image source head

From chips to data: AI’s next battle

trendx logo

Reprinted from chaincatcher

01/23/2025·3M

Author: Dr. Max Li, founder of OORT and professor at Columbia University

While the world is still focused on the wars surrounding AI chips - tariffs, intellectual property restrictions, supply chain sanctions and geopolitical disputes, the data shortage issue that directly affects the future development of AI has obviously been ignored.

At the beginning of this year, Elon Musk keenly pointed out that AI companies have exhausted the data for training models and have even "exhausted" the sum of human knowledge.

This article explores the shrinking data pool and how decentralized AI (DeAI) can play a key role in solving this challenge.

The data war is coming

First, let’s be clear: data is not inexhaustible.

There are early signs of the data war: In 2023, a group of visual artists filed a landmark lawsuit against Stability AI, MidJourney, and DeviantArt, accusing the companies of using their work to train generative AI models without obtaining permission. (Such as Stable Diffusion). At the same time, Musk accused companies such as OpenAI of "crawling" data from Twitter (now Platform X) without authorization, prompting Platform X to tighten API pricing and access restrictions.

Coincidentally, Reddit has significantly increased API pricing, disrupting companies such as OpenAI and Anthropic that rely on Reddit user-generated content for AI model training. Reddit sees the decision as a way to monetize its data, but it has also sparked debate about tensions between platforms that use user data and AI companies seeking to use it.

These incidents highlight an increasingly clear reality: We are running out of legally and ethically available data.

The multiple fronts of data

The chip war focuses on producing the most powerful hardware, while the data war is about obtaining the right data sets to train AI. The increasing scarcity of ethical and high-quality data has become a bottleneck for many companies to develop AI.

For large companies, the most feasible way is to obtain data from centralized giants, although it is expensive. However, small businesses face limited and often unaffordable options. Without appropriate methods or channels for collecting data, these companies will fall significantly behind in the future AI development and innovation track.

So how exactly do we ethically and effectively collect the data we need to advance AI development?

The data war will be fought on multiple fronts, each presenting unique challenges and opportunities.

data collection

Who controls the data collection pipeline? How to be ethical and legal?

Emerging initiatives are emerging as lawsuits pile up against tech giants for illegally scraping or using data. Harvard University, for example, has pioneered the push for user-consented data contributions to make open-access datasets available to the public. Although such projects have their value, they are far from sufficient to meet the needs of commercial AI applications.

Synthetic data is also emerging as a potential solution. Companies like Meta and Microsoft have begun using AI to generate data to fine-tune models such as Llama and Phi-4. Google and OpenAI also use synthetic data in their work. However, synthetic data also faces its own challenges, such as model “hallucination” problems, which can affect its accuracy and reliability.

Decentralized data collection offers another promising option. By leveraging blockchain technology and using cryptocurrencies to incentivize individuals to share data securely, a decentralized model can address issues of privacy, ownership, and quality. These solutions also democratize access to data, enabling small businesses to compete in the AI ​​ecosystem.

Data quality

Low-quality data can lead to model bias, inaccurate predictions, and ultimately distrust of AI systems. How do we ensure that the data used for AI training is accurate and representative?

Common industry practices include:

  • Rigorous data validation : The company uses advanced validation techniques to filter errors, inconsistencies, and noise from data sets. This typically involves human oversight, automated processes, or a combination of both to verify data integrity.
  • Bias Mitigation Strategies : To ensure data is representative, companies implement bias detection tools and diverse sampling techniques. For example, in the medical field, data sets must include diverse population groups to avoid biases that could affect diagnostic models.
  • Follow standards : Data security industry frameworks such as ISO/IEC 27001 and other emerging ethical AI guidelines are becoming necessary to ensure data quality and compliance with global standards.
  • Crowdsourced quality checking : Platforms such as Amazon Mechanical Turk are used for tasks such as labeling and validating data. Although low cost, these methods require supervision to ensure consistency and accuracy.
  • Decentralized verification : Blockchain and decentralized systems are increasingly becoming tools for authenticating data sources, ensuring data authenticity and tamper-proofing.

In addition, national regulators face the urgent challenge of establishing comprehensive data privacy and security rules that balance individual rights and technological innovation, while addressing critical national security issues such as protecting sensitive data from cyber threats, foreign exploitation, and misuse by hostile entities. question.

The road ahead is rough

The impact of the data war is far-reaching. In the healthcare industry, for example, access to high-quality patient data can revolutionize diagnosis and treatment planning, but strict privacy regulations pose a roadblock. Likewise, in the music industry, AI models trained using ethical datasets could transform everything from songwriting to copyright enforcement, provided they respect intellectual property rights.

These challenges highlight the importance of decentralized solutions that prioritize data transparency, quality, and accessibility. By leveraging decentralized systems, we can create a fairer data ecosystem where individuals retain control over their data and businesses have access to ethical and high-quality datasets, driven without compromising privacy or security. Innovation.

The shift from chip wars to data wars will reshape the AI ​​ecosystem and its evolution, providing leading opportunities for decentralized data solutions. By prioritizing ethical data collection and accessibility, decentralized AI has the potential to bridge the gap and lead to a fairer, more innovative AI future.

The battle for the best data has begun. Are we prepared to deal with it?

more