Bolt Picks | The Opportunities and Challenges of Synthetic Data

线性资本·October 16, 2024

Synthetic Data: The Future of AI?

Image source: HIRETUAL

As real-world data becomes increasingly difficult to obtain, synthetic data has gradually become a focal point for large model companies. The idea of using synthetic data for model training has been around for some time. According to media reports, Anthropic used partial synthetic data to train its flagship model Claude 3.5 Sonnet, Meta fine-tuned its Llama 3.1 model using AI-generated data, and OpenAI is also using synthetic data generated by its o1 model to provide training material for its upcoming Orion model.

Kyle Wiggers is a senior reporter at TechCrunch. Through interviews with multiple experts in the field of artificial intelligence, the author explores questions the public is asking about synthetic data: 1. Why does AI need data? 2. What kind of data does AI need? 3. Will real data be replaced by synthetic data? Below is our compilation and translation of the author's article; the original content can be accessed by clicking the "read more" link.


01 The Importance of Data Labeling

AI systems are essentially statistical machines. Their training requires inputting massive amounts of data, and AI learns patterns in the data to make predictions — for instance, that "to whom" typically appears before "it may concern" in emails (a common English letter-opening phrase).

"Data labeling" refers to the process of adding tags or annotations to raw data so that machine learning algorithms can understand and use it. These labels act like signposts, teaching models how to distinguish between things, places, and concepts.

Take a model trained to classify kitchen photos. It is shown numerous images labeled "kitchen." During training, the model gradually associates "kitchen" with general features of kitchens (refrigerators, countertops, etc.). After training, the model should be able to identify kitchen photos it has never seen before. Of course, if these kitchen photos were mislabeled as a cow, the model would identify them as cows — highlighting the importance of data labeling quality.

The data labeling service market is expanding rapidly. According to Dimension Market Research, the data labeling market is currently valued at $838.2 million and will grow to $10.34 billion over the next decade. While there are no exact statistics on how many people work in labeling, a 2022 paper estimated the figure to be in the millions.

Many large companies and small businesses rely on employees of data labeling firms to annotate data for AI training sets. Some labeling work pays well, especially tasks requiring specialized expertise (such as mathematics). But other work is extremely burdensome — in developing countries, labelers earn an average of just a few dollars per hour, with no benefits or guarantees of future work.


02 The Data Scarcity Dilemma

From a humanitarian perspective, there is reason to seek alternatives to human data labeling services. But there are also practical considerations:

  1. Human data labeling has its limitations. First, humans can only label data so quickly. Beyond that, labelers' biases can affect their annotation quality, which in turn affects any models trained on that data. Labelers can also make mistakes, or be confused by complex annotation guidelines.

  2. Data is generally expensive. For example, Shutterstock charges AI vendors tens of millions of dollars to access its data archives, while Reddit has also earned hundreds of millions from selling data to companies like Google and OpenAI.

Beyond these two reasons, more importantly, data is becoming increasingly difficult to obtain. Most models are trained on large amounts of publicly available data, but out of concern that their data may be scraped or not receive proper attribution, data owners are increasingly restricting access to it. Today, over 35% of the world's top 1,000 websites block OpenAI's web crawler. A recent study found that approximately 25% of "high-quality" sources in major datasets have already been restricted.

If this trend continues, research institute Epoch AI predicts that developers will exhaust data for training generative AI models between 2026 and 2032. Coupled with concerns about copyright lawsuits and the possibility of objectionable content entering public datasets, AI vendors have been forced to reconsider this issue.


03 The Opportunity of Synthetic Data

On the surface, synthetic data seems like a perfect solution to these problems. Need labels? Generate them. Need more example data? No problem — unlimited quantities. To some extent, this is indeed true.

Os Keyes, a PhD student at the University of Washington researching the ethical implications of emerging technologies, told TechCrunch: "If data is the new oil, synthetic data is like biofuel — it can be generated without the negative externalities of real data, such as privacy breaches and data bias."

The AI industry has already begun widely applying this concept. This month, enterprise generative AI company Writer released a model called Palmyra X 004 trained almost entirely on synthetic data. Writer claims its development cost was only $700,000, compared to $4.6 million for an equivalent OpenAI model.

Microsoft's Phi open-source models partially used synthetic data for training, as did Google's Gemma models. Nvidia launched a series of models specifically designed for generating synthetic training data this summer, while AI startup Hugging Face claims to have released the largest synthetic text training dataset to date.

Synthetic data generation has become a standalone business, with its market size potentially reaching $2.34 billion by 2030. Gartner predicts that 60% of data used in AI and analytics projects this year will be synthetic.

Luca Soldaini, senior research scientist at AI2, notes that synthetic data technology can generate training data that is difficult to obtain through scraping or content licensing. For example, when training its video generator Movie Gen, Meta used Llama 3 to generate video captions in its training data, which were then refined by humans to add more detail.

Similarly, OpenAI stated that it used synthetic data to fine-tune GPT-4o to build Canvas, ChatGPT's drawing board feature. Amazon also said that synthetic data is an important component it uses to supplement real-world data and train Alexa's speech recognition models.


04 The Challenges of Synthetic Data

However, synthetic data is not a panacea. It faces the same "garbage in, garbage out" problem. Models can generate synthetic data, but if the data used to train those models itself contains biases and limitations, the output will be similarly affected. For example, groups underrepresented in the base data will also be underrepresented in synthetic data.

In 2023, a study published by researchers at Rice University and Stanford found that over-reliance on synthetic data for training leads to declining model quality and diversity. The researchers noted that sampling bias (i.e., underrepresentation of the real world) causes model diversity to gradually deteriorate over generations of training. But they also found that mixing in some real-world data with synthetic data helps alleviate this problem.

Keyes also points out that complex models like OpenAI's o1 may produce harder-to-detect hallucinations in their generated synthetic data, which could reduce the accuracy of models trained on that data — especially when the source of the hallucinations is difficult to identify.

Keyes added: "Because models hallucinate, the data they generate contains hallucinations. For models like o1, even the developers themselves may not be able to explain why these phenomena occur."

The accumulation of hallucinations can lead models to output meaningless data. A study published in Nature revealed how models, when trained on erroneous data, generate more erroneous data — a feedback loop that gradually erodes the performance of subsequent model generations. The researchers found that as models progressed through generations of training, they gradually lost grasp of more niche knowledge, became more generic, and often gave answers unrelated to the questions asked.

Image source: Ilia Shumailov et al.

A follow-up study showed that other types of models, such as image generators, are not immune to this collapse:

Image source: Ilia Shumailov et al.

Soldaini agrees that raw synthetic data is not trustworthy, especially when the goal is to avoid training forgetful chatbots or homogenized image generators. He notes that the safe way to use synthetic data is to thoroughly review, screen, and filter it — and ideally pair it with fresh real-world data, just as one would with any other dataset.

Failure to do so may cause model outputs to become increasingly uncreative and more biased, ultimately severely affecting their functionality. While this process may be identified and corrected before becoming serious, it is indeed a risk.

Soldaini said: "Researchers need to inspect generated data, iterate repeatedly on the generation process, and establish safeguards to eliminate low-quality data points. Synthetic data pipelines are not self-improving machines — their outputs must be carefully inspected and optimized before being used for training."

OpenAI CEO Sam Altman has stated that AI will someday generate synthetic data of sufficient quality to effectively train itself. This vision may hold, but the relevant technology does not yet exist — to date, no major AI lab has released a model trained entirely on synthetic data.

So, at least for the foreseeable future, AI training still requires human involvement to ensure nothing goes wrong during the model training process.


Linear Bolt is an investment initiative established by Linear Capital specifically for early-stage, global-market-facing AI applications. It upholds Linear's investment philosophy, focusing on projects where technology-driven transformation occurs, and aims to help founders find the shortest path to achieving their goals. Whether in speed of action or investment approach, Bolt's commitment is to be lighter, faster, and more flexible. Bolt has already invested in seven AI application projects in the first half of 2024, including Final Round, Xinguang, Cathoven, Xbuddy, and Midreal.