Training Data

RR
Ryan Rutan

Training Data

Training data is the corpus of examples (text, images, code, audio, video) used to train AI models. The quality and scale of training data are two of the three key inputs (alongside model size and compute) that determine final model capability per the empirical scaling laws. High-quality training data is increasingly the constrained resource in AI development as compute scales faster than data quality. It's the input that becomes the output: what the model can do is bounded by what it learned from.

The components of modern AI training data:

Pre-training data (foundation model training):

  • Web crawl (Common Crawl, FineWeb, etc.): hundreds of TBs of web text.
  • Books and literature (sometimes controversial).
  • Code repositories (GitHub, etc.).
  • Scientific papers (arXiv, PubMed).
  • Wikipedia and reference data.
  • Multimodal: image-text pairs, video, audio.

Typical pre-training corpus size for frontier models: 10-100+ trillion tokens.

Fine-tuning data (post-training):

  • Human-written demonstrations of desired behavior.
  • Human preferences (which output is better, for RLHF).
  • Domain-specific datasets (medical, legal, code).
  • Instruction-following datasets.

The scaling laws (Hoffmann et al., 2022, "Chinchilla"):

Empirical observation: optimal training balances three factors:

  • Model size (parameters).
  • Training data (tokens).
  • Compute (FLOPs).

The "Chinchilla optimal" finding: most models were under-trained on data relative to size. A 70B model trained on 1.4T tokens outperformed a 280B model trained on 300B tokens.

Implication: data quantity matters as much as model size. Hence the race to acquire more (and better) training data.

The data quality vs scale debate:

Quality view: filtered, deduplicated, high-quality data trains better models than raw scaled-up web data.

  • Examples: Phi models (Microsoft), Llama 3 (Meta), Claude (Anthropic) all emphasize data quality.
  • Trend: heavy data curation, filtering, deduplication.

Scale view: more data always helps, even if quality is lower.

  • Examples: early GPT models trained on broader web data.
  • Becoming less popular as compute outpaces raw data availability.

Reality: both matter. Quality at scale is the goal.

The data sources controversy:

Web crawl: largely public, but copyright questions exist (NYT v. OpenAI lawsuit, others).

Books: copyright claims have led to high-profile lawsuits.

Code (GitHub): license compliance questions (open-source licenses with conditions).

Images / video: artists' and creators' work used without explicit permission.

Synthetic data: increasingly used as supplement (training on AI-generated data).

The startup data moat question:

Foundation model labs: their data moat is shrinking as competitors acquire similar data.

Application AI startups: real data moat comes from unique customer data:

  • Vertical AI: domain-specific data from customer interactions.
  • Workflow AI: usage patterns and behavioral data.
  • Voice AI: customer conversations.
  • Specialized verticals: medical records, legal documents, financial transactions.

The data flywheel (see Data Flywheel): customer use generates data that improves the model that improves customer experience that drives more use.

Synthetic data:

Increasingly important strategy:

  • Use foundation models to generate training data for smaller, specialized models.
  • Self-training: model improves by training on its own better outputs.
  • Distillation: smaller model learns from larger model.
  • Risks: model collapse if synthetic data overwhelms diverse human-generated data.

Ryan's Take

Training data is the input that defines what AI can do. For foundation model labs, data is becoming the constrained resource. For application AI startups, unique customer data is one of the few real moats available. The discipline that works: identify what data your customers generate that no one else has; capture and structure it; use it to fine-tune domain-specific models; build the data flywheel intentionally. The pattern that fails: rely on the same foundation model APIs everyone else uses with no proprietary data; have no defensible data position; get commoditized when foundation models commoditize your use case.

What founders get wrong: Underestimating the data dimension of AI moats. The right discipline: identify unique data your business generates; structure it for AI use; design the data flywheel from product day one; treat data as strategic asset, not exhaust.

Related: Foundation Model · Large Language Model · Fine-Tuning · Machine Learning · Data Flywheel

FAQ

What is training data?
The corpus of examples (text, images, code, audio, video) used to train AI models. Quality and scale of training data are key inputs (alongside model size and compute) that determine model capability.

How much training data do modern LLMs use?
Frontier foundation models train on 10-100+ trillion tokens of pre-training data, plus smaller curated datasets for fine-tuning. Pre-training is roughly hundreds of TBs of text after deduplication and filtering.

What's the data moat for AI startups?
For application AI startups, unique customer data is one of the few real moats. Foundation model labs are losing data moats as competitors acquire similar data. Vertical AI startups build moats from domain-specific data their customers generate.

Why is data quality vs quantity important?
The Chinchilla scaling laws (2022) showed optimal training balances model size and data quantity. Heavy data curation, filtering, and deduplication produce better models than raw scale alone. Quality at scale is the goal.

Find this article helpful?

This is just a small sample! Register to unlock our in-depth courses, hundreds of video courses, and a library of playbooks and articles to grow your startup fast. Let us Let us show you!

OR

GoogleLinkedInFacebookX/Twitter

Submission confirms agreement to our Terms of Service and Privacy Policy.