Pretraining — Technical Glossary

Pretraining starts with a randomly initialised neural network and ends with a model that has absorbed the statistical structure of language from a massive corpus. The work happens through next-token prediction: the model sees a chunk of text, tries to predict the next token, gets corrected, and updates its parameters. Repeat that loop several trillion times across a deduplicated mix of web text, books, code, scientific papers, and conversational data, and the model emerges with broad general capabilities. GPT-4 was pretrained on an undisclosed mix estimated in the tens of trillions of tokens. Llama 3.1 405B was pretrained on 15 trillion tokens. Consilience 40B from Nous Research’s Psyche network reached 20 trillion.

The cost asymmetry between pretraining and fine-tuning is the central economic fact in modern AI. A frontier-scale pretraining run consumes tens of thousands of high-end GPUs for weeks or months and costs tens to hundreds of millions of dollars. Fine-tuning the same base model on a specialised domain might cost a few thousand dollars on a handful of GPUs over a few hours. This is why the open-weight ecosystem is dominated by fine-tunes (Hermes, Dolphin, OpenHermes) of a small number of pretrained bases (Llama, Qwen, DeepSeek, Mistral). Almost nobody has the capital to pretrain from scratch.

In DeAI, “decentralised training” usually means decentralised pretraining because that is where the compute footprint and value capture are largest. Nous Research, Prime Intellect (before INTELLECT-3), and Templar’s Covenant-72B all targeted pretraining runs over the public internet. The technical bar is much higher than for distributed fine-tuning because the gradient updates are larger, the runs take longer, and a coordination failure halfway through can erase weeks of work. The projects that ship public pretraining runs at meaningful scale are a small set, which is why a 20-trillion-token pretrained Consilience 40B is treated as a category-defining artifact.

Related terms