DisTrO — Technical Glossary

DisTrO is the umbrella name for Nous Research’s gradient-compression and coordination techniques. The first preliminary report (August 2024) showed a 1,000x to 10,000x reduction in inter-GPU bandwidth compared to standard AllReduce, the protocol most centralised clusters use. The follow-on DeMo paper (December 2024) trained a 15B model across the public internet using the technique. The current production system in Psyche uses DisTrO derivatives plus a Solana-based coordinator to schedule training rounds, register participants, and settle work.

The technical contribution stacks four ideas. Decoupled momentum reduces how often GPUs need to fully synchronise. Compressed gradient updates send only the largest values per step. Asynchronous coordination lets participants drop in and out without halting the run. Bandwidth-aware scheduling matches work to the link quality each node actually has. The combination means a training step that would cost gigabytes of cross-GPU traffic in a centralised cluster costs megabytes across the internet.

In context, DisTrO is the direct competitor to Templar’s SparseLoCo and the gradient-compression layer inside Prime Intellect’s earlier INTELLECT runs. The unique angle is the production track record. Hermes 4.3 (36B, released August 2025) was trained end-to-end on Psyche using DisTrO at 144,000 tokens per second across 24 nodes. Consilience 40B reached 20 trillion training tokens over the internet, the largest decentralised pre-training run anyone has published. The technique is real, the runs are public, and the code is open-source under Apache-2.0.

Related terms