A substantial improvement in intelligence and behavior over previous models. Better at sustained work on long-running tasks, follows complex instructions more reliably, and is more pleasant to collaborate with.
We improved Composer by scaling training, generating more complex RL environments, and introducing new learning methods. In addition to training on more difficult tasks, we improved behavioral aspects like communication style and effort calibration — dimensions not well captured by existing benchmarks, but critical for real-world usefulness.
Credit assignment improved with localized training signals. Instead of a single reward over hundreds of thousands of tokens, feedback is applied at the exact point of the decision.
Trained on 25× more synthetic tasks than Composer 2. Dynamic difficulty scaling keeps the model improving throughout the run with grounded, verifiable rewards.
Distributed orthogonalization with Newton-Schulz at the model's natural granularity. Dual HSDP layouts separate expert and non-expert weights for optimal throughput.
Composer 2.5 is built on the same open-source checkpoint as Composer 2 — Moonshot's Kimi K2.5 — with continued pretraining and custom RL fine-tuning.
Composer 2.5 contains several new improvements to the training stack. These changes target both model intelligence and usability.
Credit assignment during RL becomes increasingly difficult as rollouts span hundreds of thousands of tokens. When a reward is computed over an entire rollout, it is hard for the model to tell which specific decision helped or hurt the outcome. To address this, Composer 2.5 was trained with targeted textual feedback.
The idea is to provide feedback directly at the point in the trajectory where the model could have behaved better. For a target model message, a short hint describing the desired improvement is constructed and inserted into the local context. The resulting model distribution acts as a teacher, and an on-policy distillation KL loss moves the student's token probabilities toward the teacher's.
This gives a localized training signal for the behavior to change, while retaining the broader RL objective over the full trajectory. During the Composer 2.5 run, this method was applied to a variety of behaviors, from coding style to model communication.
During RL training, coding ability improves to the point where the model gets most training problems correct. To continue increasing intelligence, harder tasks are both selected and created dynamically. Composer 2.5 is trained with 25× more synthetic tasks than Composer 2.
One approach is feature deletion: the agent is given a codebase with a large set of tests, and asked to delete code while keeping the codebase functional except for specific testable features. The synthetic task is to reimplement the feature, with tests used as a verifiable reward.
One downstream consequence of large scale synthetic task creation is unexpected reward hacking. As the model became more adept, it found increasingly sophisticated workarounds — including reverse-engineering Python type-checking caches and decompiling Java bytecode to reconstruct APIs. These were diagnosed using agentic monitoring tools, demonstrating the care necessary for large scale RL.
For continued pretraining, Muon is used with distributed orthogonalization. After forming the momentum update, Newton-Schulz runs at the model's natural granularity: per attention head for attention projections, and per expert for stacked MoE weights.
The main cost is orthogonalizing expert weights. For sharded parameters, same-shaped tensors are batched, all-to-all sharded into complete matrices, Newton-Schulz is run, then results are all-to-all'ed back to the original sharded layout. These transfers are asynchronous, overlapping network and compute. On the 1T model, optimizer step time is 0.2s.
This interacts closely with HSDP for MoE models. Separate HSDP layouts are used for non-expert and expert weights. Non-expert weights stay narrow (within a node or rack), while expert weights use a wider mesh. CP=2 and EP=8 can run on 8 GPUs instead of requiring 16 in a single shared mesh.
Composer 2.5 uses a targeted RL loop that makes every training signal count.
The model produces a trajectory spanning hundreds of tool calls and thousands of tokens in complex coding environments.
A short textual hint is inserted at the exact turn where behavior should improve — whether it is tool selection, style, or reasoning.
The hint changes token probabilities for that turn. The resulting distribution serves as a localized teacher signal.
An on-policy distillation loss updates the student model only at the targeted turn, preserving the broader RL objective.
Composer 2.5 includes double usage for the first week. Choose the tier that fits your workflow.
$2.50 / M output tokens
$15.00 / M output tokens
Composer 2.5 is a substantial improvement in intelligence and behavior over Composer 2. It is better at sustained work on long-running tasks, follows complex instructions more reliably, and is more pleasant to collaborate with.
Composer 2.5 introduces targeted RL with textual feedback for localized credit assignment, 25× more synthetic tasks with dynamic difficulty scaling, and Sharded Muon with dual mesh HSDP for efficient distributed training.
Composer 2.5 is built on the same open-source checkpoint as Composer 2 — Moonshot's Kimi K2.5 — with continued pretraining and custom RL fine-tuning.
Together with SpaceXAI, a significantly larger model is being trained from scratch using 10× more total compute, leveraging Colossus 2's million H100-equivalents.
Composer 2.5 is priced at $0.50/M input and $2.50/M output tokens. A faster variant with the same intelligence is available at $3.00/M input and $15.00/M output tokens. The fast variant is the default option.
Try it now and see the difference in sustained intelligence, reliable instruction following, and collaborative coding.
Try Composer 2.5