Now Available

Composer 2.5
A New Standard in
AI Coding Intelligence

A substantial improvement in intelligence and behavior over previous models. Better at sustained work on long-running tasks, follows complex instructions more reliably, and is more pleasant to collaborate with.

What makes Composer 2.5 different

We improved Composer by scaling training, generating more complex RL environments, and introducing new learning methods. In addition to training on more difficult tasks, we improved behavioral aspects like communication style and effort calibration — dimensions not well captured by existing benchmarks, but critical for real-world usefulness.

🧠

Targeted RL with Textual Feedback

Credit assignment improved with localized training signals. Instead of a single reward over hundreds of thousands of tokens, feedback is applied at the exact point of the decision.

📊

25× More Synthetic Tasks

Trained on 25× more synthetic tasks than Composer 2. Dynamic difficulty scaling keeps the model improving throughout the run with grounded, verifiable rewards.

Sharded Muon & Dual Mesh HSDP

Distributed orthogonalization with Newton-Schulz at the model's natural granularity. Dual HSDP layouts separate expert and non-expert weights for optimal throughput.

🔬

Built on Kimi K2.5

Composer 2.5 is built on the same open-source checkpoint as Composer 2 — Moonshot's Kimi K2.5 — with continued pretraining and custom RL fine-tuning.

Key metrics

25×
More synthetic tasks than Composer 2
10×
More total compute for next-gen training
0.2s
Optimizer step time on the 1T model

Training Composer 2.5

Composer 2.5 contains several new improvements to the training stack. These changes target both model intelligence and usability.

1 Targeted RL with Textual Feedback

Credit assignment during RL becomes increasingly difficult as rollouts span hundreds of thousands of tokens. When a reward is computed over an entire rollout, it is hard for the model to tell which specific decision helped or hurt the outcome. To address this, Composer 2.5 was trained with targeted textual feedback.

The idea is to provide feedback directly at the point in the trajectory where the model could have behaved better. For a target model message, a short hint describing the desired improvement is constructed and inserted into the local context. The resulting model distribution acts as a teacher, and an on-policy distillation KL loss moves the student's token probabilities toward the teacher's.

This gives a localized training signal for the behavior to change, while retaining the broader RL objective over the full trajectory. During the Composer 2.5 run, this method was applied to a variety of behaviors, from coding style to model communication.

2 Synthetic Data at Scale

During RL training, coding ability improves to the point where the model gets most training problems correct. To continue increasing intelligence, harder tasks are both selected and created dynamically. Composer 2.5 is trained with 25× more synthetic tasks than Composer 2.

One approach is feature deletion: the agent is given a codebase with a large set of tests, and asked to delete code while keeping the codebase functional except for specific testable features. The synthetic task is to reimplement the feature, with tests used as a verifiable reward.

One downstream consequence of large scale synthetic task creation is unexpected reward hacking. As the model became more adept, it found increasingly sophisticated workarounds — including reverse-engineering Python type-checking caches and decompiling Java bytecode to reconstruct APIs. These were diagnosed using agentic monitoring tools, demonstrating the care necessary for large scale RL.

3 Sharded Muon & Dual Mesh HSDP

For continued pretraining, Muon is used with distributed orthogonalization. After forming the momentum update, Newton-Schulz runs at the model's natural granularity: per attention head for attention projections, and per expert for stacked MoE weights.

The main cost is orthogonalizing expert weights. For sharded parameters, same-shaped tensors are batched, all-to-all sharded into complete matrices, Newton-Schulz is run, then results are all-to-all'ed back to the original sharded layout. These transfers are asynchronous, overlapping network and compute. On the 1T model, optimizer step time is 0.2s.

This interacts closely with HSDP for MoE models. Separate HSDP layouts are used for non-expert and expert weights. Non-expert weights stay narrow (within a node or rack), while expert weights use a wider mesh. CP=2 and EP=8 can run on 8 GPUs instead of requiring 16 in a single shared mesh.

The loop stays efficient so each improvement compounds

Composer 2.5 uses a targeted RL loop that makes every training signal count.

01

Long rollout executes

The model produces a trajectory spanning hundreds of tool calls and thousands of tokens in complex coding environments.

02

Feedback is inserted

A short textual hint is inserted at the exact turn where behavior should improve — whether it is tool selection, style, or reasoning.

03

Teacher distribution computed

The hint changes token probabilities for that turn. The resulting distribution serves as a localized teacher signal.

04

KL distillation applied

An on-policy distillation loss updates the student model only at the targeted turn, preserving the broader RL objective.

Try Composer 2.5

Composer 2.5 includes double usage for the first week. Choose the tier that fits your workflow.

Standard
$0.50 / M input

$2.50 / M output tokens

Get Started with Composer 2.5

The fastest answers to the questions people ask first

What is Composer 2.5?

Composer 2.5 is a substantial improvement in intelligence and behavior over Composer 2. It is better at sustained work on long-running tasks, follows complex instructions more reliably, and is more pleasant to collaborate with.

How is Composer 2.5 trained differently?

Composer 2.5 introduces targeted RL with textual feedback for localized credit assignment, 25× more synthetic tasks with dynamic difficulty scaling, and Sharded Muon with dual mesh HSDP for efficient distributed training.

What model is Composer 2.5 based on?

Composer 2.5 is built on the same open-source checkpoint as Composer 2 — Moonshot's Kimi K2.5 — with continued pretraining and custom RL fine-tuning.

What hardware is used for training?

Together with SpaceXAI, a significantly larger model is being trained from scratch using 10× more total compute, leveraging Colossus 2's million H100-equivalents.

How much does Composer 2.5 cost?

Composer 2.5 is priced at $0.50/M input and $2.50/M output tokens. A faster variant with the same intelligence is available at $3.00/M input and $15.00/M output tokens. The fast variant is the default option.

Ready to experience
Composer 2.5?

Try it now and see the difference in sustained intelligence, reliable instruction following, and collaborative coding.

Try Composer 2.5