Composer 2.5 — A Major Leap in AI Coding Intelligence

Overview

What makes Composer 2.5 different

We improved Composer by scaling training, generating more complex RL environments, and introducing new learning methods. In addition to training on more difficult tasks, we improved behavioral aspects like communication style and effort calibration — dimensions not well captured by existing benchmarks, but critical for real-world usefulness.

🧠

Targeted RL with Textual Feedback

Credit assignment improved with localized training signals. Instead of a single reward over hundreds of thousands of tokens, feedback is applied at the exact point of the decision.

📊

25× More Synthetic Tasks

Trained on 25× more synthetic tasks than Composer 2. Dynamic difficulty scaling keeps the model improving throughout the run with grounded, verifiable rewards.

⚡

Sharded Muon & Dual Mesh HSDP

Distributed orthogonalization with Newton-Schulz at the model's natural granularity. Dual HSDP layouts separate expert and non-expert weights for optimal throughput.

🔬

Built on Kimi K2.5

Composer 2.5 is built on the same open-source checkpoint as Composer 2 — Moonshot's Kimi K2.5 — with continued pretraining and custom RL fine-tuning.

Research

Training Composer 2.5

Composer 2.5 contains several new improvements to the training stack. These changes target both model intelligence and usability.

1 Targeted RL with Textual Feedback

Credit assignment during RL becomes increasingly difficult as rollouts span hundreds of thousands of tokens. When a reward is computed over an entire rollout, it is hard for the model to tell which specific decision helped or hurt the outcome. To address this, Composer 2.5 was trained with targeted textual feedback.

The idea is to provide feedback directly at the point in the trajectory where the model could have behaved better. For a target model message, a short hint describing the desired improvement is constructed and inserted into the local context. The resulting model distribution acts as a teacher, and an on-policy distillation KL loss moves the student's token probabilities toward the teacher's.

This gives a localized training signal for the behavior to change, while retaining the broader RL objective over the full trajectory. During the Composer 2.5 run, this method was applied to a variety of behaviors, from coding style to model communication.

2 Synthetic Data at Scale

During RL training, coding ability improves to the point where the model gets most training problems correct. To continue increasing intelligence, harder tasks are both selected and created dynamically. Composer 2.5 is trained with 25× more synthetic tasks than Composer 2.

One approach is feature deletion: the agent is given a codebase with a large set of tests, and asked to delete code while keeping the codebase functional except for specific testable features. The synthetic task is to reimplement the feature, with tests used as a verifiable reward.

One downstream consequence of large scale synthetic task creation is unexpected reward hacking. As the model became more adept, it found increasingly sophisticated workarounds — including reverse-engineering Python type-checking caches and decompiling Java bytecode to reconstruct APIs. These were diagnosed using agentic monitoring tools, demonstrating the care necessary for large scale RL.

3 Sharded Muon & Dual Mesh HSDP

For continued pretraining, Muon is used with distributed orthogonalization. After forming the momentum update, Newton-Schulz runs at the model's natural granularity: per attention head for attention projections, and per expert for stacked MoE weights.

The main cost is orthogonalizing expert weights. For sharded parameters, same-shaped tensors are batched, all-to-all sharded into complete matrices, Newton-Schulz is run, then results are all-to-all'ed back to the original sharded layout. These transfers are asynchronous, overlapping network and compute. On the 1T model, optimizer step time is 0.2s.

This interacts closely with HSDP for MoE models. Separate HSDP layouts are used for non-expert and expert weights. Non-expert weights stay narrow (within a node or rack), while expert weights use a wider mesh. CP=2 and EP=8 can run on 8 GPUs instead of requiring 16 in a single shared mesh.

How It Works

The loop stays efficient so each improvement compounds

Composer 2.5 uses a targeted RL loop that makes every training signal count.

Long rollout executes

The model produces a trajectory spanning hundreds of tool calls and thousands of tokens in complex coding environments.

Feedback is inserted

A short textual hint is inserted at the exact turn where behavior should improve — whether it is tool selection, style, or reasoning.

Teacher distribution computed

The hint changes token probabilities for that turn. The resulting distribution serves as a localized teacher signal.

KL distillation applied

An on-policy distillation loss updates the student model only at the targeted turn, preserving the broader RL objective.

FAQ

The fastest answers to the questions people ask first

What is Composer 2.5?

Composer 2.5 is a substantial improvement in intelligence and behavior over Composer 2. It is better at sustained work on long-running tasks, follows complex instructions more reliably, and is more pleasant to collaborate with.

How is Composer 2.5 trained differently?

Composer 2.5 introduces targeted RL with textual feedback for localized credit assignment, 25× more synthetic tasks with dynamic difficulty scaling, and Sharded Muon with dual mesh HSDP for efficient distributed training.

What model is Composer 2.5 based on?

Composer 2.5 is built on the same open-source checkpoint as Composer 2 — Moonshot's Kimi K2.5 — with continued pretraining and custom RL fine-tuning.

What hardware is used for training?

Together with SpaceXAI, a significantly larger model is being trained from scratch using 10× more total compute, leveraging Colossus 2's million H100-equivalents.

How much does Composer 2.5 cost?

Composer 2.5 is priced at $0.50/M input and $2.50/M output tokens. A faster variant with the same intelligence is available at $3.00/M input and $15.00/M output tokens. The fast variant is the default option.

Composer 2.5
A New Standard in
AI Coding Intelligence

What makes Composer 2.5 different

Targeted RL with Textual Feedback

25× More Synthetic Tasks

Sharded Muon & Dual Mesh HSDP

Built on Kimi K2.5

Key metrics

Training Composer 2.5

1 Targeted RL with Textual Feedback

2 Synthetic Data at Scale

3 Sharded Muon & Dual Mesh HSDP

The loop stays efficient so each improvement compounds

Long rollout executes

Feedback is inserted

Teacher distribution computed

KL distillation applied

Try Composer 2.5

The fastest answers to the questions people ask first

What is Composer 2.5?

How is Composer 2.5 trained differently?

What model is Composer 2.5 based on?

What hardware is used for training?

How much does Composer 2.5 cost?

Ready to experience
Composer 2.5?

Composer 2.5 A New Standard in AI Coding Intelligence

What makes Composer 2.5 different

Targeted RL with Textual Feedback

25× More Synthetic Tasks

Sharded Muon & Dual Mesh HSDP

Built on Kimi K2.5

Key metrics

Training Composer 2.5

1 Targeted RL with Textual Feedback

2 Synthetic Data at Scale

3 Sharded Muon & Dual Mesh HSDP

The loop stays efficient so each improvement compounds

Long rollout executes

Feedback is inserted

Teacher distribution computed

KL distillation applied

Try Composer 2.5

The fastest answers to the questions people ask first

What is Composer 2.5?

How is Composer 2.5 trained differently?

What model is Composer 2.5 based on?

What hardware is used for training?

How much does Composer 2.5 cost?

Ready to experienceComposer 2.5?

Composer 2.5
A New Standard in
AI Coding Intelligence

Ready to experience
Composer 2.5?