How to finetune any multiagent system on any task

Let AI coaches score every action to train the agents end-to-end.

TLDR: Finetuning many agents end-to-end offers a workaround to continual learning since different agents can specialize without catastrophic forgetting. Yet doing so is hard due to credit assignment and sample efficiency. We found that using AI feedback as per-action process rewards holds promise for addressing these challenges and unlocks a new axis for scaling post-training.

Multiagent scaling

Multiagent systems sidestep catastrophic forgetting the same way mixture-of-experts does—by giving different skills different parameters.

Why bother training > 1 agents?

Finetuning a single model on one capability often degrades others. Train extensively on one language, and performance on others may drop. This is catastrophic forgetting: all tasks compete for the same parameters.

MoE architectures partially solves this by routing different inputs to different parameter subsets, creating more runway to scale (more training to be done without forgetting) in one, big model. Almost all frontier models—Gemini 2.5, Kimi K2, and Claude Opus 4.5 all use MoE designs nowadays.

Multiagent systems apply the same idea at the agent-level, each agent having its own weights to be finetuned separately. Thus, if coordinated right, # of agents could be the next dimension of scaling.

What makes training > 1 agents hard?

So far, most multiagent frameworks implement specialization by assigning different personas or instructions to each agent, leaving the weights separation advantage completely untapped. This is because training all agents end-to-end faces two fundamental challenges:

Credit assignment. When a task succeeds/fails, which agent is responsible? A data science pipeline might fail with FileNotFoundError. The error may show up first when the final agent tries to access the file, when root cause is an earlier agent forgetting to save that file. Under current RL approaches, all agents share the final, outcome score regardless, and in doing so penalizing the final agent for doing the right

Sample efficiency. Multiagent rollouts are expensive. A single run could easily involve generating dozens of actions from different LLMs, each containing tool calls to be executed by the environment, taking minutes if not hours at a time. Yet current RL approaches only provides one training signal at the end. Making it very much like “sucking supervision from a straw.”

Per-action process rewards from AI feedback

We address both challenges by having an LLM coach evaluate every action as it happens—not just the final outcome.

Execution loop with coach evaluation

With per-action evaluation, every step gets feedback—not just the final outcome.

The coach receives context that enables accurate credit assignment:

The agent’s role and what it was asked to do
What the agent saw before acting
What the agent generated
Tool output: stdout, stderr, error messages

Why “coach” rather than “judge”? A judge rules objectively on correctness. A coach is context-aware, evaluating each agent based on for its assigned role and given inputs, not just on some fixed metrics or eventual outcomes.

When the final agent crashes with FileNotFoundError, the coach checks the earlier agents’ tool outputs and the resulting filesystem. If no agent ever saved X_test.pkl, blame traces to whichever agent’s earlier action that should have created it—not the final agent’s action that correctly tried to load it.

We call the overall approach MAPPA: training MultiAgent systems with Per-action Process rewards from AI feedback.

Extended example: data science pipelines

What does MAPPA look like in practice? We train a three-agent pipeline on Kaggle-style data modeling problems—realistic, long-horizon tasks where agents must coordinate over many steps. Each task provides CSV files and requires generating predictions for held-out test data. Note that this is not the only multiagent system/task that we test our approach on, more experiment results and in-depth discussion can be found in our technical report.

DSBench pipeline with file passing

Agents pass files to each other through a shared workspace—creating a paper trail the coach can examine.

Pipeline structure

Three agents pass the baton in sequence:

Data Engineer: Explores the data, handles preprocessing, engineers features. Saves processed data as pickle files.
Modeler: Loads the processed data, selects algorithms, trains models, tunes hyperparameters. Saves the trained model.
Analyst: Loads the model and test data, generates predictions, formats the submission file.

Each agent can take up to 4 turns, executing Python code in a sandboxed environment. Agents communicate by reading and writing files to a shared workspace.

How the coach assigns credit

Why does file-passing matter for credit assignment? It creates a written record for the coach to examine. E.g., when something fails:

DATAENGINEER evaluation:
- Tool output: "Saved X_train.pkl, y_train.pkl"
- No mention of X_test.pkl
- VERDICT: Failed to save required artifact
- SCORE: 3/10

MODELER evaluation:
- Received expected files from Data Engineer
- Tool output: "Saved model.pkl successfully"
- VERDICT: Completed task correctly given inputs
- SCORE: 8/10

ANALYST evaluation:
- Required file X_test.pkl was never created upstream
- Correctly attempted to load it
- VERDICT: Not at fault for the failure
- SCORE: 6/10

Importantly, there is no counterfactual reasoning required—just checking what each agent actually produced.

Handling messy real-world metrics

Like most long-horizon, real-world tasks, evaluating the agent’s prediction is not as straightforward as checking a number: a model might achieve 89% accuracy but 23% F1 score on a classification task. Naive averaging would label this as decent when in reality the model just learned to predict the majority class.

The coach understands this by putting together the context:

Coach reasoning:
High accuracy (0.89) but very low F1 (0.23)
indicates a class imbalance problem.
The model is not learning the actual signal.
SCORE: 4/10

Simple averaging across metrics can’t catch this. The coach synthesizes them in context—that’s the judgment call.

Results

We train for 21 epochs on 64 tasks and evaluate 4 trials on 6 held-out tasks and reports the average:

Metric	Before	After	Change
Success Rate	50.0%	66.7%	+16.7pp
Accuracy (Fair)	0.583	0.719	+23%
F1 (Fair)	0.126	0.174	+38%
RMSE (Fair)	24.9%	14.6%	-41%

Fair metrics penalize failed runs rather than ignoring them. For example, a failed run of classification task uses chance accuracy (50%) as fallback.

Training improves both success rate and quality metrics. The coach’s per-action feedback translates to downstream improvements on held-out tasks.

We also validate MAPPA on competition math problems with a different multiagent configuration, achieving +5–17pp improvements on AIME and AMC benchmarks. See the paper for details.

Coaches shape agent specialization

While analyzing training dynamics, we noticed an interesting pattern: agents specialize based on coach preferences.

Regression tasks kept improving while classification stagnated. The scores revealed the coach consistently rated regression actions 0.5–1.8 points higher than equivalent classification actions—a preference we didn’t program, but the agents discovered and exploited.

After evaluation performance peaked (first 10 epochs), agents leaned into this signal over the next 10 epochs, maintaining 87.5% success on regression while classification dropped to baseline. The coaching worked—agents learned to optimize what the coach rewarded.

This suggests an opportunity: coach preferences could be deliberately designed to steer specialization. The current limitation is that our coach is stateless—it can’t see trends across epochs or calibrate standards across task types. A context-aware coach could balance task types intentionally, or even exploit this dynamic to shape agent expertise by design.

What’s next

We showed that multiagent systems can be trained end-to-end using process rewards from an LLM coach. Dense per-action feedback addresses credit assignment, improves sample efficiency, and works across domains.

The broader direction: scaling specialized agents—not just scaling single models—may be a promising path for complex tasks. A strong general model serves as coach to a team of smaller specialists that can collectively exceed what the coach could do alone.

Current limitations:

Coach quality bounds what agents can learn
Computational cost runs ~$50–150 per training run in API calls
Stateless evaluation misses temporal patterns

Promising directions:

Stateful coaching: A smarter coach might track its own scoring patterns and adjust for detected biases.
Trainable coach: The coach itself could be trained alongside the agents.
Beyond scalar rewards: Coaches could generate corrected actions, not just scores—enabling hybrid RL and supervised learning approaches.
Agent-as-a-coach: The coach could become a full agent—using tools to compute statistics across training history, run code to verify correctness, inspect intermediate artifacts. It could even plan strategic training with short-term and long-term goals.
Reward backpropagation: Trace backward from outcomes, attribute credit or blame at each step, pass the residual to the previous agent—like gradient backprop through layers.

We are entering an era where AI systems increasingly involve multiple agents working together. Figuring out how to train and evaluate these systems is becoming critical. Our approach-MAPPA-outlined in this blog could be the first step toward that direction.

Define and train your own multiagent system @ our repo!

Citation

If you find this page is helpful, please cite:

@misc{li2026mappa,
      title={Scaling Multiagent Systems with Process Rewards}, 
      author={Ed Li and Junyu Ren and Cat Yan},
      year={2026},
      eprint={2601.23228},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.23228}, 
}