Let AI coaches score every action to train the agents end-to-end.
TLDR: Finetuning many agents end-to-end offers a workaround to continual learning since different agents can specialize without catastrophic forgetting. Yet doing so is hard due to credit assignment and sample efficiency. We found that using AI feedback as per-action process rewards holds promise for addressing these challenges and unlocks a new axis for scaling post-training.

Multiagent systems sidestep catastrophic forgetting the same way mixture-of-experts does—by giving different skills different parameters.
Finetuning a single model on one capability often degrades others. Train extensively on one language, and performance on others may drop. This is catastrophic forgetting: all tasks compete for the same parameters.
MoE architectures partially solves this by routing different inputs to different parameter subsets, creating more runway to scale (more training to be done without forgetting) in one, big model. Almost all frontier models—Gemini 2.5, Kimi K2, and Claude Opus 4.5 all use MoE designs nowadays.
Multiagent systems apply the same idea at the agent-level, each agent having its own weights to be finetuned separately. Thus, if coordinated right, # of agents could be the next dimension of scaling.
So far, most multiagent frameworks implement specialization by assigning different personas or instructions to each agent, leaving the weights separation advantage completely untapped. This is because training all agents end-to-end faces two fundamental challenges:
Credit assignment. When a task succeeds/fails, which agent is responsible? A data science pipeline might fail with FileNotFoundError. The error may show up first when the final agent tries to access the file, when root cause is an earlier agent forgetting to save that file. Under current RL approaches, all agents share the final, outcome score regardless, and in doing so penalizing the final agent for doing the right
Sample efficiency. Multiagent rollouts are expensive. A single run could easily involve generating dozens of actions from different LLMs, each containing tool calls to be executed by the environment, taking minutes if not hours at a time. Yet current RL approaches only provides one training signal at the end. Making it very much like “sucking supervision from a straw.”
We address both challenges by having an LLM coach evaluate every action as it happens—not just the final outcome.

With per-action evaluation, every step gets feedback—not just the final outcome.
The coach receives context that enables accurate credit assignment:
Why “coach” rather than “judge”? A judge rules objectively on correctness. A coach is context-aware, evaluating each agent based on for its assigned role and given inputs, not just on some fixed metrics or eventual outcomes.
When the final agent crashes with FileNotFoundError, the coach checks the earlier agents’ tool outputs and the resulting filesystem. If no agent ever saved X_test.pkl, blame traces to whichever agent’s earlier action that should have created it—not the final agent’s action that correctly tried to load it.
We call the overall approach MAPPA: training MultiAgent systems with Per-action Process rewards from AI feedback.
What does MAPPA look like in practice? We train a three-agent pipeline on Kaggle-style data modeling problems—realistic, long-horizon tasks where agents must coordinate over many steps. Each task provides CSV files and requires generating predictions for held-out test data. Note that this is not the only multiagent system/task that we test our approach on, more experiment results and in-depth discussion can be found in our technical report.

Agents pass files to each other through a shared workspace—creating a paper trail the coach can examine.
Three agents pass the baton in sequence:
Each agent can take up to 4 turns, executing Python code in a sandboxed environment. Agents communicate by reading and writing files to a shared workspace.
Why does file-passing matter for credit assignment? It creates a written record for the coach to examine. E.g., when something fails:
DATAENGINEER evaluation:
- Tool output: "Saved X_train.pkl, y_train.pkl"
- No mention of X_test.pkl
- VERDICT: Failed to save required artifact
- SCORE: 3/10
MODELER evaluation:
- Received expected files from Data Engineer
- Tool output: "Saved model.pkl successfully"
- VERDICT: Completed task correctly given inputs
- SCORE: 8/10
ANALYST evaluation:
- Required file X_test.pkl was never created upstream
- Correctly attempted to load it
- VERDICT: Not at fault for the failure
- SCORE: 6/10
Importantly, there is no counterfactual reasoning required—just checking what each agent actually produced.
Like most long-horizon, real-world tasks, evaluating the agent’s prediction is not as straightforward as checking a number: a model might achieve 89% accuracy but 23% F1 score on a classification task. Naive averaging would label this as decent when in reality the model just learned to predict the majority class.
The coach understands this by putting together the context:
Coach reasoning:
High accuracy (0.89) but very low F1 (0.23)
indicates a class imbalance problem.
The model is not learning the actual signal.
SCORE: 4/10
Simple averaging across metrics can’t catch this. The coach synthesizes them in context—that’s the judgment call.
We train for 21 epochs on 64 tasks and evaluate 4 trials on 6 held-out tasks and reports the average:
| Metric | Before | After | Change |
|---|---|---|---|
| Success Rate | 50.0% | 66.7% | +16.7pp |
| Accuracy (Fair) | 0.583 | 0.719 | +23% |
| F1 (Fair) | 0.126 | 0.174 | +38% |
| RMSE (Fair) | 24.9% | 14.6% | -41% |
Fair metrics penalize failed runs rather than ignoring them. For example, a failed run of classification task uses chance accuracy (50%) as fallback.
Training improves both success rate and quality metrics. The coach’s per-action feedback translates to downstream improvements on held-out tasks.
We also validate MAPPA on competition math problems with a different multiagent configuration, achieving +5–17pp improvements on AIME and AMC benchmarks. See the paper for details.
While analyzing training dynamics, we noticed an interesting pattern: agents specialize based on coach preferences.
Regression tasks kept improving while classification stagnated. The scores revealed the coach consistently rated regression actions 0.5–1.8 points higher than equivalent classification actions—a preference we didn’t program, but the agents discovered and exploited.
After evaluation performance peaked (first 10 epochs), agents leaned into this signal over the next 10 epochs, maintaining 87.5% success on regression while classification dropped to baseline. The coaching worked—agents learned to optimize what the coach rewarded.
This suggests an opportunity: coach preferences could be deliberately designed to steer specialization. The current limitation is that our coach is stateless—it can’t see trends across epochs or calibrate standards across task types. A context-aware coach could balance task types intentionally, or even exploit this dynamic to shape agent expertise by design.
We showed that multiagent systems can be trained end-to-end using process rewards from an LLM coach. Dense per-action feedback addresses credit assignment, improves sample efficiency, and works across domains.
The broader direction: scaling specialized agents—not just scaling single models—may be a promising path for complex tasks. A strong general model serves as coach to a team of smaller specialists that can collectively exceed what the coach could do alone.
Current limitations:
Promising directions:
We are entering an era where AI systems increasingly involve multiple agents working together. Figuring out how to train and evaluate these systems is becoming critical. Our approach-MAPPA-outlined in this blog could be the first step toward that direction.
Define and train your own multiagent system @ our repo!
If you find this page is helpful, please cite:
@misc{li2026mappa,
title={Scaling Multiagent Systems with Process Rewards},
author={Ed Li and Junyu Ren and Cat Yan},
year={2026},
eprint={2601.23228},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.23228},
}