Designing an RLVR Environment from a Failure Taxonomy (Terminal-Bench 2)
Reward design has to be grounded in the data and the failure modes. Ungrounded rewards fail quietly: you penalize failures the model does not actually have, you miss the one that kills most trials, and on a hard benchmark a bare pass/fail reward leaves GRPO groups all-zero, so nothing trains. The problems you never measured never get solved.
So I measured first: a verifiable failure taxonomy on gpt-oss-20b, 444 trials, every failure
classified and diagnosed, and only then designed the RLVR environment. Every reward term traces back to a
measured failure mode. This failure taxonomy-driven method is something I would carry to any RL environment
design.
1. Benchmark selection
I chose Terminal-Bench 2, as every task ships a deterministic tests/test.sh verifier, so the reward
is a real pass/fail signal, and Harbor records full terminus-2 ATIF traces, which
makes the failure taxonomy in §2 possible. Harbor gives a clean skeleton: sandboxed
tasks, an agent loop, and verifiers, so I am wiring rewards onto this scaffolding instead of building the
harness from scratch.
2. Failure taxonomy
The baseline is gpt-oss-20b without any post-training, I ran a verifiable failure taxonomy on the baseline: 444 trials in Harbor,
each scored by rule-based ATIF detectors. The taxonomy is inspired by Atreja et al. [1], who used failure
detection for debugging and trace analysis; here I take it one step further and use the failure detections
as reward signals.
The taxonomy has two layers: L1 classifies what happened, L2 diagnoses why. This is my own implementation based on the paper; the original Pathfinder code is not open-sourced.
The taxonomy runs once per policy, not once per project. This post shows the first run, on the base model. The plan is to run it again on the SFT model before designing the RL reward: each training stage changes how the model fails, so each stage's reward consumes a fresh measurement.
2.1 L1 classification
We categorize each agent trial into four categories (pass 21 ·
verifier_fail 372 · agent_timeout 46 · infra 5, of 444 trials), so that we
don't count infra crashes as agent behavior, and we can bypass the tasks the agent already passed
(a baseline shortcut; in the next iteration passes also run through L2, as the control group for the
failure modes).
verifier_fail (~84%), not timeout. The learning problem is how the agent fails the verifier.L1 is a rule-based function over two fields of Harbor's result.json, deterministic. We don't
use an LLM judge here.
Show the L1 pseudocode
# one Harbor trial = one result.json + one ATIF trace
for result_json in jobs/<shard>/*/result.json:
row = normalize(result_json) # flatten to: task, trial_name, reward
# (= verifier_result.rewards.reward),
# exception_type, exception_message
row.l1 = classify_l1(row) # L1 is a rule-based function, no LLM
# reward >= 1.0 -> pass
# AgentTimeoutError -> agent_timeout
# vLLM / API / image-build error -> infra_*
# else (tests ran and failed) -> verifier_fail
upsert(supabase.tb2_trials, row) # one row per trial, keyed by trial_name
2.2 L2 diagnosis
For only the verifier-fail trials, we run executable detectors over the agent trace to diagnose how the agent failed.
- This bucket has 84% of the L1 classifications, with complete traces and clean failure semantics, which makes it the highest-ROI bucket to tackle first. 334 of the 372 get a trace-visible diagnosis; the other 38 are silent (the verifier failed but no detector fired), so these are skipped.
- Timeouts are deprioritized at the baseline because there is only a very small amount of it (10% of trials, vs 84% verifier_fail). But this routing choice should follow the failure distribution: you will have to re-derive it after each training stage, because the failure taxonomy can change given the model weight change.
After excluding the 38 silent trials, we have 334 trials with trace evidence. Figures 5 and 6 are computed over these.
premature_complete + error_unaddressed) fires 434+ times across multi-label trials.| Failure mode | Detector | Computed signal |
|---|---|---|
| Misreflection | error_unaddressed | Prior step had errors; next step did not address them |
| Early submit | premature_complete | mark_task_complete while errors still present |
| Infinite loop | repeat_command_loop | Same bash keystrokes repeated ≥3× |
| Non-converging | high_wasted_commands | ≥50% agent steps carry error observations |
| Environment dependency | missing_env | command not found / ModuleNotFoundError |
| Context pressure | context_pressure | prompt_tokens ≥ 25K/step or total ≥100K |
| Malformed JSON | json_parse_warning | JSON parse errors in observation or message |
Each L2 detector is a rule-based function too, deterministic (regex, counters, thresholds over the trace). We don't use an LLM judge here.
Show the L2 pseudocode
# L2: verifier_fail rows only
# (routing chosen from the baseline distribution; re-derive per stage)
for row in supabase.tb2_trials where l1 == "verifier_fail":
hit = first_hit(ordered_detectors, atif_trace(row))
# each detector = regex/counter rules over the trace, no LLM
# error_unaddressed, premature_complete, repeat_command_loop, ...
update row set l2_failure_class = hit.code, # null -> silent
evidence_step = hit.step
2.3 Findings
Two findings drive the reward design. The agent mostly fails by submitting too early: the missing-reflection family (premature complete plus unaddressed errors) dominates Figure 5. And those failures land at steps 2–3, not at the turn cap (Figure 6), so the bottleneck is wasted early actions, not horizon length.
The failure modes also decide what goes to SFT and what goes to RL:
| Failure type | Modes | Fix |
|---|---|---|
| Doesn't know the move | missing_env · json_parse | SFT cold start (§4.3) |
| Knows the move, decides badly | premature_complete · error_unaddressed · loops | Reward penalties (§3.3) |
The logic: demonstrations teach moves, rewards teach decisions. You can imitate how to install a dependency; you can't imitate when to stop. §3.3 gives every detector its disposition.
2.4 Data lens: preparing the SFT and RL data
§2.3 decided which failure modes go to SFT and which go to RL. The data lens is the other half of that
decision: what data actually teaches them. Every TB2 task ships category and
difficulty in task.toml: 89 tasks across 16 categories, dominated by
software-engineering (26 tasks), with difficulty medium 55 · hard 30 · easy 4. That distribution is something we need to be aware of.
My chosen SFT data is NVIDIA's Nemotron-Terminal-Corpus: terminus-2 trajectories subsampled from a
~140k-row pool. Balanced data selection is made against TB2: same agent format, explicit category balance, and a turn/token budget aligned
with MAX_TURNS so imitation does not teach verbosity.
The corpus's own mix is fairly even: an equal share of every category. 2,000 rows, 11 categories, about 200 each. Rows come out in category round-robin order, so any prefix of the file is balanced too; the first 500 are the cold-start set.
Why equal shares instead of copying the eval mix? At n=89 the anchor's percentages are noise: 29% software-engineering is 26 tasks. And SFT cold start teaches format and terminal habits, which transfer across categories; what it needs is coverage and variety, not ratio matching. We can adjust the data sampling based on the first iteration of the eval result.
Both datasets are browsable row by row, with provenance on every row, in the data viewer.
3. Environment creation
An RL environment is three design decisions: what the agent sees (observation), what it can do (actions), and what it gets rewarded for. For Terminal-Bench 2, the first two mostly follow Harbor's terminus-2 setup; the real design work is in the reward, and that is where the failure taxonomy from §2 comes in.
Mechanically, with TRL + Harbor each task episode runs in a sandbox and ends with a verifier reward. During training GRPO samples G rollouts per task and normalizes advantages within the group, so the reward only has to be right in a relative sense across rollouts of the same task, not calibrated on an absolute scale.
3.1 State / observation space
One Terminal-Bench 2 task = one multi-turn episode in a persistent Harbor sandbox. The policy uses the terminus-2 format to interact with the sandbox.
Episode: 1 task = 1 persistent sandbox. At step t, the policy sees the chat transcript's token sequence.
[ system prompt | instruction.md | (assistant_turn, env_feedback)_1 … (assistant_turn, env_feedback)_{t-1} ]
| Component | Content |
|---|---|
| System prompt | Terminal coding agent; acts only via bash |
| Task instruction | TB2 instruction.md |
| History | Per-step bash stdout/stderr/exit code |
| Encoding | Model chat template; loss_mask trains agent tokens only |
| Bounds | max_seq_len (e.g. 8K train / 32K eval) |
Observation is partial and stateful: same instruction, different history across steps. Harbor rollouts convert
to prompt/completion pairs with a loss_mask for TRL GRPOTrainer.
3.2 Action space
At each step, the policy emits one terminus-2 message. The meaningful action is what happens in the sandbox.
| Dimension | Definition |
|---|---|
| Action | One generation → one terminus-2 message |
| Semantic action | One bash tool call or mark_task_complete |
| Effect | Command runs in sandbox; state persists across steps |
| Horizon cap | MAX_TURNS (≈6 train / 20 eval) |
Horizon is capped at 6 in training since first failures cluster at steps 2–3 (Figure 6); eval keeps 20 for long-horizon recovery, a deliberate train/eval gap.
3.3 Reward functions
§2 tells me what fails; this is what I optimize. P0 is the core set; P1 are refinements I add if there is bandwidth.
The mapping is deliberately boring, and that is the point: every reward term traces back to a row in the taxonomy, and anything the taxonomy did not show me, I do not reward. The dominant failure, most trials miss the verifier, calls for dense partial credit so the GRPO groups are not all zeros. The early-submit and loop failures get explicit penalties, but only on top of partial credit, not in place of it. The 38 silent trials get nothing extra, because I cannot see why they failed and I will not penalize what I cannot measure.
| Failure | Reward | Priority |
|---|---|---|
| Most trials fail verifier (~84%) | R_outcome: partial credit from CTRF | P0 |
| Early submit, tests partly pass | Same R_outcome | P0 |
| Mark done while tests fail | Penalty −lam_pc · 1[premature_complete] | P1 |
| Repeat loops / wasted bash | R_agency tiebreaker among successes | P1 |
| Failures at steps 2–3, not turn cap | No horizon bonus or per-step shaping | P0 |
| Silent trials (38) | R_outcome only | P0 |
error_unaddressed | No separate term; same family as premature complete, penalized at the submit step | P1 |
missing_env / json_parse | No reward term; targeted by the SFT cold start (§4.3) | P2 |
context_pressure | No reward term; handled by env design, MAX_TURNS + truncated observations (§3.1) | P2 |
With these rows, all 7 detectors have an explicit disposition: imitation (SFT), reward pressure (RL), environment design, or deliberately nothing. This is the proposed plan; the P1/P2 terms are implemented but not all ablated yet.
R_outcome: passed_tests / total_tests ∈ [0, 1]. No LLM rubric.
R_agency: Among rollouts with R_outcome = 1.0, rank by turns + wasted_commands; bonus ∈ [0, λ]. It is only a tiebreaker, and it needs at least two successes in a group to fire at all. Early in training, when ~84% of trials fail, it almost never triggers. That is on purpose: efficiency only becomes worth rewarding once the model passes often enough to choose between a clean solution and a messy one.
R_integrity: Void trajectory on test/verifier tampering, hardcoded answers, or exfiltration (total ≤ 0).
total_i = R_outcome_i + λ · agency_bonus_i − integrity_penalty_i − lam_pc · 1[premature_complete]
P0: outcome + integrity (λ = 0, lam_pc = 0). P1: agency and premature-complete penalty.
| Term | What it does | Module | Default mode |
|---|---|---|---|
| R_outcome | passed_tests / total_tests from the CTRF report; dense partial credit so GRPO groups don't go all-zero when most rollouts fail | r_outcome | P0 on |
| R_integrity | Voids the trajectory (total ≤ 0) on test tampering, hardcoded answers, or exfiltration; the anti-reward-hacking fuse | r_integrity | P0 on |
| R_agency | Efficiency tiebreaker among rollouts that fully pass, ranked by turns + wasted commands; needs ≥2 successes in a group to fire | r_agency + group tiebreak | P1 (lam=0.1) |
| premature_complete | Penalty for mark_task_complete while errors are still visible in the trace; the taxonomy's top failure | r_premature_complete | P1 (lam_pc=0.1) |
| Other Figure 5 detectors | Routed to SFT (§4.3) or env design (§3.1), not to reward | - | Not rewarded (design only) |
4. Model selection and training plan
With the environment fixed, the training plan is mostly downstream of it. Two choices still carry weight: which model I start from, and how I stop GRPO from burning rollouts on tasks it cannot learn from. The rest of this section is those two decisions and the knobs around them.
4.1 Base model
Decision: openai/gpt-oss-20b
Aligns with the Terminal-Bench 2 blog stack: Harbor + terminus-2 + gpt-oss-20b + verifiable rewards + GRPO.
The baseline is verifier-dominated (~84% verifier_fail), not timeout-limited; dense partial credit
produces non-degenerate GRPO groups instead of all-zero advantages.
gpt-oss-120b is the natural teacher for a later on-policy distillation (OPD)
stretch: dense per-token signal when rollouts end at zero verifier reward or truncate early (Figure 4 silent
bucket, Figure 6 early-step cluster).
4.2 RL algorithm
Decision: GRPO via TRL GRPOTrainer on Harbor rollouts.
For each task prompt, sample G independent terminus-2 rollouts, score with §3.3 rewards, normalize advantages within the group:
Â_i = (R_i − mean(R_{1..G})) / (std(R_{1..G}) + ε)
| Property | Fit |
|---|---|
| Verifiable scalar reward | R_outcome from tests/test.sh; no learned RM |
| High variance across rollouts | Same task, different bash traces → spread in partial credit |
| No critic network | Simpler than PPO on long multi-turn trajectories |
| Reference recipe | GRPO [4] · AfterQuery blog + Eureka SFT→GRPO [3] |
| Stage | Role | Planned |
|---|---|---|
| SFT | Cold-start terminus-2 + occasional verifier passes | Yes |
| OPD (optional) | gpt-oss-120b teacher, reverse-KL on student trajectories | Stretch |
| GRPO | On-policy improvement with composed rewards | Yes |
4.3 Data splits and curriculum
| Split | Source | Size | Use |
|---|---|---|---|
| SFT gold | nvidia/Nemotron-Terminal-Corpus | 500 trajectories | SFT cold-start |
| GRPO pool | nvidia/Nemotron-Terminal-Synthetic-Tasks | 5,984 tasks; ~10–50 in band/run | Rollout + update |
| Probe band (discovery sweep) | Subset of synthetic pool | 10–80% pass@k | Curriculum gate |
| Eval (held-out) | Official Terminal-Bench 2 | 89 × k=5 | pass@1_macro |
| Eval (fast) | 10-task slice | 10 × k=5 | Iteration gate |
The probe band (10–80%) is a wide discovery sweep to find learnable tasks; GRPO then trains on the 20–60% sub-band, where group advantage is richest. Tasks at 0% or 100% contribute little or no group advantage.
4.4 Hyperparameters
Single-node LoRA on gpt-oss-20b (16GB-class MoE with mxfp4). Anchored to the AfterQuery blog band.
| Parameter | SFT | GRPO rollouts | GRPO update |
|---|---|---|---|
| Base weights | openai/gpt-oss-20b | SFT checkpoint | Same |
| Adapter | LoRA r=32, α=64 | - | Trainable |
| Learning rate | 2e-5 | - | 1e-6 |
| Optimizer | AdamW β=(0.9, 0.95) | - | AdamW |
| Steps | 250 (500 demos) | 15 rounds | 13–15 |
| Group size G | - | 8 | 8 |
| Max turns (train / eval) | - | 6 | 20 eval |
| Max seq len | 8K | 8K | 8K train; 32K eval |
| Temperature | - | 0.7 | - |
| Eval temperature | - | - | 1.0 |
Eval samples k=5 per task at temperature 1.0; pass@1_macro is the per-task pass rate averaged
over those 5 samples, so a non-zero eval temperature is intentional sampling for the macro estimate rather
than a single greedy decode.
LoRA on the mxfp4 MoE base: adapters sit on the mxfp4 MoE weights, target the attention and router projections, stay in bf16, and are validated for numerical stability on a 1-step smoke run before the full job.
Rollout gate
# Only commit a GRPO update when reward variance exists within a batch
std(R_{1..G}) > 0 for at least one task group
4.5 Metrics
| Phase | Metric | Purpose |
|---|---|---|
| Eval | pass@1_macro on held-out 89 | Headline benchmark lift |
| Eval | Per-task pass rate, L1 outcome mix | Tie back to taxonomy |
| SFT | Train loss, eval pass@1 vs base | Cold-start signal |
| GRPO rollouts | Mean/std of R_outcome; fraction with reward_std > 0 | Gate policy updates |
| GRPO update | grad_norm, clipped fraction, KL | Stability |
| GRPO update | premature_complete, repeat_command_loop rates | Ablation vs taxonomy |
| Integrity | Voided trajectories (R_integrity) | Reward hacking guard |
End-to-end validation would run Harbor trajectories with CTRF rewards on Nemotron tasks, confirm
≥1 GRPO step with reward_std > 0, and check that a rep-10 eval shows monotonic
base → SFT → GRPO lift. The sections above are the design; full training runs are future work.
Primary ablation: P0 (R_outcome + R_integrity) vs P1 (+ R_agency + premature_complete).
5. Closing & future steps
The bet at the top of this post was: the problems you never measured never get solved. This design is that bet applied end to end. Nothing gets penalized, trained on, or capped without a verifiable measurement behind it, and the measurement itself reruns on the SFT model before the RL reward is locked.
What I would fix next:
- Reward hacking. The taxonomy itself is an analysis tool: I decide how the reward gets shaped, so the environment guides the agent to learn properly. But once a rule is part of the reward, it runs automatically on every rollout, and rollouts that happen to dodge its text pattern (clear the error text before submitting, say "let me fix this" and do nothing) score higher and get reinforced. That is why the outcome reward dominates and the penalties stay small: hacking a rule never moves the test score.
- No human verification yet. No human labels, so I am not sure how often each detector fires wrongly, and a bad detector brings noise into the reward. Two things to do: hand-label some data, and build a validation approach that can check a detector's results quickly. Where possible, a detector should rerun the failing test instead of matching error text: execution is harder to fool.
- Generalization. The detectors are TB2-specific; the method is not. It needs a deterministic verifier and full traces, nothing else, and the same two-layer taxonomy can run on τ²-bench retail with a different detector pack.
References
- Dhruv Atreja. 2026. Pathfinder: Self-Improving Agent Trace Analysis via Adversarial Self-Play and Code Execution. ACM Conference on AI and Agentic Systems, 1336–1339. doi:10.1145/3786335.3813199
- Jacob Helwig. 2026. On-Policy Distillation (OPD). verl documentation. verl.readthedocs.io
- Li, Hangxuan, et al. 2026. Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction. DASFAA 2026.
- Shao, Zhihong, et al. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv preprint arXiv:2402.03300. arxiv.org/abs/2402.03300
Appendix
A. Failure taxonomy × TB2 detectability
L1: Classification. L1 answers "what happened to this trial": the tests passed, the
tests ran and failed, the agent ran out of time, or the infra broke before the agent got a fair shot.
It is a pure if/else over two fields of Harbor's result.json (verifier reward + exception
type), no LLM.
L2: Diagnose. L2 answers "why did the agent fail the tests": for each verifier-fail trial, a set of small rules (regex, counters, thresholds) scan the trace and name the failure mode, each hit pinned to a specific step with the evidence attached.
Optional mapping from Atreja et al. [1] to terminus-2 + Harbor ATIF.
| Paper family | Subtype | TB2 | How / count |
|---|---|---|---|
| Architecture | Missing reflection | yes | premature_complete, error_unaddressed · 434+ |
| Architecture | Infinite loop | yes | repeat_command_loop · 73 |
| Architecture | Non-converging planner | yes | high_wasted_commands · 123 |
| Context | Window overflow | yes | context_pressure · 51 |
| Parsing/config | Malformed JSON | yes | json_parse_warning · 25 |
| Parsing/config | Missing env | yes | missing_env · 131 |
| Prompt | Contradictory instructions | no | Prompt not in ATIF spans |
| Tool misuse | Malformed tool schema | no | Only bash + mark_task_complete |
| Streaming/API | Tool-call breaks | no | No streaming spans |
B. Framework & stack
The design above does not lean much on which trainer I use; this section is the reproducibility detail. I run GRPO [4] (group-relative advantages on verifiable rewards) in TRL, with Harbor as the environment layer.
Why GRPO. This is the same post-training recipe I have already shipped in production. In Eureka [3], we frame enterprise feature engineering as agentic code generation: SFT cold-start on domain plans, then GRPO on a composed reward. Terminal-Bench 2 is the same shape at a different domain, but the loop is identical: sample rollouts, score with verifiers, normalize advantages within a group, update the policy.
Considerations
| Layer | Choice |
|---|---|
| Environment | Harbor (Modal/Docker sandboxes, terminus-2 agent, test.sh verifiers) |
| SFT | TRL SFTTrainer |
| RL | TRL GRPOTrainer |
| Reward | Custom module on verifier output (composed shaping terms) |