Designing an RLVR Environment from a Failure Taxonomy (Terminal-Bench 2)

LZ Lily Zhang Jun 10, 2026

Reward design has to be grounded in the data and the failure modes. Ungrounded rewards fail quietly: you penalize failures the model does not actually have, you miss the one that kills most trials, and on a hard benchmark a bare pass/fail reward leaves GRPO groups all-zero, so nothing trains. The problems you never measured never get solved.

So I measured first: a verifiable failure taxonomy on gpt-oss-20b, 444 trials, every failure classified and diagnosed, and only then designed the RLVR environment. Every reward term traces back to a measured failure mode. This failure taxonomy-driven method is something I would carry to any RL environment design.

444 trials classified at L1, the 372 verifier failures diagnosed at L2 by 7 detectors, then split into SFT for knowledge gaps and reward terms plus GRPO for decision failures
Figure 1: The pipeline. Classify what happened (L1), diagnose why (L2), then split by failure type: knowledge gaps go to SFT, decision failures become reward terms for GRPO.

1. Benchmark selection

I chose Terminal-Bench 2, as every task ships a deterministic tests/test.sh verifier, so the reward is a real pass/fail signal, and Harbor records full terminus-2 ATIF traces, which makes the failure taxonomy in §2 possible. Harbor gives a clean skeleton: sandboxed tasks, an agent loop, and verifiers, so I am wiring rewards onto this scaffolding instead of building the harness from scratch.

2. Failure taxonomy

The baseline is gpt-oss-20b without any post-training, I ran a verifiable failure taxonomy on the baseline: 444 trials in Harbor, each scored by rule-based ATIF detectors. The taxonomy is inspired by Atreja et al. [1], who used failure detection for debugging and trace analysis; here I take it one step further and use the failure detections as reward signals.

The taxonomy has two layers: L1 classifies what happened, L2 diagnoses why. This is my own implementation based on the paper; the original Pathfinder code is not open-sourced.

The taxonomy as a system: rollouts fan out to L1 coarse attribution, infra goes to quarantine, the rest goes to L2 fine diagnosis, and silent trials feed an improvement queue whose exits grow L2 (new detector) or L1 (new quarantine rule)
Figure 2: The taxonomy as a system. L1 does coarse attribution (who is responsible) and quarantines what can't be attributed; L2 runs the executable detectors over every attributable trace, pass included (passes are the control group for the failure modes). Every silent trial exits the improvement queue one of two ways: A, a failure we have no detector for, so we build one and L2 grows; or B, the task or verifier itself is broken, so we add a quarantine rule and L1 grows.

The taxonomy runs once per policy, not once per project. This post shows the first run, on the base model. The plan is to run it again on the SFT model before designing the RL reward: each training stage changes how the model fails, so each stage's reward consumes a fresh measurement.

2.1 L1 classification

We categorize each agent trial into four categories (pass 21 · verifier_fail 372 · agent_timeout 46 · infra 5, of 444 trials), so that we don't count infra crashes as agent behavior, and we can bypass the tasks the agent already passed (a baseline shortcut; in the next iteration passes also run through L2, as the control group for the failure modes).

Harbor L1 classifications
Figure 3: Harbor L1 classifications. Mostly verifier_fail (~84%), not timeout. The learning problem is how the agent fails the verifier.
Analysis funnel
Figure 4: Analysis funnel. 38 silent trials: verifier failed but no ATIF signal for a mode penalty. Design rule: only penalize computed trace evidence.

L1 is a rule-based function over two fields of Harbor's result.json, deterministic. We don't use an LLM judge here.

Show the L1 pseudocode
# one Harbor trial = one result.json + one ATIF trace
for result_json in jobs/<shard>/*/result.json:
    row = normalize(result_json)      # flatten to: task, trial_name, reward
                                      # (= verifier_result.rewards.reward),
                                      # exception_type, exception_message
    row.l1 = classify_l1(row)         # L1 is a rule-based function, no LLM
    #   reward >= 1.0                  -> pass
    #   AgentTimeoutError              -> agent_timeout
    #   vLLM / API / image-build error -> infra_*
    #   else (tests ran and failed)    -> verifier_fail
    upsert(supabase.tb2_trials, row)  # one row per trial, keyed by trial_name

2.2 L2 diagnosis

For only the verifier-fail trials, we run executable detectors over the agent trace to diagnose how the agent failed.

  • This bucket has 84% of the L1 classifications, with complete traces and clean failure semantics, which makes it the highest-ROI bucket to tackle first. 334 of the 372 get a trace-visible diagnosis; the other 38 are silent (the verifier failed but no detector fired), so these are skipped.
  • Timeouts are deprioritized at the baseline because there is only a very small amount of it (10% of trials, vs 84% verifier_fail). But this routing choice should follow the failure distribution: you will have to re-derive it after each training stage, because the failure taxonomy can change given the model weight change.

After excluding the 38 silent trials, we have 334 trials with trace evidence. Figures 5 and 6 are computed over these.

Executable failure modes
Figure 5: Executable failure modes. Primary chart for reward design. Headline: the missing-reflection family (premature_complete + error_unaddressed) fires 434+ times across multi-label trials.
Failure step distribution
Figure 6: Failure step distribution. First failures cluster at steps 2–3, not the turn cap. Bottleneck is wasted actions early, not horizon length.
Table 1: Executable failure modes and their computed signals
Failure modeDetectorComputed signal
Misreflectionerror_unaddressedPrior step had errors; next step did not address them
Early submitpremature_completemark_task_complete while errors still present
Infinite looprepeat_command_loopSame bash keystrokes repeated ≥3×
Non-converginghigh_wasted_commands≥50% agent steps carry error observations
Environment dependencymissing_envcommand not found / ModuleNotFoundError
Context pressurecontext_pressureprompt_tokens ≥ 25K/step or total ≥100K
Malformed JSONjson_parse_warningJSON parse errors in observation or message

Each L2 detector is a rule-based function too, deterministic (regex, counters, thresholds over the trace). We don't use an LLM judge here.

Show the L2 pseudocode
# L2: verifier_fail rows only
# (routing chosen from the baseline distribution; re-derive per stage)
for row in supabase.tb2_trials where l1 == "verifier_fail":
    hit = first_hit(ordered_detectors, atif_trace(row))
    #   each detector = regex/counter rules over the trace, no LLM
    #   error_unaddressed, premature_complete, repeat_command_loop, ...
    update row set l2_failure_class = hit.code,   # null -> silent
                   evidence_step    = hit.step

2.3 Findings

Two findings drive the reward design. The agent mostly fails by submitting too early: the missing-reflection family (premature complete plus unaddressed errors) dominates Figure 5. And those failures land at steps 2–3, not at the turn cap (Figure 6), so the bottleneck is wasted early actions, not horizon length.

The failure modes also decide what goes to SFT and what goes to RL:

Table 2: Failure type → training stage
Failure typeModesFix
Doesn't know the movemissing_env · json_parseSFT cold start (§4.3)
Knows the move, decides badlypremature_complete · error_unaddressed · loopsReward penalties (§3.3)

The logic: demonstrations teach moves, rewards teach decisions. You can imitate how to install a dependency; you can't imitate when to stop. §3.3 gives every detector its disposition.

2.4 Data lens: preparing the SFT and RL data

§2.3 decided which failure modes go to SFT and which go to RL. The data lens is the other half of that decision: what data actually teaches them. Every TB2 task ships category and difficulty in task.toml: 89 tasks across 16 categories, dominated by software-engineering (26 tasks), with difficulty medium 55 · hard 30 · easy 4. That distribution is something we need to be aware of.

My chosen SFT data is NVIDIA's Nemotron-Terminal-Corpus: terminus-2 trajectories subsampled from a ~140k-row pool. Balanced data selection is made against TB2: same agent format, explicit category balance, and a turn/token budget aligned with MAX_TURNS so imitation does not teach verbosity.

The corpus's own mix is fairly even: an equal share of every category. 2,000 rows, 11 categories, about 200 each. Rows come out in category round-robin order, so any prefix of the file is balanced too; the first 500 are the cold-start set.

Nemotron corpus category distribution (6 to 14 percent per category) next to the balanced sample (10.1 percent per category)
Figure 7: Source corpus vs the first-round sample: each of the 11 categories gets an equal share.
SFT sample category shares vs the 89-task eval anchor; six anchor categories are marked not in corpus
Figure 8: The same sample vs the eval anchor. Six eval categories do not exist in the corpus at all; model-training survives the quality gates with only 3 rows.

Why equal shares instead of copying the eval mix? At n=89 the anchor's percentages are noise: 29% software-engineering is 26 tasks. And SFT cold start teaches format and terminal habits, which transfer across categories; what it needs is coverage and variety, not ratio matching. We can adjust the data sampling based on the first iteration of the eval result.

Both datasets are browsable row by row, with provenance on every row, in the data viewer.

3. Environment creation

An RL environment is three design decisions: what the agent sees (observation), what it can do (actions), and what it gets rewarded for. For Terminal-Bench 2, the first two mostly follow Harbor's terminus-2 setup; the real design work is in the reward, and that is where the failure taxonomy from §2 comes in.

Mechanically, with TRL + Harbor each task episode runs in a sandbox and ends with a verifier reward. During training GRPO samples G rollouts per task and normalizes advantages within the group, so the reward only has to be right in a relative sense across rollouts of the same task, not calibrated on an absolute scale.

3.1 State / observation space

One Terminal-Bench 2 task = one multi-turn episode in a persistent Harbor sandbox. The policy uses the terminus-2 format to interact with the sandbox.

Episode: 1 task = 1 persistent sandbox. At step t, the policy sees the chat transcript's token sequence.

[ system prompt | instruction.md | (assistant_turn, env_feedback)_1 … (assistant_turn, env_feedback)_{t-1} ]
Table 3: Observation space
ComponentContent
System promptTerminal coding agent; acts only via bash
Task instructionTB2 instruction.md
HistoryPer-step bash stdout/stderr/exit code
EncodingModel chat template; loss_mask trains agent tokens only
Boundsmax_seq_len (e.g. 8K train / 32K eval)

Observation is partial and stateful: same instruction, different history across steps. Harbor rollouts convert to prompt/completion pairs with a loss_mask for TRL GRPOTrainer.

3.2 Action space

At each step, the policy emits one terminus-2 message. The meaningful action is what happens in the sandbox.

Table 4: Action space
DimensionDefinition
ActionOne generation → one terminus-2 message
Semantic actionOne bash tool call or mark_task_complete
EffectCommand runs in sandbox; state persists across steps
Horizon capMAX_TURNS (≈6 train / 20 eval)

Horizon is capped at 6 in training since first failures cluster at steps 2–3 (Figure 6); eval keeps 20 for long-horizon recovery, a deliberate train/eval gap.

3.3 Reward functions

§2 tells me what fails; this is what I optimize. P0 is the core set; P1 are refinements I add if there is bandwidth.

The mapping is deliberately boring, and that is the point: every reward term traces back to a row in the taxonomy, and anything the taxonomy did not show me, I do not reward. The dominant failure, most trials miss the verifier, calls for dense partial credit so the GRPO groups are not all zeros. The early-submit and loop failures get explicit penalties, but only on top of partial credit, not in place of it. The 38 silent trials get nothing extra, because I cannot see why they failed and I will not penalize what I cannot measure.

Table 5: Failure → reward mapping
FailureRewardPriority
Most trials fail verifier (~84%)R_outcome: partial credit from CTRFP0
Early submit, tests partly passSame R_outcomeP0
Mark done while tests failPenalty −lam_pc · 1[premature_complete]P1
Repeat loops / wasted bashR_agency tiebreaker among successesP1
Failures at steps 2–3, not turn capNo horizon bonus or per-step shapingP0
Silent trials (38)R_outcome onlyP0
error_unaddressedNo separate term; same family as premature complete, penalized at the submit stepP1
missing_env / json_parseNo reward term; targeted by the SFT cold start (§4.3)P2
context_pressureNo reward term; handled by env design, MAX_TURNS + truncated observations (§3.1)P2

With these rows, all 7 detectors have an explicit disposition: imitation (SFT), reward pressure (RL), environment design, or deliberately nothing. This is the proposed plan; the P1/P2 terms are implemented but not all ablated yet.

R_outcome: passed_tests / total_tests ∈ [0, 1]. No LLM rubric.

R_agency: Among rollouts with R_outcome = 1.0, rank by turns + wasted_commands; bonus ∈ [0, λ]. It is only a tiebreaker, and it needs at least two successes in a group to fire at all. Early in training, when ~84% of trials fail, it almost never triggers. That is on purpose: efficiency only becomes worth rewarding once the model passes often enough to choose between a clean solution and a messy one.

R_integrity: Void trajectory on test/verifier tampering, hardcoded answers, or exfiltration (total ≤ 0).

total_i = R_outcome_i + λ · agency_bonus_i − integrity_penalty_i − lam_pc · 1[premature_complete]

P0: outcome + integrity (λ = 0, lam_pc = 0). P1: agency and premature-complete penalty.

Table 6: Implemented in code (42 tests)
TermWhat it doesModuleDefault mode
R_outcomepassed_tests / total_tests from the CTRF report; dense partial credit so GRPO groups don't go all-zero when most rollouts failr_outcomeP0 on
R_integrityVoids the trajectory (total ≤ 0) on test tampering, hardcoded answers, or exfiltration; the anti-reward-hacking fuser_integrityP0 on
R_agencyEfficiency tiebreaker among rollouts that fully pass, ranked by turns + wasted commands; needs ≥2 successes in a group to firer_agency + group tiebreakP1 (lam=0.1)
premature_completePenalty for mark_task_complete while errors are still visible in the trace; the taxonomy's top failurer_premature_completeP1 (lam_pc=0.1)
Other Figure 5 detectorsRouted to SFT (§4.3) or env design (§3.1), not to reward-Not rewarded (design only)

4. Model selection and training plan

With the environment fixed, the training plan is mostly downstream of it. Two choices still carry weight: which model I start from, and how I stop GRPO from burning rollouts on tasks it cannot learn from. The rest of this section is those two decisions and the knobs around them.

4.1 Base model

Decision: openai/gpt-oss-20b

Aligns with the Terminal-Bench 2 blog stack: Harbor + terminus-2 + gpt-oss-20b + verifiable rewards + GRPO. The baseline is verifier-dominated (~84% verifier_fail), not timeout-limited; dense partial credit produces non-degenerate GRPO groups instead of all-zero advantages.

gpt-oss-120b is the natural teacher for a later on-policy distillation (OPD) stretch: dense per-token signal when rollouts end at zero verifier reward or truncate early (Figure 4 silent bucket, Figure 6 early-step cluster).

4.2 RL algorithm

Decision: GRPO via TRL GRPOTrainer on Harbor rollouts.

For each task prompt, sample G independent terminus-2 rollouts, score with §3.3 rewards, normalize advantages within the group:

Â_i = (R_i − mean(R_{1..G})) / (std(R_{1..G}) + ε)
Table 7: Why GRPO fits this environment
PropertyFit
Verifiable scalar rewardR_outcome from tests/test.sh; no learned RM
High variance across rolloutsSame task, different bash traces → spread in partial credit
No critic networkSimpler than PPO on long multi-turn trajectories
Reference recipeGRPO [4] · AfterQuery blog + Eureka SFT→GRPO [3]
Table 8: Training stack
StageRolePlanned
SFTCold-start terminus-2 + occasional verifier passesYes
OPD (optional)gpt-oss-120b teacher, reverse-KL on student trajectoriesStretch
GRPOOn-policy improvement with composed rewardsYes

4.3 Data splits and curriculum

Table 9: Data splits
SplitSourceSizeUse
SFT goldnvidia/Nemotron-Terminal-Corpus500 trajectoriesSFT cold-start
GRPO poolnvidia/Nemotron-Terminal-Synthetic-Tasks5,984 tasks; ~10–50 in band/runRollout + update
Probe band (discovery sweep)Subset of synthetic pool10–80% pass@kCurriculum gate
Eval (held-out)Official Terminal-Bench 289 × k=5pass@1_macro
Eval (fast)10-task slice10 × k=5Iteration gate
The probe band (10–80%) is a wide discovery sweep to find learnable tasks; GRPO then trains on the 20–60% sub-band, where group advantage is richest. Tasks at 0% or 100% contribute little or no group advantage.

4.4 Hyperparameters

Single-node LoRA on gpt-oss-20b (16GB-class MoE with mxfp4). Anchored to the AfterQuery blog band.

Table 10: LoRA SFT + GRPO configuration
ParameterSFTGRPO rolloutsGRPO update
Base weightsopenai/gpt-oss-20bSFT checkpointSame
AdapterLoRA r=32, α=64-Trainable
Learning rate2e-5-1e-6
OptimizerAdamW β=(0.9, 0.95)-AdamW
Steps250 (500 demos)15 rounds13–15
Group size G-88
Max turns (train / eval)-620 eval
Max seq len8K8K8K train; 32K eval
Temperature-0.7-
Eval temperature--1.0

Eval samples k=5 per task at temperature 1.0; pass@1_macro is the per-task pass rate averaged over those 5 samples, so a non-zero eval temperature is intentional sampling for the macro estimate rather than a single greedy decode.

LoRA on the mxfp4 MoE base: adapters sit on the mxfp4 MoE weights, target the attention and router projections, stay in bf16, and are validated for numerical stability on a 1-step smoke run before the full job.

Rollout gate

# Only commit a GRPO update when reward variance exists within a batch
std(R_{1..G}) > 0  for at least one task group

4.5 Metrics

Table 11: Metrics
PhaseMetricPurpose
Evalpass@1_macro on held-out 89Headline benchmark lift
EvalPer-task pass rate, L1 outcome mixTie back to taxonomy
SFTTrain loss, eval pass@1 vs baseCold-start signal
GRPO rolloutsMean/std of R_outcome; fraction with reward_std > 0Gate policy updates
GRPO updategrad_norm, clipped fraction, KLStability
GRPO updatepremature_complete, repeat_command_loop ratesAblation vs taxonomy
IntegrityVoided trajectories (R_integrity)Reward hacking guard

End-to-end validation would run Harbor trajectories with CTRF rewards on Nemotron tasks, confirm ≥1 GRPO step with reward_std > 0, and check that a rep-10 eval shows monotonic base → SFT → GRPO lift. The sections above are the design; full training runs are future work.

Primary ablation: P0 (R_outcome + R_integrity) vs P1 (+ R_agency + premature_complete).

5. Closing & future steps

The bet at the top of this post was: the problems you never measured never get solved. This design is that bet applied end to end. Nothing gets penalized, trained on, or capped without a verifiable measurement behind it, and the measurement itself reruns on the SFT model before the RL reward is locked.

What I would fix next:

  1. Reward hacking. The taxonomy itself is an analysis tool: I decide how the reward gets shaped, so the environment guides the agent to learn properly. But once a rule is part of the reward, it runs automatically on every rollout, and rollouts that happen to dodge its text pattern (clear the error text before submitting, say "let me fix this" and do nothing) score higher and get reinforced. That is why the outcome reward dominates and the penalties stay small: hacking a rule never moves the test score.
  2. No human verification yet. No human labels, so I am not sure how often each detector fires wrongly, and a bad detector brings noise into the reward. Two things to do: hand-label some data, and build a validation approach that can check a detector's results quickly. Where possible, a detector should rerun the failing test instead of matching error text: execution is harder to fool.
  3. Generalization. The detectors are TB2-specific; the method is not. It needs a deterministic verifier and full traces, nothing else, and the same two-layer taxonomy can run on τ²-bench retail with a different detector pack.

References

  1. Dhruv Atreja. 2026. Pathfinder: Self-Improving Agent Trace Analysis via Adversarial Self-Play and Code Execution. ACM Conference on AI and Agentic Systems, 1336–1339. doi:10.1145/3786335.3813199
  2. Jacob Helwig. 2026. On-Policy Distillation (OPD). verl documentation. verl.readthedocs.io
  3. Li, Hangxuan, et al. 2026. Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction. DASFAA 2026.
  4. Shao, Zhihong, et al. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv preprint arXiv:2402.03300. arxiv.org/abs/2402.03300

Appendix

A. Failure taxonomy × TB2 detectability

L1: Classification. L1 answers "what happened to this trial": the tests passed, the tests ran and failed, the agent ran out of time, or the infra broke before the agent got a fair shot. It is a pure if/else over two fields of Harbor's result.json (verifier reward + exception type), no LLM.

L2: Diagnose. L2 answers "why did the agent fail the tests": for each verifier-fail trial, a set of small rules (regex, counters, thresholds) scan the trace and name the failure mode, each hit pinned to a specific step with the evidence attached.

Optional mapping from Atreja et al. [1] to terminus-2 + Harbor ATIF.

Table A1: Paper taxonomy × TB2 detectability
Paper familySubtypeTB2How / count
ArchitectureMissing reflectionyespremature_complete, error_unaddressed · 434+
ArchitectureInfinite loopyesrepeat_command_loop · 73
ArchitectureNon-converging planneryeshigh_wasted_commands · 123
ContextWindow overflowyescontext_pressure · 51
Parsing/configMalformed JSONyesjson_parse_warning · 25
Parsing/configMissing envyesmissing_env · 131
PromptContradictory instructionsnoPrompt not in ATIF spans
Tool misuseMalformed tool schemanoOnly bash + mark_task_complete
Streaming/APITool-call breaksnoNo streaming spans

B. Framework & stack

The design above does not lean much on which trainer I use; this section is the reproducibility detail. I run GRPO [4] (group-relative advantages on verifiable rewards) in TRL, with Harbor as the environment layer.

Why GRPO. This is the same post-training recipe I have already shipped in production. In Eureka [3], we frame enterprise feature engineering as agentic code generation: SFT cold-start on domain plans, then GRPO on a composed reward. Terminal-Bench 2 is the same shape at a different domain, but the loop is identical: sample rollouts, score with verifiers, normalize advantages within a group, update the policy.

Considerations

SkyRL Strong agentic coding integration; I would use it for production post-training on coding agents. When building Sofa Genius (sofagenius.ai), SkyRL train was powerful but operationally heavy with long debugging loops. When the focus is environment and reward design, I want to mitigate infra risk.
veRL Great for large-scale multi-node training and first-class on-policy distillation (OPD) [2]: the student samples rollouts from its own policy, and the teacher provides next-token log-probabilities on those student-visited states. Compared with RLVR, OPD provides dense, token-level supervision. For a single-node setup I doubt we need multi-node training, so veRL's setup cost isn't worth it.
SLIME Relatively new, backed by Z.AI and the GLM family, hackable for custom pipelines. Environment glue is not first-class.
TRL (chosen) Hugging Face ecosystem; mature SFT + GRPO; decouples cleanly from Harbor as the environment layer. Keeps the design (observation, action, reward, and training plan) legible and reproducible. I chose Terminal-Bench 2 with Harbor using TRL.
Table A2: Stack
LayerChoice
EnvironmentHarbor (Modal/Docker sandboxes, terminus-2 agent, test.sh verifiers)
SFTTRL SFTTrainer
RLTRL GRPOTrainer
RewardCustom module on verifier output (composed shaping terms)
Harbor TRL GRPO terminus-2 gpt-oss-20b