Designing an RLVR Environment from a Failure Taxonomy (Terminal-Bench 2)
Without looking at the data and the failure modes, the reward designs are not well grounded. I ran a verifiable failure taxonomy on gpt-oss-20b before writing reward terms, then designed
the RLVR environment.
What follows is the whole environment that came out of that: the observation and action spaces, the failure
taxonomy, the reward functions it motivated, and the training plan around them, on Terminal-Bench 2 with
gpt-oss-20b. The specifics are TB2-shaped, but the method is not: decompose the environment,
measure how it actually fails, reward against that, and train where the signal is. That is the part I would
carry to any RL environment.
1. Benchmark selection
I chose Terminal-Bench 2. Every task ships a deterministic tests/test.sh verifier, so the reward
is a real pass/fail signal with no learned judge, and Harbor records full terminus-2 ATIF traces, which is what
makes the failure taxonomy in §2 possible. Harbor also gives a clean skeleton for an RL environment, sandboxed
tasks, an agent loop, and verifiers, so I am wiring rewards onto solid scaffolding instead of building the
harness from scratch.
2. Failure taxonomy
This is the part that grounds everything downstream, so it comes first. I ran a verifiable failure taxonomy on
a Terminal-Bench 2 baseline: 444 trials in Harbor (the sandboxed TB2 runner, driving a terminus-2 agent loop),
each scored by rule-based ATIF detectors rather than an LLM judge. Framing inspired by Atreja et al. [1].
Summary: 372 verifier_fail at Harbor L1 · 334 trace-visible diagnoses · 38 silent.
verifier_fail (~84%), not timeout. The learning problem is how the agent fails the verifier.| Failure mode | Detector | Computed signal |
|---|---|---|
| Misreflection | error_unaddressed | Prior step had errors; next step did not address them |
| Early submit | premature_complete | mark_task_complete while errors still present |
| Infinite loop | repeat_command_loop | Same bash keystrokes repeated ≥3× |
| Non-converging | high_wasted_commands | ≥50% agent steps carry error observations |
| Environment dependency | missing_env | command not found / ModuleNotFoundError |
| Context pressure | context_pressure | prompt_tokens ≥ 25K/step or total ≥100K |
| Malformed JSON | json_parse_warning | JSON parse errors in observation or message |
premature_complete + error_unaddressed) fires 434+ times across multi-label trials.Two findings drive the reward design. The agent mostly fails by submitting too early: the missing-reflection family (premature complete plus unaddressed errors) dominates Figure 3. And those failures land at steps 2–3, not at the turn cap (Figure 4), so the bottleneck is wasted early actions, not horizon length. §3.3 maps each of these to a concrete reward.
3. Environment creation
I build every RL environment in the same order: what the agent sees, what it can do, and what it gets rewarded for. Get those three right and the training code mostly writes itself; get the reward wrong and no amount of training fixes it. The next three subsections are that decomposition for Terminal-Bench 2.
Mechanically, with TRL + Harbor each task episode runs in a sandbox and ends with a verifier reward. During training GRPO samples G rollouts per task and normalizes advantages within the group, so the reward only has to be right in a relative sense across rollouts of the same task, not calibrated on an absolute scale.
3.1 State / observation space
One Terminal-Bench 2 task = one multi-turn episode in a persistent Harbor sandbox. The policy uses the terminus-2 format to interact with the sandbox.
Episode: 1 task = 1 persistent sandbox. At step t, the policy sees the chat transcript's token sequence.
[ system prompt | instruction.md | (assistant_turn, env_feedback)_1 … (assistant_turn, env_feedback)_{t-1} ]
| Component | Content |
|---|---|
| System prompt | Terminal coding agent; acts only via bash |
| Task instruction | TB2 instruction.md |
| History | Per-step bash stdout/stderr/exit code |
| Encoding | Model chat template; loss_mask trains agent tokens only |
| Bounds | max_seq_len (e.g. 8K train / 32K eval) |
Observation is partial and stateful: same instruction, different history across steps. Harbor rollouts convert
to prompt/completion pairs with a loss_mask for TRL GRPOTrainer.
3.2 Action space
At each step, the policy emits one terminus-2 message. The meaningful action is what happens in the sandbox.
| Dimension | Definition |
|---|---|
| Action | One generation → one terminus-2 message |
| Semantic action | One bash tool call or mark_task_complete |
| Effect | Command runs in sandbox; state persists across steps |
| Horizon cap | MAX_TURNS (≈6 train / 20 eval) |
Horizon is capped at 6 in training since first failures cluster at steps 2–3 (Figure 4); eval keeps 20 for long-horizon recovery, a deliberate train/eval gap.
3.3 Reward functions
§2 tells me what fails; this is what I optimize. P0 is the core set; P1 are refinements I add if there is bandwidth.
The mapping is deliberately boring, and that is the point: every reward term traces back to a row in the taxonomy, and anything the taxonomy did not show me, I do not reward. The dominant failure, most trials miss the verifier, calls for dense partial credit so the GRPO groups are not all zeros. The early-submit and loop failures get explicit penalties, but only on top of partial credit, not in place of it. The 38 silent trials get nothing extra, because I cannot see why they failed and I will not penalize what I cannot measure.
| Failure | Reward | Priority |
|---|---|---|
| Most trials fail verifier (~84%) | R_outcome: partial credit from CTRF | P0 |
| Early submit, tests partly pass | Same R_outcome | P0 |
| Mark done while tests fail | Penalty −lam_pc · 1[premature_complete] | P1 |
| Repeat loops / wasted bash | R_agency tiebreaker among successes | P1 |
| Failures at steps 2–3, not turn cap | No horizon bonus or per-step shaping | P0 |
| Silent trials (38) | R_outcome only | P0 |
missing_env / context_pressure | None (ablate) | P2 |
R_outcome: passed_tests / total_tests ∈ [0, 1]. No LLM rubric.
R_agency: Among rollouts with R_outcome = 1.0, rank by turns + wasted_commands; bonus ∈ [0, λ]. It is only a tiebreaker, and it needs at least two successes in a group to fire at all. Early in training, when ~84% of trials fail, it almost never triggers. That is on purpose: efficiency only becomes worth rewarding once the model passes often enough to choose between a clean solution and a messy one.
R_integrity: Void trajectory on test/verifier tampering, hardcoded answers, or exfiltration (total ≤ 0).
total_i = R_outcome_i + λ · agency_bonus_i − integrity_penalty_i − lam_pc · 1[premature_complete]
P0: outcome + integrity (λ = 0, lam_pc = 0). P1: agency and premature-complete penalty.
| Term | Module | Default mode |
|---|---|---|
| R_outcome | r_outcome | P0 on |
| R_integrity | r_integrity | P0 on |
| R_agency | r_agency + group tiebreak | P1 (lam=0.1) |
| premature_complete | r_premature_complete | P1 (lam_pc=0.1) |
| Other Figure 3 detectors | - | Not rewarded (design only) |
4. Model selection and training plan
With the environment fixed, the training plan is mostly downstream of it. Two choices still carry weight: which model I start from, and how I stop GRPO from burning rollouts on tasks it cannot learn from. The rest of this section is those two decisions and the knobs around them.
4.1 Base model
Decision: openai/gpt-oss-20b
Aligns with the Terminal-Bench 2 blog stack: Harbor + terminus-2 + gpt-oss-20b + verifiable rewards + GRPO.
The baseline is verifier-dominated (~84% verifier_fail), not timeout-limited; dense partial credit
produces non-degenerate GRPO groups instead of all-zero advantages.
gpt-oss-120b is the natural teacher for a later on-policy distillation (OPD)
stretch: dense per-token signal when rollouts end at zero verifier reward or truncate early (Figure 2 silent
bucket, Figure 4 early-step cluster).
4.2 RL algorithm
Decision: GRPO via TRL GRPOTrainer on Harbor rollouts.
For each task prompt, sample G independent terminus-2 rollouts, score with §3.3 rewards, normalize advantages within the group:
Â_i = (R_i − mean(R_{1..G})) / (std(R_{1..G}) + ε)
| Property | Fit |
|---|---|
| Verifiable scalar reward | R_outcome from tests/test.sh; no learned RM |
| High variance across rollouts | Same task, different bash traces → spread in partial credit |
| No critic network | Simpler than PPO on long multi-turn trajectories |
| Reference recipe | GRPO [4] · AfterQuery blog + Eureka SFT→GRPO [3] |
| Stage | Role | Planned |
|---|---|---|
| SFT | Cold-start terminus-2 + occasional verifier passes | Yes |
| OPD (optional) | gpt-oss-120b teacher, reverse-KL on student trajectories | Stretch |
| GRPO | On-policy improvement with composed rewards | Yes |
4.3 Data splits and curriculum
| Split | Source | Size | Use |
|---|---|---|---|
| SFT gold | nvidia/Nemotron-Terminal-Corpus | 500 trajectories | SFT cold-start |
| GRPO pool | nvidia/Nemotron-Terminal-Synthetic-Tasks | 5,984 tasks; ~10–50 in band/run | Rollout + update |
| Probe band (discovery sweep) | Subset of synthetic pool | 10–80% pass@k | Curriculum gate |
| Eval (held-out) | Official Terminal-Bench 2 | 89 × k=5 | pass@1_macro |
| Eval (fast) | 10-task slice | 10 × k=5 | Iteration gate |
The probe band (10–80%) is a wide discovery sweep to find learnable tasks; GRPO then trains on the 20–60% sub-band, where group advantage is richest. Tasks at 0% or 100% contribute little or no group advantage.
4.4 Hyperparameters
Single-node LoRA on gpt-oss-20b (16GB-class MoE with mxfp4). Anchored to the AfterQuery blog band.
| Parameter | SFT | GRPO rollouts | GRPO update |
|---|---|---|---|
| Base weights | openai/gpt-oss-20b | SFT checkpoint | Same |
| Adapter | LoRA r=32, α=64 | - | Trainable |
| Learning rate | 2e-5 | - | 1e-6 |
| Optimizer | AdamW β=(0.9, 0.95) | - | AdamW |
| Steps | 250 (500 demos) | 15 rounds | 13–15 |
| Group size G | - | 8 | 8 |
| Max turns (train / eval) | - | 6 | 20 eval |
| Max seq len | 8K | 8K | 8K train; 32K eval |
| Temperature | - | 0.7 | - |
| Eval temperature | - | - | 1.0 |
Eval samples k=5 per task at temperature 1.0; pass@1_macro is the per-task pass rate averaged
over those 5 samples, so a non-zero eval temperature is intentional sampling for the macro estimate rather
than a single greedy decode.
LoRA on the mxfp4 MoE base: adapters sit on the mxfp4 MoE weights, target the attention and router projections, stay in bf16, and are validated for numerical stability on a 1-step smoke run before the full job.
Rollout gate
# Only commit a GRPO update when reward variance exists within a batch
std(R_{1..G}) > 0 for at least one task group
4.5 Metrics
| Phase | Metric | Purpose |
|---|---|---|
| Eval | pass@1_macro on held-out 89 | Headline benchmark lift |
| Eval | Per-task pass rate, L1 outcome mix | Tie back to taxonomy |
| SFT | Train loss, eval pass@1 vs base | Cold-start signal |
| GRPO rollouts | Mean/std of R_outcome; fraction with reward_std > 0 | Gate policy updates |
| GRPO update | grad_norm, clipped fraction, KL | Stability |
| GRPO update | premature_complete, repeat_command_loop rates | Ablation vs taxonomy |
| Integrity | Voided trajectories (R_integrity) | Reward hacking guard |
End-to-end validation would run Harbor trajectories with CTRF rewards on Nemotron tasks, confirm
≥1 GRPO step with reward_std > 0, and check that a rep-10 eval shows monotonic
base → SFT → GRPO lift. The sections above are the design; full training runs are future work.
Primary ablation: P0 (R_outcome + R_integrity) vs P1 (+ R_agency + premature_complete).
References
- Dhruv Atreja. 2026. Pathfinder: Self-Improving Agent Trace Analysis via Adversarial Self-Play and Code Execution. ACM Conference on AI and Agentic Systems, 1336–1339. doi:10.1145/3786335.3813199
- Jacob Helwig. 2026. On-Policy Distillation (OPD). verl documentation. verl.readthedocs.io
- Li, Hangxuan, et al. 2026. Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction. DASFAA 2026.
- Shao, Zhihong, et al. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv preprint arXiv:2402.03300. arxiv.org/abs/2402.03300
Appendix
A. Failure taxonomy × TB2 detectability
Optional mapping from Atreja et al. [1] to terminus-2 + Harbor ATIF.
| Paper family | Subtype | TB2 | How / count |
|---|---|---|---|
| Architecture | Missing reflection | yes | premature_complete, error_unaddressed · 434+ |
| Architecture | Infinite loop | yes | repeat_command_loop · 73 |
| Architecture | Non-converging planner | yes | high_wasted_commands · 123 |
| Context | Window overflow | yes | context_pressure · 51 |
| Parsing/config | Malformed JSON | yes | json_parse_warning · 25 |
| Parsing/config | Missing env | yes | missing_env · 131 |
| Prompt | Contradictory instructions | no | Prompt not in ATIF spans |
| Tool misuse | Malformed tool schema | no | Only bash + mark_task_complete |
| Streaming/API | Tool-call breaks | no | No streaming spans |
B. Framework & stack
The design above does not lean much on which trainer I use; this section is the reproducibility detail. I run GRPO [4] (group-relative advantages on verifiable rewards) in TRL, with Harbor as the environment layer.
Why GRPO. This is the same post-training recipe I have already shipped in production. In Eureka [3], we frame enterprise feature engineering as agentic code generation: SFT cold-start on domain plans, then GRPO on a composed reward. Terminal-Bench 2 is the same shape at a different domain, but the loop is identical: sample rollouts, score with verifiers, normalize advantages within a group, update the policy.
Considerations
| Layer | Choice |
|---|---|
| Environment | Harbor (Modal/Docker sandboxes, terminus-2 agent, test.sh verifiers) |
| SFT | TRL SFTTrainer |
| RL | TRL GRPOTrainer |
| Reward | Custom module on verifier output (composed shaping terms) |