Designing an RLVR Environment from a Failure Taxonomy (Terminal-Bench 2)

LZ Lily Zhang Jun 10, 2026

Without looking at the data and the failure modes, the reward designs are not well grounded. I ran a verifiable failure taxonomy on gpt-oss-20b before writing reward terms, then designed the RLVR environment.

What follows is the whole environment that came out of that: the observation and action spaces, the failure taxonomy, the reward functions it motivated, and the training plan around them, on Terminal-Bench 2 with gpt-oss-20b. The specifics are TB2-shaped, but the method is not: decompose the environment, measure how it actually fails, reward against that, and train where the signal is. That is the part I would carry to any RL environment.

1. Benchmark selection

I chose Terminal-Bench 2. Every task ships a deterministic tests/test.sh verifier, so the reward is a real pass/fail signal with no learned judge, and Harbor records full terminus-2 ATIF traces, which is what makes the failure taxonomy in §2 possible. Harbor also gives a clean skeleton for an RL environment, sandboxed tasks, an agent loop, and verifiers, so I am wiring rewards onto solid scaffolding instead of building the harness from scratch.

2. Failure taxonomy

This is the part that grounds everything downstream, so it comes first. I ran a verifiable failure taxonomy on a Terminal-Bench 2 baseline: 444 trials in Harbor (the sandboxed TB2 runner, driving a terminus-2 agent loop), each scored by rule-based ATIF detectors rather than an LLM judge. Framing inspired by Atreja et al. [1]. Summary: 372 verifier_fail at Harbor L1 · 334 trace-visible diagnoses · 38 silent.

Figure 1: Harbor L1 outcomes. Mostly `verifier_fail` (~84%), not timeout. The learning problem is *how* the agent fails the verifier.

Figure 2: Analysis funnel. 38 silent trials: verifier failed but no ATIF signal for a mode penalty. Design rule: only penalize computed trace evidence.

Failure mode	Detector	Computed signal
Misreflection	`error_unaddressed`	Prior step had errors; next step did not address them
Early submit	`premature_complete`	`mark_task_complete` while errors still present
Infinite loop	`repeat_command_loop`	Same bash keystrokes repeated ≥3×
Non-converging	`high_wasted_commands`	≥50% agent steps carry error observations
Environment dependency	`missing_env`	command not found / ModuleNotFoundError
Context pressure	`context_pressure`	`prompt_tokens` ≥ 25K/step or total ≥100K
Malformed JSON	`json_parse_warning`	JSON parse errors in observation or message

Figure 3: Executable failure modes. Primary chart for reward design. Headline: the missing-reflection family (`premature_complete` + `error_unaddressed`) fires 434+ times across multi-label trials.

Figure 4: Failure step distribution. First failures cluster at steps 2–3, not the turn cap. Bottleneck is wasted actions early, not horizon length.

Two findings drive the reward design. The agent mostly fails by submitting too early: the missing-reflection family (premature complete plus unaddressed errors) dominates Figure 3. And those failures land at steps 2–3, not at the turn cap (Figure 4), so the bottleneck is wasted early actions, not horizon length. §3.3 maps each of these to a concrete reward.

3. Environment creation

I build every RL environment in the same order: what the agent sees, what it can do, and what it gets rewarded for. Get those three right and the training code mostly writes itself; get the reward wrong and no amount of training fixes it. The next three subsections are that decomposition for Terminal-Bench 2.

Mechanically, with TRL + Harbor each task episode runs in a sandbox and ends with a verifier reward. During training GRPO samples G rollouts per task and normalizes advantages within the group, so the reward only has to be right in a relative sense across rollouts of the same task, not calibrated on an absolute scale.

3.1 State / observation space

One Terminal-Bench 2 task = one multi-turn episode in a persistent Harbor sandbox. The policy uses the terminus-2 format to interact with the sandbox.

Episode: 1 task = 1 persistent sandbox. At step t, the policy sees the chat transcript's token sequence.

[ system prompt | instruction.md | (assistant_turn, env_feedback)_1 … (assistant_turn, env_feedback)_{t-1} ]

Component	Content
System prompt	Terminal coding agent; acts only via `bash`
Task instruction	TB2 `instruction.md`
History	Per-step bash stdout/stderr/exit code
Encoding	Model chat template; `loss_mask` trains agent tokens only
Bounds	`max_seq_len` (e.g. 8K train / 32K eval)

Observation is partial and stateful: same instruction, different history across steps. Harbor rollouts convert to prompt/completion pairs with a loss_mask for TRL GRPOTrainer.

3.2 Action space

At each step, the policy emits one terminus-2 message. The meaningful action is what happens in the sandbox.

Dimension	Definition
Action	One generation → one terminus-2 message
Semantic action	One `bash` tool call or `mark_task_complete`
Effect	Command runs in sandbox; state persists across steps
Horizon cap	`MAX_TURNS` (≈6 train / 20 eval)

Horizon is capped at 6 in training since first failures cluster at steps 2–3 (Figure 4); eval keeps 20 for long-horizon recovery, a deliberate train/eval gap.

3.3 Reward functions

§2 tells me what fails; this is what I optimize. P0 is the core set; P1 are refinements I add if there is bandwidth.

The mapping is deliberately boring, and that is the point: every reward term traces back to a row in the taxonomy, and anything the taxonomy did not show me, I do not reward. The dominant failure, most trials miss the verifier, calls for dense partial credit so the GRPO groups are not all zeros. The early-submit and loop failures get explicit penalties, but only on top of partial credit, not in place of it. The 38 silent trials get nothing extra, because I cannot see why they failed and I will not penalize what I cannot measure.

Failure → reward mapping

Failure	Reward	Priority
Most trials fail verifier (~84%)	`R_outcome`: partial credit from CTRF	P0
Early submit, tests partly pass	Same `R_outcome`	P0
Mark done while tests fail	Penalty `−lam_pc · 1[premature_complete]`	P1
Repeat loops / wasted bash	`R_agency` tiebreaker among successes	P1
Failures at steps 2–3, not turn cap	No horizon bonus or per-step shaping	P0
Silent trials (38)	`R_outcome` only	P0
`missing_env` / `context_pressure`	None (ablate)	P2

R_outcome: passed_tests / total_tests ∈ [0, 1]. No LLM rubric.

R_agency: Among rollouts with R_outcome = 1.0, rank by turns + wasted_commands; bonus ∈ [0, λ]. It is only a tiebreaker, and it needs at least two successes in a group to fire at all. Early in training, when ~84% of trials fail, it almost never triggers. That is on purpose: efficiency only becomes worth rewarding once the model passes often enough to choose between a clean solution and a messy one.

R_integrity: Void trajectory on test/verifier tampering, hardcoded answers, or exfiltration (total ≤ 0).

total_i = R_outcome_i + λ · agency_bonus_i − integrity_penalty_i − lam_pc · 1[premature_complete]

P0: outcome + integrity (λ = 0, lam_pc = 0). P1: agency and premature-complete penalty.

Implemented in code (42 tests)

Term	Module	Default mode
R_outcome	`r_outcome`	P0 on
R_integrity	`r_integrity`	P0 on
R_agency	`r_agency` + group tiebreak	P1 (`lam=0.1`)
premature_complete	`r_premature_complete`	P1 (`lam_pc=0.1`)
Other Figure 3 detectors	-	Not rewarded (design only)

4. Model selection and training plan

With the environment fixed, the training plan is mostly downstream of it. Two choices still carry weight: which model I start from, and how I stop GRPO from burning rollouts on tasks it cannot learn from. The rest of this section is those two decisions and the knobs around them.

4.1 Base model

Decision: openai/gpt-oss-20b

Aligns with the Terminal-Bench 2 blog stack: Harbor + terminus-2 + gpt-oss-20b + verifiable rewards + GRPO. The baseline is verifier-dominated (~84% verifier_fail), not timeout-limited; dense partial credit produces non-degenerate GRPO groups instead of all-zero advantages.

gpt-oss-120b is the natural teacher for a later on-policy distillation (OPD) stretch: dense per-token signal when rollouts end at zero verifier reward or truncate early (Figure 2 silent bucket, Figure 4 early-step cluster).

4.2 RL algorithm

Decision: GRPO via TRL GRPOTrainer on Harbor rollouts.

For each task prompt, sample G independent terminus-2 rollouts, score with §3.3 rewards, normalize advantages within the group:

Â_i = (R_i − mean(R_{1..G})) / (std(R_{1..G}) + ε)

Why GRPO fits this environment

Property	Fit
Verifiable scalar reward	`R_outcome` from `tests/test.sh`; no learned RM
High variance across rollouts	Same task, different bash traces → spread in partial credit
No critic network	Simpler than PPO on long multi-turn trajectories
Reference recipe	GRPO [4] · AfterQuery blog + Eureka SFT→GRPO [3]

Training stack

Stage	Role	Planned
SFT	Cold-start terminus-2 + occasional verifier passes	Yes
OPD (optional)	gpt-oss-120b teacher, reverse-KL on student trajectories	Stretch
GRPO	On-policy improvement with composed rewards	Yes

4.3 Data splits and curriculum

Split	Source	Size	Use
SFT gold	`nvidia/Nemotron-Terminal-Corpus`	500 trajectories	SFT cold-start
GRPO pool	`nvidia/Nemotron-Terminal-Synthetic-Tasks`	5,984 tasks; ~10–50 in band/run	Rollout + update
Probe band (discovery sweep)	Subset of synthetic pool	10–80% pass@k	Curriculum gate
Eval (held-out)	Official Terminal-Bench 2	89 × k=5	`pass@1_macro`
Eval (fast)	10-task slice	10 × k=5	Iteration gate

The probe band (10–80%) is a wide discovery sweep to find learnable tasks; GRPO then trains on the 20–60% sub-band, where group advantage is richest. Tasks at 0% or 100% contribute little or no group advantage.

4.4 Hyperparameters

Single-node LoRA on gpt-oss-20b (16GB-class MoE with mxfp4). Anchored to the AfterQuery blog band.

LoRA SFT + GRPO configuration

Parameter	SFT	GRPO rollouts	GRPO update
Base weights	`openai/gpt-oss-20b`	SFT checkpoint	Same
Adapter	LoRA r=32, α=64	-	Trainable
Learning rate	2e-5	-	1e-6
Optimizer	AdamW β=(0.9, 0.95)	-	AdamW
Steps	250 (500 demos)	15 rounds	13–15
Group size G	-	8	8
Max turns (train / eval)	-	6	20 eval
Max seq len	8K	8K	8K train; 32K eval
Temperature	-	0.7	-
Eval temperature	-	-	1.0

Eval samples k=5 per task at temperature 1.0; pass@1_macro is the per-task pass rate averaged over those 5 samples, so a non-zero eval temperature is intentional sampling for the macro estimate rather than a single greedy decode.

LoRA on the mxfp4 MoE base: adapters sit on the mxfp4 MoE weights, target the attention and router projections, stay in bf16, and are validated for numerical stability on a 1-step smoke run before the full job.

Rollout gate

# Only commit a GRPO update when reward variance exists within a batch
std(R_{1..G}) > 0  for at least one task group

4.5 Metrics

Phase	Metric	Purpose
Eval	`pass@1_macro` on held-out 89	Headline benchmark lift
Eval	Per-task pass rate, L1 outcome mix	Tie back to taxonomy
SFT	Train loss, eval pass@1 vs base	Cold-start signal
GRPO rollouts	Mean/std of `R_outcome`; fraction with `reward_std > 0`	Gate policy updates
GRPO update	`grad_norm`, clipped fraction, KL	Stability
GRPO update	`premature_complete`, `repeat_command_loop` rates	Ablation vs taxonomy
Integrity	Voided trajectories (`R_integrity`)	Reward hacking guard

End-to-end validation would run Harbor trajectories with CTRF rewards on Nemotron tasks, confirm ≥1 GRPO step with reward_std > 0, and check that a rep-10 eval shows monotonic base → SFT → GRPO lift. The sections above are the design; full training runs are future work.

Primary ablation: P0 (R_outcome + R_integrity) vs P1 (+ R_agency + premature_complete).

References

Dhruv Atreja. 2026. Pathfinder: Self-Improving Agent Trace Analysis via Adversarial Self-Play and Code Execution. ACM Conference on AI and Agentic Systems, 1336–1339. doi:10.1145/3786335.3813199
Jacob Helwig. 2026. On-Policy Distillation (OPD). verl documentation. verl.readthedocs.io
Li, Hangxuan, et al. 2026. Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction. DASFAA 2026.
Shao, Zhihong, et al. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv preprint arXiv:2402.03300. arxiv.org/abs/2402.03300

Appendix

A. Failure taxonomy × TB2 detectability

Optional mapping from Atreja et al. [1] to terminus-2 + Harbor ATIF.

Paper family	Subtype	TB2	How / count
Architecture	Missing reflection	yes	`premature_complete`, `error_unaddressed` · 434+
Architecture	Infinite loop	yes	`repeat_command_loop` · 73
Architecture	Non-converging planner	yes	`high_wasted_commands` · 123
Context	Window overflow	yes	`context_pressure` · 51
Parsing/config	Malformed JSON	yes	`json_parse_warning` · 25
Parsing/config	Missing env	yes	`missing_env` · 131
Prompt	Contradictory instructions	no	Prompt not in ATIF spans
Tool misuse	Malformed tool schema	no	Only bash + mark_task_complete
Streaming/API	Tool-call breaks	no	No streaming spans

B. Framework & stack

The design above does not lean much on which trainer I use; this section is the reproducibility detail. I run GRPO [4] (group-relative advantages on verifiable rewards) in TRL, with Harbor as the environment layer.

Why GRPO. This is the same post-training recipe I have already shipped in production. In Eureka [3], we frame enterprise feature engineering as agentic code generation: SFT cold-start on domain plans, then GRPO on a composed reward. Terminal-Bench 2 is the same shape at a different domain, but the loop is identical: sample rollouts, score with verifiers, normalize advantages within a group, update the policy.

Considerations

SkyRL Strong agentic coding integration; I would use it for production post-training on coding agents. When building Sofa Genius (sofagenius.ai), SkyRL train was powerful but operationally heavy with long debugging loops. When the focus is environment and reward design, I want to mitigate infra risk.

veRL Great for large-scale multi-node training and first-class on-policy distillation (OPD) [2]: the student samples rollouts from its own policy, and the teacher provides next-token log-probabilities on those student-visited states. Compared with RLVR, OPD provides dense, token-level supervision. For a single-node setup I doubt we need multi-node training, so veRL's setup cost isn't worth it.

SLIME Relatively new, backed by Z.AI and the GLM family, hackable for custom pipelines. Environment glue is not first-class.

TRL (chosen) Hugging Face ecosystem; mature SFT + GRPO; decouples cleanly from Harbor as the environment layer. Keeps the design (observation, action, reward, and training plan) legible and reproducible. I chose Terminal-Bench 2 with Harbor using TRL.

Stack

Layer	Choice
Environment	Harbor (Modal/Docker sandboxes, terminus-2 agent, test.sh verifiers)
SFT	TRL `SFTTrainer`
RL	TRL `GRPOTrainer`
Reward	Custom module on verifier output (composed shaping terms)

Harbor TRL GRPO terminus-2 gpt-oss-20b