Designing an RLVR Environment from a Failure Taxonomy (Terminal-Bench 2)

LZ Lily Zhang Jun 10, 2026

Without looking at the data and the failure modes, the reward designs are not well grounded. I ran a verifiable failure taxonomy on gpt-oss-20b before writing reward terms, then designed the RLVR environment.

What follows is the whole environment that came out of that: the observation and action spaces, the failure taxonomy, the reward functions it motivated, and the training plan around them, on Terminal-Bench 2 with gpt-oss-20b. The specifics are TB2-shaped, but the method is not: decompose the environment, measure how it actually fails, reward against that, and train where the signal is. That is the part I would carry to any RL environment.

1. Benchmark selection

I chose Terminal-Bench 2. Every task ships a deterministic tests/test.sh verifier, so the reward is a real pass/fail signal with no learned judge, and Harbor records full terminus-2 ATIF traces, which is what makes the failure taxonomy in §2 possible. Harbor also gives a clean skeleton for an RL environment, sandboxed tasks, an agent loop, and verifiers, so I am wiring rewards onto solid scaffolding instead of building the harness from scratch.

2. Failure taxonomy

This is the part that grounds everything downstream, so it comes first. I ran a verifiable failure taxonomy on a Terminal-Bench 2 baseline: 444 trials in Harbor (the sandboxed TB2 runner, driving a terminus-2 agent loop), each scored by rule-based ATIF detectors rather than an LLM judge. Framing inspired by Atreja et al. [1]. Summary: 372 verifier_fail at Harbor L1 · 334 trace-visible diagnoses · 38 silent.

Harbor L1 outcomes
Figure 1: Harbor L1 outcomes. Mostly verifier_fail (~84%), not timeout. The learning problem is how the agent fails the verifier.
Analysis funnel
Figure 2: Analysis funnel. 38 silent trials: verifier failed but no ATIF signal for a mode penalty. Design rule: only penalize computed trace evidence.
Failure modeDetectorComputed signal
Misreflectionerror_unaddressedPrior step had errors; next step did not address them
Early submitpremature_completemark_task_complete while errors still present
Infinite looprepeat_command_loopSame bash keystrokes repeated ≥3×
Non-converginghigh_wasted_commands≥50% agent steps carry error observations
Environment dependencymissing_envcommand not found / ModuleNotFoundError
Context pressurecontext_pressureprompt_tokens ≥ 25K/step or total ≥100K
Malformed JSONjson_parse_warningJSON parse errors in observation or message
Executable failure modes
Figure 3: Executable failure modes. Primary chart for reward design. Headline: the missing-reflection family (premature_complete + error_unaddressed) fires 434+ times across multi-label trials.
Failure step distribution
Figure 4: Failure step distribution. First failures cluster at steps 2–3, not the turn cap. Bottleneck is wasted actions early, not horizon length.

Two findings drive the reward design. The agent mostly fails by submitting too early: the missing-reflection family (premature complete plus unaddressed errors) dominates Figure 3. And those failures land at steps 2–3, not at the turn cap (Figure 4), so the bottleneck is wasted early actions, not horizon length. §3.3 maps each of these to a concrete reward.

3. Environment creation

I build every RL environment in the same order: what the agent sees, what it can do, and what it gets rewarded for. Get those three right and the training code mostly writes itself; get the reward wrong and no amount of training fixes it. The next three subsections are that decomposition for Terminal-Bench 2.

Mechanically, with TRL + Harbor each task episode runs in a sandbox and ends with a verifier reward. During training GRPO samples G rollouts per task and normalizes advantages within the group, so the reward only has to be right in a relative sense across rollouts of the same task, not calibrated on an absolute scale.

3.1 State / observation space

One Terminal-Bench 2 task = one multi-turn episode in a persistent Harbor sandbox. The policy uses the terminus-2 format to interact with the sandbox.

Episode: 1 task = 1 persistent sandbox. At step t, the policy sees the chat transcript's token sequence.

[ system prompt | instruction.md | (assistant_turn, env_feedback)_1 … (assistant_turn, env_feedback)_{t-1} ]
ComponentContent
System promptTerminal coding agent; acts only via bash
Task instructionTB2 instruction.md
HistoryPer-step bash stdout/stderr/exit code
EncodingModel chat template; loss_mask trains agent tokens only
Boundsmax_seq_len (e.g. 8K train / 32K eval)

Observation is partial and stateful: same instruction, different history across steps. Harbor rollouts convert to prompt/completion pairs with a loss_mask for TRL GRPOTrainer.

3.2 Action space

At each step, the policy emits one terminus-2 message. The meaningful action is what happens in the sandbox.

DimensionDefinition
ActionOne generation → one terminus-2 message
Semantic actionOne bash tool call or mark_task_complete
EffectCommand runs in sandbox; state persists across steps
Horizon capMAX_TURNS (≈6 train / 20 eval)

Horizon is capped at 6 in training since first failures cluster at steps 2–3 (Figure 4); eval keeps 20 for long-horizon recovery, a deliberate train/eval gap.

3.3 Reward functions

§2 tells me what fails; this is what I optimize. P0 is the core set; P1 are refinements I add if there is bandwidth.

The mapping is deliberately boring, and that is the point: every reward term traces back to a row in the taxonomy, and anything the taxonomy did not show me, I do not reward. The dominant failure, most trials miss the verifier, calls for dense partial credit so the GRPO groups are not all zeros. The early-submit and loop failures get explicit penalties, but only on top of partial credit, not in place of it. The 38 silent trials get nothing extra, because I cannot see why they failed and I will not penalize what I cannot measure.

Failure → reward mapping
FailureRewardPriority
Most trials fail verifier (~84%)R_outcome: partial credit from CTRFP0
Early submit, tests partly passSame R_outcomeP0
Mark done while tests failPenalty −lam_pc · 1[premature_complete]P1
Repeat loops / wasted bashR_agency tiebreaker among successesP1
Failures at steps 2–3, not turn capNo horizon bonus or per-step shapingP0
Silent trials (38)R_outcome onlyP0
missing_env / context_pressureNone (ablate)P2

R_outcome: passed_tests / total_tests ∈ [0, 1]. No LLM rubric.

R_agency: Among rollouts with R_outcome = 1.0, rank by turns + wasted_commands; bonus ∈ [0, λ]. It is only a tiebreaker, and it needs at least two successes in a group to fire at all. Early in training, when ~84% of trials fail, it almost never triggers. That is on purpose: efficiency only becomes worth rewarding once the model passes often enough to choose between a clean solution and a messy one.

R_integrity: Void trajectory on test/verifier tampering, hardcoded answers, or exfiltration (total ≤ 0).

total_i = R_outcome_i + λ · agency_bonus_i − integrity_penalty_i − lam_pc · 1[premature_complete]

P0: outcome + integrity (λ = 0, lam_pc = 0). P1: agency and premature-complete penalty.

Implemented in code (42 tests)
TermModuleDefault mode
R_outcomer_outcomeP0 on
R_integrityr_integrityP0 on
R_agencyr_agency + group tiebreakP1 (lam=0.1)
premature_completer_premature_completeP1 (lam_pc=0.1)
Other Figure 3 detectors-Not rewarded (design only)

4. Model selection and training plan

With the environment fixed, the training plan is mostly downstream of it. Two choices still carry weight: which model I start from, and how I stop GRPO from burning rollouts on tasks it cannot learn from. The rest of this section is those two decisions and the knobs around them.

4.1 Base model

Decision: openai/gpt-oss-20b

Aligns with the Terminal-Bench 2 blog stack: Harbor + terminus-2 + gpt-oss-20b + verifiable rewards + GRPO. The baseline is verifier-dominated (~84% verifier_fail), not timeout-limited; dense partial credit produces non-degenerate GRPO groups instead of all-zero advantages.

gpt-oss-120b is the natural teacher for a later on-policy distillation (OPD) stretch: dense per-token signal when rollouts end at zero verifier reward or truncate early (Figure 2 silent bucket, Figure 4 early-step cluster).

4.2 RL algorithm

Decision: GRPO via TRL GRPOTrainer on Harbor rollouts.

For each task prompt, sample G independent terminus-2 rollouts, score with §3.3 rewards, normalize advantages within the group:

Â_i = (R_i − mean(R_{1..G})) / (std(R_{1..G}) + ε)
Why GRPO fits this environment
PropertyFit
Verifiable scalar rewardR_outcome from tests/test.sh; no learned RM
High variance across rolloutsSame task, different bash traces → spread in partial credit
No critic networkSimpler than PPO on long multi-turn trajectories
Reference recipeGRPO [4] · AfterQuery blog + Eureka SFT→GRPO [3]
Training stack
StageRolePlanned
SFTCold-start terminus-2 + occasional verifier passesYes
OPD (optional)gpt-oss-120b teacher, reverse-KL on student trajectoriesStretch
GRPOOn-policy improvement with composed rewardsYes

4.3 Data splits and curriculum

SplitSourceSizeUse
SFT goldnvidia/Nemotron-Terminal-Corpus500 trajectoriesSFT cold-start
GRPO poolnvidia/Nemotron-Terminal-Synthetic-Tasks5,984 tasks; ~10–50 in band/runRollout + update
Probe band (discovery sweep)Subset of synthetic pool10–80% pass@kCurriculum gate
Eval (held-out)Official Terminal-Bench 289 × k=5pass@1_macro
Eval (fast)10-task slice10 × k=5Iteration gate
The probe band (10–80%) is a wide discovery sweep to find learnable tasks; GRPO then trains on the 20–60% sub-band, where group advantage is richest. Tasks at 0% or 100% contribute little or no group advantage.

4.4 Hyperparameters

Single-node LoRA on gpt-oss-20b (16GB-class MoE with mxfp4). Anchored to the AfterQuery blog band.

LoRA SFT + GRPO configuration
ParameterSFTGRPO rolloutsGRPO update
Base weightsopenai/gpt-oss-20bSFT checkpointSame
AdapterLoRA r=32, α=64-Trainable
Learning rate2e-5-1e-6
OptimizerAdamW β=(0.9, 0.95)-AdamW
Steps250 (500 demos)15 rounds13–15
Group size G-88
Max turns (train / eval)-620 eval
Max seq len8K8K8K train; 32K eval
Temperature-0.7-
Eval temperature--1.0

Eval samples k=5 per task at temperature 1.0; pass@1_macro is the per-task pass rate averaged over those 5 samples, so a non-zero eval temperature is intentional sampling for the macro estimate rather than a single greedy decode.

LoRA on the mxfp4 MoE base: adapters sit on the mxfp4 MoE weights, target the attention and router projections, stay in bf16, and are validated for numerical stability on a 1-step smoke run before the full job.

Rollout gate

# Only commit a GRPO update when reward variance exists within a batch
std(R_{1..G}) > 0  for at least one task group

4.5 Metrics

PhaseMetricPurpose
Evalpass@1_macro on held-out 89Headline benchmark lift
EvalPer-task pass rate, L1 outcome mixTie back to taxonomy
SFTTrain loss, eval pass@1 vs baseCold-start signal
GRPO rolloutsMean/std of R_outcome; fraction with reward_std > 0Gate policy updates
GRPO updategrad_norm, clipped fraction, KLStability
GRPO updatepremature_complete, repeat_command_loop ratesAblation vs taxonomy
IntegrityVoided trajectories (R_integrity)Reward hacking guard

End-to-end validation would run Harbor trajectories with CTRF rewards on Nemotron tasks, confirm ≥1 GRPO step with reward_std > 0, and check that a rep-10 eval shows monotonic base → SFT → GRPO lift. The sections above are the design; full training runs are future work.

Primary ablation: P0 (R_outcome + R_integrity) vs P1 (+ R_agency + premature_complete).

References

  1. Dhruv Atreja. 2026. Pathfinder: Self-Improving Agent Trace Analysis via Adversarial Self-Play and Code Execution. ACM Conference on AI and Agentic Systems, 1336–1339. doi:10.1145/3786335.3813199
  2. Jacob Helwig. 2026. On-Policy Distillation (OPD). verl documentation. verl.readthedocs.io
  3. Li, Hangxuan, et al. 2026. Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction. DASFAA 2026.
  4. Shao, Zhihong, et al. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv preprint arXiv:2402.03300. arxiv.org/abs/2402.03300

Appendix

A. Failure taxonomy × TB2 detectability

Optional mapping from Atreja et al. [1] to terminus-2 + Harbor ATIF.

Paper familySubtypeTB2How / count
ArchitectureMissing reflectionyespremature_complete, error_unaddressed · 434+
ArchitectureInfinite loopyesrepeat_command_loop · 73
ArchitectureNon-converging planneryeshigh_wasted_commands · 123
ContextWindow overflowyescontext_pressure · 51
Parsing/configMalformed JSONyesjson_parse_warning · 25
Parsing/configMissing envyesmissing_env · 131
PromptContradictory instructionsnoPrompt not in ATIF spans
Tool misuseMalformed tool schemanoOnly bash + mark_task_complete
Streaming/APITool-call breaksnoNo streaming spans

B. Framework & stack

The design above does not lean much on which trainer I use; this section is the reproducibility detail. I run GRPO [4] (group-relative advantages on verifiable rewards) in TRL, with Harbor as the environment layer.

Why GRPO. This is the same post-training recipe I have already shipped in production. In Eureka [3], we frame enterprise feature engineering as agentic code generation: SFT cold-start on domain plans, then GRPO on a composed reward. Terminal-Bench 2 is the same shape at a different domain, but the loop is identical: sample rollouts, score with verifiers, normalize advantages within a group, update the policy.

Considerations

SkyRL Strong agentic coding integration; I would use it for production post-training on coding agents. When building Sofa Genius (sofagenius.ai), SkyRL train was powerful but operationally heavy with long debugging loops. When the focus is environment and reward design, I want to mitigate infra risk.
veRL Great for large-scale multi-node training and first-class on-policy distillation (OPD) [2]: the student samples rollouts from its own policy, and the teacher provides next-token log-probabilities on those student-visited states. Compared with RLVR, OPD provides dense, token-level supervision. For a single-node setup I doubt we need multi-node training, so veRL's setup cost isn't worth it.
SLIME Relatively new, backed by Z.AI and the GLM family, hackable for custom pipelines. Environment glue is not first-class.
TRL (chosen) Hugging Face ecosystem; mature SFT + GRPO; decouples cleanly from Harbor as the environment layer. Keeps the design (observation, action, reward, and training plan) legible and reproducible. I chose Terminal-Bench 2 with Harbor using TRL.
Stack
LayerChoice
EnvironmentHarbor (Modal/Docker sandboxes, terminus-2 agent, test.sh verifiers)
SFTTRL SFTTrainer
RLTRL GRPOTrainer
RewardCustom module on verifier output (composed shaping terms)
Harbor TRL GRPO terminus-2 gpt-oss-20b