The Agent Review Benchmark · Built for the Meta × Hugging Face OpenEnv Hackathon, India 2026 · by ~The Cook House

Security review,
for the age of AI.

AI writes the code. Who reviews it? SecureReview is the first OpenEnv harness that trains and grades agents on real security review — supply chain, infrastructure-as-code, and database migrations.

See Training Results → View Benchmark Read the Blog ↗

Review Domains

Scenarios

430

Vulnerabilities

+0.24

Mean Lift After Training

01 / RESULTS

SFT → GRPO hybrid training. Real lift on every domain.

+0.302

Dependency · 20/24 wins · 0.083 → 0.385

+0.295

Migration · 10/12 wins · 0.170 → 0.465

+0.126

IaC · 6/13 wins · 0.177 → 0.303

Dependency review — before vs after SFT — Dependency · 24 scenarios. Standout: dep_015 0.02 → 0.93.

Migration review — before vs after SFT — Migration · 12 curriculum-filtered scenarios. Standout: migration_025 0.06 → 0.64.

IaC review — before vs after SFT — IaC · 13 scenarios. Standout: iac_010 0.01 → 0.76.

02 / WHAT THE AGENT REVIEWS

Three real scenes. Three real failure modes.

DEPENDENCY requirements.txt · dep_010

# Suggested by LLM during code generation
openai==1.3.0
langchain-utils==0.5.2      ← hallucinated, not on PyPI
streamlit-helpers==0.3.1    ← slopsquat opportunity
torch-helpers==1.9.0        ← real pkg is "torch"
chromadb-client==0.4.5      ← real pkg is "chromadb"
embedding-models==2.1.0     ← does not exist
vector-store==1.2.3         ← does not exist
ai-toolkit==0.8.0           ← generic squat target

Agent must catch 7 hallucinated packages · 6 critical, 1 high.

IAC main.tf · iac_015

resource "aws_security_group" "db" {
  ingress {
    cidr_blocks = ["0.0.0.0/0"]   ← Postgres open to internet
  }
}
resource "aws_db_instance" "analytics" {
  username                = "admin"
  password                = "Sup3rSecret!2023"  ← in TF state
  publicly_accessible     = true            ← public RDS
  storage_encrypted       = false           ← unencrypted
  backup_retention_period = 0               ← no PITR
}
resource "aws_s3_bucket" "exports" {
  acl = "public-read"                         ← public bucket
}

Agent must catch 6 misconfigurations · network exposure, encryption, credentials, backup posture.

MIGRATION migration_007.sql · 4.2B-row telemetry table

-- table: 4.2B rows · 1.4k writes/sec
-- deploy: rolling, 0s downtime budget

CREATE INDEX idx_dev_metric_time   ← no CONCURRENTLY
    ON telemetry_records(device_id, metric, recorded_at);

ALTER TABLE telemetry_records
    SET (fillfactor = 100);              ← kills HOT updates

CLUSTER telemetry_records              ← AccessExclusiveLock
    USING idx_dev_metric_time;             on a 4B-row table

CREATE INDEX idx_payload
    ON telemetry_records USING gin(tags);  ← no jsonb_path_ops

Agent must catch 5 issues incl. blocking CLUSTER on 4B rows during a zero-downtime deploy. Senior-engineer judgment, not lint.

03 / BENCHMARK

Three domains, three difficulties, one standard.

Dependency & Supply Chain Security

Typosquats, hallucinated PyPI imports, pinned CVEs. Supply-chain literacy.

24 Scenarios 15 Steps 120 Findings

Easy

II.

Infrastructure-as-Code Misconfiguration

CIS violations in Terraform / K8s — public buckets, wildcard IAM, privileged containers. Multi-file cloud reasoning.

24 Scenarios 25 Steps 155 Findings

Medium

III.

Database Migration Safety

SQL migrations against live production context — table sizes, write throughput, downstream services. Judgment, not lint.

28 Scenarios 35 Steps 155 Findings

Hard

— Thesis

AI now authors a generation of production code. Review is the bottleneck — not authorship. An agent that cannot review code at the level of a senior engineer cannot be trusted to write it.

04 / RESOURCES

Everything in one place. For judges & replicators.

BLOGBLOG.md

submission writeup ↗

DOCRESULTS.md

full training story ↗

DOCSCENARIOS.md

all 76 scenarios indexed ↗

PLOTS/training_results/plots

PNGs + results.json ↗

RUNsecurereview-trainer

dependency · one-click ↗

RUNsecurereview-trainer-migration

migration · one-click ↗

RUNsecurereview-trainer-iac

iac · one-click ↗

CODEgithub.com/sam25kat/Secure_Reveiw

full source ↗

APIOpenAPI / Swagger

interactive docs ↗

05 / OPENENV INTERFACE

Standard gym-style endpoints. Plus a six-line quickstart.

health

list tasks

metadata

openapi

POST/reset

start episode

POST/step

execute action

terminal — full episode in 6 lines

# 1. start a dependency review episode
curl -X POST https://sam25kat-securereview.hf.space/reset \
  -d '{"task_id": "dependency_review"}'

# 2. mark complete to receive the F1-graded reward
curl -X POST https://sam25kat-securereview.hf.space/step \
  -d '{"action": {"action_type": "mark_complete"}}'

Security review,for the age of AI.