The Agent Review Benchmark · Built for the Meta × Hugging Face OpenEnv Hackathon, India 2026 · by ~The Cook House

Security review,
for the age of AI.

AI writes the code. Who reviews it? SecureReview is the first OpenEnv harness that trains and grades agents on real security review — supply chain, infrastructure-as-code, and database migrations.

3
Review Domains
76
Scenarios
430
Vulnerabilities
+0.24
Mean Lift After Training
01 / RESULTS

SFT → GRPO hybrid training. Real lift on every domain.

+0.302
Dependency · 20/24 wins · 0.083 → 0.385
+0.295
Migration · 10/12 wins · 0.170 → 0.465
+0.126
IaC · 6/13 wins · 0.177 → 0.303
Dependency review — before vs after SFT
Dependency · 24 scenarios. Standout: dep_015 0.02 → 0.93.
Migration review — before vs after SFT
Migration · 12 curriculum-filtered scenarios. Standout: migration_025 0.06 → 0.64.
IaC review — before vs after SFT
IaC · 13 scenarios. Standout: iac_010 0.01 → 0.76.
02 / WHAT THE AGENT REVIEWS

Three real scenes. Three real failure modes.

DEPENDENCY requirements.txt · dep_010
# Suggested by LLM during code generation
openai==1.3.0
langchain-utils==0.5.2      ← hallucinated, not on PyPI
streamlit-helpers==0.3.1    ← slopsquat opportunity
torch-helpers==1.9.0        ← real pkg is "torch"
chromadb-client==0.4.5      ← real pkg is "chromadb"
embedding-models==2.1.0     ← does not exist
vector-store==1.2.3         ← does not exist
ai-toolkit==0.8.0           ← generic squat target
Agent must catch 7 hallucinated packages · 6 critical, 1 high.
IAC main.tf · iac_015
resource "aws_security_group" "db" {
  ingress {
    cidr_blocks = ["0.0.0.0/0"]   ← Postgres open to internet
  }
}
resource "aws_db_instance" "analytics" {
  username                = "admin"
  password                = "Sup3rSecret!2023"  ← in TF state
  publicly_accessible     = true            ← public RDS
  storage_encrypted       = false           ← unencrypted
  backup_retention_period = 0               ← no PITR
}
resource "aws_s3_bucket" "exports" {
  acl = "public-read"                         ← public bucket
}
Agent must catch 6 misconfigurations · network exposure, encryption, credentials, backup posture.
MIGRATION migration_007.sql · 4.2B-row telemetry table
-- table: 4.2B rows · 1.4k writes/sec
-- deploy: rolling, 0s downtime budget

CREATE INDEX idx_dev_metric_time   ← no CONCURRENTLY
    ON telemetry_records(device_id, metric, recorded_at);

ALTER TABLE telemetry_records
    SET (fillfactor = 100);              ← kills HOT updates

CLUSTER telemetry_records              ← AccessExclusiveLock
    USING idx_dev_metric_time;             on a 4B-row table

CREATE INDEX idx_payload
    ON telemetry_records USING gin(tags);  ← no jsonb_path_ops
Agent must catch 5 issues incl. blocking CLUSTER on 4B rows during a zero-downtime deploy. Senior-engineer judgment, not lint.
03 / BENCHMARK

Three domains, three difficulties, one standard.

I.

Dependency & Supply Chain Security

Typosquats, hallucinated PyPI imports, pinned CVEs. Supply-chain literacy.

24 Scenarios 15 Steps 120 Findings
Easy
II.

Infrastructure-as-Code Misconfiguration

CIS violations in Terraform / K8s — public buckets, wildcard IAM, privileged containers. Multi-file cloud reasoning.

24 Scenarios 25 Steps 155 Findings
Medium
III.

Database Migration Safety

SQL migrations against live production context — table sizes, write throughput, downstream services. Judgment, not lint.

28 Scenarios 35 Steps 155 Findings
Hard
— Thesis
AI now authors a generation of production code. Review is the bottleneck — not authorship. An agent that cannot review code at the level of a senior engineer cannot be trusted to write it.
04 / RESOURCES

Everything in one place. For judges & replicators.

05 / OPENENV INTERFACE

Standard gym-style endpoints. Plus a six-line quickstart.

GET/health
health
GET/tasks
list tasks
GET/metadata
metadata
GET/docs
openapi
POST/reset
start episode
POST/step
execute action
terminal — full episode in 6 lines
# 1. start a dependency review episode
curl -X POST https://sam25kat-securereview.hf.space/reset \
  -d '{"task_id": "dependency_review"}'

# 2. mark complete to receive the F1-graded reward
curl -X POST https://sam25kat-securereview.hf.space/step \
  -d '{"action": {"action_type": "mark_complete"}}'