Haichuan Zhou · AI Agent Engineer · LLM Systems

I build agent systems that prove their answers — and the harnesses that measure them.

Most LLM agents sound confident before they're grounded. I built a full three-layer stack against that: self-authored MCP tool servers for deterministic code facts, a verifier-backed multi-agent system that labels every claim with evidence, and an open-source eval harness that benchmarks agent architectures with real numbers.

View Projects Resume ↓ GitHub ↗

One stack, three layers

These aren't three unrelated side projects — each layer is built on the one below it, and the top layer measures the whole thing.

01 · Fact layer
project5 — MCP servers
Three deterministic MCP servers: repo structure, AST symbols, test execution. No LLM inside — they refuse to invent what isn't there.
02 · Orchestration layer
wayfinder
LangGraph multi-agent onboarding copilot that routes questions, grounds claims in MCP evidence, and labels each one verified / unverified / contradicted.
03 · Evaluation layer
agent-eval-harness
Open-source framework scoring routing, factuality, citation grounding, and verification rate — benchmarking Supervisor vs ReAct on 40 real tasks.

Projects

Flagship first. Every claim below is backed by tests, deploy evidence, or benchmark data in the repos.

Verifier-backed codebase onboarding copilot. Point it at an unfamiliar repository and it maps the architecture, explains entry paths from AST evidence, and — unlike most code-explanation tools — shows its uncertainty instead of hiding it.

wayfinder workspace: a completed grounded run on pallets/click with verified/unverified/contradicted labels
Live workspace — a grounded run on pallets/click: 3 claims verified from AST evidence (definition at src/click/core.py:1353, signature, qualified name), and 1 runtime claim honestly labeled unverified — no test coverage. Click to open the live demo.
  • Evidence-first pipeline: a LangGraph Supervisor routes each question to role-contracted worker agents (architect_mapper, entry_explainer, verifier) that emit typed ClaimPackets with per-claim provenance.
  • Every claim gets a label: verified (AST/repo/test evidence), unverified (no grounding — shown, not hidden), or contradicted (a failing test challenges it, and the final prose is rewritten).
  • Hardened NL→symbol resolution: backticked/dotted symbols, CLI entry points from pyproject.toml, module-behavior questions — ambiguity resolves to "refuse" rather than "guess."
  • Production-shaped: auth + encrypted key storage, SQLite/Postgres run stores, rate limiting, sandboxed test-runner worker, readiness probes, full trace metadata schema — deployed publicly on Railway with recorded smoke evidence.
3-state
claim labeling with provenance
Live
public deploy on Railway
4 agents
role contracts + typed claims
8+
failure modes with designed mitigations
Python 3.11FastAPILangGraph MCPNext.js 15Tailwind Docker ComposeRailwayGitHub Actions
agent-eval-harness

Open-source evaluation framework for LLM agents, shipped with a Supervisor-vs-ReAct benchmark: same model, same five MCP tools, 40 tasks across 10 Python OSS repos — so the comparison isolates orchestration, not tooling.

  • Headline result: the Supervisor architecture used ~12× fewer tokens (396k vs 4.8M) and completed all 40 tasks with zero errors, while the ReAct baseline failed 6/40 by blowing past its recursion limit.
  • Honest reporting: ReAct scored higher raw answer quality on tasks it finished (factual 0.70 vs 0.48) — the report says so, because a benchmark that hides the trade-off is worthless.
  • Four metrics: routing accuracy, LLM-as-judge factual correctness with variance-gated self-consistency, citation grounding (anti-hallucination symbol resolver), and verification rate from real pytest execution.
  • Run/score split: expensive agent runs persist to JSONL; a mid-analysis resolver bug was fixed and re-scored offline for free — citation score corrected 0.37 → 0.80 without re-running a single agent.
12×
fewer tokens, Supervisor vs ReAct
40
tasks · 10 OSS repos · 4 buckets
0 vs 6
task failures (Supervisor vs ReAct)
73
tests · ruff + mypy --strict green
PythonLLM-as-judgeself-consistency pytestmypy --strictCLI + Python API
project5 — MCP tool suite

Three self-authored, deterministic MCP servers that form wayfinder's fact layer. None of them contains an LLM — if a symbol doesn't exist, they return a structured not-found instead of inventing an answer.

  • mcp-repo-mapper: repository structure, language breakdown, Python dependency graph, circular-dependency detection, framework detection, ranked entry-point candidates.
  • mcp-ast-explorer: LibCST-based symbol layer — definitions, signatures, references, call chains, class hierarchies. Refuses ambiguous bare names rather than guessing.
  • mcp-test-runner: bounded pytest/Jest execution (timeouts, CPU/memory limits, shell=False), normalized JSON result parsing, coverage summaries — the layer that turns claims into verdicts.
  • Designed as a stack: structure → symbols → execution mirrors how an engineer actually reads a new codebase.
3
standalone MCP servers
0
LLM calls — fully deterministic
16
focused tools across the suite
stdio+HTTP
transports, deploy-tested
FastMCP 2.xLibCSTpytest-json-report sandboxed subprocesstyped results

Writing

Design write-ups and postmortems from building the stack above.

About

I'm Haichuan Zhou, an AI engineer currently interning at HireBeat, where I'm the sole engineer on an autonomous job-application agent (LangGraph + Claude) — a grounded, anti-hallucination scoring engine that decides deterministically in Python, not in the LLM.

The projects on this page came out of one conviction: LLM output is a claim, not an answer. I kept seeing agents that sounded right and weren't — so I built the tool layer that can't hallucinate (deterministic MCP servers), the agent layer that labels its own uncertainty (wayfinder), and the eval layer that measures whether any of it actually works (agent-eval-harness). Even this site's chatbot follows the rule: it answers with retrieval from my real design docs and cites its sources.

M.S. in Analytics (STEM) at USC, expected Dec 2026 · B.S. in Mathematics, CS minor, NYU 2025 · Based in Los Angeles, open to AI/LLM engineering roles.

What I bring

The through-line: treating LLM output as a claim to be verified, not an answer to be trusted.

Agent systems

LangGraph supervisor/worker orchestration, role contracts, typed claim provenance, human-in-the-loop resume, reflection loops with hard caps, MCP tool authoring (stdio & HTTP).

Evaluation & rigor

LLM-as-judge with self-consistency and variance gating, architecture-blind scoring, run/score decoupling, ground-truth review workflows, honest benchmark reporting.

Production engineering

FastAPI, Postgres/SQLite, Docker Compose, Railway/Cloud Run deploys, sandboxed execution workers, auth & encrypted secrets, rate limiting, observability schemas, CI gates (ruff, mypy --strict, pytest, typecheck).