Haichuan Zhou — AI Agent Engineer

Projects

Flagship first. Every claim below is backed by tests, deploy evidence, or benchmark data in the repos.

wayfinderFlagship

Verifier-backed codebase onboarding copilot. Point it at an unfamiliar repository and it maps the architecture, explains entry paths from AST evidence, and — unlike most code-explanation tools — shows its uncertainty instead of hiding it.

wayfinder workspace: a completed grounded run on pallets/click with verified/unverified/contradicted labels

Live workspace — a grounded run on pallets/click: 3 claims verified from AST evidence (definition at src/click/core.py:1353, signature, qualified name), and 1 runtime claim honestly labeled unverified — no test coverage. Click to open the live demo.

Evidence-first pipeline: a LangGraph Supervisor routes each question to role-contracted worker agents (architect_mapper, entry_explainer, verifier) that emit typed ClaimPackets with per-claim provenance.
Every claim gets a label: verified (AST/repo/test evidence), unverified (no grounding — shown, not hidden), or contradicted (a failing test challenges it, and the final prose is rewritten).
Hardened NL→symbol resolution: backticked/dotted symbols, CLI entry points from pyproject.toml, module-behavior questions — ambiguity resolves to "refuse" rather than "guess."
Production-shaped: auth + encrypted key storage, SQLite/Postgres run stores, rate limiting, sandboxed test-runner worker, readiness probes, full trace metadata schema — deployed publicly on Railway with recorded smoke evidence.

3-state

claim labeling with provenance

Live

public deploy on Railway

4 agents

role contracts + typed claims

failure modes with designed mitigations

Python 3.11FastAPILangGraph MCPNext.js 15Tailwind Docker ComposeRailwayGitHub Actions

agent-eval-harness

source ↗

Open-source evaluation framework for LLM agents, shipped with a Supervisor-vs-ReAct benchmark: same model, same five MCP tools, 40 tasks across 10 Python OSS repos — so the comparison isolates orchestration, not tooling.

Headline result: the Supervisor architecture used ~12× fewer tokens (396k vs 4.8M) and completed all 40 tasks with zero errors, while the ReAct baseline failed 6/40 by blowing past its recursion limit.
Honest reporting: ReAct scored higher raw answer quality on tasks it finished (factual 0.70 vs 0.48) — the report says so, because a benchmark that hides the trade-off is worthless.
Four metrics: routing accuracy, LLM-as-judge factual correctness with variance-gated self-consistency, citation grounding (anti-hallucination symbol resolver), and verification rate from real pytest execution.
Run/score split: expensive agent runs persist to JSONL; a mid-analysis resolver bug was fixed and re-scored offline for free — citation score corrected 0.37 → 0.80 without re-running a single agent.

12×

fewer tokens, Supervisor vs ReAct

tasks · 10 OSS repos · 4 buckets

0 vs 6

task failures (Supervisor vs ReAct)

tests · ruff + mypy --strict green

PythonLLM-as-judgeself-consistency pytestmypy --strictCLI + Python API

project5 — MCP tool suite

repo-mapper ↗ ast-explorer ↗ test-runner ↗

Three self-authored, deterministic MCP servers that form wayfinder's fact layer. None of them contains an LLM — if a symbol doesn't exist, they return a structured not-found instead of inventing an answer.

mcp-repo-mapper: repository structure, language breakdown, Python dependency graph, circular-dependency detection, framework detection, ranked entry-point candidates.
mcp-ast-explorer: LibCST-based symbol layer — definitions, signatures, references, call chains, class hierarchies. Refuses ambiguous bare names rather than guessing.
mcp-test-runner: bounded pytest/Jest execution (timeouts, CPU/memory limits, shell=False), normalized JSON result parsing, coverage summaries — the layer that turns claims into verdicts.
Designed as a stack: structure → symbols → execution mirrors how an engineer actually reads a new codebase.

standalone MCP servers

LLM calls — fully deterministic

focused tools across the suite

stdio+HTTP

transports, deploy-tested

FastMCP 2.xLibCSTpytest-json-report sandboxed subprocesstyped results

About

I'm Haichuan Zhou, an AI engineer currently interning at HireBeat, where I'm the sole engineer on an autonomous job-application agent (LangGraph + Claude) — a grounded, anti-hallucination scoring engine that decides deterministically in Python, not in the LLM.

The projects on this page came out of one conviction: LLM output is a claim, not an answer. I kept seeing agents that sounded right and weren't — so I built the tool layer that can't hallucinate (deterministic MCP servers), the agent layer that labels its own uncertainty (wayfinder), and the eval layer that measures whether any of it actually works (agent-eval-harness). Even this site's chatbot follows the rule: it answers with retrieval from my real design docs and cites its sources.

M.S. in Analytics (STEM) at USC, expected Dec 2026 · B.S. in Mathematics, CS minor, NYU 2025 · Based in Los Angeles, open to AI/LLM engineering roles.

I build agent systems that prove their answers — and the harnesses that measure them.

One stack, three layers

Projects

Writing

About

What I bring

Agent systems

Evaluation & rigor

Production engineering