wayfinder · production postmortem

The timeout wasn't the bug

Haichuan Zhou · July 2026 · mcp-ast-explorer · wayfinder

While putting a screenshot of wayfinder's deployed instance on this site, I noticed something embarrassing: every symbol question — the exact query type the system is built to answer with verified, file-and-line evidence — came back with zero verified claims. The answers were honest about it ("I can't verify where this is defined from the evidence in this packet"), which is the product working as designed. But the evidence should have been there.

Following the trace

wayfinder persists per-run trace metadata precisely for moments like this. The run record pointed straight at the failing hop:

Entry explanation degraded for BaseCommand.invoke: the AST evidence tool failed. Evidence limitation: MCP tool timed out after 8s.

The deployed config ran MCP tool calls with an 8-second timeout and a single attempt. So the question became: why does an AST lookup take more than 8 seconds?

Reading mcp-ast-explorer's server code gave the answer: every tool call rebuilt the full LibCST index from scratch. find_definition, function_signature, find_references, call_chain — each one called build_cst_index(path) with no caching. I timed it locally against pallets/click (63 Python files): ~5.4 seconds per build on an M-series laptop. On Railway's shared vCPU, comfortably past 8 seconds — and a single grounded run makes several of these calls. Every symbol question was structurally guaranteed to time out.

(A fun detour: my first test query asked about BaseCommand.invoke, and even after fixing the timeout the answer refused to locate it. That wasn't a bug — BaseCommand no longer exists on click's main branch. The honest not-found was correct. Debug with symbols that exist.)

Stopgap, then the real fix

First, stop the bleeding with config: tool timeout 8s → 30s, one attempt → two, graph-node timeout raised to cover both. Symbol questions — now asking about Command.invoke, which does exist — immediately started returning verified 3 / unverified 1: definition at src/click/core.py:1353, signature, qualified name, all labeled from AST evidence. But each grounded run took 44.7–49.0 seconds, because the index was still being rebuilt on every tool call.

The real fix is a small, boring cache in mcp-ast-explorer (v0.2.0): an in-process CstIndexCache keyed by resolved repo root, invalidated by a per-file (path, size, mtime_ns) fingerprint. Stat-ing every file is cheap next to re-parsing it, and any edit, addition, or deletion changes the fingerprint and triggers a rebuild. Locally: cold build 5.6s, warm hit 4ms.

"Deployed" is a claim — verify it

Here's the part of the story I didn't expect to write. The cache fix was implemented and I believed it was deployed. When I measured the live instance, nothing had changed. Checking the chain end to end: the code existed only as uncommitted changes in a local worktree; GitHub main didn't have it; PyPI was two versions behind; and wayfinder's Dockerfile installed the dependency from git+...@main — so even a successful rebuild had faithfully shipped the old code.

The fix for that class of failure is the same discipline as everything else in this stack: commit, push, and pin the dependency to a commit SHA in the Dockerfile. The pin doubles as a Docker layer-cache bust — bumping the SHA forces the install layer to rebuild, so "it built" and "it shipped" can't quietly diverge again.

Results, measured on production

Takeaways

Evidence for the numbers above: run-by-run timeline and the raw production run records (verbatim GET /runs snapshot). The cache: mcp-ast-explorer commit 8126b9f. The pin: wayfinder commit b3e9b4b. Or ask my homepage chatbot — it retrieves from the projects' real docs and cites its sources.