Learnings Log¶
A growing record of what we actually learned about the system, the models, and the pedagogy as the wiki was built. Distinct from the auto-generated Agent Changelog (per-page runs) — this is a human-curated record of sessions where something non-obvious surfaced and is worth carrying forward.
2026-05-27 (late) — Closing the loop: retrospective agent, hallucination defenses, knockout selection¶
Goal of the session: validate the closed-loop is actually a loop — that the system can write pages, reflect on them, seed its own next sprint, and keep cycling without a human in the queue.
What worked
- The loop closed end-to-end. Six retrospective cycles ran in one day,
cycle #1 (~50 pages) → manual tests → cycle #6 (22 pages from a fresh
sprint). Each cycle's scrum retro auto-seeded the next: 7 stubs after
cycle #5, 10 stubs after cycle #6 — covering
policy-evaluation,representation-learning,quantization-aware-training,normalization,gradient-descent,mixed-precision-training, and others. All of those topics surfaced because the retrospective spotted them being referenced across 2+ existing pages without backing files. The system is now growing its own knowledge graph. - Knockout selector earned its keep. Of 22 sprint-2 topics, 5 improve
passes regressed (
data-parallelism0.75 → 0.61,policy-evaluation0.81 → 0.60,representation-learning0.81 → 0.74,backpropagation0.81 → 0.76,flash-attention0.77 → 0.70). The knockout selector retained the higher-confidence first draft in every case — no shipped page got worse on the improve pass. Without it, the revision spiral would have actively degraded the wiki. - Hallucination defense at three layers works. (1) Regex catches obvious
fakes (future YYMM, implausibly-high suffix). (2) HTTP HEAD against
arxiv.org/abs/<id>catches well-formed phantom IDs by matching the sentinel title[<id>] Article identifier not recognized. (3) The LLM reviewer's own issues list is post-scanned for hallucination keywords and applies a hard −0.10 penalty. The HEAD verifier is cached toagents/runs/arxiv_cache.jsonso every ID is hit at most once — current cost is well under 1s per page for a typical 5-citation draft. - Cost held at $0.18/page. Sprint 2 was 22 topics + 22 improve passes
- 8 critics × 30 invocations + 6 retrospective runs for ~$7. Per-page economy is genuinely linear at this scale; nothing about the closed loop added overhead.
What we got wrong — and corrected
- We initially flagged
arxiv:2603.01761as a "phantom citation" because the retrospective reported it appeared 6× across 4 different pages. The HEAD verifier confirmed it's real — "Modular Memory is the Key to Continual Learning Agents" (Dorovatas et al. 2026). Citation concentration is not the same as fabrication. A real seminal paper legitimately cited by multiple pages will cluster the same way. The retrospective should keep flagging concentration (it's a hint), but the conclusion has to come from the HEAD check, not from the count. - The first version of the regex validator only blocked future-dated YYMM.
That missed
2605.21058(current month, implausible counter 21058 for May 27 2026), which the LLM reviewer caught as suspicious but didn't hard-fail. The fix: add asuffix > 25000check for the current and previous YYMM, plus the LLM-keyword post-scan. Layered defense — the regex is cheap, the HEAD verifier is authoritative, the LLM is the third opinion.
Universal weakest signal across all 10 tracks
critic-info-architecture is the lowest-scoring critic on every track
in cycle 6 (range 0.38 in 05-stat-prob / 08-causal up to 0.65 in 09-systems).
The recurring complaint: pages are missing a "Where this concept appears"
section linking back to arcs, and a "Connected topics" section linking
sideways to peer concepts. This is the next big lever — fixing the writer
prompt to mandate both sections should lift every track's avg confidence
by ~0.05–0.08.
What we changed today
agents/src/frontier_agents/retrospective.py— new scrum retro agent (aggregator + LLM proposer + safe auto-applier), 720 LOC, wired into the scheduler.agents/src/frontier_agents/nodes.py— H1 fixer inwrite_file_node, layered arxiv validator + HEAD verifier + LLM-keyword post-scan inreview_node.docs/system/architecture.md— rewritten as a closed-loop blueprint; the retrospective is now part of the per-cycle diagram, not a "planned" item in the gap list.docs/overrides/home.html— hero rewritten with Made-to-Stick framing around the MVB promise, plus a four-node closed-loop showcase strip.
What's next
- Persona/prompt update so the writer always emits "Where this concept appears" and "Connected topics" — closes the universal weakest-critic gap.
- Author pages for the most-cited researchers (
2604.15469topped this cycle at 8 citations; an anchor page would let every concept page link back to one canonical author write-up). - A
MAX_CYCLESorSTOP_WHEN_COVERAGE_PCTenv var so the server has a declarative stop rule instead of relying on the budget envelope alone.
2026-05-27 — Local-mode reality check on M4 24GB unified RAM¶
Goal of the session: validate that the FAIRE pipeline can run end-to-end against a local LLM (MLX on Apple Silicon) — no cloud spend, no API budget pressure, full offline.
What worked:
- The integration is a one-env-var swap. Set
OPENAI_API_BASE=http://127.0.0.1:8081/v1, change the role MODEL names to local IDs, restartstart.sh. Zero code branches. ChatOpenAI speaks the standard OpenAI Chat Completions protocol that MLX/Ollama/vLLM/LMStudio all implement. - Full FAIRE pipeline (
research → plan → checklist → write → link → review → log) runs againstmlx-community/gemma-3-4b-it-qat-4bitin ~4 min per page on a 24GB M4 base — comparable to cloud wall-clock, at $0 cost. - The v2 structural template lands cleanly: all 6 sections + 5-persona Build it variants. The page-shape works regardless of which model produces it.
What didn't work — and the four real causes:
The 4B model produces structurally-correct but factually unreliable pages. Concrete failures we observed:
| Failure mode | Concrete example |
|---|---|
| Hallucinated arxiv IDs with future-year prefixes | arxiv.org/abs/2604.16324, arxiv.org/abs/2512.22473v4 — invented IDs the model can't actually verify |
| Context bleed from scratch_pad | The Bayesian-inference page's "Where the field is now" section cited FLUX.1, DDPM, Latent Diffusion — none of which are Bayesian-inference papers |
| Missing equations on math-heavy topics | The do-calculus page produced zero LaTeX equations on a topic literally defined by 3 inference rules |
| H1 format drift | Model consistently emits ## Topic Name (H2) where the schema requires # Topic Name (H1) |
| Lenient self-review | Reviewer (also 4B) scores its writer-self at 0.90 confidence on pages with the above defects, doesn't catch them |
| Build it bleed | Backpropagation page's Build it section described training Stable Diffusion on CIFAR-10 — wrong topic entirely |
The causes, ranked by leverage:
- Parameter count. A 4B model has dramatically less factual memory than a ~100B+ cloud model. It knows the shape of a topic but can't recall specific papers, equations, or model IDs.
- Training cutoff. Gemma 3 stops in March 2025. When asked for a recent paper, it invents a future-year arxiv ID instead of admitting ignorance.
- Context confusion under 17K-token prompts. FAIRE's writer prompt is large (SCHEMA.md slice + WRITE_INSTRUCTIONS + scratch_pad). The 4B model's attention struggles to keep "this fact is about Topic A, that one is about adjacent Topic B" cleanly separated and mashes them.
- Same-model reviewer. The reviewer is also 4B Gemma. Same-model self-review rarely catches its own writer's mistakes.
Gemma 4 on MLX is not usable yet (as of mlx-lm 0.31.3, May 2026):
All three Gemma 4 quants we tested (gemma-4-e2b-it-4bit,
gemma-4-e4b-it-4bit, gemma-4-e4b-it-8bit) load partially — missing
v_proj/k_proj/k_norm weights for the upper transformer layers — and
completions hang silently. The model registers at /v1/models but inference
never returns. This is a mlx-lm framework gap, not a quant bug. Use Gemma 3
QAT variants (Google's own QAT weights, more reliable than community quants)
until mlx-lm ships proper Gemma 4 support.
Memory budgeting on 24GB:
Gemma 3 27B QAT works for a single request but OOMs once FAIRE's pipeline drives multiple LLM calls in a cycle. Gemma 3 12B QAT survives single requests but OOMs on the 17K-token writer prompt + parallel critics. Gemma 3 4B QAT is the realistic ceiling on 24GB unified memory with FAIRE's current prompt shape. To go larger requires either an M4 Pro 48GB+ or a meaningful prompt trim.
The trim knobs that made it work:
CRITIC_PANEL_DISABLE=true # skip 8-way critic fan-out (saves ~5GB KV cache duplication)
SPRINT_WORKERS=1 # serial pages; parallel writer calls = Metal OOM
SCHEMA_PROMPT_BYTES=4000 # was 12000; trim agents/SCHEMA.md slice
SCRATCH_PAD_BYTES=8000 # was effectively unlimited
All four default to current cloud values — zero impact on the cloud loop.
The honest recommendation that came out of this:
Local mode is fit for prototyping and structure validation, not for production wiki content as it stands. The pipeline works; the 4B model output isn't publishable because half the citations are fake and adjacent topics bleed in. Three usable paths:
- Hybrid mode (recommended): keep
WRITER_MODEL=openai/gpt-5.1-codex-mini(cloud), route reviewer + critics + research to local. Cuts ~50% of cloud spend without giving up writer accuracy. - Bigger local hardware: M4 Pro 48GB lets you run Gemma 3 12B QAT comfortably; M4 Max 64GB+ runs the 27B. Quality recovers significantly.
- Deterministic post-write validator (TODO): add a no-LLM step in
review_nodethat pings the arxiv API for every citation in the draft and strips/flags ones that don't resolve. Defends against hallucination regardless of model size; benefits cloud-generated pages too.
What we shipped this session:
agents/src/frontier_agents/llm.py—OPENAI_API_BASEenv-overrideable base URLagents/.env.example— local-mode block, four trim knobs documentedscripts/local-setup.sh— one-time MLX install + model download + server startdocs/system/local-mode.md— runtime guide with model recommendations per M4 tieragents/tests/test_local_gemma4.py— 7-test harness (unit + integration) for any local MLX serveragents/src/frontier_agents/nodes.py— four trim knobs (CRITIC_PANEL_DISABLE,CRITIC_PANEL_WORKERS,SCHEMA_PROMPT_BYTES,SCRATCH_PAD_BYTES)
See commits 31df99d, 7ab29ce, and the README "Local-mode" section for the
shipped artifacts.
This log is human-written. The auto-generated per-page record lives in the Agent Changelog.