diff --git a/docs/NARRATIVE.md b/docs/NARRATIVE.md
index 87b1caa..07f8ad6 100644
--- a/docs/NARRATIVE.md
+++ b/docs/NARRATIVE.md
@@ -394,15 +394,183 @@ Expected total: ~46,000-48,000 paragraphs at ~93-95% label accuracy.
 
 ---
 
-## Running Cost Ledger
+## Phase 8: Pre-Training Strategy — DAPT + TAPT
 
-| Phase | Cost | Notes |
-|-------|------|-------|
-| Stage 1 production run | $115.88 | 150,009 annotations, 0 failures |
-| Stage 1 prompt iteration (pilots) | ~$15 | 12+ versions × 500-sample pilots |
-| Judge benchmarking | ~$5 | 8 models × 50-sample gold set |
-| Judge prompt iteration | ~$3 | Ongoing |
-| **Total to date** | **~$139** | |
+### The Decision: Own Filings Over PleIAs/SEC
+
+For domain-adaptive pre-training (DAPT), we needed a corpus of clean SEC filing text. Two options:
+
+1. **PleIAs/SEC** (373K full 10-K texts on HuggingFace, going back years, CC0 license) — massive but uncleaned, and a single training pass on ~18B tokens would take weeks on a single RTX 3090.
+2. **Our own ~9,000 cached filings** (FY2023-2024, HTML already downloaded during extraction) — smaller but recent, relevant, and we already have the HTML cleaning pipeline.
+
+We chose option 2. The reasoning:
+
+- **Recency > volume.** Item 1C didn't exist before FY2023. The cybersecurity disclosure vocabulary, boilerplate patterns, and regulatory framing are all new to this filing cycle. Pre-2023 filings teach the model general SEC language, which ModernBERT already knows from its general pre-training. The marginal value of historical filings is low for our specific task.
+- **The scaling laws paper says stop early.** SEC filing scaling laws (arXiv:2512.12384) show the largest DAPT gains in the first 200M tokens, with diminishing returns after. Our 9,000 full filings yield ~450M tokens — already in the sweet spot.
+- **We control the cleaning quality.** Our `stripHtml()` pipeline handles all the HTML artifacts we fought during extraction (XBRL tags, entity encoding, page breaks, inline element word splits). PleIAs/SEC is a black box — we'd need to audit it anyway.
+- **Feasibility on a 3090.** 450M tokens: ~2-3 days. 18B tokens: weeks. Single GPU means we need to be strategic about compute allocation.
+
+The DAPT corpus preparation is simple: run the existing `stripHtml()` on cached filing HTML (full text, skipping the Item 1C section extraction step) and output clean text as sharded JSONL.
+
+### Adding TAPT: "Don't Stop Pretraining"
+
+Gururangan et al. (2020) "Don't Stop Pretraining" demonstrated that task-adaptive pre-training (TAPT) — continued MLM on the unlabeled task data specifically — gives consistent gains on top of DAPT, especially when the task distribution differs from the broader domain.
+
+Item 1C is a very specific subset of SEC filings. It has its own vocabulary (CISO, NIST CSF, tabletop exercises, materiality assessments), structure (governance → management → process → strategy is a common paragraph sequence), and boilerplate patterns that differ substantially from the rest of a 10-K. TAPT teaches the model this specific distribution before we ask it to classify.
+
+The cost is negligible: our 72K paragraphs from `paragraphs-clean.jsonl` are already clean text (~5-10M tokens). TAPT takes 2-3 hours on a 3090 — essentially free compared to DAPT.
+
+### The Training Pipeline
+
+```
+ModernBERT-large (base, 395M params)
+    → DAPT on 9K full 10-K filings (~450M tokens, ~2-3 days) → SEC-ModernBERT-large
+    → TAPT on 72K Item 1C paragraphs (~10M tokens, ~2-3 hours) → SEC-cyBERT-large
+    → Fine-tune on labeled data with dual classification heads → Final classifier
+```
+
+This gives us clean ablation rows: base → +DAPT → +TAPT → +SCL, isolating the contribution of each step.
+
+---
+
+## Cost and Time Ledger
+
+### Tooling
+
+All code was written collaboratively with **Claude Code** (Anthropic's agentic coding CLI). Claude Code was used throughout the project for pipeline development, prompt engineering, data analysis, script writing, documentation, and strategic planning. The tool dramatically accelerated iteration speed — writing analysis scripts, debugging extraction edge cases, and exploring the annotation data interactively — but all decisions were made by the team with Claude Code as an implementation partner.
+
+### API Cost Ledger
+
+| Phase | Cost | Annotations | Notes |
+|-------|------|-------------|-------|
+| Stage 1 prompt iteration (pilots) | $7.03 | 9,597 | 12+ versions: 5 × 40-sample + 6 × 500-sample |
+| Stage 1 model bench (6 candidates) | $3.41 | 2,993 | seed, mimo, glm-4.5-air, minimax, mistral, nemotron |
+| Mimo pilot (dedicated comparison) | $0.24 | 500 | `mimo-pilot.ts` — replace-nano scenario modeling |
+| Stage 1 run #1 (with nano) | $112.42 | 150,009 | Full production run with gpt-5.4-nano. Completed, but nano's quality was unacceptable (0 reasoning tokens 64% of the time). Gemini+grok annotations ($91.18) preserved in `stage1-gemini-grok.jsonl`; only nano's annotations ($21.24) were discarded. Full original in `stage1.jsonl.bak`. |
+| Stage 1 run #2 (mimo only) | $24.69 | 50,003 | Ran only mimo to replace nano. Merged with preserved gemini+grok annotations to form final `stage1.jsonl` ($115.88 total value, $24.69 new spend). |
+| Judge model bench (8 candidates) | $5.97 | 505 | GLM-5 (4 configs), gpt-5.4-mini, gpt-5.4, sonnet-4.6, gemini-3-flash, grok-4.20, mimo-v2-pro, kimi-k2.5 |
+| **Total API spend** | **$156** | **~213K unique** | Nano waste: $21.24 |
+
+Only nano's portion ($21.24) of the first run was wasted — the gemini and grok annotations were preserved and merged with the new mimo annotations. Still, $21.24 thrown away on a model that wasn't thinking. The lesson: benchmark model candidates rigorously *before* committing to a production run. The 40-sample pilots showed nano was the weakest link but were misleadingly optimistic about the magnitude of the problem.
+
+### Time Ledger
+
+| Phase | Hours | Notes |
+|-------|-------|-------|
+| Data acquisition + HTML cleaning | ~6h | Extraction pipeline, HTML artifact handling, dedup, 8-K discovery. The messiest phase — SEC filing HTML variability required extensive regex heuristics and iteration. |
+| Stage 1 annotation run #1 (nano) | ~5h | Production run wall clock (~300 min). Completed but results were below quality bar. |
+| Stage 1 annotation run #2 (mimo) | ~1h | Only needed mimo annotations at higher concurrency (gemini+grok reused). |
+| Prompt iteration + model benchmarking | ~4h | 12+ prompt versions, 6 model candidates, pilot analysis |
+| Post-Stage 1 analysis + Stage 2 planning | ~5h | Distributional analysis, model bias discovery, codebook v3.0 rulings, judge benchmarking, strategy revision |
+| Documentation + narrative | ~2h | Codebook updates, narrative writing, technical guide updates |
+| **Total to date** | **~23h** | |
+
+### Remaining Work (estimated)
+
+| Phase | Est. Hours | Est. Cost |
+|-------|-----------|-----------|
+| Human labeling site + 1,200 labels | ~8-10h | $0 (team labor) |
+| Stage 2 judge production run (~3-5K paragraphs) | ~1h | ~$20-40 |
+| Training data assembly | ~2h | $0 |
+| DAPT pre-training | ~48-72h GPU | $0 (own 3090) |
+| TAPT pre-training | ~2-3h GPU | $0 |
+| Fine-tuning + ablations (7 experiments) | ~12-20h GPU | $0 |
+| Full GenAI benchmark on 1,200 holdout (9 models) | ~1h | ~$30-50 |
+| Evaluation + comparison + write-up | ~6-8h | $0 |
+
+---
+
+## Model Census — Every Model We Tried
+
+Over the course of the project, we evaluated **18 distinct models** across three phases: initial panel selection, Stage 1 replacement bench, and Stage 2 judge selection. Each decision narrowed the field based on empirical evidence.
+
+### Phase 0: Smoke Test (model-probe.ts) — 9 candidates
+
+Tested basic structured output compliance on a single paragraph before committing to expensive benchmarks.
+
+| Model | Provider | Result |
+|-------|----------|--------|
+| google/gemini-3.1-flash-lite-preview | Google | ✅ Pass — selected for panel |
+| x-ai/grok-4.1-fast | xAI | ✅ Pass — selected for panel |
+| openai/gpt-4.1-mini | OpenAI | ✅ Pass — not selected (cost) |
+| openai/gpt-4.1-nano | OpenAI | ✅ Pass — later replaced by gpt-5.4-nano |
+| anthropic/claude-haiku-4.5 | Anthropic | ✅ Pass — not selected (cost tier) |
+| google/gemini-3.1-flash-preview | Google | ✅ Pass — too expensive for Stage 1 |
+| deepseek/deepseek-chat-v3-0324:free | DeepSeek | Tested — free tier limitations |
+| meta-llama/llama-4-maverick | Meta | Tested |
+| qwen/qwen3-235b-a22b | Alibaba | Tested |
+
+### Phase 1: Early Pilots (v1.0-v1.2) — Original panel
+
+The very first panel used **gpt-oss-120b** (OpenAI's open-source 120B model), not nano:
+- `google/gemini-3.1-flash-lite-preview`
+- `openai/gpt-oss-120b` (also tested with `:exacto` routing suffix)
+- `x-ai/grok-4.1-fast`
+
+gpt-oss-120b was replaced by gpt-5.4-nano between v1.2 and v2.1 — nano was cheaper and appeared to perform comparably on the small (n=40) pilot samples.
+
+### Phase 2: 500-Sample Pilots (v2.2-v2.7) — Nano era
+
+Panel during the main prompt iteration:
+- `google/gemini-3.1-flash-lite-preview`
+- `openai/gpt-5.4-nano` ← the problem model
+- `x-ai/grok-4.1-fast`
+
+Nano's issues (0 reasoning tokens 64% of the time, erratic specificity) were persistent but masked by the 40→500 sample transition being attributed to prompt changes rather than model inadequacy.
+
+### Phase 3: Stage 1 Replacement Bench (model-bench.ts) — 6 candidates
+
+After locking prompt v2.5, formally benchmarked replacements for nano:
+
+| Model | Provider | Reasoning Tokens | Cost/ann | Outcome |
+|-------|----------|-----------------|----------|---------|
+| xiaomi/mimo-v2-flash | Xiaomi | 1,346 | $0.00048 | **✅ Winner** — best value, lowest outlier rate |
+| bytedance-seed/seed-2.0-lite | ByteDance | 658 | $0.00227 | Runner-up — highest accuracy but 4.7x more expensive |
+| z-ai/glm-4.5-air | Zhipu AI | 854 | $0.00136 | Mediocre — barely moved the needle (+0.8pp) |
+| minimax/minimax-m2.5 | MiniMax | 590 | $0.00106 | Mediocre — slightly worse than nano (-1.0pp) |
+| mistralai/mistral-small-2603 | Mistral | **0** | $0.00015 | ❌ Zero reasoning tokens. Cheapest but useless. |
+| nvidia/nemotron-3-super-120b-a12b | NVIDIA | 942 | $0.00152 | ❌ Worst performer despite being expensive. 21% outlier rate. |
+
+### Phase 4: Production Stage 1 — Final panel
+
+- `google/gemini-3.1-flash-lite-preview` (Google)
+- `xiaomi/mimo-v2-flash` (Xiaomi) ← replaced nano
+- `x-ai/grok-4.1-fast` (xAI)
+
+Three models from three providers — minimizes correlated errors.
+
+### Phase 5: Stage 2 Judge Bench (judge-bench.ts) — 8 candidates
+
+| Model | Provider | Mode | Both vs Gold | Fails | Outcome |
+|-------|----------|------|-------------|-------|---------|
+| z-ai/glm-5 | Zhipu AI | structured | 77-80% | 4-12% | Best accuracy but unreliable structured output |
+| z-ai/glm-5 | Zhipu AI | tool calling | 72% | 0% | Reliable but -7pp accuracy |
+| openai/gpt-5.4-mini | OpenAI | structured | 68% | 0% | Reliable, weaker on specificity |
+| openai/gpt-5.4 | OpenAI | structured | Tested | 0% | Expensive, diminishing returns over mini |
+| anthropic/claude-sonnet-4.6 | Anthropic | structured | Used for gold | 0% | Gold label creation, too expensive for production judge |
+| google/gemini-3-flash-preview | Google | structured | Tested | — | Rubber-stamped majority — added zero value |
+| x-ai/grok-4.20-beta | xAI | structured | Tested | — | Benchmarked |
+| xiaomi/mimo-v2-pro | Xiaomi | structured | Tested | — | Benchmarked |
+| moonshotai/kimi-k2.5 | Moonshot AI | structured | Tested | — | Only 26/50 completed — high failure rate |
+
+### Summary: 18 Models, 10 Providers
+
+| Provider | Models Tested | Models in Production |
+|----------|--------------|---------------------|
+| Google | gemini-3.1-flash-lite, gemini-3.1-flash, gemini-3-flash | gemini-3.1-flash-lite (Stage 1) |
+| OpenAI | gpt-oss-120b, gpt-5.4-nano, gpt-4.1-mini, gpt-4.1-nano, gpt-5.4-mini, gpt-5.4 | — (nano dropped) |
+| xAI | grok-4.1-fast, grok-4.20-beta | grok-4.1-fast (Stage 1) |
+| Xiaomi | mimo-v2-flash, mimo-v2-pro | mimo-v2-flash (Stage 1) |
+| Anthropic | claude-haiku-4.5, claude-sonnet-4.6 | sonnet-4.6 (gold labels) |
+| Zhipu AI | glm-4.5-air, glm-5 | TBD (Stage 2 judge) |
+| ByteDance | seed-2.0-lite | — (too expensive for scale) |
+| NVIDIA | nemotron-3-super-120b | — (worst performer) |
+| Mistral | mistral-small-2603 | — (zero reasoning) |
+| MiniMax | minimax-m2.5 | — (mediocre) |
+| Moonshot AI | kimi-k2.5 | — (high failure rate) |
+| Meta | llama-4-maverick | — (smoke test only) |
+| Alibaba | qwen3-235b-a22b | — (smoke test only) |
+| DeepSeek | deepseek-chat-v3-0324 | — (smoke test only) |
 
 ---
 
diff --git a/docs/TECHNICAL-GUIDE.md b/docs/TECHNICAL-GUIDE.md
index 13cc48e..ef11b4c 100644
--- a/docs/TECHNICAL-GUIDE.md
+++ b/docs/TECHNICAL-GUIDE.md
@@ -209,7 +209,9 @@ const result = await generateObject({
 **Step 1 — Domain-Adaptive Pre-Training (DAPT):**
 
 Continue MLM pre-training on SEC filing text to create "SEC-ModernBERT-large":
-- **Training corpus:** 200-500M tokens from PleIAs/SEC or own EDGAR download. Include 10-Ks, 10-Qs, 8-Ks, proxy statements.
+- **Training corpus:** ~450M tokens from our own 9,000 cached 10-K filings (FY2023-2024), full filing text (not just Item 1C). These are the same filings we extracted Item 1C from — we already have the raw HTML cached locally and the cleaning pipeline built (`stripHtml()` in `fast-reparse.ts`).
+- **Why not PleIAs/SEC?** The PleIAs/SEC dataset (373K filings, ~18B tokens) goes back much further in time, but: (a) one pass would take weeks on a single 3090, (b) Item 1C didn't exist before FY2023 so pre-2023 filings lack the cybersecurity disclosure vocabulary that matters most for our task, (c) the SEC filing scaling laws paper (arXiv:2512.12384) shows the largest gains come in the first 200M tokens — our 450M from recent, relevant filings is already in the sweet spot.
+- **Corpus preparation:** `dapt-corpus.ts` runs `stripHtml()` on cached filing HTML (full text, no section extraction) and outputs clean text as sharded JSONL. Same HTML cleaning that handles XBRL tags, entity decoding, page artifacts, inline element boundaries — just without the Item 1C section boundary step.
 - **MLM objective:** 30% masking rate (ModernBERT convention)
 - **Learning rate:** ~5e-5 (search range: 1e-5 to 1e-4)
 - **Hardware (RTX 3090):** bf16, gradient checkpointing, seq_len=1024-2048, batch_size=2-4 + gradient accumulation to effective batch 16-32
@@ -217,8 +219,16 @@ Continue MLM pre-training on SEC filing text to create "SEC-ModernBERT-large":
 - **Duration:** ~2-3 days on single 3090
 - **Framework:** HuggingFace Trainer + `DataCollatorForLanguageModeling` (Python script, not notebook)
 
-**Evidence DAPT works:**
-- Gururangan et al. (2020): consistent improvements across all tested domains
+**Step 1b — Task-Adaptive Pre-Training (TAPT):**
+
+After DAPT, continue MLM pre-training specifically on the task distribution to create "SEC-cyBERT-large":
+- **Training corpus:** 72K Item 1C paragraphs from `paragraphs-clean.jsonl` (~5-10M tokens). Already clean text — no additional processing needed.
+- **Why TAPT on top of DAPT?** Gururangan et al. (2020) "Don't Stop Pretraining" showed that TAPT gives consistent gains on top of DAPT, especially when the task distribution differs from the broader domain. Item 1C is a very specific subset of 10-K filings — it has its own vocabulary (CISO, NIST CSF, tabletop exercises, materiality assessments), structure (governance → management → process → strategy pattern), and boilerplate patterns that differ substantially from the rest of a 10-K. TAPT teaches the model this specific distribution.
+- **Duration:** ~2-3 hours on a 3090 (tiny corpus vs DAPT)
+- **Same hyperparameters as DAPT** (learning rate, masking rate, etc.)
+
+**Evidence DAPT + TAPT works:**
+- Gururangan et al. (2020): DAPT + TAPT together outperforms either alone across all tested domains. TAPT alone gives 1-3 F1 points; combined with DAPT, gains are additive.
 - Clinical ModernBERT, BioClinical ModernBERT: successful continued MLM on medical text
 - Patent domain ModernBERT (arXiv:2509.14926): +0.9 to +2.8 F1 from continued pre-training on 31.6B tokens
 - SEC filing scaling laws (arXiv:2512.12384): consistent improvement, largest gains in first 200M tokens
@@ -280,15 +290,16 @@ All BERT-base (110M params, 512 context) — architecturally outdated:
 
 ### 3.6 Ablation Design
 
-| # | Experiment | Model | Context | DAPT | SCL | Purpose |
-|---|-----------|-------|---------|------|-----|---------|
-| 1 | Baseline | DeBERTa-v3-large | 512 | No | No | Standard approach per syllabus |
-| 2 | + Long context | ModernBERT-large | 2048 | No | No | Context window benefit |
-| 3 | + Domain adapt | SEC-ModernBERT-large | 2048 | Yes | No | DAPT benefit |
-| 4 | + Contrastive | SEC-ModernBERT-large | 2048 | Yes | Yes | SCL benefit |
-| 5 | Efficiency | NeoBERT (+ DAPT) | 2048 | Yes | Yes | 40% fewer params |
-| 6 | Decoder | Qwen3.5 LoRA | 2048 | No | No | Encoder vs decoder |
-| 7 | **Ensemble** | SEC-ModernBERT + DeBERTa | mixed | mixed | — | Maximum performance |
+| # | Experiment | Model | Context | DAPT | TAPT | SCL | Purpose |
+|---|-----------|-------|---------|------|------|-----|---------|
+| 1 | Baseline | DeBERTa-v3-large | 512 | No | No | No | Standard approach per syllabus |
+| 2 | + Long context | ModernBERT-large | 2048 | No | No | No | Context window benefit |
+| 3 | + Domain adapt | SEC-ModernBERT-large | 2048 | Yes | No | No | DAPT benefit |
+| 3b | + Task adapt | SEC-cyBERT-large | 2048 | Yes | Yes | No | TAPT benefit (Gururangan et al.) |
+| 4 | + Contrastive | SEC-cyBERT-large | 2048 | Yes | Yes | Yes | SCL benefit |
+| 5 | Efficiency | NeoBERT (+ DAPT + TAPT) | 2048 | Yes | Yes | Yes | 40% fewer params |
+| 6 | Decoder | Qwen3.5 LoRA | 2048 | No | No | No | Encoder vs decoder |
+| 7 | **Ensemble** | SEC-cyBERT + DeBERTa | mixed | mixed | mixed | — | Maximum performance |
 
 ### 3.7 Hyperparameter Search (Autoresearch Pattern)