updating narrative and plan

2026-03-28 22:54:32 -04:00 · 2026-03-28 22:54:32 -04:00 · 48e488933a
commit 48e488933a
parent 78d1f978de
2 changed files with 199 additions and 20 deletions
--- a/docs/NARRATIVE.md
+++ b/docs/NARRATIVE.md
@ -394,15 +394,183 @@ Expected total: ~46,000-48,000 paragraphs at ~93-95% label accuracy.
 ---
-## Running Cost Ledger
+## Phase 8: Pre-Training Strategy — DAPT + TAPT
-| Phase | Cost | Notes |
+### The Decision: Own Filings Over PleIAs/SEC
-|-------|------|-------|
+
-| Stage 1 production run | $115.88 | 150,009 annotations, 0 failures |
+For domain-adaptive pre-training (DAPT), we needed a corpus of clean SEC filing text. Two options:
-| Stage 1 prompt iteration (pilots) | ~$15 | 12+ versions × 500-sample pilots |
+
-| Judge benchmarking | ~$5 | 8 models × 50-sample gold set |
+1. **PleIAs/SEC** (373K full 10-K texts on HuggingFace, going back years, CC0 license) — massive but uncleaned, and a single training pass on ~18B tokens would take weeks on a single RTX 3090.
-| Judge prompt iteration | ~$3 | Ongoing |
+2. **Our own ~9,000 cached filings** (FY2023-2024, HTML already downloaded during extraction) — smaller but recent, relevant, and we already have the HTML cleaning pipeline.
-| **Total to date** | **~$139** | |
+
 We chose option 2. The reasoning:
 - **Recency > volume.** Item 1C didn't exist before FY2023. The cybersecurity disclosure vocabulary, boilerplate patterns, and regulatory framing are all new to this filing cycle. Pre-2023 filings teach the model general SEC language, which ModernBERT already knows from its general pre-training. The marginal value of historical filings is low for our specific task.
 - **The scaling laws paper says stop early.** SEC filing scaling laws (arXiv:2512.12384) show the largest DAPT gains in the first 200M tokens, with diminishing returns after. Our 9,000 full filings yield ~450M tokens — already in the sweet spot.
 - **We control the cleaning quality.** Our `stripHtml()` pipeline handles all the HTML artifacts we fought during extraction (XBRL tags, entity encoding, page breaks, inline element word splits). PleIAs/SEC is a black box — we'd need to audit it anyway.
 - **Feasibility on a 3090.** 450M tokens: ~2-3 days. 18B tokens: weeks. Single GPU means we need to be strategic about compute allocation.
 The DAPT corpus preparation is simple: run the existing `stripHtml()` on cached filing HTML (full text, skipping the Item 1C section extraction step) and output clean text as sharded JSONL.
 ### Adding TAPT: "Don't Stop Pretraining"
 Gururangan et al. (2020) "Don't Stop Pretraining" demonstrated that task-adaptive pre-training (TAPT) — continued MLM on the unlabeled task data specifically — gives consistent gains on top of DAPT, especially when the task distribution differs from the broader domain.
 Item 1C is a very specific subset of SEC filings. It has its own vocabulary (CISO, NIST CSF, tabletop exercises, materiality assessments), structure (governance → management → process → strategy is a common paragraph sequence), and boilerplate patterns that differ substantially from the rest of a 10-K. TAPT teaches the model this specific distribution before we ask it to classify.
 The cost is negligible: our 72K paragraphs from `paragraphs-clean.jsonl` are already clean text (~5-10M tokens). TAPT takes 2-3 hours on a 3090 — essentially free compared to DAPT.
 ### The Training Pipeline
 ```
 ModernBERT-large (base, 395M params)
    → DAPT on 9K full 10-K filings (~450M tokens, ~2-3 days) → SEC-ModernBERT-large
    → TAPT on 72K Item 1C paragraphs (~10M tokens, ~2-3 hours) → SEC-cyBERT-large
    → Fine-tune on labeled data with dual classification heads → Final classifier
 ```
 This gives us clean ablation rows: base → +DAPT → +TAPT → +SCL, isolating the contribution of each step.
 ---
 ## Cost and Time Ledger
 ### Tooling
 All code was written collaboratively with **Claude Code** (Anthropic's agentic coding CLI). Claude Code was used throughout the project for pipeline development, prompt engineering, data analysis, script writing, documentation, and strategic planning. The tool dramatically accelerated iteration speed — writing analysis scripts, debugging extraction edge cases, and exploring the annotation data interactively — but all decisions were made by the team with Claude Code as an implementation partner.
 ### API Cost Ledger
 | Phase | Cost | Annotations | Notes |
 |-------|------|-------------|-------|
 | Stage 1 prompt iteration (pilots) | $7.03 | 9,597 | 12+ versions: 5 × 40-sample + 6 × 500-sample |
 | Stage 1 model bench (6 candidates) | $3.41 | 2,993 | seed, mimo, glm-4.5-air, minimax, mistral, nemotron |
 | Mimo pilot (dedicated comparison) | $0.24 | 500 | `mimo-pilot.ts` — replace-nano scenario modeling |
 | Stage 1 run #1 (with nano) | $112.42 | 150,009 | Full production run with gpt-5.4-nano. Completed, but nano's quality was unacceptable (0 reasoning tokens 64% of the time). Gemini+grok annotations ($91.18) preserved in `stage1-gemini-grok.jsonl`; only nano's annotations ($21.24) were discarded. Full original in `stage1.jsonl.bak`. |
 | Stage 1 run #2 (mimo only) | $24.69 | 50,003 | Ran only mimo to replace nano. Merged with preserved gemini+grok annotations to form final `stage1.jsonl` ($115.88 total value, $24.69 new spend). |
 | Judge model bench (8 candidates) | $5.97 | 505 | GLM-5 (4 configs), gpt-5.4-mini, gpt-5.4, sonnet-4.6, gemini-3-flash, grok-4.20, mimo-v2-pro, kimi-k2.5 |
 | **Total API spend** | **$156** | **~213K unique** | Nano waste: $21.24 |
 Only nano's portion ($21.24) of the first run was wasted — the gemini and grok annotations were preserved and merged with the new mimo annotations. Still, $21.24 thrown away on a model that wasn't thinking. The lesson: benchmark model candidates rigorously *before* committing to a production run. The 40-sample pilots showed nano was the weakest link but were misleadingly optimistic about the magnitude of the problem.
 ### Time Ledger
 | Phase | Hours | Notes |
 |-------|-------|-------|
 | Data acquisition + HTML cleaning | ~6h | Extraction pipeline, HTML artifact handling, dedup, 8-K discovery. The messiest phase — SEC filing HTML variability required extensive regex heuristics and iteration. |
 | Stage 1 annotation run #1 (nano) | ~5h | Production run wall clock (~300 min). Completed but results were below quality bar. |
 | Stage 1 annotation run #2 (mimo) | ~1h | Only needed mimo annotations at higher concurrency (gemini+grok reused). |
 | Prompt iteration + model benchmarking | ~4h | 12+ prompt versions, 6 model candidates, pilot analysis |
 | Post-Stage 1 analysis + Stage 2 planning | ~5h | Distributional analysis, model bias discovery, codebook v3.0 rulings, judge benchmarking, strategy revision |
 | Documentation + narrative | ~2h | Codebook updates, narrative writing, technical guide updates |
 | **Total to date** | **~23h** | |
 ### Remaining Work (estimated)
 | Phase | Est. Hours | Est. Cost |
 |-------|-----------|-----------|
 | Human labeling site + 1,200 labels | ~8-10h | $0 (team labor) |
 | Stage 2 judge production run (~3-5K paragraphs) | ~1h | ~$20-40 |
 | Training data assembly | ~2h | $0 |
 | DAPT pre-training | ~48-72h GPU | $0 (own 3090) |
 | TAPT pre-training | ~2-3h GPU | $0 |
 | Fine-tuning + ablations (7 experiments) | ~12-20h GPU | $0 |
 | Full GenAI benchmark on 1,200 holdout (9 models) | ~1h | ~$30-50 |
 | Evaluation + comparison + write-up | ~6-8h | $0 |
 ---
 ## Model Census — Every Model We Tried
 Over the course of the project, we evaluated **18 distinct models** across three phases: initial panel selection, Stage 1 replacement bench, and Stage 2 judge selection. Each decision narrowed the field based on empirical evidence.
 ### Phase 0: Smoke Test (model-probe.ts) — 9 candidates
 Tested basic structured output compliance on a single paragraph before committing to expensive benchmarks.
 | Model | Provider | Result |
 |-------|----------|--------|
 | google/gemini-3.1-flash-lite-preview | Google | ✅ Pass — selected for panel |
 | x-ai/grok-4.1-fast | xAI | ✅ Pass — selected for panel |
 | openai/gpt-4.1-mini | OpenAI | ✅ Pass — not selected (cost) |
 | openai/gpt-4.1-nano | OpenAI | ✅ Pass — later replaced by gpt-5.4-nano |
 | anthropic/claude-haiku-4.5 | Anthropic | ✅ Pass — not selected (cost tier) |
 | google/gemini-3.1-flash-preview | Google | ✅ Pass — too expensive for Stage 1 |
 | deepseek/deepseek-chat-v3-0324:free | DeepSeek | Tested — free tier limitations |
 | meta-llama/llama-4-maverick | Meta | Tested |
 | qwen/qwen3-235b-a22b | Alibaba | Tested |
 ### Phase 1: Early Pilots (v1.0-v1.2) — Original panel
 The very first panel used **gpt-oss-120b** (OpenAI's open-source 120B model), not nano:
 - `google/gemini-3.1-flash-lite-preview`
 - `openai/gpt-oss-120b` (also tested with `:exacto` routing suffix)
 - `x-ai/grok-4.1-fast`
 gpt-oss-120b was replaced by gpt-5.4-nano between v1.2 and v2.1 — nano was cheaper and appeared to perform comparably on the small (n=40) pilot samples.
 ### Phase 2: 500-Sample Pilots (v2.2-v2.7) — Nano era
 Panel during the main prompt iteration:
 - `google/gemini-3.1-flash-lite-preview`
 - `openai/gpt-5.4-nano` ← the problem model
 - `x-ai/grok-4.1-fast`
 Nano's issues (0 reasoning tokens 64% of the time, erratic specificity) were persistent but masked by the 40→500 sample transition being attributed to prompt changes rather than model inadequacy.
 ### Phase 3: Stage 1 Replacement Bench (model-bench.ts) — 6 candidates
 After locking prompt v2.5, formally benchmarked replacements for nano:
 | Model | Provider | Reasoning Tokens | Cost/ann | Outcome |
 |-------|----------|-----------------|----------|---------|
 | xiaomi/mimo-v2-flash | Xiaomi | 1,346 | $0.00048 | **✅ Winner** — best value, lowest outlier rate |
 | bytedance-seed/seed-2.0-lite | ByteDance | 658 | $0.00227 | Runner-up — highest accuracy but 4.7x more expensive |
 | z-ai/glm-4.5-air | Zhipu AI | 854 | $0.00136 | Mediocre — barely moved the needle (+0.8pp) |
 | minimax/minimax-m2.5 | MiniMax | 590 | $0.00106 | Mediocre — slightly worse than nano (-1.0pp) |
 | mistralai/mistral-small-2603 | Mistral | **0** | $0.00015 | ❌ Zero reasoning tokens. Cheapest but useless. |
 | nvidia/nemotron-3-super-120b-a12b | NVIDIA | 942 | $0.00152 | ❌ Worst performer despite being expensive. 21% outlier rate. |
 ### Phase 4: Production Stage 1 — Final panel
 - `google/gemini-3.1-flash-lite-preview` (Google)
 - `xiaomi/mimo-v2-flash` (Xiaomi) ← replaced nano
 - `x-ai/grok-4.1-fast` (xAI)
 Three models from three providers — minimizes correlated errors.
 ### Phase 5: Stage 2 Judge Bench (judge-bench.ts) — 8 candidates
 | Model | Provider | Mode | Both vs Gold | Fails | Outcome |
 |-------|----------|------|-------------|-------|---------|
 | z-ai/glm-5 | Zhipu AI | structured | 77-80% | 4-12% | Best accuracy but unreliable structured output |
 | z-ai/glm-5 | Zhipu AI | tool calling | 72% | 0% | Reliable but -7pp accuracy |
 | openai/gpt-5.4-mini | OpenAI | structured | 68% | 0% | Reliable, weaker on specificity |
 | openai/gpt-5.4 | OpenAI | structured | Tested | 0% | Expensive, diminishing returns over mini |
 | anthropic/claude-sonnet-4.6 | Anthropic | structured | Used for gold | 0% | Gold label creation, too expensive for production judge |
 | google/gemini-3-flash-preview | Google | structured | Tested | — | Rubber-stamped majority — added zero value |
 | x-ai/grok-4.20-beta | xAI | structured | Tested | — | Benchmarked |
 | xiaomi/mimo-v2-pro | Xiaomi | structured | Tested | — | Benchmarked |
 | moonshotai/kimi-k2.5 | Moonshot AI | structured | Tested | — | Only 26/50 completed — high failure rate |
 ### Summary: 18 Models, 10 Providers
 | Provider | Models Tested | Models in Production |
 |----------|--------------|---------------------|
 | Google | gemini-3.1-flash-lite, gemini-3.1-flash, gemini-3-flash | gemini-3.1-flash-lite (Stage 1) |
 | OpenAI | gpt-oss-120b, gpt-5.4-nano, gpt-4.1-mini, gpt-4.1-nano, gpt-5.4-mini, gpt-5.4 | — (nano dropped) |
 | xAI | grok-4.1-fast, grok-4.20-beta | grok-4.1-fast (Stage 1) |
 | Xiaomi | mimo-v2-flash, mimo-v2-pro | mimo-v2-flash (Stage 1) |
 | Anthropic | claude-haiku-4.5, claude-sonnet-4.6 | sonnet-4.6 (gold labels) |
 | Zhipu AI | glm-4.5-air, glm-5 | TBD (Stage 2 judge) |
 | ByteDance | seed-2.0-lite | — (too expensive for scale) |
 | NVIDIA | nemotron-3-super-120b | — (worst performer) |
 | Mistral | mistral-small-2603 | — (zero reasoning) |
 | MiniMax | minimax-m2.5 | — (mediocre) |
 | Moonshot AI | kimi-k2.5 | — (high failure rate) |
 | Meta | llama-4-maverick | — (smoke test only) |
 | Alibaba | qwen3-235b-a22b | — (smoke test only) |
 | DeepSeek | deepseek-chat-v3-0324 | — (smoke test only) |
 ---
--- a/docs/TECHNICAL-GUIDE.md
+++ b/docs/TECHNICAL-GUIDE.md
@ -209,7 +209,9 @@ const result = await generateObject({
 **Step 1 — Domain-Adaptive Pre-Training (DAPT):**
 Continue MLM pre-training on SEC filing text to create "SEC-ModernBERT-large":
- **Training corpus:** 200-500M tokens from PleIAs/SEC or own EDGAR download. Include 10-Ks, 10-Qs, 8-Ks, proxy statements.
+- **Training corpus:** ~450M tokens from our own 9,000 cached 10-K filings (FY2023-2024), full filing text (not just Item 1C). These are the same filings we extracted Item 1C from — we already have the raw HTML cached locally and the cleaning pipeline built (`stripHtml()` in `fast-reparse.ts`).
 - **Why not PleIAs/SEC?** The PleIAs/SEC dataset (373K filings, ~18B tokens) goes back much further in time, but: (a) one pass would take weeks on a single 3090, (b) Item 1C didn't exist before FY2023 so pre-2023 filings lack the cybersecurity disclosure vocabulary that matters most for our task, (c) the SEC filing scaling laws paper (arXiv:2512.12384) shows the largest gains come in the first 200M tokens — our 450M from recent, relevant filings is already in the sweet spot.
 - **Corpus preparation:** `dapt-corpus.ts` runs `stripHtml()` on cached filing HTML (full text, no section extraction) and outputs clean text as sharded JSONL. Same HTML cleaning that handles XBRL tags, entity decoding, page artifacts, inline element boundaries — just without the Item 1C section boundary step.
 - **MLM objective:** 30% masking rate (ModernBERT convention)
 - **Learning rate:** ~5e-5 (search range: 1e-5 to 1e-4)
 - **Hardware (RTX 3090):** bf16, gradient checkpointing, seq_len=1024-2048, batch_size=2-4 + gradient accumulation to effective batch 16-32
@ -217,8 +219,16 @@ Continue MLM pre-training on SEC filing text to create "SEC-ModernBERT-large":
 - **Duration:** ~2-3 days on single 3090
 - **Framework:** HuggingFace Trainer + `DataCollatorForLanguageModeling` (Python script, not notebook)
-**Evidence DAPT works:**
+**Step 1b — Task-Adaptive Pre-Training (TAPT):**
- Gururangan et al. (2020): consistent improvements across all tested domains
+
 After DAPT, continue MLM pre-training specifically on the task distribution to create "SEC-cyBERT-large":
 - **Training corpus:** 72K Item 1C paragraphs from `paragraphs-clean.jsonl` (~5-10M tokens). Already clean text — no additional processing needed.
 - **Why TAPT on top of DAPT?** Gururangan et al. (2020) "Don't Stop Pretraining" showed that TAPT gives consistent gains on top of DAPT, especially when the task distribution differs from the broader domain. Item 1C is a very specific subset of 10-K filings — it has its own vocabulary (CISO, NIST CSF, tabletop exercises, materiality assessments), structure (governance → management → process → strategy pattern), and boilerplate patterns that differ substantially from the rest of a 10-K. TAPT teaches the model this specific distribution.
 - **Duration:** ~2-3 hours on a 3090 (tiny corpus vs DAPT)
 - **Same hyperparameters as DAPT** (learning rate, masking rate, etc.)
 **Evidence DAPT + TAPT works:**
 - Gururangan et al. (2020): DAPT + TAPT together outperforms either alone across all tested domains. TAPT alone gives 1-3 F1 points; combined with DAPT, gains are additive.
 - Clinical ModernBERT, BioClinical ModernBERT: successful continued MLM on medical text
 - Patent domain ModernBERT (arXiv:2509.14926): +0.9 to +2.8 F1 from continued pre-training on 31.6B tokens
 - SEC filing scaling laws (arXiv:2512.12384): consistent improvement, largest gains in first 200M tokens
@ -280,15 +290,16 @@ All BERT-base (110M params, 512 context) — architecturally outdated:
 ### 3.6 Ablation Design
-| # | Experiment | Model | Context | DAPT | SCL | Purpose |
+| # | Experiment | Model | Context | DAPT | TAPT | SCL | Purpose |
-|---|-----------|-------|---------|------|-----|---------|
+|---|-----------|-------|---------|------|------|-----|---------|
-| 1 | Baseline | DeBERTa-v3-large | 512 | No | No | Standard approach per syllabus |
+| 1 | Baseline | DeBERTa-v3-large | 512 | No | No | No | Standard approach per syllabus |
-| 2 | + Long context | ModernBERT-large | 2048 | No | No | Context window benefit |
+| 2 | + Long context | ModernBERT-large | 2048 | No | No | No | Context window benefit |
-| 3 | + Domain adapt | SEC-ModernBERT-large | 2048 | Yes | No | DAPT benefit |
+| 3 | + Domain adapt | SEC-ModernBERT-large | 2048 | Yes | No | No | DAPT benefit |
-| 4 | + Contrastive | SEC-ModernBERT-large | 2048 | Yes | Yes | SCL benefit |
+| 3b | + Task adapt | SEC-cyBERT-large | 2048 | Yes | Yes | No | TAPT benefit (Gururangan et al.) |
-| 5 | Efficiency | NeoBERT (+ DAPT) | 2048 | Yes | Yes | 40% fewer params |
+| 4 | + Contrastive | SEC-cyBERT-large | 2048 | Yes | Yes | Yes | SCL benefit |
-| 6 | Decoder | Qwen3.5 LoRA | 2048 | No | No | Encoder vs decoder |
+| 5 | Efficiency | NeoBERT (+ DAPT + TAPT) | 2048 | Yes | Yes | Yes | 40% fewer params |
-| 7 | **Ensemble** | SEC-ModernBERT + DeBERTa | mixed | mixed | — | Maximum performance |
+| 6 | Decoder | Qwen3.5 LoRA | 2048 | No | No | No | Encoder vs decoder |
 | 7 | **Ensemble** | SEC-cyBERT + DeBERTa | mixed | mixed | mixed | — | Maximum performance |
 ### 3.7 Hyperparameter Search (Autoresearch Pattern)