diff --git a/docs/NARRATIVE.md b/docs/NARRATIVE.md index 87b1caa..07f8ad6 100644 --- a/docs/NARRATIVE.md +++ b/docs/NARRATIVE.md @@ -394,15 +394,183 @@ Expected total: ~46,000-48,000 paragraphs at ~93-95% label accuracy. --- -## Running Cost Ledger +## Phase 8: Pre-Training Strategy — DAPT + TAPT -| Phase | Cost | Notes | -|-------|------|-------| -| Stage 1 production run | $115.88 | 150,009 annotations, 0 failures | -| Stage 1 prompt iteration (pilots) | ~$15 | 12+ versions × 500-sample pilots | -| Judge benchmarking | ~$5 | 8 models × 50-sample gold set | -| Judge prompt iteration | ~$3 | Ongoing | -| **Total to date** | **~$139** | | +### The Decision: Own Filings Over PleIAs/SEC + +For domain-adaptive pre-training (DAPT), we needed a corpus of clean SEC filing text. Two options: + +1. **PleIAs/SEC** (373K full 10-K texts on HuggingFace, going back years, CC0 license) — massive but uncleaned, and a single training pass on ~18B tokens would take weeks on a single RTX 3090. +2. **Our own ~9,000 cached filings** (FY2023-2024, HTML already downloaded during extraction) — smaller but recent, relevant, and we already have the HTML cleaning pipeline. + +We chose option 2. The reasoning: + +- **Recency > volume.** Item 1C didn't exist before FY2023. The cybersecurity disclosure vocabulary, boilerplate patterns, and regulatory framing are all new to this filing cycle. Pre-2023 filings teach the model general SEC language, which ModernBERT already knows from its general pre-training. The marginal value of historical filings is low for our specific task. +- **The scaling laws paper says stop early.** SEC filing scaling laws (arXiv:2512.12384) show the largest DAPT gains in the first 200M tokens, with diminishing returns after. Our 9,000 full filings yield ~450M tokens — already in the sweet spot. +- **We control the cleaning quality.** Our `stripHtml()` pipeline handles all the HTML artifacts we fought during extraction (XBRL tags, entity encoding, page breaks, inline element word splits). PleIAs/SEC is a black box — we'd need to audit it anyway. +- **Feasibility on a 3090.** 450M tokens: ~2-3 days. 18B tokens: weeks. Single GPU means we need to be strategic about compute allocation. + +The DAPT corpus preparation is simple: run the existing `stripHtml()` on cached filing HTML (full text, skipping the Item 1C section extraction step) and output clean text as sharded JSONL. + +### Adding TAPT: "Don't Stop Pretraining" + +Gururangan et al. (2020) "Don't Stop Pretraining" demonstrated that task-adaptive pre-training (TAPT) — continued MLM on the unlabeled task data specifically — gives consistent gains on top of DAPT, especially when the task distribution differs from the broader domain. + +Item 1C is a very specific subset of SEC filings. It has its own vocabulary (CISO, NIST CSF, tabletop exercises, materiality assessments), structure (governance → management → process → strategy is a common paragraph sequence), and boilerplate patterns that differ substantially from the rest of a 10-K. TAPT teaches the model this specific distribution before we ask it to classify. + +The cost is negligible: our 72K paragraphs from `paragraphs-clean.jsonl` are already clean text (~5-10M tokens). TAPT takes 2-3 hours on a 3090 — essentially free compared to DAPT. + +### The Training Pipeline + +``` +ModernBERT-large (base, 395M params) + → DAPT on 9K full 10-K filings (~450M tokens, ~2-3 days) → SEC-ModernBERT-large + → TAPT on 72K Item 1C paragraphs (~10M tokens, ~2-3 hours) → SEC-cyBERT-large + → Fine-tune on labeled data with dual classification heads → Final classifier +``` + +This gives us clean ablation rows: base → +DAPT → +TAPT → +SCL, isolating the contribution of each step. + +--- + +## Cost and Time Ledger + +### Tooling + +All code was written collaboratively with **Claude Code** (Anthropic's agentic coding CLI). Claude Code was used throughout the project for pipeline development, prompt engineering, data analysis, script writing, documentation, and strategic planning. The tool dramatically accelerated iteration speed — writing analysis scripts, debugging extraction edge cases, and exploring the annotation data interactively — but all decisions were made by the team with Claude Code as an implementation partner. + +### API Cost Ledger + +| Phase | Cost | Annotations | Notes | +|-------|------|-------------|-------| +| Stage 1 prompt iteration (pilots) | $7.03 | 9,597 | 12+ versions: 5 × 40-sample + 6 × 500-sample | +| Stage 1 model bench (6 candidates) | $3.41 | 2,993 | seed, mimo, glm-4.5-air, minimax, mistral, nemotron | +| Mimo pilot (dedicated comparison) | $0.24 | 500 | `mimo-pilot.ts` — replace-nano scenario modeling | +| Stage 1 run #1 (with nano) | $112.42 | 150,009 | Full production run with gpt-5.4-nano. Completed, but nano's quality was unacceptable (0 reasoning tokens 64% of the time). Gemini+grok annotations ($91.18) preserved in `stage1-gemini-grok.jsonl`; only nano's annotations ($21.24) were discarded. Full original in `stage1.jsonl.bak`. | +| Stage 1 run #2 (mimo only) | $24.69 | 50,003 | Ran only mimo to replace nano. Merged with preserved gemini+grok annotations to form final `stage1.jsonl` ($115.88 total value, $24.69 new spend). | +| Judge model bench (8 candidates) | $5.97 | 505 | GLM-5 (4 configs), gpt-5.4-mini, gpt-5.4, sonnet-4.6, gemini-3-flash, grok-4.20, mimo-v2-pro, kimi-k2.5 | +| **Total API spend** | **$156** | **~213K unique** | Nano waste: $21.24 | + +Only nano's portion ($21.24) of the first run was wasted — the gemini and grok annotations were preserved and merged with the new mimo annotations. Still, $21.24 thrown away on a model that wasn't thinking. The lesson: benchmark model candidates rigorously *before* committing to a production run. The 40-sample pilots showed nano was the weakest link but were misleadingly optimistic about the magnitude of the problem. + +### Time Ledger + +| Phase | Hours | Notes | +|-------|-------|-------| +| Data acquisition + HTML cleaning | ~6h | Extraction pipeline, HTML artifact handling, dedup, 8-K discovery. The messiest phase — SEC filing HTML variability required extensive regex heuristics and iteration. | +| Stage 1 annotation run #1 (nano) | ~5h | Production run wall clock (~300 min). Completed but results were below quality bar. | +| Stage 1 annotation run #2 (mimo) | ~1h | Only needed mimo annotations at higher concurrency (gemini+grok reused). | +| Prompt iteration + model benchmarking | ~4h | 12+ prompt versions, 6 model candidates, pilot analysis | +| Post-Stage 1 analysis + Stage 2 planning | ~5h | Distributional analysis, model bias discovery, codebook v3.0 rulings, judge benchmarking, strategy revision | +| Documentation + narrative | ~2h | Codebook updates, narrative writing, technical guide updates | +| **Total to date** | **~23h** | | + +### Remaining Work (estimated) + +| Phase | Est. Hours | Est. Cost | +|-------|-----------|-----------| +| Human labeling site + 1,200 labels | ~8-10h | $0 (team labor) | +| Stage 2 judge production run (~3-5K paragraphs) | ~1h | ~$20-40 | +| Training data assembly | ~2h | $0 | +| DAPT pre-training | ~48-72h GPU | $0 (own 3090) | +| TAPT pre-training | ~2-3h GPU | $0 | +| Fine-tuning + ablations (7 experiments) | ~12-20h GPU | $0 | +| Full GenAI benchmark on 1,200 holdout (9 models) | ~1h | ~$30-50 | +| Evaluation + comparison + write-up | ~6-8h | $0 | + +--- + +## Model Census — Every Model We Tried + +Over the course of the project, we evaluated **18 distinct models** across three phases: initial panel selection, Stage 1 replacement bench, and Stage 2 judge selection. Each decision narrowed the field based on empirical evidence. + +### Phase 0: Smoke Test (model-probe.ts) — 9 candidates + +Tested basic structured output compliance on a single paragraph before committing to expensive benchmarks. + +| Model | Provider | Result | +|-------|----------|--------| +| google/gemini-3.1-flash-lite-preview | Google | ✅ Pass — selected for panel | +| x-ai/grok-4.1-fast | xAI | ✅ Pass — selected for panel | +| openai/gpt-4.1-mini | OpenAI | ✅ Pass — not selected (cost) | +| openai/gpt-4.1-nano | OpenAI | ✅ Pass — later replaced by gpt-5.4-nano | +| anthropic/claude-haiku-4.5 | Anthropic | ✅ Pass — not selected (cost tier) | +| google/gemini-3.1-flash-preview | Google | ✅ Pass — too expensive for Stage 1 | +| deepseek/deepseek-chat-v3-0324:free | DeepSeek | Tested — free tier limitations | +| meta-llama/llama-4-maverick | Meta | Tested | +| qwen/qwen3-235b-a22b | Alibaba | Tested | + +### Phase 1: Early Pilots (v1.0-v1.2) — Original panel + +The very first panel used **gpt-oss-120b** (OpenAI's open-source 120B model), not nano: +- `google/gemini-3.1-flash-lite-preview` +- `openai/gpt-oss-120b` (also tested with `:exacto` routing suffix) +- `x-ai/grok-4.1-fast` + +gpt-oss-120b was replaced by gpt-5.4-nano between v1.2 and v2.1 — nano was cheaper and appeared to perform comparably on the small (n=40) pilot samples. + +### Phase 2: 500-Sample Pilots (v2.2-v2.7) — Nano era + +Panel during the main prompt iteration: +- `google/gemini-3.1-flash-lite-preview` +- `openai/gpt-5.4-nano` ← the problem model +- `x-ai/grok-4.1-fast` + +Nano's issues (0 reasoning tokens 64% of the time, erratic specificity) were persistent but masked by the 40→500 sample transition being attributed to prompt changes rather than model inadequacy. + +### Phase 3: Stage 1 Replacement Bench (model-bench.ts) — 6 candidates + +After locking prompt v2.5, formally benchmarked replacements for nano: + +| Model | Provider | Reasoning Tokens | Cost/ann | Outcome | +|-------|----------|-----------------|----------|---------| +| xiaomi/mimo-v2-flash | Xiaomi | 1,346 | $0.00048 | **✅ Winner** — best value, lowest outlier rate | +| bytedance-seed/seed-2.0-lite | ByteDance | 658 | $0.00227 | Runner-up — highest accuracy but 4.7x more expensive | +| z-ai/glm-4.5-air | Zhipu AI | 854 | $0.00136 | Mediocre — barely moved the needle (+0.8pp) | +| minimax/minimax-m2.5 | MiniMax | 590 | $0.00106 | Mediocre — slightly worse than nano (-1.0pp) | +| mistralai/mistral-small-2603 | Mistral | **0** | $0.00015 | ❌ Zero reasoning tokens. Cheapest but useless. | +| nvidia/nemotron-3-super-120b-a12b | NVIDIA | 942 | $0.00152 | ❌ Worst performer despite being expensive. 21% outlier rate. | + +### Phase 4: Production Stage 1 — Final panel + +- `google/gemini-3.1-flash-lite-preview` (Google) +- `xiaomi/mimo-v2-flash` (Xiaomi) ← replaced nano +- `x-ai/grok-4.1-fast` (xAI) + +Three models from three providers — minimizes correlated errors. + +### Phase 5: Stage 2 Judge Bench (judge-bench.ts) — 8 candidates + +| Model | Provider | Mode | Both vs Gold | Fails | Outcome | +|-------|----------|------|-------------|-------|---------| +| z-ai/glm-5 | Zhipu AI | structured | 77-80% | 4-12% | Best accuracy but unreliable structured output | +| z-ai/glm-5 | Zhipu AI | tool calling | 72% | 0% | Reliable but -7pp accuracy | +| openai/gpt-5.4-mini | OpenAI | structured | 68% | 0% | Reliable, weaker on specificity | +| openai/gpt-5.4 | OpenAI | structured | Tested | 0% | Expensive, diminishing returns over mini | +| anthropic/claude-sonnet-4.6 | Anthropic | structured | Used for gold | 0% | Gold label creation, too expensive for production judge | +| google/gemini-3-flash-preview | Google | structured | Tested | — | Rubber-stamped majority — added zero value | +| x-ai/grok-4.20-beta | xAI | structured | Tested | — | Benchmarked | +| xiaomi/mimo-v2-pro | Xiaomi | structured | Tested | — | Benchmarked | +| moonshotai/kimi-k2.5 | Moonshot AI | structured | Tested | — | Only 26/50 completed — high failure rate | + +### Summary: 18 Models, 10 Providers + +| Provider | Models Tested | Models in Production | +|----------|--------------|---------------------| +| Google | gemini-3.1-flash-lite, gemini-3.1-flash, gemini-3-flash | gemini-3.1-flash-lite (Stage 1) | +| OpenAI | gpt-oss-120b, gpt-5.4-nano, gpt-4.1-mini, gpt-4.1-nano, gpt-5.4-mini, gpt-5.4 | — (nano dropped) | +| xAI | grok-4.1-fast, grok-4.20-beta | grok-4.1-fast (Stage 1) | +| Xiaomi | mimo-v2-flash, mimo-v2-pro | mimo-v2-flash (Stage 1) | +| Anthropic | claude-haiku-4.5, claude-sonnet-4.6 | sonnet-4.6 (gold labels) | +| Zhipu AI | glm-4.5-air, glm-5 | TBD (Stage 2 judge) | +| ByteDance | seed-2.0-lite | — (too expensive for scale) | +| NVIDIA | nemotron-3-super-120b | — (worst performer) | +| Mistral | mistral-small-2603 | — (zero reasoning) | +| MiniMax | minimax-m2.5 | — (mediocre) | +| Moonshot AI | kimi-k2.5 | — (high failure rate) | +| Meta | llama-4-maverick | — (smoke test only) | +| Alibaba | qwen3-235b-a22b | — (smoke test only) | +| DeepSeek | deepseek-chat-v3-0324 | — (smoke test only) | --- diff --git a/docs/TECHNICAL-GUIDE.md b/docs/TECHNICAL-GUIDE.md index 13cc48e..ef11b4c 100644 --- a/docs/TECHNICAL-GUIDE.md +++ b/docs/TECHNICAL-GUIDE.md @@ -209,7 +209,9 @@ const result = await generateObject({ **Step 1 — Domain-Adaptive Pre-Training (DAPT):** Continue MLM pre-training on SEC filing text to create "SEC-ModernBERT-large": -- **Training corpus:** 200-500M tokens from PleIAs/SEC or own EDGAR download. Include 10-Ks, 10-Qs, 8-Ks, proxy statements. +- **Training corpus:** ~450M tokens from our own 9,000 cached 10-K filings (FY2023-2024), full filing text (not just Item 1C). These are the same filings we extracted Item 1C from — we already have the raw HTML cached locally and the cleaning pipeline built (`stripHtml()` in `fast-reparse.ts`). +- **Why not PleIAs/SEC?** The PleIAs/SEC dataset (373K filings, ~18B tokens) goes back much further in time, but: (a) one pass would take weeks on a single 3090, (b) Item 1C didn't exist before FY2023 so pre-2023 filings lack the cybersecurity disclosure vocabulary that matters most for our task, (c) the SEC filing scaling laws paper (arXiv:2512.12384) shows the largest gains come in the first 200M tokens — our 450M from recent, relevant filings is already in the sweet spot. +- **Corpus preparation:** `dapt-corpus.ts` runs `stripHtml()` on cached filing HTML (full text, no section extraction) and outputs clean text as sharded JSONL. Same HTML cleaning that handles XBRL tags, entity decoding, page artifacts, inline element boundaries — just without the Item 1C section boundary step. - **MLM objective:** 30% masking rate (ModernBERT convention) - **Learning rate:** ~5e-5 (search range: 1e-5 to 1e-4) - **Hardware (RTX 3090):** bf16, gradient checkpointing, seq_len=1024-2048, batch_size=2-4 + gradient accumulation to effective batch 16-32 @@ -217,8 +219,16 @@ Continue MLM pre-training on SEC filing text to create "SEC-ModernBERT-large": - **Duration:** ~2-3 days on single 3090 - **Framework:** HuggingFace Trainer + `DataCollatorForLanguageModeling` (Python script, not notebook) -**Evidence DAPT works:** -- Gururangan et al. (2020): consistent improvements across all tested domains +**Step 1b — Task-Adaptive Pre-Training (TAPT):** + +After DAPT, continue MLM pre-training specifically on the task distribution to create "SEC-cyBERT-large": +- **Training corpus:** 72K Item 1C paragraphs from `paragraphs-clean.jsonl` (~5-10M tokens). Already clean text — no additional processing needed. +- **Why TAPT on top of DAPT?** Gururangan et al. (2020) "Don't Stop Pretraining" showed that TAPT gives consistent gains on top of DAPT, especially when the task distribution differs from the broader domain. Item 1C is a very specific subset of 10-K filings — it has its own vocabulary (CISO, NIST CSF, tabletop exercises, materiality assessments), structure (governance → management → process → strategy pattern), and boilerplate patterns that differ substantially from the rest of a 10-K. TAPT teaches the model this specific distribution. +- **Duration:** ~2-3 hours on a 3090 (tiny corpus vs DAPT) +- **Same hyperparameters as DAPT** (learning rate, masking rate, etc.) + +**Evidence DAPT + TAPT works:** +- Gururangan et al. (2020): DAPT + TAPT together outperforms either alone across all tested domains. TAPT alone gives 1-3 F1 points; combined with DAPT, gains are additive. - Clinical ModernBERT, BioClinical ModernBERT: successful continued MLM on medical text - Patent domain ModernBERT (arXiv:2509.14926): +0.9 to +2.8 F1 from continued pre-training on 31.6B tokens - SEC filing scaling laws (arXiv:2512.12384): consistent improvement, largest gains in first 200M tokens @@ -280,15 +290,16 @@ All BERT-base (110M params, 512 context) — architecturally outdated: ### 3.6 Ablation Design -| # | Experiment | Model | Context | DAPT | SCL | Purpose | -|---|-----------|-------|---------|------|-----|---------| -| 1 | Baseline | DeBERTa-v3-large | 512 | No | No | Standard approach per syllabus | -| 2 | + Long context | ModernBERT-large | 2048 | No | No | Context window benefit | -| 3 | + Domain adapt | SEC-ModernBERT-large | 2048 | Yes | No | DAPT benefit | -| 4 | + Contrastive | SEC-ModernBERT-large | 2048 | Yes | Yes | SCL benefit | -| 5 | Efficiency | NeoBERT (+ DAPT) | 2048 | Yes | Yes | 40% fewer params | -| 6 | Decoder | Qwen3.5 LoRA | 2048 | No | No | Encoder vs decoder | -| 7 | **Ensemble** | SEC-ModernBERT + DeBERTa | mixed | mixed | — | Maximum performance | +| # | Experiment | Model | Context | DAPT | TAPT | SCL | Purpose | +|---|-----------|-------|---------|------|------|-----|---------| +| 1 | Baseline | DeBERTa-v3-large | 512 | No | No | No | Standard approach per syllabus | +| 2 | + Long context | ModernBERT-large | 2048 | No | No | No | Context window benefit | +| 3 | + Domain adapt | SEC-ModernBERT-large | 2048 | Yes | No | No | DAPT benefit | +| 3b | + Task adapt | SEC-cyBERT-large | 2048 | Yes | Yes | No | TAPT benefit (Gururangan et al.) | +| 4 | + Contrastive | SEC-cyBERT-large | 2048 | Yes | Yes | Yes | SCL benefit | +| 5 | Efficiency | NeoBERT (+ DAPT + TAPT) | 2048 | Yes | Yes | Yes | 40% fewer params | +| 6 | Decoder | Qwen3.5 LoRA | 2048 | No | No | No | Encoder vs decoder | +| 7 | **Ensemble** | SEC-cyBERT + DeBERTa | mixed | mixed | mixed | — | Maximum performance | ### 3.7 Hyperparameter Search (Autoresearch Pattern)