updating narrative and plan

This commit is contained in:
Joey Eamigh 2026-03-28 22:54:32 -04:00
parent 78d1f978de
commit 48e488933a
No known key found for this signature in database
GPG Key ID: CE8C05DFFC53C9CB
2 changed files with 199 additions and 20 deletions

View File

@ -394,15 +394,183 @@ Expected total: ~46,000-48,000 paragraphs at ~93-95% label accuracy.
--- ---
## Running Cost Ledger ## Phase 8: Pre-Training Strategy — DAPT + TAPT
| Phase | Cost | Notes | ### The Decision: Own Filings Over PleIAs/SEC
|-------|------|-------|
| Stage 1 production run | $115.88 | 150,009 annotations, 0 failures | For domain-adaptive pre-training (DAPT), we needed a corpus of clean SEC filing text. Two options:
| Stage 1 prompt iteration (pilots) | ~$15 | 12+ versions × 500-sample pilots |
| Judge benchmarking | ~$5 | 8 models × 50-sample gold set | 1. **PleIAs/SEC** (373K full 10-K texts on HuggingFace, going back years, CC0 license) — massive but uncleaned, and a single training pass on ~18B tokens would take weeks on a single RTX 3090.
| Judge prompt iteration | ~$3 | Ongoing | 2. **Our own ~9,000 cached filings** (FY2023-2024, HTML already downloaded during extraction) — smaller but recent, relevant, and we already have the HTML cleaning pipeline.
| **Total to date** | **~$139** | |
We chose option 2. The reasoning:
- **Recency > volume.** Item 1C didn't exist before FY2023. The cybersecurity disclosure vocabulary, boilerplate patterns, and regulatory framing are all new to this filing cycle. Pre-2023 filings teach the model general SEC language, which ModernBERT already knows from its general pre-training. The marginal value of historical filings is low for our specific task.
- **The scaling laws paper says stop early.** SEC filing scaling laws (arXiv:2512.12384) show the largest DAPT gains in the first 200M tokens, with diminishing returns after. Our 9,000 full filings yield ~450M tokens — already in the sweet spot.
- **We control the cleaning quality.** Our `stripHtml()` pipeline handles all the HTML artifacts we fought during extraction (XBRL tags, entity encoding, page breaks, inline element word splits). PleIAs/SEC is a black box — we'd need to audit it anyway.
- **Feasibility on a 3090.** 450M tokens: ~2-3 days. 18B tokens: weeks. Single GPU means we need to be strategic about compute allocation.
The DAPT corpus preparation is simple: run the existing `stripHtml()` on cached filing HTML (full text, skipping the Item 1C section extraction step) and output clean text as sharded JSONL.
### Adding TAPT: "Don't Stop Pretraining"
Gururangan et al. (2020) "Don't Stop Pretraining" demonstrated that task-adaptive pre-training (TAPT) — continued MLM on the unlabeled task data specifically — gives consistent gains on top of DAPT, especially when the task distribution differs from the broader domain.
Item 1C is a very specific subset of SEC filings. It has its own vocabulary (CISO, NIST CSF, tabletop exercises, materiality assessments), structure (governance → management → process → strategy is a common paragraph sequence), and boilerplate patterns that differ substantially from the rest of a 10-K. TAPT teaches the model this specific distribution before we ask it to classify.
The cost is negligible: our 72K paragraphs from `paragraphs-clean.jsonl` are already clean text (~5-10M tokens). TAPT takes 2-3 hours on a 3090 — essentially free compared to DAPT.
### The Training Pipeline
```
ModernBERT-large (base, 395M params)
→ DAPT on 9K full 10-K filings (~450M tokens, ~2-3 days) → SEC-ModernBERT-large
→ TAPT on 72K Item 1C paragraphs (~10M tokens, ~2-3 hours) → SEC-cyBERT-large
→ Fine-tune on labeled data with dual classification heads → Final classifier
```
This gives us clean ablation rows: base → +DAPT → +TAPT → +SCL, isolating the contribution of each step.
---
## Cost and Time Ledger
### Tooling
All code was written collaboratively with **Claude Code** (Anthropic's agentic coding CLI). Claude Code was used throughout the project for pipeline development, prompt engineering, data analysis, script writing, documentation, and strategic planning. The tool dramatically accelerated iteration speed — writing analysis scripts, debugging extraction edge cases, and exploring the annotation data interactively — but all decisions were made by the team with Claude Code as an implementation partner.
### API Cost Ledger
| Phase | Cost | Annotations | Notes |
|-------|------|-------------|-------|
| Stage 1 prompt iteration (pilots) | $7.03 | 9,597 | 12+ versions: 5 × 40-sample + 6 × 500-sample |
| Stage 1 model bench (6 candidates) | $3.41 | 2,993 | seed, mimo, glm-4.5-air, minimax, mistral, nemotron |
| Mimo pilot (dedicated comparison) | $0.24 | 500 | `mimo-pilot.ts` — replace-nano scenario modeling |
| Stage 1 run #1 (with nano) | $112.42 | 150,009 | Full production run with gpt-5.4-nano. Completed, but nano's quality was unacceptable (0 reasoning tokens 64% of the time). Gemini+grok annotations ($91.18) preserved in `stage1-gemini-grok.jsonl`; only nano's annotations ($21.24) were discarded. Full original in `stage1.jsonl.bak`. |
| Stage 1 run #2 (mimo only) | $24.69 | 50,003 | Ran only mimo to replace nano. Merged with preserved gemini+grok annotations to form final `stage1.jsonl` ($115.88 total value, $24.69 new spend). |
| Judge model bench (8 candidates) | $5.97 | 505 | GLM-5 (4 configs), gpt-5.4-mini, gpt-5.4, sonnet-4.6, gemini-3-flash, grok-4.20, mimo-v2-pro, kimi-k2.5 |
| **Total API spend** | **$156** | **~213K unique** | Nano waste: $21.24 |
Only nano's portion ($21.24) of the first run was wasted — the gemini and grok annotations were preserved and merged with the new mimo annotations. Still, $21.24 thrown away on a model that wasn't thinking. The lesson: benchmark model candidates rigorously *before* committing to a production run. The 40-sample pilots showed nano was the weakest link but were misleadingly optimistic about the magnitude of the problem.
### Time Ledger
| Phase | Hours | Notes |
|-------|-------|-------|
| Data acquisition + HTML cleaning | ~6h | Extraction pipeline, HTML artifact handling, dedup, 8-K discovery. The messiest phase — SEC filing HTML variability required extensive regex heuristics and iteration. |
| Stage 1 annotation run #1 (nano) | ~5h | Production run wall clock (~300 min). Completed but results were below quality bar. |
| Stage 1 annotation run #2 (mimo) | ~1h | Only needed mimo annotations at higher concurrency (gemini+grok reused). |
| Prompt iteration + model benchmarking | ~4h | 12+ prompt versions, 6 model candidates, pilot analysis |
| Post-Stage 1 analysis + Stage 2 planning | ~5h | Distributional analysis, model bias discovery, codebook v3.0 rulings, judge benchmarking, strategy revision |
| Documentation + narrative | ~2h | Codebook updates, narrative writing, technical guide updates |
| **Total to date** | **~23h** | |
### Remaining Work (estimated)
| Phase | Est. Hours | Est. Cost |
|-------|-----------|-----------|
| Human labeling site + 1,200 labels | ~8-10h | $0 (team labor) |
| Stage 2 judge production run (~3-5K paragraphs) | ~1h | ~$20-40 |
| Training data assembly | ~2h | $0 |
| DAPT pre-training | ~48-72h GPU | $0 (own 3090) |
| TAPT pre-training | ~2-3h GPU | $0 |
| Fine-tuning + ablations (7 experiments) | ~12-20h GPU | $0 |
| Full GenAI benchmark on 1,200 holdout (9 models) | ~1h | ~$30-50 |
| Evaluation + comparison + write-up | ~6-8h | $0 |
---
## Model Census — Every Model We Tried
Over the course of the project, we evaluated **18 distinct models** across three phases: initial panel selection, Stage 1 replacement bench, and Stage 2 judge selection. Each decision narrowed the field based on empirical evidence.
### Phase 0: Smoke Test (model-probe.ts) — 9 candidates
Tested basic structured output compliance on a single paragraph before committing to expensive benchmarks.
| Model | Provider | Result |
|-------|----------|--------|
| google/gemini-3.1-flash-lite-preview | Google | ✅ Pass — selected for panel |
| x-ai/grok-4.1-fast | xAI | ✅ Pass — selected for panel |
| openai/gpt-4.1-mini | OpenAI | ✅ Pass — not selected (cost) |
| openai/gpt-4.1-nano | OpenAI | ✅ Pass — later replaced by gpt-5.4-nano |
| anthropic/claude-haiku-4.5 | Anthropic | ✅ Pass — not selected (cost tier) |
| google/gemini-3.1-flash-preview | Google | ✅ Pass — too expensive for Stage 1 |
| deepseek/deepseek-chat-v3-0324:free | DeepSeek | Tested — free tier limitations |
| meta-llama/llama-4-maverick | Meta | Tested |
| qwen/qwen3-235b-a22b | Alibaba | Tested |
### Phase 1: Early Pilots (v1.0-v1.2) — Original panel
The very first panel used **gpt-oss-120b** (OpenAI's open-source 120B model), not nano:
- `google/gemini-3.1-flash-lite-preview`
- `openai/gpt-oss-120b` (also tested with `:exacto` routing suffix)
- `x-ai/grok-4.1-fast`
gpt-oss-120b was replaced by gpt-5.4-nano between v1.2 and v2.1 — nano was cheaper and appeared to perform comparably on the small (n=40) pilot samples.
### Phase 2: 500-Sample Pilots (v2.2-v2.7) — Nano era
Panel during the main prompt iteration:
- `google/gemini-3.1-flash-lite-preview`
- `openai/gpt-5.4-nano` ← the problem model
- `x-ai/grok-4.1-fast`
Nano's issues (0 reasoning tokens 64% of the time, erratic specificity) were persistent but masked by the 40→500 sample transition being attributed to prompt changes rather than model inadequacy.
### Phase 3: Stage 1 Replacement Bench (model-bench.ts) — 6 candidates
After locking prompt v2.5, formally benchmarked replacements for nano:
| Model | Provider | Reasoning Tokens | Cost/ann | Outcome |
|-------|----------|-----------------|----------|---------|
| xiaomi/mimo-v2-flash | Xiaomi | 1,346 | $0.00048 | **✅ Winner** — best value, lowest outlier rate |
| bytedance-seed/seed-2.0-lite | ByteDance | 658 | $0.00227 | Runner-up — highest accuracy but 4.7x more expensive |
| z-ai/glm-4.5-air | Zhipu AI | 854 | $0.00136 | Mediocre — barely moved the needle (+0.8pp) |
| minimax/minimax-m2.5 | MiniMax | 590 | $0.00106 | Mediocre — slightly worse than nano (-1.0pp) |
| mistralai/mistral-small-2603 | Mistral | **0** | $0.00015 | ❌ Zero reasoning tokens. Cheapest but useless. |
| nvidia/nemotron-3-super-120b-a12b | NVIDIA | 942 | $0.00152 | ❌ Worst performer despite being expensive. 21% outlier rate. |
### Phase 4: Production Stage 1 — Final panel
- `google/gemini-3.1-flash-lite-preview` (Google)
- `xiaomi/mimo-v2-flash` (Xiaomi) ← replaced nano
- `x-ai/grok-4.1-fast` (xAI)
Three models from three providers — minimizes correlated errors.
### Phase 5: Stage 2 Judge Bench (judge-bench.ts) — 8 candidates
| Model | Provider | Mode | Both vs Gold | Fails | Outcome |
|-------|----------|------|-------------|-------|---------|
| z-ai/glm-5 | Zhipu AI | structured | 77-80% | 4-12% | Best accuracy but unreliable structured output |
| z-ai/glm-5 | Zhipu AI | tool calling | 72% | 0% | Reliable but -7pp accuracy |
| openai/gpt-5.4-mini | OpenAI | structured | 68% | 0% | Reliable, weaker on specificity |
| openai/gpt-5.4 | OpenAI | structured | Tested | 0% | Expensive, diminishing returns over mini |
| anthropic/claude-sonnet-4.6 | Anthropic | structured | Used for gold | 0% | Gold label creation, too expensive for production judge |
| google/gemini-3-flash-preview | Google | structured | Tested | — | Rubber-stamped majority — added zero value |
| x-ai/grok-4.20-beta | xAI | structured | Tested | — | Benchmarked |
| xiaomi/mimo-v2-pro | Xiaomi | structured | Tested | — | Benchmarked |
| moonshotai/kimi-k2.5 | Moonshot AI | structured | Tested | — | Only 26/50 completed — high failure rate |
### Summary: 18 Models, 10 Providers
| Provider | Models Tested | Models in Production |
|----------|--------------|---------------------|
| Google | gemini-3.1-flash-lite, gemini-3.1-flash, gemini-3-flash | gemini-3.1-flash-lite (Stage 1) |
| OpenAI | gpt-oss-120b, gpt-5.4-nano, gpt-4.1-mini, gpt-4.1-nano, gpt-5.4-mini, gpt-5.4 | — (nano dropped) |
| xAI | grok-4.1-fast, grok-4.20-beta | grok-4.1-fast (Stage 1) |
| Xiaomi | mimo-v2-flash, mimo-v2-pro | mimo-v2-flash (Stage 1) |
| Anthropic | claude-haiku-4.5, claude-sonnet-4.6 | sonnet-4.6 (gold labels) |
| Zhipu AI | glm-4.5-air, glm-5 | TBD (Stage 2 judge) |
| ByteDance | seed-2.0-lite | — (too expensive for scale) |
| NVIDIA | nemotron-3-super-120b | — (worst performer) |
| Mistral | mistral-small-2603 | — (zero reasoning) |
| MiniMax | minimax-m2.5 | — (mediocre) |
| Moonshot AI | kimi-k2.5 | — (high failure rate) |
| Meta | llama-4-maverick | — (smoke test only) |
| Alibaba | qwen3-235b-a22b | — (smoke test only) |
| DeepSeek | deepseek-chat-v3-0324 | — (smoke test only) |
--- ---

View File

@ -209,7 +209,9 @@ const result = await generateObject({
**Step 1 — Domain-Adaptive Pre-Training (DAPT):** **Step 1 — Domain-Adaptive Pre-Training (DAPT):**
Continue MLM pre-training on SEC filing text to create "SEC-ModernBERT-large": Continue MLM pre-training on SEC filing text to create "SEC-ModernBERT-large":
- **Training corpus:** 200-500M tokens from PleIAs/SEC or own EDGAR download. Include 10-Ks, 10-Qs, 8-Ks, proxy statements. - **Training corpus:** ~450M tokens from our own 9,000 cached 10-K filings (FY2023-2024), full filing text (not just Item 1C). These are the same filings we extracted Item 1C from — we already have the raw HTML cached locally and the cleaning pipeline built (`stripHtml()` in `fast-reparse.ts`).
- **Why not PleIAs/SEC?** The PleIAs/SEC dataset (373K filings, ~18B tokens) goes back much further in time, but: (a) one pass would take weeks on a single 3090, (b) Item 1C didn't exist before FY2023 so pre-2023 filings lack the cybersecurity disclosure vocabulary that matters most for our task, (c) the SEC filing scaling laws paper (arXiv:2512.12384) shows the largest gains come in the first 200M tokens — our 450M from recent, relevant filings is already in the sweet spot.
- **Corpus preparation:** `dapt-corpus.ts` runs `stripHtml()` on cached filing HTML (full text, no section extraction) and outputs clean text as sharded JSONL. Same HTML cleaning that handles XBRL tags, entity decoding, page artifacts, inline element boundaries — just without the Item 1C section boundary step.
- **MLM objective:** 30% masking rate (ModernBERT convention) - **MLM objective:** 30% masking rate (ModernBERT convention)
- **Learning rate:** ~5e-5 (search range: 1e-5 to 1e-4) - **Learning rate:** ~5e-5 (search range: 1e-5 to 1e-4)
- **Hardware (RTX 3090):** bf16, gradient checkpointing, seq_len=1024-2048, batch_size=2-4 + gradient accumulation to effective batch 16-32 - **Hardware (RTX 3090):** bf16, gradient checkpointing, seq_len=1024-2048, batch_size=2-4 + gradient accumulation to effective batch 16-32
@ -217,8 +219,16 @@ Continue MLM pre-training on SEC filing text to create "SEC-ModernBERT-large":
- **Duration:** ~2-3 days on single 3090 - **Duration:** ~2-3 days on single 3090
- **Framework:** HuggingFace Trainer + `DataCollatorForLanguageModeling` (Python script, not notebook) - **Framework:** HuggingFace Trainer + `DataCollatorForLanguageModeling` (Python script, not notebook)
**Evidence DAPT works:** **Step 1b — Task-Adaptive Pre-Training (TAPT):**
- Gururangan et al. (2020): consistent improvements across all tested domains
After DAPT, continue MLM pre-training specifically on the task distribution to create "SEC-cyBERT-large":
- **Training corpus:** 72K Item 1C paragraphs from `paragraphs-clean.jsonl` (~5-10M tokens). Already clean text — no additional processing needed.
- **Why TAPT on top of DAPT?** Gururangan et al. (2020) "Don't Stop Pretraining" showed that TAPT gives consistent gains on top of DAPT, especially when the task distribution differs from the broader domain. Item 1C is a very specific subset of 10-K filings — it has its own vocabulary (CISO, NIST CSF, tabletop exercises, materiality assessments), structure (governance → management → process → strategy pattern), and boilerplate patterns that differ substantially from the rest of a 10-K. TAPT teaches the model this specific distribution.
- **Duration:** ~2-3 hours on a 3090 (tiny corpus vs DAPT)
- **Same hyperparameters as DAPT** (learning rate, masking rate, etc.)
**Evidence DAPT + TAPT works:**
- Gururangan et al. (2020): DAPT + TAPT together outperforms either alone across all tested domains. TAPT alone gives 1-3 F1 points; combined with DAPT, gains are additive.
- Clinical ModernBERT, BioClinical ModernBERT: successful continued MLM on medical text - Clinical ModernBERT, BioClinical ModernBERT: successful continued MLM on medical text
- Patent domain ModernBERT (arXiv:2509.14926): +0.9 to +2.8 F1 from continued pre-training on 31.6B tokens - Patent domain ModernBERT (arXiv:2509.14926): +0.9 to +2.8 F1 from continued pre-training on 31.6B tokens
- SEC filing scaling laws (arXiv:2512.12384): consistent improvement, largest gains in first 200M tokens - SEC filing scaling laws (arXiv:2512.12384): consistent improvement, largest gains in first 200M tokens
@ -280,15 +290,16 @@ All BERT-base (110M params, 512 context) — architecturally outdated:
### 3.6 Ablation Design ### 3.6 Ablation Design
| # | Experiment | Model | Context | DAPT | SCL | Purpose | | # | Experiment | Model | Context | DAPT | TAPT | SCL | Purpose |
|---|-----------|-------|---------|------|-----|---------| |---|-----------|-------|---------|------|------|-----|---------|
| 1 | Baseline | DeBERTa-v3-large | 512 | No | No | Standard approach per syllabus | | 1 | Baseline | DeBERTa-v3-large | 512 | No | No | No | Standard approach per syllabus |
| 2 | + Long context | ModernBERT-large | 2048 | No | No | Context window benefit | | 2 | + Long context | ModernBERT-large | 2048 | No | No | No | Context window benefit |
| 3 | + Domain adapt | SEC-ModernBERT-large | 2048 | Yes | No | DAPT benefit | | 3 | + Domain adapt | SEC-ModernBERT-large | 2048 | Yes | No | No | DAPT benefit |
| 4 | + Contrastive | SEC-ModernBERT-large | 2048 | Yes | Yes | SCL benefit | | 3b | + Task adapt | SEC-cyBERT-large | 2048 | Yes | Yes | No | TAPT benefit (Gururangan et al.) |
| 5 | Efficiency | NeoBERT (+ DAPT) | 2048 | Yes | Yes | 40% fewer params | | 4 | + Contrastive | SEC-cyBERT-large | 2048 | Yes | Yes | Yes | SCL benefit |
| 6 | Decoder | Qwen3.5 LoRA | 2048 | No | No | Encoder vs decoder | | 5 | Efficiency | NeoBERT (+ DAPT + TAPT) | 2048 | Yes | Yes | Yes | 40% fewer params |
| 7 | **Ensemble** | SEC-ModernBERT + DeBERTa | mixed | mixed | — | Maximum performance | | 6 | Decoder | Qwen3.5 LoRA | 2048 | No | No | No | Encoder vs decoder |
| 7 | **Ensemble** | SEC-cyBERT + DeBERTa | mixed | mixed | mixed | — | Maximum performance |
### 3.7 Hyperparameter Search (Autoresearch Pattern) ### 3.7 Hyperparameter Search (Autoresearch Pattern)