# Project Narrative — SEC Cybersecurity Disclosure Quality Classifier This document captures the process, roadblocks, decisions, and resolutions from building the SEC cybersecurity disclosure quality classifier. It serves as the source material for the final paper and presentation. --- ## Phase 1: Project Scoping and Construct Design ### The Problem SEC Release 33-11216 (July 2023) created a new annual cybersecurity disclosure requirement (10-K Item 1C) and an incident disclosure requirement (8-K Item 1.05). By FY2024, ~9,000-10,000 filings exist. No validated classifier or public labeled dataset exists for assessing the quality of these disclosures. Investors, regulators, and compliance officers need scalable tools to distinguish substantive disclosures from boilerplate. ### Methodology Decision: Ringel (2023) "Synthetic Experts" We adopted the Ringel (2023) "Synthetic Experts" pipeline: use frontier LLMs to generate training labels at scale, then distill into an efficient encoder model. This approach was chosen because: - Manual labeling of 50,000+ paragraphs is infeasible for a 6-person team - Multiple cheap LLMs annotating in parallel provide built-in quality control through inter-annotator agreement - The encoder distillation step produces a model that can classify at inference time without LLM API costs ### Construct: Two Classification Dimensions We defined two simultaneous classification tasks per paragraph: 1. **Content Category** (7 mutually exclusive classes) — what the paragraph is about, grounded in the SEC rule's own structure (Board Governance, Management Role, Risk Management Process, Third-Party Risk, Incident Disclosure, Strategy Integration, None/Other) 2. **Specificity Level** (4-point ordinal) — how company-specific the disclosure is, from generic boilerplate to quantified-verifiable facts The construct maps to NIST CSF 2.0 categories for academic grounding. --- ## Phase 2: Data Acquisition and Corpus Construction ### The Extraction Problem SEC filings are not structured data. They're HTML generated from PDFs, XBRL, and Word documents by dozens of different tools, each producing different artifacts. Building a reliable extraction pipeline for ~9,000 filings meant solving a series of messy, real-world data engineering problems. ### Pipeline Architecture Built in TypeScript (~1,000 lines of extraction code across `parse-item1c.ts`, `segment.ts`, `fast-reparse.ts`, and pipeline orchestration): ``` EDGAR Master Index → enumerate 10-K filings → download HTML → extract Item 1C → segment paragraphs → JSONL submissions.zip → scan for 8-K Item 1.05 → download HTML → extract → segment → merge with 10-K corpus ``` ### Roadblock: HTML Variability Every filing's HTML is different. The same logical content looks completely different depending on the tool that generated the HTML: - **Word splitting from inline elements.** XBRL and styling tags break words mid-token: `Item 2` renders correctly in a browser but parses as "Item2" in code. Same with `cybersecurity`. Required detecting adjacent inline element boundaries and inserting spaces selectively. - **CamelCase joins from PDF converters.** PDF-to-HTML tools merge sentences across formatting boundaries: `sentence.Next sentence` instead of `sentence. Next sentence`. Required regex passes to detect missing spaces after punctuation. - **Page breaks mid-sentence.** Page numbers (`28`, `- 12 -`, `F-3`), running headers (`ACME CORP — ANNUAL REPORT`), and subsidiary headers (`ENTERGY ARKANSAS, LLC AND SUBSIDIARIES`) get spliced into the middle of content paragraphs. Required filtering a catalog of page artifact patterns. - **Table of Contents shadowing.** "Item 1C" appears at least twice in every 10-K — once in the Table of Contents and once in the actual content. Using the first match extracts the wrong section. Took several iterations to discover we needed the LAST match — this was a silent failure that produced empty or wrong extractions for hundreds of filings before we caught it. - **XBRL tag pollution.** Inline XBRL wraps financial facts in `ix:header`, `ix:references`, and `ix:nonFraction` tags that carry no display content but add noise. Required stripping all `ix:*` tags before text processing. - **Entity encoding chaos.** ` `, ` `, `“`, `”`, `—`, `–`, `•` — each needs correct decoding, and different filing tools use different entity styles for the same characters. ### Paragraph Segmentation After extracting clean section text, splitting into paragraphs had its own challenges: - **Bullet list merging.** Disclosures frequently use bullet lists ("Our program includes: • risk assessment • vulnerability scanning"). Bullets need to be merged with their intro sentence; a standalone "• vulnerability scanning" is meaningless. - **Continuation line detection.** Sentences split across HTML block elements need rejoining. Heuristic: if the previous block lacks terminal punctuation and the next starts lowercase or with a continuation phrase (`and`, `or`, `including`, `such as`), merge. - **Length boundaries.** Under 20 words → likely a header (filtered). Over 500 words → split at sentence boundaries to keep annotation units manageable. - **Table-based bullet lists and the cascade failure.** Some generators (notably EFiling/XDX) render bullet lists as HTML tables with one `` per bullet item, and use `·` (middle dot in Symbol font) instead of the standard `•` bullet character. Since `stripHtml()` doesn't decode `·` as a bullet marker, the bullet-aware merge logic never fires. Each bullet item starts lowercase ("establishing...", "maintaining..."), so the segmenter treats them as continuation fragments and merges them with the preceding block. This cascades: a Bancorp 34 filing had three separate elements — two bullet items about risk management processes and a standalone paragraph disclosing a $25,000 cybersecurity incident — concatenated into a single 114-word run-on sentence. The HTML structure was completely unambiguous (separate `` and `

` elements with spacers), but the information was lost during text extraction. The data quality audit found 2,210 paragraphs with embedded bullet points across the corpus — most from this class of failure. These paragraphs are still classifiable (the models unanimously labeled this example as Incident Disclosure / Specificity 4), but the text quality is degraded. ### 8-K Extraction **Roadblock: EDGAR full-text search misses filings.** The EFTS keyword search doesn't reliably return all cybersecurity 8-Ks. Post-May 2024, companies moved non-material disclosures from Item 1.05 to Items 8.01 or 7.01. **Resolution:** Built `scan-8k-items.py` to scan the SEC's bulk `submissions.zip` deterministically — a gap-free scan of every 8-K with cybersecurity content. Tries items in priority order (1.05 → 8.01 → 7.01), skips cross-reference stubs. Result: **207 cybersecurity incident 8-K filings** identified — a complete inventory. ### Paragraph Deduplication Each paragraph gets a `textHash` (SHA-256 of normalized text). Deduplication at three levels: 1. **Within-filing:** Parser artifacts sometimes produce duplicate blocks. Removed by textHash. 2. **Cross-year (same company):** Companies copy-paste identical paragraphs year-to-year. Detected but kept — the repetition itself is informative for disclosure quality analysis. 3. **Cross-company boilerplate:** Different companies use identical materiality disclaimers. Detected but kept — these are real Specificity 1 examples. **Result:** Only ~27 excess duplicates removed (0.04%). Most textual similarity is legitimate variation. ### Performance at Scale Initial extraction with cheerio (DOM parser) was slow for 9,000 filings. Built `fast-reparse.ts` (regex-only HTML stripping, no DOM) and `parallel-reparse.ts` (16 bun workers in parallel). Also deduplicates amendment filings (keeps latest per CIK×FiscalYear). ### Corpus Statistics - **72,045 paragraphs** from ~9,000 filings (FY2023 + FY2024 + early FY2025) - All 10-K Item 1C; 207 8-K paragraphs extracted separately - Median ~7 paragraphs per filing - 49,795 paragraphs annotated (after filtering to complete filing metadata) ### Roadblock: Truncated Filings Discovered 72 filings (~0.8%) where section boundary detection cut off mid-sentence. A paragraph about CISSP certifications cut mid-sentence looks like vague boilerplate — this would corrupt specificity labels. **Resolution:** Exclude from training splits. Filings where the last paragraph doesn't match `/[.!?;")\u201d]\s*$/` are filtered before train/val/test creation. --- ## Phase 3: Codebook Development ### Initial Codebook (v1.0) Built a detailed labeling codebook (`docs/LABELING-CODEBOOK.md`) grounded in the SEC rule structure. Includes: - 7 category definitions with SEC basis citations, key markers, and example texts - 4 specificity levels with boundary rules - 5 category decision rules for common ambiguities - 5 borderline cases with worked reasoning - Gold set protocol for human validation ### Codebook Iteration (v3.0 — 2026-03-29) After analyzing 150,000+ Stage 1 annotations and identifying systematic disagreement patterns, we made three major codebook rulings: **Ruling A — Materiality Disclaimers:** Paragraphs with explicit materiality assessments ("have not materially affected our business strategy, results of operations, or financial condition") are Strategy Integration, even if boilerplate. A cross-reference to Risk Factors appended to a materiality assessment does not change the classification. Only pure cross-references with no materiality conclusion are None/Other. *This resolved ~1,094 disputed paragraphs.* **Ruling B — SPACs and Shell Companies:** Companies explicitly stating they have no operations, no cybersecurity program, or no formal processes receive None/Other regardless of incidental mentions of board oversight or risk acknowledgment. The absence of a program is not a description of a program. *This resolved ~53 unresolved paragraphs and likely hundreds more.* **Ruling C — Person vs. Function Test (Management Role vs. RMP):** This was the single most impactful ruling, addressing the #1 disagreement axis (2,290 disputes). The line: if the paragraph is about the *person* (qualifications, credentials, background, tenure, career history) → Management Role. If it's about what the role/program *does* (processes, activities, tools, frameworks) → Risk Management Process, even if a CISO/CIO/CTO title appears. The test: would the paragraph still make sense if you removed the person's name, title, and credentials? If yes → the paragraph is about the function, not the person. --- ## Phase 4: Stage 1 — Synthetic Expert Annotation ### Tech Stack Decision Chose TypeScript + Vercel AI SDK v6 + OpenRouter over Python + LangChain/LiteLLM because: - Vercel AI SDK provides native structured output with Zod schema validation - OpenRouter gives single-API access to all candidate models with real cost tracking - Bun runtime for fast script execution with native TypeScript support - JSONL-append pattern for crash-safe resume without data loss or duplicate API spend ### Prompt Engineering (12+ iterations, v1.0 → v2.5) This was one of the most time-intensive phases. Key lessons: **What worked:** - Text enum labels ("Firm-Specific") over ordinals ("3") — universal improvement across all models - Decision-test format ("ask in order, stop at first yes") for specificity — reduced ambiguity - ✓ IS / ✗ NOT fact lists with explicit examples — the single biggest lever for specificity accuracy. Reduced overrating from 54 to 21 cases. - Validation step ("review your specific_facts, remove NOT-list items") — caught model self-correction - 13 calibration examples, each targeting a specific observed failure mode — examples outperformed rules - Explicit Incident↔Strategy tiebreaker — completely eliminated a 20-case confusion pattern - `specific_facts` chain-of-thought in the schema — forces the model to enumerate evidence before assigning specificity **What didn't work:** - Adding more rules (v1.2) — confused models, caused regression from 95%→88% category accuracy - Changing category definitions to structural "TEST:" format (v2.6) — regression - "COMMON MISTAKES" section (v2.7) — improved consensus but reduced unanimity - Attempting a Management↔RMP tiebreaker in the prompt (v2.5) — made confusion worse (this was ultimately resolved through the v3.0 codebook ruling instead) **Critical lesson: 40-sample pilots were misleadingly optimistic.** Results that looked good at n=40 fell apart at n=500. We standardized on 500-sample pilots for all prompt evaluation. ### The Iteration Trajectory Five 40-sample pilots (v1.0, v1.1, v1.2, v2.1, v2.2-n40) followed by six 500-sample pilots (v2.2-v2.7): | Version | n | Both Unan | Key Change | Top Confusion Axis | |---------|---|-----------|-----------|-------------------| | v2.2 | 500 | 51.4% | First 500-sample baseline | Incident↔Strategy (20 cases) | | v2.3 | 500 | 59.2% | Tightened Sector-Adapted, expanded IS/NOT lists | Inc↔Strat reduced | | v2.4 | 500 | 66.8% | Validation step, schema constraint on specific_facts | Mgmt↔RMP emerging | | **v2.5** | **500** | **70.8%** | Incident↔Strategy tiebreaker, QV calibration examples | **Inc↔Strat eliminated**; Mgmt↔RMP now #1 (17 cases) | | v2.6 | 500 | 67.8% | Changed defs to "TEST:" format — **regression** | — | | v2.7 | 500 | 67.6% | Added COMMON MISTAKES section — **regression** | — | The most dramatic single improvement: v2.5's Incident↔Strategy tiebreaker ("DESCRIBES what happened → Incident; ONLY discusses cost/materiality → Strategy") completely eliminated what had been the #1 confusion axis at v2.2 (20 cases → 0). This is a case where a single well-targeted rule outperformed broad prompt restructuring. v2.5 was locked as the production prompt. v2.6 and v2.7 demonstrated that the prompt had reached its practical ceiling — further structural changes caused regressions. The remaining disagreements (Management↔RMP, specificity boundaries) turned out to be codebook ambiguities and model-capacity issues, not prompt failures. ### The Original Panel and the Nano Problem The initial Stage 1 panel was: - `google/gemini-3.1-flash-lite-preview` - `openai/gpt-5.4-nano` - `x-ai/grok-4.1-fast` GPT-5.4-nano was chosen for its low cost and the assumption that even a small model could handle structured classification with a good enough prompt. This assumption was wrong. **The problem: nano wasn't thinking.** During pilot testing, we discovered nano produced **zero reasoning tokens 64% of the time**. When it did reason, the output was minimal (avg 34,356 total reasoning tokens across 500 paragraphs, vs grok's 336,993). Without reasoning, nano's classifications were essentially pattern-matching on surface features — it couldn't apply the multi-step decision logic the codebook requires (enumerate facts, filter against IS/NOT lists, count QV-eligible items, apply threshold). **The symptoms:** - **Erratic specificity** — nano was simultaneously too conservative on some axes ([1,3,3] disagreements — 21 cases where nano said Generic when gemini+grok said Firm-Specific) and too liberal on others ([3,3,4] — 11 cases where nano said Quantified when the others said Firm-Specific). No prompt change fixed this because it's a model-level capacity issue: without reasoning tokens, the decision test can't execute properly. - **Lowest pairwise agreement** — gemini×grok agreed on 95.6% of categories and 91.2% of specificity. gemini×nano: 87.4% category, 83.8% specificity. Nano was the consistent outlier. - **Dragging down unanimity** — the gemini+grok pair was strong, but nano's disagreements broke unanimity on hundreds of paragraphs that would otherwise have been clean. Despite 12 prompt iterations (v1.0→v2.7) that improved overall metrics significantly, nano's behavior never stabilized. The prompt was at its practical ceiling for a model that wouldn't reason. ### Smoke Testing: model-probe.ts Before running an expensive benchmark, we built `model-probe.ts` to test 9 candidate models on a single paragraph for basic structured output compliance: - gemini-3.1-flash-lite-preview, grok-4.1-fast, gpt-4.1-mini, gpt-4.1-nano, claude-haiku-4.5, gemini-3.1-flash-preview, deepseek-chat-v3-0324:free, llama-4-maverick, qwen3-235b-a22b This caught schema-level incompatibilities (wrong field names, missing fields, invalid enum values) before we spent money on 500-paragraph bench runs. ### Model Benchmark: 6 Candidates to Replace Nano After locking prompt v2.5, we built `model-bench.ts` to formally evaluate nano replacements. Each candidate was benchmarked against the 500-sample pilot set and compared to the existing gemini+grok annotations. | Model | Cost/ann | Reasoning Tokens | vs Majority (both) | Cat Outlier | Spec Outlier | Nano→X Delta | |-------|----------|-----------------|---------------------|-------------|-------------|-------------| | seed-2.0-lite | $0.00227 | 658 | **88.8%** | 2.2% | 3.8% | +11.6pp | | **mimo-v2-flash** | **$0.00048** | **1,346** | **86.0%** | **5.0%** | **4.0%** | **+8.8pp** | | glm-4.5-air | $0.00136 | 854 | 76.2% | 8.8% | 9.6% | +0.8pp | | minimax-m2.5 | $0.00106 | 590 | 73.8% | 7.9% | 12.7% | -1.0pp | | mistral-small-2603 | $0.00015 | **0** | 66.8% | 9.2% | 17.6% | -6.8pp | | nemotron-3-super-120b | $0.00152 | 942 | 57.9% | **21.3%** | **20.7%** | **-16.9pp** | **Key findings:** - **Reasoning tokens are the strongest predictor of accuracy.** Mistral-small produced literally zero reasoning tokens — not a single one. Its average output was only 136 tokens (vs mimo's 1,463). It had a 17.6% specificity outlier rate. This confirmed that the nano problem wasn't prompt-specific: models that don't reason can't do this task. - **Price ≠ quality.** Nemotron was the most expensive candidate at $0.00152/annotation with 942 reasoning tokens (it *was* thinking), but thinking badly — 21.3% category outlier rate, worst of any candidate. Only 497/500 completed (3 failures). Replacing nano with nemotron would have been catastrophic: -16.9pp unanimity. - **The two mediocre options.** GLM-4.5-air (+0.8pp) and minimax-m2.5 (-1.0pp) neither helped nor hurt. Not worth the switch. - **Seed-2.0-lite was technically the best** at 88.8% agreement with majority, but cost 4.7x more than mimo ($0.00227 vs $0.00048) and was 2x slower (21.5s vs 11.4s latency). For 50K+ paragraphs at scale, this cost differential was significant. ### The Winner: mimo-v2-flash Mimo won the slot on value: 1. **Cheapest viable option** — $0.00048/annotation (3x cheaper than most candidates) 2. **Most reasoning tokens** — 1,346 avg (highest of all 6, more than seed-2.0-lite) 3. **Lowest outlier rate** — 5.0% category, 4.0% specificity 4. **+8.8pp unanimity improvement** over nano 5. **93.4% category agreement with grok** — strongest pairwise alignment of any candidate **Roadblock: Mimo schema quirks.** Mimo produced non-standard outputs: capitalized confidence labels ("High" instead of "high"), numeric confidence values (0.9 instead of "high"), and flat string arrays instead of structured `{fact, type}` objects for specific_facts. Rather than trying to fix this with prompting (which would waste tokens and might break other behavior), we fixed it with Zod schema transforms — `.transform()` to normalize casing and map numbers to labels, `.union()` to accept both structured and flat fact formats. This took ~30 minutes to implement and handled all edge cases automatically. A dedicated `mimo-pilot.ts` script modeled the full "replace nano with mimo" scenario before committing to the panel change. **Final Stage 1 panel:** - `google/gemini-3.1-flash-lite-preview` - `xiaomi/mimo-v2-flash` ← replaced `openai/gpt-5.4-nano` - `x-ai/grok-4.1-fast` ### Production Run Results Completed 2026-03-28. **150,009 annotations** (50,003 paragraphs × 3 models), **$115.88 total cost**, **0 failures**. | Metric | Value | |--------|-------| | Both-unanimous | 35,204 (70.7%) | | Majority agreement | 14,182 (28.5%) | | Unresolved (3-way split) | 409 (0.8%) | | Total cost | $115.88 | | Failures | 0 | --- ## Phase 5: Post-Stage 1 Analysis — Discovering Systematic Patterns After the production run, we conducted a deep distributional analysis of disagreement patterns. This analysis fundamentally changed our approach to Stage 2. ### Model Bias Discovery Each model has systematic, quantifiable biases: | Model | Category Outlier Rate | Specificity Outlier Rate | Key Bias | |-------|----------------------|--------------------------|----------| | Mimo | **48.1%** | 32.5% | Over-classifies as Third-Party Risk; under-rates Spec 4 (74.3% of Spec 4 outlier cases) | | Gemini | 30.9% | **45.7%** | Over-classifies as Management Role (81.1% in Mgmt↔RMP disputes); inflates specificity | | Grok | 21.0% | 21.8% | Most moderate; slight RMP bias | These biases are not random — they're predictable by model and confusion axis. This opened the possibility of model-calibrated majority voting (using the known biases to assess when the majority is likely correct). ### Key Distributional Findings 1. **Management Role is the disaster category** — only 51.5% unanimous (every other category is 62-79%). Nearly half of all Management Role paragraphs need resolution. 2. **Spec 4 (Quantified-Verifiable) is the disaster specificity** — only 37.6% unanimous. Models can't agree on what counts as "quantified." 3. **Stage 1 confidence is completely useless** — 95.4% of paragraphs report all-high category confidence. Zero all-low cases. The cheap models are systematically overconfident. 4. **Specificity is effectively a 3-level scale** — Spec 2 (Sector-Adapted) is rarely disputed (82.1% unanimous). The contested boundaries are [1,3] (3,742 disputes) and [3,4] (2,898 disputes) with almost nothing at [1,2] or [2,3]. 5. **Longer paragraphs are harder** — Q5 word count (>134 words): 64.1% unanimous vs Q1 (≤51 words): 76.3%. 6. **Small companies (1-3 paragraphs) are noise-prone** — 50.0% unanimous, 10.5% unresolved. Almost all are SPACs or shell companies with non-standard disclosures. ### Top Disagreement Axes | Axis | Disputes | Pattern | |------|----------|---------| | Management Role ↔ RMP | 2,290 | Paragraph describes processes but names CISO/CIO | | RMP ↔ Third-Party Risk | 1,475 | Mimo over-classifies vendor mentions as Third-Party | | None/Other ↔ Strategy Integration | 1,094 | Materiality disclaimers — genuinely ambiguous in codebook | | Board Governance ↔ Management Role | 867 | Paragraphs at the board-management interface | | Spec [1,3] boundary | 3,742 | NOT-list items counted as specific facts | | Spec [3,4] boundary | 2,898 | Gemini counts roles as QV-eligible; Mimo downgrades | ### Insight: Reading the Actual Paragraphs We sampled 20 paragraphs across the 4 hardest dispute types and read them in full. Patterns emerged: - **Management↔RMP:** Every example follows the same structure — a process-focused paragraph that names a CISO/CIO in the opening attribution. The paragraph's content is about what the program does, not who the person is. The v3.0 "person-vs-function" ruling directly addresses this. - **None/Other↔Strategy:** All 5 sampled paragraphs are "no material incidents" boilerplate. Every single one. The materiality disclaimer ruling resolves this entirely. - **Spec [3,4]:** Gemini counts "20 years of experience" + "CISO" as 2 QV facts → Spec 4. Grok/Mimo correctly exclude named roles from QV counting → Spec 3. The rule exists in the prompt but Gemini ignores it. - **Small company unresolved:** All SPACs or blank check companies with "we have no operations" disclaimers. The SPAC ruling handles these. --- ## Phase 6: Stage 2 — Judge Model Evaluation ### Gold Label Construction Built a 50-paragraph gold set using 3 independent Sonnet agents: - Agent A: paragraphs 0-24 - Agent B: paragraphs 25-49 - Agent C: all 50 as cross-check - Adjudicator agent resolved 11 disputes with detailed reasoning - Inter-annotator agreement: 94% category, 84% specificity, 78% both **Lesson learned: majority vote ≠ ground truth.** Initially scored judges against Stage 1 majority, which made gemini-3-flash look great (86% category match). Scoring against gold labels revealed it added zero value — it was rubber-stamping the majority. Always evaluate against adjudicated gold labels. ### Judge Model Benchmarking (8 candidates) | Model | Mode | n | Cat | Spec | Both | Fails | Cost/call | |-------|------|---|-----|------|------|-------|-----------| | Majority vote | — | 50 | 78.0% | 80.0% | 60.0% | 0% | $0 | | gpt-5.4-mini | structured | 50 | 88.0% | 80.0% | 68.0% | 0% | $0.0046 | | GLM-5 v2 | structured | 48 | 87.5% | 89.6% | 77.1% | 4% | $0.0078 | | GLM-5 v4 | structured+req_params | 44 | 90.9% | 88.6% | 79.5% | 12% | $0.0083 | | GLM-5 v3 | tool calling | 50 | 84.0% | 82.0% | 72.0% | 0% | $0.0070 | ### Roadblock: GLM-5 Structured Output Failures GLM-5 had the best accuracy (77-80% both-correct) but a 6-12% structured output failure rate. The model intermittently wraps JSON in markdown code blocks. **Investigation:** Built diagnostic scripts (`judge-diag.ts`, `judge-diag-batch.ts`) to isolate the issue. Tested all 9 failing paragraphs × 2 attempts each. Found 72% success rate, all from the same model variant (`z-ai/glm-5-20260211`). The best OpenRouter provider (Ambient) has a 6% base error rate. This is a model-level behavior, not provider-specific. **Attempted fixes:** - Bumped validation retries from 1 to 3 → reduced failures from 18% to ~4-12% - Tool calling mode → 0% failures but accuracy dropped ~7pp (72% both). Enum constraints not enforced, `undefined` categories appear. - `provider: { require_parameters: true }` in OpenRouter → no effect - Exacto routing → no effect **Resolution:** Accepted as a model-level constraint. Production strategy will use the best model with retry logic and fall back to a reliable model (gpt-5.4-mini) for persistent failures. ### Judge Prompt Iteration (v1 → v2) Built a dynamic judge prompt (`buildJudgePrompt()`) with: - **Disagreement diagnosis:** Tells the judge exactly what's in dispute and the vote distribution - **Targeted disambiguation rules:** 7 category guidance blocks + 2 specificity guidance blocks, dynamically included only when relevant to the specific dispute - **Structured analysis steps:** Critique each annotator → enumerate IS-list facts → determine dominant purpose → decide - **Confidence calibration:** HIGH/MEDIUM/LOW mapped to codebook clarity, used as training weights - **Anti-bias:** Fisher-Yates shuffle of annotator order **Results:** Category accuracy improved +10pp over majority vote for both models. Specificity improved +9.8pp for GLM-5 but stayed flat for gpt-5.4-mini. The disambiguation rules work well for category but specificity needs the codebook v3.0 changes. ### Key Finding: Judge Confidence Is Highly Predictive | Confidence | GLM-5 Both-Correct | gpt-5.4-mini Both-Correct | |------------|--------------------|----| | High | 82-84% | 80.6% | | Medium | 25-50% | 35.7% | This enables confidence-stratified training data: high-confidence judge labels get full weight; medium/low are downweighted or excluded. --- ## Phase 7: Revised Data Quality Strategy The post-Stage 1 analysis and judge benchmarking led to a fundamental reassessment of our approach. ### The Key Realization The best judge (77% both-correct) barely beats the raw majority vote (78% category, 80% specificity). Judging all 14,591 disputed paragraphs at 77% accuracy doesn't meaningfully improve on the majority. The judge's real value is concentrated in two places: 1. The 409 unresolved paragraphs where no majority exists 2. Cases where we have specific reason to doubt the majority ### The Revised Plan **Phase 0: Codebook rulings (completed)** — Three rulings that resolve thousands of disputes at zero inference cost: materiality disclaimers → Strategy Integration, SPACs → None/Other, person-vs-function test for Management↔RMP. **Phase 1: Model-calibrated majority resolution** — For the 14,182 majority-agreement paragraphs, apply calibration using known model biases. When the known-biased model is the outlier on a known axis → trust majority. Flag anomalous cases for judge resolution. Expected to auto-resolve ~10,000-12,000 paragraphs. **Phase 2: Human gold set (1,200 paragraphs)** — Assignment requires 1,200 human-labeled paragraphs. Building a quiz-gated labeling web tool that enforces codebook knowledge before each session. Stratified sampling to ensure all categories, specificity levels, and confusion axes are represented. This becomes the calibration metric for all further work. **Phase 3: Judge prompt iteration** — Update judge prompt to mirror codebook v3.0 rulings. Add worked examples from the 11 gold adjudications. Iterate against expanded gold set. Target: 85%+ both-correct. **Phase 4: Production judge run** — Judge only the ~3,000-5,000 genuinely hard cases (unresolved + flagged majority + "both" disputes). Two models for cross-validation on the hardest cases. **Phase 5: Training data assembly** — Confidence-stratified tiers: | Tier | Source | Est. Accuracy | Paragraphs | Treatment | |------|--------|--------------|------------|-----------| | T1 | Both-unanimous | ~97% | 35,204 | Full weight | | T2 | Calibrated majority | ~85-90% | ~9,000-12,000 | Full weight | | T3 | Judge high-confidence | ~84% | ~2,000-3,000 | Full weight | | T4 | Judge medium-confidence | ~40% | ~500-1,000 | Downweight (0.5) or soft labels | | T5 | Judge low / failure / excluded | ??? | ~500-1,000 | Exclude | Expected total: ~46,000-48,000 paragraphs at ~93-95% label accuracy. --- ## Phase 8: Human Labeling Webapp (Labelapp) ### Why Build a Webapp? The project requires 1,200 human-labeled paragraphs as a gold holdout set — the calibration metric for everything downstream. Six student annotators, three per paragraph, 600 per person. The labels need to be reliable enough to benchmark the GenAI pipeline and validate the final classifier. The alternative was everyone tagging in a shared JSON file or spreadsheet. That would almost certainly produce poor data quality. The failure modes are well-documented in annotation literature and we'd hit all of them: - **Inconsistent category names.** Free-text entry in a spreadsheet means "Risk Management Process" vs "Risk Mgmt" vs "RMP" vs "3" — all referring to the same class but requiring manual reconciliation. - **Skipped or double-labeled paragraphs.** No enforced assignment tracking means annotators can accidentally skip paragraphs or label the same one twice without anyone noticing until export. - **No codebook enforcement.** The labeling codebook has 7 categories, 4 specificity levels, 5 decision rules, and 3 codebook rulings (v3.0). Without quiz gating, annotators can start labeling without understanding the materiality disclaimer ruling, the person-vs-function test, or the QV counting threshold — exactly the boundaries where annotation quality lives or dies. - **No feedback loop.** In a spreadsheet, an annotator who misunderstands the SPAC ruling labels 600 paragraphs before anyone catches it. A webapp with warmup feedback catches misunderstanding in the first 5 paragraphs. - **No timing data.** For the writeup, we need per-paragraph labeling times to report annotator effort and identify paragraphs that are disproportionately hard. A spreadsheet gives you nothing; even a basic timer gives you wall-clock time corrupted by idle periods. A purpose-built labeling tool turns all of these failure modes into solved problems. Constrained radio buttons eliminate typos. Server-side assignment tracking prevents skips and duplicates. Quiz gating enforces codebook knowledge. Warmup paragraphs with gold feedback catch misunderstandings early. Active timing with idle detection gives clean data for the writeup. ### The Onboarding Funnel Every annotation session follows the same enforced path: 1. **Login** → annotator selects their name, enters password. Session cookie (HMAC-SHA256 signed, 8-hour expiry). 2. **Dashboard** → shows progress, links to training materials or labeling. 3. **Quiz** → 8 questions (2 per type), random draw from a bank of ~30. Four question types target the exact codebook boundaries that cause the most disagreement in the GenAI pipeline: - **Person-vs-function** (Management Role vs RMP) — the #1 disagreement axis (2,290 disputes in Stage 1) - **Materiality disclaimers** (Strategy Integration vs None/Other) — resolved ~1,094 disputes via codebook ruling - **QV fact counting** (Specificity 3 vs 4) — the hardest specificity boundary - **SPAC exception** (None/Other for shell companies) - Pass threshold: 7/8 correct. Immediate feedback with codebook explanation after each answer. Failed → review mistakes → retry. 4. **Warmup** → 5 pre-selected paragraphs with known gold labels. Identical UI to real labeling, but after submit, the annotator sees the gold answer + explanation. This catches systematic misunderstandings before they contaminate 600 labels. 5. **Labeling** → the real thing. 600 assigned paragraphs per annotator. The quiz questions are not random trivia — they're targeted at the exact confusion axes that the GenAI pipeline struggles with. If an annotator can't reliably distinguish Management Role from RMP, their labels on that axis are noise. Better to catch that before they start than after. ### Labeling Interface Design The labeling UI prioritizes speed and consistency: - **Paragraph display:** Full text with filing metadata badges (company, ticker, filing type, date, SEC item) in the header bar. - **Constrained input:** Radio buttons for both category (7 options) and specificity (4 options). No free-text entry for classifications. - **Keyboard shortcuts:** 1-7 for category, Q/W/E/R for specificity, N to focus notes, Enter to submit. An experienced annotator never touches the mouse. - **Codebook sidebar:** Floating button opens a slide-out panel with all category definitions, IS/NOT lists, specificity levels, and decision rules. Always one click away — annotators don't need to switch to a separate document. - **Progress bar:** Shows completed/total in the header. Annotators know where they stand. - **Notes field:** Optional free-text for edge cases or uncertainty. Useful for adjudication — if an annotator flags "this could be either Management Role or RMP, went with RMP because the person-vs-function test says..." that reasoning helps the adjudicator. ### Sampling Strategy The 1,200 paragraphs are not randomly sampled. Random sampling from 50K paragraphs would over-represent the easy cases (Board Governance at Specificity 1 is unambiguous) and under-represent the hard cases that actually test annotation quality. Instead, the sampling is stratified by the disagreement patterns discovered in the Stage 1 analysis (Phase 5): | Stratum | Count | Why | |---------|-------|-----| | Management ↔ RMP split votes | 120 | #1 disagreement axis — validates the person-vs-function ruling | | None/Other ↔ Strategy splits | 80 | Materiality disclaimer boundary | | Specificity [3,4] splits | 80 | QV counting — the hardest specificity boundary | | Board ↔ Management splits | 80 | Board/management interface | | Rare category guarantee | 120 | ≥15 per category, extra for Incident Disclosure (sparse) | | Proportional stratified random | 720 | Fill remaining from category × specificity cells | This ensures the gold set is informative where it matters most: at the decision boundaries where both humans and models are most likely to disagree. ### Assignment: Balanced Incomplete Block Design (BIBD) Each paragraph gets exactly 3 of 6 annotators. The assignment uses a balanced incomplete block design: - C(6,3) = 20 unique triples. Assign 60 paragraphs to each triple. - Each annotator appears in C(5,2) = 10 triples → 10 × 60 = 600 paragraphs per person. - Every annotator pair shares equal paragraph overlap → pairwise Cohen's Kappa is statistically valid across all 15 pairs. This is important for the writeup: we can report inter-rater reliability as a full pairwise matrix, not just an average that hides weak pairs. ### Active Timer and Idle Detection The initial implementation tracked raw wall-clock `duration_ms` per label — `Date.now()` when the paragraph loaded, minus `Date.now()` at submit. This is corrupted by any idle time (annotator walks away, checks email, gets coffee). We added `useActiveTimer`, a React hook that tracks active vs idle time using mouse/keyboard/scroll/focus events with a 30-second idle threshold. When no activity is detected for 30 seconds, the timer pauses and the header shows an amber "idle" indicator. Both `duration_ms` (wall-clock) and `active_ms` (idle-excluded) are submitted with every label. For the writeup, `active_ms` is the metric to report — it reflects actual cognitive effort per paragraph. `duration_ms` is retained for completeness. Pre-existing labels (before the timer change) have `active_ms = NULL` and are excluded from timing analysis. ### Infrastructure Decisions **Stack:** Next.js (App Router) + Drizzle ORM + Postgres + Tailwind + shadcn/ui. Deployed via Docker with a Postgres sidecar. **Migrations:** Switched from `drizzle-kit push --force` (schema diffing at startup) to file-based Drizzle migrations (`drizzle-kit generate` + `drizzle-kit migrate`). A `scripts/ensure-migration-baseline.ts` script handles the transition for existing databases by seeding the migration journal with the baseline hash. **Monorepo:** The labelapp triggered converting the repo to a Bun workspace monorepo with shared Zod schemas (`packages/schemas/`). This ensures the labelapp's category/specificity enums are identical to the GenAI pipeline's — no possibility of a mismatch between what the models label and what the humans label. ### Adjudication After all 3 annotators label a paragraph: - **3/3 agree** on both dimensions → consensus (no intervention needed) - **2/3 agree** on both dimensions → majority rules - **Otherwise** → flagged for admin adjudication The admin page shows disputed paragraphs with all 3 labels side-by-side, annotator notes, and Stage 1 consensus for reference. The adjudicator picks a label, enters a custom one, or marks it for team discussion. Adjudications are stored separately from labels for audit trail. ### Key Technical Artifacts | Artifact | Location | |----------|----------| | Implementation plan | `docs/labelapp-plan.md` | | Agent guide | `labelapp/AGENTS.md` | | Database schema | `labelapp/db/schema.ts` | | Active timer hook | `labelapp/hooks/use-active-timer.ts` | | Labeling UI | `labelapp/app/label/page.tsx` | | Quiz questions | `labelapp/lib/quiz-questions.ts` | | Warmup paragraphs | `labelapp/lib/warmup-paragraphs.ts` | | BIBD assignment generator | `labelapp/lib/assignment.ts` | | IRR metrics (Kappa, Alpha) | `labelapp/lib/metrics.ts` | | Stratified sampling | `labelapp/lib/sampling.ts` | | Baseline migration | `labelapp/drizzle/0000_baseline.sql` | | Migration transition script | `labelapp/scripts/ensure-migration-baseline.ts` | | Docker entrypoint | `labelapp/entrypoint.sh` | ### Opus Golden Labeling With the human gold set nearing completion, we added a parallel labeling pass using Claude Opus 4.6 as an additional expert annotator. The motivation is empirical: the GenAI pipeline's Stage 1 consensus + Stage 2 judge combination has shown strong alignment with the codebook throughout development, and Opus represents a significant capability jump over the models used in Stages 1 and 2. Having an independent Opus annotation for every gold-set paragraph gives us a third perspective alongside the human labels and the existing pipeline labels — useful for adjudication, for measuring human-vs-model agreement, and as an upper bound on what automated annotation can achieve. **Implementation:** Rather than routing through OpenRouter (which would cost ~$27-80 depending on the model), we used the Claude Agent SDK (`@anthropic-ai/claude-agent-sdk`) to call Opus 4.6 through the existing Claude Code subscription. The Agent SDK's `query()` function accepts a custom system prompt and structured output schema, so we configured it as a fully isolated classifier: no tools, no hooks, no settings, no session persistence — just a system prompt and a JSON schema response. **Key design decisions:** 1. **Full codebook as system prompt.** The Stage 1/2 pipeline uses a condensed v2.5 operational prompt (~4KB). For Opus, we feed the entire labeling codebook (`docs/LABELING-CODEBOOK.md`, ~42KB) plus the operational prompt plus the JSON output schema. Opus has the context window and reasoning depth to actually use the worked examples, borderline cases, and decision rules that cheaper models would ignore. 2. **Reasoning traces saved.** Opus's adaptive thinking produces step-by-step codebook application (e.g., "Count QV-eligible facts: specific date (2020), 24 years (quantified)... two hard verifiable facts → Quantified-Verifiable"). These are saved in the `golden.thinking` field alongside each annotation — valuable both for adjudication and for understanding where the codebook's boundaries create ambiguity. 3. **Raw confidence preserved.** Opus returns numeric confidence (0-1) rather than the categorical high/medium/low that cheaper models produce. We save the raw values (`golden.rawCategoryConfidence`, `golden.rawSpecificityConfidence`) before coercing them through the existing `Confidence` transform. This gives a finer-grained signal for weighting or analysis. 4. **Serial execution at 1 req/s.** The Claude Code subscription has rate limits, so the batch runs serially with a 1-second delay between requests. At ~4 paragraphs/minute (including Opus thinking time), the full 1,200-paragraph set completes in ~5 hours. Crash-safe JSONL checkpoint resume means it can be interrupted and restarted without re-running completed paragraphs. **Output:** `data/annotations/golden/opus.jsonl` — standard `Annotation` records (compatible with the existing pipeline) plus a `golden` block containing thinking traces, raw confidence values, and the model's specific fact extractions. The `provenance.promptVersion` is tagged `v2.5+codebook` to distinguish from standard Stage 1/2 annotations. --- ## Phase 9: Pre-Training Strategy — DAPT + TAPT ### The Decision: Own Filings Over PleIAs/SEC For domain-adaptive pre-training (DAPT), we needed a corpus of clean SEC filing text. Two options: 1. **PleIAs/SEC** (373K full 10-K texts on HuggingFace, going back years, CC0 license) — massive but uncleaned, and a single training pass on ~18B tokens would take weeks on a single RTX 3090. 2. **Our own ~9,000 cached filings** (FY2023-2024, HTML already downloaded during extraction) — smaller but recent, relevant, and we already have the HTML cleaning pipeline. We chose option 2. The reasoning: - **Recency > volume.** Item 1C didn't exist before FY2023. The cybersecurity disclosure vocabulary, boilerplate patterns, and regulatory framing are all new to this filing cycle. Pre-2023 filings teach the model general SEC language, which ModernBERT already knows from its general pre-training. The marginal value of historical filings is low for our specific task. - **The scaling laws paper says stop early.** SEC filing scaling laws (arXiv:2512.12384) show the largest DAPT gains in the first 200M tokens, with diminishing returns after. Our 9,000 full filings yield ~450M tokens — already in the sweet spot. - **We control the cleaning quality.** Our `stripHtml()` pipeline handles all the HTML artifacts we fought during extraction (XBRL tags, entity encoding, page breaks, inline element word splits). PleIAs/SEC is a black box — we'd need to audit it anyway. - **Feasibility on a 3090.** 450M tokens: ~2-3 days. 18B tokens: weeks. Single GPU means we need to be strategic about compute allocation. The DAPT corpus preparation is simple: run the existing `stripHtml()` on cached filing HTML (full text, skipping the Item 1C section extraction step) and output clean text as sharded JSONL. ### Adding TAPT: "Don't Stop Pretraining" Gururangan et al. (2020) "Don't Stop Pretraining" demonstrated that task-adaptive pre-training (TAPT) — continued MLM on the unlabeled task data specifically — gives consistent gains on top of DAPT, especially when the task distribution differs from the broader domain. Item 1C is a very specific subset of SEC filings. It has its own vocabulary (CISO, NIST CSF, tabletop exercises, materiality assessments), structure (governance → management → process → strategy is a common paragraph sequence), and boilerplate patterns that differ substantially from the rest of a 10-K. TAPT teaches the model this specific distribution before we ask it to classify. The cost is negligible: our 72K paragraphs from `paragraphs-clean.jsonl` are already clean text (~5-10M tokens). TAPT takes 2-3 hours on a 3090 — essentially free compared to DAPT. ### The Training Pipeline ``` ModernBERT-large (base, 395M params) → DAPT on 9K full 10-K filings (~450M tokens, ~2-3 days) → SEC-ModernBERT-large → TAPT on 72K Item 1C paragraphs (~10M tokens, ~2-3 hours) → SEC-cyBERT-large → Fine-tune on labeled data with dual classification heads → Final classifier ``` This gives us clean ablation rows: base → +DAPT → +TAPT → +SCL, isolating the contribution of each step. --- ## Phase 10: Data Quality Audit and Corpus Remediation ### The Discovery While preparing the DAPT corpus, we discovered that the paragraph data was less clean than we assumed. The extraction pipeline had been built to handle the worst HTML artifacts (word splits, XBRL tags, page breaks), but two systematic issues had been silently corrupting the training data: 1. **Orphan words.** HTML source wraps text at fixed column width. When a `` tag consumes most of a line, only the first word fits before the source newline. `stripHtml()` preserved that newline, and the paragraph segmenter dropped the single-word fragment. Result: paragraphs like "sole executive officer and director is responsible for..." instead of "Our sole executive officer..." — 4.7% of all paragraphs. 2. **Inlined section headings.** The paragraph segmenter didn't strip sub-section headings ("Risk Management and Strategy", "Board Oversight") from paragraph body text. These headings became the first "sentence" of the paragraph. Result: 22% of paragraphs had section titles prepended to body text — a near-perfect predictor of `content_category` that creates shortcut learning risk. ### The Generator Investigation Initial quality metrics showed 45% of filings in an "UNKNOWN" generator bucket. This felt wrong — SEC HTML comes from identifiable tools. We investigated and identified **14 distinct filing generators** covering 99.99% of 14,759 HTML files using meta tags, comments, namespace declarations, CSS patterns, and CIK-based filing agent lookup. The investigation revealed that the worst-quality generator, **EFiling/EDGAR Agent (GoFiler/Novaworks XDX)**, had been hidden in the UNKNOWN bucket. It accounts for 13.5% of all filings but produces 36.8% orphan word rate (8x corpus average), the lowest paragraphs-per-filing (5.7 vs 7.7 avg), and 5.9% fragment rate. The second worst, **CompSci Transform** (6% of filings), had a 14.8% orphan word rate. By contrast, the clean generators — Workiva (24.3%), Donnelley (15.8%), and Inline XBRL (16.4%) — all had <1% orphan word rates. Over 70% of paragraphs came from clean generators. The problem was concentrated, not uniform. Full generator reference: `docs/EDGAR-FILING-GENERATORS.md`. Full audit findings: `docs/DATA-QUALITY-AUDIT.md`. ### Six Surgical Patches All fixes follow the same principle: `paragraphs-clean.jsonl` is **frozen** — never modified. All fixes go through separate `.patched.jsonl` files. Annotations link by paragraph UUID, which never changes. Every patch is documented with scope, method, and validation. | Patch | Method | Paragraphs | Annotated | |-------|--------|-----------|-----------| | 1-2. Orphan word restoration | HTML lookback: find paragraph text in stripped HTML, extract preceding word | 2,233 | 1,537 | | 3. Heading strip (space separator) | Pattern match against 71 known Item 1C sub-headings | 7,514 | 5,013 | | 4. Heading strip (colon separator) | "Heading Text: Sentence..." patterns | 370 | 227 | | 5. Heading strip (period/dash/caps) | Extended separator detection | 184 | 133 | | 6. HTML-confirmed headings | Bold/underline/h-tag extraction from source HTML, validated against paragraph starts | 343 | 270 | | **Total** | | **8,411 headings + 2,233 orphans** | **~7,100 of 49,795 (14.3%)** | The heading detection required five progressive passes because no single heuristic caught all separator styles. The HTML-confirmed pass (Patch 6) used a 32-worker parallel extraction script to scan 6,341 filings in 1.7 seconds, caching styled headings per filing for reuse. ### Orphan Word Re-Annotation The orphan word patches weren't just cosmetic. Analysis revealed **label bias** in orphan-word paragraphs: - Strategy Integration 1.55x over-represented (16.1% vs 10.4% baseline) - Management Role 0.49x under-represented - Board Governance 0.60x under-represented Missing subject words like "Our", "We", "The" strip governance context that models rely on for classification. This suggested the original annotations on these paragraphs might be systematically wrong. **Decision: re-run Stage 1 on patched text.** Cost: $3.30 for 4,611 annotations (1,537 paragraphs × 3 models), completed in ~9 minutes at 60 concurrency with zero failures. **Results:** - **119 paragraphs (7.7%)** changed consensus category — confirming the bias was real - **37 paragraphs (2.4%)** changed consensus specificity - **152 total (9.9%)** changed on at least one dimension - mimo-v2-flash was most sensitive (14.6% category changes); gemini least affected (6.0%) - 18 original conflicts resolved, 22 new conflicts introduced — roughly a wash on Stage 2 savings - Top transitions: Management Role ↔ Risk Management Process (55/51 each direction), Strategy Integration → None/Other (46), Third-Party Risk → Risk Management Process (34) The re-run annotations are stored separately in `data/annotations/stage1-orphan-rerun.jsonl` — the original `stage1.jsonl` is untouched. For training, the re-run annotations replace the originals for the affected 1,537 paragraphs. ### No-Cyber-Keyword Paragraphs: A False Alarm The quality audit flagged 528 paragraphs (348 annotated) with no cybersecurity keywords at all — suspicious for Item 1C content. Initial expectation: these are section bleed from adjacent filing sections, probably labeled None/Other. **Actual finding:** 65.2% (227 paragraphs) were labeled as real categories — mostly Risk Management Process (44.8%) and Management Role (10.6%). And the labels were **correct.** The paragraphs discuss security topics using synonymous terms: "risk assessment", "access to systems", "theft of intellectual property", "safeguards", "internal notifications" — all legitimate cybersecurity content that doesn't use the literal word "cybersecurity." The keyword filter was too narrow, not the paragraphs. All 348 are kept. ### Heading-Stripped Paragraphs: Labels Still Valid For the ~5,643 annotated paragraphs where headings were stripped, existing labels are retained without re-annotation. The heading was a shortcut learning signal (near-perfect predictor of category), but annotators classified the body text, not the heading. Stripping the heading from training data removes a leaky feature without invalidating the label. ### Embedded Bullet Lists: The Cascade Failure A spot-check of a Bancorp 34, Inc. paragraph revealed a class of structural corruption we hadn't detected. The paragraph read as a 114-word run-on: > establishing and maintaining a comprehensive program to oversee and manager external connections and third-party relationships with access to the institution's technology assets maintaining an incident response program intended to enable us to mitigate the impact of, and recover from, any cyberattacks, and facilitate communication to internal and external experienced a single cybersecurity event in June of 2023... The source HTML (filed via EFiling/XDX) had three clearly separate elements: two `` bullet items about risk management processes, and a standalone `

` disclosing a $25,000 cybersecurity incident. The HTML structure was unambiguous — separate table rows with spacers between them. **Root cause: a three-part cascade failure in the extraction pipeline.** 1. **Bullet character not recognized.** The HTML used `·` (middle dot in Symbol font) instead of `•` (standard bullet). `stripHtml()` doesn't decode it, so the bullet-aware merge logic in the segmenter never fires. 2. **Lowercase continuation merge.** Each bullet starts lowercase ("establishing...", "maintaining..."), so the segmenter treats them as continuation fragments of the previous block. 3. **Short-block append.** Individual bullets fall below the 20-word minimum, so they get appended to the previous paragraph. The result: two process-description bullet items and an incident disclosure fused into one incoherent paragraph. Despite this, all 3 Stage 1 models unanimously labeled it Incident Disclosure / Specificity 4 — the $25K incident detail dominated the merged text. We identified two classes of this failure: 1. **Semicolon-separated merges (1,941 paragraphs):** The semicolons from the original list survived, but the bullet characters were stripped. Detectable by heuristic (3+ semicolons, lowercase after each, no bullet markers). 2. **Invisible merges (222 paragraphs):** Even the semicolons were stripped, leaving text that simply runs together with no trace of the original list structure. The Bancorp 34 example falls in this category — "to internal and external experienced a single cybersecurity event" is an impossible English sentence that a regex cannot distinguish from legitimate prose. These were detected by a secondary heuristic (lowercase-start, not orphan-patched, 60+ words), but this is an undercount — some invisible merges start with uppercase text. All 2,163 were reclassified to "degraded" tier. These aren't worth patching — splitting merged bullets requires per-paragraph HTML structure analysis and re-annotation of every resulting fragment. Instead, they'll be downweighted (0.5x) during fine-tuning to reduce overfitting to degraded text patterns while preserving their content signal. ### Sample Weighting for Fine-Tuning The quality tier system maps directly to training sample weights: | Tier | Weight | Rationale | |------|--------|-----------| | clean | 1.0 | No issues | | headed | 1.0 | Heading removed, body text intact | | minor | 1.0 | Orphan word restored | | degraded | 0.5 | Labels likely correct, but text structure doesn't match clean inference-time inputs | This is implemented via a `sample_weight` column in the training dataset. The HuggingFace Trainer supports per-sample loss weighting — each sample's cross-entropy loss is multiplied by its tier weight before backpropagation. Degraded paragraphs still contribute to learning, but their influence is halved relative to clean data. ### Data Integrity Framework The audit produced a formal data integrity framework: 1. `paragraphs-clean.jsonl` is frozen — the reproducibility anchor 2. All fixes go through `.patched.jsonl` — same schema, same IDs, updated text and hash 3. Annotations link by UUID — stable across patches 4. Never re-run extraction from HTML — cascade effects from merge logic cause thousands of ripple-effect changes 5. Every patch is documented with scope, method, validation, and annotation impact 6. Quality metadata is separate from text data — per-paragraph quality scores in a separate file ### Quality Tier System Each paragraph gets a quality tier based on detected issues: | Tier | Criteria | Count | % | |------|----------|-------|---| | clean | No detected issues | 58,165 | 80.7% | | headed | Had inlined heading (now stripped) | 7,402 | 10.3% | | degraded | Embedded bullets, invisible merges, fragments, truncations | 4,331 | 6.0% | | minor | Had orphan word (now fixed) | 2,147 | 3.0% | All "headed" and "minor" paragraphs have been patched — the tier records what *was* wrong for traceability. "Degraded" paragraphs are downweighted (0.5x) during fine-tuning. --- ## Phase 11: DAPT Corpus Preparation ### Corpus Cleaning The DAPT corpus is built from 14,759 cached 10-K HTML filings processed through `stripHtml()` + `cleanForDapt()`. Three rounds of cleaning were required: **Round 1** revealed XBRL data blobs (8.7% of docs, up to 33% of document text), page number artifacts, and exhibit listing boilerplate. Added targeted stripping for `iso4217:`, `xbrli:`, CIK-number sequences, and `F-N` page markers. **Round 2** removed URLs (39% of docs → 0.3%) and XBRL exhibit listing lines ("Inline XBRL Taxonomy Extension Calculation Linkbase Document" — present in 85% of filings). Initial investigation claimed these were "legitimate prose mentions of XBRL." Spot-checking showed every single remaining match was exhibit index boilerplate. Stripped any line containing "XBRL" unless it also contained cybersecurity/risk/governance terms. **Round 3** was a verification pass confirming the remaining 7.4% of docs with "XBRL" traces are legitimate prose co-occurrences with security terms. The page number regex initially had a branch matching `[- ]\d{1,3}[- ]` that produced 100% false positives — it was matching negative financial figures (`-1%`) in sensitivity analysis tables. Only the `F-\d+` pattern was genuine. The false-positive branch was removed. ### Corpus Statistics (Final) | Metric | Value | |--------|-------| | Full corpus | 14,568 docs, ~1.056B tokens | | Training subset | ~7,200 docs (newest 500M tokens, FY2024-2025) | | Training sequences (seq_len=8192) | ~60K | | Steps per epoch (eff. batch=32) | ~1,950 | | Actual training time | ~13.5 hours (RTX 3090, 27s/step) | ### Sequence Length Decision ModernBERT was pre-trained at 8192 tokens (Warner et al., 2024). We match this during DAPT to ensure all positional embedding and attention weights — including ModernBERT's alternating local/global attention pattern — receive gradient updates. At seq_len=2048, positions 2048-8191 would get no updates, and the global attention layers (every 3rd layer, RoPE theta 160K) would never see long-range context during DAPT. ### Epoch Decision We train for 1 epoch (single pass), following the empirical consensus: - **Gururangan et al. (2020), "Don't Stop Pretraining" (ACL 2020):** Trained DAPT for "12.5K steps, which amounts to a single pass on each domain dataset" across 2-8B token corpora. Sufficient for consistent downstream gains across all four domains tested. - **Ponnock (2025), arXiv:2512.12384:** Found SEC-specific DAPT shows "diminishing marginal returns beyond roughly 250M tokens" within a single epoch. Our 1B token corpus is well past the diminishing-returns threshold. ### Hyperparameters Aligned with Prior ModernBERT DAPT Work We aligned hyperparameters with the ModernBERT paper and two published DAPT efforts: - **MLM probability (30%):** Matches ModernBERT pre-training (Warner et al., 2024). - **Weight decay (1e-5):** Matches ModernBERT pre-training and both BioClinical-ModernBERT (Sounack et al., 2025) and Patent-ModernBERT (Luo et al., 2025). The commonly-cited 0.01 is a BERT/RoBERTa default that doesn't apply to ModernBERT. - **Learning rate (5e-5):** Conservative because we start from the published post-decay checkpoint. BioClinical and Patent-ModernBERT used 3e-4 but started from pre-decay stable-phase checkpoints that the ModernBERT authors released specifically for continued pre-training. ### Training Optimizations Initial training ran at ~47s/step (projected ~56 hours for 1B tokens). Through iterative optimization we brought this down to ~13.5 hours: 1. **Flash Attention 2** (Dao, 2024) — installed via precompiled wheel after upgrading to PyTorch 2.11+cu130 (CUDA 13.0 to match the driver). Without FA2, ModernBERT fell back to O(n²) eager attention at 8192 seq_len. This cut s/step from ~47s to ~27s. 2. **torch.compile** — JIT-compiles non-attention ops into fused CUDA kernels. With external FA2, Dynamo hits graph breaks at every attention layer, so there was **no compute speedup**. However, fusing the surrounding ops (FFN, layer norms, residuals) unexpectedly **halved activation memory** (18.2GB → 11.9GB at batch=2) by eliminating intermediate tensor allocations. 3. **Batch size increase** — torch.compile's memory savings freed enough VRAM to increase from batch=2 to batch=4. At seq_len=8192 the GPU is already compute-saturated, so larger batches didn't meaningfully improve s/step (~27s in all configurations). The benefit was marginal reduction in gradient accumulation overhead. 4. **Corpus subsampling** — the single biggest wall-time reduction. Ponnock (2025) showed diminishing returns past 250M tokens for SEC DAPT. Subsampling from 1.06B to 500M tokens (newest filings) halved training from ~29h to ~13.5h. 5. **Fused AdamW + non-reentrant gradient checkpointing + tf32** — minor optimizations (~1-2% combined). Fused optimizer merges parameter updates into a single kernel. Non-reentrant checkpointing enables torch.compile compatibility. **What didn't work:** Increasing batch size beyond 2 provided no s/step improvement because the 3090 is compute-saturated at seq_len=8192 (attention is O(n²) FLOPs even with FA2). SDPA (PyTorch's native attention) couldn't replace external FA2 without OOMing due to different memory allocation patterns. torch.compile couldn't accelerate the attention bottleneck because FA2's custom CUDA kernels are opaque to Dynamo's graph tracer. **The fundamental constraint** is hardware: the RTX 3090's 35.6 bf16 TFLOPS sets a hard ceiling on throughput at 8192 seq_len. An AWS g7e.2xlarge (RTX PRO 6000 Blackwell, 236 bf16 TFLOPS, 96GB VRAM) could complete the same run in ~3.7 hours for ~$5 on spot pricing — the 96GB VRAM allows dropping gradient checkpointing entirely (eliminating activation recomputation) and running batch=16. Full procedure, optimization journey, and cloud cost analysis in `docs/DAPT-PROCEDURE.md`. ### Early Training Results | Step | Loss | grad_norm | LR | Epoch | Note | |------|------|-----------|-----|-------|------| | 54 | 0.7991 | 0.066 | 2.66e-5 | 0.03 | Warmup phase | | 1280 | 0.7233 | 0.068 | 1.57e-5 | 0.70 | Steady decline | | 1800 | 0.7253 | 0.073 | 1.48e-6 | 0.97 | LR near zero, loss plateaued | | **Final** | **0.7250** | **0.043** | **5.7e-8** | **1.00** | **Eval loss: 0.7250, perplexity: 1.65** | The loss dropped from 0.80 → 0.72 — a gentle 10% decline over one epoch. For comparison, a randomly initialized model would start at ~10.8 (ln(50280 vocab size)). Starting at 0.80 reflects that ModernBERT already knows English; DAPT taught it SEC-specific token co-occurrence patterns ("NIST CSF", "materiality assessment", "tabletop exercise"), not language fundamentals. grad_norm remained stable at 0.04-0.07 throughout with zero instability. Total training time: ~14 hours across two sessions on an RTX 3090 (resumed from checkpoint-1280). The DAPT checkpoint is saved at `checkpoints/dapt/modernbert-large/final/` and is ready for TAPT. ### TAPT Configuration The TAPT corpus is 72K Item 1C paragraphs (~10M tokens) — 50x smaller than the DAPT corpus. This changes several training decisions vs. DAPT. Config file: `python/configs/tapt/modernbert.yaml`. | Parameter | DAPT | TAPT | Rationale for change | |-----------|------|------|---------------------| | `max_seq_length` | 8192 | 512 | Data-driven: paragraphs average 127 tokens (P99=386, 99.6% fit in 512). Using 8192 would mean 98.5% padding — pure waste. See seq_len discussion below. | | `num_train_epochs` | 1 | 5 | Gururangan et al. (2020) ran 100 epochs on 50-500K token TAPT corpora. We match total token exposure: 5 × 10M = 50M tokens ≈ upper bound of their TAPT exposure. | | `whole_word_mask` | false | true | Masks entire words instead of subword pieces. Prevents trivially solvable masking patterns (e.g., masked `cyber` next to unmasked `security`). The model already knows subword composition from DAPT — TAPT should focus on domain-specific whole words ("CISO", "materiality", "tabletop"). | | `per_device_train_batch_size` | 4 | 32 | Short sequences free VRAM. Tested: batch=32 uses 22.7 GB with torch.compile (vs. OOM at batch=48). | | `gradient_accumulation_steps` | 8 | 1 | Effective batch = 32 in both cases. No accumulation needed since batch=32 fits directly. | | `gradient_checkpointing` | true | false | Not needed at seq_len=512 — activations are small. Gradient checkpointing would slow training 30-40% for no memory benefit. | | `save_strategy` / `eval_strategy` | steps (256) | epoch | 5 epochs; checkpoint and evaluate after each one. | | `validation_split` | 0.02 | 0.05 | Larger val split for a 50x smaller dataset — need enough samples for stable eval loss. | **Sequence length (512 vs. 8192):** The concern with a shorter seq_len is degrading the model's long-range attention capabilities. Three factors make this a non-issue for TAPT: 1. **The data is short.** Paragraphs average 127 tokens. There is no long-range structure to learn — the information simply isn't there. 2. **Scale of exposure.** TAPT is 50M token-exposures (5 epochs × 10M). ModernBERT was pre-trained on ~2T tokens; DAPT added 500M. 50M is 0.0025% of original pre-training — far too small to cause catastrophic forgetting of patterns established over trillions of tokens. 3. **RoPE positions are independent.** ModernBERT uses rotary position embeddings. Positions 0-511 compute identically whether max_length is 512 or 8192. Training at 512 updates the same parameters; positions 512-8191 remain as-is from DAPT, not degraded. **Whole-word masking and tokenization:** Whole-word masking requires `offset_mapping` from the tokenizer to determine word boundaries. This is incompatible with DAPT's concatenate-and-chunk approach (which destroys offset_mapping by merging documents). TAPT tokenizes each paragraph individually with truncation, preserving offset_mapping. The data collator handles dynamic padding per batch. This is a different code path from DAPT's concatenation, but the data justifies it: paragraphs are natural self-contained units, unlike DAPT's long filings that must be chunked. **Training time:** ~2,139 steps/epoch × 5 epochs = ~10,695 total steps. At ~1.84 it/s on the 3090, ~1.6 hours total. ### TAPT Launch — Whole-Word Masking Bugs Launching TAPT required fighting through four bugs in `transformers`' `DataCollatorForLanguageModeling` when `whole_word_mask=True`, plus a Python 3.14 incompatibility that forced a version rollback. **Bug 1: `offset_mapping` stripped before reaching the collator.** The Trainer's default `remove_unused_columns=True` drops any dataset column not in the model's `forward()` signature. Since `offset_mapping` is a collator input (not a model input), it was silently removed, causing the collator to receive a 0-dimensional array and crash with `IndexError: too many indices for array`. Fix: set `remove_unused_columns=False` when whole-word masking is enabled. **Bug 2: `offset_mapping` can't survive `tokenizer.pad()`.** Even with the column present, the collator's `torch_call()` passes all features — including `offset_mapping` — through `tokenizer.pad()`, which tries to tensorize the variable-length nested lists and crashes with `ValueError`. The collator pops `offset_mapping` *after* padding, but padding already failed. Fix: subclass `DataCollatorForLanguageModeling` to strip `offset_mapping` before padding. **Bug 3: `offset_mapping` word boundary detection is broken for BPE tokenizers.** This was the most insidious bug — training ran but loss was ~6-8 (near-random, vs. expected ~1.5-2.0). The upstream `_calc_word_ids_and_prob_mask` detects word boundaries by checking if `token_start != prev_token_end` in the offset mapping. But BPE tokenizers (like ModernBERT's) absorb leading spaces into tokens, making ALL offsets contiguous: `"The" → (0,3), " company" → (3,11)`. Since 3 == 3, the algorithm treats the entire sequence as one giant "word." When 30% masking is applied to these mega-groups, it masks enormous contiguous spans, making prediction nearly impossible. **Fix:** Replaced `offset_mapping` entirely with the tokenizer's `word_ids()` method, which correctly identifies word boundaries for any tokenizer type (BPE, WordPiece, SentencePiece). The `WholeWordMaskCollator` in `python/src/dapt/train.py` implements whole-word masking from scratch: extracts `word_ids` before padding, selects `mlm_probability` fraction of unique word IDs per sequence, and masks all tokens belonging to selected words. **Python 3.14 incompatibility.** Two separate issues forced a rollback to Python 3.13: 1. Python 3.14 changed the multiprocessing start method from `fork` to `forkserver`, requiring picklable dataloader collators (closures crash with `PicklingError`). 2. Python 3.14 changed `pickle.Pickler._batch_setitems` to take 3 arguments, breaking `dill` (used by `datasets` for config hashing). This was unfixable — even `dill` 0.4.1 and `datasets` 4.8.4 crashed. The breakage is deep in the `datasets` builder machinery and hit every codepath (`load_dataset`, `Dataset.from_list`, `dataset.map`). Rolled `pyproject.toml` from `requires-python = ">=3.14"` to `">=3.13,<3.14"` and updated the flash-attn wheel URL from cp314 to cp313. --- ## Cost and Time Ledger ### Tooling All code was written collaboratively with **Claude Code** (Anthropic's agentic coding CLI). Claude Code was used throughout the project for pipeline development, prompt engineering, data analysis, script writing, documentation, and strategic planning. The tool dramatically accelerated iteration speed — writing analysis scripts, debugging extraction edge cases, and exploring the annotation data interactively — but all decisions were made by the team with Claude Code as an implementation partner. ### API Cost Ledger | Phase | Cost | Annotations | Notes | |-------|------|-------------|-------| | Stage 1 prompt iteration (pilots) | $7.03 | 9,597 | 12+ versions: 5 × 40-sample + 6 × 500-sample | | Stage 1 model bench (6 candidates) | $3.41 | 2,993 | seed, mimo, glm-4.5-air, minimax, mistral, nemotron | | Mimo pilot (dedicated comparison) | $0.24 | 500 | `mimo-pilot.ts` — replace-nano scenario modeling | | Stage 1 run #1 (with nano) | $112.42 | 150,009 | Full production run with gpt-5.4-nano. Completed, but nano's quality was unacceptable (0 reasoning tokens 64% of the time). Gemini+grok annotations ($91.18) preserved in `stage1-gemini-grok.jsonl`; only nano's annotations ($21.24) were discarded. Full original in `stage1.jsonl.bak`. | | Stage 1 run #2 (mimo only) | $24.69 | 50,003 | Ran only mimo to replace nano. Merged with preserved gemini+grok annotations to form final `stage1.jsonl` ($115.88 total value, $24.69 new spend). | | Judge model bench (8 candidates) | $5.97 | 505 | GLM-5 (4 configs), gpt-5.4-mini, gpt-5.4, sonnet-4.6, gemini-3-flash, grok-4.20, mimo-v2-pro, kimi-k2.5 | | Orphan word re-annotation | $3.30 | 4,611 | Re-ran Stage 1 on 1,537 patched paragraphs × 3 models. 7.7% changed consensus category. | | **Total API spend** | **$159** | **~218K unique** | Nano waste: $21.24 | Only nano's portion ($21.24) of the first run was wasted — the gemini and grok annotations were preserved and merged with the new mimo annotations. Still, $21.24 thrown away on a model that wasn't thinking. The lesson: benchmark model candidates rigorously *before* committing to a production run. The 40-sample pilots showed nano was the weakest link but were misleadingly optimistic about the magnitude of the problem. ### Time Ledger | Phase | Hours | Notes | |-------|-------|-------| | Data acquisition + HTML cleaning | ~6h | Extraction pipeline, HTML artifact handling, dedup, 8-K discovery. The messiest phase — SEC filing HTML variability required extensive regex heuristics and iteration. | | Stage 1 annotation run #1 (nano) | ~5h | Production run wall clock (~300 min). Completed but results were below quality bar. | | Stage 1 annotation run #2 (mimo) | ~1h | Only needed mimo annotations at higher concurrency (gemini+grok reused). | | Prompt iteration + model benchmarking | ~4h | 12+ prompt versions, 6 model candidates, pilot analysis | | Post-Stage 1 analysis + Stage 2 planning | ~5h | Distributional analysis, model bias discovery, codebook v3.0 rulings, judge benchmarking, strategy revision | | Data quality audit + remediation | ~4h | Generator investigation, 6 patches, orphan re-annotation, quality tier system, docs | | Documentation + narrative | ~2h | Codebook updates, narrative writing, technical guide updates | | Labelapp build + infrastructure | ~8h | Monorepo restructure, Next.js app, quiz/warmup/labeling flows, BIBD assignment, sampling, Docker deployment, timer + migration infrastructure | | DAPT pre-training | ~14.5h GPU | 1 epoch on 500M tokens, RTX 3090. Two sessions (resumed from checkpoint-1280). | | TAPT debugging + pre-training | ~2h dev + ~1.6h GPU | 4 bugs in transformers whole-word masking + Python 3.14 rollback. Training: 5 epochs on 72K paragraphs. | | **Total to date** | **~53h** | Includes ~16h GPU time | ### Remaining Work (estimated) | Phase | Est. Hours | Est. Cost | |-------|-----------|-----------| | Human labeling (1,200 paragraphs, 6 annotators) | ~6-8h | $0 (team labor) | | Stage 2 judge production run (~3-5K paragraphs) | ~1h | ~$20-40 | | Training data assembly | ~2h | $0 | | Fine-tuning + ablations (7 experiments) | ~12-20h GPU | $0 | | Full GenAI benchmark on 1,200 holdout (9 models) | ~1h | ~$30-50 | | Evaluation + comparison + write-up | ~6-8h | $0 | --- ## Model Census — Every Model We Tried Over the course of the project, we evaluated **18 distinct models** across three phases: initial panel selection, Stage 1 replacement bench, and Stage 2 judge selection. Each decision narrowed the field based on empirical evidence. ### Phase 0: Smoke Test (model-probe.ts) — 9 candidates Tested basic structured output compliance on a single paragraph before committing to expensive benchmarks. | Model | Provider | Result | |-------|----------|--------| | google/gemini-3.1-flash-lite-preview | Google | ✅ Pass — selected for panel | | x-ai/grok-4.1-fast | xAI | ✅ Pass — selected for panel | | openai/gpt-4.1-mini | OpenAI | ✅ Pass — not selected (cost) | | openai/gpt-4.1-nano | OpenAI | ✅ Pass — later replaced by gpt-5.4-nano | | anthropic/claude-haiku-4.5 | Anthropic | ✅ Pass — not selected (cost tier) | | google/gemini-3.1-flash-preview | Google | ✅ Pass — too expensive for Stage 1 | | deepseek/deepseek-chat-v3-0324:free | DeepSeek | Tested — free tier limitations | | meta-llama/llama-4-maverick | Meta | Tested | | qwen/qwen3-235b-a22b | Alibaba | Tested | ### Phase 1: Early Pilots (v1.0-v1.2) — Original panel The very first panel used **gpt-oss-120b** (OpenAI's open-source 120B model), not nano: - `google/gemini-3.1-flash-lite-preview` - `openai/gpt-oss-120b` (also tested with `:exacto` routing suffix) - `x-ai/grok-4.1-fast` gpt-oss-120b was replaced by gpt-5.4-nano between v1.2 and v2.1 — nano was cheaper and appeared to perform comparably on the small (n=40) pilot samples. ### Phase 2: 500-Sample Pilots (v2.2-v2.7) — Nano era Panel during the main prompt iteration: - `google/gemini-3.1-flash-lite-preview` - `openai/gpt-5.4-nano` ← the problem model - `x-ai/grok-4.1-fast` Nano's issues (0 reasoning tokens 64% of the time, erratic specificity) were persistent but masked by the 40→500 sample transition being attributed to prompt changes rather than model inadequacy. ### Phase 3: Stage 1 Replacement Bench (model-bench.ts) — 6 candidates After locking prompt v2.5, formally benchmarked replacements for nano: | Model | Provider | Reasoning Tokens | Cost/ann | Outcome | |-------|----------|-----------------|----------|---------| | xiaomi/mimo-v2-flash | Xiaomi | 1,346 | $0.00048 | **✅ Winner** — best value, lowest outlier rate | | bytedance-seed/seed-2.0-lite | ByteDance | 658 | $0.00227 | Runner-up — highest accuracy but 4.7x more expensive | | z-ai/glm-4.5-air | Zhipu AI | 854 | $0.00136 | Mediocre — barely moved the needle (+0.8pp) | | minimax/minimax-m2.5 | MiniMax | 590 | $0.00106 | Mediocre — slightly worse than nano (-1.0pp) | | mistralai/mistral-small-2603 | Mistral | **0** | $0.00015 | ❌ Zero reasoning tokens. Cheapest but useless. | | nvidia/nemotron-3-super-120b-a12b | NVIDIA | 942 | $0.00152 | ❌ Worst performer despite being expensive. 21% outlier rate. | ### Phase 4: Production Stage 1 — Final panel - `google/gemini-3.1-flash-lite-preview` (Google) - `xiaomi/mimo-v2-flash` (Xiaomi) ← replaced nano - `x-ai/grok-4.1-fast` (xAI) Three models from three providers — minimizes correlated errors. ### Phase 5: Stage 2 Judge Bench (judge-bench.ts) — 8 candidates | Model | Provider | Mode | Both vs Gold | Fails | Outcome | |-------|----------|------|-------------|-------|---------| | z-ai/glm-5 | Zhipu AI | structured | 77-80% | 4-12% | Best accuracy but unreliable structured output | | z-ai/glm-5 | Zhipu AI | tool calling | 72% | 0% | Reliable but -7pp accuracy | | openai/gpt-5.4-mini | OpenAI | structured | 68% | 0% | Reliable, weaker on specificity | | openai/gpt-5.4 | OpenAI | structured | Tested | 0% | Expensive, diminishing returns over mini | | anthropic/claude-sonnet-4.6 | Anthropic | structured | Used for gold | 0% | Gold label creation, too expensive for production judge | | google/gemini-3-flash-preview | Google | structured | Tested | — | Rubber-stamped majority — added zero value | | x-ai/grok-4.20-beta | xAI | structured | Tested | — | Benchmarked | | xiaomi/mimo-v2-pro | Xiaomi | structured | Tested | — | Benchmarked | | moonshotai/kimi-k2.5 | Moonshot AI | structured | Tested | — | Only 26/50 completed — high failure rate | ### Summary: 18 Models, 10 Providers | Provider | Models Tested | Models in Production | |----------|--------------|---------------------| | Google | gemini-3.1-flash-lite, gemini-3.1-flash, gemini-3-flash | gemini-3.1-flash-lite (Stage 1) | | OpenAI | gpt-oss-120b, gpt-5.4-nano, gpt-4.1-mini, gpt-4.1-nano, gpt-5.4-mini, gpt-5.4 | — (nano dropped) | | xAI | grok-4.1-fast, grok-4.20-beta | grok-4.1-fast (Stage 1) | | Xiaomi | mimo-v2-flash, mimo-v2-pro | mimo-v2-flash (Stage 1) | | Anthropic | claude-haiku-4.5, claude-sonnet-4.6 | sonnet-4.6 (gold labels) | | Zhipu AI | glm-4.5-air, glm-5 | TBD (Stage 2 judge) | | ByteDance | seed-2.0-lite | — (too expensive for scale) | | NVIDIA | nemotron-3-super-120b | — (worst performer) | | Mistral | mistral-small-2603 | — (zero reasoning) | | MiniMax | minimax-m2.5 | — (mediocre) | | Moonshot AI | kimi-k2.5 | — (high failure rate) | | Meta | llama-4-maverick | — (smoke test only) | | Alibaba | qwen3-235b-a22b | — (smoke test only) | | DeepSeek | deepseek-chat-v3-0324 | — (smoke test only) | --- ## Key Technical Artifacts | Artifact | Location | Description | |----------|----------|-------------| | Labeling codebook | `docs/LABELING-CODEBOOK.md` | Authoritative reference, v3.0 with codebook rulings | | Stage 1 annotations | `data/annotations/stage1.jsonl` | 150,009 annotations (120 MB) | | Paragraphs | `data/paragraphs/paragraphs-clean.jsonl` | 72,045 paragraphs with filing metadata | | Gold labels | `data/bench/judges/gold-final.json` | 50 adjudicated gold labels | | Gold adjudications | `data/bench/judges/gold-adjudicated.json` | 11 detailed adjudication decisions with reasoning | | Stage 1 prompt | `ts/src/label/prompts.ts` | SYSTEM_PROMPT (v2.5) + buildJudgePrompt() | | Annotation runner | `ts/scripts/stage1-run.ts` | Resume-safe, configurable concurrency | | Orphan re-annotation | `ts/scripts/rerun-orphan-stage1.ts` | Re-ran 1,537 patched paragraphs, $3.30 | | Re-annotation diff | `ts/scripts/diff-orphan-annotations.ts` | Category/specificity change analysis | | No-cyber analysis | `ts/scripts/analyze-no-cyber.ts` | Label distribution on 348 flagged paragraphs | | Data quality audit | `docs/DATA-QUALITY-AUDIT.md` | Full audit: generators, patches, quality tiers | | Generator reference | `docs/EDGAR-FILING-GENERATORS.md` | 14 vendors with signatures and quality profiles | | Analysis scripts | `ts/scripts/stage1-analyze.ts`, `segment-analysis.ts`, `model-bias-analysis.ts`, `dispute-crosstab.ts`, `sample-disputes.ts` | Deep analytics on annotation data | | Judge benchmarking | `ts/scripts/judge-bench.ts` | Supports structured/tool modes, gold label comparison | | Judge diagnostics | `ts/scripts/judge-diag.ts`, `judge-diag-batch.ts` | GLM-5 failure investigation | | Model benchmarking | `ts/scripts/model-bench.ts` | Stage 1 candidate evaluation | | Golden annotation (Opus) | `ts/src/label/golden.ts` | Agent SDK runner for gold set, saves reasoning traces | | Golden annotations | `data/annotations/golden/opus.jsonl` | Opus 4.6 labels + thinking + raw confidence | --- ## Lessons Learned ### On Prompt Engineering - Calibration examples beat rules. Each example targets a specific observed failure mode. - Pilots must be large enough (500+). 40-sample pilots were misleadingly optimistic. - More rules ≠ better. After the core structure is right, additional rules cause regression. - The `specific_facts` chain-of-thought schema (forcing models to enumerate evidence before deciding) was the single most impactful structural change. ### On Model Selection - Reasoning tokens are the strongest predictor of accuracy, not price or model size. - Schema compliance varies — fix with Zod transforms, not prompt changes. - Test both structured output AND tool calling for any candidate. They are not equivalent. ### On Evaluation - **Never evaluate against majority vote.** Build gold labels. Majority vote as ground truth makes models that rubber-stamp the majority look good. - **Judge confidence is highly predictive** of accuracy. Use it to weight training samples. - **Stage 1 confidence is useless** — cheap models are systematically overconfident (95%+ all-high). ### On Data Quality at Scale - The biggest wins come from understanding *where* and *why* models disagree, not from blanket improvements. - Systematic model biases are quantifiable and predictable. Use them as signal, not noise. - Codebook ambiguity causes more disagreement than model limitations. Three codebook rulings resolved more disputes than any prompt change. - Not all labels need the same treatment. Confidence-stratified assembly beats uniform labeling. - **Freeze originals, patch separately.** The single best data integrity decision was never modifying `paragraphs-clean.jsonl`. All fixes go through `.patched.jsonl` with the same UUIDs. This makes every change auditable, reversible, and safe to apply incrementally. Without this, the 6-patch iteration would have been terrifying. - **Tag everything you can.** Generator metadata, quality tiers, and anomaly flags cost almost nothing to compute but make targeted remediation possible. Without generator tags, the 36.8% orphan rate in EFiling/XDX would have been invisible — diluted into a 4.7% corpus average. - **Re-annotation is cheap and validating.** Re-running Stage 1 on 1,537 patched paragraphs cost $3.30 and took 9 minutes. It confirmed that 7.7% of consensus labels were wrong due to the data issue — an empirical validation that the patch was necessary, not just cosmetic. ### On Training Infrastructure - **Whole-word masking in `transformers` is broken for BPE tokenizers.** The upstream `DataCollatorForLanguageModeling(whole_word_mask=True)` uses `offset_mapping` to detect word boundaries by checking for gaps in character offsets. This fails silently for BPE tokenizers that absorb leading spaces — all offsets are contiguous, so the entire sequence becomes one "word." Loss appears to train but sits at ~6-8 (near-random). The fix is to use the tokenizer's `word_ids()` method, which correctly identifies word boundaries for any tokenizer type, and implement masking yourself. - **Python 3.14 is not ready for ML.** Both `dill` (via `datasets`) and PyTorch's multiprocessing (`fork` → `forkserver`) have breaking incompatibilities. Rolling back to 3.13 was the only viable path. - **Flash Attention is mandatory for long sequences.** Without FA2, ModernBERT at seq_len=8192 ran at ~47s/step on an RTX 3090. With FA2, the same configuration ran at ~25s/step — and enabled further optimizations (batch size increase, torch.compile) that pushed it further. - **Align hyperparameters with the base model's pre-training config.** ModernBERT was trained with weight_decay=1e-5 and 30% MLM probability. Using the BERT/RoBERTa default of 0.01 weight decay would have been wrong. Both published ModernBERT DAPT papers (BioClinical, Patent) independently validated these values. - **torch.compile + gradient_checkpointing together is more than the sum of its parts.** On ModernBERT, this combination resolves a memory anomaly specific to FA2 during MLM training (AnswerDotAI/ModernBERT#172), freeing VRAM for larger batch sizes. - **Precompiled wheels save hours.** Building flash-attn from source requires matching CUDA toolkit versions, which is fragile. Precompiled wheels for the exact {python, torch, CUDA} combination avoid this entirely. - **torch.compile's value can be memory, not speed.** When the bottleneck is opaque custom CUDA kernels (like FA2), torch.compile can't accelerate them. But it can still fuse the *surrounding* ops, dramatically reducing activation memory. In our case, compile provided 0% speedup but 35% memory reduction — enough to double the batch size. - **Corpus subsampling is the biggest lever on consumer hardware.** When you're compute-bound, no software optimization can beat "process less data." The scaling laws literature (Ponnock 2025) provides empirical justification for stopping early. - **At long sequence lengths, the GPU saturates at small batches.** Increasing batch from 2→4 at seq_len=8192 provided no s/step improvement on an RTX 3090 — the matmul dimensions are already large enough to fill all 82 SMs. This is the opposite of short-sequence fine-tuning where batch size scaling is the primary throughput lever. --- ## References - Warner, B., Clavié, B., Soldaini, L., et al. (2024). "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Fine-tuning and Inference." arXiv:2412.13663. - Gururangan, S., Marasovic, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N.A. (2020). "Don't Stop Pretraining: Adapt Language Models to Domains and Tasks." *Proceedings of ACL 2020*, pp. 8342-8360. - Ponnock, J. (2025). "The Data Efficiency Frontier of Financial Foundation Models: Scaling Laws from Continued Pretraining." arXiv:2512.12384. - Sounack, T., et al. (2025). "BioClinical ModernBERT: A Domain-Adapted Encoder for Biomedical and Clinical NLP." arXiv:2506.10896. - Luo, Z., et al. (2025). "Patent ModernBERT: A Pretrained Language Model for Intellectual Property." arXiv:2509.14926. - Dao, T. (2024). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." *Proceedings of ICLR 2024*. - Ringel, D.M. (2023). "Creating Synthetic Experts with Generative Artificial Intelligence." arXiv:2310.15560.