31 KiB
Project Narrative — SEC Cybersecurity Disclosure Quality Classifier
This document captures the process, roadblocks, decisions, and resolutions from building the SEC cybersecurity disclosure quality classifier. It serves as the source material for the final paper and presentation.
Phase 1: Project Scoping and Construct Design
The Problem
SEC Release 33-11216 (July 2023) created a new annual cybersecurity disclosure requirement (10-K Item 1C) and an incident disclosure requirement (8-K Item 1.05). By FY2024, ~9,000-10,000 filings exist. No validated classifier or public labeled dataset exists for assessing the quality of these disclosures. Investors, regulators, and compliance officers need scalable tools to distinguish substantive disclosures from boilerplate.
Methodology Decision: Ringel (2023) "Synthetic Experts"
We adopted the Ringel (2023) "Synthetic Experts" pipeline: use frontier LLMs to generate training labels at scale, then distill into an efficient encoder model. This approach was chosen because:
- Manual labeling of 50,000+ paragraphs is infeasible for a 6-person team
- Multiple cheap LLMs annotating in parallel provide built-in quality control through inter-annotator agreement
- The encoder distillation step produces a model that can classify at inference time without LLM API costs
Construct: Two Classification Dimensions
We defined two simultaneous classification tasks per paragraph:
- Content Category (7 mutually exclusive classes) — what the paragraph is about, grounded in the SEC rule's own structure (Board Governance, Management Role, Risk Management Process, Third-Party Risk, Incident Disclosure, Strategy Integration, None/Other)
- Specificity Level (4-point ordinal) — how company-specific the disclosure is, from generic boilerplate to quantified-verifiable facts
The construct maps to NIST CSF 2.0 categories for academic grounding.
Phase 2: Data Acquisition and Corpus Construction
The Extraction Problem
SEC filings are not structured data. They're HTML generated from PDFs, XBRL, and Word documents by dozens of different tools, each producing different artifacts. Building a reliable extraction pipeline for ~9,000 filings meant solving a series of messy, real-world data engineering problems.
Pipeline Architecture
Built in TypeScript (~1,000 lines of extraction code across parse-item1c.ts, segment.ts, fast-reparse.ts, and pipeline orchestration):
EDGAR Master Index → enumerate 10-K filings → download HTML → extract Item 1C → segment paragraphs → JSONL
submissions.zip → scan for 8-K Item 1.05 → download HTML → extract → segment → merge with 10-K corpus
Roadblock: HTML Variability
Every filing's HTML is different. The same logical content looks completely different depending on the tool that generated the HTML:
-
Word splitting from inline elements. XBRL and styling tags break words mid-token:
<span>It</span><span>em 2</span>renders correctly in a browser but parses as "Item2" in code. Same with<b>cyber</b><i>security</i>. Required detecting adjacent inline element boundaries and inserting spaces selectively. -
CamelCase joins from PDF converters. PDF-to-HTML tools merge sentences across formatting boundaries:
sentence.Next sentenceinstead ofsentence. Next sentence. Required regex passes to detect missing spaces after punctuation. -
Page breaks mid-sentence. Page numbers (
28,- 12 -,F-3), running headers (ACME CORP — ANNUAL REPORT), and subsidiary headers (ENTERGY ARKANSAS, LLC AND SUBSIDIARIES) get spliced into the middle of content paragraphs. Required filtering a catalog of page artifact patterns. -
Table of Contents shadowing. "Item 1C" appears at least twice in every 10-K — once in the Table of Contents and once in the actual content. Using the first match extracts the wrong section. Took several iterations to discover we needed the LAST match — this was a silent failure that produced empty or wrong extractions for hundreds of filings before we caught it.
-
XBRL tag pollution. Inline XBRL wraps financial facts in
ix:header,ix:references, andix:nonFractiontags that carry no display content but add noise. Required stripping allix:*tags before text processing. -
Entity encoding chaos.
, ,“,”,—,–,•— each needs correct decoding, and different filing tools use different entity styles for the same characters.
Paragraph Segmentation
After extracting clean section text, splitting into paragraphs had its own challenges:
- Bullet list merging. Disclosures frequently use bullet lists ("Our program includes: • risk assessment • vulnerability scanning"). Bullets need to be merged with their intro sentence; a standalone "• vulnerability scanning" is meaningless.
- Continuation line detection. Sentences split across HTML block elements need rejoining. Heuristic: if the previous block lacks terminal punctuation and the next starts lowercase or with a continuation phrase (
and,or,including,such as), merge. - Length boundaries. Under 20 words → likely a header (filtered). Over 500 words → split at sentence boundaries to keep annotation units manageable.
8-K Extraction
Roadblock: EDGAR full-text search misses filings. The EFTS keyword search doesn't reliably return all cybersecurity 8-Ks. Post-May 2024, companies moved non-material disclosures from Item 1.05 to Items 8.01 or 7.01.
Resolution: Built scan-8k-items.py to scan the SEC's bulk submissions.zip deterministically — a gap-free scan of every 8-K with cybersecurity content. Tries items in priority order (1.05 → 8.01 → 7.01), skips cross-reference stubs. Result: 207 cybersecurity incident 8-K filings identified — a complete inventory.
Paragraph Deduplication
Each paragraph gets a textHash (SHA-256 of normalized text). Deduplication at three levels:
- Within-filing: Parser artifacts sometimes produce duplicate blocks. Removed by textHash.
- Cross-year (same company): Companies copy-paste identical paragraphs year-to-year. Detected but kept — the repetition itself is informative for disclosure quality analysis.
- Cross-company boilerplate: Different companies use identical materiality disclaimers. Detected but kept — these are real Specificity 1 examples.
Result: Only ~27 excess duplicates removed (0.04%). Most textual similarity is legitimate variation.
Performance at Scale
Initial extraction with cheerio (DOM parser) was slow for 9,000 filings. Built fast-reparse.ts (regex-only HTML stripping, no DOM) and parallel-reparse.ts (16 bun workers in parallel). Also deduplicates amendment filings (keeps latest per CIK×FiscalYear).
Corpus Statistics
- 72,045 paragraphs from ~9,000 filings (FY2023 + FY2024 + early FY2025)
- All 10-K Item 1C; 207 8-K paragraphs extracted separately
- Median ~7 paragraphs per filing
- 49,795 paragraphs annotated (after filtering to complete filing metadata)
Roadblock: Truncated Filings
Discovered 72 filings (~0.8%) where section boundary detection cut off mid-sentence. A paragraph about CISSP certifications cut mid-sentence looks like vague boilerplate — this would corrupt specificity labels.
Resolution: Exclude from training splits. Filings where the last paragraph doesn't match /[.!?;")\u201d]\s*$/ are filtered before train/val/test creation.
Phase 3: Codebook Development
Initial Codebook (v1.0)
Built a detailed labeling codebook (docs/LABELING-CODEBOOK.md) grounded in the SEC rule structure. Includes:
- 7 category definitions with SEC basis citations, key markers, and example texts
- 4 specificity levels with boundary rules
- 5 category decision rules for common ambiguities
- 5 borderline cases with worked reasoning
- Gold set protocol for human validation
Codebook Iteration (v3.0 — 2026-03-29)
After analyzing 150,000+ Stage 1 annotations and identifying systematic disagreement patterns, we made three major codebook rulings:
Ruling A — Materiality Disclaimers: Paragraphs with explicit materiality assessments ("have not materially affected our business strategy, results of operations, or financial condition") are Strategy Integration, even if boilerplate. A cross-reference to Risk Factors appended to a materiality assessment does not change the classification. Only pure cross-references with no materiality conclusion are None/Other. This resolved ~1,094 disputed paragraphs.
Ruling B — SPACs and Shell Companies: Companies explicitly stating they have no operations, no cybersecurity program, or no formal processes receive None/Other regardless of incidental mentions of board oversight or risk acknowledgment. The absence of a program is not a description of a program. This resolved ~53 unresolved paragraphs and likely hundreds more.
Ruling C — Person vs. Function Test (Management Role vs. RMP): This was the single most impactful ruling, addressing the #1 disagreement axis (2,290 disputes). The line: if the paragraph is about the person (qualifications, credentials, background, tenure, career history) → Management Role. If it's about what the role/program does (processes, activities, tools, frameworks) → Risk Management Process, even if a CISO/CIO/CTO title appears. The test: would the paragraph still make sense if you removed the person's name, title, and credentials? If yes → the paragraph is about the function, not the person.
Phase 4: Stage 1 — Synthetic Expert Annotation
Tech Stack Decision
Chose TypeScript + Vercel AI SDK v6 + OpenRouter over Python + LangChain/LiteLLM because:
- Vercel AI SDK provides native structured output with Zod schema validation
- OpenRouter gives single-API access to all candidate models with real cost tracking
- Bun runtime for fast script execution with native TypeScript support
- JSONL-append pattern for crash-safe resume without data loss or duplicate API spend
Prompt Engineering (12+ iterations, v1.0 → v2.5)
This was one of the most time-intensive phases. Key lessons:
What worked:
- Text enum labels ("Firm-Specific") over ordinals ("3") — universal improvement across all models
- Decision-test format ("ask in order, stop at first yes") for specificity — reduced ambiguity
- ✓ IS / ✗ NOT fact lists with explicit examples — the single biggest lever for specificity accuracy. Reduced overrating from 54 to 21 cases.
- Validation step ("review your specific_facts, remove NOT-list items") — caught model self-correction
- 13 calibration examples, each targeting a specific observed failure mode — examples outperformed rules
- Explicit Incident↔Strategy tiebreaker — completely eliminated a 20-case confusion pattern
specific_factschain-of-thought in the schema — forces the model to enumerate evidence before assigning specificity
What didn't work:
- Adding more rules (v1.2) — confused models, caused regression from 95%→88% category accuracy
- Changing category definitions to structural "TEST:" format (v2.6) — regression
- "COMMON MISTAKES" section (v2.7) — improved consensus but reduced unanimity
- Attempting a Management↔RMP tiebreaker in the prompt (v2.5) — made confusion worse (this was ultimately resolved through the v3.0 codebook ruling instead)
Critical lesson: 40-sample pilots were misleadingly optimistic. Results that looked good at n=40 fell apart at n=500. We standardized on 500-sample pilots for all prompt evaluation.
The Iteration Trajectory
Five 40-sample pilots (v1.0, v1.1, v1.2, v2.1, v2.2-n40) followed by six 500-sample pilots (v2.2-v2.7):
| Version | n | Both Unan | Key Change | Top Confusion Axis |
|---|---|---|---|---|
| v2.2 | 500 | 51.4% | First 500-sample baseline | Incident↔Strategy (20 cases) |
| v2.3 | 500 | 59.2% | Tightened Sector-Adapted, expanded IS/NOT lists | Inc↔Strat reduced |
| v2.4 | 500 | 66.8% | Validation step, schema constraint on specific_facts | Mgmt↔RMP emerging |
| v2.5 | 500 | 70.8% | Incident↔Strategy tiebreaker, QV calibration examples | Inc↔Strat eliminated; Mgmt↔RMP now #1 (17 cases) |
| v2.6 | 500 | 67.8% | Changed defs to "TEST:" format — regression | — |
| v2.7 | 500 | 67.6% | Added COMMON MISTAKES section — regression | — |
The most dramatic single improvement: v2.5's Incident↔Strategy tiebreaker ("DESCRIBES what happened → Incident; ONLY discusses cost/materiality → Strategy") completely eliminated what had been the #1 confusion axis at v2.2 (20 cases → 0). This is a case where a single well-targeted rule outperformed broad prompt restructuring.
v2.5 was locked as the production prompt. v2.6 and v2.7 demonstrated that the prompt had reached its practical ceiling — further structural changes caused regressions. The remaining disagreements (Management↔RMP, specificity boundaries) turned out to be codebook ambiguities and model-capacity issues, not prompt failures.
The Original Panel and the Nano Problem
The initial Stage 1 panel was:
google/gemini-3.1-flash-lite-previewopenai/gpt-5.4-nanox-ai/grok-4.1-fast
GPT-5.4-nano was chosen for its low cost and the assumption that even a small model could handle structured classification with a good enough prompt. This assumption was wrong.
The problem: nano wasn't thinking. During pilot testing, we discovered nano produced zero reasoning tokens 64% of the time. When it did reason, the output was minimal (avg 34,356 total reasoning tokens across 500 paragraphs, vs grok's 336,993). Without reasoning, nano's classifications were essentially pattern-matching on surface features — it couldn't apply the multi-step decision logic the codebook requires (enumerate facts, filter against IS/NOT lists, count QV-eligible items, apply threshold).
The symptoms:
- Erratic specificity — nano was simultaneously too conservative on some axes ([1,3,3] disagreements — 21 cases where nano said Generic when gemini+grok said Firm-Specific) and too liberal on others ([3,3,4] — 11 cases where nano said Quantified when the others said Firm-Specific). No prompt change fixed this because it's a model-level capacity issue: without reasoning tokens, the decision test can't execute properly.
- Lowest pairwise agreement — gemini×grok agreed on 95.6% of categories and 91.2% of specificity. gemini×nano: 87.4% category, 83.8% specificity. Nano was the consistent outlier.
- Dragging down unanimity — the gemini+grok pair was strong, but nano's disagreements broke unanimity on hundreds of paragraphs that would otherwise have been clean.
Despite 12 prompt iterations (v1.0→v2.7) that improved overall metrics significantly, nano's behavior never stabilized. The prompt was at its practical ceiling for a model that wouldn't reason.
Smoke Testing: model-probe.ts
Before running an expensive benchmark, we built model-probe.ts to test 9 candidate models on a single paragraph for basic structured output compliance:
- gemini-3.1-flash-lite-preview, grok-4.1-fast, gpt-4.1-mini, gpt-4.1-nano, claude-haiku-4.5, gemini-3.1-flash-preview, deepseek-chat-v3-0324:free, llama-4-maverick, qwen3-235b-a22b
This caught schema-level incompatibilities (wrong field names, missing fields, invalid enum values) before we spent money on 500-paragraph bench runs.
Model Benchmark: 6 Candidates to Replace Nano
After locking prompt v2.5, we built model-bench.ts to formally evaluate nano replacements. Each candidate was benchmarked against the 500-sample pilot set and compared to the existing gemini+grok annotations.
| Model | Cost/ann | Reasoning Tokens | vs Majority (both) | Cat Outlier | Spec Outlier | Nano→X Delta |
|---|---|---|---|---|---|---|
| seed-2.0-lite | $0.00227 | 658 | 88.8% | 2.2% | 3.8% | +11.6pp |
| mimo-v2-flash | $0.00048 | 1,346 | 86.0% | 5.0% | 4.0% | +8.8pp |
| glm-4.5-air | $0.00136 | 854 | 76.2% | 8.8% | 9.6% | +0.8pp |
| minimax-m2.5 | $0.00106 | 590 | 73.8% | 7.9% | 12.7% | -1.0pp |
| mistral-small-2603 | $0.00015 | 0 | 66.8% | 9.2% | 17.6% | -6.8pp |
| nemotron-3-super-120b | $0.00152 | 942 | 57.9% | 21.3% | 20.7% | -16.9pp |
Key findings:
-
Reasoning tokens are the strongest predictor of accuracy. Mistral-small produced literally zero reasoning tokens — not a single one. Its average output was only 136 tokens (vs mimo's 1,463). It had a 17.6% specificity outlier rate. This confirmed that the nano problem wasn't prompt-specific: models that don't reason can't do this task.
-
Price ≠ quality. Nemotron was the most expensive candidate at $0.00152/annotation with 942 reasoning tokens (it was thinking), but thinking badly — 21.3% category outlier rate, worst of any candidate. Only 497/500 completed (3 failures). Replacing nano with nemotron would have been catastrophic: -16.9pp unanimity.
-
The two mediocre options. GLM-4.5-air (+0.8pp) and minimax-m2.5 (-1.0pp) neither helped nor hurt. Not worth the switch.
-
Seed-2.0-lite was technically the best at 88.8% agreement with majority, but cost 4.7x more than mimo ($0.00227 vs $0.00048) and was 2x slower (21.5s vs 11.4s latency). For 50K+ paragraphs at scale, this cost differential was significant.
The Winner: mimo-v2-flash
Mimo won the slot on value:
- Cheapest viable option — $0.00048/annotation (3x cheaper than most candidates)
- Most reasoning tokens — 1,346 avg (highest of all 6, more than seed-2.0-lite)
- Lowest outlier rate — 5.0% category, 4.0% specificity
- +8.8pp unanimity improvement over nano
- 93.4% category agreement with grok — strongest pairwise alignment of any candidate
Roadblock: Mimo schema quirks. Mimo produced non-standard outputs: capitalized confidence labels ("High" instead of "high"), numeric confidence values (0.9 instead of "high"), and flat string arrays instead of structured {fact, type} objects for specific_facts. Rather than trying to fix this with prompting (which would waste tokens and might break other behavior), we fixed it with Zod schema transforms — .transform() to normalize casing and map numbers to labels, .union() to accept both structured and flat fact formats. This took ~30 minutes to implement and handled all edge cases automatically.
A dedicated mimo-pilot.ts script modeled the full "replace nano with mimo" scenario before committing to the panel change.
Final Stage 1 panel:
google/gemini-3.1-flash-lite-previewxiaomi/mimo-v2-flash← replacedopenai/gpt-5.4-nanox-ai/grok-4.1-fast
Production Run Results
Completed 2026-03-28. 150,009 annotations (50,003 paragraphs × 3 models), $115.88 total cost, 0 failures.
| Metric | Value |
|---|---|
| Both-unanimous | 35,204 (70.7%) |
| Majority agreement | 14,182 (28.5%) |
| Unresolved (3-way split) | 409 (0.8%) |
| Total cost | $115.88 |
| Failures | 0 |
Phase 5: Post-Stage 1 Analysis — Discovering Systematic Patterns
After the production run, we conducted a deep distributional analysis of disagreement patterns. This analysis fundamentally changed our approach to Stage 2.
Model Bias Discovery
Each model has systematic, quantifiable biases:
| Model | Category Outlier Rate | Specificity Outlier Rate | Key Bias |
|---|---|---|---|
| Mimo | 48.1% | 32.5% | Over-classifies as Third-Party Risk; under-rates Spec 4 (74.3% of Spec 4 outlier cases) |
| Gemini | 30.9% | 45.7% | Over-classifies as Management Role (81.1% in Mgmt↔RMP disputes); inflates specificity |
| Grok | 21.0% | 21.8% | Most moderate; slight RMP bias |
These biases are not random — they're predictable by model and confusion axis. This opened the possibility of model-calibrated majority voting (using the known biases to assess when the majority is likely correct).
Key Distributional Findings
- Management Role is the disaster category — only 51.5% unanimous (every other category is 62-79%). Nearly half of all Management Role paragraphs need resolution.
- Spec 4 (Quantified-Verifiable) is the disaster specificity — only 37.6% unanimous. Models can't agree on what counts as "quantified."
- Stage 1 confidence is completely useless — 95.4% of paragraphs report all-high category confidence. Zero all-low cases. The cheap models are systematically overconfident.
- Specificity is effectively a 3-level scale — Spec 2 (Sector-Adapted) is rarely disputed (82.1% unanimous). The contested boundaries are [1,3] (3,742 disputes) and [3,4] (2,898 disputes) with almost nothing at [1,2] or [2,3].
- Longer paragraphs are harder — Q5 word count (>134 words): 64.1% unanimous vs Q1 (≤51 words): 76.3%.
- Small companies (1-3 paragraphs) are noise-prone — 50.0% unanimous, 10.5% unresolved. Almost all are SPACs or shell companies with non-standard disclosures.
Top Disagreement Axes
| Axis | Disputes | Pattern |
|---|---|---|
| Management Role ↔ RMP | 2,290 | Paragraph describes processes but names CISO/CIO |
| RMP ↔ Third-Party Risk | 1,475 | Mimo over-classifies vendor mentions as Third-Party |
| None/Other ↔ Strategy Integration | 1,094 | Materiality disclaimers — genuinely ambiguous in codebook |
| Board Governance ↔ Management Role | 867 | Paragraphs at the board-management interface |
| Spec [1,3] boundary | 3,742 | NOT-list items counted as specific facts |
| Spec [3,4] boundary | 2,898 | Gemini counts roles as QV-eligible; Mimo downgrades |
Insight: Reading the Actual Paragraphs
We sampled 20 paragraphs across the 4 hardest dispute types and read them in full. Patterns emerged:
- Management↔RMP: Every example follows the same structure — a process-focused paragraph that names a CISO/CIO in the opening attribution. The paragraph's content is about what the program does, not who the person is. The v3.0 "person-vs-function" ruling directly addresses this.
- None/Other↔Strategy: All 5 sampled paragraphs are "no material incidents" boilerplate. Every single one. The materiality disclaimer ruling resolves this entirely.
- Spec [3,4]: Gemini counts "20 years of experience" + "CISO" as 2 QV facts → Spec 4. Grok/Mimo correctly exclude named roles from QV counting → Spec 3. The rule exists in the prompt but Gemini ignores it.
- Small company unresolved: All SPACs or blank check companies with "we have no operations" disclaimers. The SPAC ruling handles these.
Phase 6: Stage 2 — Judge Model Evaluation
Gold Label Construction
Built a 50-paragraph gold set using 3 independent Sonnet agents:
- Agent A: paragraphs 0-24
- Agent B: paragraphs 25-49
- Agent C: all 50 as cross-check
- Adjudicator agent resolved 11 disputes with detailed reasoning
- Inter-annotator agreement: 94% category, 84% specificity, 78% both
Lesson learned: majority vote ≠ ground truth. Initially scored judges against Stage 1 majority, which made gemini-3-flash look great (86% category match). Scoring against gold labels revealed it added zero value — it was rubber-stamping the majority. Always evaluate against adjudicated gold labels.
Judge Model Benchmarking (8 candidates)
| Model | Mode | n | Cat | Spec | Both | Fails | Cost/call |
|---|---|---|---|---|---|---|---|
| Majority vote | — | 50 | 78.0% | 80.0% | 60.0% | 0% | $0 |
| gpt-5.4-mini | structured | 50 | 88.0% | 80.0% | 68.0% | 0% | $0.0046 |
| GLM-5 v2 | structured | 48 | 87.5% | 89.6% | 77.1% | 4% | $0.0078 |
| GLM-5 v4 | structured+req_params | 44 | 90.9% | 88.6% | 79.5% | 12% | $0.0083 |
| GLM-5 v3 | tool calling | 50 | 84.0% | 82.0% | 72.0% | 0% | $0.0070 |
Roadblock: GLM-5 Structured Output Failures
GLM-5 had the best accuracy (77-80% both-correct) but a 6-12% structured output failure rate. The model intermittently wraps JSON in markdown code blocks.
Investigation: Built diagnostic scripts (judge-diag.ts, judge-diag-batch.ts) to isolate the issue. Tested all 9 failing paragraphs × 2 attempts each. Found 72% success rate, all from the same model variant (z-ai/glm-5-20260211). The best OpenRouter provider (Ambient) has a 6% base error rate. This is a model-level behavior, not provider-specific.
Attempted fixes:
- Bumped validation retries from 1 to 3 → reduced failures from 18% to ~4-12%
- Tool calling mode → 0% failures but accuracy dropped ~7pp (72% both). Enum constraints not enforced,
undefinedcategories appear. provider: { require_parameters: true }in OpenRouter → no effect- Exacto routing → no effect
Resolution: Accepted as a model-level constraint. Production strategy will use the best model with retry logic and fall back to a reliable model (gpt-5.4-mini) for persistent failures.
Judge Prompt Iteration (v1 → v2)
Built a dynamic judge prompt (buildJudgePrompt()) with:
- Disagreement diagnosis: Tells the judge exactly what's in dispute and the vote distribution
- Targeted disambiguation rules: 7 category guidance blocks + 2 specificity guidance blocks, dynamically included only when relevant to the specific dispute
- Structured analysis steps: Critique each annotator → enumerate IS-list facts → determine dominant purpose → decide
- Confidence calibration: HIGH/MEDIUM/LOW mapped to codebook clarity, used as training weights
- Anti-bias: Fisher-Yates shuffle of annotator order
Results: Category accuracy improved +10pp over majority vote for both models. Specificity improved +9.8pp for GLM-5 but stayed flat for gpt-5.4-mini. The disambiguation rules work well for category but specificity needs the codebook v3.0 changes.
Key Finding: Judge Confidence Is Highly Predictive
| Confidence | GLM-5 Both-Correct | gpt-5.4-mini Both-Correct |
|---|---|---|
| High | 82-84% | 80.6% |
| Medium | 25-50% | 35.7% |
This enables confidence-stratified training data: high-confidence judge labels get full weight; medium/low are downweighted or excluded.
Phase 7: Revised Data Quality Strategy (Current)
The post-Stage 1 analysis and judge benchmarking led to a fundamental reassessment of our approach.
The Key Realization
The best judge (77% both-correct) barely beats the raw majority vote (78% category, 80% specificity). Judging all 14,591 disputed paragraphs at 77% accuracy doesn't meaningfully improve on the majority. The judge's real value is concentrated in two places:
- The 409 unresolved paragraphs where no majority exists
- Cases where we have specific reason to doubt the majority
The Revised Plan
Phase 0: Codebook rulings (completed) — Three rulings that resolve thousands of disputes at zero inference cost: materiality disclaimers → Strategy Integration, SPACs → None/Other, person-vs-function test for Management↔RMP.
Phase 1: Model-calibrated majority resolution — For the 14,182 majority-agreement paragraphs, apply calibration using known model biases. When the known-biased model is the outlier on a known axis → trust majority. Flag anomalous cases for judge resolution. Expected to auto-resolve ~10,000-12,000 paragraphs.
Phase 2: Human gold set (1,200 paragraphs) — Assignment requires 1,200 human-labeled paragraphs. Building a quiz-gated labeling web tool that enforces codebook knowledge before each session. Stratified sampling to ensure all categories, specificity levels, and confusion axes are represented. This becomes the calibration metric for all further work.
Phase 3: Judge prompt iteration — Update judge prompt to mirror codebook v3.0 rulings. Add worked examples from the 11 gold adjudications. Iterate against expanded gold set. Target: 85%+ both-correct.
Phase 4: Production judge run — Judge only the ~3,000-5,000 genuinely hard cases (unresolved + flagged majority + "both" disputes). Two models for cross-validation on the hardest cases.
Phase 5: Training data assembly — Confidence-stratified tiers:
| Tier | Source | Est. Accuracy | Paragraphs | Treatment |
|---|---|---|---|---|
| T1 | Both-unanimous | ~97% | 35,204 | Full weight |
| T2 | Calibrated majority | ~85-90% | ~9,000-12,000 | Full weight |
| T3 | Judge high-confidence | ~84% | ~2,000-3,000 | Full weight |
| T4 | Judge medium-confidence | ~40% | ~500-1,000 | Downweight (0.5) or soft labels |
| T5 | Judge low / failure / excluded | ??? | ~500-1,000 | Exclude |
Expected total: ~46,000-48,000 paragraphs at ~93-95% label accuracy.
Running Cost Ledger
| Phase | Cost | Notes |
|---|---|---|
| Stage 1 production run | $115.88 | 150,009 annotations, 0 failures |
| Stage 1 prompt iteration (pilots) | ~$15 | 12+ versions × 500-sample pilots |
| Judge benchmarking | ~$5 | 8 models × 50-sample gold set |
| Judge prompt iteration | ~$3 | Ongoing |
| Total to date | ~$139 |
Key Technical Artifacts
| Artifact | Location | Description |
|---|---|---|
| Labeling codebook | docs/LABELING-CODEBOOK.md |
Authoritative reference, v3.0 with codebook rulings |
| Stage 1 annotations | data/annotations/stage1.jsonl |
150,009 annotations (120 MB) |
| Paragraphs | data/paragraphs/paragraphs-clean.jsonl |
72,045 paragraphs with filing metadata |
| Gold labels | data/bench/judges/gold-final.json |
50 adjudicated gold labels |
| Gold adjudications | data/bench/judges/gold-adjudicated.json |
11 detailed adjudication decisions with reasoning |
| Stage 1 prompt | ts/src/label/prompts.ts |
SYSTEM_PROMPT (v2.5) + buildJudgePrompt() |
| Annotation runner | ts/scripts/stage1-run.ts |
Resume-safe, configurable concurrency |
| Analysis scripts | ts/scripts/stage1-analyze.ts, segment-analysis.ts, model-bias-analysis.ts, dispute-crosstab.ts, sample-disputes.ts |
Deep analytics on annotation data |
| Judge benchmarking | ts/scripts/judge-bench.ts |
Supports structured/tool modes, gold label comparison |
| Judge diagnostics | ts/scripts/judge-diag.ts, judge-diag-batch.ts |
GLM-5 failure investigation |
| Model benchmarking | ts/scripts/model-bench.ts |
Stage 1 candidate evaluation |
Lessons Learned
On Prompt Engineering
- Calibration examples beat rules. Each example targets a specific observed failure mode.
- Pilots must be large enough (500+). 40-sample pilots were misleadingly optimistic.
- More rules ≠ better. After the core structure is right, additional rules cause regression.
- The
specific_factschain-of-thought schema (forcing models to enumerate evidence before deciding) was the single most impactful structural change.
On Model Selection
- Reasoning tokens are the strongest predictor of accuracy, not price or model size.
- Schema compliance varies — fix with Zod transforms, not prompt changes.
- Test both structured output AND tool calling for any candidate. They are not equivalent.
On Evaluation
- Never evaluate against majority vote. Build gold labels. Majority vote as ground truth makes models that rubber-stamp the majority look good.
- Judge confidence is highly predictive of accuracy. Use it to weight training samples.
- Stage 1 confidence is useless — cheap models are systematically overconfident (95%+ all-high).
On Data Quality at Scale
- The biggest wins come from understanding where and why models disagree, not from blanket improvements.
- Systematic model biases are quantifiable and predictable. Use them as signal, not noise.
- Codebook ambiguity causes more disagreement than model limitations. Three codebook rulings resolved more disputes than any prompt change.
- Not all labels need the same treatment. Confidence-stratified assembly beats uniform labeling.