2026-03-28 20:39:36 -04:00

31 KiB

Raw Blame History

Project Narrative — SEC Cybersecurity Disclosure Quality Classifier

This document captures the process, roadblocks, decisions, and resolutions from building the SEC cybersecurity disclosure quality classifier. It serves as the source material for the final paper and presentation.

Phase 1: Project Scoping and Construct Design

The Problem

SEC Release 33-11216 (July 2023) created a new annual cybersecurity disclosure requirement (10-K Item 1C) and an incident disclosure requirement (8-K Item 1.05). By FY2024, ~9,000-10,000 filings exist. No validated classifier or public labeled dataset exists for assessing the quality of these disclosures. Investors, regulators, and compliance officers need scalable tools to distinguish substantive disclosures from boilerplate.

Methodology Decision: Ringel (2023) "Synthetic Experts"

We adopted the Ringel (2023) "Synthetic Experts" pipeline: use frontier LLMs to generate training labels at scale, then distill into an efficient encoder model. This approach was chosen because:

Manual labeling of 50,000+ paragraphs is infeasible for a 6-person team
Multiple cheap LLMs annotating in parallel provide built-in quality control through inter-annotator agreement
The encoder distillation step produces a model that can classify at inference time without LLM API costs

Construct: Two Classification Dimensions

We defined two simultaneous classification tasks per paragraph:

Content Category (7 mutually exclusive classes) — what the paragraph is about, grounded in the SEC rule's own structure (Board Governance, Management Role, Risk Management Process, Third-Party Risk, Incident Disclosure, Strategy Integration, None/Other)
Specificity Level (4-point ordinal) — how company-specific the disclosure is, from generic boilerplate to quantified-verifiable facts

The construct maps to NIST CSF 2.0 categories for academic grounding.

Phase 2: Data Acquisition and Corpus Construction

The Extraction Problem

SEC filings are not structured data. They're HTML generated from PDFs, XBRL, and Word documents by dozens of different tools, each producing different artifacts. Building a reliable extraction pipeline for ~9,000 filings meant solving a series of messy, real-world data engineering problems.

Pipeline Architecture

Built in TypeScript (~1,000 lines of extraction code across parse-item1c.ts, segment.ts, fast-reparse.ts, and pipeline orchestration):

EDGAR Master Index → enumerate 10-K filings → download HTML → extract Item 1C → segment paragraphs → JSONL
submissions.zip → scan for 8-K Item 1.05 → download HTML → extract → segment → merge with 10-K corpus

Roadblock: HTML Variability

Every filing's HTML is different. The same logical content looks completely different depending on the tool that generated the HTML:

Word splitting from inline elements. XBRL and styling tags break words mid-token: Item 2 renders correctly in a browser but parses as "Item2" in code. Same with cybersecurity. Required detecting adjacent inline element boundaries and inserting spaces selectively.
CamelCase joins from PDF converters. PDF-to-HTML tools merge sentences across formatting boundaries: sentence.Next sentence instead of sentence. Next sentence. Required regex passes to detect missing spaces after punctuation.
Page breaks mid-sentence. Page numbers (28, - 12 -, F-3), running headers (ACME CORP — ANNUAL REPORT), and subsidiary headers (ENTERGY ARKANSAS, LLC AND SUBSIDIARIES) get spliced into the middle of content paragraphs. Required filtering a catalog of page artifact patterns.
Table of Contents shadowing. "Item 1C" appears at least twice in every 10-K — once in the Table of Contents and once in the actual content. Using the first match extracts the wrong section. Took several iterations to discover we needed the LAST match — this was a silent failure that produced empty or wrong extractions for hundreds of filings before we caught it.
XBRL tag pollution. Inline XBRL wraps financial facts in ix:header, ix:references, and ix:nonFraction tags that carry no display content but add noise. Required stripping all ix:* tags before text processing.
Entity encoding chaos.  ,  , “, ”, —, –, • — each needs correct decoding, and different filing tools use different entity styles for the same characters.

Paragraph Segmentation

After extracting clean section text, splitting into paragraphs had its own challenges:

Bullet list merging. Disclosures frequently use bullet lists ("Our program includes: • risk assessment • vulnerability scanning"). Bullets need to be merged with their intro sentence; a standalone "• vulnerability scanning" is meaningless.
Continuation line detection. Sentences split across HTML block elements need rejoining. Heuristic: if the previous block lacks terminal punctuation and the next starts lowercase or with a continuation phrase (and, or, including, such as), merge.
Length boundaries. Under 20 words → likely a header (filtered). Over 500 words → split at sentence boundaries to keep annotation units manageable.

8-K Extraction

Roadblock: EDGAR full-text search misses filings. The EFTS keyword search doesn't reliably return all cybersecurity 8-Ks. Post-May 2024, companies moved non-material disclosures from Item 1.05 to Items 8.01 or 7.01.

Resolution: Built scan-8k-items.py to scan the SEC's bulk submissions.zip deterministically — a gap-free scan of every 8-K with cybersecurity content. Tries items in priority order (1.05 → 8.01 → 7.01), skips cross-reference stubs. Result: 207 cybersecurity incident 8-K filings identified — a complete inventory.

Paragraph Deduplication

Each paragraph gets a textHash (SHA-256 of normalized text). Deduplication at three levels:

Within-filing: Parser artifacts sometimes produce duplicate blocks. Removed by textHash.
Cross-year (same company): Companies copy-paste identical paragraphs year-to-year. Detected but kept — the repetition itself is informative for disclosure quality analysis.
Cross-company boilerplate: Different companies use identical materiality disclaimers. Detected but kept — these are real Specificity 1 examples.

Result: Only ~27 excess duplicates removed (0.04%). Most textual similarity is legitimate variation.

Performance at Scale

Initial extraction with cheerio (DOM parser) was slow for 9,000 filings. Built fast-reparse.ts (regex-only HTML stripping, no DOM) and parallel-reparse.ts (16 bun workers in parallel). Also deduplicates amendment filings (keeps latest per CIK×FiscalYear).

Corpus Statistics

72,045 paragraphs from ~9,000 filings (FY2023 + FY2024 + early FY2025)
All 10-K Item 1C; 207 8-K paragraphs extracted separately
Median ~7 paragraphs per filing
49,795 paragraphs annotated (after filtering to complete filing metadata)

Roadblock: Truncated Filings

Discovered 72 filings (~0.8%) where section boundary detection cut off mid-sentence. A paragraph about CISSP certifications cut mid-sentence looks like vague boilerplate — this would corrupt specificity labels.

Resolution: Exclude from training splits. Filings where the last paragraph doesn't match /[.!?;")\u201d]\s*$/ are filtered before train/val/test creation.

Phase 3: Codebook Development

Initial Codebook (v1.0)

Built a detailed labeling codebook (docs/LABELING-CODEBOOK.md) grounded in the SEC rule structure. Includes:

7 category definitions with SEC basis citations, key markers, and example texts
4 specificity levels with boundary rules
5 category decision rules for common ambiguities
5 borderline cases with worked reasoning
Gold set protocol for human validation

Codebook Iteration (v3.0 — 2026-03-29)

After analyzing 150,000+ Stage 1 annotations and identifying systematic disagreement patterns, we made three major codebook rulings:

Ruling A — Materiality Disclaimers: Paragraphs with explicit materiality assessments ("have not materially affected our business strategy, results of operations, or financial condition") are Strategy Integration, even if boilerplate. A cross-reference to Risk Factors appended to a materiality assessment does not change the classification. Only pure cross-references with no materiality conclusion are None/Other. This resolved ~1,094 disputed paragraphs.

Ruling B — SPACs and Shell Companies: Companies explicitly stating they have no operations, no cybersecurity program, or no formal processes receive None/Other regardless of incidental mentions of board oversight or risk acknowledgment. The absence of a program is not a description of a program. This resolved ~53 unresolved paragraphs and likely hundreds more.

Ruling C — Person vs. Function Test (Management Role vs. RMP): This was the single most impactful ruling, addressing the #1 disagreement axis (2,290 disputes). The line: if the paragraph is about the person (qualifications, credentials, background, tenure, career history) → Management Role. If it's about what the role/program does (processes, activities, tools, frameworks) → Risk Management Process, even if a CISO/CIO/CTO title appears. The test: would the paragraph still make sense if you removed the person's name, title, and credentials? If yes → the paragraph is about the function, not the person.

Phase 4: Stage 1 — Synthetic Expert Annotation

Tech Stack Decision

Chose TypeScript + Vercel AI SDK v6 + OpenRouter over Python + LangChain/LiteLLM because:

Vercel AI SDK provides native structured output with Zod schema validation
OpenRouter gives single-API access to all candidate models with real cost tracking
Bun runtime for fast script execution with native TypeScript support
JSONL-append pattern for crash-safe resume without data loss or duplicate API spend

Prompt Engineering (12+ iterations, v1.0 → v2.5)

This was one of the most time-intensive phases. Key lessons:

What worked:

Text enum labels ("Firm-Specific") over ordinals ("3") — universal improvement across all models
Decision-test format ("ask in order, stop at first yes") for specificity — reduced ambiguity
✓ IS / ✗ NOT fact lists with explicit examples — the single biggest lever for specificity accuracy. Reduced overrating from 54 to 21 cases.
Validation step ("review your specific_facts, remove NOT-list items") — caught model self-correction
13 calibration examples, each targeting a specific observed failure mode — examples outperformed rules
Explicit Incident↔Strategy tiebreaker — completely eliminated a 20-case confusion pattern
specific_facts chain-of-thought in the schema — forces the model to enumerate evidence before assigning specificity

What didn't work:

Adding more rules (v1.2) — confused models, caused regression from 95%→88% category accuracy
Changing category definitions to structural "TEST:" format (v2.6) — regression
"COMMON MISTAKES" section (v2.7) — improved consensus but reduced unanimity
Attempting a Management↔RMP tiebreaker in the prompt (v2.5) — made confusion worse (this was ultimately resolved through the v3.0 codebook ruling instead)

Critical lesson: 40-sample pilots were misleadingly optimistic. Results that looked good at n=40 fell apart at n=500. We standardized on 500-sample pilots for all prompt evaluation.

The Iteration Trajectory

Five 40-sample pilots (v1.0, v1.1, v1.2, v2.1, v2.2-n40) followed by six 500-sample pilots (v2.2-v2.7):

Version	n	Both Unan	Key Change	Top Confusion Axis
v2.2	500	51.4%	First 500-sample baseline	Incident↔Strategy (20 cases)
v2.3	500	59.2%	Tightened Sector-Adapted, expanded IS/NOT lists	Inc↔Strat reduced
v2.4	500	66.8%	Validation step, schema constraint on specific_facts	Mgmt↔RMP emerging
v2.5	500	70.8%	Incident↔Strategy tiebreaker, QV calibration examples	Inc↔Strat eliminated; Mgmt↔RMP now #1 (17 cases)
v2.6	500	67.8%	Changed defs to "TEST:" format — regression	—
v2.7	500	67.6%	Added COMMON MISTAKES section — regression	—

The most dramatic single improvement: v2.5's Incident↔Strategy tiebreaker ("DESCRIBES what happened → Incident; ONLY discusses cost/materiality → Strategy") completely eliminated what had been the #1 confusion axis at v2.2 (20 cases → 0). This is a case where a single well-targeted rule outperformed broad prompt restructuring.

v2.5 was locked as the production prompt. v2.6 and v2.7 demonstrated that the prompt had reached its practical ceiling — further structural changes caused regressions. The remaining disagreements (Management↔RMP, specificity boundaries) turned out to be codebook ambiguities and model-capacity issues, not prompt failures.

The Original Panel and the Nano Problem

The initial Stage 1 panel was:

google/gemini-3.1-flash-lite-preview
openai/gpt-5.4-nano
x-ai/grok-4.1-fast

GPT-5.4-nano was chosen for its low cost and the assumption that even a small model could handle structured classification with a good enough prompt. This assumption was wrong.

The problem: nano wasn't thinking. During pilot testing, we discovered nano produced zero reasoning tokens 64% of the time. When it did reason, the output was minimal (avg 34,356 total reasoning tokens across 500 paragraphs, vs grok's 336,993). Without reasoning, nano's classifications were essentially pattern-matching on surface features — it couldn't apply the multi-step decision logic the codebook requires (enumerate facts, filter against IS/NOT lists, count QV-eligible items, apply threshold).

The symptoms:

Erratic specificity — nano was simultaneously too conservative on some axes ([1,3,3] disagreements — 21 cases where nano said Generic when gemini+grok said Firm-Specific) and too liberal on others ([3,3,4] — 11 cases where nano said Quantified when the others said Firm-Specific). No prompt change fixed this because it's a model-level capacity issue: without reasoning tokens, the decision test can't execute properly.
Lowest pairwise agreement — gemini×grok agreed on 95.6% of categories and 91.2% of specificity. gemini×nano: 87.4% category, 83.8% specificity. Nano was the consistent outlier.
Dragging down unanimity — the gemini+grok pair was strong, but nano's disagreements broke unanimity on hundreds of paragraphs that would otherwise have been clean.

Despite 12 prompt iterations (v1.0→v2.7) that improved overall metrics significantly, nano's behavior never stabilized. The prompt was at its practical ceiling for a model that wouldn't reason.

Smoke Testing: model-probe.ts

Before running an expensive benchmark, we built model-probe.ts to test 9 candidate models on a single paragraph for basic structured output compliance:

gemini-3.1-flash-lite-preview, grok-4.1-fast, gpt-4.1-mini, gpt-4.1-nano, claude-haiku-4.5, gemini-3.1-flash-preview, deepseek-chat-v3-0324:free, llama-4-maverick, qwen3-235b-a22b

This caught schema-level incompatibilities (wrong field names, missing fields, invalid enum values) before we spent money on 500-paragraph bench runs.

Model Benchmark: 6 Candidates to Replace Nano

After locking prompt v2.5, we built model-bench.ts to formally evaluate nano replacements. Each candidate was benchmarked against the 500-sample pilot set and compared to the existing gemini+grok annotations.

Model	Cost/ann	Reasoning Tokens	vs Majority (both)	Cat Outlier	Spec Outlier	Nano→X Delta
seed-2.0-lite	$0.00227	658	88.8%	2.2%	3.8%	+11.6pp
mimo-v2-flash	$0.00048	1,346	86.0%	5.0%	4.0%	+8.8pp
glm-4.5-air	$0.00136	854	76.2%	8.8%	9.6%	+0.8pp
minimax-m2.5	$0.00106	590	73.8%	7.9%	12.7%	-1.0pp
mistral-small-2603	$0.00015	0	66.8%	9.2%	17.6%	-6.8pp
nemotron-3-super-120b	$0.00152	942	57.9%	21.3%	20.7%	-16.9pp

Key findings:

Reasoning tokens are the strongest predictor of accuracy. Mistral-small produced literally zero reasoning tokens — not a single one. Its average output was only 136 tokens (vs mimo's 1,463). It had a 17.6% specificity outlier rate. This confirmed that the nano problem wasn't prompt-specific: models that don't reason can't do this task.
Price ≠ quality. Nemotron was the most expensive candidate at $0.00152/annotation with 942 reasoning tokens (it was thinking), but thinking badly — 21.3% category outlier rate, worst of any candidate. Only 497/500 completed (3 failures). Replacing nano with nemotron would have been catastrophic: -16.9pp unanimity.
The two mediocre options. GLM-4.5-air (+0.8pp) and minimax-m2.5 (-1.0pp) neither helped nor hurt. Not worth the switch.
Seed-2.0-lite was technically the best at 88.8% agreement with majority, but cost 4.7x more than mimo ($0.00227 vs $0.00048) and was 2x slower (21.5s vs 11.4s latency). For 50K+ paragraphs at scale, this cost differential was significant.

The Winner: mimo-v2-flash

Mimo won the slot on value:

Cheapest viable option — $0.00048/annotation (3x cheaper than most candidates)
Most reasoning tokens — 1,346 avg (highest of all 6, more than seed-2.0-lite)
Lowest outlier rate — 5.0% category, 4.0% specificity
+8.8pp unanimity improvement over nano
93.4% category agreement with grok — strongest pairwise alignment of any candidate

Roadblock: Mimo schema quirks. Mimo produced non-standard outputs: capitalized confidence labels ("High" instead of "high"), numeric confidence values (0.9 instead of "high"), and flat string arrays instead of structured {fact, type} objects for specific_facts. Rather than trying to fix this with prompting (which would waste tokens and might break other behavior), we fixed it with Zod schema transforms — .transform() to normalize casing and map numbers to labels, .union() to accept both structured and flat fact formats. This took ~30 minutes to implement and handled all edge cases automatically.

A dedicated mimo-pilot.ts script modeled the full "replace nano with mimo" scenario before committing to the panel change.

Final Stage 1 panel:

google/gemini-3.1-flash-lite-preview
xiaomi/mimo-v2-flash ← replaced openai/gpt-5.4-nano
x-ai/grok-4.1-fast

Production Run Results

Completed 2026-03-28. 150,009 annotations (50,003 paragraphs × 3 models), $115.88 total cost, 0 failures.

Metric	Value
Both-unanimous	35,204 (70.7%)
Majority agreement	14,182 (28.5%)
Unresolved (3-way split)	409 (0.8%)
Total cost	$115.88
Failures	0

Phase 5: Post-Stage 1 Analysis — Discovering Systematic Patterns

After the production run, we conducted a deep distributional analysis of disagreement patterns. This analysis fundamentally changed our approach to Stage 2.

Model Bias Discovery

Each model has systematic, quantifiable biases:

Model	Category Outlier Rate	Specificity Outlier Rate	Key Bias
Mimo	48.1%	32.5%	Over-classifies as Third-Party Risk; under-rates Spec 4 (74.3% of Spec 4 outlier cases)
Gemini	30.9%	45.7%	Over-classifies as Management Role (81.1% in Mgmt↔RMP disputes); inflates specificity
Grok	21.0%	21.8%	Most moderate; slight RMP bias

These biases are not random — they're predictable by model and confusion axis. This opened the possibility of model-calibrated majority voting (using the known biases to assess when the majority is likely correct).

Key Distributional Findings

Management Role is the disaster category — only 51.5% unanimous (every other category is 62-79%). Nearly half of all Management Role paragraphs need resolution.
Spec 4 (Quantified-Verifiable) is the disaster specificity — only 37.6% unanimous. Models can't agree on what counts as "quantified."
Stage 1 confidence is completely useless — 95.4% of paragraphs report all-high category confidence. Zero all-low cases. The cheap models are systematically overconfident.
Specificity is effectively a 3-level scale — Spec 2 (Sector-Adapted) is rarely disputed (82.1% unanimous). The contested boundaries are [1,3] (3,742 disputes) and [3,4] (2,898 disputes) with almost nothing at [1,2] or [2,3].
Longer paragraphs are harder — Q5 word count (>134 words): 64.1% unanimous vs Q1 (≤51 words): 76.3%.
Small companies (1-3 paragraphs) are noise-prone — 50.0% unanimous, 10.5% unresolved. Almost all are SPACs or shell companies with non-standard disclosures.

Top Disagreement Axes

Axis	Disputes	Pattern
Management Role ↔ RMP	2,290	Paragraph describes processes but names CISO/CIO
RMP ↔ Third-Party Risk	1,475	Mimo over-classifies vendor mentions as Third-Party
None/Other ↔ Strategy Integration	1,094	Materiality disclaimers — genuinely ambiguous in codebook
Board Governance ↔ Management Role	867	Paragraphs at the board-management interface
Spec [1,3] boundary	3,742	NOT-list items counted as specific facts
Spec [3,4] boundary	2,898	Gemini counts roles as QV-eligible; Mimo downgrades

Insight: Reading the Actual Paragraphs

We sampled 20 paragraphs across the 4 hardest dispute types and read them in full. Patterns emerged:

Management↔RMP: Every example follows the same structure — a process-focused paragraph that names a CISO/CIO in the opening attribution. The paragraph's content is about what the program does, not who the person is. The v3.0 "person-vs-function" ruling directly addresses this.
None/Other↔Strategy: All 5 sampled paragraphs are "no material incidents" boilerplate. Every single one. The materiality disclaimer ruling resolves this entirely.
Spec [3,4]: Gemini counts "20 years of experience" + "CISO" as 2 QV facts → Spec 4. Grok/Mimo correctly exclude named roles from QV counting → Spec 3. The rule exists in the prompt but Gemini ignores it.
Small company unresolved: All SPACs or blank check companies with "we have no operations" disclaimers. The SPAC ruling handles these.

Phase 6: Stage 2 — Judge Model Evaluation

Gold Label Construction

Built a 50-paragraph gold set using 3 independent Sonnet agents:

Agent A: paragraphs 0-24
Agent B: paragraphs 25-49
Agent C: all 50 as cross-check
Adjudicator agent resolved 11 disputes with detailed reasoning
Inter-annotator agreement: 94% category, 84% specificity, 78% both

Lesson learned: majority vote ≠ ground truth. Initially scored judges against Stage 1 majority, which made gemini-3-flash look great (86% category match). Scoring against gold labels revealed it added zero value — it was rubber-stamping the majority. Always evaluate against adjudicated gold labels.

Judge Model Benchmarking (8 candidates)

Model	Mode	n	Cat	Spec	Both	Fails	Cost/call
Majority vote	—	50	78.0%	80.0%	60.0%	0%	$0
gpt-5.4-mini	structured	50	88.0%	80.0%	68.0%	0%	$0.0046
GLM-5 v2	structured	48	87.5%	89.6%	77.1%	4%	$0.0078
GLM-5 v4	structured+req_params	44	90.9%	88.6%	79.5%	12%	$0.0083
GLM-5 v3	tool calling	50	84.0%	82.0%	72.0%	0%	$0.0070

Roadblock: GLM-5 Structured Output Failures

GLM-5 had the best accuracy (77-80% both-correct) but a 6-12% structured output failure rate. The model intermittently wraps JSON in markdown code blocks.

Investigation: Built diagnostic scripts (judge-diag.ts, judge-diag-batch.ts) to isolate the issue. Tested all 9 failing paragraphs × 2 attempts each. Found 72% success rate, all from the same model variant (z-ai/glm-5-20260211). The best OpenRouter provider (Ambient) has a 6% base error rate. This is a model-level behavior, not provider-specific.

Attempted fixes:

Bumped validation retries from 1 to 3 → reduced failures from 18% to ~4-12%
Tool calling mode → 0% failures but accuracy dropped ~7pp (72% both). Enum constraints not enforced, undefined categories appear.
provider: { require_parameters: true } in OpenRouter → no effect
Exacto routing → no effect

Resolution: Accepted as a model-level constraint. Production strategy will use the best model with retry logic and fall back to a reliable model (gpt-5.4-mini) for persistent failures.

Judge Prompt Iteration (v1 → v2)

Built a dynamic judge prompt (buildJudgePrompt()) with:

Disagreement diagnosis: Tells the judge exactly what's in dispute and the vote distribution
Targeted disambiguation rules: 7 category guidance blocks + 2 specificity guidance blocks, dynamically included only when relevant to the specific dispute
Structured analysis steps: Critique each annotator → enumerate IS-list facts → determine dominant purpose → decide
Confidence calibration: HIGH/MEDIUM/LOW mapped to codebook clarity, used as training weights
Anti-bias: Fisher-Yates shuffle of annotator order

Results: Category accuracy improved +10pp over majority vote for both models. Specificity improved +9.8pp for GLM-5 but stayed flat for gpt-5.4-mini. The disambiguation rules work well for category but specificity needs the codebook v3.0 changes.

Key Finding: Judge Confidence Is Highly Predictive

Confidence	GLM-5 Both-Correct	gpt-5.4-mini Both-Correct
High	82-84%	80.6%
Medium	25-50%	35.7%

This enables confidence-stratified training data: high-confidence judge labels get full weight; medium/low are downweighted or excluded.

Phase 7: Revised Data Quality Strategy (Current)

The post-Stage 1 analysis and judge benchmarking led to a fundamental reassessment of our approach.

The Key Realization

The best judge (77% both-correct) barely beats the raw majority vote (78% category, 80% specificity). Judging all 14,591 disputed paragraphs at 77% accuracy doesn't meaningfully improve on the majority. The judge's real value is concentrated in two places:

The 409 unresolved paragraphs where no majority exists
Cases where we have specific reason to doubt the majority

The Revised Plan

Phase 0: Codebook rulings (completed) — Three rulings that resolve thousands of disputes at zero inference cost: materiality disclaimers → Strategy Integration, SPACs → None/Other, person-vs-function test for Management↔RMP.

Phase 1: Model-calibrated majority resolution — For the 14,182 majority-agreement paragraphs, apply calibration using known model biases. When the known-biased model is the outlier on a known axis → trust majority. Flag anomalous cases for judge resolution. Expected to auto-resolve ~10,000-12,000 paragraphs.

Phase 2: Human gold set (1,200 paragraphs) — Assignment requires 1,200 human-labeled paragraphs. Building a quiz-gated labeling web tool that enforces codebook knowledge before each session. Stratified sampling to ensure all categories, specificity levels, and confusion axes are represented. This becomes the calibration metric for all further work.

Phase 3: Judge prompt iteration — Update judge prompt to mirror codebook v3.0 rulings. Add worked examples from the 11 gold adjudications. Iterate against expanded gold set. Target: 85%+ both-correct.

Phase 4: Production judge run — Judge only the ~3,000-5,000 genuinely hard cases (unresolved + flagged majority + "both" disputes). Two models for cross-validation on the hardest cases.

Phase 5: Training data assembly — Confidence-stratified tiers:

Tier	Source	Est. Accuracy	Paragraphs	Treatment
T1	Both-unanimous	~97%	35,204	Full weight
T2	Calibrated majority	~85-90%	~9,000-12,000	Full weight
T3	Judge high-confidence	~84%	~2,000-3,000	Full weight
T4	Judge medium-confidence	~40%	~500-1,000	Downweight (0.5) or soft labels
T5	Judge low / failure / excluded	???	~500-1,000	Exclude

Expected total: ~46,000-48,000 paragraphs at ~93-95% label accuracy.

Running Cost Ledger

Phase	Cost	Notes
Stage 1 production run	$115.88	150,009 annotations, 0 failures
Stage 1 prompt iteration (pilots)	~$15	12+ versions × 500-sample pilots
Judge benchmarking	~$5	8 models × 50-sample gold set
Judge prompt iteration	~$3	Ongoing
Total to date	~$139

Key Technical Artifacts

Artifact	Location	Description
Labeling codebook	`docs/LABELING-CODEBOOK.md`	Authoritative reference, v3.0 with codebook rulings
Stage 1 annotations	`data/annotations/stage1.jsonl`	150,009 annotations (120 MB)
Paragraphs	`data/paragraphs/paragraphs-clean.jsonl`	72,045 paragraphs with filing metadata
Gold labels	`data/bench/judges/gold-final.json`	50 adjudicated gold labels
Gold adjudications	`data/bench/judges/gold-adjudicated.json`	11 detailed adjudication decisions with reasoning
Stage 1 prompt	`ts/src/label/prompts.ts`	SYSTEM_PROMPT (v2.5) + buildJudgePrompt()
Annotation runner	`ts/scripts/stage1-run.ts`	Resume-safe, configurable concurrency
Analysis scripts	`ts/scripts/stage1-analyze.ts`, `segment-analysis.ts`, `model-bias-analysis.ts`, `dispute-crosstab.ts`, `sample-disputes.ts`	Deep analytics on annotation data
Judge benchmarking	`ts/scripts/judge-bench.ts`	Supports structured/tool modes, gold label comparison
Judge diagnostics	`ts/scripts/judge-diag.ts`, `judge-diag-batch.ts`	GLM-5 failure investigation
Model benchmarking	`ts/scripts/model-bench.ts`	Stage 1 candidate evaluation

Lessons Learned

On Prompt Engineering

Calibration examples beat rules. Each example targets a specific observed failure mode.
Pilots must be large enough (500+). 40-sample pilots were misleadingly optimistic.
More rules ≠ better. After the core structure is right, additional rules cause regression.
The specific_facts chain-of-thought schema (forcing models to enumerate evidence before deciding) was the single most impactful structural change.

On Model Selection

Reasoning tokens are the strongest predictor of accuracy, not price or model size.
Schema compliance varies — fix with Zod transforms, not prompt changes.
Test both structured output AND tool calling for any candidate. They are not equivalent.

On Evaluation

Never evaluate against majority vote. Build gold labels. Majority vote as ground truth makes models that rubber-stamp the majority look good.
Judge confidence is highly predictive of accuracy. Use it to weight training samples.
Stage 1 confidence is useless — cheap models are systematically overconfident (95%+ all-high).

On Data Quality at Scale

The biggest wins come from understanding where and why models disagree, not from blanket improvements.
Systematic model biases are quantifiable and predictable. Use them as signal, not noise.
Codebook ambiguity causes more disagreement than model limitations. Three codebook rulings resolved more disputes than any prompt change.
Not all labels need the same treatment. Confidence-stratified assembly beats uniform labeling.

31 KiB Raw Blame History Unescape Escape