2026-04-05 12:16:16 -04:00

45 KiB

Raw Blame History

Project Narrative — SEC Cybersecurity Disclosure Quality Classifier

This document captures the process, roadblocks, decisions, and resolutions from building the SEC cybersecurity disclosure quality classifier. It serves as the source material for the final paper and presentation.

Phase 1: Project Scoping and Construct Design

The Problem

SEC Release 33-11216 (July 2023) created a new annual cybersecurity disclosure requirement (10-K Item 1C) and an incident disclosure requirement (8-K Item 1.05). By FY2024, ~9,000-10,000 filings exist. No validated classifier or public labeled dataset exists for assessing the quality of these disclosures. Investors, regulators, and compliance officers need scalable tools to distinguish substantive disclosures from boilerplate.

Methodology Decision: Ringel (2023) "Synthetic Experts"

We adopted the Ringel (2023) "Synthetic Experts" pipeline: use frontier LLMs to generate training labels at scale, then distill into an efficient encoder model. This approach was chosen because:

Manual labeling of 50,000+ paragraphs is infeasible for a 6-person team
Multiple cheap LLMs annotating in parallel provide built-in quality control through inter-annotator agreement
The encoder distillation step produces a model that can classify at inference time without LLM API costs

Construct: Two Classification Dimensions

We defined two simultaneous classification tasks per paragraph:

Content Category (7 mutually exclusive classes) — what the paragraph is about, grounded in the SEC rule's own structure (Board Governance, Management Role, Risk Management Process, Third-Party Risk, Incident Disclosure, Strategy Integration, None/Other)
Specificity Level (4-point ordinal) — how company-specific the disclosure is, from generic boilerplate to quantified-verifiable facts

The construct maps to NIST CSF 2.0 categories for academic grounding.

Phase 2: Data Acquisition and Corpus Construction

The Extraction Problem

SEC filings are not structured data. They're HTML generated from PDFs, XBRL, and Word documents by dozens of different tools, each producing different artifacts. Building a reliable extraction pipeline for ~9,000 filings meant solving a series of messy, real-world data engineering problems.

Pipeline Architecture

Built in TypeScript (~1,000 lines of extraction code across parse-item1c.ts, segment.ts, fast-reparse.ts, and pipeline orchestration):

EDGAR Master Index → enumerate 10-K filings → download HTML → extract Item 1C → segment paragraphs → JSONL
submissions.zip → scan for 8-K Item 1.05 → download HTML → extract → segment → merge with 10-K corpus

Roadblock: HTML Variability

Every filing's HTML is different. The same logical content looks completely different depending on the tool that generated the HTML:

Word splitting from inline elements. XBRL and styling tags break words mid-token: Item 2 renders correctly in a browser but parses as "Item2" in code. Required detecting adjacent inline element boundaries and inserting spaces selectively.
CamelCase joins from PDF converters. PDF-to-HTML tools merge sentences across formatting boundaries: sentence.Next sentence instead of sentence. Next sentence. Required regex passes to detect missing spaces after punctuation.
Page breaks mid-sentence. Page numbers, running headers, and subsidiary headers get spliced into the middle of content paragraphs. Required filtering a catalog of page artifact patterns.
Table of Contents shadowing. "Item 1C" appears at least twice in every 10-K — once in the Table of Contents and once in the actual content. Using the first match extracts the wrong section. Required the LAST match — a silent failure that produced empty or wrong extractions for hundreds of filings before we caught it.
XBRL tag pollution. Inline XBRL wraps financial facts in ix:header, ix:references, and ix:nonFraction tags that carry no display content but add noise.
Entity encoding chaos.  ,  , “, ”, —, –, • — each needs correct decoding, and different filing tools use different entity styles for the same characters.

Paragraph Segmentation

After extracting clean section text, splitting into paragraphs had its own challenges:

Bullet list merging. Disclosures frequently use bullet lists. Bullets need to be merged with their intro sentence; a standalone "• vulnerability scanning" is meaningless.
Continuation line detection. Sentences split across HTML block elements need rejoining.
Length boundaries. Under 20 words → likely a header (filtered). Over 500 words → split at sentence boundaries.
Table-based bullet lists and the cascade failure. Some generators render bullet lists as HTML tables with non-standard bullet characters. Since stripHtml() doesn't recognize · as a bullet marker, the merge logic never fires, causing multi-element run-on paragraphs. Found 2,210 paragraphs affected.

8-K Extraction

Roadblock: EDGAR full-text search misses filings. The EFTS keyword search doesn't reliably return all cybersecurity 8-Ks. Post-May 2024, companies moved non-material disclosures from Item 1.05 to Items 8.01 or 7.01.

Resolution: Built scan-8k-items.py to scan the SEC's bulk submissions.zip deterministically — a gap-free scan of every 8-K with cybersecurity content. Result: 207 cybersecurity incident 8-K filings identified.

Corpus Statistics

72,045 paragraphs from ~9,000 filings (FY2023 + FY2024 + early FY2025)
All 10-K Item 1C; 207 8-K paragraphs extracted separately
Median ~7 paragraphs per filing
49,795 paragraphs annotated (after filtering to complete filing metadata)

Phase 3: Data Quality Audit and Corpus Remediation

The Discovery

While preparing the DAPT corpus, we discovered two systematic issues silently corrupting the data:

Orphan words. HTML source wraps text at fixed column width. When a  tag consumes most of a line, only the first word fits before the source newline. 4.7% of all paragraphs affected.
Inlined section headings. 22% of paragraphs had section titles prepended to body text — a near-perfect predictor of content_category that creates shortcut learning risk.

Generator Investigation

Identified 14 distinct filing generators covering 99.99% of 14,759 HTML files. The worst generator (EFiling/EDGAR Agent) accounted for 13.5% of filings but 36.8% orphan word rate (8x corpus average). Clean generators (Workiva, Donnelley, Inline XBRL) all had <1% rates. Full reference: docs/EDGAR-FILING-GENERATORS.md.

Six Surgical Patches

All fixes follow the principle: paragraphs-clean.jsonl is frozen. All fixes go through .patched.jsonl files linked by paragraph UUID.

Patch	Method	Paragraphs
1-2. Orphan word restoration	HTML lookback extraction	2,233
3-6. Heading strip (4 passes)	Pattern match + HTML-confirmed	8,411

Quality Tier System

Tier	Criteria	Count	%
clean	No detected issues	58,165	80.7%
headed	Had inlined heading (now stripped)	7,402	10.3%
degraded	Embedded bullets, invisible merges, fragments, truncations	4,331	6.0%
minor	Had orphan word (now fixed)	2,147	3.0%

Degraded paragraphs downweighted 0.5x during fine-tuning.

Phase 4: Pre-Training — DAPT + TAPT

DAPT: Domain-Adaptive Pre-Training

Chose our own ~9,000 cached filings over PleIAs/SEC (373K on HuggingFace):

Recency > volume — Item 1C didn't exist before FY2023
Diminishing returns past 250M tokens (Ponnock 2025)
We control cleaning quality
Feasible on a single RTX 3090

Corpus: 14,568 docs, ~1.056B tokens. Subsampled to newest 500M tokens.

Key optimizations: Flash Attention 2 (47s→27s/step), torch.compile (halved activation memory), corpus subsampling (29h→13.5h).

Results: Eval loss 0.7250, perplexity 1.65. 1 epoch, ~14.5h on RTX 3090. Checkpoint: checkpoints/dapt/modernbert-large/final/.

TAPT: Task-Adaptive Pre-Training

72K Item 1C paragraphs (~10M tokens). 5 epochs with whole-word masking at seq_len=512.

Bugs fought: 4 bugs in transformers whole-word masking for BPE tokenizers, Python 3.14 incompatibility. Custom WholeWordMaskCollator built from scratch.

Results: Loss 1.46→1.08, eval loss 1.0754, perplexity 2.11. 50 minutes on RTX 3090. Checkpoint: checkpoints/tapt/modernbert-large/final/.

Training Pipeline

ModernBERT-large (base, 395M params)
    → DAPT on 9K full 10-K filings (~500M tokens, ~14.5h) → SEC-ModernBERT-large
    → TAPT on 72K Item 1C paragraphs (~10M tokens, ~50min) → SEC-cyBERT-large
    → Fine-tune on labeled data with dual classification heads → Final classifier

Phase 5: Truncated Filing Exclusion

72 filings (~0.8%) where section boundary detection cut off mid-sentence. Excluded from training splits — filings where the last paragraph doesn't match terminal punctuation are filtered.

Phase 6: The v2 Reboot — Why We Started Over

What v1 Taught Us

The v1 pipeline produced 150K Stage 1 annotations, a 10-model benchmark, human labels from 6 annotators, and extensive gold adjudication. It worked — but evaluation revealed structural problems that no amount of prompt iteration could fix:

Specificity Level 2 was too narrow. Our codebook defined Level 2 as "names a recognized standard" — but the professor's construct says "references industry." Domain-specific practices (penetration testing, vulnerability scanning, SIEM) were classified as Level 1. Level 2 ended up at 3.9% of the holdout (47 samples) — too few for reliable per-class F1.
Level 4 required 2+ QV facts. The construct lists types of qualifying facts, not a minimum count. The artificial threshold created a narrow class and forced annotators into a counting exercise.
The BG/MR/RMP triangle was patched, not fixed. Six decision rules and ten borderline cases accumulated as patches on unchanged definitions. Models processed increasingly complex instructions with diminishing returns.
The holdout was adversarial by design. Stratified to over-sample confusion-axis paragraphs — great for stress-testing the codebook, terrible for evaluation. Combined with narrow Level 2, this structurally depressed F1.
Human specificity agreement was poor. Krippendorff's α = 0.546 on specificity (target: 0.67). The narrow Level 2 definition made it hard for anyone to agree.

The Decision

Rather than continue patching, we decided to:

Revise the codebook with systemic changes (broaden Level 2, loosen Level 4, reframe category rules)
Take a new random stratified holdout (equal per category class, not overindexed on hard cases)
Re-run Stage 1 with the improved codebook/prompt
Have humans re-label the new holdout
Re-run the benchmark panel
Then train

The v1 data pipeline, corpus, DAPT checkpoint, and TAPT checkpoint are all unchanged and carried forward. Only the labeling and evaluation are redone.

What Changed in v2

Codebook (LABELING-CODEBOOK.md):

Level 2 broadened from "names a standard" to "uses cybersecurity domain terminology" (the ERM test)
Level 4 threshold lowered from 2+ to 1+ QV-eligible fact (the external verifiability test)
Category primary test changed to "What question does this paragraph answer?"
MR headline changed from "who a specific person is" to "how management is organized to handle cybersecurity"
Person-removal test reframed as confirmation tool, not primary rule
Materiality rules cleaned up (assessment vs. speculation distinction became a clean rule, not a ruling)
IS/NOT lists restructured for new Level 2 boundary
Codebook + Ethos split: rules in LABELING-CODEBOOK.md, reasoning in CODEBOOK-ETHOS.md

Holdout:

Random stratified sample: ~170 per category class × 7 ≈ 1,190
Secondary constraint: minimum ~100 per specificity level
NOT overindexed on confusion-axis cases
Separate ~200-paragraph dev set for prompt iteration (excluded from holdout)

Phase 7: Holdout Selection & Prompt Engineering

Holdout Sampling

Used v1 Stage 1 consensus labels (50,003 paragraphs, 3-model majority vote under v2.5 prompt) as a sampling guide. Applied heuristic v2 specificity prediction: keyword scan for domain terminology to identify v1 Level 1 paragraphs that would become Level 2 under v2 rules, and QV indicator scan for Level 3→4 promotions.

Allocation: 185 per non-ID category, 90 for Incident Disclosure (only 166 available in the annotated corpus) = 1,200 exact. Max 2 paragraphs per company per category stratum to prevent boilerplate clustering. All specificity floors met (≥100 per level). 1,042 unique companies represented.

The v1 holdout had been intentionally oversampled on confusion-axis cases (split votes between MR/RMP, N/O/SI, etc.) — useful for codebook development but structurally hostile to F1. The v2 holdout is random within each category stratum: hard cases appear at their natural frequency, not overweighted.

Prompt Iteration: From List-Matching to Principle-Based Reasoning

The v2 prompt underwent 5 iterations (v4.0→v4.4) tested against a 200-paragraph dev batch from the holdout with GPT-5.4 (~$6 total pilot cost).

v4.0 (baseline rewrite): Translated the v2 codebook into the system prompt. Category section used the "what question?" test — worked well at 87% agreement with v1 consensus. Specificity section used exhaustive IS/NOT lists, matching the v1 approach. Result: Level 2 grew from 6% to 16% (domain terminology broadening) and Level 4 grew from 5% to 22% (1+ QV rule). But audit revealed the model was pattern-matching against the lists rather than reasoning about the underlying principles. Two errors: "Vice President, Information Systems and Technology" and "Senior Vice President of Information Technology" classified as Level 1 because neither exactly matched the IS list entry "VP of IT/Security."

The list-matching problem: The category section — built around reasoning principles ("what question does this paragraph answer?", person-removal test, materiality linguistic test) — achieved 87% agreement. The specificity section — built around exhaustive checklists — caught listed items but missed unlisted items that satisfied the same principle. The model was executing a lookup table, not applying the ERM test.

v4.1 (principle-first restructure): Restructured all three specificity levels to lead with the principle and compress lists to boundary-case disambiguation only:

Level 2: "Apply the ERM test — would a non-security ERM professional use this language?" with illustrative examples
Level 3: "Would this detail help narrow down which company wrote it?" with the VP-or-above bright line
Level 4: "Could someone outside the company verify this?" with boundary cases

Result: +12 Level 1→2 catches (model reasoning about vocabulary level, not scanning a list), VP/SVP titles fixed. But Level 4 regressed — the model started reasoning about whether QV facts were "relevant to the paragraph's main point" instead of treating specificity as a presence check.

The independence insight: Category and specificity are independent dimensions. Category captures what the paragraph is ABOUT. Specificity captures how informative it is AS A WHOLE. A paragraph classified as RMP that mentions a CISO's CISSP in a subordinate clause is RMP at Level 4 — the certification is verifiable regardless of whether it serves the category. The model was conflating "this fact is secondary to the paragraph's purpose" with "this fact doesn't count for specificity." This is wrong: specificity is a presence check on the entire paragraph, not a relevance judgment.

This also raised a methodological question: SHOULD specificity be category-conditional? The steelman for category-conditional specificity: "Board Governance at Level 4" should mean the governance disclosure is highly specific, not that a tangential financial fact inflated the score. The steelman against: SEC paragraphs interleave topics, you can't cleanly decompose facts into category buckets, and conditional specificity introduces cascading errors (wrong category → wrong specificity). For this project, paragraph-level specificity is the right choice — it matches the construct, is simpler to annotate, and produces higher agreement. Acknowledged as a limitation for the paper.

v4.2–v4.4 (surgical fixes): Added explicit presence-check framing, hard vs. soft number boundary ("12 professionals" → QV, "approximately 20 departments" → not QV), and the "various certifications including CISSP → YES" rule (named certifications are QV regardless of surrounding hedge words). Final prompt (v4.4) recovers Level 4 to within 1 of baseline while retaining all principle-based gains at Levels 2 and 3.

v4.4 pilot results (200 paragraphs, GPT-5.4):

Specificity	v4.0 (list)	v4.4 (principle)	Change
L1	81 (40.5%)	65 (32.5%)	-16
L2	32 (16.0%)	41 (20.5%)	+9
L3	43 (21.5%)	51 (25.5%)	+8
L4	44 (22.0%)	43 (21.5%)	-1

Category: 95.5% agreement with v1 consensus. Specificity: 84.5% agreement (expected divergence given broadened L2 and 1+ QV rule). The 200-paragraph dev batch is now contaminated by prompt examples that target specific cases in it — further iteration requires the unseen 1,000 paragraphs from the full holdout.

Full Holdout Validation & v4.5

Running v4.4 on the full 1,200 holdout ($5.70) revealed three problems not visible in the 200-paragraph pilot:

Problem 1: 34.5% medium-confidence specificity. The model was uncertain on 414 of 1,200 paragraphs, concentrated at the L1/L2 boundary (59% of L2 calls were medium-confidence) and L2/L3 boundary (51% of L3). Third-Party Risk was worst: 74% medium-confidence on specificity. The model's reasoning showed it listing zero specific facts but still assigning L2 based on vibes — the paragraph "felt" domain-adapted because the topic was cybersecurity, even when the vocabulary was generic ERM language.

Problem 2: SI materiality assertions falsely promoted to L4. Paragraphs like "As of December 28, 2024, we have not had any material cybersecurity incidents" were classified L4 because a specific date anchored the claim. But negative self-assertions are not externally verifiable — you cannot independently confirm the absence of something. These are Strategy Integration at Level 1, not Level 4.

Problem 3: specific_facts discarded from stored output. The toLabelOutput() function stripped the specific_facts array before writing to disk. The model was generating facts during inference (the schema required it), but we couldn't verify the mechanical bridge between facts and specificity level because the evidence was thrown away.

v4.5 fixes:

Mechanical bridge enforced. Restructured the specificity protocol as a scan-tag-max pipeline: scan for facts, tag each as [DOMAIN]/[FIRM]/[VERIFIABLE], assign specificity = max(tags). Added explicit rule: "if specific_facts is empty, specificity MUST be Generic Boilerplate." Result: 100% consistency — L1 always empty, L2+ always populated with supporting facts. The bridge prevents the model from overriding its own fact-finding with holistic vibes.
Expertise vs. topic clarification for L1/L2. Added: "The ERM test evaluates whether the paragraph demonstrates cybersecurity EXPERTISE, not whether it discusses a cybersecurity TOPIC. Every paragraph in these filings discusses cybersecurity — that's what the filing requires. L1 means generic oversight language any business professional could write. L2 means the writer shows they understand HOW cybersecurity works." With TP-specific examples: "We conduct vendor security assessments" → L1 (generic process description); "We review vendors' SOC 2 attestations and require encryption at rest" → L2 (specific security evidence requiring domain knowledge).
SI negative assertions excluded from L4. Added explicit NOT-verifiable examples: "We have not experienced any material cybersecurity incidents" → NOT QV (cannot externally verify absence); "In 2023, we did not experience a material incident" → NOT QV (a year does not make a negative assertion verifiable). Also added lower bounds as verifiable: "more than 20 years" → YES (checkable threshold, unlike "approximately 20" which is hedged both directions).
Fact storage. Updated toLabelOutput() and LabelOutput schema to preserve specific_facts in stored output. Added domain_term to the FactType enum for L2-level vocabulary evidence.

v4.5 results (1,200 paragraphs, GPT-5.4, $6.88):

Metric	v4.4	v4.5
L1	546 (45.5%)	618 (51.5%)
L2	229 (19.1%)	168 (14.0%)
L3	225 (18.8%)	207 (17.2%)
L4	200 (16.7%)	207 (17.2%)
Medium confidence	414 (34.5%)	211 (17.6%)
Bridge consistency	unknown	100%
SI false L4s	~6	0
Category stability	—	96.8%

L2 at 14% is below the 15% holdout target, but the holdout oversamples TP (14.4% vs 5% in corpus) and TP is where 55 of 61 L2→L1 drops concentrated. On the full corpus (46% RMP, 5% TP), L2 should be ~15-17%. The TP drops are correct — verified by inspecting the facts: survivors list SOC reports, vulnerability scans, penetration testing; drops use only generic vendor management language ("contractual requirements", "vendor due diligence").

Key architectural insight: With reasoning models, structured output fields are results, not reasoning steps. The model decides everything in reasoning tokens before generating JSON. The mechanical bridge works by influencing the reasoning process through prompt text, not through schema field ordering. The specific_facts field captures the model's evidence for our debugging, but the actual bridge enforcement happens in the model's internal reasoning guided by the prompt's explicit consistency rules.

v2 Holdout Benchmark (10 models, 8 providers)

With v4.5 locked, we ran the full BENCHMARK_MODELS panel on the 1,200-paragraph v2 holdout to evaluate model quality before committing to the ~$100 Stage 1 re-run. GPT-5.4 (v4.5) is the reference — our best-validated model on the holdout, the one whose prompt iterations we hand-verified.

Full benchmark results (vs GPT-5.4 reference):

Model	N	Cat%	Cat κ	Spec%	Spec κw	Both%	50K proj	Reasoning
Grok 4.1 Fast	1200	93.7%	0.925	91.6%	0.929	86.1%	$32	584
Opus 4.6 (prompt-only)	1184	93.7%	0.925	90.1%	0.910	85.2%	$0 (sub)	—
Gemini 3.1 Pro	1200	93.8%	0.926	89.4%	0.906	84.2%	$735	502
GLM-5	1200	92.8%	0.915	88.3%	0.898	82.8%	$364	1421
Kimi K2.5	1200	92.6%	0.912	88.1%	0.894	82.8%	$353	2832
Gemini 3.1 Flash Lite	1200	91.8%	0.904	83.0%	0.844	76.5%	$79	363
MIMO v2 Flash	794	92.7%	0.911	85.3%*	0.662	79.7%	$26	1423
MIMO v2 Pro	980	94.0%	—	90.7%	—	85.9%	$274	1439
MiniMax M2.7	1198	87.6%	0.855	76.5%	0.756	68.5%	$70	615

*MIMO Flash spec% is misleading — 91.1% of its labels are L1 (collapsed distribution). κw = 0.662 reflects this.

Pilot candidates (200-paragraph tests):

Model	Cat%	Spec%	Both%	50K proj	Verdict
Qwen3-235B MoE	89.9%	62.6%	56.1%	$18	Dead — 0 reasoning tokens, 34% L4
Seed 1.6 Flash	87.5%	74.7%	67.7%	$24	Weak — below Flash Lite
Qwen3.5 Flash	92.9%	n/a	n/a	$70	Dead — 100% L1 collapse

Key findings from the benchmark:

Clear quality tiers. Grok Fast stands alone as the best affordable model (86.1% both-match, $32/50K). There's a 9pp gap to the next affordable option (Flash Lite at 76.5%, $79). Everything in between costs $350+.
MIMO Flash specificity is broken. Category agreement is fine (92.7%) but specificity collapses to 91.1% L1 — it simply doesn't differentiate specificity levels. The v1 Stage 1 panel included MIMO Flash; this means v1 specificity consensus was partially degraded by one broken voter.
Opus performs better without the codebook. We ran Opus via Agent SDK in two configurations: (a) full v2 codebook + operational prompt (37.7KB system prompt), (b) operational prompt only (16.2KB). Prompt-only was significantly better: 85.2% vs 82.4% both-match, 49.2% vs 40.5% facts coverage. The codebook was actively diluting the operational prompt's bridge instruction. This is a counterintuitive but important finding for the paper — more context can hurt performance when the operational prompt has been carefully engineered.
Reasoning tokens correlate with quality, but not linearly. Kimi K2.5 reasons the most (2832 tokens/para) but ranks 5th. Grok reasons modestly (584 tokens) and ranks 1st. The quality seems to depend more on the model's internal architecture than on raw reasoning volume. Models with 0 reasoning tokens (Qwen3-235B) or with reasoning that doesn't engage with specificity (Qwen3.5 Flash — 4381 tokens, all L1) are categorically broken for this task.
No viable cheap third model exists. We searched OpenRouter exhaustively for models under $50/50K that support structured output and reasoning. Every candidate (Qwen, ByteDance Seed, etc.) performed below Flash Lite, which was already the weakest panel member.
Category agreement is high across all non-broken models (>91% vs reference, κ > 0.90). The hard problem is specificity, where the mechanical bridge helps good models but can't save models that don't reason about it properly.

Model Selection: Grok ×3 Self-Consistency

The budget constraint ($175 remaining for Stage 1 + Stage 2 + everything else) eliminated all multi-model panels except Grok + Flash Lite ($111). But Flash Lite's 76.5% both-match and inflated L2 distribution (19.1% vs 14% reference) made it a weak second voter.

We investigated whether running Grok multiple times could produce independent signals. The temperature question turned out to be irrelevant: reasoning models have internal stochastic chain-of-thought that produces different outputs on repeated identical calls regardless of temperature settings. Most providers silently ignore temperature: 0 for reasoning models (OpenAI explicitly rejects it; others drop it). Our temperature: 0 was cosmetic the entire time.

Empirical verification: We re-ran 47 holdout paragraphs through Grok 4.1 Fast with identical inputs. Results:

Category: 47/47 identical (100% deterministic)
Specificity: 43/47 identical (91.5%), 4 diverged
Divergence: 8.5% of paragraphs got different specificity labels
All divergence was on specificity (L1↔L2, L1→L3, L3→L4) — exactly the ambiguous boundary cases where multiple runs provide real tiebreaking value

This 8.5% per-pair divergence rate means:

~90% of paragraphs will be 3/3 unanimous → strong consensus
~10% will be 2-1 split → majority vote resolves boundary cases
Category is always unanimous → category quality = Grok's quality (93.7%, κ=0.925)

Self-consistency is a well-established pattern (Wang et al. 2022). The weakness vs multi-model consensus is shared systematic biases — all three runs make the same systematic errors. But with κ=0.925 on category and κw=0.929 on specificity, Grok's systematic errors are rare. The 8.5% stochastic variation is concentrated exactly where we want it: ambiguous specificity boundaries.

Cost: $96 for Grok ×3 (3 × $32 through OpenRouter). Leaves $80 for Stage 2 judge and any reruns. An alternative — xAI's Batch API at 50% off — would reduce this to $48, but requires bypassing OpenRouter.

Stage 1 Results: Grok ×3 Self-Consistency (72,045 paragraphs)

We ran 3 independent Grok 4.1 Fast passes over the full 72,045-paragraph corpus at concurrency 200. Each run completed in ~33 minutes. Total cost: $129.75 ($43.12–$43.62 per run).

Cross-run agreement:

Dimension	Unanimous (3/3)	Majority (2/3)	All disagree
Category	68,394 (94.9%)	3,583 (5.0%)	68 (0.09%)
Specificity	65,780 (91.3%)	6,120 (8.5%)	145 (0.20%)

Category is near-deterministic — 94.9% unanimous, and the 5% majority cases are concentrated at the BG↔MR and MR↔RMP boundaries (exactly the confusion axes identified during codebook development). Specificity shows the expected stochastic variation at 8.5% majority-only, matching the 8.5% divergence rate observed in the 47-paragraph pilot.

Consensus resolution:

62,510 (86.8%) — both unanimous, direct consensus
9,323 (12.9%) — majority vote on at least one dimension
212 (0.3%) — no majority on at least one dimension, resolved by GPT-5.4 judge

The 212 tiebreaker paragraphs were run through GPT-5.4 with the full judge prompt (disagreement-aware disambiguation rules, shuffled prior annotations). GPT-5.4 agreed with one of the 3 Grok labels on 100% of paragraphs — never inventing a novel answer. This validates that the Grok runs produce reasonable labels and the disagreements are genuine boundary cases, not model failures. Judge cost: $5.76.

Final consensus distribution:

Category	Count	%	Specificity	Count	%
RMP	31,201	43.3%	L1: Generic Boilerplate	29,593	41.1%
BG	13,876	19.3%	L2: Domain-Adapted	16,344	22.7%
MR	10,591	14.7%	L3: Firm-Specific	17,911	24.9%
SI	7,470	10.4%	L4: Quantified-Verifiable	8,197	11.4%
N/O	4,576	6.4%
TP	4,094	5.7%
ID	237	0.3%

v1→v2 category shifts: BG rose from 16.0%→19.3% and N/O from 5.0%→6.4%, likely driven by the 22,250 paragraphs in the full corpus that v1 never annotated. RMP dropped from 45.8%→43.3%, partly because the v2 codebook's sharper BG/MR/RMP boundaries reclassified some borderline paragraphs.

Specificity is well-distributed. L2 at 22.7% (above the 15% holdout target — the full corpus has more domain-rich paragraphs than the stratified holdout). L3 at 24.9% and L4 at 11.4% reflect the v2 codebook's tightened verifiability standards.

Category × specificity interaction (see figures/stage1-category-specificity-heatmap.png): MR is 87% L3/L4 (people have names, titles, and credentials). SI is 92% L1 (materiality boilerplate with no specific facts). ID is 86% L4 (incidents have dates, named threat actors, forensic firms). These patterns are exactly what the codebook predicts and match the holdout validation.

Specificity boundary analysis: The 6,265 paragraphs where runs diverged on specificity are concentrated at adjacent levels: L1↔L2 (2,485), L1↔L3 (1,423), L2↔L3 (1,160), L3↔L4 (707). Cross-level jumps (L1↔L4, L2↔L4) are rare (~280 total). This confirms the self-consistency mechanism is working as intended — it provides tiebreaking signal exactly at the ambiguous boundaries where different reasoning paths legitimately land on different answers.

Cost of the Reboot (final)

Item	Estimated Cost	Actual Cost
Prompt iteration (v4.0–v4.5, ~8 rounds)	~$10	$19.59
v2 holdout benchmark (10 models + 3 pilots)	~$45	$45.47
Stage 1 re-run (Grok ×3, 72K paragraphs)	~$96	$129.75
Stage 2 judge (212 tiebreaker paragraphs)	~$20-40	$5.76
Human re-labeling	$0 (team labor)	pending
Total additional API	~$175-185	$200.57

Against the ~$120 already spent on v1 API calls (not recovered). Total project API cost: $320.57 of $360 budget. Remaining: $39.43 — sufficient for any reruns or additional analysis.

The cost overshoot ($200 vs $175 estimate) is entirely from annotating 72K paragraphs instead of the estimated 50K. The per-paragraph cost was actually lower than estimated ($0.60/paragraph for the full 3-run self-consistency + judge pipeline vs $0.64 estimated).

Phase 8: Fine-Tuning — From 0.52 to 0.94 Specificity F1

Training Data Assembly

Built python/src/finetune/data.py to merge Stage 1 consensus labels (72,045 paragraphs) with paragraph text, quality tiers, and specificity confidence metadata.

Exclusions:

1,200 holdout paragraphs (reserved for evaluation)
614 individually truncated paragraphs (initial plan was to exclude 72 entire filings, but paragraph-level filtering is more targeted and preserves more data)

Sample weighting: clean/headed/minor = 1.0×, degraded = 0.5× (4,331 paragraphs at half weight).

Result: 70,231 training paragraphs, stratified 90/10 into 63,214 train / 7,024 val.

Architecture: Dual-Head ModernBERT

The model architecture: ModernBERT-large backbone (395M params) → pooled representation → dropout → two independent classification heads:

Category head: Linear(1024, 7) with weighted cross-entropy loss. Standard multi-class classification.
Specificity head: Ordinal classification. The specificity dimension (L1→L2→L3→L4) has natural ordering — predicting L1 when truth is L4 is worse than predicting L3. This ordering should be reflected in the model architecture and loss function.

The initial architecture used CORAL (Cao et al. 2020) for the specificity head: a single shared weight vector with learned bias offsets for each ordinal threshold. This is the standard approach for ordinal regression.

Ablation Grid: 12 Configurations × 1 Epoch

Ran a systematic ablation over three axes:

Checkpoint: base ModernBERT-large vs DAPT checkpoint vs TAPT checkpoint
Class weighting: inverse-frequency weights vs uniform
Loss type: cross-entropy vs focal loss (γ=2.0)

Results (1 epoch each, ~15 min/run, ~3 hours total):

Rank	Configuration	Combined F1	Cat F1	Spec F1
1	base + weighted + CE	0.685	0.900	0.469
2	DAPT + unweighted + focal	0.684	0.892	0.476
3	DAPT + weighted + CE	0.681	0.896	0.466
4	base + unweighted + CE	0.680	0.892	0.467
5	TAPT + weighted + CE	0.675	0.896	0.455
...
12	TAPT + weighted + focal	0.649	0.849	0.449

Finding 1: DAPT/TAPT pre-training did not help. Base ModernBERT-large outperformed both domain-adapted checkpoints. This is a noteworthy null result. ModernBERT-large was already pre-trained on a massive, diverse web corpus that likely includes SEC filings. Additional narrow-domain pre-training appears to cause mild catastrophic forgetting — the model loses general linguistic features while gaining domain-specific ones that the fine-tuning task doesn't benefit from. TAPT was consistently worst, suggesting the small corpus (72K paragraphs × 5 epochs at 30% masking) caused overfitting during MLM pre-training.

Finding 2: Weighted CE is the best loss combination. Class weighting helps category F1 significantly (0.900 vs 0.892 for base). Focal loss helps specificity slightly but hurts category. Weighted + focal = too much correction (consistently bottom tier) — both mechanisms independently reduce majority-class influence, and combining them over-corrects.

Full Training: The CORAL Wall (5 Epochs)

Trained the top 2 configurations for 5 epochs each (~1.5 hours per run):

base_weighted_ce (5 epochs):

Epoch	Combined	Cat F1	Spec F1	QWK
1	0.670	0.879	0.461	0.800
3	0.704	0.924	0.485	0.833
5	0.724	0.932	0.517	0.840

Category F1 reached 0.932 — well above the 0.80 target. But specificity F1 plateaued at 0.517. Per-class breakdown revealed the problem:

Specificity	F1
L1 (Generic)	0.79
L2 (Domain-Adapted)	0.29
L3 (Firm-Specific)	0.31
L4 (Quantified)	0.55

L2 and L3 were dragging macro F1 down to 0.52. QWK was 0.84 — meaning the model's ordinal ranking was good (rarely confusing L1 with L4), but the exact boundary placement between adjacent levels was fuzzy.

The CORAL Diagnosis

CORAL uses a single weight vector w with shifted biases: logit_k = w·x + b_k. This means the same features separate L1 from L2 as separate L3 from L4. But the three specificity transitions require fundamentally different evidence:

L1→L2: Cybersecurity terminology detection (the ERM test — does the paragraph use language a general business professional wouldn't?)
L2→L3: Firm-unique fact detection (named roles, specific systems, internal programs)
L3→L4: Quantified/verifiable claim detection (dollar amounts, dates, third-party firm names)

A single shared weight vector cannot simultaneously encode "presence of domain terminology," "presence of named entities," and "presence of numerical quantities" — these are orthogonal signal types in the embedding space. CORAL's structural constraint was forcing the model to find one feature direction that approximates all three, resulting in blurry boundaries everywhere.

Additionally, [CLS] token pooling loses distributed signals. A paragraph that mentions "CISO" once in a subordinate clause should be L3, but [CLS] may not attend strongly to that one token.

Architecture Iteration: Independent Thresholds

Replaced CORAL with three changes (implemented in python/src/finetune/model.py):

Independent threshold heads. Three separate binary classifiers, each with its own Linear(1024→256→1) MLP:
- threshold_L2plus: "Has any qualifying facts?" (L1 vs L2+)
- threshold_L3plus: "Has firm-specific facts?" (≤L2 vs L3+)
- threshold_L4: "Has quantified facts?" (≤L3 vs L4)
Same cumulative binary targets as CORAL (label k → [1]×k + [0]×(3−k)), but each threshold learns independent features. The prediction is: level = count(sigmoid(logit_k) > 0.5).
Attention pooling. Replaced [CLS] with a learned attention pool over all token representations. This lets the model attend to specific evidence tokens (CISO, $2M, NIST) distributed anywhere in the paragraph.
Specificity confidence filtering. Only compute specificity loss on paragraphs where all 3 Grok runs agreed on specificity (91.3% of training data, as tracked in consensus specificityAgreement.agreed). The ~6K disagreement cases are exactly the noisy boundary labels that confuse the model. Category loss still uses all samples.
Ordinal consistency regularization. Penalty (weight 0.1) when threshold k fires but threshold k-1 doesn't — e.g., the model says "has firm-specific facts" but not "has domain terms." This enforces the cumulative structure without the rigidity of CORAL's shared weights.

Results: The Independent Threshold Breakthrough

Config: configs/finetune/iter1-independent.yaml — base ModernBERT-large, independent thresholds with 256-dim MLP, attention pooling, spec confidence filtering, 15 epochs.

Epoch	Combined	Cat F1	Spec F1	QWK	L2 F1	L3 F1
1	0.855	0.867	0.844	0.874	0.782	0.821
2	0.913	0.909	0.918	0.935	0.887	0.911
3	0.925	0.919	0.931	0.945	0.893	0.926
5	0.938	0.936	0.940	0.949	—	—
8	0.944	0.943	0.945	0.952	0.923	0.940
10	0.944	0.943	0.945	0.952	—	—

The model exceeded 0.80 on both heads at epoch 1. By epoch 8 it plateaued at 0.944 combined F1 (cat=0.943, spec=0.945, QWK=0.952). Training was stopped at epoch 11 — the train-eval loss gap (0.06 vs 0.49, ~8×) indicated the model was memorizing without further improving eval metrics.

The improvement was transformative. Spec F1: 0.517 → 0.945 (+0.428). L2 F1: 0.29 → 0.92. L3 F1: 0.31 → 0.94. The independent thresholds + attention pooling + confidence filtering combination addressed all three root causes simultaneously.

What mattered most? The independent thresholds were the primary driver. CORAL's shared weight vector was the bottleneck — when we let each ordinal transition learn its own features, the model immediately distinguished the three types of specificity evidence. Attention pooling and confidence filtering likely contributed meaningful improvements, but we did not run an ablation to isolate their individual contributions (the combined effect was so strong that decomposition was deprioritized).

Overfitting Observations

Encoder models absolutely can overfit. The 8× train-eval loss gap by epoch 10 is substantial. However, eval metrics (F1, QWK) remained stable from epoch 8–11, exhibiting "benign overfitting" — the model becomes more confident on training examples (lower train loss) without changing its decision boundaries (stable eval F1). The practical implication: monitor eval F1 for model selection, not eval loss.

For future runs: increase save_total_limit to preserve all epoch checkpoints, and add early stopping with patience ≥ 3 on spec_macro_f1.

Training Configuration Reference

Parameter	Value
Backbone	answerdotai/ModernBERT-large (395M params)
Pooling	Learned attention
Category head	Linear(1024, 7) + weighted CE
Specificity head	3× Independent(Linear(1024→256→1)) + cumulative BCE
Ordinal consistency	0.1 weight
Spec confidence filter	Unanimous labels only (91.3% of data)
Batch size	32
Learning rate	5e-5
Warmup	10% of total steps
Precision	bf16 + tf32
Attention	Flash Attention 2
Compilation	torch.compile
Optimizer	AdamW (fused)
Peak VRAM	~18 GB / 24.6 GB (RTX 3090)
Training speed	~2.1 it/s (batch 32, seq 512)
Best epoch	8 (stable through 11)

Checkpoint: checkpoints/finetune/iter1-independent/final/

What Remains

These metrics are on the validation set — same distribution as training (Grok ×3 consensus labels). The true test is the holdout gold set with human labels, which may reveal:

Systematic Grok-vs-human disagreements (especially at L2/L3 boundaries)
Whether the model learned Grok's biases rather than the underlying construct
Per-class F1 on the more diverse holdout distribution (the training data overrepresents RMP at 43%)

As a proxy before human labels arrive, evaluation against GPT-5.4 and Opus benchmark labels on the holdout will provide an intermediate signal.

v1 Reference

The complete v1 narrative — Stage 1 prompt engineering (12+ iterations), model benchmarking (21+ models, 12 providers), human labeling webapp, gold set adjudication (13-signal cross-analysis), codebook iterations v1.0–v3.5 — is preserved at docs/NARRATIVE-v1.md.

Key v1 deliverables carried forward:

72,045-paragraph corpus with quality tiers
DAPT checkpoint (eval loss 0.7250, perplexity 1.65)
TAPT checkpoint (eval loss 1.0754, perplexity 2.11)
Model census: 21+ models evaluated across 12 providers
Human labeling webapp (labelapp) — will be updated for v2 codebook
Empirical evidence for every v2 codebook decision

References

Warner, B., Clavié, B., Soldaini, L., et al. (2024). "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Fine-tuning and Inference." arXiv:2412.13663.
Gururangan, S., Marasovic, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N.A. (2020). "Don't Stop Pretraining: Adapt Language Models to Domains and Tasks." Proceedings of ACL 2020, pp. 8342-8360.
Ponnock, J. (2025). "The Data Efficiency Frontier of Financial Foundation Models: Scaling Laws from Continued Pretraining." arXiv:2512.12384.
Sounack, T., et al. (2025). "BioClinical ModernBERT: A Domain-Adapted Encoder for Biomedical and Clinical NLP." arXiv:2506.10896.
Luo, Z., et al. (2025). "Patent ModernBERT: A Pretrained Language Model for Intellectual Property." arXiv:2509.14926.
Dao, T. (2024). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." Proceedings of ICLR 2024.
Ringel, D.M. (2023). "Creating Synthetic Experts with Generative Artificial Intelligence." arXiv:2310.15560.

45 KiB Raw Blame History Unescape Escape