SEC-cyBERT/docs/NARRATIVE.md
2026-04-05 12:16:16 -04:00

45 KiB
Raw Blame History

Project Narrative — SEC Cybersecurity Disclosure Quality Classifier

This document captures the process, roadblocks, decisions, and resolutions from building the SEC cybersecurity disclosure quality classifier. It serves as the source material for the final paper and presentation.


Phase 1: Project Scoping and Construct Design

The Problem

SEC Release 33-11216 (July 2023) created a new annual cybersecurity disclosure requirement (10-K Item 1C) and an incident disclosure requirement (8-K Item 1.05). By FY2024, ~9,000-10,000 filings exist. No validated classifier or public labeled dataset exists for assessing the quality of these disclosures. Investors, regulators, and compliance officers need scalable tools to distinguish substantive disclosures from boilerplate.

Methodology Decision: Ringel (2023) "Synthetic Experts"

We adopted the Ringel (2023) "Synthetic Experts" pipeline: use frontier LLMs to generate training labels at scale, then distill into an efficient encoder model. This approach was chosen because:

  • Manual labeling of 50,000+ paragraphs is infeasible for a 6-person team
  • Multiple cheap LLMs annotating in parallel provide built-in quality control through inter-annotator agreement
  • The encoder distillation step produces a model that can classify at inference time without LLM API costs

Construct: Two Classification Dimensions

We defined two simultaneous classification tasks per paragraph:

  1. Content Category (7 mutually exclusive classes) — what the paragraph is about, grounded in the SEC rule's own structure (Board Governance, Management Role, Risk Management Process, Third-Party Risk, Incident Disclosure, Strategy Integration, None/Other)
  2. Specificity Level (4-point ordinal) — how company-specific the disclosure is, from generic boilerplate to quantified-verifiable facts

The construct maps to NIST CSF 2.0 categories for academic grounding.


Phase 2: Data Acquisition and Corpus Construction

The Extraction Problem

SEC filings are not structured data. They're HTML generated from PDFs, XBRL, and Word documents by dozens of different tools, each producing different artifacts. Building a reliable extraction pipeline for ~9,000 filings meant solving a series of messy, real-world data engineering problems.

Pipeline Architecture

Built in TypeScript (~1,000 lines of extraction code across parse-item1c.ts, segment.ts, fast-reparse.ts, and pipeline orchestration):

EDGAR Master Index → enumerate 10-K filings → download HTML → extract Item 1C → segment paragraphs → JSONL
submissions.zip → scan for 8-K Item 1.05 → download HTML → extract → segment → merge with 10-K corpus

Roadblock: HTML Variability

Every filing's HTML is different. The same logical content looks completely different depending on the tool that generated the HTML:

  • Word splitting from inline elements. XBRL and styling tags break words mid-token: <span>It</span><span>em 2</span> renders correctly in a browser but parses as "Item2" in code. Required detecting adjacent inline element boundaries and inserting spaces selectively.
  • CamelCase joins from PDF converters. PDF-to-HTML tools merge sentences across formatting boundaries: sentence.Next sentence instead of sentence. Next sentence. Required regex passes to detect missing spaces after punctuation.
  • Page breaks mid-sentence. Page numbers, running headers, and subsidiary headers get spliced into the middle of content paragraphs. Required filtering a catalog of page artifact patterns.
  • Table of Contents shadowing. "Item 1C" appears at least twice in every 10-K — once in the Table of Contents and once in the actual content. Using the first match extracts the wrong section. Required the LAST match — a silent failure that produced empty or wrong extractions for hundreds of filings before we caught it.
  • XBRL tag pollution. Inline XBRL wraps financial facts in ix:header, ix:references, and ix:nonFraction tags that carry no display content but add noise.
  • Entity encoding chaos. &nbsp;, &#160;, &ldquo;, &rdquo;, &mdash;, &ndash;, &bull; — each needs correct decoding, and different filing tools use different entity styles for the same characters.

Paragraph Segmentation

After extracting clean section text, splitting into paragraphs had its own challenges:

  • Bullet list merging. Disclosures frequently use bullet lists. Bullets need to be merged with their intro sentence; a standalone "• vulnerability scanning" is meaningless.
  • Continuation line detection. Sentences split across HTML block elements need rejoining.
  • Length boundaries. Under 20 words → likely a header (filtered). Over 500 words → split at sentence boundaries.
  • Table-based bullet lists and the cascade failure. Some generators render bullet lists as HTML tables with non-standard bullet characters. Since stripHtml() doesn't recognize &#183; as a bullet marker, the merge logic never fires, causing multi-element run-on paragraphs. Found 2,210 paragraphs affected.

8-K Extraction

Roadblock: EDGAR full-text search misses filings. The EFTS keyword search doesn't reliably return all cybersecurity 8-Ks. Post-May 2024, companies moved non-material disclosures from Item 1.05 to Items 8.01 or 7.01.

Resolution: Built scan-8k-items.py to scan the SEC's bulk submissions.zip deterministically — a gap-free scan of every 8-K with cybersecurity content. Result: 207 cybersecurity incident 8-K filings identified.

Corpus Statistics

  • 72,045 paragraphs from ~9,000 filings (FY2023 + FY2024 + early FY2025)
  • All 10-K Item 1C; 207 8-K paragraphs extracted separately
  • Median ~7 paragraphs per filing
  • 49,795 paragraphs annotated (after filtering to complete filing metadata)

Phase 3: Data Quality Audit and Corpus Remediation

The Discovery

While preparing the DAPT corpus, we discovered two systematic issues silently corrupting the data:

  1. Orphan words. HTML source wraps text at fixed column width. When a <span> tag consumes most of a line, only the first word fits before the source newline. 4.7% of all paragraphs affected.
  2. Inlined section headings. 22% of paragraphs had section titles prepended to body text — a near-perfect predictor of content_category that creates shortcut learning risk.

Generator Investigation

Identified 14 distinct filing generators covering 99.99% of 14,759 HTML files. The worst generator (EFiling/EDGAR Agent) accounted for 13.5% of filings but 36.8% orphan word rate (8x corpus average). Clean generators (Workiva, Donnelley, Inline XBRL) all had <1% rates. Full reference: docs/EDGAR-FILING-GENERATORS.md.

Six Surgical Patches

All fixes follow the principle: paragraphs-clean.jsonl is frozen. All fixes go through .patched.jsonl files linked by paragraph UUID.

Patch Method Paragraphs
1-2. Orphan word restoration HTML lookback extraction 2,233
3-6. Heading strip (4 passes) Pattern match + HTML-confirmed 8,411

Quality Tier System

Tier Criteria Count %
clean No detected issues 58,165 80.7%
headed Had inlined heading (now stripped) 7,402 10.3%
degraded Embedded bullets, invisible merges, fragments, truncations 4,331 6.0%
minor Had orphan word (now fixed) 2,147 3.0%

Degraded paragraphs downweighted 0.5x during fine-tuning.


Phase 4: Pre-Training — DAPT + TAPT

DAPT: Domain-Adaptive Pre-Training

Chose our own ~9,000 cached filings over PleIAs/SEC (373K on HuggingFace):

  • Recency > volume — Item 1C didn't exist before FY2023
  • Diminishing returns past 250M tokens (Ponnock 2025)
  • We control cleaning quality
  • Feasible on a single RTX 3090

Corpus: 14,568 docs, ~1.056B tokens. Subsampled to newest 500M tokens.

Key optimizations: Flash Attention 2 (47s→27s/step), torch.compile (halved activation memory), corpus subsampling (29h→13.5h).

Results: Eval loss 0.7250, perplexity 1.65. 1 epoch, ~14.5h on RTX 3090. Checkpoint: checkpoints/dapt/modernbert-large/final/.

TAPT: Task-Adaptive Pre-Training

72K Item 1C paragraphs (~10M tokens). 5 epochs with whole-word masking at seq_len=512.

Bugs fought: 4 bugs in transformers whole-word masking for BPE tokenizers, Python 3.14 incompatibility. Custom WholeWordMaskCollator built from scratch.

Results: Loss 1.46→1.08, eval loss 1.0754, perplexity 2.11. 50 minutes on RTX 3090. Checkpoint: checkpoints/tapt/modernbert-large/final/.

Training Pipeline

ModernBERT-large (base, 395M params)
    → DAPT on 9K full 10-K filings (~500M tokens, ~14.5h) → SEC-ModernBERT-large
    → TAPT on 72K Item 1C paragraphs (~10M tokens, ~50min) → SEC-cyBERT-large
    → Fine-tune on labeled data with dual classification heads → Final classifier

Phase 5: Truncated Filing Exclusion

72 filings (~0.8%) where section boundary detection cut off mid-sentence. Excluded from training splits — filings where the last paragraph doesn't match terminal punctuation are filtered.



Phase 6: The v2 Reboot — Why We Started Over

What v1 Taught Us

The v1 pipeline produced 150K Stage 1 annotations, a 10-model benchmark, human labels from 6 annotators, and extensive gold adjudication. It worked — but evaluation revealed structural problems that no amount of prompt iteration could fix:

  1. Specificity Level 2 was too narrow. Our codebook defined Level 2 as "names a recognized standard" — but the professor's construct says "references industry." Domain-specific practices (penetration testing, vulnerability scanning, SIEM) were classified as Level 1. Level 2 ended up at 3.9% of the holdout (47 samples) — too few for reliable per-class F1.

  2. Level 4 required 2+ QV facts. The construct lists types of qualifying facts, not a minimum count. The artificial threshold created a narrow class and forced annotators into a counting exercise.

  3. The BG/MR/RMP triangle was patched, not fixed. Six decision rules and ten borderline cases accumulated as patches on unchanged definitions. Models processed increasingly complex instructions with diminishing returns.

  4. The holdout was adversarial by design. Stratified to over-sample confusion-axis paragraphs — great for stress-testing the codebook, terrible for evaluation. Combined with narrow Level 2, this structurally depressed F1.

  5. Human specificity agreement was poor. Krippendorff's α = 0.546 on specificity (target: 0.67). The narrow Level 2 definition made it hard for anyone to agree.

The Decision

Rather than continue patching, we decided to:

  • Revise the codebook with systemic changes (broaden Level 2, loosen Level 4, reframe category rules)
  • Take a new random stratified holdout (equal per category class, not overindexed on hard cases)
  • Re-run Stage 1 with the improved codebook/prompt
  • Have humans re-label the new holdout
  • Re-run the benchmark panel
  • Then train

The v1 data pipeline, corpus, DAPT checkpoint, and TAPT checkpoint are all unchanged and carried forward. Only the labeling and evaluation are redone.

What Changed in v2

Codebook (LABELING-CODEBOOK.md):

  • Level 2 broadened from "names a standard" to "uses cybersecurity domain terminology" (the ERM test)
  • Level 4 threshold lowered from 2+ to 1+ QV-eligible fact (the external verifiability test)
  • Category primary test changed to "What question does this paragraph answer?"
  • MR headline changed from "who a specific person is" to "how management is organized to handle cybersecurity"
  • Person-removal test reframed as confirmation tool, not primary rule
  • Materiality rules cleaned up (assessment vs. speculation distinction became a clean rule, not a ruling)
  • IS/NOT lists restructured for new Level 2 boundary
  • Codebook + Ethos split: rules in LABELING-CODEBOOK.md, reasoning in CODEBOOK-ETHOS.md

Holdout:

  • Random stratified sample: ~170 per category class × 7 ≈ 1,190
  • Secondary constraint: minimum ~100 per specificity level
  • NOT overindexed on confusion-axis cases
  • Separate ~200-paragraph dev set for prompt iteration (excluded from holdout)

Phase 7: Holdout Selection & Prompt Engineering

Holdout Sampling

Used v1 Stage 1 consensus labels (50,003 paragraphs, 3-model majority vote under v2.5 prompt) as a sampling guide. Applied heuristic v2 specificity prediction: keyword scan for domain terminology to identify v1 Level 1 paragraphs that would become Level 2 under v2 rules, and QV indicator scan for Level 3→4 promotions.

Allocation: 185 per non-ID category, 90 for Incident Disclosure (only 166 available in the annotated corpus) = 1,200 exact. Max 2 paragraphs per company per category stratum to prevent boilerplate clustering. All specificity floors met (≥100 per level). 1,042 unique companies represented.

The v1 holdout had been intentionally oversampled on confusion-axis cases (split votes between MR/RMP, N/O/SI, etc.) — useful for codebook development but structurally hostile to F1. The v2 holdout is random within each category stratum: hard cases appear at their natural frequency, not overweighted.

Prompt Iteration: From List-Matching to Principle-Based Reasoning

The v2 prompt underwent 5 iterations (v4.0→v4.4) tested against a 200-paragraph dev batch from the holdout with GPT-5.4 (~$6 total pilot cost).

v4.0 (baseline rewrite): Translated the v2 codebook into the system prompt. Category section used the "what question?" test — worked well at 87% agreement with v1 consensus. Specificity section used exhaustive IS/NOT lists, matching the v1 approach. Result: Level 2 grew from 6% to 16% (domain terminology broadening) and Level 4 grew from 5% to 22% (1+ QV rule). But audit revealed the model was pattern-matching against the lists rather than reasoning about the underlying principles. Two errors: "Vice President, Information Systems and Technology" and "Senior Vice President of Information Technology" classified as Level 1 because neither exactly matched the IS list entry "VP of IT/Security."

The list-matching problem: The category section — built around reasoning principles ("what question does this paragraph answer?", person-removal test, materiality linguistic test) — achieved 87% agreement. The specificity section — built around exhaustive checklists — caught listed items but missed unlisted items that satisfied the same principle. The model was executing a lookup table, not applying the ERM test.

v4.1 (principle-first restructure): Restructured all three specificity levels to lead with the principle and compress lists to boundary-case disambiguation only:

  • Level 2: "Apply the ERM test — would a non-security ERM professional use this language?" with illustrative examples
  • Level 3: "Would this detail help narrow down which company wrote it?" with the VP-or-above bright line
  • Level 4: "Could someone outside the company verify this?" with boundary cases

Result: +12 Level 1→2 catches (model reasoning about vocabulary level, not scanning a list), VP/SVP titles fixed. But Level 4 regressed — the model started reasoning about whether QV facts were "relevant to the paragraph's main point" instead of treating specificity as a presence check.

The independence insight: Category and specificity are independent dimensions. Category captures what the paragraph is ABOUT. Specificity captures how informative it is AS A WHOLE. A paragraph classified as RMP that mentions a CISO's CISSP in a subordinate clause is RMP at Level 4 — the certification is verifiable regardless of whether it serves the category. The model was conflating "this fact is secondary to the paragraph's purpose" with "this fact doesn't count for specificity." This is wrong: specificity is a presence check on the entire paragraph, not a relevance judgment.

This also raised a methodological question: SHOULD specificity be category-conditional? The steelman for category-conditional specificity: "Board Governance at Level 4" should mean the governance disclosure is highly specific, not that a tangential financial fact inflated the score. The steelman against: SEC paragraphs interleave topics, you can't cleanly decompose facts into category buckets, and conditional specificity introduces cascading errors (wrong category → wrong specificity). For this project, paragraph-level specificity is the right choice — it matches the construct, is simpler to annotate, and produces higher agreement. Acknowledged as a limitation for the paper.

v4.2v4.4 (surgical fixes): Added explicit presence-check framing, hard vs. soft number boundary ("12 professionals" → QV, "approximately 20 departments" → not QV), and the "various certifications including CISSP → YES" rule (named certifications are QV regardless of surrounding hedge words). Final prompt (v4.4) recovers Level 4 to within 1 of baseline while retaining all principle-based gains at Levels 2 and 3.

v4.4 pilot results (200 paragraphs, GPT-5.4):

Specificity v4.0 (list) v4.4 (principle) Change
L1 81 (40.5%) 65 (32.5%) -16
L2 32 (16.0%) 41 (20.5%) +9
L3 43 (21.5%) 51 (25.5%) +8
L4 44 (22.0%) 43 (21.5%) -1

Category: 95.5% agreement with v1 consensus. Specificity: 84.5% agreement (expected divergence given broadened L2 and 1+ QV rule). The 200-paragraph dev batch is now contaminated by prompt examples that target specific cases in it — further iteration requires the unseen 1,000 paragraphs from the full holdout.

Full Holdout Validation & v4.5

Running v4.4 on the full 1,200 holdout ($5.70) revealed three problems not visible in the 200-paragraph pilot:

Problem 1: 34.5% medium-confidence specificity. The model was uncertain on 414 of 1,200 paragraphs, concentrated at the L1/L2 boundary (59% of L2 calls were medium-confidence) and L2/L3 boundary (51% of L3). Third-Party Risk was worst: 74% medium-confidence on specificity. The model's reasoning showed it listing zero specific facts but still assigning L2 based on vibes — the paragraph "felt" domain-adapted because the topic was cybersecurity, even when the vocabulary was generic ERM language.

Problem 2: SI materiality assertions falsely promoted to L4. Paragraphs like "As of December 28, 2024, we have not had any material cybersecurity incidents" were classified L4 because a specific date anchored the claim. But negative self-assertions are not externally verifiable — you cannot independently confirm the absence of something. These are Strategy Integration at Level 1, not Level 4.

Problem 3: specific_facts discarded from stored output. The toLabelOutput() function stripped the specific_facts array before writing to disk. The model was generating facts during inference (the schema required it), but we couldn't verify the mechanical bridge between facts and specificity level because the evidence was thrown away.

v4.5 fixes:

  1. Mechanical bridge enforced. Restructured the specificity protocol as a scan-tag-max pipeline: scan for facts, tag each as [DOMAIN]/[FIRM]/[VERIFIABLE], assign specificity = max(tags). Added explicit rule: "if specific_facts is empty, specificity MUST be Generic Boilerplate." Result: 100% consistency — L1 always empty, L2+ always populated with supporting facts. The bridge prevents the model from overriding its own fact-finding with holistic vibes.

  2. Expertise vs. topic clarification for L1/L2. Added: "The ERM test evaluates whether the paragraph demonstrates cybersecurity EXPERTISE, not whether it discusses a cybersecurity TOPIC. Every paragraph in these filings discusses cybersecurity — that's what the filing requires. L1 means generic oversight language any business professional could write. L2 means the writer shows they understand HOW cybersecurity works." With TP-specific examples: "We conduct vendor security assessments" → L1 (generic process description); "We review vendors' SOC 2 attestations and require encryption at rest" → L2 (specific security evidence requiring domain knowledge).

  3. SI negative assertions excluded from L4. Added explicit NOT-verifiable examples: "We have not experienced any material cybersecurity incidents" → NOT QV (cannot externally verify absence); "In 2023, we did not experience a material incident" → NOT QV (a year does not make a negative assertion verifiable). Also added lower bounds as verifiable: "more than 20 years" → YES (checkable threshold, unlike "approximately 20" which is hedged both directions).

  4. Fact storage. Updated toLabelOutput() and LabelOutput schema to preserve specific_facts in stored output. Added domain_term to the FactType enum for L2-level vocabulary evidence.

v4.5 results (1,200 paragraphs, GPT-5.4, $6.88):

Metric v4.4 v4.5
L1 546 (45.5%) 618 (51.5%)
L2 229 (19.1%) 168 (14.0%)
L3 225 (18.8%) 207 (17.2%)
L4 200 (16.7%) 207 (17.2%)
Medium confidence 414 (34.5%) 211 (17.6%)
Bridge consistency unknown 100%
SI false L4s ~6 0
Category stability 96.8%

L2 at 14% is below the 15% holdout target, but the holdout oversamples TP (14.4% vs 5% in corpus) and TP is where 55 of 61 L2→L1 drops concentrated. On the full corpus (46% RMP, 5% TP), L2 should be ~15-17%. The TP drops are correct — verified by inspecting the facts: survivors list SOC reports, vulnerability scans, penetration testing; drops use only generic vendor management language ("contractual requirements", "vendor due diligence").

Key architectural insight: With reasoning models, structured output fields are results, not reasoning steps. The model decides everything in reasoning tokens before generating JSON. The mechanical bridge works by influencing the reasoning process through prompt text, not through schema field ordering. The specific_facts field captures the model's evidence for our debugging, but the actual bridge enforcement happens in the model's internal reasoning guided by the prompt's explicit consistency rules.

v2 Holdout Benchmark (10 models, 8 providers)

With v4.5 locked, we ran the full BENCHMARK_MODELS panel on the 1,200-paragraph v2 holdout to evaluate model quality before committing to the ~$100 Stage 1 re-run. GPT-5.4 (v4.5) is the reference — our best-validated model on the holdout, the one whose prompt iterations we hand-verified.

Full benchmark results (vs GPT-5.4 reference):

Model N Cat% Cat κ Spec% Spec κw Both% 50K proj Reasoning
Grok 4.1 Fast 1200 93.7% 0.925 91.6% 0.929 86.1% $32 584
Opus 4.6 (prompt-only) 1184 93.7% 0.925 90.1% 0.910 85.2% $0 (sub)
Gemini 3.1 Pro 1200 93.8% 0.926 89.4% 0.906 84.2% $735 502
GLM-5 1200 92.8% 0.915 88.3% 0.898 82.8% $364 1421
Kimi K2.5 1200 92.6% 0.912 88.1% 0.894 82.8% $353 2832
Gemini 3.1 Flash Lite 1200 91.8% 0.904 83.0% 0.844 76.5% $79 363
MIMO v2 Flash 794 92.7% 0.911 85.3%* 0.662 79.7% $26 1423
MIMO v2 Pro 980 94.0% 90.7% 85.9% $274 1439
MiniMax M2.7 1198 87.6% 0.855 76.5% 0.756 68.5% $70 615

*MIMO Flash spec% is misleading — 91.1% of its labels are L1 (collapsed distribution). κw = 0.662 reflects this.

Pilot candidates (200-paragraph tests):

Model Cat% Spec% Both% 50K proj Verdict
Qwen3-235B MoE 89.9% 62.6% 56.1% $18 Dead — 0 reasoning tokens, 34% L4
Seed 1.6 Flash 87.5% 74.7% 67.7% $24 Weak — below Flash Lite
Qwen3.5 Flash 92.9% n/a n/a $70 Dead — 100% L1 collapse

Key findings from the benchmark:

  1. Clear quality tiers. Grok Fast stands alone as the best affordable model (86.1% both-match, $32/50K). There's a 9pp gap to the next affordable option (Flash Lite at 76.5%, $79). Everything in between costs $350+.

  2. MIMO Flash specificity is broken. Category agreement is fine (92.7%) but specificity collapses to 91.1% L1 — it simply doesn't differentiate specificity levels. The v1 Stage 1 panel included MIMO Flash; this means v1 specificity consensus was partially degraded by one broken voter.

  3. Opus performs better without the codebook. We ran Opus via Agent SDK in two configurations: (a) full v2 codebook + operational prompt (37.7KB system prompt), (b) operational prompt only (16.2KB). Prompt-only was significantly better: 85.2% vs 82.4% both-match, 49.2% vs 40.5% facts coverage. The codebook was actively diluting the operational prompt's bridge instruction. This is a counterintuitive but important finding for the paper — more context can hurt performance when the operational prompt has been carefully engineered.

  4. Reasoning tokens correlate with quality, but not linearly. Kimi K2.5 reasons the most (2832 tokens/para) but ranks 5th. Grok reasons modestly (584 tokens) and ranks 1st. The quality seems to depend more on the model's internal architecture than on raw reasoning volume. Models with 0 reasoning tokens (Qwen3-235B) or with reasoning that doesn't engage with specificity (Qwen3.5 Flash — 4381 tokens, all L1) are categorically broken for this task.

  5. No viable cheap third model exists. We searched OpenRouter exhaustively for models under $50/50K that support structured output and reasoning. Every candidate (Qwen, ByteDance Seed, etc.) performed below Flash Lite, which was already the weakest panel member.

  6. Category agreement is high across all non-broken models (>91% vs reference, κ > 0.90). The hard problem is specificity, where the mechanical bridge helps good models but can't save models that don't reason about it properly.

Model Selection: Grok ×3 Self-Consistency

The budget constraint ($175 remaining for Stage 1 + Stage 2 + everything else) eliminated all multi-model panels except Grok + Flash Lite ($111). But Flash Lite's 76.5% both-match and inflated L2 distribution (19.1% vs 14% reference) made it a weak second voter.

We investigated whether running Grok multiple times could produce independent signals. The temperature question turned out to be irrelevant: reasoning models have internal stochastic chain-of-thought that produces different outputs on repeated identical calls regardless of temperature settings. Most providers silently ignore temperature: 0 for reasoning models (OpenAI explicitly rejects it; others drop it). Our temperature: 0 was cosmetic the entire time.

Empirical verification: We re-ran 47 holdout paragraphs through Grok 4.1 Fast with identical inputs. Results:

  • Category: 47/47 identical (100% deterministic)
  • Specificity: 43/47 identical (91.5%), 4 diverged
  • Divergence: 8.5% of paragraphs got different specificity labels
  • All divergence was on specificity (L1↔L2, L1→L3, L3→L4) — exactly the ambiguous boundary cases where multiple runs provide real tiebreaking value

This 8.5% per-pair divergence rate means:

  • ~90% of paragraphs will be 3/3 unanimous → strong consensus
  • ~10% will be 2-1 split → majority vote resolves boundary cases
  • Category is always unanimous → category quality = Grok's quality (93.7%, κ=0.925)

Self-consistency is a well-established pattern (Wang et al. 2022). The weakness vs multi-model consensus is shared systematic biases — all three runs make the same systematic errors. But with κ=0.925 on category and κw=0.929 on specificity, Grok's systematic errors are rare. The 8.5% stochastic variation is concentrated exactly where we want it: ambiguous specificity boundaries.

Cost: $96 for Grok ×3 (3 × $32 through OpenRouter). Leaves $80 for Stage 2 judge and any reruns. An alternative — xAI's Batch API at 50% off — would reduce this to $48, but requires bypassing OpenRouter.

Stage 1 Results: Grok ×3 Self-Consistency (72,045 paragraphs)

We ran 3 independent Grok 4.1 Fast passes over the full 72,045-paragraph corpus at concurrency 200. Each run completed in ~33 minutes. Total cost: $129.75 ($43.12$43.62 per run).

Cross-run agreement:

Dimension Unanimous (3/3) Majority (2/3) All disagree
Category 68,394 (94.9%) 3,583 (5.0%) 68 (0.09%)
Specificity 65,780 (91.3%) 6,120 (8.5%) 145 (0.20%)

Category is near-deterministic — 94.9% unanimous, and the 5% majority cases are concentrated at the BG↔MR and MR↔RMP boundaries (exactly the confusion axes identified during codebook development). Specificity shows the expected stochastic variation at 8.5% majority-only, matching the 8.5% divergence rate observed in the 47-paragraph pilot.

Consensus resolution:

  • 62,510 (86.8%) — both unanimous, direct consensus
  • 9,323 (12.9%) — majority vote on at least one dimension
  • 212 (0.3%) — no majority on at least one dimension, resolved by GPT-5.4 judge

The 212 tiebreaker paragraphs were run through GPT-5.4 with the full judge prompt (disagreement-aware disambiguation rules, shuffled prior annotations). GPT-5.4 agreed with one of the 3 Grok labels on 100% of paragraphs — never inventing a novel answer. This validates that the Grok runs produce reasonable labels and the disagreements are genuine boundary cases, not model failures. Judge cost: $5.76.

Final consensus distribution:

Category Count % Specificity Count %
RMP 31,201 43.3% L1: Generic Boilerplate 29,593 41.1%
BG 13,876 19.3% L2: Domain-Adapted 16,344 22.7%
MR 10,591 14.7% L3: Firm-Specific 17,911 24.9%
SI 7,470 10.4% L4: Quantified-Verifiable 8,197 11.4%
N/O 4,576 6.4%
TP 4,094 5.7%
ID 237 0.3%

v1→v2 category shifts: BG rose from 16.0%→19.3% and N/O from 5.0%→6.4%, likely driven by the 22,250 paragraphs in the full corpus that v1 never annotated. RMP dropped from 45.8%→43.3%, partly because the v2 codebook's sharper BG/MR/RMP boundaries reclassified some borderline paragraphs.

Specificity is well-distributed. L2 at 22.7% (above the 15% holdout target — the full corpus has more domain-rich paragraphs than the stratified holdout). L3 at 24.9% and L4 at 11.4% reflect the v2 codebook's tightened verifiability standards.

Category × specificity interaction (see figures/stage1-category-specificity-heatmap.png): MR is 87% L3/L4 (people have names, titles, and credentials). SI is 92% L1 (materiality boilerplate with no specific facts). ID is 86% L4 (incidents have dates, named threat actors, forensic firms). These patterns are exactly what the codebook predicts and match the holdout validation.

Specificity boundary analysis: The 6,265 paragraphs where runs diverged on specificity are concentrated at adjacent levels: L1↔L2 (2,485), L1↔L3 (1,423), L2↔L3 (1,160), L3↔L4 (707). Cross-level jumps (L1↔L4, L2↔L4) are rare (~280 total). This confirms the self-consistency mechanism is working as intended — it provides tiebreaking signal exactly at the ambiguous boundaries where different reasoning paths legitimately land on different answers.

Cost of the Reboot (final)

Item Estimated Cost Actual Cost
Prompt iteration (v4.0v4.5, ~8 rounds) ~$10 $19.59
v2 holdout benchmark (10 models + 3 pilots) ~$45 $45.47
Stage 1 re-run (Grok ×3, 72K paragraphs) ~$96 $129.75
Stage 2 judge (212 tiebreaker paragraphs) ~$20-40 $5.76
Human re-labeling $0 (team labor) pending
Total additional API ~$175-185 $200.57

Against the ~$120 already spent on v1 API calls (not recovered). Total project API cost: $320.57 of $360 budget. Remaining: $39.43 — sufficient for any reruns or additional analysis.

The cost overshoot ($200 vs $175 estimate) is entirely from annotating 72K paragraphs instead of the estimated 50K. The per-paragraph cost was actually lower than estimated ($0.60/paragraph for the full 3-run self-consistency + judge pipeline vs $0.64 estimated).


Phase 8: Fine-Tuning — From 0.52 to 0.94 Specificity F1

Training Data Assembly

Built python/src/finetune/data.py to merge Stage 1 consensus labels (72,045 paragraphs) with paragraph text, quality tiers, and specificity confidence metadata.

Exclusions:

  • 1,200 holdout paragraphs (reserved for evaluation)
  • 614 individually truncated paragraphs (initial plan was to exclude 72 entire filings, but paragraph-level filtering is more targeted and preserves more data)

Sample weighting: clean/headed/minor = 1.0×, degraded = 0.5× (4,331 paragraphs at half weight).

Result: 70,231 training paragraphs, stratified 90/10 into 63,214 train / 7,024 val.

Architecture: Dual-Head ModernBERT

The model architecture: ModernBERT-large backbone (395M params) → pooled representation → dropout → two independent classification heads:

  1. Category head: Linear(1024, 7) with weighted cross-entropy loss. Standard multi-class classification.
  2. Specificity head: Ordinal classification. The specificity dimension (L1→L2→L3→L4) has natural ordering — predicting L1 when truth is L4 is worse than predicting L3. This ordering should be reflected in the model architecture and loss function.

The initial architecture used CORAL (Cao et al. 2020) for the specificity head: a single shared weight vector with learned bias offsets for each ordinal threshold. This is the standard approach for ordinal regression.

Ablation Grid: 12 Configurations × 1 Epoch

Ran a systematic ablation over three axes:

  • Checkpoint: base ModernBERT-large vs DAPT checkpoint vs TAPT checkpoint
  • Class weighting: inverse-frequency weights vs uniform
  • Loss type: cross-entropy vs focal loss (γ=2.0)

Results (1 epoch each, ~15 min/run, ~3 hours total):

Rank Configuration Combined F1 Cat F1 Spec F1
1 base + weighted + CE 0.685 0.900 0.469
2 DAPT + unweighted + focal 0.684 0.892 0.476
3 DAPT + weighted + CE 0.681 0.896 0.466
4 base + unweighted + CE 0.680 0.892 0.467
5 TAPT + weighted + CE 0.675 0.896 0.455
...
12 TAPT + weighted + focal 0.649 0.849 0.449

Finding 1: DAPT/TAPT pre-training did not help. Base ModernBERT-large outperformed both domain-adapted checkpoints. This is a noteworthy null result. ModernBERT-large was already pre-trained on a massive, diverse web corpus that likely includes SEC filings. Additional narrow-domain pre-training appears to cause mild catastrophic forgetting — the model loses general linguistic features while gaining domain-specific ones that the fine-tuning task doesn't benefit from. TAPT was consistently worst, suggesting the small corpus (72K paragraphs × 5 epochs at 30% masking) caused overfitting during MLM pre-training.

Finding 2: Weighted CE is the best loss combination. Class weighting helps category F1 significantly (0.900 vs 0.892 for base). Focal loss helps specificity slightly but hurts category. Weighted + focal = too much correction (consistently bottom tier) — both mechanisms independently reduce majority-class influence, and combining them over-corrects.

Full Training: The CORAL Wall (5 Epochs)

Trained the top 2 configurations for 5 epochs each (~1.5 hours per run):

base_weighted_ce (5 epochs):

Epoch Combined Cat F1 Spec F1 QWK
1 0.670 0.879 0.461 0.800
3 0.704 0.924 0.485 0.833
5 0.724 0.932 0.517 0.840

Category F1 reached 0.932 — well above the 0.80 target. But specificity F1 plateaued at 0.517. Per-class breakdown revealed the problem:

Specificity F1
L1 (Generic) 0.79
L2 (Domain-Adapted) 0.29
L3 (Firm-Specific) 0.31
L4 (Quantified) 0.55

L2 and L3 were dragging macro F1 down to 0.52. QWK was 0.84 — meaning the model's ordinal ranking was good (rarely confusing L1 with L4), but the exact boundary placement between adjacent levels was fuzzy.

The CORAL Diagnosis

CORAL uses a single weight vector w with shifted biases: logit_k = w·x + b_k. This means the same features separate L1 from L2 as separate L3 from L4. But the three specificity transitions require fundamentally different evidence:

  • L1→L2: Cybersecurity terminology detection (the ERM test — does the paragraph use language a general business professional wouldn't?)
  • L2→L3: Firm-unique fact detection (named roles, specific systems, internal programs)
  • L3→L4: Quantified/verifiable claim detection (dollar amounts, dates, third-party firm names)

A single shared weight vector cannot simultaneously encode "presence of domain terminology," "presence of named entities," and "presence of numerical quantities" — these are orthogonal signal types in the embedding space. CORAL's structural constraint was forcing the model to find one feature direction that approximates all three, resulting in blurry boundaries everywhere.

Additionally, [CLS] token pooling loses distributed signals. A paragraph that mentions "CISO" once in a subordinate clause should be L3, but [CLS] may not attend strongly to that one token.

Architecture Iteration: Independent Thresholds

Replaced CORAL with three changes (implemented in python/src/finetune/model.py):

  1. Independent threshold heads. Three separate binary classifiers, each with its own Linear(1024→256→1) MLP:

    • threshold_L2plus: "Has any qualifying facts?" (L1 vs L2+)
    • threshold_L3plus: "Has firm-specific facts?" (≤L2 vs L3+)
    • threshold_L4: "Has quantified facts?" (≤L3 vs L4)

    Same cumulative binary targets as CORAL (label k → [1]×k + [0]×(3k)), but each threshold learns independent features. The prediction is: level = count(sigmoid(logit_k) > 0.5).

  2. Attention pooling. Replaced [CLS] with a learned attention pool over all token representations. This lets the model attend to specific evidence tokens (CISO, $2M, NIST) distributed anywhere in the paragraph.

  3. Specificity confidence filtering. Only compute specificity loss on paragraphs where all 3 Grok runs agreed on specificity (91.3% of training data, as tracked in consensus specificityAgreement.agreed). The ~6K disagreement cases are exactly the noisy boundary labels that confuse the model. Category loss still uses all samples.

  4. Ordinal consistency regularization. Penalty (weight 0.1) when threshold k fires but threshold k-1 doesn't — e.g., the model says "has firm-specific facts" but not "has domain terms." This enforces the cumulative structure without the rigidity of CORAL's shared weights.

Results: The Independent Threshold Breakthrough

Config: configs/finetune/iter1-independent.yaml — base ModernBERT-large, independent thresholds with 256-dim MLP, attention pooling, spec confidence filtering, 15 epochs.

Epoch Combined Cat F1 Spec F1 QWK L2 F1 L3 F1
1 0.855 0.867 0.844 0.874 0.782 0.821
2 0.913 0.909 0.918 0.935 0.887 0.911
3 0.925 0.919 0.931 0.945 0.893 0.926
5 0.938 0.936 0.940 0.949
8 0.944 0.943 0.945 0.952 0.923 0.940
10 0.944 0.943 0.945 0.952

The model exceeded 0.80 on both heads at epoch 1. By epoch 8 it plateaued at 0.944 combined F1 (cat=0.943, spec=0.945, QWK=0.952). Training was stopped at epoch 11 — the train-eval loss gap (0.06 vs 0.49, ~8×) indicated the model was memorizing without further improving eval metrics.

The improvement was transformative. Spec F1: 0.517 → 0.945 (+0.428). L2 F1: 0.29 → 0.92. L3 F1: 0.31 → 0.94. The independent thresholds + attention pooling + confidence filtering combination addressed all three root causes simultaneously.

What mattered most? The independent thresholds were the primary driver. CORAL's shared weight vector was the bottleneck — when we let each ordinal transition learn its own features, the model immediately distinguished the three types of specificity evidence. Attention pooling and confidence filtering likely contributed meaningful improvements, but we did not run an ablation to isolate their individual contributions (the combined effect was so strong that decomposition was deprioritized).

Overfitting Observations

Encoder models absolutely can overfit. The 8× train-eval loss gap by epoch 10 is substantial. However, eval metrics (F1, QWK) remained stable from epoch 811, exhibiting "benign overfitting" — the model becomes more confident on training examples (lower train loss) without changing its decision boundaries (stable eval F1). The practical implication: monitor eval F1 for model selection, not eval loss.

For future runs: increase save_total_limit to preserve all epoch checkpoints, and add early stopping with patience ≥ 3 on spec_macro_f1.

Training Configuration Reference

Parameter Value
Backbone answerdotai/ModernBERT-large (395M params)
Pooling Learned attention
Category head Linear(1024, 7) + weighted CE
Specificity head 3× Independent(Linear(1024→256→1)) + cumulative BCE
Ordinal consistency 0.1 weight
Spec confidence filter Unanimous labels only (91.3% of data)
Batch size 32
Learning rate 5e-5
Warmup 10% of total steps
Precision bf16 + tf32
Attention Flash Attention 2
Compilation torch.compile
Optimizer AdamW (fused)
Peak VRAM ~18 GB / 24.6 GB (RTX 3090)
Training speed ~2.1 it/s (batch 32, seq 512)
Best epoch 8 (stable through 11)

Checkpoint: checkpoints/finetune/iter1-independent/final/

What Remains

These metrics are on the validation set — same distribution as training (Grok ×3 consensus labels). The true test is the holdout gold set with human labels, which may reveal:

  • Systematic Grok-vs-human disagreements (especially at L2/L3 boundaries)
  • Whether the model learned Grok's biases rather than the underlying construct
  • Per-class F1 on the more diverse holdout distribution (the training data overrepresents RMP at 43%)

As a proxy before human labels arrive, evaluation against GPT-5.4 and Opus benchmark labels on the holdout will provide an intermediate signal.


v1 Reference

The complete v1 narrative — Stage 1 prompt engineering (12+ iterations), model benchmarking (21+ models, 12 providers), human labeling webapp, gold set adjudication (13-signal cross-analysis), codebook iterations v1.0v3.5 — is preserved at docs/NARRATIVE-v1.md.

Key v1 deliverables carried forward:

  • 72,045-paragraph corpus with quality tiers
  • DAPT checkpoint (eval loss 0.7250, perplexity 1.65)
  • TAPT checkpoint (eval loss 1.0754, perplexity 2.11)
  • Model census: 21+ models evaluated across 12 providers
  • Human labeling webapp (labelapp) — will be updated for v2 codebook
  • Empirical evidence for every v2 codebook decision

References

  • Warner, B., Clavié, B., Soldaini, L., et al. (2024). "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Fine-tuning and Inference." arXiv:2412.13663.
  • Gururangan, S., Marasovic, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N.A. (2020). "Don't Stop Pretraining: Adapt Language Models to Domains and Tasks." Proceedings of ACL 2020, pp. 8342-8360.
  • Ponnock, J. (2025). "The Data Efficiency Frontier of Financial Foundation Models: Scaling Laws from Continued Pretraining." arXiv:2512.12384.
  • Sounack, T., et al. (2025). "BioClinical ModernBERT: A Domain-Adapted Encoder for Biomedical and Clinical NLP." arXiv:2506.10896.
  • Luo, Z., et al. (2025). "Patent ModernBERT: A Pretrained Language Model for Intellectual Property." arXiv:2509.14926.
  • Dao, T. (2024). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." Proceedings of ICLR 2024.
  • Ringel, D.M. (2023). "Creating Synthetic Experts with Generative Artificial Intelligence." arXiv:2310.15560.