# Project Narrative — SEC Cybersecurity Disclosure Quality Classifier This document captures the process, roadblocks, decisions, and resolutions from building the SEC cybersecurity disclosure quality classifier. It serves as the source material for the final paper and presentation. --- ## Phase 1: Project Scoping and Construct Design ### The Problem SEC Release 33-11216 (July 2023) created a new annual cybersecurity disclosure requirement (10-K Item 1C) and an incident disclosure requirement (8-K Item 1.05). By FY2024, ~9,000-10,000 filings exist. No validated classifier or public labeled dataset exists for assessing the quality of these disclosures. Investors, regulators, and compliance officers need scalable tools to distinguish substantive disclosures from boilerplate. ### Methodology Decision: Ringel (2023) "Synthetic Experts" We adopted the Ringel (2023) "Synthetic Experts" pipeline: use frontier LLMs to generate training labels at scale, then distill into an efficient encoder model. This approach was chosen because: - Manual labeling of 50,000+ paragraphs is infeasible for a 6-person team - Multiple cheap LLMs annotating in parallel provide built-in quality control through inter-annotator agreement - The encoder distillation step produces a model that can classify at inference time without LLM API costs ### Construct: Two Classification Dimensions We defined two simultaneous classification tasks per paragraph: 1. **Content Category** (7 mutually exclusive classes) — what the paragraph is about, grounded in the SEC rule's own structure (Board Governance, Management Role, Risk Management Process, Third-Party Risk, Incident Disclosure, Strategy Integration, None/Other) 2. **Specificity Level** (4-point ordinal) — how company-specific the disclosure is, from generic boilerplate to quantified-verifiable facts The construct maps to NIST CSF 2.0 categories for academic grounding. --- ## Phase 2: Data Acquisition and Corpus Construction ### The Extraction Problem SEC filings are not structured data. They're HTML generated from PDFs, XBRL, and Word documents by dozens of different tools, each producing different artifacts. Building a reliable extraction pipeline for ~9,000 filings meant solving a series of messy, real-world data engineering problems. ### Pipeline Architecture Built in TypeScript (~1,000 lines of extraction code across `parse-item1c.ts`, `segment.ts`, `fast-reparse.ts`, and pipeline orchestration): ``` EDGAR Master Index → enumerate 10-K filings → download HTML → extract Item 1C → segment paragraphs → JSONL submissions.zip → scan for 8-K Item 1.05 → download HTML → extract → segment → merge with 10-K corpus ``` ### Roadblock: HTML Variability Every filing's HTML is different. The same logical content looks completely different depending on the tool that generated the HTML: - **Word splitting from inline elements.** XBRL and styling tags break words mid-token: `Item 2` renders correctly in a browser but parses as "Item2" in code. Required detecting adjacent inline element boundaries and inserting spaces selectively. - **CamelCase joins from PDF converters.** PDF-to-HTML tools merge sentences across formatting boundaries: `sentence.Next sentence` instead of `sentence. Next sentence`. Required regex passes to detect missing spaces after punctuation. - **Page breaks mid-sentence.** Page numbers, running headers, and subsidiary headers get spliced into the middle of content paragraphs. Required filtering a catalog of page artifact patterns. - **Table of Contents shadowing.** "Item 1C" appears at least twice in every 10-K — once in the Table of Contents and once in the actual content. Using the first match extracts the wrong section. Required the LAST match — a silent failure that produced empty or wrong extractions for hundreds of filings before we caught it. - **XBRL tag pollution.** Inline XBRL wraps financial facts in `ix:header`, `ix:references`, and `ix:nonFraction` tags that carry no display content but add noise. - **Entity encoding chaos.** ` `, ` `, `“`, `”`, `—`, `–`, `•` — each needs correct decoding, and different filing tools use different entity styles for the same characters. ### Paragraph Segmentation After extracting clean section text, splitting into paragraphs had its own challenges: - **Bullet list merging.** Disclosures frequently use bullet lists. Bullets need to be merged with their intro sentence; a standalone "• vulnerability scanning" is meaningless. - **Continuation line detection.** Sentences split across HTML block elements need rejoining. - **Length boundaries.** Under 20 words → likely a header (filtered). Over 500 words → split at sentence boundaries. - **Table-based bullet lists and the cascade failure.** Some generators render bullet lists as HTML tables with non-standard bullet characters. Since `stripHtml()` doesn't recognize `·` as a bullet marker, the merge logic never fires, causing multi-element run-on paragraphs. Found 2,210 paragraphs affected. ### 8-K Extraction **Roadblock: EDGAR full-text search misses filings.** The EFTS keyword search doesn't reliably return all cybersecurity 8-Ks. Post-May 2024, companies moved non-material disclosures from Item 1.05 to Items 8.01 or 7.01. **Resolution:** Built `scan-8k-items.py` to scan the SEC's bulk `submissions.zip` deterministically — a gap-free scan of every 8-K with cybersecurity content. Result: **207 cybersecurity incident 8-K filings** identified. ### Corpus Statistics - **72,045 paragraphs** from ~9,000 filings (FY2023 + FY2024 + early FY2025) - All 10-K Item 1C; 207 8-K paragraphs extracted separately - Median ~7 paragraphs per filing - 49,795 paragraphs annotated (after filtering to complete filing metadata) --- ## Phase 3: Data Quality Audit and Corpus Remediation ### The Discovery While preparing the DAPT corpus, we discovered two systematic issues silently corrupting the data: 1. **Orphan words.** HTML source wraps text at fixed column width. When a `` tag consumes most of a line, only the first word fits before the source newline. 4.7% of all paragraphs affected. 2. **Inlined section headings.** 22% of paragraphs had section titles prepended to body text — a near-perfect predictor of `content_category` that creates shortcut learning risk. ### Generator Investigation Identified **14 distinct filing generators** covering 99.99% of 14,759 HTML files. The worst generator (EFiling/EDGAR Agent) accounted for 13.5% of filings but 36.8% orphan word rate (8x corpus average). Clean generators (Workiva, Donnelley, Inline XBRL) all had <1% rates. Full reference: `docs/EDGAR-FILING-GENERATORS.md`. ### Six Surgical Patches All fixes follow the principle: `paragraphs-clean.jsonl` is **frozen**. All fixes go through `.patched.jsonl` files linked by paragraph UUID. | Patch | Method | Paragraphs | |-------|--------|-----------| | 1-2. Orphan word restoration | HTML lookback extraction | 2,233 | | 3-6. Heading strip (4 passes) | Pattern match + HTML-confirmed | 8,411 | ### Quality Tier System | Tier | Criteria | Count | % | |------|----------|-------|---| | clean | No detected issues | 58,165 | 80.7% | | headed | Had inlined heading (now stripped) | 7,402 | 10.3% | | degraded | Embedded bullets, invisible merges, fragments, truncations | 4,331 | 6.0% | | minor | Had orphan word (now fixed) | 2,147 | 3.0% | Degraded paragraphs downweighted 0.5x during fine-tuning. --- ## Phase 4: Pre-Training — DAPT + TAPT ### DAPT: Domain-Adaptive Pre-Training Chose our own ~9,000 cached filings over PleIAs/SEC (373K on HuggingFace): - Recency > volume — Item 1C didn't exist before FY2023 - Diminishing returns past 250M tokens (Ponnock 2025) - We control cleaning quality - Feasible on a single RTX 3090 **Corpus:** 14,568 docs, ~1.056B tokens. Subsampled to newest 500M tokens. **Key optimizations:** Flash Attention 2 (47s→27s/step), torch.compile (halved activation memory), corpus subsampling (29h→13.5h). **Results:** Eval loss 0.7250, perplexity 1.65. 1 epoch, ~14.5h on RTX 3090. Checkpoint: `checkpoints/dapt/modernbert-large/final/`. ### TAPT: Task-Adaptive Pre-Training 72K Item 1C paragraphs (~10M tokens). 5 epochs with whole-word masking at seq_len=512. **Bugs fought:** 4 bugs in `transformers` whole-word masking for BPE tokenizers, Python 3.14 incompatibility. Custom `WholeWordMaskCollator` built from scratch. **Results:** Loss 1.46→1.08, eval loss 1.0754, perplexity 2.11. 50 minutes on RTX 3090. Checkpoint: `checkpoints/tapt/modernbert-large/final/`. ### Training Pipeline ``` ModernBERT-large (base, 395M params) → DAPT on 9K full 10-K filings (~500M tokens, ~14.5h) → SEC-ModernBERT-large → TAPT on 72K Item 1C paragraphs (~10M tokens, ~50min) → SEC-cyBERT-large → Fine-tune on labeled data with dual classification heads → Final classifier ``` --- ## Phase 5: Truncated Filing Exclusion 72 filings (~0.8%) where section boundary detection cut off mid-sentence. Excluded from training splits — filings where the last paragraph doesn't match terminal punctuation are filtered. --- --- ## Phase 6: The v2 Reboot — Why We Started Over ### What v1 Taught Us The v1 pipeline produced 150K Stage 1 annotations, a 10-model benchmark, human labels from 6 annotators, and extensive gold adjudication. It worked — but evaluation revealed structural problems that no amount of prompt iteration could fix: 1. **Specificity Level 2 was too narrow.** Our codebook defined Level 2 as "names a recognized standard" — but the professor's construct says "references industry." Domain-specific practices (penetration testing, vulnerability scanning, SIEM) were classified as Level 1. Level 2 ended up at 3.9% of the holdout (47 samples) — too few for reliable per-class F1. 2. **Level 4 required 2+ QV facts.** The construct lists types of qualifying facts, not a minimum count. The artificial threshold created a narrow class and forced annotators into a counting exercise. 3. **The BG/MR/RMP triangle was patched, not fixed.** Six decision rules and ten borderline cases accumulated as patches on unchanged definitions. Models processed increasingly complex instructions with diminishing returns. 4. **The holdout was adversarial by design.** Stratified to over-sample confusion-axis paragraphs — great for stress-testing the codebook, terrible for evaluation. Combined with narrow Level 2, this structurally depressed F1. 5. **Human specificity agreement was poor.** Krippendorff's α = 0.546 on specificity (target: 0.67). The narrow Level 2 definition made it hard for anyone to agree. ### The Decision Rather than continue patching, we decided to: - Revise the codebook with systemic changes (broaden Level 2, loosen Level 4, reframe category rules) - Take a new random stratified holdout (equal per category class, not overindexed on hard cases) - Re-run Stage 1 with the improved codebook/prompt - Have humans re-label the new holdout - Re-run the benchmark panel - Then train The v1 data pipeline, corpus, DAPT checkpoint, and TAPT checkpoint are all unchanged and carried forward. Only the labeling and evaluation are redone. ### What Changed in v2 **Codebook (LABELING-CODEBOOK.md):** - Level 2 broadened from "names a standard" to "uses cybersecurity domain terminology" (the ERM test) - Level 4 threshold lowered from 2+ to 1+ QV-eligible fact (the external verifiability test) - Category primary test changed to "What question does this paragraph answer?" - MR headline changed from "who a specific person is" to "how management is organized to handle cybersecurity" - Person-removal test reframed as confirmation tool, not primary rule - Materiality rules cleaned up (assessment vs. speculation distinction became a clean rule, not a ruling) - IS/NOT lists restructured for new Level 2 boundary - Codebook + Ethos split: rules in LABELING-CODEBOOK.md, reasoning in CODEBOOK-ETHOS.md **Holdout:** - Random stratified sample: ~170 per category class × 7 ≈ 1,190 - Secondary constraint: minimum ~100 per specificity level - NOT overindexed on confusion-axis cases - Separate ~200-paragraph dev set for prompt iteration (excluded from holdout) --- ## Phase 7: Holdout Selection & Prompt Engineering ### Holdout Sampling Used v1 Stage 1 consensus labels (50,003 paragraphs, 3-model majority vote under v2.5 prompt) as a sampling guide. Applied heuristic v2 specificity prediction: keyword scan for domain terminology to identify v1 Level 1 paragraphs that would become Level 2 under v2 rules, and QV indicator scan for Level 3→4 promotions. **Allocation:** 185 per non-ID category, 90 for Incident Disclosure (only 166 available in the annotated corpus) = 1,200 exact. Max 2 paragraphs per company per category stratum to prevent boilerplate clustering. All specificity floors met (≥100 per level). 1,042 unique companies represented. The v1 holdout had been intentionally oversampled on confusion-axis cases (split votes between MR/RMP, N/O/SI, etc.) — useful for codebook development but structurally hostile to F1. The v2 holdout is random within each category stratum: hard cases appear at their natural frequency, not overweighted. ### Prompt Iteration: From List-Matching to Principle-Based Reasoning The v2 prompt underwent 5 iterations (v4.0→v4.4) tested against a 200-paragraph dev batch from the holdout with GPT-5.4 (~$6 total pilot cost). **v4.0 (baseline rewrite):** Translated the v2 codebook into the system prompt. Category section used the "what question?" test — worked well at 87% agreement with v1 consensus. Specificity section used exhaustive IS/NOT lists, matching the v1 approach. Result: Level 2 grew from 6% to 16% (domain terminology broadening) and Level 4 grew from 5% to 22% (1+ QV rule). But audit revealed the model was pattern-matching against the lists rather than reasoning about the underlying principles. Two errors: "Vice President, Information Systems and Technology" and "Senior Vice President of Information Technology" classified as Level 1 because neither exactly matched the IS list entry "VP of IT/Security." **The list-matching problem:** The category section — built around reasoning principles ("what question does this paragraph answer?", person-removal test, materiality linguistic test) — achieved 87% agreement. The specificity section — built around exhaustive checklists — caught listed items but missed unlisted items that satisfied the same principle. The model was executing a lookup table, not applying the ERM test. **v4.1 (principle-first restructure):** Restructured all three specificity levels to lead with the principle and compress lists to boundary-case disambiguation only: - Level 2: "Apply the ERM test — would a non-security ERM professional use this language?" with illustrative examples - Level 3: "Would this detail help narrow down which company wrote it?" with the VP-or-above bright line - Level 4: "Could someone outside the company verify this?" with boundary cases Result: +12 Level 1→2 catches (model reasoning about vocabulary level, not scanning a list), VP/SVP titles fixed. But Level 4 regressed — the model started reasoning about whether QV facts were "relevant to the paragraph's main point" instead of treating specificity as a presence check. **The independence insight:** Category and specificity are independent dimensions. Category captures what the paragraph is ABOUT. Specificity captures how informative it is AS A WHOLE. A paragraph classified as RMP that mentions a CISO's CISSP in a subordinate clause is RMP at Level 4 — the certification is verifiable regardless of whether it serves the category. The model was conflating "this fact is secondary to the paragraph's purpose" with "this fact doesn't count for specificity." This is wrong: specificity is a presence check on the entire paragraph, not a relevance judgment. This also raised a methodological question: SHOULD specificity be category-conditional? The steelman for category-conditional specificity: "Board Governance at Level 4" should mean the governance disclosure is highly specific, not that a tangential financial fact inflated the score. The steelman against: SEC paragraphs interleave topics, you can't cleanly decompose facts into category buckets, and conditional specificity introduces cascading errors (wrong category → wrong specificity). For this project, paragraph-level specificity is the right choice — it matches the construct, is simpler to annotate, and produces higher agreement. Acknowledged as a limitation for the paper. **v4.2–v4.4 (surgical fixes):** Added explicit presence-check framing, hard vs. soft number boundary ("12 professionals" → QV, "approximately 20 departments" → not QV), and the "various certifications including CISSP → YES" rule (named certifications are QV regardless of surrounding hedge words). Final prompt (v4.4) recovers Level 4 to within 1 of baseline while retaining all principle-based gains at Levels 2 and 3. **v4.4 pilot results (200 paragraphs, GPT-5.4):** | Specificity | v4.0 (list) | v4.4 (principle) | Change | |-------------|-------------|-----------------|--------| | L1 | 81 (40.5%) | 65 (32.5%) | -16 | | L2 | 32 (16.0%) | 41 (20.5%) | +9 | | L3 | 43 (21.5%) | 51 (25.5%) | +8 | | L4 | 44 (22.0%) | 43 (21.5%) | -1 | Category: 95.5% agreement with v1 consensus. Specificity: 84.5% agreement (expected divergence given broadened L2 and 1+ QV rule). The 200-paragraph dev batch is now contaminated by prompt examples that target specific cases in it — further iteration requires the unseen 1,000 paragraphs from the full holdout. ### Full Holdout Validation & v4.5 Running v4.4 on the full 1,200 holdout ($5.70) revealed three problems not visible in the 200-paragraph pilot: **Problem 1: 34.5% medium-confidence specificity.** The model was uncertain on 414 of 1,200 paragraphs, concentrated at the L1/L2 boundary (59% of L2 calls were medium-confidence) and L2/L3 boundary (51% of L3). Third-Party Risk was worst: 74% medium-confidence on specificity. The model's reasoning showed it listing zero specific facts but still assigning L2 based on vibes — the paragraph "felt" domain-adapted because the topic was cybersecurity, even when the vocabulary was generic ERM language. **Problem 2: SI materiality assertions falsely promoted to L4.** Paragraphs like "As of December 28, 2024, we have not had any material cybersecurity incidents" were classified L4 because a specific date anchored the claim. But negative self-assertions are not externally verifiable — you cannot independently confirm the absence of something. These are Strategy Integration at Level 1, not Level 4. **Problem 3: specific_facts discarded from stored output.** The `toLabelOutput()` function stripped the `specific_facts` array before writing to disk. The model was generating facts during inference (the schema required it), but we couldn't verify the mechanical bridge between facts and specificity level because the evidence was thrown away. **v4.5 fixes:** 1. **Mechanical bridge enforced.** Restructured the specificity protocol as a scan-tag-max pipeline: scan for facts, tag each as [DOMAIN]/[FIRM]/[VERIFIABLE], assign specificity = max(tags). Added explicit rule: "if specific_facts is empty, specificity MUST be Generic Boilerplate." Result: 100% consistency — L1 always empty, L2+ always populated with supporting facts. The bridge prevents the model from overriding its own fact-finding with holistic vibes. 2. **Expertise vs. topic clarification for L1/L2.** Added: "The ERM test evaluates whether the paragraph demonstrates cybersecurity EXPERTISE, not whether it discusses a cybersecurity TOPIC. Every paragraph in these filings discusses cybersecurity — that's what the filing requires. L1 means generic oversight language any business professional could write. L2 means the writer shows they understand HOW cybersecurity works." With TP-specific examples: "We conduct vendor security assessments" → L1 (generic process description); "We review vendors' SOC 2 attestations and require encryption at rest" → L2 (specific security evidence requiring domain knowledge). 3. **SI negative assertions excluded from L4.** Added explicit NOT-verifiable examples: "We have not experienced any material cybersecurity incidents" → NOT QV (cannot externally verify absence); "In 2023, we did not experience a material incident" → NOT QV (a year does not make a negative assertion verifiable). Also added lower bounds as verifiable: "more than 20 years" → YES (checkable threshold, unlike "approximately 20" which is hedged both directions). 4. **Fact storage.** Updated `toLabelOutput()` and `LabelOutput` schema to preserve `specific_facts` in stored output. Added `domain_term` to the `FactType` enum for L2-level vocabulary evidence. **v4.5 results (1,200 paragraphs, GPT-5.4, $6.88):** | Metric | v4.4 | v4.5 | |--------|------|------| | L1 | 546 (45.5%) | 618 (51.5%) | | L2 | 229 (19.1%) | 168 (14.0%) | | L3 | 225 (18.8%) | 207 (17.2%) | | L4 | 200 (16.7%) | 207 (17.2%) | | Medium confidence | 414 (34.5%) | 211 (17.6%) | | Bridge consistency | unknown | 100% | | SI false L4s | ~6 | 0 | | Category stability | — | 96.8% | L2 at 14% is below the 15% holdout target, but the holdout oversamples TP (14.4% vs 5% in corpus) and TP is where 55 of 61 L2→L1 drops concentrated. On the full corpus (46% RMP, 5% TP), L2 should be ~15-17%. The TP drops are correct — verified by inspecting the facts: survivors list SOC reports, vulnerability scans, penetration testing; drops use only generic vendor management language ("contractual requirements", "vendor due diligence"). **Key architectural insight:** With reasoning models, structured output fields are results, not reasoning steps. The model decides everything in reasoning tokens before generating JSON. The mechanical bridge works by influencing the reasoning process through prompt text, not through schema field ordering. The specific_facts field captures the model's evidence for our debugging, but the actual bridge enforcement happens in the model's internal reasoning guided by the prompt's explicit consistency rules. ### v2 Holdout Benchmark (10 models, 8 providers) With v4.5 locked, we ran the full BENCHMARK_MODELS panel on the 1,200-paragraph v2 holdout to evaluate model quality before committing to the ~$100 Stage 1 re-run. GPT-5.4 (v4.5) is the reference — our best-validated model on the holdout, the one whose prompt iterations we hand-verified. **Full benchmark results (vs GPT-5.4 reference):** | Model | N | Cat% | Cat κ | Spec% | Spec κw | Both% | 50K proj | Reasoning | |-------|---|------|-------|-------|---------|-------|----------|-----------| | Grok 4.1 Fast | 1200 | 93.7% | 0.925 | 91.6% | 0.929 | 86.1% | $32 | 584 | | Opus 4.6 (prompt-only) | 1184 | 93.7% | 0.925 | 90.1% | 0.910 | 85.2% | $0 (sub) | — | | Gemini 3.1 Pro | 1200 | 93.8% | 0.926 | 89.4% | 0.906 | 84.2% | $735 | 502 | | GLM-5 | 1200 | 92.8% | 0.915 | 88.3% | 0.898 | 82.8% | $364 | 1421 | | Kimi K2.5 | 1200 | 92.6% | 0.912 | 88.1% | 0.894 | 82.8% | $353 | 2832 | | Gemini 3.1 Flash Lite | 1200 | 91.8% | 0.904 | 83.0% | 0.844 | 76.5% | $79 | 363 | | MIMO v2 Flash | 794 | 92.7% | 0.911 | 85.3%* | 0.662 | 79.7% | $26 | 1423 | | MIMO v2 Pro | 980 | 94.0% | — | 90.7% | — | 85.9% | $274 | 1439 | | MiniMax M2.7 | 1198 | 87.6% | 0.855 | 76.5% | 0.756 | 68.5% | $70 | 615 | *MIMO Flash spec% is misleading — 91.1% of its labels are L1 (collapsed distribution). κw = 0.662 reflects this. **Pilot candidates (200-paragraph tests):** | Model | Cat% | Spec% | Both% | 50K proj | Verdict | |-------|------|-------|-------|----------|---------| | Qwen3-235B MoE | 89.9% | 62.6% | 56.1% | $18 | Dead — 0 reasoning tokens, 34% L4 | | Seed 1.6 Flash | 87.5% | 74.7% | 67.7% | $24 | Weak — below Flash Lite | | Qwen3.5 Flash | 92.9% | n/a | n/a | $70 | Dead — 100% L1 collapse | **Key findings from the benchmark:** 1. **Clear quality tiers.** Grok Fast stands alone as the best affordable model (86.1% both-match, $32/50K). There's a 9pp gap to the next affordable option (Flash Lite at 76.5%, $79). Everything in between costs $350+. 2. **MIMO Flash specificity is broken.** Category agreement is fine (92.7%) but specificity collapses to 91.1% L1 — it simply doesn't differentiate specificity levels. The v1 Stage 1 panel included MIMO Flash; this means v1 specificity consensus was partially degraded by one broken voter. 3. **Opus performs better without the codebook.** We ran Opus via Agent SDK in two configurations: (a) full v2 codebook + operational prompt (37.7KB system prompt), (b) operational prompt only (16.2KB). Prompt-only was significantly better: 85.2% vs 82.4% both-match, 49.2% vs 40.5% facts coverage. The codebook was actively diluting the operational prompt's bridge instruction. This is a counterintuitive but important finding for the paper — more context can hurt performance when the operational prompt has been carefully engineered. 4. **Reasoning tokens correlate with quality, but not linearly.** Kimi K2.5 reasons the most (2832 tokens/para) but ranks 5th. Grok reasons modestly (584 tokens) and ranks 1st. The quality seems to depend more on the model's internal architecture than on raw reasoning volume. Models with 0 reasoning tokens (Qwen3-235B) or with reasoning that doesn't engage with specificity (Qwen3.5 Flash — 4381 tokens, all L1) are categorically broken for this task. 5. **No viable cheap third model exists.** We searched OpenRouter exhaustively for models under $50/50K that support structured output and reasoning. Every candidate (Qwen, ByteDance Seed, etc.) performed below Flash Lite, which was already the weakest panel member. 6. **Category agreement is high across all non-broken models** (>91% vs reference, κ > 0.90). The hard problem is specificity, where the mechanical bridge helps good models but can't save models that don't reason about it properly. ### Model Selection: Grok ×3 Self-Consistency The budget constraint ($175 remaining for Stage 1 + Stage 2 + everything else) eliminated all multi-model panels except Grok + Flash Lite ($111). But Flash Lite's 76.5% both-match and inflated L2 distribution (19.1% vs 14% reference) made it a weak second voter. We investigated whether running Grok multiple times could produce independent signals. The temperature question turned out to be irrelevant: reasoning models have internal stochastic chain-of-thought that produces different outputs on repeated identical calls regardless of temperature settings. Most providers silently ignore `temperature: 0` for reasoning models (OpenAI explicitly rejects it; others drop it). Our `temperature: 0` was cosmetic the entire time. **Empirical verification:** We re-ran 47 holdout paragraphs through Grok 4.1 Fast with identical inputs. Results: - Category: 47/47 identical (100% deterministic) - Specificity: 43/47 identical (91.5%), 4 diverged - Divergence: 8.5% of paragraphs got different specificity labels - All divergence was on specificity (L1↔L2, L1→L3, L3→L4) — exactly the ambiguous boundary cases where multiple runs provide real tiebreaking value This 8.5% per-pair divergence rate means: - ~90% of paragraphs will be 3/3 unanimous → strong consensus - ~10% will be 2-1 split → majority vote resolves boundary cases - Category is always unanimous → category quality = Grok's quality (93.7%, κ=0.925) **Self-consistency is a well-established pattern** (Wang et al. 2022). The weakness vs multi-model consensus is shared systematic biases — all three runs make the same systematic errors. But with κ=0.925 on category and κw=0.929 on specificity, Grok's systematic errors are rare. The 8.5% stochastic variation is concentrated exactly where we want it: ambiguous specificity boundaries. **Cost: $96 for Grok ×3** (3 × $32 through OpenRouter). Leaves $80 for Stage 2 judge and any reruns. An alternative — xAI's Batch API at 50% off — would reduce this to $48, but requires bypassing OpenRouter. ### Stage 1 Results: Grok ×3 Self-Consistency (72,045 paragraphs) We ran 3 independent Grok 4.1 Fast passes over the full 72,045-paragraph corpus at concurrency 200. Each run completed in ~33 minutes. Total cost: $129.75 ($43.12–$43.62 per run). **Cross-run agreement:** | Dimension | Unanimous (3/3) | Majority (2/3) | All disagree | |-----------|-----------------|----------------|--------------| | Category | 68,394 (94.9%) | 3,583 (5.0%) | 68 (0.09%) | | Specificity | 65,780 (91.3%) | 6,120 (8.5%) | 145 (0.20%) | Category is near-deterministic — 94.9% unanimous, and the 5% majority cases are concentrated at the BG↔MR and MR↔RMP boundaries (exactly the confusion axes identified during codebook development). Specificity shows the expected stochastic variation at 8.5% majority-only, matching the 8.5% divergence rate observed in the 47-paragraph pilot. **Consensus resolution:** - **62,510 (86.8%)** — both unanimous, direct consensus - **9,323 (12.9%)** — majority vote on at least one dimension - **212 (0.3%)** — no majority on at least one dimension, resolved by GPT-5.4 judge The 212 tiebreaker paragraphs were run through GPT-5.4 with the full judge prompt (disagreement-aware disambiguation rules, shuffled prior annotations). GPT-5.4 agreed with one of the 3 Grok labels on 100% of paragraphs — never inventing a novel answer. This validates that the Grok runs produce reasonable labels and the disagreements are genuine boundary cases, not model failures. Judge cost: $5.76. **Final consensus distribution:** | Category | Count | % | | Specificity | Count | % | |----------|-------|---|---|-------------|-------|---| | RMP | 31,201 | 43.3% | | L1: Generic Boilerplate | 29,593 | 41.1% | | BG | 13,876 | 19.3% | | L2: Domain-Adapted | 16,344 | 22.7% | | MR | 10,591 | 14.7% | | L3: Firm-Specific | 17,911 | 24.9% | | SI | 7,470 | 10.4% | | L4: Quantified-Verifiable | 8,197 | 11.4% | | N/O | 4,576 | 6.4% | | | | | | TP | 4,094 | 5.7% | | | | | | ID | 237 | 0.3% | | | | | **v1→v2 category shifts:** BG rose from 16.0%→19.3% and N/O from 5.0%→6.4%, likely driven by the 22,250 paragraphs in the full corpus that v1 never annotated. RMP dropped from 45.8%→43.3%, partly because the v2 codebook's sharper BG/MR/RMP boundaries reclassified some borderline paragraphs. **Specificity is well-distributed.** L2 at 22.7% (above the 15% holdout target — the full corpus has more domain-rich paragraphs than the stratified holdout). L3 at 24.9% and L4 at 11.4% reflect the v2 codebook's tightened verifiability standards. **Category × specificity interaction (see `figures/stage1-category-specificity-heatmap.png`):** MR is 87% L3/L4 (people have names, titles, and credentials). SI is 92% L1 (materiality boilerplate with no specific facts). ID is 86% L4 (incidents have dates, named threat actors, forensic firms). These patterns are exactly what the codebook predicts and match the holdout validation. **Specificity boundary analysis:** The 6,265 paragraphs where runs diverged on specificity are concentrated at adjacent levels: L1↔L2 (2,485), L1↔L3 (1,423), L2↔L3 (1,160), L3↔L4 (707). Cross-level jumps (L1↔L4, L2↔L4) are rare (~280 total). This confirms the self-consistency mechanism is working as intended — it provides tiebreaking signal exactly at the ambiguous boundaries where different reasoning paths legitimately land on different answers. ### Cost of the Reboot (final) | Item | Estimated Cost | Actual Cost | |------|---------------|-------------| | Prompt iteration (v4.0–v4.5, ~8 rounds) | ~$10 | $19.59 | | v2 holdout benchmark (10 models + 3 pilots) | ~$45 | $45.47 | | Stage 1 re-run (Grok ×3, 72K paragraphs) | ~$96 | $129.75 | | Stage 2 judge (212 tiebreaker paragraphs) | ~$20-40 | $5.76 | | Human re-labeling | $0 (team labor) | pending | | **Total additional API** | **~$175-185** | **$200.57** | Against the ~$120 already spent on v1 API calls (not recovered). Total project API cost: **$320.57 of $360 budget**. Remaining: **$39.43** — sufficient for any reruns or additional analysis. The cost overshoot ($200 vs $175 estimate) is entirely from annotating 72K paragraphs instead of the estimated 50K. The per-paragraph cost was actually lower than estimated ($0.60/paragraph for the full 3-run self-consistency + judge pipeline vs $0.64 estimated). --- ## Phase 8: Fine-Tuning — From 0.52 to 0.94 Specificity F1 ### Training Data Assembly Built `python/src/finetune/data.py` to merge Stage 1 consensus labels (72,045 paragraphs) with paragraph text, quality tiers, and specificity confidence metadata. **Exclusions:** - 1,200 holdout paragraphs (reserved for evaluation) - 614 individually truncated paragraphs (initial plan was to exclude 72 entire filings, but paragraph-level filtering is more targeted and preserves more data) **Sample weighting:** clean/headed/minor = 1.0×, degraded = 0.5× (4,331 paragraphs at half weight). **Result:** 70,231 training paragraphs, stratified 90/10 into 63,214 train / 7,024 val. ### Architecture: Dual-Head ModernBERT The model architecture: ModernBERT-large backbone (395M params) → pooled representation → dropout → two independent classification heads: 1. **Category head:** Linear(1024, 7) with weighted cross-entropy loss. Standard multi-class classification. 2. **Specificity head:** Ordinal classification. The specificity dimension (L1→L2→L3→L4) has natural ordering — predicting L1 when truth is L4 is worse than predicting L3. This ordering should be reflected in the model architecture and loss function. The initial architecture used **CORAL** (Cao et al. 2020) for the specificity head: a single shared weight vector with learned bias offsets for each ordinal threshold. This is the standard approach for ordinal regression. ### Ablation Grid: 12 Configurations × 1 Epoch Ran a systematic ablation over three axes: - **Checkpoint:** base ModernBERT-large vs DAPT checkpoint vs TAPT checkpoint - **Class weighting:** inverse-frequency weights vs uniform - **Loss type:** cross-entropy vs focal loss (γ=2.0) Results (1 epoch each, ~15 min/run, ~3 hours total): | Rank | Configuration | Combined F1 | Cat F1 | Spec F1 | |------|-------------|-------------|--------|---------| | 1 | base + weighted + CE | **0.685** | **0.900** | 0.469 | | 2 | DAPT + unweighted + focal | 0.684 | 0.892 | **0.476** | | 3 | DAPT + weighted + CE | 0.681 | 0.896 | 0.466 | | 4 | base + unweighted + CE | 0.680 | 0.892 | 0.467 | | 5 | TAPT + weighted + CE | 0.675 | 0.896 | 0.455 | | ... | | | | | | 12 | TAPT + weighted + focal | 0.649 | 0.849 | 0.449 | **Finding 1: DAPT/TAPT pre-training did not help.** Base ModernBERT-large outperformed both domain-adapted checkpoints. This is a noteworthy null result. ModernBERT-large was already pre-trained on a massive, diverse web corpus that likely includes SEC filings. Additional narrow-domain pre-training appears to cause mild catastrophic forgetting — the model loses general linguistic features while gaining domain-specific ones that the fine-tuning task doesn't benefit from. TAPT was consistently worst, suggesting the small corpus (72K paragraphs × 5 epochs at 30% masking) caused overfitting during MLM pre-training. **Finding 2: Weighted CE is the best loss combination.** Class weighting helps category F1 significantly (0.900 vs 0.892 for base). Focal loss helps specificity slightly but hurts category. Weighted + focal = too much correction (consistently bottom tier) — both mechanisms independently reduce majority-class influence, and combining them over-corrects. ### Full Training: The CORAL Wall (5 Epochs) Trained the top 2 configurations for 5 epochs each (~1.5 hours per run): **base_weighted_ce (5 epochs):** | Epoch | Combined | Cat F1 | Spec F1 | QWK | |-------|----------|--------|---------|-----| | 1 | 0.670 | 0.879 | 0.461 | 0.800 | | 3 | 0.704 | 0.924 | 0.485 | 0.833 | | 5 | **0.724** | **0.932** | **0.517** | **0.840** | Category F1 reached 0.932 — well above the 0.80 target. But specificity F1 plateaued at 0.517. Per-class breakdown revealed the problem: | Specificity | F1 | |-------------|-----| | L1 (Generic) | 0.79 | | L2 (Domain-Adapted) | **0.29** | | L3 (Firm-Specific) | **0.31** | | L4 (Quantified) | 0.55 | L2 and L3 were dragging macro F1 down to 0.52. QWK was 0.84 — meaning the model's ordinal *ranking* was good (rarely confusing L1 with L4), but the exact *boundary placement* between adjacent levels was fuzzy. ### The CORAL Diagnosis CORAL uses a single weight vector **w** with shifted biases: logit_k = **w**·**x** + b_k. This means the *same features* separate L1 from L2 as separate L3 from L4. But the three specificity transitions require fundamentally different evidence: - **L1→L2:** Cybersecurity terminology detection (the ERM test — does the paragraph use language a general business professional wouldn't?) - **L2→L3:** Firm-unique fact detection (named roles, specific systems, internal programs) - **L3→L4:** Quantified/verifiable claim detection (dollar amounts, dates, third-party firm names) A single shared weight vector cannot simultaneously encode "presence of domain terminology," "presence of named entities," and "presence of numerical quantities" — these are orthogonal signal types in the embedding space. CORAL's structural constraint was forcing the model to find one feature direction that approximates all three, resulting in blurry boundaries everywhere. Additionally, [CLS] token pooling loses distributed signals. A paragraph that mentions "CISO" once in a subordinate clause should be L3, but [CLS] may not attend strongly to that one token. ### Architecture Iteration: Independent Thresholds Replaced CORAL with three changes (implemented in `python/src/finetune/model.py`): 1. **Independent threshold heads.** Three separate binary classifiers, each with its own `Linear(1024→256→1)` MLP: - threshold_L2plus: "Has any qualifying facts?" (L1 vs L2+) - threshold_L3plus: "Has firm-specific facts?" (≤L2 vs L3+) - threshold_L4: "Has quantified facts?" (≤L3 vs L4) Same cumulative binary targets as CORAL (label k → [1]×k + [0]×(3−k)), but each threshold learns independent features. The prediction is: level = count(sigmoid(logit_k) > 0.5). 2. **Attention pooling.** Replaced [CLS] with a learned attention pool over all token representations. This lets the model attend to specific evidence tokens (CISO, $2M, NIST) distributed anywhere in the paragraph. 3. **Specificity confidence filtering.** Only compute specificity loss on paragraphs where all 3 Grok runs agreed on specificity (91.3% of training data, as tracked in consensus `specificityAgreement.agreed`). The ~6K disagreement cases are exactly the noisy boundary labels that confuse the model. Category loss still uses all samples. 4. **Ordinal consistency regularization.** Penalty (weight 0.1) when threshold k fires but threshold k-1 doesn't — e.g., the model says "has firm-specific facts" but not "has domain terms." This enforces the cumulative structure without the rigidity of CORAL's shared weights. ### Results: The Independent Threshold Breakthrough **Config:** `configs/finetune/iter1-independent.yaml` — base ModernBERT-large, independent thresholds with 256-dim MLP, attention pooling, spec confidence filtering, 15 epochs. | Epoch | Combined | Cat F1 | Spec F1 | QWK | L2 F1 | L3 F1 | |-------|----------|--------|---------|-----|-------|-------| | 1 | 0.855 | 0.867 | **0.844** | 0.874 | 0.782 | 0.821 | | 2 | 0.913 | 0.909 | **0.918** | 0.935 | 0.887 | 0.911 | | 3 | 0.925 | 0.919 | 0.931 | 0.945 | 0.893 | 0.926 | | 5 | 0.938 | 0.936 | 0.940 | 0.949 | — | — | | **8** | **0.944** | **0.943** | **0.945** | **0.952** | **0.923** | **0.940** | | 10 | 0.944 | 0.943 | 0.945 | 0.952 | — | — | The model exceeded 0.80 on both heads **at epoch 1**. By epoch 8 it plateaued at **0.944 combined F1 (cat=0.943, spec=0.945, QWK=0.952)**. Training was stopped at epoch 11 — the train-eval loss gap (0.06 vs 0.49, ~8×) indicated the model was memorizing without further improving eval metrics. **The improvement was transformative.** Spec F1: 0.517 → 0.945 (+0.428). L2 F1: 0.29 → 0.92. L3 F1: 0.31 → 0.94. The independent thresholds + attention pooling + confidence filtering combination addressed all three root causes simultaneously. **What mattered most?** The independent thresholds were the primary driver. CORAL's shared weight vector was the bottleneck — when we let each ordinal transition learn its own features, the model immediately distinguished the three types of specificity evidence. Attention pooling and confidence filtering likely contributed meaningful improvements, but we did not run an ablation to isolate their individual contributions (the combined effect was so strong that decomposition was deprioritized). ### Overfitting Observations Encoder models absolutely can overfit. The 8× train-eval loss gap by epoch 10 is substantial. However, eval *metrics* (F1, QWK) remained stable from epoch 8–11, exhibiting "benign overfitting" — the model becomes more confident on training examples (lower train loss) without changing its decision boundaries (stable eval F1). The practical implication: monitor eval F1 for model selection, not eval loss. For future runs: increase `save_total_limit` to preserve all epoch checkpoints, and add early stopping with patience ≥ 3 on `spec_macro_f1`. ### Training Configuration Reference | Parameter | Value | |-----------|-------| | Backbone | answerdotai/ModernBERT-large (395M params) | | Pooling | Learned attention | | Category head | Linear(1024, 7) + weighted CE | | Specificity head | 3× Independent(Linear(1024→256→1)) + cumulative BCE | | Ordinal consistency | 0.1 weight | | Spec confidence filter | Unanimous labels only (91.3% of data) | | Batch size | 32 | | Learning rate | 5e-5 | | Warmup | 10% of total steps | | Precision | bf16 + tf32 | | Attention | Flash Attention 2 | | Compilation | torch.compile | | Optimizer | AdamW (fused) | | Peak VRAM | ~18 GB / 24.6 GB (RTX 3090) | | Training speed | ~2.1 it/s (batch 32, seq 512) | | Best epoch | 8 (stable through 11) | **Checkpoint:** `checkpoints/finetune/iter1-independent/final/` ### What Remains These metrics are on the validation set — same distribution as training (Grok ×3 consensus labels). The true test is the **holdout gold set** with human labels, which may reveal: - Systematic Grok-vs-human disagreements (especially at L2/L3 boundaries) - Whether the model learned Grok's biases rather than the underlying construct - Per-class F1 on the more diverse holdout distribution (the training data overrepresents RMP at 43%) As a proxy before human labels arrive, evaluation against GPT-5.4 and Opus benchmark labels on the holdout will provide an intermediate signal. --- ## Phase 9: Holdout Evaluation — Proxy Gold Results ### Evaluation Setup Built a comprehensive evaluation pipeline (`python/src/finetune/eval.py`) to test the trained model on the 1,200-paragraph holdout set. Since human gold labels were not yet available, we used two frontier API models as proxy references: - **GPT-5.4** (1,200 labels, ~$3,400/1M texts, ~2,900ms/sample) - **Opus-4.6** (1,200 labels, ~$5,000/1M texts, ~6,000ms/sample) Both references used the same v4.5 prompt as the Grok training labels but are different model families — they provide independent validation that the fine-tuned model learned the construct, not just Grok's idiosyncrasies. The evaluation computed: macro/weighted F1, per-class F1, precision, recall, MCC, AUC (one-vs-rest), QWK, MAE, Krippendorff's alpha (nominal for category, ordinal for specificity), confusion matrices, and calibration (ECE). ### Results: Independent Thresholds (Epoch 8, Best Model) | Metric | vs GPT-5.4 | vs Opus-4.6 | |--------|-----------|-------------| | **Cat Macro F1** | **0.934** | **0.923** | | **Spec Macro F1** | **0.895** | **0.883** | | Cat MCC | 0.923 | 0.909 | | Cat AUC (OvR) | 0.992 | 0.994 | | Spec QWK | 0.932 | 0.923 | | Spec MAE | 0.118 | 0.136 | | Cat Kripp α | 0.922 | 0.909 | | Spec Kripp α | 0.918 | 0.907 | | Cat ECE | 0.054 | 0.066 | | Throughput | **178 samples/sec** | — | | Latency | **5.6ms/sample** | — | Both heads pass the 0.80 macro F1 target by wide margins on held-out data against independent reference models. Per-class category F1 (vs GPT-5.4): Board Gov. 0.972, Incident Disc. 0.961, Mgmt Role 0.941, None/Other 0.888, Risk Mgmt Proc. 0.856, Strategy Int. 0.958, Third-Party 0.959. RMP is the weakest category (0.856) due to MR↔RMP boundary ambiguity, but still comfortably above target. Per-class specificity F1 (vs GPT-5.4): L1 0.936, L2 0.798, L3 0.894, L4 0.954. L2 is the weakest level — analyzed in detail below. ### Results: CORAL Baseline (Epoch 5) — For Comparison | Metric | vs GPT-5.4 | vs Opus-4.6 | |--------|-----------|-------------| | Cat Macro F1 | 0.936 | 0.928 | | **Spec Macro F1** | **0.597** | **0.596** | | Spec QWK | 0.876 | 0.872 | The category heads are essentially identical between models — the backbone handles category well regardless of specificity architecture. The +0.298 spec F1 improvement is entirely attributable to the independent threshold heads. CORAL's confusion matrix reveals the mechanism: it collapses L2 (F1=0.407) and L3 (F1=0.369) into L1 and L4, predicting extreme levels because the shared weight vector can't represent the intermediate transitions. The independent threshold model's confusion matrix shows clean diagonals across all four levels. ### Reference Agreement Ceiling A critical finding: **the model agrees with the references more than the references agree with each other.** | Comparison | Macro Spec F1 | L2 F1 | |-----------|---------------|-------| | GPT-5.4 vs Opus-4.6 | **0.885** | **0.805** | | Our model vs GPT-5.4 | **0.895** | 0.798 | | Our model vs Opus-4.6 | 0.883 | 0.776 | | Stage 1 Consensus vs GPT-5.4 | 0.911 | 0.845 | Our model's macro spec F1 (0.895) exceeds the inter-reference agreement (0.885). This means the model learned a "consensus position" that is more consistent than either individual reference. Further improvements against these proxy references are not meaningful — they would represent overfitting to one reference's idiosyncrasies rather than genuine improvement. The L2 F1 of 0.798 is within 0.007 of the reference ceiling (0.805). The L1↔L2 boundary is the hardest in the construct — it hinges on whether language is "domain-specific" enough to qualify (the ERM test). Paragraphs using quasi-domain language (e.g., "risk management program for cybersecurity") sit in a genuine gray zone where even frontier models disagree. ### L2 Error Analysis The L2 confusion is directional. Against GPT-5.4: - 29 L2 paragraphs misclassified as L1 (model under-calls domain terminology) - 23 L1 paragraphs misclassified as L2 (model over-calls domain terminology) - Only 7 L2→L3 and 2 L2→L4 errors (higher transitions are clean) This is the L1↔L2 boundary problem in isolation — the model handles L2↔L3 and L3↔L4 transitions with high accuracy. The ERM test ("would an employee relations manager understand this language?") is inherently subjective at the margin. ### Category × Specificity Joint Distribution The holdout set reveals strong correlation between category and specificity: | Category | L1 | L2 | L3 | L4 | |---------|-----|-----|-----|-----| | None/Other | **100%** | 0% | 0% | 0% | | Strategy Integration | **85%** | 10% | 2% | 3% | | Third-Party Risk | 62% | **22%** | 12% | 5% | | Risk Mgmt Process | 34% | **44%** | 16% | 6% | | Board Governance | 42% | 4% | **45%** | 9% | | Management Role | 13% | 3% | 29% | **54%** | | Incident Disclosure | 0% | 8% | 2% | **90%** | Despite this correlation, the current architecture treats specificity as category-independent (by design — per the codebook, specificity measures "how specific" regardless of "what about"). Making specificity category-dependent was considered but rejected: the cell sizes for many (category, spec_level) combinations are too small for reliable conditional modeling, and error propagation from category mistakes would corrupt specificity predictions. The strong correlations are already captured implicitly by the shared backbone. This remains a potential direction for future investigation with a larger dataset. ### Sequence Length Analysis At max_seq_length=512, truncation is negligible: | Dataset | Mean tokens | P95 | P99 | Max | Truncated (>512) | |---------|------------|-----|-----|-----|-----------------| | All paragraphs (72K) | 114.6 | 240 | 350 | 678 | 139 (0.19%) | | Holdout (1,200) | 117.9 | 236 | 329 | 603 | 1 (0.08%) | SEC cybersecurity disclosure paragraphs are short by nature (median ~100 tokens). The 512-token limit is more than sufficient — increasing to 1024 would affect only 139 training paragraphs and 1 holdout paragraph. ### Speed and Cost Comparison | System | Latency | Throughput | Cost/1M texts | Reproducible | |--------|---------|-----------|---------------|-------------| | **Fine-tuned specialist** | **5.6ms** | **178/sec** | **~$5** | **Yes** | | GPT-5.4 (API) | ~2,900ms | ~0.3/sec | ~$3,400 | No | | Opus-4.6 (API) | ~6,000ms | ~0.2/sec | ~$5,000 | No | The fine-tuned model is **520× faster** than GPT-5.4 and **1,070× faster** than Opus-4.6, at **~680-1,000× lower cost**, with comparable or better accuracy and full determinism. ### Calibration The model is well-calibrated for category (ECE=0.054 vs GPT-5.4) and reasonably calibrated for specificity (ECE=0.071). The calibration plot shows slight overconfidence in the 0.7-0.9 range — consistent with the "benign overfitting" observed during training where the model became more confident without changing decision boundaries. Temperature scaling could improve calibration without affecting predictions (a single scalar adjustment on validation logits), which would be valuable for deployment confidence thresholds. ### Remaining Opportunities **Threshold tuning (free, post-gold):** Once human gold labels arrive, grid-search the per-threshold sigmoid cutoffs. Currently all thresholds use 0.5 — the optimal L1→L2 cutoff may differ. This requires no retraining and could gain +0.01-0.02 on L2 F1. **Ensemble (3 seeds, +0.01-0.03 F1):** Train 3 models with seeds 42/43/44, average sigmoid outputs. Reduces variance on boundary cases and provides confidence intervals for reported metrics. Cost: 3× training time (~24h total), 3× inference time (~17ms/sample). **Temperature scaling (free, improves calibration only):** Fit a single temperature parameter on the validation set. Reduces ECE without changing predictions — relevant for deployment where confidence scores matter. **Larger specificity MLP (future investigation):** The current 256-dim MLP is efficient but may not capture the full complexity of subtle specificity distinctions. Larger heads (512-dim or 3-layer) could help if the dataset grows, but risk overfitting at current data scale. ### Figures Generated All evaluation figures saved to `results/eval/`: - `iter1-independent/figures/` — confusion matrices (cat + spec), calibration reliability diagrams, per-class F1 bar charts (vs GPT-5.4 and Opus-4.6 separately) - `coral-baseline/figures/` — same set for CORAL baseline comparison - `comparison/` — side-by-side CORAL vs Independent (per-class F1 bars, all-metrics comparison, improvement delta chart, confusion matrix comparison, summary table) --- ## v1 Reference The complete v1 narrative — Stage 1 prompt engineering (12+ iterations), model benchmarking (21+ models, 12 providers), human labeling webapp, gold set adjudication (13-signal cross-analysis), codebook iterations v1.0–v3.5 — is preserved at `docs/NARRATIVE-v1.md`. Key v1 deliverables carried forward: - 72,045-paragraph corpus with quality tiers - DAPT checkpoint (eval loss 0.7250, perplexity 1.65) - TAPT checkpoint (eval loss 1.0754, perplexity 2.11) - Model census: 21+ models evaluated across 12 providers - Human labeling webapp (labelapp) — will be updated for v2 codebook - Empirical evidence for every v2 codebook decision --- ## References - Warner, B., Clavié, B., Soldaini, L., et al. (2024). "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Fine-tuning and Inference." arXiv:2412.13663. - Gururangan, S., Marasovic, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N.A. (2020). "Don't Stop Pretraining: Adapt Language Models to Domains and Tasks." *Proceedings of ACL 2020*, pp. 8342-8360. - Ponnock, J. (2025). "The Data Efficiency Frontier of Financial Foundation Models: Scaling Laws from Continued Pretraining." arXiv:2512.12384. - Sounack, T., et al. (2025). "BioClinical ModernBERT: A Domain-Adapted Encoder for Biomedical and Clinical NLP." arXiv:2506.10896. - Luo, Z., et al. (2025). "Patent ModernBERT: A Pretrained Language Model for Intellectual Property." arXiv:2509.14926. - Dao, T. (2024). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." *Proceedings of ICLR 2024*. - Ringel, D.M. (2023). "Creating Synthetic Experts with Generative Artificial Intelligence." arXiv:2310.15560.