SEC-cyBERT/docs/NARRATIVE.md
2026-04-04 15:01:20 -04:00

15 KiB
Raw Blame History

Project Narrative — SEC Cybersecurity Disclosure Quality Classifier

This document captures the process, roadblocks, decisions, and resolutions from building the SEC cybersecurity disclosure quality classifier. It serves as the source material for the final paper and presentation.


Phase 1: Project Scoping and Construct Design

The Problem

SEC Release 33-11216 (July 2023) created a new annual cybersecurity disclosure requirement (10-K Item 1C) and an incident disclosure requirement (8-K Item 1.05). By FY2024, ~9,000-10,000 filings exist. No validated classifier or public labeled dataset exists for assessing the quality of these disclosures. Investors, regulators, and compliance officers need scalable tools to distinguish substantive disclosures from boilerplate.

Methodology Decision: Ringel (2023) "Synthetic Experts"

We adopted the Ringel (2023) "Synthetic Experts" pipeline: use frontier LLMs to generate training labels at scale, then distill into an efficient encoder model. This approach was chosen because:

  • Manual labeling of 50,000+ paragraphs is infeasible for a 6-person team
  • Multiple cheap LLMs annotating in parallel provide built-in quality control through inter-annotator agreement
  • The encoder distillation step produces a model that can classify at inference time without LLM API costs

Construct: Two Classification Dimensions

We defined two simultaneous classification tasks per paragraph:

  1. Content Category (7 mutually exclusive classes) — what the paragraph is about, grounded in the SEC rule's own structure (Board Governance, Management Role, Risk Management Process, Third-Party Risk, Incident Disclosure, Strategy Integration, None/Other)
  2. Specificity Level (4-point ordinal) — how company-specific the disclosure is, from generic boilerplate to quantified-verifiable facts

The construct maps to NIST CSF 2.0 categories for academic grounding.


Phase 2: Data Acquisition and Corpus Construction

The Extraction Problem

SEC filings are not structured data. They're HTML generated from PDFs, XBRL, and Word documents by dozens of different tools, each producing different artifacts. Building a reliable extraction pipeline for ~9,000 filings meant solving a series of messy, real-world data engineering problems.

Pipeline Architecture

Built in TypeScript (~1,000 lines of extraction code across parse-item1c.ts, segment.ts, fast-reparse.ts, and pipeline orchestration):

EDGAR Master Index → enumerate 10-K filings → download HTML → extract Item 1C → segment paragraphs → JSONL
submissions.zip → scan for 8-K Item 1.05 → download HTML → extract → segment → merge with 10-K corpus

Roadblock: HTML Variability

Every filing's HTML is different. The same logical content looks completely different depending on the tool that generated the HTML:

  • Word splitting from inline elements. XBRL and styling tags break words mid-token: <span>It</span><span>em 2</span> renders correctly in a browser but parses as "Item2" in code. Required detecting adjacent inline element boundaries and inserting spaces selectively.
  • CamelCase joins from PDF converters. PDF-to-HTML tools merge sentences across formatting boundaries: sentence.Next sentence instead of sentence. Next sentence. Required regex passes to detect missing spaces after punctuation.
  • Page breaks mid-sentence. Page numbers, running headers, and subsidiary headers get spliced into the middle of content paragraphs. Required filtering a catalog of page artifact patterns.
  • Table of Contents shadowing. "Item 1C" appears at least twice in every 10-K — once in the Table of Contents and once in the actual content. Using the first match extracts the wrong section. Required the LAST match — a silent failure that produced empty or wrong extractions for hundreds of filings before we caught it.
  • XBRL tag pollution. Inline XBRL wraps financial facts in ix:header, ix:references, and ix:nonFraction tags that carry no display content but add noise.
  • Entity encoding chaos. &nbsp;, &#160;, &ldquo;, &rdquo;, &mdash;, &ndash;, &bull; — each needs correct decoding, and different filing tools use different entity styles for the same characters.

Paragraph Segmentation

After extracting clean section text, splitting into paragraphs had its own challenges:

  • Bullet list merging. Disclosures frequently use bullet lists. Bullets need to be merged with their intro sentence; a standalone "• vulnerability scanning" is meaningless.
  • Continuation line detection. Sentences split across HTML block elements need rejoining.
  • Length boundaries. Under 20 words → likely a header (filtered). Over 500 words → split at sentence boundaries.
  • Table-based bullet lists and the cascade failure. Some generators render bullet lists as HTML tables with non-standard bullet characters. Since stripHtml() doesn't recognize &#183; as a bullet marker, the merge logic never fires, causing multi-element run-on paragraphs. Found 2,210 paragraphs affected.

8-K Extraction

Roadblock: EDGAR full-text search misses filings. The EFTS keyword search doesn't reliably return all cybersecurity 8-Ks. Post-May 2024, companies moved non-material disclosures from Item 1.05 to Items 8.01 or 7.01.

Resolution: Built scan-8k-items.py to scan the SEC's bulk submissions.zip deterministically — a gap-free scan of every 8-K with cybersecurity content. Result: 207 cybersecurity incident 8-K filings identified.

Corpus Statistics

  • 72,045 paragraphs from ~9,000 filings (FY2023 + FY2024 + early FY2025)
  • All 10-K Item 1C; 207 8-K paragraphs extracted separately
  • Median ~7 paragraphs per filing
  • 49,795 paragraphs annotated (after filtering to complete filing metadata)

Phase 3: Data Quality Audit and Corpus Remediation

The Discovery

While preparing the DAPT corpus, we discovered two systematic issues silently corrupting the data:

  1. Orphan words. HTML source wraps text at fixed column width. When a <span> tag consumes most of a line, only the first word fits before the source newline. 4.7% of all paragraphs affected.
  2. Inlined section headings. 22% of paragraphs had section titles prepended to body text — a near-perfect predictor of content_category that creates shortcut learning risk.

Generator Investigation

Identified 14 distinct filing generators covering 99.99% of 14,759 HTML files. The worst generator (EFiling/EDGAR Agent) accounted for 13.5% of filings but 36.8% orphan word rate (8x corpus average). Clean generators (Workiva, Donnelley, Inline XBRL) all had <1% rates. Full reference: docs/EDGAR-FILING-GENERATORS.md.

Six Surgical Patches

All fixes follow the principle: paragraphs-clean.jsonl is frozen. All fixes go through .patched.jsonl files linked by paragraph UUID.

Patch Method Paragraphs
1-2. Orphan word restoration HTML lookback extraction 2,233
3-6. Heading strip (4 passes) Pattern match + HTML-confirmed 8,411

Quality Tier System

Tier Criteria Count %
clean No detected issues 58,165 80.7%
headed Had inlined heading (now stripped) 7,402 10.3%
degraded Embedded bullets, invisible merges, fragments, truncations 4,331 6.0%
minor Had orphan word (now fixed) 2,147 3.0%

Degraded paragraphs downweighted 0.5x during fine-tuning.


Phase 4: Pre-Training — DAPT + TAPT

DAPT: Domain-Adaptive Pre-Training

Chose our own ~9,000 cached filings over PleIAs/SEC (373K on HuggingFace):

  • Recency > volume — Item 1C didn't exist before FY2023
  • Diminishing returns past 250M tokens (Ponnock 2025)
  • We control cleaning quality
  • Feasible on a single RTX 3090

Corpus: 14,568 docs, ~1.056B tokens. Subsampled to newest 500M tokens.

Key optimizations: Flash Attention 2 (47s→27s/step), torch.compile (halved activation memory), corpus subsampling (29h→13.5h).

Results: Eval loss 0.7250, perplexity 1.65. 1 epoch, ~14.5h on RTX 3090. Checkpoint: checkpoints/dapt/modernbert-large/final/.

TAPT: Task-Adaptive Pre-Training

72K Item 1C paragraphs (~10M tokens). 5 epochs with whole-word masking at seq_len=512.

Bugs fought: 4 bugs in transformers whole-word masking for BPE tokenizers, Python 3.14 incompatibility. Custom WholeWordMaskCollator built from scratch.

Results: Loss 1.46→1.08, eval loss 1.0754, perplexity 2.11. 50 minutes on RTX 3090. Checkpoint: checkpoints/tapt/modernbert-large/final/.

Training Pipeline

ModernBERT-large (base, 395M params)
    → DAPT on 9K full 10-K filings (~500M tokens, ~14.5h) → SEC-ModernBERT-large
    → TAPT on 72K Item 1C paragraphs (~10M tokens, ~50min) → SEC-cyBERT-large
    → Fine-tune on labeled data with dual classification heads → Final classifier

Phase 5: Truncated Filing Exclusion

72 filings (~0.8%) where section boundary detection cut off mid-sentence. Excluded from training splits — filings where the last paragraph doesn't match terminal punctuation are filtered.



Phase 6: The v2 Reboot — Why We Started Over

What v1 Taught Us

The v1 pipeline produced 150K Stage 1 annotations, a 10-model benchmark, human labels from 6 annotators, and extensive gold adjudication. It worked — but evaluation revealed structural problems that no amount of prompt iteration could fix:

  1. Specificity Level 2 was too narrow. Our codebook defined Level 2 as "names a recognized standard" — but the professor's construct says "references industry." Domain-specific practices (penetration testing, vulnerability scanning, SIEM) were classified as Level 1. Level 2 ended up at 3.9% of the holdout (47 samples) — too few for reliable per-class F1.

  2. Level 4 required 2+ QV facts. The construct lists types of qualifying facts, not a minimum count. The artificial threshold created a narrow class and forced annotators into a counting exercise.

  3. The BG/MR/RMP triangle was patched, not fixed. Six decision rules and ten borderline cases accumulated as patches on unchanged definitions. Models processed increasingly complex instructions with diminishing returns.

  4. The holdout was adversarial by design. Stratified to over-sample confusion-axis paragraphs — great for stress-testing the codebook, terrible for evaluation. Combined with narrow Level 2, this structurally depressed F1.

  5. Human specificity agreement was poor. Krippendorff's α = 0.546 on specificity (target: 0.67). The narrow Level 2 definition made it hard for anyone to agree.

The Decision

Rather than continue patching, we decided to:

  • Revise the codebook with systemic changes (broaden Level 2, loosen Level 4, reframe category rules)
  • Take a new random stratified holdout (equal per category class, not overindexed on hard cases)
  • Re-run Stage 1 with the improved codebook/prompt
  • Have humans re-label the new holdout
  • Re-run the benchmark panel
  • Then train

The v1 data pipeline, corpus, DAPT checkpoint, and TAPT checkpoint are all unchanged and carried forward. Only the labeling and evaluation are redone.

What Changed in v2

Codebook (LABELING-CODEBOOK.md):

  • Level 2 broadened from "names a standard" to "uses cybersecurity domain terminology" (the ERM test)
  • Level 4 threshold lowered from 2+ to 1+ QV-eligible fact (the external verifiability test)
  • Category primary test changed to "What question does this paragraph answer?"
  • MR headline changed from "who a specific person is" to "how management is organized to handle cybersecurity"
  • Person-removal test reframed as confirmation tool, not primary rule
  • Materiality rules cleaned up (assessment vs. speculation distinction became a clean rule, not a ruling)
  • IS/NOT lists restructured for new Level 2 boundary
  • Codebook + Ethos split: rules in LABELING-CODEBOOK.md, reasoning in CODEBOOK-ETHOS.md

Holdout:

  • Random stratified sample: ~170 per category class × 7 ≈ 1,190
  • Secondary constraint: minimum ~100 per specificity level
  • NOT overindexed on confusion-axis cases
  • Separate ~200-paragraph dev set for prompt iteration (excluded from holdout)

Cost of the Reboot

Item Estimated Cost
Stage 1 re-run (full corpus) ~$120
Benchmark re-run (holdout) ~$45
Prompt iteration ~$10
Human re-labeling $0 (team labor)
Total additional API ~$175

Against the ~$200 already spent on v1 API calls. The DAPT/TAPT compute (~15h GPU) is not re-done.


v1 Reference

The complete v1 narrative — Stage 1 prompt engineering (12+ iterations), model benchmarking (21+ models, 12 providers), human labeling webapp, gold set adjudication (13-signal cross-analysis), codebook iterations v1.0v3.5 — is preserved at docs/NARRATIVE-v1.md.

Key v1 deliverables carried forward:

  • 72,045-paragraph corpus with quality tiers
  • DAPT checkpoint (eval loss 0.7250, perplexity 1.65)
  • TAPT checkpoint (eval loss 1.0754, perplexity 2.11)
  • Model census: 21+ models evaluated across 12 providers
  • Human labeling webapp (labelapp) — will be updated for v2 codebook
  • Empirical evidence for every v2 codebook decision

References

  • Warner, B., Clavié, B., Soldaini, L., et al. (2024). "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Fine-tuning and Inference." arXiv:2412.13663.
  • Gururangan, S., Marasovic, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N.A. (2020). "Don't Stop Pretraining: Adapt Language Models to Domains and Tasks." Proceedings of ACL 2020, pp. 8342-8360.
  • Ponnock, J. (2025). "The Data Efficiency Frontier of Financial Foundation Models: Scaling Laws from Continued Pretraining." arXiv:2512.12384.
  • Sounack, T., et al. (2025). "BioClinical ModernBERT: A Domain-Adapted Encoder for Biomedical and Clinical NLP." arXiv:2506.10896.
  • Luo, Z., et al. (2025). "Patent ModernBERT: A Pretrained Language Model for Intellectual Property." arXiv:2509.14926.
  • Dao, T. (2024). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." Proceedings of ICLR 2024.
  • Ringel, D.M. (2023). "Creating Synthetic Experts with Generative Artificial Intelligence." arXiv:2310.15560.