SEC-cyBERT/docs/PROJECT-OVERVIEW.md
2026-03-28 20:39:36 -04:00

10 KiB
Raw Blame History

SEC Cybersecurity Disclosure Quality Classifier

Project Summary

Build a validated, reusable classifier that labels SEC cybersecurity disclosures by content category and specificity level, then fine-tune an open-weights encoder model for deployment at scale.

Methodology: Ringel (2023) "Synthetic Experts" pipeline — use frontier LLMs to generate training labels, then distill into a small open-weights encoder model.

Construct: Project 3 from the Capstone Constructs document — "Cybersecurity Governance and Incident Disclosure Quality (SEC-Aligned)."

Three publishable artifacts:

  1. A novel dataset of extracted Item 1C disclosures (no public HuggingFace dataset exists)
  2. A labeling methodology for cybersecurity disclosure quality
  3. A SOTA classifier (SEC-ModernBERT-large — first SEC-specific ModernBERT)

Why This Matters

Cybersecurity risk is among the most financially material operational risks facing firms. In July 2023, the SEC adopted Release 33-11216 requiring:

  • Annual disclosure of cybersecurity risk management, strategy, and governance (10-K Item 1C)
  • Incident disclosure within 4 business days of materiality determination (8-K Item 1.05)

Investors, boards, and regulators need tools to assess whether disclosures are substantive or boilerplate, whether governance structures are robust or ceremonial, and whether incident reports are timely and informative. No validated, construct-aligned classifier exists for this purpose.

Stakeholder

Compliance officers, investor relations teams, institutional investors, and regulators who need to assess disclosure quality at scale across thousands of filings.

What Decisions Classification Enables

  • Investors: Screen for governance quality; identify firms with weak cyber posture before incidents
  • Regulators: Flag filings that may not meet the spirit of the rule
  • Boards: Benchmark their own disclosures against peers
  • Researchers: Large-scale empirical studies of disclosure quality

Error Consequences

  • False positive (labels boilerplate as specific): Overstates disclosure quality — less harmful
  • False negative (labels specific as boilerplate): Understates quality — could unfairly penalize well-governed firms. More harmful for investment decisions.

Why Now

  • ~9,000-10,000 filings exist (FY2023 + FY2024 cycles)
  • iXBRL CYD taxonomy went live Dec 2024 — programmatic extraction now possible
  • Volume makes manual review infeasible; leadership needs scalable measurement

Construct Definition

Theoretical foundation: Disclosure theory (Verrecchia, 2001) and regulatory compliance as information provision. The SEC rule itself provides a natural taxonomy — its structured requirements map directly to a multi-class classification task.

Unit of analysis: The paragraph within Item 1C (10-K) or Item 1.05 (8-K).

Two classification dimensions applied simultaneously:

Dimension 1: Content Category (single-label, 7 classes)

Category SEC Basis What It Covers
Board Governance 106(c)(1) Board/committee oversight, briefing frequency, board cyber expertise
Management Role 106(c)(2) CISO/CTO identification, qualifications, reporting structure
Risk Management Process 106(b) Assessment processes, ERM integration, framework references
Third-Party Risk 106(b) Vendor oversight, external assessors, supply chain risk
Incident Disclosure 8-K 1.05 Nature/scope/timing of incidents, material impact, remediation
Strategy Integration 106(b)(2) Material impact on business strategy, cyber insurance, resource allocation
None/Other Boilerplate intros, legal disclaimers, non-cybersecurity content

Dimension 2: Specificity (4-point ordinal scale)

Level Label Decision Test
1 Generic Boilerplate "Could I paste this into a different company's filing unchanged?" → Yes
2 Sector-Adapted "Does this name something specific but not unique to THIS company?" → Yes
3 Firm-Specific "Does this contain at least one fact unique to THIS company?" → Yes
4 Quantified-Verifiable "Could an outsider verify a specific claim in this paragraph?" → Yes

Full rubric details, examples, and boundary rules are in LABELING-CODEBOOK.md.


Deliverables Checklist

A) Executive Memo (max 5 pages)

  • Construct definition + why it matters + theoretical grounding
  • Data source + governance/ethics
  • Label schema overview
  • Results summary: best GenAI vs best specialist
  • Cost/time/reproducibility comparison
  • Recommendation for a real firm

B) Technical Appendix (slides or PDF)

  • Pipeline diagram (data → labels → model → evaluation)
  • Label codebook
  • Benchmark table (6+ GenAI models from 3+ suppliers)
  • Fine-tuning experiments + results
  • Error analysis: where does it fail and why?

C) Code + Artifacts

  • Reproducible notebooks
  • Datasets: holdout with human labels, train/test with GenAI labels, all model labels per run + majority labels
  • Saved fine-tuned model + inference script (link to shared drive, not Canvas)
  • Cost/time log

Grading Rubric (100%)

Component Weight
Business framing & construct clarity 20%
Data pipeline quality + documentation 15%
Human labeling process + reliability 15%
GenAI benchmarking rigor 20%
Fine-tuning rigor + evaluation discipline 20%
Final comparison + recommendation quality 10%

Grade Targets

C range: F1 > 0.80, performance comparison, labeled datasets, documentation, reproducible notebooks

B range (C + 3 of these):

  • Cost, time, reproducibility analysis
  • 6+ models from 3+ suppliers
  • Contemporary data you collected (not off-the-shelf)
  • Compelling business case

A range (B + 3 of these):

  • Error analysis (corner cases, rare/complex texts)
  • Mitigation strategy for identified model weaknesses
  • Additional baselines (dictionaries, topic models, etc.)
  • Comparison to amateur labels

Corpus Size

Filing Type Estimated Count
10-K with Item 1C (FY2023 cycle) ~4,500
10-K with Item 1C (FY2024 cycle) ~4,500
8-K cybersecurity incidents ~80 filings
Total filings ~9,000-10,000
Estimated paragraphs ~50,000-80,000

Data Targets (per syllabus)

  • 20,000 texts for train/test (GenAI-labeled)
  • 1,200 texts for locked holdout (human-labeled, 3 annotators each)

Team Roles (6 people)

Role Responsibility
Data Lead EDGAR extraction pipeline, paragraph segmentation, data cleaning
Data Support 8-K extraction, breach database cross-referencing, dataset QA
Labeling Lead Rubric refinement, GenAI prompt engineering, MMC pipeline orchestration
Annotation Gold set human labeling, inter-rater reliability, active learning review
Model Lead DAPT pre-training, classification fine-tuning, ablation experiments
Eval & Writing Validation tests, metrics computation, final presentation, documentation

3-Week Schedule

Week 1: Data + Rubric

  • Set up EDGAR extraction pipeline (edgar-crawler + sec-edgar-downloader)
  • Set up 8-K extraction (sec-8k-item105)
  • Draft and pilot labeling rubric v1 on 30 paragraphs
  • Begin bulk 10-K download (FY2023 + FY2024 cycles)
  • Extract all 8-K cyber filings (Items 1.05, 8.01, 7.01)
  • Build company metadata table (CIK → ticker → GICS sector → market cap)
  • Compare pilot labels, compute initial inter-rater agreement, revise rubric → v2
  • Begin DAPT pre-training (SEC-ModernBERT-large, ~2-3 days on 3090)
  • Friday milestone: Full paragraph corpus ready (~50K+), 8-K dataset complete, evaluation framework ready
  • Launch Stage 1 dual annotation (Sonnet + Gemini Flash) on full corpus

Week 2: Labeling + Training

  • Monitor and complete dual annotation
  • Gold set human labeling (300-500 paragraphs, stratified, 2+ annotators)
  • Extract disagreements (~17%), run Stage 2 judge panel (Opus + GPT-5 + Gemini Pro)
  • Active learning pass on low-confidence cases
  • Fine-tuning experiments: DeBERTa baseline → ModernBERT → SEC-ModernBERT → NeoBERT → Ensemble
  • Wednesday milestone: Gold set validated, Kappa computed
  • Friday milestone: Labeled dataset finalized, all training complete

Week 3: Evaluation + Presentation

  • Publish dataset to HuggingFace
  • Run validation tests (breach prediction, known-groups, boilerplate index)
  • Write all sections, create figures
  • Code cleanup, README
  • Thursday: Full team review and rehearsal
  • Friday: Presentation day

Critical Path

Data extraction → Paragraph corpus → GenAI labeling → Judge panel → Final labels
                                                                        ↓
Rubric design → Pilot → Rubric v2 ──────────────────────────────────→ Gold set validation
                                                                        ↓
DAPT pre-training ─────→ Fine-tuning experiments ──→ Evaluation ──→ Final comparison

Budget

Item Cost
GenAI Stage 1 dual annotation (50K × 2 models, batch) ~$115
GenAI Stage 2 judge panel (~8.5K × 3 models, batch) ~$55
Prompt caching savings -$30 to -$40
SEC EDGAR data $0
Breach databases $0
Compute (RTX 3090, owned) $0
Total ~$130-170

GPU-Free Work (next 2 days)

Everything below can proceed without GPU:

  • Set up project repo structure, dependencies, environment
  • Build EDGAR extraction pipeline (download + parse Item 1C)
  • Build 8-K extraction pipeline
  • Paragraph segmentation logic
  • Company metadata table (CIK → ticker → GICS sector)
  • Download PleIAs/SEC corpus for future DAPT
  • Refine labeling rubric, create pilot samples
  • Set up GenAI labeling scripts (batch API calls)
  • Set up evaluation framework (metrics computation code)
  • Download breach databases (PRC, VCDB, CISA KEV)
  • Gold set sampling strategy
  • Begin human labeling of pilot set

GPU-Required (deferred)

  • DAPT pre-training of SEC-ModernBERT-large (~2-3 days on 3090)
  • All classification fine-tuning experiments
  • Model inference and evaluation