2026-03-28 20:39:36 -04:00

10 KiB

Raw Blame History

SEC Cybersecurity Disclosure Quality Classifier

Project Summary

Build a validated, reusable classifier that labels SEC cybersecurity disclosures by content category and specificity level, then fine-tune an open-weights encoder model for deployment at scale.

Methodology: Ringel (2023) "Synthetic Experts" pipeline — use frontier LLMs to generate training labels, then distill into a small open-weights encoder model.

Construct: Project 3 from the Capstone Constructs document — "Cybersecurity Governance and Incident Disclosure Quality (SEC-Aligned)."

Three publishable artifacts:

A novel dataset of extracted Item 1C disclosures (no public HuggingFace dataset exists)
A labeling methodology for cybersecurity disclosure quality
A SOTA classifier (SEC-ModernBERT-large — first SEC-specific ModernBERT)

Why This Matters

Cybersecurity risk is among the most financially material operational risks facing firms. In July 2023, the SEC adopted Release 33-11216 requiring:

Annual disclosure of cybersecurity risk management, strategy, and governance (10-K Item 1C)
Incident disclosure within 4 business days of materiality determination (8-K Item 1.05)

Investors, boards, and regulators need tools to assess whether disclosures are substantive or boilerplate, whether governance structures are robust or ceremonial, and whether incident reports are timely and informative. No validated, construct-aligned classifier exists for this purpose.

Stakeholder

Compliance officers, investor relations teams, institutional investors, and regulators who need to assess disclosure quality at scale across thousands of filings.

What Decisions Classification Enables

Investors: Screen for governance quality; identify firms with weak cyber posture before incidents
Regulators: Flag filings that may not meet the spirit of the rule
Boards: Benchmark their own disclosures against peers
Researchers: Large-scale empirical studies of disclosure quality

Error Consequences

False positive (labels boilerplate as specific): Overstates disclosure quality — less harmful
False negative (labels specific as boilerplate): Understates quality — could unfairly penalize well-governed firms. More harmful for investment decisions.

Why Now

~9,000-10,000 filings exist (FY2023 + FY2024 cycles)
iXBRL CYD taxonomy went live Dec 2024 — programmatic extraction now possible
Volume makes manual review infeasible; leadership needs scalable measurement

Construct Definition

Theoretical foundation: Disclosure theory (Verrecchia, 2001) and regulatory compliance as information provision. The SEC rule itself provides a natural taxonomy — its structured requirements map directly to a multi-class classification task.

Unit of analysis: The paragraph within Item 1C (10-K) or Item 1.05 (8-K).

Two classification dimensions applied simultaneously:

Dimension 1: Content Category (single-label, 7 classes)

Category	SEC Basis	What It Covers
Board Governance	106(c)(1)	Board/committee oversight, briefing frequency, board cyber expertise
Management Role	106(c)(2)	CISO/CTO identification, qualifications, reporting structure
Risk Management Process	106(b)	Assessment processes, ERM integration, framework references
Third-Party Risk	106(b)	Vendor oversight, external assessors, supply chain risk
Incident Disclosure	8-K 1.05	Nature/scope/timing of incidents, material impact, remediation
Strategy Integration	106(b)(2)	Material impact on business strategy, cyber insurance, resource allocation
None/Other	—	Boilerplate intros, legal disclaimers, non-cybersecurity content

Dimension 2: Specificity (4-point ordinal scale)

Level	Label	Decision Test
1	Generic Boilerplate	"Could I paste this into a different company's filing unchanged?" → Yes
2	Sector-Adapted	"Does this name something specific but not unique to THIS company?" → Yes
3	Firm-Specific	"Does this contain at least one fact unique to THIS company?" → Yes
4	Quantified-Verifiable	"Could an outsider verify a specific claim in this paragraph?" → Yes

Full rubric details, examples, and boundary rules are in LABELING-CODEBOOK.md.

Deliverables Checklist

A) Executive Memo (max 5 pages)

Construct definition + why it matters + theoretical grounding
Data source + governance/ethics
Label schema overview
Results summary: best GenAI vs best specialist
Cost/time/reproducibility comparison
Recommendation for a real firm

B) Technical Appendix (slides or PDF)

Pipeline diagram (data → labels → model → evaluation)
Label codebook
Benchmark table (6+ GenAI models from 3+ suppliers)
Fine-tuning experiments + results
Error analysis: where does it fail and why?

C) Code + Artifacts

Reproducible notebooks
Datasets: holdout with human labels, train/test with GenAI labels, all model labels per run + majority labels
Saved fine-tuned model + inference script (link to shared drive, not Canvas)
Cost/time log

Grading Rubric (100%)

Component	Weight
Business framing & construct clarity	20%
Data pipeline quality + documentation	15%
Human labeling process + reliability	15%
GenAI benchmarking rigor	20%
Fine-tuning rigor + evaluation discipline	20%
Final comparison + recommendation quality	10%

Grade Targets

C range: F1 > 0.80, performance comparison, labeled datasets, documentation, reproducible notebooks

B range (C + 3 of these):

Cost, time, reproducibility analysis
6+ models from 3+ suppliers
Contemporary data you collected (not off-the-shelf)
Compelling business case

A range (B + 3 of these):

Error analysis (corner cases, rare/complex texts)
Mitigation strategy for identified model weaknesses
Additional baselines (dictionaries, topic models, etc.)
Comparison to amateur labels

Corpus Size

Filing Type	Estimated Count
10-K with Item 1C (FY2023 cycle)	~4,500
10-K with Item 1C (FY2024 cycle)	~4,500
8-K cybersecurity incidents	~80 filings
Total filings	~9,000-10,000
Estimated paragraphs	~50,000-80,000

Data Targets (per syllabus)

20,000 texts for train/test (GenAI-labeled)
1,200 texts for locked holdout (human-labeled, 3 annotators each)

Team Roles (6 people)

Role	Responsibility
Data Lead	EDGAR extraction pipeline, paragraph segmentation, data cleaning
Data Support	8-K extraction, breach database cross-referencing, dataset QA
Labeling Lead	Rubric refinement, GenAI prompt engineering, MMC pipeline orchestration
Annotation	Gold set human labeling, inter-rater reliability, active learning review
Model Lead	DAPT pre-training, classification fine-tuning, ablation experiments
Eval & Writing	Validation tests, metrics computation, final presentation, documentation

3-Week Schedule

Week 1: Data + Rubric

Set up EDGAR extraction pipeline (edgar-crawler + sec-edgar-downloader)
Set up 8-K extraction (sec-8k-item105)
Draft and pilot labeling rubric v1 on 30 paragraphs
Begin bulk 10-K download (FY2023 + FY2024 cycles)
Extract all 8-K cyber filings (Items 1.05, 8.01, 7.01)
Build company metadata table (CIK → ticker → GICS sector → market cap)
Compare pilot labels, compute initial inter-rater agreement, revise rubric → v2
Begin DAPT pre-training (SEC-ModernBERT-large, ~2-3 days on 3090)
Friday milestone: Full paragraph corpus ready (~50K+), 8-K dataset complete, evaluation framework ready
Launch Stage 1 dual annotation (Sonnet + Gemini Flash) on full corpus

Week 2: Labeling + Training

Monitor and complete dual annotation
Gold set human labeling (300-500 paragraphs, stratified, 2+ annotators)
Extract disagreements (~17%), run Stage 2 judge panel (Opus + GPT-5 + Gemini Pro)
Active learning pass on low-confidence cases
Fine-tuning experiments: DeBERTa baseline → ModernBERT → SEC-ModernBERT → NeoBERT → Ensemble
Wednesday milestone: Gold set validated, Kappa computed
Friday milestone: Labeled dataset finalized, all training complete

Week 3: Evaluation + Presentation

Publish dataset to HuggingFace
Run validation tests (breach prediction, known-groups, boilerplate index)
Write all sections, create figures
Code cleanup, README
Thursday: Full team review and rehearsal
Friday: Presentation day

Critical Path

Data extraction → Paragraph corpus → GenAI labeling → Judge panel → Final labels
                                                                        ↓
Rubric design → Pilot → Rubric v2 ──────────────────────────────────→ Gold set validation
                                                                        ↓
DAPT pre-training ─────→ Fine-tuning experiments ──→ Evaluation ──→ Final comparison

Budget

Item	Cost
GenAI Stage 1 dual annotation (50K × 2 models, batch)	~$115
GenAI Stage 2 judge panel (~8.5K × 3 models, batch)	~$55
Prompt caching savings	-$30 to -$40
SEC EDGAR data	$0
Breach databases	$0
Compute (RTX 3090, owned)	$0
Total	~$130-170

GPU-Free Work (next 2 days)

Everything below can proceed without GPU:

Set up project repo structure, dependencies, environment
Build EDGAR extraction pipeline (download + parse Item 1C)
Build 8-K extraction pipeline
Paragraph segmentation logic
Company metadata table (CIK → ticker → GICS sector)
Download PleIAs/SEC corpus for future DAPT
Refine labeling rubric, create pilot samples
Set up GenAI labeling scripts (batch API calls)
Set up evaluation framework (metrics computation code)
Download breach databases (PRC, VCDB, CISA KEV)
Gold set sampling strategy
Begin human labeling of pilot set

GPU-Required (deferred)

DAPT pre-training of SEC-ModernBERT-large (~2-3 days on 3090)
All classification fine-tuning experiments
Model inference and evaluation

10 KiB Raw Blame History Unescape Escape