244 lines
10 KiB
Markdown
244 lines
10 KiB
Markdown
# SEC Cybersecurity Disclosure Quality Classifier
|
||
|
||
## Project Summary
|
||
|
||
Build a validated, reusable classifier that labels SEC cybersecurity disclosures by **content category** and **specificity level**, then fine-tune an open-weights encoder model for deployment at scale.
|
||
|
||
**Methodology:** Ringel (2023) "Synthetic Experts" pipeline — use frontier LLMs to generate training labels, then distill into a small open-weights encoder model.
|
||
|
||
**Construct:** Project 3 from the Capstone Constructs document — "Cybersecurity Governance and Incident Disclosure Quality (SEC-Aligned)."
|
||
|
||
**Three publishable artifacts:**
|
||
1. A novel dataset of extracted Item 1C disclosures (no public HuggingFace dataset exists)
|
||
2. A labeling methodology for cybersecurity disclosure quality
|
||
3. A SOTA classifier (SEC-ModernBERT-large — first SEC-specific ModernBERT)
|
||
|
||
---
|
||
|
||
## Why This Matters
|
||
|
||
Cybersecurity risk is among the most financially material operational risks facing firms. In July 2023, the SEC adopted Release 33-11216 requiring:
|
||
- **Annual disclosure** of cybersecurity risk management, strategy, and governance (10-K Item 1C)
|
||
- **Incident disclosure** within 4 business days of materiality determination (8-K Item 1.05)
|
||
|
||
Investors, boards, and regulators need tools to assess whether disclosures are substantive or boilerplate, whether governance structures are robust or ceremonial, and whether incident reports are timely and informative. **No validated, construct-aligned classifier exists for this purpose.**
|
||
|
||
### Stakeholder
|
||
|
||
Compliance officers, investor relations teams, institutional investors, and regulators who need to assess disclosure quality at scale across thousands of filings.
|
||
|
||
### What Decisions Classification Enables
|
||
|
||
- **Investors:** Screen for governance quality; identify firms with weak cyber posture before incidents
|
||
- **Regulators:** Flag filings that may not meet the spirit of the rule
|
||
- **Boards:** Benchmark their own disclosures against peers
|
||
- **Researchers:** Large-scale empirical studies of disclosure quality
|
||
|
||
### Error Consequences
|
||
|
||
- **False positive (labels boilerplate as specific):** Overstates disclosure quality — less harmful
|
||
- **False negative (labels specific as boilerplate):** Understates quality — could unfairly penalize well-governed firms. More harmful for investment decisions.
|
||
|
||
### Why Now
|
||
|
||
- ~9,000-10,000 filings exist (FY2023 + FY2024 cycles)
|
||
- iXBRL CYD taxonomy went live Dec 2024 — programmatic extraction now possible
|
||
- Volume makes manual review infeasible; leadership needs scalable measurement
|
||
|
||
---
|
||
|
||
## Construct Definition
|
||
|
||
**Theoretical foundation:** Disclosure theory (Verrecchia, 2001) and regulatory compliance as information provision. The SEC rule itself provides a natural taxonomy — its structured requirements map directly to a multi-class classification task.
|
||
|
||
**Unit of analysis:** The paragraph within Item 1C (10-K) or Item 1.05 (8-K).
|
||
|
||
**Two classification dimensions applied simultaneously:**
|
||
|
||
### Dimension 1: Content Category (single-label, 7 classes)
|
||
|
||
| Category | SEC Basis | What It Covers |
|
||
|----------|-----------|----------------|
|
||
| Board Governance | 106(c)(1) | Board/committee oversight, briefing frequency, board cyber expertise |
|
||
| Management Role | 106(c)(2) | CISO/CTO identification, qualifications, reporting structure |
|
||
| Risk Management Process | 106(b) | Assessment processes, ERM integration, framework references |
|
||
| Third-Party Risk | 106(b) | Vendor oversight, external assessors, supply chain risk |
|
||
| Incident Disclosure | 8-K 1.05 | Nature/scope/timing of incidents, material impact, remediation |
|
||
| Strategy Integration | 106(b)(2) | Material impact on business strategy, cyber insurance, resource allocation |
|
||
| None/Other | — | Boilerplate intros, legal disclaimers, non-cybersecurity content |
|
||
|
||
### Dimension 2: Specificity (4-point ordinal scale)
|
||
|
||
| Level | Label | Decision Test |
|
||
|-------|-------|---------------|
|
||
| 1 | Generic Boilerplate | "Could I paste this into a different company's filing unchanged?" → Yes |
|
||
| 2 | Sector-Adapted | "Does this name something specific but not unique to THIS company?" → Yes |
|
||
| 3 | Firm-Specific | "Does this contain at least one fact unique to THIS company?" → Yes |
|
||
| 4 | Quantified-Verifiable | "Could an outsider verify a specific claim in this paragraph?" → Yes |
|
||
|
||
Full rubric details, examples, and boundary rules are in [LABELING-CODEBOOK.md](LABELING-CODEBOOK.md).
|
||
|
||
---
|
||
|
||
## Deliverables Checklist
|
||
|
||
### A) Executive Memo (max 5 pages)
|
||
- [ ] Construct definition + why it matters + theoretical grounding
|
||
- [ ] Data source + governance/ethics
|
||
- [ ] Label schema overview
|
||
- [ ] Results summary: best GenAI vs best specialist
|
||
- [ ] Cost/time/reproducibility comparison
|
||
- [ ] Recommendation for a real firm
|
||
|
||
### B) Technical Appendix (slides or PDF)
|
||
- [ ] Pipeline diagram (data → labels → model → evaluation)
|
||
- [ ] Label codebook
|
||
- [ ] Benchmark table (6+ GenAI models from 3+ suppliers)
|
||
- [ ] Fine-tuning experiments + results
|
||
- [ ] Error analysis: where does it fail and why?
|
||
|
||
### C) Code + Artifacts
|
||
- [ ] Reproducible notebooks
|
||
- [ ] Datasets: holdout with human labels, train/test with GenAI labels, all model labels per run + majority labels
|
||
- [ ] Saved fine-tuned model + inference script (link to shared drive, not Canvas)
|
||
- [ ] Cost/time log
|
||
|
||
---
|
||
|
||
## Grading Rubric (100%)
|
||
|
||
| Component | Weight |
|
||
|-----------|--------|
|
||
| Business framing & construct clarity | 20% |
|
||
| Data pipeline quality + documentation | 15% |
|
||
| Human labeling process + reliability | 15% |
|
||
| GenAI benchmarking rigor | 20% |
|
||
| Fine-tuning rigor + evaluation discipline | 20% |
|
||
| Final comparison + recommendation quality | 10% |
|
||
|
||
### Grade Targets
|
||
|
||
**C range:** F1 > 0.80, performance comparison, labeled datasets, documentation, reproducible notebooks
|
||
|
||
**B range (C + 3 of these):**
|
||
- Cost, time, reproducibility analysis
|
||
- 6+ models from 3+ suppliers
|
||
- Contemporary data you collected (not off-the-shelf)
|
||
- Compelling business case
|
||
|
||
**A range (B + 3 of these):**
|
||
- Error analysis (corner cases, rare/complex texts)
|
||
- Mitigation strategy for identified model weaknesses
|
||
- Additional baselines (dictionaries, topic models, etc.)
|
||
- Comparison to amateur labels
|
||
|
||
---
|
||
|
||
## Corpus Size
|
||
|
||
| Filing Type | Estimated Count |
|
||
|-------------|----------------|
|
||
| 10-K with Item 1C (FY2023 cycle) | ~4,500 |
|
||
| 10-K with Item 1C (FY2024 cycle) | ~4,500 |
|
||
| 8-K cybersecurity incidents | ~80 filings |
|
||
| **Total filings** | **~9,000-10,000** |
|
||
| **Estimated paragraphs** | **~50,000-80,000** |
|
||
|
||
### Data Targets (per syllabus)
|
||
|
||
- **20,000 texts** for train/test (GenAI-labeled)
|
||
- **1,200 texts** for locked holdout (human-labeled, 3 annotators each)
|
||
|
||
---
|
||
|
||
## Team Roles (6 people)
|
||
|
||
| Role | Responsibility |
|
||
|------|---------------|
|
||
| Data Lead | EDGAR extraction pipeline, paragraph segmentation, data cleaning |
|
||
| Data Support | 8-K extraction, breach database cross-referencing, dataset QA |
|
||
| Labeling Lead | Rubric refinement, GenAI prompt engineering, MMC pipeline orchestration |
|
||
| Annotation | Gold set human labeling, inter-rater reliability, active learning review |
|
||
| Model Lead | DAPT pre-training, classification fine-tuning, ablation experiments |
|
||
| Eval & Writing | Validation tests, metrics computation, final presentation, documentation |
|
||
|
||
---
|
||
|
||
## 3-Week Schedule
|
||
|
||
### Week 1: Data + Rubric
|
||
- Set up EDGAR extraction pipeline (edgar-crawler + sec-edgar-downloader)
|
||
- Set up 8-K extraction (sec-8k-item105)
|
||
- Draft and pilot labeling rubric v1 on 30 paragraphs
|
||
- Begin bulk 10-K download (FY2023 + FY2024 cycles)
|
||
- Extract all 8-K cyber filings (Items 1.05, 8.01, 7.01)
|
||
- Build company metadata table (CIK → ticker → GICS sector → market cap)
|
||
- Compare pilot labels, compute initial inter-rater agreement, revise rubric → v2
|
||
- Begin DAPT pre-training (SEC-ModernBERT-large, ~2-3 days on 3090)
|
||
- **Friday milestone:** Full paragraph corpus ready (~50K+), 8-K dataset complete, evaluation framework ready
|
||
- Launch Stage 1 dual annotation (Sonnet + Gemini Flash) on full corpus
|
||
|
||
### Week 2: Labeling + Training
|
||
- Monitor and complete dual annotation
|
||
- Gold set human labeling (300-500 paragraphs, stratified, 2+ annotators)
|
||
- Extract disagreements (~17%), run Stage 2 judge panel (Opus + GPT-5 + Gemini Pro)
|
||
- Active learning pass on low-confidence cases
|
||
- Fine-tuning experiments: DeBERTa baseline → ModernBERT → SEC-ModernBERT → NeoBERT → Ensemble
|
||
- **Wednesday milestone:** Gold set validated, Kappa computed
|
||
- **Friday milestone:** Labeled dataset finalized, all training complete
|
||
|
||
### Week 3: Evaluation + Presentation
|
||
- Publish dataset to HuggingFace
|
||
- Run validation tests (breach prediction, known-groups, boilerplate index)
|
||
- Write all sections, create figures
|
||
- Code cleanup, README
|
||
- **Thursday:** Full team review and rehearsal
|
||
- **Friday:** Presentation day
|
||
|
||
### Critical Path
|
||
```
|
||
Data extraction → Paragraph corpus → GenAI labeling → Judge panel → Final labels
|
||
↓
|
||
Rubric design → Pilot → Rubric v2 ──────────────────────────────────→ Gold set validation
|
||
↓
|
||
DAPT pre-training ─────→ Fine-tuning experiments ──→ Evaluation ──→ Final comparison
|
||
```
|
||
|
||
---
|
||
|
||
## Budget
|
||
|
||
| Item | Cost |
|
||
|------|------|
|
||
| GenAI Stage 1 dual annotation (50K × 2 models, batch) | ~$115 |
|
||
| GenAI Stage 2 judge panel (~8.5K × 3 models, batch) | ~$55 |
|
||
| Prompt caching savings | -$30 to -$40 |
|
||
| SEC EDGAR data | $0 |
|
||
| Breach databases | $0 |
|
||
| Compute (RTX 3090, owned) | $0 |
|
||
| **Total** | **~$130-170** |
|
||
|
||
---
|
||
|
||
## GPU-Free Work (next 2 days)
|
||
|
||
Everything below can proceed without GPU:
|
||
|
||
- [ ] Set up project repo structure, dependencies, environment
|
||
- [ ] Build EDGAR extraction pipeline (download + parse Item 1C)
|
||
- [ ] Build 8-K extraction pipeline
|
||
- [ ] Paragraph segmentation logic
|
||
- [ ] Company metadata table (CIK → ticker → GICS sector)
|
||
- [ ] Download PleIAs/SEC corpus for future DAPT
|
||
- [ ] Refine labeling rubric, create pilot samples
|
||
- [ ] Set up GenAI labeling scripts (batch API calls)
|
||
- [ ] Set up evaluation framework (metrics computation code)
|
||
- [ ] Download breach databases (PRC, VCDB, CISA KEV)
|
||
- [ ] Gold set sampling strategy
|
||
- [ ] Begin human labeling of pilot set
|
||
|
||
### GPU-Required (deferred)
|
||
- DAPT pre-training of SEC-ModernBERT-large (~2-3 days on 3090)
|
||
- All classification fine-tuning experiments
|
||
- Model inference and evaluation
|