SEC-cyBERT/docs/archive/planning/PROJECT-OVERVIEW.md

# SEC Cybersecurity Disclosure Quality Classifier

## Project Summary

Build a validated, reusable classifier that labels SEC cybersecurity disclosures by **content category** and **specificity level**, then fine-tune an open-weights encoder model for deployment at scale.

**Methodology:** Ringel (2023) "Synthetic Experts" pipeline — use frontier LLMs to generate training labels, then distill into a small open-weights encoder model.

**Construct:** Project 3 from the Capstone Constructs document — "Cybersecurity Governance and Incident Disclosure Quality (SEC-Aligned)."

**Three publishable artifacts:**
1. A novel dataset of extracted Item 1C disclosures (no public HuggingFace dataset exists)
2. A labeling methodology for cybersecurity disclosure quality
3. A SOTA classifier (SEC-ModernBERT-large — first SEC-specific ModernBERT)

---

## Why This Matters

Cybersecurity risk is among the most financially material operational risks facing firms. In July 2023, the SEC adopted Release 33-11216 requiring:
- **Annual disclosure** of cybersecurity risk management, strategy, and governance (10-K Item 1C)
- **Incident disclosure** within 4 business days of materiality determination (8-K Item 1.05)

Investors, boards, and regulators need tools to assess whether disclosures are substantive or boilerplate, whether governance structures are robust or ceremonial, and whether incident reports are timely and informative. **No validated, construct-aligned classifier exists for this purpose.**

### Stakeholder

Compliance officers, investor relations teams, institutional investors, and regulators who need to assess disclosure quality at scale across thousands of filings.

### What Decisions Classification Enables

- **Investors:** Screen for governance quality; identify firms with weak cyber posture before incidents
- **Regulators:** Flag filings that may not meet the spirit of the rule
- **Boards:** Benchmark their own disclosures against peers
- **Researchers:** Large-scale empirical studies of disclosure quality

### Error Consequences

- **False positive (labels boilerplate as specific):** Overstates disclosure quality — less harmful
- **False negative (labels specific as boilerplate):** Understates quality — could unfairly penalize well-governed firms. More harmful for investment decisions.

### Why Now

- ~9,000-10,000 filings exist (FY2023 + FY2024 cycles)
- iXBRL CYD taxonomy went live Dec 2024 — programmatic extraction now possible
- Volume makes manual review infeasible; leadership needs scalable measurement

---

## Construct Definition

**Theoretical foundation:** Disclosure theory (Verrecchia, 2001) and regulatory compliance as information provision. The SEC rule itself provides a natural taxonomy — its structured requirements map directly to a multi-class classification task.

**Unit of analysis:** The paragraph within Item 1C (10-K) or Item 1.05 (8-K).

**Two classification dimensions applied simultaneously:**

### Dimension 1: Content Category (single-label, 7 classes)

| Category | SEC Basis | What It Covers |
|----------|-----------|----------------|
| Board Governance | 106(c)(1) | Board/committee oversight, briefing frequency, board cyber expertise |
| Management Role | 106(c)(2) | CISO/CTO identification, qualifications, reporting structure |
| Risk Management Process | 106(b) | Assessment processes, ERM integration, framework references |
| Third-Party Risk | 106(b) | Vendor oversight, external assessors, supply chain risk |
| Incident Disclosure | 8-K 1.05 | Nature/scope/timing of incidents, material impact, remediation |
| Strategy Integration | 106(b)(2) | Material impact on business strategy, cyber insurance, resource allocation |
| None/Other | — | Boilerplate intros, legal disclaimers, non-cybersecurity content |

### Dimension 2: Specificity (4-point ordinal scale)

| Level | Label | Decision Test |
|-------|-------|---------------|
| 1 | Generic Boilerplate | "Could I paste this into a different company's filing unchanged?" → Yes |
| 2 | Sector-Adapted | "Does this name something specific but not unique to THIS company?" → Yes |
| 3 | Firm-Specific | "Does this contain at least one fact unique to THIS company?" → Yes |
| 4 | Quantified-Verifiable | "Could an outsider verify a specific claim in this paragraph?" → Yes |

Full rubric details, examples, and boundary rules are in [LABELING-CODEBOOK.md](LABELING-CODEBOOK.md).

---

## Deliverables Checklist

### A) Executive Memo (max 5 pages)
- [ ] Construct definition + why it matters + theoretical grounding
- [ ] Data source + governance/ethics
- [ ] Label schema overview
- [ ] Results summary: best GenAI vs best specialist
- [ ] Cost/time/reproducibility comparison
- [ ] Recommendation for a real firm

### B) Technical Appendix (slides or PDF)
- [ ] Pipeline diagram (data → labels → model → evaluation)
- [ ] Label codebook
- [ ] Benchmark table (6+ GenAI models from 3+ suppliers)
- [ ] Fine-tuning experiments + results
- [ ] Error analysis: where does it fail and why?

### C) Code + Artifacts
- [ ] Reproducible notebooks
- [ ] Datasets: holdout with human labels, train/test with GenAI labels, all model labels per run + majority labels
- [ ] Saved fine-tuned model + inference script (link to shared drive, not Canvas)
- [ ] Cost/time log

---

## Grading Rubric (100%)

| Component | Weight |
|-----------|--------|
| Business framing & construct clarity | 20% |
| Data pipeline quality + documentation | 15% |
| Human labeling process + reliability | 15% |
| GenAI benchmarking rigor | 20% |
| Fine-tuning rigor + evaluation discipline | 20% |
| Final comparison + recommendation quality | 10% |

### Grade Targets

**C range:** F1 > 0.80, performance comparison, labeled datasets, documentation, reproducible notebooks

**B range (C + 3 of these):**
- Cost, time, reproducibility analysis
- 6+ models from 3+ suppliers
- Contemporary data you collected (not off-the-shelf)
- Compelling business case

**A range (B + 3 of these):**
- Error analysis (corner cases, rare/complex texts)
- Mitigation strategy for identified model weaknesses
- Additional baselines (dictionaries, topic models, etc.)
- Comparison to amateur labels

---

## Corpus Size

| Filing Type | Estimated Count |
|-------------|----------------|
| 10-K with Item 1C (FY2023 cycle) | ~4,500 |
| 10-K with Item 1C (FY2024 cycle) | ~4,500 |
| 8-K cybersecurity incidents | ~80 filings |
| **Total filings** | **~9,000-10,000** |
| **Estimated paragraphs** | **~50,000-80,000** |

### Data Targets (per syllabus)

- **20,000 texts** for train/test (GenAI-labeled)
- **1,200 texts** for locked holdout (human-labeled, 3 annotators each)

---

## Team Roles (6 people)

| Role | Responsibility |
|------|---------------|
| Data Lead | EDGAR extraction pipeline, paragraph segmentation, data cleaning |
| Data Support | 8-K extraction, breach database cross-referencing, dataset QA |
| Labeling Lead | Rubric refinement, GenAI prompt engineering, MMC pipeline orchestration |
| Annotation | Gold set human labeling, inter-rater reliability, active learning review |
| Model Lead | DAPT pre-training, classification fine-tuning, ablation experiments |
| Eval & Writing | Validation tests, metrics computation, final presentation, documentation |

---

## 3-Week Schedule

### Week 1: Data + Rubric
- Set up EDGAR extraction pipeline (edgar-crawler + sec-edgar-downloader)
- Set up 8-K extraction (sec-8k-item105)
- Draft and pilot labeling rubric v1 on 30 paragraphs
- Begin bulk 10-K download (FY2023 + FY2024 cycles)
- Extract all 8-K cyber filings (Items 1.05, 8.01, 7.01)
- Build company metadata table (CIK → ticker → GICS sector → market cap)
- Compare pilot labels, compute initial inter-rater agreement, revise rubric → v2
- Begin DAPT pre-training (SEC-ModernBERT-large, ~2-3 days on 3090)
- **Friday milestone:** Full paragraph corpus ready (~50K+), 8-K dataset complete, evaluation framework ready
- Launch Stage 1 dual annotation (Sonnet + Gemini Flash) on full corpus

### Week 2: Labeling + Training
- Monitor and complete dual annotation
- Gold set human labeling (300-500 paragraphs, stratified, 2+ annotators)
- Extract disagreements (~17%), run Stage 2 judge panel (Opus + GPT-5 + Gemini Pro)
- Active learning pass on low-confidence cases
- Fine-tuning experiments: DeBERTa baseline → ModernBERT → SEC-ModernBERT → NeoBERT → Ensemble
- **Wednesday milestone:** Gold set validated, Kappa computed
- **Friday milestone:** Labeled dataset finalized, all training complete

### Week 3: Evaluation + Presentation
- Publish dataset to HuggingFace
- Run validation tests (breach prediction, known-groups, boilerplate index)
- Write all sections, create figures
- Code cleanup, README
- **Thursday:** Full team review and rehearsal
- **Friday:** Presentation day

### Critical Path
```
Data extraction → Paragraph corpus → GenAI labeling → Judge panel → Final labels
                                                                        ↓
Rubric design → Pilot → Rubric v2 ──────────────────────────────────→ Gold set validation
                                                                        ↓
DAPT pre-training ─────→ Fine-tuning experiments ──→ Evaluation ──→ Final comparison
```

---

## Budget

| Item | Cost |
|------|------|
| GenAI Stage 1 dual annotation (50K × 2 models, batch) | ~$115 |
| GenAI Stage 2 judge panel (~8.5K × 3 models, batch) | ~$55 |
| Prompt caching savings | -$30 to -$40 |
| SEC EDGAR data | $0 |
| Breach databases | $0 |
| Compute (RTX 3090, owned) | $0 |
| **Total** | **~$130-170** |

---

## GPU-Free Work (next 2 days)

Everything below can proceed without GPU:

- [ ] Set up project repo structure, dependencies, environment
- [ ] Build EDGAR extraction pipeline (download + parse Item 1C)
- [ ] Build 8-K extraction pipeline
- [ ] Paragraph segmentation logic
- [ ] Company metadata table (CIK → ticker → GICS sector)
- [ ] Download PleIAs/SEC corpus for future DAPT
- [ ] Refine labeling rubric, create pilot samples
- [ ] Set up GenAI labeling scripts (batch API calls)
- [ ] Set up evaluation framework (metrics computation code)
- [ ] Download breach databases (PRC, VCDB, CISA KEV)
- [ ] Gold set sampling strategy
- [ ] Begin human labeling of pilot set

### GPU-Required (deferred)
- DAPT pre-training of SEC-ModernBERT-large (~2-3 days on 3090)
- All classification fine-tuning experiments
- Model inference and evaluation