SEC-cyBERT/docs/reference/P3_SEC_Cybersecurity_Capstone.md

# Project 3: SEC Cybersecurity Disclosure Quality Classifier

## Capstone 2026 — BUSI488/COMP488 — Team Knowledge Transfer

**Project:** Build a validated, reusable classifier that labels SEC cybersecurity disclosures by content category and specificity level, then fine-tune an open-weights model for deployment at scale.

**Methodology:** Ringel (2023) "Synthetic Experts" pipeline — use frontier LLMs to generate training labels, then distill into a small open-weights encoder model.

**Why this project:** No HuggingFace dataset of extracted Item 1C disclosures exists. No trained classifier for cybersecurity disclosure quality exists. No domain-adapted ModernBERT on SEC filings exists. The iXBRL CYD taxonomy just went live (Dec 2024). We produce **three publishable artifacts**: a novel dataset, a labeling methodology, and a SOTA classifier.

---

## Table of Contents

1. [Regulatory Background](#1-regulatory-background)
2. [Labeling Rubric](#2-labeling-rubric)
3. [Data Acquisition](#3-data-acquisition)
4. [GenAI Labeling Pipeline](#4-genai-labeling-pipeline)
5. [Model Strategy](#5-model-strategy)
6. [Evaluation & Validation](#6-evaluation--validation)
7. [Release Artifacts](#7-release-artifacts)
8. [3-Week Schedule (6 People)](#8-3-week-schedule-6-people)
9. [Budget](#9-budget)
10. [Reference Links](#10-reference-links)

---

## 1. Regulatory Background

### The Rule: SEC Release 33-11216 (July 2023)

The SEC adopted final rules requiring public companies to disclose cybersecurity risk management, strategy, governance, and material incidents. This created a massive new text corpus with natural variation in quality — perfect for classification.

Full rule PDF: <https://www.sec.gov/files/rules/final/2023/33-11216.pdf>
Fact sheet: <https://www.sec.gov/files/33-11216-fact-sheet.pdf>

### Item 1C — Annual Disclosure (10-K)

Appears as **Regulation S-K Item 106**, reported in **Item 1C** of the 10-K. Two mandated subsections:

**Item 106(b) — Risk Management and Strategy:**
1. Processes for assessing, identifying, and managing material cybersecurity risks
2. Whether/how cybersecurity processes integrate into overall enterprise risk management (ERM)
3. Whether the company engages external assessors, consultants, or auditors
4. Processes to oversee/identify risks from third-party service providers
5. Whether cybersecurity risks (including prior incidents) have materially affected or are reasonably likely to affect business strategy, results, or financial condition

**Item 106(c) — Governance:**

*Board Oversight (106(c)(1)):*
- Description of board's oversight of cybersecurity risks
- Identification of responsible board committee/subcommittee
- Processes by which the board/committee is informed about risks

*Management's Role (106(c)(2)):*
- Which management positions/committees are responsible
- Relevant expertise of those persons
- How management monitors prevention, detection, mitigation, and remediation
- Whether and how frequently management reports to the board

**Key design note:** The SEC uses "describe" — it does not prescribe specific items. The enumerated sub-items are non-exclusive suggestions. This principles-based approach creates natural variation in specificity and content, which is exactly what our rubric captures.

### Item 1.05 — Incident Disclosure (8-K)

Required within **4 business days** of determining a cybersecurity incident is material:
1. Material aspects of the nature, scope, and timing of the incident
2. Material impact or reasonably likely material impact on the registrant

**Key nuances:**
- The 4-day clock starts at the **materiality determination**, not the incident itself
- Companies explicitly do NOT need to disclose technical details that would impede response/remediation
- The AG can delay disclosure up to 120 days for national security
- Companies must amend the 8-K when new material information becomes available

**The May 2024 shift:** After SEC Director Erik Gerding clarified that Item 1.05 is only for *material* incidents, companies pivoted from Item 1.05 to Items 8.01/7.01 for non-material disclosures:
- Pre-guidance: 72% used Item 1.05, 28% used 8.01/7.01
- Post-guidance: 34% used Item 1.05, 66% used 8.01/7.01

**Our extraction must capture all three item types.**

### Compliance Timeline

| Date | Milestone |
|------|-----------|
| Jul 26, 2023 | Rule adopted |
| Sep 5, 2023 | Rule effective |
| Dec 15, 2023 | Item 1C required in 10-Ks (FY ending on/after this date) |
| Dec 18, 2023 | Item 1.05 required in 8-Ks |
| Jun 15, 2024 | Item 1.05 required for smaller reporting companies |
| Dec 15, 2024 | iXBRL tagging of Item 106 (CYD taxonomy) required |
| Dec 18, 2024 | iXBRL tagging of 8-K Item 1.05 required |

### iXBRL CYD Taxonomy

The SEC published the **Cybersecurity Disclosure (CYD) Taxonomy** on Sep 16, 2024. Starting with filings after Dec 15, 2024, Item 1C disclosures are tagged in Inline XBRL using the `cyd` prefix. This means 2025 filings can be parsed programmatically via XBRL rather than HTML scraping.

Taxonomy schema: `http://xbrl.sec.gov/cyd/2024`
Taxonomy guide: <https://xbrl.sec.gov/cyd/2024/cyd-taxonomy-guide-2024-09-16.pdf>

### Corpus Size

| Filing Type | Estimated Count (as of early 2026) |
|-------------|-----------------------------------|
| 10-K with Item 1C (FY2023 cycle) | ~4,500 |
| 10-K with Item 1C (FY2024 cycle) | ~4,500 |
| 8-K cybersecurity incidents | ~80 filings (55 incidents + amendments) |
| **Total filings** | **~9,000-10,000** |
| **Estimated paragraphs** (from Item 1C) | **~50,000-80,000** |

---

## 2. Labeling Rubric

### Dimension 1: Content Category (single-label per paragraph)

Derived directly from the SEC rule structure. Each paragraph receives exactly one category:

| Category | SEC Basis | What It Covers | Example Markers |
|----------|-----------|----------------|-----------------|
| **Board Governance** | 106(c)(1) | Board/committee oversight, briefing frequency, board cyber expertise | "Audit Committee," "Board of Directors oversees," "quarterly briefings" |
| **Management Role** | 106(c)(2) | CISO/CTO identification, qualifications, reporting structure | "Chief Information Security Officer," "reports to," "years of experience" |
| **Risk Management Process** | 106(b) | Assessment/identification processes, ERM integration, framework references | "NIST CSF," "ISO 27001," "risk assessment," "vulnerability management" |
| **Third-Party Risk** | 106(b) | Vendor oversight, external assessors/consultants, supply chain risk | "third-party," "service providers," "penetration testing by," "external auditors" |
| **Incident Disclosure** | 8-K 1.05 | Nature/scope/timing of incidents, material impact, remediation | "unauthorized access," "detected," "incident," "remediation," "impacted" |
| **Strategy Integration** | 106(b)(2) | Material impact on business strategy, cyber insurance, resource allocation | "business strategy," "insurance," "investment," "material," "financial condition" |
| **None/Other** | — | Boilerplate intros, legal disclaimers, non-cybersecurity content | Forward-looking statement disclaimers, general risk language |

### Dimension 2: Specificity (4-point ordinal per paragraph)

Grounded in Berkman et al. (2018), Gibson Dunn surveys, and PwC quality tiers:

| Level | Label | Definition | Decision Test |
|-------|-------|------------|---------------|
| **1** | **Generic Boilerplate** | Could apply to any company. Conditional language ("may," "could"). No named entities. Passive voice. | "Could I paste this into a different company's filing unchanged?" → Yes |
| **2** | **Sector-Adapted** | References industry context or named frameworks (NIST, ISO) but no firm-specific detail. | "Does this name something specific but not unique to THIS company?" → Yes |
| **3** | **Firm-Specific** | Names roles (CISO by name), committees, reporting lines, specific programs, or processes unique to the firm. Active voice with accountability. | "Does this contain at least one fact unique to THIS company?" → Yes |
| **4** | **Quantified-Verifiable** | Includes metrics, dollar amounts, dates, frequencies, third-party audit references, or independently verifiable facts. Multiple firm-specific facts with operational detail. | "Could an outsider verify a specific claim in this paragraph?" → Yes |

**Boundary rules for annotators:**
- If torn between 1 and 2: "Does it name ANY framework, standard, or industry term?" → Yes = 2
- If torn between 2 and 3: "Does it mention anything unique to THIS company?" → Yes = 3
- If torn between 3 and 4: "Does it contain TWO OR MORE specific, verifiable facts?" → Yes = 4

**Important:** EvasionBench (Ma et al., 2026) found that a 5-level ordinal scale failed (kappa < 0.5) and had to be collapsed to 3 levels. **Pilot test this 4-level scale on 50 paragraphs early.** Be prepared to merge levels 1-2 or 3-4 if inter-annotator agreement is poor.

### Boilerplate vs. Substantive Markers (from the literature)

**Boilerplate indicators:**
- Conditional language: "may," "could," "might"
- Generic risk statements without company-specific context
- No named individuals, committees, or frameworks
- Identical language across same-industry filings (cosine similarity > 0.8)
- Passive voice: "cybersecurity risks are managed"

**Substantive indicators:**
- Named roles and reporting structures ("Our CISO, Jane Smith, reports quarterly to the Audit Committee")
- Specific frameworks by name (NIST CSF, ISO 27001, SOC 2, PCI-DSS)
- Concrete processes (penetration testing frequency, tabletop exercises)
- Quantification (dollar investment, headcount, incident counts, training completion rates)
- Third-party names or types of assessments
- Temporal specificity (dates, frequencies, durations)

### Mapping to NIST CSF 2.0

For academic grounding, our content categories map to NIST CSF 2.0 functions:

| Our Category | NIST CSF 2.0 |
|-------------|-------------|
| Board Governance | GOVERN (GV.OV, GV.RR) |
| Management Role | GOVERN (GV.RR, GV.RM) |
| Risk Management Process | IDENTIFY (ID.RA), GOVERN (GV.RM), PROTECT (all) |
| Third-Party Risk | GOVERN (GV.SC) |
| Incident Disclosure | DETECT, RESPOND, RECOVER |
| Strategy Integration | GOVERN (GV.OC, GV.RM) |

---

## 3. Data Acquisition

### 3.1 Extracting 10-K Item 1C

**Recommended pipeline:**

```
sec-edgar-downloader  →  edgar-crawler  →  paragraph segmentation  →  dataset
  (bulk download)       (parse Item 1C)    (split into units)
```

**Tools:**

| Tool | Purpose | Install | Notes |
|------|---------|---------|-------|
| `sec-edgar-downloader` | Bulk download 10-K filings by CIK | `pip install sec-edgar-downloader` | Pure downloader, no parsing |
| `edgar-crawler` | Extract specific item sections to JSON | `git clone github.com/lefterisloukas/edgar-crawler` | Best for bulk extraction; configure `['1C']` in items list |
| `edgartools` | Interactive exploration, XBRL parsing | `pip install edgartools` | `tenk['Item 1C']` accessor; great for prototyping |
| `sec-api` | Commercial API, zero parsing headaches | `pip install sec-api` | `extractorApi.get_section(url, "1C", "text")` — paid, free tier available |

**EDGAR API requirements:**
- Rate limit: 10 requests/second
- Required: Custom `User-Agent` header with name and email (e.g., `"TeamName team@email.com"`)
- SEC blocks requests without proper User-Agent (returns 403)

**For iXBRL-tagged filings (2025+):** Use `edgartools` XBRL parser to extract CYD taxonomy elements directly. This gives pre-structured data aligned with regulatory categories.

**Fallback corpus:** `PleIAs/SEC` on HuggingFace (373K 10-K full texts, CC0 license) — but sections are NOT pre-parsed; you must extract Item 1C yourself.

### 3.2 Extracting 8-K Incident Disclosures

| Tool | Purpose | URL |
|------|---------|-----|
| `sec-8k-item105` | Extract Item 1.05 from 8-Ks, iXBRL + HTML fallback | `github.com/JMousqueton/sec-8k-item105` |
| `SECurityTr8Ker` | Monitor SEC RSS for new cyber 8-Ks, Slack/Teams alerts | `github.com/pancak3lullz/SECurityTr8Ker` |
| Debevoise 8-K Tracker | Curated list with filing links, dates, amendments | `debevoisedatablog.com/2024/03/06/cybersecurity-form-8-k-tracker/` |
| Board Cybersecurity Tracker | Links filings to MITRE ATT&CK, impact assessments | `board-cybersecurity.com/incidents/tracker` |

**Critical:** Must capture Item 1.05 AND Items 8.01/7.01 (post-May 2024 shift).

### 3.3 Paragraph Segmentation

Once Item 1C text is extracted, segment into paragraphs:
- Split on double newlines or `<p>` tags (depending on extraction format)
- Minimum paragraph length: 20 words (filter out headers, whitespace)
- Maximum paragraph length: 500 words (split longer blocks at sentence boundaries)
- Preserve metadata: company name, CIK, ticker, filing date, fiscal year

Expected yield: ~5-8 paragraphs per Item 1C disclosure × ~9,000 filings = **~50,000-70,000 paragraphs**

### 3.4 Pre-Existing Datasets and Resources

| Resource | What It Is | URL |
|----------|-----------|-----|
| PleIAs/SEC | 373K full 10-K texts (CC0) | `huggingface.co/datasets/PleIAs/SEC` |
| EDGAR-CORPUS | 220K filings with sections pre-parsed (Apache 2.0) | `huggingface.co/datasets/eloukas/edgar-corpus` |
| Board Cybersecurity 23-Feature Analysis | Regex-based extraction of 23 governance/security features from 4,538 10-Ks | `board-cybersecurity.com/research/insights/` |
| Gibson Dunn S&P 100 Survey | Detailed feature analysis of disclosure content | `corpgov.law.harvard.edu/2025/01/09/cybersecurity-disclosure-overview-...` |
| Florackis et al. (2023) "Cybersecurity Risk" | Firm-level cyber risk measure from 10-K text, RFS publication | SSRN: 3725130, data companion: 4319606 |
| zeroshot/cybersecurity-corpus | General cybersecurity text (not SEC-specific, useful for DAPT) | `huggingface.co/datasets/zeroshot/cybersecurity-corpus` |

---

## 4. GenAI Labeling Pipeline

### 4.1 Multi-Model Consensus (EvasionBench Architecture)

We follow Ma et al. (2026, arXiv:2601.09142) — the EvasionBench pipeline designed for an almost identical task (ordinal classification of financial text). Their approach achieved Cohen's Kappa = 0.835 with human annotators.

**Stage 1 — Dual Independent Annotation (all ~50K paragraphs):**
- Annotator A: **Claude Sonnet 4.6** (batch API — $1.50/$7.50 per M input/output tokens)
- Annotator B: **Gemini 2.5 Flash** ($0.30/$2.50 per M tokens)
- Architectural diversity (Anthropic vs. Google) minimizes correlated errors
- ~83% of paragraphs will have immediate agreement

**Stage 2 — Judge Panel for Disagreements (~17% = ~8,500 cases):**
- Judge 1: **Claude Opus 4.6** (batch — $2.50/$12.50 per M tokens)
- Judge 2: **GPT-5** (batch — $0.63/$5.00 per M tokens)
- Judge 3: **Gemini 2.5 Pro** (~$2-4/$12-18 per M tokens)
- Majority vote (2/3) resolves disagreements
- Anti-bias: randomize label presentation order

**Stage 3 — Active Learning Pass:**
- Cluster remaining low-confidence cases
- Human-review ~5% (~2,500 cases) to identify systematic errors
- Iterate rubric if needed, re-run affected subsets

### 4.2 Prompt Template

```
SYSTEM PROMPT:
You are an expert annotator classifying paragraphs from SEC cybersecurity
disclosures (10-K Item 1C and 8-K Item 1.05 filings).

For each paragraph, assign:
(a) content_category: exactly one of ["Board Governance", "Management Role",
    "Risk Management Process", "Third-Party Risk", "Incident Disclosure",
    "Strategy Integration", "None/Other"]
(b) specificity_level: integer 1-4

CONTENT CATEGORIES:
- Board Governance: Board/committee oversight of cybersecurity risks, briefing
  frequency, board member cyber expertise
- Management Role: CISO/CTO/CIO identification, qualifications, reporting
  structure, management committees
- Risk Management Process: Risk assessment methodology, framework adoption
  (NIST, ISO, etc.), vulnerability management, monitoring, incident response
  planning, tabletop exercises
- Third-Party Risk: Vendor/supplier risk oversight, external assessor engagement,
  contractual security requirements, supply chain risk
- Incident Disclosure: Description of cybersecurity incidents, scope, timing,
  impact, remediation actions
- Strategy Integration: Material impact on business strategy or financials,
  cyber insurance, investment/resource allocation
- None/Other: Boilerplate introductions, legal disclaimers, forward-looking
  statement warnings, non-cybersecurity content

SPECIFICITY SCALE:
1 - Generic Boilerplate: Could apply to any company. Conditional language
    ("may," "could"). No named entities.
    Example: "We face cybersecurity risks that could materially affect our
    business operations."

2 - Sector-Adapted: References industry context or named frameworks but no
    firm-specific details.
    Example: "We employ a cybersecurity framework aligned with the NIST
    Cybersecurity Framework to manage cyber risk."

3 - Firm-Specific: Contains facts unique to this company — named roles,
    committees, specific programs, reporting lines.
    Example: "Our CISO reports quarterly to the Audit Committee on
    cybersecurity risk posture and incident trends."

4 - Quantified-Verifiable: Includes metrics, dollar amounts, dates,
    frequencies, third-party audit references, or independently verifiable facts.
    Example: "Following the March 2024 incident affecting our payment systems,
    we engaged CrowdStrike and implemented network segmentation at a cost of
    $4.2M, completing remediation in Q3 2024."

BOUNDARY RULES:
- If torn between 1 and 2: "Does it name ANY framework, standard, or industry
  term?" If yes → 2
- If torn between 2 and 3: "Does it mention anything unique to THIS company?"
  If yes → 3
- If torn between 3 and 4: "Does it contain TWO OR MORE specific, verifiable
  facts?" If yes → 4

Respond with valid JSON only. Include a brief reasoning field.

USER PROMPT:
Company: {company_name}
Filing Date: {filing_date}
Paragraph:
{paragraph_text}
```

**Expected output:**
```json
{
  "content_category": "Board Governance",
  "specificity_level": 3,
  "reasoning": "Identifies Audit Committee by name and describes quarterly briefing cadence, both firm-specific facts."
}
```

### 4.3 Practical Labeling Notes

- **Always use Batch API.** Both OpenAI and Anthropic offer 50% discount for async/batch processing (24-hour turnaround). No reason to use real-time.
- **Prompt caching:** The system prompt (~800 tokens) is identical for every request. With Anthropic's prompt caching, cached reads cost 10% of base price. Combined with batch discount = 5% of standard price.
- **Structured output mode:** Use JSON mode / structured outputs on all providers. Reduces parsing errors by ~90%.
- **Reasoning models (o3, extended thinking):** Use ONLY as judges for disagreement cases, not as primary annotators. They're overkill for clear-cut classification and expensive due to reasoning token consumption.

### 4.4 Gold Set Protocol

**Non-negotiable for publication quality.**

1. Sample 300-500 paragraphs, stratified by:
   - Expected content category (ensure all 7 represented)
   - Expected specificity level (ensure all 4 represented)
   - Industry (financial services, tech, healthcare, manufacturing)
   - Filing year (FY2023 vs FY2024)

2. Two team members independently label the full gold set

3. Compute:
   - Cohen's Kappa (binary/nominal categories)
   - Krippendorff's Alpha (ordinal specificity scale)
   - Per-class confusion matrices
   - Target: Kappa > 0.75 ("substantial agreement")

4. Adjudicate disagreements with a third team member

5. Run the full MMC pipeline on the gold set and compare

---

## 5. Model Strategy

### 5.1 Primary: SEC-ModernBERT-large

**This model does not exist publicly. Building it is a core contribution.**

**Base model:** `answerdotai/ModernBERT-large`
- 395M parameters
- 8,192-token native context (vs. 512 for DeBERTa-v3-large)
- RoPE + alternating local/global attention + FlashAttention
- 2-4x faster than DeBERTa-v3-large
- Apache 2.0 license
- GLUE: 90.4 (only 1 point behind DeBERTa-v3-large's 91.4)

**Step 1 — Domain-Adaptive Pre-Training (DAPT):**

Continue MLM pre-training on SEC filing text to create "SEC-ModernBERT-large":

- **Training corpus:** 200-500M tokens of SEC filings (from PleIAs/SEC or your own EDGAR download). Include 10-Ks, 10-Qs, 8-Ks, proxy statements.
- **MLM objective:** 30% masking rate (ModernBERT convention)
- **Learning rate:** ~5e-5 (much lower than from-scratch pre-training)
- **Hardware (RTX 3090):** bf16, gradient checkpointing, seq_len=1024-2048, batch_size=2-4 + gradient accumulation to effective batch 16-32
- **VRAM estimate:** ~12-15GB at batch=4, seq=2048 with gradient checkpointing — fits on 3090

**Evidence DAPT works:**
- Gururangan et al. (2020): consistent improvements across all tested domains
- Patent domain ModernBERT (arXiv:2509.14926): +0.9 to +2.8 F1 from continued pre-training on 31.6B tokens
- Scaling-law analysis on SEC filings (arXiv:2512.12384): consistent improvement with largest gains in first 200M tokens
- Databricks customer report: 70% → 95% accuracy with domain-specific pre-training

**Step 2 — Classification Fine-Tuning:**

Fine-tune SEC-ModernBERT-large on the 50K labeled paragraphs:

- **Sequence length:** 2048 tokens (captures full regulatory paragraphs that 512-token models truncate)
- **Two classification heads:** content_category (7-class softmax) + specificity_level (4-class ordinal or softmax)
- **Add supervised contrastive loss (SCL):** Combine standard cross-entropy with SCL that pulls same-class embeddings together. Gunel et al. (2020) showed +0.5-1.5% improvement, especially for rare/imbalanced classes.
- **VRAM:** ~11-13GB at batch=8, seq=2048 in bf16 — comfortable on 3090
- **3090 supports bf16** natively via Ampere Tensor Cores. Use `bf16=True` in HuggingFace Trainer. No loss scaling needed (unlike fp16).

### 5.2 Dark Horse: NeoBERT

`chandar-lab/NeoBERT`
- **250M parameters** (100M fewer than ModernBERT-large, 185M fewer than DeBERTa-v3-large)
- 4,096-token context
- SwiGLU, RoPE, Pre-RMSNorm, FlashAttention
- GLUE: 89.0 (close to DeBERTa-v3-large's 91.4)
- MTEB: 51.3 (crushes everything else — ModernBERT-large is 46.9)
- MIT license
- Requires `trust_remote_code=True`
- Almost nobody is using it for domain-specific tasks

Same DAPT + fine-tuning pipeline as ModernBERT-large, with even less VRAM.

### 5.3 Baseline: DeBERTa-v3-large

`microsoft/deberta-v3-large`
- 304M backbone + 131M embedding = ~435M total
- 512-token native context (can push to ~1024)
- Disentangled attention + ELECTRA-style RTD pre-training
- GLUE: **91.4** — still the highest among all encoders
- MIT license
- **Weakness:** no long context support, completely fails at retrieval tasks

Include as baseline to show improvement from (a) long context and (b) DAPT.

### 5.4 Ablation Design

| Experiment | Model | Context | DAPT | SCL | Purpose |
|-----------|-------|---------|------|-----|---------|
| Baseline | DeBERTa-v3-large | 512 | No | No | "Standard" approach per syllabus |
| + Long context | ModernBERT-large | 2048 | No | No | Shows context window benefit |
| + Domain adapt | SEC-ModernBERT-large | 2048 | Yes | No | Shows DAPT benefit |
| + Contrastive | SEC-ModernBERT-large | 2048 | Yes | Yes | Shows SCL benefit |
| Efficiency | NeoBERT (+ DAPT) | 2048 | Yes | Yes | 40% fewer params, comparable? |
| **Ensemble** | SEC-ModernBERT + DeBERTa | mixed | mixed | — | Maximum performance |

The ensemble averages logits from SEC-ModernBERT-large (long context, domain-adapted) and DeBERTa-v3-large (highest raw NLU). Their architecturally different attention mechanisms mean uncorrelated errors.

### 5.5 Training Framework

- **Encoder fine-tuning:** HuggingFace `transformers` + `Trainer` with `AutoModelForSequenceClassification`
- **DAPT continued pre-training:** HuggingFace `transformers` with `DataCollatorForLanguageModeling`
- **SCL implementation:** Custom training loop or modify Trainer with dual loss
- **Few-shot prototyping:** `SetFit` (sentence-transformers based) for rapid baseline in <30 seconds

**Key reference:** Phil Schmid's ModernBERT fine-tuning tutorial: <https://www.philschmid.de/fine-tune-modern-bert-in-2025>

### 5.6 Domain-Specific Encoder Models (for comparison only)

These exist but are all BERT-base (110M params, 512 context) — architecturally outdated:

| Model | HuggingFace ID | Domain | Params |
|-------|---------------|--------|--------|
| SEC-BERT | `nlpaueb/sec-bert-base` | 260K 10-K filings | 110M |
| SEC-BERT-SHAPE | `nlpaueb/sec-bert-shape` | Same, with number normalization | 110M |
| FinBERT | `ProsusAI/finbert` | Financial sentiment | 110M |
| Legal-BERT | `nlpaueb/legal-bert-base-uncased` | 12GB legal text | 110M |
| SecureBERT | arXiv:2204.02685 | Cybersecurity text | 110M |

Our DAPT approach on a modern architecture (ModernBERT-large or NeoBERT) will outperform all of these. Include SEC-BERT as an additional baseline if time permits.

---

## 6. Evaluation & Validation

### 6.1 Required Metrics (from syllabus)

| Metric | Target | Notes |
|--------|--------|-------|
| **Macro-F1** on human holdout | Report per-class and overall | Minimum 1.2K holdout examples |
| **Per-class F1** | Identify weak categories | Expect "None/Other" to be noisiest |
| **Krippendorff's Alpha** | > 0.67 (adequate), > 0.75 (good) | GenAI labels vs. human gold set |
| **Calibration plots** | Reliability diagrams | For probabilistic outputs (softmax) |
| **Robustness splits** | Report by time period, industry, filing size | FY2023 vs FY2024; GICS sector; word count quartiles |

### 6.2 Downstream Validity Tests

These demonstrate that the classifier's predictions correlate with real-world outcomes:

**Test 1 — Breach Prediction (strongest):**
- Do firms with lower specificity scores subsequently appear in breach databases?
- Cross-reference with:
  - **Privacy Rights Clearinghouse** (80K+ breaches; Mendeley dataset provides ticker/CIK matching: `doi.org/10.17632/w33nhh3282.1`)
  - **VCDB** (8K+ incidents, VERIS schema: `github.com/vz-risk/VCDB`)
  - **Board Cybersecurity Incident Tracker** (direct SEC filing links: `board-cybersecurity.com/incidents/tracker`)
  - **CISA KEV Catalog** (known exploited vulnerabilities: `cisa.gov/known-exploited-vulnerabilities-catalog`)

**Test 2 — Market Reaction (if time permits):**
- Event study: abnormal returns in [-1, +3] window around 8-K Item 1.05 filing
- Does prior Item 1C disclosure quality predict magnitude of reaction?
- Small sample (~55 incidents) but high signal
- Regression: CAR = f(specificity_score, incident_severity, firm_size, industry)

**Test 3 — Known-Groups Validity (easy, always include):**
- Do regulated industries (financial services under NYDFS, healthcare under HIPAA) produce systematically higher-specificity disclosures?
- Do larger firms (by market cap) have more specific disclosures?
- These are expected results — confirming them validates the measure

**Test 4 — Boilerplate Index (easy, always include):**
- Compute cosine similarity of each company's Item 1C to the industry-median disclosure
- Does our specificity score inversely correlate with this similarity measure?
- This is an independent, construct-free validation of the "uniqueness" dimension

### 6.3 External Benchmark

Per syllabus: "include an external benchmark approach (i.e., previous best practice)."

- **Board Cybersecurity's 23-feature regex extraction** is the natural benchmark. Their binary (present/absent) feature coding is the prior best practice. Our classifier should capture everything their regex captures plus the quality/specificity dimension they cannot measure.
- **Florackis et al. (2023) cybersecurity risk measure** from Item 1A text is another comparison — different section (1A vs 1C), different methodology (dictionary vs. classifier), different era (pre-rule vs. post-rule).

---

## 7. Release Artifacts

By project end, publish:

1. **HuggingFace Dataset:** Extracted Item 1C paragraphs with labels — first public dataset of its kind
2. **SEC-ModernBERT-large:** Domain-adapted model weights — first SEC-specific ModernBERT
3. **Fine-tuned classifiers:** Content category + specificity models, ready to deploy
4. **Labeling rubric + prompt templates:** Reusable for future SEC disclosure research
5. **Extraction pipeline code:** EDGAR → structured paragraphs → labeled dataset
6. **Evaluation notebook:** All metrics, ablations, validation tests

---

## 8. 3-Week Schedule (6 People)

### Team Roles

| Role | Person(s) | Primary Responsibility |
|------|-----------|----------------------|
| **Data Lead** | Person A | EDGAR extraction pipeline, paragraph segmentation, data cleaning |
| **Data Support** | Person B | 8-K extraction, breach database cross-referencing, dataset QA |
| **Labeling Lead** | Person C | Rubric refinement, GenAI prompt engineering, MMC pipeline orchestration |
| **Annotation** | Person D | Gold set human labeling, inter-rater reliability, active learning review |
| **Model Lead** | Person E | DAPT pre-training, classification fine-tuning, ablation experiments |
| **Eval & Writing** | Person F | Validation tests, metrics computation, final presentation, documentation |

### Week 1: Data + Rubric

| Day | Person A (Data Lead) | Person B (Data Support) | Person C (Labeling Lead) | Person D (Annotation) | Person E (Model Lead) | Person F (Eval & Writing) |
|-----|---------------------|------------------------|-------------------------|----------------------|----------------------|--------------------------|
| **Mon** | Set up EDGAR extraction pipeline (edgar-crawler + sec-edgar-downloader) | Set up 8-K extraction (sec-8k-item105) | Draft labeling rubric v1 from SEC rule | Read SEC rule + Gibson Dunn survey | Download ModernBERT-large, set up training env | Outline evaluation plan, identify breach databases |
| **Tue** | Begin bulk 10-K download (FY2023 cycle) | Extract all 8-K cyber filings (Items 1.05, 8.01, 7.01) | Pilot rubric on 30 paragraphs with Claude Opus | Pilot rubric on same 30 paragraphs independently | Download PleIAs/SEC corpus, prepare DAPT data | Download PRC Mendeley dataset, VCDB, set up cross-ref |
| **Wed** | Continue download (FY2024 cycle), begin Item 1C parsing | Build company metadata table (CIK → ticker → GICS sector → market cap) | Compare pilot labels with Person D, revise rubric boundary rules | Compute initial inter-rater agreement, flag problem areas | Begin DAPT pre-training (SEC-ModernBERT-large, ~2-3 days on 3090) | Map VCDB incidents to SEC filers by name matching |
| **Thu** | Paragraph segmentation pipeline, quality checks | Merge 8-K incidents with Board Cybersecurity Tracker data | Rubric v2 finalized; set up batch API calls for dual annotation | Begin gold set sampling (300-500 paragraphs, stratified) | DAPT continues (monitor loss, checkpoint) | Draft presentation outline |
| **Fri** | **Milestone: Full paragraph corpus ready (~50K+ paragraphs)** | **Milestone: 8-K incident dataset complete** | Launch Stage 1 dual annotation (Sonnet + Gemini Flash) on full corpus | Continue gold set labeling (target: finish 150/300) | DAPT continues | **Milestone: Evaluation framework + breach cross-ref ready** |

### Week 2: Labeling + Training

| Day | Person A | Person B | Person C | Person D | Person E | Person F |
|-----|----------|----------|----------|----------|----------|----------|
| **Mon** | Data cleaning — fix extraction errors, handle edge cases | Assist Person D with gold set labeling (second annotator) | Monitor dual annotation results (should be ~60% complete) | Continue gold set labeling, begin second pass | DAPT finishes; begin DeBERTa-v3-large baseline fine-tuning | Compute gold set inter-rater reliability (Kappa, Alpha) |
| **Tue** | Build train/holdout split logic (stratified by industry, year, specificity) | Continue gold set second-annotator pass | Dual annotation complete → extract disagreements (~17%) | Finish gold set, adjudicate disagreements with Person C | Baseline results in; begin ModernBERT-large (no DAPT) fine-tuning | Analyze gold set confusion patterns, recommend rubric tweaks |
| **Wed** | Final dataset assembly | Assist Person C with judge panel setup | Launch Stage 2 judge panel (Opus + GPT-5 + Gemini Pro) on disagreements | Run MMC pipeline on gold set, compare with human labels | ModernBERT-large done; begin SEC-ModernBERT-large fine-tuning | **Milestone: Gold set validated, Kappa computed** |
| **Thu** | Prepare HuggingFace dataset card | Begin active learning — cluster low-confidence cases | Judge panel results in; assemble final labeled dataset | Human-review ~500 low-confidence cases from active learning | SEC-ModernBERT-large done; begin NeoBERT experiment | Robustness split analysis (by industry, year, filing size) |
| **Fri** | **Milestone: Labeled dataset finalized (~50K paragraphs)** | **Milestone: Active learning pass complete** | QA final labels — spot-check 100 random samples | Assist Person E with evaluation | Begin ensemble experiment (SEC-ModernBERT + DeBERTa) | **Milestone: All baseline + ablation training complete** |

### Week 3: Evaluation + Presentation

| Day | Person A | Person B | Person C | Person D | Person E | Person F |
|-----|----------|----------|----------|----------|----------|----------|
| **Mon** | Publish dataset to HuggingFace | Run breach prediction validation (PRC + VCDB cross-ref) | Write labeling methodology section | Calibration plots for all models | Final ensemble tuning; publish model weights to HuggingFace | Compile all metrics into evaluation tables |
| **Tue** | Write data acquisition section | Run known-groups validity (industry, size effects) | Write GenAI labeling section | Boilerplate index validation (cosine similarity) | Write model strategy section | Draft full results section |
| **Wed** | Code cleanup, README for extraction pipeline | Market reaction analysis if feasible (optional) | Review/edit all written sections | Create figures: confusion matrices, calibration plots | Review/edit model section | Assemble presentation slides |
| **Thu** | **Full team: review presentation, rehearse, polish** | | | | | |
| **Fri** | **Presentation day** | | | | | |

### Critical Path & Dependencies

```
Week 1:
  Data extraction (A,B) ──────────────────┐
  Rubric design (C,D) ───→ Pilot test ───→ Rubric v2 ──→ GenAI labeling launch (Fri)
  DAPT pre-training (E) ──────────────────────────────────→ (continues into Week 2)
  Eval framework (F) ─────────────────────────────────────→ (ready for Week 2)

Week 2:
  GenAI labeling (C) ───→ Judge panel ───→ Active learning ───→ Final labels (Fri)
  Gold set (D + B) ──────────────────────→ Validated (Wed)
  Fine-tuning experiments (E) ───→ Baseline → ModernBERT → SEC-ModernBERT → NeoBERT → Ensemble
  Metrics (F) ───────────────────→ Robustness splits

Week 3:
  Validation tests (B,D,F) ───→ Breach prediction, known-groups, boilerplate index
  Writing (all) ──────────────→ Sections → Review → Presentation
  Release (A,E) ──────────────→ HuggingFace dataset + model weights
```

---

## 9. Budget

| Item | Cost |
|------|------|
| GenAI labeling — Stage 1 dual annotation (50K × 2 models, batch) | ~$115 |
| GenAI labeling — Stage 2 judge panel (~8.5K × 3 models, batch) | ~$55 |
| Prompt caching savings | -$30 to -$40 |
| SEC EDGAR data | $0 (public domain) |
| Breach databases (PRC open data, VCDB, CISA KEV) | $0 |
| Compute (RTX 3090, already owned) | $0 |
| **Total** | **~$130-170** |

For comparison, human annotation at $0.50/label would cost $25,000+ for single-annotated, $75,000+ for triple-annotated.

---

## 10. Reference Links

### SEC Rule & Guidance
- [SEC Final Rule 33-11216 (PDF)](https://www.sec.gov/files/rules/final/2023/33-11216.pdf)
- [SEC Fact Sheet](https://www.sec.gov/files/33-11216-fact-sheet.pdf)
- [SEC Small Business Compliance Guide](https://www.sec.gov/resources-small-businesses/small-business-compliance-guides/cybersecurity-risk-management-strategy-governance-incident-disclosure)
- [CYD iXBRL Taxonomy Guide (PDF)](https://xbrl.sec.gov/cyd/2024/cyd-taxonomy-guide-2024-09-16.pdf)

### Law Firm Surveys & Analysis
- [Gibson Dunn S&P 100 Survey (Harvard Law Forum)](https://corpgov.law.harvard.edu/2025/01/09/cybersecurity-disclosure-overview-a-survey-of-form-10-k-cybersecurity-disclosures-by-sp-100-companies/)
- [PwC First Wave of 10-K Cyber Disclosures](https://www.pwc.com/us/en/services/consulting/cybersecurity-risk-regulatory/sec-final-cybersecurity-disclosure-rules/sec-10-k-cyber-disclosures.html)
- [Debevoise 8-K Lessons Learned](https://www.debevoisedatablog.com/2024/03/06/cybersecurity-form-8-k-tracker/)
- [Greenberg Traurig 2025 Trends Update](https://www.gtlaw.com/en/insights/2025/2/sec-cybersecurity-disclosure-trends-2025-update-on-corporate-reporting-practices)
- [Known Trends: First Year of 8-K Filings](https://www.knowntrends.com/2025/02/snapshot-the-first-year-of-cybersecurity-incident-filings-on-form-8-k-since-adoption-of-new-rules/)
- [NYU: Lessons Learned from 8-K Reporting](https://wp.nyu.edu/compliance_enforcement/2025/03/25/lessons-learned-one-year-of-form-8-k-material-cybersecurity-incident-reporting/)

### Data Extraction Tools
- [edgar-crawler (GitHub)](https://github.com/lefterisloukas/edgar-crawler)
- [edgartools (GitHub)](https://github.com/dgunning/edgartools)
- [sec-edgar-downloader (PyPI)](https://pypi.org/project/sec-edgar-downloader/)
- [sec-8k-item105 (GitHub)](https://github.com/JMousqueton/sec-8k-item105)
- [SECurityTr8Ker (GitHub)](https://github.com/pancak3lullz/SECurityTr8Ker)
- [SEC EDGAR APIs](https://www.sec.gov/search-filings/edgar-application-programming-interfaces)
- [SEC EDGAR Full-Text Search](https://efts.sec.gov/LATEST/search-index)

### Datasets
- [PleIAs/SEC — 373K 10-K texts (HuggingFace, CC0)](https://huggingface.co/datasets/PleIAs/SEC)
- [EDGAR-CORPUS — 220K filings, sections parsed (HuggingFace, Apache 2.0)](https://huggingface.co/datasets/eloukas/edgar-corpus)
- [Board Cybersecurity 23-Feature Analysis](https://www.board-cybersecurity.com/research/insights/risk-frameworks-security-standards-in-10k-item-1c-cybersecurity-disclosures-through-2024-06-30/)
- [Board Cybersecurity Incident Tracker](https://www.board-cybersecurity.com/incidents/tracker)
- [PRC Mendeley Breach Dataset (with tickers)](http://dx.doi.org/10.17632/w33nhh3282.1)
- [VCDB (GitHub)](https://github.com/vz-risk/VCDB)
- [CISA KEV Catalog](https://www.cisa.gov/known-exploited-vulnerabilities-catalog)
- [zeroshot/cybersecurity-corpus (HuggingFace)](https://huggingface.co/datasets/zeroshot/cybersecurity-corpus)

### Models
- [ModernBERT-large (HuggingFace, Apache 2.0)](https://huggingface.co/answerdotai/ModernBERT-large)
- [ModernBERT-base (HuggingFace, Apache 2.0)](https://huggingface.co/answerdotai/ModernBERT-base)
- [NeoBERT (HuggingFace, MIT)](https://huggingface.co/chandar-lab/NeoBERT)
- [DeBERTa-v3-large (HuggingFace, MIT)](https://huggingface.co/microsoft/deberta-v3-large)
- [SEC-BERT (HuggingFace)](https://huggingface.co/nlpaueb/sec-bert-base)
- [ProsusAI FinBERT (HuggingFace)](https://huggingface.co/ProsusAI/finbert)
- [EvasionBench Eva-4B-V2 (HuggingFace)](https://huggingface.co/FutureMa/Eva-4B-V2)

### Key Papers
- Ringel (2023), "Creating Synthetic Experts with Generative AI" — [SSRN:4542949](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4542949)
- Ludwig et al. (2026), "Extracting Consumer Insight from Text" — [arXiv:2602.15312](https://arxiv.org/abs/2602.15312)
- Ma et al. (2026), "EvasionBench" — [arXiv:2601.09142](https://arxiv.org/abs/2601.09142)
- Florackis et al. (2023), "Cybersecurity Risk" — [SSRN:3725130](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3725130)
- Gururangan et al. (2020), "Don't Stop Pretraining" — [arXiv:2004.10964](https://arxiv.org/abs/2004.10964)
- ModernBERT paper — [arXiv:2412.13663](https://arxiv.org/abs/2412.13663)
- NeoBERT paper — [arXiv:2502.19587](https://arxiv.org/abs/2502.19587)
- ModernBERT vs DeBERTa-v3 comparison — [arXiv:2504.08716](https://arxiv.org/abs/2504.08716)
- Patent domain ModernBERT DAPT — [arXiv:2509.14926](https://arxiv.org/abs/2509.14926)
- SEC filing scaling laws for continued pre-training — [arXiv:2512.12384](https://arxiv.org/abs/2512.12384)
- Gunel et al. (2020), Supervised Contrastive Learning for fine-tuning — [OpenReview](https://openreview.net/forum?id=cu7IUiOhujH)
- Phil Schmid, "Fine-tune classifier with ModernBERT in 2025" — [philschmid.de](https://www.philschmid.de/fine-tune-modern-bert-in-2025)
- Berkman et al. (2018), Cybersecurity disclosure quality scoring
- Li, No, and Boritz (2023), BERT-based classification of cybersecurity disclosures
- Scalable 10-K Analysis with LLMs — [arXiv:2409.17581](https://arxiv.org/abs/2409.17581)
- SecureBERT — [arXiv:2204.02685](https://arxiv.org/abs/2204.02685)
- Gilardi et al. (2023), "ChatGPT Outperforms Crowd-Workers" (PNAS) — [arXiv:2303.15056](https://arxiv.org/abs/2303.15056)
- Pangakis et al. (2023), "Automated Annotation Requires Validation" — [arXiv:2306.00176](https://arxiv.org/abs/2306.00176)

### Methodological Playbook
- [Ringel 2026 Capstone Pipeline Example (ZIP)](http://ringel.ai/UNC/2026/helpers/Ringel_2026_VerticalAI_Capstone_Pipeline_Example.zip)
- [Class 21 Exemplary Presentation (PDF)](http://www.ringel.ai/UNC/2026/BUSI488/Class21/Ringel_488-2026_Class21.pdf)