SEC-cyBERT/docs/TECHNICAL-GUIDE.md
2026-03-29 16:12:19 -04:00

542 lines
30 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Technical Guide — SEC Cybersecurity Disclosure Classifier
Everything needed to build the pipeline: data acquisition, GenAI labeling, model training, evaluation, and references.
**Stack:** TypeScript (bun) for data/labeling/eval, Python (uv) for training. Vercel AI SDK v6 + OpenRouter for all LLM calls. HuggingFace Trainer for encoder training, Unsloth for decoder experiment.
---
## 1. Data Acquisition
### 1.1 Extracting 10-K Item 1C
**Pipeline:**
```
EDGAR API → download 10-K HTML → extract Item 1C → paragraph segmentation → JSONL
```
**Tools:**
| Tool | Purpose | Install | Notes |
|------|---------|---------|-------|
| `sec-edgar-downloader` | Bulk download 10-K filings by CIK | `uv add sec-edgar-downloader` | Pure downloader, no parsing |
| `edgar-crawler` | Extract specific item sections to JSON | `git clone github.com/lefterisloukas/edgar-crawler` | Configure `['1C']` in items list |
| `edgartools` | Interactive exploration, XBRL parsing | `uv add edgartools` | `tenk['Item 1C']` accessor; great for prototyping |
**EDGAR API requirements:**
- Rate limit: 10 requests/second
- Required: Custom `User-Agent` header with name and email (e.g., `"sec-cyBERT team@email.com"`)
- SEC blocks requests without proper User-Agent (returns 403)
**For iXBRL-tagged filings (2025+):** Use `edgartools` XBRL parser to extract CYD taxonomy elements directly. The `cyd` prefix tags give pre-structured data aligned with regulatory categories.
**Fallback corpus:** `PleIAs/SEC` on HuggingFace (373K 10-K full texts, CC0 license) — sections NOT pre-parsed; must extract Item 1C yourself.
### 1.2 Extracting 8-K Incident Disclosures
| Tool | Purpose |
|------|---------|
| `sec-8k-item105` | Extract Item 1.05 from 8-Ks, iXBRL + HTML fallback — `github.com/JMousqueton/sec-8k-item105` |
| `SECurityTr8Ker` | Monitor SEC RSS for new cyber 8-Ks — `github.com/pancak3lullz/SECurityTr8Ker` |
| Debevoise 8-K Tracker | Curated list with filing links — `debevoisedatablog.com` |
| Board Cybersecurity Tracker | Links filings to MITRE ATT&CK — `board-cybersecurity.com/incidents/tracker` |
**Critical:** Must capture Item 1.05 AND Items 8.01/7.01 (post-May 2024 shift where companies moved non-material disclosures away from 1.05).
### 1.3 Paragraph Segmentation
Once Item 1C text is extracted:
- Split on double newlines or `<p>` tags (depending on extraction format)
- **Minimum** paragraph length: 20 words (filter out headers, whitespace)
- **Maximum** paragraph length: 500 words (split longer blocks at sentence boundaries)
- Preserve metadata: company name, CIK, ticker, filing date, fiscal year
**Expected yield:** ~5-8 paragraphs per Item 1C × ~9,000 filings = **~50,000-70,000 paragraphs**
### 1.4 Pre-Existing Datasets
| Resource | What It Is | License |
|----------|-----------|---------|
| [PleIAs/SEC](https://huggingface.co/datasets/PleIAs/SEC) | 373K full 10-K texts | CC0 |
| [EDGAR-CORPUS](https://huggingface.co/datasets/eloukas/edgar-corpus) | 220K filings with sections pre-parsed | Apache 2.0 |
| [Board Cybersecurity 23-Feature Analysis](https://www.board-cybersecurity.com/research/insights/) | Regex extraction of 23 governance features from 4,538 10-Ks | Research |
| [Gibson Dunn S&P 100 Survey](https://corpgov.law.harvard.edu/2025/01/09/cybersecurity-disclosure-overview-a-survey-of-form-10-k-cybersecurity-disclosures-by-sp-100-companies/) | Detailed disclosure feature analysis | Research |
| [Florackis et al. (2023)](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3725130) | Firm-level cyber risk measure from 10-K text | SSRN |
| [zeroshot/cybersecurity-corpus](https://huggingface.co/datasets/zeroshot/cybersecurity-corpus) | General cybersecurity text (useful for DAPT) | HuggingFace |
---
## 2. GenAI Labeling Pipeline
All LLM calls go through **OpenRouter** via `@openrouter/ai-sdk-provider` + Vercel AI SDK v6 `generateObject`. OpenRouter returns actual cost in `usage.cost` — no estimation needed.
### 2.1 Model Panel
**Stage 1 — Three Independent Annotators (all ~50K paragraphs):**
All three are reasoning models. Use low reasoning effort to get a cheap thinking pass without blowing up token costs.
| Model | OpenRouter ID | Role | Reasoning |
|-------|--------------|------|-----------|
| Gemini 3.1 Flash Lite | `google/gemini-3.1-flash-lite-preview` | Cheap + capable | Low effort |
| MiMo-V2-Flash | `xiaomi/mimo-v2-flash` | Xiaomi reasoning flash | Low effort |
| Grok 4.1 Fast | `x-ai/grok-4.1-fast` | xAI fast tier | Low effort |
Provider diversity: Google, Xiaomi, xAI — three different architectures, minimizes correlated errors.
**Stage 2 — Judge for Disagreements (~15-20% of paragraphs):**
| Model | OpenRouter ID | Role | Reasoning |
|-------|--------------|------|-----------|
| Claude Sonnet 4.6 | `anthropic/claude-sonnet-4.6` | Tiebreaker judge | Medium effort |
**Full Benchmarking Panel (run on 1,200 holdout alongside human labels):**
The Stage 1 models plus 6 SOTA frontier models — 9 total from 8+ providers.
| Model | OpenRouter ID | Provider | Reasoning |
|-------|--------------|----------|-----------|
| Gemini 3.1 Flash Lite | `google/gemini-3.1-flash-lite-preview` | Google | Low |
| MiMo-V2-Flash | `xiaomi/mimo-v2-flash` | Xiaomi | Low |
| Grok 4.1 Fast | `x-ai/grok-4.1-fast` | xAI | Low |
| GPT-5.4 | `openai/gpt-5.4` | OpenAI | Medium |
| Claude Sonnet 4.6 | `anthropic/claude-sonnet-4.6` | Anthropic | Medium |
| Gemini 3.1 Pro Preview | `google/gemini-3.1-pro-preview` | Google | Medium |
| GLM-5 | `zhipu/glm-5` | Zhipu AI | Medium |
| MiniMax-M2.7 | `minimax/minimax-m2.7` | MiniMax | Medium |
| MiMo-V2-Pro | `xiaomi/mimo-v2-pro` | Xiaomi | Medium |
That's **9 models from 8 providers**, far exceeding the 6-from-3 requirement. All support structured outputs on OpenRouter.
### 2.2 Consensus Algorithm
**Stage 1: 3-model majority vote.**
- Each of 3 models independently labels every paragraph via `generateObject` with the `LabelOutput` Zod schema (includes per-dimension confidence ratings).
- For each paragraph, compare the 3 labels on both dimensions (category + specificity).
- If 2/3 or 3/3 agree on BOTH dimensions → consensus reached.
- Expected agreement rate: ~80-85%.
- **Confidence-aware routing:** Even when models agree, if all 3 report "low" confidence on either dimension, route to Stage 2 judge anyway. These are hard cases that deserve a stronger model's opinion.
**Stage 2: Judge tiebreaker.**
- Claude Sonnet 4.6 (medium reasoning effort) receives the paragraph + all 3 Stage 1 labels (randomized order for anti-bias).
- Judge's label is treated as authoritative — if judge agrees with any Stage 1 model on both dimensions, that label wins. Otherwise judge's label is used directly.
- Remaining unresolved cases (~1-2%) flagged for human review.
**Stage 3: Active learning pass.**
- Cluster low-confidence cases by embedding similarity.
- Human-review ~2-5% of total to identify systematic rubric failures.
- Iterate rubric if patterns emerge, re-run affected subsets.
### 2.3 Reasoning Configuration
All Stage 1 and benchmark models are reasoning-capable. We use provider-appropriate "low" or "medium" effort settings to balance quality and cost.
**OpenRouter reasoning params** (passed via `providerOptions` or model-specific params):
- **Google Gemini**: `thinkingConfig: { thinkingBudget: 256 }` (low) / `1024` (medium)
- **Xiaomi MiMo**: Thinking is default-on; use `reasoning_effort: "low"` / `"medium"` if supported
- **xAI Grok**: `reasoning_effort: "low"` / `"medium"`
- **OpenAI GPT-5.4**: `reasoning: { effort: "low" }` / `"medium"`
- **Anthropic Claude**: `thinking: { budgetTokens: 512 }` (low) / `2048` (medium)
Exact param names may vary per model on OpenRouter — verify during pilot. The reasoning tokens are tracked separately in `usage.completion_tokens_details.reasoning_tokens`.
### 2.4 Cost Tracking
OpenRouter returns **actual cost** in `usage.cost` for every response. No estimation needed. Reasoning tokens are included in cost automatically.
### 2.5 Rate Limiting
OpenRouter uses **credit-based limiting** for paid accounts, not fixed RPM. Your key shows `requests: -1` (unlimited). There is no hard request-per-second cap — only Cloudflare DDoS protection if you dramatically exceed reasonable usage.
**Our approach:** Use `p-limit` concurrency control, starting at 10-15 concurrent requests. Ramp up if no 429s or latency degradation. Monitor account usage via `GET /api/v1/key`.
### 2.4 Technical Implementation
**Core pattern:** `generateObject` with Zod schema via OpenRouter.
```typescript
import { generateObject } from "ai";
import { createOpenRouter } from "@openrouter/ai-sdk-provider";
import { LabelOutput } from "../schemas/label";
const openrouter = createOpenRouter();
const result = await generateObject({
model: openrouter("google/gemini-3.1-flash-lite-preview"),
schema: LabelOutput,
system: SYSTEM_PROMPT,
prompt: buildUserPrompt(paragraph),
temperature: 0,
mode: "json",
// Reasoning effort — model-specific, set per provider
providerOptions: {
google: { thinkingConfig: { thinkingBudget: 256 } },
},
});
// result.object: { content_category, specificity_level, category_confidence, specificity_confidence, reasoning }
// result.usage: { promptTokens, completionTokens }
// OpenRouter response body also includes usage.cost (actual USD)
// and usage.completion_tokens_details.reasoning_tokens
```
**Generation ID tracking:** Every OpenRouter response includes an `id` field (the generation ID). We store this in every annotation record for audit trail and `GET /api/v1/generation?id={id}` lookup.
**Batch processing:** Concurrency-limited via `p-limit` (start at 10-15 concurrent). Each successful annotation is appended immediately to JSONL (crash-safe checkpoint). On resume, completed paragraph IDs are read from the output file and skipped. Graceful shutdown on SIGINT — wait for in-flight requests, write session summary.
**Structured output:** All panel models support `structured_outputs` on OpenRouter. Use `mode: "json"` in `generateObject`. Response Healing plugin (`plugins: [{ id: 'response-healing' }]`) available for edge cases.
**Live observability:** Every script that hits APIs renders a live dashboard to stderr (progress, ETA, session cost, latency percentiles, reasoning token usage). Session summaries append to `data/metadata/sessions.jsonl`.
**Prompt tuning before scale:** See LABELING-CODEBOOK.md for the 4-phase iterative prompt tuning protocol. Micro-pilot (30 paragraphs) → prompt revision → scale pilot (200 paragraphs) → green light. Do not fire the full 50K run until the scale pilot passes agreement targets.
---
## 3. Model Strategy
### 3.1 Primary: SEC-ModernBERT-large
**This model does not exist publicly. Building it is a core contribution.**
**Base model:** [`answerdotai/ModernBERT-large`](https://huggingface.co/answerdotai/ModernBERT-large)
- 395M parameters
- 8,192-token native context (vs. 512 for DeBERTa-v3-large)
- RoPE + alternating local/global attention + FlashAttention
- 2-4x faster than DeBERTa-v3-large
- Apache 2.0 license
- GLUE: 90.4
**Step 1 — Domain-Adaptive Pre-Training (DAPT):**
Continue MLM pre-training on SEC filing text to create "SEC-ModernBERT-large":
- **Training corpus:** ~450M tokens from our own 9,000 cached 10-K filings (FY2023-2024), full filing text (not just Item 1C). These are the same filings we extracted Item 1C from — we already have the raw HTML cached locally and the cleaning pipeline built (`stripHtml()` in `fast-reparse.ts`).
- **Why not PleIAs/SEC?** The PleIAs/SEC dataset (373K filings, ~18B tokens) goes back much further in time, but: (a) one pass would take weeks on a single 3090, (b) Item 1C didn't exist before FY2023 so pre-2023 filings lack the cybersecurity disclosure vocabulary that matters most for our task, (c) the SEC filing scaling laws paper (arXiv:2512.12384) shows the largest gains come in the first 200M tokens — our 450M from recent, relevant filings is already in the sweet spot.
- **Corpus preparation:** `bun run ts/scripts/dapt-corpus-prep.ts` runs `stripHtml()` (from `ts/src/extract/html-cleaner.ts`) on all cached filing HTML (full text, no section extraction) and outputs clean text as sharded JSONL to `data/dapt-corpus/`. Same HTML cleaning that handles XBRL tags, entity decoding, page artifacts, inline element boundaries — just without the Item 1C section boundary step.
- **MLM objective:** 30% masking rate (ModernBERT convention)
- **Learning rate:** ~5e-5 (search range: 1e-5 to 1e-4)
- **Hardware (RTX 3090):** bf16, gradient checkpointing, seq_len=1024-2048, batch_size=2-4 + gradient accumulation to effective batch 16-32
- **VRAM estimate:** ~12-15GB at batch=4, seq=2048 with gradient checkpointing — fits on 3090
- **Duration:** ~2-3 days on single 3090
- **Framework:** HuggingFace Trainer + `DataCollatorForLanguageModeling` (Python script, not notebook)
**Step 1b — Task-Adaptive Pre-Training (TAPT):**
After DAPT, continue MLM pre-training specifically on the task distribution to create "SEC-cyBERT-large":
- **Training corpus:** 72K Item 1C paragraphs from `paragraphs-clean.jsonl` (~5-10M tokens). Already clean text — no additional processing needed.
- **Why TAPT on top of DAPT?** Gururangan et al. (2020) "Don't Stop Pretraining" showed that TAPT gives consistent gains on top of DAPT, especially when the task distribution differs from the broader domain. Item 1C is a very specific subset of 10-K filings — it has its own vocabulary (CISO, NIST CSF, tabletop exercises, materiality assessments), structure (governance → management → process → strategy pattern), and boilerplate patterns that differ substantially from the rest of a 10-K. TAPT teaches the model this specific distribution.
- **Duration:** ~2-3 hours on a 3090 (tiny corpus vs DAPT)
- **Same hyperparameters as DAPT** (learning rate, masking rate, etc.)
**Evidence DAPT + TAPT works:**
- Gururangan et al. (2020): DAPT + TAPT together outperforms either alone across all tested domains. TAPT alone gives 1-3 F1 points; combined with DAPT, gains are additive.
- Clinical ModernBERT, BioClinical ModernBERT: successful continued MLM on medical text
- Patent domain ModernBERT (arXiv:2509.14926): +0.9 to +2.8 F1 from continued pre-training on 31.6B tokens
- SEC filing scaling laws (arXiv:2512.12384): consistent improvement, largest gains in first 200M tokens
**Step 2 — Classification Fine-Tuning:**
Fine-tune SEC-ModernBERT-large on the labeled paragraphs:
- **Architecture:** Shared encoder backbone → dropout → two linear classification heads
- `category_head`: 7-class softmax (content category)
- `specificity_head`: 4-class softmax (specificity level)
- **Loss:** `α × CE(category) + (1-α) × CE(specificity) + β × SCL`
- `α` (category_weight): default 0.5, searchable
- `β` (scl_weight): default 0, searchable (ablation)
- **Sequence length:** 2048 tokens
- **VRAM:** ~11-13GB at batch=8, seq=2048 in bf16 — comfortable on 3090
- **bf16=True** in HuggingFace Trainer (3090 Ampere supports natively)
- **Framework:** Custom `MultiHeadClassifier` model + HuggingFace Trainer subclass
### 3.2 Dark Horse: NeoBERT
[`chandar-lab/NeoBERT`](https://huggingface.co/chandar-lab/NeoBERT)
- 250M parameters (100M fewer than ModernBERT-large)
- 4,096-token context
- SwiGLU, RoPE, Pre-RMSNorm, FlashAttention
- GLUE: 89.0 | MTEB: 51.3 (best in class — ModernBERT is 46.9)
- MIT license
- Requires `trust_remote_code=True`
Same DAPT + fine-tuning pipeline, even less VRAM. Interesting efficiency vs. quality tradeoff.
### 3.3 Baseline: DeBERTa-v3-large
[`microsoft/deberta-v3-large`](https://huggingface.co/microsoft/deberta-v3-large)
- ~435M total parameters
- 512-token context (can push to ~1024)
- GLUE: 91.4 (highest among encoders)
- MIT license
- **Weakness:** no long context, fails at retrieval
Include as baseline to show improvement from (a) long context and (b) DAPT.
### 3.4 Decoder Experiment: Qwen3.5 via Unsloth
Experimental comparison of encoder vs. decoder approach:
- **Model:** Qwen3.5-1.5B or Qwen3.5-7B (smallest viable decoder)
- **Framework:** Unsloth (2x faster than Axolotl, 80% less VRAM, optimized for Qwen)
- **Method:** QLoRA fine-tuning — train the model to output the same JSON schema as the GenAI labelers
- **Purpose:** "Additional baseline" for A-grade requirement + demonstrates encoder advantage for classification
### 3.5 Domain-Specific Baselines (for comparison)
All BERT-base (110M params, 512 context) — architecturally outdated:
| Model | HuggingFace ID | Domain |
|-------|---------------|--------|
| SEC-BERT | `nlpaueb/sec-bert-base` | 260K 10-K filings |
| FinBERT | `ProsusAI/finbert` | Financial sentiment |
| SecureBERT | arXiv:2204.02685 | Cybersecurity text |
### 3.6 Ablation Design
| # | Experiment | Model | Context | DAPT | TAPT | SCL | Purpose |
|---|-----------|-------|---------|------|------|-----|---------|
| 1 | Baseline | DeBERTa-v3-large | 512 | No | No | No | Standard approach per syllabus |
| 2 | + Long context | ModernBERT-large | 2048 | No | No | No | Context window benefit |
| 3 | + Domain adapt | SEC-ModernBERT-large | 2048 | Yes | No | No | DAPT benefit |
| 3b | + Task adapt | SEC-cyBERT-large | 2048 | Yes | Yes | No | TAPT benefit (Gururangan et al.) |
| 4 | + Contrastive | SEC-cyBERT-large | 2048 | Yes | Yes | Yes | SCL benefit |
| 5 | Efficiency | NeoBERT (+ DAPT + TAPT) | 2048 | Yes | Yes | Yes | 40% fewer params |
| 6 | Decoder | Qwen3.5 LoRA | 2048 | No | No | No | Encoder vs decoder |
| 7 | **Ensemble** | SEC-cyBERT + DeBERTa | mixed | mixed | mixed | — | Maximum performance |
### 3.7 Hyperparameter Search (Autoresearch Pattern)
Inspired by Karpathy's [autoresearch](https://github.com/karpathy/autoresearch): an agent autonomously iterates on training configs using a `program.md` directive.
**How it works:**
1. Agent reads `program.md` which defines: fixed time budget (30 min), evaluation metric (`val_macro_f1`), what can be modified (YAML config values), what cannot (data splits, eval script, seed)
2. Agent modifies one hyperparameter in the YAML config
3. Agent runs training for 30 minutes
4. Agent evaluates on validation set
5. If `val_macro_f1` improved by ≥ 0.002 → keep checkpoint, else discard
6. Agent logs result to `results/experiments.tsv` and repeats
**Search spaces:**
DAPT:
- learning_rate: [1e-5, 2e-5, 5e-5, 1e-4]
- mlm_probability: [0.15, 0.20, 0.30]
- max_seq_length: [1024, 2048]
- effective batch size: [8, 16, 32]
Encoder fine-tuning:
- learning_rate: [1e-5, 2e-5, 3e-5, 5e-5]
- category_weight: [0.3, 0.4, 0.5, 0.6, 0.7]
- label_smoothing: [0, 0.05, 0.1]
- scl_weight: [0, 0.1, 0.2, 0.5]
- dropout: [0.05, 0.1, 0.2]
- pool_strategy: ["cls", "mean"]
- max_seq_length: [512, 1024, 2048]
Decoder (Unsloth LoRA):
- lora_r: [8, 16, 32, 64]
- lora_alpha: [16, 32, 64]
- learning_rate: [1e-4, 2e-4, 5e-4]
### 3.8 Python Training Package (`python/`)
Structured Python package for all training stages. All commands run from `python/` via `uv run main.py <command>`.
**Package layout:**
```
python/
├── main.py # CLI: uv run main.py {dapt,finetune,eval,decoder}
├── pyproject.toml # Dependencies (torch, transformers, datasets, accelerate)
├── configs/
│ ├── dapt/
│ │ ├── modernbert.yaml # ModernBERT DAPT hyperparams
│ │ └── neobert.yaml # NeoBERT DAPT hyperparams
│ ├── finetune/ # Fine-tuning configs (per ablation)
│ └── decoder/ # Qwen LoRA config
└── src/
├── common/
│ └── config.py # Typed dataclass configs, YAML loading
├── data/
│ └── corpus.py # DAPT corpus loading, tokenization, chunking
├── dapt/
│ └── train.py # DAPT + TAPT (same MLM objective, different data)
├── finetune/
│ ├── model.py # Dual-head classifier architecture
│ └── train.py # Classification fine-tuning
├── eval/
│ └── metrics.py # Macro-F1, MCC, Krippendorff's Alpha
└── decoder/
└── train.py # Qwen LoRA experiment (Unsloth)
```
**DAPT/TAPT usage:**
```bash
# DAPT: full 10-K filings → SEC-ModernBERT-large
uv run main.py dapt --config configs/dapt/modernbert.yaml
# TAPT: continue from DAPT checkpoint on Item 1C paragraphs → SEC-cyBERT-large
uv run main.py dapt --config configs/dapt/modernbert.yaml \
--model-path ../checkpoints/dapt/modernbert-large/final \
--data-path ../data/paragraphs/paragraphs-clean.jsonl \
--output-dir ../checkpoints/tapt/modernbert-large \
--stage tapt
```
**Config design:** YAML files define all hyperparameters (reproducible, diffable). CLI flags override key fields (`--model-path`, `--data-path`, `--output-dir`, `--stage`) for TAPT or experimentation without duplicating config files.
**Corpus preparation (prerequisite):** Run `bun run ts/scripts/dapt-corpus-prep.ts` from repo root to generate `data/dapt-corpus/` shards from cached HTML. This reuses the same `stripHtml()` from `ts/src/extract/html-cleaner.ts` that powers paragraph extraction.
**Checkpoints:** Saved to `checkpoints/` (gitignored). Each stage writes to `{output_dir}/final/` with the model and tokenizer.
---
## 4. Evaluation & Validation
### 4.1 Required Metrics
| Metric | Target | Notes |
|--------|--------|-------|
| Macro-F1 on holdout | > 0.80 for C, higher for A | Per-class and overall |
| Per-class F1 | Identify weak categories | Expect "None/Other" noisiest |
| Krippendorff's Alpha | > 0.67 adequate, > 0.75 good | GenAI vs human gold set |
| MCC | Report alongside F1 | More robust for imbalanced classes |
| Specificity MAE | Report for ordinal dimension | Mean absolute error: |pred - true| |
| Calibration plots | Reliability diagrams | For softmax outputs |
| Robustness splits | By time, industry, filing size | FY2023 vs FY2024; GICS sector; word count quartiles |
### 4.2 Downstream Validity Tests
**Test 1 — Breach Prediction (strongest):**
Do firms with lower specificity scores subsequently appear in breach databases?
- [Privacy Rights Clearinghouse](http://dx.doi.org/10.17632/w33nhh3282.1) — 80K+ breaches, ticker/CIK matching
- [VCDB](https://github.com/vz-risk/VCDB) — 8K+ incidents, VERIS schema
- [Board Cybersecurity Incident Tracker](https://www.board-cybersecurity.com/incidents/tracker) — direct SEC filing links
- [CISA KEV Catalog](https://www.cisa.gov/known-exploited-vulnerabilities-catalog) — known exploited vulnerabilities
**Test 2 — Market Reaction (optional):**
Event study: abnormal returns around 8-K Item 1.05 filing. Does prior Item 1C quality predict reaction magnitude? Small sample (~55 incidents) but high signal.
**Test 3 — Known-Groups Validity (easy, always include):**
Do regulated industries (NYDFS, HIPAA) produce higher-specificity disclosures? Do larger firms have more specific disclosures? Expected results that validate the measure.
**Test 4 — Boilerplate Index (easy, always include):**
Cosine similarity of each company's Item 1C to industry-median disclosure. Specificity score should inversely correlate — independent, construct-free validation.
### 4.3 External Benchmark
Per syllabus requirement:
- **Board Cybersecurity's 23-feature regex extraction** — natural benchmark. Their binary feature coding is prior best practice. Our classifier captures everything their regex does plus quality/specificity.
- **Florackis et al. (2023) cyber risk measure** — different section (1A vs 1C), different methodology, different era.
---
## 5. SEC Regulatory Context
### The Rule: SEC Release 33-11216 (July 2023)
**Item 1C (10-K Annual Disclosure) — Regulation S-K Item 106:**
*Item 106(b) — Risk Management and Strategy:*
1. Processes for assessing, identifying, and managing material cybersecurity risks
2. Whether cybersecurity processes integrate into overall ERM
3. Whether the company engages external assessors, consultants, or auditors
4. Processes to oversee risks from third-party service providers
5. Whether cybersecurity risks have materially affected business strategy, results, or financial condition
*Item 106(c) — Governance:*
- Board oversight (106(c)(1)): oversight description, responsible committee, information processes
- Management's role (106(c)(2)): responsible positions, expertise, monitoring processes, board reporting frequency
**Item 1.05 (8-K Incident Disclosure):**
- Required within 4 business days of materiality determination
- Material aspects of nature, scope, timing + material impact
- No technical details that would impede response/remediation
- AG can delay up to 120 days for national security
**Key design note:** The SEC uses "describe" — non-exclusive suggestions create natural variation in specificity and content. This is what makes the construct classifiable.
### Compliance Timeline
| Date | Milestone |
|------|-----------|
| Jul 26, 2023 | Rule adopted |
| Dec 15, 2023 | Item 1C required in 10-Ks |
| Dec 18, 2023 | Item 1.05 required in 8-Ks |
| Jun 15, 2024 | Item 1.05 required for smaller reporting companies |
| Dec 15, 2024 | iXBRL tagging of Item 106 (CYD taxonomy) required |
### iXBRL CYD Taxonomy
Published Sep 16, 2024. Starting Dec 15, 2024, Item 1C tagged in Inline XBRL with `cyd` prefix.
- Schema: `http://xbrl.sec.gov/cyd/2024`
- [Taxonomy guide (PDF)](https://xbrl.sec.gov/cyd/2024/cyd-taxonomy-guide-2024-09-16.pdf)
---
## 6. References
### SEC Rule & Guidance
- [SEC Final Rule 33-11216 (PDF)](https://www.sec.gov/files/rules/final/2023/33-11216.pdf)
- [SEC Fact Sheet](https://www.sec.gov/files/33-11216-fact-sheet.pdf)
- [SEC Small Business Compliance Guide](https://www.sec.gov/resources-small-businesses/small-business-compliance-guides/cybersecurity-risk-management-strategy-governance-incident-disclosure)
- [CYD iXBRL Taxonomy Guide (PDF)](https://xbrl.sec.gov/cyd/2024/cyd-taxonomy-guide-2024-09-16.pdf)
### Law Firm Surveys & Analysis
- [Gibson Dunn S&P 100 Survey](https://corpgov.law.harvard.edu/2025/01/09/cybersecurity-disclosure-overview-a-survey-of-form-10-k-cybersecurity-disclosures-by-sp-100-companies/)
- [PwC First Wave of 10-K Cyber Disclosures](https://www.pwc.com/us/en/services/consulting/cybersecurity-risk-regulatory/sec-final-cybersecurity-disclosure-rules/sec-10-k-cyber-disclosures.html)
- [Debevoise 8-K Tracker](https://www.debevoisedatablog.com/2024/03/06/cybersecurity-form-8-k-tracker/)
- [Greenberg Traurig 2025 Trends](https://www.gtlaw.com/en/insights/2025/2/sec-cybersecurity-disclosure-trends-2025-update-on-corporate-reporting-practices)
- [Known Trends: First Year of 8-K Filings](https://www.knowntrends.com/2025/02/snapshot-the-first-year-of-cybersecurity-incident-filings-on-form-8-k-since-adoption-of-new-rules/)
- [NYU: Lessons Learned from 8-K Reporting](https://wp.nyu.edu/compliance_enforcement/2025/03/25/lessons-learned-one-year-of-form-8-k-material-cybersecurity-incident-reporting/)
### Data Extraction Tools
- [edgar-crawler](https://github.com/lefterisloukas/edgar-crawler)
- [edgartools](https://github.com/dgunning/edgartools)
- [sec-edgar-downloader](https://pypi.org/project/sec-edgar-downloader/)
- [sec-8k-item105](https://github.com/JMousqueton/sec-8k-item105)
- [SECurityTr8Ker](https://github.com/pancak3lullz/SECurityTr8Ker)
- [SEC EDGAR APIs](https://www.sec.gov/search-filings/edgar-application-programming-interfaces)
- [SEC EDGAR Full-Text Search](https://efts.sec.gov/LATEST/search-index)
### Datasets
- [PleIAs/SEC — 373K 10-K texts (CC0)](https://huggingface.co/datasets/PleIAs/SEC)
- [EDGAR-CORPUS — 220K filings, sections parsed (Apache 2.0)](https://huggingface.co/datasets/eloukas/edgar-corpus)
- [Board Cybersecurity 23-Feature Analysis](https://www.board-cybersecurity.com/research/insights/risk-frameworks-security-standards-in-10k-item-1c-cybersecurity-disclosures-through-2024-06-30/)
- [Board Cybersecurity Incident Tracker](https://www.board-cybersecurity.com/incidents/tracker)
- [PRC Mendeley Breach Dataset](http://dx.doi.org/10.17632/w33nhh3282.1)
- [VCDB](https://github.com/vz-risk/VCDB)
- [CISA KEV Catalog](https://www.cisa.gov/known-exploited-vulnerabilities-catalog)
- [zeroshot/cybersecurity-corpus](https://huggingface.co/datasets/zeroshot/cybersecurity-corpus)
### Models
- [ModernBERT-large (Apache 2.0)](https://huggingface.co/answerdotai/ModernBERT-large)
- [ModernBERT-base (Apache 2.0)](https://huggingface.co/answerdotai/ModernBERT-base)
- [NeoBERT (MIT)](https://huggingface.co/chandar-lab/NeoBERT)
- [DeBERTa-v3-large (MIT)](https://huggingface.co/microsoft/deberta-v3-large)
- [SEC-BERT](https://huggingface.co/nlpaueb/sec-bert-base)
- [FinBERT](https://huggingface.co/ProsusAI/finbert)
- [EvasionBench Eva-4B-V2](https://huggingface.co/FutureMa/Eva-4B-V2)
### Key Papers
- Ringel (2023), "Creating Synthetic Experts with Generative AI" — [SSRN:4542949](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4542949)
- Ludwig et al. (2026), "Extracting Consumer Insight from Text" — [arXiv:2602.15312](https://arxiv.org/abs/2602.15312)
- Ma et al. (2026), "EvasionBench" — [arXiv:2601.09142](https://arxiv.org/abs/2601.09142)
- Florackis et al. (2023), "Cybersecurity Risk" — [SSRN:3725130](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3725130)
- Gururangan et al. (2020), "Don't Stop Pretraining" — [arXiv:2004.10964](https://arxiv.org/abs/2004.10964)
- ModernBERT — [arXiv:2412.13663](https://arxiv.org/abs/2412.13663)
- NeoBERT — [arXiv:2502.19587](https://arxiv.org/abs/2502.19587)
- ModernBERT vs DeBERTa-v3 — [arXiv:2504.08716](https://arxiv.org/abs/2504.08716)
- Patent domain ModernBERT DAPT — [arXiv:2509.14926](https://arxiv.org/abs/2509.14926)
- SEC filing scaling laws — [arXiv:2512.12384](https://arxiv.org/abs/2512.12384)
- Gunel et al. (2020), Supervised Contrastive Learning — [OpenReview](https://openreview.net/forum?id=cu7IUiOhujH)
- Phil Schmid, "Fine-tune ModernBERT" — [philschmid.de](https://www.philschmid.de/fine-tune-modern-bert-in-2025)
- Berkman et al. (2018), Cybersecurity disclosure quality scoring
- SecureBERT — [arXiv:2204.02685](https://arxiv.org/abs/2204.02685)
- Gilardi et al. (2023), "ChatGPT Outperforms Crowd-Workers" — [arXiv:2303.15056](https://arxiv.org/abs/2303.15056)
- Kiefer et al. (2025), ESG-Activities benchmark — [arXiv:2502.21112](https://arxiv.org/abs/2502.21112)
### Methodological Resources
- [Ringel 2026 Capstone Pipeline Example (ipynb)](http://ringel.ai/UNC/2026/helpers/Ringel_2026_VerticalAI_Capstone_Pipeline_Example.ipynb)
- [Ringel 2026 Capstone Pipeline Example (zip)](http://ringel.ai/UNC/2026/helpers/Ringel_2026_VerticalAI_Capstone_Pipeline_Example.zip)
- [Class 21 Exemplary Presentation (PDF)](http://www.ringel.ai/UNC/2026/BUSI488/Class21/Ringel_488-2026_Class21.pdf)
- [Karpathy autoresearch](https://github.com/karpathy/autoresearch) — autonomous HP search pattern