542 lines
30 KiB
Markdown
542 lines
30 KiB
Markdown
# Technical Guide — SEC Cybersecurity Disclosure Classifier
|
||
|
||
Everything needed to build the pipeline: data acquisition, GenAI labeling, model training, evaluation, and references.
|
||
|
||
**Stack:** TypeScript (bun) for data/labeling/eval, Python (uv) for training. Vercel AI SDK v6 + OpenRouter for all LLM calls. HuggingFace Trainer for encoder training, Unsloth for decoder experiment.
|
||
|
||
---
|
||
|
||
## 1. Data Acquisition
|
||
|
||
### 1.1 Extracting 10-K Item 1C
|
||
|
||
**Pipeline:**
|
||
```
|
||
EDGAR API → download 10-K HTML → extract Item 1C → paragraph segmentation → JSONL
|
||
```
|
||
|
||
**Tools:**
|
||
|
||
| Tool | Purpose | Install | Notes |
|
||
|------|---------|---------|-------|
|
||
| `sec-edgar-downloader` | Bulk download 10-K filings by CIK | `uv add sec-edgar-downloader` | Pure downloader, no parsing |
|
||
| `edgar-crawler` | Extract specific item sections to JSON | `git clone github.com/lefterisloukas/edgar-crawler` | Configure `['1C']` in items list |
|
||
| `edgartools` | Interactive exploration, XBRL parsing | `uv add edgartools` | `tenk['Item 1C']` accessor; great for prototyping |
|
||
|
||
**EDGAR API requirements:**
|
||
- Rate limit: 10 requests/second
|
||
- Required: Custom `User-Agent` header with name and email (e.g., `"sec-cyBERT team@email.com"`)
|
||
- SEC blocks requests without proper User-Agent (returns 403)
|
||
|
||
**For iXBRL-tagged filings (2025+):** Use `edgartools` XBRL parser to extract CYD taxonomy elements directly. The `cyd` prefix tags give pre-structured data aligned with regulatory categories.
|
||
|
||
**Fallback corpus:** `PleIAs/SEC` on HuggingFace (373K 10-K full texts, CC0 license) — sections NOT pre-parsed; must extract Item 1C yourself.
|
||
|
||
### 1.2 Extracting 8-K Incident Disclosures
|
||
|
||
| Tool | Purpose |
|
||
|------|---------|
|
||
| `sec-8k-item105` | Extract Item 1.05 from 8-Ks, iXBRL + HTML fallback — `github.com/JMousqueton/sec-8k-item105` |
|
||
| `SECurityTr8Ker` | Monitor SEC RSS for new cyber 8-Ks — `github.com/pancak3lullz/SECurityTr8Ker` |
|
||
| Debevoise 8-K Tracker | Curated list with filing links — `debevoisedatablog.com` |
|
||
| Board Cybersecurity Tracker | Links filings to MITRE ATT&CK — `board-cybersecurity.com/incidents/tracker` |
|
||
|
||
**Critical:** Must capture Item 1.05 AND Items 8.01/7.01 (post-May 2024 shift where companies moved non-material disclosures away from 1.05).
|
||
|
||
### 1.3 Paragraph Segmentation
|
||
|
||
Once Item 1C text is extracted:
|
||
- Split on double newlines or `<p>` tags (depending on extraction format)
|
||
- **Minimum** paragraph length: 20 words (filter out headers, whitespace)
|
||
- **Maximum** paragraph length: 500 words (split longer blocks at sentence boundaries)
|
||
- Preserve metadata: company name, CIK, ticker, filing date, fiscal year
|
||
|
||
**Expected yield:** ~5-8 paragraphs per Item 1C × ~9,000 filings = **~50,000-70,000 paragraphs**
|
||
|
||
### 1.4 Pre-Existing Datasets
|
||
|
||
| Resource | What It Is | License |
|
||
|----------|-----------|---------|
|
||
| [PleIAs/SEC](https://huggingface.co/datasets/PleIAs/SEC) | 373K full 10-K texts | CC0 |
|
||
| [EDGAR-CORPUS](https://huggingface.co/datasets/eloukas/edgar-corpus) | 220K filings with sections pre-parsed | Apache 2.0 |
|
||
| [Board Cybersecurity 23-Feature Analysis](https://www.board-cybersecurity.com/research/insights/) | Regex extraction of 23 governance features from 4,538 10-Ks | Research |
|
||
| [Gibson Dunn S&P 100 Survey](https://corpgov.law.harvard.edu/2025/01/09/cybersecurity-disclosure-overview-a-survey-of-form-10-k-cybersecurity-disclosures-by-sp-100-companies/) | Detailed disclosure feature analysis | Research |
|
||
| [Florackis et al. (2023)](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3725130) | Firm-level cyber risk measure from 10-K text | SSRN |
|
||
| [zeroshot/cybersecurity-corpus](https://huggingface.co/datasets/zeroshot/cybersecurity-corpus) | General cybersecurity text (useful for DAPT) | HuggingFace |
|
||
|
||
---
|
||
|
||
## 2. GenAI Labeling Pipeline
|
||
|
||
All LLM calls go through **OpenRouter** via `@openrouter/ai-sdk-provider` + Vercel AI SDK v6 `generateObject`. OpenRouter returns actual cost in `usage.cost` — no estimation needed.
|
||
|
||
### 2.1 Model Panel
|
||
|
||
**Stage 1 — Three Independent Annotators (all ~50K paragraphs):**
|
||
|
||
All three are reasoning models. Use low reasoning effort to get a cheap thinking pass without blowing up token costs.
|
||
|
||
| Model | OpenRouter ID | Role | Reasoning |
|
||
|-------|--------------|------|-----------|
|
||
| Gemini 3.1 Flash Lite | `google/gemini-3.1-flash-lite-preview` | Cheap + capable | Low effort |
|
||
| MiMo-V2-Flash | `xiaomi/mimo-v2-flash` | Xiaomi reasoning flash | Low effort |
|
||
| Grok 4.1 Fast | `x-ai/grok-4.1-fast` | xAI fast tier | Low effort |
|
||
|
||
Provider diversity: Google, Xiaomi, xAI — three different architectures, minimizes correlated errors.
|
||
|
||
**Stage 2 — Judge for Disagreements (~15-20% of paragraphs):**
|
||
|
||
| Model | OpenRouter ID | Role | Reasoning |
|
||
|-------|--------------|------|-----------|
|
||
| Claude Sonnet 4.6 | `anthropic/claude-sonnet-4.6` | Tiebreaker judge | Medium effort |
|
||
|
||
**Full Benchmarking Panel (run on 1,200 holdout alongside human labels):**
|
||
|
||
The Stage 1 models plus 6 SOTA frontier models — 9 total from 8+ providers.
|
||
|
||
| Model | OpenRouter ID | Provider | Reasoning |
|
||
|-------|--------------|----------|-----------|
|
||
| Gemini 3.1 Flash Lite | `google/gemini-3.1-flash-lite-preview` | Google | Low |
|
||
| MiMo-V2-Flash | `xiaomi/mimo-v2-flash` | Xiaomi | Low |
|
||
| Grok 4.1 Fast | `x-ai/grok-4.1-fast` | xAI | Low |
|
||
| GPT-5.4 | `openai/gpt-5.4` | OpenAI | Medium |
|
||
| Claude Sonnet 4.6 | `anthropic/claude-sonnet-4.6` | Anthropic | Medium |
|
||
| Gemini 3.1 Pro Preview | `google/gemini-3.1-pro-preview` | Google | Medium |
|
||
| GLM-5 | `zhipu/glm-5` | Zhipu AI | Medium |
|
||
| MiniMax-M2.7 | `minimax/minimax-m2.7` | MiniMax | Medium |
|
||
| MiMo-V2-Pro | `xiaomi/mimo-v2-pro` | Xiaomi | Medium |
|
||
|
||
That's **9 models from 8 providers**, far exceeding the 6-from-3 requirement. All support structured outputs on OpenRouter.
|
||
|
||
### 2.2 Consensus Algorithm
|
||
|
||
**Stage 1: 3-model majority vote.**
|
||
- Each of 3 models independently labels every paragraph via `generateObject` with the `LabelOutput` Zod schema (includes per-dimension confidence ratings).
|
||
- For each paragraph, compare the 3 labels on both dimensions (category + specificity).
|
||
- If 2/3 or 3/3 agree on BOTH dimensions → consensus reached.
|
||
- Expected agreement rate: ~80-85%.
|
||
- **Confidence-aware routing:** Even when models agree, if all 3 report "low" confidence on either dimension, route to Stage 2 judge anyway. These are hard cases that deserve a stronger model's opinion.
|
||
|
||
**Stage 2: Judge tiebreaker.**
|
||
- Claude Sonnet 4.6 (medium reasoning effort) receives the paragraph + all 3 Stage 1 labels (randomized order for anti-bias).
|
||
- Judge's label is treated as authoritative — if judge agrees with any Stage 1 model on both dimensions, that label wins. Otherwise judge's label is used directly.
|
||
- Remaining unresolved cases (~1-2%) flagged for human review.
|
||
|
||
**Stage 3: Active learning pass.**
|
||
- Cluster low-confidence cases by embedding similarity.
|
||
- Human-review ~2-5% of total to identify systematic rubric failures.
|
||
- Iterate rubric if patterns emerge, re-run affected subsets.
|
||
|
||
### 2.3 Reasoning Configuration
|
||
|
||
All Stage 1 and benchmark models are reasoning-capable. We use provider-appropriate "low" or "medium" effort settings to balance quality and cost.
|
||
|
||
**OpenRouter reasoning params** (passed via `providerOptions` or model-specific params):
|
||
- **Google Gemini**: `thinkingConfig: { thinkingBudget: 256 }` (low) / `1024` (medium)
|
||
- **Xiaomi MiMo**: Thinking is default-on; use `reasoning_effort: "low"` / `"medium"` if supported
|
||
- **xAI Grok**: `reasoning_effort: "low"` / `"medium"`
|
||
- **OpenAI GPT-5.4**: `reasoning: { effort: "low" }` / `"medium"`
|
||
- **Anthropic Claude**: `thinking: { budgetTokens: 512 }` (low) / `2048` (medium)
|
||
|
||
Exact param names may vary per model on OpenRouter — verify during pilot. The reasoning tokens are tracked separately in `usage.completion_tokens_details.reasoning_tokens`.
|
||
|
||
### 2.4 Cost Tracking
|
||
|
||
OpenRouter returns **actual cost** in `usage.cost` for every response. No estimation needed. Reasoning tokens are included in cost automatically.
|
||
|
||
### 2.5 Rate Limiting
|
||
|
||
OpenRouter uses **credit-based limiting** for paid accounts, not fixed RPM. Your key shows `requests: -1` (unlimited). There is no hard request-per-second cap — only Cloudflare DDoS protection if you dramatically exceed reasonable usage.
|
||
|
||
**Our approach:** Use `p-limit` concurrency control, starting at 10-15 concurrent requests. Ramp up if no 429s or latency degradation. Monitor account usage via `GET /api/v1/key`.
|
||
|
||
### 2.4 Technical Implementation
|
||
|
||
**Core pattern:** `generateObject` with Zod schema via OpenRouter.
|
||
|
||
```typescript
|
||
import { generateObject } from "ai";
|
||
import { createOpenRouter } from "@openrouter/ai-sdk-provider";
|
||
import { LabelOutput } from "../schemas/label";
|
||
|
||
const openrouter = createOpenRouter();
|
||
|
||
const result = await generateObject({
|
||
model: openrouter("google/gemini-3.1-flash-lite-preview"),
|
||
schema: LabelOutput,
|
||
system: SYSTEM_PROMPT,
|
||
prompt: buildUserPrompt(paragraph),
|
||
temperature: 0,
|
||
mode: "json",
|
||
// Reasoning effort — model-specific, set per provider
|
||
providerOptions: {
|
||
google: { thinkingConfig: { thinkingBudget: 256 } },
|
||
},
|
||
});
|
||
|
||
// result.object: { content_category, specificity_level, category_confidence, specificity_confidence, reasoning }
|
||
// result.usage: { promptTokens, completionTokens }
|
||
// OpenRouter response body also includes usage.cost (actual USD)
|
||
// and usage.completion_tokens_details.reasoning_tokens
|
||
```
|
||
|
||
**Generation ID tracking:** Every OpenRouter response includes an `id` field (the generation ID). We store this in every annotation record for audit trail and `GET /api/v1/generation?id={id}` lookup.
|
||
|
||
**Batch processing:** Concurrency-limited via `p-limit` (start at 10-15 concurrent). Each successful annotation is appended immediately to JSONL (crash-safe checkpoint). On resume, completed paragraph IDs are read from the output file and skipped. Graceful shutdown on SIGINT — wait for in-flight requests, write session summary.
|
||
|
||
**Structured output:** All panel models support `structured_outputs` on OpenRouter. Use `mode: "json"` in `generateObject`. Response Healing plugin (`plugins: [{ id: 'response-healing' }]`) available for edge cases.
|
||
|
||
**Live observability:** Every script that hits APIs renders a live dashboard to stderr (progress, ETA, session cost, latency percentiles, reasoning token usage). Session summaries append to `data/metadata/sessions.jsonl`.
|
||
|
||
**Prompt tuning before scale:** See LABELING-CODEBOOK.md for the 4-phase iterative prompt tuning protocol. Micro-pilot (30 paragraphs) → prompt revision → scale pilot (200 paragraphs) → green light. Do not fire the full 50K run until the scale pilot passes agreement targets.
|
||
|
||
---
|
||
|
||
## 3. Model Strategy
|
||
|
||
### 3.1 Primary: SEC-ModernBERT-large
|
||
|
||
**This model does not exist publicly. Building it is a core contribution.**
|
||
|
||
**Base model:** [`answerdotai/ModernBERT-large`](https://huggingface.co/answerdotai/ModernBERT-large)
|
||
- 395M parameters
|
||
- 8,192-token native context (vs. 512 for DeBERTa-v3-large)
|
||
- RoPE + alternating local/global attention + FlashAttention
|
||
- 2-4x faster than DeBERTa-v3-large
|
||
- Apache 2.0 license
|
||
- GLUE: 90.4
|
||
|
||
**Step 1 — Domain-Adaptive Pre-Training (DAPT):**
|
||
|
||
Continue MLM pre-training on SEC filing text to create "SEC-ModernBERT-large":
|
||
- **Training corpus:** ~450M tokens from our own 9,000 cached 10-K filings (FY2023-2024), full filing text (not just Item 1C). These are the same filings we extracted Item 1C from — we already have the raw HTML cached locally and the cleaning pipeline built (`stripHtml()` in `fast-reparse.ts`).
|
||
- **Why not PleIAs/SEC?** The PleIAs/SEC dataset (373K filings, ~18B tokens) goes back much further in time, but: (a) one pass would take weeks on a single 3090, (b) Item 1C didn't exist before FY2023 so pre-2023 filings lack the cybersecurity disclosure vocabulary that matters most for our task, (c) the SEC filing scaling laws paper (arXiv:2512.12384) shows the largest gains come in the first 200M tokens — our 450M from recent, relevant filings is already in the sweet spot.
|
||
- **Corpus preparation:** `bun run ts/scripts/dapt-corpus-prep.ts` runs `stripHtml()` (from `ts/src/extract/html-cleaner.ts`) on all cached filing HTML (full text, no section extraction) and outputs clean text as sharded JSONL to `data/dapt-corpus/`. Same HTML cleaning that handles XBRL tags, entity decoding, page artifacts, inline element boundaries — just without the Item 1C section boundary step.
|
||
- **MLM objective:** 30% masking rate (ModernBERT convention)
|
||
- **Learning rate:** ~5e-5 (search range: 1e-5 to 1e-4)
|
||
- **Hardware (RTX 3090):** bf16, gradient checkpointing, seq_len=1024-2048, batch_size=2-4 + gradient accumulation to effective batch 16-32
|
||
- **VRAM estimate:** ~12-15GB at batch=4, seq=2048 with gradient checkpointing — fits on 3090
|
||
- **Duration:** ~2-3 days on single 3090
|
||
- **Framework:** HuggingFace Trainer + `DataCollatorForLanguageModeling` (Python script, not notebook)
|
||
|
||
**Step 1b — Task-Adaptive Pre-Training (TAPT):**
|
||
|
||
After DAPT, continue MLM pre-training specifically on the task distribution to create "SEC-cyBERT-large":
|
||
- **Training corpus:** 72K Item 1C paragraphs from `paragraphs-clean.jsonl` (~5-10M tokens). Already clean text — no additional processing needed.
|
||
- **Why TAPT on top of DAPT?** Gururangan et al. (2020) "Don't Stop Pretraining" showed that TAPT gives consistent gains on top of DAPT, especially when the task distribution differs from the broader domain. Item 1C is a very specific subset of 10-K filings — it has its own vocabulary (CISO, NIST CSF, tabletop exercises, materiality assessments), structure (governance → management → process → strategy pattern), and boilerplate patterns that differ substantially from the rest of a 10-K. TAPT teaches the model this specific distribution.
|
||
- **Duration:** ~2-3 hours on a 3090 (tiny corpus vs DAPT)
|
||
- **Same hyperparameters as DAPT** (learning rate, masking rate, etc.)
|
||
|
||
**Evidence DAPT + TAPT works:**
|
||
- Gururangan et al. (2020): DAPT + TAPT together outperforms either alone across all tested domains. TAPT alone gives 1-3 F1 points; combined with DAPT, gains are additive.
|
||
- Clinical ModernBERT, BioClinical ModernBERT: successful continued MLM on medical text
|
||
- Patent domain ModernBERT (arXiv:2509.14926): +0.9 to +2.8 F1 from continued pre-training on 31.6B tokens
|
||
- SEC filing scaling laws (arXiv:2512.12384): consistent improvement, largest gains in first 200M tokens
|
||
|
||
**Step 2 — Classification Fine-Tuning:**
|
||
|
||
Fine-tune SEC-ModernBERT-large on the labeled paragraphs:
|
||
- **Architecture:** Shared encoder backbone → dropout → two linear classification heads
|
||
- `category_head`: 7-class softmax (content category)
|
||
- `specificity_head`: 4-class softmax (specificity level)
|
||
- **Loss:** `α × CE(category) + (1-α) × CE(specificity) + β × SCL`
|
||
- `α` (category_weight): default 0.5, searchable
|
||
- `β` (scl_weight): default 0, searchable (ablation)
|
||
- **Sequence length:** 2048 tokens
|
||
- **VRAM:** ~11-13GB at batch=8, seq=2048 in bf16 — comfortable on 3090
|
||
- **bf16=True** in HuggingFace Trainer (3090 Ampere supports natively)
|
||
- **Framework:** Custom `MultiHeadClassifier` model + HuggingFace Trainer subclass
|
||
|
||
### 3.2 Dark Horse: NeoBERT
|
||
|
||
[`chandar-lab/NeoBERT`](https://huggingface.co/chandar-lab/NeoBERT)
|
||
- 250M parameters (100M fewer than ModernBERT-large)
|
||
- 4,096-token context
|
||
- SwiGLU, RoPE, Pre-RMSNorm, FlashAttention
|
||
- GLUE: 89.0 | MTEB: 51.3 (best in class — ModernBERT is 46.9)
|
||
- MIT license
|
||
- Requires `trust_remote_code=True`
|
||
|
||
Same DAPT + fine-tuning pipeline, even less VRAM. Interesting efficiency vs. quality tradeoff.
|
||
|
||
### 3.3 Baseline: DeBERTa-v3-large
|
||
|
||
[`microsoft/deberta-v3-large`](https://huggingface.co/microsoft/deberta-v3-large)
|
||
- ~435M total parameters
|
||
- 512-token context (can push to ~1024)
|
||
- GLUE: 91.4 (highest among encoders)
|
||
- MIT license
|
||
- **Weakness:** no long context, fails at retrieval
|
||
|
||
Include as baseline to show improvement from (a) long context and (b) DAPT.
|
||
|
||
### 3.4 Decoder Experiment: Qwen3.5 via Unsloth
|
||
|
||
Experimental comparison of encoder vs. decoder approach:
|
||
- **Model:** Qwen3.5-1.5B or Qwen3.5-7B (smallest viable decoder)
|
||
- **Framework:** Unsloth (2x faster than Axolotl, 80% less VRAM, optimized for Qwen)
|
||
- **Method:** QLoRA fine-tuning — train the model to output the same JSON schema as the GenAI labelers
|
||
- **Purpose:** "Additional baseline" for A-grade requirement + demonstrates encoder advantage for classification
|
||
|
||
### 3.5 Domain-Specific Baselines (for comparison)
|
||
|
||
All BERT-base (110M params, 512 context) — architecturally outdated:
|
||
|
||
| Model | HuggingFace ID | Domain |
|
||
|-------|---------------|--------|
|
||
| SEC-BERT | `nlpaueb/sec-bert-base` | 260K 10-K filings |
|
||
| FinBERT | `ProsusAI/finbert` | Financial sentiment |
|
||
| SecureBERT | arXiv:2204.02685 | Cybersecurity text |
|
||
|
||
### 3.6 Ablation Design
|
||
|
||
| # | Experiment | Model | Context | DAPT | TAPT | SCL | Purpose |
|
||
|---|-----------|-------|---------|------|------|-----|---------|
|
||
| 1 | Baseline | DeBERTa-v3-large | 512 | No | No | No | Standard approach per syllabus |
|
||
| 2 | + Long context | ModernBERT-large | 2048 | No | No | No | Context window benefit |
|
||
| 3 | + Domain adapt | SEC-ModernBERT-large | 2048 | Yes | No | No | DAPT benefit |
|
||
| 3b | + Task adapt | SEC-cyBERT-large | 2048 | Yes | Yes | No | TAPT benefit (Gururangan et al.) |
|
||
| 4 | + Contrastive | SEC-cyBERT-large | 2048 | Yes | Yes | Yes | SCL benefit |
|
||
| 5 | Efficiency | NeoBERT (+ DAPT + TAPT) | 2048 | Yes | Yes | Yes | 40% fewer params |
|
||
| 6 | Decoder | Qwen3.5 LoRA | 2048 | No | No | No | Encoder vs decoder |
|
||
| 7 | **Ensemble** | SEC-cyBERT + DeBERTa | mixed | mixed | mixed | — | Maximum performance |
|
||
|
||
### 3.7 Hyperparameter Search (Autoresearch Pattern)
|
||
|
||
Inspired by Karpathy's [autoresearch](https://github.com/karpathy/autoresearch): an agent autonomously iterates on training configs using a `program.md` directive.
|
||
|
||
**How it works:**
|
||
1. Agent reads `program.md` which defines: fixed time budget (30 min), evaluation metric (`val_macro_f1`), what can be modified (YAML config values), what cannot (data splits, eval script, seed)
|
||
2. Agent modifies one hyperparameter in the YAML config
|
||
3. Agent runs training for 30 minutes
|
||
4. Agent evaluates on validation set
|
||
5. If `val_macro_f1` improved by ≥ 0.002 → keep checkpoint, else discard
|
||
6. Agent logs result to `results/experiments.tsv` and repeats
|
||
|
||
**Search spaces:**
|
||
|
||
DAPT:
|
||
- learning_rate: [1e-5, 2e-5, 5e-5, 1e-4]
|
||
- mlm_probability: [0.15, 0.20, 0.30]
|
||
- max_seq_length: [1024, 2048]
|
||
- effective batch size: [8, 16, 32]
|
||
|
||
Encoder fine-tuning:
|
||
- learning_rate: [1e-5, 2e-5, 3e-5, 5e-5]
|
||
- category_weight: [0.3, 0.4, 0.5, 0.6, 0.7]
|
||
- label_smoothing: [0, 0.05, 0.1]
|
||
- scl_weight: [0, 0.1, 0.2, 0.5]
|
||
- dropout: [0.05, 0.1, 0.2]
|
||
- pool_strategy: ["cls", "mean"]
|
||
- max_seq_length: [512, 1024, 2048]
|
||
|
||
Decoder (Unsloth LoRA):
|
||
- lora_r: [8, 16, 32, 64]
|
||
- lora_alpha: [16, 32, 64]
|
||
- learning_rate: [1e-4, 2e-4, 5e-4]
|
||
|
||
### 3.8 Python Training Package (`python/`)
|
||
|
||
Structured Python package for all training stages. All commands run from `python/` via `uv run main.py <command>`.
|
||
|
||
**Package layout:**
|
||
|
||
```
|
||
python/
|
||
├── main.py # CLI: uv run main.py {dapt,finetune,eval,decoder}
|
||
├── pyproject.toml # Dependencies (torch, transformers, datasets, accelerate)
|
||
├── configs/
|
||
│ ├── dapt/
|
||
│ │ ├── modernbert.yaml # ModernBERT DAPT hyperparams
|
||
│ │ └── neobert.yaml # NeoBERT DAPT hyperparams
|
||
│ ├── finetune/ # Fine-tuning configs (per ablation)
|
||
│ └── decoder/ # Qwen LoRA config
|
||
└── src/
|
||
├── common/
|
||
│ └── config.py # Typed dataclass configs, YAML loading
|
||
├── data/
|
||
│ └── corpus.py # DAPT corpus loading, tokenization, chunking
|
||
├── dapt/
|
||
│ └── train.py # DAPT + TAPT (same MLM objective, different data)
|
||
├── finetune/
|
||
│ ├── model.py # Dual-head classifier architecture
|
||
│ └── train.py # Classification fine-tuning
|
||
├── eval/
|
||
│ └── metrics.py # Macro-F1, MCC, Krippendorff's Alpha
|
||
└── decoder/
|
||
└── train.py # Qwen LoRA experiment (Unsloth)
|
||
```
|
||
|
||
**DAPT/TAPT usage:**
|
||
|
||
```bash
|
||
# DAPT: full 10-K filings → SEC-ModernBERT-large
|
||
uv run main.py dapt --config configs/dapt/modernbert.yaml
|
||
|
||
# TAPT: continue from DAPT checkpoint on Item 1C paragraphs → SEC-cyBERT-large
|
||
uv run main.py dapt --config configs/dapt/modernbert.yaml \
|
||
--model-path ../checkpoints/dapt/modernbert-large/final \
|
||
--data-path ../data/paragraphs/paragraphs-clean.jsonl \
|
||
--output-dir ../checkpoints/tapt/modernbert-large \
|
||
--stage tapt
|
||
```
|
||
|
||
**Config design:** YAML files define all hyperparameters (reproducible, diffable). CLI flags override key fields (`--model-path`, `--data-path`, `--output-dir`, `--stage`) for TAPT or experimentation without duplicating config files.
|
||
|
||
**Corpus preparation (prerequisite):** Run `bun run ts/scripts/dapt-corpus-prep.ts` from repo root to generate `data/dapt-corpus/` shards from cached HTML. This reuses the same `stripHtml()` from `ts/src/extract/html-cleaner.ts` that powers paragraph extraction.
|
||
|
||
**Checkpoints:** Saved to `checkpoints/` (gitignored). Each stage writes to `{output_dir}/final/` with the model and tokenizer.
|
||
|
||
---
|
||
|
||
## 4. Evaluation & Validation
|
||
|
||
### 4.1 Required Metrics
|
||
|
||
| Metric | Target | Notes |
|
||
|--------|--------|-------|
|
||
| Macro-F1 on holdout | > 0.80 for C, higher for A | Per-class and overall |
|
||
| Per-class F1 | Identify weak categories | Expect "None/Other" noisiest |
|
||
| Krippendorff's Alpha | > 0.67 adequate, > 0.75 good | GenAI vs human gold set |
|
||
| MCC | Report alongside F1 | More robust for imbalanced classes |
|
||
| Specificity MAE | Report for ordinal dimension | Mean absolute error: |pred - true| |
|
||
| Calibration plots | Reliability diagrams | For softmax outputs |
|
||
| Robustness splits | By time, industry, filing size | FY2023 vs FY2024; GICS sector; word count quartiles |
|
||
|
||
### 4.2 Downstream Validity Tests
|
||
|
||
**Test 1 — Breach Prediction (strongest):**
|
||
Do firms with lower specificity scores subsequently appear in breach databases?
|
||
- [Privacy Rights Clearinghouse](http://dx.doi.org/10.17632/w33nhh3282.1) — 80K+ breaches, ticker/CIK matching
|
||
- [VCDB](https://github.com/vz-risk/VCDB) — 8K+ incidents, VERIS schema
|
||
- [Board Cybersecurity Incident Tracker](https://www.board-cybersecurity.com/incidents/tracker) — direct SEC filing links
|
||
- [CISA KEV Catalog](https://www.cisa.gov/known-exploited-vulnerabilities-catalog) — known exploited vulnerabilities
|
||
|
||
**Test 2 — Market Reaction (optional):**
|
||
Event study: abnormal returns around 8-K Item 1.05 filing. Does prior Item 1C quality predict reaction magnitude? Small sample (~55 incidents) but high signal.
|
||
|
||
**Test 3 — Known-Groups Validity (easy, always include):**
|
||
Do regulated industries (NYDFS, HIPAA) produce higher-specificity disclosures? Do larger firms have more specific disclosures? Expected results that validate the measure.
|
||
|
||
**Test 4 — Boilerplate Index (easy, always include):**
|
||
Cosine similarity of each company's Item 1C to industry-median disclosure. Specificity score should inversely correlate — independent, construct-free validation.
|
||
|
||
### 4.3 External Benchmark
|
||
|
||
Per syllabus requirement:
|
||
- **Board Cybersecurity's 23-feature regex extraction** — natural benchmark. Their binary feature coding is prior best practice. Our classifier captures everything their regex does plus quality/specificity.
|
||
- **Florackis et al. (2023) cyber risk measure** — different section (1A vs 1C), different methodology, different era.
|
||
|
||
---
|
||
|
||
## 5. SEC Regulatory Context
|
||
|
||
### The Rule: SEC Release 33-11216 (July 2023)
|
||
|
||
**Item 1C (10-K Annual Disclosure) — Regulation S-K Item 106:**
|
||
|
||
*Item 106(b) — Risk Management and Strategy:*
|
||
1. Processes for assessing, identifying, and managing material cybersecurity risks
|
||
2. Whether cybersecurity processes integrate into overall ERM
|
||
3. Whether the company engages external assessors, consultants, or auditors
|
||
4. Processes to oversee risks from third-party service providers
|
||
5. Whether cybersecurity risks have materially affected business strategy, results, or financial condition
|
||
|
||
*Item 106(c) — Governance:*
|
||
- Board oversight (106(c)(1)): oversight description, responsible committee, information processes
|
||
- Management's role (106(c)(2)): responsible positions, expertise, monitoring processes, board reporting frequency
|
||
|
||
**Item 1.05 (8-K Incident Disclosure):**
|
||
- Required within 4 business days of materiality determination
|
||
- Material aspects of nature, scope, timing + material impact
|
||
- No technical details that would impede response/remediation
|
||
- AG can delay up to 120 days for national security
|
||
|
||
**Key design note:** The SEC uses "describe" — non-exclusive suggestions create natural variation in specificity and content. This is what makes the construct classifiable.
|
||
|
||
### Compliance Timeline
|
||
|
||
| Date | Milestone |
|
||
|------|-----------|
|
||
| Jul 26, 2023 | Rule adopted |
|
||
| Dec 15, 2023 | Item 1C required in 10-Ks |
|
||
| Dec 18, 2023 | Item 1.05 required in 8-Ks |
|
||
| Jun 15, 2024 | Item 1.05 required for smaller reporting companies |
|
||
| Dec 15, 2024 | iXBRL tagging of Item 106 (CYD taxonomy) required |
|
||
|
||
### iXBRL CYD Taxonomy
|
||
|
||
Published Sep 16, 2024. Starting Dec 15, 2024, Item 1C tagged in Inline XBRL with `cyd` prefix.
|
||
- Schema: `http://xbrl.sec.gov/cyd/2024`
|
||
- [Taxonomy guide (PDF)](https://xbrl.sec.gov/cyd/2024/cyd-taxonomy-guide-2024-09-16.pdf)
|
||
|
||
---
|
||
|
||
## 6. References
|
||
|
||
### SEC Rule & Guidance
|
||
- [SEC Final Rule 33-11216 (PDF)](https://www.sec.gov/files/rules/final/2023/33-11216.pdf)
|
||
- [SEC Fact Sheet](https://www.sec.gov/files/33-11216-fact-sheet.pdf)
|
||
- [SEC Small Business Compliance Guide](https://www.sec.gov/resources-small-businesses/small-business-compliance-guides/cybersecurity-risk-management-strategy-governance-incident-disclosure)
|
||
- [CYD iXBRL Taxonomy Guide (PDF)](https://xbrl.sec.gov/cyd/2024/cyd-taxonomy-guide-2024-09-16.pdf)
|
||
|
||
### Law Firm Surveys & Analysis
|
||
- [Gibson Dunn S&P 100 Survey](https://corpgov.law.harvard.edu/2025/01/09/cybersecurity-disclosure-overview-a-survey-of-form-10-k-cybersecurity-disclosures-by-sp-100-companies/)
|
||
- [PwC First Wave of 10-K Cyber Disclosures](https://www.pwc.com/us/en/services/consulting/cybersecurity-risk-regulatory/sec-final-cybersecurity-disclosure-rules/sec-10-k-cyber-disclosures.html)
|
||
- [Debevoise 8-K Tracker](https://www.debevoisedatablog.com/2024/03/06/cybersecurity-form-8-k-tracker/)
|
||
- [Greenberg Traurig 2025 Trends](https://www.gtlaw.com/en/insights/2025/2/sec-cybersecurity-disclosure-trends-2025-update-on-corporate-reporting-practices)
|
||
- [Known Trends: First Year of 8-K Filings](https://www.knowntrends.com/2025/02/snapshot-the-first-year-of-cybersecurity-incident-filings-on-form-8-k-since-adoption-of-new-rules/)
|
||
- [NYU: Lessons Learned from 8-K Reporting](https://wp.nyu.edu/compliance_enforcement/2025/03/25/lessons-learned-one-year-of-form-8-k-material-cybersecurity-incident-reporting/)
|
||
|
||
### Data Extraction Tools
|
||
- [edgar-crawler](https://github.com/lefterisloukas/edgar-crawler)
|
||
- [edgartools](https://github.com/dgunning/edgartools)
|
||
- [sec-edgar-downloader](https://pypi.org/project/sec-edgar-downloader/)
|
||
- [sec-8k-item105](https://github.com/JMousqueton/sec-8k-item105)
|
||
- [SECurityTr8Ker](https://github.com/pancak3lullz/SECurityTr8Ker)
|
||
- [SEC EDGAR APIs](https://www.sec.gov/search-filings/edgar-application-programming-interfaces)
|
||
- [SEC EDGAR Full-Text Search](https://efts.sec.gov/LATEST/search-index)
|
||
|
||
### Datasets
|
||
- [PleIAs/SEC — 373K 10-K texts (CC0)](https://huggingface.co/datasets/PleIAs/SEC)
|
||
- [EDGAR-CORPUS — 220K filings, sections parsed (Apache 2.0)](https://huggingface.co/datasets/eloukas/edgar-corpus)
|
||
- [Board Cybersecurity 23-Feature Analysis](https://www.board-cybersecurity.com/research/insights/risk-frameworks-security-standards-in-10k-item-1c-cybersecurity-disclosures-through-2024-06-30/)
|
||
- [Board Cybersecurity Incident Tracker](https://www.board-cybersecurity.com/incidents/tracker)
|
||
- [PRC Mendeley Breach Dataset](http://dx.doi.org/10.17632/w33nhh3282.1)
|
||
- [VCDB](https://github.com/vz-risk/VCDB)
|
||
- [CISA KEV Catalog](https://www.cisa.gov/known-exploited-vulnerabilities-catalog)
|
||
- [zeroshot/cybersecurity-corpus](https://huggingface.co/datasets/zeroshot/cybersecurity-corpus)
|
||
|
||
### Models
|
||
- [ModernBERT-large (Apache 2.0)](https://huggingface.co/answerdotai/ModernBERT-large)
|
||
- [ModernBERT-base (Apache 2.0)](https://huggingface.co/answerdotai/ModernBERT-base)
|
||
- [NeoBERT (MIT)](https://huggingface.co/chandar-lab/NeoBERT)
|
||
- [DeBERTa-v3-large (MIT)](https://huggingface.co/microsoft/deberta-v3-large)
|
||
- [SEC-BERT](https://huggingface.co/nlpaueb/sec-bert-base)
|
||
- [FinBERT](https://huggingface.co/ProsusAI/finbert)
|
||
- [EvasionBench Eva-4B-V2](https://huggingface.co/FutureMa/Eva-4B-V2)
|
||
|
||
### Key Papers
|
||
- Ringel (2023), "Creating Synthetic Experts with Generative AI" — [SSRN:4542949](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4542949)
|
||
- Ludwig et al. (2026), "Extracting Consumer Insight from Text" — [arXiv:2602.15312](https://arxiv.org/abs/2602.15312)
|
||
- Ma et al. (2026), "EvasionBench" — [arXiv:2601.09142](https://arxiv.org/abs/2601.09142)
|
||
- Florackis et al. (2023), "Cybersecurity Risk" — [SSRN:3725130](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3725130)
|
||
- Gururangan et al. (2020), "Don't Stop Pretraining" — [arXiv:2004.10964](https://arxiv.org/abs/2004.10964)
|
||
- ModernBERT — [arXiv:2412.13663](https://arxiv.org/abs/2412.13663)
|
||
- NeoBERT — [arXiv:2502.19587](https://arxiv.org/abs/2502.19587)
|
||
- ModernBERT vs DeBERTa-v3 — [arXiv:2504.08716](https://arxiv.org/abs/2504.08716)
|
||
- Patent domain ModernBERT DAPT — [arXiv:2509.14926](https://arxiv.org/abs/2509.14926)
|
||
- SEC filing scaling laws — [arXiv:2512.12384](https://arxiv.org/abs/2512.12384)
|
||
- Gunel et al. (2020), Supervised Contrastive Learning — [OpenReview](https://openreview.net/forum?id=cu7IUiOhujH)
|
||
- Phil Schmid, "Fine-tune ModernBERT" — [philschmid.de](https://www.philschmid.de/fine-tune-modern-bert-in-2025)
|
||
- Berkman et al. (2018), Cybersecurity disclosure quality scoring
|
||
- SecureBERT — [arXiv:2204.02685](https://arxiv.org/abs/2204.02685)
|
||
- Gilardi et al. (2023), "ChatGPT Outperforms Crowd-Workers" — [arXiv:2303.15056](https://arxiv.org/abs/2303.15056)
|
||
- Kiefer et al. (2025), ESG-Activities benchmark — [arXiv:2502.21112](https://arxiv.org/abs/2502.21112)
|
||
|
||
### Methodological Resources
|
||
- [Ringel 2026 Capstone Pipeline Example (ipynb)](http://ringel.ai/UNC/2026/helpers/Ringel_2026_VerticalAI_Capstone_Pipeline_Example.ipynb)
|
||
- [Ringel 2026 Capstone Pipeline Example (zip)](http://ringel.ai/UNC/2026/helpers/Ringel_2026_VerticalAI_Capstone_Pipeline_Example.zip)
|
||
- [Class 21 Exemplary Presentation (PDF)](http://www.ringel.ai/UNC/2026/BUSI488/Class21/Ringel_488-2026_Class21.pdf)
|
||
- [Karpathy autoresearch](https://github.com/karpathy/autoresearch) — autonomous HP search pattern
|