26 KiB
Technical Guide — SEC Cybersecurity Disclosure Classifier
Everything needed to build the pipeline: data acquisition, GenAI labeling, model training, evaluation, and references.
Stack: TypeScript (bun) for data/labeling/eval, Python (uv) for training. Vercel AI SDK v6 + OpenRouter for all LLM calls. HuggingFace Trainer for encoder training, Unsloth for decoder experiment.
1. Data Acquisition
1.1 Extracting 10-K Item 1C
Pipeline:
EDGAR API → download 10-K HTML → extract Item 1C → paragraph segmentation → JSONL
Tools:
| Tool | Purpose | Install | Notes |
|---|---|---|---|
sec-edgar-downloader |
Bulk download 10-K filings by CIK | uv add sec-edgar-downloader |
Pure downloader, no parsing |
edgar-crawler |
Extract specific item sections to JSON | git clone github.com/lefterisloukas/edgar-crawler |
Configure ['1C'] in items list |
edgartools |
Interactive exploration, XBRL parsing | uv add edgartools |
tenk['Item 1C'] accessor; great for prototyping |
EDGAR API requirements:
- Rate limit: 10 requests/second
- Required: Custom
User-Agentheader with name and email (e.g.,"sec-cyBERT team@email.com") - SEC blocks requests without proper User-Agent (returns 403)
For iXBRL-tagged filings (2025+): Use edgartools XBRL parser to extract CYD taxonomy elements directly. The cyd prefix tags give pre-structured data aligned with regulatory categories.
Fallback corpus: PleIAs/SEC on HuggingFace (373K 10-K full texts, CC0 license) — sections NOT pre-parsed; must extract Item 1C yourself.
1.2 Extracting 8-K Incident Disclosures
| Tool | Purpose |
|---|---|
sec-8k-item105 |
Extract Item 1.05 from 8-Ks, iXBRL + HTML fallback — github.com/JMousqueton/sec-8k-item105 |
SECurityTr8Ker |
Monitor SEC RSS for new cyber 8-Ks — github.com/pancak3lullz/SECurityTr8Ker |
| Debevoise 8-K Tracker | Curated list with filing links — debevoisedatablog.com |
| Board Cybersecurity Tracker | Links filings to MITRE ATT&CK — board-cybersecurity.com/incidents/tracker |
Critical: Must capture Item 1.05 AND Items 8.01/7.01 (post-May 2024 shift where companies moved non-material disclosures away from 1.05).
1.3 Paragraph Segmentation
Once Item 1C text is extracted:
- Split on double newlines or
<p>tags (depending on extraction format) - Minimum paragraph length: 20 words (filter out headers, whitespace)
- Maximum paragraph length: 500 words (split longer blocks at sentence boundaries)
- Preserve metadata: company name, CIK, ticker, filing date, fiscal year
Expected yield: ~5-8 paragraphs per Item 1C × ~9,000 filings = ~50,000-70,000 paragraphs
1.4 Pre-Existing Datasets
| Resource | What It Is | License |
|---|---|---|
| PleIAs/SEC | 373K full 10-K texts | CC0 |
| EDGAR-CORPUS | 220K filings with sections pre-parsed | Apache 2.0 |
| Board Cybersecurity 23-Feature Analysis | Regex extraction of 23 governance features from 4,538 10-Ks | Research |
| Gibson Dunn S&P 100 Survey | Detailed disclosure feature analysis | Research |
| Florackis et al. (2023) | Firm-level cyber risk measure from 10-K text | SSRN |
| zeroshot/cybersecurity-corpus | General cybersecurity text (useful for DAPT) | HuggingFace |
2. GenAI Labeling Pipeline
All LLM calls go through OpenRouter via @openrouter/ai-sdk-provider + Vercel AI SDK v6 generateObject. OpenRouter returns actual cost in usage.cost — no estimation needed.
2.1 Model Panel
Stage 1 — Three Independent Annotators (all ~50K paragraphs):
All three are reasoning models. Use low reasoning effort to get a cheap thinking pass without blowing up token costs.
| Model | OpenRouter ID | Role | Reasoning |
|---|---|---|---|
| Gemini 3.1 Flash Lite | google/gemini-3.1-flash-lite-preview |
Cheap + capable | Low effort |
| MiMo-V2-Flash | xiaomi/mimo-v2-flash |
Xiaomi reasoning flash | Low effort |
| Grok 4.1 Fast | x-ai/grok-4.1-fast |
xAI fast tier | Low effort |
Provider diversity: Google, Xiaomi, xAI — three different architectures, minimizes correlated errors.
Stage 2 — Judge for Disagreements (~15-20% of paragraphs):
| Model | OpenRouter ID | Role | Reasoning |
|---|---|---|---|
| Claude Sonnet 4.6 | anthropic/claude-sonnet-4.6 |
Tiebreaker judge | Medium effort |
Full Benchmarking Panel (run on 1,200 holdout alongside human labels):
The Stage 1 models plus 6 SOTA frontier models — 9 total from 8+ providers.
| Model | OpenRouter ID | Provider | Reasoning |
|---|---|---|---|
| Gemini 3.1 Flash Lite | google/gemini-3.1-flash-lite-preview |
Low | |
| MiMo-V2-Flash | xiaomi/mimo-v2-flash |
Xiaomi | Low |
| Grok 4.1 Fast | x-ai/grok-4.1-fast |
xAI | Low |
| GPT-5.4 | openai/gpt-5.4 |
OpenAI | Medium |
| Claude Sonnet 4.6 | anthropic/claude-sonnet-4.6 |
Anthropic | Medium |
| Gemini 3.1 Pro Preview | google/gemini-3.1-pro-preview |
Medium | |
| GLM-5 | zhipu/glm-5 |
Zhipu AI | Medium |
| MiniMax-M2.7 | minimax/minimax-m2.7 |
MiniMax | Medium |
| MiMo-V2-Pro | xiaomi/mimo-v2-pro |
Xiaomi | Medium |
That's 9 models from 8 providers, far exceeding the 6-from-3 requirement. All support structured outputs on OpenRouter.
2.2 Consensus Algorithm
Stage 1: 3-model majority vote.
- Each of 3 models independently labels every paragraph via
generateObjectwith theLabelOutputZod schema (includes per-dimension confidence ratings). - For each paragraph, compare the 3 labels on both dimensions (category + specificity).
- If 2/3 or 3/3 agree on BOTH dimensions → consensus reached.
- Expected agreement rate: ~80-85%.
- Confidence-aware routing: Even when models agree, if all 3 report "low" confidence on either dimension, route to Stage 2 judge anyway. These are hard cases that deserve a stronger model's opinion.
Stage 2: Judge tiebreaker.
- Claude Sonnet 4.6 (medium reasoning effort) receives the paragraph + all 3 Stage 1 labels (randomized order for anti-bias).
- Judge's label is treated as authoritative — if judge agrees with any Stage 1 model on both dimensions, that label wins. Otherwise judge's label is used directly.
- Remaining unresolved cases (~1-2%) flagged for human review.
Stage 3: Active learning pass.
- Cluster low-confidence cases by embedding similarity.
- Human-review ~2-5% of total to identify systematic rubric failures.
- Iterate rubric if patterns emerge, re-run affected subsets.
2.3 Reasoning Configuration
All Stage 1 and benchmark models are reasoning-capable. We use provider-appropriate "low" or "medium" effort settings to balance quality and cost.
OpenRouter reasoning params (passed via providerOptions or model-specific params):
- Google Gemini:
thinkingConfig: { thinkingBudget: 256 }(low) /1024(medium) - Xiaomi MiMo: Thinking is default-on; use
reasoning_effort: "low"/"medium"if supported - xAI Grok:
reasoning_effort: "low"/"medium" - OpenAI GPT-5.4:
reasoning: { effort: "low" }/"medium" - Anthropic Claude:
thinking: { budgetTokens: 512 }(low) /2048(medium)
Exact param names may vary per model on OpenRouter — verify during pilot. The reasoning tokens are tracked separately in usage.completion_tokens_details.reasoning_tokens.
2.4 Cost Tracking
OpenRouter returns actual cost in usage.cost for every response. No estimation needed. Reasoning tokens are included in cost automatically.
2.5 Rate Limiting
OpenRouter uses credit-based limiting for paid accounts, not fixed RPM. Your key shows requests: -1 (unlimited). There is no hard request-per-second cap — only Cloudflare DDoS protection if you dramatically exceed reasonable usage.
Our approach: Use p-limit concurrency control, starting at 10-15 concurrent requests. Ramp up if no 429s or latency degradation. Monitor account usage via GET /api/v1/key.
2.4 Technical Implementation
Core pattern: generateObject with Zod schema via OpenRouter.
import { generateObject } from "ai";
import { createOpenRouter } from "@openrouter/ai-sdk-provider";
import { LabelOutput } from "../schemas/label";
const openrouter = createOpenRouter();
const result = await generateObject({
model: openrouter("google/gemini-3.1-flash-lite-preview"),
schema: LabelOutput,
system: SYSTEM_PROMPT,
prompt: buildUserPrompt(paragraph),
temperature: 0,
mode: "json",
// Reasoning effort — model-specific, set per provider
providerOptions: {
google: { thinkingConfig: { thinkingBudget: 256 } },
},
});
// result.object: { content_category, specificity_level, category_confidence, specificity_confidence, reasoning }
// result.usage: { promptTokens, completionTokens }
// OpenRouter response body also includes usage.cost (actual USD)
// and usage.completion_tokens_details.reasoning_tokens
Generation ID tracking: Every OpenRouter response includes an id field (the generation ID). We store this in every annotation record for audit trail and GET /api/v1/generation?id={id} lookup.
Batch processing: Concurrency-limited via p-limit (start at 10-15 concurrent). Each successful annotation is appended immediately to JSONL (crash-safe checkpoint). On resume, completed paragraph IDs are read from the output file and skipped. Graceful shutdown on SIGINT — wait for in-flight requests, write session summary.
Structured output: All panel models support structured_outputs on OpenRouter. Use mode: "json" in generateObject. Response Healing plugin (plugins: [{ id: 'response-healing' }]) available for edge cases.
Live observability: Every script that hits APIs renders a live dashboard to stderr (progress, ETA, session cost, latency percentiles, reasoning token usage). Session summaries append to data/metadata/sessions.jsonl.
Prompt tuning before scale: See LABELING-CODEBOOK.md for the 4-phase iterative prompt tuning protocol. Micro-pilot (30 paragraphs) → prompt revision → scale pilot (200 paragraphs) → green light. Do not fire the full 50K run until the scale pilot passes agreement targets.
3. Model Strategy
3.1 Primary: SEC-ModernBERT-large
This model does not exist publicly. Building it is a core contribution.
Base model: answerdotai/ModernBERT-large
- 395M parameters
- 8,192-token native context (vs. 512 for DeBERTa-v3-large)
- RoPE + alternating local/global attention + FlashAttention
- 2-4x faster than DeBERTa-v3-large
- Apache 2.0 license
- GLUE: 90.4
Step 1 — Domain-Adaptive Pre-Training (DAPT):
Continue MLM pre-training on SEC filing text to create "SEC-ModernBERT-large":
- Training corpus: 200-500M tokens from PleIAs/SEC or own EDGAR download. Include 10-Ks, 10-Qs, 8-Ks, proxy statements.
- MLM objective: 30% masking rate (ModernBERT convention)
- Learning rate: ~5e-5 (search range: 1e-5 to 1e-4)
- Hardware (RTX 3090): bf16, gradient checkpointing, seq_len=1024-2048, batch_size=2-4 + gradient accumulation to effective batch 16-32
- VRAM estimate: ~12-15GB at batch=4, seq=2048 with gradient checkpointing — fits on 3090
- Duration: ~2-3 days on single 3090
- Framework: HuggingFace Trainer +
DataCollatorForLanguageModeling(Python script, not notebook)
Evidence DAPT works:
- Gururangan et al. (2020): consistent improvements across all tested domains
- Clinical ModernBERT, BioClinical ModernBERT: successful continued MLM on medical text
- Patent domain ModernBERT (arXiv:2509.14926): +0.9 to +2.8 F1 from continued pre-training on 31.6B tokens
- SEC filing scaling laws (arXiv:2512.12384): consistent improvement, largest gains in first 200M tokens
Step 2 — Classification Fine-Tuning:
Fine-tune SEC-ModernBERT-large on the labeled paragraphs:
- Architecture: Shared encoder backbone → dropout → two linear classification heads
category_head: 7-class softmax (content category)specificity_head: 4-class softmax (specificity level)
- Loss:
α × CE(category) + (1-α) × CE(specificity) + β × SCLα(category_weight): default 0.5, searchableβ(scl_weight): default 0, searchable (ablation)
- Sequence length: 2048 tokens
- VRAM: ~11-13GB at batch=8, seq=2048 in bf16 — comfortable on 3090
- bf16=True in HuggingFace Trainer (3090 Ampere supports natively)
- Framework: Custom
MultiHeadClassifiermodel + HuggingFace Trainer subclass
3.2 Dark Horse: NeoBERT
- 250M parameters (100M fewer than ModernBERT-large)
- 4,096-token context
- SwiGLU, RoPE, Pre-RMSNorm, FlashAttention
- GLUE: 89.0 | MTEB: 51.3 (best in class — ModernBERT is 46.9)
- MIT license
- Requires
trust_remote_code=True
Same DAPT + fine-tuning pipeline, even less VRAM. Interesting efficiency vs. quality tradeoff.
3.3 Baseline: DeBERTa-v3-large
- ~435M total parameters
- 512-token context (can push to ~1024)
- GLUE: 91.4 (highest among encoders)
- MIT license
- Weakness: no long context, fails at retrieval
Include as baseline to show improvement from (a) long context and (b) DAPT.
3.4 Decoder Experiment: Qwen3.5 via Unsloth
Experimental comparison of encoder vs. decoder approach:
- Model: Qwen3.5-1.5B or Qwen3.5-7B (smallest viable decoder)
- Framework: Unsloth (2x faster than Axolotl, 80% less VRAM, optimized for Qwen)
- Method: QLoRA fine-tuning — train the model to output the same JSON schema as the GenAI labelers
- Purpose: "Additional baseline" for A-grade requirement + demonstrates encoder advantage for classification
3.5 Domain-Specific Baselines (for comparison)
All BERT-base (110M params, 512 context) — architecturally outdated:
| Model | HuggingFace ID | Domain |
|---|---|---|
| SEC-BERT | nlpaueb/sec-bert-base |
260K 10-K filings |
| FinBERT | ProsusAI/finbert |
Financial sentiment |
| SecureBERT | arXiv:2204.02685 | Cybersecurity text |
3.6 Ablation Design
| # | Experiment | Model | Context | DAPT | SCL | Purpose |
|---|---|---|---|---|---|---|
| 1 | Baseline | DeBERTa-v3-large | 512 | No | No | Standard approach per syllabus |
| 2 | + Long context | ModernBERT-large | 2048 | No | No | Context window benefit |
| 3 | + Domain adapt | SEC-ModernBERT-large | 2048 | Yes | No | DAPT benefit |
| 4 | + Contrastive | SEC-ModernBERT-large | 2048 | Yes | Yes | SCL benefit |
| 5 | Efficiency | NeoBERT (+ DAPT) | 2048 | Yes | Yes | 40% fewer params |
| 6 | Decoder | Qwen3.5 LoRA | 2048 | No | No | Encoder vs decoder |
| 7 | Ensemble | SEC-ModernBERT + DeBERTa | mixed | mixed | — | Maximum performance |
3.7 Hyperparameter Search (Autoresearch Pattern)
Inspired by Karpathy's autoresearch: an agent autonomously iterates on training configs using a program.md directive.
How it works:
- Agent reads
program.mdwhich defines: fixed time budget (30 min), evaluation metric (val_macro_f1), what can be modified (YAML config values), what cannot (data splits, eval script, seed) - Agent modifies one hyperparameter in the YAML config
- Agent runs training for 30 minutes
- Agent evaluates on validation set
- If
val_macro_f1improved by ≥ 0.002 → keep checkpoint, else discard - Agent logs result to
results/experiments.tsvand repeats
Search spaces:
DAPT:
- learning_rate: [1e-5, 2e-5, 5e-5, 1e-4]
- mlm_probability: [0.15, 0.20, 0.30]
- max_seq_length: [1024, 2048]
- effective batch size: [8, 16, 32]
Encoder fine-tuning:
- learning_rate: [1e-5, 2e-5, 3e-5, 5e-5]
- category_weight: [0.3, 0.4, 0.5, 0.6, 0.7]
- label_smoothing: [0, 0.05, 0.1]
- scl_weight: [0, 0.1, 0.2, 0.5]
- dropout: [0.05, 0.1, 0.2]
- pool_strategy: ["cls", "mean"]
- max_seq_length: [512, 1024, 2048]
Decoder (Unsloth LoRA):
- lora_r: [8, 16, 32, 64]
- lora_alpha: [16, 32, 64]
- learning_rate: [1e-4, 2e-4, 5e-4]
4. Evaluation & Validation
4.1 Required Metrics
| Metric | Target | Notes |
|---|---|---|
| Macro-F1 on holdout | > 0.80 for C, higher for A | Per-class and overall |
| Per-class F1 | Identify weak categories | Expect "None/Other" noisiest |
| Krippendorff's Alpha | > 0.67 adequate, > 0.75 good | GenAI vs human gold set |
| MCC | Report alongside F1 | More robust for imbalanced classes |
| Specificity MAE | Report for ordinal dimension | Mean absolute error: |
| Calibration plots | Reliability diagrams | For softmax outputs |
| Robustness splits | By time, industry, filing size | FY2023 vs FY2024; GICS sector; word count quartiles |
4.2 Downstream Validity Tests
Test 1 — Breach Prediction (strongest): Do firms with lower specificity scores subsequently appear in breach databases?
- Privacy Rights Clearinghouse — 80K+ breaches, ticker/CIK matching
- VCDB — 8K+ incidents, VERIS schema
- Board Cybersecurity Incident Tracker — direct SEC filing links
- CISA KEV Catalog — known exploited vulnerabilities
Test 2 — Market Reaction (optional): Event study: abnormal returns around 8-K Item 1.05 filing. Does prior Item 1C quality predict reaction magnitude? Small sample (~55 incidents) but high signal.
Test 3 — Known-Groups Validity (easy, always include): Do regulated industries (NYDFS, HIPAA) produce higher-specificity disclosures? Do larger firms have more specific disclosures? Expected results that validate the measure.
Test 4 — Boilerplate Index (easy, always include): Cosine similarity of each company's Item 1C to industry-median disclosure. Specificity score should inversely correlate — independent, construct-free validation.
4.3 External Benchmark
Per syllabus requirement:
- Board Cybersecurity's 23-feature regex extraction — natural benchmark. Their binary feature coding is prior best practice. Our classifier captures everything their regex does plus quality/specificity.
- Florackis et al. (2023) cyber risk measure — different section (1A vs 1C), different methodology, different era.
5. SEC Regulatory Context
The Rule: SEC Release 33-11216 (July 2023)
Item 1C (10-K Annual Disclosure) — Regulation S-K Item 106:
Item 106(b) — Risk Management and Strategy:
- Processes for assessing, identifying, and managing material cybersecurity risks
- Whether cybersecurity processes integrate into overall ERM
- Whether the company engages external assessors, consultants, or auditors
- Processes to oversee risks from third-party service providers
- Whether cybersecurity risks have materially affected business strategy, results, or financial condition
Item 106(c) — Governance:
- Board oversight (106(c)(1)): oversight description, responsible committee, information processes
- Management's role (106(c)(2)): responsible positions, expertise, monitoring processes, board reporting frequency
Item 1.05 (8-K Incident Disclosure):
- Required within 4 business days of materiality determination
- Material aspects of nature, scope, timing + material impact
- No technical details that would impede response/remediation
- AG can delay up to 120 days for national security
Key design note: The SEC uses "describe" — non-exclusive suggestions create natural variation in specificity and content. This is what makes the construct classifiable.
Compliance Timeline
| Date | Milestone |
|---|---|
| Jul 26, 2023 | Rule adopted |
| Dec 15, 2023 | Item 1C required in 10-Ks |
| Dec 18, 2023 | Item 1.05 required in 8-Ks |
| Jun 15, 2024 | Item 1.05 required for smaller reporting companies |
| Dec 15, 2024 | iXBRL tagging of Item 106 (CYD taxonomy) required |
iXBRL CYD Taxonomy
Published Sep 16, 2024. Starting Dec 15, 2024, Item 1C tagged in Inline XBRL with cyd prefix.
- Schema:
http://xbrl.sec.gov/cyd/2024 - Taxonomy guide (PDF)
6. References
SEC Rule & Guidance
- SEC Final Rule 33-11216 (PDF)
- SEC Fact Sheet
- SEC Small Business Compliance Guide
- CYD iXBRL Taxonomy Guide (PDF)
Law Firm Surveys & Analysis
- Gibson Dunn S&P 100 Survey
- PwC First Wave of 10-K Cyber Disclosures
- Debevoise 8-K Tracker
- Greenberg Traurig 2025 Trends
- Known Trends: First Year of 8-K Filings
- NYU: Lessons Learned from 8-K Reporting
Data Extraction Tools
- edgar-crawler
- edgartools
- sec-edgar-downloader
- sec-8k-item105
- SECurityTr8Ker
- SEC EDGAR APIs
- SEC EDGAR Full-Text Search
Datasets
- PleIAs/SEC — 373K 10-K texts (CC0)
- EDGAR-CORPUS — 220K filings, sections parsed (Apache 2.0)
- Board Cybersecurity 23-Feature Analysis
- Board Cybersecurity Incident Tracker
- PRC Mendeley Breach Dataset
- VCDB
- CISA KEV Catalog
- zeroshot/cybersecurity-corpus
Models
- ModernBERT-large (Apache 2.0)
- ModernBERT-base (Apache 2.0)
- NeoBERT (MIT)
- DeBERTa-v3-large (MIT)
- SEC-BERT
- FinBERT
- EvasionBench Eva-4B-V2
Key Papers
- Ringel (2023), "Creating Synthetic Experts with Generative AI" — SSRN:4542949
- Ludwig et al. (2026), "Extracting Consumer Insight from Text" — arXiv:2602.15312
- Ma et al. (2026), "EvasionBench" — arXiv:2601.09142
- Florackis et al. (2023), "Cybersecurity Risk" — SSRN:3725130
- Gururangan et al. (2020), "Don't Stop Pretraining" — arXiv:2004.10964
- ModernBERT — arXiv:2412.13663
- NeoBERT — arXiv:2502.19587
- ModernBERT vs DeBERTa-v3 — arXiv:2504.08716
- Patent domain ModernBERT DAPT — arXiv:2509.14926
- SEC filing scaling laws — arXiv:2512.12384
- Gunel et al. (2020), Supervised Contrastive Learning — OpenReview
- Phil Schmid, "Fine-tune ModernBERT" — philschmid.de
- Berkman et al. (2018), Cybersecurity disclosure quality scoring
- SecureBERT — arXiv:2204.02685
- Gilardi et al. (2023), "ChatGPT Outperforms Crowd-Workers" — arXiv:2303.15056
- Kiefer et al. (2025), ESG-Activities benchmark — arXiv:2502.21112