2026-03-28 20:39:36 -04:00

26 KiB

Raw Blame History

Technical Guide — SEC Cybersecurity Disclosure Classifier

Everything needed to build the pipeline: data acquisition, GenAI labeling, model training, evaluation, and references.

Stack: TypeScript (bun) for data/labeling/eval, Python (uv) for training. Vercel AI SDK v6 + OpenRouter for all LLM calls. HuggingFace Trainer for encoder training, Unsloth for decoder experiment.

1. Data Acquisition

1.1 Extracting 10-K Item 1C

Pipeline:

EDGAR API  →  download 10-K HTML  →  extract Item 1C  →  paragraph segmentation  →  JSONL

Tools:

Tool	Purpose	Install	Notes
`sec-edgar-downloader`	Bulk download 10-K filings by CIK	`uv add sec-edgar-downloader`	Pure downloader, no parsing
`edgar-crawler`	Extract specific item sections to JSON	`git clone github.com/lefterisloukas/edgar-crawler`	Configure `['1C']` in items list
`edgartools`	Interactive exploration, XBRL parsing	`uv add edgartools`	`tenk['Item 1C']` accessor; great for prototyping

EDGAR API requirements:

Rate limit: 10 requests/second
Required: Custom User-Agent header with name and email (e.g., "sec-cyBERT team@email.com")
SEC blocks requests without proper User-Agent (returns 403)

For iXBRL-tagged filings (2025+): Use edgartools XBRL parser to extract CYD taxonomy elements directly. The cyd prefix tags give pre-structured data aligned with regulatory categories.

Fallback corpus: PleIAs/SEC on HuggingFace (373K 10-K full texts, CC0 license) — sections NOT pre-parsed; must extract Item 1C yourself.

1.2 Extracting 8-K Incident Disclosures

Tool	Purpose
`sec-8k-item105`	Extract Item 1.05 from 8-Ks, iXBRL + HTML fallback — `github.com/JMousqueton/sec-8k-item105`
`SECurityTr8Ker`	Monitor SEC RSS for new cyber 8-Ks — `github.com/pancak3lullz/SECurityTr8Ker`
Debevoise 8-K Tracker	Curated list with filing links — `debevoisedatablog.com`
Board Cybersecurity Tracker	Links filings to MITRE ATT&CK — `board-cybersecurity.com/incidents/tracker`

Critical: Must capture Item 1.05 AND Items 8.01/7.01 (post-May 2024 shift where companies moved non-material disclosures away from 1.05).

1.3 Paragraph Segmentation

Once Item 1C text is extracted:

Split on double newlines or <p> tags (depending on extraction format)
Minimum paragraph length: 20 words (filter out headers, whitespace)
Maximum paragraph length: 500 words (split longer blocks at sentence boundaries)
Preserve metadata: company name, CIK, ticker, filing date, fiscal year

Expected yield: ~5-8 paragraphs per Item 1C × ~9,000 filings = ~50,000-70,000 paragraphs

1.4 Pre-Existing Datasets

Resource	What It Is	License
PleIAs/SEC	373K full 10-K texts	CC0
EDGAR-CORPUS	220K filings with sections pre-parsed	Apache 2.0
Board Cybersecurity 23-Feature Analysis	Regex extraction of 23 governance features from 4,538 10-Ks	Research
Gibson Dunn S&P 100 Survey	Detailed disclosure feature analysis	Research
Florackis et al. (2023)	Firm-level cyber risk measure from 10-K text	SSRN
zeroshot/cybersecurity-corpus	General cybersecurity text (useful for DAPT)	HuggingFace

2. GenAI Labeling Pipeline

All LLM calls go through OpenRouter via @openrouter/ai-sdk-provider + Vercel AI SDK v6 generateObject. OpenRouter returns actual cost in usage.cost — no estimation needed.

2.1 Model Panel

Stage 1 — Three Independent Annotators (all ~50K paragraphs):

All three are reasoning models. Use low reasoning effort to get a cheap thinking pass without blowing up token costs.

Model	OpenRouter ID	Role	Reasoning
Gemini 3.1 Flash Lite	`google/gemini-3.1-flash-lite-preview`	Cheap + capable	Low effort
MiMo-V2-Flash	`xiaomi/mimo-v2-flash`	Xiaomi reasoning flash	Low effort
Grok 4.1 Fast	`x-ai/grok-4.1-fast`	xAI fast tier	Low effort

Provider diversity: Google, Xiaomi, xAI — three different architectures, minimizes correlated errors.

Stage 2 — Judge for Disagreements (~15-20% of paragraphs):

Model	OpenRouter ID	Role	Reasoning
Claude Sonnet 4.6	`anthropic/claude-sonnet-4.6`	Tiebreaker judge	Medium effort

Full Benchmarking Panel (run on 1,200 holdout alongside human labels):

The Stage 1 models plus 6 SOTA frontier models — 9 total from 8+ providers.

Model	OpenRouter ID	Provider	Reasoning
Gemini 3.1 Flash Lite	`google/gemini-3.1-flash-lite-preview`	Google	Low
MiMo-V2-Flash	`xiaomi/mimo-v2-flash`	Xiaomi	Low
Grok 4.1 Fast	`x-ai/grok-4.1-fast`	xAI	Low
GPT-5.4	`openai/gpt-5.4`	OpenAI	Medium
Claude Sonnet 4.6	`anthropic/claude-sonnet-4.6`	Anthropic	Medium
Gemini 3.1 Pro Preview	`google/gemini-3.1-pro-preview`	Google	Medium
GLM-5	`zhipu/glm-5`	Zhipu AI	Medium
MiniMax-M2.7	`minimax/minimax-m2.7`	MiniMax	Medium
MiMo-V2-Pro	`xiaomi/mimo-v2-pro`	Xiaomi	Medium

That's 9 models from 8 providers, far exceeding the 6-from-3 requirement. All support structured outputs on OpenRouter.

2.2 Consensus Algorithm

Stage 1: 3-model majority vote.

Each of 3 models independently labels every paragraph via generateObject with the LabelOutput Zod schema (includes per-dimension confidence ratings).
For each paragraph, compare the 3 labels on both dimensions (category + specificity).
If 2/3 or 3/3 agree on BOTH dimensions → consensus reached.
Expected agreement rate: ~80-85%.
Confidence-aware routing: Even when models agree, if all 3 report "low" confidence on either dimension, route to Stage 2 judge anyway. These are hard cases that deserve a stronger model's opinion.

Stage 2: Judge tiebreaker.

Claude Sonnet 4.6 (medium reasoning effort) receives the paragraph + all 3 Stage 1 labels (randomized order for anti-bias).
Judge's label is treated as authoritative — if judge agrees with any Stage 1 model on both dimensions, that label wins. Otherwise judge's label is used directly.
Remaining unresolved cases (~1-2%) flagged for human review.

Stage 3: Active learning pass.

Cluster low-confidence cases by embedding similarity.
Human-review ~2-5% of total to identify systematic rubric failures.
Iterate rubric if patterns emerge, re-run affected subsets.

2.3 Reasoning Configuration

All Stage 1 and benchmark models are reasoning-capable. We use provider-appropriate "low" or "medium" effort settings to balance quality and cost.

OpenRouter reasoning params (passed via providerOptions or model-specific params):

Google Gemini: thinkingConfig: { thinkingBudget: 256 } (low) / 1024 (medium)
Xiaomi MiMo: Thinking is default-on; use reasoning_effort: "low" / "medium" if supported
xAI Grok: reasoning_effort: "low" / "medium"
OpenAI GPT-5.4: reasoning: { effort: "low" } / "medium"
Anthropic Claude: thinking: { budgetTokens: 512 } (low) / 2048 (medium)

Exact param names may vary per model on OpenRouter — verify during pilot. The reasoning tokens are tracked separately in usage.completion_tokens_details.reasoning_tokens.

2.4 Cost Tracking

OpenRouter returns actual cost in usage.cost for every response. No estimation needed. Reasoning tokens are included in cost automatically.

2.5 Rate Limiting

OpenRouter uses credit-based limiting for paid accounts, not fixed RPM. Your key shows requests: -1 (unlimited). There is no hard request-per-second cap — only Cloudflare DDoS protection if you dramatically exceed reasonable usage.

Our approach: Use p-limit concurrency control, starting at 10-15 concurrent requests. Ramp up if no 429s or latency degradation. Monitor account usage via GET /api/v1/key.

2.4 Technical Implementation

Core pattern: generateObject with Zod schema via OpenRouter.

import { generateObject } from "ai";
import { createOpenRouter } from "@openrouter/ai-sdk-provider";
import { LabelOutput } from "../schemas/label";

const openrouter = createOpenRouter();

const result = await generateObject({
  model: openrouter("google/gemini-3.1-flash-lite-preview"),
  schema: LabelOutput,
  system: SYSTEM_PROMPT,
  prompt: buildUserPrompt(paragraph),
  temperature: 0,
  mode: "json",
  // Reasoning effort — model-specific, set per provider
  providerOptions: {
    google: { thinkingConfig: { thinkingBudget: 256 } },
  },
});

// result.object: { content_category, specificity_level, category_confidence, specificity_confidence, reasoning }
// result.usage: { promptTokens, completionTokens }
// OpenRouter response body also includes usage.cost (actual USD)
// and usage.completion_tokens_details.reasoning_tokens

Generation ID tracking: Every OpenRouter response includes an id field (the generation ID). We store this in every annotation record for audit trail and GET /api/v1/generation?id={id} lookup.

Batch processing: Concurrency-limited via p-limit (start at 10-15 concurrent). Each successful annotation is appended immediately to JSONL (crash-safe checkpoint). On resume, completed paragraph IDs are read from the output file and skipped. Graceful shutdown on SIGINT — wait for in-flight requests, write session summary.

Structured output: All panel models support structured_outputs on OpenRouter. Use mode: "json" in generateObject. Response Healing plugin (plugins: [{ id: 'response-healing' }]) available for edge cases.

Live observability: Every script that hits APIs renders a live dashboard to stderr (progress, ETA, session cost, latency percentiles, reasoning token usage). Session summaries append to data/metadata/sessions.jsonl.

Prompt tuning before scale: See LABELING-CODEBOOK.md for the 4-phase iterative prompt tuning protocol. Micro-pilot (30 paragraphs) → prompt revision → scale pilot (200 paragraphs) → green light. Do not fire the full 50K run until the scale pilot passes agreement targets.

3. Model Strategy

3.1 Primary: SEC-ModernBERT-large

This model does not exist publicly. Building it is a core contribution.

Base model: answerdotai/ModernBERT-large

395M parameters
8,192-token native context (vs. 512 for DeBERTa-v3-large)
RoPE + alternating local/global attention + FlashAttention
2-4x faster than DeBERTa-v3-large
Apache 2.0 license
GLUE: 90.4

Step 1 — Domain-Adaptive Pre-Training (DAPT):

Continue MLM pre-training on SEC filing text to create "SEC-ModernBERT-large":

Training corpus: 200-500M tokens from PleIAs/SEC or own EDGAR download. Include 10-Ks, 10-Qs, 8-Ks, proxy statements.
MLM objective: 30% masking rate (ModernBERT convention)
Learning rate: ~5e-5 (search range: 1e-5 to 1e-4)
Hardware (RTX 3090): bf16, gradient checkpointing, seq_len=1024-2048, batch_size=2-4 + gradient accumulation to effective batch 16-32
VRAM estimate: ~12-15GB at batch=4, seq=2048 with gradient checkpointing — fits on 3090
Duration: ~2-3 days on single 3090
Framework: HuggingFace Trainer + DataCollatorForLanguageModeling (Python script, not notebook)

Evidence DAPT works:

Gururangan et al. (2020): consistent improvements across all tested domains
Clinical ModernBERT, BioClinical ModernBERT: successful continued MLM on medical text
Patent domain ModernBERT (arXiv:2509.14926): +0.9 to +2.8 F1 from continued pre-training on 31.6B tokens
SEC filing scaling laws (arXiv:2512.12384): consistent improvement, largest gains in first 200M tokens

Step 2 — Classification Fine-Tuning:

Fine-tune SEC-ModernBERT-large on the labeled paragraphs:

Architecture: Shared encoder backbone → dropout → two linear classification heads
- category_head: 7-class softmax (content category)
- specificity_head: 4-class softmax (specificity level)
Loss: α × CE(category) + (1-α) × CE(specificity) + β × SCL
- α (category_weight): default 0.5, searchable
- β (scl_weight): default 0, searchable (ablation)
Sequence length: 2048 tokens
VRAM: ~11-13GB at batch=8, seq=2048 in bf16 — comfortable on 3090
bf16=True in HuggingFace Trainer (3090 Ampere supports natively)
Framework: Custom MultiHeadClassifier model + HuggingFace Trainer subclass

3.2 Dark Horse: NeoBERT

chandar-lab/NeoBERT

250M parameters (100M fewer than ModernBERT-large)
4,096-token context
SwiGLU, RoPE, Pre-RMSNorm, FlashAttention
GLUE: 89.0 | MTEB: 51.3 (best in class — ModernBERT is 46.9)
MIT license
Requires trust_remote_code=True

Same DAPT + fine-tuning pipeline, even less VRAM. Interesting efficiency vs. quality tradeoff.

3.3 Baseline: DeBERTa-v3-large

microsoft/deberta-v3-large

~435M total parameters
512-token context (can push to ~1024)
GLUE: 91.4 (highest among encoders)
MIT license
Weakness: no long context, fails at retrieval

Include as baseline to show improvement from (a) long context and (b) DAPT.

3.4 Decoder Experiment: Qwen3.5 via Unsloth

Experimental comparison of encoder vs. decoder approach:

Model: Qwen3.5-1.5B or Qwen3.5-7B (smallest viable decoder)
Framework: Unsloth (2x faster than Axolotl, 80% less VRAM, optimized for Qwen)
Method: QLoRA fine-tuning — train the model to output the same JSON schema as the GenAI labelers
Purpose: "Additional baseline" for A-grade requirement + demonstrates encoder advantage for classification

3.5 Domain-Specific Baselines (for comparison)

All BERT-base (110M params, 512 context) — architecturally outdated:

Model	HuggingFace ID	Domain
SEC-BERT	`nlpaueb/sec-bert-base`	260K 10-K filings
FinBERT	`ProsusAI/finbert`	Financial sentiment
SecureBERT	arXiv:2204.02685	Cybersecurity text

3.6 Ablation Design

#	Experiment	Model	Context	DAPT	SCL	Purpose
1	Baseline	DeBERTa-v3-large	512	No	No	Standard approach per syllabus
2	+ Long context	ModernBERT-large	2048	No	No	Context window benefit
3	+ Domain adapt	SEC-ModernBERT-large	2048	Yes	No	DAPT benefit
4	+ Contrastive	SEC-ModernBERT-large	2048	Yes	Yes	SCL benefit
5	Efficiency	NeoBERT (+ DAPT)	2048	Yes	Yes	40% fewer params
6	Decoder	Qwen3.5 LoRA	2048	No	No	Encoder vs decoder
7	Ensemble	SEC-ModernBERT + DeBERTa	mixed	mixed	—	Maximum performance

3.7 Hyperparameter Search (Autoresearch Pattern)

Inspired by Karpathy's autoresearch: an agent autonomously iterates on training configs using a program.md directive.

How it works:

Agent reads program.md which defines: fixed time budget (30 min), evaluation metric (val_macro_f1), what can be modified (YAML config values), what cannot (data splits, eval script, seed)
Agent modifies one hyperparameter in the YAML config
Agent runs training for 30 minutes
Agent evaluates on validation set
If val_macro_f1 improved by ≥ 0.002 → keep checkpoint, else discard
Agent logs result to results/experiments.tsv and repeats

Search spaces:

DAPT:

learning_rate: [1e-5, 2e-5, 5e-5, 1e-4]
mlm_probability: [0.15, 0.20, 0.30]
max_seq_length: [1024, 2048]
effective batch size: [8, 16, 32]

Encoder fine-tuning:

learning_rate: [1e-5, 2e-5, 3e-5, 5e-5]
category_weight: [0.3, 0.4, 0.5, 0.6, 0.7]
label_smoothing: [0, 0.05, 0.1]
scl_weight: [0, 0.1, 0.2, 0.5]
dropout: [0.05, 0.1, 0.2]
pool_strategy: ["cls", "mean"]
max_seq_length: [512, 1024, 2048]

Decoder (Unsloth LoRA):

lora_r: [8, 16, 32, 64]
lora_alpha: [16, 32, 64]
learning_rate: [1e-4, 2e-4, 5e-4]

4. Evaluation & Validation

4.1 Required Metrics

Metric	Target	Notes
Macro-F1 on holdout	> 0.80 for C, higher for A	Per-class and overall
Per-class F1	Identify weak categories	Expect "None/Other" noisiest
Krippendorff's Alpha	> 0.67 adequate, > 0.75 good	GenAI vs human gold set
MCC	Report alongside F1	More robust for imbalanced classes
Specificity MAE	Report for ordinal dimension	Mean absolute error:
Calibration plots	Reliability diagrams	For softmax outputs
Robustness splits	By time, industry, filing size	FY2023 vs FY2024; GICS sector; word count quartiles

4.2 Downstream Validity Tests

Test 1 — Breach Prediction (strongest): Do firms with lower specificity scores subsequently appear in breach databases?

Privacy Rights Clearinghouse — 80K+ breaches, ticker/CIK matching
VCDB — 8K+ incidents, VERIS schema
Board Cybersecurity Incident Tracker — direct SEC filing links
CISA KEV Catalog — known exploited vulnerabilities

Test 2 — Market Reaction (optional): Event study: abnormal returns around 8-K Item 1.05 filing. Does prior Item 1C quality predict reaction magnitude? Small sample (~55 incidents) but high signal.

Test 3 — Known-Groups Validity (easy, always include): Do regulated industries (NYDFS, HIPAA) produce higher-specificity disclosures? Do larger firms have more specific disclosures? Expected results that validate the measure.

Test 4 — Boilerplate Index (easy, always include): Cosine similarity of each company's Item 1C to industry-median disclosure. Specificity score should inversely correlate — independent, construct-free validation.

4.3 External Benchmark

Per syllabus requirement:

Board Cybersecurity's 23-feature regex extraction — natural benchmark. Their binary feature coding is prior best practice. Our classifier captures everything their regex does plus quality/specificity.
Florackis et al. (2023) cyber risk measure — different section (1A vs 1C), different methodology, different era.

5. SEC Regulatory Context

The Rule: SEC Release 33-11216 (July 2023)

Item 1C (10-K Annual Disclosure) — Regulation S-K Item 106:

Item 106(b) — Risk Management and Strategy:

Processes for assessing, identifying, and managing material cybersecurity risks
Whether cybersecurity processes integrate into overall ERM
Whether the company engages external assessors, consultants, or auditors
Processes to oversee risks from third-party service providers
Whether cybersecurity risks have materially affected business strategy, results, or financial condition

Item 106(c) — Governance:

Board oversight (106(c)(1)): oversight description, responsible committee, information processes
Management's role (106(c)(2)): responsible positions, expertise, monitoring processes, board reporting frequency

Item 1.05 (8-K Incident Disclosure):

Required within 4 business days of materiality determination
Material aspects of nature, scope, timing + material impact
No technical details that would impede response/remediation
AG can delay up to 120 days for national security

Key design note: The SEC uses "describe" — non-exclusive suggestions create natural variation in specificity and content. This is what makes the construct classifiable.

Compliance Timeline

Date	Milestone
Jul 26, 2023	Rule adopted
Dec 15, 2023	Item 1C required in 10-Ks
Dec 18, 2023	Item 1.05 required in 8-Ks
Jun 15, 2024	Item 1.05 required for smaller reporting companies
Dec 15, 2024	iXBRL tagging of Item 106 (CYD taxonomy) required

iXBRL CYD Taxonomy

Published Sep 16, 2024. Starting Dec 15, 2024, Item 1C tagged in Inline XBRL with cyd prefix.

Schema: http://xbrl.sec.gov/cyd/2024
Taxonomy guide (PDF)

6. References

SEC Rule & Guidance

Law Firm Surveys & Analysis

Data Extraction Tools

Datasets

Models

Key Papers

Ringel (2023), "Creating Synthetic Experts with Generative AI" — SSRN:4542949
Ludwig et al. (2026), "Extracting Consumer Insight from Text" — arXiv:2602.15312
Ma et al. (2026), "EvasionBench" — arXiv:2601.09142
Florackis et al. (2023), "Cybersecurity Risk" — SSRN:3725130
Gururangan et al. (2020), "Don't Stop Pretraining" — arXiv:2004.10964
ModernBERT — arXiv:2412.13663
NeoBERT — arXiv:2502.19587
ModernBERT vs DeBERTa-v3 — arXiv:2504.08716
Patent domain ModernBERT DAPT — arXiv:2509.14926
SEC filing scaling laws — arXiv:2512.12384
Gunel et al. (2020), Supervised Contrastive Learning — OpenReview
Phil Schmid, "Fine-tune ModernBERT" — philschmid.de
Berkman et al. (2018), Cybersecurity disclosure quality scoring
SecureBERT — arXiv:2204.02685
Gilardi et al. (2023), "ChatGPT Outperforms Crowd-Workers" — arXiv:2303.15056
Kiefer et al. (2025), ESG-Activities benchmark — arXiv:2502.21112

Methodological Resources

Ringel 2026 Capstone Pipeline Example (ipynb)
Ringel 2026 Capstone Pipeline Example (zip)
Class 21 Exemplary Presentation (PDF)
Karpathy autoresearch — autonomous HP search pattern

26 KiB Raw Blame History Unescape Escape