SEC-cyBERT/docs/data-pipeline/TECHNICAL-GUIDE.md
2026-04-05 21:00:40 -04:00

30 KiB
Raw Permalink Blame History

Technical Guide — SEC Cybersecurity Disclosure Classifier

Everything needed to build the pipeline: data acquisition, GenAI labeling, model training, evaluation, and references.

Stack: TypeScript (bun) for data/labeling/eval, Python (uv) for training. Vercel AI SDK v6 + OpenRouter for all LLM calls. HuggingFace Trainer for encoder training, Unsloth for decoder experiment.


1. Data Acquisition

1.1 Extracting 10-K Item 1C

Pipeline:

EDGAR API  →  download 10-K HTML  →  extract Item 1C  →  paragraph segmentation  →  JSONL

Tools:

Tool Purpose Install Notes
sec-edgar-downloader Bulk download 10-K filings by CIK uv add sec-edgar-downloader Pure downloader, no parsing
edgar-crawler Extract specific item sections to JSON git clone github.com/lefterisloukas/edgar-crawler Configure ['1C'] in items list
edgartools Interactive exploration, XBRL parsing uv add edgartools tenk['Item 1C'] accessor; great for prototyping

EDGAR API requirements:

  • Rate limit: 10 requests/second
  • Required: Custom User-Agent header with name and email (e.g., "sec-cyBERT team@email.com")
  • SEC blocks requests without proper User-Agent (returns 403)

For iXBRL-tagged filings (2025+): Use edgartools XBRL parser to extract CYD taxonomy elements directly. The cyd prefix tags give pre-structured data aligned with regulatory categories.

Fallback corpus: PleIAs/SEC on HuggingFace (373K 10-K full texts, CC0 license) — sections NOT pre-parsed; must extract Item 1C yourself.

1.2 Extracting 8-K Incident Disclosures

Tool Purpose
sec-8k-item105 Extract Item 1.05 from 8-Ks, iXBRL + HTML fallback — github.com/JMousqueton/sec-8k-item105
SECurityTr8Ker Monitor SEC RSS for new cyber 8-Ks — github.com/pancak3lullz/SECurityTr8Ker
Debevoise 8-K Tracker Curated list with filing links — debevoisedatablog.com
Board Cybersecurity Tracker Links filings to MITRE ATT&CK — board-cybersecurity.com/incidents/tracker

Critical: Must capture Item 1.05 AND Items 8.01/7.01 (post-May 2024 shift where companies moved non-material disclosures away from 1.05).

1.3 Paragraph Segmentation

Once Item 1C text is extracted:

  • Split on double newlines or <p> tags (depending on extraction format)
  • Minimum paragraph length: 20 words (filter out headers, whitespace)
  • Maximum paragraph length: 500 words (split longer blocks at sentence boundaries)
  • Preserve metadata: company name, CIK, ticker, filing date, fiscal year

Expected yield: ~5-8 paragraphs per Item 1C × ~9,000 filings = ~50,000-70,000 paragraphs

1.4 Pre-Existing Datasets

Resource What It Is License
PleIAs/SEC 373K full 10-K texts CC0
EDGAR-CORPUS 220K filings with sections pre-parsed Apache 2.0
Board Cybersecurity 23-Feature Analysis Regex extraction of 23 governance features from 4,538 10-Ks Research
Gibson Dunn S&P 100 Survey Detailed disclosure feature analysis Research
Florackis et al. (2023) Firm-level cyber risk measure from 10-K text SSRN
zeroshot/cybersecurity-corpus General cybersecurity text (useful for DAPT) HuggingFace

2. GenAI Labeling Pipeline

All LLM calls go through OpenRouter via @openrouter/ai-sdk-provider + Vercel AI SDK v6 generateObject. OpenRouter returns actual cost in usage.cost — no estimation needed.

2.1 Model Panel

Stage 1 — Three Independent Annotators (all ~50K paragraphs):

All three are reasoning models. Use low reasoning effort to get a cheap thinking pass without blowing up token costs.

Model OpenRouter ID Role Reasoning
Gemini 3.1 Flash Lite google/gemini-3.1-flash-lite-preview Cheap + capable Low effort
MiMo-V2-Flash xiaomi/mimo-v2-flash Xiaomi reasoning flash Low effort
Grok 4.1 Fast x-ai/grok-4.1-fast xAI fast tier Low effort

Provider diversity: Google, Xiaomi, xAI — three different architectures, minimizes correlated errors.

Stage 2 — Judge for Disagreements (~15-20% of paragraphs):

Model OpenRouter ID Role Reasoning
Claude Sonnet 4.6 anthropic/claude-sonnet-4.6 Tiebreaker judge Medium effort

Full Benchmarking Panel (run on 1,200 holdout alongside human labels):

The Stage 1 models plus 6 SOTA frontier models — 9 total from 8+ providers.

Model OpenRouter ID Provider Reasoning
Gemini 3.1 Flash Lite google/gemini-3.1-flash-lite-preview Google Low
MiMo-V2-Flash xiaomi/mimo-v2-flash Xiaomi Low
Grok 4.1 Fast x-ai/grok-4.1-fast xAI Low
GPT-5.4 openai/gpt-5.4 OpenAI Medium
Claude Sonnet 4.6 anthropic/claude-sonnet-4.6 Anthropic Medium
Gemini 3.1 Pro Preview google/gemini-3.1-pro-preview Google Medium
GLM-5 zhipu/glm-5 Zhipu AI Medium
MiniMax-M2.7 minimax/minimax-m2.7 MiniMax Medium
MiMo-V2-Pro xiaomi/mimo-v2-pro Xiaomi Medium

That's 9 models from 8 providers, far exceeding the 6-from-3 requirement. All support structured outputs on OpenRouter.

2.2 Consensus Algorithm

Stage 1: 3-model majority vote.

  • Each of 3 models independently labels every paragraph via generateObject with the LabelOutput Zod schema (includes per-dimension confidence ratings).
  • For each paragraph, compare the 3 labels on both dimensions (category + specificity).
  • If 2/3 or 3/3 agree on BOTH dimensions → consensus reached.
  • Expected agreement rate: ~80-85%.
  • Confidence-aware routing: Even when models agree, if all 3 report "low" confidence on either dimension, route to Stage 2 judge anyway. These are hard cases that deserve a stronger model's opinion.

Stage 2: Judge tiebreaker.

  • Claude Sonnet 4.6 (medium reasoning effort) receives the paragraph + all 3 Stage 1 labels (randomized order for anti-bias).
  • Judge's label is treated as authoritative — if judge agrees with any Stage 1 model on both dimensions, that label wins. Otherwise judge's label is used directly.
  • Remaining unresolved cases (~1-2%) flagged for human review.

Stage 3: Active learning pass.

  • Cluster low-confidence cases by embedding similarity.
  • Human-review ~2-5% of total to identify systematic rubric failures.
  • Iterate rubric if patterns emerge, re-run affected subsets.

2.3 Reasoning Configuration

All Stage 1 and benchmark models are reasoning-capable. We use provider-appropriate "low" or "medium" effort settings to balance quality and cost.

OpenRouter reasoning params (passed via providerOptions or model-specific params):

  • Google Gemini: thinkingConfig: { thinkingBudget: 256 } (low) / 1024 (medium)
  • Xiaomi MiMo: Thinking is default-on; use reasoning_effort: "low" / "medium" if supported
  • xAI Grok: reasoning_effort: "low" / "medium"
  • OpenAI GPT-5.4: reasoning: { effort: "low" } / "medium"
  • Anthropic Claude: thinking: { budgetTokens: 512 } (low) / 2048 (medium)

Exact param names may vary per model on OpenRouter — verify during pilot. The reasoning tokens are tracked separately in usage.completion_tokens_details.reasoning_tokens.

2.4 Cost Tracking

OpenRouter returns actual cost in usage.cost for every response. No estimation needed. Reasoning tokens are included in cost automatically.

2.5 Rate Limiting

OpenRouter uses credit-based limiting for paid accounts, not fixed RPM. Your key shows requests: -1 (unlimited). There is no hard request-per-second cap — only Cloudflare DDoS protection if you dramatically exceed reasonable usage.

Our approach: Use p-limit concurrency control, starting at 10-15 concurrent requests. Ramp up if no 429s or latency degradation. Monitor account usage via GET /api/v1/key.

2.4 Technical Implementation

Core pattern: generateObject with Zod schema via OpenRouter.

import { generateObject } from "ai";
import { createOpenRouter } from "@openrouter/ai-sdk-provider";
import { LabelOutput } from "../schemas/label";

const openrouter = createOpenRouter();

const result = await generateObject({
  model: openrouter("google/gemini-3.1-flash-lite-preview"),
  schema: LabelOutput,
  system: SYSTEM_PROMPT,
  prompt: buildUserPrompt(paragraph),
  temperature: 0,
  mode: "json",
  // Reasoning effort — model-specific, set per provider
  providerOptions: {
    google: { thinkingConfig: { thinkingBudget: 256 } },
  },
});

// result.object: { content_category, specificity_level, category_confidence, specificity_confidence, reasoning }
// result.usage: { promptTokens, completionTokens }
// OpenRouter response body also includes usage.cost (actual USD)
// and usage.completion_tokens_details.reasoning_tokens

Generation ID tracking: Every OpenRouter response includes an id field (the generation ID). We store this in every annotation record for audit trail and GET /api/v1/generation?id={id} lookup.

Batch processing: Concurrency-limited via p-limit (start at 10-15 concurrent). Each successful annotation is appended immediately to JSONL (crash-safe checkpoint). On resume, completed paragraph IDs are read from the output file and skipped. Graceful shutdown on SIGINT — wait for in-flight requests, write session summary.

Structured output: All panel models support structured_outputs on OpenRouter. Use mode: "json" in generateObject. Response Healing plugin (plugins: [{ id: 'response-healing' }]) available for edge cases.

Live observability: Every script that hits APIs renders a live dashboard to stderr (progress, ETA, session cost, latency percentiles, reasoning token usage). Session summaries append to data/metadata/sessions.jsonl.

Prompt tuning before scale: See LABELING-CODEBOOK.md for the 4-phase iterative prompt tuning protocol. Micro-pilot (30 paragraphs) → prompt revision → scale pilot (200 paragraphs) → green light. Do not fire the full 50K run until the scale pilot passes agreement targets.


3. Model Strategy

3.1 Primary: SEC-ModernBERT-large

This model does not exist publicly. Building it is a core contribution.

Base model: answerdotai/ModernBERT-large

  • 395M parameters
  • 8,192-token native context (vs. 512 for DeBERTa-v3-large)
  • RoPE + alternating local/global attention + FlashAttention
  • 2-4x faster than DeBERTa-v3-large
  • Apache 2.0 license
  • GLUE: 90.4

Step 1 — Domain-Adaptive Pre-Training (DAPT):

Continue MLM pre-training on SEC filing text to create "SEC-ModernBERT-large":

  • Training corpus: ~450M tokens from our own 9,000 cached 10-K filings (FY2023-2024), full filing text (not just Item 1C). These are the same filings we extracted Item 1C from — we already have the raw HTML cached locally and the cleaning pipeline built (stripHtml() in fast-reparse.ts).
  • Why not PleIAs/SEC? The PleIAs/SEC dataset (373K filings, ~18B tokens) goes back much further in time, but: (a) one pass would take weeks on a single 3090, (b) Item 1C didn't exist before FY2023 so pre-2023 filings lack the cybersecurity disclosure vocabulary that matters most for our task, (c) the SEC filing scaling laws paper (arXiv:2512.12384) shows the largest gains come in the first 200M tokens — our 450M from recent, relevant filings is already in the sweet spot.
  • Corpus preparation: bun run ts/scripts/dapt-corpus-prep.ts runs stripHtml() (from ts/src/extract/html-cleaner.ts) on all cached filing HTML (full text, no section extraction) and outputs clean text as sharded JSONL to data/dapt-corpus/. Same HTML cleaning that handles XBRL tags, entity decoding, page artifacts, inline element boundaries — just without the Item 1C section boundary step.
  • MLM objective: 30% masking rate (ModernBERT convention)
  • Learning rate: ~5e-5 (search range: 1e-5 to 1e-4)
  • Hardware (RTX 3090): bf16, gradient checkpointing, seq_len=1024-2048, batch_size=2-4 + gradient accumulation to effective batch 16-32
  • VRAM estimate: ~12-15GB at batch=4, seq=2048 with gradient checkpointing — fits on 3090
  • Duration: ~2-3 days on single 3090
  • Framework: HuggingFace Trainer + DataCollatorForLanguageModeling (Python script, not notebook)

Step 1b — Task-Adaptive Pre-Training (TAPT):

After DAPT, continue MLM pre-training specifically on the task distribution to create "SEC-cyBERT-large":

  • Training corpus: 72K Item 1C paragraphs from paragraphs-clean.jsonl (~5-10M tokens). Already clean text — no additional processing needed.
  • Why TAPT on top of DAPT? Gururangan et al. (2020) "Don't Stop Pretraining" showed that TAPT gives consistent gains on top of DAPT, especially when the task distribution differs from the broader domain. Item 1C is a very specific subset of 10-K filings — it has its own vocabulary (CISO, NIST CSF, tabletop exercises, materiality assessments), structure (governance → management → process → strategy pattern), and boilerplate patterns that differ substantially from the rest of a 10-K. TAPT teaches the model this specific distribution.
  • Duration: ~2-3 hours on a 3090 (tiny corpus vs DAPT)
  • Same hyperparameters as DAPT (learning rate, masking rate, etc.)

Evidence DAPT + TAPT works:

  • Gururangan et al. (2020): DAPT + TAPT together outperforms either alone across all tested domains. TAPT alone gives 1-3 F1 points; combined with DAPT, gains are additive.
  • Clinical ModernBERT, BioClinical ModernBERT: successful continued MLM on medical text
  • Patent domain ModernBERT (arXiv:2509.14926): +0.9 to +2.8 F1 from continued pre-training on 31.6B tokens
  • SEC filing scaling laws (arXiv:2512.12384): consistent improvement, largest gains in first 200M tokens

Step 2 — Classification Fine-Tuning:

Fine-tune SEC-ModernBERT-large on the labeled paragraphs:

  • Architecture: Shared encoder backbone → dropout → two linear classification heads
    • category_head: 7-class softmax (content category)
    • specificity_head: 4-class softmax (specificity level)
  • Loss: α × CE(category) + (1-α) × CE(specificity) + β × SCL
    • α (category_weight): default 0.5, searchable
    • β (scl_weight): default 0, searchable (ablation)
  • Sequence length: 2048 tokens
  • VRAM: ~11-13GB at batch=8, seq=2048 in bf16 — comfortable on 3090
  • bf16=True in HuggingFace Trainer (3090 Ampere supports natively)
  • Framework: Custom MultiHeadClassifier model + HuggingFace Trainer subclass

3.2 Dark Horse: NeoBERT

chandar-lab/NeoBERT

  • 250M parameters (100M fewer than ModernBERT-large)
  • 4,096-token context
  • SwiGLU, RoPE, Pre-RMSNorm, FlashAttention
  • GLUE: 89.0 | MTEB: 51.3 (best in class — ModernBERT is 46.9)
  • MIT license
  • Requires trust_remote_code=True

Same DAPT + fine-tuning pipeline, even less VRAM. Interesting efficiency vs. quality tradeoff.

3.3 Baseline: DeBERTa-v3-large

microsoft/deberta-v3-large

  • ~435M total parameters
  • 512-token context (can push to ~1024)
  • GLUE: 91.4 (highest among encoders)
  • MIT license
  • Weakness: no long context, fails at retrieval

Include as baseline to show improvement from (a) long context and (b) DAPT.

3.4 Decoder Experiment: Qwen3.5 via Unsloth

Experimental comparison of encoder vs. decoder approach:

  • Model: Qwen3.5-1.5B or Qwen3.5-7B (smallest viable decoder)
  • Framework: Unsloth (2x faster than Axolotl, 80% less VRAM, optimized for Qwen)
  • Method: QLoRA fine-tuning — train the model to output the same JSON schema as the GenAI labelers
  • Purpose: "Additional baseline" for A-grade requirement + demonstrates encoder advantage for classification

3.5 Domain-Specific Baselines (for comparison)

All BERT-base (110M params, 512 context) — architecturally outdated:

Model HuggingFace ID Domain
SEC-BERT nlpaueb/sec-bert-base 260K 10-K filings
FinBERT ProsusAI/finbert Financial sentiment
SecureBERT arXiv:2204.02685 Cybersecurity text

3.6 Ablation Design

# Experiment Model Context DAPT TAPT SCL Purpose
1 Baseline DeBERTa-v3-large 512 No No No Standard approach per syllabus
2 + Long context ModernBERT-large 2048 No No No Context window benefit
3 + Domain adapt SEC-ModernBERT-large 2048 Yes No No DAPT benefit
3b + Task adapt SEC-cyBERT-large 2048 Yes Yes No TAPT benefit (Gururangan et al.)
4 + Contrastive SEC-cyBERT-large 2048 Yes Yes Yes SCL benefit
5 Efficiency NeoBERT (+ DAPT + TAPT) 2048 Yes Yes Yes 40% fewer params
6 Decoder Qwen3.5 LoRA 2048 No No No Encoder vs decoder
7 Ensemble SEC-cyBERT + DeBERTa mixed mixed mixed Maximum performance

3.7 Hyperparameter Search (Autoresearch Pattern)

Inspired by Karpathy's autoresearch: an agent autonomously iterates on training configs using a program.md directive.

How it works:

  1. Agent reads program.md which defines: fixed time budget (30 min), evaluation metric (val_macro_f1), what can be modified (YAML config values), what cannot (data splits, eval script, seed)
  2. Agent modifies one hyperparameter in the YAML config
  3. Agent runs training for 30 minutes
  4. Agent evaluates on validation set
  5. If val_macro_f1 improved by ≥ 0.002 → keep checkpoint, else discard
  6. Agent logs result to results/experiments.tsv and repeats

Search spaces:

DAPT:

  • learning_rate: [1e-5, 2e-5, 5e-5, 1e-4]
  • mlm_probability: [0.15, 0.20, 0.30]
  • max_seq_length: [1024, 2048]
  • effective batch size: [8, 16, 32]

Encoder fine-tuning:

  • learning_rate: [1e-5, 2e-5, 3e-5, 5e-5]
  • category_weight: [0.3, 0.4, 0.5, 0.6, 0.7]
  • label_smoothing: [0, 0.05, 0.1]
  • scl_weight: [0, 0.1, 0.2, 0.5]
  • dropout: [0.05, 0.1, 0.2]
  • pool_strategy: ["cls", "mean"]
  • max_seq_length: [512, 1024, 2048]

Decoder (Unsloth LoRA):

  • lora_r: [8, 16, 32, 64]
  • lora_alpha: [16, 32, 64]
  • learning_rate: [1e-4, 2e-4, 5e-4]

3.8 Python Training Package (python/)

Structured Python package for all training stages. All commands run from python/ via uv run main.py <command>.

Package layout:

python/
├── main.py                         # CLI: uv run main.py {dapt,finetune,eval,decoder}
├── pyproject.toml                  # Dependencies (torch, transformers, datasets, accelerate)
├── configs/
│   ├── dapt/
│   │   ├── modernbert.yaml         # ModernBERT DAPT hyperparams
│   │   └── neobert.yaml            # NeoBERT DAPT hyperparams
│   ├── finetune/                   # Fine-tuning configs (per ablation)
│   └── decoder/                    # Qwen LoRA config
└── src/
    ├── common/
    │   └── config.py               # Typed dataclass configs, YAML loading
    ├── data/
    │   └── corpus.py               # DAPT corpus loading, tokenization, chunking
    ├── dapt/
    │   └── train.py                # DAPT + TAPT (same MLM objective, different data)
    ├── finetune/
    │   ├── model.py                # Dual-head classifier architecture
    │   └── train.py                # Classification fine-tuning
    ├── eval/
    │   └── metrics.py              # Macro-F1, MCC, Krippendorff's Alpha
    └── decoder/
        └── train.py                # Qwen LoRA experiment (Unsloth)

DAPT/TAPT usage:

# DAPT: full 10-K filings → SEC-ModernBERT-large
uv run main.py dapt --config configs/dapt/modernbert.yaml

# TAPT: continue from DAPT checkpoint on Item 1C paragraphs → SEC-cyBERT-large
uv run main.py dapt --config configs/dapt/modernbert.yaml \
  --model-path ../checkpoints/dapt/modernbert-large/final \
  --data-path ../data/paragraphs/paragraphs-clean.jsonl \
  --output-dir ../checkpoints/tapt/modernbert-large \
  --stage tapt

Config design: YAML files define all hyperparameters (reproducible, diffable). CLI flags override key fields (--model-path, --data-path, --output-dir, --stage) for TAPT or experimentation without duplicating config files.

Corpus preparation (prerequisite): Run bun run ts/scripts/dapt-corpus-prep.ts from repo root to generate data/dapt-corpus/ shards from cached HTML. This reuses the same stripHtml() from ts/src/extract/html-cleaner.ts that powers paragraph extraction.

Checkpoints: Saved to checkpoints/ (gitignored). Each stage writes to {output_dir}/final/ with the model and tokenizer.


4. Evaluation & Validation

4.1 Required Metrics

Metric Target Notes
Macro-F1 on holdout > 0.80 for C, higher for A Per-class and overall
Per-class F1 Identify weak categories Expect "None/Other" noisiest
Krippendorff's Alpha > 0.67 adequate, > 0.75 good GenAI vs human gold set
MCC Report alongside F1 More robust for imbalanced classes
Specificity MAE Report for ordinal dimension Mean absolute error:
Calibration plots Reliability diagrams For softmax outputs
Robustness splits By time, industry, filing size FY2023 vs FY2024; GICS sector; word count quartiles

4.2 Downstream Validity Tests

Test 1 — Breach Prediction (strongest): Do firms with lower specificity scores subsequently appear in breach databases?

Test 2 — Market Reaction (optional): Event study: abnormal returns around 8-K Item 1.05 filing. Does prior Item 1C quality predict reaction magnitude? Small sample (~55 incidents) but high signal.

Test 3 — Known-Groups Validity (easy, always include): Do regulated industries (NYDFS, HIPAA) produce higher-specificity disclosures? Do larger firms have more specific disclosures? Expected results that validate the measure.

Test 4 — Boilerplate Index (easy, always include): Cosine similarity of each company's Item 1C to industry-median disclosure. Specificity score should inversely correlate — independent, construct-free validation.

4.3 External Benchmark

Per syllabus requirement:

  • Board Cybersecurity's 23-feature regex extraction — natural benchmark. Their binary feature coding is prior best practice. Our classifier captures everything their regex does plus quality/specificity.
  • Florackis et al. (2023) cyber risk measure — different section (1A vs 1C), different methodology, different era.

5. SEC Regulatory Context

The Rule: SEC Release 33-11216 (July 2023)

Item 1C (10-K Annual Disclosure) — Regulation S-K Item 106:

Item 106(b) — Risk Management and Strategy:

  1. Processes for assessing, identifying, and managing material cybersecurity risks
  2. Whether cybersecurity processes integrate into overall ERM
  3. Whether the company engages external assessors, consultants, or auditors
  4. Processes to oversee risks from third-party service providers
  5. Whether cybersecurity risks have materially affected business strategy, results, or financial condition

Item 106(c) — Governance:

  • Board oversight (106(c)(1)): oversight description, responsible committee, information processes
  • Management's role (106(c)(2)): responsible positions, expertise, monitoring processes, board reporting frequency

Item 1.05 (8-K Incident Disclosure):

  • Required within 4 business days of materiality determination
  • Material aspects of nature, scope, timing + material impact
  • No technical details that would impede response/remediation
  • AG can delay up to 120 days for national security

Key design note: The SEC uses "describe" — non-exclusive suggestions create natural variation in specificity and content. This is what makes the construct classifiable.

Compliance Timeline

Date Milestone
Jul 26, 2023 Rule adopted
Dec 15, 2023 Item 1C required in 10-Ks
Dec 18, 2023 Item 1.05 required in 8-Ks
Jun 15, 2024 Item 1.05 required for smaller reporting companies
Dec 15, 2024 iXBRL tagging of Item 106 (CYD taxonomy) required

iXBRL CYD Taxonomy

Published Sep 16, 2024. Starting Dec 15, 2024, Item 1C tagged in Inline XBRL with cyd prefix.


6. References

SEC Rule & Guidance

Law Firm Surveys & Analysis

Data Extraction Tools

Datasets

Models

Key Papers

Methodological Resources