2026-04-02 02:02:36 -04:00
2026-03-29 16:12:19 -04:00
2026-04-02 02:02:36 -04:00
2026-04-02 00:28:31 -04:00
2026-03-28 23:44:37 -04:00
2026-03-28 23:26:56 -04:00
2026-04-02 00:28:31 -04:00
2026-04-02 02:02:36 -04:00
2026-03-29 01:15:37 -04:00
2026-04-02 00:28:31 -04:00
2026-03-30 22:02:52 -04:00
2026-03-28 23:26:56 -04:00
2026-04-02 00:28:31 -04:00

sec-cyBERT

Classifier for SEC cybersecurity disclosure quality. Extracts Item 1C / Item 1.05 paragraphs from 10-K and 8-K filings, labels them along two dimensions (content category and specificity), and fine-tunes a ModernBERT-large model via domain-adaptive pre-training (DAPT), task-adaptive pre-training (TAPT), and supervised dual-head classification.

Three-stage labeling pipeline: synthetic expert panel (3 LLMs via OpenRouter) → judge resolution → human annotation with adjudication.

Quick start

# Clone and install
git clone <repo-url> sec-cyBERT && cd sec-cyBERT
bun install

# Pull data (no credentials needed, ~700 MB compressed download)
bun run data:pull

That gives you all extracted paragraphs, annotations, the DAPT corpus, benchmark results, and pilot experiments. See data/README.md for the full manifest.

Prerequisites

Tool Install
Bun ≥1.1 curl -fsSL https://bun.sh/install | bash
zstd ≥1.5 apt install zstd / brew install zstd

Additional prerequisites depending on what you're running:

Tool Needed for Install
uv ≥0.5 Training pipeline curl -LsSf https://astral.sh/uv/install.sh | sh
Docker ≥24 Labelapp (Postgres) Package manager or Docker Desktop
NVIDIA GPU + CUDA ≥13.0 DAPT / TAPT / fine-tuning

Project structure

sec-cyBERT/
├── packages/schemas/     # Shared Zod schemas (@sec-cybert/schemas)
├── ts/                   # GenAI labeling pipeline (Vercel AI SDK, OpenRouter)
├── python/               # Training pipeline (HuggingFace Trainer, PyTorch)
│   └── configs/          # YAML training configs
├── labelapp/             # Next.js human labeling webapp
├── data/                 # All data artifacts (DVC-managed, see data/README.md)
├── checkpoints/          # Model training checkpoints
├── scripts/              # Data packaging and utility scripts
└── docs/                 # Project documentation

Pipeline

SEC EDGAR (14,759 filings)
    │
    ▼
[1] Extract paragraphs ──→ data/paragraphs/ (72,045 paragraphs)
    │
    ▼
[2] Quality audit + patch ──→ data/paragraphs/quality/, patches/
    │
    ├──→ [3] Stage 1: 3-model annotation ──→ data/annotations/stage1.patched.jsonl
    │         │
    │         ▼
    │    [4] Stage 2: judge resolution ──→ data/annotations/stage2/
    │         │
    │         ▼
    │    [5] Human labeling ──→ data/gold/gold-labels.jsonl
    │
    ├──→ [6] DAPT corpus prep ──→ data/dapt-corpus/ (1.06B tokens)
    │         │
    │         ▼
    │    [7] DAPT ──→ checkpoints/dapt/
    │         │
    │         ▼
    │    [8] TAPT ──→ checkpoints/tapt/
    │
    └──→ [9] Fine-tune dual-head classifier ──→ final model

Scripts

All commands run from repo root via bun run <script>.

Data extraction and labeling (ts:*)

bun run ts:sec extract:10k    # Extract 10-K Item 1C paragraphs from EDGAR
bun run ts:sec extract:8k     # Extract 8-K Item 1.05 disclosures
bun run ts:sec extract:merge  # Merge + deduplicate
bun run ts:sec label:annotate-all  # Stage 1: 3-model panel annotation (~$116)
bun run ts:sec label:consensus     # Compute consensus from panel
bun run ts:sec label:judge         # Stage 2: judge resolution

Training (py:*)

cd python && uv sync --extra flash  # Install Python deps + flash-attn (pre-built wheel, CUDA ≥13.0)
cd ..
bun run py:train dapt --config configs/dapt/modernbert.yaml      # DAPT (~13.5h on RTX 3090)
bun run py:train tapt --config configs/tapt/modernbert.yaml      # TAPT (~2h)
bun run py:train finetune --config configs/ft/modernbert.yaml    # Fine-tune classifier

Data management (data:*)

bun run data:pull       # Download from R2 + decompress (no auth needed)
bun run data:push       # Compress + upload to R2 via DVC (needs R2 write keys)
bun run data:package    # Build standalone .tar.zst archives for offline distribution

Data

Data is versioned with DVC and stored compressed (zstd-19) on Cloudflare R2. bun run data:pull fetches everything with no credentials required.

Dataset Records Description
Paragraphs 72,045 Extracted SEC filing paragraphs with filing metadata
Stage 1 annotations 150,009 3-model panel labels (category + specificity)
DAPT corpus 14,756 docs Full 10-K text for masked language model pre-training
Gold labels (in progress) Human-adjudicated ground truth (1,200 paragraphs)

See data/README.md for schemas, row counts, and reproduction steps for every file.

Labelapp

The human labeling webapp lives in labelapp/. It requires Postgres (via Docker) and has its own setup:

docker compose up -d       # Start Postgres
bun run la:db:migrate      # Apply migrations
bun run la:seed            # Seed paragraphs
bun run la:assign          # Generate annotator assignments (BIBD)
bun run la:dev             # Start dev server
bun run la:export          # Export adjudicated gold labels

See labelapp/AGENTS.md for labelapp-specific development notes.

Environment variables

Copy .env.example to .env and fill in the values you need:

Variable Needed for
OPENROUTER_API_KEY GenAI labeling pipeline (extraction is free)
R2_ACCESS_KEY_ID / R2_SECRET_ACCESS_KEY Pushing data to DVC (pulling is anonymous)
DATABASE_URL Labelapp only (defaults to local Postgres)
Description
SEC cyber disclosure quality classifier (ModernBERT)
Readme 27 GiB
Languages
TypeScript 40%
Python 34.1%
HTML 24.9%
Shell 0.7%
CSS 0.2%
Other 0.1%