Initial migration of data/ to .lfs-store/ with zstd-19 compression and xxh3 per-file change detection manifest.
sec-cyBERT
Classifier for SEC cybersecurity disclosure quality. Extracts Item 1C / Item 1.05 paragraphs from 10-K and 8-K filings, labels them along two dimensions (content category and specificity), and fine-tunes a ModernBERT-large model via domain-adaptive pre-training (DAPT), task-adaptive pre-training (TAPT), and supervised dual-head classification.
Three-stage labeling pipeline: synthetic expert panel (3 LLMs via OpenRouter) → judge resolution → human annotation with adjudication.
Quick start
# Clone and install
git clone <repo-url> sec-cyBERT && cd sec-cyBERT
bun install
# Pull data (no credentials needed, ~700 MB compressed download)
bun run data:pull
That gives you all extracted paragraphs, annotations, the DAPT corpus, benchmark results, and pilot experiments. See data/README.md for the full manifest.
Prerequisites
| Tool | Install |
|---|---|
| Bun ≥1.1 | curl -fsSL https://bun.sh/install | bash |
| zstd ≥1.5 | apt install zstd / brew install zstd |
Additional prerequisites depending on what you're running:
| Tool | Needed for | Install |
|---|---|---|
| uv ≥0.5 | Training pipeline | curl -LsSf https://astral.sh/uv/install.sh | sh |
| Docker ≥24 | Labelapp (Postgres) | Package manager or Docker Desktop |
| NVIDIA GPU + CUDA ≥13.0 | DAPT / TAPT / fine-tuning | — |
Project structure
sec-cyBERT/
├── packages/schemas/ # Shared Zod schemas (@sec-cybert/schemas)
├── ts/ # GenAI labeling pipeline (Vercel AI SDK, OpenRouter)
├── python/ # Training pipeline (HuggingFace Trainer, PyTorch)
│ └── configs/ # YAML training configs
├── labelapp/ # Next.js human labeling webapp
├── data/ # All data artifacts (DVC-managed, see data/README.md)
├── checkpoints/ # Model training checkpoints
├── scripts/ # Data packaging and utility scripts
└── docs/ # Project documentation
Pipeline
SEC EDGAR (14,759 filings)
│
▼
[1] Extract paragraphs ──→ data/paragraphs/ (72,045 paragraphs)
│
▼
[2] Quality audit + patch ──→ data/paragraphs/quality/, patches/
│
├──→ [3] Stage 1: 3-model annotation ──→ data/annotations/stage1.patched.jsonl
│ │
│ ▼
│ [4] Stage 2: judge resolution ──→ data/annotations/stage2/
│ │
│ ▼
│ [5] Human labeling ──→ data/gold/gold-labels.jsonl
│
├──→ [6] DAPT corpus prep ──→ data/dapt-corpus/ (1.06B tokens)
│ │
│ ▼
│ [7] DAPT ──→ checkpoints/dapt/
│ │
│ ▼
│ [8] TAPT ──→ checkpoints/tapt/
│
└──→ [9] Fine-tune dual-head classifier ──→ final model
Scripts
All commands run from repo root via bun run <script>.
Data extraction and labeling (ts:*)
bun run ts:sec extract:10k # Extract 10-K Item 1C paragraphs from EDGAR
bun run ts:sec extract:8k # Extract 8-K Item 1.05 disclosures
bun run ts:sec extract:merge # Merge + deduplicate
bun run ts:sec label:annotate-all # Stage 1: 3-model panel annotation (~$116)
bun run ts:sec label:consensus # Compute consensus from panel
bun run ts:sec label:judge # Stage 2: judge resolution
Training (py:*)
cd python && uv sync --extra flash # Install Python deps + flash-attn (pre-built wheel, CUDA ≥13.0)
cd ..
bun run py:train dapt --config configs/dapt/modernbert.yaml # DAPT (~13.5h on RTX 3090)
bun run py:train tapt --config configs/tapt/modernbert.yaml # TAPT (~2h)
bun run py:train finetune --config configs/ft/modernbert.yaml # Fine-tune classifier
Data management (data:*)
bun run data:pull # Download from R2 + decompress (no auth needed)
bun run data:push # Compress + upload to R2 via DVC (needs R2 write keys)
bun run data:package # Build standalone .tar.zst archives for offline distribution
Data
Data is versioned with DVC and stored compressed (zstd-19) on Cloudflare R2. bun run data:pull fetches everything with no credentials required.
| Dataset | Records | Description |
|---|---|---|
| Paragraphs | 72,045 | Extracted SEC filing paragraphs with filing metadata |
| Stage 1 annotations | 150,009 | 3-model panel labels (category + specificity) |
| DAPT corpus | 14,756 docs | Full 10-K text for masked language model pre-training |
| Gold labels | (in progress) | Human-adjudicated ground truth (1,200 paragraphs) |
See data/README.md for schemas, row counts, and reproduction steps for every file.
Labelapp
The human labeling webapp lives in labelapp/. It requires Postgres (via Docker) and has its own setup:
docker compose up -d # Start Postgres
bun run la:db:migrate # Apply migrations
bun run la:seed # Seed paragraphs
bun run la:assign # Generate annotator assignments (BIBD)
bun run la:dev # Start dev server
bun run la:export # Export adjudicated gold labels
See labelapp/AGENTS.md for labelapp-specific development notes.
Environment variables
Copy .env.example to .env and fill in the values you need:
| Variable | Needed for |
|---|---|
OPENROUTER_API_KEY |
GenAI labeling pipeline (extraction is free) |
R2_ACCESS_KEY_ID / R2_SECRET_ACCESS_KEY |
Pushing data to DVC (pulling is anonymous) |
DATABASE_URL |
Labelapp only (defaults to local Postgres) |