157 lines
6.2 KiB
Markdown
157 lines
6.2 KiB
Markdown
# sec-cyBERT
|
|
|
|
Classifier for SEC cybersecurity disclosure quality. Extracts Item 1C / Item 1.05 paragraphs from 10-K and 8-K filings, labels them along two dimensions (content category and specificity), and fine-tunes a ModernBERT-large model via domain-adaptive pre-training (DAPT), task-adaptive pre-training (TAPT), and supervised dual-head classification.
|
|
|
|
Three-stage labeling pipeline: synthetic expert panel (3 LLMs via OpenRouter) → judge resolution → human annotation with adjudication.
|
|
|
|
## Quick start
|
|
|
|
```bash
|
|
# Clone and install
|
|
git clone https://git.claiborne.soy/joey/SEC-cyBERT.git sec-cyBERT && cd sec-cyBERT
|
|
bun install
|
|
|
|
# Pull data (~750 MB compressed via Git LFS, ~20 GB with checkpoints)
|
|
bun run data:pull
|
|
```
|
|
|
|
That gives you all extracted paragraphs, annotations, the DAPT corpus, benchmark results, and pilot experiments. See [`data/README.md`](data/README.md) for the full manifest.
|
|
|
|
### Prerequisites
|
|
|
|
| Tool | Install |
|
|
|------|---------|
|
|
| [Bun](https://bun.sh) ≥1.1 | `curl -fsSL https://bun.sh/install \| bash` |
|
|
| [Git LFS](https://git-lfs.com) ≥3.0 | `apt install git-lfs` / `brew install git-lfs` |
|
|
| [zstd](https://github.com/facebook/zstd) ≥1.5 | `apt install zstd` / `brew install zstd` |
|
|
|
|
Additional prerequisites depending on what you're running:
|
|
|
|
| Tool | Needed for | Install |
|
|
|------|-----------|---------|
|
|
| [uv](https://docs.astral.sh/uv/) ≥0.5 | Training pipeline | `curl -LsSf https://astral.sh/uv/install.sh \| sh` |
|
|
| [Docker](https://docs.docker.com/get-docker/) ≥24 | Labelapp (Postgres) | Package manager or Docker Desktop |
|
|
| NVIDIA GPU + CUDA ≥13.0 | DAPT / TAPT / fine-tuning | — |
|
|
|
|
## Project structure
|
|
|
|
```
|
|
sec-cyBERT/
|
|
├── packages/schemas/ # Shared Zod schemas (@sec-cybert/schemas)
|
|
├── ts/ # GenAI labeling pipeline (Vercel AI SDK, OpenRouter)
|
|
├── python/ # Training pipeline (HuggingFace Trainer, PyTorch)
|
|
│ └── configs/ # YAML training configs
|
|
├── labelapp/ # Next.js human labeling webapp
|
|
├── data/ # All data artifacts (gitignored working copies)
|
|
├── checkpoints/ # Model training checkpoints (gitignored working copies)
|
|
├── .lfs-store/ # Compressed data + checkpoints (Git LFS-tracked)
|
|
├── scripts/ # Data packaging and utility scripts
|
|
└── docs/ # Project documentation
|
|
```
|
|
|
|
## Pipeline
|
|
|
|
```
|
|
SEC EDGAR (14,759 filings)
|
|
│
|
|
▼
|
|
[1] Extract paragraphs ──→ data/paragraphs/ (72,045 paragraphs)
|
|
│
|
|
▼
|
|
[2] Quality audit + patch ──→ data/paragraphs/quality/, patches/
|
|
│
|
|
├──→ [3] Stage 1: 3-model annotation ──→ data/annotations/stage1.patched.jsonl
|
|
│ │
|
|
│ ▼
|
|
│ [4] Stage 2: judge resolution ──→ data/annotations/stage2/
|
|
│ │
|
|
│ ▼
|
|
│ [5] Human labeling ──→ data/gold/gold-labels.jsonl
|
|
│
|
|
├──→ [6] DAPT corpus prep ──→ data/dapt-corpus/ (1.06B tokens)
|
|
│ │
|
|
│ ▼
|
|
│ [7] DAPT ──→ checkpoints/dapt/
|
|
│ │
|
|
│ ▼
|
|
│ [8] TAPT ──→ checkpoints/tapt/
|
|
│
|
|
└──→ [9] Fine-tune dual-head classifier ──→ final model
|
|
```
|
|
|
|
## Scripts
|
|
|
|
All commands run from repo root via `bun run <script>`.
|
|
|
|
### Data extraction and labeling (`ts:*`)
|
|
|
|
```bash
|
|
bun run ts:sec extract:10k # Extract 10-K Item 1C paragraphs from EDGAR
|
|
bun run ts:sec extract:8k # Extract 8-K Item 1.05 disclosures
|
|
bun run ts:sec extract:merge # Merge + deduplicate
|
|
bun run ts:sec label:annotate-all # Stage 1: 3-model panel annotation (~$116)
|
|
bun run ts:sec label:consensus # Compute consensus from panel
|
|
bun run ts:sec label:judge # Stage 2: judge resolution
|
|
```
|
|
|
|
### Training (`py:*`)
|
|
|
|
```bash
|
|
cd python && uv sync --extra flash # Install Python deps + flash-attn (pre-built wheel, CUDA ≥13.0)
|
|
cd ..
|
|
bun run py:train dapt --config configs/dapt/modernbert.yaml # DAPT (~13.5h on RTX 3090)
|
|
bun run py:train tapt --config configs/tapt/modernbert.yaml # TAPT (~2h)
|
|
bun run py:train finetune --config configs/ft/modernbert.yaml # Fine-tune classifier
|
|
```
|
|
|
|
### Data & checkpoint management
|
|
|
|
```bash
|
|
bun run data:pull # Pull data from LFS + decompress
|
|
bun run data:push # Compress changed data + push via LFS
|
|
bun run ckpt:pull # Pull checkpoints from LFS + decompress
|
|
bun run ckpt:push # Compress changed checkpoints + push via LFS
|
|
bun run sync:pull # Pull everything (data + checkpoints)
|
|
bun run sync:push # Push everything (data + checkpoints)
|
|
bun run data:package # Build standalone .tar.zst archives for offline distribution
|
|
```
|
|
|
|
All push/pull scripts support `--dry-run` and use per-file xxh3 hashing to skip unchanged files.
|
|
|
|
## Data
|
|
|
|
Data is stored compressed (zstd-19) in `.lfs-store/` and tracked via [Git LFS](https://git-lfs.com) on Gitea. `bun run data:pull` decompresses from the LFS store into `data/`.
|
|
|
|
| Dataset | Records | Description |
|
|
|---------|---------|-------------|
|
|
| Paragraphs | 72,045 | Extracted SEC filing paragraphs with filing metadata |
|
|
| Stage 1 annotations | 150,009 | 3-model panel labels (category + specificity) |
|
|
| DAPT corpus | 14,756 docs | Full 10-K text for masked language model pre-training |
|
|
| Gold labels | *(in progress)* | Human-adjudicated ground truth (1,200 paragraphs) |
|
|
|
|
See [`data/README.md`](data/README.md) for schemas, row counts, and reproduction steps for every file.
|
|
|
|
## Labelapp
|
|
|
|
The human labeling webapp lives in `labelapp/`. It requires Postgres (via Docker) and has its own setup:
|
|
|
|
```bash
|
|
docker compose up -d # Start Postgres
|
|
bun run la:db:migrate # Apply migrations
|
|
bun run la:seed # Seed paragraphs
|
|
bun run la:assign # Generate annotator assignments (BIBD)
|
|
bun run la:dev # Start dev server
|
|
bun run la:export # Export adjudicated gold labels
|
|
```
|
|
|
|
See [`labelapp/AGENTS.md`](labelapp/AGENTS.md) for labelapp-specific development notes.
|
|
|
|
## Environment variables
|
|
|
|
Copy `.env.example` to `.env` and fill in the values you need:
|
|
|
|
| Variable | Needed for |
|
|
|----------|-----------|
|
|
| `OPENROUTER_API_KEY` | GenAI labeling pipeline (extraction is free) |
|
|
| `DATABASE_URL` | Labelapp only (defaults to local Postgres) |
|