# sec-cyBERT Classifier for SEC cybersecurity disclosure quality. Extracts Item 1C / Item 1.05 paragraphs from 10-K and 8-K filings, labels them along two dimensions (content category and specificity), and fine-tunes a ModernBERT-large model via domain-adaptive pre-training (DAPT), task-adaptive pre-training (TAPT), and supervised dual-head classification. Three-stage labeling pipeline: synthetic expert panel (3 LLMs via OpenRouter) → judge resolution → human annotation with adjudication. ## Quick start ```bash # Clone and install git clone sec-cyBERT && cd sec-cyBERT bun install # Pull data (no credentials needed, ~700 MB compressed download) bun run data:pull ``` That gives you all extracted paragraphs, annotations, the DAPT corpus, benchmark results, and pilot experiments. See [`data/README.md`](data/README.md) for the full manifest. ### Prerequisites | Tool | Install | |------|---------| | [Bun](https://bun.sh) ≥1.1 | `curl -fsSL https://bun.sh/install \| bash` | | [zstd](https://github.com/facebook/zstd) ≥1.5 | `apt install zstd` / `brew install zstd` | Additional prerequisites depending on what you're running: | Tool | Needed for | Install | |------|-----------|---------| | [uv](https://docs.astral.sh/uv/) ≥0.5 | Training pipeline | `curl -LsSf https://astral.sh/uv/install.sh \| sh` | | [Docker](https://docs.docker.com/get-docker/) ≥24 | Labelapp (Postgres) | Package manager or Docker Desktop | | NVIDIA GPU + CUDA ≥13.0 | DAPT / TAPT / fine-tuning | — | ## Project structure ``` sec-cyBERT/ ├── packages/schemas/ # Shared Zod schemas (@sec-cybert/schemas) ├── ts/ # GenAI labeling pipeline (Vercel AI SDK, OpenRouter) ├── python/ # Training pipeline (HuggingFace Trainer, PyTorch) │ └── configs/ # YAML training configs ├── labelapp/ # Next.js human labeling webapp ├── data/ # All data artifacts (DVC-managed, see data/README.md) ├── checkpoints/ # Model training checkpoints ├── scripts/ # Data packaging and utility scripts └── docs/ # Project documentation ``` ## Pipeline ``` SEC EDGAR (14,759 filings) │ ▼ [1] Extract paragraphs ──→ data/paragraphs/ (72,045 paragraphs) │ ▼ [2] Quality audit + patch ──→ data/paragraphs/quality/, patches/ │ ├──→ [3] Stage 1: 3-model annotation ──→ data/annotations/stage1.patched.jsonl │ │ │ ▼ │ [4] Stage 2: judge resolution ──→ data/annotations/stage2/ │ │ │ ▼ │ [5] Human labeling ──→ data/gold/gold-labels.jsonl │ ├──→ [6] DAPT corpus prep ──→ data/dapt-corpus/ (1.06B tokens) │ │ │ ▼ │ [7] DAPT ──→ checkpoints/dapt/ │ │ │ ▼ │ [8] TAPT ──→ checkpoints/tapt/ │ └──→ [9] Fine-tune dual-head classifier ──→ final model ``` ## Scripts All commands run from repo root via `bun run