346 lines
14 KiB
Markdown
346 lines
14 KiB
Markdown
# SEC-cyBERT Implementation Plan
|
||
|
||
## Context
|
||
|
||
Building an SEC cybersecurity disclosure quality classifier for the BUSI488/COMP488 capstone. The Ringel (2023) "Synthetic Experts" pipeline: frontier LLMs label ~50K paragraphs, then distill into a small encoder model. Two dimensions: content category (7-class) + specificity (4-point ordinal). GPU is offline for 2 days — all data/labeling/eval infrastructure is GPU-free and should be built now.
|
||
|
||
---
|
||
|
||
## Tech Stack
|
||
|
||
| Layer | Tool | Notes |
|
||
|-------|------|-------|
|
||
| Data/labeling pipeline | TypeScript, Vercel AI SDK 6.0.108, `@openrouter/ai-sdk-provider`, Zod | `generateObject` with Zod schemas for structured output |
|
||
| Stage 1 annotators | gpt-oss-120b, mimo-v2-flash, grok-4.1-fast | Via OpenRouter |
|
||
| Stage 2 judge | Claude Sonnet 4.6 | Via OpenRouter, called only on disagreements |
|
||
| Encoder training | HuggingFace Trainer, Python scripts | ModernBERT-large, NeoBERT, DeBERTa-v3-large |
|
||
| DAPT | HuggingFace Trainer + DataCollatorForLanguageModeling | Continued MLM on SEC filings |
|
||
| Decoder experiment | Unsloth (NOT Axolotl — it's decoder-only and slower) | Qwen3.5 LoRA |
|
||
| HP search | Autoresearch-style `program.md` directives | Agent edits YAML, trains for fixed budget, evaluates, keeps/discards |
|
||
| Runtime | bun (TS), uv (Python) | |
|
||
|
||
---
|
||
|
||
## Project Structure
|
||
|
||
```
|
||
sec-cyBERT/
|
||
├── docs/
|
||
│ ├── PROJECT-OVERVIEW.md
|
||
│ ├── LABELING-CODEBOOK.md
|
||
│ └── TECHNICAL-GUIDE.md
|
||
│
|
||
├── ts/ # TypeScript: data pipeline, labeling, eval
|
||
│ ├── package.json
|
||
│ ├── tsconfig.json
|
||
│ ├── src/
|
||
│ │ ├── schemas/ # Zod schemas (single source of truth)
|
||
│ │ │ ├── filing.ts
|
||
│ │ │ ├── paragraph.ts
|
||
│ │ │ ├── label.ts # LabelOutput — passed to generateObject
|
||
│ │ │ ├── annotation.ts # Label + provenance (model, cost, latency)
|
||
│ │ │ ├── consensus.ts # Multi-model agreement result
|
||
│ │ │ ├── gold.ts # Human-labeled holdout entry
|
||
│ │ │ ├── benchmark.ts # Model performance metrics
|
||
│ │ │ ├── experiment.ts # Autoresearch training tracker
|
||
│ │ │ └── index.ts
|
||
│ │ ├── extract/ # Phase 1: EDGAR extraction
|
||
│ │ │ ├── download-10k.ts
|
||
│ │ │ ├── parse-item1c.ts
|
||
│ │ │ ├── parse-8k.ts
|
||
│ │ │ ├── segment.ts
|
||
│ │ │ └── metadata.ts
|
||
│ │ ├── label/ # Phase 2: GenAI labeling
|
||
│ │ │ ├── annotate.ts # generateObject + OpenRouter per paragraph
|
||
│ │ │ ├── batch.ts # Concurrency control + JSONL checkpointing
|
||
│ │ │ ├── consensus.ts # Stage 1 majority vote logic
|
||
│ │ │ ├── judge.ts # Stage 2 tiebreaker (Sonnet 4.6)
|
||
│ │ │ ├── prompts.ts # System/user prompt builders
|
||
│ │ │ └── cost.ts # Cost tracking aggregation
|
||
│ │ ├── gold/ # Phase 3: Gold set
|
||
│ │ │ ├── sample.ts # Stratified sampling
|
||
│ │ │ ├── human-label.ts # Human label import
|
||
│ │ │ └── agreement.ts # Krippendorff's alpha, Cohen's kappa
|
||
│ │ ├── benchmark/ # Phase 4: GenAI benchmarking
|
||
│ │ │ ├── run.ts
|
||
│ │ │ └── metrics.ts # F1, AUC, MCC computation
|
||
│ │ ├── lib/ # Shared utilities
|
||
│ │ │ ├── openrouter.ts # Singleton + model registry with pricing
|
||
│ │ │ ├── jsonl.ts # Read/write/append JSONL
|
||
│ │ │ ├── checkpoint.ts # Resume from last completed ID
|
||
│ │ │ └── retry.ts # Exponential backoff
|
||
│ │ └── cli.ts # CLI entry point
|
||
│ └── tests/
|
||
│
|
||
├── python/ # Python: training, DAPT, inference
|
||
│ ├── pyproject.toml
|
||
│ ├── configs/
|
||
│ │ ├── dapt/modernbert-large.yaml
|
||
│ │ ├── finetune/
|
||
│ │ │ ├── modernbert-large.yaml
|
||
│ │ │ ├── neobert.yaml
|
||
│ │ │ └── deberta-v3-large.yaml
|
||
│ │ └── decoder/qwen3.5-lora.yaml
|
||
│ ├── src/
|
||
│ │ ├── dapt/train_mlm.py
|
||
│ │ ├── finetune/
|
||
│ │ │ ├── model.py # Multi-head classifier (shared backbone)
|
||
│ │ │ ├── train.py # HF Trainer script with --time-budget
|
||
│ │ │ ├── data.py
|
||
│ │ │ ├── losses.py # SCL + ordinal + multi-head balancing
|
||
│ │ │ └── trainer.py # Custom Trainer subclass
|
||
│ │ ├── decoder/train_lora.py # Unsloth
|
||
│ │ └── eval/
|
||
│ │ ├── predict.py
|
||
│ │ ├── metrics.py
|
||
│ │ └── error_analysis.py
|
||
│ └── program.md # Autoresearch agent directive
|
||
│
|
||
├── data/ # Gitignored heavy files
|
||
│ ├── raw/{10k,8k}/
|
||
│ ├── extracted/{item1c,item105}/
|
||
│ ├── paragraphs/paragraphs.jsonl
|
||
│ ├── annotations/
|
||
│ │ ├── stage1/{model}.jsonl
|
||
│ │ ├── stage2/judge.jsonl
|
||
│ │ └── consensus.jsonl
|
||
│ ├── gold/
|
||
│ │ ├── gold-sample.jsonl
|
||
│ │ ├── human-labels/annotator-{1,2,3}.jsonl
|
||
│ │ └── gold-adjudicated.jsonl
|
||
│ ├── benchmark/runs/{model}.jsonl
|
||
│ ├── splits/{train,val,test}.jsonl
|
||
│ └── dapt-corpus/sec-texts.jsonl
|
||
│
|
||
├── models/ # Gitignored checkpoints
|
||
├── results/
|
||
│ ├── experiments.tsv # Autoresearch log
|
||
│ └── figures/
|
||
└── .gitignore
|
||
```
|
||
|
||
---
|
||
|
||
## Core Schemas (Zod)
|
||
|
||
**`label.ts`** — the contract passed to `generateObject`:
|
||
```typescript
|
||
export const ContentCategory = z.enum([
|
||
"Board Governance", "Management Role", "Risk Management Process",
|
||
"Third-Party Risk", "Incident Disclosure", "Strategy Integration", "None/Other",
|
||
]);
|
||
export const SpecificityLevel = z.union([z.literal(1), z.literal(2), z.literal(3), z.literal(4)]);
|
||
export const LabelOutput = z.object({
|
||
content_category: ContentCategory,
|
||
specificity_level: SpecificityLevel,
|
||
reasoning: z.string().max(500),
|
||
});
|
||
```
|
||
|
||
**`annotation.ts`** — label + full provenance:
|
||
```typescript
|
||
export const Annotation = z.object({
|
||
paragraphId: z.string().uuid(),
|
||
label: LabelOutput,
|
||
provenance: z.object({
|
||
modelId: z.string(),
|
||
provider: z.string(),
|
||
stage: z.enum(["stage1", "stage2-judge"]),
|
||
runId: z.string().uuid(),
|
||
promptVersion: z.string(),
|
||
inputTokens: z.number(),
|
||
outputTokens: z.number(),
|
||
estimatedCostUsd: z.number(),
|
||
latencyMs: z.number(),
|
||
requestedAt: z.string().datetime(),
|
||
}),
|
||
});
|
||
```
|
||
|
||
**`consensus.ts`** — multi-model agreement:
|
||
```typescript
|
||
export const ConsensusResult = z.object({
|
||
paragraphId: z.string().uuid(),
|
||
finalLabel: LabelOutput,
|
||
method: z.enum(["unanimous", "majority", "judge-resolved", "unresolved"]),
|
||
categoryAgreement: z.object({ votes: z.record(z.number()), agreed: z.boolean() }),
|
||
specificityAgreement: z.object({ votes: z.record(z.number()), agreed: z.boolean(), spread: z.number() }),
|
||
stage1ModelIds: z.array(z.string()),
|
||
stage2JudgeModelId: z.string().nullable(),
|
||
confidence: z.number().min(0).max(1),
|
||
});
|
||
```
|
||
|
||
Full schemas for filing, paragraph, gold, benchmark, and experiment types follow the same pattern — see the full plan agent output for complete definitions.
|
||
|
||
---
|
||
|
||
## Data Flow
|
||
|
||
```
|
||
Phase 1: EXTRACTION (GPU-free)
|
||
EDGAR API → download 10-K/8-K → parse Item 1C/1.05 → segment into paragraphs
|
||
→ enrich with company metadata → data/paragraphs/paragraphs.jsonl (~50-70K records)
|
||
|
||
Phase 2: LABELING (GPU-free)
|
||
paragraphs.jsonl → Stage 1: 3 models annotate all → consensus (expect ~83% agree)
|
||
→ disagreements → Stage 2: Sonnet 4.6 judges → final consensus.jsonl
|
||
|
||
Phase 3: GOLD SET (GPU-free)
|
||
Stratified sample 1,200 → 3 humans label independently → compute agreement
|
||
→ adjudicate → gold-adjudicated.jsonl (LOCKED holdout)
|
||
|
||
Phase 4: BENCHMARKING (GPU-free)
|
||
Run 6+ models on holdout → compute F1/AUC/MCC/Krippendorff's α → comparison table
|
||
|
||
Phase 5: TRAINING (REQUIRES GPU)
|
||
DAPT: SEC-ModernBERT-large (continued MLM on SEC filings)
|
||
Encoder FT: SEC-ModernBERT, ModernBERT, NeoBERT, DeBERTa (5 ablations)
|
||
Decoder FT: Qwen3.5 via Unsloth LoRA
|
||
HP search: autoresearch program.md — agent iterates autonomously
|
||
|
||
Phase 6: EVALUATION (REQUIRES GPU)
|
||
Inference on holdout → metrics → error analysis → validity tests → final comparison
|
||
```
|
||
|
||
---
|
||
|
||
## Key Architecture Patterns
|
||
|
||
### Annotation: `generateObject` + OpenRouter
|
||
```typescript
|
||
const result = await generateObject({
|
||
model: openrouter(modelId),
|
||
schema: LabelOutput,
|
||
system: buildSystemPrompt(),
|
||
prompt: buildUserPrompt(paragraph),
|
||
temperature: 0,
|
||
mode: "json",
|
||
});
|
||
```
|
||
|
||
### Batch Processing: Append-per-record checkpoint
|
||
Each successful annotation appends immediately to JSONL. On crash/resume, read completed IDs from output file, skip them. Uses `p-limit` for concurrency control (default 5).
|
||
|
||
### Consensus: Stage 1 majority → Stage 2 judge
|
||
- Stage 1: 3 models vote. If 2/3 agree on BOTH dimensions → consensus.
|
||
- Stage 2: For disagreements, Sonnet 4.6 gets the paragraph + all 3 annotations (randomized order for anti-bias). Judge's label treated as authoritative tiebreaker.
|
||
|
||
### Training: Multi-head classifier
|
||
Shared encoder backbone (ModernBERT/NeoBERT/DeBERTa) → dropout → two linear heads:
|
||
- `category_head`: 7-class softmax
|
||
- `specificity_head`: 4-class ordinal/softmax
|
||
Loss: `α * CE(category) + (1-α) * CE(specificity) + β * SCL`
|
||
|
||
### HP Search: Autoresearch `program.md`
|
||
- Fixed 30-min time budget per experiment
|
||
- Metric: `val_macro_f1`
|
||
- Agent modifies ONLY YAML configs, not training scripts
|
||
- TSV results log: experiment_id, metric, hyperparameters, verdict (keep/discard)
|
||
- Vary ONE hyperparameter per experiment (controlled ablation)
|
||
|
||
---
|
||
|
||
## Quality Gates
|
||
|
||
| Gate | When | Key Check | Threshold | If Failed |
|
||
|------|------|-----------|-----------|-----------|
|
||
| Extraction QA | After Phase 1 | Spot-check 20 filings manually | 18/20 correct | Fix parser |
|
||
| Labeling Pilot | 50 paragraphs | Human review of LLM labels | ≥80% agreement | Revise prompt/rubric |
|
||
| Scale Pilot | 200 paragraphs | Inter-model Fleiss' Kappa | ≥0.60 | Replace weakest model or revise prompt |
|
||
| Human Labeling | Phase 3 | Krippendorff's α (specificity) | ≥0.67 | Collapse 4-pt to 3-pt scale |
|
||
| Human Labeling | Phase 3 | Cohen's κ (category) | ≥0.75 | Revise rubric boundaries |
|
||
| DAPT | Phase 5 | Perplexity decrease + GLUE check | PPL ↓, GLUE drop <2% | Reduce LR |
|
||
| Fine-tuning | Phase 5 | val_macro_f1 by epoch 3 | >0.75 | Check data quality |
|
||
| Final | Phase 6 | Holdout macro-F1 (category) | ≥0.80 | Error analysis, iterate |
|
||
| Final | Phase 6 | Calibration (ECE) | <0.10 | Temperature scaling |
|
||
|
||
---
|
||
|
||
## CLI Commands
|
||
|
||
```bash
|
||
# Extraction
|
||
bun sec extract:download-10k --fiscal-year 2023
|
||
bun sec extract:parse --type 10k
|
||
bun sec extract:segment
|
||
bun sec extract:metadata
|
||
|
||
# Labeling
|
||
bun sec label:annotate --model openai/gpt-oss-120b --limit 50 # pilot
|
||
bun sec label:annotate-all # full run
|
||
bun sec label:consensus
|
||
bun sec label:judge
|
||
bun sec label:cost
|
||
|
||
# Gold set
|
||
bun sec gold:sample --n 1200
|
||
bun sec gold:import-human --annotator annotator-1 --input labels.csv
|
||
bun sec gold:agreement
|
||
|
||
# Benchmarking
|
||
bun sec benchmark:run-all
|
||
bun sec benchmark:evaluate
|
||
bun sec benchmark:table
|
||
|
||
# Splits
|
||
bun sec splits:create
|
||
|
||
# Python training (GPU required)
|
||
uv run python/src/dapt/train_mlm.py --config python/configs/dapt/modernbert-large.yaml
|
||
uv run python/src/finetune/train.py --config python/configs/finetune/modernbert-large.yaml --time-budget 1800
|
||
uv run python/src/decoder/train_lora.py --config python/configs/decoder/qwen3.5-lora.yaml
|
||
uv run python/src/eval/predict.py --split test
|
||
uv run python/src/eval/metrics.py
|
||
```
|
||
|
||
---
|
||
|
||
## Implementation Sequence
|
||
|
||
### Day 1 (GPU-free) — Foundation
|
||
1. `bun init` in ts/, `uv init` in python/, create full directory tree
|
||
2. All Zod schemas
|
||
3. JSONL utilities, OpenRouter singleton, model registry
|
||
4. Prompt builders (from LABELING-CODEBOOK.md)
|
||
5. `annotate.ts` + `batch.ts` with checkpoint/resume
|
||
6. Test: dry-run 3 paragraphs
|
||
|
||
### Day 2 (GPU-free) — Extraction + Labeling Pilot
|
||
7. EDGAR extraction pipeline (download, parse, segment)
|
||
8. Run extraction on a small sample (~100 filings)
|
||
9. **Quality Gate 1**: Verify extraction
|
||
10. Labeling pilot: 50 paragraphs × 3 models
|
||
11. `consensus.ts` + `judge.ts`
|
||
12. **Quality Gate 2**: Manual review
|
||
13. Scale pilot: 200 paragraphs
|
||
14. **Quality Gate 3**: Inter-model agreement
|
||
15. If gates pass → launch full Stage 1 annotation
|
||
|
||
### Day 3+ (GPU-free, labeling runs) — Gold Set + Benchmarking
|
||
16. Gold set sampling, human label infrastructure
|
||
17. Benchmark runner + metrics
|
||
18. Consensus + judge on full corpus
|
||
19. Begin human labeling
|
||
20. Prepare DAPT corpus
|
||
|
||
### GPU Available — Training
|
||
21. Python training scripts (model.py, train.py, losses.py)
|
||
22. `program.md` for autoresearch
|
||
23. DAPT (~2-3 days)
|
||
24. Fine-tuning ablations via autoresearch
|
||
25. Unsloth decoder experiment
|
||
26. Final evaluation + error analysis
|
||
|
||
---
|
||
|
||
## Verification
|
||
|
||
After implementation, verify end-to-end:
|
||
1. `bun sec extract:segment --limit 10` produces valid Paragraph JSONL
|
||
2. `bun sec label:annotate --model openai/gpt-oss-120b --limit 5` returns valid Annotations with cost tracking
|
||
3. `bun sec label:consensus` correctly identifies agreement/disagreement
|
||
4. `bun sec validate:schema --input data/annotations/stage1/gpt-oss-120b.jsonl --schema annotation` passes
|
||
5. Python training script loads JSONL splits and begins training without errors
|
||
6. `results/experiments.tsv` gets populated after one autoresearch iteration
|