SEC-cyBERT/docs/archive/planning/implementation-plan.md
2026-04-05 21:00:40 -04:00

346 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# SEC-cyBERT Implementation Plan
## Context
Building an SEC cybersecurity disclosure quality classifier for the BUSI488/COMP488 capstone. The Ringel (2023) "Synthetic Experts" pipeline: frontier LLMs label ~50K paragraphs, then distill into a small encoder model. Two dimensions: content category (7-class) + specificity (4-point ordinal). GPU is offline for 2 days — all data/labeling/eval infrastructure is GPU-free and should be built now.
---
## Tech Stack
| Layer | Tool | Notes |
|-------|------|-------|
| Data/labeling pipeline | TypeScript, Vercel AI SDK 6.0.108, `@openrouter/ai-sdk-provider`, Zod | `generateObject` with Zod schemas for structured output |
| Stage 1 annotators | gpt-oss-120b, mimo-v2-flash, grok-4.1-fast | Via OpenRouter |
| Stage 2 judge | Claude Sonnet 4.6 | Via OpenRouter, called only on disagreements |
| Encoder training | HuggingFace Trainer, Python scripts | ModernBERT-large, NeoBERT, DeBERTa-v3-large |
| DAPT | HuggingFace Trainer + DataCollatorForLanguageModeling | Continued MLM on SEC filings |
| Decoder experiment | Unsloth (NOT Axolotl — it's decoder-only and slower) | Qwen3.5 LoRA |
| HP search | Autoresearch-style `program.md` directives | Agent edits YAML, trains for fixed budget, evaluates, keeps/discards |
| Runtime | bun (TS), uv (Python) | |
---
## Project Structure
```
sec-cyBERT/
├── docs/
│ ├── PROJECT-OVERVIEW.md
│ ├── LABELING-CODEBOOK.md
│ └── TECHNICAL-GUIDE.md
├── ts/ # TypeScript: data pipeline, labeling, eval
│ ├── package.json
│ ├── tsconfig.json
│ ├── src/
│ │ ├── schemas/ # Zod schemas (single source of truth)
│ │ │ ├── filing.ts
│ │ │ ├── paragraph.ts
│ │ │ ├── label.ts # LabelOutput — passed to generateObject
│ │ │ ├── annotation.ts # Label + provenance (model, cost, latency)
│ │ │ ├── consensus.ts # Multi-model agreement result
│ │ │ ├── gold.ts # Human-labeled holdout entry
│ │ │ ├── benchmark.ts # Model performance metrics
│ │ │ ├── experiment.ts # Autoresearch training tracker
│ │ │ └── index.ts
│ │ ├── extract/ # Phase 1: EDGAR extraction
│ │ │ ├── download-10k.ts
│ │ │ ├── parse-item1c.ts
│ │ │ ├── parse-8k.ts
│ │ │ ├── segment.ts
│ │ │ └── metadata.ts
│ │ ├── label/ # Phase 2: GenAI labeling
│ │ │ ├── annotate.ts # generateObject + OpenRouter per paragraph
│ │ │ ├── batch.ts # Concurrency control + JSONL checkpointing
│ │ │ ├── consensus.ts # Stage 1 majority vote logic
│ │ │ ├── judge.ts # Stage 2 tiebreaker (Sonnet 4.6)
│ │ │ ├── prompts.ts # System/user prompt builders
│ │ │ └── cost.ts # Cost tracking aggregation
│ │ ├── gold/ # Phase 3: Gold set
│ │ │ ├── sample.ts # Stratified sampling
│ │ │ ├── human-label.ts # Human label import
│ │ │ └── agreement.ts # Krippendorff's alpha, Cohen's kappa
│ │ ├── benchmark/ # Phase 4: GenAI benchmarking
│ │ │ ├── run.ts
│ │ │ └── metrics.ts # F1, AUC, MCC computation
│ │ ├── lib/ # Shared utilities
│ │ │ ├── openrouter.ts # Singleton + model registry with pricing
│ │ │ ├── jsonl.ts # Read/write/append JSONL
│ │ │ ├── checkpoint.ts # Resume from last completed ID
│ │ │ └── retry.ts # Exponential backoff
│ │ └── cli.ts # CLI entry point
│ └── tests/
├── python/ # Python: training, DAPT, inference
│ ├── pyproject.toml
│ ├── configs/
│ │ ├── dapt/modernbert-large.yaml
│ │ ├── finetune/
│ │ │ ├── modernbert-large.yaml
│ │ │ ├── neobert.yaml
│ │ │ └── deberta-v3-large.yaml
│ │ └── decoder/qwen3.5-lora.yaml
│ ├── src/
│ │ ├── dapt/train_mlm.py
│ │ ├── finetune/
│ │ │ ├── model.py # Multi-head classifier (shared backbone)
│ │ │ ├── train.py # HF Trainer script with --time-budget
│ │ │ ├── data.py
│ │ │ ├── losses.py # SCL + ordinal + multi-head balancing
│ │ │ └── trainer.py # Custom Trainer subclass
│ │ ├── decoder/train_lora.py # Unsloth
│ │ └── eval/
│ │ ├── predict.py
│ │ ├── metrics.py
│ │ └── error_analysis.py
│ └── program.md # Autoresearch agent directive
├── data/ # Gitignored heavy files
│ ├── raw/{10k,8k}/
│ ├── extracted/{item1c,item105}/
│ ├── paragraphs/paragraphs.jsonl
│ ├── annotations/
│ │ ├── stage1/{model}.jsonl
│ │ ├── stage2/judge.jsonl
│ │ └── consensus.jsonl
│ ├── gold/
│ │ ├── gold-sample.jsonl
│ │ ├── human-labels/annotator-{1,2,3}.jsonl
│ │ └── gold-adjudicated.jsonl
│ ├── benchmark/runs/{model}.jsonl
│ ├── splits/{train,val,test}.jsonl
│ └── dapt-corpus/sec-texts.jsonl
├── models/ # Gitignored checkpoints
├── results/
│ ├── experiments.tsv # Autoresearch log
│ └── figures/
└── .gitignore
```
---
## Core Schemas (Zod)
**`label.ts`** — the contract passed to `generateObject`:
```typescript
export const ContentCategory = z.enum([
"Board Governance", "Management Role", "Risk Management Process",
"Third-Party Risk", "Incident Disclosure", "Strategy Integration", "None/Other",
]);
export const SpecificityLevel = z.union([z.literal(1), z.literal(2), z.literal(3), z.literal(4)]);
export const LabelOutput = z.object({
content_category: ContentCategory,
specificity_level: SpecificityLevel,
reasoning: z.string().max(500),
});
```
**`annotation.ts`** — label + full provenance:
```typescript
export const Annotation = z.object({
paragraphId: z.string().uuid(),
label: LabelOutput,
provenance: z.object({
modelId: z.string(),
provider: z.string(),
stage: z.enum(["stage1", "stage2-judge"]),
runId: z.string().uuid(),
promptVersion: z.string(),
inputTokens: z.number(),
outputTokens: z.number(),
estimatedCostUsd: z.number(),
latencyMs: z.number(),
requestedAt: z.string().datetime(),
}),
});
```
**`consensus.ts`** — multi-model agreement:
```typescript
export const ConsensusResult = z.object({
paragraphId: z.string().uuid(),
finalLabel: LabelOutput,
method: z.enum(["unanimous", "majority", "judge-resolved", "unresolved"]),
categoryAgreement: z.object({ votes: z.record(z.number()), agreed: z.boolean() }),
specificityAgreement: z.object({ votes: z.record(z.number()), agreed: z.boolean(), spread: z.number() }),
stage1ModelIds: z.array(z.string()),
stage2JudgeModelId: z.string().nullable(),
confidence: z.number().min(0).max(1),
});
```
Full schemas for filing, paragraph, gold, benchmark, and experiment types follow the same pattern — see the full plan agent output for complete definitions.
---
## Data Flow
```
Phase 1: EXTRACTION (GPU-free)
EDGAR API → download 10-K/8-K → parse Item 1C/1.05 → segment into paragraphs
→ enrich with company metadata → data/paragraphs/paragraphs.jsonl (~50-70K records)
Phase 2: LABELING (GPU-free)
paragraphs.jsonl → Stage 1: 3 models annotate all → consensus (expect ~83% agree)
→ disagreements → Stage 2: Sonnet 4.6 judges → final consensus.jsonl
Phase 3: GOLD SET (GPU-free)
Stratified sample 1,200 → 3 humans label independently → compute agreement
→ adjudicate → gold-adjudicated.jsonl (LOCKED holdout)
Phase 4: BENCHMARKING (GPU-free)
Run 6+ models on holdout → compute F1/AUC/MCC/Krippendorff's α → comparison table
Phase 5: TRAINING (REQUIRES GPU)
DAPT: SEC-ModernBERT-large (continued MLM on SEC filings)
Encoder FT: SEC-ModernBERT, ModernBERT, NeoBERT, DeBERTa (5 ablations)
Decoder FT: Qwen3.5 via Unsloth LoRA
HP search: autoresearch program.md — agent iterates autonomously
Phase 6: EVALUATION (REQUIRES GPU)
Inference on holdout → metrics → error analysis → validity tests → final comparison
```
---
## Key Architecture Patterns
### Annotation: `generateObject` + OpenRouter
```typescript
const result = await generateObject({
model: openrouter(modelId),
schema: LabelOutput,
system: buildSystemPrompt(),
prompt: buildUserPrompt(paragraph),
temperature: 0,
mode: "json",
});
```
### Batch Processing: Append-per-record checkpoint
Each successful annotation appends immediately to JSONL. On crash/resume, read completed IDs from output file, skip them. Uses `p-limit` for concurrency control (default 5).
### Consensus: Stage 1 majority → Stage 2 judge
- Stage 1: 3 models vote. If 2/3 agree on BOTH dimensions → consensus.
- Stage 2: For disagreements, Sonnet 4.6 gets the paragraph + all 3 annotations (randomized order for anti-bias). Judge's label treated as authoritative tiebreaker.
### Training: Multi-head classifier
Shared encoder backbone (ModernBERT/NeoBERT/DeBERTa) → dropout → two linear heads:
- `category_head`: 7-class softmax
- `specificity_head`: 4-class ordinal/softmax
Loss: `α * CE(category) + (1-α) * CE(specificity) + β * SCL`
### HP Search: Autoresearch `program.md`
- Fixed 30-min time budget per experiment
- Metric: `val_macro_f1`
- Agent modifies ONLY YAML configs, not training scripts
- TSV results log: experiment_id, metric, hyperparameters, verdict (keep/discard)
- Vary ONE hyperparameter per experiment (controlled ablation)
---
## Quality Gates
| Gate | When | Key Check | Threshold | If Failed |
|------|------|-----------|-----------|-----------|
| Extraction QA | After Phase 1 | Spot-check 20 filings manually | 18/20 correct | Fix parser |
| Labeling Pilot | 50 paragraphs | Human review of LLM labels | ≥80% agreement | Revise prompt/rubric |
| Scale Pilot | 200 paragraphs | Inter-model Fleiss' Kappa | ≥0.60 | Replace weakest model or revise prompt |
| Human Labeling | Phase 3 | Krippendorff's α (specificity) | ≥0.67 | Collapse 4-pt to 3-pt scale |
| Human Labeling | Phase 3 | Cohen's κ (category) | ≥0.75 | Revise rubric boundaries |
| DAPT | Phase 5 | Perplexity decrease + GLUE check | PPL ↓, GLUE drop <2% | Reduce LR |
| Fine-tuning | Phase 5 | val_macro_f1 by epoch 3 | >0.75 | Check data quality |
| Final | Phase 6 | Holdout macro-F1 (category) | ≥0.80 | Error analysis, iterate |
| Final | Phase 6 | Calibration (ECE) | <0.10 | Temperature scaling |
---
## CLI Commands
```bash
# Extraction
bun sec extract:download-10k --fiscal-year 2023
bun sec extract:parse --type 10k
bun sec extract:segment
bun sec extract:metadata
# Labeling
bun sec label:annotate --model openai/gpt-oss-120b --limit 50 # pilot
bun sec label:annotate-all # full run
bun sec label:consensus
bun sec label:judge
bun sec label:cost
# Gold set
bun sec gold:sample --n 1200
bun sec gold:import-human --annotator annotator-1 --input labels.csv
bun sec gold:agreement
# Benchmarking
bun sec benchmark:run-all
bun sec benchmark:evaluate
bun sec benchmark:table
# Splits
bun sec splits:create
# Python training (GPU required)
uv run python/src/dapt/train_mlm.py --config python/configs/dapt/modernbert-large.yaml
uv run python/src/finetune/train.py --config python/configs/finetune/modernbert-large.yaml --time-budget 1800
uv run python/src/decoder/train_lora.py --config python/configs/decoder/qwen3.5-lora.yaml
uv run python/src/eval/predict.py --split test
uv run python/src/eval/metrics.py
```
---
## Implementation Sequence
### Day 1 (GPU-free) — Foundation
1. `bun init` in ts/, `uv init` in python/, create full directory tree
2. All Zod schemas
3. JSONL utilities, OpenRouter singleton, model registry
4. Prompt builders (from LABELING-CODEBOOK.md)
5. `annotate.ts` + `batch.ts` with checkpoint/resume
6. Test: dry-run 3 paragraphs
### Day 2 (GPU-free) — Extraction + Labeling Pilot
7. EDGAR extraction pipeline (download, parse, segment)
8. Run extraction on a small sample (~100 filings)
9. **Quality Gate 1**: Verify extraction
10. Labeling pilot: 50 paragraphs × 3 models
11. `consensus.ts` + `judge.ts`
12. **Quality Gate 2**: Manual review
13. Scale pilot: 200 paragraphs
14. **Quality Gate 3**: Inter-model agreement
15. If gates pass launch full Stage 1 annotation
### Day 3+ (GPU-free, labeling runs) — Gold Set + Benchmarking
16. Gold set sampling, human label infrastructure
17. Benchmark runner + metrics
18. Consensus + judge on full corpus
19. Begin human labeling
20. Prepare DAPT corpus
### GPU Available — Training
21. Python training scripts (model.py, train.py, losses.py)
22. `program.md` for autoresearch
23. DAPT (~2-3 days)
24. Fine-tuning ablations via autoresearch
25. Unsloth decoder experiment
26. Final evaluation + error analysis
---
## Verification
After implementation, verify end-to-end:
1. `bun sec extract:segment --limit 10` produces valid Paragraph JSONL
2. `bun sec label:annotate --model openai/gpt-oss-120b --limit 5` returns valid Annotations with cost tracking
3. `bun sec label:consensus` correctly identifies agreement/disagreement
4. `bun sec validate:schema --input data/annotations/stage1/gpt-oss-120b.jsonl --schema annotation` passes
5. Python training script loads JSONL splits and begins training without errors
6. `results/experiments.tsv` gets populated after one autoresearch iteration