SEC-cyBERT/docs/archive/planning/implementation-plan.md

# SEC-cyBERT Implementation Plan

## Context

Building an SEC cybersecurity disclosure quality classifier for the BUSI488/COMP488 capstone. The Ringel (2023) "Synthetic Experts" pipeline: frontier LLMs label ~50K paragraphs, then distill into a small encoder model. Two dimensions: content category (7-class) + specificity (4-point ordinal). GPU is offline for 2 days — all data/labeling/eval infrastructure is GPU-free and should be built now.

---

## Tech Stack

| Layer | Tool | Notes |
|-------|------|-------|
| Data/labeling pipeline | TypeScript, Vercel AI SDK 6.0.108, `@openrouter/ai-sdk-provider`, Zod | `generateObject` with Zod schemas for structured output |
| Stage 1 annotators | gpt-oss-120b, mimo-v2-flash, grok-4.1-fast | Via OpenRouter |
| Stage 2 judge | Claude Sonnet 4.6 | Via OpenRouter, called only on disagreements |
| Encoder training | HuggingFace Trainer, Python scripts | ModernBERT-large, NeoBERT, DeBERTa-v3-large |
| DAPT | HuggingFace Trainer + DataCollatorForLanguageModeling | Continued MLM on SEC filings |
| Decoder experiment | Unsloth (NOT Axolotl — it's decoder-only and slower) | Qwen3.5 LoRA |
| HP search | Autoresearch-style `program.md` directives | Agent edits YAML, trains for fixed budget, evaluates, keeps/discards |
| Runtime | bun (TS), uv (Python) | |

---

## Project Structure

```
sec-cyBERT/
├── docs/
│   ├── PROJECT-OVERVIEW.md
│   ├── LABELING-CODEBOOK.md
│   └── TECHNICAL-GUIDE.md
│
├── ts/                              # TypeScript: data pipeline, labeling, eval
│   ├── package.json
│   ├── tsconfig.json
│   ├── src/
│   │   ├── schemas/                 # Zod schemas (single source of truth)
│   │   │   ├── filing.ts
│   │   │   ├── paragraph.ts
│   │   │   ├── label.ts            # LabelOutput — passed to generateObject
│   │   │   ├── annotation.ts       # Label + provenance (model, cost, latency)
│   │   │   ├── consensus.ts        # Multi-model agreement result
│   │   │   ├── gold.ts             # Human-labeled holdout entry
│   │   │   ├── benchmark.ts        # Model performance metrics
│   │   │   ├── experiment.ts       # Autoresearch training tracker
│   │   │   └── index.ts
│   │   ├── extract/                 # Phase 1: EDGAR extraction
│   │   │   ├── download-10k.ts
│   │   │   ├── parse-item1c.ts
│   │   │   ├── parse-8k.ts
│   │   │   ├── segment.ts
│   │   │   └── metadata.ts
│   │   ├── label/                   # Phase 2: GenAI labeling
│   │   │   ├── annotate.ts          # generateObject + OpenRouter per paragraph
│   │   │   ├── batch.ts             # Concurrency control + JSONL checkpointing
│   │   │   ├── consensus.ts         # Stage 1 majority vote logic
│   │   │   ├── judge.ts             # Stage 2 tiebreaker (Sonnet 4.6)
│   │   │   ├── prompts.ts           # System/user prompt builders
│   │   │   └── cost.ts              # Cost tracking aggregation
│   │   ├── gold/                    # Phase 3: Gold set
│   │   │   ├── sample.ts            # Stratified sampling
│   │   │   ├── human-label.ts       # Human label import
│   │   │   └── agreement.ts         # Krippendorff's alpha, Cohen's kappa
│   │   ├── benchmark/               # Phase 4: GenAI benchmarking
│   │   │   ├── run.ts
│   │   │   └── metrics.ts           # F1, AUC, MCC computation
│   │   ├── lib/                     # Shared utilities
│   │   │   ├── openrouter.ts        # Singleton + model registry with pricing
│   │   │   ├── jsonl.ts             # Read/write/append JSONL
│   │   │   ├── checkpoint.ts        # Resume from last completed ID
│   │   │   └── retry.ts             # Exponential backoff
│   │   └── cli.ts                   # CLI entry point
│   └── tests/
│
├── python/                          # Python: training, DAPT, inference
│   ├── pyproject.toml
│   ├── configs/
│   │   ├── dapt/modernbert-large.yaml
│   │   ├── finetune/
│   │   │   ├── modernbert-large.yaml
│   │   │   ├── neobert.yaml
│   │   │   └── deberta-v3-large.yaml
│   │   └── decoder/qwen3.5-lora.yaml
│   ├── src/
│   │   ├── dapt/train_mlm.py
│   │   ├── finetune/
│   │   │   ├── model.py             # Multi-head classifier (shared backbone)
│   │   │   ├── train.py             # HF Trainer script with --time-budget
│   │   │   ├── data.py
│   │   │   ├── losses.py            # SCL + ordinal + multi-head balancing
│   │   │   └── trainer.py           # Custom Trainer subclass
│   │   ├── decoder/train_lora.py    # Unsloth
│   │   └── eval/
│   │       ├── predict.py
│   │       ├── metrics.py
│   │       └── error_analysis.py
│   └── program.md                   # Autoresearch agent directive
│
├── data/                            # Gitignored heavy files
│   ├── raw/{10k,8k}/
│   ├── extracted/{item1c,item105}/
│   ├── paragraphs/paragraphs.jsonl
│   ├── annotations/
│   │   ├── stage1/{model}.jsonl
│   │   ├── stage2/judge.jsonl
│   │   └── consensus.jsonl
│   ├── gold/
│   │   ├── gold-sample.jsonl
│   │   ├── human-labels/annotator-{1,2,3}.jsonl
│   │   └── gold-adjudicated.jsonl
│   ├── benchmark/runs/{model}.jsonl
│   ├── splits/{train,val,test}.jsonl
│   └── dapt-corpus/sec-texts.jsonl
│
├── models/                          # Gitignored checkpoints
├── results/
│   ├── experiments.tsv              # Autoresearch log
│   └── figures/
└── .gitignore
```

---

## Core Schemas (Zod)

**`label.ts`** — the contract passed to `generateObject`:
```typescript
export const ContentCategory = z.enum([
  "Board Governance", "Management Role", "Risk Management Process",
  "Third-Party Risk", "Incident Disclosure", "Strategy Integration", "None/Other",
]);
export const SpecificityLevel = z.union([z.literal(1), z.literal(2), z.literal(3), z.literal(4)]);
export const LabelOutput = z.object({
  content_category: ContentCategory,
  specificity_level: SpecificityLevel,
  reasoning: z.string().max(500),
});
```

**`annotation.ts`** — label + full provenance:
```typescript
export const Annotation = z.object({
  paragraphId: z.string().uuid(),
  label: LabelOutput,
  provenance: z.object({
    modelId: z.string(),
    provider: z.string(),
    stage: z.enum(["stage1", "stage2-judge"]),
    runId: z.string().uuid(),
    promptVersion: z.string(),
    inputTokens: z.number(),
    outputTokens: z.number(),
    estimatedCostUsd: z.number(),
    latencyMs: z.number(),
    requestedAt: z.string().datetime(),
  }),
});
```

**`consensus.ts`** — multi-model agreement:
```typescript
export const ConsensusResult = z.object({
  paragraphId: z.string().uuid(),
  finalLabel: LabelOutput,
  method: z.enum(["unanimous", "majority", "judge-resolved", "unresolved"]),
  categoryAgreement: z.object({ votes: z.record(z.number()), agreed: z.boolean() }),
  specificityAgreement: z.object({ votes: z.record(z.number()), agreed: z.boolean(), spread: z.number() }),
  stage1ModelIds: z.array(z.string()),
  stage2JudgeModelId: z.string().nullable(),
  confidence: z.number().min(0).max(1),
});
```

Full schemas for filing, paragraph, gold, benchmark, and experiment types follow the same pattern — see the full plan agent output for complete definitions.

---

## Data Flow

```
Phase 1: EXTRACTION (GPU-free)
  EDGAR API → download 10-K/8-K → parse Item 1C/1.05 → segment into paragraphs
  → enrich with company metadata → data/paragraphs/paragraphs.jsonl (~50-70K records)

Phase 2: LABELING (GPU-free)
  paragraphs.jsonl → Stage 1: 3 models annotate all → consensus (expect ~83% agree)
  → disagreements → Stage 2: Sonnet 4.6 judges → final consensus.jsonl

Phase 3: GOLD SET (GPU-free)
  Stratified sample 1,200 → 3 humans label independently → compute agreement
  → adjudicate → gold-adjudicated.jsonl (LOCKED holdout)

Phase 4: BENCHMARKING (GPU-free)
  Run 6+ models on holdout → compute F1/AUC/MCC/Krippendorff's α → comparison table

Phase 5: TRAINING (REQUIRES GPU)
  DAPT: SEC-ModernBERT-large (continued MLM on SEC filings)
  Encoder FT: SEC-ModernBERT, ModernBERT, NeoBERT, DeBERTa (5 ablations)
  Decoder FT: Qwen3.5 via Unsloth LoRA
  HP search: autoresearch program.md — agent iterates autonomously

Phase 6: EVALUATION (REQUIRES GPU)
  Inference on holdout → metrics → error analysis → validity tests → final comparison
```

---

## Key Architecture Patterns

### Annotation: `generateObject` + OpenRouter
```typescript
const result = await generateObject({
  model: openrouter(modelId),
  schema: LabelOutput,
  system: buildSystemPrompt(),
  prompt: buildUserPrompt(paragraph),
  temperature: 0,
  mode: "json",
});
```

### Batch Processing: Append-per-record checkpoint
Each successful annotation appends immediately to JSONL. On crash/resume, read completed IDs from output file, skip them. Uses `p-limit` for concurrency control (default 5).

### Consensus: Stage 1 majority → Stage 2 judge
- Stage 1: 3 models vote. If 2/3 agree on BOTH dimensions → consensus.
- Stage 2: For disagreements, Sonnet 4.6 gets the paragraph + all 3 annotations (randomized order for anti-bias). Judge's label treated as authoritative tiebreaker.

### Training: Multi-head classifier
Shared encoder backbone (ModernBERT/NeoBERT/DeBERTa) → dropout → two linear heads:
- `category_head`: 7-class softmax
- `specificity_head`: 4-class ordinal/softmax
Loss: `α * CE(category) + (1-α) * CE(specificity) + β * SCL`

### HP Search: Autoresearch `program.md`
- Fixed 30-min time budget per experiment
- Metric: `val_macro_f1`
- Agent modifies ONLY YAML configs, not training scripts
- TSV results log: experiment_id, metric, hyperparameters, verdict (keep/discard)
- Vary ONE hyperparameter per experiment (controlled ablation)

---

## Quality Gates

| Gate | When | Key Check | Threshold | If Failed |
|------|------|-----------|-----------|-----------|
| Extraction QA | After Phase 1 | Spot-check 20 filings manually | 18/20 correct | Fix parser |
| Labeling Pilot | 50 paragraphs | Human review of LLM labels | ≥80% agreement | Revise prompt/rubric |
| Scale Pilot | 200 paragraphs | Inter-model Fleiss' Kappa | ≥0.60 | Replace weakest model or revise prompt |
| Human Labeling | Phase 3 | Krippendorff's α (specificity) | ≥0.67 | Collapse 4-pt to 3-pt scale |
| Human Labeling | Phase 3 | Cohen's κ (category) | ≥0.75 | Revise rubric boundaries |
| DAPT | Phase 5 | Perplexity decrease + GLUE check | PPL ↓, GLUE drop <2% | Reduce LR |
| Fine-tuning | Phase 5 | val_macro_f1 by epoch 3 | >0.75 | Check data quality |
| Final | Phase 6 | Holdout macro-F1 (category) | ≥0.80 | Error analysis, iterate |
| Final | Phase 6 | Calibration (ECE) | <0.10 | Temperature scaling |

---

## CLI Commands

```bash
# Extraction
bun sec extract:download-10k --fiscal-year 2023
bun sec extract:parse --type 10k
bun sec extract:segment
bun sec extract:metadata

# Labeling
bun sec label:annotate --model openai/gpt-oss-120b --limit 50   # pilot
bun sec label:annotate-all                                        # full run
bun sec label:consensus
bun sec label:judge
bun sec label:cost

# Gold set
bun sec gold:sample --n 1200
bun sec gold:import-human --annotator annotator-1 --input labels.csv
bun sec gold:agreement

# Benchmarking
bun sec benchmark:run-all
bun sec benchmark:evaluate
bun sec benchmark:table

# Splits
bun sec splits:create

# Python training (GPU required)
uv run python/src/dapt/train_mlm.py --config python/configs/dapt/modernbert-large.yaml
uv run python/src/finetune/train.py --config python/configs/finetune/modernbert-large.yaml --time-budget 1800
uv run python/src/decoder/train_lora.py --config python/configs/decoder/qwen3.5-lora.yaml
uv run python/src/eval/predict.py --split test
uv run python/src/eval/metrics.py
```

---

## Implementation Sequence

### Day 1 (GPU-free) — Foundation
1. `bun init` in ts/, `uv init` in python/, create full directory tree
2. All Zod schemas
3. JSONL utilities, OpenRouter singleton, model registry
4. Prompt builders (from LABELING-CODEBOOK.md)
5. `annotate.ts` + `batch.ts` with checkpoint/resume
6. Test: dry-run 3 paragraphs

### Day 2 (GPU-free) — Extraction + Labeling Pilot
7. EDGAR extraction pipeline (download, parse, segment)
8. Run extraction on a small sample (~100 filings)
9. **Quality Gate 1**: Verify extraction
10. Labeling pilot: 50 paragraphs × 3 models
11. `consensus.ts` + `judge.ts`
12. **Quality Gate 2**: Manual review
13. Scale pilot: 200 paragraphs
14. **Quality Gate 3**: Inter-model agreement
15. If gates pass → launch full Stage 1 annotation

### Day 3+ (GPU-free, labeling runs) — Gold Set + Benchmarking
16. Gold set sampling, human label infrastructure
17. Benchmark runner + metrics
18. Consensus + judge on full corpus
19. Begin human labeling
20. Prepare DAPT corpus

### GPU Available — Training
21. Python training scripts (model.py, train.py, losses.py)
22. `program.md` for autoresearch
23. DAPT (~2-3 days)
24. Fine-tuning ablations via autoresearch
25. Unsloth decoder experiment
26. Final evaluation + error analysis

---

## Verification

After implementation, verify end-to-end:
1. `bun sec extract:segment --limit 10` produces valid Paragraph JSONL
2. `bun sec label:annotate --model openai/gpt-oss-120b --limit 5` returns valid Annotations with cost tracking
3. `bun sec label:consensus` correctly identifies agreement/disagreement
4. `bun sec validate:schema --input data/annotations/stage1/gpt-oss-120b.jsonl --schema annotation` passes
5. Python training script loads JSONL splits and begins training without errors
6. `results/experiments.tsv` gets populated after one autoresearch iteration