SEC-cyBERT/docs/implementation-plan.md
2026-03-28 20:39:36 -04:00

14 KiB
Raw Blame History

SEC-cyBERT Implementation Plan

Context

Building an SEC cybersecurity disclosure quality classifier for the BUSI488/COMP488 capstone. The Ringel (2023) "Synthetic Experts" pipeline: frontier LLMs label ~50K paragraphs, then distill into a small encoder model. Two dimensions: content category (7-class) + specificity (4-point ordinal). GPU is offline for 2 days — all data/labeling/eval infrastructure is GPU-free and should be built now.


Tech Stack

Layer Tool Notes
Data/labeling pipeline TypeScript, Vercel AI SDK 6.0.108, @openrouter/ai-sdk-provider, Zod generateObject with Zod schemas for structured output
Stage 1 annotators gpt-oss-120b, mimo-v2-flash, grok-4.1-fast Via OpenRouter
Stage 2 judge Claude Sonnet 4.6 Via OpenRouter, called only on disagreements
Encoder training HuggingFace Trainer, Python scripts ModernBERT-large, NeoBERT, DeBERTa-v3-large
DAPT HuggingFace Trainer + DataCollatorForLanguageModeling Continued MLM on SEC filings
Decoder experiment Unsloth (NOT Axolotl — it's decoder-only and slower) Qwen3.5 LoRA
HP search Autoresearch-style program.md directives Agent edits YAML, trains for fixed budget, evaluates, keeps/discards
Runtime bun (TS), uv (Python)

Project Structure

sec-cyBERT/
├── docs/
│   ├── PROJECT-OVERVIEW.md
│   ├── LABELING-CODEBOOK.md
│   └── TECHNICAL-GUIDE.md
│
├── ts/                              # TypeScript: data pipeline, labeling, eval
│   ├── package.json
│   ├── tsconfig.json
│   ├── src/
│   │   ├── schemas/                 # Zod schemas (single source of truth)
│   │   │   ├── filing.ts
│   │   │   ├── paragraph.ts
│   │   │   ├── label.ts            # LabelOutput — passed to generateObject
│   │   │   ├── annotation.ts       # Label + provenance (model, cost, latency)
│   │   │   ├── consensus.ts        # Multi-model agreement result
│   │   │   ├── gold.ts             # Human-labeled holdout entry
│   │   │   ├── benchmark.ts        # Model performance metrics
│   │   │   ├── experiment.ts       # Autoresearch training tracker
│   │   │   └── index.ts
│   │   ├── extract/                 # Phase 1: EDGAR extraction
│   │   │   ├── download-10k.ts
│   │   │   ├── parse-item1c.ts
│   │   │   ├── parse-8k.ts
│   │   │   ├── segment.ts
│   │   │   └── metadata.ts
│   │   ├── label/                   # Phase 2: GenAI labeling
│   │   │   ├── annotate.ts          # generateObject + OpenRouter per paragraph
│   │   │   ├── batch.ts             # Concurrency control + JSONL checkpointing
│   │   │   ├── consensus.ts         # Stage 1 majority vote logic
│   │   │   ├── judge.ts             # Stage 2 tiebreaker (Sonnet 4.6)
│   │   │   ├── prompts.ts           # System/user prompt builders
│   │   │   └── cost.ts              # Cost tracking aggregation
│   │   ├── gold/                    # Phase 3: Gold set
│   │   │   ├── sample.ts            # Stratified sampling
│   │   │   ├── human-label.ts       # Human label import
│   │   │   └── agreement.ts         # Krippendorff's alpha, Cohen's kappa
│   │   ├── benchmark/               # Phase 4: GenAI benchmarking
│   │   │   ├── run.ts
│   │   │   └── metrics.ts           # F1, AUC, MCC computation
│   │   ├── lib/                     # Shared utilities
│   │   │   ├── openrouter.ts        # Singleton + model registry with pricing
│   │   │   ├── jsonl.ts             # Read/write/append JSONL
│   │   │   ├── checkpoint.ts        # Resume from last completed ID
│   │   │   └── retry.ts             # Exponential backoff
│   │   └── cli.ts                   # CLI entry point
│   └── tests/
│
├── python/                          # Python: training, DAPT, inference
│   ├── pyproject.toml
│   ├── configs/
│   │   ├── dapt/modernbert-large.yaml
│   │   ├── finetune/
│   │   │   ├── modernbert-large.yaml
│   │   │   ├── neobert.yaml
│   │   │   └── deberta-v3-large.yaml
│   │   └── decoder/qwen3.5-lora.yaml
│   ├── src/
│   │   ├── dapt/train_mlm.py
│   │   ├── finetune/
│   │   │   ├── model.py             # Multi-head classifier (shared backbone)
│   │   │   ├── train.py             # HF Trainer script with --time-budget
│   │   │   ├── data.py
│   │   │   ├── losses.py            # SCL + ordinal + multi-head balancing
│   │   │   └── trainer.py           # Custom Trainer subclass
│   │   ├── decoder/train_lora.py    # Unsloth
│   │   └── eval/
│   │       ├── predict.py
│   │       ├── metrics.py
│   │       └── error_analysis.py
│   └── program.md                   # Autoresearch agent directive
│
├── data/                            # Gitignored heavy files
│   ├── raw/{10k,8k}/
│   ├── extracted/{item1c,item105}/
│   ├── paragraphs/paragraphs.jsonl
│   ├── annotations/
│   │   ├── stage1/{model}.jsonl
│   │   ├── stage2/judge.jsonl
│   │   └── consensus.jsonl
│   ├── gold/
│   │   ├── gold-sample.jsonl
│   │   ├── human-labels/annotator-{1,2,3}.jsonl
│   │   └── gold-adjudicated.jsonl
│   ├── benchmark/runs/{model}.jsonl
│   ├── splits/{train,val,test}.jsonl
│   └── dapt-corpus/sec-texts.jsonl
│
├── models/                          # Gitignored checkpoints
├── results/
│   ├── experiments.tsv              # Autoresearch log
│   └── figures/
└── .gitignore

Core Schemas (Zod)

label.ts — the contract passed to generateObject:

export const ContentCategory = z.enum([
  "Board Governance", "Management Role", "Risk Management Process",
  "Third-Party Risk", "Incident Disclosure", "Strategy Integration", "None/Other",
]);
export const SpecificityLevel = z.union([z.literal(1), z.literal(2), z.literal(3), z.literal(4)]);
export const LabelOutput = z.object({
  content_category: ContentCategory,
  specificity_level: SpecificityLevel,
  reasoning: z.string().max(500),
});

annotation.ts — label + full provenance:

export const Annotation = z.object({
  paragraphId: z.string().uuid(),
  label: LabelOutput,
  provenance: z.object({
    modelId: z.string(),
    provider: z.string(),
    stage: z.enum(["stage1", "stage2-judge"]),
    runId: z.string().uuid(),
    promptVersion: z.string(),
    inputTokens: z.number(),
    outputTokens: z.number(),
    estimatedCostUsd: z.number(),
    latencyMs: z.number(),
    requestedAt: z.string().datetime(),
  }),
});

consensus.ts — multi-model agreement:

export const ConsensusResult = z.object({
  paragraphId: z.string().uuid(),
  finalLabel: LabelOutput,
  method: z.enum(["unanimous", "majority", "judge-resolved", "unresolved"]),
  categoryAgreement: z.object({ votes: z.record(z.number()), agreed: z.boolean() }),
  specificityAgreement: z.object({ votes: z.record(z.number()), agreed: z.boolean(), spread: z.number() }),
  stage1ModelIds: z.array(z.string()),
  stage2JudgeModelId: z.string().nullable(),
  confidence: z.number().min(0).max(1),
});

Full schemas for filing, paragraph, gold, benchmark, and experiment types follow the same pattern — see the full plan agent output for complete definitions.


Data Flow

Phase 1: EXTRACTION (GPU-free)
  EDGAR API → download 10-K/8-K → parse Item 1C/1.05 → segment into paragraphs
  → enrich with company metadata → data/paragraphs/paragraphs.jsonl (~50-70K records)

Phase 2: LABELING (GPU-free)
  paragraphs.jsonl → Stage 1: 3 models annotate all → consensus (expect ~83% agree)
  → disagreements → Stage 2: Sonnet 4.6 judges → final consensus.jsonl

Phase 3: GOLD SET (GPU-free)
  Stratified sample 1,200 → 3 humans label independently → compute agreement
  → adjudicate → gold-adjudicated.jsonl (LOCKED holdout)

Phase 4: BENCHMARKING (GPU-free)
  Run 6+ models on holdout → compute F1/AUC/MCC/Krippendorff's α → comparison table

Phase 5: TRAINING (REQUIRES GPU)
  DAPT: SEC-ModernBERT-large (continued MLM on SEC filings)
  Encoder FT: SEC-ModernBERT, ModernBERT, NeoBERT, DeBERTa (5 ablations)
  Decoder FT: Qwen3.5 via Unsloth LoRA
  HP search: autoresearch program.md — agent iterates autonomously

Phase 6: EVALUATION (REQUIRES GPU)
  Inference on holdout → metrics → error analysis → validity tests → final comparison

Key Architecture Patterns

Annotation: generateObject + OpenRouter

const result = await generateObject({
  model: openrouter(modelId),
  schema: LabelOutput,
  system: buildSystemPrompt(),
  prompt: buildUserPrompt(paragraph),
  temperature: 0,
  mode: "json",
});

Batch Processing: Append-per-record checkpoint

Each successful annotation appends immediately to JSONL. On crash/resume, read completed IDs from output file, skip them. Uses p-limit for concurrency control (default 5).

Consensus: Stage 1 majority → Stage 2 judge

  • Stage 1: 3 models vote. If 2/3 agree on BOTH dimensions → consensus.
  • Stage 2: For disagreements, Sonnet 4.6 gets the paragraph + all 3 annotations (randomized order for anti-bias). Judge's label treated as authoritative tiebreaker.

Training: Multi-head classifier

Shared encoder backbone (ModernBERT/NeoBERT/DeBERTa) → dropout → two linear heads:

  • category_head: 7-class softmax
  • specificity_head: 4-class ordinal/softmax Loss: α * CE(category) + (1-α) * CE(specificity) + β * SCL

HP Search: Autoresearch program.md

  • Fixed 30-min time budget per experiment
  • Metric: val_macro_f1
  • Agent modifies ONLY YAML configs, not training scripts
  • TSV results log: experiment_id, metric, hyperparameters, verdict (keep/discard)
  • Vary ONE hyperparameter per experiment (controlled ablation)

Quality Gates

Gate When Key Check Threshold If Failed
Extraction QA After Phase 1 Spot-check 20 filings manually 18/20 correct Fix parser
Labeling Pilot 50 paragraphs Human review of LLM labels ≥80% agreement Revise prompt/rubric
Scale Pilot 200 paragraphs Inter-model Fleiss' Kappa ≥0.60 Replace weakest model or revise prompt
Human Labeling Phase 3 Krippendorff's α (specificity) ≥0.67 Collapse 4-pt to 3-pt scale
Human Labeling Phase 3 Cohen's κ (category) ≥0.75 Revise rubric boundaries
DAPT Phase 5 Perplexity decrease + GLUE check PPL ↓, GLUE drop <2% Reduce LR
Fine-tuning Phase 5 val_macro_f1 by epoch 3 >0.75 Check data quality
Final Phase 6 Holdout macro-F1 (category) ≥0.80 Error analysis, iterate
Final Phase 6 Calibration (ECE) <0.10 Temperature scaling

CLI Commands

# Extraction
bun sec extract:download-10k --fiscal-year 2023
bun sec extract:parse --type 10k
bun sec extract:segment
bun sec extract:metadata

# Labeling
bun sec label:annotate --model openai/gpt-oss-120b --limit 50   # pilot
bun sec label:annotate-all                                        # full run
bun sec label:consensus
bun sec label:judge
bun sec label:cost

# Gold set
bun sec gold:sample --n 1200
bun sec gold:import-human --annotator annotator-1 --input labels.csv
bun sec gold:agreement

# Benchmarking
bun sec benchmark:run-all
bun sec benchmark:evaluate
bun sec benchmark:table

# Splits
bun sec splits:create

# Python training (GPU required)
uv run python/src/dapt/train_mlm.py --config python/configs/dapt/modernbert-large.yaml
uv run python/src/finetune/train.py --config python/configs/finetune/modernbert-large.yaml --time-budget 1800
uv run python/src/decoder/train_lora.py --config python/configs/decoder/qwen3.5-lora.yaml
uv run python/src/eval/predict.py --split test
uv run python/src/eval/metrics.py

Implementation Sequence

Day 1 (GPU-free) — Foundation

  1. bun init in ts/, uv init in python/, create full directory tree
  2. All Zod schemas
  3. JSONL utilities, OpenRouter singleton, model registry
  4. Prompt builders (from LABELING-CODEBOOK.md)
  5. annotate.ts + batch.ts with checkpoint/resume
  6. Test: dry-run 3 paragraphs

Day 2 (GPU-free) — Extraction + Labeling Pilot

  1. EDGAR extraction pipeline (download, parse, segment)
  2. Run extraction on a small sample (~100 filings)
  3. Quality Gate 1: Verify extraction
  4. Labeling pilot: 50 paragraphs × 3 models
  5. consensus.ts + judge.ts
  6. Quality Gate 2: Manual review
  7. Scale pilot: 200 paragraphs
  8. Quality Gate 3: Inter-model agreement
  9. If gates pass → launch full Stage 1 annotation

Day 3+ (GPU-free, labeling runs) — Gold Set + Benchmarking

  1. Gold set sampling, human label infrastructure
  2. Benchmark runner + metrics
  3. Consensus + judge on full corpus
  4. Begin human labeling
  5. Prepare DAPT corpus

GPU Available — Training

  1. Python training scripts (model.py, train.py, losses.py)
  2. program.md for autoresearch
  3. DAPT (~2-3 days)
  4. Fine-tuning ablations via autoresearch
  5. Unsloth decoder experiment
  6. Final evaluation + error analysis

Verification

After implementation, verify end-to-end:

  1. bun sec extract:segment --limit 10 produces valid Paragraph JSONL
  2. bun sec label:annotate --model openai/gpt-oss-120b --limit 5 returns valid Annotations with cost tracking
  3. bun sec label:consensus correctly identifies agreement/disagreement
  4. bun sec validate:schema --input data/annotations/stage1/gpt-oss-120b.jsonl --schema annotation passes
  5. Python training script loads JSONL splits and begins training without errors
  6. results/experiments.tsv gets populated after one autoresearch iteration