# SEC-cyBERT Implementation Plan ## Context Building an SEC cybersecurity disclosure quality classifier for the BUSI488/COMP488 capstone. The Ringel (2023) "Synthetic Experts" pipeline: frontier LLMs label ~50K paragraphs, then distill into a small encoder model. Two dimensions: content category (7-class) + specificity (4-point ordinal). GPU is offline for 2 days — all data/labeling/eval infrastructure is GPU-free and should be built now. --- ## Tech Stack | Layer | Tool | Notes | |-------|------|-------| | Data/labeling pipeline | TypeScript, Vercel AI SDK 6.0.108, `@openrouter/ai-sdk-provider`, Zod | `generateObject` with Zod schemas for structured output | | Stage 1 annotators | gpt-oss-120b, mimo-v2-flash, grok-4.1-fast | Via OpenRouter | | Stage 2 judge | Claude Sonnet 4.6 | Via OpenRouter, called only on disagreements | | Encoder training | HuggingFace Trainer, Python scripts | ModernBERT-large, NeoBERT, DeBERTa-v3-large | | DAPT | HuggingFace Trainer + DataCollatorForLanguageModeling | Continued MLM on SEC filings | | Decoder experiment | Unsloth (NOT Axolotl — it's decoder-only and slower) | Qwen3.5 LoRA | | HP search | Autoresearch-style `program.md` directives | Agent edits YAML, trains for fixed budget, evaluates, keeps/discards | | Runtime | bun (TS), uv (Python) | | --- ## Project Structure ``` sec-cyBERT/ ├── docs/ │ ├── PROJECT-OVERVIEW.md │ ├── LABELING-CODEBOOK.md │ └── TECHNICAL-GUIDE.md │ ├── ts/ # TypeScript: data pipeline, labeling, eval │ ├── package.json │ ├── tsconfig.json │ ├── src/ │ │ ├── schemas/ # Zod schemas (single source of truth) │ │ │ ├── filing.ts │ │ │ ├── paragraph.ts │ │ │ ├── label.ts # LabelOutput — passed to generateObject │ │ │ ├── annotation.ts # Label + provenance (model, cost, latency) │ │ │ ├── consensus.ts # Multi-model agreement result │ │ │ ├── gold.ts # Human-labeled holdout entry │ │ │ ├── benchmark.ts # Model performance metrics │ │ │ ├── experiment.ts # Autoresearch training tracker │ │ │ └── index.ts │ │ ├── extract/ # Phase 1: EDGAR extraction │ │ │ ├── download-10k.ts │ │ │ ├── parse-item1c.ts │ │ │ ├── parse-8k.ts │ │ │ ├── segment.ts │ │ │ └── metadata.ts │ │ ├── label/ # Phase 2: GenAI labeling │ │ │ ├── annotate.ts # generateObject + OpenRouter per paragraph │ │ │ ├── batch.ts # Concurrency control + JSONL checkpointing │ │ │ ├── consensus.ts # Stage 1 majority vote logic │ │ │ ├── judge.ts # Stage 2 tiebreaker (Sonnet 4.6) │ │ │ ├── prompts.ts # System/user prompt builders │ │ │ └── cost.ts # Cost tracking aggregation │ │ ├── gold/ # Phase 3: Gold set │ │ │ ├── sample.ts # Stratified sampling │ │ │ ├── human-label.ts # Human label import │ │ │ └── agreement.ts # Krippendorff's alpha, Cohen's kappa │ │ ├── benchmark/ # Phase 4: GenAI benchmarking │ │ │ ├── run.ts │ │ │ └── metrics.ts # F1, AUC, MCC computation │ │ ├── lib/ # Shared utilities │ │ │ ├── openrouter.ts # Singleton + model registry with pricing │ │ │ ├── jsonl.ts # Read/write/append JSONL │ │ │ ├── checkpoint.ts # Resume from last completed ID │ │ │ └── retry.ts # Exponential backoff │ │ └── cli.ts # CLI entry point │ └── tests/ │ ├── python/ # Python: training, DAPT, inference │ ├── pyproject.toml │ ├── configs/ │ │ ├── dapt/modernbert-large.yaml │ │ ├── finetune/ │ │ │ ├── modernbert-large.yaml │ │ │ ├── neobert.yaml │ │ │ └── deberta-v3-large.yaml │ │ └── decoder/qwen3.5-lora.yaml │ ├── src/ │ │ ├── dapt/train_mlm.py │ │ ├── finetune/ │ │ │ ├── model.py # Multi-head classifier (shared backbone) │ │ │ ├── train.py # HF Trainer script with --time-budget │ │ │ ├── data.py │ │ │ ├── losses.py # SCL + ordinal + multi-head balancing │ │ │ └── trainer.py # Custom Trainer subclass │ │ ├── decoder/train_lora.py # Unsloth │ │ └── eval/ │ │ ├── predict.py │ │ ├── metrics.py │ │ └── error_analysis.py │ └── program.md # Autoresearch agent directive │ ├── data/ # Gitignored heavy files │ ├── raw/{10k,8k}/ │ ├── extracted/{item1c,item105}/ │ ├── paragraphs/paragraphs.jsonl │ ├── annotations/ │ │ ├── stage1/{model}.jsonl │ │ ├── stage2/judge.jsonl │ │ └── consensus.jsonl │ ├── gold/ │ │ ├── gold-sample.jsonl │ │ ├── human-labels/annotator-{1,2,3}.jsonl │ │ └── gold-adjudicated.jsonl │ ├── benchmark/runs/{model}.jsonl │ ├── splits/{train,val,test}.jsonl │ └── dapt-corpus/sec-texts.jsonl │ ├── models/ # Gitignored checkpoints ├── results/ │ ├── experiments.tsv # Autoresearch log │ └── figures/ └── .gitignore ``` --- ## Core Schemas (Zod) **`label.ts`** — the contract passed to `generateObject`: ```typescript export const ContentCategory = z.enum([ "Board Governance", "Management Role", "Risk Management Process", "Third-Party Risk", "Incident Disclosure", "Strategy Integration", "None/Other", ]); export const SpecificityLevel = z.union([z.literal(1), z.literal(2), z.literal(3), z.literal(4)]); export const LabelOutput = z.object({ content_category: ContentCategory, specificity_level: SpecificityLevel, reasoning: z.string().max(500), }); ``` **`annotation.ts`** — label + full provenance: ```typescript export const Annotation = z.object({ paragraphId: z.string().uuid(), label: LabelOutput, provenance: z.object({ modelId: z.string(), provider: z.string(), stage: z.enum(["stage1", "stage2-judge"]), runId: z.string().uuid(), promptVersion: z.string(), inputTokens: z.number(), outputTokens: z.number(), estimatedCostUsd: z.number(), latencyMs: z.number(), requestedAt: z.string().datetime(), }), }); ``` **`consensus.ts`** — multi-model agreement: ```typescript export const ConsensusResult = z.object({ paragraphId: z.string().uuid(), finalLabel: LabelOutput, method: z.enum(["unanimous", "majority", "judge-resolved", "unresolved"]), categoryAgreement: z.object({ votes: z.record(z.number()), agreed: z.boolean() }), specificityAgreement: z.object({ votes: z.record(z.number()), agreed: z.boolean(), spread: z.number() }), stage1ModelIds: z.array(z.string()), stage2JudgeModelId: z.string().nullable(), confidence: z.number().min(0).max(1), }); ``` Full schemas for filing, paragraph, gold, benchmark, and experiment types follow the same pattern — see the full plan agent output for complete definitions. --- ## Data Flow ``` Phase 1: EXTRACTION (GPU-free) EDGAR API → download 10-K/8-K → parse Item 1C/1.05 → segment into paragraphs → enrich with company metadata → data/paragraphs/paragraphs.jsonl (~50-70K records) Phase 2: LABELING (GPU-free) paragraphs.jsonl → Stage 1: 3 models annotate all → consensus (expect ~83% agree) → disagreements → Stage 2: Sonnet 4.6 judges → final consensus.jsonl Phase 3: GOLD SET (GPU-free) Stratified sample 1,200 → 3 humans label independently → compute agreement → adjudicate → gold-adjudicated.jsonl (LOCKED holdout) Phase 4: BENCHMARKING (GPU-free) Run 6+ models on holdout → compute F1/AUC/MCC/Krippendorff's α → comparison table Phase 5: TRAINING (REQUIRES GPU) DAPT: SEC-ModernBERT-large (continued MLM on SEC filings) Encoder FT: SEC-ModernBERT, ModernBERT, NeoBERT, DeBERTa (5 ablations) Decoder FT: Qwen3.5 via Unsloth LoRA HP search: autoresearch program.md — agent iterates autonomously Phase 6: EVALUATION (REQUIRES GPU) Inference on holdout → metrics → error analysis → validity tests → final comparison ``` --- ## Key Architecture Patterns ### Annotation: `generateObject` + OpenRouter ```typescript const result = await generateObject({ model: openrouter(modelId), schema: LabelOutput, system: buildSystemPrompt(), prompt: buildUserPrompt(paragraph), temperature: 0, mode: "json", }); ``` ### Batch Processing: Append-per-record checkpoint Each successful annotation appends immediately to JSONL. On crash/resume, read completed IDs from output file, skip them. Uses `p-limit` for concurrency control (default 5). ### Consensus: Stage 1 majority → Stage 2 judge - Stage 1: 3 models vote. If 2/3 agree on BOTH dimensions → consensus. - Stage 2: For disagreements, Sonnet 4.6 gets the paragraph + all 3 annotations (randomized order for anti-bias). Judge's label treated as authoritative tiebreaker. ### Training: Multi-head classifier Shared encoder backbone (ModernBERT/NeoBERT/DeBERTa) → dropout → two linear heads: - `category_head`: 7-class softmax - `specificity_head`: 4-class ordinal/softmax Loss: `α * CE(category) + (1-α) * CE(specificity) + β * SCL` ### HP Search: Autoresearch `program.md` - Fixed 30-min time budget per experiment - Metric: `val_macro_f1` - Agent modifies ONLY YAML configs, not training scripts - TSV results log: experiment_id, metric, hyperparameters, verdict (keep/discard) - Vary ONE hyperparameter per experiment (controlled ablation) --- ## Quality Gates | Gate | When | Key Check | Threshold | If Failed | |------|------|-----------|-----------|-----------| | Extraction QA | After Phase 1 | Spot-check 20 filings manually | 18/20 correct | Fix parser | | Labeling Pilot | 50 paragraphs | Human review of LLM labels | ≥80% agreement | Revise prompt/rubric | | Scale Pilot | 200 paragraphs | Inter-model Fleiss' Kappa | ≥0.60 | Replace weakest model or revise prompt | | Human Labeling | Phase 3 | Krippendorff's α (specificity) | ≥0.67 | Collapse 4-pt to 3-pt scale | | Human Labeling | Phase 3 | Cohen's κ (category) | ≥0.75 | Revise rubric boundaries | | DAPT | Phase 5 | Perplexity decrease + GLUE check | PPL ↓, GLUE drop <2% | Reduce LR | | Fine-tuning | Phase 5 | val_macro_f1 by epoch 3 | >0.75 | Check data quality | | Final | Phase 6 | Holdout macro-F1 (category) | ≥0.80 | Error analysis, iterate | | Final | Phase 6 | Calibration (ECE) | <0.10 | Temperature scaling | --- ## CLI Commands ```bash # Extraction bun sec extract:download-10k --fiscal-year 2023 bun sec extract:parse --type 10k bun sec extract:segment bun sec extract:metadata # Labeling bun sec label:annotate --model openai/gpt-oss-120b --limit 50 # pilot bun sec label:annotate-all # full run bun sec label:consensus bun sec label:judge bun sec label:cost # Gold set bun sec gold:sample --n 1200 bun sec gold:import-human --annotator annotator-1 --input labels.csv bun sec gold:agreement # Benchmarking bun sec benchmark:run-all bun sec benchmark:evaluate bun sec benchmark:table # Splits bun sec splits:create # Python training (GPU required) uv run python/src/dapt/train_mlm.py --config python/configs/dapt/modernbert-large.yaml uv run python/src/finetune/train.py --config python/configs/finetune/modernbert-large.yaml --time-budget 1800 uv run python/src/decoder/train_lora.py --config python/configs/decoder/qwen3.5-lora.yaml uv run python/src/eval/predict.py --split test uv run python/src/eval/metrics.py ``` --- ## Implementation Sequence ### Day 1 (GPU-free) — Foundation 1. `bun init` in ts/, `uv init` in python/, create full directory tree 2. All Zod schemas 3. JSONL utilities, OpenRouter singleton, model registry 4. Prompt builders (from LABELING-CODEBOOK.md) 5. `annotate.ts` + `batch.ts` with checkpoint/resume 6. Test: dry-run 3 paragraphs ### Day 2 (GPU-free) — Extraction + Labeling Pilot 7. EDGAR extraction pipeline (download, parse, segment) 8. Run extraction on a small sample (~100 filings) 9. **Quality Gate 1**: Verify extraction 10. Labeling pilot: 50 paragraphs × 3 models 11. `consensus.ts` + `judge.ts` 12. **Quality Gate 2**: Manual review 13. Scale pilot: 200 paragraphs 14. **Quality Gate 3**: Inter-model agreement 15. If gates pass → launch full Stage 1 annotation ### Day 3+ (GPU-free, labeling runs) — Gold Set + Benchmarking 16. Gold set sampling, human label infrastructure 17. Benchmark runner + metrics 18. Consensus + judge on full corpus 19. Begin human labeling 20. Prepare DAPT corpus ### GPU Available — Training 21. Python training scripts (model.py, train.py, losses.py) 22. `program.md` for autoresearch 23. DAPT (~2-3 days) 24. Fine-tuning ablations via autoresearch 25. Unsloth decoder experiment 26. Final evaluation + error analysis --- ## Verification After implementation, verify end-to-end: 1. `bun sec extract:segment --limit 10` produces valid Paragraph JSONL 2. `bun sec label:annotate --model openai/gpt-oss-120b --limit 5` returns valid Annotations with cost tracking 3. `bun sec label:consensus` correctly identifies agreement/disagreement 4. `bun sec validate:schema --input data/annotations/stage1/gpt-oss-120b.jsonl --schema annotation` passes 5. Python training script loads JSONL splits and begins training without errors 6. `results/experiments.tsv` gets populated after one autoresearch iteration