14 KiB
SEC-cyBERT Implementation Plan
Context
Building an SEC cybersecurity disclosure quality classifier for the BUSI488/COMP488 capstone. The Ringel (2023) "Synthetic Experts" pipeline: frontier LLMs label ~50K paragraphs, then distill into a small encoder model. Two dimensions: content category (7-class) + specificity (4-point ordinal). GPU is offline for 2 days — all data/labeling/eval infrastructure is GPU-free and should be built now.
Tech Stack
| Layer | Tool | Notes |
|---|---|---|
| Data/labeling pipeline | TypeScript, Vercel AI SDK 6.0.108, @openrouter/ai-sdk-provider, Zod |
generateObject with Zod schemas for structured output |
| Stage 1 annotators | gpt-oss-120b, mimo-v2-flash, grok-4.1-fast | Via OpenRouter |
| Stage 2 judge | Claude Sonnet 4.6 | Via OpenRouter, called only on disagreements |
| Encoder training | HuggingFace Trainer, Python scripts | ModernBERT-large, NeoBERT, DeBERTa-v3-large |
| DAPT | HuggingFace Trainer + DataCollatorForLanguageModeling | Continued MLM on SEC filings |
| Decoder experiment | Unsloth (NOT Axolotl — it's decoder-only and slower) | Qwen3.5 LoRA |
| HP search | Autoresearch-style program.md directives |
Agent edits YAML, trains for fixed budget, evaluates, keeps/discards |
| Runtime | bun (TS), uv (Python) |
Project Structure
sec-cyBERT/
├── docs/
│ ├── PROJECT-OVERVIEW.md
│ ├── LABELING-CODEBOOK.md
│ └── TECHNICAL-GUIDE.md
│
├── ts/ # TypeScript: data pipeline, labeling, eval
│ ├── package.json
│ ├── tsconfig.json
│ ├── src/
│ │ ├── schemas/ # Zod schemas (single source of truth)
│ │ │ ├── filing.ts
│ │ │ ├── paragraph.ts
│ │ │ ├── label.ts # LabelOutput — passed to generateObject
│ │ │ ├── annotation.ts # Label + provenance (model, cost, latency)
│ │ │ ├── consensus.ts # Multi-model agreement result
│ │ │ ├── gold.ts # Human-labeled holdout entry
│ │ │ ├── benchmark.ts # Model performance metrics
│ │ │ ├── experiment.ts # Autoresearch training tracker
│ │ │ └── index.ts
│ │ ├── extract/ # Phase 1: EDGAR extraction
│ │ │ ├── download-10k.ts
│ │ │ ├── parse-item1c.ts
│ │ │ ├── parse-8k.ts
│ │ │ ├── segment.ts
│ │ │ └── metadata.ts
│ │ ├── label/ # Phase 2: GenAI labeling
│ │ │ ├── annotate.ts # generateObject + OpenRouter per paragraph
│ │ │ ├── batch.ts # Concurrency control + JSONL checkpointing
│ │ │ ├── consensus.ts # Stage 1 majority vote logic
│ │ │ ├── judge.ts # Stage 2 tiebreaker (Sonnet 4.6)
│ │ │ ├── prompts.ts # System/user prompt builders
│ │ │ └── cost.ts # Cost tracking aggregation
│ │ ├── gold/ # Phase 3: Gold set
│ │ │ ├── sample.ts # Stratified sampling
│ │ │ ├── human-label.ts # Human label import
│ │ │ └── agreement.ts # Krippendorff's alpha, Cohen's kappa
│ │ ├── benchmark/ # Phase 4: GenAI benchmarking
│ │ │ ├── run.ts
│ │ │ └── metrics.ts # F1, AUC, MCC computation
│ │ ├── lib/ # Shared utilities
│ │ │ ├── openrouter.ts # Singleton + model registry with pricing
│ │ │ ├── jsonl.ts # Read/write/append JSONL
│ │ │ ├── checkpoint.ts # Resume from last completed ID
│ │ │ └── retry.ts # Exponential backoff
│ │ └── cli.ts # CLI entry point
│ └── tests/
│
├── python/ # Python: training, DAPT, inference
│ ├── pyproject.toml
│ ├── configs/
│ │ ├── dapt/modernbert-large.yaml
│ │ ├── finetune/
│ │ │ ├── modernbert-large.yaml
│ │ │ ├── neobert.yaml
│ │ │ └── deberta-v3-large.yaml
│ │ └── decoder/qwen3.5-lora.yaml
│ ├── src/
│ │ ├── dapt/train_mlm.py
│ │ ├── finetune/
│ │ │ ├── model.py # Multi-head classifier (shared backbone)
│ │ │ ├── train.py # HF Trainer script with --time-budget
│ │ │ ├── data.py
│ │ │ ├── losses.py # SCL + ordinal + multi-head balancing
│ │ │ └── trainer.py # Custom Trainer subclass
│ │ ├── decoder/train_lora.py # Unsloth
│ │ └── eval/
│ │ ├── predict.py
│ │ ├── metrics.py
│ │ └── error_analysis.py
│ └── program.md # Autoresearch agent directive
│
├── data/ # Gitignored heavy files
│ ├── raw/{10k,8k}/
│ ├── extracted/{item1c,item105}/
│ ├── paragraphs/paragraphs.jsonl
│ ├── annotations/
│ │ ├── stage1/{model}.jsonl
│ │ ├── stage2/judge.jsonl
│ │ └── consensus.jsonl
│ ├── gold/
│ │ ├── gold-sample.jsonl
│ │ ├── human-labels/annotator-{1,2,3}.jsonl
│ │ └── gold-adjudicated.jsonl
│ ├── benchmark/runs/{model}.jsonl
│ ├── splits/{train,val,test}.jsonl
│ └── dapt-corpus/sec-texts.jsonl
│
├── models/ # Gitignored checkpoints
├── results/
│ ├── experiments.tsv # Autoresearch log
│ └── figures/
└── .gitignore
Core Schemas (Zod)
label.ts — the contract passed to generateObject:
export const ContentCategory = z.enum([
"Board Governance", "Management Role", "Risk Management Process",
"Third-Party Risk", "Incident Disclosure", "Strategy Integration", "None/Other",
]);
export const SpecificityLevel = z.union([z.literal(1), z.literal(2), z.literal(3), z.literal(4)]);
export const LabelOutput = z.object({
content_category: ContentCategory,
specificity_level: SpecificityLevel,
reasoning: z.string().max(500),
});
annotation.ts — label + full provenance:
export const Annotation = z.object({
paragraphId: z.string().uuid(),
label: LabelOutput,
provenance: z.object({
modelId: z.string(),
provider: z.string(),
stage: z.enum(["stage1", "stage2-judge"]),
runId: z.string().uuid(),
promptVersion: z.string(),
inputTokens: z.number(),
outputTokens: z.number(),
estimatedCostUsd: z.number(),
latencyMs: z.number(),
requestedAt: z.string().datetime(),
}),
});
consensus.ts — multi-model agreement:
export const ConsensusResult = z.object({
paragraphId: z.string().uuid(),
finalLabel: LabelOutput,
method: z.enum(["unanimous", "majority", "judge-resolved", "unresolved"]),
categoryAgreement: z.object({ votes: z.record(z.number()), agreed: z.boolean() }),
specificityAgreement: z.object({ votes: z.record(z.number()), agreed: z.boolean(), spread: z.number() }),
stage1ModelIds: z.array(z.string()),
stage2JudgeModelId: z.string().nullable(),
confidence: z.number().min(0).max(1),
});
Full schemas for filing, paragraph, gold, benchmark, and experiment types follow the same pattern — see the full plan agent output for complete definitions.
Data Flow
Phase 1: EXTRACTION (GPU-free)
EDGAR API → download 10-K/8-K → parse Item 1C/1.05 → segment into paragraphs
→ enrich with company metadata → data/paragraphs/paragraphs.jsonl (~50-70K records)
Phase 2: LABELING (GPU-free)
paragraphs.jsonl → Stage 1: 3 models annotate all → consensus (expect ~83% agree)
→ disagreements → Stage 2: Sonnet 4.6 judges → final consensus.jsonl
Phase 3: GOLD SET (GPU-free)
Stratified sample 1,200 → 3 humans label independently → compute agreement
→ adjudicate → gold-adjudicated.jsonl (LOCKED holdout)
Phase 4: BENCHMARKING (GPU-free)
Run 6+ models on holdout → compute F1/AUC/MCC/Krippendorff's α → comparison table
Phase 5: TRAINING (REQUIRES GPU)
DAPT: SEC-ModernBERT-large (continued MLM on SEC filings)
Encoder FT: SEC-ModernBERT, ModernBERT, NeoBERT, DeBERTa (5 ablations)
Decoder FT: Qwen3.5 via Unsloth LoRA
HP search: autoresearch program.md — agent iterates autonomously
Phase 6: EVALUATION (REQUIRES GPU)
Inference on holdout → metrics → error analysis → validity tests → final comparison
Key Architecture Patterns
Annotation: generateObject + OpenRouter
const result = await generateObject({
model: openrouter(modelId),
schema: LabelOutput,
system: buildSystemPrompt(),
prompt: buildUserPrompt(paragraph),
temperature: 0,
mode: "json",
});
Batch Processing: Append-per-record checkpoint
Each successful annotation appends immediately to JSONL. On crash/resume, read completed IDs from output file, skip them. Uses p-limit for concurrency control (default 5).
Consensus: Stage 1 majority → Stage 2 judge
- Stage 1: 3 models vote. If 2/3 agree on BOTH dimensions → consensus.
- Stage 2: For disagreements, Sonnet 4.6 gets the paragraph + all 3 annotations (randomized order for anti-bias). Judge's label treated as authoritative tiebreaker.
Training: Multi-head classifier
Shared encoder backbone (ModernBERT/NeoBERT/DeBERTa) → dropout → two linear heads:
category_head: 7-class softmaxspecificity_head: 4-class ordinal/softmax Loss:α * CE(category) + (1-α) * CE(specificity) + β * SCL
HP Search: Autoresearch program.md
- Fixed 30-min time budget per experiment
- Metric:
val_macro_f1 - Agent modifies ONLY YAML configs, not training scripts
- TSV results log: experiment_id, metric, hyperparameters, verdict (keep/discard)
- Vary ONE hyperparameter per experiment (controlled ablation)
Quality Gates
| Gate | When | Key Check | Threshold | If Failed |
|---|---|---|---|---|
| Extraction QA | After Phase 1 | Spot-check 20 filings manually | 18/20 correct | Fix parser |
| Labeling Pilot | 50 paragraphs | Human review of LLM labels | ≥80% agreement | Revise prompt/rubric |
| Scale Pilot | 200 paragraphs | Inter-model Fleiss' Kappa | ≥0.60 | Replace weakest model or revise prompt |
| Human Labeling | Phase 3 | Krippendorff's α (specificity) | ≥0.67 | Collapse 4-pt to 3-pt scale |
| Human Labeling | Phase 3 | Cohen's κ (category) | ≥0.75 | Revise rubric boundaries |
| DAPT | Phase 5 | Perplexity decrease + GLUE check | PPL ↓, GLUE drop <2% | Reduce LR |
| Fine-tuning | Phase 5 | val_macro_f1 by epoch 3 | >0.75 | Check data quality |
| Final | Phase 6 | Holdout macro-F1 (category) | ≥0.80 | Error analysis, iterate |
| Final | Phase 6 | Calibration (ECE) | <0.10 | Temperature scaling |
CLI Commands
# Extraction
bun sec extract:download-10k --fiscal-year 2023
bun sec extract:parse --type 10k
bun sec extract:segment
bun sec extract:metadata
# Labeling
bun sec label:annotate --model openai/gpt-oss-120b --limit 50 # pilot
bun sec label:annotate-all # full run
bun sec label:consensus
bun sec label:judge
bun sec label:cost
# Gold set
bun sec gold:sample --n 1200
bun sec gold:import-human --annotator annotator-1 --input labels.csv
bun sec gold:agreement
# Benchmarking
bun sec benchmark:run-all
bun sec benchmark:evaluate
bun sec benchmark:table
# Splits
bun sec splits:create
# Python training (GPU required)
uv run python/src/dapt/train_mlm.py --config python/configs/dapt/modernbert-large.yaml
uv run python/src/finetune/train.py --config python/configs/finetune/modernbert-large.yaml --time-budget 1800
uv run python/src/decoder/train_lora.py --config python/configs/decoder/qwen3.5-lora.yaml
uv run python/src/eval/predict.py --split test
uv run python/src/eval/metrics.py
Implementation Sequence
Day 1 (GPU-free) — Foundation
bun initin ts/,uv initin python/, create full directory tree- All Zod schemas
- JSONL utilities, OpenRouter singleton, model registry
- Prompt builders (from LABELING-CODEBOOK.md)
annotate.ts+batch.tswith checkpoint/resume- Test: dry-run 3 paragraphs
Day 2 (GPU-free) — Extraction + Labeling Pilot
- EDGAR extraction pipeline (download, parse, segment)
- Run extraction on a small sample (~100 filings)
- Quality Gate 1: Verify extraction
- Labeling pilot: 50 paragraphs × 3 models
consensus.ts+judge.ts- Quality Gate 2: Manual review
- Scale pilot: 200 paragraphs
- Quality Gate 3: Inter-model agreement
- If gates pass → launch full Stage 1 annotation
Day 3+ (GPU-free, labeling runs) — Gold Set + Benchmarking
- Gold set sampling, human label infrastructure
- Benchmark runner + metrics
- Consensus + judge on full corpus
- Begin human labeling
- Prepare DAPT corpus
GPU Available — Training
- Python training scripts (model.py, train.py, losses.py)
program.mdfor autoresearch- DAPT (~2-3 days)
- Fine-tuning ablations via autoresearch
- Unsloth decoder experiment
- Final evaluation + error analysis
Verification
After implementation, verify end-to-end:
bun sec extract:segment --limit 10produces valid Paragraph JSONLbun sec label:annotate --model openai/gpt-oss-120b --limit 5returns valid Annotations with cost trackingbun sec label:consensuscorrectly identifies agreement/disagreementbun sec validate:schema --input data/annotations/stage1/gpt-oss-120b.jsonl --schema annotationpasses- Python training script loads JSONL splits and begins training without errors
results/experiments.tsvgets populated after one autoresearch iteration