SEC-cyBERT/docs/archive/v1/F1-STRATEGY.md
2026-04-05 21:00:40 -04:00

14 KiB

F1 Strategy — Passing the Class

The assignment requires macro F1 > 0.80 on category, measured against the human-labeled 1,200-paragraph holdout. This document lays out the concrete plan for getting there.


The Situation

What we have

  • Training data: 150,009 Stage 1 annotations across 50,003 paragraphs (3 models each). ~35K paragraphs with unanimous category agreement (all 3 models agree).
  • Pre-trained backbone: ModernBERT-large with DAPT (1B tokens of SEC filings) + TAPT (labeled paragraphs). Domain-adapted and task-adapted.
  • Gold holdout: 1,200 paragraphs with 13 independent annotations each (3 human + 10 GenAI). Adjudication tiers computed: 81% auto-resolvable.
  • Complete benchmark: 10 GenAI models from 8 suppliers, all on holdout.

The ceiling

The best individual GenAI models agree with human majority on ~83-87% of category labels. Our fine-tuned model is trained on GenAI labels, so its accuracy is bounded by how well GenAI labels match human labels. With DAPT+TAPT, the model should approach or slightly exceed this ceiling because:

  1. It learns decision boundaries directly from representations, not through generative reasoning
  2. It's specialized on the exact domain (SEC filings) and task distribution
  3. The training data (35K+ unanimous labels) is cleaner than any single model's output

The threat

The holdout was deliberately stratified to over-sample hard decision boundaries (MR<->RMP splits, N/O<->SI splits, Spec 3/4 splits). This means raw F1 on this holdout is lower than on a random sample. Macro F1 also weights all 7 classes equally — rare categories (TPR at ~5%, ID at ~8%) get the same influence as RMP at ~35%.

Estimated range: category macro F1 of 0.78-0.85. The plan below is designed to push toward the top of that range.


Action 1: Clean the Gold Labels

Priority: highest. This directly caps F1 from above.

If the gold label is wrong, even a perfect model gets penalized. The gold label quality depends on how we adjudicate the 1,200 holdout paragraphs.

Aaryan correction

Aaryan has a 40.9% odd-one-out rate (vs 8-16% for other annotators), specificity kappa of 0.03-0.25, and +1.30 specificity bias vs Opus. On his 600 paragraphs, when the other 2 annotators agree and he disagrees, the other-2 majority should be the human signal. This is not "throwing out" his data — it's using the objective reliability metrics to weight it appropriately.

Excluding his label on his paragraphs pushes both-unanimous from 5% to 50% (+45pp). This single correction likely improves effective gold label quality by 5-10% on the paragraphs he touched.

Tiered adjudication

Tier Count % Gold label source
1 756 63% 13-signal consensus (10+/13 agree on both dimensions)
2 216 18% Human majority + GenAI majority agree — take consensus
3 26 2% Expert review with Opus reasoning traces
4 202 17% Expert review, documented reasoning

For Tier 1+2 (972 paragraphs, 81%), the gold label is objectively strong — at least 10 of 13 annotators agree, or both human and GenAI majorities independently converge. These labels are essentially guaranteed correct.

For Tier 3+4 (228 paragraphs), expert adjudication using:

  1. Opus reasoning trace (why did the best model choose this category?)
  2. GenAI consensus direction (what do 7+/10 models say?)
  3. The paragraph text itself
  4. Codebook boundary rules (MR vs RMP person-vs-function test, materiality disclaimers -> SI, etc.)

Document reasoning for every Tier 4 decision. These 202 paragraphs become the error analysis corpus.


Action 2: Handle Class Imbalance

Priority: critical. This is the difference between 0.76 and 0.83 on macro F1.

The problem

The training data class distribution is heavily skewed:

Category Est. % of training Macro F1 weight
RMP ~35% 14.3% (1/7)
BG ~15% 14.3%
MR ~14% 14.3%
SI ~13% 14.3%
N/O ~10% 14.3%
ID ~8% 14.3%
TPR ~5% 14.3%

Without correction, the model will over-predict RMP (the majority class) and under-predict TPR/ID. Since macro F1 weights all 7 equally, poor performance on rare classes tanks the overall score.

Solutions (use in combination)

Focal loss (gamma=2). Down-weights easy/confident examples, up-weights hard/uncertain ones. The model spends more gradient on the examples it's getting wrong — which are disproportionately from rare classes and boundary cases. Better than static class weights because it adapts as training progresses.

Class-weighted sampling. Over-sample rare categories during training so the model sees roughly equal numbers of each class per epoch. Alternatively, use class-weighted cross-entropy with weights inversely proportional to frequency.

Stratified validation split. Ensure the validation set used for early stopping and threshold optimization has proportional representation of all classes. Don't let the model optimize for RMP accuracy at the expense of TPR.


Action 3: Supervised Contrastive Learning (SCL)

Priority: high. Directly attacks the #1 confusion axis.

The problem

MR<->RMP is the dominant confusion axis for humans, Stage 1, all GenAI models, and will be the dominant confusion axis for our fine-tuned model. These two categories share vocabulary (both discuss "cybersecurity" in a management/process context) and differ primarily in whether the paragraph describes a person's role (MR) or a process/procedure (RMP).

BG<->MR is the #2 axis — both involve governance/management but differ in whether it's board-level or management-level.

How SCL helps

SCL adds a contrastive loss that pulls representations of same-class paragraphs together and pushes different-class paragraphs apart in the embedding space. This is especially valuable when:

  • Two classes share surface-level vocabulary (MR/RMP, BG/MR)
  • The distinguishing features are subtle (person vs function, board vs management)
  • The model needs to learn discriminative features, not just predictive ones

Implementation

Dual loss: L = L_classification + lambda * L_contrastive

The contrastive loss uses the [CLS] representation from the shared backbone. Lambda should be tuned (start with 0.1-0.5) on the validation set.


Action 4: Ordinal Specificity

Priority: medium. Matters for specificity F1, not directly for category F1 (which is the pass/fail metric).

The problem

Specificity is a 4-point ordinal scale (1=Generic Boilerplate, 2=Sector-Standard, 3=Firm-Specific, 4=Quantified-Verifiable). Treating it as flat classification ignores the ordering — a Spec 1->4 error is worse than a Spec 2->3 error.

Human alpha on specificity is only 0.546 (unreliable). The Spec 3<->4 boundary is genuinely ambiguous. Even frontier models only agree 75-91% on specificity.

Solution

Use CORAL (Consistent Rank Logits) ordinal regression for the specificity head. CORAL converts a K-class ordinal problem into K-1 binary problems (is this >= 2? is this >= 3? is this >= 4?) and trains shared representations across all thresholds. This:

  • Respects the ordinal structure
  • Eliminates impossible predictions (e.g., predicting "yes >= 4" but "no >= 3")
  • Handles the noisy Spec 3<->4 boundary gracefully

Managed expectations

Specificity macro F1 will be 0.65-0.75 regardless of what we do. This is not a model failure — it's a gold label quality issue (alpha=0.546). Report specificity F1 separately and frame it as a finding about construct reliability.


Action 5: Training Data Curation

Priority: high. Garbage in, garbage out.

Confidence-stratified assembly

Source Count Sample Weight Rationale
Unanimous Stage 1 (3/3 agree on both) ~35K 1.0 Highest confidence
Majority Stage 1 (2/3 agree on cat) ~9-12K 0.8 Good but not certain
Judge labels (high confidence) ~2-3K 0.7 Disputed, resolved by stronger model
All-disagree ~2-3K 0.0 (exclude) Too noisy

Quality tier weighting

Paragraph quality Weight
Clean 1.0
Headed 1.0
Minor issues 1.0
Degraded (embedded bullets, orphan words) 0.5

What NOT to include

  • Paragraphs where all 3 Stage 1 models disagree on category (pure noise)
  • Paragraphs from truncated filings (72 identified and removed pre-DAPT)
  • Paragraphs shorter than 10 words (tend to be parsing artifacts)

Action 6: Ablation Design

The assignment requires at least 4 configurations. We'll run 6-8 to isolate each contribution.

# Backbone Focal Loss SCL Notes
1 ModernBERT-large (base) No No Baseline — no domain adaptation
2 +DAPT No No Isolate domain pre-training effect
3 +DAPT+TAPT No No Isolate task-adaptive pre-training effect
4 +DAPT+TAPT Yes No Isolate class imbalance handling
5 +DAPT+TAPT Yes Yes Full pipeline
6 +DAPT+TAPT Yes Yes Ensemble (3 seeds, majority vote)

Expected pattern: 1 < 2 < 3 (pre-training helps), 3 < 4 (focal loss helps rare classes), 4 < 5 (SCL helps confusion boundaries), 5 < 6 (ensemble smooths variance).

Each experiment trains for ~30-60 min on the RTX 3090. Total ablation time: ~4-8 hours.

Hyperparameters (starting points, tune on validation)

  • Learning rate: 2e-5 (standard for BERT fine-tuning)
  • Batch size: 16-32 (depending on VRAM with dual heads)
  • Max sequence length: 512 (most paragraphs are <200 tokens; 8192 is unnecessary for classification)
  • Epochs: 5-10 with early stopping (patience=3)
  • Warmup: 10% of steps
  • Weight decay: 1e-5 (matching ModernBERT pre-training config)
  • Focal loss gamma: 2.0
  • SCL lambda: 0.1-0.5 (tune)
  • Label smoothing: 0.05

Action 7: Inference-Time Techniques

Ensemble (3 seeds)

Train 3 instances of the best configuration (experiment 5) with different random seeds. At inference, average the softmax probabilities and take argmax. Typically adds 1-3pp macro F1 over a single model. The variance across seeds also gives confidence intervals for reported metrics.

Per-class threshold optimization

After training, don't use argmax. Instead, optimize per-class classification thresholds on the validation set to maximize macro F1 directly. The optimal threshold for RMP (high prevalence, high precision needed) is different from TPR (low prevalence, high recall needed). Use a grid search or Bayesian optimization over the 7 thresholds.

Post-hoc calibration

Apply temperature scaling on the validation set. This doesn't change predictions (and therefore doesn't change F1), but it makes the model's confidence scores meaningful for:

  • Calibration plots (recommended evaluation metric)
  • AUC computation
  • The error analysis narrative

Action 8: Evaluation & Reporting

Primary metrics (what determines the grade)

  • Category macro F1 on full 1,200 holdout — must exceed 0.80
  • Per-class F1 — breakdown showing which categories are strong/weak
  • Krippendorff's alpha — model vs human labels (should approach GenAI panel's alpha)
  • MCC — robust to class imbalance
  • AUC — from calibrated probabilities

Dual F1 reporting (adverse incentive mitigation)

Report F1 on both:

  1. Full 1,200 holdout (stratified, over-samples hard cases)
  2. ~720-paragraph proportional subsample (random draw matching corpus class proportions)

The delta between these two numbers quantifies how much the model degrades at decision boundaries. This directly serves the A-grade "error analysis" criterion and is methodologically honest about the stratified design.

Error analysis corpus

The 202 Tier 4 paragraphs (universal disagreement) are the natural error analysis set. For each:

  • What did the model predict?
  • What is the gold label?
  • What do the 13 signals show?
  • What is the confusion axis?
  • Is the gold label itself debatable?

This analysis will show that most "errors" fall on the MR<->RMP, BG<->MR, and N/O<->SI axes — the same axes where humans disagree. The model is not failing randomly; it's failing where the construct itself is ambiguous.

GenAI vs specialist comparison (assignment Step 10)

Dimension GenAI Panel (10 models) Fine-tuned Specialist
Category macro F1 ~0.82-0.87 (per model) Target: 0.80-0.85
Cost per 1M texts ~$5,000-13,000 ~$5 (GPU inference)
Latency per text 3-76 seconds ~5ms
Reproducibility Varies (temperature, routing) Deterministic
Setup cost $165 (one-time labeling) + ~8h GPU training

The specialist wins on cost (1000x cheaper), speed (1000x faster), and reproducibility (deterministic). The GenAI panel wins on raw accuracy by a few points. This is the core Ringel (2023) thesis: the specialist approximates the GenAI labeler at near-zero marginal cost.


Risk Assessment

Risk Likelihood Impact Mitigation
Macro F1 lands at 0.78-0.80 (just below threshold) Medium High Ensemble + threshold optimization should add 2-3pp
TPR per-class F1 tanks macro average Medium Medium Focal loss + over-sampling TPR in training
Gold label noise on Tier 4 paragraphs Low Medium Conservative adjudication + dual F1 reporting
MR<->RMP confusion not resolved by SCL Low Medium Person-vs-function test baked into training data via v3.0 codebook
DAPT+TAPT doesn't help (base model is already good enough) Low Low Still meets 0.80 threshold; the ablation result itself is publishable

Timeline

Task Duration Target Date
Gold set adjudication (Tier 3+4 expert review) 2-3h Apr 3-4
Training data assembly 1-2h Apr 4
Fine-tuning ablations (6 configs) 4-8h GPU Apr 5-8
Final evaluation on holdout 1h Apr 9
Error analysis writeup 2h Apr 10
Executive memo draft 3h Apr 11-12
IGNITE slides (20 slides) 2h Apr 13-14
Final review + submission 2h Apr 22
Due date Apr 23 12pm