# F1 Strategy — Passing the Class

The assignment requires **macro F1 > 0.80** on category, measured against the human-labeled 1,200-paragraph holdout. This document lays out the concrete plan for getting there.

---

## The Situation

### What we have

- **Training data:** 150,009 Stage 1 annotations across 50,003 paragraphs (3 models each). ~35K paragraphs with unanimous category agreement (all 3 models agree).
- **Pre-trained backbone:** ModernBERT-large with DAPT (1B tokens of SEC filings) + TAPT (labeled paragraphs). Domain-adapted and task-adapted.
- **Gold holdout:** 1,200 paragraphs with 13 independent annotations each (3 human + 10 GenAI). Adjudication tiers computed: 81% auto-resolvable.
- **Complete benchmark:** 10 GenAI models from 8 suppliers, all on holdout.

### The ceiling

The best individual GenAI models agree with human majority on ~83-87% of category labels. Our fine-tuned model is trained on GenAI labels, so its accuracy is bounded by how well GenAI labels match human labels. With DAPT+TAPT, the model should approach or slightly exceed this ceiling because:

1. It learns decision boundaries directly from representations, not through generative reasoning
2. It's specialized on the exact domain (SEC filings) and task distribution
3. The training data (35K+ unanimous labels) is cleaner than any single model's output

### The threat

The holdout was deliberately stratified to over-sample hard decision boundaries (MR<->RMP splits, N/O<->SI splits, Spec 3/4 splits). This means raw F1 on this holdout is **lower** than on a random sample. Macro F1 also weights all 7 classes equally — rare categories (TPR at ~5%, ID at ~8%) get the same influence as RMP at ~35%.

**Estimated range: category macro F1 of 0.78-0.85.** The plan below is designed to push toward the top of that range.

---

## Action 1: Clean the Gold Labels

**Priority: highest. This directly caps F1 from above.**

If the gold label is wrong, even a perfect model gets penalized. The gold label quality depends on how we adjudicate the 1,200 holdout paragraphs.

### Aaryan correction

Aaryan has a 40.9% odd-one-out rate (vs 8-16% for other annotators), specificity kappa of 0.03-0.25, and +1.30 specificity bias vs Opus. On his 600 paragraphs, when the other 2 annotators agree and he disagrees, the other-2 majority should be the human signal. This is not "throwing out" his data — it's using the objective reliability metrics to weight it appropriately.

Excluding his label on his paragraphs pushes both-unanimous from 5% to 50% (+45pp). This single correction likely improves effective gold label quality by 5-10% on the paragraphs he touched.

### Tiered adjudication

| Tier | Count | % | Gold label source |
|------|-------|---|-------------------|
| 1 | 756 | 63% | 13-signal consensus (10+/13 agree on both dimensions) |
| 2 | 216 | 18% | Human majority + GenAI majority agree — take consensus |
| 3 | 26 | 2% | Expert review with Opus reasoning traces |
| 4 | 202 | 17% | Expert review, documented reasoning |

For Tier 1+2 (972 paragraphs, 81%), the gold label is objectively strong — at least 10 of 13 annotators agree, or both human and GenAI majorities independently converge. These labels are essentially guaranteed correct.

For Tier 3+4 (228 paragraphs), expert adjudication using:
1. Opus reasoning trace (why did the best model choose this category?)
2. GenAI consensus direction (what do 7+/10 models say?)
3. The paragraph text itself
4. Codebook boundary rules (MR vs RMP person-vs-function test, materiality disclaimers -> SI, etc.)

Document reasoning for every Tier 4 decision. These 202 paragraphs become the error analysis corpus.

---

## Action 2: Handle Class Imbalance

**Priority: critical. This is the difference between 0.76 and 0.83 on macro F1.**

### The problem

The training data class distribution is heavily skewed:

| Category | Est. % of training | Macro F1 weight |
|----------|-------------------|-----------------|
| RMP | ~35% | 14.3% (1/7) |
| BG | ~15% | 14.3% |
| MR | ~14% | 14.3% |
| SI | ~13% | 14.3% |
| N/O | ~10% | 14.3% |
| ID | ~8% | 14.3% |
| TPR | ~5% | 14.3% |

Without correction, the model will over-predict RMP (the majority class) and under-predict TPR/ID. Since macro F1 weights all 7 equally, poor performance on rare classes tanks the overall score.

### Solutions (use in combination)

**Focal loss (gamma=2).** Down-weights easy/confident examples, up-weights hard/uncertain ones. The model spends more gradient on the examples it's getting wrong — which are disproportionately from rare classes and boundary cases. Better than static class weights because it adapts as training progresses.

**Class-weighted sampling.** Over-sample rare categories during training so the model sees roughly equal numbers of each class per epoch. Alternatively, use class-weighted cross-entropy with weights inversely proportional to frequency.

**Stratified validation split.** Ensure the validation set used for early stopping and threshold optimization has proportional representation of all classes. Don't let the model optimize for RMP accuracy at the expense of TPR.

---

## Action 3: Supervised Contrastive Learning (SCL)

**Priority: high. Directly attacks the #1 confusion axis.**

### The problem

MR<->RMP is the dominant confusion axis for humans, Stage 1, all GenAI models, and will be the dominant confusion axis for our fine-tuned model. These two categories share vocabulary (both discuss "cybersecurity" in a management/process context) and differ primarily in whether the paragraph describes a **person's role** (MR) or a **process/procedure** (RMP).

BG<->MR is the #2 axis — both involve governance/management but differ in whether it's board-level or management-level.

### How SCL helps

SCL adds a contrastive loss that pulls representations of same-class paragraphs together and pushes different-class paragraphs apart in the embedding space. This is especially valuable when:
- Two classes share surface-level vocabulary (MR/RMP, BG/MR)
- The distinguishing features are subtle (person vs function, board vs management)
- The model needs to learn discriminative features, not just predictive ones

### Implementation

Dual loss: L = L_classification + lambda * L_contrastive

The contrastive loss uses the [CLS] representation from the shared backbone. Lambda should be tuned (start with 0.1-0.5) on the validation set.

---

## Action 4: Ordinal Specificity

**Priority: medium. Matters for specificity F1, not directly for category F1 (which is the pass/fail metric).**

### The problem

Specificity is a 4-point ordinal scale (1=Generic Boilerplate, 2=Sector-Standard, 3=Firm-Specific, 4=Quantified-Verifiable). Treating it as flat classification ignores the ordering — a Spec 1->4 error is worse than a Spec 2->3 error.

Human alpha on specificity is only 0.546 (unreliable). The Spec 3<->4 boundary is genuinely ambiguous. Even frontier models only agree 75-91% on specificity.

### Solution

Use CORAL (Consistent Rank Logits) ordinal regression for the specificity head. CORAL converts a K-class ordinal problem into K-1 binary problems (is this >= 2? is this >= 3? is this >= 4?) and trains shared representations across all thresholds. This:
- Respects the ordinal structure
- Eliminates impossible predictions (e.g., predicting "yes >= 4" but "no >= 3")
- Handles the noisy Spec 3<->4 boundary gracefully

### Managed expectations

Specificity macro F1 will be 0.65-0.75 regardless of what we do. This is not a model failure — it's a gold label quality issue (alpha=0.546). Report specificity F1 separately and frame it as a finding about construct reliability.

---

## Action 5: Training Data Curation

**Priority: high. Garbage in, garbage out.**

### Confidence-stratified assembly

| Source | Count | Sample Weight | Rationale |
|--------|-------|--------------|-----------|
| Unanimous Stage 1 (3/3 agree on both) | ~35K | 1.0 | Highest confidence |
| Majority Stage 1 (2/3 agree on cat) | ~9-12K | 0.8 | Good but not certain |
| Judge labels (high confidence) | ~2-3K | 0.7 | Disputed, resolved by stronger model |
| All-disagree | ~2-3K | 0.0 (exclude) | Too noisy |

### Quality tier weighting

| Paragraph quality | Weight |
|-------------------|--------|
| Clean | 1.0 |
| Headed | 1.0 |
| Minor issues | 1.0 |
| Degraded (embedded bullets, orphan words) | 0.5 |

### What NOT to include

- Paragraphs where all 3 Stage 1 models disagree on category (pure noise)
- Paragraphs from truncated filings (72 identified and removed pre-DAPT)
- Paragraphs shorter than 10 words (tend to be parsing artifacts)

---

## Action 6: Ablation Design

**The assignment requires at least 4 configurations. We'll run 6-8 to isolate each contribution.**

| # | Backbone | Focal Loss | SCL | Notes |
|---|----------|-----------|-----|-------|
| 1 | ModernBERT-large (base) | No | No | Baseline — no domain adaptation |
| 2 | +DAPT | No | No | Isolate domain pre-training effect |
| 3 | +DAPT+TAPT | No | No | Isolate task-adaptive pre-training effect |
| 4 | +DAPT+TAPT | Yes | No | Isolate class imbalance handling |
| 5 | +DAPT+TAPT | Yes | Yes | Full pipeline |
| 6 | +DAPT+TAPT | Yes | Yes | Ensemble (3 seeds, majority vote) |

**Expected pattern:** 1 < 2 < 3 (pre-training helps), 3 < 4 (focal loss helps rare classes), 4 < 5 (SCL helps confusion boundaries), 5 < 6 (ensemble smooths variance).

Each experiment trains for ~30-60 min on the RTX 3090. Total ablation time: ~4-8 hours.

### Hyperparameters (starting points, tune on validation)

- Learning rate: 2e-5 (standard for BERT fine-tuning)
- Batch size: 16-32 (depending on VRAM with dual heads)
- Max sequence length: 512 (most paragraphs are <200 tokens; 8192 is unnecessary for classification)
- Epochs: 5-10 with early stopping (patience=3)
- Warmup: 10% of steps
- Weight decay: 1e-5 (matching ModernBERT pre-training config)
- Focal loss gamma: 2.0
- SCL lambda: 0.1-0.5 (tune)
- Label smoothing: 0.05

---

## Action 7: Inference-Time Techniques

### Ensemble (3 seeds)

Train 3 instances of the best configuration (experiment 5) with different random seeds. At inference, average the softmax probabilities and take argmax. Typically adds 1-3pp macro F1 over a single model. The variance across seeds also gives confidence intervals for reported metrics.

### Per-class threshold optimization

After training, don't use argmax. Instead, optimize per-class classification thresholds on the validation set to maximize macro F1 directly. The optimal threshold for RMP (high prevalence, high precision needed) is different from TPR (low prevalence, high recall needed). Use a grid search or Bayesian optimization over the 7 thresholds.

### Post-hoc calibration

Apply temperature scaling on the validation set. This doesn't change predictions (and therefore doesn't change F1), but it makes the model's confidence scores meaningful for:
- Calibration plots (recommended evaluation metric)
- AUC computation
- The error analysis narrative

---

## Action 8: Evaluation & Reporting

### Primary metrics (what determines the grade)

- **Category macro F1** on full 1,200 holdout — must exceed 0.80
- **Per-class F1** — breakdown showing which categories are strong/weak
- **Krippendorff's alpha** — model vs human labels (should approach GenAI panel's alpha)
- **MCC** — robust to class imbalance
- **AUC** — from calibrated probabilities

### Dual F1 reporting (adverse incentive mitigation)

Report F1 on both:
1. **Full 1,200 holdout** (stratified, over-samples hard cases)
2. **~720-paragraph proportional subsample** (random draw matching corpus class proportions)

The delta between these two numbers quantifies how much the model degrades at decision boundaries. This directly serves the A-grade "error analysis" criterion and is methodologically honest about the stratified design.

### Error analysis corpus

The 202 Tier 4 paragraphs (universal disagreement) are the natural error analysis set. For each:
- What did the model predict?
- What is the gold label?
- What do the 13 signals show?
- What is the confusion axis?
- Is the gold label itself debatable?

This analysis will show that most "errors" fall on the MR<->RMP, BG<->MR, and N/O<->SI axes — the same axes where humans disagree. The model is not failing randomly; it's failing where the construct itself is ambiguous.

### GenAI vs specialist comparison (assignment Step 10)

| Dimension | GenAI Panel (10 models) | Fine-tuned Specialist |
|-----------|------------------------|----------------------|
| Category macro F1 | ~0.82-0.87 (per model) | Target: 0.80-0.85 |
| Cost per 1M texts | ~$5,000-13,000 | ~$5 (GPU inference) |
| Latency per text | 3-76 seconds | ~5ms |
| Reproducibility | Varies (temperature, routing) | Deterministic |
| Setup cost | $165 (one-time labeling) | + ~8h GPU training |

The specialist wins on cost (1000x cheaper), speed (1000x faster), and reproducibility (deterministic). The GenAI panel wins on raw accuracy by a few points. This is the core Ringel (2023) thesis: the specialist approximates the GenAI labeler at near-zero marginal cost.

---

## Risk Assessment

| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| Macro F1 lands at 0.78-0.80 (just below threshold) | Medium | High | Ensemble + threshold optimization should add 2-3pp |
| TPR per-class F1 tanks macro average | Medium | Medium | Focal loss + over-sampling TPR in training |
| Gold label noise on Tier 4 paragraphs | Low | Medium | Conservative adjudication + dual F1 reporting |
| MR<->RMP confusion not resolved by SCL | Low | Medium | Person-vs-function test baked into training data via v3.0 codebook |
| DAPT+TAPT doesn't help (base model is already good enough) | Low | Low | Still meets 0.80 threshold; the ablation result itself is publishable |

---

## Timeline

| Task | Duration | Target Date |
|------|----------|-------------|
| Gold set adjudication (Tier 3+4 expert review) | 2-3h | Apr 3-4 |
| Training data assembly | 1-2h | Apr 4 |
| Fine-tuning ablations (6 configs) | 4-8h GPU | Apr 5-8 |
| Final evaluation on holdout | 1h | Apr 9 |
| Error analysis writeup | 2h | Apr 10 |
| Executive memo draft | 3h | Apr 11-12 |
| IGNITE slides (20 slides) | 2h | Apr 13-14 |
| Final review + submission | 2h | Apr 22 |
| **Due date** | | **Apr 23 12pm** |