223 lines
11 KiB
Markdown
223 lines
11 KiB
Markdown
# Post-Labeling Plan — Gold Set, Fine-Tuning & F1 Strategy
|
||
|
||
Updated 2026-04-02 with actual benchmark results and 13-signal analysis.
|
||
|
||
---
|
||
|
||
## Human Labeling Results (Complete)
|
||
|
||
3,600 labels (1,200 paragraphs x 3 annotators via BIBD), 21.5 active hours total.
|
||
|
||
| Metric | Category | Specificity | Both |
|
||
|--------|----------|-------------|------|
|
||
| Consensus (3/3 agree) | 56.8% | 42.3% | 27.0% |
|
||
| Krippendorff's alpha | **0.801** | 0.546 | -- |
|
||
| Avg Cohen's kappa | 0.612 | 0.440 | -- |
|
||
|
||
**Category is reliable.** Alpha = 0.801 exceeds the conventional 0.80 threshold. **Specificity is unreliable.** Alpha = 0.546, driven by one outlier annotator (+1.28 specificity bias) and a genuinely hard Spec 3-4 boundary.
|
||
|
||
---
|
||
|
||
## GenAI Benchmark Results (Complete)
|
||
|
||
10 models from 8 suppliers on 1,200 holdout paragraphs. $45.63 total benchmark cost.
|
||
|
||
### Per-Model Accuracy (Leave-One-Out: each source vs majority of other 12)
|
||
|
||
| Rank | Source | Cat % | Spec % | Both % | Odd-One-Out % |
|
||
|------|--------|-------|--------|--------|---------------|
|
||
| 1 | Opus 4.6 | 92.6 | 90.8 | 84.0 | 7.4% |
|
||
| 2 | Kimi K2.5 | 91.6 | 91.1 | 83.3 | 8.4% |
|
||
| 3 | Gemini Pro | 91.1 | 90.1 | 82.3 | 8.9% |
|
||
| 4 | GPT-5.4 | 91.4 | 88.8 | 82.1 | 8.6% |
|
||
| 5 | GLM-5 | 91.9 | 88.4 | 81.4 | 8.1% |
|
||
| 6 | MIMO Pro | 91.1 | 89.4 | 81.4 | 8.9% |
|
||
| 7 | Grok Fast | 88.9 | 89.6 | 80.0 | 11.1% |
|
||
| 8 | Xander (best human) | 91.3 | 83.9 | 76.9 | 8.7% |
|
||
| 9 | Elisabeth | 85.5 | 84.6 | 72.3 | 14.5% |
|
||
| 10 | Gemini Lite | 83.0 | 86.1 | 71.7 | 17.0% |
|
||
| 11 | MIMO Flash | 80.4 | 86.4 | 69.2 | 19.6% |
|
||
| 12 | Meghan | 86.3 | 76.8 | 66.5 | 13.7% |
|
||
| 13 | MiniMax M2.7 | 87.9 | 75.6 | 66.1 | 12.1% |
|
||
| 14 | Joey | 84.0 | 77.2 | 65.8 | 16.0% |
|
||
| 15 | Anuj | 72.7 | 60.6 | 42.8 | 27.3% |
|
||
| 16 | Aaryan (outlier) | 59.1 | 24.7 | 15.8 | 40.9% |
|
||
|
||
Opus earns #1 without being privileged -- it genuinely disagrees with the crowd least.
|
||
|
||
### Cross-Source Agreement
|
||
|
||
| Comparison | Category |
|
||
|------------|----------|
|
||
| Human maj = S1 maj | 81.7% |
|
||
| Human maj = Opus | 83.2% |
|
||
| Human maj = GenAI maj (10) | 82.2% |
|
||
| GenAI maj = Opus | 86.8% |
|
||
| 13-signal maj = 10-GenAI maj | 99.5% |
|
||
|
||
### Confusion Axes (same order for all source types)
|
||
|
||
1. MR <-> RMP (dominant)
|
||
2. BG <-> MR
|
||
3. N/O <-> SI
|
||
|
||
---
|
||
|
||
## Adjudication Strategy (13 Signals)
|
||
|
||
### Sources per paragraph
|
||
|
||
| Source | Count | Prompt |
|
||
|--------|-------|--------|
|
||
| Human annotators | 3 | Codebook v3.0 |
|
||
| Stage 1 (gemini-lite, mimo-flash, grok-fast) | 3 | v2.5 |
|
||
| Opus 4.6 golden | 1 | v3.0+codebook |
|
||
| Benchmark (gpt-5.4, kimi-k2.5, gemini-pro, glm-5, minimax-m2.7, mimo-pro) | 6 | v3.0 |
|
||
| **Total** | **13** | |
|
||
|
||
### Tier breakdown (actual counts)
|
||
|
||
| Tier | Rule | Count | % |
|
||
|------|------|-------|---|
|
||
| 1 | 10+/13 agree on both dimensions | 756 | 63.0% |
|
||
| 2 | Human majority + GenAI majority agree | 216 | 18.0% |
|
||
| 3 | Humans split, GenAI converges | 26 | 2.2% |
|
||
| 4 | Universal disagreement | 202 | 16.8% |
|
||
|
||
**81% auto-resolvable.** Only 228 paragraphs (19%) need expert review.
|
||
|
||
### Aaryan correction
|
||
|
||
On Aaryan's 600 paragraphs: when the other 2 annotators agree and Aaryan disagrees, the other-2 majority becomes the human signal for adjudication. This is justified by his 40.9% odd-one-out rate (vs 8-16% for other annotators) and α=0.03-0.25 on specificity.
|
||
|
||
### Adjudication process for Tier 3+4
|
||
|
||
1. Pull Opus reasoning trace for the paragraph
|
||
2. Check the GenAI consensus (which category do 7+/10 models agree on?)
|
||
3. Expert reads the paragraph and all signals, makes final call
|
||
4. Document reasoning for Tier 4 paragraphs (these are the error analysis corpus)
|
||
|
||
---
|
||
|
||
## F1 Strategy — How to Pass
|
||
|
||
### The requirement
|
||
|
||
- **C grade minimum:** fine-tuned model with macro F1 > 0.80 on holdout
|
||
- **Gold standard:** human-labeled holdout (1,200 paragraphs)
|
||
- **Metrics to report:** macro F1, per-class F1, Krippendorff's alpha, AUC, MCC
|
||
- The fine-tuned "specialist" must be compared head-to-head with GenAI labeling
|
||
|
||
### The challenge
|
||
|
||
The holdout was deliberately stratified to over-sample hard decision boundaries (MR<->RMP, N/O<->SI, Spec 3<->4). This means raw F1 on this holdout will be **lower** than on a random sample. Additionally:
|
||
|
||
- The best individual GenAI models only agree with human majority ~83-87% on category
|
||
- Our model is trained on GenAI labels, so its ceiling is bounded by GenAI-vs-human agreement
|
||
- Macro F1 weights all 7 classes equally -- rare classes (TPR, ID) get equal influence
|
||
- The MR<->RMP confusion axis is the #1 challenge across all source types
|
||
|
||
### Why F1 > 0.80 is achievable
|
||
|
||
1. **DAPT + TAPT give domain advantage.** The model has seen 1B tokens of SEC filings (DAPT) and all labeled paragraphs (TAPT). It understands SEC disclosure language at a depth that generic BERT models don't.
|
||
|
||
2. **35K+ high-confidence training examples.** Unanimous Stage 1 labels where all 3 models agreed on both dimensions. These are cleaner than any single model's labels.
|
||
|
||
3. **Encoder classification outperforms generative labeling on fine-tuned domains.** The model doesn't need to "reason" about the codebook -- it learns the decision boundaries directly from representations. This is the core thesis of Ringel (2023).
|
||
|
||
4. **The hard cases are a small fraction.** 63% of the holdout is Tier 1 (10+/13 agree). The model only needs reasonable performance on the remaining 37% to clear 0.80.
|
||
|
||
### Critical actions
|
||
|
||
#### 1. Gold label quality (highest priority)
|
||
|
||
Noisy gold labels directly cap F1. If the gold label is wrong, even a perfect model gets penalized.
|
||
|
||
- **Tier 1+2 (972 paragraphs):** Use 13-signal consensus. These are essentially guaranteed correct.
|
||
- **Tier 3+4 (228 paragraphs):** Expert adjudication with documented reasoning. Prioritize Opus reasoning traces + GenAI consensus as evidence.
|
||
- **Aaryan correction:** On his 600 paragraphs, replace his vote with the other-2 majority when they agree. This alone should improve gold label quality substantially.
|
||
- **Document the process:** The adjudication methodology itself is a deliverable (IRR report + reliability analysis).
|
||
|
||
#### 2. Training data curation
|
||
|
||
- **Primary corpus:** Unanimous Stage 1 labels (all 3 models agree on both cat+spec) -- ~35K paragraphs
|
||
- **Secondary:** Majority labels (2/3 agree) with 0.8x sample weight -- ~9-12K
|
||
- **Tertiary:** Judge labels with high confidence -- ~2-3K
|
||
- **Exclude:** Paragraphs where all 3 models disagree (too noisy for training)
|
||
- **Quality weighting:** clean/headed/minor = 1.0, degraded = 0.5
|
||
|
||
#### 3. Architecture and loss
|
||
|
||
- **Dual-head classifier:** Shared ModernBERT backbone -> category head (7-class softmax) + specificity head (4-class ordinal)
|
||
- **Category loss:** Focal loss (gamma=2) or class-weighted cross-entropy. The model must not ignore rare categories (TPR, ID). Weights inversely proportional to class frequency in training data.
|
||
- **Specificity loss:** Ordinal regression (CORAL) -- penalizes Spec 1->4 errors more than Spec 2->3. This respects the ordinal nature and handles the noisy Spec 3<->4 boundary gracefully.
|
||
- **Combined loss:** L = L_cat + 0.5 * L_spec (category gets more gradient weight because it's the more reliable dimension and the primary metric)
|
||
|
||
#### 4. Ablation experiments (need >=4 configurations)
|
||
|
||
| # | Backbone | Class Weights | SCL | Notes |
|
||
|---|----------|--------------|-----|-------|
|
||
| 1 | Base ModernBERT-large | No | No | Baseline |
|
||
| 2 | +DAPT | No | No | Domain adaptation effect |
|
||
| 3 | +DAPT+TAPT | No | No | Full pre-training pipeline |
|
||
| 4 | +DAPT+TAPT | Yes (focal) | No | Class imbalance handling |
|
||
| 5 | +DAPT+TAPT | Yes (focal) | Yes | Supervised contrastive learning |
|
||
| 6 | +DAPT+TAPT | Yes (focal) | Yes | + ensemble (3 seeds) |
|
||
|
||
Experiments 1-3 isolate the pre-training contribution. 4-5 isolate training strategy. 6 is the final system.
|
||
|
||
#### 5. Evaluation strategy
|
||
|
||
- **Primary metric:** Category macro F1 on full 1,200 holdout (must exceed 0.80)
|
||
- **Secondary metrics:** Per-class F1, specificity F1 (report separately), MCC, Krippendorff's alpha vs human labels
|
||
- **Dual reporting (adverse incentive mitigation):** Also report F1 on a 720-paragraph proportional subsample (random draw matching corpus class proportions). The delta quantifies degradation on hard boundary cases. This serves the A-grade "error analysis" criterion.
|
||
- **Error analysis corpus:** Tier 4 paragraphs (202) are the natural error analysis set. Where the model fails on these, the 13-signal disagreement pattern explains why.
|
||
|
||
#### 6. Inference-time techniques
|
||
|
||
- **Ensemble:** Train 3 models with different random seeds on the best config. Majority vote at inference. Typically adds 1-3pp F1.
|
||
- **Threshold optimization:** After training, optimize per-class classification thresholds on a validation set (not holdout) to maximize macro F1. Don't use argmax -- use thresholds that balance precision and recall per class.
|
||
- **Post-hoc calibration:** Temperature scaling on validation set. Important for AUC and calibration plots.
|
||
|
||
### Specificity dimension -- managed expectations
|
||
|
||
Specificity F1 will be lower than category F1. This is not a model failure:
|
||
- Human alpha on specificity is only 0.546 (unreliable gold)
|
||
- Even frontier models only agree 75-91% on specificity
|
||
- The Spec 3<->4 boundary is genuinely ambiguous
|
||
|
||
Strategy: report specificity F1 separately, explain why it's lower, and frame it as a finding about construct reliability (the specificity dimension needs more operational clarity, not better models). This is honest and scientifically interesting.
|
||
|
||
### Concrete F1 estimate
|
||
|
||
Based on GenAI-vs-human agreement rates and the typical BERT fine-tuning premium:
|
||
- **Category macro F1:** 0.78-0.85 (depends on class imbalance handling and gold quality)
|
||
- **Specificity macro F1:** 0.65-0.75 (ceiling-limited by human disagreement)
|
||
- **Combined (cat x spec) accuracy:** 0.55-0.70
|
||
|
||
The swing categories for macro F1 are MR (~65-80% per-class F1), TPR (~70-90%), and N/O (~60-85%). Focal loss + SCL should push MR and N/O into the range where macro F1 clears 0.80.
|
||
|
||
---
|
||
|
||
## The Meta-Narrative
|
||
|
||
The finding that trained student annotators achieve alpha = 0.801 on category but only 0.546 on specificity, while calibrated LLM panels achieve higher consistency (60.1% spec unanimity vs 42.2% for humans), validates the synthetic experts hypothesis for rule-heavy classification tasks. The low specificity agreement is not annotator incompetence -- it's evidence that the specificity construct requires systematic attention to IS/NOT lists and counting rules that humans don't consistently invest at 15s/paragraph pace. GenAI's advantage on multi-step reasoning tasks is itself a key finding.
|
||
|
||
The leave-one-out analysis showing that Opus earns the top rank without being privileged is the strongest validation of using frontier LLMs as "gold" annotators: they're not just consistent with each other, they're the most consistent with the emergent consensus of all 16 sources combined.
|
||
|
||
---
|
||
|
||
## Timeline
|
||
|
||
| Task | Target | Status |
|
||
|------|--------|--------|
|
||
| Human labeling | 2026-04-01 | Done |
|
||
| GenAI benchmark (10 models) | 2026-04-02 | Done |
|
||
| 13-signal analysis | 2026-04-02 | Done |
|
||
| Gold set adjudication | 2026-04-03-04 | Next |
|
||
| Training data assembly | 2026-04-04 | |
|
||
| Fine-tuning ablations (6 configs) | 2026-04-05-08 | |
|
||
| Final evaluation on holdout | 2026-04-09 | |
|
||
| Executive memo + IGNITE slides | 2026-04-10-14 | |
|
||
| Submission | 2026-04-23 | |
|