SEC-cyBERT/docs/archive/v1/POST-LABELING-PLAN.md
2026-04-05 21:00:40 -04:00

223 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Post-Labeling Plan — Gold Set, Fine-Tuning & F1 Strategy
Updated 2026-04-02 with actual benchmark results and 13-signal analysis.
---
## Human Labeling Results (Complete)
3,600 labels (1,200 paragraphs x 3 annotators via BIBD), 21.5 active hours total.
| Metric | Category | Specificity | Both |
|--------|----------|-------------|------|
| Consensus (3/3 agree) | 56.8% | 42.3% | 27.0% |
| Krippendorff's alpha | **0.801** | 0.546 | -- |
| Avg Cohen's kappa | 0.612 | 0.440 | -- |
**Category is reliable.** Alpha = 0.801 exceeds the conventional 0.80 threshold. **Specificity is unreliable.** Alpha = 0.546, driven by one outlier annotator (+1.28 specificity bias) and a genuinely hard Spec 3-4 boundary.
---
## GenAI Benchmark Results (Complete)
10 models from 8 suppliers on 1,200 holdout paragraphs. $45.63 total benchmark cost.
### Per-Model Accuracy (Leave-One-Out: each source vs majority of other 12)
| Rank | Source | Cat % | Spec % | Both % | Odd-One-Out % |
|------|--------|-------|--------|--------|---------------|
| 1 | Opus 4.6 | 92.6 | 90.8 | 84.0 | 7.4% |
| 2 | Kimi K2.5 | 91.6 | 91.1 | 83.3 | 8.4% |
| 3 | Gemini Pro | 91.1 | 90.1 | 82.3 | 8.9% |
| 4 | GPT-5.4 | 91.4 | 88.8 | 82.1 | 8.6% |
| 5 | GLM-5 | 91.9 | 88.4 | 81.4 | 8.1% |
| 6 | MIMO Pro | 91.1 | 89.4 | 81.4 | 8.9% |
| 7 | Grok Fast | 88.9 | 89.6 | 80.0 | 11.1% |
| 8 | Xander (best human) | 91.3 | 83.9 | 76.9 | 8.7% |
| 9 | Elisabeth | 85.5 | 84.6 | 72.3 | 14.5% |
| 10 | Gemini Lite | 83.0 | 86.1 | 71.7 | 17.0% |
| 11 | MIMO Flash | 80.4 | 86.4 | 69.2 | 19.6% |
| 12 | Meghan | 86.3 | 76.8 | 66.5 | 13.7% |
| 13 | MiniMax M2.7 | 87.9 | 75.6 | 66.1 | 12.1% |
| 14 | Joey | 84.0 | 77.2 | 65.8 | 16.0% |
| 15 | Anuj | 72.7 | 60.6 | 42.8 | 27.3% |
| 16 | Aaryan (outlier) | 59.1 | 24.7 | 15.8 | 40.9% |
Opus earns #1 without being privileged -- it genuinely disagrees with the crowd least.
### Cross-Source Agreement
| Comparison | Category |
|------------|----------|
| Human maj = S1 maj | 81.7% |
| Human maj = Opus | 83.2% |
| Human maj = GenAI maj (10) | 82.2% |
| GenAI maj = Opus | 86.8% |
| 13-signal maj = 10-GenAI maj | 99.5% |
### Confusion Axes (same order for all source types)
1. MR <-> RMP (dominant)
2. BG <-> MR
3. N/O <-> SI
---
## Adjudication Strategy (13 Signals)
### Sources per paragraph
| Source | Count | Prompt |
|--------|-------|--------|
| Human annotators | 3 | Codebook v3.0 |
| Stage 1 (gemini-lite, mimo-flash, grok-fast) | 3 | v2.5 |
| Opus 4.6 golden | 1 | v3.0+codebook |
| Benchmark (gpt-5.4, kimi-k2.5, gemini-pro, glm-5, minimax-m2.7, mimo-pro) | 6 | v3.0 |
| **Total** | **13** | |
### Tier breakdown (actual counts)
| Tier | Rule | Count | % |
|------|------|-------|---|
| 1 | 10+/13 agree on both dimensions | 756 | 63.0% |
| 2 | Human majority + GenAI majority agree | 216 | 18.0% |
| 3 | Humans split, GenAI converges | 26 | 2.2% |
| 4 | Universal disagreement | 202 | 16.8% |
**81% auto-resolvable.** Only 228 paragraphs (19%) need expert review.
### Aaryan correction
On Aaryan's 600 paragraphs: when the other 2 annotators agree and Aaryan disagrees, the other-2 majority becomes the human signal for adjudication. This is justified by his 40.9% odd-one-out rate (vs 8-16% for other annotators) and α=0.03-0.25 on specificity.
### Adjudication process for Tier 3+4
1. Pull Opus reasoning trace for the paragraph
2. Check the GenAI consensus (which category do 7+/10 models agree on?)
3. Expert reads the paragraph and all signals, makes final call
4. Document reasoning for Tier 4 paragraphs (these are the error analysis corpus)
---
## F1 Strategy — How to Pass
### The requirement
- **C grade minimum:** fine-tuned model with macro F1 > 0.80 on holdout
- **Gold standard:** human-labeled holdout (1,200 paragraphs)
- **Metrics to report:** macro F1, per-class F1, Krippendorff's alpha, AUC, MCC
- The fine-tuned "specialist" must be compared head-to-head with GenAI labeling
### The challenge
The holdout was deliberately stratified to over-sample hard decision boundaries (MR<->RMP, N/O<->SI, Spec 3<->4). This means raw F1 on this holdout will be **lower** than on a random sample. Additionally:
- The best individual GenAI models only agree with human majority ~83-87% on category
- Our model is trained on GenAI labels, so its ceiling is bounded by GenAI-vs-human agreement
- Macro F1 weights all 7 classes equally -- rare classes (TPR, ID) get equal influence
- The MR<->RMP confusion axis is the #1 challenge across all source types
### Why F1 > 0.80 is achievable
1. **DAPT + TAPT give domain advantage.** The model has seen 1B tokens of SEC filings (DAPT) and all labeled paragraphs (TAPT). It understands SEC disclosure language at a depth that generic BERT models don't.
2. **35K+ high-confidence training examples.** Unanimous Stage 1 labels where all 3 models agreed on both dimensions. These are cleaner than any single model's labels.
3. **Encoder classification outperforms generative labeling on fine-tuned domains.** The model doesn't need to "reason" about the codebook -- it learns the decision boundaries directly from representations. This is the core thesis of Ringel (2023).
4. **The hard cases are a small fraction.** 63% of the holdout is Tier 1 (10+/13 agree). The model only needs reasonable performance on the remaining 37% to clear 0.80.
### Critical actions
#### 1. Gold label quality (highest priority)
Noisy gold labels directly cap F1. If the gold label is wrong, even a perfect model gets penalized.
- **Tier 1+2 (972 paragraphs):** Use 13-signal consensus. These are essentially guaranteed correct.
- **Tier 3+4 (228 paragraphs):** Expert adjudication with documented reasoning. Prioritize Opus reasoning traces + GenAI consensus as evidence.
- **Aaryan correction:** On his 600 paragraphs, replace his vote with the other-2 majority when they agree. This alone should improve gold label quality substantially.
- **Document the process:** The adjudication methodology itself is a deliverable (IRR report + reliability analysis).
#### 2. Training data curation
- **Primary corpus:** Unanimous Stage 1 labels (all 3 models agree on both cat+spec) -- ~35K paragraphs
- **Secondary:** Majority labels (2/3 agree) with 0.8x sample weight -- ~9-12K
- **Tertiary:** Judge labels with high confidence -- ~2-3K
- **Exclude:** Paragraphs where all 3 models disagree (too noisy for training)
- **Quality weighting:** clean/headed/minor = 1.0, degraded = 0.5
#### 3. Architecture and loss
- **Dual-head classifier:** Shared ModernBERT backbone -> category head (7-class softmax) + specificity head (4-class ordinal)
- **Category loss:** Focal loss (gamma=2) or class-weighted cross-entropy. The model must not ignore rare categories (TPR, ID). Weights inversely proportional to class frequency in training data.
- **Specificity loss:** Ordinal regression (CORAL) -- penalizes Spec 1->4 errors more than Spec 2->3. This respects the ordinal nature and handles the noisy Spec 3<->4 boundary gracefully.
- **Combined loss:** L = L_cat + 0.5 * L_spec (category gets more gradient weight because it's the more reliable dimension and the primary metric)
#### 4. Ablation experiments (need >=4 configurations)
| # | Backbone | Class Weights | SCL | Notes |
|---|----------|--------------|-----|-------|
| 1 | Base ModernBERT-large | No | No | Baseline |
| 2 | +DAPT | No | No | Domain adaptation effect |
| 3 | +DAPT+TAPT | No | No | Full pre-training pipeline |
| 4 | +DAPT+TAPT | Yes (focal) | No | Class imbalance handling |
| 5 | +DAPT+TAPT | Yes (focal) | Yes | Supervised contrastive learning |
| 6 | +DAPT+TAPT | Yes (focal) | Yes | + ensemble (3 seeds) |
Experiments 1-3 isolate the pre-training contribution. 4-5 isolate training strategy. 6 is the final system.
#### 5. Evaluation strategy
- **Primary metric:** Category macro F1 on full 1,200 holdout (must exceed 0.80)
- **Secondary metrics:** Per-class F1, specificity F1 (report separately), MCC, Krippendorff's alpha vs human labels
- **Dual reporting (adverse incentive mitigation):** Also report F1 on a 720-paragraph proportional subsample (random draw matching corpus class proportions). The delta quantifies degradation on hard boundary cases. This serves the A-grade "error analysis" criterion.
- **Error analysis corpus:** Tier 4 paragraphs (202) are the natural error analysis set. Where the model fails on these, the 13-signal disagreement pattern explains why.
#### 6. Inference-time techniques
- **Ensemble:** Train 3 models with different random seeds on the best config. Majority vote at inference. Typically adds 1-3pp F1.
- **Threshold optimization:** After training, optimize per-class classification thresholds on a validation set (not holdout) to maximize macro F1. Don't use argmax -- use thresholds that balance precision and recall per class.
- **Post-hoc calibration:** Temperature scaling on validation set. Important for AUC and calibration plots.
### Specificity dimension -- managed expectations
Specificity F1 will be lower than category F1. This is not a model failure:
- Human alpha on specificity is only 0.546 (unreliable gold)
- Even frontier models only agree 75-91% on specificity
- The Spec 3<->4 boundary is genuinely ambiguous
Strategy: report specificity F1 separately, explain why it's lower, and frame it as a finding about construct reliability (the specificity dimension needs more operational clarity, not better models). This is honest and scientifically interesting.
### Concrete F1 estimate
Based on GenAI-vs-human agreement rates and the typical BERT fine-tuning premium:
- **Category macro F1:** 0.78-0.85 (depends on class imbalance handling and gold quality)
- **Specificity macro F1:** 0.65-0.75 (ceiling-limited by human disagreement)
- **Combined (cat x spec) accuracy:** 0.55-0.70
The swing categories for macro F1 are MR (~65-80% per-class F1), TPR (~70-90%), and N/O (~60-85%). Focal loss + SCL should push MR and N/O into the range where macro F1 clears 0.80.
---
## The Meta-Narrative
The finding that trained student annotators achieve alpha = 0.801 on category but only 0.546 on specificity, while calibrated LLM panels achieve higher consistency (60.1% spec unanimity vs 42.2% for humans), validates the synthetic experts hypothesis for rule-heavy classification tasks. The low specificity agreement is not annotator incompetence -- it's evidence that the specificity construct requires systematic attention to IS/NOT lists and counting rules that humans don't consistently invest at 15s/paragraph pace. GenAI's advantage on multi-step reasoning tasks is itself a key finding.
The leave-one-out analysis showing that Opus earns the top rank without being privileged is the strongest validation of using frontier LLMs as "gold" annotators: they're not just consistent with each other, they're the most consistent with the emergent consensus of all 16 sources combined.
---
## Timeline
| Task | Target | Status |
|------|--------|--------|
| Human labeling | 2026-04-01 | Done |
| GenAI benchmark (10 models) | 2026-04-02 | Done |
| 13-signal analysis | 2026-04-02 | Done |
| Gold set adjudication | 2026-04-03-04 | Next |
| Training data assembly | 2026-04-04 | |
| Fine-tuning ablations (6 configs) | 2026-04-05-08 | |
| Final evaluation on holdout | 2026-04-09 | |
| Executive memo + IGNITE slides | 2026-04-10-14 | |
| Submission | 2026-04-23 | |