SEC-cyBERT/docs/archive/v1/POST-LABELING-PLAN.md
2026-04-05 21:00:40 -04:00

11 KiB
Raw Blame History

Post-Labeling Plan — Gold Set, Fine-Tuning & F1 Strategy

Updated 2026-04-02 with actual benchmark results and 13-signal analysis.


Human Labeling Results (Complete)

3,600 labels (1,200 paragraphs x 3 annotators via BIBD), 21.5 active hours total.

Metric Category Specificity Both
Consensus (3/3 agree) 56.8% 42.3% 27.0%
Krippendorff's alpha 0.801 0.546 --
Avg Cohen's kappa 0.612 0.440 --

Category is reliable. Alpha = 0.801 exceeds the conventional 0.80 threshold. Specificity is unreliable. Alpha = 0.546, driven by one outlier annotator (+1.28 specificity bias) and a genuinely hard Spec 3-4 boundary.


GenAI Benchmark Results (Complete)

10 models from 8 suppliers on 1,200 holdout paragraphs. $45.63 total benchmark cost.

Per-Model Accuracy (Leave-One-Out: each source vs majority of other 12)

Rank Source Cat % Spec % Both % Odd-One-Out %
1 Opus 4.6 92.6 90.8 84.0 7.4%
2 Kimi K2.5 91.6 91.1 83.3 8.4%
3 Gemini Pro 91.1 90.1 82.3 8.9%
4 GPT-5.4 91.4 88.8 82.1 8.6%
5 GLM-5 91.9 88.4 81.4 8.1%
6 MIMO Pro 91.1 89.4 81.4 8.9%
7 Grok Fast 88.9 89.6 80.0 11.1%
8 Xander (best human) 91.3 83.9 76.9 8.7%
9 Elisabeth 85.5 84.6 72.3 14.5%
10 Gemini Lite 83.0 86.1 71.7 17.0%
11 MIMO Flash 80.4 86.4 69.2 19.6%
12 Meghan 86.3 76.8 66.5 13.7%
13 MiniMax M2.7 87.9 75.6 66.1 12.1%
14 Joey 84.0 77.2 65.8 16.0%
15 Anuj 72.7 60.6 42.8 27.3%
16 Aaryan (outlier) 59.1 24.7 15.8 40.9%

Opus earns #1 without being privileged -- it genuinely disagrees with the crowd least.

Cross-Source Agreement

Comparison Category
Human maj = S1 maj 81.7%
Human maj = Opus 83.2%
Human maj = GenAI maj (10) 82.2%
GenAI maj = Opus 86.8%
13-signal maj = 10-GenAI maj 99.5%

Confusion Axes (same order for all source types)

  1. MR <-> RMP (dominant)
  2. BG <-> MR
  3. N/O <-> SI

Adjudication Strategy (13 Signals)

Sources per paragraph

Source Count Prompt
Human annotators 3 Codebook v3.0
Stage 1 (gemini-lite, mimo-flash, grok-fast) 3 v2.5
Opus 4.6 golden 1 v3.0+codebook
Benchmark (gpt-5.4, kimi-k2.5, gemini-pro, glm-5, minimax-m2.7, mimo-pro) 6 v3.0
Total 13

Tier breakdown (actual counts)

Tier Rule Count %
1 10+/13 agree on both dimensions 756 63.0%
2 Human majority + GenAI majority agree 216 18.0%
3 Humans split, GenAI converges 26 2.2%
4 Universal disagreement 202 16.8%

81% auto-resolvable. Only 228 paragraphs (19%) need expert review.

Aaryan correction

On Aaryan's 600 paragraphs: when the other 2 annotators agree and Aaryan disagrees, the other-2 majority becomes the human signal for adjudication. This is justified by his 40.9% odd-one-out rate (vs 8-16% for other annotators) and α=0.03-0.25 on specificity.

Adjudication process for Tier 3+4

  1. Pull Opus reasoning trace for the paragraph
  2. Check the GenAI consensus (which category do 7+/10 models agree on?)
  3. Expert reads the paragraph and all signals, makes final call
  4. Document reasoning for Tier 4 paragraphs (these are the error analysis corpus)

F1 Strategy — How to Pass

The requirement

  • C grade minimum: fine-tuned model with macro F1 > 0.80 on holdout
  • Gold standard: human-labeled holdout (1,200 paragraphs)
  • Metrics to report: macro F1, per-class F1, Krippendorff's alpha, AUC, MCC
  • The fine-tuned "specialist" must be compared head-to-head with GenAI labeling

The challenge

The holdout was deliberately stratified to over-sample hard decision boundaries (MR<->RMP, N/O<->SI, Spec 3<->4). This means raw F1 on this holdout will be lower than on a random sample. Additionally:

  • The best individual GenAI models only agree with human majority ~83-87% on category
  • Our model is trained on GenAI labels, so its ceiling is bounded by GenAI-vs-human agreement
  • Macro F1 weights all 7 classes equally -- rare classes (TPR, ID) get equal influence
  • The MR<->RMP confusion axis is the #1 challenge across all source types

Why F1 > 0.80 is achievable

  1. DAPT + TAPT give domain advantage. The model has seen 1B tokens of SEC filings (DAPT) and all labeled paragraphs (TAPT). It understands SEC disclosure language at a depth that generic BERT models don't.

  2. 35K+ high-confidence training examples. Unanimous Stage 1 labels where all 3 models agreed on both dimensions. These are cleaner than any single model's labels.

  3. Encoder classification outperforms generative labeling on fine-tuned domains. The model doesn't need to "reason" about the codebook -- it learns the decision boundaries directly from representations. This is the core thesis of Ringel (2023).

  4. The hard cases are a small fraction. 63% of the holdout is Tier 1 (10+/13 agree). The model only needs reasonable performance on the remaining 37% to clear 0.80.

Critical actions

1. Gold label quality (highest priority)

Noisy gold labels directly cap F1. If the gold label is wrong, even a perfect model gets penalized.

  • Tier 1+2 (972 paragraphs): Use 13-signal consensus. These are essentially guaranteed correct.
  • Tier 3+4 (228 paragraphs): Expert adjudication with documented reasoning. Prioritize Opus reasoning traces + GenAI consensus as evidence.
  • Aaryan correction: On his 600 paragraphs, replace his vote with the other-2 majority when they agree. This alone should improve gold label quality substantially.
  • Document the process: The adjudication methodology itself is a deliverable (IRR report + reliability analysis).

2. Training data curation

  • Primary corpus: Unanimous Stage 1 labels (all 3 models agree on both cat+spec) -- ~35K paragraphs
  • Secondary: Majority labels (2/3 agree) with 0.8x sample weight -- ~9-12K
  • Tertiary: Judge labels with high confidence -- ~2-3K
  • Exclude: Paragraphs where all 3 models disagree (too noisy for training)
  • Quality weighting: clean/headed/minor = 1.0, degraded = 0.5

3. Architecture and loss

  • Dual-head classifier: Shared ModernBERT backbone -> category head (7-class softmax) + specificity head (4-class ordinal)
  • Category loss: Focal loss (gamma=2) or class-weighted cross-entropy. The model must not ignore rare categories (TPR, ID). Weights inversely proportional to class frequency in training data.
  • Specificity loss: Ordinal regression (CORAL) -- penalizes Spec 1->4 errors more than Spec 2->3. This respects the ordinal nature and handles the noisy Spec 3<->4 boundary gracefully.
  • Combined loss: L = L_cat + 0.5 * L_spec (category gets more gradient weight because it's the more reliable dimension and the primary metric)

4. Ablation experiments (need >=4 configurations)

# Backbone Class Weights SCL Notes
1 Base ModernBERT-large No No Baseline
2 +DAPT No No Domain adaptation effect
3 +DAPT+TAPT No No Full pre-training pipeline
4 +DAPT+TAPT Yes (focal) No Class imbalance handling
5 +DAPT+TAPT Yes (focal) Yes Supervised contrastive learning
6 +DAPT+TAPT Yes (focal) Yes + ensemble (3 seeds)

Experiments 1-3 isolate the pre-training contribution. 4-5 isolate training strategy. 6 is the final system.

5. Evaluation strategy

  • Primary metric: Category macro F1 on full 1,200 holdout (must exceed 0.80)
  • Secondary metrics: Per-class F1, specificity F1 (report separately), MCC, Krippendorff's alpha vs human labels
  • Dual reporting (adverse incentive mitigation): Also report F1 on a 720-paragraph proportional subsample (random draw matching corpus class proportions). The delta quantifies degradation on hard boundary cases. This serves the A-grade "error analysis" criterion.
  • Error analysis corpus: Tier 4 paragraphs (202) are the natural error analysis set. Where the model fails on these, the 13-signal disagreement pattern explains why.

6. Inference-time techniques

  • Ensemble: Train 3 models with different random seeds on the best config. Majority vote at inference. Typically adds 1-3pp F1.
  • Threshold optimization: After training, optimize per-class classification thresholds on a validation set (not holdout) to maximize macro F1. Don't use argmax -- use thresholds that balance precision and recall per class.
  • Post-hoc calibration: Temperature scaling on validation set. Important for AUC and calibration plots.

Specificity dimension -- managed expectations

Specificity F1 will be lower than category F1. This is not a model failure:

  • Human alpha on specificity is only 0.546 (unreliable gold)
  • Even frontier models only agree 75-91% on specificity
  • The Spec 3<->4 boundary is genuinely ambiguous

Strategy: report specificity F1 separately, explain why it's lower, and frame it as a finding about construct reliability (the specificity dimension needs more operational clarity, not better models). This is honest and scientifically interesting.

Concrete F1 estimate

Based on GenAI-vs-human agreement rates and the typical BERT fine-tuning premium:

  • Category macro F1: 0.78-0.85 (depends on class imbalance handling and gold quality)
  • Specificity macro F1: 0.65-0.75 (ceiling-limited by human disagreement)
  • Combined (cat x spec) accuracy: 0.55-0.70

The swing categories for macro F1 are MR (~65-80% per-class F1), TPR (~70-90%), and N/O (~60-85%). Focal loss + SCL should push MR and N/O into the range where macro F1 clears 0.80.


The Meta-Narrative

The finding that trained student annotators achieve alpha = 0.801 on category but only 0.546 on specificity, while calibrated LLM panels achieve higher consistency (60.1% spec unanimity vs 42.2% for humans), validates the synthetic experts hypothesis for rule-heavy classification tasks. The low specificity agreement is not annotator incompetence -- it's evidence that the specificity construct requires systematic attention to IS/NOT lists and counting rules that humans don't consistently invest at 15s/paragraph pace. GenAI's advantage on multi-step reasoning tasks is itself a key finding.

The leave-one-out analysis showing that Opus earns the top rank without being privileged is the strongest validation of using frontier LLMs as "gold" annotators: they're not just consistent with each other, they're the most consistent with the emergent consensus of all 16 sources combined.


Timeline

Task Target Status
Human labeling 2026-04-01 Done
GenAI benchmark (10 models) 2026-04-02 Done
13-signal analysis 2026-04-02 Done
Gold set adjudication 2026-04-03-04 Next
Training data assembly 2026-04-04
Fine-tuning ablations (6 configs) 2026-04-05-08
Final evaluation on holdout 2026-04-09
Executive memo + IGNITE slides 2026-04-10-14
Submission 2026-04-23