joey/SEC-cyBERT

Fork 0

Joey Eamigh 745172adb8

docs restructuring

2026-04-05 21:00:40 -04:00

11 KiB

Raw Blame History

Post-Labeling Plan — Gold Set, Fine-Tuning & F1 Strategy

Updated 2026-04-02 with actual benchmark results and 13-signal analysis.

Human Labeling Results (Complete)

3,600 labels (1,200 paragraphs x 3 annotators via BIBD), 21.5 active hours total.

Metric	Category	Specificity	Both
Consensus (3/3 agree)	56.8%	42.3%	27.0%
Krippendorff's alpha	0.801	0.546	--
Avg Cohen's kappa	0.612	0.440	--

Category is reliable. Alpha = 0.801 exceeds the conventional 0.80 threshold. Specificity is unreliable. Alpha = 0.546, driven by one outlier annotator (+1.28 specificity bias) and a genuinely hard Spec 3-4 boundary.

GenAI Benchmark Results (Complete)

10 models from 8 suppliers on 1,200 holdout paragraphs. $45.63 total benchmark cost.

Per-Model Accuracy (Leave-One-Out: each source vs majority of other 12)

Rank	Source	Cat %	Spec %	Both %	Odd-One-Out %
1	Opus 4.6	92.6	90.8	84.0	7.4%
2	Kimi K2.5	91.6	91.1	83.3	8.4%
3	Gemini Pro	91.1	90.1	82.3	8.9%
4	GPT-5.4	91.4	88.8	82.1	8.6%
5	GLM-5	91.9	88.4	81.4	8.1%
6	MIMO Pro	91.1	89.4	81.4	8.9%
7	Grok Fast	88.9	89.6	80.0	11.1%
8	Xander (best human)	91.3	83.9	76.9	8.7%
9	Elisabeth	85.5	84.6	72.3	14.5%
10	Gemini Lite	83.0	86.1	71.7	17.0%
11	MIMO Flash	80.4	86.4	69.2	19.6%
12	Meghan	86.3	76.8	66.5	13.7%
13	MiniMax M2.7	87.9	75.6	66.1	12.1%
14	Joey	84.0	77.2	65.8	16.0%
15	Anuj	72.7	60.6	42.8	27.3%
16	Aaryan (outlier)	59.1	24.7	15.8	40.9%

Opus earns #1 without being privileged -- it genuinely disagrees with the crowd least.

Cross-Source Agreement

Comparison	Category
Human maj = S1 maj	81.7%
Human maj = Opus	83.2%
Human maj = GenAI maj (10)	82.2%
GenAI maj = Opus	86.8%
13-signal maj = 10-GenAI maj	99.5%

Confusion Axes (same order for all source types)

MR <-> RMP (dominant)
BG <-> MR
N/O <-> SI

Adjudication Strategy (13 Signals)

Sources per paragraph

Source	Count	Prompt
Human annotators	3	Codebook v3.0
Stage 1 (gemini-lite, mimo-flash, grok-fast)	3	v2.5
Opus 4.6 golden	1	v3.0+codebook
Benchmark (gpt-5.4, kimi-k2.5, gemini-pro, glm-5, minimax-m2.7, mimo-pro)	6	v3.0
Total	13

Tier breakdown (actual counts)

Tier	Rule	Count	%
1	10+/13 agree on both dimensions	756	63.0%
2	Human majority + GenAI majority agree	216	18.0%
3	Humans split, GenAI converges	26	2.2%
4	Universal disagreement	202	16.8%

81% auto-resolvable. Only 228 paragraphs (19%) need expert review.

Aaryan correction

On Aaryan's 600 paragraphs: when the other 2 annotators agree and Aaryan disagrees, the other-2 majority becomes the human signal for adjudication. This is justified by his 40.9% odd-one-out rate (vs 8-16% for other annotators) and α=0.03-0.25 on specificity.

Adjudication process for Tier 3+4

Pull Opus reasoning trace for the paragraph
Check the GenAI consensus (which category do 7+/10 models agree on?)
Expert reads the paragraph and all signals, makes final call
Document reasoning for Tier 4 paragraphs (these are the error analysis corpus)

F1 Strategy — How to Pass

The requirement

C grade minimum: fine-tuned model with macro F1 > 0.80 on holdout
Gold standard: human-labeled holdout (1,200 paragraphs)
Metrics to report: macro F1, per-class F1, Krippendorff's alpha, AUC, MCC
The fine-tuned "specialist" must be compared head-to-head with GenAI labeling

The challenge

The holdout was deliberately stratified to over-sample hard decision boundaries (MR<->RMP, N/O<->SI, Spec 3<->4). This means raw F1 on this holdout will be lower than on a random sample. Additionally:

The best individual GenAI models only agree with human majority ~83-87% on category
Our model is trained on GenAI labels, so its ceiling is bounded by GenAI-vs-human agreement
Macro F1 weights all 7 classes equally -- rare classes (TPR, ID) get equal influence
The MR<->RMP confusion axis is the #1 challenge across all source types

Why F1 > 0.80 is achievable

DAPT + TAPT give domain advantage. The model has seen 1B tokens of SEC filings (DAPT) and all labeled paragraphs (TAPT). It understands SEC disclosure language at a depth that generic BERT models don't.
35K+ high-confidence training examples. Unanimous Stage 1 labels where all 3 models agreed on both dimensions. These are cleaner than any single model's labels.
Encoder classification outperforms generative labeling on fine-tuned domains. The model doesn't need to "reason" about the codebook -- it learns the decision boundaries directly from representations. This is the core thesis of Ringel (2023).
The hard cases are a small fraction. 63% of the holdout is Tier 1 (10+/13 agree). The model only needs reasonable performance on the remaining 37% to clear 0.80.

Critical actions

1. Gold label quality (highest priority)

Noisy gold labels directly cap F1. If the gold label is wrong, even a perfect model gets penalized.

Tier 1+2 (972 paragraphs): Use 13-signal consensus. These are essentially guaranteed correct.
Tier 3+4 (228 paragraphs): Expert adjudication with documented reasoning. Prioritize Opus reasoning traces + GenAI consensus as evidence.
Aaryan correction: On his 600 paragraphs, replace his vote with the other-2 majority when they agree. This alone should improve gold label quality substantially.
Document the process: The adjudication methodology itself is a deliverable (IRR report + reliability analysis).

2. Training data curation

Primary corpus: Unanimous Stage 1 labels (all 3 models agree on both cat+spec) -- ~35K paragraphs
Secondary: Majority labels (2/3 agree) with 0.8x sample weight -- ~9-12K
Tertiary: Judge labels with high confidence -- ~2-3K
Exclude: Paragraphs where all 3 models disagree (too noisy for training)
Quality weighting: clean/headed/minor = 1.0, degraded = 0.5

3. Architecture and loss

Dual-head classifier: Shared ModernBERT backbone -> category head (7-class softmax) + specificity head (4-class ordinal)
Category loss: Focal loss (gamma=2) or class-weighted cross-entropy. The model must not ignore rare categories (TPR, ID). Weights inversely proportional to class frequency in training data.
Specificity loss: Ordinal regression (CORAL) -- penalizes Spec 1->4 errors more than Spec 2->3. This respects the ordinal nature and handles the noisy Spec 3<->4 boundary gracefully.
Combined loss: L = L_cat + 0.5 * L_spec (category gets more gradient weight because it's the more reliable dimension and the primary metric)

4. Ablation experiments (need >=4 configurations)

#	Backbone	Class Weights	SCL	Notes
1	Base ModernBERT-large	No	No	Baseline
2	+DAPT	No	No	Domain adaptation effect
3	+DAPT+TAPT	No	No	Full pre-training pipeline
4	+DAPT+TAPT	Yes (focal)	No	Class imbalance handling
5	+DAPT+TAPT	Yes (focal)	Yes	Supervised contrastive learning
6	+DAPT+TAPT	Yes (focal)	Yes	+ ensemble (3 seeds)

Experiments 1-3 isolate the pre-training contribution. 4-5 isolate training strategy. 6 is the final system.

5. Evaluation strategy

Primary metric: Category macro F1 on full 1,200 holdout (must exceed 0.80)
Secondary metrics: Per-class F1, specificity F1 (report separately), MCC, Krippendorff's alpha vs human labels
Dual reporting (adverse incentive mitigation): Also report F1 on a 720-paragraph proportional subsample (random draw matching corpus class proportions). The delta quantifies degradation on hard boundary cases. This serves the A-grade "error analysis" criterion.
Error analysis corpus: Tier 4 paragraphs (202) are the natural error analysis set. Where the model fails on these, the 13-signal disagreement pattern explains why.

6. Inference-time techniques

Ensemble: Train 3 models with different random seeds on the best config. Majority vote at inference. Typically adds 1-3pp F1.
Threshold optimization: After training, optimize per-class classification thresholds on a validation set (not holdout) to maximize macro F1. Don't use argmax -- use thresholds that balance precision and recall per class.
Post-hoc calibration: Temperature scaling on validation set. Important for AUC and calibration plots.

Specificity dimension -- managed expectations

Specificity F1 will be lower than category F1. This is not a model failure:

Human alpha on specificity is only 0.546 (unreliable gold)
Even frontier models only agree 75-91% on specificity
The Spec 3<->4 boundary is genuinely ambiguous

Strategy: report specificity F1 separately, explain why it's lower, and frame it as a finding about construct reliability (the specificity dimension needs more operational clarity, not better models). This is honest and scientifically interesting.

Concrete F1 estimate

Based on GenAI-vs-human agreement rates and the typical BERT fine-tuning premium:

Category macro F1: 0.78-0.85 (depends on class imbalance handling and gold quality)
Specificity macro F1: 0.65-0.75 (ceiling-limited by human disagreement)
Combined (cat x spec) accuracy: 0.55-0.70

The swing categories for macro F1 are MR (~65-80% per-class F1), TPR (~70-90%), and N/O (~60-85%). Focal loss + SCL should push MR and N/O into the range where macro F1 clears 0.80.

The Meta-Narrative

The finding that trained student annotators achieve alpha = 0.801 on category but only 0.546 on specificity, while calibrated LLM panels achieve higher consistency (60.1% spec unanimity vs 42.2% for humans), validates the synthetic experts hypothesis for rule-heavy classification tasks. The low specificity agreement is not annotator incompetence -- it's evidence that the specificity construct requires systematic attention to IS/NOT lists and counting rules that humans don't consistently invest at 15s/paragraph pace. GenAI's advantage on multi-step reasoning tasks is itself a key finding.

The leave-one-out analysis showing that Opus earns the top rank without being privileged is the strongest validation of using frontier LLMs as "gold" annotators: they're not just consistent with each other, they're the most consistent with the emergent consensus of all 16 sources combined.

Timeline

Task	Target	Status
Human labeling	2026-04-01	Done
GenAI benchmark (10 models)	2026-04-02	Done
13-signal analysis	2026-04-02	Done
Gold set adjudication	2026-04-03-04	Next
Training data assembly	2026-04-04
Fine-tuning ablations (6 configs)	2026-04-05-08
Final evaluation on holdout	2026-04-09
Executive memo + IGNITE slides	2026-04-10-14
Submission	2026-04-23

11 KiB Raw Blame History Unescape Escape