SEC-cyBERT/docs/SPECIFICITY-IMPROVEMENT-PLAN.md
2026-04-05 15:37:50 -04:00

6.5 KiB
Raw Permalink Blame History

Specificity F1 Improvement Plan

Goal: Macro F1 > 0.80 on both category and specificity heads Current: Cat F1=0.932 (passing), Spec F1=0.517 (needs ~+0.28) Constraint: Specificity is paragraph-level and category-independent by design

Diagnosis

Per-class spec F1 (best run, epoch 5):

  • L1 (Generic): ~0.79
  • L2 (Domain-Adapted): ~0.29
  • L3 (Firm-Specific): ~0.31
  • L4 (Quantified): ~0.55

L2 and L3 drag macro F1 from ~0.67 average to 0.52. QWK=0.840 shows ordinal ranking is strong — the problem is exact boundary placement between adjacent levels.

Root causes

  1. CORAL's shared weight vector. CORAL uses logit_k = w·x + b_k — one weight vector for all thresholds. But the three transitions require different features:

    • L1→L2: cybersecurity terminology detection (ERM test)
    • L2→L3: firm-unique fact detection (named roles, systems)
    • L3→L4: quantified/verifiable claim detection (numbers, dates) A single w can't capture all three signal types.
  2. [CLS] pooling loses distributed signals. A single "CISO" mention anywhere in a paragraph should bump to L3, but [CLS] may not attend to it.

  3. Label noise at boundaries. 8.7% of training labels had Grok specificity disagreement, concentrated at L1/L2 and L2/L3 boundaries.

  4. Insufficient training. Model was still improving at epoch 5 — not converged.

Ideas (ordered by estimated ROI)

Tier 1 — Implement first

A. Independent threshold heads (replace CORAL) Replace the single CORAL weight vector with 3 independent binary classifiers, each with its own learned features:

  • threshold_L2plus: Linear(hidden, 1) — "has any qualifying facts?"
  • threshold_L3plus: Linear(hidden, 1) — "has firm-specific facts?"
  • threshold_L4: Linear(hidden, 1) — "has quantified/verifiable facts?"

Same cumulative binary targets as CORAL, but each threshold has independent weights. Optionally upgrade to 2-layer MLP (hidden→256→1) for richer decision boundaries.

B. High-confidence label filtering Only train specificity on paragraphs where all 3 Grok runs agreed on specificity level (~91.3% of data, ~59K of 65K). The ~6K disagreement cases are exactly the noisy boundary labels that confuse the model. Category labels can still use all data.

C. More epochs + early stopping on spec F1 Run 15-20 epochs. Switch model selection metric from combined_macro_f1 to spec_macro_f1 (since category already exceeds 0.80). Use patience=5.

D. Attention pooling Replace [CLS] token pooling with learned attention pooling over all tokens. This lets the model attend to specific evidence tokens (CISO, $2M, NIST) distributed anywhere in the paragraph.

Tier 2 — If Tier 1 insufficient

E. Ordinal consistency regularization Add a penalty when threshold k fires but threshold k-1 doesn't (e.g., model says "has firm-specific" but not "has domain terms"). Weight ~0.1.

F. Differential learning rates Backbone: 1e-5, heads: 5e-4. Let the heads learn classification faster while the backbone makes only fine adjustments.

G. Softmax head comparison Try standard 4-class CE (no ordinal constraint at all). If it outperforms both CORAL and independent thresholds, the ordinal structure isn't helping.

H. Multi-sample dropout Apply N different dropout masks, average logits. Reduces variance in the specificity head's predictions, especially for boundary cases.

Tier 3 — If nothing else works

I. Specificity-focused auxiliary task The consensus labels include specific_facts arrays with classified fact types (domain_term, named_role, quantified, etc.). Add a token-level auxiliary task that detects these fact types. Specificity becomes "what's the highest-level fact type present?" — making the ordinal structure explicit.

J. Separate specificity model Train a dedicated model just for specificity with a larger head, more specificity-focused features, or a different architecture (e.g., token-level fact extraction → aggregation).

K. Re-annotate boundary cases Use GPT-5.4 to re-judge the ~9,323 majority-vote cases where Grok had specificity disagreement. Cleaner labels at boundaries.

Experiment Log

Experiment 1: Independent thresholds + attention pooling + MLP + filtering (15 epochs)

Config: configs/finetune/iter1-independent.yaml

  • Specificity head: independent (3 separate Linear(1024→256→1) binary classifiers)
  • Pooling: attention (learned attention over all tokens)
  • Confidence filtering: only train spec on unanimous labels
  • Ordinal consistency regularization: 0.1
  • Class weighting: yes
  • Base checkpoint: ModernBERT-large (no DAPT/TAPT)
  • Epochs: 15

Results:

Epoch Combined Cat F1 Spec F1 QWK
1 0.855 0.867 0.844 0.874
2 0.913 0.909 0.918 0.935
3 0.925 0.919 0.931 0.945
4 0.936 0.932 0.940 0.950
5 0.938 0.936 0.940 0.949
8 0.944 0.943 0.945 0.952
10 0.944 0.943 0.945 0.952
11 0.944 0.945 0.944 0.952

Stopped at epoch 11 — train-eval loss gap was 8× (0.06 vs 0.49) with no further eval F1 improvement. Best checkpoint: epoch 8 (spec F1=0.945).

Conclusion: Massive improvement — spec F1 went from 0.517 (CORAL baseline) to 0.945 at epoch 8. Both targets (>0.80 cat and spec F1) exceeded by epoch 1. Independent thresholds were the key insight — CORAL's shared weight vector was the primary bottleneck. Attention pooling, MLP heads, and confidence filtering all contributed. Tier 2 and Tier 3 ideas were not needed.

Holdout Evaluation (1,200 paragraphs, proxy gold)

Validated on held-out data against two independent frontier model references:

Model Ref Cat F1 Spec F1 L2 F1 Spec QWK
Independent (ep8) GPT-5.4 0.934 0.895 0.798 0.932
Independent (ep8) Opus-4.6 0.923 0.883 0.776 0.923
CORAL (ep5) GPT-5.4 0.936 0.597 0.407 0.876
CORAL (ep5) Opus-4.6 0.928 0.596 0.418 0.872
GPT-5.4 Opus-4.6 0.885 0.805 0.919

Key finding: The model's holdout spec F1 (0.895) exceeds the inter-reference agreement (0.885 between GPT-5.4 and Opus-4.6). The model has reached the construct reliability ceiling — further improvement requires cleaner reference labels, not a better model.

L2 is at ceiling: Model L2 F1 (0.798) is within 0.007 of reference agreement (0.805). The L1↔L2 boundary is genuinely ambiguous. Remaining opportunity: per-threshold sigmoid tuning against human gold labels (potential +0.01-0.02).