SEC-cyBERT/docs/SPECIFICITY-IMPROVEMENT-PLAN.md
2026-04-05 15:37:50 -04:00

153 lines
6.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Specificity F1 Improvement Plan
**Goal:** Macro F1 > 0.80 on both category and specificity heads
**Current:** Cat F1=0.932 (passing), Spec F1=0.517 (needs ~+0.28)
**Constraint:** Specificity is paragraph-level and category-independent by design
## Diagnosis
Per-class spec F1 (best run, epoch 5):
- L1 (Generic): ~0.79
- L2 (Domain-Adapted): ~0.29
- L3 (Firm-Specific): ~0.31
- L4 (Quantified): ~0.55
L2 and L3 drag macro F1 from ~0.67 average to 0.52. QWK=0.840 shows ordinal
ranking is strong — the problem is exact boundary placement between adjacent levels.
### Root causes
1. **CORAL's shared weight vector.** CORAL uses `logit_k = w·x + b_k` — one weight
vector for all thresholds. But the three transitions require different features:
- L1→L2: cybersecurity terminology detection (ERM test)
- L2→L3: firm-unique fact detection (named roles, systems)
- L3→L4: quantified/verifiable claim detection (numbers, dates)
A single w can't capture all three signal types.
2. **[CLS] pooling loses distributed signals.** A single "CISO" mention anywhere
in a paragraph should bump to L3, but [CLS] may not attend to it.
3. **Label noise at boundaries.** 8.7% of training labels had Grok specificity
disagreement, concentrated at L1/L2 and L2/L3 boundaries.
4. **Insufficient training.** Model was still improving at epoch 5 — not converged.
## Ideas (ordered by estimated ROI)
### Tier 1 — Implement first
**A. Independent threshold heads (replace CORAL)**
Replace the single CORAL weight vector with 3 independent binary classifiers,
each with its own learned features:
- threshold_L2plus: Linear(hidden, 1) — "has any qualifying facts?"
- threshold_L3plus: Linear(hidden, 1) — "has firm-specific facts?"
- threshold_L4: Linear(hidden, 1) — "has quantified/verifiable facts?"
Same cumulative binary targets as CORAL, but each threshold has independent weights.
Optionally upgrade to 2-layer MLP (hidden→256→1) for richer decision boundaries.
**B. High-confidence label filtering**
Only train specificity on paragraphs where all 3 Grok runs agreed on specificity
level (~91.3% of data, ~59K of 65K). The ~6K disagreement cases are exactly the
noisy boundary labels that confuse the model. Category labels can still use all data.
**C. More epochs + early stopping on spec F1**
Run 15-20 epochs. Switch model selection metric from combined_macro_f1 to
spec_macro_f1 (since category already exceeds 0.80). Use patience=5.
**D. Attention pooling**
Replace [CLS] token pooling with learned attention pooling over all tokens.
This lets the model attend to specific evidence tokens (CISO, $2M, NIST)
distributed anywhere in the paragraph.
### Tier 2 — If Tier 1 insufficient
**E. Ordinal consistency regularization**
Add a penalty when threshold k fires but threshold k-1 doesn't (e.g., model
says "has firm-specific" but not "has domain terms"). Weight ~0.1.
**F. Differential learning rates**
Backbone: 1e-5, heads: 5e-4. Let the heads learn classification faster while
the backbone makes only fine adjustments.
**G. Softmax head comparison**
Try standard 4-class CE (no ordinal constraint at all). If it outperforms both
CORAL and independent thresholds, the ordinal structure isn't helping.
**H. Multi-sample dropout**
Apply N different dropout masks, average logits. Reduces variance in the
specificity head's predictions, especially for boundary cases.
### Tier 3 — If nothing else works
**I. Specificity-focused auxiliary task**
The consensus labels include `specific_facts` arrays with classified fact types
(domain_term, named_role, quantified, etc.). Add a token-level auxiliary task
that detects these fact types. Specificity becomes "what's the highest-level
fact type present?" — making the ordinal structure explicit.
**J. Separate specificity model**
Train a dedicated model just for specificity with a larger head, more
specificity-focused features, or a different architecture (e.g., token-level
fact extraction → aggregation).
**K. Re-annotate boundary cases**
Use GPT-5.4 to re-judge the ~9,323 majority-vote cases where Grok had
specificity disagreement. Cleaner labels at boundaries.
## Experiment Log
### Experiment 1: Independent thresholds + attention pooling + MLP + filtering (15 epochs)
**Config:** `configs/finetune/iter1-independent.yaml`
- Specificity head: independent (3 separate Linear(1024→256→1) binary classifiers)
- Pooling: attention (learned attention over all tokens)
- Confidence filtering: only train spec on unanimous labels
- Ordinal consistency regularization: 0.1
- Class weighting: yes
- Base checkpoint: ModernBERT-large (no DAPT/TAPT)
- Epochs: 15
**Results:**
| Epoch | Combined | Cat F1 | Spec F1 | QWK |
|-------|----------|--------|---------|-----|
| 1 | 0.855 | 0.867 | 0.844 | 0.874 |
| 2 | 0.913 | 0.909 | 0.918 | 0.935 |
| 3 | 0.925 | 0.919 | 0.931 | 0.945 |
| 4 | 0.936 | 0.932 | 0.940 | 0.950 |
| 5 | 0.938 | 0.936 | 0.940 | 0.949 |
| **8** | **0.944** | **0.943** | **0.945** | **0.952** |
| 10 | 0.944 | 0.943 | 0.945 | 0.952 |
| 11 | 0.944 | 0.945 | 0.944 | 0.952 |
Stopped at epoch 11 — train-eval loss gap was 8× (0.06 vs 0.49) with no further
eval F1 improvement. Best checkpoint: epoch 8 (spec F1=0.945).
**Conclusion:** Massive improvement — spec F1 went from 0.517 (CORAL baseline) to
0.945 at epoch 8. Both targets (>0.80 cat and spec F1) exceeded by epoch 1.
Independent thresholds were the key insight — CORAL's shared weight vector was
the primary bottleneck. Attention pooling, MLP heads, and confidence filtering
all contributed. Tier 2 and Tier 3 ideas were not needed.
### Holdout Evaluation (1,200 paragraphs, proxy gold)
Validated on held-out data against two independent frontier model references:
| Model | Ref | Cat F1 | Spec F1 | L2 F1 | Spec QWK |
|-------|-----|--------|---------|-------|----------|
| Independent (ep8) | GPT-5.4 | 0.934 | **0.895** | 0.798 | 0.932 |
| Independent (ep8) | Opus-4.6 | 0.923 | **0.883** | 0.776 | 0.923 |
| CORAL (ep5) | GPT-5.4 | 0.936 | 0.597 | 0.407 | 0.876 |
| CORAL (ep5) | Opus-4.6 | 0.928 | 0.596 | 0.418 | 0.872 |
| GPT-5.4 | Opus-4.6 | — | **0.885** | **0.805** | 0.919 |
**Key finding:** The model's holdout spec F1 (0.895) exceeds the inter-reference
agreement (0.885 between GPT-5.4 and Opus-4.6). The model has reached the
construct reliability ceiling — further improvement requires cleaner reference
labels, not a better model.
**L2 is at ceiling:** Model L2 F1 (0.798) is within 0.007 of reference agreement
(0.805). The L1↔L2 boundary is genuinely ambiguous. Remaining opportunity:
per-threshold sigmoid tuning against human gold labels (potential +0.01-0.02).