153 lines
6.5 KiB
Markdown
153 lines
6.5 KiB
Markdown
# Specificity F1 Improvement Plan
|
||
|
||
**Goal:** Macro F1 > 0.80 on both category and specificity heads
|
||
**Current:** Cat F1=0.932 (passing), Spec F1=0.517 (needs ~+0.28)
|
||
**Constraint:** Specificity is paragraph-level and category-independent by design
|
||
|
||
## Diagnosis
|
||
|
||
Per-class spec F1 (best run, epoch 5):
|
||
- L1 (Generic): ~0.79
|
||
- L2 (Domain-Adapted): ~0.29
|
||
- L3 (Firm-Specific): ~0.31
|
||
- L4 (Quantified): ~0.55
|
||
|
||
L2 and L3 drag macro F1 from ~0.67 average to 0.52. QWK=0.840 shows ordinal
|
||
ranking is strong — the problem is exact boundary placement between adjacent levels.
|
||
|
||
### Root causes
|
||
|
||
1. **CORAL's shared weight vector.** CORAL uses `logit_k = w·x + b_k` — one weight
|
||
vector for all thresholds. But the three transitions require different features:
|
||
- L1→L2: cybersecurity terminology detection (ERM test)
|
||
- L2→L3: firm-unique fact detection (named roles, systems)
|
||
- L3→L4: quantified/verifiable claim detection (numbers, dates)
|
||
A single w can't capture all three signal types.
|
||
|
||
2. **[CLS] pooling loses distributed signals.** A single "CISO" mention anywhere
|
||
in a paragraph should bump to L3, but [CLS] may not attend to it.
|
||
|
||
3. **Label noise at boundaries.** 8.7% of training labels had Grok specificity
|
||
disagreement, concentrated at L1/L2 and L2/L3 boundaries.
|
||
|
||
4. **Insufficient training.** Model was still improving at epoch 5 — not converged.
|
||
|
||
## Ideas (ordered by estimated ROI)
|
||
|
||
### Tier 1 — Implement first
|
||
|
||
**A. Independent threshold heads (replace CORAL)**
|
||
Replace the single CORAL weight vector with 3 independent binary classifiers,
|
||
each with its own learned features:
|
||
- threshold_L2plus: Linear(hidden, 1) — "has any qualifying facts?"
|
||
- threshold_L3plus: Linear(hidden, 1) — "has firm-specific facts?"
|
||
- threshold_L4: Linear(hidden, 1) — "has quantified/verifiable facts?"
|
||
|
||
Same cumulative binary targets as CORAL, but each threshold has independent weights.
|
||
Optionally upgrade to 2-layer MLP (hidden→256→1) for richer decision boundaries.
|
||
|
||
**B. High-confidence label filtering**
|
||
Only train specificity on paragraphs where all 3 Grok runs agreed on specificity
|
||
level (~91.3% of data, ~59K of 65K). The ~6K disagreement cases are exactly the
|
||
noisy boundary labels that confuse the model. Category labels can still use all data.
|
||
|
||
**C. More epochs + early stopping on spec F1**
|
||
Run 15-20 epochs. Switch model selection metric from combined_macro_f1 to
|
||
spec_macro_f1 (since category already exceeds 0.80). Use patience=5.
|
||
|
||
**D. Attention pooling**
|
||
Replace [CLS] token pooling with learned attention pooling over all tokens.
|
||
This lets the model attend to specific evidence tokens (CISO, $2M, NIST)
|
||
distributed anywhere in the paragraph.
|
||
|
||
### Tier 2 — If Tier 1 insufficient
|
||
|
||
**E. Ordinal consistency regularization**
|
||
Add a penalty when threshold k fires but threshold k-1 doesn't (e.g., model
|
||
says "has firm-specific" but not "has domain terms"). Weight ~0.1.
|
||
|
||
**F. Differential learning rates**
|
||
Backbone: 1e-5, heads: 5e-4. Let the heads learn classification faster while
|
||
the backbone makes only fine adjustments.
|
||
|
||
**G. Softmax head comparison**
|
||
Try standard 4-class CE (no ordinal constraint at all). If it outperforms both
|
||
CORAL and independent thresholds, the ordinal structure isn't helping.
|
||
|
||
**H. Multi-sample dropout**
|
||
Apply N different dropout masks, average logits. Reduces variance in the
|
||
specificity head's predictions, especially for boundary cases.
|
||
|
||
### Tier 3 — If nothing else works
|
||
|
||
**I. Specificity-focused auxiliary task**
|
||
The consensus labels include `specific_facts` arrays with classified fact types
|
||
(domain_term, named_role, quantified, etc.). Add a token-level auxiliary task
|
||
that detects these fact types. Specificity becomes "what's the highest-level
|
||
fact type present?" — making the ordinal structure explicit.
|
||
|
||
**J. Separate specificity model**
|
||
Train a dedicated model just for specificity with a larger head, more
|
||
specificity-focused features, or a different architecture (e.g., token-level
|
||
fact extraction → aggregation).
|
||
|
||
**K. Re-annotate boundary cases**
|
||
Use GPT-5.4 to re-judge the ~9,323 majority-vote cases where Grok had
|
||
specificity disagreement. Cleaner labels at boundaries.
|
||
|
||
## Experiment Log
|
||
|
||
### Experiment 1: Independent thresholds + attention pooling + MLP + filtering (15 epochs)
|
||
|
||
**Config:** `configs/finetune/iter1-independent.yaml`
|
||
- Specificity head: independent (3 separate Linear(1024→256→1) binary classifiers)
|
||
- Pooling: attention (learned attention over all tokens)
|
||
- Confidence filtering: only train spec on unanimous labels
|
||
- Ordinal consistency regularization: 0.1
|
||
- Class weighting: yes
|
||
- Base checkpoint: ModernBERT-large (no DAPT/TAPT)
|
||
- Epochs: 15
|
||
|
||
**Results:**
|
||
|
||
| Epoch | Combined | Cat F1 | Spec F1 | QWK |
|
||
|-------|----------|--------|---------|-----|
|
||
| 1 | 0.855 | 0.867 | 0.844 | 0.874 |
|
||
| 2 | 0.913 | 0.909 | 0.918 | 0.935 |
|
||
| 3 | 0.925 | 0.919 | 0.931 | 0.945 |
|
||
| 4 | 0.936 | 0.932 | 0.940 | 0.950 |
|
||
| 5 | 0.938 | 0.936 | 0.940 | 0.949 |
|
||
| **8** | **0.944** | **0.943** | **0.945** | **0.952** |
|
||
| 10 | 0.944 | 0.943 | 0.945 | 0.952 |
|
||
| 11 | 0.944 | 0.945 | 0.944 | 0.952 |
|
||
|
||
Stopped at epoch 11 — train-eval loss gap was 8× (0.06 vs 0.49) with no further
|
||
eval F1 improvement. Best checkpoint: epoch 8 (spec F1=0.945).
|
||
|
||
**Conclusion:** Massive improvement — spec F1 went from 0.517 (CORAL baseline) to
|
||
0.945 at epoch 8. Both targets (>0.80 cat and spec F1) exceeded by epoch 1.
|
||
Independent thresholds were the key insight — CORAL's shared weight vector was
|
||
the primary bottleneck. Attention pooling, MLP heads, and confidence filtering
|
||
all contributed. Tier 2 and Tier 3 ideas were not needed.
|
||
|
||
### Holdout Evaluation (1,200 paragraphs, proxy gold)
|
||
|
||
Validated on held-out data against two independent frontier model references:
|
||
|
||
| Model | Ref | Cat F1 | Spec F1 | L2 F1 | Spec QWK |
|
||
|-------|-----|--------|---------|-------|----------|
|
||
| Independent (ep8) | GPT-5.4 | 0.934 | **0.895** | 0.798 | 0.932 |
|
||
| Independent (ep8) | Opus-4.6 | 0.923 | **0.883** | 0.776 | 0.923 |
|
||
| CORAL (ep5) | GPT-5.4 | 0.936 | 0.597 | 0.407 | 0.876 |
|
||
| CORAL (ep5) | Opus-4.6 | 0.928 | 0.596 | 0.418 | 0.872 |
|
||
| GPT-5.4 | Opus-4.6 | — | **0.885** | **0.805** | 0.919 |
|
||
|
||
**Key finding:** The model's holdout spec F1 (0.895) exceeds the inter-reference
|
||
agreement (0.885 between GPT-5.4 and Opus-4.6). The model has reached the
|
||
construct reliability ceiling — further improvement requires cleaner reference
|
||
labels, not a better model.
|
||
|
||
**L2 is at ceiling:** Model L2 F1 (0.798) is within 0.007 of reference agreement
|
||
(0.805). The L1↔L2 boundary is genuinely ambiguous. Remaining opportunity:
|
||
per-threshold sigmoid tuning against human gold labels (potential +0.01-0.02).
|