From 41df5923f2c5794ec5345a9745a2bf6087fd4010 Mon Sep 17 00:00:00 2001
From: Joey Eamigh <55670930+JoeyEamigh@users.noreply.github.com>
Date: Sun, 5 Apr 2026 12:17:03 -0400
Subject: [PATCH] adding old strategy notes for posterity (need to clean git
 history anyway)

---
 .gitignore             |   4 +-
 docs/STRATEGY-NOTES.md | 196 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 197 insertions(+), 3 deletions(-)
 create mode 100644 docs/STRATEGY-NOTES.md

diff --git a/.gitignore b/.gitignore
index 380efcb..f561072 100644
--- a/.gitignore
+++ b/.gitignore
@@ -50,12 +50,10 @@ report.[0-9]_.[0-9]_.[0-9]_.[0-9]_.json
 .eslintcache
 .cache
 *.tsbuildinfo
+unsloth_compiled_cache/
 
 # Finder (MacOS) folder config
 .DS_Store
 python/*.whl
 /.dvc-store
 
-# Personal notes
-docs/STRATEGY-NOTES.md
-unsloth_compiled_cache/
diff --git a/docs/STRATEGY-NOTES.md b/docs/STRATEGY-NOTES.md
new file mode 100644
index 0000000..76524ce
--- /dev/null
+++ b/docs/STRATEGY-NOTES.md
@@ -0,0 +1,196 @@
+# Strategy Notes — F1 Threshold Analysis & Decision Framework
+
+**Date:** 2026-04-03
+**Context:** Pre-training planning session. 21 days to deadline.
+
+---
+
+## The Core Question
+
+Can our dual-head classifier (7-class category + 4-class specificity) hit macro F1 >= 0.80 on both dimensions? The rubric says "fine-tuned model with f1 score > .80" — singular, ambiguous whether it applies to one or both dimensions. Playing it safe: assume both.
+
+---
+
+## Category (7-class): Likely Fine
+
+### Holdout distribution
+| Category | Count | % |
+|----------|-------|---|
+| Risk Management Process | 472 | 39.3% |
+| Board Governance | 228 | 19.0% |
+| Management Role | 156 | 13.0% |
+| Strategy Integration | 143 | 11.9% |
+| Incident Disclosure | 94 | 7.8% |
+| None/Other | 59 | 4.9% |
+| Third-Party Risk | 48 | 4.0% |
+
+### Estimated macro F1
+
+| Scenario | Macro F1 | Worst class |
+|----------|----------|-------------|
+| Pessimistic | 0.797 | MR (0.74) |
+| Moderate | 0.833 | MR (0.78) |
+| Optimistic | 0.866 | MR (0.82) |
+
+**Bottleneck:** MR, due to BG/MR/RMP triangle confusion. NOT the small classes (TP at 48, N/O at 59) — those are semantically distinct.
+
+**Insurance:** 5-class merge (BG+MR+RMP → "Governance & Management") yields ~0.85 macro F1. Soft merge — no retraining, just remap predictions at eval time.
+
+### BG/MR/RMP "irreducible ambiguity"
+- 85 T5-plurality cases (7.1% of holdout) where humans and models disagree
+- 71% on the BG/MR/RMP triangle
+- This reflects genuine structural ambiguity in SEC Item 1C (companies interleave governance, roles, and processes in single paragraphs)
+- Not a labeling failure — a domain property. Documented in T5-ANALYSIS.md.
+- Impact on macro F1: ~2-3 percentage points. Not fatal.
+
+---
+
+## Specificity (4-class): The Real Risk
+
+### Holdout distribution
+| Level | Count | % | Description |
+|-------|-------|---|-------------|
+| 1 — Generic Boilerplate | 535 | 44.6% | Could paste into any filing |
+| 2 — Sector-Adapted | 47 | 3.9% | Names a standard, no firm detail |
+| 3 — Firm-Specific | 449 | 37.4% | Unique-to-company facts |
+| 4 — Quantified-Verifiable | 169 | 14.1% | 2+ hard verifiable facts |
+
+**Level 2 at 3.9% (47 samples) is structurally below the reliability threshold for per-class F1.**
+
+### Estimated macro F1
+
+| Scenario | Macro F1 |
+|----------|----------|
+| Current 4-class (L2=3.9%) | **0.742** — below threshold |
+| Broadened L2 (~20%) | **0.812** — clears |
+| 3-class merge (1+2) | **0.837** — comfortable |
+
+### Why Level 2 is so small
+Our codebook defined Level 2 as "names a specific recognized standard (NIST, ISO, SOC 2, etc.) but nothing unique to THIS company." That's a very narrow band.
+
+The professor's construct says: "(2) Sector-adapted - references industry but no firm-specific details." "References industry" is broader than "names a standard." Things like penetration testing, vulnerability scanning, SIEM, incident response plans — these are industry-specific practices that our codebook classifies as Level 1.
+
+We took the construct too literally and squeezed Level 2 into a sliver.
+
+### Level 4 is also tighter than the construct
+Our codebook: 2+ QV-eligible facts required.
+Professor's construct: "(4) Quantified and verifiable - includes specific metrics, dollar amounts, incident timelines, or third-party audit references" — that's a list of qualifying facts, not a "count two" rule.
+
+Loosening to 1+ QV fact would grow Level 4 from ~14% to ~18%.
+
+---
+
+## The Decision Fork
+
+### Option A: Broaden Level 2 + Loosen Level 4 (re-run Stage 1)
+
+**Codebook changes:**
+- Level 2: "Uses cybersecurity domain terminology (specific practices, tool categories, or methodologies) recognizable to a security professional, but nothing unique to THIS company"
+- Level 4: 1+ QV-eligible fact (not 2+)
+- Keep v3.5 category improvements (materiality assessment rule, MR/RMP 3-step chain, BG purpose test)
+
+**What changes:**
+- Stage 1 re-run: ~$120, ~20 minutes
+- Benchmark re-run on holdout: ~$45
+- Gold specificity: use model consensus (6 models at 89% agreement > 3 humans at 42% consensus)
+- Human labels: untouched for category; not re-done for specificity
+- Distribution shifts to ~30/20/34/16 — healthy for macro F1
+
+**Estimated specificity macro F1: 0.81**
+
+**Risk:** Broadened Level 1↔2 boundary might be fuzzier. Mitigated by IS/NOT lists (same pattern that already works) and DAPT+TAPT embeddings that encode domain vocabulary.
+
+### Option B: 3-class merge (no re-run)
+
+Merge Level 1+2 → "Non-specific." Three classes: Non-specific / Firm-specific / Quantified-verifiable.
+
+**What changes:** Nothing in training data. Soft merge at eval time.
+
+**Estimated specificity macro F1: 0.84**
+
+**Risk:** Professor's construct has 4 levels. Merging might be seen as not meeting the construct. Defense: "Level 2 represents 3.9% of real-world SEC paragraphs — below statistical reliability for per-class F1. Our own codebook protocol specifies collapsing when alpha < threshold."
+
+### Option C: Train first, decide after (pragmatic)
+
+Train with current data. Get real F1 numbers. If specificity 4-class > 0.80: done. If not: apply merge or re-run based on empirical evidence.
+
+**Risk:** If re-run is needed, lose 2-3 days. But gain information.
+
+### Option D: Pivot to different project (nuclear)
+
+Abandon SEC cyBERT. Pick a simpler single-head project from the professor's 7 options. Start from scratch.
+
+**Assessment: Bad idea.** Already have 10-model benchmark, DAPT+TAPT, gold adjudication, human labels. All A-grade criteria met except F1 threshold. A new project in 21 days, with a team that doesn't understand the technical work, would be weaker on every criterion. The F1 problem is solvable without pivoting.
+
+---
+
+## The Representative Sample Problem
+
+Took a representative (proportional) holdout sample. This means rare classes have few test samples:
+- TP: 48 (4.0%)
+- N/O: 59 (4.9%)
+- Level 2: 47 (3.9%)
+
+An equal-per-class sample would have given ~171 per category class and ~300 per specificity class.
+
+**Impact on category:** Minimal. TP and N/O are semantically distinct — the model should classify them well even with few samples. MR (156 samples) is the actual bottleneck, and it has enough samples.
+
+**Impact on specificity:** This IS the problem. 47 Level 2 samples means ±3 correct/incorrect swings F1 by ~0.06. The measurement is too noisy to be reliable. Fixed by broadening Level 2 (same paragraphs get new labels under new definition) or merging.
+
+**Key insight:** The holdout paragraphs don't change. The LABELS change if we update the codebook. A paragraph that says "we conduct penetration testing and vulnerability assessments" is currently Level 1. Under a broadened definition, it becomes Level 2. Same paragraph, different label. Level 2 grows from 47 to ~200 without adding any new paragraphs.
+
+---
+
+## Human Labels Strategy
+
+**Category:** Keep as-is. Human alpha = 0.825 (strong). No re-labeling needed.
+
+**Specificity:** Two options:
+1. Have teammates re-label specificity on updated codebook (2-3 days of their time)
+2. Use model consensus for specificity gold (6 frontier models at 89% agreement vs 3 humans at 42% consensus)
+
+Option 2 is defensible: "Human inter-annotator agreement on specificity (alpha=0.661) was below model panel agreement (89%). Gold specificity labels use model consensus with human override only where humans were unanimous." The existing human alpha gets reported as-is — it's a metric about human agreement, separate from model evaluation gold.
+
+---
+
+## Rubric Checklist
+
+### C (f1 > .80): achievable
+- [ ] Fine-tuned model with F1 > .80 — category: 85-90% confident; specificity: needs Level 2 fix
+- [ ] Performance comparison GenAI vs fine-tuned — have 10 models benchmarked
+- [ ] Labeled datasets — 150K Stage 1 + 1,200 gold
+- [ ] Documentation — extensive
+- [ ] Python notebooks for replication — need to create
+
+### B (3+ of 4): already have all 4
+- [x] Cost, time, reproducibility — dollar amounts for every API call, session logs
+- [x] 6+ models, 3+ suppliers — 10 models, 8 suppliers
+- [x] Contemporary self-collected data — 72K paragraphs from SEC EDGAR
+- [x] Compelling use case — SEC cyber disclosure quality assessment
+
+### A (3+ of 4): already have 3-4
+- [x] Error analysis — T5 deep-dive, confusion axis analysis, model reasoning examination
+- [x] Mitigation strategy — codebook v3.0→v3.5, text-based BG removal, specificity hybrid, experimental validation
+- [ ] Additional baselines — dictionary/keyword approach (the specificity IS/NOT lists ARE a dictionary baseline)
+- [x] Comparison to amateur labels — Aaryan before/after (50.3% labels changed, alpha improvement documented)
+
+---
+
+## Recommended Path
+
+1. Revise codebook specificity (broaden Level 2 IS/NOT list, loosen Level 4 to 1+ QV fact). 1 day.
+2. Pilot on ~50 paragraphs with 2-3 models. Verify no disasters.
+3. Re-run Stage 1 (~$120, 20 mins). Also picks up v3.5 category improvements.
+4. Re-run benchmark models on holdout (~$45). New gold specificity = model consensus.
+5. Don't re-run human labels. Use existing for category, model consensus for specificity.
+6. Assemble training data. Train. Ablation matrix: {base, +DAPT, +DAPT+TAPT} x {+/- class weighting} x {CE vs focal loss}.
+7. First real F1 number within 3-4 days of starting training.
+8. If specificity 4-class < 0.80 after broadening: 3-class merge is still available as fallback.
+
+---
+
+## Key Insight
+
+The F1 threshold is achievable. The project is strong. The specificity distribution is the only structural problem, and it's fixable by aligning the codebook with the professor's construct (which we drifted from by being too precise). Everything else — the T5 ambiguity, the representative sample, the small classes — is manageable.
+
+The worst thing to do right now is panic and pivot. The second worst thing is to agonize and delay. Pick a path, execute, get real numbers.