v2 holdout

2026-04-04 22:49:24 -04:00 · 2026-04-04 22:49:24 -04:00 · 160adc42ab
commit 160adc42ab
parent 1f2d748a1d
11 changed files with 1444 additions and 387 deletions
--- a/.dvc-store.dvc
+++ b/.dvc-store.dvc
@ -1,6 +1,6 @@
 outs:
- md5: 4ad135e50584bca430b79307e8bd1050.dir
+- md5: 2428c0895414e6bd46c229b661a68b6d.dir
-  size: 741469715
+  size: 724690357
-  nfiles: 194
+  nfiles: 226
  hash: md5
  path: .dvc-store
--- a/docs/NARRATIVE.md
+++ b/docs/NARRATIVE.md
@ -217,17 +217,162 @@ The v1 data pipeline, corpus, DAPT checkpoint, and TAPT checkpoint are all uncha
 - NOT overindexed on confusion-axis cases
 - Separate ~200-paragraph dev set for prompt iteration (excluded from holdout)
-### Cost of the Reboot
+---
-| Item | Estimated Cost |
+## Phase 7: Holdout Selection & Prompt Engineering
 |------|---------------|
 | Stage 1 re-run (full corpus) | ~$120 |
 | Benchmark re-run (holdout) | ~$45 |
 | Prompt iteration | ~$10 |
 | Human re-labeling | $0 (team labor) |
 | **Total additional API** | **~$175** |
-Against the ~$200 already spent on v1 API calls. The DAPT/TAPT compute (~15h GPU) is not re-done.
+### Holdout Sampling
 Used v1 Stage 1 consensus labels (50,003 paragraphs, 3-model majority vote under v2.5 prompt) as a sampling guide. Applied heuristic v2 specificity prediction: keyword scan for domain terminology to identify v1 Level 1 paragraphs that would become Level 2 under v2 rules, and QV indicator scan for Level 3→4 promotions.
 **Allocation:** 185 per non-ID category, 90 for Incident Disclosure (only 166 available in the annotated corpus) = 1,200 exact. Max 2 paragraphs per company per category stratum to prevent boilerplate clustering. All specificity floors met (≥100 per level). 1,042 unique companies represented.
 The v1 holdout had been intentionally oversampled on confusion-axis cases (split votes between MR/RMP, N/O/SI, etc.) — useful for codebook development but structurally hostile to F1. The v2 holdout is random within each category stratum: hard cases appear at their natural frequency, not overweighted.
 ### Prompt Iteration: From List-Matching to Principle-Based Reasoning
 The v2 prompt underwent 5 iterations (v4.0→v4.4) tested against a 200-paragraph dev batch from the holdout with GPT-5.4 (~$6 total pilot cost).
 **v4.0 (baseline rewrite):** Translated the v2 codebook into the system prompt. Category section used the "what question?" test — worked well at 87% agreement with v1 consensus. Specificity section used exhaustive IS/NOT lists, matching the v1 approach. Result: Level 2 grew from 6% to 16% (domain terminology broadening) and Level 4 grew from 5% to 22% (1+ QV rule). But audit revealed the model was pattern-matching against the lists rather than reasoning about the underlying principles. Two errors: "Vice President, Information Systems and Technology" and "Senior Vice President of Information Technology" classified as Level 1 because neither exactly matched the IS list entry "VP of IT/Security."
 **The list-matching problem:** The category section — built around reasoning principles ("what question does this paragraph answer?", person-removal test, materiality linguistic test) — achieved 87% agreement. The specificity section — built around exhaustive checklists — caught listed items but missed unlisted items that satisfied the same principle. The model was executing a lookup table, not applying the ERM test.
 **v4.1 (principle-first restructure):** Restructured all three specificity levels to lead with the principle and compress lists to boundary-case disambiguation only:
 - Level 2: "Apply the ERM test — would a non-security ERM professional use this language?" with illustrative examples
 - Level 3: "Would this detail help narrow down which company wrote it?" with the VP-or-above bright line
 - Level 4: "Could someone outside the company verify this?" with boundary cases
 Result: +12 Level 1→2 catches (model reasoning about vocabulary level, not scanning a list), VP/SVP titles fixed. But Level 4 regressed — the model started reasoning about whether QV facts were "relevant to the paragraph's main point" instead of treating specificity as a presence check.
 **The independence insight:** Category and specificity are independent dimensions. Category captures what the paragraph is ABOUT. Specificity captures how informative it is AS A WHOLE. A paragraph classified as RMP that mentions a CISO's CISSP in a subordinate clause is RMP at Level 4 — the certification is verifiable regardless of whether it serves the category. The model was conflating "this fact is secondary to the paragraph's purpose" with "this fact doesn't count for specificity." This is wrong: specificity is a presence check on the entire paragraph, not a relevance judgment.
 This also raised a methodological question: SHOULD specificity be category-conditional? The steelman for category-conditional specificity: "Board Governance at Level 4" should mean the governance disclosure is highly specific, not that a tangential financial fact inflated the score. The steelman against: SEC paragraphs interleave topics, you can't cleanly decompose facts into category buckets, and conditional specificity introduces cascading errors (wrong category → wrong specificity). For this project, paragraph-level specificity is the right choice — it matches the construct, is simpler to annotate, and produces higher agreement. Acknowledged as a limitation for the paper.
 **v4.2–v4.4 (surgical fixes):** Added explicit presence-check framing, hard vs. soft number boundary ("12 professionals" → QV, "approximately 20 departments" → not QV), and the "various certifications including CISSP → YES" rule (named certifications are QV regardless of surrounding hedge words). Final prompt (v4.4) recovers Level 4 to within 1 of baseline while retaining all principle-based gains at Levels 2 and 3.
 **v4.4 pilot results (200 paragraphs, GPT-5.4):**
 | Specificity | v4.0 (list) | v4.4 (principle) | Change |
 |-------------|-------------|-----------------|--------|
 | L1 | 81 (40.5%) | 65 (32.5%) | -16 |
 | L2 | 32 (16.0%) | 41 (20.5%) | +9 |
 | L3 | 43 (21.5%) | 51 (25.5%) | +8 |
 | L4 | 44 (22.0%) | 43 (21.5%) | -1 |
 Category: 95.5% agreement with v1 consensus. Specificity: 84.5% agreement (expected divergence given broadened L2 and 1+ QV rule). The 200-paragraph dev batch is now contaminated by prompt examples that target specific cases in it — further iteration requires the unseen 1,000 paragraphs from the full holdout.
 ### Full Holdout Validation & v4.5
 Running v4.4 on the full 1,200 holdout ($5.70) revealed three problems not visible in the 200-paragraph pilot:
 **Problem 1: 34.5% medium-confidence specificity.** The model was uncertain on 414 of 1,200 paragraphs, concentrated at the L1/L2 boundary (59% of L2 calls were medium-confidence) and L2/L3 boundary (51% of L3). Third-Party Risk was worst: 74% medium-confidence on specificity. The model's reasoning showed it listing zero specific facts but still assigning L2 based on vibes — the paragraph "felt" domain-adapted because the topic was cybersecurity, even when the vocabulary was generic ERM language.
 **Problem 2: SI materiality assertions falsely promoted to L4.** Paragraphs like "As of December 28, 2024, we have not had any material cybersecurity incidents" were classified L4 because a specific date anchored the claim. But negative self-assertions are not externally verifiable — you cannot independently confirm the absence of something. These are Strategy Integration at Level 1, not Level 4.
 **Problem 3: specific_facts discarded from stored output.** The `toLabelOutput()` function stripped the `specific_facts` array before writing to disk. The model was generating facts during inference (the schema required it), but we couldn't verify the mechanical bridge between facts and specificity level because the evidence was thrown away.
 **v4.5 fixes:**
 1. **Mechanical bridge enforced.** Restructured the specificity protocol as a scan-tag-max pipeline: scan for facts, tag each as [DOMAIN]/[FIRM]/[VERIFIABLE], assign specificity = max(tags). Added explicit rule: "if specific_facts is empty, specificity MUST be Generic Boilerplate." Result: 100% consistency — L1 always empty, L2+ always populated with supporting facts. The bridge prevents the model from overriding its own fact-finding with holistic vibes.
 2. **Expertise vs. topic clarification for L1/L2.** Added: "The ERM test evaluates whether the paragraph demonstrates cybersecurity EXPERTISE, not whether it discusses a cybersecurity TOPIC. Every paragraph in these filings discusses cybersecurity — that's what the filing requires. L1 means generic oversight language any business professional could write. L2 means the writer shows they understand HOW cybersecurity works." With TP-specific examples: "We conduct vendor security assessments" → L1 (generic process description); "We review vendors' SOC 2 attestations and require encryption at rest" → L2 (specific security evidence requiring domain knowledge).
 3. **SI negative assertions excluded from L4.** Added explicit NOT-verifiable examples: "We have not experienced any material cybersecurity incidents" → NOT QV (cannot externally verify absence); "In 2023, we did not experience a material incident" → NOT QV (a year does not make a negative assertion verifiable). Also added lower bounds as verifiable: "more than 20 years" → YES (checkable threshold, unlike "approximately 20" which is hedged both directions).
 4. **Fact storage.** Updated `toLabelOutput()` and `LabelOutput` schema to preserve `specific_facts` in stored output. Added `domain_term` to the `FactType` enum for L2-level vocabulary evidence.
 **v4.5 results (1,200 paragraphs, GPT-5.4, $6.88):**
 | Metric | v4.4 | v4.5 |
 |--------|------|------|
 | L1 | 546 (45.5%) | 618 (51.5%) |
 | L2 | 229 (19.1%) | 168 (14.0%) |
 | L3 | 225 (18.8%) | 207 (17.2%) |
 | L4 | 200 (16.7%) | 207 (17.2%) |
 | Medium confidence | 414 (34.5%) | 211 (17.6%) |
 | Bridge consistency | unknown | 100% |
 | SI false L4s | ~6 | 0 |
 | Category stability | — | 96.8% |
 L2 at 14% is below the 15% holdout target, but the holdout oversamples TP (14.4% vs 5% in corpus) and TP is where 55 of 61 L2→L1 drops concentrated. On the full corpus (46% RMP, 5% TP), L2 should be ~15-17%. The TP drops are correct — verified by inspecting the facts: survivors list SOC reports, vulnerability scans, penetration testing; drops use only generic vendor management language ("contractual requirements", "vendor due diligence").
 **Key architectural insight:** With reasoning models, structured output fields are results, not reasoning steps. The model decides everything in reasoning tokens before generating JSON. The mechanical bridge works by influencing the reasoning process through prompt text, not through schema field ordering. The specific_facts field captures the model's evidence for our debugging, but the actual bridge enforcement happens in the model's internal reasoning guided by the prompt's explicit consistency rules.
 ### v2 Holdout Benchmark (10 models, 8 providers)
 With v4.5 locked, we ran the full BENCHMARK_MODELS panel on the 1,200-paragraph v2 holdout to evaluate model quality before committing to the ~$100 Stage 1 re-run. GPT-5.4 (v4.5) is the reference — our best-validated model on the holdout, the one whose prompt iterations we hand-verified.
 **Full benchmark results (vs GPT-5.4 reference):**
 | Model | N | Cat% | Cat κ | Spec% | Spec κw | Both% | 50K proj | Reasoning |
 |-------|---|------|-------|-------|---------|-------|----------|-----------|
 | Grok 4.1 Fast | 1200 | 93.7% | 0.925 | 91.6% | 0.929 | 86.1% | $32 | 584 |
 | Opus 4.6 (prompt-only) | 1184 | 93.7% | 0.925 | 90.1% | 0.910 | 85.2% | $0 (sub) | — |
 | Gemini 3.1 Pro | 1200 | 93.8% | 0.926 | 89.4% | 0.906 | 84.2% | $735 | 502 |
 | GLM-5 | 1200 | 92.8% | 0.915 | 88.3% | 0.898 | 82.8% | $364 | 1421 |
 | Kimi K2.5 | 1200 | 92.6% | 0.912 | 88.1% | 0.894 | 82.8% | $353 | 2832 |
 | Gemini 3.1 Flash Lite | 1200 | 91.8% | 0.904 | 83.0% | 0.844 | 76.5% | $79 | 363 |
 | MIMO v2 Flash | 794 | 92.7% | 0.911 | 85.3%* | 0.662 | 79.7% | $26 | 1423 |
 | MIMO v2 Pro | 980 | 94.0% | — | 90.7% | — | 85.9% | $274 | 1439 |
 | MiniMax M2.7 | 1198 | 87.6% | 0.855 | 76.5% | 0.756 | 68.5% | $70 | 615 |
 *MIMO Flash spec% is misleading — 91.1% of its labels are L1 (collapsed distribution). κw = 0.662 reflects this.
 **Pilot candidates (200-paragraph tests):**
 | Model | Cat% | Spec% | Both% | 50K proj | Verdict |
 |-------|------|-------|-------|----------|---------|
 | Qwen3-235B MoE | 89.9% | 62.6% | 56.1% | $18 | Dead — 0 reasoning tokens, 34% L4 |
 | Seed 1.6 Flash | 87.5% | 74.7% | 67.7% | $24 | Weak — below Flash Lite |
 | Qwen3.5 Flash | 92.9% | n/a | n/a | $70 | Dead — 100% L1 collapse |
 **Key findings from the benchmark:**
 1. **Clear quality tiers.** Grok Fast stands alone as the best affordable model (86.1% both-match, $32/50K). There's a 9pp gap to the next affordable option (Flash Lite at 76.5%, $79). Everything in between costs $350+.
 2. **MIMO Flash specificity is broken.** Category agreement is fine (92.7%) but specificity collapses to 91.1% L1 — it simply doesn't differentiate specificity levels. The v1 Stage 1 panel included MIMO Flash; this means v1 specificity consensus was partially degraded by one broken voter.
 3. **Opus performs better without the codebook.** We ran Opus via Agent SDK in two configurations: (a) full v2 codebook + operational prompt (37.7KB system prompt), (b) operational prompt only (16.2KB). Prompt-only was significantly better: 85.2% vs 82.4% both-match, 49.2% vs 40.5% facts coverage. The codebook was actively diluting the operational prompt's bridge instruction. This is a counterintuitive but important finding for the paper — more context can hurt performance when the operational prompt has been carefully engineered.
 4. **Reasoning tokens correlate with quality, but not linearly.** Kimi K2.5 reasons the most (2832 tokens/para) but ranks 5th. Grok reasons modestly (584 tokens) and ranks 1st. The quality seems to depend more on the model's internal architecture than on raw reasoning volume. Models with 0 reasoning tokens (Qwen3-235B) or with reasoning that doesn't engage with specificity (Qwen3.5 Flash — 4381 tokens, all L1) are categorically broken for this task.
 5. **No viable cheap third model exists.** We searched OpenRouter exhaustively for models under $50/50K that support structured output and reasoning. Every candidate (Qwen, ByteDance Seed, etc.) performed below Flash Lite, which was already the weakest panel member.
 6. **Category agreement is high across all non-broken models** (>91% vs reference, κ > 0.90). The hard problem is specificity, where the mechanical bridge helps good models but can't save models that don't reason about it properly.
 ### Model Selection: Grok ×3 Self-Consistency
 The budget constraint ($175 remaining for Stage 1 + Stage 2 + everything else) eliminated all multi-model panels except Grok + Flash Lite ($111). But Flash Lite's 76.5% both-match and inflated L2 distribution (19.1% vs 14% reference) made it a weak second voter.
 We investigated whether running Grok multiple times could produce independent signals. The temperature question turned out to be irrelevant: reasoning models have internal stochastic chain-of-thought that produces different outputs on repeated identical calls regardless of temperature settings. Most providers silently ignore `temperature: 0` for reasoning models (OpenAI explicitly rejects it; others drop it). Our `temperature: 0` was cosmetic the entire time.
 **Empirical verification:** We re-ran 47 holdout paragraphs through Grok 4.1 Fast with identical inputs. Results:
 - Category: 47/47 identical (100% deterministic)
 - Specificity: 43/47 identical (91.5%), 4 diverged
 - Divergence: 8.5% of paragraphs got different specificity labels
 - All divergence was on specificity (L1↔L2, L1→L3, L3→L4) — exactly the ambiguous boundary cases where multiple runs provide real tiebreaking value
 This 8.5% per-pair divergence rate means:
 - ~90% of paragraphs will be 3/3 unanimous → strong consensus
 - ~10% will be 2-1 split → majority vote resolves boundary cases
 - Category is always unanimous → category quality = Grok's quality (93.7%, κ=0.925)
 **Self-consistency is a well-established pattern** (Wang et al. 2022). The weakness vs multi-model consensus is shared systematic biases — all three runs make the same systematic errors. But with κ=0.925 on category and κw=0.929 on specificity, Grok's systematic errors are rare. The 8.5% stochastic variation is concentrated exactly where we want it: ambiguous specificity boundaries.
 **Cost: $96 for Grok ×3** (3 × $32 through OpenRouter). Leaves $80 for Stage 2 judge and any reruns. An alternative — xAI's Batch API at 50% off — would reduce this to $48, but requires bypassing OpenRouter.
 ### Cost of the Reboot (updated)
 | Item | Estimated Cost | Actual Cost |
 |------|---------------|-------------|
 | Prompt iteration (v4.0–v4.5, ~8 rounds) | ~$10 | $19.59 |
 | v2 holdout benchmark (10 models + 3 pilots) | ~$45 | $45.47 |
 | Stage 1 re-run (Grok ×3, 50K paragraphs) | ~$96 | pending |
 | Stage 2 judge (disputed paragraphs) | ~$20-40 | pending |
 | Human re-labeling | $0 (team labor) | pending |
 | **Total additional API** | **~$175-185** | |
 Against the ~$120 already spent on v1 API calls (not recovered). Total project API cost: ~$300-305 of $360 budget.
 ---
--- a/docs/STATUS.md
+++ b/docs/STATUS.md
@ -1,195 +1,174 @@
-# Project Status — 2026-04-03 (v2 Reboot)
+# Project Status — v2 Pipeline
-**Deadline:** 2026-04-24 (21 days)
+**Deadline:** 2026-04-24 | **Started:** 2026-04-03 | **Updated:** 2026-04-05 (holdout benchmark done, Grok ×3 selected)
 ## What's Done (Carried Forward from v1)
 ### Data Pipeline
 - [x] 72,045 paragraphs extracted from ~9,000 10-K + 207 8-K filings
 - [x] 14 filing generators identified, 6 surgical patches applied
 - [x] Quality tier system: clean (80.7%), headed (10.3%), degraded (6.0%), minor (3.0%)
 - [x] 72 truncated filings identified and excluded
 - [x] All data integrity rules formalized (frozen originals, UUID-linked patches)
 ### Pre-Training
 - [x] DAPT: 1 epoch on 500M tokens, eval loss 0.7250, ~14.5h on RTX 3090
 - [x] TAPT: 5 epochs on 72K paragraphs, eval loss 1.0754, ~50 min on RTX 3090
 - [x] Custom `WholeWordMaskCollator` (upstream broken for BPE)
 - [x] Checkpoints: `checkpoints/dapt/` and `checkpoints/tapt/`
 ### v1 Labeling (preserved, not used for v2 training)
 - [x] 150K Stage 1 annotations (v2.5 prompt, $115.88)
 - [x] 10-model benchmark (8 suppliers, $45.63)
 - [x] Human labeling: 6 annotators × 600 paragraphs, category α=0.801, specificity α=0.546
 - [x] Gold adjudication: 13-signal cross-analysis, 5-tier adjudication
 - [x] Codebook v1.0→v3.5 iteration (12+ prompt versions, 6 v3.5 rounds)
 - [x] All v1 data preserved at original paths + `docs/NARRATIVE-v1.md`
 ### v2 Codebook (this session)
 - [x] LABELING-CODEBOOK.md v2: broadened Level 2, 1+ QV, "what question?" test
 - [x] CODEBOOK-ETHOS.md: full reasoning, worked edge cases
 - [x] NARRATIVE.md: data/pretraining carried forward, pivot divider, v2 section started
 - [x] STATUS.md: this document
 ---
-## What's Next (v2 Pipeline)
+## Carried Forward (not re-done)
-### Step 1: Codebook Finalization ← CURRENT
+- 72,045 paragraphs (49,795 annotated), quality tiers, 6 surgical patches
- [x] Draft v2 codebook with systemic changes
+- DAPT checkpoint (eval loss 0.7250, ~14.5h) + TAPT checkpoint (eval loss 1.0754, ~50min)
- [x] Draft codebook ethos with full reasoning
+- v1 data preserved: 150K Stage 1 annotations, 10-model benchmark, 6-annotator human labels, gold adjudication
- [ ] Get group approval on v2 codebook (share both docs)
+- v2 codebook approved (5/6 group approval 2026-04-04)
 - [ ] Incorporate any group feedback
-### Step 2: Prompt Iteration (dev set)
+---
 - [ ] Draw ~200 paragraph dev set from existing Stage 1 labels (stratified, separate from holdout)
 - [ ] Update Stage 1 prompt to match v2 codebook
 - [ ] Run 2-3 models on dev set, analyze results
 - [ ] Iterate prompt against judge panel until reasonable consensus
 - [ ] Update codebook with any rulings needed (should be minimal if rules are clean)
 - [ ] Re-approval if codebook changed materially
 - **Estimated cost:** ~$5-10
 - **Estimated time:** 1-2 sessions
-### Step 3: Stage 1 Re-Run
+## Pipeline Steps
 - [ ] Lock v2 prompt
 - [ ] Re-run Stage 1 on full corpus (~50K paragraphs × 3 models)
 - [ ] Distribution check: verify Level 2 grew to ~20%, category distribution healthy
 - [ ] If distribution is off → iterate codebook/prompt before proceeding
 - **Estimated cost:** ~$120
 - **Estimated time:** ~30 min execution
-### Step 4: Holdout Selection
+### 1. Codebook Finalization — DONE
- [ ] Draw stratified holdout from new Stage 1 labels
+- [x] Draft v2 codebook (LABELING-CODEBOOK.md)
-  - ~170 per category class × 7 ≈ 1,190
+- [x] Draft codebook ethos (CODEBOOK-ETHOS.md)
-  - Random within each stratum (NOT difficulty-weighted)
+- [x] Group approval (5/6, 2026-04-04)
  - Secondary constraint: minimum ~100 per specificity level
  - Exclude dev set paragraphs
 - [ ] Draw separate AI-labeled extension set (up to 20K) if desired
 - **Depends on:** Step 3 complete + distribution check passed
-### Step 5: Labelapp Update
+### 2. Holdout Selection — DONE
- [ ] Update quiz questions for v2 codebook (new Level 2 definition, 1+ QV, "what question?" test)
+- [x] Heuristic v2 specificity prediction (keyword scan of v1 L1 → predicted L2, v1 L3 → predicted L4)
 - [x] Stratified holdout: 185 per non-ID category, 90 ID = 1,200 exact
 - [x] Max 2 paragraphs per company per category stratum
 - [x] Specificity floors met: L1=621, L2=119, L3=262, L4=198 (all ≥100)
 - [x] 1,042 companies represented, max 3 from any one company
 - [x] Output: `data/gold/v2-holdout-ids.json`, `data/gold/v2-holdout-manifest.jsonl`
 - [x] Script: `scripts/sample-v2-holdout.py`
 - Dev set drawn from holdout (first 200 paragraphs used for prompt iteration)
 ### 3. Prompt Iteration — DONE
 - [x] Full rewrite of SYSTEM_PROMPT for v2 codebook (v4.0 → v4.5, ~8 iterations)
 - [x] Principle-first restructure: ERM test for L2, "unique to THIS company" for L3, external verifiability for L4
 - [x] Lists compressed to boundary-case disambiguation only (not exhaustive checklists)
 - [x] Category/specificity independence explicitly stated (presence check, not relevance judgment)
 - [x] Hard vs soft number boundary clarified for QV; lower bounds ("more than 20 years") count as hard
 - [x] VP/SVP title boundary: VP-or-above with IT/Security qualifier → L3; Director of IT without security qualifier → L1
 - [x] Schema updated: "Sector-Adapted" → "Domain-Adapted", 2+ QV → 1+ QV
 - [x] Piloted on 200 holdout paragraphs with GPT-5.4 across 5 iterations (~$6 total)
 - [x] v4.5 iteration: mechanical bridge (specific_facts → specificity level), expertise-vs-topic L1/L2 clarification, SI negative-assertion L4 fix, fact storage in output
 - **v4.4 results (200 paragraphs):** L1=65, L2=41, L3=51, L4=43; category 95.5% agreement with v1
 - **Cost per 200:** ~$1.20 (GPT-5.4)
 - **Prompt version:** v4.5 (locked)
 ### 4. Full Holdout Validation — DONE
 - [x] Run GPT-5.4 on all 1,200 holdout paragraphs with v4.4 prompt ($5.70)
 - [x] Identified 34.5% medium-confidence specificity calls, concentrated at L1/L2 and L2/L3 boundaries
 - [x] Identified SI materiality assertions being false-promoted to L4 (negative assertions not verifiable)
 - [x] Identified specific_facts field not being stored to disk (toLabelOutput stripped it)
 - [x] Iterated to v4.5: mechanical bridge, expertise-vs-topic, SI L4 fix, fact storage
 - [x] Re-ran full 1,200 with v4.5 ($6.88)
 - [x] Verified bridge consistency: L1=all empty, L2+=all populated (100%)
 - [x] Verified SI L4 false positives eliminated (0 remaining)
 - [x] Verified TP L2→L1 drops are correct (generic vendor language, not cybersecurity expertise)
 - **v4.5 results (1,200 paragraphs):** L1=618 (51.5%), L2=168 (14.0%), L3=207 (17.2%), L4=207 (17.2%)
 - **Confidence:** 989 high (82.4%), 211 medium (17.6%) — down from 414 medium in v4.4
 - **Category stability:** 96.8% agreement between v4.4 and v4.5
 - **L2 at 14%:** below 15% target on holdout, but holdout oversamples TP (14.4% vs 5% in corpus). On full corpus (46% RMP, 5% TP), L2 should be ~15-17% since RMP L2 held up.
 - **Dev vs unseen stable:** no prompt overfitting
 ### 5. Holdout Benchmark — DONE
 - [x] Run 10 models from 8 providers on 1,200 holdout (GPT-5.4, Grok Fast, Gemini Lite, Gemini Pro, MIMO Flash, Kimi K2.5, GLM-5, MiniMax M2.7, Opus 4.6, + 3 pilots)
 - [x] Opus prompt-only vs codebook A/B test (prompt-only wins: 85.2% vs 82.4% both-match)
 - [x] MIMO Flash broken on specificity (91% L1 collapse, κw=0.662) — disqualified
 - [x] Pilot 3 cheap candidates (Qwen3-235B, Seed 1.6 Flash, Qwen3.5 Flash) — all below Flash Lite quality
 - [x] Grok self-consistency test: 8.5% specificity divergence on repeated runs at temp=0 (reasoning stochasticity)
 - [x] Decision: Grok ×3 self-consistency panel (Wang et al. 2022)
 - **Benchmark cost:** $45.47
 - **Top models:** Grok Fast (86.1% both), Opus prompt-only (85.2%), Gemini Pro (84.2%)
 - **Stage 1 panel:** Grok 4.1 Fast ×3 ($96 estimated)
 ### 6. Stage 1 Re-Run ← CURRENT
 - [x] Lock v2 prompt (v4.5)
 - [x] Model selection: Grok 4.1 Fast ×3 (self-consistency)
 - [ ] Re-run Stage 1 on full corpus (~50K paragraphs × 3 runs)
 - [ ] Distribution check: L2 ~15-17%, categories healthy
 - **Estimated cost:** ~$96
 ### 7. Labelapp Update
 - [ ] Update quiz questions for v2 codebook
 - [ ] Update warmup paragraphs with v2 examples
- [ ] Update codebook sidebar content
+- [ ] Load new holdout paragraphs into labelapp DB
 - [ ] Load new holdout paragraphs into labelapp
 - [ ] Generate new BIBD assignments (3 of 6 annotators per paragraph)
 - [ ] Test the full flow (quiz → warmup → labeling)
 - **Depends on:** Step 4 complete
-### Step 6: Parallel Labeling
+### 8. Parallel Labeling
- [ ] **Humans:** Tell annotators to start labeling v2 holdout
+- [ ] Humans: annotators label v2 holdout (~600 per annotator, 2-3 days)
- [ ] **Models:** Run full benchmark panel on holdout (10+ models, 8+ suppliers)
+- [x] Models: full benchmark panel on holdout (10 models, 8 providers + Opus via Agent SDK) — $45.47
-  - Stage 1 panel (gemini-flash-lite, mimo-v2-flash, grok-4.1-fast)
+- **Estimated cost:** ~$0 remaining (models done)
  - Benchmark panel (gpt-5.4, gemini-pro, kimi-k2.5, glm-5, mimo-v2-pro, minimax-m2.7)
  - Opus 4.6 via Anthropic SDK (new addition, treated as another benchmark model)
 - **Estimated model cost:** ~$45
 - **Estimated human time:** 2-3 days (600 paragraphs per annotator)
 - **Depends on:** Step 5 complete
-### Step 7: Gold Set Assembly
+### 9. Gold Set Assembly
- [ ] Compute human IRR (target: category α > 0.75, specificity α > 0.67)
+- [ ] Compute human IRR (category α > 0.75, specificity α > 0.67)
- [ ] Gold = majority vote (where all 3 disagree, model consensus tiebreaker)
+- [ ] Gold = majority vote; all-disagree → model consensus tiebreaker
- [ ] Validate gold against model panel — check for systematic human errors (learned from v1 SI↔N/O)
+- [ ] Cross-validate against model panel
 - **Depends on:** Step 6 complete (both humans and models)
-### Step 8: Stage 2 (if needed)
+### 10. Stage 2 (if needed)
- [ ] Bench Stage 2 adjudication accuracy against gold
+- [ ] Bench Stage 2 accuracy against gold
- [ ] If Stage 2 adds value → iterate prompt, run on disputed Stage 1 paragraphs
+- [ ] If adds value → run on disputed Stage 1 paragraphs
 - [ ] If Stage 2 adds minimal value → document finding, skip production run
 - **Estimated cost:** ~$20-40 if run
 - **Depends on:** Step 7 complete
-### Step 9: Training Data Assembly
+### 11. Training Data Assembly
- [ ] Unanimous Stage 1 labels → full weight
+- [ ] Unanimous Stage 1 → full weight, calibrated majority → full weight
- [ ] Calibrated majority labels → full weight
+- [ ] Quality tier weights: clean/headed/minor 1.0, degraded 0.5
- [ ] Judge high-confidence (if Stage 2 run) → full weight
+- [ ] Exclude 72 truncated filings
 - [ ] Quality tier weights: clean/headed/minor = 1.0, degraded = 0.5
 - [ ] Nuke 72 truncated filings
 - **Depends on:** Step 8 complete
-### Step 10: Fine-Tuning
+### 12. Fine-Tuning
- [ ] Ablation matrix: {base, +DAPT, +DAPT+TAPT} × {±class weighting} × {CE vs focal loss}
+- [ ] Ablation: {base, +DAPT, +DAPT+TAPT} × {±class weighting} × {CE vs focal loss}
- [ ] Dual-head classifier: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal)
+- [ ] Dual-head: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal)
- [ ] Ordinal regression (CORAL) for specificity
+- [ ] CORAL for ordinal specificity
 - [ ] SCL for boundary separation (optional, if time permits)
 - **Estimated time:** 12-20h GPU
 - **Depends on:** Step 9 complete
-### Step 11: Evaluation & Paper
+### 13. Evaluation & Paper
- [ ] Macro F1 on holdout (target: > 0.80 for both heads)
+- [ ] Macro F1 on holdout (target > 0.80 both heads)
- [ ] Per-class F1 breakdown
+- [ ] Per-class F1 breakdown + GenAI benchmark table
- [ ] Full GenAI benchmark table (10+ models × holdout)
+- [ ] Error analysis, cost comparison, IGNITE slides
- [ ] Cost/time/reproducibility comparison
+- [ ] Note in paper: specificity is paragraph-level (presence check), not category-conditional — acknowledge as limitation/future work
 - [ ] Error analysis on hardest cases
 - [ ] IGNITE slides (20 slides, 15s each)
 - [ ] Python notebooks for replication (assignment requirement)
 - **Depends on:** Step 10 complete
 ---
-## Timeline Estimate
+## Rubric Checklist
-| Step | Days | Cumulative |
+**C (F1 > .80):** Fine-tuned model, GenAI comparison, labeled datasets, documentation, Python notebooks
-|------|------|-----------|
+**B (3+ of 4):** [x] Cost/time/reproducibility, [x] 6+ models / 3+ suppliers, [x] Contemporary self-collected data, [x] Compelling use case
-| 1. Codebook approval | 1 | 1 |
+**A (3+ of 4):** [x] Error analysis, [x] Mitigation strategy, [ ] Additional baselines (keyword/dictionary), [x] Comparison to amateur labels
 | 2. Prompt iteration | 2 | 3 |
 | 3. Stage 1 re-run | 0.5 | 3.5 |
 | 4. Holdout selection | 0.5 | 4 |
 | 5. Labelapp update | 1 | 5 |
 | 6. Parallel labeling | 3 | 8 |
 | 7. Gold assembly | 1 | 9 |
 | 8. Stage 2 (if needed) | 1 | 10 |
 | 9. Training data assembly | 0.5 | 10.5 |
 | 10. Fine-tuning | 3-5 | 13.5-15.5 |
 | 11. Evaluation + paper | 3-5 | 16.5-20.5 |
 **Buffer:** 0.5-4.5 days. Tight but feasible if Steps 1-5 execute cleanly.
 ---
-## Rubric Checklist (Assignment)
+## Key Data
 ### C (f1 > .80): the goal
 - [ ] Fine-tuned model with F1 > .80 — category likely, specificity needs v2 broadening
 - [x] Performance comparison GenAI vs fine-tuned — 10 models benchmarked (will re-run on v2 holdout)
 - [x] Labeled datasets — 150K Stage 1 + 1,200 gold (v1; will re-do for v2)
 - [x] Documentation — extensive
 - [ ] Python notebooks for replication
 ### B (3+ of 4): already have all 4
 - [x] Cost, time, reproducibility — dollar amounts for every API call
 - [x] 6+ models, 3+ suppliers — 10 models, 8 suppliers (+ Opus in v2)
 - [x] Contemporary self-collected data — 72K paragraphs from SEC EDGAR
 - [x] Compelling use case — SEC cyber disclosure quality assessment
 ### A (3+ of 4): have 3, working on 4th
 - [x] Error analysis — T5 deep-dive, confusion axis analysis, model reasoning examination
 - [x] Mitigation strategy — v1→v2 codebook evolution, experimental validation
 - [ ] Additional baselines — dictionary/keyword approach (specificity IS/NOT lists as baseline)
 - [x] Comparison to amateur labels — annotator before/after, human vs model agreement analysis
 ---
 ## Key File Locations
 | What | Where |
 |------|-------|
 | v2 codebook | `docs/LABELING-CODEBOOK.md` |
-| v2 codebook ethos | `docs/CODEBOOK-ETHOS.md` |
+| v2 ethos | `docs/CODEBOOK-ETHOS.md` |
-| v2 narrative | `docs/NARRATIVE.md` |
+| Paragraphs (patched) | `data/paragraphs/paragraphs-clean.patched.jsonl` (72,045) |
 | v1 codebook (preserved) | `docs/LABELING-CODEBOOK-v1.md` |
 | v1 narrative (preserved) | `docs/NARRATIVE-v1.md` |
 | Strategy notes | `docs/STRATEGY-NOTES.md` |
 | Paragraphs | `data/paragraphs/paragraphs-clean.jsonl` (72,045) |
 | Patched paragraphs | `data/paragraphs/paragraphs-clean.patched.jsonl` (49,795) |
 | v1 Stage 1 annotations | `data/annotations/stage1.patched.jsonl` (150,009) |
-| v1 gold labels | `data/gold/gold-adjudicated.jsonl` (1,200) |
+| v2 holdout IDs | `data/gold/v2-holdout-ids.json` (1,200) |
-| v1 human labels | `data/gold/human-labels-raw.jsonl` (3,600) |
+| v2 holdout manifest | `data/gold/v2-holdout-manifest.jsonl` |
-| v1 benchmark annotations | `data/annotations/bench-holdout/*.jsonl` |
+| v1 holdout IDs | `labelapp/.sampled-ids.original.json` |
-| DAPT checkpoint | `checkpoints/dapt/modernbert-large/final/` |
+| v1 gold labels | `data/gold/gold-adjudicated.jsonl` |
-| TAPT checkpoint | `checkpoints/tapt/modernbert-large/final/` |
+| v2 holdout benchmark | `data/annotations/v2-bench/` (10 models + 3 pilots, 1,200 paragraphs) |
-| DAPT corpus | `data/dapt-corpus/shard-*.jsonl` |
+| v2 holdout reference | `data/annotations/v2-bench/gpt-5.4.jsonl` (v4.5, 1,200 paragraphs) |
-| Stage 1 prompt | `ts/src/label/prompts.ts` |
+| v2 iteration archive | `data/annotations/v2-bench/gpt-5.4.v4.{0,1,2,3,4}.jsonl` |
-| Annotation runner | `ts/src/label/annotate.ts` |
+| v4.5 boundary test | `data/annotations/v2-bench/v45-test/gpt-5.4.jsonl` (50 paragraphs) |
-| Labelapp | `labelapp/` |
+| Opus prompt-only | `data/annotations/v2-bench/opus-4.6.jsonl` (1,184 paragraphs) |
 | Opus +codebook | `data/annotations/golden/opus.jsonl` (includes v1 + v2 runs) |
 | Grok self-consistency test | `data/annotations/v2-bench/grok-rerun/grok-4.1-fast.jsonl` (47 paragraphs) |
 | Benchmark analysis | `scripts/analyze-v2-bench.py` |
 | Stage 1 prompt | `ts/src/label/prompts.ts` (v4.5) |
 | Holdout sampling script | `scripts/sample-v2-holdout.py` |
 ### v1 Stage 1 Distribution (50,003 paragraphs, v2.5 prompt, 3-model consensus)
 | Category | Count | % |
 |----------|-------|---|
 | RMP | 22,898 | 45.8% |
 | MR | 8,782 | 17.6% |
 | BG | 8,024 | 16.0% |
 | SI | 5,014 | 10.0% |
 | N/O | 2,503 | 5.0% |
 | TP | 2,478 | 5.0% |
 | ID | 304 | 0.6% |
 ### GPT-5.4 Prompt Iteration (holdout)
 | Specificity | v4.0 (list, 200) | v4.4 (principle, 200) | v4.4 (full, 1200) | v4.5 (full, 1200) |
 |-------------|-------------------|----------------------|--------------------|--------------------|
 | L1 | 81 (40.5%) | 65 (32.5%) | 546 (45.5%) | 618 (51.5%) |
 | L2 | 32 (16.0%) | 41 (20.5%) | 229 (19.1%) | 168 (14.0%) |
 | L3 | 43 (21.5%) | 51 (25.5%) | 225 (18.8%) | 207 (17.2%) |
 | L4 | 44 (22.0%) | 43 (21.5%) | 200 (16.7%) | 207 (17.2%) |
 | Med conf | — | — | 414 (34.5%) | 211 (17.6%) |
 v4.4→v4.5 key changes: mechanical bridge (specific_facts drives specificity level, 100% consistent), expertise-vs-topic L1/L2 clarification (fixes TP false L2s), SI negative-assertion L4 fix, lower-bound numbers as hard QV, fact storage in output.
--- a/packages/schemas/src/label.ts
+++ b/packages/schemas/src/label.ts
@ -15,7 +15,7 @@ export type ContentCategory = z.infer<typeof ContentCategory>;
 /** Ordinal specificity as semantic labels — we convert to 1-4 in post-processing. */
 export const SpecificityLabel = z.enum([
  "Generic Boilerplate",
-  "Sector-Adapted",
+  "Domain-Adapted",
  "Firm-Specific",
  "Quantified-Verifiable",
 ]);
@ -24,7 +24,7 @@ export type SpecificityLabel = z.infer<typeof SpecificityLabel>;
 /** Map specificity labels to ordinal values. */
 export const SPECIFICITY_ORDINAL: Record<SpecificityLabel, number> = {
  "Generic Boilerplate": 1,
-  "Sector-Adapted": 2,
+  "Domain-Adapted": 2,
  "Firm-Specific": 3,
  "Quantified-Verifiable": 4,
 };
@ -53,6 +53,7 @@ export type Confidence = z.infer<typeof Confidence>;
 /** Type of specific fact found in the text. */
 export const FactType = z.enum([
  "domain_term",     // Cybersecurity-specific vocabulary/practice: EDR, MFA, penetration testing, zero trust, etc.
  "framework",       // Named standard: NIST, ISO 27001, SOC 2, etc.
  "named_role",      // Specific title: CISO, VP of Security, CTO, etc.
  "named_committee", // Named committee: Audit Committee, Technology Committee, etc.
@ -92,10 +93,10 @@ export const LabelOutputRaw = z.object({
    "The single most applicable content category for this paragraph",
  ),
  specific_facts: z.array(SpecificFact).describe(
-    "List ONLY facts from the ✓ IS list (cybersecurity-specific roles, named non-generic committees, team compositions, dates, named programs, products, firms, specific numbers). Do NOT include items from the ✗ NOT list (CEO, CFO, Audit Committee, generic cadences, boilerplate phrases). Empty array if no qualifying facts exist.",
+    "List ALL specificity-relevant facts found anywhere in the paragraph. Include: domain terms that pass the ERM test (type=domain_term), firm-specific facts from the ✓ IS list (named roles, committees, teams, programs, systems), and verifiable claims (dates, numbers, named firms, certifications held). Do NOT include items from the ✗ NOT list (CEO, CFO, Audit Committee, generic cadences, boilerplate phrases). Empty array if no qualifying facts exist. Your specificity level must equal the highest-level fact listed.",
  ),
  specificity: SpecificityLabel.describe(
-    "Derived from specific_facts: no facts → Generic Boilerplate, only frameworks → Sector-Adapted, any firm-unique fact → Firm-Specific, 2+ verifiable facts → Quantified-Verifiable",
+    "Derived from specific_facts: no facts → Generic Boilerplate, domain terminology → Domain-Adapted, any firm-unique fact → Firm-Specific, 1+ verifiable facts → Quantified-Verifiable",
  ),
  category_confidence: Confidence.describe(
    "high=clear-cut, medium=some ambiguity, low=genuinely torn between categories",
@ -116,6 +117,7 @@ export function toLabelOutput(raw: LabelOutputRaw): LabelOutput {
  return {
    content_category: raw.content_category,
    specificity_level: SPECIFICITY_ORDINAL[raw.specificity] as SpecificityLevel,
    specific_facts: raw.specific_facts,
    category_confidence: raw.category_confidence,
    specificity_confidence: raw.specificity_confidence,
    reasoning: raw.reasoning,
@ -131,7 +133,10 @@ export const LabelOutput = z.object({
    "The single most applicable content category for this paragraph",
  ),
  specificity_level: SpecificityLevel.describe(
-    "1=generic boilerplate, 2=sector-adapted, 3=firm-specific, 4=quantified-verifiable",
+    "1=generic boilerplate, 2=domain-adapted, 3=firm-specific, 4=quantified-verifiable",
  ),
  specific_facts: z.array(SpecificFact).optional().describe(
    "Facts that drove the specificity level — empty/absent for Level 1",
  ),
  category_confidence: Confidence.describe(
    "high=clear-cut, medium=some ambiguity, low=genuinely torn between categories",
--- a/scripts/analyze-v2-bench.py
+++ b/scripts/analyze-v2-bench.py
@ -0,0 +1,477 @@
 """
 V2 holdout benchmark analysis.
 Compares all models in data/annotations/v2-bench/ on the 1,200 v2 holdout.
 Uses GPT-5.4 (v4.5) as reference since it's our best-validated model.
 Outputs:
  - Per-model distribution tables (category + specificity)
  - Pairwise agreement matrix (category, specificity, both)
  - Per-model agreement with GPT-5.4 reference
  - Confusion patterns: where models disagree and why
  - Confidence distribution per model
  - Specific facts coverage analysis
 """
 import json
 import sys
 from collections import Counter, defaultdict
 from itertools import combinations
 from pathlib import Path
 import numpy as np
 ROOT = Path(__file__).resolve().parent.parent
 V2_BENCH = ROOT / "data/annotations/v2-bench"
 GOLDEN_DIR = ROOT / "data/annotations/golden"
 CATEGORIES = [
    "Board Governance", "Management Role", "Risk Management Process",
    "Third-Party Risk", "Incident Disclosure", "Strategy Integration", "None/Other",
 ]
 CAT_SHORT = {"Board Governance": "BG", "Management Role": "MR",
             "Risk Management Process": "RMP", "Third-Party Risk": "TP",
             "Incident Disclosure": "ID", "Strategy Integration": "SI",
             "None/Other": "N/O"}
 SPEC_LABELS = {1: "L1", 2: "L2", 3: "L3", 4: "L4"}
 MODEL_DISPLAY = {
    "gemini-3.1-flash-lite-preview": "Gemini Lite",
    "mimo-v2-flash": "MIMO Flash",
    "grok-4.1-fast": "Grok Fast",
    "gpt-5.4": "GPT-5.4",
    "kimi-k2.5": "Kimi K2.5",
    "gemini-3.1-pro-preview": "Gemini Pro",
    "glm-5": "GLM-5",
    "minimax-m2.7": "MiniMax M2.7",
    "mimo-v2-pro": "MIMO Pro",
 }
 REFERENCE_MODEL = "gpt-5.4"
 def load_jsonl(path: Path) -> list[dict]:
    records = []
    with open(path) as f:
        for line in f:
            line = line.strip()
            if line:
                records.append(json.loads(line))
    return records
 def cohens_kappa(a: list, b: list) -> float:
    assert len(a) == len(b)
    n = len(a)
    if n == 0:
        return 0.0
    labels = sorted(set(a) | set(b))
    idx = {l: i for i, l in enumerate(labels)}
    k = len(labels)
    conf = np.zeros((k, k))
    for x, y in zip(a, b):
        conf[idx[x]][idx[y]] += 1
    po = np.trace(conf) / n
    pe = sum((conf[i, :].sum() / n) * (conf[:, i].sum() / n) for i in range(k))
    if pe >= 1.0:
        return 1.0
    return (po - pe) / (1 - pe)
 def weighted_kappa(a: list[int], b: list[int]) -> float:
    """Quadratic-weighted kappa for ordinal specificity."""
    assert len(a) == len(b)
    n = len(a)
    if n == 0:
        return 0.0
    labels = sorted(set(a) | set(b))
    idx = {l: i for i, l in enumerate(labels)}
    k = len(labels)
    conf = np.zeros((k, k))
    for x, y in zip(a, b):
        conf[idx[x]][idx[y]] += 1
    weights = np.zeros((k, k))
    for i in range(k):
        for j in range(k):
            weights[i][j] = (i - j) ** 2 / (k - 1) ** 2
    po = 1 - np.sum(weights * conf) / n
    expected = np.outer(conf.sum(axis=1), conf.sum(axis=0)) / n
    pe = 1 - np.sum(weights * expected) / n
    if pe == 0:
        return 1.0
    return (po - pe) / (1 - pe)
 # ── Load all models ──
 print("Loading v2-bench annotations...")
 models: dict[str, dict[str, dict]] = {}  # model_short -> {pid -> annotation}
 for f in sorted(V2_BENCH.glob("*.jsonl")):
    if "errors" in f.name or f.stem.startswith("gpt-5.4.v4"):
        continue
    records = load_jsonl(f)
    if len(records) < 100:
        print(f"  SKIP {f.name}: only {len(records)} records")
        continue
    model_short = f.stem
    by_pid = {r["paragraphId"]: r for r in records}
    models[model_short] = by_pid
    display = MODEL_DISPLAY.get(model_short, model_short)
    print(f"  {display}: {len(by_pid)} annotations")
 # Load Opus golden if available
 opus_path = GOLDEN_DIR / "opus.jsonl"
 if opus_path.exists():
    records = load_jsonl(opus_path)
    if len(records) >= 100:
        by_pid = {r["paragraphId"]: r for r in records}
        models["opus-4.6"] = by_pid
        MODEL_DISPLAY["opus-4.6"] = "Opus 4.6"
        print(f"  Opus 4.6: {len(by_pid)} annotations")
 # Common paragraph IDs across all models
 all_pids = set.intersection(*(set(m.keys()) for m in models.values())) if models else set()
 print(f"\n  {len(all_pids)} paragraphs common to all {len(models)} models")
 if not all_pids:
    # Fall back to pairwise with reference
    ref = models.get(REFERENCE_MODEL)
    if ref:
        all_pids = set(ref.keys())
        print(f"  Using {len(all_pids)} reference model paragraphs for pairwise analysis")
 model_names = sorted(models.keys(), key=lambda m: list(MODEL_DISPLAY.keys()).index(m) if m in MODEL_DISPLAY else 999)
 def get_label(model: str, pid: str) -> dict | None:
    ann = models.get(model, {}).get(pid)
    if not ann:
        return None
    return ann.get("label", ann)
 # ═══════════════════════════════════════════════════════════
 # 1. DISTRIBUTION TABLES
 # ═══════════════════════════════════════════════════════════
 print("\n" + "═" * 70)
 print("CATEGORY DISTRIBUTION")
 print("═" * 70)
 header = f"{'Model':<16}" + "".join(f"{s:>8}" for s in CAT_SHORT.values())
 print(header)
 print("─" * len(header))
 for m in model_names:
    display = MODEL_DISPLAY.get(m, m)[:15]
    cats = [get_label(m, pid) for pid in models[m]]
    cats = [l["content_category"] for l in cats if l]
    counts = Counter(cats)
    total = len(cats)
    row = f"{display:<16}"
    for full_name in CATEGORIES:
        pct = counts.get(full_name, 0) / total * 100 if total else 0
        row += f"{pct:>7.1f}%"
    print(row)
 print("\n" + "═" * 70)
 print("SPECIFICITY DISTRIBUTION")
 print("═" * 70)
 header = f"{'Model':<16}" + "".join(f"{s:>8}" for s in SPEC_LABELS.values()) + f"{'Med%':>8}"
 print(header)
 print("─" * len(header))
 for m in model_names:
    display = MODEL_DISPLAY.get(m, m)[:15]
    labels = [get_label(m, pid) for pid in models[m]]
    specs = [l["specificity_level"] for l in labels if l]
    confs = [l.get("specificity_confidence", "high") for l in labels if l]
    counts = Counter(specs)
    total = len(specs)
    med_count = sum(1 for c in confs if c == "medium")
    row = f"{display:<16}"
    for level in SPEC_LABELS:
        pct = counts.get(level, 0) / total * 100 if total else 0
        row += f"{pct:>7.1f}%"
    med_pct = med_count / total * 100 if total else 0
    row += f"{med_pct:>7.1f}%"
    print(row)
 # ═══════════════════════════════════════════════════════════
 # 2. AGREEMENT WITH REFERENCE
 # ═══════════════════════════════════════════════════════════
 ref_data = models.get(REFERENCE_MODEL)
 if ref_data:
    print("\n" + "═" * 70)
    print(f"AGREEMENT WITH {MODEL_DISPLAY.get(REFERENCE_MODEL, REFERENCE_MODEL).upper()}")
    print("═" * 70)
    header = f"{'Model':<16}{'Cat%':>8}{'Cat κ':>8}{'Spec%':>8}{'Spec κw':>8}{'Both%':>8}{'N':>6}"
    print(header)
    print("─" * len(header))
    for m in model_names:
        if m == REFERENCE_MODEL:
            continue
        display = MODEL_DISPLAY.get(m, m)[:15]
        common = set(models[m].keys()) & set(ref_data.keys())
        if len(common) < 100:
            print(f"{display:<16}  (only {len(common)} common paragraphs)")
            continue
        ref_cats, m_cats = [], []
        ref_specs, m_specs = [], []
        both_match = 0
        for pid in common:
            rl = get_label(REFERENCE_MODEL, pid)
            ml = get_label(m, pid)
            if not rl or not ml:
                continue
            ref_cats.append(rl["content_category"])
            m_cats.append(ml["content_category"])
            ref_specs.append(rl["specificity_level"])
            m_specs.append(ml["specificity_level"])
            if rl["content_category"] == ml["content_category"] and rl["specificity_level"] == ml["specificity_level"]:
                both_match += 1
        n = len(ref_cats)
        cat_agree = sum(1 for a, b in zip(ref_cats, m_cats) if a == b) / n * 100
        spec_agree = sum(1 for a, b in zip(ref_specs, m_specs) if a == b) / n * 100
        both_pct = both_match / n * 100
        cat_k = cohens_kappa(ref_cats, m_cats)
        spec_kw = weighted_kappa(ref_specs, m_specs)
        print(f"{display:<16}{cat_agree:>7.1f}%{cat_k:>8.3f}{spec_agree:>7.1f}%{spec_kw:>8.3f}{both_pct:>7.1f}%{n:>6}")
 # ═══════════════════════════════════════════════════════════
 # 3. PAIRWISE AGREEMENT MATRIX (category kappa)
 # ═══════════════════════════════════════════════════════════
 print("\n" + "═" * 70)
 print("PAIRWISE CATEGORY κ (lower triangle)")
 print("═" * 70)
 short_names = [MODEL_DISPLAY.get(m, m)[:10] for m in model_names]
 header = f"{'':>12}" + "".join(f"{s:>12}" for s in short_names)
 print(header)
 for i, m1 in enumerate(model_names):
    row = f"{short_names[i]:>12}"
    for j, m2 in enumerate(model_names):
        if j >= i:
            row += f"{'—':>12}"
            continue
        common = set(models[m1].keys()) & set(models[m2].keys())
        if len(common) < 100:
            row += f"{'n/a':>12}"
            continue
        cats1 = [get_label(m1, pid)["content_category"] for pid in common if get_label(m1, pid) and get_label(m2, pid)]
        cats2 = [get_label(m2, pid)["content_category"] for pid in common if get_label(m1, pid) and get_label(m2, pid)]
        k = cohens_kappa(cats1, cats2)
        row += f"{k:>12.3f}"
    print(row)
 # ═══════════════════════════════════════════════════════════
 # 4. SPECIFICITY CONFUSION WITH REFERENCE
 # ═══════════════════════════════════════════════════════════
 if ref_data:
    print("\n" + "═" * 70)
    print("SPECIFICITY CONFUSION vs REFERENCE (rows=model, cols=reference)")
    print("═" * 70)
    for m in model_names:
        if m == REFERENCE_MODEL:
            continue
        display = MODEL_DISPLAY.get(m, m)
        common = set(models[m].keys()) & set(ref_data.keys())
        if len(common) < 100:
            continue
        conf = np.zeros((4, 4), dtype=int)
        for pid in common:
            rl = get_label(REFERENCE_MODEL, pid)
            ml = get_label(m, pid)
            if not rl or not ml:
                continue
            ref_s = rl["specificity_level"] - 1
            mod_s = ml["specificity_level"] - 1
            conf[mod_s][ref_s] += 1
        print(f"\n  {display} (N={int(conf.sum())})")
        print(f"  {'':>8}" + "".join(f"{'ref ' + SPEC_LABELS[l]:>8}" for l in range(1, 5)))
        for i in range(4):
            row_total = conf[i].sum()
            row = f"  {SPEC_LABELS[i+1]:>8}"
            for j in range(4):
                row += f"{conf[i][j]:>8}"
            print(row + f"  | {row_total}")
 # ═══════════════════════════════════════════════════════════
 # 5. CATEGORY DISAGREEMENT PATTERNS
 # ═══════════════════════════════════════════════════════════
 if ref_data:
    print("\n" + "═" * 70)
    print("TOP CATEGORY DISAGREEMENT PATTERNS vs REFERENCE")
    print("═" * 70)
    for m in model_names:
        if m == REFERENCE_MODEL:
            continue
        display = MODEL_DISPLAY.get(m, m)
        common = set(models[m].keys()) & set(ref_data.keys())
        if len(common) < 100:
            continue
        disagreements: Counter = Counter()
        for pid in common:
            rl = get_label(REFERENCE_MODEL, pid)
            ml = get_label(m, pid)
            if not rl or not ml:
                continue
            rc = CAT_SHORT[rl["content_category"]]
            mc = CAT_SHORT[ml["content_category"]]
            if rc != mc:
                disagreements[(rc, mc)] += 1
        total_disagree = sum(disagreements.values())
        if total_disagree == 0:
            continue
        print(f"\n  {display}: {total_disagree} disagreements ({total_disagree/len(common)*100:.1f}%)")
        for (rc, mc), count in disagreements.most_common(5):
            print(f"    {rc} → {mc}: {count}")
 # ═══════════════════════════════════════════════════════════
 # 6. SPECIFIC_FACTS COVERAGE
 # ═══════════════════════════════════════════════════════════
 print("\n" + "═" * 70)
 print("SPECIFIC_FACTS COVERAGE")
 print("═" * 70)
 header = f"{'Model':<16}{'Has facts':>10}{'Avg #':>8}{'L1 empty':>10}{'L2+ has':>10}"
 print(header)
 print("─" * len(header))
 for m in model_names:
    display = MODEL_DISPLAY.get(m, m)[:15]
    has_facts = 0
    total_facts = 0
    l1_empty = 0
    l1_total = 0
    l2plus_has = 0
    l2plus_total = 0
    for pid in models[m]:
        l = get_label(m, pid)
        if not l:
            continue
        facts = l.get("specific_facts") or []
        spec = l["specificity_level"]
        if facts:
            has_facts += 1
            total_facts += len(facts)
        if spec == 1:
            l1_total += 1
            if not facts:
                l1_empty += 1
        else:
            l2plus_total += 1
            if facts:
                l2plus_has += 1
    total = len(models[m])
    print(f"{display:<16}"
          f"{has_facts/total*100:>9.1f}%"
          f"{total_facts/max(1,has_facts):>8.1f}"
          f"{l1_empty/max(1,l1_total)*100:>9.1f}%"
          f"{l2plus_has/max(1,l2plus_total)*100:>9.1f}%")
 # ═══════════════════════════════════════════════════════════
 # 7. MULTI-MODEL CONSENSUS ANALYSIS
 # ═══════════════════════════════════════════════════════════
 print("\n" + "═" * 70)
 print("MULTI-MODEL CONSENSUS")
 print("═" * 70)
 # For paragraphs common to all models
 if len(all_pids) >= 100:
    cat_unanimous = 0
    spec_unanimous = 0
    both_unanimous = 0
    cat_majority = 0
    spec_majority = 0
    for pid in all_pids:
        cats = []
        specs = []
        for m in model_names:
            l = get_label(m, pid)
            if l:
                cats.append(l["content_category"])
                specs.append(l["specificity_level"])
        cat_counts = Counter(cats)
        spec_counts = Counter(specs)
        top_cat_n = cat_counts.most_common(1)[0][1]
        top_spec_n = spec_counts.most_common(1)[0][1]
        if len(set(cats)) == 1:
            cat_unanimous += 1
        if len(set(specs)) == 1:
            spec_unanimous += 1
        if len(set(cats)) == 1 and len(set(specs)) == 1:
            both_unanimous += 1
        if top_cat_n > len(cats) / 2:
            cat_majority += 1
        if top_spec_n > len(specs) / 2:
            spec_majority += 1
    n = len(all_pids)
    print(f"  Category unanimous:   {cat_unanimous}/{n} ({cat_unanimous/n*100:.1f}%)")
    print(f"  Category majority:    {cat_majority}/{n} ({cat_majority/n*100:.1f}%)")
    print(f"  Specificity unanimous: {spec_unanimous}/{n} ({spec_unanimous/n*100:.1f}%)")
    print(f"  Specificity majority:  {spec_majority}/{n} ({spec_majority/n*100:.1f}%)")
    print(f"  Both unanimous:        {both_unanimous}/{n} ({both_unanimous/n*100:.1f}%)")
 else:
    print(f"  Only {len(all_pids)} common paragraphs — skipping full consensus")
    print("  (Some models may still be running)")
 # ═══════════════════════════════════════════════════════════
 # 8. COST SUMMARY
 # ═══════════════════════════════════════════════════════════
 print("\n" + "═" * 70)
 print("COST & LATENCY SUMMARY")
 print("═" * 70)
 header = f"{'Model':<16}{'Cost':>10}{'Avg ms':>10}{'Tokens/p':>10}{'Reason/p':>10}"
 print(header)
 print("─" * len(header))
 total_cost = 0
 for m in model_names:
    display = MODEL_DISPLAY.get(m, m)[:15]
    costs = []
    latencies = []
    reasoning = []
    for pid in models[m]:
        ann = models[m][pid]
        prov = ann.get("provenance", {})
        costs.append(prov.get("costUsd", 0))
        latencies.append(prov.get("latencyMs", 0))
        reasoning.append(prov.get("reasoningTokens", 0))
    cost = sum(costs)
    total_cost += cost
    avg_lat = np.mean(latencies) if latencies else 0
    avg_tok = np.mean([c.get("provenance", {}).get("inputTokens", 0) + c.get("provenance", {}).get("outputTokens", 0) for c in models[m].values()])
    avg_reason = np.mean(reasoning) if reasoning else 0
    print(f"{display:<16}${cost:>9.4f}{avg_lat:>9.0f}ms{avg_tok:>10.0f}{avg_reason:>10.0f}")
 print(f"\n  Total benchmark cost: ${total_cost:.4f}")
 print("\n" + "═" * 70)
 print("DONE")
 print("═" * 70)
--- a/scripts/sample-v2-holdout.py
+++ b/scripts/sample-v2-holdout.py
@ -0,0 +1,346 @@
 """
 Sample v2 holdout set (1,200 paragraphs) using v1 Stage 1 consensus as guide.
 Applies heuristic v2 specificity prediction:
  - v1 L1 with domain terminology keywords → predicted L2
  - v1 L3 with 1+ QV indicator → predicted L4
 Stratified by category (185 per non-ID, 90 ID), with:
  - Max 2 paragraphs per company per category stratum
  - Secondary floor: ≥100 per predicted v2 specificity level
  - Random within strata (no difficulty weighting)
 Outputs:
  - data/gold/v2-holdout-ids.json (list of paragraph IDs)
  - data/gold/v2-holdout-manifest.jsonl (full metadata per paragraph)
 """
 import json
 import random
 import re
 import sys
 from collections import Counter, defaultdict
 from pathlib import Path
 random.seed(42)  # reproducible sampling
 DATA = Path("data")
 PARAGRAPHS_PATH = DATA / "paragraphs" / "paragraphs-clean.patched.jsonl"
 STAGE1_PATH = DATA / "annotations" / "stage1.patched.jsonl"
 V1_HOLDOUT_PATH = Path("labelapp") / ".sampled-ids.original.json"
 OUTPUT_IDS = DATA / "gold" / "v2-holdout-ids.json"
 OUTPUT_MANIFEST = DATA / "gold" / "v2-holdout-manifest.jsonl"
 # ── Allocation ──────────────────────────────────────────────────────────────
 TOTAL = 1200
 CATEGORY_ALLOC = {
    "Board Governance": 185,
    "Management Role": 185,
    "Risk Management Process": 185,
    "Third-Party Risk": 185,
    "Strategy Integration": 185,
    "None/Other": 185,
    "Incident Disclosure": 90,
 }
 assert sum(CATEGORY_ALLOC.values()) == TOTAL
 SPECIFICITY_FLOOR = 100  # minimum per predicted v2 specificity level
 MAX_PER_COMPANY_PER_STRATUM = 2
 # ── Domain terminology keywords (v2 codebook) ──────────────────────────────
 # Applied to v1 L1 paragraphs: any match → predicted L2
 DOMAIN_TERMS = [
    # Practices and activities
    r"penetration test", r"pen test", r"vulnerability scan", r"vulnerability assess",
    r"red team", r"phishing simul", r"security awareness training",
    r"threat hunt", r"threat intellig", r"patch management",
    r"identity and access management", r"\bIAM\b",
    r"data loss prevention", r"\bDLP\b",
    r"network segmentation", r"encryption",
    # Tools and infrastructure
    r"\bSIEM\b", r"security information and event management",
    r"\bSOC\b", r"security operations center",
    r"\bEDR\b", r"\bXDR\b", r"\bMDR\b", r"endpoint detection",
    r"\bWAF\b", r"web application firewall",
    r"\bIDS\b", r"\bIPS\b", r"intrusion detection", r"intrusion prevention",
    r"\bMFA\b", r"multi-factor auth", r"two-factor auth", r"\b2FA\b",
    r"\bfirewall\b", r"antivirus", r"anti-malware",
    # Architectural concepts
    r"zero trust", r"defense in depth", r"least privilege",
    # Named standards (already L2 in v1, but catch any misses)
    r"\bNIST\b", r"ISO 27001", r"ISO 27002", r"\bSOC 2\b", r"CIS Controls",
    r"PCI[ -]DSS", r"\bHIPAA\b", r"\bGDPR\b", r"\bCOBIT\b", r"MITRE ATT",
    # Threat types
    r"ransomware", r"\bmalware\b", r"phishing",
    r"\bDDoS\b", r"supply chain attack", r"social engineering",
    r"advanced persistent threat", r"\bAPT\b", r"zero[- ]day",
 ]
 DOMAIN_PATTERNS = [re.compile(t, re.IGNORECASE) for t in DOMAIN_TERMS]
 # ── QV indicators (v2 codebook: 1+ → Level 4) ──────────────────────────────
 # Applied to v1 L3 paragraphs: any match → predicted L4
 QV_PATTERNS_RAW = [
    r"\$[\d,]+",  # dollar amounts
    r"\b\d{1,3}(?:,\d{3})*\s*(?:million|billion|thousand)\b",
    # Specific dates (month + year or exact)
    r"\b(?:January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2},?\s+\d{4}\b",
    r"\b(?:January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{4}\b",
    # Named certifications
    r"\bCISSP\b", r"\bCISM\b", r"\bCEH\b", r"\bCRISC\b",
    # Named third-party firms
    r"\bMandiant\b", r"\bCrowdStrike\b", r"\bDeloitte\b", r"\bKPMG\b",
    r"\bPricewaterhouse\b", r"\bPwC\b", r"\bErnst\s*&\s*Young\b", r"\bEY\b",
    r"\bAccenture\b", r"\bBooz Allen\b", r"\bProtiviti\b", r"\bKroll\b",
    # Named products/tools
    r"\bSplunk\b", r"\bAzure Sentinel\b", r"\bCrowdStrike Falcon\b",
    r"\bServiceNow\b", r"\bPalo Alto\b", r"\bFortinet\b", r"\bSailPoint\b",
    r"\bOkta\b", r"\bZscaler\b", r"\bCarbon Black\b", r"\bSentinelOne\b",
    r"\bQualys\b", r"\bTenable\b", r"\bRapid7\b", r"\bProofpoint\b",
    r"\bMimecast\b", r"\bDarktrace\b", r"\bKnowBe4\b",
    # Headcounts / years of experience
    r"\b\d+\s+(?:years?|professionals?|employees?|staff|members?|individuals?)\b",
    # Specific percentages
    r"\b\d+(?:\.\d+)?%",
    # Named universities in credential context
    r"\bPh\.?D\.?\b", r"\bMaster'?s?\s+(?:of|in)\b",
 ]
 QV_PATTERNS = [re.compile(p, re.IGNORECASE) for p in QV_PATTERNS_RAW]
 def has_domain_terminology(text: str) -> bool:
    return any(p.search(text) for p in DOMAIN_PATTERNS)
 def has_qv_indicator(text: str) -> bool:
    return any(p.search(text) for p in QV_PATTERNS)
 def predict_v2_specificity(v1_spec: int, text: str) -> int:
    """Heuristic v2 specificity prediction from v1 consensus + text scan."""
    if v1_spec == 1 and has_domain_terminology(text):
        return 2
    if v1_spec == 3 and has_qv_indicator(text):
        return 4
    return v1_spec
 # ── Main ────────────────────────────────────────────────────────────────────
 def main():
    # Load paragraphs
    print("Loading paragraphs...", file=sys.stderr)
    paragraphs = {}
    with open(PARAGRAPHS_PATH) as f:
        for line in f:
            p = json.loads(line)
            paragraphs[p["id"]] = p
    print(f"  Loaded {len(paragraphs)} paragraphs", file=sys.stderr)
    # Load v1 holdout IDs (to exclude — we don't want overlap)
    v1_holdout = set()
    if V1_HOLDOUT_PATH.exists():
        v1_holdout = set(json.loads(V1_HOLDOUT_PATH.read_text()))
        print(f"  Loaded {len(v1_holdout)} v1 holdout IDs (will NOT exclude — they're eligible)", file=sys.stderr)
    # Load stage1 annotations and compute consensus
    print("Computing v1 consensus...", file=sys.stderr)
    ann_by_pid: dict[str, list[dict]] = defaultdict(list)
    with open(STAGE1_PATH) as f:
        for line in f:
            r = json.loads(line)
            ann_by_pid[r["paragraphId"]].append(r["label"])
    # Build consensus records
    records = []  # (pid, v1_cat, v1_spec, v2_pred_spec, company, text)
    for pid, labels in ann_by_pid.items():
        if len(labels) != 3:
            continue
        if pid not in paragraphs:
            continue
        p = paragraphs[pid]
        text = p["text"]
        company = p["filing"]["cik"]  # CIK is unique per company
        # Majority vote
        cats = [l["content_category"] for l in labels]
        cat = Counter(cats).most_common(1)[0][0]
        specs = [l["specificity_level"] for l in labels]
        spec = Counter(specs).most_common(1)[0][0]
        v2_spec = predict_v2_specificity(spec, text)
        records.append({
            "id": pid,
            "v1_category": cat,
            "v1_specificity": spec,
            "v2_pred_specificity": v2_spec,
            "company_cik": company,
            "company_name": p["filing"]["companyName"],
            "text_preview": text[:120],
            "word_count": p["wordCount"],
            "text": text,
        })
    print(f"  {len(records)} paragraphs with 3-model consensus", file=sys.stderr)
    # ── Distribution report ─────────────────────────────────────────────
    print("\n  v1 Category distribution:", file=sys.stderr)
    cat_counts = Counter(r["v1_category"] for r in records)
    for cat, count in sorted(cat_counts.items(), key=lambda x: -x[1]):
        print(f"    {cat:30s} {count:6d} ({count/len(records)*100:.1f}%)", file=sys.stderr)
    print("\n  v2 Predicted specificity distribution:", file=sys.stderr)
    spec_counts = Counter(r["v2_pred_specificity"] for r in records)
    for spec in sorted(spec_counts):
        count = spec_counts[spec]
        print(f"    Level {spec}: {count:6d} ({count/len(records)*100:.1f}%)", file=sys.stderr)
    # ── Stratified sampling ─────────────────────────────────────────────
    print("\nSampling holdout...", file=sys.stderr)
    # Group by category
    by_category: dict[str, list[dict]] = defaultdict(list)
    for r in records:
        by_category[r["v1_category"]].append(r)
    selected_ids: set[str] = set()
    selected_records: list[dict] = []
    for cat, alloc in CATEGORY_ALLOC.items():
        pool = by_category.get(cat, [])
        random.shuffle(pool)
        # Enforce per-company cap
        company_counts: dict[str, int] = defaultdict(int)
        eligible = []
        for r in pool:
            cik = r["company_cik"]
            if company_counts[cik] < MAX_PER_COMPANY_PER_STRATUM:
                eligible.append(r)
                company_counts[cik] += 1
        # Take up to allocation
        taken = eligible[:alloc]
        if len(taken) < alloc:
            print(f"  WARNING: {cat} — only {len(taken)}/{alloc} available after company cap", file=sys.stderr)
        for r in taken:
            selected_ids.add(r["id"])
            selected_records.append(r)
        print(f"  {cat:30s} selected {len(taken):4d}/{alloc} (pool {len(pool)})", file=sys.stderr)
    print(f"\n  Initial selection: {len(selected_records)}", file=sys.stderr)
    # ── Specificity floor enforcement ───────────────────────────────────
    spec_selected = Counter(r["v2_pred_specificity"] for r in selected_records)
    print("\n  Predicted v2 specificity in selection:", file=sys.stderr)
    for spec in sorted(spec_selected):
        count = spec_selected[spec]
        status = "OK" if count >= SPECIFICITY_FLOOR else f"BELOW FLOOR ({SPECIFICITY_FLOOR})"
        print(f"    Level {spec}: {count:4d} — {status}", file=sys.stderr)
    # If any level is below floor, swap from over-represented levels
    for target_level in [1, 2, 3, 4]:
        deficit = SPECIFICITY_FLOOR - spec_selected.get(target_level, 0)
        if deficit <= 0:
            continue
        print(f"\n  Boosting Level {target_level} by {deficit}...", file=sys.stderr)
        # Find candidates NOT yet selected, matching target specificity
        candidates = [
            r for r in records
            if r["id"] not in selected_ids
            and r["v2_pred_specificity"] == target_level
        ]
        random.shuffle(candidates)
        # Find swappable records from over-represented specificity levels
        # Sort selected by how over-represented their specificity level is
        over_levels = [
            lvl for lvl in [1, 2, 3, 4]
            if spec_selected.get(lvl, 0) > SPECIFICITY_FLOOR + 20  # only swap from levels well above floor
        ]
        swappable = [
            r for r in selected_records
            if r["v2_pred_specificity"] in over_levels
        ]
        random.shuffle(swappable)
        swapped = 0
        for cand in candidates:
            if swapped >= deficit:
                break
            if not swappable:
                break
            # Find a swappable record from same category (maintain category balance)
            for i, swap_r in enumerate(swappable):
                if swap_r["v1_category"] == cand["v1_category"]:
                    # Swap
                    selected_ids.remove(swap_r["id"])
                    selected_records.remove(swap_r)
                    selected_ids.add(cand["id"])
                    selected_records.append(cand)
                    swappable.pop(i)
                    spec_selected[swap_r["v2_pred_specificity"]] -= 1
                    spec_selected[target_level] += 1
                    swapped += 1
                    break
        if swapped < deficit:
            print(f"    Could only boost by {swapped}/{deficit} (not enough same-category swaps)", file=sys.stderr)
        else:
            print(f"    Boosted by {swapped}", file=sys.stderr)
    # ── Final report ────────────────────────────────────────────────────
    print(f"\n  Final selection: {len(selected_records)}", file=sys.stderr)
    print("\n  Final category distribution:", file=sys.stderr)
    final_cat = Counter(r["v1_category"] for r in selected_records)
    for cat, count in sorted(final_cat.items(), key=lambda x: -x[1]):
        print(f"    {cat:30s} {count:4d}", file=sys.stderr)
    print("\n  Final predicted v2 specificity distribution:", file=sys.stderr)
    final_spec = Counter(r["v2_pred_specificity"] for r in selected_records)
    for spec in sorted(final_spec):
        count = final_spec[spec]
        status = "OK" if count >= SPECIFICITY_FLOOR else "BELOW"
        print(f"    Level {spec}: {count:4d} — {status}", file=sys.stderr)
    # Company diversity check
    companies = Counter(r["company_cik"] for r in selected_records)
    print(f"\n  Companies represented: {len(companies)}", file=sys.stderr)
    print(f"  Max paragraphs from one company: {companies.most_common(1)[0][1]}", file=sys.stderr)
    # ── Write outputs ───────────────────────────────────────────────────
    ids = sorted(r["id"] for r in selected_records)
    OUTPUT_IDS.parent.mkdir(parents=True, exist_ok=True)
    OUTPUT_IDS.write_text(json.dumps(ids, indent=2) + "\n")
    print(f"\n  Wrote {len(ids)} IDs to {OUTPUT_IDS}", file=sys.stderr)
    with open(OUTPUT_MANIFEST, "w") as f:
        for r in sorted(selected_records, key=lambda x: x["v1_category"]):
            manifest = {
                "id": r["id"],
                "v1_category": r["v1_category"],
                "v1_specificity": r["v1_specificity"],
                "v2_pred_specificity": r["v2_pred_specificity"],
                "company_cik": r["company_cik"],
                "company_name": r["company_name"],
                "word_count": r["word_count"],
                "text_preview": r["text_preview"],
            }
            f.write(json.dumps(manifest) + "\n")
    print(f"  Wrote manifest to {OUTPUT_MANIFEST}", file=sys.stderr)
 if __name__ == "__main__":
    main()
--- a/ts/scripts/stage1-analyze.ts
+++ b/ts/scripts/stage1-analyze.ts
@ -305,7 +305,7 @@ async function main() {
  // 7. SPECIFICITY DISTRIBUTION (AGGREGATE)
  // ════════════════════════════════════════════════════════════════════
  console.log("\n── Specificity Distribution (all annotations) ──────────────");
-  const specLabels = ["Generic Boilerplate", "Sector-Adapted", "Firm-Specific", "Quantitatively Verifiable"];
+  const specLabels = ["Generic Boilerplate", "Domain-Adapted", "Firm-Specific", "Quantified-Verifiable"];
  const aggSpec = new Map<number, number>();
  for (const a of anns) {
    aggSpec.set(a.label.specificity_level, (aggSpec.get(a.label.specificity_level) ?? 0) + 1);
--- a/ts/src/cli.ts
+++ b/ts/src/cli.ts
@ -228,8 +228,8 @@ async function cmdJudge(): Promise<void> {
 }
 async function cmdGolden(): Promise<void> {
-  // Load the 1,200 human-labeled paragraph IDs from the original sample
+  // Load holdout paragraph IDs (v2 default, override with --ids)
-  const sampledIdsPath = flag("ids") ?? "../labelapp/.sampled-ids.original.json";
+  const sampledIdsPath = flag("ids") ?? `${DATA}/gold/v2-holdout-ids.json`;
  const sampledIds = new Set<string>(
    JSON.parse(await import("node:fs/promises").then((fs) => fs.readFile(sampledIdsPath, "utf-8"))),
  );
@ -248,17 +248,22 @@ async function cmdGolden(): Promise<void> {
    process.exit(1);
  }
  const promptOnly = rest.includes("--prompt-only");
  const outputDir = promptOnly ? "v2-bench" : "golden";
  const outputFile = promptOnly ? "opus-4.6" : "opus";
  await runGoldenBatch(paragraphs, {
-    outputPath: `${DATA}/annotations/golden/opus.jsonl`,
+    outputPath: `${DATA}/annotations/${outputDir}/${outputFile}.jsonl`,
-    errorsPath: `${DATA}/annotations/golden/opus-errors.jsonl`,
+    errorsPath: `${DATA}/annotations/${outputDir}/${outputFile}-errors.jsonl`,
    limit: flag("limit") !== undefined ? flagInt("limit", 50) : undefined,
    delayMs: flag("delay") !== undefined ? flagInt("delay", 1000) : 1000,
    concurrency: flagInt("concurrency", 1),
    promptOnly,
  });
 }
-async function loadHoldoutParagraphs(): Promise<Paragraph[]> {
+async function loadHoldoutParagraphs(idsPath?: string): Promise<Paragraph[]> {
-  const sampledIdsPath = "../labelapp/.sampled-ids.original.json";
+  const sampledIdsPath = idsPath ?? `${DATA}/gold/v2-holdout-ids.json`;
  const sampledIds = new Set<string>(
    JSON.parse(await import("node:fs/promises").then((fs) => fs.readFile(sampledIdsPath, "utf-8"))),
  );
@ -284,14 +289,16 @@ async function cmdBenchHoldout(): Promise<void> {
    console.error("--model is required");
    process.exit(1);
  }
-  const paragraphs = await loadHoldoutParagraphs();
+  const idsPath = flag("ids");
  const paragraphs = await loadHoldoutParagraphs(idsPath ?? undefined);
  const modelShort = modelId.split("/")[1]!;
  const outputDir = flag("output-dir") ?? "v2-bench";
  await runBatch(paragraphs, {
    modelId,
    stage: "benchmark",
-    outputPath: `${DATA}/annotations/bench-holdout/${modelShort}.jsonl`,
+    outputPath: `${DATA}/annotations/${outputDir}/${modelShort}.jsonl`,
-    errorsPath: `${DATA}/annotations/bench-holdout/${modelShort}-errors.jsonl`,
+    errorsPath: `${DATA}/annotations/${outputDir}/${modelShort}-errors.jsonl`,
    sessionsPath: SESSIONS_PATH,
    concurrency: flagInt("concurrency", 60),
    limit: flag("limit") !== undefined ? flagInt("limit", 50) : undefined,
@ -299,24 +306,22 @@ async function cmdBenchHoldout(): Promise<void> {
 }
 async function cmdBenchHoldoutAll(): Promise<void> {
-  const paragraphs = await loadHoldoutParagraphs();
+  const idsPath = flag("ids");
  const paragraphs = await loadHoldoutParagraphs(idsPath ?? undefined);
  const concurrency = flagInt("concurrency", 60);
  const limit = flag("limit") !== undefined ? flagInt("limit", 50) : undefined;
  const outputDir = flag("output-dir") ?? "v2-bench";
-  // Exclude Stage 1 models — we already have their annotations
+  process.stderr.write(`  Running ${BENCHMARK_MODELS.length} benchmark models → ${outputDir}/\n`);
  const benchModels = BENCHMARK_MODELS.filter(
    (m) => !(STAGE1_MODELS as readonly string[]).includes(m),
  );
  process.stderr.write(`  Running ${benchModels.length} benchmark models (excluding Stage 1 panel)\n`);
-  for (const modelId of benchModels) {
+  for (const modelId of BENCHMARK_MODELS) {
    const modelShort = modelId.split("/")[1]!;
    process.stderr.write(`\n  ═══ ${modelId} ═══\n`);
    await runBatch(paragraphs, {
      modelId,
      stage: "benchmark",
-      outputPath: `${DATA}/annotations/bench-holdout/${modelShort}.jsonl`,
+      outputPath: `${DATA}/annotations/${outputDir}/${modelShort}.jsonl`,
-      errorsPath: `${DATA}/annotations/bench-holdout/${modelShort}-errors.jsonl`,
+      errorsPath: `${DATA}/annotations/${outputDir}/${modelShort}-errors.jsonl`,
      sessionsPath: SESSIONS_PATH,
      concurrency,
      limit,
--- a/ts/src/label/batch.ts
+++ b/ts/src/label/batch.ts
@ -1,13 +1,14 @@
 import pLimit from "p-limit";
 import { v4 as uuidv4 } from "uuid";
 import { loadCompletedIds } from "../lib/checkpoint.ts";
-import { appendJsonl } from "../lib/jsonl.ts";
+import { appendJsonl, readJsonlRaw } from "../lib/jsonl.ts";
 import { classifyError } from "../lib/retry.ts";
 import type { Annotation } from "@sec-cybert/schemas/annotation.ts";
 import type { SessionLog } from "@sec-cybert/schemas/session.ts";
 import type { Paragraph } from "@sec-cybert/schemas/paragraph.ts";
 import { annotateParagraph, type AnnotateOpts } from "./annotate.ts";
 import { PROMPT_VERSION } from "./prompts.ts";
 import { writeFile } from "node:fs/promises";
 export interface BatchOpts {
  modelId: string;
@ -214,6 +215,84 @@ export async function runBatch(
  renderDashboard(stats, total, modelId, promptVersion, command);
  process.stderr.write("\n".repeat(7)); // move past dashboard
  // Retry sweep: re-attempt failed paragraphs with fresh connections.
  // Some models (GLM, MiniMax) have transient issues that resolve on a clean retry.
  // We do up to 3 sweep passes, stopping early if no progress is made.
  if (stats.errored > 0 && !stopping) {
    const maxSweeps = 3;
    const paragraphMap = new Map(remaining.map((p) => [p.id, p]));
    for (let sweep = 1; sweep <= maxSweeps; sweep++) {
      // Load error IDs from the errors file
      const { records: errorRecords } = await readJsonlRaw(errorsPath);
      const errorIds = new Set(
        errorRecords
          .filter((r): r is { paragraphId: string } =>
            !!r && typeof r === "object" && "paragraphId" in r)
          .map((r) => r.paragraphId),
      );
      // Remove any that succeeded on a previous sweep (now in output file)
      const { completedIds: nowDone } = await loadCompletedIds(outputPath);
      const retryIds = [...errorIds].filter((id) => !nowDone.has(id));
      if (retryIds.length === 0) break;
      process.stderr.write(
        `\n  ⟳ Retry sweep ${sweep}/${maxSweeps}: ${retryIds.length} failed paragraphs\n`,
      );
      let sweepRecovered = 0;
      let sweepFailed = 0;
      const sweepLimit = pLimit(concurrency);
      const sweepTasks = retryIds.map((id) =>
        sweepLimit(async () => {
          if (stopping) return;
          const paragraph = paragraphMap.get(id);
          if (!paragraph) return;
          try {
            const annotation = await annotateParagraph(paragraph, annotateOpts);
            await appendJsonl(outputPath, annotation);
            stats.processed++;
            stats.totalCostUsd += annotation.provenance.costUsd;
            stats.totalReasoningTokens += annotation.provenance.reasoningTokens;
            stats.latencies.push(annotation.provenance.latencyMs);
            sweepRecovered++;
          } catch {
            sweepFailed++;
          }
        }),
      );
      await Promise.all(sweepTasks);
      process.stderr.write(
        `  ⟳ Sweep ${sweep}: recovered ${sweepRecovered}, still failing ${sweepFailed}\n`,
      );
      // No progress — stop sweeping
      if (sweepRecovered === 0) break;
    }
    // Rewrite errors file with only the still-failing paragraphs
    const { completedIds: finalDone } = await loadCompletedIds(outputPath);
    const { records: allErrors } = await readJsonlRaw(errorsPath);
    const stillFailing = allErrors.filter((r) => {
      const rec = r as { paragraphId?: string };
      return rec.paragraphId && !finalDone.has(rec.paragraphId);
    });
    // Deduplicate: keep only the last error per paragraph
    const lastErrorByParagraph = new Map<string, object>();
    for (const err of stillFailing) {
      const rec = err as { paragraphId: string };
      lastErrorByParagraph.set(rec.paragraphId, err as object);
    }
    await writeFile(errorsPath, [...lastErrorByParagraph.values()].map((e) => JSON.stringify(e)).join("\n") + (lastErrorByParagraph.size > 0 ? "\n" : ""));
    stats.errored = lastErrorByParagraph.size;
  }
  // Session log
  const endedAt = new Date().toISOString();
  const durationSeconds = (Date.now() - stats.startTime) / 1000;
--- a/ts/src/label/golden.ts
+++ b/ts/src/label/golden.ts
@ -3,10 +3,10 @@
 *
 * Uses the user's Claude Code subscription (OAuth) instead of API keys,
 * calling Opus 4.6 through the Agent SDK's `query()` with structured output.
- * Designed for the ~1,200 human-labeled paragraphs.
+ * Designed for the ~1,200 holdout paragraphs.
 *
 * Key differences from Stage 1/2 (OpenRouter):
- * - Full codebook (docs/LABELING-CODEBOOK.md) + v2.5 prompt as system prompt
+ * - Full v2 codebook (docs/LABELING-CODEBOOK.md) + operational prompt as system prompt
 * - Saves reasoning traces (Opus adaptive thinking) alongside annotations
 * - Saves raw confidence values before coercion
 * - No API cost — runs on Max subscription
@ -76,23 +76,36 @@ export interface GoldenBatchOpts {
  delayMs?: number;
  /** Number of concurrent workers. Default 1 (serial). */
  concurrency?: number;
  /** If true, use only the operational prompt (no codebook). Default false. */
  promptOnly?: boolean;
 }
-/** Build the enhanced system prompt: full codebook + v2.5 operational prompt + JSON schema. */
+/** Build the enhanced system prompt: full v2 codebook + operational prompt + JSON schema. */
-async function buildGoldenSystemPrompt(): Promise<string> {
+async function buildGoldenSystemPrompt(includeCodebook: boolean): Promise<string> {
  const jsonSchema = JSON.stringify(z.toJSONSchema(GoldenOutputSchema), null, 2);
  if (!includeCodebook) {
    return `${SYSTEM_PROMPT}
 ═══════════════════════════════════════════════════════════════════════
 OUTPUT JSON SCHEMA
 You MUST return JSON matching this exact schema. Use text labels for
 specificity (not integers). Confidence is a number 0-1.
 ═══════════════════════════════════════════════════════════════════════
 ${jsonSchema}`;
  }
  const codebookPath = new URL("../../../docs/LABELING-CODEBOOK.md", import.meta.url).pathname;
  const codebook = await readFile(codebookPath, "utf-8");
  // Strip the old "LLM Response Schema" section from the codebook to avoid
  // conflicting with the actual JSON schema we enforce via outputFormat.
  // The old section uses specificity_level (integer) instead of specificity (string label).
  const schemaHeading = "## LLM Response Schema";
  const codebookTrimmed = codebook.includes(schemaHeading)
    ? codebook.slice(0, codebook.indexOf(schemaHeading)).trimEnd()
    : codebook;
  const jsonSchema = JSON.stringify(z.toJSONSchema(GoldenOutputSchema), null, 2);
  return `${codebookTrimmed}
 ═══════════════════════════════════════════════════════════════════════
@ -129,6 +142,7 @@ async function annotateGolden(
  paragraph: Paragraph,
  runId: string,
  systemPrompt: string,
  promptVersionLabel: string,
 ): Promise<GoldenAnnotation> {
  const requestedAt = new Date().toISOString();
  const start = Date.now();
@ -230,7 +244,7 @@ async function annotateGolden(
      generationId: "agent-sdk",
      stage: "benchmark",
      runId,
-      promptVersion: `${PROMPT_VERSION}+codebook`,
+      promptVersion: promptVersionLabel,
      inputTokens: result.inputTokens,
      outputTokens: result.outputTokens,
      reasoningTokens: 0, // included in outputTokens, not broken out by SDK
@ -256,11 +270,12 @@ export async function runGoldenBatch(
  paragraphs: Paragraph[],
  opts: GoldenBatchOpts,
 ): Promise<void> {
-  const { outputPath, errorsPath, limit, delayMs = 1000, concurrency = 1 } = opts;
+  const { outputPath, errorsPath, limit, delayMs = 1000, concurrency = 1, promptOnly = false } = opts;
  const runId = uuidv4();
  const promptVersionLabel = promptOnly ? PROMPT_VERSION : `${PROMPT_VERSION}+codebook`;
-  // Build system prompt once (codebook + operational prompt)
+  // Build system prompt once
-  const systemPrompt = await buildGoldenSystemPrompt();
+  const systemPrompt = await buildGoldenSystemPrompt(!promptOnly);
  process.stderr.write(`  System prompt: ${(systemPrompt.length / 1024).toFixed(1)}KB\n`);
  // Resume support
@ -322,7 +337,7 @@ export async function runGoldenBatch(
      const paragraph = remaining[idx]!;
      try {
-        const annotation = await annotateGolden(paragraph, runId, systemPrompt);
+        const annotation = await annotateGolden(paragraph, runId, systemPrompt, promptVersionLabel);
        await safeAppend(outputPath, annotation);
        processed++;
      } catch (error) {
--- a/ts/src/label/prompts.ts
+++ b/ts/src/label/prompts.ts
@ -1,118 +1,133 @@
 import type { Paragraph } from "@sec-cybert/schemas/paragraph.ts";
-export const PROMPT_VERSION = "v3.5";
+export const PROMPT_VERSION = "v4.5";
-/** System prompt for all Stage 1 annotation and benchmarking. */
+/** System prompt for all Stage 1 annotation and benchmarking (v2 codebook). */
 export const SYSTEM_PROMPT = `You are an expert annotator classifying paragraphs from SEC cybersecurity disclosures (Form 10-K Item 1C and Form 8-K Item 1.05 filings).
 For each paragraph, assign a content_category and specificity level.
 ═══ CONTENT CATEGORY ═══
-Assign the single most applicable category:
+For every paragraph, ask: "What question does this paragraph primarily answer?"
-"Board Governance" — Board/committee oversight of cyber risk, briefing cadence, board cyber expertise. Assign when the paragraph describes the governance/oversight STRUCTURE — how the board exercises oversight, who reports to the board, how information flows upward. Governance-chain paragraphs (board → committee → officer → program) are BG even when officers appear as grammatical subjects, because the PURPOSE is describing oversight structure.
+"Board Governance" — How does the board oversee cybersecurity? Board/committee oversight of cybersecurity risks, briefing frequency and scope, board member expertise, delegation of oversight, how information flows to the board. Governance-chain paragraphs (Board → Committee → Officer → Program) are BG when the purpose is describing oversight structure.
-"Management Role" — CISO/CTO/CIO identification, qualifications, reporting lines. Assign when the paragraph is primarily about WHO the person IS — their credentials, experience, certifications, career history. Naming an officer as part of a governance or process description does NOT make it Management Role.
+"Management Role" — How is management organized to handle cybersecurity? Leadership roles and responsibilities, qualifications and credentials, career history, management-level committee structure and membership, reporting lines between management roles, team composition and size.
-"Risk Management Process" — Risk assessment, framework adoption, vulnerability management, monitoring, IR planning, ERM integration. Assign when the company's OWN internal processes are the topic.
+"Risk Management Process" — What does the cybersecurity program do? Risk assessment methodology, framework adoption, vulnerability management, security monitoring, incident response planning, security operations, tools and technologies, employee training, ERM integration.
-"Third-Party Risk" — Vendor/supplier security oversight, contractual security standards. Assign ONLY when vendor oversight is the CENTRAL topic, not a component of internal processes.
+"Third-Party Risk" — How are third-party cyber risks managed? Vendor/supplier cybersecurity oversight, external assessor requirements, contractual security requirements, supply chain risk management. Must be the CENTRAL topic.
-"Incident Disclosure" — Description of actual cybersecurity incidents: what happened, when, scope, response actions. Must reference a real event. Includes: incident narrative, incident response actions, AND descriptions of affected data/systems scope or operational impact of a disclosed incident.
+"Incident Disclosure" — What happened in a cybersecurity incident? Description of actual incidents: nature, scope, timing, impact, remediation, investigation. Must reference a real event. Hypothetical incident language ("we may experience...") is NOT Incident Disclosure.
-"Strategy Integration" — Business/financial impact, cyber insurance, budget, materiality ASSESSMENTS. A materiality assessment is the company stating a conclusion about whether cybersecurity has or will affect business outcomes. Includes: backward-looking ("have not materially affected"), forward-looking with SEC qualifier ("reasonably likely to materially affect"), and negative assertions ("have not experienced material incidents"). Does NOT include generic risk warnings ("could have a material adverse effect") — those are boilerplate speculation, not assessments. Does NOT include "material" as an adjective ("managing material risks").
+"Strategy Integration" — How does cybersecurity affect the business or finances? Materiality assessments, cybersecurity insurance, budget/investment allocation, cost of incidents, business strategy impact.
-"None/Other" — Forward-looking disclaimers, section headers, cross-references, non-cybersecurity content, generic IT-dependence language ("our IT systems are important"). NO substantive disclosure AND no materiality language at all.
+"None/Other" — None of the six substantive questions. Forward-looking disclaimers, section headers, cross-references, non-cybersecurity content.
-CATEGORY TIEBREAKERS:
+If a paragraph touches multiple categories, assign the one whose question it most directly answers. When genuinely split, the category with the most text wins.
  - Paragraph DESCRIBES what happened in an incident (dates, access, encryption, scope, response actions) → Incident Disclosure
  - Paragraph ONLY discusses financial cost, insurance, or materiality of an incident WITHOUT describing the event → Strategy Integration (even if it says "the incident" or "the cybersecurity incident")
  - Brief mention of a past incident + materiality conclusion as the main point → Strategy Integration
  - Standalone materiality conclusion with no incident reference → Strategy Integration
  - Materiality ASSESSMENTS → Strategy Integration. An assessment is the company stating a conclusion:
    • Backward: "have not materially affected our business strategy, results of operations, or financial condition" → SI
    • Forward with SEC qualifier: "reasonably likely to materially affect" → SI
    • Negative assertion: "we have not experienced any material cybersecurity incidents" → SI
    NOT assessments (do NOT trigger SI):
    • Generic risk warning: "could have a material adverse effect on our business" → NOT SI. This is boilerplate speculation in every 10-K, not a conclusion. Classify by the paragraph's primary content.
    • "Material" as adjective: "managing material risks" → NOT SI. "Material" means "significant" here, not a materiality assessment.
    • Consequence clause: SPECULATIVE materiality language ("could have a material adverse effect") at the END of an RMP/risk paragraph does not override the primary purpose. BUT a negative assertion ("we have not experienced any material cybersecurity incidents") IS an assessment even at the end of a paragraph — it is a factual conclusion, not speculation.
    • Cross-references with materiality language: "For risks that may materially affect us, see Item 1A" → N/O (pointing elsewhere, not concluding).
  - SPACs and shell companies explicitly stating they have no operations, no cybersecurity program, or no formal processes → None/Other regardless of incidental mentions of board oversight or risk acknowledgment. The absence of a program is not a description of a program.
  - Internal processes mentioning vendors as one component → Risk Management Process
  - Requirements imposed ON vendors → Third-Party Risk
  - Board oversight mentioned briefly + management roles as main focus → Management Role
  - Management mentioned briefly + board oversight as main focus → Board Governance
-MR vs RMP — THREE-STEP DECISION CHAIN (apply in order):
+CATEGORY RULES:
-  Step 1 — SUBJECT TEST: What is the grammatical subject?
+
-    Clear process/framework/program as subject with no person detail → Risk Management Process. STOP.
+Rule 1 — BG vs MR: Governance-chain paragraphs default to BG when describing oversight structure. They become MR only when management's organizational role (responsibilities, qualifications, committee membership) is the primary content. Board-level committees (Audit Committee, Risk Committee of the Board) → BG. Management-level committees (Cybersecurity Steering Committee) → MR if about structure/membership, RMP if about activities.
-    Person/role as subject → this is a SIGNAL, not decisive. ALWAYS continue to Step 2.
+
-  Step 2 — PERSON-REMOVAL TEST: Delete all named roles, titles, qualifications, experience, and credentials. Is the remaining text a coherent cybersecurity disclosure?
+Rule 2 — MR vs RMP (person-removal test): Remove all person-specific content (names, titles, qualifications, experience, reporting lines, team composition, committee membership). If a substantive cybersecurity program description remains → RMP. If the paragraph collapses to near-nothing → MR. MR is about ROLES (who is responsible, how organized, what qualifies them). RMP is about ACTIVITIES (what the program does, how it operates, what tools it uses). A paragraph where a named officer (CISO, CTO) is the grammatical subject but content describes what the PROGRAM does → RMP.
-    YES → Risk Management Process (the process stands alone; people are incidental).
+
-    NO → Management Role (the paragraph is fundamentally about who these people are).
+Rule 3 — TP vs RMP: Third parties hired to serve the company (assessors, pen testers) → RMP. Requirements imposed ON vendors → TP. Third parties mentioned as one component of internal program → RMP. Vendor oversight as central topic → TP.
-    Borderline → continue to Step 3.
+
-  Step 3 — QUALIFICATIONS TIEBREAKER: Does the paragraph include years of experience, certifications (CISSP, CISM), education, team size, or career history?
+Rule 4 — ID vs SI: What happened (timeline, scope, response) → ID. Business/financial impact → SI. Mixed with incident frame dominant → ID. Mixed with financial frame dominant → SI.
-    YES → Management Role (qualifications are MR-specific content).
+
-    NO → Risk Management Process (no person-specific content beyond a title).
+Rule 5 — SI materiality rule: A paragraph STATING A CONCLUSION about whether cybersecurity has or will affect business outcomes → SI.
-  IMPORTANT: A paragraph where a named officer (CISO, CTO) is the grammatical subject but the content describes what the PROGRAM does is Risk Management Process. Step 1 must NOT short-circuit to MR just because a person is mentioned. Always apply Step 2.
+  IS an assessment → Strategy Integration:
  • "Have not materially affected our business strategy, results of operations, or financial condition" → SI (backward-looking)
  • "Are reasonably likely to materially affect" → SI (SEC Item 106(b)(2) forward-looking)
  • "Have not experienced any material cybersecurity incidents" → SI (negative assertion)
  NOT an assessment → classify by other content or N/O:
  • "Could have a material adverse effect" → speculation, every 10-K says this → NOT SI
  • "Material" as adjective ("managing material risks") → NOT SI
  • Cross-references: "risks that may materially affect us, see Item 1A" → N/O (pointing elsewhere)
 Rule 6 — N/O threshold: Only when no substantive cybersecurity disclosure. If any actual cybersecurity measure/process/assessment is described → relevant category. SPACs and no-operations companies with no program → N/O. Pure speculation without measures → N/O.
 ═══ SPECIFICITY ═══
-"Generic Boilerplate" — Could paste into any company's filing unchanged. No named entities, frameworks, roles, dates, or specific details.
+IMPORTANT: Specificity and category are INDEPENDENT dimensions. Category captures what the paragraph is ABOUT. Specificity captures how informative the paragraph is AS A WHOLE. Do not assess them together — a paragraph's specificity level is determined by scanning the ENTIRE text for facts, regardless of whether those facts relate to the assigned category.
 "Sector-Adapted" — Names a specific, recognized framework or standard (NIST, ISO 27001, SOC 2, CIS, GDPR, PCI DSS, HIPAA) but contains nothing unique to THIS company. General security practices (penetration testing, vulnerability scanning, tabletop exercises) do NOT qualify — only named standards do.
 "Firm-Specific" — Contains at least one fact that identifies something unique to THIS company's disclosure. See the IS/IS-NOT lists below.
 "Quantified-Verifiable" — Contains TWO or more hard facts that an outsider could independently verify. Count ONLY: specific dates (month+year or exact date), dollar amounts, headcounts, percentages, named third-party firms (Mandiant, CrowdStrike, Deloitte), named tools/products (Splunk, Azure Sentinel, Dropbox Sign). Do NOT count toward this threshold: named roles, named committees, team compositions, reporting cadences, named internal programs, or organizational structure — these are Firm-Specific, not Quantified-Verifiable.
-DECISION TEST — ask in order, stop at the first "yes":
+Example: a paragraph primarily about risk management processes (→ RMP) that also mentions "our CISO, who holds CISSP certification" in a subordinate clause is RMP at Level 4. The CISSP is verifiable even though it's not about the process. The paragraph CONTAINS a verifiable fact — that is all that matters for specificity.
  1. Count hard verifiable facts ONLY (specific dates, dollar amounts, headcounts/percentages, named third-party firms, named products/tools). TWO or more? → Quantified-Verifiable
  2. Does it contain at least one fact from the ✓ IS list below? If yes → Firm-Specific. (Even one ✓ IS fact is enough, regardless of how many ✗ NOT items also appear.)
  3. Does it name a recognized standard (NIST, ISO 27001, SOC 2, CIS, GDPR, PCI DSS, HIPAA)? → Sector-Adapted
  4. None of the above? → Generic Boilerplate
-✓ IS A SPECIFIC FACT (any ONE of these → at least Firm-Specific):
+Specificity is determined by your specific_facts list — a max() operation:
-  - Cybersecurity-specific titles: CISO, CTO, CIO, VP of IT/Security, Information Security Officer, Director of IT Security, HSE Director overseeing cybersecurity
+1. List every specific fact you find ANYWHERE in the paragraph in specific_facts.
-  - Named non-generic committees: Technology Committee, Cybersecurity Committee, Risk Committee, ERM Committee (NOT "Audit Committee" — that is standard)
+2. For each fact, determine its level: is it a domain term (Level 2), a firm-unique fact (Level 3), or an externally verifiable claim (Level 4)?
-  - Specific team/department compositions: "Legal, Compliance, and Finance"
+3. Your specificity level = the HIGHEST level among your listed facts. One verifiable fact → Level 4. One firm-specific fact → at least Level 3. One domain term → at least Level 2. No facts → Level 1.
  - Specific dates: "In December 2023", "On May 6, 2024", "fiscal 2025"
  - Named internal programs with unique identifiers: "Cyber Incident Response Plan (CIRP)"
  - Named products, systems, tools: Splunk, CrowdStrike Falcon, Azure Sentinel
  - Named third-party firms: Mandiant, Deloitte, CrowdStrike
  - Specific numbers: headcounts, dollar amounts, percentages, exact durations
  - Certification claims: "We maintain ISO 27001 certification" (more than just naming the standard)
-✗ IS NOT A SPECIFIC FACT (do NOT extract these in specific_facts, do NOT use them to justify Firm-Specific):
+This is a presence scan, not a holistic judgment. One qualifying fact in 500 words of boilerplate sets the floor. If you listed a fact, it counts — do not second-guess it during reasoning.
  - Generic governance: "the Board", "Board of Directors", "management", "Audit Committee", "the Committee"
  - Generic C-suite: CEO, CFO, COO, President, General Counsel — these exist at every company
  - Unnamed entities: "third-party experts", "external consultants", "cybersecurity firms", "managed service provider"
  - Generic cadences: "quarterly", "annual", "periodic", "regular" — without exact dates
  - Boilerplate phrases: "cybersecurity risks", "material adverse effect", "business operations", "financial condition"
  - Standard incident language: "forensic investigation", "law enforcement", "regulatory obligations", "incident response protocols"
  - Vague quantifiers: "certain systems", "some employees", "a number of", "a portion of"
  - Common practices: "penetration testing", "vulnerability scanning", "tabletop exercises", "phishing simulations", "security awareness training"
  - Generic program names: "incident response plan", "business continuity plan", "cybersecurity program", "Company-wide training"
  - Company self-references: the company's own name, "the Company", filing form types (Form 8-K, Form 10-K)
 VALIDATION STEP — before finalizing specificity:
  Review your specific_facts list. Remove any fact that appears on the ✗ NOT list above. If NO facts remain after filtering, the paragraph is Generic Boilerplate (or Sector-Adapted if it names a standard). Do not let ✗ NOT items inflate your specificity rating.
 CALIBRATION EXAMPLES:
  "On April 24, 2024, we became aware of unauthorized access to our Dropbox Sign environment" → Quantified-Verifiable (date + named product = two hard verifiable facts)
  "Our CISO reports quarterly to the Audit Committee" → Firm-Specific (CISO is a cybersecurity-specific role; Audit Committee alone would be generic)
  "We follow the NIST Cybersecurity Framework" → Sector-Adapted (named framework, nothing firm-specific)
  "We maintain ISO 27001 certification" → Firm-Specific (holding a certification is a verifiable firm fact, beyond merely naming the standard)
  "The incident is not reasonably likely to materially affect our financial condition" → Generic Boilerplate (boilerplate materiality language)
  "We detected unauthorized access to certain systems and engaged third-party experts" → Generic Boilerplate (no names, no dates, no specifics)
  "In December 2023, we experienced an incident involving unauthorized access to internal systems" → Firm-Specific (specific month/year = one fact, but only one → not QV)
  "A team comprising Legal, Compliance, and Finance assesses cybersecurity threats" → Firm-Specific (specific departmental composition)
  "The Audit Committee receives periodic reports from management on cybersecurity risks" → Generic Boilerplate (Audit Committee is standard, "management" is generic, "periodic" is vague)
  "On May 6, 2024, the Company detected unauthorized access. The Company engaged CrowdStrike to investigate." → Quantified-Verifiable (specific date + named firm = two hard facts)
  "The CISO has 17 years of experience and holds CISSP and CISM certifications" → Quantified-Verifiable (specific number + named certifications = two hard facts)
  "The Company halted operations for approximately two weeks and restored systems by mid-June" → Firm-Specific (approximate duration and approximate date are soft — only one hard fact)
  "Our CISO, supported by a team of 12 security professionals, leads the cybersecurity program" → Quantified-Verifiable (cybersecurity-specific role + specific headcount = two hard facts)
  "Costs related to the incident are expected to be covered by cybersecurity insurance" → Generic Boilerplate (no specific amounts, dates, or named insurers)
  "The Company's ERM program incorporates cybersecurity into 12 risk categories with quarterly reviews" → Firm-Specific (named internal program, but "12 categories" and "quarterly" alone don't reach QV — organizational detail, not hard external facts)
 None/Other paragraphs always get Generic Boilerplate.
 LEVEL 2 — DOMAIN-ADAPTED:
 The paragraph demonstrates cybersecurity EXPERTISE, not just cybersecurity as a TOPIC. Every paragraph in these filings discusses cybersecurity — that's what the filing requires. The question is whether the writer shows they understand HOW cybersecurity works (specific practices, tools, architectural concepts) or merely THAT it exists (generic oversight language any business professional could write).
 Apply the ERM test: would a non-security enterprise risk management professional naturally use this language? If no → the paragraph demonstrates domain knowledge → Level 2. This is about the paragraph's VOCABULARY LEVEL, not individual word matching.
 Key distinction — process descriptions vs. domain expertise:
  "We conduct vendor security assessments" → L1 (says WHAT they do; any ERM professional would say this about non-cyber vendor risk too)
  "We review vendors' SOC 2 attestations and require encryption at rest" → L2 (shows the writer knows WHAT SPECIFIC EVIDENCE to look for — that requires cybersecurity domain knowledge)
  "We have processes to identify, assess, and manage cybersecurity risks" → L1 (swap "cybersecurity" for "financial" and it still works — no domain expertise shown)
  "Our program includes endpoint detection and response, network segmentation, and threat intelligence feeds" → L2 (these concepts are specific to cybersecurity; an ERM professional wouldn't naturally use them)
 Domain terminology includes: specific security practices (penetration testing, vulnerability scanning, threat hunting, patch management), security tools and infrastructure categories (SIEM, SOC, EDR/XDR, WAF, IDS/IPS, MFA), architectural concepts (zero trust, defense in depth, least privilege, network segmentation), named standards (NIST CSF, ISO 27001, SOC 2, PCI DSS), and specific threat types (ransomware, APT, zero-day, DDoS). This list is illustrative, not exhaustive — apply the ERM test to terms not listed.
 Commonly confused — these FAIL the ERM test and stay Level 1: risk assessment, risk management, incident response plan, business continuity, disaster recovery, tabletop exercises (without cybersecurity qualifier), enterprise risk management, internal controls, policies and procedures, "processes to identify, assess, and manage risks", "measures to protect our systems and data", "dedicated cybersecurity team."
 LEVEL 3 — FIRM-SPECIFIC:
 The paragraph contains at least one fact that distinguishes THIS company's cybersecurity posture from any other public company. Ask: if you removed the company name, would this detail help you narrow down which company wrote it?
 Firm-specific facts include: cybersecurity leadership roles at VP level or above (CISO, CTO, CIO, VP/SVP/EVP of IT, VP/SVP/EVP of Information Technology, VP/SVP of Security/Information Security — any VP-or-above title whose portfolio includes IT, cybersecurity, or information security), Director-level titles WITH a security qualifier (Director of IT Security, Director of Cybersecurity, Director of Information Security — but NOT plain "Director of IT" or "Director of Information Technology" without a security qualifier), named non-generic committees (Cybersecurity Committee, Technology Risk Committee — not "Audit Committee" which every company has), specific team compositions by department ("Legal, Compliance, and Finance" — not "a cross-functional team"), named internal programs with distinguishing identifiers ("Cyber Incident Response Plan (CIRP)" — not generic "incident response plan"), named individuals in cybersecurity context, and specific organizational claims ("24/7 security operations").
 Commonly confused — NOT firm-specific: generic governance roles that exist at every company (Board, Audit Committee, CEO, CFO, General Counsel), IT titles below VP level without a security qualifier ("Head of IT", "IT Manager", "Director of IT", "Director of Information Technology"), unnamed entities ("third-party experts", "external consultants"), generic cadences ("quarterly", "annual" without specific dates), and cybersecurity practices (penetration testing, vulnerability scanning — these are Domain-Adapted, not Firm-Specific).
 LEVEL 4 — QUANTIFIED-VERIFIABLE:
 The paragraph contains at least one hard fact that an external party could independently verify. Ask: could someone outside the company confirm or refute this specific claim? Any ONE verifiable fact ANYWHERE in the paragraph is sufficient — even in a subordinate clause, even if the paragraph's main point is something else.
 What counts as a hard number vs. a soft number:
  HARD (→ QV): "12 security professionals", "20 years of experience", "$100M", "March 2024" — a specific quantity someone could check. Lower bounds also count: "more than 20 years", "over 15 years", "at least 10 employees" — these set a verifiable threshold (check if value ≥ N).
  SOFT (→ NOT QV): "approximately 20 departments", "several years", "significant investment", "numerous professionals" — vague or hedged in both directions, not pinned down enough to verify
 What makes a fact verifiable — concrete examples:
  YES (verifiable):
  - "holds CISSP certification" → verifiable via (ISC)² registry, even mentioned in passing
  - "20 years of cybersecurity experience" → specific number checkable against a resume
  - "more than 20 years of experience" → lower bound, checkable (is experience ≥ 20?)
  - "$100M in cyber insurance" → specific dollar amount
  - "engaged Mandiant for forensic investigation" → named third-party firm
  - "uses CrowdStrike Falcon for EDR" → named product
  - "On January 15, 2024, we detected unauthorized access" → specific date tied to a real event
  - "team of 12 security professionals" → specific headcount
  - "maintain ISO 27001 certification" → verifiable via certification body
  - "various certifications including CISSP" → "various" is soft but CISSP is specifically named and independently verifiable
  NO (not verifiable):
  - "For the year ended December 31, 2024" → reporting period context, not a cybersecurity fact
  - "Our CISO" → a role title is Firm-Specific (Level 3), not a quantified claim
  - "aligned with NIST CSF" → aspiration not audited fact → Domain-Adapted (Level 2)
  - "quarterly reviews" → generic cadence, not a specific date
  - "approximately 20 departments" → soft number, hedged in both directions
  - "We have not experienced any material cybersecurity incidents" → negative self-assertion, NOT externally verifiable (you cannot independently confirm the absence of something)
  - "In 2023, we did not experience a material incident" → a year does not make a negative assertion verifiable — this is still a self-reported claim about absence
  - "As of December 28, 2024, no material incidents have occurred" → reporting-anchored negative assertion, same problem
  - "During the last three years, no material breaches" → soft timeframe + negative assertion
 CERTIFICATION TRILOGY (illustrates all three levels):
  "Aligned with ISO 27001" → Domain-Adapted (references a standard, no firm-specific claim)
  "Working toward ISO 27001 certification" → Firm-Specific (firm-specific intent, not yet verifiable)
  "Maintain ISO 27001 certification" → Quantified-Verifiable (externally verifiable claim)
 VALIDATION STEP — before finalizing specificity:
  1. Review your specific_facts list. Each fact should survive the commonly-confused filter for its level (e.g., "Audit Committee" is NOT firm-specific, "CISO" IS firm-specific). Remove any that don't qualify.
  2. Your specificity level = highest surviving fact. If all facts were removed → Level 1.
  3. Check consistency: does your specificity level match your specific_facts? If you listed a firm-specific fact, you cannot assign below Level 3. If you listed a verifiable fact, you cannot assign below Level 4. The list is your commitment.
 ═══ OUTPUT ═══
-Return JSON with: content_category, specific_facts (list what specific facts you found — empty if generic), specificity, category_confidence, specificity_confidence, reasoning (1-2 sentences for category choice).`;
+Return JSON with:
  - content_category
  - specific_facts: List every fact that elevates specificity above Level 1. For Level 2, list the domain terms that pass the ERM test. For Level 3, list the firm-specific facts. For Level 4, list the verifiable claims. Empty array ONLY for Level 1. If specific_facts is empty, specificity MUST be Generic Boilerplate.
  - specificity: Must equal the highest level supported by your specific_facts list.
  - category_confidence, specificity_confidence
  - reasoning: 1-2 sentences covering BOTH category choice AND what drove your specificity level.`;
 /** Build the user prompt for a single paragraph annotation. */
 export function buildUserPrompt(paragraph: Paragraph): string {
@ -129,99 +144,93 @@ ${text}`;
 // ── Category confusion-axis disambiguation rules ──────────────────────────
 // Keyed by sorted pair of disputed categories. Only included when relevant.
 const CATEGORY_GUIDANCE: Record<string, string> = {
-  "Management Role|Risk Management Process": `MANAGEMENT ROLE vs RISK MANAGEMENT PROCESS — apply the decision chain:
+  "Management Role|Risk Management Process": `MANAGEMENT ROLE vs RISK MANAGEMENT PROCESS — apply the person-removal test:
-  Step 1 — SUBJECT TEST: Is the process/framework clearly the subject with no person detail? → RMP. STOP. If a person is the subject → this is only a signal. ALWAYS continue to Step 2.
+  Remove all person-specific content (names, titles, qualifications, experience, reporting lines, team composition, committee membership). Is the remaining text a coherent cybersecurity disclosure?
-  Step 2 — PERSON-REMOVAL TEST: Delete all people/titles/qualifications. Still a coherent disclosure? YES → RMP. NO → MR. Borderline → Step 3.
+    YES → Risk Management Process (the process stands alone; people are incidental).
-  Step 3 — QUALIFICATIONS TIEBREAKER: Does it mention years of experience, certifications, education, team size, career history? YES → MR. NO → RMP.
+    NO → Management Role (the paragraph is fundamentally about who these people are).
-  CRITICAL: A person being the grammatical subject does NOT automatically mean Management Role. Many SEC disclosures say "Our CISO oversees..." then describe the program. Apply Step 2.
+  MR is about ROLES: who is responsible, how responsibilities are divided, what qualifies them, how management-level oversight is structured.
  RMP is about ACTIVITIES: what the program does, how it operates, what tools and frameworks it uses.
  CRITICAL: A person being the grammatical subject does NOT automatically mean Management Role. "Our CISO oversees a program that includes penetration testing and vulnerability scanning" → remove CISO → program description stands alone → RMP.
  Examples:
  • "Our CISO has 20 years of experience and holds CISSP certification. She reports to the CIO." → MR (remove people → nothing left; has qualifications)
  • "Our cybersecurity program includes risk assessment and monitoring, overseen by our CISO." → RMP (remove CISO → program description stands alone)
-  • "Our CISO oversees the Company's cybersecurity program, which includes risk assessments, vulnerability scanning, and incident response planning." → RMP (person is subject BUT remove CISO → "the Company's cybersecurity program includes..." still works. Content is about the program.)`,
+  • "Our CFO and VP of IT jointly oversee our cybersecurity program. The CFO handles risk governance, the VP of IT manages technical operations." → MR (role allocation and responsibilities are the substance)`,
  "Risk Management Process|Third-Party Risk": `RISK MANAGEMENT PROCESS vs THIRD-PARTY RISK — ask: is vendor/supplier oversight the CENTRAL topic?
  • "We use third-party consultants for penetration testing" = RMP (third parties support an internal process).
  • "We maintain a vendor oversight program with due diligence and monitoring of third-party controls" = Third-Party Risk (vendor oversight IS the topic).
-  • The paragraph must be PRIMARILY about managing vendor/supplier cyber risk to qualify as Third-Party Risk.`,
+  • Third parties hired to serve the company → RMP. Requirements imposed ON vendors → TP.`,
  "None/Other|Strategy Integration": `NONE/OTHER vs STRATEGY INTEGRATION — the materiality ASSESSMENT test:
-  The test is whether the company is MAKING A MATERIALITY CONCLUSION, not whether the word "material" appears.
+  IS a materiality assessment → Strategy Integration:
  • Backward-looking: "have not materially affected our business strategy, results of operations, or financial condition" → SI
  • Forward-looking with SEC qualifier: "reasonably likely to materially affect" → SI (Item 106(b)(2) language)
  • Negative assertion: "we have not experienced any material cybersecurity incidents" → SI
  • Insurance, budget, investment: "we expend considerable resources on cybersecurity", cyber insurance, cost allocation → SI
-  IS a materiality assessment or SI marker → Strategy Integration:
+  NOT a materiality assessment → classify by primary purpose (usually N/O or RMP):
-  • Backward-looking: "have not materially affected our business strategy, results of operations, or financial condition" (company reporting on actual impact)
+  • Generic risk warning: "could have a material adverse effect on our business" → boilerplate speculation → N/O or RMP
-  • Forward-looking with SEC qualifier: "reasonably likely to materially affect" (Item 106(b)(2) language — the company is making a forward-looking assessment)
+  • "Material" as adjective: "managing material risks" → NOT SI
-  • Negative assertions: "we have not experienced any material cybersecurity incidents" (materiality conclusion about past events — SI even if at end of paragraph)
+  • Consequence clause: SPECULATIVE materiality ("could have a material adverse effect") at END of paragraph does not override primary purpose. BUT a factual negative assertion ("we have not experienced any material cybersecurity incidents") IS an assessment even at the end.
-  • Insurance, budget, investment discussion: "we expend considerable resources on cybersecurity", cyber insurance, cost allocation (strategic resource commitment)
+  • Cross-references: "For risks that may materially affect us, see Item 1A" → N/O
-  Is NOT a materiality assessment → classify by primary purpose (usually N/O or RMP):
+  KEY DISTINCTION: "Risks have not materially affected us" = SI (CONCLUSION). "Risks could have a material adverse effect" = N/O (SPECULATION). "Reasonably likely to materially affect" = SI (SEC-qualified forward-looking assessment).`,
  • Generic risk warning: "could have a material adverse effect on our business" — this is boilerplate risk factor language that appears in every 10-K. The word "could" indicates speculation, not an assessment. → N/O or RMP depending on surrounding content.
  • "Material" as adjective: "managing material risks associated with cybersecurity" — "material" here means "significant," not a materiality assessment. → RMP.
  • Consequence clause: SPECULATIVE materiality language ("could have a material adverse effect") at the END of a paragraph does not override primary purpose. BUT a factual negative assertion ("we have not experienced any material cybersecurity incidents") IS an assessment even at the end — it states a conclusion. If a paragraph contains BOTH speculative consequence language AND a factual negative assertion, the negative assertion triggers SI.
  • Cross-references: "For a description of risks that may materially affect the Company, see Item 1A" → N/O (pointing elsewhere, not making an assessment).
-  KEY DISTINCTION: "Risks have not materially affected us" = SI (CONCLUSION). "Risks could have a material adverse effect" = N/O (SPECULATION). "Risks are reasonably likely to materially affect us" = SI (FORWARD-LOOKING CONCLUSION with SEC qualifier).`,
+  "Board Governance|Management Role": `BOARD GOVERNANCE vs MANAGEMENT ROLE — ask: "What question does this paragraph primarily answer?"
  • "How does the board oversee cybersecurity?" → Board Governance
  • "How is management organized to handle cybersecurity?" → Management Role
  Governance-chain paragraphs (Board → Committee → Officer → Program) default to BG when the purpose is describing oversight structure.
  MR only when management's organizational role — responsibilities, qualifications, committee membership — is the primary content.
  • Board/committee oversees, receives reports, delegates → BG
  • Management reports TO the board (describing oversight flow) → BG
  • Management roles, responsibilities, how they're divided → MR
  • Person's qualifications, credentials, experience → MR
  • Board-level committee (Audit Committee, Risk Committee of the Board) → BG
  • Management-level committee (Cybersecurity Steering Committee) → MR if about structure/membership, RMP if about activities`,
-  "Board Governance|Management Role": `BOARD GOVERNANCE vs MANAGEMENT ROLE — the PURPOSE test:
+  "Board Governance|Risk Management Process": `BOARD GOVERNANCE vs RISK MANAGEMENT PROCESS — oversight or operations?
  Ask: what is the paragraph's COMMUNICATIVE PURPOSE?
  • PURPOSE = describing the oversight/reporting STRUCTURE (who reports to whom, how the board exercises oversight, briefing cadence, committee responsibilities) → Board Governance. The board/committee's actions must be a SIGNIFICANT part of the paragraph (multiple sentences describing what the board/committee does, receives, or directs).
  • PURPOSE = describing WHO a specific person IS (qualifications, credentials, experience, career history, team they lead) → Management Role.
  • CRITICAL THRESHOLD: A one-sentence mention of a board/committee does NOT make a paragraph Board Governance. Test: if you removed the committee sentence, would the paragraph lose its main point? If NO → the committee mention is incidental; classify based on the remaining content.
  • "Our management team oversees cybersecurity technologies and processes. Our Audit Committee also provides oversight." → NOT BG. The committee mention is a brief addendum. The paragraph is about what management does → MR or RMP.
  • "The Audit Committee receives quarterly briefings from the CISO and conducts annual reviews of the cybersecurity program." → BG. The committee's oversight actions ARE the content.
  • Governance-chain paragraphs where the board/committee spans multiple sentences ARE Board Governance. Single-sentence mentions are NOT enough.`,
  "Board Governance|Risk Management Process": `BOARD GOVERNANCE vs RISK MANAGEMENT PROCESS — ask: oversight or operations?
  • Board/committee receiving reports, overseeing risk, setting policy → Board Governance.
  • Company describing HOW it assesses, monitors, mitigates risks → Risk Management Process.
-  • "The board receives quarterly cybersecurity briefings" → Board Governance. "We conduct quarterly risk assessments; the board is informed" → RMP (process is primary content).`,
+  • "The board receives quarterly cybersecurity briefings" → BG. "We conduct quarterly risk assessments; the board is informed" → RMP (process is primary).`,
-  "None/Other|Risk Management Process": `NONE/OTHER vs RISK MANAGEMENT PROCESS — ask: does the paragraph describe actual cybersecurity activities?
+  "None/Other|Risk Management Process": `NONE/OTHER vs RISK MANAGEMENT PROCESS — does the paragraph describe actual cybersecurity activities?
-  • Describing actual processes, measures, or controls the company has implemented → RMP. Key signals: "we have implemented," "we use," "we maintain," "we have taken steps to," "our program includes," "we engage." Even if surrounded by risk-factor framing, ACTUAL MEASURES = RMP.
+  • Describing actual processes, measures, or controls → RMP. Even if surrounded by risk-factor framing.
-  • Only stating the company has IT systems, faces cyber risks, or enumerating threat types — without describing what it DOES about them → None/Other.
+  • Only stating the company has IT systems, faces risks, or enumerating threat types — without describing what it DOES → N/O.
-  • Generic regulatory compliance language ("subject to various regulations... non-compliance could result in penalties") is None/Other — it describes no actual compliance activities. If a specific regulation is named (GDPR, HIPAA, PCI DSS) but no company-specific program is described → RMP at Specificity 2 (named standard).
+  • Generic regulatory compliance ("subject to various regulations") with no activity described → N/O. Named regulation + compliance activities → RMP.`,
  • A paragraph that BOTH enumerates threats AND describes measures taken is RMP — the measures are the substantive content.`,
-  "Risk Management Process|Strategy Integration": `RISK MANAGEMENT PROCESS vs STRATEGY INTEGRATION — ask: operational or strategic?
+  "Risk Management Process|Strategy Integration": `RISK MANAGEMENT PROCESS vs STRATEGY INTEGRATION — operational or strategic?
-  • Describing HOW risks are assessed, monitored, mitigated → Risk Management Process.
+  • HOW risks are assessed, monitored, mitigated → RMP.
-  • Discussing materiality, financial impact, insurance, budget, or business implications → Strategy Integration.
+  • Materiality conclusions, financial impact, insurance, budget → SI.
-  • ERM integration described as a process = RMP. Materiality conclusions drawn from the process = Strategy.`,
+  • ERM integration described as a process = RMP. Materiality conclusions from the process = SI.`,
 };
 // ── Specificity disambiguation rules ──────────────────────────────────────
 const SPECIFICITY_GUIDANCE_1_VS_2 = `SPECIFICITY: GENERIC BOILERPLATE (1) vs DOMAIN-ADAPTED (2)
 Apply the ERM test: does the paragraph use vocabulary that a non-security enterprise risk management professional would NOT naturally use? If yes → the paragraph demonstrates cybersecurity domain knowledge → Domain-Adapted. This is about vocabulary level, not individual word presence — a boilerplate materiality disclaimer mentioning "cyberattacks" is still Level 1 if it demonstrates no domain knowledge.
  Level 2 examples: "penetration testing and vulnerability scanning" (security-specific practices), "defense in depth strategy" (architectural concept), "SIEM and EDR tools" (security infrastructure).
  Level 1 examples: "processes to identify, assess, and manage risks" (generic ERM), "incident response plan" (used in all risk domains), "dedicated cybersecurity team" (organizational, not technical).`;
 const SPECIFICITY_GUIDANCE_1_VS_3 = `SPECIFICITY: GENERIC BOILERPLATE (1) vs FIRM-SPECIFIC (3)
-The #1 annotator error is counting NOT-list items as specific facts. Before deciding, filter strictly:
+Firm-Specific requires at least one fact that distinguishes THIS company from any other. Common false positives:
-  ✗ "Head of IT department", "IT Manager" = generic IT leadership, NOT on IS list (IS list: CISO, CTO, CIO, VP of IT/Security, Information Security Officer, Director of IT Security)
+  ✗ Generic governance (Board, Audit Committee, management, CEO, CFO) — exist at every company
-  ✗ Generic program names without unique identifiers: "incident response plan", "cybersecurity program", "Third-Party Risk Management Program" → all on NOT list. Compare with IS example: "Cyber Incident Response Plan (CIRP)" which has a distinguishing abbreviation.
+  ✗ IT titles below VP without a security qualifier ("Head of IT", "IT Manager", "Director of IT") — too generic. VP/SVP of IT IS firm-specific, Director of IT Security IS firm-specific, but plain "Director of IT" is NOT.
-  ✗ Company self-references ("the Company", "the Bank", company name) → NOT list
+  ✗ Generic program names ("incident response plan", "cybersecurity program") — unless they have a distinguishing identifier ("CIRP")
-  ✗ Common practices (pen testing, vulnerability scanning, tabletop exercises) → NOT list
+  ✗ Cybersecurity practices (pen testing, vuln scanning) — these are Domain-Adapted (Level 2), not Firm-Specific
-  ✗ Generic governance (Board, Audit Committee, management, CEO, CFO) → NOT list
+If all candidate facts are disqualified, check for domain terminology (→ Level 2) before defaulting to Level 1.`;
-After filtering all NOT-list items, if zero IS-list facts remain → Generic Boilerplate.`;
+
 const SPECIFICITY_GUIDANCE_2_VS_3 = `SPECIFICITY: DOMAIN-ADAPTED (2) vs FIRM-SPECIFIC (3)
 Domain-Adapted demonstrates cybersecurity knowledge but nothing unique to THIS company. Firm-Specific has at least one fact that identifies something about THIS company's specific posture.
  "We conduct penetration testing aligned with NIST CSF" → Level 2 (domain terms + standard, but any company could say this)
  "Our CISO oversees penetration testing" → Level 3 (CISO = cybersecurity leadership role unique to this company's structure)
  The test: could you paste this sentence into any company's filing unchanged? If yes → Level 2. If it reveals something about THIS company → Level 3.`;
 const SPECIFICITY_GUIDANCE_3_VS_4 = `SPECIFICITY: FIRM-SPECIFIC (3) vs QUANTIFIED-VERIFIABLE (4)
-Count ONLY these as hard verifiable facts for the 2-fact QV threshold:
+Quantified-Verifiable requires at least one fact an external party could independently confirm. The test: could someone outside the company verify this claim?
-  ✓ Specific dates (month+year or exact date): "In January 2022", "On May 6, 2024"
+  QV: specific numbers ($2M, 12 professionals, 17 years), specific dates (month+year tied to a cybersecurity event), named external entities (Mandiant, CrowdStrike, Splunk), certifications held (CISSP, CISM, ISO 27001 certification).
-  ✓ Dollar amounts, headcounts, percentages: "$2M", "12 professionals", "25%"
+  NOT QV: named roles alone (CISO → Firm-Specific, not quantified), named committees (organizational, not verifiable), standards followed not certified ("aligned with NIST" → Level 2), fiscal year dates as reporting context ("year ended December 31" → not a cybersecurity-specific date).`;
  ✓ Named third-party firms: Mandiant, CrowdStrike, Deloitte
  ✓ Named products/tools: Splunk, Azure Sentinel, CrowdStrike Falcon
  ✓ Named certifications: CISSP, CISM (but NOT generic degrees like "BS in Management")
  ✓ Years of experience: "17 years", "over 20 years" (these are specific numbers)
 Do NOT count toward QV:
  ✗ Named roles (CISO, CIO) — these trigger Firm-Specific but do NOT count as QV facts
  ✗ Named committees — Firm-Specific, not QV
  ✗ Named frameworks (NIST, ISO 27001) — trigger Sector-Adapted, not QV
  ✗ Team compositions, reporting structures — Firm-Specific, not QV
  ✗ Generic degrees without named university
 Need 2+ QV-eligible facts. One fact = stays at Firm-Specific.`;
 /**
 * Build a re-evaluation prompt for paragraphs flagged for codebook-correction review.
 * Used when unanimous Stage 1 labels may be wrong due to prompt version drift
 * (e.g., v2.5 lacked the materiality→SI rule, so N/O labels on paragraphs with
 * materiality language need re-evaluation under v3.5 rules).
 *
 * Unlike the judge prompt, this does NOT show prior annotations (to avoid anchoring
 * to the potentially-wrong unanimous label). Instead, it provides the specific
 * codebook rule that triggered the re-evaluation and asks for a fresh classification.
 */
 export function buildReEvalPrompt(
  paragraph: Paragraph,
@ -235,33 +244,30 @@ export function buildReEvalPrompt(
 This paragraph was previously labeled None/Other. It has been flagged for re-evaluation because it contains materiality-related language.
-CODEBOOK RULE 6 (v3.5): Materiality ASSESSMENTS are Strategy Integration. An assessment is the company STATING A CONCLUSION about materiality:
+CODEBOOK RULE (v4.0): Materiality ASSESSMENTS are Strategy Integration. An assessment is the company STATING A CONCLUSION about materiality:
  • Backward-looking: "have not materially affected our business strategy, results of operations, or financial condition" → SI
  • Forward-looking with SEC qualifier: "reasonably likely to materially affect" → SI
  • Negative assertion: "we have not experienced any material cybersecurity incidents" → SI
 The following are NOT materiality assessments and do NOT trigger SI:
-  • Generic risk warning: "could have a material adverse effect on our business" → NOT SI (boilerplate speculation, not a conclusion)
+  • Generic risk warning: "could have a material adverse effect on our business" → NOT SI (boilerplate speculation)
-  • "Material" as adjective: "managing material risks" → NOT SI ("material" means "significant" here)
+  • "Material" as adjective: "managing material risks" → NOT SI
-  • Consequence clause: SPECULATIVE materiality language ("could have a material adverse effect") at the END of a paragraph does not override primary purpose. BUT a factual negative assertion ("we have not experienced any material cybersecurity incidents") IS an assessment even at the end.
+  • Cross-references: "For risks that may materially affect us, see Item 1A" → N/O`;
  • Cross-references: "For risks that may materially affect us, see Item 1A" → N/O
 KEY DISTINCTION: "Risks have not materially affected us" = SI (conclusion). "Risks could have a material adverse effect" = N/O (speculation). "Reasonably likely to materially affect" = SI (SEC-qualified forward-looking assessment).`;
  } else if (reason === "spac") {
    ruleBlock = `═══ RULE UNDER REVIEW ═══
 This paragraph was flagged for re-evaluation because it may be from a SPAC or shell company.
-CODEBOOK RULE (v3.0+): SPACs and shell companies explicitly stating they have no operations, no cybersecurity program, or no formal processes → None/Other, regardless of incidental mentions of board oversight or risk acknowledgment. The absence of a program is not a description of a program.`;
+CODEBOOK RULE (v4.0): SPACs and shell companies explicitly stating they have no operations, no cybersecurity program, or no formal processes → None/Other, regardless of incidental mentions of board oversight or risk acknowledgment.`;
  } else {
    ruleBlock = `═══ RE-EVALUATION ═══
-This paragraph has been flagged for fresh classification under codebook v3.5 rules. Apply all current rules without anchoring to any prior label.`;
+This paragraph has been flagged for fresh classification under codebook v4.0 rules. Classify it fresh — do not assume any prior label is correct.`;
  }
  return `═══ RE-EVALUATION TASK ═══
-You are re-classifying this paragraph under updated codebook rules (v3.5). Classify it fresh — do not assume any prior label is correct.
+You are re-classifying this paragraph under updated codebook rules (v4.0). Classify it fresh — do not assume any prior label is correct.
 ${ruleBlock}
@ -318,7 +324,7 @@ export function buildJudgePrompt(
  const specCounts = new Map<number, number>();
  for (const s of specs) specCounts.set(s, (specCounts.get(s) ?? 0) + 1);
-  const specLabels: Record<number, string> = { 1: "Generic Boilerplate", 2: "Sector-Adapted", 3: "Firm-Specific", 4: "Quantified-Verifiable" };
+  const specLabels: Record<number, string> = { 1: "Generic Boilerplate", 2: "Domain-Adapted", 3: "Firm-Specific", 4: "Quantified-Verifiable" };
  const specDistStr = [...specCounts.entries()]
    .sort(([, a], [, b]) => b - a)
    .map(([s, n]) => `${specLabels[s] ?? s} (${n})`)
@ -337,7 +343,6 @@ export function buildJudgePrompt(
  const guidanceBlocks: string[] = [];
  if (catDisagree) {
    // Generate all pairs and look up guidance
    for (let i = 0; i < uniqueCats.length; i++) {
      for (let j = i + 1; j < uniqueCats.length; j++) {
        const key = [uniqueCats[i], uniqueCats[j]].sort().join("|");
@ -350,9 +355,10 @@ export function buildJudgePrompt(
  if (specDisagree) {
    const minSpec = Math.min(...specs);
    const maxSpec = Math.max(...specs);
    if (minSpec <= 1 && maxSpec >= 2) guidanceBlocks.push(SPECIFICITY_GUIDANCE_1_VS_2);
    if (minSpec <= 1 && maxSpec >= 3) guidanceBlocks.push(SPECIFICITY_GUIDANCE_1_VS_3);
    if (minSpec <= 2 && maxSpec >= 3) guidanceBlocks.push(SPECIFICITY_GUIDANCE_2_VS_3);
    if (minSpec <= 3 && maxSpec >= 4) guidanceBlocks.push(SPECIFICITY_GUIDANCE_3_VS_4);
    // For [1,2] or [2,3] disputes, the general codebook suffices
  }
  const guidanceSection =
@ -387,8 +393,8 @@ ${guidanceSection}
 Before providing your label:
 1. CRITIQUE each annotator's reasoning against the codebook. Which one(s) applied the rules correctly? Which made identifiable errors?
-2. For specificity: independently enumerate IS-list facts in the text. Filter out any NOT-list items the annotators may have counted. How many genuine IS-list facts remain? How many are QV-eligible?
+2. For specificity: independently assess the paragraph's vocabulary level (domain knowledge?), identify any facts unique to this company, and check whether any facts are externally verifiable. Apply the principles, not just list-matching.
-3. Determine the dominant communicative purpose for category.
+3. Determine the dominant communicative purpose for category using the "what question?" test.
 4. Provide your final label. In your reasoning, explain which annotator(s) you agree with and what specific error(s) the others made.
 ═══ CONFIDENCE CALIBRATION ═══