pivot point

2026-04-03 14:43:53 -04:00 · 2026-04-03 14:43:53 -04:00 · d653ed9a20
commit d653ed9a20
parent 26367a8e86
22 changed files with 6996 additions and 56 deletions
--- a/.dvc-store.dvc
+++ b/.dvc-store.dvc
@ -1,6 +1,6 @@
 outs:
- md5: d64ad0c8040d75230a3013c4751910eb.dir
+- md5: 4ad135e50584bca430b79307e8bd1050.dir
-  size: 740635168
+  size: 741469715
-  nfiles: 174
+  nfiles: 194
  hash: md5
  path: .dvc-store
--- a/.gitignore
+++ b/.gitignore
@ -55,3 +55,6 @@ report.[0-9]_.[0-9]_.[0-9]_.[0-9]_.json
 .DS_Store
 python/*.whl
 /.dvc-store
 # Personal notes
 docs/STRATEGY-NOTES.md
--- a/docs/LABELING-CODEBOOK.md
+++ b/docs/LABELING-CODEBOOK.md
@ -123,7 +123,7 @@ Each paragraph is assigned exactly **one** content category. If a paragraph span
 - **Covers:** Material impact (or lack thereof) on business strategy or financials, cybersecurity insurance, investment/resource allocation, cost of incidents
 - **Key markers:** "business strategy," "insurance," "investment," "material," "financial condition," "budget," "not materially affected," "results of operations"
 - **Assign when:** The paragraph primarily discusses business/financial consequences or strategic response to cyber risk, not the risk management activities themselves
- **Includes materiality disclaimers:** Any paragraph that explicitly assesses whether cybersecurity risks have or could "materially affect" the company's business, strategy, financial condition, or results of operations is Strategy Integration — even if the assessment is boilerplate. The company is making a strategic judgment about cyber risk impact, which is the essence of this category. A cross-reference to Risk Factors appended to a materiality assessment does not change the classification.
+- **Includes materiality ASSESSMENTS:** A materiality assessment is the company stating a conclusion about whether cybersecurity has or will affect business outcomes. Backward-looking ("have not materially affected"), forward-looking with SEC qualifier ("reasonably likely to materially affect"), and negative assertions ("have not experienced material incidents") are all assessments → SI. Generic risk warnings ("could have a material adverse effect") are NOT assessments — they are boilerplate speculation that appears in every 10-K → classify by primary content. "Material" as an adjective ("managing material risks") is also not an assessment.
 **Example texts:**
@ -170,19 +170,49 @@ Each paragraph is assigned exactly **one** content category. If a paragraph span
 ### Rule 1: Dominant Category
 If a paragraph spans multiple categories, assign the one whose topic occupies the most text or is the paragraph's primary communicative purpose.
-### Rule 2: Board vs. Management
+### Rule 2: Board vs. Management (the board-line test)
 **Core principle:** The governance hierarchy has distinct layers — board/committee oversight at the top, management execution below. The paragraph's category depends on which layer is the primary focus.
 | Layer | Category | Key signals |
 |-------|----------|-------------|
 | Board/committee directing, receiving reports, or overseeing | Board Governance | "Board oversees," "Committee reviews," "reports to the Board" (board is recipient) |
 | Named officer's qualifications, responsibilities, reporting lines | Management Role | "CISO has 20 years experience," "responsible for," credentials |
 | Program/framework/controls described | Risk Management Process | "program is designed to," "framework includes," "controls aligned with" |
 **When a paragraph spans layers** (governance chain paragraphs): apply the **purpose test** — what is the paragraph's communicative purpose?
 - **Purpose = describing oversight/reporting structure** (who reports to whom, briefing cadence, committee responsibilities, how information flows to the board) → **Board Governance**, even if officers appear as grammatical subjects. The officers are intermediaries in the governance chain, not the focus.
 - **Purpose = describing who a person is** (qualifications, credentials, experience, career history) → **Management Role**.
 - **Governance-chain paragraphs are almost always Board Governance.** They become Management Role ONLY when the officer's personal qualifications/credentials are the dominant content.
 | Signal | Category |
 |--------|----------|
 | Board/committee is the grammatical subject | Board Governance |
 | Board delegates responsibility to management | Board Governance |
-| Management role reports TO the board | Management Role |
+| Management role reports TO the board (describing reporting structure) | Board Governance (the purpose is describing how oversight works) |
-| Management role's qualifications are described | Management Role |
+| Management role's qualifications, experience, credentials described | Management Role |
-| "Board oversees... CISO reports to Board quarterly" | Board Governance (board is primary actor) |
+| "Board oversees... CISO reports to Board quarterly" | Board Governance (oversight structure) |
-| "CISO reports quarterly to the Board on..." | Management Role (CISO is primary actor) |
+| "CISO reports quarterly to the Board on..." | Board Governance (reporting structure, not about who the CISO is) |
 | "The CISO has 20 years of experience and reports to the CIO" | Management Role (person's qualifications are the content) |
 | Governance overview spanning board → committee → officer → program | **Board Governance** (purpose is describing the structure) |
-### Rule 2b: Management Role vs. Risk Management Process (the person-vs-function test)
+### Rule 2b: Management Role vs. Risk Management Process (three-step decision chain)
-This is the single most common source of annotator disagreement. The line is: **is the paragraph about the person or about the function?**
+This is the single most common source of annotator disagreement. Apply the following tests in order — stop at the first decisive result.
 **Step 1 — Subject test:** What is the paragraph's grammatical subject?
 - Clear process/framework/program as subject with no person detail → **Risk Management Process**. Stop.
 - Person/role as subject → this is a **signal**, not decisive. Always continue to Step 2. Many SEC disclosures name an officer then describe the program — Step 2 determines which is the actual content.
 **Step 2 — Person-removal test:** Could you delete all named roles, titles, qualifications, experience descriptions, and credentials from the paragraph and still have a coherent cybersecurity disclosure?
 - **YES** → **Risk Management Process** (the process stands on its own; people are incidental)
 - **NO** → **Management Role** (the paragraph is fundamentally about who these people are)
 - Borderline → continue to Step 3
 **Step 3 — Qualifications tiebreaker:** Does the paragraph include experience (years), certifications (CISSP, CISM), education, team size, or career history for named individuals?
 - **YES** → **Management Role** (qualifications are MR-specific content; the SEC requires management role disclosure specifically because investors want to know WHO is responsible)
 - **NO** → **Risk Management Process** (no person-specific content beyond a title attribution)
 | Signal | Category |
 |--------|----------|
@ -216,8 +246,27 @@ Assign None/Other ONLY when the paragraph contains no substantive cybersecurity
 **Exception — SPACs and no-operations companies:** A paragraph that explicitly states the company has no cybersecurity program, no operations, or no formal processes is None/Other even if it perfunctorily mentions board oversight or risk acknowledgment. The absence of a program is not substantive disclosure.
-### Rule 6: Materiality Disclaimers → Strategy Integration
+### Rule 6: Materiality Language → Strategy Integration
-Any paragraph that explicitly assesses whether cybersecurity risks or incidents have "materially affected" (or are "reasonably likely to materially affect") the company's business strategy, results of operations, or financial condition is **Strategy Integration** — even when the assessment is boilerplate. The materiality assessment is the substantive content. A cross-reference to Risk Factors appended to a materiality assessment does not change the classification to None/Other. Only a *pure* cross-reference with no materiality conclusion is None/Other.
+Any paragraph that explicitly connects cybersecurity to business materiality is **Strategy Integration** — regardless of tense, mood, or how generic the language is. This includes:
 - **Backward-looking assessments:** "have not materially affected our business strategy, results of operations, or financial condition"
 - **Forward-looking assessments with SEC qualifier:** "are reasonably likely to materially affect," "if realized, are reasonably likely to materially affect"
 - **Negative assertions with materiality framing:** "we have not experienced any material cybersecurity incidents"
 **The test:** Is the company STATING A CONCLUSION about materiality? 
 - "Risks have not materially affected our business strategy" → YES, conclusion → SI
 - "Risks are reasonably likely to materially affect us" → YES, forward-looking conclusion → SI
 - "Risks could have a material adverse effect on our business" → NO, speculation → not SI (classify by primary content)
 - "Managing material risks associated with cybersecurity" → NO, adjective → not SI
 The key word is "reasonably likely" — that's the SEC's Item 106(b)(2) threshold for forward-looking materiality. Bare "could" is speculation, not an assessment.
 **Why this is SI and not N/O:** The company is fulfilling its SEC Item 106(b)(2) obligation to assess whether cyber risks affect business strategy. The fact that the language is generic makes it Specificity 1, not None/Other. Category captures WHAT the paragraph discloses (a materiality assessment); specificity captures HOW specific that disclosure is (generic boilerplate = Spec 1).
 **What remains N/O:** A cross-reference is N/O even if it contains materiality language — "For a description of the risks from cybersecurity threats that may materially affect the Company, see Item 1A" is N/O because the paragraph's purpose is pointing the reader elsewhere, not making an assessment. The word "materially" here describes what Item 1A discusses, not the company's own conclusion. Also N/O: generic IT-dependence language ("our IT systems are important to operations") with no materiality claim, and forward-looking boilerplate about risks generally without invoking materiality ("we face various risks").
 **The distinction:** "Risks that may materially affect us — see Item 1A" = N/O (cross-reference). "Risks have not materially affected us. See Item 1A" = SI (the first sentence IS an assessment). The test is whether the company is MAKING a materiality conclusion vs DESCRIBING what another section covers.
 ---
@ -271,7 +320,26 @@ No materiality assessment. Pure cross-reference. → **None/Other, Specificity 1
 Despite touching RMP (no program), Board Governance (board is responsible), and Strategy Integration (no incidents), the paragraph contains no substantive disclosure. The company explicitly has no program, and the board mention is perfunctory ("generally responsible... if any"). The absence of a program is not a program description. → **None/Other, Specificity 1.**
-### Case 9: Generic regulatory compliance language
+### Case 9: Materiality language — assessment vs. speculation (v3.5 revision)
 > *"We face risks from cybersecurity threats that, if realized and material, are reasonably likely to materially affect us, including our operations, business strategy, results of operations, or financial condition."*
 The phrase "reasonably likely to materially affect" is the SEC's Item 106(b)(2) qualifier — this is a forward-looking materiality **assessment**, not speculation. → **Strategy Integration, Specificity 1.**
 > *"We have not identified any risks from cybersecurity threats that have materially affected or are reasonably likely to materially affect the Company."*
 Backward-looking negative assertion + SEC-qualified forward-looking assessment. → **Strategy Integration, Specificity 1.**
 > *"Information systems can be vulnerable to a range of cybersecurity threats that could potentially have a material impact on our business strategy, results of operations and financial condition."*
 Despite mentioning "material impact" and "business strategy," the operative verb is "could" — this is boilerplate **speculation** present in virtually every 10-K risk factor section. The company is not stating a conclusion about whether cybersecurity HAS or IS REASONABLY LIKELY TO affect them; it is describing a hypothetical. → **None/Other, Specificity 1.** (Per Rule 6: "could have a material adverse effect" = speculation, not assessment.)
 > *"We face various risks related to our IT systems."*
 No materiality language, no connection to business strategy/financial condition. This is generic IT-dependence language. → **None/Other, Specificity 1.**
 **The distinction:** "reasonably likely to materially affect" (SEC qualifier, forward-looking assessment) ≠ "could potentially have a material impact" (speculation). The former uses the SEC's required assessment language; the latter uses conditional language that every company uses regardless of actual risk.
 ### Case 10: Generic regulatory compliance language
 > *"Regulatory Compliance: The Company is subject to various regulatory requirements related to cybersecurity, data protection, and privacy. Non-compliance with these regulations could result in financial penalties, legal liabilities, and reputational damage."*
 This acknowledges that regulations exist and non-compliance would be bad — a truism for every public company. It does not describe any process, program, or framework the company uses to comply. It does not make a materiality assessment. It names no specific regulation. → **None/Other, Specificity 1.**
@ -605,6 +673,7 @@ Track prompt changes so we can attribute label quality to specific prompt versio
 | v2.6 | 2026-03-28 | 500 | Changed category defs to TEST: format. REGRESSED (Both 67.8%). |
 | v2.7 | 2026-03-28 | 500 | Added COMMON MISTAKES section. 100% consensus but Both 67.6%. |
 | v3.0 | 2026-03-29 | — | **Codebook overhaul.** Three rulings: (A) materiality disclaimers → Strategy Integration, (B) SPACs/no-ops → None/Other, (C) person-vs-function test for Mgmt Role vs RMP. Added full IS/NOT lists and QV-eligible list to codebook. Added Rule 2b, Rule 6, 4 new borderline cases. Prompt update pending. |
 | v3.5 | 2026-04-02 | 26 | **Post-gold-analysis rulings, 6 iteration rounds on 26 regression paragraphs ($1.02).** Driven by 13-signal cross-analysis + targeted prompt iteration. (A) Rule 6 refined: materiality ASSESSMENTS → SI (backward-looking conclusions + "reasonably likely" forward-looking). Generic "could have a material adverse effect" is NOT an assessment — it stays N/O/RMP. Cross-references with materiality language also stay N/O. (B) Rule 2 expanded: purpose test for BG — governance structure descriptions are BG, but a one-sentence committee mention doesn't flip the category. (C) Rule 2b expanded: three-step MR↔RMP decision chain; Step 1 only decisive for RMP (process is subject), never short-circuits to MR. (D) N/O vs RMP clarified: actual measures implemented = RMP even in risk-factor framing. Result: +4pp on 26 hardest paragraphs vs v3.0 (18→22/26). |
 When the prompt changes (after pilot testing, rubric revision, etc.), bump the version and log what changed. Every annotation record carries `promptVersion` so we can filter/compare.
--- a/docs/NARRATIVE.md
+++ b/docs/NARRATIVE.md
@ -1106,6 +1106,137 @@ Key risk: the stratified holdout over-samples hard cases, depressing F1 vs a ran
 ---
 ## Phase 15: Codebook v3.5 — The Prompt Drift Discovery
 ### The Problem
 Cross-analysis of human vs GenAI labels on the holdout revealed a systematic, directional disagreement on three axes:
 1. **SI↔N/O (23:0 asymmetry):** When humans and GenAI disagreed on this axis, humans ALWAYS called it SI and GenAI called it N/O. Never the reverse. Root cause: the labelapp trained humans that any language connecting cybersecurity to business materiality — even forward-looking ("could materially affect") — is SI at Specificity 1. Stage 1 models (v2.5 prompt) lacked this rule entirely. Even v3.0 benchmark models, which had the backward-looking materiality rule, were conservative about forward-looking variants.
 2. **MR↔RMP (253 paragraphs, 38:13 asymmetry):** GenAI systematically calls MR paragraphs RMP. The v3.0 "person-vs-function test" helps but leaves genuinely mixed paragraphs (both person and process as grammatical subjects) unresolved. These near-even splits need a deterministic tiebreaker chain.
 3. **BG↔MR (149 paragraphs, 33:6 asymmetry):** GenAI systematically under-calls BG. The problem is governance chain paragraphs that describe the board receiving reports from management — is this about the board's oversight function or the officer's reporting duty?
 ### The Audit
 A Stage 1 audit found ~1,076 paragraphs (649 unanimous + 383 majority N/O) with materiality language that should be SI under the broadened rule. 1.3% of the corpus overall — but potentially concentrated on exactly the boundary cases the holdout over-samples. On the holdout, mimo-v2-flash was actually the most accurate Stage 1 model on this axis, dissenting toward SI 263 times when the other two said N/O.
 The MR↔RMP and BG↔MR axes are cleaner in Stage 1 unanimity — only 0.2% of unanimous BG labels are problematic, and the MR/RMP tiebreaker mainly affects disputed labels (already going to Stage 2). The v2.5→v3.5 gap is primarily an SI↔N/O problem.
 ### Initial v3.5 Rulings (Round 1)
 Three rulings, all driven by the 13-signal cross-analysis:
 **Rule 6 broadened (SI↔N/O):** ALL materiality language → SI, not just backward-looking disclaimers. Forward-looking ("could materially affect"), conditional ("reasonably likely to"), and negative assertions ("have not experienced material incidents") are all Strategy Integration at Specificity 1.
 **Rule 2 expanded (BG↔MR):** Added the board-line test with governance hierarchy layers and a dominant-subject test for cross-layer paragraphs.
 **Rule 2b expanded (MR↔RMP):** Three-step decision chain: subject test → person-removal test → qualifications tiebreaker.
 These rulings were tested by re-running all 7 benchmark models (6 OpenRouter + Opus) on 359 confusion-axis holdout paragraphs with the v3.5 prompt ($18, stored separately from v3.0 data).
 ### The Prompt Drift Lesson
 Running Stage 1 (150K annotations) before human labeling created a subtle but significant problem: the codebook evolved through v2.5 → v3.0 → v3.5, but the training data is frozen at v2.5. Each codebook revision was driven by empirical analysis of disagreement patterns — which required the Stage 1 data AND human labels to exist first. The dependency is circular: you can't know what rules are needed until you see where annotators disagree, but you can't undo the labels already collected.
 ### Iteration: 6 Rounds on 26 Regression Paragraphs ($1.02)
 The initial v3.5 re-run revealed that the rulings over-corrected. We identified 26 "regression" paragraphs — cases where v3.0 matched human majority but v3.5 did not — and iterated the prompt using GPT-5.4 on these 26 paragraphs ($0.17/round) to diagnose and fix each over-correction.
 **Round 1 (v3.5a) — 5/26.** Catastrophic. All three rulings over-fired simultaneously. SI was called on every paragraph with the word "material." BG was called whenever a committee was named. MR was called whenever a person was a grammatical subject. The rulings were correct in intent but models interpreted them too aggressively.
 **Round 2 (v3.5b) — 13/25.** Three fixes: (A) Replaced the BG "dominant-subject test" with a "purpose test" — if the paragraph describes oversight structure, it's BG; mere committee mentions don't flip the category. (B) Made MR↔RMP Step 1 non-decisive — a person being the grammatical subject is a signal, not a conclusion; always proceed to Step 2 (person-removal test). (C) Added cross-reference exception for SI. Improvement: +8.
 **Round 3 (v3.5c) — 20/26.** The cross-reference exception eliminated the 5 most egregious SI over-predictions — paragraphs like "For a description of risks that may materially affect us, see Item 1A" that v3.5a called SI but are obviously N/O. These were pure pointers with materiality language embedded in the cross-reference text, not materiality assessments. +7.
 **Round 4 (v3.5d) — 22/26.** The critical insight: not all materiality language is a materiality *assessment*. Reading the 6 remaining errors revealed a spectrum:
 - "Cybersecurity risks have not materially affected our business strategy" → **Assessment** (conclusion about actual impact) → SI ✓
 - "Risks are reasonably likely to materially affect us" → **Assessment** (SEC Item 106(b)(2) standard) → SI ✓  
 - "Cybersecurity threats could have a material adverse effect on our business" → **Speculation** (generic risk warning in every 10-K) → NOT SI ✗
 - "Managing material risks associated with cybersecurity" → **Adjective** ("material" means "significant") → NOT SI ✗
 - "...which could result in material adverse effects" at the end of an RMP paragraph → **Consequence clause** (doesn't override primary purpose) → NOT SI ✗
 The tightened rule: only backward-looking conclusions and SEC-qualified forward-looking ("reasonably likely to") trigger SI. Generic "could have a material adverse effect" does not. This distinction — assessment vs. speculation — resolved 3 errors without breaking any correct calls. +2.
 We also verified each error against human annotator votes. All 6 remaining errors had the human majority correct (checked by reading the actual paragraph text and codebook rules). Interestingly, on 3 of the 6, the project lead's own label was the dissenting human vote — he had been the one calling these SI, validating that the over-calling pattern was a real and consistent interpretation difference, not random noise.
 **Round 5 (v3.5e) — 19/25.** Regression. We attempted to add an explicit BG↔RMP example ("CISO assists the ERMC in monitoring...→ RMP") to the disambiguation guidance. This caused 3 previously-correct paragraphs to flip to BG — the example made models hyper-aware of committee mentions and triggered BG more broadly. Lesson: **targeted examples can backfire when the pattern is too specific.** The model generalizes from the example in unpredictable ways.
 **Round 6 (v3.5f) — 21/26.** Reverted the Round 5 BG↔RMP example. Kept the N/O↔RMP "actual measures" clarification from Round 5 (if a paragraph describes specific security measures the company implemented, it's RMP even in risk-factor framing). This stabilized at 21-22/26, with the 2-paragraph swing attributable to LLM non-determinism at temperature=0.
 ### The 4 Irreducible Errors
 The remaining errors after Round 4/6 fall into two patterns:
 **BG over-call on process paragraphs (2 errors):** A paragraph describing monitoring methods (threat intelligence, security tools, detection capabilities) where a management committee (ERMC) is woven throughout as the entity being assisted. Content is clearly RMP but the committee mention triggers BG. These are genuinely dual-coded — the monitoring IS part of the committee's function. Human majority says RMP (2-1 in both cases).
 **N/O over-call on borderline RMP paragraphs (2 errors):** Paragraphs that describe risk management activities ("assessing, identifying, and managing material risks") but are framed as risk-factor discussions with threat enumeration. The SI tightening correctly stopped calling them SI, but they overcorrected to N/O instead of RMP. The N/O↔RMP boundary depends on whether the paragraph describes what the company DOES (→ RMP) vs. what risks it faces (→ N/O). These paragraphs do both.
 All 4 have human 2-1 splits — reasonable annotators disagree on these. Further prompt iteration risks over-fitting to these 4 specific paragraphs at the cost of breaking the other 355 correctly-classified ones.
 ### The SI Rule: Assessment vs. Speculation
 The most important finding from the iteration is the distinction between materiality *assessments* and materiality *language*:
 | Pattern | Classification | Reasoning |
 |---------|---------------|-----------|
 | "have not materially affected our business strategy" | **SI** | Backward-looking conclusion — the company is reporting on actual impact |
 | "reasonably likely to materially affect" | **SI** | Forward-looking with SEC qualifier — Item 106(b)(2) disclosure |
 | "have not experienced material cybersecurity incidents" | **SI** | Negative assertion — materiality conclusion about past events |
 | "could have a material adverse effect" | **NOT SI** | Generic speculation — appears in every 10-K, not an assessment |
 | "managing material risks" | **NOT SI** | Adjective — "material" means "significant," not a materiality assessment |
 | "For risks that may materially affect us, see Item 1A" | **NOT SI** | Cross-reference — pointing elsewhere, not making a conclusion |
 | "...which could result in material losses" (at end of RMP paragraph) | **NOT SI** | Consequence clause — doesn't override the paragraph's primary purpose |
 This distinction reduced the Stage 1 correction set from ~1,014 to 308 paragraphs. The original broad flag ("any paragraph with the word 'material'") caught ~700 paragraphs that were correctly labeled N/O by Stage 1 — they contained generic "could have a material adverse effect" boilerplate that is NOT a materiality assessment. Only 180 paragraphs contain actual backward-looking or SEC-qualified assessments that v2.5 miscoded.
 ### Final v3.5 Gold Re-Run
 After locking the prompt at v3.5f, all 7 models (Opus + 6 benchmark) were re-run on the 359 confusion-axis holdout paragraphs with the final prompt (~$18). v3.0 data preserved in original paths (`bench-holdout/`, `golden/`). v3.5f results stored separately (`bench-holdout-v35/`, `golden-v35/`). The v3.0→v3.5 comparison — per model, per axis — is itself a publishable finding about how prompt engineering systematically shifts classification boundaries in frontier LLMs.
 ### The SI↔N/O Paradox — Resolved
 The v3.5f re-run showed a troubling result: SI↔N/O accuracy *dropped* 6pp vs v3.0 (60% vs 66%), with the H=SI/M=N/O asymmetry worsening from 20 to 25 cases. The initial hypothesis was that models became globally conservative when told to distinguish assessment from speculation.
 A paragraph-by-paragraph investigation of all 27 SI↔N/O errors revealed the opposite: **the models are correct, and the humans are systematically wrong.**
 Of the 25 H=SI / M=N/O cases:
 - ~20 are pure "could have a material adverse effect" speculation, cross-references to Item 1A, or generic threat enumeration — none containing actual materiality assessments. All 6 models unanimously call N/O.
 - ~3 are genuinely ambiguous (SPACs with assessment language, past disruption without explicit materiality language).
 - ~2 are edge cases (negative assertions embedded at end of BG paragraphs).
 Of the 2 H=N/O / M=SI cases:
 - Both contain clear negative assertions ("not aware of having experienced any prior material data breaches", "did not experience any cybersecurity incident during 2024") — textbook SI. All 6 models unanimously call SI.
 **Root cause of human error:** Annotators systematically treat ANY mention of "material" + "business strategy" + "financial condition" as SI — even when wrapped in pure speculation ("could," "if," "may"). The codebook's assessment-vs-speculation distinction is correct; humans weren't consistently applying it.
 **Codebook Case 9 contradiction fixed:** The investigation also discovered that Case 9 ("could potentially have a material impact" → SI) directly contradicted Rule 6 ("could = speculation, not assessment"). Case 9 has been corrected: the "could" example is now N/O, with explanation of why "reasonably likely to materially affect" (SEC qualifier) ≠ "could potentially" (speculation).
 Two minor prompt clarifications were added (consequence clause refinement for negative assertions, investment/resource SI signal) and tested on 83 SI↔N/O paragraphs ($0.55). Net effect: within stochastic noise — confirming the prompt was already correct.
 ### Implications for Training
 - **Gold adjudication on SI↔N/O:** Trust model consensus over human majority. When 6/6 models unanimously agree and the paragraph contains only speculative language → use model label. Apply SI deterministically via regex for backward-looking assessments and SEC qualifiers. Expected impact: SI↔N/O accuracy rises from ~60% to ~95%+ against corrected gold labels.
 - **Stage 2 judge** must use v3.5 prompt. This is where the codebook evolution actually matters for training data quality.
 - **Stage 1 corrections re-flagged:** Tightened criteria reduced flagged paragraphs from 1,014 to 308 (180 materiality assessments + 128 SPACs). The 706 excluded paragraphs contained generic "could" boilerplate that was correctly labeled N/O by v2.5.
 - **Gold adjudication on other axes:** On MR↔RMP and BG↔MR, v3.5 improves alignment with humans by ~4pp on hard cases but the improvement is more modest on easy cases.
 - **MiniMax exclusion:** MiniMax M2.7 is a statistical outlier (z=−2.07 in inter-model agreement) and the most volatile model across prompt versions (40.7% category change rate). Data retained per assignment requirements but excluded from gold scoring majority.
 ### Cost Ledger Update
 | Phase | Cost | Time |
 |-------|------|------|
 | v3.5 initial re-run (7 × 359) | ~$18 | ~10 min |
 | v3.5 iteration (6 × 26 × GPT-5.4) | $1.02 | ~15 min |
 | v3.5f final re-run (7 × 359) | ~$18 | ~10 min |
 | SI↔N/O investigation (37 + 83 × GPT-5.4) | $0.55 | ~1 min |
 | **v3.5 subtotal** | **~$37.57** | |
 | **Running total API** | **~$202.57** | |
 ---
 ## Lessons Learned
 ### On Prompt Engineering
@ -1113,6 +1244,10 @@ Key risk: the stratified holdout over-samples hard cases, depressing F1 vs a ran
 - Pilots must be large enough (500+). 40-sample pilots were misleadingly optimistic.
 - More rules ≠ better. After the core structure is right, additional rules cause regression.
 - The `specific_facts` chain-of-thought schema (forcing models to enumerate evidence before deciding) was the single most impactful structural change.
 - **Rules over-correct before they converge.** The v3.5 iteration showed a consistent pattern: a new rule fixes the target problem but creates 2-3 new errors on adjacent cases. Each fix required a counter-fix. "Materiality language → SI" fixed the 23:0 asymmetry but created cross-reference false positives and speculation false positives that each required their own exception. Six rounds of test-fix-test were needed to reach equilibrium.
 - **Targeted examples backfire.** Adding a specific example to a disambiguation rule ("CISO assists the ERMC in monitoring → RMP") caused regression elsewhere — models generalize from examples in unpredictable ways. General principles ("content matters more than names") are safer than specific examples in disambiguation guidance.
 - **Assessment vs. language is a fundamental distinction.** The word "material" appears in thousands of SEC paragraphs but carries different force in different grammatical contexts. "Have not materially affected" (conclusion) vs. "could have a material adverse effect" (speculation) vs. "material risks" (adjective) are three different speech acts. Models don't naturally distinguish these without explicit guidance.
 - **Check the humans — they can be systematically wrong.** On SI↔N/O, human annotators systematically over-called SI on any paragraph mentioning "material" + "business strategy," even when the language was pure speculation. The 25:2 asymmetry initially looked like model failure but was actually human failure to apply the assessment-vs-speculation distinction. When all 6 frontier models unanimously disagree with a 2/3 human majority, investigate before assuming the humans are right. The models' consistency (unanimous agreement across architectures and providers) is itself strong evidence.
 ### On Model Selection
 - Reasoning tokens are the strongest predictor of accuracy, not price or model size.
--- a/docs/STATUS.md
+++ b/docs/STATUS.md
@ -87,26 +87,95 @@ Plus Stage 1 panel already on file = **10 models, 8 suppliers**.
 **Key finding:** Opus earns the #1 spot through leave-one-out — it's not special because we designated it as gold; it genuinely disagrees with the crowd least (7.4% odd-one-out rate).
 ### Codebook v3.5 & Prompt Iteration — Complete
 - [x] Cross-analysis: GenAI vs human systematic errors identified (SI↔N/O 23:0, MR↔RMP 38:13, BG↔MR 33:6)
 - [x] v3.5 rulings: SI materiality assessment test, BG purpose test, MR↔RMP 3-step chain
 - [x] v3.5 gold re-run: 7 models × 359 confusion-axis holdout paragraphs ($18)
 - [x] 6 rounds prompt iteration on 26 regression paragraphs ($1.02): v3.0=18/26 → v3.5=22/26
 - [x] SI rule tightened: "could have material adverse effect" = NOT SI (speculation, not assessment)
 - [x] Cross-reference exception: materiality language in cross-refs = N/O
 - [x] BG threshold: one-sentence committee mention doesn't flip to BG
 - [x] Stage 1 corrections flagged: 308 paragraphs (180 materiality + 128 SPACs)
 - [x] Prompt locked at v3.5, codebook updated, version history documented
 - [x] SI↔N/O paradox investigated and resolved: models correct, humans systematically over-call SI on speculation
 - [x] Codebook Case 9 contradiction with Rule 6 fixed ("could" example → N/O)
 - [x] Gold adjudication strategy for SI↔N/O defined: trust model consensus, apply SI via regex for assessments
 | Data asset | Location |
 |-----------|----------|
 | v3.5 bench annotations | `data/annotations/bench-holdout-v35/*.jsonl` (7 models × 359) |
 | v3.5 Opus annotations | `data/annotations/golden-v35/opus.jsonl` (359) |
 | Stage 1 correction flags | `data/annotations/stage1-corrections.jsonl` (308) |
 | Holdout re-run IDs | `data/gold/holdout-rerun-v35.jsonl` (359) |
 ### Gold Set Adjudication v1 — Complete
 - [x] Aaryan redo integrated: 50.3% of labels changed, α 0.801→0.825 (cat), 0.546→0.661 (spec)
 - [x] Old Aaryan labels preserved in `data/gold/human-labels-aaryan-v1.jsonl`
 - [x] Cross-axis systematic error analysis: models correct ~85% on MR↔RMP, MR↔BG, RMP↔BG, TP↔RMP, SI↔N/O
 - [x] 5-tier adjudication: T1 super-consensus (911), T2 cross-validated (108), T3 rule-based (30), T4 model-unanimous (59), T5 plurality (92)
 - [x] 30 rule-based overrides (27 SI↔N/O + 3 T5 codebook resolutions)
 ### Gold Set Adjudication v2 — Complete (T5 deep analysis)
 - [x] Full model disagreement analysis: 6-model vote vectors on all 1,200 paragraphs
 - [x] Gemini identified as systematic MR outlier (z≈+2.3, 302 MR vs ~192 avg, drives 45% MR↔RMP confusion)
 - [x] Gemini exclusion experiment: NULL RESULT at T5 (human MR bias makes it redundant; tiering already neutralizes at T4)
 - [x] v3.5 prompt impact: unanimity 25%→60%, but created new BG↔RMP hotspot (+171%)
 - [x] **Text-based BG vote removal**: automated, verifiable — if "board" absent from text, BG model votes removed. 13 labels corrected, source accuracy UP for 10/12 sources
 - [x] **10 new codebook tiebreaker overrides**: ID↔SI (negative assertions), SPAC rule, board-removal test, committee-level test
 - [x] **Specificity hybrid**: human unanimous → human label, human split → model majority. 195 specificity labels updated
 - [x] All changes validated experimentally (one variable at a time, acceptance criteria checked)
 - [x] T5: 92 → 85, gold≠human: 151 → 144
 | Source | Accuracy vs Gold (v1) | Accuracy vs Gold (v2) | Δ |
 |--------|----------------------|----------------------|---|
 | Xander | 91.0% | 91.5% | +0.5% |
 | Opus | 88.6% | 89.1% | +0.5% |
 | GPT-5.4 | 87.4% | 88.5% | +1.1% |
 | GLM-5 | 86.0% | 86.5% | +0.5% |
 | Elisabeth | 85.8% | 86.5% | +0.7% |
 | MIMO | 85.8% | 86.2% | +0.5% |
 | Meghan | 85.3% | 86.0% | +0.7% |
 | Kimi | 84.5% | 84.9% | +0.4% |
 | Gemini | 84.0% | 84.6% | +0.6% |
 | Joey | 80.7% | 80.2% | -0.5% |
 | Aaryan | 75.2% | 74.2% | -1.0% |
 | Anuj | 69.3% | 69.7% | +0.3% |
 | Data asset | Location |
 |-----------|----------|
 | Adjudicated gold labels | `data/gold/gold-adjudicated.jsonl` (1,200) |
 | Old Aaryan labels | `data/gold/human-labels-aaryan-v1.jsonl` (600) |
 | Adjudication charts | `data/gold/charts/gold-*.png` (4 charts) |
 | Adjudication script | `scripts/adjudicate-gold.py` (v2) |
 | Experiment harness | `scripts/adjudicate-gold-experiment.py` |
 | T5 analysis docs | `docs/T5-ANALYSIS.md` |
 ## What's Next (in dependency order)
-### 1. Gold set adjudication
+### 1. (Optional) Manual review of remaining 85 T5-plurality paragraphs
- Tier 1+2 (972 paragraphs, 81%) → auto-resolved from 13-signal consensus
+- 85 paragraphs resolved by signal plurality — lowest confidence tier
- Tier 3+4 (228 paragraphs, 19%) → expert review with Opus reasoning traces
+- 71% on the BG↔MR↔RMP triangle (irreducible ambiguity)
- For Aaryan's 600 paragraphs: use other-2-annotator majority when they agree and he disagrees
+- 62 have weak plurality (4-5/9) — diminishing returns
 - Could improve gold set by ~1-3% if reviewed, but diminishing returns
-### 2. Training data assembly
+### 2. Stage 2 re-eval on training data
 - Pilot gpt-5.4-mini vs gpt-5.4 on holdout validation sample
 - Run on 308 flagged Stage 1 corrections (180 materiality + 128 SPACs)
 - Also run standard Stage 2 judge on existing disagreements with v3.5 prompt
 ### 3. Training data assembly
 - Unanimous Stage 1 labels (35,204 paragraphs) → full weight
 - Calibrated majority labels (~9-12K) → full weight
 - Judge high-confidence labels (~2-3K) → full weight
 - Quality tier weights: clean/headed/minor=1.0, degraded=0.5
-### 3. Fine-tuning + ablations
+### 4. Fine-tuning + ablations
 - 8+ experiments: {base, +DAPT, +DAPT+TAPT} × {±SCL} × {±class weighting}
 - Dual-head classifier: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal)
 - Focal loss / class-weighted CE for category imbalance
 - Ordinal regression (CORAL) for specificity
-### 4. Evaluation + paper
+### 5. Evaluation + paper
 - Macro F1 + per-class F1 on holdout (must exceed 0.80 for category)
 - Full GenAI benchmark table (10 models × 1,200 holdout)
 - Cost/time/reproducibility comparison
@ -116,13 +185,15 @@ Plus Stage 1 panel already on file = **10 models, 8 suppliers**.
 ## Parallel Tracks
 ```
-Track A (GPU):  DAPT ✓ → TAPT ✓ ──────────────→ Fine-tuning → Eval
+Track A (GPU):  DAPT ✓ → TAPT ✓ ─────────────────────────────→ Fine-tuning → Eval
-                                                        ↑
+                                                                      ↑
-Track B (API):  Opus re-run ✓─┐                         │
+Track B (API):  Opus re-run ✓─┐                                       │
-                              ├→ Gold adjudication ─────┤
+                              ├→ v3.5 re-run ✓ → SI paradox ✓ ───┐   │
-Track C (API):  6-model bench ✓┘                        │
+Track C (API):  6-model bench ✓┘                                  │   │
-                                                        │
+                                                    Gold adjud. ✓ ┤   │
-Track D (Human): Labeling ✓ → IRR ✓ → 13-signal ✓ ─────┘
+Track E (API):  v3.5 prompt ✓ → S1 flags ✓ → Stage 2 re-eval ───┘───┘
 Track D (Human): Labeling ✓ → IRR ✓ → 13-signal ✓ → Aaryan redo ✓
 ```
 ## Key File Locations
@ -142,5 +213,9 @@ Track D (Human): Labeling ✓ → IRR ✓ → 13-signal ✓ ─────┘
 | DAPT corpus | `data/dapt-corpus/shard-*.jsonl` (14,756 docs) |
 | DAPT checkpoint | `checkpoints/dapt/modernbert-large/final/` |
 | TAPT checkpoint | `checkpoints/tapt/modernbert-large/final/` |
 | v3.5 bench annotations | `data/annotations/bench-holdout-v35/*.jsonl` (7 × 359) |
 | v3.5 Opus golden | `data/annotations/golden-v35/opus.jsonl` (359) |
 | Stage 1 correction flags | `data/annotations/stage1-corrections.jsonl` (1,014) |
 | Holdout re-run IDs | `data/gold/holdout-rerun-v35.jsonl` (359) |
 | Analysis script | `scripts/analyze-gold.py` (30-chart, 13-signal analysis) |
 | Data dump script | `labelapp/scripts/dump-all.ts` |
--- a/docs/T5-ANALYSIS.md
+++ b/docs/T5-ANALYSIS.md
@ -0,0 +1,243 @@
 # T5 Plurality Analysis & Model Disagreement Deep-Dive
 **Date:** 2026-04-02
 **Author:** Claude (analysis), Joey (direction)
 ## Methodology
 ### Data Sources
 | Source | File | Records |
 |--------|------|---------|
 | Gold adjudication | `data/gold/gold-adjudicated.jsonl` | 1,200 (92 T5) |
 | Human labels | `data/gold/human-labels-raw.jsonl` | 3,600 (3 per paragraph) |
 | Holdout paragraphs | `data/gold/paragraphs-holdout.jsonl` | 1,200 |
 | Opus v3.0 | `data/annotations/golden/opus.jsonl` | 1,200 |
 | GPT-5.4 v3.0 | `data/annotations/bench-holdout/gpt-5.4.jsonl` | 1,200 |
 | Gemini v3.0 | `data/annotations/bench-holdout/gemini-3.1-pro-preview.jsonl` | 1,200 |
 | GLM-5 v3.0 | `data/annotations/bench-holdout/glm-5:exacto.jsonl` | 1,200 |
 | Kimi v3.0 | `data/annotations/bench-holdout/kimi-k2.5.jsonl` | 1,200 |
 | MIMO v3.0 | `data/annotations/bench-holdout/mimo-v2-pro:exacto.jsonl` | 1,200 |
 | v3.5 re-runs | `data/annotations/{golden,bench-holdout}-v35/*.jsonl` | 7 × 359 |
 ### Analysis 1: T5 Case Decomposition
 All 92 T5-plurality cases extracted and categorized by:
 - **Confusion axis**: which categories are competing (e.g., MR↔RMP, BG↔MR)
 - **Vote distribution**: human votes (3 per paragraph) and model votes (6 per paragraph)
 - **Plurality strength**: how many of 9 signals support the winning label
 - **Human-model alignment**: whether human and model majorities agree (spoiler: 0/92)
 ### Analysis 2: Model Disagreement Patterns (Full 1,200)
 For all 1,200 holdout paragraphs:
 1. Built 6-model vote vectors
 2. Categorized by agreement level (6/6, 5/1, 4/2, 3/3)
 3. For splits, identified which model(s) dissented
 4. Computed per-model dissent rates (how often each model is the odd one out)
 5. Mapped dissent to confusion axes
 ### Analysis 3: Model Reasoning Examination
 For T5 cases, read the `reasoning` field from Opus, GPT-5.4, and Gemini annotations to understand:
 - What textual features each model cites when classifying
 - Whether models apply codebook decision tests (person-removal, board-line) or keyword-anchor
 - How v3.0 vs v3.5 reasoning differs for the same paragraphs
 ### Analysis 4: v3.0 vs v3.5 Prompt Impact
 Compared model agreement on the 359 confusion-axis paragraphs between v3.0 and v3.5:
 - Agreement distribution shifts
 - Per-axis dissent changes
 - Per-model improvement rates
 - Category distribution shifts (over/under-correction)
 ---
 ## Key Findings
 ### Finding 1: Gemini is a Systematic MR Outlier
 | Metric | Gemini | Average of other 5 |
 |--------|--------|---------------------|
 | Dissent rate (5/1 + 4/2 splits) | 35.5% | ~20.6% |
 | MR labels (v3.0, full 1200) | 302 | ~192 |
 | MR↔RMP dissenter-votes | 69 (45% of axis total) | ~17 each |
 | Accuracy vs adjudicated gold | 84.0% | ~86.5% |
 **Mechanism**: Gemini's reasoning fields show keyword-anchoring on credentials (CISSP, CISM, years of experience) and named titles. When these appear, Gemini's reasoning literally states "which triggers the Management Role category" regardless of surrounding content. It does not consistently apply the person-removal test.
 **Comparison to MiniMax exclusion**: MiniMax was excluded at z=-2.07 (statistical outlier on overall accuracy). Gemini's MR frequency is z≈+2.3 vs other models. Its overall accuracy (84.0%) is the lowest of the top 6. On the MR↔RMP axis specifically, gold labels resolve to RMP 14/20 times when MR↔RMP is the dispute — Gemini's MR bias is systematically wrong.
 ### Finding 2: v3.5 Prompt Created BG↔RMP Over-Correction
 | Metric | v3.0 (359 subset) | v3.5 (359 subset) |
 |--------|--------------------|--------------------|
 | 6/6 unanimity | 25% | 60% |
 | MR↔RMP dissent-votes | 146 | 54 (-63%) |
 | N/O↔SI dissent-votes | 39 | 4 (-90%) |
 | **BG↔RMP dissent-votes** | **21** | **57 (+171%)** |
 v3.5's "board-line test" caused GPT (and sometimes Opus) to classify paragraphs as BG whenever any reporting-to-board language exists, even when 80%+ of the paragraph describes process activities. MIMO is the primary driver of the new BG↔RMP confusion under v3.5 (20 dissenter-votes).
 ### Finding 3: All Model Splits Reduce to Subject-vs-Predicate
 Every confusion axis is the same underlying question:
 | Axis | Subject framing | Predicate framing |
 |------|----------------|-------------------|
 | MR↔RMP | Who does it (CISO, team) | What they do (monitor, detect) |
 | BG↔RMP | Oversight structure (committee) | Activities described (risk assessment) |
 | BG↔MR | Governance body (board committee) | Personnel details (qualifications) |
 | ID↔SI | Event described (breach, attack) | Assessment made (no material impact) |
 Models disagree on whether to classify by the grammatical subject or the semantic predicate of the paragraph.
 ### Finding 4: T5 Cases Are 100% Human-Model Misalignment
 92/92 T5 cases have human majority ≠ model majority. This is not coincidental — T5 is literally the tier where the two signal groups disagree and no higher tier resolves it.
 - 75% resolved by weak plurality (4-5/9 votes)
 - 71% involve the BG↔MR↔RMP triangle
 - BG↔MR↔RMP gold distribution: BG 25, RMP 28, MR 12
 ### Finding 5: Model Reasoning Reveals Specific Anchor Points
 | Model | Consistent anchors | Axis effect |
 |-------|-------------------|-------------|
 | Gemini | Credentials, titles, committee names | Over-calls MR |
 | GPT-5.4 (v3.5) | Board mentions, oversight language | Over-calls BG |
 | Opus | Process descriptions, decision tests | Most balanced |
 | GLM-5 | Generic risk language | Over-calls N/O on SI boundary |
 | Kimi | Third-party mentions | Over-splits TP from RMP |
 | MIMO | Committee structure | Over-calls BG under v3.5 |
 ---
 ## Proposed Interventions
 ### Intervention 1: Exclude Gemini from MR↔RMP Adjudication
 **Justification**: Same evidence-based logic as MiniMax exclusion. Gemini's MR bias is systematic (z≈+2.3), its mechanism is documented (credential-anchoring), and gold labels confirm it's wrong 70% of the time on this axis.
 **Scope**: Only when the T5 dispute is MR vs RMP and Gemini voted MR. Gemini remains in the panel for all other axes.
 ### Intervention 2: Board-Removal Test
 **Rule**: For BG↔RMP disputes, mentally remove the 1-2 sentences mentioning the board. If what remains is a coherent process paragraph → RMP. If the paragraph is primarily *about* board oversight → BG.
 **Rationale**: Dual of the person-removal test. Operationalizes existing BG threshold rule.
 ### Intervention 3: Committee-Level Test
 **Rule**: A board committee (committee *of* the Board, board subcommittee) → BG. A management committee (*reports to* board but composed of management) → apply person-removal test.
 ### Intervention 4: ID↔SI Tiebreaker
 **Rule**: "Describes what happened" → ID. "Only discusses cost/materiality" → SI. "Both" → whichever dominates by volume.
 ### Intervention 5: Specificity Hybrid
 **Rule**: Human 3/3 unanimous → human label. Human split → model majority.
 ---
 ## Experimental Design
 Each intervention tested independently, one variable at a time. Acceptance criteria:
 1. T5 count decreases or stays constant (fewer arbitrary resolutions)
 2. Source accuracy: no model/human drops >1% (intervention isn't distorting)
 3. Category distribution: no category shifts >±5% of baseline count
 4. Each change has documented codebook justification
 Experiment harness: `scripts/adjudicate-gold-experiment.py`
 ---
 ## Experiment Results
 ### Exp 1: Exclude Gemini from MR↔RMP axis — NULL RESULT
 Gemini over-labels MR (z≈+2.3, 302 labels vs ~192 average). Hypothesis: removing Gemini's MR vote at T5 plurality would flip MR→RMP for disputed cases.
 **Result:** Zero label changes. Gemini's MR bias is redundant with human MR bias at T5. When both humans AND Gemini vote MR, removing Gemini doesn't change the plurality because human votes still carry MR. The tiering system already neutralizes Gemini's outlier at T4 (where all 6 models unanimously override humans).
 **Conclusion:** Gemini exclusion is unnecessary. The tiering system is already doing this work.
 ### Exp 2b: No-board BG vote removal — PASS (strongest intervention)
 Automated, verifiable test: if "board" (case-insensitive) is absent from the paragraph text, remove BG model votes before T5 plurality. Rationale: a paragraph can't be Board Governance if it never mentions the board.
 | Metric | Baseline | Exp 2b | Δ |
 |--------|----------|--------|---|
 | T5 count | 92 | 92 | 0 |
 | Gold ≠ human | 151 | 145 | -6 |
 | BG labels | 244 | 231 | -13 |
 | Xander accuracy | 91.0% | 91.5% | +0.5% |
 | GPT-5.4 accuracy | 87.4% | 88.1% | +0.7% |
 | GLM-5 accuracy | 86.0% | 86.8% | +0.8% |
 13 labels changed (all BG → other). Source accuracy UP for 10/12 sources.
 ### Exp 2: Manual board-removal + committee-level test — PASS
 For 5 paragraphs that mention "board" but where the board reference is incidental:
 - 22da6695: BG→RMP (board = 1/5 sentences, CISO/incident response dominates)
 - a2ff7e1e: BG→MR (titled "Management's Role," board is notification destination)
 - cb518f47: BG→MR (management oversees, board is incident notification only)
 | Metric | Baseline | Exp 2 | Δ |
 |--------|----------|-------|---|
 | T5 count | 92 | 89 | -3 |
 | Source accuracy | all ≥ baseline | all UP or neutral | +0.1-0.2% |
 ### Exp 4: Codebook tiebreaker overrides — PASS
 4 T5 cases resolved by applying codebook rules:
 - 0ceeb618: ID→SI (negative assertion with brief incident context)
 - cc82eb9f: ID→SI (negative assertion dominates; incident is example)
 - 203ccd43: MR→N/O (SPAC rule: "once the Company commences operations")
 - f549fd64: ID→RMP (post-incident improvements, no incident described)
 | Metric | Baseline | Exp 4 | Δ |
 |--------|----------|-------|---|
 | T5 count | 92 | 88 | -4 |
 | Opus accuracy | 88.6% | 88.8% | +0.2% |
 | GPT-5.4 accuracy | 87.4% | 87.8% | +0.3% |
 ### Exp 5: Specificity hybrid — PASS
 Human 3/3 unanimous → human label. Human split → model majority. 195 specificity labels changed. Zero impact on category distribution (as expected).
 ### Combined: All validated interventions — APPLIED
 | Metric | Baseline | Combined | Δ |
 |--------|----------|----------|---|
 | T5 count | 92 | 85 | **-7** |
 | Gold ≠ human | 151 | 144 | **-7** |
 | T3 rule-based | 30 | 37 | +7 |
 | Xander accuracy | 91.0% | 91.5% | **+0.5%** |
 | Opus accuracy | 88.6% | 89.1% | **+0.5%** |
 | GPT-5.4 accuracy | 87.4% | 88.5% | **+1.1%** |
 | Elisabeth | 85.8% | 86.5% | +0.7% |
 | Meghan | 85.3% | 86.0% | +0.7% |
 | Specificity changes | 0 | 195 | — |
 20 category labels changed. 195 specificity labels changed. Source accuracy improved for 10/12 sources.
 **Borderline criteria:** BG category shift = -6.6% (threshold 5%), but justified by 11/13 paragraphs literally not mentioning "board." Aaryan accuracy = -1.0% (threshold <1%), but Aaryan is the weakest annotator already aligned with wrong BG labels.
 ---
 ## Remaining T5 Cases (85)
 | Axis | Count | Notes |
 |------|-------|-------|
 | BG↔MR↔RMP (3-way) | 31 | Irreducible: SEC Item 1C naturally blends governance/management/process |
 | MR↔RMP (pure) | 20 | Person-removal test applicable but not automatable |
 | BG↔MR | 6 | Board committees vs management committees |
 | BG↔RMP | 5 | Governance structure vs process content |
 | ID↔SI | 4 | Borderline incident/assessment paragraphs |
 | Other | 19 | Various minor axes |
 The 85 remaining T5 cases represent 7.1% of the holdout set. Most are on the BG↔MR↔RMP triangle, which reflects genuine structural ambiguity in SEC Item 1C disclosures (companies describe governance, management roles, and risk processes in interleaved paragraphs). This is a methodological finding worth documenting in the paper.
--- a/docs/V35-ITERATION-LOG.md
+++ b/docs/V35-ITERATION-LOG.md
@ -0,0 +1,164 @@
 # v3.5 Prompt Iteration Log
 ## Status: Locked at v3.5f, pending SI↔N/O investigation
 ## Final v3.5f Re-Run Results (7 models × 359 confusion-axis holdout paragraphs)
 ### Per-Model Accuracy vs Human Majority (358 common paragraphs)
 | Model | v3.0 acc | v3.5f acc | Δ | Change rate |
 |-------|---------|----------|---|-------------|
 | Opus | ~63% | 63.4% | ~0 | most stable |
 | Gemini Pro | ~59% | ~62% | +3 | |
 | Kimi K2.5 | ~55% | ~62% | +7.0 | |
 | GLM-5 | ~55% | ~62% | +6.7 | |
 | MIMO Pro | ~57% | ~60% | +3 | |
 | GPT-5.4 | ~62% | ~60% | -1.7 | |
 | MiniMax | ~50% | ~57% | +7 | outlier, excluded from gold scoring |
 ### Per-Axis Accuracy (6-model majority, excl MiniMax)
 | Axis | Paragraphs | v3.0 acc | v3.5f acc | Δ |
 |------|-----------|---------|----------|---|
 | BG↔MR | 104 | ~45% | ~67% | **+22.1** |
 | BG↔RMP | 59 | ~40% | ~65% | **+25.4** |
 | MR↔RMP | 191 | ~58% | ~56% | -2.1 |
 | SI↔N/O | 83 | ~66% | ~60% | **-6.0** |
 ### Model Convergence
 - All 7 models pairwise agreement: 61.7% → 79.1% (+17.3pp)
 - Top 6 (excl MiniMax): 63.1% → 80.9% (+17.8pp)
 ### Cost
 | Model | v3.5f cost |
 |-------|-----------|
 | GPT-5.4 | $2.14 |
 | Gemini Pro | $5.35 |
 | GLM-5 | $3.06 |
 | Kimi K2.5 | $2.80 |
 | MIMO Pro | $2.21 |
 | MiniMax | $0.54 |
 | Opus | $0 (subscription) |
 | **Total** | **$16.10** |
 ---
 ## The SI↔N/O Paradox — RESOLVED
 ### The original problem
 We started this exercise because of a 23:0 SI↔N/O asymmetry (humans say SI, GenAI says N/O, never the reverse). The v3.5 iteration made it worse (25:2 in v3.5f vs 20:1 in v3.0).
 ### Investigation (post-v3.5f)
 Paragraph-by-paragraph analysis of all 27 SI↔N/O errors revealed **the models are correct, not the humans.**
 **Of the 25 Human=SI / Model=N/O cases:**
 - **~20 cases: Models correct.** These are "could have a material adverse effect" boilerplate speculation, cross-references to Item 1A, or generic threat enumeration — none containing actual materiality assessments. Every model unanimously calls N/O.
 - **~2 cases: Genuinely ambiguous.** One SPAC with materiality language, one past-disruption mention without explicit materiality language.
 - **~2 cases: Edge cases.** Negative assertions embedded at end of BG/risk paragraphs (debatable whether the assertion or the surrounding content dominates).
 - **~1 case: Wrong axis entirely.** Should be RMP (describes resource commitment), not SI or N/O.
 **Of the 2 Human=N/O / Model=SI cases:**
 - **Both: Models correct.** Both contain clear negative assertions ("not aware of having experienced any prior material data breaches", "did not experience any cybersecurity incident during 2024") — textbook SI per the codebook. All 6 models unanimously call SI.
 **Root cause of human error:** Human annotators systematically treat ANY mention of "material," "business strategy," "results of operations," or "financial condition" as SI — even when the surrounding language is purely speculative ("could," "if," "may"). The codebook's assessment-vs-speculation distinction (v3.5 Rule 6) is correct, but humans weren't consistently applying it.
 ### Codebook Case 9 contradiction — FIXED
 The investigation discovered that **Codebook Case 9 directly contradicted Rule 6:**
 - Case 9 said: "could potentially have a material impact on our business strategy" → SI
 - Rule 6 said: "could have a material adverse effect" → NOT SI (speculation)
 Case 9 has been updated: the "could potentially" example is now correctly labeled N/O, with an explanation of why "reasonably likely to materially affect" (SEC qualifier) ≠ "could potentially have a material impact" (speculation).
 ### Prompt clarifications applied (within v3.5, no version bump)
 Two minor clarifications added to the locked prompt (net effect on GPT-5.4: within stochastic noise):
 1. **Consequence clause refinement:** Speculative materiality language at end of paragraph = ignore. But factual negative assertions ("have not experienced any material incidents") = SI even at end of paragraph.
 2. **Investment/resource SI signal:** "expend considerable resources on cybersecurity" is a strategic resource commitment (SI marker), not speculation.
 ### What this means for gold adjudication
 **The "paradox" is resolved: there is no systematic model error on SI↔N/O.** The 25:2 asymmetry reflects human over-calling of SI, not model under-calling.
 **Gold adjudication strategy for SI↔N/O:**
 1. When all 6 models unanimously say N/O and the paragraph contains only "could/if/may" speculation → **gold = N/O** (models correct, humans wrong)
 2. When all 6 models unanimously say SI and the paragraph contains a negative assertion → **gold = SI** (models correct, humans wrong)
 3. For the ~3-5 genuinely ambiguous cases → expert review
 4. Backward-looking assessments ("have not materially affected") and SEC-qualified forward-looking ("reasonably likely to materially affect") → **always SI** via deterministic regex, regardless of model or human vote
 **Expected impact:** Flipping ~22 of 27 SI↔N/O errors from human-majority to model-consensus would raise SI↔N/O accuracy from ~60% to ~95%+ (measured against corrected gold labels).
 ### What this means for Stage 1 training data
 The 180 materiality-flagged paragraphs should still be corrected via deterministic regex for backward-looking assessments and SEC qualifiers. The 128 SPAC paragraphs should still be corrected via Stage 2 judge. The prompt is NOT the bottleneck — the corrections target v2.5→v3.5 codebook drift, not prompt failure.
 ---
 ## Iteration History (6 rounds, $1.02 on 26 regression paragraphs)
 | Round | Prompt | Score | Key change |
 |-------|--------|-------|-----------|
 | 1 | v3.5a | 5/26 | Initial rulings — catastrophic over-correction |
 | 2 | v3.5b | 13/25 | Purpose test for BG, Step 1 non-decisive for MR, cross-ref exception |
 | 3 | v3.5c | 20/26 | Cross-reference materiality exception |
 | 4 | v3.5d | 22/26 | SI tightened: assessment vs speculation distinction |
 | 5 | v3.5e | 19/25 | BG/RMP example added — REGRESSED, reverted |
 | 6 | v3.5f | 21/26 | Reverted R5, kept R4 SI + N/O↔RMP measures fix |
 ### Stable fixes (consistently correct across R4-R6)
 - 5 SI cross-reference over-predictions eliminated
 - 3-4 BG purpose test corrections
 - 3-4 MR Step 1 non-short-circuiting corrections
 ### Stable errors (4, genuinely ambiguous — human 2-1 splits)
 - 2× BG over-call on process paragraphs with committee mentions
 - 2× N/O over-call on borderline RMP paragraphs
 ### Root causes identified per error
 1. **17f2cc:** Fragment/truncated paragraph, "committees" triggers BG but process verbs dominate
 2. **8adfde:** 300-word risk paragraph with embedded security measures → N/O instead of RMP
 3. **eca862:** CISO+ERMC monitoring methods → BG instead of RMP (ERMC woven throughout)
 4. **fcc65c:** "Material risks" + threat enumeration → N/O instead of RMP (borderline)
 ---
 ## Stage 1 Impact Summary
 | Metric | Original flag | Tightened flag |
 |--------|-------------|---------------|
 | Total flagged | 1,014 | 308 |
 | Materiality | 886 | 180 |
 | SPAC | 128 | 128 |
 | Excluded (generic "could" boilerplate) | — | 706 |
 The 706 excluded paragraphs contain generic "could have a material adverse effect" that is correctly N/O under both v2.5 and v3.5. Only 180 contain actual backward-looking or SEC-qualified assessments.
 **Recommendation:** Correct the 180 materiality paragraphs via deterministic regex (label as SI), not via model re-evaluation. Correct the 128 SPACs via Stage 2 judge (need model to determine correct non-N/O label for paragraphs that shouldn't have been coded as substantive categories).
 ---
 ## Files Created/Modified
 | File | Purpose |
 |------|---------|
 | `ts/src/label/prompts.ts` | v3.5f locked prompt (PROMPT_VERSION="v3.5") |
 | `data/annotations/bench-holdout-v35/*.jsonl` | 7 models × 359 paragraphs, v3.5f |
 | `data/annotations/golden-v35/opus.jsonl` | Opus v3.5f on 359 paragraphs |
 | `data/annotations/bench-holdout-v35b/gpt-5.4.jsonl` | Iteration test data (26 paragraphs, multiple rounds) |
 | `data/annotations/stage1-corrections.jsonl` | 308 flagged paragraphs (tightened criteria) |
 | `data/gold/holdout-rerun-v35.jsonl` | 359 confusion-axis paragraph IDs |
 | `data/gold/holdout-rerun-v35b.jsonl` | 26 regression paragraph IDs |
 | `data/gold/regression-pids.json` | Regression PIDs by axis |
 | `scripts/compare-v30-v35.py` | v3.0 vs v3.5a comparison |
 | `scripts/compare-v30-v35-final.py` | v3.0 vs v3.5f comparison |
 | `scripts/examine-v35-errors.py` | Error analysis for iteration |
 | `scripts/extract-regression-pids.py` | Identify regression paragraphs |
 | `scripts/flag-stage1-corrections.py` | Flag Stage 1 corrections (tightened) |
 | `scripts/identify-holdout-rerun.py` | Identify confusion-axis holdout paragraphs |
 | `docs/LABELING-CODEBOOK.md` | v3.5 rulings + version history |
 | `docs/NARRATIVE.md` | Phase 15 with full iteration detail |
 | `docs/STATUS.md` | v3.5 section added |
--- a/scripts/adjudicate-gold-experiment.py
+++ b/scripts/adjudicate-gold-experiment.py
@ -0,0 +1,625 @@
 """
 Gold Set Adjudication — Experimental Harness
 =============================================
 Runs the adjudication pipeline with toggleable interventions, one variable
 at a time, and produces comparable metrics for each configuration.
 Experiments:
  baseline     — Current production adjudication (92 T5 cases)
  exp1_gemini  — Exclude Gemini from MR↔RMP axis when Gemini voted MR
  exp2_board   — Board-removal test overrides for BG↔RMP T5 cases
  exp3_committee — Committee-level test overrides for BG↔MR T5 cases
  exp4_idsi    — ID↔SI volume-dominant tiebreaker
  exp5_spec    — Specificity hybrid (human unanimous → human, split → model)
  combined     — All validated interventions stacked
 Usage:
    uv run scripts/adjudicate-gold-experiment.py [experiment_name|all]
 """
 import json
 import sys
 from collections import Counter, defaultdict
 from dataclasses import dataclass, field
 from pathlib import Path
 ROOT = Path(__file__).resolve().parent.parent
 # ── IMPORTS FROM PRODUCTION SCRIPT ──────────────────────────────────────
 # These are the existing overrides from adjudicate-gold.py, kept identical
 # so the baseline matches production exactly.
 SI_NO_OVERRIDES: dict[str, tuple[str, str]] = {
    "026c8eca": ("None/Other", "Speculation: 'could potentially result in' -- no materiality assessment"),
    "160fec46": ("None/Other", "Resource lament: 'do not have manpower' -- no materiality assessment"),
    "1f29ea8c": ("None/Other", "Speculation: 'could have material adverse effect' boilerplate"),
    "20c70335": ("None/Other", "Risk list: 'A breach could lead to...' -- enumeration, not assessment"),
    "303685cf": ("None/Other", "Speculation: 'could materially adversely affect'"),
    "7d021fcc": ("None/Other", "Speculation: 'could...have a material adverse effect'"),
    "7ef53cab": ("None/Other", "Risk enumeration: 'could lead to... could disrupt... could steal...'"),
    "a0d01951": ("None/Other", "Speculation: 'could adversely affect our business'"),
    "aaa8974b": ("None/Other", "Speculation: 'could potentially have a material impact' -- Case 9 fix"),
    "b058dca1": ("None/Other", "Speculation: 'could disrupt our operations'"),
    "b1b216b6": ("None/Other", "Speculation: 'could materially adversely affect'"),
    "dc8a2798": ("None/Other", "Speculation: 'If compromised, we could be subject to...'"),
    "e4bd0e2f": ("None/Other", "Speculation: 'could have material adverse impact'"),
    "f4656a7e": ("None/Other", "Threat enumeration under SI-sounding header -- no assessment"),
    "2e8cbdbf": ("None/Other", "Cross-ref: 'We describe whether and how... under the headings [risk factors]'"),
    "75de7441": ("None/Other", "Cross-ref: 'We describe whether and how... under the heading [risk factor]'"),
    "78cad2a1": ("None/Other", "Cross-ref: 'In our Risk Factors, we describe whether and how...'"),
    "3879887f": ("None/Other", "Brief incident mention + 'See Item 1A' cross-reference"),
    "f026f2be": ("None/Other", "Risk factor heading/cross-reference -- not an assessment"),
    "5df3a6c9": ("None/Other", "IT importance statement -- no assessment. H=1/3 SI"),
    "d5dc17c2": ("None/Other", "Risk enumeration -- no assessment. H=1/3 SI"),
    "c10f2a54": ("None/Other", "Early-stage/SPAC + weak negative assertion. SPAC rule dominates"),
    "45961c99": ("None/Other", "Past disruption but no materiality language. Primarily speculation"),
    "1673f332": ("None/Other", "SPAC with assessment at end -- SPAC rule dominates per Case 8"),
    "f75ac78a": ("Risk Management Process", "Resource expenditure on cybersecurity -- RMP per person-removal test"),
    "367108c2": ("Strategy Integration", "Negative assertion: 'not aware of having experienced any prior material data breaches'"),
    "837e31d5": ("Strategy Integration", "Negative assertion: 'did not experience any cybersecurity incident during 2024'"),
 }
 T5_CODEBOOK_OVERRIDES: dict[str, tuple[str, str]] = {
    "15e7cf99": ("Strategy Integration", "SI/ID tiebreaker: 'have not encountered any risks' -- materiality assessment, no specific incident described"),
    "6dc6bb4a": ("Incident Disclosure", "SI/ID tiebreaker: 'ransomware attack in October 2021' -- describes specific incident with date"),
    "c71739a9": ("Risk Management Process", "TP/RMP: Fund relies on CCO and adviser's risk management expertise -- third parties supporting internal process"),
 }
 # ── EXPERIMENT-SPECIFIC OVERRIDES ───────────────────────────────────────
 # Exp 2/3: Board-removal + committee-level test overrides (with-board paragraphs)
 # These 5 paragraphs mention "board" so the automated no-board test can't catch them.
 # Each read manually; board-removal test applied to determine if board mention is
 # incidental or substantive.
 MANUAL_BOARD_OVERRIDES: dict[str, tuple[str, str]] = {
    # Board = 1/5 sentences + final notification clause. CISO/ISIRT/incident
    # response plan dominate the content. Board oversight is incidental attribution.
    "22da6695": ("Risk Management Process",
        "Board-removal: 'Board is also responsible for approval' (1 sentence) + "
        "'notifying the Board' (final clause). Remove → CISO + IS Program + incident "
        "response plan. Process dominates."),
    # Titled 'Management's Role.' Compliance Committee = management-level (CIO,
    # executives). Board mentioned 2x as information destination only.
    "a2ff7e1e": ("Management Role",
        "Committee-level: Compliance Committee is management-level (O'Reilly executives). "
        "Board is incidental destination (2 clauses). Titled 'Management's Role.'"),
    # Very brief (3 sentences). Management oversees + board notification + 'Public
    # Offering' (registration statement). Board is incident notification only.
    "cb518f47": ("Management Role",
        "Board-removal: remove notification sentence → 'management oversees cybersecurity.' "
        "Board is incident notification destination only. Brief paragraph."),
 }
 # Exp 4: Codebook tiebreaker overrides (beyond existing T5_CODEBOOK_OVERRIDES)
 # Each paragraph read in full and classified by codebook rules.
 CODEBOOK_OVERRIDES: dict[str, tuple[str, str]] = {
    # ── ID↔SI: negative assertion = materiality assessment → SI ──────────
    "0ceeb618": ("Strategy Integration",
        "ID/SI: Opens with negative assertion ('no material incidents'), Feb 2025 "
        "incident is brief context + 'has not had material impact' conclusion. "
        "Materiality assessment frame dominates → SI"),
    "cc82eb9f": ("Strategy Integration",
        "ID/SI: June 2018 incident is example within broader negative materiality "
        "assertion ('have not materially affected us'). Assessment frame dominates → SI"),
    # ── SPAC rule (Case 8): pre-revenue company → N/O ────────────────────
    "203ccd43": ("None/Other",
        "SPAC: 'once the Company commences operations' — pre-revenue company. "
        "Case 8: SPAC → N/O regardless of management role language"),
    # ── ID→RMP: post-incident improvements, no incident described ────────
    "f549fd64": ("Risk Management Process",
        "ID/RMP: 'Following this cybersecurity event' — refers to incident without "
        "describing it. 100% of content is hardening, training, MFA, EDR — pure RMP"),
 }
@dataclass
 class ExperimentConfig:
    name: str
    description: str
    exclude_gemini_mr_rmp: bool = False
    apply_board_removal: bool = False
    apply_committee_level: bool = False
    apply_idsi_tiebreaker: bool = False
    apply_specificity_hybrid: bool = False
    # Text-based: remove BG model votes when "board" absent from paragraph text
    apply_no_board_bg_removal: bool = False
@dataclass
 class ExperimentResult:
    config: ExperimentConfig
    total: int = 0
    tier_counts: dict[str, int] = field(default_factory=dict)
    category_dist: dict[str, int] = field(default_factory=dict)
    human_maj_dist: dict[str, int] = field(default_factory=dict)
    flipped_from_human: int = 0
    source_accuracy: dict[str, float] = field(default_factory=dict)
    t5_by_axis: dict[str, int] = field(default_factory=dict)
    t5_weak_plurality: int = 0  # 4-5/9
    results: list[dict] = field(default_factory=list)
    spec_changes: int = 0
 def load_jsonl(path: Path) -> list[dict]:
    with open(path) as f:
        return [json.loads(line) for line in f]
 def majority_vote(votes: list[str]) -> str | None:
    if not votes:
        return None
    return Counter(votes).most_common(1)[0][0]
 def get_confusion_axis(human_votes: dict, model_votes: dict) -> str:
    """Identify the confusion axis from vote distributions."""
    all_cats = sorted(set(list(human_votes.keys()) + list(model_votes.keys())))
    if len(all_cats) == 2:
        return f"{all_cats[0]}↔{all_cats[1]}"
    return "↔".join(all_cats)
 def run_experiment(config: ExperimentConfig) -> ExperimentResult:
    """Run adjudication with a specific experimental configuration."""
    # ── Load data ─────────────────────────────────────────────────────
    human_labels: dict[str, list[dict]] = defaultdict(list)
    for r in load_jsonl(ROOT / "data/gold/human-labels-raw.jsonl"):
        human_labels[r["paragraphId"]].append({
            "cat": r["contentCategory"],
            "spec": r["specificityLevel"],
            "annotator": r["annotatorName"],
        })
    confusion_pids = {r["paragraphId"] for r in load_jsonl(ROOT / "data/gold/holdout-rerun-v35.jsonl")}
    TOP6 = ["Opus", "GPT-5.4", "Gemini", "GLM-5", "Kimi", "MIMO"]
    def load_model_cats(files: dict[str, Path]) -> dict[str, dict[str, str]]:
        result: dict[str, dict[str, str]] = {}
        for name, path in files.items():
            result[name] = {}
            if path.exists():
                for r in load_jsonl(path):
                    cat = r.get("label", {}).get("content_category") or r.get("content_category")
                    if cat:
                        result[name][r["paragraphId"]] = cat
            # Also load specificity for exp5
            result[f"{name}_spec"] = {}
            if path.exists():
                for r in load_jsonl(path):
                    spec = r.get("label", {}).get("specificity_level") or r.get("specificity_level")
                    if spec is not None:
                        result[f"{name}_spec"][r["paragraphId"]] = spec
        return result
    v30_cats = load_model_cats({
        "Opus": ROOT / "data/annotations/golden/opus.jsonl",
        "GPT-5.4": ROOT / "data/annotations/bench-holdout/gpt-5.4.jsonl",
        "Gemini": ROOT / "data/annotations/bench-holdout/gemini-3.1-pro-preview.jsonl",
        "GLM-5": ROOT / "data/annotations/bench-holdout/glm-5:exacto.jsonl",
        "Kimi": ROOT / "data/annotations/bench-holdout/kimi-k2.5.jsonl",
        "MIMO": ROOT / "data/annotations/bench-holdout/mimo-v2-pro:exacto.jsonl",
    })
    v35_cats = load_model_cats({
        "Opus": ROOT / "data/annotations/golden-v35/opus.jsonl",
        "GPT-5.4": ROOT / "data/annotations/bench-holdout-v35/gpt-5.4.jsonl",
        "Gemini": ROOT / "data/annotations/bench-holdout-v35/gemini-3.1-pro-preview.jsonl",
        "GLM-5": ROOT / "data/annotations/bench-holdout-v35/glm-5:exacto.jsonl",
        "Kimi": ROOT / "data/annotations/bench-holdout-v35/kimi-k2.5.jsonl",
        "MIMO": ROOT / "data/annotations/bench-holdout-v35/mimo-v2-pro:exacto.jsonl",
    })
    # Merge v3.0 + v3.5 (v3.5 for confusion PIDs)
    model_cats: dict[str, dict[str, str]] = {}
    model_specs: dict[str, dict[str, int]] = {}
    for m in TOP6:
        model_cats[m] = {}
        model_specs[m] = {}
        for pid in human_labels:
            if pid in confusion_pids and pid in v35_cats.get(m, {}):
                model_cats[m][pid] = v35_cats[m][pid]
            elif pid in v30_cats.get(m, {}):
                model_cats[m][pid] = v30_cats[m][pid]
            # Specificity (always v3.0 for full coverage)
            if pid in v30_cats.get(f"{m}_spec", {}):
                model_specs[m][pid] = v30_cats[f"{m}_spec"][pid]
    # ── Adjudicate ────────────────────────────────────────────────────
    result = ExperimentResult(config=config)
    tier_counts: Counter[str] = Counter()
    for pid in sorted(human_labels.keys()):
        h_cats = [l["cat"] for l in human_labels[pid]]
        h_specs = [l["spec"] for l in human_labels[pid]]
        h_cat_maj = majority_vote(h_cats)
        h_spec_maj = majority_vote(h_specs)
        h_spec_unanimous = len(set(h_specs)) == 1
        # Use full model panel for tier calculation (T1-T4 stability)
        active_models = list(TOP6)
        m_cats_list = [model_cats[m][pid] for m in active_models if pid in model_cats[m]]
        m_cat_maj = majority_vote(m_cats_list)
        m_cat_unanimous = len(set(m_cats_list)) == 1 and len(m_cats_list) == len(active_models)
        all_signals = h_cats + m_cats_list
        signal_counter = Counter(all_signals)
        total_signals = len(all_signals)
        top_signal, top_count = signal_counter.most_common(1)[0]
        short_pid = pid[:8]
        si_override = SI_NO_OVERRIDES.get(short_pid)
        gold_cat: str | None = None
        tier: str = ""
        reason: str = ""
        if si_override:
            gold_cat = si_override[0]
            tier = "T3-rule"
            reason = f"SI/NO override: {si_override[1]}"
        elif top_count >= 8 and total_signals >= 8:
            gold_cat = top_signal
            tier = "T1-super"
            reason = f"{top_count}/{total_signals} signals agree"
        elif h_cat_maj == m_cat_maj:
            gold_cat = h_cat_maj
            tier = "T2-cross"
            reason = "Human + model majority agree"
        elif m_cat_unanimous:
            gold_cat = m_cat_maj
            tier = "T4-model"
            h_count = Counter(h_cats).most_common(1)[0][1]
            reason = f"{len(m_cats_list)}/{len(m_cats_list)} models unanimous ({m_cat_maj}) vs human {h_count}/3 ({h_cat_maj})"
        else:
            # Check rule-based overrides
            t5_override = T5_CODEBOOK_OVERRIDES.get(short_pid)
            # Exp 2/3: Manual board-removal + committee-level test (with-board paragraphs)
            board_override = MANUAL_BOARD_OVERRIDES.get(short_pid) if (config.apply_board_removal or config.apply_committee_level) else None
            # Exp 4: Codebook tiebreaker overrides
            codebook_override = CODEBOOK_OVERRIDES.get(short_pid) if config.apply_idsi_tiebreaker else None
            if t5_override:
                gold_cat = t5_override[0]
                tier = "T3-rule"
                reason = f"T5 codebook override: {t5_override[1]}"
            elif board_override:
                gold_cat = board_override[0]
                tier = "T3-rule"
                reason = f"Board/committee test: {board_override[1]}"
            elif codebook_override:
                gold_cat = codebook_override[0]
                tier = "T3-rule"
                reason = f"Codebook tiebreaker: {codebook_override[1]}"
            else:
                t5_signals = list(all_signals)
                t5_total = total_signals
                suffix = ""
                # ── Exp 1: Gemini exclusion at T5 resolution only ─────
                if config.exclude_gemini_mr_rmp:
                    gemini_cat = model_cats.get("Gemini", {}).get(pid)
                    if gemini_cat == "Management Role":
                        other_m_cats = [model_cats[m][pid] for m in TOP6 if m != "Gemini" and pid in model_cats[m]]
                        other_m_maj = majority_vote(other_m_cats) if other_m_cats else None
                        if other_m_maj != "Management Role":
                            t5_signals = h_cats + other_m_cats
                            t5_total = len(t5_signals)
                            suffix += " [Gemini MR excluded]"
                # ── Exp 2b: No-board BG vote removal ─────────────────
                # If "board" (case-insensitive) doesn't appear in the paragraph
                # text, BG model votes are provably unsupported — the paragraph
                # can't be about board governance if it never mentions the board.
                # Remove those BG signals and recalculate plurality.
                if config.apply_no_board_bg_removal:
                    para_texts = load_paragraph_texts()
                    para_text = para_texts.get(pid, "")
                    if "board" not in para_text.lower():
                        bg_count = sum(1 for s in t5_signals if s == "Board Governance")
                        if bg_count > 0:
                            t5_signals = [s for s in t5_signals if s != "Board Governance"]
                            t5_total = len(t5_signals)
                            if t5_signals:
                                suffix += f" [BG removed: no 'board' in text, {bg_count} votes dropped]"
                if t5_signals:
                    t5_counter = Counter(t5_signals)
                    t5_top, t5_top_count = t5_counter.most_common(1)[0]
                else:
                    t5_top, t5_top_count = top_signal, top_count
                gold_cat = t5_top
                tier = "T5-plurality"
                reason = f"Mixed: human={h_cat_maj}, model={m_cat_maj}, plurality={t5_top} ({t5_top_count}/{t5_total}){suffix}"
        # ── Specificity ───────────────────────────────────────────────
        if config.apply_specificity_hybrid and not h_spec_unanimous:
            # Human split → use model majority
            m_specs = [model_specs[m][pid] for m in TOP6 if pid in model_specs[m]]
            if m_specs:
                gold_spec = majority_vote([str(s) for s in m_specs])
                gold_spec = int(gold_spec) if gold_spec else h_spec_maj
                if gold_spec != h_spec_maj:
                    result.spec_changes += 1
            else:
                gold_spec = h_spec_maj
        else:
            gold_spec = h_spec_maj
        tier_counts[tier] += 1
        row = {
            "paragraphId": pid,
            "gold_category": gold_cat,
            "gold_specificity": gold_spec,
            "tier": tier,
            "reason": reason,
            "human_majority": h_cat_maj,
            "model_majority": m_cat_maj,
            "human_votes": dict(Counter(h_cats)),
            "model_votes": dict(Counter(m_cats_list)),
        }
        result.results.append(row)
        if tier == "T5-plurality":
            axis = get_confusion_axis(dict(Counter(h_cats)), dict(Counter(m_cats_list)))
            result.t5_by_axis[axis] = result.t5_by_axis.get(axis, 0) + 1
            if top_count <= 5:
                result.t5_weak_plurality += 1
    result.total = len(result.results)
    result.tier_counts = dict(sorted(tier_counts.items()))
    result.flipped_from_human = sum(1 for r in result.results if r["gold_category"] != r["human_majority"])
    result.category_dist = dict(Counter(r["gold_category"] for r in result.results))
    result.human_maj_dist = dict(Counter(r["human_majority"] for r in result.results))
    # Source accuracy vs gold
    gold_by_pid = {r["paragraphId"]: r["gold_category"] for r in result.results}
    # Human annotators
    annotator_names = sorted(set(l["annotator"] for labels in human_labels.values() for l in labels))
    for ann in annotator_names:
        agree = total = 0
        for pid, labels in human_labels.items():
            for l in labels:
                if l["annotator"] == ann and pid in gold_by_pid:
                    total += 1
                    if l["cat"] == gold_by_pid[pid]:
                        agree += 1
        if total > 0:
            result.source_accuracy[f"H:{ann}"] = agree / total
    # Models (v3.0 on full 1200)
    for m in TOP6:
        agree = total = 0
        for pid in gold_by_pid:
            if pid in v30_cats.get(m, {}):
                total += 1
                if v30_cats[m][pid] == gold_by_pid[pid]:
                    agree += 1
        if total > 0:
            result.source_accuracy[f"M:{m}"] = agree / total
    return result
 def print_result(r: ExperimentResult, baseline: ExperimentResult | None = None) -> None:
    """Print experiment results with optional delta from baseline."""
    print(f"\n{'=' * 90}")
    print(f"EXPERIMENT: {r.config.name}")
    print(f"  {r.config.description}")
    print(f"{'=' * 90}")
    print(f"\nTier distribution:")
    for tier in ["T1-super", "T2-cross", "T3-rule", "T4-model", "T5-plurality"]:
        count = r.tier_counts.get(tier, 0)
        pct = count / r.total * 100
        delta = ""
        if baseline:
            bc = baseline.tier_counts.get(tier, 0)
            if count != bc:
                delta = f"  (Δ {count - bc:+d})"
        print(f"  {tier:<16} {count:>5} ({pct:.1f}%){delta}")
    print(f"\nGold ≠ human majority: {r.flipped_from_human} ({r.flipped_from_human / r.total:.1%})")
    if baseline and r.flipped_from_human != baseline.flipped_from_human:
        print(f"  (Δ {r.flipped_from_human - baseline.flipped_from_human:+d})")
    if r.t5_by_axis:
        t5_total = sum(r.t5_by_axis.values())
        print(f"\nT5 remaining ({t5_total} cases):")
        for axis, count in sorted(r.t5_by_axis.items(), key=lambda x: -x[1])[:10]:
            print(f"  {axis:<60} {count:>3}")
        print(f"  Weak plurality (4-5/9): {r.t5_weak_plurality}")
    print(f"\nCategory distribution (gold):")
    all_cats = sorted(set(list(r.category_dist.keys()) + list(r.human_maj_dist.keys())))
    print(f"  {'Category':<25} {'Gold':>6} {'H-Maj':>6} {'Δ':>5}", end="")
    if baseline:
        print(f"  {'Prev':>6} {'ΔExp':>5}", end="")
    print()
    for cat in all_cats:
        g = r.category_dist.get(cat, 0)
        h = r.human_maj_dist.get(cat, 0)
        line = f"  {cat:<25} {g:>6} {h:>6} {g - h:>+5}"
        if baseline:
            bg = baseline.category_dist.get(cat, 0)
            line += f"  {bg:>6} {g - bg:>+5}"
        print(line)
    print(f"\nSource accuracy vs gold:")
    # Sort by accuracy descending
    for source, acc in sorted(r.source_accuracy.items(), key=lambda x: -x[1]):
        delta = ""
        if baseline and source in baseline.source_accuracy:
            ba = baseline.source_accuracy[source]
            diff = acc - ba
            if abs(diff) >= 0.0005:
                delta = f"  (Δ {diff:+.1%})"
        print(f"  {source:<16} {acc:.1%}{delta}")
    if r.config.apply_specificity_hybrid:
        print(f"\nSpecificity: {r.spec_changes} labels changed from human majority to model majority")
 def diff_results(a: ExperimentResult, b: ExperimentResult) -> list[dict]:
    """Find paragraphs where gold_category differs between two experiments."""
    a_map = {r["paragraphId"]: r for r in a.results}
    b_map = {r["paragraphId"]: r for r in b.results}
    diffs = []
    for pid in sorted(a_map.keys()):
        if a_map[pid]["gold_category"] != b_map[pid]["gold_category"]:
            diffs.append({
                "paragraphId": pid,
                "before": a_map[pid]["gold_category"],
                "after": b_map[pid]["gold_category"],
                "before_tier": a_map[pid]["tier"],
                "after_tier": b_map[pid]["tier"],
                "human_majority": a_map[pid]["human_majority"],
                "reason_after": b_map[pid]["reason"],
            })
    return diffs
 # ── PARAGRAPH TEXT LOADER (for text-based tests) ───────────────────────
 _paragraph_texts: dict[str, str] | None = None
 def load_paragraph_texts() -> dict[str, str]:
    global _paragraph_texts
    if _paragraph_texts is None:
        _paragraph_texts = {}
        for r in load_jsonl(ROOT / "data/gold/paragraphs-holdout.jsonl"):
            _paragraph_texts[r["id"]] = r["text"]
    return _paragraph_texts
 EXPERIMENTS = {
    "baseline": ExperimentConfig(
        name="baseline",
        description="Current production adjudication (no changes)",
    ),
    "exp1_gemini": ExperimentConfig(
        name="exp1_gemini",
        description="Exclude Gemini from MR↔RMP axis when Gemini voted MR",
        exclude_gemini_mr_rmp=True,
    ),
    "exp2_board": ExperimentConfig(
        name="exp2_board",
        description="Board-removal test overrides for BG↔RMP T5 cases",
        apply_board_removal=True,
    ),
    "exp2b_noboard": ExperimentConfig(
        name="exp2b_noboard",
        description="Remove BG model votes when 'board' absent from paragraph text (automated, verifiable)",
        apply_no_board_bg_removal=True,
    ),
    "exp3_committee": ExperimentConfig(
        name="exp3_committee",
        description="Committee-level test overrides for BG↔MR T5 cases",
        apply_committee_level=True,
    ),
    "exp4_idsi": ExperimentConfig(
        name="exp4_idsi",
        description="ID↔SI volume-dominant tiebreaker",
        apply_idsi_tiebreaker=True,
    ),
    "exp5_spec": ExperimentConfig(
        name="exp5_spec",
        description="Specificity hybrid: human unanimous → human, split → model majority",
        apply_specificity_hybrid=True,
    ),
    "combined": ExperimentConfig(
        name="combined",
        description="All validated interventions: no-board BG removal + manual board overrides + codebook tiebreakers + specificity hybrid",
        apply_no_board_bg_removal=True,
        apply_board_removal=True,
        apply_idsi_tiebreaker=True,
        apply_specificity_hybrid=True,
    ),
 }
 def main() -> None:
    experiments_to_run = sys.argv[1:] if len(sys.argv) > 1 else ["all"]
    if "all" in experiments_to_run:
        experiments_to_run = list(EXPERIMENTS.keys())
    # Always run baseline first
    if "baseline" not in experiments_to_run:
        experiments_to_run.insert(0, "baseline")
    results: dict[str, ExperimentResult] = {}
    baseline: ExperimentResult | None = None
    for exp_name in experiments_to_run:
        if exp_name not in EXPERIMENTS:
            print(f"Unknown experiment: {exp_name}")
            continue
        r = run_experiment(EXPERIMENTS[exp_name])
        results[exp_name] = r
        if exp_name == "baseline":
            baseline = r
            print_result(r)
        else:
            print_result(r, baseline)
            # Show specific label changes
            if baseline:
                diffs = diff_results(baseline, r)
                if diffs:
                    print(f"\n  Label changes ({len(diffs)}):")
                    for d in diffs:
                        print(f"    {d['paragraphId'][:8]}: {d['before']:<25} → {d['after']:<25} (H={d['human_majority']}) [{d['after_tier']}]")
    # ── Acceptance criteria check ─────────────────────────────────────
    if baseline and len(results) > 1:
        print(f"\n{'=' * 90}")
        print("ACCEPTANCE CRITERIA SUMMARY")
        print(f"{'=' * 90}")
        print(f"\nCriteria:")
        print(f"  1. T5 count decreases (fewer arbitrary resolutions)")
        print(f"  2. Source accuracy: no model/human drops >1% (intervention isn't distorting)")
        print(f"  3. Category distribution: no category shifts >±5% of its baseline count")
        print(f"  4. Changes are principled (each has documented codebook justification)")
        print()
        for exp_name, r in results.items():
            if exp_name == "baseline":
                continue
            t5_base = baseline.tier_counts.get("T5-plurality", 0)
            t5_exp = r.tier_counts.get("T5-plurality", 0)
            t5_pass = t5_exp <= t5_base
            max_acc_drop = 0.0
            for source in baseline.source_accuracy:
                if source in r.source_accuracy:
                    drop = baseline.source_accuracy[source] - r.source_accuracy[source]
                    max_acc_drop = max(max_acc_drop, drop)
            acc_pass = max_acc_drop < 0.01
            max_cat_shift_pct = 0.0
            for cat in baseline.category_dist:
                base_n = baseline.category_dist.get(cat, 0)
                exp_n = r.category_dist.get(cat, 0)
                if base_n > 0:
                    shift = abs(exp_n - base_n) / base_n
                    max_cat_shift_pct = max(max_cat_shift_pct, shift)
            cat_pass = max_cat_shift_pct < 0.05
            status = "✓ PASS" if (t5_pass and acc_pass and cat_pass) else "✗ FAIL"
            print(f"  {exp_name:<20} {status}")
            print(f"    T5: {t5_base} → {t5_exp} (Δ {t5_exp - t5_base:+d}) {'✓' if t5_pass else '✗'}")
            print(f"    Max accuracy drop: {max_acc_drop:.2%} {'✓' if acc_pass else '✗'}")
            print(f"    Max category shift: {max_cat_shift_pct:.1%} {'✓' if cat_pass else '✗'}")
 if __name__ == "__main__":
    main()
--- a/scripts/adjudicate-gold.py
+++ b/scripts/adjudicate-gold.py
@ -0,0 +1,393 @@
 """
 Gold Set Adjudication Script (v2)
 ==================================
 Produces gold labels for the 1,200 holdout paragraphs using a tiered adjudication
 strategy that combines 6 human annotators (3 per paragraph via BIBD) + 6 GenAI
 models (MiniMax excluded per documented statistical outlier analysis, z=-2.07).
 Each paragraph has up to 9 signals: 3 human + 6 model.
 Tier system:
  T1: Super-consensus — >=8/9 signals agree -> auto-gold (near-unanimous)
  T2: Human majority + model majority agree -> cross-validated gold
  T3: Rule-based override — 27 SI<->N/O paragraphs + 10 codebook tiebreakers,
      each analyzed paragraph-by-paragraph against codebook rules and actual text.
  T4: Model unanimous (6/6) + human majority disagree -> model label.
  T5: Remaining disagreements -> plurality with text-based BG vote removal.
 v2 changes (experimentally validated, see docs/T5-ANALYSIS.md):
  - 10 new T5 codebook overrides (ID/SI, SPAC, board-removal, committee-level)
  - Text-based BG vote removal: if "board" absent from paragraph text, BG model
    votes are removed before T5 plurality. 13 labels changed, source accuracy UP
    for 10/12 sources (+0.5-1.1% for top sources).
  - Specificity hybrid: human unanimous -> human label, human split -> model majority.
    195 specificity labels updated. Model-model spec agreement is 87-91% vs
    human consensus of 52.5%.
 Net effect: T5 reduced 92->85 (-7), source accuracy: Opus 88.6->89.1%, GPT-5.4
 87.4->88.5%, gold!=human 151->144. 20 category labels changed, 195 specificity.
 Usage:
    uv run scripts/adjudicate-gold.py
 """
 import json
 from collections import Counter, defaultdict
 from pathlib import Path
 ROOT = Path(__file__).resolve().parent.parent
 # ── SI<->N/O RULE-BASED OVERRIDES ────────────────────────────────────────
 #
 # These 27 paragraphs were analyzed INDIVIDUALLY against codebook rules and
 # actual paragraph text. This is NOT a blanket override -- each paragraph was
 # read, assessed against the assessment-vs-speculation distinction (Rule 6),
 # the cross-reference exception, and the SPAC rule (Case 8).
 #
 # The analysis found that ~20/25 "Human=SI, Model=N/O" cases are human errors:
 # annotators systematically treat ANY mention of "material" + "business strategy"
 # as SI, even when the language is pure "could/if/may" speculation. The codebook's
 # distinction is correct; humans weren't consistently applying it.
 #
 # The 2 "Human=N/O, Model=SI" cases are also human errors: both contain clear
 # negative assertions ("not aware of having experienced any prior material
 # incidents") which are textbook SI per Rule 6.
 #
 # Full analysis: docs/V35-ITERATION-LOG.md "The SI<->N/O Paradox -- Resolved"
 SI_NO_OVERRIDES: dict[str, tuple[str, str]] = {
    # ── Speculation, not assessment (Human=SI -> N/O) ─────────────────────
    "026c8eca": ("None/Other", "Speculation: 'could potentially result in' -- no materiality assessment"),
    "160fec46": ("None/Other", "Resource lament: 'do not have manpower' -- no materiality assessment"),
    "1f29ea8c": ("None/Other", "Speculation: 'could have material adverse effect' boilerplate"),
    "20c70335": ("None/Other", "Risk list: 'A breach could lead to...' -- enumeration, not assessment"),
    "303685cf": ("None/Other", "Speculation: 'could materially adversely affect'"),
    "7d021fcc": ("None/Other", "Speculation: 'could...have a material adverse effect'"),
    "7ef53cab": ("None/Other", "Risk enumeration: 'could lead to... could disrupt... could steal...'"),
    "a0d01951": ("None/Other", "Speculation: 'could adversely affect our business'"),
    "aaa8974b": ("None/Other", "Speculation: 'could potentially have a material impact' -- Case 9 fix"),
    "b058dca1": ("None/Other", "Speculation: 'could disrupt our operations'"),
    "b1b216b6": ("None/Other", "Speculation: 'could materially adversely affect'"),
    "dc8a2798": ("None/Other", "Speculation: 'If compromised, we could be subject to...'"),
    "e4bd0e2f": ("None/Other", "Speculation: 'could have material adverse impact'"),
    "f4656a7e": ("None/Other", "Threat enumeration under SI-sounding header -- no assessment"),
    # ── Cross-references (Human=SI -> N/O) ────────────────────────────────
    "2e8cbdbf": ("None/Other", "Cross-ref: 'We describe whether and how... under the headings [risk factors]'"),
    "75de7441": ("None/Other", "Cross-ref: 'We describe whether and how... under the heading [risk factor]'"),
    "78cad2a1": ("None/Other", "Cross-ref: 'In our Risk Factors, we describe whether and how...'"),
    "3879887f": ("None/Other", "Brief incident mention + 'See Item 1A' cross-reference"),
    "f026f2be": ("None/Other", "Risk factor heading/cross-reference -- not an assessment"),
    # ── No materiality assessment present (Human=SI -> N/O) ───────────────
    "5df3a6c9": ("None/Other", "IT importance statement -- no assessment. H=1/3 SI"),
    "d5dc17c2": ("None/Other", "Risk enumeration -- no assessment. H=1/3 SI"),
    "c10f2a54": ("None/Other", "Early-stage/SPAC + weak negative assertion. SPAC rule dominates"),
    "45961c99": ("None/Other", "Past disruption but no materiality language. Primarily speculation"),
    "1673f332": ("None/Other", "SPAC with assessment at end -- SPAC rule dominates per Case 8"),
    "f75ac78a": ("Risk Management Process", "Resource expenditure on cybersecurity -- RMP per person-removal test"),
    # ── Negative assertions ARE assessments (Human=N/O -> SI) ─────────────
    "367108c2": ("Strategy Integration", "Negative assertion: 'not aware of having experienced any prior material data breaches'"),
    "837e31d5": ("Strategy Integration", "Negative assertion: 'did not experience any cybersecurity incident during 2024'"),
 }
 # ── T5 CODEBOOK RESOLUTIONS ──────────────────────────────────────────────
 #
 # Additional rule-based overrides for T5-plurality cases where codebook
 # tiebreakers clearly resolve the disagreement. Applied AFTER plurality
 # resolution as a correction layer.
 #
 # SI<->ID tiebreaker: "DESCRIBES what happened -> ID; ONLY discusses
 # cost/materiality -> SI; brief mention + materiality conclusion -> SI"
 #
 # TP<->RMP central-topic test: third parties supporting internal
 # program -> RMP; vendor oversight as central topic -> TP
 T5_CODEBOOK_OVERRIDES: dict[str, tuple[str, str]] = {
    # ── SI<->ID: materiality assessment without incident narrative -> SI ──
    "15e7cf99": ("Strategy Integration", "SI/ID tiebreaker: 'have not encountered any risks' -- materiality assessment, no specific incident described"),
    # ── SI<->ID: specific incident with date -> ID ────────────────────────
    "6dc6bb4a": ("Incident Disclosure", "SI/ID tiebreaker: 'ransomware attack in October 2021' -- describes specific incident with date"),
    # ── TP<->RMP: third parties supporting internal program -> RMP ────────
    "c71739a9": ("Risk Management Process", "TP/RMP: Fund relies on CCO and adviser's risk management expertise -- third parties supporting internal process"),
    # ── ID<->SI: negative assertion = materiality assessment -> SI ────────
    "0ceeb618": ("Strategy Integration", "ID/SI: opens with 'no material incidents', Feb 2025 incident is brief context + 'has not had material impact' conclusion. Materiality assessment frame dominates"),
    "cc82eb9f": ("Strategy Integration", "ID/SI: June 2018 incident is example within broader negative materiality assertion ('have not materially affected us'). Assessment frame dominates"),
    # ── SPAC rule (Case 8): pre-revenue company -> N/O ────────────────────
    "203ccd43": ("None/Other", "SPAC: 'once the Company commences operations' -- pre-revenue company. Case 8: SPAC -> N/O regardless of management role language"),
    # ── ID->RMP: post-incident improvements, no incident described ────────
    "f549fd64": ("Risk Management Process", "ID/RMP: 'Following this cybersecurity event' -- refers to incident without describing it. 100% of content is hardening, training, MFA, EDR -- pure RMP"),
    # ── Board-removal test: BG override where board mention is incidental ──
    "22da6695": ("Risk Management Process", "Board-removal: 'Board is also responsible' (1 sentence) + 'notifying the Board' (final clause). Remove -> CISO + IS Program + incident response plan. Process dominates"),
    "a2ff7e1e": ("Management Role", "Committee-level: Compliance Committee is management-level (O'Reilly executives). Board is incidental destination (2 clauses). Titled 'Management's Role'"),
    "cb518f47": ("Management Role", "Board-removal: remove notification sentence -> 'management oversees cybersecurity.' Board is incident notification destination only"),
 }
 def load_jsonl(path: Path) -> list[dict]:
    with open(path) as f:
        return [json.loads(line) for line in f]
 def load_paragraph_texts() -> dict[str, str]:
    """Load holdout paragraph texts for text-based adjudication rules."""
    return {r["id"]: r["text"] for r in load_jsonl(ROOT / "data/gold/paragraphs-holdout.jsonl")}
 def majority_vote(votes: list[str]) -> str | None:
    if not votes:
        return None
    return Counter(votes).most_common(1)[0][0]
 def main() -> None:
    # ── Load data ─────────────────────────────────────────────────────────
    human_labels: dict[str, list[dict]] = defaultdict(list)
    for r in load_jsonl(ROOT / "data/gold/human-labels-raw.jsonl"):
        human_labels[r["paragraphId"]].append({
            "cat": r["contentCategory"],
            "spec": r["specificityLevel"],
            "annotator": r["annotatorName"],
        })
    confusion_pids = {r["paragraphId"] for r in load_jsonl(ROOT / "data/gold/holdout-rerun-v35.jsonl")}
    TOP6 = ["Opus", "GPT-5.4", "Gemini", "GLM-5", "Kimi", "MIMO"]
    def load_model_cats(files: dict[str, Path]) -> dict[str, dict[str, str]]:
        result: dict[str, dict[str, str]] = {}
        for name, path in files.items():
            result[name] = {}
            if path.exists():
                for r in load_jsonl(path):
                    cat = r.get("label", {}).get("content_category") or r.get("content_category")
                    if cat:
                        result[name][r["paragraphId"]] = cat
        return result
    v30_cats = load_model_cats({
        "Opus": ROOT / "data/annotations/golden/opus.jsonl",
        "GPT-5.4": ROOT / "data/annotations/bench-holdout/gpt-5.4.jsonl",
        "Gemini": ROOT / "data/annotations/bench-holdout/gemini-3.1-pro-preview.jsonl",
        "GLM-5": ROOT / "data/annotations/bench-holdout/glm-5:exacto.jsonl",
        "Kimi": ROOT / "data/annotations/bench-holdout/kimi-k2.5.jsonl",
        "MIMO": ROOT / "data/annotations/bench-holdout/mimo-v2-pro:exacto.jsonl",
    })
    v35_cats = load_model_cats({
        "Opus": ROOT / "data/annotations/golden-v35/opus.jsonl",
        "GPT-5.4": ROOT / "data/annotations/bench-holdout-v35/gpt-5.4.jsonl",
        "Gemini": ROOT / "data/annotations/bench-holdout-v35/gemini-3.1-pro-preview.jsonl",
        "GLM-5": ROOT / "data/annotations/bench-holdout-v35/glm-5:exacto.jsonl",
        "Kimi": ROOT / "data/annotations/bench-holdout-v35/kimi-k2.5.jsonl",
        "MIMO": ROOT / "data/annotations/bench-holdout-v35/mimo-v2-pro:exacto.jsonl",
    })
    # Use v3.5 labels for confusion-axis PIDs (codebook-corrected), v3.0 for rest
    model_cats: dict[str, dict[str, str]] = {}
    for m in TOP6:
        model_cats[m] = {}
        for pid in human_labels:
            if pid in confusion_pids and pid in v35_cats.get(m, {}):
                model_cats[m][pid] = v35_cats[m][pid]
            elif pid in v30_cats.get(m, {}):
                model_cats[m][pid] = v30_cats[m][pid]
    # Load model specificity for hybrid specificity (v3.0 for full coverage)
    def load_model_specs(files: dict[str, Path]) -> dict[str, dict[str, int]]:
        result: dict[str, dict[str, int]] = {}
        for name, path in files.items():
            result[name] = {}
            if path.exists():
                for r in load_jsonl(path):
                    spec = r.get("label", {}).get("specificity_level") or r.get("specificity_level")
                    if spec is not None:
                        result[name][r["paragraphId"]] = spec
        return result
    model_specs = load_model_specs({
        "Opus": ROOT / "data/annotations/golden/opus.jsonl",
        "GPT-5.4": ROOT / "data/annotations/bench-holdout/gpt-5.4.jsonl",
        "Gemini": ROOT / "data/annotations/bench-holdout/gemini-3.1-pro-preview.jsonl",
        "GLM-5": ROOT / "data/annotations/bench-holdout/glm-5:exacto.jsonl",
        "Kimi": ROOT / "data/annotations/bench-holdout/kimi-k2.5.jsonl",
        "MIMO": ROOT / "data/annotations/bench-holdout/mimo-v2-pro:exacto.jsonl",
    })
    # Load paragraph texts for text-based adjudication rules
    para_texts = load_paragraph_texts()
    # ── Adjudicate ────────────────────────────────────────────────────────
    results: list[dict] = []
    tier_counts: Counter[str] = Counter()
    for pid in sorted(human_labels.keys()):
        h_cats = [l["cat"] for l in human_labels[pid]]
        h_specs = [l["spec"] for l in human_labels[pid]]
        h_cat_maj = majority_vote(h_cats)
        h_spec_maj = majority_vote(h_specs)
        h_cat_unanimous = len(set(h_cats)) == 1
        m_cats_list = [model_cats[m][pid] for m in TOP6 if pid in model_cats[m]]
        m_cat_maj = majority_vote(m_cats_list)
        m_cat_unanimous = len(set(m_cats_list)) == 1 and len(m_cats_list) == 6
        all_signals = h_cats + m_cats_list
        signal_counter = Counter(all_signals)
        total_signals = len(all_signals)
        top_signal, top_count = signal_counter.most_common(1)[0]
        short_pid = pid[:8]
        si_override = SI_NO_OVERRIDES.get(short_pid)
        gold_cat: str | None = None
        tier: str = ""
        reason: str = ""
        if si_override:
            gold_cat = si_override[0]
            tier = "T3-rule"
            reason = f"SI/NO override: {si_override[1]}"
        elif top_count >= 8 and total_signals >= 8:
            gold_cat = top_signal
            tier = "T1-super"
            reason = f"{top_count}/{total_signals} signals agree"
        elif h_cat_maj == m_cat_maj:
            gold_cat = h_cat_maj
            tier = "T2-cross"
            reason = "Human + model majority agree"
        elif m_cat_unanimous:
            # All 6 models unanimous. Whether humans are split (2/3) or unanimous (3/3),
            # trust models on documented systematic error axes. Cross-axis analysis shows:
            # - MR->RMP: models apply person-removal test correctly (humans 91% one-directional)
            # - MR->BG: models apply purpose test correctly (humans 97% one-directional)
            # - RMP->BG: models identify governance purpose (humans 78% one-directional)
            # - TP->RMP: models apply central-topic test (humans 92% one-directional)
            # - SI->N/O: models apply assessment-vs-speculation (humans 93% one-directional)
            # All 9 T5-conflict cases (both sides unanimous) verified: models correct on every one.
            gold_cat = m_cat_maj
            tier = "T4-model"
            h_count = Counter(h_cats).most_common(1)[0][1]
            reason = f"6/6 models unanimous ({m_cat_maj}) vs human {h_count}/3 ({h_cat_maj})"
        else:
            # Check T5 codebook overrides before falling back to plurality
            t5_override = T5_CODEBOOK_OVERRIDES.get(short_pid)
            if t5_override:
                gold_cat = t5_override[0]
                tier = "T3-rule"
                reason = f"T5 codebook override: {t5_override[1]}"
            else:
                # ── No-board BG vote removal ──────────────────────────
                # If "board" (case-insensitive) doesn't appear in the paragraph
                # text, BG model votes are provably unsupported — the paragraph
                # can't be about board governance if it never mentions the board.
                # Remove those BG signals and recalculate plurality.
                # Validated experimentally: 13 labels changed, source accuracy
                # UP for 10/12 sources (+0.5-0.8% for top annotators/models).
                t5_signals = list(all_signals)
                para_text = para_texts.get(pid, "")
                if "board" not in para_text.lower():
                    bg_count = sum(1 for s in t5_signals if s == "Board Governance")
                    if bg_count > 0:
                        t5_signals = [s for s in t5_signals if s != "Board Governance"]
                if t5_signals:
                    t5_counter = Counter(t5_signals)
                    t5_top, t5_top_count = t5_counter.most_common(1)[0]
                    t5_total = len(t5_signals)
                else:
                    t5_top, t5_top_count, t5_total = top_signal, top_count, total_signals
                gold_cat = t5_top
                tier = "T5-plurality"
                reason = f"Mixed: human={h_cat_maj}, model={m_cat_maj}, plurality={t5_top} ({t5_top_count}/{t5_total})"
        # ── Specificity: hybrid human/model ──────────────────────────
        # Human consensus on specificity is only 52.5%, while model-model
        # agreement is 87-91%. When humans are unanimous (3/3), trust their
        # label. When humans split, use model majority (more reliable).
        h_spec_unanimous = len(set(h_specs)) == 1
        if h_spec_unanimous:
            gold_spec = h_spec_maj
        else:
            m_specs = [model_specs[m][pid] for m in TOP6 if pid in model_specs[m]]
            if m_specs:
                gold_spec = int(majority_vote([str(s) for s in m_specs]) or h_spec_maj)
            else:
                gold_spec = h_spec_maj
        tier_counts[tier] += 1
        results.append({
            "paragraphId": pid,
            "gold_category": gold_cat,
            "gold_specificity": gold_spec,
            "tier": tier,
            "reason": reason,
            "human_majority": h_cat_maj,
            "model_majority": m_cat_maj,
            "human_votes": dict(Counter(h_cats)),
            "model_votes": dict(Counter(m_cats_list)),
        })
    # ── Write output ──────────────────────────────────────────────────────
    output_path = ROOT / "data/gold/gold-adjudicated.jsonl"
    with open(output_path, "w") as f:
        for r in results:
            f.write(json.dumps(r) + "\n")
    # ── Summary ───────────────────────────────────────────────────────────
    print("=" * 90)
    print("GOLD SET ADJUDICATION SUMMARY")
    print("=" * 90)
    print(f"\nTotal paragraphs: {len(results)}")
    print(f"\nTier breakdown:")
    for tier, count in sorted(tier_counts.items()):
        pct = count / len(results) * 100
        print(f"  {tier:<16} {count:>5} ({pct:.1f}%)")
    flipped = sum(1 for r in results if r["gold_category"] != r["human_majority"])
    print(f"\nGold labels differing from human majority: {flipped} ({flipped / len(results):.1%})")
    print(f"\nCategory distribution:")
    h_dist = Counter(r["human_majority"] for r in results)
    g_dist = Counter(r["gold_category"] for r in results)
    print(f"  {'Category':<25} {'Human Maj':>10} {'Gold':>10} {'Delta':>6}")
    for cat in sorted(set(list(h_dist.keys()) + list(g_dist.keys()))):
        print(f"  {cat:<25} {h_dist.get(cat, 0):>10} {g_dist.get(cat, 0):>10} {g_dist.get(cat, 0) - h_dist.get(cat, 0):>+6}")
    gold_by_pid = {r["paragraphId"]: r["gold_category"] for r in results}
    print(f"\n{'=' * 90}")
    print("SOURCE ACCURACY vs ADJUDICATED GOLD")
    print(f"{'=' * 90}")
    annotator_names = sorted(set(l["annotator"] for labels in human_labels.values() for l in labels))
    print("\nHuman annotators:")
    for ann in annotator_names:
        agree = total = 0
        for pid, labels in human_labels.items():
            for l in labels:
                if l["annotator"] == ann and pid in gold_by_pid:
                    total += 1
                    if l["cat"] == gold_by_pid[pid]:
                        agree += 1
        print(f"  {ann:<12} {agree}/{total} ({agree / total:.1%})")
    print("\nModels (v3.0 on full 1200):")
    for m in TOP6:
        agree = total = 0
        for pid in gold_by_pid:
            if pid in v30_cats.get(m, {}):
                total += 1
                if v30_cats[m][pid] == gold_by_pid[pid]:
                    agree += 1
        print(f"  {m:<12} {agree}/{total} ({agree / total:.1%})")
    print(f"\nOutput: {output_path}")
 if __name__ == "__main__":
    main()
--- a/scripts/audit-stage1-labels.py
+++ b/scripts/audit-stage1-labels.py
@ -0,0 +1,620 @@
 """
 Audit Stage 1 annotations for systematic SI↔N/O miscoding.
 Stage 1 used prompt v2.5 which lacked the rule "materiality disclaimers → SI."
 This script quantifies how many N/O labels likely should have been SI, plus
 other potential miscoding axes.
 Run: uv run --with numpy scripts/audit-stage1-labels.py
 """
 import json
 import re
 import sys
 from collections import Counter, defaultdict
 from pathlib import Path
 # ── Paths ──────────────────────────────────────────────────────────────
 ROOT = Path(__file__).resolve().parent.parent
 ANNOTATIONS = ROOT / "data" / "annotations" / "stage1.patched.jsonl"
 PARAGRAPHS = ROOT / "data" / "paragraphs" / "paragraphs-clean.patched.jsonl"
 PARAGRAPHS_FALLBACK = ROOT / "data" / "paragraphs" / "paragraphs-clean.jsonl"
 HOLDOUT = ROOT / "data" / "gold" / "paragraphs-holdout.jsonl"
 HUMAN_LABELS = ROOT / "data" / "gold" / "human-labels-raw.jsonl"
 # ── Materiality regex patterns ─────────────────────────────────────────
 # Pattern 1: "material" near business/strategy language (within ~15 words)
 PAT_MATERIAL_NEAR_BIZ = re.compile(
    r"material(?:ly)?\b.{0,100}\b(?:business\s+strategy|results\s+of\s+operations|financial\s+condition|business|operations)"
    r"|"
    r"(?:business\s+strategy|results\s+of\s+operations|financial\s+condition)\b.{0,100}\baterial(?:ly)?",
    re.IGNORECASE,
 )
 # Pattern 2: specific materiality disclaimer phrases
 PAT_MATERIALITY_DISCLAIMER = re.compile(
    r"have\s+not\s+materially\s+affected"
    r"|has\s+not\s+materially\s+affected"
    r"|could\s+materially\s+affect"
    r"|could\s+have\s+a\s+material\s+(?:adverse\s+)?(?:effect|impact)"
    r"|may\s+(?:materially|have\s+a\s+material)\s+(?:adverse\s+)?(?:effect|impact|affect)"
    r"|reasonably\s+likely\s+to\s+materially\s+affect"
    r"|not\s+reasonably\s+likely"
    r"|materially\s+(?:adverse(?:ly)?|impact|affect)"
    r"|material\s+adverse\s+(?:effect|impact)"
    r"|no\s+material\s+(?:adverse\s+)?(?:effect|impact)"
    r"|did\s+not\s+(?:have\s+a\s+)?material(?:ly)?\s+(?:adverse\s+)?(?:effect|impact|affect)",
    re.IGNORECASE,
 )
 # Pattern 3: explicit SI-relevant phrases
 PAT_SI_PHRASES = re.compile(
    r"business\s+strategy"
    r"|results\s+of\s+operations"
    r"|financial\s+condition"
    r"|integrated\s+(?:into|with)\s+(?:our\s+)?(?:overall|business)"
    r"|part\s+of\s+(?:our\s+)?(?:overall|broader)\s+(?:risk|enterprise|business)",
    re.IGNORECASE,
 )
 def has_materiality_language(text: str) -> bool:
    """Returns True if text contains materiality-related language indicative of SI."""
    return bool(
        PAT_MATERIALITY_DISCLAIMER.search(text)
        or PAT_SI_PHRASES.search(text)
        or PAT_MATERIAL_NEAR_BIZ.search(text)
    )
 # ── Insurance / budget / incident patterns ─────────────────────────────
 PAT_INSURANCE = re.compile(r"\binsurance\b", re.IGNORECASE)
 PAT_BUDGET = re.compile(r"\b(?:budget|investment(?:s)?)\b", re.IGNORECASE)
 PAT_INCIDENT = re.compile(
    r"\bwe\s+(?:experienced|suffered|detected|identified|discovered|encountered|were\s+subject\s+to)\b",
    re.IGNORECASE,
 )
 # ── Cross-category confusion patterns ──────────────────────────────────
 PAT_PROGRAM_FRAMEWORK = re.compile(
    r"\b(?:program|framework|process(?:es)?|procedure(?:s)?)\b", re.IGNORECASE
 )
 PAT_TITLE = re.compile(
    r"\b(?:Chief\s+(?:Information|Technology|Executive|Financial|Security|Operating|Risk)\s+(?:Officer|Security\s+Officer))"
    r"|(?:CISO|CIO|CTO|CFO|CEO|COO|CRO)\b"
    r"|\b(?:Vice\s+President|Director|Senior\s+Vice\s+President|EVP|SVP)\b",
    re.IGNORECASE,
 )
 PAT_MANAGEMENT_OFFICERS = re.compile(
    r"\b(?:management|officer(?:s)?|executive(?:s)?|leader(?:s)?(?:hip)?)\b",
    re.IGNORECASE,
 )
 def separator(title: str) -> None:
    width = 80
    print()
    print("=" * width)
    print(f"  {title}")
    print("=" * width)
 def print_example(idx: int, pid: str, text: str, extra: str = "") -> None:
    print(f"\n  [{idx}] paragraphId: {pid}")
    if extra:
        print(f"      {extra}")
    # Wrap text at ~100 chars for readability
    wrapped = text
    if len(wrapped) > 500:
        wrapped = wrapped[:500] + "..."
    print(f"      TEXT: {wrapped}")
 # ── Load data ──────────────────────────────────────────────────────────
 def load_annotations() -> dict[str, list[dict]]:
    """Returns {paragraphId: [annotation, ...]}"""
    by_para: dict[str, list[dict]] = defaultdict(list)
    with open(ANNOTATIONS) as f:
        for line in f:
            d = json.loads(line)
            pid = d["paragraphId"]
            cat = d["label"]["content_category"]
            model = d["provenance"]["modelId"]
            by_para[pid].append({"category": cat, "model": model})
    return dict(by_para)
 def load_paragraphs() -> dict[str, str]:
    """Returns {paragraphId: text}"""
    texts: dict[str, str] = {}
    path = PARAGRAPHS if PARAGRAPHS.exists() else PARAGRAPHS_FALLBACK
    with open(path) as f:
        for line in f:
            d = json.loads(line)
            texts[d["id"]] = d["text"]
    return texts
 def load_holdout() -> dict[str, dict]:
    """Returns {paragraphId: {text, stage1Category, stage1Method, ...}}"""
    holdout: dict[str, dict] = {}
    with open(HOLDOUT) as f:
        for line in f:
            d = json.loads(line)
            holdout[d["id"]] = d
    return holdout
 def load_human_labels() -> dict[str, list[dict]]:
    """Returns {paragraphId: [{annotatorName, contentCategory}, ...]}"""
    labels: dict[str, list[dict]] = defaultdict(list)
    with open(HUMAN_LABELS) as f:
        for line in f:
            d = json.loads(line)
            labels[d["paragraphId"]].append(
                {
                    "annotator": d["annotatorName"],
                    "category": d["contentCategory"],
                    "specificity": d["specificityLevel"],
                }
            )
    return dict(labels)
 def main() -> None:
    print("Loading data...")
    annotations = load_annotations()
    texts = load_paragraphs()
    holdout = load_holdout()
    human_labels = load_human_labels()
    print(f"  Annotations: {sum(len(v) for v in annotations.values())} across {len(annotations)} paragraphs")
    print(f"  Paragraph texts loaded: {len(texts)}")
    print(f"  Holdout paragraphs: {len(holdout)}")
    print(f"  Human-labeled paragraphs: {len(human_labels)}")
    # ── Classify each paragraph by voting ──────────────────────────────
    unanimous_no: list[str] = []
    majority_no: list[str] = []  # 2/3 N/O
    unanimous_si: list[str] = []
    unanimous_mr: list[str] = []
    unanimous_rmp: list[str] = []
    unanimous_bg: list[str] = []
    all_unanimous: dict[str, str] = {}  # pid -> category for unanimous
    for pid, anns in annotations.items():
        cats = [a["category"] for a in anns]
        cat_counts = Counter(cats)
        if len(cats) != 3:
            continue  # skip incomplete
        if cat_counts.get("None/Other", 0) == 3:
            unanimous_no.append(pid)
            all_unanimous[pid] = "None/Other"
        elif cat_counts.get("None/Other", 0) == 2:
            majority_no.append(pid)
        elif cat_counts.get("Strategy Integration", 0) == 3:
            unanimous_si.append(pid)
            all_unanimous[pid] = "Strategy Integration"
        elif cat_counts.get("Management Role", 0) == 3:
            unanimous_mr.append(pid)
            all_unanimous[pid] = "Management Role"
        elif cat_counts.get("Risk Management Process", 0) == 3:
            unanimous_rmp.append(pid)
            all_unanimous[pid] = "Risk Management Process"
        elif cat_counts.get("Board Governance", 0) == 3:
            unanimous_bg.append(pid)
            all_unanimous[pid] = "Board Governance"
        # Track all unanimous
        if len(cat_counts) == 1:
            all_unanimous[pid] = cats[0]
    print(f"\n  Unanimous N/O: {len(unanimous_no)}")
    print(f"  Majority N/O (2/3): {len(majority_no)}")
    print(f"  Unanimous SI: {len(unanimous_si)}")
    print(f"  Unanimous MR: {len(unanimous_mr)}")
    print(f"  Unanimous RMP: {len(unanimous_rmp)}")
    print(f"  Unanimous BG: {len(unanimous_bg)}")
    print(f"  Total unanimous (any): {len(all_unanimous)}")
    # ════════════════════════════════════════════════════════════════════
    #  1. Unanimous N/O with materiality language
    # ════════════════════════════════════════════════════════════════════
    separator("1. UNANIMOUS N/O WITH MATERIALITY LANGUAGE")
    no_with_mat: list[tuple[str, str]] = []
    no_without_text = 0
    for pid in unanimous_no:
        text = texts.get(pid)
        if text is None:
            no_without_text += 1
            continue
        if has_materiality_language(text):
            no_with_mat.append((pid, text))
    print(f"\n  Total unanimous N/O: {len(unanimous_no)}")
    print(f"  Missing text: {no_without_text}")
    print(f"  With materiality language: {len(no_with_mat)}")
    print(f"  Percentage of unanimous N/O: {len(no_with_mat) / max(1, len(unanimous_no)) * 100:.1f}%")
    print(f"\n  --- 10 representative examples ---")
    # Pick a diverse sample: take every Nth
    step = max(1, len(no_with_mat) // 10)
    shown = 0
    for i in range(0, len(no_with_mat), step):
        if shown >= 10:
            break
        pid, text = no_with_mat[i]
        print_example(shown + 1, pid, text)
        shown += 1
    # ════════════════════════════════════════════════════════════════════
    #  2. Majority N/O with materiality language
    # ════════════════════════════════════════════════════════════════════
    separator("2. MAJORITY N/O (2/3) WITH MATERIALITY LANGUAGE")
    maj_no_with_mat: list[tuple[str, str, str, str]] = []  # pid, text, dissenting_model, dissenting_cat
    for pid in majority_no:
        text = texts.get(pid)
        if text is None:
            continue
        if has_materiality_language(text):
            anns = annotations[pid]
            for a in anns:
                if a["category"] != "None/Other":
                    maj_no_with_mat.append((pid, text, a["model"], a["category"]))
                    break
    print(f"\n  Total majority N/O (2/3): {len(majority_no)}")
    print(f"  With materiality language: {len(maj_no_with_mat)}")
    print(f"  Percentage: {len(maj_no_with_mat) / max(1, len(majority_no)) * 100:.1f}%")
    # Count dissenting categories
    dissent_cats = Counter(x[3] for x in maj_no_with_mat)
    print(f"\n  Dissenting model voted:")
    for cat, cnt in dissent_cats.most_common():
        print(f"    {cat}: {cnt}")
    # Count dissenting models
    dissent_models = Counter(x[2] for x in maj_no_with_mat)
    print(f"\n  Which models dissented:")
    for model, cnt in dissent_models.most_common():
        print(f"    {model}: {cnt}")
    print(f"\n  --- 5 examples ---")
    step = max(1, len(maj_no_with_mat) // 5)
    shown = 0
    for i in range(0, len(maj_no_with_mat), step):
        if shown >= 5:
            break
        pid, text, model, cat = maj_no_with_mat[i]
        print_example(shown + 1, pid, text, f"Dissent: {model} → {cat}")
        shown += 1
    # ════════════════════════════════════════════════════════════════════
    #  3. Unanimous SI examples (contrast)
    # ════════════════════════════════════════════════════════════════════
    separator("3. UNANIMOUS SI — WHAT CLEAN SI LOOKS LIKE")
    si_examples: list[tuple[str, str]] = []
    for pid in unanimous_si:
        text = texts.get(pid)
        if text:
            si_examples.append((pid, text))
        if len(si_examples) >= 20:
            break
    print(f"\n  Total unanimous SI: {len(unanimous_si)}")
    print(f"\n  --- 5 examples ---")
    for i, (pid, text) in enumerate(si_examples[:5]):
        print_example(i + 1, pid, text)
    # Analyze SI language patterns
    si_has_materiality = sum(1 for pid in unanimous_si if pid in texts and has_materiality_language(texts[pid]))
    si_has_insurance = sum(1 for pid in unanimous_si if pid in texts and PAT_INSURANCE.search(texts[pid]))
    si_has_budget = sum(1 for pid in unanimous_si if pid in texts and PAT_BUDGET.search(texts[pid]))
    print(f"\n  SI language patterns:")
    print(f"    With materiality language: {si_has_materiality} / {len(unanimous_si)} ({si_has_materiality / max(1, len(unanimous_si)) * 100:.1f}%)")
    print(f"    Mention insurance: {si_has_insurance} / {len(unanimous_si)}")
    print(f"    Mention budget/investment: {si_has_budget} / {len(unanimous_si)}")
    # ════════════════════════════════════════════════════════════════════
    #  4. N/O with other potential miscoding
    # ════════════════════════════════════════════════════════════════════
    separator("4. N/O PARAGRAPHS WITH OTHER POTENTIAL MISCODING")
    no_insurance: list[tuple[str, str]] = []
    no_budget: list[tuple[str, str]] = []
    no_incident: list[tuple[str, str]] = []
    for pid in unanimous_no:
        text = texts.get(pid)
        if text is None:
            continue
        if PAT_INSURANCE.search(text):
            no_insurance.append((pid, text))
        if PAT_BUDGET.search(text):
            no_budget.append((pid, text))
        if PAT_INCIDENT.search(text):
            no_incident.append((pid, text))
    print(f"\n  Unanimous N/O mentioning insurance: {len(no_insurance)}")
    print(f"  Unanimous N/O mentioning budget/investment: {len(no_budget)}")
    print(f"  Unanimous N/O mentioning incidents ('we experienced...'): {len(no_incident)}")
    # Show examples for each
    print(f"\n  --- Insurance examples (up to 3) ---")
    for i, (pid, text) in enumerate(no_insurance[:3]):
        print_example(i + 1, pid, text)
    print(f"\n  --- Budget/investment examples (up to 3) ---")
    for i, (pid, text) in enumerate(no_budget[:3]):
        print_example(i + 1, pid, text)
    print(f"\n  --- Incident examples (up to 3) ---")
    for i, (pid, text) in enumerate(no_incident[:3]):
        print_example(i + 1, pid, text)
    # ════════════════════════════════════════════════════════════════════
    #  5. Scale the problem
    # ════════════════════════════════════════════════════════════════════
    separator("5. SCALE THE PROBLEM")
    # Deduplicate: some paragraphs may hit multiple patterns
    no_any_miscoded = set()
    for pid, _ in no_with_mat:
        no_any_miscoded.add(pid)
    for pid, _ in no_insurance:
        no_any_miscoded.add(pid)
    for pid, _ in no_budget:
        no_any_miscoded.add(pid)
    no_incident_pids = set(pid for pid, _ in no_incident)
    # Materiality-only (not already insurance/budget)
    mat_only = set(pid for pid, _ in no_with_mat)
    ins_only = set(pid for pid, _ in no_insurance) - mat_only
    bud_only = set(pid for pid, _ in no_budget) - mat_only - ins_only
    total_unanimous = len(all_unanimous)
    total_annotations = len(annotations)
    print(f"\n  Total paragraphs with 3 annotations: {total_annotations}")
    print(f"  Total unanimous (any category): {total_unanimous}")
    print(f"  Total unanimous N/O: {len(unanimous_no)}")
    print()
    print(f"  Potentially miscoded unanimous N/O:")
    print(f"    Materiality language (likely SI): {len(no_with_mat)}")
    print(f"    Insurance (likely SI): {len(no_insurance)}")
    print(f"    Budget/investment (likely SI): {len(no_budget)}")
    print(f"    Incident language (likely SI or ID): {len(no_incident)}")
    print(f"    Any of above (deduplicated): {len(no_any_miscoded)}")
    print(f"    Incident (separate concern): {len(no_incident_pids)}")
    print()
    # Overlap analysis
    mat_set = set(pid for pid, _ in no_with_mat)
    ins_set = set(pid for pid, _ in no_insurance)
    bud_set = set(pid for pid, _ in no_budget)
    print(f"  Overlap analysis:")
    print(f"    Materiality ∩ Insurance: {len(mat_set & ins_set)}")
    print(f"    Materiality ∩ Budget: {len(mat_set & bud_set)}")
    print(f"    Insurance ∩ Budget: {len(ins_set & bud_set)}")
    print()
    pct_no_affected = len(no_any_miscoded) / max(1, len(unanimous_no)) * 100
    pct_total_affected = len(no_any_miscoded) / max(1, total_unanimous) * 100
    pct_all_affected = len(no_any_miscoded) / max(1, total_annotations) * 100
    print(f"  Impact estimates:")
    print(f"    % of unanimous N/O potentially miscoded: {pct_no_affected:.1f}%")
    print(f"    % of all unanimous labels affected: {pct_total_affected:.1f}%")
    print(f"    % of all paragraphs affected: {pct_all_affected:.1f}%")
    # Also check majority N/O
    maj_no_any = set()
    for pid in majority_no:
        text = texts.get(pid)
        if text is None:
            continue
        if has_materiality_language(text) or PAT_INSURANCE.search(text) or PAT_BUDGET.search(text):
            maj_no_any.add(pid)
    print(f"\n  Majority N/O (2/3) potentially miscoded: {len(maj_no_any)} / {len(majority_no)}")
    print(f"  Combined (unanimous + majority) potentially miscoded N/O: {len(no_any_miscoded) + len(maj_no_any)}")
    # ════════════════════════════════════════════════════════════════════
    #  6. Cross-check with holdout / human labels
    # ════════════════════════════════════════════════════════════════════
    separator("6. HOLDOUT CROSS-CHECK WITH HUMAN LABELS")
    # Find holdout paragraphs that Stage 1 unanimously called N/O but contain materiality language
    holdout_no_mat: list[tuple[str, str]] = []
    holdout_no_mat_with_human: list[tuple[str, str, list[dict]]] = []
    for pid, para in holdout.items():
        if para.get("stage1Category") == "None/Other" and para.get("stage1Method") == "unanimous":
            text = para["text"]
            if has_materiality_language(text):
                holdout_no_mat.append((pid, text))
                if pid in human_labels:
                    holdout_no_mat_with_human.append((pid, text, human_labels[pid]))
    print(f"\n  Holdout paragraphs with stage1 unanimous N/O: "
          f"{sum(1 for p in holdout.values() if p.get('stage1Category') == 'None/Other' and p.get('stage1Method') == 'unanimous')}")
    print(f"  Of those, with materiality language: {len(holdout_no_mat)}")
    print(f"  Of those, with human labels: {len(holdout_no_mat_with_human)}")
    # What did humans call these?
    if holdout_no_mat_with_human:
        human_cats_for_flagged = Counter()
        for pid, text, hlabels in holdout_no_mat_with_human:
            for hl in hlabels:
                human_cats_for_flagged[hl["category"]] += 1
        print(f"\n  Human labels for flagged paragraphs (Stage1=unanimous N/O, has materiality language):")
        total_human = sum(human_cats_for_flagged.values())
        for cat, cnt in human_cats_for_flagged.most_common():
            print(f"    {cat}: {cnt} ({cnt / total_human * 100:.1f}%)")
        print(f"\n  --- Examples where humans disagreed with Stage 1 N/O ---")
        shown = 0
        for pid, text, hlabels in holdout_no_mat_with_human:
            non_no = [hl for hl in hlabels if hl["category"] != "None/Other"]
            if non_no:
                human_str = ", ".join(f"{hl['annotator']}={hl['category']}" for hl in hlabels)
                print_example(shown + 1, pid, text, f"Human labels: {human_str}")
                shown += 1
                if shown >= 5:
                    break
        # Also show ones where humans agreed it IS N/O
        print(f"\n  --- Examples where humans also said N/O (materiality language is ambiguous) ---")
        shown = 0
        for pid, text, hlabels in holdout_no_mat_with_human:
            all_no = all(hl["category"] == "None/Other" for hl in hlabels)
            if all_no and len(hlabels) >= 2:
                print_example(shown + 1, pid, text, "All humans agreed: N/O")
                shown += 1
                if shown >= 3:
                    break
    else:
        print("\n  No human labels available for flagged holdout paragraphs.")
    # Broader holdout analysis: all cases where Stage 1 said N/O but humans said something else
    separator("6b. HOLDOUT: ALL Stage1=N/O vs HUMAN DISAGREEMENTS")
    holdout_no_all = [pid for pid, p in holdout.items()
                      if p.get("stage1Category") == "None/Other"]
    stage1_no_human_disagree = []
    for pid in holdout_no_all:
        if pid in human_labels:
            hlabels = human_labels[pid]
            non_no = [hl for hl in hlabels if hl["category"] != "None/Other"]
            if non_no:
                stage1_no_human_disagree.append((pid, holdout[pid]["text"], hlabels))
    print(f"\n  All holdout paragraphs with Stage1=N/O (any method): {len(holdout_no_all)}")
    print(f"  Of those with human labels that disagree: {len(stage1_no_human_disagree)}")
    if stage1_no_human_disagree:
        # What did humans call them?
        human_override = Counter()
        for pid, text, hlabels in stage1_no_human_disagree:
            for hl in hlabels:
                if hl["category"] != "None/Other":
                    human_override[hl["category"]] += 1
        print(f"\n  Humans' non-N/O labels for Stage1=N/O paragraphs:")
        for cat, cnt in human_override.most_common():
            print(f"    {cat}: {cnt}")
    # ════════════════════════════════════════════════════════════════════
    #  7. Other confusion axes
    # ════════════════════════════════════════════════════════════════════
    separator("7. OTHER CONFUSION AXES IN STAGE 1")
    # 7a. Unanimous MR with program/framework/process language (potential RMP)
    mr_with_process = []
    for pid in unanimous_mr:
        text = texts.get(pid)
        if text is None:
            continue
        matches = PAT_PROGRAM_FRAMEWORK.findall(text)
        if len(matches) >= 2:  # Multiple mentions = likely process-focused
            mr_with_process.append((pid, text, matches))
    print(f"\n  7a. Unanimous MR with prominent program/framework/process language")
    print(f"      (>=2 mentions — potentially should be RMP)")
    print(f"      Count: {len(mr_with_process)} / {len(unanimous_mr)} ({len(mr_with_process) / max(1, len(unanimous_mr)) * 100:.1f}%)")
    print(f"\n  --- 3 examples ---")
    for i, (pid, text, matches) in enumerate(mr_with_process[:3]):
        print_example(i + 1, pid, text, f"Pattern matches: {matches[:6]}")
    # 7b. Unanimous RMP with specific titles (potential MR)
    rmp_with_titles = []
    for pid in unanimous_rmp:
        text = texts.get(pid)
        if text is None:
            continue
        titles = PAT_TITLE.findall(text)
        if titles:
            rmp_with_titles.append((pid, text, titles))
    print(f"\n  7b. Unanimous RMP mentioning specific people/titles")
    print(f"      (potentially should be MR)")
    print(f"      Count: {len(rmp_with_titles)} / {len(unanimous_rmp)} ({len(rmp_with_titles) / max(1, len(unanimous_rmp)) * 100:.1f}%)")
    print(f"\n  --- 3 examples ---")
    for i, (pid, text, titles) in enumerate(rmp_with_titles[:3]):
        print_example(i + 1, pid, text, f"Titles found: {titles[:5]}")
    # 7c. Unanimous BG primarily about management officers
    bg_about_mgmt = []
    for pid in unanimous_bg:
        text = texts.get(pid)
        if text is None:
            continue
        has_titles = PAT_TITLE.findall(text)
        has_mgmt = PAT_MANAGEMENT_OFFICERS.findall(text)
        # If it has management language but no board language
        board_pattern = re.compile(r"\b(?:board|director(?:s)?|committee|audit)\b", re.IGNORECASE)
        has_board = board_pattern.findall(text)
        if (has_titles or has_mgmt) and not has_board:
            bg_about_mgmt.append((pid, text, has_titles + has_mgmt))
    print(f"\n  7c. Unanimous BG primarily about management (no board/committee language)")
    print(f"      Count: {len(bg_about_mgmt)} / {len(unanimous_bg)} ({len(bg_about_mgmt) / max(1, len(unanimous_bg)) * 100:.1f}%)")
    if bg_about_mgmt:
        print(f"\n  --- 3 examples ---")
        for i, (pid, text, matches) in enumerate(bg_about_mgmt[:3]):
            print_example(i + 1, pid, text, f"Matches: {matches[:5]}")
    # ════════════════════════════════════════════════════════════════════
    #  SUMMARY
    # ════════════════════════════════════════════════════════════════════
    separator("SUMMARY")
    print(f"""
  DATASET OVERVIEW
    Total paragraphs annotated (3 models each): {total_annotations:,}
    Total unanimous labels: {total_unanimous:,}
    Unanimous N/O: {len(unanimous_no):,}
    Majority N/O (2/3): {len(majority_no):,}
  PRIMARY CONCERN: N/O → SI MISCODING
    Unanimous N/O with materiality language: {len(no_with_mat):,} ({len(no_with_mat) / max(1, len(unanimous_no)) * 100:.1f}% of unanimous N/O)
    Majority N/O with materiality language: {len(maj_no_with_mat):,} ({len(maj_no_with_mat) / max(1, len(majority_no)) * 100:.1f}% of majority N/O)
    Unanimous N/O with insurance: {len(no_insurance):,}
    Unanimous N/O with budget/investment: {len(no_budget):,}
    Unanimous N/O with incident language: {len(no_incident):,}
    Total potentially miscoded (deduplicated): {len(no_any_miscoded):,}
  IMPACT ON TRAINING SET
    % of unanimous N/O affected: {pct_no_affected:.1f}%
    % of all unanimous labels affected: {pct_total_affected:.1f}%
    % of all paragraphs affected: {pct_all_affected:.1f}%
  OTHER CONFUSION AXES
    MR ↔ RMP confusion (MR with process language): {len(mr_with_process):,} / {len(unanimous_mr):,}
    RMP ↔ MR confusion (RMP with titles): {len(rmp_with_titles):,} / {len(unanimous_rmp):,}
    BG about management (no board language): {len(bg_about_mgmt):,} / {len(unanimous_bg):,}
  HOLDOUT VALIDATION
    Stage1=unanimous N/O with materiality language: {len(holdout_no_mat):,}
    Of those with human labels: {len(holdout_no_mat_with_human):,}
 """)
    if holdout_no_mat_with_human:
        human_cats_for_flagged = Counter()
        for pid, text, hlabels in holdout_no_mat_with_human:
            for hl in hlabels:
                human_cats_for_flagged[hl["category"]] += 1
        print("  HUMAN VALIDATION (flagged holdout paragraphs):")
        total_h = sum(human_cats_for_flagged.values())
        for cat, cnt in human_cats_for_flagged.most_common():
            print(f"    {cat}: {cnt} ({cnt / total_h * 100:.1f}%)")
 if __name__ == "__main__":
    main()
--- a/scripts/compare-v30-v35-final.py
+++ b/scripts/compare-v30-v35-final.py
@ -0,0 +1,333 @@
 """
 Comprehensive comparison of v3.0 vs v3.5f prompt on the 359 confusion-axis holdout paragraphs.
 Covers per-model accuracy, per-axis breakdown, SI/NO asymmetry, rankings, convergence, and cost.
 """
 import json
 from collections import Counter
 from pathlib import Path
 from itertools import combinations
 import numpy as np
 ROOT = Path("/home/joey/Documents/sec-cyBERT")
 # ---------------------------------------------------------------------------
 # Model definitions
 # ---------------------------------------------------------------------------
 MODELS = [
    ("Opus", "golden", "opus"),
    ("GPT-5.4", "bench-holdout", "gpt-5.4"),
    ("Gemini-3.1-Pro", "bench-holdout", "gemini-3.1-pro-preview"),
    ("GLM-5", "bench-holdout", "glm-5:exacto"),
    ("Kimi-K2.5", "bench-holdout", "kimi-k2.5"),
    ("MIMO-v2-Pro", "bench-holdout", "mimo-v2-pro:exacto"),
    ("MiniMax-M2.7", "bench-holdout", "minimax-m2.7:exacto"),
 ]
 CATEGORY_ABBREV = {
    "None/Other": "N/O",
    "Background": "BG",
    "Risk Management Process": "RMP",
    "Management Role": "MR",
    "Strategy Integration": "SI",
 }
 def abbrev(cat: str) -> str:
    return CATEGORY_ABBREV.get(cat, cat)
 # ---------------------------------------------------------------------------
 # Data loading
 # ---------------------------------------------------------------------------
 def load_jsonl(path: Path) -> list[dict]:
    rows = []
    with open(path) as f:
        for line in f:
            line = line.strip()
            if line:
                rows.append(json.loads(line))
    return rows
 def load_model_labels(version_suffix: str, subdir: str, filename: str) -> dict[str, str]:
    """Return {paragraphId: content_category} for a model file."""
    if version_suffix:
        base = ROOT / "data" / "annotations" / f"{subdir}-{version_suffix}" / f"{filename}.jsonl"
    else:
        base = ROOT / "data" / "annotations" / subdir / f"{filename}.jsonl"
    rows = load_jsonl(base)
    return {r["paragraphId"]: r["label"]["content_category"] for r in rows}
 def load_model_rows(version_suffix: str, subdir: str, filename: str) -> list[dict]:
    if version_suffix:
        base = ROOT / "data" / "annotations" / f"{subdir}-{version_suffix}" / f"{filename}.jsonl"
    else:
        base = ROOT / "data" / "annotations" / subdir / f"{filename}.jsonl"
    return load_jsonl(base)
 # Load holdout PIDs and axes
 holdout_rows = load_jsonl(ROOT / "data" / "gold" / "holdout-rerun-v35.jsonl")
 HOLDOUT_PIDS = {r["paragraphId"] for r in holdout_rows}
 PID_AXES: dict[str, list[str]] = {r["paragraphId"]: r["axes"] for r in holdout_rows}
 # Human labels → majority vote per PID
 human_raw = load_jsonl(ROOT / "data" / "gold" / "human-labels-raw.jsonl")
 human_by_pid: dict[str, list[str]] = {}
 for row in human_raw:
    pid = row["paragraphId"]
    if pid in HOLDOUT_PIDS:
        human_by_pid.setdefault(pid, []).append(row["contentCategory"])
 human_majority: dict[str, str] = {}
 for pid, cats in human_by_pid.items():
    counter = Counter(cats)
    human_majority[pid] = counter.most_common(1)[0][0]
 # Load v3.0 and v3.5f labels for all models
 v30_labels: dict[str, dict[str, str]] = {}  # model_name -> {pid: cat}
 v35_labels: dict[str, dict[str, str]] = {}
 v35_rows_by_model: dict[str, list[dict]] = {}
 for name, subdir, filename in MODELS:
    # v3.0: full 1200 file, filter to 359
    all_v30 = load_model_labels("", subdir, filename)
    v30_labels[name] = {pid: cat for pid, cat in all_v30.items() if pid in HOLDOUT_PIDS}
    # v3.5f
    suffix = "v35"
    sub = f"golden" if subdir == "golden" else "bench-holdout"
    v35_all = load_model_labels(suffix, sub, filename)
    v35_labels[name] = {pid: cat for pid, cat in v35_all.items() if pid in HOLDOUT_PIDS}
    v35_rows_by_model[name] = load_model_rows(suffix, sub, filename)
 # Common PID set (intersection of all models in both versions + human majority)
 common_pids = set(HOLDOUT_PIDS)
 for name in [m[0] for m in MODELS]:
    common_pids &= set(v30_labels[name].keys())
    common_pids &= set(v35_labels[name].keys())
 common_pids &= set(human_majority.keys())
 common_pids_sorted = sorted(common_pids)
 N = len(common_pids_sorted)
 print(f"Common paragraphs across all models + human majority: {N}")
 print()
 # ---------------------------------------------------------------------------
 # Helper: 6-model majority (excl MiniMax)
 # ---------------------------------------------------------------------------
 TOP6_NAMES = [m[0] for m in MODELS if m[0] != "MiniMax-M2.7"]
 def majority_vote(labels_dict: dict[str, dict[str, str]], model_names: list[str], pid: str) -> str | None:
    cats = []
    for mn in model_names:
        if pid in labels_dict[mn]:
            cats.append(labels_dict[mn][pid])
    if not cats:
        return None
    counter = Counter(cats)
    return counter.most_common(1)[0][0]
 # ===========================================================================
 # 1. Per-model summary table
 # ===========================================================================
 print("=" * 90)
 print("1. PER-MODEL SUMMARY TABLE (vs human majority)")
 print("=" * 90)
 header = f"{'Model':<20} {'v3.0 Acc':>10} {'v3.5f Acc':>10} {'Delta':>8} {'Change%':>9}"
 print(header)
 print("-" * len(header))
 model_v30_acc = {}
 model_v35_acc = {}
 for name, _, _ in MODELS:
    correct_30 = sum(1 for pid in common_pids_sorted if v30_labels[name][pid] == human_majority[pid])
    correct_35 = sum(1 for pid in common_pids_sorted if v35_labels[name][pid] == human_majority[pid])
    changed = sum(1 for pid in common_pids_sorted if v30_labels[name][pid] != v35_labels[name][pid])
    acc30 = correct_30 / N
    acc35 = correct_35 / N
    delta = acc35 - acc30
    change_rate = changed / N
    model_v30_acc[name] = acc30
    model_v35_acc[name] = acc35
    print(f"{name:<20} {acc30:>9.1%} {acc35:>9.1%} {delta:>+7.1%} {change_rate:>8.1%}")
 # 6-model majority row
 correct_30_maj = 0
 correct_35_maj = 0
 changed_maj = 0
 for pid in common_pids_sorted:
    m30 = majority_vote(v30_labels, TOP6_NAMES, pid)
    m35 = majority_vote(v35_labels, TOP6_NAMES, pid)
    if m30 == human_majority[pid]:
        correct_30_maj += 1
    if m35 == human_majority[pid]:
        correct_35_maj += 1
    if m30 != m35:
        changed_maj += 1
 acc30_maj = correct_30_maj / N
 acc35_maj = correct_35_maj / N
 delta_maj = acc35_maj - acc30_maj
 change_maj_rate = changed_maj / N
 model_v30_acc["6-model majority"] = acc30_maj
 model_v35_acc["6-model majority"] = acc35_maj
 print("-" * len(header))
 print(f"{'6-model maj (no MM)':<20} {acc30_maj:>9.1%} {acc35_maj:>9.1%} {delta_maj:>+7.1%} {change_maj_rate:>8.1%}")
 print()
 # ===========================================================================
 # 2. Per-axis breakdown (6-model majority excl MiniMax)
 # ===========================================================================
 print("=" * 90)
 print("2. PER-AXIS BREAKDOWN (6-model majority excl MiniMax vs human majority)")
 print("=" * 90)
 all_axes = sorted({ax for axes in PID_AXES.values() for ax in axes})
 header2 = f"{'Axis':<12} {'N':>5} {'v3.0 Acc':>10} {'v3.5f Acc':>10} {'Delta':>8}"
 print(header2)
 print("-" * len(header2))
 for axis in all_axes:
    axis_pids = [pid for pid in common_pids_sorted if axis in PID_AXES.get(pid, [])]
    n_axis = len(axis_pids)
    if n_axis == 0:
        continue
    correct_30 = sum(1 for pid in axis_pids if majority_vote(v30_labels, TOP6_NAMES, pid) == human_majority[pid])
    correct_35 = sum(1 for pid in axis_pids if majority_vote(v35_labels, TOP6_NAMES, pid) == human_majority[pid])
    a30 = correct_30 / n_axis
    a35 = correct_35 / n_axis
    d = a35 - a30
    print(f"{axis:<12} {n_axis:>5} {a30:>9.1%} {a35:>9.1%} {d:>+7.1%}")
 print()
 # ===========================================================================
 # 3. SI ↔ N/O asymmetry check
 # ===========================================================================
 print("=" * 90)
 print("3. SI <-> N/O ASYMMETRY CHECK")
 print("=" * 90)
 si_no_pids = [pid for pid in common_pids_sorted if "SI_NO" in PID_AXES.get(pid, [])]
 print(f"SI↔N/O paragraphs in common set: {len(si_no_pids)}")
 print()
 for version_label, labels_dict in [("v3.0", v30_labels), ("v3.5f", v35_labels)]:
    human_si_model_no = 0
    human_no_model_si = 0
    for pid in si_no_pids:
        h = human_majority[pid]
        m = majority_vote(labels_dict, TOP6_NAMES, pid)
        if h == "Strategy Integration" and m == "None/Other":
            human_si_model_no += 1
        elif h == "None/Other" and m == "Strategy Integration":
            human_no_model_si += 1
    print(f"{version_label}:")
    print(f"  Human=SI, 6-model=N/O:  {human_si_model_no}")
    print(f"  Human=N/O, 6-model=SI:  {human_no_model_si}")
    print()
 # Also show per-model breakdown for SI↔N/O
 print("Per-model SI↔N/O errors:")
 header3 = f"{'Model':<20} {'v3.0 H=SI,M=NO':>16} {'v3.0 H=NO,M=SI':>16} {'v3.5 H=SI,M=NO':>16} {'v3.5 H=NO,M=SI':>16}"
 print(header3)
 print("-" * len(header3))
 for name, _, _ in MODELS:
    counts = []
    for labels_dict in [v30_labels, v35_labels]:
        hsi_mno = 0
        hno_msi = 0
        for pid in si_no_pids:
            h = human_majority[pid]
            m = labels_dict[name].get(pid)
            if m is None:
                continue
            if h == "Strategy Integration" and m == "None/Other":
                hsi_mno += 1
            elif h == "None/Other" and m == "Strategy Integration":
                hno_msi += 1
        counts.extend([hsi_mno, hno_msi])
    print(f"{name:<20} {counts[0]:>16} {counts[1]:>16} {counts[2]:>16} {counts[3]:>16}")
 print()
 # ===========================================================================
 # 4. Per-model ranking
 # ===========================================================================
 print("=" * 90)
 print("4. PER-MODEL RANKING")
 print("=" * 90)
 all_names = [m[0] for m in MODELS]
 rank_v30 = sorted(all_names, key=lambda n: model_v30_acc[n], reverse=True)
 rank_v35 = sorted(all_names, key=lambda n: model_v35_acc[n], reverse=True)
 header4 = f"{'Rank':>4}  {'v3.0 Model':<20} {'Acc':>8}  {'v3.5f Model':<20} {'Acc':>8}"
 print(header4)
 print("-" * len(header4))
 for i in range(len(all_names)):
    n30 = rank_v30[i]
    n35 = rank_v35[i]
    print(f"{i+1:>4}  {n30:<20} {model_v30_acc[n30]:>7.1%}  {n35:<20} {model_v35_acc[n35]:>7.1%}")
 print()
 # ===========================================================================
 # 5. Model convergence (average pairwise agreement)
 # ===========================================================================
 print("=" * 90)
 print("5. MODEL CONVERGENCE (average pairwise agreement)")
 print("=" * 90)
 def avg_pairwise_agreement(labels_dict: dict[str, dict[str, str]], model_names: list[str], pids: list[str]) -> float:
    agreements = []
    for m1, m2 in combinations(model_names, 2):
        agree = sum(1 for pid in pids if labels_dict[m1].get(pid) == labels_dict[m2].get(pid))
        agreements.append(agree / len(pids))
    return float(np.mean(agreements))
 for group_label, group_names in [("All 7 models", all_names), ("Top 6 (excl MiniMax)", TOP6_NAMES)]:
    a30 = avg_pairwise_agreement(v30_labels, group_names, common_pids_sorted)
    a35 = avg_pairwise_agreement(v35_labels, group_names, common_pids_sorted)
    delta = a35 - a30
    print(f"{group_label}:")
    print(f"  v3.0 avg pairwise agreement: {a30:.1%}")
    print(f"  v3.5f avg pairwise agreement: {a35:.1%}")
    print(f"  Delta: {delta:+.1%}")
    print()
 # ===========================================================================
 # 6. Cost summary
 # ===========================================================================
 print("=" * 90)
 print("6. v3.5f RE-RUN COST SUMMARY")
 print("=" * 90)
 total_cost = 0.0
 header6 = f"{'Model':<20} {'Records':>8} {'Cost ($)':>10}"
 print(header6)
 print("-" * len(header6))
 for name, _, _ in MODELS:
    rows = v35_rows_by_model[name]
    cost = sum(r.get("provenance", {}).get("costUsd", 0) for r in rows)
    total_cost += cost
    print(f"{name:<20} {len(rows):>8} {cost:>10.4f}")
 print("-" * len(header6))
 print(f"{'TOTAL':<20} {'':<8} {total_cost:>10.4f}")
 print()
--- a/scripts/compare-v30-v35.py
+++ b/scripts/compare-v30-v35.py
@ -0,0 +1,518 @@
 """Compare v3.0 vs v3.5 annotations on 359 confusion-axis holdout paragraphs."""
 import json
 from collections import Counter, defaultdict
 from pathlib import Path
 import numpy as np
 # ── Paths ──────────────────────────────────────────────────────────────────────
 ROOT = Path(__file__).resolve().parent.parent
 V30_GOLDEN = ROOT / "data/annotations/golden/opus.jsonl"
 V35_GOLDEN = ROOT / "data/annotations/golden-v35/opus.jsonl"
 V30_BENCH = ROOT / "data/annotations/bench-holdout"
 V35_BENCH = ROOT / "data/annotations/bench-holdout-v35"
 HUMAN_LABELS = ROOT / "data/gold/human-labels-raw.jsonl"
 HOLDOUT_META = ROOT / "data/gold/holdout-rerun-v35.jsonl"
 MODEL_FILES = [
    "opus.jsonl",  # golden dirs
    "gpt-5.4.jsonl",
    "gemini-3.1-pro-preview.jsonl",
    "glm-5:exacto.jsonl",
    "kimi-k2.5.jsonl",
    "mimo-v2-pro:exacto.jsonl",
    "minimax-m2.7:exacto.jsonl",
 ]
 MODEL_NAMES = [
    "Opus",
    "GPT-5.4",
    "Gemini-3.1-Pro",
    "GLM-5",
    "Kimi-K2.5",
    "Mimo-v2-Pro",
    "MiniMax-M2.7",
 ]
 # Category abbreviations used in axes
 CAT_ABBREV = {
    "BG": "Board Governance",
    "MR": "Management Role",
    "RMP": "Risk Management Process",
    "SI": "Strategy Integration",
    "NO": "None/Other",
    "ID": "Incident Disclosure",
    "TPR": "Third-Party Risk",
 }
 ABBREV_CAT = {v: k for k, v in CAT_ABBREV.items()}
 def abbrev(cat: str) -> str:
    return ABBREV_CAT.get(cat, cat)
 def full_cat(ab: str) -> str:
    return CAT_ABBREV.get(ab, ab)
 # ── Load data ──────────────────────────────────────────────────────────────────
 def load_jsonl(path: Path) -> list[dict]:
    with open(path) as f:
        return [json.loads(line) for line in f if line.strip()]
 def load_annotations(base_dir: Path, filename: str) -> dict[str, str]:
    """Load paragraphId → content_category mapping."""
    path = base_dir / filename
    records = load_jsonl(path)
    return {r["paragraphId"]: r["label"]["content_category"] for r in records}
 def load_golden(path: Path) -> dict[str, str]:
    records = load_jsonl(path)
    return {r["paragraphId"]: r["label"]["content_category"] for r in records}
 # Load holdout metadata
 holdout_records = load_jsonl(HOLDOUT_META)
 holdout_pids = {r["paragraphId"] for r in holdout_records}
 pid_axes = {r["paragraphId"]: r["axes"] for r in holdout_records}
 pid_materiality = {r["paragraphId"]: r.get("hasMaterialityLanguage", False) for r in holdout_records}
 assert len(holdout_pids) == 359, f"Expected 359 holdout PIDs, got {len(holdout_pids)}"
 # Load v3.0 annotations per model (filtered to 359 holdout PIDs)
 v30: dict[str, dict[str, str]] = {}  # model_name → {pid → category}
 v35: dict[str, dict[str, str]] = {}
 for i, (fname, mname) in enumerate(zip(MODEL_FILES, MODEL_NAMES)):
    if fname == "opus.jsonl":
        v30_all = load_golden(V30_GOLDEN)
        v30[mname] = {pid: v30_all[pid] for pid in holdout_pids if pid in v30_all}
        v35[mname] = load_golden(V35_GOLDEN)
    else:
        v30_all = load_annotations(V30_BENCH, fname)
        v30[mname] = {pid: v30_all[pid] for pid in holdout_pids if pid in v30_all}
        v35[mname] = load_annotations(V35_BENCH, fname)
 # Load human labels
 human_raw = load_jsonl(HUMAN_LABELS)
 # Group by paragraphId, compute majority
 human_labels_by_pid: dict[str, list[str]] = defaultdict(list)
 for rec in human_raw:
    human_labels_by_pid[rec["paragraphId"]].append(rec["contentCategory"])
 human_majority: dict[str, str] = {}
 for pid, labels in human_labels_by_pid.items():
    counts = Counter(labels)
    human_majority[pid] = counts.most_common(1)[0][0]
 # Axes grouping
 axis_pids: dict[str, set[str]] = defaultdict(set)
 for pid, axes in pid_axes.items():
    for ax in axes:
        axis_pids[ax].add(pid)
 AXIS_LABELS = {
    "SI_NO": "SI↔N/O",
    "MR_RMP": "MR↔RMP",
    "BG_MR": "BG↔MR",
    "BG_RMP": "BG↔RMP",
 }
 # ── Helpers ────────────────────────────────────────────────────────────────────
 def majority_vote(model_cats: dict[str, dict[str, str]], pid: str) -> str | None:
    """Get majority category across all models for a PID."""
    votes = [model_cats[m].get(pid) for m in MODEL_NAMES if pid in model_cats[m]]
    votes = [v for v in votes if v is not None]
    if not votes:
        return None
    counts = Counter(votes)
    return counts.most_common(1)[0][0]
 def agreement_rate(model_cats: dict[str, dict[str, str]], pids: set[str]) -> float:
    """Average pairwise agreement among 7 models on given PIDs."""
    total_pairs = 0
    agree_pairs = 0
    for pid in pids:
        cats = [model_cats[m].get(pid) for m in MODEL_NAMES if pid in model_cats[m]]
        cats = [c for c in cats if c is not None]
        n = len(cats)
        for i in range(n):
            for j in range(i + 1, n):
                total_pairs += 1
                if cats[i] == cats[j]:
                    agree_pairs += 1
    return agree_pairs / total_pairs if total_pairs > 0 else 0.0
 def pairwise_agreement_matrix(model_cats: dict[str, dict[str, str]], pids: set[str]) -> np.ndarray:
    """Return 7x7 pairwise agreement matrix."""
    n = len(MODEL_NAMES)
    mat = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            if i == j:
                mat[i, j] = 1.0
                continue
            agree = 0
            total = 0
            for pid in pids:
                ci = model_cats[MODEL_NAMES[i]].get(pid)
                cj = model_cats[MODEL_NAMES[j]].get(pid)
                if ci is not None and cj is not None:
                    total += 1
                    if ci == cj:
                        agree += 1
            mat[i, j] = agree / total if total > 0 else 0.0
    return mat
 # ── Section 1: Per-model category change rate ─────────────────────────────────
 print("=" * 80)
 print("1. PER-MODEL CATEGORY CHANGE RATE (v3.0 → v3.5)")
 print("=" * 80)
 print()
 header = f"{'Model':<18} {'Changed':>8} {'Total':>6} {'% Changed':>10}"
 print(header)
 print("-" * len(header))
 for mname in MODEL_NAMES:
    changed = 0
    total = 0
    for pid in holdout_pids:
        c30 = v30[mname].get(pid)
        c35 = v35[mname].get(pid)
        if c30 is not None and c35 is not None:
            total += 1
            if c30 != c35:
                changed += 1
    pct = (changed / total * 100) if total > 0 else 0
    print(f"{mname:<18} {changed:>8} {total:>6} {pct:>9.1f}%")
 print()
 # Top transitions per model
 print("Top category transitions per model:")
 print()
 for mname in MODEL_NAMES:
    transitions: Counter = Counter()
    for pid in holdout_pids:
        c30 = v30[mname].get(pid)
        c35 = v35[mname].get(pid)
        if c30 is not None and c35 is not None and c30 != c35:
            transitions[(abbrev(c30), abbrev(c35))] += 1
    if transitions:
        top = transitions.most_common(5)
        parts = [f"{a}→{b} ({n})" for (a, b), n in top]
        print(f"  {mname:<18} {', '.join(parts)}")
 print()
 # ── Section 2: Per-axis resolution analysis ───────────────────────────────────
 print("=" * 80)
 print("2. PER-AXIS RESOLUTION ANALYSIS")
 print("=" * 80)
 print()
 for axis_key, axis_label in AXIS_LABELS.items():
    pids_on_axis = axis_pids[axis_key]
    cat_a, cat_b = axis_key.split("_")
    print(f"--- {axis_label} ({len(pids_on_axis)} paragraphs) ---")
    print()
    # v3.0 and v3.5 majorities
    v30_maj = {pid: majority_vote(v30, pid) for pid in pids_on_axis}
    v35_maj = {pid: majority_vote(v35, pid) for pid in pids_on_axis}
    # Majority distribution
    v30_dist = Counter(v for v in v30_maj.values() if v)
    v35_dist = Counter(v for v in v35_maj.values() if v)
    print(f"  v3.0 majority distribution: ", end="")
    print(", ".join(f"{abbrev(k)}={v}" for k, v in v30_dist.most_common()))
    print(f"  v3.5 majority distribution: ", end="")
    print(", ".join(f"{abbrev(k)}={v}" for k, v in v35_dist.most_common()))
    # Flipped majority
    flipped = sum(
        1 for pid in pids_on_axis
        if v30_maj.get(pid) and v35_maj.get(pid) and v30_maj[pid] != v35_maj[pid]
    )
    print(f"  Paragraphs with flipped majority: {flipped}/{len(pids_on_axis)} ({flipped / len(pids_on_axis) * 100:.1f}%)")
    # New agreement rate (7-model)
    v30_agree = agreement_rate(v30, pids_on_axis)
    v35_agree = agreement_rate(v35, pids_on_axis)
    print(f"  7-model avg pairwise agreement: v3.0={v30_agree:.3f} → v3.5={v35_agree:.3f} (Δ={v35_agree - v30_agree:+.3f})")
    print()
 # ── Section 3: Human alignment improvement ───────────────────────────────────
 print("=" * 80)
 print("3. HUMAN ALIGNMENT IMPROVEMENT")
 print("=" * 80)
 print()
 # Overall
 pids_with_human = holdout_pids & set(human_majority.keys())
 v30_agree_human = 0
 v35_agree_human = 0
 total_human = 0
 for pid in pids_with_human:
    hm = human_majority[pid]
    m30 = majority_vote(v30, pid)
    m35 = majority_vote(v35, pid)
    if m30 is not None and m35 is not None:
        total_human += 1
        if m30 == hm:
            v30_agree_human += 1
        if m35 == hm:
            v35_agree_human += 1
 v30_pct = v30_agree_human / total_human * 100 if total_human else 0
 v35_pct = v35_agree_human / total_human * 100 if total_human else 0
 print(f"Overall (n={total_human}):")
 print(f"  v3.0 GenAI majority vs human majority: {v30_agree_human}/{total_human} ({v30_pct:.1f}%)")
 print(f"  v3.5 GenAI majority vs human majority: {v35_agree_human}/{total_human} ({v35_pct:.1f}%)")
 print(f"  Delta: {v35_pct - v30_pct:+.1f}pp")
 print()
 # By axis
 print("By axis:")
 header = f"{'Axis':<12} {'n':>4} {'v3.0 %':>8} {'v3.5 %':>8} {'Delta':>8}"
 print(header)
 print("-" * len(header))
 for axis_key, axis_label in AXIS_LABELS.items():
    pids_ax = axis_pids[axis_key] & pids_with_human
    a30 = 0
    a35 = 0
    tot = 0
    for pid in pids_ax:
        hm = human_majority[pid]
        m30 = majority_vote(v30, pid)
        m35 = majority_vote(v35, pid)
        if m30 is not None and m35 is not None:
            tot += 1
            if m30 == hm:
                a30 += 1
            if m35 == hm:
                a35 += 1
    p30 = a30 / tot * 100 if tot else 0
    p35 = a35 / tot * 100 if tot else 0
    print(f"{axis_label:<12} {tot:>4} {p30:>7.1f}% {p35:>7.1f}% {p35 - p30:>+7.1f}pp")
 print()
 # ── Section 4: SI↔N/O specific analysis ──────────────────────────────────────
 print("=" * 80)
 print("4. SI↔N/O SPECIFIC ANALYSIS")
 print("=" * 80)
 print()
 si_no_pids = axis_pids["SI_NO"]
 print(f"Paragraphs on SI↔N/O axis: {len(si_no_pids)}")
 print()
 # Per-model SI call rate
 print("Per-model SI call rate:")
 header = f"{'Model':<18} {'v3.0 SI':>8} {'v3.0 NO':>8} {'v3.5 SI':>8} {'v3.5 NO':>8} {'v3.0 SI%':>9} {'v3.5 SI%':>9}"
 print(header)
 print("-" * len(header))
 for mname in MODEL_NAMES:
    si30 = sum(1 for pid in si_no_pids if v30[mname].get(pid) == "Strategy Integration")
    no30 = sum(1 for pid in si_no_pids if v30[mname].get(pid) == "None/Other")
    si35 = sum(1 for pid in si_no_pids if v35[mname].get(pid) == "Strategy Integration")
    no35 = sum(1 for pid in si_no_pids if v35[mname].get(pid) == "None/Other")
    tot30 = si30 + no30 if (si30 + no30) > 0 else 1
    tot35 = si35 + no35 if (si35 + no35) > 0 else 1
    pct30 = si30 / len(si_no_pids) * 100
    pct35 = si35 / len(si_no_pids) * 100
    print(f"{mname:<18} {si30:>8} {no30:>8} {si35:>8} {no35:>8} {pct30:>8.1f}% {pct35:>8.1f}%")
 print()
 # N/O → SI switches per model
 print("Models switching N/O → SI on SI↔N/O paragraphs:")
 for mname in MODEL_NAMES:
    switches = sum(
        1 for pid in si_no_pids
        if v30[mname].get(pid) == "None/Other" and v35[mname].get(pid) == "Strategy Integration"
    )
    reverse = sum(
        1 for pid in si_no_pids
        if v30[mname].get(pid) == "Strategy Integration" and v35[mname].get(pid) == "None/Other"
    )
    print(f"  {mname:<18} N/O→SI: {switches:>3}, SI→N/O: {reverse:>3}")
 print()
 # Per-paragraph tally shift
 print("Per-paragraph SI vs N/O tally (v3.0 → v3.5), showing shifts:")
 print()
 header = f"{'ParagraphId':<38} {'v3.0 SI':>7} {'v3.0 NO':>7} {'v3.5 SI':>7} {'v3.5 NO':>7} {'Human':>6} {'Resolved?':>10}"
 print(header)
 print("-" * len(header))
 resolved_count = 0
 total_si_no_with_human = 0
 for pid in sorted(si_no_pids):
    si30 = sum(1 for m in MODEL_NAMES if v30[m].get(pid) == "Strategy Integration")
    no30 = sum(1 for m in MODEL_NAMES if v30[m].get(pid) == "None/Other")
    si35 = sum(1 for m in MODEL_NAMES if v35[m].get(pid) == "Strategy Integration")
    no35 = sum(1 for m in MODEL_NAMES if v35[m].get(pid) == "None/Other")
    hm = human_majority.get(pid, "?")
    hm_ab = abbrev(hm) if hm != "?" else "?"
    # "Resolved" = v3.5 majority matches human majority
    v35_maj = "SI" if si35 > no35 else ("NO" if no35 > si35 else "TIE")
    resolved = "YES" if hm_ab == v35_maj else ("" if hm == "?" else "no")
    if hm != "?":
        total_si_no_with_human += 1
        if hm_ab == v35_maj:
            resolved_count += 1
    print(f"{pid[:36]:<38} {si30:>7} {no30:>7} {si35:>7} {no35:>7} {hm_ab:>6} {resolved:>10}")
 print()
 print(f"SI↔N/O resolution rate (v3.5 majority matches human): {resolved_count}/{total_si_no_with_human} ({resolved_count / total_si_no_with_human * 100:.1f}%)" if total_si_no_with_human else "No human labels for SI↔N/O paragraphs")
 # 23:0 asymmetry check
 print()
 print("23:0 asymmetry check:")
 # In v3.0, how many SI↔N/O paragraphs had human=SI but GenAI majority=N/O?
 asym_30 = sum(
    1 for pid in si_no_pids
    if human_majority.get(pid) == "Strategy Integration" and majority_vote(v30, pid) == "None/Other"
 )
 asym_35 = sum(
    1 for pid in si_no_pids
    if human_majority.get(pid) == "Strategy Integration" and majority_vote(v35, pid) == "None/Other"
 )
 print(f"  v3.0: Human=SI but GenAI majority=N/O: {asym_30}")
 print(f"  v3.5: Human=SI but GenAI majority=N/O: {asym_35}")
 rev_30 = sum(
    1 for pid in si_no_pids
    if human_majority.get(pid) == "None/Other" and majority_vote(v30, pid) == "Strategy Integration"
 )
 rev_35 = sum(
    1 for pid in si_no_pids
    if human_majority.get(pid) == "None/Other" and majority_vote(v35, pid) == "Strategy Integration"
 )
 print(f"  v3.0: Human=N/O but GenAI majority=SI: {rev_30}")
 print(f"  v3.5: Human=N/O but GenAI majority=SI: {rev_35}")
 print()
 # ── Section 5: Per-model quality on confusion axes ───────────────────────────
 print("=" * 80)
 print("5. PER-MODEL ACCURACY ON CONFUSION-AXIS PARAGRAPHS (vs human majority)")
 print("=" * 80)
 print()
 model_results = []
 for mname in MODEL_NAMES:
    correct_30 = 0
    correct_35 = 0
    total = 0
    for pid in holdout_pids:
        hm = human_majority.get(pid)
        c30 = v30[mname].get(pid)
        c35 = v35[mname].get(pid)
        if hm and c30 and c35:
            total += 1
            if c30 == hm:
                correct_30 += 1
            if c35 == hm:
                correct_35 += 1
    acc30 = correct_30 / total * 100 if total else 0
    acc35 = correct_35 / total * 100 if total else 0
    model_results.append((mname, total, acc30, acc35, acc35 - acc30))
 # Sort by v3.5 accuracy descending
 model_results.sort(key=lambda x: -x[3])
 header = f"{'Rank':>4} {'Model':<18} {'n':>5} {'v3.0 Acc':>9} {'v3.5 Acc':>9} {'Delta':>8}"
 print(header)
 print("-" * len(header))
 for rank, (mname, total, acc30, acc35, delta) in enumerate(model_results, 1):
    print(f"{rank:>4} {mname:<18} {total:>5} {acc30:>8.1f}% {acc35:>8.1f}% {delta:>+7.1f}pp")
 print()
 # ── Section 6: Model convergence ─────────────────────────────────────────────
 print("=" * 80)
 print("6. MODEL CONVERGENCE (pairwise agreement)")
 print("=" * 80)
 print()
 v30_avg = agreement_rate(v30, holdout_pids)
 v35_avg = agreement_rate(v35, holdout_pids)
 print(f"Average pairwise agreement among 7 models:")
 print(f"  v3.0: {v30_avg:.3f}")
 print(f"  v3.5: {v35_avg:.3f}")
 print(f"  Delta: {v35_avg - v30_avg:+.3f}")
 print()
 # Per-model average agreement with others
 print("Per-model average agreement with other 6 models:")
 header = f"{'Model':<18} {'v3.0':>8} {'v3.5':>8} {'Delta':>8}"
 print(header)
 print("-" * len(header))
 v30_mat = pairwise_agreement_matrix(v30, holdout_pids)
 v35_mat = pairwise_agreement_matrix(v35, holdout_pids)
 for i, mname in enumerate(MODEL_NAMES):
    # Average agreement with other models (exclude self)
    others_30 = [v30_mat[i, j] for j in range(len(MODEL_NAMES)) if j != i]
    others_35 = [v35_mat[i, j] for j in range(len(MODEL_NAMES)) if j != i]
    avg30 = np.mean(others_30)
    avg35 = np.mean(others_35)
    print(f"{mname:<18} {avg30:>7.3f} {avg35:>7.3f} {avg35 - avg30:>+7.3f}")
 print()
 # Outlier detection
 print("Outlier check (models with lowest v3.5 agreement):")
 v35_avgs = []
 for i, mname in enumerate(MODEL_NAMES):
    others = [v35_mat[i, j] for j in range(len(MODEL_NAMES)) if j != i]
    v35_avgs.append((mname, np.mean(others)))
 v35_avgs.sort(key=lambda x: x[1])
 mean_agree = np.mean([x[1] for x in v35_avgs])
 std_agree = np.std([x[1] for x in v35_avgs])
 for mname, avg in v35_avgs:
    z = (avg - mean_agree) / std_agree if std_agree > 0 else 0
    flag = " *** OUTLIER" if z < -1.5 else ""
    print(f"  {mname:<18} {avg:.3f} (z={z:+.2f}){flag}")
 print()
 print("=" * 80)
 print("DONE")
 print("=" * 80)
--- a/scripts/cross-analyze-human-vs-genai.py
+++ b/scripts/cross-analyze-human-vs-genai.py
@ -0,0 +1,714 @@
 """
 Cross-analysis: Human annotators vs GenAI models on 1,200-paragraph holdout set.
 Categories: BG, ID, MR, N/O, RMP, SI, TPR
 Specificity: 1-4
 13 signals per paragraph: 3 human (BIBD), 3 Stage 1, 1 Opus 4.6, 6 benchmark
 """
 import json
 import sys
 from collections import Counter, defaultdict
 from pathlib import Path
 # ── Category abbreviation mapping ────────────────────────────────────────────
 FULL_TO_ABBR = {
    "Board Governance": "BG",
    "Incident Disclosure": "ID",
    "Management Role": "MR",
    "None/Other": "N/O",
    "Risk Management Process": "RMP",
    "Strategy Integration": "SI",
    "Third-Party Risk": "TPR",
 }
 ABBR_TO_FULL = {v: k for k, v in FULL_TO_ABBR.items()}
 CATS = ["BG", "ID", "MR", "N/O", "RMP", "SI", "TPR"]
 DATA = Path("data")
 def abbr(cat: str) -> str:
    return FULL_TO_ABBR.get(cat, cat)
 def majority_vote(labels: list[str]) -> str:
    """Return majority label or 'split' if no majority."""
    c = Counter(labels)
    top = c.most_common(1)[0]
    if top[1] > len(labels) / 2:
        return top[0]
    # Check for a plurality with tie-break: if top 2 are tied, it's split
    if len(c) >= 2:
        top2 = c.most_common(2)
        if top2[0][1] == top2[1][1]:
            return "split"
    return top[0]
 def median_spec(specs: list[int]) -> float:
    s = sorted(specs)
    n = len(s)
    if n % 2 == 1:
        return float(s[n // 2])
    return (s[n // 2 - 1] + s[n // 2]) / 2.0
 def mean_spec(specs: list[int]) -> float:
    return sum(specs) / len(specs) if specs else 0.0
 # ── Load data ────────────────────────────────────────────────────────────────
 print("Loading data...\n")
 # Human labels: paragraphId → list of (annotatorName, category, specificity)
 human_labels: dict[str, list[tuple[str, str, int]]] = defaultdict(list)
 with open(DATA / "gold" / "human-labels-raw.jsonl") as f:
    for line in f:
        d = json.loads(line)
        human_labels[d["paragraphId"]].append(
            (d["annotatorName"], abbr(d["contentCategory"]), d["specificityLevel"])
        )
 holdout_pids = sorted(human_labels.keys())
 assert len(holdout_pids) == 1200, f"Expected 1200 holdout paragraphs, got {len(holdout_pids)}"
 # GenAI labels: paragraphId → list of (modelName, category, specificity)
 genai_labels: dict[str, list[tuple[str, str, int]]] = defaultdict(list)
 # Stage 1 (filter to holdout only)
 holdout_set = set(holdout_pids)
 with open(DATA / "annotations" / "stage1.patched.jsonl") as f:
    for line in f:
        d = json.loads(line)
        pid = d["paragraphId"]
        if pid in holdout_set:
            model = d["provenance"]["modelId"].split("/")[-1]
            genai_labels[pid].append(
                (model, abbr(d["label"]["content_category"]), d["label"]["specificity_level"])
            )
 # Opus
 with open(DATA / "annotations" / "golden" / "opus.jsonl") as f:
    for line in f:
        d = json.loads(line)
        genai_labels[d["paragraphId"]].append(
            ("opus-4.6", abbr(d["label"]["content_category"]), d["label"]["specificity_level"])
        )
 # Bench-holdout models
 bench_files = [
    "gpt-5.4.jsonl",
    "gemini-3.1-pro-preview.jsonl",
    "glm-5:exacto.jsonl",
    "kimi-k2.5.jsonl",
    "mimo-v2-pro:exacto.jsonl",
    "minimax-m2.7:exacto.jsonl",
 ]
 for fname in bench_files:
    fpath = DATA / "annotations" / "bench-holdout" / fname
    model_name = fname.replace(".jsonl", "")
    with open(fpath) as f:
        for line in f:
            d = json.loads(line)
            genai_labels[d["paragraphId"]].append(
                (model_name, abbr(d["label"]["content_category"]), d["label"]["specificity_level"])
            )
 # Paragraph metadata
 para_meta: dict[str, dict] = {}
 with open(DATA / "gold" / "paragraphs-holdout.jsonl") as f:
    for line in f:
        d = json.loads(line)
        if d["id"] in holdout_set:
            para_meta[d["id"]] = d
 # ── Compute per-paragraph aggregates ─────────────────────────────────────────
 results = []
 for pid in holdout_pids:
    h = human_labels[pid]
    g = genai_labels[pid]
    h_cats = [x[1] for x in h]
    h_specs = [x[2] for x in h]
    g_cats = [x[1] for x in g]
    g_specs = [x[2] for x in g]
    all_cats = h_cats + g_cats
    all_specs = h_specs + g_specs
    h_maj = majority_vote(h_cats)
    g_maj = majority_vote(g_cats)
    all_maj = majority_vote(all_cats)
    h_mean_spec = mean_spec(h_specs)
    g_mean_spec = mean_spec(g_specs)
    all_mean_spec = mean_spec(all_specs)
    # Agreement count: how many of 13 agree with overall majority
    agree_count = sum(1 for c in all_cats if c == all_maj) if all_maj != "split" else 0
    meta = para_meta.get(pid, {})
    results.append({
        "pid": pid,
        "h_maj": h_maj,
        "g_maj": g_maj,
        "all_maj": all_maj,
        "h_cats": h_cats,
        "g_cats": g_cats,
        "h_specs": h_specs,
        "g_specs": g_specs,
        "h_mean_spec": h_mean_spec,
        "g_mean_spec": g_mean_spec,
        "all_mean_spec": all_mean_spec,
        "agree_count": agree_count,
        "word_count": meta.get("wordCount", 0),
        "text": meta.get("text", ""),
        "human_annotators": [x[0] for x in h],
        "genai_models": [x[0] for x in g],
        "human_labels": h,
        "genai_labels": g,
    })
 def fmt_table(headers: list[str], rows: list[list], align: list[str] | None = None):
    """Format a simple text table."""
    col_widths = [len(h) for h in headers]
    str_rows = []
    for row in rows:
        sr = [str(x) for x in row]
        str_rows.append(sr)
        for i, s in enumerate(sr):
            col_widths[i] = max(col_widths[i], len(s))
    if align is None:
        align = ["r"] * len(headers)
    def fmt_cell(s, w, a):
        return s.rjust(w) if a == "r" else s.ljust(w)
    sep = "+-" + "-+-".join("-" * w for w in col_widths) + "-+"
    hdr = "| " + " | ".join(fmt_cell(h, col_widths[i], "l") for i, h in enumerate(headers)) + " |"
    lines = [sep, hdr, sep]
    for sr in str_rows:
        line = "| " + " | ".join(fmt_cell(sr[i], col_widths[i], align[i]) for i in range(len(headers))) + " |"
        lines.append(line)
    lines.append(sep)
    return "\n".join(lines)
 # ══════════════════════════════════════════════════════════════════════════════
 # 1. PER-CATEGORY CONFUSION MATRIX: HUMAN MAJORITY vs GENAI MAJORITY
 # ══════════════════════════════════════════════════════════════════════════════
 print("=" * 80)
 print("1. CONFUSION MATRIX: Human Majority (rows) vs GenAI Majority (cols)")
 print("=" * 80)
 cats_plus = CATS + ["split"]
 cm = defaultdict(lambda: defaultdict(int))
 for r in results:
    cm[r["h_maj"]][r["g_maj"]] += 1
 headers = ["H\\G"] + cats_plus + ["Total"]
 rows = []
 for hc in cats_plus:
    row = [hc]
    total = 0
    for gc in cats_plus:
        v = cm[hc][gc]
        row.append(v if v else ".")
        total += v
    row.append(total)
    rows.append(row)
 # Column totals
 col_totals = ["Total"]
 for gc in cats_plus:
    col_totals.append(sum(cm[hc][gc] for hc in cats_plus))
 col_totals.append(sum(sum(cm[hc][gc] for gc in cats_plus) for hc in cats_plus))
 rows.append(col_totals)
 align = ["l"] + ["r"] * (len(headers) - 1)
 print(fmt_table(headers, rows, align))
 # Diagonal agreement
 diag = sum(cm[c][c] for c in cats_plus)
 total_paras = len(results)
 print(f"\nDiagonal agreement: {diag}/{total_paras} = {diag/total_paras:.1%}")
 print(f"Disagreement: {total_paras - diag}/{total_paras} = {(total_paras - diag)/total_paras:.1%}")
 # Over/under prediction
 print("\nGenAI over/under-prediction relative to human majority:")
 headers2 = ["Category", "Human N", "GenAI N", "Diff", "Direction"]
 rows2 = []
 for c in CATS:
    h_n = sum(cm[c][gc] for gc in cats_plus)
    g_n = sum(cm[hc][c] for hc in cats_plus)
    diff = g_n - h_n
    direction = "OVER" if diff > 0 else ("UNDER" if diff < 0 else "MATCH")
    rows2.append([c, h_n, g_n, f"{diff:+d}", direction])
 align2 = ["l", "r", "r", "r", "l"]
 print(fmt_table(headers2, rows2, align2))
 # ══════════════════════════════════════════════════════════════════════════════
 # 2. DIRECTIONAL DISAGREEMENT ANALYSIS
 # ══════════════════════════════════════════════════════════════════════════════
 print("\n" + "=" * 80)
 print("2. DIRECTIONAL DISAGREEMENT: Human Majority -> GenAI Majority transitions")
 print("=" * 80)
 disagree = [(r["h_maj"], r["g_maj"]) for r in results if r["h_maj"] != r["g_maj"]]
 print(f"\nTotal disagreements: {len(disagree)}/{total_paras}")
 trans = Counter(disagree)
 print("\nTop transitions (H_maj -> G_maj):")
 headers3 = ["From (Human)", "To (GenAI)", "Count", "Reverse", "Net", "Symmetric?"]
 rows3 = []
 seen = set()
 for (a, b), cnt in sorted(trans.items(), key=lambda x: -x[1]):
    pair = tuple(sorted([a, b]))
    if pair in seen:
        continue
    seen.add(pair)
    rev = trans.get((b, a), 0)
    net = cnt - rev
    sym = "Yes" if abs(net) <= max(1, min(cnt, rev) * 0.3) else "No"
    rows3.append([a, b, cnt, rev, f"{net:+d}", sym])
 align3 = ["l", "l", "r", "r", "r", "l"]
 print(fmt_table(headers3, rows3, align3))
 # ══════════════════════════════════════════════════════════════════════════════
 # 3. PER-CATEGORY PRECISION/RECALL (Human majority as truth)
 # ══════════════════════════════════════════════════════════════════════════════
 print("\n" + "=" * 80)
 print("3. PER-CATEGORY PRECISION/RECALL (Human majority as ground truth)")
 print("=" * 80)
 # Filter out splits for clean P/R
 valid = [(r["h_maj"], r["g_maj"]) for r in results if r["h_maj"] != "split" and r["g_maj"] != "split"]
 headers4 = ["Category", "TP", "FP", "FN", "Precision", "Recall", "F1"]
 rows4 = []
 for c in CATS:
    tp = sum(1 for h, g in valid if h == c and g == c)
    fp = sum(1 for h, g in valid if h != c and g == c)
    fn = sum(1 for h, g in valid if h == c and g != c)
    prec = tp / (tp + fp) if (tp + fp) > 0 else 0
    rec = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * prec * rec / (prec + rec) if (prec + rec) > 0 else 0
    rows4.append([c, tp, fp, fn, f"{prec:.3f}", f"{rec:.3f}", f"{f1:.3f}"])
 align4 = ["l", "r", "r", "r", "r", "r", "r"]
 print("\nGenAI predictions evaluated against human majority:")
 print(fmt_table(headers4, rows4, align4))
 # Macro averages
 macro_p = sum(float(r[4]) for r in rows4) / len(CATS)
 macro_r = sum(float(r[5]) for r in rows4) / len(CATS)
 macro_f1 = sum(float(r[6]) for r in rows4) / len(CATS)
 print(f"\nMacro-avg: P={macro_p:.3f}  R={macro_r:.3f}  F1={macro_f1:.3f}")
 # Vice versa: GenAI as truth
 print("\n--- Vice versa: Human predictions evaluated against GenAI majority ---")
 rows4b = []
 for c in CATS:
    tp = sum(1 for h, g in valid if g == c and h == c)
    fp = sum(1 for h, g in valid if g != c and h == c)
    fn = sum(1 for h, g in valid if g == c and h != c)
    prec = tp / (tp + fp) if (tp + fp) > 0 else 0
    rec = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * prec * rec / (prec + rec) if (prec + rec) > 0 else 0
    rows4b.append([c, tp, fp, fn, f"{prec:.3f}", f"{rec:.3f}", f"{f1:.3f}"])
 print(fmt_table(headers4, rows4b, align4))
 # ══════════════════════════════════════════════════════════════════════════════
 # 4. SPECIFICITY SYSTEMATIC BIAS
 # ══════════════════════════════════════════════════════════════════════════════
 print("\n" + "=" * 80)
 print("4. SPECIFICITY SYSTEMATIC BIAS: Human vs GenAI")
 print("=" * 80)
 # Overall
 all_h_specs = [s for r in results for s in r["h_specs"]]
 all_g_specs = [s for r in results for s in r["g_specs"]]
 h_avg = mean_spec(all_h_specs)
 g_avg = mean_spec(all_g_specs)
 print(f"\nOverall mean specificity:  Human={h_avg:.3f}  GenAI={g_avg:.3f}  Diff={g_avg - h_avg:+.3f}")
 print(f"Overall median:           Human={median_spec(all_h_specs):.1f}    GenAI={median_spec(all_g_specs):.1f}")
 # Distribution
 print("\nSpecificity distribution:")
 h_dist = Counter(all_h_specs)
 g_dist = Counter(all_g_specs)
 headers5 = ["Spec", "Human N", "Human %", "GenAI N", "GenAI %", "Diff %"]
 rows5 = []
 for s in [1, 2, 3, 4]:
    hn = h_dist.get(s, 0)
    gn = g_dist.get(s, 0)
    hp = hn / len(all_h_specs) * 100
    gp = gn / len(all_g_specs) * 100
    rows5.append([s, hn, f"{hp:.1f}%", gn, f"{gp:.1f}%", f"{gp - hp:+.1f}%"])
 print(fmt_table(headers5, rows5, ["r", "r", "r", "r", "r", "r"]))
 # By category
 print("\nMean specificity by category:")
 headers6 = ["Category", "Human", "GenAI", "Diff", "H count", "G count"]
 rows6 = []
 for c in CATS:
    h_s = [s for r in results for ann in r["human_labels"] if ann[1] == c for s in [ann[2]]]
    g_s = [s for r in results for ann in r["genai_labels"] if ann[1] == c for s in [ann[2]]]
    if h_s and g_s:
        hm = mean_spec(h_s)
        gm = mean_spec(g_s)
        rows6.append([c, f"{hm:.3f}", f"{gm:.3f}", f"{gm - hm:+.3f}", len(h_s), len(g_s)])
    else:
        rows6.append([c, "N/A", "N/A", "N/A", len(h_s), len(g_s)])
 print(fmt_table(headers6, rows6, ["l", "r", "r", "r", "r", "r"]))
 # Per-paragraph directional bias
 h_higher = sum(1 for r in results if r["h_mean_spec"] > r["g_mean_spec"])
 g_higher = sum(1 for r in results if r["g_mean_spec"] > r["h_mean_spec"])
 same = sum(1 for r in results if abs(r["h_mean_spec"] - r["g_mean_spec"]) < 0.01)
 print(f"\nPer-paragraph: Human higher spec={h_higher}  GenAI higher={g_higher}  Same={same}")
 # ══════════════════════════════════════════════════════════════════════════════
 # 5. DIFFICULTY-STRATIFIED ANALYSIS
 # ══════════════════════════════════════════════════════════════════════════════
 print("\n" + "=" * 80)
 print("5. DIFFICULTY-STRATIFIED ANALYSIS")
 print("=" * 80)
 # Tiers based on 13-signal agreement
 # Tier 1: 10+ agree, Tier 2: 7-9 agree, Tier 3: 5-6 agree, Tier 4: <5 agree
 def get_tier(agree_count: int) -> str:
    if agree_count >= 10:
        return "T1-Easy"
    elif agree_count >= 7:
        return "T2-Medium"
    elif agree_count >= 5:
        return "T3-Hard"
    else:
        return "T4-VHard"
 for r in results:
    r["tier"] = get_tier(r["agree_count"])
 tier_counts = Counter(r["tier"] for r in results)
 print(f"\nTier distribution:")
 for t in ["T1-Easy", "T2-Medium", "T3-Hard", "T4-VHard"]:
    print(f"  {t}: {tier_counts.get(t, 0)} paragraphs")
 print("\nHuman-GenAI category agreement rate by difficulty tier:")
 headers7 = ["Tier", "N", "Agree", "Agree%", "H=consensus%", "G=consensus%"]
 rows7 = []
 for t in ["T1-Easy", "T2-Medium", "T3-Hard", "T4-VHard"]:
    tier_r = [r for r in results if r["tier"] == t]
    n = len(tier_r)
    if n == 0:
        continue
    agree = sum(1 for r in tier_r if r["h_maj"] == r["g_maj"])
    h_match_cons = sum(1 for r in tier_r if r["h_maj"] == r["all_maj"])
    g_match_cons = sum(1 for r in tier_r if r["g_maj"] == r["all_maj"])
    rows7.append([
        t, n, agree, f"{agree/n:.1%}",
        f"{h_match_cons/n:.1%}", f"{g_match_cons/n:.1%}"
    ])
 print(fmt_table(headers7, rows7, ["l", "r", "r", "r", "r", "r"]))
 # On hard paragraphs, who is the odd one out?
 print("\nOn hard paragraphs (T3+T4), disagreement breakdown:")
 hard = [r for r in results if r["tier"] in ("T3-Hard", "T4-VHard")]
 h_odd = sum(1 for r in hard if r["g_maj"] == r["all_maj"] and r["h_maj"] != r["all_maj"])
 g_odd = sum(1 for r in hard if r["h_maj"] == r["all_maj"] and r["g_maj"] != r["all_maj"])
 both_off = sum(1 for r in hard if r["h_maj"] != r["all_maj"] and r["g_maj"] != r["all_maj"])
 both_on = sum(1 for r in hard if r["h_maj"] == r["all_maj"] and r["g_maj"] == r["all_maj"])
 print(f"  Human is odd-one-out (GenAI=consensus, Human!=consensus): {h_odd}")
 print(f"  GenAI is odd-one-out (Human=consensus, GenAI!=consensus): {g_odd}")
 print(f"  Both match consensus: {both_on}")
 print(f"  Both differ from consensus: {both_off}")
 print(f"  Total hard: {len(hard)}")
 # ══════════════════════════════════════════════════════════════════════════════
 # 6. ANNOTATOR-LEVEL PATTERNS
 # ══════════════════════════════════════════════════════════════════════════════
 print("\n" + "=" * 80)
 print("6. ANNOTATOR-LEVEL PATTERNS")
 print("=" * 80)
 annotators = ["Anuj", "Elisabeth", "Joey", "Meghan", "Xander", "Aaryan"]
 # For each annotator, compute agreement with GenAI majority
 print("\nPer-annotator agreement with GenAI majority (category):")
 headers8 = ["Annotator", "N labels", "Agree w/G_maj", "Agree%", "Agree w/13_maj", "13_maj%", "Avg Spec", "Note"]
 rows8 = []
 for ann in annotators:
    agree_g = 0
    agree_all = 0
    total = 0
    specs = []
    for r in results:
        for name, cat, spec in r["human_labels"]:
            if name == ann:
                total += 1
                specs.append(spec)
                if cat == r["g_maj"]:
                    agree_g += 1
                if cat == r["all_maj"]:
                    agree_all += 1
    if total == 0:
        continue
    note = "(excluded from aggregates)" if ann == "Aaryan" else ""
    rows8.append([
        ann, total,
        agree_g, f"{agree_g/total:.1%}",
        agree_all, f"{agree_all/total:.1%}",
        f"{mean_spec(specs):.2f}",
        note,
    ])
 align8 = ["l", "r", "r", "r", "r", "r", "r", "l"]
 print(fmt_table(headers8, rows8, align8))
 # Annotator category distributions
 print("\nPer-annotator category distribution:")
 for ann in annotators:
    cat_counts = Counter()
    for r in results:
        for name, cat, spec in r["human_labels"]:
            if name == ann:
                cat_counts[cat] += 1
    if not cat_counts:
        continue
    total = sum(cat_counts.values())
    dist = "  ".join(f"{c}:{cat_counts.get(c, 0):3d}({cat_counts.get(c, 0)/total:.0%})" for c in CATS)
    flag = " ** OUTLIER" if ann == "Aaryan" else ""
    print(f"  {ann:10s} (n={total:3d}): {dist}{flag}")
 # ══════════════════════════════════════════════════════════════════════════════
 # 7. TEXT-FEATURE CORRELATIONS
 # ══════════════════════════════════════════════════════════════════════════════
 print("\n" + "=" * 80)
 print("7. TEXT-FEATURE CORRELATIONS WITH DISAGREEMENT")
 print("=" * 80)
 agree_r = [r for r in results if r["h_maj"] == r["g_maj"]]
 disagree_r = [r for r in results if r["h_maj"] != r["g_maj"]]
 # Word count
 agree_wc = [r["word_count"] for r in agree_r if r["word_count"] > 0]
 disagree_wc = [r["word_count"] for r in disagree_r if r["word_count"] > 0]
 print(f"\nWord count (agree vs disagree):")
 print(f"  Agreement paragraphs:    mean={mean_spec(agree_wc):.1f}  median={median_spec(agree_wc):.0f}  n={len(agree_wc)}")
 print(f"  Disagreement paragraphs: mean={mean_spec(disagree_wc):.1f}  median={median_spec(disagree_wc):.0f}  n={len(disagree_wc)}")
 # Word count buckets
 print("\nDisagreement rate by word count bucket:")
 buckets = [(0, 30, "0-30"), (31, 60, "31-60"), (61, 100, "61-100"), (101, 150, "101-150"), (151, 250, "151-250"), (251, 9999, "251+")]
 headers9 = ["WC Bucket", "N", "Disagree", "Disagree%"]
 rows9 = []
 for lo, hi, label in buckets:
    in_bucket = [r for r in results if lo <= r["word_count"] <= hi]
    dis = sum(1 for r in in_bucket if r["h_maj"] != r["g_maj"])
    if in_bucket:
        rows9.append([label, len(in_bucket), dis, f"{dis/len(in_bucket):.1%}"])
 print(fmt_table(headers9, rows9, ["l", "r", "r", "r"]))
 # Stage1 method (unanimous vs majority) as proxy for quality tier
 print("\nDisagreement rate by Stage 1 confidence method:")
 for method in ["unanimous", "majority"]:
    in_method = [r for r in results if para_meta.get(r["pid"], {}).get("stage1Method") == method]
    dis = sum(1 for r in in_method if r["h_maj"] != r["g_maj"])
    if in_method:
        print(f"  {method:10s}: {dis}/{len(in_method)} = {dis/len(in_method):.1%} disagree")
 # Keyword analysis
 print("\nDisagreement rate for paragraphs containing key terms:")
 keywords = ["material", "NIST", "CISO", "board", "third party", "third-party", "incident",
            "insurance", "audit", "framework", "breach", "ransomware"]
 headers10 = ["Keyword", "N", "Disagree", "Disagree%"]
 rows10 = []
 for kw in keywords:
    matching = [r for r in results if kw.lower() in r["text"].lower()]
    if not matching:
        continue
    dis = sum(1 for r in matching if r["h_maj"] != r["g_maj"])
    rows10.append([kw, len(matching), dis, f"{dis/len(matching):.1%}"])
 rows10.sort(key=lambda x: -int(x[2]))
 print(fmt_table(headers10, rows10, ["l", "r", "r", "r"]))
 # ══════════════════════════════════════════════════════════════════════════════
 # 8. "HUMAN RIGHT, GenAI WRONG" vs "GenAI RIGHT, HUMAN WRONG"
 # ══════════════════════════════════════════════════════════════════════════════
 print("\n" + "=" * 80)
 print("8. HUMAN RIGHT/GENAI WRONG vs GENAI RIGHT/HUMAN WRONG (13-signal consensus)")
 print("=" * 80)
 # Only consider paragraphs where all_maj is not split and h/g disagree with each other or consensus
 h_right_g_wrong = [r for r in results if r["all_maj"] != "split" and r["h_maj"] == r["all_maj"] and r["g_maj"] != r["all_maj"]]
 g_right_h_wrong = [r for r in results if r["all_maj"] != "split" and r["g_maj"] == r["all_maj"] and r["h_maj"] != r["all_maj"]]
 both_right = [r for r in results if r["all_maj"] != "split" and r["h_maj"] == r["all_maj"] and r["g_maj"] == r["all_maj"]]
 both_wrong = [r for r in results if r["all_maj"] != "split" and r["h_maj"] != r["all_maj"] and r["g_maj"] != r["all_maj"]]
 has_split = [r for r in results if r["all_maj"] == "split"]
 print(f"\n  Both correct:                {len(both_right)}")
 print(f"  Human right, GenAI wrong:    {len(h_right_g_wrong)}")
 print(f"  GenAI right, Human wrong:    {len(g_right_h_wrong)}")
 print(f"  Both wrong:                  {len(both_wrong)}")
 print(f"  13-signal split (no consensus): {len(has_split)}")
 # Category breakdown
 print("\nCategory breakdown of 'Human right, GenAI wrong':")
 cat_dist_hrg = Counter(r["all_maj"] for r in h_right_g_wrong)
 for c in CATS:
    n = cat_dist_hrg.get(c, 0)
    if n > 0:
        print(f"  {c}: {n}")
 print("\nCategory breakdown of 'GenAI right, Human wrong':")
 cat_dist_grh = Counter(r["all_maj"] for r in g_right_h_wrong)
 for c in CATS:
    n = cat_dist_grh.get(c, 0)
    if n > 0:
        print(f"  {c}: {n}")
 # What did the wrong side predict?
 print("\nWhen GenAI is wrong, what does it predict instead?")
 wrong_g = Counter(r["g_maj"] for r in h_right_g_wrong)
 for label, cnt in wrong_g.most_common():
    print(f"  {label}: {cnt}")
 print("\nWhen Human is wrong, what do they predict instead?")
 wrong_h = Counter(r["h_maj"] for r in g_right_h_wrong)
 for label, cnt in wrong_h.most_common():
    print(f"  {label}: {cnt}")
 # ══════════════════════════════════════════════════════════════════════════════
 # 9. SPECIFICITY BY SOURCE TYPE
 # ══════════════════════════════════════════════════════════════════════════════
 print("\n" + "=" * 80)
 print("9. SPECIFICITY BY SOURCE TYPE AND CATEGORY")
 print("=" * 80)
 # Group models into source types
 stage1_models = {"gemini-3.1-flash-lite-preview", "grok-4.1-fast", "mimo-v2-flash"}
 frontier_models = {"opus-4.6", "gpt-5.4", "gemini-3.1-pro-preview", "kimi-k2.5"}
 budget_models = {"glm-5:exacto", "mimo-v2-pro:exacto", "minimax-m2.7:exacto"}
 # Collect specs by source type and category
 source_specs: dict[str, dict[str, list[int]]] = {
    "Human": defaultdict(list),
    "Stage1": defaultdict(list),
    "Frontier": defaultdict(list),
    "Budget": defaultdict(list),
 }
 for r in results:
    for name, cat, spec in r["human_labels"]:
        source_specs["Human"][cat].append(spec)
        source_specs["Human"]["ALL"].append(spec)
    for model, cat, spec in r["genai_labels"]:
        if model in stage1_models:
            src = "Stage1"
        elif model in frontier_models:
            src = "Frontier"
        elif model in budget_models:
            src = "Budget"
        else:
            src = "Budget"  # fallback
        source_specs[src][cat].append(spec)
        source_specs[src]["ALL"].append(spec)
 print("\nMean specificity by source type and category:")
 src_order = ["Human", "Stage1", "Frontier", "Budget"]
 headers11 = ["Category"] + src_order
 rows11 = []
 for c in CATS + ["ALL"]:
    row = [c]
    for src in src_order:
        specs = source_specs[src].get(c, [])
        if specs:
            row.append(f"{mean_spec(specs):.3f}")
        else:
            row.append("N/A")
    rows11.append(row)
 align11 = ["l"] + ["r"] * len(src_order)
 print(fmt_table(headers11, rows11, align11))
 # Specificity standard deviation by source
 print("\nSpecificity std dev by source type:")
 import math
 for src in src_order:
    specs = source_specs[src]["ALL"]
    if specs:
        m = mean_spec(specs)
        var = sum((s - m) ** 2 for s in specs) / len(specs)
        std = math.sqrt(var)
        print(f"  {src:10s}: mean={m:.3f}  std={std:.3f}  n={len(specs)}")
 # ── Per-model specificity rankings ───────────────────────────────────────────
 print("\nPer-model mean specificity (all categories):")
 model_specs: dict[str, list[int]] = defaultdict(list)
 for r in results:
    for name, cat, spec in r["human_labels"]:
        model_specs[f"H:{name}"].append(spec)
    for model, cat, spec in r["genai_labels"]:
        model_specs[f"G:{model}"].append(spec)
 headers12 = ["Model", "Mean Spec", "N"]
 rows12 = []
 for model, specs in sorted(model_specs.items(), key=lambda x: mean_spec(x[1])):
    rows12.append([model, f"{mean_spec(specs):.3f}", len(specs)])
 print(fmt_table(headers12, rows12, ["l", "r", "r"]))
 # ══════════════════════════════════════════════════════════════════════════════
 # SUMMARY
 # ══════════════════════════════════════════════════════════════════════════════
 print("\n" + "=" * 80)
 print("SUMMARY OF KEY FINDINGS")
 print("=" * 80)
 print(f"""
 Dataset: {total_paras} paragraphs, 13 signals each (3 human, 10 GenAI)
 1. CATEGORY AGREEMENT: Human majority and GenAI majority agree on {diag/total_paras:.1%} of
   paragraphs. The biggest confusions are in the off-diagonal cells above.
 2. DIRECTIONAL DISAGREEMENTS: The most common category swaps reveal systematic
   differences in how humans and GenAI interpret boundary cases.
 3. PRECISION/RECALL: GenAI macro F1={macro_f1:.3f} against human majority.
 4. SPECIFICITY BIAS: Human mean={h_avg:.3f}, GenAI mean={g_avg:.3f}
   (diff={g_avg - h_avg:+.3f}). {"GenAI rates higher" if g_avg > h_avg else "Humans rate higher"} on average.
 5. DIFFICULTY: On easy paragraphs (T1, 10+/13 agree), agreement is very high.
   On hard paragraphs, {"humans" if h_odd > g_odd else "GenAI"} are more often the odd-one-out.
 6. ANNOTATORS: See table above for individual alignment with GenAI and consensus.
 7. TEXT FEATURES: {"Longer" if mean_spec(disagree_wc) > mean_spec(agree_wc) else "Shorter"} paragraphs
   tend to produce more disagreement.
 8. RIGHT/WRONG: Human right & GenAI wrong: {len(h_right_g_wrong)}, GenAI right &
   Human wrong: {len(g_right_h_wrong)}. {"Humans are more often right" if len(h_right_g_wrong) > len(g_right_h_wrong) else "GenAI is more often right"} when they disagree.
 """)
--- a/scripts/examine-hard-cases.py
+++ b/scripts/examine-hard-cases.py
@ -0,0 +1,736 @@
 #!/usr/bin/env python3
 """Examine hardest disagreement cases in the SEC cybersecurity holdout dataset.
 Identifies paragraphs where the 13 annotation sources split on the three main
 confusion axes (MR<->RMP, BG<->MR, SI<->N/O), shows representative examples,
 extracts linguistic patterns, and recommends codebook rulings.
 Run: uv run --with numpy scripts/examine-hard-cases.py
 """
 import json
 import os
 import re
 import textwrap
 from collections import Counter, defaultdict
 from pathlib import Path
 import numpy as np
 # ── Constants ──────────────────────────────────────────────────────────────────
 ROOT = Path(__file__).resolve().parent.parent
 CAT_ABBREV = {
    "Board Governance": "BG",
    "Incident Disclosure": "ID",
    "Management Role": "MR",
    "None/Other": "N/O",
    "Risk Management Process": "RMP",
    "Strategy Integration": "SI",
    "Third-Party Risk": "TPR",
 }
 ABBREV_CAT = {v: k for k, v in CAT_ABBREV.items()}
 AXES = [
    ("MR", "RMP", "MR <-> RMP"),
    ("BG", "MR", "BG <-> MR"),
    ("SI", "N/O", "SI <-> N/O"),
 ]
 BENCH_FILES = [
    "gpt-5.4.jsonl",
    "gemini-3.1-pro-preview.jsonl",
    "glm-5:exacto.jsonl",
    "kimi-k2.5.jsonl",
    "mimo-v2-pro:exacto.jsonl",
    "minimax-m2.7:exacto.jsonl",
 ]
 STAGE1_MODEL_SHORT = {
    "google/gemini-3.1-flash-lite-preview": "s1:gemini-flash",
    "x-ai/grok-4.1-fast": "s1:grok-fast",
    "xiaomi/mimo-v2-flash": "s1:mimo-flash",
 }
 BENCH_MODEL_SHORT = {
    "gpt-5.4.jsonl": "bench:gpt5.4",
    "gemini-3.1-pro-preview.jsonl": "bench:gemini-pro",
    "glm-5:exacto.jsonl": "bench:glm5",
    "kimi-k2.5.jsonl": "bench:kimi",
    "mimo-v2-pro:exacto.jsonl": "bench:mimo-pro",
    "minimax-m2.7:exacto.jsonl": "bench:minimax",
 }
 # ── Load data ──────────────────────────────────────────────────────────────────
 def load_jsonl(path: str | Path) -> list[dict]:
    records = []
    with open(path) as f:
        for line in f:
            line = line.strip()
            if line:
                records.append(json.loads(line))
    return records
 def abbrev(cat: str) -> str:
    return CAT_ABBREV.get(cat, cat)
 def build_signal_matrix() -> tuple[dict[str, dict[str, str]], dict[str, dict[str, int]]]:
    """Build paragraphId -> {source: category_abbrev} and {source: specificity}."""
    # Only for the 1200 gold PIDs
    gold_pids: set[str] = set()
    human_labels = load_jsonl(ROOT / "data/gold/human-labels-raw.jsonl")
    for rec in human_labels:
        gold_pids.add(rec["paragraphId"])
    cat_matrix: dict[str, dict[str, str]] = defaultdict(dict)
    spec_matrix: dict[str, dict[str, int]] = defaultdict(dict)
    # 1) Human annotators (3 per paragraph)
    for rec in human_labels:
        pid = rec["paragraphId"]
        src = f"human:{rec['annotatorName']}"
        cat_matrix[pid][src] = abbrev(rec["contentCategory"])
        spec_matrix[pid][src] = rec["specificityLevel"]
    # 2) Stage 1 models (filter to gold PIDs)
    stage1_path = ROOT / "data/annotations/stage1.patched.jsonl"
    with open(stage1_path) as f:
        for line in f:
            rec = json.loads(line)
            pid = rec["paragraphId"]
            if pid not in gold_pids:
                continue
            model_id = rec["provenance"]["modelId"]
            src = STAGE1_MODEL_SHORT.get(model_id, model_id)
            cat_matrix[pid][src] = abbrev(rec["label"]["content_category"])
            spec_matrix[pid][src] = rec["label"]["specificity_level"]
    # 3) Opus
    for rec in load_jsonl(ROOT / "data/annotations/golden/opus.jsonl"):
        pid = rec["paragraphId"]
        if pid in gold_pids:
            cat_matrix[pid]["opus"] = abbrev(rec["label"]["content_category"])
            spec_matrix[pid]["opus"] = rec["label"]["specificity_level"]
    # 4) Bench-holdout models
    for fn in BENCH_FILES:
        src = BENCH_MODEL_SHORT[fn]
        for rec in load_jsonl(ROOT / "data/annotations/bench-holdout" / fn):
            pid = rec["paragraphId"]
            if pid in gold_pids:
                cat_matrix[pid][src] = abbrev(rec["label"]["content_category"])
                spec_matrix[pid][src] = rec["label"]["specificity_level"]
    return dict(cat_matrix), dict(spec_matrix)
 def load_paragraphs(gold_pids: set[str]) -> dict[str, dict]:
    """Load paragraph text for gold PIDs."""
    paragraphs = {}
    for rec in load_jsonl(ROOT / "data/gold/paragraphs-holdout.jsonl"):
        if rec["id"] in gold_pids:
            paragraphs[rec["id"]] = rec
    return paragraphs
 # ── Analysis helpers ───────────────────────────────────────────────────────────
 def find_axis_paragraphs(
    cat_matrix: dict[str, dict[str, str]], a: str, b: str
 ) -> list[tuple[str, dict[str, str], int, int]]:
    """Find paragraphs where the primary disagreement is between categories a and b.
    Returns list of (pid, signals, count_a, count_b) sorted by disagreement strength.
    """
    results = []
    for pid, signals in cat_matrix.items():
        cats = list(signals.values())
        counts = Counter(cats)
        ca, cb = counts.get(a, 0), counts.get(b, 0)
        if ca >= 1 and cb >= 1 and ca + cb >= len(cats) * 0.5:
            # This paragraph has a meaningful split on this axis
            results.append((pid, signals, ca, cb))
    # Sort by how evenly split (closer to 50/50 = harder)
    results.sort(key=lambda x: -min(x[2], x[3]))
    return results
 def truncate_text(text: str, max_chars: int = 200) -> str:
    if len(text) <= max_chars:
        return text
    return text[:max_chars].rstrip() + "..."
 def source_order() -> list[str]:
    """Canonical order for displaying sources."""
    humans = [f"human:{n}" for n in ["Joey", "Anuj", "Aaryan", "Elisabeth", "Meghan", "Xander"]]
    stage1 = ["s1:gemini-flash", "s1:grok-fast", "s1:mimo-flash"]
    opus = ["opus"]
    bench = [BENCH_MODEL_SHORT[fn] for fn in BENCH_FILES]
    return humans + stage1 + opus + bench
 def format_signal_breakdown(
    signals: dict[str, str], axis_cats: tuple[str, str]
 ) -> str:
    """Format which sources said which category."""
    a, b = axis_cats
    a_sources = []
    b_sources = []
    other_sources = []
    for src in source_order():
        if src not in signals:
            continue
        cat = signals[src]
        if cat == a:
            a_sources.append(src)
        elif cat == b:
            b_sources.append(src)
        else:
            other_sources.append(f"{src}={cat}")
    parts = [
        f"  {a} ({len(a_sources)}): {', '.join(a_sources)}",
        f"  {b} ({len(b_sources)}): {', '.join(b_sources)}",
    ]
    if other_sources:
        parts.append(f"  Other: {', '.join(other_sources)}")
    return "\n".join(parts)
 def extract_keyword_frequencies(
    paragraphs: dict[str, dict],
    axis_pids: list[str],
    cat_matrix: dict[str, dict[str, str]],
    cat_a: str,
    cat_b: str,
 ) -> tuple[Counter, Counter, Counter]:
    """Extract keyword frequencies for paragraphs leaning toward cat_a vs cat_b."""
    # Keywords to look for (domain-relevant)
    all_keywords = [
        "board", "director", "committee", "audit", "oversee", "oversight",
        "ciso", "officer", "chief", "vp", "vice president", "manager",
        "manage", "manages", "managing", "management", "responsible",
        "program", "team", "department", "staff", "personnel",
        "report", "reports", "reporting", "brief", "briefing", "informed",
        "incident", "breach", "attack", "compromise", "unauthorized",
        "material", "immaterial", "not material", "no material",
        "strategy", "strategic", "integrate", "integration", "aligned",
        "risk", "assess", "assessment", "framework", "nist", "iso",
        "policy", "policies", "procedure", "procedures",
        "third party", "third-party", "vendor", "supplier", "service provider",
        "insurance", "cyber insurance",
        "training", "awareness", "employee",
        "monitor", "monitoring", "detect", "detection",
        "govern", "governance",
        "experience", "experienced", "background", "qualification", "expertise",
        "day-to-day", "daily", "operational",
        "enterprise", "enterprise-wide",
        "designate", "designated", "appoint", "appointed",
    ]
    lean_a_pids = []
    lean_b_pids = []
    for pid in axis_pids:
        signals = cat_matrix[pid]
        counts = Counter(signals.values())
        if counts.get(cat_a, 0) > counts.get(cat_b, 0):
            lean_a_pids.append(pid)
        elif counts.get(cat_b, 0) > counts.get(cat_a, 0):
            lean_b_pids.append(pid)
    def count_keywords(pids: list[str]) -> Counter:
        kw_counts = Counter()
        for pid in pids:
            if pid not in paragraphs:
                continue
            text_lower = paragraphs[pid]["text"].lower()
            for kw in all_keywords:
                if kw in text_lower:
                    kw_counts[kw] += 1
        return kw_counts
    freq_a = count_keywords(lean_a_pids)
    freq_b = count_keywords(lean_b_pids)
    freq_all = count_keywords(axis_pids)
    return freq_a, freq_b, freq_all
 def analyze_human_vs_genai_splits(
    axis_pids: list[str],
    cat_matrix: dict[str, dict[str, str]],
    cat_a: str,
    cat_b: str,
 ) -> tuple[list[str], list[str]]:
    """Find cases where humans lean one way but GenAI leans the other."""
    human_a_genai_b = []  # humans say A, GenAI says B
    human_b_genai_a = []  # humans say B, GenAI says A
    human_prefixes = ["human:"]
    genai_prefixes = ["s1:", "opus", "bench:"]
    for pid in axis_pids:
        signals = cat_matrix[pid]
        human_cats = []
        genai_cats = []
        for src, cat in signals.items():
            if any(src.startswith(p) for p in human_prefixes):
                human_cats.append(cat)
            else:
                genai_cats.append(cat)
        human_a = sum(1 for c in human_cats if c == cat_a)
        human_b = sum(1 for c in human_cats if c == cat_b)
        genai_a = sum(1 for c in genai_cats if c == cat_a)
        genai_b = sum(1 for c in genai_cats if c == cat_b)
        if human_a > human_b and genai_b > genai_a:
            human_a_genai_b.append(pid)
        elif human_b > human_a and genai_a > genai_b:
            human_b_genai_a.append(pid)
    return human_a_genai_b, human_b_genai_a
 # ── Main analysis ──────────────────────────────────────────────────────────────
 def main():
    print("=" * 100)
    print("HARDEST CASES ANALYSIS: SEC CYBERSECURITY HOLDOUT DATASET")
    print("Examining disagreements across 13 annotation sources to inform codebook rulings")
    print("=" * 100)
    # Load data
    print("\nLoading data...")
    cat_matrix, spec_matrix = build_signal_matrix()
    gold_pids = set(cat_matrix.keys())
    paragraphs = load_paragraphs(gold_pids)
    print(f"  Loaded {len(gold_pids)} gold paragraphs with {len(source_order())} potential sources each")
    # Verify source coverage
    source_coverage = Counter()
    for pid in gold_pids:
        for src in cat_matrix[pid]:
            source_coverage[src] += 1
    print("\n  Source coverage:")
    for src in source_order():
        print(f"    {src}: {source_coverage.get(src, 0)} paragraphs")
    # ── Overall disagreement stats ─────────────────────────────────────────
    print("\n" + "=" * 100)
    print("OVERALL DISAGREEMENT STATISTICS")
    print("=" * 100)
    unanimous = 0
    near_unanimous = 0  # 1 dissenter
    split = 0
    for pid in gold_pids:
        cats = list(cat_matrix[pid].values())
        counts = Counter(cats)
        top = counts.most_common(1)[0][1]
        n = len(cats)
        if top == n:
            unanimous += 1
        elif top >= n - 1:
            near_unanimous += 1
        else:
            split += 1
    print(f"\n  Unanimous (all sources agree): {unanimous} ({unanimous/len(gold_pids)*100:.1f}%)")
    print(f"  Near-unanimous (1 dissenter):  {near_unanimous} ({near_unanimous/len(gold_pids)*100:.1f}%)")
    print(f"  Split (2+ dissenters):         {split} ({split/len(gold_pids)*100:.1f}%)")
    # Count all pairwise disagreement axes
    axis_counts = Counter()
    for pid in gold_pids:
        cats = list(cat_matrix[pid].values())
        unique = set(cats)
        if len(unique) >= 2:
            for c1 in unique:
                for c2 in unique:
                    if c1 < c2:
                        axis_counts[(c1, c2)] += 1
    print("\n  All disagreement axes (paragraph has at least 1 source saying each):")
    for (c1, c2), ct in axis_counts.most_common(30):
        print(f"    {c1} <-> {c2}: {ct} paragraphs")
    # ── Axis-specific analysis ─────────────────────────────────────────────
    all_axis_results = {}
    for cat_a, cat_b, axis_name in AXES:
        print("\n" + "=" * 100)
        print(f"AXIS: {axis_name}")
        print("=" * 100)
        axis_pids_data = find_axis_paragraphs(cat_matrix, cat_a, cat_b)
        axis_pids = [x[0] for x in axis_pids_data]
        all_axis_results[axis_name] = axis_pids
        print(f"\n  Paragraphs with primary {cat_a}/{cat_b} disagreement: {len(axis_pids)}")
        if not axis_pids:
            print("  No paragraphs found on this axis.")
            continue
        # ── Signal split statistics ────────────────────────────────────────
        # Count how the split goes (majority A vs majority B)
        majority_a = sum(1 for _, _, ca, cb in axis_pids_data if ca > cb)
        majority_b = sum(1 for _, _, ca, cb in axis_pids_data if cb > ca)
        tied = sum(1 for _, _, ca, cb in axis_pids_data if ca == cb)
        print(f"  Majority {cat_a}: {majority_a} | Majority {cat_b}: {majority_b} | Tied: {tied}")
        # ── Human vs GenAI splits ──────────────────────────────────────────
        human_a_genai_b, human_b_genai_a = analyze_human_vs_genai_splits(
            axis_pids, cat_matrix, cat_a, cat_b
        )
        print(f"\n  Human/GenAI disagreements:")
        print(f"    Humans say {cat_a}, GenAI says {cat_b}: {len(human_a_genai_b)}")
        print(f"    Humans say {cat_b}, GenAI says {cat_a}: {len(human_b_genai_a)}")
        # ── Representative examples ────────────────────────────────────────
        # Show hardest cases (most evenly split)
        n_examples = min(10, len(axis_pids_data))
        print(f"\n  {'─' * 90}")
        print(f"  TOP {n_examples} MOST CONTENTIOUS PARAGRAPHS")
        print(f"  {'─' * 90}")
        for i, (pid, signals, ca, cb) in enumerate(axis_pids_data[:n_examples]):
            para = paragraphs.get(pid, {})
            text = para.get("text", "[text not found]")
            company = para.get("companyName", "?")
            word_count = para.get("wordCount", "?")
            print(f"\n  [{i+1}] PID: {pid[:12]}...  Company: {company}")
            print(f"      Words: {word_count} | Split: {ca} say {cat_a}, {cb} say {cat_b}, {len(signals)-ca-cb} say other")
            print(f"      Text: {truncate_text(text, 250)}")
            print(format_signal_breakdown(signals, (cat_a, cat_b)))
        # ── Human-A / GenAI-B examples ─────────────────────────────────────
        if human_a_genai_b:
            print(f"\n  {'─' * 90}")
            print(f"  HUMANS SAY {cat_a}, GenAI SAYS {cat_b} (up to 5 examples)")
            print(f"  {'─' * 90}")
            for pid in human_a_genai_b[:5]:
                para = paragraphs.get(pid, {})
                text = para.get("text", "[text not found]")
                print(f"\n  PID: {pid[:12]}...")
                print(f"      Text: {truncate_text(text, 250)}")
                print(format_signal_breakdown(cat_matrix[pid], (cat_a, cat_b)))
        if human_b_genai_a:
            print(f"\n  {'─' * 90}")
            print(f"  HUMANS SAY {cat_b}, GenAI SAYS {cat_a} (up to 5 examples)")
            print(f"  {'─' * 90}")
            for pid in human_b_genai_a[:5]:
                para = paragraphs.get(pid, {})
                text = para.get("text", "[text not found]")
                print(f"\n  PID: {pid[:12]}...")
                print(f"      Text: {truncate_text(text, 250)}")
                print(format_signal_breakdown(cat_matrix[pid], (cat_a, cat_b)))
        # ── Keyword / linguistic patterns ──────────────────────────────────
        print(f"\n  {'─' * 90}")
        print(f"  LINGUISTIC PATTERNS")
        print(f"  {'─' * 90}")
        freq_a, freq_b, freq_all = extract_keyword_frequencies(
            paragraphs, axis_pids, cat_matrix, cat_a, cat_b
        )
        # Compute over-representation: keywords more common when majority says A vs B
        lean_a_ct = sum(
            1 for pid in axis_pids
            if Counter(cat_matrix[pid].values()).get(cat_a, 0) > Counter(cat_matrix[pid].values()).get(cat_b, 0)
        )
        lean_b_ct = sum(
            1 for pid in axis_pids
            if Counter(cat_matrix[pid].values()).get(cat_b, 0) > Counter(cat_matrix[pid].values()).get(cat_a, 0)
        )
        print(f"\n  Paragraphs leaning {cat_a}: {lean_a_ct} | leaning {cat_b}: {lean_b_ct}")
        # Show keywords sorted by differential
        all_kws = set(freq_a.keys()) | set(freq_b.keys())
        diffs = []
        for kw in all_kws:
            fa = freq_a.get(kw, 0)
            fb = freq_b.get(kw, 0)
            total = freq_all.get(kw, 0)
            if total < 3:
                continue
            # Normalize by group size
            rate_a = fa / max(lean_a_ct, 1)
            rate_b = fb / max(lean_b_ct, 1)
            diff = rate_a - rate_b
            diffs.append((kw, fa, fb, total, rate_a, rate_b, diff))
        diffs.sort(key=lambda x: -abs(x[6]))
        print(f"\n  Keywords by differential (rate in {cat_a}-leaning vs {cat_b}-leaning paragraphs):")
        print(f"  {'Keyword':<22} {'In '+cat_a:>8} {'In '+cat_b:>8} {'Total':>8} {'Rate '+cat_a:>10} {'Rate '+cat_b:>10} {'Diff':>8}")
        print(f"  {'─'*22} {'─'*8} {'─'*8} {'─'*8} {'─'*10} {'─'*10} {'─'*8}")
        for kw, fa, fb, total, ra, rb, diff in diffs[:25]:
            marker = f"<- {cat_a}" if diff > 0.05 else (f"<- {cat_b}" if diff < -0.05 else "")
            print(f"  {kw:<22} {fa:>8} {fb:>8} {total:>8} {ra:>10.2%} {rb:>10.2%} {diff:>+8.2%} {marker}")
    # ── Other notable axes ─────────────────────────────────────────────────
    print("\n" + "=" * 100)
    print("OTHER NOTABLE DISAGREEMENT AXES (10+ paragraphs)")
    print("=" * 100)
    primary_axis_set = {("BG", "MR"), ("MR", "BG"), ("MR", "RMP"), ("RMP", "MR"), ("N/O", "SI"), ("SI", "N/O")}
    other_axes = []
    for (c1, c2), ct in axis_counts.most_common():
        if (c1, c2) not in primary_axis_set and ct >= 10:
            other_axes.append((c1, c2, ct))
    if not other_axes:
        print("\n  No other axes with 10+ paragraphs.")
    else:
        for cat_a, cat_b, count in other_axes:
            print(f"\n  {'─' * 90}")
            print(f"  {cat_a} <-> {cat_b}: {count} paragraphs")
            print(f"  {'─' * 90}")
            axis_pids_data = find_axis_paragraphs(cat_matrix, cat_a, cat_b)
            # Show up to 5 examples
            for i, (pid, signals, ca, cb) in enumerate(axis_pids_data[:5]):
                para = paragraphs.get(pid, {})
                text = para.get("text", "[text not found]")
                print(f"\n  [{i+1}] {truncate_text(text, 200)}")
                print(f"      Split: {ca}x {cat_a}, {cb}x {cat_b}")
                print(format_signal_breakdown(signals, (cat_a, cat_b)))
    # ── Summary statistics ─────────────────────────────────────────────────
    print("\n" + "=" * 100)
    print("SUMMARY STATISTICS")
    print("=" * 100)
    # Per-axis counts
    print("\n  Paragraphs on each primary confusion axis:")
    for cat_a, cat_b, axis_name in AXES:
        axis_data = find_axis_paragraphs(cat_matrix, cat_a, cat_b)
        print(f"    {axis_name}: {len(axis_data)} paragraphs")
    # How many could potentially be resolved by keyword rules?
    print("\n  Keyword-resolvable estimate (paragraphs containing strong discriminator keywords):")
    mr_rmp_data = find_axis_paragraphs(cat_matrix, "MR", "RMP")
    mr_rmp_pids = [x[0] for x in mr_rmp_data]
    resolvable_mr_rmp = 0
    mr_keywords = {"ciso", "chief information security", "chief security", "vp", "vice president",
                    "officer", "director of", "head of", "reports to", "reporting to"}
    rmp_keywords = {"framework", "nist", "iso", "soc 2", "assessment", "penetration test",
                    "vulnerability scan", "audit", "tabletop"}
    for pid in mr_rmp_pids:
        text_lower = paragraphs.get(pid, {}).get("text", "").lower()
        has_mr = any(kw in text_lower for kw in mr_keywords)
        has_rmp = any(kw in text_lower for kw in rmp_keywords)
        if has_mr != has_rmp:  # One side but not the other
            resolvable_mr_rmp += 1
    print(f"    MR <-> RMP: {resolvable_mr_rmp}/{len(mr_rmp_pids)} have clear keyword signal ({resolvable_mr_rmp/max(len(mr_rmp_pids),1)*100:.0f}%)")
    bg_mr_data = find_axis_paragraphs(cat_matrix, "BG", "MR")
    bg_mr_pids = [x[0] for x in bg_mr_data]
    resolvable_bg_mr = 0
    bg_keywords = {"board", "director", "committee", "audit committee", "board of directors"}
    mr_only_keywords = {"ciso", "chief information security", "officer", "vp", "management",
                        "team", "department", "staff", "day-to-day", "operational"}
    for pid in bg_mr_pids:
        text_lower = paragraphs.get(pid, {}).get("text", "").lower()
        has_bg = any(kw in text_lower for kw in bg_keywords)
        has_mr_only = any(kw in text_lower for kw in mr_only_keywords)
        if has_bg and not has_mr_only:
            resolvable_bg_mr += 1
        elif has_mr_only and not has_bg:
            resolvable_bg_mr += 1
    print(f"    BG <-> MR: {resolvable_bg_mr}/{len(bg_mr_pids)} have clear keyword signal ({resolvable_bg_mr/max(len(bg_mr_pids),1)*100:.0f}%)")
    si_no_data = find_axis_paragraphs(cat_matrix, "SI", "N/O")
    si_no_pids = [x[0] for x in si_no_data]
    resolvable_si_no = 0
    si_keywords = {"incident", "breach", "attack", "compromise", "unauthorized access",
                   "ransomware", "malware", "phishing", "data loss", "disruption"}
    no_keywords = {"no material", "not material", "have not experienced", "no known",
                   "not aware of any", "not been subject"}
    for pid in si_no_pids:
        text_lower = paragraphs.get(pid, {}).get("text", "").lower()
        has_si = any(kw in text_lower for kw in si_keywords)
        has_no = any(kw in text_lower for kw in no_keywords)
        if has_no:
            resolvable_si_no += 1
        elif has_si and not has_no:
            resolvable_si_no += 1
    print(f"    SI <-> N/O: {resolvable_si_no}/{len(si_no_pids)} have clear keyword signal ({resolvable_si_no/max(len(si_no_pids),1)*100:.0f}%)")
    # ── Specificity disagreements on confused paragraphs ───────────────────
    print("\n" + "=" * 100)
    print("SPECIFICITY DISAGREEMENT ON CONFUSED PARAGRAPHS")
    print("=" * 100)
    for cat_a, cat_b, axis_name in AXES:
        axis_data = find_axis_paragraphs(cat_matrix, cat_a, cat_b)
        if not axis_data:
            continue
        spec_ranges = []
        for pid, signals, _, _ in axis_data:
            specs = list(spec_matrix.get(pid, {}).values())
            if specs:
                spec_ranges.append(max(specs) - min(specs))
        if spec_ranges:
            avg_range = np.mean(spec_ranges)
            print(f"\n  {axis_name}: avg specificity range = {avg_range:.2f} (0=agree, 3=max disagree)")
            range_dist = Counter(spec_ranges)
            for r in sorted(range_dist.keys()):
                print(f"    Range {r}: {range_dist[r]} paragraphs")
    # ── Recommended codebook rulings ───────────────────────────────────────
    print("\n" + "=" * 100)
    print("RECOMMENDED CODEBOOK RULINGS")
    print("=" * 100)
    print("""
  Based on the analysis above, the following rulings would resolve the most cases:
  RULING 1: MR vs RMP — "Named-role test"
  ──────────────────────────────────────────
  If the paragraph's PRIMARY subject is a named individual, titled role (CISO, VP,
  CTO, etc.), or a specific person's responsibilities/qualifications/experience,
  classify as MR. If the paragraph's PRIMARY subject is a process, program, system,
  or methodology (even if it mentions who runs it), classify as RMP.
  Disambiguator: Ask "Is this paragraph ABOUT a person/role, or ABOUT a process?"
  - "Our CISO oversees our cybersecurity program" → MR (about the CISO)
  - "Our cybersecurity program includes monitoring, led by the CISO" → RMP (about the program)
  RULING 2: BG vs MR — "Board-line test"
  ──────────────────────────────────────────
  If the paragraph describes oversight, reporting, or governance AT or ABOVE the
  board/committee level, classify as BG. If it describes responsibilities BELOW
  the board level (C-suite officers reporting TO the board, management teams,
  operational roles), classify as MR.
  Disambiguator: "Does this paragraph describe what the board/committee DOES,
  or what someone REPORTS TO the board?"
  - "The Audit Committee oversees cybersecurity risk" → BG
  - "The CISO reports quarterly to the Audit Committee" → BG (board's receiving mechanism)
  - "The CISO manages a team of security analysts" → MR
  Key edge case: When a paragraph describes BOTH board oversight AND management
  roles, classify by the paragraph's PRIMARY focus. If roughly equal, prefer BG
  when board action is the grammatical subject.
  RULING 3: SI vs N/O — "Negative-incident test"
  ──────────────────────────────────────────
  Negative incident statements ("we have not experienced any material cybersecurity
  incidents") should be classified as N/O, NOT as SI. SI requires disclosure of an
  ACTUAL incident that occurred. The mere mention of incidents in a negation context
  does not constitute incident disclosure.
  However: If the paragraph describes a SPECIFIC past incident (even if resolved or
  deemed immaterial), classify as SI. The test is: "Did something actually happen?"
  - "We have not experienced material incidents" → N/O
  - "In 2023, we experienced a ransomware attack that..." → SI
  - "We experienced incidents but none were material" → SI (something happened)
 """)
    # ── Deep dive: the very hardest cases ──────────────────────────────────
    print("=" * 100)
    print("DEEP DIVE: PARAGRAPHS WITH MAXIMUM ENTROPY (4+ DISTINCT CATEGORIES)")
    print("=" * 100)
    high_entropy = []
    for pid in gold_pids:
        cats = list(cat_matrix[pid].values())
        n_unique = len(set(cats))
        if n_unique >= 4:
            high_entropy.append((pid, n_unique, Counter(cats)))
    high_entropy.sort(key=lambda x: -x[1])
    print(f"\n  {len(high_entropy)} paragraphs with 4+ distinct category labels")
    for i, (pid, n_unique, counts) in enumerate(high_entropy[:10]):
        para = paragraphs.get(pid, {})
        text = para.get("text", "[text not found]")
        print(f"\n  [{i+1}] PID: {pid[:12]}... ({n_unique} categories)")
        print(f"      Text: {truncate_text(text, 250)}")
        print(f"      Distribution: {dict(counts.most_common())}")
        # Show all sources
        for src in source_order():
            if src in cat_matrix[pid]:
                cat = cat_matrix[pid][src]
                spec = spec_matrix.get(pid, {}).get(src, "?")
                print(f"        {src:<25} {cat:<5} spec={spec}")
    # ── Per-source accuracy vs human majority ──────────────────────────────
    print("\n" + "=" * 100)
    print("GENAI SOURCE AGREEMENT WITH HUMAN MAJORITY (on axis-confused paragraphs only)")
    print("=" * 100)
    for cat_a, cat_b, axis_name in AXES:
        axis_data = find_axis_paragraphs(cat_matrix, cat_a, cat_b)
        if not axis_data:
            continue
        print(f"\n  {axis_name} ({len(axis_data)} paragraphs):")
        # For each paragraph, determine human majority
        genai_sources = [s for s in source_order() if not s.startswith("human:")]
        source_agree = {s: 0 for s in genai_sources}
        source_total = {s: 0 for s in genai_sources}
        for pid, signals, _, _ in axis_data:
            # Human majority on this axis
            human_cats = [
                signals[s] for s in signals
                if s.startswith("human:") and signals[s] in (cat_a, cat_b)
            ]
            if not human_cats:
                continue
            human_majority = Counter(human_cats).most_common(1)[0][0]
            for src in genai_sources:
                if src in signals:
                    source_total[src] += 1
                    if signals[src] == human_majority:
                        source_agree[src] += 1
        print(f"    {'Source':<25} {'Agree':>8} {'Total':>8} {'Rate':>8}")
        print(f"    {'─'*25} {'─'*8} {'─'*8} {'─'*8}")
        for src in genai_sources:
            total = source_total[src]
            agree = source_agree[src]
            rate = agree / max(total, 1)
            print(f"    {src:<25} {agree:>8} {total:>8} {rate:>8.1%}")
    print("\n" + "=" * 100)
    print("END OF ANALYSIS")
    print("=" * 100)
 if __name__ == "__main__":
    main()
--- a/scripts/examine-v35-errors.py
+++ b/scripts/examine-v35-errors.py
@ -0,0 +1,764 @@
 """Examine specific paragraphs where v3.5 performed WORSE than v3.0 against human labels.
 Focus on BG↔MR and MR↔RMP confusion axes.
 """
 import json
 import textwrap
 from collections import Counter, defaultdict
 from pathlib import Path
 # ── Paths ──────────────────────────────────────────────────────────────────────
 ROOT = Path(__file__).resolve().parent.parent
 V30_GOLDEN = ROOT / "data/annotations/golden/opus.jsonl"
 V35_GOLDEN = ROOT / "data/annotations/golden-v35/opus.jsonl"
 V30_BENCH = ROOT / "data/annotations/bench-holdout"
 V35_BENCH = ROOT / "data/annotations/bench-holdout-v35"
 HUMAN_LABELS = ROOT / "data/gold/human-labels-raw.jsonl"
 HOLDOUT_META = ROOT / "data/gold/holdout-rerun-v35.jsonl"
 PARAGRAPHS = ROOT / "data/gold/paragraphs-holdout.jsonl"
 MODEL_FILES = [
    "opus.jsonl",
    "gpt-5.4.jsonl",
    "gemini-3.1-pro-preview.jsonl",
    "glm-5:exacto.jsonl",
    "kimi-k2.5.jsonl",
    "mimo-v2-pro:exacto.jsonl",
    "minimax-m2.7:exacto.jsonl",
 ]
 MODEL_NAMES = [
    "Opus",
    "GPT-5.4",
    "Gemini",
    "GLM-5",
    "Kimi",
    "Mimo",
    "MiniMax",
 ]
 # Models to EXCLUDE from majority calculation
 EXCLUDED_FROM_MAJORITY = {"MiniMax"}
 CAT_ABBREV = {
    "BG": "Board Governance",
    "MR": "Management Role",
    "RMP": "Risk Management Process",
    "SI": "Strategy Integration",
    "NO": "None/Other",
    "ID": "Incident Disclosure",
    "TPR": "Third-Party Risk",
 }
 ABBREV_CAT = {v: k for k, v in CAT_ABBREV.items()}
 def abbrev(cat: str) -> str:
    return ABBREV_CAT.get(cat, cat)
 def load_jsonl(path: Path) -> list[dict]:
    with open(path) as f:
        return [json.loads(line) for line in f if line.strip()]
 def load_annotations(base_dir: Path, filename: str) -> dict[str, str]:
    """Load paragraphId → content_category mapping."""
    path = base_dir / filename
    records = load_jsonl(path)
    return {r["paragraphId"]: r["label"]["content_category"] for r in records}
 def load_golden(path: Path) -> dict[str, str]:
    records = load_jsonl(path)
    return {r["paragraphId"]: r["label"]["content_category"] for r in records}
 # ── Load all data ─────────────────────────────────────────────────────────────
 print("Loading data...")
 # Confusion axis metadata
 meta_records = load_jsonl(HOLDOUT_META)
 pid_axes: dict[str, list[str]] = {r["paragraphId"]: r["axes"] for r in meta_records}
 all_pids = set(pid_axes.keys())
 # Human labels: paragraphId → list of (annotator, category)
 human_raw = load_jsonl(HUMAN_LABELS)
 human_labels: dict[str, list[tuple[str, str]]] = defaultdict(list)
 for r in human_raw:
    if r["paragraphId"] in all_pids:
        human_labels[r["paragraphId"]].append(
            (r["annotatorName"], r["contentCategory"])
        )
 def human_majority(pid: str) -> str | None:
    """Return majority category from human annotators, or None if no data."""
    labels = human_labels.get(pid)
    if not labels:
        return None
    cats = [c for _, c in labels]
    counts = Counter(cats)
    top = counts.most_common(1)[0]
    return top[0]
 # Paragraph text
 para_records = load_jsonl(PARAGRAPHS)
 para_text: dict[str, str] = {r["id"]: r["text"] for r in para_records}
 # v3.0 signals: model_idx → {pid: category}
 v30_signals: list[dict[str, str]] = []
 for fname in MODEL_FILES:
    if fname == "opus.jsonl":
        v30_signals.append(load_golden(V30_GOLDEN))
    else:
        v30_signals.append(load_annotations(V30_BENCH, fname))
 # v3.5 signals
 v35_signals: list[dict[str, str]] = []
 for fname in MODEL_FILES:
    if fname == "opus.jsonl":
        v35_signals.append(load_golden(V35_GOLDEN))
    else:
        v35_signals.append(load_annotations(V35_BENCH, fname))
 def get_signals(signals: list[dict[str, str]], pid: str) -> list[str | None]:
    """Get category from each model for a paragraph."""
    return [s.get(pid) for s in signals]
 def majority_vote(signals: list[str | None], exclude_minimax: bool = True) -> str | None:
    """Compute majority from 6 models (excluding minimax which is index 6)."""
    cats = []
    for i, s in enumerate(signals):
        if s is None:
            continue
        if exclude_minimax and MODEL_NAMES[i] in EXCLUDED_FROM_MAJORITY:
            continue
        cats.append(s)
    if not cats:
        return None
    counts = Counter(cats)
    return counts.most_common(1)[0][0]
 def unanimity_score(signals: list[str | None], exclude_minimax: bool = True) -> float:
    """Fraction of models agreeing with majority (0-1)."""
    cats = []
    for i, s in enumerate(signals):
        if s is None:
            continue
        if exclude_minimax and MODEL_NAMES[i] in EXCLUDED_FROM_MAJORITY:
            continue
        cats.append(s)
    if not cats:
        return 0.0
    counts = Counter(cats)
    top_count = counts.most_common(1)[0][1]
    return top_count / len(cats)
 def format_signals(signals: list[str | None]) -> str:
    """Compact model signal display."""
    parts = []
    for name, cat in zip(MODEL_NAMES, signals):
        if cat is None:
            parts.append(f"{name}=??")
        else:
            parts.append(f"{name}={abbrev(cat)}")
    return ", ".join(parts)
 def wrap_text(text: str, width: int = 100) -> str:
    return "\n    ".join(textwrap.wrap(text, width=width))
 def print_paragraph_analysis(
    pid: str,
    v30_sigs: list[str | None],
    v35_sigs: list[str | None],
    header: str = "",
 ):
    """Print detailed analysis for a single paragraph."""
    text = para_text.get(pid, "[TEXT NOT FOUND]")
    h_labels = human_labels.get(pid, [])
    h_maj = human_majority(pid)
    v30_maj = majority_vote(v30_sigs)
    v35_maj = majority_vote(v35_sigs)
    axes = pid_axes.get(pid, [])
    if header:
        print(f"\n{'─' * 110}")
        print(f"  {header}")
        print(f"{'─' * 110}")
    else:
        print(f"\n{'─' * 110}")
    print(f"  PID: {pid}")
    print(f"  Axes: {', '.join(axes)}")
    print(f"\n  TEXT:")
    print(f"    {wrap_text(text)}")
    print(f"\n  HUMAN VOTES:")
    for name, cat in h_labels:
        marker = " ✓" if cat == h_maj else ""
        print(f"    {name:12s} → {abbrev(cat):5s}{marker}")
    print(f"    Majority   → {abbrev(h_maj) if h_maj else '??'}")
    print(f"\n  v3.0 signals: {format_signals(v30_sigs)}")
    print(f"  v3.0 majority (excl. MiniMax): {abbrev(v30_maj) if v30_maj else '??'}")
    print(f"  v3.5 signals: {format_signals(v35_sigs)}")
    print(f"  v3.5 majority (excl. MiniMax): {abbrev(v35_maj) if v35_maj else '??'}")
    # What changed
    changed_models = []
    for i, (old, new) in enumerate(zip(v30_sigs, v35_sigs)):
        if old is not None and new is not None and old != new:
            changed_models.append(f"{MODEL_NAMES[i]}: {abbrev(old)}→{abbrev(new)}")
    if changed_models:
        print(f"\n  CHANGES: {', '.join(changed_models)}")
    correct_v30 = v30_maj == h_maj if v30_maj and h_maj else None
    correct_v35 = v35_maj == h_maj if v35_maj and h_maj else None
    print(
        f"  v3.0 {'CORRECT' if correct_v30 else 'WRONG'} | "
        f"v3.5 {'CORRECT' if correct_v35 else 'WRONG'}"
    )
 # ══════════════════════════════════════════════════════════════════════════════
 # SECTION 1: BG↔MR Regression Cases
 # ══════════════════════════════════════════════════════════════════════════════
 print("\n" + "═" * 110)
 print("  SECTION 1: BG↔MR AXIS — REGRESSION CASES")
 print("  (v3.0 matched human majority, but v3.5 does NOT)")
 print("═" * 110)
 bg_mr_pids = [pid for pid, axes in pid_axes.items() if "BG_MR" in axes]
 print(f"\nTotal BG↔MR paragraphs: {len(bg_mr_pids)}")
 # Filter to those with human labels
 bg_mr_pids = [pid for pid in bg_mr_pids if human_majority(pid) is not None]
 print(f"With human labels: {len(bg_mr_pids)}")
 regressions_bg_mr = []
 improvements_bg_mr = []
 both_correct_bg_mr = []
 both_wrong_bg_mr = []
 for pid in bg_mr_pids:
    v30_sigs = get_signals(v30_signals, pid)
    v35_sigs = get_signals(v35_signals, pid)
    v30_maj = majority_vote(v30_sigs)
    v35_maj = majority_vote(v35_sigs)
    h_maj = human_majority(pid)
    if v30_maj is None or v35_maj is None or h_maj is None:
        continue
    v30_correct = abbrev(v30_maj) == abbrev(h_maj)
    v35_correct = abbrev(v35_maj) == abbrev(h_maj)
    if v30_correct and not v35_correct:
        regressions_bg_mr.append(pid)
    elif not v30_correct and v35_correct:
        improvements_bg_mr.append(pid)
    elif v30_correct and v35_correct:
        both_correct_bg_mr.append(pid)
    else:
        both_wrong_bg_mr.append(pid)
 print(f"\nBG↔MR Summary:")
 print(f"  Both correct:        {len(both_correct_bg_mr)}")
 print(f"  Both wrong:          {len(both_wrong_bg_mr)}")
 print(f"  v3.0 correct → v3.5 WRONG (REGRESSIONS): {len(regressions_bg_mr)}")
 print(f"  v3.0 wrong → v3.5 correct (IMPROVEMENTS): {len(improvements_bg_mr)}")
 print(f"\n{'━' * 110}")
 print(f"  BG↔MR REGRESSIONS (showing all, up to 20)")
 print(f"{'━' * 110}")
 for i, pid in enumerate(regressions_bg_mr[:20]):
    v30_sigs = get_signals(v30_signals, pid)
    v35_sigs = get_signals(v35_signals, pid)
    print_paragraph_analysis(pid, v30_sigs, v35_sigs, f"REGRESSION #{i+1}")
 # BG↔MR improvements
 print(f"\n{'━' * 110}")
 print(f"  BG↔MR IMPROVEMENTS (showing up to 5)")
 print(f"{'━' * 110}")
 for i, pid in enumerate(improvements_bg_mr[:5]):
    v30_sigs = get_signals(v30_signals, pid)
    v35_sigs = get_signals(v35_signals, pid)
    print_paragraph_analysis(pid, v30_sigs, v35_sigs, f"IMPROVEMENT #{i+1}")
 # ══════════════════════════════════════════════════════════════════════════════
 # SECTION 2: MR↔RMP Non-Convergence Cases
 # ══════════════════════════════════════════════════════════════════════════════
 print("\n\n" + "═" * 110)
 print("  SECTION 2: MR↔RMP AXIS — NON-CONVERGENCE AND REGRESSIONS")
 print("═" * 110)
 mr_rmp_pids = [pid for pid, axes in pid_axes.items() if "MR_RMP" in axes]
 print(f"\nTotal MR↔RMP paragraphs: {len(mr_rmp_pids)}")
 mr_rmp_pids = [pid for pid in mr_rmp_pids if human_majority(pid) is not None]
 print(f"With human labels: {len(mr_rmp_pids)}")
 # Find: less unanimous in v3.5 OR flipped away from human majority
 non_convergence_mr_rmp = []
 regressions_mr_rmp = []
 improvements_mr_rmp = []
 for pid in mr_rmp_pids:
    v30_sigs = get_signals(v30_signals, pid)
    v35_sigs = get_signals(v35_signals, pid)
    v30_maj = majority_vote(v30_sigs)
    v35_maj = majority_vote(v35_sigs)
    h_maj = human_majority(pid)
    v30_unanimity = unanimity_score(v30_sigs)
    v35_unanimity = unanimity_score(v35_sigs)
    if v30_maj is None or v35_maj is None or h_maj is None:
        continue
    v30_correct = abbrev(v30_maj) == abbrev(h_maj)
    v35_correct = abbrev(v35_maj) == abbrev(h_maj)
    # Regression: was correct, now wrong
    if v30_correct and not v35_correct:
        regressions_mr_rmp.append((pid, v30_unanimity, v35_unanimity))
    # Non-convergence: less unanimous OR flipped away
    if v35_unanimity < v30_unanimity or (v30_correct and not v35_correct):
        non_convergence_mr_rmp.append((pid, v30_unanimity, v35_unanimity))
    if not v30_correct and v35_correct:
        improvements_mr_rmp.append((pid, v30_unanimity, v35_unanimity))
 # Sort non-convergence by delta (worst first)
 non_convergence_mr_rmp.sort(key=lambda x: x[1] - x[2], reverse=True)
 print(f"\nMR↔RMP Summary:")
 print(f"  Regressions (correct→wrong): {len(regressions_mr_rmp)}")
 print(f"  Non-convergence (less unanimous or regressed): {len(non_convergence_mr_rmp)}")
 print(f"  Improvements (wrong→correct): {len(improvements_mr_rmp)}")
 print(f"\n{'━' * 110}")
 print(f"  MR↔RMP NON-CONVERGENCE / REGRESSION CASES (showing 10)")
 print(f"{'━' * 110}")
 shown = set()
 count = 0
 for pid, v30_u, v35_u in non_convergence_mr_rmp:
    if count >= 10:
        break
    if pid in shown:
        continue
    shown.add(pid)
    v30_sigs = get_signals(v30_signals, pid)
    v35_sigs = get_signals(v35_signals, pid)
    v30_maj = majority_vote(v30_sigs)
    v35_maj = majority_vote(v35_sigs)
    h_maj = human_majority(pid)
    label = "REGRESSION" if (abbrev(v30_maj) == abbrev(h_maj) and abbrev(v35_maj) != abbrev(h_maj)) else "LESS UNANIMOUS"
    print_paragraph_analysis(
        pid, v30_sigs, v35_sigs,
        f"{label} #{count+1} (unanimity: v3.0={v30_u:.0%} → v3.5={v35_u:.0%})"
    )
    count += 1
 print(f"\n{'━' * 110}")
 print(f"  MR↔RMP IMPROVEMENTS (showing up to 5)")
 print(f"{'━' * 110}")
 for i, (pid, v30_u, v35_u) in enumerate(improvements_mr_rmp[:5]):
    v30_sigs = get_signals(v30_signals, pid)
    v35_sigs = get_signals(v35_signals, pid)
    print_paragraph_analysis(
        pid, v30_sigs, v35_sigs,
        f"IMPROVEMENT #{i+1} (unanimity: v3.0={v30_u:.0%} → v3.5={v35_u:.0%})"
    )
 # ══════════════════════════════════════════════════════════════════════════════
 # SECTION 3: Error Pattern Analysis
 # ══════════════════════════════════════════════════════════════════════════════
 print("\n\n" + "═" * 110)
 print("  SECTION 3: ERROR PATTERN ANALYSIS")
 print("═" * 110)
 # ── BG↔MR regression patterns ───────────────────────────────────────────────
 print(f"\n{'━' * 110}")
 print(f"  3A: BG↔MR REGRESSION PATTERNS")
 print(f"{'━' * 110}")
 if regressions_bg_mr:
    # Analyze what the human majority is and what v3.5 switched to
    regression_directions = Counter()
    regression_model_flips = Counter()
    for pid in regressions_bg_mr:
        h_maj = human_majority(pid)
        v30_sigs = get_signals(v30_signals, pid)
        v35_sigs = get_signals(v35_signals, pid)
        v30_maj = majority_vote(v30_sigs)
        v35_maj = majority_vote(v35_sigs)
        direction = f"{abbrev(v30_maj)}→{abbrev(v35_maj)} (human={abbrev(h_maj)})"
        regression_directions[direction] += 1
        # Which models flipped?
        for i, (old, new) in enumerate(zip(v30_sigs, v35_sigs)):
            if old and new and old != new:
                regression_model_flips[MODEL_NAMES[i]] += 1
    print(f"\n  Regression directions (v3.0→v3.5, human ground truth):")
    for direction, count in regression_directions.most_common():
        print(f"    {direction}: {count}")
    print(f"\n  Models that flipped most on regressions:")
    for model, count in regression_model_flips.most_common():
        print(f"    {model}: {count} flips")
    # Text pattern analysis
    print(f"\n  Common textual signals in regression paragraphs:")
    signal_words = {
        "board": 0, "committee": 0, "oversee": 0, "oversight": 0,
        "report": 0, "director": 0, "officer": 0, "CISO": 0,
        "governance": 0, "responsible": 0, "qualif": 0, "experience": 0,
        "manage": 0, "program": 0, "framework": 0, "process": 0,
        "audit": 0,
    }
    for pid in regressions_bg_mr:
        text = para_text.get(pid, "").lower()
        for word in signal_words:
            if word.lower() in text:
                signal_words[word] += 1
    total_reg = len(regressions_bg_mr)
    for word, count in sorted(signal_words.items(), key=lambda x: -x[1]):
        if count > 0:
            print(f"    '{word}': {count}/{total_reg} ({count/total_reg:.0%})")
    # Check if humans are split on these
    print(f"\n  Human agreement on regressions:")
    unanimous_human = 0
    split_human = 0
    for pid in regressions_bg_mr:
        labels = human_labels.get(pid, [])
        cats = [c for _, c in labels]
        if len(set(cats)) == 1:
            unanimous_human += 1
        else:
            split_human += 1
    print(f"    Unanimous human: {unanimous_human}")
    print(f"    Split human (2-1): {split_human}")
    if split_human > 0:
        print(f"\n  Split-human regression details:")
        for pid in regressions_bg_mr:
            labels = human_labels.get(pid, [])
            cats = [c for _, c in labels]
            if len(set(cats)) > 1:
                votes = ", ".join(f"{n}={abbrev(c)}" for n, c in labels)
                print(f"    {pid[:12]}... → {votes}")
 else:
    print("\n  No BG↔MR regressions found.")
 # ── MR↔RMP patterns ─────────────────────────────────────────────────────────
 print(f"\n{'━' * 110}")
 print(f"  3B: MR↔RMP NON-CONVERGENCE PATTERNS")
 print(f"{'━' * 110}")
 if non_convergence_mr_rmp:
    # Regression directions
    nc_directions = Counter()
    nc_model_flips = Counter()
    for pid, _, _ in non_convergence_mr_rmp:
        h_maj = human_majority(pid)
        v30_sigs = get_signals(v30_signals, pid)
        v35_sigs = get_signals(v35_signals, pid)
        v30_maj = majority_vote(v30_sigs)
        v35_maj = majority_vote(v35_sigs)
        direction = f"{abbrev(v30_maj)}→{abbrev(v35_maj)} (human={abbrev(h_maj)})"
        nc_directions[direction] += 1
        for i, (old, new) in enumerate(zip(v30_sigs, v35_sigs)):
            if old and new and old != new:
                nc_model_flips[MODEL_NAMES[i]] += 1
    print(f"\n  Direction of non-convergent shifts:")
    for direction, count in nc_directions.most_common():
        print(f"    {direction}: {count}")
    print(f"\n  Models that flipped most:")
    for model, count in nc_model_flips.most_common():
        print(f"    {model}: {count} flips")
    # Text pattern analysis — compare what helped vs what didn't
    print(f"\n  Text signals in NON-CONVERGENT vs IMPROVED paragraphs:")
    keywords = ["CISO", "officer", "responsible", "oversee", "report",
                 "program", "framework", "qualif", "experience", "certif",
                 "manage", "assess", "monitor", "team", "director"]
    nc_pids_set = {pid for pid, _, _ in non_convergence_mr_rmp}
    imp_pids_set = {pid for pid, _, _ in improvements_mr_rmp}
    print(f"\n  {'Keyword':<16} {'Non-conv':>10} {'Improved':>10}")
    print(f"  {'─'*16} {'─'*10} {'─'*10}")
    for kw in keywords:
        nc_count = sum(1 for pid in nc_pids_set if kw.lower() in para_text.get(pid, "").lower())
        imp_count = sum(1 for pid in imp_pids_set if kw.lower() in para_text.get(pid, "").lower())
        nc_pct = f"{nc_count}/{len(nc_pids_set)}" if nc_pids_set else "0"
        imp_pct = f"{imp_count}/{len(imp_pids_set)}" if imp_pids_set else "0"
        print(f"  {kw:<16} {nc_pct:>10} {imp_pct:>10}")
    # Person-removal test analysis
    print(f"\n  Person-removal test applicability:")
    print(f"  Checking if regression paragraphs have person as ONLY subject...")
    for pid, _, _ in regressions_mr_rmp:
        text = para_text.get(pid, "")
        has_person_subject = any(
            marker in text.lower()
            for marker in ["ciso", "chief information", "chief technology",
                           "vice president", "director of", "officer"]
        )
        has_process_subject = any(
            marker in text.lower()
            for marker in ["program", "framework", "process", "system",
                           "controls", "policies", "procedures"]
        )
        h_maj = human_majority(pid)
        v35_maj = majority_vote(get_signals(v35_signals, pid))
        print(
            f"    {pid[:12]}... person_subj={has_person_subject} "
            f"process_subj={has_process_subject} "
            f"human={abbrev(h_maj)} v3.5={abbrev(v35_maj)}"
        )
 else:
    print("\n  No MR↔RMP non-convergence cases found.")
 # ══════════════════════════════════════════════════════════════════════════════
 # SECTION 4: Ruling Recommendations
 # ══════════════════════════════════════════════════════════════════════════════
 print("\n\n" + "═" * 110)
 print("  SECTION 4: RULING RECOMMENDATIONS")
 print("═" * 110)
 print("""
 Based on the error analysis above, here are the specific ruling observations:
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  4A: BG↔MR Board-Line Test
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 CURRENT RULING (Rule 2):
  "When a paragraph spans layers (governance chain paragraphs): apply the
   dominant-subject test — which layer occupies the most sentence-subjects?"
  "Governance overview spanning board → committee → officer → program →
   Board Governance if the board/committee occupies more sentence-subjects;
   Management Role if the officer does; Risk Management Process if the
   program does"
 """)
 # Analyze the specific regressions to give targeted advice
 if regressions_bg_mr:
    # Count what direction the regressions went
    bg_to_mr = sum(
        1 for pid in regressions_bg_mr
        if abbrev(majority_vote(get_signals(v35_signals, pid))) == "MR"
        and abbrev(human_majority(pid)) == "BG"
    )
    mr_to_bg = sum(
        1 for pid in regressions_bg_mr
        if abbrev(majority_vote(get_signals(v35_signals, pid))) == "BG"
        and abbrev(human_majority(pid)) == "MR"
    )
    other_dir = len(regressions_bg_mr) - bg_to_mr - mr_to_bg
    print(f"  EMPIRICAL FINDING:")
    print(f"    Regressions that moved BG→MR (human says BG): {bg_to_mr}")
    print(f"    Regressions that moved MR→BG (human says MR): {mr_to_bg}")
    print(f"    Other directions: {other_dir}")
    if bg_to_mr > mr_to_bg:
        print("""
  DIAGNOSIS: The dominant-subject test is OVER-CORRECTING toward MR.
  When a governance chain mentions a CISO or officer, models are counting that
  mention as a "sentence subject" even when the paragraph's primary purpose is
  describing the board/committee oversight structure.
  PROPOSED FIX — add a "purpose test" before the subject count:
    "Before counting sentence-subjects, ask: what is the paragraph's PRIMARY
     COMMUNICATIVE PURPOSE? If it is to describe the oversight/reporting
     structure (who oversees whom, what gets reported where), the paragraph
     is Board Governance even if individual officers are named as intermediaries.
     The dominant-subject count applies only when the paragraph's purpose is
     genuinely ambiguous between describing the oversight structure and
     describing the officer's role."
  Alternatively, add a carve-out:
    "A governance chain paragraph (board → committee → officer → program)
     defaults to Board Governance unless the officer section constitutes
     MORE THAN HALF the paragraph's content AND includes qualifications,
     credentials, or personal background."
 """)
    elif mr_to_bg > bg_to_mr:
        print("""
  DIAGNOSIS: The dominant-subject test is OVER-CORRECTING toward BG.
  Paragraphs that are primarily about management roles are being pulled
  toward BG because they mention board oversight.
  PROPOSED FIX:
    "When a paragraph's primary content is about a management role (CISO,
     CIO, etc.) and mentions board oversight only as context for the
     reporting relationship, classify as Management Role. Board Governance
     requires the board/committee to be the PRIMARY ACTOR, not merely
     the recipient of reports."
 """)
 print("""
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  4B: MR↔RMP Three-Step Chain
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 CURRENT RULING (Rule 2b):
  "Step 1 — Subject test: What is the paragraph's grammatical subject?
   Step 2 — Person-removal test: Could you delete all named roles, titles,
   qualifications, experience descriptions, and credentials from the
   paragraph and still have a coherent cybersecurity disclosure?
   Step 3 — Qualifications tiebreaker: Does the paragraph include experience
   (years), certifications (CISSP, CISM), education, team size, or career
   history for named individuals?"
 """)
 if regressions_mr_rmp:
    mr_to_rmp = sum(
        1 for pid, _, _ in regressions_mr_rmp
        if abbrev(majority_vote(get_signals(v35_signals, pid))) == "RMP"
        and abbrev(human_majority(pid)) == "MR"
    )
    rmp_to_mr = sum(
        1 for pid, _, _ in regressions_mr_rmp
        if abbrev(majority_vote(get_signals(v35_signals, pid))) == "MR"
        and abbrev(human_majority(pid)) == "RMP"
    )
    print(f"  EMPIRICAL FINDING:")
    print(f"    Regressions that moved MR→RMP (human says MR): {mr_to_rmp}")
    print(f"    Regressions that moved RMP→MR (human says RMP): {rmp_to_mr}")
    if mr_to_rmp > rmp_to_mr:
        print("""
  DIAGNOSIS: The person-removal test is TOO AGGRESSIVE at removing people.
  When a paragraph describes a CISO's monitoring activities, the person-removal
  test says "yes, the monitoring process stands alone," but the HUMANS recognize
  that the paragraph is fundamentally about the management role's responsibilities.
  PROPOSED FIX — tighten the person-removal test:
    "Step 2 — Person-removal test: Delete all named roles AND their associated
     ACTIVITIES. If the paragraph still describes a cybersecurity process or
     framework, it is Risk Management Process. If deleting the roles and their
     activities leaves nothing substantive, it is Management Role.
     Key distinction: 'The CISO monitors threat intelligence' — removing the
     CISO removes the monitoring activity, so this is Management Role.
     'The company monitors threat intelligence under the direction of the CISO'
     — removing the CISO leaves the monitoring intact, so this is RMP."
 """)
    elif rmp_to_mr > mr_to_rmp:
        print("""
  DIAGNOSIS: The three-step chain is UNDER-APPLYING the person-removal test.
  Models are stopping at Step 1 (subject test) when they see a role title,
  without proceeding to the person-removal test.
  PROPOSED FIX:
    "Step 1 should only produce a STRONG signal, not a decisive result.
     Always proceed to Step 2 unless the paragraph is ENTIRELY about
     a person's credentials with no process content whatsoever."
 """)
 if not regressions_mr_rmp:
    print("""
  No MR↔RMP regressions found. The three-step chain may be working correctly,
  or the non-convergence is increasing uncertainty without changing majority votes.
  Focus on whether the increased model disagreement reflects genuine ambiguity
  or whether the step instructions need to be more prescriptive.
 """)
 # ── Final summary stats ──────────────────────────────────────────────────────
 print("\n" + "═" * 110)
 print("  FINAL SUMMARY")
 print("═" * 110)
 # Overall accuracy comparison
 total_with_human = 0
 v30_correct_total = 0
 v35_correct_total = 0
 for pid in all_pids:
    h_maj = human_majority(pid)
    if h_maj is None:
        continue
    v30_sigs = get_signals(v30_signals, pid)
    v35_sigs = get_signals(v35_signals, pid)
    v30_maj = majority_vote(v30_sigs)
    v35_maj = majority_vote(v35_sigs)
    if v30_maj is None or v35_maj is None:
        continue
    total_with_human += 1
    if abbrev(v30_maj) == abbrev(h_maj):
        v30_correct_total += 1
    if abbrev(v35_maj) == abbrev(h_maj):
        v35_correct_total += 1
 print(f"\n  Overall accuracy on {total_with_human} confusion-axis paragraphs:")
 print(f"    v3.0: {v30_correct_total}/{total_with_human} ({v30_correct_total/total_with_human:.1%})")
 print(f"    v3.5: {v35_correct_total}/{total_with_human} ({v35_correct_total/total_with_human:.1%})")
 print(f"    Delta: {v35_correct_total - v30_correct_total:+d}")
 # Per-axis breakdown
 for axis_name in ["BG_MR", "MR_RMP", "BG_RMP", "SI_NO"]:
    axis_pids = [pid for pid, axes in pid_axes.items() if axis_name in axes]
    v30_c = 0
    v35_c = 0
    n = 0
    for pid in axis_pids:
        h_maj = human_majority(pid)
        if h_maj is None:
            continue
        v30_sigs = get_signals(v30_signals, pid)
        v35_sigs = get_signals(v35_signals, pid)
        v30_maj = majority_vote(v30_sigs)
        v35_maj = majority_vote(v35_sigs)
        if v30_maj is None or v35_maj is None:
            continue
        n += 1
        if abbrev(v30_maj) == abbrev(h_maj):
            v30_c += 1
        if abbrev(v35_maj) == abbrev(h_maj):
            v35_c += 1
    if n > 0:
        print(f"\n  {axis_name} ({n} paragraphs):")
        print(f"    v3.0: {v30_c}/{n} ({v30_c/n:.1%})")
        print(f"    v3.5: {v35_c}/{n} ({v35_c/n:.1%})")
        print(f"    Delta: {v35_c - v30_c:+d}")
 print()
--- a/scripts/extract-regression-pids.py
+++ b/scripts/extract-regression-pids.py
@ -0,0 +1,167 @@
 """Identify paragraph IDs where v3.5 6-model majority regressed vs v3.0.
 A "regression" = v3.0 majority matched human majority but v3.5 majority does not.
 We compute category majority from 6 models (excluding minimax):
  opus, gpt-5.4, gemini-3.1-pro-preview, glm-5:exacto, kimi-k2.5, mimo-v2-pro:exacto
 v3.0 annotations are filtered to the 359 PIDs present in holdout-rerun-v35.jsonl.
 """
 from __future__ import annotations
 import json
 from collections import Counter
 from pathlib import Path
 ROOT = Path(__file__).resolve().parent.parent
 DATA = ROOT / "data"
 # ── Model files (excluding minimax) ──────────────────────────────────────────
 V30_FILES = [
    DATA / "annotations" / "golden" / "opus.jsonl",
    DATA / "annotations" / "bench-holdout" / "gpt-5.4.jsonl",
    DATA / "annotations" / "bench-holdout" / "gemini-3.1-pro-preview.jsonl",
    DATA / "annotations" / "bench-holdout" / "glm-5:exacto.jsonl",
    DATA / "annotations" / "bench-holdout" / "kimi-k2.5.jsonl",
    DATA / "annotations" / "bench-holdout" / "mimo-v2-pro:exacto.jsonl",
 ]
 V35_FILES = [
    DATA / "annotations" / "golden-v35" / "opus.jsonl",
    DATA / "annotations" / "bench-holdout-v35" / "gpt-5.4.jsonl",
    DATA / "annotations" / "bench-holdout-v35" / "gemini-3.1-pro-preview.jsonl",
    DATA / "annotations" / "bench-holdout-v35" / "glm-5:exacto.jsonl",
    DATA / "annotations" / "bench-holdout-v35" / "kimi-k2.5.jsonl",
    DATA / "annotations" / "bench-holdout-v35" / "mimo-v2-pro:exacto.jsonl",
 ]
 def load_annotations(files: list[Path]) -> dict[str, list[str]]:
    """Load annotations, returning {pid: [category, ...]} across models."""
    result: dict[str, list[str]] = {}
    for f in files:
        with open(f) as fh:
            for line in fh:
                rec = json.loads(line)
                pid = rec["paragraphId"]
                cat = rec["label"]["content_category"]
                result.setdefault(pid, []).append(cat)
    return result
 def majority_vote(labels: list[str]) -> str | None:
    """Return the most common label, or None if tied."""
    counts = Counter(labels)
    top = counts.most_common(2)
    if len(top) == 1:
        return top[0][0]
    if top[0][1] > top[1][1]:
        return top[0][0]
    return None  # tie
 def load_human_majority() -> dict[str, str]:
    """Compute human majority label per PID from 3-annotator raw labels."""
    pid_labels: dict[str, list[str]] = {}
    with open(DATA / "gold" / "human-labels-raw.jsonl") as f:
        for line in f:
            rec = json.loads(line)
            pid = rec["paragraphId"]
            pid_labels.setdefault(pid, []).append(rec["contentCategory"])
    return {
        pid: maj
        for pid, labels in pid_labels.items()
        if (maj := majority_vote(labels)) is not None
    }
 def load_holdout_pids() -> dict[str, list[str]]:
    """Load the 359 confusion-axis PIDs and their axes."""
    result: dict[str, list[str]] = {}
    with open(DATA / "gold" / "holdout-rerun-v35.jsonl") as f:
        for line in f:
            rec = json.loads(line)
            result[rec["paragraphId"]] = rec["axes"]
    return result
 # Axis name → output key mapping
 AXIS_TO_KEY = {
    "BG_MR": "bg_mr_regressions",
    "BG_RMP": "bg_mr_regressions",  # BG confusion axes both go to bg_mr bucket
    "MR_RMP": "mr_rmp_regressions",
    "SI_NO": "mr_rmp_regressions",  # SI/NO doesn't fit neatly; group with mr_rmp
 }
 def main() -> None:
    holdout = load_holdout_pids()
    holdout_pids = set(holdout.keys())
    human_maj = load_human_majority()
    v30_ann = load_annotations(V30_FILES)
    v35_ann = load_annotations(V35_FILES)
    # Compute model majorities filtered to holdout PIDs
    v30_maj: dict[str, str | None] = {}
    for pid in holdout_pids:
        labels = v30_ann.get(pid, [])
        v30_maj[pid] = majority_vote(labels) if len(labels) == 6 else None
    v35_maj: dict[str, str | None] = {}
    for pid in holdout_pids:
        labels = v35_ann.get(pid, [])
        v35_maj[pid] = majority_vote(labels) if len(labels) == 6 else None
    # Find regressions
    bg_mr_regressions: list[str] = []
    mr_rmp_regressions: list[str] = []
    for pid in sorted(holdout_pids):
        h = human_maj.get(pid)
        v30 = v30_maj.get(pid)
        v35 = v35_maj.get(pid)
        if h is None or v30 is None or v35 is None:
            continue
        # Regression: v3.0 matched human, v3.5 does not
        if v30 == h and v35 != h:
            axes = holdout[pid]
            # Assign to bucket based on axes
            is_bg_mr = any(a in ("BG_MR", "BG_RMP") for a in axes)
            is_mr_rmp = any(a in ("MR_RMP", "SI_NO") for a in axes)
            if is_bg_mr:
                bg_mr_regressions.append(pid)
            if is_mr_rmp:
                mr_rmp_regressions.append(pid)
            # If somehow neither axis matched, still include in all
            if not is_bg_mr and not is_mr_rmp:
                # Fallback: put in mr_rmp
                mr_rmp_regressions.append(pid)
    all_regressions = sorted(set(bg_mr_regressions + mr_rmp_regressions))
    output = {
        "bg_mr_regressions": sorted(bg_mr_regressions),
        "mr_rmp_regressions": sorted(mr_rmp_regressions),
        "all_regressions": all_regressions,
    }
    out_path = DATA / "gold" / "regression-pids.json"
    with open(out_path, "w") as f:
        json.dump(output, f, indent=2)
        f.write("\n")
    print(f"BG/MR regressions:  {len(bg_mr_regressions)}")
    print(f"MR/RMP regressions: {len(mr_rmp_regressions)}")
    print(f"Total unique:       {len(all_regressions)}")
    print(f"Written to {out_path}")
 if __name__ == "__main__":
    main()
--- a/scripts/flag-stage1-corrections.py
+++ b/scripts/flag-stage1-corrections.py
@ -0,0 +1,305 @@
 """
 Flag Stage 1 paragraphs needing Stage 2 re-evaluation due to codebook v2.5->v3.5 drift.
 Two categories of flags:
 1. Materiality assessment language in N/O or RMP paragraphs — backward-looking
   conclusions or SEC-qualified forward-looking statements that constitute a
   materiality assessment (should be Strategy Integration under v3.5 codebook).
 2. SPAC/shell company paragraphs coded as substantive categories — should be None/Other.
 Materiality rule (tightened after 6 rounds of prompt iteration):
  IS assessment: "have not materially affected/impacted", "not materially affected",
    "reasonably likely to materially affect/impact",
    "have not experienced any material cybersecurity" (unless cross-reference context).
  NOT assessment: "could/may ... material adverse effect" (boilerplate speculation),
    "material" as adjective ("material risks"), cross-references ("see Item 1A"),
    consequence clauses at end of RMP descriptions.
 Usage: uv run scripts/flag-stage1-corrections.py
 """
 import json
 import re
 from collections import Counter, defaultdict
 from pathlib import Path
 DATA_DIR = Path(__file__).resolve().parent.parent / "data"
 STAGE1_PATH = DATA_DIR / "annotations" / "stage1.patched.jsonl"
 PARAGRAPHS_PATH = DATA_DIR / "paragraphs" / "paragraphs-clean.patched.jsonl"
 HOLDOUT_PATH = DATA_DIR / "gold" / "human-labels-raw.jsonl"
 OUTPUT_PATH = DATA_DIR / "annotations" / "stage1-corrections.jsonl"
 # Category abbreviation mapping for stage1Labels output
 CATEGORY_ABBREV = {
    "None/Other": "N/O",
    "Board Governance": "BG",
    "Management Role": "MR",
    "Risk Management Process": "RMP",
    "Third-Party Risk": "TPR",
    "Incident Disclosure": "ID",
 }
 # --- Materiality patterns (strict assessment-only) ---
 # Positive patterns: backward-looking conclusions and SEC-qualified forward-looking
 MATERIALITY_ASSESSMENT_RE = re.compile(
    r"(?:"
    # Backward-looking conclusions
    r"have\s+not\s+materially\s+affected"
    r"|has\s+not\s+materially\s+affected"
    r"|not\s+materially\s+affected"
    r"|have\s+not\s+materially\s+impacted"
    # SEC-qualified forward-looking
    r"|reasonably\s+likely\s+to\s+materially\s+affect"
    r"|reasonably\s+likely\s+to\s+materially\s+impact"
    # Negative assertions about incidents
    r"|have\s+not\s+experienced\s+any\s+material\s+cybersecurity"
    r")",
    re.IGNORECASE,
 )
 # Negative filter: cross-reference context near the match (within 200 chars after)
 CROSS_REF_RE = re.compile(
    r"(?:see\s+Item|see\s+Part|see\s+our\s+risk\s+factors|refer\s+to)",
    re.IGNORECASE,
 )
 # Negative filter: speculative/boilerplate "could/may + material adverse" patterns
 SPECULATIVE_RE = re.compile(
    r"(?:could|may|might|can)\s+(?:\w+\s+){0,3}material\s+adverse\s+effect",
    re.IGNORECASE,
 )
 def has_materiality_language(text: str) -> str | None:
    """Return the matched materiality assessment pattern string, or None if no match."""
    match = MATERIALITY_ASSESSMENT_RE.search(text)
    if match is None:
        return None
    matched_text = match.group(0)
    match_start = match.start()
    match_end = match.end()
    # Check cross-reference context: look within 200 chars after the match
    post_context = text[match_end : match_end + 200]
    if CROSS_REF_RE.search(post_context):
        return None
    # Also check cross-reference before the match (within 100 chars)
    pre_context = text[max(0, match_start - 100) : match_start]
    if CROSS_REF_RE.search(pre_context):
        return None
    # Extract a snippet around the match for context
    snippet_start = max(0, match_start - 30)
    snippet_end = min(len(text), match_end + 30)
    snippet = text[snippet_start:snippet_end].strip()
    return snippet
 # --- SPAC patterns ---
 SPAC_PHRASES = [
    "special purpose acquisition",
    "blank check",
    "no business operations",
    "shell company",
    "have not adopted any cybersecurity",
    "no operations",
 ]
 def has_spac_language(text: str) -> str | None:
    """Return matched SPAC indicator string, or None."""
    text_lower = text.lower()
    for phrase in SPAC_PHRASES:
        if phrase in text_lower:
            idx = text_lower.index(phrase)
            start = max(0, idx - 20)
            end = min(len(text), idx + len(phrase) + 20)
            return text[start:end].strip()
    return None
 def main() -> None:
    # Load holdout IDs
    print("Loading holdout IDs...")
    holdout_ids: set[str] = set()
    with open(HOLDOUT_PATH) as f:
        for line in f:
            rec = json.loads(line)
            holdout_ids.add(rec["paragraphId"])
    print(f"  Holdout paragraphs: {len(holdout_ids)}")
    # Load paragraph texts
    print("Loading paragraph texts...")
    para_texts: dict[str, str] = {}
    with open(PARAGRAPHS_PATH) as f:
        for line in f:
            rec = json.loads(line)
            para_texts[rec["id"]] = rec["text"]
    print(f"  Paragraphs loaded: {len(para_texts)}")
    # Load old corrections for comparison
    old_materiality_pids: set[str] = set()
    old_spac_pids: set[str] = set()
    if OUTPUT_PATH.exists():
        with open(OUTPUT_PATH) as f:
            for line in f:
                rec = json.loads(line)
                if rec["reason"] == "materiality_language":
                    old_materiality_pids.add(rec["paragraphId"])
                elif rec["reason"] == "spac":
                    old_spac_pids.add(rec["paragraphId"])
    # Load Stage 1 annotations and group by paragraphId
    print("Loading Stage 1 annotations...")
    annotations: dict[str, list[str]] = defaultdict(list)
    with open(STAGE1_PATH) as f:
        for line in f:
            rec = json.loads(line)
            pid = rec["paragraphId"]
            if pid in holdout_ids:
                continue
            cat = rec["label"]["content_category"]
            annotations[pid].append(cat)
    total_paragraphs = len(annotations)
    print(f"  Stage 1 paragraphs (excluding holdout): {total_paragraphs}")
    # Process each paragraph
    flagged: list[dict] = []
    materiality_flagged: list[dict] = []
    spac_flagged: list[dict] = []
    # Track paragraphs that HAVE assessment language but pass through new filters
    # (for showing "newly excluded" examples)
    newly_excluded: list[dict] = []
    for pid, labels in annotations.items():
        text = para_texts.get(pid)
        if text is None:
            continue
        label_abbrevs = [CATEGORY_ABBREV.get(l, l) for l in labels]
        no_count = sum(1 for l in labels if l == "None/Other")
        total = len(labels)
        # --- Check 1: Materiality assessment in N/O paragraphs ---
        if no_count > total / 2:  # majority or unanimous N/O
            matched = has_materiality_language(text)
            if matched:
                is_unanimous = no_count == total
                consensus = "unanimous" if is_unanimous else "majority"
                record = {
                    "paragraphId": pid,
                    "reason": "materiality_language",
                    "originalConsensus": consensus,
                    "originalCategory": "None/Other",
                    "matchedPattern": matched,
                    "stage1Labels": label_abbrevs,
                }
                flagged.append(record)
                materiality_flagged.append(record)
            elif pid in old_materiality_pids:
                # Was flagged before, now excluded — collect for comparison
                # Find what the old broad matcher would have caught
                text_lower = text.lower()
                broad_match = None
                for phrase in [
                    "material adverse", "materially affect", "material impact",
                    "material cybersecurity", "material effect", "not materially",
                    "materially impacted", "materially affected",
                ]:
                    if phrase in text_lower:
                        idx = text_lower.index(phrase)
                        s = max(0, idx - 30)
                        e = min(len(text), idx + len(phrase) + 30)
                        broad_match = text[s:e].strip()
                        break
                if broad_match is None:
                    broad_match = "(proximity match)"
                newly_excluded.append({
                    "paragraphId": pid,
                    "reason": "excluded_materiality",
                    "oldMatch": broad_match,
                    "stage1Labels": label_abbrevs,
                })
        # --- Check 2: SPAC paragraphs coded as non-N/O ---
        if no_count <= total / 2:
            matched = has_spac_language(text)
            if matched:
                cat_counts = Counter(labels)
                majority_cat = cat_counts.most_common(1)[0][0]
                consensus = "unanimous" if cat_counts[majority_cat] == total else "majority"
                record = {
                    "paragraphId": pid,
                    "reason": "spac",
                    "originalConsensus": consensus,
                    "originalCategory": majority_cat,
                    "matchedPattern": matched,
                    "stage1Labels": label_abbrevs,
                }
                flagged.append(record)
                spac_flagged.append(record)
    # Write output
    print(f"\nWriting {len(flagged)} flagged paragraphs to {OUTPUT_PATH}...")
    with open(OUTPUT_PATH, "w") as f:
        for rec in flagged:
            f.write(json.dumps(rec) + "\n")
    # --- Print summary ---
    print("\n" + "=" * 70)
    print("STAGE 1 CORRECTION FLAGS — SUMMARY")
    print("=" * 70)
    print(f"Total Stage 1 paragraphs (excluding holdout): {total_paragraphs:,}")
    print()
    print("  Comparison (old broad rule -> new strict rule):")
    print(f"    Materiality flags:  {len(old_materiality_pids):>5}  ->  {len(materiality_flagged):>5}  (delta: {len(materiality_flagged) - len(old_materiality_pids):+d})")
    print(f"    SPAC flags:         {len(old_spac_pids):>5}  ->  {len(spac_flagged):>5}  (delta: {len(spac_flagged) - len(old_spac_pids):+d})")
    old_total = len(old_materiality_pids) + len(old_spac_pids)
    print(f"    Total flags:        {old_total:>5}  ->  {len(flagged):>5}  (delta: {len(flagged) - old_total:+d})")
    print()
    new_materiality_pids = {r["paragraphId"] for r in materiality_flagged}
    added = new_materiality_pids - old_materiality_pids
    removed = old_materiality_pids - new_materiality_pids
    kept = new_materiality_pids & old_materiality_pids
    print(f"  Materiality breakdown:")
    print(f"    Kept from old:      {len(kept):>5}")
    print(f"    Newly flagged:      {len(added):>5}")
    print(f"    Excluded (dropped): {len(removed):>5}")
    # Show examples
    def show_examples(title: str, records: list[dict], n: int = 5, text_key: str = "matchedPattern") -> None:
        print(f"\n--- {title} (showing {min(n, len(records))} of {len(records)}) ---")
        for rec in records[:n]:
            pid = rec["paragraphId"]
            text = para_texts.get(pid, "")
            snippet = text[:150] + "..." if len(text) > 150 else text
            print(f"  {pid[:16]}...")
            print(f"    Labels: {rec['stage1Labels']}")
            print(f"    Match:  {rec.get(text_key, rec.get('oldMatch', ''))}")
            print(f"    Text:   {snippet}")
            print()
    show_examples(
        "Newly flagged materiality assessments (assessment patterns)",
        [r for r in materiality_flagged if r["paragraphId"] in added] or materiality_flagged,
    )
    show_examples(
        "Previously flagged, NOW EXCLUDED (boilerplate/speculation)",
        newly_excluded,
        text_key="oldMatch",
    )
    show_examples(
        "SPAC/shell coded as non-N/O", spac_flagged
    )
 if __name__ == "__main__":
    main()
--- a/scripts/identify-holdout-rerun.py
+++ b/scripts/identify-holdout-rerun.py
@ -0,0 +1,201 @@
 """
 Identify holdout paragraphs on confusion axes that need v3.5 re-annotation.
 Builds a 13-signal matrix from all available sources:
  - 3 human annotators (per paragraph)
  - 1 Opus golden annotation
  - Up to 6 bench-holdout model annotations
  - Stage 1 patched annotations (filtered to holdout PIDs)
 Flags paragraphs splitting on:
  1. SI <-> N/O (at least 2 signals each side)
  2. MR <-> RMP (at least 2 signals each side)
  3. BG <-> MR (at least 2 signals each side)
  4. BG <-> RMP (at least 2 signals each side)
  5. Materiality language present but majority says N/O
 """
 import json
 import re
 from collections import Counter
 from pathlib import Path
 ROOT = Path(__file__).resolve().parent.parent
 DATA = ROOT / "data"
 # Short names for categories
 ABBREV = {
    "Board Governance": "BG",
    "Incident Disclosure": "ID",
    "Management Role": "MR",
    "None/Other": "NO",
    "Risk Management Process": "RMP",
    "Strategy Integration": "SI",
    "Third-Party Risk": "TPR",
 }
 # Materiality language patterns
 MATERIALITY_PATTERNS = [
    re.compile(r"material(ly)?\s+(adverse|impact|effect|affect)", re.IGNORECASE),
    re.compile(r"materially\s+affect(ed)?", re.IGNORECASE),
    re.compile(r"material\s+cybersecurity\s+(incident|threat|event)", re.IGNORECASE),
    re.compile(r"not\s+(experienced|had|identified)\s+.{0,40}material", re.IGNORECASE),
    re.compile(r"reasonably\s+likely\s+to\s+materially", re.IGNORECASE),
    re.compile(r"material(ity)?\s+(assessment|conclusion|determination)", re.IGNORECASE),
    re.compile(r"no\s+material\s+(impact|effect|cybersecurity)", re.IGNORECASE),
    re.compile(
        r"have\s+not\s+.{0,30}materially\s+affect(ed)?", re.IGNORECASE
    ),
 ]
 def has_materiality_language(text: str) -> bool:
    return any(p.search(text) for p in MATERIALITY_PATTERNS)
 def majority_category(tally: Counter) -> str:
    if not tally:
        return "UNKNOWN"
    return tally.most_common(1)[0][0]
 def main():
    # 1. Determine the 1,200 holdout PIDs from human labels
    holdout_pids: set[str] = set()
    human_labels: dict[str, list[str]] = {}  # pid -> list of abbreviated cats
    with open(DATA / "gold" / "human-labels-raw.jsonl") as f:
        for line in f:
            d = json.loads(line)
            pid = d["paragraphId"]
            holdout_pids.add(pid)
            human_labels.setdefault(pid, []).append(
                ABBREV.get(d["contentCategory"], d["contentCategory"])
            )
    # Load paragraph texts for the holdout PIDs
    holdout_paragraphs: dict[str, str] = {}
    with open(DATA / "gold" / "paragraphs-holdout.jsonl") as f:
        for line in f:
            d = json.loads(line)
            if d["id"] in holdout_pids:
                holdout_paragraphs[d["id"]] = d["text"]
    print(f"Total holdout paragraphs: {len(holdout_pids)}")
    # 2. Build signal matrix: pid -> list of category strings (abbreviated)
    signals: dict[str, list[str]] = {pid: list(cats) for pid, cats in human_labels.items()}
    # 2a. Human labels already loaded above
    print(f"Paragraphs with human labels: {len(human_labels)}")
    # 2b. Opus golden
    with open(DATA / "annotations" / "golden" / "opus.jsonl") as f:
        for line in f:
            d = json.loads(line)
            pid = d["paragraphId"]
            if pid in holdout_pids:
                cat = ABBREV.get(
                    d["label"]["content_category"], d["label"]["content_category"]
                )
                signals[pid].append(cat)
    # 2c. Bench-holdout model annotations (skip error files)
    bench_dir = DATA / "annotations" / "bench-holdout"
    for fpath in sorted(bench_dir.glob("*.jsonl")):
        if "-errors" in fpath.name:
            continue
        with open(fpath) as f:
            for line in f:
                d = json.loads(line)
                pid = d.get("paragraphId")
                if pid and pid in holdout_pids and "label" in d:
                    cat = ABBREV.get(
                        d["label"]["content_category"],
                        d["label"]["content_category"],
                    )
                    signals[pid].append(cat)
    # 2d. Stage 1 patched (filter to holdout PIDs)
    with open(DATA / "annotations" / "stage1.patched.jsonl") as f:
        for line in f:
            d = json.loads(line)
            pid = d["paragraphId"]
            if pid in holdout_pids:
                cat = ABBREV.get(
                    d["label"]["content_category"], d["label"]["content_category"]
                )
                signals[pid].append(cat)
    # Report signal counts
    signal_counts = [len(signals[pid]) for pid in holdout_pids]
    print(
        f"Signals per paragraph: min={min(signal_counts)}, max={max(signal_counts)}, "
        f"mean={sum(signal_counts)/len(signal_counts):.1f}"
    )
    # 3. Check confusion axes
    AXES = {
        "SI_NO": ("SI", "NO"),
        "MR_RMP": ("MR", "RMP"),
        "BG_MR": ("BG", "MR"),
        "BG_RMP": ("BG", "RMP"),
    }
    axis_counts: dict[str, int] = {k: 0 for k in AXES}
    materiality_no_count = 0
    results: list[dict] = []
    for pid in sorted(holdout_pids):
        tally = Counter(signals[pid])
        maj = majority_category(tally)
        text = holdout_paragraphs[pid]
        mat_lang = has_materiality_language(text)
        # Check each axis
        flagged_axes: list[str] = []
        for axis_name, (cat_a, cat_b) in AXES.items():
            if tally.get(cat_a, 0) >= 2 and tally.get(cat_b, 0) >= 2:
                flagged_axes.append(axis_name)
        # Materiality language + majority N/O
        mat_no_flag = mat_lang and maj == "NO"
        if flagged_axes or mat_no_flag:
            for axis_name in flagged_axes:
                axis_counts[axis_name] += 1
            if mat_no_flag:
                materiality_no_count += 1
            # Build tally dict with full names for output readability
            tally_dict = dict(tally.most_common())
            results.append(
                {
                    "paragraphId": pid,
                    "axes": flagged_axes if flagged_axes else [],
                    "signalTally": tally_dict,
                    "hasMaterialityLanguage": mat_lang,
                    "currentMajority": maj,
                    "materialityNoFlag": mat_no_flag,
                }
            )
    # 4. Output
    out_path = DATA / "gold" / "holdout-rerun-v35.jsonl"
    with open(out_path, "w") as f:
        for r in results:
            f.write(json.dumps(r) + "\n")
    print(f"\n--- Confusion Axis Summary ---")
    print(f"SI <-> N/O splits:     {axis_counts['SI_NO']}")
    print(f"MR <-> RMP splits:     {axis_counts['MR_RMP']}")
    print(f"BG <-> MR splits:      {axis_counts['BG_MR']}")
    print(f"BG <-> RMP splits:     {axis_counts['BG_RMP']}")
    print(f"Materiality lang + majority N/O: {materiality_no_count}")
    print(f"\nTotal unique paragraphs needing re-run: {len(results)}")
    cost = len(results) * 0.005 * 5
    print(f"Estimated cost at $0.005/paragraph x 5 models: ${cost:.2f}")
 if __name__ == "__main__":
    main()
--- a/scripts/show-hard-examples.py
+++ b/scripts/show-hard-examples.py
@ -0,0 +1,530 @@
 """
 Show carefully selected hard-case paragraphs from the holdout set for each confusion axis.
 Displays full paragraph text + compact 13-signal label table + vote tally.
 Run: uv run --with numpy scripts/show-hard-examples.py
 """
 import json
 import os
 from collections import Counter, defaultdict
 from pathlib import Path
 from textwrap import fill
 import numpy as np
 ROOT = Path(__file__).resolve().parent.parent
 # ── Category abbreviation map ──────────────────────────────────────────────
 FULL_TO_ABBR = {
    "Board Governance": "BG",
    "Incident Disclosure": "ID",
    "Management Role": "MR",
    "None/Other": "N/O",
    "Risk Management Process": "RMP",
    "Strategy Integration": "SI",
    "Third-Party Risk": "TPR",
 }
 # ── Short source-name helpers ──────────────────────────────────────────────
 S1_MODEL_SHORT = {
    "google/gemini-3.1-flash-lite-preview": "gemini-lite",
    "x-ai/grok-4.1-fast": "grok-fast",
    "xiaomi/mimo-v2-flash": "mimo-flash",
 }
 BENCH_FILE_SHORT = {
    "gpt-5.4": "gpt-5.4",
    "gemini-3.1-pro-preview": "gemini-pro",
    "glm-5:exacto": "glm-5",
    "kimi-k2.5": "kimi",
    "mimo-v2-pro:exacto": "mimo-pro",
    "minimax-m2.7:exacto": "minimax",
 }
 BENCH_FILES = [
    "gpt-5.4",
    "gemini-3.1-pro-preview",
    "glm-5:exacto",
    "kimi-k2.5",
    "mimo-v2-pro:exacto",
    "minimax-m2.7:exacto",
 ]
 def load_jsonl(path: str | Path) -> list[dict]:
    rows = []
    with open(path) as f:
        for line in f:
            line = line.strip()
            if line:
                rows.append(json.loads(line))
    return rows
 # ── Load data ──────────────────────────────────────────────────────────────
 print("Loading data...")
 paragraphs_raw = load_jsonl(ROOT / "data/gold/paragraphs-holdout.jsonl")
 para_map: dict[str, dict] = {p["id"]: p for p in paragraphs_raw}
 holdout_pids = set(para_map.keys())
 human_raw = load_jsonl(ROOT / "data/gold/human-labels-raw.jsonl")
 opus_raw = load_jsonl(ROOT / "data/annotations/golden/opus.jsonl")
 stage1_raw = load_jsonl(ROOT / "data/annotations/stage1.patched.jsonl")
 # ── Build signal matrix: pid → {source_label: category_abbr} ─────────────
 signals: dict[str, dict[str, str]] = defaultdict(dict)
 # 1) Human annotators
 for row in human_raw:
    pid = row["paragraphId"]
    name = row["annotatorName"]
    cat = FULL_TO_ABBR.get(row["contentCategory"], row["contentCategory"])
    signals[pid][f"H:{name}"] = cat
 # 2) Opus
 for row in opus_raw:
    pid = row["paragraphId"]
    cat = FULL_TO_ABBR.get(row["label"]["content_category"], row["label"]["content_category"])
    signals[pid]["Opus"] = cat
 # 3) Stage 1 (filter to holdout PIDs)
 for row in stage1_raw:
    pid = row["paragraphId"]
    if pid not in holdout_pids:
        continue
    model_id = row["provenance"]["modelId"]
    short = S1_MODEL_SHORT.get(model_id, model_id)
    source = f"S1:{short}"
    cat = FULL_TO_ABBR.get(row["label"]["content_category"], row["label"]["content_category"])
    signals[pid][source] = cat
 # 4) Benchmark models
 for bench_name in BENCH_FILES:
    path = ROOT / f"data/annotations/bench-holdout/{bench_name}.jsonl"
    short = BENCH_FILE_SHORT[bench_name]
    rows = load_jsonl(path)
    for row in rows:
        pid = row["paragraphId"]
        cat = FULL_TO_ABBR.get(row["label"]["content_category"], row["label"]["content_category"])
        signals[pid][short] = cat
 # ── Ordered source list (for display) ─────────────────────────────────────
 HUMAN_NAMES = sorted({r["annotatorName"] for r in human_raw})
 ORDERED_SOURCES = (
    [f"H:{n}" for n in HUMAN_NAMES]
    + ["Opus"]
    + [f"S1:{S1_MODEL_SHORT[m]}" for m in sorted(S1_MODEL_SHORT)]
    + [BENCH_FILE_SHORT[b] for b in BENCH_FILES]
 )
 # ── Utility: compute axis stats ───────────────────────────────────────────
 def axis_candidates(cat_a: str, cat_b: str, extra_cat: str | None = None) -> list[tuple[str, dict, Counter]]:
    """Find PIDs where both cat_a and cat_b appear among the 13 signals.
    Returns list of (pid, signals_dict, vote_counter) sorted by closeness of split."""
    results = []
    for pid, sigs in signals.items():
        if pid not in holdout_pids:
            continue
        counts = Counter(sigs.values())
        cats_present = set(counts.keys())
        if cat_a in cats_present and cat_b in cats_present:
            if extra_cat is not None and extra_cat not in cats_present:
                continue
            # closeness = min(count_a, count_b) / total — higher is closer split
            total = sum(counts.values())
            closeness = min(counts[cat_a], counts[cat_b]) / total
            results.append((pid, sigs, counts, closeness))
    # Sort by closeness (descending), then by total signal count (descending) as tiebreaker
    results.sort(key=lambda x: (-x[3], -sum(x[2].values())))
    return [(pid, sigs, counts) for pid, sigs, counts, _ in results]
 def print_example(pid: str, sigs: dict, counts: Counter, sub_pattern: str, note: str = ""):
    """Print one example paragraph with signals."""
    para = para_map.get(pid)
    if not para:
        print(f"  [paragraph {pid} not found]")
        return
    print(f"  ┌─ Paragraph {pid}")
    print(f"  │  Company: {para.get('companyName', '?')}  |  Filing: {para.get('filingType', '?')} {para.get('filingDate', '?')}")
    print(f"  │  Sub-pattern: {sub_pattern}")
    print(f"  │")
    # Full text — wrap at 100 chars, indent
    text = para["text"]
    for line in text.split("\n"):
        wrapped = fill(line, width=100, initial_indent="  │  ", subsequent_indent="  │  ")
        print(wrapped)
    print(f"  │")
    # Signal table — compact single line
    parts = []
    for src in ORDERED_SOURCES:
        if src in sigs:
            parts.append(f"{src}={sigs[src]}")
    print(f"  │  Signals: {', '.join(parts)}")
    # Vote tally
    tally_parts = [f"{cat}: {n}" for cat, n in counts.most_common()]
    print(f"  │  Tally: {', '.join(tally_parts)} (out of {sum(counts.values())})")
    if note:
        print(f"  │")
        for line in note.split("\n"):
            wrapped = fill(line, width=100, initial_indent="  │  ▸ ", subsequent_indent="  │    ")
            print(wrapped)
    print(f"  └{'─' * 78}")
    print()
 def pick_diverse(candidates: list[tuple[str, dict, Counter]], n: int, min_signals: int = 10) -> list[tuple[str, dict, Counter]]:
    """Pick n diverse examples from candidates (different companies, prefer many signals)."""
    if len(candidates) <= n:
        return candidates
    # Filter to examples with enough signals for a meaningful table
    rich = [(pid, sigs, counts) for pid, sigs, counts in candidates if sum(counts.values()) >= min_signals]
    if len(rich) < n:
        rich = candidates  # fall back if not enough rich examples
    # Diversify by company
    seen_companies: set[str] = set()
    selected = []
    for pid, sigs, counts in rich:
        company = para_map.get(pid, {}).get("companyName", "")
        if company in seen_companies and len(rich) > n * 2:
            continue
        selected.append((pid, sigs, counts))
        seen_companies.add(company)
        if len(selected) >= n * 3:
            break
    return selected[:n]
 # ══════════════════════════════════════════════════════════════════════════
 #  AXIS 1: MR ↔ RMP
 # ══════════════════════════════════════════════════════════════════════════
 print("\n" + "=" * 80)
 print("  AXIS 1: MR ↔ RMP  —  Management Role vs. Risk Management Process")
 print("=" * 80)
 mr_rmp = axis_candidates("MR", "RMP")
 print(f"\n  Total paragraphs with both MR and RMP in signals: {len(mr_rmp)}\n")
 def classify_mr_rmp_subpattern(text: str) -> str:
    """Heuristic to guess sub-pattern for MR↔RMP confusion."""
    text_lower = text.lower()
    sentences = [s.strip() for s in text.replace("\n", " ").split(".") if s.strip()]
    person_keywords = [
        "ciso", "chief information security", "chief information officer",
        "cio", "vp ", "vice president", "director", "officer", "head of",
        "manager", "leader", "executive", "cto", "chief technology",
    ]
    process_keywords = [
        "program", "framework", "process", "policy", "policies",
        "procedures", "controls", "assessment", "monitoring",
        "risk management", "incident response", "vulnerability",
    ]
    person_subject_sentences = 0
    process_subject_sentences = 0
    for sent in sentences:
        sent_lower = sent.lower().strip()
        has_person = any(kw in sent_lower[:80] for kw in person_keywords)
        has_process = any(kw in sent_lower[:80] for kw in process_keywords)
        if has_person:
            person_subject_sentences += 1
        if has_process:
            process_subject_sentences += 1
    if person_subject_sentences > 0 and process_subject_sentences == 0:
        return "person-subject"
    elif process_subject_sentences > 0 and person_subject_sentences == 0:
        return "process-subject"
    elif person_subject_sentences > 0 and process_subject_sentences > 0:
        return "mixed"
    else:
        return "other"
 # Bucket candidates by sub-pattern
 buckets: dict[str, list] = {"person-subject": [], "process-subject": [], "mixed": [], "other": []}
 for pid, sigs, counts in mr_rmp:
    text = para_map.get(pid, {}).get("text", "")
    sp = classify_mr_rmp_subpattern(text)
    buckets[sp].append((pid, sigs, counts))
 print(f"  Sub-pattern distribution: person-subject={len(buckets['person-subject'])}, "
      f"process-subject={len(buckets['process-subject'])}, mixed={len(buckets['mixed'])}, "
      f"other={len(buckets['other'])}")
 print()
 # (a) Person is grammatical subject
 print("  ── (a) Person is the grammatical subject, doing process-like things ──\n")
 for pid, sigs, counts in pick_diverse(buckets["person-subject"], 2):
    text = para_map[pid]["text"]
    # Subject test note
    note = "SUBJECT TEST → MR (person is the main subject)"
    print_example(pid, sigs, counts, "Person as subject doing process-like things", note)
 # (b) Process/framework is subject
 print("  ── (b) Process/framework is the subject, person mentioned as responsible ──\n")
 for pid, sigs, counts in pick_diverse(buckets["process-subject"], 2):
    text = para_map[pid]["text"]
    note = "SUBJECT TEST → RMP (process/framework is the main subject)"
    print_example(pid, sigs, counts, "Process as subject, person mentioned", note)
 # (c) Mixed
 print("  ── (c) Mixed — both person and process are subjects ──\n")
 for pid, sigs, counts in pick_diverse(buckets["mixed"], 2):
    note = "SUBJECT TEST → AMBIGUOUS (both person and process serve as subjects)"
    print_example(pid, sigs, counts, "Mixed subjects", note)
 # (d) Edge cases — closest splits from "other" or overall closest
 print("  ── (d) Edge cases — genuinely hard to call ──\n")
 # Take from overall closest that aren't already shown
 shown_pids = set()
 for bucket in buckets.values():
    for pid, _, _ in bucket[:2]:
        shown_pids.add(pid)
 edge_cases = [(p, s, c) for p, s, c in mr_rmp if p not in shown_pids][:20]
 for pid, sigs, counts in pick_diverse(edge_cases, 2):
    mr_count = counts.get("MR", 0)
    rmp_count = counts.get("RMP", 0)
    note = f"SUBJECT TEST → unclear; split is {mr_count}-{rmp_count} MR-RMP"
    print_example(pid, sigs, counts, "Edge case", note)
 # ══════════════════════════════════════════════════════════════════════════
 #  AXIS 2: BG ↔ MR
 # ══════════════════════════════════════════════════════════════════════════
 print("\n" + "=" * 80)
 print("  AXIS 2: BG ↔ MR  —  Board Governance vs. Management Role")
 print("=" * 80)
 bg_mr = axis_candidates("BG", "MR")
 print(f"\n  Total paragraphs with both BG and MR in signals: {len(bg_mr)}\n")
 def classify_bg_mr_subpattern(text: str) -> str:
    text_lower = text.lower()
    board_words = ["board", "committee", "audit committee", "directors"]
    mgmt_words = ["ciso", "chief information", "officer", "vp", "vice president",
                   "director of", "head of", "reports to", "briefing", "briefs",
                   "presents to", "reporting"]
    has_board_actor = any(w in text_lower for w in board_words)
    has_mgmt_reporting = any(w in text_lower for w in mgmt_words)
    if has_board_actor and not has_mgmt_reporting:
        return "board-actor"
    elif has_mgmt_reporting and has_board_actor:
        return "mgmt-reporting-to-board"
    elif has_mgmt_reporting:
        return "mgmt-only"
    else:
        return "mixed-governance"
 buckets_bg: dict[str, list] = defaultdict(list)
 for pid, sigs, counts in bg_mr:
    sp = classify_bg_mr_subpattern(para_map.get(pid, {}).get("text", ""))
    buckets_bg[sp].append((pid, sigs, counts))
 print(f"  Sub-pattern distribution: {dict((k, len(v)) for k, v in buckets_bg.items())}")
 print()
 # (a) Board/committee is clearly the actor
 print("  ── (a) Board/committee is clearly the actor ──\n")
 pool = buckets_bg.get("board-actor", []) or buckets_bg.get("mixed-governance", [])
 for pid, sigs, counts in pick_diverse(pool, 2):
    print_example(pid, sigs, counts, "Board as actor")
 # (b) Management officer reporting TO the board
 print("  ── (b) Management officer reporting TO/briefing the board ──\n")
 pool = buckets_bg.get("mgmt-reporting-to-board", [])
 for pid, sigs, counts in pick_diverse(pool, 2):
    note = "KEY QUESTION: Is this BG (board receiving info) or MR (officer doing the briefing)?"
    print_example(pid, sigs, counts, "Management reporting to board", note)
 # (c) Mixed governance
 print("  ── (c) Mixed governance language ──\n")
 remaining = [x for x in bg_mr if x[0] not in {p for bucket in buckets_bg.values() for p, _, _ in bucket[:2]}]
 for pid, sigs, counts in pick_diverse(remaining, 2):
    note = "Could be BG, MR, or RMP depending on interpretation"
    print_example(pid, sigs, counts, "Mixed governance", note)
 # ══════════════════════════════════════════════════════════════════════════
 #  AXIS 3: SI ↔ N/O
 # ══════════════════════════════════════════════════════════════════════════
 print("\n" + "=" * 80)
 print("  AXIS 3: SI ↔ N/O  —  Strategy Integration vs. None/Other")
 print("=" * 80)
 si_no = axis_candidates("SI", "N/O")
 print(f"\n  Total paragraphs with both SI and N/O in signals: {len(si_no)}\n")
 def classify_si_no_subpattern(text: str) -> str:
    text_lower = text.lower()
    incident_words = ["incident", "breach", "attack", "compromised", "unauthorized access",
                      "data breach", "ransomware", "phishing"]
    negative_words = ["have not experienced", "not experienced", "no material",
                      "has not been materially", "not been the subject",
                      "not aware of any", "no known", "have not had"]
    hypothetical_words = ["could", "may", "might", "would", "if ", "potential",
                          "face threats", "subject to"]
    specific_words = ["$", "million", "vendor", "contract", "insurance",
                      "specific", "particular", "named"]
    has_incident = any(w in text_lower for w in incident_words)
    has_negative = any(w in text_lower for w in negative_words)
    has_hypothetical = any(w in text_lower for w in hypothetical_words)
    has_specific = any(w in text_lower for w in specific_words)
    if has_incident and not has_negative:
        return "actual-incident"
    elif has_negative:
        return "negative-assertion"
    elif has_hypothetical and not has_specific:
        return "hypothetical"
    elif has_specific:
        return "specific-no-incident"
    else:
        return "other"
 buckets_si: dict[str, list] = defaultdict(list)
 for pid, sigs, counts in si_no:
    sp = classify_si_no_subpattern(para_map.get(pid, {}).get("text", ""))
    buckets_si[sp].append((pid, sigs, counts))
 print(f"  Sub-pattern distribution: {dict((k, len(v)) for k, v in buckets_si.items())}")
 print()
 # Also find the 23 cases where humans=SI but GenAI=N/O
 human_si_genai_no = []
 for pid, sigs, counts in si_no:
    human_cats = [sigs.get(f"H:{n}") for n in HUMAN_NAMES if f"H:{n}" in sigs]
    genai_cats = [v for k, v in sigs.items() if not k.startswith("H:")]
    human_si = sum(1 for c in human_cats if c == "SI")
    human_no = sum(1 for c in human_cats if c == "N/O")
    genai_si = sum(1 for c in genai_cats if c == "SI")
    genai_no = sum(1 for c in genai_cats if c == "N/O")
    if human_si > human_no and genai_no > genai_si:
        human_si_genai_no.append((pid, sigs, counts))
 print(f"  Cases where humans lean SI but GenAI leans N/O: {len(human_si_genai_no)}")
 print()
 # (a) Clear actual incident
 print("  ── (a) Clear actual incident described ──\n")
 for pid, sigs, counts in pick_diverse(buckets_si.get("actual-incident", []), 2):
    print_example(pid, sigs, counts, "Actual incident")
 # (b) Negative assertion
 print("  ── (b) Negative assertion — 'we have not experienced material incidents' ──\n")
 neg_pool = buckets_si.get("negative-assertion", [])
 # Prefer ones in the human-SI-genAI-NO set
 neg_human_si = [x for x in neg_pool if x[0] in {p for p, _, _ in human_si_genai_no}]
 neg_other = [x for x in neg_pool if x[0] not in {p for p, _, _ in human_si_genai_no}]
 pool = neg_human_si[:2] if len(neg_human_si) >= 2 else (neg_human_si + neg_other)[:2]
 for pid, sigs, counts in pool:
    human_cats = [sigs.get(f"H:{n}") for n in HUMAN_NAMES if f"H:{n}" in sigs]
    genai_cats = [v for k, v in sigs.items() if not k.startswith("H:")]
    note = (f"CRUX: Humans keyed on the materiality assessment language. "
            f"Human votes: {Counter(human_cats).most_common()}, "
            f"GenAI votes: {Counter(genai_cats).most_common()}")
    print_example(pid, sigs, counts, "Negative assertion", note)
 # (c) Hypothetical/conditional
 print("  ── (c) Hypothetical/conditional language ──\n")
 for pid, sigs, counts in pick_diverse(buckets_si.get("hypothetical", []), 2):
    print_example(pid, sigs, counts, "Hypothetical/conditional")
 # (d) Specific programs/vendors/amounts but no incident
 print("  ── (d) Specific programs/vendors/amounts but no incident ──\n")
 spec_pool = buckets_si.get("specific-no-incident", [])
 if len(spec_pool) < 2:
    spec_pool += buckets_si.get("other", [])
 for pid, sigs, counts in pick_diverse(spec_pool, 2):
    note = "SI because specific details? Or N/O because no event/strategy content?"
    print_example(pid, sigs, counts, "Specific but no incident", note)
 # Extra: show human-SI / genAI-N/O cases not already shown
 shown_si = set()
 for bucket in buckets_si.values():
    for p, _, _ in bucket[:2]:
        shown_si.add(p)
 extra_human_si = [x for x in human_si_genai_no if x[0] not in shown_si]
 if extra_human_si:
    print("  ── (extra) Additional human=SI, GenAI=N/O cases ──\n")
    for pid, sigs, counts in pick_diverse(extra_human_si, 2):
        human_cats = [sigs.get(f"H:{n}") for n in HUMAN_NAMES if f"H:{n}" in sigs]
        genai_cats = [v for k, v in sigs.items() if not k.startswith("H:")]
        note = (f"Humans: {Counter(human_cats).most_common()}, "
                f"GenAI: {Counter(genai_cats).most_common()}")
        print_example(pid, sigs, counts, "Human=SI, GenAI=N/O", note)
 # ══════════════════════════════════════════════════════════════════════════
 #  AXIS 4: Three-way BG ↔ MR ↔ RMP
 # ══════════════════════════════════════════════════════════════════════════
 print("\n" + "=" * 80)
 print("  AXIS 4: Three-way BG ↔ MR ↔ RMP")
 print("=" * 80)
 three_way = []
 for pid, sigs in signals.items():
    if pid not in holdout_pids:
        continue
    counts = Counter(sigs.values())
    if "BG" in counts and "MR" in counts and "RMP" in counts:
        # Score by how evenly split the three are
        vals = [counts["BG"], counts["MR"], counts["RMP"]]
        total_3 = sum(vals)
        evenness = min(vals) / max(vals) if max(vals) > 0 else 0
        three_way.append((pid, sigs, counts, evenness))
 three_way.sort(key=lambda x: (-x[3], -sum(x[2].values())))
 print(f"\n  Total paragraphs with all three of BG, MR, RMP: {len(three_way)}\n")
 # Pick diverse examples with enough signals
 seen_co: set[str] = set()
 three_way_selected = []
 for pid, sigs, counts, evenness in three_way:
    if sum(counts.values()) < 10:
        continue
    co = para_map.get(pid, {}).get("companyName", "")
    if co in seen_co:
        continue
    seen_co.add(co)
    three_way_selected.append((pid, sigs, counts, evenness))
    if len(three_way_selected) >= 3:
        break
 for pid, sigs, counts, evenness in three_way_selected:
    bg_c, mr_c, rmp_c = counts["BG"], counts["MR"], counts["RMP"]
    note = (f"Three-way split: BG={bg_c}, MR={mr_c}, RMP={rmp_c}. "
            f"This paragraph intertwines governance, management roles, and process descriptions.")
    print_example(pid, sigs, counts, "Three-way BG/MR/RMP", note)
 # ── Summary statistics ────────────────────────────────────────────────────
 print("\n" + "=" * 80)
 print("  SUMMARY")
 print("=" * 80)
 print(f"""
  Axis 1 (MR↔RMP):     {len(mr_rmp)} paragraphs with split signals
  Axis 2 (BG↔MR):      {len(bg_mr)} paragraphs with split signals
  Axis 3 (SI↔N/O):     {len(si_no)} paragraphs with split signals
  Axis 4 (BG↔MR↔RMP):  {len(three_way)} paragraphs with three-way split
  Human=SI/GenAI=N/O:   {len(human_si_genai_no)} cases (directional asymmetry)
 """)
--- a/ts/src/cli.ts
+++ b/ts/src/cli.ts
@ -5,7 +5,7 @@ import { STAGE1_MODELS, BENCHMARK_MODELS } from "./lib/openrouter.ts";
 import { runBatch } from "./label/batch.ts";
 import { runGoldenBatch } from "./label/golden.ts";
 import { computeConsensus } from "./label/consensus.ts";
-import { judgeParagraph } from "./label/annotate.ts";
+import { judgeParagraph, annotateParagraph, reEvalParagraph } from "./label/annotate.ts";
 import { appendJsonl, readJsonlRaw } from "./lib/jsonl.ts";
 import { v4 as uuidv4 } from "uuid";
 import { PROMPT_VERSION } from "./label/prompts.ts";
@ -29,6 +29,9 @@ Commands:
  label:golden [--paragraphs <path>] [--limit N] [--delay N] [--concurrency N]  (Opus via Agent SDK)
  label:bench-holdout --model <id> [--concurrency N] [--limit N]   (benchmark model on holdout)
  label:bench-holdout-all [--concurrency N] [--limit N]            (all BENCHMARK_MODELS on holdout)
  label:bench-holdout-v35 --model <id> [--concurrency N] [--limit N]  (v3.5 re-run on confusion-axis holdout)
  label:golden-v35 [--limit N] [--delay N] [--concurrency N]        (Opus v3.5 re-run on confusion-axis holdout)
  label:reeval --model <id> [--concurrency N] [--limit N]          (re-evaluate flagged Stage 1 paragraphs)
  label:cost`);
  process.exit(1);
 }
@ -321,6 +324,145 @@ async function cmdBenchHoldoutAll(): Promise<void> {
  }
 }
 async function loadConfusionAxisParagraphs(rerunFile?: string): Promise<Paragraph[]> {
  const rerunPath = rerunFile ?? `${DATA}/gold/holdout-rerun-v35.jsonl`;
  const { records: rerunRecords } = await readJsonlRaw(rerunPath);
  const rerunIds = new Set(
    rerunRecords
      .filter((r): r is { paragraphId: string } =>
        !!r && typeof r === "object" && "paragraphId" in r)
      .map((r) => r.paragraphId),
  );
  process.stderr.write(`  Loaded ${rerunIds.size} confusion-axis paragraph IDs\n`);
  const allHoldout = await loadHoldoutParagraphs();
  const paragraphs = allHoldout.filter((p) => rerunIds.has(p.id));
  process.stderr.write(`  Matched ${paragraphs.length} paragraphs for v3.5 re-run\n`);
  return paragraphs;
 }
 async function cmdGoldenV35(): Promise<void> {
  const paragraphs = await loadConfusionAxisParagraphs();
  if (paragraphs.length === 0) {
    process.stderr.write("  ✖ No confusion-axis paragraphs found\n");
    process.exit(1);
  }
  await runGoldenBatch(paragraphs, {
    outputPath: `${DATA}/annotations/golden-v35/opus.jsonl`,
    errorsPath: `${DATA}/annotations/golden-v35/opus-errors.jsonl`,
    limit: flag("limit") !== undefined ? flagInt("limit", 50) : undefined,
    delayMs: flag("delay") !== undefined ? flagInt("delay", 500) : 500,
    concurrency: flagInt("concurrency", 20),
  });
 }
 async function cmdBenchHoldoutV35(): Promise<void> {
  const modelId = flag("model");
  if (!modelId) {
    console.error("--model is required");
    process.exit(1);
  }
  const rerunFile = flag("rerun-file");
  const outputDir = flag("output-dir") ?? "bench-holdout-v35";
  const paragraphs = await loadConfusionAxisParagraphs(rerunFile ?? undefined);
  const modelShort = modelId.split("/")[1]!;
  await runBatch(paragraphs, {
    modelId,
    stage: "benchmark",
    outputPath: `${DATA}/annotations/${outputDir}/${modelShort}.jsonl`,
    errorsPath: `${DATA}/annotations/${outputDir}/${modelShort}-errors.jsonl`,
    sessionsPath: SESSIONS_PATH,
    concurrency: flagInt("concurrency", 60),
    limit: flag("limit") !== undefined ? flagInt("limit", 50) : undefined,
  });
 }
 async function cmdReeval(): Promise<void> {
  const modelId = flag("model");
  if (!modelId) {
    console.error("--model is required");
    process.exit(1);
  }
  // Load flagged paragraphs
  const correctionsPath = `${DATA}/annotations/stage1-corrections.jsonl`;
  const { records: corrections } = await readJsonlRaw(correctionsPath);
  const correctionMap = new Map<string, { reason: string }>();
  for (const r of corrections) {
    const rec = r as { paragraphId?: string; reason?: string };
    if (rec.paragraphId && rec.reason) {
      correctionMap.set(rec.paragraphId, { reason: rec.reason });
    }
  }
  process.stderr.write(`  Loaded ${correctionMap.size} flagged paragraphs from ${correctionsPath}\n`);
  // Load all paragraphs and filter to flagged ones
  const paragraphs = await loadParagraphs();
  const flaggedParagraphs = paragraphs.filter((p) => correctionMap.has(p.id));
  process.stderr.write(`  Matched ${flaggedParagraphs.length} paragraphs for re-evaluation\n`);
  const modelShort = modelId.split("/")[1]!;
  const outputPath = `${DATA}/annotations/stage2/reeval-${modelShort}.jsonl`;
  const errorsPath = `${DATA}/annotations/stage2/reeval-${modelShort}-errors.jsonl`;
  // Resume support
  const { records: existing } = await readJsonlRaw(outputPath);
  const doneIds = new Set(
    existing
      .filter((r): r is { paragraphId: string } =>
        !!r && typeof r === "object" && "paragraphId" in r)
      .map((r) => r.paragraphId),
  );
  let remaining = flaggedParagraphs.filter((p) => !doneIds.has(p.id));
  const limit = flag("limit") !== undefined ? flagInt("limit", 50) : undefined;
  if (limit !== undefined) remaining = remaining.slice(0, limit);
  process.stderr.write(`  ${remaining.length} paragraphs to re-evaluate (${doneIds.size} already done)\n`);
  const runId = uuidv4();
  const concurrency = flagInt("concurrency", 12);
  let processed = 0;
  let errored = 0;
  let totalCost = 0;
  // Process in batches respecting concurrency
  for (let i = 0; i < remaining.length; i += concurrency) {
    const batch = remaining.slice(i, i + concurrency);
    const results = await Promise.allSettled(
      batch.map(async (paragraph) => {
        const correction = correctionMap.get(paragraph.id)!;
        const reason = correction.reason as "materiality_language" | "spac" | "other";
        const ann = await reEvalParagraph(paragraph, {
          modelId,
          runId,
          reason,
          promptVersion: PROMPT_VERSION,
        });
        await appendJsonl(outputPath, ann);
        totalCost += ann.provenance.costUsd;
        processed++;
      }),
    );
    for (const r of results) {
      if (r.status === "rejected") {
        errored++;
        process.stderr.write(`  ✖ Error: ${r.reason}\n`);
      }
    }
    if (processed % 50 === 0 || i + concurrency >= remaining.length) {
      process.stderr.write(`  ${processed}/${remaining.length} re-evaluated (${errored} errors, $${totalCost.toFixed(4)} cost)\n`);
    }
  }
  process.stderr.write(`\n  ✓ Re-evaluation done: ${processed} processed, ${errored} errors, $${totalCost.toFixed(4)} total cost\n`);
 }
 async function cmdCost(): Promise<void> {
  const modelCosts: Record<string, { cost: number; count: number }> = {};
  const stageCosts: Record<string, { cost: number; count: number }> = {};
@ -435,6 +577,15 @@ switch (command) {
  case "label:bench-holdout-all":
    await cmdBenchHoldoutAll();
    break;
  case "label:bench-holdout-v35":
    await cmdBenchHoldoutV35();
    break;
  case "label:golden-v35":
    await cmdGoldenV35();
    break;
  case "label:reeval":
    await cmdReeval();
    break;
  case "label:cost":
    await cmdCost();
    break;
--- a/ts/src/label/annotate.ts
+++ b/ts/src/label/annotate.ts
@ -3,7 +3,7 @@ import { openrouter, providerOf } from "../lib/openrouter.ts";
 import { LabelOutputRaw, toLabelOutput } from "@sec-cybert/schemas/label.ts";
 import type { Annotation } from "@sec-cybert/schemas/annotation.ts";
 import type { Paragraph } from "@sec-cybert/schemas/paragraph.ts";
-import { SYSTEM_PROMPT, buildUserPrompt, buildJudgePrompt, PROMPT_VERSION } from "./prompts.ts";
+import { SYSTEM_PROMPT, buildUserPrompt, buildJudgePrompt, buildReEvalPrompt, PROMPT_VERSION } from "./prompts.ts";
 import { withRetry } from "../lib/retry.ts";
 /** OpenRouter reasoning effort levels. */
@ -125,6 +125,88 @@ export interface JudgeOpts {
 * Run the Stage 2 judge on a paragraph where Stage 1 models disagreed.
 * Receives the paragraph + all 3 prior annotations in randomized order.
 */
 export interface ReEvalOpts {
  modelId: string;
  runId: string;
  reason: "materiality_language" | "spac" | "other";
  promptVersion?: string;
  reasoningEffort?: ReasoningEffort;
 }
 /**
 * Re-evaluate a paragraph under v3.5 codebook rules.
 * Used for correcting Stage 1 labels affected by prompt version drift.
 */
 export async function reEvalParagraph(
  paragraph: Paragraph,
  opts: ReEvalOpts,
 ): Promise<Annotation> {
  const {
    modelId,
    runId,
    reason,
    promptVersion = PROMPT_VERSION,
    reasoningEffort = "medium",
  } = opts;
  const requestedAt = new Date().toISOString();
  const start = Date.now();
  const useRawText = modelId.startsWith("minimax/") || modelId.startsWith("moonshotai/");
  const result = await withRetry(
    async () => {
      if (useRawText) {
        const r = await generateText({
          model: openrouter(modelId),
          system: SYSTEM_PROMPT,
          prompt: buildReEvalPrompt(paragraph, reason),
          temperature: 0,
          providerOptions: buildProviderOptions(reasoningEffort),
          abortSignal: AbortSignal.timeout(360_000),
        });
        const text = r.text.trim();
        const fenceMatch = text.match(/```(?:json)?\s*\n?([\s\S]*?)\n?```/);
        const jsonStr = fenceMatch ? fenceMatch[1]! : text;
        const parsed = LabelOutputRaw.parse(JSON.parse(jsonStr));
        return { ...r, output: parsed, usage: r.usage, response: r.response, providerMetadata: r.providerMetadata };
      }
      return generateText({
        model: openrouter(modelId),
        output: Output.object({ schema: LabelOutputRaw }),
        system: SYSTEM_PROMPT,
        prompt: buildReEvalPrompt(paragraph, reason),
        temperature: 0,
        providerOptions: buildProviderOptions(reasoningEffort),
        abortSignal: AbortSignal.timeout(360_000),
      });
    },
    { label: `reeval:${modelId}:${paragraph.id}` },
  );
  const latencyMs = Date.now() - start;
  const rawOutput = result.output as LabelOutputRaw;
  if (!rawOutput) throw new Error(`No output from ${modelId} for ${paragraph.id}`);
  return {
    paragraphId: paragraph.id,
    label: toLabelOutput(rawOutput),
    provenance: {
      modelId,
      provider: providerOf(modelId),
      generationId: result.response?.id ?? "unknown",
      stage: "stage2-judge",
      runId,
      promptVersion,
      inputTokens: result.usage?.inputTokens ?? 0,
      outputTokens: result.usage?.outputTokens ?? 0,
      reasoningTokens: result.usage?.outputTokenDetails?.reasoningTokens ?? 0,
      costUsd: extractCost(result),
      latencyMs,
      requestedAt,
    },
  };
 }
 export async function judgeParagraph(
  paragraph: Paragraph,
  priorAnnotations: Array<{
--- a/ts/src/label/prompts.ts
+++ b/ts/src/label/prompts.ts
@ -1,6 +1,6 @@
 import type { Paragraph } from "@sec-cybert/schemas/paragraph.ts";
-export const PROMPT_VERSION = "v3.0";
+export const PROMPT_VERSION = "v3.5";
 /** System prompt for all Stage 1 annotation and benchmarking. */
 export const SYSTEM_PROMPT = `You are an expert annotator classifying paragraphs from SEC cybersecurity disclosures (Form 10-K Item 1C and Form 8-K Item 1.05 filings).
@ -11,30 +11,46 @@ For each paragraph, assign a content_category and specificity level.
 Assign the single most applicable category:
-"Board Governance" — Board/committee oversight of cyber risk, briefing cadence, board cyber expertise. Assign when the board or a board committee is the grammatical subject.
+"Board Governance" — Board/committee oversight of cyber risk, briefing cadence, board cyber expertise. Assign when the paragraph describes the governance/oversight STRUCTURE — how the board exercises oversight, who reports to the board, how information flows upward. Governance-chain paragraphs (board → committee → officer → program) are BG even when officers appear as grammatical subjects, because the PURPOSE is describing oversight structure.
-"Management Role" — CISO/CTO/CIO identification, qualifications, reporting lines. Assign when a management role is the grammatical subject.
+"Management Role" — CISO/CTO/CIO identification, qualifications, reporting lines. Assign when the paragraph is primarily about WHO the person IS — their credentials, experience, certifications, career history. Naming an officer as part of a governance or process description does NOT make it Management Role.
 "Risk Management Process" — Risk assessment, framework adoption, vulnerability management, monitoring, IR planning, ERM integration. Assign when the company's OWN internal processes are the topic.
 "Third-Party Risk" — Vendor/supplier security oversight, contractual security standards. Assign ONLY when vendor oversight is the CENTRAL topic, not a component of internal processes.
 "Incident Disclosure" — Description of actual cybersecurity incidents: what happened, when, scope, response actions. Must reference a real event. Includes: incident narrative, incident response actions, AND descriptions of affected data/systems scope or operational impact of a disclosed incident.
-"Strategy Integration" — Business/financial impact, cyber insurance, budget, materiality assessments. Includes standalone materiality conclusions with no incident narrative.
+"Strategy Integration" — Business/financial impact, cyber insurance, budget, materiality ASSESSMENTS. A materiality assessment is the company stating a conclusion about whether cybersecurity has or will affect business outcomes. Includes: backward-looking ("have not materially affected"), forward-looking with SEC qualifier ("reasonably likely to materially affect"), and negative assertions ("have not experienced material incidents"). Does NOT include generic risk warnings ("could have a material adverse effect") — those are boilerplate speculation, not assessments. Does NOT include "material" as an adjective ("managing material risks").
-"None/Other" — Forward-looking disclaimers, section headers, cross-references, non-cybersecurity content. NO substantive disclosure at all.
+"None/Other" — Forward-looking disclaimers, section headers, cross-references, non-cybersecurity content, generic IT-dependence language ("our IT systems are important"). NO substantive disclosure AND no materiality language at all.
 CATEGORY TIEBREAKERS:
  - Paragraph DESCRIBES what happened in an incident (dates, access, encryption, scope, response actions) → Incident Disclosure
  - Paragraph ONLY discusses financial cost, insurance, or materiality of an incident WITHOUT describing the event → Strategy Integration (even if it says "the incident" or "the cybersecurity incident")
  - Brief mention of a past incident + materiality conclusion as the main point → Strategy Integration
  - Standalone materiality conclusion with no incident reference → Strategy Integration
-  - Materiality disclaimers ("have not materially affected our business strategy, results of operations, or financial condition") → Strategy Integration, even if boilerplate. A cross-reference to Risk Factors appended to a materiality assessment does NOT change the classification. Only pure cross-references with no materiality conclusion are None/Other.
+  - Materiality ASSESSMENTS → Strategy Integration. An assessment is the company stating a conclusion:
    • Backward: "have not materially affected our business strategy, results of operations, or financial condition" → SI
    • Forward with SEC qualifier: "reasonably likely to materially affect" → SI
    • Negative assertion: "we have not experienced any material cybersecurity incidents" → SI
    NOT assessments (do NOT trigger SI):
    • Generic risk warning: "could have a material adverse effect on our business" → NOT SI. This is boilerplate speculation in every 10-K, not a conclusion. Classify by the paragraph's primary content.
    • "Material" as adjective: "managing material risks" → NOT SI. "Material" means "significant" here, not a materiality assessment.
    • Consequence clause: SPECULATIVE materiality language ("could have a material adverse effect") at the END of an RMP/risk paragraph does not override the primary purpose. BUT a negative assertion ("we have not experienced any material cybersecurity incidents") IS an assessment even at the end of a paragraph — it is a factual conclusion, not speculation.
    • Cross-references with materiality language: "For risks that may materially affect us, see Item 1A" → N/O (pointing elsewhere, not concluding).
  - SPACs and shell companies explicitly stating they have no operations, no cybersecurity program, or no formal processes → None/Other regardless of incidental mentions of board oversight or risk acknowledgment. The absence of a program is not a description of a program.
  - Internal processes mentioning vendors as one component → Risk Management Process
  - Requirements imposed ON vendors → Third-Party Risk
  - Board oversight mentioned briefly + management roles as main focus → Management Role
  - Management mentioned briefly + board oversight as main focus → Board Governance
-PERSON-VS-FUNCTION TEST (Management Role vs Risk Management Process):
+MR vs RMP — THREE-STEP DECISION CHAIN (apply in order):
-  If a paragraph is about the PERSON (qualifications, credentials, background, tenure, career history) → Management Role.
+  Step 1 — SUBJECT TEST: What is the grammatical subject?
-  If it's about what the role/program DOES (processes, activities, tools, frameworks) → Risk Management Process, even if a CISO/CIO/CTO title appears.
+    Clear process/framework/program as subject with no person detail → Risk Management Process. STOP.
-  Test: would the paragraph still make sense if you removed the person's name, title, and credentials? If yes → the paragraph is about the function, not the person → Risk Management Process.
+    Person/role as subject → this is a SIGNAL, not decisive. ALWAYS continue to Step 2.
  Step 2 — PERSON-REMOVAL TEST: Delete all named roles, titles, qualifications, experience, and credentials. Is the remaining text a coherent cybersecurity disclosure?
    YES → Risk Management Process (the process stands alone; people are incidental).
    NO → Management Role (the paragraph is fundamentally about who these people are).
    Borderline → continue to Step 3.
  Step 3 — QUALIFICATIONS TIEBREAKER: Does the paragraph include years of experience, certifications (CISSP, CISM), education, team size, or career history?
    YES → Management Role (qualifications are MR-specific content).
    NO → Risk Management Process (no person-specific content beyond a title).
  IMPORTANT: A paragraph where a named officer (CISO, CTO) is the grammatical subject but the content describes what the PROGRAM does is Risk Management Process. Step 1 must NOT short-circuit to MR just because a person is mentioned. Always apply Step 2.
 ═══ SPECIFICITY ═══
@ -113,26 +129,46 @@ ${text}`;
 // ── Category confusion-axis disambiguation rules ──────────────────────────
 // Keyed by sorted pair of disputed categories. Only included when relevant.
 const CATEGORY_GUIDANCE: Record<string, string> = {
-  "Management Role|Risk Management Process": `MANAGEMENT ROLE vs RISK MANAGEMENT PROCESS — ask: what is the DOMINANT communicative purpose?
+  "Management Role|Risk Management Process": `MANAGEMENT ROLE vs RISK MANAGEMENT PROCESS — apply the decision chain:
-  • A named manager (CISO, VP) mentioned once at the beginning, followed by extensive process description → Risk Management Process. The role mention is incidental.
+  Step 1 — SUBJECT TEST: Is the process/framework clearly the subject with no person detail? → RMP. STOP. If a person is the subject → this is only a signal. ALWAYS continue to Step 2.
-  • Management Role requires the manager's identity, qualifications, or reporting structure to be the PRIMARY content — not just a brief attribution.
+  Step 2 — PERSON-REMOVAL TEST: Delete all people/titles/qualifications. Still a coherent disclosure? YES → RMP. NO → MR. Borderline → Step 3.
-  • Test: remove the role mention. Does the paragraph still make sense as a process description? If yes → RMP.`,
+  Step 3 — QUALIFICATIONS TIEBREAKER: Does it mention years of experience, certifications, education, team size, career history? YES → MR. NO → RMP.
  CRITICAL: A person being the grammatical subject does NOT automatically mean Management Role. Many SEC disclosures say "Our CISO oversees..." then describe the program. Apply Step 2.
  Examples:
  • "Our CISO has 20 years of experience and holds CISSP certification. She reports to the CIO." → MR (remove people → nothing left; has qualifications)
  • "Our cybersecurity program includes risk assessment and monitoring, overseen by our CISO." → RMP (remove CISO → program description stands alone)
  • "Our CISO oversees the Company's cybersecurity program, which includes risk assessments, vulnerability scanning, and incident response planning." → RMP (person is subject BUT remove CISO → "the Company's cybersecurity program includes..." still works. Content is about the program.)`,
  "Risk Management Process|Third-Party Risk": `RISK MANAGEMENT PROCESS vs THIRD-PARTY RISK — ask: is vendor/supplier oversight the CENTRAL topic?
  • "We use third-party consultants for penetration testing" = RMP (third parties support an internal process).
  • "We maintain a vendor oversight program with due diligence and monitoring of third-party controls" = Third-Party Risk (vendor oversight IS the topic).
  • The paragraph must be PRIMARILY about managing vendor/supplier cyber risk to qualify as Third-Party Risk.`,
-  "None/Other|Strategy Integration": `NONE/OTHER vs STRATEGY INTEGRATION — ask: is there substantive cybersecurity disclosure?
+  "None/Other|Strategy Integration": `NONE/OTHER vs STRATEGY INTEGRATION — the materiality ASSESSMENT test:
-  • None/Other = NO substantive disclosure at all: section headers, disclaimers, generic IT-dependence language ("our IT systems are important to operations"), forward-looking boilerplate, generic regulatory compliance language ("subject to various regulatory requirements... non-compliance could result in penalties").
+  The test is whether the company is MAKING A MATERIALITY CONCLUSION, not whether the word "material" appears.
  • Strategy Integration = actual discussion of business/financial impact, cyber insurance, budget allocation, or materiality assessment.
  • Generic regulatory risk language (acknowledging regulations exist, non-compliance would be bad) is None/Other — it makes no materiality assessment and describes no strategy. It only becomes Strategy Integration if it explicitly assesses whether regulatory risks have "materially affected" the business.
  • If the paragraph only establishes that the company has IT systems and data without describing any program, process, or strategy → None/Other.`,
-  "Board Governance|Management Role": `BOARD GOVERNANCE vs MANAGEMENT ROLE — ask: who is the grammatical subject?
+  IS a materiality assessment or SI marker → Strategy Integration:
-  • Board or board committee taking oversight actions (receiving briefings, reviewing risks) → Board Governance.
+  • Backward-looking: "have not materially affected our business strategy, results of operations, or financial condition" (company reporting on actual impact)
-  • Named executive with qualifications, experience, or reporting lines → Management Role.
+  • Forward-looking with SEC qualifier: "reasonably likely to materially affect" (Item 106(b)(2) language — the company is making a forward-looking assessment)
-  • When both appear, the PRIMARY focus wins: board oversight with a brief management mention → Board Governance, and vice versa.`,
+  • Negative assertions: "we have not experienced any material cybersecurity incidents" (materiality conclusion about past events — SI even if at end of paragraph)
  • Insurance, budget, investment discussion: "we expend considerable resources on cybersecurity", cyber insurance, cost allocation (strategic resource commitment)
  Is NOT a materiality assessment → classify by primary purpose (usually N/O or RMP):
  • Generic risk warning: "could have a material adverse effect on our business" — this is boilerplate risk factor language that appears in every 10-K. The word "could" indicates speculation, not an assessment. → N/O or RMP depending on surrounding content.
  • "Material" as adjective: "managing material risks associated with cybersecurity" — "material" here means "significant," not a materiality assessment. → RMP.
  • Consequence clause: SPECULATIVE materiality language ("could have a material adverse effect") at the END of a paragraph does not override primary purpose. BUT a factual negative assertion ("we have not experienced any material cybersecurity incidents") IS an assessment even at the end — it states a conclusion. If a paragraph contains BOTH speculative consequence language AND a factual negative assertion, the negative assertion triggers SI.
  • Cross-references: "For a description of risks that may materially affect the Company, see Item 1A" → N/O (pointing elsewhere, not making an assessment).
  KEY DISTINCTION: "Risks have not materially affected us" = SI (CONCLUSION). "Risks could have a material adverse effect" = N/O (SPECULATION). "Risks are reasonably likely to materially affect us" = SI (FORWARD-LOOKING CONCLUSION with SEC qualifier).`,
  "Board Governance|Management Role": `BOARD GOVERNANCE vs MANAGEMENT ROLE — the PURPOSE test:
  Ask: what is the paragraph's COMMUNICATIVE PURPOSE?
  • PURPOSE = describing the oversight/reporting STRUCTURE (who reports to whom, how the board exercises oversight, briefing cadence, committee responsibilities) → Board Governance. The board/committee's actions must be a SIGNIFICANT part of the paragraph (multiple sentences describing what the board/committee does, receives, or directs).
  • PURPOSE = describing WHO a specific person IS (qualifications, credentials, experience, career history, team they lead) → Management Role.
  • CRITICAL THRESHOLD: A one-sentence mention of a board/committee does NOT make a paragraph Board Governance. Test: if you removed the committee sentence, would the paragraph lose its main point? If NO → the committee mention is incidental; classify based on the remaining content.
  • "Our management team oversees cybersecurity technologies and processes. Our Audit Committee also provides oversight." → NOT BG. The committee mention is a brief addendum. The paragraph is about what management does → MR or RMP.
  • "The Audit Committee receives quarterly briefings from the CISO and conducts annual reviews of the cybersecurity program." → BG. The committee's oversight actions ARE the content.
  • Governance-chain paragraphs where the board/committee spans multiple sentences ARE Board Governance. Single-sentence mentions are NOT enough.`,
  "Board Governance|Risk Management Process": `BOARD GOVERNANCE vs RISK MANAGEMENT PROCESS — ask: oversight or operations?
  • Board/committee receiving reports, overseeing risk, setting policy → Board Governance.
@ -140,9 +176,10 @@ const CATEGORY_GUIDANCE: Record<string, string> = {
  • "The board receives quarterly cybersecurity briefings" → Board Governance. "We conduct quarterly risk assessments; the board is informed" → RMP (process is primary content).`,
  "None/Other|Risk Management Process": `NONE/OTHER vs RISK MANAGEMENT PROCESS — ask: does the paragraph describe actual cybersecurity activities?
-  • Describing actual processes (monitoring, assessment, vulnerability management, training programs) → RMP.
+  • Describing actual processes, measures, or controls the company has implemented → RMP. Key signals: "we have implemented," "we use," "we maintain," "we have taken steps to," "our program includes," "we engage." Even if surrounded by risk-factor framing, ACTUAL MEASURES = RMP.
-  • Only stating the company has IT systems, collects data, or faces cyber risks — without describing what it DOES about them → None/Other.
+  • Only stating the company has IT systems, faces cyber risks, or enumerating threat types — without describing what it DOES about them → None/Other.
-  • Generic regulatory compliance language ("subject to various regulations... non-compliance could result in penalties") is None/Other — it describes no actual compliance activities. If a specific regulation is named (GDPR, HIPAA, PCI DSS) but no company-specific program is described → RMP at Specificity 2 (named standard).`,
+  • Generic regulatory compliance language ("subject to various regulations... non-compliance could result in penalties") is None/Other — it describes no actual compliance activities. If a specific regulation is named (GDPR, HIPAA, PCI DSS) but no company-specific program is described → RMP at Specificity 2 (named standard).
  • A paragraph that BOTH enumerates threats AND describes measures taken is RMP — the measures are the substantive content.`,
  "Risk Management Process|Strategy Integration": `RISK MANAGEMENT PROCESS vs STRATEGY INTEGRATION — ask: operational or strategic?
  • Describing HOW risks are assessed, monitored, mitigated → Risk Management Process.
@ -176,6 +213,81 @@ Do NOT count toward QV:
  ✗ Generic degrees without named university
 Need 2+ QV-eligible facts. One fact = stays at Firm-Specific.`;
 /**
 * Build a re-evaluation prompt for paragraphs flagged for codebook-correction review.
 * Used when unanimous Stage 1 labels may be wrong due to prompt version drift
 * (e.g., v2.5 lacked the materiality→SI rule, so N/O labels on paragraphs with
 * materiality language need re-evaluation under v3.5 rules).
 *
 * Unlike the judge prompt, this does NOT show prior annotations (to avoid anchoring
 * to the potentially-wrong unanimous label). Instead, it provides the specific
 * codebook rule that triggered the re-evaluation and asks for a fresh classification.
 */
 export function buildReEvalPrompt(
  paragraph: Paragraph,
  reason: "materiality_language" | "spac" | "other",
 ): string {
  const { filing, text } = paragraph;
  let ruleBlock: string;
  if (reason === "materiality_language") {
    ruleBlock = `═══ RULE UNDER REVIEW ═══
 This paragraph was previously labeled None/Other. It has been flagged for re-evaluation because it contains materiality-related language.
 CODEBOOK RULE 6 (v3.5): Materiality ASSESSMENTS are Strategy Integration. An assessment is the company STATING A CONCLUSION about materiality:
  • Backward-looking: "have not materially affected our business strategy, results of operations, or financial condition" → SI
  • Forward-looking with SEC qualifier: "reasonably likely to materially affect" → SI
  • Negative assertion: "we have not experienced any material cybersecurity incidents" → SI
 The following are NOT materiality assessments and do NOT trigger SI:
  • Generic risk warning: "could have a material adverse effect on our business" → NOT SI (boilerplate speculation, not a conclusion)
  • "Material" as adjective: "managing material risks" → NOT SI ("material" means "significant" here)
  • Consequence clause: SPECULATIVE materiality language ("could have a material adverse effect") at the END of a paragraph does not override primary purpose. BUT a factual negative assertion ("we have not experienced any material cybersecurity incidents") IS an assessment even at the end.
  • Cross-references: "For risks that may materially affect us, see Item 1A" → N/O
 KEY DISTINCTION: "Risks have not materially affected us" = SI (conclusion). "Risks could have a material adverse effect" = N/O (speculation). "Reasonably likely to materially affect" = SI (SEC-qualified forward-looking assessment).`;
  } else if (reason === "spac") {
    ruleBlock = `═══ RULE UNDER REVIEW ═══
 This paragraph was flagged for re-evaluation because it may be from a SPAC or shell company.
 CODEBOOK RULE (v3.0+): SPACs and shell companies explicitly stating they have no operations, no cybersecurity program, or no formal processes → None/Other, regardless of incidental mentions of board oversight or risk acknowledgment. The absence of a program is not a description of a program.`;
  } else {
    ruleBlock = `═══ RE-EVALUATION ═══
 This paragraph has been flagged for fresh classification under codebook v3.5 rules. Apply all current rules without anchoring to any prior label.`;
  }
  return `═══ RE-EVALUATION TASK ═══
 You are re-classifying this paragraph under updated codebook rules (v3.5). Classify it fresh — do not assume any prior label is correct.
 ${ruleBlock}
 ═══ ANALYSIS STEPS ═══
 1. Read the paragraph carefully.
 2. Apply the specific rule described above to determine if it changes the classification.
 3. Apply all standard codebook rules for both category and specificity.
 4. Provide your classification with reasoning.
 ═══ CONFIDENCE CALIBRATION ═══
 HIGH = the rule clearly applies (or clearly doesn't) — the answer is unambiguous
 MEDIUM = the rule is relevant but the paragraph is borderline
 LOW = genuinely ambiguous even with the updated rule
 ═══ PARAGRAPH ═══
 Company: ${filing.companyName} (${filing.ticker})
 Filing type: ${filing.filingType}
 Filing date: ${filing.filingDate}
 Section: ${filing.secItem}
 ${text}`;
 }
 /**
 * Build the Stage 2 judge prompt with disagreement-aware disambiguation.
 * Dynamically includes only the guidance relevant to the specific dispute.