pivot point
This commit is contained in:
parent
26367a8e86
commit
d653ed9a20
@ -1,6 +1,6 @@
|
|||||||
outs:
|
outs:
|
||||||
- md5: d64ad0c8040d75230a3013c4751910eb.dir
|
- md5: 4ad135e50584bca430b79307e8bd1050.dir
|
||||||
size: 740635168
|
size: 741469715
|
||||||
nfiles: 174
|
nfiles: 194
|
||||||
hash: md5
|
hash: md5
|
||||||
path: .dvc-store
|
path: .dvc-store
|
||||||
|
|||||||
3
.gitignore
vendored
3
.gitignore
vendored
@ -55,3 +55,6 @@ report.[0-9]_.[0-9]_.[0-9]_.[0-9]_.json
|
|||||||
.DS_Store
|
.DS_Store
|
||||||
python/*.whl
|
python/*.whl
|
||||||
/.dvc-store
|
/.dvc-store
|
||||||
|
|
||||||
|
# Personal notes
|
||||||
|
docs/STRATEGY-NOTES.md
|
||||||
|
|||||||
@ -123,7 +123,7 @@ Each paragraph is assigned exactly **one** content category. If a paragraph span
|
|||||||
- **Covers:** Material impact (or lack thereof) on business strategy or financials, cybersecurity insurance, investment/resource allocation, cost of incidents
|
- **Covers:** Material impact (or lack thereof) on business strategy or financials, cybersecurity insurance, investment/resource allocation, cost of incidents
|
||||||
- **Key markers:** "business strategy," "insurance," "investment," "material," "financial condition," "budget," "not materially affected," "results of operations"
|
- **Key markers:** "business strategy," "insurance," "investment," "material," "financial condition," "budget," "not materially affected," "results of operations"
|
||||||
- **Assign when:** The paragraph primarily discusses business/financial consequences or strategic response to cyber risk, not the risk management activities themselves
|
- **Assign when:** The paragraph primarily discusses business/financial consequences or strategic response to cyber risk, not the risk management activities themselves
|
||||||
- **Includes materiality disclaimers:** Any paragraph that explicitly assesses whether cybersecurity risks have or could "materially affect" the company's business, strategy, financial condition, or results of operations is Strategy Integration — even if the assessment is boilerplate. The company is making a strategic judgment about cyber risk impact, which is the essence of this category. A cross-reference to Risk Factors appended to a materiality assessment does not change the classification.
|
- **Includes materiality ASSESSMENTS:** A materiality assessment is the company stating a conclusion about whether cybersecurity has or will affect business outcomes. Backward-looking ("have not materially affected"), forward-looking with SEC qualifier ("reasonably likely to materially affect"), and negative assertions ("have not experienced material incidents") are all assessments → SI. Generic risk warnings ("could have a material adverse effect") are NOT assessments — they are boilerplate speculation that appears in every 10-K → classify by primary content. "Material" as an adjective ("managing material risks") is also not an assessment.
|
||||||
|
|
||||||
**Example texts:**
|
**Example texts:**
|
||||||
|
|
||||||
@ -170,19 +170,49 @@ Each paragraph is assigned exactly **one** content category. If a paragraph span
|
|||||||
### Rule 1: Dominant Category
|
### Rule 1: Dominant Category
|
||||||
If a paragraph spans multiple categories, assign the one whose topic occupies the most text or is the paragraph's primary communicative purpose.
|
If a paragraph spans multiple categories, assign the one whose topic occupies the most text or is the paragraph's primary communicative purpose.
|
||||||
|
|
||||||
### Rule 2: Board vs. Management
|
### Rule 2: Board vs. Management (the board-line test)
|
||||||
|
|
||||||
|
**Core principle:** The governance hierarchy has distinct layers — board/committee oversight at the top, management execution below. The paragraph's category depends on which layer is the primary focus.
|
||||||
|
|
||||||
|
| Layer | Category | Key signals |
|
||||||
|
|-------|----------|-------------|
|
||||||
|
| Board/committee directing, receiving reports, or overseeing | Board Governance | "Board oversees," "Committee reviews," "reports to the Board" (board is recipient) |
|
||||||
|
| Named officer's qualifications, responsibilities, reporting lines | Management Role | "CISO has 20 years experience," "responsible for," credentials |
|
||||||
|
| Program/framework/controls described | Risk Management Process | "program is designed to," "framework includes," "controls aligned with" |
|
||||||
|
|
||||||
|
**When a paragraph spans layers** (governance chain paragraphs): apply the **purpose test** — what is the paragraph's communicative purpose?
|
||||||
|
|
||||||
|
- **Purpose = describing oversight/reporting structure** (who reports to whom, briefing cadence, committee responsibilities, how information flows to the board) → **Board Governance**, even if officers appear as grammatical subjects. The officers are intermediaries in the governance chain, not the focus.
|
||||||
|
- **Purpose = describing who a person is** (qualifications, credentials, experience, career history) → **Management Role**.
|
||||||
|
- **Governance-chain paragraphs are almost always Board Governance.** They become Management Role ONLY when the officer's personal qualifications/credentials are the dominant content.
|
||||||
|
|
||||||
| Signal | Category |
|
| Signal | Category |
|
||||||
|--------|----------|
|
|--------|----------|
|
||||||
| Board/committee is the grammatical subject | Board Governance |
|
| Board/committee is the grammatical subject | Board Governance |
|
||||||
| Board delegates responsibility to management | Board Governance |
|
| Board delegates responsibility to management | Board Governance |
|
||||||
| Management role reports TO the board | Management Role |
|
| Management role reports TO the board (describing reporting structure) | Board Governance (the purpose is describing how oversight works) |
|
||||||
| Management role's qualifications are described | Management Role |
|
| Management role's qualifications, experience, credentials described | Management Role |
|
||||||
| "Board oversees... CISO reports to Board quarterly" | Board Governance (board is primary actor) |
|
| "Board oversees... CISO reports to Board quarterly" | Board Governance (oversight structure) |
|
||||||
| "CISO reports quarterly to the Board on..." | Management Role (CISO is primary actor) |
|
| "CISO reports quarterly to the Board on..." | Board Governance (reporting structure, not about who the CISO is) |
|
||||||
|
| "The CISO has 20 years of experience and reports to the CIO" | Management Role (person's qualifications are the content) |
|
||||||
|
| Governance overview spanning board → committee → officer → program | **Board Governance** (purpose is describing the structure) |
|
||||||
|
|
||||||
### Rule 2b: Management Role vs. Risk Management Process (the person-vs-function test)
|
### Rule 2b: Management Role vs. Risk Management Process (three-step decision chain)
|
||||||
|
|
||||||
This is the single most common source of annotator disagreement. The line is: **is the paragraph about the person or about the function?**
|
This is the single most common source of annotator disagreement. Apply the following tests in order — stop at the first decisive result.
|
||||||
|
|
||||||
|
**Step 1 — Subject test:** What is the paragraph's grammatical subject?
|
||||||
|
- Clear process/framework/program as subject with no person detail → **Risk Management Process**. Stop.
|
||||||
|
- Person/role as subject → this is a **signal**, not decisive. Always continue to Step 2. Many SEC disclosures name an officer then describe the program — Step 2 determines which is the actual content.
|
||||||
|
|
||||||
|
**Step 2 — Person-removal test:** Could you delete all named roles, titles, qualifications, experience descriptions, and credentials from the paragraph and still have a coherent cybersecurity disclosure?
|
||||||
|
- **YES** → **Risk Management Process** (the process stands on its own; people are incidental)
|
||||||
|
- **NO** → **Management Role** (the paragraph is fundamentally about who these people are)
|
||||||
|
- Borderline → continue to Step 3
|
||||||
|
|
||||||
|
**Step 3 — Qualifications tiebreaker:** Does the paragraph include experience (years), certifications (CISSP, CISM), education, team size, or career history for named individuals?
|
||||||
|
- **YES** → **Management Role** (qualifications are MR-specific content; the SEC requires management role disclosure specifically because investors want to know WHO is responsible)
|
||||||
|
- **NO** → **Risk Management Process** (no person-specific content beyond a title attribution)
|
||||||
|
|
||||||
| Signal | Category |
|
| Signal | Category |
|
||||||
|--------|----------|
|
|--------|----------|
|
||||||
@ -216,8 +246,27 @@ Assign None/Other ONLY when the paragraph contains no substantive cybersecurity
|
|||||||
|
|
||||||
**Exception — SPACs and no-operations companies:** A paragraph that explicitly states the company has no cybersecurity program, no operations, or no formal processes is None/Other even if it perfunctorily mentions board oversight or risk acknowledgment. The absence of a program is not substantive disclosure.
|
**Exception — SPACs and no-operations companies:** A paragraph that explicitly states the company has no cybersecurity program, no operations, or no formal processes is None/Other even if it perfunctorily mentions board oversight or risk acknowledgment. The absence of a program is not substantive disclosure.
|
||||||
|
|
||||||
### Rule 6: Materiality Disclaimers → Strategy Integration
|
### Rule 6: Materiality Language → Strategy Integration
|
||||||
Any paragraph that explicitly assesses whether cybersecurity risks or incidents have "materially affected" (or are "reasonably likely to materially affect") the company's business strategy, results of operations, or financial condition is **Strategy Integration** — even when the assessment is boilerplate. The materiality assessment is the substantive content. A cross-reference to Risk Factors appended to a materiality assessment does not change the classification to None/Other. Only a *pure* cross-reference with no materiality conclusion is None/Other.
|
Any paragraph that explicitly connects cybersecurity to business materiality is **Strategy Integration** — regardless of tense, mood, or how generic the language is. This includes:
|
||||||
|
|
||||||
|
- **Backward-looking assessments:** "have not materially affected our business strategy, results of operations, or financial condition"
|
||||||
|
- **Forward-looking assessments with SEC qualifier:** "are reasonably likely to materially affect," "if realized, are reasonably likely to materially affect"
|
||||||
|
- **Negative assertions with materiality framing:** "we have not experienced any material cybersecurity incidents"
|
||||||
|
|
||||||
|
**The test:** Is the company STATING A CONCLUSION about materiality?
|
||||||
|
|
||||||
|
- "Risks have not materially affected our business strategy" → YES, conclusion → SI
|
||||||
|
- "Risks are reasonably likely to materially affect us" → YES, forward-looking conclusion → SI
|
||||||
|
- "Risks could have a material adverse effect on our business" → NO, speculation → not SI (classify by primary content)
|
||||||
|
- "Managing material risks associated with cybersecurity" → NO, adjective → not SI
|
||||||
|
|
||||||
|
The key word is "reasonably likely" — that's the SEC's Item 106(b)(2) threshold for forward-looking materiality. Bare "could" is speculation, not an assessment.
|
||||||
|
|
||||||
|
**Why this is SI and not N/O:** The company is fulfilling its SEC Item 106(b)(2) obligation to assess whether cyber risks affect business strategy. The fact that the language is generic makes it Specificity 1, not None/Other. Category captures WHAT the paragraph discloses (a materiality assessment); specificity captures HOW specific that disclosure is (generic boilerplate = Spec 1).
|
||||||
|
|
||||||
|
**What remains N/O:** A cross-reference is N/O even if it contains materiality language — "For a description of the risks from cybersecurity threats that may materially affect the Company, see Item 1A" is N/O because the paragraph's purpose is pointing the reader elsewhere, not making an assessment. The word "materially" here describes what Item 1A discusses, not the company's own conclusion. Also N/O: generic IT-dependence language ("our IT systems are important to operations") with no materiality claim, and forward-looking boilerplate about risks generally without invoking materiality ("we face various risks").
|
||||||
|
|
||||||
|
**The distinction:** "Risks that may materially affect us — see Item 1A" = N/O (cross-reference). "Risks have not materially affected us. See Item 1A" = SI (the first sentence IS an assessment). The test is whether the company is MAKING a materiality conclusion vs DESCRIBING what another section covers.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@ -271,7 +320,26 @@ No materiality assessment. Pure cross-reference. → **None/Other, Specificity 1
|
|||||||
|
|
||||||
Despite touching RMP (no program), Board Governance (board is responsible), and Strategy Integration (no incidents), the paragraph contains no substantive disclosure. The company explicitly has no program, and the board mention is perfunctory ("generally responsible... if any"). The absence of a program is not a program description. → **None/Other, Specificity 1.**
|
Despite touching RMP (no program), Board Governance (board is responsible), and Strategy Integration (no incidents), the paragraph contains no substantive disclosure. The company explicitly has no program, and the board mention is perfunctory ("generally responsible... if any"). The absence of a program is not a program description. → **None/Other, Specificity 1.**
|
||||||
|
|
||||||
### Case 9: Generic regulatory compliance language
|
### Case 9: Materiality language — assessment vs. speculation (v3.5 revision)
|
||||||
|
> *"We face risks from cybersecurity threats that, if realized and material, are reasonably likely to materially affect us, including our operations, business strategy, results of operations, or financial condition."*
|
||||||
|
|
||||||
|
The phrase "reasonably likely to materially affect" is the SEC's Item 106(b)(2) qualifier — this is a forward-looking materiality **assessment**, not speculation. → **Strategy Integration, Specificity 1.**
|
||||||
|
|
||||||
|
> *"We have not identified any risks from cybersecurity threats that have materially affected or are reasonably likely to materially affect the Company."*
|
||||||
|
|
||||||
|
Backward-looking negative assertion + SEC-qualified forward-looking assessment. → **Strategy Integration, Specificity 1.**
|
||||||
|
|
||||||
|
> *"Information systems can be vulnerable to a range of cybersecurity threats that could potentially have a material impact on our business strategy, results of operations and financial condition."*
|
||||||
|
|
||||||
|
Despite mentioning "material impact" and "business strategy," the operative verb is "could" — this is boilerplate **speculation** present in virtually every 10-K risk factor section. The company is not stating a conclusion about whether cybersecurity HAS or IS REASONABLY LIKELY TO affect them; it is describing a hypothetical. → **None/Other, Specificity 1.** (Per Rule 6: "could have a material adverse effect" = speculation, not assessment.)
|
||||||
|
|
||||||
|
> *"We face various risks related to our IT systems."*
|
||||||
|
|
||||||
|
No materiality language, no connection to business strategy/financial condition. This is generic IT-dependence language. → **None/Other, Specificity 1.**
|
||||||
|
|
||||||
|
**The distinction:** "reasonably likely to materially affect" (SEC qualifier, forward-looking assessment) ≠ "could potentially have a material impact" (speculation). The former uses the SEC's required assessment language; the latter uses conditional language that every company uses regardless of actual risk.
|
||||||
|
|
||||||
|
### Case 10: Generic regulatory compliance language
|
||||||
> *"Regulatory Compliance: The Company is subject to various regulatory requirements related to cybersecurity, data protection, and privacy. Non-compliance with these regulations could result in financial penalties, legal liabilities, and reputational damage."*
|
> *"Regulatory Compliance: The Company is subject to various regulatory requirements related to cybersecurity, data protection, and privacy. Non-compliance with these regulations could result in financial penalties, legal liabilities, and reputational damage."*
|
||||||
|
|
||||||
This acknowledges that regulations exist and non-compliance would be bad — a truism for every public company. It does not describe any process, program, or framework the company uses to comply. It does not make a materiality assessment. It names no specific regulation. → **None/Other, Specificity 1.**
|
This acknowledges that regulations exist and non-compliance would be bad — a truism for every public company. It does not describe any process, program, or framework the company uses to comply. It does not make a materiality assessment. It names no specific regulation. → **None/Other, Specificity 1.**
|
||||||
@ -605,6 +673,7 @@ Track prompt changes so we can attribute label quality to specific prompt versio
|
|||||||
| v2.6 | 2026-03-28 | 500 | Changed category defs to TEST: format. REGRESSED (Both 67.8%). |
|
| v2.6 | 2026-03-28 | 500 | Changed category defs to TEST: format. REGRESSED (Both 67.8%). |
|
||||||
| v2.7 | 2026-03-28 | 500 | Added COMMON MISTAKES section. 100% consensus but Both 67.6%. |
|
| v2.7 | 2026-03-28 | 500 | Added COMMON MISTAKES section. 100% consensus but Both 67.6%. |
|
||||||
| v3.0 | 2026-03-29 | — | **Codebook overhaul.** Three rulings: (A) materiality disclaimers → Strategy Integration, (B) SPACs/no-ops → None/Other, (C) person-vs-function test for Mgmt Role vs RMP. Added full IS/NOT lists and QV-eligible list to codebook. Added Rule 2b, Rule 6, 4 new borderline cases. Prompt update pending. |
|
| v3.0 | 2026-03-29 | — | **Codebook overhaul.** Three rulings: (A) materiality disclaimers → Strategy Integration, (B) SPACs/no-ops → None/Other, (C) person-vs-function test for Mgmt Role vs RMP. Added full IS/NOT lists and QV-eligible list to codebook. Added Rule 2b, Rule 6, 4 new borderline cases. Prompt update pending. |
|
||||||
|
| v3.5 | 2026-04-02 | 26 | **Post-gold-analysis rulings, 6 iteration rounds on 26 regression paragraphs ($1.02).** Driven by 13-signal cross-analysis + targeted prompt iteration. (A) Rule 6 refined: materiality ASSESSMENTS → SI (backward-looking conclusions + "reasonably likely" forward-looking). Generic "could have a material adverse effect" is NOT an assessment — it stays N/O/RMP. Cross-references with materiality language also stay N/O. (B) Rule 2 expanded: purpose test for BG — governance structure descriptions are BG, but a one-sentence committee mention doesn't flip the category. (C) Rule 2b expanded: three-step MR↔RMP decision chain; Step 1 only decisive for RMP (process is subject), never short-circuits to MR. (D) N/O vs RMP clarified: actual measures implemented = RMP even in risk-factor framing. Result: +4pp on 26 hardest paragraphs vs v3.0 (18→22/26). |
|
||||||
|
|
||||||
When the prompt changes (after pilot testing, rubric revision, etc.), bump the version and log what changed. Every annotation record carries `promptVersion` so we can filter/compare.
|
When the prompt changes (after pilot testing, rubric revision, etc.), bump the version and log what changed. Every annotation record carries `promptVersion` so we can filter/compare.
|
||||||
|
|
||||||
|
|||||||
@ -1106,6 +1106,137 @@ Key risk: the stratified holdout over-samples hard cases, depressing F1 vs a ran
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Phase 15: Codebook v3.5 — The Prompt Drift Discovery
|
||||||
|
|
||||||
|
### The Problem
|
||||||
|
|
||||||
|
Cross-analysis of human vs GenAI labels on the holdout revealed a systematic, directional disagreement on three axes:
|
||||||
|
|
||||||
|
1. **SI↔N/O (23:0 asymmetry):** When humans and GenAI disagreed on this axis, humans ALWAYS called it SI and GenAI called it N/O. Never the reverse. Root cause: the labelapp trained humans that any language connecting cybersecurity to business materiality — even forward-looking ("could materially affect") — is SI at Specificity 1. Stage 1 models (v2.5 prompt) lacked this rule entirely. Even v3.0 benchmark models, which had the backward-looking materiality rule, were conservative about forward-looking variants.
|
||||||
|
|
||||||
|
2. **MR↔RMP (253 paragraphs, 38:13 asymmetry):** GenAI systematically calls MR paragraphs RMP. The v3.0 "person-vs-function test" helps but leaves genuinely mixed paragraphs (both person and process as grammatical subjects) unresolved. These near-even splits need a deterministic tiebreaker chain.
|
||||||
|
|
||||||
|
3. **BG↔MR (149 paragraphs, 33:6 asymmetry):** GenAI systematically under-calls BG. The problem is governance chain paragraphs that describe the board receiving reports from management — is this about the board's oversight function or the officer's reporting duty?
|
||||||
|
|
||||||
|
### The Audit
|
||||||
|
|
||||||
|
A Stage 1 audit found ~1,076 paragraphs (649 unanimous + 383 majority N/O) with materiality language that should be SI under the broadened rule. 1.3% of the corpus overall — but potentially concentrated on exactly the boundary cases the holdout over-samples. On the holdout, mimo-v2-flash was actually the most accurate Stage 1 model on this axis, dissenting toward SI 263 times when the other two said N/O.
|
||||||
|
|
||||||
|
The MR↔RMP and BG↔MR axes are cleaner in Stage 1 unanimity — only 0.2% of unanimous BG labels are problematic, and the MR/RMP tiebreaker mainly affects disputed labels (already going to Stage 2). The v2.5→v3.5 gap is primarily an SI↔N/O problem.
|
||||||
|
|
||||||
|
### Initial v3.5 Rulings (Round 1)
|
||||||
|
|
||||||
|
Three rulings, all driven by the 13-signal cross-analysis:
|
||||||
|
|
||||||
|
**Rule 6 broadened (SI↔N/O):** ALL materiality language → SI, not just backward-looking disclaimers. Forward-looking ("could materially affect"), conditional ("reasonably likely to"), and negative assertions ("have not experienced material incidents") are all Strategy Integration at Specificity 1.
|
||||||
|
|
||||||
|
**Rule 2 expanded (BG↔MR):** Added the board-line test with governance hierarchy layers and a dominant-subject test for cross-layer paragraphs.
|
||||||
|
|
||||||
|
**Rule 2b expanded (MR↔RMP):** Three-step decision chain: subject test → person-removal test → qualifications tiebreaker.
|
||||||
|
|
||||||
|
These rulings were tested by re-running all 7 benchmark models (6 OpenRouter + Opus) on 359 confusion-axis holdout paragraphs with the v3.5 prompt ($18, stored separately from v3.0 data).
|
||||||
|
|
||||||
|
### The Prompt Drift Lesson
|
||||||
|
|
||||||
|
Running Stage 1 (150K annotations) before human labeling created a subtle but significant problem: the codebook evolved through v2.5 → v3.0 → v3.5, but the training data is frozen at v2.5. Each codebook revision was driven by empirical analysis of disagreement patterns — which required the Stage 1 data AND human labels to exist first. The dependency is circular: you can't know what rules are needed until you see where annotators disagree, but you can't undo the labels already collected.
|
||||||
|
|
||||||
|
### Iteration: 6 Rounds on 26 Regression Paragraphs ($1.02)
|
||||||
|
|
||||||
|
The initial v3.5 re-run revealed that the rulings over-corrected. We identified 26 "regression" paragraphs — cases where v3.0 matched human majority but v3.5 did not — and iterated the prompt using GPT-5.4 on these 26 paragraphs ($0.17/round) to diagnose and fix each over-correction.
|
||||||
|
|
||||||
|
**Round 1 (v3.5a) — 5/26.** Catastrophic. All three rulings over-fired simultaneously. SI was called on every paragraph with the word "material." BG was called whenever a committee was named. MR was called whenever a person was a grammatical subject. The rulings were correct in intent but models interpreted them too aggressively.
|
||||||
|
|
||||||
|
**Round 2 (v3.5b) — 13/25.** Three fixes: (A) Replaced the BG "dominant-subject test" with a "purpose test" — if the paragraph describes oversight structure, it's BG; mere committee mentions don't flip the category. (B) Made MR↔RMP Step 1 non-decisive — a person being the grammatical subject is a signal, not a conclusion; always proceed to Step 2 (person-removal test). (C) Added cross-reference exception for SI. Improvement: +8.
|
||||||
|
|
||||||
|
**Round 3 (v3.5c) — 20/26.** The cross-reference exception eliminated the 5 most egregious SI over-predictions — paragraphs like "For a description of risks that may materially affect us, see Item 1A" that v3.5a called SI but are obviously N/O. These were pure pointers with materiality language embedded in the cross-reference text, not materiality assessments. +7.
|
||||||
|
|
||||||
|
**Round 4 (v3.5d) — 22/26.** The critical insight: not all materiality language is a materiality *assessment*. Reading the 6 remaining errors revealed a spectrum:
|
||||||
|
|
||||||
|
- "Cybersecurity risks have not materially affected our business strategy" → **Assessment** (conclusion about actual impact) → SI ✓
|
||||||
|
- "Risks are reasonably likely to materially affect us" → **Assessment** (SEC Item 106(b)(2) standard) → SI ✓
|
||||||
|
- "Cybersecurity threats could have a material adverse effect on our business" → **Speculation** (generic risk warning in every 10-K) → NOT SI ✗
|
||||||
|
- "Managing material risks associated with cybersecurity" → **Adjective** ("material" means "significant") → NOT SI ✗
|
||||||
|
- "...which could result in material adverse effects" at the end of an RMP paragraph → **Consequence clause** (doesn't override primary purpose) → NOT SI ✗
|
||||||
|
|
||||||
|
The tightened rule: only backward-looking conclusions and SEC-qualified forward-looking ("reasonably likely to") trigger SI. Generic "could have a material adverse effect" does not. This distinction — assessment vs. speculation — resolved 3 errors without breaking any correct calls. +2.
|
||||||
|
|
||||||
|
We also verified each error against human annotator votes. All 6 remaining errors had the human majority correct (checked by reading the actual paragraph text and codebook rules). Interestingly, on 3 of the 6, the project lead's own label was the dissenting human vote — he had been the one calling these SI, validating that the over-calling pattern was a real and consistent interpretation difference, not random noise.
|
||||||
|
|
||||||
|
**Round 5 (v3.5e) — 19/25.** Regression. We attempted to add an explicit BG↔RMP example ("CISO assists the ERMC in monitoring...→ RMP") to the disambiguation guidance. This caused 3 previously-correct paragraphs to flip to BG — the example made models hyper-aware of committee mentions and triggered BG more broadly. Lesson: **targeted examples can backfire when the pattern is too specific.** The model generalizes from the example in unpredictable ways.
|
||||||
|
|
||||||
|
**Round 6 (v3.5f) — 21/26.** Reverted the Round 5 BG↔RMP example. Kept the N/O↔RMP "actual measures" clarification from Round 5 (if a paragraph describes specific security measures the company implemented, it's RMP even in risk-factor framing). This stabilized at 21-22/26, with the 2-paragraph swing attributable to LLM non-determinism at temperature=0.
|
||||||
|
|
||||||
|
### The 4 Irreducible Errors
|
||||||
|
|
||||||
|
The remaining errors after Round 4/6 fall into two patterns:
|
||||||
|
|
||||||
|
**BG over-call on process paragraphs (2 errors):** A paragraph describing monitoring methods (threat intelligence, security tools, detection capabilities) where a management committee (ERMC) is woven throughout as the entity being assisted. Content is clearly RMP but the committee mention triggers BG. These are genuinely dual-coded — the monitoring IS part of the committee's function. Human majority says RMP (2-1 in both cases).
|
||||||
|
|
||||||
|
**N/O over-call on borderline RMP paragraphs (2 errors):** Paragraphs that describe risk management activities ("assessing, identifying, and managing material risks") but are framed as risk-factor discussions with threat enumeration. The SI tightening correctly stopped calling them SI, but they overcorrected to N/O instead of RMP. The N/O↔RMP boundary depends on whether the paragraph describes what the company DOES (→ RMP) vs. what risks it faces (→ N/O). These paragraphs do both.
|
||||||
|
|
||||||
|
All 4 have human 2-1 splits — reasonable annotators disagree on these. Further prompt iteration risks over-fitting to these 4 specific paragraphs at the cost of breaking the other 355 correctly-classified ones.
|
||||||
|
|
||||||
|
### The SI Rule: Assessment vs. Speculation
|
||||||
|
|
||||||
|
The most important finding from the iteration is the distinction between materiality *assessments* and materiality *language*:
|
||||||
|
|
||||||
|
| Pattern | Classification | Reasoning |
|
||||||
|
|---------|---------------|-----------|
|
||||||
|
| "have not materially affected our business strategy" | **SI** | Backward-looking conclusion — the company is reporting on actual impact |
|
||||||
|
| "reasonably likely to materially affect" | **SI** | Forward-looking with SEC qualifier — Item 106(b)(2) disclosure |
|
||||||
|
| "have not experienced material cybersecurity incidents" | **SI** | Negative assertion — materiality conclusion about past events |
|
||||||
|
| "could have a material adverse effect" | **NOT SI** | Generic speculation — appears in every 10-K, not an assessment |
|
||||||
|
| "managing material risks" | **NOT SI** | Adjective — "material" means "significant," not a materiality assessment |
|
||||||
|
| "For risks that may materially affect us, see Item 1A" | **NOT SI** | Cross-reference — pointing elsewhere, not making a conclusion |
|
||||||
|
| "...which could result in material losses" (at end of RMP paragraph) | **NOT SI** | Consequence clause — doesn't override the paragraph's primary purpose |
|
||||||
|
|
||||||
|
This distinction reduced the Stage 1 correction set from ~1,014 to 308 paragraphs. The original broad flag ("any paragraph with the word 'material'") caught ~700 paragraphs that were correctly labeled N/O by Stage 1 — they contained generic "could have a material adverse effect" boilerplate that is NOT a materiality assessment. Only 180 paragraphs contain actual backward-looking or SEC-qualified assessments that v2.5 miscoded.
|
||||||
|
|
||||||
|
### Final v3.5 Gold Re-Run
|
||||||
|
|
||||||
|
After locking the prompt at v3.5f, all 7 models (Opus + 6 benchmark) were re-run on the 359 confusion-axis holdout paragraphs with the final prompt (~$18). v3.0 data preserved in original paths (`bench-holdout/`, `golden/`). v3.5f results stored separately (`bench-holdout-v35/`, `golden-v35/`). The v3.0→v3.5 comparison — per model, per axis — is itself a publishable finding about how prompt engineering systematically shifts classification boundaries in frontier LLMs.
|
||||||
|
|
||||||
|
### The SI↔N/O Paradox — Resolved
|
||||||
|
|
||||||
|
The v3.5f re-run showed a troubling result: SI↔N/O accuracy *dropped* 6pp vs v3.0 (60% vs 66%), with the H=SI/M=N/O asymmetry worsening from 20 to 25 cases. The initial hypothesis was that models became globally conservative when told to distinguish assessment from speculation.
|
||||||
|
|
||||||
|
A paragraph-by-paragraph investigation of all 27 SI↔N/O errors revealed the opposite: **the models are correct, and the humans are systematically wrong.**
|
||||||
|
|
||||||
|
Of the 25 H=SI / M=N/O cases:
|
||||||
|
- ~20 are pure "could have a material adverse effect" speculation, cross-references to Item 1A, or generic threat enumeration — none containing actual materiality assessments. All 6 models unanimously call N/O.
|
||||||
|
- ~3 are genuinely ambiguous (SPACs with assessment language, past disruption without explicit materiality language).
|
||||||
|
- ~2 are edge cases (negative assertions embedded at end of BG paragraphs).
|
||||||
|
|
||||||
|
Of the 2 H=N/O / M=SI cases:
|
||||||
|
- Both contain clear negative assertions ("not aware of having experienced any prior material data breaches", "did not experience any cybersecurity incident during 2024") — textbook SI. All 6 models unanimously call SI.
|
||||||
|
|
||||||
|
**Root cause of human error:** Annotators systematically treat ANY mention of "material" + "business strategy" + "financial condition" as SI — even when wrapped in pure speculation ("could," "if," "may"). The codebook's assessment-vs-speculation distinction is correct; humans weren't consistently applying it.
|
||||||
|
|
||||||
|
**Codebook Case 9 contradiction fixed:** The investigation also discovered that Case 9 ("could potentially have a material impact" → SI) directly contradicted Rule 6 ("could = speculation, not assessment"). Case 9 has been corrected: the "could" example is now N/O, with explanation of why "reasonably likely to materially affect" (SEC qualifier) ≠ "could potentially" (speculation).
|
||||||
|
|
||||||
|
Two minor prompt clarifications were added (consequence clause refinement for negative assertions, investment/resource SI signal) and tested on 83 SI↔N/O paragraphs ($0.55). Net effect: within stochastic noise — confirming the prompt was already correct.
|
||||||
|
|
||||||
|
### Implications for Training
|
||||||
|
|
||||||
|
- **Gold adjudication on SI↔N/O:** Trust model consensus over human majority. When 6/6 models unanimously agree and the paragraph contains only speculative language → use model label. Apply SI deterministically via regex for backward-looking assessments and SEC qualifiers. Expected impact: SI↔N/O accuracy rises from ~60% to ~95%+ against corrected gold labels.
|
||||||
|
- **Stage 2 judge** must use v3.5 prompt. This is where the codebook evolution actually matters for training data quality.
|
||||||
|
- **Stage 1 corrections re-flagged:** Tightened criteria reduced flagged paragraphs from 1,014 to 308 (180 materiality assessments + 128 SPACs). The 706 excluded paragraphs contained generic "could" boilerplate that was correctly labeled N/O by v2.5.
|
||||||
|
- **Gold adjudication on other axes:** On MR↔RMP and BG↔MR, v3.5 improves alignment with humans by ~4pp on hard cases but the improvement is more modest on easy cases.
|
||||||
|
- **MiniMax exclusion:** MiniMax M2.7 is a statistical outlier (z=−2.07 in inter-model agreement) and the most volatile model across prompt versions (40.7% category change rate). Data retained per assignment requirements but excluded from gold scoring majority.
|
||||||
|
|
||||||
|
### Cost Ledger Update
|
||||||
|
|
||||||
|
| Phase | Cost | Time |
|
||||||
|
|-------|------|------|
|
||||||
|
| v3.5 initial re-run (7 × 359) | ~$18 | ~10 min |
|
||||||
|
| v3.5 iteration (6 × 26 × GPT-5.4) | $1.02 | ~15 min |
|
||||||
|
| v3.5f final re-run (7 × 359) | ~$18 | ~10 min |
|
||||||
|
| SI↔N/O investigation (37 + 83 × GPT-5.4) | $0.55 | ~1 min |
|
||||||
|
| **v3.5 subtotal** | **~$37.57** | |
|
||||||
|
| **Running total API** | **~$202.57** | |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Lessons Learned
|
## Lessons Learned
|
||||||
|
|
||||||
### On Prompt Engineering
|
### On Prompt Engineering
|
||||||
@ -1113,6 +1244,10 @@ Key risk: the stratified holdout over-samples hard cases, depressing F1 vs a ran
|
|||||||
- Pilots must be large enough (500+). 40-sample pilots were misleadingly optimistic.
|
- Pilots must be large enough (500+). 40-sample pilots were misleadingly optimistic.
|
||||||
- More rules ≠ better. After the core structure is right, additional rules cause regression.
|
- More rules ≠ better. After the core structure is right, additional rules cause regression.
|
||||||
- The `specific_facts` chain-of-thought schema (forcing models to enumerate evidence before deciding) was the single most impactful structural change.
|
- The `specific_facts` chain-of-thought schema (forcing models to enumerate evidence before deciding) was the single most impactful structural change.
|
||||||
|
- **Rules over-correct before they converge.** The v3.5 iteration showed a consistent pattern: a new rule fixes the target problem but creates 2-3 new errors on adjacent cases. Each fix required a counter-fix. "Materiality language → SI" fixed the 23:0 asymmetry but created cross-reference false positives and speculation false positives that each required their own exception. Six rounds of test-fix-test were needed to reach equilibrium.
|
||||||
|
- **Targeted examples backfire.** Adding a specific example to a disambiguation rule ("CISO assists the ERMC in monitoring → RMP") caused regression elsewhere — models generalize from examples in unpredictable ways. General principles ("content matters more than names") are safer than specific examples in disambiguation guidance.
|
||||||
|
- **Assessment vs. language is a fundamental distinction.** The word "material" appears in thousands of SEC paragraphs but carries different force in different grammatical contexts. "Have not materially affected" (conclusion) vs. "could have a material adverse effect" (speculation) vs. "material risks" (adjective) are three different speech acts. Models don't naturally distinguish these without explicit guidance.
|
||||||
|
- **Check the humans — they can be systematically wrong.** On SI↔N/O, human annotators systematically over-called SI on any paragraph mentioning "material" + "business strategy," even when the language was pure speculation. The 25:2 asymmetry initially looked like model failure but was actually human failure to apply the assessment-vs-speculation distinction. When all 6 frontier models unanimously disagree with a 2/3 human majority, investigate before assuming the humans are right. The models' consistency (unanimous agreement across architectures and providers) is itself strong evidence.
|
||||||
|
|
||||||
### On Model Selection
|
### On Model Selection
|
||||||
- Reasoning tokens are the strongest predictor of accuracy, not price or model size.
|
- Reasoning tokens are the strongest predictor of accuracy, not price or model size.
|
||||||
|
|||||||
103
docs/STATUS.md
103
docs/STATUS.md
@ -87,26 +87,95 @@ Plus Stage 1 panel already on file = **10 models, 8 suppliers**.
|
|||||||
|
|
||||||
**Key finding:** Opus earns the #1 spot through leave-one-out — it's not special because we designated it as gold; it genuinely disagrees with the crowd least (7.4% odd-one-out rate).
|
**Key finding:** Opus earns the #1 spot through leave-one-out — it's not special because we designated it as gold; it genuinely disagrees with the crowd least (7.4% odd-one-out rate).
|
||||||
|
|
||||||
|
### Codebook v3.5 & Prompt Iteration — Complete
|
||||||
|
- [x] Cross-analysis: GenAI vs human systematic errors identified (SI↔N/O 23:0, MR↔RMP 38:13, BG↔MR 33:6)
|
||||||
|
- [x] v3.5 rulings: SI materiality assessment test, BG purpose test, MR↔RMP 3-step chain
|
||||||
|
- [x] v3.5 gold re-run: 7 models × 359 confusion-axis holdout paragraphs ($18)
|
||||||
|
- [x] 6 rounds prompt iteration on 26 regression paragraphs ($1.02): v3.0=18/26 → v3.5=22/26
|
||||||
|
- [x] SI rule tightened: "could have material adverse effect" = NOT SI (speculation, not assessment)
|
||||||
|
- [x] Cross-reference exception: materiality language in cross-refs = N/O
|
||||||
|
- [x] BG threshold: one-sentence committee mention doesn't flip to BG
|
||||||
|
- [x] Stage 1 corrections flagged: 308 paragraphs (180 materiality + 128 SPACs)
|
||||||
|
- [x] Prompt locked at v3.5, codebook updated, version history documented
|
||||||
|
- [x] SI↔N/O paradox investigated and resolved: models correct, humans systematically over-call SI on speculation
|
||||||
|
- [x] Codebook Case 9 contradiction with Rule 6 fixed ("could" example → N/O)
|
||||||
|
- [x] Gold adjudication strategy for SI↔N/O defined: trust model consensus, apply SI via regex for assessments
|
||||||
|
|
||||||
|
| Data asset | Location |
|
||||||
|
|-----------|----------|
|
||||||
|
| v3.5 bench annotations | `data/annotations/bench-holdout-v35/*.jsonl` (7 models × 359) |
|
||||||
|
| v3.5 Opus annotations | `data/annotations/golden-v35/opus.jsonl` (359) |
|
||||||
|
| Stage 1 correction flags | `data/annotations/stage1-corrections.jsonl` (308) |
|
||||||
|
| Holdout re-run IDs | `data/gold/holdout-rerun-v35.jsonl` (359) |
|
||||||
|
|
||||||
|
### Gold Set Adjudication v1 — Complete
|
||||||
|
- [x] Aaryan redo integrated: 50.3% of labels changed, α 0.801→0.825 (cat), 0.546→0.661 (spec)
|
||||||
|
- [x] Old Aaryan labels preserved in `data/gold/human-labels-aaryan-v1.jsonl`
|
||||||
|
- [x] Cross-axis systematic error analysis: models correct ~85% on MR↔RMP, MR↔BG, RMP↔BG, TP↔RMP, SI↔N/O
|
||||||
|
- [x] 5-tier adjudication: T1 super-consensus (911), T2 cross-validated (108), T3 rule-based (30), T4 model-unanimous (59), T5 plurality (92)
|
||||||
|
- [x] 30 rule-based overrides (27 SI↔N/O + 3 T5 codebook resolutions)
|
||||||
|
|
||||||
|
### Gold Set Adjudication v2 — Complete (T5 deep analysis)
|
||||||
|
- [x] Full model disagreement analysis: 6-model vote vectors on all 1,200 paragraphs
|
||||||
|
- [x] Gemini identified as systematic MR outlier (z≈+2.3, 302 MR vs ~192 avg, drives 45% MR↔RMP confusion)
|
||||||
|
- [x] Gemini exclusion experiment: NULL RESULT at T5 (human MR bias makes it redundant; tiering already neutralizes at T4)
|
||||||
|
- [x] v3.5 prompt impact: unanimity 25%→60%, but created new BG↔RMP hotspot (+171%)
|
||||||
|
- [x] **Text-based BG vote removal**: automated, verifiable — if "board" absent from text, BG model votes removed. 13 labels corrected, source accuracy UP for 10/12 sources
|
||||||
|
- [x] **10 new codebook tiebreaker overrides**: ID↔SI (negative assertions), SPAC rule, board-removal test, committee-level test
|
||||||
|
- [x] **Specificity hybrid**: human unanimous → human label, human split → model majority. 195 specificity labels updated
|
||||||
|
- [x] All changes validated experimentally (one variable at a time, acceptance criteria checked)
|
||||||
|
- [x] T5: 92 → 85, gold≠human: 151 → 144
|
||||||
|
|
||||||
|
| Source | Accuracy vs Gold (v1) | Accuracy vs Gold (v2) | Δ |
|
||||||
|
|--------|----------------------|----------------------|---|
|
||||||
|
| Xander | 91.0% | 91.5% | +0.5% |
|
||||||
|
| Opus | 88.6% | 89.1% | +0.5% |
|
||||||
|
| GPT-5.4 | 87.4% | 88.5% | +1.1% |
|
||||||
|
| GLM-5 | 86.0% | 86.5% | +0.5% |
|
||||||
|
| Elisabeth | 85.8% | 86.5% | +0.7% |
|
||||||
|
| MIMO | 85.8% | 86.2% | +0.5% |
|
||||||
|
| Meghan | 85.3% | 86.0% | +0.7% |
|
||||||
|
| Kimi | 84.5% | 84.9% | +0.4% |
|
||||||
|
| Gemini | 84.0% | 84.6% | +0.6% |
|
||||||
|
| Joey | 80.7% | 80.2% | -0.5% |
|
||||||
|
| Aaryan | 75.2% | 74.2% | -1.0% |
|
||||||
|
| Anuj | 69.3% | 69.7% | +0.3% |
|
||||||
|
|
||||||
|
| Data asset | Location |
|
||||||
|
|-----------|----------|
|
||||||
|
| Adjudicated gold labels | `data/gold/gold-adjudicated.jsonl` (1,200) |
|
||||||
|
| Old Aaryan labels | `data/gold/human-labels-aaryan-v1.jsonl` (600) |
|
||||||
|
| Adjudication charts | `data/gold/charts/gold-*.png` (4 charts) |
|
||||||
|
| Adjudication script | `scripts/adjudicate-gold.py` (v2) |
|
||||||
|
| Experiment harness | `scripts/adjudicate-gold-experiment.py` |
|
||||||
|
| T5 analysis docs | `docs/T5-ANALYSIS.md` |
|
||||||
|
|
||||||
## What's Next (in dependency order)
|
## What's Next (in dependency order)
|
||||||
|
|
||||||
### 1. Gold set adjudication
|
### 1. (Optional) Manual review of remaining 85 T5-plurality paragraphs
|
||||||
- Tier 1+2 (972 paragraphs, 81%) → auto-resolved from 13-signal consensus
|
- 85 paragraphs resolved by signal plurality — lowest confidence tier
|
||||||
- Tier 3+4 (228 paragraphs, 19%) → expert review with Opus reasoning traces
|
- 71% on the BG↔MR↔RMP triangle (irreducible ambiguity)
|
||||||
- For Aaryan's 600 paragraphs: use other-2-annotator majority when they agree and he disagrees
|
- 62 have weak plurality (4-5/9) — diminishing returns
|
||||||
|
- Could improve gold set by ~1-3% if reviewed, but diminishing returns
|
||||||
|
|
||||||
### 2. Training data assembly
|
### 2. Stage 2 re-eval on training data
|
||||||
|
- Pilot gpt-5.4-mini vs gpt-5.4 on holdout validation sample
|
||||||
|
- Run on 308 flagged Stage 1 corrections (180 materiality + 128 SPACs)
|
||||||
|
- Also run standard Stage 2 judge on existing disagreements with v3.5 prompt
|
||||||
|
|
||||||
|
### 3. Training data assembly
|
||||||
- Unanimous Stage 1 labels (35,204 paragraphs) → full weight
|
- Unanimous Stage 1 labels (35,204 paragraphs) → full weight
|
||||||
- Calibrated majority labels (~9-12K) → full weight
|
- Calibrated majority labels (~9-12K) → full weight
|
||||||
- Judge high-confidence labels (~2-3K) → full weight
|
- Judge high-confidence labels (~2-3K) → full weight
|
||||||
- Quality tier weights: clean/headed/minor=1.0, degraded=0.5
|
- Quality tier weights: clean/headed/minor=1.0, degraded=0.5
|
||||||
|
|
||||||
### 3. Fine-tuning + ablations
|
### 4. Fine-tuning + ablations
|
||||||
- 8+ experiments: {base, +DAPT, +DAPT+TAPT} × {±SCL} × {±class weighting}
|
- 8+ experiments: {base, +DAPT, +DAPT+TAPT} × {±SCL} × {±class weighting}
|
||||||
- Dual-head classifier: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal)
|
- Dual-head classifier: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal)
|
||||||
- Focal loss / class-weighted CE for category imbalance
|
- Focal loss / class-weighted CE for category imbalance
|
||||||
- Ordinal regression (CORAL) for specificity
|
- Ordinal regression (CORAL) for specificity
|
||||||
|
|
||||||
### 4. Evaluation + paper
|
### 5. Evaluation + paper
|
||||||
- Macro F1 + per-class F1 on holdout (must exceed 0.80 for category)
|
- Macro F1 + per-class F1 on holdout (must exceed 0.80 for category)
|
||||||
- Full GenAI benchmark table (10 models × 1,200 holdout)
|
- Full GenAI benchmark table (10 models × 1,200 holdout)
|
||||||
- Cost/time/reproducibility comparison
|
- Cost/time/reproducibility comparison
|
||||||
@ -116,13 +185,15 @@ Plus Stage 1 panel already on file = **10 models, 8 suppliers**.
|
|||||||
## Parallel Tracks
|
## Parallel Tracks
|
||||||
|
|
||||||
```
|
```
|
||||||
Track A (GPU): DAPT ✓ → TAPT ✓ ──────────────→ Fine-tuning → Eval
|
Track A (GPU): DAPT ✓ → TAPT ✓ ─────────────────────────────→ Fine-tuning → Eval
|
||||||
↑
|
↑
|
||||||
Track B (API): Opus re-run ✓─┐ │
|
Track B (API): Opus re-run ✓─┐ │
|
||||||
├→ Gold adjudication ─────┤
|
├→ v3.5 re-run ✓ → SI paradox ✓ ───┐ │
|
||||||
Track C (API): 6-model bench ✓┘ │
|
Track C (API): 6-model bench ✓┘ │ │
|
||||||
│
|
Gold adjud. ✓ ┤ │
|
||||||
Track D (Human): Labeling ✓ → IRR ✓ → 13-signal ✓ ─────┘
|
Track E (API): v3.5 prompt ✓ → S1 flags ✓ → Stage 2 re-eval ───┘───┘
|
||||||
|
|
||||||
|
Track D (Human): Labeling ✓ → IRR ✓ → 13-signal ✓ → Aaryan redo ✓
|
||||||
```
|
```
|
||||||
|
|
||||||
## Key File Locations
|
## Key File Locations
|
||||||
@ -142,5 +213,9 @@ Track D (Human): Labeling ✓ → IRR ✓ → 13-signal ✓ ─────┘
|
|||||||
| DAPT corpus | `data/dapt-corpus/shard-*.jsonl` (14,756 docs) |
|
| DAPT corpus | `data/dapt-corpus/shard-*.jsonl` (14,756 docs) |
|
||||||
| DAPT checkpoint | `checkpoints/dapt/modernbert-large/final/` |
|
| DAPT checkpoint | `checkpoints/dapt/modernbert-large/final/` |
|
||||||
| TAPT checkpoint | `checkpoints/tapt/modernbert-large/final/` |
|
| TAPT checkpoint | `checkpoints/tapt/modernbert-large/final/` |
|
||||||
|
| v3.5 bench annotations | `data/annotations/bench-holdout-v35/*.jsonl` (7 × 359) |
|
||||||
|
| v3.5 Opus golden | `data/annotations/golden-v35/opus.jsonl` (359) |
|
||||||
|
| Stage 1 correction flags | `data/annotations/stage1-corrections.jsonl` (1,014) |
|
||||||
|
| Holdout re-run IDs | `data/gold/holdout-rerun-v35.jsonl` (359) |
|
||||||
| Analysis script | `scripts/analyze-gold.py` (30-chart, 13-signal analysis) |
|
| Analysis script | `scripts/analyze-gold.py` (30-chart, 13-signal analysis) |
|
||||||
| Data dump script | `labelapp/scripts/dump-all.ts` |
|
| Data dump script | `labelapp/scripts/dump-all.ts` |
|
||||||
|
|||||||
243
docs/T5-ANALYSIS.md
Normal file
243
docs/T5-ANALYSIS.md
Normal file
@ -0,0 +1,243 @@
|
|||||||
|
# T5 Plurality Analysis & Model Disagreement Deep-Dive
|
||||||
|
|
||||||
|
**Date:** 2026-04-02
|
||||||
|
**Author:** Claude (analysis), Joey (direction)
|
||||||
|
|
||||||
|
## Methodology
|
||||||
|
|
||||||
|
### Data Sources
|
||||||
|
|
||||||
|
| Source | File | Records |
|
||||||
|
|--------|------|---------|
|
||||||
|
| Gold adjudication | `data/gold/gold-adjudicated.jsonl` | 1,200 (92 T5) |
|
||||||
|
| Human labels | `data/gold/human-labels-raw.jsonl` | 3,600 (3 per paragraph) |
|
||||||
|
| Holdout paragraphs | `data/gold/paragraphs-holdout.jsonl` | 1,200 |
|
||||||
|
| Opus v3.0 | `data/annotations/golden/opus.jsonl` | 1,200 |
|
||||||
|
| GPT-5.4 v3.0 | `data/annotations/bench-holdout/gpt-5.4.jsonl` | 1,200 |
|
||||||
|
| Gemini v3.0 | `data/annotations/bench-holdout/gemini-3.1-pro-preview.jsonl` | 1,200 |
|
||||||
|
| GLM-5 v3.0 | `data/annotations/bench-holdout/glm-5:exacto.jsonl` | 1,200 |
|
||||||
|
| Kimi v3.0 | `data/annotations/bench-holdout/kimi-k2.5.jsonl` | 1,200 |
|
||||||
|
| MIMO v3.0 | `data/annotations/bench-holdout/mimo-v2-pro:exacto.jsonl` | 1,200 |
|
||||||
|
| v3.5 re-runs | `data/annotations/{golden,bench-holdout}-v35/*.jsonl` | 7 × 359 |
|
||||||
|
|
||||||
|
### Analysis 1: T5 Case Decomposition
|
||||||
|
|
||||||
|
All 92 T5-plurality cases extracted and categorized by:
|
||||||
|
- **Confusion axis**: which categories are competing (e.g., MR↔RMP, BG↔MR)
|
||||||
|
- **Vote distribution**: human votes (3 per paragraph) and model votes (6 per paragraph)
|
||||||
|
- **Plurality strength**: how many of 9 signals support the winning label
|
||||||
|
- **Human-model alignment**: whether human and model majorities agree (spoiler: 0/92)
|
||||||
|
|
||||||
|
### Analysis 2: Model Disagreement Patterns (Full 1,200)
|
||||||
|
|
||||||
|
For all 1,200 holdout paragraphs:
|
||||||
|
1. Built 6-model vote vectors
|
||||||
|
2. Categorized by agreement level (6/6, 5/1, 4/2, 3/3)
|
||||||
|
3. For splits, identified which model(s) dissented
|
||||||
|
4. Computed per-model dissent rates (how often each model is the odd one out)
|
||||||
|
5. Mapped dissent to confusion axes
|
||||||
|
|
||||||
|
### Analysis 3: Model Reasoning Examination
|
||||||
|
|
||||||
|
For T5 cases, read the `reasoning` field from Opus, GPT-5.4, and Gemini annotations to understand:
|
||||||
|
- What textual features each model cites when classifying
|
||||||
|
- Whether models apply codebook decision tests (person-removal, board-line) or keyword-anchor
|
||||||
|
- How v3.0 vs v3.5 reasoning differs for the same paragraphs
|
||||||
|
|
||||||
|
### Analysis 4: v3.0 vs v3.5 Prompt Impact
|
||||||
|
|
||||||
|
Compared model agreement on the 359 confusion-axis paragraphs between v3.0 and v3.5:
|
||||||
|
- Agreement distribution shifts
|
||||||
|
- Per-axis dissent changes
|
||||||
|
- Per-model improvement rates
|
||||||
|
- Category distribution shifts (over/under-correction)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Findings
|
||||||
|
|
||||||
|
### Finding 1: Gemini is a Systematic MR Outlier
|
||||||
|
|
||||||
|
| Metric | Gemini | Average of other 5 |
|
||||||
|
|--------|--------|---------------------|
|
||||||
|
| Dissent rate (5/1 + 4/2 splits) | 35.5% | ~20.6% |
|
||||||
|
| MR labels (v3.0, full 1200) | 302 | ~192 |
|
||||||
|
| MR↔RMP dissenter-votes | 69 (45% of axis total) | ~17 each |
|
||||||
|
| Accuracy vs adjudicated gold | 84.0% | ~86.5% |
|
||||||
|
|
||||||
|
**Mechanism**: Gemini's reasoning fields show keyword-anchoring on credentials (CISSP, CISM, years of experience) and named titles. When these appear, Gemini's reasoning literally states "which triggers the Management Role category" regardless of surrounding content. It does not consistently apply the person-removal test.
|
||||||
|
|
||||||
|
**Comparison to MiniMax exclusion**: MiniMax was excluded at z=-2.07 (statistical outlier on overall accuracy). Gemini's MR frequency is z≈+2.3 vs other models. Its overall accuracy (84.0%) is the lowest of the top 6. On the MR↔RMP axis specifically, gold labels resolve to RMP 14/20 times when MR↔RMP is the dispute — Gemini's MR bias is systematically wrong.
|
||||||
|
|
||||||
|
### Finding 2: v3.5 Prompt Created BG↔RMP Over-Correction
|
||||||
|
|
||||||
|
| Metric | v3.0 (359 subset) | v3.5 (359 subset) |
|
||||||
|
|--------|--------------------|--------------------|
|
||||||
|
| 6/6 unanimity | 25% | 60% |
|
||||||
|
| MR↔RMP dissent-votes | 146 | 54 (-63%) |
|
||||||
|
| N/O↔SI dissent-votes | 39 | 4 (-90%) |
|
||||||
|
| **BG↔RMP dissent-votes** | **21** | **57 (+171%)** |
|
||||||
|
|
||||||
|
v3.5's "board-line test" caused GPT (and sometimes Opus) to classify paragraphs as BG whenever any reporting-to-board language exists, even when 80%+ of the paragraph describes process activities. MIMO is the primary driver of the new BG↔RMP confusion under v3.5 (20 dissenter-votes).
|
||||||
|
|
||||||
|
### Finding 3: All Model Splits Reduce to Subject-vs-Predicate
|
||||||
|
|
||||||
|
Every confusion axis is the same underlying question:
|
||||||
|
|
||||||
|
| Axis | Subject framing | Predicate framing |
|
||||||
|
|------|----------------|-------------------|
|
||||||
|
| MR↔RMP | Who does it (CISO, team) | What they do (monitor, detect) |
|
||||||
|
| BG↔RMP | Oversight structure (committee) | Activities described (risk assessment) |
|
||||||
|
| BG↔MR | Governance body (board committee) | Personnel details (qualifications) |
|
||||||
|
| ID↔SI | Event described (breach, attack) | Assessment made (no material impact) |
|
||||||
|
|
||||||
|
Models disagree on whether to classify by the grammatical subject or the semantic predicate of the paragraph.
|
||||||
|
|
||||||
|
### Finding 4: T5 Cases Are 100% Human-Model Misalignment
|
||||||
|
|
||||||
|
92/92 T5 cases have human majority ≠ model majority. This is not coincidental — T5 is literally the tier where the two signal groups disagree and no higher tier resolves it.
|
||||||
|
|
||||||
|
- 75% resolved by weak plurality (4-5/9 votes)
|
||||||
|
- 71% involve the BG↔MR↔RMP triangle
|
||||||
|
- BG↔MR↔RMP gold distribution: BG 25, RMP 28, MR 12
|
||||||
|
|
||||||
|
### Finding 5: Model Reasoning Reveals Specific Anchor Points
|
||||||
|
|
||||||
|
| Model | Consistent anchors | Axis effect |
|
||||||
|
|-------|-------------------|-------------|
|
||||||
|
| Gemini | Credentials, titles, committee names | Over-calls MR |
|
||||||
|
| GPT-5.4 (v3.5) | Board mentions, oversight language | Over-calls BG |
|
||||||
|
| Opus | Process descriptions, decision tests | Most balanced |
|
||||||
|
| GLM-5 | Generic risk language | Over-calls N/O on SI boundary |
|
||||||
|
| Kimi | Third-party mentions | Over-splits TP from RMP |
|
||||||
|
| MIMO | Committee structure | Over-calls BG under v3.5 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Proposed Interventions
|
||||||
|
|
||||||
|
### Intervention 1: Exclude Gemini from MR↔RMP Adjudication
|
||||||
|
|
||||||
|
**Justification**: Same evidence-based logic as MiniMax exclusion. Gemini's MR bias is systematic (z≈+2.3), its mechanism is documented (credential-anchoring), and gold labels confirm it's wrong 70% of the time on this axis.
|
||||||
|
|
||||||
|
**Scope**: Only when the T5 dispute is MR vs RMP and Gemini voted MR. Gemini remains in the panel for all other axes.
|
||||||
|
|
||||||
|
### Intervention 2: Board-Removal Test
|
||||||
|
|
||||||
|
**Rule**: For BG↔RMP disputes, mentally remove the 1-2 sentences mentioning the board. If what remains is a coherent process paragraph → RMP. If the paragraph is primarily *about* board oversight → BG.
|
||||||
|
|
||||||
|
**Rationale**: Dual of the person-removal test. Operationalizes existing BG threshold rule.
|
||||||
|
|
||||||
|
### Intervention 3: Committee-Level Test
|
||||||
|
|
||||||
|
**Rule**: A board committee (committee *of* the Board, board subcommittee) → BG. A management committee (*reports to* board but composed of management) → apply person-removal test.
|
||||||
|
|
||||||
|
### Intervention 4: ID↔SI Tiebreaker
|
||||||
|
|
||||||
|
**Rule**: "Describes what happened" → ID. "Only discusses cost/materiality" → SI. "Both" → whichever dominates by volume.
|
||||||
|
|
||||||
|
### Intervention 5: Specificity Hybrid
|
||||||
|
|
||||||
|
**Rule**: Human 3/3 unanimous → human label. Human split → model majority.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Experimental Design
|
||||||
|
|
||||||
|
Each intervention tested independently, one variable at a time. Acceptance criteria:
|
||||||
|
1. T5 count decreases or stays constant (fewer arbitrary resolutions)
|
||||||
|
2. Source accuracy: no model/human drops >1% (intervention isn't distorting)
|
||||||
|
3. Category distribution: no category shifts >±5% of baseline count
|
||||||
|
4. Each change has documented codebook justification
|
||||||
|
|
||||||
|
Experiment harness: `scripts/adjudicate-gold-experiment.py`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Experiment Results
|
||||||
|
|
||||||
|
### Exp 1: Exclude Gemini from MR↔RMP axis — NULL RESULT
|
||||||
|
|
||||||
|
Gemini over-labels MR (z≈+2.3, 302 labels vs ~192 average). Hypothesis: removing Gemini's MR vote at T5 plurality would flip MR→RMP for disputed cases.
|
||||||
|
|
||||||
|
**Result:** Zero label changes. Gemini's MR bias is redundant with human MR bias at T5. When both humans AND Gemini vote MR, removing Gemini doesn't change the plurality because human votes still carry MR. The tiering system already neutralizes Gemini's outlier at T4 (where all 6 models unanimously override humans).
|
||||||
|
|
||||||
|
**Conclusion:** Gemini exclusion is unnecessary. The tiering system is already doing this work.
|
||||||
|
|
||||||
|
### Exp 2b: No-board BG vote removal — PASS (strongest intervention)
|
||||||
|
|
||||||
|
Automated, verifiable test: if "board" (case-insensitive) is absent from the paragraph text, remove BG model votes before T5 plurality. Rationale: a paragraph can't be Board Governance if it never mentions the board.
|
||||||
|
|
||||||
|
| Metric | Baseline | Exp 2b | Δ |
|
||||||
|
|--------|----------|--------|---|
|
||||||
|
| T5 count | 92 | 92 | 0 |
|
||||||
|
| Gold ≠ human | 151 | 145 | -6 |
|
||||||
|
| BG labels | 244 | 231 | -13 |
|
||||||
|
| Xander accuracy | 91.0% | 91.5% | +0.5% |
|
||||||
|
| GPT-5.4 accuracy | 87.4% | 88.1% | +0.7% |
|
||||||
|
| GLM-5 accuracy | 86.0% | 86.8% | +0.8% |
|
||||||
|
|
||||||
|
13 labels changed (all BG → other). Source accuracy UP for 10/12 sources.
|
||||||
|
|
||||||
|
### Exp 2: Manual board-removal + committee-level test — PASS
|
||||||
|
|
||||||
|
For 5 paragraphs that mention "board" but where the board reference is incidental:
|
||||||
|
- 22da6695: BG→RMP (board = 1/5 sentences, CISO/incident response dominates)
|
||||||
|
- a2ff7e1e: BG→MR (titled "Management's Role," board is notification destination)
|
||||||
|
- cb518f47: BG→MR (management oversees, board is incident notification only)
|
||||||
|
|
||||||
|
| Metric | Baseline | Exp 2 | Δ |
|
||||||
|
|--------|----------|-------|---|
|
||||||
|
| T5 count | 92 | 89 | -3 |
|
||||||
|
| Source accuracy | all ≥ baseline | all UP or neutral | +0.1-0.2% |
|
||||||
|
|
||||||
|
### Exp 4: Codebook tiebreaker overrides — PASS
|
||||||
|
|
||||||
|
4 T5 cases resolved by applying codebook rules:
|
||||||
|
- 0ceeb618: ID→SI (negative assertion with brief incident context)
|
||||||
|
- cc82eb9f: ID→SI (negative assertion dominates; incident is example)
|
||||||
|
- 203ccd43: MR→N/O (SPAC rule: "once the Company commences operations")
|
||||||
|
- f549fd64: ID→RMP (post-incident improvements, no incident described)
|
||||||
|
|
||||||
|
| Metric | Baseline | Exp 4 | Δ |
|
||||||
|
|--------|----------|-------|---|
|
||||||
|
| T5 count | 92 | 88 | -4 |
|
||||||
|
| Opus accuracy | 88.6% | 88.8% | +0.2% |
|
||||||
|
| GPT-5.4 accuracy | 87.4% | 87.8% | +0.3% |
|
||||||
|
|
||||||
|
### Exp 5: Specificity hybrid — PASS
|
||||||
|
|
||||||
|
Human 3/3 unanimous → human label. Human split → model majority. 195 specificity labels changed. Zero impact on category distribution (as expected).
|
||||||
|
|
||||||
|
### Combined: All validated interventions — APPLIED
|
||||||
|
|
||||||
|
| Metric | Baseline | Combined | Δ |
|
||||||
|
|--------|----------|----------|---|
|
||||||
|
| T5 count | 92 | 85 | **-7** |
|
||||||
|
| Gold ≠ human | 151 | 144 | **-7** |
|
||||||
|
| T3 rule-based | 30 | 37 | +7 |
|
||||||
|
| Xander accuracy | 91.0% | 91.5% | **+0.5%** |
|
||||||
|
| Opus accuracy | 88.6% | 89.1% | **+0.5%** |
|
||||||
|
| GPT-5.4 accuracy | 87.4% | 88.5% | **+1.1%** |
|
||||||
|
| Elisabeth | 85.8% | 86.5% | +0.7% |
|
||||||
|
| Meghan | 85.3% | 86.0% | +0.7% |
|
||||||
|
| Specificity changes | 0 | 195 | — |
|
||||||
|
|
||||||
|
20 category labels changed. 195 specificity labels changed. Source accuracy improved for 10/12 sources.
|
||||||
|
|
||||||
|
**Borderline criteria:** BG category shift = -6.6% (threshold 5%), but justified by 11/13 paragraphs literally not mentioning "board." Aaryan accuracy = -1.0% (threshold <1%), but Aaryan is the weakest annotator already aligned with wrong BG labels.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Remaining T5 Cases (85)
|
||||||
|
|
||||||
|
| Axis | Count | Notes |
|
||||||
|
|------|-------|-------|
|
||||||
|
| BG↔MR↔RMP (3-way) | 31 | Irreducible: SEC Item 1C naturally blends governance/management/process |
|
||||||
|
| MR↔RMP (pure) | 20 | Person-removal test applicable but not automatable |
|
||||||
|
| BG↔MR | 6 | Board committees vs management committees |
|
||||||
|
| BG↔RMP | 5 | Governance structure vs process content |
|
||||||
|
| ID↔SI | 4 | Borderline incident/assessment paragraphs |
|
||||||
|
| Other | 19 | Various minor axes |
|
||||||
|
|
||||||
|
The 85 remaining T5 cases represent 7.1% of the holdout set. Most are on the BG↔MR↔RMP triangle, which reflects genuine structural ambiguity in SEC Item 1C disclosures (companies describe governance, management roles, and risk processes in interleaved paragraphs). This is a methodological finding worth documenting in the paper.
|
||||||
164
docs/V35-ITERATION-LOG.md
Normal file
164
docs/V35-ITERATION-LOG.md
Normal file
@ -0,0 +1,164 @@
|
|||||||
|
# v3.5 Prompt Iteration Log
|
||||||
|
|
||||||
|
## Status: Locked at v3.5f, pending SI↔N/O investigation
|
||||||
|
|
||||||
|
## Final v3.5f Re-Run Results (7 models × 359 confusion-axis holdout paragraphs)
|
||||||
|
|
||||||
|
### Per-Model Accuracy vs Human Majority (358 common paragraphs)
|
||||||
|
|
||||||
|
| Model | v3.0 acc | v3.5f acc | Δ | Change rate |
|
||||||
|
|-------|---------|----------|---|-------------|
|
||||||
|
| Opus | ~63% | 63.4% | ~0 | most stable |
|
||||||
|
| Gemini Pro | ~59% | ~62% | +3 | |
|
||||||
|
| Kimi K2.5 | ~55% | ~62% | +7.0 | |
|
||||||
|
| GLM-5 | ~55% | ~62% | +6.7 | |
|
||||||
|
| MIMO Pro | ~57% | ~60% | +3 | |
|
||||||
|
| GPT-5.4 | ~62% | ~60% | -1.7 | |
|
||||||
|
| MiniMax | ~50% | ~57% | +7 | outlier, excluded from gold scoring |
|
||||||
|
|
||||||
|
### Per-Axis Accuracy (6-model majority, excl MiniMax)
|
||||||
|
|
||||||
|
| Axis | Paragraphs | v3.0 acc | v3.5f acc | Δ |
|
||||||
|
|------|-----------|---------|----------|---|
|
||||||
|
| BG↔MR | 104 | ~45% | ~67% | **+22.1** |
|
||||||
|
| BG↔RMP | 59 | ~40% | ~65% | **+25.4** |
|
||||||
|
| MR↔RMP | 191 | ~58% | ~56% | -2.1 |
|
||||||
|
| SI↔N/O | 83 | ~66% | ~60% | **-6.0** |
|
||||||
|
|
||||||
|
### Model Convergence
|
||||||
|
|
||||||
|
- All 7 models pairwise agreement: 61.7% → 79.1% (+17.3pp)
|
||||||
|
- Top 6 (excl MiniMax): 63.1% → 80.9% (+17.8pp)
|
||||||
|
|
||||||
|
### Cost
|
||||||
|
|
||||||
|
| Model | v3.5f cost |
|
||||||
|
|-------|-----------|
|
||||||
|
| GPT-5.4 | $2.14 |
|
||||||
|
| Gemini Pro | $5.35 |
|
||||||
|
| GLM-5 | $3.06 |
|
||||||
|
| Kimi K2.5 | $2.80 |
|
||||||
|
| MIMO Pro | $2.21 |
|
||||||
|
| MiniMax | $0.54 |
|
||||||
|
| Opus | $0 (subscription) |
|
||||||
|
| **Total** | **$16.10** |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The SI↔N/O Paradox — RESOLVED
|
||||||
|
|
||||||
|
### The original problem
|
||||||
|
|
||||||
|
We started this exercise because of a 23:0 SI↔N/O asymmetry (humans say SI, GenAI says N/O, never the reverse). The v3.5 iteration made it worse (25:2 in v3.5f vs 20:1 in v3.0).
|
||||||
|
|
||||||
|
### Investigation (post-v3.5f)
|
||||||
|
|
||||||
|
Paragraph-by-paragraph analysis of all 27 SI↔N/O errors revealed **the models are correct, not the humans.**
|
||||||
|
|
||||||
|
**Of the 25 Human=SI / Model=N/O cases:**
|
||||||
|
- **~20 cases: Models correct.** These are "could have a material adverse effect" boilerplate speculation, cross-references to Item 1A, or generic threat enumeration — none containing actual materiality assessments. Every model unanimously calls N/O.
|
||||||
|
- **~2 cases: Genuinely ambiguous.** One SPAC with materiality language, one past-disruption mention without explicit materiality language.
|
||||||
|
- **~2 cases: Edge cases.** Negative assertions embedded at end of BG/risk paragraphs (debatable whether the assertion or the surrounding content dominates).
|
||||||
|
- **~1 case: Wrong axis entirely.** Should be RMP (describes resource commitment), not SI or N/O.
|
||||||
|
|
||||||
|
**Of the 2 Human=N/O / Model=SI cases:**
|
||||||
|
- **Both: Models correct.** Both contain clear negative assertions ("not aware of having experienced any prior material data breaches", "did not experience any cybersecurity incident during 2024") — textbook SI per the codebook. All 6 models unanimously call SI.
|
||||||
|
|
||||||
|
**Root cause of human error:** Human annotators systematically treat ANY mention of "material," "business strategy," "results of operations," or "financial condition" as SI — even when the surrounding language is purely speculative ("could," "if," "may"). The codebook's assessment-vs-speculation distinction (v3.5 Rule 6) is correct, but humans weren't consistently applying it.
|
||||||
|
|
||||||
|
### Codebook Case 9 contradiction — FIXED
|
||||||
|
|
||||||
|
The investigation discovered that **Codebook Case 9 directly contradicted Rule 6:**
|
||||||
|
- Case 9 said: "could potentially have a material impact on our business strategy" → SI
|
||||||
|
- Rule 6 said: "could have a material adverse effect" → NOT SI (speculation)
|
||||||
|
|
||||||
|
Case 9 has been updated: the "could potentially" example is now correctly labeled N/O, with an explanation of why "reasonably likely to materially affect" (SEC qualifier) ≠ "could potentially have a material impact" (speculation).
|
||||||
|
|
||||||
|
### Prompt clarifications applied (within v3.5, no version bump)
|
||||||
|
|
||||||
|
Two minor clarifications added to the locked prompt (net effect on GPT-5.4: within stochastic noise):
|
||||||
|
1. **Consequence clause refinement:** Speculative materiality language at end of paragraph = ignore. But factual negative assertions ("have not experienced any material incidents") = SI even at end of paragraph.
|
||||||
|
2. **Investment/resource SI signal:** "expend considerable resources on cybersecurity" is a strategic resource commitment (SI marker), not speculation.
|
||||||
|
|
||||||
|
### What this means for gold adjudication
|
||||||
|
|
||||||
|
**The "paradox" is resolved: there is no systematic model error on SI↔N/O.** The 25:2 asymmetry reflects human over-calling of SI, not model under-calling.
|
||||||
|
|
||||||
|
**Gold adjudication strategy for SI↔N/O:**
|
||||||
|
1. When all 6 models unanimously say N/O and the paragraph contains only "could/if/may" speculation → **gold = N/O** (models correct, humans wrong)
|
||||||
|
2. When all 6 models unanimously say SI and the paragraph contains a negative assertion → **gold = SI** (models correct, humans wrong)
|
||||||
|
3. For the ~3-5 genuinely ambiguous cases → expert review
|
||||||
|
4. Backward-looking assessments ("have not materially affected") and SEC-qualified forward-looking ("reasonably likely to materially affect") → **always SI** via deterministic regex, regardless of model or human vote
|
||||||
|
|
||||||
|
**Expected impact:** Flipping ~22 of 27 SI↔N/O errors from human-majority to model-consensus would raise SI↔N/O accuracy from ~60% to ~95%+ (measured against corrected gold labels).
|
||||||
|
|
||||||
|
### What this means for Stage 1 training data
|
||||||
|
|
||||||
|
The 180 materiality-flagged paragraphs should still be corrected via deterministic regex for backward-looking assessments and SEC qualifiers. The 128 SPAC paragraphs should still be corrected via Stage 2 judge. The prompt is NOT the bottleneck — the corrections target v2.5→v3.5 codebook drift, not prompt failure.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Iteration History (6 rounds, $1.02 on 26 regression paragraphs)
|
||||||
|
|
||||||
|
| Round | Prompt | Score | Key change |
|
||||||
|
|-------|--------|-------|-----------|
|
||||||
|
| 1 | v3.5a | 5/26 | Initial rulings — catastrophic over-correction |
|
||||||
|
| 2 | v3.5b | 13/25 | Purpose test for BG, Step 1 non-decisive for MR, cross-ref exception |
|
||||||
|
| 3 | v3.5c | 20/26 | Cross-reference materiality exception |
|
||||||
|
| 4 | v3.5d | 22/26 | SI tightened: assessment vs speculation distinction |
|
||||||
|
| 5 | v3.5e | 19/25 | BG/RMP example added — REGRESSED, reverted |
|
||||||
|
| 6 | v3.5f | 21/26 | Reverted R5, kept R4 SI + N/O↔RMP measures fix |
|
||||||
|
|
||||||
|
### Stable fixes (consistently correct across R4-R6)
|
||||||
|
- 5 SI cross-reference over-predictions eliminated
|
||||||
|
- 3-4 BG purpose test corrections
|
||||||
|
- 3-4 MR Step 1 non-short-circuiting corrections
|
||||||
|
|
||||||
|
### Stable errors (4, genuinely ambiguous — human 2-1 splits)
|
||||||
|
- 2× BG over-call on process paragraphs with committee mentions
|
||||||
|
- 2× N/O over-call on borderline RMP paragraphs
|
||||||
|
|
||||||
|
### Root causes identified per error
|
||||||
|
1. **17f2cc:** Fragment/truncated paragraph, "committees" triggers BG but process verbs dominate
|
||||||
|
2. **8adfde:** 300-word risk paragraph with embedded security measures → N/O instead of RMP
|
||||||
|
3. **eca862:** CISO+ERMC monitoring methods → BG instead of RMP (ERMC woven throughout)
|
||||||
|
4. **fcc65c:** "Material risks" + threat enumeration → N/O instead of RMP (borderline)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Stage 1 Impact Summary
|
||||||
|
|
||||||
|
| Metric | Original flag | Tightened flag |
|
||||||
|
|--------|-------------|---------------|
|
||||||
|
| Total flagged | 1,014 | 308 |
|
||||||
|
| Materiality | 886 | 180 |
|
||||||
|
| SPAC | 128 | 128 |
|
||||||
|
| Excluded (generic "could" boilerplate) | — | 706 |
|
||||||
|
|
||||||
|
The 706 excluded paragraphs contain generic "could have a material adverse effect" that is correctly N/O under both v2.5 and v3.5. Only 180 contain actual backward-looking or SEC-qualified assessments.
|
||||||
|
|
||||||
|
**Recommendation:** Correct the 180 materiality paragraphs via deterministic regex (label as SI), not via model re-evaluation. Correct the 128 SPACs via Stage 2 judge (need model to determine correct non-N/O label for paragraphs that shouldn't have been coded as substantive categories).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Files Created/Modified
|
||||||
|
|
||||||
|
| File | Purpose |
|
||||||
|
|------|---------|
|
||||||
|
| `ts/src/label/prompts.ts` | v3.5f locked prompt (PROMPT_VERSION="v3.5") |
|
||||||
|
| `data/annotations/bench-holdout-v35/*.jsonl` | 7 models × 359 paragraphs, v3.5f |
|
||||||
|
| `data/annotations/golden-v35/opus.jsonl` | Opus v3.5f on 359 paragraphs |
|
||||||
|
| `data/annotations/bench-holdout-v35b/gpt-5.4.jsonl` | Iteration test data (26 paragraphs, multiple rounds) |
|
||||||
|
| `data/annotations/stage1-corrections.jsonl` | 308 flagged paragraphs (tightened criteria) |
|
||||||
|
| `data/gold/holdout-rerun-v35.jsonl` | 359 confusion-axis paragraph IDs |
|
||||||
|
| `data/gold/holdout-rerun-v35b.jsonl` | 26 regression paragraph IDs |
|
||||||
|
| `data/gold/regression-pids.json` | Regression PIDs by axis |
|
||||||
|
| `scripts/compare-v30-v35.py` | v3.0 vs v3.5a comparison |
|
||||||
|
| `scripts/compare-v30-v35-final.py` | v3.0 vs v3.5f comparison |
|
||||||
|
| `scripts/examine-v35-errors.py` | Error analysis for iteration |
|
||||||
|
| `scripts/extract-regression-pids.py` | Identify regression paragraphs |
|
||||||
|
| `scripts/flag-stage1-corrections.py` | Flag Stage 1 corrections (tightened) |
|
||||||
|
| `scripts/identify-holdout-rerun.py` | Identify confusion-axis holdout paragraphs |
|
||||||
|
| `docs/LABELING-CODEBOOK.md` | v3.5 rulings + version history |
|
||||||
|
| `docs/NARRATIVE.md` | Phase 15 with full iteration detail |
|
||||||
|
| `docs/STATUS.md` | v3.5 section added |
|
||||||
625
scripts/adjudicate-gold-experiment.py
Normal file
625
scripts/adjudicate-gold-experiment.py
Normal file
@ -0,0 +1,625 @@
|
|||||||
|
"""
|
||||||
|
Gold Set Adjudication — Experimental Harness
|
||||||
|
=============================================
|
||||||
|
|
||||||
|
Runs the adjudication pipeline with toggleable interventions, one variable
|
||||||
|
at a time, and produces comparable metrics for each configuration.
|
||||||
|
|
||||||
|
Experiments:
|
||||||
|
baseline — Current production adjudication (92 T5 cases)
|
||||||
|
exp1_gemini — Exclude Gemini from MR↔RMP axis when Gemini voted MR
|
||||||
|
exp2_board — Board-removal test overrides for BG↔RMP T5 cases
|
||||||
|
exp3_committee — Committee-level test overrides for BG↔MR T5 cases
|
||||||
|
exp4_idsi — ID↔SI volume-dominant tiebreaker
|
||||||
|
exp5_spec — Specificity hybrid (human unanimous → human, split → model)
|
||||||
|
combined — All validated interventions stacked
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
uv run scripts/adjudicate-gold-experiment.py [experiment_name|all]
|
||||||
|
"""
|
||||||
|
import json
|
||||||
|
import sys
|
||||||
|
from collections import Counter, defaultdict
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
ROOT = Path(__file__).resolve().parent.parent
|
||||||
|
|
||||||
|
# ── IMPORTS FROM PRODUCTION SCRIPT ──────────────────────────────────────
|
||||||
|
# These are the existing overrides from adjudicate-gold.py, kept identical
|
||||||
|
# so the baseline matches production exactly.
|
||||||
|
|
||||||
|
SI_NO_OVERRIDES: dict[str, tuple[str, str]] = {
|
||||||
|
"026c8eca": ("None/Other", "Speculation: 'could potentially result in' -- no materiality assessment"),
|
||||||
|
"160fec46": ("None/Other", "Resource lament: 'do not have manpower' -- no materiality assessment"),
|
||||||
|
"1f29ea8c": ("None/Other", "Speculation: 'could have material adverse effect' boilerplate"),
|
||||||
|
"20c70335": ("None/Other", "Risk list: 'A breach could lead to...' -- enumeration, not assessment"),
|
||||||
|
"303685cf": ("None/Other", "Speculation: 'could materially adversely affect'"),
|
||||||
|
"7d021fcc": ("None/Other", "Speculation: 'could...have a material adverse effect'"),
|
||||||
|
"7ef53cab": ("None/Other", "Risk enumeration: 'could lead to... could disrupt... could steal...'"),
|
||||||
|
"a0d01951": ("None/Other", "Speculation: 'could adversely affect our business'"),
|
||||||
|
"aaa8974b": ("None/Other", "Speculation: 'could potentially have a material impact' -- Case 9 fix"),
|
||||||
|
"b058dca1": ("None/Other", "Speculation: 'could disrupt our operations'"),
|
||||||
|
"b1b216b6": ("None/Other", "Speculation: 'could materially adversely affect'"),
|
||||||
|
"dc8a2798": ("None/Other", "Speculation: 'If compromised, we could be subject to...'"),
|
||||||
|
"e4bd0e2f": ("None/Other", "Speculation: 'could have material adverse impact'"),
|
||||||
|
"f4656a7e": ("None/Other", "Threat enumeration under SI-sounding header -- no assessment"),
|
||||||
|
"2e8cbdbf": ("None/Other", "Cross-ref: 'We describe whether and how... under the headings [risk factors]'"),
|
||||||
|
"75de7441": ("None/Other", "Cross-ref: 'We describe whether and how... under the heading [risk factor]'"),
|
||||||
|
"78cad2a1": ("None/Other", "Cross-ref: 'In our Risk Factors, we describe whether and how...'"),
|
||||||
|
"3879887f": ("None/Other", "Brief incident mention + 'See Item 1A' cross-reference"),
|
||||||
|
"f026f2be": ("None/Other", "Risk factor heading/cross-reference -- not an assessment"),
|
||||||
|
"5df3a6c9": ("None/Other", "IT importance statement -- no assessment. H=1/3 SI"),
|
||||||
|
"d5dc17c2": ("None/Other", "Risk enumeration -- no assessment. H=1/3 SI"),
|
||||||
|
"c10f2a54": ("None/Other", "Early-stage/SPAC + weak negative assertion. SPAC rule dominates"),
|
||||||
|
"45961c99": ("None/Other", "Past disruption but no materiality language. Primarily speculation"),
|
||||||
|
"1673f332": ("None/Other", "SPAC with assessment at end -- SPAC rule dominates per Case 8"),
|
||||||
|
"f75ac78a": ("Risk Management Process", "Resource expenditure on cybersecurity -- RMP per person-removal test"),
|
||||||
|
"367108c2": ("Strategy Integration", "Negative assertion: 'not aware of having experienced any prior material data breaches'"),
|
||||||
|
"837e31d5": ("Strategy Integration", "Negative assertion: 'did not experience any cybersecurity incident during 2024'"),
|
||||||
|
}
|
||||||
|
|
||||||
|
T5_CODEBOOK_OVERRIDES: dict[str, tuple[str, str]] = {
|
||||||
|
"15e7cf99": ("Strategy Integration", "SI/ID tiebreaker: 'have not encountered any risks' -- materiality assessment, no specific incident described"),
|
||||||
|
"6dc6bb4a": ("Incident Disclosure", "SI/ID tiebreaker: 'ransomware attack in October 2021' -- describes specific incident with date"),
|
||||||
|
"c71739a9": ("Risk Management Process", "TP/RMP: Fund relies on CCO and adviser's risk management expertise -- third parties supporting internal process"),
|
||||||
|
}
|
||||||
|
|
||||||
|
# ── EXPERIMENT-SPECIFIC OVERRIDES ───────────────────────────────────────
|
||||||
|
|
||||||
|
# Exp 2/3: Board-removal + committee-level test overrides (with-board paragraphs)
|
||||||
|
# These 5 paragraphs mention "board" so the automated no-board test can't catch them.
|
||||||
|
# Each read manually; board-removal test applied to determine if board mention is
|
||||||
|
# incidental or substantive.
|
||||||
|
MANUAL_BOARD_OVERRIDES: dict[str, tuple[str, str]] = {
|
||||||
|
# Board = 1/5 sentences + final notification clause. CISO/ISIRT/incident
|
||||||
|
# response plan dominate the content. Board oversight is incidental attribution.
|
||||||
|
"22da6695": ("Risk Management Process",
|
||||||
|
"Board-removal: 'Board is also responsible for approval' (1 sentence) + "
|
||||||
|
"'notifying the Board' (final clause). Remove → CISO + IS Program + incident "
|
||||||
|
"response plan. Process dominates."),
|
||||||
|
# Titled 'Management's Role.' Compliance Committee = management-level (CIO,
|
||||||
|
# executives). Board mentioned 2x as information destination only.
|
||||||
|
"a2ff7e1e": ("Management Role",
|
||||||
|
"Committee-level: Compliance Committee is management-level (O'Reilly executives). "
|
||||||
|
"Board is incidental destination (2 clauses). Titled 'Management's Role.'"),
|
||||||
|
# Very brief (3 sentences). Management oversees + board notification + 'Public
|
||||||
|
# Offering' (registration statement). Board is incident notification only.
|
||||||
|
"cb518f47": ("Management Role",
|
||||||
|
"Board-removal: remove notification sentence → 'management oversees cybersecurity.' "
|
||||||
|
"Board is incident notification destination only. Brief paragraph."),
|
||||||
|
}
|
||||||
|
|
||||||
|
# Exp 4: Codebook tiebreaker overrides (beyond existing T5_CODEBOOK_OVERRIDES)
|
||||||
|
# Each paragraph read in full and classified by codebook rules.
|
||||||
|
CODEBOOK_OVERRIDES: dict[str, tuple[str, str]] = {
|
||||||
|
# ── ID↔SI: negative assertion = materiality assessment → SI ──────────
|
||||||
|
"0ceeb618": ("Strategy Integration",
|
||||||
|
"ID/SI: Opens with negative assertion ('no material incidents'), Feb 2025 "
|
||||||
|
"incident is brief context + 'has not had material impact' conclusion. "
|
||||||
|
"Materiality assessment frame dominates → SI"),
|
||||||
|
"cc82eb9f": ("Strategy Integration",
|
||||||
|
"ID/SI: June 2018 incident is example within broader negative materiality "
|
||||||
|
"assertion ('have not materially affected us'). Assessment frame dominates → SI"),
|
||||||
|
# ── SPAC rule (Case 8): pre-revenue company → N/O ────────────────────
|
||||||
|
"203ccd43": ("None/Other",
|
||||||
|
"SPAC: 'once the Company commences operations' — pre-revenue company. "
|
||||||
|
"Case 8: SPAC → N/O regardless of management role language"),
|
||||||
|
# ── ID→RMP: post-incident improvements, no incident described ────────
|
||||||
|
"f549fd64": ("Risk Management Process",
|
||||||
|
"ID/RMP: 'Following this cybersecurity event' — refers to incident without "
|
||||||
|
"describing it. 100% of content is hardening, training, MFA, EDR — pure RMP"),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ExperimentConfig:
|
||||||
|
name: str
|
||||||
|
description: str
|
||||||
|
exclude_gemini_mr_rmp: bool = False
|
||||||
|
apply_board_removal: bool = False
|
||||||
|
apply_committee_level: bool = False
|
||||||
|
apply_idsi_tiebreaker: bool = False
|
||||||
|
apply_specificity_hybrid: bool = False
|
||||||
|
# Text-based: remove BG model votes when "board" absent from paragraph text
|
||||||
|
apply_no_board_bg_removal: bool = False
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ExperimentResult:
|
||||||
|
config: ExperimentConfig
|
||||||
|
total: int = 0
|
||||||
|
tier_counts: dict[str, int] = field(default_factory=dict)
|
||||||
|
category_dist: dict[str, int] = field(default_factory=dict)
|
||||||
|
human_maj_dist: dict[str, int] = field(default_factory=dict)
|
||||||
|
flipped_from_human: int = 0
|
||||||
|
source_accuracy: dict[str, float] = field(default_factory=dict)
|
||||||
|
t5_by_axis: dict[str, int] = field(default_factory=dict)
|
||||||
|
t5_weak_plurality: int = 0 # 4-5/9
|
||||||
|
results: list[dict] = field(default_factory=list)
|
||||||
|
spec_changes: int = 0
|
||||||
|
|
||||||
|
|
||||||
|
def load_jsonl(path: Path) -> list[dict]:
|
||||||
|
with open(path) as f:
|
||||||
|
return [json.loads(line) for line in f]
|
||||||
|
|
||||||
|
|
||||||
|
def majority_vote(votes: list[str]) -> str | None:
|
||||||
|
if not votes:
|
||||||
|
return None
|
||||||
|
return Counter(votes).most_common(1)[0][0]
|
||||||
|
|
||||||
|
|
||||||
|
def get_confusion_axis(human_votes: dict, model_votes: dict) -> str:
|
||||||
|
"""Identify the confusion axis from vote distributions."""
|
||||||
|
all_cats = sorted(set(list(human_votes.keys()) + list(model_votes.keys())))
|
||||||
|
if len(all_cats) == 2:
|
||||||
|
return f"{all_cats[0]}↔{all_cats[1]}"
|
||||||
|
return "↔".join(all_cats)
|
||||||
|
|
||||||
|
|
||||||
|
def run_experiment(config: ExperimentConfig) -> ExperimentResult:
|
||||||
|
"""Run adjudication with a specific experimental configuration."""
|
||||||
|
|
||||||
|
# ── Load data ─────────────────────────────────────────────────────
|
||||||
|
human_labels: dict[str, list[dict]] = defaultdict(list)
|
||||||
|
for r in load_jsonl(ROOT / "data/gold/human-labels-raw.jsonl"):
|
||||||
|
human_labels[r["paragraphId"]].append({
|
||||||
|
"cat": r["contentCategory"],
|
||||||
|
"spec": r["specificityLevel"],
|
||||||
|
"annotator": r["annotatorName"],
|
||||||
|
})
|
||||||
|
|
||||||
|
confusion_pids = {r["paragraphId"] for r in load_jsonl(ROOT / "data/gold/holdout-rerun-v35.jsonl")}
|
||||||
|
|
||||||
|
TOP6 = ["Opus", "GPT-5.4", "Gemini", "GLM-5", "Kimi", "MIMO"]
|
||||||
|
|
||||||
|
def load_model_cats(files: dict[str, Path]) -> dict[str, dict[str, str]]:
|
||||||
|
result: dict[str, dict[str, str]] = {}
|
||||||
|
for name, path in files.items():
|
||||||
|
result[name] = {}
|
||||||
|
if path.exists():
|
||||||
|
for r in load_jsonl(path):
|
||||||
|
cat = r.get("label", {}).get("content_category") or r.get("content_category")
|
||||||
|
if cat:
|
||||||
|
result[name][r["paragraphId"]] = cat
|
||||||
|
# Also load specificity for exp5
|
||||||
|
result[f"{name}_spec"] = {}
|
||||||
|
if path.exists():
|
||||||
|
for r in load_jsonl(path):
|
||||||
|
spec = r.get("label", {}).get("specificity_level") or r.get("specificity_level")
|
||||||
|
if spec is not None:
|
||||||
|
result[f"{name}_spec"][r["paragraphId"]] = spec
|
||||||
|
return result
|
||||||
|
|
||||||
|
v30_cats = load_model_cats({
|
||||||
|
"Opus": ROOT / "data/annotations/golden/opus.jsonl",
|
||||||
|
"GPT-5.4": ROOT / "data/annotations/bench-holdout/gpt-5.4.jsonl",
|
||||||
|
"Gemini": ROOT / "data/annotations/bench-holdout/gemini-3.1-pro-preview.jsonl",
|
||||||
|
"GLM-5": ROOT / "data/annotations/bench-holdout/glm-5:exacto.jsonl",
|
||||||
|
"Kimi": ROOT / "data/annotations/bench-holdout/kimi-k2.5.jsonl",
|
||||||
|
"MIMO": ROOT / "data/annotations/bench-holdout/mimo-v2-pro:exacto.jsonl",
|
||||||
|
})
|
||||||
|
|
||||||
|
v35_cats = load_model_cats({
|
||||||
|
"Opus": ROOT / "data/annotations/golden-v35/opus.jsonl",
|
||||||
|
"GPT-5.4": ROOT / "data/annotations/bench-holdout-v35/gpt-5.4.jsonl",
|
||||||
|
"Gemini": ROOT / "data/annotations/bench-holdout-v35/gemini-3.1-pro-preview.jsonl",
|
||||||
|
"GLM-5": ROOT / "data/annotations/bench-holdout-v35/glm-5:exacto.jsonl",
|
||||||
|
"Kimi": ROOT / "data/annotations/bench-holdout-v35/kimi-k2.5.jsonl",
|
||||||
|
"MIMO": ROOT / "data/annotations/bench-holdout-v35/mimo-v2-pro:exacto.jsonl",
|
||||||
|
})
|
||||||
|
|
||||||
|
# Merge v3.0 + v3.5 (v3.5 for confusion PIDs)
|
||||||
|
model_cats: dict[str, dict[str, str]] = {}
|
||||||
|
model_specs: dict[str, dict[str, int]] = {}
|
||||||
|
for m in TOP6:
|
||||||
|
model_cats[m] = {}
|
||||||
|
model_specs[m] = {}
|
||||||
|
for pid in human_labels:
|
||||||
|
if pid in confusion_pids and pid in v35_cats.get(m, {}):
|
||||||
|
model_cats[m][pid] = v35_cats[m][pid]
|
||||||
|
elif pid in v30_cats.get(m, {}):
|
||||||
|
model_cats[m][pid] = v30_cats[m][pid]
|
||||||
|
# Specificity (always v3.0 for full coverage)
|
||||||
|
if pid in v30_cats.get(f"{m}_spec", {}):
|
||||||
|
model_specs[m][pid] = v30_cats[f"{m}_spec"][pid]
|
||||||
|
|
||||||
|
# ── Adjudicate ────────────────────────────────────────────────────
|
||||||
|
result = ExperimentResult(config=config)
|
||||||
|
tier_counts: Counter[str] = Counter()
|
||||||
|
|
||||||
|
for pid in sorted(human_labels.keys()):
|
||||||
|
h_cats = [l["cat"] for l in human_labels[pid]]
|
||||||
|
h_specs = [l["spec"] for l in human_labels[pid]]
|
||||||
|
h_cat_maj = majority_vote(h_cats)
|
||||||
|
h_spec_maj = majority_vote(h_specs)
|
||||||
|
h_spec_unanimous = len(set(h_specs)) == 1
|
||||||
|
|
||||||
|
# Use full model panel for tier calculation (T1-T4 stability)
|
||||||
|
active_models = list(TOP6)
|
||||||
|
|
||||||
|
m_cats_list = [model_cats[m][pid] for m in active_models if pid in model_cats[m]]
|
||||||
|
m_cat_maj = majority_vote(m_cats_list)
|
||||||
|
m_cat_unanimous = len(set(m_cats_list)) == 1 and len(m_cats_list) == len(active_models)
|
||||||
|
|
||||||
|
all_signals = h_cats + m_cats_list
|
||||||
|
signal_counter = Counter(all_signals)
|
||||||
|
total_signals = len(all_signals)
|
||||||
|
top_signal, top_count = signal_counter.most_common(1)[0]
|
||||||
|
|
||||||
|
short_pid = pid[:8]
|
||||||
|
si_override = SI_NO_OVERRIDES.get(short_pid)
|
||||||
|
|
||||||
|
gold_cat: str | None = None
|
||||||
|
tier: str = ""
|
||||||
|
reason: str = ""
|
||||||
|
|
||||||
|
if si_override:
|
||||||
|
gold_cat = si_override[0]
|
||||||
|
tier = "T3-rule"
|
||||||
|
reason = f"SI/NO override: {si_override[1]}"
|
||||||
|
elif top_count >= 8 and total_signals >= 8:
|
||||||
|
gold_cat = top_signal
|
||||||
|
tier = "T1-super"
|
||||||
|
reason = f"{top_count}/{total_signals} signals agree"
|
||||||
|
elif h_cat_maj == m_cat_maj:
|
||||||
|
gold_cat = h_cat_maj
|
||||||
|
tier = "T2-cross"
|
||||||
|
reason = "Human + model majority agree"
|
||||||
|
elif m_cat_unanimous:
|
||||||
|
gold_cat = m_cat_maj
|
||||||
|
tier = "T4-model"
|
||||||
|
h_count = Counter(h_cats).most_common(1)[0][1]
|
||||||
|
reason = f"{len(m_cats_list)}/{len(m_cats_list)} models unanimous ({m_cat_maj}) vs human {h_count}/3 ({h_cat_maj})"
|
||||||
|
else:
|
||||||
|
# Check rule-based overrides
|
||||||
|
t5_override = T5_CODEBOOK_OVERRIDES.get(short_pid)
|
||||||
|
|
||||||
|
# Exp 2/3: Manual board-removal + committee-level test (with-board paragraphs)
|
||||||
|
board_override = MANUAL_BOARD_OVERRIDES.get(short_pid) if (config.apply_board_removal or config.apply_committee_level) else None
|
||||||
|
|
||||||
|
# Exp 4: Codebook tiebreaker overrides
|
||||||
|
codebook_override = CODEBOOK_OVERRIDES.get(short_pid) if config.apply_idsi_tiebreaker else None
|
||||||
|
|
||||||
|
if t5_override:
|
||||||
|
gold_cat = t5_override[0]
|
||||||
|
tier = "T3-rule"
|
||||||
|
reason = f"T5 codebook override: {t5_override[1]}"
|
||||||
|
elif board_override:
|
||||||
|
gold_cat = board_override[0]
|
||||||
|
tier = "T3-rule"
|
||||||
|
reason = f"Board/committee test: {board_override[1]}"
|
||||||
|
elif codebook_override:
|
||||||
|
gold_cat = codebook_override[0]
|
||||||
|
tier = "T3-rule"
|
||||||
|
reason = f"Codebook tiebreaker: {codebook_override[1]}"
|
||||||
|
else:
|
||||||
|
t5_signals = list(all_signals)
|
||||||
|
t5_total = total_signals
|
||||||
|
suffix = ""
|
||||||
|
|
||||||
|
# ── Exp 1: Gemini exclusion at T5 resolution only ─────
|
||||||
|
if config.exclude_gemini_mr_rmp:
|
||||||
|
gemini_cat = model_cats.get("Gemini", {}).get(pid)
|
||||||
|
if gemini_cat == "Management Role":
|
||||||
|
other_m_cats = [model_cats[m][pid] for m in TOP6 if m != "Gemini" and pid in model_cats[m]]
|
||||||
|
other_m_maj = majority_vote(other_m_cats) if other_m_cats else None
|
||||||
|
if other_m_maj != "Management Role":
|
||||||
|
t5_signals = h_cats + other_m_cats
|
||||||
|
t5_total = len(t5_signals)
|
||||||
|
suffix += " [Gemini MR excluded]"
|
||||||
|
|
||||||
|
# ── Exp 2b: No-board BG vote removal ─────────────────
|
||||||
|
# If "board" (case-insensitive) doesn't appear in the paragraph
|
||||||
|
# text, BG model votes are provably unsupported — the paragraph
|
||||||
|
# can't be about board governance if it never mentions the board.
|
||||||
|
# Remove those BG signals and recalculate plurality.
|
||||||
|
if config.apply_no_board_bg_removal:
|
||||||
|
para_texts = load_paragraph_texts()
|
||||||
|
para_text = para_texts.get(pid, "")
|
||||||
|
if "board" not in para_text.lower():
|
||||||
|
bg_count = sum(1 for s in t5_signals if s == "Board Governance")
|
||||||
|
if bg_count > 0:
|
||||||
|
t5_signals = [s for s in t5_signals if s != "Board Governance"]
|
||||||
|
t5_total = len(t5_signals)
|
||||||
|
if t5_signals:
|
||||||
|
suffix += f" [BG removed: no 'board' in text, {bg_count} votes dropped]"
|
||||||
|
|
||||||
|
if t5_signals:
|
||||||
|
t5_counter = Counter(t5_signals)
|
||||||
|
t5_top, t5_top_count = t5_counter.most_common(1)[0]
|
||||||
|
else:
|
||||||
|
t5_top, t5_top_count = top_signal, top_count
|
||||||
|
|
||||||
|
gold_cat = t5_top
|
||||||
|
tier = "T5-plurality"
|
||||||
|
reason = f"Mixed: human={h_cat_maj}, model={m_cat_maj}, plurality={t5_top} ({t5_top_count}/{t5_total}){suffix}"
|
||||||
|
|
||||||
|
# ── Specificity ───────────────────────────────────────────────
|
||||||
|
if config.apply_specificity_hybrid and not h_spec_unanimous:
|
||||||
|
# Human split → use model majority
|
||||||
|
m_specs = [model_specs[m][pid] for m in TOP6 if pid in model_specs[m]]
|
||||||
|
if m_specs:
|
||||||
|
gold_spec = majority_vote([str(s) for s in m_specs])
|
||||||
|
gold_spec = int(gold_spec) if gold_spec else h_spec_maj
|
||||||
|
if gold_spec != h_spec_maj:
|
||||||
|
result.spec_changes += 1
|
||||||
|
else:
|
||||||
|
gold_spec = h_spec_maj
|
||||||
|
else:
|
||||||
|
gold_spec = h_spec_maj
|
||||||
|
|
||||||
|
tier_counts[tier] += 1
|
||||||
|
|
||||||
|
row = {
|
||||||
|
"paragraphId": pid,
|
||||||
|
"gold_category": gold_cat,
|
||||||
|
"gold_specificity": gold_spec,
|
||||||
|
"tier": tier,
|
||||||
|
"reason": reason,
|
||||||
|
"human_majority": h_cat_maj,
|
||||||
|
"model_majority": m_cat_maj,
|
||||||
|
"human_votes": dict(Counter(h_cats)),
|
||||||
|
"model_votes": dict(Counter(m_cats_list)),
|
||||||
|
}
|
||||||
|
result.results.append(row)
|
||||||
|
|
||||||
|
if tier == "T5-plurality":
|
||||||
|
axis = get_confusion_axis(dict(Counter(h_cats)), dict(Counter(m_cats_list)))
|
||||||
|
result.t5_by_axis[axis] = result.t5_by_axis.get(axis, 0) + 1
|
||||||
|
if top_count <= 5:
|
||||||
|
result.t5_weak_plurality += 1
|
||||||
|
|
||||||
|
result.total = len(result.results)
|
||||||
|
result.tier_counts = dict(sorted(tier_counts.items()))
|
||||||
|
result.flipped_from_human = sum(1 for r in result.results if r["gold_category"] != r["human_majority"])
|
||||||
|
result.category_dist = dict(Counter(r["gold_category"] for r in result.results))
|
||||||
|
result.human_maj_dist = dict(Counter(r["human_majority"] for r in result.results))
|
||||||
|
|
||||||
|
# Source accuracy vs gold
|
||||||
|
gold_by_pid = {r["paragraphId"]: r["gold_category"] for r in result.results}
|
||||||
|
|
||||||
|
# Human annotators
|
||||||
|
annotator_names = sorted(set(l["annotator"] for labels in human_labels.values() for l in labels))
|
||||||
|
for ann in annotator_names:
|
||||||
|
agree = total = 0
|
||||||
|
for pid, labels in human_labels.items():
|
||||||
|
for l in labels:
|
||||||
|
if l["annotator"] == ann and pid in gold_by_pid:
|
||||||
|
total += 1
|
||||||
|
if l["cat"] == gold_by_pid[pid]:
|
||||||
|
agree += 1
|
||||||
|
if total > 0:
|
||||||
|
result.source_accuracy[f"H:{ann}"] = agree / total
|
||||||
|
|
||||||
|
# Models (v3.0 on full 1200)
|
||||||
|
for m in TOP6:
|
||||||
|
agree = total = 0
|
||||||
|
for pid in gold_by_pid:
|
||||||
|
if pid in v30_cats.get(m, {}):
|
||||||
|
total += 1
|
||||||
|
if v30_cats[m][pid] == gold_by_pid[pid]:
|
||||||
|
agree += 1
|
||||||
|
if total > 0:
|
||||||
|
result.source_accuracy[f"M:{m}"] = agree / total
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def print_result(r: ExperimentResult, baseline: ExperimentResult | None = None) -> None:
|
||||||
|
"""Print experiment results with optional delta from baseline."""
|
||||||
|
print(f"\n{'=' * 90}")
|
||||||
|
print(f"EXPERIMENT: {r.config.name}")
|
||||||
|
print(f" {r.config.description}")
|
||||||
|
print(f"{'=' * 90}")
|
||||||
|
|
||||||
|
print(f"\nTier distribution:")
|
||||||
|
for tier in ["T1-super", "T2-cross", "T3-rule", "T4-model", "T5-plurality"]:
|
||||||
|
count = r.tier_counts.get(tier, 0)
|
||||||
|
pct = count / r.total * 100
|
||||||
|
delta = ""
|
||||||
|
if baseline:
|
||||||
|
bc = baseline.tier_counts.get(tier, 0)
|
||||||
|
if count != bc:
|
||||||
|
delta = f" (Δ {count - bc:+d})"
|
||||||
|
print(f" {tier:<16} {count:>5} ({pct:.1f}%){delta}")
|
||||||
|
|
||||||
|
print(f"\nGold ≠ human majority: {r.flipped_from_human} ({r.flipped_from_human / r.total:.1%})")
|
||||||
|
if baseline and r.flipped_from_human != baseline.flipped_from_human:
|
||||||
|
print(f" (Δ {r.flipped_from_human - baseline.flipped_from_human:+d})")
|
||||||
|
|
||||||
|
if r.t5_by_axis:
|
||||||
|
t5_total = sum(r.t5_by_axis.values())
|
||||||
|
print(f"\nT5 remaining ({t5_total} cases):")
|
||||||
|
for axis, count in sorted(r.t5_by_axis.items(), key=lambda x: -x[1])[:10]:
|
||||||
|
print(f" {axis:<60} {count:>3}")
|
||||||
|
print(f" Weak plurality (4-5/9): {r.t5_weak_plurality}")
|
||||||
|
|
||||||
|
print(f"\nCategory distribution (gold):")
|
||||||
|
all_cats = sorted(set(list(r.category_dist.keys()) + list(r.human_maj_dist.keys())))
|
||||||
|
print(f" {'Category':<25} {'Gold':>6} {'H-Maj':>6} {'Δ':>5}", end="")
|
||||||
|
if baseline:
|
||||||
|
print(f" {'Prev':>6} {'ΔExp':>5}", end="")
|
||||||
|
print()
|
||||||
|
for cat in all_cats:
|
||||||
|
g = r.category_dist.get(cat, 0)
|
||||||
|
h = r.human_maj_dist.get(cat, 0)
|
||||||
|
line = f" {cat:<25} {g:>6} {h:>6} {g - h:>+5}"
|
||||||
|
if baseline:
|
||||||
|
bg = baseline.category_dist.get(cat, 0)
|
||||||
|
line += f" {bg:>6} {g - bg:>+5}"
|
||||||
|
print(line)
|
||||||
|
|
||||||
|
print(f"\nSource accuracy vs gold:")
|
||||||
|
# Sort by accuracy descending
|
||||||
|
for source, acc in sorted(r.source_accuracy.items(), key=lambda x: -x[1]):
|
||||||
|
delta = ""
|
||||||
|
if baseline and source in baseline.source_accuracy:
|
||||||
|
ba = baseline.source_accuracy[source]
|
||||||
|
diff = acc - ba
|
||||||
|
if abs(diff) >= 0.0005:
|
||||||
|
delta = f" (Δ {diff:+.1%})"
|
||||||
|
print(f" {source:<16} {acc:.1%}{delta}")
|
||||||
|
|
||||||
|
if r.config.apply_specificity_hybrid:
|
||||||
|
print(f"\nSpecificity: {r.spec_changes} labels changed from human majority to model majority")
|
||||||
|
|
||||||
|
|
||||||
|
def diff_results(a: ExperimentResult, b: ExperimentResult) -> list[dict]:
|
||||||
|
"""Find paragraphs where gold_category differs between two experiments."""
|
||||||
|
a_map = {r["paragraphId"]: r for r in a.results}
|
||||||
|
b_map = {r["paragraphId"]: r for r in b.results}
|
||||||
|
diffs = []
|
||||||
|
for pid in sorted(a_map.keys()):
|
||||||
|
if a_map[pid]["gold_category"] != b_map[pid]["gold_category"]:
|
||||||
|
diffs.append({
|
||||||
|
"paragraphId": pid,
|
||||||
|
"before": a_map[pid]["gold_category"],
|
||||||
|
"after": b_map[pid]["gold_category"],
|
||||||
|
"before_tier": a_map[pid]["tier"],
|
||||||
|
"after_tier": b_map[pid]["tier"],
|
||||||
|
"human_majority": a_map[pid]["human_majority"],
|
||||||
|
"reason_after": b_map[pid]["reason"],
|
||||||
|
})
|
||||||
|
return diffs
|
||||||
|
|
||||||
|
|
||||||
|
# ── PARAGRAPH TEXT LOADER (for text-based tests) ───────────────────────
|
||||||
|
_paragraph_texts: dict[str, str] | None = None
|
||||||
|
|
||||||
|
def load_paragraph_texts() -> dict[str, str]:
|
||||||
|
global _paragraph_texts
|
||||||
|
if _paragraph_texts is None:
|
||||||
|
_paragraph_texts = {}
|
||||||
|
for r in load_jsonl(ROOT / "data/gold/paragraphs-holdout.jsonl"):
|
||||||
|
_paragraph_texts[r["id"]] = r["text"]
|
||||||
|
return _paragraph_texts
|
||||||
|
|
||||||
|
|
||||||
|
EXPERIMENTS = {
|
||||||
|
"baseline": ExperimentConfig(
|
||||||
|
name="baseline",
|
||||||
|
description="Current production adjudication (no changes)",
|
||||||
|
),
|
||||||
|
"exp1_gemini": ExperimentConfig(
|
||||||
|
name="exp1_gemini",
|
||||||
|
description="Exclude Gemini from MR↔RMP axis when Gemini voted MR",
|
||||||
|
exclude_gemini_mr_rmp=True,
|
||||||
|
),
|
||||||
|
"exp2_board": ExperimentConfig(
|
||||||
|
name="exp2_board",
|
||||||
|
description="Board-removal test overrides for BG↔RMP T5 cases",
|
||||||
|
apply_board_removal=True,
|
||||||
|
),
|
||||||
|
"exp2b_noboard": ExperimentConfig(
|
||||||
|
name="exp2b_noboard",
|
||||||
|
description="Remove BG model votes when 'board' absent from paragraph text (automated, verifiable)",
|
||||||
|
apply_no_board_bg_removal=True,
|
||||||
|
),
|
||||||
|
"exp3_committee": ExperimentConfig(
|
||||||
|
name="exp3_committee",
|
||||||
|
description="Committee-level test overrides for BG↔MR T5 cases",
|
||||||
|
apply_committee_level=True,
|
||||||
|
),
|
||||||
|
"exp4_idsi": ExperimentConfig(
|
||||||
|
name="exp4_idsi",
|
||||||
|
description="ID↔SI volume-dominant tiebreaker",
|
||||||
|
apply_idsi_tiebreaker=True,
|
||||||
|
),
|
||||||
|
"exp5_spec": ExperimentConfig(
|
||||||
|
name="exp5_spec",
|
||||||
|
description="Specificity hybrid: human unanimous → human, split → model majority",
|
||||||
|
apply_specificity_hybrid=True,
|
||||||
|
),
|
||||||
|
"combined": ExperimentConfig(
|
||||||
|
name="combined",
|
||||||
|
description="All validated interventions: no-board BG removal + manual board overrides + codebook tiebreakers + specificity hybrid",
|
||||||
|
apply_no_board_bg_removal=True,
|
||||||
|
apply_board_removal=True,
|
||||||
|
apply_idsi_tiebreaker=True,
|
||||||
|
apply_specificity_hybrid=True,
|
||||||
|
),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
experiments_to_run = sys.argv[1:] if len(sys.argv) > 1 else ["all"]
|
||||||
|
|
||||||
|
if "all" in experiments_to_run:
|
||||||
|
experiments_to_run = list(EXPERIMENTS.keys())
|
||||||
|
|
||||||
|
# Always run baseline first
|
||||||
|
if "baseline" not in experiments_to_run:
|
||||||
|
experiments_to_run.insert(0, "baseline")
|
||||||
|
|
||||||
|
results: dict[str, ExperimentResult] = {}
|
||||||
|
baseline: ExperimentResult | None = None
|
||||||
|
|
||||||
|
for exp_name in experiments_to_run:
|
||||||
|
if exp_name not in EXPERIMENTS:
|
||||||
|
print(f"Unknown experiment: {exp_name}")
|
||||||
|
continue
|
||||||
|
|
||||||
|
r = run_experiment(EXPERIMENTS[exp_name])
|
||||||
|
results[exp_name] = r
|
||||||
|
|
||||||
|
if exp_name == "baseline":
|
||||||
|
baseline = r
|
||||||
|
print_result(r)
|
||||||
|
else:
|
||||||
|
print_result(r, baseline)
|
||||||
|
|
||||||
|
# Show specific label changes
|
||||||
|
if baseline:
|
||||||
|
diffs = diff_results(baseline, r)
|
||||||
|
if diffs:
|
||||||
|
print(f"\n Label changes ({len(diffs)}):")
|
||||||
|
for d in diffs:
|
||||||
|
print(f" {d['paragraphId'][:8]}: {d['before']:<25} → {d['after']:<25} (H={d['human_majority']}) [{d['after_tier']}]")
|
||||||
|
|
||||||
|
# ── Acceptance criteria check ─────────────────────────────────────
|
||||||
|
if baseline and len(results) > 1:
|
||||||
|
print(f"\n{'=' * 90}")
|
||||||
|
print("ACCEPTANCE CRITERIA SUMMARY")
|
||||||
|
print(f"{'=' * 90}")
|
||||||
|
print(f"\nCriteria:")
|
||||||
|
print(f" 1. T5 count decreases (fewer arbitrary resolutions)")
|
||||||
|
print(f" 2. Source accuracy: no model/human drops >1% (intervention isn't distorting)")
|
||||||
|
print(f" 3. Category distribution: no category shifts >±5% of its baseline count")
|
||||||
|
print(f" 4. Changes are principled (each has documented codebook justification)")
|
||||||
|
print()
|
||||||
|
|
||||||
|
for exp_name, r in results.items():
|
||||||
|
if exp_name == "baseline":
|
||||||
|
continue
|
||||||
|
t5_base = baseline.tier_counts.get("T5-plurality", 0)
|
||||||
|
t5_exp = r.tier_counts.get("T5-plurality", 0)
|
||||||
|
t5_pass = t5_exp <= t5_base
|
||||||
|
|
||||||
|
max_acc_drop = 0.0
|
||||||
|
for source in baseline.source_accuracy:
|
||||||
|
if source in r.source_accuracy:
|
||||||
|
drop = baseline.source_accuracy[source] - r.source_accuracy[source]
|
||||||
|
max_acc_drop = max(max_acc_drop, drop)
|
||||||
|
acc_pass = max_acc_drop < 0.01
|
||||||
|
|
||||||
|
max_cat_shift_pct = 0.0
|
||||||
|
for cat in baseline.category_dist:
|
||||||
|
base_n = baseline.category_dist.get(cat, 0)
|
||||||
|
exp_n = r.category_dist.get(cat, 0)
|
||||||
|
if base_n > 0:
|
||||||
|
shift = abs(exp_n - base_n) / base_n
|
||||||
|
max_cat_shift_pct = max(max_cat_shift_pct, shift)
|
||||||
|
cat_pass = max_cat_shift_pct < 0.05
|
||||||
|
|
||||||
|
status = "✓ PASS" if (t5_pass and acc_pass and cat_pass) else "✗ FAIL"
|
||||||
|
print(f" {exp_name:<20} {status}")
|
||||||
|
print(f" T5: {t5_base} → {t5_exp} (Δ {t5_exp - t5_base:+d}) {'✓' if t5_pass else '✗'}")
|
||||||
|
print(f" Max accuracy drop: {max_acc_drop:.2%} {'✓' if acc_pass else '✗'}")
|
||||||
|
print(f" Max category shift: {max_cat_shift_pct:.1%} {'✓' if cat_pass else '✗'}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
393
scripts/adjudicate-gold.py
Normal file
393
scripts/adjudicate-gold.py
Normal file
@ -0,0 +1,393 @@
|
|||||||
|
"""
|
||||||
|
Gold Set Adjudication Script (v2)
|
||||||
|
==================================
|
||||||
|
|
||||||
|
Produces gold labels for the 1,200 holdout paragraphs using a tiered adjudication
|
||||||
|
strategy that combines 6 human annotators (3 per paragraph via BIBD) + 6 GenAI
|
||||||
|
models (MiniMax excluded per documented statistical outlier analysis, z=-2.07).
|
||||||
|
|
||||||
|
Each paragraph has up to 9 signals: 3 human + 6 model.
|
||||||
|
|
||||||
|
Tier system:
|
||||||
|
T1: Super-consensus — >=8/9 signals agree -> auto-gold (near-unanimous)
|
||||||
|
T2: Human majority + model majority agree -> cross-validated gold
|
||||||
|
T3: Rule-based override — 27 SI<->N/O paragraphs + 10 codebook tiebreakers,
|
||||||
|
each analyzed paragraph-by-paragraph against codebook rules and actual text.
|
||||||
|
T4: Model unanimous (6/6) + human majority disagree -> model label.
|
||||||
|
T5: Remaining disagreements -> plurality with text-based BG vote removal.
|
||||||
|
|
||||||
|
v2 changes (experimentally validated, see docs/T5-ANALYSIS.md):
|
||||||
|
- 10 new T5 codebook overrides (ID/SI, SPAC, board-removal, committee-level)
|
||||||
|
- Text-based BG vote removal: if "board" absent from paragraph text, BG model
|
||||||
|
votes are removed before T5 plurality. 13 labels changed, source accuracy UP
|
||||||
|
for 10/12 sources (+0.5-1.1% for top sources).
|
||||||
|
- Specificity hybrid: human unanimous -> human label, human split -> model majority.
|
||||||
|
195 specificity labels updated. Model-model spec agreement is 87-91% vs
|
||||||
|
human consensus of 52.5%.
|
||||||
|
|
||||||
|
Net effect: T5 reduced 92->85 (-7), source accuracy: Opus 88.6->89.1%, GPT-5.4
|
||||||
|
87.4->88.5%, gold!=human 151->144. 20 category labels changed, 195 specificity.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
uv run scripts/adjudicate-gold.py
|
||||||
|
"""
|
||||||
|
import json
|
||||||
|
from collections import Counter, defaultdict
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
ROOT = Path(__file__).resolve().parent.parent
|
||||||
|
|
||||||
|
# ── SI<->N/O RULE-BASED OVERRIDES ────────────────────────────────────────
|
||||||
|
#
|
||||||
|
# These 27 paragraphs were analyzed INDIVIDUALLY against codebook rules and
|
||||||
|
# actual paragraph text. This is NOT a blanket override -- each paragraph was
|
||||||
|
# read, assessed against the assessment-vs-speculation distinction (Rule 6),
|
||||||
|
# the cross-reference exception, and the SPAC rule (Case 8).
|
||||||
|
#
|
||||||
|
# The analysis found that ~20/25 "Human=SI, Model=N/O" cases are human errors:
|
||||||
|
# annotators systematically treat ANY mention of "material" + "business strategy"
|
||||||
|
# as SI, even when the language is pure "could/if/may" speculation. The codebook's
|
||||||
|
# distinction is correct; humans weren't consistently applying it.
|
||||||
|
#
|
||||||
|
# The 2 "Human=N/O, Model=SI" cases are also human errors: both contain clear
|
||||||
|
# negative assertions ("not aware of having experienced any prior material
|
||||||
|
# incidents") which are textbook SI per Rule 6.
|
||||||
|
#
|
||||||
|
# Full analysis: docs/V35-ITERATION-LOG.md "The SI<->N/O Paradox -- Resolved"
|
||||||
|
|
||||||
|
SI_NO_OVERRIDES: dict[str, tuple[str, str]] = {
|
||||||
|
# ── Speculation, not assessment (Human=SI -> N/O) ─────────────────────
|
||||||
|
"026c8eca": ("None/Other", "Speculation: 'could potentially result in' -- no materiality assessment"),
|
||||||
|
"160fec46": ("None/Other", "Resource lament: 'do not have manpower' -- no materiality assessment"),
|
||||||
|
"1f29ea8c": ("None/Other", "Speculation: 'could have material adverse effect' boilerplate"),
|
||||||
|
"20c70335": ("None/Other", "Risk list: 'A breach could lead to...' -- enumeration, not assessment"),
|
||||||
|
"303685cf": ("None/Other", "Speculation: 'could materially adversely affect'"),
|
||||||
|
"7d021fcc": ("None/Other", "Speculation: 'could...have a material adverse effect'"),
|
||||||
|
"7ef53cab": ("None/Other", "Risk enumeration: 'could lead to... could disrupt... could steal...'"),
|
||||||
|
"a0d01951": ("None/Other", "Speculation: 'could adversely affect our business'"),
|
||||||
|
"aaa8974b": ("None/Other", "Speculation: 'could potentially have a material impact' -- Case 9 fix"),
|
||||||
|
"b058dca1": ("None/Other", "Speculation: 'could disrupt our operations'"),
|
||||||
|
"b1b216b6": ("None/Other", "Speculation: 'could materially adversely affect'"),
|
||||||
|
"dc8a2798": ("None/Other", "Speculation: 'If compromised, we could be subject to...'"),
|
||||||
|
"e4bd0e2f": ("None/Other", "Speculation: 'could have material adverse impact'"),
|
||||||
|
"f4656a7e": ("None/Other", "Threat enumeration under SI-sounding header -- no assessment"),
|
||||||
|
# ── Cross-references (Human=SI -> N/O) ────────────────────────────────
|
||||||
|
"2e8cbdbf": ("None/Other", "Cross-ref: 'We describe whether and how... under the headings [risk factors]'"),
|
||||||
|
"75de7441": ("None/Other", "Cross-ref: 'We describe whether and how... under the heading [risk factor]'"),
|
||||||
|
"78cad2a1": ("None/Other", "Cross-ref: 'In our Risk Factors, we describe whether and how...'"),
|
||||||
|
"3879887f": ("None/Other", "Brief incident mention + 'See Item 1A' cross-reference"),
|
||||||
|
"f026f2be": ("None/Other", "Risk factor heading/cross-reference -- not an assessment"),
|
||||||
|
# ── No materiality assessment present (Human=SI -> N/O) ───────────────
|
||||||
|
"5df3a6c9": ("None/Other", "IT importance statement -- no assessment. H=1/3 SI"),
|
||||||
|
"d5dc17c2": ("None/Other", "Risk enumeration -- no assessment. H=1/3 SI"),
|
||||||
|
"c10f2a54": ("None/Other", "Early-stage/SPAC + weak negative assertion. SPAC rule dominates"),
|
||||||
|
"45961c99": ("None/Other", "Past disruption but no materiality language. Primarily speculation"),
|
||||||
|
"1673f332": ("None/Other", "SPAC with assessment at end -- SPAC rule dominates per Case 8"),
|
||||||
|
"f75ac78a": ("Risk Management Process", "Resource expenditure on cybersecurity -- RMP per person-removal test"),
|
||||||
|
# ── Negative assertions ARE assessments (Human=N/O -> SI) ─────────────
|
||||||
|
"367108c2": ("Strategy Integration", "Negative assertion: 'not aware of having experienced any prior material data breaches'"),
|
||||||
|
"837e31d5": ("Strategy Integration", "Negative assertion: 'did not experience any cybersecurity incident during 2024'"),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# ── T5 CODEBOOK RESOLUTIONS ──────────────────────────────────────────────
|
||||||
|
#
|
||||||
|
# Additional rule-based overrides for T5-plurality cases where codebook
|
||||||
|
# tiebreakers clearly resolve the disagreement. Applied AFTER plurality
|
||||||
|
# resolution as a correction layer.
|
||||||
|
#
|
||||||
|
# SI<->ID tiebreaker: "DESCRIBES what happened -> ID; ONLY discusses
|
||||||
|
# cost/materiality -> SI; brief mention + materiality conclusion -> SI"
|
||||||
|
#
|
||||||
|
# TP<->RMP central-topic test: third parties supporting internal
|
||||||
|
# program -> RMP; vendor oversight as central topic -> TP
|
||||||
|
|
||||||
|
T5_CODEBOOK_OVERRIDES: dict[str, tuple[str, str]] = {
|
||||||
|
# ── SI<->ID: materiality assessment without incident narrative -> SI ──
|
||||||
|
"15e7cf99": ("Strategy Integration", "SI/ID tiebreaker: 'have not encountered any risks' -- materiality assessment, no specific incident described"),
|
||||||
|
# ── SI<->ID: specific incident with date -> ID ────────────────────────
|
||||||
|
"6dc6bb4a": ("Incident Disclosure", "SI/ID tiebreaker: 'ransomware attack in October 2021' -- describes specific incident with date"),
|
||||||
|
# ── TP<->RMP: third parties supporting internal program -> RMP ────────
|
||||||
|
"c71739a9": ("Risk Management Process", "TP/RMP: Fund relies on CCO and adviser's risk management expertise -- third parties supporting internal process"),
|
||||||
|
# ── ID<->SI: negative assertion = materiality assessment -> SI ────────
|
||||||
|
"0ceeb618": ("Strategy Integration", "ID/SI: opens with 'no material incidents', Feb 2025 incident is brief context + 'has not had material impact' conclusion. Materiality assessment frame dominates"),
|
||||||
|
"cc82eb9f": ("Strategy Integration", "ID/SI: June 2018 incident is example within broader negative materiality assertion ('have not materially affected us'). Assessment frame dominates"),
|
||||||
|
# ── SPAC rule (Case 8): pre-revenue company -> N/O ────────────────────
|
||||||
|
"203ccd43": ("None/Other", "SPAC: 'once the Company commences operations' -- pre-revenue company. Case 8: SPAC -> N/O regardless of management role language"),
|
||||||
|
# ── ID->RMP: post-incident improvements, no incident described ────────
|
||||||
|
"f549fd64": ("Risk Management Process", "ID/RMP: 'Following this cybersecurity event' -- refers to incident without describing it. 100% of content is hardening, training, MFA, EDR -- pure RMP"),
|
||||||
|
# ── Board-removal test: BG override where board mention is incidental ──
|
||||||
|
"22da6695": ("Risk Management Process", "Board-removal: 'Board is also responsible' (1 sentence) + 'notifying the Board' (final clause). Remove -> CISO + IS Program + incident response plan. Process dominates"),
|
||||||
|
"a2ff7e1e": ("Management Role", "Committee-level: Compliance Committee is management-level (O'Reilly executives). Board is incidental destination (2 clauses). Titled 'Management's Role'"),
|
||||||
|
"cb518f47": ("Management Role", "Board-removal: remove notification sentence -> 'management oversees cybersecurity.' Board is incident notification destination only"),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def load_jsonl(path: Path) -> list[dict]:
|
||||||
|
with open(path) as f:
|
||||||
|
return [json.loads(line) for line in f]
|
||||||
|
|
||||||
|
|
||||||
|
def load_paragraph_texts() -> dict[str, str]:
|
||||||
|
"""Load holdout paragraph texts for text-based adjudication rules."""
|
||||||
|
return {r["id"]: r["text"] for r in load_jsonl(ROOT / "data/gold/paragraphs-holdout.jsonl")}
|
||||||
|
|
||||||
|
|
||||||
|
def majority_vote(votes: list[str]) -> str | None:
|
||||||
|
if not votes:
|
||||||
|
return None
|
||||||
|
return Counter(votes).most_common(1)[0][0]
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
# ── Load data ─────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
human_labels: dict[str, list[dict]] = defaultdict(list)
|
||||||
|
for r in load_jsonl(ROOT / "data/gold/human-labels-raw.jsonl"):
|
||||||
|
human_labels[r["paragraphId"]].append({
|
||||||
|
"cat": r["contentCategory"],
|
||||||
|
"spec": r["specificityLevel"],
|
||||||
|
"annotator": r["annotatorName"],
|
||||||
|
})
|
||||||
|
|
||||||
|
confusion_pids = {r["paragraphId"] for r in load_jsonl(ROOT / "data/gold/holdout-rerun-v35.jsonl")}
|
||||||
|
|
||||||
|
TOP6 = ["Opus", "GPT-5.4", "Gemini", "GLM-5", "Kimi", "MIMO"]
|
||||||
|
|
||||||
|
def load_model_cats(files: dict[str, Path]) -> dict[str, dict[str, str]]:
|
||||||
|
result: dict[str, dict[str, str]] = {}
|
||||||
|
for name, path in files.items():
|
||||||
|
result[name] = {}
|
||||||
|
if path.exists():
|
||||||
|
for r in load_jsonl(path):
|
||||||
|
cat = r.get("label", {}).get("content_category") or r.get("content_category")
|
||||||
|
if cat:
|
||||||
|
result[name][r["paragraphId"]] = cat
|
||||||
|
return result
|
||||||
|
|
||||||
|
v30_cats = load_model_cats({
|
||||||
|
"Opus": ROOT / "data/annotations/golden/opus.jsonl",
|
||||||
|
"GPT-5.4": ROOT / "data/annotations/bench-holdout/gpt-5.4.jsonl",
|
||||||
|
"Gemini": ROOT / "data/annotations/bench-holdout/gemini-3.1-pro-preview.jsonl",
|
||||||
|
"GLM-5": ROOT / "data/annotations/bench-holdout/glm-5:exacto.jsonl",
|
||||||
|
"Kimi": ROOT / "data/annotations/bench-holdout/kimi-k2.5.jsonl",
|
||||||
|
"MIMO": ROOT / "data/annotations/bench-holdout/mimo-v2-pro:exacto.jsonl",
|
||||||
|
})
|
||||||
|
|
||||||
|
v35_cats = load_model_cats({
|
||||||
|
"Opus": ROOT / "data/annotations/golden-v35/opus.jsonl",
|
||||||
|
"GPT-5.4": ROOT / "data/annotations/bench-holdout-v35/gpt-5.4.jsonl",
|
||||||
|
"Gemini": ROOT / "data/annotations/bench-holdout-v35/gemini-3.1-pro-preview.jsonl",
|
||||||
|
"GLM-5": ROOT / "data/annotations/bench-holdout-v35/glm-5:exacto.jsonl",
|
||||||
|
"Kimi": ROOT / "data/annotations/bench-holdout-v35/kimi-k2.5.jsonl",
|
||||||
|
"MIMO": ROOT / "data/annotations/bench-holdout-v35/mimo-v2-pro:exacto.jsonl",
|
||||||
|
})
|
||||||
|
|
||||||
|
# Use v3.5 labels for confusion-axis PIDs (codebook-corrected), v3.0 for rest
|
||||||
|
model_cats: dict[str, dict[str, str]] = {}
|
||||||
|
for m in TOP6:
|
||||||
|
model_cats[m] = {}
|
||||||
|
for pid in human_labels:
|
||||||
|
if pid in confusion_pids and pid in v35_cats.get(m, {}):
|
||||||
|
model_cats[m][pid] = v35_cats[m][pid]
|
||||||
|
elif pid in v30_cats.get(m, {}):
|
||||||
|
model_cats[m][pid] = v30_cats[m][pid]
|
||||||
|
|
||||||
|
# Load model specificity for hybrid specificity (v3.0 for full coverage)
|
||||||
|
def load_model_specs(files: dict[str, Path]) -> dict[str, dict[str, int]]:
|
||||||
|
result: dict[str, dict[str, int]] = {}
|
||||||
|
for name, path in files.items():
|
||||||
|
result[name] = {}
|
||||||
|
if path.exists():
|
||||||
|
for r in load_jsonl(path):
|
||||||
|
spec = r.get("label", {}).get("specificity_level") or r.get("specificity_level")
|
||||||
|
if spec is not None:
|
||||||
|
result[name][r["paragraphId"]] = spec
|
||||||
|
return result
|
||||||
|
|
||||||
|
model_specs = load_model_specs({
|
||||||
|
"Opus": ROOT / "data/annotations/golden/opus.jsonl",
|
||||||
|
"GPT-5.4": ROOT / "data/annotations/bench-holdout/gpt-5.4.jsonl",
|
||||||
|
"Gemini": ROOT / "data/annotations/bench-holdout/gemini-3.1-pro-preview.jsonl",
|
||||||
|
"GLM-5": ROOT / "data/annotations/bench-holdout/glm-5:exacto.jsonl",
|
||||||
|
"Kimi": ROOT / "data/annotations/bench-holdout/kimi-k2.5.jsonl",
|
||||||
|
"MIMO": ROOT / "data/annotations/bench-holdout/mimo-v2-pro:exacto.jsonl",
|
||||||
|
})
|
||||||
|
|
||||||
|
# Load paragraph texts for text-based adjudication rules
|
||||||
|
para_texts = load_paragraph_texts()
|
||||||
|
|
||||||
|
# ── Adjudicate ────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
results: list[dict] = []
|
||||||
|
tier_counts: Counter[str] = Counter()
|
||||||
|
|
||||||
|
for pid in sorted(human_labels.keys()):
|
||||||
|
h_cats = [l["cat"] for l in human_labels[pid]]
|
||||||
|
h_specs = [l["spec"] for l in human_labels[pid]]
|
||||||
|
h_cat_maj = majority_vote(h_cats)
|
||||||
|
h_spec_maj = majority_vote(h_specs)
|
||||||
|
h_cat_unanimous = len(set(h_cats)) == 1
|
||||||
|
|
||||||
|
m_cats_list = [model_cats[m][pid] for m in TOP6 if pid in model_cats[m]]
|
||||||
|
m_cat_maj = majority_vote(m_cats_list)
|
||||||
|
m_cat_unanimous = len(set(m_cats_list)) == 1 and len(m_cats_list) == 6
|
||||||
|
|
||||||
|
all_signals = h_cats + m_cats_list
|
||||||
|
signal_counter = Counter(all_signals)
|
||||||
|
total_signals = len(all_signals)
|
||||||
|
top_signal, top_count = signal_counter.most_common(1)[0]
|
||||||
|
|
||||||
|
short_pid = pid[:8]
|
||||||
|
si_override = SI_NO_OVERRIDES.get(short_pid)
|
||||||
|
|
||||||
|
gold_cat: str | None = None
|
||||||
|
tier: str = ""
|
||||||
|
reason: str = ""
|
||||||
|
|
||||||
|
if si_override:
|
||||||
|
gold_cat = si_override[0]
|
||||||
|
tier = "T3-rule"
|
||||||
|
reason = f"SI/NO override: {si_override[1]}"
|
||||||
|
elif top_count >= 8 and total_signals >= 8:
|
||||||
|
gold_cat = top_signal
|
||||||
|
tier = "T1-super"
|
||||||
|
reason = f"{top_count}/{total_signals} signals agree"
|
||||||
|
elif h_cat_maj == m_cat_maj:
|
||||||
|
gold_cat = h_cat_maj
|
||||||
|
tier = "T2-cross"
|
||||||
|
reason = "Human + model majority agree"
|
||||||
|
elif m_cat_unanimous:
|
||||||
|
# All 6 models unanimous. Whether humans are split (2/3) or unanimous (3/3),
|
||||||
|
# trust models on documented systematic error axes. Cross-axis analysis shows:
|
||||||
|
# - MR->RMP: models apply person-removal test correctly (humans 91% one-directional)
|
||||||
|
# - MR->BG: models apply purpose test correctly (humans 97% one-directional)
|
||||||
|
# - RMP->BG: models identify governance purpose (humans 78% one-directional)
|
||||||
|
# - TP->RMP: models apply central-topic test (humans 92% one-directional)
|
||||||
|
# - SI->N/O: models apply assessment-vs-speculation (humans 93% one-directional)
|
||||||
|
# All 9 T5-conflict cases (both sides unanimous) verified: models correct on every one.
|
||||||
|
gold_cat = m_cat_maj
|
||||||
|
tier = "T4-model"
|
||||||
|
h_count = Counter(h_cats).most_common(1)[0][1]
|
||||||
|
reason = f"6/6 models unanimous ({m_cat_maj}) vs human {h_count}/3 ({h_cat_maj})"
|
||||||
|
else:
|
||||||
|
# Check T5 codebook overrides before falling back to plurality
|
||||||
|
t5_override = T5_CODEBOOK_OVERRIDES.get(short_pid)
|
||||||
|
if t5_override:
|
||||||
|
gold_cat = t5_override[0]
|
||||||
|
tier = "T3-rule"
|
||||||
|
reason = f"T5 codebook override: {t5_override[1]}"
|
||||||
|
else:
|
||||||
|
# ── No-board BG vote removal ──────────────────────────
|
||||||
|
# If "board" (case-insensitive) doesn't appear in the paragraph
|
||||||
|
# text, BG model votes are provably unsupported — the paragraph
|
||||||
|
# can't be about board governance if it never mentions the board.
|
||||||
|
# Remove those BG signals and recalculate plurality.
|
||||||
|
# Validated experimentally: 13 labels changed, source accuracy
|
||||||
|
# UP for 10/12 sources (+0.5-0.8% for top annotators/models).
|
||||||
|
t5_signals = list(all_signals)
|
||||||
|
para_text = para_texts.get(pid, "")
|
||||||
|
if "board" not in para_text.lower():
|
||||||
|
bg_count = sum(1 for s in t5_signals if s == "Board Governance")
|
||||||
|
if bg_count > 0:
|
||||||
|
t5_signals = [s for s in t5_signals if s != "Board Governance"]
|
||||||
|
|
||||||
|
if t5_signals:
|
||||||
|
t5_counter = Counter(t5_signals)
|
||||||
|
t5_top, t5_top_count = t5_counter.most_common(1)[0]
|
||||||
|
t5_total = len(t5_signals)
|
||||||
|
else:
|
||||||
|
t5_top, t5_top_count, t5_total = top_signal, top_count, total_signals
|
||||||
|
|
||||||
|
gold_cat = t5_top
|
||||||
|
tier = "T5-plurality"
|
||||||
|
reason = f"Mixed: human={h_cat_maj}, model={m_cat_maj}, plurality={t5_top} ({t5_top_count}/{t5_total})"
|
||||||
|
|
||||||
|
# ── Specificity: hybrid human/model ──────────────────────────
|
||||||
|
# Human consensus on specificity is only 52.5%, while model-model
|
||||||
|
# agreement is 87-91%. When humans are unanimous (3/3), trust their
|
||||||
|
# label. When humans split, use model majority (more reliable).
|
||||||
|
h_spec_unanimous = len(set(h_specs)) == 1
|
||||||
|
if h_spec_unanimous:
|
||||||
|
gold_spec = h_spec_maj
|
||||||
|
else:
|
||||||
|
m_specs = [model_specs[m][pid] for m in TOP6 if pid in model_specs[m]]
|
||||||
|
if m_specs:
|
||||||
|
gold_spec = int(majority_vote([str(s) for s in m_specs]) or h_spec_maj)
|
||||||
|
else:
|
||||||
|
gold_spec = h_spec_maj
|
||||||
|
|
||||||
|
tier_counts[tier] += 1
|
||||||
|
results.append({
|
||||||
|
"paragraphId": pid,
|
||||||
|
"gold_category": gold_cat,
|
||||||
|
"gold_specificity": gold_spec,
|
||||||
|
"tier": tier,
|
||||||
|
"reason": reason,
|
||||||
|
"human_majority": h_cat_maj,
|
||||||
|
"model_majority": m_cat_maj,
|
||||||
|
"human_votes": dict(Counter(h_cats)),
|
||||||
|
"model_votes": dict(Counter(m_cats_list)),
|
||||||
|
})
|
||||||
|
|
||||||
|
# ── Write output ──────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
output_path = ROOT / "data/gold/gold-adjudicated.jsonl"
|
||||||
|
with open(output_path, "w") as f:
|
||||||
|
for r in results:
|
||||||
|
f.write(json.dumps(r) + "\n")
|
||||||
|
|
||||||
|
# ── Summary ───────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
print("=" * 90)
|
||||||
|
print("GOLD SET ADJUDICATION SUMMARY")
|
||||||
|
print("=" * 90)
|
||||||
|
print(f"\nTotal paragraphs: {len(results)}")
|
||||||
|
print(f"\nTier breakdown:")
|
||||||
|
for tier, count in sorted(tier_counts.items()):
|
||||||
|
pct = count / len(results) * 100
|
||||||
|
print(f" {tier:<16} {count:>5} ({pct:.1f}%)")
|
||||||
|
|
||||||
|
flipped = sum(1 for r in results if r["gold_category"] != r["human_majority"])
|
||||||
|
print(f"\nGold labels differing from human majority: {flipped} ({flipped / len(results):.1%})")
|
||||||
|
|
||||||
|
print(f"\nCategory distribution:")
|
||||||
|
h_dist = Counter(r["human_majority"] for r in results)
|
||||||
|
g_dist = Counter(r["gold_category"] for r in results)
|
||||||
|
print(f" {'Category':<25} {'Human Maj':>10} {'Gold':>10} {'Delta':>6}")
|
||||||
|
for cat in sorted(set(list(h_dist.keys()) + list(g_dist.keys()))):
|
||||||
|
print(f" {cat:<25} {h_dist.get(cat, 0):>10} {g_dist.get(cat, 0):>10} {g_dist.get(cat, 0) - h_dist.get(cat, 0):>+6}")
|
||||||
|
|
||||||
|
gold_by_pid = {r["paragraphId"]: r["gold_category"] for r in results}
|
||||||
|
|
||||||
|
print(f"\n{'=' * 90}")
|
||||||
|
print("SOURCE ACCURACY vs ADJUDICATED GOLD")
|
||||||
|
print(f"{'=' * 90}")
|
||||||
|
|
||||||
|
annotator_names = sorted(set(l["annotator"] for labels in human_labels.values() for l in labels))
|
||||||
|
print("\nHuman annotators:")
|
||||||
|
for ann in annotator_names:
|
||||||
|
agree = total = 0
|
||||||
|
for pid, labels in human_labels.items():
|
||||||
|
for l in labels:
|
||||||
|
if l["annotator"] == ann and pid in gold_by_pid:
|
||||||
|
total += 1
|
||||||
|
if l["cat"] == gold_by_pid[pid]:
|
||||||
|
agree += 1
|
||||||
|
print(f" {ann:<12} {agree}/{total} ({agree / total:.1%})")
|
||||||
|
|
||||||
|
print("\nModels (v3.0 on full 1200):")
|
||||||
|
for m in TOP6:
|
||||||
|
agree = total = 0
|
||||||
|
for pid in gold_by_pid:
|
||||||
|
if pid in v30_cats.get(m, {}):
|
||||||
|
total += 1
|
||||||
|
if v30_cats[m][pid] == gold_by_pid[pid]:
|
||||||
|
agree += 1
|
||||||
|
print(f" {m:<12} {agree}/{total} ({agree / total:.1%})")
|
||||||
|
|
||||||
|
print(f"\nOutput: {output_path}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
620
scripts/audit-stage1-labels.py
Normal file
620
scripts/audit-stage1-labels.py
Normal file
@ -0,0 +1,620 @@
|
|||||||
|
"""
|
||||||
|
Audit Stage 1 annotations for systematic SI↔N/O miscoding.
|
||||||
|
|
||||||
|
Stage 1 used prompt v2.5 which lacked the rule "materiality disclaimers → SI."
|
||||||
|
This script quantifies how many N/O labels likely should have been SI, plus
|
||||||
|
other potential miscoding axes.
|
||||||
|
|
||||||
|
Run: uv run --with numpy scripts/audit-stage1-labels.py
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
from collections import Counter, defaultdict
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
# ── Paths ──────────────────────────────────────────────────────────────
|
||||||
|
ROOT = Path(__file__).resolve().parent.parent
|
||||||
|
ANNOTATIONS = ROOT / "data" / "annotations" / "stage1.patched.jsonl"
|
||||||
|
PARAGRAPHS = ROOT / "data" / "paragraphs" / "paragraphs-clean.patched.jsonl"
|
||||||
|
PARAGRAPHS_FALLBACK = ROOT / "data" / "paragraphs" / "paragraphs-clean.jsonl"
|
||||||
|
HOLDOUT = ROOT / "data" / "gold" / "paragraphs-holdout.jsonl"
|
||||||
|
HUMAN_LABELS = ROOT / "data" / "gold" / "human-labels-raw.jsonl"
|
||||||
|
|
||||||
|
|
||||||
|
# ── Materiality regex patterns ─────────────────────────────────────────
|
||||||
|
# Pattern 1: "material" near business/strategy language (within ~15 words)
|
||||||
|
PAT_MATERIAL_NEAR_BIZ = re.compile(
|
||||||
|
r"material(?:ly)?\b.{0,100}\b(?:business\s+strategy|results\s+of\s+operations|financial\s+condition|business|operations)"
|
||||||
|
r"|"
|
||||||
|
r"(?:business\s+strategy|results\s+of\s+operations|financial\s+condition)\b.{0,100}\baterial(?:ly)?",
|
||||||
|
re.IGNORECASE,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Pattern 2: specific materiality disclaimer phrases
|
||||||
|
PAT_MATERIALITY_DISCLAIMER = re.compile(
|
||||||
|
r"have\s+not\s+materially\s+affected"
|
||||||
|
r"|has\s+not\s+materially\s+affected"
|
||||||
|
r"|could\s+materially\s+affect"
|
||||||
|
r"|could\s+have\s+a\s+material\s+(?:adverse\s+)?(?:effect|impact)"
|
||||||
|
r"|may\s+(?:materially|have\s+a\s+material)\s+(?:adverse\s+)?(?:effect|impact|affect)"
|
||||||
|
r"|reasonably\s+likely\s+to\s+materially\s+affect"
|
||||||
|
r"|not\s+reasonably\s+likely"
|
||||||
|
r"|materially\s+(?:adverse(?:ly)?|impact|affect)"
|
||||||
|
r"|material\s+adverse\s+(?:effect|impact)"
|
||||||
|
r"|no\s+material\s+(?:adverse\s+)?(?:effect|impact)"
|
||||||
|
r"|did\s+not\s+(?:have\s+a\s+)?material(?:ly)?\s+(?:adverse\s+)?(?:effect|impact|affect)",
|
||||||
|
re.IGNORECASE,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Pattern 3: explicit SI-relevant phrases
|
||||||
|
PAT_SI_PHRASES = re.compile(
|
||||||
|
r"business\s+strategy"
|
||||||
|
r"|results\s+of\s+operations"
|
||||||
|
r"|financial\s+condition"
|
||||||
|
r"|integrated\s+(?:into|with)\s+(?:our\s+)?(?:overall|business)"
|
||||||
|
r"|part\s+of\s+(?:our\s+)?(?:overall|broader)\s+(?:risk|enterprise|business)",
|
||||||
|
re.IGNORECASE,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def has_materiality_language(text: str) -> bool:
|
||||||
|
"""Returns True if text contains materiality-related language indicative of SI."""
|
||||||
|
return bool(
|
||||||
|
PAT_MATERIALITY_DISCLAIMER.search(text)
|
||||||
|
or PAT_SI_PHRASES.search(text)
|
||||||
|
or PAT_MATERIAL_NEAR_BIZ.search(text)
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ── Insurance / budget / incident patterns ─────────────────────────────
|
||||||
|
PAT_INSURANCE = re.compile(r"\binsurance\b", re.IGNORECASE)
|
||||||
|
PAT_BUDGET = re.compile(r"\b(?:budget|investment(?:s)?)\b", re.IGNORECASE)
|
||||||
|
PAT_INCIDENT = re.compile(
|
||||||
|
r"\bwe\s+(?:experienced|suffered|detected|identified|discovered|encountered|were\s+subject\s+to)\b",
|
||||||
|
re.IGNORECASE,
|
||||||
|
)
|
||||||
|
|
||||||
|
# ── Cross-category confusion patterns ──────────────────────────────────
|
||||||
|
PAT_PROGRAM_FRAMEWORK = re.compile(
|
||||||
|
r"\b(?:program|framework|process(?:es)?|procedure(?:s)?)\b", re.IGNORECASE
|
||||||
|
)
|
||||||
|
PAT_TITLE = re.compile(
|
||||||
|
r"\b(?:Chief\s+(?:Information|Technology|Executive|Financial|Security|Operating|Risk)\s+(?:Officer|Security\s+Officer))"
|
||||||
|
r"|(?:CISO|CIO|CTO|CFO|CEO|COO|CRO)\b"
|
||||||
|
r"|\b(?:Vice\s+President|Director|Senior\s+Vice\s+President|EVP|SVP)\b",
|
||||||
|
re.IGNORECASE,
|
||||||
|
)
|
||||||
|
PAT_MANAGEMENT_OFFICERS = re.compile(
|
||||||
|
r"\b(?:management|officer(?:s)?|executive(?:s)?|leader(?:s)?(?:hip)?)\b",
|
||||||
|
re.IGNORECASE,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def separator(title: str) -> None:
|
||||||
|
width = 80
|
||||||
|
print()
|
||||||
|
print("=" * width)
|
||||||
|
print(f" {title}")
|
||||||
|
print("=" * width)
|
||||||
|
|
||||||
|
|
||||||
|
def print_example(idx: int, pid: str, text: str, extra: str = "") -> None:
|
||||||
|
print(f"\n [{idx}] paragraphId: {pid}")
|
||||||
|
if extra:
|
||||||
|
print(f" {extra}")
|
||||||
|
# Wrap text at ~100 chars for readability
|
||||||
|
wrapped = text
|
||||||
|
if len(wrapped) > 500:
|
||||||
|
wrapped = wrapped[:500] + "..."
|
||||||
|
print(f" TEXT: {wrapped}")
|
||||||
|
|
||||||
|
|
||||||
|
# ── Load data ──────────────────────────────────────────────────────────
|
||||||
|
def load_annotations() -> dict[str, list[dict]]:
|
||||||
|
"""Returns {paragraphId: [annotation, ...]}"""
|
||||||
|
by_para: dict[str, list[dict]] = defaultdict(list)
|
||||||
|
with open(ANNOTATIONS) as f:
|
||||||
|
for line in f:
|
||||||
|
d = json.loads(line)
|
||||||
|
pid = d["paragraphId"]
|
||||||
|
cat = d["label"]["content_category"]
|
||||||
|
model = d["provenance"]["modelId"]
|
||||||
|
by_para[pid].append({"category": cat, "model": model})
|
||||||
|
return dict(by_para)
|
||||||
|
|
||||||
|
|
||||||
|
def load_paragraphs() -> dict[str, str]:
|
||||||
|
"""Returns {paragraphId: text}"""
|
||||||
|
texts: dict[str, str] = {}
|
||||||
|
path = PARAGRAPHS if PARAGRAPHS.exists() else PARAGRAPHS_FALLBACK
|
||||||
|
with open(path) as f:
|
||||||
|
for line in f:
|
||||||
|
d = json.loads(line)
|
||||||
|
texts[d["id"]] = d["text"]
|
||||||
|
return texts
|
||||||
|
|
||||||
|
|
||||||
|
def load_holdout() -> dict[str, dict]:
|
||||||
|
"""Returns {paragraphId: {text, stage1Category, stage1Method, ...}}"""
|
||||||
|
holdout: dict[str, dict] = {}
|
||||||
|
with open(HOLDOUT) as f:
|
||||||
|
for line in f:
|
||||||
|
d = json.loads(line)
|
||||||
|
holdout[d["id"]] = d
|
||||||
|
return holdout
|
||||||
|
|
||||||
|
|
||||||
|
def load_human_labels() -> dict[str, list[dict]]:
|
||||||
|
"""Returns {paragraphId: [{annotatorName, contentCategory}, ...]}"""
|
||||||
|
labels: dict[str, list[dict]] = defaultdict(list)
|
||||||
|
with open(HUMAN_LABELS) as f:
|
||||||
|
for line in f:
|
||||||
|
d = json.loads(line)
|
||||||
|
labels[d["paragraphId"]].append(
|
||||||
|
{
|
||||||
|
"annotator": d["annotatorName"],
|
||||||
|
"category": d["contentCategory"],
|
||||||
|
"specificity": d["specificityLevel"],
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return dict(labels)
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
print("Loading data...")
|
||||||
|
annotations = load_annotations()
|
||||||
|
texts = load_paragraphs()
|
||||||
|
holdout = load_holdout()
|
||||||
|
human_labels = load_human_labels()
|
||||||
|
|
||||||
|
print(f" Annotations: {sum(len(v) for v in annotations.values())} across {len(annotations)} paragraphs")
|
||||||
|
print(f" Paragraph texts loaded: {len(texts)}")
|
||||||
|
print(f" Holdout paragraphs: {len(holdout)}")
|
||||||
|
print(f" Human-labeled paragraphs: {len(human_labels)}")
|
||||||
|
|
||||||
|
# ── Classify each paragraph by voting ──────────────────────────────
|
||||||
|
unanimous_no: list[str] = []
|
||||||
|
majority_no: list[str] = [] # 2/3 N/O
|
||||||
|
unanimous_si: list[str] = []
|
||||||
|
unanimous_mr: list[str] = []
|
||||||
|
unanimous_rmp: list[str] = []
|
||||||
|
unanimous_bg: list[str] = []
|
||||||
|
all_unanimous: dict[str, str] = {} # pid -> category for unanimous
|
||||||
|
|
||||||
|
for pid, anns in annotations.items():
|
||||||
|
cats = [a["category"] for a in anns]
|
||||||
|
cat_counts = Counter(cats)
|
||||||
|
|
||||||
|
if len(cats) != 3:
|
||||||
|
continue # skip incomplete
|
||||||
|
|
||||||
|
if cat_counts.get("None/Other", 0) == 3:
|
||||||
|
unanimous_no.append(pid)
|
||||||
|
all_unanimous[pid] = "None/Other"
|
||||||
|
elif cat_counts.get("None/Other", 0) == 2:
|
||||||
|
majority_no.append(pid)
|
||||||
|
elif cat_counts.get("Strategy Integration", 0) == 3:
|
||||||
|
unanimous_si.append(pid)
|
||||||
|
all_unanimous[pid] = "Strategy Integration"
|
||||||
|
elif cat_counts.get("Management Role", 0) == 3:
|
||||||
|
unanimous_mr.append(pid)
|
||||||
|
all_unanimous[pid] = "Management Role"
|
||||||
|
elif cat_counts.get("Risk Management Process", 0) == 3:
|
||||||
|
unanimous_rmp.append(pid)
|
||||||
|
all_unanimous[pid] = "Risk Management Process"
|
||||||
|
elif cat_counts.get("Board Governance", 0) == 3:
|
||||||
|
unanimous_bg.append(pid)
|
||||||
|
all_unanimous[pid] = "Board Governance"
|
||||||
|
|
||||||
|
# Track all unanimous
|
||||||
|
if len(cat_counts) == 1:
|
||||||
|
all_unanimous[pid] = cats[0]
|
||||||
|
|
||||||
|
print(f"\n Unanimous N/O: {len(unanimous_no)}")
|
||||||
|
print(f" Majority N/O (2/3): {len(majority_no)}")
|
||||||
|
print(f" Unanimous SI: {len(unanimous_si)}")
|
||||||
|
print(f" Unanimous MR: {len(unanimous_mr)}")
|
||||||
|
print(f" Unanimous RMP: {len(unanimous_rmp)}")
|
||||||
|
print(f" Unanimous BG: {len(unanimous_bg)}")
|
||||||
|
print(f" Total unanimous (any): {len(all_unanimous)}")
|
||||||
|
|
||||||
|
# ════════════════════════════════════════════════════════════════════
|
||||||
|
# 1. Unanimous N/O with materiality language
|
||||||
|
# ════════════════════════════════════════════════════════════════════
|
||||||
|
separator("1. UNANIMOUS N/O WITH MATERIALITY LANGUAGE")
|
||||||
|
|
||||||
|
no_with_mat: list[tuple[str, str]] = []
|
||||||
|
no_without_text = 0
|
||||||
|
for pid in unanimous_no:
|
||||||
|
text = texts.get(pid)
|
||||||
|
if text is None:
|
||||||
|
no_without_text += 1
|
||||||
|
continue
|
||||||
|
if has_materiality_language(text):
|
||||||
|
no_with_mat.append((pid, text))
|
||||||
|
|
||||||
|
print(f"\n Total unanimous N/O: {len(unanimous_no)}")
|
||||||
|
print(f" Missing text: {no_without_text}")
|
||||||
|
print(f" With materiality language: {len(no_with_mat)}")
|
||||||
|
print(f" Percentage of unanimous N/O: {len(no_with_mat) / max(1, len(unanimous_no)) * 100:.1f}%")
|
||||||
|
|
||||||
|
print(f"\n --- 10 representative examples ---")
|
||||||
|
# Pick a diverse sample: take every Nth
|
||||||
|
step = max(1, len(no_with_mat) // 10)
|
||||||
|
shown = 0
|
||||||
|
for i in range(0, len(no_with_mat), step):
|
||||||
|
if shown >= 10:
|
||||||
|
break
|
||||||
|
pid, text = no_with_mat[i]
|
||||||
|
print_example(shown + 1, pid, text)
|
||||||
|
shown += 1
|
||||||
|
|
||||||
|
# ════════════════════════════════════════════════════════════════════
|
||||||
|
# 2. Majority N/O with materiality language
|
||||||
|
# ════════════════════════════════════════════════════════════════════
|
||||||
|
separator("2. MAJORITY N/O (2/3) WITH MATERIALITY LANGUAGE")
|
||||||
|
|
||||||
|
maj_no_with_mat: list[tuple[str, str, str, str]] = [] # pid, text, dissenting_model, dissenting_cat
|
||||||
|
for pid in majority_no:
|
||||||
|
text = texts.get(pid)
|
||||||
|
if text is None:
|
||||||
|
continue
|
||||||
|
if has_materiality_language(text):
|
||||||
|
anns = annotations[pid]
|
||||||
|
for a in anns:
|
||||||
|
if a["category"] != "None/Other":
|
||||||
|
maj_no_with_mat.append((pid, text, a["model"], a["category"]))
|
||||||
|
break
|
||||||
|
|
||||||
|
print(f"\n Total majority N/O (2/3): {len(majority_no)}")
|
||||||
|
print(f" With materiality language: {len(maj_no_with_mat)}")
|
||||||
|
print(f" Percentage: {len(maj_no_with_mat) / max(1, len(majority_no)) * 100:.1f}%")
|
||||||
|
|
||||||
|
# Count dissenting categories
|
||||||
|
dissent_cats = Counter(x[3] for x in maj_no_with_mat)
|
||||||
|
print(f"\n Dissenting model voted:")
|
||||||
|
for cat, cnt in dissent_cats.most_common():
|
||||||
|
print(f" {cat}: {cnt}")
|
||||||
|
|
||||||
|
# Count dissenting models
|
||||||
|
dissent_models = Counter(x[2] for x in maj_no_with_mat)
|
||||||
|
print(f"\n Which models dissented:")
|
||||||
|
for model, cnt in dissent_models.most_common():
|
||||||
|
print(f" {model}: {cnt}")
|
||||||
|
|
||||||
|
print(f"\n --- 5 examples ---")
|
||||||
|
step = max(1, len(maj_no_with_mat) // 5)
|
||||||
|
shown = 0
|
||||||
|
for i in range(0, len(maj_no_with_mat), step):
|
||||||
|
if shown >= 5:
|
||||||
|
break
|
||||||
|
pid, text, model, cat = maj_no_with_mat[i]
|
||||||
|
print_example(shown + 1, pid, text, f"Dissent: {model} → {cat}")
|
||||||
|
shown += 1
|
||||||
|
|
||||||
|
# ════════════════════════════════════════════════════════════════════
|
||||||
|
# 3. Unanimous SI examples (contrast)
|
||||||
|
# ════════════════════════════════════════════════════════════════════
|
||||||
|
separator("3. UNANIMOUS SI — WHAT CLEAN SI LOOKS LIKE")
|
||||||
|
|
||||||
|
si_examples: list[tuple[str, str]] = []
|
||||||
|
for pid in unanimous_si:
|
||||||
|
text = texts.get(pid)
|
||||||
|
if text:
|
||||||
|
si_examples.append((pid, text))
|
||||||
|
if len(si_examples) >= 20:
|
||||||
|
break
|
||||||
|
|
||||||
|
print(f"\n Total unanimous SI: {len(unanimous_si)}")
|
||||||
|
print(f"\n --- 5 examples ---")
|
||||||
|
for i, (pid, text) in enumerate(si_examples[:5]):
|
||||||
|
print_example(i + 1, pid, text)
|
||||||
|
|
||||||
|
# Analyze SI language patterns
|
||||||
|
si_has_materiality = sum(1 for pid in unanimous_si if pid in texts and has_materiality_language(texts[pid]))
|
||||||
|
si_has_insurance = sum(1 for pid in unanimous_si if pid in texts and PAT_INSURANCE.search(texts[pid]))
|
||||||
|
si_has_budget = sum(1 for pid in unanimous_si if pid in texts and PAT_BUDGET.search(texts[pid]))
|
||||||
|
print(f"\n SI language patterns:")
|
||||||
|
print(f" With materiality language: {si_has_materiality} / {len(unanimous_si)} ({si_has_materiality / max(1, len(unanimous_si)) * 100:.1f}%)")
|
||||||
|
print(f" Mention insurance: {si_has_insurance} / {len(unanimous_si)}")
|
||||||
|
print(f" Mention budget/investment: {si_has_budget} / {len(unanimous_si)}")
|
||||||
|
|
||||||
|
# ════════════════════════════════════════════════════════════════════
|
||||||
|
# 4. N/O with other potential miscoding
|
||||||
|
# ════════════════════════════════════════════════════════════════════
|
||||||
|
separator("4. N/O PARAGRAPHS WITH OTHER POTENTIAL MISCODING")
|
||||||
|
|
||||||
|
no_insurance: list[tuple[str, str]] = []
|
||||||
|
no_budget: list[tuple[str, str]] = []
|
||||||
|
no_incident: list[tuple[str, str]] = []
|
||||||
|
|
||||||
|
for pid in unanimous_no:
|
||||||
|
text = texts.get(pid)
|
||||||
|
if text is None:
|
||||||
|
continue
|
||||||
|
if PAT_INSURANCE.search(text):
|
||||||
|
no_insurance.append((pid, text))
|
||||||
|
if PAT_BUDGET.search(text):
|
||||||
|
no_budget.append((pid, text))
|
||||||
|
if PAT_INCIDENT.search(text):
|
||||||
|
no_incident.append((pid, text))
|
||||||
|
|
||||||
|
print(f"\n Unanimous N/O mentioning insurance: {len(no_insurance)}")
|
||||||
|
print(f" Unanimous N/O mentioning budget/investment: {len(no_budget)}")
|
||||||
|
print(f" Unanimous N/O mentioning incidents ('we experienced...'): {len(no_incident)}")
|
||||||
|
|
||||||
|
# Show examples for each
|
||||||
|
print(f"\n --- Insurance examples (up to 3) ---")
|
||||||
|
for i, (pid, text) in enumerate(no_insurance[:3]):
|
||||||
|
print_example(i + 1, pid, text)
|
||||||
|
|
||||||
|
print(f"\n --- Budget/investment examples (up to 3) ---")
|
||||||
|
for i, (pid, text) in enumerate(no_budget[:3]):
|
||||||
|
print_example(i + 1, pid, text)
|
||||||
|
|
||||||
|
print(f"\n --- Incident examples (up to 3) ---")
|
||||||
|
for i, (pid, text) in enumerate(no_incident[:3]):
|
||||||
|
print_example(i + 1, pid, text)
|
||||||
|
|
||||||
|
# ════════════════════════════════════════════════════════════════════
|
||||||
|
# 5. Scale the problem
|
||||||
|
# ════════════════════════════════════════════════════════════════════
|
||||||
|
separator("5. SCALE THE PROBLEM")
|
||||||
|
|
||||||
|
# Deduplicate: some paragraphs may hit multiple patterns
|
||||||
|
no_any_miscoded = set()
|
||||||
|
for pid, _ in no_with_mat:
|
||||||
|
no_any_miscoded.add(pid)
|
||||||
|
for pid, _ in no_insurance:
|
||||||
|
no_any_miscoded.add(pid)
|
||||||
|
for pid, _ in no_budget:
|
||||||
|
no_any_miscoded.add(pid)
|
||||||
|
no_incident_pids = set(pid for pid, _ in no_incident)
|
||||||
|
|
||||||
|
# Materiality-only (not already insurance/budget)
|
||||||
|
mat_only = set(pid for pid, _ in no_with_mat)
|
||||||
|
ins_only = set(pid for pid, _ in no_insurance) - mat_only
|
||||||
|
bud_only = set(pid for pid, _ in no_budget) - mat_only - ins_only
|
||||||
|
|
||||||
|
total_unanimous = len(all_unanimous)
|
||||||
|
total_annotations = len(annotations)
|
||||||
|
|
||||||
|
print(f"\n Total paragraphs with 3 annotations: {total_annotations}")
|
||||||
|
print(f" Total unanimous (any category): {total_unanimous}")
|
||||||
|
print(f" Total unanimous N/O: {len(unanimous_no)}")
|
||||||
|
print()
|
||||||
|
print(f" Potentially miscoded unanimous N/O:")
|
||||||
|
print(f" Materiality language (likely SI): {len(no_with_mat)}")
|
||||||
|
print(f" Insurance (likely SI): {len(no_insurance)}")
|
||||||
|
print(f" Budget/investment (likely SI): {len(no_budget)}")
|
||||||
|
print(f" Incident language (likely SI or ID): {len(no_incident)}")
|
||||||
|
print(f" Any of above (deduplicated): {len(no_any_miscoded)}")
|
||||||
|
print(f" Incident (separate concern): {len(no_incident_pids)}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Overlap analysis
|
||||||
|
mat_set = set(pid for pid, _ in no_with_mat)
|
||||||
|
ins_set = set(pid for pid, _ in no_insurance)
|
||||||
|
bud_set = set(pid for pid, _ in no_budget)
|
||||||
|
print(f" Overlap analysis:")
|
||||||
|
print(f" Materiality ∩ Insurance: {len(mat_set & ins_set)}")
|
||||||
|
print(f" Materiality ∩ Budget: {len(mat_set & bud_set)}")
|
||||||
|
print(f" Insurance ∩ Budget: {len(ins_set & bud_set)}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
pct_no_affected = len(no_any_miscoded) / max(1, len(unanimous_no)) * 100
|
||||||
|
pct_total_affected = len(no_any_miscoded) / max(1, total_unanimous) * 100
|
||||||
|
pct_all_affected = len(no_any_miscoded) / max(1, total_annotations) * 100
|
||||||
|
|
||||||
|
print(f" Impact estimates:")
|
||||||
|
print(f" % of unanimous N/O potentially miscoded: {pct_no_affected:.1f}%")
|
||||||
|
print(f" % of all unanimous labels affected: {pct_total_affected:.1f}%")
|
||||||
|
print(f" % of all paragraphs affected: {pct_all_affected:.1f}%")
|
||||||
|
|
||||||
|
# Also check majority N/O
|
||||||
|
maj_no_any = set()
|
||||||
|
for pid in majority_no:
|
||||||
|
text = texts.get(pid)
|
||||||
|
if text is None:
|
||||||
|
continue
|
||||||
|
if has_materiality_language(text) or PAT_INSURANCE.search(text) or PAT_BUDGET.search(text):
|
||||||
|
maj_no_any.add(pid)
|
||||||
|
|
||||||
|
print(f"\n Majority N/O (2/3) potentially miscoded: {len(maj_no_any)} / {len(majority_no)}")
|
||||||
|
print(f" Combined (unanimous + majority) potentially miscoded N/O: {len(no_any_miscoded) + len(maj_no_any)}")
|
||||||
|
|
||||||
|
# ════════════════════════════════════════════════════════════════════
|
||||||
|
# 6. Cross-check with holdout / human labels
|
||||||
|
# ════════════════════════════════════════════════════════════════════
|
||||||
|
separator("6. HOLDOUT CROSS-CHECK WITH HUMAN LABELS")
|
||||||
|
|
||||||
|
# Find holdout paragraphs that Stage 1 unanimously called N/O but contain materiality language
|
||||||
|
holdout_no_mat: list[tuple[str, str]] = []
|
||||||
|
holdout_no_mat_with_human: list[tuple[str, str, list[dict]]] = []
|
||||||
|
|
||||||
|
for pid, para in holdout.items():
|
||||||
|
if para.get("stage1Category") == "None/Other" and para.get("stage1Method") == "unanimous":
|
||||||
|
text = para["text"]
|
||||||
|
if has_materiality_language(text):
|
||||||
|
holdout_no_mat.append((pid, text))
|
||||||
|
if pid in human_labels:
|
||||||
|
holdout_no_mat_with_human.append((pid, text, human_labels[pid]))
|
||||||
|
|
||||||
|
print(f"\n Holdout paragraphs with stage1 unanimous N/O: "
|
||||||
|
f"{sum(1 for p in holdout.values() if p.get('stage1Category') == 'None/Other' and p.get('stage1Method') == 'unanimous')}")
|
||||||
|
print(f" Of those, with materiality language: {len(holdout_no_mat)}")
|
||||||
|
print(f" Of those, with human labels: {len(holdout_no_mat_with_human)}")
|
||||||
|
|
||||||
|
# What did humans call these?
|
||||||
|
if holdout_no_mat_with_human:
|
||||||
|
human_cats_for_flagged = Counter()
|
||||||
|
for pid, text, hlabels in holdout_no_mat_with_human:
|
||||||
|
for hl in hlabels:
|
||||||
|
human_cats_for_flagged[hl["category"]] += 1
|
||||||
|
|
||||||
|
print(f"\n Human labels for flagged paragraphs (Stage1=unanimous N/O, has materiality language):")
|
||||||
|
total_human = sum(human_cats_for_flagged.values())
|
||||||
|
for cat, cnt in human_cats_for_flagged.most_common():
|
||||||
|
print(f" {cat}: {cnt} ({cnt / total_human * 100:.1f}%)")
|
||||||
|
|
||||||
|
print(f"\n --- Examples where humans disagreed with Stage 1 N/O ---")
|
||||||
|
shown = 0
|
||||||
|
for pid, text, hlabels in holdout_no_mat_with_human:
|
||||||
|
non_no = [hl for hl in hlabels if hl["category"] != "None/Other"]
|
||||||
|
if non_no:
|
||||||
|
human_str = ", ".join(f"{hl['annotator']}={hl['category']}" for hl in hlabels)
|
||||||
|
print_example(shown + 1, pid, text, f"Human labels: {human_str}")
|
||||||
|
shown += 1
|
||||||
|
if shown >= 5:
|
||||||
|
break
|
||||||
|
|
||||||
|
# Also show ones where humans agreed it IS N/O
|
||||||
|
print(f"\n --- Examples where humans also said N/O (materiality language is ambiguous) ---")
|
||||||
|
shown = 0
|
||||||
|
for pid, text, hlabels in holdout_no_mat_with_human:
|
||||||
|
all_no = all(hl["category"] == "None/Other" for hl in hlabels)
|
||||||
|
if all_no and len(hlabels) >= 2:
|
||||||
|
print_example(shown + 1, pid, text, "All humans agreed: N/O")
|
||||||
|
shown += 1
|
||||||
|
if shown >= 3:
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
print("\n No human labels available for flagged holdout paragraphs.")
|
||||||
|
|
||||||
|
# Broader holdout analysis: all cases where Stage 1 said N/O but humans said something else
|
||||||
|
separator("6b. HOLDOUT: ALL Stage1=N/O vs HUMAN DISAGREEMENTS")
|
||||||
|
|
||||||
|
holdout_no_all = [pid for pid, p in holdout.items()
|
||||||
|
if p.get("stage1Category") == "None/Other"]
|
||||||
|
stage1_no_human_disagree = []
|
||||||
|
for pid in holdout_no_all:
|
||||||
|
if pid in human_labels:
|
||||||
|
hlabels = human_labels[pid]
|
||||||
|
non_no = [hl for hl in hlabels if hl["category"] != "None/Other"]
|
||||||
|
if non_no:
|
||||||
|
stage1_no_human_disagree.append((pid, holdout[pid]["text"], hlabels))
|
||||||
|
|
||||||
|
print(f"\n All holdout paragraphs with Stage1=N/O (any method): {len(holdout_no_all)}")
|
||||||
|
print(f" Of those with human labels that disagree: {len(stage1_no_human_disagree)}")
|
||||||
|
|
||||||
|
if stage1_no_human_disagree:
|
||||||
|
# What did humans call them?
|
||||||
|
human_override = Counter()
|
||||||
|
for pid, text, hlabels in stage1_no_human_disagree:
|
||||||
|
for hl in hlabels:
|
||||||
|
if hl["category"] != "None/Other":
|
||||||
|
human_override[hl["category"]] += 1
|
||||||
|
print(f"\n Humans' non-N/O labels for Stage1=N/O paragraphs:")
|
||||||
|
for cat, cnt in human_override.most_common():
|
||||||
|
print(f" {cat}: {cnt}")
|
||||||
|
|
||||||
|
# ════════════════════════════════════════════════════════════════════
|
||||||
|
# 7. Other confusion axes
|
||||||
|
# ════════════════════════════════════════════════════════════════════
|
||||||
|
separator("7. OTHER CONFUSION AXES IN STAGE 1")
|
||||||
|
|
||||||
|
# 7a. Unanimous MR with program/framework/process language (potential RMP)
|
||||||
|
mr_with_process = []
|
||||||
|
for pid in unanimous_mr:
|
||||||
|
text = texts.get(pid)
|
||||||
|
if text is None:
|
||||||
|
continue
|
||||||
|
matches = PAT_PROGRAM_FRAMEWORK.findall(text)
|
||||||
|
if len(matches) >= 2: # Multiple mentions = likely process-focused
|
||||||
|
mr_with_process.append((pid, text, matches))
|
||||||
|
|
||||||
|
print(f"\n 7a. Unanimous MR with prominent program/framework/process language")
|
||||||
|
print(f" (>=2 mentions — potentially should be RMP)")
|
||||||
|
print(f" Count: {len(mr_with_process)} / {len(unanimous_mr)} ({len(mr_with_process) / max(1, len(unanimous_mr)) * 100:.1f}%)")
|
||||||
|
print(f"\n --- 3 examples ---")
|
||||||
|
for i, (pid, text, matches) in enumerate(mr_with_process[:3]):
|
||||||
|
print_example(i + 1, pid, text, f"Pattern matches: {matches[:6]}")
|
||||||
|
|
||||||
|
# 7b. Unanimous RMP with specific titles (potential MR)
|
||||||
|
rmp_with_titles = []
|
||||||
|
for pid in unanimous_rmp:
|
||||||
|
text = texts.get(pid)
|
||||||
|
if text is None:
|
||||||
|
continue
|
||||||
|
titles = PAT_TITLE.findall(text)
|
||||||
|
if titles:
|
||||||
|
rmp_with_titles.append((pid, text, titles))
|
||||||
|
|
||||||
|
print(f"\n 7b. Unanimous RMP mentioning specific people/titles")
|
||||||
|
print(f" (potentially should be MR)")
|
||||||
|
print(f" Count: {len(rmp_with_titles)} / {len(unanimous_rmp)} ({len(rmp_with_titles) / max(1, len(unanimous_rmp)) * 100:.1f}%)")
|
||||||
|
print(f"\n --- 3 examples ---")
|
||||||
|
for i, (pid, text, titles) in enumerate(rmp_with_titles[:3]):
|
||||||
|
print_example(i + 1, pid, text, f"Titles found: {titles[:5]}")
|
||||||
|
|
||||||
|
# 7c. Unanimous BG primarily about management officers
|
||||||
|
bg_about_mgmt = []
|
||||||
|
for pid in unanimous_bg:
|
||||||
|
text = texts.get(pid)
|
||||||
|
if text is None:
|
||||||
|
continue
|
||||||
|
has_titles = PAT_TITLE.findall(text)
|
||||||
|
has_mgmt = PAT_MANAGEMENT_OFFICERS.findall(text)
|
||||||
|
# If it has management language but no board language
|
||||||
|
board_pattern = re.compile(r"\b(?:board|director(?:s)?|committee|audit)\b", re.IGNORECASE)
|
||||||
|
has_board = board_pattern.findall(text)
|
||||||
|
if (has_titles or has_mgmt) and not has_board:
|
||||||
|
bg_about_mgmt.append((pid, text, has_titles + has_mgmt))
|
||||||
|
|
||||||
|
print(f"\n 7c. Unanimous BG primarily about management (no board/committee language)")
|
||||||
|
print(f" Count: {len(bg_about_mgmt)} / {len(unanimous_bg)} ({len(bg_about_mgmt) / max(1, len(unanimous_bg)) * 100:.1f}%)")
|
||||||
|
if bg_about_mgmt:
|
||||||
|
print(f"\n --- 3 examples ---")
|
||||||
|
for i, (pid, text, matches) in enumerate(bg_about_mgmt[:3]):
|
||||||
|
print_example(i + 1, pid, text, f"Matches: {matches[:5]}")
|
||||||
|
|
||||||
|
# ════════════════════════════════════════════════════════════════════
|
||||||
|
# SUMMARY
|
||||||
|
# ════════════════════════════════════════════════════════════════════
|
||||||
|
separator("SUMMARY")
|
||||||
|
|
||||||
|
print(f"""
|
||||||
|
DATASET OVERVIEW
|
||||||
|
Total paragraphs annotated (3 models each): {total_annotations:,}
|
||||||
|
Total unanimous labels: {total_unanimous:,}
|
||||||
|
Unanimous N/O: {len(unanimous_no):,}
|
||||||
|
Majority N/O (2/3): {len(majority_no):,}
|
||||||
|
|
||||||
|
PRIMARY CONCERN: N/O → SI MISCODING
|
||||||
|
Unanimous N/O with materiality language: {len(no_with_mat):,} ({len(no_with_mat) / max(1, len(unanimous_no)) * 100:.1f}% of unanimous N/O)
|
||||||
|
Majority N/O with materiality language: {len(maj_no_with_mat):,} ({len(maj_no_with_mat) / max(1, len(majority_no)) * 100:.1f}% of majority N/O)
|
||||||
|
Unanimous N/O with insurance: {len(no_insurance):,}
|
||||||
|
Unanimous N/O with budget/investment: {len(no_budget):,}
|
||||||
|
Unanimous N/O with incident language: {len(no_incident):,}
|
||||||
|
Total potentially miscoded (deduplicated): {len(no_any_miscoded):,}
|
||||||
|
|
||||||
|
IMPACT ON TRAINING SET
|
||||||
|
% of unanimous N/O affected: {pct_no_affected:.1f}%
|
||||||
|
% of all unanimous labels affected: {pct_total_affected:.1f}%
|
||||||
|
% of all paragraphs affected: {pct_all_affected:.1f}%
|
||||||
|
|
||||||
|
OTHER CONFUSION AXES
|
||||||
|
MR ↔ RMP confusion (MR with process language): {len(mr_with_process):,} / {len(unanimous_mr):,}
|
||||||
|
RMP ↔ MR confusion (RMP with titles): {len(rmp_with_titles):,} / {len(unanimous_rmp):,}
|
||||||
|
BG about management (no board language): {len(bg_about_mgmt):,} / {len(unanimous_bg):,}
|
||||||
|
|
||||||
|
HOLDOUT VALIDATION
|
||||||
|
Stage1=unanimous N/O with materiality language: {len(holdout_no_mat):,}
|
||||||
|
Of those with human labels: {len(holdout_no_mat_with_human):,}
|
||||||
|
""")
|
||||||
|
|
||||||
|
if holdout_no_mat_with_human:
|
||||||
|
human_cats_for_flagged = Counter()
|
||||||
|
for pid, text, hlabels in holdout_no_mat_with_human:
|
||||||
|
for hl in hlabels:
|
||||||
|
human_cats_for_flagged[hl["category"]] += 1
|
||||||
|
print(" HUMAN VALIDATION (flagged holdout paragraphs):")
|
||||||
|
total_h = sum(human_cats_for_flagged.values())
|
||||||
|
for cat, cnt in human_cats_for_flagged.most_common():
|
||||||
|
print(f" {cat}: {cnt} ({cnt / total_h * 100:.1f}%)")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
333
scripts/compare-v30-v35-final.py
Normal file
333
scripts/compare-v30-v35-final.py
Normal file
@ -0,0 +1,333 @@
|
|||||||
|
"""
|
||||||
|
Comprehensive comparison of v3.0 vs v3.5f prompt on the 359 confusion-axis holdout paragraphs.
|
||||||
|
Covers per-model accuracy, per-axis breakdown, SI/NO asymmetry, rankings, convergence, and cost.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
from collections import Counter
|
||||||
|
from pathlib import Path
|
||||||
|
from itertools import combinations
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
ROOT = Path("/home/joey/Documents/sec-cyBERT")
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Model definitions
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
MODELS = [
|
||||||
|
("Opus", "golden", "opus"),
|
||||||
|
("GPT-5.4", "bench-holdout", "gpt-5.4"),
|
||||||
|
("Gemini-3.1-Pro", "bench-holdout", "gemini-3.1-pro-preview"),
|
||||||
|
("GLM-5", "bench-holdout", "glm-5:exacto"),
|
||||||
|
("Kimi-K2.5", "bench-holdout", "kimi-k2.5"),
|
||||||
|
("MIMO-v2-Pro", "bench-holdout", "mimo-v2-pro:exacto"),
|
||||||
|
("MiniMax-M2.7", "bench-holdout", "minimax-m2.7:exacto"),
|
||||||
|
]
|
||||||
|
|
||||||
|
CATEGORY_ABBREV = {
|
||||||
|
"None/Other": "N/O",
|
||||||
|
"Background": "BG",
|
||||||
|
"Risk Management Process": "RMP",
|
||||||
|
"Management Role": "MR",
|
||||||
|
"Strategy Integration": "SI",
|
||||||
|
}
|
||||||
|
|
||||||
|
def abbrev(cat: str) -> str:
|
||||||
|
return CATEGORY_ABBREV.get(cat, cat)
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Data loading
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
def load_jsonl(path: Path) -> list[dict]:
|
||||||
|
rows = []
|
||||||
|
with open(path) as f:
|
||||||
|
for line in f:
|
||||||
|
line = line.strip()
|
||||||
|
if line:
|
||||||
|
rows.append(json.loads(line))
|
||||||
|
return rows
|
||||||
|
|
||||||
|
|
||||||
|
def load_model_labels(version_suffix: str, subdir: str, filename: str) -> dict[str, str]:
|
||||||
|
"""Return {paragraphId: content_category} for a model file."""
|
||||||
|
if version_suffix:
|
||||||
|
base = ROOT / "data" / "annotations" / f"{subdir}-{version_suffix}" / f"{filename}.jsonl"
|
||||||
|
else:
|
||||||
|
base = ROOT / "data" / "annotations" / subdir / f"{filename}.jsonl"
|
||||||
|
rows = load_jsonl(base)
|
||||||
|
return {r["paragraphId"]: r["label"]["content_category"] for r in rows}
|
||||||
|
|
||||||
|
|
||||||
|
def load_model_rows(version_suffix: str, subdir: str, filename: str) -> list[dict]:
|
||||||
|
if version_suffix:
|
||||||
|
base = ROOT / "data" / "annotations" / f"{subdir}-{version_suffix}" / f"{filename}.jsonl"
|
||||||
|
else:
|
||||||
|
base = ROOT / "data" / "annotations" / subdir / f"{filename}.jsonl"
|
||||||
|
return load_jsonl(base)
|
||||||
|
|
||||||
|
|
||||||
|
# Load holdout PIDs and axes
|
||||||
|
holdout_rows = load_jsonl(ROOT / "data" / "gold" / "holdout-rerun-v35.jsonl")
|
||||||
|
HOLDOUT_PIDS = {r["paragraphId"] for r in holdout_rows}
|
||||||
|
PID_AXES: dict[str, list[str]] = {r["paragraphId"]: r["axes"] for r in holdout_rows}
|
||||||
|
|
||||||
|
# Human labels → majority vote per PID
|
||||||
|
human_raw = load_jsonl(ROOT / "data" / "gold" / "human-labels-raw.jsonl")
|
||||||
|
human_by_pid: dict[str, list[str]] = {}
|
||||||
|
for row in human_raw:
|
||||||
|
pid = row["paragraphId"]
|
||||||
|
if pid in HOLDOUT_PIDS:
|
||||||
|
human_by_pid.setdefault(pid, []).append(row["contentCategory"])
|
||||||
|
|
||||||
|
human_majority: dict[str, str] = {}
|
||||||
|
for pid, cats in human_by_pid.items():
|
||||||
|
counter = Counter(cats)
|
||||||
|
human_majority[pid] = counter.most_common(1)[0][0]
|
||||||
|
|
||||||
|
# Load v3.0 and v3.5f labels for all models
|
||||||
|
v30_labels: dict[str, dict[str, str]] = {} # model_name -> {pid: cat}
|
||||||
|
v35_labels: dict[str, dict[str, str]] = {}
|
||||||
|
v35_rows_by_model: dict[str, list[dict]] = {}
|
||||||
|
|
||||||
|
for name, subdir, filename in MODELS:
|
||||||
|
# v3.0: full 1200 file, filter to 359
|
||||||
|
all_v30 = load_model_labels("", subdir, filename)
|
||||||
|
v30_labels[name] = {pid: cat for pid, cat in all_v30.items() if pid in HOLDOUT_PIDS}
|
||||||
|
|
||||||
|
# v3.5f
|
||||||
|
suffix = "v35"
|
||||||
|
sub = f"golden" if subdir == "golden" else "bench-holdout"
|
||||||
|
v35_all = load_model_labels(suffix, sub, filename)
|
||||||
|
v35_labels[name] = {pid: cat for pid, cat in v35_all.items() if pid in HOLDOUT_PIDS}
|
||||||
|
|
||||||
|
v35_rows_by_model[name] = load_model_rows(suffix, sub, filename)
|
||||||
|
|
||||||
|
|
||||||
|
# Common PID set (intersection of all models in both versions + human majority)
|
||||||
|
common_pids = set(HOLDOUT_PIDS)
|
||||||
|
for name in [m[0] for m in MODELS]:
|
||||||
|
common_pids &= set(v30_labels[name].keys())
|
||||||
|
common_pids &= set(v35_labels[name].keys())
|
||||||
|
common_pids &= set(human_majority.keys())
|
||||||
|
common_pids_sorted = sorted(common_pids)
|
||||||
|
|
||||||
|
N = len(common_pids_sorted)
|
||||||
|
print(f"Common paragraphs across all models + human majority: {N}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Helper: 6-model majority (excl MiniMax)
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
TOP6_NAMES = [m[0] for m in MODELS if m[0] != "MiniMax-M2.7"]
|
||||||
|
|
||||||
|
|
||||||
|
def majority_vote(labels_dict: dict[str, dict[str, str]], model_names: list[str], pid: str) -> str | None:
|
||||||
|
cats = []
|
||||||
|
for mn in model_names:
|
||||||
|
if pid in labels_dict[mn]:
|
||||||
|
cats.append(labels_dict[mn][pid])
|
||||||
|
if not cats:
|
||||||
|
return None
|
||||||
|
counter = Counter(cats)
|
||||||
|
return counter.most_common(1)[0][0]
|
||||||
|
|
||||||
|
|
||||||
|
# ===========================================================================
|
||||||
|
# 1. Per-model summary table
|
||||||
|
# ===========================================================================
|
||||||
|
print("=" * 90)
|
||||||
|
print("1. PER-MODEL SUMMARY TABLE (vs human majority)")
|
||||||
|
print("=" * 90)
|
||||||
|
header = f"{'Model':<20} {'v3.0 Acc':>10} {'v3.5f Acc':>10} {'Delta':>8} {'Change%':>9}"
|
||||||
|
print(header)
|
||||||
|
print("-" * len(header))
|
||||||
|
|
||||||
|
model_v30_acc = {}
|
||||||
|
model_v35_acc = {}
|
||||||
|
|
||||||
|
for name, _, _ in MODELS:
|
||||||
|
correct_30 = sum(1 for pid in common_pids_sorted if v30_labels[name][pid] == human_majority[pid])
|
||||||
|
correct_35 = sum(1 for pid in common_pids_sorted if v35_labels[name][pid] == human_majority[pid])
|
||||||
|
changed = sum(1 for pid in common_pids_sorted if v30_labels[name][pid] != v35_labels[name][pid])
|
||||||
|
|
||||||
|
acc30 = correct_30 / N
|
||||||
|
acc35 = correct_35 / N
|
||||||
|
delta = acc35 - acc30
|
||||||
|
change_rate = changed / N
|
||||||
|
|
||||||
|
model_v30_acc[name] = acc30
|
||||||
|
model_v35_acc[name] = acc35
|
||||||
|
|
||||||
|
print(f"{name:<20} {acc30:>9.1%} {acc35:>9.1%} {delta:>+7.1%} {change_rate:>8.1%}")
|
||||||
|
|
||||||
|
# 6-model majority row
|
||||||
|
correct_30_maj = 0
|
||||||
|
correct_35_maj = 0
|
||||||
|
changed_maj = 0
|
||||||
|
for pid in common_pids_sorted:
|
||||||
|
m30 = majority_vote(v30_labels, TOP6_NAMES, pid)
|
||||||
|
m35 = majority_vote(v35_labels, TOP6_NAMES, pid)
|
||||||
|
if m30 == human_majority[pid]:
|
||||||
|
correct_30_maj += 1
|
||||||
|
if m35 == human_majority[pid]:
|
||||||
|
correct_35_maj += 1
|
||||||
|
if m30 != m35:
|
||||||
|
changed_maj += 1
|
||||||
|
|
||||||
|
acc30_maj = correct_30_maj / N
|
||||||
|
acc35_maj = correct_35_maj / N
|
||||||
|
delta_maj = acc35_maj - acc30_maj
|
||||||
|
change_maj_rate = changed_maj / N
|
||||||
|
|
||||||
|
model_v30_acc["6-model majority"] = acc30_maj
|
||||||
|
model_v35_acc["6-model majority"] = acc35_maj
|
||||||
|
|
||||||
|
print("-" * len(header))
|
||||||
|
print(f"{'6-model maj (no MM)':<20} {acc30_maj:>9.1%} {acc35_maj:>9.1%} {delta_maj:>+7.1%} {change_maj_rate:>8.1%}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# ===========================================================================
|
||||||
|
# 2. Per-axis breakdown (6-model majority excl MiniMax)
|
||||||
|
# ===========================================================================
|
||||||
|
print("=" * 90)
|
||||||
|
print("2. PER-AXIS BREAKDOWN (6-model majority excl MiniMax vs human majority)")
|
||||||
|
print("=" * 90)
|
||||||
|
|
||||||
|
all_axes = sorted({ax for axes in PID_AXES.values() for ax in axes})
|
||||||
|
header2 = f"{'Axis':<12} {'N':>5} {'v3.0 Acc':>10} {'v3.5f Acc':>10} {'Delta':>8}"
|
||||||
|
print(header2)
|
||||||
|
print("-" * len(header2))
|
||||||
|
|
||||||
|
for axis in all_axes:
|
||||||
|
axis_pids = [pid for pid in common_pids_sorted if axis in PID_AXES.get(pid, [])]
|
||||||
|
n_axis = len(axis_pids)
|
||||||
|
if n_axis == 0:
|
||||||
|
continue
|
||||||
|
correct_30 = sum(1 for pid in axis_pids if majority_vote(v30_labels, TOP6_NAMES, pid) == human_majority[pid])
|
||||||
|
correct_35 = sum(1 for pid in axis_pids if majority_vote(v35_labels, TOP6_NAMES, pid) == human_majority[pid])
|
||||||
|
a30 = correct_30 / n_axis
|
||||||
|
a35 = correct_35 / n_axis
|
||||||
|
d = a35 - a30
|
||||||
|
print(f"{axis:<12} {n_axis:>5} {a30:>9.1%} {a35:>9.1%} {d:>+7.1%}")
|
||||||
|
|
||||||
|
print()
|
||||||
|
|
||||||
|
# ===========================================================================
|
||||||
|
# 3. SI ↔ N/O asymmetry check
|
||||||
|
# ===========================================================================
|
||||||
|
print("=" * 90)
|
||||||
|
print("3. SI <-> N/O ASYMMETRY CHECK")
|
||||||
|
print("=" * 90)
|
||||||
|
|
||||||
|
si_no_pids = [pid for pid in common_pids_sorted if "SI_NO" in PID_AXES.get(pid, [])]
|
||||||
|
print(f"SI↔N/O paragraphs in common set: {len(si_no_pids)}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
for version_label, labels_dict in [("v3.0", v30_labels), ("v3.5f", v35_labels)]:
|
||||||
|
human_si_model_no = 0
|
||||||
|
human_no_model_si = 0
|
||||||
|
for pid in si_no_pids:
|
||||||
|
h = human_majority[pid]
|
||||||
|
m = majority_vote(labels_dict, TOP6_NAMES, pid)
|
||||||
|
if h == "Strategy Integration" and m == "None/Other":
|
||||||
|
human_si_model_no += 1
|
||||||
|
elif h == "None/Other" and m == "Strategy Integration":
|
||||||
|
human_no_model_si += 1
|
||||||
|
print(f"{version_label}:")
|
||||||
|
print(f" Human=SI, 6-model=N/O: {human_si_model_no}")
|
||||||
|
print(f" Human=N/O, 6-model=SI: {human_no_model_si}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Also show per-model breakdown for SI↔N/O
|
||||||
|
print("Per-model SI↔N/O errors:")
|
||||||
|
header3 = f"{'Model':<20} {'v3.0 H=SI,M=NO':>16} {'v3.0 H=NO,M=SI':>16} {'v3.5 H=SI,M=NO':>16} {'v3.5 H=NO,M=SI':>16}"
|
||||||
|
print(header3)
|
||||||
|
print("-" * len(header3))
|
||||||
|
for name, _, _ in MODELS:
|
||||||
|
counts = []
|
||||||
|
for labels_dict in [v30_labels, v35_labels]:
|
||||||
|
hsi_mno = 0
|
||||||
|
hno_msi = 0
|
||||||
|
for pid in si_no_pids:
|
||||||
|
h = human_majority[pid]
|
||||||
|
m = labels_dict[name].get(pid)
|
||||||
|
if m is None:
|
||||||
|
continue
|
||||||
|
if h == "Strategy Integration" and m == "None/Other":
|
||||||
|
hsi_mno += 1
|
||||||
|
elif h == "None/Other" and m == "Strategy Integration":
|
||||||
|
hno_msi += 1
|
||||||
|
counts.extend([hsi_mno, hno_msi])
|
||||||
|
print(f"{name:<20} {counts[0]:>16} {counts[1]:>16} {counts[2]:>16} {counts[3]:>16}")
|
||||||
|
|
||||||
|
print()
|
||||||
|
|
||||||
|
# ===========================================================================
|
||||||
|
# 4. Per-model ranking
|
||||||
|
# ===========================================================================
|
||||||
|
print("=" * 90)
|
||||||
|
print("4. PER-MODEL RANKING")
|
||||||
|
print("=" * 90)
|
||||||
|
|
||||||
|
all_names = [m[0] for m in MODELS]
|
||||||
|
|
||||||
|
rank_v30 = sorted(all_names, key=lambda n: model_v30_acc[n], reverse=True)
|
||||||
|
rank_v35 = sorted(all_names, key=lambda n: model_v35_acc[n], reverse=True)
|
||||||
|
|
||||||
|
header4 = f"{'Rank':>4} {'v3.0 Model':<20} {'Acc':>8} {'v3.5f Model':<20} {'Acc':>8}"
|
||||||
|
print(header4)
|
||||||
|
print("-" * len(header4))
|
||||||
|
for i in range(len(all_names)):
|
||||||
|
n30 = rank_v30[i]
|
||||||
|
n35 = rank_v35[i]
|
||||||
|
print(f"{i+1:>4} {n30:<20} {model_v30_acc[n30]:>7.1%} {n35:<20} {model_v35_acc[n35]:>7.1%}")
|
||||||
|
|
||||||
|
print()
|
||||||
|
|
||||||
|
# ===========================================================================
|
||||||
|
# 5. Model convergence (average pairwise agreement)
|
||||||
|
# ===========================================================================
|
||||||
|
print("=" * 90)
|
||||||
|
print("5. MODEL CONVERGENCE (average pairwise agreement)")
|
||||||
|
print("=" * 90)
|
||||||
|
|
||||||
|
|
||||||
|
def avg_pairwise_agreement(labels_dict: dict[str, dict[str, str]], model_names: list[str], pids: list[str]) -> float:
|
||||||
|
agreements = []
|
||||||
|
for m1, m2 in combinations(model_names, 2):
|
||||||
|
agree = sum(1 for pid in pids if labels_dict[m1].get(pid) == labels_dict[m2].get(pid))
|
||||||
|
agreements.append(agree / len(pids))
|
||||||
|
return float(np.mean(agreements))
|
||||||
|
|
||||||
|
|
||||||
|
for group_label, group_names in [("All 7 models", all_names), ("Top 6 (excl MiniMax)", TOP6_NAMES)]:
|
||||||
|
a30 = avg_pairwise_agreement(v30_labels, group_names, common_pids_sorted)
|
||||||
|
a35 = avg_pairwise_agreement(v35_labels, group_names, common_pids_sorted)
|
||||||
|
delta = a35 - a30
|
||||||
|
print(f"{group_label}:")
|
||||||
|
print(f" v3.0 avg pairwise agreement: {a30:.1%}")
|
||||||
|
print(f" v3.5f avg pairwise agreement: {a35:.1%}")
|
||||||
|
print(f" Delta: {delta:+.1%}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# ===========================================================================
|
||||||
|
# 6. Cost summary
|
||||||
|
# ===========================================================================
|
||||||
|
print("=" * 90)
|
||||||
|
print("6. v3.5f RE-RUN COST SUMMARY")
|
||||||
|
print("=" * 90)
|
||||||
|
|
||||||
|
total_cost = 0.0
|
||||||
|
header6 = f"{'Model':<20} {'Records':>8} {'Cost ($)':>10}"
|
||||||
|
print(header6)
|
||||||
|
print("-" * len(header6))
|
||||||
|
for name, _, _ in MODELS:
|
||||||
|
rows = v35_rows_by_model[name]
|
||||||
|
cost = sum(r.get("provenance", {}).get("costUsd", 0) for r in rows)
|
||||||
|
total_cost += cost
|
||||||
|
print(f"{name:<20} {len(rows):>8} {cost:>10.4f}")
|
||||||
|
|
||||||
|
print("-" * len(header6))
|
||||||
|
print(f"{'TOTAL':<20} {'':<8} {total_cost:>10.4f}")
|
||||||
|
print()
|
||||||
518
scripts/compare-v30-v35.py
Normal file
518
scripts/compare-v30-v35.py
Normal file
@ -0,0 +1,518 @@
|
|||||||
|
"""Compare v3.0 vs v3.5 annotations on 359 confusion-axis holdout paragraphs."""
|
||||||
|
|
||||||
|
import json
|
||||||
|
from collections import Counter, defaultdict
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
# ── Paths ──────────────────────────────────────────────────────────────────────
|
||||||
|
ROOT = Path(__file__).resolve().parent.parent
|
||||||
|
|
||||||
|
V30_GOLDEN = ROOT / "data/annotations/golden/opus.jsonl"
|
||||||
|
V35_GOLDEN = ROOT / "data/annotations/golden-v35/opus.jsonl"
|
||||||
|
|
||||||
|
V30_BENCH = ROOT / "data/annotations/bench-holdout"
|
||||||
|
V35_BENCH = ROOT / "data/annotations/bench-holdout-v35"
|
||||||
|
|
||||||
|
HUMAN_LABELS = ROOT / "data/gold/human-labels-raw.jsonl"
|
||||||
|
HOLDOUT_META = ROOT / "data/gold/holdout-rerun-v35.jsonl"
|
||||||
|
|
||||||
|
MODEL_FILES = [
|
||||||
|
"opus.jsonl", # golden dirs
|
||||||
|
"gpt-5.4.jsonl",
|
||||||
|
"gemini-3.1-pro-preview.jsonl",
|
||||||
|
"glm-5:exacto.jsonl",
|
||||||
|
"kimi-k2.5.jsonl",
|
||||||
|
"mimo-v2-pro:exacto.jsonl",
|
||||||
|
"minimax-m2.7:exacto.jsonl",
|
||||||
|
]
|
||||||
|
|
||||||
|
MODEL_NAMES = [
|
||||||
|
"Opus",
|
||||||
|
"GPT-5.4",
|
||||||
|
"Gemini-3.1-Pro",
|
||||||
|
"GLM-5",
|
||||||
|
"Kimi-K2.5",
|
||||||
|
"Mimo-v2-Pro",
|
||||||
|
"MiniMax-M2.7",
|
||||||
|
]
|
||||||
|
|
||||||
|
# Category abbreviations used in axes
|
||||||
|
CAT_ABBREV = {
|
||||||
|
"BG": "Board Governance",
|
||||||
|
"MR": "Management Role",
|
||||||
|
"RMP": "Risk Management Process",
|
||||||
|
"SI": "Strategy Integration",
|
||||||
|
"NO": "None/Other",
|
||||||
|
"ID": "Incident Disclosure",
|
||||||
|
"TPR": "Third-Party Risk",
|
||||||
|
}
|
||||||
|
|
||||||
|
ABBREV_CAT = {v: k for k, v in CAT_ABBREV.items()}
|
||||||
|
|
||||||
|
|
||||||
|
def abbrev(cat: str) -> str:
|
||||||
|
return ABBREV_CAT.get(cat, cat)
|
||||||
|
|
||||||
|
|
||||||
|
def full_cat(ab: str) -> str:
|
||||||
|
return CAT_ABBREV.get(ab, ab)
|
||||||
|
|
||||||
|
|
||||||
|
# ── Load data ──────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def load_jsonl(path: Path) -> list[dict]:
|
||||||
|
with open(path) as f:
|
||||||
|
return [json.loads(line) for line in f if line.strip()]
|
||||||
|
|
||||||
|
|
||||||
|
def load_annotations(base_dir: Path, filename: str) -> dict[str, str]:
|
||||||
|
"""Load paragraphId → content_category mapping."""
|
||||||
|
path = base_dir / filename
|
||||||
|
records = load_jsonl(path)
|
||||||
|
return {r["paragraphId"]: r["label"]["content_category"] for r in records}
|
||||||
|
|
||||||
|
|
||||||
|
def load_golden(path: Path) -> dict[str, str]:
|
||||||
|
records = load_jsonl(path)
|
||||||
|
return {r["paragraphId"]: r["label"]["content_category"] for r in records}
|
||||||
|
|
||||||
|
|
||||||
|
# Load holdout metadata
|
||||||
|
holdout_records = load_jsonl(HOLDOUT_META)
|
||||||
|
holdout_pids = {r["paragraphId"] for r in holdout_records}
|
||||||
|
pid_axes = {r["paragraphId"]: r["axes"] for r in holdout_records}
|
||||||
|
pid_materiality = {r["paragraphId"]: r.get("hasMaterialityLanguage", False) for r in holdout_records}
|
||||||
|
|
||||||
|
assert len(holdout_pids) == 359, f"Expected 359 holdout PIDs, got {len(holdout_pids)}"
|
||||||
|
|
||||||
|
# Load v3.0 annotations per model (filtered to 359 holdout PIDs)
|
||||||
|
v30: dict[str, dict[str, str]] = {} # model_name → {pid → category}
|
||||||
|
v35: dict[str, dict[str, str]] = {}
|
||||||
|
|
||||||
|
for i, (fname, mname) in enumerate(zip(MODEL_FILES, MODEL_NAMES)):
|
||||||
|
if fname == "opus.jsonl":
|
||||||
|
v30_all = load_golden(V30_GOLDEN)
|
||||||
|
v30[mname] = {pid: v30_all[pid] for pid in holdout_pids if pid in v30_all}
|
||||||
|
v35[mname] = load_golden(V35_GOLDEN)
|
||||||
|
else:
|
||||||
|
v30_all = load_annotations(V30_BENCH, fname)
|
||||||
|
v30[mname] = {pid: v30_all[pid] for pid in holdout_pids if pid in v30_all}
|
||||||
|
v35[mname] = load_annotations(V35_BENCH, fname)
|
||||||
|
|
||||||
|
# Load human labels
|
||||||
|
human_raw = load_jsonl(HUMAN_LABELS)
|
||||||
|
# Group by paragraphId, compute majority
|
||||||
|
human_labels_by_pid: dict[str, list[str]] = defaultdict(list)
|
||||||
|
for rec in human_raw:
|
||||||
|
human_labels_by_pid[rec["paragraphId"]].append(rec["contentCategory"])
|
||||||
|
|
||||||
|
human_majority: dict[str, str] = {}
|
||||||
|
for pid, labels in human_labels_by_pid.items():
|
||||||
|
counts = Counter(labels)
|
||||||
|
human_majority[pid] = counts.most_common(1)[0][0]
|
||||||
|
|
||||||
|
# Axes grouping
|
||||||
|
axis_pids: dict[str, set[str]] = defaultdict(set)
|
||||||
|
for pid, axes in pid_axes.items():
|
||||||
|
for ax in axes:
|
||||||
|
axis_pids[ax].add(pid)
|
||||||
|
|
||||||
|
AXIS_LABELS = {
|
||||||
|
"SI_NO": "SI↔N/O",
|
||||||
|
"MR_RMP": "MR↔RMP",
|
||||||
|
"BG_MR": "BG↔MR",
|
||||||
|
"BG_RMP": "BG↔RMP",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# ── Helpers ────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def majority_vote(model_cats: dict[str, dict[str, str]], pid: str) -> str | None:
|
||||||
|
"""Get majority category across all models for a PID."""
|
||||||
|
votes = [model_cats[m].get(pid) for m in MODEL_NAMES if pid in model_cats[m]]
|
||||||
|
votes = [v for v in votes if v is not None]
|
||||||
|
if not votes:
|
||||||
|
return None
|
||||||
|
counts = Counter(votes)
|
||||||
|
return counts.most_common(1)[0][0]
|
||||||
|
|
||||||
|
|
||||||
|
def agreement_rate(model_cats: dict[str, dict[str, str]], pids: set[str]) -> float:
|
||||||
|
"""Average pairwise agreement among 7 models on given PIDs."""
|
||||||
|
total_pairs = 0
|
||||||
|
agree_pairs = 0
|
||||||
|
for pid in pids:
|
||||||
|
cats = [model_cats[m].get(pid) for m in MODEL_NAMES if pid in model_cats[m]]
|
||||||
|
cats = [c for c in cats if c is not None]
|
||||||
|
n = len(cats)
|
||||||
|
for i in range(n):
|
||||||
|
for j in range(i + 1, n):
|
||||||
|
total_pairs += 1
|
||||||
|
if cats[i] == cats[j]:
|
||||||
|
agree_pairs += 1
|
||||||
|
return agree_pairs / total_pairs if total_pairs > 0 else 0.0
|
||||||
|
|
||||||
|
|
||||||
|
def pairwise_agreement_matrix(model_cats: dict[str, dict[str, str]], pids: set[str]) -> np.ndarray:
|
||||||
|
"""Return 7x7 pairwise agreement matrix."""
|
||||||
|
n = len(MODEL_NAMES)
|
||||||
|
mat = np.zeros((n, n))
|
||||||
|
for i in range(n):
|
||||||
|
for j in range(n):
|
||||||
|
if i == j:
|
||||||
|
mat[i, j] = 1.0
|
||||||
|
continue
|
||||||
|
agree = 0
|
||||||
|
total = 0
|
||||||
|
for pid in pids:
|
||||||
|
ci = model_cats[MODEL_NAMES[i]].get(pid)
|
||||||
|
cj = model_cats[MODEL_NAMES[j]].get(pid)
|
||||||
|
if ci is not None and cj is not None:
|
||||||
|
total += 1
|
||||||
|
if ci == cj:
|
||||||
|
agree += 1
|
||||||
|
mat[i, j] = agree / total if total > 0 else 0.0
|
||||||
|
return mat
|
||||||
|
|
||||||
|
|
||||||
|
# ── Section 1: Per-model category change rate ─────────────────────────────────
|
||||||
|
|
||||||
|
print("=" * 80)
|
||||||
|
print("1. PER-MODEL CATEGORY CHANGE RATE (v3.0 → v3.5)")
|
||||||
|
print("=" * 80)
|
||||||
|
print()
|
||||||
|
|
||||||
|
header = f"{'Model':<18} {'Changed':>8} {'Total':>6} {'% Changed':>10}"
|
||||||
|
print(header)
|
||||||
|
print("-" * len(header))
|
||||||
|
|
||||||
|
for mname in MODEL_NAMES:
|
||||||
|
changed = 0
|
||||||
|
total = 0
|
||||||
|
for pid in holdout_pids:
|
||||||
|
c30 = v30[mname].get(pid)
|
||||||
|
c35 = v35[mname].get(pid)
|
||||||
|
if c30 is not None and c35 is not None:
|
||||||
|
total += 1
|
||||||
|
if c30 != c35:
|
||||||
|
changed += 1
|
||||||
|
pct = (changed / total * 100) if total > 0 else 0
|
||||||
|
print(f"{mname:<18} {changed:>8} {total:>6} {pct:>9.1f}%")
|
||||||
|
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Top transitions per model
|
||||||
|
print("Top category transitions per model:")
|
||||||
|
print()
|
||||||
|
for mname in MODEL_NAMES:
|
||||||
|
transitions: Counter = Counter()
|
||||||
|
for pid in holdout_pids:
|
||||||
|
c30 = v30[mname].get(pid)
|
||||||
|
c35 = v35[mname].get(pid)
|
||||||
|
if c30 is not None and c35 is not None and c30 != c35:
|
||||||
|
transitions[(abbrev(c30), abbrev(c35))] += 1
|
||||||
|
if transitions:
|
||||||
|
top = transitions.most_common(5)
|
||||||
|
parts = [f"{a}→{b} ({n})" for (a, b), n in top]
|
||||||
|
print(f" {mname:<18} {', '.join(parts)}")
|
||||||
|
|
||||||
|
print()
|
||||||
|
|
||||||
|
# ── Section 2: Per-axis resolution analysis ───────────────────────────────────
|
||||||
|
|
||||||
|
print("=" * 80)
|
||||||
|
print("2. PER-AXIS RESOLUTION ANALYSIS")
|
||||||
|
print("=" * 80)
|
||||||
|
print()
|
||||||
|
|
||||||
|
for axis_key, axis_label in AXIS_LABELS.items():
|
||||||
|
pids_on_axis = axis_pids[axis_key]
|
||||||
|
cat_a, cat_b = axis_key.split("_")
|
||||||
|
|
||||||
|
print(f"--- {axis_label} ({len(pids_on_axis)} paragraphs) ---")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# v3.0 and v3.5 majorities
|
||||||
|
v30_maj = {pid: majority_vote(v30, pid) for pid in pids_on_axis}
|
||||||
|
v35_maj = {pid: majority_vote(v35, pid) for pid in pids_on_axis}
|
||||||
|
|
||||||
|
# Majority distribution
|
||||||
|
v30_dist = Counter(v for v in v30_maj.values() if v)
|
||||||
|
v35_dist = Counter(v for v in v35_maj.values() if v)
|
||||||
|
|
||||||
|
print(f" v3.0 majority distribution: ", end="")
|
||||||
|
print(", ".join(f"{abbrev(k)}={v}" for k, v in v30_dist.most_common()))
|
||||||
|
|
||||||
|
print(f" v3.5 majority distribution: ", end="")
|
||||||
|
print(", ".join(f"{abbrev(k)}={v}" for k, v in v35_dist.most_common()))
|
||||||
|
|
||||||
|
# Flipped majority
|
||||||
|
flipped = sum(
|
||||||
|
1 for pid in pids_on_axis
|
||||||
|
if v30_maj.get(pid) and v35_maj.get(pid) and v30_maj[pid] != v35_maj[pid]
|
||||||
|
)
|
||||||
|
print(f" Paragraphs with flipped majority: {flipped}/{len(pids_on_axis)} ({flipped / len(pids_on_axis) * 100:.1f}%)")
|
||||||
|
|
||||||
|
# New agreement rate (7-model)
|
||||||
|
v30_agree = agreement_rate(v30, pids_on_axis)
|
||||||
|
v35_agree = agreement_rate(v35, pids_on_axis)
|
||||||
|
print(f" 7-model avg pairwise agreement: v3.0={v30_agree:.3f} → v3.5={v35_agree:.3f} (Δ={v35_agree - v30_agree:+.3f})")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# ── Section 3: Human alignment improvement ───────────────────────────────────
|
||||||
|
|
||||||
|
print("=" * 80)
|
||||||
|
print("3. HUMAN ALIGNMENT IMPROVEMENT")
|
||||||
|
print("=" * 80)
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Overall
|
||||||
|
pids_with_human = holdout_pids & set(human_majority.keys())
|
||||||
|
|
||||||
|
v30_agree_human = 0
|
||||||
|
v35_agree_human = 0
|
||||||
|
total_human = 0
|
||||||
|
|
||||||
|
for pid in pids_with_human:
|
||||||
|
hm = human_majority[pid]
|
||||||
|
m30 = majority_vote(v30, pid)
|
||||||
|
m35 = majority_vote(v35, pid)
|
||||||
|
if m30 is not None and m35 is not None:
|
||||||
|
total_human += 1
|
||||||
|
if m30 == hm:
|
||||||
|
v30_agree_human += 1
|
||||||
|
if m35 == hm:
|
||||||
|
v35_agree_human += 1
|
||||||
|
|
||||||
|
v30_pct = v30_agree_human / total_human * 100 if total_human else 0
|
||||||
|
v35_pct = v35_agree_human / total_human * 100 if total_human else 0
|
||||||
|
|
||||||
|
print(f"Overall (n={total_human}):")
|
||||||
|
print(f" v3.0 GenAI majority vs human majority: {v30_agree_human}/{total_human} ({v30_pct:.1f}%)")
|
||||||
|
print(f" v3.5 GenAI majority vs human majority: {v35_agree_human}/{total_human} ({v35_pct:.1f}%)")
|
||||||
|
print(f" Delta: {v35_pct - v30_pct:+.1f}pp")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# By axis
|
||||||
|
print("By axis:")
|
||||||
|
header = f"{'Axis':<12} {'n':>4} {'v3.0 %':>8} {'v3.5 %':>8} {'Delta':>8}"
|
||||||
|
print(header)
|
||||||
|
print("-" * len(header))
|
||||||
|
|
||||||
|
for axis_key, axis_label in AXIS_LABELS.items():
|
||||||
|
pids_ax = axis_pids[axis_key] & pids_with_human
|
||||||
|
a30 = 0
|
||||||
|
a35 = 0
|
||||||
|
tot = 0
|
||||||
|
for pid in pids_ax:
|
||||||
|
hm = human_majority[pid]
|
||||||
|
m30 = majority_vote(v30, pid)
|
||||||
|
m35 = majority_vote(v35, pid)
|
||||||
|
if m30 is not None and m35 is not None:
|
||||||
|
tot += 1
|
||||||
|
if m30 == hm:
|
||||||
|
a30 += 1
|
||||||
|
if m35 == hm:
|
||||||
|
a35 += 1
|
||||||
|
p30 = a30 / tot * 100 if tot else 0
|
||||||
|
p35 = a35 / tot * 100 if tot else 0
|
||||||
|
print(f"{axis_label:<12} {tot:>4} {p30:>7.1f}% {p35:>7.1f}% {p35 - p30:>+7.1f}pp")
|
||||||
|
|
||||||
|
print()
|
||||||
|
|
||||||
|
# ── Section 4: SI↔N/O specific analysis ──────────────────────────────────────
|
||||||
|
|
||||||
|
print("=" * 80)
|
||||||
|
print("4. SI↔N/O SPECIFIC ANALYSIS")
|
||||||
|
print("=" * 80)
|
||||||
|
print()
|
||||||
|
|
||||||
|
si_no_pids = axis_pids["SI_NO"]
|
||||||
|
print(f"Paragraphs on SI↔N/O axis: {len(si_no_pids)}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Per-model SI call rate
|
||||||
|
print("Per-model SI call rate:")
|
||||||
|
header = f"{'Model':<18} {'v3.0 SI':>8} {'v3.0 NO':>8} {'v3.5 SI':>8} {'v3.5 NO':>8} {'v3.0 SI%':>9} {'v3.5 SI%':>9}"
|
||||||
|
print(header)
|
||||||
|
print("-" * len(header))
|
||||||
|
|
||||||
|
for mname in MODEL_NAMES:
|
||||||
|
si30 = sum(1 for pid in si_no_pids if v30[mname].get(pid) == "Strategy Integration")
|
||||||
|
no30 = sum(1 for pid in si_no_pids if v30[mname].get(pid) == "None/Other")
|
||||||
|
si35 = sum(1 for pid in si_no_pids if v35[mname].get(pid) == "Strategy Integration")
|
||||||
|
no35 = sum(1 for pid in si_no_pids if v35[mname].get(pid) == "None/Other")
|
||||||
|
tot30 = si30 + no30 if (si30 + no30) > 0 else 1
|
||||||
|
tot35 = si35 + no35 if (si35 + no35) > 0 else 1
|
||||||
|
pct30 = si30 / len(si_no_pids) * 100
|
||||||
|
pct35 = si35 / len(si_no_pids) * 100
|
||||||
|
print(f"{mname:<18} {si30:>8} {no30:>8} {si35:>8} {no35:>8} {pct30:>8.1f}% {pct35:>8.1f}%")
|
||||||
|
|
||||||
|
print()
|
||||||
|
|
||||||
|
# N/O → SI switches per model
|
||||||
|
print("Models switching N/O → SI on SI↔N/O paragraphs:")
|
||||||
|
for mname in MODEL_NAMES:
|
||||||
|
switches = sum(
|
||||||
|
1 for pid in si_no_pids
|
||||||
|
if v30[mname].get(pid) == "None/Other" and v35[mname].get(pid) == "Strategy Integration"
|
||||||
|
)
|
||||||
|
reverse = sum(
|
||||||
|
1 for pid in si_no_pids
|
||||||
|
if v30[mname].get(pid) == "Strategy Integration" and v35[mname].get(pid) == "None/Other"
|
||||||
|
)
|
||||||
|
print(f" {mname:<18} N/O→SI: {switches:>3}, SI→N/O: {reverse:>3}")
|
||||||
|
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Per-paragraph tally shift
|
||||||
|
print("Per-paragraph SI vs N/O tally (v3.0 → v3.5), showing shifts:")
|
||||||
|
print()
|
||||||
|
header = f"{'ParagraphId':<38} {'v3.0 SI':>7} {'v3.0 NO':>7} {'v3.5 SI':>7} {'v3.5 NO':>7} {'Human':>6} {'Resolved?':>10}"
|
||||||
|
print(header)
|
||||||
|
print("-" * len(header))
|
||||||
|
|
||||||
|
resolved_count = 0
|
||||||
|
total_si_no_with_human = 0
|
||||||
|
for pid in sorted(si_no_pids):
|
||||||
|
si30 = sum(1 for m in MODEL_NAMES if v30[m].get(pid) == "Strategy Integration")
|
||||||
|
no30 = sum(1 for m in MODEL_NAMES if v30[m].get(pid) == "None/Other")
|
||||||
|
si35 = sum(1 for m in MODEL_NAMES if v35[m].get(pid) == "Strategy Integration")
|
||||||
|
no35 = sum(1 for m in MODEL_NAMES if v35[m].get(pid) == "None/Other")
|
||||||
|
hm = human_majority.get(pid, "?")
|
||||||
|
hm_ab = abbrev(hm) if hm != "?" else "?"
|
||||||
|
|
||||||
|
# "Resolved" = v3.5 majority matches human majority
|
||||||
|
v35_maj = "SI" if si35 > no35 else ("NO" if no35 > si35 else "TIE")
|
||||||
|
resolved = "YES" if hm_ab == v35_maj else ("" if hm == "?" else "no")
|
||||||
|
if hm != "?":
|
||||||
|
total_si_no_with_human += 1
|
||||||
|
if hm_ab == v35_maj:
|
||||||
|
resolved_count += 1
|
||||||
|
|
||||||
|
print(f"{pid[:36]:<38} {si30:>7} {no30:>7} {si35:>7} {no35:>7} {hm_ab:>6} {resolved:>10}")
|
||||||
|
|
||||||
|
print()
|
||||||
|
print(f"SI↔N/O resolution rate (v3.5 majority matches human): {resolved_count}/{total_si_no_with_human} ({resolved_count / total_si_no_with_human * 100:.1f}%)" if total_si_no_with_human else "No human labels for SI↔N/O paragraphs")
|
||||||
|
|
||||||
|
# 23:0 asymmetry check
|
||||||
|
print()
|
||||||
|
print("23:0 asymmetry check:")
|
||||||
|
# In v3.0, how many SI↔N/O paragraphs had human=SI but GenAI majority=N/O?
|
||||||
|
asym_30 = sum(
|
||||||
|
1 for pid in si_no_pids
|
||||||
|
if human_majority.get(pid) == "Strategy Integration" and majority_vote(v30, pid) == "None/Other"
|
||||||
|
)
|
||||||
|
asym_35 = sum(
|
||||||
|
1 for pid in si_no_pids
|
||||||
|
if human_majority.get(pid) == "Strategy Integration" and majority_vote(v35, pid) == "None/Other"
|
||||||
|
)
|
||||||
|
print(f" v3.0: Human=SI but GenAI majority=N/O: {asym_30}")
|
||||||
|
print(f" v3.5: Human=SI but GenAI majority=N/O: {asym_35}")
|
||||||
|
rev_30 = sum(
|
||||||
|
1 for pid in si_no_pids
|
||||||
|
if human_majority.get(pid) == "None/Other" and majority_vote(v30, pid) == "Strategy Integration"
|
||||||
|
)
|
||||||
|
rev_35 = sum(
|
||||||
|
1 for pid in si_no_pids
|
||||||
|
if human_majority.get(pid) == "None/Other" and majority_vote(v35, pid) == "Strategy Integration"
|
||||||
|
)
|
||||||
|
print(f" v3.0: Human=N/O but GenAI majority=SI: {rev_30}")
|
||||||
|
print(f" v3.5: Human=N/O but GenAI majority=SI: {rev_35}")
|
||||||
|
|
||||||
|
print()
|
||||||
|
|
||||||
|
# ── Section 5: Per-model quality on confusion axes ───────────────────────────
|
||||||
|
|
||||||
|
print("=" * 80)
|
||||||
|
print("5. PER-MODEL ACCURACY ON CONFUSION-AXIS PARAGRAPHS (vs human majority)")
|
||||||
|
print("=" * 80)
|
||||||
|
print()
|
||||||
|
|
||||||
|
model_results = []
|
||||||
|
for mname in MODEL_NAMES:
|
||||||
|
correct_30 = 0
|
||||||
|
correct_35 = 0
|
||||||
|
total = 0
|
||||||
|
for pid in holdout_pids:
|
||||||
|
hm = human_majority.get(pid)
|
||||||
|
c30 = v30[mname].get(pid)
|
||||||
|
c35 = v35[mname].get(pid)
|
||||||
|
if hm and c30 and c35:
|
||||||
|
total += 1
|
||||||
|
if c30 == hm:
|
||||||
|
correct_30 += 1
|
||||||
|
if c35 == hm:
|
||||||
|
correct_35 += 1
|
||||||
|
acc30 = correct_30 / total * 100 if total else 0
|
||||||
|
acc35 = correct_35 / total * 100 if total else 0
|
||||||
|
model_results.append((mname, total, acc30, acc35, acc35 - acc30))
|
||||||
|
|
||||||
|
# Sort by v3.5 accuracy descending
|
||||||
|
model_results.sort(key=lambda x: -x[3])
|
||||||
|
|
||||||
|
header = f"{'Rank':>4} {'Model':<18} {'n':>5} {'v3.0 Acc':>9} {'v3.5 Acc':>9} {'Delta':>8}"
|
||||||
|
print(header)
|
||||||
|
print("-" * len(header))
|
||||||
|
for rank, (mname, total, acc30, acc35, delta) in enumerate(model_results, 1):
|
||||||
|
print(f"{rank:>4} {mname:<18} {total:>5} {acc30:>8.1f}% {acc35:>8.1f}% {delta:>+7.1f}pp")
|
||||||
|
|
||||||
|
print()
|
||||||
|
|
||||||
|
# ── Section 6: Model convergence ─────────────────────────────────────────────
|
||||||
|
|
||||||
|
print("=" * 80)
|
||||||
|
print("6. MODEL CONVERGENCE (pairwise agreement)")
|
||||||
|
print("=" * 80)
|
||||||
|
print()
|
||||||
|
|
||||||
|
v30_avg = agreement_rate(v30, holdout_pids)
|
||||||
|
v35_avg = agreement_rate(v35, holdout_pids)
|
||||||
|
|
||||||
|
print(f"Average pairwise agreement among 7 models:")
|
||||||
|
print(f" v3.0: {v30_avg:.3f}")
|
||||||
|
print(f" v3.5: {v35_avg:.3f}")
|
||||||
|
print(f" Delta: {v35_avg - v30_avg:+.3f}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Per-model average agreement with others
|
||||||
|
print("Per-model average agreement with other 6 models:")
|
||||||
|
header = f"{'Model':<18} {'v3.0':>8} {'v3.5':>8} {'Delta':>8}"
|
||||||
|
print(header)
|
||||||
|
print("-" * len(header))
|
||||||
|
|
||||||
|
v30_mat = pairwise_agreement_matrix(v30, holdout_pids)
|
||||||
|
v35_mat = pairwise_agreement_matrix(v35, holdout_pids)
|
||||||
|
|
||||||
|
for i, mname in enumerate(MODEL_NAMES):
|
||||||
|
# Average agreement with other models (exclude self)
|
||||||
|
others_30 = [v30_mat[i, j] for j in range(len(MODEL_NAMES)) if j != i]
|
||||||
|
others_35 = [v35_mat[i, j] for j in range(len(MODEL_NAMES)) if j != i]
|
||||||
|
avg30 = np.mean(others_30)
|
||||||
|
avg35 = np.mean(others_35)
|
||||||
|
print(f"{mname:<18} {avg30:>7.3f} {avg35:>7.3f} {avg35 - avg30:>+7.3f}")
|
||||||
|
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Outlier detection
|
||||||
|
print("Outlier check (models with lowest v3.5 agreement):")
|
||||||
|
v35_avgs = []
|
||||||
|
for i, mname in enumerate(MODEL_NAMES):
|
||||||
|
others = [v35_mat[i, j] for j in range(len(MODEL_NAMES)) if j != i]
|
||||||
|
v35_avgs.append((mname, np.mean(others)))
|
||||||
|
|
||||||
|
v35_avgs.sort(key=lambda x: x[1])
|
||||||
|
mean_agree = np.mean([x[1] for x in v35_avgs])
|
||||||
|
std_agree = np.std([x[1] for x in v35_avgs])
|
||||||
|
|
||||||
|
for mname, avg in v35_avgs:
|
||||||
|
z = (avg - mean_agree) / std_agree if std_agree > 0 else 0
|
||||||
|
flag = " *** OUTLIER" if z < -1.5 else ""
|
||||||
|
print(f" {mname:<18} {avg:.3f} (z={z:+.2f}){flag}")
|
||||||
|
|
||||||
|
print()
|
||||||
|
print("=" * 80)
|
||||||
|
print("DONE")
|
||||||
|
print("=" * 80)
|
||||||
714
scripts/cross-analyze-human-vs-genai.py
Normal file
714
scripts/cross-analyze-human-vs-genai.py
Normal file
@ -0,0 +1,714 @@
|
|||||||
|
"""
|
||||||
|
Cross-analysis: Human annotators vs GenAI models on 1,200-paragraph holdout set.
|
||||||
|
|
||||||
|
Categories: BG, ID, MR, N/O, RMP, SI, TPR
|
||||||
|
Specificity: 1-4
|
||||||
|
13 signals per paragraph: 3 human (BIBD), 3 Stage 1, 1 Opus 4.6, 6 benchmark
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import sys
|
||||||
|
from collections import Counter, defaultdict
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
# ── Category abbreviation mapping ────────────────────────────────────────────
|
||||||
|
FULL_TO_ABBR = {
|
||||||
|
"Board Governance": "BG",
|
||||||
|
"Incident Disclosure": "ID",
|
||||||
|
"Management Role": "MR",
|
||||||
|
"None/Other": "N/O",
|
||||||
|
"Risk Management Process": "RMP",
|
||||||
|
"Strategy Integration": "SI",
|
||||||
|
"Third-Party Risk": "TPR",
|
||||||
|
}
|
||||||
|
ABBR_TO_FULL = {v: k for k, v in FULL_TO_ABBR.items()}
|
||||||
|
CATS = ["BG", "ID", "MR", "N/O", "RMP", "SI", "TPR"]
|
||||||
|
|
||||||
|
DATA = Path("data")
|
||||||
|
|
||||||
|
|
||||||
|
def abbr(cat: str) -> str:
|
||||||
|
return FULL_TO_ABBR.get(cat, cat)
|
||||||
|
|
||||||
|
|
||||||
|
def majority_vote(labels: list[str]) -> str:
|
||||||
|
"""Return majority label or 'split' if no majority."""
|
||||||
|
c = Counter(labels)
|
||||||
|
top = c.most_common(1)[0]
|
||||||
|
if top[1] > len(labels) / 2:
|
||||||
|
return top[0]
|
||||||
|
# Check for a plurality with tie-break: if top 2 are tied, it's split
|
||||||
|
if len(c) >= 2:
|
||||||
|
top2 = c.most_common(2)
|
||||||
|
if top2[0][1] == top2[1][1]:
|
||||||
|
return "split"
|
||||||
|
return top[0]
|
||||||
|
|
||||||
|
|
||||||
|
def median_spec(specs: list[int]) -> float:
|
||||||
|
s = sorted(specs)
|
||||||
|
n = len(s)
|
||||||
|
if n % 2 == 1:
|
||||||
|
return float(s[n // 2])
|
||||||
|
return (s[n // 2 - 1] + s[n // 2]) / 2.0
|
||||||
|
|
||||||
|
|
||||||
|
def mean_spec(specs: list[int]) -> float:
|
||||||
|
return sum(specs) / len(specs) if specs else 0.0
|
||||||
|
|
||||||
|
|
||||||
|
# ── Load data ────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
print("Loading data...\n")
|
||||||
|
|
||||||
|
# Human labels: paragraphId → list of (annotatorName, category, specificity)
|
||||||
|
human_labels: dict[str, list[tuple[str, str, int]]] = defaultdict(list)
|
||||||
|
with open(DATA / "gold" / "human-labels-raw.jsonl") as f:
|
||||||
|
for line in f:
|
||||||
|
d = json.loads(line)
|
||||||
|
human_labels[d["paragraphId"]].append(
|
||||||
|
(d["annotatorName"], abbr(d["contentCategory"]), d["specificityLevel"])
|
||||||
|
)
|
||||||
|
|
||||||
|
holdout_pids = sorted(human_labels.keys())
|
||||||
|
assert len(holdout_pids) == 1200, f"Expected 1200 holdout paragraphs, got {len(holdout_pids)}"
|
||||||
|
|
||||||
|
# GenAI labels: paragraphId → list of (modelName, category, specificity)
|
||||||
|
genai_labels: dict[str, list[tuple[str, str, int]]] = defaultdict(list)
|
||||||
|
|
||||||
|
# Stage 1 (filter to holdout only)
|
||||||
|
holdout_set = set(holdout_pids)
|
||||||
|
with open(DATA / "annotations" / "stage1.patched.jsonl") as f:
|
||||||
|
for line in f:
|
||||||
|
d = json.loads(line)
|
||||||
|
pid = d["paragraphId"]
|
||||||
|
if pid in holdout_set:
|
||||||
|
model = d["provenance"]["modelId"].split("/")[-1]
|
||||||
|
genai_labels[pid].append(
|
||||||
|
(model, abbr(d["label"]["content_category"]), d["label"]["specificity_level"])
|
||||||
|
)
|
||||||
|
|
||||||
|
# Opus
|
||||||
|
with open(DATA / "annotations" / "golden" / "opus.jsonl") as f:
|
||||||
|
for line in f:
|
||||||
|
d = json.loads(line)
|
||||||
|
genai_labels[d["paragraphId"]].append(
|
||||||
|
("opus-4.6", abbr(d["label"]["content_category"]), d["label"]["specificity_level"])
|
||||||
|
)
|
||||||
|
|
||||||
|
# Bench-holdout models
|
||||||
|
bench_files = [
|
||||||
|
"gpt-5.4.jsonl",
|
||||||
|
"gemini-3.1-pro-preview.jsonl",
|
||||||
|
"glm-5:exacto.jsonl",
|
||||||
|
"kimi-k2.5.jsonl",
|
||||||
|
"mimo-v2-pro:exacto.jsonl",
|
||||||
|
"minimax-m2.7:exacto.jsonl",
|
||||||
|
]
|
||||||
|
for fname in bench_files:
|
||||||
|
fpath = DATA / "annotations" / "bench-holdout" / fname
|
||||||
|
model_name = fname.replace(".jsonl", "")
|
||||||
|
with open(fpath) as f:
|
||||||
|
for line in f:
|
||||||
|
d = json.loads(line)
|
||||||
|
genai_labels[d["paragraphId"]].append(
|
||||||
|
(model_name, abbr(d["label"]["content_category"]), d["label"]["specificity_level"])
|
||||||
|
)
|
||||||
|
|
||||||
|
# Paragraph metadata
|
||||||
|
para_meta: dict[str, dict] = {}
|
||||||
|
with open(DATA / "gold" / "paragraphs-holdout.jsonl") as f:
|
||||||
|
for line in f:
|
||||||
|
d = json.loads(line)
|
||||||
|
if d["id"] in holdout_set:
|
||||||
|
para_meta[d["id"]] = d
|
||||||
|
|
||||||
|
# ── Compute per-paragraph aggregates ─────────────────────────────────────────
|
||||||
|
|
||||||
|
results = []
|
||||||
|
for pid in holdout_pids:
|
||||||
|
h = human_labels[pid]
|
||||||
|
g = genai_labels[pid]
|
||||||
|
|
||||||
|
h_cats = [x[1] for x in h]
|
||||||
|
h_specs = [x[2] for x in h]
|
||||||
|
g_cats = [x[1] for x in g]
|
||||||
|
g_specs = [x[2] for x in g]
|
||||||
|
|
||||||
|
all_cats = h_cats + g_cats
|
||||||
|
all_specs = h_specs + g_specs
|
||||||
|
|
||||||
|
h_maj = majority_vote(h_cats)
|
||||||
|
g_maj = majority_vote(g_cats)
|
||||||
|
all_maj = majority_vote(all_cats)
|
||||||
|
|
||||||
|
h_mean_spec = mean_spec(h_specs)
|
||||||
|
g_mean_spec = mean_spec(g_specs)
|
||||||
|
all_mean_spec = mean_spec(all_specs)
|
||||||
|
|
||||||
|
# Agreement count: how many of 13 agree with overall majority
|
||||||
|
agree_count = sum(1 for c in all_cats if c == all_maj) if all_maj != "split" else 0
|
||||||
|
|
||||||
|
meta = para_meta.get(pid, {})
|
||||||
|
|
||||||
|
results.append({
|
||||||
|
"pid": pid,
|
||||||
|
"h_maj": h_maj,
|
||||||
|
"g_maj": g_maj,
|
||||||
|
"all_maj": all_maj,
|
||||||
|
"h_cats": h_cats,
|
||||||
|
"g_cats": g_cats,
|
||||||
|
"h_specs": h_specs,
|
||||||
|
"g_specs": g_specs,
|
||||||
|
"h_mean_spec": h_mean_spec,
|
||||||
|
"g_mean_spec": g_mean_spec,
|
||||||
|
"all_mean_spec": all_mean_spec,
|
||||||
|
"agree_count": agree_count,
|
||||||
|
"word_count": meta.get("wordCount", 0),
|
||||||
|
"text": meta.get("text", ""),
|
||||||
|
"human_annotators": [x[0] for x in h],
|
||||||
|
"genai_models": [x[0] for x in g],
|
||||||
|
"human_labels": h,
|
||||||
|
"genai_labels": g,
|
||||||
|
})
|
||||||
|
|
||||||
|
|
||||||
|
def fmt_table(headers: list[str], rows: list[list], align: list[str] | None = None):
|
||||||
|
"""Format a simple text table."""
|
||||||
|
col_widths = [len(h) for h in headers]
|
||||||
|
str_rows = []
|
||||||
|
for row in rows:
|
||||||
|
sr = [str(x) for x in row]
|
||||||
|
str_rows.append(sr)
|
||||||
|
for i, s in enumerate(sr):
|
||||||
|
col_widths[i] = max(col_widths[i], len(s))
|
||||||
|
|
||||||
|
if align is None:
|
||||||
|
align = ["r"] * len(headers)
|
||||||
|
|
||||||
|
def fmt_cell(s, w, a):
|
||||||
|
return s.rjust(w) if a == "r" else s.ljust(w)
|
||||||
|
|
||||||
|
sep = "+-" + "-+-".join("-" * w for w in col_widths) + "-+"
|
||||||
|
hdr = "| " + " | ".join(fmt_cell(h, col_widths[i], "l") for i, h in enumerate(headers)) + " |"
|
||||||
|
lines = [sep, hdr, sep]
|
||||||
|
for sr in str_rows:
|
||||||
|
line = "| " + " | ".join(fmt_cell(sr[i], col_widths[i], align[i]) for i in range(len(headers))) + " |"
|
||||||
|
lines.append(line)
|
||||||
|
lines.append(sep)
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# 1. PER-CATEGORY CONFUSION MATRIX: HUMAN MAJORITY vs GENAI MAJORITY
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
print("=" * 80)
|
||||||
|
print("1. CONFUSION MATRIX: Human Majority (rows) vs GenAI Majority (cols)")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
cats_plus = CATS + ["split"]
|
||||||
|
cm = defaultdict(lambda: defaultdict(int))
|
||||||
|
for r in results:
|
||||||
|
cm[r["h_maj"]][r["g_maj"]] += 1
|
||||||
|
|
||||||
|
headers = ["H\\G"] + cats_plus + ["Total"]
|
||||||
|
rows = []
|
||||||
|
for hc in cats_plus:
|
||||||
|
row = [hc]
|
||||||
|
total = 0
|
||||||
|
for gc in cats_plus:
|
||||||
|
v = cm[hc][gc]
|
||||||
|
row.append(v if v else ".")
|
||||||
|
total += v
|
||||||
|
row.append(total)
|
||||||
|
rows.append(row)
|
||||||
|
|
||||||
|
# Column totals
|
||||||
|
col_totals = ["Total"]
|
||||||
|
for gc in cats_plus:
|
||||||
|
col_totals.append(sum(cm[hc][gc] for hc in cats_plus))
|
||||||
|
col_totals.append(sum(sum(cm[hc][gc] for gc in cats_plus) for hc in cats_plus))
|
||||||
|
rows.append(col_totals)
|
||||||
|
|
||||||
|
align = ["l"] + ["r"] * (len(headers) - 1)
|
||||||
|
print(fmt_table(headers, rows, align))
|
||||||
|
|
||||||
|
# Diagonal agreement
|
||||||
|
diag = sum(cm[c][c] for c in cats_plus)
|
||||||
|
total_paras = len(results)
|
||||||
|
print(f"\nDiagonal agreement: {diag}/{total_paras} = {diag/total_paras:.1%}")
|
||||||
|
print(f"Disagreement: {total_paras - diag}/{total_paras} = {(total_paras - diag)/total_paras:.1%}")
|
||||||
|
|
||||||
|
# Over/under prediction
|
||||||
|
print("\nGenAI over/under-prediction relative to human majority:")
|
||||||
|
headers2 = ["Category", "Human N", "GenAI N", "Diff", "Direction"]
|
||||||
|
rows2 = []
|
||||||
|
for c in CATS:
|
||||||
|
h_n = sum(cm[c][gc] for gc in cats_plus)
|
||||||
|
g_n = sum(cm[hc][c] for hc in cats_plus)
|
||||||
|
diff = g_n - h_n
|
||||||
|
direction = "OVER" if diff > 0 else ("UNDER" if diff < 0 else "MATCH")
|
||||||
|
rows2.append([c, h_n, g_n, f"{diff:+d}", direction])
|
||||||
|
align2 = ["l", "r", "r", "r", "l"]
|
||||||
|
print(fmt_table(headers2, rows2, align2))
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# 2. DIRECTIONAL DISAGREEMENT ANALYSIS
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("2. DIRECTIONAL DISAGREEMENT: Human Majority -> GenAI Majority transitions")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
disagree = [(r["h_maj"], r["g_maj"]) for r in results if r["h_maj"] != r["g_maj"]]
|
||||||
|
print(f"\nTotal disagreements: {len(disagree)}/{total_paras}")
|
||||||
|
|
||||||
|
trans = Counter(disagree)
|
||||||
|
print("\nTop transitions (H_maj -> G_maj):")
|
||||||
|
headers3 = ["From (Human)", "To (GenAI)", "Count", "Reverse", "Net", "Symmetric?"]
|
||||||
|
rows3 = []
|
||||||
|
seen = set()
|
||||||
|
for (a, b), cnt in sorted(trans.items(), key=lambda x: -x[1]):
|
||||||
|
pair = tuple(sorted([a, b]))
|
||||||
|
if pair in seen:
|
||||||
|
continue
|
||||||
|
seen.add(pair)
|
||||||
|
rev = trans.get((b, a), 0)
|
||||||
|
net = cnt - rev
|
||||||
|
sym = "Yes" if abs(net) <= max(1, min(cnt, rev) * 0.3) else "No"
|
||||||
|
rows3.append([a, b, cnt, rev, f"{net:+d}", sym])
|
||||||
|
align3 = ["l", "l", "r", "r", "r", "l"]
|
||||||
|
print(fmt_table(headers3, rows3, align3))
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# 3. PER-CATEGORY PRECISION/RECALL (Human majority as truth)
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("3. PER-CATEGORY PRECISION/RECALL (Human majority as ground truth)")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
# Filter out splits for clean P/R
|
||||||
|
valid = [(r["h_maj"], r["g_maj"]) for r in results if r["h_maj"] != "split" and r["g_maj"] != "split"]
|
||||||
|
|
||||||
|
headers4 = ["Category", "TP", "FP", "FN", "Precision", "Recall", "F1"]
|
||||||
|
rows4 = []
|
||||||
|
for c in CATS:
|
||||||
|
tp = sum(1 for h, g in valid if h == c and g == c)
|
||||||
|
fp = sum(1 for h, g in valid if h != c and g == c)
|
||||||
|
fn = sum(1 for h, g in valid if h == c and g != c)
|
||||||
|
prec = tp / (tp + fp) if (tp + fp) > 0 else 0
|
||||||
|
rec = tp / (tp + fn) if (tp + fn) > 0 else 0
|
||||||
|
f1 = 2 * prec * rec / (prec + rec) if (prec + rec) > 0 else 0
|
||||||
|
rows4.append([c, tp, fp, fn, f"{prec:.3f}", f"{rec:.3f}", f"{f1:.3f}"])
|
||||||
|
align4 = ["l", "r", "r", "r", "r", "r", "r"]
|
||||||
|
print("\nGenAI predictions evaluated against human majority:")
|
||||||
|
print(fmt_table(headers4, rows4, align4))
|
||||||
|
|
||||||
|
# Macro averages
|
||||||
|
macro_p = sum(float(r[4]) for r in rows4) / len(CATS)
|
||||||
|
macro_r = sum(float(r[5]) for r in rows4) / len(CATS)
|
||||||
|
macro_f1 = sum(float(r[6]) for r in rows4) / len(CATS)
|
||||||
|
print(f"\nMacro-avg: P={macro_p:.3f} R={macro_r:.3f} F1={macro_f1:.3f}")
|
||||||
|
|
||||||
|
# Vice versa: GenAI as truth
|
||||||
|
print("\n--- Vice versa: Human predictions evaluated against GenAI majority ---")
|
||||||
|
rows4b = []
|
||||||
|
for c in CATS:
|
||||||
|
tp = sum(1 for h, g in valid if g == c and h == c)
|
||||||
|
fp = sum(1 for h, g in valid if g != c and h == c)
|
||||||
|
fn = sum(1 for h, g in valid if g == c and h != c)
|
||||||
|
prec = tp / (tp + fp) if (tp + fp) > 0 else 0
|
||||||
|
rec = tp / (tp + fn) if (tp + fn) > 0 else 0
|
||||||
|
f1 = 2 * prec * rec / (prec + rec) if (prec + rec) > 0 else 0
|
||||||
|
rows4b.append([c, tp, fp, fn, f"{prec:.3f}", f"{rec:.3f}", f"{f1:.3f}"])
|
||||||
|
print(fmt_table(headers4, rows4b, align4))
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# 4. SPECIFICITY SYSTEMATIC BIAS
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("4. SPECIFICITY SYSTEMATIC BIAS: Human vs GenAI")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
# Overall
|
||||||
|
all_h_specs = [s for r in results for s in r["h_specs"]]
|
||||||
|
all_g_specs = [s for r in results for s in r["g_specs"]]
|
||||||
|
h_avg = mean_spec(all_h_specs)
|
||||||
|
g_avg = mean_spec(all_g_specs)
|
||||||
|
print(f"\nOverall mean specificity: Human={h_avg:.3f} GenAI={g_avg:.3f} Diff={g_avg - h_avg:+.3f}")
|
||||||
|
print(f"Overall median: Human={median_spec(all_h_specs):.1f} GenAI={median_spec(all_g_specs):.1f}")
|
||||||
|
|
||||||
|
# Distribution
|
||||||
|
print("\nSpecificity distribution:")
|
||||||
|
h_dist = Counter(all_h_specs)
|
||||||
|
g_dist = Counter(all_g_specs)
|
||||||
|
headers5 = ["Spec", "Human N", "Human %", "GenAI N", "GenAI %", "Diff %"]
|
||||||
|
rows5 = []
|
||||||
|
for s in [1, 2, 3, 4]:
|
||||||
|
hn = h_dist.get(s, 0)
|
||||||
|
gn = g_dist.get(s, 0)
|
||||||
|
hp = hn / len(all_h_specs) * 100
|
||||||
|
gp = gn / len(all_g_specs) * 100
|
||||||
|
rows5.append([s, hn, f"{hp:.1f}%", gn, f"{gp:.1f}%", f"{gp - hp:+.1f}%"])
|
||||||
|
print(fmt_table(headers5, rows5, ["r", "r", "r", "r", "r", "r"]))
|
||||||
|
|
||||||
|
# By category
|
||||||
|
print("\nMean specificity by category:")
|
||||||
|
headers6 = ["Category", "Human", "GenAI", "Diff", "H count", "G count"]
|
||||||
|
rows6 = []
|
||||||
|
for c in CATS:
|
||||||
|
h_s = [s for r in results for ann in r["human_labels"] if ann[1] == c for s in [ann[2]]]
|
||||||
|
g_s = [s for r in results for ann in r["genai_labels"] if ann[1] == c for s in [ann[2]]]
|
||||||
|
if h_s and g_s:
|
||||||
|
hm = mean_spec(h_s)
|
||||||
|
gm = mean_spec(g_s)
|
||||||
|
rows6.append([c, f"{hm:.3f}", f"{gm:.3f}", f"{gm - hm:+.3f}", len(h_s), len(g_s)])
|
||||||
|
else:
|
||||||
|
rows6.append([c, "N/A", "N/A", "N/A", len(h_s), len(g_s)])
|
||||||
|
print(fmt_table(headers6, rows6, ["l", "r", "r", "r", "r", "r"]))
|
||||||
|
|
||||||
|
# Per-paragraph directional bias
|
||||||
|
h_higher = sum(1 for r in results if r["h_mean_spec"] > r["g_mean_spec"])
|
||||||
|
g_higher = sum(1 for r in results if r["g_mean_spec"] > r["h_mean_spec"])
|
||||||
|
same = sum(1 for r in results if abs(r["h_mean_spec"] - r["g_mean_spec"]) < 0.01)
|
||||||
|
print(f"\nPer-paragraph: Human higher spec={h_higher} GenAI higher={g_higher} Same={same}")
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# 5. DIFFICULTY-STRATIFIED ANALYSIS
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("5. DIFFICULTY-STRATIFIED ANALYSIS")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
# Tiers based on 13-signal agreement
|
||||||
|
# Tier 1: 10+ agree, Tier 2: 7-9 agree, Tier 3: 5-6 agree, Tier 4: <5 agree
|
||||||
|
def get_tier(agree_count: int) -> str:
|
||||||
|
if agree_count >= 10:
|
||||||
|
return "T1-Easy"
|
||||||
|
elif agree_count >= 7:
|
||||||
|
return "T2-Medium"
|
||||||
|
elif agree_count >= 5:
|
||||||
|
return "T3-Hard"
|
||||||
|
else:
|
||||||
|
return "T4-VHard"
|
||||||
|
|
||||||
|
for r in results:
|
||||||
|
r["tier"] = get_tier(r["agree_count"])
|
||||||
|
|
||||||
|
tier_counts = Counter(r["tier"] for r in results)
|
||||||
|
print(f"\nTier distribution:")
|
||||||
|
for t in ["T1-Easy", "T2-Medium", "T3-Hard", "T4-VHard"]:
|
||||||
|
print(f" {t}: {tier_counts.get(t, 0)} paragraphs")
|
||||||
|
|
||||||
|
print("\nHuman-GenAI category agreement rate by difficulty tier:")
|
||||||
|
headers7 = ["Tier", "N", "Agree", "Agree%", "H=consensus%", "G=consensus%"]
|
||||||
|
rows7 = []
|
||||||
|
for t in ["T1-Easy", "T2-Medium", "T3-Hard", "T4-VHard"]:
|
||||||
|
tier_r = [r for r in results if r["tier"] == t]
|
||||||
|
n = len(tier_r)
|
||||||
|
if n == 0:
|
||||||
|
continue
|
||||||
|
agree = sum(1 for r in tier_r if r["h_maj"] == r["g_maj"])
|
||||||
|
h_match_cons = sum(1 for r in tier_r if r["h_maj"] == r["all_maj"])
|
||||||
|
g_match_cons = sum(1 for r in tier_r if r["g_maj"] == r["all_maj"])
|
||||||
|
rows7.append([
|
||||||
|
t, n, agree, f"{agree/n:.1%}",
|
||||||
|
f"{h_match_cons/n:.1%}", f"{g_match_cons/n:.1%}"
|
||||||
|
])
|
||||||
|
print(fmt_table(headers7, rows7, ["l", "r", "r", "r", "r", "r"]))
|
||||||
|
|
||||||
|
# On hard paragraphs, who is the odd one out?
|
||||||
|
print("\nOn hard paragraphs (T3+T4), disagreement breakdown:")
|
||||||
|
hard = [r for r in results if r["tier"] in ("T3-Hard", "T4-VHard")]
|
||||||
|
h_odd = sum(1 for r in hard if r["g_maj"] == r["all_maj"] and r["h_maj"] != r["all_maj"])
|
||||||
|
g_odd = sum(1 for r in hard if r["h_maj"] == r["all_maj"] and r["g_maj"] != r["all_maj"])
|
||||||
|
both_off = sum(1 for r in hard if r["h_maj"] != r["all_maj"] and r["g_maj"] != r["all_maj"])
|
||||||
|
both_on = sum(1 for r in hard if r["h_maj"] == r["all_maj"] and r["g_maj"] == r["all_maj"])
|
||||||
|
print(f" Human is odd-one-out (GenAI=consensus, Human!=consensus): {h_odd}")
|
||||||
|
print(f" GenAI is odd-one-out (Human=consensus, GenAI!=consensus): {g_odd}")
|
||||||
|
print(f" Both match consensus: {both_on}")
|
||||||
|
print(f" Both differ from consensus: {both_off}")
|
||||||
|
print(f" Total hard: {len(hard)}")
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# 6. ANNOTATOR-LEVEL PATTERNS
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("6. ANNOTATOR-LEVEL PATTERNS")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
annotators = ["Anuj", "Elisabeth", "Joey", "Meghan", "Xander", "Aaryan"]
|
||||||
|
|
||||||
|
# For each annotator, compute agreement with GenAI majority
|
||||||
|
print("\nPer-annotator agreement with GenAI majority (category):")
|
||||||
|
headers8 = ["Annotator", "N labels", "Agree w/G_maj", "Agree%", "Agree w/13_maj", "13_maj%", "Avg Spec", "Note"]
|
||||||
|
rows8 = []
|
||||||
|
for ann in annotators:
|
||||||
|
agree_g = 0
|
||||||
|
agree_all = 0
|
||||||
|
total = 0
|
||||||
|
specs = []
|
||||||
|
for r in results:
|
||||||
|
for name, cat, spec in r["human_labels"]:
|
||||||
|
if name == ann:
|
||||||
|
total += 1
|
||||||
|
specs.append(spec)
|
||||||
|
if cat == r["g_maj"]:
|
||||||
|
agree_g += 1
|
||||||
|
if cat == r["all_maj"]:
|
||||||
|
agree_all += 1
|
||||||
|
if total == 0:
|
||||||
|
continue
|
||||||
|
note = "(excluded from aggregates)" if ann == "Aaryan" else ""
|
||||||
|
rows8.append([
|
||||||
|
ann, total,
|
||||||
|
agree_g, f"{agree_g/total:.1%}",
|
||||||
|
agree_all, f"{agree_all/total:.1%}",
|
||||||
|
f"{mean_spec(specs):.2f}",
|
||||||
|
note,
|
||||||
|
])
|
||||||
|
align8 = ["l", "r", "r", "r", "r", "r", "r", "l"]
|
||||||
|
print(fmt_table(headers8, rows8, align8))
|
||||||
|
|
||||||
|
# Annotator category distributions
|
||||||
|
print("\nPer-annotator category distribution:")
|
||||||
|
for ann in annotators:
|
||||||
|
cat_counts = Counter()
|
||||||
|
for r in results:
|
||||||
|
for name, cat, spec in r["human_labels"]:
|
||||||
|
if name == ann:
|
||||||
|
cat_counts[cat] += 1
|
||||||
|
if not cat_counts:
|
||||||
|
continue
|
||||||
|
total = sum(cat_counts.values())
|
||||||
|
dist = " ".join(f"{c}:{cat_counts.get(c, 0):3d}({cat_counts.get(c, 0)/total:.0%})" for c in CATS)
|
||||||
|
flag = " ** OUTLIER" if ann == "Aaryan" else ""
|
||||||
|
print(f" {ann:10s} (n={total:3d}): {dist}{flag}")
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# 7. TEXT-FEATURE CORRELATIONS
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("7. TEXT-FEATURE CORRELATIONS WITH DISAGREEMENT")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
agree_r = [r for r in results if r["h_maj"] == r["g_maj"]]
|
||||||
|
disagree_r = [r for r in results if r["h_maj"] != r["g_maj"]]
|
||||||
|
|
||||||
|
# Word count
|
||||||
|
agree_wc = [r["word_count"] for r in agree_r if r["word_count"] > 0]
|
||||||
|
disagree_wc = [r["word_count"] for r in disagree_r if r["word_count"] > 0]
|
||||||
|
print(f"\nWord count (agree vs disagree):")
|
||||||
|
print(f" Agreement paragraphs: mean={mean_spec(agree_wc):.1f} median={median_spec(agree_wc):.0f} n={len(agree_wc)}")
|
||||||
|
print(f" Disagreement paragraphs: mean={mean_spec(disagree_wc):.1f} median={median_spec(disagree_wc):.0f} n={len(disagree_wc)}")
|
||||||
|
|
||||||
|
# Word count buckets
|
||||||
|
print("\nDisagreement rate by word count bucket:")
|
||||||
|
buckets = [(0, 30, "0-30"), (31, 60, "31-60"), (61, 100, "61-100"), (101, 150, "101-150"), (151, 250, "151-250"), (251, 9999, "251+")]
|
||||||
|
headers9 = ["WC Bucket", "N", "Disagree", "Disagree%"]
|
||||||
|
rows9 = []
|
||||||
|
for lo, hi, label in buckets:
|
||||||
|
in_bucket = [r for r in results if lo <= r["word_count"] <= hi]
|
||||||
|
dis = sum(1 for r in in_bucket if r["h_maj"] != r["g_maj"])
|
||||||
|
if in_bucket:
|
||||||
|
rows9.append([label, len(in_bucket), dis, f"{dis/len(in_bucket):.1%}"])
|
||||||
|
print(fmt_table(headers9, rows9, ["l", "r", "r", "r"]))
|
||||||
|
|
||||||
|
# Stage1 method (unanimous vs majority) as proxy for quality tier
|
||||||
|
print("\nDisagreement rate by Stage 1 confidence method:")
|
||||||
|
for method in ["unanimous", "majority"]:
|
||||||
|
in_method = [r for r in results if para_meta.get(r["pid"], {}).get("stage1Method") == method]
|
||||||
|
dis = sum(1 for r in in_method if r["h_maj"] != r["g_maj"])
|
||||||
|
if in_method:
|
||||||
|
print(f" {method:10s}: {dis}/{len(in_method)} = {dis/len(in_method):.1%} disagree")
|
||||||
|
|
||||||
|
# Keyword analysis
|
||||||
|
print("\nDisagreement rate for paragraphs containing key terms:")
|
||||||
|
keywords = ["material", "NIST", "CISO", "board", "third party", "third-party", "incident",
|
||||||
|
"insurance", "audit", "framework", "breach", "ransomware"]
|
||||||
|
headers10 = ["Keyword", "N", "Disagree", "Disagree%"]
|
||||||
|
rows10 = []
|
||||||
|
for kw in keywords:
|
||||||
|
matching = [r for r in results if kw.lower() in r["text"].lower()]
|
||||||
|
if not matching:
|
||||||
|
continue
|
||||||
|
dis = sum(1 for r in matching if r["h_maj"] != r["g_maj"])
|
||||||
|
rows10.append([kw, len(matching), dis, f"{dis/len(matching):.1%}"])
|
||||||
|
rows10.sort(key=lambda x: -int(x[2]))
|
||||||
|
print(fmt_table(headers10, rows10, ["l", "r", "r", "r"]))
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# 8. "HUMAN RIGHT, GenAI WRONG" vs "GenAI RIGHT, HUMAN WRONG"
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("8. HUMAN RIGHT/GENAI WRONG vs GENAI RIGHT/HUMAN WRONG (13-signal consensus)")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
# Only consider paragraphs where all_maj is not split and h/g disagree with each other or consensus
|
||||||
|
h_right_g_wrong = [r for r in results if r["all_maj"] != "split" and r["h_maj"] == r["all_maj"] and r["g_maj"] != r["all_maj"]]
|
||||||
|
g_right_h_wrong = [r for r in results if r["all_maj"] != "split" and r["g_maj"] == r["all_maj"] and r["h_maj"] != r["all_maj"]]
|
||||||
|
both_right = [r for r in results if r["all_maj"] != "split" and r["h_maj"] == r["all_maj"] and r["g_maj"] == r["all_maj"]]
|
||||||
|
both_wrong = [r for r in results if r["all_maj"] != "split" and r["h_maj"] != r["all_maj"] and r["g_maj"] != r["all_maj"]]
|
||||||
|
has_split = [r for r in results if r["all_maj"] == "split"]
|
||||||
|
|
||||||
|
print(f"\n Both correct: {len(both_right)}")
|
||||||
|
print(f" Human right, GenAI wrong: {len(h_right_g_wrong)}")
|
||||||
|
print(f" GenAI right, Human wrong: {len(g_right_h_wrong)}")
|
||||||
|
print(f" Both wrong: {len(both_wrong)}")
|
||||||
|
print(f" 13-signal split (no consensus): {len(has_split)}")
|
||||||
|
|
||||||
|
# Category breakdown
|
||||||
|
print("\nCategory breakdown of 'Human right, GenAI wrong':")
|
||||||
|
cat_dist_hrg = Counter(r["all_maj"] for r in h_right_g_wrong)
|
||||||
|
for c in CATS:
|
||||||
|
n = cat_dist_hrg.get(c, 0)
|
||||||
|
if n > 0:
|
||||||
|
print(f" {c}: {n}")
|
||||||
|
|
||||||
|
print("\nCategory breakdown of 'GenAI right, Human wrong':")
|
||||||
|
cat_dist_grh = Counter(r["all_maj"] for r in g_right_h_wrong)
|
||||||
|
for c in CATS:
|
||||||
|
n = cat_dist_grh.get(c, 0)
|
||||||
|
if n > 0:
|
||||||
|
print(f" {c}: {n}")
|
||||||
|
|
||||||
|
# What did the wrong side predict?
|
||||||
|
print("\nWhen GenAI is wrong, what does it predict instead?")
|
||||||
|
wrong_g = Counter(r["g_maj"] for r in h_right_g_wrong)
|
||||||
|
for label, cnt in wrong_g.most_common():
|
||||||
|
print(f" {label}: {cnt}")
|
||||||
|
|
||||||
|
print("\nWhen Human is wrong, what do they predict instead?")
|
||||||
|
wrong_h = Counter(r["h_maj"] for r in g_right_h_wrong)
|
||||||
|
for label, cnt in wrong_h.most_common():
|
||||||
|
print(f" {label}: {cnt}")
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# 9. SPECIFICITY BY SOURCE TYPE
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("9. SPECIFICITY BY SOURCE TYPE AND CATEGORY")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
# Group models into source types
|
||||||
|
stage1_models = {"gemini-3.1-flash-lite-preview", "grok-4.1-fast", "mimo-v2-flash"}
|
||||||
|
frontier_models = {"opus-4.6", "gpt-5.4", "gemini-3.1-pro-preview", "kimi-k2.5"}
|
||||||
|
budget_models = {"glm-5:exacto", "mimo-v2-pro:exacto", "minimax-m2.7:exacto"}
|
||||||
|
|
||||||
|
# Collect specs by source type and category
|
||||||
|
source_specs: dict[str, dict[str, list[int]]] = {
|
||||||
|
"Human": defaultdict(list),
|
||||||
|
"Stage1": defaultdict(list),
|
||||||
|
"Frontier": defaultdict(list),
|
||||||
|
"Budget": defaultdict(list),
|
||||||
|
}
|
||||||
|
|
||||||
|
for r in results:
|
||||||
|
for name, cat, spec in r["human_labels"]:
|
||||||
|
source_specs["Human"][cat].append(spec)
|
||||||
|
source_specs["Human"]["ALL"].append(spec)
|
||||||
|
|
||||||
|
for model, cat, spec in r["genai_labels"]:
|
||||||
|
if model in stage1_models:
|
||||||
|
src = "Stage1"
|
||||||
|
elif model in frontier_models:
|
||||||
|
src = "Frontier"
|
||||||
|
elif model in budget_models:
|
||||||
|
src = "Budget"
|
||||||
|
else:
|
||||||
|
src = "Budget" # fallback
|
||||||
|
source_specs[src][cat].append(spec)
|
||||||
|
source_specs[src]["ALL"].append(spec)
|
||||||
|
|
||||||
|
print("\nMean specificity by source type and category:")
|
||||||
|
src_order = ["Human", "Stage1", "Frontier", "Budget"]
|
||||||
|
headers11 = ["Category"] + src_order
|
||||||
|
rows11 = []
|
||||||
|
for c in CATS + ["ALL"]:
|
||||||
|
row = [c]
|
||||||
|
for src in src_order:
|
||||||
|
specs = source_specs[src].get(c, [])
|
||||||
|
if specs:
|
||||||
|
row.append(f"{mean_spec(specs):.3f}")
|
||||||
|
else:
|
||||||
|
row.append("N/A")
|
||||||
|
rows11.append(row)
|
||||||
|
align11 = ["l"] + ["r"] * len(src_order)
|
||||||
|
print(fmt_table(headers11, rows11, align11))
|
||||||
|
|
||||||
|
# Specificity standard deviation by source
|
||||||
|
print("\nSpecificity std dev by source type:")
|
||||||
|
import math
|
||||||
|
for src in src_order:
|
||||||
|
specs = source_specs[src]["ALL"]
|
||||||
|
if specs:
|
||||||
|
m = mean_spec(specs)
|
||||||
|
var = sum((s - m) ** 2 for s in specs) / len(specs)
|
||||||
|
std = math.sqrt(var)
|
||||||
|
print(f" {src:10s}: mean={m:.3f} std={std:.3f} n={len(specs)}")
|
||||||
|
|
||||||
|
# ── Per-model specificity rankings ───────────────────────────────────────────
|
||||||
|
print("\nPer-model mean specificity (all categories):")
|
||||||
|
model_specs: dict[str, list[int]] = defaultdict(list)
|
||||||
|
for r in results:
|
||||||
|
for name, cat, spec in r["human_labels"]:
|
||||||
|
model_specs[f"H:{name}"].append(spec)
|
||||||
|
for model, cat, spec in r["genai_labels"]:
|
||||||
|
model_specs[f"G:{model}"].append(spec)
|
||||||
|
|
||||||
|
headers12 = ["Model", "Mean Spec", "N"]
|
||||||
|
rows12 = []
|
||||||
|
for model, specs in sorted(model_specs.items(), key=lambda x: mean_spec(x[1])):
|
||||||
|
rows12.append([model, f"{mean_spec(specs):.3f}", len(specs)])
|
||||||
|
print(fmt_table(headers12, rows12, ["l", "r", "r"]))
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# SUMMARY
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print("SUMMARY OF KEY FINDINGS")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
print(f"""
|
||||||
|
Dataset: {total_paras} paragraphs, 13 signals each (3 human, 10 GenAI)
|
||||||
|
|
||||||
|
1. CATEGORY AGREEMENT: Human majority and GenAI majority agree on {diag/total_paras:.1%} of
|
||||||
|
paragraphs. The biggest confusions are in the off-diagonal cells above.
|
||||||
|
|
||||||
|
2. DIRECTIONAL DISAGREEMENTS: The most common category swaps reveal systematic
|
||||||
|
differences in how humans and GenAI interpret boundary cases.
|
||||||
|
|
||||||
|
3. PRECISION/RECALL: GenAI macro F1={macro_f1:.3f} against human majority.
|
||||||
|
|
||||||
|
4. SPECIFICITY BIAS: Human mean={h_avg:.3f}, GenAI mean={g_avg:.3f}
|
||||||
|
(diff={g_avg - h_avg:+.3f}). {"GenAI rates higher" if g_avg > h_avg else "Humans rate higher"} on average.
|
||||||
|
|
||||||
|
5. DIFFICULTY: On easy paragraphs (T1, 10+/13 agree), agreement is very high.
|
||||||
|
On hard paragraphs, {"humans" if h_odd > g_odd else "GenAI"} are more often the odd-one-out.
|
||||||
|
|
||||||
|
6. ANNOTATORS: See table above for individual alignment with GenAI and consensus.
|
||||||
|
|
||||||
|
7. TEXT FEATURES: {"Longer" if mean_spec(disagree_wc) > mean_spec(agree_wc) else "Shorter"} paragraphs
|
||||||
|
tend to produce more disagreement.
|
||||||
|
|
||||||
|
8. RIGHT/WRONG: Human right & GenAI wrong: {len(h_right_g_wrong)}, GenAI right &
|
||||||
|
Human wrong: {len(g_right_h_wrong)}. {"Humans are more often right" if len(h_right_g_wrong) > len(g_right_h_wrong) else "GenAI is more often right"} when they disagree.
|
||||||
|
""")
|
||||||
736
scripts/examine-hard-cases.py
Normal file
736
scripts/examine-hard-cases.py
Normal file
@ -0,0 +1,736 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Examine hardest disagreement cases in the SEC cybersecurity holdout dataset.
|
||||||
|
|
||||||
|
Identifies paragraphs where the 13 annotation sources split on the three main
|
||||||
|
confusion axes (MR<->RMP, BG<->MR, SI<->N/O), shows representative examples,
|
||||||
|
extracts linguistic patterns, and recommends codebook rulings.
|
||||||
|
|
||||||
|
Run: uv run --with numpy scripts/examine-hard-cases.py
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import textwrap
|
||||||
|
from collections import Counter, defaultdict
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
# ── Constants ──────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
ROOT = Path(__file__).resolve().parent.parent
|
||||||
|
|
||||||
|
CAT_ABBREV = {
|
||||||
|
"Board Governance": "BG",
|
||||||
|
"Incident Disclosure": "ID",
|
||||||
|
"Management Role": "MR",
|
||||||
|
"None/Other": "N/O",
|
||||||
|
"Risk Management Process": "RMP",
|
||||||
|
"Strategy Integration": "SI",
|
||||||
|
"Third-Party Risk": "TPR",
|
||||||
|
}
|
||||||
|
ABBREV_CAT = {v: k for k, v in CAT_ABBREV.items()}
|
||||||
|
|
||||||
|
AXES = [
|
||||||
|
("MR", "RMP", "MR <-> RMP"),
|
||||||
|
("BG", "MR", "BG <-> MR"),
|
||||||
|
("SI", "N/O", "SI <-> N/O"),
|
||||||
|
]
|
||||||
|
|
||||||
|
BENCH_FILES = [
|
||||||
|
"gpt-5.4.jsonl",
|
||||||
|
"gemini-3.1-pro-preview.jsonl",
|
||||||
|
"glm-5:exacto.jsonl",
|
||||||
|
"kimi-k2.5.jsonl",
|
||||||
|
"mimo-v2-pro:exacto.jsonl",
|
||||||
|
"minimax-m2.7:exacto.jsonl",
|
||||||
|
]
|
||||||
|
|
||||||
|
STAGE1_MODEL_SHORT = {
|
||||||
|
"google/gemini-3.1-flash-lite-preview": "s1:gemini-flash",
|
||||||
|
"x-ai/grok-4.1-fast": "s1:grok-fast",
|
||||||
|
"xiaomi/mimo-v2-flash": "s1:mimo-flash",
|
||||||
|
}
|
||||||
|
|
||||||
|
BENCH_MODEL_SHORT = {
|
||||||
|
"gpt-5.4.jsonl": "bench:gpt5.4",
|
||||||
|
"gemini-3.1-pro-preview.jsonl": "bench:gemini-pro",
|
||||||
|
"glm-5:exacto.jsonl": "bench:glm5",
|
||||||
|
"kimi-k2.5.jsonl": "bench:kimi",
|
||||||
|
"mimo-v2-pro:exacto.jsonl": "bench:mimo-pro",
|
||||||
|
"minimax-m2.7:exacto.jsonl": "bench:minimax",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# ── Load data ──────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def load_jsonl(path: str | Path) -> list[dict]:
|
||||||
|
records = []
|
||||||
|
with open(path) as f:
|
||||||
|
for line in f:
|
||||||
|
line = line.strip()
|
||||||
|
if line:
|
||||||
|
records.append(json.loads(line))
|
||||||
|
return records
|
||||||
|
|
||||||
|
|
||||||
|
def abbrev(cat: str) -> str:
|
||||||
|
return CAT_ABBREV.get(cat, cat)
|
||||||
|
|
||||||
|
|
||||||
|
def build_signal_matrix() -> tuple[dict[str, dict[str, str]], dict[str, dict[str, int]]]:
|
||||||
|
"""Build paragraphId -> {source: category_abbrev} and {source: specificity}."""
|
||||||
|
# Only for the 1200 gold PIDs
|
||||||
|
gold_pids: set[str] = set()
|
||||||
|
human_labels = load_jsonl(ROOT / "data/gold/human-labels-raw.jsonl")
|
||||||
|
for rec in human_labels:
|
||||||
|
gold_pids.add(rec["paragraphId"])
|
||||||
|
|
||||||
|
cat_matrix: dict[str, dict[str, str]] = defaultdict(dict)
|
||||||
|
spec_matrix: dict[str, dict[str, int]] = defaultdict(dict)
|
||||||
|
|
||||||
|
# 1) Human annotators (3 per paragraph)
|
||||||
|
for rec in human_labels:
|
||||||
|
pid = rec["paragraphId"]
|
||||||
|
src = f"human:{rec['annotatorName']}"
|
||||||
|
cat_matrix[pid][src] = abbrev(rec["contentCategory"])
|
||||||
|
spec_matrix[pid][src] = rec["specificityLevel"]
|
||||||
|
|
||||||
|
# 2) Stage 1 models (filter to gold PIDs)
|
||||||
|
stage1_path = ROOT / "data/annotations/stage1.patched.jsonl"
|
||||||
|
with open(stage1_path) as f:
|
||||||
|
for line in f:
|
||||||
|
rec = json.loads(line)
|
||||||
|
pid = rec["paragraphId"]
|
||||||
|
if pid not in gold_pids:
|
||||||
|
continue
|
||||||
|
model_id = rec["provenance"]["modelId"]
|
||||||
|
src = STAGE1_MODEL_SHORT.get(model_id, model_id)
|
||||||
|
cat_matrix[pid][src] = abbrev(rec["label"]["content_category"])
|
||||||
|
spec_matrix[pid][src] = rec["label"]["specificity_level"]
|
||||||
|
|
||||||
|
# 3) Opus
|
||||||
|
for rec in load_jsonl(ROOT / "data/annotations/golden/opus.jsonl"):
|
||||||
|
pid = rec["paragraphId"]
|
||||||
|
if pid in gold_pids:
|
||||||
|
cat_matrix[pid]["opus"] = abbrev(rec["label"]["content_category"])
|
||||||
|
spec_matrix[pid]["opus"] = rec["label"]["specificity_level"]
|
||||||
|
|
||||||
|
# 4) Bench-holdout models
|
||||||
|
for fn in BENCH_FILES:
|
||||||
|
src = BENCH_MODEL_SHORT[fn]
|
||||||
|
for rec in load_jsonl(ROOT / "data/annotations/bench-holdout" / fn):
|
||||||
|
pid = rec["paragraphId"]
|
||||||
|
if pid in gold_pids:
|
||||||
|
cat_matrix[pid][src] = abbrev(rec["label"]["content_category"])
|
||||||
|
spec_matrix[pid][src] = rec["label"]["specificity_level"]
|
||||||
|
|
||||||
|
return dict(cat_matrix), dict(spec_matrix)
|
||||||
|
|
||||||
|
|
||||||
|
def load_paragraphs(gold_pids: set[str]) -> dict[str, dict]:
|
||||||
|
"""Load paragraph text for gold PIDs."""
|
||||||
|
paragraphs = {}
|
||||||
|
for rec in load_jsonl(ROOT / "data/gold/paragraphs-holdout.jsonl"):
|
||||||
|
if rec["id"] in gold_pids:
|
||||||
|
paragraphs[rec["id"]] = rec
|
||||||
|
return paragraphs
|
||||||
|
|
||||||
|
|
||||||
|
# ── Analysis helpers ───────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def find_axis_paragraphs(
|
||||||
|
cat_matrix: dict[str, dict[str, str]], a: str, b: str
|
||||||
|
) -> list[tuple[str, dict[str, str], int, int]]:
|
||||||
|
"""Find paragraphs where the primary disagreement is between categories a and b.
|
||||||
|
|
||||||
|
Returns list of (pid, signals, count_a, count_b) sorted by disagreement strength.
|
||||||
|
"""
|
||||||
|
results = []
|
||||||
|
for pid, signals in cat_matrix.items():
|
||||||
|
cats = list(signals.values())
|
||||||
|
counts = Counter(cats)
|
||||||
|
ca, cb = counts.get(a, 0), counts.get(b, 0)
|
||||||
|
if ca >= 1 and cb >= 1 and ca + cb >= len(cats) * 0.5:
|
||||||
|
# This paragraph has a meaningful split on this axis
|
||||||
|
results.append((pid, signals, ca, cb))
|
||||||
|
# Sort by how evenly split (closer to 50/50 = harder)
|
||||||
|
results.sort(key=lambda x: -min(x[2], x[3]))
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
def truncate_text(text: str, max_chars: int = 200) -> str:
|
||||||
|
if len(text) <= max_chars:
|
||||||
|
return text
|
||||||
|
return text[:max_chars].rstrip() + "..."
|
||||||
|
|
||||||
|
|
||||||
|
def source_order() -> list[str]:
|
||||||
|
"""Canonical order for displaying sources."""
|
||||||
|
humans = [f"human:{n}" for n in ["Joey", "Anuj", "Aaryan", "Elisabeth", "Meghan", "Xander"]]
|
||||||
|
stage1 = ["s1:gemini-flash", "s1:grok-fast", "s1:mimo-flash"]
|
||||||
|
opus = ["opus"]
|
||||||
|
bench = [BENCH_MODEL_SHORT[fn] for fn in BENCH_FILES]
|
||||||
|
return humans + stage1 + opus + bench
|
||||||
|
|
||||||
|
|
||||||
|
def format_signal_breakdown(
|
||||||
|
signals: dict[str, str], axis_cats: tuple[str, str]
|
||||||
|
) -> str:
|
||||||
|
"""Format which sources said which category."""
|
||||||
|
a, b = axis_cats
|
||||||
|
a_sources = []
|
||||||
|
b_sources = []
|
||||||
|
other_sources = []
|
||||||
|
for src in source_order():
|
||||||
|
if src not in signals:
|
||||||
|
continue
|
||||||
|
cat = signals[src]
|
||||||
|
if cat == a:
|
||||||
|
a_sources.append(src)
|
||||||
|
elif cat == b:
|
||||||
|
b_sources.append(src)
|
||||||
|
else:
|
||||||
|
other_sources.append(f"{src}={cat}")
|
||||||
|
|
||||||
|
parts = [
|
||||||
|
f" {a} ({len(a_sources)}): {', '.join(a_sources)}",
|
||||||
|
f" {b} ({len(b_sources)}): {', '.join(b_sources)}",
|
||||||
|
]
|
||||||
|
if other_sources:
|
||||||
|
parts.append(f" Other: {', '.join(other_sources)}")
|
||||||
|
return "\n".join(parts)
|
||||||
|
|
||||||
|
|
||||||
|
def extract_keyword_frequencies(
|
||||||
|
paragraphs: dict[str, dict],
|
||||||
|
axis_pids: list[str],
|
||||||
|
cat_matrix: dict[str, dict[str, str]],
|
||||||
|
cat_a: str,
|
||||||
|
cat_b: str,
|
||||||
|
) -> tuple[Counter, Counter, Counter]:
|
||||||
|
"""Extract keyword frequencies for paragraphs leaning toward cat_a vs cat_b."""
|
||||||
|
# Keywords to look for (domain-relevant)
|
||||||
|
all_keywords = [
|
||||||
|
"board", "director", "committee", "audit", "oversee", "oversight",
|
||||||
|
"ciso", "officer", "chief", "vp", "vice president", "manager",
|
||||||
|
"manage", "manages", "managing", "management", "responsible",
|
||||||
|
"program", "team", "department", "staff", "personnel",
|
||||||
|
"report", "reports", "reporting", "brief", "briefing", "informed",
|
||||||
|
"incident", "breach", "attack", "compromise", "unauthorized",
|
||||||
|
"material", "immaterial", "not material", "no material",
|
||||||
|
"strategy", "strategic", "integrate", "integration", "aligned",
|
||||||
|
"risk", "assess", "assessment", "framework", "nist", "iso",
|
||||||
|
"policy", "policies", "procedure", "procedures",
|
||||||
|
"third party", "third-party", "vendor", "supplier", "service provider",
|
||||||
|
"insurance", "cyber insurance",
|
||||||
|
"training", "awareness", "employee",
|
||||||
|
"monitor", "monitoring", "detect", "detection",
|
||||||
|
"govern", "governance",
|
||||||
|
"experience", "experienced", "background", "qualification", "expertise",
|
||||||
|
"day-to-day", "daily", "operational",
|
||||||
|
"enterprise", "enterprise-wide",
|
||||||
|
"designate", "designated", "appoint", "appointed",
|
||||||
|
]
|
||||||
|
|
||||||
|
lean_a_pids = []
|
||||||
|
lean_b_pids = []
|
||||||
|
for pid in axis_pids:
|
||||||
|
signals = cat_matrix[pid]
|
||||||
|
counts = Counter(signals.values())
|
||||||
|
if counts.get(cat_a, 0) > counts.get(cat_b, 0):
|
||||||
|
lean_a_pids.append(pid)
|
||||||
|
elif counts.get(cat_b, 0) > counts.get(cat_a, 0):
|
||||||
|
lean_b_pids.append(pid)
|
||||||
|
|
||||||
|
def count_keywords(pids: list[str]) -> Counter:
|
||||||
|
kw_counts = Counter()
|
||||||
|
for pid in pids:
|
||||||
|
if pid not in paragraphs:
|
||||||
|
continue
|
||||||
|
text_lower = paragraphs[pid]["text"].lower()
|
||||||
|
for kw in all_keywords:
|
||||||
|
if kw in text_lower:
|
||||||
|
kw_counts[kw] += 1
|
||||||
|
return kw_counts
|
||||||
|
|
||||||
|
freq_a = count_keywords(lean_a_pids)
|
||||||
|
freq_b = count_keywords(lean_b_pids)
|
||||||
|
freq_all = count_keywords(axis_pids)
|
||||||
|
|
||||||
|
return freq_a, freq_b, freq_all
|
||||||
|
|
||||||
|
|
||||||
|
def analyze_human_vs_genai_splits(
|
||||||
|
axis_pids: list[str],
|
||||||
|
cat_matrix: dict[str, dict[str, str]],
|
||||||
|
cat_a: str,
|
||||||
|
cat_b: str,
|
||||||
|
) -> tuple[list[str], list[str]]:
|
||||||
|
"""Find cases where humans lean one way but GenAI leans the other."""
|
||||||
|
human_a_genai_b = [] # humans say A, GenAI says B
|
||||||
|
human_b_genai_a = [] # humans say B, GenAI says A
|
||||||
|
|
||||||
|
human_prefixes = ["human:"]
|
||||||
|
genai_prefixes = ["s1:", "opus", "bench:"]
|
||||||
|
|
||||||
|
for pid in axis_pids:
|
||||||
|
signals = cat_matrix[pid]
|
||||||
|
human_cats = []
|
||||||
|
genai_cats = []
|
||||||
|
for src, cat in signals.items():
|
||||||
|
if any(src.startswith(p) for p in human_prefixes):
|
||||||
|
human_cats.append(cat)
|
||||||
|
else:
|
||||||
|
genai_cats.append(cat)
|
||||||
|
|
||||||
|
human_a = sum(1 for c in human_cats if c == cat_a)
|
||||||
|
human_b = sum(1 for c in human_cats if c == cat_b)
|
||||||
|
genai_a = sum(1 for c in genai_cats if c == cat_a)
|
||||||
|
genai_b = sum(1 for c in genai_cats if c == cat_b)
|
||||||
|
|
||||||
|
if human_a > human_b and genai_b > genai_a:
|
||||||
|
human_a_genai_b.append(pid)
|
||||||
|
elif human_b > human_a and genai_a > genai_b:
|
||||||
|
human_b_genai_a.append(pid)
|
||||||
|
|
||||||
|
return human_a_genai_b, human_b_genai_a
|
||||||
|
|
||||||
|
|
||||||
|
# ── Main analysis ──────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
print("=" * 100)
|
||||||
|
print("HARDEST CASES ANALYSIS: SEC CYBERSECURITY HOLDOUT DATASET")
|
||||||
|
print("Examining disagreements across 13 annotation sources to inform codebook rulings")
|
||||||
|
print("=" * 100)
|
||||||
|
|
||||||
|
# Load data
|
||||||
|
print("\nLoading data...")
|
||||||
|
cat_matrix, spec_matrix = build_signal_matrix()
|
||||||
|
gold_pids = set(cat_matrix.keys())
|
||||||
|
paragraphs = load_paragraphs(gold_pids)
|
||||||
|
print(f" Loaded {len(gold_pids)} gold paragraphs with {len(source_order())} potential sources each")
|
||||||
|
|
||||||
|
# Verify source coverage
|
||||||
|
source_coverage = Counter()
|
||||||
|
for pid in gold_pids:
|
||||||
|
for src in cat_matrix[pid]:
|
||||||
|
source_coverage[src] += 1
|
||||||
|
print("\n Source coverage:")
|
||||||
|
for src in source_order():
|
||||||
|
print(f" {src}: {source_coverage.get(src, 0)} paragraphs")
|
||||||
|
|
||||||
|
# ── Overall disagreement stats ─────────────────────────────────────────
|
||||||
|
|
||||||
|
print("\n" + "=" * 100)
|
||||||
|
print("OVERALL DISAGREEMENT STATISTICS")
|
||||||
|
print("=" * 100)
|
||||||
|
|
||||||
|
unanimous = 0
|
||||||
|
near_unanimous = 0 # 1 dissenter
|
||||||
|
split = 0
|
||||||
|
for pid in gold_pids:
|
||||||
|
cats = list(cat_matrix[pid].values())
|
||||||
|
counts = Counter(cats)
|
||||||
|
top = counts.most_common(1)[0][1]
|
||||||
|
n = len(cats)
|
||||||
|
if top == n:
|
||||||
|
unanimous += 1
|
||||||
|
elif top >= n - 1:
|
||||||
|
near_unanimous += 1
|
||||||
|
else:
|
||||||
|
split += 1
|
||||||
|
|
||||||
|
print(f"\n Unanimous (all sources agree): {unanimous} ({unanimous/len(gold_pids)*100:.1f}%)")
|
||||||
|
print(f" Near-unanimous (1 dissenter): {near_unanimous} ({near_unanimous/len(gold_pids)*100:.1f}%)")
|
||||||
|
print(f" Split (2+ dissenters): {split} ({split/len(gold_pids)*100:.1f}%)")
|
||||||
|
|
||||||
|
# Count all pairwise disagreement axes
|
||||||
|
axis_counts = Counter()
|
||||||
|
for pid in gold_pids:
|
||||||
|
cats = list(cat_matrix[pid].values())
|
||||||
|
unique = set(cats)
|
||||||
|
if len(unique) >= 2:
|
||||||
|
for c1 in unique:
|
||||||
|
for c2 in unique:
|
||||||
|
if c1 < c2:
|
||||||
|
axis_counts[(c1, c2)] += 1
|
||||||
|
|
||||||
|
print("\n All disagreement axes (paragraph has at least 1 source saying each):")
|
||||||
|
for (c1, c2), ct in axis_counts.most_common(30):
|
||||||
|
print(f" {c1} <-> {c2}: {ct} paragraphs")
|
||||||
|
|
||||||
|
# ── Axis-specific analysis ─────────────────────────────────────────────
|
||||||
|
|
||||||
|
all_axis_results = {}
|
||||||
|
|
||||||
|
for cat_a, cat_b, axis_name in AXES:
|
||||||
|
print("\n" + "=" * 100)
|
||||||
|
print(f"AXIS: {axis_name}")
|
||||||
|
print("=" * 100)
|
||||||
|
|
||||||
|
axis_pids_data = find_axis_paragraphs(cat_matrix, cat_a, cat_b)
|
||||||
|
axis_pids = [x[0] for x in axis_pids_data]
|
||||||
|
all_axis_results[axis_name] = axis_pids
|
||||||
|
|
||||||
|
print(f"\n Paragraphs with primary {cat_a}/{cat_b} disagreement: {len(axis_pids)}")
|
||||||
|
|
||||||
|
if not axis_pids:
|
||||||
|
print(" No paragraphs found on this axis.")
|
||||||
|
continue
|
||||||
|
|
||||||
|
# ── Signal split statistics ────────────────────────────────────────
|
||||||
|
|
||||||
|
# Count how the split goes (majority A vs majority B)
|
||||||
|
majority_a = sum(1 for _, _, ca, cb in axis_pids_data if ca > cb)
|
||||||
|
majority_b = sum(1 for _, _, ca, cb in axis_pids_data if cb > ca)
|
||||||
|
tied = sum(1 for _, _, ca, cb in axis_pids_data if ca == cb)
|
||||||
|
print(f" Majority {cat_a}: {majority_a} | Majority {cat_b}: {majority_b} | Tied: {tied}")
|
||||||
|
|
||||||
|
# ── Human vs GenAI splits ──────────────────────────────────────────
|
||||||
|
|
||||||
|
human_a_genai_b, human_b_genai_a = analyze_human_vs_genai_splits(
|
||||||
|
axis_pids, cat_matrix, cat_a, cat_b
|
||||||
|
)
|
||||||
|
print(f"\n Human/GenAI disagreements:")
|
||||||
|
print(f" Humans say {cat_a}, GenAI says {cat_b}: {len(human_a_genai_b)}")
|
||||||
|
print(f" Humans say {cat_b}, GenAI says {cat_a}: {len(human_b_genai_a)}")
|
||||||
|
|
||||||
|
# ── Representative examples ────────────────────────────────────────
|
||||||
|
|
||||||
|
# Show hardest cases (most evenly split)
|
||||||
|
n_examples = min(10, len(axis_pids_data))
|
||||||
|
print(f"\n {'─' * 90}")
|
||||||
|
print(f" TOP {n_examples} MOST CONTENTIOUS PARAGRAPHS")
|
||||||
|
print(f" {'─' * 90}")
|
||||||
|
|
||||||
|
for i, (pid, signals, ca, cb) in enumerate(axis_pids_data[:n_examples]):
|
||||||
|
para = paragraphs.get(pid, {})
|
||||||
|
text = para.get("text", "[text not found]")
|
||||||
|
company = para.get("companyName", "?")
|
||||||
|
word_count = para.get("wordCount", "?")
|
||||||
|
|
||||||
|
print(f"\n [{i+1}] PID: {pid[:12]}... Company: {company}")
|
||||||
|
print(f" Words: {word_count} | Split: {ca} say {cat_a}, {cb} say {cat_b}, {len(signals)-ca-cb} say other")
|
||||||
|
print(f" Text: {truncate_text(text, 250)}")
|
||||||
|
print(format_signal_breakdown(signals, (cat_a, cat_b)))
|
||||||
|
|
||||||
|
# ── Human-A / GenAI-B examples ─────────────────────────────────────
|
||||||
|
|
||||||
|
if human_a_genai_b:
|
||||||
|
print(f"\n {'─' * 90}")
|
||||||
|
print(f" HUMANS SAY {cat_a}, GenAI SAYS {cat_b} (up to 5 examples)")
|
||||||
|
print(f" {'─' * 90}")
|
||||||
|
for pid in human_a_genai_b[:5]:
|
||||||
|
para = paragraphs.get(pid, {})
|
||||||
|
text = para.get("text", "[text not found]")
|
||||||
|
print(f"\n PID: {pid[:12]}...")
|
||||||
|
print(f" Text: {truncate_text(text, 250)}")
|
||||||
|
print(format_signal_breakdown(cat_matrix[pid], (cat_a, cat_b)))
|
||||||
|
|
||||||
|
if human_b_genai_a:
|
||||||
|
print(f"\n {'─' * 90}")
|
||||||
|
print(f" HUMANS SAY {cat_b}, GenAI SAYS {cat_a} (up to 5 examples)")
|
||||||
|
print(f" {'─' * 90}")
|
||||||
|
for pid in human_b_genai_a[:5]:
|
||||||
|
para = paragraphs.get(pid, {})
|
||||||
|
text = para.get("text", "[text not found]")
|
||||||
|
print(f"\n PID: {pid[:12]}...")
|
||||||
|
print(f" Text: {truncate_text(text, 250)}")
|
||||||
|
print(format_signal_breakdown(cat_matrix[pid], (cat_a, cat_b)))
|
||||||
|
|
||||||
|
# ── Keyword / linguistic patterns ──────────────────────────────────
|
||||||
|
|
||||||
|
print(f"\n {'─' * 90}")
|
||||||
|
print(f" LINGUISTIC PATTERNS")
|
||||||
|
print(f" {'─' * 90}")
|
||||||
|
|
||||||
|
freq_a, freq_b, freq_all = extract_keyword_frequencies(
|
||||||
|
paragraphs, axis_pids, cat_matrix, cat_a, cat_b
|
||||||
|
)
|
||||||
|
|
||||||
|
# Compute over-representation: keywords more common when majority says A vs B
|
||||||
|
lean_a_ct = sum(
|
||||||
|
1 for pid in axis_pids
|
||||||
|
if Counter(cat_matrix[pid].values()).get(cat_a, 0) > Counter(cat_matrix[pid].values()).get(cat_b, 0)
|
||||||
|
)
|
||||||
|
lean_b_ct = sum(
|
||||||
|
1 for pid in axis_pids
|
||||||
|
if Counter(cat_matrix[pid].values()).get(cat_b, 0) > Counter(cat_matrix[pid].values()).get(cat_a, 0)
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"\n Paragraphs leaning {cat_a}: {lean_a_ct} | leaning {cat_b}: {lean_b_ct}")
|
||||||
|
|
||||||
|
# Show keywords sorted by differential
|
||||||
|
all_kws = set(freq_a.keys()) | set(freq_b.keys())
|
||||||
|
diffs = []
|
||||||
|
for kw in all_kws:
|
||||||
|
fa = freq_a.get(kw, 0)
|
||||||
|
fb = freq_b.get(kw, 0)
|
||||||
|
total = freq_all.get(kw, 0)
|
||||||
|
if total < 3:
|
||||||
|
continue
|
||||||
|
# Normalize by group size
|
||||||
|
rate_a = fa / max(lean_a_ct, 1)
|
||||||
|
rate_b = fb / max(lean_b_ct, 1)
|
||||||
|
diff = rate_a - rate_b
|
||||||
|
diffs.append((kw, fa, fb, total, rate_a, rate_b, diff))
|
||||||
|
|
||||||
|
diffs.sort(key=lambda x: -abs(x[6]))
|
||||||
|
|
||||||
|
print(f"\n Keywords by differential (rate in {cat_a}-leaning vs {cat_b}-leaning paragraphs):")
|
||||||
|
print(f" {'Keyword':<22} {'In '+cat_a:>8} {'In '+cat_b:>8} {'Total':>8} {'Rate '+cat_a:>10} {'Rate '+cat_b:>10} {'Diff':>8}")
|
||||||
|
print(f" {'─'*22} {'─'*8} {'─'*8} {'─'*8} {'─'*10} {'─'*10} {'─'*8}")
|
||||||
|
for kw, fa, fb, total, ra, rb, diff in diffs[:25]:
|
||||||
|
marker = f"<- {cat_a}" if diff > 0.05 else (f"<- {cat_b}" if diff < -0.05 else "")
|
||||||
|
print(f" {kw:<22} {fa:>8} {fb:>8} {total:>8} {ra:>10.2%} {rb:>10.2%} {diff:>+8.2%} {marker}")
|
||||||
|
|
||||||
|
# ── Other notable axes ─────────────────────────────────────────────────
|
||||||
|
|
||||||
|
print("\n" + "=" * 100)
|
||||||
|
print("OTHER NOTABLE DISAGREEMENT AXES (10+ paragraphs)")
|
||||||
|
print("=" * 100)
|
||||||
|
|
||||||
|
primary_axis_set = {("BG", "MR"), ("MR", "BG"), ("MR", "RMP"), ("RMP", "MR"), ("N/O", "SI"), ("SI", "N/O")}
|
||||||
|
|
||||||
|
other_axes = []
|
||||||
|
for (c1, c2), ct in axis_counts.most_common():
|
||||||
|
if (c1, c2) not in primary_axis_set and ct >= 10:
|
||||||
|
other_axes.append((c1, c2, ct))
|
||||||
|
|
||||||
|
if not other_axes:
|
||||||
|
print("\n No other axes with 10+ paragraphs.")
|
||||||
|
else:
|
||||||
|
for cat_a, cat_b, count in other_axes:
|
||||||
|
print(f"\n {'─' * 90}")
|
||||||
|
print(f" {cat_a} <-> {cat_b}: {count} paragraphs")
|
||||||
|
print(f" {'─' * 90}")
|
||||||
|
|
||||||
|
axis_pids_data = find_axis_paragraphs(cat_matrix, cat_a, cat_b)
|
||||||
|
# Show up to 5 examples
|
||||||
|
for i, (pid, signals, ca, cb) in enumerate(axis_pids_data[:5]):
|
||||||
|
para = paragraphs.get(pid, {})
|
||||||
|
text = para.get("text", "[text not found]")
|
||||||
|
print(f"\n [{i+1}] {truncate_text(text, 200)}")
|
||||||
|
print(f" Split: {ca}x {cat_a}, {cb}x {cat_b}")
|
||||||
|
print(format_signal_breakdown(signals, (cat_a, cat_b)))
|
||||||
|
|
||||||
|
# ── Summary statistics ─────────────────────────────────────────────────
|
||||||
|
|
||||||
|
print("\n" + "=" * 100)
|
||||||
|
print("SUMMARY STATISTICS")
|
||||||
|
print("=" * 100)
|
||||||
|
|
||||||
|
# Per-axis counts
|
||||||
|
print("\n Paragraphs on each primary confusion axis:")
|
||||||
|
for cat_a, cat_b, axis_name in AXES:
|
||||||
|
axis_data = find_axis_paragraphs(cat_matrix, cat_a, cat_b)
|
||||||
|
print(f" {axis_name}: {len(axis_data)} paragraphs")
|
||||||
|
|
||||||
|
# How many could potentially be resolved by keyword rules?
|
||||||
|
print("\n Keyword-resolvable estimate (paragraphs containing strong discriminator keywords):")
|
||||||
|
|
||||||
|
mr_rmp_data = find_axis_paragraphs(cat_matrix, "MR", "RMP")
|
||||||
|
mr_rmp_pids = [x[0] for x in mr_rmp_data]
|
||||||
|
resolvable_mr_rmp = 0
|
||||||
|
mr_keywords = {"ciso", "chief information security", "chief security", "vp", "vice president",
|
||||||
|
"officer", "director of", "head of", "reports to", "reporting to"}
|
||||||
|
rmp_keywords = {"framework", "nist", "iso", "soc 2", "assessment", "penetration test",
|
||||||
|
"vulnerability scan", "audit", "tabletop"}
|
||||||
|
for pid in mr_rmp_pids:
|
||||||
|
text_lower = paragraphs.get(pid, {}).get("text", "").lower()
|
||||||
|
has_mr = any(kw in text_lower for kw in mr_keywords)
|
||||||
|
has_rmp = any(kw in text_lower for kw in rmp_keywords)
|
||||||
|
if has_mr != has_rmp: # One side but not the other
|
||||||
|
resolvable_mr_rmp += 1
|
||||||
|
print(f" MR <-> RMP: {resolvable_mr_rmp}/{len(mr_rmp_pids)} have clear keyword signal ({resolvable_mr_rmp/max(len(mr_rmp_pids),1)*100:.0f}%)")
|
||||||
|
|
||||||
|
bg_mr_data = find_axis_paragraphs(cat_matrix, "BG", "MR")
|
||||||
|
bg_mr_pids = [x[0] for x in bg_mr_data]
|
||||||
|
resolvable_bg_mr = 0
|
||||||
|
bg_keywords = {"board", "director", "committee", "audit committee", "board of directors"}
|
||||||
|
mr_only_keywords = {"ciso", "chief information security", "officer", "vp", "management",
|
||||||
|
"team", "department", "staff", "day-to-day", "operational"}
|
||||||
|
for pid in bg_mr_pids:
|
||||||
|
text_lower = paragraphs.get(pid, {}).get("text", "").lower()
|
||||||
|
has_bg = any(kw in text_lower for kw in bg_keywords)
|
||||||
|
has_mr_only = any(kw in text_lower for kw in mr_only_keywords)
|
||||||
|
if has_bg and not has_mr_only:
|
||||||
|
resolvable_bg_mr += 1
|
||||||
|
elif has_mr_only and not has_bg:
|
||||||
|
resolvable_bg_mr += 1
|
||||||
|
print(f" BG <-> MR: {resolvable_bg_mr}/{len(bg_mr_pids)} have clear keyword signal ({resolvable_bg_mr/max(len(bg_mr_pids),1)*100:.0f}%)")
|
||||||
|
|
||||||
|
si_no_data = find_axis_paragraphs(cat_matrix, "SI", "N/O")
|
||||||
|
si_no_pids = [x[0] for x in si_no_data]
|
||||||
|
resolvable_si_no = 0
|
||||||
|
si_keywords = {"incident", "breach", "attack", "compromise", "unauthorized access",
|
||||||
|
"ransomware", "malware", "phishing", "data loss", "disruption"}
|
||||||
|
no_keywords = {"no material", "not material", "have not experienced", "no known",
|
||||||
|
"not aware of any", "not been subject"}
|
||||||
|
for pid in si_no_pids:
|
||||||
|
text_lower = paragraphs.get(pid, {}).get("text", "").lower()
|
||||||
|
has_si = any(kw in text_lower for kw in si_keywords)
|
||||||
|
has_no = any(kw in text_lower for kw in no_keywords)
|
||||||
|
if has_no:
|
||||||
|
resolvable_si_no += 1
|
||||||
|
elif has_si and not has_no:
|
||||||
|
resolvable_si_no += 1
|
||||||
|
print(f" SI <-> N/O: {resolvable_si_no}/{len(si_no_pids)} have clear keyword signal ({resolvable_si_no/max(len(si_no_pids),1)*100:.0f}%)")
|
||||||
|
|
||||||
|
# ── Specificity disagreements on confused paragraphs ───────────────────
|
||||||
|
|
||||||
|
print("\n" + "=" * 100)
|
||||||
|
print("SPECIFICITY DISAGREEMENT ON CONFUSED PARAGRAPHS")
|
||||||
|
print("=" * 100)
|
||||||
|
|
||||||
|
for cat_a, cat_b, axis_name in AXES:
|
||||||
|
axis_data = find_axis_paragraphs(cat_matrix, cat_a, cat_b)
|
||||||
|
if not axis_data:
|
||||||
|
continue
|
||||||
|
spec_ranges = []
|
||||||
|
for pid, signals, _, _ in axis_data:
|
||||||
|
specs = list(spec_matrix.get(pid, {}).values())
|
||||||
|
if specs:
|
||||||
|
spec_ranges.append(max(specs) - min(specs))
|
||||||
|
if spec_ranges:
|
||||||
|
avg_range = np.mean(spec_ranges)
|
||||||
|
print(f"\n {axis_name}: avg specificity range = {avg_range:.2f} (0=agree, 3=max disagree)")
|
||||||
|
range_dist = Counter(spec_ranges)
|
||||||
|
for r in sorted(range_dist.keys()):
|
||||||
|
print(f" Range {r}: {range_dist[r]} paragraphs")
|
||||||
|
|
||||||
|
# ── Recommended codebook rulings ───────────────────────────────────────
|
||||||
|
|
||||||
|
print("\n" + "=" * 100)
|
||||||
|
print("RECOMMENDED CODEBOOK RULINGS")
|
||||||
|
print("=" * 100)
|
||||||
|
|
||||||
|
print("""
|
||||||
|
Based on the analysis above, the following rulings would resolve the most cases:
|
||||||
|
|
||||||
|
RULING 1: MR vs RMP — "Named-role test"
|
||||||
|
──────────────────────────────────────────
|
||||||
|
If the paragraph's PRIMARY subject is a named individual, titled role (CISO, VP,
|
||||||
|
CTO, etc.), or a specific person's responsibilities/qualifications/experience,
|
||||||
|
classify as MR. If the paragraph's PRIMARY subject is a process, program, system,
|
||||||
|
or methodology (even if it mentions who runs it), classify as RMP.
|
||||||
|
|
||||||
|
Disambiguator: Ask "Is this paragraph ABOUT a person/role, or ABOUT a process?"
|
||||||
|
- "Our CISO oversees our cybersecurity program" → MR (about the CISO)
|
||||||
|
- "Our cybersecurity program includes monitoring, led by the CISO" → RMP (about the program)
|
||||||
|
|
||||||
|
RULING 2: BG vs MR — "Board-line test"
|
||||||
|
──────────────────────────────────────────
|
||||||
|
If the paragraph describes oversight, reporting, or governance AT or ABOVE the
|
||||||
|
board/committee level, classify as BG. If it describes responsibilities BELOW
|
||||||
|
the board level (C-suite officers reporting TO the board, management teams,
|
||||||
|
operational roles), classify as MR.
|
||||||
|
|
||||||
|
Disambiguator: "Does this paragraph describe what the board/committee DOES,
|
||||||
|
or what someone REPORTS TO the board?"
|
||||||
|
- "The Audit Committee oversees cybersecurity risk" → BG
|
||||||
|
- "The CISO reports quarterly to the Audit Committee" → BG (board's receiving mechanism)
|
||||||
|
- "The CISO manages a team of security analysts" → MR
|
||||||
|
|
||||||
|
Key edge case: When a paragraph describes BOTH board oversight AND management
|
||||||
|
roles, classify by the paragraph's PRIMARY focus. If roughly equal, prefer BG
|
||||||
|
when board action is the grammatical subject.
|
||||||
|
|
||||||
|
RULING 3: SI vs N/O — "Negative-incident test"
|
||||||
|
──────────────────────────────────────────
|
||||||
|
Negative incident statements ("we have not experienced any material cybersecurity
|
||||||
|
incidents") should be classified as N/O, NOT as SI. SI requires disclosure of an
|
||||||
|
ACTUAL incident that occurred. The mere mention of incidents in a negation context
|
||||||
|
does not constitute incident disclosure.
|
||||||
|
|
||||||
|
However: If the paragraph describes a SPECIFIC past incident (even if resolved or
|
||||||
|
deemed immaterial), classify as SI. The test is: "Did something actually happen?"
|
||||||
|
- "We have not experienced material incidents" → N/O
|
||||||
|
- "In 2023, we experienced a ransomware attack that..." → SI
|
||||||
|
- "We experienced incidents but none were material" → SI (something happened)
|
||||||
|
""")
|
||||||
|
|
||||||
|
# ── Deep dive: the very hardest cases ──────────────────────────────────
|
||||||
|
|
||||||
|
print("=" * 100)
|
||||||
|
print("DEEP DIVE: PARAGRAPHS WITH MAXIMUM ENTROPY (4+ DISTINCT CATEGORIES)")
|
||||||
|
print("=" * 100)
|
||||||
|
|
||||||
|
high_entropy = []
|
||||||
|
for pid in gold_pids:
|
||||||
|
cats = list(cat_matrix[pid].values())
|
||||||
|
n_unique = len(set(cats))
|
||||||
|
if n_unique >= 4:
|
||||||
|
high_entropy.append((pid, n_unique, Counter(cats)))
|
||||||
|
|
||||||
|
high_entropy.sort(key=lambda x: -x[1])
|
||||||
|
print(f"\n {len(high_entropy)} paragraphs with 4+ distinct category labels")
|
||||||
|
|
||||||
|
for i, (pid, n_unique, counts) in enumerate(high_entropy[:10]):
|
||||||
|
para = paragraphs.get(pid, {})
|
||||||
|
text = para.get("text", "[text not found]")
|
||||||
|
print(f"\n [{i+1}] PID: {pid[:12]}... ({n_unique} categories)")
|
||||||
|
print(f" Text: {truncate_text(text, 250)}")
|
||||||
|
print(f" Distribution: {dict(counts.most_common())}")
|
||||||
|
# Show all sources
|
||||||
|
for src in source_order():
|
||||||
|
if src in cat_matrix[pid]:
|
||||||
|
cat = cat_matrix[pid][src]
|
||||||
|
spec = spec_matrix.get(pid, {}).get(src, "?")
|
||||||
|
print(f" {src:<25} {cat:<5} spec={spec}")
|
||||||
|
|
||||||
|
# ── Per-source accuracy vs human majority ──────────────────────────────
|
||||||
|
|
||||||
|
print("\n" + "=" * 100)
|
||||||
|
print("GENAI SOURCE AGREEMENT WITH HUMAN MAJORITY (on axis-confused paragraphs only)")
|
||||||
|
print("=" * 100)
|
||||||
|
|
||||||
|
for cat_a, cat_b, axis_name in AXES:
|
||||||
|
axis_data = find_axis_paragraphs(cat_matrix, cat_a, cat_b)
|
||||||
|
if not axis_data:
|
||||||
|
continue
|
||||||
|
|
||||||
|
print(f"\n {axis_name} ({len(axis_data)} paragraphs):")
|
||||||
|
|
||||||
|
# For each paragraph, determine human majority
|
||||||
|
genai_sources = [s for s in source_order() if not s.startswith("human:")]
|
||||||
|
source_agree = {s: 0 for s in genai_sources}
|
||||||
|
source_total = {s: 0 for s in genai_sources}
|
||||||
|
|
||||||
|
for pid, signals, _, _ in axis_data:
|
||||||
|
# Human majority on this axis
|
||||||
|
human_cats = [
|
||||||
|
signals[s] for s in signals
|
||||||
|
if s.startswith("human:") and signals[s] in (cat_a, cat_b)
|
||||||
|
]
|
||||||
|
if not human_cats:
|
||||||
|
continue
|
||||||
|
human_majority = Counter(human_cats).most_common(1)[0][0]
|
||||||
|
|
||||||
|
for src in genai_sources:
|
||||||
|
if src in signals:
|
||||||
|
source_total[src] += 1
|
||||||
|
if signals[src] == human_majority:
|
||||||
|
source_agree[src] += 1
|
||||||
|
|
||||||
|
print(f" {'Source':<25} {'Agree':>8} {'Total':>8} {'Rate':>8}")
|
||||||
|
print(f" {'─'*25} {'─'*8} {'─'*8} {'─'*8}")
|
||||||
|
for src in genai_sources:
|
||||||
|
total = source_total[src]
|
||||||
|
agree = source_agree[src]
|
||||||
|
rate = agree / max(total, 1)
|
||||||
|
print(f" {src:<25} {agree:>8} {total:>8} {rate:>8.1%}")
|
||||||
|
|
||||||
|
print("\n" + "=" * 100)
|
||||||
|
print("END OF ANALYSIS")
|
||||||
|
print("=" * 100)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
764
scripts/examine-v35-errors.py
Normal file
764
scripts/examine-v35-errors.py
Normal file
@ -0,0 +1,764 @@
|
|||||||
|
"""Examine specific paragraphs where v3.5 performed WORSE than v3.0 against human labels.
|
||||||
|
|
||||||
|
Focus on BG↔MR and MR↔RMP confusion axes.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import textwrap
|
||||||
|
from collections import Counter, defaultdict
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
# ── Paths ──────────────────────────────────────────────────────────────────────
|
||||||
|
ROOT = Path(__file__).resolve().parent.parent
|
||||||
|
|
||||||
|
V30_GOLDEN = ROOT / "data/annotations/golden/opus.jsonl"
|
||||||
|
V35_GOLDEN = ROOT / "data/annotations/golden-v35/opus.jsonl"
|
||||||
|
|
||||||
|
V30_BENCH = ROOT / "data/annotations/bench-holdout"
|
||||||
|
V35_BENCH = ROOT / "data/annotations/bench-holdout-v35"
|
||||||
|
|
||||||
|
HUMAN_LABELS = ROOT / "data/gold/human-labels-raw.jsonl"
|
||||||
|
HOLDOUT_META = ROOT / "data/gold/holdout-rerun-v35.jsonl"
|
||||||
|
PARAGRAPHS = ROOT / "data/gold/paragraphs-holdout.jsonl"
|
||||||
|
|
||||||
|
MODEL_FILES = [
|
||||||
|
"opus.jsonl",
|
||||||
|
"gpt-5.4.jsonl",
|
||||||
|
"gemini-3.1-pro-preview.jsonl",
|
||||||
|
"glm-5:exacto.jsonl",
|
||||||
|
"kimi-k2.5.jsonl",
|
||||||
|
"mimo-v2-pro:exacto.jsonl",
|
||||||
|
"minimax-m2.7:exacto.jsonl",
|
||||||
|
]
|
||||||
|
|
||||||
|
MODEL_NAMES = [
|
||||||
|
"Opus",
|
||||||
|
"GPT-5.4",
|
||||||
|
"Gemini",
|
||||||
|
"GLM-5",
|
||||||
|
"Kimi",
|
||||||
|
"Mimo",
|
||||||
|
"MiniMax",
|
||||||
|
]
|
||||||
|
|
||||||
|
# Models to EXCLUDE from majority calculation
|
||||||
|
EXCLUDED_FROM_MAJORITY = {"MiniMax"}
|
||||||
|
|
||||||
|
CAT_ABBREV = {
|
||||||
|
"BG": "Board Governance",
|
||||||
|
"MR": "Management Role",
|
||||||
|
"RMP": "Risk Management Process",
|
||||||
|
"SI": "Strategy Integration",
|
||||||
|
"NO": "None/Other",
|
||||||
|
"ID": "Incident Disclosure",
|
||||||
|
"TPR": "Third-Party Risk",
|
||||||
|
}
|
||||||
|
|
||||||
|
ABBREV_CAT = {v: k for k, v in CAT_ABBREV.items()}
|
||||||
|
|
||||||
|
|
||||||
|
def abbrev(cat: str) -> str:
|
||||||
|
return ABBREV_CAT.get(cat, cat)
|
||||||
|
|
||||||
|
|
||||||
|
def load_jsonl(path: Path) -> list[dict]:
|
||||||
|
with open(path) as f:
|
||||||
|
return [json.loads(line) for line in f if line.strip()]
|
||||||
|
|
||||||
|
|
||||||
|
def load_annotations(base_dir: Path, filename: str) -> dict[str, str]:
|
||||||
|
"""Load paragraphId → content_category mapping."""
|
||||||
|
path = base_dir / filename
|
||||||
|
records = load_jsonl(path)
|
||||||
|
return {r["paragraphId"]: r["label"]["content_category"] for r in records}
|
||||||
|
|
||||||
|
|
||||||
|
def load_golden(path: Path) -> dict[str, str]:
|
||||||
|
records = load_jsonl(path)
|
||||||
|
return {r["paragraphId"]: r["label"]["content_category"] for r in records}
|
||||||
|
|
||||||
|
|
||||||
|
# ── Load all data ─────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
print("Loading data...")
|
||||||
|
|
||||||
|
# Confusion axis metadata
|
||||||
|
meta_records = load_jsonl(HOLDOUT_META)
|
||||||
|
pid_axes: dict[str, list[str]] = {r["paragraphId"]: r["axes"] for r in meta_records}
|
||||||
|
all_pids = set(pid_axes.keys())
|
||||||
|
|
||||||
|
# Human labels: paragraphId → list of (annotator, category)
|
||||||
|
human_raw = load_jsonl(HUMAN_LABELS)
|
||||||
|
human_labels: dict[str, list[tuple[str, str]]] = defaultdict(list)
|
||||||
|
for r in human_raw:
|
||||||
|
if r["paragraphId"] in all_pids:
|
||||||
|
human_labels[r["paragraphId"]].append(
|
||||||
|
(r["annotatorName"], r["contentCategory"])
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def human_majority(pid: str) -> str | None:
|
||||||
|
"""Return majority category from human annotators, or None if no data."""
|
||||||
|
labels = human_labels.get(pid)
|
||||||
|
if not labels:
|
||||||
|
return None
|
||||||
|
cats = [c for _, c in labels]
|
||||||
|
counts = Counter(cats)
|
||||||
|
top = counts.most_common(1)[0]
|
||||||
|
return top[0]
|
||||||
|
|
||||||
|
|
||||||
|
# Paragraph text
|
||||||
|
para_records = load_jsonl(PARAGRAPHS)
|
||||||
|
para_text: dict[str, str] = {r["id"]: r["text"] for r in para_records}
|
||||||
|
|
||||||
|
# v3.0 signals: model_idx → {pid: category}
|
||||||
|
v30_signals: list[dict[str, str]] = []
|
||||||
|
for fname in MODEL_FILES:
|
||||||
|
if fname == "opus.jsonl":
|
||||||
|
v30_signals.append(load_golden(V30_GOLDEN))
|
||||||
|
else:
|
||||||
|
v30_signals.append(load_annotations(V30_BENCH, fname))
|
||||||
|
|
||||||
|
# v3.5 signals
|
||||||
|
v35_signals: list[dict[str, str]] = []
|
||||||
|
for fname in MODEL_FILES:
|
||||||
|
if fname == "opus.jsonl":
|
||||||
|
v35_signals.append(load_golden(V35_GOLDEN))
|
||||||
|
else:
|
||||||
|
v35_signals.append(load_annotations(V35_BENCH, fname))
|
||||||
|
|
||||||
|
|
||||||
|
def get_signals(signals: list[dict[str, str]], pid: str) -> list[str | None]:
|
||||||
|
"""Get category from each model for a paragraph."""
|
||||||
|
return [s.get(pid) for s in signals]
|
||||||
|
|
||||||
|
|
||||||
|
def majority_vote(signals: list[str | None], exclude_minimax: bool = True) -> str | None:
|
||||||
|
"""Compute majority from 6 models (excluding minimax which is index 6)."""
|
||||||
|
cats = []
|
||||||
|
for i, s in enumerate(signals):
|
||||||
|
if s is None:
|
||||||
|
continue
|
||||||
|
if exclude_minimax and MODEL_NAMES[i] in EXCLUDED_FROM_MAJORITY:
|
||||||
|
continue
|
||||||
|
cats.append(s)
|
||||||
|
if not cats:
|
||||||
|
return None
|
||||||
|
counts = Counter(cats)
|
||||||
|
return counts.most_common(1)[0][0]
|
||||||
|
|
||||||
|
|
||||||
|
def unanimity_score(signals: list[str | None], exclude_minimax: bool = True) -> float:
|
||||||
|
"""Fraction of models agreeing with majority (0-1)."""
|
||||||
|
cats = []
|
||||||
|
for i, s in enumerate(signals):
|
||||||
|
if s is None:
|
||||||
|
continue
|
||||||
|
if exclude_minimax and MODEL_NAMES[i] in EXCLUDED_FROM_MAJORITY:
|
||||||
|
continue
|
||||||
|
cats.append(s)
|
||||||
|
if not cats:
|
||||||
|
return 0.0
|
||||||
|
counts = Counter(cats)
|
||||||
|
top_count = counts.most_common(1)[0][1]
|
||||||
|
return top_count / len(cats)
|
||||||
|
|
||||||
|
|
||||||
|
def format_signals(signals: list[str | None]) -> str:
|
||||||
|
"""Compact model signal display."""
|
||||||
|
parts = []
|
||||||
|
for name, cat in zip(MODEL_NAMES, signals):
|
||||||
|
if cat is None:
|
||||||
|
parts.append(f"{name}=??")
|
||||||
|
else:
|
||||||
|
parts.append(f"{name}={abbrev(cat)}")
|
||||||
|
return ", ".join(parts)
|
||||||
|
|
||||||
|
|
||||||
|
def wrap_text(text: str, width: int = 100) -> str:
|
||||||
|
return "\n ".join(textwrap.wrap(text, width=width))
|
||||||
|
|
||||||
|
|
||||||
|
def print_paragraph_analysis(
|
||||||
|
pid: str,
|
||||||
|
v30_sigs: list[str | None],
|
||||||
|
v35_sigs: list[str | None],
|
||||||
|
header: str = "",
|
||||||
|
):
|
||||||
|
"""Print detailed analysis for a single paragraph."""
|
||||||
|
text = para_text.get(pid, "[TEXT NOT FOUND]")
|
||||||
|
h_labels = human_labels.get(pid, [])
|
||||||
|
h_maj = human_majority(pid)
|
||||||
|
v30_maj = majority_vote(v30_sigs)
|
||||||
|
v35_maj = majority_vote(v35_sigs)
|
||||||
|
axes = pid_axes.get(pid, [])
|
||||||
|
|
||||||
|
if header:
|
||||||
|
print(f"\n{'─' * 110}")
|
||||||
|
print(f" {header}")
|
||||||
|
print(f"{'─' * 110}")
|
||||||
|
else:
|
||||||
|
print(f"\n{'─' * 110}")
|
||||||
|
|
||||||
|
print(f" PID: {pid}")
|
||||||
|
print(f" Axes: {', '.join(axes)}")
|
||||||
|
print(f"\n TEXT:")
|
||||||
|
print(f" {wrap_text(text)}")
|
||||||
|
|
||||||
|
print(f"\n HUMAN VOTES:")
|
||||||
|
for name, cat in h_labels:
|
||||||
|
marker = " ✓" if cat == h_maj else ""
|
||||||
|
print(f" {name:12s} → {abbrev(cat):5s}{marker}")
|
||||||
|
print(f" Majority → {abbrev(h_maj) if h_maj else '??'}")
|
||||||
|
|
||||||
|
print(f"\n v3.0 signals: {format_signals(v30_sigs)}")
|
||||||
|
print(f" v3.0 majority (excl. MiniMax): {abbrev(v30_maj) if v30_maj else '??'}")
|
||||||
|
|
||||||
|
print(f" v3.5 signals: {format_signals(v35_sigs)}")
|
||||||
|
print(f" v3.5 majority (excl. MiniMax): {abbrev(v35_maj) if v35_maj else '??'}")
|
||||||
|
|
||||||
|
# What changed
|
||||||
|
changed_models = []
|
||||||
|
for i, (old, new) in enumerate(zip(v30_sigs, v35_sigs)):
|
||||||
|
if old is not None and new is not None and old != new:
|
||||||
|
changed_models.append(f"{MODEL_NAMES[i]}: {abbrev(old)}→{abbrev(new)}")
|
||||||
|
if changed_models:
|
||||||
|
print(f"\n CHANGES: {', '.join(changed_models)}")
|
||||||
|
|
||||||
|
correct_v30 = v30_maj == h_maj if v30_maj and h_maj else None
|
||||||
|
correct_v35 = v35_maj == h_maj if v35_maj and h_maj else None
|
||||||
|
print(
|
||||||
|
f" v3.0 {'CORRECT' if correct_v30 else 'WRONG'} | "
|
||||||
|
f"v3.5 {'CORRECT' if correct_v35 else 'WRONG'}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# SECTION 1: BG↔MR Regression Cases
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
print("\n" + "═" * 110)
|
||||||
|
print(" SECTION 1: BG↔MR AXIS — REGRESSION CASES")
|
||||||
|
print(" (v3.0 matched human majority, but v3.5 does NOT)")
|
||||||
|
print("═" * 110)
|
||||||
|
|
||||||
|
bg_mr_pids = [pid for pid, axes in pid_axes.items() if "BG_MR" in axes]
|
||||||
|
print(f"\nTotal BG↔MR paragraphs: {len(bg_mr_pids)}")
|
||||||
|
|
||||||
|
# Filter to those with human labels
|
||||||
|
bg_mr_pids = [pid for pid in bg_mr_pids if human_majority(pid) is not None]
|
||||||
|
print(f"With human labels: {len(bg_mr_pids)}")
|
||||||
|
|
||||||
|
regressions_bg_mr = []
|
||||||
|
improvements_bg_mr = []
|
||||||
|
both_correct_bg_mr = []
|
||||||
|
both_wrong_bg_mr = []
|
||||||
|
|
||||||
|
for pid in bg_mr_pids:
|
||||||
|
v30_sigs = get_signals(v30_signals, pid)
|
||||||
|
v35_sigs = get_signals(v35_signals, pid)
|
||||||
|
v30_maj = majority_vote(v30_sigs)
|
||||||
|
v35_maj = majority_vote(v35_sigs)
|
||||||
|
h_maj = human_majority(pid)
|
||||||
|
|
||||||
|
if v30_maj is None or v35_maj is None or h_maj is None:
|
||||||
|
continue
|
||||||
|
|
||||||
|
v30_correct = abbrev(v30_maj) == abbrev(h_maj)
|
||||||
|
v35_correct = abbrev(v35_maj) == abbrev(h_maj)
|
||||||
|
|
||||||
|
if v30_correct and not v35_correct:
|
||||||
|
regressions_bg_mr.append(pid)
|
||||||
|
elif not v30_correct and v35_correct:
|
||||||
|
improvements_bg_mr.append(pid)
|
||||||
|
elif v30_correct and v35_correct:
|
||||||
|
both_correct_bg_mr.append(pid)
|
||||||
|
else:
|
||||||
|
both_wrong_bg_mr.append(pid)
|
||||||
|
|
||||||
|
print(f"\nBG↔MR Summary:")
|
||||||
|
print(f" Both correct: {len(both_correct_bg_mr)}")
|
||||||
|
print(f" Both wrong: {len(both_wrong_bg_mr)}")
|
||||||
|
print(f" v3.0 correct → v3.5 WRONG (REGRESSIONS): {len(regressions_bg_mr)}")
|
||||||
|
print(f" v3.0 wrong → v3.5 correct (IMPROVEMENTS): {len(improvements_bg_mr)}")
|
||||||
|
|
||||||
|
print(f"\n{'━' * 110}")
|
||||||
|
print(f" BG↔MR REGRESSIONS (showing all, up to 20)")
|
||||||
|
print(f"{'━' * 110}")
|
||||||
|
|
||||||
|
for i, pid in enumerate(regressions_bg_mr[:20]):
|
||||||
|
v30_sigs = get_signals(v30_signals, pid)
|
||||||
|
v35_sigs = get_signals(v35_signals, pid)
|
||||||
|
print_paragraph_analysis(pid, v30_sigs, v35_sigs, f"REGRESSION #{i+1}")
|
||||||
|
|
||||||
|
# BG↔MR improvements
|
||||||
|
print(f"\n{'━' * 110}")
|
||||||
|
print(f" BG↔MR IMPROVEMENTS (showing up to 5)")
|
||||||
|
print(f"{'━' * 110}")
|
||||||
|
|
||||||
|
for i, pid in enumerate(improvements_bg_mr[:5]):
|
||||||
|
v30_sigs = get_signals(v30_signals, pid)
|
||||||
|
v35_sigs = get_signals(v35_signals, pid)
|
||||||
|
print_paragraph_analysis(pid, v30_sigs, v35_sigs, f"IMPROVEMENT #{i+1}")
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# SECTION 2: MR↔RMP Non-Convergence Cases
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
print("\n\n" + "═" * 110)
|
||||||
|
print(" SECTION 2: MR↔RMP AXIS — NON-CONVERGENCE AND REGRESSIONS")
|
||||||
|
print("═" * 110)
|
||||||
|
|
||||||
|
mr_rmp_pids = [pid for pid, axes in pid_axes.items() if "MR_RMP" in axes]
|
||||||
|
print(f"\nTotal MR↔RMP paragraphs: {len(mr_rmp_pids)}")
|
||||||
|
mr_rmp_pids = [pid for pid in mr_rmp_pids if human_majority(pid) is not None]
|
||||||
|
print(f"With human labels: {len(mr_rmp_pids)}")
|
||||||
|
|
||||||
|
# Find: less unanimous in v3.5 OR flipped away from human majority
|
||||||
|
non_convergence_mr_rmp = []
|
||||||
|
regressions_mr_rmp = []
|
||||||
|
improvements_mr_rmp = []
|
||||||
|
|
||||||
|
for pid in mr_rmp_pids:
|
||||||
|
v30_sigs = get_signals(v30_signals, pid)
|
||||||
|
v35_sigs = get_signals(v35_signals, pid)
|
||||||
|
v30_maj = majority_vote(v30_sigs)
|
||||||
|
v35_maj = majority_vote(v35_sigs)
|
||||||
|
h_maj = human_majority(pid)
|
||||||
|
v30_unanimity = unanimity_score(v30_sigs)
|
||||||
|
v35_unanimity = unanimity_score(v35_sigs)
|
||||||
|
|
||||||
|
if v30_maj is None or v35_maj is None or h_maj is None:
|
||||||
|
continue
|
||||||
|
|
||||||
|
v30_correct = abbrev(v30_maj) == abbrev(h_maj)
|
||||||
|
v35_correct = abbrev(v35_maj) == abbrev(h_maj)
|
||||||
|
|
||||||
|
# Regression: was correct, now wrong
|
||||||
|
if v30_correct and not v35_correct:
|
||||||
|
regressions_mr_rmp.append((pid, v30_unanimity, v35_unanimity))
|
||||||
|
|
||||||
|
# Non-convergence: less unanimous OR flipped away
|
||||||
|
if v35_unanimity < v30_unanimity or (v30_correct and not v35_correct):
|
||||||
|
non_convergence_mr_rmp.append((pid, v30_unanimity, v35_unanimity))
|
||||||
|
|
||||||
|
if not v30_correct and v35_correct:
|
||||||
|
improvements_mr_rmp.append((pid, v30_unanimity, v35_unanimity))
|
||||||
|
|
||||||
|
# Sort non-convergence by delta (worst first)
|
||||||
|
non_convergence_mr_rmp.sort(key=lambda x: x[1] - x[2], reverse=True)
|
||||||
|
|
||||||
|
print(f"\nMR↔RMP Summary:")
|
||||||
|
print(f" Regressions (correct→wrong): {len(regressions_mr_rmp)}")
|
||||||
|
print(f" Non-convergence (less unanimous or regressed): {len(non_convergence_mr_rmp)}")
|
||||||
|
print(f" Improvements (wrong→correct): {len(improvements_mr_rmp)}")
|
||||||
|
|
||||||
|
print(f"\n{'━' * 110}")
|
||||||
|
print(f" MR↔RMP NON-CONVERGENCE / REGRESSION CASES (showing 10)")
|
||||||
|
print(f"{'━' * 110}")
|
||||||
|
|
||||||
|
shown = set()
|
||||||
|
count = 0
|
||||||
|
for pid, v30_u, v35_u in non_convergence_mr_rmp:
|
||||||
|
if count >= 10:
|
||||||
|
break
|
||||||
|
if pid in shown:
|
||||||
|
continue
|
||||||
|
shown.add(pid)
|
||||||
|
v30_sigs = get_signals(v30_signals, pid)
|
||||||
|
v35_sigs = get_signals(v35_signals, pid)
|
||||||
|
v30_maj = majority_vote(v30_sigs)
|
||||||
|
v35_maj = majority_vote(v35_sigs)
|
||||||
|
h_maj = human_majority(pid)
|
||||||
|
label = "REGRESSION" if (abbrev(v30_maj) == abbrev(h_maj) and abbrev(v35_maj) != abbrev(h_maj)) else "LESS UNANIMOUS"
|
||||||
|
print_paragraph_analysis(
|
||||||
|
pid, v30_sigs, v35_sigs,
|
||||||
|
f"{label} #{count+1} (unanimity: v3.0={v30_u:.0%} → v3.5={v35_u:.0%})"
|
||||||
|
)
|
||||||
|
count += 1
|
||||||
|
|
||||||
|
print(f"\n{'━' * 110}")
|
||||||
|
print(f" MR↔RMP IMPROVEMENTS (showing up to 5)")
|
||||||
|
print(f"{'━' * 110}")
|
||||||
|
|
||||||
|
for i, (pid, v30_u, v35_u) in enumerate(improvements_mr_rmp[:5]):
|
||||||
|
v30_sigs = get_signals(v30_signals, pid)
|
||||||
|
v35_sigs = get_signals(v35_signals, pid)
|
||||||
|
print_paragraph_analysis(
|
||||||
|
pid, v30_sigs, v35_sigs,
|
||||||
|
f"IMPROVEMENT #{i+1} (unanimity: v3.0={v30_u:.0%} → v3.5={v35_u:.0%})"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# SECTION 3: Error Pattern Analysis
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
print("\n\n" + "═" * 110)
|
||||||
|
print(" SECTION 3: ERROR PATTERN ANALYSIS")
|
||||||
|
print("═" * 110)
|
||||||
|
|
||||||
|
# ── BG↔MR regression patterns ───────────────────────────────────────────────
|
||||||
|
print(f"\n{'━' * 110}")
|
||||||
|
print(f" 3A: BG↔MR REGRESSION PATTERNS")
|
||||||
|
print(f"{'━' * 110}")
|
||||||
|
|
||||||
|
if regressions_bg_mr:
|
||||||
|
# Analyze what the human majority is and what v3.5 switched to
|
||||||
|
regression_directions = Counter()
|
||||||
|
regression_model_flips = Counter()
|
||||||
|
|
||||||
|
for pid in regressions_bg_mr:
|
||||||
|
h_maj = human_majority(pid)
|
||||||
|
v30_sigs = get_signals(v30_signals, pid)
|
||||||
|
v35_sigs = get_signals(v35_signals, pid)
|
||||||
|
v30_maj = majority_vote(v30_sigs)
|
||||||
|
v35_maj = majority_vote(v35_sigs)
|
||||||
|
direction = f"{abbrev(v30_maj)}→{abbrev(v35_maj)} (human={abbrev(h_maj)})"
|
||||||
|
regression_directions[direction] += 1
|
||||||
|
|
||||||
|
# Which models flipped?
|
||||||
|
for i, (old, new) in enumerate(zip(v30_sigs, v35_sigs)):
|
||||||
|
if old and new and old != new:
|
||||||
|
regression_model_flips[MODEL_NAMES[i]] += 1
|
||||||
|
|
||||||
|
print(f"\n Regression directions (v3.0→v3.5, human ground truth):")
|
||||||
|
for direction, count in regression_directions.most_common():
|
||||||
|
print(f" {direction}: {count}")
|
||||||
|
|
||||||
|
print(f"\n Models that flipped most on regressions:")
|
||||||
|
for model, count in regression_model_flips.most_common():
|
||||||
|
print(f" {model}: {count} flips")
|
||||||
|
|
||||||
|
# Text pattern analysis
|
||||||
|
print(f"\n Common textual signals in regression paragraphs:")
|
||||||
|
signal_words = {
|
||||||
|
"board": 0, "committee": 0, "oversee": 0, "oversight": 0,
|
||||||
|
"report": 0, "director": 0, "officer": 0, "CISO": 0,
|
||||||
|
"governance": 0, "responsible": 0, "qualif": 0, "experience": 0,
|
||||||
|
"manage": 0, "program": 0, "framework": 0, "process": 0,
|
||||||
|
"audit": 0,
|
||||||
|
}
|
||||||
|
for pid in regressions_bg_mr:
|
||||||
|
text = para_text.get(pid, "").lower()
|
||||||
|
for word in signal_words:
|
||||||
|
if word.lower() in text:
|
||||||
|
signal_words[word] += 1
|
||||||
|
|
||||||
|
total_reg = len(regressions_bg_mr)
|
||||||
|
for word, count in sorted(signal_words.items(), key=lambda x: -x[1]):
|
||||||
|
if count > 0:
|
||||||
|
print(f" '{word}': {count}/{total_reg} ({count/total_reg:.0%})")
|
||||||
|
|
||||||
|
# Check if humans are split on these
|
||||||
|
print(f"\n Human agreement on regressions:")
|
||||||
|
unanimous_human = 0
|
||||||
|
split_human = 0
|
||||||
|
for pid in regressions_bg_mr:
|
||||||
|
labels = human_labels.get(pid, [])
|
||||||
|
cats = [c for _, c in labels]
|
||||||
|
if len(set(cats)) == 1:
|
||||||
|
unanimous_human += 1
|
||||||
|
else:
|
||||||
|
split_human += 1
|
||||||
|
print(f" Unanimous human: {unanimous_human}")
|
||||||
|
print(f" Split human (2-1): {split_human}")
|
||||||
|
|
||||||
|
if split_human > 0:
|
||||||
|
print(f"\n Split-human regression details:")
|
||||||
|
for pid in regressions_bg_mr:
|
||||||
|
labels = human_labels.get(pid, [])
|
||||||
|
cats = [c for _, c in labels]
|
||||||
|
if len(set(cats)) > 1:
|
||||||
|
votes = ", ".join(f"{n}={abbrev(c)}" for n, c in labels)
|
||||||
|
print(f" {pid[:12]}... → {votes}")
|
||||||
|
else:
|
||||||
|
print("\n No BG↔MR regressions found.")
|
||||||
|
|
||||||
|
# ── MR↔RMP patterns ─────────────────────────────────────────────────────────
|
||||||
|
print(f"\n{'━' * 110}")
|
||||||
|
print(f" 3B: MR↔RMP NON-CONVERGENCE PATTERNS")
|
||||||
|
print(f"{'━' * 110}")
|
||||||
|
|
||||||
|
if non_convergence_mr_rmp:
|
||||||
|
# Regression directions
|
||||||
|
nc_directions = Counter()
|
||||||
|
nc_model_flips = Counter()
|
||||||
|
|
||||||
|
for pid, _, _ in non_convergence_mr_rmp:
|
||||||
|
h_maj = human_majority(pid)
|
||||||
|
v30_sigs = get_signals(v30_signals, pid)
|
||||||
|
v35_sigs = get_signals(v35_signals, pid)
|
||||||
|
v30_maj = majority_vote(v30_sigs)
|
||||||
|
v35_maj = majority_vote(v35_sigs)
|
||||||
|
direction = f"{abbrev(v30_maj)}→{abbrev(v35_maj)} (human={abbrev(h_maj)})"
|
||||||
|
nc_directions[direction] += 1
|
||||||
|
|
||||||
|
for i, (old, new) in enumerate(zip(v30_sigs, v35_sigs)):
|
||||||
|
if old and new and old != new:
|
||||||
|
nc_model_flips[MODEL_NAMES[i]] += 1
|
||||||
|
|
||||||
|
print(f"\n Direction of non-convergent shifts:")
|
||||||
|
for direction, count in nc_directions.most_common():
|
||||||
|
print(f" {direction}: {count}")
|
||||||
|
|
||||||
|
print(f"\n Models that flipped most:")
|
||||||
|
for model, count in nc_model_flips.most_common():
|
||||||
|
print(f" {model}: {count} flips")
|
||||||
|
|
||||||
|
# Text pattern analysis — compare what helped vs what didn't
|
||||||
|
print(f"\n Text signals in NON-CONVERGENT vs IMPROVED paragraphs:")
|
||||||
|
|
||||||
|
keywords = ["CISO", "officer", "responsible", "oversee", "report",
|
||||||
|
"program", "framework", "qualif", "experience", "certif",
|
||||||
|
"manage", "assess", "monitor", "team", "director"]
|
||||||
|
|
||||||
|
nc_pids_set = {pid for pid, _, _ in non_convergence_mr_rmp}
|
||||||
|
imp_pids_set = {pid for pid, _, _ in improvements_mr_rmp}
|
||||||
|
|
||||||
|
print(f"\n {'Keyword':<16} {'Non-conv':>10} {'Improved':>10}")
|
||||||
|
print(f" {'─'*16} {'─'*10} {'─'*10}")
|
||||||
|
for kw in keywords:
|
||||||
|
nc_count = sum(1 for pid in nc_pids_set if kw.lower() in para_text.get(pid, "").lower())
|
||||||
|
imp_count = sum(1 for pid in imp_pids_set if kw.lower() in para_text.get(pid, "").lower())
|
||||||
|
nc_pct = f"{nc_count}/{len(nc_pids_set)}" if nc_pids_set else "0"
|
||||||
|
imp_pct = f"{imp_count}/{len(imp_pids_set)}" if imp_pids_set else "0"
|
||||||
|
print(f" {kw:<16} {nc_pct:>10} {imp_pct:>10}")
|
||||||
|
|
||||||
|
# Person-removal test analysis
|
||||||
|
print(f"\n Person-removal test applicability:")
|
||||||
|
print(f" Checking if regression paragraphs have person as ONLY subject...")
|
||||||
|
for pid, _, _ in regressions_mr_rmp:
|
||||||
|
text = para_text.get(pid, "")
|
||||||
|
has_person_subject = any(
|
||||||
|
marker in text.lower()
|
||||||
|
for marker in ["ciso", "chief information", "chief technology",
|
||||||
|
"vice president", "director of", "officer"]
|
||||||
|
)
|
||||||
|
has_process_subject = any(
|
||||||
|
marker in text.lower()
|
||||||
|
for marker in ["program", "framework", "process", "system",
|
||||||
|
"controls", "policies", "procedures"]
|
||||||
|
)
|
||||||
|
h_maj = human_majority(pid)
|
||||||
|
v35_maj = majority_vote(get_signals(v35_signals, pid))
|
||||||
|
print(
|
||||||
|
f" {pid[:12]}... person_subj={has_person_subject} "
|
||||||
|
f"process_subj={has_process_subject} "
|
||||||
|
f"human={abbrev(h_maj)} v3.5={abbrev(v35_maj)}"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
print("\n No MR↔RMP non-convergence cases found.")
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# SECTION 4: Ruling Recommendations
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
print("\n\n" + "═" * 110)
|
||||||
|
print(" SECTION 4: RULING RECOMMENDATIONS")
|
||||||
|
print("═" * 110)
|
||||||
|
|
||||||
|
print("""
|
||||||
|
Based on the error analysis above, here are the specific ruling observations:
|
||||||
|
|
||||||
|
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||||
|
4A: BG↔MR Board-Line Test
|
||||||
|
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||||
|
|
||||||
|
CURRENT RULING (Rule 2):
|
||||||
|
"When a paragraph spans layers (governance chain paragraphs): apply the
|
||||||
|
dominant-subject test — which layer occupies the most sentence-subjects?"
|
||||||
|
|
||||||
|
"Governance overview spanning board → committee → officer → program →
|
||||||
|
Board Governance if the board/committee occupies more sentence-subjects;
|
||||||
|
Management Role if the officer does; Risk Management Process if the
|
||||||
|
program does"
|
||||||
|
""")
|
||||||
|
|
||||||
|
# Analyze the specific regressions to give targeted advice
|
||||||
|
if regressions_bg_mr:
|
||||||
|
# Count what direction the regressions went
|
||||||
|
bg_to_mr = sum(
|
||||||
|
1 for pid in regressions_bg_mr
|
||||||
|
if abbrev(majority_vote(get_signals(v35_signals, pid))) == "MR"
|
||||||
|
and abbrev(human_majority(pid)) == "BG"
|
||||||
|
)
|
||||||
|
mr_to_bg = sum(
|
||||||
|
1 for pid in regressions_bg_mr
|
||||||
|
if abbrev(majority_vote(get_signals(v35_signals, pid))) == "BG"
|
||||||
|
and abbrev(human_majority(pid)) == "MR"
|
||||||
|
)
|
||||||
|
other_dir = len(regressions_bg_mr) - bg_to_mr - mr_to_bg
|
||||||
|
|
||||||
|
print(f" EMPIRICAL FINDING:")
|
||||||
|
print(f" Regressions that moved BG→MR (human says BG): {bg_to_mr}")
|
||||||
|
print(f" Regressions that moved MR→BG (human says MR): {mr_to_bg}")
|
||||||
|
print(f" Other directions: {other_dir}")
|
||||||
|
|
||||||
|
if bg_to_mr > mr_to_bg:
|
||||||
|
print("""
|
||||||
|
DIAGNOSIS: The dominant-subject test is OVER-CORRECTING toward MR.
|
||||||
|
When a governance chain mentions a CISO or officer, models are counting that
|
||||||
|
mention as a "sentence subject" even when the paragraph's primary purpose is
|
||||||
|
describing the board/committee oversight structure.
|
||||||
|
|
||||||
|
PROPOSED FIX — add a "purpose test" before the subject count:
|
||||||
|
"Before counting sentence-subjects, ask: what is the paragraph's PRIMARY
|
||||||
|
COMMUNICATIVE PURPOSE? If it is to describe the oversight/reporting
|
||||||
|
structure (who oversees whom, what gets reported where), the paragraph
|
||||||
|
is Board Governance even if individual officers are named as intermediaries.
|
||||||
|
The dominant-subject count applies only when the paragraph's purpose is
|
||||||
|
genuinely ambiguous between describing the oversight structure and
|
||||||
|
describing the officer's role."
|
||||||
|
|
||||||
|
Alternatively, add a carve-out:
|
||||||
|
"A governance chain paragraph (board → committee → officer → program)
|
||||||
|
defaults to Board Governance unless the officer section constitutes
|
||||||
|
MORE THAN HALF the paragraph's content AND includes qualifications,
|
||||||
|
credentials, or personal background."
|
||||||
|
""")
|
||||||
|
elif mr_to_bg > bg_to_mr:
|
||||||
|
print("""
|
||||||
|
DIAGNOSIS: The dominant-subject test is OVER-CORRECTING toward BG.
|
||||||
|
Paragraphs that are primarily about management roles are being pulled
|
||||||
|
toward BG because they mention board oversight.
|
||||||
|
|
||||||
|
PROPOSED FIX:
|
||||||
|
"When a paragraph's primary content is about a management role (CISO,
|
||||||
|
CIO, etc.) and mentions board oversight only as context for the
|
||||||
|
reporting relationship, classify as Management Role. Board Governance
|
||||||
|
requires the board/committee to be the PRIMARY ACTOR, not merely
|
||||||
|
the recipient of reports."
|
||||||
|
""")
|
||||||
|
|
||||||
|
print("""
|
||||||
|
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||||
|
4B: MR↔RMP Three-Step Chain
|
||||||
|
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||||
|
|
||||||
|
CURRENT RULING (Rule 2b):
|
||||||
|
"Step 1 — Subject test: What is the paragraph's grammatical subject?
|
||||||
|
Step 2 — Person-removal test: Could you delete all named roles, titles,
|
||||||
|
qualifications, experience descriptions, and credentials from the
|
||||||
|
paragraph and still have a coherent cybersecurity disclosure?
|
||||||
|
Step 3 — Qualifications tiebreaker: Does the paragraph include experience
|
||||||
|
(years), certifications (CISSP, CISM), education, team size, or career
|
||||||
|
history for named individuals?"
|
||||||
|
""")
|
||||||
|
|
||||||
|
if regressions_mr_rmp:
|
||||||
|
mr_to_rmp = sum(
|
||||||
|
1 for pid, _, _ in regressions_mr_rmp
|
||||||
|
if abbrev(majority_vote(get_signals(v35_signals, pid))) == "RMP"
|
||||||
|
and abbrev(human_majority(pid)) == "MR"
|
||||||
|
)
|
||||||
|
rmp_to_mr = sum(
|
||||||
|
1 for pid, _, _ in regressions_mr_rmp
|
||||||
|
if abbrev(majority_vote(get_signals(v35_signals, pid))) == "MR"
|
||||||
|
and abbrev(human_majority(pid)) == "RMP"
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f" EMPIRICAL FINDING:")
|
||||||
|
print(f" Regressions that moved MR→RMP (human says MR): {mr_to_rmp}")
|
||||||
|
print(f" Regressions that moved RMP→MR (human says RMP): {rmp_to_mr}")
|
||||||
|
|
||||||
|
if mr_to_rmp > rmp_to_mr:
|
||||||
|
print("""
|
||||||
|
DIAGNOSIS: The person-removal test is TOO AGGRESSIVE at removing people.
|
||||||
|
When a paragraph describes a CISO's monitoring activities, the person-removal
|
||||||
|
test says "yes, the monitoring process stands alone," but the HUMANS recognize
|
||||||
|
that the paragraph is fundamentally about the management role's responsibilities.
|
||||||
|
|
||||||
|
PROPOSED FIX — tighten the person-removal test:
|
||||||
|
"Step 2 — Person-removal test: Delete all named roles AND their associated
|
||||||
|
ACTIVITIES. If the paragraph still describes a cybersecurity process or
|
||||||
|
framework, it is Risk Management Process. If deleting the roles and their
|
||||||
|
activities leaves nothing substantive, it is Management Role.
|
||||||
|
Key distinction: 'The CISO monitors threat intelligence' — removing the
|
||||||
|
CISO removes the monitoring activity, so this is Management Role.
|
||||||
|
'The company monitors threat intelligence under the direction of the CISO'
|
||||||
|
— removing the CISO leaves the monitoring intact, so this is RMP."
|
||||||
|
""")
|
||||||
|
elif rmp_to_mr > mr_to_rmp:
|
||||||
|
print("""
|
||||||
|
DIAGNOSIS: The three-step chain is UNDER-APPLYING the person-removal test.
|
||||||
|
Models are stopping at Step 1 (subject test) when they see a role title,
|
||||||
|
without proceeding to the person-removal test.
|
||||||
|
|
||||||
|
PROPOSED FIX:
|
||||||
|
"Step 1 should only produce a STRONG signal, not a decisive result.
|
||||||
|
Always proceed to Step 2 unless the paragraph is ENTIRELY about
|
||||||
|
a person's credentials with no process content whatsoever."
|
||||||
|
""")
|
||||||
|
|
||||||
|
if not regressions_mr_rmp:
|
||||||
|
print("""
|
||||||
|
No MR↔RMP regressions found. The three-step chain may be working correctly,
|
||||||
|
or the non-convergence is increasing uncertainty without changing majority votes.
|
||||||
|
Focus on whether the increased model disagreement reflects genuine ambiguity
|
||||||
|
or whether the step instructions need to be more prescriptive.
|
||||||
|
""")
|
||||||
|
|
||||||
|
# ── Final summary stats ──────────────────────────────────────────────────────
|
||||||
|
print("\n" + "═" * 110)
|
||||||
|
print(" FINAL SUMMARY")
|
||||||
|
print("═" * 110)
|
||||||
|
|
||||||
|
# Overall accuracy comparison
|
||||||
|
total_with_human = 0
|
||||||
|
v30_correct_total = 0
|
||||||
|
v35_correct_total = 0
|
||||||
|
|
||||||
|
for pid in all_pids:
|
||||||
|
h_maj = human_majority(pid)
|
||||||
|
if h_maj is None:
|
||||||
|
continue
|
||||||
|
v30_sigs = get_signals(v30_signals, pid)
|
||||||
|
v35_sigs = get_signals(v35_signals, pid)
|
||||||
|
v30_maj = majority_vote(v30_sigs)
|
||||||
|
v35_maj = majority_vote(v35_sigs)
|
||||||
|
if v30_maj is None or v35_maj is None:
|
||||||
|
continue
|
||||||
|
total_with_human += 1
|
||||||
|
if abbrev(v30_maj) == abbrev(h_maj):
|
||||||
|
v30_correct_total += 1
|
||||||
|
if abbrev(v35_maj) == abbrev(h_maj):
|
||||||
|
v35_correct_total += 1
|
||||||
|
|
||||||
|
print(f"\n Overall accuracy on {total_with_human} confusion-axis paragraphs:")
|
||||||
|
print(f" v3.0: {v30_correct_total}/{total_with_human} ({v30_correct_total/total_with_human:.1%})")
|
||||||
|
print(f" v3.5: {v35_correct_total}/{total_with_human} ({v35_correct_total/total_with_human:.1%})")
|
||||||
|
print(f" Delta: {v35_correct_total - v30_correct_total:+d}")
|
||||||
|
|
||||||
|
# Per-axis breakdown
|
||||||
|
for axis_name in ["BG_MR", "MR_RMP", "BG_RMP", "SI_NO"]:
|
||||||
|
axis_pids = [pid for pid, axes in pid_axes.items() if axis_name in axes]
|
||||||
|
v30_c = 0
|
||||||
|
v35_c = 0
|
||||||
|
n = 0
|
||||||
|
for pid in axis_pids:
|
||||||
|
h_maj = human_majority(pid)
|
||||||
|
if h_maj is None:
|
||||||
|
continue
|
||||||
|
v30_sigs = get_signals(v30_signals, pid)
|
||||||
|
v35_sigs = get_signals(v35_signals, pid)
|
||||||
|
v30_maj = majority_vote(v30_sigs)
|
||||||
|
v35_maj = majority_vote(v35_sigs)
|
||||||
|
if v30_maj is None or v35_maj is None:
|
||||||
|
continue
|
||||||
|
n += 1
|
||||||
|
if abbrev(v30_maj) == abbrev(h_maj):
|
||||||
|
v30_c += 1
|
||||||
|
if abbrev(v35_maj) == abbrev(h_maj):
|
||||||
|
v35_c += 1
|
||||||
|
|
||||||
|
if n > 0:
|
||||||
|
print(f"\n {axis_name} ({n} paragraphs):")
|
||||||
|
print(f" v3.0: {v30_c}/{n} ({v30_c/n:.1%})")
|
||||||
|
print(f" v3.5: {v35_c}/{n} ({v35_c/n:.1%})")
|
||||||
|
print(f" Delta: {v35_c - v30_c:+d}")
|
||||||
|
|
||||||
|
print()
|
||||||
167
scripts/extract-regression-pids.py
Normal file
167
scripts/extract-regression-pids.py
Normal file
@ -0,0 +1,167 @@
|
|||||||
|
"""Identify paragraph IDs where v3.5 6-model majority regressed vs v3.0.
|
||||||
|
|
||||||
|
A "regression" = v3.0 majority matched human majority but v3.5 majority does not.
|
||||||
|
|
||||||
|
We compute category majority from 6 models (excluding minimax):
|
||||||
|
opus, gpt-5.4, gemini-3.1-pro-preview, glm-5:exacto, kimi-k2.5, mimo-v2-pro:exacto
|
||||||
|
|
||||||
|
v3.0 annotations are filtered to the 359 PIDs present in holdout-rerun-v35.jsonl.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
from collections import Counter
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
ROOT = Path(__file__).resolve().parent.parent
|
||||||
|
DATA = ROOT / "data"
|
||||||
|
|
||||||
|
# ── Model files (excluding minimax) ──────────────────────────────────────────
|
||||||
|
|
||||||
|
V30_FILES = [
|
||||||
|
DATA / "annotations" / "golden" / "opus.jsonl",
|
||||||
|
DATA / "annotations" / "bench-holdout" / "gpt-5.4.jsonl",
|
||||||
|
DATA / "annotations" / "bench-holdout" / "gemini-3.1-pro-preview.jsonl",
|
||||||
|
DATA / "annotations" / "bench-holdout" / "glm-5:exacto.jsonl",
|
||||||
|
DATA / "annotations" / "bench-holdout" / "kimi-k2.5.jsonl",
|
||||||
|
DATA / "annotations" / "bench-holdout" / "mimo-v2-pro:exacto.jsonl",
|
||||||
|
]
|
||||||
|
|
||||||
|
V35_FILES = [
|
||||||
|
DATA / "annotations" / "golden-v35" / "opus.jsonl",
|
||||||
|
DATA / "annotations" / "bench-holdout-v35" / "gpt-5.4.jsonl",
|
||||||
|
DATA / "annotations" / "bench-holdout-v35" / "gemini-3.1-pro-preview.jsonl",
|
||||||
|
DATA / "annotations" / "bench-holdout-v35" / "glm-5:exacto.jsonl",
|
||||||
|
DATA / "annotations" / "bench-holdout-v35" / "kimi-k2.5.jsonl",
|
||||||
|
DATA / "annotations" / "bench-holdout-v35" / "mimo-v2-pro:exacto.jsonl",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def load_annotations(files: list[Path]) -> dict[str, list[str]]:
|
||||||
|
"""Load annotations, returning {pid: [category, ...]} across models."""
|
||||||
|
result: dict[str, list[str]] = {}
|
||||||
|
for f in files:
|
||||||
|
with open(f) as fh:
|
||||||
|
for line in fh:
|
||||||
|
rec = json.loads(line)
|
||||||
|
pid = rec["paragraphId"]
|
||||||
|
cat = rec["label"]["content_category"]
|
||||||
|
result.setdefault(pid, []).append(cat)
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def majority_vote(labels: list[str]) -> str | None:
|
||||||
|
"""Return the most common label, or None if tied."""
|
||||||
|
counts = Counter(labels)
|
||||||
|
top = counts.most_common(2)
|
||||||
|
if len(top) == 1:
|
||||||
|
return top[0][0]
|
||||||
|
if top[0][1] > top[1][1]:
|
||||||
|
return top[0][0]
|
||||||
|
return None # tie
|
||||||
|
|
||||||
|
|
||||||
|
def load_human_majority() -> dict[str, str]:
|
||||||
|
"""Compute human majority label per PID from 3-annotator raw labels."""
|
||||||
|
pid_labels: dict[str, list[str]] = {}
|
||||||
|
with open(DATA / "gold" / "human-labels-raw.jsonl") as f:
|
||||||
|
for line in f:
|
||||||
|
rec = json.loads(line)
|
||||||
|
pid = rec["paragraphId"]
|
||||||
|
pid_labels.setdefault(pid, []).append(rec["contentCategory"])
|
||||||
|
return {
|
||||||
|
pid: maj
|
||||||
|
for pid, labels in pid_labels.items()
|
||||||
|
if (maj := majority_vote(labels)) is not None
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def load_holdout_pids() -> dict[str, list[str]]:
|
||||||
|
"""Load the 359 confusion-axis PIDs and their axes."""
|
||||||
|
result: dict[str, list[str]] = {}
|
||||||
|
with open(DATA / "gold" / "holdout-rerun-v35.jsonl") as f:
|
||||||
|
for line in f:
|
||||||
|
rec = json.loads(line)
|
||||||
|
result[rec["paragraphId"]] = rec["axes"]
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
# Axis name → output key mapping
|
||||||
|
AXIS_TO_KEY = {
|
||||||
|
"BG_MR": "bg_mr_regressions",
|
||||||
|
"BG_RMP": "bg_mr_regressions", # BG confusion axes both go to bg_mr bucket
|
||||||
|
"MR_RMP": "mr_rmp_regressions",
|
||||||
|
"SI_NO": "mr_rmp_regressions", # SI/NO doesn't fit neatly; group with mr_rmp
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
holdout = load_holdout_pids()
|
||||||
|
holdout_pids = set(holdout.keys())
|
||||||
|
|
||||||
|
human_maj = load_human_majority()
|
||||||
|
|
||||||
|
v30_ann = load_annotations(V30_FILES)
|
||||||
|
v35_ann = load_annotations(V35_FILES)
|
||||||
|
|
||||||
|
# Compute model majorities filtered to holdout PIDs
|
||||||
|
v30_maj: dict[str, str | None] = {}
|
||||||
|
for pid in holdout_pids:
|
||||||
|
labels = v30_ann.get(pid, [])
|
||||||
|
v30_maj[pid] = majority_vote(labels) if len(labels) == 6 else None
|
||||||
|
|
||||||
|
v35_maj: dict[str, str | None] = {}
|
||||||
|
for pid in holdout_pids:
|
||||||
|
labels = v35_ann.get(pid, [])
|
||||||
|
v35_maj[pid] = majority_vote(labels) if len(labels) == 6 else None
|
||||||
|
|
||||||
|
# Find regressions
|
||||||
|
bg_mr_regressions: list[str] = []
|
||||||
|
mr_rmp_regressions: list[str] = []
|
||||||
|
|
||||||
|
for pid in sorted(holdout_pids):
|
||||||
|
h = human_maj.get(pid)
|
||||||
|
v30 = v30_maj.get(pid)
|
||||||
|
v35 = v35_maj.get(pid)
|
||||||
|
|
||||||
|
if h is None or v30 is None or v35 is None:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Regression: v3.0 matched human, v3.5 does not
|
||||||
|
if v30 == h and v35 != h:
|
||||||
|
axes = holdout[pid]
|
||||||
|
# Assign to bucket based on axes
|
||||||
|
is_bg_mr = any(a in ("BG_MR", "BG_RMP") for a in axes)
|
||||||
|
is_mr_rmp = any(a in ("MR_RMP", "SI_NO") for a in axes)
|
||||||
|
|
||||||
|
if is_bg_mr:
|
||||||
|
bg_mr_regressions.append(pid)
|
||||||
|
if is_mr_rmp:
|
||||||
|
mr_rmp_regressions.append(pid)
|
||||||
|
# If somehow neither axis matched, still include in all
|
||||||
|
if not is_bg_mr and not is_mr_rmp:
|
||||||
|
# Fallback: put in mr_rmp
|
||||||
|
mr_rmp_regressions.append(pid)
|
||||||
|
|
||||||
|
all_regressions = sorted(set(bg_mr_regressions + mr_rmp_regressions))
|
||||||
|
|
||||||
|
output = {
|
||||||
|
"bg_mr_regressions": sorted(bg_mr_regressions),
|
||||||
|
"mr_rmp_regressions": sorted(mr_rmp_regressions),
|
||||||
|
"all_regressions": all_regressions,
|
||||||
|
}
|
||||||
|
|
||||||
|
out_path = DATA / "gold" / "regression-pids.json"
|
||||||
|
with open(out_path, "w") as f:
|
||||||
|
json.dump(output, f, indent=2)
|
||||||
|
f.write("\n")
|
||||||
|
|
||||||
|
print(f"BG/MR regressions: {len(bg_mr_regressions)}")
|
||||||
|
print(f"MR/RMP regressions: {len(mr_rmp_regressions)}")
|
||||||
|
print(f"Total unique: {len(all_regressions)}")
|
||||||
|
print(f"Written to {out_path}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
305
scripts/flag-stage1-corrections.py
Normal file
305
scripts/flag-stage1-corrections.py
Normal file
@ -0,0 +1,305 @@
|
|||||||
|
"""
|
||||||
|
Flag Stage 1 paragraphs needing Stage 2 re-evaluation due to codebook v2.5->v3.5 drift.
|
||||||
|
|
||||||
|
Two categories of flags:
|
||||||
|
1. Materiality assessment language in N/O or RMP paragraphs — backward-looking
|
||||||
|
conclusions or SEC-qualified forward-looking statements that constitute a
|
||||||
|
materiality assessment (should be Strategy Integration under v3.5 codebook).
|
||||||
|
2. SPAC/shell company paragraphs coded as substantive categories — should be None/Other.
|
||||||
|
|
||||||
|
Materiality rule (tightened after 6 rounds of prompt iteration):
|
||||||
|
IS assessment: "have not materially affected/impacted", "not materially affected",
|
||||||
|
"reasonably likely to materially affect/impact",
|
||||||
|
"have not experienced any material cybersecurity" (unless cross-reference context).
|
||||||
|
NOT assessment: "could/may ... material adverse effect" (boilerplate speculation),
|
||||||
|
"material" as adjective ("material risks"), cross-references ("see Item 1A"),
|
||||||
|
consequence clauses at end of RMP descriptions.
|
||||||
|
|
||||||
|
Usage: uv run scripts/flag-stage1-corrections.py
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from collections import Counter, defaultdict
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
DATA_DIR = Path(__file__).resolve().parent.parent / "data"
|
||||||
|
STAGE1_PATH = DATA_DIR / "annotations" / "stage1.patched.jsonl"
|
||||||
|
PARAGRAPHS_PATH = DATA_DIR / "paragraphs" / "paragraphs-clean.patched.jsonl"
|
||||||
|
HOLDOUT_PATH = DATA_DIR / "gold" / "human-labels-raw.jsonl"
|
||||||
|
OUTPUT_PATH = DATA_DIR / "annotations" / "stage1-corrections.jsonl"
|
||||||
|
|
||||||
|
# Category abbreviation mapping for stage1Labels output
|
||||||
|
CATEGORY_ABBREV = {
|
||||||
|
"None/Other": "N/O",
|
||||||
|
"Board Governance": "BG",
|
||||||
|
"Management Role": "MR",
|
||||||
|
"Risk Management Process": "RMP",
|
||||||
|
"Third-Party Risk": "TPR",
|
||||||
|
"Incident Disclosure": "ID",
|
||||||
|
}
|
||||||
|
|
||||||
|
# --- Materiality patterns (strict assessment-only) ---
|
||||||
|
|
||||||
|
# Positive patterns: backward-looking conclusions and SEC-qualified forward-looking
|
||||||
|
MATERIALITY_ASSESSMENT_RE = re.compile(
|
||||||
|
r"(?:"
|
||||||
|
# Backward-looking conclusions
|
||||||
|
r"have\s+not\s+materially\s+affected"
|
||||||
|
r"|has\s+not\s+materially\s+affected"
|
||||||
|
r"|not\s+materially\s+affected"
|
||||||
|
r"|have\s+not\s+materially\s+impacted"
|
||||||
|
# SEC-qualified forward-looking
|
||||||
|
r"|reasonably\s+likely\s+to\s+materially\s+affect"
|
||||||
|
r"|reasonably\s+likely\s+to\s+materially\s+impact"
|
||||||
|
# Negative assertions about incidents
|
||||||
|
r"|have\s+not\s+experienced\s+any\s+material\s+cybersecurity"
|
||||||
|
r")",
|
||||||
|
re.IGNORECASE,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Negative filter: cross-reference context near the match (within 200 chars after)
|
||||||
|
CROSS_REF_RE = re.compile(
|
||||||
|
r"(?:see\s+Item|see\s+Part|see\s+our\s+risk\s+factors|refer\s+to)",
|
||||||
|
re.IGNORECASE,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Negative filter: speculative/boilerplate "could/may + material adverse" patterns
|
||||||
|
SPECULATIVE_RE = re.compile(
|
||||||
|
r"(?:could|may|might|can)\s+(?:\w+\s+){0,3}material\s+adverse\s+effect",
|
||||||
|
re.IGNORECASE,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def has_materiality_language(text: str) -> str | None:
|
||||||
|
"""Return the matched materiality assessment pattern string, or None if no match."""
|
||||||
|
match = MATERIALITY_ASSESSMENT_RE.search(text)
|
||||||
|
if match is None:
|
||||||
|
return None
|
||||||
|
|
||||||
|
matched_text = match.group(0)
|
||||||
|
match_start = match.start()
|
||||||
|
match_end = match.end()
|
||||||
|
|
||||||
|
# Check cross-reference context: look within 200 chars after the match
|
||||||
|
post_context = text[match_end : match_end + 200]
|
||||||
|
if CROSS_REF_RE.search(post_context):
|
||||||
|
return None
|
||||||
|
|
||||||
|
# Also check cross-reference before the match (within 100 chars)
|
||||||
|
pre_context = text[max(0, match_start - 100) : match_start]
|
||||||
|
if CROSS_REF_RE.search(pre_context):
|
||||||
|
return None
|
||||||
|
|
||||||
|
# Extract a snippet around the match for context
|
||||||
|
snippet_start = max(0, match_start - 30)
|
||||||
|
snippet_end = min(len(text), match_end + 30)
|
||||||
|
snippet = text[snippet_start:snippet_end].strip()
|
||||||
|
|
||||||
|
return snippet
|
||||||
|
|
||||||
|
|
||||||
|
# --- SPAC patterns ---
|
||||||
|
|
||||||
|
SPAC_PHRASES = [
|
||||||
|
"special purpose acquisition",
|
||||||
|
"blank check",
|
||||||
|
"no business operations",
|
||||||
|
"shell company",
|
||||||
|
"have not adopted any cybersecurity",
|
||||||
|
"no operations",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def has_spac_language(text: str) -> str | None:
|
||||||
|
"""Return matched SPAC indicator string, or None."""
|
||||||
|
text_lower = text.lower()
|
||||||
|
for phrase in SPAC_PHRASES:
|
||||||
|
if phrase in text_lower:
|
||||||
|
idx = text_lower.index(phrase)
|
||||||
|
start = max(0, idx - 20)
|
||||||
|
end = min(len(text), idx + len(phrase) + 20)
|
||||||
|
return text[start:end].strip()
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
# Load holdout IDs
|
||||||
|
print("Loading holdout IDs...")
|
||||||
|
holdout_ids: set[str] = set()
|
||||||
|
with open(HOLDOUT_PATH) as f:
|
||||||
|
for line in f:
|
||||||
|
rec = json.loads(line)
|
||||||
|
holdout_ids.add(rec["paragraphId"])
|
||||||
|
print(f" Holdout paragraphs: {len(holdout_ids)}")
|
||||||
|
|
||||||
|
# Load paragraph texts
|
||||||
|
print("Loading paragraph texts...")
|
||||||
|
para_texts: dict[str, str] = {}
|
||||||
|
with open(PARAGRAPHS_PATH) as f:
|
||||||
|
for line in f:
|
||||||
|
rec = json.loads(line)
|
||||||
|
para_texts[rec["id"]] = rec["text"]
|
||||||
|
print(f" Paragraphs loaded: {len(para_texts)}")
|
||||||
|
|
||||||
|
# Load old corrections for comparison
|
||||||
|
old_materiality_pids: set[str] = set()
|
||||||
|
old_spac_pids: set[str] = set()
|
||||||
|
if OUTPUT_PATH.exists():
|
||||||
|
with open(OUTPUT_PATH) as f:
|
||||||
|
for line in f:
|
||||||
|
rec = json.loads(line)
|
||||||
|
if rec["reason"] == "materiality_language":
|
||||||
|
old_materiality_pids.add(rec["paragraphId"])
|
||||||
|
elif rec["reason"] == "spac":
|
||||||
|
old_spac_pids.add(rec["paragraphId"])
|
||||||
|
|
||||||
|
# Load Stage 1 annotations and group by paragraphId
|
||||||
|
print("Loading Stage 1 annotations...")
|
||||||
|
annotations: dict[str, list[str]] = defaultdict(list)
|
||||||
|
with open(STAGE1_PATH) as f:
|
||||||
|
for line in f:
|
||||||
|
rec = json.loads(line)
|
||||||
|
pid = rec["paragraphId"]
|
||||||
|
if pid in holdout_ids:
|
||||||
|
continue
|
||||||
|
cat = rec["label"]["content_category"]
|
||||||
|
annotations[pid].append(cat)
|
||||||
|
|
||||||
|
total_paragraphs = len(annotations)
|
||||||
|
print(f" Stage 1 paragraphs (excluding holdout): {total_paragraphs}")
|
||||||
|
|
||||||
|
# Process each paragraph
|
||||||
|
flagged: list[dict] = []
|
||||||
|
materiality_flagged: list[dict] = []
|
||||||
|
spac_flagged: list[dict] = []
|
||||||
|
|
||||||
|
# Track paragraphs that HAVE assessment language but pass through new filters
|
||||||
|
# (for showing "newly excluded" examples)
|
||||||
|
newly_excluded: list[dict] = []
|
||||||
|
|
||||||
|
for pid, labels in annotations.items():
|
||||||
|
text = para_texts.get(pid)
|
||||||
|
if text is None:
|
||||||
|
continue
|
||||||
|
|
||||||
|
label_abbrevs = [CATEGORY_ABBREV.get(l, l) for l in labels]
|
||||||
|
no_count = sum(1 for l in labels if l == "None/Other")
|
||||||
|
total = len(labels)
|
||||||
|
|
||||||
|
# --- Check 1: Materiality assessment in N/O paragraphs ---
|
||||||
|
if no_count > total / 2: # majority or unanimous N/O
|
||||||
|
matched = has_materiality_language(text)
|
||||||
|
if matched:
|
||||||
|
is_unanimous = no_count == total
|
||||||
|
consensus = "unanimous" if is_unanimous else "majority"
|
||||||
|
record = {
|
||||||
|
"paragraphId": pid,
|
||||||
|
"reason": "materiality_language",
|
||||||
|
"originalConsensus": consensus,
|
||||||
|
"originalCategory": "None/Other",
|
||||||
|
"matchedPattern": matched,
|
||||||
|
"stage1Labels": label_abbrevs,
|
||||||
|
}
|
||||||
|
flagged.append(record)
|
||||||
|
materiality_flagged.append(record)
|
||||||
|
elif pid in old_materiality_pids:
|
||||||
|
# Was flagged before, now excluded — collect for comparison
|
||||||
|
# Find what the old broad matcher would have caught
|
||||||
|
text_lower = text.lower()
|
||||||
|
broad_match = None
|
||||||
|
for phrase in [
|
||||||
|
"material adverse", "materially affect", "material impact",
|
||||||
|
"material cybersecurity", "material effect", "not materially",
|
||||||
|
"materially impacted", "materially affected",
|
||||||
|
]:
|
||||||
|
if phrase in text_lower:
|
||||||
|
idx = text_lower.index(phrase)
|
||||||
|
s = max(0, idx - 30)
|
||||||
|
e = min(len(text), idx + len(phrase) + 30)
|
||||||
|
broad_match = text[s:e].strip()
|
||||||
|
break
|
||||||
|
if broad_match is None:
|
||||||
|
broad_match = "(proximity match)"
|
||||||
|
newly_excluded.append({
|
||||||
|
"paragraphId": pid,
|
||||||
|
"reason": "excluded_materiality",
|
||||||
|
"oldMatch": broad_match,
|
||||||
|
"stage1Labels": label_abbrevs,
|
||||||
|
})
|
||||||
|
|
||||||
|
# --- Check 2: SPAC paragraphs coded as non-N/O ---
|
||||||
|
if no_count <= total / 2:
|
||||||
|
matched = has_spac_language(text)
|
||||||
|
if matched:
|
||||||
|
cat_counts = Counter(labels)
|
||||||
|
majority_cat = cat_counts.most_common(1)[0][0]
|
||||||
|
consensus = "unanimous" if cat_counts[majority_cat] == total else "majority"
|
||||||
|
record = {
|
||||||
|
"paragraphId": pid,
|
||||||
|
"reason": "spac",
|
||||||
|
"originalConsensus": consensus,
|
||||||
|
"originalCategory": majority_cat,
|
||||||
|
"matchedPattern": matched,
|
||||||
|
"stage1Labels": label_abbrevs,
|
||||||
|
}
|
||||||
|
flagged.append(record)
|
||||||
|
spac_flagged.append(record)
|
||||||
|
|
||||||
|
# Write output
|
||||||
|
print(f"\nWriting {len(flagged)} flagged paragraphs to {OUTPUT_PATH}...")
|
||||||
|
with open(OUTPUT_PATH, "w") as f:
|
||||||
|
for rec in flagged:
|
||||||
|
f.write(json.dumps(rec) + "\n")
|
||||||
|
|
||||||
|
# --- Print summary ---
|
||||||
|
print("\n" + "=" * 70)
|
||||||
|
print("STAGE 1 CORRECTION FLAGS — SUMMARY")
|
||||||
|
print("=" * 70)
|
||||||
|
print(f"Total Stage 1 paragraphs (excluding holdout): {total_paragraphs:,}")
|
||||||
|
print()
|
||||||
|
print(" Comparison (old broad rule -> new strict rule):")
|
||||||
|
print(f" Materiality flags: {len(old_materiality_pids):>5} -> {len(materiality_flagged):>5} (delta: {len(materiality_flagged) - len(old_materiality_pids):+d})")
|
||||||
|
print(f" SPAC flags: {len(old_spac_pids):>5} -> {len(spac_flagged):>5} (delta: {len(spac_flagged) - len(old_spac_pids):+d})")
|
||||||
|
old_total = len(old_materiality_pids) + len(old_spac_pids)
|
||||||
|
print(f" Total flags: {old_total:>5} -> {len(flagged):>5} (delta: {len(flagged) - old_total:+d})")
|
||||||
|
print()
|
||||||
|
|
||||||
|
new_materiality_pids = {r["paragraphId"] for r in materiality_flagged}
|
||||||
|
added = new_materiality_pids - old_materiality_pids
|
||||||
|
removed = old_materiality_pids - new_materiality_pids
|
||||||
|
kept = new_materiality_pids & old_materiality_pids
|
||||||
|
print(f" Materiality breakdown:")
|
||||||
|
print(f" Kept from old: {len(kept):>5}")
|
||||||
|
print(f" Newly flagged: {len(added):>5}")
|
||||||
|
print(f" Excluded (dropped): {len(removed):>5}")
|
||||||
|
|
||||||
|
# Show examples
|
||||||
|
def show_examples(title: str, records: list[dict], n: int = 5, text_key: str = "matchedPattern") -> None:
|
||||||
|
print(f"\n--- {title} (showing {min(n, len(records))} of {len(records)}) ---")
|
||||||
|
for rec in records[:n]:
|
||||||
|
pid = rec["paragraphId"]
|
||||||
|
text = para_texts.get(pid, "")
|
||||||
|
snippet = text[:150] + "..." if len(text) > 150 else text
|
||||||
|
print(f" {pid[:16]}...")
|
||||||
|
print(f" Labels: {rec['stage1Labels']}")
|
||||||
|
print(f" Match: {rec.get(text_key, rec.get('oldMatch', ''))}")
|
||||||
|
print(f" Text: {snippet}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
show_examples(
|
||||||
|
"Newly flagged materiality assessments (assessment patterns)",
|
||||||
|
[r for r in materiality_flagged if r["paragraphId"] in added] or materiality_flagged,
|
||||||
|
)
|
||||||
|
show_examples(
|
||||||
|
"Previously flagged, NOW EXCLUDED (boilerplate/speculation)",
|
||||||
|
newly_excluded,
|
||||||
|
text_key="oldMatch",
|
||||||
|
)
|
||||||
|
show_examples(
|
||||||
|
"SPAC/shell coded as non-N/O", spac_flagged
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
201
scripts/identify-holdout-rerun.py
Normal file
201
scripts/identify-holdout-rerun.py
Normal file
@ -0,0 +1,201 @@
|
|||||||
|
"""
|
||||||
|
Identify holdout paragraphs on confusion axes that need v3.5 re-annotation.
|
||||||
|
|
||||||
|
Builds a 13-signal matrix from all available sources:
|
||||||
|
- 3 human annotators (per paragraph)
|
||||||
|
- 1 Opus golden annotation
|
||||||
|
- Up to 6 bench-holdout model annotations
|
||||||
|
- Stage 1 patched annotations (filtered to holdout PIDs)
|
||||||
|
|
||||||
|
Flags paragraphs splitting on:
|
||||||
|
1. SI <-> N/O (at least 2 signals each side)
|
||||||
|
2. MR <-> RMP (at least 2 signals each side)
|
||||||
|
3. BG <-> MR (at least 2 signals each side)
|
||||||
|
4. BG <-> RMP (at least 2 signals each side)
|
||||||
|
5. Materiality language present but majority says N/O
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from collections import Counter
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
ROOT = Path(__file__).resolve().parent.parent
|
||||||
|
DATA = ROOT / "data"
|
||||||
|
|
||||||
|
# Short names for categories
|
||||||
|
ABBREV = {
|
||||||
|
"Board Governance": "BG",
|
||||||
|
"Incident Disclosure": "ID",
|
||||||
|
"Management Role": "MR",
|
||||||
|
"None/Other": "NO",
|
||||||
|
"Risk Management Process": "RMP",
|
||||||
|
"Strategy Integration": "SI",
|
||||||
|
"Third-Party Risk": "TPR",
|
||||||
|
}
|
||||||
|
|
||||||
|
# Materiality language patterns
|
||||||
|
MATERIALITY_PATTERNS = [
|
||||||
|
re.compile(r"material(ly)?\s+(adverse|impact|effect|affect)", re.IGNORECASE),
|
||||||
|
re.compile(r"materially\s+affect(ed)?", re.IGNORECASE),
|
||||||
|
re.compile(r"material\s+cybersecurity\s+(incident|threat|event)", re.IGNORECASE),
|
||||||
|
re.compile(r"not\s+(experienced|had|identified)\s+.{0,40}material", re.IGNORECASE),
|
||||||
|
re.compile(r"reasonably\s+likely\s+to\s+materially", re.IGNORECASE),
|
||||||
|
re.compile(r"material(ity)?\s+(assessment|conclusion|determination)", re.IGNORECASE),
|
||||||
|
re.compile(r"no\s+material\s+(impact|effect|cybersecurity)", re.IGNORECASE),
|
||||||
|
re.compile(
|
||||||
|
r"have\s+not\s+.{0,30}materially\s+affect(ed)?", re.IGNORECASE
|
||||||
|
),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def has_materiality_language(text: str) -> bool:
|
||||||
|
return any(p.search(text) for p in MATERIALITY_PATTERNS)
|
||||||
|
|
||||||
|
|
||||||
|
def majority_category(tally: Counter) -> str:
|
||||||
|
if not tally:
|
||||||
|
return "UNKNOWN"
|
||||||
|
return tally.most_common(1)[0][0]
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
# 1. Determine the 1,200 holdout PIDs from human labels
|
||||||
|
holdout_pids: set[str] = set()
|
||||||
|
human_labels: dict[str, list[str]] = {} # pid -> list of abbreviated cats
|
||||||
|
with open(DATA / "gold" / "human-labels-raw.jsonl") as f:
|
||||||
|
for line in f:
|
||||||
|
d = json.loads(line)
|
||||||
|
pid = d["paragraphId"]
|
||||||
|
holdout_pids.add(pid)
|
||||||
|
human_labels.setdefault(pid, []).append(
|
||||||
|
ABBREV.get(d["contentCategory"], d["contentCategory"])
|
||||||
|
)
|
||||||
|
|
||||||
|
# Load paragraph texts for the holdout PIDs
|
||||||
|
holdout_paragraphs: dict[str, str] = {}
|
||||||
|
with open(DATA / "gold" / "paragraphs-holdout.jsonl") as f:
|
||||||
|
for line in f:
|
||||||
|
d = json.loads(line)
|
||||||
|
if d["id"] in holdout_pids:
|
||||||
|
holdout_paragraphs[d["id"]] = d["text"]
|
||||||
|
|
||||||
|
print(f"Total holdout paragraphs: {len(holdout_pids)}")
|
||||||
|
|
||||||
|
# 2. Build signal matrix: pid -> list of category strings (abbreviated)
|
||||||
|
signals: dict[str, list[str]] = {pid: list(cats) for pid, cats in human_labels.items()}
|
||||||
|
|
||||||
|
# 2a. Human labels already loaded above
|
||||||
|
print(f"Paragraphs with human labels: {len(human_labels)}")
|
||||||
|
|
||||||
|
# 2b. Opus golden
|
||||||
|
with open(DATA / "annotations" / "golden" / "opus.jsonl") as f:
|
||||||
|
for line in f:
|
||||||
|
d = json.loads(line)
|
||||||
|
pid = d["paragraphId"]
|
||||||
|
if pid in holdout_pids:
|
||||||
|
cat = ABBREV.get(
|
||||||
|
d["label"]["content_category"], d["label"]["content_category"]
|
||||||
|
)
|
||||||
|
signals[pid].append(cat)
|
||||||
|
|
||||||
|
# 2c. Bench-holdout model annotations (skip error files)
|
||||||
|
bench_dir = DATA / "annotations" / "bench-holdout"
|
||||||
|
for fpath in sorted(bench_dir.glob("*.jsonl")):
|
||||||
|
if "-errors" in fpath.name:
|
||||||
|
continue
|
||||||
|
with open(fpath) as f:
|
||||||
|
for line in f:
|
||||||
|
d = json.loads(line)
|
||||||
|
pid = d.get("paragraphId")
|
||||||
|
if pid and pid in holdout_pids and "label" in d:
|
||||||
|
cat = ABBREV.get(
|
||||||
|
d["label"]["content_category"],
|
||||||
|
d["label"]["content_category"],
|
||||||
|
)
|
||||||
|
signals[pid].append(cat)
|
||||||
|
|
||||||
|
# 2d. Stage 1 patched (filter to holdout PIDs)
|
||||||
|
with open(DATA / "annotations" / "stage1.patched.jsonl") as f:
|
||||||
|
for line in f:
|
||||||
|
d = json.loads(line)
|
||||||
|
pid = d["paragraphId"]
|
||||||
|
if pid in holdout_pids:
|
||||||
|
cat = ABBREV.get(
|
||||||
|
d["label"]["content_category"], d["label"]["content_category"]
|
||||||
|
)
|
||||||
|
signals[pid].append(cat)
|
||||||
|
|
||||||
|
# Report signal counts
|
||||||
|
signal_counts = [len(signals[pid]) for pid in holdout_pids]
|
||||||
|
print(
|
||||||
|
f"Signals per paragraph: min={min(signal_counts)}, max={max(signal_counts)}, "
|
||||||
|
f"mean={sum(signal_counts)/len(signal_counts):.1f}"
|
||||||
|
)
|
||||||
|
|
||||||
|
# 3. Check confusion axes
|
||||||
|
AXES = {
|
||||||
|
"SI_NO": ("SI", "NO"),
|
||||||
|
"MR_RMP": ("MR", "RMP"),
|
||||||
|
"BG_MR": ("BG", "MR"),
|
||||||
|
"BG_RMP": ("BG", "RMP"),
|
||||||
|
}
|
||||||
|
|
||||||
|
axis_counts: dict[str, int] = {k: 0 for k in AXES}
|
||||||
|
materiality_no_count = 0
|
||||||
|
results: list[dict] = []
|
||||||
|
|
||||||
|
for pid in sorted(holdout_pids):
|
||||||
|
tally = Counter(signals[pid])
|
||||||
|
maj = majority_category(tally)
|
||||||
|
text = holdout_paragraphs[pid]
|
||||||
|
mat_lang = has_materiality_language(text)
|
||||||
|
|
||||||
|
# Check each axis
|
||||||
|
flagged_axes: list[str] = []
|
||||||
|
for axis_name, (cat_a, cat_b) in AXES.items():
|
||||||
|
if tally.get(cat_a, 0) >= 2 and tally.get(cat_b, 0) >= 2:
|
||||||
|
flagged_axes.append(axis_name)
|
||||||
|
|
||||||
|
# Materiality language + majority N/O
|
||||||
|
mat_no_flag = mat_lang and maj == "NO"
|
||||||
|
|
||||||
|
if flagged_axes or mat_no_flag:
|
||||||
|
for axis_name in flagged_axes:
|
||||||
|
axis_counts[axis_name] += 1
|
||||||
|
if mat_no_flag:
|
||||||
|
materiality_no_count += 1
|
||||||
|
|
||||||
|
# Build tally dict with full names for output readability
|
||||||
|
tally_dict = dict(tally.most_common())
|
||||||
|
|
||||||
|
results.append(
|
||||||
|
{
|
||||||
|
"paragraphId": pid,
|
||||||
|
"axes": flagged_axes if flagged_axes else [],
|
||||||
|
"signalTally": tally_dict,
|
||||||
|
"hasMaterialityLanguage": mat_lang,
|
||||||
|
"currentMajority": maj,
|
||||||
|
"materialityNoFlag": mat_no_flag,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 4. Output
|
||||||
|
out_path = DATA / "gold" / "holdout-rerun-v35.jsonl"
|
||||||
|
with open(out_path, "w") as f:
|
||||||
|
for r in results:
|
||||||
|
f.write(json.dumps(r) + "\n")
|
||||||
|
|
||||||
|
print(f"\n--- Confusion Axis Summary ---")
|
||||||
|
print(f"SI <-> N/O splits: {axis_counts['SI_NO']}")
|
||||||
|
print(f"MR <-> RMP splits: {axis_counts['MR_RMP']}")
|
||||||
|
print(f"BG <-> MR splits: {axis_counts['BG_MR']}")
|
||||||
|
print(f"BG <-> RMP splits: {axis_counts['BG_RMP']}")
|
||||||
|
print(f"Materiality lang + majority N/O: {materiality_no_count}")
|
||||||
|
print(f"\nTotal unique paragraphs needing re-run: {len(results)}")
|
||||||
|
cost = len(results) * 0.005 * 5
|
||||||
|
print(f"Estimated cost at $0.005/paragraph x 5 models: ${cost:.2f}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
530
scripts/show-hard-examples.py
Normal file
530
scripts/show-hard-examples.py
Normal file
@ -0,0 +1,530 @@
|
|||||||
|
"""
|
||||||
|
Show carefully selected hard-case paragraphs from the holdout set for each confusion axis.
|
||||||
|
Displays full paragraph text + compact 13-signal label table + vote tally.
|
||||||
|
|
||||||
|
Run: uv run --with numpy scripts/show-hard-examples.py
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
from collections import Counter, defaultdict
|
||||||
|
from pathlib import Path
|
||||||
|
from textwrap import fill
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
ROOT = Path(__file__).resolve().parent.parent
|
||||||
|
|
||||||
|
# ── Category abbreviation map ──────────────────────────────────────────────
|
||||||
|
FULL_TO_ABBR = {
|
||||||
|
"Board Governance": "BG",
|
||||||
|
"Incident Disclosure": "ID",
|
||||||
|
"Management Role": "MR",
|
||||||
|
"None/Other": "N/O",
|
||||||
|
"Risk Management Process": "RMP",
|
||||||
|
"Strategy Integration": "SI",
|
||||||
|
"Third-Party Risk": "TPR",
|
||||||
|
}
|
||||||
|
|
||||||
|
# ── Short source-name helpers ──────────────────────────────────────────────
|
||||||
|
S1_MODEL_SHORT = {
|
||||||
|
"google/gemini-3.1-flash-lite-preview": "gemini-lite",
|
||||||
|
"x-ai/grok-4.1-fast": "grok-fast",
|
||||||
|
"xiaomi/mimo-v2-flash": "mimo-flash",
|
||||||
|
}
|
||||||
|
|
||||||
|
BENCH_FILE_SHORT = {
|
||||||
|
"gpt-5.4": "gpt-5.4",
|
||||||
|
"gemini-3.1-pro-preview": "gemini-pro",
|
||||||
|
"glm-5:exacto": "glm-5",
|
||||||
|
"kimi-k2.5": "kimi",
|
||||||
|
"mimo-v2-pro:exacto": "mimo-pro",
|
||||||
|
"minimax-m2.7:exacto": "minimax",
|
||||||
|
}
|
||||||
|
|
||||||
|
BENCH_FILES = [
|
||||||
|
"gpt-5.4",
|
||||||
|
"gemini-3.1-pro-preview",
|
||||||
|
"glm-5:exacto",
|
||||||
|
"kimi-k2.5",
|
||||||
|
"mimo-v2-pro:exacto",
|
||||||
|
"minimax-m2.7:exacto",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def load_jsonl(path: str | Path) -> list[dict]:
|
||||||
|
rows = []
|
||||||
|
with open(path) as f:
|
||||||
|
for line in f:
|
||||||
|
line = line.strip()
|
||||||
|
if line:
|
||||||
|
rows.append(json.loads(line))
|
||||||
|
return rows
|
||||||
|
|
||||||
|
|
||||||
|
# ── Load data ──────────────────────────────────────────────────────────────
|
||||||
|
print("Loading data...")
|
||||||
|
paragraphs_raw = load_jsonl(ROOT / "data/gold/paragraphs-holdout.jsonl")
|
||||||
|
para_map: dict[str, dict] = {p["id"]: p for p in paragraphs_raw}
|
||||||
|
holdout_pids = set(para_map.keys())
|
||||||
|
|
||||||
|
human_raw = load_jsonl(ROOT / "data/gold/human-labels-raw.jsonl")
|
||||||
|
opus_raw = load_jsonl(ROOT / "data/annotations/golden/opus.jsonl")
|
||||||
|
stage1_raw = load_jsonl(ROOT / "data/annotations/stage1.patched.jsonl")
|
||||||
|
|
||||||
|
# ── Build signal matrix: pid → {source_label: category_abbr} ─────────────
|
||||||
|
signals: dict[str, dict[str, str]] = defaultdict(dict)
|
||||||
|
|
||||||
|
# 1) Human annotators
|
||||||
|
for row in human_raw:
|
||||||
|
pid = row["paragraphId"]
|
||||||
|
name = row["annotatorName"]
|
||||||
|
cat = FULL_TO_ABBR.get(row["contentCategory"], row["contentCategory"])
|
||||||
|
signals[pid][f"H:{name}"] = cat
|
||||||
|
|
||||||
|
# 2) Opus
|
||||||
|
for row in opus_raw:
|
||||||
|
pid = row["paragraphId"]
|
||||||
|
cat = FULL_TO_ABBR.get(row["label"]["content_category"], row["label"]["content_category"])
|
||||||
|
signals[pid]["Opus"] = cat
|
||||||
|
|
||||||
|
# 3) Stage 1 (filter to holdout PIDs)
|
||||||
|
for row in stage1_raw:
|
||||||
|
pid = row["paragraphId"]
|
||||||
|
if pid not in holdout_pids:
|
||||||
|
continue
|
||||||
|
model_id = row["provenance"]["modelId"]
|
||||||
|
short = S1_MODEL_SHORT.get(model_id, model_id)
|
||||||
|
source = f"S1:{short}"
|
||||||
|
cat = FULL_TO_ABBR.get(row["label"]["content_category"], row["label"]["content_category"])
|
||||||
|
signals[pid][source] = cat
|
||||||
|
|
||||||
|
# 4) Benchmark models
|
||||||
|
for bench_name in BENCH_FILES:
|
||||||
|
path = ROOT / f"data/annotations/bench-holdout/{bench_name}.jsonl"
|
||||||
|
short = BENCH_FILE_SHORT[bench_name]
|
||||||
|
rows = load_jsonl(path)
|
||||||
|
for row in rows:
|
||||||
|
pid = row["paragraphId"]
|
||||||
|
cat = FULL_TO_ABBR.get(row["label"]["content_category"], row["label"]["content_category"])
|
||||||
|
signals[pid][short] = cat
|
||||||
|
|
||||||
|
# ── Ordered source list (for display) ─────────────────────────────────────
|
||||||
|
HUMAN_NAMES = sorted({r["annotatorName"] for r in human_raw})
|
||||||
|
ORDERED_SOURCES = (
|
||||||
|
[f"H:{n}" for n in HUMAN_NAMES]
|
||||||
|
+ ["Opus"]
|
||||||
|
+ [f"S1:{S1_MODEL_SHORT[m]}" for m in sorted(S1_MODEL_SHORT)]
|
||||||
|
+ [BENCH_FILE_SHORT[b] for b in BENCH_FILES]
|
||||||
|
)
|
||||||
|
|
||||||
|
# ── Utility: compute axis stats ───────────────────────────────────────────
|
||||||
|
|
||||||
|
def axis_candidates(cat_a: str, cat_b: str, extra_cat: str | None = None) -> list[tuple[str, dict, Counter]]:
|
||||||
|
"""Find PIDs where both cat_a and cat_b appear among the 13 signals.
|
||||||
|
Returns list of (pid, signals_dict, vote_counter) sorted by closeness of split."""
|
||||||
|
results = []
|
||||||
|
for pid, sigs in signals.items():
|
||||||
|
if pid not in holdout_pids:
|
||||||
|
continue
|
||||||
|
counts = Counter(sigs.values())
|
||||||
|
cats_present = set(counts.keys())
|
||||||
|
if cat_a in cats_present and cat_b in cats_present:
|
||||||
|
if extra_cat is not None and extra_cat not in cats_present:
|
||||||
|
continue
|
||||||
|
# closeness = min(count_a, count_b) / total — higher is closer split
|
||||||
|
total = sum(counts.values())
|
||||||
|
closeness = min(counts[cat_a], counts[cat_b]) / total
|
||||||
|
results.append((pid, sigs, counts, closeness))
|
||||||
|
# Sort by closeness (descending), then by total signal count (descending) as tiebreaker
|
||||||
|
results.sort(key=lambda x: (-x[3], -sum(x[2].values())))
|
||||||
|
return [(pid, sigs, counts) for pid, sigs, counts, _ in results]
|
||||||
|
|
||||||
|
|
||||||
|
def print_example(pid: str, sigs: dict, counts: Counter, sub_pattern: str, note: str = ""):
|
||||||
|
"""Print one example paragraph with signals."""
|
||||||
|
para = para_map.get(pid)
|
||||||
|
if not para:
|
||||||
|
print(f" [paragraph {pid} not found]")
|
||||||
|
return
|
||||||
|
|
||||||
|
print(f" ┌─ Paragraph {pid}")
|
||||||
|
print(f" │ Company: {para.get('companyName', '?')} | Filing: {para.get('filingType', '?')} {para.get('filingDate', '?')}")
|
||||||
|
print(f" │ Sub-pattern: {sub_pattern}")
|
||||||
|
print(f" │")
|
||||||
|
|
||||||
|
# Full text — wrap at 100 chars, indent
|
||||||
|
text = para["text"]
|
||||||
|
for line in text.split("\n"):
|
||||||
|
wrapped = fill(line, width=100, initial_indent=" │ ", subsequent_indent=" │ ")
|
||||||
|
print(wrapped)
|
||||||
|
print(f" │")
|
||||||
|
|
||||||
|
# Signal table — compact single line
|
||||||
|
parts = []
|
||||||
|
for src in ORDERED_SOURCES:
|
||||||
|
if src in sigs:
|
||||||
|
parts.append(f"{src}={sigs[src]}")
|
||||||
|
print(f" │ Signals: {', '.join(parts)}")
|
||||||
|
|
||||||
|
# Vote tally
|
||||||
|
tally_parts = [f"{cat}: {n}" for cat, n in counts.most_common()]
|
||||||
|
print(f" │ Tally: {', '.join(tally_parts)} (out of {sum(counts.values())})")
|
||||||
|
|
||||||
|
if note:
|
||||||
|
print(f" │")
|
||||||
|
for line in note.split("\n"):
|
||||||
|
wrapped = fill(line, width=100, initial_indent=" │ ▸ ", subsequent_indent=" │ ")
|
||||||
|
print(wrapped)
|
||||||
|
|
||||||
|
print(f" └{'─' * 78}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
|
||||||
|
def pick_diverse(candidates: list[tuple[str, dict, Counter]], n: int, min_signals: int = 10) -> list[tuple[str, dict, Counter]]:
|
||||||
|
"""Pick n diverse examples from candidates (different companies, prefer many signals)."""
|
||||||
|
if len(candidates) <= n:
|
||||||
|
return candidates
|
||||||
|
# Filter to examples with enough signals for a meaningful table
|
||||||
|
rich = [(pid, sigs, counts) for pid, sigs, counts in candidates if sum(counts.values()) >= min_signals]
|
||||||
|
if len(rich) < n:
|
||||||
|
rich = candidates # fall back if not enough rich examples
|
||||||
|
# Diversify by company
|
||||||
|
seen_companies: set[str] = set()
|
||||||
|
selected = []
|
||||||
|
for pid, sigs, counts in rich:
|
||||||
|
company = para_map.get(pid, {}).get("companyName", "")
|
||||||
|
if company in seen_companies and len(rich) > n * 2:
|
||||||
|
continue
|
||||||
|
selected.append((pid, sigs, counts))
|
||||||
|
seen_companies.add(company)
|
||||||
|
if len(selected) >= n * 3:
|
||||||
|
break
|
||||||
|
return selected[:n]
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════
|
||||||
|
# AXIS 1: MR ↔ RMP
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print(" AXIS 1: MR ↔ RMP — Management Role vs. Risk Management Process")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
mr_rmp = axis_candidates("MR", "RMP")
|
||||||
|
print(f"\n Total paragraphs with both MR and RMP in signals: {len(mr_rmp)}\n")
|
||||||
|
|
||||||
|
|
||||||
|
def classify_mr_rmp_subpattern(text: str) -> str:
|
||||||
|
"""Heuristic to guess sub-pattern for MR↔RMP confusion."""
|
||||||
|
text_lower = text.lower()
|
||||||
|
sentences = [s.strip() for s in text.replace("\n", " ").split(".") if s.strip()]
|
||||||
|
|
||||||
|
person_keywords = [
|
||||||
|
"ciso", "chief information security", "chief information officer",
|
||||||
|
"cio", "vp ", "vice president", "director", "officer", "head of",
|
||||||
|
"manager", "leader", "executive", "cto", "chief technology",
|
||||||
|
]
|
||||||
|
process_keywords = [
|
||||||
|
"program", "framework", "process", "policy", "policies",
|
||||||
|
"procedures", "controls", "assessment", "monitoring",
|
||||||
|
"risk management", "incident response", "vulnerability",
|
||||||
|
]
|
||||||
|
|
||||||
|
person_subject_sentences = 0
|
||||||
|
process_subject_sentences = 0
|
||||||
|
|
||||||
|
for sent in sentences:
|
||||||
|
sent_lower = sent.lower().strip()
|
||||||
|
has_person = any(kw in sent_lower[:80] for kw in person_keywords)
|
||||||
|
has_process = any(kw in sent_lower[:80] for kw in process_keywords)
|
||||||
|
if has_person:
|
||||||
|
person_subject_sentences += 1
|
||||||
|
if has_process:
|
||||||
|
process_subject_sentences += 1
|
||||||
|
|
||||||
|
if person_subject_sentences > 0 and process_subject_sentences == 0:
|
||||||
|
return "person-subject"
|
||||||
|
elif process_subject_sentences > 0 and person_subject_sentences == 0:
|
||||||
|
return "process-subject"
|
||||||
|
elif person_subject_sentences > 0 and process_subject_sentences > 0:
|
||||||
|
return "mixed"
|
||||||
|
else:
|
||||||
|
return "other"
|
||||||
|
|
||||||
|
|
||||||
|
# Bucket candidates by sub-pattern
|
||||||
|
buckets: dict[str, list] = {"person-subject": [], "process-subject": [], "mixed": [], "other": []}
|
||||||
|
for pid, sigs, counts in mr_rmp:
|
||||||
|
text = para_map.get(pid, {}).get("text", "")
|
||||||
|
sp = classify_mr_rmp_subpattern(text)
|
||||||
|
buckets[sp].append((pid, sigs, counts))
|
||||||
|
|
||||||
|
print(f" Sub-pattern distribution: person-subject={len(buckets['person-subject'])}, "
|
||||||
|
f"process-subject={len(buckets['process-subject'])}, mixed={len(buckets['mixed'])}, "
|
||||||
|
f"other={len(buckets['other'])}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# (a) Person is grammatical subject
|
||||||
|
print(" ── (a) Person is the grammatical subject, doing process-like things ──\n")
|
||||||
|
for pid, sigs, counts in pick_diverse(buckets["person-subject"], 2):
|
||||||
|
text = para_map[pid]["text"]
|
||||||
|
# Subject test note
|
||||||
|
note = "SUBJECT TEST → MR (person is the main subject)"
|
||||||
|
print_example(pid, sigs, counts, "Person as subject doing process-like things", note)
|
||||||
|
|
||||||
|
# (b) Process/framework is subject
|
||||||
|
print(" ── (b) Process/framework is the subject, person mentioned as responsible ──\n")
|
||||||
|
for pid, sigs, counts in pick_diverse(buckets["process-subject"], 2):
|
||||||
|
text = para_map[pid]["text"]
|
||||||
|
note = "SUBJECT TEST → RMP (process/framework is the main subject)"
|
||||||
|
print_example(pid, sigs, counts, "Process as subject, person mentioned", note)
|
||||||
|
|
||||||
|
# (c) Mixed
|
||||||
|
print(" ── (c) Mixed — both person and process are subjects ──\n")
|
||||||
|
for pid, sigs, counts in pick_diverse(buckets["mixed"], 2):
|
||||||
|
note = "SUBJECT TEST → AMBIGUOUS (both person and process serve as subjects)"
|
||||||
|
print_example(pid, sigs, counts, "Mixed subjects", note)
|
||||||
|
|
||||||
|
# (d) Edge cases — closest splits from "other" or overall closest
|
||||||
|
print(" ── (d) Edge cases — genuinely hard to call ──\n")
|
||||||
|
# Take from overall closest that aren't already shown
|
||||||
|
shown_pids = set()
|
||||||
|
for bucket in buckets.values():
|
||||||
|
for pid, _, _ in bucket[:2]:
|
||||||
|
shown_pids.add(pid)
|
||||||
|
edge_cases = [(p, s, c) for p, s, c in mr_rmp if p not in shown_pids][:20]
|
||||||
|
for pid, sigs, counts in pick_diverse(edge_cases, 2):
|
||||||
|
mr_count = counts.get("MR", 0)
|
||||||
|
rmp_count = counts.get("RMP", 0)
|
||||||
|
note = f"SUBJECT TEST → unclear; split is {mr_count}-{rmp_count} MR-RMP"
|
||||||
|
print_example(pid, sigs, counts, "Edge case", note)
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════
|
||||||
|
# AXIS 2: BG ↔ MR
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print(" AXIS 2: BG ↔ MR — Board Governance vs. Management Role")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
bg_mr = axis_candidates("BG", "MR")
|
||||||
|
print(f"\n Total paragraphs with both BG and MR in signals: {len(bg_mr)}\n")
|
||||||
|
|
||||||
|
|
||||||
|
def classify_bg_mr_subpattern(text: str) -> str:
|
||||||
|
text_lower = text.lower()
|
||||||
|
board_words = ["board", "committee", "audit committee", "directors"]
|
||||||
|
mgmt_words = ["ciso", "chief information", "officer", "vp", "vice president",
|
||||||
|
"director of", "head of", "reports to", "briefing", "briefs",
|
||||||
|
"presents to", "reporting"]
|
||||||
|
|
||||||
|
has_board_actor = any(w in text_lower for w in board_words)
|
||||||
|
has_mgmt_reporting = any(w in text_lower for w in mgmt_words)
|
||||||
|
|
||||||
|
if has_board_actor and not has_mgmt_reporting:
|
||||||
|
return "board-actor"
|
||||||
|
elif has_mgmt_reporting and has_board_actor:
|
||||||
|
return "mgmt-reporting-to-board"
|
||||||
|
elif has_mgmt_reporting:
|
||||||
|
return "mgmt-only"
|
||||||
|
else:
|
||||||
|
return "mixed-governance"
|
||||||
|
|
||||||
|
|
||||||
|
buckets_bg: dict[str, list] = defaultdict(list)
|
||||||
|
for pid, sigs, counts in bg_mr:
|
||||||
|
sp = classify_bg_mr_subpattern(para_map.get(pid, {}).get("text", ""))
|
||||||
|
buckets_bg[sp].append((pid, sigs, counts))
|
||||||
|
|
||||||
|
print(f" Sub-pattern distribution: {dict((k, len(v)) for k, v in buckets_bg.items())}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# (a) Board/committee is clearly the actor
|
||||||
|
print(" ── (a) Board/committee is clearly the actor ──\n")
|
||||||
|
pool = buckets_bg.get("board-actor", []) or buckets_bg.get("mixed-governance", [])
|
||||||
|
for pid, sigs, counts in pick_diverse(pool, 2):
|
||||||
|
print_example(pid, sigs, counts, "Board as actor")
|
||||||
|
|
||||||
|
# (b) Management officer reporting TO the board
|
||||||
|
print(" ── (b) Management officer reporting TO/briefing the board ──\n")
|
||||||
|
pool = buckets_bg.get("mgmt-reporting-to-board", [])
|
||||||
|
for pid, sigs, counts in pick_diverse(pool, 2):
|
||||||
|
note = "KEY QUESTION: Is this BG (board receiving info) or MR (officer doing the briefing)?"
|
||||||
|
print_example(pid, sigs, counts, "Management reporting to board", note)
|
||||||
|
|
||||||
|
# (c) Mixed governance
|
||||||
|
print(" ── (c) Mixed governance language ──\n")
|
||||||
|
remaining = [x for x in bg_mr if x[0] not in {p for bucket in buckets_bg.values() for p, _, _ in bucket[:2]}]
|
||||||
|
for pid, sigs, counts in pick_diverse(remaining, 2):
|
||||||
|
note = "Could be BG, MR, or RMP depending on interpretation"
|
||||||
|
print_example(pid, sigs, counts, "Mixed governance", note)
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════
|
||||||
|
# AXIS 3: SI ↔ N/O
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print(" AXIS 3: SI ↔ N/O — Strategy Integration vs. None/Other")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
si_no = axis_candidates("SI", "N/O")
|
||||||
|
print(f"\n Total paragraphs with both SI and N/O in signals: {len(si_no)}\n")
|
||||||
|
|
||||||
|
|
||||||
|
def classify_si_no_subpattern(text: str) -> str:
|
||||||
|
text_lower = text.lower()
|
||||||
|
|
||||||
|
incident_words = ["incident", "breach", "attack", "compromised", "unauthorized access",
|
||||||
|
"data breach", "ransomware", "phishing"]
|
||||||
|
negative_words = ["have not experienced", "not experienced", "no material",
|
||||||
|
"has not been materially", "not been the subject",
|
||||||
|
"not aware of any", "no known", "have not had"]
|
||||||
|
hypothetical_words = ["could", "may", "might", "would", "if ", "potential",
|
||||||
|
"face threats", "subject to"]
|
||||||
|
specific_words = ["$", "million", "vendor", "contract", "insurance",
|
||||||
|
"specific", "particular", "named"]
|
||||||
|
|
||||||
|
has_incident = any(w in text_lower for w in incident_words)
|
||||||
|
has_negative = any(w in text_lower for w in negative_words)
|
||||||
|
has_hypothetical = any(w in text_lower for w in hypothetical_words)
|
||||||
|
has_specific = any(w in text_lower for w in specific_words)
|
||||||
|
|
||||||
|
if has_incident and not has_negative:
|
||||||
|
return "actual-incident"
|
||||||
|
elif has_negative:
|
||||||
|
return "negative-assertion"
|
||||||
|
elif has_hypothetical and not has_specific:
|
||||||
|
return "hypothetical"
|
||||||
|
elif has_specific:
|
||||||
|
return "specific-no-incident"
|
||||||
|
else:
|
||||||
|
return "other"
|
||||||
|
|
||||||
|
|
||||||
|
buckets_si: dict[str, list] = defaultdict(list)
|
||||||
|
for pid, sigs, counts in si_no:
|
||||||
|
sp = classify_si_no_subpattern(para_map.get(pid, {}).get("text", ""))
|
||||||
|
buckets_si[sp].append((pid, sigs, counts))
|
||||||
|
|
||||||
|
print(f" Sub-pattern distribution: {dict((k, len(v)) for k, v in buckets_si.items())}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Also find the 23 cases where humans=SI but GenAI=N/O
|
||||||
|
human_si_genai_no = []
|
||||||
|
for pid, sigs, counts in si_no:
|
||||||
|
human_cats = [sigs.get(f"H:{n}") for n in HUMAN_NAMES if f"H:{n}" in sigs]
|
||||||
|
genai_cats = [v for k, v in sigs.items() if not k.startswith("H:")]
|
||||||
|
human_si = sum(1 for c in human_cats if c == "SI")
|
||||||
|
human_no = sum(1 for c in human_cats if c == "N/O")
|
||||||
|
genai_si = sum(1 for c in genai_cats if c == "SI")
|
||||||
|
genai_no = sum(1 for c in genai_cats if c == "N/O")
|
||||||
|
if human_si > human_no and genai_no > genai_si:
|
||||||
|
human_si_genai_no.append((pid, sigs, counts))
|
||||||
|
|
||||||
|
print(f" Cases where humans lean SI but GenAI leans N/O: {len(human_si_genai_no)}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# (a) Clear actual incident
|
||||||
|
print(" ── (a) Clear actual incident described ──\n")
|
||||||
|
for pid, sigs, counts in pick_diverse(buckets_si.get("actual-incident", []), 2):
|
||||||
|
print_example(pid, sigs, counts, "Actual incident")
|
||||||
|
|
||||||
|
# (b) Negative assertion
|
||||||
|
print(" ── (b) Negative assertion — 'we have not experienced material incidents' ──\n")
|
||||||
|
neg_pool = buckets_si.get("negative-assertion", [])
|
||||||
|
# Prefer ones in the human-SI-genAI-NO set
|
||||||
|
neg_human_si = [x for x in neg_pool if x[0] in {p for p, _, _ in human_si_genai_no}]
|
||||||
|
neg_other = [x for x in neg_pool if x[0] not in {p for p, _, _ in human_si_genai_no}]
|
||||||
|
pool = neg_human_si[:2] if len(neg_human_si) >= 2 else (neg_human_si + neg_other)[:2]
|
||||||
|
for pid, sigs, counts in pool:
|
||||||
|
human_cats = [sigs.get(f"H:{n}") for n in HUMAN_NAMES if f"H:{n}" in sigs]
|
||||||
|
genai_cats = [v for k, v in sigs.items() if not k.startswith("H:")]
|
||||||
|
note = (f"CRUX: Humans keyed on the materiality assessment language. "
|
||||||
|
f"Human votes: {Counter(human_cats).most_common()}, "
|
||||||
|
f"GenAI votes: {Counter(genai_cats).most_common()}")
|
||||||
|
print_example(pid, sigs, counts, "Negative assertion", note)
|
||||||
|
|
||||||
|
# (c) Hypothetical/conditional
|
||||||
|
print(" ── (c) Hypothetical/conditional language ──\n")
|
||||||
|
for pid, sigs, counts in pick_diverse(buckets_si.get("hypothetical", []), 2):
|
||||||
|
print_example(pid, sigs, counts, "Hypothetical/conditional")
|
||||||
|
|
||||||
|
# (d) Specific programs/vendors/amounts but no incident
|
||||||
|
print(" ── (d) Specific programs/vendors/amounts but no incident ──\n")
|
||||||
|
spec_pool = buckets_si.get("specific-no-incident", [])
|
||||||
|
if len(spec_pool) < 2:
|
||||||
|
spec_pool += buckets_si.get("other", [])
|
||||||
|
for pid, sigs, counts in pick_diverse(spec_pool, 2):
|
||||||
|
note = "SI because specific details? Or N/O because no event/strategy content?"
|
||||||
|
print_example(pid, sigs, counts, "Specific but no incident", note)
|
||||||
|
|
||||||
|
# Extra: show human-SI / genAI-N/O cases not already shown
|
||||||
|
shown_si = set()
|
||||||
|
for bucket in buckets_si.values():
|
||||||
|
for p, _, _ in bucket[:2]:
|
||||||
|
shown_si.add(p)
|
||||||
|
extra_human_si = [x for x in human_si_genai_no if x[0] not in shown_si]
|
||||||
|
if extra_human_si:
|
||||||
|
print(" ── (extra) Additional human=SI, GenAI=N/O cases ──\n")
|
||||||
|
for pid, sigs, counts in pick_diverse(extra_human_si, 2):
|
||||||
|
human_cats = [sigs.get(f"H:{n}") for n in HUMAN_NAMES if f"H:{n}" in sigs]
|
||||||
|
genai_cats = [v for k, v in sigs.items() if not k.startswith("H:")]
|
||||||
|
note = (f"Humans: {Counter(human_cats).most_common()}, "
|
||||||
|
f"GenAI: {Counter(genai_cats).most_common()}")
|
||||||
|
print_example(pid, sigs, counts, "Human=SI, GenAI=N/O", note)
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════
|
||||||
|
# AXIS 4: Three-way BG ↔ MR ↔ RMP
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print(" AXIS 4: Three-way BG ↔ MR ↔ RMP")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
|
three_way = []
|
||||||
|
for pid, sigs in signals.items():
|
||||||
|
if pid not in holdout_pids:
|
||||||
|
continue
|
||||||
|
counts = Counter(sigs.values())
|
||||||
|
if "BG" in counts and "MR" in counts and "RMP" in counts:
|
||||||
|
# Score by how evenly split the three are
|
||||||
|
vals = [counts["BG"], counts["MR"], counts["RMP"]]
|
||||||
|
total_3 = sum(vals)
|
||||||
|
evenness = min(vals) / max(vals) if max(vals) > 0 else 0
|
||||||
|
three_way.append((pid, sigs, counts, evenness))
|
||||||
|
|
||||||
|
three_way.sort(key=lambda x: (-x[3], -sum(x[2].values())))
|
||||||
|
print(f"\n Total paragraphs with all three of BG, MR, RMP: {len(three_way)}\n")
|
||||||
|
|
||||||
|
# Pick diverse examples with enough signals
|
||||||
|
seen_co: set[str] = set()
|
||||||
|
three_way_selected = []
|
||||||
|
for pid, sigs, counts, evenness in three_way:
|
||||||
|
if sum(counts.values()) < 10:
|
||||||
|
continue
|
||||||
|
co = para_map.get(pid, {}).get("companyName", "")
|
||||||
|
if co in seen_co:
|
||||||
|
continue
|
||||||
|
seen_co.add(co)
|
||||||
|
three_way_selected.append((pid, sigs, counts, evenness))
|
||||||
|
if len(three_way_selected) >= 3:
|
||||||
|
break
|
||||||
|
|
||||||
|
for pid, sigs, counts, evenness in three_way_selected:
|
||||||
|
bg_c, mr_c, rmp_c = counts["BG"], counts["MR"], counts["RMP"]
|
||||||
|
note = (f"Three-way split: BG={bg_c}, MR={mr_c}, RMP={rmp_c}. "
|
||||||
|
f"This paragraph intertwines governance, management roles, and process descriptions.")
|
||||||
|
print_example(pid, sigs, counts, "Three-way BG/MR/RMP", note)
|
||||||
|
|
||||||
|
|
||||||
|
# ── Summary statistics ────────────────────────────────────────────────────
|
||||||
|
print("\n" + "=" * 80)
|
||||||
|
print(" SUMMARY")
|
||||||
|
print("=" * 80)
|
||||||
|
print(f"""
|
||||||
|
Axis 1 (MR↔RMP): {len(mr_rmp)} paragraphs with split signals
|
||||||
|
Axis 2 (BG↔MR): {len(bg_mr)} paragraphs with split signals
|
||||||
|
Axis 3 (SI↔N/O): {len(si_no)} paragraphs with split signals
|
||||||
|
Axis 4 (BG↔MR↔RMP): {len(three_way)} paragraphs with three-way split
|
||||||
|
Human=SI/GenAI=N/O: {len(human_si_genai_no)} cases (directional asymmetry)
|
||||||
|
""")
|
||||||
153
ts/src/cli.ts
153
ts/src/cli.ts
@ -5,7 +5,7 @@ import { STAGE1_MODELS, BENCHMARK_MODELS } from "./lib/openrouter.ts";
|
|||||||
import { runBatch } from "./label/batch.ts";
|
import { runBatch } from "./label/batch.ts";
|
||||||
import { runGoldenBatch } from "./label/golden.ts";
|
import { runGoldenBatch } from "./label/golden.ts";
|
||||||
import { computeConsensus } from "./label/consensus.ts";
|
import { computeConsensus } from "./label/consensus.ts";
|
||||||
import { judgeParagraph } from "./label/annotate.ts";
|
import { judgeParagraph, annotateParagraph, reEvalParagraph } from "./label/annotate.ts";
|
||||||
import { appendJsonl, readJsonlRaw } from "./lib/jsonl.ts";
|
import { appendJsonl, readJsonlRaw } from "./lib/jsonl.ts";
|
||||||
import { v4 as uuidv4 } from "uuid";
|
import { v4 as uuidv4 } from "uuid";
|
||||||
import { PROMPT_VERSION } from "./label/prompts.ts";
|
import { PROMPT_VERSION } from "./label/prompts.ts";
|
||||||
@ -29,6 +29,9 @@ Commands:
|
|||||||
label:golden [--paragraphs <path>] [--limit N] [--delay N] [--concurrency N] (Opus via Agent SDK)
|
label:golden [--paragraphs <path>] [--limit N] [--delay N] [--concurrency N] (Opus via Agent SDK)
|
||||||
label:bench-holdout --model <id> [--concurrency N] [--limit N] (benchmark model on holdout)
|
label:bench-holdout --model <id> [--concurrency N] [--limit N] (benchmark model on holdout)
|
||||||
label:bench-holdout-all [--concurrency N] [--limit N] (all BENCHMARK_MODELS on holdout)
|
label:bench-holdout-all [--concurrency N] [--limit N] (all BENCHMARK_MODELS on holdout)
|
||||||
|
label:bench-holdout-v35 --model <id> [--concurrency N] [--limit N] (v3.5 re-run on confusion-axis holdout)
|
||||||
|
label:golden-v35 [--limit N] [--delay N] [--concurrency N] (Opus v3.5 re-run on confusion-axis holdout)
|
||||||
|
label:reeval --model <id> [--concurrency N] [--limit N] (re-evaluate flagged Stage 1 paragraphs)
|
||||||
label:cost`);
|
label:cost`);
|
||||||
process.exit(1);
|
process.exit(1);
|
||||||
}
|
}
|
||||||
@ -321,6 +324,145 @@ async function cmdBenchHoldoutAll(): Promise<void> {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
async function loadConfusionAxisParagraphs(rerunFile?: string): Promise<Paragraph[]> {
|
||||||
|
const rerunPath = rerunFile ?? `${DATA}/gold/holdout-rerun-v35.jsonl`;
|
||||||
|
const { records: rerunRecords } = await readJsonlRaw(rerunPath);
|
||||||
|
const rerunIds = new Set(
|
||||||
|
rerunRecords
|
||||||
|
.filter((r): r is { paragraphId: string } =>
|
||||||
|
!!r && typeof r === "object" && "paragraphId" in r)
|
||||||
|
.map((r) => r.paragraphId),
|
||||||
|
);
|
||||||
|
process.stderr.write(` Loaded ${rerunIds.size} confusion-axis paragraph IDs\n`);
|
||||||
|
|
||||||
|
const allHoldout = await loadHoldoutParagraphs();
|
||||||
|
const paragraphs = allHoldout.filter((p) => rerunIds.has(p.id));
|
||||||
|
process.stderr.write(` Matched ${paragraphs.length} paragraphs for v3.5 re-run\n`);
|
||||||
|
return paragraphs;
|
||||||
|
}
|
||||||
|
|
||||||
|
async function cmdGoldenV35(): Promise<void> {
|
||||||
|
const paragraphs = await loadConfusionAxisParagraphs();
|
||||||
|
|
||||||
|
if (paragraphs.length === 0) {
|
||||||
|
process.stderr.write(" ✖ No confusion-axis paragraphs found\n");
|
||||||
|
process.exit(1);
|
||||||
|
}
|
||||||
|
|
||||||
|
await runGoldenBatch(paragraphs, {
|
||||||
|
outputPath: `${DATA}/annotations/golden-v35/opus.jsonl`,
|
||||||
|
errorsPath: `${DATA}/annotations/golden-v35/opus-errors.jsonl`,
|
||||||
|
limit: flag("limit") !== undefined ? flagInt("limit", 50) : undefined,
|
||||||
|
delayMs: flag("delay") !== undefined ? flagInt("delay", 500) : 500,
|
||||||
|
concurrency: flagInt("concurrency", 20),
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
async function cmdBenchHoldoutV35(): Promise<void> {
|
||||||
|
const modelId = flag("model");
|
||||||
|
if (!modelId) {
|
||||||
|
console.error("--model is required");
|
||||||
|
process.exit(1);
|
||||||
|
}
|
||||||
|
|
||||||
|
const rerunFile = flag("rerun-file");
|
||||||
|
const outputDir = flag("output-dir") ?? "bench-holdout-v35";
|
||||||
|
const paragraphs = await loadConfusionAxisParagraphs(rerunFile ?? undefined);
|
||||||
|
const modelShort = modelId.split("/")[1]!;
|
||||||
|
await runBatch(paragraphs, {
|
||||||
|
modelId,
|
||||||
|
stage: "benchmark",
|
||||||
|
outputPath: `${DATA}/annotations/${outputDir}/${modelShort}.jsonl`,
|
||||||
|
errorsPath: `${DATA}/annotations/${outputDir}/${modelShort}-errors.jsonl`,
|
||||||
|
sessionsPath: SESSIONS_PATH,
|
||||||
|
concurrency: flagInt("concurrency", 60),
|
||||||
|
limit: flag("limit") !== undefined ? flagInt("limit", 50) : undefined,
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
async function cmdReeval(): Promise<void> {
|
||||||
|
const modelId = flag("model");
|
||||||
|
if (!modelId) {
|
||||||
|
console.error("--model is required");
|
||||||
|
process.exit(1);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Load flagged paragraphs
|
||||||
|
const correctionsPath = `${DATA}/annotations/stage1-corrections.jsonl`;
|
||||||
|
const { records: corrections } = await readJsonlRaw(correctionsPath);
|
||||||
|
const correctionMap = new Map<string, { reason: string }>();
|
||||||
|
for (const r of corrections) {
|
||||||
|
const rec = r as { paragraphId?: string; reason?: string };
|
||||||
|
if (rec.paragraphId && rec.reason) {
|
||||||
|
correctionMap.set(rec.paragraphId, { reason: rec.reason });
|
||||||
|
}
|
||||||
|
}
|
||||||
|
process.stderr.write(` Loaded ${correctionMap.size} flagged paragraphs from ${correctionsPath}\n`);
|
||||||
|
|
||||||
|
// Load all paragraphs and filter to flagged ones
|
||||||
|
const paragraphs = await loadParagraphs();
|
||||||
|
const flaggedParagraphs = paragraphs.filter((p) => correctionMap.has(p.id));
|
||||||
|
process.stderr.write(` Matched ${flaggedParagraphs.length} paragraphs for re-evaluation\n`);
|
||||||
|
|
||||||
|
const modelShort = modelId.split("/")[1]!;
|
||||||
|
const outputPath = `${DATA}/annotations/stage2/reeval-${modelShort}.jsonl`;
|
||||||
|
const errorsPath = `${DATA}/annotations/stage2/reeval-${modelShort}-errors.jsonl`;
|
||||||
|
|
||||||
|
// Resume support
|
||||||
|
const { records: existing } = await readJsonlRaw(outputPath);
|
||||||
|
const doneIds = new Set(
|
||||||
|
existing
|
||||||
|
.filter((r): r is { paragraphId: string } =>
|
||||||
|
!!r && typeof r === "object" && "paragraphId" in r)
|
||||||
|
.map((r) => r.paragraphId),
|
||||||
|
);
|
||||||
|
|
||||||
|
let remaining = flaggedParagraphs.filter((p) => !doneIds.has(p.id));
|
||||||
|
const limit = flag("limit") !== undefined ? flagInt("limit", 50) : undefined;
|
||||||
|
if (limit !== undefined) remaining = remaining.slice(0, limit);
|
||||||
|
|
||||||
|
process.stderr.write(` ${remaining.length} paragraphs to re-evaluate (${doneIds.size} already done)\n`);
|
||||||
|
|
||||||
|
const runId = uuidv4();
|
||||||
|
const concurrency = flagInt("concurrency", 12);
|
||||||
|
let processed = 0;
|
||||||
|
let errored = 0;
|
||||||
|
let totalCost = 0;
|
||||||
|
|
||||||
|
// Process in batches respecting concurrency
|
||||||
|
for (let i = 0; i < remaining.length; i += concurrency) {
|
||||||
|
const batch = remaining.slice(i, i + concurrency);
|
||||||
|
const results = await Promise.allSettled(
|
||||||
|
batch.map(async (paragraph) => {
|
||||||
|
const correction = correctionMap.get(paragraph.id)!;
|
||||||
|
const reason = correction.reason as "materiality_language" | "spac" | "other";
|
||||||
|
const ann = await reEvalParagraph(paragraph, {
|
||||||
|
modelId,
|
||||||
|
runId,
|
||||||
|
reason,
|
||||||
|
promptVersion: PROMPT_VERSION,
|
||||||
|
});
|
||||||
|
await appendJsonl(outputPath, ann);
|
||||||
|
totalCost += ann.provenance.costUsd;
|
||||||
|
processed++;
|
||||||
|
}),
|
||||||
|
);
|
||||||
|
|
||||||
|
for (const r of results) {
|
||||||
|
if (r.status === "rejected") {
|
||||||
|
errored++;
|
||||||
|
process.stderr.write(` ✖ Error: ${r.reason}\n`);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (processed % 50 === 0 || i + concurrency >= remaining.length) {
|
||||||
|
process.stderr.write(` ${processed}/${remaining.length} re-evaluated (${errored} errors, $${totalCost.toFixed(4)} cost)\n`);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
process.stderr.write(`\n ✓ Re-evaluation done: ${processed} processed, ${errored} errors, $${totalCost.toFixed(4)} total cost\n`);
|
||||||
|
}
|
||||||
|
|
||||||
async function cmdCost(): Promise<void> {
|
async function cmdCost(): Promise<void> {
|
||||||
const modelCosts: Record<string, { cost: number; count: number }> = {};
|
const modelCosts: Record<string, { cost: number; count: number }> = {};
|
||||||
const stageCosts: Record<string, { cost: number; count: number }> = {};
|
const stageCosts: Record<string, { cost: number; count: number }> = {};
|
||||||
@ -435,6 +577,15 @@ switch (command) {
|
|||||||
case "label:bench-holdout-all":
|
case "label:bench-holdout-all":
|
||||||
await cmdBenchHoldoutAll();
|
await cmdBenchHoldoutAll();
|
||||||
break;
|
break;
|
||||||
|
case "label:bench-holdout-v35":
|
||||||
|
await cmdBenchHoldoutV35();
|
||||||
|
break;
|
||||||
|
case "label:golden-v35":
|
||||||
|
await cmdGoldenV35();
|
||||||
|
break;
|
||||||
|
case "label:reeval":
|
||||||
|
await cmdReeval();
|
||||||
|
break;
|
||||||
case "label:cost":
|
case "label:cost":
|
||||||
await cmdCost();
|
await cmdCost();
|
||||||
break;
|
break;
|
||||||
|
|||||||
@ -3,7 +3,7 @@ import { openrouter, providerOf } from "../lib/openrouter.ts";
|
|||||||
import { LabelOutputRaw, toLabelOutput } from "@sec-cybert/schemas/label.ts";
|
import { LabelOutputRaw, toLabelOutput } from "@sec-cybert/schemas/label.ts";
|
||||||
import type { Annotation } from "@sec-cybert/schemas/annotation.ts";
|
import type { Annotation } from "@sec-cybert/schemas/annotation.ts";
|
||||||
import type { Paragraph } from "@sec-cybert/schemas/paragraph.ts";
|
import type { Paragraph } from "@sec-cybert/schemas/paragraph.ts";
|
||||||
import { SYSTEM_PROMPT, buildUserPrompt, buildJudgePrompt, PROMPT_VERSION } from "./prompts.ts";
|
import { SYSTEM_PROMPT, buildUserPrompt, buildJudgePrompt, buildReEvalPrompt, PROMPT_VERSION } from "./prompts.ts";
|
||||||
import { withRetry } from "../lib/retry.ts";
|
import { withRetry } from "../lib/retry.ts";
|
||||||
|
|
||||||
/** OpenRouter reasoning effort levels. */
|
/** OpenRouter reasoning effort levels. */
|
||||||
@ -125,6 +125,88 @@ export interface JudgeOpts {
|
|||||||
* Run the Stage 2 judge on a paragraph where Stage 1 models disagreed.
|
* Run the Stage 2 judge on a paragraph where Stage 1 models disagreed.
|
||||||
* Receives the paragraph + all 3 prior annotations in randomized order.
|
* Receives the paragraph + all 3 prior annotations in randomized order.
|
||||||
*/
|
*/
|
||||||
|
export interface ReEvalOpts {
|
||||||
|
modelId: string;
|
||||||
|
runId: string;
|
||||||
|
reason: "materiality_language" | "spac" | "other";
|
||||||
|
promptVersion?: string;
|
||||||
|
reasoningEffort?: ReasoningEffort;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Re-evaluate a paragraph under v3.5 codebook rules.
|
||||||
|
* Used for correcting Stage 1 labels affected by prompt version drift.
|
||||||
|
*/
|
||||||
|
export async function reEvalParagraph(
|
||||||
|
paragraph: Paragraph,
|
||||||
|
opts: ReEvalOpts,
|
||||||
|
): Promise<Annotation> {
|
||||||
|
const {
|
||||||
|
modelId,
|
||||||
|
runId,
|
||||||
|
reason,
|
||||||
|
promptVersion = PROMPT_VERSION,
|
||||||
|
reasoningEffort = "medium",
|
||||||
|
} = opts;
|
||||||
|
const requestedAt = new Date().toISOString();
|
||||||
|
const start = Date.now();
|
||||||
|
|
||||||
|
const useRawText = modelId.startsWith("minimax/") || modelId.startsWith("moonshotai/");
|
||||||
|
|
||||||
|
const result = await withRetry(
|
||||||
|
async () => {
|
||||||
|
if (useRawText) {
|
||||||
|
const r = await generateText({
|
||||||
|
model: openrouter(modelId),
|
||||||
|
system: SYSTEM_PROMPT,
|
||||||
|
prompt: buildReEvalPrompt(paragraph, reason),
|
||||||
|
temperature: 0,
|
||||||
|
providerOptions: buildProviderOptions(reasoningEffort),
|
||||||
|
abortSignal: AbortSignal.timeout(360_000),
|
||||||
|
});
|
||||||
|
const text = r.text.trim();
|
||||||
|
const fenceMatch = text.match(/```(?:json)?\s*\n?([\s\S]*?)\n?```/);
|
||||||
|
const jsonStr = fenceMatch ? fenceMatch[1]! : text;
|
||||||
|
const parsed = LabelOutputRaw.parse(JSON.parse(jsonStr));
|
||||||
|
return { ...r, output: parsed, usage: r.usage, response: r.response, providerMetadata: r.providerMetadata };
|
||||||
|
}
|
||||||
|
return generateText({
|
||||||
|
model: openrouter(modelId),
|
||||||
|
output: Output.object({ schema: LabelOutputRaw }),
|
||||||
|
system: SYSTEM_PROMPT,
|
||||||
|
prompt: buildReEvalPrompt(paragraph, reason),
|
||||||
|
temperature: 0,
|
||||||
|
providerOptions: buildProviderOptions(reasoningEffort),
|
||||||
|
abortSignal: AbortSignal.timeout(360_000),
|
||||||
|
});
|
||||||
|
},
|
||||||
|
{ label: `reeval:${modelId}:${paragraph.id}` },
|
||||||
|
);
|
||||||
|
|
||||||
|
const latencyMs = Date.now() - start;
|
||||||
|
const rawOutput = result.output as LabelOutputRaw;
|
||||||
|
if (!rawOutput) throw new Error(`No output from ${modelId} for ${paragraph.id}`);
|
||||||
|
|
||||||
|
return {
|
||||||
|
paragraphId: paragraph.id,
|
||||||
|
label: toLabelOutput(rawOutput),
|
||||||
|
provenance: {
|
||||||
|
modelId,
|
||||||
|
provider: providerOf(modelId),
|
||||||
|
generationId: result.response?.id ?? "unknown",
|
||||||
|
stage: "stage2-judge",
|
||||||
|
runId,
|
||||||
|
promptVersion,
|
||||||
|
inputTokens: result.usage?.inputTokens ?? 0,
|
||||||
|
outputTokens: result.usage?.outputTokens ?? 0,
|
||||||
|
reasoningTokens: result.usage?.outputTokenDetails?.reasoningTokens ?? 0,
|
||||||
|
costUsd: extractCost(result),
|
||||||
|
latencyMs,
|
||||||
|
requestedAt,
|
||||||
|
},
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
export async function judgeParagraph(
|
export async function judgeParagraph(
|
||||||
paragraph: Paragraph,
|
paragraph: Paragraph,
|
||||||
priorAnnotations: Array<{
|
priorAnnotations: Array<{
|
||||||
|
|||||||
@ -1,6 +1,6 @@
|
|||||||
import type { Paragraph } from "@sec-cybert/schemas/paragraph.ts";
|
import type { Paragraph } from "@sec-cybert/schemas/paragraph.ts";
|
||||||
|
|
||||||
export const PROMPT_VERSION = "v3.0";
|
export const PROMPT_VERSION = "v3.5";
|
||||||
|
|
||||||
/** System prompt for all Stage 1 annotation and benchmarking. */
|
/** System prompt for all Stage 1 annotation and benchmarking. */
|
||||||
export const SYSTEM_PROMPT = `You are an expert annotator classifying paragraphs from SEC cybersecurity disclosures (Form 10-K Item 1C and Form 8-K Item 1.05 filings).
|
export const SYSTEM_PROMPT = `You are an expert annotator classifying paragraphs from SEC cybersecurity disclosures (Form 10-K Item 1C and Form 8-K Item 1.05 filings).
|
||||||
@ -11,30 +11,46 @@ For each paragraph, assign a content_category and specificity level.
|
|||||||
|
|
||||||
Assign the single most applicable category:
|
Assign the single most applicable category:
|
||||||
|
|
||||||
"Board Governance" — Board/committee oversight of cyber risk, briefing cadence, board cyber expertise. Assign when the board or a board committee is the grammatical subject.
|
"Board Governance" — Board/committee oversight of cyber risk, briefing cadence, board cyber expertise. Assign when the paragraph describes the governance/oversight STRUCTURE — how the board exercises oversight, who reports to the board, how information flows upward. Governance-chain paragraphs (board → committee → officer → program) are BG even when officers appear as grammatical subjects, because the PURPOSE is describing oversight structure.
|
||||||
"Management Role" — CISO/CTO/CIO identification, qualifications, reporting lines. Assign when a management role is the grammatical subject.
|
"Management Role" — CISO/CTO/CIO identification, qualifications, reporting lines. Assign when the paragraph is primarily about WHO the person IS — their credentials, experience, certifications, career history. Naming an officer as part of a governance or process description does NOT make it Management Role.
|
||||||
"Risk Management Process" — Risk assessment, framework adoption, vulnerability management, monitoring, IR planning, ERM integration. Assign when the company's OWN internal processes are the topic.
|
"Risk Management Process" — Risk assessment, framework adoption, vulnerability management, monitoring, IR planning, ERM integration. Assign when the company's OWN internal processes are the topic.
|
||||||
"Third-Party Risk" — Vendor/supplier security oversight, contractual security standards. Assign ONLY when vendor oversight is the CENTRAL topic, not a component of internal processes.
|
"Third-Party Risk" — Vendor/supplier security oversight, contractual security standards. Assign ONLY when vendor oversight is the CENTRAL topic, not a component of internal processes.
|
||||||
"Incident Disclosure" — Description of actual cybersecurity incidents: what happened, when, scope, response actions. Must reference a real event. Includes: incident narrative, incident response actions, AND descriptions of affected data/systems scope or operational impact of a disclosed incident.
|
"Incident Disclosure" — Description of actual cybersecurity incidents: what happened, when, scope, response actions. Must reference a real event. Includes: incident narrative, incident response actions, AND descriptions of affected data/systems scope or operational impact of a disclosed incident.
|
||||||
"Strategy Integration" — Business/financial impact, cyber insurance, budget, materiality assessments. Includes standalone materiality conclusions with no incident narrative.
|
"Strategy Integration" — Business/financial impact, cyber insurance, budget, materiality ASSESSMENTS. A materiality assessment is the company stating a conclusion about whether cybersecurity has or will affect business outcomes. Includes: backward-looking ("have not materially affected"), forward-looking with SEC qualifier ("reasonably likely to materially affect"), and negative assertions ("have not experienced material incidents"). Does NOT include generic risk warnings ("could have a material adverse effect") — those are boilerplate speculation, not assessments. Does NOT include "material" as an adjective ("managing material risks").
|
||||||
"None/Other" — Forward-looking disclaimers, section headers, cross-references, non-cybersecurity content. NO substantive disclosure at all.
|
"None/Other" — Forward-looking disclaimers, section headers, cross-references, non-cybersecurity content, generic IT-dependence language ("our IT systems are important"). NO substantive disclosure AND no materiality language at all.
|
||||||
|
|
||||||
CATEGORY TIEBREAKERS:
|
CATEGORY TIEBREAKERS:
|
||||||
- Paragraph DESCRIBES what happened in an incident (dates, access, encryption, scope, response actions) → Incident Disclosure
|
- Paragraph DESCRIBES what happened in an incident (dates, access, encryption, scope, response actions) → Incident Disclosure
|
||||||
- Paragraph ONLY discusses financial cost, insurance, or materiality of an incident WITHOUT describing the event → Strategy Integration (even if it says "the incident" or "the cybersecurity incident")
|
- Paragraph ONLY discusses financial cost, insurance, or materiality of an incident WITHOUT describing the event → Strategy Integration (even if it says "the incident" or "the cybersecurity incident")
|
||||||
- Brief mention of a past incident + materiality conclusion as the main point → Strategy Integration
|
- Brief mention of a past incident + materiality conclusion as the main point → Strategy Integration
|
||||||
- Standalone materiality conclusion with no incident reference → Strategy Integration
|
- Standalone materiality conclusion with no incident reference → Strategy Integration
|
||||||
- Materiality disclaimers ("have not materially affected our business strategy, results of operations, or financial condition") → Strategy Integration, even if boilerplate. A cross-reference to Risk Factors appended to a materiality assessment does NOT change the classification. Only pure cross-references with no materiality conclusion are None/Other.
|
- Materiality ASSESSMENTS → Strategy Integration. An assessment is the company stating a conclusion:
|
||||||
|
• Backward: "have not materially affected our business strategy, results of operations, or financial condition" → SI
|
||||||
|
• Forward with SEC qualifier: "reasonably likely to materially affect" → SI
|
||||||
|
• Negative assertion: "we have not experienced any material cybersecurity incidents" → SI
|
||||||
|
NOT assessments (do NOT trigger SI):
|
||||||
|
• Generic risk warning: "could have a material adverse effect on our business" → NOT SI. This is boilerplate speculation in every 10-K, not a conclusion. Classify by the paragraph's primary content.
|
||||||
|
• "Material" as adjective: "managing material risks" → NOT SI. "Material" means "significant" here, not a materiality assessment.
|
||||||
|
• Consequence clause: SPECULATIVE materiality language ("could have a material adverse effect") at the END of an RMP/risk paragraph does not override the primary purpose. BUT a negative assertion ("we have not experienced any material cybersecurity incidents") IS an assessment even at the end of a paragraph — it is a factual conclusion, not speculation.
|
||||||
|
• Cross-references with materiality language: "For risks that may materially affect us, see Item 1A" → N/O (pointing elsewhere, not concluding).
|
||||||
- SPACs and shell companies explicitly stating they have no operations, no cybersecurity program, or no formal processes → None/Other regardless of incidental mentions of board oversight or risk acknowledgment. The absence of a program is not a description of a program.
|
- SPACs and shell companies explicitly stating they have no operations, no cybersecurity program, or no formal processes → None/Other regardless of incidental mentions of board oversight or risk acknowledgment. The absence of a program is not a description of a program.
|
||||||
- Internal processes mentioning vendors as one component → Risk Management Process
|
- Internal processes mentioning vendors as one component → Risk Management Process
|
||||||
- Requirements imposed ON vendors → Third-Party Risk
|
- Requirements imposed ON vendors → Third-Party Risk
|
||||||
- Board oversight mentioned briefly + management roles as main focus → Management Role
|
- Board oversight mentioned briefly + management roles as main focus → Management Role
|
||||||
- Management mentioned briefly + board oversight as main focus → Board Governance
|
- Management mentioned briefly + board oversight as main focus → Board Governance
|
||||||
|
|
||||||
PERSON-VS-FUNCTION TEST (Management Role vs Risk Management Process):
|
MR vs RMP — THREE-STEP DECISION CHAIN (apply in order):
|
||||||
If a paragraph is about the PERSON (qualifications, credentials, background, tenure, career history) → Management Role.
|
Step 1 — SUBJECT TEST: What is the grammatical subject?
|
||||||
If it's about what the role/program DOES (processes, activities, tools, frameworks) → Risk Management Process, even if a CISO/CIO/CTO title appears.
|
Clear process/framework/program as subject with no person detail → Risk Management Process. STOP.
|
||||||
Test: would the paragraph still make sense if you removed the person's name, title, and credentials? If yes → the paragraph is about the function, not the person → Risk Management Process.
|
Person/role as subject → this is a SIGNAL, not decisive. ALWAYS continue to Step 2.
|
||||||
|
Step 2 — PERSON-REMOVAL TEST: Delete all named roles, titles, qualifications, experience, and credentials. Is the remaining text a coherent cybersecurity disclosure?
|
||||||
|
YES → Risk Management Process (the process stands alone; people are incidental).
|
||||||
|
NO → Management Role (the paragraph is fundamentally about who these people are).
|
||||||
|
Borderline → continue to Step 3.
|
||||||
|
Step 3 — QUALIFICATIONS TIEBREAKER: Does the paragraph include years of experience, certifications (CISSP, CISM), education, team size, or career history?
|
||||||
|
YES → Management Role (qualifications are MR-specific content).
|
||||||
|
NO → Risk Management Process (no person-specific content beyond a title).
|
||||||
|
IMPORTANT: A paragraph where a named officer (CISO, CTO) is the grammatical subject but the content describes what the PROGRAM does is Risk Management Process. Step 1 must NOT short-circuit to MR just because a person is mentioned. Always apply Step 2.
|
||||||
|
|
||||||
═══ SPECIFICITY ═══
|
═══ SPECIFICITY ═══
|
||||||
|
|
||||||
@ -113,26 +129,46 @@ ${text}`;
|
|||||||
// ── Category confusion-axis disambiguation rules ──────────────────────────
|
// ── Category confusion-axis disambiguation rules ──────────────────────────
|
||||||
// Keyed by sorted pair of disputed categories. Only included when relevant.
|
// Keyed by sorted pair of disputed categories. Only included when relevant.
|
||||||
const CATEGORY_GUIDANCE: Record<string, string> = {
|
const CATEGORY_GUIDANCE: Record<string, string> = {
|
||||||
"Management Role|Risk Management Process": `MANAGEMENT ROLE vs RISK MANAGEMENT PROCESS — ask: what is the DOMINANT communicative purpose?
|
"Management Role|Risk Management Process": `MANAGEMENT ROLE vs RISK MANAGEMENT PROCESS — apply the decision chain:
|
||||||
• A named manager (CISO, VP) mentioned once at the beginning, followed by extensive process description → Risk Management Process. The role mention is incidental.
|
Step 1 — SUBJECT TEST: Is the process/framework clearly the subject with no person detail? → RMP. STOP. If a person is the subject → this is only a signal. ALWAYS continue to Step 2.
|
||||||
• Management Role requires the manager's identity, qualifications, or reporting structure to be the PRIMARY content — not just a brief attribution.
|
Step 2 — PERSON-REMOVAL TEST: Delete all people/titles/qualifications. Still a coherent disclosure? YES → RMP. NO → MR. Borderline → Step 3.
|
||||||
• Test: remove the role mention. Does the paragraph still make sense as a process description? If yes → RMP.`,
|
Step 3 — QUALIFICATIONS TIEBREAKER: Does it mention years of experience, certifications, education, team size, career history? YES → MR. NO → RMP.
|
||||||
|
CRITICAL: A person being the grammatical subject does NOT automatically mean Management Role. Many SEC disclosures say "Our CISO oversees..." then describe the program. Apply Step 2.
|
||||||
|
Examples:
|
||||||
|
• "Our CISO has 20 years of experience and holds CISSP certification. She reports to the CIO." → MR (remove people → nothing left; has qualifications)
|
||||||
|
• "Our cybersecurity program includes risk assessment and monitoring, overseen by our CISO." → RMP (remove CISO → program description stands alone)
|
||||||
|
• "Our CISO oversees the Company's cybersecurity program, which includes risk assessments, vulnerability scanning, and incident response planning." → RMP (person is subject BUT remove CISO → "the Company's cybersecurity program includes..." still works. Content is about the program.)`,
|
||||||
|
|
||||||
"Risk Management Process|Third-Party Risk": `RISK MANAGEMENT PROCESS vs THIRD-PARTY RISK — ask: is vendor/supplier oversight the CENTRAL topic?
|
"Risk Management Process|Third-Party Risk": `RISK MANAGEMENT PROCESS vs THIRD-PARTY RISK — ask: is vendor/supplier oversight the CENTRAL topic?
|
||||||
• "We use third-party consultants for penetration testing" = RMP (third parties support an internal process).
|
• "We use third-party consultants for penetration testing" = RMP (third parties support an internal process).
|
||||||
• "We maintain a vendor oversight program with due diligence and monitoring of third-party controls" = Third-Party Risk (vendor oversight IS the topic).
|
• "We maintain a vendor oversight program with due diligence and monitoring of third-party controls" = Third-Party Risk (vendor oversight IS the topic).
|
||||||
• The paragraph must be PRIMARILY about managing vendor/supplier cyber risk to qualify as Third-Party Risk.`,
|
• The paragraph must be PRIMARILY about managing vendor/supplier cyber risk to qualify as Third-Party Risk.`,
|
||||||
|
|
||||||
"None/Other|Strategy Integration": `NONE/OTHER vs STRATEGY INTEGRATION — ask: is there substantive cybersecurity disclosure?
|
"None/Other|Strategy Integration": `NONE/OTHER vs STRATEGY INTEGRATION — the materiality ASSESSMENT test:
|
||||||
• None/Other = NO substantive disclosure at all: section headers, disclaimers, generic IT-dependence language ("our IT systems are important to operations"), forward-looking boilerplate, generic regulatory compliance language ("subject to various regulatory requirements... non-compliance could result in penalties").
|
The test is whether the company is MAKING A MATERIALITY CONCLUSION, not whether the word "material" appears.
|
||||||
• Strategy Integration = actual discussion of business/financial impact, cyber insurance, budget allocation, or materiality assessment.
|
|
||||||
• Generic regulatory risk language (acknowledging regulations exist, non-compliance would be bad) is None/Other — it makes no materiality assessment and describes no strategy. It only becomes Strategy Integration if it explicitly assesses whether regulatory risks have "materially affected" the business.
|
|
||||||
• If the paragraph only establishes that the company has IT systems and data without describing any program, process, or strategy → None/Other.`,
|
|
||||||
|
|
||||||
"Board Governance|Management Role": `BOARD GOVERNANCE vs MANAGEMENT ROLE — ask: who is the grammatical subject?
|
IS a materiality assessment or SI marker → Strategy Integration:
|
||||||
• Board or board committee taking oversight actions (receiving briefings, reviewing risks) → Board Governance.
|
• Backward-looking: "have not materially affected our business strategy, results of operations, or financial condition" (company reporting on actual impact)
|
||||||
• Named executive with qualifications, experience, or reporting lines → Management Role.
|
• Forward-looking with SEC qualifier: "reasonably likely to materially affect" (Item 106(b)(2) language — the company is making a forward-looking assessment)
|
||||||
• When both appear, the PRIMARY focus wins: board oversight with a brief management mention → Board Governance, and vice versa.`,
|
• Negative assertions: "we have not experienced any material cybersecurity incidents" (materiality conclusion about past events — SI even if at end of paragraph)
|
||||||
|
• Insurance, budget, investment discussion: "we expend considerable resources on cybersecurity", cyber insurance, cost allocation (strategic resource commitment)
|
||||||
|
|
||||||
|
Is NOT a materiality assessment → classify by primary purpose (usually N/O or RMP):
|
||||||
|
• Generic risk warning: "could have a material adverse effect on our business" — this is boilerplate risk factor language that appears in every 10-K. The word "could" indicates speculation, not an assessment. → N/O or RMP depending on surrounding content.
|
||||||
|
• "Material" as adjective: "managing material risks associated with cybersecurity" — "material" here means "significant," not a materiality assessment. → RMP.
|
||||||
|
• Consequence clause: SPECULATIVE materiality language ("could have a material adverse effect") at the END of a paragraph does not override primary purpose. BUT a factual negative assertion ("we have not experienced any material cybersecurity incidents") IS an assessment even at the end — it states a conclusion. If a paragraph contains BOTH speculative consequence language AND a factual negative assertion, the negative assertion triggers SI.
|
||||||
|
• Cross-references: "For a description of risks that may materially affect the Company, see Item 1A" → N/O (pointing elsewhere, not making an assessment).
|
||||||
|
|
||||||
|
KEY DISTINCTION: "Risks have not materially affected us" = SI (CONCLUSION). "Risks could have a material adverse effect" = N/O (SPECULATION). "Risks are reasonably likely to materially affect us" = SI (FORWARD-LOOKING CONCLUSION with SEC qualifier).`,
|
||||||
|
|
||||||
|
"Board Governance|Management Role": `BOARD GOVERNANCE vs MANAGEMENT ROLE — the PURPOSE test:
|
||||||
|
Ask: what is the paragraph's COMMUNICATIVE PURPOSE?
|
||||||
|
• PURPOSE = describing the oversight/reporting STRUCTURE (who reports to whom, how the board exercises oversight, briefing cadence, committee responsibilities) → Board Governance. The board/committee's actions must be a SIGNIFICANT part of the paragraph (multiple sentences describing what the board/committee does, receives, or directs).
|
||||||
|
• PURPOSE = describing WHO a specific person IS (qualifications, credentials, experience, career history, team they lead) → Management Role.
|
||||||
|
• CRITICAL THRESHOLD: A one-sentence mention of a board/committee does NOT make a paragraph Board Governance. Test: if you removed the committee sentence, would the paragraph lose its main point? If NO → the committee mention is incidental; classify based on the remaining content.
|
||||||
|
• "Our management team oversees cybersecurity technologies and processes. Our Audit Committee also provides oversight." → NOT BG. The committee mention is a brief addendum. The paragraph is about what management does → MR or RMP.
|
||||||
|
• "The Audit Committee receives quarterly briefings from the CISO and conducts annual reviews of the cybersecurity program." → BG. The committee's oversight actions ARE the content.
|
||||||
|
• Governance-chain paragraphs where the board/committee spans multiple sentences ARE Board Governance. Single-sentence mentions are NOT enough.`,
|
||||||
|
|
||||||
"Board Governance|Risk Management Process": `BOARD GOVERNANCE vs RISK MANAGEMENT PROCESS — ask: oversight or operations?
|
"Board Governance|Risk Management Process": `BOARD GOVERNANCE vs RISK MANAGEMENT PROCESS — ask: oversight or operations?
|
||||||
• Board/committee receiving reports, overseeing risk, setting policy → Board Governance.
|
• Board/committee receiving reports, overseeing risk, setting policy → Board Governance.
|
||||||
@ -140,9 +176,10 @@ const CATEGORY_GUIDANCE: Record<string, string> = {
|
|||||||
• "The board receives quarterly cybersecurity briefings" → Board Governance. "We conduct quarterly risk assessments; the board is informed" → RMP (process is primary content).`,
|
• "The board receives quarterly cybersecurity briefings" → Board Governance. "We conduct quarterly risk assessments; the board is informed" → RMP (process is primary content).`,
|
||||||
|
|
||||||
"None/Other|Risk Management Process": `NONE/OTHER vs RISK MANAGEMENT PROCESS — ask: does the paragraph describe actual cybersecurity activities?
|
"None/Other|Risk Management Process": `NONE/OTHER vs RISK MANAGEMENT PROCESS — ask: does the paragraph describe actual cybersecurity activities?
|
||||||
• Describing actual processes (monitoring, assessment, vulnerability management, training programs) → RMP.
|
• Describing actual processes, measures, or controls the company has implemented → RMP. Key signals: "we have implemented," "we use," "we maintain," "we have taken steps to," "our program includes," "we engage." Even if surrounded by risk-factor framing, ACTUAL MEASURES = RMP.
|
||||||
• Only stating the company has IT systems, collects data, or faces cyber risks — without describing what it DOES about them → None/Other.
|
• Only stating the company has IT systems, faces cyber risks, or enumerating threat types — without describing what it DOES about them → None/Other.
|
||||||
• Generic regulatory compliance language ("subject to various regulations... non-compliance could result in penalties") is None/Other — it describes no actual compliance activities. If a specific regulation is named (GDPR, HIPAA, PCI DSS) but no company-specific program is described → RMP at Specificity 2 (named standard).`,
|
• Generic regulatory compliance language ("subject to various regulations... non-compliance could result in penalties") is None/Other — it describes no actual compliance activities. If a specific regulation is named (GDPR, HIPAA, PCI DSS) but no company-specific program is described → RMP at Specificity 2 (named standard).
|
||||||
|
• A paragraph that BOTH enumerates threats AND describes measures taken is RMP — the measures are the substantive content.`,
|
||||||
|
|
||||||
"Risk Management Process|Strategy Integration": `RISK MANAGEMENT PROCESS vs STRATEGY INTEGRATION — ask: operational or strategic?
|
"Risk Management Process|Strategy Integration": `RISK MANAGEMENT PROCESS vs STRATEGY INTEGRATION — ask: operational or strategic?
|
||||||
• Describing HOW risks are assessed, monitored, mitigated → Risk Management Process.
|
• Describing HOW risks are assessed, monitored, mitigated → Risk Management Process.
|
||||||
@ -176,6 +213,81 @@ Do NOT count toward QV:
|
|||||||
✗ Generic degrees without named university
|
✗ Generic degrees without named university
|
||||||
Need 2+ QV-eligible facts. One fact = stays at Firm-Specific.`;
|
Need 2+ QV-eligible facts. One fact = stays at Firm-Specific.`;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Build a re-evaluation prompt for paragraphs flagged for codebook-correction review.
|
||||||
|
* Used when unanimous Stage 1 labels may be wrong due to prompt version drift
|
||||||
|
* (e.g., v2.5 lacked the materiality→SI rule, so N/O labels on paragraphs with
|
||||||
|
* materiality language need re-evaluation under v3.5 rules).
|
||||||
|
*
|
||||||
|
* Unlike the judge prompt, this does NOT show prior annotations (to avoid anchoring
|
||||||
|
* to the potentially-wrong unanimous label). Instead, it provides the specific
|
||||||
|
* codebook rule that triggered the re-evaluation and asks for a fresh classification.
|
||||||
|
*/
|
||||||
|
export function buildReEvalPrompt(
|
||||||
|
paragraph: Paragraph,
|
||||||
|
reason: "materiality_language" | "spac" | "other",
|
||||||
|
): string {
|
||||||
|
const { filing, text } = paragraph;
|
||||||
|
|
||||||
|
let ruleBlock: string;
|
||||||
|
if (reason === "materiality_language") {
|
||||||
|
ruleBlock = `═══ RULE UNDER REVIEW ═══
|
||||||
|
|
||||||
|
This paragraph was previously labeled None/Other. It has been flagged for re-evaluation because it contains materiality-related language.
|
||||||
|
|
||||||
|
CODEBOOK RULE 6 (v3.5): Materiality ASSESSMENTS are Strategy Integration. An assessment is the company STATING A CONCLUSION about materiality:
|
||||||
|
• Backward-looking: "have not materially affected our business strategy, results of operations, or financial condition" → SI
|
||||||
|
• Forward-looking with SEC qualifier: "reasonably likely to materially affect" → SI
|
||||||
|
• Negative assertion: "we have not experienced any material cybersecurity incidents" → SI
|
||||||
|
|
||||||
|
The following are NOT materiality assessments and do NOT trigger SI:
|
||||||
|
• Generic risk warning: "could have a material adverse effect on our business" → NOT SI (boilerplate speculation, not a conclusion)
|
||||||
|
• "Material" as adjective: "managing material risks" → NOT SI ("material" means "significant" here)
|
||||||
|
• Consequence clause: SPECULATIVE materiality language ("could have a material adverse effect") at the END of a paragraph does not override primary purpose. BUT a factual negative assertion ("we have not experienced any material cybersecurity incidents") IS an assessment even at the end.
|
||||||
|
• Cross-references: "For risks that may materially affect us, see Item 1A" → N/O
|
||||||
|
|
||||||
|
KEY DISTINCTION: "Risks have not materially affected us" = SI (conclusion). "Risks could have a material adverse effect" = N/O (speculation). "Reasonably likely to materially affect" = SI (SEC-qualified forward-looking assessment).`;
|
||||||
|
} else if (reason === "spac") {
|
||||||
|
ruleBlock = `═══ RULE UNDER REVIEW ═══
|
||||||
|
|
||||||
|
This paragraph was flagged for re-evaluation because it may be from a SPAC or shell company.
|
||||||
|
|
||||||
|
CODEBOOK RULE (v3.0+): SPACs and shell companies explicitly stating they have no operations, no cybersecurity program, or no formal processes → None/Other, regardless of incidental mentions of board oversight or risk acknowledgment. The absence of a program is not a description of a program.`;
|
||||||
|
} else {
|
||||||
|
ruleBlock = `═══ RE-EVALUATION ═══
|
||||||
|
|
||||||
|
This paragraph has been flagged for fresh classification under codebook v3.5 rules. Apply all current rules without anchoring to any prior label.`;
|
||||||
|
}
|
||||||
|
|
||||||
|
return `═══ RE-EVALUATION TASK ═══
|
||||||
|
|
||||||
|
You are re-classifying this paragraph under updated codebook rules (v3.5). Classify it fresh — do not assume any prior label is correct.
|
||||||
|
|
||||||
|
${ruleBlock}
|
||||||
|
|
||||||
|
═══ ANALYSIS STEPS ═══
|
||||||
|
|
||||||
|
1. Read the paragraph carefully.
|
||||||
|
2. Apply the specific rule described above to determine if it changes the classification.
|
||||||
|
3. Apply all standard codebook rules for both category and specificity.
|
||||||
|
4. Provide your classification with reasoning.
|
||||||
|
|
||||||
|
═══ CONFIDENCE CALIBRATION ═══
|
||||||
|
|
||||||
|
HIGH = the rule clearly applies (or clearly doesn't) — the answer is unambiguous
|
||||||
|
MEDIUM = the rule is relevant but the paragraph is borderline
|
||||||
|
LOW = genuinely ambiguous even with the updated rule
|
||||||
|
|
||||||
|
═══ PARAGRAPH ═══
|
||||||
|
|
||||||
|
Company: ${filing.companyName} (${filing.ticker})
|
||||||
|
Filing type: ${filing.filingType}
|
||||||
|
Filing date: ${filing.filingDate}
|
||||||
|
Section: ${filing.secItem}
|
||||||
|
|
||||||
|
${text}`;
|
||||||
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Build the Stage 2 judge prompt with disagreement-aware disambiguation.
|
* Build the Stage 2 judge prompt with disagreement-aware disambiguation.
|
||||||
* Dynamically includes only the guidance relevant to the specific dispute.
|
* Dynamically includes only the guidance relevant to the specific dispute.
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user