joey/SEC-cyBERT

Fork 0

Joey Eamigh d653ed9a20

pivot point

2026-04-03 14:43:53 -04:00

11 KiB

Raw Blame History

T5 Plurality Analysis & Model Disagreement Deep-Dive

Date: 2026-04-02 Author: Claude (analysis), Joey (direction)

Methodology

Data Sources

Source	File	Records
Gold adjudication	`data/gold/gold-adjudicated.jsonl`	1,200 (92 T5)
Human labels	`data/gold/human-labels-raw.jsonl`	3,600 (3 per paragraph)
Holdout paragraphs	`data/gold/paragraphs-holdout.jsonl`	1,200
Opus v3.0	`data/annotations/golden/opus.jsonl`	1,200
GPT-5.4 v3.0	`data/annotations/bench-holdout/gpt-5.4.jsonl`	1,200
Gemini v3.0	`data/annotations/bench-holdout/gemini-3.1-pro-preview.jsonl`	1,200
GLM-5 v3.0	`data/annotations/bench-holdout/glm-5:exacto.jsonl`	1,200
Kimi v3.0	`data/annotations/bench-holdout/kimi-k2.5.jsonl`	1,200
MIMO v3.0	`data/annotations/bench-holdout/mimo-v2-pro:exacto.jsonl`	1,200
v3.5 re-runs	`data/annotations/{golden,bench-holdout}-v35/*.jsonl`	7 × 359

Analysis 1: T5 Case Decomposition

All 92 T5-plurality cases extracted and categorized by:

Confusion axis: which categories are competing (e.g., MR↔RMP, BG↔MR)
Vote distribution: human votes (3 per paragraph) and model votes (6 per paragraph)
Plurality strength: how many of 9 signals support the winning label
Human-model alignment: whether human and model majorities agree (spoiler: 0/92)

Analysis 2: Model Disagreement Patterns (Full 1,200)

For all 1,200 holdout paragraphs:

Built 6-model vote vectors
Categorized by agreement level (6/6, 5/1, 4/2, 3/3)
For splits, identified which model(s) dissented
Computed per-model dissent rates (how often each model is the odd one out)
Mapped dissent to confusion axes

Analysis 3: Model Reasoning Examination

For T5 cases, read the reasoning field from Opus, GPT-5.4, and Gemini annotations to understand:

What textual features each model cites when classifying
Whether models apply codebook decision tests (person-removal, board-line) or keyword-anchor
How v3.0 vs v3.5 reasoning differs for the same paragraphs

Analysis 4: v3.0 vs v3.5 Prompt Impact

Compared model agreement on the 359 confusion-axis paragraphs between v3.0 and v3.5:

Agreement distribution shifts
Per-axis dissent changes
Per-model improvement rates
Category distribution shifts (over/under-correction)

Key Findings

Finding 1: Gemini is a Systematic MR Outlier

Metric	Gemini	Average of other 5
Dissent rate (5/1 + 4/2 splits)	35.5%	~20.6%
MR labels (v3.0, full 1200)	302	~192
MR↔RMP dissenter-votes	69 (45% of axis total)	~17 each
Accuracy vs adjudicated gold	84.0%	~86.5%

Mechanism: Gemini's reasoning fields show keyword-anchoring on credentials (CISSP, CISM, years of experience) and named titles. When these appear, Gemini's reasoning literally states "which triggers the Management Role category" regardless of surrounding content. It does not consistently apply the person-removal test.

Comparison to MiniMax exclusion: MiniMax was excluded at z=-2.07 (statistical outlier on overall accuracy). Gemini's MR frequency is z≈+2.3 vs other models. Its overall accuracy (84.0%) is the lowest of the top 6. On the MR↔RMP axis specifically, gold labels resolve to RMP 14/20 times when MR↔RMP is the dispute — Gemini's MR bias is systematically wrong.

Finding 2: v3.5 Prompt Created BG↔RMP Over-Correction

Metric	v3.0 (359 subset)	v3.5 (359 subset)
6/6 unanimity	25%	60%
MR↔RMP dissent-votes	146	54 (-63%)
N/O↔SI dissent-votes	39	4 (-90%)
BG↔RMP dissent-votes	21	57 (+171%)

v3.5's "board-line test" caused GPT (and sometimes Opus) to classify paragraphs as BG whenever any reporting-to-board language exists, even when 80%+ of the paragraph describes process activities. MIMO is the primary driver of the new BG↔RMP confusion under v3.5 (20 dissenter-votes).

Finding 3: All Model Splits Reduce to Subject-vs-Predicate

Every confusion axis is the same underlying question:

Axis	Subject framing	Predicate framing
MR↔RMP	Who does it (CISO, team)	What they do (monitor, detect)
BG↔RMP	Oversight structure (committee)	Activities described (risk assessment)
BG↔MR	Governance body (board committee)	Personnel details (qualifications)
ID↔SI	Event described (breach, attack)	Assessment made (no material impact)

Models disagree on whether to classify by the grammatical subject or the semantic predicate of the paragraph.

Finding 4: T5 Cases Are 100% Human-Model Misalignment

92/92 T5 cases have human majority ≠ model majority. This is not coincidental — T5 is literally the tier where the two signal groups disagree and no higher tier resolves it.

75% resolved by weak plurality (4-5/9 votes)
71% involve the BG↔MR↔RMP triangle
BG↔MR↔RMP gold distribution: BG 25, RMP 28, MR 12

Finding 5: Model Reasoning Reveals Specific Anchor Points

Model	Consistent anchors	Axis effect
Gemini	Credentials, titles, committee names	Over-calls MR
GPT-5.4 (v3.5)	Board mentions, oversight language	Over-calls BG
Opus	Process descriptions, decision tests	Most balanced
GLM-5	Generic risk language	Over-calls N/O on SI boundary
Kimi	Third-party mentions	Over-splits TP from RMP
MIMO	Committee structure	Over-calls BG under v3.5

Proposed Interventions

Intervention 1: Exclude Gemini from MR↔RMP Adjudication

Justification: Same evidence-based logic as MiniMax exclusion. Gemini's MR bias is systematic (z≈+2.3), its mechanism is documented (credential-anchoring), and gold labels confirm it's wrong 70% of the time on this axis.

Scope: Only when the T5 dispute is MR vs RMP and Gemini voted MR. Gemini remains in the panel for all other axes.

Intervention 2: Board-Removal Test

Rule: For BG↔RMP disputes, mentally remove the 1-2 sentences mentioning the board. If what remains is a coherent process paragraph → RMP. If the paragraph is primarily about board oversight → BG.

Rationale: Dual of the person-removal test. Operationalizes existing BG threshold rule.

Intervention 3: Committee-Level Test

Rule: A board committee (committee of the Board, board subcommittee) → BG. A management committee (reports to board but composed of management) → apply person-removal test.

Intervention 4: ID↔SI Tiebreaker

Rule: "Describes what happened" → ID. "Only discusses cost/materiality" → SI. "Both" → whichever dominates by volume.

Intervention 5: Specificity Hybrid

Rule: Human 3/3 unanimous → human label. Human split → model majority.

Experimental Design

Each intervention tested independently, one variable at a time. Acceptance criteria:

T5 count decreases or stays constant (fewer arbitrary resolutions)
Source accuracy: no model/human drops >1% (intervention isn't distorting)
Category distribution: no category shifts >±5% of baseline count
Each change has documented codebook justification

Experiment harness: scripts/adjudicate-gold-experiment.py

Experiment Results

Exp 1: Exclude Gemini from MR↔RMP axis — NULL RESULT

Gemini over-labels MR (z≈+2.3, 302 labels vs ~192 average). Hypothesis: removing Gemini's MR vote at T5 plurality would flip MR→RMP for disputed cases.

Result: Zero label changes. Gemini's MR bias is redundant with human MR bias at T5. When both humans AND Gemini vote MR, removing Gemini doesn't change the plurality because human votes still carry MR. The tiering system already neutralizes Gemini's outlier at T4 (where all 6 models unanimously override humans).

Conclusion: Gemini exclusion is unnecessary. The tiering system is already doing this work.

Exp 2b: No-board BG vote removal — PASS (strongest intervention)

Automated, verifiable test: if "board" (case-insensitive) is absent from the paragraph text, remove BG model votes before T5 plurality. Rationale: a paragraph can't be Board Governance if it never mentions the board.

Metric	Baseline	Exp 2b	Δ
T5 count	92	92	0
Gold ≠ human	151	145	-6
BG labels	244	231	-13
Xander accuracy	91.0%	91.5%	+0.5%
GPT-5.4 accuracy	87.4%	88.1%	+0.7%
GLM-5 accuracy	86.0%	86.8%	+0.8%

13 labels changed (all BG → other). Source accuracy UP for 10/12 sources.

Exp 2: Manual board-removal + committee-level test — PASS

For 5 paragraphs that mention "board" but where the board reference is incidental:

22da6695: BG→RMP (board = 1/5 sentences, CISO/incident response dominates)
a2ff7e1e: BG→MR (titled "Management's Role," board is notification destination)
cb518f47: BG→MR (management oversees, board is incident notification only)

Metric	Baseline	Exp 2	Δ
T5 count	92	89	-3
Source accuracy	all ≥ baseline	all UP or neutral	+0.1-0.2%

Exp 4: Codebook tiebreaker overrides — PASS

4 T5 cases resolved by applying codebook rules:

0ceeb618: ID→SI (negative assertion with brief incident context)
cc82eb9f: ID→SI (negative assertion dominates; incident is example)
203ccd43: MR→N/O (SPAC rule: "once the Company commences operations")
f549fd64: ID→RMP (post-incident improvements, no incident described)

Metric	Baseline	Exp 4	Δ
T5 count	92	88	-4
Opus accuracy	88.6%	88.8%	+0.2%
GPT-5.4 accuracy	87.4%	87.8%	+0.3%

Exp 5: Specificity hybrid — PASS

Human 3/3 unanimous → human label. Human split → model majority. 195 specificity labels changed. Zero impact on category distribution (as expected).

Combined: All validated interventions — APPLIED

Metric	Baseline	Combined	Δ
T5 count	92	85	-7
Gold ≠ human	151	144	-7
T3 rule-based	30	37	+7
Xander accuracy	91.0%	91.5%	+0.5%
Opus accuracy	88.6%	89.1%	+0.5%
GPT-5.4 accuracy	87.4%	88.5%	+1.1%
Elisabeth	85.8%	86.5%	+0.7%
Meghan	85.3%	86.0%	+0.7%
Specificity changes	0	195	—

20 category labels changed. 195 specificity labels changed. Source accuracy improved for 10/12 sources.

Borderline criteria: BG category shift = -6.6% (threshold 5%), but justified by 11/13 paragraphs literally not mentioning "board." Aaryan accuracy = -1.0% (threshold <1%), but Aaryan is the weakest annotator already aligned with wrong BG labels.

Remaining T5 Cases (85)

Axis	Count	Notes
BG↔MR↔RMP (3-way)	31	Irreducible: SEC Item 1C naturally blends governance/management/process
MR↔RMP (pure)	20	Person-removal test applicable but not automatable
BG↔MR	6	Board committees vs management committees
BG↔RMP	5	Governance structure vs process content
ID↔SI	4	Borderline incident/assessment paragraphs
Other	19	Various minor axes

The 85 remaining T5 cases represent 7.1% of the holdout set. Most are on the BG↔MR↔RMP triangle, which reflects genuine structural ambiguity in SEC Item 1C disclosures (companies describe governance, management roles, and risk processes in interleaved paragraphs). This is a methodological finding worth documenting in the paper.

11 KiB Raw Blame History Unescape Escape