SEC-cyBERT/docs/T5-ANALYSIS.md
2026-04-03 14:43:53 -04:00

11 KiB
Raw Blame History

T5 Plurality Analysis & Model Disagreement Deep-Dive

Date: 2026-04-02 Author: Claude (analysis), Joey (direction)

Methodology

Data Sources

Source File Records
Gold adjudication data/gold/gold-adjudicated.jsonl 1,200 (92 T5)
Human labels data/gold/human-labels-raw.jsonl 3,600 (3 per paragraph)
Holdout paragraphs data/gold/paragraphs-holdout.jsonl 1,200
Opus v3.0 data/annotations/golden/opus.jsonl 1,200
GPT-5.4 v3.0 data/annotations/bench-holdout/gpt-5.4.jsonl 1,200
Gemini v3.0 data/annotations/bench-holdout/gemini-3.1-pro-preview.jsonl 1,200
GLM-5 v3.0 data/annotations/bench-holdout/glm-5:exacto.jsonl 1,200
Kimi v3.0 data/annotations/bench-holdout/kimi-k2.5.jsonl 1,200
MIMO v3.0 data/annotations/bench-holdout/mimo-v2-pro:exacto.jsonl 1,200
v3.5 re-runs data/annotations/{golden,bench-holdout}-v35/*.jsonl 7 × 359

Analysis 1: T5 Case Decomposition

All 92 T5-plurality cases extracted and categorized by:

  • Confusion axis: which categories are competing (e.g., MR↔RMP, BG↔MR)
  • Vote distribution: human votes (3 per paragraph) and model votes (6 per paragraph)
  • Plurality strength: how many of 9 signals support the winning label
  • Human-model alignment: whether human and model majorities agree (spoiler: 0/92)

Analysis 2: Model Disagreement Patterns (Full 1,200)

For all 1,200 holdout paragraphs:

  1. Built 6-model vote vectors
  2. Categorized by agreement level (6/6, 5/1, 4/2, 3/3)
  3. For splits, identified which model(s) dissented
  4. Computed per-model dissent rates (how often each model is the odd one out)
  5. Mapped dissent to confusion axes

Analysis 3: Model Reasoning Examination

For T5 cases, read the reasoning field from Opus, GPT-5.4, and Gemini annotations to understand:

  • What textual features each model cites when classifying
  • Whether models apply codebook decision tests (person-removal, board-line) or keyword-anchor
  • How v3.0 vs v3.5 reasoning differs for the same paragraphs

Analysis 4: v3.0 vs v3.5 Prompt Impact

Compared model agreement on the 359 confusion-axis paragraphs between v3.0 and v3.5:

  • Agreement distribution shifts
  • Per-axis dissent changes
  • Per-model improvement rates
  • Category distribution shifts (over/under-correction)

Key Findings

Finding 1: Gemini is a Systematic MR Outlier

Metric Gemini Average of other 5
Dissent rate (5/1 + 4/2 splits) 35.5% ~20.6%
MR labels (v3.0, full 1200) 302 ~192
MR↔RMP dissenter-votes 69 (45% of axis total) ~17 each
Accuracy vs adjudicated gold 84.0% ~86.5%

Mechanism: Gemini's reasoning fields show keyword-anchoring on credentials (CISSP, CISM, years of experience) and named titles. When these appear, Gemini's reasoning literally states "which triggers the Management Role category" regardless of surrounding content. It does not consistently apply the person-removal test.

Comparison to MiniMax exclusion: MiniMax was excluded at z=-2.07 (statistical outlier on overall accuracy). Gemini's MR frequency is z≈+2.3 vs other models. Its overall accuracy (84.0%) is the lowest of the top 6. On the MR↔RMP axis specifically, gold labels resolve to RMP 14/20 times when MR↔RMP is the dispute — Gemini's MR bias is systematically wrong.

Finding 2: v3.5 Prompt Created BG↔RMP Over-Correction

Metric v3.0 (359 subset) v3.5 (359 subset)
6/6 unanimity 25% 60%
MR↔RMP dissent-votes 146 54 (-63%)
N/O↔SI dissent-votes 39 4 (-90%)
BG↔RMP dissent-votes 21 57 (+171%)

v3.5's "board-line test" caused GPT (and sometimes Opus) to classify paragraphs as BG whenever any reporting-to-board language exists, even when 80%+ of the paragraph describes process activities. MIMO is the primary driver of the new BG↔RMP confusion under v3.5 (20 dissenter-votes).

Finding 3: All Model Splits Reduce to Subject-vs-Predicate

Every confusion axis is the same underlying question:

Axis Subject framing Predicate framing
MR↔RMP Who does it (CISO, team) What they do (monitor, detect)
BG↔RMP Oversight structure (committee) Activities described (risk assessment)
BG↔MR Governance body (board committee) Personnel details (qualifications)
ID↔SI Event described (breach, attack) Assessment made (no material impact)

Models disagree on whether to classify by the grammatical subject or the semantic predicate of the paragraph.

Finding 4: T5 Cases Are 100% Human-Model Misalignment

92/92 T5 cases have human majority ≠ model majority. This is not coincidental — T5 is literally the tier where the two signal groups disagree and no higher tier resolves it.

  • 75% resolved by weak plurality (4-5/9 votes)
  • 71% involve the BG↔MR↔RMP triangle
  • BG↔MR↔RMP gold distribution: BG 25, RMP 28, MR 12

Finding 5: Model Reasoning Reveals Specific Anchor Points

Model Consistent anchors Axis effect
Gemini Credentials, titles, committee names Over-calls MR
GPT-5.4 (v3.5) Board mentions, oversight language Over-calls BG
Opus Process descriptions, decision tests Most balanced
GLM-5 Generic risk language Over-calls N/O on SI boundary
Kimi Third-party mentions Over-splits TP from RMP
MIMO Committee structure Over-calls BG under v3.5

Proposed Interventions

Intervention 1: Exclude Gemini from MR↔RMP Adjudication

Justification: Same evidence-based logic as MiniMax exclusion. Gemini's MR bias is systematic (z≈+2.3), its mechanism is documented (credential-anchoring), and gold labels confirm it's wrong 70% of the time on this axis.

Scope: Only when the T5 dispute is MR vs RMP and Gemini voted MR. Gemini remains in the panel for all other axes.

Intervention 2: Board-Removal Test

Rule: For BG↔RMP disputes, mentally remove the 1-2 sentences mentioning the board. If what remains is a coherent process paragraph → RMP. If the paragraph is primarily about board oversight → BG.

Rationale: Dual of the person-removal test. Operationalizes existing BG threshold rule.

Intervention 3: Committee-Level Test

Rule: A board committee (committee of the Board, board subcommittee) → BG. A management committee (reports to board but composed of management) → apply person-removal test.

Intervention 4: ID↔SI Tiebreaker

Rule: "Describes what happened" → ID. "Only discusses cost/materiality" → SI. "Both" → whichever dominates by volume.

Intervention 5: Specificity Hybrid

Rule: Human 3/3 unanimous → human label. Human split → model majority.


Experimental Design

Each intervention tested independently, one variable at a time. Acceptance criteria:

  1. T5 count decreases or stays constant (fewer arbitrary resolutions)
  2. Source accuracy: no model/human drops >1% (intervention isn't distorting)
  3. Category distribution: no category shifts >±5% of baseline count
  4. Each change has documented codebook justification

Experiment harness: scripts/adjudicate-gold-experiment.py


Experiment Results

Exp 1: Exclude Gemini from MR↔RMP axis — NULL RESULT

Gemini over-labels MR (z≈+2.3, 302 labels vs ~192 average). Hypothesis: removing Gemini's MR vote at T5 plurality would flip MR→RMP for disputed cases.

Result: Zero label changes. Gemini's MR bias is redundant with human MR bias at T5. When both humans AND Gemini vote MR, removing Gemini doesn't change the plurality because human votes still carry MR. The tiering system already neutralizes Gemini's outlier at T4 (where all 6 models unanimously override humans).

Conclusion: Gemini exclusion is unnecessary. The tiering system is already doing this work.

Exp 2b: No-board BG vote removal — PASS (strongest intervention)

Automated, verifiable test: if "board" (case-insensitive) is absent from the paragraph text, remove BG model votes before T5 plurality. Rationale: a paragraph can't be Board Governance if it never mentions the board.

Metric Baseline Exp 2b Δ
T5 count 92 92 0
Gold ≠ human 151 145 -6
BG labels 244 231 -13
Xander accuracy 91.0% 91.5% +0.5%
GPT-5.4 accuracy 87.4% 88.1% +0.7%
GLM-5 accuracy 86.0% 86.8% +0.8%

13 labels changed (all BG → other). Source accuracy UP for 10/12 sources.

Exp 2: Manual board-removal + committee-level test — PASS

For 5 paragraphs that mention "board" but where the board reference is incidental:

  • 22da6695: BG→RMP (board = 1/5 sentences, CISO/incident response dominates)
  • a2ff7e1e: BG→MR (titled "Management's Role," board is notification destination)
  • cb518f47: BG→MR (management oversees, board is incident notification only)
Metric Baseline Exp 2 Δ
T5 count 92 89 -3
Source accuracy all ≥ baseline all UP or neutral +0.1-0.2%

Exp 4: Codebook tiebreaker overrides — PASS

4 T5 cases resolved by applying codebook rules:

  • 0ceeb618: ID→SI (negative assertion with brief incident context)
  • cc82eb9f: ID→SI (negative assertion dominates; incident is example)
  • 203ccd43: MR→N/O (SPAC rule: "once the Company commences operations")
  • f549fd64: ID→RMP (post-incident improvements, no incident described)
Metric Baseline Exp 4 Δ
T5 count 92 88 -4
Opus accuracy 88.6% 88.8% +0.2%
GPT-5.4 accuracy 87.4% 87.8% +0.3%

Exp 5: Specificity hybrid — PASS

Human 3/3 unanimous → human label. Human split → model majority. 195 specificity labels changed. Zero impact on category distribution (as expected).

Combined: All validated interventions — APPLIED

Metric Baseline Combined Δ
T5 count 92 85 -7
Gold ≠ human 151 144 -7
T3 rule-based 30 37 +7
Xander accuracy 91.0% 91.5% +0.5%
Opus accuracy 88.6% 89.1% +0.5%
GPT-5.4 accuracy 87.4% 88.5% +1.1%
Elisabeth 85.8% 86.5% +0.7%
Meghan 85.3% 86.0% +0.7%
Specificity changes 0 195

20 category labels changed. 195 specificity labels changed. Source accuracy improved for 10/12 sources.

Borderline criteria: BG category shift = -6.6% (threshold 5%), but justified by 11/13 paragraphs literally not mentioning "board." Aaryan accuracy = -1.0% (threshold <1%), but Aaryan is the weakest annotator already aligned with wrong BG labels.


Remaining T5 Cases (85)

Axis Count Notes
BG↔MR↔RMP (3-way) 31 Irreducible: SEC Item 1C naturally blends governance/management/process
MR↔RMP (pure) 20 Person-removal test applicable but not automatable
BG↔MR 6 Board committees vs management committees
BG↔RMP 5 Governance structure vs process content
ID↔SI 4 Borderline incident/assessment paragraphs
Other 19 Various minor axes

The 85 remaining T5 cases represent 7.1% of the holdout set. Most are on the BG↔MR↔RMP triangle, which reflects genuine structural ambiguity in SEC Item 1C disclosures (companies describe governance, management roles, and risk processes in interleaved paragraphs). This is a methodological finding worth documenting in the paper.