SEC-cyBERT/docs/archive/v1/T5-ANALYSIS.md
2026-04-05 21:00:40 -04:00

244 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# T5 Plurality Analysis & Model Disagreement Deep-Dive
**Date:** 2026-04-02
**Author:** Claude (analysis), Joey (direction)
## Methodology
### Data Sources
| Source | File | Records |
|--------|------|---------|
| Gold adjudication | `data/gold/gold-adjudicated.jsonl` | 1,200 (92 T5) |
| Human labels | `data/gold/human-labels-raw.jsonl` | 3,600 (3 per paragraph) |
| Holdout paragraphs | `data/gold/paragraphs-holdout.jsonl` | 1,200 |
| Opus v3.0 | `data/annotations/golden/opus.jsonl` | 1,200 |
| GPT-5.4 v3.0 | `data/annotations/bench-holdout/gpt-5.4.jsonl` | 1,200 |
| Gemini v3.0 | `data/annotations/bench-holdout/gemini-3.1-pro-preview.jsonl` | 1,200 |
| GLM-5 v3.0 | `data/annotations/bench-holdout/glm-5:exacto.jsonl` | 1,200 |
| Kimi v3.0 | `data/annotations/bench-holdout/kimi-k2.5.jsonl` | 1,200 |
| MIMO v3.0 | `data/annotations/bench-holdout/mimo-v2-pro:exacto.jsonl` | 1,200 |
| v3.5 re-runs | `data/annotations/{golden,bench-holdout}-v35/*.jsonl` | 7 × 359 |
### Analysis 1: T5 Case Decomposition
All 92 T5-plurality cases extracted and categorized by:
- **Confusion axis**: which categories are competing (e.g., MR↔RMP, BG↔MR)
- **Vote distribution**: human votes (3 per paragraph) and model votes (6 per paragraph)
- **Plurality strength**: how many of 9 signals support the winning label
- **Human-model alignment**: whether human and model majorities agree (spoiler: 0/92)
### Analysis 2: Model Disagreement Patterns (Full 1,200)
For all 1,200 holdout paragraphs:
1. Built 6-model vote vectors
2. Categorized by agreement level (6/6, 5/1, 4/2, 3/3)
3. For splits, identified which model(s) dissented
4. Computed per-model dissent rates (how often each model is the odd one out)
5. Mapped dissent to confusion axes
### Analysis 3: Model Reasoning Examination
For T5 cases, read the `reasoning` field from Opus, GPT-5.4, and Gemini annotations to understand:
- What textual features each model cites when classifying
- Whether models apply codebook decision tests (person-removal, board-line) or keyword-anchor
- How v3.0 vs v3.5 reasoning differs for the same paragraphs
### Analysis 4: v3.0 vs v3.5 Prompt Impact
Compared model agreement on the 359 confusion-axis paragraphs between v3.0 and v3.5:
- Agreement distribution shifts
- Per-axis dissent changes
- Per-model improvement rates
- Category distribution shifts (over/under-correction)
---
## Key Findings
### Finding 1: Gemini is a Systematic MR Outlier
| Metric | Gemini | Average of other 5 |
|--------|--------|---------------------|
| Dissent rate (5/1 + 4/2 splits) | 35.5% | ~20.6% |
| MR labels (v3.0, full 1200) | 302 | ~192 |
| MR↔RMP dissenter-votes | 69 (45% of axis total) | ~17 each |
| Accuracy vs adjudicated gold | 84.0% | ~86.5% |
**Mechanism**: Gemini's reasoning fields show keyword-anchoring on credentials (CISSP, CISM, years of experience) and named titles. When these appear, Gemini's reasoning literally states "which triggers the Management Role category" regardless of surrounding content. It does not consistently apply the person-removal test.
**Comparison to MiniMax exclusion**: MiniMax was excluded at z=-2.07 (statistical outlier on overall accuracy). Gemini's MR frequency is z≈+2.3 vs other models. Its overall accuracy (84.0%) is the lowest of the top 6. On the MR↔RMP axis specifically, gold labels resolve to RMP 14/20 times when MR↔RMP is the dispute — Gemini's MR bias is systematically wrong.
### Finding 2: v3.5 Prompt Created BG↔RMP Over-Correction
| Metric | v3.0 (359 subset) | v3.5 (359 subset) |
|--------|--------------------|--------------------|
| 6/6 unanimity | 25% | 60% |
| MR↔RMP dissent-votes | 146 | 54 (-63%) |
| N/O↔SI dissent-votes | 39 | 4 (-90%) |
| **BG↔RMP dissent-votes** | **21** | **57 (+171%)** |
v3.5's "board-line test" caused GPT (and sometimes Opus) to classify paragraphs as BG whenever any reporting-to-board language exists, even when 80%+ of the paragraph describes process activities. MIMO is the primary driver of the new BG↔RMP confusion under v3.5 (20 dissenter-votes).
### Finding 3: All Model Splits Reduce to Subject-vs-Predicate
Every confusion axis is the same underlying question:
| Axis | Subject framing | Predicate framing |
|------|----------------|-------------------|
| MR↔RMP | Who does it (CISO, team) | What they do (monitor, detect) |
| BG↔RMP | Oversight structure (committee) | Activities described (risk assessment) |
| BG↔MR | Governance body (board committee) | Personnel details (qualifications) |
| ID↔SI | Event described (breach, attack) | Assessment made (no material impact) |
Models disagree on whether to classify by the grammatical subject or the semantic predicate of the paragraph.
### Finding 4: T5 Cases Are 100% Human-Model Misalignment
92/92 T5 cases have human majority ≠ model majority. This is not coincidental — T5 is literally the tier where the two signal groups disagree and no higher tier resolves it.
- 75% resolved by weak plurality (4-5/9 votes)
- 71% involve the BG↔MR↔RMP triangle
- BG↔MR↔RMP gold distribution: BG 25, RMP 28, MR 12
### Finding 5: Model Reasoning Reveals Specific Anchor Points
| Model | Consistent anchors | Axis effect |
|-------|-------------------|-------------|
| Gemini | Credentials, titles, committee names | Over-calls MR |
| GPT-5.4 (v3.5) | Board mentions, oversight language | Over-calls BG |
| Opus | Process descriptions, decision tests | Most balanced |
| GLM-5 | Generic risk language | Over-calls N/O on SI boundary |
| Kimi | Third-party mentions | Over-splits TP from RMP |
| MIMO | Committee structure | Over-calls BG under v3.5 |
---
## Proposed Interventions
### Intervention 1: Exclude Gemini from MR↔RMP Adjudication
**Justification**: Same evidence-based logic as MiniMax exclusion. Gemini's MR bias is systematic (z≈+2.3), its mechanism is documented (credential-anchoring), and gold labels confirm it's wrong 70% of the time on this axis.
**Scope**: Only when the T5 dispute is MR vs RMP and Gemini voted MR. Gemini remains in the panel for all other axes.
### Intervention 2: Board-Removal Test
**Rule**: For BG↔RMP disputes, mentally remove the 1-2 sentences mentioning the board. If what remains is a coherent process paragraph → RMP. If the paragraph is primarily *about* board oversight → BG.
**Rationale**: Dual of the person-removal test. Operationalizes existing BG threshold rule.
### Intervention 3: Committee-Level Test
**Rule**: A board committee (committee *of* the Board, board subcommittee) → BG. A management committee (*reports to* board but composed of management) → apply person-removal test.
### Intervention 4: ID↔SI Tiebreaker
**Rule**: "Describes what happened" → ID. "Only discusses cost/materiality" → SI. "Both" → whichever dominates by volume.
### Intervention 5: Specificity Hybrid
**Rule**: Human 3/3 unanimous → human label. Human split → model majority.
---
## Experimental Design
Each intervention tested independently, one variable at a time. Acceptance criteria:
1. T5 count decreases or stays constant (fewer arbitrary resolutions)
2. Source accuracy: no model/human drops >1% (intervention isn't distorting)
3. Category distribution: no category shifts >±5% of baseline count
4. Each change has documented codebook justification
Experiment harness: `scripts/adjudicate-gold-experiment.py`
---
## Experiment Results
### Exp 1: Exclude Gemini from MR↔RMP axis — NULL RESULT
Gemini over-labels MR (z≈+2.3, 302 labels vs ~192 average). Hypothesis: removing Gemini's MR vote at T5 plurality would flip MR→RMP for disputed cases.
**Result:** Zero label changes. Gemini's MR bias is redundant with human MR bias at T5. When both humans AND Gemini vote MR, removing Gemini doesn't change the plurality because human votes still carry MR. The tiering system already neutralizes Gemini's outlier at T4 (where all 6 models unanimously override humans).
**Conclusion:** Gemini exclusion is unnecessary. The tiering system is already doing this work.
### Exp 2b: No-board BG vote removal — PASS (strongest intervention)
Automated, verifiable test: if "board" (case-insensitive) is absent from the paragraph text, remove BG model votes before T5 plurality. Rationale: a paragraph can't be Board Governance if it never mentions the board.
| Metric | Baseline | Exp 2b | Δ |
|--------|----------|--------|---|
| T5 count | 92 | 92 | 0 |
| Gold ≠ human | 151 | 145 | -6 |
| BG labels | 244 | 231 | -13 |
| Xander accuracy | 91.0% | 91.5% | +0.5% |
| GPT-5.4 accuracy | 87.4% | 88.1% | +0.7% |
| GLM-5 accuracy | 86.0% | 86.8% | +0.8% |
13 labels changed (all BG → other). Source accuracy UP for 10/12 sources.
### Exp 2: Manual board-removal + committee-level test — PASS
For 5 paragraphs that mention "board" but where the board reference is incidental:
- 22da6695: BG→RMP (board = 1/5 sentences, CISO/incident response dominates)
- a2ff7e1e: BG→MR (titled "Management's Role," board is notification destination)
- cb518f47: BG→MR (management oversees, board is incident notification only)
| Metric | Baseline | Exp 2 | Δ |
|--------|----------|-------|---|
| T5 count | 92 | 89 | -3 |
| Source accuracy | all ≥ baseline | all UP or neutral | +0.1-0.2% |
### Exp 4: Codebook tiebreaker overrides — PASS
4 T5 cases resolved by applying codebook rules:
- 0ceeb618: ID→SI (negative assertion with brief incident context)
- cc82eb9f: ID→SI (negative assertion dominates; incident is example)
- 203ccd43: MR→N/O (SPAC rule: "once the Company commences operations")
- f549fd64: ID→RMP (post-incident improvements, no incident described)
| Metric | Baseline | Exp 4 | Δ |
|--------|----------|-------|---|
| T5 count | 92 | 88 | -4 |
| Opus accuracy | 88.6% | 88.8% | +0.2% |
| GPT-5.4 accuracy | 87.4% | 87.8% | +0.3% |
### Exp 5: Specificity hybrid — PASS
Human 3/3 unanimous → human label. Human split → model majority. 195 specificity labels changed. Zero impact on category distribution (as expected).
### Combined: All validated interventions — APPLIED
| Metric | Baseline | Combined | Δ |
|--------|----------|----------|---|
| T5 count | 92 | 85 | **-7** |
| Gold ≠ human | 151 | 144 | **-7** |
| T3 rule-based | 30 | 37 | +7 |
| Xander accuracy | 91.0% | 91.5% | **+0.5%** |
| Opus accuracy | 88.6% | 89.1% | **+0.5%** |
| GPT-5.4 accuracy | 87.4% | 88.5% | **+1.1%** |
| Elisabeth | 85.8% | 86.5% | +0.7% |
| Meghan | 85.3% | 86.0% | +0.7% |
| Specificity changes | 0 | 195 | — |
20 category labels changed. 195 specificity labels changed. Source accuracy improved for 10/12 sources.
**Borderline criteria:** BG category shift = -6.6% (threshold 5%), but justified by 11/13 paragraphs literally not mentioning "board." Aaryan accuracy = -1.0% (threshold <1%), but Aaryan is the weakest annotator already aligned with wrong BG labels.
---
## Remaining T5 Cases (85)
| Axis | Count | Notes |
|------|-------|-------|
| BGMRRMP (3-way) | 31 | Irreducible: SEC Item 1C naturally blends governance/management/process |
| MRRMP (pure) | 20 | Person-removal test applicable but not automatable |
| BGMR | 6 | Board committees vs management committees |
| BGRMP | 5 | Governance structure vs process content |
| IDSI | 4 | Borderline incident/assessment paragraphs |
| Other | 19 | Various minor axes |
The 85 remaining T5 cases represent 7.1% of the holdout set. Most are on the BGMRRMP triangle, which reflects genuine structural ambiguity in SEC Item 1C disclosures (companies describe governance, management roles, and risk processes in interleaved paragraphs). This is a methodological finding worth documenting in the paper.