# v3.5 Prompt Iteration Log

## Status: Locked at v3.5f, pending SI↔N/O investigation

## Final v3.5f Re-Run Results (7 models × 359 confusion-axis holdout paragraphs)

### Per-Model Accuracy vs Human Majority (358 common paragraphs)

| Model | v3.0 acc | v3.5f acc | Δ | Change rate |
|-------|---------|----------|---|-------------|
| Opus | ~63% | 63.4% | ~0 | most stable |
| Gemini Pro | ~59% | ~62% | +3 | |
| Kimi K2.5 | ~55% | ~62% | +7.0 | |
| GLM-5 | ~55% | ~62% | +6.7 | |
| MIMO Pro | ~57% | ~60% | +3 | |
| GPT-5.4 | ~62% | ~60% | -1.7 | |
| MiniMax | ~50% | ~57% | +7 | outlier, excluded from gold scoring |

### Per-Axis Accuracy (6-model majority, excl MiniMax)

| Axis | Paragraphs | v3.0 acc | v3.5f acc | Δ |
|------|-----------|---------|----------|---|
| BG↔MR | 104 | ~45% | ~67% | **+22.1** |
| BG↔RMP | 59 | ~40% | ~65% | **+25.4** |
| MR↔RMP | 191 | ~58% | ~56% | -2.1 |
| SI↔N/O | 83 | ~66% | ~60% | **-6.0** |

### Model Convergence

- All 7 models pairwise agreement: 61.7% → 79.1% (+17.3pp)
- Top 6 (excl MiniMax): 63.1% → 80.9% (+17.8pp)

### Cost

| Model | v3.5f cost |
|-------|-----------|
| GPT-5.4 | $2.14 |
| Gemini Pro | $5.35 |
| GLM-5 | $3.06 |
| Kimi K2.5 | $2.80 |
| MIMO Pro | $2.21 |
| MiniMax | $0.54 |
| Opus | $0 (subscription) |
| **Total** | **$16.10** |

---

## The SI↔N/O Paradox — RESOLVED

### The original problem

We started this exercise because of a 23:0 SI↔N/O asymmetry (humans say SI, GenAI says N/O, never the reverse). The v3.5 iteration made it worse (25:2 in v3.5f vs 20:1 in v3.0).

### Investigation (post-v3.5f)

Paragraph-by-paragraph analysis of all 27 SI↔N/O errors revealed **the models are correct, not the humans.**

**Of the 25 Human=SI / Model=N/O cases:**
- **~20 cases: Models correct.** These are "could have a material adverse effect" boilerplate speculation, cross-references to Item 1A, or generic threat enumeration — none containing actual materiality assessments. Every model unanimously calls N/O.
- **~2 cases: Genuinely ambiguous.** One SPAC with materiality language, one past-disruption mention without explicit materiality language.
- **~2 cases: Edge cases.** Negative assertions embedded at end of BG/risk paragraphs (debatable whether the assertion or the surrounding content dominates).
- **~1 case: Wrong axis entirely.** Should be RMP (describes resource commitment), not SI or N/O.

**Of the 2 Human=N/O / Model=SI cases:**
- **Both: Models correct.** Both contain clear negative assertions ("not aware of having experienced any prior material data breaches", "did not experience any cybersecurity incident during 2024") — textbook SI per the codebook. All 6 models unanimously call SI.

**Root cause of human error:** Human annotators systematically treat ANY mention of "material," "business strategy," "results of operations," or "financial condition" as SI — even when the surrounding language is purely speculative ("could," "if," "may"). The codebook's assessment-vs-speculation distinction (v3.5 Rule 6) is correct, but humans weren't consistently applying it.

### Codebook Case 9 contradiction — FIXED

The investigation discovered that **Codebook Case 9 directly contradicted Rule 6:**
- Case 9 said: "could potentially have a material impact on our business strategy" → SI
- Rule 6 said: "could have a material adverse effect" → NOT SI (speculation)

Case 9 has been updated: the "could potentially" example is now correctly labeled N/O, with an explanation of why "reasonably likely to materially affect" (SEC qualifier) ≠ "could potentially have a material impact" (speculation).

### Prompt clarifications applied (within v3.5, no version bump)

Two minor clarifications added to the locked prompt (net effect on GPT-5.4: within stochastic noise):
1. **Consequence clause refinement:** Speculative materiality language at end of paragraph = ignore. But factual negative assertions ("have not experienced any material incidents") = SI even at end of paragraph.
2. **Investment/resource SI signal:** "expend considerable resources on cybersecurity" is a strategic resource commitment (SI marker), not speculation.

### What this means for gold adjudication

**The "paradox" is resolved: there is no systematic model error on SI↔N/O.** The 25:2 asymmetry reflects human over-calling of SI, not model under-calling.

**Gold adjudication strategy for SI↔N/O:**
1. When all 6 models unanimously say N/O and the paragraph contains only "could/if/may" speculation → **gold = N/O** (models correct, humans wrong)
2. When all 6 models unanimously say SI and the paragraph contains a negative assertion → **gold = SI** (models correct, humans wrong)
3. For the ~3-5 genuinely ambiguous cases → expert review
4. Backward-looking assessments ("have not materially affected") and SEC-qualified forward-looking ("reasonably likely to materially affect") → **always SI** via deterministic regex, regardless of model or human vote

**Expected impact:** Flipping ~22 of 27 SI↔N/O errors from human-majority to model-consensus would raise SI↔N/O accuracy from ~60% to ~95%+ (measured against corrected gold labels).

### What this means for Stage 1 training data

The 180 materiality-flagged paragraphs should still be corrected via deterministic regex for backward-looking assessments and SEC qualifiers. The 128 SPAC paragraphs should still be corrected via Stage 2 judge. The prompt is NOT the bottleneck — the corrections target v2.5→v3.5 codebook drift, not prompt failure.

---

## Iteration History (6 rounds, $1.02 on 26 regression paragraphs)

| Round | Prompt | Score | Key change |
|-------|--------|-------|-----------|
| 1 | v3.5a | 5/26 | Initial rulings — catastrophic over-correction |
| 2 | v3.5b | 13/25 | Purpose test for BG, Step 1 non-decisive for MR, cross-ref exception |
| 3 | v3.5c | 20/26 | Cross-reference materiality exception |
| 4 | v3.5d | 22/26 | SI tightened: assessment vs speculation distinction |
| 5 | v3.5e | 19/25 | BG/RMP example added — REGRESSED, reverted |
| 6 | v3.5f | 21/26 | Reverted R5, kept R4 SI + N/O↔RMP measures fix |

### Stable fixes (consistently correct across R4-R6)
- 5 SI cross-reference over-predictions eliminated
- 3-4 BG purpose test corrections
- 3-4 MR Step 1 non-short-circuiting corrections

### Stable errors (4, genuinely ambiguous — human 2-1 splits)
- 2× BG over-call on process paragraphs with committee mentions
- 2× N/O over-call on borderline RMP paragraphs

### Root causes identified per error
1. **17f2cc:** Fragment/truncated paragraph, "committees" triggers BG but process verbs dominate
2. **8adfde:** 300-word risk paragraph with embedded security measures → N/O instead of RMP
3. **eca862:** CISO+ERMC monitoring methods → BG instead of RMP (ERMC woven throughout)
4. **fcc65c:** "Material risks" + threat enumeration → N/O instead of RMP (borderline)

---

## Stage 1 Impact Summary

| Metric | Original flag | Tightened flag |
|--------|-------------|---------------|
| Total flagged | 1,014 | 308 |
| Materiality | 886 | 180 |
| SPAC | 128 | 128 |
| Excluded (generic "could" boilerplate) | — | 706 |

The 706 excluded paragraphs contain generic "could have a material adverse effect" that is correctly N/O under both v2.5 and v3.5. Only 180 contain actual backward-looking or SEC-qualified assessments.

**Recommendation:** Correct the 180 materiality paragraphs via deterministic regex (label as SI), not via model re-evaluation. Correct the 128 SPACs via Stage 2 judge (need model to determine correct non-N/O label for paragraphs that shouldn't have been coded as substantive categories).

---

## Files Created/Modified

| File | Purpose |
|------|---------|
| `ts/src/label/prompts.ts` | v3.5f locked prompt (PROMPT_VERSION="v3.5") |
| `data/annotations/bench-holdout-v35/*.jsonl` | 7 models × 359 paragraphs, v3.5f |
| `data/annotations/golden-v35/opus.jsonl` | Opus v3.5f on 359 paragraphs |
| `data/annotations/bench-holdout-v35b/gpt-5.4.jsonl` | Iteration test data (26 paragraphs, multiple rounds) |
| `data/annotations/stage1-corrections.jsonl` | 308 flagged paragraphs (tightened criteria) |
| `data/gold/holdout-rerun-v35.jsonl` | 359 confusion-axis paragraph IDs |
| `data/gold/holdout-rerun-v35b.jsonl` | 26 regression paragraph IDs |
| `data/gold/regression-pids.json` | Regression PIDs by axis |
| `scripts/compare-v30-v35.py` | v3.0 vs v3.5a comparison |
| `scripts/compare-v30-v35-final.py` | v3.0 vs v3.5f comparison |
| `scripts/examine-v35-errors.py` | Error analysis for iteration |
| `scripts/extract-regression-pids.py` | Identify regression paragraphs |
| `scripts/flag-stage1-corrections.py` | Flag Stage 1 corrections (tightened) |
| `scripts/identify-holdout-rerun.py` | Identify confusion-axis holdout paragraphs |
| `docs/LABELING-CODEBOOK.md` | v3.5 rulings + version history |
| `docs/NARRATIVE.md` | Phase 15 with full iteration detail |
| `docs/STATUS.md` | v3.5 section added |