# v3.5 Prompt Iteration Log ## Status: Locked at v3.5f, pending SI↔N/O investigation ## Final v3.5f Re-Run Results (7 models × 359 confusion-axis holdout paragraphs) ### Per-Model Accuracy vs Human Majority (358 common paragraphs) | Model | v3.0 acc | v3.5f acc | Δ | Change rate | |-------|---------|----------|---|-------------| | Opus | ~63% | 63.4% | ~0 | most stable | | Gemini Pro | ~59% | ~62% | +3 | | | Kimi K2.5 | ~55% | ~62% | +7.0 | | | GLM-5 | ~55% | ~62% | +6.7 | | | MIMO Pro | ~57% | ~60% | +3 | | | GPT-5.4 | ~62% | ~60% | -1.7 | | | MiniMax | ~50% | ~57% | +7 | outlier, excluded from gold scoring | ### Per-Axis Accuracy (6-model majority, excl MiniMax) | Axis | Paragraphs | v3.0 acc | v3.5f acc | Δ | |------|-----------|---------|----------|---| | BG↔MR | 104 | ~45% | ~67% | **+22.1** | | BG↔RMP | 59 | ~40% | ~65% | **+25.4** | | MR↔RMP | 191 | ~58% | ~56% | -2.1 | | SI↔N/O | 83 | ~66% | ~60% | **-6.0** | ### Model Convergence - All 7 models pairwise agreement: 61.7% → 79.1% (+17.3pp) - Top 6 (excl MiniMax): 63.1% → 80.9% (+17.8pp) ### Cost | Model | v3.5f cost | |-------|-----------| | GPT-5.4 | $2.14 | | Gemini Pro | $5.35 | | GLM-5 | $3.06 | | Kimi K2.5 | $2.80 | | MIMO Pro | $2.21 | | MiniMax | $0.54 | | Opus | $0 (subscription) | | **Total** | **$16.10** | --- ## The SI↔N/O Paradox — RESOLVED ### The original problem We started this exercise because of a 23:0 SI↔N/O asymmetry (humans say SI, GenAI says N/O, never the reverse). The v3.5 iteration made it worse (25:2 in v3.5f vs 20:1 in v3.0). ### Investigation (post-v3.5f) Paragraph-by-paragraph analysis of all 27 SI↔N/O errors revealed **the models are correct, not the humans.** **Of the 25 Human=SI / Model=N/O cases:** - **~20 cases: Models correct.** These are "could have a material adverse effect" boilerplate speculation, cross-references to Item 1A, or generic threat enumeration — none containing actual materiality assessments. Every model unanimously calls N/O. - **~2 cases: Genuinely ambiguous.** One SPAC with materiality language, one past-disruption mention without explicit materiality language. - **~2 cases: Edge cases.** Negative assertions embedded at end of BG/risk paragraphs (debatable whether the assertion or the surrounding content dominates). - **~1 case: Wrong axis entirely.** Should be RMP (describes resource commitment), not SI or N/O. **Of the 2 Human=N/O / Model=SI cases:** - **Both: Models correct.** Both contain clear negative assertions ("not aware of having experienced any prior material data breaches", "did not experience any cybersecurity incident during 2024") — textbook SI per the codebook. All 6 models unanimously call SI. **Root cause of human error:** Human annotators systematically treat ANY mention of "material," "business strategy," "results of operations," or "financial condition" as SI — even when the surrounding language is purely speculative ("could," "if," "may"). The codebook's assessment-vs-speculation distinction (v3.5 Rule 6) is correct, but humans weren't consistently applying it. ### Codebook Case 9 contradiction — FIXED The investigation discovered that **Codebook Case 9 directly contradicted Rule 6:** - Case 9 said: "could potentially have a material impact on our business strategy" → SI - Rule 6 said: "could have a material adverse effect" → NOT SI (speculation) Case 9 has been updated: the "could potentially" example is now correctly labeled N/O, with an explanation of why "reasonably likely to materially affect" (SEC qualifier) ≠ "could potentially have a material impact" (speculation). ### Prompt clarifications applied (within v3.5, no version bump) Two minor clarifications added to the locked prompt (net effect on GPT-5.4: within stochastic noise): 1. **Consequence clause refinement:** Speculative materiality language at end of paragraph = ignore. But factual negative assertions ("have not experienced any material incidents") = SI even at end of paragraph. 2. **Investment/resource SI signal:** "expend considerable resources on cybersecurity" is a strategic resource commitment (SI marker), not speculation. ### What this means for gold adjudication **The "paradox" is resolved: there is no systematic model error on SI↔N/O.** The 25:2 asymmetry reflects human over-calling of SI, not model under-calling. **Gold adjudication strategy for SI↔N/O:** 1. When all 6 models unanimously say N/O and the paragraph contains only "could/if/may" speculation → **gold = N/O** (models correct, humans wrong) 2. When all 6 models unanimously say SI and the paragraph contains a negative assertion → **gold = SI** (models correct, humans wrong) 3. For the ~3-5 genuinely ambiguous cases → expert review 4. Backward-looking assessments ("have not materially affected") and SEC-qualified forward-looking ("reasonably likely to materially affect") → **always SI** via deterministic regex, regardless of model or human vote **Expected impact:** Flipping ~22 of 27 SI↔N/O errors from human-majority to model-consensus would raise SI↔N/O accuracy from ~60% to ~95%+ (measured against corrected gold labels). ### What this means for Stage 1 training data The 180 materiality-flagged paragraphs should still be corrected via deterministic regex for backward-looking assessments and SEC qualifiers. The 128 SPAC paragraphs should still be corrected via Stage 2 judge. The prompt is NOT the bottleneck — the corrections target v2.5→v3.5 codebook drift, not prompt failure. --- ## Iteration History (6 rounds, $1.02 on 26 regression paragraphs) | Round | Prompt | Score | Key change | |-------|--------|-------|-----------| | 1 | v3.5a | 5/26 | Initial rulings — catastrophic over-correction | | 2 | v3.5b | 13/25 | Purpose test for BG, Step 1 non-decisive for MR, cross-ref exception | | 3 | v3.5c | 20/26 | Cross-reference materiality exception | | 4 | v3.5d | 22/26 | SI tightened: assessment vs speculation distinction | | 5 | v3.5e | 19/25 | BG/RMP example added — REGRESSED, reverted | | 6 | v3.5f | 21/26 | Reverted R5, kept R4 SI + N/O↔RMP measures fix | ### Stable fixes (consistently correct across R4-R6) - 5 SI cross-reference over-predictions eliminated - 3-4 BG purpose test corrections - 3-4 MR Step 1 non-short-circuiting corrections ### Stable errors (4, genuinely ambiguous — human 2-1 splits) - 2× BG over-call on process paragraphs with committee mentions - 2× N/O over-call on borderline RMP paragraphs ### Root causes identified per error 1. **17f2cc:** Fragment/truncated paragraph, "committees" triggers BG but process verbs dominate 2. **8adfde:** 300-word risk paragraph with embedded security measures → N/O instead of RMP 3. **eca862:** CISO+ERMC monitoring methods → BG instead of RMP (ERMC woven throughout) 4. **fcc65c:** "Material risks" + threat enumeration → N/O instead of RMP (borderline) --- ## Stage 1 Impact Summary | Metric | Original flag | Tightened flag | |--------|-------------|---------------| | Total flagged | 1,014 | 308 | | Materiality | 886 | 180 | | SPAC | 128 | 128 | | Excluded (generic "could" boilerplate) | — | 706 | The 706 excluded paragraphs contain generic "could have a material adverse effect" that is correctly N/O under both v2.5 and v3.5. Only 180 contain actual backward-looking or SEC-qualified assessments. **Recommendation:** Correct the 180 materiality paragraphs via deterministic regex (label as SI), not via model re-evaluation. Correct the 128 SPACs via Stage 2 judge (need model to determine correct non-N/O label for paragraphs that shouldn't have been coded as substantive categories). --- ## Files Created/Modified | File | Purpose | |------|---------| | `ts/src/label/prompts.ts` | v3.5f locked prompt (PROMPT_VERSION="v3.5") | | `data/annotations/bench-holdout-v35/*.jsonl` | 7 models × 359 paragraphs, v3.5f | | `data/annotations/golden-v35/opus.jsonl` | Opus v3.5f on 359 paragraphs | | `data/annotations/bench-holdout-v35b/gpt-5.4.jsonl` | Iteration test data (26 paragraphs, multiple rounds) | | `data/annotations/stage1-corrections.jsonl` | 308 flagged paragraphs (tightened criteria) | | `data/gold/holdout-rerun-v35.jsonl` | 359 confusion-axis paragraph IDs | | `data/gold/holdout-rerun-v35b.jsonl` | 26 regression paragraph IDs | | `data/gold/regression-pids.json` | Regression PIDs by axis | | `scripts/compare-v30-v35.py` | v3.0 vs v3.5a comparison | | `scripts/compare-v30-v35-final.py` | v3.0 vs v3.5f comparison | | `scripts/examine-v35-errors.py` | Error analysis for iteration | | `scripts/extract-regression-pids.py` | Identify regression paragraphs | | `scripts/flag-stage1-corrections.py` | Flag Stage 1 corrections (tightened) | | `scripts/identify-holdout-rerun.py` | Identify confusion-axis holdout paragraphs | | `docs/LABELING-CODEBOOK.md` | v3.5 rulings + version history | | `docs/NARRATIVE.md` | Phase 15 with full iteration detail | | `docs/STATUS.md` | v3.5 section added |