8.8 KiB
v3.5 Prompt Iteration Log
Status: Locked at v3.5f, pending SI↔N/O investigation
Final v3.5f Re-Run Results (7 models × 359 confusion-axis holdout paragraphs)
Per-Model Accuracy vs Human Majority (358 common paragraphs)
| Model | v3.0 acc | v3.5f acc | Δ | Change rate |
|---|---|---|---|---|
| Opus | ~63% | 63.4% | ~0 | most stable |
| Gemini Pro | ~59% | ~62% | +3 | |
| Kimi K2.5 | ~55% | ~62% | +7.0 | |
| GLM-5 | ~55% | ~62% | +6.7 | |
| MIMO Pro | ~57% | ~60% | +3 | |
| GPT-5.4 | ~62% | ~60% | -1.7 | |
| MiniMax | ~50% | ~57% | +7 | outlier, excluded from gold scoring |
Per-Axis Accuracy (6-model majority, excl MiniMax)
| Axis | Paragraphs | v3.0 acc | v3.5f acc | Δ |
|---|---|---|---|---|
| BG↔MR | 104 | ~45% | ~67% | +22.1 |
| BG↔RMP | 59 | ~40% | ~65% | +25.4 |
| MR↔RMP | 191 | ~58% | ~56% | -2.1 |
| SI↔N/O | 83 | ~66% | ~60% | -6.0 |
Model Convergence
- All 7 models pairwise agreement: 61.7% → 79.1% (+17.3pp)
- Top 6 (excl MiniMax): 63.1% → 80.9% (+17.8pp)
Cost
| Model | v3.5f cost |
|---|---|
| GPT-5.4 | $2.14 |
| Gemini Pro | $5.35 |
| GLM-5 | $3.06 |
| Kimi K2.5 | $2.80 |
| MIMO Pro | $2.21 |
| MiniMax | $0.54 |
| Opus | $0 (subscription) |
| Total | $16.10 |
The SI↔N/O Paradox — RESOLVED
The original problem
We started this exercise because of a 23:0 SI↔N/O asymmetry (humans say SI, GenAI says N/O, never the reverse). The v3.5 iteration made it worse (25:2 in v3.5f vs 20:1 in v3.0).
Investigation (post-v3.5f)
Paragraph-by-paragraph analysis of all 27 SI↔N/O errors revealed the models are correct, not the humans.
Of the 25 Human=SI / Model=N/O cases:
- ~20 cases: Models correct. These are "could have a material adverse effect" boilerplate speculation, cross-references to Item 1A, or generic threat enumeration — none containing actual materiality assessments. Every model unanimously calls N/O.
- ~2 cases: Genuinely ambiguous. One SPAC with materiality language, one past-disruption mention without explicit materiality language.
- ~2 cases: Edge cases. Negative assertions embedded at end of BG/risk paragraphs (debatable whether the assertion or the surrounding content dominates).
- ~1 case: Wrong axis entirely. Should be RMP (describes resource commitment), not SI or N/O.
Of the 2 Human=N/O / Model=SI cases:
- Both: Models correct. Both contain clear negative assertions ("not aware of having experienced any prior material data breaches", "did not experience any cybersecurity incident during 2024") — textbook SI per the codebook. All 6 models unanimously call SI.
Root cause of human error: Human annotators systematically treat ANY mention of "material," "business strategy," "results of operations," or "financial condition" as SI — even when the surrounding language is purely speculative ("could," "if," "may"). The codebook's assessment-vs-speculation distinction (v3.5 Rule 6) is correct, but humans weren't consistently applying it.
Codebook Case 9 contradiction — FIXED
The investigation discovered that Codebook Case 9 directly contradicted Rule 6:
- Case 9 said: "could potentially have a material impact on our business strategy" → SI
- Rule 6 said: "could have a material adverse effect" → NOT SI (speculation)
Case 9 has been updated: the "could potentially" example is now correctly labeled N/O, with an explanation of why "reasonably likely to materially affect" (SEC qualifier) ≠ "could potentially have a material impact" (speculation).
Prompt clarifications applied (within v3.5, no version bump)
Two minor clarifications added to the locked prompt (net effect on GPT-5.4: within stochastic noise):
- Consequence clause refinement: Speculative materiality language at end of paragraph = ignore. But factual negative assertions ("have not experienced any material incidents") = SI even at end of paragraph.
- Investment/resource SI signal: "expend considerable resources on cybersecurity" is a strategic resource commitment (SI marker), not speculation.
What this means for gold adjudication
The "paradox" is resolved: there is no systematic model error on SI↔N/O. The 25:2 asymmetry reflects human over-calling of SI, not model under-calling.
Gold adjudication strategy for SI↔N/O:
- When all 6 models unanimously say N/O and the paragraph contains only "could/if/may" speculation → gold = N/O (models correct, humans wrong)
- When all 6 models unanimously say SI and the paragraph contains a negative assertion → gold = SI (models correct, humans wrong)
- For the ~3-5 genuinely ambiguous cases → expert review
- Backward-looking assessments ("have not materially affected") and SEC-qualified forward-looking ("reasonably likely to materially affect") → always SI via deterministic regex, regardless of model or human vote
Expected impact: Flipping ~22 of 27 SI↔N/O errors from human-majority to model-consensus would raise SI↔N/O accuracy from ~60% to ~95%+ (measured against corrected gold labels).
What this means for Stage 1 training data
The 180 materiality-flagged paragraphs should still be corrected via deterministic regex for backward-looking assessments and SEC qualifiers. The 128 SPAC paragraphs should still be corrected via Stage 2 judge. The prompt is NOT the bottleneck — the corrections target v2.5→v3.5 codebook drift, not prompt failure.
Iteration History (6 rounds, $1.02 on 26 regression paragraphs)
| Round | Prompt | Score | Key change |
|---|---|---|---|
| 1 | v3.5a | 5/26 | Initial rulings — catastrophic over-correction |
| 2 | v3.5b | 13/25 | Purpose test for BG, Step 1 non-decisive for MR, cross-ref exception |
| 3 | v3.5c | 20/26 | Cross-reference materiality exception |
| 4 | v3.5d | 22/26 | SI tightened: assessment vs speculation distinction |
| 5 | v3.5e | 19/25 | BG/RMP example added — REGRESSED, reverted |
| 6 | v3.5f | 21/26 | Reverted R5, kept R4 SI + N/O↔RMP measures fix |
Stable fixes (consistently correct across R4-R6)
- 5 SI cross-reference over-predictions eliminated
- 3-4 BG purpose test corrections
- 3-4 MR Step 1 non-short-circuiting corrections
Stable errors (4, genuinely ambiguous — human 2-1 splits)
- 2× BG over-call on process paragraphs with committee mentions
- 2× N/O over-call on borderline RMP paragraphs
Root causes identified per error
- 17f2cc: Fragment/truncated paragraph, "committees" triggers BG but process verbs dominate
- 8adfde: 300-word risk paragraph with embedded security measures → N/O instead of RMP
- eca862: CISO+ERMC monitoring methods → BG instead of RMP (ERMC woven throughout)
- fcc65c: "Material risks" + threat enumeration → N/O instead of RMP (borderline)
Stage 1 Impact Summary
| Metric | Original flag | Tightened flag |
|---|---|---|
| Total flagged | 1,014 | 308 |
| Materiality | 886 | 180 |
| SPAC | 128 | 128 |
| Excluded (generic "could" boilerplate) | — | 706 |
The 706 excluded paragraphs contain generic "could have a material adverse effect" that is correctly N/O under both v2.5 and v3.5. Only 180 contain actual backward-looking or SEC-qualified assessments.
Recommendation: Correct the 180 materiality paragraphs via deterministic regex (label as SI), not via model re-evaluation. Correct the 128 SPACs via Stage 2 judge (need model to determine correct non-N/O label for paragraphs that shouldn't have been coded as substantive categories).
Files Created/Modified
| File | Purpose |
|---|---|
ts/src/label/prompts.ts |
v3.5f locked prompt (PROMPT_VERSION="v3.5") |
data/annotations/bench-holdout-v35/*.jsonl |
7 models × 359 paragraphs, v3.5f |
data/annotations/golden-v35/opus.jsonl |
Opus v3.5f on 359 paragraphs |
data/annotations/bench-holdout-v35b/gpt-5.4.jsonl |
Iteration test data (26 paragraphs, multiple rounds) |
data/annotations/stage1-corrections.jsonl |
308 flagged paragraphs (tightened criteria) |
data/gold/holdout-rerun-v35.jsonl |
359 confusion-axis paragraph IDs |
data/gold/holdout-rerun-v35b.jsonl |
26 regression paragraph IDs |
data/gold/regression-pids.json |
Regression PIDs by axis |
scripts/compare-v30-v35.py |
v3.0 vs v3.5a comparison |
scripts/compare-v30-v35-final.py |
v3.0 vs v3.5f comparison |
scripts/examine-v35-errors.py |
Error analysis for iteration |
scripts/extract-regression-pids.py |
Identify regression paragraphs |
scripts/flag-stage1-corrections.py |
Flag Stage 1 corrections (tightened) |
scripts/identify-holdout-rerun.py |
Identify confusion-axis holdout paragraphs |
docs/LABELING-CODEBOOK.md |
v3.5 rulings + version history |
docs/NARRATIVE.md |
Phase 15 with full iteration detail |
docs/STATUS.md |
v3.5 section added |