165 lines
8.8 KiB
Markdown
165 lines
8.8 KiB
Markdown
# v3.5 Prompt Iteration Log
|
||
|
||
## Status: Locked at v3.5f, pending SI↔N/O investigation
|
||
|
||
## Final v3.5f Re-Run Results (7 models × 359 confusion-axis holdout paragraphs)
|
||
|
||
### Per-Model Accuracy vs Human Majority (358 common paragraphs)
|
||
|
||
| Model | v3.0 acc | v3.5f acc | Δ | Change rate |
|
||
|-------|---------|----------|---|-------------|
|
||
| Opus | ~63% | 63.4% | ~0 | most stable |
|
||
| Gemini Pro | ~59% | ~62% | +3 | |
|
||
| Kimi K2.5 | ~55% | ~62% | +7.0 | |
|
||
| GLM-5 | ~55% | ~62% | +6.7 | |
|
||
| MIMO Pro | ~57% | ~60% | +3 | |
|
||
| GPT-5.4 | ~62% | ~60% | -1.7 | |
|
||
| MiniMax | ~50% | ~57% | +7 | outlier, excluded from gold scoring |
|
||
|
||
### Per-Axis Accuracy (6-model majority, excl MiniMax)
|
||
|
||
| Axis | Paragraphs | v3.0 acc | v3.5f acc | Δ |
|
||
|------|-----------|---------|----------|---|
|
||
| BG↔MR | 104 | ~45% | ~67% | **+22.1** |
|
||
| BG↔RMP | 59 | ~40% | ~65% | **+25.4** |
|
||
| MR↔RMP | 191 | ~58% | ~56% | -2.1 |
|
||
| SI↔N/O | 83 | ~66% | ~60% | **-6.0** |
|
||
|
||
### Model Convergence
|
||
|
||
- All 7 models pairwise agreement: 61.7% → 79.1% (+17.3pp)
|
||
- Top 6 (excl MiniMax): 63.1% → 80.9% (+17.8pp)
|
||
|
||
### Cost
|
||
|
||
| Model | v3.5f cost |
|
||
|-------|-----------|
|
||
| GPT-5.4 | $2.14 |
|
||
| Gemini Pro | $5.35 |
|
||
| GLM-5 | $3.06 |
|
||
| Kimi K2.5 | $2.80 |
|
||
| MIMO Pro | $2.21 |
|
||
| MiniMax | $0.54 |
|
||
| Opus | $0 (subscription) |
|
||
| **Total** | **$16.10** |
|
||
|
||
---
|
||
|
||
## The SI↔N/O Paradox — RESOLVED
|
||
|
||
### The original problem
|
||
|
||
We started this exercise because of a 23:0 SI↔N/O asymmetry (humans say SI, GenAI says N/O, never the reverse). The v3.5 iteration made it worse (25:2 in v3.5f vs 20:1 in v3.0).
|
||
|
||
### Investigation (post-v3.5f)
|
||
|
||
Paragraph-by-paragraph analysis of all 27 SI↔N/O errors revealed **the models are correct, not the humans.**
|
||
|
||
**Of the 25 Human=SI / Model=N/O cases:**
|
||
- **~20 cases: Models correct.** These are "could have a material adverse effect" boilerplate speculation, cross-references to Item 1A, or generic threat enumeration — none containing actual materiality assessments. Every model unanimously calls N/O.
|
||
- **~2 cases: Genuinely ambiguous.** One SPAC with materiality language, one past-disruption mention without explicit materiality language.
|
||
- **~2 cases: Edge cases.** Negative assertions embedded at end of BG/risk paragraphs (debatable whether the assertion or the surrounding content dominates).
|
||
- **~1 case: Wrong axis entirely.** Should be RMP (describes resource commitment), not SI or N/O.
|
||
|
||
**Of the 2 Human=N/O / Model=SI cases:**
|
||
- **Both: Models correct.** Both contain clear negative assertions ("not aware of having experienced any prior material data breaches", "did not experience any cybersecurity incident during 2024") — textbook SI per the codebook. All 6 models unanimously call SI.
|
||
|
||
**Root cause of human error:** Human annotators systematically treat ANY mention of "material," "business strategy," "results of operations," or "financial condition" as SI — even when the surrounding language is purely speculative ("could," "if," "may"). The codebook's assessment-vs-speculation distinction (v3.5 Rule 6) is correct, but humans weren't consistently applying it.
|
||
|
||
### Codebook Case 9 contradiction — FIXED
|
||
|
||
The investigation discovered that **Codebook Case 9 directly contradicted Rule 6:**
|
||
- Case 9 said: "could potentially have a material impact on our business strategy" → SI
|
||
- Rule 6 said: "could have a material adverse effect" → NOT SI (speculation)
|
||
|
||
Case 9 has been updated: the "could potentially" example is now correctly labeled N/O, with an explanation of why "reasonably likely to materially affect" (SEC qualifier) ≠ "could potentially have a material impact" (speculation).
|
||
|
||
### Prompt clarifications applied (within v3.5, no version bump)
|
||
|
||
Two minor clarifications added to the locked prompt (net effect on GPT-5.4: within stochastic noise):
|
||
1. **Consequence clause refinement:** Speculative materiality language at end of paragraph = ignore. But factual negative assertions ("have not experienced any material incidents") = SI even at end of paragraph.
|
||
2. **Investment/resource SI signal:** "expend considerable resources on cybersecurity" is a strategic resource commitment (SI marker), not speculation.
|
||
|
||
### What this means for gold adjudication
|
||
|
||
**The "paradox" is resolved: there is no systematic model error on SI↔N/O.** The 25:2 asymmetry reflects human over-calling of SI, not model under-calling.
|
||
|
||
**Gold adjudication strategy for SI↔N/O:**
|
||
1. When all 6 models unanimously say N/O and the paragraph contains only "could/if/may" speculation → **gold = N/O** (models correct, humans wrong)
|
||
2. When all 6 models unanimously say SI and the paragraph contains a negative assertion → **gold = SI** (models correct, humans wrong)
|
||
3. For the ~3-5 genuinely ambiguous cases → expert review
|
||
4. Backward-looking assessments ("have not materially affected") and SEC-qualified forward-looking ("reasonably likely to materially affect") → **always SI** via deterministic regex, regardless of model or human vote
|
||
|
||
**Expected impact:** Flipping ~22 of 27 SI↔N/O errors from human-majority to model-consensus would raise SI↔N/O accuracy from ~60% to ~95%+ (measured against corrected gold labels).
|
||
|
||
### What this means for Stage 1 training data
|
||
|
||
The 180 materiality-flagged paragraphs should still be corrected via deterministic regex for backward-looking assessments and SEC qualifiers. The 128 SPAC paragraphs should still be corrected via Stage 2 judge. The prompt is NOT the bottleneck — the corrections target v2.5→v3.5 codebook drift, not prompt failure.
|
||
|
||
---
|
||
|
||
## Iteration History (6 rounds, $1.02 on 26 regression paragraphs)
|
||
|
||
| Round | Prompt | Score | Key change |
|
||
|-------|--------|-------|-----------|
|
||
| 1 | v3.5a | 5/26 | Initial rulings — catastrophic over-correction |
|
||
| 2 | v3.5b | 13/25 | Purpose test for BG, Step 1 non-decisive for MR, cross-ref exception |
|
||
| 3 | v3.5c | 20/26 | Cross-reference materiality exception |
|
||
| 4 | v3.5d | 22/26 | SI tightened: assessment vs speculation distinction |
|
||
| 5 | v3.5e | 19/25 | BG/RMP example added — REGRESSED, reverted |
|
||
| 6 | v3.5f | 21/26 | Reverted R5, kept R4 SI + N/O↔RMP measures fix |
|
||
|
||
### Stable fixes (consistently correct across R4-R6)
|
||
- 5 SI cross-reference over-predictions eliminated
|
||
- 3-4 BG purpose test corrections
|
||
- 3-4 MR Step 1 non-short-circuiting corrections
|
||
|
||
### Stable errors (4, genuinely ambiguous — human 2-1 splits)
|
||
- 2× BG over-call on process paragraphs with committee mentions
|
||
- 2× N/O over-call on borderline RMP paragraphs
|
||
|
||
### Root causes identified per error
|
||
1. **17f2cc:** Fragment/truncated paragraph, "committees" triggers BG but process verbs dominate
|
||
2. **8adfde:** 300-word risk paragraph with embedded security measures → N/O instead of RMP
|
||
3. **eca862:** CISO+ERMC monitoring methods → BG instead of RMP (ERMC woven throughout)
|
||
4. **fcc65c:** "Material risks" + threat enumeration → N/O instead of RMP (borderline)
|
||
|
||
---
|
||
|
||
## Stage 1 Impact Summary
|
||
|
||
| Metric | Original flag | Tightened flag |
|
||
|--------|-------------|---------------|
|
||
| Total flagged | 1,014 | 308 |
|
||
| Materiality | 886 | 180 |
|
||
| SPAC | 128 | 128 |
|
||
| Excluded (generic "could" boilerplate) | — | 706 |
|
||
|
||
The 706 excluded paragraphs contain generic "could have a material adverse effect" that is correctly N/O under both v2.5 and v3.5. Only 180 contain actual backward-looking or SEC-qualified assessments.
|
||
|
||
**Recommendation:** Correct the 180 materiality paragraphs via deterministic regex (label as SI), not via model re-evaluation. Correct the 128 SPACs via Stage 2 judge (need model to determine correct non-N/O label for paragraphs that shouldn't have been coded as substantive categories).
|
||
|
||
---
|
||
|
||
## Files Created/Modified
|
||
|
||
| File | Purpose |
|
||
|------|---------|
|
||
| `ts/src/label/prompts.ts` | v3.5f locked prompt (PROMPT_VERSION="v3.5") |
|
||
| `data/annotations/bench-holdout-v35/*.jsonl` | 7 models × 359 paragraphs, v3.5f |
|
||
| `data/annotations/golden-v35/opus.jsonl` | Opus v3.5f on 359 paragraphs |
|
||
| `data/annotations/bench-holdout-v35b/gpt-5.4.jsonl` | Iteration test data (26 paragraphs, multiple rounds) |
|
||
| `data/annotations/stage1-corrections.jsonl` | 308 flagged paragraphs (tightened criteria) |
|
||
| `data/gold/holdout-rerun-v35.jsonl` | 359 confusion-axis paragraph IDs |
|
||
| `data/gold/holdout-rerun-v35b.jsonl` | 26 regression paragraph IDs |
|
||
| `data/gold/regression-pids.json` | Regression PIDs by axis |
|
||
| `scripts/compare-v30-v35.py` | v3.0 vs v3.5a comparison |
|
||
| `scripts/compare-v30-v35-final.py` | v3.0 vs v3.5f comparison |
|
||
| `scripts/examine-v35-errors.py` | Error analysis for iteration |
|
||
| `scripts/extract-regression-pids.py` | Identify regression paragraphs |
|
||
| `scripts/flag-stage1-corrections.py` | Flag Stage 1 corrections (tightened) |
|
||
| `scripts/identify-holdout-rerun.py` | Identify confusion-axis holdout paragraphs |
|
||
| `docs/LABELING-CODEBOOK.md` | v3.5 rulings + version history |
|
||
| `docs/NARRATIVE.md` | Phase 15 with full iteration detail |
|
||
| `docs/STATUS.md` | v3.5 section added |
|