SEC-cyBERT/docs/archive/v1/V35-ITERATION-LOG.md
2026-04-05 21:00:40 -04:00

165 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# v3.5 Prompt Iteration Log
## Status: Locked at v3.5f, pending SI↔N/O investigation
## Final v3.5f Re-Run Results (7 models × 359 confusion-axis holdout paragraphs)
### Per-Model Accuracy vs Human Majority (358 common paragraphs)
| Model | v3.0 acc | v3.5f acc | Δ | Change rate |
|-------|---------|----------|---|-------------|
| Opus | ~63% | 63.4% | ~0 | most stable |
| Gemini Pro | ~59% | ~62% | +3 | |
| Kimi K2.5 | ~55% | ~62% | +7.0 | |
| GLM-5 | ~55% | ~62% | +6.7 | |
| MIMO Pro | ~57% | ~60% | +3 | |
| GPT-5.4 | ~62% | ~60% | -1.7 | |
| MiniMax | ~50% | ~57% | +7 | outlier, excluded from gold scoring |
### Per-Axis Accuracy (6-model majority, excl MiniMax)
| Axis | Paragraphs | v3.0 acc | v3.5f acc | Δ |
|------|-----------|---------|----------|---|
| BG↔MR | 104 | ~45% | ~67% | **+22.1** |
| BG↔RMP | 59 | ~40% | ~65% | **+25.4** |
| MR↔RMP | 191 | ~58% | ~56% | -2.1 |
| SI↔N/O | 83 | ~66% | ~60% | **-6.0** |
### Model Convergence
- All 7 models pairwise agreement: 61.7% → 79.1% (+17.3pp)
- Top 6 (excl MiniMax): 63.1% → 80.9% (+17.8pp)
### Cost
| Model | v3.5f cost |
|-------|-----------|
| GPT-5.4 | $2.14 |
| Gemini Pro | $5.35 |
| GLM-5 | $3.06 |
| Kimi K2.5 | $2.80 |
| MIMO Pro | $2.21 |
| MiniMax | $0.54 |
| Opus | $0 (subscription) |
| **Total** | **$16.10** |
---
## The SI↔N/O Paradox — RESOLVED
### The original problem
We started this exercise because of a 23:0 SI↔N/O asymmetry (humans say SI, GenAI says N/O, never the reverse). The v3.5 iteration made it worse (25:2 in v3.5f vs 20:1 in v3.0).
### Investigation (post-v3.5f)
Paragraph-by-paragraph analysis of all 27 SI↔N/O errors revealed **the models are correct, not the humans.**
**Of the 25 Human=SI / Model=N/O cases:**
- **~20 cases: Models correct.** These are "could have a material adverse effect" boilerplate speculation, cross-references to Item 1A, or generic threat enumeration — none containing actual materiality assessments. Every model unanimously calls N/O.
- **~2 cases: Genuinely ambiguous.** One SPAC with materiality language, one past-disruption mention without explicit materiality language.
- **~2 cases: Edge cases.** Negative assertions embedded at end of BG/risk paragraphs (debatable whether the assertion or the surrounding content dominates).
- **~1 case: Wrong axis entirely.** Should be RMP (describes resource commitment), not SI or N/O.
**Of the 2 Human=N/O / Model=SI cases:**
- **Both: Models correct.** Both contain clear negative assertions ("not aware of having experienced any prior material data breaches", "did not experience any cybersecurity incident during 2024") — textbook SI per the codebook. All 6 models unanimously call SI.
**Root cause of human error:** Human annotators systematically treat ANY mention of "material," "business strategy," "results of operations," or "financial condition" as SI — even when the surrounding language is purely speculative ("could," "if," "may"). The codebook's assessment-vs-speculation distinction (v3.5 Rule 6) is correct, but humans weren't consistently applying it.
### Codebook Case 9 contradiction — FIXED
The investigation discovered that **Codebook Case 9 directly contradicted Rule 6:**
- Case 9 said: "could potentially have a material impact on our business strategy" → SI
- Rule 6 said: "could have a material adverse effect" → NOT SI (speculation)
Case 9 has been updated: the "could potentially" example is now correctly labeled N/O, with an explanation of why "reasonably likely to materially affect" (SEC qualifier) ≠ "could potentially have a material impact" (speculation).
### Prompt clarifications applied (within v3.5, no version bump)
Two minor clarifications added to the locked prompt (net effect on GPT-5.4: within stochastic noise):
1. **Consequence clause refinement:** Speculative materiality language at end of paragraph = ignore. But factual negative assertions ("have not experienced any material incidents") = SI even at end of paragraph.
2. **Investment/resource SI signal:** "expend considerable resources on cybersecurity" is a strategic resource commitment (SI marker), not speculation.
### What this means for gold adjudication
**The "paradox" is resolved: there is no systematic model error on SI↔N/O.** The 25:2 asymmetry reflects human over-calling of SI, not model under-calling.
**Gold adjudication strategy for SI↔N/O:**
1. When all 6 models unanimously say N/O and the paragraph contains only "could/if/may" speculation → **gold = N/O** (models correct, humans wrong)
2. When all 6 models unanimously say SI and the paragraph contains a negative assertion → **gold = SI** (models correct, humans wrong)
3. For the ~3-5 genuinely ambiguous cases → expert review
4. Backward-looking assessments ("have not materially affected") and SEC-qualified forward-looking ("reasonably likely to materially affect") → **always SI** via deterministic regex, regardless of model or human vote
**Expected impact:** Flipping ~22 of 27 SI↔N/O errors from human-majority to model-consensus would raise SI↔N/O accuracy from ~60% to ~95%+ (measured against corrected gold labels).
### What this means for Stage 1 training data
The 180 materiality-flagged paragraphs should still be corrected via deterministic regex for backward-looking assessments and SEC qualifiers. The 128 SPAC paragraphs should still be corrected via Stage 2 judge. The prompt is NOT the bottleneck — the corrections target v2.5→v3.5 codebook drift, not prompt failure.
---
## Iteration History (6 rounds, $1.02 on 26 regression paragraphs)
| Round | Prompt | Score | Key change |
|-------|--------|-------|-----------|
| 1 | v3.5a | 5/26 | Initial rulings — catastrophic over-correction |
| 2 | v3.5b | 13/25 | Purpose test for BG, Step 1 non-decisive for MR, cross-ref exception |
| 3 | v3.5c | 20/26 | Cross-reference materiality exception |
| 4 | v3.5d | 22/26 | SI tightened: assessment vs speculation distinction |
| 5 | v3.5e | 19/25 | BG/RMP example added — REGRESSED, reverted |
| 6 | v3.5f | 21/26 | Reverted R5, kept R4 SI + N/O↔RMP measures fix |
### Stable fixes (consistently correct across R4-R6)
- 5 SI cross-reference over-predictions eliminated
- 3-4 BG purpose test corrections
- 3-4 MR Step 1 non-short-circuiting corrections
### Stable errors (4, genuinely ambiguous — human 2-1 splits)
- 2× BG over-call on process paragraphs with committee mentions
- 2× N/O over-call on borderline RMP paragraphs
### Root causes identified per error
1. **17f2cc:** Fragment/truncated paragraph, "committees" triggers BG but process verbs dominate
2. **8adfde:** 300-word risk paragraph with embedded security measures → N/O instead of RMP
3. **eca862:** CISO+ERMC monitoring methods → BG instead of RMP (ERMC woven throughout)
4. **fcc65c:** "Material risks" + threat enumeration → N/O instead of RMP (borderline)
---
## Stage 1 Impact Summary
| Metric | Original flag | Tightened flag |
|--------|-------------|---------------|
| Total flagged | 1,014 | 308 |
| Materiality | 886 | 180 |
| SPAC | 128 | 128 |
| Excluded (generic "could" boilerplate) | — | 706 |
The 706 excluded paragraphs contain generic "could have a material adverse effect" that is correctly N/O under both v2.5 and v3.5. Only 180 contain actual backward-looking or SEC-qualified assessments.
**Recommendation:** Correct the 180 materiality paragraphs via deterministic regex (label as SI), not via model re-evaluation. Correct the 128 SPACs via Stage 2 judge (need model to determine correct non-N/O label for paragraphs that shouldn't have been coded as substantive categories).
---
## Files Created/Modified
| File | Purpose |
|------|---------|
| `ts/src/label/prompts.ts` | v3.5f locked prompt (PROMPT_VERSION="v3.5") |
| `data/annotations/bench-holdout-v35/*.jsonl` | 7 models × 359 paragraphs, v3.5f |
| `data/annotations/golden-v35/opus.jsonl` | Opus v3.5f on 359 paragraphs |
| `data/annotations/bench-holdout-v35b/gpt-5.4.jsonl` | Iteration test data (26 paragraphs, multiple rounds) |
| `data/annotations/stage1-corrections.jsonl` | 308 flagged paragraphs (tightened criteria) |
| `data/gold/holdout-rerun-v35.jsonl` | 359 confusion-axis paragraph IDs |
| `data/gold/holdout-rerun-v35b.jsonl` | 26 regression paragraph IDs |
| `data/gold/regression-pids.json` | Regression PIDs by axis |
| `scripts/compare-v30-v35.py` | v3.0 vs v3.5a comparison |
| `scripts/compare-v30-v35-final.py` | v3.0 vs v3.5f comparison |
| `scripts/examine-v35-errors.py` | Error analysis for iteration |
| `scripts/extract-regression-pids.py` | Identify regression paragraphs |
| `scripts/flag-stage1-corrections.py` | Flag Stage 1 corrections (tightened) |
| `scripts/identify-holdout-rerun.py` | Identify confusion-axis holdout paragraphs |
| `docs/LABELING-CODEBOOK.md` | v3.5 rulings + version history |
| `docs/NARRATIVE.md` | Phase 15 with full iteration detail |
| `docs/STATUS.md` | v3.5 section added |