SEC-cyBERT/docs/archive/v1/V35-ITERATION-LOG.md
2026-04-05 21:00:40 -04:00

8.8 KiB
Raw Blame History

v3.5 Prompt Iteration Log

Status: Locked at v3.5f, pending SI↔N/O investigation

Final v3.5f Re-Run Results (7 models × 359 confusion-axis holdout paragraphs)

Per-Model Accuracy vs Human Majority (358 common paragraphs)

Model v3.0 acc v3.5f acc Δ Change rate
Opus ~63% 63.4% ~0 most stable
Gemini Pro ~59% ~62% +3
Kimi K2.5 ~55% ~62% +7.0
GLM-5 ~55% ~62% +6.7
MIMO Pro ~57% ~60% +3
GPT-5.4 ~62% ~60% -1.7
MiniMax ~50% ~57% +7 outlier, excluded from gold scoring

Per-Axis Accuracy (6-model majority, excl MiniMax)

Axis Paragraphs v3.0 acc v3.5f acc Δ
BG↔MR 104 ~45% ~67% +22.1
BG↔RMP 59 ~40% ~65% +25.4
MR↔RMP 191 ~58% ~56% -2.1
SI↔N/O 83 ~66% ~60% -6.0

Model Convergence

  • All 7 models pairwise agreement: 61.7% → 79.1% (+17.3pp)
  • Top 6 (excl MiniMax): 63.1% → 80.9% (+17.8pp)

Cost

Model v3.5f cost
GPT-5.4 $2.14
Gemini Pro $5.35
GLM-5 $3.06
Kimi K2.5 $2.80
MIMO Pro $2.21
MiniMax $0.54
Opus $0 (subscription)
Total $16.10

The SI↔N/O Paradox — RESOLVED

The original problem

We started this exercise because of a 23:0 SI↔N/O asymmetry (humans say SI, GenAI says N/O, never the reverse). The v3.5 iteration made it worse (25:2 in v3.5f vs 20:1 in v3.0).

Investigation (post-v3.5f)

Paragraph-by-paragraph analysis of all 27 SI↔N/O errors revealed the models are correct, not the humans.

Of the 25 Human=SI / Model=N/O cases:

  • ~20 cases: Models correct. These are "could have a material adverse effect" boilerplate speculation, cross-references to Item 1A, or generic threat enumeration — none containing actual materiality assessments. Every model unanimously calls N/O.
  • ~2 cases: Genuinely ambiguous. One SPAC with materiality language, one past-disruption mention without explicit materiality language.
  • ~2 cases: Edge cases. Negative assertions embedded at end of BG/risk paragraphs (debatable whether the assertion or the surrounding content dominates).
  • ~1 case: Wrong axis entirely. Should be RMP (describes resource commitment), not SI or N/O.

Of the 2 Human=N/O / Model=SI cases:

  • Both: Models correct. Both contain clear negative assertions ("not aware of having experienced any prior material data breaches", "did not experience any cybersecurity incident during 2024") — textbook SI per the codebook. All 6 models unanimously call SI.

Root cause of human error: Human annotators systematically treat ANY mention of "material," "business strategy," "results of operations," or "financial condition" as SI — even when the surrounding language is purely speculative ("could," "if," "may"). The codebook's assessment-vs-speculation distinction (v3.5 Rule 6) is correct, but humans weren't consistently applying it.

Codebook Case 9 contradiction — FIXED

The investigation discovered that Codebook Case 9 directly contradicted Rule 6:

  • Case 9 said: "could potentially have a material impact on our business strategy" → SI
  • Rule 6 said: "could have a material adverse effect" → NOT SI (speculation)

Case 9 has been updated: the "could potentially" example is now correctly labeled N/O, with an explanation of why "reasonably likely to materially affect" (SEC qualifier) ≠ "could potentially have a material impact" (speculation).

Prompt clarifications applied (within v3.5, no version bump)

Two minor clarifications added to the locked prompt (net effect on GPT-5.4: within stochastic noise):

  1. Consequence clause refinement: Speculative materiality language at end of paragraph = ignore. But factual negative assertions ("have not experienced any material incidents") = SI even at end of paragraph.
  2. Investment/resource SI signal: "expend considerable resources on cybersecurity" is a strategic resource commitment (SI marker), not speculation.

What this means for gold adjudication

The "paradox" is resolved: there is no systematic model error on SI↔N/O. The 25:2 asymmetry reflects human over-calling of SI, not model under-calling.

Gold adjudication strategy for SI↔N/O:

  1. When all 6 models unanimously say N/O and the paragraph contains only "could/if/may" speculation → gold = N/O (models correct, humans wrong)
  2. When all 6 models unanimously say SI and the paragraph contains a negative assertion → gold = SI (models correct, humans wrong)
  3. For the ~3-5 genuinely ambiguous cases → expert review
  4. Backward-looking assessments ("have not materially affected") and SEC-qualified forward-looking ("reasonably likely to materially affect") → always SI via deterministic regex, regardless of model or human vote

Expected impact: Flipping ~22 of 27 SI↔N/O errors from human-majority to model-consensus would raise SI↔N/O accuracy from ~60% to ~95%+ (measured against corrected gold labels).

What this means for Stage 1 training data

The 180 materiality-flagged paragraphs should still be corrected via deterministic regex for backward-looking assessments and SEC qualifiers. The 128 SPAC paragraphs should still be corrected via Stage 2 judge. The prompt is NOT the bottleneck — the corrections target v2.5→v3.5 codebook drift, not prompt failure.


Iteration History (6 rounds, $1.02 on 26 regression paragraphs)

Round Prompt Score Key change
1 v3.5a 5/26 Initial rulings — catastrophic over-correction
2 v3.5b 13/25 Purpose test for BG, Step 1 non-decisive for MR, cross-ref exception
3 v3.5c 20/26 Cross-reference materiality exception
4 v3.5d 22/26 SI tightened: assessment vs speculation distinction
5 v3.5e 19/25 BG/RMP example added — REGRESSED, reverted
6 v3.5f 21/26 Reverted R5, kept R4 SI + N/O↔RMP measures fix

Stable fixes (consistently correct across R4-R6)

  • 5 SI cross-reference over-predictions eliminated
  • 3-4 BG purpose test corrections
  • 3-4 MR Step 1 non-short-circuiting corrections

Stable errors (4, genuinely ambiguous — human 2-1 splits)

  • 2× BG over-call on process paragraphs with committee mentions
  • 2× N/O over-call on borderline RMP paragraphs

Root causes identified per error

  1. 17f2cc: Fragment/truncated paragraph, "committees" triggers BG but process verbs dominate
  2. 8adfde: 300-word risk paragraph with embedded security measures → N/O instead of RMP
  3. eca862: CISO+ERMC monitoring methods → BG instead of RMP (ERMC woven throughout)
  4. fcc65c: "Material risks" + threat enumeration → N/O instead of RMP (borderline)

Stage 1 Impact Summary

Metric Original flag Tightened flag
Total flagged 1,014 308
Materiality 886 180
SPAC 128 128
Excluded (generic "could" boilerplate) 706

The 706 excluded paragraphs contain generic "could have a material adverse effect" that is correctly N/O under both v2.5 and v3.5. Only 180 contain actual backward-looking or SEC-qualified assessments.

Recommendation: Correct the 180 materiality paragraphs via deterministic regex (label as SI), not via model re-evaluation. Correct the 128 SPACs via Stage 2 judge (need model to determine correct non-N/O label for paragraphs that shouldn't have been coded as substantive categories).


Files Created/Modified

File Purpose
ts/src/label/prompts.ts v3.5f locked prompt (PROMPT_VERSION="v3.5")
data/annotations/bench-holdout-v35/*.jsonl 7 models × 359 paragraphs, v3.5f
data/annotations/golden-v35/opus.jsonl Opus v3.5f on 359 paragraphs
data/annotations/bench-holdout-v35b/gpt-5.4.jsonl Iteration test data (26 paragraphs, multiple rounds)
data/annotations/stage1-corrections.jsonl 308 flagged paragraphs (tightened criteria)
data/gold/holdout-rerun-v35.jsonl 359 confusion-axis paragraph IDs
data/gold/holdout-rerun-v35b.jsonl 26 regression paragraph IDs
data/gold/regression-pids.json Regression PIDs by axis
scripts/compare-v30-v35.py v3.0 vs v3.5a comparison
scripts/compare-v30-v35-final.py v3.0 vs v3.5f comparison
scripts/examine-v35-errors.py Error analysis for iteration
scripts/extract-regression-pids.py Identify regression paragraphs
scripts/flag-stage1-corrections.py Flag Stage 1 corrections (tightened)
scripts/identify-holdout-rerun.py Identify confusion-axis holdout paragraphs
docs/LABELING-CODEBOOK.md v3.5 rulings + version history
docs/NARRATIVE.md Phase 15 with full iteration detail
docs/STATUS.md v3.5 section added