joey/SEC-cyBERT

Fork 0

Joey Eamigh 745172adb8

docs restructuring

2026-04-05 21:00:40 -04:00

8.8 KiB

Raw Blame History

v3.5 Prompt Iteration Log

Status: Locked at v3.5f, pending SI↔N/O investigation

Final v3.5f Re-Run Results (7 models × 359 confusion-axis holdout paragraphs)

Per-Model Accuracy vs Human Majority (358 common paragraphs)

Model	v3.0 acc	v3.5f acc	Δ	Change rate
Opus	~63%	63.4%	~0	most stable
Gemini Pro	~59%	~62%	+3
Kimi K2.5	~55%	~62%	+7.0
GLM-5	~55%	~62%	+6.7
MIMO Pro	~57%	~60%	+3
GPT-5.4	~62%	~60%	-1.7
MiniMax	~50%	~57%	+7	outlier, excluded from gold scoring

Per-Axis Accuracy (6-model majority, excl MiniMax)

Axis	Paragraphs	v3.0 acc	v3.5f acc	Δ
BG↔MR	104	~45%	~67%	+22.1
BG↔RMP	59	~40%	~65%	+25.4
MR↔RMP	191	~58%	~56%	-2.1
SI↔N/O	83	~66%	~60%	-6.0

Model Convergence

All 7 models pairwise agreement: 61.7% → 79.1% (+17.3pp)
Top 6 (excl MiniMax): 63.1% → 80.9% (+17.8pp)

Cost

Model	v3.5f cost
GPT-5.4	$2.14
Gemini Pro	$5.35
GLM-5	$3.06
Kimi K2.5	$2.80
MIMO Pro	$2.21
MiniMax	$0.54
Opus	$0 (subscription)
Total	$16.10

The SI↔N/O Paradox — RESOLVED

The original problem

We started this exercise because of a 23:0 SI↔N/O asymmetry (humans say SI, GenAI says N/O, never the reverse). The v3.5 iteration made it worse (25:2 in v3.5f vs 20:1 in v3.0).

Investigation (post-v3.5f)

Paragraph-by-paragraph analysis of all 27 SI↔N/O errors revealed the models are correct, not the humans.

Of the 25 Human=SI / Model=N/O cases:

~20 cases: Models correct. These are "could have a material adverse effect" boilerplate speculation, cross-references to Item 1A, or generic threat enumeration — none containing actual materiality assessments. Every model unanimously calls N/O.
~2 cases: Genuinely ambiguous. One SPAC with materiality language, one past-disruption mention without explicit materiality language.
~2 cases: Edge cases. Negative assertions embedded at end of BG/risk paragraphs (debatable whether the assertion or the surrounding content dominates).
~1 case: Wrong axis entirely. Should be RMP (describes resource commitment), not SI or N/O.

Of the 2 Human=N/O / Model=SI cases:

Both: Models correct. Both contain clear negative assertions ("not aware of having experienced any prior material data breaches", "did not experience any cybersecurity incident during 2024") — textbook SI per the codebook. All 6 models unanimously call SI.

Root cause of human error: Human annotators systematically treat ANY mention of "material," "business strategy," "results of operations," or "financial condition" as SI — even when the surrounding language is purely speculative ("could," "if," "may"). The codebook's assessment-vs-speculation distinction (v3.5 Rule 6) is correct, but humans weren't consistently applying it.

Codebook Case 9 contradiction — FIXED

The investigation discovered that Codebook Case 9 directly contradicted Rule 6:

Case 9 said: "could potentially have a material impact on our business strategy" → SI
Rule 6 said: "could have a material adverse effect" → NOT SI (speculation)

Case 9 has been updated: the "could potentially" example is now correctly labeled N/O, with an explanation of why "reasonably likely to materially affect" (SEC qualifier) ≠ "could potentially have a material impact" (speculation).

Prompt clarifications applied (within v3.5, no version bump)

Two minor clarifications added to the locked prompt (net effect on GPT-5.4: within stochastic noise):

Consequence clause refinement: Speculative materiality language at end of paragraph = ignore. But factual negative assertions ("have not experienced any material incidents") = SI even at end of paragraph.
Investment/resource SI signal: "expend considerable resources on cybersecurity" is a strategic resource commitment (SI marker), not speculation.

What this means for gold adjudication

The "paradox" is resolved: there is no systematic model error on SI↔N/O. The 25:2 asymmetry reflects human over-calling of SI, not model under-calling.

Gold adjudication strategy for SI↔N/O:

When all 6 models unanimously say N/O and the paragraph contains only "could/if/may" speculation → gold = N/O (models correct, humans wrong)
When all 6 models unanimously say SI and the paragraph contains a negative assertion → gold = SI (models correct, humans wrong)
For the ~3-5 genuinely ambiguous cases → expert review
Backward-looking assessments ("have not materially affected") and SEC-qualified forward-looking ("reasonably likely to materially affect") → always SI via deterministic regex, regardless of model or human vote

Expected impact: Flipping ~22 of 27 SI↔N/O errors from human-majority to model-consensus would raise SI↔N/O accuracy from ~60% to ~95%+ (measured against corrected gold labels).

What this means for Stage 1 training data

The 180 materiality-flagged paragraphs should still be corrected via deterministic regex for backward-looking assessments and SEC qualifiers. The 128 SPAC paragraphs should still be corrected via Stage 2 judge. The prompt is NOT the bottleneck — the corrections target v2.5→v3.5 codebook drift, not prompt failure.

Iteration History (6 rounds, $1.02 on 26 regression paragraphs)

Round	Prompt	Score	Key change
1	v3.5a	5/26	Initial rulings — catastrophic over-correction
2	v3.5b	13/25	Purpose test for BG, Step 1 non-decisive for MR, cross-ref exception
3	v3.5c	20/26	Cross-reference materiality exception
4	v3.5d	22/26	SI tightened: assessment vs speculation distinction
5	v3.5e	19/25	BG/RMP example added — REGRESSED, reverted
6	v3.5f	21/26	Reverted R5, kept R4 SI + N/O↔RMP measures fix

Stable fixes (consistently correct across R4-R6)

5 SI cross-reference over-predictions eliminated
3-4 BG purpose test corrections
3-4 MR Step 1 non-short-circuiting corrections

Stable errors (4, genuinely ambiguous — human 2-1 splits)

2× BG over-call on process paragraphs with committee mentions
2× N/O over-call on borderline RMP paragraphs

Root causes identified per error

17f2cc: Fragment/truncated paragraph, "committees" triggers BG but process verbs dominate
8adfde: 300-word risk paragraph with embedded security measures → N/O instead of RMP
eca862: CISO+ERMC monitoring methods → BG instead of RMP (ERMC woven throughout)
fcc65c: "Material risks" + threat enumeration → N/O instead of RMP (borderline)

Stage 1 Impact Summary

Metric	Original flag	Tightened flag
Total flagged	1,014	308
Materiality	886	180
SPAC	128	128
Excluded (generic "could" boilerplate)	—	706

The 706 excluded paragraphs contain generic "could have a material adverse effect" that is correctly N/O under both v2.5 and v3.5. Only 180 contain actual backward-looking or SEC-qualified assessments.

Recommendation: Correct the 180 materiality paragraphs via deterministic regex (label as SI), not via model re-evaluation. Correct the 128 SPACs via Stage 2 judge (need model to determine correct non-N/O label for paragraphs that shouldn't have been coded as substantive categories).

Files Created/Modified

File	Purpose
`ts/src/label/prompts.ts`	v3.5f locked prompt (PROMPT_VERSION="v3.5")
`data/annotations/bench-holdout-v35/*.jsonl`	7 models × 359 paragraphs, v3.5f
`data/annotations/golden-v35/opus.jsonl`	Opus v3.5f on 359 paragraphs
`data/annotations/bench-holdout-v35b/gpt-5.4.jsonl`	Iteration test data (26 paragraphs, multiple rounds)
`data/annotations/stage1-corrections.jsonl`	308 flagged paragraphs (tightened criteria)
`data/gold/holdout-rerun-v35.jsonl`	359 confusion-axis paragraph IDs
`data/gold/holdout-rerun-v35b.jsonl`	26 regression paragraph IDs
`data/gold/regression-pids.json`	Regression PIDs by axis
`scripts/compare-v30-v35.py`	v3.0 vs v3.5a comparison
`scripts/compare-v30-v35-final.py`	v3.0 vs v3.5f comparison
`scripts/examine-v35-errors.py`	Error analysis for iteration
`scripts/extract-regression-pids.py`	Identify regression paragraphs
`scripts/flag-stage1-corrections.py`	Flag Stage 1 corrections (tightened)
`scripts/identify-holdout-rerun.py`	Identify confusion-axis holdout paragraphs
`docs/LABELING-CODEBOOK.md`	v3.5 rulings + version history
`docs/NARRATIVE.md`	Phase 15 with full iteration detail
`docs/STATUS.md`	v3.5 section added

8.8 KiB Raw Blame History Unescape Escape