corpus labeled
@ -1,6 +1,6 @@
|
|||||||
outs:
|
outs:
|
||||||
- md5: 2428c0895414e6bd46c229b661a68b6d.dir
|
- md5: b52c8929353b5ed374f10aab8c4e7837.dir
|
||||||
size: 724690357
|
size: 753948666
|
||||||
nfiles: 226
|
nfiles: 234
|
||||||
hash: md5
|
hash: md5
|
||||||
path: .dvc-store
|
path: .dvc-store
|
||||||
|
|||||||
@ -361,18 +361,60 @@ This 8.5% per-pair divergence rate means:
|
|||||||
|
|
||||||
**Cost: $96 for Grok ×3** (3 × $32 through OpenRouter). Leaves $80 for Stage 2 judge and any reruns. An alternative — xAI's Batch API at 50% off — would reduce this to $48, but requires bypassing OpenRouter.
|
**Cost: $96 for Grok ×3** (3 × $32 through OpenRouter). Leaves $80 for Stage 2 judge and any reruns. An alternative — xAI's Batch API at 50% off — would reduce this to $48, but requires bypassing OpenRouter.
|
||||||
|
|
||||||
### Cost of the Reboot (updated)
|
### Stage 1 Results: Grok ×3 Self-Consistency (72,045 paragraphs)
|
||||||
|
|
||||||
|
We ran 3 independent Grok 4.1 Fast passes over the full 72,045-paragraph corpus at concurrency 200. Each run completed in ~33 minutes. Total cost: $129.75 ($43.12–$43.62 per run).
|
||||||
|
|
||||||
|
**Cross-run agreement:**
|
||||||
|
|
||||||
|
| Dimension | Unanimous (3/3) | Majority (2/3) | All disagree |
|
||||||
|
|-----------|-----------------|----------------|--------------|
|
||||||
|
| Category | 68,394 (94.9%) | 3,583 (5.0%) | 68 (0.09%) |
|
||||||
|
| Specificity | 65,780 (91.3%) | 6,120 (8.5%) | 145 (0.20%) |
|
||||||
|
|
||||||
|
Category is near-deterministic — 94.9% unanimous, and the 5% majority cases are concentrated at the BG↔MR and MR↔RMP boundaries (exactly the confusion axes identified during codebook development). Specificity shows the expected stochastic variation at 8.5% majority-only, matching the 8.5% divergence rate observed in the 47-paragraph pilot.
|
||||||
|
|
||||||
|
**Consensus resolution:**
|
||||||
|
- **62,510 (86.8%)** — both unanimous, direct consensus
|
||||||
|
- **9,323 (12.9%)** — majority vote on at least one dimension
|
||||||
|
- **212 (0.3%)** — no majority on at least one dimension, resolved by GPT-5.4 judge
|
||||||
|
|
||||||
|
The 212 tiebreaker paragraphs were run through GPT-5.4 with the full judge prompt (disagreement-aware disambiguation rules, shuffled prior annotations). GPT-5.4 agreed with one of the 3 Grok labels on 100% of paragraphs — never inventing a novel answer. This validates that the Grok runs produce reasonable labels and the disagreements are genuine boundary cases, not model failures. Judge cost: $5.76.
|
||||||
|
|
||||||
|
**Final consensus distribution:**
|
||||||
|
|
||||||
|
| Category | Count | % | | Specificity | Count | % |
|
||||||
|
|----------|-------|---|---|-------------|-------|---|
|
||||||
|
| RMP | 31,201 | 43.3% | | L1: Generic Boilerplate | 29,593 | 41.1% |
|
||||||
|
| BG | 13,876 | 19.3% | | L2: Domain-Adapted | 16,344 | 22.7% |
|
||||||
|
| MR | 10,591 | 14.7% | | L3: Firm-Specific | 17,911 | 24.9% |
|
||||||
|
| SI | 7,470 | 10.4% | | L4: Quantified-Verifiable | 8,197 | 11.4% |
|
||||||
|
| N/O | 4,576 | 6.4% | | | | |
|
||||||
|
| TP | 4,094 | 5.7% | | | | |
|
||||||
|
| ID | 237 | 0.3% | | | | |
|
||||||
|
|
||||||
|
**v1→v2 category shifts:** BG rose from 16.0%→19.3% and N/O from 5.0%→6.4%, likely driven by the 22,250 paragraphs in the full corpus that v1 never annotated. RMP dropped from 45.8%→43.3%, partly because the v2 codebook's sharper BG/MR/RMP boundaries reclassified some borderline paragraphs.
|
||||||
|
|
||||||
|
**Specificity is well-distributed.** L2 at 22.7% (above the 15% holdout target — the full corpus has more domain-rich paragraphs than the stratified holdout). L3 at 24.9% and L4 at 11.4% reflect the v2 codebook's tightened verifiability standards.
|
||||||
|
|
||||||
|
**Category × specificity interaction (see `figures/stage1-category-specificity-heatmap.png`):** MR is 87% L3/L4 (people have names, titles, and credentials). SI is 92% L1 (materiality boilerplate with no specific facts). ID is 86% L4 (incidents have dates, named threat actors, forensic firms). These patterns are exactly what the codebook predicts and match the holdout validation.
|
||||||
|
|
||||||
|
**Specificity boundary analysis:** The 6,265 paragraphs where runs diverged on specificity are concentrated at adjacent levels: L1↔L2 (2,485), L1↔L3 (1,423), L2↔L3 (1,160), L3↔L4 (707). Cross-level jumps (L1↔L4, L2↔L4) are rare (~280 total). This confirms the self-consistency mechanism is working as intended — it provides tiebreaking signal exactly at the ambiguous boundaries where different reasoning paths legitimately land on different answers.
|
||||||
|
|
||||||
|
### Cost of the Reboot (final)
|
||||||
|
|
||||||
| Item | Estimated Cost | Actual Cost |
|
| Item | Estimated Cost | Actual Cost |
|
||||||
|------|---------------|-------------|
|
|------|---------------|-------------|
|
||||||
| Prompt iteration (v4.0–v4.5, ~8 rounds) | ~$10 | $19.59 |
|
| Prompt iteration (v4.0–v4.5, ~8 rounds) | ~$10 | $19.59 |
|
||||||
| v2 holdout benchmark (10 models + 3 pilots) | ~$45 | $45.47 |
|
| v2 holdout benchmark (10 models + 3 pilots) | ~$45 | $45.47 |
|
||||||
| Stage 1 re-run (Grok ×3, 50K paragraphs) | ~$96 | pending |
|
| Stage 1 re-run (Grok ×3, 72K paragraphs) | ~$96 | $129.75 |
|
||||||
| Stage 2 judge (disputed paragraphs) | ~$20-40 | pending |
|
| Stage 2 judge (212 tiebreaker paragraphs) | ~$20-40 | $5.76 |
|
||||||
| Human re-labeling | $0 (team labor) | pending |
|
| Human re-labeling | $0 (team labor) | pending |
|
||||||
| **Total additional API** | **~$175-185** | |
|
| **Total additional API** | **~$175-185** | **$200.57** |
|
||||||
|
|
||||||
Against the ~$120 already spent on v1 API calls (not recovered). Total project API cost: ~$300-305 of $360 budget.
|
Against the ~$120 already spent on v1 API calls (not recovered). Total project API cost: **$320.57 of $360 budget**. Remaining: **$39.43** — sufficient for any reruns or additional analysis.
|
||||||
|
|
||||||
|
The cost overshoot ($200 vs $175 estimate) is entirely from annotating 72K paragraphs instead of the estimated 50K. The per-paragraph cost was actually lower than estimated ($0.60/paragraph for the full 3-run self-consistency + judge pipeline vs $0.64 estimated).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
@ -1,12 +1,12 @@
|
|||||||
# Project Status — v2 Pipeline
|
# Project Status — v2 Pipeline
|
||||||
|
|
||||||
**Deadline:** 2026-04-24 | **Started:** 2026-04-03 | **Updated:** 2026-04-05 (holdout benchmark done, Grok ×3 selected)
|
**Deadline:** 2026-04-24 | **Started:** 2026-04-03 | **Updated:** 2026-04-05 (Stage 1 complete, 72K×3 + judge)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Carried Forward (not re-done)
|
## Carried Forward (not re-done)
|
||||||
|
|
||||||
- 72,045 paragraphs (49,795 annotated), quality tiers, 6 surgical patches
|
- 72,045 paragraphs (all annotated in v2), quality tiers, 6 surgical patches
|
||||||
- DAPT checkpoint (eval loss 0.7250, ~14.5h) + TAPT checkpoint (eval loss 1.0754, ~50min)
|
- DAPT checkpoint (eval loss 0.7250, ~14.5h) + TAPT checkpoint (eval loss 1.0754, ~50min)
|
||||||
- v1 data preserved: 150K Stage 1 annotations, 10-model benchmark, 6-annotator human labels, gold adjudication
|
- v1 data preserved: 150K Stage 1 annotations, 10-model benchmark, 6-annotator human labels, gold adjudication
|
||||||
- v2 codebook approved (5/6 group approval 2026-04-04)
|
- v2 codebook approved (5/6 group approval 2026-04-04)
|
||||||
@ -71,14 +71,18 @@
|
|||||||
- **Top models:** Grok Fast (86.1% both), Opus prompt-only (85.2%), Gemini Pro (84.2%)
|
- **Top models:** Grok Fast (86.1% both), Opus prompt-only (85.2%), Gemini Pro (84.2%)
|
||||||
- **Stage 1 panel:** Grok 4.1 Fast ×3 ($96 estimated)
|
- **Stage 1 panel:** Grok 4.1 Fast ×3 ($96 estimated)
|
||||||
|
|
||||||
### 6. Stage 1 Re-Run ← CURRENT
|
### 6. Stage 1 Re-Run — DONE
|
||||||
- [x] Lock v2 prompt (v4.5)
|
- [x] Lock v2 prompt (v4.5)
|
||||||
- [x] Model selection: Grok 4.1 Fast ×3 (self-consistency)
|
- [x] Model selection: Grok 4.1 Fast ×3 (self-consistency)
|
||||||
- [ ] Re-run Stage 1 on full corpus (~50K paragraphs × 3 runs)
|
- [x] Re-run Stage 1 on full corpus (72,045 paragraphs × 3 runs, concurrency 200)
|
||||||
- [ ] Distribution check: L2 ~15-17%, categories healthy
|
- [x] Cross-run agreement: category 94.9% unanimous, specificity 91.3% unanimous
|
||||||
- **Estimated cost:** ~$96
|
- [x] Consensus: 62,510 unanimous (86.8%), 9,323 majority (12.9%), 212 judge tiebreaker (0.3%)
|
||||||
|
- [x] GPT-5.4 judge on 212 unresolved paragraphs — 100% agreed with a Grok label
|
||||||
|
- [x] Distribution check: L2=22.7% (above 15% target), categories healthy
|
||||||
|
- **Stage 1 cost:** $129.75 (3 runs) + $5.76 (judge) = $135.51
|
||||||
|
- **Run time:** ~33 min per run at concurrency 200
|
||||||
|
|
||||||
### 7. Labelapp Update
|
### 7. Labelapp Update ← CURRENT
|
||||||
- [x] Update quiz questions for v2 codebook (v2 specificity rules, fixed impossible qv-3, all 4 levels as options)
|
- [x] Update quiz questions for v2 codebook (v2 specificity rules, fixed impossible qv-3, all 4 levels as options)
|
||||||
- [x] Update warmup paragraphs with v2 explanations
|
- [x] Update warmup paragraphs with v2 explanations
|
||||||
- [x] Update onboarding content for v2 (Domain-Adapted, 1+ QV, domain terminology lists)
|
- [x] Update onboarding content for v2 (Domain-Adapted, 1+ QV, domain terminology lists)
|
||||||
@ -100,10 +104,10 @@
|
|||||||
- [ ] Gold = majority vote; all-disagree → model consensus tiebreaker
|
- [ ] Gold = majority vote; all-disagree → model consensus tiebreaker
|
||||||
- [ ] Cross-validate against model panel
|
- [ ] Cross-validate against model panel
|
||||||
|
|
||||||
### 10. Stage 2 (if needed)
|
### 10. Stage 2
|
||||||
- [ ] Bench Stage 2 accuracy against gold
|
- [x] GPT-5.4 judge resolved 212 tiebreaker paragraphs during Stage 1 consensus ($5.76)
|
||||||
- [ ] If adds value → run on disputed Stage 1 paragraphs
|
- [ ] Bench Stage 2 accuracy against gold (if needed for additional disputed paragraphs)
|
||||||
- **Estimated cost:** ~$20-40 if run
|
- **Cost so far:** $5.76 | **Remaining budget:** ~$39
|
||||||
|
|
||||||
### 11. Training Data Assembly
|
### 11. Training Data Assembly
|
||||||
- [ ] Unanimous Stage 1 → full weight, calibrated majority → full weight
|
- [ ] Unanimous Stage 1 → full weight, calibrated majority → full weight
|
||||||
@ -154,6 +158,32 @@
|
|||||||
| Benchmark analysis | `scripts/analyze-v2-bench.py` |
|
| Benchmark analysis | `scripts/analyze-v2-bench.py` |
|
||||||
| Stage 1 prompt | `ts/src/label/prompts.ts` (v4.5) |
|
| Stage 1 prompt | `ts/src/label/prompts.ts` (v4.5) |
|
||||||
| Holdout sampling script | `scripts/sample-v2-holdout.py` |
|
| Holdout sampling script | `scripts/sample-v2-holdout.py` |
|
||||||
|
| v2 Stage 1 run 1 | `data/annotations/v2-stage1/grok-4.1-fast.run1.jsonl` (72,045) |
|
||||||
|
| v2 Stage 1 run 2 | `data/annotations/v2-stage1/grok-4.1-fast.run2.jsonl` (72,045) |
|
||||||
|
| v2 Stage 1 run 3 | `data/annotations/v2-stage1/grok-4.1-fast.run3.jsonl` (72,045) |
|
||||||
|
| v2 Stage 1 consensus | `data/annotations/v2-stage1/consensus.jsonl` (72,045) |
|
||||||
|
| v2 Stage 1 judge | `data/annotations/v2-stage1/judge.jsonl` (212 tiebreakers) |
|
||||||
|
| Stage 1 distribution charts | `figures/stage1-*.png` (7 charts) |
|
||||||
|
| Stage 1 chart script | `scripts/plot-stage1-distributions.py` |
|
||||||
|
|
||||||
|
### v2 Stage 1 Distribution (72,045 paragraphs, v4.5 prompt, Grok ×3 consensus + GPT-5.4 judge)
|
||||||
|
|
||||||
|
| Category | Count | % |
|
||||||
|
|----------|-------|---|
|
||||||
|
| RMP | 31,201 | 43.3% |
|
||||||
|
| BG | 13,876 | 19.3% |
|
||||||
|
| MR | 10,591 | 14.7% |
|
||||||
|
| SI | 7,470 | 10.4% |
|
||||||
|
| N/O | 4,576 | 6.4% |
|
||||||
|
| TP | 4,094 | 5.7% |
|
||||||
|
| ID | 237 | 0.3% |
|
||||||
|
|
||||||
|
| Specificity | Count | % |
|
||||||
|
|-------------|-------|---|
|
||||||
|
| L1 | 29,593 | 41.1% |
|
||||||
|
| L2 | 16,344 | 22.7% |
|
||||||
|
| L3 | 17,911 | 24.9% |
|
||||||
|
| L4 | 8,197 | 11.4% |
|
||||||
|
|
||||||
### v1 Stage 1 Distribution (50,003 paragraphs, v2.5 prompt, 3-model consensus)
|
### v1 Stage 1 Distribution (50,003 paragraphs, v2.5 prompt, 3-model consensus)
|
||||||
|
|
||||||
|
|||||||
BIN
figures/stage1-category-distribution.png
Normal file
|
After Width: | Height: | Size: 93 KiB |
BIN
figures/stage1-category-specificity-heatmap.png
Normal file
|
After Width: | Height: | Size: 148 KiB |
BIN
figures/stage1-consensus-methods.png
Normal file
|
After Width: | Height: | Size: 96 KiB |
BIN
figures/stage1-cross-run-agreement.png
Normal file
|
After Width: | Height: | Size: 78 KiB |
BIN
figures/stage1-specificity-boundaries.png
Normal file
|
After Width: | Height: | Size: 79 KiB |
BIN
figures/stage1-specificity-distribution.png
Normal file
|
After Width: | Height: | Size: 85 KiB |
BIN
figures/stage1-v1-vs-v2-categories.png
Normal file
|
After Width: | Height: | Size: 78 KiB |
373
scripts/plot-stage1-distributions.py
Normal file
@ -0,0 +1,373 @@
|
|||||||
|
"""
|
||||||
|
Stage 1 (v2) distribution charts for the writeup.
|
||||||
|
Generates: category distribution, specificity distribution,
|
||||||
|
cross-run agreement, consensus method breakdown, and
|
||||||
|
specificity disagreement boundary analysis.
|
||||||
|
|
||||||
|
Usage: uvx --with matplotlib --with numpy python scripts/plot-stage1-distributions.py
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import collections
|
||||||
|
from pathlib import Path
|
||||||
|
import matplotlib
|
||||||
|
matplotlib.use("Agg")
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
import matplotlib.ticker as mtick
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
DATA = Path(__file__).resolve().parent.parent / "data"
|
||||||
|
FIGS = Path(__file__).resolve().parent.parent / "figures"
|
||||||
|
FIGS.mkdir(exist_ok=True)
|
||||||
|
|
||||||
|
# ── Color palette ────────────────────────────────────────────────────────────
|
||||||
|
CAT_COLORS = {
|
||||||
|
"Risk Management Process": "#2196F3",
|
||||||
|
"Board Governance": "#4CAF50",
|
||||||
|
"Management Role": "#FF9800",
|
||||||
|
"Strategy Integration": "#9C27B0",
|
||||||
|
"None/Other": "#607D8B",
|
||||||
|
"Third-Party Risk": "#F44336",
|
||||||
|
"Incident Disclosure": "#00BCD4",
|
||||||
|
}
|
||||||
|
CAT_ABBREV = {
|
||||||
|
"Risk Management Process": "RMP",
|
||||||
|
"Board Governance": "BG",
|
||||||
|
"Management Role": "MR",
|
||||||
|
"Strategy Integration": "SI",
|
||||||
|
"None/Other": "N/O",
|
||||||
|
"Third-Party Risk": "TP",
|
||||||
|
"Incident Disclosure": "ID",
|
||||||
|
}
|
||||||
|
SPEC_COLORS = ["#BDBDBD", "#64B5F6", "#FFB74D", "#EF5350"]
|
||||||
|
SPEC_LABELS = ["L1: Generic\nBoilerplate", "L2: Domain-\nAdapted", "L3: Firm-\nSpecific", "L4: Quantified-\nVerifiable"]
|
||||||
|
|
||||||
|
# ── Load data ────────────────────────────────────────────────────────────────
|
||||||
|
runs = {}
|
||||||
|
for run in [1, 2, 3]:
|
||||||
|
path = DATA / f"annotations/v2-stage1/grok-4.1-fast.run{run}.jsonl"
|
||||||
|
runs[run] = {}
|
||||||
|
with open(path) as f:
|
||||||
|
for line in f:
|
||||||
|
r = json.loads(line)
|
||||||
|
runs[run][r["paragraphId"]] = r["label"]
|
||||||
|
|
||||||
|
# Load judge results
|
||||||
|
judge = {}
|
||||||
|
judge_path = DATA / "annotations/v2-stage1/judge.jsonl"
|
||||||
|
if judge_path.exists():
|
||||||
|
with open(judge_path) as f:
|
||||||
|
for line in f:
|
||||||
|
r = json.loads(line)
|
||||||
|
judge[r["paragraphId"]] = r["label"]
|
||||||
|
|
||||||
|
all_ids = sorted(set(runs[1]) & set(runs[2]) & set(runs[3]))
|
||||||
|
N = len(all_ids)
|
||||||
|
print(f"Loaded {N} paragraphs across 3 runs, {len(judge)} judge results")
|
||||||
|
|
||||||
|
# ── Compute consensus labels ─────────────────────────────────────────────────
|
||||||
|
final_cats = []
|
||||||
|
final_specs = []
|
||||||
|
consensus_methods = collections.Counter()
|
||||||
|
|
||||||
|
for pid in all_ids:
|
||||||
|
cats = [runs[r][pid]["content_category"] for r in [1, 2, 3]]
|
||||||
|
specs = [runs[r][pid]["specificity_level"] for r in [1, 2, 3]]
|
||||||
|
cat_counts = collections.Counter(cats)
|
||||||
|
spec_counts = collections.Counter(specs)
|
||||||
|
cat_max = max(cat_counts.values())
|
||||||
|
spec_max = max(spec_counts.values())
|
||||||
|
|
||||||
|
if cat_max == 3 and spec_max == 3:
|
||||||
|
consensus_methods["Unanimous (3/3)"] += 1
|
||||||
|
final_cats.append(cat_counts.most_common(1)[0][0])
|
||||||
|
final_specs.append(spec_counts.most_common(1)[0][0])
|
||||||
|
elif cat_max >= 2 and spec_max >= 2:
|
||||||
|
consensus_methods["Majority (2/3)"] += 1
|
||||||
|
final_cats.append(cat_counts.most_common(1)[0][0])
|
||||||
|
final_specs.append(spec_counts.most_common(1)[0][0])
|
||||||
|
else:
|
||||||
|
# Judge tiebreaker
|
||||||
|
if pid in judge:
|
||||||
|
consensus_methods["Judge tiebreaker"] += 1
|
||||||
|
final_cats.append(judge[pid]["content_category"])
|
||||||
|
final_specs.append(judge[pid]["specificity_level"])
|
||||||
|
else:
|
||||||
|
consensus_methods["Unresolved"] += 1
|
||||||
|
final_cats.append(cat_counts.most_common(1)[0][0])
|
||||||
|
final_specs.append(spec_counts.most_common(1)[0][0])
|
||||||
|
|
||||||
|
plt.rcParams.update({
|
||||||
|
"font.family": "sans-serif",
|
||||||
|
"font.size": 11,
|
||||||
|
"axes.titlesize": 13,
|
||||||
|
"axes.titleweight": "bold",
|
||||||
|
"figure.facecolor": "white",
|
||||||
|
})
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# FIGURE 1: Category Distribution (final consensus)
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
cat_counts_final = collections.Counter(final_cats)
|
||||||
|
cat_order = ["Risk Management Process", "Board Governance", "Management Role",
|
||||||
|
"Strategy Integration", "None/Other", "Third-Party Risk", "Incident Disclosure"]
|
||||||
|
|
||||||
|
fig, ax = plt.subplots(figsize=(10, 5))
|
||||||
|
x = np.arange(len(cat_order))
|
||||||
|
counts = [cat_counts_final[c] for c in cat_order]
|
||||||
|
colors = [CAT_COLORS[c] for c in cat_order]
|
||||||
|
bars = ax.bar(x, counts, color=colors, edgecolor="white", linewidth=0.5)
|
||||||
|
|
||||||
|
for bar, count in zip(bars, counts):
|
||||||
|
pct = count / N * 100
|
||||||
|
ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 200,
|
||||||
|
f"{count:,}\n({pct:.1f}%)", ha="center", va="bottom", fontsize=9)
|
||||||
|
|
||||||
|
ax.set_xticks(x)
|
||||||
|
ax.set_xticklabels([CAT_ABBREV[c] for c in cat_order], fontsize=11)
|
||||||
|
ax.set_ylabel("Paragraphs")
|
||||||
|
ax.set_title("Content Category Distribution — Stage 1 Consensus (72,045 paragraphs)")
|
||||||
|
ax.set_ylim(0, max(counts) * 1.18)
|
||||||
|
ax.spines["top"].set_visible(False)
|
||||||
|
ax.spines["right"].set_visible(False)
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(FIGS / "stage1-category-distribution.png", dpi=200)
|
||||||
|
plt.close(fig)
|
||||||
|
print(" ✓ stage1-category-distribution.png")
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# FIGURE 2: Specificity Distribution (final consensus)
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
spec_counts_final = collections.Counter(final_specs)
|
||||||
|
|
||||||
|
fig, ax = plt.subplots(figsize=(8, 5))
|
||||||
|
x = np.arange(4)
|
||||||
|
counts = [spec_counts_final.get(i + 1, 0) for i in range(4)]
|
||||||
|
bars = ax.bar(x, counts, color=SPEC_COLORS, edgecolor="white", linewidth=0.5)
|
||||||
|
|
||||||
|
for bar, count in zip(bars, counts):
|
||||||
|
pct = count / N * 100
|
||||||
|
ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 200,
|
||||||
|
f"{count:,}\n({pct:.1f}%)", ha="center", va="bottom", fontsize=9)
|
||||||
|
|
||||||
|
ax.set_xticks(x)
|
||||||
|
ax.set_xticklabels(SPEC_LABELS, fontsize=10)
|
||||||
|
ax.set_ylabel("Paragraphs")
|
||||||
|
ax.set_title("Specificity Distribution — Stage 1 Consensus (72,045 paragraphs)")
|
||||||
|
ax.set_ylim(0, max(counts) * 1.18)
|
||||||
|
ax.spines["top"].set_visible(False)
|
||||||
|
ax.spines["right"].set_visible(False)
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(FIGS / "stage1-specificity-distribution.png", dpi=200)
|
||||||
|
plt.close(fig)
|
||||||
|
print(" ✓ stage1-specificity-distribution.png")
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# FIGURE 3: Cross-run agreement (stacked bar showing unanimity rates)
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
cat_agreement = {"Unanimous": 0, "Majority": 0, "All disagree": 0}
|
||||||
|
spec_agreement = {"Unanimous": 0, "Majority": 0, "All disagree": 0}
|
||||||
|
|
||||||
|
for pid in all_ids:
|
||||||
|
cats = [runs[r][pid]["content_category"] for r in [1, 2, 3]]
|
||||||
|
specs = [runs[r][pid]["specificity_level"] for r in [1, 2, 3]]
|
||||||
|
cat_n = len(set(cats))
|
||||||
|
spec_n = len(set(specs))
|
||||||
|
|
||||||
|
if cat_n == 1: cat_agreement["Unanimous"] += 1
|
||||||
|
elif cat_n == 2: cat_agreement["Majority"] += 1
|
||||||
|
else: cat_agreement["All disagree"] += 1
|
||||||
|
|
||||||
|
if spec_n == 1: spec_agreement["Unanimous"] += 1
|
||||||
|
elif spec_n == 2: spec_agreement["Majority"] += 1
|
||||||
|
else: spec_agreement["All disagree"] += 1
|
||||||
|
|
||||||
|
fig, ax = plt.subplots(figsize=(8, 5))
|
||||||
|
dims = ["Category", "Specificity"]
|
||||||
|
unanimous = [cat_agreement["Unanimous"] / N * 100, spec_agreement["Unanimous"] / N * 100]
|
||||||
|
majority = [cat_agreement["Majority"] / N * 100, spec_agreement["Majority"] / N * 100]
|
||||||
|
alldis = [cat_agreement["All disagree"] / N * 100, spec_agreement["All disagree"] / N * 100]
|
||||||
|
|
||||||
|
x = np.arange(len(dims))
|
||||||
|
w = 0.5
|
||||||
|
b1 = ax.bar(x, unanimous, w, label="Unanimous (3/3)", color="#4CAF50")
|
||||||
|
b2 = ax.bar(x, majority, w, bottom=unanimous, label="Majority (2/3)", color="#FFC107")
|
||||||
|
b3 = ax.bar(x, alldis, w, bottom=[u + m for u, m in zip(unanimous, majority)],
|
||||||
|
label="All disagree", color="#F44336")
|
||||||
|
|
||||||
|
for i, (u, m, a) in enumerate(zip(unanimous, majority, alldis)):
|
||||||
|
ax.text(i, u / 2, f"{u:.1f}%", ha="center", va="center", fontsize=11, fontweight="bold", color="white")
|
||||||
|
if m > 2:
|
||||||
|
ax.text(i, u + m / 2, f"{m:.1f}%", ha="center", va="center", fontsize=10, color="black")
|
||||||
|
if a > 0.5:
|
||||||
|
ax.text(i, u + m + a / 2, f"{a:.2f}%", ha="center", va="center", fontsize=8, color="white")
|
||||||
|
|
||||||
|
ax.set_xticks(x)
|
||||||
|
ax.set_xticklabels(dims, fontsize=12)
|
||||||
|
ax.set_ylabel("Percentage of paragraphs")
|
||||||
|
ax.set_title("Grok ×3 Cross-Run Agreement (72,045 paragraphs)")
|
||||||
|
ax.set_ylim(0, 105)
|
||||||
|
ax.legend(loc="upper right", fontsize=10)
|
||||||
|
ax.spines["top"].set_visible(False)
|
||||||
|
ax.spines["right"].set_visible(False)
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(FIGS / "stage1-cross-run-agreement.png", dpi=200)
|
||||||
|
plt.close(fig)
|
||||||
|
print(" ✓ stage1-cross-run-agreement.png")
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# FIGURE 4: Consensus method breakdown (pie/donut)
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
fig, ax = plt.subplots(figsize=(7, 7))
|
||||||
|
method_order = ["Unanimous (3/3)", "Majority (2/3)", "Judge tiebreaker"]
|
||||||
|
method_counts = [consensus_methods.get(m, 0) for m in method_order]
|
||||||
|
method_colors = ["#4CAF50", "#FFC107", "#2196F3"]
|
||||||
|
|
||||||
|
wedges, texts, autotexts = ax.pie(
|
||||||
|
method_counts, labels=method_order, colors=method_colors,
|
||||||
|
autopct=lambda p: f"{p:.1f}%\n({int(round(p * N / 100)):,})",
|
||||||
|
startangle=90, pctdistance=0.75,
|
||||||
|
wedgeprops=dict(width=0.45, edgecolor="white", linewidth=2),
|
||||||
|
)
|
||||||
|
for t in autotexts:
|
||||||
|
t.set_fontsize(10)
|
||||||
|
ax.set_title("Consensus Resolution Method — Stage 1")
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(FIGS / "stage1-consensus-methods.png", dpi=200)
|
||||||
|
plt.close(fig)
|
||||||
|
print(" ✓ stage1-consensus-methods.png")
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# FIGURE 5: Specificity boundary disagreements (where runs diverge)
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
boundary_counts = collections.Counter()
|
||||||
|
for pid in all_ids:
|
||||||
|
specs = [runs[r][pid]["specificity_level"] for r in [1, 2, 3]]
|
||||||
|
if len(set(specs)) == 1:
|
||||||
|
continue
|
||||||
|
low, high = min(specs), max(specs)
|
||||||
|
boundary_counts[f"L{low}↔L{high}"] += 1
|
||||||
|
|
||||||
|
fig, ax = plt.subplots(figsize=(8, 5))
|
||||||
|
boundaries = sorted(boundary_counts.keys())
|
||||||
|
counts = [boundary_counts[b] for b in boundaries]
|
||||||
|
colors_b = ["#90CAF9" if "1↔2" in b else "#FFE082" if "2↔3" in b or "1↔3" in b
|
||||||
|
else "#EF9A9A" if "3↔4" in b else "#CE93D8" for b in boundaries]
|
||||||
|
bars = ax.barh(boundaries, counts, color=colors_b, edgecolor="white")
|
||||||
|
|
||||||
|
for bar, count in zip(bars, counts):
|
||||||
|
ax.text(bar.get_width() + 20, bar.get_y() + bar.get_height() / 2,
|
||||||
|
f"{count:,} ({count / N * 100:.1f}%)", va="center", fontsize=10)
|
||||||
|
|
||||||
|
ax.set_xlabel("Paragraphs with divergent specificity")
|
||||||
|
ax.set_title("Specificity Boundary Disagreements Across 3 Grok Runs")
|
||||||
|
ax.spines["top"].set_visible(False)
|
||||||
|
ax.spines["right"].set_visible(False)
|
||||||
|
ax.set_xlim(0, max(counts) * 1.25)
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(FIGS / "stage1-specificity-boundaries.png", dpi=200)
|
||||||
|
plt.close(fig)
|
||||||
|
print(" ✓ stage1-specificity-boundaries.png")
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# FIGURE 6: Category × Specificity heatmap (final consensus)
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
cat_spec_matrix = np.zeros((len(cat_order), 4))
|
||||||
|
for cat, spec in zip(final_cats, final_specs):
|
||||||
|
i = cat_order.index(cat)
|
||||||
|
cat_spec_matrix[i, spec - 1] += 1
|
||||||
|
|
||||||
|
# Normalize to row percentages
|
||||||
|
row_sums = cat_spec_matrix.sum(axis=1, keepdims=True)
|
||||||
|
cat_spec_pct = cat_spec_matrix / row_sums * 100
|
||||||
|
|
||||||
|
fig, ax = plt.subplots(figsize=(9, 6))
|
||||||
|
im = ax.imshow(cat_spec_pct, cmap="YlOrRd", aspect="auto")
|
||||||
|
|
||||||
|
for i in range(len(cat_order)):
|
||||||
|
for j in range(4):
|
||||||
|
count = int(cat_spec_matrix[i, j])
|
||||||
|
pct = cat_spec_pct[i, j]
|
||||||
|
color = "white" if pct > 50 else "black"
|
||||||
|
ax.text(j, i, f"{count:,}\n({pct:.0f}%)", ha="center", va="center",
|
||||||
|
fontsize=8, color=color)
|
||||||
|
|
||||||
|
ax.set_xticks(range(4))
|
||||||
|
ax.set_xticklabels(["L1", "L2", "L3", "L4"], fontsize=11)
|
||||||
|
ax.set_yticks(range(len(cat_order)))
|
||||||
|
ax.set_yticklabels([CAT_ABBREV[c] for c in cat_order], fontsize=11)
|
||||||
|
ax.set_xlabel("Specificity Level")
|
||||||
|
ax.set_ylabel("Content Category")
|
||||||
|
ax.set_title("Category × Specificity — Stage 1 Consensus (row %)")
|
||||||
|
fig.colorbar(im, ax=ax, label="Row %", shrink=0.8)
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(FIGS / "stage1-category-specificity-heatmap.png", dpi=200)
|
||||||
|
plt.close(fig)
|
||||||
|
print(" ✓ stage1-category-specificity-heatmap.png")
|
||||||
|
|
||||||
|
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# FIGURE 7: v1 vs v2 category comparison
|
||||||
|
# ══════════════════════════════════════════════════════════════════════════════
|
||||||
|
# v1 distribution from STATUS.md (50,003 paragraphs, different base)
|
||||||
|
v1_pct = {
|
||||||
|
"Risk Management Process": 45.8,
|
||||||
|
"Management Role": 17.6,
|
||||||
|
"Board Governance": 16.0,
|
||||||
|
"Strategy Integration": 10.0,
|
||||||
|
"None/Other": 5.0,
|
||||||
|
"Third-Party Risk": 5.0,
|
||||||
|
"Incident Disclosure": 0.6,
|
||||||
|
}
|
||||||
|
v2_pct = {c: cat_counts_final[c] / N * 100 for c in cat_order}
|
||||||
|
|
||||||
|
fig, ax = plt.subplots(figsize=(10, 5))
|
||||||
|
x = np.arange(len(cat_order))
|
||||||
|
w = 0.35
|
||||||
|
b1 = ax.bar(x - w / 2, [v1_pct[c] for c in cat_order], w, label="v1 (50K, 3-model panel)",
|
||||||
|
color="#90CAF9", edgecolor="white")
|
||||||
|
b2 = ax.bar(x + w / 2, [v2_pct[c] for c in cat_order], w, label="v2 (72K, Grok ×3)",
|
||||||
|
color="#2196F3", edgecolor="white")
|
||||||
|
|
||||||
|
for bar_group in [b1, b2]:
|
||||||
|
for bar in bar_group:
|
||||||
|
h = bar.get_height()
|
||||||
|
ax.text(bar.get_x() + bar.get_width() / 2, h + 0.3,
|
||||||
|
f"{h:.1f}%", ha="center", va="bottom", fontsize=8)
|
||||||
|
|
||||||
|
ax.set_xticks(x)
|
||||||
|
ax.set_xticklabels([CAT_ABBREV[c] for c in cat_order], fontsize=11)
|
||||||
|
ax.set_ylabel("Percentage")
|
||||||
|
ax.set_title("Category Distribution: v1 vs v2 Stage 1")
|
||||||
|
ax.legend(fontsize=10)
|
||||||
|
ax.spines["top"].set_visible(False)
|
||||||
|
ax.spines["right"].set_visible(False)
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(FIGS / "stage1-v1-vs-v2-categories.png", dpi=200)
|
||||||
|
plt.close(fig)
|
||||||
|
print(" ✓ stage1-v1-vs-v2-categories.png")
|
||||||
|
|
||||||
|
|
||||||
|
# ── Print summary stats ──────────────────────────────────────────────────────
|
||||||
|
print(f"\n{'═' * 60}")
|
||||||
|
print(f"Stage 1 Consensus Summary")
|
||||||
|
print(f"{'═' * 60}")
|
||||||
|
print(f"Total paragraphs: {N:,}")
|
||||||
|
print(f"\nConsensus methods:")
|
||||||
|
for m in ["Unanimous (3/3)", "Majority (2/3)", "Judge tiebreaker"]:
|
||||||
|
c = consensus_methods.get(m, 0)
|
||||||
|
print(f" {m}: {c:,} ({c / N * 100:.1f}%)")
|
||||||
|
print(f"\nCategory distribution (consensus):")
|
||||||
|
for c in cat_order:
|
||||||
|
n = cat_counts_final[c]
|
||||||
|
print(f" {CAT_ABBREV[c]:4s} {n:>6,} ({n / N * 100:.1f}%)")
|
||||||
|
print(f"\nSpecificity distribution (consensus):")
|
||||||
|
for i in range(4):
|
||||||
|
n = spec_counts_final.get(i + 1, 0)
|
||||||
|
print(f" L{i + 1} {n:>6,} ({n / N * 100:.1f}%)")
|
||||||
@ -1,3 +1,4 @@
|
|||||||
|
import pLimit from "p-limit";
|
||||||
import { readJsonl } from "./lib/jsonl.ts";
|
import { readJsonl } from "./lib/jsonl.ts";
|
||||||
import { Paragraph } from "@sec-cybert/schemas/paragraph.ts";
|
import { Paragraph } from "@sec-cybert/schemas/paragraph.ts";
|
||||||
import { Annotation } from "@sec-cybert/schemas/annotation.ts";
|
import { Annotation } from "@sec-cybert/schemas/annotation.ts";
|
||||||
@ -142,7 +143,7 @@ async function cmdConsensus(): Promise<void> {
|
|||||||
// Only process paragraphs with all 3 annotations
|
// Only process paragraphs with all 3 annotations
|
||||||
let consensus = 0;
|
let consensus = 0;
|
||||||
let needsJudge = 0;
|
let needsJudge = 0;
|
||||||
const outputPath = `${DATA}/annotations/consensus.jsonl`;
|
const outputPath = `${DATA}/annotations/${inputDir}/consensus.jsonl`;
|
||||||
|
|
||||||
for (const [paragraphId, anns] of allAnnotations) {
|
for (const [paragraphId, anns] of allAnnotations) {
|
||||||
if (anns.length !== 3) continue;
|
if (anns.length !== 3) continue;
|
||||||
@ -163,20 +164,23 @@ async function cmdConsensus(): Promise<void> {
|
|||||||
|
|
||||||
async function cmdJudge(): Promise<void> {
|
async function cmdJudge(): Promise<void> {
|
||||||
// Load paragraphs and consensus results needing judge
|
// Load paragraphs and consensus results needing judge
|
||||||
const paragraphs = await loadParagraphs();
|
const paragraphsPath = `${DATA}/paragraphs/paragraphs-clean.patched.jsonl`;
|
||||||
|
const { records: paragraphs, skipped: pSkipped } = await readJsonl(paragraphsPath, Paragraph);
|
||||||
|
if (pSkipped > 0) process.stderr.write(` ⚠ Skipped ${pSkipped} invalid paragraph lines\n`);
|
||||||
|
process.stderr.write(` Loaded ${paragraphs.length} paragraphs\n`);
|
||||||
const paragraphMap = new Map(paragraphs.map((p) => [p.id, p]));
|
const paragraphMap = new Map(paragraphs.map((p) => [p.id, p]));
|
||||||
|
|
||||||
const consensusPath = `${DATA}/annotations/consensus.jsonl`;
|
const judgeDir = flag("input-dir") ?? "v2-stage1";
|
||||||
|
const consensusPath = `${DATA}/annotations/${judgeDir}/consensus.jsonl`;
|
||||||
const { records: rawConsensus } = await readJsonlRaw(consensusPath);
|
const { records: rawConsensus } = await readJsonlRaw(consensusPath);
|
||||||
|
|
||||||
// Load all stage 1 annotations for lookup (3 self-consistency runs)
|
// Load all stage 1 annotations for lookup (3 self-consistency runs)
|
||||||
const stage1Map: Map<string, Annotation[]> = new Map();
|
const stage1Map: Map<string, Annotation[]> = new Map();
|
||||||
const judgeInputDir = flag("input-dir") ?? "v2-stage1";
|
|
||||||
const judgeModelShort = STAGE1_MODEL.split("/")[1]!;
|
const judgeModelShort = STAGE1_MODEL.split("/")[1]!;
|
||||||
|
|
||||||
for (let run = 1; run <= STAGE1_RUNS; run++) {
|
for (let run = 1; run <= STAGE1_RUNS; run++) {
|
||||||
const { records } = await readJsonl(
|
const { records } = await readJsonl(
|
||||||
`${DATA}/annotations/${judgeInputDir}/${judgeModelShort}.run${run}.jsonl`,
|
`${DATA}/annotations/${judgeDir}/${judgeModelShort}.run${run}.jsonl`,
|
||||||
Annotation,
|
Annotation,
|
||||||
);
|
);
|
||||||
for (const ann of records) {
|
for (const ann of records) {
|
||||||
@ -198,7 +202,7 @@ async function cmdJudge(): Promise<void> {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// Check what's already judged
|
// Check what's already judged
|
||||||
const judgePath = `${DATA}/annotations/stage2/judge.jsonl`;
|
const judgePath = `${DATA}/annotations/${judgeDir}/judge.jsonl`;
|
||||||
const { records: existing } = await readJsonlRaw(judgePath);
|
const { records: existing } = await readJsonlRaw(judgePath);
|
||||||
const judgedIds = new Set(
|
const judgedIds = new Set(
|
||||||
existing
|
existing
|
||||||
@ -208,43 +212,52 @@ async function cmdJudge(): Promise<void> {
|
|||||||
);
|
);
|
||||||
|
|
||||||
const toJudge = unresolvedIds.filter((id) => !judgedIds.has(id));
|
const toJudge = unresolvedIds.filter((id) => !judgedIds.has(id));
|
||||||
process.stderr.write(` ${toJudge.length} paragraphs to judge (${judgedIds.size} already done)\n`);
|
const judgeModelId = flag("model") ?? "openai/gpt-5.4";
|
||||||
|
const concurrency = flagInt("concurrency", 30);
|
||||||
|
process.stderr.write(` ${toJudge.length} paragraphs to judge (${judgedIds.size} already done) │ ${judgeModelId} │ concurrency=${concurrency}\n`);
|
||||||
|
|
||||||
const runId = uuidv4();
|
const runId = uuidv4();
|
||||||
let processed = 0;
|
let processed = 0;
|
||||||
|
let errored = 0;
|
||||||
|
const limit_ = pLimit(concurrency);
|
||||||
|
|
||||||
for (const paragraphId of toJudge) {
|
const tasks = toJudge.map((paragraphId) =>
|
||||||
const paragraph = paragraphMap.get(paragraphId);
|
limit_(async () => {
|
||||||
if (!paragraph) continue;
|
const paragraph = paragraphMap.get(paragraphId);
|
||||||
|
if (!paragraph) return;
|
||||||
|
|
||||||
const stage1Anns = stage1Map.get(paragraphId);
|
const stage1Anns = stage1Map.get(paragraphId);
|
||||||
if (!stage1Anns || stage1Anns.length < 3) continue;
|
if (!stage1Anns || stage1Anns.length < 3) return;
|
||||||
|
|
||||||
const priorLabels = stage1Anns.map((a) => ({
|
const priorLabels = stage1Anns.map((a) => ({
|
||||||
content_category: a.label.content_category,
|
content_category: a.label.content_category,
|
||||||
specificity_level: a.label.specificity_level,
|
specificity_level: a.label.specificity_level,
|
||||||
reasoning: a.label.reasoning,
|
reasoning: a.label.reasoning,
|
||||||
}));
|
}));
|
||||||
|
|
||||||
try {
|
try {
|
||||||
const judgeAnn = await judgeParagraph(paragraph, priorLabels, {
|
const judgeAnn = await judgeParagraph(paragraph, priorLabels, {
|
||||||
runId,
|
runId,
|
||||||
promptVersion: PROMPT_VERSION,
|
promptVersion: PROMPT_VERSION,
|
||||||
});
|
modelId: judgeModelId,
|
||||||
await appendJsonl(judgePath, judgeAnn);
|
});
|
||||||
processed++;
|
await appendJsonl(judgePath, judgeAnn);
|
||||||
|
processed++;
|
||||||
|
|
||||||
if (processed % 10 === 0) {
|
if (processed % 10 === 0) {
|
||||||
process.stderr.write(` Judged ${processed}/${toJudge.length}\n`);
|
process.stderr.write(` Judged ${processed}/${toJudge.length}\n`);
|
||||||
|
}
|
||||||
|
} catch (error) {
|
||||||
|
errored++;
|
||||||
|
process.stderr.write(
|
||||||
|
` ✖ Judge error for ${paragraphId}: ${error instanceof Error ? error.message : String(error)}\n`,
|
||||||
|
);
|
||||||
}
|
}
|
||||||
} catch (error) {
|
}),
|
||||||
process.stderr.write(
|
);
|
||||||
` ✖ Judge error for ${paragraphId}: ${error instanceof Error ? error.message : String(error)}\n`,
|
|
||||||
);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
process.stderr.write(`\n ✓ Judged ${processed} paragraphs\n`);
|
await Promise.all(tasks);
|
||||||
|
process.stderr.write(`\n ✓ Judged ${processed} paragraphs (${errored} errors)\n`);
|
||||||
}
|
}
|
||||||
|
|
||||||
async function cmdGolden(): Promise<void> {
|
async function cmdGolden(): Promise<void> {
|
||||||
|
|||||||
@ -119,6 +119,8 @@ export async function annotateParagraph(
|
|||||||
export interface JudgeOpts {
|
export interface JudgeOpts {
|
||||||
runId: string;
|
runId: string;
|
||||||
promptVersion?: string;
|
promptVersion?: string;
|
||||||
|
/** Override judge model. Defaults to Claude Sonnet 4.6. */
|
||||||
|
modelId?: string;
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
@ -216,8 +218,7 @@ export async function judgeParagraph(
|
|||||||
}>,
|
}>,
|
||||||
opts: JudgeOpts,
|
opts: JudgeOpts,
|
||||||
): Promise<Annotation> {
|
): Promise<Annotation> {
|
||||||
const { runId, promptVersion = PROMPT_VERSION } = opts;
|
const { runId, promptVersion = PROMPT_VERSION, modelId = "anthropic/claude-sonnet-4.6" } = opts;
|
||||||
const modelId = "anthropic/claude-sonnet-4.6";
|
|
||||||
const requestedAt = new Date().toISOString();
|
const requestedAt = new Date().toISOString();
|
||||||
const start = Date.now();
|
const start = Date.now();
|
||||||
|
|
||||||
|
|||||||