human labeling done
This commit is contained in:
parent
b4319845e4
commit
e2c7a21c99
@ -1,6 +1,6 @@
|
|||||||
outs:
|
outs:
|
||||||
- md5: c633654a20f23d76af34689f7e27d58a.dir
|
- md5: 6147599f136e4781a2de20adcb2aba1f.dir
|
||||||
size: 729964105
|
size: 737313104
|
||||||
nfiles: 111
|
nfiles: 135
|
||||||
hash: md5
|
hash: md5
|
||||||
path: .dvc-store
|
path: .dvc-store
|
||||||
|
|||||||
111
docs/POST-LABELING-PLAN.md
Normal file
111
docs/POST-LABELING-PLAN.md
Normal file
@ -0,0 +1,111 @@
|
|||||||
|
# Post-Labeling Plan — Gold Set Repair & Final Pipeline
|
||||||
|
|
||||||
|
Written 2026-04-01 while waiting for the last human annotator to finish.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Situation
|
||||||
|
|
||||||
|
Human labeling is nearly complete (1,200 paragraphs, 6 annotators, 3 per paragraph via BIBD). Current inter-annotator agreement:
|
||||||
|
- **Cohen's Kappa (avg):** 0.622
|
||||||
|
- **Krippendorff's alpha:** 0.616
|
||||||
|
|
||||||
|
These numbers are at the floor of "substantial agreement" (Landis & Koch) but below the 0.667 threshold Krippendorff recommends for tentative conclusions. The holdout was deliberately stratified to over-sample hard cases (120 Management↔RMP splits, 80 None/Other↔Strategy splits, 80 Spec [3,4] splits, etc.), so raw consensus reflects sampling difficulty, not pure annotator quality.
|
||||||
|
|
||||||
|
The task is genuinely hard: 7 categories, 4 specificity levels, 5 decision rules, 3 codebook rulings, multi-step reasoning required (person-vs-function test, QV fact counting). The GenAI panel struggled with the same boundaries.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Immediate Analysis (once last annotator finishes)
|
||||||
|
|
||||||
|
1. **Export labels** from labelapp (`bun run la:export`)
|
||||||
|
2. **Per-dimension alpha:** Compute Krippendorff's alpha for category and specificity separately. Hypothesis: category alpha is significantly higher than specificity alpha (matching the GenAI pattern where Spec 4 was only 37.6% unanimous).
|
||||||
|
3. **Pairwise Kappa matrix:** All 15 annotator pairs. Identify if one annotator is a systematic outlier or if disagreement is uniform.
|
||||||
|
4. **Stratum-level agreement:** Break down consensus rates by sampling stratum (Management↔RMP, None/Other↔Strategy, Spec [3,4], proportional random, etc.). The hard strata should show lower agreement; the proportional random stratum should be higher.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Adverse Incentive Problem
|
||||||
|
|
||||||
|
The assignment requires F1 > 0.80 on the holdout to pass. This creates a perverse incentive: pick easy, unambiguous paragraphs for the holdout → high human agreement, high GenAI scores, high fine-tuned model F1 → passing grade, meaningless evaluation.
|
||||||
|
|
||||||
|
We did the opposite: stratified to stress-test decision boundaries. This produces a harder holdout with lower headline numbers but an actually informative evaluation.
|
||||||
|
|
||||||
|
**Mitigation:** Report F1 on both the full 1,200 holdout AND the 720-paragraph "proportional stratified random" subsample separately. The proportional subsample approximates what a random holdout would look like. The delta between the two quantifies exactly how much performance degrades at decision boundaries. This isn't gaming — it's rigorous reporting.
|
||||||
|
|
||||||
|
The A-grade criteria ("error analysis," "comparison to amateur labels") are directly served by our approach. The low human agreement rate is a finding, not a failure.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Gold Set Repair Strategy: 13+ Signals Per Paragraph
|
||||||
|
|
||||||
|
### Existing signals (7 per paragraph)
|
||||||
|
- 3 human labels (from labelapp, with notes and timing)
|
||||||
|
- 3 Stage 1 GenAI labels (gemini-flash-lite, mimo-v2-flash, grok-4.1-fast)
|
||||||
|
- 1 Opus golden label (with full reasoning trace)
|
||||||
|
|
||||||
|
### New signals from GenAI benchmark (6+ additional)
|
||||||
|
The assignment requires benchmarking 6+ models from 3+ suppliers against the holdout. This serves triple duty:
|
||||||
|
1. Assignment deliverable (GenAI benchmark table)
|
||||||
|
2. Gold set repair evidence (6+ more annotation signals for adjudication)
|
||||||
|
3. "GenAI vs amateur" comparison (A-grade criterion)
|
||||||
|
|
||||||
|
**Candidate models (6+ from 3+ suppliers):**
|
||||||
|
- OpenAI: gpt-5.4-mini, gpt-5.4
|
||||||
|
- Google: gemini-3-flash, gemini-3-pro (or similar)
|
||||||
|
- Anthropic: claude-sonnet-4.6, claude-haiku-4.5
|
||||||
|
- xAI: grok-4.20 (or similar)
|
||||||
|
- Others as needed for count
|
||||||
|
|
||||||
|
After the benchmark, each paragraph has **13+ independent annotations**. This is an absurdly rich signal for adjudication.
|
||||||
|
|
||||||
|
### Adjudication tiers
|
||||||
|
|
||||||
|
**Tier 1 — High confidence:** 10+/13 annotators agree on both dimensions. Gold label, no intervention needed. Expected: ~500-600 paragraphs.
|
||||||
|
|
||||||
|
**Tier 2 — Clear majority with cross-validation:** Human majority exists (2/3) and matches GenAI consensus (majority of 10 GenAI labels). Strong signal — take the consensus. Expected: ~300-400 paragraphs.
|
||||||
|
|
||||||
|
**Tier 3 — Human split, GenAI consensus:** Humans disagree but GenAI labels converge. Use Opus reasoning trace + GenAI consensus to inform expert adjudication. Human (Joey) makes the final call. Expected: ~100-200 paragraphs.
|
||||||
|
|
||||||
|
**Tier 4 — Universal disagreement:** Humans and GenAI both split. Genuinely ambiguous. Expert adjudication with documented reasoning, or flag as inherently ambiguous and report in error analysis. Expected: ~50-100 paragraphs.
|
||||||
|
|
||||||
|
The GenAI labels are evidence for adjudication, not the gold label itself. The final label is always a human decision. This avoids circularity — we're not evaluating GenAI against GenAI-derived labels. We're using GenAI agreement patterns to identify which human label is most likely correct in cases of human disagreement.
|
||||||
|
|
||||||
|
If we can't produce reliable gold labels from 13+ signals per paragraph, the construct itself is ill-defined. That would be an important finding too — but given that the GenAI panel achieved 70.8% both-unanimous on 50K paragraphs (unstratified), and the hardest axes have clear codebook resolutions, the construct should hold.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Meta-Narrative
|
||||||
|
|
||||||
|
The finding that trained student annotators achieve α = 0.616 while calibrated LLM panels achieve 70.8%+ unanimity on the same task validates the synthetic experts hypothesis. For complex, rule-heavy classification tasks requiring multi-step reasoning, LLMs with reasoning tokens can match or exceed human annotation quality.
|
||||||
|
|
||||||
|
This isn't a failure of the humans — it's the whole point of the project. The Ringel pipeline exists because these tasks are too cognitively demanding for consistent human annotation at scale. The human labels are essential as a calibration anchor, but GenAI's advantage on rule-application tasks is a key finding.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task Sequence (dependency order)
|
||||||
|
|
||||||
|
### Can start now (no blockers)
|
||||||
|
- [ ] Judge prompt v3.0 update (codebook rulings → `buildJudgePrompt()`)
|
||||||
|
- [ ] Fine-tuning pipeline code (dual-head classifier, sample weighting, train/val/test split)
|
||||||
|
- [ ] GenAI benchmark infrastructure (scripts to run 6+ models on holdout)
|
||||||
|
|
||||||
|
### After last annotator finishes
|
||||||
|
- [ ] Export + per-dimension alpha + pairwise Kappa matrix + stratum breakdown
|
||||||
|
- [ ] Run GenAI benchmark on 1,200 holdout (6+ models, 3+ suppliers)
|
||||||
|
- [ ] Gold set adjudication using 13+ signals per paragraph
|
||||||
|
- [ ] Judge v3.0 validation against adjudicated gold set
|
||||||
|
|
||||||
|
### After gold set is finalized
|
||||||
|
- [ ] Training data assembly (unanimous + calibrated majority + judge)
|
||||||
|
- [ ] Fine-tuning + ablations (7 experiments)
|
||||||
|
- [ ] Final evaluation on holdout
|
||||||
|
- [ ] Writeup + IGNITE slides
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open Questions
|
||||||
|
|
||||||
|
1. **F1 threshold per-dimension?** Worth asking Ringel if the 0.80 F1 requirement applies to the joint 28-class label or can be reported per-dimension (category + specificity separately).
|
||||||
|
2. **Soft labels for ambiguous cases?** For Tier 4 paragraphs, could use label distributions as soft targets during training instead of forcing a hard label. More sophisticated but harder to evaluate.
|
||||||
|
3. **One bad annotator vs. uniform disagreement?** The pairwise Kappa matrix will answer this. If one annotator is systematically off, their labels could be downweighted during adjudication.
|
||||||
BIN
docs/reference/Ringel 2023 Synthetic Experts.pdf
Normal file
BIN
docs/reference/Ringel 2023 Synthetic Experts.pdf
Normal file
Binary file not shown.
@ -37,11 +37,11 @@ Our construct of interest is **cybersecurity disclosure quality** in SEC filings
|
|||||||
|
|
||||||
## 2. Sources and Citations
|
## 2. Sources and Citations
|
||||||
|
|
||||||
The construct is **theoretically grounded** in disclosure theory ([Verrecchia, 2001](<https://doi.org/10.1016/S0165-4101(01)00037-4>)) and regulatory compliance as an information-provision mechanism. The SEC's final rule provides the taxonomic backbone: it specifies four content domains — governance, risk management, strategy integration, and incident disclosure — creating a natural multi-class classification task directly from the regulatory text. Our categories further map to [NIST CSF 2.0](https://www.nist.gov/cyberframework) functions (GOVERN, IDENTIFY, PROTECT, DETECT, RESPOND, RECOVER) for independent academic grounding.
|
The construct is **theoretically grounded** in disclosure theory ([Verrecchia, 2001](<https://doi.org/10.1016/S0165-4101(01)00025-8>)) and regulatory compliance as an information-provision mechanism. The SEC's final rule provides the taxonomic backbone: it specifies four content domains — governance, risk management, strategy integration, and incident disclosure — creating a natural multi-class classification task directly from the regulatory text. Our categories further map to [NIST CSF 2.0](https://www.nist.gov/cyberframework) functions (GOVERN, IDENTIFY, PROTECT, DETECT, RESPOND, RECOVER) for independent academic grounding.
|
||||||
|
|
||||||
The **specificity dimension** draws on the disclosure quality literature. [Berkman et al. (2018)](https://doi.org/10.2308/accr-52165) demonstrate that boilerplate risk-factor disclosures are uninformative to investors, while specific disclosures predict future outcomes. [Gordon, Loeb, and Sohail (2010)](https://doi.org/10.1016/j.jaccpubpol.2010.09.013) establish that voluntary IT security disclosures vary in informativeness and that more specific disclosures correlate with market valuations. [Von Solms and Von Solms (2004)](https://doi.org/10.1016/j.cose.2004.07.002) provide the information security governance framework connecting board oversight to operational risk management. The [Gibson Dunn annual surveys](https://www.gibsondunn.com/cybersecurity-disclosure-overview-2024/) of S&P 100 cybersecurity disclosures empirically document the variation in quality across firms, confirming that the specificity gradient is observable in practice.
|
The **specificity dimension** draws on the disclosure quality literature. [Hope, Hu, and Lu (2016)](https://doi.org/10.1007/s11142-016-9371-1) demonstrate that boilerplate risk-factor disclosures are uninformative to investors, while specific disclosures predict future outcomes. [Gordon, Loeb, and Sohail (2010)](https://doi.org/10.2307/25750692) establish that voluntary IT security disclosures vary in informativeness and that more specific disclosures correlate with market valuations. [Von Solms and Von Solms (2004)](https://doi.org/10.1016/j.cose.2004.05.002) provide the information security governance framework connecting board oversight to operational risk management. The [Gibson Dunn annual surveys](https://www.gibsondunn.com/cybersecurity-disclosure-survey-of-form-10-k-cybersecurity-disclosures-by-sp-100-cos/) of S&P 100 cybersecurity disclosures empirically document the variation in quality across firms, confirming that the specificity gradient is observable in practice.
|
||||||
|
|
||||||
The **methodological foundation** is the [Ringel (2023)](https://arxiv.org/abs/2310.15560) synthetic experts pipeline — frontier LLMs generate training labels, then a small open-weights model is fine-tuned to approximate the GenAI labeler at near-zero marginal cost. [Ma et al. (2026)](https://arxiv.org/abs/2601.09142) provide the multi-model consensus labeling architecture we adopt for quality assurance. **No validated classifier or public labeled dataset for SEC cybersecurity disclosure quality currently exists** — this is the gap our project fills.
|
The **methodological foundation** is the [Ringel (2023)](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4542949) synthetic experts pipeline — frontier LLMs generate training labels, then a small open-weights model is fine-tuned to approximate the GenAI labeler at near-zero marginal cost. [Ma et al. (2026)](https://arxiv.org/abs/2601.09142) provide the multi-model consensus labeling architecture we adopt for quality assurance. **No validated classifier or public labeled dataset for SEC cybersecurity disclosure quality currently exists** — this is the gap our project fills.
|
||||||
|
|
||||||
## 3. Data Description
|
## 3. Data Description
|
||||||
|
|
||||||
|
|||||||
1202
labelapp/.sampled-ids.original.json
Normal file
1202
labelapp/.sampled-ids.original.json
Normal file
File diff suppressed because it is too large
Load Diff
@ -16,6 +16,7 @@
|
|||||||
"sample": "bun run scripts/sample.ts",
|
"sample": "bun run scripts/sample.ts",
|
||||||
"assign": "bun run scripts/assign.ts",
|
"assign": "bun run scripts/assign.ts",
|
||||||
"export": "bun run scripts/export.ts",
|
"export": "bun run scripts/export.ts",
|
||||||
|
"dump": "bun run scripts/dump-all.ts",
|
||||||
"test": "bun test app/ lib/ && playwright test",
|
"test": "bun test app/ lib/ && playwright test",
|
||||||
"test:api": "bun test app/ lib/",
|
"test:api": "bun test app/ lib/",
|
||||||
"test:e2e": "playwright test",
|
"test:e2e": "playwright test",
|
||||||
|
|||||||
597
labelapp/scripts/dump-all.ts
Normal file
597
labelapp/scripts/dump-all.ts
Normal file
@ -0,0 +1,597 @@
|
|||||||
|
/**
|
||||||
|
* Comprehensive data dump from the labelapp database.
|
||||||
|
*
|
||||||
|
* Exports:
|
||||||
|
* data/gold/human-labels-raw.jsonl — every individual label with timing
|
||||||
|
* data/gold/paragraphs-holdout.jsonl — paragraph metadata for the 1,200 holdout
|
||||||
|
* data/gold/annotators.json — annotator profiles + onboarding timestamps
|
||||||
|
* data/gold/quiz-sessions.jsonl — all quiz attempts
|
||||||
|
* data/gold/metrics.json — comprehensive IRR: per-dimension alpha/kappa, pairwise matrices, per-category, per-stratum
|
||||||
|
*/
|
||||||
|
|
||||||
|
process.env.DATABASE_URL ??=
|
||||||
|
"postgresql://sec_cybert:sec_cybert@10.1.10.10:5432/sec_cybert";
|
||||||
|
|
||||||
|
import { writeFile, mkdir } from "node:fs/promises";
|
||||||
|
import { existsSync } from "node:fs";
|
||||||
|
import { db } from "../db";
|
||||||
|
import * as schema from "../db/schema";
|
||||||
|
import {
|
||||||
|
cohensKappa,
|
||||||
|
krippendorffsAlpha,
|
||||||
|
agreementRate,
|
||||||
|
perCategoryAgreement,
|
||||||
|
} from "../lib/metrics";
|
||||||
|
|
||||||
|
const OUT_DIR = "/home/joey/Documents/sec-cyBERT/data/gold";
|
||||||
|
|
||||||
|
const CATEGORIES = [
|
||||||
|
"Board Governance",
|
||||||
|
"Management Role",
|
||||||
|
"Risk Management Process",
|
||||||
|
"Third-Party Risk",
|
||||||
|
"Incident Disclosure",
|
||||||
|
"Strategy Integration",
|
||||||
|
"None/Other",
|
||||||
|
];
|
||||||
|
|
||||||
|
function toJSONL(records: object[]): string {
|
||||||
|
return records.map((r) => JSON.stringify(r)).join("\n") + "\n";
|
||||||
|
}
|
||||||
|
|
||||||
|
async function main() {
|
||||||
|
if (!existsSync(OUT_DIR)) await mkdir(OUT_DIR, { recursive: true });
|
||||||
|
|
||||||
|
// ── Load everything ──
|
||||||
|
console.log("Loading all data from database...");
|
||||||
|
const [allLabels, allAnnotators, allParagraphs, allQuizSessions, allAdjudications] =
|
||||||
|
await Promise.all([
|
||||||
|
db.select().from(schema.humanLabels),
|
||||||
|
db.select().from(schema.annotators),
|
||||||
|
db.select().from(schema.paragraphs),
|
||||||
|
db.select().from(schema.quizSessions),
|
||||||
|
db.select().from(schema.adjudications),
|
||||||
|
]);
|
||||||
|
|
||||||
|
const nonAdminAnnotators = allAnnotators.filter((a) => a.id !== "admin");
|
||||||
|
const annotatorIds = nonAdminAnnotators.map((a) => a.id).sort();
|
||||||
|
const annotatorNames = new Map(allAnnotators.map((a) => [a.id, a.displayName]));
|
||||||
|
|
||||||
|
// Filter to non-admin labels only
|
||||||
|
const labels = allLabels.filter((l) => l.annotatorId !== "admin");
|
||||||
|
|
||||||
|
console.log(` ${labels.length} human labels (non-admin)`);
|
||||||
|
console.log(` ${allParagraphs.length} paragraphs`);
|
||||||
|
console.log(` ${nonAdminAnnotators.length} annotators`);
|
||||||
|
console.log(` ${allQuizSessions.length} quiz sessions`);
|
||||||
|
console.log(` ${allAdjudications.length} adjudications`);
|
||||||
|
|
||||||
|
// ── 1. Raw labels JSONL ──
|
||||||
|
console.log("\nExporting raw labels...");
|
||||||
|
const rawLabels = labels.map((l) => ({
|
||||||
|
paragraphId: l.paragraphId,
|
||||||
|
annotatorId: l.annotatorId,
|
||||||
|
annotatorName: annotatorNames.get(l.annotatorId) ?? l.annotatorId,
|
||||||
|
contentCategory: l.contentCategory,
|
||||||
|
specificityLevel: l.specificityLevel,
|
||||||
|
notes: l.notes,
|
||||||
|
labeledAt: l.labeledAt?.toISOString() ?? null,
|
||||||
|
sessionId: l.sessionId,
|
||||||
|
durationMs: l.durationMs,
|
||||||
|
activeMs: l.activeMs,
|
||||||
|
}));
|
||||||
|
await writeFile(`${OUT_DIR}/human-labels-raw.jsonl`, toJSONL(rawLabels));
|
||||||
|
console.log(` ${rawLabels.length} labels → human-labels-raw.jsonl`);
|
||||||
|
|
||||||
|
// ── 2. Paragraph metadata JSONL ──
|
||||||
|
console.log("\nExporting paragraph metadata...");
|
||||||
|
const paragraphRecords = allParagraphs.map((p) => ({
|
||||||
|
id: p.id,
|
||||||
|
text: p.text,
|
||||||
|
wordCount: p.wordCount,
|
||||||
|
paragraphIndex: p.paragraphIndex,
|
||||||
|
companyName: p.companyName,
|
||||||
|
cik: p.cik,
|
||||||
|
ticker: p.ticker,
|
||||||
|
filingType: p.filingType,
|
||||||
|
filingDate: p.filingDate,
|
||||||
|
fiscalYear: p.fiscalYear,
|
||||||
|
accessionNumber: p.accessionNumber,
|
||||||
|
secItem: p.secItem,
|
||||||
|
stage1Category: p.stage1Category,
|
||||||
|
stage1Specificity: p.stage1Specificity,
|
||||||
|
stage1Method: p.stage1Method,
|
||||||
|
stage1Confidence: p.stage1Confidence,
|
||||||
|
}));
|
||||||
|
await writeFile(`${OUT_DIR}/paragraphs-holdout.jsonl`, toJSONL(paragraphRecords));
|
||||||
|
console.log(` ${paragraphRecords.length} paragraphs → paragraphs-holdout.jsonl`);
|
||||||
|
|
||||||
|
// ── 3. Annotators JSON ──
|
||||||
|
console.log("\nExporting annotator profiles...");
|
||||||
|
const annotatorProfiles = nonAdminAnnotators.map((a) => ({
|
||||||
|
id: a.id,
|
||||||
|
displayName: a.displayName,
|
||||||
|
onboardedAt: a.onboardedAt?.toISOString() ?? null,
|
||||||
|
}));
|
||||||
|
await writeFile(`${OUT_DIR}/annotators.json`, JSON.stringify(annotatorProfiles, null, 2));
|
||||||
|
console.log(` ${annotatorProfiles.length} annotators → annotators.json`);
|
||||||
|
|
||||||
|
// ── 4. Quiz sessions JSONL ──
|
||||||
|
console.log("\nExporting quiz sessions...");
|
||||||
|
const quizRecords = allQuizSessions.map((q) => ({
|
||||||
|
id: q.id,
|
||||||
|
annotatorId: q.annotatorId,
|
||||||
|
annotatorName: annotatorNames.get(q.annotatorId) ?? q.annotatorId,
|
||||||
|
startedAt: q.startedAt?.toISOString() ?? null,
|
||||||
|
completedAt: q.completedAt?.toISOString() ?? null,
|
||||||
|
passed: q.passed,
|
||||||
|
score: q.score,
|
||||||
|
totalQuestions: q.totalQuestions,
|
||||||
|
answers: q.answers,
|
||||||
|
}));
|
||||||
|
await writeFile(`${OUT_DIR}/quiz-sessions.jsonl`, toJSONL(quizRecords));
|
||||||
|
console.log(` ${quizRecords.length} quiz sessions → quiz-sessions.jsonl`);
|
||||||
|
|
||||||
|
// ── 5. Comprehensive metrics ──
|
||||||
|
console.log("\nComputing metrics...");
|
||||||
|
|
||||||
|
// Group labels by paragraph
|
||||||
|
const byParagraph = new Map<string, typeof labels>();
|
||||||
|
for (const label of labels) {
|
||||||
|
const group = byParagraph.get(label.paragraphId);
|
||||||
|
if (group) group.push(label);
|
||||||
|
else byParagraph.set(label.paragraphId, [label]);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Only paragraphs with 3+ labels
|
||||||
|
const fullyLabeled = new Map<string, typeof labels>();
|
||||||
|
for (const [pid, lbls] of byParagraph) {
|
||||||
|
if (lbls.length >= 3) fullyLabeled.set(pid, lbls);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Paragraphs with 2+ labels (for pairwise)
|
||||||
|
const multiLabeled = new Map<string, typeof labels>();
|
||||||
|
for (const [pid, lbls] of byParagraph) {
|
||||||
|
if (lbls.length >= 2) multiLabeled.set(pid, lbls);
|
||||||
|
}
|
||||||
|
|
||||||
|
const multiLabeledParaIds = [...multiLabeled.keys()];
|
||||||
|
|
||||||
|
// ─── Per-annotator stats ───
|
||||||
|
const perAnnotatorStats = annotatorIds.map((aid) => {
|
||||||
|
const myLabels = labels.filter((l) => l.annotatorId === aid);
|
||||||
|
const activeTimes = myLabels
|
||||||
|
.map((l) => l.activeMs)
|
||||||
|
.filter((t): t is number => t !== null);
|
||||||
|
const wallTimes = myLabels
|
||||||
|
.map((l) => l.durationMs)
|
||||||
|
.filter((t): t is number => t !== null);
|
||||||
|
return {
|
||||||
|
id: aid,
|
||||||
|
name: annotatorNames.get(aid) ?? aid,
|
||||||
|
labelCount: myLabels.length,
|
||||||
|
medianActiveMs: activeTimes.length > 0 ? median(activeTimes) : null,
|
||||||
|
meanActiveMs: activeTimes.length > 0 ? mean(activeTimes) : null,
|
||||||
|
medianDurationMs: wallTimes.length > 0 ? median(wallTimes) : null,
|
||||||
|
meanDurationMs: wallTimes.length > 0 ? mean(wallTimes) : null,
|
||||||
|
totalActiveMs: activeTimes.length > 0 ? sum(activeTimes) : null,
|
||||||
|
totalDurationMs: wallTimes.length > 0 ? sum(wallTimes) : null,
|
||||||
|
labelsWithActiveTime: activeTimes.length,
|
||||||
|
};
|
||||||
|
});
|
||||||
|
|
||||||
|
// ─── Category consensus ───
|
||||||
|
const categoryArrays: string[][] = [];
|
||||||
|
for (const lbls of fullyLabeled.values()) {
|
||||||
|
categoryArrays.push(lbls.map((l) => l.contentCategory));
|
||||||
|
}
|
||||||
|
const categoryConsensusRate = agreementRate(categoryArrays);
|
||||||
|
|
||||||
|
// ─── Specificity consensus ───
|
||||||
|
const specArrays: string[][] = [];
|
||||||
|
for (const lbls of fullyLabeled.values()) {
|
||||||
|
specArrays.push(lbls.map((l) => String(l.specificityLevel)));
|
||||||
|
}
|
||||||
|
const specConsensusRate = agreementRate(specArrays);
|
||||||
|
|
||||||
|
// ─── Both consensus ───
|
||||||
|
const bothArrays: string[][] = [];
|
||||||
|
for (const lbls of fullyLabeled.values()) {
|
||||||
|
bothArrays.push(
|
||||||
|
lbls.map((l) => `${l.contentCategory}|${l.specificityLevel}`),
|
||||||
|
);
|
||||||
|
}
|
||||||
|
const bothConsensusRate = agreementRate(bothArrays);
|
||||||
|
|
||||||
|
// ─── Krippendorff's Alpha: category (nominal, use ordinal distance = 0/1) ───
|
||||||
|
// We encode categories as integers for alpha computation
|
||||||
|
const catIndex = new Map(CATEGORIES.map((c, i) => [c, i + 1]));
|
||||||
|
|
||||||
|
const categoryRatingsMatrix: (number | null)[][] = annotatorIds.map(
|
||||||
|
(annotatorId) =>
|
||||||
|
multiLabeledParaIds.map((paraId) => {
|
||||||
|
const label = multiLabeled
|
||||||
|
.get(paraId)
|
||||||
|
?.find((l) => l.annotatorId === annotatorId);
|
||||||
|
if (!label) return null;
|
||||||
|
return catIndex.get(label.contentCategory) ?? null;
|
||||||
|
}),
|
||||||
|
);
|
||||||
|
|
||||||
|
// Krippendorff's alpha for category (note: using ordinal distance on nominal data
|
||||||
|
// — this is conservative; nominal distance would give higher alpha)
|
||||||
|
const categoryAlpha =
|
||||||
|
annotatorIds.length >= 2 && multiLabeledParaIds.length > 0
|
||||||
|
? krippendorffsAlpha(categoryRatingsMatrix)
|
||||||
|
: 0;
|
||||||
|
|
||||||
|
// ─── Krippendorff's Alpha: specificity (ordinal) ───
|
||||||
|
const specRatingsMatrix: (number | null)[][] = annotatorIds.map(
|
||||||
|
(annotatorId) =>
|
||||||
|
multiLabeledParaIds.map((paraId) => {
|
||||||
|
const label = multiLabeled
|
||||||
|
.get(paraId)
|
||||||
|
?.find((l) => l.annotatorId === annotatorId);
|
||||||
|
return label?.specificityLevel ?? null;
|
||||||
|
}),
|
||||||
|
);
|
||||||
|
|
||||||
|
const specAlpha =
|
||||||
|
annotatorIds.length >= 2 && multiLabeledParaIds.length > 0
|
||||||
|
? krippendorffsAlpha(specRatingsMatrix)
|
||||||
|
: 0;
|
||||||
|
|
||||||
|
// ─── Pairwise Cohen's Kappa — category ───
|
||||||
|
const kappaCategory: number[][] = Array.from(
|
||||||
|
{ length: annotatorIds.length },
|
||||||
|
() => new Array(annotatorIds.length).fill(0),
|
||||||
|
);
|
||||||
|
const kappaCatDetails: {
|
||||||
|
a1: string;
|
||||||
|
a2: string;
|
||||||
|
kappa: number;
|
||||||
|
n: number;
|
||||||
|
}[] = [];
|
||||||
|
|
||||||
|
for (let i = 0; i < annotatorIds.length; i++) {
|
||||||
|
kappaCategory[i][i] = 1;
|
||||||
|
for (let j = i + 1; j < annotatorIds.length; j++) {
|
||||||
|
const a1 = annotatorIds[i];
|
||||||
|
const a2 = annotatorIds[j];
|
||||||
|
const shared1: string[] = [];
|
||||||
|
const shared2: string[] = [];
|
||||||
|
|
||||||
|
for (const [, lbls] of multiLabeled) {
|
||||||
|
const l1 = lbls.find((l) => l.annotatorId === a1);
|
||||||
|
const l2 = lbls.find((l) => l.annotatorId === a2);
|
||||||
|
if (l1 && l2) {
|
||||||
|
shared1.push(l1.contentCategory);
|
||||||
|
shared2.push(l2.contentCategory);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (shared1.length >= 2) {
|
||||||
|
const kappa = cohensKappa(shared1, shared2);
|
||||||
|
kappaCategory[i][j] = kappa;
|
||||||
|
kappaCategory[j][i] = kappa;
|
||||||
|
kappaCatDetails.push({
|
||||||
|
a1: annotatorNames.get(a1) ?? a1,
|
||||||
|
a2: annotatorNames.get(a2) ?? a2,
|
||||||
|
kappa,
|
||||||
|
n: shared1.length,
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── Pairwise Cohen's Kappa — specificity ───
|
||||||
|
const kappaSpec: number[][] = Array.from(
|
||||||
|
{ length: annotatorIds.length },
|
||||||
|
() => new Array(annotatorIds.length).fill(0),
|
||||||
|
);
|
||||||
|
const kappaSpecDetails: {
|
||||||
|
a1: string;
|
||||||
|
a2: string;
|
||||||
|
kappa: number;
|
||||||
|
n: number;
|
||||||
|
}[] = [];
|
||||||
|
|
||||||
|
for (let i = 0; i < annotatorIds.length; i++) {
|
||||||
|
kappaSpec[i][i] = 1;
|
||||||
|
for (let j = i + 1; j < annotatorIds.length; j++) {
|
||||||
|
const a1 = annotatorIds[i];
|
||||||
|
const a2 = annotatorIds[j];
|
||||||
|
const shared1: string[] = [];
|
||||||
|
const shared2: string[] = [];
|
||||||
|
|
||||||
|
for (const [, lbls] of multiLabeled) {
|
||||||
|
const l1 = lbls.find((l) => l.annotatorId === a1);
|
||||||
|
const l2 = lbls.find((l) => l.annotatorId === a2);
|
||||||
|
if (l1 && l2) {
|
||||||
|
shared1.push(String(l1.specificityLevel));
|
||||||
|
shared2.push(String(l2.specificityLevel));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (shared1.length >= 2) {
|
||||||
|
const kappa = cohensKappa(shared1, shared2);
|
||||||
|
kappaSpec[i][j] = kappa;
|
||||||
|
kappaSpec[j][i] = kappa;
|
||||||
|
kappaSpecDetails.push({
|
||||||
|
a1: annotatorNames.get(a1) ?? a1,
|
||||||
|
a2: annotatorNames.get(a2) ?? a2,
|
||||||
|
kappa,
|
||||||
|
n: shared1.length,
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── Per-category agreement ───
|
||||||
|
const perCategory = perCategoryAgreement(
|
||||||
|
labels.map((l) => ({
|
||||||
|
category: l.contentCategory,
|
||||||
|
annotatorId: l.annotatorId,
|
||||||
|
paragraphId: l.paragraphId,
|
||||||
|
})),
|
||||||
|
CATEGORIES,
|
||||||
|
);
|
||||||
|
|
||||||
|
// ─── Per-stratum agreement (using stage1 data to identify strata) ───
|
||||||
|
const paragraphMeta = new Map(allParagraphs.map((p) => [p.id, p]));
|
||||||
|
|
||||||
|
// Classify each paragraph's stratum based on stage1 data
|
||||||
|
function classifyStratum(pid: string): string {
|
||||||
|
const para = paragraphMeta.get(pid);
|
||||||
|
if (!para) return "unknown";
|
||||||
|
const method = para.stage1Method;
|
||||||
|
const cat = para.stage1Category;
|
||||||
|
const spec = para.stage1Specificity;
|
||||||
|
|
||||||
|
// Check if it was a disputed paragraph based on method
|
||||||
|
if (method === "unresolved") return "unresolved";
|
||||||
|
if (method === "majority") {
|
||||||
|
// Try to identify the dispute type from the category
|
||||||
|
if (cat === "Management Role" || cat === "Risk Management Process")
|
||||||
|
return "mgmt_rmp_split";
|
||||||
|
if (cat === "None/Other" || cat === "Strategy Integration")
|
||||||
|
return "noneother_strategy_split";
|
||||||
|
if (cat === "Board Governance") return "board_mgmt_split";
|
||||||
|
if (spec === 3 || spec === 4) return "spec_34_split";
|
||||||
|
return "majority_other";
|
||||||
|
}
|
||||||
|
if (method === "unanimous") return "unanimous";
|
||||||
|
return "proportional_random";
|
||||||
|
}
|
||||||
|
|
||||||
|
const strataAgreement: Record<string, { total: number; agreed: number }> = {};
|
||||||
|
for (const [pid, lbls] of fullyLabeled) {
|
||||||
|
const stratum = classifyStratum(pid);
|
||||||
|
if (!strataAgreement[stratum]) {
|
||||||
|
strataAgreement[stratum] = { total: 0, agreed: 0 };
|
||||||
|
}
|
||||||
|
strataAgreement[stratum].total++;
|
||||||
|
const allSameCat = lbls.every(
|
||||||
|
(l) => l.contentCategory === lbls[0].contentCategory,
|
||||||
|
);
|
||||||
|
const allSameSpec = lbls.every(
|
||||||
|
(l) => l.specificityLevel === lbls[0].specificityLevel,
|
||||||
|
);
|
||||||
|
if (allSameCat && allSameSpec) strataAgreement[stratum].agreed++;
|
||||||
|
}
|
||||||
|
|
||||||
|
const strataRates: Record<string, { total: number; agreed: number; rate: number }> = {};
|
||||||
|
for (const [stratum, data] of Object.entries(strataAgreement)) {
|
||||||
|
strataRates[stratum] = {
|
||||||
|
...data,
|
||||||
|
rate: data.total > 0 ? data.agreed / data.total : 0,
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── Timing summary ───
|
||||||
|
const allActiveTimes = labels
|
||||||
|
.map((l) => l.activeMs)
|
||||||
|
.filter((t): t is number => t !== null);
|
||||||
|
const allWallTimes = labels
|
||||||
|
.map((l) => l.durationMs)
|
||||||
|
.filter((t): t is number => t !== null);
|
||||||
|
|
||||||
|
// ─── Category distribution ───
|
||||||
|
const categoryDist: Record<string, number> = {};
|
||||||
|
for (const cat of CATEGORIES) categoryDist[cat] = 0;
|
||||||
|
for (const l of labels) {
|
||||||
|
categoryDist[l.contentCategory] =
|
||||||
|
(categoryDist[l.contentCategory] ?? 0) + 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── Specificity distribution ───
|
||||||
|
const specDist: Record<string, number> = { "1": 0, "2": 0, "3": 0, "4": 0 };
|
||||||
|
for (const l of labels) {
|
||||||
|
specDist[String(l.specificityLevel)] =
|
||||||
|
(specDist[String(l.specificityLevel)] ?? 0) + 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── Majority label distribution (for fully-labeled paragraphs) ───
|
||||||
|
const majorityCategories: Record<string, number> = {};
|
||||||
|
for (const cat of CATEGORIES) majorityCategories[cat] = 0;
|
||||||
|
|
||||||
|
for (const lbls of fullyLabeled.values()) {
|
||||||
|
const catCounts = new Map<string, number>();
|
||||||
|
for (const l of lbls) {
|
||||||
|
catCounts.set(l.contentCategory, (catCounts.get(l.contentCategory) ?? 0) + 1);
|
||||||
|
}
|
||||||
|
let maxCount = 0;
|
||||||
|
let majorCat = "";
|
||||||
|
for (const [cat, count] of catCounts) {
|
||||||
|
if (count > maxCount) {
|
||||||
|
maxCount = count;
|
||||||
|
majorCat = cat;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (majorCat) majorityCategories[majorCat]++;
|
||||||
|
}
|
||||||
|
|
||||||
|
const metrics = {
|
||||||
|
summary: {
|
||||||
|
totalLabels: labels.length,
|
||||||
|
totalParagraphs: allParagraphs.length,
|
||||||
|
fullyLabeledParagraphs: fullyLabeled.size,
|
||||||
|
adjudicatedParagraphs: allAdjudications.length,
|
||||||
|
annotatorCount: annotatorIds.length,
|
||||||
|
},
|
||||||
|
consensus: {
|
||||||
|
categoryOnly: round(categoryConsensusRate, 4),
|
||||||
|
specificityOnly: round(specConsensusRate, 4),
|
||||||
|
both: round(bothConsensusRate, 4),
|
||||||
|
},
|
||||||
|
krippendorffsAlpha: {
|
||||||
|
category: round(categoryAlpha, 4),
|
||||||
|
specificity: round(specAlpha, 4),
|
||||||
|
note: "Category alpha uses ordinal distance on nominal data (conservative). Specificity alpha uses ordinal distance.",
|
||||||
|
},
|
||||||
|
pairwiseKappa: {
|
||||||
|
category: {
|
||||||
|
annotators: annotatorIds.map((id) => annotatorNames.get(id) ?? id),
|
||||||
|
matrix: kappaCategory.map((row) => row.map((v) => round(v, 4))),
|
||||||
|
pairs: kappaCatDetails.map((d) => ({
|
||||||
|
...d,
|
||||||
|
kappa: round(d.kappa, 4),
|
||||||
|
})),
|
||||||
|
mean: round(
|
||||||
|
kappaCatDetails.length > 0
|
||||||
|
? kappaCatDetails.reduce((s, d) => s + d.kappa, 0) /
|
||||||
|
kappaCatDetails.length
|
||||||
|
: 0,
|
||||||
|
4,
|
||||||
|
),
|
||||||
|
},
|
||||||
|
specificity: {
|
||||||
|
annotators: annotatorIds.map((id) => annotatorNames.get(id) ?? id),
|
||||||
|
matrix: kappaSpec.map((row) => row.map((v) => round(v, 4))),
|
||||||
|
pairs: kappaSpecDetails.map((d) => ({
|
||||||
|
...d,
|
||||||
|
kappa: round(d.kappa, 4),
|
||||||
|
})),
|
||||||
|
mean: round(
|
||||||
|
kappaSpecDetails.length > 0
|
||||||
|
? kappaSpecDetails.reduce((s, d) => s + d.kappa, 0) /
|
||||||
|
kappaSpecDetails.length
|
||||||
|
: 0,
|
||||||
|
4,
|
||||||
|
),
|
||||||
|
},
|
||||||
|
},
|
||||||
|
perCategoryAgreement: Object.fromEntries(
|
||||||
|
Object.entries(perCategory).map(([k, v]) => [k, round(v, 4)]),
|
||||||
|
),
|
||||||
|
perStratumAgreement: strataRates,
|
||||||
|
distributions: {
|
||||||
|
categoryLabels: categoryDist,
|
||||||
|
specificityLabels: specDist,
|
||||||
|
majorityCategories,
|
||||||
|
},
|
||||||
|
timing: {
|
||||||
|
overallMedianActiveMs: allActiveTimes.length > 0 ? median(allActiveTimes) : null,
|
||||||
|
overallMeanActiveMs: allActiveTimes.length > 0 ? round(mean(allActiveTimes), 0) : null,
|
||||||
|
overallMedianDurationMs: allWallTimes.length > 0 ? median(allWallTimes) : null,
|
||||||
|
overallMeanDurationMs: allWallTimes.length > 0 ? round(mean(allWallTimes), 0) : null,
|
||||||
|
totalActiveHours:
|
||||||
|
allActiveTimes.length > 0
|
||||||
|
? round(sum(allActiveTimes) / 3_600_000, 2)
|
||||||
|
: null,
|
||||||
|
totalWallHours:
|
||||||
|
allWallTimes.length > 0
|
||||||
|
? round(sum(allWallTimes) / 3_600_000, 2)
|
||||||
|
: null,
|
||||||
|
labelsWithActiveTime: allActiveTimes.length,
|
||||||
|
labelsWithoutActiveTime: labels.length - allActiveTimes.length,
|
||||||
|
},
|
||||||
|
perAnnotator: perAnnotatorStats,
|
||||||
|
};
|
||||||
|
|
||||||
|
await writeFile(`${OUT_DIR}/metrics.json`, JSON.stringify(metrics, null, 2));
|
||||||
|
console.log(` metrics → metrics.json`);
|
||||||
|
|
||||||
|
// ── Print summary to console ──
|
||||||
|
console.log("\n" + "=".repeat(60));
|
||||||
|
console.log("HUMAN LABELING SUMMARY");
|
||||||
|
console.log("=".repeat(60));
|
||||||
|
console.log(`\nParagraphs: ${fullyLabeled.size} fully labeled / ${allParagraphs.length} total`);
|
||||||
|
console.log(`Labels: ${labels.length} total`);
|
||||||
|
console.log(`\n── Consensus Rates (3/3 agree) ──`);
|
||||||
|
console.log(` Category only: ${(categoryConsensusRate * 100).toFixed(1)}%`);
|
||||||
|
console.log(` Specificity only: ${(specConsensusRate * 100).toFixed(1)}%`);
|
||||||
|
console.log(` Both: ${(bothConsensusRate * 100).toFixed(1)}%`);
|
||||||
|
console.log(`\n── Krippendorff's Alpha ──`);
|
||||||
|
console.log(` Category: ${categoryAlpha.toFixed(4)}`);
|
||||||
|
console.log(` Specificity: ${specAlpha.toFixed(4)}`);
|
||||||
|
console.log(`\n── Pairwise Kappa (category) ──`);
|
||||||
|
console.log(` Mean: ${metrics.pairwiseKappa.category.mean}`);
|
||||||
|
for (const pair of kappaCatDetails) {
|
||||||
|
console.log(` ${pair.a1} × ${pair.a2}: ${pair.kappa.toFixed(4)} (n=${pair.n})`);
|
||||||
|
}
|
||||||
|
console.log(`\n── Pairwise Kappa (specificity) ──`);
|
||||||
|
console.log(` Mean: ${metrics.pairwiseKappa.specificity.mean}`);
|
||||||
|
for (const pair of kappaSpecDetails) {
|
||||||
|
console.log(` ${pair.a1} × ${pair.a2}: ${pair.kappa.toFixed(4)} (n=${pair.n})`);
|
||||||
|
}
|
||||||
|
console.log(`\n── Per-Category Agreement ──`);
|
||||||
|
for (const [cat, rate] of Object.entries(perCategory)) {
|
||||||
|
console.log(` ${cat}: ${(rate * 100).toFixed(1)}%`);
|
||||||
|
}
|
||||||
|
console.log(`\n── Per-Stratum Agreement ──`);
|
||||||
|
for (const [stratum, data] of Object.entries(strataRates)) {
|
||||||
|
console.log(
|
||||||
|
` ${stratum}: ${(data.rate * 100).toFixed(1)}% (${data.agreed}/${data.total})`,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
console.log(`\n── Timing ──`);
|
||||||
|
if (allActiveTimes.length > 0) {
|
||||||
|
console.log(` Median active time: ${(median(allActiveTimes) / 1000).toFixed(1)}s`);
|
||||||
|
console.log(` Mean active time: ${(mean(allActiveTimes) / 1000).toFixed(1)}s`);
|
||||||
|
console.log(` Total active hours: ${(sum(allActiveTimes) / 3_600_000).toFixed(2)}h`);
|
||||||
|
console.log(` Total wall hours: ${(sum(allWallTimes) / 3_600_000).toFixed(2)}h`);
|
||||||
|
}
|
||||||
|
console.log(` Labels with active timer: ${allActiveTimes.length}/${labels.length}`);
|
||||||
|
|
||||||
|
console.log(`\n── Per-Annotator ──`);
|
||||||
|
for (const a of perAnnotatorStats) {
|
||||||
|
const activeH = a.totalActiveMs ? (a.totalActiveMs / 3_600_000).toFixed(2) : "N/A";
|
||||||
|
const medSec = a.medianActiveMs ? (a.medianActiveMs / 1000).toFixed(1) : "N/A";
|
||||||
|
console.log(
|
||||||
|
` ${a.name}: ${a.labelCount} labels, median ${medSec}s active, ${activeH}h total`,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
console.log(`\n${"=".repeat(60)}`);
|
||||||
|
console.log(`All data exported to ${OUT_DIR}/`);
|
||||||
|
console.log("=".repeat(60));
|
||||||
|
|
||||||
|
process.exit(0);
|
||||||
|
}
|
||||||
|
|
||||||
|
function median(arr: number[]): number {
|
||||||
|
const sorted = [...arr].sort((a, b) => a - b);
|
||||||
|
const mid = Math.floor(sorted.length / 2);
|
||||||
|
return sorted.length % 2 !== 0
|
||||||
|
? sorted[mid]
|
||||||
|
: (sorted[mid - 1] + sorted[mid]) / 2;
|
||||||
|
}
|
||||||
|
|
||||||
|
function mean(arr: number[]): number {
|
||||||
|
return arr.reduce((s, v) => s + v, 0) / arr.length;
|
||||||
|
}
|
||||||
|
|
||||||
|
function sum(arr: number[]): number {
|
||||||
|
return arr.reduce((s, v) => s + v, 0);
|
||||||
|
}
|
||||||
|
|
||||||
|
function round(n: number, decimals: number): number {
|
||||||
|
const factor = 10 ** decimals;
|
||||||
|
return Math.round(n * factor) / factor;
|
||||||
|
}
|
||||||
|
|
||||||
|
main().catch((err) => {
|
||||||
|
console.error("Dump failed:", err);
|
||||||
|
process.exit(1);
|
||||||
|
});
|
||||||
@ -16,6 +16,7 @@
|
|||||||
"la:sample": "bun run --filter labelapp sample",
|
"la:sample": "bun run --filter labelapp sample",
|
||||||
"la:assign": "bun run --filter labelapp assign",
|
"la:assign": "bun run --filter labelapp assign",
|
||||||
"la:export": "bun run --filter labelapp export",
|
"la:export": "bun run --filter labelapp export",
|
||||||
|
"la:dump": "bun run --filter labelapp dump",
|
||||||
"la:docker": "docker build -f labelapp/Dockerfile -t registry.claiborne.soy/labelapp:latest . --push",
|
"la:docker": "docker build -f labelapp/Dockerfile -t registry.claiborne.soy/labelapp:latest . --push",
|
||||||
"ts:sec": "bun run --filter sec-cybert sec",
|
"ts:sec": "bun run --filter sec-cybert sec",
|
||||||
"ts:typecheck": "bun run --filter sec-cybert typecheck",
|
"ts:typecheck": "bun run --filter sec-cybert typecheck",
|
||||||
|
|||||||
1224
scripts/analyze-gold.py
Normal file
1224
scripts/analyze-gold.py
Normal file
File diff suppressed because it is too large
Load Diff
@ -1,7 +1,7 @@
|
|||||||
import { readJsonl } from "./lib/jsonl.ts";
|
import { readJsonl } from "./lib/jsonl.ts";
|
||||||
import { Paragraph } from "@sec-cybert/schemas/paragraph.ts";
|
import { Paragraph } from "@sec-cybert/schemas/paragraph.ts";
|
||||||
import { Annotation } from "@sec-cybert/schemas/annotation.ts";
|
import { Annotation } from "@sec-cybert/schemas/annotation.ts";
|
||||||
import { STAGE1_MODELS } from "./lib/openrouter.ts";
|
import { STAGE1_MODELS, BENCHMARK_MODELS } from "./lib/openrouter.ts";
|
||||||
import { runBatch } from "./label/batch.ts";
|
import { runBatch } from "./label/batch.ts";
|
||||||
import { runGoldenBatch } from "./label/golden.ts";
|
import { runGoldenBatch } from "./label/golden.ts";
|
||||||
import { computeConsensus } from "./label/consensus.ts";
|
import { computeConsensus } from "./label/consensus.ts";
|
||||||
@ -26,7 +26,9 @@ Commands:
|
|||||||
label:annotate-all [--limit N] [--concurrency N]
|
label:annotate-all [--limit N] [--concurrency N]
|
||||||
label:consensus
|
label:consensus
|
||||||
label:judge [--concurrency N]
|
label:judge [--concurrency N]
|
||||||
label:golden [--paragraphs <path>] [--limit N] [--delay N] (Opus via Agent SDK)
|
label:golden [--paragraphs <path>] [--limit N] [--delay N] [--concurrency N] (Opus via Agent SDK)
|
||||||
|
label:bench-holdout --model <id> [--concurrency N] [--limit N] (benchmark model on holdout)
|
||||||
|
label:bench-holdout-all [--concurrency N] [--limit N] (all BENCHMARK_MODELS on holdout)
|
||||||
label:cost`);
|
label:cost`);
|
||||||
process.exit(1);
|
process.exit(1);
|
||||||
}
|
}
|
||||||
@ -223,8 +225,8 @@ async function cmdJudge(): Promise<void> {
|
|||||||
}
|
}
|
||||||
|
|
||||||
async function cmdGolden(): Promise<void> {
|
async function cmdGolden(): Promise<void> {
|
||||||
// Load the 1,200 human-labeled paragraph IDs from the labelapp sample
|
// Load the 1,200 human-labeled paragraph IDs from the original sample
|
||||||
const sampledIdsPath = "../labelapp/.sampled-ids.json";
|
const sampledIdsPath = flag("ids") ?? "../labelapp/.sampled-ids.original.json";
|
||||||
const sampledIds = new Set<string>(
|
const sampledIds = new Set<string>(
|
||||||
JSON.parse(await import("node:fs/promises").then((fs) => fs.readFile(sampledIdsPath, "utf-8"))),
|
JSON.parse(await import("node:fs/promises").then((fs) => fs.readFile(sampledIdsPath, "utf-8"))),
|
||||||
);
|
);
|
||||||
@ -248,9 +250,77 @@ async function cmdGolden(): Promise<void> {
|
|||||||
errorsPath: `${DATA}/annotations/golden/opus-errors.jsonl`,
|
errorsPath: `${DATA}/annotations/golden/opus-errors.jsonl`,
|
||||||
limit: flag("limit") !== undefined ? flagInt("limit", 50) : undefined,
|
limit: flag("limit") !== undefined ? flagInt("limit", 50) : undefined,
|
||||||
delayMs: flag("delay") !== undefined ? flagInt("delay", 1000) : 1000,
|
delayMs: flag("delay") !== undefined ? flagInt("delay", 1000) : 1000,
|
||||||
|
concurrency: flagInt("concurrency", 1),
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
|
async function loadHoldoutParagraphs(): Promise<Paragraph[]> {
|
||||||
|
const sampledIdsPath = "../labelapp/.sampled-ids.original.json";
|
||||||
|
const sampledIds = new Set<string>(
|
||||||
|
JSON.parse(await import("node:fs/promises").then((fs) => fs.readFile(sampledIdsPath, "utf-8"))),
|
||||||
|
);
|
||||||
|
process.stderr.write(` Loaded ${sampledIds.size} holdout IDs from ${sampledIdsPath}\n`);
|
||||||
|
|
||||||
|
const paragraphsPath = `${DATA}/paragraphs/paragraphs-clean.patched.jsonl`;
|
||||||
|
const { records: allParagraphs, skipped } = await readJsonl(paragraphsPath, Paragraph);
|
||||||
|
if (skipped > 0) process.stderr.write(` ⚠ Skipped ${skipped} invalid paragraph lines\n`);
|
||||||
|
|
||||||
|
const paragraphs = allParagraphs.filter((p) => sampledIds.has(p.id));
|
||||||
|
process.stderr.write(` Matched ${paragraphs.length}/${sampledIds.size} holdout paragraphs\n`);
|
||||||
|
|
||||||
|
if (paragraphs.length === 0) {
|
||||||
|
process.stderr.write(" ✖ No matching paragraphs found\n");
|
||||||
|
process.exit(1);
|
||||||
|
}
|
||||||
|
return paragraphs;
|
||||||
|
}
|
||||||
|
|
||||||
|
async function cmdBenchHoldout(): Promise<void> {
|
||||||
|
const modelId = flag("model");
|
||||||
|
if (!modelId) {
|
||||||
|
console.error("--model is required");
|
||||||
|
process.exit(1);
|
||||||
|
}
|
||||||
|
const paragraphs = await loadHoldoutParagraphs();
|
||||||
|
const modelShort = modelId.split("/")[1]!;
|
||||||
|
|
||||||
|
await runBatch(paragraphs, {
|
||||||
|
modelId,
|
||||||
|
stage: "benchmark",
|
||||||
|
outputPath: `${DATA}/annotations/bench-holdout/${modelShort}.jsonl`,
|
||||||
|
errorsPath: `${DATA}/annotations/bench-holdout/${modelShort}-errors.jsonl`,
|
||||||
|
sessionsPath: SESSIONS_PATH,
|
||||||
|
concurrency: flagInt("concurrency", 60),
|
||||||
|
limit: flag("limit") !== undefined ? flagInt("limit", 50) : undefined,
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
async function cmdBenchHoldoutAll(): Promise<void> {
|
||||||
|
const paragraphs = await loadHoldoutParagraphs();
|
||||||
|
const concurrency = flagInt("concurrency", 60);
|
||||||
|
const limit = flag("limit") !== undefined ? flagInt("limit", 50) : undefined;
|
||||||
|
|
||||||
|
// Exclude Stage 1 models — we already have their annotations
|
||||||
|
const benchModels = BENCHMARK_MODELS.filter(
|
||||||
|
(m) => !(STAGE1_MODELS as readonly string[]).includes(m),
|
||||||
|
);
|
||||||
|
process.stderr.write(` Running ${benchModels.length} benchmark models (excluding Stage 1 panel)\n`);
|
||||||
|
|
||||||
|
for (const modelId of benchModels) {
|
||||||
|
const modelShort = modelId.split("/")[1]!;
|
||||||
|
process.stderr.write(`\n ═══ ${modelId} ═══\n`);
|
||||||
|
await runBatch(paragraphs, {
|
||||||
|
modelId,
|
||||||
|
stage: "benchmark",
|
||||||
|
outputPath: `${DATA}/annotations/bench-holdout/${modelShort}.jsonl`,
|
||||||
|
errorsPath: `${DATA}/annotations/bench-holdout/${modelShort}-errors.jsonl`,
|
||||||
|
sessionsPath: SESSIONS_PATH,
|
||||||
|
concurrency,
|
||||||
|
limit,
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
async function cmdCost(): Promise<void> {
|
async function cmdCost(): Promise<void> {
|
||||||
const modelCosts: Record<string, { cost: number; count: number }> = {};
|
const modelCosts: Record<string, { cost: number; count: number }> = {};
|
||||||
const stageCosts: Record<string, { cost: number; count: number }> = {};
|
const stageCosts: Record<string, { cost: number; count: number }> = {};
|
||||||
@ -359,6 +429,12 @@ switch (command) {
|
|||||||
case "label:golden":
|
case "label:golden":
|
||||||
await cmdGolden();
|
await cmdGolden();
|
||||||
break;
|
break;
|
||||||
|
case "label:bench-holdout":
|
||||||
|
await cmdBenchHoldout();
|
||||||
|
break;
|
||||||
|
case "label:bench-holdout-all":
|
||||||
|
await cmdBenchHoldoutAll();
|
||||||
|
break;
|
||||||
case "label:cost":
|
case "label:cost":
|
||||||
await cmdCost();
|
await cmdCost();
|
||||||
break;
|
break;
|
||||||
|
|||||||
@ -74,6 +74,8 @@ export interface GoldenBatchOpts {
|
|||||||
limit?: number;
|
limit?: number;
|
||||||
/** Delay between requests in ms. Default 1000 (1 req/s). */
|
/** Delay between requests in ms. Default 1000 (1 req/s). */
|
||||||
delayMs?: number;
|
delayMs?: number;
|
||||||
|
/** Number of concurrent workers. Default 1 (serial). */
|
||||||
|
concurrency?: number;
|
||||||
}
|
}
|
||||||
|
|
||||||
/** Build the enhanced system prompt: full codebook + v2.5 operational prompt + JSON schema. */
|
/** Build the enhanced system prompt: full codebook + v2.5 operational prompt + JSON schema. */
|
||||||
@ -138,6 +140,9 @@ async function annotateGolden(
|
|||||||
outputTokens: 0,
|
outputTokens: 0,
|
||||||
};
|
};
|
||||||
|
|
||||||
|
// Prevent git pull and other non-essential traffic when running concurrently
|
||||||
|
process.env.CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC = "1";
|
||||||
|
|
||||||
for await (const message of query({
|
for await (const message of query({
|
||||||
prompt: buildUserPrompt(paragraph),
|
prompt: buildUserPrompt(paragraph),
|
||||||
options: {
|
options: {
|
||||||
@ -150,10 +155,11 @@ async function annotateGolden(
|
|||||||
// No tools — pure classification
|
// No tools — pure classification
|
||||||
allowedTools: [],
|
allowedTools: [],
|
||||||
disallowedTools: ["Bash", "Read", "Write", "Edit", "Glob", "Grep", "WebSearch", "WebFetch", "Agent", "AskUserQuestion"],
|
disallowedTools: ["Bash", "Read", "Write", "Edit", "Glob", "Grep", "WebSearch", "WebFetch", "Agent", "AskUserQuestion"],
|
||||||
// Isolation: no hooks, no settings, no session persistence
|
// Isolation: no hooks, no settings, no session persistence, no plugins
|
||||||
hooks: {},
|
hooks: {},
|
||||||
settingSources: [],
|
settingSources: [],
|
||||||
persistSession: false,
|
persistSession: false,
|
||||||
|
plugins: [],
|
||||||
// Single-turn: one prompt → one structured response
|
// Single-turn: one prompt → one structured response
|
||||||
maxTurns: 1,
|
maxTurns: 1,
|
||||||
permissionMode: "dontAsk",
|
permissionMode: "dontAsk",
|
||||||
@ -242,14 +248,15 @@ async function annotateGolden(
|
|||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Run golden set annotation: serial 1-req/s through the Agent SDK.
|
* Run golden set annotation through the Agent SDK.
|
||||||
|
* Supports concurrent workers for parallelism.
|
||||||
* Crash-safe with JSONL checkpoint resume.
|
* Crash-safe with JSONL checkpoint resume.
|
||||||
*/
|
*/
|
||||||
export async function runGoldenBatch(
|
export async function runGoldenBatch(
|
||||||
paragraphs: Paragraph[],
|
paragraphs: Paragraph[],
|
||||||
opts: GoldenBatchOpts,
|
opts: GoldenBatchOpts,
|
||||||
): Promise<void> {
|
): Promise<void> {
|
||||||
const { outputPath, errorsPath, limit, delayMs = 1000 } = opts;
|
const { outputPath, errorsPath, limit, delayMs = 1000, concurrency = 1 } = opts;
|
||||||
const runId = uuidv4();
|
const runId = uuidv4();
|
||||||
|
|
||||||
// Build system prompt once (codebook + operational prompt)
|
// Build system prompt once (codebook + operational prompt)
|
||||||
@ -272,69 +279,86 @@ export async function runGoldenBatch(
|
|||||||
}
|
}
|
||||||
|
|
||||||
process.stderr.write(
|
process.stderr.write(
|
||||||
` Starting golden annotation │ Opus 4.6 (Agent SDK) │ ${total} remaining of ${paragraphs.length}\n`,
|
` Starting golden annotation │ Opus 4.6 (Agent SDK) │ ${total} remaining of ${paragraphs.length} │ concurrency=${concurrency}\n`,
|
||||||
);
|
);
|
||||||
|
|
||||||
let processed = 0;
|
let processed = 0;
|
||||||
let errored = 0;
|
let errored = 0;
|
||||||
const startTime = Date.now();
|
const startTime = Date.now();
|
||||||
|
|
||||||
|
// Serialized file writes to prevent corruption
|
||||||
|
let writeQueue = Promise.resolve();
|
||||||
|
function safeAppend(path: string, data: object) {
|
||||||
|
writeQueue = writeQueue.then(() => appendJsonl(path, data));
|
||||||
|
return writeQueue;
|
||||||
|
}
|
||||||
|
|
||||||
// Graceful shutdown
|
// Graceful shutdown
|
||||||
let stopping = false;
|
let stopping = false;
|
||||||
const onSignal = () => {
|
const onSignal = () => {
|
||||||
if (stopping) return;
|
if (stopping) return;
|
||||||
stopping = true;
|
stopping = true;
|
||||||
process.stderr.write("\n ⏸ Stopping — finishing current request...\n");
|
process.stderr.write("\n ⏸ Stopping — finishing in-flight requests...\n");
|
||||||
};
|
};
|
||||||
process.on("SIGINT", onSignal);
|
process.on("SIGINT", onSignal);
|
||||||
process.on("SIGTERM", onSignal);
|
process.on("SIGTERM", onSignal);
|
||||||
|
|
||||||
for (const paragraph of remaining) {
|
// Dashboard refresh
|
||||||
if (stopping) break;
|
function renderStatus() {
|
||||||
|
const elapsed = (Date.now() - startTime) / 1000;
|
||||||
|
const rate = elapsed > 0 ? (processed / elapsed) * 60 : 0;
|
||||||
|
const etaMin = rate > 0 ? Math.round((total - processed) / rate) : 0;
|
||||||
|
process.stderr.write(
|
||||||
|
`\x1b[2K\r ${processed}/${total} (${((processed / total) * 100).toFixed(1)}%) │ ${rate.toFixed(1)} para/min │ ETA ${etaMin}m │ ${errored} errors`,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
const dashboardInterval = setInterval(renderStatus, 2000);
|
||||||
|
|
||||||
try {
|
// Worker pool: N concurrent workers pulling from shared queue
|
||||||
const annotation = await annotateGolden(paragraph, runId, systemPrompt);
|
let nextIdx = 0;
|
||||||
await appendJsonl(outputPath, annotation);
|
async function worker() {
|
||||||
processed++;
|
while (nextIdx < remaining.length && !stopping) {
|
||||||
|
const idx = nextIdx++;
|
||||||
|
const paragraph = remaining[idx]!;
|
||||||
|
|
||||||
if (processed % 10 === 0 || processed === total) {
|
try {
|
||||||
const elapsed = (Date.now() - startTime) / 1000;
|
const annotation = await annotateGolden(paragraph, runId, systemPrompt);
|
||||||
const rate = (processed / elapsed) * 60;
|
await safeAppend(outputPath, annotation);
|
||||||
const etaMin = Math.round((total - processed) / rate);
|
processed++;
|
||||||
process.stderr.write(
|
} catch (error) {
|
||||||
` ${processed}/${total} (${((processed / total) * 100).toFixed(1)}%) │ ${rate.toFixed(1)} para/min │ ETA ${etaMin}m │ ${errored} errors\n`,
|
errored++;
|
||||||
);
|
await safeAppend(errorsPath, {
|
||||||
|
paragraphId: paragraph.id,
|
||||||
|
error: error instanceof Error ? error.message : String(error),
|
||||||
|
modelId: "anthropic/claude-opus-4-6",
|
||||||
|
timestamp: new Date().toISOString(),
|
||||||
|
});
|
||||||
|
|
||||||
|
if (errored >= 10 && processed === 0) {
|
||||||
|
stopping = true;
|
||||||
|
process.stderr.write("\n ✖ 10 errors with no successes. Stopping.\n");
|
||||||
|
}
|
||||||
}
|
}
|
||||||
} catch (error) {
|
|
||||||
errored++;
|
|
||||||
await appendJsonl(errorsPath, {
|
|
||||||
paragraphId: paragraph.id,
|
|
||||||
error: error instanceof Error ? error.message : String(error),
|
|
||||||
modelId: "anthropic/claude-opus-4-6",
|
|
||||||
timestamp: new Date().toISOString(),
|
|
||||||
});
|
|
||||||
|
|
||||||
process.stderr.write(
|
// Per-worker delay between requests
|
||||||
` ✖ Error on ${paragraph.id}: ${error instanceof Error ? error.message : String(error)}\n`,
|
if (!stopping) {
|
||||||
);
|
await new Promise((r) => setTimeout(r, delayMs));
|
||||||
|
|
||||||
// 5 consecutive errors with no successes = likely systemic
|
|
||||||
if (errored >= 5 && processed === 0) {
|
|
||||||
process.stderr.write(" ✖ 5 errors with no successes. Stopping.\n");
|
|
||||||
break;
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// Rate limit: 1 req/s
|
|
||||||
if (!stopping) {
|
|
||||||
await new Promise((r) => setTimeout(r, delayMs));
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
const workers = Array.from(
|
||||||
|
{ length: Math.min(concurrency, remaining.length) },
|
||||||
|
() => worker(),
|
||||||
|
);
|
||||||
|
await Promise.all(workers);
|
||||||
|
|
||||||
// Cleanup
|
// Cleanup
|
||||||
|
clearInterval(dashboardInterval);
|
||||||
process.off("SIGINT", onSignal);
|
process.off("SIGINT", onSignal);
|
||||||
process.off("SIGTERM", onSignal);
|
process.off("SIGTERM", onSignal);
|
||||||
|
|
||||||
|
renderStatus();
|
||||||
process.stderr.write(
|
process.stderr.write(
|
||||||
`\n ✓ Golden annotation done: ${processed} processed, ${errored} errors\n`,
|
`\n ✓ Golden annotation done: ${processed} processed, ${errored} errors\n`,
|
||||||
);
|
);
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user