25 KiB
Human Labeling Webapp — Implementation Plan
Context
The SEC cyBERT project needs 1,200 human-labeled paragraphs as a gold holdout set. 6 student annotators, 3 per paragraph, 600 per person. The narrative specifies a "quiz-gated labeling web tool that enforces codebook knowledge before each session." No webapp or monorepo structure exists yet.
Phase 0: Monorepo Restructure
Convert the repo to a Bun workspace monorepo and extract shared schemas into a package.
New repo structure
sec-cyBERT/
package.json # workspace root: { "workspaces": ["packages/*", "ts", "labelapp"] }
bun.lock # moved from ts/bun.lock
packages/
schemas/
package.json # { "name": "@sec-cybert/schemas" }
tsconfig.json
src/
label.ts # moved from ts/src/schemas/label.ts
paragraph.ts
annotation.ts
consensus.ts
gold.ts
benchmark.ts
session.ts
ts/
package.json # adds dependency: "@sec-cybert/schemas": "workspace:*"
tsconfig.json # add project reference to ../packages/schemas
src/
schemas/ # DELETED (moved to packages/schemas)
...
scripts/
...
labelapp/
package.json # Next.js app, depends on @sec-cybert/schemas
...
Import rewrite
All imports in ts/src/ and ts/scripts/ change from relative schema paths to the package:
// Before (ts/src/label/annotate.ts)
import { LabelOutputRaw, toLabelOutput } from "../schemas/label.ts";
import type { Paragraph } from "../schemas/paragraph.ts";
// After
import { LabelOutputRaw, toLabelOutput } from "@sec-cybert/schemas/label.ts";
import type { Paragraph } from "@sec-cybert/schemas/paragraph.ts";
No barrel file. Direct path-based imports. The existing ts/src/schemas/index.ts is deleted.
Files to rewrite imports in (~30 files)
ts/src/ (15 files): cli.ts, label/annotate.ts, label/batch.ts, label/consensus.ts, label/prompts.ts, analyze/corpus-stats.ts, analyze/data-quality.ts, analyze/dedup-analysis.ts, extract/pipeline.ts, extract/segment.ts, extract/fast-reparse.ts
ts/scripts/ (14 files): dispute-crosstab.ts, model-bench.ts, sample-disputes.ts, mimo-pilot.ts, model-bias-analysis.ts, segment-analysis.ts, mimo-raw-test.ts, mimo-test.ts, judge-bench.ts, judge-diag-batch.ts, judge-diag.ts, pilot.ts, stage1-run.ts, model-probe.ts
Can be done with a sed command per pattern:
"../schemas/→"@sec-cybert/schemas/"./schemas/→"@sec-cybert/schemas/"../src/schemas/→"@sec-cybert/schemas/
tsconfig setup
packages/schemas/tsconfig.json:
{
"compilerOptions": {
"lib": ["ESNext"],
"target": "ESNext",
"module": "Preserve",
"moduleDetection": "force",
"moduleResolution": "bundler",
"allowImportingTsExtensions": true,
"verbatimModuleSyntax": true,
"noEmit": true,
"strict": true,
"skipLibCheck": true,
"composite": true,
"rootDir": "src",
"outDir": "dist"
},
"include": ["src/**/*.ts"]
}
ts/tsconfig.json — add project reference:
{
"compilerOptions": { ... },
"references": [{ "path": "../packages/schemas" }]
}
Verification
bun installfrom root resolves all workspacesbun run --filter ts typecheckpasses- Existing scripts still work:
bun run --filter ts sec -- label:cost
Phase 1: Labelapp — Next.js + Drizzle + Postgres
Stack
- Next.js (App Router, turbopack dev)
- Tailwind CSS v4 + shadcn/ui (components: RadioGroup, Button, Card, Sidebar, Collapsible, Table, Charts)
- lucide-react for icons (shadcn default)
- Drizzle ORM +
drizzle-kit(schema push, no migration files needed for this) - Postgres 18 via docker-compose (named volume for data)
- Bun as package manager (auto-uses Node for Next.js dev/build)
- @sec-cybert/schemas workspace dependency
- Playwright for E2E tests
Postgres (already running)
Docker compose is already set up and running at the repo root. DB: sec_cybert, user: sec_cybert, password: sec_cybert, port 5432.
DATABASE_URL=postgresql://sec_cybert:sec_cybert@localhost:5432/sec_cybert
Drizzle schema (labelapp/db/schema.ts)
import { pgTable, text, integer, real, timestamp, boolean, unique } from "drizzle-orm/pg-core";
export const paragraphs = pgTable("paragraphs", {
id: text("id").primaryKey(), // UUID from Paragraph.id
text: text("text").notNull(),
wordCount: integer("word_count").notNull(),
paragraphIndex: integer("paragraph_index").notNull(),
companyName: text("company_name").notNull(),
cik: text("cik").notNull(),
ticker: text("ticker"),
filingType: text("filing_type").notNull(),
filingDate: text("filing_date").notNull(),
fiscalYear: integer("fiscal_year").notNull(),
accessionNumber: text("accession_number").notNull(),
secItem: text("sec_item").notNull(),
// Stage 1 consensus (for stratification, not shown to annotators during labeling)
stage1Category: text("stage1_category"),
stage1Specificity: integer("stage1_specificity"),
stage1Method: text("stage1_method"),
stage1Confidence: real("stage1_confidence"),
});
export const annotators = pgTable("annotators", {
id: text("id").primaryKey(), // slug: "joey", "alice", etc.
displayName: text("display_name").notNull(),
password: text("password").notNull(), // plaintext (just their name)
});
export const assignments = pgTable("assignments", {
paragraphId: text("paragraph_id").notNull().references(() => paragraphs.id),
annotatorId: text("annotator_id").notNull().references(() => annotators.id),
assignedAt: timestamp("assigned_at").notNull().defaultNow(),
isWarmup: boolean("is_warmup").notNull().default(false),
}, (t) => [unique().on(t.paragraphId, t.annotatorId)]);
export const humanLabels = pgTable("human_labels", {
id: integer("id").primaryKey().generatedAlwaysAsIdentity(),
paragraphId: text("paragraph_id").notNull().references(() => paragraphs.id),
annotatorId: text("annotator_id").notNull().references(() => annotators.id),
contentCategory: text("content_category").notNull(),
specificityLevel: integer("specificity_level").notNull(),
notes: text("notes"),
labeledAt: timestamp("labeled_at").notNull().defaultNow(),
sessionId: text("session_id").notNull(),
durationMs: integer("duration_ms"),
activeMs: integer("active_ms"), // idle-excluded active time
}, (t) => [unique().on(t.paragraphId, t.annotatorId)]);
export const quizSessions = pgTable("quiz_sessions", {
id: text("id").primaryKey(), // UUID
annotatorId: text("annotator_id").notNull().references(() => annotators.id),
startedAt: timestamp("started_at").notNull().defaultNow(),
completedAt: timestamp("completed_at"),
passed: boolean("passed").notNull().default(false),
score: integer("score").notNull().default(0),
totalQuestions: integer("total_questions").notNull(),
answers: text("answers").notNull().default("[]"), // JSON
});
export const adjudications = pgTable("adjudications", {
paragraphId: text("paragraph_id").primaryKey().references(() => paragraphs.id),
finalCategory: text("final_category").notNull(),
finalSpecificity: integer("final_specificity").notNull(),
method: text("method").notNull(), // consensus | majority | discussion
adjudicatorId: text("adjudicator_id"),
notes: text("notes"),
resolvedAt: timestamp("resolved_at").notNull().defaultNow(),
});
Labelapp file structure (no src/ — flat root, consistent with shadcn)
labelapp/
package.json
next.config.ts
tsconfig.json # @/* maps to ./*
drizzle.config.ts
playwright.config.ts
components.json # shadcn config
.env.local # DATABASE_URL
db/
index.ts # drizzle client
schema.ts # tables above
lib/
utils.ts # shadcn cn() helper
sampling.ts # stratified sampling logic
assignment.ts # BIBD assignment generation
metrics.ts # Cohen's kappa, Krippendorff's alpha
quiz-questions.ts # question bank
components/
ui/ # shadcn components
app/
layout.tsx # root layout
globals.css # Tailwind + shadcn theme
page.tsx # login screen
dashboard/
page.tsx # annotator dashboard (progress, start session)
quiz/
page.tsx # quiz flow
label/
page.tsx # main labeling interface
admin/
page.tsx # adjudication queue + metrics dashboard
api/
auth/route.ts # login/logout
quiz/route.ts # start quiz, submit answers
label/route.ts # get next paragraph, submit label
warmup/route.ts # get warmup paragraph, submit + get feedback
adjudicate/route.ts # get queue, resolve
metrics/route.ts # IRR metrics
export/route.ts # trigger gold label export
scripts/
seed.ts # import paragraphs + consensus from JSONL, create annotators
sample.ts # stratified sample → 1,200 paragraphs
assign.ts # BIBD assignment → 3,600 rows
export.ts # dump gold labels to JSONL (GoldLabel schema format)
tests/
helpers/
reset-db.ts
login.ts
00-setup.spec.ts
01-auth.spec.ts
...
Package.json scripts
{
"scripts": {
"dev": "next dev --turbopack",
"build": "next build",
"start": "next start",
"db:push": "drizzle-kit push",
"db:studio": "drizzle-kit studio",
"seed": "bun run scripts/seed.ts",
"sample": "bun run scripts/sample.ts",
"assign": "bun run scripts/assign.ts",
"export": "bun run scripts/export.ts"
}
}
Phase 2: Core Features
Authentication
- Login page: annotator ID dropdown + password field (password = their name)
- Server-side session via Next.js cookies (signed, httpOnly)
- Middleware checks auth on all
/dashboard,/quiz,/label,/adminroutes - No external auth library needed — just
cookies()API
Quiz System
- Question bank (~30 questions) in
lib/quiz-questions.ts, 4 types:- Person-vs-function (8-10): "Is this Management Role or RMP?"
- Materiality disclaimers (6-8): "Strategy Integration or None/Other?"
- QV fact counting (6-8): "Specificity 3 or 4?"
- SPAC exception (3-4): "What category for this shell company?"
- Per session: 8 questions (2 per type, random draw), pass = 7/8
- Immediate feedback with codebook explanation after each answer
- Failed → review mistakes → retry. Passed → proceed to warmup.
- Session stored in
quiz_sessions, referenced byhuman_labels.session_id - Session expires on 2-hour idle (checked server-side)
Warm-up (5 paragraphs per session)
- Pre-selected paragraphs with known gold labels + feedback text
- Identical UI to labeling, but after submit: shows gold answer + explanation
- Not counted toward gold set (
assignments.is_warmup = true)
Labeling Interface
- Top bar: Filing metadata (company, ticker, filing type, date)
- Center: Paragraph text, large and readable
- Form:
- Content category: 7 radio buttons with short labels
- Specificity level: 4 radio buttons (Generic Boilerplate / Sector-Adapted / Firm-Specific / Quantified-Verifiable)
- Notes: optional textarea
- Submit button
- Keyboard shortcuts: 1-7 category, Shift+1-4 specificity, Enter submit
- Sidebar (collapsible): Codebook quick-reference (category defs, IS/NOT lists, decision rules)
- Progress bar: "47 / 600 completed"
- Active timer: visible timer in header bar tracks active labeling time with idle detection (30s threshold). Submits both
duration_ms(wall-clock) andactive_ms(idle-excluded) per paragraph. Timer pauses and shows amber "idle" indicator when no mouse/keyboard/scroll activity detected.
Sampling (1,200 from ~50K)
Script reads Stage 1 consensus from JSONL, stratifies:
| Stratum | Count | Source |
|---|---|---|
| Mgmt↔RMP split votes | 120 | Paragraphs where Stage 1 annotators disagreed on this axis |
| None/Other↔Strategy splits | 80 | Materiality disclaimer boundary |
| Spec [3,4] splits | 80 | QV counting boundary |
| Board↔Mgmt splits | 80 | Board/management boundary |
| Rare category guarantee | 120 | ≥15 per category, extra for Incident Disclosure |
| Proportional stratified random | 720 | Fill from category×specificity cells |
Assignment: BIBD
C(6,3)=20 unique triples of 6 annotators. Assign 60 paragraphs to each triple. Each annotator appears in C(5,2)=10 triples → 10×60 = 600 paragraphs. Every annotator pair shares equal overlap → pairwise kappa is statistically valid.
Adjudication
Auto-resolve after 3 labels:
- 3/3 agree both dims → consensus
- 2/3 agree both dims → majority
- Otherwise → flagged for admin
Admin page:
- Queue sorted by severity (3-way splits first)
- Shows paragraph + all 3 labels side-by-side + notes + Stage 1 reference
- Resolution: pick a label, enter custom, or mark for team discussion
- Stores in
adjudicationstable
Metrics Dashboard (admin page)
- Overall progress: N/1,200 fully labeled, N adjudicated
- Cohen's Kappa (category): pairwise 6×6 matrix + average. Target ≥ 0.75
- Krippendorff's Alpha (specificity): single number. Target ≥ 0.67
- Raw consensus rate. Target ≥ 75%
- Per-category confusion matrix (7×7)
- Per-annotator stats: completion, agreement rate, distribution
- Confusion axis disagreement rates
Export
Script dumps adjudicated gold labels to data/gold/gold-labels.jsonl in the existing GoldLabel schema format, readable by readJsonl(path, GoldLabel).
Reused Existing Code
| What | Source | Used by |
|---|---|---|
ContentCategory, SpecificityLevel, LabelOutput |
@sec-cybert/schemas/label.ts |
Label validation, quiz answers |
Paragraph, FilingMeta |
@sec-cybert/schemas/paragraph.ts |
Seed script, display |
HumanLabel, GoldLabel |
@sec-cybert/schemas/gold.ts |
Export script format |
Annotation |
@sec-cybert/schemas/annotation.ts |
Seed script (Stage 1 data) |
readJsonl() |
ts/src/lib/jsonl.ts |
Seed script (import from JSONL) |
| Codebook content | docs/LABELING-CODEBOOK.md |
Quiz questions, sidebar reference |
Note: readJsonl stays in ts/src/lib/ — the seed script imports it directly via relative path or we extract it to the schemas package if needed. Since it depends on Zod (which schemas already has), it could live there.
Setup Sequence
# 1. Monorepo setup
bun install # from repo root, resolves all workspaces
# 2. Verify existing pipeline still works
bun run --filter ts typecheck
# 3. Start Postgres
cd labelapp && docker compose up -d
# 4. Push schema
bun run db:push
# 5. Seed data
bun run seed # imports paragraphs + consensus + creates annotators
# 6. Sample + assign
bun run sample # stratified sample → 1,200 paragraphs marked
bun run assign # BIBD → 3,600 assignment rows
# 7. Start dev server
bun run dev # Next.js on :3000
# After labeling complete:
bun run export # → data/gold/gold-labels.jsonl
Implementation Order
- Monorepo restructure — root package.json, extract schemas, rewrite imports, verify typecheck
- Labelapp scaffold — Next.js init, Drizzle schema, db connection, Playwright setup
- Seed + sample + assign scripts — data pipeline into Postgres
- Auth — login page, session cookies, middleware
- Quiz system — question bank, quiz flow page, session gating
- Labeling UI — the core: next paragraph, submit label, progress tracking, keyboard shortcuts, codebook sidebar
- Warm-up flow — 5 pre-labeled paragraphs with feedback
- Admin: adjudication — queue, resolution UI
- Admin: metrics dashboard — kappa, alpha, confusion matrix, per-annotator stats
- Export script — gold labels to JSONL
Each phase ends with Playwright tests that verify it works E2E before moving on.
Testing Strategy
Philosophy
No unit tests. Integration/E2E only. Two layers:
- Backend route tests (
bun test) — hit real API routes against real Postgres, verify responses/DB state - Playwright E2E — click through the real UI in a real browser
Backend Route Tests (bun test)
- Colocated
__test__dirs adjacent to each route handler - Tests import the route handler directly (or use
fetchagainst the dev server) - Run against real Postgres (same
DATABASE_URL) - Each test file resets relevant tables via Drizzle before running
labelapp/app/api/
auth/
route.ts
__test__/
auth.test.ts # login/logout, session validation
quiz/
route.ts
__test__/
quiz.test.ts # start quiz, submit answers, pass/fail logic
label/
route.ts
__test__/
label.test.ts # get next paragraph, submit label, skip completed
warmup/
route.ts
__test__/
warmup.test.ts # get warmup, submit + get feedback
adjudicate/
route.ts
__test__/
adjudicate.test.ts # get queue, resolve, verify DB state
metrics/
route.ts
__test__/
metrics.test.ts # kappa/alpha values, progress counts
export/
route.ts
__test__/
export.test.ts # trigger export, verify JSONL output
Playwright E2E
- Playwright installed in
labelapp/as dev dependency - Tests in
labelapp/tests/(Playwright default) playwright.config.tsconfigured to:- Start Next.js dev server automatically via
webServerconfig - Use the real Postgres (same
DATABASE_URL) - Run tests serially (stateful DB)
- Start Next.js dev server automatically via
- Test DB reset: Each test file starts by truncating relevant tables (not dropping — schema stays). A
tests/helpers/reset-db.tsutil handles this via Drizzle.
Test Files & What They Cover
tests/00-setup.spec.ts — Data pipeline scripts
- Run
bun run seedviaexecSync, verify paragraphs table has rows - Run
bun run sample, verify exactly 1,200 paragraphs marked (or asampledflag / separate table) - Run
bun run assign, verify 3,600 assignment rows, each annotator has 600 - Verify BIBD property: every annotator pair shares equal paragraph count
tests/01-auth.spec.ts — Login flow
- Navigate to
/, see login form - Login with wrong password → error message shown
- Login with correct password → redirected to
/dashboard - Access
/dashboardwithout login → redirected to/ - Logout → session cleared, redirected to
/
tests/02-quiz.spec.ts — Quiz gating
- Login, navigate to dashboard, click "Start Session"
- Verify quiz page loads with 8 questions
- Answer all correctly → see "Passed" message, "Begin labeling" button appears
- Start new session, answer 2 wrong → see "Failed", can retry
- Verify cannot access
/labelwithout a passed quiz session
tests/03-warmup.spec.ts — Warm-up flow
- After passing quiz, verify 5 warm-up paragraphs shown first
- Submit a label → gold answer + explanation revealed
- After 5 warm-ups → transition to real labeling
- Verify warm-up labels are NOT counted in progress stats
tests/04-labeling.spec.ts — Core labeling flow
- Verify paragraph text + filing metadata displayed
- Select category via radio button, select specificity, submit
- Verify redirected to next paragraph, progress increments
- Verify keyboard shortcuts work (press "1" → first category selected, etc.)
- Submit several labels, verify they're stored in DB
- Verify codebook sidebar toggles open/close
- Verify "next" skips already-completed paragraphs (label one, refresh, get a different one)
tests/05-adjudication.spec.ts — Multi-annotator + admin flow
- Seed 3 test annotators with assignments for the same paragraph
- Login as each, pass quiz, label the same paragraph with DIFFERENT labels
- Login as admin, navigate to
/admin - Verify the disputed paragraph appears in the adjudication queue
- Resolve it (pick one label), verify adjudication stored
- Verify it no longer appears in queue
tests/06-metrics.spec.ts — Dashboard metrics
- Seed known label data (pre-computed expected kappa/alpha values)
- Navigate to admin metrics page
- Verify progress numbers match expected
- Verify consensus rate displayed and reasonable
- Verify per-annotator stats shown
tests/07-export.spec.ts — Gold export
- Ensure some adjudicated labels exist (from prior tests or seeded)
- Run
bun run exportviaexecSync - Read the output JSONL file
- Parse each line, verify it matches GoldLabel schema
- Verify paragraph count matches expected
Running Tests
cd labelapp
# Backend route tests (fast, no browser)
bun test # runs all __test__/*.test.ts
bun test app/api/label/__test__/ # run one route's tests
# Playwright E2E (browser)
bunx playwright install --with-deps chromium # one-time browser install
bunx playwright test # runs all tests serially
bunx playwright test tests/04-labeling.spec.ts # run one file
# Both
bun test && bunx playwright test # full suite
Package.json test scripts
{
"scripts": {
"test": "bun test && playwright test",
"test:api": "bun test",
"test:e2e": "playwright test",
"test:e2e:ui": "playwright test --ui"
}
}
How Agents Use This
Each implementation phase follows this cycle:
- Write the feature code
- Write the backend route test (
__test__/adjacent to route) AND/OR Playwright test - Run
bun testfor route logic,bunx playwright testfor UI flows - If tests fail, fix the code and re-run
- Only move to the next phase when all tests pass
The test suite is cumulative — tests from earlier phases keep running, ensuring nothing regresses. An agent completing phase 6 (labeling UI) runs the full suite to confirm everything still works.
Test Helpers
Shared (src/lib/__test__/helpers.ts or similar):
resetDb()— truncate tables between test files via DrizzleseedTestData()— insert known paragraphs/assignments/labels for test scenarios
Playwright-specific (tests/helpers/):
login.ts— reusable: login as annotator X, pass quiz, get to labelingreset-db.ts— calls resetDb() for Playwright test setup
The login.ts helper is critical — it encapsulates the login → quiz → warmup flow so that labeling/adjudication tests don't have to repeat that ceremony.
Verification (automated)
Two test suites, both must pass:
bun test — Backend route integration tests
| Route | Test file | Verifies |
|---|---|---|
api/auth |
__test__/auth.test.ts |
Login/logout, bad password rejection, session cookies |
api/quiz |
__test__/quiz.test.ts |
Start quiz, submit answers, pass/fail threshold, session creation |
api/label |
__test__/label.test.ts |
Get next paragraph, submit label to DB, skip completed, enforce quiz gate |
api/warmup |
__test__/warmup.test.ts |
Get warmup paragraph, submit + receive gold feedback |
api/adjudicate |
__test__/adjudicate.test.ts |
Get disagreement queue, resolve, verify DB state |
api/metrics |
__test__/metrics.test.ts |
Kappa/alpha with known data, progress counts |
api/export |
__test__/export.test.ts |
Trigger export, verify JSONL matches GoldLabel schema |
bunx playwright test — Browser E2E
| Test file | Verifies |
|---|---|
00-setup.spec.ts |
Seed/sample/assign scripts produce correct DB state |
01-auth.spec.ts |
Login form, redirect on auth failure, logout |
02-quiz.spec.ts |
Quiz renders, pass/fail gating, retry flow |
03-warmup.spec.ts |
5 warm-ups with feedback, transition to real labeling |
04-labeling.spec.ts |
Paragraph display, radio buttons, keyboard shortcuts, progress bar, codebook sidebar |
05-adjudication.spec.ts |
3 annotators disagree → admin queue → resolution |
06-metrics.spec.ts |
Dashboard renders with correct numbers |
07-export.spec.ts |
Export script produces valid JSONL |
Pre-test gates
bun installfrom root succeedsbun run --filter ts typecheckpasses (monorepo didn't break existing pipeline)- Postgres is reachable,
bun run db:pushsucceeds
Success criteria: cd labelapp && bun test && bunx playwright test exits 0.