2026-03-29 16:37:51 -04:00

25 KiB

Raw Blame History

Human Labeling Webapp — Implementation Plan

Context

The SEC cyBERT project needs 1,200 human-labeled paragraphs as a gold holdout set. 6 student annotators, 3 per paragraph, 600 per person. The narrative specifies a "quiz-gated labeling web tool that enforces codebook knowledge before each session." No webapp or monorepo structure exists yet.

Phase 0: Monorepo Restructure

Convert the repo to a Bun workspace monorepo and extract shared schemas into a package.

New repo structure

sec-cyBERT/
  package.json                    # workspace root: { "workspaces": ["packages/*", "ts", "labelapp"] }
  bun.lock                        # moved from ts/bun.lock
  packages/
    schemas/
      package.json                # { "name": "@sec-cybert/schemas" }
      tsconfig.json
      src/
        label.ts                  # moved from ts/src/schemas/label.ts
        paragraph.ts
        annotation.ts
        consensus.ts
        gold.ts
        benchmark.ts
        session.ts
  ts/
    package.json                  # adds dependency: "@sec-cybert/schemas": "workspace:*"
    tsconfig.json                 # add project reference to ../packages/schemas
    src/
      schemas/                    # DELETED (moved to packages/schemas)
      ...
    scripts/
      ...
  labelapp/
    package.json                  # Next.js app, depends on @sec-cybert/schemas
    ...

Import rewrite

All imports in ts/src/ and ts/scripts/ change from relative schema paths to the package:

// Before (ts/src/label/annotate.ts)
import { LabelOutputRaw, toLabelOutput } from "../schemas/label.ts";
import type { Paragraph } from "../schemas/paragraph.ts";

// After
import { LabelOutputRaw, toLabelOutput } from "@sec-cybert/schemas/label.ts";
import type { Paragraph } from "@sec-cybert/schemas/paragraph.ts";

No barrel file. Direct path-based imports. The existing ts/src/schemas/index.ts is deleted.

Files to rewrite imports in (~30 files)

ts/src/ (15 files): cli.ts, label/annotate.ts, label/batch.ts, label/consensus.ts, label/prompts.ts, analyze/corpus-stats.ts, analyze/data-quality.ts, analyze/dedup-analysis.ts, extract/pipeline.ts, extract/segment.ts, extract/fast-reparse.ts

ts/scripts/ (14 files): dispute-crosstab.ts, model-bench.ts, sample-disputes.ts, mimo-pilot.ts, model-bias-analysis.ts, segment-analysis.ts, mimo-raw-test.ts, mimo-test.ts, judge-bench.ts, judge-diag-batch.ts, judge-diag.ts, pilot.ts, stage1-run.ts, model-probe.ts

Can be done with a sed command per pattern:

"../schemas/ → "@sec-cybert/schemas/
"./schemas/ → "@sec-cybert/schemas/
"../src/schemas/ → "@sec-cybert/schemas/

tsconfig setup

packages/schemas/tsconfig.json:

{
  "compilerOptions": {
    "lib": ["ESNext"],
    "target": "ESNext",
    "module": "Preserve",
    "moduleDetection": "force",
    "moduleResolution": "bundler",
    "allowImportingTsExtensions": true,
    "verbatimModuleSyntax": true,
    "noEmit": true,
    "strict": true,
    "skipLibCheck": true,
    "composite": true,
    "rootDir": "src",
    "outDir": "dist"
  },
  "include": ["src/**/*.ts"]
}

ts/tsconfig.json — add project reference:

{
  "compilerOptions": { ... },
  "references": [{ "path": "../packages/schemas" }]
}

Verification

bun install from root resolves all workspaces
bun run --filter ts typecheck passes
Existing scripts still work: bun run --filter ts sec -- label:cost

Phase 1: Labelapp — Next.js + Drizzle + Postgres

Stack

Next.js (App Router, turbopack dev)
Tailwind CSS v4 + shadcn/ui (components: RadioGroup, Button, Card, Sidebar, Collapsible, Table, Charts)
lucide-react for icons (shadcn default)
Drizzle ORM + drizzle-kit (schema push, no migration files needed for this)
Postgres 18 via docker-compose (named volume for data)
Bun as package manager (auto-uses Node for Next.js dev/build)
@sec-cybert/schemas workspace dependency
Playwright for E2E tests

Postgres (already running)

Docker compose is already set up and running at the repo root. DB: sec_cybert, user: sec_cybert, password: sec_cybert, port 5432.

DATABASE_URL=postgresql://sec_cybert:sec_cybert@localhost:5432/sec_cybert

Drizzle schema (labelapp/db/schema.ts)

import { pgTable, text, integer, real, timestamp, boolean, unique } from "drizzle-orm/pg-core";

export const paragraphs = pgTable("paragraphs", {
  id: text("id").primaryKey(),                    // UUID from Paragraph.id
  text: text("text").notNull(),
  wordCount: integer("word_count").notNull(),
  paragraphIndex: integer("paragraph_index").notNull(),
  companyName: text("company_name").notNull(),
  cik: text("cik").notNull(),
  ticker: text("ticker"),
  filingType: text("filing_type").notNull(),
  filingDate: text("filing_date").notNull(),
  fiscalYear: integer("fiscal_year").notNull(),
  accessionNumber: text("accession_number").notNull(),
  secItem: text("sec_item").notNull(),
  // Stage 1 consensus (for stratification, not shown to annotators during labeling)
  stage1Category: text("stage1_category"),
  stage1Specificity: integer("stage1_specificity"),
  stage1Method: text("stage1_method"),
  stage1Confidence: real("stage1_confidence"),
});

export const annotators = pgTable("annotators", {
  id: text("id").primaryKey(),                    // slug: "joey", "alice", etc.
  displayName: text("display_name").notNull(),
  password: text("password").notNull(),           // plaintext (just their name)
});

export const assignments = pgTable("assignments", {
  paragraphId: text("paragraph_id").notNull().references(() => paragraphs.id),
  annotatorId: text("annotator_id").notNull().references(() => annotators.id),
  assignedAt: timestamp("assigned_at").notNull().defaultNow(),
  isWarmup: boolean("is_warmup").notNull().default(false),
}, (t) => [unique().on(t.paragraphId, t.annotatorId)]);

export const humanLabels = pgTable("human_labels", {
  id: integer("id").primaryKey().generatedAlwaysAsIdentity(),
  paragraphId: text("paragraph_id").notNull().references(() => paragraphs.id),
  annotatorId: text("annotator_id").notNull().references(() => annotators.id),
  contentCategory: text("content_category").notNull(),
  specificityLevel: integer("specificity_level").notNull(),
  notes: text("notes"),
  labeledAt: timestamp("labeled_at").notNull().defaultNow(),
  sessionId: text("session_id").notNull(),
  durationMs: integer("duration_ms"),
  activeMs: integer("active_ms"),          // idle-excluded active time
}, (t) => [unique().on(t.paragraphId, t.annotatorId)]);

export const quizSessions = pgTable("quiz_sessions", {
  id: text("id").primaryKey(),                    // UUID
  annotatorId: text("annotator_id").notNull().references(() => annotators.id),
  startedAt: timestamp("started_at").notNull().defaultNow(),
  completedAt: timestamp("completed_at"),
  passed: boolean("passed").notNull().default(false),
  score: integer("score").notNull().default(0),
  totalQuestions: integer("total_questions").notNull(),
  answers: text("answers").notNull().default("[]"),  // JSON
});

export const adjudications = pgTable("adjudications", {
  paragraphId: text("paragraph_id").primaryKey().references(() => paragraphs.id),
  finalCategory: text("final_category").notNull(),
  finalSpecificity: integer("final_specificity").notNull(),
  method: text("method").notNull(),               // consensus | majority | discussion
  adjudicatorId: text("adjudicator_id"),
  notes: text("notes"),
  resolvedAt: timestamp("resolved_at").notNull().defaultNow(),
});

Labelapp file structure (no src/ — flat root, consistent with shadcn)

labelapp/
  package.json
  next.config.ts
  tsconfig.json                       # @/* maps to ./*
  drizzle.config.ts
  playwright.config.ts
  components.json                     # shadcn config
  .env.local                          # DATABASE_URL
  db/
    index.ts                          # drizzle client
    schema.ts                         # tables above
  lib/
    utils.ts                          # shadcn cn() helper
    sampling.ts                       # stratified sampling logic
    assignment.ts                     # BIBD assignment generation
    metrics.ts                        # Cohen's kappa, Krippendorff's alpha
    quiz-questions.ts                 # question bank
  components/
    ui/                               # shadcn components
  app/
    layout.tsx                        # root layout
    globals.css                       # Tailwind + shadcn theme
    page.tsx                          # login screen
    dashboard/
      page.tsx                        # annotator dashboard (progress, start session)
    quiz/
      page.tsx                        # quiz flow
    label/
      page.tsx                        # main labeling interface
    admin/
      page.tsx                        # adjudication queue + metrics dashboard
    api/
      auth/route.ts                   # login/logout
      quiz/route.ts                   # start quiz, submit answers
      label/route.ts                  # get next paragraph, submit label
      warmup/route.ts                 # get warmup paragraph, submit + get feedback
      adjudicate/route.ts             # get queue, resolve
      metrics/route.ts                # IRR metrics
      export/route.ts                 # trigger gold label export
  scripts/
    seed.ts                           # import paragraphs + consensus from JSONL, create annotators
    sample.ts                         # stratified sample → 1,200 paragraphs
    assign.ts                         # BIBD assignment → 3,600 rows
    export.ts                         # dump gold labels to JSONL (GoldLabel schema format)
  tests/
    helpers/
      reset-db.ts
      login.ts
    00-setup.spec.ts
    01-auth.spec.ts
    ...

Package.json scripts

{
  "scripts": {
    "dev": "next dev --turbopack",
    "build": "next build",
    "start": "next start",
    "db:push": "drizzle-kit push",
    "db:studio": "drizzle-kit studio",
    "seed": "bun run scripts/seed.ts",
    "sample": "bun run scripts/sample.ts",
    "assign": "bun run scripts/assign.ts",
    "export": "bun run scripts/export.ts"
  }
}

Phase 2: Core Features

Authentication

Login page: annotator ID dropdown + password field (password = their name)
Server-side session via Next.js cookies (signed, httpOnly)
Middleware checks auth on all /dashboard, /quiz, /label, /admin routes
No external auth library needed — just cookies() API

Quiz System

Question bank (~30 questions) in lib/quiz-questions.ts, 4 types:
- Person-vs-function (8-10): "Is this Management Role or RMP?"
- Materiality disclaimers (6-8): "Strategy Integration or None/Other?"
- QV fact counting (6-8): "Specificity 3 or 4?"
- SPAC exception (3-4): "What category for this shell company?"
Per session: 8 questions (2 per type, random draw), pass = 7/8
Immediate feedback with codebook explanation after each answer
Failed → review mistakes → retry. Passed → proceed to warmup.
Session stored in quiz_sessions, referenced by human_labels.session_id
Session expires on 2-hour idle (checked server-side)

Warm-up (5 paragraphs per session)

Pre-selected paragraphs with known gold labels + feedback text
Identical UI to labeling, but after submit: shows gold answer + explanation
Not counted toward gold set (assignments.is_warmup = true)

Labeling Interface

Top bar: Filing metadata (company, ticker, filing type, date)
Center: Paragraph text, large and readable
Form:
- Content category: 7 radio buttons with short labels
- Specificity level: 4 radio buttons (Generic Boilerplate / Sector-Adapted / Firm-Specific / Quantified-Verifiable)
- Notes: optional textarea
- Submit button
Keyboard shortcuts: 1-7 category, Shift+1-4 specificity, Enter submit
Sidebar (collapsible): Codebook quick-reference (category defs, IS/NOT lists, decision rules)
Progress bar: "47 / 600 completed"
Active timer: visible timer in header bar tracks active labeling time with idle detection (30s threshold). Submits both duration_ms (wall-clock) and active_ms (idle-excluded) per paragraph. Timer pauses and shows amber "idle" indicator when no mouse/keyboard/scroll activity detected.

Sampling (1,200 from ~50K)

Script reads Stage 1 consensus from JSONL, stratifies:

Stratum	Count	Source
Mgmt↔RMP split votes	120	Paragraphs where Stage 1 annotators disagreed on this axis
None/Other↔Strategy splits	80	Materiality disclaimer boundary
Spec [3,4] splits	80	QV counting boundary
Board↔Mgmt splits	80	Board/management boundary
Rare category guarantee	120	≥15 per category, extra for Incident Disclosure
Proportional stratified random	720	Fill from category×specificity cells

Assignment: BIBD

C(6,3)=20 unique triples of 6 annotators. Assign 60 paragraphs to each triple. Each annotator appears in C(5,2)=10 triples → 10×60 = 600 paragraphs. Every annotator pair shares equal overlap → pairwise kappa is statistically valid.

Adjudication

Auto-resolve after 3 labels:

3/3 agree both dims → consensus
2/3 agree both dims → majority
Otherwise → flagged for admin

Admin page:

Queue sorted by severity (3-way splits first)
Shows paragraph + all 3 labels side-by-side + notes + Stage 1 reference
Resolution: pick a label, enter custom, or mark for team discussion
Stores in adjudications table

Metrics Dashboard (admin page)

Overall progress: N/1,200 fully labeled, N adjudicated
Cohen's Kappa (category): pairwise 6×6 matrix + average. Target ≥ 0.75
Krippendorff's Alpha (specificity): single number. Target ≥ 0.67
Raw consensus rate. Target ≥ 75%
Per-category confusion matrix (7×7)
Per-annotator stats: completion, agreement rate, distribution
Confusion axis disagreement rates

Export

Script dumps adjudicated gold labels to data/gold/gold-labels.jsonl in the existing GoldLabel schema format, readable by readJsonl(path, GoldLabel).

Reused Existing Code

What	Source	Used by
`ContentCategory`, `SpecificityLevel`, `LabelOutput`	`@sec-cybert/schemas/label.ts`	Label validation, quiz answers
`Paragraph`, `FilingMeta`	`@sec-cybert/schemas/paragraph.ts`	Seed script, display
`HumanLabel`, `GoldLabel`	`@sec-cybert/schemas/gold.ts`	Export script format
`Annotation`	`@sec-cybert/schemas/annotation.ts`	Seed script (Stage 1 data)
`readJsonl()`	`ts/src/lib/jsonl.ts`	Seed script (import from JSONL)
Codebook content	`docs/LABELING-CODEBOOK.md`	Quiz questions, sidebar reference

Note: readJsonl stays in ts/src/lib/ — the seed script imports it directly via relative path or we extract it to the schemas package if needed. Since it depends on Zod (which schemas already has), it could live there.

Setup Sequence

# 1. Monorepo setup
bun install                                  # from repo root, resolves all workspaces

# 2. Verify existing pipeline still works
bun run --filter ts typecheck

# 3. Start Postgres
cd labelapp && docker compose up -d

# 4. Push schema
bun run db:push

# 5. Seed data
bun run seed                                 # imports paragraphs + consensus + creates annotators

# 6. Sample + assign
bun run sample                               # stratified sample → 1,200 paragraphs marked
bun run assign                               # BIBD → 3,600 assignment rows

# 7. Start dev server
bun run dev                                  # Next.js on :3000

# After labeling complete:
bun run export                               # → data/gold/gold-labels.jsonl

Implementation Order

Monorepo restructure — root package.json, extract schemas, rewrite imports, verify typecheck
Labelapp scaffold — Next.js init, Drizzle schema, db connection, Playwright setup
Seed + sample + assign scripts — data pipeline into Postgres
Auth — login page, session cookies, middleware
Quiz system — question bank, quiz flow page, session gating
Labeling UI — the core: next paragraph, submit label, progress tracking, keyboard shortcuts, codebook sidebar
Warm-up flow — 5 pre-labeled paragraphs with feedback
Admin: adjudication — queue, resolution UI
Admin: metrics dashboard — kappa, alpha, confusion matrix, per-annotator stats
Export script — gold labels to JSONL

Each phase ends with Playwright tests that verify it works E2E before moving on.

Testing Strategy

Philosophy

No unit tests. Integration/E2E only. Two layers:

Backend route tests (bun test) — hit real API routes against real Postgres, verify responses/DB state
Playwright E2E — click through the real UI in a real browser

Backend Route Tests (bun test)

Colocated __test__ dirs adjacent to each route handler
Tests import the route handler directly (or use fetch against the dev server)
Run against real Postgres (same DATABASE_URL)
Each test file resets relevant tables via Drizzle before running

labelapp/app/api/
  auth/
    route.ts
    __test__/
      auth.test.ts         # login/logout, session validation
  quiz/
    route.ts
    __test__/
      quiz.test.ts         # start quiz, submit answers, pass/fail logic
  label/
    route.ts
    __test__/
      label.test.ts        # get next paragraph, submit label, skip completed
  warmup/
    route.ts
    __test__/
      warmup.test.ts       # get warmup, submit + get feedback
  adjudicate/
    route.ts
    __test__/
      adjudicate.test.ts   # get queue, resolve, verify DB state
  metrics/
    route.ts
    __test__/
      metrics.test.ts      # kappa/alpha values, progress counts
  export/
    route.ts
    __test__/
      export.test.ts       # trigger export, verify JSONL output

Playwright E2E

Playwright installed in labelapp/ as dev dependency
Tests in labelapp/tests/ (Playwright default)
playwright.config.ts configured to:
- Start Next.js dev server automatically via webServer config
- Use the real Postgres (same DATABASE_URL)
- Run tests serially (stateful DB)
Test DB reset: Each test file starts by truncating relevant tables (not dropping — schema stays). A tests/helpers/reset-db.ts util handles this via Drizzle.

Test Files & What They Cover

tests/00-setup.spec.ts — Data pipeline scripts

Run bun run seed via execSync, verify paragraphs table has rows
Run bun run sample, verify exactly 1,200 paragraphs marked (or a sampled flag / separate table)
Run bun run assign, verify 3,600 assignment rows, each annotator has 600
Verify BIBD property: every annotator pair shares equal paragraph count

tests/01-auth.spec.ts — Login flow

Navigate to /, see login form
Login with wrong password → error message shown
Login with correct password → redirected to /dashboard
Access /dashboard without login → redirected to /
Logout → session cleared, redirected to /

tests/02-quiz.spec.ts — Quiz gating

Login, navigate to dashboard, click "Start Session"
Verify quiz page loads with 8 questions
Answer all correctly → see "Passed" message, "Begin labeling" button appears
Start new session, answer 2 wrong → see "Failed", can retry
Verify cannot access /label without a passed quiz session

tests/03-warmup.spec.ts — Warm-up flow

After passing quiz, verify 5 warm-up paragraphs shown first
Submit a label → gold answer + explanation revealed
After 5 warm-ups → transition to real labeling
Verify warm-up labels are NOT counted in progress stats

tests/04-labeling.spec.ts — Core labeling flow

Verify paragraph text + filing metadata displayed
Select category via radio button, select specificity, submit
Verify redirected to next paragraph, progress increments
Verify keyboard shortcuts work (press "1" → first category selected, etc.)
Submit several labels, verify they're stored in DB
Verify codebook sidebar toggles open/close
Verify "next" skips already-completed paragraphs (label one, refresh, get a different one)

tests/05-adjudication.spec.ts — Multi-annotator + admin flow

Seed 3 test annotators with assignments for the same paragraph
Login as each, pass quiz, label the same paragraph with DIFFERENT labels
Login as admin, navigate to /admin
Verify the disputed paragraph appears in the adjudication queue
Resolve it (pick one label), verify adjudication stored
Verify it no longer appears in queue

tests/06-metrics.spec.ts — Dashboard metrics

Seed known label data (pre-computed expected kappa/alpha values)
Navigate to admin metrics page
Verify progress numbers match expected
Verify consensus rate displayed and reasonable
Verify per-annotator stats shown

tests/07-export.spec.ts — Gold export

Ensure some adjudicated labels exist (from prior tests or seeded)
Run bun run export via execSync
Read the output JSONL file
Parse each line, verify it matches GoldLabel schema
Verify paragraph count matches expected

Running Tests

cd labelapp

# Backend route tests (fast, no browser)
bun test                                       # runs all __test__/*.test.ts
bun test app/api/label/__test__/           # run one route's tests

# Playwright E2E (browser)
bunx playwright install --with-deps chromium   # one-time browser install
bunx playwright test                           # runs all tests serially
bunx playwright test tests/04-labeling.spec.ts # run one file

# Both
bun test && bunx playwright test               # full suite

Package.json test scripts

{
  "scripts": {
    "test": "bun test && playwright test",
    "test:api": "bun test",
    "test:e2e": "playwright test",
    "test:e2e:ui": "playwright test --ui"
  }
}

How Agents Use This

Each implementation phase follows this cycle:

Write the feature code
Write the backend route test (__test__/ adjacent to route) AND/OR Playwright test
Run bun test for route logic, bunx playwright test for UI flows
If tests fail, fix the code and re-run
Only move to the next phase when all tests pass

The test suite is cumulative — tests from earlier phases keep running, ensuring nothing regresses. An agent completing phase 6 (labeling UI) runs the full suite to confirm everything still works.

Test Helpers

Shared (src/lib/__test__/helpers.ts or similar):

resetDb() — truncate tables between test files via Drizzle
seedTestData() — insert known paragraphs/assignments/labels for test scenarios

Playwright-specific (tests/helpers/):

login.ts — reusable: login as annotator X, pass quiz, get to labeling
reset-db.ts — calls resetDb() for Playwright test setup

The login.ts helper is critical — it encapsulates the login → quiz → warmup flow so that labeling/adjudication tests don't have to repeat that ceremony.

Verification (automated)

Two test suites, both must pass:

`bun test` — Backend route integration tests

Route	Test file	Verifies
`api/auth`	`__test__/auth.test.ts`	Login/logout, bad password rejection, session cookies
`api/quiz`	`__test__/quiz.test.ts`	Start quiz, submit answers, pass/fail threshold, session creation
`api/label`	`__test__/label.test.ts`	Get next paragraph, submit label to DB, skip completed, enforce quiz gate
`api/warmup`	`__test__/warmup.test.ts`	Get warmup paragraph, submit + receive gold feedback
`api/adjudicate`	`__test__/adjudicate.test.ts`	Get disagreement queue, resolve, verify DB state
`api/metrics`	`__test__/metrics.test.ts`	Kappa/alpha with known data, progress counts
`api/export`	`__test__/export.test.ts`	Trigger export, verify JSONL matches GoldLabel schema

`bunx playwright test` — Browser E2E

Test file	Verifies
`00-setup.spec.ts`	Seed/sample/assign scripts produce correct DB state
`01-auth.spec.ts`	Login form, redirect on auth failure, logout
`02-quiz.spec.ts`	Quiz renders, pass/fail gating, retry flow
`03-warmup.spec.ts`	5 warm-ups with feedback, transition to real labeling
`04-labeling.spec.ts`	Paragraph display, radio buttons, keyboard shortcuts, progress bar, codebook sidebar
`05-adjudication.spec.ts`	3 annotators disagree → admin queue → resolution
`06-metrics.spec.ts`	Dashboard renders with correct numbers
`07-export.spec.ts`	Export script produces valid JSONL

Pre-test gates

bun install from root succeeds
bun run --filter ts typecheck passes (monorepo didn't break existing pipeline)
Postgres is reachable, bun run db:push succeeds

Success criteria: cd labelapp && bun test && bunx playwright test exits 0.

25 KiB Raw Blame History Unescape Escape