SEC-cyBERT/CLAUDE.md
2026-04-05 21:00:40 -04:00

4.2 KiB

sec-cyBERT

Bun workspace monorepo. Three packages:

  • packages/schemas/ — shared Zod schemas (@sec-cybert/schemas). Import directly by path: from "@sec-cybert/schemas/label.ts"
  • ts/ — GenAI labeling pipeline (CLI scripts, Vercel AI SDK, OpenRouter)
  • labelapp/ — Next.js human labeling webapp (Drizzle, Postgres, shadcn/ui, Playwright)

Quick reference

What Where
Shared schemas (Zod) packages/schemas/src/
Labeling codebook (source of truth for all category/specificity definitions) docs/LABELING-CODEBOOK.md
Codebook ethos (reasoning behind every codebook decision) docs/CODEBOOK-ETHOS.md
Project narrative (decisions, roadblocks, lessons) docs/NARRATIVE.md
Project status & todo list docs/STATUS.md
Specificity improvement plan (pending threshold tuning) docs/SPECIFICITY-IMPROVEMENT-PLAN.md
Training docs (DAPT procedure, data quality audit, strategy notes) docs/training/
Data pipeline reference (tech guide, HTML cleaning, filing generators) docs/data-pipeline/
v1 archive (codebook, narrative, iteration logs, analyses) docs/archive/v1/
Planning archive (project overview, implementation plan, labelapp plan) docs/archive/planning/
Professor-provided reference materials docs/reference/
Labelapp-specific agent guide labelapp/AGENTS.md
Docker compose (Postgres) docker-compose.yaml (root)
DB credentials sec_cybert / sec_cybert / sec_cybert on localhost:5432

Root scripts

All commands run from repo root via bun run <script>. No need to cd into subpackages.

Labelapp (la:*)

Script What it does
la:dev Start Next.js dev server (Turbopack)
la:build Production build
la:typecheck TypeScript type-check
la:lint ESLint
la:test API tests + Playwright E2E
la:test:api API tests only (bun test)
la:test:e2e Playwright E2E only
la:db:generate Generate Drizzle migration
la:db:migrate Apply Drizzle migrations
la:db:studio Drizzle Studio (DB browser)
la:seed Seed paragraphs + annotations
la:assign Generate annotator assignments
la:export Export labels
la:docker Build + push Docker image

GenAI pipeline (ts:*)

Script What it does
ts:sec CLI entrypoint (bun run ts/src/cli.ts)
ts:typecheck TypeScript type-check

Python training (py:*)

Script What it does
py:train CLI entrypoint (uv run main.py — pass subcommand as arg, e.g. bun run py:train dapt --config ...)

Data & checkpoints (data:*, ckpt:*, sync:*)

Large files are compressed with zstd-19 into .lfs-store/ and tracked via Git LFS on Gitea (git.claiborne.soy). Per-file xxh3 hashing ensures only changed files are recompressed/uploaded.

Script What it does
data:push Compress data/.lfs-store/, commit + push via LFS (default)
data:pull Git LFS pull + decompress into data/
ckpt:push Compress checkpoints → .lfs-store/, commit + push via LFS
ckpt:pull Git LFS pull + decompress checkpoints
sync:push Push both data + checkpoints
sync:pull Pull both data + checkpoints
data:package Build standalone .tar.zst archives for submission

Checkpoint sync excludes optimizer.pt, scheduler.pt, rng_state.pth, and .data_cache/ (resume-only state, regenerable).

Cross-package

Script What it does
typecheck Type-check all TS packages in parallel

Rules

  • bun for all JS/TS. uv for Python.
  • No barrel files. Direct path-based imports only.
  • No TODO comments. Finish what you start.
  • No parallel codepaths. Find and extend existing code before writing new.
  • Schemas live in packages/schemas/ — do not duplicate type definitions elsewhere.
  • labelapp/ uses flat layout (no src/ dir): app/, db/, lib/, components/ at root.
  • labelapp/ uses file-based Drizzle migrations (drizzle-kit generate + drizzle-kit migrate), not push.
  • Tests: bun test for backend route integration (__test__/ dirs adjacent to routes), Playwright for E2E (tests/).