sec-cyBERT

Bun workspace monorepo. Three packages:

packages/schemas/ — shared Zod schemas (@sec-cybert/schemas). Import directly by path: from "@sec-cybert/schemas/label.ts"
ts/ — GenAI labeling pipeline (CLI scripts, Vercel AI SDK, OpenRouter)
labelapp/ — Next.js human labeling webapp (Drizzle, Postgres, shadcn/ui, Playwright)

Quick reference

What	Where
Shared schemas (Zod)	`packages/schemas/src/`
Labeling codebook (source of truth for all category/specificity definitions)	`docs/LABELING-CODEBOOK.md`
Codebook ethos (reasoning behind every codebook decision)	`docs/CODEBOOK-ETHOS.md`
Project narrative (decisions, roadblocks, lessons)	`docs/NARRATIVE.md`
Project status & todo list	`docs/STATUS.md`
Specificity improvement plan (pending threshold tuning)	`docs/SPECIFICITY-IMPROVEMENT-PLAN.md`
Training docs (DAPT procedure, data quality audit, strategy notes)	`docs/training/`
Data pipeline reference (tech guide, HTML cleaning, filing generators)	`docs/data-pipeline/`
v1 archive (codebook, narrative, iteration logs, analyses)	`docs/archive/v1/`
Planning archive (project overview, implementation plan, labelapp plan)	`docs/archive/planning/`
Professor-provided reference materials	`docs/reference/`
Labelapp-specific agent guide	`labelapp/AGENTS.md`
Docker compose (Postgres)	`docker-compose.yaml` (root)
DB credentials	`sec_cybert` / `sec_cybert` / `sec_cybert` on localhost:5432

Root scripts

All commands run from repo root via bun run <script>. No need to cd into subpackages.

Labelapp (`la:*`)

Script	What it does
`la:dev`	Start Next.js dev server (Turbopack)
`la:build`	Production build
`la:typecheck`	TypeScript type-check
`la:lint`	ESLint
`la:test`	API tests + Playwright E2E
`la:test:api`	API tests only (`bun test`)
`la:test:e2e`	Playwright E2E only
`la:db:generate`	Generate Drizzle migration
`la:db:migrate`	Apply Drizzle migrations
`la:db:studio`	Drizzle Studio (DB browser)
`la:seed`	Seed paragraphs + annotations
`la:assign`	Generate annotator assignments
`la:export`	Export labels
`la:docker`	Build + push Docker image

GenAI pipeline (`ts:*`)

Script	What it does
`ts:sec`	CLI entrypoint (`bun run ts/src/cli.ts`)
`ts:typecheck`	TypeScript type-check

Python training (`py:*`)

Script	What it does
`py:train`	CLI entrypoint (`uv run main.py` — pass subcommand as arg, e.g. `bun run py:train dapt --config ...`)

Data & checkpoints (`data:`, `ckpt:`, `sync:*`)

Large files are compressed with zstd-19 into .lfs-store/ and tracked via Git LFS on Gitea (git.claiborne.soy). Per-file xxh3 hashing ensures only changed files are recompressed/uploaded.

Script	What it does
`data:push`	Compress `data/` → `.lfs-store/`, commit + push via LFS (default)
`data:pull`	Git LFS pull + decompress into `data/`
`ckpt:push`	Compress checkpoints → `.lfs-store/`, commit + push via LFS
`ckpt:pull`	Git LFS pull + decompress checkpoints
`sync:push`	Push both data + checkpoints
`sync:pull`	Pull both data + checkpoints
`data:package`	Build standalone `.tar.zst` archives for submission

Checkpoint sync excludes optimizer.pt, scheduler.pt, rng_state.pth, and .data_cache/ (resume-only state, regenerable).

Cross-package

Script	What it does
`typecheck`	Type-check all TS packages in parallel

Rules

bun for all JS/TS. uv for Python.
No barrel files. Direct path-based imports only.
No TODO comments. Finish what you start.
No parallel codepaths. Find and extend existing code before writing new.
Schemas live in packages/schemas/ — do not duplicate type definitions elsewhere.
labelapp/ uses flat layout (no src/ dir): app/, db/, lib/, components/ at root.
labelapp/ uses file-based Drizzle migrations (drizzle-kit generate + drizzle-kit migrate), not push.
Tests: bun test for backend route integration (__test__/ dirs adjacent to routes), Playwright for E2E (tests/).

4.2 KiB Raw Blame History

sec-cyBERT

Quick reference

Root scripts

Labelapp (la:*)

GenAI pipeline (ts:*)

Python training (py:*)

Data & checkpoints (data:*, ckpt:*, sync:*)