sec-cyBERT
Bun workspace monorepo. Three packages:
packages/schemas/ — shared Zod schemas (@sec-cybert/schemas). Import directly by path: from "@sec-cybert/schemas/label.ts"
ts/ — GenAI labeling pipeline (CLI scripts, Vercel AI SDK, OpenRouter)
labelapp/ — Next.js human labeling webapp (Drizzle, Postgres, shadcn/ui, Playwright)
Quick reference
| What |
Where |
| Shared schemas (Zod) |
packages/schemas/src/ |
| Labeling codebook (source of truth for all category/specificity definitions) |
docs/LABELING-CODEBOOK.md |
| Codebook ethos (reasoning behind every codebook decision) |
docs/CODEBOOK-ETHOS.md |
| Project narrative (decisions, roadblocks, lessons) |
docs/NARRATIVE.md |
| Project status & todo list |
docs/STATUS.md |
| v1 codebook (preserved) |
docs/LABELING-CODEBOOK-v1.md |
| v1 narrative (preserved) |
docs/NARRATIVE-v1.md |
| Implementation plan for labelapp |
docs/labelapp-plan.md |
| Labelapp-specific agent guide |
labelapp/AGENTS.md |
| Docker compose (Postgres) |
docker-compose.yaml (root) |
| DB credentials |
sec_cybert / sec_cybert / sec_cybert on localhost:5432 |
Root scripts
All commands run from repo root via bun run <script>. No need to cd into subpackages.
Labelapp (la:*)
| Script |
What it does |
la:dev |
Start Next.js dev server (Turbopack) |
la:build |
Production build |
la:typecheck |
TypeScript type-check |
la:lint |
ESLint |
la:test |
API tests + Playwright E2E |
la:test:api |
API tests only (bun test) |
la:test:e2e |
Playwright E2E only |
la:db:generate |
Generate Drizzle migration |
la:db:migrate |
Apply Drizzle migrations |
la:db:studio |
Drizzle Studio (DB browser) |
la:seed |
Seed paragraphs + annotations |
la:assign |
Generate annotator assignments |
la:export |
Export labels |
la:docker |
Build + push Docker image |
GenAI pipeline (ts:*)
| Script |
What it does |
ts:sec |
CLI entrypoint (bun run ts/src/cli.ts) |
ts:typecheck |
TypeScript type-check |
Python training (py:*)
| Script |
What it does |
py:train |
CLI entrypoint (uv run main.py — pass subcommand as arg, e.g. bun run py:train dapt --config ...) |
Data & checkpoints (data:*, ckpt:*, sync:*)
Large files are compressed with zstd-19 into .lfs-store/ and tracked via Git LFS on Gitea (git.claiborne.soy). Per-file xxh3 hashing ensures only changed files are recompressed/uploaded.
| Script |
What it does |
data:push |
Compress data/ → .lfs-store/, commit + push via LFS (default) |
data:pull |
Git LFS pull + decompress into data/ |
ckpt:push |
Compress checkpoints → .lfs-store/, commit + push via LFS |
ckpt:pull |
Git LFS pull + decompress checkpoints |
sync:push |
Push both data + checkpoints |
sync:pull |
Pull both data + checkpoints |
data:package |
Build standalone .tar.zst archives for submission |
Checkpoint sync excludes optimizer.pt, scheduler.pt, rng_state.pth, and .data_cache/ (resume-only state, regenerable).
Cross-package
| Script |
What it does |
typecheck |
Type-check all TS packages in parallel |
Rules
bun for all JS/TS. uv for Python.
- No barrel files. Direct path-based imports only.
- No TODO comments. Finish what you start.
- No parallel codepaths. Find and extend existing code before writing new.
- Schemas live in
packages/schemas/ — do not duplicate type definitions elsewhere.
labelapp/ uses flat layout (no src/ dir): app/, db/, lib/, components/ at root.
labelapp/ uses file-based Drizzle migrations (drizzle-kit generate + drizzle-kit migrate), not push.
- Tests:
bun test for backend route integration (__test__/ dirs adjacent to routes), Playwright for E2E (tests/).