Replace DVC pipeline with Git LFS on self-hosted Gitea. New scripts use per-file xxh3 hashing for change detection and parallel zstd-19 compression. Supports separate data/checkpoint push modes.
93 lines
3.9 KiB
Markdown
93 lines
3.9 KiB
Markdown
# sec-cyBERT
|
|
|
|
Bun workspace monorepo. Three packages:
|
|
|
|
- `packages/schemas/` — shared Zod schemas (`@sec-cybert/schemas`). Import directly by path: `from "@sec-cybert/schemas/label.ts"`
|
|
- `ts/` — GenAI labeling pipeline (CLI scripts, Vercel AI SDK, OpenRouter)
|
|
- `labelapp/` — Next.js human labeling webapp (Drizzle, Postgres, shadcn/ui, Playwright)
|
|
|
|
## Quick reference
|
|
|
|
| What | Where |
|
|
|------|-------|
|
|
| Shared schemas (Zod) | `packages/schemas/src/` |
|
|
| Labeling codebook (source of truth for all category/specificity definitions) | `docs/LABELING-CODEBOOK.md` |
|
|
| Codebook ethos (reasoning behind every codebook decision) | `docs/CODEBOOK-ETHOS.md` |
|
|
| Project narrative (decisions, roadblocks, lessons) | `docs/NARRATIVE.md` |
|
|
| Project status & todo list | `docs/STATUS.md` |
|
|
| v1 codebook (preserved) | `docs/LABELING-CODEBOOK-v1.md` |
|
|
| v1 narrative (preserved) | `docs/NARRATIVE-v1.md` |
|
|
| Implementation plan for labelapp | `docs/labelapp-plan.md` |
|
|
| Labelapp-specific agent guide | `labelapp/AGENTS.md` |
|
|
| Docker compose (Postgres) | `docker-compose.yaml` (root) |
|
|
| DB credentials | `sec_cybert` / `sec_cybert` / `sec_cybert` on localhost:5432 |
|
|
|
|
## Root scripts
|
|
|
|
All commands run from repo root via `bun run <script>`. No need to cd into subpackages.
|
|
|
|
### Labelapp (`la:*`)
|
|
|
|
| Script | What it does |
|
|
|--------|-------------|
|
|
| `la:dev` | Start Next.js dev server (Turbopack) |
|
|
| `la:build` | Production build |
|
|
| `la:typecheck` | TypeScript type-check |
|
|
| `la:lint` | ESLint |
|
|
| `la:test` | API tests + Playwright E2E |
|
|
| `la:test:api` | API tests only (`bun test`) |
|
|
| `la:test:e2e` | Playwright E2E only |
|
|
| `la:db:generate` | Generate Drizzle migration |
|
|
| `la:db:migrate` | Apply Drizzle migrations |
|
|
| `la:db:studio` | Drizzle Studio (DB browser) |
|
|
| `la:seed` | Seed paragraphs + annotations |
|
|
| `la:assign` | Generate annotator assignments |
|
|
| `la:export` | Export labels |
|
|
| `la:docker` | Build + push Docker image |
|
|
|
|
### GenAI pipeline (`ts:*`)
|
|
|
|
| Script | What it does |
|
|
|--------|-------------|
|
|
| `ts:sec` | CLI entrypoint (`bun run ts/src/cli.ts`) |
|
|
| `ts:typecheck` | TypeScript type-check |
|
|
|
|
### Python training (`py:*`)
|
|
|
|
| Script | What it does |
|
|
|--------|-------------|
|
|
| `py:train` | CLI entrypoint (`uv run main.py` — pass subcommand as arg, e.g. `bun run py:train dapt --config ...`) |
|
|
|
|
### Data & checkpoints (`data:*`, `ckpt:*`, `sync:*`)
|
|
|
|
Large files are compressed with zstd-19 into `.lfs-store/` and tracked via Git LFS on Gitea (`git.claiborne.soy`). Per-file xxh3 hashing ensures only changed files are recompressed/uploaded.
|
|
|
|
| Script | What it does |
|
|
|--------|-------------|
|
|
| `data:push` | Compress `data/` → `.lfs-store/`, commit + push via LFS (default) |
|
|
| `data:pull` | Git LFS pull + decompress into `data/` |
|
|
| `ckpt:push` | Compress checkpoints → `.lfs-store/`, commit + push via LFS |
|
|
| `ckpt:pull` | Git LFS pull + decompress checkpoints |
|
|
| `sync:push` | Push both data + checkpoints |
|
|
| `sync:pull` | Pull both data + checkpoints |
|
|
| `data:package` | Build standalone `.tar.zst` archives for submission |
|
|
|
|
Checkpoint sync excludes `optimizer.pt`, `scheduler.pt`, `rng_state.pth`, and `.data_cache/` (resume-only state, regenerable).
|
|
|
|
### Cross-package
|
|
|
|
| Script | What it does |
|
|
|--------|-------------|
|
|
| `typecheck` | Type-check all TS packages in parallel |
|
|
|
|
## Rules
|
|
|
|
- `bun` for all JS/TS. `uv` for Python.
|
|
- No barrel files. Direct path-based imports only.
|
|
- No TODO comments. Finish what you start.
|
|
- No parallel codepaths. Find and extend existing code before writing new.
|
|
- Schemas live in `packages/schemas/` — do not duplicate type definitions elsewhere.
|
|
- `labelapp/` uses flat layout (no `src/` dir): `app/`, `db/`, `lib/`, `components/` at root.
|
|
- `labelapp/` uses file-based Drizzle migrations (`drizzle-kit generate` + `drizzle-kit migrate`), not `push`.
|
|
- Tests: `bun test` for backend route integration (`__test__/` dirs adjacent to routes), Playwright for E2E (`tests/`).
|