update readme to show git lfs

This commit is contained in:
Joey Eamigh 2026-04-05 20:33:57 -04:00
parent dac00f90db
commit 3e010a6d0c
No known key found for this signature in database
GPG Key ID: CE8C05DFFC53C9CB

View File

@ -8,10 +8,10 @@ Three-stage labeling pipeline: synthetic expert panel (3 LLMs via OpenRouter)
```bash
# Clone and install
git clone <repo-url> sec-cyBERT && cd sec-cyBERT
git clone https://git.claiborne.soy/joey/SEC-cyBERT.git sec-cyBERT && cd sec-cyBERT
bun install
# Pull data (no credentials needed, ~700 MB compressed download)
# Pull data (~750 MB compressed via Git LFS, ~20 GB with checkpoints)
bun run data:pull
```
@ -22,6 +22,7 @@ That gives you all extracted paragraphs, annotations, the DAPT corpus, benchmark
| Tool | Install |
|------|---------|
| [Bun](https://bun.sh) ≥1.1 | `curl -fsSL https://bun.sh/install \| bash` |
| [Git LFS](https://git-lfs.com) ≥3.0 | `apt install git-lfs` / `brew install git-lfs` |
| [zstd](https://github.com/facebook/zstd) ≥1.5 | `apt install zstd` / `brew install zstd` |
Additional prerequisites depending on what you're running:
@ -41,8 +42,9 @@ sec-cyBERT/
├── python/ # Training pipeline (HuggingFace Trainer, PyTorch)
│ └── configs/ # YAML training configs
├── labelapp/ # Next.js human labeling webapp
├── data/ # All data artifacts (DVC-managed, see data/README.md)
├── checkpoints/ # Model training checkpoints
├── data/ # All data artifacts (gitignored working copies)
├── checkpoints/ # Model training checkpoints (gitignored working copies)
├── .lfs-store/ # Compressed data + checkpoints (Git LFS-tracked)
├── scripts/ # Data packaging and utility scripts
└── docs/ # Project documentation
```
@ -102,17 +104,23 @@ bun run py:train tapt --config configs/tapt/modernbert.yaml # TAPT (~2h)
bun run py:train finetune --config configs/ft/modernbert.yaml # Fine-tune classifier
```
### Data management (`data:*`)
### Data & checkpoint management
```bash
bun run data:pull # Download from R2 + decompress (no auth needed)
bun run data:push # Compress + upload to R2 via DVC (needs R2 write keys)
bun run data:pull # Pull data from LFS + decompress
bun run data:push # Compress changed data + push via LFS
bun run ckpt:pull # Pull checkpoints from LFS + decompress
bun run ckpt:push # Compress changed checkpoints + push via LFS
bun run sync:pull # Pull everything (data + checkpoints)
bun run sync:push # Push everything (data + checkpoints)
bun run data:package # Build standalone .tar.zst archives for offline distribution
```
All push/pull scripts support `--dry-run` and use per-file xxh3 hashing to skip unchanged files.
## Data
Data is versioned with [DVC](https://dvc.org/) and stored compressed (zstd-19) on Cloudflare R2. `bun run data:pull` fetches everything with no credentials required.
Data is stored compressed (zstd-19) in `.lfs-store/` and tracked via [Git LFS](https://git-lfs.com) on Gitea. `bun run data:pull` decompresses from the LFS store into `data/`.
| Dataset | Records | Description |
|---------|---------|-------------|
@ -145,5 +153,4 @@ Copy `.env.example` to `.env` and fill in the values you need:
| Variable | Needed for |
|----------|-----------|
| `OPENROUTER_API_KEY` | GenAI labeling pipeline (extraction is free) |
| `R2_ACCESS_KEY_ID` / `R2_SECRET_ACCESS_KEY` | Pushing data to DVC (pulling is anonymous) |
| `DATABASE_URL` | Labelapp only (defaults to local Postgres) |