update readme to show git lfs

2026-04-05 20:33:57 -04:00 · 2026-04-05 20:33:57 -04:00 · 3e010a6d0c
commit 3e010a6d0c
parent dac00f90db
1 changed files with 16 additions and 9 deletions
--- a/README.md
+++ b/README.md
@ -8,10 +8,10 @@ Three-stage labeling pipeline: synthetic expert panel (3 LLMs via OpenRouter)

 ```bash
 # Clone and install
-git clone <repo-url> sec-cyBERT && cd sec-cyBERT
+git clone https://git.claiborne.soy/joey/SEC-cyBERT.git sec-cyBERT && cd sec-cyBERT
 bun install

-# Pull data (no credentials needed, ~700 MB compressed download)
+# Pull data (~750 MB compressed via Git LFS, ~20 GB with checkpoints)
 bun run data:pull
 ```

@ -22,6 +22,7 @@ That gives you all extracted paragraphs, annotations, the DAPT corpus, benchmark
 | Tool | Install |
 |------|---------|
 | [Bun](https://bun.sh) ≥1.1 | `curl -fsSL https://bun.sh/install \| bash` |
+| [Git LFS](https://git-lfs.com) ≥3.0 | `apt install git-lfs` / `brew install git-lfs` |
 | [zstd](https://github.com/facebook/zstd) ≥1.5 | `apt install zstd` / `brew install zstd` |

 Additional prerequisites depending on what you're running:
@ -41,8 +42,9 @@ sec-cyBERT/
 ├── python/               # Training pipeline (HuggingFace Trainer, PyTorch)
 │   └── configs/          # YAML training configs
 ├── labelapp/             # Next.js human labeling webapp
-├── data/                 # All data artifacts (DVC-managed, see data/README.md)
-├── checkpoints/          # Model training checkpoints
+├── data/                 # All data artifacts (gitignored working copies)
+├── checkpoints/          # Model training checkpoints (gitignored working copies)
+├── .lfs-store/           # Compressed data + checkpoints (Git LFS-tracked)
 ├── scripts/              # Data packaging and utility scripts
 └── docs/                 # Project documentation
 ```
@ -102,17 +104,23 @@ bun run py:train tapt --config configs/tapt/modernbert.yaml      # TAPT (~2h)
 bun run py:train finetune --config configs/ft/modernbert.yaml    # Fine-tune classifier
 ```

-### Data management (`data:*`)
+### Data & checkpoint management

 ```bash
-bun run data:pull       # Download from R2 + decompress (no auth needed)
-bun run data:push       # Compress + upload to R2 via DVC (needs R2 write keys)
+bun run data:pull       # Pull data from LFS + decompress
+bun run data:push       # Compress changed data + push via LFS
+bun run ckpt:pull       # Pull checkpoints from LFS + decompress
+bun run ckpt:push       # Compress changed checkpoints + push via LFS
+bun run sync:pull       # Pull everything (data + checkpoints)
+bun run sync:push       # Push everything (data + checkpoints)
 bun run data:package    # Build standalone .tar.zst archives for offline distribution
 ```

+All push/pull scripts support `--dry-run` and use per-file xxh3 hashing to skip unchanged files.
+
 ## Data

-Data is versioned with [DVC](https://dvc.org/) and stored compressed (zstd-19) on Cloudflare R2. `bun run data:pull` fetches everything with no credentials required.
+Data is stored compressed (zstd-19) in `.lfs-store/` and tracked via [Git LFS](https://git-lfs.com) on Gitea. `bun run data:pull` decompresses from the LFS store into `data/`.

 | Dataset | Records | Description |
 |---------|---------|-------------|
@ -145,5 +153,4 @@ Copy `.env.example` to `.env` and fill in the values you need:
 | Variable | Needed for |
 |----------|-----------|
 | `OPENROUTER_API_KEY` | GenAI labeling pipeline (extraction is free) |
-| `R2_ACCESS_KEY_ID` / `R2_SECRET_ACCESS_KEY` | Pushing data to DVC (pulling is anonymous) |
 | `DATABASE_URL` | Labelapp only (defaults to local Postgres) |