From 3e010a6d0c82387169f2452dac1b03cebfc07fb8 Mon Sep 17 00:00:00 2001
From: Joey Eamigh <55670930+JoeyEamigh@users.noreply.github.com>
Date: Sun, 5 Apr 2026 20:33:57 -0400
Subject: [PATCH] update readme to show git lfs

---
 README.md | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/README.md b/README.md
index 51037ea..05336cb 100644
--- a/README.md
+++ b/README.md
@@ -8,10 +8,10 @@ Three-stage labeling pipeline: synthetic expert panel (3 LLMs via OpenRouter) 
 
 ```bash
 # Clone and install
-git clone <repo-url> sec-cyBERT && cd sec-cyBERT
+git clone https://git.claiborne.soy/joey/SEC-cyBERT.git sec-cyBERT && cd sec-cyBERT
 bun install
 
-# Pull data (no credentials needed, ~700 MB compressed download)
+# Pull data (~750 MB compressed via Git LFS, ~20 GB with checkpoints)
 bun run data:pull
 ```
 
@@ -22,6 +22,7 @@ That gives you all extracted paragraphs, annotations, the DAPT corpus, benchmark
 | Tool | Install |
 |------|---------|
 | [Bun](https://bun.sh) ≥1.1 | `curl -fsSL https://bun.sh/install \| bash` |
+| [Git LFS](https://git-lfs.com) ≥3.0 | `apt install git-lfs` / `brew install git-lfs` |
 | [zstd](https://github.com/facebook/zstd) ≥1.5 | `apt install zstd` / `brew install zstd` |
 
 Additional prerequisites depending on what you're running:
@@ -41,8 +42,9 @@ sec-cyBERT/
 ├── python/               # Training pipeline (HuggingFace Trainer, PyTorch)
 │   └── configs/          # YAML training configs
 ├── labelapp/             # Next.js human labeling webapp
-├── data/                 # All data artifacts (DVC-managed, see data/README.md)
-├── checkpoints/          # Model training checkpoints
+├── data/                 # All data artifacts (gitignored working copies)
+├── checkpoints/          # Model training checkpoints (gitignored working copies)
+├── .lfs-store/           # Compressed data + checkpoints (Git LFS-tracked)
 ├── scripts/              # Data packaging and utility scripts
 └── docs/                 # Project documentation
 ```
@@ -102,17 +104,23 @@ bun run py:train tapt --config configs/tapt/modernbert.yaml      # TAPT (~2h)
 bun run py:train finetune --config configs/ft/modernbert.yaml    # Fine-tune classifier
 ```
 
-### Data management (`data:*`)
+### Data & checkpoint management
 
 ```bash
-bun run data:pull       # Download from R2 + decompress (no auth needed)
-bun run data:push       # Compress + upload to R2 via DVC (needs R2 write keys)
+bun run data:pull       # Pull data from LFS + decompress
+bun run data:push       # Compress changed data + push via LFS
+bun run ckpt:pull       # Pull checkpoints from LFS + decompress
+bun run ckpt:push       # Compress changed checkpoints + push via LFS
+bun run sync:pull       # Pull everything (data + checkpoints)
+bun run sync:push       # Push everything (data + checkpoints)
 bun run data:package    # Build standalone .tar.zst archives for offline distribution
 ```
 
+All push/pull scripts support `--dry-run` and use per-file xxh3 hashing to skip unchanged files.
+
 ## Data
 
-Data is versioned with [DVC](https://dvc.org/) and stored compressed (zstd-19) on Cloudflare R2. `bun run data:pull` fetches everything with no credentials required.
+Data is stored compressed (zstd-19) in `.lfs-store/` and tracked via [Git LFS](https://git-lfs.com) on Gitea. `bun run data:pull` decompresses from the LFS store into `data/`.
 
 | Dataset | Records | Description |
 |---------|---------|-------------|
@@ -145,5 +153,4 @@ Copy `.env.example` to `.env` and fill in the values you need:
 | Variable | Needed for |
 |----------|-----------|
 | `OPENROUTER_API_KEY` | GenAI labeling pipeline (extraction is free) |
-| `R2_ACCESS_KEY_ID` / `R2_SECRET_ACCESS_KEY` | Pushing data to DVC (pulling is anonymous) |
 | `DATABASE_URL` | Labelapp only (defaults to local Postgres) |