From 3e010a6d0c82387169f2452dac1b03cebfc07fb8 Mon Sep 17 00:00:00 2001 From: Joey Eamigh <55670930+JoeyEamigh@users.noreply.github.com> Date: Sun, 5 Apr 2026 20:33:57 -0400 Subject: [PATCH] update readme to show git lfs --- README.md | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index 51037ea..05336cb 100644 --- a/README.md +++ b/README.md @@ -8,10 +8,10 @@ Three-stage labeling pipeline: synthetic expert panel (3 LLMs via OpenRouter) ```bash # Clone and install -git clone sec-cyBERT && cd sec-cyBERT +git clone https://git.claiborne.soy/joey/SEC-cyBERT.git sec-cyBERT && cd sec-cyBERT bun install -# Pull data (no credentials needed, ~700 MB compressed download) +# Pull data (~750 MB compressed via Git LFS, ~20 GB with checkpoints) bun run data:pull ``` @@ -22,6 +22,7 @@ That gives you all extracted paragraphs, annotations, the DAPT corpus, benchmark | Tool | Install | |------|---------| | [Bun](https://bun.sh) ≥1.1 | `curl -fsSL https://bun.sh/install \| bash` | +| [Git LFS](https://git-lfs.com) ≥3.0 | `apt install git-lfs` / `brew install git-lfs` | | [zstd](https://github.com/facebook/zstd) ≥1.5 | `apt install zstd` / `brew install zstd` | Additional prerequisites depending on what you're running: @@ -41,8 +42,9 @@ sec-cyBERT/ ├── python/ # Training pipeline (HuggingFace Trainer, PyTorch) │ └── configs/ # YAML training configs ├── labelapp/ # Next.js human labeling webapp -├── data/ # All data artifacts (DVC-managed, see data/README.md) -├── checkpoints/ # Model training checkpoints +├── data/ # All data artifacts (gitignored working copies) +├── checkpoints/ # Model training checkpoints (gitignored working copies) +├── .lfs-store/ # Compressed data + checkpoints (Git LFS-tracked) ├── scripts/ # Data packaging and utility scripts └── docs/ # Project documentation ``` @@ -102,17 +104,23 @@ bun run py:train tapt --config configs/tapt/modernbert.yaml # TAPT (~2h) bun run py:train finetune --config configs/ft/modernbert.yaml # Fine-tune classifier ``` -### Data management (`data:*`) +### Data & checkpoint management ```bash -bun run data:pull # Download from R2 + decompress (no auth needed) -bun run data:push # Compress + upload to R2 via DVC (needs R2 write keys) +bun run data:pull # Pull data from LFS + decompress +bun run data:push # Compress changed data + push via LFS +bun run ckpt:pull # Pull checkpoints from LFS + decompress +bun run ckpt:push # Compress changed checkpoints + push via LFS +bun run sync:pull # Pull everything (data + checkpoints) +bun run sync:push # Push everything (data + checkpoints) bun run data:package # Build standalone .tar.zst archives for offline distribution ``` +All push/pull scripts support `--dry-run` and use per-file xxh3 hashing to skip unchanged files. + ## Data -Data is versioned with [DVC](https://dvc.org/) and stored compressed (zstd-19) on Cloudflare R2. `bun run data:pull` fetches everything with no credentials required. +Data is stored compressed (zstd-19) in `.lfs-store/` and tracked via [Git LFS](https://git-lfs.com) on Gitea. `bun run data:pull` decompresses from the LFS store into `data/`. | Dataset | Records | Description | |---------|---------|-------------| @@ -145,5 +153,4 @@ Copy `.env.example` to `.env` and fill in the values you need: | Variable | Needed for | |----------|-----------| | `OPENROUTER_API_KEY` | GenAI labeling pipeline (extraction is free) | -| `R2_ACCESS_KEY_ID` / `R2_SECRET_ACCESS_KEY` | Pushing data to DVC (pulling is anonymous) | | `DATABASE_URL` | Labelapp only (defaults to local Postgres) |