update readme to show git lfs
This commit is contained in:
parent
dac00f90db
commit
3e010a6d0c
25
README.md
25
README.md
@ -8,10 +8,10 @@ Three-stage labeling pipeline: synthetic expert panel (3 LLMs via OpenRouter)
|
||||
|
||||
```bash
|
||||
# Clone and install
|
||||
git clone <repo-url> sec-cyBERT && cd sec-cyBERT
|
||||
git clone https://git.claiborne.soy/joey/SEC-cyBERT.git sec-cyBERT && cd sec-cyBERT
|
||||
bun install
|
||||
|
||||
# Pull data (no credentials needed, ~700 MB compressed download)
|
||||
# Pull data (~750 MB compressed via Git LFS, ~20 GB with checkpoints)
|
||||
bun run data:pull
|
||||
```
|
||||
|
||||
@ -22,6 +22,7 @@ That gives you all extracted paragraphs, annotations, the DAPT corpus, benchmark
|
||||
| Tool | Install |
|
||||
|------|---------|
|
||||
| [Bun](https://bun.sh) ≥1.1 | `curl -fsSL https://bun.sh/install \| bash` |
|
||||
| [Git LFS](https://git-lfs.com) ≥3.0 | `apt install git-lfs` / `brew install git-lfs` |
|
||||
| [zstd](https://github.com/facebook/zstd) ≥1.5 | `apt install zstd` / `brew install zstd` |
|
||||
|
||||
Additional prerequisites depending on what you're running:
|
||||
@ -41,8 +42,9 @@ sec-cyBERT/
|
||||
├── python/ # Training pipeline (HuggingFace Trainer, PyTorch)
|
||||
│ └── configs/ # YAML training configs
|
||||
├── labelapp/ # Next.js human labeling webapp
|
||||
├── data/ # All data artifacts (DVC-managed, see data/README.md)
|
||||
├── checkpoints/ # Model training checkpoints
|
||||
├── data/ # All data artifacts (gitignored working copies)
|
||||
├── checkpoints/ # Model training checkpoints (gitignored working copies)
|
||||
├── .lfs-store/ # Compressed data + checkpoints (Git LFS-tracked)
|
||||
├── scripts/ # Data packaging and utility scripts
|
||||
└── docs/ # Project documentation
|
||||
```
|
||||
@ -102,17 +104,23 @@ bun run py:train tapt --config configs/tapt/modernbert.yaml # TAPT (~2h)
|
||||
bun run py:train finetune --config configs/ft/modernbert.yaml # Fine-tune classifier
|
||||
```
|
||||
|
||||
### Data management (`data:*`)
|
||||
### Data & checkpoint management
|
||||
|
||||
```bash
|
||||
bun run data:pull # Download from R2 + decompress (no auth needed)
|
||||
bun run data:push # Compress + upload to R2 via DVC (needs R2 write keys)
|
||||
bun run data:pull # Pull data from LFS + decompress
|
||||
bun run data:push # Compress changed data + push via LFS
|
||||
bun run ckpt:pull # Pull checkpoints from LFS + decompress
|
||||
bun run ckpt:push # Compress changed checkpoints + push via LFS
|
||||
bun run sync:pull # Pull everything (data + checkpoints)
|
||||
bun run sync:push # Push everything (data + checkpoints)
|
||||
bun run data:package # Build standalone .tar.zst archives for offline distribution
|
||||
```
|
||||
|
||||
All push/pull scripts support `--dry-run` and use per-file xxh3 hashing to skip unchanged files.
|
||||
|
||||
## Data
|
||||
|
||||
Data is versioned with [DVC](https://dvc.org/) and stored compressed (zstd-19) on Cloudflare R2. `bun run data:pull` fetches everything with no credentials required.
|
||||
Data is stored compressed (zstd-19) in `.lfs-store/` and tracked via [Git LFS](https://git-lfs.com) on Gitea. `bun run data:pull` decompresses from the LFS store into `data/`.
|
||||
|
||||
| Dataset | Records | Description |
|
||||
|---------|---------|-------------|
|
||||
@ -145,5 +153,4 @@ Copy `.env.example` to `.env` and fill in the values you need:
|
||||
| Variable | Needed for |
|
||||
|----------|-----------|
|
||||
| `OPENROUTER_API_KEY` | GenAI labeling pipeline (extraction is free) |
|
||||
| `R2_ACCESS_KEY_ID` / `R2_SECRET_ACCESS_KEY` | Pushing data to DVC (pulling is anonymous) |
|
||||
| `DATABASE_URL` | Labelapp only (defaults to local Postgres) |
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user