Local Deployment
Run Trinity on your local machine for development, testing, and inference with ternary models. This guide covers building from source, running inference, and using the CLI tools.
Prerequisites​
| Requirement | Version | Notes |
|---|---|---|
| Zig | 0.13.0 | Exact version required |
| Git | Any recent | For cloning the repository |
| RAM | 4 GB minimum | 8 GB+ recommended for model inference |
| Disk | 1 GB minimum | Plus model file size |
Build from Source​
macOS​
# Install Zig (Apple Silicon)
curl -LO https://ziglang.org/download/0.13.0/zig-macos-aarch64-0.13.0.tar.xz
tar -xf zig-macos-aarch64-0.13.0.tar.xz
export PATH="$PWD/zig-macos-aarch64-0.13.0:$PATH"
# Alternatively, use Homebrew
brew install zig@0.13
# Clone and build
git clone https://github.com/gHashTag/trinity.git
cd trinity
zig build
Linux​
# Install Zig
curl -LO https://ziglang.org/download/0.13.0/zig-linux-x86_64-0.13.0.tar.xz
tar -xf zig-linux-x86_64-0.13.0.tar.xz
export PATH="$PWD/zig-linux-x86_64-0.13.0:$PATH"
# Clone and build
git clone https://github.com/gHashTag/trinity.git
cd trinity
zig build
Windows​
- Download Zig 0.13.0 from ziglang.org/download
- Extract to
C:\zigand add to your PATH - Clone and build:
git clone https://github.com/gHashTag/trinity.git
cd trinity
zig build
Verify the Build​
zig build test
All tests should pass. You can also run specific module tests:
zig test src/vsa.zig # VSA operations
zig test src/vm.zig # Virtual machine
Running Inference with Local Models​
Obtaining GGUF Models​
BitNet b1.58 models in GGUF format are available from HuggingFace:
- microsoft/bitnet-b1.58-2B-4T-gguf -- 2.4B parameter model, ~1.1 GB
- Other ternary models can be converted to GGUF using the tools provided with bitnet.cpp
Download using the HuggingFace CLI or directly:
pip install huggingface_hub
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download('microsoft/bitnet-b1.58-2B-4T-gguf', 'ggml-model-i2_s.gguf', local_dir='./models')
"
Chat Mode​
Start an interactive chat session with a local model:
./bin/vibee chat --model ./models/ggml-model-i2_s.gguf
Server Mode​
Run Trinity as an HTTP server for API-based inference:
./bin/vibee serve --port 8080
This starts a local HTTP server that accepts inference requests via JSON API.
Memory Requirements by Model Size​
| Model | Parameters | GGUF File Size | Min RAM (inference) | Recommended RAM |
|---|---|---|---|---|
| BitNet Small | ~700M | ~350 MB | 2 GB | 4 GB |
| BitNet 2B-4T | 2.4B | 1.1 GB | 4 GB | 8 GB |
| BitNet 3B | ~3B | ~1.4 GB | 4 GB | 8 GB |
| BitNet 7B | ~7B | ~3.2 GB | 8 GB | 16 GB |
These numbers reflect the ternary-packed model weights. During inference, additional memory is required for the KV cache (which scales with context length) and activation buffers.
CPU Performance Expectations​
Local CPU inference is significantly slower than GPU inference. On an Apple M1 Pro or comparable x86 CPU, expect:
- Without optimized kernels: 0.1-0.5 tokens/second (very slow)
- With AVX-512 VNNI (x86): Up to ~15,000 tokens/second
- ARM NEON (Apple Silicon): Performance depends on kernel availability
For production-grade throughput, see the RunPod GPU Deployment guide.
Other CLI Commands​
# Generate code from a .vibee specification
./bin/vibee gen specs/tri/module.vibee
# Run a program via the bytecode VM
./bin/vibee run program.999
# Build the Firebird LLM CLI in release mode
zig build firebird
# Cross-platform release builds
zig build release