Local Deployment

Run Trinity on your local machine for development, testing, and inference with ternary models. This guide covers building from source, running inference, and using the CLI tools.

Prerequisites

Requirement	Version	Notes
Zig	0.13.0	Exact version required
Git	Any recent	For cloning the repository
RAM	4 GB minimum	8 GB+ recommended for model inference
Disk	1 GB minimum	Plus model file size

Build from Source

macOS

# Install Zig (Apple Silicon)
curl -LO https://ziglang.org/download/0.13.0/zig-macos-aarch64-0.13.0.tar.xz
tar -xf zig-macos-aarch64-0.13.0.tar.xz
export PATH="$PWD/zig-macos-aarch64-0.13.0:$PATH"

# Alternatively, use Homebrew
brew install zig@0.13

# Clone and build
git clone https://github.com/gHashTag/trinity.git
cd trinity
zig build

Linux

# Install Zig
curl -LO https://ziglang.org/download/0.13.0/zig-linux-x86_64-0.13.0.tar.xz
tar -xf zig-linux-x86_64-0.13.0.tar.xz
export PATH="$PWD/zig-linux-x86_64-0.13.0:$PATH"

# Clone and build
git clone https://github.com/gHashTag/trinity.git
cd trinity
zig build

Windows

Download Zig 0.13.0 from ziglang.org/download
Extract to C:\zig and add to your PATH
Clone and build:

git clone https://github.com/gHashTag/trinity.git
cd trinity
zig build

Verify the Build

zig build test

All tests should pass. You can also run specific module tests:

zig test src/vsa.zig     # VSA operations
zig test src/vm.zig      # Virtual machine

Running Inference with Local Models

Obtaining GGUF Models

BitNet b1.58 models in GGUF format are available from HuggingFace:

microsoft/bitnet-b1.58-2B-4T-gguf -- 2.4B parameter model, ~1.1 GB
Other ternary models can be converted to GGUF using the tools provided with bitnet.cpp

Download using the HuggingFace CLI or directly:

pip install huggingface_hub
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download('microsoft/bitnet-b1.58-2B-4T-gguf', 'ggml-model-i2_s.gguf', local_dir='./models')
"

Chat Mode

Start an interactive chat session with a local model:

./bin/vibee chat --model ./models/ggml-model-i2_s.gguf

Server Mode

Run Trinity as an HTTP server for API-based inference:

./bin/vibee serve --port 8080

This starts a local HTTP server that accepts inference requests via JSON API.

Memory Requirements by Model Size

Model	Parameters	GGUF File Size	Min RAM (inference)	Recommended RAM
BitNet Small	~700M	~350 MB	2 GB	4 GB
BitNet 2B-4T	2.4B	1.1 GB	4 GB	8 GB
BitNet 3B	~3B	~1.4 GB	4 GB	8 GB
BitNet 7B	~7B	~3.2 GB	8 GB	16 GB

These numbers reflect the ternary-packed model weights. During inference, additional memory is required for the KV cache (which scales with context length) and activation buffers.

CPU Performance Expectations

Local CPU inference is significantly slower than GPU inference. On an Apple M1 Pro or comparable x86 CPU, expect:

Without optimized kernels: 0.1-0.5 tokens/second (very slow)
With AVX-512 VNNI (x86): Up to ~15,000 tokens/second
ARM NEON (Apple Silicon): Performance depends on kernel availability

For production-grade throughput, see the RunPod GPU Deployment guide.

Other CLI Commands

# Generate code from a .vibee specification
./bin/vibee gen specs/tri/module.vibee

# Run a program via the bytecode VM
./bin/vibee run program.999

# Build the Firebird LLM CLI in release mode
zig build firebird

# Cross-platform release builds
zig build release

Prerequisites​

Build from Source​

macOS​

Linux​

Windows​

Verify the Build​

Running Inference with Local Models​

Obtaining GGUF Models​

Chat Mode​

Server Mode​

Memory Requirements by Model Size​

CPU Performance Expectations​

Other CLI Commands​