Skip to main content

GPU Inference Benchmarks

BitNet b1.58 models use ternary weights ({-1, 0, +1}), enabling highly efficient inference on both consumer and datacenter GPUs. This page summarizes performance measurements across different hardware configurations.

Hardware Comparison​

GPUTokens/sec (eval)Tokens/sec (prompt)Memory UsageNotes
NVIDIA RTX 3090~298,000~350,000~1.3 GBConsumer GPU, 24GB VRAM
NVIDIA A100 80GB~274,000~320,000~1.3 GBDatacenter GPU, PCIe/SXM
NVIDIA H100 SXM~300,000+~380,000~1.3 GBDatacenter, AVX-512 VNNI on CPU side
CPU-only (M1 Pro)~0.2N/A~4 GBARM64, no GPU acceleration
CPU-only (x86 AVX-512)~15,000~18,000~1.3 GBServer CPU with AVX-512 VNNI

The numbers above are for the BitNet b1.58-2B-4T model (2.4 billion parameters) using the bitnet.cpp inference engine with I2_S quantization. Actual throughput depends on batch size, sequence length, and system configuration.

caution

These throughput figures represent bitnet.cpp kernel benchmark results (measuring raw computation speed), not end-to-end text generation throughput. End-to-end generation speed is substantially lower due to sequential token generation, memory transfers, and tokenizer overhead. See the BitNet Coherence Report for measured end-to-end generation speeds.

Model Size Scaling​

ModelParametersGGUF SizeMin VRAMApprox. Throughput (RTX 3090)
BitNet b1.58 Small~700M~350 MB1 GB~400K tok/s
BitNet b1.58-2B-4T2.4B1.1 GB2 GB~298K tok/s
BitNet b1.58 3B~3B~1.4 GB2 GB~220K tok/s
BitNet b1.58 7B~7B~3.2 GB4 GB~95K tok/s

Ternary quantization (I2_S) produces model files that are approximately 20x smaller than their float32 equivalents. A 7B parameter model that would normally require ~28 GB in float32 fits in roughly 3.2 GB with ternary weights.

Batch Size Effects​

Batch size has a significant impact on throughput. Single-token generation (batch size 1) is latency-optimized, while larger batch sizes improve aggregate throughput at the cost of per-token latency.

Batch SizeThroughput MultiplierUse Case
11x (baseline)Interactive chat, real-time generation
4~2.5xSmall batch serving
16~6xBatch processing
64~12xOffline processing, benchmarks

Memory Requirements​

The ternary weight format dramatically reduces memory consumption:

  • Model weights: 1.58 bits per parameter (vs 32 bits for float32, 16 bits for float16)
  • KV cache: Standard float16, scales with context length and batch size
  • Activations: 8-bit quantized activations further reduce memory during inference

For the 2B-4T model, peak memory usage is approximately 1.3 GB for the model weights plus KV cache overhead that scales with sequence length. A 4096-token context window adds roughly 200-400 MB depending on the number of KV heads.

Inference Frameworks​

Trinity's benchmarks use the official Microsoft bitnet.cpp framework, which provides optimized kernels for ternary inference:

  • I2_S kernel: Optimized for ternary weight unpacking and accumulation
  • TL2 kernel: Advanced kernel with tiling for better cache utilization on x86
  • ARM NEON: Vectorized path for Apple Silicon and ARM servers

The RunPod deployment script (scripts/runpod_h100_bitnet.sh) automates benchmarking across thread counts and prompt variations on cloud GPU instances.