Skip to main content

GGUF Model Format

Trinity reads model weights from GGUF (GPT-Generated Unified Format) files, the standard format used by the llama.cpp ecosystem. This page documents how Trinity parses GGUF v3 files and what model configurations are supported.

GGUF v3 File Structure​

A GGUF file is organized into four sequential sections:

  1. Header: Begins with the magic bytes 0x46554747 ("GGUF" in little-endian ASCII), followed by the format version number (3), the count of metadata key-value pairs, and the count of tensors.

  2. Metadata: A sequence of key-value pairs that describe the model architecture, tokenizer configuration, and training parameters. Each entry consists of a length-prefixed string key, a type tag, and the corresponding value.

  3. Tensor Descriptors: For each tensor, the file records the name (string), number of dimensions, shape (array of dimension sizes), quantization type (enum), and byte offset into the data section. Tensors are aligned to 32-byte boundaries by default.

  4. Tensor Data: The raw weight data for all tensors, laid out contiguously and aligned according to the alignment parameter. Trinity reads this data directly into memory for inference.

Metadata Value Types​

The GGUF format supports the following value types for metadata entries:

Type IDNameDescription
0UINT8Unsigned 8-bit integer
1INT8Signed 8-bit integer
2UINT16Unsigned 16-bit integer
3INT16Signed 16-bit integer
4UINT32Unsigned 32-bit integer
5INT32Signed 32-bit integer
6FLOAT3232-bit IEEE 754 float
7BOOLBoolean (1 byte)
8STRINGLength-prefixed UTF-8 string
9ARRAYTyped array with element type and count
10UINT64Unsigned 64-bit integer
11INT64Signed 64-bit integer
12FLOAT6464-bit IEEE 754 double

Supported Quantization Types​

Trinity's GGUF reader recognizes a wide range of quantization formats. The following types are most relevant for BitNet inference:

Standard Types​

TypeIDBlock SizeBytes/BlockDescription
F32014Full precision 32-bit float
F16112Half precision 16-bit float
BF163012Brain floating point 16-bit
Q8_0832348-bit quantization with scale
Q4_0232184-bit quantization with scale
Q4_1332204-bit quantization with scale and minimum

BitNet Ternary Types​

TypeIDBlock SizeBytes/BlockDescription
I2_S36412-bit integer with scale; encodes ternary {-1, 0, +1} as 4 values per byte
TQ1_034328Pure ternary packed, no scale factor
TQ2_0353210Ternary packed with 2-byte scale factor
TL13841BitNet TL1 format
TL23941BitNet TL2 format

Low-Bit Quantization Types​

TypeIDDescription
IQ1_S191-bit integer quantization with scale
IQ1_M291-bit integer quantization (modified)
IQ2_XXS16Ultra-low 2-bit quantization
IQ2_XS17Extra-small 2-bit quantization
IQ2_S222-bit integer quantization with scale
IQ3_XXS18Ultra-low 3-bit quantization
IQ3_S213-bit integer quantization with scale
IQ4_NL204-bit non-linear quantization
IQ4_XS234-bit extra-small quantization

Ternary Weight Encoding​

For BitNet models using I2_S quantization, ternary weights are packed at 4 values per byte using 2-bit encoding:

Bit PatternTrit Value
000
01+1
10-1
110 (unused)

This encoding is read through a lookup table (TRIT_LUT) during inference, enabling efficient unpacking during the ternary matrix-vector multiply operation.

Model Architecture Metadata​

Trinity reads the following architecture parameters from GGUF metadata to configure the inference engine:

Metadata KeyExample ValuePurpose
n_layers / num_hidden_layers30Number of transformer layers
n_heads / num_attention_heads20Number of query attention heads
n_kv_heads / num_key_value_heads5Number of key-value heads (for GQA)
n_embd / hidden_size2560Hidden dimension size
intermediate_size6912Feed-forward intermediate dimension
vocab_size128256Vocabulary size for embedding/output
max_position_embeddings4096Maximum sequence length
rms_norm_eps1e-5RMSNorm epsilon for numerical stability
rope_theta500000.0Rotary position embedding base frequency

RoPE (Rotary Position Embeddings)​

The model uses Rotary Position Embeddings to encode token positions. The RoPE theta parameter (500000.0 for BitNet 2B) controls the frequency base for the sinusoidal position encoding. Higher theta values extend the effective context length. Trinity computes RoPE frequencies on-the-fly during the attention computation.

Obtaining Compatible Models​

BitNet GGUF models can be obtained from the Hugging Face model hub. Look for models specifically quantized with I2_S or TQ1_0 quantization types. Models originally published in the Hugging Face Transformers format can be converted to GGUF using the conversion tools provided by the llama.cpp project, with explicit support for BitNet ternary quantization.