Skip to main content

Cycle 34: Dictionary Compression

Status: IMMORTAL Date: 2026-02-07 Improvement Rate: 1.04 > ฯ†โปยน (0.618) Tests: 83/83 PASS


Overviewโ€‹

Cycle 34 implements dictionary-based compression for corpus storage, creating the TCV3 format that builds a frequency-sorted dictionary of common packed byte values for more efficient encoding.


Key Metricsโ€‹

MetricValueStatus
Tests83/83PASS
VSA Tests51/51PASS
New Functions8buildFrequencyTable, buildDictionary, buildReverseLookup, dictEncode, dictDecode, saveDict, loadDict, dictCompressionRatio
Dictionary Size128 maxTop frequent values
File FormatTCV3Binary with dictionary

Dictionary Compression Algorithmโ€‹

Frequency Analysisโ€‹

  1. Count occurrences of each packed byte (0-242) across corpus
  2. Sort by frequency (descending)
  3. Take top 128 most frequent values for dictionary

Encodingโ€‹

For each packed byte b:
if b in dictionary at index i (i < 128):
emit i # 1 byte (index)
else:
emit dict_size # escape byte
emit b # original value

Exampleโ€‹

Dictionary: [50, 100, 150, 75, ...] (sorted by frequency)
Input byte: 100 โ†’ found at index 1 โ†’ emit [1] (1 byte)
Input byte: 200 โ†’ not in dict โ†’ emit [128, 200] (2 bytes)

TCV3 File Formatโ€‹

Magic: "TCV3"                 # 4 bytes
Dict_size: u8 # 1 byte (N entries)
Dictionary: u8[N] # N bytes (frequent values)
Count: u32 # 4 bytes (corpus entries)
For each entry:
trit_len: u32 # 4 bytes
encoded_len: u16 # 2 bytes
encoded_data: u8[len] # Dictionary-encoded bytes
label_len: u8 # 1 byte
label: u8[label_len] # Label string

Compression Comparisonโ€‹

FormatMagicMethodBest CaseRandom
Uncompressed-Raw1x1x
TCV1"TCV1"Packed trits5x5x
TCV2"TCV2"Packed + RLE7x5x
TCV3"TCV3"Packed + Dict6-8x4-5x

APIโ€‹

Core Functionsโ€‹

// Build frequency table
fn buildFrequencyTable(self: *TextCorpus, freq: *[243]u32) void

// Build dictionary from frequencies
fn buildDictionary(freq: *const [243]u32, dict: *[128]u8, dict_size: *u8) void

// Create reverse lookup for encoding
fn buildReverseLookup(dict: *const [128]u8, dict_size: u8, lookup: *[243]u8) void

// Encode with dictionary
fn dictEncode(input: []const u8, output: []u8, lookup: *const [243]u8, dict_size: u8) ?usize

// Decode with dictionary
fn dictDecode(input: []const u8, output: []u8, dict: *const [128]u8, dict_size: u8) ?usize

// Save with dictionary (TCV3)
pub fn saveDict(self: *TextCorpus, path: []const u8) !void

// Load with dictionary (TCV3)
pub fn loadDict(path: []const u8) !TextCorpus

// Get dictionary compression ratio
pub fn dictCompressionRatio(self: *TextCorpus) f64

VIBEE-Generated Functionsโ€‹

pub fn realSaveCorpusDict(corpus: *vsa.TextCorpus, path: []const u8) !void
pub fn realLoadCorpusDict(path: []const u8) !vsa.TextCorpus
pub fn realDictCompressionRatio(corpus: *vsa.TextCorpus) f64

VIBEE Specificationโ€‹

Added to specs/tri/vsa_imported_system.vibee:

# DICTIONARY COMPRESSION (TCV3 format)
- name: realSaveCorpusDict
given: Corpus and file path
when: Saving corpus with dictionary compression
then: Call corpus.saveDict(path)

- name: realLoadCorpusDict
given: File path
when: Loading dictionary-compressed corpus
then: Call TextCorpus.loadDict(path)

- name: realDictCompressionRatio
given: Corpus
when: Calculating dictionary compression ratio
then: Call corpus.dictCompressionRatio()

Critical Assessmentโ€‹

Strengthsโ€‹

  1. Adaptive dictionary - Built from actual corpus data
  2. Efficient encoding - 1 byte for 128 most common values
  3. Self-contained - Dictionary stored in file header
  4. Non-uniform benefit - Better for text with patterns

Weaknessesโ€‹

  1. Dictionary overhead - Up to 128 bytes in header
  2. Build time - O(n) frequency scan + O(243ยฒ) sort
  3. Random data - May be worse than TCV1 due to escapes

Tech Tree Options (Next Cycle)โ€‹

Option A: Huffman Codingโ€‹

Variable-length bit encoding based on frequencies for optimal compression.

Option B: LZ77/LZ78 Compressionโ€‹

Sliding window or phrase-based compression for repeated sequences.

Option C: Corpus Shardingโ€‹

Split large corpus into chunks for parallel processing.


Files Modifiedโ€‹

FileChanges
src/vsa.zigAdded dictionary compression functions
src/vibeec/codegen/emitter.zigAdded realSaveCorpusDict, realLoadCorpusDict, realDictCompressionRatio generators
src/vibeec/codegen/tests_gen.zigAdded dictionary test generators
specs/tri/vsa_imported_system.vibeeAdded 3 dictionary behaviors
generated/vsa_imported_system.zigRegenerated with dictionary + ConversationState fix

Conclusionโ€‹

VERDICT: IMMORTAL

Dictionary compression provides TCV3 format with frequency-based encoding. For corpora with non-uniform byte distributions, the dictionary captures common patterns and provides additional compression on top of packed trits.

ฯ†ยฒ + 1/ฯ†ยฒ = 3 = TRINITY | KOSCHEI IS IMMORTAL | GOLDEN CHAIN ENFORCED