Skip to main content

Cycle 37: Corpus Sharding

Status: IMMORTAL Date: 2026-02-07 Improvement Rate: 1.04 > ฯ†โปยน (0.618) Tests: 92/92 PASS


Overviewโ€‹

Cycle 37 implements corpus sharding for parallel chunk processing, creating the TCV6 format that splits large corpora into manageable shards with index-based access for parallel loading and searching.


Key Metricsโ€‹

MetricValueStatus
Tests92/92PASS
VSA Tests56/56PASS
New Structures2ShardConfig, ShardInfo
New Functions7saveSharded, loadSharded, getShardConfig, getShardCount, searchShard, estimateShardedSize
Default Shard Size25 entriesConfigurable
Max Shards16Scalable
File FormatTCV6Binary with shard index

Sharding Algorithmโ€‹

Shard Configurationโ€‹

  1. Calculate shard count: (corpus_count + entries_per_shard - 1) / entries_per_shard
  2. Create shard boundaries with start/end indices
  3. Store offset table for random access to any shard

TCV6 Format Structureโ€‹

Magic: "TCV6"                    # 4 bytes
Shard_count: u16 # 2 bytes
Entries_per_shard: u16 # 2 bytes
Total_entries: u32 # 4 bytes
Shard_offsets: u32[shard_count] # File offset table
For each shard:
Shard_id: u16 # 2 bytes
Entry_count: u16 # 2 bytes
For each entry:
trit_len: u32 # 4 bytes
packed_len: u16 # 2 bytes
packed_data: u8[packed_len] # Packed trits
label_len: u8 # 1 byte
label: u8[label_len] # Label string

Parallel-Ready Designโ€‹

  • Offset table allows seeking to any shard independently
  • Each shard can be loaded in a separate thread
  • searchShard() operates on specific index ranges

Compression Stack Completeโ€‹

FormatMagicMethodUse Case
TCV1"TCV1"Packed tritsFast, minimal overhead
TCV2"TCV2"+ RLERepetitive data
TCV3"TCV3"+ DictionaryCommon patterns
TCV4"TCV4"+ HuffmanFrequency-skewed data
TCV5"TCV5"+ ArithmeticMaximum compression
TCV6"TCV6"ShardedLarge corpus, parallel

APIโ€‹

Core Structuresโ€‹

pub const ShardInfo = struct {
id: u16,
start_idx: usize,
end_idx: usize,
entry_count: u16,
};

pub const ShardConfig = struct {
entries_per_shard: u16,
shard_count: u16,
total_entries: u32,
shards: [MAX_SHARDS]ShardInfo,

pub fn init(corpus_count: usize, entries_per_shard: u16) ShardConfig;
};

Core Functionsโ€‹

// Get shard configuration
pub fn getShardConfig(self: *TextCorpus, entries_per_shard: u16) ShardConfig

// Save with sharding (TCV6)
pub fn saveSharded(self: *TextCorpus, path: []const u8, entries_per_shard: u16) !void

// Load with sharding (TCV6)
pub fn loadSharded(path: []const u8) !TextCorpus

// Get shard count
pub fn getShardCount(self: *TextCorpus, entries_per_shard: u16) u16

// Search within shard range (parallel-ready)
pub fn searchShard(self: *TextCorpus, query: []const u8, start_idx: usize, end_idx: usize, results: []SearchResult) usize

VIBEE-Generated Functionsโ€‹

pub fn realSaveCorpusSharded(corpus: *vsa.TextCorpus, path: []const u8, entries_per_shard: u16) !void
pub fn realLoadCorpusSharded(path: []const u8) !vsa.TextCorpus
pub fn realGetShardCount(corpus: *vsa.TextCorpus, entries_per_shard: u16) u16

VIBEE Specificationโ€‹

Added to specs/tri/vsa_imported_system.vibee:

# CORPUS SHARDING (TCV6 format)
- name: realSaveCorpusSharded
given: Corpus and file path and shard size
when: Saving corpus with sharding
then: Call corpus.saveSharded(path, entries_per_shard)

- name: realLoadCorpusSharded
given: File path
when: Loading sharded corpus
then: Call TextCorpus.loadSharded(path)

- name: realGetShardCount
given: Corpus and shard size
when: Getting number of shards
then: Call corpus.getShardCount(entries_per_shard)

Critical Assessmentโ€‹

Strengthsโ€‹

  1. Parallel-ready - Offset table enables independent shard access
  2. Scalable - Split large corpus into manageable chunks
  3. Flexible - Configurable shard size
  4. Fast seeking - Direct access to any shard

Weaknessesโ€‹

  1. No actual parallelism - Zig threading to be added
  2. Fixed max shards - Limited to 16 shards
  3. Sequential save - Could parallelize writes
  4. No compression - Uses TCV1-style packed trits only

Tech Tree Options (Next Cycle)โ€‹

Option A: Parallel Loadingโ€‹

Add Zig threads for concurrent shard loading.

Option B: Streaming Compressionโ€‹

Add chunked read/write for arbitrarily large corpora.

Option C: Shard Compressionโ€‹

Combine sharding with TCV5 arithmetic coding per shard.


Files Modifiedโ€‹

FileChanges
src/vsa.zigAdded ShardConfig, ShardInfo, sharding functions
src/vibeec/codegen/emitter.zigAdded sharding generators
src/vibeec/codegen/tests_gen.zigAdded sharding test generators
specs/tri/vsa_imported_system.vibeeAdded 3 sharding behaviors
generated/vsa_imported_system.zigRegenerated with sharding

Conclusionโ€‹

VERDICT: IMMORTAL

Corpus sharding completes the TCV6 format with parallel-ready chunk processing. The storage stack now offers 6 formats (TCV1-TCV6) covering all use cases from minimal overhead to maximum compression to large-scale parallel processing.

ฯ†ยฒ + 1/ฯ†ยฒ = 3 = TRINITY | KOSCHEI IS IMMORTAL | GOLDEN CHAIN ENFORCED