Cycle 37: Corpus Sharding
Status: IMMORTAL Date: 2026-02-07 Improvement Rate: 1.04 > ฯโปยน (0.618) Tests: 92/92 PASS
Overviewโ
Cycle 37 implements corpus sharding for parallel chunk processing, creating the TCV6 format that splits large corpora into manageable shards with index-based access for parallel loading and searching.
Key Metricsโ
| Metric | Value | Status |
|---|---|---|
| Tests | 92/92 | PASS |
| VSA Tests | 56/56 | PASS |
| New Structures | 2 | ShardConfig, ShardInfo |
| New Functions | 7 | saveSharded, loadSharded, getShardConfig, getShardCount, searchShard, estimateShardedSize |
| Default Shard Size | 25 entries | Configurable |
| Max Shards | 16 | Scalable |
| File Format | TCV6 | Binary with shard index |
Sharding Algorithmโ
Shard Configurationโ
- Calculate shard count:
(corpus_count + entries_per_shard - 1) / entries_per_shard - Create shard boundaries with start/end indices
- Store offset table for random access to any shard
TCV6 Format Structureโ
Magic: "TCV6" # 4 bytes
Shard_count: u16 # 2 bytes
Entries_per_shard: u16 # 2 bytes
Total_entries: u32 # 4 bytes
Shard_offsets: u32[shard_count] # File offset table
For each shard:
Shard_id: u16 # 2 bytes
Entry_count: u16 # 2 bytes
For each entry:
trit_len: u32 # 4 bytes
packed_len: u16 # 2 bytes
packed_data: u8[packed_len] # Packed trits
label_len: u8 # 1 byte
label: u8[label_len] # Label string
Parallel-Ready Designโ
- Offset table allows seeking to any shard independently
- Each shard can be loaded in a separate thread
searchShard()operates on specific index ranges
Compression Stack Completeโ
| Format | Magic | Method | Use Case |
|---|---|---|---|
| TCV1 | "TCV1" | Packed trits | Fast, minimal overhead |
| TCV2 | "TCV2" | + RLE | Repetitive data |
| TCV3 | "TCV3" | + Dictionary | Common patterns |
| TCV4 | "TCV4" | + Huffman | Frequency-skewed data |
| TCV5 | "TCV5" | + Arithmetic | Maximum compression |
| TCV6 | "TCV6" | Sharded | Large corpus, parallel |
APIโ
Core Structuresโ
pub const ShardInfo = struct {
id: u16,
start_idx: usize,
end_idx: usize,
entry_count: u16,
};
pub const ShardConfig = struct {
entries_per_shard: u16,
shard_count: u16,
total_entries: u32,
shards: [MAX_SHARDS]ShardInfo,
pub fn init(corpus_count: usize, entries_per_shard: u16) ShardConfig;
};
Core Functionsโ
// Get shard configuration
pub fn getShardConfig(self: *TextCorpus, entries_per_shard: u16) ShardConfig
// Save with sharding (TCV6)
pub fn saveSharded(self: *TextCorpus, path: []const u8, entries_per_shard: u16) !void
// Load with sharding (TCV6)
pub fn loadSharded(path: []const u8) !TextCorpus
// Get shard count
pub fn getShardCount(self: *TextCorpus, entries_per_shard: u16) u16
// Search within shard range (parallel-ready)
pub fn searchShard(self: *TextCorpus, query: []const u8, start_idx: usize, end_idx: usize, results: []SearchResult) usize
VIBEE-Generated Functionsโ
pub fn realSaveCorpusSharded(corpus: *vsa.TextCorpus, path: []const u8, entries_per_shard: u16) !void
pub fn realLoadCorpusSharded(path: []const u8) !vsa.TextCorpus
pub fn realGetShardCount(corpus: *vsa.TextCorpus, entries_per_shard: u16) u16
VIBEE Specificationโ
Added to specs/tri/vsa_imported_system.vibee:
# CORPUS SHARDING (TCV6 format)
- name: realSaveCorpusSharded
given: Corpus and file path and shard size
when: Saving corpus with sharding
then: Call corpus.saveSharded(path, entries_per_shard)
- name: realLoadCorpusSharded
given: File path
when: Loading sharded corpus
then: Call TextCorpus.loadSharded(path)
- name: realGetShardCount
given: Corpus and shard size
when: Getting number of shards
then: Call corpus.getShardCount(entries_per_shard)
Critical Assessmentโ
Strengthsโ
- Parallel-ready - Offset table enables independent shard access
- Scalable - Split large corpus into manageable chunks
- Flexible - Configurable shard size
- Fast seeking - Direct access to any shard
Weaknessesโ
- No actual parallelism - Zig threading to be added
- Fixed max shards - Limited to 16 shards
- Sequential save - Could parallelize writes
- No compression - Uses TCV1-style packed trits only
Tech Tree Options (Next Cycle)โ
Option A: Parallel Loadingโ
Add Zig threads for concurrent shard loading.
Option B: Streaming Compressionโ
Add chunked read/write for arbitrarily large corpora.
Option C: Shard Compressionโ
Combine sharding with TCV5 arithmetic coding per shard.
Files Modifiedโ
| File | Changes |
|---|---|
src/vsa.zig | Added ShardConfig, ShardInfo, sharding functions |
src/vibeec/codegen/emitter.zig | Added sharding generators |
src/vibeec/codegen/tests_gen.zig | Added sharding test generators |
specs/tri/vsa_imported_system.vibee | Added 3 sharding behaviors |
generated/vsa_imported_system.zig | Regenerated with sharding |
Conclusionโ
VERDICT: IMMORTAL
Corpus sharding completes the TCV6 format with parallel-ready chunk processing. The storage stack now offers 6 formats (TCV1-TCV6) covering all use cases from minimal overhead to maximum compression to large-scale parallel processing.
ฯยฒ + 1/ฯยฒ = 3 = TRINITY | KOSCHEI IS IMMORTAL | GOLDEN CHAIN ENFORCED