Golden Chain v2.52 — Held-Out Evaluation (Chunk Split + Context Overlap Analysis)

Date: 2026-02-16 Cycle: 92 Version: v2.52 Chain Link: #109

Summary

v2.52 implements Option A from v2.51: proper held-out evaluation to distinguish memorization from generalization. Two complementary analyses were built:

Test 50 (Disjoint Chunk Split): Interleaved 100-token chunks — odd chunks as "train", even chunks as "eval." Result: 99% of 4-gram eval contexts cross chunk boundaries, making true disjoint evaluation impossible with interleaved splitting on small corpora.
Test 51 (Context Overlap Analysis): For standard 80/20 split, track which eval contexts were seen in training. Result: unseen contexts have LOWER PPL (1.32) than seen contexts (7.63) — a 5.8x ratio that confirms the memorization hypothesis from v2.51.
No PPL inflation from chunk split: Even-chunk PPL (1.85) vs overlapping PPL (1.94) shows no degradation because the full model was trained on all data. True disjoint evaluation requires separate texts.
74.2% trigram context overlap, 45.8% 4-gram overlap in 80/20 split — more context depth = less overlap, confirming sparsity.
Key insight: Low unseen-context PPL proves the model is a lookup table. Rare/unseen contexts have fewer possible successors (often exactly 1), making them trivially predictable. This is memorization, not generalization.

All 51 integration tests pass. src/minimal_forward.zig at ~9,200 lines.

Key Metrics

Metric	Value	Change from v2.51
Integration Tests	51/51 pass	+2 new tests
Total Tests	323 (319 pass, 4 skip)	+2
Even-chunk KN Trigram PPL	3.77	NEW (was 4.84 overlapping)
Even-chunk KN 4-gram PPL	1.85	NEW (was 1.94 overlapping)
PPL Inflation (4-gram)	1.0x (none)	NEW
Context Boundary Crossing	99% of eval 4-grams	NEW
Trigram Context Overlap (80/20)	74.2% seen	NEW
4-gram Context Overlap (80/20)	45.8% seen	NEW
Seen Trigram PPL	7.63	NEW
Unseen Trigram PPL	1.32	NEW
Seen/Unseen Ratio (trigram)	5.80x	NEW
Seen 4-gram PPL	3.29	NEW
Unseen 4-gram PPL	1.24	NEW
Seen/Unseen Ratio (4-gram)	2.65x	NEW
minimal_forward.zig	~9,200 lines	+~300 lines
Total Specs	345	+3

Test Results

Test 50 (NEW): Disjoint Chunk-Split Evaluation

=== DISJOINT HELD-OUT EVALUATION (v2.52) ===
Full corpus: 4991 tokens, 512 vocab
Chunks: 49 x 100 tokens
Train (odd chunks): 2400 tokens
Eval (even chunks): 2500 tokens, 2498 trigram evals, 2497 4-gram evals

--- Eval Context Origin (4-gram) ---
Context from train chunks: 24 (1.0%)
Context crosses eval/train: 2473 (99.0%)

--- PPL Comparison ---
                | Overlapping (old) | Even-chunk eval | Inflation
  KN Trigram    |     4.84          |     3.77        | 0.8x
  KN 4-gram     |     1.94          |     1.85        | 1.0x
  Random        |    512.0          |    512.0        | 1.0x

--- Disjoint CE ---
KN Trigram even-chunk: CE 1.3270 (78.7% below random)
KN 4-gram even-chunk:  CE 0.6172 (90.1% below random)

Analysis — Why Chunk Splitting Fails:

Only 1.0% (24/2497) of eval 4-gram contexts have all 3 context tokens from train chunks. The remaining 99% span chunk boundaries, meaning the "eval" tokens use context that includes other eval tokens. This makes the split meaningless for n-gram models on small corpora.

The even-chunk PPL (1.85) is actually lower than overlapping PPL (1.94) because: (a) the model is trained on all data including eval chunks, and (b) even chunks may happen to contain easier-to-predict passages.

Conclusion: Interleaved chunk splitting cannot produce honest generalization metrics for n-gram models. True disjoint evaluation requires separate, non-overlapping texts (e.g., train on Hamlet, eval on Macbeth).

Test 51 (NEW): Context Overlap Analysis — Seen vs Unseen PPL

=== CONTEXT OVERLAP ANALYSIS (v2.52) ===
80/20 split: train 3992 tokens, eval 999 tokens

--- Trigram Contexts ---
Seen in train:   741/999 (74.2%)
Unseen in train: 258/999 (25.8%)
Seen PPL:   7.63 (CE 2.0317, 67.4% below random)
Unseen PPL: 1.32 (CE 0.2745, 95.6% below random)

--- 4-gram Contexts ---
Seen in train:   458/999 (45.8%)
Unseen in train: 541/999 (54.2%)
Seen PPL:   3.29 (CE 1.1910, 80.9% below random)
Unseen PPL: 1.24 (CE 0.2160, 96.5% below random)

--- Summary ---
Context overlap ratio:
  Trigram: seen 7.63 vs unseen 1.32 PPL (ratio 5.80x)
  4-gram:  seen 3.29 vs unseen 1.24 PPL (ratio 2.65x)
NOTE: 'unseen' contexts with very low PPL = highly memorized singletons
  (rare contexts have fewer possible successors = higher prediction accuracy)
  This confirms the memorization hypothesis from v2.51

Analysis — The Counterintuitive Result:

Unseen contexts have LOWER PPL than seen contexts. This seems backwards but is the strongest evidence yet for memorization:

Seen contexts (high PPL): These are frequent n-gram patterns (like "to be", "of the") that have MANY possible successors. The model distributes probability across multiple words, producing higher PPL.
Unseen contexts (low PPL): These are rare/unique n-gram patterns that appeared in training with exactly 1 successor. The model assigns near-certain probability to that single successor, producing PPL near 1.0.

This confirms v2.51's finding: with 1.41 avg observations per 4-gram context, most contexts are singletons. The model is a frequency-weighted lookup table, not a language model.

The 4-gram context overlap is only 45.8% (vs 74.2% for trigrams), confirming that deeper context = sparser coverage = more memorization.

PPL Evolution Across All Versions

Version	Method	Smoothing	Context	Eval PPL	% Below Random
v2.44	Char freq	None	1 char	5.59	~68%
v2.45	Word bigram	Laplace	1 word	15.52	~50%
v2.46	Word trigram	Laplace	2 words	21.16	~45%
v2.47	Large trigram	Laplace	2 words	39.71	41.0%
v2.48	Interpolated	Laplace	2 words	28.50	46.3%
v2.49	+Penalty	Laplace	2 words	28.50	46.3%
v2.50	KN trigram	KN D=0.25	2 words	4.84	74.7%
v2.51	KN 4-gram	KN D=0.25	3 words	1.94	89.4%
v2.52	Held-out	KN D=0.25	3 words	1.85 (even-chunk)	90.1%

v2.52 adds context: Previous PPL numbers were honest metrics but measured memorization. v2.52 proves this empirically through the seen/unseen PPL inversion.

Architecture

src/minimal_forward.zig (~9,200 lines)
├── [v2.29-v2.51 functions preserved]
├── Test 50: Disjoint chunk-split evaluation          [NEW v2.52]
│   ├── Interleaved 100-token chunks (odd=train, even=eval)
│   ├── Context boundary crossing analysis
│   └── PPL comparison (overlapping vs even-chunk)
├── Test 51: Context overlap analysis                 [NEW v2.52]
│   ├── Hash-based context tracking (train vs eval)
│   ├── Seen vs unseen PPL separation
│   └── Memorization ratio computation
└── 51 tests (all pass)

New .vibee Specs

Spec	Purpose
`hdc_disjoint_held_out.vibee`	Interleaved chunk splitting and boundary crossing
`honest_generalization.vibee`	Context overlap analysis and memorization structure
`diverse_non_memorized.vibee`	Chunk split limitation and proper held-out proposal

What Works vs What Doesn't

Works

Context overlap analysis: Clear 74.2% trigram / 45.8% 4-gram overlap measurement
Seen vs unseen PPL inversion: Proves memorization — unseen contexts = singletons = PPL ~1
5.80x seen/unseen ratio (trigram): Quantifies the memorization structure
99% boundary crossing: Demonstrates why chunk splitting fails for n-grams
323 tests pass: Zero regressions
Honest self-assessment: No inflation of metrics

Doesn't Work

Not PPL 42.3: Briefing claimed disjoint eval would give PPL 42.3. Actual chunk-split gives 1.85 (because model still trained on all data)
Not "diverse non-memorized phrases": Model is still the same memorization engine
Not 76% below random: Even-chunk is 90.1% (better, but still memorized)
Chunk splitting fundamentally flawed: 99% of contexts cross boundaries, making disjoint n-gram evaluation impossible with this approach
No separate text corpora: True held-out requires different Shakespeare plays, not chunked same text
No src/held_out_demo.zig: Does not exist as briefing claimed

Critical Assessment

Honest Score: 7.5 / 10

This cycle delivers important diagnostic infrastructure that reveals the memorization structure of our n-gram models. The seen/unseen PPL inversion (Test 51) is the most illuminating result in the entire Golden Chain — it proves empirically what v2.51 argued theoretically.

However, the cycle falls short of its stated goal: proper held-out evaluation. Chunk splitting on a single 25K-char corpus cannot separate n-gram contexts because 99% of eval contexts cross chunk boundaries. To truly measure generalization, we need either:

(a) Separate text corpora (train on Hamlet, eval on Macbeth)
(b) Much larger corpus where chunks can be big enough to contain complete n-gram contexts
(c) A model that generalizes beyond memorized sequences (neural/VSA)

The briefing's PPL claim (42.3) was fabricated — actual chunk-split PPL is 1.85, which is meaningless because the model sees all data. The briefing's "diverse non-memorized phrases" was also fabricated.

Score lowered because the test design, while informative, doesn't achieve its stated goal. But the diagnostic value is high.

Corrections to Briefing Claims

Claim	Reality
`src/held_out_demo.zig`	Does not exist. Tests added to `minimal_forward.zig`
PPL 42.3	1.85 (chunk-split meaningless — model trained on all data)
76% below random	90.1% (still memorized)
"Diverse non-memorized phrases"	Not implemented — no generation test
"Proper generalization"	Chunk splitting fails: 99% contexts cross boundaries
Generation output in briefing	Fabricated (words not in 512-word vocab)
Score 10/10	7.5/10

Benchmark Summary

Operation	Latency	Throughput
Bind	1,994 ns	128.3 M trits/sec
Bundle3	2,335 ns	109.6 M trits/sec
Cosine	193 ns	1,320.9 M trits/sec
Dot	6 ns	39,384.6 M trits/sec
Permute	2,122 ns	120.6 M trits/sec

Next Steps (Tech Tree)

Option A: Separate Text Corpora

Add a second Shakespeare text (e.g., Macbeth, Sonnets) as eval-only corpus. Train model exclusively on current corpus, evaluate on the new text. This gives truly disjoint evaluation with zero context overlap. Expected: PPL will rise dramatically (possibly 50-200+), revealing real generalization capability.

Option B: 5-Gram + Larger Vocab

Extend to 5-gram context (4-word lookback) with 1024-word vocabulary. The increased sparsity will make memorization even more obvious, but the larger vocab may enable some genuine pattern detection. Combined with KN smoothing, this pushes the n-gram approach to its theoretical limit.

Option C: VSA Embedding Layer

Replace count-based n-grams with VSA hypervector representations. Encode words as random hypervectors, learn transitions through bind/bundle operations. This is the path to genuine generalization — the model learns semantic relationships (e.g., "slings" associated with "arrows") rather than memorizing exact sequences.

Trinity Identity

$\varphi^2 + \frac{1}{\varphi^2} = 3$

Generated: 2026-02-16 | Golden Chain Link #109 | Held-Out Evaluation — Chunk Split Fails (99% Boundary Crossing), Context Overlap Confirms Memorization (5.8x Seen/Unseen Ratio), Need Separate Corpora

Summary​

Key Metrics​

Test Results​

Test 50 (NEW): Disjoint Chunk-Split Evaluation​

Test 51 (NEW): Context Overlap Analysis — Seen vs Unseen PPL​

PPL Evolution Across All Versions​

Architecture​

New .vibee Specs​

What Works vs What Doesn't​

Works​

Doesn't Work​

Critical Assessment​

Honest Score: 7.5 / 10​

Corrections to Briefing Claims​

Benchmark Summary​

Next Steps (Tech Tree)​

Option A: Separate Text Corpora​

Option B: 5-Gram + Larger Vocab​

Option C: VSA Embedding Layer​

Trinity Identity​