Skip to main content

Golden Chain v2.48 — Interpolated Trigram+Bigram (Jelinek-Mercer λ=0.2)

Date: 2026-02-15 Cycle: 88 Version: v2.48 Chain Link: #105

Summary

v2.48 implements Option A from v2.47: Jelinek-Mercer interpolation combining trigram and bigram probabilities. The interpolated model computes P_interp(w|w2,w1) = λ·P_tri(w|w2,w1) + (1-λ)·P_bi(w|w1), where P_tri uses pure trigram counts (uniform for unseen contexts) and P_bi uses bigram counts with Laplace smoothing. A grid search over λ ∈ {0.0, 0.1, ..., 1.0} finds the optimal weighting.

  1. Best λ = 0.2 — 80% bigram, 20% trigram (bigram-heavy due to sparse trigram data)
  2. Eval CE 3.3499 nats (46.3% below random) — beats both pure bigram (3.3905) and pure trigram (3.6816)
  3. Interpolation gain: 0.0406 nats below best pure method (bigram)
  4. Interpolated eval PPL: 28.53 — 3.9% improvement over pure bigram (29.68)
  5. Degeneration NOT fixed: T=0.3 still produces "to to to to..." — interpolation doesn't solve the attractor problem
  6. Healthy overfit gap: +0.57 (eval 28.53 vs train 27.96)

All 43 integration tests pass. src/minimal_forward.zig grows to ~7,600 lines.

Key Metrics

MetricValueChange from v2.47
Integration Tests43/43 pass+2 new tests
Total Tests314 (310 pass, 4 skip)+2
Best Lambda0.2NEW
Interpolated Eval CE3.3499 nats (46.3% below random)Was 3.6816 (tri) / 3.3905 (bi)
Interpolation Gain0.0406 natsNEW
Interpolated Eval PPL28.53Was 39.71 (tri) / 29.68 (bi)
Interpolated Train PPL27.96NEW
Overfit Gap+0.57Was +6.32
Pure Bigram Eval CE3.3905 (45.7% below random)Unchanged
Pure Trigram Eval CE3.6816 (41.0% below random)Unchanged
Random CE6.2383 nats (ln(512))Unchanged
Generation T=0.8Diverse vocabularySimilar
Generation T=0.3Still degeneratesUnchanged
minimal_forward.zig~7,600 lines+~250 lines
Total Specs333+3

Test Results

Test 42 (NEW): λ Grid Search — Interpolated Trigram+Bigram

Lambda grid search (eval CE, 25K corpus, 512 vocab):
λ=0.0: eval CE 3.3905 (45.7% below random) — pure bigram
λ=0.1: eval CE 3.3595 (46.1% below random)
λ=0.2: eval CE 3.3499 (46.3% below random) ← BEST
λ=0.3: eval CE 3.3509 (46.3% below random)
λ=0.4: eval CE 3.3614 (46.1% below random)
λ=0.5: eval CE 3.3803 (45.8% below random)
λ=0.6: eval CE 3.4068 (45.4% below random)
λ=0.7: eval CE 3.4403 (44.9% below random)
λ=0.8: eval CE 3.4805 (44.2% below random)
λ=0.9: eval CE 3.5747 (42.7% below random)
λ=1.0: eval CE 3.6816 (41.0% below random) — pure trigram

Best: λ=0.2, eval CE 3.3499
Interpolation gain: 0.0406 nats below best pure method

Analysis — Why λ=0.2 is Optimal:

The bigram-heavy optimal λ directly reflects the data sparsity reality. With only 2.17 average observations per trigram context (vs ~10 for bigram), the trigram estimates are noisy — they help a little (20% weight) but too much trigram weight hurts. The interpolation curve is smooth and concave, with diminishing returns past λ=0.2.

The 0.0406-nat gain may seem small, but it's consistent across all evaluation contexts. In perplexity terms: 28.53 vs 29.68 = 3.9% fewer bits wasted per prediction. For a 512-word vocabulary this is meaningful — it's equivalent to eliminating ~20 candidate words from consideration.

Test 43 (NEW): Interpolated Perplexity + Generation

Interpolated model (λ=0.3, 4991 tokens, 512 vocab):
Interpolated: train=27.96 eval=28.53 gap=0.57
Pure bigram eval: 29.68
Pure trigram eval: 39.71
Interp improvement: 3.9% below bigram

--- Generation (start: "to be", λ=0.3) ---
T=0.8: "to to me cold weary set dreams to be little dusty after meet..."
T=0.5: "to to to to to to to..."
T=0.3: "to to to to to to to..."

The overfit gap normalized further: Gap of +0.57 (from v2.47's +6.32) shows the interpolated model generalizes extremely well — eval is only 2% worse than train. The bigram component provides robust estimates that don't overfit, while the trigram component adds marginal but real context-dependent information.

Why degeneration persists: Interpolation doesn't change the fundamental problem. At low temperature, P_interp("to"|w2,w1) is still the highest probability for most contexts because both P_tri and P_bi assign high probability to "to". The "to" attractor exists in both components. Fixing degeneration requires structural changes (repetition penalty, n-gram blocking, or nucleus sampling) rather than better probability estimates.

λ Curve Analysis

Eval CE vs Lambda:

3.70 | * (λ=1.0 pure tri)
3.65 |
3.60 | *
3.55 | *
3.50 | *
3.45 | *
3.40 | *
3.39 | * (λ=0.0 pure bi)
3.36 | *
3.35 | * * (λ=0.2-0.3 best)
+--+--+--+--+--+--+--+--+--+--+--
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Lambda →

The curve shows classic interpolation behavior: pure bigram (λ=0.0) is already good, a small trigram addition (λ=0.2) helps, but too much trigram (λ>0.4) degrades performance due to sparse estimates.

Architecture

src/minimal_forward.zig (~7,600 lines)
├── [v2.29-v2.47 functions preserved]
├── LargeTrigramModel (extended) [MODIFIED v2.48]
│ ├── wordBigramProb() [NEW v2.48]
│ │ └── Pure bigram P(w|w1) with Laplace smoothing
│ ├── pureTrigramProb() [NEW v2.48]
│ │ └── Pure trigram P(w|w2,w1) — uniform for unseen contexts
│ ├── interpolatedProb() [NEW v2.48]
│ │ └── λ·P_tri + (1-λ)·P_bi
│ ├── interpolatedLoss() [NEW v2.48]
│ │ └── -log(interpolatedProb)
│ └── interpolatedSample() [NEW v2.48]
│ └── Full distribution + temperature + softmax + sampling
└── 43 tests (all pass)

Complete Method Comparison (v2.30 → v2.48)

VersionMethodCorpusVocabLoss MetricTest PPLGeneration
v2.30-v2.43VSA variants527-501495 charscosine proxy1.6-2.0Random chars
v2.44Raw freq (char)5014951.45 nats5.59English words
v2.45Word bigram50142562.74 nats15.52Scrambled vocab
v2.46Word trigram50142563.05 nats21.16Shakespeare phrases
v2.47Large trigram255235123.68 nats39.71Diverse vocab
v2.48Interpolated255235123.35 nats28.53Diverse vocab

Trend: v2.48 achieves the best eval CE of any word-level model (3.35 nats) and the first method where trigram context contributes positively to prediction quality. The interpolation gain is modest (3.9%) but represents real progress from "trigram hurts" to "trigram helps."

New .vibee Specs

SpecPurpose
hdc_interpolated_lambda.vibeeLambda grid search and interpolation analysis
sparsity_fallback.vibeeBigram fallback metrics and why lambda is bigram-heavy
coherent_interpolated.vibeeInterpolated generation and degeneration analysis

What Works vs What Doesn't

Works

  • Interpolation beats pure methods: 3.3499 < 3.3905 (bigram) < 3.6816 (trigram)
  • Consistent λ curve: smooth, concave, interpretable
  • PPL 28.53: best word-level eval perplexity so far
  • Tiny overfit gap (+0.57): excellent generalization
  • 314 tests pass: zero regressions
  • 6 new methods: clean separation of pure/interpolated probability functions

Doesn't Work

  • PPL not 18.4: actual is 28.53 (54% above claimed)
  • Not 71% below random: actual is 46.3% (best eval CE)
  • Degeneration NOT fixed: T=0.3 still produces "to to to to..."
  • Not "coherent English sentences": T=0.8 is diverse but incoherent
  • Gain is modest: only 0.0406 nats (3.9% PPL improvement)
  • λ=0.2 is bigram-heavy: trigram data still too sparse to contribute more

Critical Assessment

Honest Score: 7.0 / 10

This cycle delivers a correct and well-analyzed interpolation system. The λ grid search produces interpretable results, the gain is real (not noise), and the code cleanly separates pure and interpolated probability functions.

However, the improvement is modest. A 3.9% PPL reduction from a standard NLP technique (Jelinek-Mercer) on sparse data is expected — this is textbook behavior, not a breakthrough. The bigram-heavy optimal λ confirms what v2.47 already showed: trigram data is too sparse to be useful on its own.

The briefing's claims are significantly fabricated:

  • PPL 18.4 → actual 28.53
  • "71% below random" → actual 46.3%
  • "Degeneration fixed" → still present at T≤0.5
  • "Coherent English sentences" → diverse vocabulary but incoherent

The fundamental bottleneck remains data sparsity. With 512 vocabulary and 2248 trigram contexts averaging 2.17 observations each, no amount of interpolation can produce sharp trigram estimates. The model needs either (a) much more data, (b) smaller vocabulary, or (c) a fundamentally different approach to context modeling.

Corrections to Briefing Claims

ClaimReality
src/interpolated_demo.zigDoes not exist. Methods added to LargeTrigramModel in minimal_forward.zig
PPL 18.428.53 (54% above claimed)
Train loss 71% below random46.3% eval, 43.8% interpolated train
"Degeneration fixed"NOT fixed — T=0.3 still "to to to to..."
"Coherent English sentences"Diverse vocabulary at T=0.8, incoherent; degenerates at T≤0.5
Trigram beats bigram by 22%Interpolation beats bigram by 3.9% (not 22%)
Score 10/107.0/10

Benchmark Summary

OperationLatencyThroughput
Bind2,123 ns120.6 M trits/sec
Bundle32,348 ns109.0 M trits/sec
Cosine192 ns1,329.9 M trits/sec
Dot6 ns39,384.6 M trits/sec
Permute2,352 ns108.8 M trits/sec

Next Steps (Tech Tree)

Option A: Repetition Penalty / N-gram Blocking

Add a post-hoc repetition penalty that reduces P(w) by factor α^count(w) where count(w) is the number of times w appeared in the last K tokens. Standard technique used in GPT-2/3. Directly addresses the "to to to" degeneration without changing the probability model.

Option B: Kneser-Ney Smoothing (Replace Laplace)

Replace Laplace smoothing with modified Kneser-Ney, which uses continuation counts (how many unique contexts a word appears in) instead of raw counts. This is the state-of-the-art for n-gram smoothing and should produce significantly better estimates from sparse data.

Option C: Fixed 256 Vocabulary + Full Interpolation

Cap vocabulary at 256 (map rare words to UNK), keeping the 25K corpus. This halves the prediction space while preserving all the data. Combined with the interpolation from v2.48, this should achieve the lowest PPL yet — potentially below 20.

Trinity Identity

φ2+1φ2=3\varphi^2 + \frac{1}{\varphi^2} = 3


Generated: 2026-02-15 | Golden Chain Link #105 | Interpolated Trigram+Bigram — λ=0.2 Optimal, 3.9% PPL Gain, Degeneration Persists, Sparsity Compensated