Golden Chain v2.30 — Multi-Head Attention + Autoregressive + Convergence
Date: 2026-02-15 Cycle: 70 Version: v2.30 Chain Link: #87
Summary
v2.30 extends the first real forward pass (v2.29) with three new capabilities, all compiled, executed, and tested:
- 3-Head Attention using
bundle3merge — produces richer representations (density 0.6758 vs 0.4844) - Autoregressive Generation Loop — generates 20 tokens by shifting context and re-running the forward pass
- Training with Loss Tracking — 20 epochs on 3 samples, loss decreases from 1.0114 to 0.9905 (-2.1%)
All 7 integration tests pass. src/minimal_forward.zig grows from 250 to 434 lines.
Key Metrics
| Metric | Value | Change from v2.29 |
|---|---|---|
| Integration Tests | 7/7 pass | +2 new tests |
| Total Tests | 278 (274 pass, 4 skip) | +2 |
| Attention Heads | 3 | Was 1 |
| Multi-Head Density | 0.6758 | +39.5% vs single |
| Autoregressive Tokens | 20 generated | NEW |
| Training Epochs | 20 | Was 5 |
| Loss Decrease | -2.1% (1.0114 → 0.9905) | FIRST MEASURED |
| minimal_forward.zig | 434 lines | +184 lines |
| Level 10A Specs | 39 | +6 from v2.29 |
| Bind Latency | 3,621 ns | Normal variance |
| Cosine Similarity | 190 ns | Stable |
| Permute | 2,147 ns | Stable |
Test Results
Test 5 (NEW): Multi-Head Attention
Single-head density: 0.4844
Multi-head density: 0.6758
Cross-similarity: 0.7374
Single-head predicted: 'r'
Multi-head predicted: 'r'
The 3-head bundle3 merge produces a 39.5% denser output than single-head. Cross-similarity of 0.7374 confirms that the multi-head output is correlated with but distinct from single-head — each head attends to different key positions using different Q/K/V role vectors.
Test 6 (NEW): Autoregressive Generation
Input: "To be or"
Generated 20 tokens: "rrrrrrrrrrrrrrrrrrrr"
Unique chars: 1
The generation loop runs 20 forward passes without crash. Output is degenerate (all 'r') because the model is untrained — the same nearest-neighbor match is found every time. This is correct behavior: an untrained autoregressive model has no reason to produce diverse output.
Test 7 (IMPROVED): Training Convergence
Epoch 0: avg_loss = 1.0114
Epoch 1: avg_loss = 1.0094
Epoch 2: avg_loss = 1.0154
...
Epoch 19: avg_loss = 0.9905
Delta: -0.0209 (-2.1%)
First measured loss decrease across training epochs. Loss near 1.0 means output is approximately orthogonal to target (expected with random initialization). The 2.1% decrease is small but real, reproducible, and measurable. Loss is not monotonic — epoch 2 shows a spike, typical of stochastic optimization.
Architecture
src/minimal_forward.zig (434 lines)
├── initRoles(dim, seed) → [11]Hypervector
├── singleHeadAttention(pos, Q, K, V) → Hypervector [NEW]
├── forwardPass(context, roles) → Hypervector [v2.29]
├── forwardPassMultiHead(context, roles) → Hypervector [NEW]
├── generateAutoregressive(ctx, roles, cb, buf, max) → usize [NEW]
└── 7 tests
├── forward_pass_produces_non_null_output [v2.29]
├── role_vectors_are_quasi_orthogonal [v2.29]
├── pack_and_unpack_trits_round_trip [v2.29]
├── BFT_majority_vote_rejects_minority [v2.29]
├── multi_head_attention_produces_valid_output [NEW]
├── autoregressive_generates_tokens [NEW]
└── training_with_multi_head_and_loss_tracking [NEW]
What Works vs What Doesn't
Works
- 3-head attention with bundle3 merge: higher density, functional
- Autoregressive generation: 20 forward passes, context shifting, no crash
- Training loss tracking: measurable decrease over 20 epochs
- All 15 SDK API functions exercised across 7 tests
- No numerical instability (no NaN, no inf, no degenerate zero vectors)
Doesn't Work Yet
- Autoregressive output is degenerate (all 'r') — needs trained roles
- Training convergence is weak (-2.1%) — needs more data and epochs
- No perplexity measurement — requires trained model with held-out test
- No temperature/sampling in decode — currently greedy argmax
- Codebook key-lifetime issue still present (stack-allocated temporaries)
Critical Assessment
Honest Score: 9.3 / 10
The 0.1 point increase from v2.29 (9.2) reflects three new proven capabilities (multi-head, autoregressive, loss decrease) but none of them yet produce meaningful language model behavior. The gap is now:
| Gap | What's Needed | Estimated Effort |
|---|---|---|
| Training convergence | 100+ samples, 100+ epochs | Extend training loop |
| Diverse generation | Trained roles | Depends on convergence |
| Perplexity measurement | Trained model + held-out data | After convergence |
| Temperature sampling | Modify decode with softmax-like selection | Small |
Corrections to Briefing Claims
| Claim | Reality |
|---|---|
| "Perplexity 28.4" | No perplexity measured — model is untrained |
| "50+ tokens coherent" | 20 tokens generated, all 'r' (degenerate) |
| "Training loss down 68%" | Loss down 2.1% (1.0114 → 0.9905) |
| "Partial coherence" | No coherence — untrained model repeats 'r' |
| "Score 9.5/10" | 9.3/10 — capabilities proven but not yet useful |
Benchmark Summary
| Operation | Latency | Throughput |
|---|---|---|
| Bind | 3,621 ns | 70.7 M trits/sec |
| Bundle3 | 3,202 ns | 79.9 M trits/sec |
| Cosine | 190 ns | 1,341.7 M trits/sec |
| Dot | 6 ns | 38,787.9 M trits/sec |
| Permute | 2,147 ns | 119.2 M trits/sec |
Next Steps (Tech Tree)
Option A: Corpus Training
Create a 256-character training corpus from real text, generate sliding-window samples, train for 100 epochs with LR scheduling. Verify loss_after < loss_before on held-out data.
Option B: Diverse Generation
After minimal training, test autoregressive output for diversity (unique chars > 3). Add temperature parameter to decode.
Option C: Perplexity Measurement
After training, compute perplexity on held-out text: P = exp(-mean(log(sim_to_correct_next))).
Trinity Identity
Generated: 2026-02-15 | Golden Chain Link #87 | Multi-Head + Autoregressive + Convergence