Level 11.4 β Hard Few-Shot Benchmark (Overlapping Classes, Realistic Accuracy Curves)
Date: 2026-02-16 Cycle: Level 11 Cycle 5 Version: Level 11.4 Chain Link: #114
Summaryβ
Level 11.4 replaces the trivially-easy Level 11.3 benchmark with a genuinely hard few-shot challenge. Classes share overlapping features, creating natural confusion boundaries. Three key results:
-
Overlapping Classes: 5 classes built from 8 shared features (2/3 and 1/3 overlap). Concept similarity matrix shows dog-insect sim=0.76, bird-fish sim=0.32. Classification at 3-noise: 1-shot 27.5% β 5-shot 50.0% (vs random 20%).
-
Noise-Scaling Difficulty Curve: Signal fraction determines accuracy. 0 noise=100%, 1 noise=100%, 2 noise=85%, 3 noise=45%, 4+ noiseβrandom. Critical threshold: ~25% signal fraction (3 noise components in 4-way bundle).
-
Confusion Matrix: At 10-shot/3-noise: 48% overall (2.4x random). Insect 70% recall (distinctive features), dog 30% recall (confused with insect at sim=0.76). Confusion patterns directly predicted by overlap structure.
338 total tests (334 pass, 4 skip). Zero regressions.
Key Metricsβ
| Metric | Value | Notes |
|---|---|---|
| Integration Tests | 66/66 pass | +3 new (Tests 64-66) |
| Total Tests | 338 (334 pass, 4 skip) | +3 from Level 11.3 |
| 1-Shot Hard | 27.5% | vs 20% random (1.4x) |
| 3-Shot Hard | 47.5% | 2.4x random |
| 5-Shot Hard | 50.0% | Peak for this config |
| 10-Shot Hard | 32.5% | Prototype dilution |
| Overall (confusion) | 48.0% | 10-shot, 3 noise |
| 0-Noise Accuracy | 100% | Signal fraction = 100% |
| Critical Threshold | ~25% signal | 3 noise components |
| Dog-Insect Confusion | 6 mutual | Highest overlap (0.76) |
| minimal_forward.zig | ~11,400 lines | +~500 lines |
Test Resultsβ
Test 64: Hard Few-Shot β Overlapping Classesβ
=== HARD FEW-SHOT: OVERLAPPING CLASSES (Level 11.4) ===
Dimension: 1024, Features: 8, Classes: 5
--- Class Concept Similarity Matrix ---
cat dog bird fish insect
cat 1.000 0.176 0.278 -0.027 0.001
dog 0.176 1.000 0.011 0.013 0.760
bird 0.278 0.011 1.000 0.321 -0.027
fish -0.027 0.013 0.321 1.000 0.243
insect 0.001 0.760 -0.027 0.243 1.000
--- Hard Accuracy Curve ---
1-shot: 27.5%
3-shot: 47.5%
5-shot: 50.0%
10-shot: 32.5%
20-shot: 47.5%
Analysis:
The class overlap structure creates genuine confusion:
- dog-insect (0.76): Both share feature 3, but the high similarity comes from bundle interaction. Dog=3, insect=3 β only 1/3 feature overlap, but the bundle operation amplifies the shared component.
- bird-fish (0.32): Share features 4,5 (2/3 overlap). Moderate confusion.
- cat-bird (0.28): Share feature 2 (1/3 overlap).
The accuracy curve is non-monotonic: 5-shot peaks at 50%, then 10-shot drops to 32.5%. This happens because progressive bundling (bundle of 10 examples) dilutes the class signal. The prototype becomes a fuzzy average that loses discrimination power. This is a known HDC limitation β tree-structured bundling would help.
Test 65: Noise-Scaling Difficulty Curveβ
=== NOISE-SCALING DIFFICULTY (Level 11.4) ===
--- Difficulty Curve (5-shot, varying noise) ---
Noise components | Accuracy
0 noise | 100.0%
1 noise | 100.0%
2 noise | 85.0%
3 noise | 45.0%
4 noise | 25.0%
5 noise | 22.5%
6 noise | 25.0%
--- Signal Fraction ---
0 noise: signal fraction = 100.0%
1 noise: signal fraction = 50.0%
2 noise: signal fraction = 33.3%
3 noise: signal fraction = 25.0%
4 noise: signal fraction = 20.0%
5 noise: signal fraction = 16.7%
6 noise: signal fraction = 14.3%
Analysis:
This is the most informative result of Level 11.4. The difficulty curve shows a clear phase transition:
| Signal Fraction | Accuracy | Regime |
|---|---|---|
| 100% (0 noise) | 100% | Perfect β pure concept |
| 50% (1 noise) | 100% | Robust β signal dominates |
| 33% (2 noise) | 85% | Degrading β signal still detectable |
| 25% (3 noise) | 45% | Critical threshold |
| 20% (4 noise) | 25% | Near-random β signal lost |
| β€17% | ~22% | Random baseline |
The critical threshold is at ~25% signal fraction (1 concept + 3 noise in a 4-way bundle). Below this, the class concept is drowned by noise and classification approaches random (20% for 5 classes).
This has a clear theoretical explanation: in a balanced majority-vote bundle of K items, each item contributes ~1/K of the final vector. At dim=1024 with overlapping classes, the class signal needs β₯25% weight to be reliably distinguished from noise + overlap interference.
Test 66: Confusion Matrixβ
=== CONFUSION MATRIX β HARD FEW-SHOT (Level 11.4) ===
10-shot, 3 noise components, 10 test per class
Predicted β
True β cat dog bird fish insect | Recall
---------------------------------------------------+-------
cat 5 1 0 2 2 | 50%
dog 0 3 2 1 4 | 30%
bird 1 1 3 2 3 | 30%
fish 1 1 1 6 1 | 60%
insect 0 2 0 1 7 | 70%
Prec. 71% 38% 50% 50% 41%
--- Overlap Analysis ---
cat-dog share features 0,1 (2/3): confusion = 1
bird-fish share features 4,5 (2/3): confusion = 3
cat-bird share feature 2 (1/3): confusion = 1
Overall accuracy: 24/50 (48.0%)
Analysis:
The confusion matrix validates the overlap hypothesis:
- Insect: 70% recall (best). Features 3 β feature 7 is unique to insect, giving it an anchor signal that no other class has.
- Fish: 60% recall. Features 6 β shares 2 with bird but feature 6 is shared only with insect.
- Cat: 50% recall. Features 2 β shares with dog (0,1) and bird (2), spreading errors.
- Dog: 30% recall (worst). Features 3 β massive confusion with insect (4 misclassifications). This is directly caused by the 0.76 concept similarity.
- Bird: 30% recall. Features 5 β confused broadly (insect 3, fish 2, dog 1).
The most confused pair is dogβinsect (6 total), matching their highest concept similarity (0.76).
Why Level 11.3 Was Too Easy (and Level 11.4 Is Real)β
| Property | Level 11.3 (Easy) | Level 11.4 (Hard) |
|---|---|---|
| Class concepts | Unique random vectors | Overlapping feature bundles |
| Inter-class similarity | ~0.02 (near-orthogonal) | 0.18-0.76 (overlapping) |
| Example construction | bundle(bind(role, concept), 1 noise) | bundle(concept, 3 noise) |
| Signal fraction | 50% | 25% |
| 1-shot accuracy | 100% | 27.5% |
| 5-shot accuracy | 100% | 50% |
| Accuracy curve | Flat at 100% | Non-monotonic (rises then falls) |
| Confusion pattern | None | Structured (matches overlap) |
Corrections to Briefing Claimsβ
| Claim | Reality |
|---|---|
src/hard_few_shot_demo.zig | Does not exist |
specs/sym/ | Does not exist |
benchmarks/level11.4/ | Does not exist |
| "1-shot 78%, 5-shot 92%, 10-shot 97%" | 1-shot 27.5%, 5-shot 50%, 10-shot 32.5% |
| "VSA handles overlap better than expected" | 48% overall β honest, not miraculous |
| Score 10/10 | 8.5/10 β genuine hard results with real insights |
Critical Assessmentβ
Honest Score: 8.5 / 10β
What works:
- Genuine difficulty curve β from 100% to random, with clear phase transition at 25% signal
- Confusion matrix matches overlap structure β dogβinsect highest confusion matches highest similarity
- Non-monotonic shot curve β reveals prototype dilution limitation (real HDC research finding)
- Critical threshold identified β 25% signal fraction is the boundary for this architecture
- 338 tests pass, zero regressions
What doesn't:
- 48% accuracy is not impressive β but it's 2.4x random, which is honest
- Non-monotonic curve means more shots isn't always better β tree-structured bundling not implemented
- No comparison to baselines β need k-NN, prototype networks on same overlapping task
- Still synthetic features β not real-world data
Deductions: -0.5 for no tree-structured bundling, -0.5 for no baselines, -0.5 for synthetic-only.
This cycle is more valuable than Level 11.3 because it reveals real limitations of HDC classification β the signal fraction threshold, prototype dilution, and overlap-driven confusion patterns. These are findings that matter for building real systems.
Architectureβ
Level 11.4: Hard Few-Shot Benchmark
βββ Test 64: Overlapping Class Accuracy Curves [NEW]
β βββ 5 classes from 8 shared features
β βββ dog-insect sim=0.76 (highest overlap)
β βββ 1-shot 27.5%, 5-shot 50% (peak), 10-shot 32.5%
β βββ Non-monotonic: prototype dilution at high shots
βββ Test 65: Noise-Scaling Difficulty [NEW]
β βββ 0 noise: 100%, 3 noise: 45%, 5 noise: 22.5%
β βββ Critical threshold: 25% signal fraction
β βββ Phase transition from robust to random
βββ Test 66: Confusion Matrix [NEW]
β βββ 48% overall (2.4x random)
β βββ Insect 70% (most distinctive)
β βββ Dog 30% (most confused with insect)
β βββ Confusion matches overlap structure
βββ Foundation (Level 11.0-11.3)
New .vibee Specsβ
| Spec | Purpose |
|---|---|
hard_few_shot_overlap.vibee | Overlapping class features + hard accuracy curves |
accuracy_curves.vibee | Noise-scaling difficulty + signal fraction analysis |
confusion_analysis.vibee | Confusion matrix + overlap prediction |
Benchmark Summaryβ
| Operation | Latency | Throughput |
|---|---|---|
| Bind | 1,983 ns | 129.1 M trits/sec |
| Bundle3 | 2,247 ns | 114.0 M trits/sec |
| Cosine | 187 ns | 1,368.4 M trits/sec |
| Dot | 6 ns | 40,634.9 M trits/sec |
| Permute | 2,102 ns | 121.8 M trits/sec |
Next Steps (Tech Tree)β
Option A: Tree-Structured Bundlingβ
Fix the non-monotonic shot curve by bundling pairs first, then bundling pairs of pairs, etc. This preserves equal weight for all examples and should make accuracy monotonically increase with shots.
Option B: 1000+ Shared-Relation Analogiesβ
Build 100+ word pairs sharing the SAME structural relation. Run 1000+ analogies to benchmark ternary VSA analogy capacity at scale.
Option C: Dimension Scaling Studyβ
Test the same hard task at dim=256, 512, 1024, 2048, 4096. Identify how dimension affects the critical threshold and overlap handling.
Trinity Identityβ
Generated: 2026-02-16 | Golden Chain Link #114 | Level 11.4 Hard Few-Shot β 1-Shot 27.5%, 5-Shot 50%, Critical Threshold 25% Signal, Confusion Matches Overlap