Skip to main content

Level 11.4 β€” Hard Few-Shot Benchmark (Overlapping Classes, Realistic Accuracy Curves)

Date: 2026-02-16 Cycle: Level 11 Cycle 5 Version: Level 11.4 Chain Link: #114

Summary​

Level 11.4 replaces the trivially-easy Level 11.3 benchmark with a genuinely hard few-shot challenge. Classes share overlapping features, creating natural confusion boundaries. Three key results:

  1. Overlapping Classes: 5 classes built from 8 shared features (2/3 and 1/3 overlap). Concept similarity matrix shows dog-insect sim=0.76, bird-fish sim=0.32. Classification at 3-noise: 1-shot 27.5% β†’ 5-shot 50.0% (vs random 20%).

  2. Noise-Scaling Difficulty Curve: Signal fraction determines accuracy. 0 noise=100%, 1 noise=100%, 2 noise=85%, 3 noise=45%, 4+ noiseβ‰ˆrandom. Critical threshold: ~25% signal fraction (3 noise components in 4-way bundle).

  3. Confusion Matrix: At 10-shot/3-noise: 48% overall (2.4x random). Insect 70% recall (distinctive features), dog 30% recall (confused with insect at sim=0.76). Confusion patterns directly predicted by overlap structure.

338 total tests (334 pass, 4 skip). Zero regressions.

Key Metrics​

MetricValueNotes
Integration Tests66/66 pass+3 new (Tests 64-66)
Total Tests338 (334 pass, 4 skip)+3 from Level 11.3
1-Shot Hard27.5%vs 20% random (1.4x)
3-Shot Hard47.5%2.4x random
5-Shot Hard50.0%Peak for this config
10-Shot Hard32.5%Prototype dilution
Overall (confusion)48.0%10-shot, 3 noise
0-Noise Accuracy100%Signal fraction = 100%
Critical Threshold~25% signal3 noise components
Dog-Insect Confusion6 mutualHighest overlap (0.76)
minimal_forward.zig~11,400 lines+~500 lines

Test Results​

Test 64: Hard Few-Shot β€” Overlapping Classes​

=== HARD FEW-SHOT: OVERLAPPING CLASSES (Level 11.4) ===
Dimension: 1024, Features: 8, Classes: 5

--- Class Concept Similarity Matrix ---
cat dog bird fish insect
cat 1.000 0.176 0.278 -0.027 0.001
dog 0.176 1.000 0.011 0.013 0.760
bird 0.278 0.011 1.000 0.321 -0.027
fish -0.027 0.013 0.321 1.000 0.243
insect 0.001 0.760 -0.027 0.243 1.000

--- Hard Accuracy Curve ---
1-shot: 27.5%
3-shot: 47.5%
5-shot: 50.0%
10-shot: 32.5%
20-shot: 47.5%

Analysis:

The class overlap structure creates genuine confusion:

  • dog-insect (0.76): Both share feature 3, but the high similarity comes from bundle interaction. Dog=3, insect=3 β€” only 1/3 feature overlap, but the bundle operation amplifies the shared component.
  • bird-fish (0.32): Share features 4,5 (2/3 overlap). Moderate confusion.
  • cat-bird (0.28): Share feature 2 (1/3 overlap).

The accuracy curve is non-monotonic: 5-shot peaks at 50%, then 10-shot drops to 32.5%. This happens because progressive bundling (bundle of 10 examples) dilutes the class signal. The prototype becomes a fuzzy average that loses discrimination power. This is a known HDC limitation β€” tree-structured bundling would help.

Test 65: Noise-Scaling Difficulty Curve​

=== NOISE-SCALING DIFFICULTY (Level 11.4) ===

--- Difficulty Curve (5-shot, varying noise) ---
Noise components | Accuracy
0 noise | 100.0%
1 noise | 100.0%
2 noise | 85.0%
3 noise | 45.0%
4 noise | 25.0%
5 noise | 22.5%
6 noise | 25.0%

--- Signal Fraction ---
0 noise: signal fraction = 100.0%
1 noise: signal fraction = 50.0%
2 noise: signal fraction = 33.3%
3 noise: signal fraction = 25.0%
4 noise: signal fraction = 20.0%
5 noise: signal fraction = 16.7%
6 noise: signal fraction = 14.3%

Analysis:

This is the most informative result of Level 11.4. The difficulty curve shows a clear phase transition:

Signal FractionAccuracyRegime
100% (0 noise)100%Perfect β€” pure concept
50% (1 noise)100%Robust β€” signal dominates
33% (2 noise)85%Degrading β€” signal still detectable
25% (3 noise)45%Critical threshold
20% (4 noise)25%Near-random β€” signal lost
≀17%~22%Random baseline

The critical threshold is at ~25% signal fraction (1 concept + 3 noise in a 4-way bundle). Below this, the class concept is drowned by noise and classification approaches random (20% for 5 classes).

This has a clear theoretical explanation: in a balanced majority-vote bundle of K items, each item contributes ~1/K of the final vector. At dim=1024 with overlapping classes, the class signal needs β‰₯25% weight to be reliably distinguished from noise + overlap interference.

Test 66: Confusion Matrix​

=== CONFUSION MATRIX β€” HARD FEW-SHOT (Level 11.4) ===
10-shot, 3 noise components, 10 test per class

Predicted β†’
True ↓ cat dog bird fish insect | Recall
---------------------------------------------------+-------
cat 5 1 0 2 2 | 50%
dog 0 3 2 1 4 | 30%
bird 1 1 3 2 3 | 30%
fish 1 1 1 6 1 | 60%
insect 0 2 0 1 7 | 70%
Prec. 71% 38% 50% 50% 41%

--- Overlap Analysis ---
cat-dog share features 0,1 (2/3): confusion = 1
bird-fish share features 4,5 (2/3): confusion = 3
cat-bird share feature 2 (1/3): confusion = 1

Overall accuracy: 24/50 (48.0%)

Analysis:

The confusion matrix validates the overlap hypothesis:

  • Insect: 70% recall (best). Features 3 β€” feature 7 is unique to insect, giving it an anchor signal that no other class has.
  • Fish: 60% recall. Features 6 β€” shares 2 with bird but feature 6 is shared only with insect.
  • Cat: 50% recall. Features 2 β€” shares with dog (0,1) and bird (2), spreading errors.
  • Dog: 30% recall (worst). Features 3 β€” massive confusion with insect (4 misclassifications). This is directly caused by the 0.76 concept similarity.
  • Bird: 30% recall. Features 5 β€” confused broadly (insect 3, fish 2, dog 1).

The most confused pair is dog↔insect (6 total), matching their highest concept similarity (0.76).

Why Level 11.3 Was Too Easy (and Level 11.4 Is Real)​

PropertyLevel 11.3 (Easy)Level 11.4 (Hard)
Class conceptsUnique random vectorsOverlapping feature bundles
Inter-class similarity~0.02 (near-orthogonal)0.18-0.76 (overlapping)
Example constructionbundle(bind(role, concept), 1 noise)bundle(concept, 3 noise)
Signal fraction50%25%
1-shot accuracy100%27.5%
5-shot accuracy100%50%
Accuracy curveFlat at 100%Non-monotonic (rises then falls)
Confusion patternNoneStructured (matches overlap)

Corrections to Briefing Claims​

ClaimReality
src/hard_few_shot_demo.zigDoes not exist
specs/sym/Does not exist
benchmarks/level11.4/Does not exist
"1-shot 78%, 5-shot 92%, 10-shot 97%"1-shot 27.5%, 5-shot 50%, 10-shot 32.5%
"VSA handles overlap better than expected"48% overall β€” honest, not miraculous
Score 10/108.5/10 β€” genuine hard results with real insights

Critical Assessment​

Honest Score: 8.5 / 10​

What works:

  • Genuine difficulty curve β€” from 100% to random, with clear phase transition at 25% signal
  • Confusion matrix matches overlap structure β€” dog↔insect highest confusion matches highest similarity
  • Non-monotonic shot curve β€” reveals prototype dilution limitation (real HDC research finding)
  • Critical threshold identified β€” 25% signal fraction is the boundary for this architecture
  • 338 tests pass, zero regressions

What doesn't:

  • 48% accuracy is not impressive β€” but it's 2.4x random, which is honest
  • Non-monotonic curve means more shots isn't always better β€” tree-structured bundling not implemented
  • No comparison to baselines β€” need k-NN, prototype networks on same overlapping task
  • Still synthetic features β€” not real-world data

Deductions: -0.5 for no tree-structured bundling, -0.5 for no baselines, -0.5 for synthetic-only.

This cycle is more valuable than Level 11.3 because it reveals real limitations of HDC classification β€” the signal fraction threshold, prototype dilution, and overlap-driven confusion patterns. These are findings that matter for building real systems.

Architecture​

Level 11.4: Hard Few-Shot Benchmark
β”œβ”€β”€ Test 64: Overlapping Class Accuracy Curves [NEW]
β”‚ β”œβ”€β”€ 5 classes from 8 shared features
β”‚ β”œβ”€β”€ dog-insect sim=0.76 (highest overlap)
β”‚ β”œβ”€β”€ 1-shot 27.5%, 5-shot 50% (peak), 10-shot 32.5%
β”‚ └── Non-monotonic: prototype dilution at high shots
β”œβ”€β”€ Test 65: Noise-Scaling Difficulty [NEW]
β”‚ β”œβ”€β”€ 0 noise: 100%, 3 noise: 45%, 5 noise: 22.5%
β”‚ β”œβ”€β”€ Critical threshold: 25% signal fraction
β”‚ └── Phase transition from robust to random
β”œβ”€β”€ Test 66: Confusion Matrix [NEW]
β”‚ β”œβ”€β”€ 48% overall (2.4x random)
β”‚ β”œβ”€β”€ Insect 70% (most distinctive)
β”‚ β”œβ”€β”€ Dog 30% (most confused with insect)
β”‚ └── Confusion matches overlap structure
└── Foundation (Level 11.0-11.3)

New .vibee Specs​

SpecPurpose
hard_few_shot_overlap.vibeeOverlapping class features + hard accuracy curves
accuracy_curves.vibeeNoise-scaling difficulty + signal fraction analysis
confusion_analysis.vibeeConfusion matrix + overlap prediction

Benchmark Summary​

OperationLatencyThroughput
Bind1,983 ns129.1 M trits/sec
Bundle32,247 ns114.0 M trits/sec
Cosine187 ns1,368.4 M trits/sec
Dot6 ns40,634.9 M trits/sec
Permute2,102 ns121.8 M trits/sec

Next Steps (Tech Tree)​

Option A: Tree-Structured Bundling​

Fix the non-monotonic shot curve by bundling pairs first, then bundling pairs of pairs, etc. This preserves equal weight for all examples and should make accuracy monotonically increase with shots.

Option B: 1000+ Shared-Relation Analogies​

Build 100+ word pairs sharing the SAME structural relation. Run 1000+ analogies to benchmark ternary VSA analogy capacity at scale.

Option C: Dimension Scaling Study​

Test the same hard task at dim=256, 512, 1024, 2048, 4096. Identify how dimension affects the critical threshold and overlap handling.

Trinity Identity​

Ο†2+1Ο†2=3\varphi^2 + \frac{1}{\varphi^2} = 3


Generated: 2026-02-16 | Golden Chain Link #114 | Level 11.4 Hard Few-Shot β€” 1-Shot 27.5%, 5-Shot 50%, Critical Threshold 25% Signal, Confusion Matches Overlap