The Agzamov Test

Watch Live

Games are streamed live as they happen. Follow us to get notified.

▶ YouTube Coming soon ▶ Twitch Coming soon

The Problem

Every AI benchmark tests naked models in a lab. Fixed questions, known answers, no pressure.

But that's not how anyone uses AI. In the real world, models have tools, memory, and orchestration. They face problems that can't be memorized. They work against opponents who adapt.

The Agzamov Test measures the gap. Strip everything away — how good is the model alone? Now add tools and memory back — how much better does it get? That gap is the Agzamov Score (0–100).

Smart model is not a press release. It's a number.

Real-time Stockfish evaluation

Model's actual reasoning

Every move with timing + cost

AI commentary and analysis

Chess960 — complete information game

Heads-Up No-Limit Hold'em — hidden information game

How It Works

Four phases. Each one adds more augmentation. The delta tells you what actually helps.

0 DONE

Sanity Check

Model vs random opponent. Does the harness even work? Can the model play legal moves?

96.7% WR · 0 errors

1 IN PROGRESS

Baseline

Naked model vs naked model. No tools, no memory. Just raw capability. This is E₀.

2 NEXT

Augmented vs Naked

Give one model tools + memory, keep the other naked. The score difference = how much augmentation actually helps.

3 NEXT

Arms Race

Both models get tools + memory. Does augmentation still help when the opponent has it too?

What You Get

Δ_a

The Delta

score(with tools) - score(without)
Positive = augmentation helps. Zero = your RAG is useless. Negative = your tools make it worse.

τ

Learning Speed

How many games until augmentation kicks in. Some models figure out their tools in 5 games. Some never do.

M × A

Compatibility Matrix

Claude + BrainOps Memory = great. GPT + same memory = meh. Not all models benefit from the same tools.

Watch Live

The Problem

How It Works

Sanity Check

Baseline

Augmented vs Naked

Arms Race

What You Get

Δa

τ

M × A

Δ_a