
Last week, a six-person startup called Poetiq did something that wasn’t supposed to happen yet.
They achieved 54% accuracy on ARC-AGI-2, a benchmark specifically designed to measure genuine reasoning ability in AI systems1. The previous best was 45%, held by Google’s Gemini 3 Deep Think. Poetiq did it at half the cost. And they did it within hours of Gemini 3’s release, integrating the new model into their system and breaking the 50% barrier that researchers assumed was years away.
I used the plot they shared in their announcement blog as the banner image for this article. Look closely at how fast AI has progressed, and if you squint, you may even notice a real person in there. I digress.
What makes this breakthrough remarkable isn’t the score. It’s how they got there.
Reasoning Over Recall
Poetiq didn’t build a bigger model. They didn’t throw more compute at the problem. They built what they call a “meta-system”: a layer that wraps around existing frontier models and extracts more from them.
Their winning approach is grounded in a deep understanding of how modern attention-based LLM works. They view LLMs as “amazing databases” containing much of humanity’s digitized knowledge. But that knowledge is fragmented, scattered across the model’s parameters in ways that naive prompting cannot reliably access. As Poetiq put it: “The prompt is an interface, not the intelligence.”2
Their system works through refinement loops (they open-sourced their AI harness on GitHub)3. Generate a solution, receive feedback, analyze the feedback, and improve the answer. Repeat. This process improved Gemini 3 Pro’s baseline performance from 31% to 54% on the ARC-AGI-2 benchmark, not by adding new knowledge to the model used, but by improving its ability to extract and assemble the knowledge that was already there.
The implication is significant: the bottleneck isn’t what models know. It’s how effectively we can get them to reason with what they know.
The Debate This Opens
Poetiq’s success falls amid an ongoing debate in the AI community. Since late 2023, every expansion of context windows has prompted the same question: Do we still need retrieval? With models now handling 200K to 1 million tokens, why bother with RAG pipelines and vector databases? Why retrieve when you can dump everything into context? That’s not even mentioning the latest topic around in-context learning.
Yet here we are at the end of 2025, and retrieval is more alive than ever. Studies show RAG remains significantly cheaper for most workloads4. Long-context approaches can cost up to $20 per request. More importantly, research suggests that longer context and retrieval are synergistic, not competing. The real question isn’t which one wins. It’s when to use which, and how to combine them effectively.
But here’s what strikes me: we’re debating the architecture of these systems while most technical practitioners may not fully appreciate or understand the primitive that powers them.
The Primitive That Requires Deeper Understanding
Whether you’re building a RAG pipeline or trying to extract knowledge directly from an LLM, you’re relying on the exact fundamental mechanism: vector similarity. This isn’t just about retrieval systems. It’s about how LLMs themselves store and recall information.
LLMs encode knowledge as dense, overlapping vector representations in high-dimensional latent spaces. When an LLM “remembers” something, it’s not pulling from a database. In the “attention-is-all-you-need” AI era, transformer layers implement what’s essentially a key-value memory architecture5. The attention mechanism, the core of every modern language model, works by comparing query vectors against key vectors using similarity measures, then retrieving the associated values.
Query. Key. Value. Match by similarity. Retrieve.
This is the same primitive that powers vector databases. The same primitive that makes RAG work. The same primitive that determines whether your semantic search returns relevant results or confidently wrong ones. Vector similarity is not just a technique for building retrieval systems. It’s the substrate of how these models organize and access knowledge.
And most people using these systems have never examined what “similar” means.
What follows gets technical. If you’re not a practitioner, feel free to skip to the conclusion. But if you build systems that touch AI in any capacity, this is where the leverage lives. Understanding the “Similarity” primitive is how you stop being a passenger.
What “Similar” Actually Means
When two documents share a topic, they might be “similar.” When they share a writing style, they might be “similar.” When they express the same sentiment, “similar” again. A medical research paper and a patient forum post might discuss the same disease, but share almost nothing else.
Similarity is not a fact you discover. It’s a choice you make: a lens for what to pay attention to and what to ignore. The math encodes this choice in ways that are easy to miss.
When you use cosine similarity, you’re measuring direction. Two vectors pointing the same way are “similar” regardless of how far apart they sit in space. When you use Euclidean distance, you’re measuring position. Two vectors close together are “similar” even if they point in different directions: same data, different answers.
Consider the words “dog” and “canine.” They should be nearly identical, not just directionally aligned. Cosine similarity indicates they are related; Euclidean distance suggests they’re the same. Neither is wrong, but one better captures what you actually want, depending on your application.
For unit vectors, these metrics produce the same ranking. Minimizing Euclidean distance becomes mathematically equivalent to maximizing cosine similarity. But most embedding models don’t produce unit vectors by default. And whether you normalize, when you normalize, and how the model was trained all affect what the similarity scores mean.
Recent research from Netflix made this concrete: cosine similarity can produce arbitrary results depending on how the embedding model was trained6. Models trained with dot-product objectives have degrees of freedom that cosine similarity is sensitive to in unintended ways. The metric you use at inference should match the training objective. Many production systems get this wrong without knowing it.
Most practitioners inherit these choices by default. The tutorial used cosine similarity, so they did too. The library normalizes embeddings, so their embeddings are normalized. It works until it doesn’t. And when it doesn’t, the lack of foundational understanding makes diagnosis nearly impossible.
The difference between a retrieval system that works and one that silently fails often comes down to whether someone understood the primitive or inherited the default.
I wrote a longer piece exploring this in depth: The Geometry of Meaning: Vector Similarity from First Principles. It covers the math behind similarity metrics, when they disagree, and the practical implications for anyone building systems that depend on them.
The Questions Worth Sitting With
Poetiq’s breakthrough wasn’t about access to better models. It was about understanding models deeply enough to build intelligence on top of them. They knew what the models were good at, what they struggled with, and how to compensate for the gaps.
That understanding was the leverage. Everything else was execution.
The same principle applies whether you’re orchestrating reasoning loops across frontier models or choosing a similarity metric for a retrieval pipeline. Complex systems emerge from simple primitives. The people who understand the primitives, their assumptions, and limitations, are the ones who gain leverage over the larger systems.
So here are the questions I’d leave you with:
What primitives are you building on without fully understanding? When your system fails, do you have the foundation to diagnose why, or are you asking the latest LLM for symptoms? Are you in the driver’s seat, or just along for the ride?
The tools will keep getting easier. The defaults will keep getting better. The abstractions will keep getting higher. Whether that makes you more capable or more dependent is a choice you’re making right now, whether you realize it or not.
If you’re interested in this kind of foundational thinking and learning new perspectives, you’ll find a lot of similar content on my blog. I wrote recently about a unified framework for Bayesian reasoning and ML. Same underlying theme: the fundamentals aren’t obstacles to practical work. They’re the source of leverage over it and the secret to staying in the driver’s seat.
Footnotes
-
ARC Prize Foundation. (2025). “ARC-AGI-2 Benchmark.” A benchmark specifically designed to measure genuine reasoning ability in AI systems, requiring novel problem-solving rather than pattern matching from training data. ↩
-
Poetiq. (2025). “Traversing the Frontier of Superintelligence.” https://poetiq.ai/posts/arcagi_announcement/ Poetiq’s philosophy on treating LLMs as knowledge databases where the prompt serves as an interface to extract and assemble existing knowledge. ↩
-
Poetiq. (2025). “Poetiq ARC-AGI Solver.” https://github.com/poetiq-ai/poetiq-arc-agi-solver Open-source implementation of their refinement loop system that achieved state-of-the-art results on ARC-AGI-2. ↩
-
Wang, L., et al. (2024). “RAG vs Long-Context LLMs: A Comprehensive Study.” arXiv:2407.16833. Research showing RAG remains significantly more cost-effective for most retrieval workloads compared to long-context approaches. ↩
-
Vaswani, A., et al. (2017). “Attention Is All You Need.” The foundational paper introducing the transformer architecture, where attention mechanisms implement key-value memory retrieval through vector similarity. ↩
-
Steck, H., et al. (2024). “Is Cosine-Similarity of Embeddings Really About Similarity?” Netflix Research. arXiv:2403.05440. Analysis showing how cosine similarity can produce arbitrary results depending on embedding model training objectives. ↩