The prologue ended with a question. The kid was still pulling.
What IS intelligence?
Not “can it pass the bar exam.” Not “did it score 90 on the benchmark.” The thing underneath the word. You would think, given how long humans have been thinking about thinking, someone would have nailed this down by now. They have not. What they have done is more interesting: multiple independent traditions (separate research lineages, each with its own assumptions, methods, and vocabulary), starting from different axioms and asking different questions, circled the same territory and arrived at what looks like the same answer. They just stated it in different languages.
I want to walk through four of these traditions. They span information theory, psychometrics, neuroscience, and complex systems. They use different vocabularies. They rarely cite each other. And when you lay them side by side, they converge on a single operation. The diagram above maps the territory. Each lens is expandable.
I am not the first person to notice this. I am probably not even right about all of it. But a first-principles walk through old questions is always rewarding, even when the answer turns out to be the same one someone else already found. And when independent traditions converge, the convergence itself is worth taking seriously.
Before narrowing to four, I surveyed over thirty frameworks that have attempted to define, measure, or explain intelligence. They span information theory, psychometrics, neuroscience, cybernetics, embodied cognition, classical AI, philosophy of mind, and thermodynamics. The full survey is in the expandable reference below. What follows is the distillation: four lenses that, between them, capture what every other framework is pointing at.
A Map of Intelligence Frameworks (33 frameworks across 8 traditions)
Information-Theoretic
| Framework | Summary | Key Reference |
|---|---|---|
| Kolmogorov Complexity (1965) | The complexity of data is the length of the shortest program that produces it; incompressible data is random by definition. | Kolmogorov, A.N. (1965). “Three Approaches to the Quantitative Definition of Information.” Problems of Information Transmission, 1(1), 1-7. |
| Solomonoff Induction (1964) | Optimal prediction assigns probability to hypotheses inversely by program length; compression and prediction are mathematically dual. | Solomonoff, R. (1964). “A Formal Theory of Inductive Inference.” Information and Control, 7(1), 1-22; 7(2), 224-254. |
| AIXI (2005) | The theoretically optimal agent combines Solomonoff compression with sequential decision-making to maximize expected reward. | Hutter, M. (2005). Universal Artificial Intelligence. Springer. |
| SP Theory (2013) | Intelligence is fundamentally information compression via pattern matching and unification across multiple representations. | Wolff, J.G. (2013). “The SP Theory of Intelligence: An Overview.” Information, 4(3), 283-341. |
| Universal Intelligence (2007) | Intelligence is an agent’s expected performance across all computable environments, weighted by the simplicity of each environment. | Legg, S. & Hutter, M. (2007). “Universal Intelligence: A Definition of Machine Intelligence.” Minds and Machines, 17(4), 391-444. |
Psychometric and Cognitive
| Framework | Summary | Key Reference |
|---|---|---|
| g Factor (1904) | A single general factor underlies performance across all cognitive tasks, suggesting a common underlying operation. | Spearman, C. (1904). “‘General Intelligence,’ Objectively Determined and Measured.” American Journal of Psychology, 15(2), 201-293. |
| Wechsler’s Definition (1939) | “The aggregate or global capacity of the individual to act purposefully, to think rationally, and to deal effectively with his environment.” | Wechsler, D. (1939). The Measurement of Adult Intelligence. Williams & Wilkins. |
| Developmental Schemas (1952) | Intelligence is progressive construction of cognitive schemas through assimilation (fitting new data to existing schemas) and accommodation (adjusting schemas to new data). | Piaget, J. (1952). The Origins of Intelligence in Children. International Universities Press. |
| Fluid and Crystallized (1963) | Fluid intelligence (novel problem-solving) and crystallized intelligence (accumulated knowledge) are distinct but correlated capacities. | Cattell, R.B. (1963). “Theory of Fluid and Crystallized Intelligence.” Journal of Educational Psychology, 54(1), 1-22. |
| Multiple Intelligences (1983) | Intelligence is not a single capacity but multiple independent faculties: linguistic, logical-mathematical, spatial, musical, bodily-kinesthetic, interpersonal, intrapersonal, naturalist. | Gardner, H. (1983). Frames of Mind. Basic Books. |
| Triarchic Theory (1985) | Intelligence comprises analytical, creative, and practical components; practical intelligence adds a real-world deployment dimension. | Sternberg, R.J. (1985). Beyond IQ: A Triarchic Theory of Human Intelligence. Cambridge University Press. |
| On the Measure of Intelligence (2019) | Intelligence is the rate at which a system acquires skill on novel tasks given minimal prior knowledge; a conversion efficiency from experience to generalization. | Chollet, F. (2019). “On the Measure of Intelligence.” arXiv:1911.01547. |
Neuroscience and Adaptive
| Framework | Summary | Key Reference |
|---|---|---|
| Free Energy Principle (2010) | Biological agents minimize variational free energy, an upper bound on surprise; perception and learning are forms of model optimization. | Friston, K. (2010). “The Free-Energy Principle: A Unified Brain Theory?” Nature Reviews Neuroscience, 11(2), 127-138. |
| Active Inference (2017) | Perception (updating models) and action (changing the world) form a single loop; both minimize prediction error. | Friston, K. et al. (2017). “Active Inference: A Process Theory.” Neural Computation, 29(1), 1-49. |
| Predictive Processing (2013) | The brain is a hierarchical prediction machine; cognition is the ongoing minimization of prediction errors across multiple levels. | Clark, A. (2013). “Whatever Next? Predictive Brains, Situated Agents, and the Future of Cognitive Science.” Behavioral and Brain Sciences, 36(3), 181-204. |
Systems, Cybernetics, and Emergence
| Framework | Summary | Key Reference |
|---|---|---|
| Cybernetics (1948) | Intelligence is inseparable from communication, feedback, and control; circular causality between agent and environment is fundamental. | Wiener, N. (1948). Cybernetics: Or Control and Communication in the Animal and the Machine. MIT Press. |
| Requisite Variety (1956) | A controller must have at least as much internal variety as the environment it regulates; effective intelligence requires matching environmental complexity. | Ashby, W.R. (1956). An Introduction to Cybernetics. Chapman & Hall. |
| Society of Mind (1986) | Intelligence emerges from interaction of many simple, specialized agents; no single agent is intelligent, but their society is. | Minsky, M. (1986). The Society of Mind. Simon & Schuster. |
Embodied and Enactive
| Framework | Summary | Key Reference |
|---|---|---|
| Skill Acquisition (1972) | Expert performance is embodied know-how that resists formalization; skilled intelligence operates without rule-following or explicit representations. | Dreyfus, H.L. (1972). What Computers Can’t Do. Harper & Row. |
| Enactive Cognition (1991) | Cognition is not representation of an independent world but the enactment of a world through embodied interaction; rejects the representation assumption. | Varela, F.J., Thompson, E. & Rosch, E. (1991). The Embodied Mind. MIT Press. |
| Intelligence without Representation (1991) | Intelligent behavior arises from layered reactive behaviors coupled directly to the environment; “the world is its own best model.” | Brooks, R.A. (1991). “Intelligence without Representation.” Artificial Intelligence, 47(1-3), 139-159. |
Classical AI and Computation
| Framework | Summary | Key Reference |
|---|---|---|
| Turing Test (1950) | A machine is intelligent if its behavior is indistinguishable from a human’s in conversation; a behavioral test, not a definition of mechanism. | Turing, A.M. (1950). “Computing Machinery and Intelligence.” Mind, 59(236), 433-460. |
| Physical Symbol System (1976) | A physical symbol system has the necessary and sufficient means for general intelligent action; intelligence is symbol manipulation. | Newell, A. & Simon, H.A. (1976). “Computer Science as Empirical Inquiry: Symbols and Search.” Communications of the ACM, 19(3), 113-126. |
| NARS / AIKR (2019) | Intelligence is the capacity for adaptation under insufficient knowledge and resources; real intelligence operates under severe computational and informational constraints. | Wang, P. (2019). “On Defining Artificial Intelligence.” Journal of Artificial General Intelligence, 10(2), 1-37. |
| Patternism (2006) | Mind is a set of patterns in a complex system; intelligence is the ability to recognize and exploit patterns in the world and in itself. | Goertzel, B. (2006). The Hidden Pattern. BrownWalker Press. |
Philosophy of Mind and Consciousness
| Framework | Summary | Key Reference |
|---|---|---|
| Chinese Room (1980) | Syntactic manipulation of symbols is insufficient for semantic understanding; a system can simulate intelligence without possessing it. | Searle, J.R. (1980). “Minds, Brains, and Programs.” Behavioral and Brain Sciences, 3(3), 417-457. |
| Emperor’s New Mind (1989) | Mathematical understanding involves non-computable processes; Godel’s incompleteness theorem limits what machine intelligence can achieve. | Penrose, R. (1989). The Emperor’s New Mind. Oxford University Press. |
| Integrated Information Theory (2004) | Consciousness is measured by integrated information (Phi); a system can be a perfect compressor with Phi = 0. Addresses consciousness, not intelligence. | Tononi, G. (2004). “An Information Integration Theory of Consciousness.” BMC Neuroscience, 5, 42. |
| Global Workspace Theory (1988) | Consciousness arises from global broadcasting of information across specialized brain modules; explains conscious access, not intelligence per se. | Baars, B.J. (1988). A Cognitive Theory of Consciousness. Cambridge University Press. |
Thermodynamic (Covered in Part 2)
| Framework | Summary | Key Reference |
|---|---|---|
| What is Life? (1944) | Living organisms maintain low internal entropy by importing low-entropy energy and exporting high-entropy waste; life is organized dissipation. | Schrodinger, E. (1944). What is Life? Cambridge University Press. |
| Dissipative Structures (1977) | Systems far from equilibrium spontaneously create organized structures that accelerate entropy production; order arises because of the second law, not despite it. | Prigogine, I. (1977). Nobel Lecture. Published as “Time, Structure, and Fluctuations.” Science, 201(4358), 777-785 (1978). |
| Irreversibility and Computation (1961) | Erasing one bit of information necessarily dissipates at least kT ln 2 joules of heat; computation is a physical, thermodynamic process. | Landauer, R. (1961). “Irreversibility and Heat Generation in the Computing Process.” IBM Journal of Research and Development, 5(3), 183-191. |
| Dissipation-Driven Adaptation (2013) | Groups of atoms driven by an external energy source tend to self-organize into configurations that dissipate energy more efficiently; this tendency precedes biology. | England, J.L. (2013). “Statistical Physics of Self-Replication.” Journal of Chemical Physics, 139(12), 121923. |
The Compression Lens
Start with what might be the oldest deep question in computer science: given a string of data, what is the shortest program that produces it?
The length of that shortest program, the Kolmogorov complexity of the data, draws the line between structure and noise. Data that can be generated by a short program has pattern, regularity, something worth capturing. Data that cannot be compressed at all is, by definition, random.
Solomonoff connected this to intelligence in 1964: the optimal way to predict the next observation in a data stream is to weight your predictions by the inverse of program length. Shorter programs get more weight. This means compression and prediction are not just related. They are mathematically dual. A system that compresses well predicts well, and vice versa. Not metaphor. Provable.
Hutter extended this into AIXI1, a theoretical agent that makes optimal decisions by combining Solomonoff’s compression with sequential decision-making. AIXI is uncomputable, you cannot build it, but it serves as a ceiling: the best an agent can do is find the simplest model consistent with all observations and act on it. Wolff pushes the same thread further with his SP Theory2, arguing explicitly that intelligence, at its core, is information compression through pattern matching and unification. He calls compression the “double helix of intelligence.”
If you have read Modeling and Compression on this blog, you have seen a version of this: models are exquisitely fussy compressors. Feature selection is curation. Regularization is compression discipline. The best models know what to forget.
What this lens tells us: intelligence produces compact, efficient encodings of the regularities in the world.
The Generalization Lens
Francois Chollet noticed something important: we confuse skill with intelligence constantly.
A chess engine that plays at grandmaster level is not necessarily intelligent in any general sense. It is skilled at chess. A language model that writes passable legal briefs is skilled at generating text that looks like legal briefs. Skill is an output. Intelligence is the machinery that produces it. And the most telling property of that machinery is not how well it performs on any single task, but how efficiently it converts limited experience into competence on new tasks it has never seen.
Chollet’s definition, from his 2019 paper “On the Measure of Intelligence”: intelligence is the rate at which a system acquires skill on novel tasks, given minimal prior knowledge. A conversion ratio. How much performance per unit of experience? How far does past learning travel?
This is grounded in the same mathematical substrate as Solomonoff, algorithmic information theory, and Chollet builds on it explicitly. Compressed representations capture general structure rather than specific instances. Generalization is what happens when compression works. Memorization is what happens when it does not.
Legg and Hutter3, in what became the most cited formal definition of machine intelligence, push toward the question Chollet does not quite ask: intelligence as expected performance across all computable environments, weighted by simplicity. This adds a goal dimension the compression lens alone misses. Intelligence is not just about capturing regularities. It is about deploying them to achieve objectives across diverse contexts. Pei Wang’s work on adaptive reasoning4 makes a related point from yet another direction: real intelligence operates under severe limits on time, memory, and knowledge. Compression is not a luxury. It is a survival necessity.
What this lens tells us: intelligence is measured by how well captured structure transfers to contexts beyond its origin. And transfer, ultimately, is in service of action.
The Prediction Lens
Karl Friston’s Free Energy Principle5 arrives from neuroscience, not information theory.
The core idea: an intelligent agent minimizes the difference between what it predicts and what it observes. In information-theoretic terms, it minimizes surprise. An agent that consistently fails to predict its environment will not persist for long. At the limit, prediction failure is death.
This gives intelligence a survival logic the compression lens alone lacks. Solomonoff tells you that compression is optimal for prediction. Friston tells you why agents compress: because prediction failure is existentially costly.
The mechanism Friston proposes, predictive processing, works through a hierarchy of top-down predictions and bottom-up error signals. When sensory input conflicts with predictions, error propagates upward. The system resolves it by either updating its model (perception) or acting on the world to make reality match its predictions (action). Both serve the same function: minimizing surprise. This second route, acting on the world, is what Friston calls active inference6, and it matters. Intelligence is not passive model-building. It is a loop of capturing and deploying, of perceiving and acting.
From cybernetics, Ashby’s Law of Requisite Variety7 arrives at a related insight from a different angle: a controller must have at least as much internal variety as the environment it regulates. This creates a productive tension with compression. Compression reduces variety. Requisite variety demands it. A good intelligence must do both: compress to capture the essential structure, while maintaining enough internal variety to handle the environment’s complexity. The two constraints together define the sweet spot.
I should note that Friston’s framework has drawn criticism for being so general that any self-organizing system can be described as “minimizing free energy.” That criticism has teeth. But the specific mechanism, hierarchical prediction error minimization8, has generated testable neuroscience and real empirical support. The broad formalism may overreach. The specific predictions are real.
What this lens tells us: intelligence persists because failing to capture and deploy regularities is costly. Survival drives the operation.
The Emergence Lens
The first three lenses are formal definitions with mathematical content. This one is different. Emergence is not a definition of intelligence. It is an observation about how intelligence arises.
Three local rules (separation, alignment, cohesion) produce a murmuration of starlings. One update rule (gradient descent) applied to billions of parameters produces a system that writes coherent prose and solves differential equations. Mutation and selection, repeated over billions of years, produce organisms that model their ecological niches with staggering fidelity. None of these rules mention intelligence. None mention understanding. And yet the system-level behavior captures regularities, encodes them, and deploys them.
If you have read From Bird Flocks to Intelligence, you have seen this argument in detail. But I want to add something the earlier post did not address.
There is an entire tradition in cognitive science, the enactive and embodied school of Varela, Thompson, and Rosch9, and the “intelligence without representation” work of roboticist Rodney Brooks10, that rejects the idea that intelligence requires internal models at all. Brooks built robots that navigate real environments using layered reactive behaviors with no world model. His slogan: “the world is its own best model.” The skilled martial artist executing a throw, the jazz musician improvising over changes, the insect navigating by pheromone gradient: these systems act intelligently without consulting anything that resembles an explicit internal model.
This tradition matters. It is the strongest challenge to any convergence thesis built on “model-building.” But I think it belongs here, under the emergence lens, because it is making a closely related point: intelligence arises from interaction between a system and its environment, not from top-down design. The martial artist’s body has been shaped by thousands of hours of practice. That shaping is a form of regularity capture: the physics of combat, distilled through repetition into refined motor programs that generalize across opponents. Whether you call that a “representation” or “embodied know-how” is partly a vocabulary dispute. The operation underneath, capturing regularities and deploying them in new contexts, looks the same from either side.
What this lens tells us: intelligence arises from interaction at sufficient scale. It does not require a designer. It does not require explicit internal representations. The structure can be distributed, embodied, implicit. And it did not need anyone’s permission to appear.
The Convergence
Four lenses. Four starting points. Four vocabularies. What do they share? Each one, stripped to its core, describes the same operation:
Intelligence is the capacity to distill regularities into compact structure and deploy that structure in new contexts.
The four lenses decompose this into its fundamental aspects. What, how well, why, and how. Four questions about one thing:
- Compression describes what the operation produces: compact structure, short programs, efficient encodings.
- Generalization describes how well the operation transfers: deployment efficiency across novel contexts.
- Prediction describes why the operation persists: because failure to capture and deploy is existentially costly.
- Emergence describes how the operation arises: from interaction at sufficient scale, without requiring a designer.
I want to be honest about two things. First, some of this convergence is by construction. Chollet builds explicitly on Solomonoff’s mathematical framework. They were not developed in isolation. The stronger evidence comes from the agreement between the information-theoretic tradition and Friston’s neuroscience tradition, which developed independently and arrived at compatible conclusions from completely different starting axioms. Second, there is a risk that “distilling regularities into compact structure” is broad enough that everything qualifies. A thermostat captures a regularity (temperature) and deploys it (turns on heat). Is a thermostat intelligent? On this framework: yes, minimally. It sits at a vanishingly low position on what Part 3 will argue is a continuous spectrum. I consider that a feature, not a bug. But the risk of vacuity is real, and I would rather name it than hide from it.
The Question They Do Not Answer
All four lenses describe the same operation. They converge on what intelligence is, how well it transfers, why it persists, and how it arises. But none of them answers the deeper question. Why does this operation exist at all?
Not “why is it useful” or “why does natural selection favor it.” Something more fundamental. In a universe that is, by every account we have, winding down into uniform nothing, dissipating every gradient, flattening every structure, why does matter keep arranging itself into configurations that capture regularities and deploy them? What are these configurations for? The four lenses describe the shape of intelligence. They say nothing about why the universe keeps producing it. For that, you need a different kind of answer. One that starts, of all places, with thermodynamics.
Part 2 follows the thread.
For the information-theoretic deep dive, see Modeling and Compression. For emergence in detail, From Bird Flocks to Intelligence. For where sophistication peaks between order and chaos, Complextropy and Complexodynamics.
Footnotes
-
Hutter, M. (2005). Universal Artificial Intelligence: Sequential Decisions Based On Algorithmic Probability. Springer. ↩
-
Wolff, J.G. (2013). “The SP Theory of Intelligence: An Overview.” Information, 4(3), 283-341. ↩
-
Legg, S. & Hutter, M. (2007). “Universal Intelligence: A Definition of Machine Intelligence.” Minds and Machines, 17(4), 391-444. ↩
-
Wang, P. (2019). “On Defining Artificial Intelligence.” Journal of Artificial General Intelligence, 10(2), 1-37. ↩
-
Friston, K. (2010). “The Free-Energy Principle: A Unified Brain Theory?” Nature Reviews Neuroscience, 11(2), 127-138. ↩
-
Friston, K. et al. (2017). “Active Inference: A Process Theory.” Neural Computation, 29(1), 1-49. ↩
-
Ashby, W.R. (1956). An Introduction to Cybernetics. Chapman & Hall. ↩
-
Clark, A. (2013). “Whatever Next? Predictive Brains, Situated Agents, and the Future of Cognitive Science.” Behavioral and Brain Sciences, 36(3), 181-204. ↩
-
Varela, F.J., Thompson, E. & Rosch, E. (1991). The Embodied Mind. MIT Press. ↩
-
Brooks, R.A. (1991). “Intelligence without Representation.” Artificial Intelligence, 47(1-3), 139-159. ↩