The Data Paradox

GPT-4 was trained on roughly 15 trillion tokens — about 1.5 quadrillion bits, equivalent to hundreds of human lifetimes of reading. That sounds like an enormous amount of information. And it is, if you’re thinking about information the way programmers think about it: as text to be processed.

But here’s what is missing from the narrative: a 4-year-old has already processed more data than GPT-4. Much more.

Consider what happens when you are awake. Your visual cortex alone processes about 1 billion bits per second. Over four years — assuming 12 waking hours per day — that’s about 63 quadrillion bits just from vision. Add in the other senses and you’re likely approaching 65–70 quadrillion bits total — more than 40 times what we feed our largest language models.

The difference isn’t just in volume of data. It’s in the type of learning that data enables. When a child sees a ball roll behind a couch, they learn something fundamental about object permanence and spatial reasoning. When an LLM reads “the ball rolled behind the couch,” it learns statistical patterns about word sequences.

This explains something that seem to puzzle many people: why language models can write sophisticated prose about physics but can’t figure out that a marble will fall if you let go of it. They’ve learned to manipulate symbols without understanding what those symbols refer to.

The child’s advantage isn’t just the amount of data, but that it’s the right kind of data. Every photon hitting their retina, every sensation of pressure felt in their fingertips, is grounded in physical reality. They’re not learning about the world; they’re learning from the world.

This is why I think the most interesting AI work today is multimodal. Those who figure out how to give their models genuine sensory experience—not just image classification, but the kind of rich, embodied interaction that lets you build causal models of reality—will have a significant advantage.

There’s another lesson here about learning efficiency. Humans are remarkably good at extracting understanding from a small number of experiences. Show a child a few examples of something and they generalise. LLMs need thousands of examples to learn the same patterns.

This suggests we’re not just facing a data problem, but an architecture problem. Current models learn by gradient descent over massive datasets. Humans learn through active exploration, hypothesis formation, and real-time feedback. These are fundamentally different approaches.

Those building the next generation of AI systems are probably thinking less about how to gather more text and more about how to create systems that learn the way children do: through rich, multimodal interaction with their environment.

The fact that a 4-year-old has processed more data, and done so more meaningfully, than our most sophisticated AI systems isn’t just an interesting observation. It’s a roadmap.