Essay / Neuroscience / April 2026

The Brain Is Not a Black Box Anymore

On TRIBE v2, the strange mirror Meta just held up to the human mind — and what a video game taught me about why that matters.

Published April 2026 · ~18 min read

There are moments in the history of science that are easy to miss precisely because they don't announce themselves the way we expect them to. No press conference. No breathless keynote. No carefully staged demo for a crowd of ten thousand. Just a paper, a date — March 25, 2026 — and a set of results so quietly extraordinary that you have to sit with them for a while before the full weight of what you're reading becomes clear.

I almost missed TRIBE v2. I suspect most people did. We are living, after all, in the middle of a period of such relentless model releases, benchmark announcements, and industry reshufflings — Anthropic's Mythos making everyone recalibrate, the race accelerating in every direction at once — that it has become genuinely difficult to separate the significant from the spectacular. But sometimes the most significant thing in the room is also the quietest. And TRIBE v2 is very, very quiet. It is not trying to talk like you. It is not trying to think like you. It is trying to understand what happens inside your skull when you simply watch a film.

That distinction, I want to argue, is one of the most important ones in the whole landscape of contemporary AI. And I want to get at why — not just technically, but philosophically — by making a detour through something that might seem entirely unrelated: a video game called Faceminer, which I played some weeks ago and haven't been able to stop thinking about since.

What Faceminer taught me about the unmappable

Faceminer is a small game. That's part of what makes it devastating. The premise is simple enough: you are an analyst. You have data. The data is about people — faces, patterns, identities compressed into numbers. And as you play, the scale of what that compression implies slowly dawns on you. There are no explosions in Faceminer. There is no narrative in any conventional sense. What there is, instead, is a growing, creeping sensation that the most intimate thing about a person — the particular configuration of their experience, their recognition, their self — can be represented. Can be encoded. Can be predicted.

It left me with a question I couldn't shake: how much of us is actually mappable? Not in a dystopian surveillance sense, but in a deeper, older sense. When we say that someone is irreducible — that human experience exceeds any model of it — are we asserting a philosophical truth, or are we simply describing the limits of our current tools?

"The history of neuroscience is, in many ways, a history of subdivision. We map the part of the brain that loves faces. We map the part that recognizes written words. We map and map and map — and what we are left with is a beautiful atlas of fragments."

TRIBE v2 makes this question urgent in a way I wasn't prepared for.

The problem that TRIBE v2 is solving

To appreciate what Meta's FAIR team has built, you have to first understand the state of the field they entered. Cognitive neuroscience has, for most of its history, operated through a kind of magnificent reductionism. Researchers would design a controlled experiment — show subjects images of faces, or play them isolated words, or have them solve a specific kind of problem — and then use fMRI, functional Magnetic Resonance Imaging, to see which regions of the brain lit up. Over decades, this painstaking "divide-and-conquer" approach produced a rich map: the fusiform face area responds to faces, the visual word-form area processes written characters, the temporal parietal junction handles social cognition. Each discovery was real, hard-won, important.

But it was also inherently fragmentary. The brain, when you're watching a film with your partner, eating dinner, laughing at something, feeling a passing sadness — that brain is not running one neat experiment at a time. It is integrating everything simultaneously: the sound, the image, the language, the memory, the anticipation. The fragmented study of individual functions could not capture that integration. What was needed was not a better map of one province. What was needed was an entirely different kind of cartography.

TRIBE v2 — architecture at a glance

ModalitiesVideo, audio, language — simultaneously

Training data1,117 hours of fMRI · 720 subjects · 5,094 sessions

Video modelV-JEPA 2 (frozen) — self-supervised video understanding

Audio modelWav2Vec-BERT 2.0 (frozen)

Language modelLlama 3.2-3B (frozen)

Trainable core1B parameter Transformer — learns to aggregate across time

OutputPredicted fMRI activity at ~29,000 cortical & subcortical targets

Competition result1st place, Algonauts 2025 — out of 263 teams

The architecture of TRIBE v2 reflects this ambition with a kind of structural elegance. Three of the most capable pretrained AI models in the world — one for video, one for audio, one for language — are held frozen, used as feature extractors, their representations fed into a single trainable Transformer that learns, over time, to weave all three streams together and predict what a particular brain region will do next. The frozen models are the sensory organs. The Transformer is the mind learning to integrate them. And the target is fMRI data — the blood-oxygen-level-dependent signal that proxies for neural activity across the whole brain, at roughly 29,000 locations simultaneously.

The scale of the training data alone is worth pausing on. Over 1,000 hours of fMRI recordings. 720 subjects. People watching movies, listening to podcasts, seeing flashed objects, reading sentences. The breadth of that data is what allows the model to begin doing something that prior systems simply couldn't: generalising. Not just predicting the brain activity of people it has already seen, but predicting the brain responses of people it has never encountered, to stimuli it has never processed. Zero-shot generalisation — the model's ability to predict a brand new brain's responses with no prior exposure to that individual — is perhaps TRIBE v2's most quietly staggering capability.

In-silico experimentation, or: the library that writes its own books

Here is where it gets genuinely strange, and where I want to slow down, because I think this is the part that most coverage will rush past in its eagerness to move on to the next benchmark.

TRIBE v2 doesn't just predict brain activity under naturalistic conditions. It enables what the researchers call in-silico experimentation — the ability to run neuroscience experiments without a human in the scanner at all. Want to know which brain areas respond to faces versus places? You don't need to recruit subjects, design a protocol, apply for scanner time, wait months for ethics approval, and then spend days analysing the resulting data. You can ask TRIBE v2. Flash it the images. Observe its predictions. And then compare those predictions against the decades of empirical literature.

The researchers did exactly this. They tested TRIBE v2 on classic functional localizer experiments from the Individual Brain Charting dataset — experiments designed to identify, with precision, the brain areas involved in face recognition, place processing, body perception, written language, emotional processing, and syntactic complexity. TRIBE v2 recovered all of them. The fusiform face area lit up for faces. The parahippocampal place area responded to places. The visual word-form area activated for written characters. The temporo-parietal junction distinguished emotional from physical pain. These are not guesses or approximations — the spatial correlations between TRIBE v2's predictions and the actual experimental results were statistically significant and, in several cases, remarkably precise.

"TRIBE v2 is recovering, from a model trained entirely on movies and podcasts, the results of controlled experiments it has never seen. It is, in some sense, rediscovering neuroscience from scratch."

Think about what that means. A model trained on naturalistic viewing — on people watching Friends and BBC nature documentaries and the first Bourne film — can, without any additional data, replicate experiments that took the field decades to establish. The implicit knowledge of how the brain works, which is encoded in the statistical patterns of how brains respond to rich, real-world stimuli, turns out to contain the results of isolated, controlled experiments as a kind of subset. The general contains the specific. The naturalistic contains the experimental.

This is the library that writes its own books. Or rather: this is what happens when you have a model comprehensive enough that the questions you want to ask of the brain become, in some sense, already answered within it.

The colour of the mind: multisensory integration made visible

One of the most beautiful results in the paper is also one of the most visually arresting. To understand how the three modalities — video, audio, language — contribute to the brain's overall encoding, the researchers did something simple and ingenious: they assigned each modality a colour. Red for text. Blue for video. Green for audio. Then they mapped, across the entire cortical surface, which modality's encoding score was highest at each location. The result is a colour-coded brain — and it is, quite literally, a map of how your senses are spatially arranged inside your skull.

The occipital and parietal regions are deeply blue — dominated by video, as you'd expect from regions associated with visual processing. The auditory cortex is green. The language processing regions and large swaths of the prefrontal lobe go red, dominated by text — which, the researchers note, likely contains the most semantic information of the three modalities. And then the mixing happens. Yellow (text plus audio) appears in the superior temporal lobe, where language and sound converge. Cyan (video plus audio) shows up in the ventral and dorsal visual cortices and in the hippocampus — areas where what you see and what you hear are bound into a single memory, a single moment.

The largest gains from combining modalities — up to fifty percent improvement in encoding score over the best single-modality encoder — are clustered around the temporal-parietal-occipital junction, a region long understood to be where the brain's streams of sensory information converge and are stitched together into unified experience. TRIBE v2 didn't know this. It learned it. The neuroscience fell out of the data.

Back to Faceminer. Back to the question.

Playing Faceminer, the thing that unsettles you isn't the surveillance. It isn't the privacy violation, or the scale of the database, or any of the familiar dystopian registers we've been trained to reach for. What unsettles you is the compression. The fact that something as irreducibly particular as a face — as a person's face, which you recognise instantly and which carries for you decades of specific emotional weight — can be reduced to a vector. A point in a space of numbers. And that the distance between two such points can meaningfully represent the similarity or difference between two human beings.

TRIBE v2 is doing something related, but pointed in a different direction. Faceminer compresses the person into data. TRIBE v2 uses data to reconstruct the inner experience of perception — to predict not just what a brain does, but what a brain does while it is having an experience. It is compression run in reverse. Or perhaps it is something more unsettling: it suggests that the two directions are closer together than we thought. That the map and the territory have begun to converge.

"What the scaling law tells you is that this is not a ceiling. This is a beginning. Every additional hour of fMRI data moves the curve."

The paper documents a log-linear scaling law: as you feed TRIBE v2 more training data, its encoding accuracy increases, without any sign of plateau. This mirrors what has been observed in large language models — that more data, more parameters, more compute produce predictably better performance along a smooth curve. The implications are significant. If brain encoding follows the same scaling laws as language modelling, then the current state of TRIBE v2 is not the endpoint. It is the early point on the curve. The ceiling, if there is one, has not come into view.

What the model can't yet do — and why that matters

The researchers are honest about the limits, and the limits are themselves philosophically interesting. TRIBE v2 is constrained by fMRI's resolution — it cannot capture the millisecond-level dynamics of individual neurons, only the sluggish blood-flow proxy that fMRI measures, lagging several seconds behind the actual neural event. Its inputs are limited to visual, auditory, and semantic information — it knows nothing of smell, of proprioception, of balance, of the vast somatosensory world that floods the brain constantly from the body itself.

But the most fundamental limitation is this: the model currently treats the brain as a passive observer. It models the brain receiving the world. It does not model the brain acting on it, choosing, producing, moving, deciding. The next horizon — which the researchers name explicitly — is to model the brain as an active agent. A brain that is not just watching Bourne, but that is reaching for a coffee cup while watching Bourne, deciding to pause, remembering a similar scene from a different film, feeling a flicker of something it can't quite name.

That is a very different and much harder problem. But the fact that it is now the next problem — rather than a distant fantasy — is itself a measure of how far we have come.

The paradigm shift, and what comes after

The paper's discussion section contains a comparison that I keep returning to. The authors draw a parallel between what TRIBE v2 represents for neuroscience and what AlphaFold represented for structural biology. AlphaFold didn't just get better at predicting protein structures — it ended a fifty-year competition by making that competition essentially irrelevant. The problem wasn't just solved; it was dissolved. The field moved on to harder problems.

TRIBE v2 is not AlphaFold — not yet, not quite. The brain is orders of magnitude more complex than any protein, and the fMRI signal is orders of magnitude noisier than a crystallography result. But the trajectory is analogous. We are watching the moment where a highly fragmented field — where each lab studies its own task, its own brain region, its own subject pool, with its own model trained from scratch on its own data — begins to cohere around a shared foundation. A universal encoder of whole-brain activity. A single model that knows enough about brains in general to be useful for any particular brain, any particular question.

The researchers describe this as a shift from "passive observer" to foundation model — from a tool that describes brain activity to one that can generate hypotheses, pre-screen experimental designs, augment statistical power, and recover known results without a single human having to enter a scanner. That is not a modest change. That is a change in what neuroscience is, structurally, as a practice.

The distance between biological and digital

Faceminer ends — or rather, it doesn't end so much as it simply stops, leaving you with the weight of what you've been doing — and the question it leaves you with is something like: was the distance always this small? Were we always this compressible, this representable, this close to being a model that another model could predict?

TRIBE v2 makes me feel that question acutely, and from the other direction. Not: can the person be compressed into data? But: can the data recover the person? Can enough fMRI, trained into a large enough model, begin to reconstruct not just which brain region activated, but something of what it was like to be that brain, in that moment, watching that film?

We are not there. Let's be careful about that. What TRIBE v2 predicts is a BOLD signal — oxygenated blood flow — not experience. There is a very long distance between the two, and the hard problem of consciousness sits somewhere in that distance, enormous and unresolved. But here is what strikes me as genuinely new: for the first time, the distance is being measured. Not philosophised about, not gestured toward — measured, in Pearson correlations and scaling curves and spatial maps of multisensory integration.

The gap between biological and digital isn't closing because digital is becoming more like biological. It's closing because we now have tools that can hold both at once — that can take the full richness of human perception and map it, with increasing precision, onto a surface that both can share. TRIBE v2 is that surface. And Faceminer, for all its smallness, intuited the same thing: that the distance, whatever its ultimate nature, is traversable. That something essential passes through the encoding.

I don't know if that should comfort us or disturb us. I suspect, as with most things that are genuinely true, it is both.

    Paper: d'Ascoli et al. — "A foundation model of vision, audition, and language for in-silico neuroscience" — FAIR at Meta, March 25 2026

    Code: github.com/facebookresearch/tribev2  ·  Demo: aid