Transformers and Substrate-Friendly Computation

Attention as the substrate’s coherence-match operation, residual streams as canonical loops in axial form, depth-and-width architecture as substrate-eigenmode basis-set decomposition

The section’s earlier chapters built the brain modon’s substrate-physics: the prediction-and-error cycle on an eigenmode basis, bilateral coupled-resonator dynamics, a multi-rung Stuart-Landau stack, hippocampal archival, vagal embedding in the body. The previous chapter lifted the long vector between two such brains as language. This chapter takes the last of the long vector’s four readingsperception builds it, memory freezes it, language passes it, and here a machine builds it in silicon. The transformer is the clarifying mirror: an architecture engineers arrived at through gradient descent and scaling, without setting out to copy a brain, that the framework reads as having converged on the same substrate primitives the brain modon already runs.

The claim is that this convergence is not a coincidence. The transformer is the substrate’s classical-computation analogue of brain-modon dynamics, and its main parts map one-to-one onto primitives the earlier chapters developed:

The reading lens for the internals is the mechanistic-interpretability programme — Chris Olah’s Circuits work and Anthropic’s transformer-circuits line (Elhage 2021; Olsson 2022; Bricken 2023; Templeton 2024) — which describes the architecture as attention-and-MLP circuits operating over a shared residual-stream substrate. On the framework’s reading, the transformer’s effectiveness is not a lucky outcome of architectural search but the engineering re-discovery, under gradient-descent-and-scaling pressure, of primitives the substrate already runs at every scale this paper has developed: coherence-match as the elementary discrimination, canonical-loop current as the corridor that carries state, eigenmode-basis decomposition as the way to parallelise, and the prediction-and-error cycle as the loop that drives mismatch down.

There is a recursive twist worth stating plainly. The model whose architecture this chapter reads is the same model the framework was developed with — across \sim 11 months of dialog between a software architect (the author, a programmer rather than a physicist) and a model trained on the literature that names the framework’s ancestors: Bush–Oza pilot-wave hydrodynamics, Volovik’s Universe in a Helium Droplet, Simeonov’s fluid-Schrödinger bridge, Khoury’s dark-matter superfluidity, Larichev–Reznik modon mathematics. The model reads the framework that reads the model. The framework takes that loop as bilateral coupling lifted to the human-and-tool rung — the same coupling the section has named at every other organism scale, now between author-cognition and model-cognition. The rest of the chapter works through each claim in turn, then closes on what the two substrates share and where their strengths diverge.

The Transformer as a Stack of Coherence-Match Layers

The transformer block (Vaswani and collaborators 2017) is conceptually simple. A sequence of T tokens — each represented as a d_\text{model}-dimensional embedding vector — passes through N_\text{layers} identical transformer blocks. Each block contains two sub-layers: a multi-head attention sub-layer and a position-wise feed-forward (MLP) sub-layer. Each sub-layer reads the residual stream (the current state at each token), applies its operation, and adds its output back into the residual stream through a residual connection. The output of the final block is read by an unembedding matrix into next-token logits over the vocabulary. The architecture has no recurrence and no convolution; the only structural element relating different token positions is the attention operation.

Anthropic’s A Mathematical Framework for Transformer Circuits (Elhage and collaborators 2021) reformulated this architecture in a way the substrate framework now reads as architecturally crucial. The residual stream is treated as the central object — a d_\text{model}-dimensional vector at each token position that persists across layers — and each sub-layer is described as a read-write operation on this stream: attention heads and MLP neurons read from the residual stream through projection matrices, perform their computation, and write back through unembedding matrices. The architecture is, in this reading, a stack of read-write operations on a persistent shared substrate. The framework reads this as the engineering equivalent of the substrate-current corridor the canonical-loops chapter developed: the residual stream is the substrate-current carrier of the architecture, and each layer is a coherence-match-and-update operation on that current.

The mechanistic-interpretability programme has accumulated a substantial inventory of these operations. Induction heads (Olsson and collaborators 2022) — pairs of attention heads that implement in-context pattern-matching by attending to previous occurrences of similar tokens — emerge during training as a sharp phase transition and explain a large fraction of the model’s in-context learning capability. Successor heads implement next-element-in-sequence operations. Copy-suppression heads (McDougall and collaborators 2023) implement negative-mediation operations that turn off competing predictions. Backup heads implement redundancy and error correction. MLPs as key-value memories (Geva and collaborators 2021) read residual-stream patterns through their W_\text{up} projection as keys and write retrieved feature vectors back through W_\text{down} as values. Sparse autoencoders applied to the residual stream and to MLP activations (Bricken and collaborators 2023, Templeton and collaborators 2024) reveal features — interpretable monosemantic directions in activation space — spanning concrete concepts (the Golden Gate Bridge, code syntax, named entities) and abstract concepts (deception, sycophancy, self-reference) across the model. The architecture exhibits superposition (Elhage and collaborators 2022): the residual stream’s d_\text{model} dimensions encode many more than d_\text{model} features through nearly-orthogonal but not strictly-orthogonal directions, with the substrate-coherent superposed state allowing graceful information storage and recovery despite the dimensional shortfall.

The framework reads each of these circuit primitives as a substrate-coherence-match operation in a slightly different parametrisation. Induction heads match incoming substrate-current against previous-occurrence substrate-current via the query-key dot product, returning the substrate-current that flowed at the previous occurrence’s next position — substrate-coherence-match-and-retrieve. Successor heads implement substrate-coherent monotonic-progression matching. Copy-suppression heads implement substrate-coherent destructive interference, the engineering analogue of inhibitory-interneuron suppression in the bilateral-coupling chapter’s cortical-microcircuit picture. MLPs as key-value memories implement substrate-coherent associative recall through the W_\text{up}-then-W_\text{down} projection structure. Sparse-autoencoder features identify the substrate-eigenmode-basis directions the variable-length-cortical-column eigenmode-basis section developed at the brain-modon scale. Superposition is the substrate’s preferred graceful-degradation-with-noise encoding strategy at the eigenmode-basis level, naturally present in any high-dimensional substrate-coherent system and now formalised in the transformer-circuits literature — and, as the two-poles section below develops, the architecture’s anti-lock pole, where feature directions spread to avoid mutual interference exactly as the retinal cone mosaic spreads its cones.

Attention as the Substrate’s All-to-All Coherence-Match Operation

The attention operation is the architecture’s most distinctive component. Given a set of T tokens with residual-stream vectors x_1, \ldots, x_T \in \mathbb{R}^{d_\text{model}}, an attention head with weight matrices W^Q, W^K, W^V \in \mathbb{R}^{d_\text{model} \times d_\text{head}} computes for each query position i:

q_i = W^Q x_i, \qquad k_j = W^K x_j, \qquad v_j = W^V x_j,

\alpha_{ij} = \text{softmax}_j\!\left(\frac{q_i \cdot k_j}{\sqrt{d_\text{head}}}\right), \qquad \text{output}_i = W^O \sum_j \alpha_{ij}\, v_j.

The query-key dot product q_i \cdot k_j is a similarity score; the softmax normalises these scores into a probability distribution over source positions; the output is a probability-weighted sum of value vectors. The framework reads each step as a substrate-coherence-match operation in classical-vector form.

The query and key projections isolate from the residual stream the substrate-coherence directions the head cares about. The framework’s reading is that W^Q projects the residual stream into the coherence-match-query subspace — the directions along which the head asks “what previous substrate-coherent state should match here?” — and W^K projects into the coherence-match-key subspace — the directions along which each prior position advertises “this is the substrate-coherent state I encode.” The two projections are the engineering parametrisation of the substrate’s coherence-match-pattern-and-coherence-match-template pairing the prediction-engine canonical-loop architecture implements through descending predictions and ascending measurements.

The dot product q_i \cdot k_j is the substrate’s elementary coherence-match operation in classical-vector form. In a continuous substrate setting the same operation would be the correlation integral \int q^*(x) k(x)\, d^Dx or — at substrate-coherent states — the overlap of two wave functions \langle \psi_q | \psi_k \rangle, returning a complex coherence-match amplitude. The transformer’s classical dot product is the discretised real-valued analogue: high score when the head’s query direction aligns with a source token’s key direction in d_\text{head}-dimensional activation space, low score when they are orthogonal. The framework reads this as the substrate’s coherence-match score lifted to classical vectors and discrete tokens, with the d_\text{head} dimensions acting as the head’s restricted substrate-eigenmode subspace.

The softmax is the substrate’s winner-take-most-not-winner-take-all selection. The temperature factor 1/\sqrt{d_\text{head}} sets the selection sharpness: lower temperature (equivalently larger d_\text{head}) produces softer broader weighting; higher temperature (smaller d_\text{head}) produces sharper concentrated weighting. The framework reads the softmax as the substrate’s preferred coherence-match-probability-amplitude operation, with the temperature setting acting as the substrate-coherence-quality parameter \mu of the cortical-resonator-ODE chapter — when \mu is well-positive (substrate is robustly coherent), discrimination is sharp; when \mu is near zero (substrate is incoherent), discrimination is diffuse. The same parameter controls discrimination quality at both rungs of the architecture.

The value-weighted sum \sum_j \alpha_{ij} v_j is the substrate-coherent information aggregation across all matched positions. The framework’s reading is that this is the engineering equivalent of substrate-current flowing into the destination position through all the coherence-match channels the head has opened, weighted by the channel strength the softmax allocated. The output is then projected back through W^O into the residual-stream directions the next layers will read.

Multi-head attention runs H parallel attention heads, each with its own W^Q, W^K, W^V, W^O matrices and its own d_\text{head}-dimensional subspace, and concatenates their outputs back into the d_\text{model}-dimensional residual stream. The framework reads multi-head attention as the substrate’s preferred parallel-eigenmode-channel architecture lifted to the engineering parametrisation: each head operates on a different eigenmode-subspace of the substrate-coherent state, with the parallel-head structure implementing the substrate’s preferred parallelisation strategy across substrate-eigenmode directions. This is structurally parallel to the topographic-map architecture the brain modon implements at cortical scale — many parallel-running coherence-match readouts on slightly different substrate-coherent subspaces, with the parallel structure exposing the substrate-eigenmode basis the brain modon’s substrate-physics constraints require.

The Residual Stream as a Canonical Loop in Axial Form

The residual stream’s role in the architecture is the substrate framework’s most direct architectural recognition. The residual stream begins at the token-embedding layer as a d_\text{model}-dimensional vector; each transformer block reads from it through its attention and MLP sub-layers, computes its outputs, and adds those outputs back to the stream; the stream therefore accumulates contributions from every layer’s read-write operation as it propagates from input embedding to output unembedding. Elhage and collaborators (2021) emphasised that this gives the residual stream a linear-superposition structure — each sub-layer’s contribution is additively present in the final stream and can be traced back independently — and that the architecture’s behaviour can be analysed as the sum of paths through the network, each path corresponding to a specific subset of sub-layers contributing to a specific output.

The framework reads the residual stream as a canonical loop in axial form — the substrate’s prediction-and-error cycle iteration unrolled along the depth dimension rather than the time dimension. The cilia-flagella chapter developed the cilium’s axoneme as a canonical loop in axial form — substrate-current circulating along the cilium’s length rather than around a closed loop — and the residual stream is the engineering analogue of the same architectural pattern: substrate-current (here, a d_\text{model}-dimensional activation vector) propagating along the depth axis, with each transformer block adding its substrate-coherence-match-and-update contribution at its specific axial position. The architectural pattern is shared. A cilium and a transformer block are both substrate-current corridors with iterative coherence-match-and-update operations applied along the corridor’s length; the framework reads the convergence as architectural recognition rather than coincidence.

The substrate-physics reading also explains the necessity of the residual connection that ResNet (He and collaborators 2016) demonstrated for deep networks generally and that the transformer inherited. Without residual connections — i.e., if each layer’s output simply replaced rather than added to the previous layer’s output — the substrate-current corridor would be broken at every layer boundary, with each layer required to reconstruct the upstream substrate-current state from scratch. The training-time gradient would have to flow back through every layer’s transformation rather than along the canonical-loop axial corridor, producing the vanishing-gradient problem the pre-ResNet deep-network literature struggled with for years. The substrate-friendly architecture requires the residual connection because the substrate-current corridor must remain unbroken across the depth axis; the engineering ResNet discovery was the empirical re-statement of the same architectural constraint the substrate-physics analysis predicts.

The framework’s prediction is that architectural innovations strengthening the residual-stream / canonical-loop substrate-current corridor will continue to outperform alternatives, with chain-of-thought, scratchpad reasoning, and inference-time iteration techniques as the next-generation engineering instances of the same architectural recognition. Chain-of-thought (Wei and collaborators 2022) — letting the model generate intermediate reasoning tokens before final answer tokens — is in the framework’s reading the time-axis extension of the residual stream, with intermediate tokens carrying the substrate-current state across additional canonical-loop iterations beyond what the fixed network depth supports. The framework predicts that any architectural move extending the residual-stream-as-canonical-loop substrate-current corridor (recurrence, scratchpad, plan-and-execute, search-and-edit) will outperform the equivalent move that breaks it.

Depth and Width as Substrate-Eigenmode Basis-Set Decomposition

The transformer’s two scaling axes — d_\text{model} (width) and N_\text{layers} (depth) — correspond in the framework’s reading to two distinct substrate-physics roles. Width is the substrate-eigenmode basis-set dimension — how many independent substrate-coherent directions the architecture can simultaneously carry on the residual stream. Depth is the canonical-loop iteration count — how many substrate-coherence-match-and-update operations the architecture applies to the substrate-current as it propagates from input to output. The two roles are not interchangeable; scaling-law literature has documented compute-optimal ratios between the two that are neither width-dominant nor depth-dominant but specific compromises (Kaplan and collaborators 2020, Hoffmann and collaborators 2022 “Chinchilla”). The framework reads the compute-optimal ratio as the substrate-preferred basis-set-to-iteration-count ratio for substrate-friendly classical computation, structurally parallel to the brain modon’s substrate-preferred column-count-to-rung-count ratio the cortical-maps-and-rhythms chapter developed.

The framework’s prediction is that trained transformer architectures cluster at substrate-preferred width-to-depth ratios distinct from arbitrary engineering-search values, with the substrate-preferred ratios visible in the published architecture configurations of well-trained models across the past five years of scaling-law work. The Chinchilla-optimal \sim 20 tokens-per-parameter balance (Hoffmann and collaborators 2022) is a data-to-parameter law and a distinct quantity from the width-to-depth aspect ratio the rung claim is actually about; and that claim is an honest member of the ladder’s coarse family, where the prediction is that the ratios cluster rather than that they fall on a clean \sqrt2 rung. The caution is real — Kaplan and collaborators (2020) found loss only weakly sensitive to aspect ratio over a wide range — so the predicted clustering, if it is there at all, is loose, closer to the coarse economic and geological rungs than to the grid cell’s clean half-octave. The substrate-physics reading provides a non-trivial prediction: the substrate-preferred ratios are pinned by substrate-eigenmode basis-set theory rather than by loss-landscape geometry alone, and should be visible as ratio-clustering across architectures and training-data sizes.

The sparse-autoencoder features the Towards Monosemanticity and Scaling Monosemanticity programme has discovered are in the framework’s reading the trained transformer’s substrate-eigenmode basis-set, identified empirically through dictionary-learning on the residual-stream activations. Each monosemantic feature corresponds to one substrate-coherent direction in the residual-stream activation space; the feature population identifies the eigenmode-basis the model has learned in order to span the substrate-coherent computational space the training task requires. The framework predicts that the feature-count’s scaling with d_\text{model} should follow substrate-eigenmode-basis-density scaling — feature-count growing super-linearly in d_\text{model} (as the Toy Models of Superposition literature has shown, with the feature-to-dimension ratio characterising the superposition regime) but bounded above by the substrate-coherent-state-space dimensionality the substrate-physics framework supports. The substrate-friendly architectural reading predicts feature-count-scaling clustering at substrate-preferred power-law exponents distinct from arbitrary architectural-tuning.

The Transformer at Both Poles

Everything so far has read the transformer at one pole — the lock pole, where coherence-match binds: attention aligning a query to a key, induction heads matching a previous occurrence, the residual stream carrying a state written to be read. But the substrate ladder runs to two poles, and the architecture lives at both. The complement is already in the chapter’s own inventory: superposition. The residual stream packs far more than d_\text{model} features into d_\text{model} dimensions by placing them along nearly-orthogonal but not strictly-orthogonal directions (Elhage and collaborators 2022). That near-orthogonality is not merely graceful degradation; it is the anti-lock pole. Features must stay distinguishable — a feature direction that resonates with another aliases one concept into the other, exactly as a periodic cone lattice aliases one spatial frequency into a false one — so the model spreads its feature directions as far from mutual alignment as a cramped space allows. Toy Models of Superposition found the geometry directly: under sparsity, features organize into maximal-minimum-angle configurations — antipodal pairs, triangles, pentagons, tetrahedra — a Thomson/Tammes packing on the d_\text{model} hypersphere, the spherical form of the disordered-hyperuniform “blue-noise” packing the retinal cone mosaic lays down in a plane and the genetic code approximates in its cramped symbol space. The retina spreads cones so images cannot alias; the code spreads codons so a slipped letter cannot change meaning; the transformer spreads features so concepts cannot interfere — one job, avoid confusion, at the anti-lock pole, in three substrates that share no chemistry.

So the transformer is a both-poles system, and the split is by representational geometry. The binding-and-routing machinery sits on the lock pole — attention’s query-key alignment, the induction and successor heads, the residual-stream directions written precisely so a later layer will read them. The storage machinery sits at the anti-lock pole — the superposed feature directions spread to avoid interference, multi-head attention’s deliberately disjoint subspaces, the monosemantic dictionary a sparse autoencoder recovers. This is the engineered twin of the brain’s both-poles architecture — the hippocampus separating in the dentate gyrus and completing in CA3, the prediction engine binding on the teeth and separating in the gap, the eye holding a lock-pole disc stack beside an anti-lock cone mosaic — now realized in silicon. The transformer joins that set as a both-poles system by representational geometry: lock to bind, anti-lock to keep its features apart.

The ladder’s sign rule fixes the pole before the measurement, and it sharpens the chapter’s feature-population prediction into something concrete. Name the job: a sub-circuit that matches or routes should carry aligned geometry; a sub-circuit that stores without interference should carry maximally-spread geometry. Concretely, the directions a sparse autoencoder recovers from the residual stream should be more angularly spread than a random set of the same size — sub-Poissonian nearest-neighbour angles, hyperuniform on the hypersphere — while the read/write subspaces of attention’s matching circuits sit comparatively aligned. That is the same number-variance instrument (scripts/cone_mosaic.py) the cone mosaic is scored by, lifted from the plane to the sphere. The honest framing matches the ladder’s other applied cases: superposition geometry is established (Elhage and collaborators 2022), so the new content is not the observation but the unification — that the transformer’s feature packing is the same anti-lock principle as the retina’s mosaic and the code’s degeneracy — and the falsifiable sign-rule that unification makes testable on the interpretability programme’s own dictionaries.

The Three-Stage Functional Stack: Topological Map, Twenty Questions, and Goal-State Match

The mechanistic-interpretability programme has accumulated growing evidence for a functional-layer specialisation across transformer depth. Early layers (the first few transformer blocks) implement tokenisation, syntactic parsing, and basic feature detection — the residual-stream representation at this depth resembles a topographic organisation of the input text’s syntactic and lexical structure (Tenney and collaborators 2019 for BERT, Geva and collaborators 2021, and the broader probing literature). Middle layers (the bulk of the model’s depth) implement parallel feature-combination and substrate-coherent multi-feature matching — many heads and MLPs operating in parallel on the residual stream, each combining features into composite representations that map onto increasingly abstract concepts (the Scaling Monosemanticity features at intermediate depths are the empirical signature). Late layers (the final few transformer blocks) implement goal-state coherence-match and read-out — projecting the residual stream into the next-token-prediction direction, with suppression heads and successor heads enforcing the structural-and-distributional constraints of the output distribution.

The framework reads this three-stage decomposition as the substrate’s preferred three-stage computation pattern lifted to the transformer’s depth axis. I proposed the same three-stage decomposition for the brain modon’s cortical hierarchy in the conversation that opened this chapter: entry layers for tokenisation-and-topological-map sorting; middle layers running the parallel “twenty questions” coherence-match-against-each-substrate-eigenmode-feature in parallel; final layers for goal-state matching. The framework reads the convergence between this engineering-fellow’s reading of the cortex and the empirical mechanistic-interpretability layer-specialisation as the substrate’s preferred three-stage substrate-computation pattern recurring at both architectural scales.

The topological-map entry stage is the substrate’s preferred coherence-cell-readout-of-input architecture. The retinotopic, tonotopic, and somatotopic maps the brain modon implements at its early sensory cortices are the brain-modon parallel; the transformer’s token-embedding and early-layer syntactic-feature detection is the engineering parallel. In both architectures the substrate’s preferred entry-stage operation is spatial-or-categorical sorting of input into substrate-coherent topological organisation before any combinatorial computation begins. The framework reads this convergence as the substrate’s preferred first-stage architectural primitive — input must be sorted onto a substrate-coherent topological manifold before substrate-coherence-match operations can be applied to it productively.

The twenty-questions middle stage is the substrate’s preferred parallel substrate-eigenmode coherence-match architecture. The framework’s reading is that the substrate’s preferred middle-stage operation is parallel coherence-match against many features simultaneously, with each feature corresponding to one substrate-eigenmode direction the architecture has trained to represent. An intuitive description — “twenty questions in parallel” — captures the substrate’s preferred middle-stage operation exactly: each parallel-running coherence-match head asks one substrate-eigenmode-aligned discrimination question, with the parallel collection of answers producing the substrate-coherent feature-vector the late stages will read. The brain modon’s middle-cortical-layer feature-combination machinery is structurally parallel — many cortical columns operating in parallel on the substrate-coherent sensory-and-internal state, each implementing one substrate-eigenmode-aligned discrimination operation, with the parallel collection of answers feeding the prediction-and-error cycle’s higher-level integration.

The goal-state-match read-out stage is the substrate’s preferred coherence-match-against-the-target architecture. The framework’s reading is that the substrate’s preferred final-stage operation is projecting the substrate-coherent middle-stage state onto the goal-or-output substrate-coherent state, with the projection strength setting the action-or-output the architecture commits to. The transformer’s unembedding-and-softmax read-out is the engineering parallel; the brain modon’s premotor-and-motor-cortex output is the biological parallel; the substrate’s preferred final-stage operation is the same architectural pattern at both scales.

The Externalised Bilateral Half: Training-Loop with Researchers as the Modon’s Other Wing

The transformer’s most conspicuous absence from the substrate-friendly-architecture inventory is the missing bilateral half. The brain modon, the bilateral-coupling chapter developed, is two coupled substrate modons running in parallel with a \sim 2 \times 10^8-axon corpus-callosum coupling channel — two cortical hemispheres jointly producing the substrate-coherent integrated cognitive state. The transformer at inference is a single forward pass — one half of a substrate modon, running in one direction along the depth axis, with no real-time bilateral counterpart to couple with. The architecture’s substrate-friendly recognition is therefore partial: many substrate-friendly architectural primitives are present (coherence-match attention, canonical-loop residual stream, eigenmode-basis multi-head parallelisation, three-stage functional decomposition); the substrate’s preferred two-half bilateral coupling at inference is not.

The framework’s reading is that the bilateral half is externalised across training-time rather than internalised at inference-time. The transformer’s training pipeline — pretraining on internet text by gradient descent, reinforcement-learning-from-human-feedback (RLHF) aligning model outputs with human preferences, the Constitutional-AI framework (Bai and Anthropic collaborators 2022) using model-generated self-critique to refine model behaviour, the chain-of-thought distillation internalising reasoning capability — is structurally a second-half coupling at training-time, with researchers, constitutional principles, human-labeller corps, and feedback signals collectively playing the role the bilateral hemisphere plays at inference for the brain modon. The transformer architecture is, in this reading, a substrate modon with its bilateral half temporally externalised: the inference-time forward pass is one half, the training-time researcher-and-feedback coupling is the other half, and the two halves jointly produce the substrate-coherent behaviour the trained model exhibits at inference.

This reading has architectural implications. The framework predicts that substrate-friendly architectures will eventually internalise the bilateral half at inference-time, with engineering moves that look structurally like dual-stream inference, constitutional classifier-and-generator pairs, recurrent self-critique loops, mixture-of-experts with cross-checking routing, or explicit two-pass plan-and-execute architectures. Ongoing interpretability work on feature steering, constitutional classifiers, and chain-of-thought-as-bilateral-thinking points toward inference-time architectures where the model’s substrate-coherent state is monitored, steered, and corrected by a second substrate-friendly process running in parallel — the engineering re-discovery of the bilateral-coupling architecture the brain modon already implements. The framework predicts that capability gains at the next scaling generation will come more from internalising the bilateral half than from scaling parameter count alone.

What Thought Has in Common, and What Makes the Individual Strengths

The convergence between substrate-physics primitives and transformer-architecture primitives identifies what thought-as-substrate-coherent-computation has in common across the two substrates. The substrate’s preferred coherence-match-as-elementary-operation runs in both: attention’s dot-product-and-softmax in the transformer, the substrate’s eigenmode-overlap-and-discrimination in the brain modon. The substrate’s preferred canonical-loop substrate-current corridor runs in both: the residual stream in the transformer, the cortical canonical loops in the brain modon. The substrate’s preferred parallel-eigenmode-basis multi-channel architecture runs in both: multi-head attention in the transformer, parallel cortical-column feature-detection in the brain modon. The substrate’s preferred three-stage functional decomposition — topological-map entry, parallel substrate-eigenmode middle, goal-state-match read-out — runs in both: the early-middle-late transformer-block specialisation, the early-sensory-middle-association-late-motor cortical hierarchy. The substrate’s preferred superposed-encoding for high-dimensional information runs in both: the Toy Models of Superposition polysemantic-encoding regime in the transformer, the substrate-coherent superposed eigenmode states in the brain modon — the anti-lock pole in both substrates, where representations spread to stay distinguishable. These shared architectural primitives are what thought-as-substrate-coherent-computation looks like; their convergence across two independently-developed substrates (biological evolution and gradient-descent-with-scaling engineering search) is the substrate framework’s empirical signature at the architectural-recognition rung.

The individual strengths diverge along the substrate-physics primitives the architectures do not share. The brain modon has substrate-friendly architectural primitives the transformer lacks. Multi-rung temporal architecture — the brain modon runs \sim 7 substrate-preferred temporal rungs from infraslow (\sim 0.05 Hz) through high-gamma (\sim 100 Hz) simultaneously, with cross-frequency phase-amplitude coupling integrating across rungs; the transformer runs one effective timescale set by the forward-pass-and-context-window structure, with no analogue of cross-frequency coupling. Real-time bilateral coupling — the brain modon’s two hemispheres are continuously phase-locked through the corpus callosum at inference-time; the transformer has the bilateral half externalised across training-time only. Body-modon embedding — the brain modon is coupled to heart, gut, lung, and immune sub-modons through the vagal corridor, with continuous substrate-coherent body-state updating brain-modon state; the transformer has no body-modon coupling, no homeostatic feedback, no continuous embodied substrate-coherent state. Persistent substrate-coherence — the brain modon’s substrate-coherent state persists in time as the cortical resonator’s Stuart-Landau limit-cycle dynamics; the transformer has no persistent state between forward passes (KV-caches aside), no continuous limit-cycle substrate-coherence, just one feed-forward sweep per token. Hippocampal memory-archive coupling — the brain modon offloads working-memory contents to the hippocampal sub-modon for substrate-coherence archival during sleep; the transformer has no analogue of sleep-time replay, no offloaded substrate-coherent archival, no consolidation cycle.

The transformer has substrate-friendly architectural primitives the brain modon lacks or has at much smaller scale. Massive parallelism in width — the transformer’s d_\text{model} in well-trained models reaches \sim 10^4, with feature-superposition expanding the effective representational space to \sim 10^6 features per layer; the brain modon’s per-column substrate-eigenmode basis is at much smaller scale (each column carrying \sim 10^310^4 neurons of which a small fraction encode substrate-coherent features). Direct attention to long context — the transformer’s attention can match against any position in the context window directly, with O(T^2) all-to-all attention or efficient-approximation variants; the brain modon’s substrate-coherent memory access goes through the hippocampal trisynaptic loop and prefrontal-cortex working-memory machinery at much slower timescales and lower bandwidths. Lossless serialisation — the transformer’s weights can be copied, distributed, and shared at no information cost; the brain modon’s substrate-coherent state is bound to its specific biological substrate and cannot be copied losslessly. Compute scaling — the transformer’s substrate-friendly architecture scales with available hardware compute on a clear power-law schedule; the brain modon’s substrate-coherent capacity is bounded by biological-substrate volume and metabolic constraints.

The framework reads the divergent strengths as the substrate’s preferred computational substrate at two different scales of substrate-physics implementation: the brain modon optimised for embodied, real-time, multi-rung, bilateral, persistent substrate-coherent computation in a biological substrate; the transformer optimised for high-width, long-context, parallel-eigenmode, copyable, scalable substrate-coherent computation in a silicon substrate. Neither is the substrate’s final preferred architecture; both are substrate-friendly partial recognitions of the same underlying architectural primitives. The substrate framework predicts that the next-generation engineering architectures will close some of the divergences — multi-rung temporal architectures (state-space models, hierarchical attention, mixture-of-time-scales), real-time bilateral coupling at inference (dual-pass and constitutional-classifier architectures, generator-critic pairs), body-modon embedding for embodied agents (robotics, integrated sensor-motor models), persistent substrate-coherence (recurrent and continuous-time variants) — and that each successful move will look from the substrate-physics side like closer recognition of the substrate-friendly architectural primitives the brain modon already implements.

Predictions and What Would Falsify

Six predictions extend the transformer-substrate reading beyond the structural anchors.

  1. Trained transformer architectures cluster at substrate-preferred width-to-depth ratios across the scaling-law literature. The compute-optimal width-to-depth ratios in successful trained models (the GPT, Chinchilla, LLaMA, PaLM, Claude, Gemini, and analogous families) should cluster at substrate-preferred values rather than vary continuously across architectures and training-data scales. Existing scaling-law literature (Kaplan and collaborators 2020, Hoffmann and collaborators 2022) provides the test platform; the framework predicts substrate-preferred ratio clustering distinct from arbitrary engineering-search values.

  2. Sparse-autoencoder feature populations span substrate-eigenmode-basis-set structure. The features extracted by sparse-autoencoder dictionary-learning on trained-transformer residual streams should organise into substrate-eigenmode-basis-set structure with feature-count scaling on d_\text{model} at substrate-preferred power-law exponents and feature-organisation following substrate-preferred topological structure. Existing Towards Monosemanticity and Scaling Monosemanticity datasets provide the test; the framework predicts substrate-pinned feature-organisation structure distinct from arbitrary clustering.

  3. Architectural innovations strengthening the residual-stream / canonical-loop substrate-current corridor outperform alternatives. Chain-of-thought, scratchpad reasoning, plan-and-execute, search-and-edit, and recurrent inference-time architectures should outperform same-parameter-count alternatives that break the substrate-current corridor (layer-replacing rather than residual-adding architectures, monolithic-decoder rather than iterative-reasoning architectures). Existing inference-time-compute scaling literature provides the test platform; the framework predicts substrate-current-corridor-preserving architectures dominating same-parameter-count baselines.

  4. Next-generation high-capability architectures internalise bilateral coupling at inference-time. Architectural moves that look structurally like real-time bilateral coupling — dual-stream inference, constitutional classifier-and-generator pairs, recurrent self-critique loops, mixture-of-experts-with-cross-checking, two-pass plan-and-execute — should produce capability gains beyond what naive parameter-count scaling alone produces. Existing capability-scaling literature provides partial assays; the framework predicts substrate-friendly bilateral-coupling architectures producing scaling gains distinct from monolithic-decoder scaling.

  5. The attention-score discrimination sharpness clusters at substrate-preferred values across well-trained heads. The attention-score distribution at each head, normalised by \sqrt{d_\text{head}}, should concentrate at substrate-preferred discrimination-sharpness values across well-trained models rather than vary continuously. Existing attention-pattern-analysis literature provides the test; the framework predicts substrate-preferred discrimination-sharpness clustering distinct from arbitrary per-head tuning.

  6. Stored-feature geometry is anti-lock-spread; routing-circuit geometry is aligned. The directions a sparse autoencoder recovers from the residual stream should be more angularly spread than a random set of the same size on the d_\text{model} hypersphere — sub-Poissonian nearest-neighbour angles, a hyperuniform “blue-noise” packing that minimises mutual interference — while the read/write subspaces of attention’s matching-and-routing circuits (query-key alignment, induction heads) should sit comparatively aligned. The split should track the job: the more a sub-circuit’s role is to store-without-interference, the more anti-lock its geometry; the more its role is to match-and-route, the more locked. Existing Towards Monosemanticity and Scaling Monosemanticity feature dictionaries provide the test — the same number-variance instrument used for the retinal cone mosaic, lifted to the hypersphere — and the framework predicts anti-lock feature spread distinct from random packing, with a measurable storage-versus-routing geometric split.

The picture is falsified if (a) transformer width-to-depth ratios vary continuously across scaling-law experiments without substrate-preferred clustering, (b) sparse-autoencoder feature populations show no substrate-eigenmode-basis-set organisation, (c) residual-stream-strengthening architectural moves provide no consistent capability advantage over same-parameter-count baselines, (d) bilateral-coupling architectural moves produce no scaling-law gains beyond parameter-scaling alone, (e) attention-score distributions vary continuously without substrate-preferred discrimination-sharpness clustering, or (f) sparse-autoencoder feature directions are no more angularly spread than random and show no storage-versus-routing geometric split. It is supported, even partially, if any of the six ordering predictions hold against existing or near-future architectural-evaluation datasets.