Can human brain connectivity explain verbal working memory?

ABSTRACT

The ability of humans to store spoken words in verbal working memory and build extensive vocabularies is believed to stem from evolutionary changes in cortical connectivity across primate species. However, the underlying neurobiological mechanisms remain unclear. Why can humans acquire vast vocabularies, while non-human primates cannot? This study addresses this question using brain-constrained neural networks that realize between-species differences in cortical connectivity. It investigates how these structural differences support the formation of neural representations for spoken words and the emergence of verbal working memory, crucial for human vocabulary building. We develop comparative models of frontotemporal and occipital cortices, reflecting human and non-human primate neuroanatomy. Using meanfield and spiking neural networks, we simulate auditory word recognition and examine verbal working memory function. The “human models”, characterized by denser inter-area connectivity in core language areas, produced larger cell assemblies than the “monkey models”, with specific topographies reflecting semantic properties of the represented words. Crucially, longer-lasting reverberant neural activity was observed in human versus monkey architectures, compatible with robust verbal working memory, a necessary condition for vocabulary building. Our findings offer insights into the structural basis of human-specific symbol learning and verbal working memory, shedding light on humans’ unique capacity for large vocabulary acquisition.

Introduction

Human cognition is special. Animals may by far outperform humans in speed, power of endurance or strength, yet when it comes to speaking and to what is commonly described as conceptual thinking, homo sapiens seems to be privileged. Humans divide up the world along multiple categorial distinctions, which have their correlate in symbolic distinctions. Unlike any other species, humans build huge vocabularies of tens or even hundreds of thousands of symbols and words. These symbols can be combined in diverse complex ways, yielding constructions whose structure even the most complex algorithms struggle to capture (Pinker Citation1994; Brysbaert et al. Citation2016). In contrast, our closest relatives are limited to small vocabularies of up to ca 100 signs and, even under extreme life-long training, cannot learn more than an additionnal one or two hundred (Gardner and Gardner Citation1969; Call and Tomasello Citation2020).

What is the reason for this exceptional difference in cognitive ability? If it comes to speed, endurance or power, there are clear physiological material correlates of ability. Strength is manifest in the structure and diameter of muscles, speed in specific mechanistic features along with favourable weight. But what may be the mechanisms behind conceptual thought, combinatorial ability, or huge vocabularies? Among these, the latter feature is probably most fundamental and serves as an initial step in identifying the contributing biological factors. Huge vocabularies include many thousands of symbols each of which interlinks a symbol form with meaning, sometimes in an arbitrary manner. Building such a vocabulary requires the ability that symbols – most typically spoken word forms – can be kept in memory for some time, a capacity which, like vast vocabularies themselves, is specific to humans (Gathercole and Baddeley Citation2014; Baddeley et al. Citation2021). The specific human mechanisms for verbal working memory are well understood from a neuroscience perspective, where activation is maintained due to reverberation of neuronal populations (Fuster and Bressler Citation2012, D’Esposito Citation2007; Goldmann-Rakic Citation1995; Pavlov and Kotchoubey Citation2022). A range of neuroimaging and neuropsychological results show that the left perisylvian cortex, including its frontal, parietal, and temporal parts, is essential for storing word forms (Buchsbaum and D’Esposito Citation2019; Emch et al. Citation2019) and a fibre bundle called the arcuate fascicle (AF), which interconnects frontal and temporal areas of the left perisylvian cortex (Catani et al. Citation2005; Barbeau et al. Citation2020), directly relates to verbal working memory capacity (Yeatman et al. Citation2011; López-Barroso et al. Citation2013; Saygin et al. Citation2013; Teubner-Rhodes et al. Citation2016; Ivanova et al. Citation2021). A pivotal question is how these facts can help to explain why humans, but not non-human primates, have the capacity for verbal working memory, hence enabling them to build extensive vocabularies and use these for various cognitive achievements?

One explanatory pathway builds upon evolutionary differences in neuroanatomical structures. In this perspective, a range of different features could be relevant, including sheer brain size or the over-proportional enlargement of perisylvian cortical areas (Rilling Citation2014). Although these quantitative measures may suggest a gradual increase in cognitive capacities, it is less clear how they may explain the emergence of a new capacity, in particular that of verbal working memory. A qualitative difference across primate brains has been suggested by non-invasive tractography research and addresses both strength and properties of the arcuate fascicle or AF (Rilling et al. Citation2008, Citation2012; Rilling and van den Heuvel Citation2018).

First, this frontotemporal fibre bundle is weakly developed in monkeys, much stronger in apes but quite prominent in humans, where it is, furthermore, greater developed in the left language-dominant hemisphere than in the right (Catani et al. Citation2005; Bain et al. Citation2019; Balezeau et al. Citation2020). Second, between-area links provided by the AF also vary across species. In monkeys, it primarily connects relatively anterior inferior-prefrontal areas with lateral superior-temporal auditory parabelt. In contrast, the human connectome includes further between-area links, in particular those linking the premotor cortex with parabelt and auditory belt with the prefrontal cortex. These results indicate that humans have not only stronger but also new and qualitatively different between-area connections (see , Rilling et al. Citation2012; Thiebaut de Schotten et al. Citation2012).

Schomers and colleagues have closely examined the functional role of the additional connectivity provided by the AF, which is unique to the human brain, by means of brain-constrained neural networks. schematizes the connectivity structure of fronto-temporal connectivity in humans and monkeys, where the “monkey architecture” has next neighbour links along with one long-distance link, between inferior prefrontal and superior-temporal auditory parabelt. In contrast, their “human architecture” includes multiple fronto-temporal perisylvian links – as indicated by the above-mentioned findings – along with richer connections in both frontal and temporal model cortices (Schomers et al. Citation2017). Both monkey and human models were trained to simulate early stages of articulatory-auditory learning, where infant and young monkeys associate their own articulatory movements with the sounds they produce. Note also that this process is analogous to early stage of language development in human infants when they babble syllables or repeat words spoken by others (see also Pulvermüller et al. Citation1999; Westermann and Reck Miranda Citation2004). When simulating the association between articulatory-motor and auditory patterns representing spoken and heard word forms or monkey calls, biologically motivated Hebbian algorithms were applied to determine synaptic modification and learning. This resulted in the formation of neural network correlates of oral signs, including spoken word forms, which were distributed across frontotemporal areas.

Figure 1. (A) Schematic depiction of the neural network architectures and the different connectivity structures (human and monkey) used in by Schomers et al. (Citation2017). In the monkey model (left), only adjacent-areas are connected through reciprocal next-neighbour connections (black), excepted the auditory and articulatory systems, which are linked by a long distance connection (purple). In the human connectivity structure (right), next-neighbour connections are present together with “jumping links” link second-next neighbour areas (blue). (B) Connectivity matrices of the monkey and human models. The colors indicate the different kinds of links (light grey: within-area, dark grey: next-neighbor, blue: “jumping”, purple: long-distance).

Although in the monkey model, neural assemblies emerged that activated together when stimulated, these neuron sets lost their activity rapidly after activation. In contrast, the neural assemblies developing for each sensorimotor pattern in the human model not only activated as a functional unit after being stimulated, but these circuits maintained their activity for some time (Schomers et al. Citation2017). Maintenance of activity within a set of neurons, also referred as “reverberant” activity, is an established mechanism underlying working memory (Deco and Rolls Citation2003; Fuster and Bressler Citation2012; Schomers et al. Citation2017; Heeger and Mackey Citation2019). As the human model developed working memory for articulatory-auditory patterns, whereas the monkey models failed to do so, these results provide a candidate explanation for the human ability of memorizing sounds individuals can produce her/himself, including spoken word forms. This capacity of verbal working memory in humans is explained solely by the distinguishing feature between monkey and human models: the richer vs poorer connectivity structure in perisylvian cortex provided by the AF. This leads to the more general hypothesis that rich perisylvian connectivity is fundamental to the development of verbal working memory (see Schomers et al. Citation2017).

However, this model-based explanation suffers from some shortcomings. Even though the deep neural networks employed by Schomers and colleagues were constrained by multiple features known to characterize biological networks in primates, including aspects of their between-area connectivity and further features that apply at the micro- and macrostructural levels, they used a relatively basic neuron model realizing the output of one neural unit as a gradual and continuous variable. This raises the questions whether their study results depend on the use of such “meanfield” artificial units or can be generalized to more biological realistic network models including spiking integrate-and-fire neurons. Second, Schomers just simulated six areas of the perisylvian core language cortex, which is typically defined as the first cortical convolution surrounding the sylvian fissure (Bogen Citation1976). However, it is well known that, apart from the left-perisylvian “language core”, other cortical regions make important contributions to language processing too; these include “extrasylvian” areas in middle frontal gyrus and middle and inferior temporal cortex (Pulvermüller et al. Citation1999; Bates et al. Citation2003; Binder et al. Citation2011; Fedorenko and Thompson-Schill Citation2014; Ralph et al. Citation2017; Pulvermüller Citation2018; Riccardi et al. Citation2019; Dreyer et al. Citation2020; Kuhnke et al. Citation2023).

In summary, there are well-documented structural differences in the arcuate fascicle (AF) between humans and non-human primates, with humans exhibiting richer and more complex connectivity (Rilling et al. Citation2008, Citation2012; Rilling and van den Heuvel Citation2018). The AF is well known to play a crucial role in language processing, as shown by a range of previous studies (Yeatman et al. Citation2011; López-Barroso et al. Citation2013; Saygin et al. Citation2013; Teubner-Rhodes et al. Citation2016; Ivanova et al. Citation2021). While a neurocomputational study has begun to address its role in these in language processing (Schomers et al. Citation2017), they have not yet tackled how this richer connectivity supports the processing of meaningful symbol processing in spiking neural networks. Hence, a critical gap that remains unexplored is how enhanced connectivity in humans facilitates the formation of cell assemblies that retain their activity for some time, particularly as they expand into extrasylvian regions where semantic information is processed and integrated. Understanding how differences in connectivity impact the development of these cell assemblies could explain the fundamental difference in verbal working memory capacity between humans and non-human primates. Specifically, the richer network of connections in humans may support more robust and longer-lasting reverberatory neural activity, ultimately contributing to greater verbal working memory – an area where non-human primates exhibit substantial limitations.

To address these issues, we employed brain-constrained neural networks that simulates 12 key cortical areas of the frontotemporal and occipital regions relevant for language and symbol processing, incorporating several biological features (Garagnani and Pulvermüller Citation2016, Tomasello et al. Citation2017; Tomasello et al. Citation2019; Pulvermüller et al. Citation2021; Constant et al. Citation2023). Brain-constrained models are grounded in neuroimaging and anatomical connectivity data, offering greater biological plausibility compared to earlier connectionist models or deep neural networks (Christiansen and Chater Citation2001; Dell et al., Citation1999; LeCun et al. Citation2015; Garagnani and Pulvermüller Citation2016; Plaut and Gonnerman, Citation2000). Unlike artificial neural networks, such as large language models (LLMs) (Vaswani et al. Citation2017; Brown et al. Citation2020), which are highly effective but lack direct correspondence to brain mechanisms, our approach allows for deeper insights into how language is processed in the human brain. Our biologically grounded network was composed of 12 components thought to simulate frontotemporal and occipital cortical areas, 6 perisylvian and 6 extrasylvian, which play a role in linguistic and semantic processing (Deiber et al. Citation1991; Ungerleider and Haxby Citation1994; Zatorre et al. Citation1996; Pulvermüller et al. Citation1999; Fadiga et al. Citation2002; Dum and Strick Citation2002, Citation2005). The perisylvian network part included articulatory motor and auditory systems critical for spoken form processing, whereas the extrasylvian part comprised hand-motor and ventral-visual systems relevant for action execution and object perception, respectively. Each system included one primary (auditory, visual, or motor), one secondary and one “higher” association area (superior-temporal auditory parabelt, anterior-inferior temporal, inferior and lateral prefrontal). The simulations used two sets of these 12-area networks, one including meanfield, the other spiking neurons, as to explore the generalizability of the findings across neural models. Each set of networks consisted of three anatomical variants or “architectures”. The first, called the “human model” (HM), implemented rich connectivity within both perisylvian and extrasylvian model cortex. This means that, within each system, connections interlinked next neighbour areas and connections between all “association” cortices were realized; in addition, there were connections between second-next neighbours, including those long-range links suggested by tractography results summarized below (see ). The second architecture, called the “monkey model” (MM) had a simpler connectivity structure in the perisylvian cortex only, otherwise replicating the connections of the human model. This mirrors the somewhat poorer connectivity structure of ape and monkey brains reported for their perisylvian cortex, described above. The third architecture, the “simple model” (SM), implemented relatively impoverished connectivity in both peri- and extrasylvian regions. The motivation for this third implementation comes from the lack of clarity about the precise size of the projection areas of the arcuate fascicle in frontal and temporal areas of monkey and human cortex, which may in part affect not only inferior frontal and superior temporal areas but also the connectivity of slightly more dorsally (or laterally) located (middle) frontal and anterior and posterior areas in the middle and inferior temporal gyrus (Glasser and Rilling Citation2008; Binder et al. Citation2011; Rilling et al. Citation2012). In sum, this led to the implementation of six models, two model variants (made up of meanfield vs spiking neurons), and three architectures (simple, monkey, and human). Within each model, two cortical regions (peri- vs extrasylvian), two macroscopic brain parts (frontal vs temporal), and three hierarchical area levels (primary, secondary, association/hub) were realized.

Figure 2. (A) Schematic connectivity structure of the left peri-sylvian cortex in macaques, chimpanzees and humans revealed by tractography studies (modified from Rilling et al. Citation2008). (B) Structure of the neural network model with human connectivity. Superior temporal auditory (blue) and inferior-frontal articulatory (red) systems represent the peri-sylvian region. Whereas the extra-sylvian region is constituted of the lateral dorsal motor (yellow/brown) and the visual “what” (green) system. The connectivity is constituted of next-neighbour connections between adjacent areas within each system (black arrows), “jumping links” between non-adjacent areas (blue arrows) and “long distance links” between pairs of multimodal areas PB, PF_i, AT and PF_L (purple arrows, figure taken from Tomasello et al. Citation2018). (C) Schematic depiction of the neural network architecture and the different connectivity structures analysed. In simple Model (SM, left), only adjacent-areas are connected with next-neighbour connections (black) as well as long distance links between multimodal areas (purple). In the monkey Model (MM, middle) architecture, the connectivity is similar as SM and additional “jumping links” (blue) are present in the extra-sylvian region. Finally in the human (H, right) connectivity structure, “jumping links” are present in the peri-sylvian and extra-sylvian regions. (D) Connectivity matrices of the SM, MM and HM. The colours represent the different links depicted in (C).

Each network implementation was submitted to a learning phase, during which articulatory-auditory activity patterns were presented together with either activation patterns in the primary visual or in the dorsal-motor cortex. This was simulating spoken word form learning in the context of visual perception of objects or execution of motor actions to which the word forms relate semantically (Tomasello et al. Citation2017; Tomasello et al. Citation2018, Citation2019, Citation2024; Constant et al. Citation2023). After learning driven by Hebbian plasticity, including both long-term potentiation and depression, network structure and function were mapped to evaluate the neural representations and circuits that had formed along with their dynamics. In particular, we asked whether i). a definable neuronal circuit had developed for each learned “meaningful word”, ii). this circuit fully “ignited” after stimulation of the network with a previously learned pattern, and iii). if so, whether circuit activity was maintained for some time, thus showing the emergence of verbal working memory by neural reverberation.

The main contributions to the article are presented in .

Table 1. Main contributions of the study.

Material and methods

In the present study, we implemented a neuroanatomically and neurophysiologically constrained neural network model of 12 frontotemporal and occipital areas based on previous modelling work (Tomasello et al. Citation2017; Tomasello et al. Citation2018, Citation2019; Henningsen-Schomers and Pulvermüller Citation2022; Constant et al. Citation2023). The following brain constraints and neuroanatomical principles that were discussed as relevant for simulating cognitive functions (Pulvermüller et al. Citation2021) were applied to the model:

• Model neurons were either spiking or meanfield
- • Spiking model: Neurophysiological dynamics of spiking pyramidal neurons, including spatial and temporal summation of inputs and adaptation (Connors et al. Citation1982; Matthews Citation1998); threshold-based excitatory cells provided a binary output.
- • Meanfield model: The meanfield cells summed up inputs in the same way as spiking model neurons, but produced graded responses, representing the average firing rate of a neuronal element or local neuron cluster. Thus, a cell’s output can be interpreted as the average firing rate of a population of neurons (Wilson and Cowan Citation1973; Brunel and Hakim Citation1999).
• Learning mechanisms
• Synaptic plasticity followed a Hebbian learning rule that included long-term potentiation (LTP) and long-term depression (LTD) (Artola and Singer Citation1993).

• Regulation and control
• Excitatory activity was regulated using two types of inhibitory feedback, lateral local and area-specific global inhibition, to prevent non-physiologically high activity states (Braitenberg Citation1978; Knoblauch and Palm Citation2002). To prevent under-activity, each neuron received uniform, uncorrelated white noise during the learning phase. To mimic realistic variability of inputs, additional static noise was applied in the stimulus patterns supplied to primary areas (Garagnani and Pulvermüller Citation2016; Tomasello et al. Citation2018; Citation2024).

• Cortical areas
• 12 areas were modelled included motor (articulatory and hand-motor) and sensory (auditory and visual) systems in the inferior and dorsolateral frontal, superior temporal and ventral temporal and occipital cortex along with multimodal areas interlinking these systems (Garagnani and Pulvermüller Citation2016; Tomasello et al. Citation2017).

• Local connectivity
• Within-area connectivity was sparse, random, and initially weak along with a neighbourhood bias towards close-by links (Kaas Citation1997; Braitenberg and Schüz Citation2013).

• Between-area global connectivity
• Connections between different cortical areas were bidirectional and guided by neuroanatomical principles and experimental studies, as explained in detail below.

Neuron types

Both neuron models (meanfield and spiking) follow similar membrane potential dynamics but differ in how they compute outputs. The spiking neurons capture the discrete, all-or-none firing behaviour of individual neurons, providing binary outputs based on specific activation thresholds (Connors et al. Citation1982; Matthews Citation1998). In contrast, meanfield neurons simulate the averaged behaviour of a population, producing graded responses that reflect the overall firing rate of a neural group (Wilson and Cowan Citation1973; Brunel and Hakim Citation1999). This approach allows us to explore both fine-grained and population-level neural dynamics in our simulations. The detailed mathematical equations governing each neuron type are provided in the Appendix; the model parameters were adopted from previous successful simulations (Garagnani and Pulvermüller Citation2016, Tomasello et al. Citation2017; Tomasello et al. Citation2018; Constant et al. Citation2023; Nguyen et al. Citation2024) ensuring consistency and comparability with previous established simulation work. For a more in-depth description of the parameters and their impact on CA formation, see Garagnani et al. (Citation2008).

Simulated brain areas and connectivity structure

The network modelled the following 12 cortical areas: The left-perisylvian language region was represented by six areas, including the primary auditory cortex (A1), auditory belt (PB), and modality-general parabelt areas (PB) that make up the “auditory system”. In addition, the inferior part of the primary motor cortex (M1_i), inferior premotor (PM_i), and multimodal prefrontal motor cortex (PF_i) were included, which, together, constitute the “articulatory system” (inferior face-motor and adjacent multimodal areas). Six additional areas composed the extra-sylvian region, including the primary visual cortex (V1), temporo-occipital (TO), and anterior-temporal areas (AT) for the ventral visual system. The motor system was represented by the dorsolateral fronto-central motor (M1_L), premotor (PM_L) and prefrontal cortices (PF_L). Each system included a multimodal area central to the neural architecture (PB, PF_i, AT, and PF_L). These areas can be considered “convergence zones” (Damasio Citation1989) or “connector hub areas” (van den Heuvel and Sporns Citation2013) because each of them has a relatively large number of links to other areas, and each connects 1–3 synaptic steps to all four modelled modality-specific primary cortices.

Each of the 12 areas consisted of an array of 25 × 25 (=625 in total) excitatory cells (spiking ones in the spiking model, while graded responses in the meanfield model) and 625 inhibitory cells, simulating a pool of interneurons within the same cortical columns (see , (Tomasello et al. Citation2017)); thus, each network included 15.000 neurons. Implementation of between-area connections was guided by studies using DTI/DWI and tractography (Rilling et al. Citation2008, Citation2012; Glasser and Rilling Citation2008; Thiebaut de Schotten et al. Citation2012; Rilling Citation2014; Ardesch et al. Citation2019), as described in detail below.

Taking into account the neuroanatomical differences of the AFs across primate species and human, as described in the introduction (Rilling et al. Citation2008, Citation2012; Rilling and van den Heuvel Citation2018), we constructed three distinct models: a Simple Model (SM), a Monkey Model (MM), and a Human Model (HM).

Simple model

In the SM architecture, areas in the peri and extra-sylvian regions were interconnected through (next-neighbour) connections between adjacent areas (, black arrows and matrices). These cortical areas were reciprocally linked as well documented by prior neuroanatomical studies in monkeys (Pandya and Yeterian Citation1985, Citation1985; Arikuni et al. Citation1988; Distler et al. Citation1993; Bressler et al. Citation1993; Lu et al. Citation1994; Young et al. Citation1994; Pandya Citation1995; Kaas and Hackett Citation2000; Rauschecker and Tian Citation2000; Rizzolatti and Luppino Citation2001; Dum and Strick Citation2002, Citation2005). Similarly, multimodal areas (PB, PF_i, AT, and PF_L) were also connected with each other through long distance reciprocal connections (purple arrows) in line with previous findings on monkey (Fuster et al. Citation1985; Eacott and Gaffan Citation1992; Webster et al. Citation1994; Parker and Gaffan Citation1998; Romanski et al. Citation1999; Rauschecker and Tian Citation2000; Romanski Citation2007; Rilling et al. Citation2008; Petrides and Pandya Citation2009; Rilling Citation2014). This connectivity structure serves as the most basic model, and the absence of “jumping-links” results in diminished connectivity, establishing a baseline for evaluating the other architectures.

Monkey model

In the MM architecture, in addition to next-neighbour and long-distance connections, only extraylvian region areas were also connected via “jumping links” (blue arrows) between second-next-neighbour areas (n + 2), whose existence had been reported by previous studies in primates (Fuster and Jervey Citation1981; Fuster et al. Citation1985, Citation1985; Pandya and Barnes Citation1987; Seltzer and Pandya Citation1989; Deacon Citation1992; Chafee and Goldman-Rakic Citation2000; Thiebaut de Schotten et al. Citation2012).

Human model

In the HM architecture, which also includes next-neighbour and long-distance connections, was the only one that included jumping links in the extra-sylvian and also in the peri-sylvian region (Guye et al. Citation2003; Rilling et al. Citation2008, Citation2012; Thiebaut de Schotten et al. Citation2012; Rilling and van den Heuvel Citation2018). Thus, auditory and articulatory systems in the peri-sylvian region were more interlinked mimicking the strongly developed bundle of the AF in humans (Glasser and Rilling Citation2008; Rilling et al. Citation2012; Rilling and van den Heuvel Citation2018).

Each of the 12 implementations of the 6 model variants (overall 72 networks) was first submitted to a learning phase simulating early stages of word form and semantic learning. Please note that conceptual, linguistic, and semantic learning may in part occur separately, at different developmental stages (Perszyk and Waxman Citation2018). Earlier neurocomputational work has focused on these learning phases (Mayor and Plunkett Citation2010; Constant et al. Citation2023) and found that the type of neural representations that form after three learning steps (Constant et al. Citation2023) resemble those obtained after only one (Tomasello et al. Citation2017; Tomasello et al. Citation2018). Therefore, and to speed up the training procedure, learning of symbol form and meaning was collapsed into one learning phase. After learning, an evaluation followed where we assessed structural and functional properties of the representations built during learning.

Learning phase

Prior to learning, the presence and weights of synaptic connections between neurons were determined using a stochastic algorithm with bias towards local links (Garagnani et al. Citation2007, Citation2008; Pulvermüller et al. Citation2014), using different seed values for the 12 variants of each of the 6 model types. During learning, input presentations provided co-activation of neurons. Using Hebbian learning, all weights between excitatory neurons were modified using a learning rule implementing both long-term potentiation (so that “neurons that fired together wired together”) and long-term depression (so that “neurons out of sync delinked”) (Hebb Citation1949; Artola and Singer Citation1993; Wennekers et al. Citation2006). With repeated learning, neurons that were frequently activated together strengthened their mutual connections and became more strongly interlinked than other neuron sets in the networks, which we call cell assembly circuits.

In each learning trial, a network received 16 time steps of stimulation. Stimulation patterns activated 22 randomly defined cells in 3 of the 4 primary areas of the model, whereas the 4^th primary area received random input changing from time step to time step. The 22 stimulated neurons of a pattern within each primary area of 25 × 25 cells resulted in sparse activation of around 3% of the total cells available. Stimulation patterns were applied simultaneously to primary auditory (A1) and articulatory (M1_i) areas, plus primary visual cortex (V1) to simulate learning of object-related signs or in lateral motor cortex (M1_L) for action-related signs. This mimics a word learning situation in which a novel word form is produced while a referent object is visually perceived (Vouloumanos and Werker Citation2009) or a relevant motor action is executed by the language learning infant (Tomasello and Kruger Citation1992). When learning that a word is used to speak about an object (action), hand-motor activity (visual stimulation) can be relatively variable; therefore, stimulation patterns also included uncorrelated noise pattern in V1 for action words, and in M1_L for object words, ensuring that the correlation between word form and semantic information was high in one modality but low in the non-relevant one (Tomasello et al., Citation2019; Tomasello et al., Citation2024). We note that the patterns presented to the visual and motor cortex capture experiential and environmental information relevant for learning the meaning of words – e.g., which object or action the word is used to speak about.

Each network learned 12 distinct triplets of sensory-motor patterns, simulating the learning of 6 signs thought to symbolize actions and 6 more signs related to objects. Additional white noise was always continuously presented to all areas during the learning phase, regardless of whether stimulation patterns were applied or not. After the initial 16-timestep pattern presentation, a period without pattern stimulation followed (interstimulus interval ISI). This period lasted until network activity caused by stimulation had ceased and the level of global inhibition in areas PF_i and PB had reduced to a baseline level). This was to prevent interference between consecutive patterns and learning trials. There were 5000 learning trials for each of the 12 signs (overall 60,000) for each of the 12 implementations of the 6 network types (thus 4.32 × 10^6 trials). This number of repetitions was chosen based on previous simulations and showed no substantial changes after 1000 learning episodes per sign (Garagnani et al. Citation2008; Schomers et al. Citation2017).

Evaluation

After object and action word learning, we examined the neurophysiological mechanism underpinning these processes in the three different network architectures and model types. To this end, each of the 12 learned “acoustic” patterns were presented to the auditory primary area (A1), and the resultant neural activity brought about by each of the “auditory” stimulation patterns was mapped across the entire network including all modelled areas.

Mapping of Cell Assembly circuits

Each evaluation trial started with a baseline of 10 time-steps without any inputs, followed by two time-steps with pattern presentation to primary auditory A1 area and ended in 60 subsequent time-steps during which activation spreading was allowed to spread through the network (Garagnani et al. Citation2008; Schomers et al. Citation2017, Tomasello et al. Citation2017; Tomasello et al. Citation2018). As in the learning phase, all neurons received additional white noise. For the spiking model, a cell was considered active when it fired at least one spike during an evaluation trial. A meanfield neuron was counted as active if its gradual firing probability (or membrane potential) exceeded 75% of the most active neuron of a given area. To eliminate spurious results, the maximum firing rate had to be at least 0.01. This procedure was adopted from previous work (Garagnani et al. Citation2017, Tomasello et al. Citation2017; Tomasello et al. Citation2018). To exclude most activity directly related to external stimulation from activations caused from within the network, we also stimulated with previously learned “motor patterns” in M1_L and used activity values of the not-stimulated part in perisylvian regions (A1, AB, and PB after M1_i stimulation and M1_i, PM_i, and PF_i after auditory stimulation).

Neurons were counted as cell assembly (CA) members if they became active in both stimulation conditions. This procedure was computed for all networks and averaged across conditions (neuron models, connectivity structures, and word types). Plots were created for areas specifically. The number of CA neurons per network and area is statistically analysed as described in the next section. All the stimulations were performed with the same noise level within each model type (noise parameters: spiking = 5, meanfield = 27) as in the learning phase.

Mapping of ignition (maximal activation) and reverberation (maintenance of activity)

We also simulated object and action word recognition and comprehension processes and examined the neurophysiological mechanism underpinning these processes in the three different model architectures for the meanfield and spiking neuron model. To this end, each of the 12 learned acoustic patterns were presented to the auditory primary area (A1), and the resultant neural activity brought about by each of the “auditory” stimulation patterns was mapped across the entire network including all modelled areas. Following the same procedure as during CA mapping, each evaluation trial constituted of 10 time-steps without any inputs, followed by two time-steps with pattern presentation to A1 and 60 time-steps of activation spreading, during which activity was mapped (Garagnani et al. Citation2008; Schomers et al. Citation2017, Tomasello et al. Citation2017; Tomasello et al. Citation2018). Then, the same process was computed with pattern presentation to M1_i in order to average the two trials and remove over excitation produced by the direct stimulation of the respective input areas. As for CA mapping, white noise was added during. To obtain information about the time course of activations, two parameters were defined, peak activation per area and reverberation time. The peak was defined as the maximum number of neurons activated in a given area during the entire 60-time steps of an evaluation trial. Reverberation time (R_time) was defined as the latency, in simulation time steps, between the point in time of peak activity and the time point when activity had returned below baseline (2 standard deviations of prior stimulus activity), following similar VWM measure procedures (Schomers et al. Citation2017). The reverberation period provides a measure of duration, for which activity was held by the CA. Again, values were extracted for each model and connectivity type, word category, and area; values were averaged across the six words per word type and the standard errors across the 12 model implementations plotted.

Statistical analyses

To statistically test the presence of significant differences in the number of activated neurons in specific time periods and the number of CA neurons for the different model types (spiking, meanfield) and architectures (HM, MM, SM), word types (object and actions), and areas (12 cortical regions), we performed analyses using Generalized Estimating Equations (GEE). GEE is well suited for repeated measures and complex models, as it accounts for within-subject correlations and offers robust standard error estimates (Liang and Zeger Citation1986; Ballinger Citation2004; Hardin and Hilbe Citation2013). This method provides a flexible framework for the analysis of correlated data and is especially beneficial in the presence of clustered or longitudinal data, ensuring more accurate and generalizable results. The different levels of each factor were encoded with sum to 0 coding (Agresti Citation2012; Fox Citation2015), which ensured that the sum of each level was 0 and that the GEE was not biased towards one of the levels. For instance, HM was encoded as 1, MM as 0, and SM as −1. The six factors were as follows: Model Type (two levels: {Spiking, Meanfield}), Architecture (three levels: {HM, MM, SM}), WordType (two levels: {Action, Object}), and the three topological factors PeriExtra (two levels: Peri-sylvian = {A1, AB, PB, PF_i, PM_i, M1_i}, Extra-sylvian cortex = {V1, TO, AT, PF_L, PM_L, M1_L}), FrontPost (two levels: frontal ={PF_L, PM_L, M1_L, M1_i, PM_i, PF_i}, posterior={A1, AB, PB, V1, TO, AT}) and Areas (three levels: Primary = {A1, V1, M1_L, M1_i}, Secondary = {TO, AB, PM_L, PM_i} and Central = {PB, AT, PF_L, PF_i} areas). For model selection, we used the Quasi-likelihood Information Criterion (QIC) (Pan Citation2001), an extension of the AIC that is adapted to GEE models. The QIC accounts for model complexity while appropriately penalizing overfitting, making it more suitable for the type of repeated and dependent data handled by GEE. In addition to QIC, we computed the pseudo R² to evaluate the model fit and explain the proportion of variance accounted for by each model. All interactions between factors were computed and if the largest interaction was significant, the GEE was broken down in smaller models containing one factor less. In addition to be more interpretable, lower QIC and deviance were indicators that smaller models explained the data relatively better. To assess the difference in CA size between architectures, repeated measure ANOVAs were performed for each area and for each model type, followed by Bonferroni-corrected planned comparison tests (12 comparisons, corrected critical p < .0042).

To study the reverberation time (R_time), repeated one-way ANOVA (Factor: Connectivity structure (three levels: {HM, MM, SM})) was performed for each area and for each model type (spiking and meanfield). In case of significant differences in the ANOVAs, Bonferroni-corrected planned comparison tests were then carried out to identify which specific groups means differ significantly from each other. All p-value thresholds were also corrected depending on the number of comparisons using Bonferroni corrections (12 comparisons, corrected critical p < .0042).

Results

Cell assemblies size

Across spiking and meanfield model types and across architectures, strongly interlinked cell assemblies (CAs) developed spontaneously as a result of Hebbian correlation learning (). These became manifest as the sets of neurons that jointly activated upon stimulation of an acoustic word form pattern. Intriguingly, the CAs exhibited a higher density of neural material for the fully connected human architecture (blue) and to a lesser degree for the weak connected monkey (orange) and simple (green) architectures in both spiking (right) and meanfield (left) model types. The meanfield model showed overall higher density of neural materials than the spiking model due to the different neural type models.

Figure 3. Distribution of cell assemblies (CAs) across six network types. The mean number of CA neurons is plotted for each cortical area in both meanfield (left) and spiking (right) models. Each colour represents a different architecture: human (blue), Monkey (orange), and simple Model (green). The number of neurons is shown separately for action and object words within each model area. Notable topographical differences are observed in the extra-sylvian regions, particularly in the human model’s primary motor area (M1_L) with largest neuron numbers for action words and the primary visual cortex (V1) where object word CAs are prominent. Note that the y-axis scale differs across network types, reflecting variations in the number of neurons per architecture.

The distributions of the cells included in these CAs varied across network areas depending on the object and action word learning scenarios. Specifically, the extra-sylvian region showed topographical differences, in particular for the human architectures, while no differences between word types were observed in the peri-sylvian regions across architectures. The CA circuits of action words tended to extend into lateral motor regions, in particular M1_L, more strongly than those for object words, and the opposite was true for object related words in the visual system, in particular V1. These differences related to semantic circuit distributions were most obvious in the spiking human model, and they were similarly manifest in primary areas of both spiking and meanfield human model. In contrast, the other architectures, MM and SM, did not show such a clear topographical pattern distribution.

To assess the difference in CA sizes between the three architectures, Generalized Estimating Equations models and repeated-measures ANOVAs were performed on the number of cell assembly neurons in each model type obtained across all areas and for each area separately.

Meanfield model

For the meanfield networks, the human model produced by far the largest number of neurons per cell assembly (average CA sizes HM: 244.1 (SE: 4.6); MM: 9.2 (SE: 1.3); SM: 8.0 (SE: 0.4); F (2,22) = 11383.5; corrected p-value <0.001) across the entire networks and likewise in each of the areas specifically (F-values >764; corrected p-values <0.0001). A statistically significant difference in CA size was also observed between MM and SM, with MM having larger CA sizes (p < 0.001).

Spiking model

Similar results were obtained in the spiking networks. The human model revealed substantially larger CA sizes across areas as compared to the monkey and simple models (HM: 85.9 neurons (SE: 1.5); MM: 25.5 (SE: 0.6); SM: 25.5 (SE: 0.4)), and converging results for each area analysed individually. However, no difference in CA sizes was observed between MM and SM. Post-hoc Tukey’s HSD tests confirmed that, for both the meanfield and spiking models and across areas and for each area specifically, the HM provided larger CAs than the MM and the SM (corrected p-value <0.0001, see ).

Semantic specificity and CA distributions per model

To examine CA topographical distribution across the modelled regions, a 4-way GEE was run for each model type (meanfield and Spiking) with factor WordType x PeriExtra x FrontPost x Areas (see methods for more detail). Note that if the 4-way interaction was significant, 3-way GEEs were computed for peri- and extrasylvian areas separately.

Meanfield model

The human model revealed a significant 4-way interaction (β = -40.028, corrected p-value <0.0001, QIC = 59683.27, R² = 0.8100) along with a main effect of PeriExtra (corrected p-value <0.0001) indicating generally larger neuron numbers in peri-sylvian compared with extra-sylvian regions. However, a similar significant interaction was not observed for either the monkey or simple model. For the latter, even the main effect of WordType failed to achieve significance (monkey: corrected p-values = 0.41; simple model: corrected p-values = 0.067). Note again the very low numbers of CA cells in both the meanfield MMs and SMs.

The GEE for the human meanfield model was then run for the peri-sylvian and extra-sylvian region, separately, with factors (WordType, FrontPost, Areas). For the peri-sylvian regions, the three-way interaction between WordType, FrontPost, and Areas, which was not significant (corrected p-values = 0.79). However, the three-way interaction was significant for the extra-sylvian regions (β = −40.6375, corrected p-value <0.0001, QIC = 21252.03, R² = 0.8640). This indicates that only the extra-sylvian regions show differences in CA distributions between word types across Areas and FrontPost (see also , left panel).

To further explore any significant area-specific differences in CA cell numbers between action and object word circuits across each network, we ran Bonferroni-corrected planned comparison tests (12 comparisons, corrected critical p < 0.0042). Specifically, as reported in earlier work (Tomasello et al. Citation2017; Tomasello et al. Citation2018; Constant et al. Citation2023), we found relatively higher numbers of cell assembly neurons in primary, secondary, and central visual areas for object words and in motor areas for action-related words in the human model (p-value <0.0001). We note, however, that the distribution of CA cell patterns observed in the extrasylvian connector hub areas central to the architecture showed an inverse pattern of neural material density: relatively more CA cells for object words in PF_L, and for action words in AT (p-value <0.001).

Spiking model

The 4-way GEE with factors WordType, PeriExtra, FrontPost, and Areas showed a significant 4-way interaction for the human architecture (β = -4.2144, corrected p-value <0.001, QIC = 10047.03, R² = 0.6917) and also for the simple model (β = -1.9421, corrected p-value <0.001, QIC = 1166.07, R² = 0.7431). The monkey model did not reveal such a 4-way interaction (corrected p-value = 0.21).

Both Human and simple model also revealed a significant 3-way interaction of the factors WordType, FrontPost, Areas when run only on the extrasylvian regions (human: β = −4.0581, corrected p-value <0.001, QIC = 3549.00, R² = 0.4244; simple model: β = −1.4699, corrected p-value <0.001, QIC = 388.03, R² = 0.3782). The latter was not significant when run on the perisylvian alone (human: corrected p-value = 0.64; simple model: corrected p-value = 0.09).

Bonferroni-corrected planned comparison tests (12 comparisons, corrected critical p < 0.0042) confirmed the different distribution of cell assemblies between object and action words in the human and simple models. Specifically, the human model showed relatively larger numbers of cells across all lateral motor/frontal areas for action words, but larger ones in visual/temporo-occipital areas for object words (p-value <0.001, see , top-right panel). For the simple model, a distinct pattern was observed with significant differences between the two-word types being restricted to the central hub regions (AT and PF_L, p-value <0.001).

Simulation of auditory word comprehension: Time course activation results

After training, we simulated auditory word comprehension by presenting the learnt auditory patterns to the auditory primary area A1 and recorded the spiking neural activation of the reactivated neural circuits. This allowed us to investigate the activation time course across the different areas and models. shows the dynamics of object neural activation across the different areas for the spiking and meanfield networks and for the different connectivity architectures. Visual inspection of these activation plots reveals a main difference across model architectures. The HM showed by far the strongest neural activations for both spiking and meanfield models, outperforming both MM and SM. Object-related CA activation showed a clear activation of the visual system (V1, TO, AT) in HM for both spiking and meanfield models, with very low activation of, the dorsolateral motor system, and with some activity only the central prefrontal area (PF_L). In contrast, activation is relatively low in MM and SM. The opposite pattern was observed for action-related words with stronger activation in the motor system (M1_L, PM_L, PF_L) and less in the visual system. This is in line with the differences in CA structure described above ().

Figure 4. Model simulation of the comprehension of an object related spoken word using the 12-area brain-constrained network. Each panel shows the time course of activation in six areas, using different colours (see legend). Activation time courses for perisylvian regions are shown in the left half of the figure, those for extrasylvian areas are one right. For each of these regions, simulation results of the spiking and meanfield model are presented side-by-side. The top row shows human level performance, that for the monkey model is in the middle and the bottom row shows simple model results. Note that the scales are different across models and regions. Object word comprehension in the human model shows robust activation in visual areas (V1, TO, AT), while activation in the motor system (M1_L, PF_L) remains low. In contrast, the monkey model and simple model exhibit weaker overall activation across both visual and motor systems. Note that the y-axis scale differs across models and regions to capture variations in neural activity.

Differences in reverberation time

Here, we examined the average duration of maintained neural activity (reverberation time, R_time) after the full ignition of CAs in auditory word recognition and articulatory stimulation “trials” (). CA ignition was defined by the maximum area-specific activation. Reverberation time was defined by the number of time steps between ignition and the point in time when activity went below baseline activity. This baseline activity was computed as 2 standard deviations of activity in the 10-time step baseline before stimulation. R_time values showed clear differences across architectures but exhibited similarities between model types. Notably, in the meanfield model, the HM surpassed the MM by a factor of 2.6 and the SM by 2.9, while in the spiking model, HM reverberation time was 4.4 and 5.7 times as long as in the MM and SM, respectively. These observations were statistically confirmed by one-way ANOVAs (Factor: Connectivity structure) for both model types and for each area (corrected p-value <0.001). Below we report the results of the statistical analysis within each model types. Note, furthermore, that, in both SM and MM, reverberations were primarily present in the perisylvian network parts, whereas persistent activity in extrasylvian space was overall very weak (, middle and bottom panels). For the human model, both peri- and extrasylvian areas made a strong contribution to reverberant activity (, top panels).

Figure 5. Reverberation times of CA activations after peak activation in a given area, for meanfield (left) and spiking models (right) and human (top), monkey (middle) and simples architectures (bottom). Each panel shows the time, during which the activity was held (or reverberated) in the CA neurons of action and object words within a given model area. The graphs show how long activity reverberated (or was sustained) in the CAs for action and object words within each model area. Longer reverberation times are generally observed in connector hub areas (PF_i, PF_L, PB, PM_L) relative to primary and secondary areas, particularly in the human model. Notable differences are observed between architectures, with the human model demonstrating the longest reverberation times overall, especially in prefrontal and anterior temporal areas, while both the monkey and simple models show significantly shorter reverberation times.

Meanfield model

The HM showed significantly longer reverberation times (average: 10.9 (SE: 0.6) arbitrary time steps), compared to SM and MM, where activity lasted for substantially shorter durations (MM: 4.2 (SE: 0.5); SM: 3.8 (SE: 0.5)). To examine reverberation time across the different modelled regions, a 4-way GEE with factors WordType, PeriExtra, FrontPost, and Areas was conducted separately for Human, Monkey, and Simple models, mirroring the approach used for analysing CA sizes. A significant 4-way interaction was observed only for the human model (β = 0.9618, corrected p-value <0.0001, QIC = 66203.81, R² = 0.2165), whereas the monkey and simple architectures did not reveal a similar significant interaction (monkey: corrected p-values = 0.76, simple model: corrected p-value = 0.33).

To gain further insight into this complex interaction, the GEE for the human meanfield model was again dissected into two 3-factor GEEs (WordType, FrontPost, Areas), focusing on data from the peri-sylvian and extra-sylvian regions separately. Only the extra-sylvian region showed significant interaction effect of these factors (β = 2.6372, corrected p-value <0.0001, QIC = 30285.88, R² = 0.2802). No significant interaction was observed in the perisylvian regions (corrected p-value = 0.11).

Bonferroni-corrected planned comparison tests (12 comparisons, corrected critical p < 0.0042) confirmed longer sustained activity in the model area V1 for object compared to action words (p-value <0.001) and the opposite effect difference in M1_L for action words (p-value <0.001). Additionally, the inverse effect previously observed in the central hub regions (AT & PF_L) during the CA size analysis was also present in the reverberatory activity in those regions. Specifically, action words exhibited longer reverberation times in the central area of the ventral visual stream (AT) compared to object words (p-value <0.001). Conversely, object words showed longer reverberation times in the dorsal frontal hub area (PF_L) relative to action words (p-value <0.001).

Spiking model

The spiking model (, right) yielded similar results as the meanfield model; the richer connectivity structure of the HM produced prolonged CA reverberation activity (average: 5.7 (SE: 0.3)) compared to SM and MM models, which exhibited short-lived CA activations (MM: 1.3 (SE: 0.2); SM: 1.0 (SE: 0.1)).

For the spiking model, the 4-factor GEE analysis for the semantic category specificity analysis yielded significant interactions for HM and SM (β = 0.1441, corrected p-value <0.0001, QIC = 4218.56, R² = 0.2434, simple model: β = 0.5347, corrected p-value <0.0001, QIC = 12293.76, R² = 0.4450), while no significant interaction was found for the MM architecture (corrected p-value = 0.052). Thus, we ran further statistical analyses only on the HM and SM architectures, so as to examine closely the revealed significant interactions. However, the additional analysis of the 3-factor GEEs (WordType, FrontPost, Areas) for peri- and extrasylvian system, respectively, showed significant interactions only for the extra-sylvian regions for HM and SM (human: β = 0.2170, corrected p-value <0.0001, QIC = 2025.64, R² = 0.1354, simple model: β = 1.0382, corrected p-value <0.0001, QIC = 7492.04, R² = 0.4332).

Bonferroni-corrected planned comparison tests (12 comparisons, corrected critical p < 0.0042) were once again applied and showed longer sustained activity within the HM architecture in model area V1 for object compared to action words (p-value <0.001) and the opposite effect in M1_L (p-value <0.001); the secondary and central areas did not reveal any significant differences (p-values >0.05). In contrast, the SM failed to show any semantic category differences in reverberation times in the primary, secondary, or central extrasylvian areas (p-value >0.05). All these findings for both model types (spiking and meanfield) align with our previous observations regarding CA sizes.

Discussion

Can the evolutionary change in brain connectivity between non-human and human primates explain their known functional cognitive differences in verbal working memory, symbol learning, and vocabulary sizes? To answer this question, we built brain-constrained neurocomputational models of 12 areas relevant to linguistic and semantic processing in peri- and extra-sylvian cortex. The connectivity between the modelled cortical areas mimicked known features of cortico-cortical connectivity in human and non-human primates, as revealed by tractography studies (Rilling et al. Citation2008, Citation2012; Thiebaut de Schotten et al. Citation2012; Rilling and van den Heuvel Citation2018; Ardesch et al. Citation2019). Three different model architectures were built, a Human, Monkey, and Simple Model. These were used to simulate symbolic learning and understanding. All models included next-neighbour connections between adjacent areas and between hub regions, but otherwise varied: the human model (HM) had additional inter-area connections, including jumping links between second-next areas in both peri- and extra-sylvian regions; the monkey model (MM) had such jumping links in the extra-sylvian region but not in perisylvian cortices, thereby imitating a much-discussed evolutionary difference, and the simple model (SM) lacked any jumping links (see ). All simulations were performed with two types of networks consisting of either meanfield or spiking neurons. The simulations led to the following main results:

All networks built strongly connected neuronal ensembles, called cell assemblies (CAs), for word forms and their associated meanings by means of unsupervised Hebbian learning. Upon stimulation with only an acoustic pattern, CAs became active and fully ignited. These CAs were spread out across primary, secondary, and multimodal perisylvian areas in all models, but their extension to extrasylvian areas varied across models (). The densely connected human architecture built much larger CAs than the weaker connected monkey and simple models.
Only the human model architecture gave rise to CA circuits with different semantically specific distributions across extra-sylvian areas. These CAs linked information about symbol form to either visually perceived or action-related information, thus reflecting the semantic grounding of object and action words in visual and motor systems, respectively. Such specificity was not found in the non-human models.
CA activation was maintained and reverberated for a substantially longer period of time in the HM architecture than in the SM and MM ().
These results were consistently obtained across meanfield and spiking model variants, thus pointing to a degree of generality of the findings.

Below we discuss the model results in the light of previous simulation results, empirical evidence and novel predictions.

CA formation across models

Word or symbol learning was implemented by concordant stimulation of the primary auditory and articulatory motor cortex of the model thought to simulate word production and perception of corresponding acoustic information, along with visual or hand-motor cortex stimulation designed to code for perceptually- or action-related features of the objects and actions to which the words relate semantically. The repeated stimulation of the same neurons led to activation spreading throughout the networks and synaptic strengthening of the connections involved, replicating well-known observations from neural networks research (Doursat and Bienenstock Citation2006; Garagnani et al. Citation2008; Lansner Citation2009; Huyck and Passmore Citation2013; Palm et al. Citation2014). As co-activated neurons were present in all network areas, the resultant CAs were distributed across network areas. Our results contrast with earlier simulations using layered networks, which arguably do not build distinct and discrete sets of neurons related to symbols, word forms, or meanings, while “represent” cognitive entities such as word forms or meanings as fully distributed dynamic activation patterns instead (Farah and McClelland Citation1991; Elman Citation1996, Citation2004; Rogers and McClelland Citation2004; Westermann et al. Citation2006; Ralph et al. Citation2017). Although distributed dynamic activation patterns can be informative to some extent, the use of more biologically well-founded models of the brain that include reciprocal topographical links between areas, excitatory within-area connections, and Hebbian correlation-based learning entails discrete circuit formation, offering a more accurate model of brain-based neural circuit formation (see (Garagnani et al. Citation2009; Pulvermüller Citation2023)).

In spite of the general observation of CA formation across the biologically constrained networks, we observed great differences in the sizes of these distributed representational units. The HMs produced CAs with substantial numbers of 50–100 cells in the spiking and 200–300 in the meanfield models. We do not interpret this quantitative difference between the two model types because no attempt was made to match activation strengths across models. The probabilistic nature of spiking activity provides an explanation why some presynaptic activations efficient in a meanfield model remained without (or very little) effect in spiking networks. This probabilistic cancelling of the effect of some activity accounts for the generally lower activations and CA sizes in the spiking models. Importantly, across network types, the human architecture led to substantially larger CAs than the monkey and simple models (see ). This difference can only be explained by the only implemented difference between these model architectures: their structural difference in connectivity. The addition of “jumping links” between non-adjacent areas, links bypassing intermediate regions, has been suggested by neuroanatomical research to be a characteristic feature distinguishing human perisylvian cortex from that of non-human primates (Rilling et al. Citation2008; Rilling and van den Heuvel Citation2018; Ardesch et al. Citation2019). These additional links imply, of course, that more activation spreads across perisylvian areas. However, over and above an activity increase, these links reduce the number of synaptic steps necessary for direct communication between distant cortical areas, thus enabling faster and more parallel neural processing. In HM, the inclusion of these jumping links during the learning phase led to enhanced synaptic strengthening within distributed cell assemblies (CAs). This strengthening was facilitated by the simultaneous activation of neurons in multiple interconnected regions, allowing for larger CA formation across different modalities. As a result, the CAs in the human-like model were not only larger in size but also exhibited higher ignition amplitudes, which means that, when activated, they generated stronger and more widespread neural responses. In contrast, models that lacked these jumping links in the perisylvian cortex (MM and SM) relied solely on next-neighbour connections there, so that information about the sounds and articulatory patterns of spoken words could only be passed sequentially from one adjacent region to the next. This led to less efficient building of neural representations for spoken words and hence to a lower CA size.

CA topographies and meaningful symbol storage

Beyond the quantitative difference in CA sizes, our results indicate qualitative differences between CAs with different functions. When simulating the learning of object-related words, the word form information was co-present with semantic information about the referent objects the words can be used to speak about. Likewise, in action word learning, information about action-related semantic features of the words was co-presented with symbol-related information to the model, mimicking early word learning as observed in language developmental studies (Tomasello and Kruger Citation1992; Vouloumanos and Werker Citation2009). The results were CA circuits that interlinked word form-related information with semantic information about referent objects and actions, respectively. Note that we simulated these differences between word types not because they capture the entire range of semantics characterizing words and symbols, but to highlight one out of many semantic differences between symbols that may affect cortical representations. Furthermore, it is needless to say that there is a wide range of different semantic word types and learning scenarios. The modelling of “pure” object and action words grounded directly into sensorimotor system is one way to address such differences. The differences in the simulated semantic grounding process, either objects or actions, led to qualitative differences of the related CAs, which extended more strongly either into the primary visual or the primary articulatory-motor cortex, respectively. However, this topographical difference in semantic grounding of object and action words was only found for the human architectures, where it was equally present in the spiking and meanfield models. In contrast, the other architectures, the monkey and simple model, did not show this semantic difference between symbol types. Only the “simple” architecture of the spiking model showed an interaction of the four factors, but this was due to an entirely different pattern of results. Word type-specific difference was confined only to extrasylvian central, connector-hub areas (AT and PF_L) and did not extend into the primary areas, as it might be expected for a grounding related difference would suggest. We do not have a ready explanation for this unexpected result in the SM but conclude that the CA topography difference related to semantic grounding in different modalities (visual vs action, i.e., V1 vs M1_L) is not directly reflected in the topographic difference of the “simple” spiking model.

In summary, our observations about the topographies of CAs representing object and action words show clearly that only the human architecture (HM) built CAs whose distribution over the network reflected the type of information relevant for the semantic grounding of symbols. A range of pre-existing neurophysiological, neuroimaging, and neuropsychological studies had previously shown category-specific semantic differences in the human brain (Warrington and Shallice Citation1984; Pulvermüller Citation2018; Grisoni et al. Citation2021; Carota et al. Citation2023), which are consistent with and may be explained by different CA topographies (Garagnani and Pulvermüller Citation2016, Tomasello et al. Citation2017; Tomasello et al. Citation2018, Citation2019). The fact that such semantic differences have model correlates, which are explained by differences in the learning paradigm, provides an argument for successful formation of meaningful symbolic representations. In this context, our present simulations suggest that the building of discrete and meaningful symbolic representations is supported by the human architecture but not by the monkey or the more “simple” ones. However, while some studies have shown that the superior temporal sulcus (STS) and superior temporal gyrus (STG) in monkeys seem to be involved in linking visual and auditory information (Joly et al. Citation2012; Froesel et al. Citation2022), they have not focused on the distinction between word modalities like action or object and do not seem to be specialized. Our results further support this point, as they showed no difference in activation patterns between these two modalities in monkeys. This suggests that central hubs in the extra-sylvian region might not be modality-specific for monkeys, unlike the human brain.

A recent study also demonstrated that macaques could reproduce a learned sequence of sounds by playing on a modified piano (Archakov et al. Citation2020). Additionally, it showed activation in the motor cortex after auditory stimulation of this sequence. This suggests that monkeys can form internal models linking auditory and motor information. In contrast, our study did not observe such a clear link between the auditory and motor systems in monkeys and simpler models for words or symbols. While higher activation provided by sequence of words could potentially reach the motor cortex in these models with further training, it is unclear if this would establish links comparable to those observed in macaques. Further research is necessary to explore this possibility.

Reverberation times and verbal working memory

Our work follows up on earlier neurocomputational work addressing the putative cortical basis of working memory (Moody et al. Citation1998; Verduzco-Flores et al. Citation2009; Pulvermüller and Garagnani Citation2014; Schomers et al. Citation2017). A clear and strong difference between model architectures was present in the duration of maintained activity within activated cell assembly circuits. The human model outperformed the monkey and simple architectures by a factor of 2.9 and 3.7 (meanfield) vs 1.8 and 2.1 (spiking model), respectively, even reaching values up to 7 when just focusing on one specific area. Generally, the longest reverberation times were present in connector hub areas central to the architectures – relative to the primary areas. The predominance of reverberation in “higher” connector hubs and association cortices sits comfortably with previous neurophysiological observations that so-called memory cells are most frequent, and show the most long-lasting sustained activity, in prefrontal and anterior temporal areas (for visual stimuli) (Fuster and Alexander Citation1971; Fuster Citation1973, Citation2022; Fuster et al. Citation1981; Goldman-Rakic Citation1999; Tomasello et al. Citation2018; Shebani et al. Citation2022), and that verbal working memory most strongly draws on perisylvian multimodal areas (Baddeley Citation2003; Fuster Citation2009; Baddeley et al. Citation2021).

Note, by the way, that the hub areas (PF_i, PF_L, PB, PM_L) did not only show the longest reverberation times in the HM, similar between-area differences were also present within both the MM and SM, although reverberation times of these models always remained substantially and statically reliably below the levels achieved by the HM. The maximal differences per area ranged between factors 7.6 (meanfield model, PM_L) and 10.1 (spiking model, PM_i) across the different architectures.

These results confirm and extend earlier findings according to which connector hub areas central to a neurocomputational architecture develop the most efficient memory mechanisms (Pulvermüller & Garagnani, Cortex Citation2014; Schomers et al. Citation2017; Tomasello et al. Citation2018; Dobler et al., Citation2024). This is explained by the relatively larger neural converging on these areas, resulting in enhanced activity levels among neurons exhibit and, consequently, the relatively larger number of neurons that are being recruited into the CA and thus take part in CA building (for discussion, see Pulvermüller & Garagnani, Cortex 2014). Note that, when comparing and , the model areas with the largest number of CA neurons also show the longest reverberation times. The biological correlates of these connector hub model areas are in the prefrontal cortex (PF areas), in the anterior temporal (area AT) and in the superior temporal gyrus and sulcus (area PB). In an extended model architecture that also includes parietal areas, posterior parietal sites would similarly be included. As these are the areas where the most pronounced memory activity has been recorded in neurophysiological experiments (Fuster et al. Citation1985, p. 198; Fuster Citation2009, Citation2022), modelling work with brain constrained networks may contribute to a mechanistic explanation of the location and cortical distribution of memory cells in primate and human cortex. This explanation, as sketched above, builds on correlation-based learning and network architecture. In this context, it is particularly interesting that not only the human architecture but also the others too exhibited the largest number of cell assembly neurons and the most prolonged memory-related activity in areas central to the architecture.

It is also important to notice that only the HM meanfield model showed differences in reverberation times between hubs areas of the extra-sylvian regions and between semantic word types, which resembled those found for the number of CA neurons. However, the meanfield model exhibited the unexpected effect of longer reverberation times for the visual-stream central areas AT for action words, and longer activation maintenance for object words in the dorsal prefrontal PF_L areas in the dorsal motor PM_L stream. The spiking model did not show this difference in central areas. Importantly, overall both models had longer reverberation times for action words in the primary motor areas and for object words in the primary visual areas.

As mentioned in the introduction, a previous neurocomputational study already found dependence of verbal working memory on human-like connectivity structure of the network. They reported that memory activity is significantly more pronounced in the human model of the perisylvian cortex, including jumping links than in a monkey architecture with next neighbour connections only. The earlier study simulated word form learning independent of the symbols’ meaning. This strategy may be criticized because, in almost all cases where infants and children learn symbol forms, they also learn what these forms mean. In the present work, we simulated the learning of word forms in the context of object perception and action execution, showing that correlation-related semantic links are built between activation patterns representing symbol forms and patterns related to referent objects and actions. This captures only quite limited aspects of referential semantics, and leaves unaddressed issues such as the great flexibility of word use across different contexts, and does not address different symbol types such as names for individual objects, terms for large conceptual categories or abstract expressions (but see (Pulvermüller et al. Citation2021; Pulvermüller Citation2023)). Still, the learning of links between symbols and referent objects and actions is an important semantic process which may be necessary for building any semantic representation. Therefore, it is well motivated to simulate referential link building in a semantic learning paradigm. The CA circuits that formed produced reverberant activity dependent on model architecture. As in the study by Schomers and colleagues, the richer human architecture outperformed the simpler models with poorer connectivity. One may argue that this difference is just due to a more or less in the number and overall strength of connections between areas and, hence, between the neurons involved. However, as shown in the earlier study, prolonging learning by a factor of 10 and therefore further strengthening of synaptic links did not significantly improve memory performance of the non-human architecture, thus indicating that connection strength may not be the only relevant factor (see in (Schomers et al. Citation2017)). A different feature distinguishing the architectures is path length, the “distance” – counted in terms of synaptic steps – between areas. In the simple and monkey architecture, it takes five steps from A1 and M1_i, whereas the shortcuts provided by the human model’s “jumping links” brings this number down to 3. Shorter pathways facilitate quicker and more direct communication between relevant areas, reducing potential delays or disruptions provided by noise in the intermediate areas (Avena-Koenigsberger et al. Citation2018). As a result, the human architecture exhibits enhanced neural communication, potentially leading to swifter and more precise processing of information that may also contribute to memory maintenance.

Our research offers a significant theoretical contribution by demonstrating how the arcuate fascicle provides the essential connections and neural shortcuts necessary for creating large cell assemblies that exhibit topographical differences important for storing semantic information for action and object words. The richer connectivity within the human language architecture enables the formation of semantic-specific neural circuits, unlike the simpler architectures in non-human models. This emphasizes the crucial role of the arcuate fascicle in supporting the semantic grounding of symbols (Harnad Citation1990), contributing to the neurobiological mechanisms that underpin verbal working memory and symbolic thought, which are unique to human cognition. By modelling these processes, the study bridges a gap in understanding how specific brain connections shape the emergence of meaningful symbol processing. From this theoretical framework, several clinical implications arise. Damage to the arcuate fascicle, as seen in conditions like conduction aphasia (Catani et al. Citation2005; Marchina et al. Citation2011) disrupts the formation and maintenance of cell assemblies crucial for producing word and sentence forms. Understanding its role in symbol formation could lead to neuroplasticity-based therapies targeting these disrupted circuits in disorders like aphasia and dyslexia, thereby improving rehabilitation outcomes (Berthier and Pulvermüller Citation2011; Berthier et al. Citation2011). The findings also suggest practical implications for artificial intelligence (AI). Many current large language models (LLMs) rely solely on text input, but our study indicates that when AI systems begin integrating multimodal inputs – such as text, vision, and motor activities – extra connections between these sensory-motor layers could enhance their ability to “understand” and process meaningful symbols (Adams et al. Citation2014). This integration could result in improved natural language understanding, bringing AI closer to human-like cognitive abilities.

Limitation and future perspectives

Despite the strengths of this study, several limitations should be acknowledged. While we focus on key frontotemporal and occipital cortices, it is important to note that language processing is inherently more complex and involves other brain areas, such as the parietal and insular cortex, which are implicated in integrating sensory information and emotional valence in language processing (Moseley et al. Citation2012; Vigliocco et al. Citation2014; Hinojosa et al. Citation2020). Emotional factors, which arise from interactions with regions like the posterior parietal cortex and limbic system, may influence the formation and maintenance of cell assemblies, and this dimension was not fully captured in our model. Furthermore, the model does not account for potential bio-cultural evolutionary factors. However, cultural evolution can exert selective pressures on brain development, suggesting that biological and cultural evolution are intertwined (Deacon Citation1998; Richerson and Boyd Citation2008). Such cultural factors – like language exposure and social interaction – could influence the neurobiological mechanisms underlying symbol learning and language processing and, in recent human evolution, may have exerted an evolutionary pressure on brain structure (Christiansen and Chater Citation2008; Fisher and Scharff Citation2009). While we demonstrate that the arcuate fascicle is a key biological factor in symbolic language formation, it is possible that cultural evolution played and continues to play a role in further shaping this brain structure and its function.

We also recognize that one limitation of our model is the absence of any explicit simulation of attention, which is known to play a critical role in language processing and verbal working memory (Deco and Rolls Citation2003). Attention mechanisms help the brain selectively focus on relevant stimuli while ignoring distractions, which can significantly affect how cell assemblies form and sustain their activity. In human language learning, attentional control is essential for processing the flow of information, and its inclusion in future models could provide deeper insights into the selective activation of circuits based on contextual relevance. By not modelling attention, we may be missing out on an important cognitive layer that influences both symbol formation and working memory performance. However, we should mention that, using a similar model to the one applied in this present set of simulations, Garagnani et al. (Citation2008) simulated the influence of attention on known brain indexes that depend on attention, including the N400 and Mismatch Negativity to linguistic stimuli (Garagnani et al. Citation2008; Garagnani and Pulvermüller Citation2011). In these simulations, attention levels were simulated by varying the level of area-specific feedback inhibition. Future work could explore how varying attentional mechanism in the current model of meaningful symbol processing would impact on the richness and persistence of semantic neural representations (Knudsen Citation2007; D’Esposito and Postle Citation2015). A further limitation is that the model currently handles only 12 words, and scaling it up to include a larger vocabulary is a critical step for future research, specifically in simulating the learning of extensive symbol vocabularies. Lastly, while the parameters used in this study were drawn from previous research, we recognize that a more thorough analysis of other potential parameters is required to optimize and expand the model’s capabilities.

Conclusion

Our simulations show that human-like cortical connectivity is crucial for (1) the emergence of verbal working memory and (2) the formation of large, sustained neuronal circuits. Models with rich perisylvian fronto-temporal cortico-cortical links, as they are provided by the human arcuate fascicle specifically, formed larger cell assembly circuits for spoken words and these amply interlinked neuronal ensembles gave rise to prolonged reverberatory activity after stimulation. In contrast, models lacking these rich fronto-temporal connections produced relatively smaller circuits with short-lived activity. Furthermore, human-like cortical connectivity was also necessary (3) for semantic differences in cell assembly topographies across the networks, which map the learned symbolic links between articulated sounds and visually perceived objects and executable actions.

These results suggest that the human-specific ability to maintain spoken meaningful words in verbal working memory, thereby consolidating symbol-related cell assemblies, and to build a larger and larger vocabulary, is driven by the rich connectivity of the human perisylvian cortex – a structural feature that distinguishes language-capable human from non-human primate brains.

Author contributions

MC, RT, and FP conceived the study. MC conducted the experiments, analysed the data, and wrote the paper. RT and FP supervised the whole study and contributed to paper writing and revision.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

All data will be made available in a repository after acceptance.