Comments on “Toward Human-Level Artificial Intelligence”

“the moral of the story is that there can be intelligence without learning” What do you mean by “learning”? If you are taking the utility agent perspective: “Behavior policy is a function that maps a sensory input with the appropriate action”, then I assume you mean the “behavior policy” looks something like πbehavior(at + 1|ot; θexperience, θdna) and that “learning” is one of the many variants of maxθexperienceE[reward]. If you are saying that there can be intelligence without changing θexperience, then I agree with you.

“components in the human brain can be divided into universal and specialized parts [...] All other parts except [the] neocortex are specialized.” I feel it is more appropriate to present the neocortex as a more flexible part of the brain, just as the brain is a more flexible signaling organ than the peripheral nervous system, enteric nervous system, or the immune system. The neocortex is more specialized for “universal” allocentric information processing than the subcortical structures, but this does not mean subcortical reorganization cannot take place. Consider: subcortical structures are more specialized for egocentric processing (vital functions, survival behaviors, sensorimotor integration, localization, reward broadcasting) which usually changes little over an organism’s lifespan, but early blindness / deafness / other change.

“Many conjecture that the main function of a cortical column is to predict the next vector signal given the sequence of vectors that is [an] autoregressive (AR) model [37, 38, 39, 40, 41]. Recent advances in language model and image generation with self-supervised learning or semi-supervised learning are based on this [16, 42]. [...] One fundamental issue of the AR model is dealing with long-term dependencies. Hierarchical structures have been used to solve long-term dependencies, such as hierarchical RNN [45, 46, 47] or hierarchical RL [48, 49].” I think it is important to consider what kind of AR model the neocortex behaves like. As you highlight, architectural priors play a significant role into the information flow that a neural network facilitates. For instance, recurrent neural networks like simple RNN’s (h = f([hprev; x]); y = g(h)), GRU’s, or LSTM’s are more efficiently equipped than transformers at modeling local dependencies while the opposite is true when correlating far-separated data. However, neither architecture is a perfect fit for the power-law distributions of natural language, ocular fixations, hippocampal navigation in cognitive space, or Correlating local and global dependencies at the human level activity scale. (One reason that our brain succeeds at percieving these powerlaws because powerlaws are burned into the connectome and readily observable in neural activity signatures. It's internal structure and dynamics reasonably conforms to the external world's structure and dynamics. I'll come back to this point later)

While hierarchical recurrent approaches do expand a network’s ‘memory’, this usually takes place at the cost of information fidelity. Let me give an example: we can often listen to just one out of every two words spoken and still understand the message being communicated. However there are occasions when this does not work, and we need to ask for repetition, clarification, or pay closer attention. In those failures, our ‘hierarchical’ listening did not obtain all the necessary information. Another example: when we read, our eyes usually only make one or a few forward-traveling fixations on a few letters of each word without ‘looking at’ the entire word. Along with feedback, those brief fixations are usually sufficient to maintain a flow of information up the visual hierarchy. Again however, we must occasionally obtain more than a brief ‘hierarchical’ summary to understand long words like “simultagnosia”.

My point is: whether it is the unconscious process of reading words from letters, the semi-conscious activity of tuning into a lecture at necessary intervals, or the highly conscious development of human intelligence, we need more than ‘hierarchical’ information processing. We need the ability to backtrack, to zoom in, to pause and think. We need feedback on our perception systems. In the context of mHPM, I believe you need to provide feedback to your pooler block. (I am assuming you are using autoencoding to train the pooler based on Figure 4.) I don’t think unsupervised reconstruction is a sufficient training objective. Let me explain why I think this from neurological and reinforcement learning perspectives:

The complexity of neocortical circuits contribute to the brain's functional diversity. When alcoholism, lack of sleep (like right now), chronic stress, and chronic pain decrease the sensitivity of the brain to internal and external signals, fine-tuned cortical functional specialization must be sacrificed to reliably broadcast signals. This results in decreased cognitive speed and function. In contrast, a healthy brain adapts/learns/restructures to maximally utilize its billions of neurons while maintaining sensitivity to even small activity. (See Towards a statistical mechanics of consciousness: maximization of number of connections is associated with conscious awareness. I will discuss that point more later.) However, the ideal is that each cortical column predicts a unique function. In contrast, autoencoding has no symmetry-breaking guarantees unless the inputs to different mHPM AE’s are unique. I suggest that at the very least, you make sure no two mHPM’s share the same lower inputs so that you maximize the system’s parameter-performance efficiency.
Now if you are planning to feed in the predictive-error gradients from the higher AR model when backpropagating the pooler’s loss function, OpenAI’s Curiosity-driven Exploration by Self-supervised Prediction made a different but related point to consider. They were pretraining RL agents by supplementing their sparse reward with a dense predictive error minimizing objective.

2.1. Prediction error as curiosity reward

[. . .] The underlying problem is that the agent is unaware that some parts of the state space simply cannot be modeled and thus the agent can fall into an artificial curiosity trap and stall its exploration. [. . .] If not the raw observation space, then what is the right feature space for making predictions so that the prediction error provides a good measure of curiosity? To answer this question, let us divide all sources that can modify the agent’s observations into three cases: (1) things that can be controlled by the agent; (2) things that the agent cannot control but that can affect the agent (e.g. a vehicle driven by another agent), and (3) things out of the agent’s control and not affecting the agent (e.g. moving leaves). A good feature space for curiosity should model (1) and (2) and be unaffected by (3). This latter is because, if there is a source of variation that is inconsequential for the agent, then the agent has no incentive to know about it.

Obviously, existing deep learning architectures already display remarkable performance, and I expect you will be using that as a baseline to go even farther. However I feel that by adding a few more structural priors - some ‘principles of intelligence’ - to your mHPM’s and overall system architecture, it may be able to more effectively model the real distributions we experience at the human-scale. Why is this important? Consider a few ‘principles of intelligence’:

A Comparison of the Maximum Entropy Principle Across Biological Spatial Scales

The MEP [maximum entropy principle] fits well the needs of biology in the era of big data, where information abounds but general principles—and corresponding mechanistic rules—are often scarce. [...] The core idea of the MEP is to build statistical models that agree with data, but are otherwise as “structureless” as possible. In other words, the MEP provides a method to find the least biased model that is consistent with the data, i.e., the maximally noncommittal with regard to missing information

Action and Perception as Divergence Minimization

We introduce a unified objective for action and perception of intelligent agents. Extending representation learning and control, we minimize the joint divergence between the combined system of agent and environment and a target distribution. Intuitively, such agents use perception to align their beliefs with the world, and use actions to align the world with their beliefs. Minimizing the joint divergence to an expressive target maximizes the mutual information between the agent's representations and inputs, thus inferring representations that are informative of past inputs and exploring future inputs that are informative of the representations. This lets us explain intrinsic objectives, such as representation learning, information gain, empowerment, and skill discovery from minimal assumptions. Moreover, interpreting the target distribution as a latent variable model suggests powerful world models as a path toward highly adaptive agents that seek large niches in their environments, rendering task rewards optional. The framework provides a common language for comparing a wide range of objectives, advances the understanding of latent variables for decision making, and offers a recipe for designing novel objectives.

The free-energy principle: a rough guide to the brain?

Free-energy is a function of a recognition density and sensory input. It comprises two terms; the energy expected under this density and its entropy. The energy is simply the surprise about the joint occurrence of sensory input and its causes. The free-energy depends on two densities; one that generates sensory samples and their causes, and a recognition density on the causes. This density is specified by its sufficient statistics, which we assume are encoded by the brain. This means free-energy induces a generative model for any system and a recognition density over the causes or parameters of that model. Given the functional form of these densities, the free energy can always be evaluated because it is a function of sensory input and the sufficient statistics. The free-energy principle states that all quantities that can change (sufficient statistics, and action, minimise free-energy (math symbols removed)

The Energy Homeostasis Principle: Neuronal Energy Regulation Drives Local Network Dynamics Generating Behavior

Here, we proposed a new, bottom-up conceptual paradigm: The Energy Homeostasis Principle, where the balance between energy income, expenditure, and availability are the key parameters in determining the dynamics of neuronal phenomena found from molecular to behavioral levels. [...] neurons are highly sensitive to energy limitations [...] the largest energy, by far, is expended by action potentials and post-synaptic potentials; therefore, plasticity can be reinterpreted in terms of their energy context. Consequently, neurons, through their synapses, impose energy demands over post-synaptic neurons in a close loop-manner, modulating the dynamics of local circuits. Subsequently, the energy dynamics end up impacting the homeostatic mechanisms of neuronal networks. Furthermore, local energy management also emerges as a neural population property, where most of the energy expenses are triggered by sensory or other modulatory inputs. Local energy management in neurons may be sufficient to explain the emergence of behavior, enabling the assessment of which properties arise in neural circuits and how.

The Critical Brain Hypothesis and related hypotheses

The neural criticality hypothesis states that the brain may be poised in a critical state at a boundary between different types of dynamics. (Self-organized criticality as a fundamental property of neural systems)