The general recipe to develop increasingly advanced machine learning systems is: identify and combine various ‘components of intelligence' (datasets, environments, training paradigms, objectives, architectures) under an incremental dev-test cycle while minimizing the amount of information movement required (e.g.: data collection, training, architecture design and experimentation). I propose several iterations in this cycle by describing a data-hunting, self-/unsupervised, neurosymbolic, social and developmental learning, Internet-connected artificial intelligence system Computatrum. This paper motivates its proposal, provides relevant background, describes the general design of Computatrum, planned experiments, future steps, and invites the reader to contribute to this work.
False stack artificial intelligence
This proposal distinguishes itself from other artificial general intelligence, human level AI, etc proposals by attempting to consider a "full stack" (A,B,C) rather than individual components (information representation, computation, goal setting) of human level artificial intelligence
There is strong motivation to develop human level artificial intelligence from many domains of human endeavor including healthcare, science, engineering, and business. However, approaching and surpassing this level of intelligence is currently not a straightforward task. There is no silver bullet or single framework that has generalized across every human problem domain.
Humans possess a diverse array of mechanisms that equip them to engineer complex patches onto open-ended problems. At the neuron level, these include a cell membrane, thousands of synapses, an axon, and an internal signal interaction network to guide the cycle of stimulus-response-stimulus. Neurons organize into highly modular heterogeneous networks, and their emergent rhythms of structural and electrochemical activity provide a slate against which external disruptions are compared to build meaningful representations. The human body provides a rich set of sensory-motor modalities, and a natural curriculum of physical, mental, and social development exposes increasingly many degrees of freedom to interaction accompanied by extrinsic and intrinsic feedback mechanisms. With sufficient social and technological extensions, humans are even able to modify their own environment, physiology, and brain.
The most advanced artificial intelligence systems today pale in comparison to human complexity, diversity, and dexterity. Even superhuman deep neural network models are relatively hardwired considering that they must be spoon fed datasets, situated in closed environments, and train under an economic fitness landscape that it has no direct awareness of. Animatedly, there are also many constants to human intelligence. However, minimizing constraints intrinsic to human engineering will require liberating many aspects of the ML development cycle to autonomous control.
Rather than focusing on any one of the above mentioned shortcomings of AI, this paper takes a full stack perspective to engineering human-level intelligence. My main contributions are the following:
Section 2 reviews background and identifies overarching trends and related work. Section 3 introduces the Computatrum architecture. Section 4 presents classical ML experiments and ablation studies. Section 5 proposed a future developmental curriculum. Finally, Section 6 concludes by discussing future work and larger impact.
The brain continually searches for faster, more reliable, and more energy-efficient methods to accomplish tasks. That, in part, motivated developing computing machines to leverage the intelligence of human programmers over tasks with time and memory complexity exceeding reasonable human limits. In turn, development of high-level programming languages and simple user interfaces significantly lowered the cognitive barriers to operating these machines.
Concomitant with these developments, computer scientists began investigating the very foundations of intelligence as engineered in artificial systems, and a variety of symbolic, subsymbolic, statistical, and biologically-inspired approaches under the umbrella of artificial intelligence (AI) began to emerge. Over the decades of disappointment and hope in AI’s practical potential, machine learning and deep learning approaches have currently emerged as particularly successful due to their ability to directly learn from raw numerical data. Deep learning utilizes stacks of functional “layers” to build highly nonlinear function approximators termed “neural networks”. Usually the neural network’s parameters are partially differentiable with respect to the output for a given input which allows optimizing the network in an end-to-end supervised learning pipeline to extract features, correlate, classify, predict, translate, distort, generate, or otherwise map data with minimal hand tuning, feature engineering, or hardwiring. Neural networks are becoming ubiquitous tools to fill small and large niches in a variety of problem domains. The following subsections discuss three relevant cases: sequence modeling, graph processing, and deep reinforcement learning.\footnote{Please see appendix X for a primer on the mathematical foundations of these powerful tools.}
Sequence models -- including language models -- have become a particularly popular breed of neural networks. Autoregressive language models are trained using a simple objective of next-token prediction error minimization. Over many optimization loops and dataset examples, large language models can generate very plausible statements such as the remainder of this paragraph. “[Sequence models] have been used for a variety of commercial and research applications, such as speech recognition, text-to-speech synthesis, machine translation, and information retrieval. Such information can be used to predict unseen data, such as tags for images, genres of music, or properties of sales leads.”\footnote{See figure D.X for details.}
Accurately modeling the underlying partly-local, partly-global dependencies in natural sequences has historically been challenging for deep learning systems. A common approach to the challenge is to use some variant of the attention operation. A common form of this operation -- self-attention -- computes a key, query, and value for each token in a sequence. Queries are then compared against keys for all combinations of tokens. The values of foreign token B whose keys were numerically similar to some token A’s query are then averaged into the output representation used for token A. This mechanism enables every token in a sequence to ‘communicate’ with every other token in a sequence. More generally, attention mechanisms give neural networks the ability to perform dynamic differentiable routing between many pieces of data. Cross attention is another variant where the numerical representations that ‘ask’ queries are different from the numerical representations that ‘answer’ them with keys and values. Since computing the query-key similarity between each token in a sequence of length n produces O(n^2) complexity growth, there can be significant performance improvement by structural simplifications (fastformer,longformer,reformer) or utilizing a separate smaller latent space to store information gathered from a sequence over many queries. [Cite Perceiver, Perceiver IO]
Graph processing. Attention operations may be interpreted as dynamically constructing a graph adjacency matrix between sequence entities (vertices), however graph neural networks are a family of more structured neural network architectures for graph processing performing vertex-centric and in some variants, edge-centric operations. Some architectures utilize adjacency matrix representations while others use (more compact) adjacency list representations. These networks are applied in various tasks including node classification, edge prediction, graph generation, transformation, clustering -- theoretically any problem expressible using graphs (sequences are chain graphs and images are lattice graphs). However their widespread adoption is presently hindered by the issue of efficient differentiable structural representation and computation.\footnote{Pay attention to research in this area.}
Deep reinforcement learning approaches extend the versatility of deep learning to the agent-based paradigm. The utility agent problem is: given a trajectory of observation-action-reward, how does an agent optimally select future actions? Significantly, pure reinforcement learning (RL) agents are not force fit to an observation-action mapping; they learn an optimal policy as a product of reward maximization. This enables RL agents to learn in domains where it is infeasible or impractical to collect a labeled dataset. [Reward is Enough] posits that any nontrivial reward signal is sufficient to guide an agent to master all complexity of its environment. Simply knowing the reward function and environment of an agent reveals much about the behavior it may acquire. However even designing an appropriate reward function is not always aimable to growth, and when aiming for autonomy surpassing the intrinsic human error of those training signals, reinforcement learning paradigms can even constrain growth.
Various information-theoretic techniques have been developed to combat this constraint including maximum entropy, control, prediction, curiosity, empowerment, skill diversity, mutual information, and information gain. Fundamentally, these techniques make a few assumptions on the statistical relationship between factors such as observations, actions, and internal states over their trajectory and then optimize their behavior to minimize the divergence between anticipated and actual data distributions. Given its history predating other information-theoretic metrics, autoregressive prediction error minimization is often expressed as the single objective required to develop general intelligence in any environment. While information theoretic metrics are the ‘gold standard’ for self-supervision, their typically long associated factor graphs and complex algorithmic representation limit universal application.
Various other self supervised and unsupervised training paradigms have filled important roles in the absence (and presence) of an external feedback signal. Autoencoding trains two neural networks to compute the forward and inverse mapping of a representation while minimizing reconstruction error. Contrasting learning trains a network to produce augmentation-invariant representations for individual training examples, but maximally different representations for different training examples. Self-organizing maps perform clustering at a lower level by competitively assigning input vectors to the nearest codebook vector and then moving the code in the direction of its claimed input vector. The self-organizing recurrent neural network uses Hebbian and homeostatic mechanisms to modulate the evolution of excitatory and inhibitory action. Finally, hierarchical temporal memory represents a specialized set of computational architectures that learn by association, prediction, and competition.