Open-ended energetically grounded reward

Central to any general intelligence is the issue of reward. It must support open-ended learning.

Just as many animals seem to instinctively know what foods they like and which they do not, the energetically grounded artificial agent also has some genetic disposition towards certain system states which may be induced by environmental effects. However not all animals within the same species or even a similar genotypic domain share similar preferences. This may be likened to the way a loss function inherently knows which samples are true and which are false from a neural net work. During training, the last function transfers this information to the neural net work, after which an effectively trained neural network will remember those samples from the last function, and also extrapolate to novel samples. The neural network has both learned the fixed genotypic dispositions of its loss function, but it also expresses unique preference for novel inference samples interpolated or extrapolated between those in the training set

There is no single handle on biological reward, just as there is no parameter for emotional state. Unlike simple reinforcement learning agents, biological creatures perceive affect in many dimensions. Affective psychology identifies three components of affect: arousal, valency, and motivational intensity. Arousal has no particular preference for or against an action but increases awareness. Valency identifies a state is positive or negative but even strong valency or preference may be unassociated with motivational intensity which disposes an agent to act or remain passive. These three affective components emerge from numerous discrete signaling molecules. It is a mistake however to think that the discrete set of signaling molecules limits biological reward systems to a ‘discreet’ state space. Variations in the concentration of individual chemical signaling molecules allow us to make a practically continuous interpretation of reward system state. To model this I will use a high dimensional vector to share information between individual regions of the brain. Although I will explicitly code for arousal, valency, and motivational intensity, I will encode this information in a high dimensional latent space. This may allow for the complex interactions that take place in the brains reward system.

Very little reward experience is a direct consequence of physiological state. Extrinsic and intrinsic reward usually appear to follow predictive error minimization. Often however, the case is predictive error over physiological variables. For example, the muscular system exhibits heightened performance following glucose level increase. As a result, predictive error is minimized. More generally, when the energy management systems of the body display stable performance, the brain more reliably minimizes predictive coding error.

Building an artificial intelligence system with an energy management system connected would allow for many of the advantages that deep reinforcement learning is currently discovering and the open-ended, unsupervised learning capabilities that biological development allows for. For example, deep reinforcement learning research has shown that deaths reward feedback can encourage agents to develop local suboptimal behaviors. However, the biological reward system surmounts this problem by randomizing reward administration. Often satisfying physiological needs results in reward experience, but chaotic physiological attractors can dampen or even disconnect external rewards and internally experienced reward. This buffers internally experienced and externally administered reward. Consuming, drinking, breathing, procreation, and satisfying other physiological drives only yield rate decreasing reward. The effect is a balanced motivational drive (unlike the hypothetical paperclip producing monster superintelligence). This could be incorporated into an artificial energy management system by building a chaotic attractor system which grounds and energy in both the physical world and the agents environment. Chaotic nature is essential to effectively imitate the biological energy management and reward systems. Just as deep reinforcement learning has found that giving agents a random reward increases exploration and reduces some optimal performance, this chaotic attractor system would help the agent in open-ended learning.

Although pleasure/aversion is not necessarily explicit from outside to inside, biological agents give nociceptive information special attention. It follows two major pathways in the brain both the informative route beginning in somatosensory courteces and the affective route which notably includes the insula. The insula is a critical component in integrating homeostatic and sensory information. It displays similar activation patterns whether the agent is empathizing with another's pain, feeling their own pain, or exerting physically or cognitively. To give pain similar role in artificial agents, there should be privileged pain sensory endpoints in the brain that make direct (but still inhibitable) connections to the motivational system and signaling in the pain nerve should provide brief – not chronic – disruptions in energetic homeostasis.

Other causes may result in deviation status as well. The collective effects are generally considered stress. Biological agents spend much of their developing life training the physiological mechanisms of their body to handle whatever load of stress is expected to be experienced across the development of span. Even at the beginning of the day that body prepares for the level of stress it expects to experience during the day. Similar effects are in play over exponential scales of days such as weeks and years. I should have a similar exponential periodic learning rate scheduler in training the physiological mechanisms to minimize free energy — that is minimize the difference between produced and consumed energy. Since he is an irrational number no periodic rhythm of exponentially increasing wavelengths ever repeats. This may help encourage exploration. In fact, I may include a similar form of exponentially increasing waveforms in the raw physiological agent state to encourage exploration.

Finally, employing an energy management system would have the advantage of tying together energy in the real world, energy in the agent's world, and motivational energy. As in the real world, the agent's energy management system directly would receive energy from its (virtual) environment. Greater energy in the physical brain allows for faster mental processing by reducing the amount of accumulated evidence required to surpass a postsynaptic potential. In the artificial agent too, an energy budget allows penalizing computation and training. A reinforcement learning subsystem trained to maximize reward could receive energy as a parameter, and then determine whether to spend that energy on other brain modules performing inference or supervised learning. It would also receive as parameters the costs of both actions. Then perhaps a system of reinforcement learning systems could be constructed to grow (and even shrink!) the computation graph size of the brain in accordance with available energy. From the physical world perspective, engineers could increase or decrease the amount of (virtual) energy resources available in the agents environment to match the computation resources available.

It would be worth investigating whether the structure of any particular energy management system is important to intelligence. I may just copy animals.

The neuronal-level information/energy minimizing objective is sufficient to account for a broad range of human motivations. However vertebrae(?) brains additionally possess a globalized reward system which further broadcasts anticipated energy signals and modulates neuronal activity appropriately. The reward system is not a narrow-minded optimizer; it does not bias physiological parameters to the extreme but rather, the optimal. Natural behaviors like consumption, exercise, and paying attention are rewarding given their association with an increase in mobile energy. In excess however, these behaviors fail to contribute to energy regulation and often lose their motivational salience. Additionally, the reward system itself is dynamic. Although initialized by genetic information, the reward landscape is weathered and reinforced by regular feedback. Indeed, it is quite peculiar that the brain should regard certain stimuli as pleasurable while others as aversive. Professional acupuncture or massage may produce nociceptive stimuli and yet be highly pleasurable. While the reward system’s complex nature makes it difficult for humans to accurately forecast affective state, the effect of its feedback is autonomous motivation for advanced skill acquisition. It tunes the exploration/exploitation tradeoff of a noisy fitness function to neural activation patterns which are then quickly learned by other neurons. Finally, the reward system’s desensitization and decay under hyperactivation such as by nicotine, THC, and other addictive substances renders it relatively robust to tampering.

The objective or reward function is a central focus of machine learning. "Quote on give me the reward function and I'll tell you what it does" The reward function provides the necessary feedback signal to direct optimization. However as the paradigm of allostasis emphasizes, there is no single handle on biological reward, just as there is no sole physiological parameter for emotional state. Unlike simple reinforcement learning agents, biological creatures perceive affect in multiple dimensions. Affective psychology commonly identifies three components: arousal, valency, and motivational intensity. Arousal has no particular preference for or against an action but merely increases an agent's awareness. Valency identifies a state as positive or negative but even strong valency or preference may be associated with lack of motivational intensity which disposes an agent to act or remain passive. Motivational intensity describes the degree of action an agent attempts to perform physically, socially, or intellectually. These three affective components are associated with a discrete yet large set of increasingly identified signaling molecules.