What about directly modeling an ebm p(x,a,z). Then take observation x and minimally adjust z. Then minimally adjust a to lower energy.
The actor is the inverse of the observer
Train for predictive error minimization with minimum action and latent change constraints
min (Fabs, Fact) KL Z || Zprev + KL Zfromaction || Ztarget
While some sl may be beneficial, it is important to also backprop through the uncontrollable part of the observation. This demands rl