What about directly modeling an ebm p(x,a,z). Then take observation x and minimally adjust z. Then minimally adjust a to lower energy.

The actor is the inverse of the observer

Train for predictive error minimization with minimum action and latent change constraints

min (Fabs, Fact) KL Z || Zprev  +  KL Zfromaction || Ztarget

While some sl may be beneficial, it is important to also backprop through the uncontrollable part of the observation. This demands rl