In the paper, the proposed betas of Adam were set to 0.0 & 0.9, so I guess it makes sense to use those as default here as well. I also got more stable results during training using these. But maybe you tweaked the betas on purpose, and the old ones make more sense to you. TY for your implementations btw :).