Created by: suchenzang
- Set hard defaults for loss scaler logic (no longer a function of data parallelism)
- Scale scale window with loss scale
- Remove raising FloatingPointError when min loss scale is reached - just continue skipping gradients (using external monitoring to restart)