There appears to be some slow-down when using reverse-mode with multiple nested loops. See bezier_curvefit.py
in slang-torch PR#1.
Filing this in the main repository because its most likely an issue with recompute/checkpoint policy creating unnecessary intemediates.