I good AD library will need not only AD, but also operator fusion, especially when running on GPU.
I created DeepLearning.scala to perform compile-time reverse AD based on RAII Monads, with the help of Each, which performs general CPS translation. I also created Compute.scala for operator fusion at runtime (aka JIT).
A better solution would requires performing both AD and operator fusion at compiler-time, which need reifiable control flow proposed in Pre SIP: `for` with Control Flow (an alternative to Scala Async, Scala Continuations and Scala Virtualized) .
Vote for the proposal if you want Scala to be able to do AD with best performance.