Differentiable Stochastic Gradient Descent Optimizer.

Useful for algorithms such as MAML that needs the gradient of functions of post-updated parameters with respect to pre-updated parameters.

class DifferentiableSGD(module, lr=0.001)

Differentiable Stochastic Gradient Descent.

DifferentiableSGD performs the same optimization step as SGD, but instead of updating parameters in-place, it saves updated parameters in new tensors, so that the gradient of functions of new parameters can flow back to the pre-updated parameters.

  • module (torch.nn.module) – A torch module whose parameters needs to be optimized.

  • lr (float) – Learning rate of stochastic gradient descent.


Take an optimization step.


Sets gradients of all model parameters to zero.


Sets gradients for all model parameters to None.

This is an alternative to zero_grad which sets gradients to zero.