SGD Optimiser

sgd_optimiser_type

sgd_optimiser_type(
  learning_rate=0.01,
  momentum=0.0,
  nesterov=.false.,
  num_params=...,
  regulariser=...,
  clip_dict=...,
  lr_decay=...
)

Stochastic Gradient Descent (SGD) optimiser with optional momentum and Nesterov acceleration.

The update rule without momentum:

\[\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)\]

With momentum:

\[\begin{split}v_{t+1} &= \mu v_t + \nabla L(\theta_t) \\ \theta_{t+1} &= \theta_t - \eta v_{t+1}\end{split}\]

With Nesterov momentum:

\[\begin{split}v_{t+1} &= \mu v_t + \nabla L(\theta_t - \eta \mu v_t) \\ \theta_{t+1} &= \theta_t - \eta v_{t+1}\end{split}\]

where \(\eta\) is the learning rate and \(\mu\) is the momentum coefficient.

Arguments

  • learning_rate (real(real32)): Step size for parameter updates. Default: 0.01.

  • momentum (real(real32)): Momentum factor. Default: 0.0 (no momentum).

  • nesterov (logical): Whether to use Nesterov momentum. Default: .false..

  • num_params (integer): Number of parameters to optimise.

  • regulariser (class(base_regulariser_type)): Regularisation method (e.g., L2 regularisation).

  • clip_dict (type(clip_type)): Gradient clipping configuration.

  • lr_decay (class(base_lr_decay_type)): Learning rate decay schedule.

Notes:

SGD is the fundamental optimisation algorithm for neural networks. Adding momentum helps accelerate convergence and reduces oscillation.