# Optimization Functions¶

The optimization functions helps us to minimize (or maximize) an Objective function by improving the quality of the weights and bias.

## Optimization Descriptions¶

AdaGrad is an optimization method that allows different step sizes for different features. It increases the influence of rare but informative features

References

[1] An overview of gradient descent optimization algorithms
Parameters: kwargs – Arbitrary keyword arguments.
optimization_name

Adadelta is an extension of Adagrad that seeks to avoid setting the learing rate to an aggresively monotonically decreasing rate. This is achieved via a dynamic learning rate i.e a diffrent learning rate is computed for each training sample

References

[1] An overview of gradient descent optimization algorithms
Parameters: kwargs – Arbitrary keyword arguments.
optimization_name

Adam computes adaptive learning rates for by updating each of the training samples while storing an exponentially decaying average of past squared gradients. Adam also keeps an exponentially decaying average of past gradients.

References

[1] An overview of gradient descent optimization algorithms
[2] Adam: A Method for Stochastic Optimization
Parameters: kwargs – Arbitrary keyword arguments.
optimization_name

AdaMax is a variant of Adam based on the infinity norm. The Adam update rule for individual weights is to scale their gradients inversely proportional to a (scaled) L2 norm of their individual c urrent and past gradients. For Adamax we generalize the L2 norm based update rule to a Lp norm based update rule. These variants are numerically unstable for large p. but have special cases where as p tens to infinity, a simple and stable algorithm emerges.

References

[1] An overview of gradient descent optimization algorithms
[2] Adam: A Method for Stochastic Optimization
Parameters: kwargs – Arbitrary keyword arguments.
optimization_name
class ztlearn.optimizers.GD[source]

Bases: object

GD optimizes parameters theta of an objective function J(theta) by updating all of the training samples in the dataset. The update is perfomed in the opposite direction of the gradient of the objective function d/d_theta J(theta) - with respect to the parameters (theta). The learning rate eta helps determine the size of teh steps we take to the minima

References

[1] An overview of gradient descent optimization algorithms

NAG is an improvement in SGDMomentum where the the previous parameter values are smoothed and a gradient descent step is taken from this smoothed value. This enables a more intelligent way of arriving at the minima

References

[1] An overview of gradient descent optimization algorithms
[2] A method for unconstrained convex minimization problem with the rate of convergence
[3] Nesterov’s Accelerated Gradient and Momentum as approximations to Regularised Update Descent
Parameters: kwargs – Arbitrary keyword arguments.
optimization_name
class ztlearn.optimizers.OptimizationFunction(optimizer_kwargs)[source]

Bases: object

name
class ztlearn.optimizers.Optimizer(**kwargs)[source]

Bases: object

get_learning_rate(current_epoch)[source]
class ztlearn.optimizers.RMSprop(**kwargs)[source]

Root Mean Squared Propagation (RMSprop)

RMSprop utilizes the magnitude of recent gradients to normalize gradients. A moving average over the root mean squared (RMS) gradients is kept and then divided by the current gradient. Parameters are recomended to be set as follows rho = 0.9 and eta (learning rate) = 0.001

References

[1] An overview of gradient descent optimization algorithms
[2] Lecture 6.5 - rmsprop, COURSERA: Neural Networks for Machine Learning
Parameters: kwargs – Arbitrary keyword arguments.
optimization_name
class ztlearn.optimizers.SGD(**kwargs)[source]

SGD optimizes parameters theta of an objective function J(theta) by updating each of the training samples inputs(i) and targets(i) for all samples in the dataset. The update is perfomed in the opposite direction of the gradient of the objective function d/d_theta J(theta) - with respect to the parameters (theta). The learning rate eta helps determine the size of the steps we take to the minima

References

[1] An overview of gradient descent optimization algorithms
[2] Large-Scale Machine Learning with Stochastic Gradient Descent
Parameters: kwargs – Arbitrary keyword arguments.
optimization_name
class ztlearn.optimizers.SGDMomentum(**kwargs)[source]

Stochastic Gradient Descent with Momentum (SGDMomentum)

The objective function regularly forms places on the contour map in which the surface curves more steeply than others (ravines). Standard SGD will tend to oscillate across the narrow ravine since the negative gradient will point down one of the steep sides rather than along the ravine towards the optimum. Momentum hepls to push the objective more quickly along the shallow ravine towards the global minima

References

[1] An overview of gradient descent optimization algorithms
[2] On the Momentum Term in Gradient Descent Learning Algorithms
[3] Two problems with backpropagation and other steepest-descent learning procedures for networks.
Parameters: kwargs – Arbitrary keyword arguments.
optimization_name