Optimization Functions¶

The optimization functions helps us to minimize (or maximize) an Objective function by improving the quality of the weights and bias.

Featured Optimizers¶

`SGD`(**kwargs)	Stochastic Gradient Descent (SGD)
`SGDMomentum`(**kwargs)	Stochastic Gradient Descent with Momentum (SGDMomentum)
`NesterovAcceleratedGradient`(**kwargs)	Nesterov Accelerated Gradient (NAG)
`RMSprop`(**kwargs)	Root Mean Squared Propagation (RMSprop)
`Adam`(**kwargs)	Adaptive Moment Estimation (Adam)
`Adamax`(**kwargs)	Admax
`AdaGrad`(**kwargs)	Adaptive Gradient Algorithm (AdaGrad)
`Adadelta`(**kwargs)	An Adaptive Learning Rate Method (Adadelta)

Optimization Descriptions¶

class ztlearn.optimizers.AdaGrad(**kwargs)[source]¶

Bases: ztlearn.optimizers.Optimizer

Adaptive Gradient Algorithm (AdaGrad)

AdaGrad is an optimization method that allows different step sizes for different features. It increases the influence of rare but informative features

References

[1] An overview of gradient descent optimization algorithms

[Sebastien Ruder, 2016] https://arxiv.org/abs/1609.04747
[PDF] https://arxiv.org/pdf/1609.04747.pdf

[2] Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

[John Duchi et. al., 2011] http://jmlr.org/papers/v12/duchi11a.html
[PDF] http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf

Parameters:	kwargs – Arbitrary keyword arguments.

optimization_name¶

update(weights, grads, epoch_num, batch_num, batch_size)[source]¶

class ztlearn.optimizers.Adadelta(**kwargs)[source]¶

Bases: ztlearn.optimizers.Optimizer

An Adaptive Learning Rate Method (Adadelta)

Adadelta is an extension of Adagrad that seeks to avoid setting the learing rate to an aggresively monotonically decreasing rate. This is achieved via a dynamic learning rate i.e a diffrent learning rate is computed for each training sample

References

[1] An overview of gradient descent optimization algorithms

[Sebastien Ruder, 2016] https://arxiv.org/abs/1609.04747
[PDF] https://arxiv.org/pdf/1609.04747.pdf

[2] ADADELTA: An Adaptive Learning Rate Method

[Matthew D. Zeiler, 2012] https://arxiv.org/abs/1212.5701
[PDF] https://arxiv.org/pdf/1212.5701.pdf

Parameters:	kwargs – Arbitrary keyword arguments.

optimization_name¶

update(weights, grads, epoch_num, batch_num, batch_size)[source]¶

class ztlearn.optimizers.Adam(**kwargs)[source]¶

Bases: ztlearn.optimizers.Optimizer

Adaptive Moment Estimation (Adam)

Adam computes adaptive learning rates for by updating each of the training samples while storing an exponentially decaying average of past squared gradients. Adam also keeps an exponentially decaying average of past gradients.

References

[1] An overview of gradient descent optimization algorithms

[Sebastien Ruder, 2016] https://arxiv.org/abs/1609.04747
[PDF] https://arxiv.org/pdf/1609.04747.pdf

[2] Adam: A Method for Stochastic Optimization

[Diederik P. Kingma et. al., 2014] https://arxiv.org/abs/1412.6980
[PDF] https://arxiv.org/pdf/1412.6980.pdf

Parameters:	kwargs – Arbitrary keyword arguments.

optimization_name¶

update(weights, grads, epoch_num, batch_num, batch_size)[source]¶

class ztlearn.optimizers.Adamax(**kwargs)[source]¶

Bases: ztlearn.optimizers.Optimizer

Admax

AdaMax is a variant of Adam based on the infinity norm. The Adam update rule for individual weights is to scale their gradients inversely proportional to a (scaled) L2 norm of their individual c urrent and past gradients. For Adamax we generalize the L2 norm based update rule to a Lp norm based update rule. These variants are numerically unstable for large p. but have special cases where as p tens to infinity, a simple and stable algorithm emerges.

References

[1] An overview of gradient descent optimization algorithms

[Sebastien Ruder, 2016] https://arxiv.org/abs/1609.04747
[PDF] https://arxiv.org/pdf/1609.04747.pdf

[2] Adam: A Method for Stochastic Optimization

[Diederik P. Kingma et. al., 2014] https://arxiv.org/abs/1412.6980
[PDF] https://arxiv.org/pdf/1412.6980.pdf

Parameters:	kwargs – Arbitrary keyword arguments.

optimization_name¶

update(weights, grads, epoch_num, batch_num, batch_size)[source]¶

class ztlearn.optimizers.GD[source]¶

Bases: object

Gradient Descent (GD)

GD optimizes parameters theta of an objective function J(theta) by updating all of the training samples in the dataset. The update is perfomed in the opposite direction of the gradient of the objective function d/d_theta J(theta) - with respect to the parameters (theta). The learning rate eta helps determine the size of teh steps we take to the minima

References

[1] An overview of gradient descent optimization algorithms

[Sebastien Ruder, 2016] https://arxiv.org/abs/1609.04747
[PDF] https://arxiv.org/pdf/1609.04747.pdf

class ztlearn.optimizers.NesterovAcceleratedGradient(**kwargs)[source]¶

Bases: ztlearn.optimizers.Optimizer

Nesterov Accelerated Gradient (NAG)

NAG is an improvement in SGDMomentum where the the previous parameter values are smoothed and a gradient descent step is taken from this smoothed value. This enables a more intelligent way of arriving at the minima

References

[1] An overview of gradient descent optimization algorithms

[Sebastien Ruder, 2016] https://arxiv.org/abs/1609.04747
[PDF] https://arxiv.org/pdf/1609.04747.pdf

[2] A method for unconstrained convex minimization problem with the rate of convergence

[Nesterov, Y. 1983][PDF] https://goo.gl/X8313t

[3] Nesterov’s Accelerated Gradient and Momentum as approximations to Regularised Update Descent

[Aleksandar Botev, 2016] https://arxiv.org/abs/1607.01981
[PDF] https://arxiv.org/pdf/1607.01981.pdf

Parameters:	kwargs – Arbitrary keyword arguments.

optimization_name¶

update(weights, grads, epoch_num, batch_num, batch_size)[source]¶

class ztlearn.optimizers.OptimizationFunction(optimizer_kwargs)[source]¶

Bases: object

name¶

update(weights, grads, epoch_num, batch_num, batch_size)[source]¶

class ztlearn.optimizers.Optimizer(**kwargs)[source]¶

Bases: object

get_learning_rate(current_epoch)[source]¶

class ztlearn.optimizers.RMSprop(**kwargs)[source]¶

Bases: ztlearn.optimizers.Optimizer

Root Mean Squared Propagation (RMSprop)

RMSprop utilizes the magnitude of recent gradients to normalize gradients. A moving average over the root mean squared (RMS) gradients is kept and then divided by the current gradient. Parameters are recomended to be set as follows rho = 0.9 and eta (learning rate) = 0.001

References

[1] An overview of gradient descent optimization algorithms

[Sebastien Ruder, 2016] https://arxiv.org/abs/1609.04747
[PDF] https://arxiv.org/pdf/1609.04747.pdf

[2] Lecture 6.5 - rmsprop, COURSERA: Neural Networks for Machine Learning

[Tieleman, T. and Hinton, G. 2012][PDF] https://goo.gl/Dhkvpk

Parameters:	kwargs – Arbitrary keyword arguments.

optimization_name¶

update(weights, grads, epoch_num, batch_num, batch_size)[source]¶

class ztlearn.optimizers.SGD(**kwargs)[source]¶

Bases: ztlearn.optimizers.Optimizer

Stochastic Gradient Descent (SGD)

SGD optimizes parameters theta of an objective function J(theta) by updating each of the training samples inputs(i) and targets(i) for all samples in the dataset. The update is perfomed in the opposite direction of the gradient of the objective function d/d_theta J(theta) - with respect to the parameters (theta). The learning rate eta helps determine the size of the steps we take to the minima

References

[1] An overview of gradient descent optimization algorithms

[Sebastien Ruder, 2016] https://arxiv.org/abs/1609.04747
[PDF] https://arxiv.org/pdf/1609.04747.pdf

[2] Large-Scale Machine Learning with Stochastic Gradient Descent

[Leon Botou, 2011][PDF] http://leon.bottou.org/publications/pdf/compstat-2010.pdf

Parameters:	kwargs – Arbitrary keyword arguments.

optimization_name¶

update(weights, grads, epoch_num, batch_num, batch_size)[source]¶

class ztlearn.optimizers.SGDMomentum(**kwargs)[source]¶

Bases: ztlearn.optimizers.Optimizer

Stochastic Gradient Descent with Momentum (SGDMomentum)

The objective function regularly forms places on the contour map in which the surface curves more steeply than others (ravines). Standard SGD will tend to oscillate across the narrow ravine since the negative gradient will point down one of the steep sides rather than along the ravine towards the optimum. Momentum hepls to push the objective more quickly along the shallow ravine towards the global minima

References

[1] An overview of gradient descent optimization algorithms

[Sebastien Ruder, 2016] https://arxiv.org/abs/1609.04747
[PDF] https://arxiv.org/pdf/1609.04747.pdf

[2] On the Momentum Term in Gradient Descent Learning Algorithms

[Ning Qian, 199] https://goo.gl/7fhr14
[PDF] https://goo.gl/91HtDt

[3] Two problems with backpropagation and other steepest-descent learning procedures for networks.

[Sutton, R. S., 1986][PDF] https://goo.gl/M3VFM1

Parameters:	kwargs – Arbitrary keyword arguments.

optimization_name¶

update(weights, grads, epoch_num, batch_num, batch_size)[source]¶

ztlearn.optimizers.register_opt(**kwargs)[source]¶