devinterp.optim package
Submodules
devinterp.optim.sgld module
- class devinterp.optim.sgld.SGLD(params, lr=0.01, noise_level=1.0, weight_decay=0.0, localization=0.0, temperature: Callable | float = 1.0, bounding_box_size=None, save_noise=False, save_mala_vars=False)[source]
Bases:
Optimizer
Implements Stochastic Gradient Langevin Dynamics (SGLD) optimizer.
This optimizer blends Stochastic Gradient Descent (SGD) with Langevin Dynamics, introducing Gaussian noise to the gradient updates. This makes it sample weights from the posterior distribution, instead of optimizing weights.
This implementation follows Lau et al.’s (2023) implementation, which is a modification of Welling and Teh (2011) that omits the learning rate schedule and introduces an localization term that pulls the weights towards their initial values.
The equation for the update is as follows:
\[\Delta w_t = \frac{\epsilon}{2}\left(\frac{\beta n}{m} \sum_{i=1}^m \nabla \log p\left(y_{l_i} \mid x_{l_i}, w_t\right)+\gamma\left(w_0-w_t\right) - \lambda w_t\right) + N(0, \epsilon\sigma^2)\]where \(w_t\) is the weight at time \(t\), \(\epsilon\) is the learning rate, \((\beta n)\) is the inverse temperature (we’re in the tempered Bayes paradigm), \(n\) is the number of training samples, \(m\) is the batch size, \(\gamma\) is the localization strength, \(\lambda\) is the weight decay strength, and \(\sigma\) is the noise term.
Example
>>> optimizer = SGLD(model.parameters(), lr=0.1, temperature=utils.optimal_temperature(dataloader)) >>> optimizer.zero_grad() >>> loss_fn(model(input), target).backward() >>> optimizer.step()
Note
localization
is unique to this class and serves to guide the weights towards their original values. This is useful for estimating quantities over the local posterior.noise_level
is not intended to be changed, except when testing! Doing so will raise a warning.Although this class is a subclass of
torch.optim.Optimizer
, this is a bit of a misnomer in this case. It’s not used for optimizing in LLC estimation, but rather for sampling from the posterior distribution around a point.Hyperparameter optimization is more of an art than a science. Check out the calibration notebook for how to go about it in a simple case.
- Parameters:
params (Iterable) – Iterable of parameters to optimize or dicts defining parameter groups. Either
model.parameters()
or something more fancy, just like othertorch.optim.Optimizer
classes.lr (float, optional) – Learning rate \(\epsilon\). Default is 0.01
noise_level (float, optional) – Amount of Gaussian noise \(\sigma\) introduced into gradient updates. Don’t change this unless you know very well what you’re doing! Default is 1
weight_decay (float, optional) – L2 regularization term \(\lambda\), applied as weight decay. Default is 0
localization (float, optional) – Strength of the force \(\gamma\) pulling weights back to their initial values. Default is 0
bounding_box_size (float, optional) – the size of the bounding box enclosing our trajectory. Default is None
temperature (int, optional) – Temperature, float (default: 1., set by sample() to utils.optimal_temperature(dataloader)=len(batch_size)/np.log(len(batch_size)))
save_noise (bool, optional) – Whether to store the per-parameter noise during optimization. Default is False
- Raises:
Warning – if
noise_level
is set to anything other than 1Warning – if
temperature
is set to 1
devinterp.optim.sgnht module
- class devinterp.optim.sgnht.SGNHT(params, lr=0.01, diffusion_factor=0.01, bounding_box_size=None, save_noise=False, save_mala_vars=False, temperature=1.0)[source]
Bases:
Optimizer
Implement the Stochastic Gradient Nose Hoover Thermostat (SGNHT) Optimizer. This optimizer blends SGD with an adaptive thermostat variable to control the magnitude of the injected noise, maintaining the kinetic energy of the system.
It follows Ding et al.’s (2014) implementation.
The equations for the update are as follows:
\[\Delta w_t = \epsilon\left(\frac{\beta n}{m} \sum_{i=1}^m \nabla \log p\left(y_{l_i} \mid x_{l_i}, w_t\right) - \xi_t w_t \right) + \sqrt{2A} N(0, \epsilon)\]\[\Delta\xi_{t} = \epsilon \left( \frac{1}{n} \|w_t\|^2 - 1 \right)\]where \(w_t\) is the weight at time \(t\), \(\epsilon\) is the learning rate, \((\beta n)\) is the inverse temperature (we’re in the tempered Bayes paradigm), \(n\) is the number of samples, \(m\) is the batch size, \(\xi_t\) is the thermostat variable at time \(t\), \(A\) is the diffusion factor, and \(N(0, A)\) represents Gaussian noise with mean 0 and variance \(A\).
Note
diffusion_factor
is unique to this class, and functions as a way to allow for random parameter changes while keeping them from blowing up by guiding parameters back to a slowly-changing thermostat value using a friction term.This class does not have an explicit localization term like
SGLD()
does. If you want to constrain your sampling, usebounding_box_size
Although this class is a subclass of
torch.optim.Optimizer
, this is a bit of a misnomer in this case. It’s not used for optimizing in LLC estimation, but rather for sampling from the posterior distribution around a point.
- Parameters:
params (Iterable) – Iterable of parameters to optimize or dicts defining parameter groups. Either
model.parameters()
or something more fancy, just like othertorch.optim.Optimizer
classes.lr (float, optional) – Learning rate \(\epsilon\). Default is 0.01
diffusion_factor (float, optional) – The diffusion factor \(A\) of the thermostat. Default is 0.01
bounding_box_size (float, optional) – the size of the bounding box enclosing our trajectory. Default is None
temperature (int, optional) – Temperature, float (default: 1., set by sample() to utils.optimal_temperature(dataloader)=len(batch_size)/np.log(len(batch_size)))
- Raises:
Warning – if
temperature
is set to 1Warning – if
NoiseNorm
callback is usedWarning – if
MALA
callback is used