devinterp.optim package
Submodules
devinterp.optim.sgld module
- class devinterp.optim.sgld.SGLD(params, lr=0.01, noise_level=1.0, weight_decay=0.0, localization=0.0, temperature: Callable | float = 1.0, bounding_box_size=None, save_noise=False, save_mala_vars=False)[source]
 Bases:
OptimizerImplements Stochastic Gradient Langevin Dynamics (SGLD) optimizer.
This optimizer blends Stochastic Gradient Descent (SGD) with Langevin Dynamics, introducing Gaussian noise to the gradient updates. This makes it sample weights from the posterior distribution, instead of optimizing weights.
This implementation follows Lau et al.’s (2023) implementation, which is a modification of Welling and Teh (2011) that omits the learning rate schedule and introduces an localization term that pulls the weights towards their initial values.
The equation for the update is as follows:
\[\Delta w_t = \frac{\epsilon}{2}\left(\frac{\beta n}{m} \sum_{i=1}^m \nabla \log p\left(y_{l_i} \mid x_{l_i}, w_t\right)+\gamma\left(w_0-w_t\right) - \lambda w_t\right) + N(0, \epsilon\sigma^2)\]where \(w_t\) is the weight at time \(t\), \(\epsilon\) is the learning rate, \((\beta n)\) is the inverse temperature (we’re in the tempered Bayes paradigm), \(n\) is the number of training samples, \(m\) is the batch size, \(\gamma\) is the localization strength, \(\lambda\) is the weight decay strength, and \(\sigma\) is the noise term.
Example
>>> optimizer = SGLD(model.parameters(), lr=0.1, temperature=utils.optimal_temperature(dataloader)) >>> optimizer.zero_grad() >>> loss_fn(model(input), target).backward() >>> optimizer.step()
Note
localizationis unique to this class and serves to guide the weights towards their original values. This is useful for estimating quantities over the local posterior.noise_levelis not intended to be changed, except when testing! Doing so will raise a warning.Although this class is a subclass of
torch.optim.Optimizer, this is a bit of a misnomer in this case. It’s not used for optimizing in LLC estimation, but rather for sampling from the posterior distribution around a point.Hyperparameter optimization is more of an art than a science. Check out the calibration notebook
for how to go about it in a simple case.
- Parameters:
 params (Iterable) – Iterable of parameters to optimize or dicts defining parameter groups. Either
model.parameters()or something more fancy, just like othertorch.optim.Optimizerclasses.lr (float, optional) – Learning rate \(\epsilon\). Default is 0.01
noise_level (float, optional) – Amount of Gaussian noise \(\sigma\) introduced into gradient updates. Don’t change this unless you know very well what you’re doing! Default is 1
weight_decay (float, optional) – L2 regularization term \(\lambda\), applied as weight decay. Default is 0
localization (float, optional) – Strength of the force \(\gamma\) pulling weights back to their initial values. Default is 0
bounding_box_size (float, optional) – the size of the bounding box enclosing our trajectory. Default is None
temperature (int, optional) – Temperature, float (default: 1., set by sample() to utils.optimal_temperature(dataloader)=len(batch_size)/np.log(len(batch_size)))
save_noise (bool, optional) – Whether to store the per-parameter noise during optimization. Default is False
- Raises:
 Warning – if
noise_levelis set to anything other than 1Warning – if
temperatureis set to 1
devinterp.optim.sgnht module
- class devinterp.optim.sgnht.SGNHT(params, lr=0.01, diffusion_factor=0.01, bounding_box_size=None, save_noise=False, save_mala_vars=False, temperature=1.0)[source]
 Bases:
OptimizerImplement the Stochastic Gradient Nose Hoover Thermostat (SGNHT) Optimizer. This optimizer blends SGD with an adaptive thermostat variable to control the magnitude of the injected noise, maintaining the kinetic energy of the system.
It follows Ding et al.’s (2014) implementation.
The equations for the update are as follows:
\[\Delta w_t = \epsilon\left(\frac{\beta n}{m} \sum_{i=1}^m \nabla \log p\left(y_{l_i} \mid x_{l_i}, w_t\right) - \xi_t w_t \right) + \sqrt{2A} N(0, \epsilon)\]\[\Delta\xi_{t} = \epsilon \left( \frac{1}{n} \|w_t\|^2 - 1 \right)\]where \(w_t\) is the weight at time \(t\), \(\epsilon\) is the learning rate, \((\beta n)\) is the inverse temperature (we’re in the tempered Bayes paradigm), \(n\) is the number of samples, \(m\) is the batch size, \(\xi_t\) is the thermostat variable at time \(t\), \(A\) is the diffusion factor, and \(N(0, A)\) represents Gaussian noise with mean 0 and variance \(A\).
Note
diffusion_factoris unique to this class, and functions as a way to allow for random parameter changes while keeping them from blowing up by guiding parameters back to a slowly-changing thermostat value using a friction term.This class does not have an explicit localization term like
SGLD()does. If you want to constrain your sampling, usebounding_box_sizeAlthough this class is a subclass of
torch.optim.Optimizer, this is a bit of a misnomer in this case. It’s not used for optimizing in LLC estimation, but rather for sampling from the posterior distribution around a point.
- Parameters:
 params (Iterable) – Iterable of parameters to optimize or dicts defining parameter groups. Either
model.parameters()or something more fancy, just like othertorch.optim.Optimizerclasses.lr (float, optional) – Learning rate \(\epsilon\). Default is 0.01
diffusion_factor (float, optional) – The diffusion factor \(A\) of the thermostat. Default is 0.01
bounding_box_size (float, optional) – the size of the bounding box enclosing our trajectory. Default is None
temperature (int, optional) – Temperature, float (default: 1., set by sample() to utils.optimal_temperature(dataloader)=len(batch_size)/np.log(len(batch_size)))
- Raises:
 Warning – if
temperatureis set to 1Warning – if
NoiseNormcallback is usedWarning – if
MALAcallback is used