devinterp.optim package

Submodules

devinterp.optim.sgld module

class devinterp.optim.sgld.SGLD(params, lr=0.01, noise_level=1.0, weight_decay=0.0, localization=0.0, temperature: Callable | float = 1.0, bounding_box_size=None, save_noise=False, save_mala_vars=False)[source]

Bases: Optimizer

Implements Stochastic Gradient Langevin Dynamics (SGLD) optimizer.

This optimizer blends Stochastic Gradient Descent (SGD) with Langevin Dynamics, introducing Gaussian noise to the gradient updates. This makes it sample weights from the posterior distribution, instead of optimizing weights.

This implementation follows Lau et al.’s (2023) implementation, which is a modification of Welling and Teh (2011) that omits the learning rate schedule and introduces an localization term that pulls the weights towards their initial values.

The equation for the update is as follows:

\[\Delta w_t = \frac{\epsilon}{2}\left(\frac{\beta n}{m} \sum_{i=1}^m \nabla \log p\left(y_{l_i} \mid x_{l_i}, w_t\right)+\gamma\left(w_0-w_t\right) - \lambda w_t\right) + N(0, \epsilon\sigma^2)\]

where \(w_t\) is the weight at time \(t\), \(\epsilon\) is the learning rate, \((\beta n)\) is the inverse temperature (we’re in the tempered Bayes paradigm), \(n\) is the number of training samples, \(m\) is the batch size, \(\gamma\) is the localization strength, \(\lambda\) is the weight decay strength, and \(\sigma\) is the noise term.

Example

>>> optimizer = SGLD(model.parameters(), lr=0.1, temperature=utils.optimal_temperature(dataloader))
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()

Note

localization is unique to this class and serves to guide the weights towards their original values. This is useful for estimating quantities over the local posterior.
noise_level is not intended to be changed, except when testing! Doing so will raise a warning.
Although this class is a subclass of torch.optim.Optimizer, this is a bit of a misnomer in this case. It’s not used for optimizing in LLC estimation, but rather for sampling from the posterior distribution around a point.
Hyperparameter optimization is more of an art than a science. Check out the calibration notebook for how to go about it in a simple case.

Parameters:

params (Iterable) – Iterable of parameters to optimize or dicts defining parameter groups. Either model.parameters() or something more fancy, just like other torch.optim.Optimizer classes.
lr (float, optional) – Learning rate \(\epsilon\). Default is 0.01
noise_level (float, optional) – Amount of Gaussian noise \(\sigma\) introduced into gradient updates. Don’t change this unless you know very well what you’re doing! Default is 1
weight_decay (float, optional) – L2 regularization term \(\lambda\), applied as weight decay. Default is 0
localization (float, optional) – Strength of the force \(\gamma\) pulling weights back to their initial values. Default is 0
bounding_box_size (float, optional) – the size of the bounding box enclosing our trajectory. Default is None
temperature (int, optional) – Temperature, float (default: 1., set by sample() to utils.optimal_temperature(dataloader)=len(batch_size)/np.log(len(batch_size)))
save_noise (bool, optional) – Whether to store the per-parameter noise during optimization. Default is False

Raises:

Warning – if noise_level is set to anything other than 1
Warning – if temperature is set to 1

devinterp.optim.sgnht module

class devinterp.optim.sgnht.SGNHT(params, lr=0.01, diffusion_factor=0.01, bounding_box_size=None, save_noise=False, save_mala_vars=False, temperature=1.0)[source]

Bases: Optimizer

Implement the Stochastic Gradient Nose Hoover Thermostat (SGNHT) Optimizer. This optimizer blends SGD with an adaptive thermostat variable to control the magnitude of the injected noise, maintaining the kinetic energy of the system.

It follows Ding et al.’s (2014) implementation.

The equations for the update are as follows:

\[\Delta w_t = \epsilon\left(\frac{\beta n}{m} \sum_{i=1}^m \nabla \log p\left(y_{l_i} \mid x_{l_i}, w_t\right) - \xi_t w_t \right) + \sqrt{2A} N(0, \epsilon)\]

\[\Delta\xi_{t} = \epsilon \left( \frac{1}{n} \|w_t\|^2 - 1 \right)\]

where \(w_t\) is the weight at time \(t\), \(\epsilon\) is the learning rate, \((\beta n)\) is the inverse temperature (we’re in the tempered Bayes paradigm), \(n\) is the number of samples, \(m\) is the batch size, \(\xi_t\) is the thermostat variable at time \(t\), \(A\) is the diffusion factor, and \(N(0, A)\) represents Gaussian noise with mean 0 and variance \(A\).