Adam

Adam (Adaptive moment estimation) is an adaptive learning rate optimizer, combining ideas from SGD with momentum and RMSprop to automatically scale the learning rate:

a weighted average of the past gradients to provide direction (first-moment)
a weighted average of the squared past gradients to adapt the learning rate to each parameter (second-moment)

bitsandbytes also supports paged optimizers which take advantage of CUDAs unified memory to transfer memory from the GPU to the CPU when GPU memory is exhausted.

Adam

class bitsandbytes.optim.Adam

< source >

( paramslr = 0.001betas = (0.9, 0.999)eps = 1e-08weight_decay = 0amsgrad = Falseoptim_bits = 32args = Nonemin_8bit_size = 4096percentile_clipping = 100block_wise = Trueis_paged = False )

init

< source >

( paramslr = 0.001betas = (0.9, 0.999)eps = 1e-08weight_decay = 0amsgrad = Falseoptim_bits = 32args = Nonemin_8bit_size = 4096percentile_clipping = 100block_wise = Trueis_paged = False )

Parameters

params (torch.tensor) — The input parameters to optimize.
lr (float, defaults to 1e-3) — The learning rate.
betas (tuple(float, float), defaults to (0.9, 0.999)) — The beta values are the decay rates of the first and second-order moment of the optimizer.
eps (float, defaults to 1e-8) — The epsilon value prevents division by zero in the optimizer.
weight_decay (float, defaults to 0.0) — The weight decay value for the optimizer.
amsgrad (bool, defaults to False) — Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
optim_bits (int, defaults to 32) — The number of bits of the optimizer state.
args (object, defaults to None) — An object with additional arguments.
min_8bit_size (int, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization.
percentile_clipping (int, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability.
block_wise (bool, defaults to True) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.
is_paged (bool, defaults to False) — Whether the optimizer is a paged optimizer or not.

Base Adam optimizer.

Adam8bit

class bitsandbytes.optim.Adam8bit

< source >

( paramslr = 0.001betas = (0.9, 0.999)eps = 1e-08weight_decay = 0amsgrad = Falseoptim_bits = 32args = Nonemin_8bit_size = 4096percentile_clipping = 100block_wise = Trueis_paged = False )

init

< source >

( paramslr = 0.001betas = (0.9, 0.999)eps = 1e-08weight_decay = 0amsgrad = Falseoptim_bits = 32args = Nonemin_8bit_size = 4096percentile_clipping = 100block_wise = Trueis_paged = False )

Parameters

params (torch.tensor) — The input parameters to optimize.
lr (float, defaults to 1e-3) — The learning rate.
betas (tuple(float, float), defaults to (0.9, 0.999)) — The beta values are the decay rates of the first and second-order moment of the optimizer.
eps (float, defaults to 1e-8) — The epsilon value prevents division by zero in the optimizer.
weight_decay (float, defaults to 0.0) — The weight decay value for the optimizer.
amsgrad (bool, defaults to False) — Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
optim_bits (int, defaults to 32) — The number of bits of the optimizer state.
args (object, defaults to None) — An object with additional arguments.
min_8bit_size (int, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization.
percentile_clipping (int, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability.
block_wise (bool, defaults to True) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.
is_paged (bool, defaults to False) — Whether the optimizer is a paged optimizer or not.

8-bit Adam optimizer.

Adam32bit

class bitsandbytes.optim.Adam32bit

< source >

( paramslr = 0.001betas = (0.9, 0.999)eps = 1e-08weight_decay = 0amsgrad = Falseoptim_bits = 32args = Nonemin_8bit_size = 4096percentile_clipping = 100block_wise = Trueis_paged = False )

init

< source >

( paramslr = 0.001betas = (0.9, 0.999)eps = 1e-08weight_decay = 0amsgrad = Falseoptim_bits = 32args = Nonemin_8bit_size = 4096percentile_clipping = 100block_wise = Trueis_paged = False )

Parameters

params (torch.tensor) — The input parameters to optimize.
lr (float, defaults to 1e-3) — The learning rate.
betas (tuple(float, float), defaults to (0.9, 0.999)) — The beta values are the decay rates of the first and second-order moment of the optimizer.
eps (float, defaults to 1e-8) — The epsilon value prevents division by zero in the optimizer.
weight_decay (float, defaults to 0.0) — The weight decay value for the optimizer.
amsgrad (bool, defaults to False) — Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
optim_bits (int, defaults to 32) — The number of bits of the optimizer state.
args (object, defaults to None) — An object with additional arguments.
min_8bit_size (int, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization.
percentile_clipping (int, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability.
block_wise (bool, defaults to True) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.
is_paged (bool, defaults to False) — Whether the optimizer is a paged optimizer or not.

32-bit Adam optimizer.

PagedAdam

class bitsandbytes.optim.PagedAdam

< source >

( paramslr = 0.001betas = (0.9, 0.999)eps = 1e-08weight_decay = 0amsgrad = Falseoptim_bits = 32args = Nonemin_8bit_size = 4096percentile_clipping = 100block_wise = Trueis_paged = False )

init

< source >

( paramslr = 0.001betas = (0.9, 0.999)eps = 1e-08weight_decay = 0amsgrad = Falseoptim_bits = 32args = Nonemin_8bit_size = 4096percentile_clipping = 100block_wise = Trueis_paged = False )

Parameters

params (torch.tensor) — The input parameters to optimize.
lr (float, defaults to 1e-3) — The learning rate.
betas (tuple(float, float), defaults to (0.9, 0.999)) — The beta values are the decay rates of the first and second-order moment of the optimizer.
eps (float, defaults to 1e-8) — The epsilon value prevents division by zero in the optimizer.
weight_decay (float, defaults to 0.0) — The weight decay value for the optimizer.
amsgrad (bool, defaults to False) — Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
optim_bits (int, defaults to 32) — The number of bits of the optimizer state.
args (object, defaults to None) — An object with additional arguments.
min_8bit_size (int, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization.
percentile_clipping (int, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability.
block_wise (bool, defaults to True) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.
is_paged (bool, defaults to False) — Whether the optimizer is a paged optimizer or not.

Paged Adam optimizer.

PagedAdam8bit

class bitsandbytes.optim.PagedAdam8bit

< source >

( paramslr = 0.001betas = (0.9, 0.999)eps = 1e-08weight_decay = 0amsgrad = Falseoptim_bits = 32args = Nonemin_8bit_size = 4096percentile_clipping = 100block_wise = Trueis_paged = False )

init

< source >

( paramslr = 0.001betas = (0.9, 0.999)eps = 1e-08weight_decay = 0amsgrad = Falseoptim_bits = 32args = Nonemin_8bit_size = 4096percentile_clipping = 100block_wise = Trueis_paged = False )

Parameters

params (torch.tensor) — The input parameters to optimize.
lr (float, defaults to 1e-3) — The learning rate.
betas (tuple(float, float), defaults to (0.9, 0.999)) — The beta values are the decay rates of the first and second-order moment of the optimizer.
eps (float, defaults to 1e-8) — The epsilon value prevents division by zero in the optimizer.
weight_decay (float, defaults to 0.0) — The weight decay value for the optimizer.
amsgrad (bool, defaults to False) — Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
optim_bits (int, defaults to 32) — The number of bits of the optimizer state.
args (object, defaults to None) — An object with additional arguments.
min_8bit_size (int, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization.
percentile_clipping (int, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability.
block_wise (bool, defaults to True) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.
is_paged (bool, defaults to False) — Whether the optimizer is a paged optimizer or not.

8-bit paged Adam optimizer.

PagedAdam32bit

class bitsandbytes.optim.PagedAdam32bit

< source >

( paramslr = 0.001betas = (0.9, 0.999)eps = 1e-08weight_decay = 0amsgrad = Falseoptim_bits = 32args = Nonemin_8bit_size = 4096percentile_clipping = 100block_wise = Trueis_paged = False )

init

< source >

( paramslr = 0.001betas = (0.9, 0.999)eps = 1e-08weight_decay = 0amsgrad = Falseoptim_bits = 32args = Nonemin_8bit_size = 4096percentile_clipping = 100block_wise = Trueis_paged = False )

Parameters

params (torch.tensor) — The input parameters to optimize.
lr (float, defaults to 1e-3) — The learning rate.
betas (tuple(float, float), defaults to (0.9, 0.999)) — The beta values are the decay rates of the first and second-order moment of the optimizer.
eps (float, defaults to 1e-8) — The epsilon value prevents division by zero in the optimizer.
weight_decay (float, defaults to 0.0) — The weight decay value for the optimizer.
amsgrad (bool, defaults to False) — Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
optim_bits (int, defaults to 32) — The number of bits of the optimizer state.
args (object, defaults to None) — An object with additional arguments.
min_8bit_size (int, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization.
percentile_clipping (int, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability.
block_wise (bool, defaults to True) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.
is_paged (bool, defaults to False) — Whether the optimizer is a paged optimizer or not.

Paged 32-bit Adam optimizer.

< > Update on GitHub

Bitsandbytes

Adam

Adam

class bitsandbytes.optim.Adam

__init__

Adam8bit

class bitsandbytes.optim.Adam8bit

__init__

Adam32bit

class bitsandbytes.optim.Adam32bit

__init__

PagedAdam

class bitsandbytes.optim.PagedAdam

__init__

PagedAdam8bit

class bitsandbytes.optim.PagedAdam8bit

__init__

PagedAdam32bit

class bitsandbytes.optim.PagedAdam32bit

__init__

init

init

init

init

init

init