Recently, we open-sourced YaFSDP, Yandex’s tool for efficient distributed training of LLMs.
Here are some of the key ideas used in YaFSDP to provide speedup and memory savings over FSDP: • Allocate and utilize just two buffers throughout the transformer for all collected weights to circumvent the torch memory allocator; • Gather small normalization layers at the beginning of the iteration and average the gradients only at the end; • Move gradient division to the very end of the backward pass.
Today we are introducing YaFSDP, Yandex’s tool for efficient distributed LLM training. YaFSDP can be used in conjunction with huggingface workflows and is up to 25% faster compared to FSDP.