How we built DeepL’s next-generation LLMs with FP8 for training and inference

When DeepL deployed our current NVIDIA DGX SuperPOD with 544 NVIDIA H100 Tensor Core GPUs, we didn’t just get a big increase in compute power. The H100 also introduces native support for 8-bit floating point (FP8) data types, through a new generation of Tensor Cores, which enable the GPU to perform matrix multiplications and other tensor operations at FP8 precision. By computing matrix multiplications in FP8, we can increase throughput when training and deploying our Large Language Models (LLMs), since most of the compute involved in modern LLMs takes the form of matrix multiplications.

In this post, we want to explain the journey that we have taken, in order to apply FP8 for training and inference, share some of the tools and techniques that underpin this success, and give you an idea of the results that we generate in terms of training and inference performance along the way.

Full post -> https://www.deepl.com/en/blog/tech/next-generation-llm-fp8-training

Did you work with FP8 as well? Whats your experience?