lqdev

https://huggingface.co/blog/quanto-introduction

Quantization is a technique to reduce the computational and memory costs of evaluating Deep Learning Models by representing their weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).

Today, we are excited to introduce quanto, a versatile pytorch quantization toolkit, that provides several unique features:

  • available in eager mode (works with non-traceable models)
  • quantized models can be placed on any device (including CUDA and MPS),
  • automatically inserts quantization and dequantization stubs,
  • automatically inserts quantized functional operations,
  • automatically inserts quantized modules (see below the list of supported modules),
  • provides a seamless workflow for a float model, going from a dynamic to a static quantized model,
  • supports quantized model serialization as a state_dict,
  • supports not only int8 weights, but also int2 and int4,
  • supports not only int8 activations, but also float8.

Send me a message or webmention
Back to feed