Papers
arxiv:2310.09259

Towards End-to-end 4-Bit Inference on Generative Large Language Models

Published on Oct 13, 2023
Authors:
,
,
,
,
,
,

Abstract

Using a hybrid quantization strategy, QUIK, large generative models like LLaMA and OPT can achieve significant speedups while maintaining accuracy by compressing most weights and activations to 4-bit.

We show that the majority of the inference computations for large generative models such as LLaMA and OPT can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups while at the same time maintaining good accuracy. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit, while keeping some outlier weights and activations in higher-precision. Crucially, our scheme is designed with computational efficiency in mind: we provide GPU kernels with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.1x relative to FP16 execution. Code and models are provided at https://github.com/IST-DASLab/QUIK.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2310.09259
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2310.09259 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2310.09259 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2310.09259 in a Space README.md to link it from this page.

Collections including this paper 2