I am currently a fourth-year PhD student at ISTAustria, supervised by Dan Alistarh, as well as a second-time intern at Google DeepMind. Overall, my research is focused on making massive machine learning models more efficient.

In 2022, I developed the first successful low-bit quantization and sparsification methods for extremely large language models, GPTQ and SparseGPT. In 2023, I identified scaling laws for sparsely-connected foundation models, and built QMoE, a tool for efficient compression and execution of trillion-parameter Mixture-of-Expert models in limited-resource settings. Most recently, I implemented Marlin, the first INT4xFP16 LLM inference kernel with near-ideal speedup at medium batchsizes.

In my free time, I love creating super fast Rubik's Cube solving robots, some of which have beaten long-standing records and collected millions of views on YouTube.

Highlighted Work

GPTQ (ICLR 2023):

  • The first quantization method able to accurately compress massive LLMs to 4- or 3-bit precision.
  • The first open-source GPU kernel demonstrating major generative inference speedup with standard weight-only quantization.
  • Supported by various popular libraries: HuggingFace's transformers, NVIDIA's TensorRT-LLM, Intel's neural-compressor.
  • 1.5k+ stars on GitHub; with popular forks AutoGPTQ (3k+ stars) and GPTQ-for-LLaMa (2.5k+ stars).

SparseGPT (Oral, ICML 2023):

  • The first algorithm able to accurately induce significant sparsity in 100+ billion parameter models.
  • Featured by Communications of the ACM and national television.
  • Invited talks at Apple, Amazon and Google.
  • 500+ stars on GitHub.

Rubik's Cube Robots:

  • 10+ million views on Youtube; also presented live on BBC.
  • Cuboth: the world's fastest robot to solve an unmodified Rubik's Cube, beating the previous record, which stood for 7 years, by 2x, while using equivalent hardware.
  • rob-twophase & qphase: the current best computer solving algorithms; also the first to directly take into account robot mechanics during the search process.
  • SquidCuber: the first machine made entirely out of Lego to solve a cube in a single second on average (2x faster than the 5-year-long-standing previous record).

First-Author Papers

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

MLSys 2023 - Elias Frantar, Dan Alistarh - [arxiv], [github].
We compress a trillion-parameter Mixture-of-Experts model to, for the first time, sub-1-bit per parameter, in a custom format co-designed with an efficient inference kernel.

Scaling Laws for Sparsely-Connected Foundation Models

SPOTLIGHT, ICLR 2024 - E. Frantar, C. Riquelme, N. Houlsby, D. Alistarh, U. Evci - [arxiv].
We determine the first scaling laws connecting parameter-sparsity, effective model size and amount of training data, in the context of modern Transformers trained on massive datasets.

SparseGPT: Massive Language Models Can be Accurately Pruned in One-shot

ORAL, ICML 2023 - Elias Frantar, Dan Alistarh - [arxiv], [github].
We introduce the first pruning algorithm that is fast and accurate enough to successfuly impose non-trivial amounts of sparsity on 100+ billion parameter models.

GPTQ: Accurate Post-training Quantization for Generative Pretrained Transformers

ICLR 2023 - E. Frantar, S. Ashkboos, T. Hoefler, D. Alistarh - [arxiv], [github].
We develop the first quantization algorithm that is fast and accurate enough to successfully quantize 100+ billion parameter models to 4-bit and 3-bit precision.

Optimal Brain Compression: A Framework for Accurate Pruning and Quantization

NeurIPS 2022 - Elias Frantar, Sidak Pal Singh, Dan Alistarh - [arxiv], [github].
We show that the classical Optimal Brain Surgeon pruning framework can be implemented exactly on a layer-wise level and extended to quantization, leading to state-of-the-art post-training compression results.

Accurate Pruning with Speedup Guarantees

ICML 2022 - Elias Frantar, Dan Alistarh - [arxiv], [github].
We introduce new techniques for runtime- and hardware-aware sparsification, with state-of-the-art speedup-vs-accuracy trade-offs, for vision and text domains.

M-FAC: Efficient Matrix-Free Approximations of Second-Order Information

NeurIPS 2021 - Elias Frantar, Eldar Kurtic, Dan Alistarh - [arxiv], [github].
We develop new algorithms for dealing with empirical Fisher approximations that efficiently scale to arbitrarily large blocksizes, with applications to pruning and optimization.

Other Publications

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

ICLR 2024 - T. Dettmers, R. Svirschevski, V. Egiazarian, D. Kuznedelev, E. Frantar, ...

ZipLM: Hardware-Aware Structured Pruning of Language Models

NeurIPS 2023 - Eldar Kurtic, Elias Frantar, Dan Alistarh

CAP: Correlation-Aware Pruning for Highly-Accurate Sparse Vision Models

NeurIPS 2023 - Denis Kuznedelev, Eldar Kurtic, Elias Frantar, Dan Alistarh

The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

EMNLP 2022 - E. Kurtic, D. Campos, T. Nguyen, E. Frantar, M. Kurtz, B. Fineran, ...

On the Sample Complexity of Adversarial Multi-Source PAC Learning

ICML 2020 - Nikola Konstantinov, Elias Frantar, Dan Alistarh, Christoph Lampert

Contact

first-name [dot] last-name [at] (gmail [dot] com OR ist [dot] ac [dot] at)