Detail View

OPAL: Outlier-Preserved Microscaling Quantization Accelerator for Generative Large Language Models
Citations

WEB OF SCIENCE

Citations

SCOPUS

Metadata Downloads

DC Field Value Language
dc.contributor.author Koo, Jahyun -
dc.contributor.author Park, Dahoon -
dc.contributor.author Jung, Sangwoo -
dc.contributor.author Kung, Jaeha -
dc.date.accessioned 2025-02-03T22:10:17Z -
dc.date.available 2025-02-03T22:10:17Z -
dc.date.created 2024-12-19 -
dc.date.issued 2024-06-25 -
dc.identifier.isbn 9798400706011 -
dc.identifier.issn 0738-100X -
dc.identifier.uri http://hdl.handle.net/20.500.11750/57857 -
dc.description.abstract To overcome the burden on the memory size and bandwidth due to ever-increasing size of large language models (LLMs), aggressive weight quantization has been recently studied, while lacking research on quantizing activations. In this paper, we present a hardware-software co-design method that results in an energy-efficient LLM accelerator, named OPAL, for generation tasks. First of all, a novel activation quantization method that leverages the microscaling data format while preserving several outliers per subtensor block (e.g., four out of 128 elements) is proposed. Second, on top of preserving outliers, mixed precision is utilized that sets 5-bit for inputs to sensitive layers in the decoder block of an LLM, while keeping inputs to less sensitive layers to 3-bit. Finally, we present the OPAL hardware architecture that consists of FP units for handling outliers and vectorized INT multipliers for dominant non-outlier related operations. In addition, OPAL uses log2-based approximation on softmax operations that only requires shift and subtraction to maximize power efficiency. As a result, we are able to improve the energy efficiency by 1.6∼2.2×, and reduce the area by 2.4∼3.1× with negligible accuracy loss, i.e., <1 perplexity increase. © 2024 Institute of Electrical and Electronics Engineers Inc.. All rights reserved. -
dc.language English -
dc.publisher Institute of Electrical and Electronics Engineers Inc. -
dc.relation.ispartof Proceedings - Design Automation Conference -
dc.title OPAL: Outlier-Preserved Microscaling Quantization Accelerator for Generative Large Language Models -
dc.type Conference Paper -
dc.identifier.doi 10.1145/3649329.3657323 -
dc.identifier.wosid 001447271200259 -
dc.identifier.scopusid 2-s2.0-85211145974 -
dc.identifier.bibliographicCitation Koo, Jahyun. (2024-06-25). OPAL: Outlier-Preserved Microscaling Quantization Accelerator for Generative Large Language Models. Design Automation Conference, 1–6. doi: 10.1145/3649329.3657323 -
dc.identifier.url https://www.dac.com/About/Conference-Archive/61st-DAC -
dc.citation.conferenceDate 2024-06-23 -
dc.citation.conferencePlace US -
dc.citation.conferencePlace San Francisco -
dc.citation.endPage 6 -
dc.citation.startPage 1 -
dc.citation.title Design Automation Conference -
Show Simple Item Record

File Downloads

  • There are no files associated with this item.

공유

qrcode
공유하기

Total Views & Downloads