Detail View

Hardware Acceleration with Microscaling Formats for Deep Learning Inference and Training

Citations

WEB OF SCIENCE

Citations

SCOPUS

Metadata Downloads

DC Field Value Language
dc.contributor.advisor 김가인 -
dc.contributor.author Jahyun Koo -
dc.date.accessioned 2026-01-23T10:54:21Z -
dc.date.available 2026-01-24T06:00:45Z -
dc.date.issued 2026 -
dc.identifier.uri https://scholar.dgist.ac.kr/handle/20.500.11750/59630 -
dc.identifier.uri http://dgist.dcollection.net/common/orgView/200000944743 -
dc.description AI Accelerator, Low-precision Data Format, Block Floating Point (BFP), Microscaling (MX) Format, Large Language Model (LLM), Hardware-Software Co-design -
dc.description.abstract This thesis proposes a hardware-software co-design methodology for constructing high-efficiency AI accelerators to address the exponentially increasing computation and memory demands of Deep Neural Networks (DNNs) and Large Language Models (LLMs). This research introduces a novel integrated approach involving the design of specialized accelerator architectures based on low-precision data formats, specifically Block Floating Point (BFP) and Microscaling (MX) formats, optimized for distinct application needs. First, FlexBlock was conducted as a foundational study to explore the potentials and limitations of low- precision training using multi-mode BFP (FB12/16/24). To overcome the hardware under-utilization problem faced by prior precision-scalable MAC arrays during 2D operations (e.g., weight updates), FlexBlock introduced a hierarchical structure and a dedicated dual-path reduction unit. While FlexBlock demonstrated significant gains—1.5× to 5.3× higher training speed and 2.4× to 7.0× higher energy efficiency—this study crucially revealed the structural limitations of the rigid BFP format, such as training instability at low precision and inflexibility towards outliers, motivating the strategic shift to Microscaling (MX) formats. Second, building upon these insights, OPAL is specialized for LLM inference acceleration, tackling the critical challenge of activation outliers during low-precision quantization. OPAL proposes the MX-OPAL format, a hybrid approach that preserves outliers (e.g., four out of 128 elements) in BF16 while aggressively quantizing the remainder to low-bit integers. The OPAL accelerator utilizes heterogeneous compute lanes consisting of sparse FP units and dense INT multipliers, processing 96.9% of all computations using efficient INT operations. It also incorporates a log2-based Softmax approximation unit to minimize hardware cost. This architecture achieved up to 46.5% reduction in total energy consumption compared to weight-only quantization (OWQ) methods. Finally, MX-SAFE resolves the fundamental trade-off between the wide mantissa bit-width (E2M5) required for accurate inference and the wide dynamic range (E3M2 + bias) required for stable training. The proposed MXSF format dynamically switches between inference-proof and training-proof modes within a single 8-bit block to support both needs. By employing a 2D tile-based MX block design, it minimizes re-quantization overhead during the backward pass. The MX-SAFE accelerator maintained accuracy comparable to the BF16 baseline while reducing total energy consumption by 24.9% during the DeiT-Tiny training task. The series of studies presented in this thesis systematically demonstrates that the co-design of data formats and hardware architectures, optimized for specific applications, maximizes the efficiency of deep learning acceleration.

Keywords: AI Accelerator, Low-precision Data Format, Block Floating Point (BFP), Microscaling (MX) Format, Large Language Model (LLM), Hardware-Software Co-design|본 논문은 기하급수적으로 증가하고 있는 딥러닝 모델(DNN 및 LLM)의 연산 및 메모리 요구량을 해결하기 위해, 고효율 AI 가속기를 설계하는 하드웨어-소프트웨어 공동 설계 방법론을 제시한다. 본 연구는 블록 부동소수점(BFP) 및 마이크로스케일링(MX) 포맷과 같은 저정밀도 데이터 포맷을 기반으로, 각 응용 분야의 특성에 최적화된 가속기 아키텍처를 설계하고 이를 통합하는 새로운 접근법을 다룬다.
첫째, FlexBlock은 범용 DNN 훈련 가속을 위해 다중 모드 BFP(FB12/16/24)를 지원하는 아키텍처를 통해 저정밀도 훈련의 가능성과 한계를 규명하는 기초 연구(Foundational Study)로서 수행되었다. FlexBlock은 기존 정밀도 가변형 MAC 어레이가 2D 연산(가중치 업데이트 등)에서 겪는 하드웨어 미활용 문제를 해결하기 위해 계층적 구조와 이중 경로 리덕션(Dual-Path Reduction) 유닛을 도입했다. 이를 통해 1.5배~5.3배의 훈련 속도 향상과 2.4배~7.0배의 에너지 효율성 향상을 입증했으나, 동시에 BFP 포맷의 구조적 경직성으로 인한 학습 불안정성과 아웃라이어 처리의 한계를 확인함으로써 차세대 마이크로스케일링(MX) 포맷 도입의 필요성을 제시하였다.
둘째, OPAL은 앞선 연구에서 확인된 한계를 바탕으로 LLM 추론 가속에 특화되어, 저정밀도 양자화의 핵심 난제인 활성화 아웃라이어(Activation Outlier) 문제를 해결한다. OPAL은 아웃라이어를 BF16으로 보존하고 나머지를 저비트 정수로 양자화하는 MX-OPAL 포맷을 제안했다. OPAL 가속기는 FP 유닛과 INT 연산기로 구성된 이종 컴퓨팅 레인을 활용하여 전체 연산의 96.9%를 고효율 INT 연산으로 처리하며, 로그2(Log2) 기반 Softmax 근사 유닛을 통해 하드웨어 비용을 최소화했다. 이 설계를 통해 OPAL은 기존 가중치 전용 양자화(OWQ) 기법 대비 최대 46.5%의 총 에너지 소비 절감을 달성했다.
마지막으로, MX-SAFE는 추론에 필요한 넓은 가수 비트폭(E2M5)과 훈련에 요구되는 넓은 동적 범위(E3M2) 사이의 근본적인 상충 관계(Trade-off)를 해소하는 통합 솔루션을 제시한다. 제안된 MXSF 포맷은 단일 8비트 블록 내에서 추론 최적화 모드와 훈련 안정화 모드를 동적으로 전환하여 두 요구사항을 모두 만족시킨다. 또한, 2D 타일 기반의 MX 블록 설계를 도입하여 훈련 시 재양자화 오버헤드를 최소화했다. MX-SAFE 가속기는 BF16 베이스라인과 동등한 수준의 훈련 정확도를 유지하면서도 DeiT-Tiny 훈련 태스크에서 총 에너지 소비를 24.9% 감소시키는 성과를 거두었다.
본 논문이 제시하는 일련의 연구들은 각 응용에 최적화된 데이터 포맷과 하드웨어 아키텍처의 공동 설계가 차세대 딥러닝 가속 기술의 효율을 극대화할 수 있음을 체계적으로 입증한다.

핵심어: 인공지능 가속기, 저정밀도 데이터 포맷, 블록 부동소수점(BFP), 마이크로스케일링(MX), 대규모 언어 모델(LLM), 하드웨어-소프트웨어 공동 설계
-
dc.description.tableofcontents Ⅰ Introduction 1
ⅠⅠ Background 5
2.1 Deep Learning Models: Workloads and Bottlenecks 5
2.1.1 Conventional DNNs (e.g., CNNs) 5
2.1.2 Transformer-based Models (LLMs and Vision Transformers) 9
2.1.3 The Cost and Computational Bottlenecks 11
2.2 Low-Precision Data Formats 12
2.2.1 Fundamentals of Quantization and Granularity 12
2.2.2 Standard Formats (FP32, FP16, BF16, INT8) 13
2.2.3 Block Floating Point (BFP) Theory 15
2.2.4 From Block Floating Point (BFP) to Microscaling (MX) Formats 15
2.3 Deep Learning Accelerator Architectures 17
2.3.1 General-Purpose GPUs (GPGPUs) and Tensor Cores 17
2.3.2 Systolic Arrays (e.g., Google TPU) 18
2.3.3 Limitations of Existing Architectures for Low-Precision Training 18
2.4 Related Work in LLM Inference Acceleration 19
2.4.1 Inflexibility in BFP-based Training Accelerators 19
2.4.2 The Hardware Cost of Mixed-Precision Training 20
2.5 Related Work in LLM Inference Acceleration 22
2.5.1 Limitations of Software-Centric Quantization Solutions 22
2.5.2 Hardware Bottlenecks in Autoregressive Generation 23
2.6 Chapter Summary and Identified Research Gaps 24
ⅠⅠⅠ Foundational Study: Investigating Limitations of Block Floating Point in DNN Training Acceleration 27
3.1 Introduction 27
3.2 Related Work and Architectural Challenges 29
3.3 Proposed Architecture for BFP Evaluation 31
3.3.1 Hierarchical Processing Unit (PU) Design 32
3.3.2 Shared Exponent Handler 33
3.3.3 Dual-Path Reduction Unit for 2D/3D Hybrid Dataflow 33
3.4 Dynamic Precision Control Strategy 34
3.5 Experimental Methodology for BFP Evaluation 35
3.5.1 Simulation Framework 35
3.5.2 Benchmarks and Training Hyperparameters 36
3.6 Analysis of BFP Limitations 36
3.6.1 The Gradient Dynamic Range Bottleneck 36
3.6.2 Inability to Handle Outliers 37
3.7 Proof-of-Concept Results 38
3.8 Conclusion 39
ⅠV Outlier-Preserved Microscaling Quantization for LLM 41
4.1 Introduction 41
4.2 Background 44
4.2.1 Quantization for LLMs 44
4.2.2 Microscaling Data Format 45
4.3 Proposed MX-OPAL Data Format 46
4.3.1 Outlier-Preserved Microscaling Data Format 46
4.3.2 Impact of Preserving Activation Outliers 49
4.4 Hardware Architecture of OPAL 50
4.4.1 OPAL Computation Flow 50
4.4.2 Proposed Log2-based Softmax Unit 51
4.4.3 OPAL Microarchitecture 53
4.5 Experimental Results 56
4.5.1 Accuracy Analysis of MX-OPAL 56
4.5.2 Hardware Efficiency of OPAL 58
4.6 Conclusion 59
V Versatile Microscaling Hardware for Inference and Training 61
5.1 Introduction 61
5.2 Backgrounds 64
5.2.1 Microscaling (MX) Data Format 64
5.2.2 Optimal MX Format for Inference & Training 64
5.3 Quantitative Analysis on MX Formats 65
5.3.1 Analytical Comparison Between MXINT and MXFP 65
5.3.2 Analysis on Required Bit-precision in Training 68
5.4 MX-SAFE: VERSATILE MICROSCALING FORMAT 69
5.4.1 Proposed MXSF Data Format 69
5.4.2 MX Block Tiling for Inference/Training 73
5.5 MXSF-Based Multi-Format Systolic Tensor Array Accelerator 74
5.5.1 MX-SAFE Accelerator 74
5.5.2 MXSF-aware MAC Unit 75
5.6 Experimental Results 75
5.6.1 Experimental Setup 75
5.6.2 Accuracy on Direct-cast Inference 77
5.6.3 Accuracy on Model Training 78
5.6.4 Hardware Analysis 79
5.7 Conclusion 80
VI Conclusion 82
6.1 Conclusion 82
6.2 Future Research Directions 84
References 87
-
dc.format.extent 98 -
dc.language eng -
dc.publisher DGIST -
dc.title Hardware Acceleration with Microscaling Formats for Deep Learning Inference and Training -
dc.title.alternative 마이크로스케일링 포맷을 활용한 딥 러닝의 추론과 학습을 위한 가속 하드웨어 -
dc.type Thesis -
dc.identifier.doi 10.22677/THESIS.200000944743 -
dc.description.degree Doctor -
dc.contributor.department Department of Electrical Engineering and Computer Science -
dc.contributor.coadvisor Jaeha Kung -
dc.date.awarded 2026-02-01 -
dc.publisher.location Daegu -
dc.description.database dCollection -
dc.citation XT.ID 구72 202602 -
dc.date.accepted 2026-01-19 -
dc.contributor.alternativeDepartment 전기전자컴퓨터공학과 -
dc.subject.keyword AI Accelerator, Low-precision Data Format, Block Floating Point (BFP), Microscaling (MX) Format, Large Language Model (LLM), Hardware-Software Co-design -
dc.contributor.affiliatedAuthor Jahyun Koo -
dc.contributor.affiliatedAuthor Gain Kim -
dc.contributor.affiliatedAuthor Jaeha Kung -
dc.contributor.alternativeName 구자현 -
dc.contributor.alternativeName Gain Kim -
dc.contributor.alternativeName 궁재하 -
Show Simple Item Record

File Downloads

  • There are no files associated with this item.

공유

qrcode
공유하기

Total Views & Downloads

???jsp.display-item.statistics.view???: , ???jsp.display-item.statistics.download???: