DGIST Scholar: Dynamic Precision Control of a DNN Training Accelerator Using RISC-V Processor

Department of Electrical Engineering and Computer Science Theses Master

Cited time in webofscience

Cited time in scopus

Dynamic Precision Control of a DNN Training Accelerator Using RISC-V Processor

Title: Dynamic Precision Control of a DNN Training Accelerator Using RISC-V Processor

Alternative Title: RISC-V 프로세서를 활용한 인공지능 학습 가속기의 동적 제어 기술

Author(s): Jeik Choi

DGIST Authors: Jeik Choi ; Jaeha Kung ; Gain Kim

Advisor: 궁재하

Co-Advisor(s): Gain Kim

Issued Date: 2023

Awarded Date: 2023-02-01

Type: Thesis

Description: DNN Training Accelerator, RISC-V, Instruction Extension, Block Floating Point, FPGA

Abstract: In designing accelerators for deep neural networks (DNNs), previous research suggested various methods to save cost for a computation of general matrix multiplication (GEMM). One of the most common ways is reducing the data bit-width to save memory space or the number of accesses to off-chip memory. This also save an energy as accessing external memory requires much energy than computation with data for GEMM. Unlike an inference with DNNs, training DNNs require enough precise data like floating-point data. Although previous research tried to save hardware cost by reducing data-with, computation with floating-point data requires more cost than computation with fixed-point data. This thesis proposes an accelerator that supports Block Floating Point (BFP) data format. BFP format allows the accelerator to compute data with fixed-point arithmetic logic. This requires fewer cycles to do GEMM than a conventional accelerator that supported only floating-point data to do training DNNs. Also, my proposed accelerator can be configured with precision of data. It can accommodate more data with lower precision. To configure this accelerator with block-size and precision, a simple processor is added with Rocket-chip that is one kind of processor based on RISC-V, an open-source architecture. To make them work together, custom instructions are extended on the processor. Additionally, a training program is designed including special functions to generate custom instruction. The system is estimated to be this accelerator with simulation provided RISC-V tools, and the hardware circuit system is downloaded on an FPGA utilizing Block RAM (BRAM) on it. The estimation has been done by comparing another accelerator: GEMMINI, an accelerator based on systolic array, which is proved by RISC-V platform. Then, at the end of this paper, it shows how long latency or how much hardware cost can be saved with utilizing BFP data format and control the precision dynamically.; 심층신경망 가속기 설계하는 과정에서 행렬 곱셈을 어떻게 효율적으로 수행할 것인지 연구가 많이 되어 왔다. 제시된 여러 방법 중 하나가, 데이터 하나의 길이를 줄이는 방법이 있다. 데이터의 크기를 줄임으로써 필요한 메모리의 크기가 줄 뿐만 아니라, 외부 메모리 접근 횟수 당 얻을 수 있는 데이터 개수가 증가하기 때문에 메모리 접근 횟수를 줄일 수 있다. 특히, 외부 메모리에 접근할 때 필요한 에너지가 데이터 연산에 필요한 데이터에 비해 훨씬 많이 필요하기 때문에 전력을 절약할 수 있는 방법이다.
심층신경망으로 추론 동작을 할 때와는 다르게, 심층신경망을 학습시킬 때는 데이터가 충분히 큰 정밀도를 가지고 있어야 하기 때문에 부동소수점 형식을 이용하는 경우가 많다. 앞서 얘기한 연구를 통해 부동소수점의 데이터 길이를 줄여서 하드웨어 비용을 아끼게 되었지만, 고정소수점 연산에 비해 더 많은 비용이 필요하다는 것은 변하지 않는다. 이 논문에서 제시하는 가속기는 블록 부동소수점을 지원한다. 블록 부동소수점을 활용하면, 데이터 연산을 하는 과정에서 고정소수점 연산기를 활용할 수 있다는 장점이 있다. 그리고 이 논문에서의 가속기는 데이터의 정밀성 정도를 변경할 수 있기 때문에, 낮은 정밀도의 데이터를 사용하는 상황에서는 더 많은 데이터를 수용하여 연산할 수 있게 된다.
가속기에 적용되는 정밀도와 블록사이즈를 설정할 수 있는 환경을 만들기 위해, Rocket-chip이라는 RISC-V 기반 프로세서를 활용하였다. 그리고 해당 프로세서와 가속기가 함께 심층신경망 학습시키기 위해 명령어 추가하여 시스템을 구성했다. 가속기의 성능을 평가하기 위해 추가된 명령어를 생성하는 함수를 추가하여 학습 프로그램을 설계하였다.
RISC-V 플랫폼에서 제공하는 툴을 이용하여 시스템 평가하였다. 오픈소스로 제공되는 systolic-array 형태의 가속기인 GEMMINI를 사용했을 때와 비교할 것이다. 이후, 설계된 시스템을 FPGA(Field Programmable Gate Array)에 직접 구현하는 과정을 이 논문에 나타냈다.

Table Of Contents: I. INTRODUCTION 1
II. BACKGROUND 4
2.1 Deep Neural Networks 4
2.2 GEMM Accelerators 7
2.2.1 Systolic-Array-based Accelerator 8
2.2.2 BFP-based Accelerator 11
2.3 RISC-V Open-Source Processor 12
III. Proposed Hardware Design: BFP-based GEMM Accelerator 14
3.1 Processing Core 14
3.2 Reduction Unit 15
3.3 Interface with RISC-V Core 16
IV. Experimental Setup 18
4.1 C-program to do GEMM 18
V. Experimental Results 19
5.1 Training DNNs 19
5.2 Performance and Power Analysis 20
VI. CONCLUSION 22
References 23
요약문 26