Repository Collection: null

Repository Collection: null https://scholar.dgist.ac.kr/handle/20.500.11750/6302 Tue, 14 Jul 2026 11:32:03 GMT 2026-07-14T11:32:03Z Simplified Compressor and Encoder Designs for Low-Cost Approximate Radix-4 Booth Multiplier https://scholar.dgist.ac.kr/handle/20.500.11750/17484 Title: Simplified Compressor and Encoder Designs for Low-Cost Approximate Radix-4 Booth Multiplier Author(s): Park, Gunho; Kung, Jaeha; Lee, Youngjoo Abstract: In this brief, we present a novel design methodology of cost-effective approximate radix-4 Booth multipliers, which can significantly reduce the power consumption of error-resilient signal processing tasks. In contrast that the prior studies only focus on the approximation of either the partial product generation with encoders or the partial product reductions with compressors, the proposed method considers two major processing steps jointly by forcing the generated error directions to be opposite to each other. As the internal errors are naturally balanced to have zero mean, as a result, the proposed approximate Booth multiplier can minimize the required processing energy under the same number of approximate bits compared to the previous designs. Simulation results on FIR filtering and image classification applications reveal that the proposed approximate Booth multiplier shows the most attractive energy-performance trade-offs, achieving 28% and 34% of energy reduction compared to the exact Booth multiplier, respectively, with negligible accuracy loss. IEEE Tue, 28 Feb 2023 15:00:00 GMT https://scholar.dgist.ac.kr/handle/20.500.11750/17484 2023-02-28T15:00:00Z SEMS: Scalable Embedding Memory System for Accelerating Embedding-Based DNNs https://scholar.dgist.ac.kr/handle/20.500.11750/17472 Title: SEMS: Scalable Embedding Memory System for Accelerating Embedding-Based DNNs Author(s): Kim, Sejin; Kim, Jungwoo; Jang, Yongjoo; Kung, Jaeha; Lee, Sungjin Abstract: Embedding layers, which are widely used in various deep learning (DL) applications, are very large in size and are increasing. We propose scalable embedding memory system (SEMS) to deal with the inference of DL applications with a large embedding layer. SEMS is built using scalable embedding memory (SEM) modules, which include FPGA for acceleration. In SEMS, PCIe bus, which is scalable and versatile, is used to expand the system memory and processing in SEMs reduces the amount of data transferred from SEMs to host, improving the effective bandwidth of PCIe. In order to achieve better performance, we apply various optimization techniques at different levels. We develop SEMlib, a Python library to provide convenience in using SEMS. We implement a proof-of-concept prototype of SEMS and using SEMS yields DLRM execution time that is 32.85x faster than that of a CPU-based system when there is a lack of DRAM to hold the entire embedding layer. © 2022 IEEE. Thu, 30 Jun 2022 15:00:00 GMT https://scholar.dgist.ac.kr/handle/20.500.11750/17472 2022-06-30T15:00:00Z Implication of Optimizing NPU Dataflows on Neural Architecture Search for Mobile Devices https://scholar.dgist.ac.kr/handle/20.500.11750/17050 Title: Implication of Optimizing NPU Dataflows on Neural Architecture Search for Mobile Devices Author(s): Lee, Jooyeon; Park, Junsang; Lee, Seunghyun; Kung, Jaeha Abstract: Recent advances in deep learning have made it possible to implement artificial intelligence in mobile devices. Many studies have put a lot of effort into developing lightweight deep learning models optimized for mobile devices. To overcome the performance limitations of manually designed deep learning models, an automated search algorithm, called neural architecture search (NAS), has been proposed. However, studies on the effect of hardware architecture of the mobile device on the performance of NAS have been less explored. In this article, we show the importance of optimizing a hardware architecture, namely, NPU dataflow, when searching for a more accurate yet fast deep learning model. To do so, we first implement an optimization framework, named FlowOptimizer, for generating a best possible NPU dataflow for a given deep learning operator. Then, we utilize this framework during the latency-aware NAS to find the model with the highest accuracy satisfying the latency constraint. As a result, we show that the searched model with FlowOptimizer outperforms the performance by 87.1% and 92.3% on average compared to the searched model with NVDLA and Eyeriss, respectively, with better accuracy on a proxy dataset. We also show that the searched model can be transferred to a larger model to classify a more complex image dataset, i.e., ImageNet, achieving 0.2%/5.4% higher Top-1/Top-5 accuracy compared to MobileNetV2-1.0 with 3.6x lower latency. Wed, 31 Aug 2022 15:00:00 GMT https://scholar.dgist.ac.kr/handle/20.500.11750/17050 2022-08-31T15:00:00Z High-throughput Near-Memory Processing on CNNs with 3D HBM-like Memory https://scholar.dgist.ac.kr/handle/20.500.11750/16436 Title: High-throughput Near-Memory Processing on CNNs with 3D HBM-like Memory Author(s): Park, Naebeom; Ryu, Sungju; Kung, Jaeha; Kim, Jae-Joon Abstract: This article discusses the high-performance near-memory neural neㅁ twork (NN) accelerator architecture utilizing the logic die in three-dimensional (3D) High Bandwidth Memory- (HBM) like memory. As most of the previously reported 3D memory-based near-memory NN accelerator designs used the Hybrid Memory Cube (HMC) memory, we first focus on identifying the key differences between HBM and HMC in terms of near-memory NN accelerator design. One of the major differences between the two 3D memories is that HBM has the centralized through-silicon-via (TSV) channels while HMC has distributed TSV channels for separate vaults. Based on the observation, we introduce the Round-Robin Data Fetching and Groupwise Broadcast schemes to exploit the centralized TSV channels for improvement of the data feeding rate for the processing elements. Using synthesized designs in a 28-nm CMOS technology, performance and energy consumption of the proposed architectures with various dataflow models are evaluated. Experimental results show that the proposed schemes reduce the runtime by 16.4-39.3% on average and the energy consumption by 2.1-5.1% on average compared to conventional data fetching schemes. Sun, 31 Oct 2021 15:00:00 GMT https://scholar.dgist.ac.kr/handle/20.500.11750/16436 2021-10-31T15:00:00Z