<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <title>Repository Collection: null</title>
    <link>https://scholar.dgist.ac.kr/handle/20.500.11750/6302</link>
    <description />
    <pubDate>Sat, 04 Apr 2026 16:06:28 GMT</pubDate>
    <dc:date>2026-04-04T16:06:28Z</dc:date>
    <item>
      <title>Simplified Compressor and Encoder Designs for Low-Cost Approximate Radix-4 Booth Multiplier</title>
      <link>https://scholar.dgist.ac.kr/handle/20.500.11750/17484</link>
      <description>Title: Simplified Compressor and Encoder Designs for Low-Cost Approximate Radix-4 Booth Multiplier
Author(s): Park, Gunho; Kung, Jaeha; Lee, Youngjoo
Abstract: In this brief, we present a novel design methodology of cost-effective approximate radix-4 Booth multipliers, which can significantly reduce the power consumption of error-resilient signal processing tasks. In contrast that the prior studies only focus on the approximation of either the partial product generation with encoders or the partial product reductions with compressors, the proposed method considers two major processing steps jointly by forcing the generated error directions to be opposite to each other. As the internal errors are naturally balanced to have zero mean, as a result, the proposed approximate Booth multiplier can minimize the required processing energy under the same number of approximate bits compared to the previous designs. Simulation results on FIR filtering and image classification applications reveal that the proposed approximate Booth multiplier shows the most attractive energy-performance trade-offs, achieving 28% and 34% of energy reduction compared to the exact Booth multiplier, respectively, with negligible accuracy loss. IEEE</description>
      <pubDate>Tue, 28 Feb 2023 15:00:00 GMT</pubDate>
      <guid isPermaLink="false">https://scholar.dgist.ac.kr/handle/20.500.11750/17484</guid>
      <dc:date>2023-02-28T15:00:00Z</dc:date>
    </item>
    <item>
      <title>SEMS: Scalable Embedding Memory System for Accelerating Embedding-Based DNNs</title>
      <link>https://scholar.dgist.ac.kr/handle/20.500.11750/17472</link>
      <description>Title: SEMS: Scalable Embedding Memory System for Accelerating Embedding-Based DNNs
Author(s): Kim, Sejin; Kim, Jungwoo; Jang, Yongjoo; Kung, Jaeha; Lee, Sungjin
Abstract: Embedding layers, which are widely used in various deep learning (DL) applications, are very large in size and are increasing. We propose scalable embedding memory system (SEMS) to deal with the inference of DL applications with a large embedding layer. SEMS is built using scalable embedding memory (SEM) modules, which include FPGA for acceleration. In SEMS, PCIe bus, which is scalable and versatile, is used to expand the system memory and processing in SEMs reduces the amount of data transferred from SEMs to host, improving the effective bandwidth of PCIe. In order to achieve better performance, we apply various optimization techniques at different levels. We develop SEMlib, a Python library to provide convenience in using SEMS. We implement a proof-of-concept prototype of SEMS and using SEMS yields DLRM execution time that is 32.85x faster than that of a CPU-based system when there is a lack of DRAM to hold the entire embedding layer. © 2022 IEEE.</description>
      <pubDate>Thu, 30 Jun 2022 15:00:00 GMT</pubDate>
      <guid isPermaLink="false">https://scholar.dgist.ac.kr/handle/20.500.11750/17472</guid>
      <dc:date>2022-06-30T15:00:00Z</dc:date>
    </item>
    <item>
      <title>Implication of Optimizing NPU Dataflows on Neural Architecture Search for Mobile Devices</title>
      <link>https://scholar.dgist.ac.kr/handle/20.500.11750/17050</link>
      <description>Title: Implication of Optimizing NPU Dataflows on Neural Architecture Search for Mobile Devices
Author(s): Lee, Jooyeon; Park, Junsang; Lee, Seunghyun; Kung, Jaeha
Abstract: Recent advances in deep learning have made it possible to implement artificial intelligence in mobile devices. Many studies have put a lot of effort into developing lightweight deep learning models optimized for mobile devices. To overcome the performance limitations of manually designed deep learning models, an automated search algorithm, called neural architecture search (NAS), has been proposed. However, studies on the effect of hardware architecture of the mobile device on the performance of NAS have been less explored. In this article, we show the importance of optimizing a hardware architecture, namely, NPU dataflow, when searching for a more accurate yet fast deep learning model. To do so, we first implement an optimization framework, named FlowOptimizer, for generating a best possible NPU dataflow for a given deep learning operator. Then, we utilize this framework during the latency-aware NAS to find the model with the highest accuracy satisfying the latency constraint. As a result, we show that the searched model with FlowOptimizer outperforms the performance by 87.1% and 92.3% on average compared to the searched model with NVDLA and Eyeriss, respectively, with better accuracy on a proxy dataset. We also show that the searched model can be transferred to a larger model to classify a more complex image dataset, i.e., ImageNet, achieving 0.2%/5.4% higher Top-1/Top-5 accuracy compared to MobileNetV2-1.0 with 3.6x lower latency.</description>
      <pubDate>Wed, 31 Aug 2022 15:00:00 GMT</pubDate>
      <guid isPermaLink="false">https://scholar.dgist.ac.kr/handle/20.500.11750/17050</guid>
      <dc:date>2022-08-31T15:00:00Z</dc:date>
    </item>
    <item>
      <title>High-throughput Near-Memory Processing on CNNs with 3D HBM-like Memory</title>
      <link>https://scholar.dgist.ac.kr/handle/20.500.11750/16436</link>
      <description>Title: High-throughput Near-Memory Processing on CNNs with 3D HBM-like Memory
Author(s): Park, Naebeom; Ryu, Sungju; Kung, Jaeha; Kim, Jae-Joon
Abstract: This article discusses the high-performance near-memory neural neㅁ                                                                                                 
                                               twork (NN) accelerator architecture utilizing the logic die in three-dimensional (3D) High Bandwidth Memory- (HBM) like memory. As most of the previously reported 3D memory-based near-memory NN accelerator designs used the Hybrid Memory Cube (HMC) memory, we first focus on identifying the key differences between HBM and HMC in terms of near-memory NN accelerator design. One of the major differences between the two 3D memories is that HBM has the centralized through-silicon-via (TSV) channels while HMC has distributed TSV channels for separate vaults. Based on the observation, we introduce the Round-Robin Data Fetching and Groupwise Broadcast schemes to exploit the centralized TSV channels for improvement of the data feeding rate for the processing elements. Using synthesized designs in a 28-nm CMOS technology, performance and energy consumption of the proposed architectures with various dataflow models are evaluated. Experimental results show that the proposed schemes reduce the runtime by 16.4-39.3% on average and the energy consumption by 2.1-5.1% on average compared to conventional data fetching schemes.</description>
      <pubDate>Sun, 31 Oct 2021 15:00:00 GMT</pubDate>
      <guid isPermaLink="false">https://scholar.dgist.ac.kr/handle/20.500.11750/16436</guid>
      <dc:date>2021-10-31T15:00:00Z</dc:date>
    </item>
  </channel>
</rss>

