Repository Collection: null

Repository Collection: null https://scholar.dgist.ac.kr/handle/20.500.11750/848 2026-07-14T13:35:17Z T-CAT: Dynamic Cache Allocation for Tiered Memory Systems with Memory Interleaving https://scholar.dgist.ac.kr/handle/20.500.11750/47772 Title: T-CAT: Dynamic Cache Allocation for Tiered Memory Systems with Memory Interleaving Author(s): Lee, Hwanjun; Lee, Seunghak; Jung, Yeji; Kim, Daehoon Abstract: New memory interconnect technology, such as Intel's Compute Express Link (CXL), helps to expand memory bandwidth and capacity by adding CPU-less NUMA nodes to the main memory system, addressing the growing memory wall challenge. Consequently, modern computing systems embrace the heterogeneity in memory systems, composing the memory systems with a tiered memory system with near and far memory (e.g., local DRAM and CXL-DRAM). However, adopting NUMA interleaving, which can improve performance by exploiting node-level parallelism and utilizing aggregate bandwidth, to the tiered memory systems can face challenges due to differences in the access latency between the two types of memory, leading to potential performance degradation for memory-intensive workloads. By tackling the challenges, we first investigate the effects of the NUMA interleaving on the performance of the tiered memory systems. We observe that while NUMA interleaving is essential for applications demanding high memory bandwidth, it can negatively impact the performance of applications demanding low memory bandwidth. Next, we propose a dynamic cache management, called T-CAT, which partitions the last-level cache between near and far memory, aiming to mitigate performance degradation by accessing far memory. T-CAT, attempts to reduce the difference in the average access latency between near and far memory by re-sizing the cache partitions. Through dynamic cache management, T-CAT, can preserve the performance benefits of NUMA interleaving while mitigating performance degradation by the far memory accesses. Our experimental results show that T-CAT, improves performance by up to 17% compared to cases with NUMA interleaving without the cache management. © 2023 IEEE 2023-06-30T15:00:00Z NoHammer: Preventing Row Hammer with Last-Level Cache Management https://scholar.dgist.ac.kr/handle/20.500.11750/46712 Title: NoHammer: Preventing Row Hammer with Last-Level Cache Management Author(s): Lee, Seunghak; Kang, Ki-Dong; Park, Gyeongseo; Kim, Nam Sung; Kim, Daehoon Abstract: Row Hammer (RH) is a circuit-level phenomenon where repetitive activation of a DRAM row causes bit-flips in adjacent rows. Prior studies that rely on extra refreshes to mitigate RH vulnerability demonstrate that bit-flips can be prevented effectively. However, its implementation is challenging due to the significant performance degradation and energy overhead caused by the additional extra refresh for the RH mitigation. To overcome challenges, some studies propose techniques to mitigate the RH attack without relying on extra refresh. These techniques include delaying the activation of an aggressor row for a certain amount of time or swapping an aggressor row with another row to isolate it from victim rows. Although such techniques do not require extra refreshes to mitigate RH, the activation delaying technique may result in high-performance degradation in false-positive cases, and the swapping technique requires high storage overheads to track swap information. We propose NoHammer, an efficient RH mitigation technique to prevent the bit-flips caused by the RH attack by utilizing Last-Level Cache (LLC) management. NoHammer temporarily extends the associativity of the cache set that is being targeted by utilizing another cache set as the extended set and keeps the cache lines of aggressor rows on the extended set under the eviction-based RH attack. Along with the modification of the LLC replacement policy, NoHammer ensures that the aggressor row's cache lines are not evicted from the LLC under the RH attack. In our evaluation, we demonstrate that NoHammer gives 6% higher performance than a baseline without any RH mitigation technique by replacing excessive cache misses caused by the RH attack with LLC hits through sophisticated LLC management, while requiring 45% less storage than prior proposals. © 2023 IEEE. 2023-06-30T15:00:00Z CoreNap: Energy Efficient Core Allocation for Latency-Critical Workloads https://scholar.dgist.ac.kr/handle/20.500.11750/17490 Title: CoreNap: Energy Efficient Core Allocation for Latency-Critical Workloads Author(s): Park, Gyeongseo; Kang, Ki-Dong; Kim, Minho; Kim, Daehoon Abstract: In data-center servers, the dynamic core allocation for Latency-Critical (LC) applications can play a crucial role in improving energy efficiency under Service Level Objective (SLO) constraints, allowing cores to enter idle states (i.e., C-states) that consume less power by turning off a part of hardware components of a processor. However, prior studies focus on the core allocation for application threads while not considering cores involved in network packet processing, even though packet processing affects not only response latency but also energy consumption considerably. In this paper, we first investigate the impacts of the explicit core allocation for network packet processing on the tail response latency and energy consumption while running LC applications. We observe that co-adjusting the number of cores for network packet processing along with the number of cores for LC application threads can improve energy efficiency substantially, compared with adjusting the number of cores only for application threads, as prior studies do. In addition, we propose a dynamic core allocation, called CoreNap, which allocates/de-allocates cores for both LC application threads and packet processing. CoreNap measures the CPU-utilization by application threads and packet processing individually, and predicts response latency and power consumption when the combination of core allocation is enforced via a lightweight prediction model. Based on the prediction, CoreNap chooses/enforces the energy-efficient combination of core allocation. Our experimental results show that CoreNap reduces energy consumption by up to 18.6% compared with state-of-the-art study that adjusts cores only for LC application in parallel packet processing environments. IEEE 2022-12-31T15:00:00Z Deep Partitioned Training from Near-Storage Computing to DNN Accelerators https://scholar.dgist.ac.kr/handle/20.500.11750/15436 Title: Deep Partitioned Training from Near-Storage Computing to DNN Accelerators Author(s): Jang, Yongjoo; Kim, Sejin; Kim, Daehoon; Lee, Sungjin; Kung, Jaeha Abstract: In this paper, we present deep partitioned training to accelerate computations involved in training DNN models. This is the first work that partitions a DNN model across storage devices, an NPU and a host CPU forming a unified compute node for training workloads. To validate the benefit of using the proposed system during DNN training, a trace-based simulator or an FPGA prototype is used to estimate the overall performance and obtain the layer index to be partitioned that provides the minimum latency. As a case study, we select two benchmarks, i.e., vision-related tasks and a recommendation system. As a result, the training time reduces by 12.2~31.0% with four near-storage computing devices in vision-related tasks with a mini-batch size of 512 and 40.6~44.7% with one near-storage computing device in the selected recommendation system with a mini-batch size of 64. CCBY 2020-12-31T15:00:00Z