Detail View

Fault-Tolerance in CXL-Based Memory Pooling Environments

Citations

WEB OF SCIENCE

Citations

SCOPUS

Metadata Downloads

DC Field Value Language
dc.contributor.advisor 좌훈승 -
dc.contributor.author Kyeongmin Kim -
dc.date.accessioned 2026-01-23T10:57:06Z -
dc.date.available 2026-01-24T06:00:39Z -
dc.date.issued 2026 -
dc.identifier.uri https://scholar.dgist.ac.kr/handle/20.500.11750/59725 -
dc.identifier.uri http://dgist.dcollection.net/common/orgView/200000945429 -
dc.description CXL, Disaggregated Memory Systems, Fault Tolerance, Erasure Coding -
dc.description.abstract Datacenters adopt memory disaggregation to cut TCO and raise utilization by pooling CXL memory and elastically allocating it across hosts. CXL makes this practical by enabling load/store access to pooled memory with lower latency than RDMA. However, when many hosts share a CXL memory pool, a single-device failure can impact multiple hosts, so fault tolerance is essential; further CXL’s HW-managed load/store path leave little room for software to synchronize parity on writes. This paper address this with a HW/SW co-design: the CXL switch performs in-line parity updates, while the kernel manages coding groups, enhanced by 2 MB THP, a huge-page-aware CXL switch SRAM cache, and hotness-based THP placement. In trace-driven evaluation, our design provides fault tolerance with only ~11% latency overhead on write-intensive and ~1% on read- intensive workloads relative to Ideal (No fault tolerance)|현대 데이터센터는 TCO 절감과 메모리 활용률 제고를 위해 메모리를 풀링·탄력 할당하는 메모리 디스어그리게이션을 채택한다. 그러나 다수 호스트가 동일 CXL 디바이스를 공유하면 단일 디바이스 장애의 파급 범위가 커지므로 결함 허용 (fault tolerance)이 필수다. 특히 CXL 환경에서는 CXL 디바이스 접근이 하드웨어로 처리되기에 소프트웨어가 개입해 패리티를 동기화하기 어렵다. 본 논문은 이를 CXL 스위치가 인라인 패리티 업데이트를 수행하고, 커널이 코딩 그룹 메타데이터를 관리하여 HW/SW 병합 디자인으로 해결한다. 또한 2 MB THP와 huge-page 인지 스위치 SRAM 캐시, 페이지 빈도 기반 THP 배치로 메타데이터 I/O와 쓰기 경로 부담을 줄인다. 트레이스 기반 평가에서 제안 기법은 Ideal (결함 허용 미제공) 대비 쓰기 집약 워크로드에서 약 10%, 읽기 집약 워크로드에서 1%의 지연 오버헤드만으로 안정적인 결함 허용을 제공함을 보였다. -
dc.description.tableofcontents I. Introduction 1
II. Background and Related Work 2
2.1 Compute Express Link for Disaggregated Memory System 2
2.2 Fault Tolerance in Disaggregated Memory System 3
2.3 Erasure Coding 4
III. Design 4
3.1 Hardware-based Design 5
3.1.1 Overall System Structure 5
3.1.2 Drawback of HW-based Design 6
3.2 Software-based Design 7
3.2.1 Overall System Structure 7
3.2.2 Drawback of SW-based Design 8
3.3 HW/SW Co-design for Fault Tolerance 9
3.3.1 Comparison of HW and SW Design 9
3.3.2 Overall System Structure 10
3.3.3 Drawback of HW/SW Co-Design 11
3.3.4 2MB Huge Page-based Optimization 12
3.3.5 Hotness-based Transparent Huge Page 13
IV. IMPLEMENTATION AND EXPERIMENTAL SETUP 15
4.1 Methodology 15
4.2 Workloads 16
4.3 Configurations 16
V. Evaluation 17
5.1 Overall Performance 17
5.2 Detail Performance Breakdown 18
5.2.1 Performance improvement with each technique 18
5.2.2 CXL-Switch SRAM Cache Hit 19
5.2.3 Proper Data placement in Tiered Memory System 20
5.3 Full-node Recovery 20
VI. Discussion and Future Work 21
VII. Conclusion 22
-
dc.format.extent 26 -
dc.language eng -
dc.publisher DGIST -
dc.title Fault-Tolerance in CXL-Based Memory Pooling Environments -
dc.title.alternative Compute Express Link (CXL) 기반 메모리 풀링 환경에서의 장애 허용 메커니즘 -
dc.type Thesis -
dc.identifier.doi 10.22677/THESIS.200000945429 -
dc.description.degree Master -
dc.contributor.department Artificial Intelligence Major -
dc.date.awarded 2026-02-01 -
dc.publisher.location Daegu -
dc.description.database dCollection -
dc.citation XT.AM 김14 202602 -
dc.date.accepted 2026-01-19 -
dc.contributor.alternativeDepartment 학제학과인공지능전공 -
dc.subject.keyword CXL, Disaggregated Memory Systems, Fault Tolerance, Erasure Coding -
dc.contributor.affiliatedAuthor Kyeongmin Kim -
dc.contributor.affiliatedAuthor Hoon Sung Chwa -
dc.contributor.alternativeName 김경민 -
dc.contributor.alternativeName Hoon Sung Chwa -
Show Simple Item Record

File Downloads

  • There are no files associated with this item.

공유

qrcode
공유하기

Total Views & Downloads

???jsp.display-item.statistics.view???: , ???jsp.display-item.statistics.download???: