DGIST Scholar: QuiltNet: Efficient Deep Learning Inference on Multi-Chip Accelerators Using Model Partitioning

Department of Electrical Engineering and Computer Science Computation Efficient Learning Lab. 2. Conference Papers

Cited time in webofscience

Cited time in scopus

Full metadata record

DC Field	Value	Language
dc.contributor.author	Park, Jongho	-
dc.contributor.author	Kwon, Hyuk-jun	-
dc.contributor.author	Kim, Seowoo	-
dc.contributor.author	Lee, Junyoung	-
dc.contributor.author	Ha, Minho	-
dc.contributor.author	Lim, Euicheol	-
dc.contributor.author	Imani, Mohsen	-
dc.contributor.author	Kim, Yeseong	-
dc.date.accessioned	2023-12-26T18:13:00Z	-
dc.date.available	2023-12-26T18:13:00Z	-
dc.date.created	2022-09-23	-
dc.date.issued	2022-07-14	-
dc.identifier.isbn	9781450391429	-
dc.identifier.issn	0738-100X	-
dc.identifier.uri	http://hdl.handle.net/20.500.11750/46820	-
dc.description.abstract	We have seen many successful deployments of deep learning accelerator designs on different platforms and technologies, e.g., FPGA, ASIC, and Processing In-Memory platforms. However, the size of the deep learning models keeps increasing, making computations a burden on the accelerators. A naive approach to resolve this issue is to design larger accelerators; however, it is not scalable due to high resource requirements, e.g., power consumption and off-chip memory sizes. A promising solution is to utilize multiple accelerators and use them as needed, similar to conventional multiprocessing. For example, for smaller networks, we may use a single accelerator, while we may use multiple accelerators with proper network partitioning for larger networks. However, partitioning DNN models into multiple parts leads to large communication overheads due to inter-layer communications. In this paper, we propose a scalable solution to accelerate DNN models on multiple devices by devising a new model partitioning technique. Our technique transforms a DNN model into layer-wise partitioned models using an autoencoder. Since the autoencoder encodes a tensor output into a smaller dimension, we can split the neural network model into multiple pieces while significantly reducing the communication overhead to pipeline them. Our evaluation results conducted on state-of-the-art deep learning models show that the proposed technique significantly improves performance and energy efficiency. Our solution increases performance and energy efficiency by up to 30.5% and 28.4% with minimal accuracy loss as compared to running the same model on pipelined multi-block accelerators without the autoencoder. © 2022 ACM.	-
dc.language	English	-
dc.publisher	Association for Computing Machinery	-
dc.title	QuiltNet: Efficient Deep Learning Inference on Multi-Chip Accelerators Using Model Partitioning	-
dc.type	Conference Paper	-
dc.identifier.doi	10.1145/3489517.3530589	-
dc.identifier.scopusid	2-s2.0-85137456792	-
dc.identifier.bibliographicCitation	Design Automation Conference, pp.1159 - 1164	-
dc.identifier.url	https://59dac.conference-program.com/presentation/?id=RESEARCH505&sess=sess130	-
dc.citation.conferencePlace	US	-
dc.citation.conferencePlace	San Francisco	-
dc.citation.endPage	1164	-
dc.citation.startPage	1159	-
dc.citation.title	Design Automation Conference	-

Files in This Item:: There are no files associated with this item.

Appears in Collections:: Department of Electrical Engineering and Computer Science Computation Efficient Learning Lab. 2. Conference Papers

Show Simple Item Record

qrcode

DGIST

DGIST Scholar was built with support from the OAK distribution project by the National Library of Korea.

You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Library Services Team, DGIST 333. Techno Jungang-daero, Hyeonpung-myeon, Dalseong-gun, Daegu, 42988, Republic of Korea.

DGIST Library Repository

BROWSE

DGIST

BROWSE