WEB OF SCIENCE
SCOPUS
With the rapid advancement of hardware accelerators, the bottleneck in deep learning systems is moving from computation to I/O. This bottleneck can be observed especially during transfer learning that uses a fixed feature extractor because the feature extraction is an I/O intensive task. Due to the limited bandwidth of Direct Media Interface (DMI), GPUs failed to fully perform despite the advent of high-performance SSDs.
To address the problem that the performance of transfer learning is limited by the DMI bandwidth, we propose a novel transfer learning system that adapts in-storage processing. The feature extraction is executed in SSDs using a high-performance mobile GPU. Because the feature extraction can be executed in parallel, the feature extraction in aggregated SSDs is fast and scalable. Moreover, the feature extraction can be optimized by adapting optimization techniques such as 16-bit floating-point quantization, layer fusion, kernel auto-tuning, removing transformation overhead, and data prefetching. Our proposed system could catch up with the GPU system using 6 aggregated SSDs in a power-efficient way.