Execution Platforms for Deep Learning

Automatic Parallelization of DNN Models

As computation and memory requirements of the deep neural network (DNN) models grow, multi-GPU or even multi-node training/inference has become essential. However, parallelizing a DNN training workload into multiple GPUs is challenging. It requires sophisticated knowledge of the underlying system architecture – which hinders the users from developing large DNN models. To overcome such a programming wall, we develop methods to automatically find out the best parallelization scheme on a multi-node multi-GPU system. We also develop a performance model to predict the performance of DNN workloads running on heterogeneous GPUs in a cluster.

DeepUM

Deep neural networks (DNNs) are continuing to get wider and deeper. As a result, it requires a tremendous amount of GPU memory and computing power. We propose a framework called DeepUM that exploits CUDA Unified Memory (UM) to allow GPU memory oversubscription for DNNs. While conventional CUDA UM allows memory oversubscription using a page fault mechanism, page migration introduces enormous overhead. DeepUM uses a new correlation prefetching technique to hide the page migration overhead. It is fully automatic and transparent to users. We also propose two optimization techniques to minimize the GPU fault handling time. We evaluate the performance of DeepUM using nine large-scale DNNs from MLPerf, PyTorch examples, and Hugging Face and compare its performance with six state-of-the-art GPU memory swapping approaches. Currently, we are extending DeepUM to a GPU cluster.

Deep Learning Framework for FPGAs

As energy efficiency becomes one of the most important issues, FPGAs are emerging as promising accelerators for HPC systems. However, low-level hardware description language and compilation flow with ancient vendor-provided tools are still primary ways to program FPGAs, which require considerable expertise to use. This programming wall prevents FPGAs from being adopted widely as accelerators. To overcome this obstacle, we are constructing a full-stack framework for FPGAs. As the first step, we presented a high-level synthesis framework of OpenCL/CUDA for FPGAs, called SOFF. Our framework does not require any explicit user annotations while achieving high performance. We are also developing a Neural Processing Unit (NPU) and a Deep Learning framework for FPGAs, which automatically generates an optimal circuit for a given deep learning model. With direct communication technology between FPGAs, we plan to extend the framework to cluster systems.

Mobile ISP Replacement through Deep Learning Model Compression

The latest mobile devices are equipped with high-performance cameras, increasing the workload involved in image processing. There have been studies on mobile ISP replacement using deep learning models such as PyNET. However, such models require a high computation cost and are hard to deploy on mobile devices. We propose model compression techniques for mobile ISP replacement, such as Tucker decomposition, quantization, pruning, and knowledge distillation. We focus on combining various model compression techniques to significantly reduce computation/memory costs without sacrificing model accuracy.

DeepCuts

Conventional Deep Learning (DL) frameworks, such as TensorFlow and PyTorch, heavily rely on libraries with fixed implementations (e.g., NVIDIA cuDNN). Thus, it is hard for them to handle versatile DNN models and GPU architectures. We propose a deep learning optimization framework for versatile GPU workloads. Unlike the conventional DL framework, it generates GPU kernels optimized for the target DL workload and the target hardware architecture. We implement this idea using CUDA and OpenCL. DeepCuts achieves state-of-the-art performance compared to other DL optimization frameworks, such as TVM, TensorRT, and TensorFlow XLA.

SnuRHAC

We are working on the SnuRHAC framework that provides an illusion of a single GPU for multiple GPUs in a cluster. SnuRHAC automatically distributes workload and manages data across the nodes. We propose several optimization techniques, such as prefetching, to maximize the performance. SnuRHAC aims to achieve both ease of programming and performance for GPU programming. The evaluation result of SnuRHAC with 18 applications from various sources indicates that SnuRHAC achieves scalable performance for the cluster environment depending on the application characteristics while significantly reducing the programmer’s burden.