Beyond Peak Performance: Comparing the Real Performance of FPGAs and GPUs on Deep Learning Workloads, Eriko Nurvitadhi (Intel)

Monday, December 7, 2020
Location: Zoom
Time: 1:30PM-2:30PM

Abstract

The growing importance and compute demands of artificial intelligence (AI) have led to the emergence of AI optimized hardware platforms. For example, Nvidia GPUs introduced specialized tensor cores for matrix operations to speed up deep learning (DL) computation, resulting in a high peak throughput up to 130 int8 TOPS in the T4 GPU. Recently, Intel introduced AI-optimized 14nm FPGA, Intel® Stratix® 10 NX, with in-fabric AI tensor blocks that offer estimated peak performance up to 143 int8 TOPS, comparable to 12nm GPUs. However, what matters in practice is not just the peak throughput but the achievable real performance on target workloads. This depends mainly on the utilization of the tensor units on the device, and the system-level overheads to send data to/from the accelerator.

In this talk, I will first discuss trends in AI and FPGAs. Then, I will talk about the Stratix 10 NX FPGA (S10-NX). Next, I will present our research AI soft processor overlay (NPU) and its optimizations for S10-NX. Finally, I will report evaluation of NPU on S10-NX in comparison to GPUs (Nvidia T4 and V100) both at device- and system-levels, for a suite of real-time AI inference workloads (RNNs, GRUs, LSTMs, MLPs). Our study shows that S10-NX achieves 24× and 12× average compute speedups over the T4 and V100 GPUs at batch-6. At system-level, FPGA’s fine-grained flexibility with integrated 100 Gbps Ethernet allows for remote access at 10× and 2× less system latency than local access to a GPU via 128 Gbps PCIe for short and long sequence RNNs, respectively.

Bio

Eriko Nurvitadhi is a senior research scientist at the CTO office of Programmable Solutions Group at Intel. He leads FPGA external/academic and internal research programs. His research focuses on hardware accelerator architectures (e.g., FPGAs, ASICs) for AI and data analytics. He has over 50 academic publications and 15 patents issued in this area. His research has contributed to Intel’s FPGA and ASIC solutions for AI. At Intel, he has received awards for his contributions to co-founding and growing FPGA academic program, as well as to next-generation FPGA technology. He received his PhD in Electrical and Computer Engineering from Carnegie Mellon University.