Differences

This shows you the differences between two versions of the page.

Link to this comparison view

beyond_peak_performance:comparing_the_real_performance_of_fpgas_and_gpus_on_deep_learning_workloads [2020/12/01 14:05] (current)
jhoe created
Line 1: Line 1:
 +====== ​ Beyond Peak Performance:​ Comparing the Real Performance of FPGAs and GPUs on Deep Learning Workloads, Eriko Nurvitadhi (Intel) ======
 +
 +Monday, December 7, 2020\\
 +Location: [[https://​cmu.zoom.us/​j/​92742812874?​pwd=RkJrSmxPUUVhWGZkcytTZkkyVENTdz09 |Zoom]]\\
 +Time: 1:​30PM-2:​30PM\\
 +
 +
 +=====Abstract=====
 +
 +The growing importance and compute demands of artificial
 +intelligence (AI) have led to the emergence of AI optimized
 +hardware platforms. For example, Nvidia GPUs introduced
 +specialized tensor cores for matrix operations to speed up
 +deep learning (DL) computation,​ resulting in a high peak
 +throughput up to 130 int8 TOPS in the T4 GPU. Recently,
 +Intel introduced AI-optimized 14nm FPGA, Intel® Stratix®
 +10 NX, with in-fabric AI tensor blocks that offer estimated
 +peak performance up to 143 int8 TOPS, comparable to
 +12nm GPUs. However, what matters in practice is not just
 +the peak throughput but the achievable real performance on
 +target workloads. This depends mainly on the utilization of
 +the tensor units on the device, and the system-level
 +overheads to send data to/from the accelerator.
 +
 +In this talk, I will first discuss trends in AI and FPGAs. Then,
 +I will talk about the Stratix 10 NX FPGA (S10-NX). Next, I
 +will present our research AI soft processor overlay (NPU)
 +and its optimizations for S10-NX. Finally, I will report
 +evaluation of NPU on S10-NX in comparison to GPUs
 +(Nvidia T4 and V100) both at device- and system-levels,​ for
 +a suite of real-time AI inference workloads (RNNs, GRUs,
 +LSTMs, MLPs). Our study shows that S10-NX achieves 24×
 +and 12× average compute speedups over the T4 and V100
 +GPUs at batch-6. At system-level,​ FPGA’s fine-grained
 +flexibility with integrated 100 Gbps Ethernet allows for
 +remote access at 10× and 2× less system latency than local
 +access to a GPU via 128 Gbps PCIe for short and long
 +sequence RNNs, respectively.
 +
 +=====Bio=====
 +
 +Eriko Nurvitadhi is a senior research scientist at the CTO
 +office of Programmable Solutions Group at Intel. He leads
 +FPGA external/​academic and internal research programs.
 +His research focuses on hardware accelerator architectures
 +(e.g., FPGAs, ASICs) for AI and data analytics. He has over
 +50 academic publications and 15 patents issued in this
 +area. His research has contributed to Intel’s FPGA and
 +ASIC solutions for AI. At Intel, he has received awards for
 +his contributions to co-founding and growing FPGA
 +academic program, as well as to next-generation FPGA
 +technology. He received his PhD in Electrical and Computer
 +Engineering from Carnegie Mellon University.
 +
 +