Differences
This shows you the differences between two versions of the page.
— |
beyond_peak_performance:comparing_the_real_performance_of_fpgas_and_gpus_on_deep_learning_workloads [2020/12/01 14:05] (current) jhoe created |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Beyond Peak Performance: Comparing the Real Performance of FPGAs and GPUs on Deep Learning Workloads, Eriko Nurvitadhi (Intel) ====== | ||
+ | |||
+ | Monday, December 7, 2020\\ | ||
+ | Location: [[https://cmu.zoom.us/j/92742812874?pwd=RkJrSmxPUUVhWGZkcytTZkkyVENTdz09 |Zoom]]\\ | ||
+ | Time: 1:30PM-2:30PM\\ | ||
+ | |||
+ | |||
+ | =====Abstract===== | ||
+ | |||
+ | The growing importance and compute demands of artificial | ||
+ | intelligence (AI) have led to the emergence of AI optimized | ||
+ | hardware platforms. For example, Nvidia GPUs introduced | ||
+ | specialized tensor cores for matrix operations to speed up | ||
+ | deep learning (DL) computation, resulting in a high peak | ||
+ | throughput up to 130 int8 TOPS in the T4 GPU. Recently, | ||
+ | Intel introduced AI-optimized 14nm FPGA, Intel® Stratix® | ||
+ | 10 NX, with in-fabric AI tensor blocks that offer estimated | ||
+ | peak performance up to 143 int8 TOPS, comparable to | ||
+ | 12nm GPUs. However, what matters in practice is not just | ||
+ | the peak throughput but the achievable real performance on | ||
+ | target workloads. This depends mainly on the utilization of | ||
+ | the tensor units on the device, and the system-level | ||
+ | overheads to send data to/from the accelerator. | ||
+ | |||
+ | In this talk, I will first discuss trends in AI and FPGAs. Then, | ||
+ | I will talk about the Stratix 10 NX FPGA (S10-NX). Next, I | ||
+ | will present our research AI soft processor overlay (NPU) | ||
+ | and its optimizations for S10-NX. Finally, I will report | ||
+ | evaluation of NPU on S10-NX in comparison to GPUs | ||
+ | (Nvidia T4 and V100) both at device- and system-levels, for | ||
+ | a suite of real-time AI inference workloads (RNNs, GRUs, | ||
+ | LSTMs, MLPs). Our study shows that S10-NX achieves 24× | ||
+ | and 12× average compute speedups over the T4 and V100 | ||
+ | GPUs at batch-6. At system-level, FPGA’s fine-grained | ||
+ | flexibility with integrated 100 Gbps Ethernet allows for | ||
+ | remote access at 10× and 2× less system latency than local | ||
+ | access to a GPU via 128 Gbps PCIe for short and long | ||
+ | sequence RNNs, respectively. | ||
+ | |||
+ | =====Bio===== | ||
+ | |||
+ | Eriko Nurvitadhi is a senior research scientist at the CTO | ||
+ | office of Programmable Solutions Group at Intel. He leads | ||
+ | FPGA external/academic and internal research programs. | ||
+ | His research focuses on hardware accelerator architectures | ||
+ | (e.g., FPGAs, ASICs) for AI and data analytics. He has over | ||
+ | 50 academic publications and 15 patents issued in this | ||
+ | area. His research has contributed to Intel’s FPGA and | ||
+ | ASIC solutions for AI. At Intel, he has received awards for | ||
+ | his contributions to co-founding and growing FPGA | ||
+ | academic program, as well as to next-generation FPGA | ||
+ | technology. He received his PhD in Electrical and Computer | ||
+ | Engineering from Carnegie Mellon University. | ||
+ | |||
+ | |||