Artificial Intelligence enables a whole new range of applications in the areas of Virtual and Augmented Reality, robotics, IoT, healthcare, retail and logistics, mobile, automotive, and others. In particular, Deep Neural Networks (DNNs) have enabled quantum leaps in brain-like functions such as speech and image recognition. Today, the training of Neural Networks is typically performed in data centers on standard GPU architectures, which are optimized for highest throughput, but there is still need to provide much higher performance. Also the inference of Neural Networks would greatly benefit from tailored architectures to deliver faster responses, both in the data center as well as for embedded applications. However, the design of tailored SoC platforms for training and inference of Artificial Intelligence applications is very challenging:
- The fast pace of innovation and differentiation of AI applications requires high flexibility in the underlying architecture to support evolving AI algorithms with varying number of layers, filters, channels, and filter sizes.
- The execution of AI algorithms like Neural Network graphs requires very high computational performance and memory bandwidth.
- Embedded applications, especially mobile devices, need to have low power consumption. Even for data-centers, the power efficiency is the dominant cost factor.
In the past, scaling power and performance of programmable architectures came for free by moving to smaller technology nodes. However, as Moore’s Law does no longer deliver 2x transistors every 2 years at same price, the necessary improvements in power, performance, and flexibility need to come from better architectures:
- Customization of the micro-architecture to the algorithmic kernels
- Designing the macro-architecture with the right level of block-level parallelism to achieve the desired throughput.
- Selecting the best data flow for the Neural Network based on the data handling characteristics. Typical data flows are weight stationary, output stationary, no local reuse, or row stationary.
- Optimizing the implementation of the data transfers with tailored DMA engines and local buffering to get the most out of the limited bandwidth to the external memory. This is particularly important, because most Neural Network algorithms are memory-bandwidth limited.
This tutorial will show how Virtual Prototyping can help to design accelerators for deep learning and integrate them in the SoC context.