Project Overview
Customer WAVE computing. Dataflow processing unit project was developed from idea, over the SystemC NS-3 (TLM based) co-simulation model of the system, to the FPGA prototype and ASIC design. Complex solution requires clear status and progress tracking, so early concept proof using SystemC model of the system and NS-3 for its network part was crucial for the initial project phase. Next step was FPGA prototype to provide real-time capabilities of the system. In parallel, ASIC design was ongoing using the same RTL as FPGA prototype.
The core of the project was dataflow processing unit (DPU) containing more than 16000 processing elements (PE). Architecture and RTL design specification were created in cooperation with the customer. Each processing element is designed to perform several matrix operations. Processing unit has dedicated local memory that is shared between each processing element. There is also DDR5 external shared memory for additional storage. Two multicore general purpose processors are used for the configuration of all modules within the processing unit. Since each processing unit has universal architecture, it is possible to reuse same component and create DPU cluster. Communication between several DPUs can be achieved in several ways:
- PCIe Gen 5
- 400 GB/s Ethernet
- Direct DPU to DPU communication
Purpose of DPU cluster architecture is to eliminate the need for a Central Processing Unit (CPU) or co-processor such as a Graphics Processing Unit (GPU) and to increase the training speed of neural networks, while keeping the system autonomous in the way of minimalizing host control complexity. Dataflow processing unit block diagram is shown in Figure
Figure 1. Dataflow processing unit block diagram
Since communication between each processing unit is very complex, SystemC cycle based model of processing unit was created. Verification was done by independent verification team. NS-3 network simulator was used for simulating communication between multiple processing units. Communication is based on Transaction-Level Modeling (TLM). NS-3 is a discrete-event network simulator capable of emulating large scale networks.
In parallel, optimized pre-silicon FPGA prototype of DPU was designed and implemented on Xilinx UltraScale+ development board.
Third phase of project was ASIC implementation of dataflow processing unit which was done in Verilog Hardware Description Language (HDL).
There were several challenges during development. Main challenges were area optimization and reducing power consumption of each processing element.
Therefore, processing elements are designed to perform only several matrix operations. Another challenge was to emulate complete network communication between processing units. Since this kind of simulation requires a large amount of computing power, it was necessary to acquire a computer of the latest generation with the maximum amount of Random Access Memory (RAM).
Conclusion
The project covered the complete development from the IP architecture planning up to ASIC implementation. All development steps were conducted by RT-RK, in correspondence with the customer. Customer requirements were fulfilled at an expected level. On the other side, RT-RK team proved that it is capable to deal with latest technologies that are still in developing phase.
Disclaimer: Further content reproduced from WAVE computing. Visit the source at https://www.wavecomp.ai/wp-content/uploads/2019/03/WaveComputing_Deskside_3-7-19.pdf
Faster Deep Learning Without IT
Due to strong demand by data scientists, Wave Computing® is introducing a dataflow appliance that is customized for office environments. It is based on the company’s revolutionary dataflow technology that eliminates the need for a CPU or co-processors, such as a GPU. Wave’s dataflow appliance offers extremely fast modeling and training of data sets, which can outperform existing datacenter servers for deep learning workloads. The Wave dataflow appliance is designed to easily fit in existing work spaces. Alternative power configurations are available for Asia and Europe.
A Future Proof Solution
With ONNX interoperability, Wave’s dataflow appliance can support a range of frameworks such as Tensorflow, Caffe, MXNet and more. Also, the Dataflow Processing Unit (DPU) based boards within each appliance are upgradable, allowing for next-generation, high-bandwidth memory clusters and future Wave DPUs.
Go Faster with Dataflow Technology
Wave Computing’s dataflow architecture eliminates the need for a CPU or co-processor, removing bottlenecks such as callouts to accelerators, cross-memory communication and more. The result is performance improvements of training neural networks that outperform datacenter class servers. Wave’s dataflow appliance is ideal for use on both recurrent neural networks (RNNs) and convolutional neural networks (CNNs). The appliance includes all the needed dataflow software and agent libraries to get up and running quickly.
Revolutionary Dataflow Architecture
Wave’s native dataflow architecture is the fundamental technology behind each dataflow appliance. It is built upon the company’s unique dataflow computing technology that exploits data and model parallelisms present in deep learning models. Wave’s dataflow appliances utilize Dataflow Processing Units (DPUs), which contain thousands of interconnected dataflow Processing Elements (PEs). The performance and scalability of the Wave appliances make them ideal for organizations using deep learning to develop, test and deploy deep learning models for frameworks such as TensorFlow.
Wave Dataflow Systems
Wave Computing is revolutionizing artificial intelligence (AI) and deep learning with its dataflow-based systems. The company’s innovative solutions leverage dataflow
technology to provide high-performance training and high-efficiency inferencing at scale, enabling companies to drive better business value from their data. Wave’s dataflow system solutions are designed for deep learning. Boasting significant improvements over the current standard, they enable data scientists to experiment, develop, test, deploy and run AI applications faster and more economically than ever before.
Performance | Dataflow Processing Elements (PEs) | 64,000 per appliance |
---|---|---|
Memory and Storage | High-Speed Memory | 32 GB High-Speed DRAM |
SSD Storage | 8 TB of storage | |
Bulk Storage | 512 GB DDR4 DRAM | |
Connections and Power Consumption | User Connection | Up to two 10GbE Posts |
Power Requirements | Designed for in-office use; power configurations avilable for Asia and Europe | |
Form Factor | Dimensions | 15″ x 15″ x 30″ |
Acoustics | < 30dB | |
Cooling System | Liquid cooled | |
Software | Machine Learning Framework | ONNX |
Operating System | Ubuntu Linux | |
Library | WaveFlow Agent Library | |
Data Runtime | WaveFlow Execution engine |
About Wave Computing
Wave Computing is the Silicon Valley company that is revolutionizing AI and deep learning from the datacenter to the edge with its dataflow-based systems and embedded solutions. The company enables enterprises to accelerate their AI applications by easily and cost-effectively bringing deep learning to their data, wherever it is. Wave’s innovative AI system solutions provide high-performance training and high-efficiency inferencing at scale.