The world’s fastest deep learning

Dataflow departmental appliance for deep learning, ASIC design

The world’s fastest deep learning

Dataflow departmental appliance for deep learning, ASIC design

The world’s fastest deep learning

Dataflow departmental appliance for deep learning, ASIC design

The world’s fastest deep learning

Dataflow departmental appliance for deep learning, ASIC design

Project Overview

Customer WAVE computing. Dataflow processing unit project was developed from idea, over the SystemC NS-3 (TLM based) co-simulation model of the system, to the FPGA prototype and ASIC design. Complex solution requires clear status and progress tracking, so early concept proof using SystemC model of the system and NS-3 for its network part was crucial for the initial project phase. Next step was FPGA prototype to provide real-time capabilities of the system. In parallel, ASIC design was ongoing using the same RTL as FPGA prototype.

The core of the project was dataflow processing unit (DPU) containing more than 16000 processing elements (PE). Architecture and RTL design specification were created in cooperation with the customer. Each processing element is designed to perform several matrix operations. Processing unit has dedicated local memory that is shared between each processing element. There is also DDR5 external shared memory for additional storage. Two multicore general purpose processors are used for the configuration of all modules within the processing unit. Since each processing unit has universal architecture, it is possible to reuse same component and create DPU cluster. Communication between several DPUs can be achieved in several ways:

  • PCIe Gen 5
  • 400 GB/s Ethernet
  • Direct DPU to DPU communication

Purpose of DPU cluster architecture is to eliminate the need for a Central Processing Unit (CPU) or co-processor such as a Graphics Processing Unit (GPU) and to increase the training speed of neural networks, while keeping the system autonomous in the way of minimalizing host control complexity. Dataflow processing unit block diagram is shown in Figure

Figure 1. Dataflow processing unit block diagram

Since communication between each processing unit is very complex, SystemC cycle based model of processing unit was created. Verification was done by independent verification team. NS-3 network simulator was used for simulating communication between multiple processing units. Communication is based on Transaction-Level Modeling (TLM). NS-3 is a discrete-event network simulator capable of emulating large scale networks.

In parallel, optimized pre-silicon FPGA prototype of DPU was designed and implemented on Xilinx UltraScale+ development board.
Third phase of project was ASIC implementation of dataflow processing unit which was done in Verilog Hardware Description Language (HDL).
There were several challenges during development. Main challenges were area optimization and reducing power consumption of each processing element.

Therefore, processing elements are designed to perform only several matrix operations. Another challenge was to emulate complete network communication between processing units. Since this kind of simulation requires a large amount of computing power, it was necessary to acquire a computer of the latest generation with the maximum amount of Random Access Memory (RAM).


The project covered the complete development from the IP architecture planning up to ASIC implementation. All development steps were conducted by RT-RK, in correspondence with the customer. Customer requirements were fulfilled at an expected level. On the other side, RT-RK team proved that it is capable to deal with latest technologies that are still in developing phase.

Disclaimer: Further content reproduced from WAVE computing. Visit the source at

Faster Deep Learning Without IT

Due to strong demand by data scientists, Wave Computing® is introducing a dataflow appliance that is customized for office environments. It is based on the company’s revolutionary dataflow technology that eliminates the need for a CPU or co-processors, such as a GPU. Wave’s dataflow appliance offers extremely fast modeling and training of data sets, which can outperform existing datacenter servers for deep learning workloads. The Wave dataflow appliance is designed to easily fit in existing work spaces. Alternative power configurations are available for Asia and Europe.

A Future Proof Solution

With ONNX interoperability, Wave’s dataflow appliance can support a range of frameworks such as Tensorflow, Caffe, MXNet and more. Also, the Dataflow Processing Unit (DPU) based boards within each appliance are upgradable, allowing for next-generation, high-bandwidth memory clusters and future Wave DPUs.

Go Faster with Dataflow Technology

Wave Computing’s dataflow architecture eliminates the need for a CPU or co-processor, removing bottlenecks such as callouts to accelerators, cross-memory communication and more. The result is performance improvements of training neural networks that outperform datacenter class servers. Wave’s dataflow appliance is ideal for use on both recurrent neural networks (RNNs) and convolutional neural networks (CNNs). The appliance includes all the needed dataflow software and agent libraries to get up and running quickly.

Revolutionary Dataflow Architecture

Wave’s native dataflow architecture is the fundamental technology behind each dataflow appliance. It is built upon the company’s unique dataflow computing technology that exploits data and model parallelisms present in deep learning models. Wave’s dataflow appliances utilize Dataflow Processing Units (DPUs), which contain thousands of interconnected dataflow Processing Elements (PEs). The performance and scalability of the Wave appliances make them ideal for organizations using deep learning to develop, test and deploy deep learning models for frameworks such as TensorFlow.

Wave Dataflow Systems

Wave Computing is revolutionizing artificial intelligence (AI) and deep learning with its dataflow-based systems. The company’s innovative solutions leverage dataflow
technology to provide high-performance training and high-efficiency inferencing at scale, enabling companies to drive better business value from their data. Wave’s dataflow system solutions are designed for deep learning. Boasting significant improvements over the current standard, they enable data scientists to experiment, develop, test, deploy and run AI applications faster and more economically than ever before.


Performance Dataflow Processing Elements (PEs) 64,000 per appliance
Memory and Storage High-Speed Memory 32 GB High-Speed DRAM
SSD Storage 8 TB of storage
Bulk Storage 512 GB DDR4 DRAM
Connections and Power Consumption User Connection Up to two 10GbE Posts
Power Requirements Designed for in-office use; power configurations avilable for Asia and Europe
Form Factor Dimensions 15″ x 15″ x 30″
Acoustics < 30dB
Cooling System Liquid cooled
Software Machine Learning Framework ONNX
Operating System Ubuntu Linux
Library WaveFlow Agent Library
Data Runtime WaveFlow Execution engine


About Wave Computing

Wave Computing is the Silicon Valley company that is revolutionizing AI and deep learning from the datacenter to the edge with its dataflow-based systems and embedded solutions. The company enables enterprises to accelerate their AI applications by easily and cost-effectively bringing deep learning to their data, wherever it is. Wave’s innovative AI system solutions provide high-performance training and high-efficiency inferencing at scale.

You may also like