Embedded AI

Introducing Embedded AI: revolutionizing intelligent technology with efficient, resource-conscious solutions tailored for optimal performance in even the most resource-limited environments.

The main goal is to find an optimal mapping of an algorithm to the selected embedded system.

The main steps in porting an algorithm on the embedded system are:

  • Algorithm analysis (elementary processing modules, data paths, control paths, mutual dependences)
  • Algorithm partitioning (grouping of modules in blocks, organization of modules inside the block, synchronization of blocks and data packets between blocks)
  • Algorithm adaptation to target platform (adaptation of arithmetic to target DSP/RISC cores, error analysis)
  • Performance analysis (identification of conflicts and stalls, processor load analysis, memory load analysis)

Platforms

Examples of our work

Innoviz

Renesas R-Car – joint usage of IMP-X5+ accelerators: CVe and CNN IP.

Key points:

  • Optimization of Convolution 1×1 (+ReLU+Pooling): sub-ms performance for 256×256 32 channels on Computer Vision engine – average performance boost of 100x / 4x compared to ARM A53 / CNN IP
  • Parallelization of CNN IP and CVe: offloading of convolutions 1×1 to CVe, while respecting memory dependencies of CNN IP pipeline for maximal efficiency. Performance boost: 12.8 -> 5.8 ms
  • Tilling and tunneling: in-place memory processing enabled by logically grouping and sequentially calculating corresponding spatial tiles of several consecutive layers. Average performance boost 20% on both CNN IP and CVe

Read more in case study.

Foto Nation


FotoNation Driver Monitoring System, implementation and optimization. Reference highly optimized C (with DSP assembler intrinsic) code already running on the target platform – single TDA3x DSP.

Key points:

  • The initial (beforehand optimized by customer) version of C code (with DSP assembler intrinsic) required 86 ms per frame.
  • The final version required 28 ms per frame – product for the market.
  • The most significant improvement gained by algorithmic modifications and simplification.

Read more in case study.

Denso

Denso Camera Mirror Replacement, algorithm porting and optimization. Reference high level C/C++ code – R&D code.

Key points:

  • Significantly refactored and simplified to be used on embedded platform
  • Code modified to enable execution on different cores in parallel
  • The final version is running at 15 fps (on ARM Cortex A15, TI C66x DSP, TI EVE (Embedded Vision Engine), and ARM Cortex M4 cores)
  • The most significant improvement gained by proper code allocation to SoC cores and algorithmic modification for target SoC core (float to integer, vectorization, data organization, etc.)

Read more in case study.

Android

Multimedia libraries, optimization. PC based codecs, optimization for RISC architectures.

Key points:

  • Adaptation of data types to architecture
  • Utilizing MIPS DSP ASE instruction set
  • Utlization of SIMD instructions
  • Intrisics and inline assembly

Read more in case study.

More on the topic