ExecuTorch on CanMv230 v1.1 – Part 1

ExecuTorch on CanMv230 v1.1 – Part 1


Introduction

This is the first part of a series of articles which will explore the challenges of compiling, optimizing and executing AI ML libraries on embedded RISC-V systems. This case study will focus on the CanMv230 v1.1, a readily available, affordable RISC-V development board which features these hardware capabilities:

  • The main control chip is an upgraded version of the K210
  • K230 uses the RISC-V architecture and is a 64-bit dual-core CPU
    • CPU1 has a main frequency of 1.6GHz and RVV 1.0 support
    • CPU2 has a main frequency of 800Mhz
    • It also has a KPU with 6 TOPS of computing power
    • The development version used in this series has 1GB of RAM

The series of articles will contain the following topics:

  1. Bringing up ExecuTorch on CanMv230 v1.1
  2. Optimizing critical kernel operators for CanMv230 v1.1 using RVV
  3. Integrating K230 KPU as an executor path in ExecuTorch
  4. Optimizing memory layout for inference using tiling methods
  5. Optimizing memory layout for inference using multithreading
  6. Optimizing memory layout for inference using Quantization

In this article we cover topic 1.

Bringing up ExecuTorch on CanMv230 v1.1

Motivation: ExecuTorch vs PyTorch

The main reson for choosing to proceed with ExecuTorch is clear when they are compared side-by-side with each other:

PytorchExecuTorch
Generic ML frameworkPyTorch extension
Python basedResource limited
Open sourceOpen source
Training AND InterferenceInference ONLY

It is clear that on an embedded system with limited resources you would never train a ML model, only exploit it. Thus, the choice is clear to go with ExecuTorch for this bringup.

The plan: executing ExecuTorch in an RTOS environment

The CanMv230 by default, runs the more powerful CPU1 in RTOS mode. It makes sense, the asymmetric chiplet design is such that it’s the CPU with the RVV extension as well as access to the KPU. By eliminating an OS from its execution pipeline, any heavy-duty AI application can maximize performance by interfacing directly with the hardware.

This comes with the challenge of utilizing a custom toolchain with RTOS library support in order to enable ExecuTorch runtime to be a RTOS ready application. This is not impossible, however, and the key modifications to the ExecuTorch build system will be highlighted.

Note that due to the native toolchain not being available for the CanMv230, cross compilation is the only path forward. Otherwise, we would be making ExecuTorch and linking for RISC-V using the Linux/glibc ABI, while for our use case we would need the RT-Smart microkernel userspace ABI.

Cross-Compiling for CanMV (K230)

The CanMV K230 board presented more challenges due to toolchain incompatibilities within the QEMU environment. Since the SDK cross-compiler is an x86_64 binary, it cannot run inside a RISC-V QEMU instance. The solution is to cross-compile directly from an x86 host PC.

1. The Toolchain File (k230_rtsmart.cmake)

We created a file named k230_rtsmart.cmake to tell CMake exactly where the cross-compiler is and what architecture to target:

set(CMAKE_SYSTEM_NAME Generic)
set(CMAKE_SYSTEM_PROCESSOR riscv64)

# Toolchain
set(TC /path_to_file/k230_sdk/toolchain/riscv64-linux-musleabi_for_x86_64-pc-linux-gnu/bin/riscv64-unknown-linux-musl-)

set(CMAKE_C_COMPILER   ${TC}gcc)
set(CMAKE_CXX_COMPILER ${TC}g++)
set(CMAKE_AR           ${TC}ar     CACHE FILEPATH "")
set(CMAKE_RANLIB       ${TC}ranlib CACHE FILEPATH "")
set(CMAKE_STRIP        ${TC}strip  CACHE FILEPATH "")

# Compile flags
set(CMAKE_C_FLAGS   "-mcmodel=medany -march=rv64imafdcv -mabi=lp64d -DET_HAVE_PREAD=0" CACHE STRING "")
set(CMAKE_CXX_FLAGS "-mcmodel=medany -march=rv64imafdcv -mabi=lp64d -DET_HAVE_PREAD=0" CACHE STRING "")

set(CMAKE_EXE_LINKER_FLAGS
    "-T /path_to_file/k230_sdk/src/big/mpp/userapps/sample/linker_scripts/riscv64/link.lds \
     -n --static \
     -L/path_to_file/k230_sdk/src/big/rt-smart/userapps/sdk/rt-thread/lib \
     -Wl,--whole-archive -lrtthread -Wl,--no-whole-archive \
     -L/path_to_file/k230_sdk/src/big/rt-smart/userapps/sdk/lib/risc-v/rv64 \
     -L/path_to_file/k230_sdk/src/big/rt-smart/userapps/sdk/rt-thread/lib/risc-v/rv64 \
     -Wl,--start-group -lrtthread -Wl,--end-group"
    CACHE STRING "")

set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)

# ExecuTorch options from executorch/tools/cmake/preset/zephyr.cmake
set(EXECUTORCH_BUILD_COREML             OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_ENABLE_EVENT_TRACER      OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_KERNELS_LLM        OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_KERNELS_LLM_AOT    OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_EXTENSION_DATA_LOADER ON CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR ON CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_EXTENSION_LLM      OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_EXTENSION_MODULE   OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_EXTENSION_TRAINING OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_EXTENSION_APPLE    OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_MPS                OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_NEURON             OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_OPENVINO           OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_PYBIND             OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_QNN                OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_KERNELS_OPTIMIZED  OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_KERNELS_QUANTIZED  OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_DEVTOOLS           OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_TESTS              OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_XNNPACK            OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_VULKAN             OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_PORTABLE_OPS       ON  CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_CADENCE            OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_PTHREADPOOL        OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_BUILD_CPUINFO            OFF CACHE BOOL "" FORCE)
set(EXECUTORCH_USE_CPP_CODE_COVERAGE    OFF CACHE BOOL "" FORCE)

Our k230_rtsmart.cmake configuration was modeled after zephyr.cmake. We adopted a minimalist strategy commonly found in embedded development: disabling all default features and incrementally introducing only the necessary components as the project evolved.

The main issue was that RT-Smart microkernel ABI lacks pread() support, causing ExecuTorch to trigger an EOF error when reading the model on the K230 board. By analyzing file_data_loader.cpp in ExecuTorch repository, we noted that platforms like Xtensa and Hexagon set ET_HAVE_PREAD=0, suggesting RT-Smart required the same adjustment.

With adding -DET_HAVE_PREAD=0 to the cross-compilation flags, we forced ExecuTorch to use the lseek() + read() fallback. After rebuilding, the model now loads and executes successfully on the RT-Smart RISC-V core.

2. Build Commands

Run the following commands on your x86 host PC to generate the binary for the CanMV K230 board:

cd ~/executorch
rm -rf cmake-outCanMV

cmake -B cmake-outCanMV \
    -DCMAKE_TOOLCHAIN_FILE=~/path_to/executorch/k230_rtsmart.cmake \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_BUILD_EXECUTOR_RUNNER=ON \
    -DEXECUTORCH_ENABLE_LOGGING=ON \
    -DPYTHON_EXECUTABLE=python3

cmake --build cmake-outCanMV --target executor_runner -j8

After the successful cross-compilation on the x86 host, we transferred both the executor_runner binary and the add.pte model file to the CanMV K230 board. To aid in debugging, we kept EXECUTORCH_ENABLE_LOGGING=ON during the build process to provide detailed runtime feedback.

The ExecuTorch runner successfully loaded and executed the example addition model (add.pte) on the RT-Smart terminal:

msh /sharefs>executor_runner --model_path add.pte
I 00:00:00.004559 executorch:executor_runner.cpp:375] Model file add.pte is loaded.
I 00:00:00.011787 executorch:executor_runner.cpp:385] Using method forward
I 00:00:00.018386 executorch:executor_runner.cpp:436] Setting up planned buffer 0, size 48.
I 00:00:00.026484 executorch:executor_runner.cpp:467] Model loaded in 22.814926 ms.
I 00:00:00.033857 executorch:executor_runner.cpp:525] Iteration 1 of 1: 0.012481 ms
I 00:00:00.041213 executorch:executor_runner.cpp:535] Model executed successfully 1 time(s) in 0.012481 ms.
I 00:00:00.050674 executorch:executor_runner.cpp:544] 1 outputs:
OutputX 0: tensor(sizes=[1], [2.])

Dusan Stojkovic

Iva Mancev

You may also like