Introduction
This article presents a case study on performance gains achieved through the application of the RISC-V Vector Extension (RVV) to software-based video codec libraries. Specifically, we focus on the optimization of the AVC (H.264) encoder implementation provided by the Android Open Source Project:
Our primary motivation stems from the evolving maturity of the RISC-V software ecosystem. While RISC-V is gaining traction in academia and industry alike, it has yet to match the optimization depth and toolchain maturity found in well-established architectures such as ARM and x86. This work aims to contribute to bridging that gap by exploring the potential of RVV in real-world codec scenarios.
All optimizations were implemented and tested in a fully automated CI/CD environment supporting build, execution, and regression verification. Functional validation was performed by comparing the encoded video output of the optimized implementation with the baseline. Performance metrics were gathered using Linux perf, and visualized through comparative charts and data tables.
Benchmarking was conducted on the Banana Pi BPI-F3 platform, a RISC-V SoC that supports the RVV extension. This platform was selected for its accessibility and alignment with current trends in edge computing. Secondary benchmark was performed using Semidynamics Atrevido 423.
In this article we aim to demonstrate the untapped performance potential of RISC-V in video processing pipelines and encourage further enablement efforts in the open-source and commercial ecosystems.
RVV-Based Optimizations in the AVC Encoder
Our optimization strategy began with identifying computational hotspots and assessing the effectiveness of compiler-driven vectorization. Candidates for manual optimization were shortlisted based on poor auto-vectorization results or inefficient execution profiles. Selected functions were then re-implemented using RVV intrinsics in C, with correctness validated through bit-accurate comparisons of output video streams. GCC 15.1 was used as the baseline compiler.
The overall implementation steps and the iterative optimization workflow are illustrated in Figures 1a and 1b, respectively.
Figure 1a. Implementation Steps for RISC-V Porting and Optimization
Figure 1b. Performance Optimization Workflow
The following optimization techniques were employed:
- Identifying redundant strided loads: during our analysis, we identified several functions where the compiler defaulted to using strided loads, which introduced unnecessary latency. We manually restructured these loops to enable the use of regular, contiguous vector loads instead. This transformation led to measurable performance improvements, particularly in inner-loop memory-bound kernels.
- Eliminating Redundant Reductions (redsum): in functions such as those implementing sliding-window filters, we initially observed the compiler relying on redsum instructions to accumulate results of vector multiplications. We rewrote these routines to leverage vectorized multiply-accumulate (MAC) patterns instead. This change reduced instruction overhead and improved throughput, especially in repetitive, coefficient-weighted summations.
- Preventing Vector Register Spilling: in complex routines like ime_sub_pel_compute_sad_16x16, we noticed excessive vector register spilling caused by high vector length requirements and irregular memory access. We applied pipelining techniques and aggressively reused vector registers to mitigate spilling. This significantly reduced stack traffic and unlocked higher sustained vector performance.
- Manual Instruction Scheduling: where the compiler failed to generate efficient instruction schedules, we performed manual reordering guided by microarchitectural analysis, including Top-Down performance methodology (TMA). This allowed us to reduce pipeline stalls and improve instruction-level parallelism in latency-sensitive hotspots
- Loop Unrolling and Vector Alignment: we identified loop structures where the compiler underutilized the vector co-processor due to conservative unrolling and type inference. Knowing that certain matrix dimensions (e.g., 8-column alignment) were guaranteed, we manually unrolled and aligned loops to optimize data access and vector register usage. These changes yielded improved SIMD occupancy and memory efficiency.
- Load masking: a method of vector load optimization, this principle relies on the fact that if data is contiguous (sequential) and the operation to be applied on the data doesn’t utilize all the vector lanes, it is possible to do one large vector load (usually, multiple vector loads are of higher latency) and do vector sliding operations, effectively a left shift on indexes of vector elements. This reduces the number of vector loads by replacing them with sliding operations thus increasing performance.
- Utilization of Custom Extensions: on the Banana Pi BPI-F3, we evaluated vendor-specific instructions not yet utilized by the compiler. We selectively integrated these custom opcodes into performance-critical paths, bypassing compiler limitations. This led to additional speedups, highlighting the potential of low-level tuning in pre-standard RISC-V environments.
Examples of Applied Function-Level Optimizations
To illustrate the methodology and tangible results of the applied RISC-V vector optimizations, several representative functions from the libavc encoder were selected for detailed analysis and manual optimization. Each demonstrates how targeted use of RVV intrinsics, memory alignment, and instruction-level tuning can overcome compiler limitations and deliver measurable speed-ups.
Functions selected for optimization include:
ih264e_sixtapfilter_horzih264e_sixtap_filter_2dvh_vertime_calculate_sad4_prog
Horizontal Six-Tap Filter - ih264e_sixtapfilter_horz
This function performs inter-prediction luma filtering in the horizontal direction.
- Problem: The compiler isn’t able to vectorize the function due to the way that algorithm applies the filter.
- Approach: Since the width of the filtered block is known (17), it can be resolved by loading 16 elements into a vector, repeating that six times (the algorithm applies filter to six elements). Then use an instruction for widening multiply-and-accumulate, also six times. After computing these values comes clipping and storing. Once the 16 elements are processed using vectors, the remaining element is handled using scalar operations. The snippet demonstrates this just for one vector load and multiply-and-accumulate:
... li t1,20 # loads one of the coefficients ... vsetvli zero,zero,e8,mf2,ta,ma ... vle8.v v9,(a0) # loads pu1_src starting with 0 index ... vsetvli zero,zero,e16,m1,ta,ma # performs the computation of vwmaccsu.vx v10,t1,v9 # elements with coefficients ... vadd.vx v9,v10,a6 vsra.vi v9,v9,5 # performs the clipping of the value vmin.vx v9,v9,t3 # min clamp vmax.vx v9,v9,zero # max clamp ... vsetvli zero,zero,e8,mf2,ta,ma # reconfigure back to 8-bit vnsrl.wi v9,v9,0 # truncation vse8.v v9,(a1) # stores the values to pu1_dst
Cascaded Two-Dimensional Filter - ih264e_sixtap_filter_2dvh_vert
This function implements a two-stage cascaded six tap filter. It applies the six-tap filter in the vertical direction on the predictor values, followed by applying the same filter in the horizontal direction on the output of the first stage.
- Problem: As with the ih264e_sixtapfilter_horz function, the compiler isn’t able to vectorize any loop due to irregular access patterns and the way the algorithm works.
- Approach: Since filter is applied twice, the vertical is handled using a similar approach to the previous function, with some minor adjustments. First, pointers are moved to align with the inner for loop. Although the starting point of the load varies vertically, 22 elements are loaded horizontally using regular load instructions – thus removing strided loads.
The function ih264e_sixtap_filter_2dvh_vert performs a vertical filter, the naive way to perform vertical filtering using vector instructions would be:
... # example vectorization of multiply-accumulate loop of luma filter vsetivli zero,8,e8,mf4,ta,ma # configure vector co-processor vlse8.v v2,0(a6), a5 # load 8 elem. from src, with stride vwmulsu.vv v1,v4,v2 # widening multiply signed/unsigned of # src and coeffs vsetvli zero,8,e16,mf2,ta,ma # reconfigure vector co-processor vredsum.vs v1,v1,v3 # sum all the elements in v1 register vmv.x.s a4,v1 # a4 = v1[0] <- sum ...
The flaw of this approach is that each iteration of the loop which is iterating over columns of the pu1_src array 8 elements are loaded and a vredsum is applied. This is expensive. Redsums should generally be used sparingly and the next optimizations are going to showcase this.
The main difference is how the elements from pu1_src get loaded:
col + i * src_strd, where i is from -2 to 1
This is what is called a strided load, each element is loaded at an offset of src_strd. Since strided loads are more expensive than regular loads, a way to use regular loads instead is proposed. The benefit of this approach is that it can be combined with the technique described in Eliminating redsums. Consider this table:
| Vector | i = 0 | i = 1 | i = 2 | ... | i = 7 |
|---|---|---|---|---|---|
| v_src1 | (src − 3*strd)*c0 | (src − 3*strd + 1)*c0 | (src − 3*strd + 2)*c0 | ... | (src − 3*strd + 7)*c0 |
| v_src2 | (src − 2*strd)*c1 | (src − 2*strd + 1)*c1 | (src − 2*strd + 2)*c1 | ... | (src − 2*strd + 7)*c1 |
| v_src3 | (src − 1*strd)*c2 | (src − 1*strd + 1)*c2 | (src − 1*strd + 2)*c2 | ... | (src − 1*strd + 7)*c2 |
| v_src4 | (src + 0*strd)*c3 | (src + 0*strd + 1)*c3 | (src + 0*strd + 2)*c3 | ... | (src − 0*strd + 7)*c3 |
| v_src5 | (src + 1*strd)*c4 | (src + 1*strd + 1)*c4 | (src + 1*strd + 2)*c4 | ... | (src + 1*strd + 7)*c4 |
| v_src6 | (src + 2*strd)*c5 | (src + 2*strd + 1)*c5 | (src + 2*strd + 2)*c5 | ... | (src + 2*strd + 7)*c5 |
| v_src7 | (src + 3*strd)*c6 | (src + 3*strd + 1)*c6 | (src + 3*strd + 2)*c6 | ... | (src + 3*strd + 7)*c6 |
| v_src8 | (src + 4*strd)*c7 | (src + 4*strd + 1)*c7 | (src + 4*strd + 2)*c7 | ... | (src − 4*strd + 7)*c7 |
Table 1. Strided load
The strided load of the innermost loop effectively creates a vertical sum similar to the one in the Eliminating Redsums chapter. This regular load plus multiply accumulate effectively removes the need to use a strided load. Now if all v_src* variables were accumulated in some v_src9 we would have a sum in each of the eight indexes of the v_src9 vector.
The following assembly snippet demonstrates this computation:
... .inner_loop: vsetvli zero,zero,e8,mf4,ta,ma # vector unit config vlen=256 # -> for e8,mf4 max=8 elem. vle8.v v9,(s0) # Load v_src1 add a4,s0,a2 # a4 := pu1_src vmv1r.v v10,v8 # v_sum = 0[0..7] vle8.v v11,(a4) # Load v_src2 add a4,a4,a2 # pu1_src += src_strd vle8.v v12,(a4) # Load v_src3 add a4,a4,a2 # pu1_src += src_strd vle8.v v13,(a4) # Load v_src4 add a4,a4,a2 # pu1_src += src_strd vle8.v v14,(a4) # Load v_src5 add a4,a4,a2 # pu1_src += src_strd vwmaccsu.vx v10,a7,v9 # v_sum += v_src1*c0 vle8.v v9,(a4) # Load v_src6 add a4,a4,a2 # pu1_src += src_strd vwmaccsu.vx v10,s2,v11 # v_sum += v_src2*c1 vle8.v v11,(a4) # Load v_src7 add a4,a4,a2 # pu1_src += src_strd vwmaccsu.vx v10,s3,v12 # v_sum += v_src3*c2 vle8.v v12,(a4) # Load v_src8 vwmaccsu.vx v10,s4,v13 # v_sum += v_src4*c3 vwmaccsu.vx v10,t6,v14 # v_sum += v_src5*c4 vwmaccsu.vx v10,s7,v9 # v_sum += v_src6*c5 vwmaccsu.vx v10,s8,v11 # v_sum += v_src7*c6 vwmaccsu.vx v10,s9,v12 # v_sum += v_src8*c7 vsetvli zero,zero,e16,mf2,ta,ma # reconfigure for CLIP_U8 vadd.vx v9,v10,t2 # SHIFT + CLIP_U8 vsra.vi v9,v9,6 # vmin.vx v9,v9,t3 # vmax.vx v9,v9,zero # vsetvli zero,zero,e8,mf4,ta,ma # vnsrl.wi v9,v9,0 # Narrow to 8 bit and store vse8.v v9,(a5) # pu1_dst[0..7] = v_sum8bit addi a5,a5,8 # pu1_src += 8 addi a1,a1,8 # pu1_dst += 8 blt t5,a3,.inner_loop # if col < wd continue ...
Some notes regarding the assembly presented in the snippet:
- The s0 register holds the proper value adjusted so that accessing the
pu1_srcpointer is done on valid addresses. - The a4 register holds
pu1_srcwhich is offset by the valid stride for eachvle8instruction. - Because the buffer is 32 bits wide, the vectors holding the values must be widened, followed by widening multiply-and-accumulate instruction. The results are then stored in a buffer, and pointers are updated accordingly based on the initial adjustments. The first difference between the horizontal filter in this function and previous is that 32-bit values are loaded from the buffer. Additionally, two multiply-and-accumulate instructions are replaced with two add instructions, as reflected in the original function. As in the earlier case, we process 16 elements using vector instructions and handle the remaining elements with scalar operations. Finally, two clipping and storing operations are performed. The first snippet shows vertically applied filter:
... li t6,20 # loads one of the coefficients ... vsetvli zero,zero,e8,m1,ta,ma # set vector unit for 8-bit load ... vle8.v v16,(a0) # loads pu1_src starting with 0 index ... vsetvli zero,zero,e16,m2,ta,ma # reconfigure for 16-bit operations ... vwcvtu.x.x.v v22,v16 # extending to 16 bits ... vwmaccsu.vx v12,t6,v22 # doing widening vmacc since the ... # buffer is 32-bit vse32.v v12,(t1) # stores the 32-bit values to buffer ...
Sum of Absolute Differences - ime_calculate_sad4_prog
- Problem: The compiler isn’t able to vectorize the function because of its access, even if it were able to, the cost of doing it using vectors wouldn’t be cost efficient.
- Approach: Unlike the previous two, data is loaded separately in this case. The original function accumulates the absolute value of the difference between the source and a value at a unit distance for each position in a diamond pattern. This solution computes the negation of the subtraction result, then use the vmax to determine the maximum between the original result and its negation. This yields absolute value with fewer operations.
... vsetivli zero,16,e8,m1,ta,ma # set vector length to 16 elements vle8.v v3,(a1) # loading pu1_src vle8.v v2,(a5) # loading left_ptr ... vzext.vf4 v4,v3 vzext.vf4 v24,v2 vsub.vv v24,v4,v24 vneg.v v28,v24 vmax.vv v24,v24,v28 vadd.vv v8,v8,v24 ...
Important note: Because of the amount of different data used for each of the diamond pattern’s part, it is easy to have vector register spilling. Thus, it is important to prevent the compiler from reordering memory access around this code by inserting a compiler memory barrier using:
asm volatile("" : : : "memory");
Results
Performance measurements were obtained using the Linux perf tool with custom scripts for data parsing and visualization. Collected metrics were aggregated into bar charts to illustrate performance differentials between baseline (GCC auto-vectorized) and manually optimized implementations.
The benchmark scenario involves encoding a one-second H.264 video stream at a resolution of 640×360 pixels using the libavc encoder on the Banana Pi BPI-F3 board. The input video stream comprises 25 YUV420p frames, encoded at a fixed frame rate of 25 fps and a target bitrate of approximately 530 kbps. This setup represents a realistic embedded video processing workload at low-to-medium resolution.
Verification of optimized code was performed by comparing the output with the output of the original codec. Criteria for successful verification was streams to be bit-exact.
Table and chart below present the measured function-level performance improvements across key encoder modules. Optimized routines show a clear and consistent advantage over their compiler-generated counterparts, with some functions demonstrating substantial reductions in execution time—validating the effectiveness of the applied RVV-based techniques.
Micro architecture configuration details:
rv64gcv_zicbom_zicboz_zicntr_zicond_zicsr_zifencei_zihintpause_zihpm_zfh_zfhmin_zca_zcd_zba_zbb_zbc_zbs_zkt_zve32f_zve32x_zve64d_zve64f_zve64x_zvfh_zvfhmin
| VARIANT | TOTAL OPCYCLES | TIME_ELAPSED [s] |
|---|---|---|
| Base (no optimizations) | 1.185.564.857 | 0.695752 |
| Optimized | 821.883.615 | 0.478514 |
| FUNCTION | % BEFORE OPT | % AFTER OPT | OPCYCLE COUNT BEFORE OPT | OPCYCLE COUNT AFTER OPT |
|---|---|---|---|---|
| F1 ime_sub_pel_compute_sad_16x16 | 12.13 | 3.85 | 143.809.017 | 31.642.519 |
| F2 ime_calculate_sad4_prog | 10.1 | 2.44 | 119.742.050 | 20.053.960 |
| F3 ih264e_sixtap_filter_2dvh_vert | 10.07 | 4.64 | 119.386.381 | 38.135.399 |
| F4 ih264_evaluate_intra16x16_modes | 5.3 | 3.89 | 62.834.937 | 31.971.272 |
| F5 ih264_inter_pred_chroma | 3.01 | 2.63 | 35.685.502 | 21.615.539 |
| F6 ih264e_sixtapfilter_horz | 2.48 | 1.22 | 29.402.008 | 10.026.980 |
Table 2. Perf measurement results
Function Legend:
F1 – ime_sub_pel_compute_sad_16x16
F2 – ime_calculate_sad4_prog
F3 – ih264e_sixtap_filter_2dvh_vert
F4 – ih264_evaluate_intra16x16_modes
F5 – ih264_inter_pred_chroma
Figure 2. Function-Level CPU Cycle Reduction After Vector Optimization
These results highlight the gap between optimal manual vectorization and current GCC auto-vectorization capabilities, particularly on pre-standard or complex loop structures. They also reinforce the value of targeted, low-level optimization in video processing workloads for emerging RISC-V vector platforms.
Multicore performance
To evaluate the scalability of the applied optimizations in a multicore environment, we extended our benchmark to a parallel configuration. The Banana Pi BPI-F3 features an octa-core RISC-V processor, and for this test, we configured the encoder to utilize 4 cores concurrently.
The table below summarizes the results comparing performance before and after optimization, in terms of total cycle count and elapsed wall-clock time. Measurements were collected using perf and confirm that the applied RVV-based optimizations retain their impact even under multicore execution.
| VARIANT | TOTAL OPCYCLES | TIME_ELAPSED [s] |
|---|---|---|
| Base (no optimizations) | 1,535,375,803 | 0.312551099 |
| Optimized | 955,958,149 | 0.220119798 |
| FUNCTION | % BEFORE OPT | % AFTER OPT |
|---|---|---|
| F1 ime_sub_pel_compute_sad_16x16 | 12.13 | 3.85 |
| F2 ime_calculate_sad4_prog | 10.1 | 2.44 |
| F3 ih264e_sixtap_filter_2dvh_vert | 10.07 | 4.64 |
| F4 ih264_evaluate_intra16x16_modes | 5.3 | 3.89 |
| F5 ih264_inter_pred_chroma | 3.01 | 2.63 |
| F6 ih264e_sixtapfilter_horz | 2.48 | 1.22 |
Table 3. Multicore Performance Comparison (4-Core Configuration)
The results show that with applied optimizations similar performance improvement is achieved also with multi core configuration enabled.
Artevido 430 results
Performance evaluation on the Semidynamics Artevido 430 FPGA platform was conducted by executing the libavc encoder instrumented with internal software timers, as the perf infrastructure was not available in the FPGA environment. Consequently, frame-per-second (FPS) throughput was used as the primary performance indicator, providing a consistent and reproducible basis for comparing baseline (GCC/LLVM auto-vectorized) and manually optimized implementations.
The encoder was evaluated across two representative workloads: a one-second H.264 video stream at 640×360 resolution and a five-second stream at 1280×720 resolution. Both test cases were executed under the Artevido FPGA simulation environment, enabling measurement of realistic codec throughput under controlled hardware conditions.
Collected results were aggregated into comparative charts illustrating performance differentials between compiler-generated and hand-optimized RVV implementations, demonstrating the continued benefit of low-level tuning even in pre-silicon validation environments.
Micro architecture configuration details:
mtune=atrevido-423t-march=rv64gcbv_zbc_zfh_zicbop_zicboz_zicntr_zihpm_zicsr_zifencei_zihintpausev _zvfh_zvfbfmin_xsmd_ xsmdmatVLEN=512VLEN=2048
Table and chart below present the measured encoder-level performance improvements across key combinations of compiler and optimizations. Optimized routines show a clear and consistent advantage over their compiler-generated counterparts, validating the effectiveness of the applied RVV-based techniques.
| VARIANT | FPS 1280x720 | FPS 640x360 |
|---|---|---|
| GCC Base (no optimizations) | 0.12 | 0.80 |
| GCC + Optimizations | 0.17 | 1.01 |
| Clang Base | 0.12 | 0.89 |
| Clang + Optimizations | 0.14 | 0.91 |
Table 4. SW timer measurement results VLEN=512
Figure 3. SW timer measurement results VLEN=512 encoding 1280x720 video
Figure 4. SW timer measurement results VLEN=512 encoding 640x360 video
| VARIANT | ACHIEVED FPS 640x360 |
|---|---|
| GCC Base (no optimizations) | 0.47 |
| GCC + Optimizations | 0.70 |
| Clang Base | 0.49 |
| Clang Optimized | 0.67 |
Table 5. SW timer measurement results VLEN=2048
Figure 5. SW timer measurement results VLEN=2048 encoding 640x360 video
The results obtained on the Semidynamics Artevido FPGA platform reinforce the observations made on the Banana Pi BPI-F3 system. They underline the current limitations of compiler-based auto-vectorization in Aliado SDK GCC 14.1 and Clang 19 for the RISC-V Vector Extension (RVV), particularly in compute-intensive video-encoding kernels. Across both tested resolutions and vector lengths, the manually optimized implementations consistently outperformed their auto-vectorized counterparts, often achieving significant throughput gains. These findings confirm the effectiveness of targeted RVV-based optimizations even in pre-silicon environments, and further demonstrate the scalability of the applied techniques across different toolchains and hardware configurations.
Conclusion and next steps
The results presented in this article highlight the current limitations of state-of-the-art open-source compiler auto-vectorization for the RISC-V Vector Extension (RVV), particularly in the context of performance-critical video codec functions. Across both single-core and multicore configurations on the Banana Pi BPI-F3 platform, as well as FPGA-executed measurements on the Semidynamics Artevido platform, manually optimized implementations consistently outperformed their auto-vectorized counterparts—often by significant margins.
By leveraging RVV intrinsics and low-level architectural insights, we achieved substantial reductions in execution time and CPU cycle usage across key AVC encoder functions. These optimizations demonstrate the untapped potential of RVV in accelerating compute-intensive multimedia workloads and validate the value of hands-on, function-specific tuning for RISC-V targets.
Our work builds upon prior efforts to enable RISC-V build and scalar optimization support in open-source codec libraries. Looking ahead, we plan to extend our contributions further by integrating RVV-based optimizations into community-maintained codebases and evaluating their impact on other codecs and architectures. In parallel, we aim to deepen collaboration with ecosystem partners to advance compiler auto-vectorization capabilities and introduce profiling-guided optimization tooling—further bridging the gap between hand-tuned and compiler-generated vector code.
Dusan Stojkovic
Dejan Bokan






