This article aims to introduce the concept of microarchitectures by first showcasing how the RISC-V architecture is included in GCC's backend.
A comprehensive answer to the following questions will be given:
- What is a microarchitecture?
- Why is it important for a compiler to support microarchitectures?
- How is the RISC-V ISA implemented in GCC?
Microarchitectures
A microarchitecture defines how a processor’s ISA is implemented on a physical level. While the ISA provides a set of instructions a processor can execute, the microarchitecture describes the hardware mechanisms used to perform these instructions. From the compiler’s point of view, the microarchitecture influences how code should be optimized for maximum performance. A compiler needs to consider how instructions are processed and scheduled at the hardware level.
Considerations for Microarchitectures:
-
Pipeline: A sequence of stages through which an instruction passes to be executed. Common stages include instruction fetch (IF), instruction decode (ID), execution (EX), memory access (MEM), and write-back (WB).
-
Super-pipelining: Increases the number of pipeline stages for more refined, parallel execution.
-
Superscalar Execution: This allows the processor to issue multiple instructions per clock cycle by utilizing multiple execution units. It requires sophisticated scheduling and parallelism.
-
Out-of-Order Execution (OOOE): Modern processors don't execute instructions in the order they appear in a program. They may execute instructions out of order to avoid stalls and utilize available execution resources effectively.
-
Branch Prediction: A technique to improve the flow of the instruction pipeline by guessing the outcome of conditional branches, allowing for continued processing without waiting for the result of the branch.
-
Register Renaming: Solves issues with data dependencies by dynamically renaming registers to prevent write-after-read (WAR) and write-after-write (WAW) hazards.
-
Cache Architecture: Caches are smaller, faster memory units close to the processor that store frequently accessed data and instructions. Micro-architectures typically have multiple cache levels (L1, L2, L3), and their design significantly impacts performance.
-
Execution Units: These include the Arithmetic Logic Unit (ALU), floating-point unit (FPU), and specialized execution units such as vector or cryptography units, which process specific instruction types.
-
Memory Management: Includes translation lookaside buffers (TLBs) for virtual-to-physical address translation, along with prefetching and other techniques to reduce memory latency.
-
Multi-core/Hyper-threading: Modern microarchitectures often include multiple cores and support hyper-threading, which allows multiple threads to execute simultaneously, either on different cores or via thread-level parallelism within a single core.
On this image can be seen an example of how an out of order execution capable processor might rearrange instructions during execution of a program to avoid stalling. In the image above, rearranging instructions effectively reduces the number of instruction cycles needed to execute the following code snippet by effectively utilizing the processor's pipeline.
GCC support for RISC-V
Before getting into implementation details some considerations GCC takes into account regarding microarchitectures manifest themselves as specific compile techniques:
-
Instruction Scheduling: The compiler must intelligently schedule instructions to avoid pipeline stalls, which occur when a resource needed by an instruction is not available. GCC, looks at the depth of the pipeline and tries to keep it full by scheduling independent instructions in parallel. In cases where RISC-V processors feature deep pipelines, scheduling becomes even more critical.
-
Loop Unrolling and Vectorization: Modern processors, including RISC-V implementations with multiple execution units, can benefit from loop unrolling—a technique where iterations of a loop are expanded to reduce the overhead of branch instructions. Additionally, if the processor supports SIMD (Single Instruction, Multiple Data) through extensions like RVV (RISC-V Vector Extension), GCC can generate vectorized code, allowing for parallel data processing.
-
Register Allocation and Renaming: A critical aspect of optimizing for microarchitectures is efficient register allocation. Poor register allocation can lead to register spilling, where values are temporarily stored in memory, introducing performance penalties. GCC’s backend for RISC-V considers register renaming strategies to avoid conflicts and minimize these costly memory accesses.
-
Branch Prediction Optimization: In processors with sophisticated branch prediction mechanisms, compilers can use profiling information to predict which branches are more likely to be taken. GCC’s support for profile-guided optimization (PGO) can help fine-tune branch predictions, improving performance on RISC-V microarchitectures where branching might introduce delays.
-
Cache and Memory Access Optimization: Optimizing for cache usage is another essential area. GCC can optimize memory access patterns to better utilize the cache hierarchies (L1, L2, L3) by arranging data and instructions in memory to exploit spatial and temporal locality. Some microarchitectures might also support memory prefetching, which GCC can leverage to minimize memory latency.
-
ISA Extensions and Advanced Instructions: Certain microarchitectures support advanced instruction sets that GCC can exploit to accelerate specific operations. For instance, the RISC-V Vector Extension (RVV) and other modular extensions allow compilers to generate highly specialized code optimized for the target microarchitecture. These extensions are utilized through compiler flags and tuning
-
Target-Specific Backends: GCC has a highly customizable backend architecture where different microarchitectures are defined through tuning options. For RISC-V, these options are specified in the riscv.md file, which contains details on pipeline models, instruction latencies, and the cost model associated with different cores. These configurations allow GCC to generate optimal code based on the specific RISC-V core.
-
Profile-Guided Optimization (PGO): PGO allows the compiler to make informed decisions based on runtime performance data. This approach helps optimize for the microarchitectural characteristics of the processor, such as branch prediction accuracy, cache usage patterns, and hot paths in the code.
-
Instruction Set Utilization: Depending on the microarchitecture, compilers may be able to generate code that utilizes advanced instruction sets such as AVX (Advanced Vector Extensions) for x86 or NEON for ARM architectures. These instruction sets can significantly speed up mathematical operations, but only if the compiler is aware of the microarchitecture’s capabilities.
From the compiler’s point of view, the microarchitecture influences how instructions are processed and scheduled at the hardware level. In the context of RISC-V, where modularity allows for customization of processor designs, precise machine descriptions are key for efficient code generation. GCC implements various optimizations tailored to these microarchitectures in the back-end (highlighted in blue). GCC also includes custom RTL optimization passes for RISC-V (highlighted in green).
We shall continue by focusing on the back-end and only briefly mentioning an aspect of how the back-end inserts custom RISC-V RTL passes.
RISC-V Micro-architecture support in GCC
For microarchitecture specific optimizations to occur, GCC utilizes machine description files. There are a couple of them for RISC-V.
Our focus will mostly be on the core gcc/config/riscv/riscv.md MD (machine description).
The main RISC-V machine description file
The RISC-V MD is a great place to start exploring the internals of RISC-V GCC. All of the micro-architecture descriptions extend the base RISC-V MD and overlay their custom features. The machine description file for RISC-V defines the following aspects of the architecture:
- Instruction Set: Describes how the basic instructions (e.g., addition, subtraction, branching) are implemented.
- Instruction Attributes: Assigns specific attributes to instructions, such as their latency, pipeline stages, or whether they require special handling for memory access.
- Optimization Guidance: The MD file provides GCC with hints about how to optimize code generation for specific core features, such as vector processing or floating-point arithmetic.
- Custom Extensions: RISC-V's modular nature allows for custom extensions, such as the vector and atomic extensions. These extensions are defined within the machine description and allow GCC to generate code that takes full advantage of these features.
- Unspec Operations: The RISC-V machine description also includes unspec operations, which are used to define custom behavior for instructions that do not directly map to standard GCC constructs.
Sections of the RISC-V MD File
1. Instruction Definitions:
Each instruction in the RISC-V ISA is described in the machine description file using the define_insn construct. For example, the addsi3 instruction, which defines addition between two 32-bit integers, is structured like this:
(define_insn "addsi3" [(set (match_operand:SI 0 "register_operand" "=r") (plus:SI (match_operand:SI 1 "register_operand" "r") (match_operand:SI 2 "arith_operand" "r")))] "" "add\t%0, %1, %2" [(set_attr "type" "arith")])
This snippet shows how the addsi3 instruction is defined for RISC-V 32-bit integer addition. The match_operand entries specify the source and destination registers, while the instruction attributes guide GCC on how to schedule the instruction for performance.
2. Attributes and Scheduling:
Each instruction in the MD file has associated attributes that help GCC understand its execution characteristics. These attributes are used to inform the instruction scheduling pass, ensuring efficient usage of the processor's resources. For example, the type attribute in the addsi3 instruction marks it as an arithmetic operation:
(set_attr "type" "arith")
Additionally, other attributes such as the number of pipeline stages or the mode (32-bit or 64-bit) can be defined using similar constructs.
3. Branch and Jump Instructions:
The riscv.md file also defines how branching and jumping are handled. For instance, conditional branches such as bne (branch not equal) are defined with their operand matching and corresponding assembly output:
(define_insn "branch_not_equal" [(set (pc) (if_then_else (ne (match_operand:SI 0 "register_operand" "r") (match_operand:SI 1 "register_operand" "r")) (label_ref (match_operand 2 "")) (pc)))] "" "bne\t%0, %1, %2" [(set_attr "type" "branch")])
These definitions allow GCC to generate branch instructions correctly based on the target architecture and optimize them accordingly for the pipeline stages.
4. Floating-Point and Vector Instructions:
For architectures that support floating-point or vector operations, such as the SiFive X280 or cores using the RISC-V Vector extension (RVV), the MD file contains specific sections that define how these instructions are handled. For example:
(define_insn "vfmadd" [(set (match_operand:VF 0 "register_operand" "=v") (plus:VF (mult:VF (match_operand:VF 1 "register_operand" "v") (match_operand:VF 2 "register_operand" "v")) (match_operand:VF 3 "register_operand" "v")))] "" "vfmadd.vv\t%0, %1, %2, %3" [(set_attr "type" "vfmul")])
This entry defines the fused multiply-add operation for vector floating-point numbers, allowing GCC to generate optimized code for RISC-V processors with vector support.
5. Unspec Operations
For example, atomic memory operations or custom instruction extensions like cryptography or bit manipulation are defined with unspec entries, ensuring GCC handles them correctly. A snippet for unspec atomic operations looks like this:
(define_c_enum "unspec" [ UNSPEC_ADDRESS_FIRST UNSPEC_FORCE_FOR_MEM UNSPEC_PCREL UNSPEC_LOAD_GOT UNSPEC_TLS UNSPEC_TLS_LE UNSPEC_TLS_IE ... ])
These custom operations provide GCC with a flexible framework to support new or non-standard instructions as they are added to the RISC-V ecosystem. The machine description file is also essential for handling various ISA extensions in RISC-V. For example, vector instructions or atomic instructions require their definitions within the MD file. The MD file for RISC-V includes various define_attr entries that describe different instruction types, such as atomic operations, cryptography extensions, or vector arithmetic:
(define_attr "type" "unknown,branch,jump,jalr,ret,call,load,fpload,store,fpstore, atomic,condmove,crypto,vector,bitmanip")
Custom RTL passes
Each architecture in GCC can specify their own custom RTL optimization passes. See image of simplified view at GCC's internals. These passes optimizes an intermediary representation of the input source code which has hardware registers included, thus the name register transfer language (RTL).
On the front-end an abstract syntax tree is used to represent the input source code which is used to apply generic optimizations independent of hardware.
In gcc/config/riscv/riscv-passes.def custom RISC-V RTL passes are defined. They perform optimizations unique to RISC-V.
INSERT_PASS_AFTER (pass_rtl_store_motion, 1, pass_shorten_memrefs); INSERT_PASS_AFTER (pass_split_all_insns, 1, pass_avlprop); INSERT_PASS_BEFORE (pass_fast_rtl_dce, 1, pass_vsetvl);
It can be seen in the example above that RTL passes custom to RISC-V are shorten_memrefs, avlprop and vsetvl.
- shorten_memrefs - aims to simplify the size and complexity of memory references which reduces instruction count, improves pipeline efficiency and decreases memory access latency.
- avlprop - this pass works on optimizing and managing AVL (actual vector length) as it is propagated through vector operations. It effectively optimizes vector loops by removing redundant AVL computations while still maintaining AVL consistency.
- vsetvl - this pass is used to optimize vsetvl instruction usage. This instruction is crucial for configuring the vector unit. Ultimately using the least amount of vsetvl instructions to fully utilize the vector unit while still generating correct vector instructions is the goal of this pass.
Tuning Options
The main RISC-V MD file includes tuning options that allow GCC to optimize instruction generation based on the target core’s microarchitecture. For example, tuning for SiFive or other custom RISC-V cores is supported through attributes that guide instruction scheduling and selection:
(define_attr "tune" "generic,sifive_7,sifive_p400,sifive_p600,xiangshan,generic_ooo" (const (symbol_ref "((enum attr_tune) riscv_microarchitecture)")))
The file gcc/config/riscv/riscv-cores.def defines configurations for various cores and tuning options used by the compiler.
These tuning options ensure that the generated code is optimized for the pipeline, cache hierarchy, and execution units of the specific core.
Notice the macro used to specify tuning options for different microarchitectures, which helps the compiler optimize code for various RISC-V cores. Here’s what to look for:
#define RISCV_TUNE(TUNE_NAME, PIPELINE_MODEL, TUNE_INFO) ... RISCV_TUNE("rocket", generic, rocket_tune_info)
- TUNE_NAME: Name of the microarchitecture (e.g. rocket, sifive-7-series).
- PIPELINE_MODEL: The pipeline for the given microarchitecture, defined in riscv.md.
- TUNE_INFO: The cost model for the given core, defined in riscv.cc.
Tuning instructs compilers to optimize the generated code for the target hardware architecture (microarchitecture). For example, some processors have branch prediction or cache optimizations that can be exploited to enhance the performance of the generated code.
The pipeline model describes how the CPU processes instructions. The CPU may divide instructions into multiple stages and process them in parallel.
The macro below defines the RISC-V core and its default architecture, which we use to make scheduling decisions. Here’s what to look for:
#define RISCV_CORE(CORE_NAME, ARCH, MICRO_ARCH) ... RISCV_CORE("sifive-u74", "rv64imafdc", "sifive-7-series")
- CORE_NAME: Name of the processor core.
- ARCH: Default architecture for the core.
- MICRO_ARCH: Name of the microarchitecture implemented by the core (for which we make scheduling decisions), which should match one of the TUNING_NAME values defined above.
Conclusion
To conclude, this article serves as an introduction RISC-V microarchitectures and how support for RISC-V is included in GCC. While this is by no means a complete description of the backend extensions for RISC-V (RISC-V C source code of the backed and more specific MD files are left for the reader to explore), the reader should now have a place to start and sufficient knowledge for us to keep exploring specific RISC-V GCC microarchitectures.
Dusan Stojkovic