Overview of sifive-7-series microarchitecture in RISC-V GCC

Overview of sifive-7-series microarchitecture in RISC-V GCC

Category of: Blog

Tagged with: RISC-V

Features of RISC-V SiFive 7 Series Cores
Sifive-7 series GCC tunings
Conclusion

In the previous article, we went over the concept of microarchitectures. Then the RISC-V ISA was introduced. With knowledge of what machine description files are, we can now see how specific microarchitectures are implemented.

Overview of sifive-7-series microarchitecture in RISC-V GCC

At the end of the previous article a code snippet from riscv.md was presented which defines a sifive-7-series core:

#define RISCV_CORE(CORE_NAME, ARCH, MICRO_ARCH)
...
RISCV_CORE("sifive-u74", "rv64imafdc", "sifive-7-series")

This is the tip of the iceberg. Because the RISC-V core definitions tell GCC how the selected microarchitecture should schedule.

In GCC RISC-V cores are defined in the following file:

/gcc/target/riscv/riscv-cores.def

From the riscv-core.def it can also be deduced that the tune from the sifive-7-series pipeline can be found here:

gcc/target/riscv/sifive7.md

But before continuing with GCC source, the features of SiFive 7 series cores should be pointed out.

Features of RISC-V SiFive 7 Series Cores

As of the time of writing of this article, the following SiFive 7 Series cores are defined:

The sifive-e76 and sifive-s76 cores

In the following table, it can be seen that sifive-e76 and sifiv-s76 are very similar:

Core name	RISC-V spec Cores	Mode Support	Pipeline	Memory
sifive-e76	4x RV32IMAFC E76 Cores	Machine and User	In-order, 8-stage pipeline	32KB Instruction Cache 32KB Instruction Tightly Integrated Memory (ITIM) 32KB Data Cache 32KB FIO RAM 256KB L2 Cache
sifive-s76	4x RV64GC S76 Cores	Machine and User	In-order, 8-stage pipeline	32KB Instruction Cache 32KB Instruction Tightly Integrated Memory (ITIM) 32KB Data Cache 32KB FIO RAM 256KB L2 Cache

Notable are the differences that one is E76 Core and S76:

sifive-e76: The E76-MC Core Complex includes four 32-bit E7 RISC‑V cores, which each have a dual- issue, in-order execution pipeline, with a peak execution rate of two instructions per clock cycle. Each E7 core supports machine and user privilege modes, as well as standard Multiply (M), Sin- gle-Precision Floating Point (F), Atomic (A), Compressed (C), and Bit Manipulation (B) RISC‑V extensions (RV32IMAFCB).
- High-performance TileLink Interface
- Benchmark Scores- 2.3 DMIPS/MHz, 4.9 CoreMark/MHz
ifive-s76: Each S7 core supports machine and user privilege modes, as well as standard Multiply (M), Single-Precision Floating Point (F), Double-Precision Floating Point (D), Atomic (A), Compressed (C), and Bit Manipulation (B) RISC‑V extensions (RV64GCB).
- Efficient and flexible interrupts
- Physical Memory Protection (PMP)
- High-performance TileLink Interface
- Benchmarks – 2.5 DMIPS/MHz, 4.9 CoreMark/MHz

To conclude, E76 Cores are simpler 32 bit cores while S76 Cores are 64 bit and have more advanced features.

The sifive-u74 cores

sifive-u74: The U7 core supports machine, supervisor, and user privilege modes, as well as standard Multiply (M), Single- Precision Floating Point (F), Double-Precision Floating Point (D), Atomic (A), Compressed (C), and Bit Manipulation (B) RISC‑V extensions (RV64GCB).
Features:
- Fully-compliant with the RISC-V ISA specification
- 4x RV64GC U74 Application Cores with 32KB L1 I-cache with ECC, 32KB L1 D-cache with ECC, 8x Region Physical Memory Protection, Sv39 Virtual Memory support with 38 Physical Address bits
- 1x RV64IMAC S7 Monitor Core with 16KB L1 I-Cache with ECC, 8KB DTIM with ECC, 8x Region Physical Memory Protection
- U74 and S7 cores are fully-coherent
- Integrated 2MB L2 Cache with ECC
- CLIC for timer and software interrupts
- PLIC with support for up to 128 interrupts with 7 priority levels
- Real-time Capabilities – The L1 Instruction Cache and the L2 Cache can be configured into high-speed deterministic SRAMs
- Debug with instruction trace
- Benchmark Scores – 2.5 DMIPS/MHz, 4.9 CoreMark/MHz

With even more complex features, the U76 cores achieve RTOS capabilities allowing them to run critical time applications.

The sifive-x280 cores

sifive-x280 features:
- 64-bit RISC-V ISA
- 8-stage dual-issue superscalar in-order pipeline for scalar computation
- SiFive Intelligence Extensions, which are custom instructions that accelerate AI/ML performance critical operations
- Multi-core, multi-cluster processor configuration options, with up to 8 cores
- Loosely coupled Vector Computation Pipeline, ALU implementing RISC-V Vectors extension specification 1.0
- INT8, INT16 & INT32, FP16, FP32 & FP64, and Q8.8 to Q15 fixed point data-types
  - Vector FP64 can be made optional for area and power-constrained markets
- 512-bit vector register length (VLEN)
  - Variable length operations, up to 512-bit of data per cycle, offering the Ideal balance of control logic and data parallel compute
- 256-bit Vector ALU and Load/Store architecture
- High performance vector memory subsystem
- Vector data stride L2 prefetcher unit
- Decoupled scalar and vector pipelines for optimum parallel execution of scalar and vector computation
- Memory parallelism provides cache miss tolerance
- Multi-layer caching support for optimum data movement
- Virtual memory support, with up to 48-bit addressing, with precise exceptions
- High performance, flexible connectivity to SoC peripherals

Only the SiFive X280 in the Sifive 7 series supports the vector extension (RVV). Therefore, code with RVV extension should only be executed for this specific core when -mtune=sifive-7-series is used to compile programs.

The sources for the different core's features:

Sifive-7 series GCC tunings

In this chapter, different aspects of the sifive-7-series machine description tuning shall be dissected.

Pipeline Automation and CPU Units

The file begins by defining a custom automaton (define_automaton "sifive_7") and several CPU units. This establishes the fundamental resources that instructions will contend for:

(define_automaton "sifive_7")
(define_cpu_unit "sifive_7_A" "sifive_7")
(define_cpu_unit "sifive_7_B" "sifive_7")
(define_cpu_unit "sifive_7_idiv" "sifive_7")
(define_cpu_unit "sifive_7_fpu" "sifive_7")

Two-Pipe Structure (A and B): The commentary states that the sifive-7-series core modeled here has two main pipelines—A and B. Pipeline A: Primarily handles loads, stores, and integer-to-floating-point moves. It can also handle standard integer ALU operations. Pipeline B: Used for branches, multiplications, divisions, and floating-point operations. Integer ALU instructions can be executed in either pipeline.

This dual-pipeline setup is key to allowing certain instructions to issue in parallel, assuming no conflicts or hazards occur.

The CPU units serve as building blocks for modeling how instructions claim and release pipeline stages. By splitting the pipeline into these units, the scheduler knows exactly when a resource (like the integer pipeline or the floating-point unit) is busy.

For specialized Units (idiv, fpu) two additional units are defined:

sifive_7_idiv: An integer division unit, which will be reserved for a longer duration by division instructions.
sifive_7_fpu: A floating-point unit resource, reserved by certain FP operations (like divisions or square roots) for multiple cycles.

Instruction Reservations and Latency Modeling

The core of this file lies in the define_insn_reservation lines. Each reservation describes how long a given instruction class occupies certain pipeline units. Some noteworthy examples:

Load Instructions:

(define_insn_reservation "sifive_7_load" 3
  (and (eq_attr "tune" "sifive_7") (eq_attr "type" "load"))
  "sifive_7_A")

A load takes 3 cycles occupying the A pipeline. This suggests that after issuing a load, the load unit remains busy or the load result isn’t available until after those cycles elapse, impacting when dependent instructions can execute.

Multiplication:

(define_insn_reservation "sifive_7_mul" 3
  (and (eq_attr "tune" "sifive_7") (eq_attr "type" "imul"))
  "sifive_7_B")

An integer multiplication imul instruction uses the B pipeline and takes 3 cycles. This models the real hardware latency of the multiplier.

Division:

(define_insn_reservation "sifive_7_div" 16
  (and (eq_attr "tune" "sifive_7") (eq_attr "type" "idiv"))
  "sifive_7_B,sifive_7_idiv*15")

Integer division is more expensive, taking 16 cycles. Notice that it reserves sifive_7_B plus sifive_7_idiv*15, indicating it occupies the integer division unit resource for an extended period. This is a precise reflection of the internal, iterative nature of the division hardware.

Floating-Point Operations:

(define_insn_reservation "sifive_7_dfma" 7
  (and (eq_attr "tune" "sifive_7")
       (and (eq_attr "type" "fadd,fmul,fmadd")
            (eq_attr "mode" "DF")))
  "sifive_7_B")

Different floating-point operations have their own unique latencies. For double-precision fused multiply-add fma, 7 cycles are reserved on pipeline B. Other FP instructions, like divisions, can require even more cycles and also reserve the sifive_7_fpu unit multiple times.

All these reservations allow the scheduler to know how long it must wait before dependent instructions can safely execute, minimizing stalls and structural hazards. By associating each instruction type and mode (e.g., load, imul, fdiv, fadd) with specific pipelines and cycle counts, GCC can effectively model and predict when resources become free, orchestrating a more efficient instruction schedule.

Conditional Resource Usage

Some reservations use logical OR and AND conditions to match multiple instruction types or modes. For example:

(define_insn_reservation "sifive_7_alu" 2
  (and (eq_attr "tune" "sifive_7")
       (eq_attr "type" "unknown,arith,shift,slt,multi,logical,move,bitmanip,\
                    rotate,min,max,minu,maxu,clz,ctz,atomic,condmove,mvpair,zicond"))
  "sifive_7_A|sifive_7_B")

This matches a broad class of integer ALU-like operations. The use of | (OR) for resources means these instructions can be scheduled on either pipeline A or B. The scheduler can choose whichever pipeline is currently less busy or can best fit into the instruction flow, giving it flexibility to reduce stalls and improve throughput. The flexibility in resource usage allows the compiler to exploit available pipeline bandwidth more efficiently.

Bypass Mechanisms

At the bottom of the file, define_bypass statements describe forwarding paths. These are critical for avoiding unnecessary pipeline stalls when instruction results can be delivered directly to subsequent instructions without waiting for register file writes:

(define_bypass 1 "sifive_7_load,sifive_7_alu,sifive_7_mul,sifive_7_f2i,sifive_7_sfb_alu"
  "sifive_7_alu,sifive_7_branch")

Bypass definitions let the compiler’s scheduler place dependent instructions closer together, minimizing pipeline stalls. The presence of multiple and conditional bypasses suggests a sophisticated forwarding network in the hardware.

This means that if an instruction producing a result is of type sifive_7_load or sifive_7_alu, and a following instruction is an ALU or branch instruction that needs this result, the compiler knows it can bypass the register file write-back stage, using a 1-cycle delay instead of a full read/write cycle. Similar bypass rules appear for FP operations, store data, and conversions between integer and floating-point. Some bypasses are conditional (as indicated by functions like riscv_store_data_bypass_p), allowing dynamic conditions to decide whether a given bypass path is available. This models more complex hardware conditions or microarchitectural optimizations.

Mixed Workloads and Extensions

We also see instructions for more advanced operations like popcount cpop and carry-less multiply clmul, as well as specialized FP modes (HF, SF, DF for half-, single-, and double-precision). Modeling this ensures that no matter what instruction mix the compiler emits, it can accurately account for the specific latency and resource usage.

Conclusion

Here is a summary of notable features:

Dual-Pipeline Model (A & B).
Integer instructions can use A or B, whereas memory and FP instructions have specific pipelines.
Dedicated Units for Division & FP Operations:
- Long-latency operations like division and FP division/sqrt claim separate CPU units for multiple cycles.
Flexible ALU Resource Usage:
- ALU instructions can issue on either pipeline to improve throughput.
Detailed Latency Encoding:
- Each instruction type has a carefully chosen reservation length reflecting actual hardware latency.
Extensive Bypass Mechanisms:
- Multiple bypasses reduce stalls by allowing results to be forwarded directly from producers to consumers.

By leveraging these well-defined reservations and bypass mechanisms, GCC can generate code that more closely matches the dynamic behavior of sifive-7-series cores. The net result is improved utilization of pipelines, fewer stalls, and faster execution of compiled programs.

In conclusion, the given sifive7.md snippet provides a carefully tuned machine description that captures the complexity and nuance of the SiFive 7 series pipelines. Through CPU unit definitions, cycle-accurate instruction reservations, and a robust set of bypass paths, the file enables GCC to schedule instructions in a manner that mirrors the real hardware’s capabilities, improving performance and efficiency of generated code.

Dusan Stojkovic

Overview of sifive-7-series microarchitecture in RISC-V GCC

Overview of sifive-7-series microarchitecture in RISC-V GCC

Overview of sifive-7-series microarchitecture in RISC-V GCC

Features of RISC-V SiFive 7 Series Cores

The sifive-e76 and sifive-s76 cores

The sifive-u74 cores

The sifive-x280 cores

Sifive-7 series GCC tunings

Pipeline Automation and CPU Units

Instruction Reservations and Latency Modeling

Conditional Resource Usage

Bypass Mechanisms

Mixed Workloads and Extensions

Conclusion

You may also like

ABOUT US

NEWSLETTER

EXPLORE

CONTACT US