

# The European Processor Initiative: **Project overview and co-design for RISC-V accelerators**

Filippo Mantovani Barcelona Supercomputing Center (BSC)

Co-Design for HPC in Computational Materials and Molecular Science, October 3, 2022 - October 5, 2022 - CECAM-HQ-EPFL, Lausanne, Switzerland



# **EPI OBJECTIVES**



Contribute to the development of European supercomputing technologies that can compete on the global HPC market



Strengthen the competitiveness and leadership of European industry and science



Develop key components for the European Union to equip itself with a world-class supercomputing infrastructure



Develope European microprocessor and accelerator technologies with drastically better performance and power ratios; tackle important segments of broader and/or emerging HPC and Big-Data markets

More details red, 10:00 – 10:30, talk by Jean-Marc Denis



# **EPI ORGANIZATION: 3 STREAMS**

- Stream 2 General Purpose Processor (GPP)
  - Arm SVE
  - Atos/Bull system integrator
- Stream 3 EPI Accelerator (EPAC)
  - RISC-V
  - EU design: BSC. Semidynamics, EXTOLL, FORTH, ETHZ, UniBo, Chalmers, ...
- Stream 1 Prototyping EPI technology and co-design of
  - System software
  - Benchmarks and scientific applications



#### **EPAC WITHIN EPI**





#### **EPI TIMELINE**





# **VISIONS AND COLLABORATIONS**

- VEC Self-hosted RISC-V CPU + wide VPU (256 double elements) supporting RVV 0.7.1 / 1.0
- STX RISC-V CPU + specific cores for stencil and neural network computation
- VRP RISC-V CPU with support for variable precision arithmetic (data size up to 512 bit)
- **eFPGA -** On-chip reconfigurable logic
- Ziptillion IP compressing/decompressing data to/from the main memory
- KVX FPGA demonstrator of the Kalray RISC-V CPU targeting HPC and ML

#### VISION



#### **COLLABORATIONS**





#### **EPAC ARCHITECTURE**





#### AVISPADO 220 WITH VPU RISCV64GCV



and a

- SV48
- 16KB I\$
- Decodes upcoming V1.0 vector spec
- 32KB D\$
- Full hardware support for unaligned accesses
- Coherent (CHI)
- Vector Memory (vle, vlse, vlxe, vse, …) processed by MIQ/LSU



#### **VECTOR PROCESSING UNIT**

| Architecture    |    | Vector register size (1 cell = 1 double element) |    |    |    |    |    |    |    |     |     |     |     |     |     |     |          |
|-----------------|----|--------------------------------------------------|----|----|----|----|----|----|----|-----|-----|-----|-----|-----|-----|-----|----------|
| Intel AVX512    | D1 | D2                                               | D3 | D4 | D5 | D6 | D7 | D8 |    |     |     |     |     |     |     |     |          |
| Arm Neon        | D1 | D2                                               |    |    |    |    |    |    |    |     |     |     |     |     |     |     |          |
| Arm SVE @ A64FX | D1 | D2                                               | D3 | D4 | D5 | D6 | D7 | D8 |    |     |     |     |     |     |     |     |          |
| NEC Aurora SX   | D1 | D2                                               | D3 | D4 | D5 | D6 | D7 | D8 | D9 | D10 | D11 | D12 | D13 | D14 | D15 | D16 | <br>D256 |
| RISC-V EPAC Vec | D1 | D2                                               | D3 | D4 | D5 | D6 | D7 | D8 | D9 | D10 | D11 | D12 | D13 | D14 | D15 | D16 | <br>D25  |
|                 |    |                                                  |    |    |    |    |    |    |    |     |     |     |     |     |     |     |          |

- Implementation
  - #FUs << VL (lanes=8, VL=256)</p>
  - Some OoO
    - Resources to overlap?
      - L/S, FU, shuffling
    - Renaming
      - 40 physical registers
  - Single ported register file
    - Large state
    - 5 banks/lane providing sufficient bandwidth for 1 op/cycle (latency/BW trading)
  - Data shuffling: directional ring

"Vitruvius: An Area-Efficient Decoupled Vector Accelerator for High Performance Computing" F. Minervini, O. Palomar. RISC-V Summit 2021







# HOW TO USE THE V-EXTENSION IN OUR PROGRAMS?

- Assembler
  - Always a valid option but not the most pleasant (iii)
- C/C++ builtins
  - Low-level mapping to the instructions but allows embedding it into an existing C/C++ codebase
  - Allows relatively quick experimentation
- #pragma omp simd (aka "Semi automated vectorization")
  - Relies on vectorization capabilities of the compiler
    - Usually works but gets complicated if the code calls functions
  - Also usable in Fortran
- Autovectorization
  - All bets are off (5)



# **SOFTWARE EMULATOR: VEHAVE**



#### **PROS:**

- Useful to understand the level of vectorization achieved by the code
- Easy to use and accessible with no need of hardware infrastructure
- It supports RVV-0.7 and RVV-1.0
- Output compatible with Paraver

#### CONS:

- Slow
- No information about performance (no timing)





#### Self hosted RISC-V vector node @ 50 MHz

Scalar RISC-V commercial platforms coupled with EPAC development

- Hardware/Software infrastructure for Continuous Integration and RTL check
- Playground to demonstrate a full HPC software stack
  - Linux, compiler, libraries, job scheduler, MPI
- Platform to test latest RTL with complex codes
  - Advances performance analysis tools
  - Accurate timing available

# RISC-V PLATFORMS: COMMERCIAL AND FPGA-BASED









# **EXAMPLE OF CO-DESIGN STUDY #1**

Sparse Matrix-Vector multiplication



Gómez, Constantino, Filippo Mantovani, Erich Focht, and Marc Casas. "Efficiently running spmv on long vector architectures." In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 292-303. 2021.

Divergent Flow Control

- Naïve implementation struggles taking advantage of the vector unit
- Implementation of "vector friendly" SpMV algorithm: Sell-C-sigma





# **EXAMPLE OF CO-DESIGN STUDY #2**





#### **PERFORMANCE ANALYSIS TOOLS**

- Fine grained analysis (at level of instructions) is possible
- Graphical representation of timelines
- In depth study can help highlighting:
  - 1. Low usage of the vector unit
    - Feedback to the code developer
  - 2. Suboptimal saturation or resources (FU, mem)
    - Feedback to the **RTL implementation team**
  - 3. Suboptimal overlap of instructions
    - Feedback to the **compiler team** (improve scheduling)





# CONCLUSIONS

- EPI develops the first RISC-V based accelerator targeting HPC leveraging the V-extension
  - RTL design of a Vector Unit
  - LLVM compiler support for the V-extension
- While RTL is becoming actual hardware, EPI develops tools for boosting the co-design cycle: Software Development Vehicles
  - Emulator of V-instructions (Vehave)
  - Commercial RISC-V-based clusters
  - FPGA-sdv: self-hosted RISC-V core coupled with a VPU on FPGA
  - Performance analysis tools



Interested in testing your code on EPAC? Filippo Mantovani filippo.mantovani@bsc.es





#### CONTACTS

- Stream 3 leader and responsable for SDV:
  - Filippo Mantovani <u>filippo.mantovani@bsc.es</u>
- EPI General Manager:
  - Etienne Walter <u>etienne.walter@atos.net</u>





#### **EPI FUNDING**



This project has received funding from the European High Performance Computing Joint Undertaking (JU) under Framework Partnership Agreement No 800928 and Specific Grant Agreement No 101036168 EPI-SGA2. The JU receives support from the European Union's Horizon 2020 research and innovation programme and from Croatia, France, Germany, Greece, Italy, Netherlands, Portugal, Spain, Sweden, and Switzerland.

