

Barcelona Supercomputing Center Centro Nacional de Supercomputación



#### The RISC-V Vector processor in EPI

#### **Prof. Jesús Labarta**

BSC & UPC

IEEE Computer Society Israel Webinar

Barcelona, Jan. 11<sup>th</sup> 2022

# Age before beauty

- Behavior (insight/models)
- Detail performance analytics
- Work instantiation and order
- Malleability
- Possibilities
- Elegance
- Power

#### (disclaimer)

| before | syntax                 |
|--------|------------------------|
| before | aggregated profiles    |
| before | overhead               |
| before | fitted rigid structure |
| before | how tos                |
| before | one day shine          |
| before | force                  |

• All about programmer mindset !!!





# Outline

- A vision on HPC
- Towards Holstic Co-design
- EPI
  - Architecture
  - Software Development Vehicles (SDVs)
- Conclusion





57

Barcelona Supercomputing Center Centro Nacional de Supercomputación



#### A vision on HPC

# The programming revolution

#### • From the latency age ...

- I need something ... I need it now !!!
- Performance dominated by latency in a broad sense
  - At all levels: sequential and parallel
  - Memory and communication, control flow, synchronizations

#### • ... to the throughput age

- Ability to instantiate "lots" of work and avoid stalling for specific requests
  - I need this and this and that ... and as long as it keeps coming I am ok
  - At all levels
- Performance dominated by overall availability/balance of resources



## Vision

- ( The multicore and memory revolution
  - ISA leak ...
  - Plethora of architectures
    - Heterogeneity
    - Memory hierarchies

(Complexity + variability = Divergence

 Between our mental models and actual system behavior The power wall made us go multicore and the ISA interface to leak → our world is shaking



#### What programmers need ? HOPE !!!



# Code lifetime cycle





### There is HOPE !!!



# The PM osmotic membrane







Barcelona Supercomputing Center Centro Nacional de Supercomputación



# Towards Holistic Co-design





#### A buzzword !!!!

#### **NOT**: 1 user/customer + 1 vendor

#### NOT : Top $\rightarrow$ down !!!!

and obsessed by hardware

Limited scope !!!!



#### **Co-design**



#### You have to DESIGN !!!!

#### Co-DESIGN vs. Co-DIMENSION !!!!



# Holistic co-design





### Holistic Co-design





### Holistic Co-design





# Leverage interfaces and implementations







**MPI** 



Leverage "standards" Opportunity to innovate and contribute



# Leverage interfaces and implementations





#### Towads "Exascale"















- Balanced hierarchy
- Throughput oriented: asynchrony and overlap
- Malleability and coordinated scheduling
- Homogenizing heterogeneity
- Detailed analysis and insight on behavior



## **Balanced hierarchy**



256 elements. 8 lanes per core "Limited" number of "general purpose" control flows within tile



# Latency $\rightarrow$ Throughput: asynchrony and overlap





# Malleability & Coordinated scheduling





### Homogenizing Heterogeneity



#### Nested tasked/workshared



Offload regular OpenMP



 HW support: IO coherence



# Detailed analysis and Insight on behavior





#### Some co-design related activities @ BSC

- Performance tools & analysis methodology
  - POP
- Programming model
  - OmpSs & OpenMP



- RISC-V Vector Architecture
  - EPI







Barcelona Supercomputing Center Centro Nacional de Supercomputación



EPI

3 32.

# The EPI FPA Objective

- Components (low power microprocessor technologies) ...
  - ARM based SoC
  - RISCV based accelerator
- ... to be combined to target
  - HPC
  - HPDA
  - Emerging
    - Automotive
    - ...







### Three streams

- General purpose
  - ARM SVE
  - BULL: System integrator  $\rightarrow$  chip integrator  $\rightarrow$  SiPearl
- Accelerator
  - RISC-V
  - EU design: BSC, Semidynamics, EXTOLL, ETHZ, UNIBO, Chalmers, ...
- Automotive
  - Infineon, ...



### **Overall architecture**





# Visions and collaborations

- STX:
  - Specific Accelerator devices
    - Al
    - Stencil
- RVV
  - ISA is important, RISC-V Vector
  - "Accelerator"
    - Easier entry, focus
    - $\rightarrow$  Standard self hosted, general purpose vector SMP
- VRP:
  - Extended precision arithmetic





#### The importance of a vision





# **RISCV** accelerator vision @ EPI

#### High throughput devices (processors)

- Long Vectors
  - "less words, more work"
  - Optimize memory throughput (High BW, B/F), locality
  - Decouple FE-BE
  - Locality & bandwidth at register level
- ISA is important
  - Standard
  - name spaces: single linear coherent memory, registers
  - Vector Length Agnostic (VLA)
- Hierarchical Acceleration
  - Nesting
  - Balanced hierarchy: number of levels, ratio
    - "limited" number of control flows
  - Homogenized heterogeneity
- Low power:
  - ~ low voltage x ~ low frequency
  - Efficiency

#### MPI+OpenMP

- Throughput oriented programming approach
- Malleability in application + Dynamic resource (cores, power, BW) management
- Intelligent runtimes & Runtime Aware Architectures
  - Handle overlap and locality (improve B/F)
  - Reduce overhead  $\rightarrow$  hierarchical
  - Architectural support for the runtime







Enabled by 🛛 🤁 🛛 🖓 🖓



### "Original EPAC Architecture"





### **GPP & Accelerator**

• Integrate into Common Architecture



#### From Global Architecture/GPP Meetings





## **EPAC** architecture

#### RVV

#### • RV64GCV ( $\rightarrow$ 8x)

- 2 way in order core
- Decoupled VPU
  - 8 lanes
  - Long vectors (256 DP elements)
- → 128 MSHR
- L1 MESI coherency
  - No allocate but coherent for vector L/S
- CHI interface NoC
  - 1 line / cycle
- L2\$: 256KB/module
  - Allocation control mechanisms
- No in tile L3\$



#### STX

- DL and stencil specific accelerators
- Extensions to planned NTX
  - Programmable address generators →
  - lightweight RISC-v core + fat FPU + (Streaming Semantics & FR

#### VRP

Variable precision processors





#### **AVISPADO 220 with VPU**

RISCV64GCV



A.C.

- Full hardware support for unaligned accesses •
- Coherent (CHI) •

SV48

•

•

•

•

Vector Memory (vle, vlse, vlxe, vse, ...) processed by MIQ/LSU •

## VPU: A processor in itself

- Hierarchical "accelerator" integration
  - Program & data served by scalar core (Coherence; ~punch tape program ☺)
  - Fine grain "offloading" of "vector tasks" (directly hardware supported)
  - Homogenized heterogeneity under single "standard" ISA interface defining program order
- Implementation
  - **#FUs << VL** (lanes=8, VL=256)
  - Some OoO
    - Resources to overlap?
      - L/S, FU, shuffling
    - Renaming
      - 40 physical registers
  - Single ported register file
    - Large state
    - 5 banks/lane providing sufficient bandwidth for 1 op/cycle (latency/BW trading)
  - Data shuffling: directional ring



"Vitruvius: An Area-Efficient Decoupled Vector Accelerator for High Performance Computing" F. Minervini, O. Palomar. RISC-V Summit 2021



#### **EPAC** Test Chip





Courtesy V. Papaefstathiou

## Physical design

- GF22
- Final Top level chip floorplan
- Total area:
  - 5943 X 4593 um2
  - (27.297 mm2)





## Physical design

- 4 "VPU microtiles"
  - 2.517 mm2 each





#### Alive ...

- Starting bring up process (sept 2021)
  - First words, ...
  - ... tiny shell ...
  - ... first vector code @ 1 GHz ...



(1) (1) (1) (1) (1) (1) (1)



|                                                                          | /Desktop/bringup_20210916/tools_new/bringup_20210917 |  |  |  |  |
|--------------------------------------------------------------------------|------------------------------------------------------|--|--|--|--|
| EPAC JTAG Conso                                                          |                                                      |  |  |  |  |
| Connecting to JTAG Console [0]                                           |                                                      |  |  |  |  |
| Press CTRL+A for exit                                                    |                                                      |  |  |  |  |
|                                                                          |                                                      |  |  |  |  |
|                                                                          |                                                      |  |  |  |  |
| Welcome to EP                                                            | AC TC Bring-Up Shell                                 |  |  |  |  |
|                                                                          |                                                      |  |  |  |  |
| epac@nikitas\$ help                                                      |                                                      |  |  |  |  |
| help                                                                     | Prints the available commands.                       |  |  |  |  |
|                                                                          | Echo the given input.                                |  |  |  |  |
| ping                                                                     | Pings the core.                                      |  |  |  |  |
| banner                                                                   | Shows a banner with the given input.                 |  |  |  |  |
| cpuinfo                                                                  | Prints information about the current core.           |  |  |  |  |
| sleep                                                                    | Sleeps for a given number of seconds.                |  |  |  |  |
|                                                                          | Tell how long the system has been running.           |  |  |  |  |
|                                                                          | Run the axpy benchmark.                              |  |  |  |  |
| axpy_vector                                                              | Run the axpy vector version benchmark.               |  |  |  |  |
| epac@nikitas\$ c                                                         |                                                      |  |  |  |  |
| Chip Frequency                                                           | = 1000 MHz                                           |  |  |  |  |
| Core HartID                                                              | = 0<br>= 216041194340                                |  |  |  |  |
|                                                                          |                                                      |  |  |  |  |
|                                                                          | nt = 1664256219                                      |  |  |  |  |
| epac@nikitas\$ u                                                         |                                                      |  |  |  |  |
|                                                                          | 03 minutes 46 seconds                                |  |  |  |  |
| epac@nikitas\$ a                                                         |                                                      |  |  |  |  |
| Running AXPY Scalar with 8192 array elements<br>init time: 352338 cycles |                                                      |  |  |  |  |
| init time: 3523                                                          | 38 cycles                                            |  |  |  |  |
|                                                                          | erence time: 281334 cycles                           |  |  |  |  |
| axpy scalar ret                                                          | erence time: 201554 cycles                           |  |  |  |  |
| done                                                                     |                                                      |  |  |  |  |
| Result ok !!!                                                            |                                                      |  |  |  |  |
|                                                                          | xpy_vector 8192                                      |  |  |  |  |
|                                                                          | ctor with 8192 array elements                        |  |  |  |  |
| init time: 3523                                                          | 41 cycles                                            |  |  |  |  |
| axpy vector tim                                                          | e: 9725 cycles                                       |  |  |  |  |
| done                                                                     |                                                      |  |  |  |  |
| Result ok !!!                                                            |                                                      |  |  |  |  |
| enac@nikitas\$                                                           |                                                      |  |  |  |  |



#### Special thanks to FORTH & EXTOLL

11 11

#### Alive ...







8 32.

Barcelona Supercomputing Center Centro Nacional de Supercomputación



# **SDVs**

#### Functional • Full OS, .... On QEMU & scalar RISC-V cores Timing model Microarchitectural parameters MAXVL, OoO, Locality management, ...



https://repo.hca.bsc.es/gitlab/epi-public/risc-v-vector-simulation-environment



## **RVV Software emulation**

• Detailed analysis & insight

















## **RVV Software emulation**

• Yolo: Al





## **RVV Software emulation**

• Geophysics













256K cache

#### 48

## **RVV Software emulation**

- Radar miniapp
- Directives + automatic vectorization
  - Incremental way
- Steering compiler optimizations
  - Complex data types
  - Indexed  $\rightarrow$  strided
  - Reuse through registers
  - Avoid optimizations generating extremely short vector lengths
- Steering code refactoring
  - With productivity in mind !
    - Interchange, collapse, ...











#### **RISC-V Vector extension (RVV) Compiler**

• LLVM support for the evolution of the RISC-V Vector (RVV) Extension



### Heterogeneous ARM + RISC-V

- Linux Boot on both Arm and RISC-V
  - kernel version updates
  - Tracking mainline, and contributing to ongoing patch testing and review (eg. SV48, huge pages)
- OpenMP offloading
  - Asynchronous calls (via service thread pool)
  - Thread teams
- Reverse offloading
  - access to host-side I/O devices
  - Works for Linux process on RISC-V side
  - WIP for offloaded tasks





#### FORTH



## RVV @ FPGA & ecosystem

- HPC software stack @ Commercially available RISC-V platforms
  - SLURM, MPI, OpenMP, BSC tools, RVV software emulator
- EPI SDV platforms
  - Booting Linux
  - Test user codes @ real RTL
  - Give to EPI partners and external users early access to EPI technology
    - Two step procedure
- Holistic CI/CD framework
  - HW & SW
  - Functionality & performance





## Single Core SDV





## Memory subsystem

- Mapping
- Protocol
  - DMT or not
- Memory dilation IP
- NoC
  - Injection, credits
  - hops/contention
- Pipelining
  - Bubbles?







- Comparing  $\rightarrow$  Insight
  - Architectures
    - Core Core / Socket socket / Node Node
    - Capacities
    - Bandwidths
    - ISO frequency, (power, area, cost, ...)
  - Applications
    - Algorithmic refactoring
    - Programmability
- Insight  $\rightarrow$  "Design" Impact
  - Application
  - Compiler
  - Architecture



### "General Architecture"





### Comparisons ...

- Architecture
  - Core Core / Socket socket / Node Node
  - Capacities
  - Bandwidths

|                             |                          | SDV     | SX-Aurora | CTE-Arm  | MareNostrum4       |
|-----------------------------|--------------------------|---------|-----------|----------|--------------------|
|                             | System integrator        | EPI     | NEC       | Fujitsu  | Lenovo             |
| Core architecture           |                          | RISC-V  | SX VE     | Armv8    | Intel x86          |
|                             | 10 / 000                 | IO      | 000       | 000      | 000                |
|                             | SIMD/vector extension    | V       | SX Vector | SVE      | AVX512             |
|                             | MAXVL                    | 256     | 256       | 8        | 8                  |
|                             | CPU name / model         | EPAC    | VE10B     | A64FX    | (eon Platinum 8160 |
|                             | Frequency [GHz]          | 0,05    | 1,4       | 2,2      | 2,1                |
|                             | Turbo Boost              | -       | Disabled  | Disabled | Disabled           |
| Simultaneous Multi-Threadin |                          | -       | Disabled  | Disabled | Disabled           |
|                             | Sockets / node           | 1       | 1         | 1        | 2                  |
|                             | Cores/socket             | 1       | 8         | 48       | 24                 |
|                             | Core / node              | 1       | 8         | 48       | 48                 |
|                             | #Vector functional units | 8       | 96        | 16       | 16                 |
|                             | DP Max / core [GFlop/s]  | 0,8     | 268,8     | 70,4     | 67,2               |
|                             | DP Max / node [GFlop/s]  | 0,8     | 2150,4    | 3379,2   | 3225,6             |
| Capacity                    | Memory / node [GB]       | 4       | 48        | 32       | 96                 |
|                             | Shared L3[MiB]           | 1       | 16        | 32       | 33                 |
|                             | Private L2 [KiB]         | 0       | 256       | 0        | 1024               |
|                             | Private L1[KiB]          | 32      | 32        | 64       | 32                 |
|                             | Physical Registers [KiB] | 80,25   | 512,88    | 9,13     | 11,9               |
|                             | Total Node [KiB]         | 1136,25 | 22791,00  | 36278,00 | 118843,20          |
|                             | Total per core [KiB]     | 1136,25 | 2848,88   | 755,79   | 2475,90            |
|                             | Total per Flop [B/F]     | 71,02   | 14,84     | 23,62    | 77,37              |





## **Observed Memory Bandwidth**

- Single core Load, Store, Copy
- Comparison to scalar and other architectures
- Comparison to their architectural peak









• Daxpy











- SpMV
  - MKL / NLC / Ellpack based implementation





Center Centro Nacional de Supercomputación

• FFT



FFT size (elements)



FFT size (elements)



• FFT







• HACC





## The importance of a vision



- Holistic throughput oriented vision based on long vectors and task based models
- Hierarchical concurrency and locality exploitation
  - Not massive concurrency at a given level
  - Push behaviour exploitation to low levels
- Co-ordination between levels
- Make it all look very close to classical sequential programming to ensure productivity









Se &

Barcelona Supercomputing Center Centro Nacional de Supercomputación



# **Thanks**