

# THE RISC-V VECTOR PROCESSOR IN EPI

JESUS LABARTA (@BSC.ES)



# **DISCLAIMER**

Personal opinions

I grew up in HPCland



# TOWARDS PROCESSOR DESIGN IN EUROPE

**Applications** 

Libraries/Platforms

Schedulers

Compiler/Toolchain

OS

**HW Systems** 

CPUs/GPUs/ASICs



# TOWARDS PROCESSOR DESIGN IN EUROPE

**Applications** 

Libraries/Platforms

**Schedulers** 

Compiler/Toolchain

OS

**HW Systems** 

CPUs/GPUs/ASICs

































































**EPAC** 



# **EPAC**

- RV64GCV (→ 8x)
  - 2 way in order core
  - Decoupled OoO VPU
    - 8 lanes
    - Long vectors (256 DP elements)
  - L1 MESI coherency
- CHI interface NoC
  - 1 line / cycle (high B/F)
- L\$2: 256KB/module
  - Allocation control mechanisms
- Linux



- STX: DL and stencil specific accelerators
- VRP: variable precision processors



# TOWARDS HOLISTIC CO-DESIGN

- Can we develop a unified model? Nicely integrate all levels?
- How do we ensure coordination/cooperation between levels at run time?





# HOLISTIC CO-DESIGN

**Applications** 

Libraries/Platforms

Schedulers

Compiler/Toolchain

OS

**HW Systems** 

CPUs/GPUs/ASICs



Best place to address an issue

**Fundamentals** 

Balance

Mindset

**Productivity** 

Efficiency



# HOLISTIC CO-DESIGN

**Applications** 

Libraries/Platforms

Schedulers

Compiler/Toolchain

OS

**HW Systems** 

CPUs/GPUs/ASICs



Best place to address an issue

**Fundamentals** 

Balance

Mindset

**Productivity** 

Efficiency



# LEVERAGE INTERFACES AND IMPLEMENTATIONS

**Applications** Libraries/Platforms **Schedulers** Compiler/Toolchain OS **HW Systems** CPUs/GPUs/ASICs











Leverage "standards"

Opportunity to innovate and contribute



# LEVERAGE INTERFACES AND IMPLEMENTATIONS

**Applications** 

Libraries/Platforms

**Schedulers** 

Compiler/Toolchain

OS

**HW Systems** 

CPUs/GPUs/ASICs











Leverage "standards"

Opportunity to innovate and contribute



# DETAILED ANALYSIS AND INSIGHT ON BEHAVIOR

**Applications** 

Libraries/Platforms

Schedulers

Compiler/Toolchain

OS

**HW Systems** 

CPUs/GPUs/ASICs









# BALANCED HIERARCHY

**Applications** 

Libraries/Platforms

Schedulers

Compiler/Toolchain

OS

**HW Systems** 

CPUs/GPUs/ASICs

Workflow

MPI

OpenMP

vectors cores

Accelerator specific



"Limited" number of "general purpose" control flows within tile Long vectors. 8 lanes per core



## LATENCY -> THROUGHPUT: ASYNCHRONY AND OVERLAP

**Applications** 

Libraries/Platforms

**Schedulers** 

Compiler/Toolchain

OS

**HW Systems** 

CPUs/GPUs/ASICs

### Interoperability MPI + OpenMP

Taskify MPI calls

### Task based models

- Single mechanism
  - Concurrency
  - Locality & data management





# Task based computational workflows





### Long vectors

decouple Front end – back end



Convey access pattern semantics to the architecture. Potential to optimize memory throughput:



# MALLEABILITY & COORDINATED SCHEDULING

**Applications** 

Libraries/Platforms

**Schedulers** 

Compiler/Toolchain

OS

**HW Systems** 

CPUs/GPUs/ASICs









Micro architecture decides

A wish:

Handoff scheduling

vsetvli a4, a0, e32, m8 vlw.v v0, (a1) sub a0, a0, a4 slli a4, a4, 2 add a1, a1, a4 vlw.v v8, (a2) vfmacc.vf v8, fa0, v0 vsw.v v8, (a2) add a2, a2, a4 bnez a0, saxpy



## HOMOGENIZING HETEROGENEITY

**Applications** 

Libraries/Platforms

**Schedulers** 

Compiler/Toolchain

OS

**HW Systems** 

CPUs/GPUs/ASICs

```
(double a, double *dx, double *dy, int n) {
     void axpy omp nest
        int i, chunk;
        #pragma omp taskloop
        for (i=0; i<n; i+=TS) {
           chunk= n>i+TS? TS : n-i;
           #pragma omp target map(to:dx[i:i+chunk], tofrom:dy[i:i+chunk])
                           (a, &dx[i], &dy[i], chunk);
           axpy_omp
                      (double a, double *dx, double *dy, int n) {
void axpy omp
   int I, chunk;
   #pragma omp taskloop
  for (i=0; i<n; i+=TS) {
      chunk= n>i+TS? TS : n-i;
                      (a, &dx[i], &dy[i], chunk);
      axpy_SIMD
        void axpy_SIMD
                               (double a, double *dx, double *dy, int n) {
            int i;
            #pragma omp simd
           for (i=0; i<n; i++) dy[i] += a*dx[i];
```

### VLA helps homogenize Heterogeneous Performance

Big – Little cores, ....

#### Nested tasked/workshared



Offload regular OpenMP



HW support: IO coherence



### WHERE WE ARE?

# EPAC Test chip Tapepout @ 22nm @beginning of 2021



# LLVM Vectorizing compiler & intrinsics



```
void axpy SIMD
                     (double a, double *dx, double *dy, int n) {
  int i;
                          void axpy intrinsics (double a, double *dx, double *dy, int n) {
                             int i;
                             int gvl = __builtin_epi_vsetvl(n, __epi_e64, __epi_m1);
  #pragma omp simd
                              __epi_1xf64 v_a = __builtin_epi_vbroadcast_1xf64(a, gvl);
  for (i=0; i<n; i++) {
     dy[i] += a*dx[i];
                             for (i=0; i<n; ) {
                                 gvl = _builtin_epi_vsetvl(n - i, __epi_e64, __epi_m1);
                                 __epi_1xf64 v_dx = __builtin_epi_vload_1xf64(&dx[i], gvl);
                                 __epi_1xf64 v_dy = __builtin_epi_vload_1xf64(&dy[i], gvl);
                                 __epi_1xf64 v_res = __builtin_epi_vfmacc_1xf64(v_dy, v_a, v_dx, gvl);
                                 __builtin_epi_vstore_1xf64(&dy[i], v_res, gvl);
                                 i += gvl;
```



# WHERE WE ARE?



# Heterogeneous System Software SDV



#on RISC-V side \$mkfs.ext4 -b 4096 /dev/vda \$mount -t ext4 /dev/vda /mnt/scratchfs

## EPAC SDV





## **EPAC**

- Holistic throughput oriented vision based on long vectors and task based models
- Hierarchical concurrency and locality exploitation
- Not massive concurrency at a given level
- Push behaviour exploitation to low levels
- Co-ordination between levels
- Make it all look very close to classical sequential programming to ensure productivity
- Contact us if you are interested in evaluating the framework and provide co-design input



O2 kernel