Accelerator Processor Stream

Stream 3 of the EPI SGA2 project builds on the achievements of the first phase, EPI SGA1, to demonstrate a completely European IPs based on the RISC-V Instruction Set Architecture. This first phase provided the basis for the development and provision of tangible hardware and software and sets the path toward EU HPC technology, where SGA2 steps in to improve and deliver a seamless heterogeneous accelerator based on the EPAC architecture.

EPAC architecture includes RISC-V vector tiles (VTILE), specialized Deep Learning and Stencil accelerators (STX), and variable floating point precision cores (VRP) all carefully engineered in a heterogeneous tile architecture whose subunits comply to the RISC-V standardization efforts.


Figure 1 EPAC1.0 Test Chip in GF22 Technology

EPI SGA1 resulted with the test chip of the first version of EPAC architecture implementation (EPAC1.0) in the GF22 technology and consisting of the following subunits (tiles):

  • Four RISC-V Vector (RVV) tiles composed of the scalar, two-way in order Avispado core and 8-lane Vector Processing Unit (VPU) implementing v0.7 of RISC-V Vector extension ISA.
  • Two STX tiles consisting of Stencil/Tensor Accelerator cores,
  • One VRP tile consisting of Variable floating point precision core.

Software Development Vehicles (SDVs) provided for early application porting and analysis, include Vehave, the software emulator of the RISC-V and Vector ISA, and an FPGA implementation of the RTL design that proved extremely useful for verification of the hardware and the system software including a vectorising compiler and a Linux kernel. Besides being usable as an accelerator for a general-purpose host, Avispado RISC-V core with vector extensions is a self-hosted general purpose HPC node running Linux. Another important component developed under SGA1 is the additions for RISC-V Vector extensions into the Compiler explorer – an open-source web application for interactive code generation and observations. Continuous integration system based on the FPGA implementation was very useful to identify and verify fixes and co-design improvements for the next versions of the chip that will be done in SGA2.

Continuing the developments from SGA1, SGA2 aims to produce a new EPAC1.5 test chip which will include improvements and fixes realized through the employed co-design methodology based on the SDV described previously. By the end of SGA2, the consortium will deliver the second test chip EPAC2.0 with additional features such as support for v1.0 of RISC-V Vector Extensions, improved microarchitecture features such as branch prediction, new data types, out-of-order execution, and interface to the VPU. New VPU will include more FPU units per tile.  Moreover, EPaC2.0 will feature improved cache management policies, support of very large number of outstanding memory requests, on chip memory controller, inclusion of memory compression capabilities, improved NoC, chip to chip connectivity and PCIe. And all of that migrated to GF12 technology. System software infrastructure (compiler, runtimes, libraries and operating system) will be upgraded and maintained.

SGA2 will also extend the SDV environment and provide improved framework for partners and external users willing to gain the head start and experience in preparing their codebase for high-throughput processor with vector extensions based on EPAC architecture. Remaining tiles will complement the EPAC in specific computation kernels from stencil/deep learning and approximate computing domains. The SGA2 work on the STX will, beyond general improvements, address challenges identified in SGA1 such as a focus on sparse access patterns, mixed precision and new number formats including POSIT.

Besides the continuation of activities from SGA1, SGA2 stream 3 will also include efforts to integrate other specialized cores or potential accelerator technologies such as the Kalray processor and the Menta FPGA devices under the EPAC RISC-V framework. Building on top of SGA1 and coordinating its activities with the EU Pilot project, SGA2 will contribute to demonstrate how it is possible to have a very cost-effective EU independent technology for the HPC and other domains.

RVV: RISC-V Vector Tile

RVV vector tile consists of the general-purpose, 64-bit RISC-V core Avispado, developed by Semydynamics. Avispado, tightly coupled with the vector processing unit Vitruvius, developed by BSC and the University of Zagreb. The interface is realized using, purpose-built Open Vector Interface 1.0 specification. The RVV tile also contains physically distributed, logically-shared 256KB L2 cache developed by Chalmers/Forth. The Tile is connected in a coherent fashion to other tiles using a custom CHI.B Mesh interconnect developed by Extoll. Although the test chip only contains 4 vector tiles, the architecture is highly scalable and allows for up to 512 tiles to be aggregated coherently together using a scalable mesh architecture.

Figure 2 EPAC1.0 RVV Vector tile die consisting of Avispado RV-64 (CORE), Vitrivius vector processing core (VPU) and L2 cache/HN node interface

Avispado core implements RV64GCV ISA and features OVI interface for connecting VPU units and extending the core with the support for vector instructions. The core also supports compressed instructions, SV48 virtual memory and unaligned instructions.  Block diagram of the Avispado core with Vector Unit is shown below.

Figure 3 Avispado RISC-V core with Vector Unit

Avispado sends arithmetic vector instructions to the vector unit through the Vector-FP-Issue-Queue (VFIQ). Vector memory instructions (vector loads and stores) are processed in Avispado itself, through the Memory-Issue-Queue (MIQ) and the vector address generation unit (vAGU). Vector memory addresses are translated to a physical address according to SV48/39, checked against the PMA and lookup the Data cache to guarantee coherency between vector and scalar accesses. If the data is not found in the cache, a CHI request is made to the NOC. Upon data return, 512b are delivered to the vector unit per clock cycle.

The Vector Unit or VPU communicates Avispado core through the Open-Vector Interface (OVI). It currently implements version 0.7.1 of the RISC-V V-extension featuring a maximum vector length (VL) of 256 double-precision (64-bit) elements (16384 bits). It supports register renaming by implementing 40 physical vector registers whose elements are distributed among the set of parallel identical vector lanes, which communicate through a unidirectional ring. The VPU provides a lightweight out-of-order execution mechanism by splitting arithmetic and memory operations into two concurrent instruction queues, allowing overlapped execution. The VPU is completely configurable, although the baseline design in the EPAC test chip implements eight vector lanes interconnected through a unidirectional area-efficient ring. The detailed block diagram of the VPU is shown below.

STX: Stencil/Tensor Accelerator Tile

An important aspect of EPI (both, SGA1 and SGA2) has been the consideration of heterogeneous acceleration to achieve even higher energy efficiency for domain specific applications. Consequently, specialized blocks for deep learning (DL) and stencil acceleration have been an important part of the EPI roadmap. The capabilities brought out with these specialized accelerators will address workloads in HPC centres for stencil computations, while the DL block will target learning acceleration as part of the acceleration stream motivated by “optimised performance and energy efficiency” for “specialised computations”. At the beginning of SGA1, two different domain-specific accelerators for DL and stencil computations were suggested. During the first few months of the project, researchers from Fraunhofer Institute, ETH Zürich and the University of Bologna were able to merge the functionality of both units into a very efficient computation engine that has been named STX (stencil/tensor accelerator) with an optional add-on called SPU (Stencil Processing Unit) to enhance the cluster for Stencil loads.

The main goal of STX is to achieve a significantly higher (at least 5x-10x) energy efficiency over general purpose/vector units. The efficiency tells us how many computations can be performed with the unit, and the early target for the STX unit was to achieve at least 5x more energy efficiency (TFLOPS/W) than the vector unit on deep learning applications. In the first few months of the project, it became clear that these estimations are rather conservative, and the effective efficiency within EPI chips will be significantly higher. For applications that require only inference using quantized networks, this efficiency will be another 10x higher.

STX has been designed around a novel RISC-V core named Snitch. This is a small and efficient 32-bit RISC-V core which is supported by a capable 64-bit FPU with SIMD support. The Snitch core has been enhanced with hardware supported loops for floating point operations (FREP) and streaming semantic registers (SSR) that allow the FPU to independently fetch and write back using a wide range of regular data access patterns. A typical instantiation of STX uses 8 such Snitch cores for computation and a further Snitch core with DMA enhancements to help with data movement to and from the cluster. The system can also be enhanced by the SPU unit which contains a VLIW architecture optimized for Stencil workloads. The SPU targets extreme efficiency and easy programmability for kernels with static access patterns and only local data dependencies. Typical applications are finite-difference solvers. However, dense arithmetics and FFTs are a primary target. The SPUs are developed in a strict hardware software co-design approach which includes users, application and toolchain developers as well as hardware architects and engineers. This way the SPU equipped STX achieves more than 70% of the possible peak floating-point throughput of real world scientific kernels without SPU specific code modifications.

STX has been designed as a modular building block with several parametrization options. Each STX accelerator consists of several clusters of computing units, a typical instance would have four of such clusters. Each cluster in turn consists of 4-16 compute Snitch RISC-V cores (typically 8), one specialized Snitch RISC-V core for orchestrating data transfers as well as, 0 – 4 SPU units. All these units access a local scratchpad memory or TCDM (64 – 256kB), which will be filled using the specialized DMA unit. In theory, a 4 cluster system with 8 compute cores running at 1GHz clock speed can perform 64 DP GFlops/s. Practical experiments have shown that for common machine learning tasks, an FPU utilization of over 85% can be achieved. Multiple instances of STX can be instantiated in an EPAC tile.

STX is programmed using OpenMP, there are solutions that allow regular operations to be offloaded to the STX unit from an ARM system (in the GPP) or the 64-bit RISC-V core (in the EPAC tile) using both GCC and LLVM based flows that continue to be developed further as part of the EPI project.

VRP: Variable Precision Tile

The VRP Tile enables efficient computation in scientific domains with extensive use of iterative linear algebra kernels, such as physics and chemistry. Augmenting accuracy inside the kernel reduces rounding errors and therefore improves computation’s stability. Contemporary solutions for this problem have a very high impact in memory and computation time (e.g. use double precision in the intermediate calculations), thus validating the motivation for specialized hardware acceleration.

The hardware support of variable precision, byte-aligned data format for intermediate data optimizes both memory usage and computing efficiency. When the standard precision unit cannot reach the expected accuracy, the variable precision unit takes the relay and continues with gradually augmenting precision until the tolerance error constraint is met. The offloading from the host processor, i.e. General Purpose Processor or GPP in EPI, to the VRP unit is ensured with zero-copy handover thanks to IO-coherency between EPAC and GPP.

The VRP accelerator is embedded as a functional unit in a 64-bits RISC-V processor pipeline. The unit extends the standard RISC-V Instruction set with hardwired arithmetic basic operations in variable precision for scalars: add, subtract, multiply and type conversions. It implements other additional specific instructions for comparisons, type conversion and memory accesses. The representation of data in memory is compliant with the IEEE 754-2008 extendable format, which eases the integration with the GPP. The unit features a dedicated register file for storing up to 32 scalars with up to 512 bits of mantissa precision. Its architecture is pipelined for performance, and it has an internal parallelism of 64-bits. Thus, internal operations with higher precision multiple of 64 bits are executed by iterating on the existing hardware. The VRP micro-tile also features a high-throughput memory unit (load store unit and data cache) with a hardware prefetching mechanism, that hides the access latency to memory when running commonly memory-bound scientific applications.

The VRP programming model is meant for smooth integration with legacy scientific libraries such as BLAS, MAGMA and linear solver libraries. The integration in the host memory hierarchy is transparent for avoiding the need of data copy, and the accelerator offers standard support of C programs. The libraries are organised in order to expose the variable precision kernels as compatible replacements of their usual counterparts in the BLAS and solver libraries. The complexity of arithmetic operations is confined as much as possible within the lower level library routines (BLAS). Consistently, the explicit control of precision is exclusively handled at solver level.

EPI Programmable Logic based Accelerator

Systems designers have to use a multi-chip approach so as to end up with both hardwired acceleration of some tasks and reconfigurable implementation of some others. US-based, Xilinx-AMD has announced its Versal platform which will include hardwired accelerators and reconfigurable logic but their chips are mainly targeted to the cloud market so far. Heterogeneous SoCs and platforms integrating manycore CPUs, hardware accelerators, and reconfigurable programmable logic enables:

  • HW/SW co-Design and co-development tools, aimed at accurate design space exploration at the early stages of the design and fast and efficient designing and programming of such heterogeneous platforms
  • Complete software stacks from application programming to AI-libraries, to runtime systems, to operating Systems which will take full advantage of the features of such heterogeneous devices
  • The implemented complete systems, including both the Hardware and the software, satisfy

hard real-time, safety and security requirements

In HPC applications, reconfigurable programmable Logic is mainly usefull on support functions that are on critical path and that are requiring reconfigurable feature in the field. In general, such hardware capability is acting as a Common Plateform Accelerator enabling:

  • Acceleration of specific tasks for datacenter services boosting ML & AI performance
  • To offload a wide variety of small tasks from CPU and speed process along
  • To handle atypical data types, specifically FP16 (or half-precision) values used to speed up AI training and inference

Thus, SGA2 will also provide a European field programmable gate array accelerator developed by Menta in the fifth architecture generation. Applications such like cryptography, AI, CPU support functions such as task scheduling or adaptive variable precision units will be implemented on programmable logic core from Menta. Such solution provides hardware flexibility to the overall system and will enable the HW/SW co-Design architecture concept in SGA2. It will be a powerful accelerator for HPC with applications such as that will also allow evolving

The technology will be supported by Menta’s unique eFPGA configuration software: Origami Programmer, which is allowing to program the accelerator core and generates the new bitstreams, in the field.


Menta eFPGA IP V5: EPI programmable logic accelerator core




Live News

#ACACES2023 has started🥰 Take a look at interesting courses, especially the one from Filippo Mantovani "A RISC-V vector CPU for High-Performance Computing: architecture, platforms and tools to make it happen" @hipeac @pilot_euproject @EUPEX_pilot
10/07/2023 08:50:00
RT @BSC_CNS: 🚀The ACM Europe Summer School on #HPC Computer Architectures for #AI & Dedicated Applications kicks off! 💻Hosted by BSC & @la…
06/07/2023 22:53:00
Want to know more about EUPILOT ? Check out this video to 👀 and 👂 it in 2 minutes!
06/07/2023 10:21:00
EPI keynote was carefully👂listened to today by the interested audience 👀 at the #ASHPC23 – Austrian-Slovenian HPC Meeting at the Institute of Information Science in Maribor (IZUM), Slovenia @eurocc_austria @VSCluster @uniinnsbruck @EuroCC_SLING @COBISSNET
15/06/2023 13:55:00
@euprocessor CCO Mario Kovač will present EPI in a keynote on the #ASHPC23 – Austrian-Slovenian HPC Meeting on the June 15 at the Institute of Information Science in Maribor (IZUM), Slovenia @eurocc_austria @VSCluster @uniinnsbruck @EuroCC_SLING @COBISSNET
14/06/2023 16:54:00
01/06/2023 14:14:00
Philippe Notton, CEO of #SiPearl is at the Conference on deep tech entrepreneurship in Stockholm, and will talk in one hour in the plenary session "Deep dive into the challenge of collaboration between small research-intensive companies, large companies, and academia" Watch live
01/06/2023 14:14:00
Interesting talk between @EuProcessor General Manager Etienne Walter and #SiPearl Vice President Marketing & Business Development Craig Prunty today at booth #7 in the Europa Village at #TeratecForum 👍 There is still time to visit our booth till the end of the day! @Teratec_EU
01/06/2023 12:28:00
The Teratec 2023 Forum has just started, visit us today and tomorrow at booth #7 📷in the Europa Village, Paris 🇫🇷. We will be close to our dear neighbours @EUPEX_pilot #TeratecForum @Teratec_EU
31/05/2023 09:05:00
Yesterday we had two excellent BoF @ISChpc with our partners from @Evidenlive participating in "Arm HPC Software Ecosystem Maturity on Fresh, Capable Hardware" and from @BSC_CNS and @e4company in "RISC-V is HPC. Help Build Your Ecosystem" Hope that you also enjoyed them! 🪶
25/05/2023 15:11:00
RT @EUPEX_pilot: It's interview time on our #ISC23 booth! Carlos Puchol @BSC_CNS is presenting our 3 projects to @insideHPC @EuProcessor @…
25/05/2023 10:08:00
Always great in a good company of @pilot_euproject and @EUPEX_pilot, end of day 2, looking forward to new talks tomorrow, we'll keep you updated 📝
23/05/2023 17:42:00
In two hours 🕞⏱️ our general manager Etienne Walter will have a talk about EPI at @e4company if you are close come to the booth C323 #ISC23
23/05/2023 13:30:00
Our partners @istecnico @Unipisa and @FraunhoferITWM have a poster at #ISC23 "An FPGA-based platform to evaluate Posit arithmetic in next generation processors", check out what they wrote about the next generation of EPI accelerators. ⚡️
23/05/2023 12:19:00
RT @EuroHPC_JU: @EuroHPC_JU is developing innovative and sustainable #HPC technologies The low-power microprocessor (EPI SGA2) @EuProcesso…
22/05/2023 15:00:00
Today at 17:20 Daniele Gregori, CSO of @e4company, will talk about the HPC challenges and European projects solving them - @EuProcessor, Maelstrom, Admire and Textarossa. See you at #ISC23 Hall H, Booth K1001 ✔️
22/05/2023 14:32:00
The #ISC23 has started and we are waiting for you at booth A101 together with @EUPEX_pilot and @pilot_euproject 👀 @ISChpc @EuroHPC_JU
22/05/2023 08:59:00
Only 15 days left to the Teratec 2023 Forum, visit us on 31 May and 1 June in Paris 🇫🇷 in Europa village at booth 7. We will be just next to our good neighbours @EUPEX_pilot 👨‍👦‍👦 #TeratecForum @Teratec_EU
16/05/2023 11:09:00
Only 15 days to #ISC23 ⏲️⏳ We look forward to seeing you at ISC23 @ISChpc on booth A103, together with @EUPEX_pilot and @pilot_euproject.
10/05/2023 10:35:00
EPI was presented to representatives of the Croatian government and region Emilia Romagna @RegioneER that visited FER @fer_unizg where we discussed the EU's digital future and further cooperation possibilities in many areas @EuProcessor #EmiliaRomagna @institutrb @HPCfer
04/05/2023 20:27:00
EPI Team is wishing you all a happy #Easter2023! 🥚💐
07/04/2023 11:09:00
RT @EuroHPC_JU: 📰 The news is out! 📣 @SiPearl, the 🇫🇷🇪🇺 French company building Rhea, the energy-efficient #HPC #microprocessor for exasca…
06/04/2023 11:18:00
EPI team had a succesful second Periodic Review in Luxembourg! 💪 Congratulations to all of our partners 😊
31/03/2023 15:17:00
We had a great time at the #EuroHPCSummit2023 in Sweden! 🦾 Thank to all of you who visited us at the poster session and who attended session with Mario Kovač (@HPCfer) in organisation with @Etp4HPC 😁
24/03/2023 13:10:00
The summer school addresses young computer science researchers and engineers and is open to outstanding MSc students. Accepted students will spend one week in Barcelona, attending formal lectures, invited talks, and other activities. 🤓
23/03/2023 15:32:00
.@TheOfficialACM Summer School on HPC Computer Architectures for AI and Dedicated Applications, co-hosted by @BSC_CNS and @la_UPC invites you to register until April 15th! ⏰ More about the summer school and registrations here ⬇️
23/03/2023 15:31:00
RT @Etp4HPC: Our 2nd session of the afternoon is starting at #EuroHPCSummit2023 with Mario Kovac from @EuProcessor We're in room 3, join u…
22/03/2023 17:01:00
The #EuroHPCSummit2023 is finally here! 🦾 We look forward to interesting discussions and seeing our colleagues who work on interesting projects. Don't forget to visit us today at the project poster session! 😁
20/03/2023 13:47:00
📣 Join us in the project poster session at #EuroHPCSummit2023! Learn more about our project and our plans for the future. We look forward to seeing you in Sweden in just 4 days! 💪
16/03/2023 12:04:00
Mario Kovač (@HPCfer) will have a keynote speech related to the EPI project today at HPC, Data & Architecture Week in Buenos Aires, Argentina. 🇦🇷 More information about the event here ➡️
13/03/2023 12:31:00
RT @Etp4HPC: 1 week to #EuroHPCSummit2023 ! Don’t miss the 2 sessions run by ETP4HPC on 22 March: 14:30 Emerging Technologies for HPC in Eu…
13/03/2023 10:47:00
The @EuroHPC_JU Summit is approaching! Mario Kovač from @HPCfer will participate in a session "Towards an Autonomous European HPC Supply Chain: Showcasing EuroHPC Projects." EPI will also be present in the poster session. ➡️ See you in Sweden! 🇸🇪
06/03/2023 09:40:00
📢 EPI will be a sponsor of the 2023 edition of @TheOfficialACM Summer School on HPC Computer Architectures for AI and Dedicated Applications, co-hosted by @BSC_CNS and @la_UPC. More about the summer school and registrations here 👇
01/03/2023 14:51:00
RT @hipeac: How is Europe’s quest for a homegrown processor going? We asked Étienne Walter @Atos @EuProcessor #HiPEAC23 #HiPEACTV @EuroHPC…
24/02/2023 14:47:00
RT @Kalrayinc: 📢 Don't miss @Kalrayinc's Keynote at SuperComputing Asia 2023 (#SCA2023)! Benoit Dupont de Dinechin, Kalray's CTO, will tal…
23/02/2023 18:27:00
Étienne Walter (@Atos) spoke to @hipeac about the next steps for the European Processor Initiative. 👀 You can watch the interview here ⬇️
22/02/2023 14:07:00
Manolis Marazakis from @FORTH_ITE had a keynote at the @hipeac conference (RAPIDO workshop) where he presented "Challenges in modelling HPC SoCs: An experience report on using Gem5." More information about the workshop here:
20/01/2023 09:53:00
We would like to thank everyone who participated in our "EPI Tutorial" at the @hipeac conference! 🥳 In addition, we would like to thank our partners @CEA_Officiel, @fzj_jsc, @BSC_CNS and @HPCfer for creating this tutorial💪🏻 #HiPEAC23
18/01/2023 17:10:00
RT @hipeac: 🎸HiPEAC rocks! Day 2 of #HiPEAC23: industry meets academia, students meet former employers. What did you get up to? 👉 https://…
18/01/2023 17:00:00
Today is the last day of the @hipeac conference in Toulouse! Take the opportunity to visit us at booth 14 and learn more about our project. 🦾 #HiPEAC23
18/01/2023 10:18:00
We had a great time at the @hipeac conference yesterday and look forward to today's sessions! 😁 Our team is waiting for you at booth 14 🤓
17/01/2023 10:55:00
RT @fdrcrss: #HiPEAC2023 - Giving a #keynote at @hipeac today at 12:30!
17/01/2023 10:51:00
RT @hipeac: From atoms to applications: find out what we got up to on day 1 of #HiPEAC23, ft. Albert Cohen and Subhasish Mitra 📺 https://t…
17/01/2023 10:50:00
@PCzanik @hipeac Thank you for following and supporting us! We will have an announcement and timeline really soon.
16/01/2023 16:05:00
@PCzanik @hipeac Dear Peter, thank you very much for your question! 😊 We will soon make an announcement regarding your question on the EPI website. In addition, we have an RSS feed where you can follow the project news easier:
16/01/2023 14:19:00
We love to use conferences like #HiPEAC23 to network and meet with our partners and representatives of other initiatives!🦾
16/01/2023 13:54:00
Join us today at @hipeac conference in Toulouse! 💪 You can find us at booth 14, our team looks forward to your questions 😊
16/01/2023 11:11:00
RT @heroes_hpc: Ready for @hipeac 2023 in Toulouse? 🇫🇷 📌Don’t miss our HEROES talk today at 12:00 in the CONCERTO workshop. @HPC_Now partn…
16/01/2023 09:36:00
Save the date for the EPI Tutorial at the @hipeac conference! 🗓️ #HiPEAC23 On Monday you will have the opportunity to participate in a tutorial organised by EPI partners @CEA_Officiel, @fzj_jsc, @BSC_CNS and @fer_unizg! 🦾 More information here ➡️
13/01/2023 12:48:00