LearnHPC and the use of the Fenix resources: Elevating HPC training in Europe

16 Dec 2020

The new European project LearnHPC aims to provide educational access to HPC resources in a consistent way and at a large scale. The use of the Fenix Research Infrastructure within the project will be crucial in generating a web interface that will allow to easily create and manage HPC clusters for training purposes. We interviewed Alan O'Cais from the Juelich Supercomputing Centre, E-CAM Software Manager and Principal Investigator of LearnHPC, to find out more about this exciting initiative and the contribution of the Fenix resources.

● Tell us a few words about the LearnHPC project: how did you come about with the idea, what challenge it addresses, and what is its main goal?

Through my involvement in the E-CAM Centre of Excellence and FocusCoE, I am aware that HPC training and education is a hugely important topic in the context of the EuroHPC Joint Undertaking. There is, however, an enormous logistical challenge in extending HPC training of a consistent standard to an ever-growing pool of researchers in 32 countries. One of the biggest hurdles that I foresee is providing educational access to HPC resources in a consistent way at the required scale.

There is an enormous logistical challenge in extending HPC training of a consistent standard to an ever-growing pool of researchers in 32 countries.

The underpinning concept of LearnHPC is to create temporary HPC clusters in the cloud that look, feel and even perform like real-world HPC resources. The idea of a cluster in the cloud is not so novel in itself, there are already quite a few projects out there that can do this. What makes LearnHPC unique is its opportune timing and how it brings a number of different projects together to target a specific use case. We use the Magic Castle project (from Compute Canada) to create the HPC clusters; the European Environment for Scientific Software Installations (EESSI, pronounced as “easy“) to give us a scalable, optimised research computing software environment; and HPC Carpentry to create high quality, community maintained training material for instructors to use there.

It’s all about creating a consistent, high quality training environment: realistic HPC environments, reproducible research-grade software stacks, community maintained lessons... and most importantly each one of these, individually and as a whole, have to be scalable!

● What are the particular ways in which LearnHPC will help make HPC training easier for students, researchers, and users in the field?

In the context of HPC training, I wouldn’t immediately draw a distinction between “students, researchers, and users”, I would see them all as learners. What LearnHPC will hopefully do for all learners is make the mechanics of accessing HPC training uniform, well documented and as easy as possible. We want to remove, hide or simplify the technical barriers that tend to increase the slope of the learning curve when it comes to HPC. Learners may still ultimately need to know about ip-restricted ssh keys or how to compile the latest GCC compiler from source, but these can be introduced at a more appropriate time in their learning journey.

In terms of what each of these groups take away from interacting with LearnHPC and the tools it uses, I would hope that:
     - it helps to make sure students are not intimidated by HPC, it’s just another tool in their toolbox;
     - users might recognise that a project like EESSI can mean that they might never have to compile a community code again (and don’t have to sweat about whether they did it correctly);
     - application researchers recognise that they could leverage LearnHPC and EESSI in their development workflows for things like continuous integration, scalability testing and hardware portability.

● What Fenix Research Infrastructure services are you using to run the HPC clusters for educational purposes?

The Fenix Research Infrastructure is partly built upon OpenStack, an open standard cloud computing platform. LearnHPC will leverage the OpenStack API to create a web interface that will allow us to easily create and manage HPC clusters. The federated nature of the Fenix Research Infrastructure also means that we can access different CPU architectures, or accelerators such as GPUs, or communication hardware such as Infiniband. These can be expensive, or convoluted to configure, with commercial cloud providers.

● How important is the contribution of the Fenix resources in creating this educational experience on a virtual HPC infrastructure?

Through the Fenix Research Infrastructure, we see a great opportunity to carve out a niche with respect to hardware capabilities and support: Magic Castle gives us the machinery to support the hardware drivers and EESSI provides the architecture-optimised software stacks (including accelerators and communication hardware). Working together with the Fenix Research Infrastructure, we can ensure that even though LearnHPC clusters may be temporary, they are not merely toys but can have real hardware capabilities.

This would also open up a lot of opportunities for application developers, providing them with a tool that they can leverage to implement continuous integration (even with regard to scalability and portability) in their application development workflows.

Through the Fenix Research Infrastructure, we see a great opportunity to carve out a niche with respect to hardware capabilities and support [...] This would also open up a lot of opportunities for application developers, providing them with a tool that they can leverage to implement continuous integration (even with regard to scalability and portability) in their application development workflows.

● Can you elaborate on the trainings you are planning to run, once the resources become available to the project?

Introductory level training is perhaps an easy target for LearnHPC since it is unlikely to expose any hardware shortcomings that may exist (like MPI performance or accelerator support). We are very interested in having people take (for example) the Introduction to High Performance Computing lesson from HPC Carpentry and deliver it on LearnHPC resources. We expect to organise a minimum of 2 training events based on this lesson ourselves and are open to providing resources to anyone else who wishes to deliver this lesson.

With respect to more domain-specific training courses, we will organise a further training event related to a Python-based high throughput computing library (developed by E-CAM in collaboration with PRACE and scalable to millions of tasks) which leverages the Dask data analytics framework.

E-CAM and FocusCoE also collaborated together to create a community-maintained lesson on Running LAMMPS on HPC systems which we intend to showcase as an example of an application-specific training event. One of the goals in this case is to act as a template event that could be leveraged by other Centres of Excellence for their own applications and we very keen to support other CoEs if they would like to do this.

Beyond this set, we are wary that LearnHPC is still a pilot stage project and relies heavily on EESSI, which is also a pilot stage project. In addition to this, we do not yet have a history on the Fenix Research Infrastructure. We are built on solid foundations but have not committed to a huge list of training events while we work on making sure everything functions as we hope it will. We are open to collaborating with anyone who would like to leverage LearnHPC in their training event, just get in touch with me (Alan O’Cais, a.ocais@fz-juelich.de) to see if we can work together to make it happen.