Bruyères-Le-Châtel - France - June 2010
CEA Releases MPC 2.0
a New Unified Parallel Framework for HPC
The HPC community has entered the Petaflop/s era, exposing the ability to execute up to 1015 floating-point operations per second. The next major milestone is the Exaflop/s architecture, predicted by the end of this decade. Yet there is a price to pay to reach such performance. Indeed, while these architectures are composed of more and more cores (potentially including wider execution units), the amount of memory per core is dramatically decreasing.
Such evolutions are already impacting current large-scale applications. Using the Pure-MPI model (or MPI-Everywhere) will eventually be impossible because of the intrinsic data duplication of this parallel programming model. Therefore, these applications will have to mix multiple parallel programming models to fully exploit these new-generation architectures. Currently, MPI + OpenMP is a promising candidate. But current libraries suffer from the diversity of programming-model implementations. Indeed, MPI libraries and OpenMP compilers may not be comprehensive to each other, leading to large overhead or, at best, the possibility to tune some knobs to drive the runtime behavior.
To tackle this issue, the Military Applications Department of the French Atomic Energy Authority (CEA) releases version 2.0 of the MPC distribution (freely available on sourceforge). MPC is a unified parallel framework for HPC clusters of NUMA machines. It has two main goals: (i) unifying various parallel programming models for an efficient exploitation of Petaflop/s architectures, and (ii) collaborating with other HPC components.
Unification of Mainstream Parallel Programming Models
Basically, MPC is a thread library supporting multiple parallel programming models, in a unified way to ease their interaction. Included in version 2.0, MPC now proposes the following programming models:
- POSIX thread
- Intel TBB version 2.1
- MPI version 1.3
- OpenMP version 2.5
- Hybrid MPI/OpenMP
First of all, MPC implements its own non-preemptive user-level threads, fully compatible with the POSIX-thread standard: therefore, if an application using pthreads is compiled with MPC, each new thread will be running on top of MPC instead of being managed by the underlying OS kernel. Thus, the unified user-level scheduler of MPC will take care of scheduling these threads, without paying the large overhead of going through the OS layers. In addition, MPC supports the Threading Building Blocks (TBB from Intel) programming model. By re-compiling the open-source version of TBB, MPC is able to run TBB applications by handling the thread scheduling.
Moreover, MPC implements a MPI runtime fully compliant with the 1.3 standard (May 2008). It implements a thread-based MPI version, meaning that every MPI task is a user-level thread instead of being an OS process. This is done thanks to process virtualization. This representation leads to a more flexible management on a computational node, including some facilities for parallel memory allocation and reallocation. Intra-node communications are easily done using the shared-memory available between the threads of the unique process. Only one copy is necessary to execute point-to-point commands. On the other hand, inter-node communications are done through an implementation of low-level high-speed interconnect API, including TCP, Infiniband and Elan (Quadrics) protocols. Thanks to this communication layer handling both intra-node and inter-node communications, MPC is able to run an MPI application on a whole cluster. Experiments published in [EuroPVM/MPI’09] show that the latency of intra-node communications is as low as production implementations of MPI (MPICH2 Nemesis).
Finally, MPC proposes an implementation of OpenMP 2.5 (both runtime and compiler). For this purpose, MPC includes a new level of threads. Indeed, this OpenMP implementation aims at being well integrated with other programming models like MPI. Therefore, OpenMP threads are modeled with stack-less and context-less threads as long as possible (typically while there is no scheduling point like an explicit barrier or a lock acquirement). This implementation leads to the design of a new scheduler-level called microVP on the top of each VP (Virtual Processor). With this infrastructure, the OpenMP runtime is able to handle an oversubscribing mode in an efficient way (i.e. when there are more OpenMP threads than cores allocated for the OpenMP parallel region). Because OpenMP is a directive-based language, it is necessary to use a compiler to lower the directives to function calls to the underlying runtime. For this purpose, MPC embeds a patched GCC handling the OpenMP transformation for C, C++ and FORTRAN. For this purpose, version 2.0 of MPC is shipped with patches for GCC 4.3 and GCC 4.4 (porting for GCC 4.5.0 is scheduled for the next release).
With these last two programming models, MPC has been optimized to handle efficient MPI+OpenMP applications. First of all, the main user-level scheduler sees MPI tasks and OpenMP threads the same way, to allow efficient scheduling among these different models.
Then, both schedulers (main and dedicated to OpenMP) include a special polling method to avoid busy waiting. This leads to a real fairness according to collective functions for both programming models. Thus, if one OpenMP thread is waiting on a barrier on the same core as an MPI task is waiting on an MPI barrier too, the scheduler will arbitrate which thread has to be scheduled. In [IWOMP’10], MPC shows good performance on synthetic benchmarks with hybrid MPI/OpenMP programming models. After studying the several approaches to hybridize an application, the experiments illustrate a lower overhead compared to state-of-the-art combinations of MPI implementation and OpenMP compilers (MPICH2, OpenMPI, IntelMPI and GCC, ICC and SunCC).
Integration with other HPC Components
In addition to unifying parallel programming models, MPC aims at being fully integrated to the other HPC components belonging to the software stack shipped with a supercomputer. A runtime implementation alone cannot reach high performance without interacting with other tools like the underlying architecture. That is why the MPC distribution embeds several components including:
- Parallel memory allocator
- Topology module
First of all, the compiler included inside the MPC distribution is mainly used to transform C, C++ and FORTRAN programs using OpenMP. Based on GCC, MPC includes patches for branches 4.3 and 4.4. With a total of 2,500 lines of patch, MPC extends the set of static analysis by adding a pass to detect global variables. Because the MPI implementation included in MPC is based on threads, a set of MPI tasks located on the same node now shares global variables. One way to ensure the thread-safety of such MPI task is removing global variables. Therefore, this new pass warns the programmer about global variables located in the C, C++ or FORTRAN applications. Finally, a new directive has been added to the compiler to enable automatic hybrid MPI/OpenMP checkpoint/restart. By adding a simple directive to the application like #pragma omp checkpoint, the runtime will automatically dump the application and runtime memory and allow the possibility to restart from this point on. It can be used within an MPI, OpenMP or hybrid MPI/OpenMP program.
This last extension is possible thanks to the parallel memory allocation integrated to MPC. Indeed, MPC is shipped with its own parallel memory allocator optimized for NUMA architecture and multithreaded applications. [EuroPVM/MPI’09] exposes experimental results about memory gains obtained thanks to this new allocator inside simple MPI applications. With a realistic MPI hydrodynamics code called HERA, MPC is able to gain up to 47% of memory (saving more than 1.3GB out of 2.8GB) thanks to the possibility to recycle memory pages among MPI tasks on the same node.
To help the memory allocation, MPC includes a topology module to learn the organization of the underlying architecture. This helps the runtime to map threads to cores according to the cache hierarchy and the NUMA nodes.
Finally, MPC includes a patch to GDB and its own implementation of the LibthreadDB (designed by Sun) to allow user-level thread debugging. Thanks to these modifications, users are able to see and follow the threads created by MPC with GDB-based debugger frontends or with DBX (from Sun). For additional details, the reader may refer to the article [MTAAP’10].
Built from its 10-year experience in software development for large NUMA clusters from Tera-1 (2001) to Tera-10 (2005) and now Tera-100 (2010), dedicated to the French nuclear weapons simulation program, the Military Applications Department of the French Atomic Energy (CEA) releases version 2.0 of its research framework MPC to exploit such large-scale architectures.
Thanks to its design, MPC allows mixed-mode programming models and efficient interaction with the HPC software stack to reach the highest performance.
About the CEA
The French Alternative Energies and Atomic Energy Commission (CEA) leads research, development and innovation in four main areas: low-carbon energy sources, global defense and security, information technologies and healthcare technologies. The CEA’s leadership position in the world of research is built on a cross-disciplinary culture of engineers and researchers, ideal for creating synergy between fundamental research and technology innovation. With its 15,600 researchers and collaborators, it has internationally recognized expertise in its areas of excellence and has developed many collaborations with national and international, academic and industrial partners.
Links referenced within this article
Threading Building Blocks (TBB)
OpenMP Architecture Board
Marc Pérache, Hervé Jourdren and Raymond Namyst. MPC: A Unified Parallel Runtime for Clusters of NUMA Machines. In Euro-Par 2008 – Parallel Processing, 14th International Euro-Par Conference, Las Palmas de Gran Canaria, Spain, August 26-29, 2008.
Marc Pérache, Patrick Carribault and Hervé Jourdren. MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, 16th European PVM/MPI Users’ Group Meeting, Espoo, Finland, September 7-10, 2009.
Patrick Carribault, Marc Pérache and Hervé Jourdren. Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC. To appear in International Workshop on OpenMP (IWOMP’10). Tsukuba, Japan, June 14-16, 2010.
Kevin Pouget, Marc Pérache, Patrick Carribault and Hervé Jourdren. User Level DB: a Debugging API for User-Level Thread Libraries. In Workshop on Multi-Threaded Architectures and Applications (MTAAP 2010) in conjunction with IPDPS 2010, Atlanta, GE, April 23, 2010.