Hybrid Parallel Simulation Solutions with Multiphysics and Intel

*Please note this is a guest post from COMSOL AB:

Numerical simulation is the third pillar of cognition finding. Here's how COMSOL Multiphysics® simulation software takes advantage of hybrid parallelism on Intel® multicore processors and HPC clusters.

State-of-the-Art Technology and Beyond

Alongside theory and experiment, numerical simulation has established itself as the third pillar of cognition finding. It provides an essential tool to model the physical behavior of complex processes in industry and science. And in order to fully simulate real-world applications often many different and dependent physical phenomena need to be considered.

Multiphysics simulations help engineers and scientists to develop safer cars, design more energy-efficient aircraft, search for new energy sources, optimize chemical production processes, advance communication technologies, and create new medical equipment. While providing a cost-efficient and flexible tool for simulating physical behavior of real-world processes, multiphysics simulation has a high demand for compute power and memory resources.

As the hardware world in the multicore decade has turned parallel even for desktop computers, parallelism is of paramount importance. A vital feature of compute-intensive software is the ability to scale up to hundreds and thousands of cores. COMSOL Multiphysics® is ready to exploit shared memory and distributed memory parallelism at the same time - hybrid parallel computing.

Let's dive deeper into the specifics of this type of computing.

Living in a Hybrid Parallel World

While the integration density of silicon chips keeps on growing, the clock frequencies have stagnated. Additional transistors are now used to pack more and more complex cores onto a single die. The latest multicore incarnations of the classic in-socket CPU types have more than 10 cores, such as the recently introduced 15-core Intel® Xeon® processor E7-8890 v2 (Ivy Bridge).

For the programmer, the tide has turned in such a way that she now needs to address additional levels of parallelism. Modern software needs to account for core level parallelism, parallelism between sockets on a shared memory node, and parallelism between nodes in a cluster. This can be boiled down to shared memory and distributed memory parallelism. For illustration, consider a small cluster with six distributed memory processes (MPI processes) assigned to three nodes, below. Each process uses shared memory across four cores.


Configuration of a hybrid cluster with three nodes connected by a network, two sockets per node, one quad core processor per socket, and one MPI process with four threads per socket. Image credit: COMSOL, Inc.

When it comes to the algorithms, we also need to think about data parallelism and task parallelism. The overall goal of parallel execution is to perform more work per time unit and thereby increase user productivity. The user can then either solve the same problem in a shorter amount of time (i.e. she can run more simulations per day) or she can use additional resources for solving even larger problems in order to obtain more accurate results with better resolution.

Data and Task Parallelism in Numerical Simulation

Numerical simulation, in large part, relies on uniform loop-based operations on huge matrices and vectors. Consider, for instance, an iterative solver for solving a linear system of equations (LSE) for several million degrees of freedom (DOFs). The solver can be put together by vector additions, matrix-vector multiplications, and scalar products. In a parallel iterative solver, all these routines run in parallel.

When parallelizing the kernels, you find out that blocks of data can be local to one thread or to a group of threads. So, not only can work be divided, but data arrays can also be broken into distinguished blocks that can be kept in different memory locations. The distribution of matrix blocks and the division of loop iterations is known as data parallelism.

In contrast, you can also imagine a case where the LSE to be solved depends on a parameter. Instead of a single LSE, you might then have to solve hundreds of LSEs. Of course, these tasks (solving the LSEs) can be processed in parallel as well, and this kind of parallelism is called task parallelism.

However, there are also algorithms that do not contain any kind of parallelism due to dependencies of the intermediate results. These sequential parts are known to limit the expected speedups both in theory and in practice.

Shared Memory Parallelism: A Global View of a Local Part

Shared memory parallelism is based on a global view of the data. A shared memory program typically consists of sequential and parallel parts. In the parallel parts, a fork-join mechanism can be used to generate a team of threads that are taking over the parallel work by sharing data items and computing thread-private work. Communication between threads is accomplished by means of shared data and synchronization mechanisms.

For the user, it is important to know that every desktop computer nowadays is a shared memory parallel computer due to the multicore processor(s) under its hood. However, she needs to know that her resources are limited.

Typically, the problem size will be limited by memory capacity and the performance will be limited by the available memory bandwidth. For additional resources, you would need to add more computers or shared memory nodes. To this end, shared memory nodes are interconnected by fast networks and make up a cluster. For cluster type systems, distributed memory parallelism is needed and hybrid parallelism needs to be taken into account for better performance.

Distributed Memory and Hybrid Parallelism: A Discrete View of the Whole Ensemble

For distributed memory computing, the data has to be divided and assigned to distributed memory locations. This requires considerable changes in the algorithms and programs.

Remote data items cannot be accessed directly, since they belong to different memory spaces managed by different processes. If data in other blocks is needed, it must be communicated explicitly between the distributed memory processes via message passing. Common patterns are two-sided communication with one sending and one receiving process, or global communications (e.g. all-to-all communication). The additional communication requires further time and resources, and should be kept at a minimum. This also requires new and improved algorithms that focus on data locality and reduced communication. Inside every process, shared memory parallelism can be used in order to fully exploit the available resources.

Due to the hybrid configuration of modern hardware, a single programming and execution model is not sufficient. There are structural differences in communication between two threads on the same socket and two processes on different nodes. The hybrid model reflects the actual hardware in more detail and provides a much more versatile and adaptable tool to express all necessary mechanisms for good performance. It combines the advantages of a global and discrete view to the memory. Most importantly, the hybrid model helps to reduce overhead and demand for resources.

Next, we will show you a benchmarking example to illustrate the benefits of hybrid simulations.

A Hybrid Scalability Example

The scalability of hybrid numerical simulation with COMSOL Multiphysics® is exemplified for a frequency distributed electromagnetic waves model representing a balanced patch antenna where the electric field is simulated. We use a small Gigabit ethernet connected cluster containing three nodes, each with a two-socket quad-core Intel® Xeon® processor E5-2609 with 64 GB RAM per node and a total of 24 cores.


The electrical field of a balanced patch antenna. The distributed frequency model has 1.1 million DOFs and was solved with the iterative BiCGStab solver. Image credit: COMSOL, Inc.

Our study compares the number of simulations that can be run per day for a number of processes ranging from one to twenty-four and a number of threads per process varying from one to eight. You can see our results in the graph below. Each bar represents an (nn x np)-configuration, where nn is the number of distributed memory processes, np is the number of threads per process, and nn*np is the number of active cores.

The graph shows a general performance increase with the number of active cores. For the full system load with twenty-four active cores, the best performance is obtained for one distributed memory process per socket (i.e. six processes in total). The performance and productivity gain on this small system with a hybrid process-thread configuration (case 6×4) is more than a factor of four over a single shared memory node (case 1×8). The hybrid 6x4 configuration is also almost 15% better than the purely distributed case with twenty-four processes (case 24×1).


Benchmarking the electromagnetic wave model using different process x thread configurations in a hybrid model. The y-axis indicates performance in terms of the total number of simulations that can be run per day. The bars indicate different configurations of nn x np, where nn is the number of distributed memory processes and np is the number of threads per process. Image credit: COMSOL, Inc.

For additional reading and further benchmark examples for shared and distributed memory, hybrid computing, batch sweeps, and details on how to set up hybrid parallel runs in COMSOL Multiphysics®, check out the hybrid modeling series on the COMSOL® Blog.

About the Authors

Jan-Philipp Weiss received his diploma degree in mathematics from the University of Freiburg, Germany, in 2000 and a PhD in applied mathematics from the Technical University of Karlsruhe, Germany, in 2006. From 2008 until 2012, Jan-Philipp headed a shared research group with Hewlett-Packard on numerical simulation for multicore technologies at the Karlsruhe Institute of Technology.

Pär Persson Mattsson received his Masters degree, with major in applied mathematics and minor in informatics, from the Georg-August-University Göttingen, Germany, in 2013.

Intel and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. COMSOL and COMSOL Multiphysics are trademarks of COMSOL AB.