Latest posts by Eric Lequiniou (see all)
- Simulation-driven Innovation through Software Scalability: Altair RADIOSS - September 27, 2016
Accelerating the pace of innovation is enabled by faster and more realistic simulations. It is a clear goal of computer aided engineering (CAE) to offer a design environment that answers questions and offers insights at the pace of human thought. Meeting this goal requires a combination of high-performance computing (HPC) hardware, and software able to take advantage of the hardware’s performance potential.
Our customers use Altair’s HyperWorks CAE simulation suite to design and optimize high-performance, weight efficient, and innovative products. HyperWorks’ solvers are architected to take advantage of HPC capabilities toward CPU-demanding applications. The RADIOSS Finite Element Analysis code, primarily used for crash and safety simulations, is a highly parallel solver based on hybrid parallelization using Message Passing Interface (MPI) and OpenMP.
Explicit calculations performed by RADIOSS are increasing in size and complexity as automotive engineers seek to address new challenges including CO2 reduction through vehicle weight reduction. In order to achieve lightweighting, better predictability of fractures and ruptures is required which in turn leads to finer meshes and more complex simulation models. Thus, HPC class capabilities are critical for RADIOSS users. It is typical for users to run RADIOSS on large clusters of high-end dual CPUs with a minimum of 64 to 128 cores per computation.
CAE jobs are often submitted on large cluster supercomputers as the last task of an engineer’s day. Results are reviewed the next morning, with modeling iterations made during the day leading up to another overnight run. To reduce computation times to a point where more interactivity is possible, using more cores per job is the only practical way to decrease solution time as per-core performance tends to flatten due to electrical and thermal constraints (end of Dennard’s scaling). Only software that is able to leverage additional cores per CPU and additional nodes per jobs is thus able to increase performance.
It is thus critical to improve scalability and the reason why Altair focuses on the parallel performances of our solvers, notably RADIOSS.
Since its inception RADIOSS was designed for performance. It first ran on big vector supercomputers in the 1990’s such as the Cray T90 and the NEC SX. The software was then extended to support multiple CPUs under shared memory with our Shared Memory Parallel (SMP) version, at a time before the OpenMP standard appeared.
It took additional effort to adapt RADIOSS to distributed computing. We had to design a new Single Program Multiple Data (SPMD) parallel version based on domain decomposition techniques (divide to conquer paradigm) first using Parallel Virtual Machine (PVM) then MPI communication library. Distributed computing is not intuitive for developers as they are typically more familiar with sequential programing. This effort required some “evangelization” of the RADIOSS development team with a vision for where 21st century scaled computing could take us.
Coding for scale had to be one without compromise to our customers’ performance or user experience. The complexities of parallel application setup and master failure diagnostics were addressed with additional Quality Assurance (QA)and use of specific tools like parallel debuggers.
Anticipating the evolution of hardware towards multicores, we unified SMP and SPMD versions into a single parallel version called Hybrid Massive Parallel Processing (HMPP). This version is based on two levels of parallelism: MPI domain decomposition and OpenMP multithreading inside each domain. This allows maximum level of scalability by leveraging two parallelization approaches.
Last year, we were able to successfully demonstrate RADIOSS scalability up to 16,384 cores, using 512 nodes of a Cray XC40 supercomputer! This was achieved by running a car crash model of 10 million elements, representing the size of routine customer models in the future.
Today typical car crash models are between 5 and 6 million elements. With explicit code, as mesh size decreases not only does the number of elements increase but also stability time step decreases leading to an increase of the iteration number as well. Thus, an increase of a factor 2 in term of elements means a requirement of a factor 4 in term of performance to keep the same turn-around time.
Evolving toward a broad range of core architectures including Intel’s Xeon Phi and other future exascale supercomputers, we continue to push forward software scalability.
In term of domain decomposition parallelization enhancements, how to optimize load-balancing between domains is critical. The Amdahl’s law states that if a code is 99% parallel, then it could only achieve a 100x speedup with an infinite number of processors. With this in mind we are striving for parallelism everywhere.
We are also placing emphasis on improving the multi-threading parallelization of OpenMP. It is interesting to limit the volume of communication on a multi nodes environment. For instance, if one uses 128 nodes of 32 cores each, instead of running 4,096 MPIs, it’s possible to run 1 or 2 MPI per node, reducing to 128 or 256 MPIs in total. This increases local treatments performed under OpenMP without need of communication and only uses MPI for domains distributed on different nodes. Hence, maintaining a good ratio between computation and communication.
Note that we use the same kind of strategy for all Altair solvers. For example, AcuSolve, our main Computational Fluid Dynamics (CFD) solver, is based on a similar approach of MPI and OpenMP with 2 levels of parallelization. OptiStruct, our implicit solver for structural analysis and optimization, has been recently ported to MPI enabling hybrid parallelism with OpenMP. Finally, FEKO, our main solver for electromagnetism, supports MPI, OpenMP and also GPU computing.
With HyperWorks 14.0, Altair introduces a new licensing option called “HyperWorks Unlimited Solver Node” based on total number of nodes (instead of number of cores). This innovative licensing model aims to maximize ROI of our customers and is particularly well suited for highly parallel applications running on multi cores and many cores architectures.
Working on parallel code optimization requires a great deal of effort and can be quite challenging. It requires efficient collaboration between software and hardware vendors to adapt software to the evolution and innovation of hardware. When such levels of efficiency is reached, as for Altair’s solvers, it provides the opportunity for our customers to tackle new challenges they were not able to address in the past due to the lack of computational resources. This aligns with Altair’s vision to radically change the way organization design products and make decisions through simulation-driven innovation.
Want to learn more? Watch this on-demand webinar: Optimizing High-Fidelity Crash & Safety Simulation Performance