Introduction - What is MPI?
MPI is a de facto standard outlining method of passing information across a distributed network. In 1994, the first specification for MPI-1.0 was published by the MPI Forum, a group of Academic researches and industrial representatives spanning 20+ years of experience in parallel computing. The standard has now reach version 3.1 (June 2015) and a scope of version 4.0 is already in the works.
MPI enables vendors such as Intel and IBM or even the open-source communities to implement libraries to work across a wide range platforms and hardware to solve the most challenging scientific/engineering problems.
When we hear people talking about MPI, in most cases, this is an implementation associated with an application using MPI. It is widely used by most legacy simulation packages and aimed at those lucky enough to have High Performance Computing (HPC) hardware within their company to run their MPI application.
So, if this technology has been around for such a long time and people are already using it, you may be wondering why you should be excited by this now? Cloud technology has reached a turning point as the services has become more accessible and with near infinite compute resource (think Amazon Web Services or Google Cloud Platform). If we can take advantage of these immense cloud infrastructures and implement MPI effectively, we can tackle the range of simulation pain points that no one has been able to do before.
We have developed our MPI solvers to work seamlessly with the Cloud allowing engineers to attempt new simulations that have never been possible before.
When to use MPI
MPI is specifically used for solving very large simulations very quickly.
It is difficult to define what a ‘very large’ simulation is in terms of numbers as it is dependent not only on model size but different types of physics being solved within the model. A model may have double the amount of degrees of freedom (DOF)* but not all DOF have the same computational impact.
*DOF is the number of unknown quantities that are solved in the model. A node typically has 6 DOF in 3D space (Translation X,Y,Z, Rotation RX,RY,RZ). For a typical 3D element in OnScale, each node has 3 of the translational DOFs. Each additional physics (electrical, thermal etc.) add an additional DOF to the assigned nodes.
For example, solving a piezoelectric model will be much more computationally intensive than a purely mechanical model of the same size.
Rough guidelines of when to use MPI (any combination of the following):
-
Large piezoelectric or non-linear (electrostatics) simulations
-
Model size is in the range of 100+ millions of Degrees of Freedom
-
Solve times are 12+ hours when using 16+ CPU Cores
What can MPI be used for
MPI can be used to solve a range of physics used across various applications:
-
Mechanical
-
Mechanical wave propagation models for Non-Destructive Testing
-
-
Piezoelectric
-
1-3 piezo-composite arrays for transducer simulations
-
Piezoelectric micromachined ultrasonic transducer (PMUT) arrays for fingerprint sensors
-
3D Film Bulk Acoustic Resonators
-
Surface Acoustic Wave Resonators
-
-
Thermal
-
Thermo-mechanical design problems in electronic packaging
-
-
Electrostatics*
-
Capacitive micromachined ultrasonic transducer (CMUT) arrays
-
-
Flow**
-
Ultrasonic flow meter simulations
-
*Electrostatic solve must be contained within part
**Require flow data to be partitioned
What solver features are not compatible with MPI
Particular physics that are coupled through a node from one MPI part to another may not be available:
-
Electrostatics on the MPI boundaries (piez esta) is not supported.
-
Electrostatic solve window must be contained within an MPI part (not touching the MPI boundary)
-
This will be added to the MPI solver in future releases.
-
Data input
-
Imported flow must be partitioned prior to be loaded into the correct part otherwise, is not supported
-
This will be automatically done in future releases.
-
Commands not supported:
-
Extrapolation (extr)
-
Echo (echo)
-
Mode shape (shap)
-
symb #get { }
-
nodal search commands
-
-
Piezoelectric (piez)
-
wndo auto piez
-
Outputs not supported:
-
Animations can be difficult as the model is partitioned into many parts
How to use MPI
MPI Partitioning
Partitioning is the process of sub-dividing the mesh into smaller sections or parts so that a set number of CPU cores can be focussed on solving that section.
To partition a model we need to use the part command:
grid 200 10 200
part
max 2
sdiv 1 1 100 1 10 1 200
sdiv 2 100 200 1 10 1 200
end
geom
...
The example above defines 2 MPI parts subdivided along the I-direction.
All elements in the model must be associated to a part and must not overlap into multiple parts.
The part command must be defined in between the grid and geom.
max subcommand
The max command defines the total number of parts for MPI which is equivalent to how many cores will be used to solve the model. The example above shows a total number of 2 parts used.
sdiv subcommand
The sdiv subcommand is used to subdivide the mesh into each part ready for MPI. The arguments are the typical nodal locations used in OnScale. The nodes are defined in order of IJK.
Partitioning can applied along all 3 axes if necessary. Typically, partitioning along 2 axes is sufficient as the majority models are much wider than they are thick.
Using nested do loops, we can automate the partition process and control using variables to set the total number of parts and parts along one axis:
c Setting Up Variables
symb nptot = 25
symb npi = 5
symb npj = $nptot / $npi
symb xpi = $npi
symb ypj = $npj
symb xndgrd = $indgrd
symb yndgrd = $jndgrd
symb xistep = $indgrd / $xpi
symb xjstep = $jndgrd / $ypj
part
max $nptot
symb ifr = 1 /* Intialise Starting Index
/* Loop through parts and partition model
do loopI I 1 $npi 1
symb jfr = 1 /* Re-initialise J Starting Index
do loopJ J 1 $npj 1
/* Calculate part number
symb ip = $J + ( $I - 1 ) * $npj
/* Calculate i and j indices
symb xito = $I * $xistep
symb ito = $xito
symb xjto = $J * $xjstep
symb jto = $xjto
/* Check for last part
if ( $I eq $npi ) then
symb ito = $indgrd
endif
if ( $J eq $npj ) then
symb jto = $jndgrd
endif
/* Define part
sdiv $ip $ifr $ito $jfr $jto $k1 $kndgrd
symb jfr = $jto
end$ loopJ
symb ifr = $ito
end$ loopI
There are various ways to partition models, use a method you can understand clearly.
Important note that the value set in the max subcommand must be a single integer value and not by a mathematical operation between 2 or more variables.
GOOD max subcommand definition
symb npart = 100
part
max $npart
...
BAD max subcommand definition
symb a = 10
symb b = 10
symb npart = $a * $b
part
max $npart
...
Memory requirements (OnScale Enterprise)
Memory is set as the total memory for the simulation. This value is entered in the Cloud Scheduler at time of job submission. Each part will be assigned a portion of the the total memory:
RAM Per Part = Total Memory / Number of Parts
Memory required for MPI will typically use less memory than regular shared memory multi-processing models using the mp omp commands. It is difficult to say exactly how much memory is required so a good starting point is typically around 500 MB per part.
We recommend using minimal amount of RAM as possible.
Best practices
-
Check the model is functional first without any MPI partitioning code
-
When testing MPI:
-
Reduce meshing for quicker tests
-
Run the model for a small number of time steps to check the MPI code is functional
-
Start with 500 MB RAM per Part and increase as required (trial & error)
-
-
Remove mp omp command from input deck
Model estimation
Our Cloud estimation does not support estimating MPI simulations at the moment.
Best way to estimate an MPI simulation is to execute for a number of time steps and extrapolate for the full length of the simulation. This can be down with following execution code snippet placed after the prcs command:
prcs /* Process command
symb #get { step } timestep
nexec = $full_simulation_time / $step
symb ts = 100
symb #get { time1 } wtime
exec $ts
symb #get { time2 } wtime
symb t_solve = $time2 - $time1
symb t_est = ( $nexec / $ts ) * $t_solve
The variable values are logged in the flex print file (.flxprt) which can be downloaded from Storage. The value t_est will store an estimate of the full simulation solve time in seconds. This can be converted into hours and multiplied by the total number of MPI parts to give an indication of Core-Hours usage.
Considerations when running piezoelectric MPI simulations
With piezoelectric simulations, there are more setup considerations to take into account:
Full Electric Solve
When using MPI for models with full 3D electric solves (e.g. 1-3 Piezocomposites or SAW resonators), if the electric window is continuous through parts, then we need to use the distributed solver (dstr) to be able to solve this:
non-MPI Model
piez
slvr pard
...
MPI Model
piez
slvr dstr /* Must be first command
...
Important note the distributed solver must be the first subcommand used in the piez section of the code for full 3D electric solve models.
1D Electric Solve
For models using a 1D electric solve approximation (e.g. FBARS), the solver command is defined in the usual manner:
non-MPI & MPI models
piez
slvr drct * * j
...
Mutltiple Circuits for Arrays
For MPI simulations, the circuit solve by default is calculated on the master part.
For simulations containing many circuits (e.g PMUT and CMUT arrays), it is more efficient to solve them across each individual part.
piez
cslvr lapack dstr
slvr drct
...
The cslvr subcommand is used with the dstr option. This must be the first command when using the distributed circuit solve option.
Important note, this can only be used when the electric window is fully contained within an MPI part (i.e. the piezoelectric window is smaller than the nodal boundaries of the MPI part).
Ways of executing MPI models
OnScale Enterprise - Running on the Cloud
The cloud scheduler can be used to execute the MPI models. Click on the Run on Cloud icon to open the Cloud Scheduler:
-
Select Single precision
-
Single precision can support simulations up to 2.5 billion elements, otherwise switch to Double precision
-
-
Enter the total amount of RAM for the full simulation
-
Select Run
The Cloud Scheduler auto-detects an MPI simulation when the part command is present in the input file. The total number of parts/cores is determined by the value set by the max subcommand.