Modern GPUs have transformed machine learning and scientific computing by enabling massive computational throughput. Computational Fluid Dynamics (CFD), however, remains one of the more demanding applications because practical engineering simulations often require solving millions of coupled equations over complex geometries. While CPUs have traditionally been the dominant platform for industrial CFD, modern GPU architectures provide an opportunity to rethink how these simulations are performed. In particular, with recent advances pressure-based incompressible flow solvers have successfully aligned with GPU hardware. This article introduces the development of an educational GPU-accelerated unstructured finite-volume solver built around the SIMPLE and PIMPLE algorithms using CUDA and HYPRE. The objective was not just to create an efficient CFD code, but also to investigate & demonstrate how modern sparse linear algebra and accelerator hardware can unlock new possibilities for pressure-based flow simulation
The starting point for many CFD simulations is the Navier–Stokes equations, which describe conservation of mass and momentum within a fluid. These equations govern a wide range of everyday phenomena, from airflow around vehicles and aircraft to pipe flows, pumps, heat exchangers and environmental transport. In many practical engineering applications the flow speed is much smaller than the speed of sound, meaning density variations become negligible. Under this incompressible assumption, the continuity equation reduces to the simple constraint div(u)=0. While this approximation appears minor, it fundamentally changes the mathematical structure of the problem.
Unlike variables such as velocity or temperature, pressure in case of incompressible flows does not possess its own transport equation. Instead, pressure acts as a constraint enforcing incompressibility throughout the domain. In a mathematical sense, pressure behaves as a Lagrange multiplier that ensures the velocity field satisfies continuity. This creates a coupled saddle-point system in which velocity and pressure cannot be solved independently. Directly solving such systems can become computationally demanding, particularly for large three-dimensional industrial meshes.
To address this challenge, most practical incompressible flow solvers employ operator splitting methods. Rather than solving velocity and pressure simultaneously as a single monolithic system, the equations are decomposed into smaller and more manageable subproblems. Mathematically this process can be interpreted through a Schur complement factorization of the coupled system. The exact Schur complement operator is expensive to evaluate, but for transient problems and pseudo-transient formulations it can often be approximated using mass matrix information, resulting in efficient split formulations.
This operator split viewpoint naturally leads to predictor-corrector methods. A provisional velocity field is first computed, after which a pressure Poisson equation is solved to enforce continuity. The pressure field is then used to correct both velocity and fluxes. Although this decomposition introduces approximation, it dramatically improves robustness and computational efficiency. The resulting framework forms the foundation of many pressure-based algorithms including SIMPLE, PISO and PIMPLE, which remain among the most widely used approaches in industrial CFD.
Among pressure-based methods, SIMPLE is one of the most important algorithms in industrial CFD. SIMPLE stands for Semi-Implicit Method for Pressure Linked Equations. Its strength is that it converts the coupled incompressible Navier-Stokes problem into a sequence of smaller scalar problems: momentum equations for velocity, followed by a pressure correction equation that enforces mass conservation.
The key idea is simple but powerful. First, the momentum equations are solved using the current pressure field to obtain a predicted velocity. This predicted velocity will generally not satisfy div(u)=0 exactly. Therefore, a pressure correction equation is assembled from the continuity error. Solving this pressure equation gives the correction needed to make the velocity field divergence-free. The pressure, velocity and face fluxes are then updated, and the process is repeated until both the velocity change and continuity residual become small.
SIMPLE remains extremely useful in practical CFD: it avoids solving the full saddle-point system directly, while still preserving the pressure-velocity coupling needed for incompressible flow. It is robust, modular and works well with complex unstructured finite-volume meshes. It also maps well to modern CFD software design because the velocity equations behave like scalar transport equations, while the pressure correction equation behaves like a Poisson-type elliptic solve. In the GPU solver developed here.
SIMPLE loop used in the GPU solver
----------------------------------
Start with U, p and phi
|
v
Compute grad(p)
|
v
Solve momentum equations using current pressure
-----------------------------------------------
A(U) U* = H - grad(p)
Result:
U* = predicted cell-centered velocity
|
v
Extract rAU from the momentum operator
--------------------------------------
rAU tells how strongly pressure correction changes velocity
|
v
Build pressure-correction matrix
--------------------------------
The pressure equation uses rAU as its main coefficient
|
v
Construct predicted face flux
-----------------------------
phiStar = pressure-coupled face flux from the momentum predictor
Rhie-Chow interpolation is used here to avoid pressure-velocity decoupling
on the collocated finite-volume mesh.
|
v
Compute continuity error
------------------------
continuity error = div(phiStar)
|
v
Solve pressure correction equation
----------------------------------
Find the pressure correction needed to remove the continuity error
|
v
Correct face flux
-----------------
phi = phiStar - pressure correction flux
This gives the conservative finite-volume flux.
|
v
Update pressure
---------------
p = p + relaxed pressure correction
|
v
Correct cell-centered velocity
------------------------------
U = U* - rAU grad(pressure correction)
|
v
Check convergence
-----------------
Is div(phi) small?
Is the velocity change small?
|
v
If not converged:
repeat SIMPLE loop
If converged:
write fields and postprocess quantities such as drag and lift
SIMPLE starts by solving the momentum equations using the current pressure field. This produces an intermediate cell-centered velocity field, usually called U*. However, this predicted velocity field is not guaranteed to satisfy incompressibility. In other words, the face fluxes constructed from the momentum predictor may not balance exactly across each control volume.
The next step is therefore to measure this mass imbalance and use it to build a pressure correction equation. This pressure correction equation determines how the pressure field must change in order to remove the continuity error. After this equation is solved, the solver corrects the face fluxes, updates the pressure field, and corrects the cell-centered velocity field. The corrected face flux ensures finite-volume mass conservation, while the corrected velocity field becomes the starting point for the next momentum solve.
This process is repeated until the continuity error and the velocity change become sufficiently small. In this way, SIMPLE avoids solving the full coupled velocity-pressure saddle-point system directly. Instead, it converts the incompressible flow problem into a sequence of more manageable momentum solves and pressure correction solves.
Two details are especially important in making this procedure work well. The first is rAU, the inverse of the diagonal coefficient of the momentum equation. This quantity tells the solver how strongly a pressure correction changes the velocity field. In the pressure correction equation, rAU plays a role similar to a diffusion coefficient because it controls how the pressure correction propagates through the mesh to remove mass imbalance.
The second important detail is Rhie-Chow interpolation. In a collocated finite-volume method, pressure and velocity are both stored at cell centers. If the face flux is obtained only by simple interpolation of cell-centered velocity, pressure and velocity can decouple, leading to non-physical checkerboard pressure patterns. Rhie-Chow interpolation avoids this by constructing the face flux in a pressure-sensitive way. SIMPLE corrects phi directly because finite-volume conservation is enforced through face fluxes, while the corrected cell-centered velocity keeps the stored velocity field consistent for the next iteration.
The pressure correction equation that appears in incompressible flow solvers is Poisson-like. This is one reason pressure-based CFD is computationally demanding. A Poisson equation is elliptic, which means information is coupled globally across the domain. A local correction in one region can influence the pressure field everywhere else. As a result, the pressure solve is often the most expensive part of a SIMPLE or PIMPLE iteration.
A useful way to understand this is through the heat equation. If the heat equation is written in terms of Fourier modes, each mode decays at a rate proportional to D k^2, where D is the diffusion coefficient and k is the wave number. High-frequency oscillations have large k, so they decay rapidly. Long-wavelength oscillations have small k, so they decay much more slowly.
This same idea is central to multigrid methods. Basic iterative smoothers are very good at removing high-frequency error, but they are much less effective at removing smooth, long-wavelength error. Multigrid solves this problem by transferring the error to coarser grids. On a coarse grid, what looked like a long-wavelength error on the fine grid becomes easier to represent and remove. By moving between fine and coarse levels, multigrid can reduce error across many length scales very efficiently.
In geometric multigrid, these coarse levels are created from actual coarse meshes or structured grid hierarchies. This can be extremely effective, but it is not always convenient for complex unstructured industrial meshes. Creating high-quality coarse grids and transfer operators can become difficult, especially when the mesh contains arbitrary polyhedra, local refinement or complicated geometry.
Algebraic multigrid addresses this problem differently. Instead of requiring explicit geometric coarse grids, AMG builds the multigrid hierarchy from the sparse matrix itself. The matrix contains information about which unknowns are strongly coupled, and AMG uses this algebraic structure to form coarse levels, interpolation operators and smoothers. This makes AMG especially attractive for unstructured finite-volume CFD, where the pressure equation is naturally represented as a sparse matrix.
This is also where GPUs become important. AMG is computationally heavy: it involves sparse matrix-vector products, smoothing operations, residual calculations, restriction, interpolation and coarse-grid corrections. These operations have large throughput requirements and are repeated many times during a pressure solve. Modern GPUs can be designed exactly for this kind of high-throughput parallel work; and therefore, a pressure-based CFD solver can benefit strongly when its pressure correction equation is solved using a GPU-capable AMG library.
Several GPU-oriented sparse linear algebra and AMG libraries exist, including HYPRE/BoomerAMG, NVIDIA AmgX, PETSc with GPU backends, Trilinos/MueLu and Ginkgo. In this solver, HYPRE was chosen because it provides a mature parallel sparse linear solvers and the well-optimized BoomerAMG algebraic multigrid preconditioner. The pressure correction equation is solved using a Krylov method accelerated by BoomerAMG, while the surrounding finite-volume assembly and correction operations are kept on the GPU. This combination allows the solver to preserve the robustness of pressure-based CFD while taking advantage of modern GPU throughput.
The code is organized around keeping the finite-volume physics modular, and keeping the expensive numerical work on the GPU. The solver reads an OpenFOAM polyMesh, applies boundary conditions from runtime case files, assembles the finite-volume equations, and solves the resulting sparse systems using CUDA and HYPRE.
At the top level, the code contains a main flow solver application and a set of reusable numerical libraries. A simplified view of the structure is:
anabasis (link to the github repo)
|
|-- apps/simple_gpu
| |
| |-- main.cu
| |-- boundary condition handling
| |-- patch geometry
| |-- velocity boundary evaluation
|
|-- libpoisson
| |
| |-- mesh utilities
| |-- gradient reconstruction
| |-- scalar elliptic / Poisson utilities
| |-- HYPRE backend
|
|-- libscalar
| |
| |-- scalar transport solver
| |-- convection and diffusion operators
| |-- scalar boundary conditions
|
|-- cases
| |
| |-- reference case files
|
|-- docs
|
|-- build and run notes
The scalar transport library is important because many CFD equations have a similar finite-volume structure. Passive scalar transport, temperature, species concentration, turbulence kinetic energy and dissipation can all be written in convection-diffusion-source form. This makes the scalar transport module a natural path toward additional physics such as heat transfer, species transport, level-set equations and RANS turbulence models.
In simplified form, these scalar equations look like:
convection of phi
+ diffusion of phi
+ source terms
= 0
The GPU execution flow can be summarized as:
CUDA kernel:
assemble momentum matrices and right-hand sides
HYPRE:
solve momentum systems
CUDA kernel:
extract rAU from the momentum operator
CUDA kernel:
construct pressure-coupled face fluxes
CUDA kernel:
build pressure correction right-hand side
HYPRE:
solve pressure correction system using PCG + AMG
CUDA kernel:
correct face fluxes
CUDA kernel:
update pressure and velocity
CUDA kernel:
compute residuals and convergence measures
The important performance choice is to avoid repeatedly moving large CFD fields between CPU and GPU. Once the mesh and fields are on the device, the major assembly, correction and residual operations remain GPU-resident. This is essential because realistic unstructured CFD cases contain millions of cells and faces. Repeated CPU-GPU transfers would quickly destroy the benefit of acceleration.
The same structure also extends naturally from SIMPLE to PIMPLE. SIMPLE is mainly used for steady-state problems, while PIMPLE places a similar pressure-velocity correction process inside a transient time-stepping loop. This allows the solver to use the robustness of SIMPLE while moving toward unsteady incompressible simulations. From a software point of view, the same core operations remain central: momentum prediction, pressure correction, flux correction and velocity correction.
In summary the GPU CFD framework is as follows: (1) The scalar transport module providing a route to additional equations and turbulence models. (2) The pressure solver machinery provides the core incompressible flow capability. (3) HYPRE supplies AMG-accelerated sparse linear algebra. CUDA kernels keep the finite-volume operations close to the data. Together, these pieces form the foundation of a pressure-based unstructured finite-volume solver designed for modern GPU hardware.
The comparison with the benchmark reveals the drag coefficient is captured very accurately with less than 1% error (not shown). The more sensitive lift coefficient is captured with an error of ~5%. The wall time of 24 hours is indicative of the speed of this approach.
To evaluate the GPU solver, a simple pipe-flow case was tested on an NVIDIA A100 GPU using a sequence of increasingly large unstructured meshes upto a maximum of 66M mesh size. The goal of this experiment was to measure how the pressure-based SIMPLE algorithm scales as the number of cells increases.
A useful performance metric for this type of solver is MIUPS, or million iterative updates per second. It measures how many cell updates the solver performs per second:
MIUPS = number of cells x number of iterations / elapsed time / 1,000,000
This metric is useful because it normalizes the result by problem size. Instead of only reporting total runtime, MIUPS shows how much numerical work the solver is processing per second.
The first observation from the A100 results is that the post-setup SIMPLE iteration time grows almost linearly with the number of cells. This is important. It means that after the one-time setup costs are removed, the solver behaves predictably as the mesh size increases. A larger mesh requires proportionally more work, but the GPU continues to process the problem at a nearly steady rate.
Across the tested pipe-flow cases, the total post-setup SIMPLE-loop throughput lies approximately in the range of 50 to 80 million cell-iterations per second. Double precision gives roughly 48 to 54 MIUPS, while single precision gives roughly 65 to 77 MIUPS. This is a strong result for a pressure-based incompressible solver because each SIMPLE iteration contains multiple expensive operations: momentum solves, pressure correction, Rhie-Chow flux construction, velocity correction and residual evaluation.
Single precision is consistently faster than double precision. This is expected on GPUs because single precision has lower memory traffic and higher arithmetic throughput. In these tests, single precision gives about 1.3 to 1.5 times higher total SIMPLE-loop throughput compared with double precision. This does not mean single precision is always the right choice for CFD, but it shows that for moderate residual targets and well-scaled variables, single precision can provide a useful performance advantage.
The pressure solve remains the most expensive part of the SIMPLE iteration. This is expected because the pressure correction equation is Poisson-like and globally coupled. It requires a Krylov solver accelerated by AMG, and this involves repeated sparse matrix-vector products, smoothing operations, residual evaluations and coarse-grid corrections. Even on a GPU, this pressure correction step dominates the cost of the pressure-based algorithm.
However, the pressure phase itself also shows good throughput. In these tests, the pressure solve phase reaches roughly 75 to 90 million cell-iterations per second in double precision and roughly 115 to 150 million cell-iterations per second in single precision. The pressure solve time also scales almost linearly with mesh size, which indicates that the AMG-preconditioned pressure correction remains efficient across the tested range.
The velocity phase is significantly faster. The visible velocity phase reaches roughly 350 to 380 million cell-iterations per second. This is because the momentum equations are easier to handle than the pressure equation. They are more local, better conditioned under the tested settings, and can be accelerated effectively using methods such as multicolored Gauss-Seidel or Krylov solvers with simple smoothers. Compared with the pressure equation, the velocity solve is not the main bottleneck.
These results highlight an important point about GPU-based incompressible CFD. The pressure equation is still the hardest part of the algorithm, but modern GPU sparse linear algebra makes it fast enough that the overall SIMPLE loop can reach tens of millions of cell updates per second on large unstructured meshes. On typical CPU-based unstructured pressure solvers, we expect a much lower, sometimes even 10x lesser single-node throughput (the exact CPU comparison depends strongly on solver settings, mesh quality, linear solver configuration and hardware).
On various industrial meshes solved using the GPU code on a local RTX 3060 vs OpenFOAM on a 16 core AMD Ryzen 9 9950x3d, the GPU code started paying strong dividends above 8M tet-cell counts. Depending on the solver settings and mesh sizes the speed up to convergence while using double precision can be in the range of 5-6x for optimal inexact settings and 8-10x if the solves were carried out in robust exact manner.
While the industry cases are not possible to display, here we show the results for the classical incompressible flow-past-cylinder benchmark and compared with reference FEATFLOW data.
This benchmark is useful because it tests more than just the linear solver and time stepping. The lift coefficient is especially sensitive and depends on the pressure field, the viscous stresses, the transient wake dynamics and the accuracy of the pressure-velocity coupling.
The figure shows the lift coefficient obtained using the GPU finite-volume solver on an 800k-cell polyhedral mesh, compared with the FEATFLOW benchmark data. The two curves are nearly overlapping over the full time interval. The initial rise, the sharp transition, the negative lift peak and the later recovery are all captured closely. This agreement indicates that the pressure correction, velocity update, flux correction and transient PIMPLE-type coupling are working together consistently.
Overall, this work demonstrates that pressure-based incompressible CFD can benefit strongly from modern GPU hardware. By combining an unstructured finite-volume formulation, GPU-resident assembly and correction kernels, and HYPRE/BoomerAMG for the pressure solve, the solver preserves the robustness of SIMPLE and PIMPLE while achieving high throughput on large meshes. The A100 tests showed nearly linear scaling and strong MIUPS performance, while the flow-past-cylinder comparison with FEATFLOW data showed that the solver also captures physically meaningful transient behavior. The same code structure can also be extended naturally toward scalar transport, heat transfer, RANS turbulence models and other transport equations, making it a compact but extensible foundation for GPU-accelerated incompressible CFD.