Mathbridge - CFD / FEM Analysis | Pune, India

Faster CFD-FEM solvers using matrix-free architecture

Dr. Jaydeep Deshpande, Mathbridge

For engineers and researchers in CFD, FVM solvers like OpenFOAM are the natural reference point. They are mature, widely trusted, fast and memory efficient across a wide range of industrial flow problems. It is reasonable to ask: if such tools already exist, why spend time developing finite element solvers for computational fluid dynamics at all?

Outside its bastion of solid mechanics, in fluid flow problems involving fluid-structure interaction, strongly coupled multiphysics, adjoint analysis, POD/ROM and research-driven model development, Finite Element Methods still remain extremely attractive. FEM’s variational framework, flexibility on complex geometries, and natural compatibility with constraints and mathematical frameworks make it a powerful foundation for new development. This is why commercial fluid solvers using finite element technology do exist.

These solvers however are known to be quite slow, compared to finite volume solvers, especially in three dimensions and on unstructured meshes. This performance gap is one of the major limitations of FEM-based CFD and it contributes to why FEM-CFD has remained more common in specialized research and multiphysics settings than in mainstream industrial flow simulation.

To what extent can this picture change?

In the past few years, there has been exciting progress in matrix-free finite element solvers. The idea is powerful: instead of explicitly assembling large sparse matrices and then applying them repeatedly inside iterative linear solvers, we compute the required matrix-vector actions on the fly. For efficient Krylov methods such as GMRES, this can significantly reduce memory usage and relieve bandwidth bottlenecks that often dominate performance on modern hardware. In other words, matrix-free methods aim to make high-order and large-scale FEM more practical than before.

As in case of any CFD solution: there is no free lunch. Matrix-free solvers introduce their own design constraints. Some preconditioning strategies become harder to use, implementation details matter much more, and not every problem benefits equally. But alongside these limitations come important bonuses: reduced memory pressure, better arithmetic intensity, and a route toward scalable high-performance finite element CFD that was much harder to access with traditional assembled approaches.

In this blog we will explore the matrix-free direction.

Since we are exploring a recent direction in matrix-free architecture, it is better to focus on only a simple meaningful three-dimensional test case: laminar flow past a cylinder in 3D. The physics in the problem are clean. The fluid properties do not vary, the flow is isothermal, the fluid is Newtonian, and the regime is laminar. That simplicity is useful because it removes many distractions and lets us concentrate on the following numerical challenges: an unstructured tetrahedral mesh on a curved geometry, solver quality to capture sensitive lift coefficient, while keeping the 3D factorization costs to the minimum. That makes it an ideal starting point: simple enough to understand clearly, but rich enough to start exposing the difficulties of building fast and reliable FEM solvers for fluid flow.

In the sections that follow, we will use this problem as a vehicle to build a modern matrix-free FEM solver.

A practical question now appears: how to implement this solver in a way that balances the code efficiency and ease of high level implementation?

Since a new solver requires low-level bookkeeping, custom sparse matrix infrastructure, and large amounts of parallel code just to test an idea, the development is accelerated by use of an FEM library. There are many efficient and proven FEM libraries, for example but not limited to: deal ii, FeniCS, MFEM, etc. We needed a high-level solver library where weak forms, function spaces, and solver blocks can be expressed clearly, while still allowing access to advanced linear algebra (like PETSc, Trilinos, etc.) and preconditioning tools. In this work, we chose the Firedrake ecosystem for exactly that reason.

The full saddlepoint solve with PCD preconditioner

A natural first step is to begin with a robust and well-established block preconditioning strategy rather than jumping immediately to fully matrix-free methods. For incompressible flow, one of the best known approaches is the pressure convection-diffusion, or PCD, preconditioner. For a detailed description of the PCD preconditioner the readers can refer to the book of Elman, Silvester and Wathen (https://academic.oup.com/book/27915).

The primary difficulty comes from the algebraic structure of the discretized incompressible Navier-Stokes equations: the system takes the familiar saddle point form, with a velocity block, a pressure-velocity coupling block, a continuity block and a zero block in the pressure-pressure position. This structure is numerically challenging. In particular, the hard part is not only the velocity solve itself, but the Schur complement that describes how pressure is coupled through the momentum equations.

If treated exactly, that Schur complement is expensive to invert. The key idea of PCD is to avoid forming or solving that exact object directly. Instead, it replaces the difficult Schur complement inverse by an approximation built from operators that are much more manageable. In practice, this approximation is constructed using pressure-space operators such as the pressure Laplacian, the pressure mass matrix, and a pressure convection-diffusion operator. The important point is that these are much easier to invert than the true Schur complement. Laplacian-type inverses can be treated with efficient Poisson solvers, while the mass matrix is also straightforward to invert.

This is what makes PCD attractive. It preserves the logic of the true saddle point factorization, but replaces the hardest pressure coupling piece with something much cheaper and more practical. On the velocity side, for the problem at hand the inverse of the momentum block does not need to be exact either. It can be approximated using methods such as additive Schwarz solvers or smoothers, and/or with a few iterations of a Krylov solver, depending on the target accuracy and computational budget.

For the type of problems considered here, PCD is a known and robust strategy. It has a strong reputation in incompressible flow solvers because it respects the block structure of the Navier-Stokes equations and usually gives much better behavior than treating the full coupled system with a generic preconditioner. It is an excellent baseline and a very sensible starting point for solver comparison.

At the same time, PCD also reveals the limitation that motivates the rest of this blog. Although the outer solve may still be iterative, the preconditioning machinery relies on assembling block parts. In other words, we still build matrices for important parts of the solver, and that means we still pay part of the traditional memory and bandwidth cost of finite element CFD. This is exactly the point where matrix-free ideas become interesting. They aim to reduce that dependence on assembled sparse matrices and move more of the computation into operator actions applied on the fly.

To compare against the FEATFLOW data, we refer to the paper of Bayraktar, Mierka & Turek (https://doi.org/10.1504/IJCSE.2012.04824). The measurement of interest is the sensitive lift coefficient for the unsteady 3D benchmark, since capturing the drag coefficient accurately is rather trivial. We begin by defining the geometry in gmsh/netgen as per the research paper. Note that the readings are quite mesh sensitive for coarser meshes. Instead of a well designed mesh, for the purpose of this benchmark we simply put 24 tets at the cylinder cross section and use a set grading parameter based on distance field to slowly coarsen the mesh away from the cylinder. This was followed by 1 refinement within firedrake. This mesh measured to 22k global tet cells (A Taylor Hood P2P1 discretization resulted in about 1.5 M dof counts including the x,y,z velocity and pressure dofs).

Discretization: For solving the problem we used the inf-sup stable Taylor Hood second order velocity, first order pressure (P2P1) elements. We used the symmetric gradient to define the viscous term and provided stabilization/upwind terms in form of SUPG using the stabilization tau in a format similar to the popular Tezduyar-Shakib formulation. Since the Reynold’s number is low the use case of SUPG is optional. The task of adding weak forms is made incredibly simple by Firedrake’s UFL. The equations are solved with BDF1 for the first step and followed by BDF2 for all subsequent steps.

The drag and lift on the cylinder is measured by using both the integral and reaction forces approach, of which the latter is the more accurate one. Implementing them is again a breeze through firedrake’s syntax.

Firedrake’s tutorial provides a custom PCD preconditioner. However, this works only for characteristic boundary conditions. We modified this preconditioner by adding the Robin conditions for the respective blocks as per Elman, Silvester and Wathen to make it compatible with the problem at hand. The solve is carried out by using fieldsplit in PETSc, the assignment of which is made extremely straightforward and easy in firedrake. The schur inverse block is approximated using PCD. This involves assembling and solving the pressure laplacian and mass matrix using conjugate gradient preconditioned with algebraic multigrid. The use of algebraic multigrid is robust, efficient and accurate but involves using assembled matrices. Another option is to carry out these 2 solves in a matrix free format with a geometric multigrid: however we found the overall efficiency to be not quite different than the gamg setting. The velocity block can be inverted efficiently by using an additive schwarz with an lu or ilu on each decomposed block. However, since the problem uses a fairly fine time step, we found an inexact inverse using 10 iterations of jacobi preconditioned gmres worked just as good.

We used a minimalistic setup to test the problem: a 16 core Ryzen 9 9950x3d with 64 GB RAM. With the 1.5 M dof count for the mesh described above and with a constant time step of 2.5 e-2 s, the total 8 s benchmark takes 320 time steps: each time step converges within 2-3 picard iterations with total inner solves of about 20 per picard iterate. The simulation used ~ 10 GB of RAM and took ~ 24 hours to complete. With the usage of a larger time step like 5e-2 s the total number of ksp iterations increase unless the velocity block is solved to a higher tolerance. When the 5e-2 s time step is used the preconditioner is still robust and converges, with the solve yielding little change in result or the total wall time of solve (~ 24 hours). However, above the 5e-2 while the preconditioner is robust, the results of lift coefficient suffer dramatically: they not only change in the value but do not even reproduce the expected shape of profile. Thus we avoided using time steps above 5e-2.

The comparison with the benchmark reveals the drag coefficient is captured very accurately with less than 1% error (not shown). The more sensitive lift coefficient is captured with an error of ~5%. The wall time of 24 hours is indicative of the speed of this approach.

The matrix free geometric multigrid solve

Matrix-free solvers provide an ambitious direction in solver development. Matfree solvers are an area of very active current research, with a growing body of literature showing that finite element flow solvers need not remain tied to the traditional cost of assembled sparse matrices. In this work, we follow the approach presented by Saveedra, Munch and Blias (https://doi.org/10.1016/j.jcp.2025.114186).

The key idea is quite different from the PCD strategy discussed earlier. Instead of building a block preconditioner around approximate inverses of the momentum and Schur complement operators, this method attempts to solve the full monolithic Navier-Stokes system directly with multigrid. In other words, the entire coupled velocity-pressure system is treated as a single operator across multiple grid levels. We avoid field-split methods and/or block preconditioning. This is attractive from a matrix-free point of view because it removes the need to assemble and manage separate preconditioning blocks. In our discussion of matrix-free method, we must emphasize that such a construction is heavily dependant on the usage of geometric multigrid. The much more flexible algebraic multigrid relies on RAM and storage far more than geometric multigrid does: this approach thus has all the limitations that a true geometric multigrid approach has.

At first sight, the classical incompressible saddle-point system contains a zero block in the pressure-pressure position, and such a structure is not naturally suited to standard multigrid smoothing. A smoother such as Jacobi relies on having a meaningful diagonal or near-diagonal action, while the null pressure block gives no direct elliptic control over pressure. This is exactly where pressure stabilization becomes important.

The stabilization used here is PSPG, or pressure-stabilizing Petrov-Galerkin. In this approach, the gradient of the pressure test function is dotted with the strong form of the momentum residual, and the resulting term is added into the continuity equation. Since the exact solution drives the residual to zero, this added term is consistent: it modifies the discrete problem without changing the continuous one. But from the algebraic point of view, it has a crucial effect. It introduces a pressure Poisson-like contribution into what would otherwise be the null pressure block. That change makes the monolithic operator much more amenable to standard multigrid ideas.

The momentum equation is stabilized as well. For this SUPG, or streamline-upwind Petrov-Galerkin stabilization is utilized. Here the advection operator is added into the test space through the strong momentum residual. The practical consequence is the appearance of streamline diffusion: extra diffusion introduced primarily along the flow direction rather than isotropically in all directions. This is helpful not only for stability of advection-dominated transport, but also for preconditioning. In effect, the difficult convective part of the operator becomes better behaved for iterative solution.

An optional addition is LSIC, a least-squares incompressibility constraint term. Its role is to reinforce the divergence-free condition and to improve the algebraic character of the coupled system. Taken together, PSPG, SUPG, and LSIC reshape the structure of the monolithic operator in a way that is much more favorable for iterative solvers.

The resulting matrix is not truly symmetric, of course, because the Navier-Stokes equations themselves are not. But the operator becomes more balanced and more multigrid-friendly. Even sign conventions matter here. For example, we can choose a negative sign in the continuity equation term in order to make the overall block structure behave in a more symmetric way from the solver point of view.

For such a stabilized monolithic operator, it becomes possible, in many cases, to use a relatively simple preconditioning approach that can be applied in a cheap matrix-free manner, : eg: a Chebyshev iteration preconditioned with Jacobi applied over multiple grid levels. In the present problem, the timestep is sufficiently small that the transient and diffusive contributions remain strong enough for this strategy to be effective. Under those conditions, a full multigrid treatment of the coupled velocity-pressure system becomes a realistic and attractive alternative to assembled block preconditioners. For more robust implementations the vertex star or cell wise ASM can be considered as smoothers.

This is the central appeal of the matrix-free route considered here. Rather than approximating separate pieces of the Navier-Stokes operator and stitching them together through a block factorization, we modify the monolithic system so that it can be attacked directly by multigrid. The reward is a solver architecture that is conceptually cleaner, far lighter on assembled matrix infrastructure, and potentially much more attractive in terms of memory use and parallel performance.

Constructing the geometric multigrid requires a sequence of hierarchically refined grids. Firedrake provides MeshHierarchy: a great tool for preparing successively refined grids with transfer operators built into it. Building a hierarchy is as easy as shown in the code below.

After building the hierarchy we built the new weak forms including the SUPG, PSPG and LSIC terms as before using the handy UFL of Firedrake.

Building GMG is easy in Firedrake once the mesh hierarchy is built. Multiple options are available. On MG levels (all levels except the coarse grid level), one can apply a Chebyshev ksp preconditioned with Jacobi. Typically this requires 20+ iterations of the eigen value estimator. A more robust smoothing method is to use star based ASM which provides smoothing using cell wise or vertex based local inverse approximations which carry out a full cheap lu factorization at the local vertex/cell levels. In the problem at hand 3-5 applications of jacobi preconditioned gmres are quite enough. The coarse levels can be solved through a full LU to provide a good approximation; alternatively a cheaper (but manytimes slower) operation is to solve using additive schwarz with full LU on every decomposed part. One can save further memory if ILU or ILU0 are used instead.

For the solution we began with the same mesh as the PCD case. In this case however, instead of refining once, we asked MeshHierarchy to create 1 level: making this a 2 level method with one fine and one coarse grid. We end up with similar number of cells but since we use here a P2-P2 discretization the number of dofs are 2M instead of 1.5 M. Note that P2-P2 is not inf-sup stable and is only possible due to the use of PSPG.

The result of the matrix free based solution is not far from what we obtained by the PCD approach. Infact due to the use P2P2 discretization it is slightly more accurate.

At each time step the solution converges in 2 Picard iterations and with minimal number of KSP iterations showing that the preconditioner is very well behaved for this problem.

As against the PCD approach which took about 24 hours to complete the simulation, the matrix-free simulation took less than 1.5 hours when a full LU was used to solve the coarse level. The RAM utilization was noticeably higher at around 20-25 GB due to the use of a much finer coarse level. It is remarkable to observe that despite utilization of much higher RAM with a higher number of dofs, the solve could be completed an order of magnitude faster. The higher RAM usage for the coarsest level can be avoided by creating a well defined coarse level which can be successively refined without causing stretched or bad elements. This is one of the limitations of using a true geometric h-multigrid with multiple levels: the geometry gets complex it is difficult to prepare more than 2 or 3 levels without degrading the preconditioning behavior.

However, for the 3D flow past cylinder, since the problem is quite “smooth” we can easily do polynomial refining. Firedrake allows this through their p-multigrid setup. Additionally for netgen built meshes firedrake allows the high order tets to hug the cylinder: which allowed us to model the cylinder by using a very coarse grid.

Firedrake’s tutorials provide a p-multigrid structure which we could easily emulate.

A coarse mesh of P4-P4 discretization containing < 500 cells with < 500,000 dofs was utilized. The problem took about 1600 s to complete and resulted in an even better accuracy.

The use case for high order simulations is sketchy: they are not as robust as the second order discretizations. Indeed commercial tools like COMSOL commonly use P2-P2 or P2-P1 discretizations. With this in mind the large speedup offered by the matrix free solver on P2-P2 discretization looks especially promising. Infact on the same 16 cores 64 GB minimalist workstation, we could run out 35 M dof P2-P2 simulations without exhausting the RAM when additive schwarz + ILU0 were used to perform the inverse on the coarse levels.

The broader message is encouraging. Finite element solvers for fluid flow no longer need to be seen only as elegant but impractical research tools. With modern high-level frameworks, and the growing promise of matrix-free multigrid methods, the strengths of FEM could be increasingly leveraged for tackling flow problems with both flexibility and efficiency. With the 3D cylinder case explored here, we began using our Firedrake FEM code to study low Re mixing through complex microchannel geometries. As a CFD analyst, I will probably never stop using OpenFOAM, but whenever I want to resolve a complex curvature with more viscous flow, I have a new option for high order accuracy! While this marks only a starting point, it already shows an important trend: solver design is as central to CFD performance as discretization itself.

As these methods continue to mature, they may help bring FEM-based fluid solvers closer to the speed, robustness, and practicality needed for a more wider range of real applications. An example matrix free solver code in firedrake can be found here: https://github.com/JD63021/firedrake_matfree_3dflowpastcylinder

Page updated

Google Sites

Report abuse