
Efficient GPU Implementation for Finite Element Methods
Explore the development of an efficient GPU-based assembly routine for the EllipticFEM.jl package, comparing it with CPU algorithms using various mesh densities. Investigate choke points and enhance future improvements. Additionally, implement a GPU-based linear solver routine for speed testing against CPU algorithms.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
GPU Implementations for Finite Element Methods Brian S. Cohen 12 December 2016
Last Time Stiffness Matrix Assembly 102 Julia v0.47 101 MATLAB v2016b 100 CPU Time [s] 10-1 10-2 10-3 10-4 102 103 104 105 106 107 Number of DOFs B. Cohen 12 April 2025 Slide 2
Goals 1. Implement an efficient GPU-based assembly routine to interface with the EllipticFEM.jl package Speed test all implementations and compare against CPU algorithm using varied mesh densities Investigate where GPU implementation choke points are and how these can be improved in the future 2. Implement a GPU-based linear solver routine Speed test solver and compare against CPU algorithm B. Cohen 12 April 2025 Slide 3
Finite Element Mesh A finite element mesh is a set of nodes and elements that divide a geometric domain on which our PDE can be solved Other relevant information for the mesh may be necessary Element centroids Element edge lengths Element quality Subdomain tags EllipticFEM.jl stores this information in object meshData Node pi All meshes are generated using linear 2D triangle elements Element e Node pk Node pj Node data stored as Float64 2D Array ?? ?? ?? ?1 ?1 ?? ?? ???????? = ????? = Element data stored as Int64 2D Array B. Cohen 12 April 2025 Slide 4
Finite Element Matrix Assembly Consider the simple linear system ?1,1 ?1,???? ?????,???? ?? = ? ? = ?????,1 The stiffness matrix ?is an assembly of all element contributions ? ?11 ?21 ?31 ?12 ?22 ?32 ?13 ?23 ?33 K K = sparse(I,J,V) ? = ?? ??= ?=1 Element contributions are derived from the hat function used to approximate the solution on each element ?? ??= ? T ?T ? ? ? ? ?? y x B. Cohen 12 April 2025 Slide 5
GPU Implementation A Pre-Processing Assemble ? Matrix Solve Read Equation Data Generate Geometric Data Call sparse() constructor ? = ?\? CPU Generate (I,J) Vectors Generate Mesh Data GPU Generate Ke_Values Array Double for-loop implementation B. Cohen 12 April 2025 Slide 6
GPU Implementation B Pre-Processing Assemble ? Matrix Solve Generate (I,J) Vectors Read Equation Data Generate Geometric Data Generate Mesh Data Call sparse() constructor ? = ?\? CPU Generate Ke_Values Array GPU Transfer node and element arrays only to GPU B. Cohen 12 April 2025 Slide 7
CPU vs. GPU Implementations 102 CPU Implementation CPU Time for I, J, V Assembly [s] 101 GPU Implementation A GPU Implementation B 100 10-1 10-2 10-3 10-4 102 103 104 105 106 107 Number of DOFs GeForce GTX 765M, 2048MB B. Cohen 12 April 2025 Slide 8
Runtime Diagnostics Implementation A Implementation B 1.0 1 CPU -> GPU I, J, V array assembly GPU -> CPU sparse() assembly CPU -> GPU I, J, V array assembly GPU -> CPU sparse() assembly 0.8 0.8 CPU Runtime [s] CPU Runtime [s] 0.6 0.6 0.4 0.4 0.2 0.2 0 0 102 103 104 105 106 102 103 104 105 106 Number of DOFs Number of DOFs Overhead to transfer mesh data from CPU GPU is low Overhead to transfer mesh data from GPU CPU is high It would be nice to be able perform sparse matrix construction on GPU It would be even nicer to solve the problem on the GPU B. Cohen 12 April 2025 Slide 9
Solving the Model Now we want to solve the linear model: ?? = ? ArrayFire.jl does not currently support sparse matrices Dense matrix operations seem to be comparable in speed or slower on GPU s than CPU s CUSPARSE.jl wraps NVIDIA CUSPARSE library functions High performance sparse linear algebra library Does not wrap any solver routines Built on CUDArt.jl package Wraps the CUDA runtime API Both packages required CUDA Toolkit (v8.0) B. Cohen 12 April 2025 Slide 10
GPU Solver Implementation Preconditioned Conjugate Gradient Method ?is a sparse symmetric positive definite matrix Improves convergence if ? is not well conditioned ? ? ?? ??? ? = 1,2, until convergence do do ????? ?? ? Uses the Incomplete Cholesky Factorization ?? ?T? ?? ? == 1 ???? ? ? = ?T? ? ? Rather than solve the original system ???? ?? ?? = ? ? ?? 1 We solve the following system ? ? + ?? ??? ?? ? T? ? ? ?? = ? T? q ?? ?? ? ?T? ? ? + ?? ? ? ?? end end ??? B. Cohen 12 April 2025 Slide 11
Solver Results 102 CPU Implementation 101 GPU Implementation CPU Time to Solve Ku=f [s] 100 10-1 10-2 10-3 10-4 102 103 104 105 106 107 Number of DOFs B. Cohen 12 April 2025 Slide 12
Conclusion GPU computing in Julia shows promise of speeding up FEM matrix assembly and solve routines Potentially greater gains to be made with higher order 2D/3D elements Minimize data transfer to the GPU needed to assemble FEM matrices Keeping code vectorized helps Removing any temporary data copies on GPU ArrayFire.jl should (and hopefully soon will) support sparse matrix assembly and arithmetic Open issue on GitHub since late September, 2016 CUSPARSE.jl should (and hopefully soon will) wrap additional functions COO matrix constructor and iterative solvers would be especially useful Large impact on optimization problems where matrix assembly routines and solvers are called many times B. Cohen 12 April 2025 Slide 13