**Abstract:** High fidelity scientific simulations modeling physical phenomena typically require solving large linear systems of equations which result from the discretization of a partial differential equation (PDE) by some numerical method. This step often takes a vast amount of computational time to complete, and therefore presents a bottleneck in simulation work. Solving these linear systems efficiently requires the use of massively parallel hardware with high computational throughput, as well as the development of algorithms which respect the memory hierarchy of these hardware architectures to achieve high memory bandwidth.

This talk offers two main contributions towards the development of such a linear solver algorithm which is appropriate for massively parallel architectures (with the Jacobi iteration as our starting point). First, we develop relaxation schemes which greatly improve upon the convergence of the traditional Jacobi iteration (termed the Scheduled Relaxation Jacobi method) on symmetric and nonsymmetric linear systems. A data-informed heuristic is developed for selecting schemes for a practical implementation. Second, we develop an algorithm for a memory-efficient GPU implementation of the Scheduled Relaxation Jacobi method, where efficiency is achieved via the use of on-chip GPU shared memory on both structured and unstructured linear systems. These contributions provide the basis for a GPU shared memory Jacobi solver with relaxation for the efficient solution of general (i.e. potentially unstructured and nonsymmetric) linear systems arising from the discretization of PDEs.