LFortran supports OpenMP pragmas and Do Concurrent

In early June, we announced a significant achievement with LFortran: the Fastest Open-Source Compiler in Compile-Time Evaluation of an Array Benchmark. Continuing our commitment to performance enhancement, we are pleased to announce that LFortran now includes support for OpenMP pragmas and parallelizing do concurrent. Particularly for parallel do OpenMP construct, our implementation is fully operational and achieves performance comparable to GFortran.

Note that LFortran is still a alpha software, meaning that users must continue expecting that LFortran will fail compiling or running their codes. Please report all bugs that you find.

Compilation Benchmark

Matrix Multiplication example

Let us now take a look at matrix multiplication using OpenMP

subroutine matrix_multiplication(l, m, n)
use omp_lib
integer :: l, m, n, i, j, k
integer :: seed
double precision :: a(l, n), b(l, m), c(m, n)
double precision :: start_time, end_time

seed = 123456789

b = 121.124D0
c = 29124.012D0

start_time = omp_get_wtime()
!$omp parallel do shared(a, b, c, l, m, n)
do j = 1, n
    do i = 1, l
        a(i,j) = 0.0D+00
        do k = 1, m
            a(i,j) = a(i,j) + b(i,k) * c(k,j)
        end do
    end do
end do
!$omp end parallel do
end_time = omp_get_wtime()

print *, "Time: ", end_time - start_time
print *, "sum(a): ", sum(a)
end subroutine

program openmp_30
call matrix_multiplication(500, 500, 500)
end program

Let’s benchmark different matmul sizes for 8 threads:

GFortran (fast) : -O3 -march=native -ffast-math -funroll-loops

L=M=N GFortran GFortran (fast) LFortran (–fast) GFortran (fast) / LFortran (–fast)
500 0.078481 0.021822 0.024509 0.890369
1000 0.706202 0.191768 0.188653 1.016512
1500 2.418719 0.691059 0.695324 0.993866
2000 5.889707 1.845945 1.885804 0.978864
3000 24.358390 9.179583 8.565501 1.071692
4000 84.909953 37.292557 40.427580 0.922453

Figure_1

Now let’s benchmark strong scaling with a given matmul shape L=M=N=4000:

Num Threads GFortran (fast) LFortran (–fast) GFortran (fast) / LFortran (–fast)
1 144.339067 147.331973 0.979686
2 76.052912 77.598491 0.980082
4 45.320663 47.642641 0.951263
8 33.889175 39.491259 0.858144
16 35.262666 40.834357 0.863554
32 35.644130 40.365258 0.883040

Figure_1

From the benchmarks and plot we can see that LFortran is competitive with GFortran when it comes to runtime performance. Both compilers run in parallel with the performance as expected: almost linear speedup up to about 4 threads, and plateauing beyond that due to the IO bottleneck.

Works with do concurrent

Cherry on top, we get the same runtime performance if we use DoConcurrentLoop instead of OpenMP pragmas. We just have to provide sufficient semantic information and that is it.

Matrix Multiplication with do concurrent

! equivalent to above openmp example

subroutine matrix_multiplication(l, m, n)
use omp_lib
integer :: l, m, n, i, j, k
integer :: seed
double precision :: a(l, n), b(l, m), c(m, n)
double precision :: start_time, end_time

seed = 123456789

b = 121.124D0
c = 29124.012D0

do concurrent (j = 1:n) shared(a, b, c, l, m, n)
    do concurrent (i = 1:l)
        a(i,j) = 0.0D+00
        do concurrent (k = 1:m)
            a(i,j) = a(i,j) + b(i,k) * c(k,j)
        end do
    end do
end do

print *, "sum(a): ", sum(a)
if (abs(sum(a) - (440952103687207.56D0)) > 1D-12) error stop
end subroutine

program do_concurrent_12
call matrix_multiplication(500, 500, 500)
end program

This works both with and without openmp flags.

% lfortran do_concurrent_12.f90 
sum(a):  4.40952103687207562e+14
% lfortran do_concurrent_12.f90 --openmp --openmp-lib-dir=$CONDA_PREFIX/lib
sum(a):  4.40952103687207562e+14

Mandelbrot example

program mandelbrot
    use omp_lib
    integer, parameter :: Nx = 600, Ny = 450, n_max = 255, dp=kind(0.d0)
    real(dp), parameter :: xcenter = -0.5_dp, ycenter = 0.0_dp, &
        width = 4, height = 3, dx_di = width/Nx, dy_dj = -height/Ny, &
        x_offset = xcenter - (Nx+1)*dx_di/2, y_offset = ycenter - (Ny+1)*dy_dj/2
    real(dp) :: x, y, x_0, y_0, x_sqr, y_sqr, wtime
    integer :: i, j, n, image(Nx, Ny)
    call omp_set_num_threads(4)
    wtime = omp_get_wtime()
    !$omp parallel shared(image) private(i, x, y, x_0, y_0, x_sqr, y_sqr, n)
    !$omp do
    do j = 1, Ny
        y_0 = y_offset + dy_dj * j
        do i = 1, Nx
            x_0 = x_offset + dx_di * i
            x = 0; y = 0; n = 0
            do
                x_sqr = x ** 2; y_sqr = y ** 2
                if (x_sqr + y_sqr > 4 .or. n == n_max) then
                    image(i,j) = 255-n
                    exit
                end if
                y = y_0 + 2 * x * y
                x = x_0 + x_sqr - y_sqr
                n = n + 1
            end do
        end do
    end do
    !$omp end do
    !$omp end parallel
    wtime = omp_get_wtime() - wtime
    print *, 'Time = ', wtime, "(s)"
    print *, sum(image)
end program
% lfortran mandelbrot.f90 --openmp --openmp-lib-dir=$CONDA_PREFIX/lib
Time =  1.75323009490966797e-01 (s)
59157126

Mandelbrot example using do concurrent

program mandelbrot_do_concurrent
    integer, parameter :: Nx = 600, Ny = 450, n_max = 255, dp=kind(0.d0)
    real(dp), parameter :: xcenter = -0.5_dp, ycenter = 0.0_dp, &
        width = 4, height = 3, dx_di = width/Nx, dy_dj = -height/Ny, &
        x_offset = xcenter - (Nx+1)*dx_di/2, y_offset = ycenter - (Ny+1)*dy_dj/2
    real(dp) :: x, y, x_0, y_0, x_sqr, y_sqr, wtime
    integer :: i, j, n, image(Nx, Ny)
    do concurrent (j = 1:Ny) shared(image) local(i, x, y, x_0, y_0, x_sqr, y_sqr, n)
        y_0 = y_offset + dy_dj * j
        do i = 1, Nx
            x_0 = x_offset + dx_di * i
            x = 0; y = 0; n = 0
            do
                x_sqr = x ** 2; y_sqr = y ** 2
                if (x_sqr + y_sqr > 4 .or. n == n_max) then
                    image(i,j) = 255-n
                    exit
                end if
                y = y_0 + 2 * x * y
                x = x_0 + x_sqr - y_sqr
                n = n + 1
            end do
        end do
    end do
    print *, sum(image)
end program

As stated above, this works with and without --openmp

% lfortran mandelbrot_do_concurrent.f90 
59157126
% lfortran mandelbrot_do_concurrent.f90 --openmp --openmp-lib-dir=$CONDA_PREFIX/lib
59157126

Let’s now do a comparative analysis:

For n_max = 1255

GFortran (fast) : -O3 -march=native -ffast-math -funroll-loops

num threads GFortran (fast) LFortran (–fast) GFortran (fast) / LFortran (–fast)
1 0.124 0.125 0.990
2 0.065 0.066 0.990
4 0.065 0.066 0.985
8 0.045 0.046 0.991
16 0.028 0.028 0.994
32 0.022 0.022 1.004
64 0.022 0.022 1.010

One has to note that LFortran can get even faster if you first compile to LLVM, then use clang, and enable -ffast-math, which we are not doing at the moment: lfortran --fast should be roughly equivalent to Clang’s -O2, without -ffast-math. If you know how to enable equivalent LLVM optimizations in the LFortran’s driver, please let us know or send us a PR.

How to run benchmark yourselves?

You may install latest GFortran, LFortran or any compiler using conda / mamba and then test it locally.

The following commands will help you create a conda environment with latest versions of LFortran and GFortran installed.

conda create -n omp lfortran=0.37.0 gfortran llvm-openmp
conda activate omp

To test with LFortran one may use:

lfortran file_name.f90 --openmp --openmp-lib-dir=$CONDA_PREFIX/lib

and for GFortran

gfortran file_name.f90 -fopenmp && ./a.out

Development Overview

We added two flags that need to be enabled to fully run an example: --openmp and --openmp-lib-dir=<path-to-omp-library>.

Parser

To incorporate OpenMP pragmas in our Abstract Syntax Tree (AST), we expanded the existing pragma_type node by introducing OMPPragma. Building upon our previous proof of concept with LFortranPragma, we extended its capabilities to include OMPPragma. Subsequently, we modified the parser to recognize and interpret this new pragma type.

For the simple example:

!$omp parallel shared(b, n) private(i) reduction(+:res)
!$omp do
...
!$omp end do
!$omp end parallel

We create an AST node as shown below:

(Pragma
    0
    OMPPragma
    .false.
    "parallel"
    [(String
        "shared(b, n)"
    )
    (String
        "private(i)"
    )
    (String
        "reduction(+:res)"
    )]
    ()
)
(Pragma
    0
    OMPPragma
    .false.
    "do"
    []
    ()
)
...
(Pragma
    0
    OMPPragma
    .true.
    "do"
    []
    ()
)
(Pragma
    0
    OMPPragma
    .true.
    "parallel"
    []
    (TriviaNode
        []
        [(EndOfLine)
        (EndOfLine)]
    )
)

We have added support for several reduction operations.

reduce_op = ReduceAdd | ReduceSub | ReduceMul | ReduceMIN | ReduceMAX

AST -> ASR

We decided to extend the design of DoConcurrent to support OpenMP features. Therefore, when the --openmp flag is enabled, encountering an OMPPragma will result in the creation of a DoConcurrent loop instead of a DoLoop. This DoConcurrent loop will include sufficient semantic information to facilitate its lowering to the correct API calls in subsequent Abstract Semantic Representation (ASR) passes and the backend.

Thus the grammar of DoConcurrentLoop is as follows

DoConcurrentLoop(do_loop_head head, expr* shared, expr* local, reduction_expr* reduction, stmt* body)

reduction_expr = (reduction_op op, expr arg)

reduction_op = ReduceAdd | ReduceSub | ReduceMul | ReduceMIN | ReduceMAX

Hence, for example:

!$omp parallel shared(b, n) reduction(+:res)
!$omp do
do i = 1, n
    res = res + sum(b(i, :))
end do
!$omp end do
!$omp end parallel

The generated ASR looks like:

(DoConcurrentLoop
    ((Var 3 i)
    (IntegerConstant 1 (Integer 4))
    (Var 3 n)
    ())
    [(Var 3 b)
    (Var 3 n)]
    []
    [(ReduceAdd
    (Var 3 res))]
    [(Assignment
        (Var 3 res)
        (RealBinOp
            (Var 3 res)
            Add
            (IntrinsicArrayFunction
                Sum
                [(ArraySection
                    (Var 3 b)
                    [(()
                    (Var 3 i)
                    ())
                    ((ArrayBound
                        (Var 3 b)
                        (IntegerConstant 2 (Integer 4))
                        (Integer 4)
                        LBound
                        (IntegerConstant 1 (Integer 4))
                    )
                    (ArrayBound
                        (Var 3 b)
                        (IntegerConstant 2 (Integer 4))
                        (Integer 4)
                        UBound
                        ()
                    )
                    (IntegerConstant 1 (Integer 4)))]
                    (Array
                        (Real 8)
                        [(()
                        ())]
                        DescriptorArray
                    )
                    ()
                )]
                0
                (Real 8)
                ()
            )
            (Real 8)
            ()
        )
        ()
    )]
)

We support nested do loops, pragma inside a do loop at any level of nesting, etc.

ASR -> ASR pass

The real complexity occurs within this ASR-to-ASR pass, where we take the DoConcurrent loop and transform it with the appropriate calls to the corresponding OpenMP C APIs.

We implemented a comprehensive and specialized ASR pass for OpenMP, which efficiently lowers the enhanced DoConcurrentLoop to atomic operations and API calls within the OpenMP library. Our approach involved studying Clang and GFortran’s methodology, understanding their parallelization techniques using calls like GOMP_parallel, and researching atomic operations, critical regions, barriers, and other relevant concepts. Subsequently, we developed logic to ensure accurate translation.

This pass is only run if --openmp flag is enabled.

ASR -> LLVM

With the assistance of the ASR pass, we achieved significant code reduction, enabling our example to work without backend modifications. We observed favorable speedups across threads, indicating successful parallelization of the code.

Runtime library

To support use omp_lib we added bind(C) interfaces to runtime library of LFortran wrapped inside a module. This module gets created as it encounters use omp_lib and is stored in ASR.

Test suite

Up to now, we have successfully implemented and tested approximately 40 OpenMP-related tests, indicating robust functionality across various examples. Despite LFortran being in its alpha stage and the OpenMP pass being a recent addition, we anticipate occasional issues, albeit minor, which can typically be resolved with small adjustments. In addition to these, we replaced OpenMP pragmas with the corresponding do concurrent loops and conducted thorough testing of these implementations.

We even took a real life openmp example from https://fortran-lang.discourse.group/t/openmp-code-implementation-in-gfortran-and-intel/8256 and got it compiled with only a minor fix in the OpenMP ASR pass.

Progress Tracker

We have everything tracked at OpenMP support in AST and ASR and one may filter issues / PRs with label openmp or directly follow link.

Contributing

Help us get to beta faster by reporting bugs and submitting PRs to fix things. We will help you get started, no prior compiler experience needed.

Acknowledgements

We want to thank:

Discussions