In early June, we announced a significant achievement with LFortran: the Fastest Open-Source Compiler in Compile-Time Evaluation of an Array Benchmark. Continuing our commitment to performance enhancement, we are pleased to announce that LFortran now includes support for OpenMP pragmas and parallelizing do concurrent
. Particularly for parallel do
OpenMP construct, our implementation is fully operational and achieves performance comparable to GFortran.
Note that LFortran is still a alpha software, meaning that users must continue expecting that LFortran will fail compiling or running their codes. Please report all bugs that you find.
Compilation Benchmark
Matrix Multiplication example
Let us now take a look at matrix multiplication using OpenMP
subroutine matrix_multiplication(l, m, n)
use omp_lib
integer :: l, m, n, i, j, k
integer :: seed
double precision :: a(l, n), b(l, m), c(m, n)
double precision :: start_time, end_time
seed = 123456789
b = 121.124D0
c = 29124.012D0
start_time = omp_get_wtime()
!$omp parallel do shared(a, b, c, l, m, n)
do j = 1, n
do i = 1, l
a(i,j) = 0.0D+00
do k = 1, m
a(i,j) = a(i,j) + b(i,k) * c(k,j)
end do
end do
end do
!$omp end parallel do
end_time = omp_get_wtime()
print *, "Time: ", end_time - start_time
print *, "sum(a): ", sum(a)
end subroutine
program openmp_30
call matrix_multiplication(500, 500, 500)
end program
Let’s benchmark different matmul sizes for 8 threads:
GFortran (fast) : -O3 -march=native -ffast-math -funroll-loops
L=M=N | GFortran | GFortran (fast) | LFortran (–fast) | GFortran (fast) / LFortran (–fast) |
---|---|---|---|---|
500 | 0.078481 | 0.021822 | 0.024509 | 0.890369 |
1000 | 0.706202 | 0.191768 | 0.188653 | 1.016512 |
1500 | 2.418719 | 0.691059 | 0.695324 | 0.993866 |
2000 | 5.889707 | 1.845945 | 1.885804 | 0.978864 |
3000 | 24.358390 | 9.179583 | 8.565501 | 1.071692 |
4000 | 84.909953 | 37.292557 | 40.427580 | 0.922453 |
Now let’s benchmark strong scaling with a given matmul shape L=M=N=4000:
Num Threads | GFortran (fast) | LFortran (–fast) | GFortran (fast) / LFortran (–fast) |
---|---|---|---|
1 | 144.339067 | 147.331973 | 0.979686 |
2 | 76.052912 | 77.598491 | 0.980082 |
4 | 45.320663 | 47.642641 | 0.951263 |
8 | 33.889175 | 39.491259 | 0.858144 |
16 | 35.262666 | 40.834357 | 0.863554 |
32 | 35.644130 | 40.365258 | 0.883040 |
From the benchmarks and plot we can see that LFortran is competitive with GFortran when it comes to runtime performance. Both compilers run in parallel with the performance as expected: almost linear speedup up to about 4 threads, and plateauing beyond that due to the IO bottleneck.
Works with do concurrent
Cherry on top, we get the same runtime performance if we use DoConcurrentLoop
instead of OpenMP pragmas. We just have to provide sufficient semantic information and that is it.
Matrix Multiplication with do concurrent
! equivalent to above openmp example
subroutine matrix_multiplication(l, m, n)
use omp_lib
integer :: l, m, n, i, j, k
integer :: seed
double precision :: a(l, n), b(l, m), c(m, n)
double precision :: start_time, end_time
seed = 123456789
b = 121.124D0
c = 29124.012D0
do concurrent (j = 1:n) shared(a, b, c, l, m, n)
do concurrent (i = 1:l)
a(i,j) = 0.0D+00
do concurrent (k = 1:m)
a(i,j) = a(i,j) + b(i,k) * c(k,j)
end do
end do
end do
print *, "sum(a): ", sum(a)
if (abs(sum(a) - (440952103687207.56D0)) > 1D-12) error stop
end subroutine
program do_concurrent_12
call matrix_multiplication(500, 500, 500)
end program
This works both with and without openmp flags.
% lfortran do_concurrent_12.f90
sum(a): 4.40952103687207562e+14
% lfortran do_concurrent_12.f90 --openmp --openmp-lib-dir=$CONDA_PREFIX/lib
sum(a): 4.40952103687207562e+14
Mandelbrot example
program mandelbrot
use omp_lib
integer, parameter :: Nx = 600, Ny = 450, n_max = 255, dp=kind(0.d0)
real(dp), parameter :: xcenter = -0.5_dp, ycenter = 0.0_dp, &
width = 4, height = 3, dx_di = width/Nx, dy_dj = -height/Ny, &
x_offset = xcenter - (Nx+1)*dx_di/2, y_offset = ycenter - (Ny+1)*dy_dj/2
real(dp) :: x, y, x_0, y_0, x_sqr, y_sqr, wtime
integer :: i, j, n, image(Nx, Ny)
call omp_set_num_threads(4)
wtime = omp_get_wtime()
!$omp parallel shared(image) private(i, x, y, x_0, y_0, x_sqr, y_sqr, n)
!$omp do
do j = 1, Ny
y_0 = y_offset + dy_dj * j
do i = 1, Nx
x_0 = x_offset + dx_di * i
x = 0; y = 0; n = 0
do
x_sqr = x ** 2; y_sqr = y ** 2
if (x_sqr + y_sqr > 4 .or. n == n_max) then
image(i,j) = 255-n
exit
end if
y = y_0 + 2 * x * y
x = x_0 + x_sqr - y_sqr
n = n + 1
end do
end do
end do
!$omp end do
!$omp end parallel
wtime = omp_get_wtime() - wtime
print *, 'Time = ', wtime, "(s)"
print *, sum(image)
end program
% lfortran mandelbrot.f90 --openmp --openmp-lib-dir=$CONDA_PREFIX/lib
Time = 1.75323009490966797e-01 (s)
59157126
Mandelbrot example using do concurrent
program mandelbrot_do_concurrent
integer, parameter :: Nx = 600, Ny = 450, n_max = 255, dp=kind(0.d0)
real(dp), parameter :: xcenter = -0.5_dp, ycenter = 0.0_dp, &
width = 4, height = 3, dx_di = width/Nx, dy_dj = -height/Ny, &
x_offset = xcenter - (Nx+1)*dx_di/2, y_offset = ycenter - (Ny+1)*dy_dj/2
real(dp) :: x, y, x_0, y_0, x_sqr, y_sqr, wtime
integer :: i, j, n, image(Nx, Ny)
do concurrent (j = 1:Ny) shared(image) local(i, x, y, x_0, y_0, x_sqr, y_sqr, n)
y_0 = y_offset + dy_dj * j
do i = 1, Nx
x_0 = x_offset + dx_di * i
x = 0; y = 0; n = 0
do
x_sqr = x ** 2; y_sqr = y ** 2
if (x_sqr + y_sqr > 4 .or. n == n_max) then
image(i,j) = 255-n
exit
end if
y = y_0 + 2 * x * y
x = x_0 + x_sqr - y_sqr
n = n + 1
end do
end do
end do
print *, sum(image)
end program
As stated above, this works with and without --openmp
% lfortran mandelbrot_do_concurrent.f90
59157126
% lfortran mandelbrot_do_concurrent.f90 --openmp --openmp-lib-dir=$CONDA_PREFIX/lib
59157126
Let’s now do a comparative analysis:
For n_max = 1255
GFortran (fast) : -O3 -march=native -ffast-math -funroll-loops
num threads | GFortran (fast) | LFortran (–fast) | GFortran (fast) / LFortran (–fast) |
---|---|---|---|
1 | 0.124 | 0.125 | 0.990 |
2 | 0.065 | 0.066 | 0.990 |
4 | 0.065 | 0.066 | 0.985 |
8 | 0.045 | 0.046 | 0.991 |
16 | 0.028 | 0.028 | 0.994 |
32 | 0.022 | 0.022 | 1.004 |
64 | 0.022 | 0.022 | 1.010 |
One has to note that LFortran can get even faster if you first compile to LLVM, then use clang, and enable -ffast-math
, which we are not doing at the moment: lfortran --fast
should be roughly equivalent to Clang’s -O2
, without -ffast-math
. If you know how to enable equivalent LLVM optimizations in the LFortran’s driver, please let us know or send us a PR.
How to run benchmark yourselves?
You may install latest GFortran, LFortran or any compiler using conda / mamba and then test it locally.
The following commands will help you create a conda environment with latest versions of LFortran and GFortran installed.
conda create -n omp lfortran=0.37.0 gfortran llvm-openmp
conda activate omp
To test with LFortran one may use:
lfortran file_name.f90 --openmp --openmp-lib-dir=$CONDA_PREFIX/lib
and for GFortran
gfortran file_name.f90 -fopenmp && ./a.out
Development Overview
We added two flags that need to be enabled to fully run an example: --openmp
and --openmp-lib-dir=<path-to-omp-library>
.
Parser
To incorporate OpenMP pragmas in our Abstract Syntax Tree (AST), we expanded the existing pragma_type
node by introducing OMPPragma
. Building upon our previous proof of concept with LFortranPragma
, we extended its capabilities to include OMPPragma
. Subsequently, we modified the parser to recognize and interpret this new pragma type.
For the simple example:
!$omp parallel shared(b, n) private(i) reduction(+:res)
!$omp do
...
!$omp end do
!$omp end parallel
We create an AST node as shown below:
(Pragma
0
OMPPragma
.false.
"parallel"
[(String
"shared(b, n)"
)
(String
"private(i)"
)
(String
"reduction(+:res)"
)]
()
)
(Pragma
0
OMPPragma
.false.
"do"
[]
()
)
...
(Pragma
0
OMPPragma
.true.
"do"
[]
()
)
(Pragma
0
OMPPragma
.true.
"parallel"
[]
(TriviaNode
[]
[(EndOfLine)
(EndOfLine)]
)
)
We have added support for several reduction operations.
reduce_op = ReduceAdd | ReduceSub | ReduceMul | ReduceMIN | ReduceMAX
AST -> ASR
We decided to extend the design of DoConcurrent
to support OpenMP features. Therefore, when the --openmp
flag is enabled, encountering an OMPPragma
will result in the creation of a DoConcurrent
loop instead of a DoLoop
. This DoConcurrent
loop will include sufficient semantic information to facilitate its lowering to the correct API calls in subsequent Abstract Semantic Representation (ASR) passes and the backend.
Thus the grammar of DoConcurrentLoop
is as follows
DoConcurrentLoop(do_loop_head head, expr* shared, expr* local, reduction_expr* reduction, stmt* body)
reduction_expr = (reduction_op op, expr arg)
reduction_op = ReduceAdd | ReduceSub | ReduceMul | ReduceMIN | ReduceMAX
Hence, for example:
!$omp parallel shared(b, n) reduction(+:res)
!$omp do
do i = 1, n
res = res + sum(b(i, :))
end do
!$omp end do
!$omp end parallel
The generated ASR looks like:
(DoConcurrentLoop
((Var 3 i)
(IntegerConstant 1 (Integer 4))
(Var 3 n)
())
[(Var 3 b)
(Var 3 n)]
[]
[(ReduceAdd
(Var 3 res))]
[(Assignment
(Var 3 res)
(RealBinOp
(Var 3 res)
Add
(IntrinsicArrayFunction
Sum
[(ArraySection
(Var 3 b)
[(()
(Var 3 i)
())
((ArrayBound
(Var 3 b)
(IntegerConstant 2 (Integer 4))
(Integer 4)
LBound
(IntegerConstant 1 (Integer 4))
)
(ArrayBound
(Var 3 b)
(IntegerConstant 2 (Integer 4))
(Integer 4)
UBound
()
)
(IntegerConstant 1 (Integer 4)))]
(Array
(Real 8)
[(()
())]
DescriptorArray
)
()
)]
0
(Real 8)
()
)
(Real 8)
()
)
()
)]
)
We support nested do loops, pragma inside a do loop at any level of nesting, etc.
ASR -> ASR pass
The real complexity occurs within this ASR-to-ASR pass, where we take the DoConcurrent loop and transform it with the appropriate calls to the corresponding OpenMP C APIs.
We implemented a comprehensive and specialized ASR pass for OpenMP, which efficiently lowers the enhanced DoConcurrentLoop to atomic operations and API calls within the OpenMP library. Our approach involved studying Clang and GFortran’s methodology, understanding their parallelization techniques using calls like GOMP_parallel
, and researching atomic operations, critical regions, barriers, and other relevant concepts. Subsequently, we developed logic to ensure accurate translation.
This pass is only run if --openmp
flag is enabled.
ASR -> LLVM
With the assistance of the ASR pass, we achieved significant code reduction, enabling our example to work without backend modifications. We observed favorable speedups across threads, indicating successful parallelization of the code.
Runtime library
To support use omp_lib
we added bind(C)
interfaces to runtime library of LFortran wrapped inside a module. This module gets created as it encounters use omp_lib
and is stored in ASR.
Test suite
Up to now, we have successfully implemented and tested approximately 40 OpenMP-related tests, indicating robust functionality across various examples. Despite LFortran being in its alpha stage and the OpenMP pass being a recent addition, we anticipate occasional issues, albeit minor, which can typically be resolved with small adjustments. In addition to these, we replaced OpenMP pragmas with the corresponding do concurrent loops and conducted thorough testing of these implementations.
We even took a real life openmp example from https://fortran-lang.discourse.group/t/openmp-code-implementation-in-gfortran-and-intel/8256 and got it compiled with only a minor fix in the OpenMP ASR pass.
Progress Tracker
We have everything tracked at OpenMP support in AST and ASR and one may filter issues / PRs with label openmp
or directly follow link.
Contributing
Help us get to beta faster by reporting bugs and submitting PRs to fix things. We will help you get started, no prior compiler experience needed.
Acknowledgements
We want to thank:
- Sovereign Tech Fund (STF)
- NumFOCUS
- QuantStack
- Google Summer of Code
- GSI Technology
- LANL
- Our GitHub, OpenCollective and NumFOCUS sponsors
- All our contributors (83 so far!)