In our last blog post from May 2023, we announced that LFortran can compile legacy and modern Minpack. Today, we are happy to announce that LFortran can compile and run fastGPT.
This is the third code that LFortran can compile. The progress bar toward beta has progressed to 3/10.
LFortran is still alpha, meaning that users expect frequent bugs and breaking changes. Alpha users are enthusiastic partners in the effort to reach beta and they dilligently report issues. In beta, users will expect LFortran to compile their codes, but users will still be partners in reporting remaining issues.
fastGPT Overview
We introduced fastGPT on March 14, 2023, a fast GPT-2 inference engine written in Fortran (faster and easier to maintain array-oriented operations than Python) and highly inspired by picoGPT (very small and readable). We demonstrated that fastGPT can achieve very good performance when compared to picoGPT and PyTorch on an Apple silicon chip and highlighted the combination of speed and readability achieved through Fortran’s numerical array-oriented operations. For example, by using Fortran we are able to ensure that the dimensions of all the arrays are correct (Fortran syntax makes sure of this during compilation). The generated binary is therefore always theoretically correct. In addition, since Fortran is a compiled programming language with a syntax similar to NumPy, performance gains are naturally expected.
See fastGPT and the blog-post for more details.
Today, LFortran can fully compile and run this array-oriented algorithm and get exactly the same results as does GFortran. There is one small workaround: LFortran does not support namelists yet.
Please build and run the fastGPT PR to see for yourself. The whole code compiles in both LFortran and GFortran, including the many array manipulations in gpt2.f90 and all the string manipulations in tokenizer.f90.
You can install the latest LFortran using conda-forge on Linux, macOS and
Windows (conda install lfortran).
We now test both Debug and Release builds of fastGPT at our LFortran CI for every commit.
Here is the result of the gpt2 binary compiled using LFortran (v0.20.3)
$ OMP_NUM_THREADS=1 ./gpt2
Loading the model...
    done. Time:   0.111s, Model file version: 1
Model parameters:
n_vocab = 50257
n_ctx   =  1024
n_embd  =   768
n_layer =    12
n_head  =    12
Input text
Alan Turing theorized that computers would one day become very powerful, but even he could not imagine
Encoding: tokenizing input text into tokens (currently slow)...
    done. Time:   0.50s
Input parameters:
n_seq                =  19
n_tokens_to_generate =  20
Input tokens:
 36235 39141 18765  1143   326  9061   561   530  1110  1716   845  3665    11   475   772   339   714   407  5967
Decoded input as text:
Alan Turing theorized that computers would one day become very powerful, but even he could not imagine
Running model...
 how they would be able to do so.
"I think that the most important thing is
    done. Time:   0.924s (1.0x)
Output tokens:
   703   484   561   307  1498   284   466   523    13   198   198     1    40   892   326   262   749  1593  1517   318
Decoded output as text:
 how they would be able to do so.
"I think that the most important thing is
Benchmark
Here are some preliminary benchmarks, doing make gpt2 and
time OMP_NUM_THREADS=1 ./gpt2. All times are in seconds.
Apple MacBook Pro M1 Max
| Compiler | Compile Time | Run time | 
|---|---|---|
| GFortran 11.3.0 | 1.008 | 1.143 | 
| LFortran 0.20.3 | 0.622 | 1.207 | 
| GFortran 11.3.0 (Optimized) | 2.720 | 0.483 | 
| LFortran 0.20.3 (Optimized) | 0.821 | 1.115 | 
Apple MacBook Pro M2 Pro (16 GB Memory), Ventura 13.5.1 (22G90)
fastGPT Debug build
| Compiler | Compile Time | Run time | 
|---|---|---|
| GFortran 11.3.0 | 1.740 | 1.060 | 
| LFortran 0.20.3 | 0.623 | 1.400 | 
| GFortran 11.3.0 (Optimized) | 2.096 | 0.993 | 
| LFortran 0.20.3 (Optimized) | 0.844 | 1.087 | 
fastGPT Release build
| Compiler | Compile Time | Run time | 
|---|---|---|
| GFortran 11.3.0 | 2.283 | 0.983 | 
| LFortran 0.20.3 | 0.654 | 1.402 | 
| GFortran 11.3.0 (Optimized) | 2.282 | 0.982 | 
| LFortran 0.20.3 (Optimized) | 0.824 | 1.095 | 
Apple MacBook Pro M1 Pro
| Compiler | Compile Time | Run time | 
|---|---|---|
| GFortran 12.3.0 | 2.448 | 1.011 | 
| LFortran 0.20.3 | 0.680 | 1.246 | 
| GFortran 12.3.0 (Optimized) | 2.443 | 1.014 | 
| LFortran 0.20.3 (Optimized) | 0.861 | 1.128 | 
The optimization flags for GFortran are -O3 -march=native -ffast-math -funroll-loops. LFortran takes one optimization flag: --fast.
We measured the total execution time, which includes loading the model from disk, tokenization (encoding), GPT-2 inference and decoding.
As can be seen, LFortran compiles faster than does GFortran. The run time is comparable, with GFortran generally faster. Once LFortran reaches beta, we will focus on optimizations, with the objective to match or beat GFortran’s run time in all cases.
Currently, our main focus is to just compile codes. So long as the run time is within a factor of 2 of GFortran’s, it is good enough for now. See above that LFortran is often within 20% of GFortran’s run time.
What’s Next?
As of this writing, LFortran compiles three codes. Our goal is to compile 10 third-party codes so as to bring LFortran from alpha to beta. This is our main focus. We have been working on compiling several more third-party codes. We will announce them once they fully compile and run. Some of those codes codes are the Fortran Package Manager (fpm) and large parts of SciPy.
We are always looking for more contributors; if you are interested, please get in touch. Furthermore, if you’re enthusiastic about enhancing fastGPT’s capabilities, we invite you to collaborate on parallelizing the CPU execution and optimizing its performance on GPU hardware.
Acknowledgements
We want to thank:
- GSI Technology
- LANL
- NumFOCUS
- Sovereign Tech Fund (STF)
- QuantStack
- Google Summer of Code
- Our GitHub, OpenCollective and NumFOCUS sponsors
- All our contributors (59 so far!)