NumPy Extreme: Compiling with Optimized BLAS for Maximum Performance

Author: Andrea Cardillo

Published: 10/10/2025 | Edited: 10/10/2025

Abstract

Installing NumPy via standard package managers (like pip) often relies on generic implementations of the basic mathematical libraries (BLAS/LAPACK). For CPU-intensive calculations, this can seriously limit performance. This article guides you through the advanced process of compiling NumPy directly from source, linking it to optimized BLAS libraries such as OpenBLAS or Intel MKL. Following this procedure is crucial for maximizing the speed of linear algebra operations on high-performance systems.

Keywords

NumPy, BLAS, LAPACK, OpenBLAS, Compilation, Performance, HPC, Python, Linear Algebra, Linux.

Introduction: Why Compile From Source?

NumPy is the foundation of scientific computing in Python. Much of its speed comes from delegating linear algebra operations (like matrix multiplication) to underlying libraries written in C or Fortran, known as BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra PACKage).

The Complementary Roles of BLAS and LAPACK

BLAS (The Foundation): Provides the routines for fundamental operations (vector-vector, matrix-vector, and the critical Matrix-Matrix multiplication).
LAPACK (The Builder): Relies entirely on BLAS to execute higher-level, more complex operations, such as Singular Value Decomposition (SVD), LU Decomposition, and solving systems of linear equations.

Modern NumPy installations, including those obtained via pip, are almost always linked to optimized BLAS implementations like OpenBLAS (as evidenced by the default scipy-openblas configuration). The use of an optimized BLAS library is crucial because it not only accelerates individual matrix multiplication operations but automatically extends this performance gain to all the complex LAPACK routines that rely on them. While not strictly necessary to achieve basic optimization, this experiment is vital for verification and granular control. Compiling NumPy and OpenBLAS manually allows us to rigorously isolate, configure, and test the impact of specific parameters — such as the number of threads — on the performance of BLAS and LAPACK routines.

BLAS Operation Classes: What Are We Optimizing?

BLAS classifies operations based on their computational dimension. Understanding this classification is crucial to realizing why optimization is targeted:

*BLAS operation classes*
Class	Description	Complexity	Performance Impact
1	Vector-Vector (i.e. Dot Product)	O(n)	Often limited by memory read/write speed (Memory-bound)
2	Matrix-Vector (i.e. Matrix-Vector Multiplication)	O(n^2)	Better compute/memory ratio, but still sensitive to latency.
3	Matrix-Matrix (i.e. GEMM)	O(n^3)	The maximum gain is here. The operation is limited by pure computational power (Compute-bound).

Optimized libraries focus their effort on Class 3, GEMM (General Matrix Multiplication), because, for large matrices, the cubic computation time, O(n^3), far exceeds data access time, allowing the CPU to operate at its maximum capacity.

Step 1: Installing Prerequisites

Before starting, ensure you have all the necessary development tools and Git to clone the NumPy repository.

A. Essential Tools

Install the Fortran compiler (necessary for BLAS/LAPACK) and development packages:

            
sudo apt update
sudo apt install build-essential gfortran python3-dev git

B. Installing OpenBLAS (Local User)

            
OPENBLAS_INSTALL_DIR="$HOME/local/openblas"
mkdir -p $OPENBLAS_INSTALL_DIR
cd /tmp
git clone --depth 1 https://github.com/OpenMathLib/OpenBLAS.git
cd OpenBLAS
make \
    -j $(nproc) \
    DYNAMIC_ARCH=1 \
    USE_OPENMP=1 \
    NUMA_KEEP=1 \
    L2_SIZE=512 \
    L3_SIZE=3072 \
    ALIGNMENT=64 \
    MAX_THREADS=2 \
    TARGET=HASWELL \
    FC=gfortran
make install PREFIX=$OPENBLAS_INSTALL_DIR 
unset OPENBLAS_INSTALL_DIR
rm -rf /tmp/OpenBLAS

The OpenBLAS library is now installed in ~/local/openblas/.

C. Installing OpenBLAS (Local User)

Edit the user profile, i.e. the .bashrc, to add the environment variables

            
# make OpenBLAS accessible to the linker at runtime 
export PKG_CONFIG_PATH="$HOME/local/openblas/lib/pkgconfig:$PKG_CONFIG_PATH"
export LD_LIBRARY_PATH="$HOME/local/openblas/lib:$LD_LIBRARY_PATH"
export BLAS_LIBS="-L$HOME/local/openblas/lib -lopenblas"
export LAPACK_LIBS="-L$HOME/local/openblas/lib -lopenblas"
export FC=gfortran # force the choice of the compiler

# specify the amount of threads - it depends on the hardware
export OPENBLAS_NUM_THREADS="4"
export GOTO_NUM_THREADS="4" # for compatibility with previous versions, neglatable
export OMP_NUM_THREADS="4"

OpenBLAS and OpenMP: Parallelization

OpenBLAS achieves its superior performance through the aggressive use of the OpenMP (Open Multi-Processing) framework.

OpenMP is an API that supports parallel programming in shared-memory environments (your CPU cores). When OpenBLAS executes intensive Class 3 (GEMM) operations:

It uses OpenMP directives to split the matrix multiplication operation into smaller, independent blocks.
These blocks are distributed as parallel tasks across all available logical cores.

In this way, OpenBLAS acts as the threading engine that turns a single-threaded NumPy calculation into a multi-core operation, ensuring that your linear algebra calculations are limited by your processing power, not the software.

Step 2: Configuring NumPy's backend for BLAS

NumPy uses a file named ~/.numpy-site.cfg for a local configuration (or site.cfg, for a global configuration) to know where to find the BLAS/LAPACK libraries.

Locating the BLAS libraries

            
cat << EOF > ~/.numpy-site.cfg
[openblas]
libraries = openblas
library_dirs = $HOME/local/openblas/lib
include_dirs = $HOME/local/openblas/include

[lapack]
libraries = openblas
library_dirs = $HOME/local/openblas/lib
include_dirs = $HOME/local/openblas/include
EOF

Step 3: Preparing the Compilation Environment

Let's create a clean environment and prepare the NumPy sources.

Install NumPy in a new fresh environment

            
python3 -m venv myblas
cd myblas
(myblas) python -m pip install -U pip
(myblas) python -m pip install --no-cache numpy --no-binary numpy #--verbose

Verify Installation:

Show libraries in the system on which NumPy was built with numpy.np.show_config.

            
(myblas) python -c "import numpy as np; np.show_config()" | grep "include dir"

Conclusion: Optimization and Context

Compiling NumPy with an optimized BLAS library is not a miracle cure for every user. For developers performing standard numerical operations or working with small datasets, the performance gain compared to the version provided by pip can be minimal or negligible.

However, this process is essential and offers tangible value in specific contexts:

HPC/Scientific Workloads: For those managing intensive linear algebra calculations, such as large matrix analysis, scientific simulations, or machine learning model training on dedicated hardware, optimizing the machine code (by exploiting AVX/FMA instructions) can lead to significant time savings that accumulate over time.
Transparency and Control: This procedure ensures you have total control over the underlying dependencies, eliminating any doubt about potential bottlenecks caused by unoptimized BLAS/LAPACK.

Ultimately, if benchmarks show that BLAS operations are your primary bottleneck, compiling from source is the only way to extract maximum efficiency from your hardware. For everyone else, the standard installation via pip remains the fastest and most practical choice.

References

OPENBLAS: developer manual, official webpage, GitHub
BLAS: routines
Benchmarks: STREAM webpage and GitHub