Update memory profiling and scripts

lorenzotomada · lorenzotomada · commit 1357d2bb11ff · 2025-06-29T17:55:43.000+02:00
diff --git a/.gitignore b/.gitignore
@@ -203,3 +203,6 @@ skbuild/*
 
 # SLURM logs
 *dtsc*
+
+# Backup files
+*.bak
diff --git a/README.md b/README.md
@@ -2,11 +2,6 @@ Names: Gaspare Li Causi, Lorenzo Tomada
 
 email: glicausi@sissa.it, ltomada@sissa.it
 
-# TODO:
-- documentation in the script folder (explain what we are doing)
-- run using ulysses
-- find a way to import functions in memory profiling
-
 # Introduction
 This repository contains the final project for the course in Development Tools in Scientific Computing.
 
@@ -36,7 +31,7 @@ In order to solve an eigenvalue problem, we considered multiple strategies.
 1. The most trivial one was to implement the power method in order to be able to compute (at least) the biggest eigenvalue. We then used `numba` to try and optimize it, but in this case just-in-time compilation was not extremely beneficial.The implementation of the power method is contained in `eigenvalues.py`.
 2. Lanczos + QR: this is an approach (tailored to the case of symmetric matrices) to compute *all* the eigenvalues and eigenvectors. Notice that, also in the case of the QR method,`numba` was not very beneficial in terms of speed-up, resulting in a pretty slow methodology. For this reason, we implemented the QR method in `C++` and used `pybind11` to expose it to `Python`. All the code written in `C++` can be found in `cxx_utils.cpp`.
 3. `CuPy` implementation of all of the above: we implemented all the above methodologies using `CuPy` to see whether using GPU could speed up computations. Since this was not the case, we commented all the lines of code involving `CuPy`, so that installation of the package is no longer required and we can use our code also on machines that do not have GPU.
-4. The core of the project is the implementation (as well as a generalization of the simplified case in which $\rho=1$ considered in our reference) of the _divide et implera_ method for the computation of eigenvalues of a symmetric matrix. Some helpers were originally written in `Python` and then translated to `C++` for efficiency reasons: their original implementation is in `zero_finder.py` and is still present in the project for testing purposes. The translated version can be found in `cxx_utils.cpp`. Instead, the implementation of the actual method to compute the eigenvalues starting from a tridiagonal matrix is contained in `parallel_tridiag_eigen.py` and makes use of `mpi4py`.
+4. The core of the project is the implementation (as well as a generalization of the simplified case in which $\rho=1$ considered in our reference) of the _divide et implera_ method for the computation of eigenvalues of a symmetric matrix. Some helpers were originally written in `Python` and then translated to `C++` for efficiency reasons: their original implementation is in `zero_finder.py` and is still present in the project for testing purposes. The translated version can be found in `cxx_utils.cpp`. Instead, the implementation of the actual method to compute the eigenvalues starting from a tridiagonal matrix is contained in `parallel_tridiag_eigen.py` and makes use of `mpi4py`. Notice that the implementation of deflation in `cxx_utils.cpp` is done using the `Eigen` library.
 
 # Results
 The results of the profiling (runtime vs matrix size, memory consumption, scalability, and so on) are discussed in detail in `Documentation.ipynb`.
@@ -54,16 +49,34 @@ It is also possible to provide paths to other configuration files by passing the
 Notice that the script is *not* called using `mpirun`, but internally it uses MPI.
 This is done by spawning a communicator inside the script.
 
-In addition, in the `shell` folder, we provide a `submit.sbatch` file to run using `SLURM`, as well as a `submit.sh` to run the same experiment locally.
-These two files perform memory profiling.
+In addition, in the `shell` folder, we provide a `submit.sbatch` file to run using `SLURM`, as well as a `submit.sh`.
+They are used to perform memory profiling.
+
+The `submit.sbatch` file is supposed to be used on Ulysses (or any other cluster using `SLURM`).
+It is supposed to show how to send a job (in which our package is emplyed) using `SLURM`.
+Notice, however, that due to Ulysse's problems with `MPI` the profiling for  
+As a result, we also provide `submit.sh`, which is supposed to be run on a workstation.
+It executes `mpirun -np [n_procs] python scripts/profile_memory.py`, basically doing the same as the `submit.sbatch` script, but without using `SLURM`.
+Notice that it assumes that `shell/load_modules.sh` has already been executed (see the next section).
+
+We also remark that the script to perform memory profiling `scripts/profile_memory.py` does not spam an `MPI` communicator, but is supposed to be called using `mpirun`. The reason for that is to provide a more extensive list of examples of how our package can be used.
 
-# To install using Ulysses:
+Notice that it is possible that `scripts/mpi_running.py` will not run on systems using `SLURM` due to the fact that we are using a specific way to spawn an `MPI` communicator.
+Nevertheless, the package still works: as done in `scripts/profile_memory.py`: it sufficies to run a file that can be used in combination with `mpirun` or `srun`.
+
+# How to install:
+If you are using Ulysses or a SISSA workstation, it is likely that you will need to load a couple of modules to be able to install the package.
+The exact modules change according to the device you are currently using, but it is sufficient that you have `CMake`, `gcc` and `OpenMPI`.
+
+To streamline the installation process, we provide the script `shell/load_modules.sh`.
+This script loads the modules that are required on Ulysses/my workstation (according to the flag that is passed).
+To use it, run:
 ```bash
-source shell/load_modules.sh
+source shell/load_modules.sh Ulysses # or source shell/load_modules.sh workstation
 ```
-The previous line will load CMake and gcc. Both are needed to compile the project.
-In addition, it will enable the istallation of `mpi4py`.
-After that, you can just write
+The previous line will allow the istallation of `mpi4py` and the automatic compilation of the `C++` source file used in the project.
+
+Once the needed modules are loaded, you can regularly install via `pip` using the following command:
 ```bash
 python -m pip install .
 ```
diff --git a/experiments/config.yaml b/experiments/config.yaml
@@ -1,4 +1,4 @@
-dim: 10
+dim: 100
 density: 0.2
-n_processes: 1
-plot: false
+n_processes: 2
+plot: false
diff --git a/scripts/profiling_memory.py b/scripts/profiling_memory.py
@@ -47,7 +47,7 @@
 kwargs = comm.bcast(kwargs, root=0)
 dim = kwargs["dim"]
 density = kwargs["density"]
-n_procs = kwargs["n_processes"]
+n_procs = size  # kwargs["n_processes"]
 plot = kwargs["plot"]
 
 # Now we build the matrix on rank 0
@@ -61,7 +61,7 @@
 A_np = comm.bcast(A_np, root=0)
 
 
-# On rank 0, we use the Lanczos method
+# On rank 0, we use the Lanczos method.
 # We actually call it twice: the first time to ensure that the function is JIT-compiled by Numba, the second one for memory profiling
 if rank == 0:
     print("Precompiling Lanczos...")
@@ -80,6 +80,7 @@
 else:
     diag = off_diag = None
 
+# Now we broadcast diag and off_diag to all other ranks so we can use parallel_tridiag_eigen
 diag = comm.bcast(diag, root=0)
 off_diag = comm.bcast(off_diag, root=0)
 
@@ -97,19 +98,23 @@
 
 total_mem_children = comm.reduce(delta_mem, op=MPI.SUM, root=0)
 
+# Collect the information across all ranks
 if rank == 0:
+    print(f"########################## SIZE = {size} #####################")
     total_mem_all = delta_mem_lanczos
     print("Eigenvalues computed.")
     process = psutil.Process()
 
     print(f"Total memory across all processes: {total_mem_all:.2f} MB")
 
+    # We also profile numpy and scipy memory consumption
     mem_np = profile_numpy_eigvals(A_np)
     print(f"NumPy eig memory usage: {mem_np:.2f} MB")
 
     mem_sp = profile_scipy_eigvals(A_np)
     print(f"SciPy eig memory usage: {mem_sp:.2f} MB")
 
+    # Save to the logs folder
     os.makedirs("logs", exist_ok=True)
     log_file = "logs/memory_profile.csv"
     fieldnames = [
@@ -140,6 +145,8 @@
         )
 
     if plot:
+        # We only plot if all the runs have been done already. In this way, we get a complete memory usage graph.
+
         import matplotlib.pyplot as plt
         import pandas as pd
 
@@ -194,5 +201,5 @@
         )
         plt.subplots_adjust(right=0.75)
 
-        plt.savefig("logs/mem_vs_size_all_methods.png", bbox_inches="tight")
+        plt.savefig("logs/memory_profiling.png", bbox_inches="tight")
         plt.show()
diff --git a/shell/load_modules.sh b/shell/load_modules.sh
@@ -1,4 +1,22 @@
 #!/bin/bash
-module load cmake/3.29.1
-module load intel/2021.2
-module load openmpi3/3.1.4
+
+# Usage: ./setup_modules.sh [env]
+# where [env] can be: Ulysses, workstation
+
+env_arg="$1"
+
+if [[ "$env_arg" == "Ulysses" ]]; then
+    echo "Loading modules for Ulysses cluster..."
+    module load cmake/3.29.1
+    module load intel/2021.2
+    module load openmpi3/3.1.4
+
+elif [[ "$env_arg" == "2" || "$env_arg" == "workstation" ]]; then
+    echo "Loading modules for local workstation..."
+    module load intel/2022.2.1
+    module load openmpi4/4.1.4
+
+else
+    echo "Usage: $0 [Ulysses|workstation]"
+    exit 1
+fi
diff --git a/shell/submit.sbatch b/shell/submit.sbatch
@@ -3,10 +3,10 @@
 #SBATCH --partition=regular1
 #SBATCH --job-name=dtsc
 #SBATCH --nodes=1
-#SBATCH --ntasks=4
+#SBATCH --ntasks=8
 #SBATCH --cpus-per-task=1
 #SBATCH --mem=10000
-#SBATCH --time=06:00:00
+#SBATCH --time=01:00:00
 #SBATCH --mail-user=ltomada@sissa.it
 #SBATCH --output=%x.o%j.%N
 #SBATCH --error=%x.e%j.%N
@@ -22,7 +22,7 @@ echo '------------------------------------------------------'
 # ==== End of Info part (say things) ===== #
 #
 
-cd $SLURM_SUBMIT_DIR           
+cd $SLURM_SUBMIT_DIR
 
 module load cmake/3.29.1
 module load intel/2021.2
@@ -32,8 +32,8 @@ conda init
 conda activate devtools_scicomp
 
 # Ranges over which we iterate
-n_processes=(1 2 4)
-matrix_sizes=(10 15 20)
+n_processes=(1 2 4 8)
+matrix_sizes=(10 50 100 500 1000)
 
 last_dim="${matrix_sizes[-1]}"
 last_nproc="${n_processes[-1]}"
@@ -51,7 +51,6 @@ for dim in "${matrix_sizes[@]}"; do
     echo "------------------"
 
     sed -i "s/^dim: .*/dim: $dim/" $CONFIG_FILE
-    sed -i "s/^n_processes: .*/n_processes: $n_p/" $CONFIG_FILE
     sed -i "s/^plot: .*/plot: false/" $CONFIG_FILE
     echo "Running with size=$dim and n_processes=$n_p"
 
@@ -62,7 +61,7 @@ for dim in "${matrix_sizes[@]}"; do
       sed -i "s/^plot: .*/plot: true/" $CONFIG_FILE
     fi
 
-    python scripts/profiling_memory.py
+    srun --mpi=openmpi -n ${n_p} python scripts/profiling_memory.py
   done
 done
 
diff --git a/shell/submit.sh b/shell/submit.sh
@@ -1,8 +1,8 @@
 #!/bin/bash
 
 # Ranges over which we iterate
-n_processes=(1 2)
-matrix_sizes=(10 15 20)
+n_processes=(1 2 4 8)
+matrix_sizes=(10 50 100 500 1000)
 
 last_dim="${matrix_sizes[-1]}"
 last_nproc="${n_processes[-1]}"
@@ -20,7 +20,6 @@ for dim in "${matrix_sizes[@]}"; do
     echo "------------------"
 
     sed -i "s/^dim: .*/dim: $dim/" $CONFIG_FILE
-    sed -i "s/^n_processes: .*/n_processes: $n_p/" $CONFIG_FILE
     sed -i "s/^plot: .*/plot: false/" $CONFIG_FILE
     echo "Running with size=$dim and n_processes=$n_p"