You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+31-6Lines changed: 31 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,14 +6,26 @@ Code for the exam in Development tools for Scientific Computing, SISSA, a.y. 202
6
6
---
7
7
8
8
## Parallel matrix-matrix multiplication
9
-
The goal here was to implement matrix-matrix multiplication in distributed memory. The core idea is to split the matrices we want to multiply between MPI processes, and let each process compute a chunk of the result. Since we are in distributed memory, some communication is required for each process to properly compute its chunk of the result, but ultimately each process needs some matrix-matrix multiplication routine to do its work. The efficiency of the underliying multiplication routine can severely affect the performance of the distributed algorithm. Different matrix-matrix multiplication routines are provided in `src/matmul/routines.py`, all of which can be used in the distributed routine. The actual distributed machinery is provided in `scripts/run.py`.
9
+
The goal here was to implement matrix-matrix multiplication in distributed memory. Details about the implementation are [a bit further down](#notes-on-the-implementation), but the general idea is to split the matrices we want to multiply between MPI processes, and let each process compute a chunk of the result. The whole distributed multiplication requires a bunch of steps such as computing the workload for each process, initialising the data, communicating and computing individual chunks, hence it is not straight forward to write some `distributed_multiply` routine. In fact, in this code there is no such routine, but rather the whole distributed machinery is provided in `scripts/run.py`.
10
+
11
+
In `src/matmul/routines.py` are a number of matrix-matrix multiplication routines (serial, parallel, tiled, GPU-accelerated) that can be used in the distributed algorithm. The performance of the distributed algorithm depends on the performance of the base routine.
10
12
11
13
All the code is implemented in Python. NumPy is employed to manipulate the matrices, while Numba is used to JIT compile routines in serial, parallel, CPU and GPU code. The MPI is provided by mpi4py.
12
14
13
-
This package tries to install mpi4py with `pip`, which requires a working installation of MPI on the machine. Also, for GPU computing, a recent version of the CUDA Toolkit is required (see [Numba](https://numba.readthedocs.io/en/stable/cuda/overview.html) for details).
15
+
## Installation
16
+
**NOTE:** Installing mpi4py with `pip` requires a working installation of MPI on the machine. Also, for GPU computing, a recent version of the CUDA Toolkit is required (see [Numba](https://numba.readthedocs.io/en/stable/cuda/overview.html) for details).
14
17
15
-
### NVHPC
16
-
As it turns out, NVHPC ships with all is needed here. One issue is that mpi4py is not really meant to be compiled with nvc by default. If you have issues while installing you may want to try this
Optionally install dependencies for testing (`test`), profiling (`profile`) or both (`dev`) with
25
+
```bash
26
+
python -m pip install .[<DEPENDENCY>]
27
+
```
28
+
As it turns out, [NVHPC](https://developer.nvidia.com/hpc-sdk) ships with all is needed here. One issue is that mpi4py is not really meant to be compiled with nvc by default. If you have issues while installing you may want to try this
Finally, should you run any of this code on an HPC facility and submit a SLURM job, note that SLURM's `srun` might not work with mpi4py, and you may need to use `mpirun` instead.
38
+
39
+
### From DockerHub
40
+
You can get container images with this code from DockerHub as well. The images are built with CUDA 12.4 and still require NVIDIA drivers on the host machine to run.
41
+
If you plan on running a Docker container you can get the image with
42
+
```bash
43
+
docker pull gcodega/matmul:cuda12.4-docker
44
+
```
45
+
If you plan on running a Singularity container, you can get a different tag
46
+
```bash
47
+
docker pull gcodega/matmul:cuda12.4-singularity
48
+
```
49
+
Note that to run Docker with CUDA support you may need the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html), whereas Singularity natively supports CUDA.
50
+
51
+
The tags only differ in that the Docker image has some custom user (tony:matmul), whereas the Singularity image runs as root. Not setting a different user in the Singularity image is actually recommended, as Singularity containers inherit the user from the host, and setting a user different from root may cause issues with environments inside the container. Also, if you want to run the code on some HPC facility you may want to use the Singularity image, as it can interact with the host MPI and run on multiple nodes. Note that in this case performance may not be optimal, as the OpenMPI inside the container is not optimized for any specific machine.
27
52
28
53
## Run some tests
29
54
To check out how different routines perform you can run `scripts/run.py`. After installing the package, you can modify `examples/config.yaml` by specifying the following parameters:
Note that if you want to run this in serial you still need to use `mpirun -n 1 ...`
66
+
Note that if you want to run this in serial you still need to use `mpirun -n 1 ...`. Moreover, should you run any of this code on an HPC facility and submit a SLURM job, also note that SLURM's `srun` might not work with mpi4py, and you may need to use `mpirun` instead (in `shell/submit.sbatch` you can find the script that I used to submit jobs on Ulysses at SISSA). Finally, when running through Singularity you may need to specify absolute paths for the scripts (all source code is in `/shared-folder/app`).
42
67
43
68
### Profiling
44
69
The script will print to screen the time spent in multiplying the matrices (i.e. no communication time or others). You can get more insights by profiling the code with kernprof. The script in `shell/submit.sh` lets you run one instance of kernprof for each MPI task and save the results on different files. You can select the number of threads for parallel routines by changing `NUMBA_NUM_THREADS` and customize the output path for kernprof. Run the script as
0 commit comments