How to run python experiments with Slurm and Conda?

As data scientists and researchers, efficiently managing multiple machine learning experiments is fundamental to our work. Tools like Slurm and Conda offer powerful capabilities that streamline the execution of experiments on high-performance computing clusters while ensuring environment consistency and reproducibility. In this blog, we will take a look at how we can schedule multiple experiments using slurm and conda environment or even schedule multiple trials of the same experiment.
Introduction to Slurm
Slurm (Simple Linux Utility for Resource Management) is an open-source, highly configurable workload manager used in high-performance computing (HPC) environments. It efficiently manages and schedules computing resources such as CPUs, GPUs, memory, and storage across a cluster of nodes.
With Slurm, users can submit and manage jobs, allocate resources based on job requirements, and monitor the job’s status and resource utilization. Its robust capabilities make it a popular choice for orchestrating parallel computing tasks and handling complex job scheduling
Setting Up Slurm
To utilize Slurm for running multiple experiments, we create an sbatch
script. This script specifies various job parameters, such as the number of nodes, tasks per node, GPU allocation, partition, time limits, and output logs. For example:
#!/bin/bash
#SBATCH --partition=<your_partition_name>
#SBATCH --job-name=multi_exp_trials
#SBATCH --nodes=1
#SBATCH --gres=<gpu>:1 # Request one GPU per job
#SBATCH --mem <number>GB # example, 32GB
#SBATCH --ntasks-per-node=1
#SBATCH --time=<days-hours> # example, 1-00 for 1 day
#SBATCH --cpus-per-gpu 1
#SBATCH --err <log_path>/errors.err
#SBATCH --out <log_path>/output.out
All the arguments mentioned above are optional and you can skip or add any arguments as per requirement. I generally do not add time factor if I have no idea of how long my experiment is going to take.
If you want specific gpus from the available ones like p100, you can have #SBATCH --gres=p100:1
If you don’t care about the gpu type then you can simply write #SBATCH --gres=gpu:1
This is still an old way of mentioning gpu type. If you do not have any specific choices, you may set the required number of gpus with #SBATCH --gpus <number>
Managing Environments with Conda
Conda, a versatile package and environment manager, plays a pivotal role in maintaining consistent environments for experiments. It enables users to create isolated environments with specific Python versions, packages, and dependencies.
In our sbatch
script, we activate the desired Conda environment before running experiments:
# Conda environment name
conda_env="your_conda_environment_name"
# Full path to the directory containing your Python files
python_files_dir="/path/to/your/python_files_directory"
# Activate the conda environment
source activate $conda_env
This step ensures that the experiments execute within a controlled environment, reducing potential conflicts between dependencies and versions, thereby enhancing reproducibility. Notice that we have also set the path to folder containing the python file that we want to execute.
Running Experiments
The following script defines different experiment types along with their respective command-line arguments:
# Define your experiment types and their respective command-line arguments
# Replace these with the actual relative paths to your Python files and arguments
experiment_commands=(
"python ${python_files_dir}/experiment1.py --arg1 value1 --arg2 value2"
"python ${python_files_dir}/experiment2.py --arg3 value3"
"python ${python_files_dir}/experiment3.py --arg4 value4 --arg5 value5"
)
# Loop through each experiment type
for exp_cmd in "${experiment_commands[@]}"; do
# Run 10 trials of each experiment type
for trial in {1..10}; do
echo "Running trial $trial of $exp_cmd"
$exp_cmd &
done
wait # Wait for all trials of the current experiment type to finish
echo "All trials of $exp_cmd completed"
done
# Deactivate the Conda environment
conda deactivate
wait # Wait for all experiment types to finish
echo "All experiments completed"
The script allows customization by specifying the full path to the directory containing Python files, the Conda environment to activate, and the command-line arguments for each experiment type. Such a setup ensures an organized and reproducible execution environment for experiments. This configuration allows running ten trials for each experiment type while utilizing one GPU per job. Adjust the paths, arguments, and experiment types according to your specific setup.
Complete script:
#!/bin/bash
#SBATCH --partition=<your_partition_name>
#SBATCH --job-name=multi_exp_trials
#SBATCH --nodes=1
#SBATCH --gres=gpu:1 # Request one GPU per job
#SBATCH --mem 32GB # cpu memory
#SBATCH --ntasks-per-node=1
#SBATCH --time=1-00
#SBATCH --cpus-per-gpu 1
#SBATCH --err ../logs/errors.err
#SBATCH --out ../logs/output.out
# Full path to the directory containing your Python files
python_files_dir="/path/to/your/python_files_directory"
# Conda environment name
conda_env="your_conda_environment_name"
# Activate the Conda environment
source activate $conda_env
# Define your experiment types and their respective command-line arguments
# Replace these with the actual relative paths to your Python files and arguments
experiment_commands=(
"python ${python_files_dir}/experiment1.py --arg1 value1 --arg2 value2"
"python ${python_files_dir}/experiment2.py --arg3 value3"
"python ${python_files_dir}/experiment3.py --arg4 value4 --arg5 value5"
)
# Loop through each experiment type
for exp_cmd in "${experiment_commands[@]}"; do
# Run 10 trials of each experiment type
for trial in {1..10}; do
echo "Running trial $trial of $exp_cmd"
$exp_cmd &
done
wait # Wait for all trials of the current experiment type to finish
echo "All trials of $exp_cmd completed"
done
# Deactivate the Conda environment
conda deactivate
wait # Wait for all experiment types to finish
echo "All experiments completed"
Efficiently managing experiments on HPC clusters demands a streamlined approach. Slurm provides efficient job scheduling, while Conda simplifies environment management, ensuring reproducibility across experiments.
By leveraging Slurm for job management and Conda for environment isolation, researchers and data scientists can seamlessly conduct experiments, ensuring reliability, reproducibility, and scalability in their work.
Feel free to customize this script and approach to suit your specific cluster setup and experiment requirements. Using Slurm and Conda in combination offers a robust framework for managing and executing experiments at scale, providing researchers with the computational power and environment consistency necessary for successful experimentation.
I would appreciate any feedbacks regarding this setup. Let’s learn and grow together.
cheers! ✌️