How to run python experiments with Slurm and Conda?

4 min readNov 25, 2023

source: https://getyarn.io/yarn-clip/368e3529-f540-4cfa-af28-d40a2ca2b99d

As data scientists and researchers, efficiently managing multiple machine learning experiments is fundamental to our work. Tools like Slurm and Conda offer powerful capabilities that streamline the execution of experiments on high-performance computing clusters while ensuring environment consistency and reproducibility. In this blog, we will take a look at how we can schedule multiple experiments using slurm and conda environment or even schedule multiple trials of the same experiment.

Introduction to Slurm

Slurm (Simple Linux Utility for Resource Management) is an open-source, highly configurable workload manager used in high-performance computing (HPC) environments. It efficiently manages and schedules computing resources such as CPUs, GPUs, memory, and storage across a cluster of nodes.

With Slurm, users can submit and manage jobs, allocate resources based on job requirements, and monitor the job’s status and resource utilization. Its robust capabilities make it a popular choice for orchestrating parallel computing tasks and handling complex job scheduling

Setting Up Slurm

To utilize Slurm for running multiple experiments, we create an sbatch script. This script specifies various job parameters, such as the number of nodes, tasks per node, GPU allocation, partition, time limits, and output logs. For example:

#!/bin/bash
#SBATCH --partition=<your_partition_name>
#SBATCH --job-name=multi_exp_trials
#SBATCH --nodes=1
#SBATCH --gres=<gpu>:1                      # Request one GPU per job
#SBATCH --mem <number>GB                    # example, 32GB
#SBATCH --ntasks-per-node=1
#SBATCH --time=<days-hours>                 # example, 1-00 for 1 day
#SBATCH --cpus-per-gpu 1
#SBATCH --err <log_path>/errors.err
#SBATCH --out <log_path>/output.out

All the arguments mentioned above are optional and you can skip or add any arguments as per requirement. I generally do not add time factor if I have no idea of how long my experiment is going to take.

If you want specific gpus from the available ones like p100, you can have #SBATCH --gres=p100:1 If you don’t care about the gpu type then you can simply write #SBATCH --gres=gpu:1 This is still an old way of mentioning gpu type. If you do not have any specific choices, you may set the required number of gpus with #SBATCH --gpus <number>

Managing Environments with Conda

Conda, a versatile package and environment manager, plays a pivotal role in maintaining consistent environments for experiments. It enables users to create isolated environments with specific Python versions, packages, and dependencies.

In our sbatch script, we activate the desired Conda environment before running experiments:

# Conda environment name
conda_env="your_conda_environment_name"

# Full path to the directory containing your Python files
python_files_dir="/path/to/your/python_files_directory"

# Activate the conda environment
source activate $conda_env

This step ensures that the experiments execute within a controlled environment, reducing potential conflicts between dependencies and versions, thereby enhancing reproducibility. Notice that we have also set the path to folder containing the python file that we want to execute.

Running Experiments

The following script defines different experiment types along with their respective command-line arguments:

# Define your experiment types and their respective command-line arguments
# Replace these with the actual relative paths to your Python files and arguments
experiment_commands=(
    "python ${python_files_dir}/experiment1.py --arg1 value1 --arg2 value2"
    "python ${python_files_dir}/experiment2.py --arg3 value3"
    "python ${python_files_dir}/experiment3.py --arg4 value4 --arg5 value5"
)

# Loop through each experiment type
for exp_cmd in "${experiment_commands[@]}"; do
    # Run 10 trials of each experiment type
    for trial in {1..10}; do
        echo "Running trial $trial of $exp_cmd"
        $exp_cmd &
    done
    wait # Wait for all trials of the current experiment type to finish
    echo "All trials of $exp_cmd completed"
done

# Deactivate the Conda environment
conda deactivate

wait # Wait for all experiment types to finish
echo "All experiments completed"

The script allows customization by specifying the full path to the directory containing Python files, the Conda environment to activate, and the command-line arguments for each experiment type. Such a setup ensures an organized and reproducible execution environment for experiments. This configuration allows running ten trials for each experiment type while utilizing one GPU per job. Adjust the paths, arguments, and experiment types according to your specific setup.

Complete script:

#!/bin/bash
#SBATCH --partition=<your_partition_name>
#SBATCH --job-name=multi_exp_trials
#SBATCH --nodes=1
#SBATCH --gres=gpu:1                        # Request one GPU per job
#SBATCH --mem 32GB                          # cpu memory
#SBATCH --ntasks-per-node=1
#SBATCH --time=1-00
#SBATCH --cpus-per-gpu 1
#SBATCH --err ../logs/errors.err
#SBATCH --out ../logs/output.out

# Full path to the directory containing your Python files
python_files_dir="/path/to/your/python_files_directory"

# Conda environment name
conda_env="your_conda_environment_name"

# Activate the Conda environment
source activate $conda_env

# Define your experiment types and their respective command-line arguments
# Replace these with the actual relative paths to your Python files and arguments
experiment_commands=(
    "python ${python_files_dir}/experiment1.py --arg1 value1 --arg2 value2"
    "python ${python_files_dir}/experiment2.py --arg3 value3"
    "python ${python_files_dir}/experiment3.py --arg4 value4 --arg5 value5"
)

# Loop through each experiment type
for exp_cmd in "${experiment_commands[@]}"; do
    # Run 10 trials of each experiment type
    for trial in {1..10}; do
        echo "Running trial $trial of $exp_cmd"
        $exp_cmd &
    done
    wait # Wait for all trials of the current experiment type to finish
    echo "All trials of $exp_cmd completed"
done

# Deactivate the Conda environment
conda deactivate

wait # Wait for all experiment types to finish
echo "All experiments completed"

Efficiently managing experiments on HPC clusters demands a streamlined approach. Slurm provides efficient job scheduling, while Conda simplifies environment management, ensuring reproducibility across experiments.

By leveraging Slurm for job management and Conda for environment isolation, researchers and data scientists can seamlessly conduct experiments, ensuring reliability, reproducibility, and scalability in their work.

Feel free to customize this script and approach to suit your specific cluster setup and experiment requirements. Using Slurm and Conda in combination offers a robust framework for managing and executing experiments at scale, providing researchers with the computational power and environment consistency necessary for successful experimentation.

I would appreciate any feedbacks regarding this setup. Let’s learn and grow together.

cheers! ✌️

How to run python experiments with Slurm and Conda?

Introduction to Slurm

Setting Up Slurm

Managing Environments with Conda

Running Experiments

Complete script:

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Hitesh Vaidya

No responses yet

More from Hitesh Vaidya

How to write a Neural Network in Tensorflow from scratch (without using Keras)

Introduction

How to write a Neural Network in Tensorflow from scratch (without using Keras)

Introduction

Recommended from Medium

How I Am Using a Lifetime 100% Free Server

Get a server with 24 GB RAM + 4 CPU + 200 GB Storage + Always Free

How I Learned to Love `init.py`: A Simple Guide😊

💡 Heads Up! Click here to unlock this article for free if you’re not a Medium member!

Lists

Predictive Modeling w/ Python

Practical Guides to Machine Learning

Coding & Development

Natural Language Processing

Debugging in Python: Replace print() with ic() and Do It Like a Pro

Introduction:

Stop making your python projects like it was 15 years ago…

I have a few things I’ve seen across companies and projects that I’ve seen working with Python that are annoying, hard to maintain, and are…

I used OpenAI’s o1 model to develop a trading strategy. It is DESTROYING the market

It literally took one try. I was shocked.

Setting Up Jupyter Lab Locally on Mac

A step-by-step guide to configuring Jupyter Lab for Python development on macOS.

How to run python experiments with Slurm and Conda?

Introduction to Slurm

Setting Up Slurm

Managing Environments with Conda

Running Experiments

Complete script:

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Hitesh Vaidya

No responses yet

More from Hitesh Vaidya

How to write a Neural Network in Tensorflow from scratch (without using Keras)

Introduction

How to write a Neural Network in Tensorflow from scratch (without using Keras)

Introduction

Recommended from Medium

How I Am Using a Lifetime 100% Free Server

Get a server with 24 GB RAM + 4 CPU + 200 GB Storage + Always Free

How I Learned to Love `__init__.py`: A Simple Guide😊

💡 Heads Up! Click here to unlock this article for free if you’re not a Medium member!

Lists

Predictive Modeling w/ Python

Practical Guides to Machine Learning

Coding & Development

Natural Language Processing

Debugging in Python: Replace print() with ic() and Do It Like a Pro

Introduction:

Stop making your python projects like it was 15 years ago…

I have a few things I’ve seen across companies and projects that I’ve seen working with Python that are annoying, hard to maintain, and are…

I used OpenAI’s o1 model to develop a trading strategy. It is DESTROYING the market

It literally took one try. I was shocked.

Setting Up Jupyter Lab Locally on Mac

A step-by-step guide to configuring Jupyter Lab for Python development on macOS.

How I Learned to Love `init.py`: A Simple Guide😊