This page is currently under development. Review the Getting Started guide for information about obtaining an account, connecting to Ada and submitting basic jobs. This page assumes you are familiar with the Getting Started material.
- File system and copying data (i.e., where to store your data)
- Working with specific software environments (Python, R)
- Working with the SLURM queuing system
- Working with GPUs
File system and copying data
Warning: Ada storage is not backed-up. Users are responsible for backing-up their own data as needed.
Each user has home and storage directories on the shared high-speed storage (available on each node). The home directory (at /home/username or via the $HOME environment variable) is intended for storing scripts, code, executables, and small configuration files. Larger data should be placed in the storage directory at /storage/username (or via the $STORAGE environment variable).
For jobs that create temporary files only needed during execution can use local SSD-based scratch storage on each at /local/username or via the $SCRATCH environment variable). Since it doesn’t require network communication the local scratch storage can be more efficient for small or frequently accessed temporary files.
You can download data from external resources directly to Ada via the wget or curl tools (see linked guide). To copy data between your personal computer and Ada you can use the scp command line utility or programs (like FileZilla or Cyberduck) that support scp or sftp (see linked guide). For example, to copy a file from your personal computer to Ada, execute the following in a local terminal (e.g., on your laptop or desktop computer), where username is your Middlebury username, local_file is the path the to local file on your personal computer and remote_destination is the desired location on Ada, e.g., /storage/username.
scp local_file username@ada.middlebury.edu:remote_destination
You can copy files from Ada to your personal computer by reversing the order of the arguments to scp, e.g.,
scp username@ada.middlebury.edu:remote_destination local_file
Working with specific software environments
Python
The conda package manager is the recommended approach for working with Python-based projects, especially those that rely on native libraries (e.g., machine learning). Conda enables to you create isolated environments for each project with specific versions of neccessary software and its dependencies without needing administrative privileges.
To create and activate a new environment named my_py_env (the name is arbitrary) with Python 3.11 (you can leave out the Python version to install the current stable release or change the version as needed):
conda create --name my_py_env python=3.11
conda activate my_py_env
With the environment activated, you can install additional packages with conda, e.g.,:
conda install numpy
or pip, e.g., (activating the environment updates your $PATH to point to the Python installed within the environment):
python3 -m pip install numpy
Once you have created an environment, to use it you only need to activate the environment in your current shell. SLURM jobs inherit the current environment so will inherit your activated conda environment.
By default, conda uses the official “default” repository. To include additional packages, add the conda-forge repository.
Using Torch
This section is currently under development. Following the Torch documentation, we create an environment that installs Torch and its dependencies. Torch no longer directly creates Conda packages. Instead we use Conda to create the environment and then Pip to install the relevant Torch packages (including for working with the GPU):
module load cuda/12.6
conda create --name my_torch_env
conda activate my_torch_env
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
You can verify successful installation by launching a short “interactive” job on one of the GPU nodes like shown below.
(my_torch_env) [ada ~]$ srun --time=1:00:00 --partition=gpu-short --gres=gpu:1 --pty bash -i
[node020 ~]$ python3
Python 3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:38:13) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
R
You can also use the conda package manager to work with R. conda can be particularly useful when working with R libraries that depend on native code (e.g., scientific data storage tools, etc.). Note you may need to use the conda-forge repository in addition to the Anaconda default repositories.
Similar to Python, create a new conda environment for your project, but specify the r-base and r-essentials packages at creation. Here we create an environment named my_r_env with the latest R version (you can change the version by specifying a specific version of r-base).
conda create --name my_r_env r-base r-essentials
conda activate my_r_env
With the environment activated you can now install R packages provided through Conda repositories (typically prefixed with “r-“, e.g., r-ncdf4) or by starting R and using install.packages and CRAN. Note that with environment activated, the current R executable will provided by the environment (separate from the system R or other environments you created). Whenever you want to use the R packages associated with a specific environment activate that environment in the shell as shown above. SLURM jobs inherit the current environment so will inherit your activated conda environment.
Working with the SLURM queuing system
The SLURM job scheduling system fairly and efficiently schedules computational tasks across the cluster’s worker nodes according to the resources requested by the job. Users interact with SLURM via a set of control commands, most commonly via sbatch for submitting jobs and squeue and sinfo for monitoring cluster status (check out this SLURM-provided cheatsheet for the commonly used options/resource specifiers). The most common use case is submitting more “batch” jobs (i.e., a set of commands that run without user intervention), but you can also run “interactive” jobs.
Batch compute jobs
To submit a job, you typically write a job BASH script that specifies the command(s) to run and the resources required. An example job script, created as the file slurm_serial.sh, is shown below. Note the #SBATCH comments at the top that specify the resources (memory, time) required for this job (using the same command line options as sbatch). As described on the Getting Started page you would submit this job script with the command: sbatch slurm_serial.sh. SLURM will respond with the job ID.
#!/usr/bin/env bash
# SLURM template for serial jobs
# Set SLURM options
#SBATCH --job-name=serial_test # Job name
#SBATCH --output=serial_test-%j.out # Output file incorporating job ID
#SBATCH --partition=standard # Partition (queue)
#SBATCH --time=00:05:00 # Time limit hrs:min:sec
#SBATCH --mem=100mb # Job memory request
# Print SLURM environment variables
echo "Job ID: ${SLURM_JOB_ID}"
echo "Node: ${SLURMD_NODENAME}"
# Start of job info
echo "Starting: "`date +"%D %T"`
# Your calculations here
printf "\nHello world from ${SLURMD_NODENAME}!\n\n"
# End of job info
echo "Ending: "`date +"%D %T"`
Correctly configuring your job will require determining the relevant partition (queue) based on execution time and need for specialized resources (high memory, GPUs), execution time, number of nodes and cores, and memory per node. The table below shows the relevant partition for different workload types and time requirements. The requested partition and time, memory and other resource requirements need to match. For example, submitting a job requesting 2 days to the short partition, or more memory than is available on nodes in the standard partition, will never execute (and you may or not get a corresponding error message). Review the Resources page for the number of nodes, cores and memory available for each class of machine.
| Workload type | Jobs ≤ 2 hours | Jobs ≤ 2 days | Jobs ≤ 7 days |
| CPU-based workloads needing < 256 GB of memory | short | standard | long |
| GPU-based workloads | gpu-short | gpu-standard | gpu-long |
| CPU-based workloads needing ≥ 256 GB of memory | himem-short | himem-standard | himem-long |
If your workload can take advantage of multiple CPU cores on a single node you can request additional cores with the --cpus-per-task option. For example adding the following to the SLURM options at the top of your job script would request 8 CPU cores for the job. You can request up to as many cores as the nodes have (see the Resources page). All cores will be on the same node and share main memory, as if the job was running on a single stand-alone workstation.
#SBATCH --cpus-per-task=8 # Number of CPU cores for this job
Note that your code must be able to take advantage of the multiple CPU cores. If you request multiple cores for a purely serial program (which can only use 1 CPU core), such as most Python or R programs that don’t use specialized libraries like Torch, the additional CPU cores will remain idle (wasting resources that could be used by others).
To maximize throughput we suggest submitting your job to the shortest partition that allows enough execution time (with a buffer if the job takes longer than expected) and only requesting the number of cores and the amount memory (with a small buffer for overage) you will actually use. The “shorter” queues have higher priority. SLURM doesn’t over allocate the worker nodes, even if jobs use less than the requested resources, so requesting just the needed resources will enable more of your (and other’s) jobs to run at the same time.
If your workload can take advantage of GPUs, e.g., machine learning libraries, see the section below for details on submitting jobs that use GPUs.
Submitting many batch jobs at once with Array jobs
If you workload consists of many independent parts, e.g., running the same program on different input files, you can increase throughput by running each instance as a separate job (that can execute in parallel). You can do so with multiple sbatch invocations, or more efficiently as an “array” job that will launch multiple instances of the same job script with unique identifiers.
- To submit an array job, use the SLURM option --array. For example, --array=0-4 will run 5 independent tasks, labeled 0-4 by the environment variable
SLURM_ARRAY_TASK_ID. - To allow each array task to perform a different calculation, you can to use SLURM_ARRAY_TASK_ID as an input parameter to your calculation.
- Each array task will appear as an independent job in the queue and execute independently.
- An entire array job can be canceled at once or each task can be canceled individually.
An example array job script shown below. Assuming this script is saved as slurm_array.sh, you can submit the job as sbatch slurm_array.sh.
#!/usr/bin/env bash
# SLURM template for array jobs
# Set SLURM options
#SBATCH --job-name=array_test # Job name
#SBATCH --output=array_test-%A-%a.out # Output file incorporating job and array ID
#SBATCH --mem=100mb # Job memory request
#SBATCH --partition=standard # Partition (queue)
#SBATCH --time=00:05:00 # Time limit hrs:min:sec
#SBATCH --array=0-4 # Array range
# Print SLURM environment variables
echo "Job ID: ${SLURM_JOB_ID}"
echo "Array ID: ${SLURM_ARRAY_TASK_ID}"
echo "Node: ${SLURMD_NODENAME}"
# Start of job info
echo "Starting: "`date +"%D %T"`
# Your calculations here
printf "\nHello world from array task ${SLURM_ARRAY_TASK_ID}!\n\n"
# End of job info
echo "Ending: "`date +"%D %T"`
Interactive jobs
If you computationally intensive tasks that should run on the cluster’s worker nodes, but is still interactive in some way (e.g., you are working with large datasets, but are testing out commands or don’t yet know what you want to run), you can run an “interactive job”. For example the following would run a 48 hour interactive job with 8G of memory. The sun command will wait until the job is successfully scheduled, at which point you will automatically be connected to a shell on the worker node. Use the exit command to end that shell and “return” to the head node.
srun --time=48:00:00 --partition=standard --mem=8G --pty bash -i
Working with GPUs
There are two steps to successfully launching jobs to utilize one or more GPUs (assuming your program can do so):
- Submit your job to one of the GPU-enabled partitions (queues), e.g., gpu-standard, described previously
- Request one or more GPUs as a SLURM “generic resource”, e.g., in your script or on the job submission command line specify --gres=gpu:1. The 1 specifies the number of GPUs requested. You can further refine the “gres” specification with the specific model of GPU if needed for your application e.g., --gres=gpu:rxa6000:1
Once the job is running you can monitor GPU utilization with the following command, replacing <job id> with id of the your GPU job (as reported by squeue):
srun --overlap --jobid=<job id> --pty nvidia-smi
This runs the NVIDIA system management tool to report GPU memory and other usage statistics within the allocation of your already running job. The –overlap argument is critical to sharing the resources of that existing job. An example output is:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:5E:00.0 Off | Off |
| 35% 63C P2 88W / 300W | 918MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 40747 C python 908MiB |
+-----------------------------------------------------------------------------------------+
Older documentation is below
9. Checkpointing
Checkpointing your jobs running on ada is recommended. Checkpointing stores the internal state of your calculation periodically so the job can be restarted from that state, e.g. if the node goes down or the wall clock limit is reached. Ideally, checkpointing is done internally in your application (it is built into many open source and commercial packages); if your application doesn’t support checkpointing internally you can use an external checkpointing tool such as dmtcp. Here we’ll illustrate an example of using external checkpointing via dmtcp found the directory “ckpt-example” on the GitHub repository.
- We’ll illustrate checkpointing using a simple counter. First compile the executable “count” from the source code “counter.c” via:
gcc counter.c -o count
Now you should see the executable file “count”. Take a look at the slurm script slurm-ckpt-start.sh. The key line is:
timeout 15 dmtcp_launch --no-coordinator -p 0 -i 10 ./count
- “timeout” is a standard linux utility that will automatically stop whatever command that follows; the “15” is the length of time before the process is killed in seconds. You can also use units of days and hours, eg. “timeout 47h”. Timeout is not necessary for checkpointing, but it lets you stop your job before the wall clock limit is reached and slurm kills your job.
- “dmtcp_launch” is the command to start running your executable (in this case count) through the dmtcp checkpointing tool. We suggest you always use the “–no-coordinator -p 0” options to avoid interference with other jobs.
- The “-i” option sets the frequency that dmtcp will store the state of you process to a checkpoint file. “-i 10” checkpoints the file every 10 seconds–much more frequently than you would ever want to do in practice (this is just so the example goes quickly). More reasonable for an actual job would be “-i 3600” to checkpoint once an hour.
- In practice, the checkpointing syntax for “your_executable”, might be something like:
timeout 47h dmtcp_launch --no-coordinator -p 0 -i 3600 your_executable
- Now submit the slurm script “slurm_ckpt_start.sh”
sbatch slurm-ckpt-start.sh
- Once that job has completed, you should see a checkpointing file of the form “ckpt_count_*.dmtcp”. You job can be restarted using the “dmtcp_restart” command as is found in “slurm_ckpt_restart.sh”:
sbatch slurm-ckpt-restart.sh
- You can restart and continue the job any number of times via the same restart script. E.g. try submitting the restart script a 2nd time.
sbatch slurm-ckpt-restart.sh
10. Sample jobs
10b. Serial Stata job
The primary difference between using Stata on the cluster and using Stata on your computer is learning how to run Stata in batch mode, that is, non-interactively. To use Stata on the cluster, you will need a shell script (*.sh) that inserts your Stata process into the Slurm queue and runs your Stata do file from the command line. You need basic Unix command skills, basic Slurm syntax and a Stata do file.
You can log in to MIddlebury’s HPC repository at Github to see executable examples of both a serial Stata job and a parallel Stata job in the “Stata-examples” directory. A serial Stata job is the simplest, using a single processor on a single node to execute your calculations. Most Stata users will need to use the parallel computing capabilities if they need to use the cluster to perform their calculations. Both the serial and parallel computing examples use “stata_auto.do” as the sample do file, so be sure to download it as well. Copy the shell script and do file to your home directory on Ada. The command to run the serial shell script is:
sbatch stata_serial.sh
10c. Parallel Stata job
Because we are using Stata MP (multiprocessor), the program already has built-in multiprocessor capabilities. Our license allows us to use up to 16 processors. Stata will automatically use as many processors as it can “see”, which is where the specifications in Slurm (the queuing software) are important. There is a single difference between the serial job syntax and the parallel job syntax for Stata, and that is to change “#SBATCH –cpus-per-task=1” to “#SBATCH –cpus-per-task=16” in the shell script, which tells Stata there are 16 computing processors available (see the above section on .
Copy the example script and do file to your home directory on Ada and type to following command:
sbatch stata_parallel.sh
12. Git repository
Sample slurm scripts and example jobs are availing in the GitHub repository:
You can clone a copy of this repository to your home directory (or elsewhere) via the command:
git clone https://github.com/middlebury/HPC.git
