CAIS Compute Cluster Resources and Documentation

Welcome to the Center for AI Safety Cluster

Table of Contents

Cluster Overview

The cluster is hosted on OCI and is based on 32 bare metal BM.GPU.A100-v2.8 nodes and a number of service nodes. Each GPU node is configured with 8 NVIDIA A100 80GB GPU cards, 27.2 TB local NVMe SSD Storage and two 64 core AMD EPYC Milan, for a total of 256 GPUs, 4,096 CPU cores and 870 TB of file system storage.

The nodes are connected by a remote direct memory access (RDMA) network for data communication. Each node has eight 2 x 100 Gbps network interface cards (NICs), providing a total of 1,600 Gbit/sec inter-node network bandwidth with latency as low as single-digit microseconds.

The cluster is run on Ubuntu 22.04 and is managed using Ansible and Terraform. We are in the process of implementing containerization using Singularity. The scheduling system for running jobs on the cluster is SLURM. Storage is managed using the WekaFS Distributed Parallel Filesystem.

SSH fingerprints:

2048 SHA256:AQtXOzqLY1Bzxrzcg/7emNCZPM+oR2UBWB3QvviQq9M watch-tower-login (RSA)
256 SHA256:QtxZcVFXr0EsnJBfmx/5yhoCmAYSmP43x1nJHS9K78c watch-tower-login (ECDSA)
256 SHA256:DiFgOsIH1WEUEDn6lg0ImIg7ok5RE+i4R26uxUhXjQA watch-tower-login (ED25519)

Getting Started

Getting Cluster Access

There is a form linked on the compute cluster’s page on CAIS’ website to apply for access.

Getting Help

To request help, please login to the cluster’s Slack workspace and message us in #help-desk channel. For questions before being granted access please direct them to compute@safe.ai.

Basic Cluster Usage Example

Once you’ve given us your ssh key and we’ve provided your login credentials, you can SSH onto the login node.

ssh  -i {path-to-private-key} {username}@compute.safe.ai

# Alternatively configure your .ssh/config file to make a nice short version
ssh cais

Important note: Please do not do any work on the login node. This includes installing things (request a CPU node to do so).

Test out requesting a node.

# request 1 cpu on 1 node
srun --pty bash

# Exit from the compute node to request a new node
exit  # or hit Ctrl+d

# On the login node again
# request 2 gpus on 1 node
srun --gpus-per-node=2 --pty bash

# this is more convenient but can fail if you're doing multinode 
# so we suggest the above command  
srun --gpus=2 --pty bash

Note for some users you may need to add --partition=single before --pty bash when requesting a node.

Example sbatch file

This is a quick example for running jobs non-interactively. Putting them in the queue. The suggested workflow is to debug and get things working with srun and then transition into putting jobs into the queue.

This will request 1 gpu on 2 nodes so 2 gpus total. It also specifies to run on the interactive partition. Feel free to remove that line to have it run on the big partition.

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --gpus-per-node=1
#SBATCH --time=10:00
#SBATCH --partition=interactive  # OPTIONAL can be removed to run on big/normal partition
#SBATCH --job-name=Example 

# Recommended way if you want to enable gcc version 10 for the "sbatch" session 
source /opt/rh/devtoolset-10/enable

gcc --version  # if you print it out again here it'll be version 10 

python --version  # prints out the python version.  Replace this with a python call to whatever file.

sleep 5

Sharing files and folders with other users

For security measures, if you make your directory readable or executable by other users you will be locked out from ssh‘ing into the cluster. This prevents other users from manipulating your ssh keys and such. If you do wish to share we’ve made the directories /data/public_models, /data/private_models and /data/datasets. Useable by everyone and you can share files and folders there.

How to request a shared folder

We can set up a folder in the /private_models directory for you to share files with your team. To make it easier to keep track of who is using shared folders in /private_models and be able to remove files that are no longer in use once projects are finished, we are asking everyone that needs to create a new folder to share files with their team to first fill out this brief form: https://airtable.com/appeMGyDPWhtwa3Dw/shr0sUqxVLwokliXW . We will then set up the folder and grant access to the relevant team members.

This process does not apply to the /public_models folder, which should only be used for models that other teams are likely to want for their research projects.

Public models and datasets

Many commonly used models (Llama, Mistral, Pythia etc.) can be found in the /data/public_models folder, so please check this before downloading them again. Similarly, some popular datasets can be found in the /data/datasets folder.

How to Use the Global Hugging Face Model Cache

The CAIS cluster provides a global Hugging Face model cache to facilitate efficient access to popular resources without affecting your file system quota. This cache is maintained by the CAIS cluster administrators and is regularly updated with frequently used models. Below is a guide on how to utilize this resource.

Accessing the Global Cache
The global Hugging Face cache is located at /data/huggingface/. You can use the cache by either setting the cache_dir argument or by setting the HF_HOME environment variable. For example:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

cache_dir = "/data/huggingface/"
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"

print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", cache_dir=cache_dir)
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)

Requesting Models to be Added to the Cache
If you require a model that is not already cached, you can request it by posting in the #shared-models Slack channel. Please include the full name, such as meta-llama/Meta-Llama-3.1-8B. Requests are typically processed within 24 hours, subject to approval.

You can view the list of models currently available in the global cache on our Shared Models Tracker.

Cache Maintenance and Updates
The global cache is updated regularly, with popular models added automatically. Models that have not been used for over 60 days will be removed.

Troubleshooting
Here are some common issues and how to resolve them:

  • Model Not Found: Double-check the model name for typos. Refer to the Shared Models Tracker for the correct path.
  • Missing Dependencies: Ensure all required Python packages are installed.
  • Missing Token: Ensure you have an authentication token for models that require them. Consult the Hugging Face documentation for further information.

For additional help, reach out via the #shared-models-data Slack channel.

Using Custom Models
If you need to cache a custom model locally, feel free to make use of a local HF cache. However, be mindful of storage quotas and only use local caching when necessary.

How to request additional filesystem storage

By default, all users of the cluster are limited to 500 GB of file system storage on the cluster. If you need more storage for your project, you can submit an application indicating how much additional storage you need and for how long. We are usually able to provide a decision within 2-3 days.

Package Management

Install Miniconda or Anaconda

We suggest installing anaconda or miniconda to facilitate installing many other apps on the server. Here’s how we installed miniconda.

curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o Miniconda3-latest-Linux-x86_64.sh

chmod +x Miniconda3-latest-Linux-x86_64.sh

./Miniconda3-latest-Linux-x86_64.sh
# close and reopen shell or say yes to configure the current shell.

Example of how to install pytorch a popular deep learning library. Similar commands exist for tensorflow etc.

pip3 install torch torchvision torchaudio

Suggested Installations

If you use tmux we recommend installing tmux-resurrect. This prevents the unfortunate case where if tmux dies so too do all your tabs. With this app you can resurrect your sessions (if you save them).

SLURM Notes

The priority of your jobs in the SLURM queue is determined based on the size of the job, the time you’ve been waiting and your group’s previous usage relative to other groups (“fair share”, in SLURM’s terminology).

You should aim to request the minimum number of GPUs needed to run your job efficiently. Requesting more GPUs than needed will result in your group’s priority declining faster than otherwise, which means you will need to wait longer in the queue to run jobs in future. You can check the utilisation of GPU cores and memory to see if jobs are running efficiently using:

srun –jobid= nvidia-smi

We have configured Slurm for sending email and slack notifications for various job stages (begin, fail, requeue, complete, etc) from the do-not-reply@safe.ai email address (check spam). See the cluster documentation for further details. You can add notifications to your job by adding the following lines to your SBATCH file: #SBATCH –mail-user=mailto:{email},slack:{slack-member-id} #SBATCH –mail-type=ALL

Important note: Please do not do any work on the login node. This includes installing things (request a CPU node to do so).

SLURM Example Commands

# From the login node
srun --gpus=2 --pty bash  # requests two GPUs to run interactively

# The rest of the commands will work from any node
sbatch launch.sbatch      # Start a job and place it in the queue

scancel #JOB_NUM#         # Pass in the Job number for a specific job to cancel.

scancel --user $USER      # Cancel all jobs that I've started.

sinfo                     # See how many of the nodes are in use.

squeue                    # See where your job is in the queue.


Specific Topics

Jupyter Notebooks on the Cluster

See Nathaniel Li’s instructions as well.

# I have configured my .ssh/config so I can do ssh cais_cluster
# If you have not replace cais_cluster with user_name@compute.safe.ai
ssh cais_cluster

# See earlier section about installing anaconda.
# install jupyter notebooks if you haven't already
conda install -c anaconda jupyter
# or use pip to install jupyter notebook

# Get on a compute node
srun  --pty bash

# you'll need to note which compute-permanent-node-## you are on for below.

# Run the following on the compute node
unset XDG_RUNTIME_DIR
export NODEPORT=$(( $RANDOM + 1024 ))
echo $NODEPORT

# You will need the node port so please copy it

jupyter notebook --no-browser --port=$NODEPORT

# You will also need the notebook link that will get printed
# The notebook link will look something like this: 
#  http://localhost:19303/?token=cb2b708e5468268ase8c46448fc28e78bd049a977cdcbd65d1

Start a new terminal window

# Paste the nodeport from above here replacing the #s
export NODEPORT=####

ssh -t -t cais_cluster -L ${NODEPORT}:localhost:${NODEPORT} ssh -N compute-permanent-node-### -L ${NODEPORT}:localhost:${NODEPORT}

Finally open up your favorite browser and paste in the link. http://localhost:19303/?token=cb....

Switching shell such as to ZSH

Add the following to the end of your .bashrc file which shell you’d like to run. The different shells are by default installed into /usr/bin/

# Run zsh
if [ "$SHELL" != "/usr/bin/zsh" ]
then
    export SHELL="/usr/bin/zsh"
    exec /usr/bin/zsh
fi

If you’re favorite shell is not installed on the system just ask us to add it.

Installing cmake

cmake can be installed via pip so we recommend that approach. The cmake installed via package manager is old and installing from source will have a more difficult time for upgrades.

pip install cmake

Docker Support

This is on our roadmap for the coming months. Please poke us in the Slack if you want this support sooner and we can either reprioritize it or give you our status update on it.

Configuring Notifications

We have configured SLURM for sending email and Slack notifications for various job stages (begin, fail, requeue, complete, etc) from the do-not-reply@safe.ai email address (check spam). It is also capable of interfacing with other notification platforms, so if you would like us to configure it for another platform, let us know and we will look into it. We are using the goslmailer for the notifications.

You can add notifications to your job by adding the following lines to your SBATCH file:

#SBATCH --mail-user=mailto:{email},slack:{slack-member-id}
#SBATCH --mail-type=ALL

Feel free to replace {email} with one of your choice and {slack-member-id} with your personal one for the CAIS Compute Cluster workspace (how to find your member id). If you would only like to use Slack or email for notifications and not the other, you can do this by only including that service in your sbatch.

Example:

#SBATCH --mail-user=mailto:{email}
#SBATCH --mail-type=ALL

You can also configure the messages with all the mail-type options found in SLURM by default.

VS Code on the Cluster

In our investigations some extensions such as ai autocomplete tools (tabNine and Copilot) have significant cpu utilization. At the moment these are not a problem but we may have to disable these extensions if their usage increases. The main performance impact of VS Code on the cluster is RAM usage, which is nearly entirely decided by the number of extensions installed. This is also impacted by the size of the working directory that you have open in VS Code. Thus we suggest being mindful of how many extensions you install as more than 10 will begin to impact the responsiveness of your connection to the cluster (may vary slightly depending on the extensions).

There are two primary extensions you can use to develop remotely using VS Code. The first, Remote SSH extension, can easily be setup for use with the server using the example ssh setup. We do require you to change one setting for this extension:

  • Go to the extension settings and paste remote.SSH.remoteServerListenOnSocket into the search bar at the top. Then make sure the option has a checkmark (disabled by default).

The second option is the Remote Tunnel extension. This extension requires you to ssh into the server in a normal terminal window then run the service manually. It is, however, easier to use on an interactive session of a compute node than Remote SSH and allows you to develop in a browser (and/or the desktop application). We strongly encourage you to run the installation commands from your home directory on the cluster (/data/yourname) so the VS Code folders are created with the correct privacy permissions.

Both of these extensions deliver nearly identical development experiences and differ only in their connection method to the server.

Distributed Training Example

Basic Example

The following requires pytorch 2.0. The example though is meant to be simple enough to be a starting point.

The following file is the launch.sbatch. You run it with sbatch launch.sbatch.

#!/bin/bash
#SBATCH --nodes 2
#SBATCH --gpus-per-node=8
#SBATCH --job-name dist_gpu
#SBATCH --partition interactive

export MASTER_ADDR=$(scontrol show hostname ${SLURM_NODELIST} | head -n 1)

echo $MASTER_ADDR

srun torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29400 dist_training.py

The above file will call the python file below that we named dist_training.py. You can see the name at the end of the srun command.

import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim

from torch.nn.parallel import DistributedDataParallel as DDP

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))


def demo_basic():
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    print(f"Start running basic DDP example on rank {rank}.")

    # create model and move it to GPU with id rank
    device_id = rank % torch.cuda.device_count()
    model = ToyModel().to(device_id)
    ddp_model = DDP(model, device_ids=[device_id])
    print(f"Moving model to GPUs.")


    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    optimizer.zero_grad()
    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(device_id)
    loss_fn(outputs, labels).backward()
    optimizer.step()
    print(f"Took a gradient step.")
    print(f"Done.")

if __name__ == "__main__":
    demo_basic()

Enabling debugging for distributed training

This allows for tracebacks to be recorded.

from torch.distributed.elastic.multiprocessing.errors import record

# Note main should be the main entry point of the code not just a random function. 
# It won't work if your whole code is not in a function for instance.
@record
def main():
    ... # rest of code