CAIS Compute Cluster¶

Welcome to the documentation for the Center for AI Safety (CAIS) Compute Cluster, a GPU-accelerated research cluster for Artifical Intelligence/Machine Learning (AI/ML) safety research maintained by the Center for AI Safety. On this site we provide guidance on accessing and using the cluster, from basic usage to advanced workflows.

If you have questions not answered here, please consult the FAQ or reach out on Slack or via email at compute@safe.ai.

Cluster Overview¶

The CAIS Compute Cluster consists of 32 GPU nodes (Oracle Cloud bare-metal servers), each with:

Feature	Description
GPU Nodes	10 GPU nodes (Oracle Cloud bare-metal), each with 8× NVIDIA A100 80GB GPUs (total 80)
CPU Cores	Dual 64-core AMD CPUs per node (total 1280 CPU cores across the cluster)
Local NVMe SSD	27.2 TB per node (272 TB total system storage)
RDMA Network	1,600 Gbit/sec total, providing high-bandwidth and low-latency inter-node communication
Operating System	Ubuntu 22.04

The cluster is managed using Ansible and Terraform. The cluster uses Slurm for job scheduling, with WekaFS for the shared distributed parallel filesystem.