Running Llama 405B on VLLM with Slurm and Multiple Nodes
Llama 405b is a large model which requires lots of memory.
Quantization Method | Weight Memory | # 80GB A100 GPUs |
---|---|---|
FP16 | 810GB | 10 |
INT8/FP8 | 405GB | 6 |
INT4 | 202GB | 3 |
I have access to a 4-node SLURM cluster each with 4 A100 80GB GPUs each.
So how do we get these to work together?
Step 1: Multi-Node SLURM Configuration
Running jobs through sbatch
requires us to be careful about specifying the number of nodes/jobs/gpus we will be running.
Here we have 2 scripts:
vllm_run.sh
- This script will be submitted to SLURM.vllm_node.sh
- This script will be run on each individual node
Here's the first:
vllm_run.sh
#!/bin/bash
#SBATCH --job-name=vllm-multinode # Job name
#SBATCH --nodes=4 # Number of nodes
#SBATCH --gres=gpu:4 # Number of GPUs per node (adjust as needed)
#SBATCH --cpus-per-task=128 # Number of CPUs per task (adjust as needed)
#SBATCH --ntasks-per-node=1 # Number of tasks per node
#SBATCH --time=02:00:00 # Max runtime (HH:MM:SS)
#SBATCH --partition=<YOUR PARTITION> # Partition (queue) name
srun --ntasks=4 --cpus-per-task=128 --gres=gpu:4 --exclusive ./vllm_start.sh
There's a few important things here:
- We launch a task on
--nnodes=4
nodes - We Specify
--ntasks-per-node=1
so that one task will run on each node - We specify
--cpus-per-task=128
. Sometimes slurm clusters will default to only a single CPU. - We run
srun
, specifying the number of CPUs/GPUs again. This is important to ensure that the resources are allocated correctly. (If I ommit the--cpus-per-task
flag in thesrun
command, each instance crawls with one CPU and 4 A100s)
Step 2: Downloading the model locally
I have my $HF_HOME environment variable set to a network filesystem. This is fine for single-node jobs, but for multi-node jobs this can mess up loading. Since each node performs non-contiguous reads of different pieces of the model, it's best to copy the model to a local directory on each node.
Let's first set up a virtualenv and install the huggingface-cli:
# install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | env UV_INSTALL_DIR="<somewhere on your path>" sh
# make the venv
uv venv
# install huggingface-cli
source .venv/bin/activate
uv pip install huggingface-cli
# Download it to $HF_HOME
export MODEL_PATH="neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w8a16"
huggingface-cli download $MODEL_PATH
Beware that if you're downloading the versions directly from meta
, you should add --exclude="original/*
to the huggingface-cli download
command. This will prevent the download of the pytorch model weights.
Step 3: Downloading VLLM
On my cluster, I can't use docker or slurm's docker-based system, so I use Singularity. Let's download VLLM to $SINGULARITY_IMAGE_DIR:
export SINGULARITY_IMAGE_DIR=<path to your singularity images>
mkdir -p $SINGULARITY_IMAGE_DIR
singularity pull $SINGULARITY_IMAGE_DIR/vllm-openai_v0.6.1.post2.sif docker://vllm/vllm-openai:v0.6.1.post2
Now we're ready to write our vllm_node.sh
script.
Step 3: Per-node script
This script is quite complicated. It does the following:
-
Sets up the environment variables
source /etc/profile export PATH="$HOME/.bin:$PATH" # Set environment variables export HF_HOME=<YOUR_HF_CACHE> export HF_TOKEN=<YOUR_HF_TOKEN> export SINGULARITY_IMAGE_DIR=<path to your singularity images> export MODEL_PATH="neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w8a16"
-
Copy the model from a network filesystem to a local directory on the node
# This should persist across reboots and be local to the node TARGET_HF_HOME="/var/tmp/hf_home" echo "Syncing HF" rsync -a --links -r $HF_HOME $TARGET_HF_HOME echo "HF Synced"
Here we use
rsync -a --links -r
to copy the HF cache. Since the HF cache uses local symlinks, everything should be preserved nicely. -
Determine the correct IP address for the head node
NODELIST=$(scontrol show hostnames $SLURM_NODELIST) NODE0_IP=$(host -4 $(echo $NODELIST | awk '{print $1}') | awk '{print $4}') NNODES=$(echo $NODELIST | wc -w) NODE_RANK=$SLURM_NODEID MODEL_PATH="neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w8a16"
This gets the IP of the first node in the list, the number of nodes, and the node rank.
-
Set up the required NCCL infiniband environment variables
export NCCL_DEBUG=INFO export NCCL_IB_HCA=mlx5 # You should run `ifconfig` to see what your infiniband interface is called export NCCL_SOCKET_IFNAME=ib0 # Modify this based on what you need in your system export NCCL_NET_GDR_LEVEL=SYS # Same with this export NCCL_P2P_LEVEL=SYS
In order to figure out the values to put here, you should run
ibv_devinfo
to find your HCA andnvidia-smi topo -m
to see what the connectivity is between your GPUs/PCIe buses/Infiniband cards, as well asifconfig
to find the name of your infiniband interface. -
Set up the ray command to run
VLLM_COMMAND="vllm serve --dtype auto --api-key token-abc123 \ --tensor-parallel-size 4 \ --pipeline-parallel-size $NNODES \ --enable-prefix-caching \ $MODEL_PATH" RAY_START_CMD="ray start" if [ "${NODE_RANK}" == "0" ]; then # We're in rank 0, so we start the head node, wait a bit, print out the status RAY_START_CMD+=" --head --port=6379 && sleep 5 && ray status" # Then we start the model server RAY_START_CMD+=" && $VLLM_COMMAND" else # Otherwise, we just run a blocking comand on each worker RAY_START_CMD+=" --block --address=${NODE0_IP}:6379" fi echo "RAY_START_CMD: $RAY_START_CMD"
Here's where we define our VLLM topology. We're using 4 nodes with 4 way pipeline parallelism and 4 way tensor parallelism within each node.
-
Serve the model
module load singularity/3.7.0 # This is the version of singularity I use on my cluster mkdir -p /var/tmp/home singularity exec \ --nv \ # Use the GPU --no-home \ # Don't mount the home directory -B /var/tmp/home:$HOME \ # Mount a local dir to home -B $HF_HOME:/hf_home \ # Bind the HF cache --env HF_HOME=/hf_home \ # Set the HF_HOME environment variable $SINGULARITY_IMAGE_DIR/vllm-openai_v0.6.1.post2.sif bash -c "$RAY_START_CMD"
Step 4: Putting it all together
Now we can write the whole start script
vllm_start.sh
#!/bin/bash
source /etc/profile
export PATH="$HOME/.bin:$PATH"
# Set environment variables
export HF_HOME=<YOUR_HF_CACHE>
export HF_TOKEN=<YOUR_HF_TOKEN>
export SINGULARITY_IMAGE_DIR=<path to your singularity images>
export MODEL_PATH="neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w8a16"
TARGET_HF_HOME="/var/tmp/hf_home"
# update hf cache
echo "Syncing HF"
rsync -a --links -r $HF_HOME $TARGET_HF_HOME
echo "HF Synced"
# Get the hostnames
NODELIST=$(scontrol show hostnames $SLURM_NODELIST)
NODE0_IP=$(host -4 $(echo $NODELIST | awk '{print $1}') | awk '{print $4}')
NNODES=$(echo $NODELIST | wc -w)
NODE_RANK=$SLURM_NODEID
export NCCL_DEBUG=INFO
export NCCL_IB_HCA=mlx5
# You should run `ifconfig` to see what your infiniband interface is called
export NCCL_SOCKET_IFNAME=ib0
# Modify this based on what you need in your system
export NCCL_NET_GDR_LEVEL=SYS
# Same with this
export NCCL_P2P_LEVEL=SYS
VLLM_COMMAND="vllm serve --dtype auto --api-key token-abc123 --tensor-parallel-size 4 --pipeline-parallel-size $NNODES --enable-prefix-caching $MODEL_PATH"
RAY_START_CMD="ray start"
if [ "${NODE_RANK}" == "0" ]; then
# We're in rank 0, so we start the head node, wait a bit, print out the status
RAY_START_CMD+=" --head --port=6379 && sleep 5 && ray status"
# Then we start the model server
RAY_START_CMD+=" && $VLLM_COMMAND"
else
# Otherwise, we just run a blocking comand on each worker
RAY_START_CMD+=" --block --address=${NODE0_IP}:6379"
fi
echo "RAY_START_CMD: $RAY_START_CMD"
module load singularity/3.7.0 # This is the version of singularity I use on my cluster
mkdir -p /var/tmp/home
singularity exec \
--nv \ # Use the GPU
--no-home \ # Don't mount the home directory
-B /var/tmp/home:$HOME \ # Mount a local dir to home
-B $HF_HOME:/hf_home \ # Bind the HF cache
--env HF_HOME=/hf_home \ # Set the HF_HOME environment variable
$SINGULARITY_IMAGE_DIR/vllm-openai_v0.6.1.post2.sif bash -c "$RAY_START_CMD"
Final: Running the model
Now we can run sbatch vllm_run.sh
and watch our model get served!
My nickname is will. Correspondence to 'my nickname' at swaglu dot com will reach me.