Running Llama 405B on VLLM with Slurm and Multiple Nodes
Llama 405b is a large model which requires lots of memory.
Quantization Method | Weight Memory | # 80GB A100 GPUs |
---|---|---|
FP16 | 810GB | 10 |
INT8/FP8 | 405GB | 6 |
INT4 | 202GB | 3 |
I have access to a 4-node SLURM cluster each with 4 A100 80GB GPUs each.
So how do we get these to work together?