The model's precision and size determine this. Regarding FP16 inference:

  •         RTX 4090/A5000 (24 GB VRAM) LLaMA 2/3/4-7B
  •         LLaMA 13B: 40GB RTX 5090, A6000, and A100

·         LLaMA 70B: H100 x2 (multi-GPU) or A100 80GB x2

Was this answer helpful? 0 Users Found This Useful (0 Votes)