Skip to main content

Other considerations

Introduction to Compute Requirements for Model Deployment

When working with large language models (LLMs), understanding the compute requirements is as crucial as knowing the appropriate model size for your task's complexity. The type of hardware you use significantly impacts your ability to run and train these models effectively.

You may have used 70 million parameter models that ran on CPUs. While these models are functional, they are not the most performant. For tasks that demand more efficiency and power, it's advisable to use more robust hardware, such as GPUs, which can handle larger and more complex models.

As an example, a single V100 GPU, which is available on cloud platforms like AWS, comes with 16 gigabytes of memory. This allows it to run a 7 billion parameter model for inference. However, when it comes to training, the memory requirements are much higher due to the need to store gradients and optimizers. In this scenario, the V100 GPU can only support the training of a 1 billion parameter model.

AWS InstanceGPUsGPU MemoryMax inference size
(# of params)
Max inference size
(# of tokens)
p3.2xlarge1 V10016GB7B1B
p3.8xlarge4 V10064GB7B1B
p3.16xlarge8 V100128GB7B1B
p3dn.24xlarge8 V100256GB14B2B
p4d.24xlarge8 A100320GB HBM218B2.5B
p4de.24xlarge8 A100640GB HBM2e32B5B

If the memory and compute power of a single V100 GPU are insufficient for your needs, and you wish to work with even larger models, you would need to consider more advanced hardware configurations.