Every organization at any scale understands that leveraging the public cloud is a trade-off between convenience and cost. While cloud providers like Google, Amazon and Microsoft have immensely reduced the barrier of entry for machine learning, GPU costs are still at a premium.
There is an increasing fear in the machine learning community that the true power of machine learning is still within the hands of the few. The flagship example of this is OpenAI’s massive GPT-3 model containing 175 billion parameters, a memory footprint of 350GB and reportedly costing at least $4.6 million to train. The trend also looks set to continue: Rumours consistently circulate regarding the next generation GPT-4’s size, with some estimates ranging in the order of trillions of parameters. Even with more efficient training techniques, these models will still cost in the order of millions to train.
For the rest of the ML community, there is now an increasing reliance on their secret weapon: Transfer learning. Just recently, the excellent HuggingFace library announced a simple method to fine-tune large-scale parameter models on a single cloud GPU. This gives hope to ML practitioners that even if they are unable to train models from scratch, utilizing the immense power of modern-day machine learning models is still within reach.
Whether training from scratch, or fine-tuning, it is still clear that public cloud providers offer the most convenient path to provision and utilize compute resources for most ML practitioners out there. However, even for fine-tuning tasks or smaller models, these costs can quickly grow and become unmanageable. For example, here is a simple breakdown of much it costs to train a machine learning model on a relatively mid-tier configuration suitable for many machine learning tasks:
| Provider | Configuration | Cost ($/h) | |----------|-------------------------|---------------| | GCP | n1-standard-16 with P4 | 1.36 | | Azure | NV6 with M60 | 1.14 | | AWS | g3.4xlarge with M60 | 1.14 |
Bear in mind that the above costs are just for one training run. Most machine learning projects go through much more experimentation phases, and these numbers add up quickly. Therefore, most ML teams who do not have vast budgets on their hands usually resort to sampling their datasets and portioning out big training runs for when they are sure. This can be slow and tedious, not to mention hard to coordinate. It can also lead to teams converging to the wrong results. If the smaller, sampled datasets are not representative of the larger dataset, it can lead to frustrating and diverging results as the models develop.
Would it not be easy if ML practitioners had the luxury to launch experiments without having to fret so much about the costs exploding over time? There might be just be a solution, offered by all major cloud providers, and severely underutilized by the machine learning community: Preemptible/Spot instances.
The word preemptible instance is largely a Google Cloud Platform term, while spot instance is used by AWS/Azure. Whatever you call it, the concept is the same: These instances cost a fraction of the cost of normal instances, and the only catch is that there is no guarantee that the instance stays up all the time. Usually, this means that within 24 hours the instance is shut down by the provider.
These sorts of instances are a mechanism for the cloud providers to maximize the utilization of all their resources at any given time. They are intended for batched, non-critical workloads. Most trainings jobs for the majority of the use-cases out there take less than 24 hours to complete. And even if the job is interrupted before that happens, they can almost always be restarted from checkpoints.
Therefore, machine learning training fits the intended use of spot instances perfectly. By using these instances, practitioners stand to gain massive cost reductions. We have conducted a rough analysis across the three major cloud providers to showcase the cost benefits. The raw data can be found here. Feel free to share the doc and leave a comment if you find something to add. Here is a snapshot, with the same configurations as before:
| Provider | Configuration | Cost ($/h) | Spot Cost ($/h) | Savings | |----------|-------------------------|------------|---------------------------| | GCP | n1-standard-16 with P4 | 1.36 | 0.38 | 72% | | Azure | NV6 with M60 | 1.14 | 0.20 | 82% | | AWS | g3.4xlarge with M60 | 1.14 | 0.34 | 70% |
Note: All costs in the US region, AWS instance pricing as of January 28, 14:00 CET.
As can be seen, depending on the configuration, there is up to 82% cost reduction by using spot instances, with the average cost savings across multiple cloud and configurations being rougly 74%. This can equate to hundreds of dollars worth of savings. Especially for hobbyists, smaller companies, or smaller departments experimenting with Machine Learning, this may mean the difference between getting a model deployed vs crashing and burning before lift-off.
Using this technique is not new: Way back in 2018, the FastAI team trained ImageNet in 18 minutes with 16 AWS spot instances. This cost $40 at the time, and was the most public display of the insane cost benefits of spot instances in the community.
However, given the trend of increasingly big models, and the increasing adoption of AI worldwide, I can only see the need for spot instance training increasing over time. Given the dramatic difference in costs, it is almost a no-brainer to use spot instance training as a primary mechanism for training, at least in the experimentation phase.
If you’re looking for a head start for spot instance training, check out ZenML, an open-source MLOps framework for reproducible machine learning. Running spot pipeline in ZenML, is as easy as :
training_pipeline.run( backend=OrchestratorGCPBackend( preemptible=True, # reduce costs by using preemptible (spot) instances machine_type='n1-standard-4', gpu='nvidia-tesla-k80', gpu_count=1, ) )
ZenML not only zips your code up to the instance, it also makes sure the right CUDA-drivers are enabled to take advantage of the accelerator of your choice. It provisions the instance, and spins it down when the pipeline is done. Not to mention the other benefits of experiment tracking, versioning and metadata management which is provided anyway by the framework. Give it a spin yourself: A full code example can be found here.
AWS and Azure support is on the horizon, and we’d love your feedback on the current setup. If you like what you see, leave us a star at the GitHub repo!