When you have a lot of data such that you can’t reasonably run the training on your local machine, or the size of that data is larger than your hard drive, it’s time to look at other options.
Shift Machine Learning Training
One solid option is to shift the machine learning training to another computer with access to more storage, thus freeing up your hard drive space, and allowing you to work on other things while that training is taking place. Let’s break down exactly what parts need to be moved into the Cloud.
It’s useful to think about our training as needing two primary resources that is compute and storage. What’s interesting here is that we don’t have to tie them together quite as tightly as you might at first expect. We can decouple them, which yields specialized systems for both and this can lead to efficiencies of scale when you’re dealing with big data.
Now, compute load is moved around easily enough. But the moving of large data sets, that can be a bit more involved. However, if your dataset is truly large, the results are worthwhile, as it allows the data to be accessed by many machines in parallel that is working on your machine learning training job. Google Cloud platform has a couple of easy ways to tie together these abstractions.
Google Cloud Platform
First, we’ll want to make sure that our data is in Google Cloud Storage or GCS. We can do this using a variety of tools. For the smaller to medium data sets, just use gsutil. It’s a command line tool that was specifically made for interacting with Google Cloud Storage.
It supports a –m option that allows for sending multiple streams in parallel, thus speeding up your transfer job. But if your data is too big to send over the network, you can use the Google transfer appliance, which is literally a machine that will be shipped to your data center to securely capture and transfer a whole petabyte of data. With a typical network bandwidth of, say, 100 megabits per second, it would take three years to upload a petabyte of data over the network.
Even if you had a gigabyte connection, it would still take four months. The transfer appliance, on the other hand, can capture a petabyte of data in just 25 hours that is fast. Now that our data are in a Cloud, we’re ready to run machine learning training at scale. But that’s a whole topic of its own.
Training Machine-learning Models
Training machine-learning models on large sets can be challenging to accomplish with limited compute and storage resources. But it doesn’t have to be that way. By moving to the Cloud via either gsutil or the transfer appliance, you can train on large data sets without any hiccups.