After labeling your dataset, you’ll need to choose a method for storing labeled data. This is easier said than done, however—as your datasets inevitably grow and you start identifying new solutions, you may quickly find yourself outpacing traditional storage methods.
This situation is becoming especially prevalent, with many datasets now containing millions of individual datapoints amounting to hundreds of gigabytes (if not more!). Plus, your considerations aren’t limited to sheer storage capacity; depending on your model, you’ll also have to consider factors such as processing requirements, data redundancy, and more.
To keep your training efforts running smoothly, we’ve compiled ten data storage best practices applicable to almost every dataset and implementation.
1. Define the problem you want to solve
Before adopting any specific storage solution, you should have a good idea – or at least a general idea – of your machine learning goals. By establishing a clear picture of your desired outcome, you’ll be better equipped to identify the best storage solution that fits your needs.
Again, storage capacity is only one part of the equation. We’ll touch more on that point later, but for now, start considering a more “holistic” storage solution which includes not only a manageable amount of storage, but also the necessary processing power for handling your project’s training requirements.
Defining your data needs should also include defining the logistics of your data use. Even after labeling, your data pipeline should still be present and flowing; as a result, you’ll typically want to choose a storage solution with high bandwidth and scalability.
2. Keep unlabeled copies of raw data
Keeping unlabeled copies of raw data has more practical uses than simply maintaining data redundancy; you might want to keep raw data in case you ever decide to re-label them. This situation is occasionally necessary, especially if your models come up empty and you identify better labeling methods at one point in the future.
Whether for redundancy or future retraining, keeping raw data is always a good idea— backups can become somewhat cumbersome over time. Even though labeled data is only slightly larger than its raw form (depending on the datatype), the storage requirements can quickly add up, especially as your datasets grow to hundreds of gigabytes.
Relational databases in conjunction with cloud storage are often ideal storage solutions for growing datasets and their backups. For especially large datasets, you may also want to consider big data platforms such as Hadoop and Spark.
3. Keep algorithms separate from data storage
In other words, move your algorithms before you move your data. This practice is especially important for large datasets, where moving data between servers can consume tons of time and resources. By comparison, algorithms are much leaner; think of them as “mobile agents” moving between large swarths of virtually “immobile” data.
Keeping your data in one spot (or distributed among multiple spots) may not be necessary for smaller models, but it’s still good practice.
4. Use compatible formats when storing labeled data
For anyone working with data, working across multiple formats can be tedious and time-consuming—especially if each format has its own compression type or is exclusive to a certain platform. To make the process easier, try to maintain compatibility and consistency from the get-go and standardize your formats whenever storing labeled data.
Common formats found in machine learning include comma-separated values (CSV), XLXS, plain text, and image formats such as JPEG. Your individual project will dictate which one to use, of course, but try your best to stick with one whenever possible. Also, whichever one you choose, make sure it’s compatible across platforms to ensure future scaling and/or migration.
5. Regularly backup databases and maintain redundancy
This should be a no-brainer for any data manager, but it bears repeating. Even with highly reliable cloud servers, data loss is still a very real threat and can quickly kill your machine learning efforts.
Since data is your most valuable commodity, do whatever it takes to maintain backups and redundancy. While backups should include copies of raw data, also be sure to include existing labels where necessary. Further, be sure to perform backups on a regular basis, especially if you frequently receive high volumes of data.
6. Find “happy mediums” for processing power
In an ideal world, or maybe at some point in the future, every machine learning server will be entirely flash-based and use only the best GPUs for processing. This setup is currently possible, of course, but it’s very expensive—especially if you want to scale.
As we wait for Moore’s law to work its magic, consider some happy mediums in the interim. While GPU clusters are still expensive and relatively scarce, flash storage is becoming increasingly commonplace while other mediums – such as NVMe – serve as practical alternatives. Further, maintaining good data management practices can also help speed along tasks and decrease processing requirements.
7. Establish a minimum data requirement
Establishing a minimum data requirement is not only essential for training a high-accuracy model, but it’s also a great way to establish a general understanding of your storage requirements. In other words, by knowing the bare minimum of what you need, you’ll be able to more realistically predict and allocate storage costs and resource consumption.
8. Store variables that add context
Context is easy for humans. If we see an image that’s sideways, we can almost instantly classify it as being sideways. Computers, however, struggle with this task and need to be told repeatedly the “nature” of a particular context – be it “sideways-ness” or some other type of context.
As a result, it’s crucial to add variables and labels which not only add context to your data, but also the correct context. Doing this early on, despite requiring somewhat more oversight, is one of the best ways to ensure seamless and accurate classification in the future. Ease of classification can not only save on processing power but can also help avoid the possibility of having to retrain your models.
9. Use sentence-level classification
For text classification, sentence-level classification is typically a good starting point for assigning the right content to your data. While sub-sentence labeling is a step more accurate, this is only true for especially complex or ambiguous sentences which could be interpreted different ways. As a result, sentence-level classification often provides an ideal medium between speed and accuracy, especially for larger datasets.
If you plan to use sub-sentence classification, however, consider “splitting” ambiguous sentences into individual data points; in other words, if a sentence could be interpreted multiple ways, classify each interpretation as its own data point.
10. Plan for growth and be ready to scale
As your models grow and you identify new solutions, your storage needs will likely change. This is to be expected; machine learning models should grow and adapt over time, especially as more data comes through the pipeline.
When you start to reach the limits of an existing storage solution, you’ll find yourself having to decide whether to expand, scale back, or potentially start from scratch depending on the results of previous training. By employing these practices when storing labeled data, however, adapting will be seamless and straightforward.