One attempt to ensure that ML models generalize in unknown settings is splitting data. This can be done in many ways,
from 3-way (train, test, eval)
splits to k-splits with cross-validation. The underlying reasoning is that by training a ML model
on a subset of the data, and evaluating on
unknown data, one can reason much better if the model has underfit or overfit in training.
For me, splitting data is the most under-rated task in all of data science. It is understandable that for most jobs,
a simple 3-way split suffices. However, I have stumbled across many problems where there is a
need for more complicated splits to ensure generalization. These splits are more complex because they are derived
from the actual data, rather than the structure of the data which the hitherto mentioned split methods are based on.
This post attempts to break down some of the more
unconventional ways to split data in ML development, and the
reasoning behind them.
In order to illustrate the split mechanisms, it helps to start with a sample dataset to do the splits on.
To make things easy, lets use a simple multi-variate, timeseries dataset represented in tabular format. This data consists of
3 numerical features, 1 categorical feature and 1 timestamp feature. Below this is visualized:
This type of dataset is common across many use-cases and industries in machine learning. A concrete example can be multiple timestreams transmitted from different machines with multiple sensors on a factory floor. The categorical variable would then be the ID of the machine, the numerical features would be what the sensors are recording (e.g. pressure, temperature etc.), and the timestamp would be when the data was transmitted and recorded in the database.
Imagine you receive this dataset as a csv file from your data engineering department and are tasked with writing a classification or a regression model. The label in such a case could be any of the features or an additional column. Regardless, the first thing to do would be to try to split up the data into sets that are meaningful.
To make things easy, you decide to go make a simple split with
eval. You know immediately that a naive random split with
shuffling won’t fly here - the data does have multple sensor streams that are indexed by time after all. So how do you split the data so that order
is maintained and subsequent models are sufficiently generalizable?
The most straightforward transformation we can do is to represent the data per categorical class (in our running example, visualize the data per machine). This would yield the following result:
The grouping together suddenly makes the issue of splitting a bit simpler, and largely dependant on your hypothesis. If the machines are running under
similar conditions, one question you might ask is:
How would a ML model trained on one group generalize to other groups. That is, if trained on
class_3 timestreams how would the model fair on
class_5 timestreams. Here is a visualization of that split:
I call this the
Horizontal split due to nature of the cut line in the above visualization. This split can be easily achieved in most ML libraries by
simply grouping by the categorical feature and partitioning along it. A successful training with this split would show evidence that the model has
picked up signals that generalize across previously unseen groups. However, it would not showcase that it is able to predict future behavior of one
Its important to note that the the split decision did
NOT account for time as a basis of the
split itself. One can assume however that you would also sort by time per timestream to maintain that relationship in your
data. Which brings us to the next split..
But what if you want to split across time itself? For most time-series modelling, a common way to split the data is
future. That is, to
take in the training set historical data relative to the data in the eval set. The hypothesis in this case would be:
How would a ML model trained on
historical data per group generalize to future data for each group?. This question might be answered by the so called
A successful training with this split would showcase that the model is able to pick up patterns across timestreams it has already seen, and make accurate predictions of behavior in the future. However, this itself would not show that this model will generalize well to other timestreams from different groups.
Of course, your multiple timestreams have to be sorted now individually, so we still need to group. However, this time, rather than cutting across
groups, we take a sample of the
past of each group and put it in train and the
future of each group in eval. In this idealized example, all the
timestreams are of the same length, i.e., each timestream has exactly the same number of data points. However, in the real world, this maybe not be the
case - so you would require a system to build an index across each group to make this split.
An inquisitive ML researcher might at this point wonder if they could produce a model that would generalize under both constraints of the
Horizontal and the
Vertical split. The hypothesis in that case would be:
How would a model trained on historical data for SOME groups generalize
to future data of these groups AND all data from other groups?. A visualization of this
Hybrid split would look like this:
Naturally, if model training is successful, this model would surely be more robust than the others in a real world setting. It would have displayed evidence to not only learning patterns of some of the groups it has already seen, but also evidence of the fact that it has picked up signals that generalize across groups. This might be useful if we are to add more similar machines to the factory in the future.
The notion of the horizontal and vertical splits can be generalized to many dimensions. For example, one might want to do group by two categorical features rather than one to even further isolate sub-groups in the data, and sort them per sub-group. There might also be complex logic in the middle to filter groups that have a lower number of samples, and other business-level logic pertaining to the domain.
The hypothetical example is used to illustrate the endless possibilities of various kinds of machine learning splits that can be created by an astute data scientist. Just like it is important to ensure ML fairness whilst evaluating your models, it is equally important to spend sufficient time thinking about splitting a dataset and its consequences to bias the model downstream.
One easy way to do the
Vertical and the
Hybrid split by writing just a few lines of YAML is via the Core Engine.
The Core Engine is a MLOps platform developed at maiot while we deployed models to production,
for datasets with similar characteristics as the example above. If you are interested in the content above, and would like to try the Core Engine,
please feel free to join sign up directly for our beta, or reach out to me
at firstname.lastname@example.org if you are interested in giving the platform a go.
Head over to our docs to understand more how it works in more detail.
Thank you and happy splitting!