In step 3 of this series, we designed a simple model and algorithm on which to build a simulator to better manage our wind turbine drivetrains.
Now we are finally at the last step: data. The AI Value Catalyst approach has saved us countless hours of sifting through data sources and types that would not have enabled us in the end to build what we actually needed. But now it is time to make sure we get the data right for our simulator and thus, our entire solution.
Garbage in, garbage out
Our AI solution can only be as good as the data we feed it. There are many types of data garbage that we must make sure don’t make onto the menu for our simulator. “Bad data” can be:

- Erroneous
- Missing
- Cheating
- Not enough
- Asynchronous
- Non-representative
Let’s go through these in turn and see how they might rear their ugly head in our predictive maintenance domain.
Erroneous data
This is the most obvious one. If the data is simply not right, the model won’t be able to learn anything sensible.
Data can be erroneous due to several factors. One potential factor is subjectivity.
How are the bearings labeled? If this is based on subjective manual assessment then our machine learning model will have a much harder job. Instead, make sure you specify clear, objective measures. For example in a very simple case, the inspector could measure the diameter of the bearing and compare with a standards table. If the diameter is below a certain value, the bearing is worn.
Another typical cause of erroneous data is sensor malfunction or drift. Malfunction can often be detected by applying simple one-dimensional filters. For example, we know the minimum and maximum realistic values for the power in the vibration frequencies.
Drift is more nefarious as it can happen slowly and go undetected. But left unchecked, it can “blur” the data and the model won’t work well. The best you can do here is to keep all sensors properly calibrated.
The general area of how to catch these types of problems is often called “outlier detection” or “anomaly detection.” It is outside the scope of this article but the Wikipedia article is a good starting point.
Missing data
There is not a whole lot you can do if all of the measurements for a particular time window are missing. Other than working out why they were missing and rectifying that root cause.
You can interpolate if you have data that is “close” to the missing window. For example with time series data, you could for example do cubic interpolation using adjacent points. You should then make sure that this does not degrade the model, by evaluating performance on an independent test set with and without interpolation.
If one or more data elements are present, you can estimate the most probable values using a probability density model. This Kaggle article is a good introduction.
Cheating data
You want to make sure that the inputs to the model don’t contain “shortcuts” to the labels or targets. In the worst case, imagine if the targets are included in the inputs. Then the model doesn’t need to learn anything other than the unity function, setting the output equal to that input.
This would give you a fantastically good training-time performance, but the model would obviously be useless in production.
So just make sure that all of the input data will actually be available in test time, in production.
Not enough data
The more data you have, the better your model and thus simulator can become. There is a trade-off here: the more data you have, the more complex and powerful your model can be.
By splitting your data into training and “validation” data, you can get an understanding of the accuracy you can obtain. This is also the basis for tuning the complexity of your model.
While that is another topic, always make sure that you supply the model with enough data to obtain good prediction accuracy on test data. For our predictive maintenance model, we would look at KPIs like accuracy to make sure the model is good enough.
Asynchronous data
If data along each dimension arrives at different times, it is important to apply a consistent alignment procedure. For example, you can average the samples that fall within uniform time windows to obtain uniform data.
A sub-category that you must be especially aware of when working with time series is temporal misalignment. In our case, there could be sensor lags on the vibration measurements and we would need to correct these by collecting and using timestamps for each vibration data point.
Non Representative data
If our training data is not representative of the data that the model is going to see in production, it will mislead us.
This can sneak up on you in various ways if you are not careful. For example, we might train the model on data from one powerplant and apply it in another. This could very well lead to problems due to differences between the two plants – such as the load measurements being calibrated using different procedures.
Next steps
At this point, it is time to start implementing a prototype or your solution – unless you discovered some surprises along the way that invalidate the approach.
As we said in the first article: measure twice, cut once. Until now, we haven’t spent any time or money on implementation. But now we can, with a good ROI expectation on our AI investment.
This article series has been a whirlwind tour of designing an AI solution, skipping many details for brevity. But although we outlined only a skeleton of a real solution, the principles are the same for a detailed, real-world application. Hopefully, you will now be better equipped to make AI work for you rather than the other way around.
If you want to leverage AI in your business, get in touch with me. I would love to get you started and help you maximize your ROI.