The AI value catalyst step 3: Understanding

In step 2, we designed the simulator which will enable us to improve decision making on predictive maintenance for our wind turbine drivetrains. 

We decided that the simulator should calculate the probability that a given bearing will become worn within a certain timeframe, giving us time to take appropriate action. This allows us to make better decisions, namely to replace the bearing only if the probability exceeds a certain threshold. We will then be freed from having to replace bearings just because the manufacturer’s planned maintenance tables tell us to.

Understanding with machine learning

Machine learning (ML) can be viewed as a subset of AI but often the terms are used interchangeably. In any case, it enables us to understand complex problems and patterns in an automated fashion. 

ML proceeds in two steps, fundamentally. First, we design a predictor. A predictor outputs some quantity that tells us what we need to know. We can also think of it as a question answering machine. In our case, the predictor outputs the probability that the bearing in question is worn. The output will thus be a number from 0 to 1, where 1 means that the predictor believes that most likely probability is 100%.

The predictor

To produce the probability output that we need, the predictor needs three things:

  1. Inputs
  2. A model
  3. Parameters for the model

The model calculates the probability from the inputs, using the parameters that control the model. But where do the parameters come from?

Learning to predict: the trainer

The parameters are learned from the data by the “trainer”. The trainer takes the inputs and labels – meaning the true probabilities in our case – and a model, and trains the model. “Training” means that it “teaches” the model what the parameters should be in order to correctly predict the labels. 

There are different ways of assigning such true probabilities. A simple way would be to set the label to 1 if the part was found on inspection to be worn, and 0 if it was not. We will talk more about this in the next article in this series. 

This is still all quite abstract. To illustrate, consider one of the simplest examples imaginable, namely one-dimensional linear regression:

There’s a lot going on in this diagram – but it illustrates all the most important concepts for the predictor and the trainer. 

The red line is created by and thus represents the model. In this example we are doing simple regression, trying to predict y from x using our model and its parameters. Our inputs are all the x’s we have observed, and the y’s are our labels.

The model we have chosen here says that y is a linear function of the inputs, plus “noise”, or y = ax + b + z. a is the gradient of the line and b is the offset, i.e. where the line crosses the y axis.

Why do we need the noise term? For two main reasons. First, our observations are generated by some process, such as a sensor measurement. Such measurements are never deterministic – they always come with some uncertainty, some “plus/minus.” Second, our model is wrong to some degree – we generally can never know the perfectly optimal model for any problem. That means our predictions will always be somewhat off from the truth. 

Thus, we have z – a random variable that represents the noise. To keep it simple, we could assume that z follows a normal distribution with zero mean and an unknown variance. Let’s call that variance “c” – that is another parameter. We assume all the z’s for all the data points is the same, so there is only one c.

So a and b are the parameters, along with c – three parameters in total. 

Training takes all the blue points (x’s and y’s) and finds the parameters (a,b,c) that best explain them. This can be done in different ways, such as “gradient descent” where you start with some random guess for (a,b,c) and take steps that make the error smaller until you reach a sufficiently small error. In case of a simple model like this one, the optimal parameters can be estimated in a single calculation.

The predictor can then take a “new” input – which we could call a query – and predict what y would be for that x. 

The predictor can also in many cases provide us with a range of possible predictions – a “confidence interval” – which can be extremely valuable. We will get back to that point later. 

A predictor for bearing wear 

For our predictor, we can use the following inputs:

  1. History of load on the drivetrain, measured in kilowatts
  2. Number of operating hours since the bearing was installed
  3. A record of a vibration signal.

Since the history of loads etc. for each bearing is very large, we decide to split time into discrete windows and treat each window as a separate datapoint. For all the measurements that fall within each window, we will calculate a fixed number of  statistics to keep everything manageable and uniform. 

If the window is very short, the statistics in each window will be very noisy. This pushes us towards choosing longer windows. On the other hand if the window is very wide, then the statistics will change too much within each window for the model to learn anything. For example, there might be a short period of “bad vibrations” inside a long window that we could not detect because it was lost in all the other data inside that window. So this pushes us to choose shorter windows. 

Based on initial analysis of the statistics of the observations, we figure that a one-minute window is a good balance between these two opposing goals.

Our labels are extracted from digital inspection records. If the replaced bearing was found to be somewhat or severely worn, we set the label to 1, otherwise to 0. But which window should we assign this label to?

We don’t know when the bearing became worn exactly, only that it became worn sometime before the inspection. 

After discussion with our expert maintenance engineer we decide that the 48 hours before an inspection revealing a worn bearing will be given the label 1 – meaning “about to break down.”

The 48 hours before that, we will not use because we have no idea what the true state of the bearing was. All windows before that we will use and label as “not about to break down.”

For bearings that were not found to be worn, we will use all preceding windows with the label 0.

A wear prediction model

Now, we can decide on some statistics to define the inputs of each window. This is often called “feature engineering” in ML. We could go for a “deep learning” model where the feature engineering is learned from the data, but to keep it simple as usual, let’s consider some simple yet probably very informative features:

We can build a simple model by first calculating a “score” function,

where each aiis a parameter that determines the “weight” of the corresponding feature.

This yields some number but it is not between 0 and 1 as required. To get that, we apply the sigmoid function to get our final output as a probability, 


We can train this model by minimizing the so-called “cross-entropic” loss function for the entire training data set. Basically we compare the predicted probabilities to the true (0 and 1) probabilities, or labels.

After training, we can calculate the prediction accuracy and work out an appropriate threshold for the output. For example, we might decide that whenever the probability given by the predictor is more than 0.5 (50%), we should replace the bearing.

There are many technical details in building a good predictor and trainer, which is beyond the scope of this article. Hopefully, you now have some feel for what is involved and why it is important to know what you need before you start digging into the technicalities.

What if the predictor is wrong?

A key to understanding ML is to realize that it deals with stochastic, or random, variables. That’s not a bad thing – it’s what we humans do all the time. When we look out the window to see if it’s going to rain, our estimate – based on cloud cover, the darkness of said clouds, etc. – is not binary. Rather, we “think it’s probably going to rain”, or we “think it’s not going to rain”. 

Our decision making depends on more than just whether we think it’s going to rain – it also depends on how certain we are. Let’s say you are about to arrange a very expensive outdoors photo shoot. If it rains, you will have to stop the shoot and you will have wasted a significant amount of money. Therefore, even if you “think it’s not going to rain”, if you think that this prediction is uncertain, that there is some chance it might rain, you would be better off postponing the shoot rather than risking wasting a lot of cash. 

Luckily, machine learning can provide you with this uncertainty. We can start with a simple predictor that just outputs a number between 0 and 1, representing the probability of the bearing being worn. But if the output is 0.1, say, and we don’t replace the part, it might turn out that this number was quite uncertain – it had high variance; see the confidence interval in the illustration above. We might therefore be making a mistake, and the bearing might break down, costing us a fortune. 

The figure below illustrates two predictors that give us exactly the same single probability that the bearing is worn. The blue predictor is pretty certain – the probability that it is wrong by a lot is small. The red predictor, while giving us the same single probability, is much less certain. In fact, the true probability could in fact be very high. 

I will leave it for another post to dig further into this subject of decision making under risk. Suffice it to say for now that without this additional layer of information about uncertainty, you should be very careful when applying your predictor to your decision making. One simple way of being careful is to set a low threshold. For example, you could decide to replace – or at least inspect – the bearing if the output probability rises above 50 or even 25%. Of course, that would not give you the same cost saving. Which goes to show how important the uncertainty of the prediction is. 

My advice would be to always put in the extra work to obtain not just an output – whether it is a probability or any other quantity – but also the uncertainty. 

The next step

Note how starting at the end – our decision making – has led us to focus our modeling on a very specific problem, making us much more effective and efficient than if we had started with all available data, creating all kinds of “interesting” models.

Now that we have designed our predictor and trainer we have a good handle on the “understand” step. Next we must dig deeper into the data, which presents many challenges. That is the subject of the next article in this series – stay tuned!

In the mean time – if you’d like my help with leveraging AI strategically for your business – and your career – contact me.

Leave a Reply

Your email address will not be published. Required fields are marked *