What went wrong: Root cause analysis in complex manufacturing

Sometimes, things happen that we don’t understand. No chance of rain, they said, but still the rains came pouring down just after you finished painting the house. Or, the reject ratio for a particular product suddenly increased by 25% – goodbye, quarterly bonus. Why did this happen?

To seek such answers is to do root cause analysis, or RCA. RCA is closely related to forecasting, two uses of which are demand forecasting and predicting mechanical breakdowns (predictive maintenance.) It is also closely related to the optimizations we do, based on forecasts, to optimize business processes.

In forecasting, our aim is to predict the future. What will demand be? When will this component break down?

In optimization, we go one step further and ask: what should I do to obtain the best possible future? 

“The best way to predict the future is to invent it.”

Alan Kay, computer scientist

At the pointy end of the supply chain, this is called demand shaping. We can optimize promotions, for example, to increase demand and maximize long term profits. In predictive maintenance, we optimize the timing of replacing parts, based on the predictive model that tells us the likelihood of breakdown at each future time point. 

But sometimes, our predictions fail. Things either go surprisingly well – lucky us – or surprisingly bad – goodbye, bonus. And so we must figure out why so that we can fix the issue and stop the damage. Instead of predicting the future from the past, we seek to find what, in the past, created this present unfortunate situation that we find ourselves in.

In manufacturing, RCA is a method of identifying causes of errors or quality deviations in the manufactured product, overall equipment effectiveness and so on. 

Two important RCA concepts are symptoms and failures. Root causes lead to internal failures, which typically can’t be directly observed. But we can observe symptoms, which are caused by failures. And we can observe the faults in the final product. RCA is then deducing or inferring the root cause of these faults from the symptoms and other observations of the manufacturing process. 

Traditional RCA won’t cut it

In complex manufacturing process chains, traditional approaches to RCA like FTA (Fault Tree Analysis) and FMEA (Failure Modes and Effects Analysis) fall short. They rely on human expertise and that is problematic in several ways. Human experts can be biased, they often don’t agree with each other, and thus the outcome of their analysis isn’t consistent and reproducible. As manufacturing processes become more complex and dynamic, these issues are exacerbated. 

Also, manual RCA is not automatic and real-time. Which means more damage and lost revenue until the problem is fixed. 

Such expert knowledge can, however, form a key ingredient in a successful machine learning approach to RCA. That’s one of the wonderful qualities of machine learning: it enables the optimal utilization of human domain expertise by combining it with learning from data. 

Improving RCA with Bayesian networks 

There are many different machine learning approaches to RCA; see [1] for a review. A promising and disciplined approach is using so-called Bayesian networks. A good paper on this is [2]. I’ll give you a high-level review, following this paper.

Here are a couple of minimalistic Bayesian Networks:

If the grass is wet – which we may consider a fault or a symptom – it could be because the sprinkler was on, or because it rained recently. If we are interested in the sprinkler as a possible root cause, then we need to correct structure: the sprinkler turns on only when it has not been raining. Hence the directions of the three arrows. 

In an (unrealistically) simple manufacturing equivalent, poor product quality could be caused by bad input materials, or the wrong oven temperature. The wrong material could affect oven temperature, and we end up with the same conceptual structure as in the sprinkler example.

Even though these examples are extremely simple, we have two potential root causes and it is not obvious from our observations (wet grass/poor quality) which one is at fault. 

A Bayesian network (BN) describes the relation between the variables that we are interested in. Each node is a variable and the arrows are dependencies. For example, the state of the sprinkler depends on whether it’s raining or not, but not the other way around – hence the arrow pointing from rain to sprinkler. 

Some variables are observed – if the grass is wet, if it is raining – others are hidden. For example, it may be the case that we can’t observe the sprinkler and we therefore seek to infer if it was active or not, based on what we can observe. Or,  we can observe the poor quality of the product and seek to find out if it was caused by poor materials or the wrong oven temperature, neither of which we can observe.  

There are three machine learning tasks that we will need to perform with our BNs:

  • Structure learning. Which nodes should we include and what are their relations?
  • Parameter learning. Given a structure, what should the parameters be? That is, what are the relationships and how strong are they?
  • Inference. Given a structure and parameters, what are the values of the variables? Was the sprinkler active?

Structure learning is difficult, but a good solution can be to manually design the structure based on human domain expertise. Of the three tasks, this one is relatively easy for humans and hard for computers. Parameter learning and inference are hard for computers too, but humans can’t cope with them at all for complex processes.

Bayesian networks for manufacturing

A somewhat more realistic sketch of what this could look like in a manufacturing process is shown below.

Adapted from [2]

Here, input materials go through four process steps – P1 through P4 – to produce an output. For each step, we have defined several key variables and their causal relations that together capture the we need for RCA. I’ve added some nodes in gray to represent variables that are neither root causes or symptoms. Although ignored in [2], these are often useful and necessary to capture the dynamics correctly. Some of these may be observed, others hidden.

Note that the root causes are not observable, by definition. If they were, we could simply fix them as soon as they appear. 

At the end of the process, we observe two potential faults, F1 and F2. 

How to implement intelligent RCA

First of all, you need to build a foundation of good data. The more you can measure, and the more accurately you can do it, the better your chances of building a trustworthy RCA solution with machine learning. 

The most important data are labels of root causes, symptoms and failures. That is, you need a set of data with known failures and known root causes. The root cause nodes need to be defined as binary variables – they are either true or false. For example, a potential cause could be “Oven seal #2 is broken”. 

To get this data requires some work. When a fault occurs, you need to inspect the value of each possible root cause. Was that oven seal broken or not? This takes time and money, but if you can get a robust RCA tool out of it, it is very likely going to give you a great ROI on this investment. 

In addition to these labels, you also need a continuous set of measurements – both for periods with faults, and without. These observations can include:

  • Vibrations
  • Noise/sound
  • Temperatures
  • Settings, e.g. for temperature control

These probably contain valuable information that will allow the BN algorithms to determine which root cause caused the fault. Some might even be classed as symptoms, because they are directly and strongly influenced by root causes. 

As shown in [2], realistic manufacturing process networks can require 100s of thousands of data points to infer the correct BN structure. Starting with a structure defined by human experts and then refining it makes this an order of magnitude easier – so that is definitely recommended!

The next step is learning the correct relationships – that is, the parameters of the model. These parameters control the conditional probabilities in the network. For example, given that oven seal #2 is broken, what is the probability that oven temperature will be too low?  

This is a matter of applying the appropriate machine learning algorithm, such as Bayesian inference, to the combination of model structure and data. If the model contains variables that can’t be observed, there are algorithms for that, too, like EM (expectation maximization) – though they are more complex. 

Finally, you do the actual inference on the root causes, using the structure and parameters that have been learned. This uses a different type of algorithm like “belief propagation”. The output of this is the probability that each root cause was at fault. 

At this point, you have the foundation for a great RCA decision support tool. 

As always, make sure to continuously monitor the quality of these RCA models. Be especially careful if the manufacturing process is modified from what the model has observed during training. For example, if a different type of input material is now used, the model may not perform reliable RCA any longer. Keep measuring everything so that you can update the model regularly to maintain a robust decision support tool. 

A good RCA decision support tool can save your manufacturing company millions on the bottom line. In addition, you can probably leap ahead of your competition because they may not (yet) have adapted intelligent RCA into their processes. 

If you would like to start tapping into this potential, I am here to help. Schedule a free consult today, or call me on +45 51 64 66 91 to get started. 


[1] Survey on Models and Techniques for Root-Cause Analysis, Marc Sol é et al., 2017, https://arxiv.org/abs/1701.08546

[2] Root cause analysis of failures and quality deviations in manufacturing using machine learning, Lokrantz et al., 2018, https://www.sciencedirect.com/science/article/pii/S2212827118303895

Leave a Reply

Your email address will not be published. Required fields are marked *