20 Process optimization on sensor data via EDA

In this section, we will have a look at a more realistic example of industrial data.

20.1 Setting the stage

Imagine a manufacturing chain which is dedicated to producing knives with wooden handles. Various machines are interconnected and the process flow is as follows:

Machine A prepares the steel blade and shaft.
Machine B applies epoxy to the grip area.
Machine C inserts the wooden handle material.
Machine D is a curing oven, hardening the adhesive.
Machine E coats the knife with protective finish and completes the manufacturing process.

Operators notice that the process is not running as smoothly as expected. The handle is not always properly bonded to the metal piece and sometimes the wood is cracked. They find that the root cause of the issue lies in the epoxy application step of Machine B. Common issues include excessive or insufficient epoxy application, uneven distribution, and occasional poor adhesion to the shaft. One clear economic impact is that these defective products cannot be sold. But there is more to it:

Waste of material is not sustainable.
The defect actually already happens in Machine B, but Machines C to E still process the faulty products, leading to further waste and costs, especially since the wooden handle is the most expensive part of the product and the curing oven is very energy-intensive.

Since there is a new department focused on data science, they decide to manually inspect the epoxy application right after Machine B and label the data accordingly. Additionally the machine collects some sensor data in a CSV file. This dataset is given to the data science team and in tight collaboration with the operators the ambitious goal is to:

Optimize the epoxy application process to reduce defects and waste.
Human inspection is time-consuming and expensive. Find a way to predict defects occuring in Machine B before the product is handed over to machine C.

Of course, this data science department is us.

20.2 Dataset

The dataset consists of sensor readings of a simulated industrial process and consists of two files:

simulated_machine_data.csv stores the sensor data
simulated_inspection_data.csv stores the human inspection results

The sensor data includes the following features in a 200ms resolution:

air_temperature_C: Ambient air temperature in degrees Celsius.
machine_temperature_C: Temperature of the machine in degrees Celsius.
pressure_measured_bar: Measured pressure in bar.
pressure_setting_bar: Pressure setting in bar.
process_speed_mm_s: Set speed of the process in millimeters per second.
process_step_index: Index indicating the current step within the process.
product_id: Unique identifier for each product. The machine performs three processing steps: moving the nozzle to the shaft, applying epoxy, and blowing out residual epoxy from the nozzle while moving away.
timestamp: Timestamp of the sensor reading.

The inspection data includes the product_id and the error_code:

error_code: Code indicating the type of error detected during inspection.
product_id: Unique identifier for each product.

They used OK when there was no defect, NOK_1 (excessive epoxy), NOK_2 (insufficient epoxy), NOK_3 (poor adhesion), and NOK_4 (uneven distribution).

20.3 Additional information

The operators state that the epoxy application process is highly sensitive to variations in environmental conditions, such as temperature and humidity. They believe that incorporating additional sensor data, such as humidity levels, could further enhance understanding of the process. Also, the pressure settings are critical, since they directly influence the amount of epoxy applied to the product. The first processing step ensures the correct positioning of the nozzle and the final processing step is supposed to clean the nozzle, preparing it for the next application.

20.4 Exploratory Data Analysis

20.4.1 Loading packages

import pandas as pd  # used for data handling
import seaborn as sns  # used for statistical data visualization
import plotly.express as px  # used for performant plotting
import plotly.io as pio  # used to set the default plotly renderer

pio.renderers.default = (
    "notebook"  # set the default plotly renderer to "notebook" (necessary for quarto to render the plots)
)

20.4.2 Loading the dataset

The dataset is available online in the form of two csv files.

It was released at Ehrensperger (2026), but you can also clone the source repository from GitHub.

df_processes = pd.read_csv(
    "https://github.com/noxthot/industrial_datascience_sim_machine_knifes/raw/refs/tags/v1.0.2/data/simulated_machine_data.csv"
)
df_inspections = pd.read_csv(
    "https://github.com/noxthot/industrial_datascience_sim_machine_knifes/raw/refs/tags/v1.0.2/data/simulated_inspection_data.csv"
)

20.4.3 Quick check data integrity

df_processes

	nr_iteration	process_step_index	timestamp	air_temperature_C	process_speed_mm_s	pressure_measured_bar	pressure_setting_bar	machine_temperature_C
0	0	0	2024-06-01 21:00:00.000	17.411810	10	14.221200	5	17.413006
1	0	0	2024-06-01 21:00:00.200	17.411810	10	10.564433	5	17.412700
2	0	0	2024-06-01 21:00:00.400	17.411810	10	10.054016	5	17.413947
3	0	0	2024-06-01 21:00:00.600	17.411810	10	10.103557	5	17.414270
4	0	0	2024-06-01 21:00:00.800	17.411810	10	11.594134	5	17.415649
...	...	...	...	...	...	...	...	...
199995	4999	2	2024-06-02 08:06:39.000	20.261769	20	24.946507	20	45.459340
199996	4999	2	2024-06-02 08:06:39.200	20.261769	20	26.045240	20	45.461210
199997	4999	2	2024-06-02 08:06:39.400	20.261769	20	30.247436	20	45.459048
199998	4999	2	2024-06-02 08:06:39.600	20.261769	20	30.317732	20	45.455973
199999	4999	2	2024-06-02 08:06:39.800	20.261769	20	27.688261	20	45.455442

200000 rows × 8 columns

df_inspections

	timestamp	error_code
0	2024-06-01 21:00:10.068501	OK
1	2024-06-01 21:00:20.934694	OK
2	2024-06-01 21:00:25.997543	OK
3	2024-06-01 21:00:36.090388	OK
4	2024-06-01 21:00:44.213738	OK
...	...	...
4995	2024-06-02 08:06:10.582952	OK
4996	2024-06-02 08:06:19.236113	NOK_4
4997	2024-06-02 08:06:25.214726	OK
4998	2024-06-02 08:06:36.041731	OK
4999	2024-06-02 08:06:44.301083	OK

5000 rows × 2 columns

if (
    df_processes.isna().sum().sum() > 0
    or df_inspections.isna().sum().sum() > 0
    or df_processes.duplicated().sum().sum() > 0
    or df_inspections.duplicated().sum().sum() > 0
):
    raise ValueError("Missing values or duplicates found in the dataframes.")
else:
    print("No missing values or duplicates found in the dataframes.")

No missing values or duplicates found in the dataframes.

df_processes.describe().T

	count	mean	std	min	25%	50%	75%	max
nr_iteration	200000.0	2499.500000	1443.379253	0.000000	1249.750000	2499.500000	3749.250000	4999.000000
process_step_index	200000.0	0.625000	0.856959	0.000000	0.000000	0.000000	1.250000	2.000000
air_temperature_C	200000.0	13.240958	2.849175	10.000000	10.648648	12.539426	15.382514	20.261769
process_speed_mm_s	200000.0	11.375000	5.764941	1.000000	10.000000	10.000000	12.500000	20.000000
pressure_measured_bar	200000.0	27.621157	30.727644	2.112853	11.493736	13.658634	27.004350	114.615669
pressure_setting_bar	200000.0	20.625000	30.663318	5.000000	5.000000	5.000000	20.000000	100.000000
machine_temperature_C	200000.0	37.053597	4.185691	17.412700	36.009991	36.908348	38.663800	45.538659

df_processes.dtypes

nr_iteration               int64
process_step_index         int64
timestamp                 object
air_temperature_C        float64
process_speed_mm_s         int64
pressure_measured_bar    float64
pressure_setting_bar       int64
machine_temperature_C    float64
dtype: object

df_inspections.dtypes

timestamp     object
error_code    object
dtype: object

df_processes.nunique()

nr_iteration               5000
process_step_index            3
timestamp                200000
air_temperature_C           501
process_speed_mm_s            3
pressure_measured_bar    200000
pressure_setting_bar          3
machine_temperature_C    200000
dtype: int64

df_inspections.nunique()

timestamp     5000
error_code       5
dtype: int64

df_inspections["error_code"].value_counts()

error_code
OK       4628
NOK_4     296
NOK_1      70
NOK_3       5
NOK_2       1
Name: count, dtype: int64

NOK_2 and NOK_3 seem to be hardly ever happening. Let us mentally drop these error codes for this investigation, since <= 10 samples give a very weak statistical basis and tend to distort plots.

20.4.4 Exploring continuous variables by simple plots

Here, we visualize the continuous variables (one by one) against the timestamp to get an idea of their temporal behavior.

CONTINUOUS_VARS = [
    "air_temperature_C",
    "process_speed_mm_s",
    "pressure_measured_bar",
    "pressure_setting_bar",
    "machine_temperature_C",
]

df_filtered = df_processes.query("nr_iteration < 1000").sort_values(
    "timestamp"
)  # for performance reasons, we filter to the first 1000 cycles.

fig = px.line(
    df_filtered,
    x="timestamp",
    y=CONTINUOUS_VARS,
    facet_col="variable",
    facet_col_wrap=1,
    title="Continuous Variables Over Time",
)

fig.update_xaxes(matches="x")  # share x-axis zoom/pan
fig.update_yaxes(matches=None)  # do not share y-axis limits
fig.update_layout(height=1200)  # increase figure height (y axis size)

fig.show()

The plot is interactive. Zoom in to see more details. Observe that the set pressure is constant over time (within a processing step), while the measured pressure varies around the set pressure and also shows some systematic offset.

Let us investigate this difference between set and measured pressure in more detail.

df_processes["pressure_difference_bar"] = df_processes["pressure_measured_bar"] - df_processes["pressure_setting_bar"]

sns.histplot(data=df_processes, x="pressure_difference_bar", hue="process_step_index", bins=50)

It appears like the pressure difference is roughly normally distributed and that the mean of the three processing steps is similar.

Absolute counts are visually harder to compare, so let us use relative frequencies instead and normalize each of the processing steps separately.

sns.histplot(
    data=df_processes,
    x="pressure_difference_bar",
    hue="process_step_index",
    bins=50,
    common_norm=False,
    stat="percent",
)  # common_norm is set to False to normalize each processing step separately, stat is set to "percent" to show relative frequencies in percent.

Apparently the pressure difference shares similar mean and standard deviation across the three processing steps. We can also show this by calculating the mean and standard deviation of the pressure difference for each processing step.

df_processes.groupby("process_step_index")["pressure_difference_bar"].mean()

process_step_index
0    6.992650
1    6.990845
2    7.007579
Name: pressure_difference_bar, dtype: float64

df_processes.groupby("process_step_index")["pressure_difference_bar"].std()

process_step_index
0    1.990051
1    1.997792
2    2.015344
Name: pressure_difference_bar, dtype: float64

We conclude that the pressure is not perfectly controlled (or measured) and have a systematic offset of 7 bar and a standard deviation of roughly 2 bar.

fig = px.line(
    df_processes.sample(10_000).sort_values(
        "timestamp"
    ),  # for performance reasons, we sample 10,000 rows. Always remember to sort by timestamp after sampling.
    x="timestamp",
    y=["machine_temperature_C", "air_temperature_C"],
    labels={"value": "Temperature (°C)", "timestamp": "Timestamp", "variable": "Type"},
    title="Machine and Air Temperature Over Time",
)
fig.show()

We observe that the machine is in a cold state when the data collection starts and heats up to a more stable plateau, where it still seems to follow the shape of the air temperature.

20.4.5 Aggregating plots

We are looking at iterations of the same, identical process. Let us look at a plot where the processes overlap each other, colored by error code.

First, we create a new index which starts at 0 for each piece that is manufactured and counts each timestamp within that process step. This index will be useful for plotting the different process iterations on top of each other.

df_processes["process_inner_index"] = df_processes.sort_values("timestamp").groupby("nr_iteration").cumcount()

# Merge error_code from df_inspections into df_processes based on 'nr_iteration'. This assigns an errorcode to the entire process iteration.
df_plot = df_processes.merge(df_inspections, left_on="nr_iteration", right_index=True, how="left")

fig = px.line(
    df_plot,
    x="process_inner_index",
    y="pressure_measured_bar",
    line_group="nr_iteration",
    color="error_code",
    title="Pressure per Iteration (Overlapped, Colored by Error Code)",
    labels={
        "process_inner_index": "Inner-process timestamp",
        "pressure_measured_bar": "Pressure (bar)",
        "error_code": "Error Code",
    },
)

# Set opacity: 0.01 for 'OK', 0.5 for others
# Setting opacity is always a bit tricky. Setting it too low makes the lines almost invisble, setting it too high makes the plot too crowded.
# This settings were found in trial and error and work well for that case.
for trace in fig.data:
    if trace.name == "OK":
        trace.opacity = 0.01
    else:
        trace.opacity = 0.5

# Unselect NOK_2 and NOK_3 by default, since they are very rare and would clutter the plot.
for i, trace in enumerate(fig.data):
    if trace.name in ["NOK_2", "NOK_3"]:
        fig.data[i].visible = "legendonly"

fig.update_layout(showlegend=True)
fig.show()

Keep in mind that you can (de)select lines to plot when clicking on the according legend entry. Plotly also allows you to zoom in and pan around.

From the plot, we can see that the second processing step’s pressure is elevated for NOK_1. NOK_4 shows no obvious difference to OK.

With this in mind, let us have a closer look at the mean pressure and its standard deviation per error code and processing step.

# Aggregate: mean and std for each process_inner_index and error_code
agg_df = (
    df_plot.groupby(["process_inner_index", "error_code"])["pressure_measured_bar"].agg(["mean", "std"]).reset_index()
)

fig = px.line(
    agg_df,
    x="process_inner_index",
    y="mean",
    color="error_code",
    error_y="std",
    labels={
        "process_inner_index": "Inner-process timestamp",
        "mean": "Pressure (bar)",
        "error_code": "Error Code",
    },
    title="Mean Pressure per Error Code with Deviation",
)

# Unselect NOK_2 and NOK_3 by default
for i, trace in enumerate(fig.data):
    if trace.name in ["NOK_2", "NOK_3"]:
        fig.data[i].visible = "legendonly"

fig.update_layout(legend_title_text="Error Code")
fig.show()

This plot confirms our previous observation that NOK_1 has a systematic higher mean pressure during the second processing step and NOK_4 does not differ notably from OK.

While the first line plot shows every individual process iteration as a singe line and therefor does not accidentally filter relevant information, it is a bit crowded and hard to read. The second plot aggregates the individual lines by calculating the mean and standard deviation, therefore losing some information, but making it easier to comprehend.

Now let us also check whether the machine temperature has an influence on failures.

df_agg = df_plot.groupby(["nr_iteration", "error_code"])["machine_temperature_C"].agg(["mean"]).reset_index()

fig = px.box(
    df_agg,
    x="error_code",
    y="mean",
    title="Machine Temperature by Error Code",
    labels={"mean": "Mean Machine Temperature (°C)", "error_code": "Error Code"},
)

fig.show()

Also here, NOK_4 and NOK_1 do not show a notable difference to OK, while NOK_2 and NOK_3 can not be judged well due to the low sample size.

20.4.6 Summary of findings

Let us summarize our findings so far in this dataset:

The pressure during the second processing step is elevated for NOK_1 compared to OK.
The machine temperature does not show a notable difference between OK and NOK_x.
The source of NOK_4 failures was not identified in this analysis.
The error codes NOK_2 and NOK_3 are too rare to draw any conclusions.

Chapter 22 will dive deeper into the data and guide you through the process of optimizing the manufacturing process based on these findings.