15 Self study - Session 1
In this session, we will revisit Chapter 10 and Chapter 11.
In Chapter 10, we have explored different methods to handle missing data in time series and have seen that the choice of imputation method can significantly affect the results.
The data set we will rely on in this session is a synthetic time series that simulates weekly production data of a manufacturing plant over 20 years. Unfortunately, some data points are missing due to frequent outages in the data collection system and sometimes we observe huge outliers due to temporary errors in the counting process.
Your task is to impute the missing values and handle the outliers appropriately to prepare the data for a 20 years jubilee management report.
First, download the data set from here.
Here are some useful helper functions:
import plotly.graph_objects as go
def plot_production_data(
df,
orig_data_column,
gap_column,
imp_column=None,
):
"""
Plots production data with and without gaps using Plotly.
Parameters:
df (DataFrame): DataFrame containing the production data.
orig_data_column (str): Column name for the original data.
gap_column (str): Column name for the data with gaps.
imp_column (str): Column name for the imputed data.
"""
if imp_column is not None:
dff = df[[gap_column, imp_column]].copy()
dff.loc[~df[gap_column].isna(), imp_column] = None
# Only create the upper subplot (single plot)
fig = go.Figure()
# Original data
fig.add_trace(
go.Scatter(
x=df.index,
y=df[orig_data_column],
mode="lines",
name="Original",
line=dict(color="lightgrey"),
)
)
# With gaps
fig.add_trace(
go.Scatter(
x=df.index,
y=df[gap_column],
mode="lines",
name="With Gaps",
)
)
# Imputed
fig.add_trace(
go.Scatter(
x=dff.index,
y=dff[imp_column],
mode="lines",
name="Imputed",
line=dict(color="blue"),
)
)
fig.update_layout(
height=400,
title_text="#pieces with and without Gaps",
xaxis_title="date",
yaxis_title="pieces",
)
fig.update_yaxes(title_text="pieces")
else:
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=df.index,
y=df[orig_data_column],
mode="lines",
name="Original",
line=dict(color="lightgrey"),
)
)
fig.add_trace(
go.Scatter(
x=df.index,
y=df[gap_column],
mode="lines",
name="With Gaps",
)
)
fig.update_layout(
title="#pieces with and without gaps",
xaxis_title="Date",
yaxis_title="pieces",
)
fig.show()
def seasonal_decompose_plotly(decomposed_ts, title):
"""
Plots a time series decomposition using Plotly.
Parameters:
decomposed_ts (DecomposeResult): Decomposed time series result.
Returns:
plotly object
"""
# Extract components
observed = decomposed_ts.observed
trend = decomposed_ts.trend
seasonal = decomposed_ts.seasonal
resid = decomposed_ts.resid
# Create subplots
fig_decomp = sp.make_subplots(
rows=4,
cols=1,
shared_xaxes=True,
subplot_titles=["Observed", "Trend", "Seasonal", "Residual"],
)
fig_decomp.add_trace(
go.Scatter(x=observed.index, y=observed.values.flatten(), name="Observed"),
row=1,
col=1,
)
fig_decomp.add_trace(go.Scatter(x=trend.index, y=trend, name="Trend"), row=2, col=1)
fig_decomp.add_trace(go.Scatter(x=seasonal.index, y=seasonal, name="Seasonal"), row=3, col=1)
fig_decomp.add_trace(go.Scatter(x=resid.index, y=resid, name="Residual"), row=4, col=1)
fig_decomp.update_layout(height=900, title_text=title)
return fig_decompIn a first step, we will focus on getting rid of the outliers.
Let us try to impute the missing values in a second step.
Of course, we have only seen the tip of the iceberg when it comes to imputation methods for time series. Especially for seasonal data, more advanced methods such as STL can be very effective. We have had a look at STL in Chapter 11, where we decomposed a time series into its components. Now we will use it for imputation to not lose the seasonal structure of the data while filling the gaps.
Great job, you have provided the graphs in time for the management report and the board liked the results. In the meanwhile your colleagues have found the original data set without gaps on some old dusty server and would like to know the number of outliers in the data set.