15  Self study - Session 1

In this session, we will revisit Chapter 10 and Chapter 11.

In Chapter 10, we have explored different methods to handle missing data in time series and have seen that the choice of imputation method can significantly affect the results.

The data set we will rely on in this session is a synthetic time series that simulates weekly production data of a manufacturing plant over 20 years. Unfortunately, some data points are missing due to frequent outages in the data collection system and sometimes we observe huge outliers due to temporary errors in the counting process.

Your task is to impute the missing values and handle the outliers appropriately to prepare the data for a 20 years jubilee management report.

First, download the data set from here.

Here are some useful helper functions:

import plotly.graph_objects as go

def plot_production_data(
    df,
    orig_data_column,
    gap_column,
    imp_column=None,
):
    """
    Plots production data with and without gaps using Plotly.
    Parameters:
    df (DataFrame): DataFrame containing the production data.
    orig_data_column (str): Column name for the original data.
    gap_column (str): Column name for the data with gaps.
    imp_column (str): Column name for the imputed data.
    """
    if imp_column is not None:
        dff = df[[gap_column, imp_column]].copy()
        dff.loc[~df[gap_column].isna(), imp_column] = None

        # Only create the upper subplot (single plot)
        fig = go.Figure()

        # Original data
        fig.add_trace(
            go.Scatter(
                x=df.index,
                y=df[orig_data_column],
                mode="lines",
                name="Original",
                line=dict(color="lightgrey"),
            )
        )
        # With gaps
        fig.add_trace(
            go.Scatter(
                x=df.index,
                y=df[gap_column],
                mode="lines",
                name="With Gaps",
            )
        )
        # Imputed
        fig.add_trace(
            go.Scatter(
                x=dff.index,
                y=dff[imp_column],
                mode="lines",
                name="Imputed",
                line=dict(color="blue"),
            )
        )

        fig.update_layout(
            height=400,
            title_text="#pieces with and without Gaps",
            xaxis_title="date",
            yaxis_title="pieces",
        )
        fig.update_yaxes(title_text="pieces")
    else:
        fig = go.Figure()
        fig.add_trace(
            go.Scatter(
                x=df.index,
                y=df[orig_data_column],
                mode="lines",
                name="Original",
                line=dict(color="lightgrey"),
            )
        )
        fig.add_trace(
            go.Scatter(
                x=df.index,
                y=df[gap_column],
                mode="lines",
                name="With Gaps",
            )
        )

        fig.update_layout(
            title="#pieces with and without gaps",
            xaxis_title="Date",
            yaxis_title="pieces",
        )

    fig.show()


def seasonal_decompose_plotly(decomposed_ts, title):
    """
    Plots a time series decomposition using Plotly.

    Parameters:
    decomposed_ts (DecomposeResult): Decomposed time series result.

    Returns:
    plotly object
    """

    # Extract components
    observed = decomposed_ts.observed
    trend = decomposed_ts.trend
    seasonal = decomposed_ts.seasonal
    resid = decomposed_ts.resid

    # Create subplots
    fig_decomp = sp.make_subplots(
        rows=4,
        cols=1,
        shared_xaxes=True,
        subplot_titles=["Observed", "Trend", "Seasonal", "Residual"],
    )

    fig_decomp.add_trace(
        go.Scatter(x=observed.index, y=observed.values.flatten(), name="Observed"),
        row=1,
        col=1,
    )
    fig_decomp.add_trace(go.Scatter(x=trend.index, y=trend, name="Trend"), row=2, col=1)
    fig_decomp.add_trace(go.Scatter(x=seasonal.index, y=seasonal, name="Seasonal"), row=3, col=1)
    fig_decomp.add_trace(go.Scatter(x=resid.index, y=resid, name="Residual"), row=4, col=1)

    fig_decomp.update_layout(height=900, title_text=title)

    return fig_decomp

In a first step, we will focus on getting rid of the outliers.

Exercise 15.1 (Load the data with gaps and get rid of outliers)  

  • Download and load the dataset. It has three columns: Date, pieces and pieces_gaps. The column pieces contains the full data set without gaps, while pieces_gaps contains the same data set with several missing values (NaNs). The period covered by the data set is from 2005-01-01 to 2024-12-31 in weekly frequency.
  • Visualize both time series (pieces and pieces_gaps) using the provided plot_production_data function.
  • Smooth the pieces_gaps time series using a rolling window with appropriate window size and decide whether it is better to use mean or median as aggregation function. Discuss your choice. Ideally the resulting smoothed time series should have the same gaps as the original pieces_gaps time series.
  • Visualize the resulting smoothed time series.

Let us try to impute the missing values in a second step.

Exercise 15.2 (Impute the missing values with known methods)  

  • Use the smoothed time series with gaps from the previous exercise as input.
  • Experiment with different imputation methods discussed in Chapter 10.
  • Visualize the results using the provided plot_production_data function.
  • Discuss the results.

Of course, we have only seen the tip of the iceberg when it comes to imputation methods for time series. Especially for seasonal data, more advanced methods such as STL can be very effective. We have had a look at STL in Chapter 11, where we decomposed a time series into its components. Now we will use it for imputation to not lose the seasonal structure of the data while filling the gaps.

Exercise 15.3 (Impute the missing values with STL)  

  • Use the smoothed time series with gaps from the first exercise as input.
  • The STL needs data that is complete, i.e., without missing values. Therefore, first impute the missing values using a simple method of your choice (e.g., linear interpolation).
  • Then, apply the STL decomposition to the imputed time series.
  • Visualize the decomposed components (trend, seasonal, residual) using the provided seasonal_decompose_plotly function.
  • The seasonal part of the decomposition can now be used to deseasonalize the smoothed time series with gaps.
  • Now that we have a deseasonalized time series with missing values, we can again impute the missing values using a simple method of your choice.
  • Finally, reseasonalize the imputed deseasonalized time series by adding back the seasonal component obtained from the STL decomposition.
  • Visualize the final imputed time series using the provided plot_production_data function.
  • Discuss the results.

Great job, you have provided the graphs in time for the management report and the board liked the results. In the meanwhile your colleagues have found the original data set without gaps on some old dusty server and would like to know the number of outliers in the data set.

Exercise 15.4 (Count the number of outliers in the original data set)  

  • Use the original data set without gaps (pieces column) as input.
  • Use an appropriate method to identify the outliers in the data set.
  • Visualize the outliers and the original data set.
  • Count the number of detected outliers.
  • Discuss the results.