25 Resampling

In many real scenarios, data is acquired by manual input or when a certain event occurred, leading to irregular time steps.

Finding patterns in time series data is usually easier when time series have constant time steps. Some common time series methods, such as ARIMA or Recurrent Neural Networks, even require constant time steps.

When time series have irregular time steps, we can resample them to a regular frequency. To illustrate this, we generate a synthetic time series with irregular time steps and resample it to a regular frequency.

25.1 Example

25.1.1 Load packages

Code

import datetime

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio  # used to set the default plotly renderer

pio.renderers.default = (
    "notebook"  # set the default plotly renderer to "notebook" (necessary for quarto to render the plots)
)

25.1.2 Data Generation

Generate synthetic time series with irregular time steps.

Code

np.random.seed(42)
n_steps = 1000

# Generate irregular time steps
start_time = pd.Timestamp("2025-09-25 12:43:00")
time_steps = pd.date_range(start=start_time, periods=n_steps, freq="1ms")

# Use a sinus curve instead of a random walk
frequency = 16.667  # Hz
# Calculate elapsed time in seconds for each time step
elapsed_seconds = (time_steps - start_time).total_seconds()
values = np.sin(2 * np.pi * frequency * elapsed_seconds)

# Create a pandas Series
df_sinus = pd.Series(data=values, index=time_steps)

df_sinus_sampled = df_sinus.sample(frac=0.2).sort_index()

25.1.3 Data Visualization

fig = px.line(
    x=df_sinus.index,
    y=df_sinus.values,
    labels={"x": "Time", "y": "Value"},
    title="Original sinus curve with sampled points at irregular intervals",
)
fig.add_scatter(
    x=df_sinus_sampled.index,
    y=df_sinus_sampled.values,
    mode="markers",
    name="Sampled Points",
    marker=dict(color="red", size=6),
)
fig.show()

Using a histogram, visualizing the difference between consecutive time steps, we can see that time steps are not constant.

# Calculate time differences in seconds between consecutive time steps
time_diffs = df_sinus_sampled.index[1:] - df_sinus_sampled.index[:-1]
time_diffs_seconds = time_diffs.total_seconds() * 1000

plt.hist(time_diffs_seconds, bins=50, edgecolor="k")
plt.xlabel("Time difference between steps (milliseconds)")
plt.ylabel("Frequency")
plt.title("Distribution of Time Steps")
plt.show()

25.1.4 Choosing a sampling frequency

Resampling ensures constant time steps, but usually requires some sort of alignment.

This can be done by oversampling (increasing the frequency) or undersampling (decreasing the frequency). Using a fixed frequency, the question arises, which value should be used for values at the newly introduced time step.

For undersampling, we usually have multiple values between two consecutive time steps of the resampled time series. This can be handled by aggregation functions, such as mean, median, min, max, sum, last value, next value, etc.

For oversampling, we usually have no value at the newly introduced time steps, which requires some form of interpolation.

resample_freq = 5  # milliseconds

df_sinus_resampled = df_sinus_sampled.resample(f"{resample_freq}ms").median().interpolate(method="pchip")
df_sinus_resampled.index += +datetime.timedelta(
    milliseconds=(resample_freq / 2)
)  # pd.resample is not center-aligned; adjust for that

fig = px.line(
    x=df_sinus.index,
    y=df_sinus.values,
    labels={"x": "Time", "y": "Value"},
    title="Lineplot of original data",
)
fig.add_scatter(
    x=df_sinus_sampled.index,
    y=df_sinus_sampled.values,
    mode="markers",
    name="sampled Points",
    marker=dict(color="red", size=6),
)
fig.add_scatter(
    x=df_sinus_resampled.index,
    y=df_sinus_resampled.values,
    mode="markers",
    name="resampled Points",
    marker=dict(color="green", symbol="x", size=6),
)
fig.show()