12 Time Series Clustering

The goal of clustering is to group similar time series together based on their characteristics or patterns.

But what does similar mean in the context of time series data? Defining similarity is not straight forward due to the inherent subjectivity involved.

Difficulties are:

Context-Dependent: Similarity can vary based on the specific application or domain.
Human Bias: Metrics are often chosen based on human intuition, which can introduce bias.
Inconsistent Interpretations: Different experts may have different perspectives on what constitutes similarity.
Parameter Sensitivity: Many similarity measures come with tunable parameters that can significantly affect the results.

12.1 Clustering algorithms

Time series clustering algorithms can be broadly categorized into four main types:

Distance-Based: These algorithms rely on a distance measure to quantify the similarity between time series. Examples include \(k\)-means and \(k\)-median.
Distribution-Based: These methods model the distribution (in a very general sense) of the time series data and cluster based on statistical properties or descriptive features that represent the characteristics of the time series. Examples include Gaussian Mixture Models (GMM) and DBSCAN.
Subsequence-Based: These algorithms focus on clustering representative subsequences of time series data rather than the entire series. Examples include methods based on shapelets and sliding windows.
Representation-Learning-Based: These methods use techniques like autoencoders or recurrent neural networks to learn a lower-dimensional representation of the time series data, which is then clustered using traditional clustering algorithms.

Paparrizos, Yang, and Li (2024) gives a good overview of different time series clustering methods.

In the following, we will briefly introduce distance-based clustering using two common similarity measures: Euclidean distance and Dynamic Time Warping (DTW).

12.2 Similarity Measures

12.2.1 Imports

Code

import warnings

import numpy as np

import plotly.graph_objs as go
import plotly.io as pio

pio.renderers.default = (
    "notebook"  # set the default plotly renderer to "notebook" (necessary for quarto to render the plots)
)

warnings.filterwarnings("ignore")

12.2.2 Helper functions for plotting

Code

def plot_time_series_plotly(t, series1, series2, title):
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=t, y=series1, mode="lines+markers", name="Series 1"))
    fig.add_trace(go.Scatter(x=t, y=series2, mode="lines+markers", name="Series 2"))
    fig.update_layout(title=title, xaxis_title="t", yaxis_title="Value", legend_title="Series")

    return fig


def plot_euclidean_distance(series1, series2, t):
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=t, y=series1, mode="lines+markers", name="Series 1"))
    fig.add_trace(go.Scatter(x=t, y=series2, mode="lines+markers", name="Series 2"))

    # Draw lines showing the Euclidean distance at each point (every 5th for clarity)
    for i, v in enumerate(t):
        fig.add_trace(
            go.Scatter(
                x=[v, v],
                y=[series1[i], series2[i]],
                mode="lines",
                line=dict(color="gray", dash="dash", width=1),
                showlegend=False,
            )
        )

    fig.update_layout(
        title="Euclidean Distance Visualization Between Series 1 and Series 2",
        xaxis_title="t",
        yaxis_title="Value",
    )

    return fig

12.2.3 Data Generation

Creating two similar synthetic time series.

nr_points = 50
interval_start = 0
interval_end = 20

t = np.linspace(interval_start, interval_end, nr_points)

series1 = (
    0.8 + 0.9 * np.sin(0.9 + 0.9 * t) + 0.03 * np.random.randn(nr_points) - 0.2 * (t / (interval_end - interval_start))
)

series2 = np.sin(1.05 * t) + 0.15 * np.random.randn(nr_points)

plot_time_series_plotly(t, series1, series2, "Two Similar Time Series")

12.2.4 Measures

12.2.4.1 Euclidean Distance

The Euclidean distance is the most straightforward similarity measure. It calculates the straight-line distance between two points. For time series data, this means comparing the values at each point of time directly.

Having two time series \(X = (x_1, x_2, \ldots, x_n)\) and \(Y = (y_1, y_2, \ldots, y_n)\) of equal length \(n\), the Euclidean distance \(d\) between them is calculated as: \[d(X, Y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}\]

plot_euclidean_distance(series1, series2, t)

def euclidean_distance(series1, series2):
    return np.sqrt(np.sum((series1 - series2) ** 2))


d_euclidean = euclidean_distance(series1, series2)
print(f"Euclidean Distance: {d_euclidean:.2f}")

Euclidean Distance: 6.57

12.2.4.2 Dynamic Time Warping

Dynamic Time Warping (DTW) is a more advanced similarity measure that accounts for shifts and distortions in the time axis. It finds the optimal alignment between two time series by warping the time dimension, allowing for comparisons even when the series are out of phase or have different lengths.

def dtw_distance(s1, s2):
    n, m = len(s1), len(s2)

    # initializing cost matrix
    cost = np.full((n + 1, m + 1), np.inf)
    cost[0, 0] = 0

    for i in range(1, n + 1):
        for j in range(1, m + 1):
            dist = abs(s1[i - 1] - s2[j - 1])
            cost[i, j] = dist + min(
                cost[i - 1, j],  # insertion
                cost[i, j - 1],  # deletion
                cost[i - 1, j - 1],
            )  # match

    # backtracking path (just for visualization)
    path = []

    i, j = n, m

    while i > 0 and j > 0:
        path.append((i - 1, j - 1))
        steps = [(i - 1, j), (i, j - 1), (i - 1, j - 1)]
        costs = [cost[s] if s[0] >= 0 and s[1] >= 0 else np.inf for s in steps]
        min_step = steps[np.argmin(costs)]
        i, j = min_step

    path = path[::-1]

    return cost[n, m], path


d_dtw, path = dtw_distance(series1, series2)

print(f"Dynamic Time Warping Distance: {d_dtw:.2f}")

Dynamic Time Warping Distance: 24.61

Note that this DTW implementation is using a very naive approach computing the full cost matrix. Many different variants exist that are more efficient and/or add constraints to the warping path. Lahreche and Boucheham (2021) provides a good overview.

def plot_dtw_alignment(s1, s2, t, path):
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=t, y=s1, mode="lines+markers", name="Series 1"))
    fig.add_trace(go.Scatter(x=t, y=s2, mode="lines+markers", name="Series 2"))

    # Draw alignment lines
    for i, j in path:
        fig.add_trace(
            go.Scatter(
                x=[t[i], t[j]],
                y=[s1[i], s2[j]],
                mode="lines",
                line=dict(color="gray", width=1, dash="dot"),
                showlegend=False,
            )
        )

    fig.update_layout(
        title="Dynamic Time Warping Alignment Between Series 1 and Series 2",
        xaxis_title="t",
        yaxis_title="Value",
    )
    return fig


plot_dtw_alignment(series1, series2, t, path)

12.2.5 Importance of Preprocessing

In the previous plots we observe that noise, outliers and positioning of the time series can have a significant impact on similarity measures. To mitigate these effects, we can apply various preprocessing techniques such as:

Smoothing: Applying filters (e.g., moving average, Gaussian) to reduce noise.
Normalization: Scaling the time series to a common range to ensure that differences in amplitude do not dominate the similarity measure.
Detrending: Removing trends to focus on the fluctuations around a mean level.

Note that the applied preprocessing techniques should be chosen based on the specific characteristics of the data and the analysis goals.

Here, we first smooth the time series using a simple moving average filter. This can easily be realized by convoluting with a specific kernel (a vector consisting of equal weights summing to 1, where the length of the vector is equal to the desired window size).

def moving_average(series, window_size=3):
    kernel = np.ones(window_size) / window_size
    return np.convolve(series, kernel, mode="same")


series1_smooth = moving_average(series1)
series2_smooth = moving_average(series2)

plot_time_series_plotly(t, series1_smooth, series2_smooth, "Two Similar Time Series Smoothed")

Then we normalize the smoothed time series to the range [0, 1] using min-max normalization: \[X_{norm} := \frac{X - X_{min}}{X_{max} - X_{min}}\] where \(X_{min}\) and \(X_{max}\) are the minimum and maximum values of the time series \(X\), respectively.

def normalize(series):
    return (series - np.min(series)) / (np.max(series) - np.min(series))


series1_smooth_norm = normalize(series1_smooth)
series2_smooth_norm = normalize(series2_smooth)

plot_time_series_plotly(
    t,
    series1_smooth_norm,
    series2_smooth_norm,
    "Two Similar Time Series Smoothed and Normalized",
)

The euclidean distance between the two smoothed and normalized time series can then be calculated as before.

d_euclidean_sn = euclidean_distance(series1_smooth_norm, series2_smooth_norm)
print(f"Euclidean Distance: {d_euclidean_sn:.2f}")

plot_euclidean_distance(series1_smooth_norm, series2_smooth_norm, t)

Euclidean Distance: 2.31

And also dynamic time warping can of course be applied similarly on the preprocessed time series.

d_dtw_sn, path_sn = dtw_distance(series1_smooth_norm, series2_smooth_norm)

print(f"Dynamic Time Warping Distance: {d_dtw_sn:.2f}")

plot_dtw_alignment(series1_smooth_norm, series2_smooth_norm, t, path_sn)

Dynamic Time Warping Distance: 5.36

12.3 k-Means Clustering

Just like the \(k\)-means algorithm for tabular data, the \(k\)-means algorithm for time series data aims to partition a set of time series into \(k\) clusters, where each time series belongs to the cluster with the nearest mean (centroid) time series.

The algorithm works as follows:

Initialization: Randomly select \(k\) initial centroids from the time series dataset.
Assignment Step: Assign each time series to the cluster whose centroid is closest, based on a chosen similarity measure (e.g., Euclidean distance, Dynamic Time Warping).
Update Step: Recalculate the centroids of the clusters by taking the mean of all time series assigned to each cluster.
Repeat: Repeat the assignment and update steps until convergence (i.e., when assignments no longer change or a maximum number of iterations is reached).

In practice, calculating the centroid of a cluster needs to be compatible with the chosen similarity measure (at least it is very beneficial if it is compatible). Especially for non-Euclidean measures with time series of variable length like Dynamic Time Warping, specialized methods may be used to compute the centroid.

Here we will have a look at an example using the sktime library. The dataset we will use is a subset of the trace dataset, which is a synthetic dataset introduced by Roverso (2002) for the purpose of plant diagnostics. The subset includes only 4 classes of univariate time series.

12.3.1 Imports

Code

import time
import warnings

import numpy as np
import plotly.graph_objs as go
import plotly.io as pio

from sklearn.preprocessing import StandardScaler
from sktime.clustering.k_means import TimeSeriesKMeansTslearn
from tslearn.datasets import CachedDatasets

pio.renderers.default = (
    "notebook"  # set the default plotly renderer to "notebook" (necessary for quarto to render the plots)
)

np.random.seed(42)

warnings.filterwarnings("ignore")

12.3.2 Loading the dataset

X_train, y_train, X_test, y_test = CachedDatasets().load_dataset("Trace")

# Combine train and test sets since clustering does not require a train-test split
X = np.concatenate((X_train, X_test))
y = np.concatenate((y_train, y_test))

X.shape

(200, 275, 1)

y.shape

(200,)

We have 200 time series in total, each of length 275.

Note that we also have class labels, which is not the case in real clustering problems. We will solely use them in the end to evaluate the clustering performance.

Let us plot all the time series to get an idea of how they look like.

fig = go.Figure()
for i in range(X.shape[0]):
    fig.add_trace(
        go.Scatter(
            y=X[i, :, 0],
            mode="lines",
            line=dict(width=1, color="grey"),
            opacity=0.2,
            showlegend=False,
        )
    )

fig.update_layout(title="All Time Series in X", xaxis_title="Time", yaxis_title="Value", height=400)
fig.show()

Without class labels it is hard to count the number of classes in the data, but we can see that there are some patterns in the data.

Since normalization and scaling is important for distance-based methods, we will use the StandardScaler from sklearn to standardize the data to have zero mean and unit variance.

X_scaled = StandardScaler().fit_transform(
    X[:, :, 0]
)  # In this case, the dataset was already scaled beforehand, but we do it here explicitely for demonstration purposes

Let us also just visualize a few time series from the dataset to get a better idea of the data.

COLORS = ["#A6CEE3", "#B2DF8A", "#FDBF6F", "#CAB2D6"]  # pastel, colorblind-friendly

sampled_ids = [0, 10, 25, 81]

fig = go.Figure()

for idx, i in enumerate(sampled_ids):
    fig.add_trace(
        go.Scatter(
            y=X[i, :, 0],
            mode="lines",
            line=dict(width=3, color=COLORS[idx]),
            name=f"Sample {i}",
        )
    )

fig.update_layout(
    title="Sampled Time Series from X",
    xaxis_title="Time",
    yaxis_title="Value",
    height=400,
)
fig.show()

12.3.3 Clustering based on Euclidean Distance

In this example we know that there are 4 classes in the data, so we will set \(k=4\).

Luckily, tslearn already implements a variety of clustering algorithms that we can use out of the box, including the \(k\)-means algorithm.

k = 4  # number of clusters

clusterer = TimeSeriesKMeansTslearn(n_clusters=4, metric="euclidean", random_state=42)

start_time = time.time()
y_predicted = clusterer.fit_predict(X)
duration = time.time() - start_time

print(f"Clustering took {duration:.2f} seconds.")

Clustering took 7.73 seconds.

Helper function for plotting clusters and cluster centers

Code

def plot_clusters(X, y, title, use_numeric_classname):
    fig = go.Figure()

    for cluster_idx, cluster in enumerate(sorted(set(y))):
        idx = np.where(y == cluster)[0]
        show_legend = True  # Only show legend for the first trace of each cluster

        for i in idx:
            fig.add_trace(
                go.Scatter(
                    y=X[i, :, 0],
                    mode="lines",
                    line=dict(width=1, color=COLORS[cluster_idx]),
                    opacity=0.5,
                    name=f"Cluster {cluster_idx + 1 if use_numeric_classname else chr(ord('A') + cluster_idx)}",
                    legendgroup=cluster_idx,
                    showlegend=show_legend,
                )
            )
            show_legend = False

    fig.update_layout(title=title, xaxis_title="Time", yaxis_title="Value", height=500)

    fig.show()


def plot_centroids(clusterer):
    fig = go.Figure()

    for cluster_idx, centroid in enumerate(clusterer.cluster_centers_):
        fig.add_trace(
            go.Scatter(
                y=centroid[:, 0],
                mode="lines",
                line=dict(width=3, color=COLORS[cluster_idx]),
                name=f"Centroid {chr(ord('A') + cluster_idx)}",
            )
        )

    fig.update_layout(title="Cluster Centroids", xaxis_title="Time", yaxis_title="Value", height=400)

    fig.show()

Next, we plot the \(k\)-means clusters. Remember that the legend is clickable.

plot_clusters(X, y_predicted, "Time Series clustered by k-means (euclidean distance)", False)

In comparison, here are the true classes according to the labels in the dataset.

plot_clusters(X, y, "Time Series with true labels", True)

The four classes from the original dataset can be described as follows: - Cluster 1: Time series that start high, rise to a huge peak around the middle, fall back to low and then gradually rise to high again. - Cluster 2: Time series that start high, drop to a low point around the middle, and then rise back up. - Cluster 3: Time series that start low, quickly rise to high around the middle, and then show some oscillations on high plateau - Cluster 4: Similiar to Cluster 3, but without oscillations.

The clusters found by the \(k\)-means algorithm do not correspond well to these classes: - Cluster 1 and 2 are mixed up in Cluster B. - Cluster 3 and 4 are mixed up in Cluster A, C, and D.

This is due to the fact that the \(k\)-means algorithm is based on minimizing the within-cluster distances based on the euclidean distance, which does not necessarily correspond to the true classes in the data.

This also reflects when we visualize the cluster centers.

plot_centroids(clusterer)

Next, let us check how the clustering performs with dynamic time warping as distance measure.

12.3.4 Clustering based on Dynamic Time Warping

While euclidean distance uses the mean for computing the centroid, dynamic time warping uses a more complex method called soft-DTW barycenter averaging, which is compatible with the DTW distance measure.

Soft-DTW, a differentiable version of DTW, enables the use of gradient-based optimization techniques for computing the barycenter. Its differentiable nature also facilitates its integration as loss function into machine learning algorithms, in particular neural networks.

More information on soft-DTW can be found in the related paper Cuturi and Blondel (2017).

k = 4  # number of clusters

clusterer_dwt = TimeSeriesKMeansTslearn(
    n_clusters=4, metric="softdtw", n_jobs=-1, random_state=1337
)  # n_jobs=-1 uses all available CPU cores


start_time = time.time()
y_predicted_dwt = clusterer_dwt.fit_predict(X)
duration = time.time() - start_time

print(f"Clustering took {duration:.2f} seconds.")

Clustering took 1141.56 seconds.

Have a look at the execution times of both algorithms. While DTW is executed in parallel (n_jobs=-1), it is still significantly slower than the euclidean distance version. The reason for this is that the DTW algorithm got a time complexity of \(O(NM)\), where \(N\) and \(M\) are the lengths of the two time series to be compared. More sophisticated algorithms are able to slightly reduce this time complexity, but so far, the runtime complexity stays quadratic.

However, the clustering results look better now.

plot_clusters(X, y_predicted_dwt, "Time Series clustered by k-means (dynamic time warping)", False)

plot_centroids(clusterer_dwt)

Cluster 2 matches class A very well, cluster 4 matches class B. However, cluster 1 and 3 still contain a mix of class C and D.

12.4 Elbow method for determining the number of clusters

One open question is how to determine the number of clusters \(k\). A common method is the elbow method, which plots the sum of distances to the nearest cluster center for different values of \(k\). The idea is to choose the value of \(k\) at which the rate of decrease sharply shifts, forming an elbow shape in the plot.

sum_of_distances = []
K = range(2, 10)

for k in K:
    km = TimeSeriesKMeansTslearn(
        n_clusters=k,
        metric="euclidean",
        random_state=42,
        n_jobs=-1,
    )

    km = km.fit(X)
    sum_of_distances.append(km.inertia_)

fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=list(K),
        y=sum_of_distances,
        mode="lines+markers",
        marker=dict(color="blue"),
        line=dict(color="blue"),
        name="Sum of distances",
    )
)

fig.update_layout(title="Elbow Method For Optimal k", xaxis_title="k", yaxis_title="Sum of distances")

fig.show()

The elbow method suggests that four or five clusters are a good choice for \(k\), which is close to the true number of classes in the data.

Often it makes sense to start with a larger number of clusters and then merge similar clusters later on. This allows to capture more subtle patterns in the data.

12.5 Additional readings

tslearn dtw documentation (accessed: 29 09 2025) briefly introduces dynamic time warping, barycenters and soft-DTW. To learn more about different methods used to calculate barycenters, have a look at the tslearn barycenter documentation (accessed: 29 09 2025).

To learn more about dynamic time warping, the wikipedia article on dynamic time warping (accessed: 29 09 2025) is a good starting point.

\(k\)-means clustering in time series is still a very active research area. A recent preprint by Holder, Bagnall, and Lines (2024) gives a nice overview over the variants of \(k\)-means for time series and discusses their pros and cons.