The goal of clustering is to group similar time series together based on their characteristics or patterns.
But what does similar mean in the context of time series data? Defining similarity is not straight forward due to the inherent subjectivity involved.
Difficulties are:
Context-Dependent: Similarity can vary based on the specific application or domain.
Human Bias: Metrics are often chosen based on human intuition, which can introduce bias.
Inconsistent Interpretations: Different experts may have different perspectives on what constitutes similarity.
Parameter Sensitivity: Many similarity measures come with tunable parameters that can significantly affect the results.
12.1 Clustering algorithms
Time series clustering algorithms can be broadly categorized into four main types:
Distance-Based: These algorithms rely on a distance measure to quantify the similarity between time series. Examples include \(k\)-means and \(k\)-median.
Distribution-Based: These methods model the distribution (in a very general sense) of the time series data and cluster based on statistical properties or descriptive features that represent the characteristics of the time series. Examples include Gaussian Mixture Models (GMM) and DBSCAN.
Subsequence-Based: These algorithms focus on clustering representative subsequences of time series data rather than the entire series. Examples include methods based on shapelets and sliding windows.
Representation-Learning-Based: These methods use techniques like autoencoders or recurrent neural networks to learn a lower-dimensional representation of the time series data, which is then clustered using traditional clustering algorithms.
Paparrizos, Yang, and Li (2024) gives a good overview of different time series clustering methods.
In the following, we will briefly introduce distance-based clustering using two common similarity measures: Euclidean distance and Dynamic Time Warping (DTW).
12.2 Similarity Measures
12.2.1 Imports
Code
import warningsimport numpy as npimport plotly.graph_objs as goimport plotly.io as piopio.renderers.default = ("notebook"# set the default plotly renderer to "notebook" (necessary for quarto to render the plots))warnings.filterwarnings("ignore")
12.2.2 Helper functions for plotting
Code
def plot_time_series_plotly(t, series1, series2, title): fig = go.Figure() fig.add_trace(go.Scatter(x=t, y=series1, mode="lines+markers", name="Series 1")) fig.add_trace(go.Scatter(x=t, y=series2, mode="lines+markers", name="Series 2")) fig.update_layout(title=title, xaxis_title="t", yaxis_title="Value", legend_title="Series")return figdef plot_euclidean_distance(series1, series2, t): fig = go.Figure() fig.add_trace(go.Scatter(x=t, y=series1, mode="lines+markers", name="Series 1")) fig.add_trace(go.Scatter(x=t, y=series2, mode="lines+markers", name="Series 2"))# Draw lines showing the Euclidean distance at each point (every 5th for clarity)for i, v inenumerate(t): fig.add_trace( go.Scatter( x=[v, v], y=[series1[i], series2[i]], mode="lines", line=dict(color="gray", dash="dash", width=1), showlegend=False, ) ) fig.update_layout( title="Euclidean Distance Visualization Between Series 1 and Series 2", xaxis_title="t", yaxis_title="Value", )return fig
plot_time_series_plotly(t, series1, series2, "Two Similar Time Series")
12.2.4 Measures
12.2.4.1 Euclidean Distance
The Euclidean distance is the most straightforward similarity measure. It calculates the straight-line distance between two points. For time series data, this means comparing the values at each point of time directly.
Having two time series \(X = (x_1, x_2, \ldots, x_n)\) and \(Y = (y_1, y_2, \ldots, y_n)\) of equal length \(n\), the Euclidean distance \(d\) between them is calculated as: \[d(X, Y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}\]
Dynamic Time Warping (DTW) is a more advanced similarity measure that accounts for shifts and distortions in the time axis. It finds the optimal alignment between two time series by warping the time dimension, allowing for comparisons even when the series are out of phase or have different lengths.
def dtw_distance(s1, s2): n, m =len(s1), len(s2)# initializing cost matrix cost = np.full((n +1, m +1), np.inf) cost[0, 0] =0for i inrange(1, n +1):for j inrange(1, m +1): dist =abs(s1[i -1] - s2[j -1]) cost[i, j] = dist +min( cost[i -1, j], # insertion cost[i, j -1], # deletion cost[i -1, j -1], ) # match# backtracking path (just for visualization) path = [] i, j = n, mwhile i >0and j >0: path.append((i -1, j -1)) steps = [(i -1, j), (i, j -1), (i -1, j -1)] costs = [cost[s] if s[0] >=0and s[1] >=0else np.inf for s in steps] min_step = steps[np.argmin(costs)] i, j = min_step path = path[::-1]return cost[n, m], pathd_dtw, path = dtw_distance(series1, series2)print(f"Dynamic Time Warping Distance: {d_dtw:.2f}")
Dynamic Time Warping Distance: 24.61
Note that this DTW implementation is using a very naive approach computing the full cost matrix. Many different variants exist that are more efficient and/or add constraints to the warping path. Lahreche and Boucheham (2021) provides a good overview.
def plot_dtw_alignment(s1, s2, t, path): fig = go.Figure() fig.add_trace(go.Scatter(x=t, y=s1, mode="lines+markers", name="Series 1")) fig.add_trace(go.Scatter(x=t, y=s2, mode="lines+markers", name="Series 2"))# Draw alignment linesfor i, j in path: fig.add_trace( go.Scatter( x=[t[i], t[j]], y=[s1[i], s2[j]], mode="lines", line=dict(color="gray", width=1, dash="dot"), showlegend=False, ) ) fig.update_layout( title="Dynamic Time Warping Alignment Between Series 1 and Series 2", xaxis_title="t", yaxis_title="Value", )return figplot_dtw_alignment(series1, series2, t, path)
12.2.5 Importance of Preprocessing
In the previous plots we observe that noise, outliers and positioning of the time series can have a significant impact on similarity measures. To mitigate these effects, we can apply various preprocessing techniques such as:
Smoothing: Applying filters (e.g., moving average, Gaussian) to reduce noise.
Normalization: Scaling the time series to a common range to ensure that differences in amplitude do not dominate the similarity measure.
Detrending: Removing trends to focus on the fluctuations around a mean level.
Note that the applied preprocessing techniques should be chosen based on the specific characteristics of the data and the analysis goals.
Here, we first smooth the time series using a simple moving average filter. This can easily be realized by convoluting with a specific kernel (a vector consisting of equal weights summing to 1, where the length of the vector is equal to the desired window size).
plot_time_series_plotly(t, series1_smooth, series2_smooth, "Two Similar Time Series Smoothed")
Then we normalize the smoothed time series to the range [0, 1] using min-max normalization: \[X_{norm} := \frac{X - X_{min}}{X_{max} - X_{min}}\] where \(X_{min}\) and \(X_{max}\) are the minimum and maximum values of the time series \(X\), respectively.
Just like the \(k\)-means algorithm for tabular data, the \(k\)-means algorithm for time series data aims to partition a set of time series into \(k\) clusters, where each time series belongs to the cluster with the nearest mean (centroid) time series.
The algorithm works as follows:
Initialization: Randomly select \(k\) initial centroids from the time series dataset.
Assignment Step: Assign each time series to the cluster whose centroid is closest, based on a chosen similarity measure (e.g., Euclidean distance, Dynamic Time Warping).
Update Step: Recalculate the centroids of the clusters by taking the mean of all time series assigned to each cluster.
Repeat: Repeat the assignment and update steps until convergence (i.e., when assignments no longer change or a maximum number of iterations is reached).
In practice, calculating the centroid of a cluster needs to be compatible with the chosen similarity measure (at least it is very beneficial if it is compatible). Especially for non-Euclidean measures with time series of variable length like Dynamic Time Warping, specialized methods may be used to compute the centroid.
Here we will have a look at an example using the sktime library. The dataset we will use is a subset of the trace dataset, which is a synthetic dataset introduced by Roverso (2002) for the purpose of plant diagnostics. The subset includes only 4 classes of univariate time series.
12.3.1 Imports
Code
import timeimport warningsimport numpy as npimport plotly.graph_objs as goimport plotly.io as piofrom sklearn.preprocessing import StandardScalerfrom sktime.clustering.k_means import TimeSeriesKMeansTslearnfrom tslearn.datasets import CachedDatasetspio.renderers.default = ("notebook"# set the default plotly renderer to "notebook" (necessary for quarto to render the plots))np.random.seed(42)warnings.filterwarnings("ignore")
12.3.2 Loading the dataset
X_train, y_train, X_test, y_test = CachedDatasets().load_dataset("Trace")# Combine train and test sets since clustering does not require a train-test splitX = np.concatenate((X_train, X_test))y = np.concatenate((y_train, y_test))
X.shape
(200, 275, 1)
y.shape
(200,)
We have 200 time series in total, each of length 275.
Note that we also have class labels, which is not the case in real clustering problems. We will solely use them in the end to evaluate the clustering performance.
Let us plot all the time series to get an idea of how they look like.
fig = go.Figure()for i inrange(X.shape[0]): fig.add_trace( go.Scatter( y=X[i, :, 0], mode="lines", line=dict(width=1, color="grey"), opacity=0.2, showlegend=False, ) )fig.update_layout(title="All Time Series in X", xaxis_title="Time", yaxis_title="Value", height=400)fig.show()
Without class labels it is hard to count the number of classes in the data, but we can see that there are some patterns in the data.
Since normalization and scaling is important for distance-based methods, we will use the StandardScaler from sklearn to standardize the data to have zero mean and unit variance.
X_scaled = StandardScaler().fit_transform( X[:, :, 0]) # In this case, the dataset was already scaled beforehand, but we do it here explicitely for demonstration purposes
Let us also just visualize a few time series from the dataset to get a better idea of the data.
COLORS = ["#A6CEE3", "#B2DF8A", "#FDBF6F", "#CAB2D6"] # pastel, colorblind-friendlysampled_ids = [0, 10, 25, 81]fig = go.Figure()for idx, i inenumerate(sampled_ids): fig.add_trace( go.Scatter( y=X[i, :, 0], mode="lines", line=dict(width=3, color=COLORS[idx]), name=f"Sample {i}", ) )fig.update_layout( title="Sampled Time Series from X", xaxis_title="Time", yaxis_title="Value", height=400,)fig.show()
12.3.3 Clustering based on Euclidean Distance
In this example we know that there are 4 classes in the data, so we will set \(k=4\).
Luckily, tslearn already implements a variety of clustering algorithms that we can use out of the box, including the \(k\)-means algorithm.
k =4# number of clustersclusterer = TimeSeriesKMeansTslearn(n_clusters=4, metric="euclidean", random_state=42)start_time = time.time()y_predicted = clusterer.fit_predict(X)duration = time.time() - start_timeprint(f"Clustering took {duration:.2f} seconds.")
Clustering took 7.73 seconds.
Helper function for plotting clusters and cluster centers
Code
def plot_clusters(X, y, title, use_numeric_classname): fig = go.Figure()for cluster_idx, cluster inenumerate(sorted(set(y))): idx = np.where(y == cluster)[0] show_legend =True# Only show legend for the first trace of each clusterfor i in idx: fig.add_trace( go.Scatter( y=X[i, :, 0], mode="lines", line=dict(width=1, color=COLORS[cluster_idx]), opacity=0.5, name=f"Cluster {cluster_idx +1if use_numeric_classname elsechr(ord('A') + cluster_idx)}", legendgroup=cluster_idx, showlegend=show_legend, ) ) show_legend =False fig.update_layout(title=title, xaxis_title="Time", yaxis_title="Value", height=500) fig.show()def plot_centroids(clusterer): fig = go.Figure()for cluster_idx, centroid inenumerate(clusterer.cluster_centers_): fig.add_trace( go.Scatter( y=centroid[:, 0], mode="lines", line=dict(width=3, color=COLORS[cluster_idx]), name=f"Centroid {chr(ord('A') + cluster_idx)}", ) ) fig.update_layout(title="Cluster Centroids", xaxis_title="Time", yaxis_title="Value", height=400) fig.show()
Next, we plot the \(k\)-means clusters. Remember that the legend is clickable.
plot_clusters(X, y_predicted, "Time Series clustered by k-means (euclidean distance)", False)
In comparison, here are the true classes according to the labels in the dataset.
plot_clusters(X, y, "Time Series with true labels", True)
The four classes from the original dataset can be described as follows: - Cluster 1: Time series that start high, rise to a huge peak around the middle, fall back to low and then gradually rise to high again. - Cluster 2: Time series that start high, drop to a low point around the middle, and then rise back up. - Cluster 3: Time series that start low, quickly rise to high around the middle, and then show some oscillations on high plateau - Cluster 4: Similiar to Cluster 3, but without oscillations.
The clusters found by the \(k\)-means algorithm do not correspond well to these classes: - Cluster 1 and 2 are mixed up in Cluster B. - Cluster 3 and 4 are mixed up in Cluster A, C, and D.
This is due to the fact that the \(k\)-means algorithm is based on minimizing the within-cluster distances based on the euclidean distance, which does not necessarily correspond to the true classes in the data.
This also reflects when we visualize the cluster centers.
plot_centroids(clusterer)
Next, let us check how the clustering performs with dynamic time warping as distance measure.
12.3.4 Clustering based on Dynamic Time Warping
While euclidean distance uses the mean for computing the centroid, dynamic time warping uses a more complex method called soft-DTW barycenter averaging, which is compatible with the DTW distance measure.
Soft-DTW, a differentiable version of DTW, enables the use of gradient-based optimization techniques for computing the barycenter. Its differentiable nature also facilitates its integration as loss function into machine learning algorithms, in particular neural networks.
More information on soft-DTW can be found in the related paper Cuturi and Blondel (2017).
k =4# number of clustersclusterer_dwt = TimeSeriesKMeansTslearn( n_clusters=4, metric="softdtw", n_jobs=-1, random_state=1337) # n_jobs=-1 uses all available CPU coresstart_time = time.time()y_predicted_dwt = clusterer_dwt.fit_predict(X)duration = time.time() - start_timeprint(f"Clustering took {duration:.2f} seconds.")
Clustering took 1141.56 seconds.
Have a look at the execution times of both algorithms. While DTW is executed in parallel (n_jobs=-1), it is still significantly slower than the euclidean distance version. The reason for this is that the DTW algorithm got a time complexity of \(O(NM)\), where \(N\) and \(M\) are the lengths of the two time series to be compared. More sophisticated algorithms are able to slightly reduce this time complexity, but so far, the runtime complexity stays quadratic.
However, the clustering results look better now.
plot_clusters(X, y_predicted_dwt, "Time Series clustered by k-means (dynamic time warping)", False)
plot_centroids(clusterer_dwt)
Cluster 2 matches class A very well, cluster 4 matches class B. However, cluster 1 and 3 still contain a mix of class C and D.
12.4 Elbow method for determining the number of clusters
One open question is how to determine the number of clusters \(k\). A common method is the elbow method, which plots the sum of distances to the nearest cluster center for different values of \(k\). The idea is to choose the value of \(k\) at which the rate of decrease sharply shifts, forming an elbow shape in the plot.
sum_of_distances = []K =range(2, 10)for k in K: km = TimeSeriesKMeansTslearn( n_clusters=k, metric="euclidean", random_state=42, n_jobs=-1, ) km = km.fit(X) sum_of_distances.append(km.inertia_)
fig = go.Figure()fig.add_trace( go.Scatter( x=list(K), y=sum_of_distances, mode="lines+markers", marker=dict(color="blue"), line=dict(color="blue"), name="Sum of distances", ))fig.update_layout(title="Elbow Method For Optimal k", xaxis_title="k", yaxis_title="Sum of distances")fig.show()
The elbow method suggests that four or five clusters are a good choice for \(k\), which is close to the true number of classes in the data.
Often it makes sense to start with a larger number of clusters and then merge similar clusters later on. This allows to capture more subtle patterns in the data.
\(k\)-means clustering in time series is still a very active research area. A recent preprint by Holder, Bagnall, and Lines (2024) gives a nice overview over the variants of \(k\)-means for time series and discusses their pros and cons.
Cuturi, Marco, and Mathieu Blondel. 2017. “Soft-Dtw: A Differentiable Loss Function for Time-Series.” In International Conference on Machine Learning, 894–903. PMLR.
Holder, Christopher, Anthony Bagnall, and Jason Lines. 2024. “On Time Series Clustering with k-Means.”arXiv Preprint arXiv:2410.14269.
Lahreche, Abdelmadjid, and Bachir Boucheham. 2021. “A Comparison Study of Dynamic Time Warping’s Variants for Time Series Classification.”International Journal of Informatics and Applied Mathematics 4 (1): 56–71. https://dergipark.org.tr/en/pub/ijiam/issue/59559/819482.
Paparrizos, John, Fan Yang, and Haojun Li. 2024. “Bridging the Gap: A Decade Review of Time-Series Clustering Methods.”arXiv Preprint arXiv:2412.20582.
Roverso, Davide. 2002. “Plant Diagnostics by Transient Classification: The Aladdin Approach.”International Journal of Intelligent Systems 17 (8): 767–90.