CN113780451A

CN113780451A - Temporal data implication mode clustering analysis method of temporal-spatial big data

Info

Publication number: CN113780451A
Application number: CN202111088489.9A
Authority: CN
Inventors: 李海峰; 阮航; 庞垠; 张云生; 施庆章
Original assignee: Central South University; 63921 Troops of PLA
Current assignee: Central South University; 63921 Troops of PLA
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2021-12-10

Abstract

The invention discloses a temporal data intrinsic mode cluster analysis method of temporal-spatial big data, which comprises the following steps: acquiring point cloud data, and acquiring a persistence map by using a persistence coherence method; calculating the SW distance of the persistence diagram; calculating topological geometric mixed distance; optimizing the clusters using a clustering algorithm to minimize a distance between a center of each cluster and the data points within the cluster; repeating clustering operation after adjusting the contour coefficient, and selecting proper clustering; and analyzing the clustering result by using the evaluation index, and visualizing the topological-geometric mixed distance measurement space of the comprehensive control data set by using the UMAP. The extracted structure reflects the regularity of each urban road network as a whole and discovers a potential substructure; and analyzing the potential structure of the data by combining the total GDP amount information of the city, and finding that the structure division is related to the economic level of the city and the geographic location factors.

Description

Temporal data implication mode clustering analysis method of temporal-spatial big data

Technical Field

The invention belongs to the technical field of mapping, and particularly relates to a temporal data intrinsic mode cluster analysis method of spatial-temporal big data.

Background

In recent years, with the development of information communication technology and the popularization of various sensors and positioning technologies, a large amount of spatial big data which has space-time marks and can describe individual behaviors, including mobile phone positioning data, taxi data, shared bicycle data, bus smart card data, social network data, video big data and the like, are generated, and great opportunities are provided for analyzing and understanding the dynamics of urban structures, the space-time laws of human activities and the quantitative understanding of social and economic environments. The rise of the big data with geographic attributes also puts new demands on the space mining capability of the big data in the sky. In the time series analysis framework, time series clustering is an important method for understanding the intrinsic characteristics of temporal data. Time series clustering may be understood in essence as the process of aggregating time data points according to a given similarity measure. Clustering performance, in turn, critically depends on how the similarity is quantified.

The classical similarity-based methods are geometric similarity-based methods, and mainly focus on local relations at given moments in an original time sequence. These include Dynamic Time Warping (DTW), Euclidean Distance (ED), and Longest Common Subsequence (lcs). The information contained in the original time series is here represented by its geometry. These methods have yielded satisfactory results on many types of data. Although these methods are able to detect similarities in time and shape and describe local geometric differences, they generally ignore the dynamics of the time series from a global perspective. Furthermore, since DTW and ED take into account all points in time, they are usually very sensitive to outliers and noise.

Recently, another new class of methods uses persistent coherence to describe the topological properties of time series, reflecting the global dynamics of time series. The basic idea of the methods is to construct a point cloud in a phase space by embedding time delay of an original time sequence, and then extract topological features of the point cloud, such as clusters, rings, three-dimensional cavities and high-dimensional ring structures thereof, by using topological data analysis. The topological data analysis is a calculation method for analyzing high-dimensional complex data by using a topological theory. The characteristic of continuous coherent extraction is robust, and the disturbance of data only causes small change of the analysis output of the topological data. However, these topology-based global methods can only extract global information from the time series, and cannot take into account local quantitative differences.

Disclosure of Invention

The invention provides a time sequence clustering method based on Topological-geometrical Mixed Distance (TGMD). From the point cloud obtained by delayed embedding of the time series, topological features are extracted by using a persistence map, and geometric features are local correlations of given moments in the original time series. To characterize and quantify similarity using these two properties, we used the Sliced Wasserstein distance as the topological similarity measure and ED and DTW as the geometric similarity measure. On the basis, a mixed distance measurement method based on an adjusting function is provided, and the evaluation of the proximity of the geometric characteristics is adjusted according to the proximity of the topological characteristics. Then, the proposed new clustering method is applied to k-Medoids clustering of multiple sets of time series data, and the result proves the effectiveness of the method.

In view of this, the schematic diagram of the temporal data implicit mode cluster analysis method for the spatio-temporal big data provided by the present invention is shown in fig. 1, and includes three parts: topological similarity measures, geometric similarity measures, and time series clustering based on a hybrid distance measure. The method specifically comprises the following steps:

acquiring point cloud data, embedding the delay of a time sequence into the point cloud data to obtain a phase space, applying a principal component analysis method to the phase space to reduce topological noise, and obtaining a persistence diagram by using a persistence coherent method;

calculating Sliced Walserstein distance of the persistent graph to obtain global topological similarity measurement of the phase space, and calculating geometric similarity of time sequence original space;

calculating topological-geometric mixed distance, wherein the calculation of the topological-geometric mixed distance utilizes an adjusting function to establish a new distance, and the adjusting function is selected to enable topological similarity to become an adjusting factor of the geometric distance;

optimizing the clusters by minimizing the distance between the center of each cluster and the data points in the clusters by using a clustering algorithm to generate spherical clusters with similar sizes;

repeating the clustering operation after adjusting the contour coefficient, and selecting a proper number of clusters;

and analyzing the clustering result by using the evaluation index, and visualizing the topological-geometric mixed distance measurement space of the comprehensive control data set by using the UMAP.

Further, for a time series f { x_tT1, 2, …, T }, and embedding the time-series data into the point cloud space V by the delay embedding_i＝{v₁,…,v_t,…,f_TV, points in the point cloud space_i＝(x_i,x_i+τ,…,x_i+(d-1)τ) And representing, wherein d is the embedded point cloud space dimension, tau represents a delay coefficient, d and tau are regarded as hyper-parameters, and the two parameters are selected according to the contour coefficient.

Further, the calculation of the geometric similarity of the time series includes euclidean distance and DTW, where ED is calculated as follows:

given two time-series sequences T₁And T₂，T₁And T₂Euclidean distance δ between_EIs defined as:

wherein u is_iIs a time sequence T₁Point of (5), v_iIs a time sequence T₂The point in (b) is the number of the time sequence midpoint;

the DTW is calculated as follows:

for two time-series sequences T₁And T₂，

Wherein, the mapping r belongs to M,

|r|＝∑_i＝1,…,m|u_ai-v_bi|，

u_aiis a time sequence T₁U in_iPoint after point mapping, v_biIs a time sequence T₁V in (1)_iThe point after the point mapping.

Further, given θ ∈ R²And | θ |₂1, the function L (θ) represents a straight line { λ θ | λ ∈ R }, and pi ∈ R }_θ：R²→ L (theta) represents the orthogonal projection on L (theta), and the Sliced Wasserstein distance is defined as follows:

wherein Dg is₁,Dg₂Representing two persistent concordant graphs, the persistent graphs residing in R²A union of a finite plurality of sets of points in space, with a diagonal as delta,

π_Δis an orthogonal projection on the diagonal.

Further, the calculation formula of the topological-geometric hybrid distance is as follows:

TGMD(T₁,T₂)＝f(TS′(T₁,T₂))×Geo(T1,T2)

where f (x) is a monotonically increasing adjustment function, TS' is a normalized topological similarity obtained by rescaling the topological similarity TS, T1 and T2 are time series, Geo is a geometric distance between T1 and T2.

Further, the adjustment function is an exponential function, ensuring that the adjustment effect of the extremum and its nearest neighbors is almost equal, and the adjustment function f (x) is as follows:

wherein k is an adjustment coefficient, and k is more than or equal to 0.

Further, the clustering algorithm is K-medoids.

Furthermore, the evaluation index adopts a contour coefficient and an adjusted Lande index,

the contour coefficient s for a single sample is:

wherein a is the average distance between the sample and all other points in the same class, and b is the average distance between the sample and all other points in the next closest cluster;

the adjusted landed coefficients ARI are as follows:

where TP is the number of time-series pairs belonging to the same class and assigned to the same cluster, TN is the number of time-series pairs belonging to different classes and assigned to different clusters, FP is the number of time-series pairs belonging to different classes and assigned to the same cluster, FN is the number of time-series pairs belonging to the same class and assigned to different clusters, and RI is the ratio at which correct decisions are calculated.

The invention has the following beneficial effects:

1) the invention can qualitatively describe the global characteristics and local quantitative differences of the time sequence;

2) the hybrid distance metric proposed by the present invention is based on an adjustment function that adjusts the overall similarity according to the proximity of topological properties.

3) Experimental results show that the method is superior to a pure geometric method or a pure geometric method in the aspect of carrying out oscillation activity clustering identification on noisy biological data, and has stronger robustness on noise. Compared with other standard time series clustering methods, the method has the advantage that competitive results are obtained on the real data set.

Drawings

FIG. 1 illustrates Betty numbers corresponding to different object topologies of the present invention;

time series S1-S4 in the original space of FIG. 2;

time series S1-S4 in the phase space of FIG. 3;

FIG. 4 is a graph of the function of the adjustment function f (x);

FIG. 5 is a summation of all region time series;

FIG. 6 is an example of a non-periodic traffic time series in different regions;

FIG. 7 is a visualization of the time series data of FIG. 6 after delay embedding and projection by PCA;

FIG. 8 is a one-dimensional persistence diagram of FIG. 6;

FIG. 9 is an example of a periodic traffic time series in different regions;

FIG. 10 is a visualization of the time series data of FIG. 9 after delay embedding and projection by PCA;

FIG. 11 is a one-dimensional continuation of FIG. 9;

FIG. 12 is a graph of contour coefficients for different cluster numbers k;

FIG. 13 is a spatial distribution of clustering results;

FIG. 14 is a time series of the boarding and disembarking of 4 typical regions in category 1;

FIG. 15 is a spatial distribution of clustering results;

FIG. 16 is a time series of getting on and off for 2 exemplary regions in category 2;

fig. 17 is a time series of getting on and off for another 2 typical regions in category 2;

FIG. 18 is an example of non-oscillating time series data acquired every 5 minutes;

FIG. 19 is a time series in the phase space of FIG. 18;

FIG. 20 is a continuation of FIG. 18;

FIG. 21 is an example of acquiring oscillation time-series data every 5 minutes;

FIG. 22 is a time series in the phase space of FIG. 21;

figure 23 is a continuation of figure 21,

FIG. 24 is the clustering results of synthetic single cell data with different intervals;

FIG. 25 is a graph of clustering results for data containing different noise;

FIG. 26 shows the cluster analysis results of the adjustment function for different values of k.

Detailed Description

The invention is further described with reference to the accompanying drawings, but the invention is not limited in any way, and any alterations or substitutions based on the teaching of the invention are within the scope of the invention.

The invention discloses a method for clustering and analyzing internal implication modes of temporal-spatial big data, which comprises the following steps:

s1: acquiring point cloud data, embedding the delay of a time sequence into the point cloud data to obtain a phase space, applying a principal component analysis method to the phase space to reduce topological noise, and obtaining a persistence diagram by using a persistence coherent method;

s2: calculating Sliced Walserstein distance of the persistent graph to obtain global topological similarity measurement of the phase space, and calculating geometric similarity of time sequence original space;

s3: calculating topological-geometric mixed distance, wherein the calculation of the topological-geometric mixed distance utilizes an adjusting function to establish a new distance, and the adjusting function is selected to enable topological similarity to become an adjusting factor of the geometric distance;

s4: optimizing the clusters by minimizing the distance between the center of each cluster and the data points in the clusters by using a clustering algorithm to generate spherical clusters with similar sizes;

s5: repeating the clustering operation after adjusting the contour coefficient, and selecting a proper number of clusters;

s6: and analyzing the clustering result by using the evaluation index, and visualizing the topological-geometric mixed distance measurement space of the comprehensive control data set by using the UMAP.

Extracting global topological properties in a time series potential phase space

Given a one-dimensional signal f of length T, use { x_tWherein T is 1,2, …, T represents F_i＝{f₁,…,f_t,…,f_TIs a sequence of real numbers collected on a time scale, where f_tRepresenting the value at time t.

By time-lapse embedding of the time sequence, the unique topological features of the time sequence can be better understood in the phase space. For a time series f { x_tAnd T is 1,2, …, T, and the time sequence data is embedded into the point cloud space V through delay embedding_i＝{v₁,…,v_t,…,f_TV for these points_i＝(x_i,x_i+τ,…,x_i+(d-1)τ) And (4) showing. Where d refers to the embedded point cloud space dimension and τ represents the delay factor. The number of points in the point cloud space, N, depends on the choice of d and τ, and is two important parameters. In fact, since the real time series are noisy and of limited length, there is currently no known optimal method for determining d and τ. Most of the methods required to select τ are not based on strict mathematical criteria, but rather on heuristics. The invention regards d and tau as hyper-parameters and selects the two parameters according to the contour coefficient, the specific method of the contour coefficient is in the non-patent literature Peter J.Rousseuw.Silhouttes: A graphical aid to the interpretation and validation of the cluster analysis]Journal of Computational and Applied Matchemics, 1987,20:53-65, which will not be described herein.

S1(t)＝sin(2πt) (1)

S2(t)＝sin(4πt) (2)

In phase space, the relevant topological features of the time series are mainly 0-dimensional components and 1-dimensional ring structures. Considering the examples shown in fig. 2 and 3, fig. 2 is a time series S1-S4 in the original space, and fig. 3 is a time series S1-S4 in the phase space; S1-S3 are all periodic time series, see equations (1) - (3), while S4 is an aperiodic time series with equation (4). S1-S3 all have the same topology in the embedding space, i.e. a one-dimensional ring structure, and therefore cannot be distinguished from each other from their global topology. However, by calculating their mutual DTW distance, S1 and S3 were found to be closer to each other than S1 and S2, due to local differences. Since the frequency of S2 is higher than that of S1 and S3, if DTW distances are compared, it is found that S1 and S4 are closer than S1 and S2. However, S1 and S2 will be found to be different from S4 in view of the topological characteristics of the data. Thus, by considering different properties of the time series, more information may be obtained than if the time series were concentrated on a local or global time series.

After embedding the time series into the phase space, PCA (principal component analysis) is applied to this embedded point cloud to reduce the topological noise. The topology is then described in terms of persistence graphs.

Persistent coherence theory is an algebraic method of computing spatial topological features at different spatial resolutions. More continuously existing topological features are detected on a wide spatial scale, the continuously existing topological features in a large range are reclassified as topological signals, and transient topological features are regarded as noise.

A space is defined in which the persistence barcode can be projected and its geometric properties studied, called a persistence map, which summarizes the topological information in the persistence coherence process. The persistence pattern is presence of R²A union of a finite multiple set of points in space, where the diagonal is expressed as Δ { (x, x) | x ∈ R²Where each point in Δ has infinite multiplicity, i.e., the number of times it appears in the multi-set is infinite. Using double mapping allows comparison of persistence map differences, so persistence maps must be usedWith the same cardinality-diagonal. In some cases, diagonal lines in the persistence map may simplify comparison between the maps. Because points near the diagonal correspond to transient topological features, they are likely caused by small perturbations in the data, which can be considered noise. Persistence barcodes can be converted to persistence maps by combining the multiple sets of (birth, death) pairs with the diagonal Δ. The persistent coherent method and the persistent diagram are prior art and will not be described herein.

The present embodiment uses the Sliced Wasserstein distance as the topological similarity measure. The Sliced Walserstein distance is not only demonstrably stable, but is also distinguishable (its bounds depend on the number of points in the persistence map). The basic idea of this metric is to slice a plane through the line of origin, project the measurements onto these lines where W is calculated, and integrate the distances over all possible lines.

Given θ ∈ R²And | θ |₂Let the function L (θ) denote a straight line { λ θ | λ ∈ R }, and pi ∈ 1_θ：R²→ L (θ) represents an orthogonal projection on L (θ). Let Dg₁,Dg₂Represents two persistent coherence maps and lets

For the

The same calculation is carried out, where_ΔIs an orthogonal projection on the diagonal. Then, the Sliced Walserstein distance is defined as follows:

let μ and ν be two non-negative measures on the solid line, so that | μ | ═ μ (R) and | ν | ═ ν (R) equal the same number R. Thus, the following definitions apply:

extracting local geometric properties in an original space of a time series

The time series is uniquely represented in its geometry in the original space, which also carries the information. For example, the seismograph signals (time amplitude versus surface rhythm) contain information about earthquakes. The present invention captures the similarity in these shapes by calculating the distance of local point pairs in the original time series.

The present embodiment uses ED and DTW as geometric similarity measures to quantify the geometric similarity of time series. Euclidean distance and DTW distance are the most applied geometric measures of time series. Let T₁＝(u₁,…,u_p),T₂＝(v₁,…,v_p) Two time series sequences are shown. T is₁And T₂Euclidean distance δ between_EIs defined as:

for two time-series sequences T₁，T₂Defining a mapping r ∈ M, wherein:

on this basis, the DTW distance is defined:

geometric similarity measures tend to identify local geometric relationships and numerical differences between original time series samples without taking into account topological changes in the time series data. The present embodiment therefore employs a new similarity measure, combining both topological similarity and the traditional geometric similarity measure, i.e., topological geometry blending distance (TGMD).

The topological-geometric hybrid distance establishes a new distance using an adjustment function selected to make Topological Similarity (TS) an adjustment factor for the geometric distance (Geo). The tuning function increases TGMD similarity if the topological properties of the two time series are different. The adjustment function selects an exponential function rather than a linear function because the exponential function has a lower rate of increase or decrease near the extrema (1 and 0), thereby ensuring that the adjustment effect of the extrema and their nearest neighbors are nearly equal.

The Topological Similarity (TS) is first rescaled to obtain normalized topological similarity (TS'), as follows:

then, the monotonically increasing adjustment function f (x) is as follows:

wherein k is an adjustment coefficient, and k is more than or equal to 0.

In conjunction with the above procedure, the following TGMD similarity measure is proposed:

TGMD(T₁,T₂)＝f(TS′(T₁,T₂))×Geo(T1,T2)

the adjustment function adjusts the geometric distance metric based on the similarity of the topology. When the Topological Similarity (TS) of two time series approaches 1, for any k ≧ 0, there is an adjustment function f (x) ≧ 1, increasing the TGMD metric. The scale factor due to topological similarity increases with increasing k.

The choice of clustering algorithm depends on the strategy employed, i.e. maximizing intra-group similarity and minimizing inter-group similarity. The K-medoids algorithm optimizes clusters by minimizing the distance between the center (i.e., centroid) of each cluster and the data points within the cluster, ultimately generating spherical clusters of similar size. The centroid may or may not be an actual data point. In this example, a K-medoids algorithm was used for cluster analysis and a TGMD similarity measure was used.

The dataset of the present invention uses synthetic single cell time series data, UCR time series profiles and taxi time series.

Synthesis of single cell time series data: single cell synthetic mRNA and protein time series data were generated according to a computational model of a Hes1 oscillator, which is a negative autoregulation system with delay. The model-generated data is divided into two categories, oscillatory and non-oscillatory time series (periodic and non-periodic). Data for 200 cells of the Hes1 model were simulated for both the oscillating and non-oscillating parameter protocols, and protein levels were measured every 5,10,15,20 minutes, from 5000 minutes to 6500 minutes. After acquiring time series at different intervals, data were normalized to mean 0 and have unit variance, and certain gaussian noise was added.

UCR time series profile: the UCR time series archive is an important database of the time series data mining community, and 11 typical data sets are selected from the UCR to evaluate the clustering performance of the algorithm. These data sets contain data of different sizes, lengths, number of classes and signal types.

Taxi time sequence: this example also uses a time series dataset generated by the GPS trajectory for a Strong taxi from Shanghai, China (https:// sodachallenges. com/datasets/taxi-GPS/#). Shanghai is the city with the most population in China, and the social space structure of the Shanghai is changed greatly after the Shanghai is reformed and opened. Some studies analyzed the impact of city morphology on resident travel behavior, but these studies used a small sample of survey data, which may not be representative of the general population, and the data lacked the accuracy of travel variables. The present embodiment therefore uses a large amount of travel data collected from the location-aware devices for analysis. Many taxi companies have installed GPS receivers in their fleets to monitor the real-time movement of each taxi. In the raw data set, each record consists of 13 fields including vehicle ID, GPS time, latitude and longitude, speed, number of satellites, operating status, overhead status, braking status, etc., where the latitude and longitude fields indicate the geographic location of the taxi.

The study area was divided into 1480 pixels of 1km × 1km, similar to the Traffic Analysis Zone (TAZ). This ratio is determined from previous studies, i.e. a picture element of this size is sufficient to describe the urban structure. In addition, these pixels can be used as a substitute for traffic analysis areas to represent a relatively uniform socio-economic signature unit. In the embodiment, the taxi track is simplified into the time sequence in each pixel, and the time sequence is divided into the getting-on time sequence and the getting-off time sequence. By pre-processing, the time series of each finally generated pel is 120 dimensions: t is_i＝{t₁,t₂,…,t₁₂₀Wherein t is₁～t₁₂₀The number of taxis for the ith pixel of the weekday is shown and the result is plotted in fig. 5. The geometry of the two time series is very similar and 5 24 hour periods can be clearly identified, which represents a roughly repeating time distribution of getting on and off over 5 days. Overall, the traffic time series has distinct geometric and topological properties (periodicity).

Comparison method

The TGMD method of the present invention is compared to other standard clustering methods for time series data. In particular, the commonly used distance metrics ED and DTW are considered as geometric methods and are clustered using K-medoids. The ED of the PCA subspace is then considered as a distance measure and time series clustering is performed using K-Means. TGMD clustering results were also compared to results from k-shape and TSKMeans. The k-shape clustering method uses a normalized cross-correlation metric to consider the shape of time series when comparing them. TSKMeans is a smooth subspace clustering algorithm of a k-means type, and can effectively utilize subspace information inherent in a time sequence data set to improve clustering performance.

Evaluation index

The present embodiment uses the following two indexes to measure the effectiveness of the method.

Contour coefficient (Silhouette coeffient): if the ground truth label is not known, the model itself must be used for evaluation. Higher contour coefficient scores are associated with models with better defined clustering. For each sample a contour coefficient is defined, which consists of two fractions.

The Silhouuette coefficient s for a single sample is:

where a is the average distance between the sample and all other points in the same class and b is the average distance between the sample and all other points in the next closest cluster.

Adjusting the landed index (ARI): the adjusted landed index solves the problem that the landed index cannot well describe the similarity of randomly distributed cluster type mark vectors.

Where TP is the number of time-series pairs belonging to the same category and assigned to the same cluster, TN is the number of time-series pairs belonging to different categories and assigned to different clusters, FP is the number of time-series pairs belonging to different categories and assigned to the same cluster, and FN is the number of time-series pairs belonging to the same category and assigned to different clusters.

Traffic spatio-temporal sequence clustering analysis

And (4) carrying out clustering analysis on the space-time sequence of the taxi in the Shanghai (pick-up: getting on the taxi; drop-off: getting off). In particular, DTW and Sliced Walserstein are used as the geometric distance and the topological distance, respectively. Since there are no actual labels in this data set, the appropriate number of clusters is selected based on the contour coefficients.

Fig. 6 and 9 show examples of aperiodic and periodic communication time series, respectively, in different regions. Fig. 7 and 10 are the visualization results of the time-series data of fig. 6 and 9 after delay embedding (τ 2, d 6) and projection to a two-dimensional space by PCA. Fig. 8 and 11 are one-dimensional persistence diagrams of fig. 6 and 9. To analyze the topological properties of the time series, the time series are embedded in the phase space. For time series with significant periodicity, fig. 11 shows a clear topological signal (loop) compared to the results of fig. 7, where no topological features appear. Then, topological features are obtained using a continuous coherent process: fig. 8 and 11 show two persistence maps that represent topological summaries of the point cloud in phase space. In fig. 8, most points are close to the diagonal, which can be interpreted as topological noise. In contrast, in fig. 11, the existence of a point far from the diagonal, i.e. a continuous topological signal, can be clearly seen. In this example, it is clearly observed that aperiodic and periodic taxi time series can be distinguished by the persistence map.

Then, all taxi time series in Shanghai were subjected to cluster analysis. First, an appropriate number of clusters are selected according to the contour coefficients by repeated clustering operations. As shown in fig. 12, the contour coefficient varies with the number of clusters k. Generally, the larger the contour value, the better the clustering result. For both time series, the contour coefficients of the clustering results were significantly reduced between k 4 and k 5. And when k > 5, the contour factor is less than 0.5. Based on this, the data were classified into 4 classes for cluster analysis.

The study area was divided into 4 clusters as a whole, forming a central circle layer structure, as shown in fig. 13. Cluster 1(C1) is distributed in the central area of a city, including airports and train stations. The time series of C1 has a large taxi traffic and a stable period, see fig. 14. The geometric features of the original time series include quantitative features, namely taxi traffic, and the stable periodicity can be described as a stable topology of the time series in the phase space.

Cluster 2 and cluster 3 are distributed in the transition area between suburban and urban centers, while cluster 4(C4) is distributed in suburban areas. Fig. 15 shows typical areas in four types of clusters. The cells 4 in fig. 16 belong to the cluster 1, the time pattern of which exhibits strong periodicity and the traffic volume to and from these areas is very large, while clear topological features (points far from the diagonal) in the PD can be seen. In contrast, the suburban area (unit 7 in C4 in fig. 17) has a smaller number of entering cars and exiting cars, and its time pattern is irregular. The corresponding persistence map of the cell 7 does not show obvious topological features (these points are close to the diagonal, which can be considered as noise). The time pattern of the residential areas (cell 5 and cell 6) also shows a 24 hour period, but the fluctuation of the residential area 5 is larger than that of the residential area 6. And correspondingly the continuation of the unit 6 in fig. 17, it is also possible to clearly observe a clear periodicity and more noisy points. From the overall clustering result, the TGMD method can effectively capture the geometric and topological characteristics of the taxi space-time sequence.

Clustering method performance analysis

First, three different distance metrics were tested for performance on synthetic single cell data: geometry only (DTW), topology only (Sliced Wasserstein) and TGMD, and they were clustered using the K-Medoids method. Then, the ARI is used to analyze the clustering results and compare the performance of different dimensions m, where m is ∈ {2,3,4,5,6,7,8,9,10 }.

Then, the clustering performance of the algorithm provided by the embodiment on the real data set is tested, and the result is compared with the results of other methods. In the method proposed by this embodiment, the delay embedding parameters τ and d are manually selected according to the contour coefficients on different data sets. For the adjustment function, k is chosen to be 1, and then the euclidean distance and the DTW distance are used to evaluate the geometric similarity.

Finally, the TGMD metric space of the integrated control data set is visualized using UMAP (Uniform Manifold Approximation and Projection). UMAP is prior art and is not described herein in detail. Empirically, d-15 and τ -3 were set as delay embedding parameters and the effect of different adjustment factors k on the results were compared.

Cluster analysis of synthetic single cell gene cycle expression data

In this section, the clustering results of the single cell gene cycle expression data will be analyzed and synthesized. Fig. 18 and 21 show examples of Non-oscillatory (Non-oscillatory) and oscillatory (oscillatory) time-series data acquired every 5 minutes, respectively. Fig. 19 and 22 show two time series in phase space. For the oscillation time series, fig. 22 reveals a clear topological signature (ring structure) compared to the results of fig. 19, in which no topological features appear. Then, topological features are obtained using persistent coherence: fig. 20 and 23 show persistence maps representing topological summaries of point clouds in phase space. In fig. 20, most points in the persistence map are close to the diagonal, which can be interpreted as topological noise. In contrast, in fig. 23, the existence of a point far from the diagonal, i.e., the sustained topological signal, can be clearly seen. In this example, it is clear how to distinguish between non-oscillating data (non-periodic expression) and oscillating data (periodic expression) by a persistence map.

Next, cluster analysis of the synthesized single cell data was performed using TGMD and the effect of different embedding sizes m was compared. The contour coefficient (SIL) and the adjusted rand coefficient (ARI) were used as evaluation indices. As shown in fig. 24, the qualitative behavior of the contour coefficients and the adjusted reed coefficients as a function of m are similar, and their minimum and maximum values coincide. Note that clustering becomes increasingly difficult as the length of the time series becomes shorter as v increases. From the experimental results of fig. 24, as v increases, better clustering performance can be obtained for larger m, since higher dimensionality information is needed to correctly describe the data.

To demonstrate that TGMD is more robust than either the Topological (TS) only or the geometric (DTW) only methods when noise interferes with the time series, the performance of these methods on noisy time series data was compared. The measurement interval of v to 10 minutes is added as

Gaussian noise. FIG. 25 depicts the mean clustering results for different noise level data and shows that TGMD is present throughout

(from 0.1 to 0.7) is superior to methods using only topology or only geometry.

Cluster analysis of real time series data

To further evaluate the performance of the TGMD method to classify different real datasets, this section selected 11 datasets in the UCR time series database. At the same time, different embedding

dimensions m

2,3, 20 and different delays τ 1,2,3,4 are taken into account and the optimum values are sought. To this end, the adjustment parameter of the adjustment function is set to k 1, and the geometric property of the original time series is captured using the euclidean distance or the DTW distance. Further, the proposed new clustering method is compared with other clustering methods of time-series data. The test results are shown in table 1, where the best and suboptimal methods (in terms of accuracy) are highlighted in red and orange, respectively, TGMD being the method proposed in this study. The results in table 1 show that the results of the two methods of pure geometry do not differ much. TGMD provides better results than methods that only consider the geometry of the original time series. For the 11 datasets considered here, the new approach presented here is to achieve optimal results on 8 datasets and suboptimal results on the remaining 3 datasets.

TABLE 1UCR time series data set clustering results

Visual analysis of high dimensional metric space

The accuracy of the model can be improved to a certain extent by adjusting the hyper-parameters of the model. The embodiment analyzes the influence of the adjustment parameter k on the clustering performance of the TGMD on the Synthetic Controls data set, and then visualizes the TGMD metric space by using the UMAP.

The Synthetic Controls dataset was used to monitor the behavior of the system, with 600 data and 60 timing lengths. There are six types of patterns in this dataset: normal, periodic, downward movement, upward trend, and downward trend. All modes except the normal mode indicate that the monitored process is not operating properly and needs to be adjusted.

When k is 0, f (x) is a constant (see fig. 26), and TGMD reflects only the geometric characteristics of the original time series. Therefore, it is not easy to distinguish between two growth trends and upward shifts with different temporal dynamics. As k increases, the TGMD begins to display the topological properties of the time series. For k ═ 1, TGMD can distinguish between ascending trends and ascending trends. As k increases, only cycles with unique topological properties will be clearly distinguished. In general, from high-dimensional metric space visualization and clustering accuracy evaluation, it is confirmed that the intrinsic characteristics of the time data can be better understood by properly considering the topological and geometric characteristics of the time series.

The invention provides and analyzes a TGMD time series clustering framework in detail. The original time series are first projected into the phase space using delayed embedding, and then the topological features of the resulting point cloud are extracted using persistent co-ordination. Subsequently, a hybrid distance metric based on an adjustment function is proposed, which can not only extract global dynamic features but also describe the local structure of the time series. A large number of experiments are carried out aiming at time sequences in different fields, and the results show that the method is superior to a topological method only or a geometric method only, and has robustness on noise. Compared with other standard time series clustering methods on actual data sets, the method has the advantages that competitive results are obtained, and the effectiveness of the method is verified through visual analysis. The invention is also tested on the space-time data, and the experimental result shows that the invention can simultaneously capture the geometric and topological characteristics of the data, and the clustering result reflects the human activity and the internal structure of the city. The proposed clustering framework has been applied to time series data from different domains, including biological protein expression data, Electrocardiogram (ECG) data, image contour data, and taxi time series data. Since the geometric and topological properties of the time series reveal their intrinsic features, the present invention is applicable to cluster analysis of a wide variety of temporal data.

The invention has the following beneficial effects:

1. a topological-geometric mixed distance measure for time series clustering analysis is provided by combining local geometric and global topological features of time series. The topological approach is able to qualitatively describe the global features of the time series, while geometrically describing the local quantitative differences of the original time series.

2. The hybrid distance metric proposed by the present invention is based on an adjustment function that adjusts the overall similarity according to the proximity of topological properties.

3. Experimental results show that the method is superior to a pure geometric method or a pure geometric method in the aspect of carrying out oscillation activity clustering identification on noisy biological data, and has stronger robustness on noise. In addition, compared with other standard time series clustering methods, the method has the advantage that competitive results are obtained on the real data set. Visualization of the TGMD metric space also demonstrates its effectiveness. The present invention also tests the method on spatio-temporal data. Experimental results show that the proposed algorithm can capture the geometric and topological features of data simultaneously, and the clustering results reflect human activity behaviors and the structures inherent in cities.

The above embodiment is an embodiment of the present invention, but the embodiment of the present invention is not limited by the above embodiment, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be regarded as equivalent replacements within the protection scope of the present invention.

Claims

1. The method for clustering and analyzing the temporal data intrinsic mode of the spatiotemporal big data is characterized by comprising the following steps of:

2. The spatio-temporal big data temporal implication mode cluster analysis method according to claim 1, characterized in that, for a time series f { x }_tT is 1,2, and T is a time sequence length, and time sequence data is embedded into a point cloud space V through the delay embedding_i＝{v₁，...，v_t，...，f_TV, points in the point cloud space_i＝(x_i，x_i+τ，...，x_i+(d-1)τ) And representing, wherein d is the embedded point cloud space dimension, tau represents a delay coefficient, d and tau are regarded as hyper-parameters, and the two parameters are selected according to the contour coefficient.

3. The spatio-temporal big data temporal implication mode cluster analysis method according to claim 1, characterized in that the computation of geometrical similarity of the time series includes euclidean distance and DTW, where ED is computed as follows:

the DTW is calculated as follows:

for two time-series sequences T₁And T₂，

Wherein, the mapping r belongs to M,

|r|＝∑_{i＝1，...，m}|u_ai-v_bi|，

4. The spatio-temporal big data temporal implication mode cluster analysis method according to claim 1, characterized in that given θ e R²And | | θ | | non-conducting phosphor₂1, the function L (θ) represents a straight line { λ θ | λ ∈ R }, and pi ∈ R }_θ：R²→ L (theta) represents the orthogonal projection on L (theta), and the Sliced Wasserstein distance is defined as follows:

wherein Dg is₁，Dg₂Representing two persistent concordant graphs, the persistent graphs residing in R²A union of a finite plurality of sets of points in space, with a diagonal as delta,

π_Δis an orthogonal projection on the diagonal.

5. The spatiotemporal big data temporal implication mode cluster analysis method according to claim 1, characterized in that the topological-geometric mixture distance is calculated as follows:

TGMD(T₁，T₂)＝f(TS′(T₁，T₂))×Geo(T1，T2)

6. The spatio-temporal big data temporal implication mode cluster analysis method according to claim 1, characterized in that the adjusting function is an exponential function, ensuring that the adjusting effect of extremum and its nearest neighbors is almost equal, and the adjusting function f (x) is as follows:

wherein k is an adjustment coefficient, and k is more than or equal to 0.

7. The spatiotemporal big data temporal implication mode cluster analysis method according to claim 1, characterized in that the clustering algorithm is K-medoids.

8. The spatio-temporal big data temporal implication mode cluster analysis method according to claim 1, characterized in that the evaluation index adopts contour coefficient and adjusted Lande index,

the contour coefficient s for a single sample is:

the adjusted landed coefficients ARI are as follows: