CN113780451A - Temporal data implication mode clustering analysis method of temporal-spatial big data - Google Patents
Temporal data implication mode clustering analysis method of temporal-spatial big data Download PDFInfo
- Publication number
- CN113780451A CN113780451A CN202111088489.9A CN202111088489A CN113780451A CN 113780451 A CN113780451 A CN 113780451A CN 202111088489 A CN202111088489 A CN 202111088489A CN 113780451 A CN113780451 A CN 113780451A
- Authority
- CN
- China
- Prior art keywords
- data
- distance
- topological
- temporal
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000002123 temporal effect Effects 0.000 title claims abstract description 16
- 238000004458 analytical method Methods 0.000 title description 12
- 238000000034 method Methods 0.000 claims abstract description 76
- 230000002688 persistence Effects 0.000 claims abstract description 29
- 238000007621 cluster analysis Methods 0.000 claims abstract description 19
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 13
- 238000011156 evaluation Methods 0.000 claims abstract description 11
- 238000005259 measurement Methods 0.000 claims abstract description 9
- 238000010586 diagram Methods 0.000 claims abstract description 8
- 230000002085 persistent effect Effects 0.000 claims description 14
- 230000000694 effects Effects 0.000 claims description 10
- 238000013507 mapping Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000001427 coherent effect Effects 0.000 claims description 6
- 238000012847 principal component analysis method Methods 0.000 claims description 3
- OAICVXFJPJFONN-UHFFFAOYSA-N Phosphorus Chemical compound [P] OAICVXFJPJFONN-UHFFFAOYSA-N 0.000 claims 1
- 239000000203 mixture Substances 0.000 claims 1
- 238000011524 similarity measure Methods 0.000 description 12
- 230000000737 periodic effect Effects 0.000 description 10
- 238000000513 principal component analysis Methods 0.000 description 6
- 230000006399 behavior Effects 0.000 description 5
- 108090000623 proteins and genes Proteins 0.000 description 5
- 238000012800 visualization Methods 0.000 description 5
- 230000010355 oscillation Effects 0.000 description 4
- 230000003534 oscillatory effect Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 230000002860 competitive effect Effects 0.000 description 3
- 102000004169 proteins and genes Human genes 0.000 description 3
- 101150092640 HES1 gene Proteins 0.000 description 2
- 230000001174 ascending effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000003111 delayed effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000033001 locomotion Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000010937 topological data analysis Methods 0.000 description 2
- 230000001052 transient effect Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 235000014676 Phragmites communis Nutrition 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000010455 autoregulation Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000012886 linear function Methods 0.000 description 1
- 238000000691 measurement method Methods 0.000 description 1
- 108020004999 messenger RNA Proteins 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 230000002459 sustained effect Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000029305 taxis Effects 0.000 description 1
- 238000012731 temporal analysis Methods 0.000 description 1
- 238000000700 time series analysis Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a temporal data intrinsic mode cluster analysis method of temporal-spatial big data, which comprises the following steps: acquiring point cloud data, and acquiring a persistence map by using a persistence coherence method; calculating the SW distance of the persistence diagram; calculating topological geometric mixed distance; optimizing the clusters using a clustering algorithm to minimize a distance between a center of each cluster and the data points within the cluster; repeating clustering operation after adjusting the contour coefficient, and selecting proper clustering; and analyzing the clustering result by using the evaluation index, and visualizing the topological-geometric mixed distance measurement space of the comprehensive control data set by using the UMAP. The extracted structure reflects the regularity of each urban road network as a whole and discovers a potential substructure; and analyzing the potential structure of the data by combining the total GDP amount information of the city, and finding that the structure division is related to the economic level of the city and the geographic location factors.
Description
Technical Field
The invention belongs to the technical field of mapping, and particularly relates to a temporal data intrinsic mode cluster analysis method of spatial-temporal big data.
Background
In recent years, with the development of information communication technology and the popularization of various sensors and positioning technologies, a large amount of spatial big data which has space-time marks and can describe individual behaviors, including mobile phone positioning data, taxi data, shared bicycle data, bus smart card data, social network data, video big data and the like, are generated, and great opportunities are provided for analyzing and understanding the dynamics of urban structures, the space-time laws of human activities and the quantitative understanding of social and economic environments. The rise of the big data with geographic attributes also puts new demands on the space mining capability of the big data in the sky. In the time series analysis framework, time series clustering is an important method for understanding the intrinsic characteristics of temporal data. Time series clustering may be understood in essence as the process of aggregating time data points according to a given similarity measure. Clustering performance, in turn, critically depends on how the similarity is quantified.
The classical similarity-based methods are geometric similarity-based methods, and mainly focus on local relations at given moments in an original time sequence. These include Dynamic Time Warping (DTW), Euclidean Distance (ED), and Longest Common Subsequence (lcs). The information contained in the original time series is here represented by its geometry. These methods have yielded satisfactory results on many types of data. Although these methods are able to detect similarities in time and shape and describe local geometric differences, they generally ignore the dynamics of the time series from a global perspective. Furthermore, since DTW and ED take into account all points in time, they are usually very sensitive to outliers and noise.
Recently, another new class of methods uses persistent coherence to describe the topological properties of time series, reflecting the global dynamics of time series. The basic idea of the methods is to construct a point cloud in a phase space by embedding time delay of an original time sequence, and then extract topological features of the point cloud, such as clusters, rings, three-dimensional cavities and high-dimensional ring structures thereof, by using topological data analysis. The topological data analysis is a calculation method for analyzing high-dimensional complex data by using a topological theory. The characteristic of continuous coherent extraction is robust, and the disturbance of data only causes small change of the analysis output of the topological data. However, these topology-based global methods can only extract global information from the time series, and cannot take into account local quantitative differences.
Disclosure of Invention
The invention provides a time sequence clustering method based on Topological-geometrical Mixed Distance (TGMD). From the point cloud obtained by delayed embedding of the time series, topological features are extracted by using a persistence map, and geometric features are local correlations of given moments in the original time series. To characterize and quantify similarity using these two properties, we used the Sliced Wasserstein distance as the topological similarity measure and ED and DTW as the geometric similarity measure. On the basis, a mixed distance measurement method based on an adjusting function is provided, and the evaluation of the proximity of the geometric characteristics is adjusted according to the proximity of the topological characteristics. Then, the proposed new clustering method is applied to k-Medoids clustering of multiple sets of time series data, and the result proves the effectiveness of the method.
In view of this, the schematic diagram of the temporal data implicit mode cluster analysis method for the spatio-temporal big data provided by the present invention is shown in fig. 1, and includes three parts: topological similarity measures, geometric similarity measures, and time series clustering based on a hybrid distance measure. The method specifically comprises the following steps:
acquiring point cloud data, embedding the delay of a time sequence into the point cloud data to obtain a phase space, applying a principal component analysis method to the phase space to reduce topological noise, and obtaining a persistence diagram by using a persistence coherent method;
calculating Sliced Walserstein distance of the persistent graph to obtain global topological similarity measurement of the phase space, and calculating geometric similarity of time sequence original space;
calculating topological-geometric mixed distance, wherein the calculation of the topological-geometric mixed distance utilizes an adjusting function to establish a new distance, and the adjusting function is selected to enable topological similarity to become an adjusting factor of the geometric distance;
optimizing the clusters by minimizing the distance between the center of each cluster and the data points in the clusters by using a clustering algorithm to generate spherical clusters with similar sizes;
repeating the clustering operation after adjusting the contour coefficient, and selecting a proper number of clusters;
and analyzing the clustering result by using the evaluation index, and visualizing the topological-geometric mixed distance measurement space of the comprehensive control data set by using the UMAP.
Further, for a time series f { xtT1, 2, …, T }, and embedding the time-series data into the point cloud space V by the delay embeddingi={v1,…,vt,…,fTV, points in the point cloud spacei=(xi,xi+τ,…,xi+(d-1)τ) And representing, wherein d is the embedded point cloud space dimension, tau represents a delay coefficient, d and tau are regarded as hyper-parameters, and the two parameters are selected according to the contour coefficient.
Further, the calculation of the geometric similarity of the time series includes euclidean distance and DTW, where ED is calculated as follows:
given two time-series sequences T1And T2,T1And T2Euclidean distance δ betweenEIs defined as:
wherein u isiIs a time sequence T1Point of (5), viIs a time sequence T2The point in (b) is the number of the time sequence midpoint;
the DTW is calculated as follows:
for two time-series sequences T1And T2,
Wherein, the mapping r belongs to M,
|r|=∑i=1,…,m|uai-vbi|,
uaiis a time sequence T1U iniPoint after point mapping, vbiIs a time sequence T1V in (1)iThe point after the point mapping.
Further, given θ ∈ R2And | θ |21, the function L (θ) represents a straight line { λ θ | λ ∈ R }, and pi ∈ R }θ:R2→ L (theta) represents the orthogonal projection on L (theta), and the Sliced Wasserstein distance is defined as follows:
wherein Dg is1,Dg2Representing two persistent concordant graphs, the persistent graphs residing in R2A union of a finite plurality of sets of points in space, with a diagonal as delta,
Further, the calculation formula of the topological-geometric hybrid distance is as follows:
TGMD(T1,T2)=f(TS′(T1,T2))×Geo(T1,T2)
where f (x) is a monotonically increasing adjustment function, TS' is a normalized topological similarity obtained by rescaling the topological similarity TS, T1 and T2 are time series, Geo is a geometric distance between T1 and T2.
Further, the adjustment function is an exponential function, ensuring that the adjustment effect of the extremum and its nearest neighbors is almost equal, and the adjustment function f (x) is as follows:
wherein k is an adjustment coefficient, and k is more than or equal to 0.
Further, the clustering algorithm is K-medoids.
Furthermore, the evaluation index adopts a contour coefficient and an adjusted Lande index,
the contour coefficient s for a single sample is:
wherein a is the average distance between the sample and all other points in the same class, and b is the average distance between the sample and all other points in the next closest cluster;
the adjusted landed coefficients ARI are as follows:
where TP is the number of time-series pairs belonging to the same class and assigned to the same cluster, TN is the number of time-series pairs belonging to different classes and assigned to different clusters, FP is the number of time-series pairs belonging to different classes and assigned to the same cluster, FN is the number of time-series pairs belonging to the same class and assigned to different clusters, and RI is the ratio at which correct decisions are calculated.
The invention has the following beneficial effects:
1) the invention can qualitatively describe the global characteristics and local quantitative differences of the time sequence;
2) the hybrid distance metric proposed by the present invention is based on an adjustment function that adjusts the overall similarity according to the proximity of topological properties.
3) Experimental results show that the method is superior to a pure geometric method or a pure geometric method in the aspect of carrying out oscillation activity clustering identification on noisy biological data, and has stronger robustness on noise. Compared with other standard time series clustering methods, the method has the advantage that competitive results are obtained on the real data set.
Drawings
FIG. 1 illustrates Betty numbers corresponding to different object topologies of the present invention;
time series S1-S4 in the original space of FIG. 2;
time series S1-S4 in the phase space of FIG. 3;
FIG. 4 is a graph of the function of the adjustment function f (x);
FIG. 5 is a summation of all region time series;
FIG. 6 is an example of a non-periodic traffic time series in different regions;
FIG. 7 is a visualization of the time series data of FIG. 6 after delay embedding and projection by PCA;
FIG. 8 is a one-dimensional persistence diagram of FIG. 6;
FIG. 9 is an example of a periodic traffic time series in different regions;
FIG. 10 is a visualization of the time series data of FIG. 9 after delay embedding and projection by PCA;
FIG. 11 is a one-dimensional continuation of FIG. 9;
FIG. 12 is a graph of contour coefficients for different cluster numbers k;
FIG. 13 is a spatial distribution of clustering results;
FIG. 14 is a time series of the boarding and disembarking of 4 typical regions in category 1;
FIG. 15 is a spatial distribution of clustering results;
FIG. 16 is a time series of getting on and off for 2 exemplary regions in category 2;
fig. 17 is a time series of getting on and off for another 2 typical regions in category 2;
FIG. 18 is an example of non-oscillating time series data acquired every 5 minutes;
FIG. 19 is a time series in the phase space of FIG. 18;
FIG. 20 is a continuation of FIG. 18;
FIG. 21 is an example of acquiring oscillation time-series data every 5 minutes;
FIG. 22 is a time series in the phase space of FIG. 21;
figure 23 is a continuation of figure 21,
FIG. 24 is the clustering results of synthetic single cell data with different intervals;
FIG. 25 is a graph of clustering results for data containing different noise;
FIG. 26 shows the cluster analysis results of the adjustment function for different values of k.
Detailed Description
The invention is further described with reference to the accompanying drawings, but the invention is not limited in any way, and any alterations or substitutions based on the teaching of the invention are within the scope of the invention.
The invention discloses a method for clustering and analyzing internal implication modes of temporal-spatial big data, which comprises the following steps:
s1: acquiring point cloud data, embedding the delay of a time sequence into the point cloud data to obtain a phase space, applying a principal component analysis method to the phase space to reduce topological noise, and obtaining a persistence diagram by using a persistence coherent method;
s2: calculating Sliced Walserstein distance of the persistent graph to obtain global topological similarity measurement of the phase space, and calculating geometric similarity of time sequence original space;
s3: calculating topological-geometric mixed distance, wherein the calculation of the topological-geometric mixed distance utilizes an adjusting function to establish a new distance, and the adjusting function is selected to enable topological similarity to become an adjusting factor of the geometric distance;
s4: optimizing the clusters by minimizing the distance between the center of each cluster and the data points in the clusters by using a clustering algorithm to generate spherical clusters with similar sizes;
s5: repeating the clustering operation after adjusting the contour coefficient, and selecting a proper number of clusters;
s6: and analyzing the clustering result by using the evaluation index, and visualizing the topological-geometric mixed distance measurement space of the comprehensive control data set by using the UMAP.
Extracting global topological properties in a time series potential phase space
Given a one-dimensional signal f of length T, use { xtWherein T is 1,2, …, T represents Fi={f1,…,ft,…,fTIs a sequence of real numbers collected on a time scale, where ftRepresenting the value at time t.
By time-lapse embedding of the time sequence, the unique topological features of the time sequence can be better understood in the phase space. For a time series f { xtAnd T is 1,2, …, T, and the time sequence data is embedded into the point cloud space V through delay embeddingi={v1,…,vt,…,fTV for these pointsi=(xi,xi+τ,…,xi+(d-1)τ) And (4) showing. Where d refers to the embedded point cloud space dimension and τ represents the delay factor. The number of points in the point cloud space, N, depends on the choice of d and τ, and is two important parameters. In fact, since the real time series are noisy and of limited length, there is currently no known optimal method for determining d and τ. Most of the methods required to select τ are not based on strict mathematical criteria, but rather on heuristics. The invention regards d and tau as hyper-parameters and selects the two parameters according to the contour coefficient, the specific method of the contour coefficient is in the non-patent literature Peter J.Rousseuw.Silhouttes: A graphical aid to the interpretation and validation of the cluster analysis]Journal of Computational and Applied Matchemics, 1987,20:53-65, which will not be described herein.
S1(t)=sin(2πt) (1)
S2(t)=sin(4πt) (2)
In phase space, the relevant topological features of the time series are mainly 0-dimensional components and 1-dimensional ring structures. Considering the examples shown in fig. 2 and 3, fig. 2 is a time series S1-S4 in the original space, and fig. 3 is a time series S1-S4 in the phase space; S1-S3 are all periodic time series, see equations (1) - (3), while S4 is an aperiodic time series with equation (4). S1-S3 all have the same topology in the embedding space, i.e. a one-dimensional ring structure, and therefore cannot be distinguished from each other from their global topology. However, by calculating their mutual DTW distance, S1 and S3 were found to be closer to each other than S1 and S2, due to local differences. Since the frequency of S2 is higher than that of S1 and S3, if DTW distances are compared, it is found that S1 and S4 are closer than S1 and S2. However, S1 and S2 will be found to be different from S4 in view of the topological characteristics of the data. Thus, by considering different properties of the time series, more information may be obtained than if the time series were concentrated on a local or global time series.
After embedding the time series into the phase space, PCA (principal component analysis) is applied to this embedded point cloud to reduce the topological noise. The topology is then described in terms of persistence graphs.
Persistent coherence theory is an algebraic method of computing spatial topological features at different spatial resolutions. More continuously existing topological features are detected on a wide spatial scale, the continuously existing topological features in a large range are reclassified as topological signals, and transient topological features are regarded as noise.
A space is defined in which the persistence barcode can be projected and its geometric properties studied, called a persistence map, which summarizes the topological information in the persistence coherence process. The persistence pattern is presence of R2A union of a finite multiple set of points in space, where the diagonal is expressed as Δ { (x, x) | x ∈ R2Where each point in Δ has infinite multiplicity, i.e., the number of times it appears in the multi-set is infinite. Using double mapping allows comparison of persistence map differences, so persistence maps must be usedWith the same cardinality-diagonal. In some cases, diagonal lines in the persistence map may simplify comparison between the maps. Because points near the diagonal correspond to transient topological features, they are likely caused by small perturbations in the data, which can be considered noise. Persistence barcodes can be converted to persistence maps by combining the multiple sets of (birth, death) pairs with the diagonal Δ. The persistent coherent method and the persistent diagram are prior art and will not be described herein.
The present embodiment uses the Sliced Wasserstein distance as the topological similarity measure. The Sliced Walserstein distance is not only demonstrably stable, but is also distinguishable (its bounds depend on the number of points in the persistence map). The basic idea of this metric is to slice a plane through the line of origin, project the measurements onto these lines where W is calculated, and integrate the distances over all possible lines.
Given θ ∈ R2And | θ |2Let the function L (θ) denote a straight line { λ θ | λ ∈ R }, and pi ∈ 1θ:R2→ L (θ) represents an orthogonal projection on L (θ). Let Dg1,Dg2Represents two persistent coherence maps and lets For theThe same calculation is carried out, whereΔIs an orthogonal projection on the diagonal. Then, the Sliced Walserstein distance is defined as follows:
let μ and ν be two non-negative measures on the solid line, so that | μ | ═ μ (R) and | ν | ═ ν (R) equal the same number R. Thus, the following definitions apply:
extracting local geometric properties in an original space of a time series
The time series is uniquely represented in its geometry in the original space, which also carries the information. For example, the seismograph signals (time amplitude versus surface rhythm) contain information about earthquakes. The present invention captures the similarity in these shapes by calculating the distance of local point pairs in the original time series.
The present embodiment uses ED and DTW as geometric similarity measures to quantify the geometric similarity of time series. Euclidean distance and DTW distance are the most applied geometric measures of time series. Let T1=(u1,…,up),T2=(v1,…,vp) Two time series sequences are shown. T is1And T2Euclidean distance δ betweenEIs defined as:
for two time-series sequences T1,T2Defining a mapping r ∈ M, wherein:
on this basis, the DTW distance is defined:
geometric similarity measures tend to identify local geometric relationships and numerical differences between original time series samples without taking into account topological changes in the time series data. The present embodiment therefore employs a new similarity measure, combining both topological similarity and the traditional geometric similarity measure, i.e., topological geometry blending distance (TGMD).
The topological-geometric hybrid distance establishes a new distance using an adjustment function selected to make Topological Similarity (TS) an adjustment factor for the geometric distance (Geo). The tuning function increases TGMD similarity if the topological properties of the two time series are different. The adjustment function selects an exponential function rather than a linear function because the exponential function has a lower rate of increase or decrease near the extrema (1 and 0), thereby ensuring that the adjustment effect of the extrema and their nearest neighbors are nearly equal.
The Topological Similarity (TS) is first rescaled to obtain normalized topological similarity (TS'), as follows:
then, the monotonically increasing adjustment function f (x) is as follows:
wherein k is an adjustment coefficient, and k is more than or equal to 0.
In conjunction with the above procedure, the following TGMD similarity measure is proposed:
TGMD(T1,T2)=f(TS′(T1,T2))×Geo(T1,T2)
the adjustment function adjusts the geometric distance metric based on the similarity of the topology. When the Topological Similarity (TS) of two time series approaches 1, for any k ≧ 0, there is an adjustment function f (x) ≧ 1, increasing the TGMD metric. The scale factor due to topological similarity increases with increasing k.
The choice of clustering algorithm depends on the strategy employed, i.e. maximizing intra-group similarity and minimizing inter-group similarity. The K-medoids algorithm optimizes clusters by minimizing the distance between the center (i.e., centroid) of each cluster and the data points within the cluster, ultimately generating spherical clusters of similar size. The centroid may or may not be an actual data point. In this example, a K-medoids algorithm was used for cluster analysis and a TGMD similarity measure was used.
The dataset of the present invention uses synthetic single cell time series data, UCR time series profiles and taxi time series.
Synthesis of single cell time series data: single cell synthetic mRNA and protein time series data were generated according to a computational model of a Hes1 oscillator, which is a negative autoregulation system with delay. The model-generated data is divided into two categories, oscillatory and non-oscillatory time series (periodic and non-periodic). Data for 200 cells of the Hes1 model were simulated for both the oscillating and non-oscillating parameter protocols, and protein levels were measured every 5,10,15,20 minutes, from 5000 minutes to 6500 minutes. After acquiring time series at different intervals, data were normalized to mean 0 and have unit variance, and certain gaussian noise was added.
UCR time series profile: the UCR time series archive is an important database of the time series data mining community, and 11 typical data sets are selected from the UCR to evaluate the clustering performance of the algorithm. These data sets contain data of different sizes, lengths, number of classes and signal types.
Taxi time sequence: this example also uses a time series dataset generated by the GPS trajectory for a Strong taxi from Shanghai, China (https:// sodachallenges. com/datasets/taxi-GPS/#). Shanghai is the city with the most population in China, and the social space structure of the Shanghai is changed greatly after the Shanghai is reformed and opened. Some studies analyzed the impact of city morphology on resident travel behavior, but these studies used a small sample of survey data, which may not be representative of the general population, and the data lacked the accuracy of travel variables. The present embodiment therefore uses a large amount of travel data collected from the location-aware devices for analysis. Many taxi companies have installed GPS receivers in their fleets to monitor the real-time movement of each taxi. In the raw data set, each record consists of 13 fields including vehicle ID, GPS time, latitude and longitude, speed, number of satellites, operating status, overhead status, braking status, etc., where the latitude and longitude fields indicate the geographic location of the taxi.
The study area was divided into 1480 pixels of 1km × 1km, similar to the Traffic Analysis Zone (TAZ). This ratio is determined from previous studies, i.e. a picture element of this size is sufficient to describe the urban structure. In addition, these pixels can be used as a substitute for traffic analysis areas to represent a relatively uniform socio-economic signature unit. In the embodiment, the taxi track is simplified into the time sequence in each pixel, and the time sequence is divided into the getting-on time sequence and the getting-off time sequence. By pre-processing, the time series of each finally generated pel is 120 dimensions: t isi={t1,t2,…,t120Wherein t is1~t120The number of taxis for the ith pixel of the weekday is shown and the result is plotted in fig. 5. The geometry of the two time series is very similar and 5 24 hour periods can be clearly identified, which represents a roughly repeating time distribution of getting on and off over 5 days. Overall, the traffic time series has distinct geometric and topological properties (periodicity).
Comparison method
The TGMD method of the present invention is compared to other standard clustering methods for time series data. In particular, the commonly used distance metrics ED and DTW are considered as geometric methods and are clustered using K-medoids. The ED of the PCA subspace is then considered as a distance measure and time series clustering is performed using K-Means. TGMD clustering results were also compared to results from k-shape and TSKMeans. The k-shape clustering method uses a normalized cross-correlation metric to consider the shape of time series when comparing them. TSKMeans is a smooth subspace clustering algorithm of a k-means type, and can effectively utilize subspace information inherent in a time sequence data set to improve clustering performance.
Evaluation index
The present embodiment uses the following two indexes to measure the effectiveness of the method.
Contour coefficient (Silhouette coeffient): if the ground truth label is not known, the model itself must be used for evaluation. Higher contour coefficient scores are associated with models with better defined clustering. For each sample a contour coefficient is defined, which consists of two fractions.
The Silhouuette coefficient s for a single sample is:
where a is the average distance between the sample and all other points in the same class and b is the average distance between the sample and all other points in the next closest cluster.
Adjusting the landed index (ARI): the adjusted landed index solves the problem that the landed index cannot well describe the similarity of randomly distributed cluster type mark vectors.
Where TP is the number of time-series pairs belonging to the same category and assigned to the same cluster, TN is the number of time-series pairs belonging to different categories and assigned to different clusters, FP is the number of time-series pairs belonging to different categories and assigned to the same cluster, and FN is the number of time-series pairs belonging to the same category and assigned to different clusters.
Traffic spatio-temporal sequence clustering analysis
And (4) carrying out clustering analysis on the space-time sequence of the taxi in the Shanghai (pick-up: getting on the taxi; drop-off: getting off). In particular, DTW and Sliced Walserstein are used as the geometric distance and the topological distance, respectively. Since there are no actual labels in this data set, the appropriate number of clusters is selected based on the contour coefficients.
Fig. 6 and 9 show examples of aperiodic and periodic communication time series, respectively, in different regions. Fig. 7 and 10 are the visualization results of the time-series data of fig. 6 and 9 after delay embedding (τ 2, d 6) and projection to a two-dimensional space by PCA. Fig. 8 and 11 are one-dimensional persistence diagrams of fig. 6 and 9. To analyze the topological properties of the time series, the time series are embedded in the phase space. For time series with significant periodicity, fig. 11 shows a clear topological signal (loop) compared to the results of fig. 7, where no topological features appear. Then, topological features are obtained using a continuous coherent process: fig. 8 and 11 show two persistence maps that represent topological summaries of the point cloud in phase space. In fig. 8, most points are close to the diagonal, which can be interpreted as topological noise. In contrast, in fig. 11, the existence of a point far from the diagonal, i.e. a continuous topological signal, can be clearly seen. In this example, it is clearly observed that aperiodic and periodic taxi time series can be distinguished by the persistence map.
Then, all taxi time series in Shanghai were subjected to cluster analysis. First, an appropriate number of clusters are selected according to the contour coefficients by repeated clustering operations. As shown in fig. 12, the contour coefficient varies with the number of clusters k. Generally, the larger the contour value, the better the clustering result. For both time series, the contour coefficients of the clustering results were significantly reduced between k 4 and k 5. And when k > 5, the contour factor is less than 0.5. Based on this, the data were classified into 4 classes for cluster analysis.
The study area was divided into 4 clusters as a whole, forming a central circle layer structure, as shown in fig. 13. Cluster 1(C1) is distributed in the central area of a city, including airports and train stations. The time series of C1 has a large taxi traffic and a stable period, see fig. 14. The geometric features of the original time series include quantitative features, namely taxi traffic, and the stable periodicity can be described as a stable topology of the time series in the phase space.
Clustering method performance analysis
First, three different distance metrics were tested for performance on synthetic single cell data: geometry only (DTW), topology only (Sliced Wasserstein) and TGMD, and they were clustered using the K-Medoids method. Then, the ARI is used to analyze the clustering results and compare the performance of different dimensions m, where m is ∈ {2,3,4,5,6,7,8,9,10 }.
Then, the clustering performance of the algorithm provided by the embodiment on the real data set is tested, and the result is compared with the results of other methods. In the method proposed by this embodiment, the delay embedding parameters τ and d are manually selected according to the contour coefficients on different data sets. For the adjustment function, k is chosen to be 1, and then the euclidean distance and the DTW distance are used to evaluate the geometric similarity.
Finally, the TGMD metric space of the integrated control data set is visualized using UMAP (Uniform Manifold Approximation and Projection). UMAP is prior art and is not described herein in detail. Empirically, d-15 and τ -3 were set as delay embedding parameters and the effect of different adjustment factors k on the results were compared.
Cluster analysis of synthetic single cell gene cycle expression data
In this section, the clustering results of the single cell gene cycle expression data will be analyzed and synthesized. Fig. 18 and 21 show examples of Non-oscillatory (Non-oscillatory) and oscillatory (oscillatory) time-series data acquired every 5 minutes, respectively. Fig. 19 and 22 show two time series in phase space. For the oscillation time series, fig. 22 reveals a clear topological signature (ring structure) compared to the results of fig. 19, in which no topological features appear. Then, topological features are obtained using persistent coherence: fig. 20 and 23 show persistence maps representing topological summaries of point clouds in phase space. In fig. 20, most points in the persistence map are close to the diagonal, which can be interpreted as topological noise. In contrast, in fig. 23, the existence of a point far from the diagonal, i.e., the sustained topological signal, can be clearly seen. In this example, it is clear how to distinguish between non-oscillating data (non-periodic expression) and oscillating data (periodic expression) by a persistence map.
Next, cluster analysis of the synthesized single cell data was performed using TGMD and the effect of different embedding sizes m was compared. The contour coefficient (SIL) and the adjusted rand coefficient (ARI) were used as evaluation indices. As shown in fig. 24, the qualitative behavior of the contour coefficients and the adjusted reed coefficients as a function of m are similar, and their minimum and maximum values coincide. Note that clustering becomes increasingly difficult as the length of the time series becomes shorter as v increases. From the experimental results of fig. 24, as v increases, better clustering performance can be obtained for larger m, since higher dimensionality information is needed to correctly describe the data.
To demonstrate that TGMD is more robust than either the Topological (TS) only or the geometric (DTW) only methods when noise interferes with the time series, the performance of these methods on noisy time series data was compared. The measurement interval of v to 10 minutes is added asGaussian noise. FIG. 25 depicts the mean clustering results for different noise level data and shows that TGMD is present throughout(from 0.1 to 0.7) is superior to methods using only topology or only geometry.
Cluster analysis of real time series data
To further evaluate the performance of the TGMD method to classify different real datasets, this section selected 11 datasets in the UCR time series database. At the same time, different embedding dimensions m 2,3, 20 and different delays τ 1,2,3,4 are taken into account and the optimum values are sought. To this end, the adjustment parameter of the adjustment function is set to k 1, and the geometric property of the original time series is captured using the euclidean distance or the DTW distance. Further, the proposed new clustering method is compared with other clustering methods of time-series data. The test results are shown in table 1, where the best and suboptimal methods (in terms of accuracy) are highlighted in red and orange, respectively, TGMD being the method proposed in this study. The results in table 1 show that the results of the two methods of pure geometry do not differ much. TGMD provides better results than methods that only consider the geometry of the original time series. For the 11 datasets considered here, the new approach presented here is to achieve optimal results on 8 datasets and suboptimal results on the remaining 3 datasets.
TABLE 1UCR time series data set clustering results
Visual analysis of high dimensional metric space
The accuracy of the model can be improved to a certain extent by adjusting the hyper-parameters of the model. The embodiment analyzes the influence of the adjustment parameter k on the clustering performance of the TGMD on the Synthetic Controls data set, and then visualizes the TGMD metric space by using the UMAP.
The Synthetic Controls dataset was used to monitor the behavior of the system, with 600 data and 60 timing lengths. There are six types of patterns in this dataset: normal, periodic, downward movement, upward trend, and downward trend. All modes except the normal mode indicate that the monitored process is not operating properly and needs to be adjusted.
When k is 0, f (x) is a constant (see fig. 26), and TGMD reflects only the geometric characteristics of the original time series. Therefore, it is not easy to distinguish between two growth trends and upward shifts with different temporal dynamics. As k increases, the TGMD begins to display the topological properties of the time series. For k ═ 1, TGMD can distinguish between ascending trends and ascending trends. As k increases, only cycles with unique topological properties will be clearly distinguished. In general, from high-dimensional metric space visualization and clustering accuracy evaluation, it is confirmed that the intrinsic characteristics of the time data can be better understood by properly considering the topological and geometric characteristics of the time series.
The invention provides and analyzes a TGMD time series clustering framework in detail. The original time series are first projected into the phase space using delayed embedding, and then the topological features of the resulting point cloud are extracted using persistent co-ordination. Subsequently, a hybrid distance metric based on an adjustment function is proposed, which can not only extract global dynamic features but also describe the local structure of the time series. A large number of experiments are carried out aiming at time sequences in different fields, and the results show that the method is superior to a topological method only or a geometric method only, and has robustness on noise. Compared with other standard time series clustering methods on actual data sets, the method has the advantages that competitive results are obtained, and the effectiveness of the method is verified through visual analysis. The invention is also tested on the space-time data, and the experimental result shows that the invention can simultaneously capture the geometric and topological characteristics of the data, and the clustering result reflects the human activity and the internal structure of the city. The proposed clustering framework has been applied to time series data from different domains, including biological protein expression data, Electrocardiogram (ECG) data, image contour data, and taxi time series data. Since the geometric and topological properties of the time series reveal their intrinsic features, the present invention is applicable to cluster analysis of a wide variety of temporal data.
The invention has the following beneficial effects:
1. a topological-geometric mixed distance measure for time series clustering analysis is provided by combining local geometric and global topological features of time series. The topological approach is able to qualitatively describe the global features of the time series, while geometrically describing the local quantitative differences of the original time series.
2. The hybrid distance metric proposed by the present invention is based on an adjustment function that adjusts the overall similarity according to the proximity of topological properties.
3. Experimental results show that the method is superior to a pure geometric method or a pure geometric method in the aspect of carrying out oscillation activity clustering identification on noisy biological data, and has stronger robustness on noise. In addition, compared with other standard time series clustering methods, the method has the advantage that competitive results are obtained on the real data set. Visualization of the TGMD metric space also demonstrates its effectiveness. The present invention also tests the method on spatio-temporal data. Experimental results show that the proposed algorithm can capture the geometric and topological features of data simultaneously, and the clustering results reflect human activity behaviors and the structures inherent in cities.
The above embodiment is an embodiment of the present invention, but the embodiment of the present invention is not limited by the above embodiment, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be regarded as equivalent replacements within the protection scope of the present invention.
Claims (8)
1. The method for clustering and analyzing the temporal data intrinsic mode of the spatiotemporal big data is characterized by comprising the following steps of:
acquiring point cloud data, embedding the delay of a time sequence into the point cloud data to obtain a phase space, applying a principal component analysis method to the phase space to reduce topological noise, and obtaining a persistence diagram by using a persistence coherent method;
calculating Sliced Walserstein distance of the persistent graph to obtain global topological similarity measurement of the phase space, and calculating geometric similarity of time sequence original space;
calculating topological-geometric mixed distance, wherein the calculation of the topological-geometric mixed distance utilizes an adjusting function to establish a new distance, and the adjusting function is selected to enable topological similarity to become an adjusting factor of the geometric distance;
optimizing the clusters by minimizing the distance between the center of each cluster and the data points in the clusters by using a clustering algorithm to generate spherical clusters with similar sizes;
repeating the clustering operation after adjusting the contour coefficient, and selecting a proper number of clusters;
and analyzing the clustering result by using the evaluation index, and visualizing the topological-geometric mixed distance measurement space of the comprehensive control data set by using the UMAP.
2. The spatio-temporal big data temporal implication mode cluster analysis method according to claim 1, characterized in that, for a time series f { x }tT is 1,2, and T is a time sequence length, and time sequence data is embedded into a point cloud space V through the delay embeddingi={v1,...,vt,...,fTV, points in the point cloud spacei=(xi,xi+τ,...,xi+(d-1)τ) And representing, wherein d is the embedded point cloud space dimension, tau represents a delay coefficient, d and tau are regarded as hyper-parameters, and the two parameters are selected according to the contour coefficient.
3. The spatio-temporal big data temporal implication mode cluster analysis method according to claim 1, characterized in that the computation of geometrical similarity of the time series includes euclidean distance and DTW, where ED is computed as follows:
given two time-series sequences T1And T2,T1And T2Euclidean distance δ betweenEIs defined as:
wherein u isiIs a time sequence T1Point of (5), viIs a time sequence T2The point in (b) is the number of the time sequence midpoint;
the DTW is calculated as follows:
for two time-series sequences T1And T2,
Wherein, the mapping r belongs to M,
|r|=∑i=1,...,m|uai-vbi|,
uaiis a time sequence T1U iniPoint after point mapping, vbiIs a time sequence T1V in (1)iThe point after the point mapping.
4. The spatio-temporal big data temporal implication mode cluster analysis method according to claim 1, characterized in that given θ e R2And | | θ | | non-conducting phosphor21, the function L (θ) represents a straight line { λ θ | λ ∈ R }, and pi ∈ R }θ:R2→ L (theta) represents the orthogonal projection on L (theta), and the Sliced Wasserstein distance is defined as follows:
wherein Dg is1,Dg2Representing two persistent concordant graphs, the persistent graphs residing in R2A union of a finite plurality of sets of points in space, with a diagonal as delta,
5. The spatiotemporal big data temporal implication mode cluster analysis method according to claim 1, characterized in that the topological-geometric mixture distance is calculated as follows:
TGMD(T1,T2)=f(TS′(T1,T2))×Geo(T1,T2)
where f (x) is a monotonically increasing adjustment function, TS' is a normalized topological similarity obtained by rescaling the topological similarity TS, T1 and T2 are time series, Geo is a geometric distance between T1 and T2.
6. The spatio-temporal big data temporal implication mode cluster analysis method according to claim 1, characterized in that the adjusting function is an exponential function, ensuring that the adjusting effect of extremum and its nearest neighbors is almost equal, and the adjusting function f (x) is as follows:
wherein k is an adjustment coefficient, and k is more than or equal to 0.
7. The spatiotemporal big data temporal implication mode cluster analysis method according to claim 1, characterized in that the clustering algorithm is K-medoids.
8. The spatio-temporal big data temporal implication mode cluster analysis method according to claim 1, characterized in that the evaluation index adopts contour coefficient and adjusted Lande index,
the contour coefficient s for a single sample is:
wherein a is the average distance between the sample and all other points in the same class, and b is the average distance between the sample and all other points in the next closest cluster;
the adjusted landed coefficients ARI are as follows:
where TP is the number of time-series pairs belonging to the same class and assigned to the same cluster, TN is the number of time-series pairs belonging to different classes and assigned to different clusters, FP is the number of time-series pairs belonging to different classes and assigned to the same cluster, FN is the number of time-series pairs belonging to the same class and assigned to different clusters, and RI is the ratio at which correct decisions are calculated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111088489.9A CN113780451A (en) | 2021-09-16 | 2021-09-16 | Temporal data implication mode clustering analysis method of temporal-spatial big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111088489.9A CN113780451A (en) | 2021-09-16 | 2021-09-16 | Temporal data implication mode clustering analysis method of temporal-spatial big data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113780451A true CN113780451A (en) | 2021-12-10 |
Family
ID=78851593
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111088489.9A Pending CN113780451A (en) | 2021-09-16 | 2021-09-16 | Temporal data implication mode clustering analysis method of temporal-spatial big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113780451A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114526739A (en) * | 2022-01-25 | 2022-05-24 | 中南大学 | Mobile robot indoor repositioning method, computer device and product |
CN114582439A (en) * | 2022-03-10 | 2022-06-03 | 北京国垦节水科技有限公司 | Soil saline-alkali soil conditioner screening method and system based on application scene |
CN116467610A (en) * | 2023-03-13 | 2023-07-21 | 深圳市壹通道科技有限公司 | Data topology analysis method, device, equipment and storage medium based on 5G message |
CN117688410A (en) * | 2024-02-02 | 2024-03-12 | 山东同利新材料有限公司 | Intelligent management method for production data of diethyl maleate |
CN118427731A (en) * | 2024-07-05 | 2024-08-02 | 深圳中恒检测技术有限公司 | Power detection method, device, equipment and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109800801A (en) * | 2019-01-10 | 2019-05-24 | 浙江工业大学 | K-Means clustering lane method of flow based on Gauss regression algorithm |
-
2021
- 2021-09-16 CN CN202111088489.9A patent/CN113780451A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109800801A (en) * | 2019-01-10 | 2019-05-24 | 浙江工业大学 | K-Means clustering lane method of flow based on Gauss regression algorithm |
Non-Patent Citations (1)
Title |
---|
YUNSHENG ZHANG: ""Time Series Clustering withTopological and Geometric MixedDistance"", 《MATHEMATICS》, no. 9, 6 May 2021 (2021-05-06), pages 1 - 17 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114526739A (en) * | 2022-01-25 | 2022-05-24 | 中南大学 | Mobile robot indoor repositioning method, computer device and product |
CN114526739B (en) * | 2022-01-25 | 2024-05-07 | 中南大学 | Mobile robot indoor repositioning method, computer device and product |
CN114582439A (en) * | 2022-03-10 | 2022-06-03 | 北京国垦节水科技有限公司 | Soil saline-alkali soil conditioner screening method and system based on application scene |
CN114582439B (en) * | 2022-03-10 | 2022-11-04 | 新疆楚强生物科技有限公司 | Application scene-based soil saline-alkali soil conditioner screening method and system |
CN116467610A (en) * | 2023-03-13 | 2023-07-21 | 深圳市壹通道科技有限公司 | Data topology analysis method, device, equipment and storage medium based on 5G message |
CN116467610B (en) * | 2023-03-13 | 2023-10-10 | 深圳市壹通道科技有限公司 | Data topology analysis method, device, equipment and storage medium based on 5G message |
CN117688410A (en) * | 2024-02-02 | 2024-03-12 | 山东同利新材料有限公司 | Intelligent management method for production data of diethyl maleate |
CN117688410B (en) * | 2024-02-02 | 2024-05-24 | 山东同利新材料有限公司 | Intelligent management method for production data of diethyl maleate |
CN118427731A (en) * | 2024-07-05 | 2024-08-02 | 深圳中恒检测技术有限公司 | Power detection method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113780451A (en) | Temporal data implication mode clustering analysis method of temporal-spatial big data | |
CN111539454B (en) | Vehicle track clustering method and system based on meta-learning | |
Li et al. | Reconstruction of human movement trajectories from large-scale low-frequency mobile phone data | |
Morris et al. | Real-time video-based traffic measurement and visualization system for energy/emissions | |
Liu et al. | A participatory urban traffic monitoring system: The power of bus riders | |
CN109923595A (en) | A kind of urban highway traffic method for detecting abnormality based on floating car data | |
CN107679558A (en) | A kind of user trajectory method for measuring similarity based on metric learning | |
CN110232398A (en) | A kind of road network sub-area division and its appraisal procedure based on Canopy+Kmeans cluster | |
Li et al. | Knowledge-based trajectory completion from sparse GPS samples | |
CN107944628A (en) | A kind of accumulation mode under road network environment finds method and system | |
CN116628455A (en) | Urban traffic carbon emission monitoring and decision support method and system | |
CN111652198A (en) | Urban edge area identification method and system | |
CN112101132B (en) | Traffic condition prediction method based on graph embedding model and metric learning | |
Caceres et al. | Estimating traffic flow profiles according to a relative attractiveness factor | |
Qi et al. | Vehicle trajectory reconstruction on urban traffic network using automatic license plate recognition data | |
CN113327079A (en) | Path selection potential factor visual analysis method based on network car booking track | |
Duan et al. | A unified STARIMA based model for short-term traffic flow prediction | |
Li et al. | Driving performances assessment based on speed variation using dedicated route truck GPS data | |
CN108053646B (en) | Traffic characteristic obtaining method, traffic characteristic prediction method and traffic characteristic prediction system based on time sensitive characteristics | |
CN114120018B (en) | Spatial vitality quantification method based on crowd clustering trajectory entropy | |
Qin et al. | Spatiotemporal K-Nearest Neighbors Algorithm and Bayesian Approach for Estimating Urban Link Travel Time Distribution From Sparse GPS Trajectories | |
CN110909037A (en) | Frequent track mode mining method and device | |
CN112150045A (en) | Method for judging urban vehicle supply and demand relationship based on vehicle position statistics and monitoring system thereof | |
Jiang et al. | A fast-mining method for target behavior pattern based on trajectory data | |
CN110049447A (en) | A kind of partnership analysis method based on location information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |