CN115357811A

CN115357811A - Time sequence multilayer geographical flow clustering identification method considering topological data analysis

Info

Publication number: CN115357811A
Application number: CN202211013111.7A
Authority: CN
Inventors: 李军利; 涂有军; 张韩; 王雅楠; 周成; 邢文文; 王伟印
Original assignee: Anhui Agricultural University AHAU
Current assignee: Anhui Agricultural University AHAU
Priority date: 2022-08-23
Filing date: 2022-08-23
Publication date: 2022-11-18

Abstract

The invention belongs to the technical field of data mining, and particularly relates to a time sequence multilayer geographical flow clustering identification method considering topological data analysis, which comprises the following steps: s1, constructing a time sequence geographical stream; s2, reducing the dimension of the geographic stream; and S3, identifying the geographical stream clusters. The invention provides a time sequence multilayer geographical flow cluster identification method considering topological data analysis, and the brand-new geographical spatiotemporal analysis method can identify multilayer geographical flow clusters; in order to cluster multi-layer time-sequence geographical streams, the inventor introduces a multi-lens tool for topological data analysis into method persistence maps, calculates Wassertein distances between each persistence map so as to cluster the multi-layer geographical streams with different time sequences, vividly depicts dynamic interaction of the multi-layer geographical streams, and enriches research on urban space dynamic organization; the experimental result can provide decision support for sustainable city management.

Description

Time sequence multilayer geographical flow clustering identification method considering topological data analysis

Technical Field

The invention belongs to the technical field of data mining, and particularly relates to a time sequence multi-layer geographical flow clustering identification method considering topological data analysis.

Background

Most of the current multilayer network clustering methods are based on embedding a graph into a Euclidean space through atlas decomposition, and therefore, the geometry and topology of a local basic graph are not considered explicitly.

In view of this, the present application is expected to provide a time-series multi-layer geographical stream cluster identification method considering topological data analysis, which can identify multi-layer geographical stream clusters from a brand-new geographical spatio-temporal analysis method.

Disclosure of Invention

The invention aims to overcome the problems in the prior art and provide a time sequence multilayer geographical flow clustering identification method considering topological data analysis.

In order to achieve the technical purpose and achieve the technical effect, the invention is realized by the following technical scheme:

a time sequence multilayer geographical flow clustering identification method considering topological data analysis comprises the following steps:

s1, time sequence geographical stream construction

Downloading track data of an object city, converting the downloaded track data into OD flow data required by research by using a TransBigData library in python, establishing a traffic time sequence multi-layer network, then establishing a grid by using ArcGIS to segment OD flows, taking the number of the flows of which OD points fall on the same two grids as a weight value between the grids, and establishing a geographical flow weight network taking the grids as nodes;

s2, reducing the dimension of the geographic stream

Embedding the obtained geo-flow weight network into point cloud data by using a deepwalk algorithm in network embedding and applying a random walk model to the obtained geo-flow weight network, calculating a sequence of each layer of network nodes on the basis, and detecting a dynamic urban mobile community from a time-related multilayer network;

s3, geographic stream clustering identification

And generating a corresponding persistence graph by using a gudhi packet in python for the obtained point cloud data, calculating the Wasserstein distance of each layer of network, and performing the clustering identification of the geographic streams by using the distance.

Further, in the time-series multi-layer geographical flow clustering identification method considering the topological data analysis, in step S1, the time scale of the traffic time-series multi-layer network is 2 hours.

Further, in the time-series multi-layer geographical flow clustering identification method considering the topological data analysis, in the step S2, the deepwalk algorithm is mainly divided into two parts of random walk and generation of a representation vector; firstly, extracting some vertex sequences from a graph by using a random walk algorithm, then regarding the generated fixed point sequences as sentences composed of words by using a natural language processing thought, regarding all the sequences as a large corpus, and finally expressing each vertex as a vector with a dimension d by using a natural language processing tool word2 vec.

Further, in the time-series multi-layer geographical flow clustering identification method considering the topological data analysis, in step S2, the deepwalk algorithm specifically includes the following steps:

1) Numbering grids created by ArcGIS, regarding each grid as a network node, generating an algorithm of a random walk sequence, wherein the algorithm can be understood as inputting a starting point and a path length, generating a random walk node sequence, summarizing adjacent nodes, and randomly selecting a next node from the adjacent nodes;

2) And generating a random walk sequence by taking each node as a starting point, training a word2vec model in a deepwalk algorithm, embedding each layer of network into point cloud data, performing dimension reduction visualization by using principal component analysis, and storing the point cloud data generated by embedding.

Further, in the time-series multi-layer geographical flow clustering identification method considering the topological data analysis, the specific algorithm of random walk in the step 1) is as follows:

let f (x) be a multivariate function containing n variables, and x = (x 1, x 2.., xn) be an n-dimensional vector;

giving an initial iteration point x, a primary walking step length lambda, and enabling the control precision to be epsilon;

giving iteration control times N, wherein k is the current iteration times;

when k is<N, randomly generating an N-dimensional vector u = (u 1, u2, \ 8230;, un), - (1) between (-1, 1)<ui<1,i =1,2, \ 8230;, n), and standardized to give

Making x1= x + λ u', and completing the first step of wandering;

calculating a function value, if f (x 1) < f (x), namely a point better than the initial value is found, resetting k to be 1, changing x1 into x, and returning to the step 2; otherwise k = k +1, returning to the step 3;

if no more optimal value can be found for N times, the optimal solution is considered to be in an N-dimensional sphere with the current optimal solution as the center and the current step length as the radius; at the moment, if lambda < ∈ then the algorithm is ended; otherwise, let λ = λ 2, return to step 1 and start a new round of walking.

Further, in the time-series multi-layer geographical flow cluster identification method considering the topological data analysis, the e is a very small positive number used for controlling the ending algorithm.

Further, in the time-series multi-layer geo-stream cluster identification method considering topological data analysis, in this step, the point cloud data embedded in each layer of the network is subjected to principal component analysis, visualization and dimension reduction, and life histories of occurrence, expansion, stability, contraction and disappearance which are different in different time periods are explored.

Further, in the time-series multilayer geo-flow clustering identification method considering the topological data analysis, the specific algorithm steps of the Skip-Gram model in the step 2) are as follows:

firstly, selecting a point in a point cloud network as an input point;

after the input point is available, defining a parameter called skip _ window, which represents the number of points selected from one side of the current input point; another parameter, num _ clips, is defined, which represents how many different points are selected from the whole window as output points;

the neural network outputs a probability distribution based on the training data, the probability representing the output likelihood of each point in the dictionary.

Further, in the time-series multi-layer geo-stream cluster identification method considering the topology data analysis, in step S3, the basic principle of the cluster method is as follows: if the local neighborhoods of two points are similar in shape at all resolution scales, they are close enough to be grouped into a cluster.

Further, in the time-series multi-layer geo-stream cluster identification method considering topology data analysis, in order to compare the shapes of the clusters, the following steps are performed:

consider Xn = (X1, \8230; xn) in some metric space (X, D);

setting a resolution threshold V ₁ <V ₂ …<V _K And constructing a VR filter

Calculating a local topology summary of xi in the form of a persistence map PD (i), i =1, \8230;, n;

all local neighborhoods N (i) and x for xi _j N (j), i, j =1,2, \ 8230, N, calculating dissimilarity of the pair-wise topology or data shape as the Wasserstein distance between their respective persistence maps PD (i) and PD (j):

in formula (1), Δ = { (x, x) | x ∈ R }, γ is double mapped into PD (i) uedato PD (j) ueda, and the Wasserstein distance allows systematically quantizing the similar shapes of two node neighborhoods;

form W ₂ (N (i), N (j)), i, j =1,2, \8230;, distance map G on N, with adjacency matrix a, where

Defining an entry point k by elbow mapping or cross validation;

the connected component of G is the resulting cluster.

The beneficial effects of the invention are:

the invention provides a time sequence multilayer geographical flow cluster identification method considering topological data analysis, which is a brand new geographical space-time analysis method and can identify multilayer geographical flow clusters; the aim is to cluster a multi-layer network in an unsupervised environment from the perspective of data shape similarity of multi-resolution records; secondly, unsupervised learning of multi-layer networks is still significantly less developed than supervised community detection and classification. In order to cluster multi-layer time-series geographic streams, the inventor introduces a multi-lens tool for topological data analysis into method persistence graphs, calculates Wasserstein distance between each persistence graph so as to cluster the multi-layer geographic streams with different time series, vividly depicts dynamic interaction of the multi-layer geographic streams, and enriches research on urban space dynamic organization; the experimental result can provide decision support for sustainable city management.

Of course, it is not necessary for any one product that embodies the invention to achieve all of the above advantages simultaneously.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of an experiment of an example;

FIG. 2 is a schematic diagram of OD flows at different timings in the embodiment;

fig. 3 is a schematic diagram of weighted network data using a mesh as a node in an embodiment;

FIG. 4 is a schematic diagram of the Skip-Gram algorithm in the example;

FIG. 5 is a schematic diagram of the dimension reduction of a 6-point to 8-point geo-flow in an embodiment;

FIG. 6 is a schematic diagram of different timing persistence in the embodiment;

FIG. 7 is a schematic diagram of Wasserstein matrices with different timings in the embodiment;

FIG. 8 is a diagram illustrating the result of cluster identification of time-series multi-layer geo-streams in an embodiment.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to fully utilize the advantages of the model, the invention relates to a time sequence multilayer geographical flow clustering identification method considering topological data analysis, wherein multilayer geographical flows with different time sequences are clustered by using Wasserstein distance, and the overall experimental flow chart is shown in figure 1.

By utilizing the method, the topological similarity of each layer network can be explored to reveal that the layers show different life histories of occurrence, expansion, stability, contraction and disappearance in different time periods, and the dynamic interaction between the layers is described.

The technical scheme adopted by the invention is as follows: a time sequence multilayer geographical flow clustering identification method considering topological data analysis comprises the following steps:

(1) Time-sequential geo-stream construction

Downloading track data from 2016, 8 months and 3 days to 30 days, 6 am to 24 am, converting the downloaded track data into OD stream data required by research, establishing a traffic time sequence multi-layer network with a time scale of 2 hours (9 time sequences in total, and the serial numbers are 0,1,2,3,4,5,6,7 and 8 in sequence) by using a TransBigData library in python, establishing a 500m mesh network by using arcgis to segment OD streams, and establishing a weight network taking the mesh network as a node by using the number of the streams with OD points falling on the same two mesh networks as a weight value between the mesh networks.

(2) Dimensionality reduction of geographical flow

The algorithm is mainly divided into two parts of random walk and generation of a representation vector. Firstly, extracting a plurality of vertex sequences from the graph by using a Random walk algorithm (Random walk); the resulting fixed-point sequences are then treated as sentences of words, all sequences being treated as a large corpus (corpus), with the aid of the natural language processing idea, and finally each vertex is represented as a vector of dimension d using the natural language processing tool word2 vec.

The section mainly applies a deepwalk algorithm in network embedding, and the specific algorithm steps are as follows:

1) Using ArcGIS to number the created grids, regarding each grid as a network node, and generating an algorithm of a random walk (Randomwalk) sequence, wherein the algorithm can be understood as inputting a starting point and a path length, generating a random walk node sequence, then summarizing adjacent nodes, and randomly selecting a next node from the adjacent nodes, wherein the specific algorithm of the random walk is as follows:

let f (x) be a multivariate function containing n variables, and x = (x 1, x 2.., xn) be an n-dimensional vector.

1. Given an initial iteration point x, a step length λ of initial walking, and a control precision e (e is a very small positive number used for a control ending algorithm).

2. And giving the iteration control times N, and taking k as the current iteration times.

3. When k is<N, randomly generating an N-dimensional vector u = (u 1, u2, \ 8230;, un) (-1) between (-1, 1)<ui<1,i=1,2, \8230;, n), and their standards are applied theretoIs converted into

Let x1= x + λ u', complete the first step walk.

4. Calculating a function value, if f (x 1) < f (x), namely a point better than the initial value is found, resetting k to be 1, changing x1 into x, and returning to the step 2; otherwise k = k +1 and returns to step 3.

5. If no better value can be found for N consecutive times, the optimal solution is considered to be within an N-dimensional sphere (exactly a sphere in space if three-dimensional) centered on the current optimal solution and having the current step size as the radius. At the moment, if lambda < ∈ then the algorithm is ended; otherwise, let λ = λ 2, go back to step 1 and start a new round of wandering.

2) And generating a random walk sequence by taking each node as a starting point. Training a word2vec model in a deepwalk algorithm to embed each layer of network into point cloud data, wherein the word2vec algorithm is divided into Skip-Gram and CBOW, the Skip-Gram model is used in the invention, then Principal Component Analysis (PCA) is used for carrying out dimension reduction visualization and storing point cloud data generated by embedding, the Skip-Gram algorithm in the word2vec can be understood as predicting an upper point and a lower point according to a central point, and the Skip-Gram model has the following details:

1. first, a point in the point cloud network is selected as an input point.

2. Having input points, the inventor defines a parameter called skip _ window, which represents the number of points the inventor chooses from one side (left or right) of the current input point, in the present invention, the parameter of skip _ window chosen by the inventor is 4, which represents how many different points the inventor chooses from the whole window as the inventor's output points, so that the inventor gets two sets of training data (input points, output points) when skip _ window =4, num \\ skip = 4.

3. The neural network will output a probability distribution based on the training data, the probability representing the output likelihood of each point in the invented human dictionary.

(3) Geo-stream cluster identification

In order to cluster multi-layer time-series geographic streams, the inventor introduces a multi-lens tool for topological data analysis into method persistence graphs, calculates Wasserstein distance between each persistence graph so as to cluster the multi-layer geographic streams with different time series, vividly depicts dynamic interaction of the multi-layer geographic streams, and enriches research on urban space dynamic organization. The experimental result can provide decision support for sustainable city management.

The rationale behind the clustering method is as follows: if the local neighborhoods of two points are similar in shape at all resolution scales, they are close enough to be grouped into a cluster. To compare shapes, the inventors performed the following steps:

1) Consider Xn = (X1, \8230;, xn) in some metric space (X, D).

2) Setting a resolution threshold V ₁ <V ₂ …<V _K And constructing a VR filter

3) X is calculated in the form of a persistence map PD (i), i =1, \ 8230;, n _i Is performed.

4) All local neighborhoods N (i) and x for xi _j N (j), i, j =1,2, \8230n, N, calculating dissimilarity of the pair topology or data shape as the Wasserstein distance between their respective persistence maps PD (i) and PD (j):

here Δ = { (x, x) | x ∈ R }, γ is double mapped into PD (i) ueΔto PD (j) ueΔ, and the Wasserstein distance allows the inventors to systematically quantify similar shapes of two node neighborhoods. That is, the inventors calculated and compared all loops, holes and other topological features in each node neighborhood.

5) Form W ₂ (N (i), N (j)), i, j =1,2, \8230;, distance map G on N, with adjacency matrix a, where

The entry point k is defined by elbow mapping or cross validation.

6) The connected component of G is the cluster that the inventors have derived. Therefore, the persistent graph clustering utilizes the distance function and local geometric information around the point, clusters the persistent graphs in different time periods with the multi-layer geographic streams in different time sequences according to the Wasserstein distance, fully considers the topological structure of data, can more effectively improve the clustering effect, vividly depicts the dynamic interaction of the data and enriches the research on the urban space dynamic organization. The experimental result can provide decision support for sustainable city management.

The specific embodiment of the invention is as follows:

example one

In order to cluster multi-layer time-series geographic streams, the inventor introduces a multi-lens tool for topological data analysis into method persistence graphs, calculates Wasserstein distance between each persistence graph so as to cluster the multi-layer geographic streams with different time series, vividly depicts dynamic interaction of the multi-layer geographic streams, and enriches research on urban space dynamic organization. The experimental result can provide decision support for sustainable city management, and the specific implementation mode is as follows:

(1) The track data is converted into OD data. Track data from 8/3/30/2016 to 6 am to 24 am are downloaded, and the downloaded track data is converted into OD stream data required by the research by using a TransBigData library in python.

(2) And constructing a time-series multilayer geographic network. A traffic time sequence multi-layer network with a time scale of 2 hours is built from 6 am, 9 layers of networks are shown in fig. 2, then, arcGIS is used for creating 500m × 500m grids to segment OD flows, the number of the flows with OD points falling on the same two grids is used as a weight value between the grids to build a weight network with the grids as nodes, and the force data is shown in fig. 3.

(3) The created grids are numbered by using ArcGIS, each grid is regarded as a network node, and an algorithm of a random walk (RandomWalk) sequence is generated.

(4) Training a word2vec model in a deepwalk algorithm to reduce the dimension of each layer of network into point cloud data, wherein the word2vec algorithm is divided into Skip-Gram and CBOW, the Skip-Gram model is used in the invention, the algorithm process of the model is shown in figure 4, then the dimension reduction is carried out by using Principal Component Analysis (PCA), finally the TSNE is used for visualization, the embedded generated point cloud data is stored, and 9 groups of point cloud data are formed, as shown in figure 5. The Skip-Gram algorithm in word2vec can be understood as predicting the upper and lower points according to the central point.

(5) To cluster multi-layer time-sequential geographic streams, the inventors introduced a multi-lens tool of topological data analysis into a method persistence map. And (3) generating a persistence diagram by calling a gudhi library in python to embed the point cloud data obtained by the traffic network in different time periods through the network, wherein the persistence diagram corresponding to each time period is shown in FIG. 6.

(6) Calculating Wasserstein distances among all the persistence graphs, clustering the multi-layer geographic streams with different time sequences according to the Wasserstein distances, and calculating the commonalities of the Wasserstein distances as follows:

here Δ = { (x, x) | x ∈ R }, γ is double mapped into PD (i) ueΔto PD (j) ueΔ, and the Wasserstein distance allows the inventors to systematically quantify the similar shapes of the two networks. The topological similarity of each layer network can be explored through clustering to reveal that the life histories of the layers in different time periods show different occurrences, expansions, stabilities, contractions and disappears, and the dynamic interaction between the layers is depicted. On the basis of the above, a Wasserstein matrix about the multi-layer time-series geographic flow is obtained, as shown in FIG. 7.

(7) Multi-tier time-sequential geo-streaming cluster identification

By performing hierarchical clustering on the Wasserstein matrix generated in the step (6), and exploring time-series multi-layer geo-stream cluster recognition, as shown in FIG. 8, it can be clearly seen from the figure that geo-stream cluster patterns from 6 o 'clock to 8 o' clock, from 8 o 'clock to 10 o' clock in the morning and from 16 o 'clock to 18 o' clock in the afternoon are similar, and FIG. 2 can clearly see that the number of streams from residential areas to various traffic stations in the time period is obviously increased, and the reason for this is probably that residents all shuttle between various traffic stations in the time period and go to work, so the related geo-stream cluster patterns are similar. It is also apparent from fig. 8 that the geographical flow clustering patterns from 18 pm to 24 am are similar, which may be due to similar resident activity travel laws, since it is clear from fig. 2 that the number of flows to each mall in the time period from 18 pm to 24 pm is significantly increased. It can also be seen from fig. 8 that the clustering patterns of three time periods between 10 am and 16 pm are similar, and from fig. 2, it can be understood that the number of streams in the 3 time periods is relatively small, and the reason for this is probably that the clustering patterns of the contemporaneous geographical streams are similar because residents are on and off duty in the three time periods and have no outgoing activity.

According to the embodiment, the cluster identification of the time sequence multilayer geographic flow is explored from the view point of topological data analysis, so that different life histories of occurrence, expansion, stability, contraction and disappearance can be revealed in different time periods, dynamic interaction among the life histories is described, the study on activities and mobility of residents is enriched, and decision support can be provided for the network management of the geographic flow such as sustainable urban traffic.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand the invention for and utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A time sequence multilayer geographical flow clustering identification method considering topological data analysis is characterized by comprising the following steps:

s1, time sequence geographical stream construction

s2, reducing the dimension of the geographic stream

s3, geographic stream clustering identification

2. The method for time-series multilayer geo-flow cluster identification with consideration of topological data analysis according to claim 1, characterized in that in step S1, the time scale of the traffic time-series multilayer network is 2 hours.

3. The time-series multilayer geographical flow clustering identification method considering topological data analysis according to claim 1, wherein in step S2, the deepwalk algorithm is mainly divided into two parts, namely random walk and generation of a representation vector; firstly, some vertex sequences are extracted from a graph by using a random walk algorithm, then, by means of a natural language processing thought, the generated fixed point sequences are regarded as sentences composed of words, all the sequences can be regarded as a large corpus, and finally, each vertex is expressed as a vector with a dimension d by using a natural language processing tool word2 vec.

4. The time-series multi-layer geographical flow cluster identification method considering topological data analysis according to claim 3, wherein in step S2, the deepwalk algorithm specifically comprises the following steps:

1) Numbering grids created by ArcGIS, regarding each grid as a network node, generating an algorithm of a random walk sequence, wherein the algorithm can be understood as inputting a starting point and a path length, generating a random walk node sequence, then summarizing adjacent nodes, and randomly selecting a next node from the adjacent nodes;

2) And (3) generating a random walk sequence by taking each node as a starting point, training a word2vec model in a deepwalk algorithm to embed each layer of network into point cloud data, performing dimensionality reduction visualization by using principal component analysis, and storing the point cloud data generated by embedding.

5. The time-series multilayer geographical flow cluster identification method considering topological data analysis according to claim 4, wherein the specific algorithm of random walk in step 1) is as follows:

giving an initial iteration point x, a first walking step length lambda and a control precision belonging to E;

giving iteration control times N, wherein k is the current iteration times;

when k is<N, randomly generating an N-dimensional vector u = (u 1, u2, \ 8230;, un) (-1) between (-1, 1)<ui<1,i=1,2, \8230;, n), and standardizes it to obtain

Making x1= x + λ u', and completing the first step of wandering;

if no more optimal value can be found for N times, the optimal solution is considered to be in an N-dimensional sphere with the current optimal solution as the center and the current step length as the radius; at the moment, if lambda < ∈ then the algorithm is ended; otherwise, let λ = λ 2, go back to step 1 and start a new round of wandering.

6. The method of topological data analysis-aware sequential multi-layer geo-flow cluster recognition as claimed in claim 5, wherein e is a very small positive number used to control the termination algorithm.

7. The time-series multi-layer geo-stream cluster recognition method taking topological data analysis into consideration of claim 4, characterized in that in step 2), each layer of network is embedded into the obtained point cloud data to perform principal component analysis visualization dimension reduction, and the life histories of the point cloud data showing different occurrences, expansions, stabilities, contractions and disappears in different time periods are explored.

8. The topological data analysis-based time-series multi-layer geographical flow cluster identification method according to claim 4, wherein the specific algorithm steps of the Skip-Gram model in step 2) are as follows:

firstly, selecting a point in a point cloud network as an input point;

after the input point is available, defining a parameter called skip _ window, which represents the number of points selected from one side of the current input point; another parameter, num _ skip, is defined, which represents how many different points are selected from the whole window as output points;

9. The topological data analysis-aware time-series multi-layer geo-flow cluster identification method of claim 4, wherein: in step S3, the basic principle of the clustering method is: if the local neighborhoods of two points are similar in shape at all resolution scales, they are close enough to be grouped into a cluster.

10. The topological data analysis-based time-series multi-layer geo-stream cluster identification method of claim 9, wherein: to compare the shape of the clusters, the following steps are performed:

consider Xn = (X1, \8230; xn) in some metric space (X, D);

X is calculated in the form of a persistence map PD (i), i =1, \ 8230;, n _i The local topology summary of (1);

all local neighborhoods N (i) and x for xi _j N (j), i, j =1,2, \8230n, N, calculating dissimilarity of the pair topology or data shape as the Wasserstein distance between their respective persistence maps PD (i) and PD (j):

in formula (1), Δ = { (x, x) | x ∈ R }, γ is doubly mapped into PD (i) ueΔto PD (j) ueΔ, and the Wasserstein distance allows systematically quantizing similar shapes of two node neighborhoods;

form W ₂ (N (i), N (j)), i, j =1,2, \ 8230;, distance on NFrom graph G, having an adjacent matrix a, wherein

Defining an entry point k by elbow mapping or cross validation;

the connected component of G is the resulting cluster.