CN115357811A - Time sequence multilayer geographical flow clustering identification method considering topological data analysis - Google Patents

Time sequence multilayer geographical flow clustering identification method considering topological data analysis Download PDF

Info

Publication number
CN115357811A
CN115357811A CN202211013111.7A CN202211013111A CN115357811A CN 115357811 A CN115357811 A CN 115357811A CN 202211013111 A CN202211013111 A CN 202211013111A CN 115357811 A CN115357811 A CN 115357811A
Authority
CN
China
Prior art keywords
geographical
network
time
layer
data analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211013111.7A
Other languages
Chinese (zh)
Inventor
李军利
涂有军
张韩
王雅楠
周成
邢文文
王伟印
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Agricultural University AHAU
Original Assignee
Anhui Agricultural University AHAU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Agricultural University AHAU filed Critical Anhui Agricultural University AHAU
Priority to CN202211013111.7A priority Critical patent/CN115357811A/en
Publication of CN115357811A publication Critical patent/CN115357811A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Remote Sensing (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the technical field of data mining, and particularly relates to a time sequence multilayer geographical flow clustering identification method considering topological data analysis, which comprises the following steps: s1, constructing a time sequence geographical stream; s2, reducing the dimension of the geographic stream; and S3, identifying the geographical stream clusters. The invention provides a time sequence multilayer geographical flow cluster identification method considering topological data analysis, and the brand-new geographical spatiotemporal analysis method can identify multilayer geographical flow clusters; in order to cluster multi-layer time-sequence geographical streams, the inventor introduces a multi-lens tool for topological data analysis into method persistence maps, calculates Wassertein distances between each persistence map so as to cluster the multi-layer geographical streams with different time sequences, vividly depicts dynamic interaction of the multi-layer geographical streams, and enriches research on urban space dynamic organization; the experimental result can provide decision support for sustainable city management.

Description

Time sequence multilayer geographical flow clustering identification method considering topological data analysis
Technical Field
The invention belongs to the technical field of data mining, and particularly relates to a time sequence multi-layer geographical flow clustering identification method considering topological data analysis.
Background
Most of the current multilayer network clustering methods are based on embedding a graph into a Euclidean space through atlas decomposition, and therefore, the geometry and topology of a local basic graph are not considered explicitly.
In view of this, the present application is expected to provide a time-series multi-layer geographical stream cluster identification method considering topological data analysis, which can identify multi-layer geographical stream clusters from a brand-new geographical spatio-temporal analysis method.
Disclosure of Invention
The invention aims to overcome the problems in the prior art and provide a time sequence multilayer geographical flow clustering identification method considering topological data analysis.
In order to achieve the technical purpose and achieve the technical effect, the invention is realized by the following technical scheme:
a time sequence multilayer geographical flow clustering identification method considering topological data analysis comprises the following steps:
s1, time sequence geographical stream construction
Downloading track data of an object city, converting the downloaded track data into OD flow data required by research by using a TransBigData library in python, establishing a traffic time sequence multi-layer network, then establishing a grid by using ArcGIS to segment OD flows, taking the number of the flows of which OD points fall on the same two grids as a weight value between the grids, and establishing a geographical flow weight network taking the grids as nodes;
s2, reducing the dimension of the geographic stream
Embedding the obtained geo-flow weight network into point cloud data by using a deepwalk algorithm in network embedding and applying a random walk model to the obtained geo-flow weight network, calculating a sequence of each layer of network nodes on the basis, and detecting a dynamic urban mobile community from a time-related multilayer network;
s3, geographic stream clustering identification
And generating a corresponding persistence graph by using a gudhi packet in python for the obtained point cloud data, calculating the Wasserstein distance of each layer of network, and performing the clustering identification of the geographic streams by using the distance.
Further, in the time-series multi-layer geographical flow clustering identification method considering the topological data analysis, in step S1, the time scale of the traffic time-series multi-layer network is 2 hours.
Further, in the time-series multi-layer geographical flow clustering identification method considering the topological data analysis, in the step S2, the deepwalk algorithm is mainly divided into two parts of random walk and generation of a representation vector; firstly, extracting some vertex sequences from a graph by using a random walk algorithm, then regarding the generated fixed point sequences as sentences composed of words by using a natural language processing thought, regarding all the sequences as a large corpus, and finally expressing each vertex as a vector with a dimension d by using a natural language processing tool word2 vec.
Further, in the time-series multi-layer geographical flow clustering identification method considering the topological data analysis, in step S2, the deepwalk algorithm specifically includes the following steps:
1) Numbering grids created by ArcGIS, regarding each grid as a network node, generating an algorithm of a random walk sequence, wherein the algorithm can be understood as inputting a starting point and a path length, generating a random walk node sequence, summarizing adjacent nodes, and randomly selecting a next node from the adjacent nodes;
2) And generating a random walk sequence by taking each node as a starting point, training a word2vec model in a deepwalk algorithm, embedding each layer of network into point cloud data, performing dimension reduction visualization by using principal component analysis, and storing the point cloud data generated by embedding.
Further, in the time-series multi-layer geographical flow clustering identification method considering the topological data analysis, the specific algorithm of random walk in the step 1) is as follows:
let f (x) be a multivariate function containing n variables, and x = (x 1, x 2.., xn) be an n-dimensional vector;
giving an initial iteration point x, a primary walking step length lambda, and enabling the control precision to be epsilon;
giving iteration control times N, wherein k is the current iteration times;
when k is<N, randomly generating an N-dimensional vector u = (u 1, u2, \ 8230;, un), - (1) between (-1, 1)<ui<1,i =1,2, \ 8230;, n), and standardized to give
Figure BDA0003811354060000031
Making x1= x + λ u', and completing the first step of wandering;
calculating a function value, if f (x 1) < f (x), namely a point better than the initial value is found, resetting k to be 1, changing x1 into x, and returning to the step 2; otherwise k = k +1, returning to the step 3;
if no more optimal value can be found for N times, the optimal solution is considered to be in an N-dimensional sphere with the current optimal solution as the center and the current step length as the radius; at the moment, if lambda < ∈ then the algorithm is ended; otherwise, let λ = λ 2, return to step 1 and start a new round of walking.
Further, in the time-series multi-layer geographical flow cluster identification method considering the topological data analysis, the e is a very small positive number used for controlling the ending algorithm.
Further, in the time-series multi-layer geo-stream cluster identification method considering topological data analysis, in this step, the point cloud data embedded in each layer of the network is subjected to principal component analysis, visualization and dimension reduction, and life histories of occurrence, expansion, stability, contraction and disappearance which are different in different time periods are explored.
Further, in the time-series multilayer geo-flow clustering identification method considering the topological data analysis, the specific algorithm steps of the Skip-Gram model in the step 2) are as follows:
firstly, selecting a point in a point cloud network as an input point;
after the input point is available, defining a parameter called skip _ window, which represents the number of points selected from one side of the current input point; another parameter, num _ clips, is defined, which represents how many different points are selected from the whole window as output points;
the neural network outputs a probability distribution based on the training data, the probability representing the output likelihood of each point in the dictionary.
Further, in the time-series multi-layer geo-stream cluster identification method considering the topology data analysis, in step S3, the basic principle of the cluster method is as follows: if the local neighborhoods of two points are similar in shape at all resolution scales, they are close enough to be grouped into a cluster.
Further, in the time-series multi-layer geo-stream cluster identification method considering topology data analysis, in order to compare the shapes of the clusters, the following steps are performed:
consider Xn = (X1, \8230; xn) in some metric space (X, D);
setting a resolution threshold V 1 <V 2 …<V K And constructing a VR filter
Figure BDA0003811354060000041
Figure BDA0003811354060000042
Calculating a local topology summary of xi in the form of a persistence map PD (i), i =1, \8230;, n;
all local neighborhoods N (i) and x for xi j N (j), i, j =1,2, \ 8230, N, calculating dissimilarity of the pair-wise topology or data shape as the Wasserstein distance between their respective persistence maps PD (i) and PD (j):
Figure BDA0003811354060000043
in formula (1), Δ = { (x, x) | x ∈ R }, γ is double mapped into PD (i) uedato PD (j) ueda, and the Wasserstein distance allows systematically quantizing the similar shapes of two node neighborhoods;
form W 2 (N (i), N (j)), i, j =1,2, \8230;, distance map G on N, with adjacency matrix a, where
Figure BDA0003811354060000044
Defining an entry point k by elbow mapping or cross validation;
the connected component of G is the resulting cluster.
The beneficial effects of the invention are:
the invention provides a time sequence multilayer geographical flow cluster identification method considering topological data analysis, which is a brand new geographical space-time analysis method and can identify multilayer geographical flow clusters; the aim is to cluster a multi-layer network in an unsupervised environment from the perspective of data shape similarity of multi-resolution records; secondly, unsupervised learning of multi-layer networks is still significantly less developed than supervised community detection and classification. In order to cluster multi-layer time-series geographic streams, the inventor introduces a multi-lens tool for topological data analysis into method persistence graphs, calculates Wasserstein distance between each persistence graph so as to cluster the multi-layer geographic streams with different time series, vividly depicts dynamic interaction of the multi-layer geographic streams, and enriches research on urban space dynamic organization; the experimental result can provide decision support for sustainable city management.
Of course, it is not necessary for any one product that embodies the invention to achieve all of the above advantages simultaneously.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of an experiment of an example;
FIG. 2 is a schematic diagram of OD flows at different timings in the embodiment;
fig. 3 is a schematic diagram of weighted network data using a mesh as a node in an embodiment;
FIG. 4 is a schematic diagram of the Skip-Gram algorithm in the example;
FIG. 5 is a schematic diagram of the dimension reduction of a 6-point to 8-point geo-flow in an embodiment;
FIG. 6 is a schematic diagram of different timing persistence in the embodiment;
FIG. 7 is a schematic diagram of Wasserstein matrices with different timings in the embodiment;
FIG. 8 is a diagram illustrating the result of cluster identification of time-series multi-layer geo-streams in an embodiment.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to fully utilize the advantages of the model, the invention relates to a time sequence multilayer geographical flow clustering identification method considering topological data analysis, wherein multilayer geographical flows with different time sequences are clustered by using Wasserstein distance, and the overall experimental flow chart is shown in figure 1.
By utilizing the method, the topological similarity of each layer network can be explored to reveal that the layers show different life histories of occurrence, expansion, stability, contraction and disappearance in different time periods, and the dynamic interaction between the layers is described.
The technical scheme adopted by the invention is as follows: a time sequence multilayer geographical flow clustering identification method considering topological data analysis comprises the following steps:
(1) Time-sequential geo-stream construction
Downloading track data from 2016, 8 months and 3 days to 30 days, 6 am to 24 am, converting the downloaded track data into OD stream data required by research, establishing a traffic time sequence multi-layer network with a time scale of 2 hours (9 time sequences in total, and the serial numbers are 0,1,2,3,4,5,6,7 and 8 in sequence) by using a TransBigData library in python, establishing a 500m mesh network by using arcgis to segment OD streams, and establishing a weight network taking the mesh network as a node by using the number of the streams with OD points falling on the same two mesh networks as a weight value between the mesh networks.
(2) Dimensionality reduction of geographical flow
The algorithm is mainly divided into two parts of random walk and generation of a representation vector. Firstly, extracting a plurality of vertex sequences from the graph by using a Random walk algorithm (Random walk); the resulting fixed-point sequences are then treated as sentences of words, all sequences being treated as a large corpus (corpus), with the aid of the natural language processing idea, and finally each vertex is represented as a vector of dimension d using the natural language processing tool word2 vec.
The section mainly applies a deepwalk algorithm in network embedding, and the specific algorithm steps are as follows:
1) Using ArcGIS to number the created grids, regarding each grid as a network node, and generating an algorithm of a random walk (Randomwalk) sequence, wherein the algorithm can be understood as inputting a starting point and a path length, generating a random walk node sequence, then summarizing adjacent nodes, and randomly selecting a next node from the adjacent nodes, wherein the specific algorithm of the random walk is as follows:
let f (x) be a multivariate function containing n variables, and x = (x 1, x 2.., xn) be an n-dimensional vector.
1. Given an initial iteration point x, a step length λ of initial walking, and a control precision e (e is a very small positive number used for a control ending algorithm).
2. And giving the iteration control times N, and taking k as the current iteration times.
3. When k is<N, randomly generating an N-dimensional vector u = (u 1, u2, \ 8230;, un) (-1) between (-1, 1)<ui<1,i=1,2, \8230;, n), and their standards are applied theretoIs converted into
Figure BDA0003811354060000071
Let x1= x + λ u', complete the first step walk.
4. Calculating a function value, if f (x 1) < f (x), namely a point better than the initial value is found, resetting k to be 1, changing x1 into x, and returning to the step 2; otherwise k = k +1 and returns to step 3.
5. If no better value can be found for N consecutive times, the optimal solution is considered to be within an N-dimensional sphere (exactly a sphere in space if three-dimensional) centered on the current optimal solution and having the current step size as the radius. At the moment, if lambda < ∈ then the algorithm is ended; otherwise, let λ = λ 2, go back to step 1 and start a new round of wandering.
2) And generating a random walk sequence by taking each node as a starting point. Training a word2vec model in a deepwalk algorithm to embed each layer of network into point cloud data, wherein the word2vec algorithm is divided into Skip-Gram and CBOW, the Skip-Gram model is used in the invention, then Principal Component Analysis (PCA) is used for carrying out dimension reduction visualization and storing point cloud data generated by embedding, the Skip-Gram algorithm in the word2vec can be understood as predicting an upper point and a lower point according to a central point, and the Skip-Gram model has the following details:
1. first, a point in the point cloud network is selected as an input point.
2. Having input points, the inventor defines a parameter called skip _ window, which represents the number of points the inventor chooses from one side (left or right) of the current input point, in the present invention, the parameter of skip _ window chosen by the inventor is 4, which represents how many different points the inventor chooses from the whole window as the inventor's output points, so that the inventor gets two sets of training data (input points, output points) when skip _ window =4, num \\ skip = 4.
3. The neural network will output a probability distribution based on the training data, the probability representing the output likelihood of each point in the invented human dictionary.
(3) Geo-stream cluster identification
In order to cluster multi-layer time-series geographic streams, the inventor introduces a multi-lens tool for topological data analysis into method persistence graphs, calculates Wasserstein distance between each persistence graph so as to cluster the multi-layer geographic streams with different time series, vividly depicts dynamic interaction of the multi-layer geographic streams, and enriches research on urban space dynamic organization. The experimental result can provide decision support for sustainable city management.
The rationale behind the clustering method is as follows: if the local neighborhoods of two points are similar in shape at all resolution scales, they are close enough to be grouped into a cluster. To compare shapes, the inventors performed the following steps:
1) Consider Xn = (X1, \8230;, xn) in some metric space (X, D).
2) Setting a resolution threshold V 1 <V 2 …<V K And constructing a VR filter
Figure BDA0003811354060000081
Figure BDA0003811354060000082
3) X is calculated in the form of a persistence map PD (i), i =1, \ 8230;, n i Is performed.
4) All local neighborhoods N (i) and x for xi j N (j), i, j =1,2, \8230n, N, calculating dissimilarity of the pair topology or data shape as the Wasserstein distance between their respective persistence maps PD (i) and PD (j):
Figure BDA0003811354060000083
here Δ = { (x, x) | x ∈ R }, γ is double mapped into PD (i) ueΔto PD (j) ueΔ, and the Wasserstein distance allows the inventors to systematically quantify similar shapes of two node neighborhoods. That is, the inventors calculated and compared all loops, holes and other topological features in each node neighborhood.
5) Form W 2 (N (i), N (j)), i, j =1,2, \8230;, distance map G on N, with adjacency matrix a, where
Figure BDA0003811354060000084
The entry point k is defined by elbow mapping or cross validation.
6) The connected component of G is the cluster that the inventors have derived. Therefore, the persistent graph clustering utilizes the distance function and local geometric information around the point, clusters the persistent graphs in different time periods with the multi-layer geographic streams in different time sequences according to the Wasserstein distance, fully considers the topological structure of data, can more effectively improve the clustering effect, vividly depicts the dynamic interaction of the data and enriches the research on the urban space dynamic organization. The experimental result can provide decision support for sustainable city management.
The specific embodiment of the invention is as follows:
example one
In order to cluster multi-layer time-series geographic streams, the inventor introduces a multi-lens tool for topological data analysis into method persistence graphs, calculates Wasserstein distance between each persistence graph so as to cluster the multi-layer geographic streams with different time series, vividly depicts dynamic interaction of the multi-layer geographic streams, and enriches research on urban space dynamic organization. The experimental result can provide decision support for sustainable city management, and the specific implementation mode is as follows:
(1) The track data is converted into OD data. Track data from 8/3/30/2016 to 6 am to 24 am are downloaded, and the downloaded track data is converted into OD stream data required by the research by using a TransBigData library in python.
(2) And constructing a time-series multilayer geographic network. A traffic time sequence multi-layer network with a time scale of 2 hours is built from 6 am, 9 layers of networks are shown in fig. 2, then, arcGIS is used for creating 500m × 500m grids to segment OD flows, the number of the flows with OD points falling on the same two grids is used as a weight value between the grids to build a weight network with the grids as nodes, and the force data is shown in fig. 3.
(3) The created grids are numbered by using ArcGIS, each grid is regarded as a network node, and an algorithm of a random walk (RandomWalk) sequence is generated.
(4) Training a word2vec model in a deepwalk algorithm to reduce the dimension of each layer of network into point cloud data, wherein the word2vec algorithm is divided into Skip-Gram and CBOW, the Skip-Gram model is used in the invention, the algorithm process of the model is shown in figure 4, then the dimension reduction is carried out by using Principal Component Analysis (PCA), finally the TSNE is used for visualization, the embedded generated point cloud data is stored, and 9 groups of point cloud data are formed, as shown in figure 5. The Skip-Gram algorithm in word2vec can be understood as predicting the upper and lower points according to the central point.
(5) To cluster multi-layer time-sequential geographic streams, the inventors introduced a multi-lens tool of topological data analysis into a method persistence map. And (3) generating a persistence diagram by calling a gudhi library in python to embed the point cloud data obtained by the traffic network in different time periods through the network, wherein the persistence diagram corresponding to each time period is shown in FIG. 6.
(6) Calculating Wasserstein distances among all the persistence graphs, clustering the multi-layer geographic streams with different time sequences according to the Wasserstein distances, and calculating the commonalities of the Wasserstein distances as follows:
Figure BDA0003811354060000101
here Δ = { (x, x) | x ∈ R }, γ is double mapped into PD (i) ueΔto PD (j) ueΔ, and the Wasserstein distance allows the inventors to systematically quantify the similar shapes of the two networks. The topological similarity of each layer network can be explored through clustering to reveal that the life histories of the layers in different time periods show different occurrences, expansions, stabilities, contractions and disappears, and the dynamic interaction between the layers is depicted. On the basis of the above, a Wasserstein matrix about the multi-layer time-series geographic flow is obtained, as shown in FIG. 7.
(7) Multi-tier time-sequential geo-streaming cluster identification
By performing hierarchical clustering on the Wasserstein matrix generated in the step (6), and exploring time-series multi-layer geo-stream cluster recognition, as shown in FIG. 8, it can be clearly seen from the figure that geo-stream cluster patterns from 6 o 'clock to 8 o' clock, from 8 o 'clock to 10 o' clock in the morning and from 16 o 'clock to 18 o' clock in the afternoon are similar, and FIG. 2 can clearly see that the number of streams from residential areas to various traffic stations in the time period is obviously increased, and the reason for this is probably that residents all shuttle between various traffic stations in the time period and go to work, so the related geo-stream cluster patterns are similar. It is also apparent from fig. 8 that the geographical flow clustering patterns from 18 pm to 24 am are similar, which may be due to similar resident activity travel laws, since it is clear from fig. 2 that the number of flows to each mall in the time period from 18 pm to 24 pm is significantly increased. It can also be seen from fig. 8 that the clustering patterns of three time periods between 10 am and 16 pm are similar, and from fig. 2, it can be understood that the number of streams in the 3 time periods is relatively small, and the reason for this is probably that the clustering patterns of the contemporaneous geographical streams are similar because residents are on and off duty in the three time periods and have no outgoing activity.
According to the embodiment, the cluster identification of the time sequence multilayer geographic flow is explored from the view point of topological data analysis, so that different life histories of occurrence, expansion, stability, contraction and disappearance can be revealed in different time periods, dynamic interaction among the life histories is described, the study on activities and mobility of residents is enriched, and decision support can be provided for the network management of the geographic flow such as sustainable urban traffic.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand the invention for and utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims (10)

1. A time sequence multilayer geographical flow clustering identification method considering topological data analysis is characterized by comprising the following steps:
s1, time sequence geographical stream construction
Downloading track data of an object city, converting the downloaded track data into OD flow data required by research by using a TransBigData library in python, establishing a traffic time sequence multi-layer network, then establishing a grid by using ArcGIS to segment OD flows, taking the number of the flows of which OD points fall on the same two grids as a weight value between the grids, and establishing a geographical flow weight network taking the grids as nodes;
s2, reducing the dimension of the geographic stream
Embedding the obtained geo-flow weight network into point cloud data by using a deepwalk algorithm in network embedding and applying a random walk model to the obtained geo-flow weight network, calculating a sequence of each layer of network nodes on the basis, and detecting a dynamic urban mobile community from a time-related multilayer network;
s3, geographic stream clustering identification
And generating a corresponding persistence graph by using a gudhi packet in python for the obtained point cloud data, calculating the Wasserstein distance of each layer of network, and performing the clustering identification of the geographic streams by using the distance.
2. The method for time-series multilayer geo-flow cluster identification with consideration of topological data analysis according to claim 1, characterized in that in step S1, the time scale of the traffic time-series multilayer network is 2 hours.
3. The time-series multilayer geographical flow clustering identification method considering topological data analysis according to claim 1, wherein in step S2, the deepwalk algorithm is mainly divided into two parts, namely random walk and generation of a representation vector; firstly, some vertex sequences are extracted from a graph by using a random walk algorithm, then, by means of a natural language processing thought, the generated fixed point sequences are regarded as sentences composed of words, all the sequences can be regarded as a large corpus, and finally, each vertex is expressed as a vector with a dimension d by using a natural language processing tool word2 vec.
4. The time-series multi-layer geographical flow cluster identification method considering topological data analysis according to claim 3, wherein in step S2, the deepwalk algorithm specifically comprises the following steps:
1) Numbering grids created by ArcGIS, regarding each grid as a network node, generating an algorithm of a random walk sequence, wherein the algorithm can be understood as inputting a starting point and a path length, generating a random walk node sequence, then summarizing adjacent nodes, and randomly selecting a next node from the adjacent nodes;
2) And (3) generating a random walk sequence by taking each node as a starting point, training a word2vec model in a deepwalk algorithm to embed each layer of network into point cloud data, performing dimensionality reduction visualization by using principal component analysis, and storing the point cloud data generated by embedding.
5. The time-series multilayer geographical flow cluster identification method considering topological data analysis according to claim 4, wherein the specific algorithm of random walk in step 1) is as follows:
let f (x) be a multivariate function containing n variables, and x = (x 1, x 2.., xn) be an n-dimensional vector;
giving an initial iteration point x, a first walking step length lambda and a control precision belonging to E;
giving iteration control times N, wherein k is the current iteration times;
when k is<N, randomly generating an N-dimensional vector u = (u 1, u2, \ 8230;, un) (-1) between (-1, 1)<ui<1,i=1,2, \8230;, n), and standardizes it to obtain
Figure FDA0003811354050000021
Making x1= x + λ u', and completing the first step of wandering;
calculating a function value, if f (x 1) < f (x), namely a point better than the initial value is found, resetting k to be 1, changing x1 into x, and returning to the step 2; otherwise k = k +1, returning to the step 3;
if no more optimal value can be found for N times, the optimal solution is considered to be in an N-dimensional sphere with the current optimal solution as the center and the current step length as the radius; at the moment, if lambda < ∈ then the algorithm is ended; otherwise, let λ = λ 2, go back to step 1 and start a new round of wandering.
6. The method of topological data analysis-aware sequential multi-layer geo-flow cluster recognition as claimed in claim 5, wherein e is a very small positive number used to control the termination algorithm.
7. The time-series multi-layer geo-stream cluster recognition method taking topological data analysis into consideration of claim 4, characterized in that in step 2), each layer of network is embedded into the obtained point cloud data to perform principal component analysis visualization dimension reduction, and the life histories of the point cloud data showing different occurrences, expansions, stabilities, contractions and disappears in different time periods are explored.
8. The topological data analysis-based time-series multi-layer geographical flow cluster identification method according to claim 4, wherein the specific algorithm steps of the Skip-Gram model in step 2) are as follows:
firstly, selecting a point in a point cloud network as an input point;
after the input point is available, defining a parameter called skip _ window, which represents the number of points selected from one side of the current input point; another parameter, num _ skip, is defined, which represents how many different points are selected from the whole window as output points;
the neural network outputs a probability distribution based on the training data, the probability representing the output likelihood of each point in the dictionary.
9. The topological data analysis-aware time-series multi-layer geo-flow cluster identification method of claim 4, wherein: in step S3, the basic principle of the clustering method is: if the local neighborhoods of two points are similar in shape at all resolution scales, they are close enough to be grouped into a cluster.
10. The topological data analysis-based time-series multi-layer geo-stream cluster identification method of claim 9, wherein: to compare the shape of the clusters, the following steps are performed:
consider Xn = (X1, \8230; xn) in some metric space (X, D);
setting a resolution threshold V 1 <V 2 …<V K And constructing a VR filter
Figure FDA0003811354050000031
Figure FDA0003811354050000032
X is calculated in the form of a persistence map PD (i), i =1, \ 8230;, n i The local topology summary of (1);
all local neighborhoods N (i) and x for xi j N (j), i, j =1,2, \8230n, N, calculating dissimilarity of the pair topology or data shape as the Wasserstein distance between their respective persistence maps PD (i) and PD (j):
Figure FDA0003811354050000033
in formula (1), Δ = { (x, x) | x ∈ R }, γ is doubly mapped into PD (i) ueΔto PD (j) ueΔ, and the Wasserstein distance allows systematically quantizing similar shapes of two node neighborhoods;
form W 2 (N (i), N (j)), i, j =1,2, \ 8230;, distance on NFrom graph G, having an adjacent matrix a, wherein
Figure FDA0003811354050000041
Defining an entry point k by elbow mapping or cross validation;
the connected component of G is the resulting cluster.
CN202211013111.7A 2022-08-23 2022-08-23 Time sequence multilayer geographical flow clustering identification method considering topological data analysis Pending CN115357811A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211013111.7A CN115357811A (en) 2022-08-23 2022-08-23 Time sequence multilayer geographical flow clustering identification method considering topological data analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211013111.7A CN115357811A (en) 2022-08-23 2022-08-23 Time sequence multilayer geographical flow clustering identification method considering topological data analysis

Publications (1)

Publication Number Publication Date
CN115357811A true CN115357811A (en) 2022-11-18

Family

ID=84002801

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211013111.7A Pending CN115357811A (en) 2022-08-23 2022-08-23 Time sequence multilayer geographical flow clustering identification method considering topological data analysis

Country Status (1)

Country Link
CN (1) CN115357811A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975561A (en) * 2023-07-19 2023-10-31 深圳市快速直接工业科技有限公司 Lathe process identification method based on STEP format

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975561A (en) * 2023-07-19 2023-10-31 深圳市快速直接工业科技有限公司 Lathe process identification method based on STEP format
CN116975561B (en) * 2023-07-19 2024-04-05 快速直接(深圳)精密制造有限公司 Lathe process identification method based on STEP format

Similar Documents

Publication Publication Date Title
Ye et al. Coupled layer-wise graph convolution for transportation demand prediction
An et al. A novel fuzzy-based convolutional neural network method to traffic flow prediction with uncertain traffic accident information
Sun et al. A review of designs and applications of echo state networks
CN110879856B (en) Social group classification method and system based on multi-feature fusion
CN110570035B (en) People flow prediction system for simultaneously modeling space-time dependency and daily flow dependency
CN109344992B (en) Modeling method for user control behavior habits of smart home integrating time-space factors
CN111639791A (en) Traffic flow prediction method, system, storage medium and terminal
CN115357811A (en) Time sequence multilayer geographical flow clustering identification method considering topological data analysis
CN113222265A (en) Mobile multi-sensor space-time data prediction method and system in Internet of things
Openshaw Neuroclassification of spatial data
CN115620510A (en) Traffic flow prediction method based on adaptive window attention extraction space-time dependence
Zhang et al. Dynamic auto-structuring graph neural network: A joint learning framework for origin-destination demand prediction
CN115082896A (en) Pedestrian trajectory prediction method based on topological graph structure and depth self-attention network
CN116822722A (en) Water level prediction method, system, device, electronic equipment and medium
Sun et al. Tcsa-net: a temporal-context-based self-attention network for next location prediction
Wei et al. Short-term load forecasting using spatial-temporal embedding graph neural network
Shterev et al. Time series prediction with neural networks: a review
Weng et al. Big data and deep learning platform for terabyte-scale renewable datasets
Rathnayaka et al. Specialist vs generalist: A transformer architecture for global forecasting energy time series
Ma et al. A genetic algorithm for the optimization of cable routing
Hou et al. Masked token enabled pre-training: A task-agnostic approach for understanding complex traffic flow
Long et al. Learning Semantic Behavior for Human Mobility Trajectory Recovery
CN117133116B (en) Traffic flow prediction method and system based on space-time correlation network
Wu et al. G-RGAN: A spatiotemporal graph generative adversarial model of parking data recovery
Wang Vehicular Traffic Flow Prediction Model Using Machine Learning-Based Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination