CN106055689A

CN106055689A - Spatial clustering method based on time sequence correlation

Info

Publication number: CN106055689A
Application number: CN201610404636.1A
Authority: CN
Inventors: 杜; 杜一; 崔文娟; 吕菲; 周园春; 黎建辉
Original assignee: Computer Network Information Center of CAS
Current assignee: Computer Network Information Center of CAS
Priority date: 2016-06-08
Filing date: 2016-06-08
Publication date: 2016-10-26

Abstract

The invention relates to a spatial clustering method based on time sequence correlation. The method comprises the steps of: 1, selecting a set of spatial points to be clustered; 2, according to geographical relationships of the spatial points, carrying out first-time clustering, and clustering the spatial points belonging to the same geographical relationship into one category; 3, determining a time interval T of time sequence data, which is used in the process of carrying out second-time clustering, obtaining a data value of each spatial point in the time interval T, and forming a time sequence; 4, according to clustering results obtained in the step 2 and the time sequence obtained in the step 3, calculating time sequence correlation between any two spatial points in the same category; and 5, for each clustering result in the step 2, combining the time sequence correlation obtained in the step 4 to carry out second-time clustering on each clustering result so as to form a final clustering result. According to the spatial clustering method disclosed by the invention, two-step clustering is used in the spatial object clustering process, and consideration on the characteristics of time sequence correlation between the objects is added, so that the clustering result is more accurate and has greater practical significance.

Description

A kind of spatial clustering method based on timing dependence

Technical field

The invention belongs to big data and the data mining application of spatial analysis, be specifically related to a kind of relevant based on sequential The spatial clustering method of property.

Background technology

Cluster is one important ingredient of Data Mining and analysis method.Along with big data and data mining are led The extensive application in territory, the clustering analysis of method conventional in data analysis field also receives to be visited the most widely Rope, it has all obtained highly effective application at multiple fields such as image procossing, bio information, spatial database, artificial intelligences.

The data object with higher similarity is classified as one bunch by the main thought of cluster, and the data between different bunches Object does not has or has relatively low similarity, similar in bunch, different between bunch.For cluster analysis, metric data pair Similarity between as becomes the key of analysis, and the quality of cluster result also depends on the similarity assessment that the method is used Whether mode and the method have explored more hidden patterns.

Usually, the method for common cluster generally uses method for measuring similarity based on distance.Containing of distance Justice is relatively wide, as long as being that the function of four conditions meeting distance definition all can be as calculating the range formula of similarity, and these four Condition is uniqueness, nonnegativity, symmetry and triangle inequality respectively.Conventional distance calculating method specifically include that European away from From, mahalanobis distance, manhatton distance and Chebyshev's distance.Euclidean distance is the distance of a usual employing, is mainly described in The natural length of two points and actual distance in space；Mahalanobis distance is intended to indicate that the covariance distance of data, mahalanobis distance Unlike Euclidean distance, it mainly considers the relation between the various characteristic of sample；Manhatton distance be then a kind of for The metric form of degree of geometrical quantity space, it designates the summation of two points absolute wheelbase in coordinate system；And Chebyshev away from From being a kind of metric form in vector space, its main thought be by two points between distance definition be its each coordinate values The maximum of difference.In clustering method based on distance, during more typically clustering algorithm specifically includes that k-means clustering algorithm, k- Heart point clustering algorithm, coagulation type hierarchical clustering algorithm and disintegrated type hierarchical clustering algorithm etc..

But for having different spatial, and having the object of temporal aspect, traditional clustering method has limitation, More excellent cluster result can not be obtained.

Summary of the invention

Some reality characteristics between it is an object of the invention to for object, provide a kind of space based on timing dependence Clustering method.When the method clusters for spatial object, use two step clusters, add the timing dependence considered between each object Characteristic so that cluster result is more accurate, has more realistic meaning.

Specifically, the technical scheme is that

A kind of spatial clustering method based on timing dependence, comprises the following steps:

1) set of the spatial point that will cluster is chosen；

2) carry out clustering for the first time according to spatial point relation geographically, the spatial point of same geographical relationship will be under the jurisdiction of Gathering is a class；

3) for analysis task, the time interval T of the time series data used when determining second time cluster, takes out each space Point data value in time interval T, forms time series；

4) according to step 2) in the cluster result that obtains and step 3) time series that obtains, calculate in same class any two Timing dependence between individual spatial point；

5) for step 2) in each cluster result, integrating step 4) timing dependence between the spatial point that obtains, By a kind of bottom-up method, each cluster result is carried out secondary cluster, form final cluster result.

Compared with prior art, beneficial effects of the present invention is as follows:

The present invention, for when clustering real space object, not only considers its characteristic on space length, with Time also contemplate the timing dependence between each data object, so make the result that spatial object clusters more true, more There is the Research Significance of reality.

Accompanying drawing explanation

Fig. 1 is the flow chart of steps of the inventive method.

Fig. 2 is the cluster ratio variation diagram with distance, and wherein transverse axis represents citing, and the longitudinal axis represents cluster ratio.

Detailed description of the invention

Understandable for enabling the above-mentioned purpose of the present invention, feature and advantage to become apparent from, below by specific embodiment and Accompanying drawing, the present invention will be further described.

The spatial clustering method based on timing dependence of the present embodiment, its steps flow chart as it is shown in figure 1, specifically include with Lower step:

The first step, chooses the set of the spatial point that will cluster.It is all of that this set includes in certain spatial dimension Point, and for each point, all contain the time series data in the time period.Such as, to China's air quality monitoring stations Point clusters, then the set of this spatial point includes all of air quality monitoring stations point, comes for each monitoring station Say, all contain Detection of Air Quality data hourly.

The set of above-mentioned spatial point can be the whole spatial point in spatial dimension, it is also possible to is to apply certain filtering rule After the spatial point that filters out.These filtering rules include but not limited to: distance within a particular value or other indexs (as Precipitation) within the scope of certain special value.

Second step, carries out clustering for the first time according to spatial point relation geographically, will be under the jurisdiction of same geographical relationship It is a class that spatial point is gathered.The such as administrative division of this geographical relationship, such as country, province, city etc., can according to different situations, as All the scope in space, data set sequential density, the computing capability etc. of main frame are adjusted；And for example self defined area, such as basis Mountain range, river trend carry out region segmentation, it is also possible to be to divide according to the spatial object of urban construction, such as railway, at a high speed public affairs Road etc..

3rd step, for analysis task, the time interval T of the time series data used when determining second time cluster, takes out every Individual spatial point data value in time interval T, forms time series.

4th step, the time series obtained according to result and the 3rd step of for the first time cluster, calculate in same class any two Timing dependence between individual spatial point.

Such as, in this example, use administrative division to carry out for the first time and cluster, the administrative division being positioned at according to each point In, the some cluster in same administrative area is one bunch.For any two points in every cluster, the Pearson's phase between calculating at 2 Closing property index, it is defined as follows:

r_{X Y} = \frac{Σ_{i = 1}^{N} (x_{i} - \overset{&OverBar;}{x}) (y_{i} - \overset{&OverBar;}{y})}{\sqrt{Σ_{i = 1}^{N} (x_{i} - \overset{&OverBar;}{x}) Σ_{i = 1}^{N} (y_{i} - \overset{&OverBar;}{y})}}

Wherein, r_XYSpan be-1 to 1, it is positive correlation or negative correlation that sign represents relevant direction, and it is absolute Value is the biggest, and to represent degree of correlation the highest,WithRepresent the meansigma methods of time series X and Y, x respectively_iAnd y_iExpress time sequence X exists The numerical value in the i-th moment, N express time sequence Y is at the numerical value in jth moment.Allowancing for bark outside Ademilson correlation metric, the present invention can also Other index is used to calculate timing dependence, such as Spearman rank correlation coefficient (Spearman's rank correlation Coefficient), Kendall rank correlation coefficient (Kendall rank correlation coefficient) etc..

5th step, by a kind of bottom-up method, carries out secondary cluster to each cluster result, forms final gathering Class result.

Shown in the most following algorithm of false code 1 (Algorithm 1recluster) of this secondary clustering method, this method makes With a kind of bottom-up clustering method, referred to as recluster algorithm, recluster algorithm is the process of an iteration.This calculation The input parameter of method is clustered result clustered, result unclustered not clustered and last time Length length that recluster algorithm does not clusters after performing.The result that each is clustered for the first time, recluster's In initial parameter value, clustered is a null set, preserves the cluster result during all recluster methods perform, Unclustered is the result of cluster for the first time, and length is the length of unclustered.Algorithm when being performed for the first time, It is as follows that algorithm performs step:

If in the result 1. not clustered, after spatial point number performs with last recluster algorithm, number is identical, says Without meeting the spatial point clustered required in bright result, algorithm performs to terminate, and returns, and wherein clustered result is Secondary cluster result.

If in the most non-cluster result, spatial point number is 0, illustrates that all spatial point are complete cluster, and algorithm performs Terminating, return, wherein clustered result is the result of secondary cluster

3. the length of length is entered as the length of unclustered, and creates a new variable save at this Recluster does not carries out the some remaining clustered.As being unsatisfactory for 1,2 conditions, then for owning in unclustered Spatial point, it is judged that the dependency of its any two point A Yu B, this dependency is the timing dependence obtained at four-step calculation, as Really its dependency is less than a certain threshold value, or it does not have significant difference, and (wherein significant difference refers to statistically logarithm According to the evaluation of diversity, being provided with significant difference between data, just the data of explanation participation comparison are not from same Totally, the correlation values drawn just has interpretability), then B is added in remaining, and by B from Unclustered removes.

4., after step 3 is finished, residue unclustered being gathered is a class, and adds in cluster (cluster represents " class ").

5. re-executing algorithm recluster, the parameter of use is cluster, remaining, length.

As a example by China's air quality monitoring stations point, with PM2.5 for analyzing dimension, divide using city and cluster as the first step Benchmark, relevance threshold is redefined for 0.6, performs the method given by the present invention.Centered by each city, with away from Distance from city is radius r, calculates along with the change of r, and the change of cluster ratio, result is as shown in Figure 1.In Fig. 1, transverse axis table Show the distance of distance, can represent centered by certain specified point, the set of all spatial point in this specific range； The longitudinal axis represents cluster ratio, and after referring to cluster, number of clusters mesh is divided by the number of all spatial point.Different colors represents respectively with not Maximum, minimum and the meansigma methods of all of cluster ratio of gained centered by isospace point.Cluster ratio is defined as the most clustered Result number divided by the number of all websites.Result according to Fig. 1 is it is found that along with the change of distance, totally cluster ratio Example maintains about 40%.

In the method, for spatial object, cluster in tradition cluster based on distance mode the most merely, but adopt With the Two-step cluster proposed, this method not only allows for spatial object characteristic in terms of distance, also contemplates simultaneously Its timing dependence characteristic.By the result obtained by two step clustering methods, due to its corresponding time series data also poly- Class process is considered, so cluster result has more realistic meaning.Simultaneously this method also expanded traditional clustering method time Application in empty data.

Above example is only limited in order to technical scheme to be described, the ordinary skill of this area Technical scheme can be modified or equivalent by personnel, without departing from the spirit and scope of the present invention, and this The protection domain of invention should be as the criterion with described in claim.

Claims

1. a spatial clustering method based on timing dependence, it is characterised in that comprise the following steps:

1) set of the spatial point that will cluster is chosen；

2) carry out clustering for the first time according to spatial point relation geographically, the spatial point being under the jurisdiction of same geographical relationship is gathered be One class；

3) for analysis task, the time interval of the time series data used when determining second time cluster, takes out each spatial point and exists Data value in this time interval, forms time series；

4) according to step 2) in the cluster result that obtains and step 3) time series that obtains, calculate any two in same class empty Between point between timing dependence；

5) for step 2) in each cluster result, integrating step 4) timing dependence that obtains, each cluster result is entered Row secondary clusters, and forms final cluster result.

2. the method for claim 1, it is characterised in that step 1) in the set of described spatial point is certain spatial dimension Whole spatial point, or the spatial point filtered out after applying certain filtering rule, and each spatial point comprises Time series data in time period.

3. method as claimed in claim 2, it is characterised in that described filtering rule includes: distance within a particular value, Or other indexs are within the scope of certain special value.

4. the method for claim 1, it is characterised in that step 2) described geographical relationship be by administrative division divide ground Reason relation, or self-defining region.

5. method as claimed in claim 4, it is characterised in that described administrative division includes but not limited to country, province, city City, and can being adjusted according to different situations, including the scope according to whole spaces, data set sequential density, main frame Computing capability is adjusted.

6. method as claimed in claim 4, it is characterised in that described self-defining region is according to mountain range, river trend The region divided, or the region divided according to the spatial object of urban construction.

7. the method for claim 1, it is characterised in that step 4) index that calculates described timing dependence includes: skin Ademilson correlation metric, Spearman rank correlation coefficient, Kendall rank correlation coefficient.

8. the method for claim 1, it is characterised in that step 5) by bottom-up clustering method to each cluster Result carries out secondary cluster；Described bottom-up clustering method is referred to as recluster algorithm, and its input parameter is for have gathered Result clustered of class, result unclustered not clustered and last recluster algorithm do not cluster after performing Length length；The execution step of this recluster algorithm is as follows:

If in the result a) not clustered, after spatial point number performs with last recluster algorithm, number is identical, and knot is described Without meeting the spatial point clustered required in Guo, algorithm performs to terminate, and returns；Wherein clustered result is secondary Cluster result；

If b) in non-cluster result, spatial point number is 0, illustrating that all spatial point are complete cluster, algorithm performs to terminate, Return；Wherein clustered result is the result of secondary cluster；

C) length of length is entered as the length of unclustered, and creates a new variable save at this Recluster does not carries out the some remaining clustered；Condition as being unsatisfactory for a), b), then in unclustered All spatial point, it is judged that the timing dependence of its any two point A Yu B, if its dependency is less than a certain threshold value, or it is not There is significance, then B is added in remaining, and B is removed from unclustered；

D) after step c) is finished, residue unclustered being gathered is a class, and adds in cluster.

E) re-executing algorithm recluster, the parameter of use is cluster, remaining, length.