CN106055689A - Spatial clustering method based on time sequence correlation - Google Patents

Spatial clustering method based on time sequence correlation Download PDF

Info

Publication number
CN106055689A
CN106055689A CN201610404636.1A CN201610404636A CN106055689A CN 106055689 A CN106055689 A CN 106055689A CN 201610404636 A CN201610404636 A CN 201610404636A CN 106055689 A CN106055689 A CN 106055689A
Authority
CN
China
Prior art keywords
cluster
result
spatial
clustering
spatial point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610404636.1A
Other languages
Chinese (zh)
Inventor
杜一
崔文娟
吕菲
周园春
黎建辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Computer Network Information Center of CAS
Original Assignee
Computer Network Information Center of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Computer Network Information Center of CAS filed Critical Computer Network Information Center of CAS
Priority to CN201610404636.1A priority Critical patent/CN106055689A/en
Publication of CN106055689A publication Critical patent/CN106055689A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a spatial clustering method based on time sequence correlation. The method comprises the steps of: 1, selecting a set of spatial points to be clustered; 2, according to geographical relationships of the spatial points, carrying out first-time clustering, and clustering the spatial points belonging to the same geographical relationship into one category; 3, determining a time interval T of time sequence data, which is used in the process of carrying out second-time clustering, obtaining a data value of each spatial point in the time interval T, and forming a time sequence; 4, according to clustering results obtained in the step 2 and the time sequence obtained in the step 3, calculating time sequence correlation between any two spatial points in the same category; and 5, for each clustering result in the step 2, combining the time sequence correlation obtained in the step 4 to carry out second-time clustering on each clustering result so as to form a final clustering result. According to the spatial clustering method disclosed by the invention, two-step clustering is used in the spatial object clustering process, and consideration on the characteristics of time sequence correlation between the objects is added, so that the clustering result is more accurate and has greater practical significance.

Description

A kind of spatial clustering method based on timing dependence
Technical field
The invention belongs to big data and the data mining application of spatial analysis, be specifically related to a kind of relevant based on sequential The spatial clustering method of property.
Background technology
Cluster is one important ingredient of Data Mining and analysis method.Along with big data and data mining are led The extensive application in territory, the clustering analysis of method conventional in data analysis field also receives to be visited the most widely Rope, it has all obtained highly effective application at multiple fields such as image procossing, bio information, spatial database, artificial intelligences.
The data object with higher similarity is classified as one bunch by the main thought of cluster, and the data between different bunches Object does not has or has relatively low similarity, similar in bunch, different between bunch.For cluster analysis, metric data pair Similarity between as becomes the key of analysis, and the quality of cluster result also depends on the similarity assessment that the method is used Whether mode and the method have explored more hidden patterns.
Usually, the method for common cluster generally uses method for measuring similarity based on distance.Containing of distance Justice is relatively wide, as long as being that the function of four conditions meeting distance definition all can be as calculating the range formula of similarity, and these four Condition is uniqueness, nonnegativity, symmetry and triangle inequality respectively.Conventional distance calculating method specifically include that European away from From, mahalanobis distance, manhatton distance and Chebyshev's distance.Euclidean distance is the distance of a usual employing, is mainly described in The natural length of two points and actual distance in space;Mahalanobis distance is intended to indicate that the covariance distance of data, mahalanobis distance Unlike Euclidean distance, it mainly considers the relation between the various characteristic of sample;Manhatton distance be then a kind of for The metric form of degree of geometrical quantity space, it designates the summation of two points absolute wheelbase in coordinate system;And Chebyshev away from From being a kind of metric form in vector space, its main thought be by two points between distance definition be its each coordinate values The maximum of difference.In clustering method based on distance, during more typically clustering algorithm specifically includes that k-means clustering algorithm, k- Heart point clustering algorithm, coagulation type hierarchical clustering algorithm and disintegrated type hierarchical clustering algorithm etc..
But for having different spatial, and having the object of temporal aspect, traditional clustering method has limitation, More excellent cluster result can not be obtained.
Summary of the invention
Some reality characteristics between it is an object of the invention to for object, provide a kind of space based on timing dependence Clustering method.When the method clusters for spatial object, use two step clusters, add the timing dependence considered between each object Characteristic so that cluster result is more accurate, has more realistic meaning.
Specifically, the technical scheme is that
A kind of spatial clustering method based on timing dependence, comprises the following steps:
1) set of the spatial point that will cluster is chosen;
2) carry out clustering for the first time according to spatial point relation geographically, the spatial point of same geographical relationship will be under the jurisdiction of Gathering is a class;
3) for analysis task, the time interval T of the time series data used when determining second time cluster, takes out each space Point data value in time interval T, forms time series;
4) according to step 2) in the cluster result that obtains and step 3) time series that obtains, calculate in same class any two Timing dependence between individual spatial point;
5) for step 2) in each cluster result, integrating step 4) timing dependence between the spatial point that obtains, By a kind of bottom-up method, each cluster result is carried out secondary cluster, form final cluster result.
Compared with prior art, beneficial effects of the present invention is as follows:
The present invention, for when clustering real space object, not only considers its characteristic on space length, with Time also contemplate the timing dependence between each data object, so make the result that spatial object clusters more true, more There is the Research Significance of reality.
Accompanying drawing explanation
Fig. 1 is the flow chart of steps of the inventive method.
Fig. 2 is the cluster ratio variation diagram with distance, and wherein transverse axis represents citing, and the longitudinal axis represents cluster ratio.
Detailed description of the invention
Understandable for enabling the above-mentioned purpose of the present invention, feature and advantage to become apparent from, below by specific embodiment and Accompanying drawing, the present invention will be further described.
The spatial clustering method based on timing dependence of the present embodiment, its steps flow chart as it is shown in figure 1, specifically include with Lower step:
The first step, chooses the set of the spatial point that will cluster.It is all of that this set includes in certain spatial dimension Point, and for each point, all contain the time series data in the time period.Such as, to China's air quality monitoring stations Point clusters, then the set of this spatial point includes all of air quality monitoring stations point, comes for each monitoring station Say, all contain Detection of Air Quality data hourly.
The set of above-mentioned spatial point can be the whole spatial point in spatial dimension, it is also possible to is to apply certain filtering rule After the spatial point that filters out.These filtering rules include but not limited to: distance within a particular value or other indexs (as Precipitation) within the scope of certain special value.
Second step, carries out clustering for the first time according to spatial point relation geographically, will be under the jurisdiction of same geographical relationship It is a class that spatial point is gathered.The such as administrative division of this geographical relationship, such as country, province, city etc., can according to different situations, as All the scope in space, data set sequential density, the computing capability etc. of main frame are adjusted;And for example self defined area, such as basis Mountain range, river trend carry out region segmentation, it is also possible to be to divide according to the spatial object of urban construction, such as railway, at a high speed public affairs Road etc..
3rd step, for analysis task, the time interval T of the time series data used when determining second time cluster, takes out every Individual spatial point data value in time interval T, forms time series.
4th step, the time series obtained according to result and the 3rd step of for the first time cluster, calculate in same class any two Timing dependence between individual spatial point.
Such as, in this example, use administrative division to carry out for the first time and cluster, the administrative division being positioned at according to each point In, the some cluster in same administrative area is one bunch.For any two points in every cluster, the Pearson's phase between calculating at 2 Closing property index, it is defined as follows:
r X Y = Σ i = 1 N ( x i - x ‾ ) ( y i - y ‾ ) Σ i = 1 N ( x i - x ‾ ) Σ i = 1 N ( y i - y ‾ )
Wherein, rXYSpan be-1 to 1, it is positive correlation or negative correlation that sign represents relevant direction, and it is absolute Value is the biggest, and to represent degree of correlation the highest,WithRepresent the meansigma methods of time series X and Y, x respectivelyiAnd yiExpress time sequence X exists The numerical value in the i-th moment, N express time sequence Y is at the numerical value in jth moment.Allowancing for bark outside Ademilson correlation metric, the present invention can also Other index is used to calculate timing dependence, such as Spearman rank correlation coefficient (Spearman's rank correlation Coefficient), Kendall rank correlation coefficient (Kendall rank correlation coefficient) etc..
5th step, by a kind of bottom-up method, carries out secondary cluster to each cluster result, forms final gathering Class result.
Shown in the most following algorithm of false code 1 (Algorithm 1recluster) of this secondary clustering method, this method makes With a kind of bottom-up clustering method, referred to as recluster algorithm, recluster algorithm is the process of an iteration.This calculation The input parameter of method is clustered result clustered, result unclustered not clustered and last time Length length that recluster algorithm does not clusters after performing.The result that each is clustered for the first time, recluster's In initial parameter value, clustered is a null set, preserves the cluster result during all recluster methods perform, Unclustered is the result of cluster for the first time, and length is the length of unclustered.Algorithm when being performed for the first time, It is as follows that algorithm performs step:
If in the result 1. not clustered, after spatial point number performs with last recluster algorithm, number is identical, says Without meeting the spatial point clustered required in bright result, algorithm performs to terminate, and returns, and wherein clustered result is Secondary cluster result.
If in the most non-cluster result, spatial point number is 0, illustrates that all spatial point are complete cluster, and algorithm performs Terminating, return, wherein clustered result is the result of secondary cluster
3. the length of length is entered as the length of unclustered, and creates a new variable save at this Recluster does not carries out the some remaining clustered.As being unsatisfactory for 1,2 conditions, then for owning in unclustered Spatial point, it is judged that the dependency of its any two point A Yu B, this dependency is the timing dependence obtained at four-step calculation, as Really its dependency is less than a certain threshold value, or it does not have significant difference, and (wherein significant difference refers to statistically logarithm According to the evaluation of diversity, being provided with significant difference between data, just the data of explanation participation comparison are not from same Totally, the correlation values drawn just has interpretability), then B is added in remaining, and by B from Unclustered removes.
4., after step 3 is finished, residue unclustered being gathered is a class, and adds in cluster (cluster represents " class ").
5. re-executing algorithm recluster, the parameter of use is cluster, remaining, length.
As a example by China's air quality monitoring stations point, with PM2.5 for analyzing dimension, divide using city and cluster as the first step Benchmark, relevance threshold is redefined for 0.6, performs the method given by the present invention.Centered by each city, with away from Distance from city is radius r, calculates along with the change of r, and the change of cluster ratio, result is as shown in Figure 1.In Fig. 1, transverse axis table Show the distance of distance, can represent centered by certain specified point, the set of all spatial point in this specific range; The longitudinal axis represents cluster ratio, and after referring to cluster, number of clusters mesh is divided by the number of all spatial point.Different colors represents respectively with not Maximum, minimum and the meansigma methods of all of cluster ratio of gained centered by isospace point.Cluster ratio is defined as the most clustered Result number divided by the number of all websites.Result according to Fig. 1 is it is found that along with the change of distance, totally cluster ratio Example maintains about 40%.
In the method, for spatial object, cluster in tradition cluster based on distance mode the most merely, but adopt With the Two-step cluster proposed, this method not only allows for spatial object characteristic in terms of distance, also contemplates simultaneously Its timing dependence characteristic.By the result obtained by two step clustering methods, due to its corresponding time series data also poly- Class process is considered, so cluster result has more realistic meaning.Simultaneously this method also expanded traditional clustering method time Application in empty data.
Above example is only limited in order to technical scheme to be described, the ordinary skill of this area Technical scheme can be modified or equivalent by personnel, without departing from the spirit and scope of the present invention, and this The protection domain of invention should be as the criterion with described in claim.

Claims (8)

1. a spatial clustering method based on timing dependence, it is characterised in that comprise the following steps:
1) set of the spatial point that will cluster is chosen;
2) carry out clustering for the first time according to spatial point relation geographically, the spatial point being under the jurisdiction of same geographical relationship is gathered be One class;
3) for analysis task, the time interval of the time series data used when determining second time cluster, takes out each spatial point and exists Data value in this time interval, forms time series;
4) according to step 2) in the cluster result that obtains and step 3) time series that obtains, calculate any two in same class empty Between point between timing dependence;
5) for step 2) in each cluster result, integrating step 4) timing dependence that obtains, each cluster result is entered Row secondary clusters, and forms final cluster result.
2. the method for claim 1, it is characterised in that step 1) in the set of described spatial point is certain spatial dimension Whole spatial point, or the spatial point filtered out after applying certain filtering rule, and each spatial point comprises Time series data in time period.
3. method as claimed in claim 2, it is characterised in that described filtering rule includes: distance within a particular value, Or other indexs are within the scope of certain special value.
4. the method for claim 1, it is characterised in that step 2) described geographical relationship be by administrative division divide ground Reason relation, or self-defining region.
5. method as claimed in claim 4, it is characterised in that described administrative division includes but not limited to country, province, city City, and can being adjusted according to different situations, including the scope according to whole spaces, data set sequential density, main frame Computing capability is adjusted.
6. method as claimed in claim 4, it is characterised in that described self-defining region is according to mountain range, river trend The region divided, or the region divided according to the spatial object of urban construction.
7. the method for claim 1, it is characterised in that step 4) index that calculates described timing dependence includes: skin Ademilson correlation metric, Spearman rank correlation coefficient, Kendall rank correlation coefficient.
8. the method for claim 1, it is characterised in that step 5) by bottom-up clustering method to each cluster Result carries out secondary cluster;Described bottom-up clustering method is referred to as recluster algorithm, and its input parameter is for have gathered Result clustered of class, result unclustered not clustered and last recluster algorithm do not cluster after performing Length length;The execution step of this recluster algorithm is as follows:
If in the result a) not clustered, after spatial point number performs with last recluster algorithm, number is identical, and knot is described Without meeting the spatial point clustered required in Guo, algorithm performs to terminate, and returns;Wherein clustered result is secondary Cluster result;
If b) in non-cluster result, spatial point number is 0, illustrating that all spatial point are complete cluster, algorithm performs to terminate, Return;Wherein clustered result is the result of secondary cluster;
C) length of length is entered as the length of unclustered, and creates a new variable save at this Recluster does not carries out the some remaining clustered;Condition as being unsatisfactory for a), b), then in unclustered All spatial point, it is judged that the timing dependence of its any two point A Yu B, if its dependency is less than a certain threshold value, or it is not There is significance, then B is added in remaining, and B is removed from unclustered;
D) after step c) is finished, residue unclustered being gathered is a class, and adds in cluster.
E) re-executing algorithm recluster, the parameter of use is cluster, remaining, length.
CN201610404636.1A 2016-06-08 2016-06-08 Spatial clustering method based on time sequence correlation Pending CN106055689A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610404636.1A CN106055689A (en) 2016-06-08 2016-06-08 Spatial clustering method based on time sequence correlation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610404636.1A CN106055689A (en) 2016-06-08 2016-06-08 Spatial clustering method based on time sequence correlation

Publications (1)

Publication Number Publication Date
CN106055689A true CN106055689A (en) 2016-10-26

Family

ID=57169893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610404636.1A Pending CN106055689A (en) 2016-06-08 2016-06-08 Spatial clustering method based on time sequence correlation

Country Status (1)

Country Link
CN (1) CN106055689A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108564110A (en) * 2018-03-26 2018-09-21 上海电力学院 A kind of Air Quality Forecast method based on clustering algorithm
CN110134839A (en) * 2019-03-27 2019-08-16 平安科技(深圳)有限公司 Time series data characteristic processing method, apparatus and computer readable storage medium
CN110147843A (en) * 2019-05-22 2019-08-20 哈尔滨工程大学 Voice Time Series Similar measure based on metric learning
CN110263791A (en) * 2019-05-31 2019-09-20 京东城市(北京)数字科技有限公司 A kind of method and apparatus in identification function area
CN110288140A (en) * 2019-06-14 2019-09-27 西北大学 A kind of opioid drug spatial prediction technique based on geo-relevance model
CN110706004A (en) * 2019-06-27 2020-01-17 华南农业大学 Farmland heavy metal pollutant tracing method based on hierarchical clustering
CN113537311A (en) * 2021-06-30 2021-10-22 北京百度网讯科技有限公司 Spatial point clustering method and device and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942325A (en) * 2014-04-29 2014-07-23 中南大学 Method for association rule mining of ocean-land climate events with combination of climate subdivision thought
CN105550244A (en) * 2015-12-07 2016-05-04 武汉大学 Adaptive clustering method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942325A (en) * 2014-04-29 2014-07-23 中南大学 Method for association rule mining of ocean-land climate events with combination of climate subdivision thought
CN105550244A (en) * 2015-12-07 2016-05-04 武汉大学 Adaptive clustering method

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108564110B (en) * 2018-03-26 2021-07-20 上海电力学院 Air quality prediction method based on clustering algorithm
CN108564110A (en) * 2018-03-26 2018-09-21 上海电力学院 A kind of Air Quality Forecast method based on clustering algorithm
CN110134839A (en) * 2019-03-27 2019-08-16 平安科技(深圳)有限公司 Time series data characteristic processing method, apparatus and computer readable storage medium
CN110134839B (en) * 2019-03-27 2023-06-06 平安科技(深圳)有限公司 Time sequence data characteristic processing method and device and computer readable storage medium
CN110147843A (en) * 2019-05-22 2019-08-20 哈尔滨工程大学 Voice Time Series Similar measure based on metric learning
CN110263791A (en) * 2019-05-31 2019-09-20 京东城市(北京)数字科技有限公司 A kind of method and apparatus in identification function area
CN110263791B (en) * 2019-05-31 2021-11-09 北京京东智能城市大数据研究院 Method and device for identifying functional area
CN110288140B (en) * 2019-06-14 2023-04-07 西北大学 Opioid spatial propagation prediction method based on geographical correlation model
CN110288140A (en) * 2019-06-14 2019-09-27 西北大学 A kind of opioid drug spatial prediction technique based on geo-relevance model
CN110706004A (en) * 2019-06-27 2020-01-17 华南农业大学 Farmland heavy metal pollutant tracing method based on hierarchical clustering
CN110706004B (en) * 2019-06-27 2022-03-29 华南农业大学 Farmland heavy metal pollutant tracing method based on hierarchical clustering
CN113537311A (en) * 2021-06-30 2021-10-22 北京百度网讯科技有限公司 Spatial point clustering method and device and electronic equipment
US20230004751A1 (en) * 2021-06-30 2023-01-05 Beijing Baidu Netcom Science Technology Co., Ltd. Clustering Method and Apparatus for Spatial Points, and Electronic Device
CN113537311B (en) * 2021-06-30 2023-08-04 北京百度网讯科技有限公司 Spatial point clustering method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN106055689A (en) Spatial clustering method based on time sequence correlation
CN109448370B (en) Traffic control subarea division method based on vehicle track data
CN107610469B (en) Day-dimension area traffic index prediction method considering multi-factor influence
CN108596362B (en) Power load curve form clustering method based on adaptive piecewise aggregation approximation
Prat-Pérez et al. Shaping communities out of triangles
CN107529651A (en) A kind of urban transportation passenger flow forecasting and equipment based on deep learning
CN104462184B (en) A kind of large-scale data abnormality recognition method based on two-way sampling combination
CN105163326B (en) A kind of cell clustering method and system based on wireless network traffic feature
CN101178703B (en) Failure diagnosis chart clustering method based on network dividing
CN101276420A (en) Classification method for syncretizing optical spectrum information and multi-point simulation space information
CN108959958A (en) A kind of method for secret protection and system being associated with big data
Pietrucha-Urbanik Multidimensional comparative analysis of water infrastructures differentiation
CN109871638A (en) A kind of lake and marshland Evaluation of Eutrophication model building method
CN106228190A (en) Decision tree method of discrimination for resident's exception water
CN111125285A (en) Animal geographic zoning method based on species spatial distribution relation
CN111307164A (en) Low-sampling-rate track map matching method
CN110716998B (en) Fine scale population data spatialization method
CN114219370B (en) Social network-based multidimensional influence factor weight analysis method for river water quality
CN113641733B (en) Real-time intelligent estimation method for river cross section flow
CN105243503A (en) Coastal zone ecological safety assessment method based on space variables and logistic regression
Jarvis New measure of the topologic structure of dendritic drainage networks
CN112052405B (en) Passenger searching area recommendation method based on driver experience
CN109285219A (en) A kind of grid type hydrological model grid calculation order encoding method based on DEM
CN112819208A (en) Spatial similarity geological disaster prediction method based on feature subset coupling model
CN109255433B (en) Community detection method based on similarity

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20161026

WD01 Invention patent application deemed withdrawn after publication