CN103473255A - Data clustering method and system, and data processing equipment - Google Patents

Data clustering method and system, and data processing equipment Download PDF

Info

Publication number
CN103473255A
CN103473255A CN2013102234517A CN201310223451A CN103473255A CN 103473255 A CN103473255 A CN 103473255A CN 2013102234517 A CN2013102234517 A CN 2013102234517A CN 201310223451 A CN201310223451 A CN 201310223451A CN 103473255 A CN103473255 A CN 103473255A
Authority
CN
China
Prior art keywords
data
center
blocks
distance
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013102234517A
Other languages
Chinese (zh)
Inventor
曹付元
黄哲学
梁吉业
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN2013102234517A priority Critical patent/CN103473255A/en
Publication of CN103473255A publication Critical patent/CN103473255A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention is applicable to the field of data processing, provides a data clustering method, a data clustering system and data processing equipment. The method comprises the following steps: inputting a data set consisting of n objects with a block data feature required to be clustered and an expected class number k; selecting k block data objects from the data set to serve as an initial class center; calculating the distance from each object to the initial class center; distributing each block data object to the center closest to the block data object according to the calculated distance to form k disjointed classes; calculating the center of each class to serve as a new class center; repeatedly executing the step of distributing each block data object to the center closest to the block data object according to the calculated distance to form the k disjointed classes and the step of calculating the center of each class to serve as the new class center until the algorithm is converged; obtaining the division result of the data set. By the data clustering method, the data clustering system, and the data processing equipment, the data with the block feature can be processed directly without compressing the block data, so that the loss of information is avoided, and the obtained clustering result is better than the clustering effect obtained after the block data is compressed.

Description

A kind of data clustering method, system and data processing equipment
Technical field
The invention belongs to data processing field, relate in particular to a kind of data clustering method, system and data processing equipment.
Background technology
Along with the fast development of the automatic generation of data and acquisition technique, many fields have produced the mass data that records people's behavior details, for behavior pattern, excavate possibility is provided.These data of describing collected object behavior have a kind of common trait, i.e. the behavior of each object is by many incompatible the portraying of record set, and the data set that we will record the object behavior feature is called a blocks of data.Such as a client's buying behavior or conversation behavior are to embody at purchase detail or the call itemization of a time period by this client.By blocks of data is carried out to deep excavation, contribute to us to client's behavior, to carry out analysis and prediction.Yet current machine learning algorithm can not directly be processed blocks of data, the data that must convert thereof into standard are processed, and cause the potential behavioural characteristic existed in data to be left in the basket.
Summary of the invention
The object of the present invention is to provide a kind of data clustering method, system and data processing equipment, being intended to solve the current machine learning algorithm existed in prior art can not directly be processed blocks of data, the data that must convert thereof into standard are processed, and cause the uncared-for problem of potential behavioural characteristic possibility existed in data.
The present invention is achieved in that a kind of data clustering method, said method comprising the steps of:
The data set that input needs the n with blocks of data feature object of cluster to form and the classification of expectation are counted k;
From described data centralization, select k blocks of data object as the initial classes center;
Calculate the distance of each object to described initial classes center;
According to the distance calculated, each blocks of data object is assigned to the center nearest from it, form k disjoint class;
Calculate each Lei center as the Xin Lei center;
Repeat the distance that described basis calculates, each blocks of data object is assigned to the center nearest from it, form the step of k disjoint class; And each Lei center of described calculating is as the step at Xin Lei center, until algorithm convergence, obtain the division result of data set.
Another object of the present invention is to provide a kind of data clusters system, described system comprises:
Load module, the data set that needs the n with blocks of data feature object of cluster to form for input and the classification of expectation are counted k;
Select module, for from described data centralization, selecting k blocks of data object as the initial classes center;
Distance calculation module, for calculating the distance of each object to described initial classes center;
Distribution module, the distance for according to calculating, be assigned to the center nearest from it by each blocks of data object, forms k disjoint class;
Class center calculation module, for calculating each Lei center as the Xin Lei center;
The cycle control module, repeat the step at distribution object and compute classes center for control, until algorithm convergence, obtain the division result of data set.
Another object of the present invention is to provide a kind of data processing equipment that comprises data clusters system recited above.
In the present invention, by iterative process, data set is divided into different classes ofly, makes the criterion function of estimating clustering performance reach optimum.At first the classification number of selecting at random k(to expect from data centralization) individual blocks of data object is as the initial classes center; Then according to the distance between blocks of data, describe, each block object that computational data is concentrated, to the distance between the initial classes center, is assigned to the center nearest from it by each block object, forms k class; Calculate each Lei center as the Xin Lei center by inclusion-exclusion principle; The step at duplicate allocation object and compute classes center, until algorithm convergence.The embodiment of the present invention can be carried out cluster to the blocks of data extensively existed in real world rapidly, is a kind of not only efficient but also practical division clustering method.The embodiment of the present invention can directly be processed the data with piece characteristic, and does not need blocks of data is compressed to processing, has avoided the loss of information, and the Clustering Effect after the cluster result comparison blocks of data compression obtained is better.In addition, the embodiment of the present invention can also be processed large-scale data.
The accompanying drawing explanation
Fig. 1 is the realization flow schematic diagram of the data clustering method that provides of the embodiment of the present invention.
Fig. 2 is the cluster result figure in 34 cities providing of the embodiment of the present invention.
Fig. 3 is the structural representation of the data clusters system that provides of the embodiment of the present invention.
Embodiment
In order to make purpose of the present invention, technical scheme and beneficial effect clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.
Refer to Fig. 1, the realization flow of the data clustering method provided for the embodiment of the present invention, it comprises the following steps:
In step S101, the data set that input needs the n with blocks of data feature object of cluster to form and the classification of expectation are counted k;
In embodiments of the present invention, suppose that data set to be clustered is X={x 1, x 2, L, x n, wherein x i = x i , 1,1 , x i , 1,2 , L , x i , 1 , m x i , 2,1 , x i , 2,2 , L , x i , 2 , m L x i , r , 1 , x i , r , 2 , L , x i , r , m That i is individual by m attribute, r the object that detail record is described, we are by x ibe called a blocks of data object.K is the classification number of expectation.
In step S102, from described data centralization, select k blocks of data object as the initial classes center;
In embodiments of the present invention, from data set X, select k blocks of data object as initial classes center c 1, c 2, L, c kstep, be specially: select at random k block object as the initial classes center from data set X.
In step S103, calculate the distance of each object to described initial classes center;
In step S104, according to the distance calculated, each blocks of data object is assigned to the center nearest from it, form k disjoint class;
In embodiments of the present invention, the distance between object depends on the otherness between object attribute values, for the distance between the blocks of data object, adopts formula
Figure BDA00003313805500041
measured, x wherein, y means two blocks of data objects, A i, B imean respectively the thresholding of two objects under i attribute, the characteristic number that m is description object or attribute number.
In step S105, calculate each Lei center as the Xin Lei center;
What at first by the average of calculating the detailed numbers of all objects in such, as such Lei center, will comprise in embodiments of the present invention, records number r; Then add up the frequency that in every one dimension thresholding, each element occurs in different objects in such, if the number of thresholding is greater than r, front r the highest representative be worth as this dimension of selecting frequency, otherwise, according to frequency, reiteration is from high to low got thresholding, until get enough r value; Repeat above-mentioned steps, obtain the representative of m row, form such Lei center.
In step S106, the step of repeated execution of steps S104 and S105, until algorithm convergence, the division result of acquisition data set.
In embodiments of the present invention, by the distance at class center before and after calculating, if the distance of the two is less than a given threshold value, algorithm finishes.
The concrete steps that the method provided below in conjunction with the embodiment of the present invention describes this example enforcement in detail are as follows:
1) we from http://www.wunderground.com/ downloaded 2011 the whole nation 34 provincial capitals (comprising Hong Kong and Macao) weather data, except Shanghai is the data of 364 days, other cities are all the data of 365 days, so each data in 1 year in city is typical blocks of data.For convenience, we have selected 16 features that there is no the attribute description weather data of missing values.Because attribute is the numeric type feature, we have adopted the method logarithm value type data discrete of uniform quantization to turn to 30 classification offsets.
2) the classification number of supposition expectation is 2, selects Taiyuan and Liang Ge city, Wuhan as the initial classes center.
3) utilize the range formula of definition to calculate each city to the distance between Taiyuan and Wuhan, and each blocks of data object is assigned to the center nearest from it.
4) calculate each class Zhong Lei center.
5) whether the distance that judges Xin Lei center and initial classes center is less than given threshold value.
6), if be less than, finish, otherwise forward step 3) to, until algorithm convergence.
7) as shown in Figure 2, wherein circle and pentagram mean two classes that are divided into to cluster result, and triangle means that this city does not have the weather data of 2011.
Refer to Fig. 3, the structure of the data clusters system provided for the embodiment of the present invention.For convenience of explanation, only show the part relevant to the embodiment of the present invention.Described data clusters system comprises: load module 101, selection module 102, distance calculation module 103, distribution module 104, class center calculation module 105 and cycle control module 106.Described data clusters system can be the unit that is built in software unit, hardware cell or software and hardware combining in data processing equipment.
Load module 101, the data set that needs the n with blocks of data feature object of cluster to form for input and the classification of expectation are counted k;
In embodiments of the present invention, suppose that data set to be clustered is X={x 1, x 2, L, x n, wherein x i = x i , 1,1 , x i , 1,2 , L , x i , 1 , m x i , 2,1 , x i , 2,2 , L , x i , 2 , m L x i , r , 1 , x i , r , 2 , L , x i , r , m That i is individual by m attribute, r the object that detail record is described, we are by x ibe called a blocks of data object.K is the classification number of expectation.
Select module 102, for from described data centralization, selecting k blocks of data object as the initial classes center;
In embodiments of the present invention, select module 102, specifically for selecting at random k block object as the initial classes center from data set X.
Distance calculation module 103, for calculating the distance of each object to described initial classes center;
Distribution module 104, the distance for according to calculating, be assigned to the center nearest from it by each blocks of data object, forms k disjoint class;
In embodiments of the present invention, the distance between object depends on the otherness between object attribute values, for the distance between the blocks of data object, adopts formula
Figure BDA00003313805500061
measured, x wherein, y means two blocks of data objects, A i, B imean respectively the thresholding of two objects under i attribute, the characteristic number that m is description object or attribute number.
Class center calculation module 105, for calculating each Lei center as the Xin Lei center;
What at first by the average of calculating the detailed numbers of all objects in such, as such Lei center, will comprise in embodiments of the present invention, records number r; Then add up the frequency that in every one dimension thresholding, each element occurs in different objects in such, if the number of thresholding is greater than r, front r the highest representative be worth as this dimension of selecting frequency, otherwise, according to frequency, reiteration is from high to low got thresholding, until get enough r value; Repeat above-mentioned steps, obtain the representative of m row, form such Lei center.
Cycle control module 106, repeat the step at distribution object and compute classes center for control, until algorithm convergence, obtain the division result of data set.
In embodiments of the present invention, by the distance at class center before and after calculating, if the distance of the two is less than a given threshold value, algorithm finishes.
In sum, the embodiment of the present invention is divided into data set by iterative process different classes of, makes the criterion function of estimating clustering performance reach optimum.At first the classification number of selecting at random k(to expect from data centralization) individual blocks of data object is as the initial classes center; Then according to the distance between blocks of data, describe, each block object that computational data is concentrated, to the distance between the initial classes center, is assigned to the center nearest from it by each block object, forms k class; Calculate each Lei center as the Xin Lei center by inclusion-exclusion principle; The step at duplicate allocation object and compute classes center, until algorithm convergence.The embodiment of the present invention can be carried out cluster to the blocks of data extensively existed in real world rapidly, is a kind of not only efficient but also practical division clustering method.The embodiment of the present invention can directly be processed the data with piece characteristic, and does not need blocks of data is compressed to processing, has avoided the loss of information, and the Clustering Effect after the cluster result comparison blocks of data compression obtained is better.In addition, the embodiment of the present invention can also be processed large-scale data.
One of ordinary skill in the art will appreciate that all or part of step realized in above-described embodiment method is to come the hardware that instruction is relevant to complete by program, described program can be stored in a computer read/write memory medium, described storage medium, as ROM/RAM, disk, CD etc.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any modifications of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.

Claims (8)

1. a data clustering method, is characterized in that, said method comprising the steps of:
The data set that input needs the n with blocks of data feature object of cluster to form and the classification of expectation are counted k;
From described data centralization, select k blocks of data object as the initial classes center;
Calculate the distance of each object to described initial classes center;
According to the distance calculated, each blocks of data object is assigned to the center nearest from it, form k disjoint class;
Calculate each Lei center as the Xin Lei center;
Repeat the distance that described basis calculates, each blocks of data object is assigned to the center nearest from it, form the step of k disjoint class; And each Lei center of described calculating is as the step at Xin Lei center, until algorithm convergence, obtain the division result of data set.
2. the method for claim 1, is characterized in that, supposes that data set to be clustered is X={x 1, x 2, L, x n, wherein x i = x i , 1,1 , x i , 1,2 , L , x i , 1 , m x i , 2,1 , x i , 2,2 , L , x i , 2 , m L x i , r , 1 , x i , r , 2 , L , x i , r , m Be that i is individual by m attribute, r the object that detail record is described, by x ibe called a blocks of data object; K is the classification number of expectation.
3. the method for claim 1, is characterized in that, from described data centralization, selects the step of k blocks of data object as the initial classes center, is specially: select at random k block object as the initial classes center from data set X.
4. the method for claim 1, is characterized in that, the distance between object depends on the otherness between object attribute values, for the distance between the blocks of data object, adopts formula
Figure FDA00003313805400012
measured, x wherein, y means two blocks of data objects, A i, B imean respectively the thresholding of two objects under i attribute, the characteristic number that m is description object or attribute number.
5. the method for claim 1, is characterized in that, what at first by the average of calculating the detailed numbers of all objects in such, as such Lei center, will comprise records number r; Then add up the frequency that in every one dimension thresholding, each element occurs in different objects in such, if the number of thresholding is greater than r, front r the highest representative be worth as this dimension of selecting frequency, otherwise, according to frequency, reiteration is from high to low got thresholding, until get enough r value; Repeat above-mentioned steps, obtain the representative of m row, form such Lei center.
6. a data clusters system, is characterized in that, described system comprises:
Load module, the data set that needs the n with blocks of data feature object of cluster to form for input and the classification of expectation are counted k;
Select module, for from described data centralization, selecting k blocks of data object as the initial classes center;
Distance calculation module, for calculating the distance of each object to described initial classes center;
Distribution module, the distance for according to calculating, be assigned to the center nearest from it by each blocks of data object, forms k disjoint class;
Class center calculation module, for calculating each Lei center as the Xin Lei center;
The cycle control module, repeat the step at distribution object and compute classes center for control, until algorithm convergence, obtain the division result of data set.
7. system as claimed in claim 6, is characterized in that, selects module, specifically for selecting at random k block object as the initial classes center from data set X.
8. a data processing equipment that comprises claim 6 or the described system of 7 any one.
CN2013102234517A 2013-06-06 2013-06-06 Data clustering method and system, and data processing equipment Pending CN103473255A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013102234517A CN103473255A (en) 2013-06-06 2013-06-06 Data clustering method and system, and data processing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013102234517A CN103473255A (en) 2013-06-06 2013-06-06 Data clustering method and system, and data processing equipment

Publications (1)

Publication Number Publication Date
CN103473255A true CN103473255A (en) 2013-12-25

Family

ID=49798105

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013102234517A Pending CN103473255A (en) 2013-06-06 2013-06-06 Data clustering method and system, and data processing equipment

Country Status (1)

Country Link
CN (1) CN103473255A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103914518A (en) * 2014-03-14 2014-07-09 小米科技有限责任公司 Clustering method and clustering device
CN104281674A (en) * 2014-09-29 2015-01-14 同济大学 Adaptive clustering method and adaptive clustering system on basis of clustering coefficients
CN104391879A (en) * 2014-10-31 2015-03-04 小米科技有限责任公司 Method and device for hierarchical clustering
CN105183855A (en) * 2015-09-08 2015-12-23 浪潮(北京)电子信息产业有限公司 Information classification method and system
CN106776972A (en) * 2016-12-05 2017-05-31 深圳万智联合科技有限公司 A kind of virtual resources integration platform in system for cloud computing
CN106940803A (en) * 2017-02-17 2017-07-11 平安科技(深圳)有限公司 Correlated variables recognition methods and device
CN107392513A (en) * 2017-01-26 2017-11-24 北京小度信息科技有限公司 Order processing method and apparatus
CN107564290A (en) * 2017-10-13 2018-01-09 公安部交通管理科学研究所 A kind of urban road intersection saturation volume rate computational methods
US10037345B2 (en) 2014-03-14 2018-07-31 Xiaomi Inc. Clustering method and device
WO2019169619A1 (en) * 2018-03-09 2019-09-12 深圳大学 Method and apparatus for dividing randomly sampled data sub-blocks of big data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
于永彦: "基于Jaccard距离与概念聚类的多模型估计", 《计算机工程》 *
冯玉: "数据仓库环境中近似查询处理技术研究", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10037345B2 (en) 2014-03-14 2018-07-31 Xiaomi Inc. Clustering method and device
WO2015135276A1 (en) * 2014-03-14 2015-09-17 小米科技有限责任公司 Clustering method and related device
CN103914518B (en) * 2014-03-14 2017-05-17 小米科技有限责任公司 Clustering method and clustering device
CN103914518A (en) * 2014-03-14 2014-07-09 小米科技有限责任公司 Clustering method and clustering device
CN104281674A (en) * 2014-09-29 2015-01-14 同济大学 Adaptive clustering method and adaptive clustering system on basis of clustering coefficients
CN104281674B (en) * 2014-09-29 2017-07-11 同济大学 It is a kind of based on the adaptive clustering scheme and system that gather coefficient
CN104391879A (en) * 2014-10-31 2015-03-04 小米科技有限责任公司 Method and device for hierarchical clustering
CN104391879B (en) * 2014-10-31 2017-10-10 小米科技有限责任公司 The method and device of hierarchical clustering
CN105183855A (en) * 2015-09-08 2015-12-23 浪潮(北京)电子信息产业有限公司 Information classification method and system
CN106776972A (en) * 2016-12-05 2017-05-31 深圳万智联合科技有限公司 A kind of virtual resources integration platform in system for cloud computing
CN107392513A (en) * 2017-01-26 2017-11-24 北京小度信息科技有限公司 Order processing method and apparatus
WO2018137330A1 (en) * 2017-01-26 2018-08-02 北京小度信息科技有限公司 Order processing method, device, server and computer storage medium
CN106940803A (en) * 2017-02-17 2017-07-11 平安科技(深圳)有限公司 Correlated variables recognition methods and device
CN106940803B (en) * 2017-02-17 2018-04-17 平安科技(深圳)有限公司 Correlated variables recognition methods and device
CN107564290A (en) * 2017-10-13 2018-01-09 公安部交通管理科学研究所 A kind of urban road intersection saturation volume rate computational methods
WO2019169619A1 (en) * 2018-03-09 2019-09-12 深圳大学 Method and apparatus for dividing randomly sampled data sub-blocks of big data

Similar Documents

Publication Publication Date Title
CN103473255A (en) Data clustering method and system, and data processing equipment
CN103020256B (en) A kind of association rule mining method of large-scale data
US9361343B2 (en) Method for parallel mining of temporal relations in large event file
CN104424235A (en) Method and device for clustering user information
CN103218617B (en) A kind of feature extracting method of polyteny Large space
CN103336771B (en) Data similarity detection method based on sliding window
CN104391879B (en) The method and device of hierarchical clustering
Panapakidis et al. Enhancing the clustering process in the category model load profiling
Rongfei et al. A new clustering method for collaborative filtering
CN107248031B (en) Rapid power consumer classification method aiming at load curve peak-valley difference
CN108764335A (en) A kind of integrated energy system multi-energy requirement typical scene generation method and device
CN106023212A (en) Super-pixel segmentation method based on pyramid layer-by-layer spreading clustering
CN116167581A (en) Battery demand estimation method and device, scheduling method and computer equipment
CN116993555A (en) Partition method, system and storage medium for identifying territory space planning key region
CN103995828A (en) Cloud storage log data analysis method
CN113030954A (en) Ground penetrating radar data SVD distributed algorithm based on Flink
CN116226468B (en) Service data storage management method based on gridding terminal
CN101340458B (en) Grid data copy generation method based on time and space limitation
CN115358354A (en) Rainfall space data restoration and reconstruction method
Huang et al. A grid and density based fast spatial clustering algorithm
CN109241201A (en) A kind of Laplce's centrality peak-data clustering method based on curvature
CN105426626B (en) Multiple-Point Geostatistics modeling method based on set of metadata of similar data pattern cluster
CN106652032B (en) A kind of parallel contour lines creation method of DEM based on Linux cluster platform
CN109856673B (en) High-resolution Radon transformation data separation technology based on dominant frequency iterative weighting
CN110910029A (en) Power load clustering method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20131225

RJ01 Rejection of invention patent application after publication