CN103473255A

CN103473255A - Data clustering method and system, and data processing equipment

Info

Publication number: CN103473255A
Application number: CN2013102234517A
Authority: CN
Inventors: 曹付元; 黄哲学; 梁吉业
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2013-06-06
Filing date: 2013-06-06
Publication date: 2013-12-25

Abstract

The invention is applicable to the field of data processing, provides a data clustering method, a data clustering system and data processing equipment. The method comprises the following steps: inputting a data set consisting of n objects with a block data feature required to be clustered and an expected class number k; selecting k block data objects from the data set to serve as an initial class center; calculating the distance from each object to the initial class center; distributing each block data object to the center closest to the block data object according to the calculated distance to form k disjointed classes; calculating the center of each class to serve as a new class center; repeatedly executing the step of distributing each block data object to the center closest to the block data object according to the calculated distance to form the k disjointed classes and the step of calculating the center of each class to serve as the new class center until the algorithm is converged; obtaining the division result of the data set. By the data clustering method, the data clustering system, and the data processing equipment, the data with the block feature can be processed directly without compressing the block data, so that the loss of information is avoided, and the obtained clustering result is better than the clustering effect obtained after the block data is compressed.

Description

A kind of data clustering method, system and data processing equipment

Technical field

The invention belongs to data processing field, relate in particular to a kind of data clustering method, system and data processing equipment.

Background technology

Along with the fast development of the automatic generation of data and acquisition technique, many fields have produced the mass data that records people's behavior details, for behavior pattern, excavate possibility is provided.These data of describing collected object behavior have a kind of common trait, i.e. the behavior of each object is by many incompatible the portraying of record set, and the data set that we will record the object behavior feature is called a blocks of data.Such as a client's buying behavior or conversation behavior are to embody at purchase detail or the call itemization of a time period by this client.By blocks of data is carried out to deep excavation, contribute to us to client's behavior, to carry out analysis and prediction.Yet current machine learning algorithm can not directly be processed blocks of data, the data that must convert thereof into standard are processed, and cause the potential behavioural characteristic existed in data to be left in the basket.

Summary of the invention

The object of the present invention is to provide a kind of data clustering method, system and data processing equipment, being intended to solve the current machine learning algorithm existed in prior art can not directly be processed blocks of data, the data that must convert thereof into standard are processed, and cause the uncared-for problem of potential behavioural characteristic possibility existed in data.

The present invention is achieved in that a kind of data clustering method, said method comprising the steps of:

The data set that input needs the n with blocks of data feature object of cluster to form and the classification of expectation are counted k;

From described data centralization, select k blocks of data object as the initial classes center;

Calculate the distance of each object to described initial classes center;

According to the distance calculated, each blocks of data object is assigned to the center nearest from it, form k disjoint class;

Calculate each Lei center as the Xin Lei center;

Repeat the distance that described basis calculates, each blocks of data object is assigned to the center nearest from it, form the step of k disjoint class; And each Lei center of described calculating is as the step at Xin Lei center, until algorithm convergence, obtain the division result of data set.

Another object of the present invention is to provide a kind of data clusters system, described system comprises:

Load module, the data set that needs the n with blocks of data feature object of cluster to form for input and the classification of expectation are counted k;

Select module, for from described data centralization, selecting k blocks of data object as the initial classes center;

Distance calculation module, for calculating the distance of each object to described initial classes center;

Distribution module, the distance for according to calculating, be assigned to the center nearest from it by each blocks of data object, forms k disjoint class;

Class center calculation module, for calculating each Lei center as the Xin Lei center;

The cycle control module, repeat the step at distribution object and compute classes center for control, until algorithm convergence, obtain the division result of data set.

Another object of the present invention is to provide a kind of data processing equipment that comprises data clusters system recited above.

In the present invention, by iterative process, data set is divided into different classes ofly, makes the criterion function of estimating clustering performance reach optimum.At first the classification number of selecting at random k(to expect from data centralization) individual blocks of data object is as the initial classes center; Then according to the distance between blocks of data, describe, each block object that computational data is concentrated, to the distance between the initial classes center, is assigned to the center nearest from it by each block object, forms k class; Calculate each Lei center as the Xin Lei center by inclusion-exclusion principle; The step at duplicate allocation object and compute classes center, until algorithm convergence.The embodiment of the present invention can be carried out cluster to the blocks of data extensively existed in real world rapidly, is a kind of not only efficient but also practical division clustering method.The embodiment of the present invention can directly be processed the data with piece characteristic, and does not need blocks of data is compressed to processing, has avoided the loss of information, and the Clustering Effect after the cluster result comparison blocks of data compression obtained is better.In addition, the embodiment of the present invention can also be processed large-scale data.

The accompanying drawing explanation

Fig. 1 is the realization flow schematic diagram of the data clustering method that provides of the embodiment of the present invention.

Fig. 2 is the cluster result figure in 34 cities providing of the embodiment of the present invention.

Fig. 3 is the structural representation of the data clusters system that provides of the embodiment of the present invention.

Embodiment

In order to make purpose of the present invention, technical scheme and beneficial effect clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.

Refer to Fig. 1, the realization flow of the data clustering method provided for the embodiment of the present invention, it comprises the following steps:

In step S101, the data set that input needs the n with blocks of data feature object of cluster to form and the classification of expectation are counted k;

In embodiments of the present invention, suppose that data set to be clustered is X={x ₁, x ₂, L, x _n, wherein

x_{i} = [\begin{matrix} x_{i, 1,1}, x_{i, 1,2}, L, x_{i, 1, m} \\ x_{i, 2,1}, x_{i, 2,2}, L, x_{i, 2, m} \\ L \\ x_{i, r, 1}, x_{i, r, 2}, L, x_{i, r, m} \end{matrix}]

That i is individual by m attribute, r the object that detail record is described, we are by x _ibe called a blocks of data object.K is the classification number of expectation.

In step S102, from described data centralization, select k blocks of data object as the initial classes center;

In embodiments of the present invention, from data set X, select k blocks of data object as initial classes center c ₁, c ₂, L, c _kstep, be specially: select at random k block object as the initial classes center from data set X.

In step S103, calculate the distance of each object to described initial classes center;

In step S104, according to the distance calculated, each blocks of data object is assigned to the center nearest from it, form k disjoint class;

In embodiments of the present invention, the distance between object depends on the otherness between object attribute values, for the distance between the blocks of data object, adopts formula

measured, x wherein, y means two blocks of data objects, A _i, B _imean respectively the thresholding of two objects under i attribute, the characteristic number that m is description object or attribute number.

In step S105, calculate each Lei center as the Xin Lei center;

What at first by the average of calculating the detailed numbers of all objects in such, as such Lei center, will comprise in embodiments of the present invention, records number r; Then add up the frequency that in every one dimension thresholding, each element occurs in different objects in such, if the number of thresholding is greater than r, front r the highest representative be worth as this dimension of selecting frequency, otherwise, according to frequency, reiteration is from high to low got thresholding, until get enough r value; Repeat above-mentioned steps, obtain the representative of m row, form such Lei center.

In step S106, the step of repeated execution of steps S104 and S105, until algorithm convergence, the division result of acquisition data set.

In embodiments of the present invention, by the distance at class center before and after calculating, if the distance of the two is less than a given threshold value, algorithm finishes.

The concrete steps that the method provided below in conjunction with the embodiment of the present invention describes this example enforcement in detail are as follows:

1) we from http://www.wunderground.com/ downloaded 2011 the whole nation 34 provincial capitals (comprising Hong Kong and Macao) weather data, except Shanghai is the data of 364 days, other cities are all the data of 365 days, so each data in 1 year in city is typical blocks of data.For convenience, we have selected 16 features that there is no the attribute description weather data of missing values.Because attribute is the numeric type feature, we have adopted the method logarithm value type data discrete of uniform quantization to turn to 30 classification offsets.

2) the classification number of supposition expectation is 2, selects Taiyuan and Liang Ge city, Wuhan as the initial classes center.

3) utilize the range formula of definition to calculate each city to the distance between Taiyuan and Wuhan, and each blocks of data object is assigned to the center nearest from it.

4) calculate each class Zhong Lei center.

5) whether the distance that judges Xin Lei center and initial classes center is less than given threshold value.

6), if be less than, finish, otherwise forward step 3) to, until algorithm convergence.

7) as shown in Figure 2, wherein circle and pentagram mean two classes that are divided into to cluster result, and triangle means that this city does not have the weather data of 2011.

Refer to Fig. 3, the structure of the data clusters system provided for the embodiment of the present invention.For convenience of explanation, only show the part relevant to the embodiment of the present invention.Described data clusters system comprises: load module 101, selection module 102, distance calculation module 103, distribution module 104, class center calculation module 105 and cycle control module 106.Described data clusters system can be the unit that is built in software unit, hardware cell or software and hardware combining in data processing equipment.

Load module 101, the data set that needs the n with blocks of data feature object of cluster to form for input and the classification of expectation are counted k;

x_{i} = [\begin{matrix} x_{i, 1,1}, x_{i, 1,2}, L, x_{i, 1, m} \\ x_{i, 2,1}, x_{i, 2,2}, L, x_{i, 2, m} \\ L \\ x_{i, r, 1}, x_{i, r, 2}, L, x_{i, r, m} \end{matrix}]

Select module 102, for from described data centralization, selecting k blocks of data object as the initial classes center;

In embodiments of the present invention, select module 102, specifically for selecting at random k block object as the initial classes center from data set X.

Distance calculation module 103, for calculating the distance of each object to described initial classes center;

Distribution module 104, the distance for according to calculating, be assigned to the center nearest from it by each blocks of data object, forms k disjoint class;

Class center calculation module 105, for calculating each Lei center as the Xin Lei center;

Cycle control module 106, repeat the step at distribution object and compute classes center for control, until algorithm convergence, obtain the division result of data set.

In sum, the embodiment of the present invention is divided into data set by iterative process different classes of, makes the criterion function of estimating clustering performance reach optimum.At first the classification number of selecting at random k(to expect from data centralization) individual blocks of data object is as the initial classes center; Then according to the distance between blocks of data, describe, each block object that computational data is concentrated, to the distance between the initial classes center, is assigned to the center nearest from it by each block object, forms k class; Calculate each Lei center as the Xin Lei center by inclusion-exclusion principle; The step at duplicate allocation object and compute classes center, until algorithm convergence.The embodiment of the present invention can be carried out cluster to the blocks of data extensively existed in real world rapidly, is a kind of not only efficient but also practical division clustering method.The embodiment of the present invention can directly be processed the data with piece characteristic, and does not need blocks of data is compressed to processing, has avoided the loss of information, and the Clustering Effect after the cluster result comparison blocks of data compression obtained is better.In addition, the embodiment of the present invention can also be processed large-scale data.

One of ordinary skill in the art will appreciate that all or part of step realized in above-described embodiment method is to come the hardware that instruction is relevant to complete by program, described program can be stored in a computer read/write memory medium, described storage medium, as ROM/RAM, disk, CD etc.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any modifications of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.

Claims

1. a data clustering method, is characterized in that, said method comprising the steps of:

Calculate the distance of each object to described initial classes center;

Calculate each Lei center as the Xin Lei center;

2. the method for claim 1, is characterized in that, supposes that data set to be clustered is X={x ₁, x ₂, L, x _n, wherein

x_{i} = [\begin{matrix} x_{i, 1,1}, x_{i, 1,2}, L, x_{i, 1, m} \\ x_{i, 2,1}, x_{i, 2,2}, L, x_{i, 2, m} \\ L \\ x_{i, r, 1}, x_{i, r, 2}, L, x_{i, r, m} \end{matrix}]

Be that i is individual by m attribute, r the object that detail record is described, by x _ibe called a blocks of data object; K is the classification number of expectation.

3. the method for claim 1, is characterized in that, from described data centralization, selects the step of k blocks of data object as the initial classes center, is specially: select at random k block object as the initial classes center from data set X.

4. the method for claim 1, is characterized in that, the distance between object depends on the otherness between object attribute values, for the distance between the blocks of data object, adopts formula

5. the method for claim 1, is characterized in that, what at first by the average of calculating the detailed numbers of all objects in such, as such Lei center, will comprise records number r; Then add up the frequency that in every one dimension thresholding, each element occurs in different objects in such, if the number of thresholding is greater than r, front r the highest representative be worth as this dimension of selecting frequency, otherwise, according to frequency, reiteration is from high to low got thresholding, until get enough r value; Repeat above-mentioned steps, obtain the representative of m row, form such Lei center.

6. a data clusters system, is characterized in that, described system comprises:

7. system as claimed in claim 6, is characterized in that, selects module, specifically for selecting at random k block object as the initial classes center from data set X.

8. a data processing equipment that comprises claim 6 or the described system of 7 any one.