CN106897292A

CN106897292A - A kind of internet data clustering method and system

Info

Publication number: CN106897292A
Application number: CN201510956891.2A
Authority: CN
Inventors: 赵鹤; 李栋; 李栋一; 黄哲学; 姜青山; 陈会; 高琴; 朱敏; 蔡业首
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2015-12-17
Filing date: 2015-12-17
Publication date: 2017-06-27

Abstract

The invention belongs to clustering algorithm technical field, more particularly to a kind of internet data clustering method and system.The internet data clustering method is comprised the following steps：Step a：Penalty coefficient is added in computational item in the object function of original FG-k-means algorithms, new object function is formed；Step b：Parameter to new object function optimizes solution；Step c：Parametric results according to optimization calculate the distance of sample and cluster centre, and the distance according to sample and cluster centre carries out clustering cluster distribution to sample.The internet data clustering method and system of the embodiment of the present invention add penalty coefficient in the object function of original FG-k-means algorithms in computational item, the problem that the clustering cluster sample size of original FG-k-means algorithms presence increases without limitation can be efficiently controlled, sample size equilibrium is controllable between enabling to each clustering cluster simultaneously, so as to obtain clustering precision higher.

Description

A kind of internet data clustering method and system

Technical field

The invention belongs to clustering algorithm technical field, more particularly to a kind of internet data clustering method and system.

Background technology

With the arrival in big data epoch, the data faced in Data Mining become to become increasingly complex.It is especially mutual Networking text data, in addition to enormous amount, the textual data built by vector space model (Vector Space Model) According to the dimension and degree of rarefication also with superelevation, existing data mining clustering algorithm, such as k-means, hierarchical clustering and general cluster When scheduling algorithm is applied to text cluster, generally existing is not enough and limitation.

For the subspace clustering problem of high dimension sparse data, academia proposes the subspace clustering algorithm of many correlations (Subspace Clustering), soft subspace clustering algorithm is a class therein, and the number of plies according to weighting is different, and soft son is empty Between clustering algorithm can be divided into the soft subspace clustering algorithm of individual layer and double-deck soft subspace clustering algorithm.FG-k-means algorithms are A kind of double-deck soft subspace clustering algorithm put forward in 2012 by Chen little Jun et al., introduces the concept of feature group, in group Weighted simultaneously with feature, when being clustered to super high dimension sparse data, effect is substantially better than the soft subspace clustering of individual layer and calculates Method.FG-k-means algorithms are clustered for data of the feature space comprising grouping information, specific as shown in figure 1, being FG-k- Means algorithm simulation data sets.FG-k-means algorithm characteristics definition spaces are as follows：

1) it is X={ x to set training dataset₁, x₂..., x_N, wherein x_i∈R^a(1≤i≤N) represents i-th sample in data set This；

2) feature set on X is V={ v₁, v₂..., v_d, the feature in V is contained in a group set G={ G₁, G₂..., G_kIn And G meets

FG-k-means algorithms need to find out K clustering cluster on the data set of above form, while finding each clustering cluster The correspondence subspace in group and feature.As shown in figure 1, a data set example comprising group information is defined, in data set X ={ x₁, x₂..., x_NOn, feature set is v={ v₁, v₂..., v₁₂, feature quantity d=12, group collection is combined into G={ G₁, G₂, G₃, And the feature between group and group is mutually disjointed.

To solve the above-mentioned clustering problem comprising group information data set, FG-k-means assumes each clustering cluster l in V and G On respectively have one group of weight H_lAnd S_l.When all samples in clustering cluster are high in a certain feature or ancestors' uniformity, then this feature is assigned Or organize larger weight.Its object function is as follows：

Meet condition：

● U is a matrix of n × k, u_{I, l}=1 i-th example of expression belongs to l-th cluster centre；

● Z={ Z₁, Z₂..., Z_kRepresent k cluster centre；

● H is the weight matrix of k × m, H_{I, j}Represent j-th keyword in l-th weight of cluster centre；

● S is a weight matrix of k × T, S_{L, j}J-th group is represented in l-th weight of cluster centre；

● λ ＞ 0 and η ＞ 0 are respectively intended to the degree that is evenly distributed of control weight.

The parameter of object function (1) can be solved by iterative optimization method.

In sum, existing FG-k-means algorithms have the disadvantage that：Enter by feature in cluster process Row weighting, ignores the influence that useless feature is formed to clustering cluster.When run into some sample in these dimensions with cluster centre There were significant differences, and when extremely similar in the larger feature of weight in the clustering cluster, algorithm cannot be by the sample and cluster Cluster is effectively split.The presence of a large amount of unrelated samples, causes the clustering cluster sample size to increase, so as to cause between clustering cluster Unbalanced phenomena, reduce the robustness of FG-k-means algorithms.

The content of the invention

The invention provides a kind of internet data clustering method and system, it is intended at least solve to a certain extent existing One of above-mentioned technical problem in technology.

Implementation of the present invention is as follows, a kind of internet data clustering method, comprises the following steps：

Step a：Penalty coefficient is added in computational item in the object function of original FG-k-means algorithms, forms new Object function；

Step b：Parameter to new object function optimizes solution；

Step c：Parametric results according to optimization calculate the distance of sample and cluster centre, according to sample and cluster centre Distance carries out clustering cluster distribution to sample.

The technical scheme that the embodiment of the present invention is taken also includes：In the step a, the new object function is：

Meet condition：

In above-mentioned formula：

U is a matrix of n × k, u_{I, l}=1 i-th example of expression belongs to l-th cluster centre；

Z={ Z₁, Z₂..., Z_kRepresent k cluster centre；

H is a weight matrix of k × m, H_{L, j}Represent j-th keyword in l-th weight of cluster centre；

S is a weight matrix of k × T, S_{L, j}T-th group is represented in l-th weight of cluster centre；

p_lRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes；

λ ＞ 0 and η ＞ 0 are respectively intended to the degree that is evenly distributed of control weight；

σ is used for adjusting the balanced degree of sample size between clustering cluster；

It is the new penalty coefficient punished clustering cluster sample size for adding.

The technical scheme that the embodiment of the present invention is taken also includes：In the step b, the ginseng to new object function Number optimizes solution and specifically includes following steps：

Step b1：Fixed variableSolving-optimizing problem

Step b2：Fixed variableSolving-optimizing problem

Step b3：Fixed variableSolving-optimizing problem

Step b4：Fixed variableSolving-optimizing problem

The technical scheme that the embodiment of the present invention is taken also includes：In the step b1, the fixed variableSolving-optimizing problemSolution according to being：

In the step b2, the fixed variableSolving-optimizing problem's Solving foundation is：

The technical scheme that the embodiment of the present invention is taken also includes：In the step b3, the fixed variableSolving-optimizing problemSolution formula be：

The technical scheme that the embodiment of the present invention is taken also includes：In the step b4, the fixed variableSolving-optimizing problemSolution formula be：

The technical scheme that the embodiment of the present invention is taken also includes：In the step c, the parametric results according to optimization The distance of sample and cluster centre is calculated, clustering cluster distribution is carried out to sample according to sample and the distance of cluster centre is specially： p_lRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes, as the sample size percentage p in clustering cluster_lCross When big, the penalty coefficient in corresponding clustering clusterCan be closer to 1, more than other clustering clusters；When sample x calculate apart from when, arrive The distance of each cluster centre can carry out scaling after being multiplied by the penalty coefficient, and the zoom ratio on clustering cluster l is smaller, other clusters The scaling of cluster is larger；If the cluster centre of a sample x and clustering cluster l and another clustering cluster r is closely located, due to r Penalty coefficientSmaller, distance reduces the distance between large percentage, x and r by being less than its distance with l after scaling, Then sample x can be assigned in clustering cluster r.

Another technical scheme that the embodiment of the present invention is taken is：A kind of internet data clustering system, including object function Update module and object function computing module；The object function update module is used for the mesh in original FG-k-means algorithms Scalar functions add penalty coefficient in computational item, form new object function；The object function computing module is used for new The parameter of object function optimize solution, the parametric results according to optimization calculate the distance of sample and cluster centre, according to Sample carries out clustering cluster distribution with the distance of cluster centre to sample.

The technical scheme that the embodiment of the present invention is taken also includes：The new target letter that the object function update module is formed Number is newly：

Meet condition：

In above-mentioned formula and formula：

Z={ Z₁, Z₂..., Z_kRepresent k cluster centre；

The technical scheme that the embodiment of the present invention is taken also includes：The object function computing module includes that first solves list Unit, second solve unit, the 3rd solution unit and first and solve unit；

Described first solves unit is used in fixed variableWhen, solving-optimizing problem

Described second solves unit is used in fixed variableWhen, solving-optimizing problem

Described 3rd solves unit is used in fixed variableSolving-optimizing problem

Described 4th solves unit is used in fixed variableWhen, solving-optimizing problem

Object function of the internet data clustering method and system of the embodiment of the present invention in original FG-k-means algorithms Penalty coefficient is added in computational item, the clustering cluster sample size that original FG-k-means algorithms are present can be efficiently controlled The problem for increasing without limitation, while enable to sample size equilibrium between each clustering cluster controllable, so as to obtain cluster higher Precision.

Brief description of the drawings

Fig. 1 is FG-k-means algorithm simulation data sets；

Fig. 2 is the flow chart of the internet data clustering method of the embodiment of the present invention；

Fig. 3 is the flow chart of the objective function optimization method for solving of the embodiment of the present invention；

Fig. 4 is the structural representation of the internet data clustering system of the embodiment of the present invention.

Specific embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

Fig. 2 is referred to, is the flow chart of the internet data clustering method of the embodiment of the present invention.The embodiment of the present invention it is mutual Networking data clustering method is comprised the following steps：

Step 100：Penalty coefficient is added in computational item in the object function of original FG-k-means algorithms, is formed New object function；

In step 100, new object function is：

Meet condition：

In formula (3) and formula (4)：

Z={ Z₁, Z₂..., Z_kRepresent k cluster centre；

It is the new penalty coefficient punished clustering cluster sample size for adding, penalty coefficient uses fraction Form expression, by p_lControlled with the two parameters of σ, and corresponding adjustment can be made according to practical application；By σ come to cluster The balance of sample carries out stipulations between cluster, and the sample total on data set is normalized, and introduces p_lParameter, be for Influence of the removal data set sample size to σ reasonable value scopes.In embodiments of the present invention, the conjunction of the σ in penalty coefficient Reason span is [0,0.7] this interval, i.e., the span of the parameter in penalty coefficient is in a less controllable interval In, be conducive to the regulation and control to clustering cluster sample size balance degree.

Step 200：Parameter to new object function optimizes solution, according to optimization parametric results calculate sample with The distance of cluster centre, the distance according to sample and cluster centre carries out clustering cluster distribution to sample.

In step 200, it is the flow of the objective function optimization method for solving of the embodiment of the present invention also referring to Fig. 3 Figure.The objective function optimization method for solving of the embodiment of the present invention is comprised the following steps：

Step 201:Fixed variableSolving-optimizing problem

In step 201, with reference to the derivation of FG-k-means, can be according to following Policy Updates U：

Step 202:Fixed variableSolving-optimizing problem

In step 202., can be according to the element in following Policy Updates Z:

Step 203:Fixed variableSolving-optimizing problem

In step 203, solution procedure is drawn by theorem 1：

Theorem 1：Assuming thatAnd η ＞ 0, then object functionObtain minimum value when and only When：

Theorem 1 is proved：

If givenThree variables, need to solve h in the case where object function (3) is minimized_{I, j}Value, the value table Show weight of j-th feature in ith cluster cluster, due to there are k × T qualificationsBy object function (3) taking the logarithm and introducing Lagrangian Cheng Zi to obtain：

In formula (8), E_{L, j}Represent when fixationWhen, j-th constant of feature group, Ke Yitong in l-th clustering cluster Formula (8) is crossed to try to achieve.

By to formula (9) respectively according to γ_{L, t}And h_{L, j}Derivation is carried out, and sets reciprocal value as 0, can obtained：

In formula (11) and formula (12), t is characterized the numbering of the group that j is adhered to separately.

Abbreviation formula (12) can be obtained：

Formula (13) is substituted into the middle of formula (11), can be obtained：

Abbreviation formula (14) can be obtained：

Most formula (15) inverse iteration enters formula (13) at last：

Step 204:Fixed variableSolving-optimizing problem

In step 204, solved using theorem 2：

Theorem 2：And λ ＞ 0, then object functionAnd if only if to obtain minimum value：

Theorem 2 is proved：

If givenThree variables, need to solve h in the case where object function (3) is minimized_{L, j}Value, the value table Show weight of j-th feature in ith cluster cluster, due to there are k qualifications Taking the logarithm and introduce Lagrangian Cheng Zi object function (3) can obtain：

In formula (18), D_{L, t}Represent when fixationWhen, t-th constant of feature group, Ke Yitong in l-th clustering cluster Formula (17) is crossed to try to achieve.

In formula (18), respectively to γ and s_{L, t}It is 0 to carry out derivation and make its inverse, can be obtained：

Abbreviation formula (20) can be obtained：

Formula (21) is substituted into the middle of formula (19), can be obtained：

Abbreviation formula (22) can be obtained：

Most formula (23) inverse iteration enters formula (21) at last：

In above-mentioned algorithm, it is possible to find the value of σ influences whether the overall performance of algorithm, and when σ=0, algorithm is equivalent to FG-k-means algorithms.According to the algorithm above more new formula and FG-k-means algorithm flows, inventive algorithm can be obtained pseudo- Code is as follows：

Pseudo-code of the algorithm

p_lRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes, when the sample size in clustering cluster Percentage p_lWhen excessive, the penalty coefficient in corresponding clustering clusterCan be closer to 1, more than other clustering clusters.When sample x according to According to (5) calculate apart from when, to each cluster centre distance be multiplied by the penalty coefficient after can carry out scaling, the contracting on clustering cluster l Put that ratio is smaller, the scaling of other clustering clusters is larger.Assuming that the cluster centre of some sample x and l and another clustering cluster r It is closely located, due to the penalty coefficient of rSmaller, distance reduces large percentage.The distance between x and r is by small after scaling In its distance with l, then sample x can be assigned in clustering cluster r.Profit can limit clustering cluster l sample numbers in such a way Measure unconfined growth.In the more clustering cluster of sample size, the sample only really near cluster centre just stays in cluster In cluster, the sample from the clustering cluster farther out is assigned to the less clustering cluster of other sample sizes in.

Fig. 4 is referred to, is the structural representation of the internet data clustering system of the embodiment of the present invention.The embodiment of the present invention Internet data clustering system include object function update module and object function computing module；

Object function update module is used to be added in computational item in the object function of original FG-k-means algorithms Penalty coefficient, forms new object function；Wherein, new object function is：

Meet condition：

In formula (3) and formula (4)：

Z={ Z₁, Z₂..., Z_kRepresent k cluster centre；

Object function computing module is used to optimize solution to the parameter of new object function, according to the parameter knot of optimization Fruit calculates the distance of sample and cluster centre, and the distance according to sample and cluster centre carries out clustering cluster distribution to sample；Specifically Ground, object function computing module includes that the first solution unit, second solve unit, the 3rd solution unit and first and solve unit；

First solves unit is used in fixed variableWhen, solving-optimizing problemIts In, with reference to the derivation of FG-k-means, can be according to following Policy Updates U：

Second solves unit is used in fixed variableWhen, solving-optimizing problemIts In, can be according to the element in following Policy Updates Z:

3rd solves unit is used in fixed variableSolving-optimizing problemWherein, Solution procedure is drawn by theorem 1：

Theorem 1 is proved：

If givenThree variables, need to solve h in the case where object function (3) is minimized_{L, j}Value, the value table Show weight of j-th feature in ith cluster cluster, due to there are k × T qualifications, by object function (3) taking the logarithm and introducing Lagrangian Cheng Zi to obtain：

Abbreviation formula (12) can be obtained：

Formula (13) is substituted into the middle of formula (11), can be obtained：

Abbreviation formula (14) can be obtained：

Most formula (15) inverse iteration enters formula (13) at last：

4th solves unit is used in fixed variableWhen, solving-optimizing problemIts In, solved using theorem 2：

Theorem 2 is proved：

Abbreviation formula (20) can be obtained：

Formula (21) is substituted into the middle of formula (19), can be obtained：

Abbreviation formula (22) can be obtained：

Most formula (23) inverse iteration enters formula (21) at last：

In above-mentioned algorithm, it is possible to find the value of σ influences whether the overall performance of algorithm, and when σ=0, algorithm is equivalent to FG-k-means algorithms；

Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the invention, it is all in essence of the invention Any modification, equivalent and improvement made within god and principle etc., should be included within the scope of the present invention.

Claims

1. a kind of internet data clustering method, comprises the following steps：

Step a：Penalty coefficient is added in computational item in the object function of original FG-k-means algorithms, new mesh is formed Scalar functions；

Step b：Parameter to new object function optimizes solution；

Step c：Parametric results according to optimization calculate the distance of sample and cluster centre, according to sample and the distance of cluster centre Clustering cluster distribution is carried out to sample.

2. internet data clustering method according to claim 1, it is characterised in that in the step a is described new Object function is：

Meet condition：

In above-mentioned formula：

Z={ Z₁, Z₂, Z_kRepresent k cluster centre；

3. internet data clustering method according to claim 2, it is characterised in that described to new in the step b The parameter of object function optimize solution and specifically include following steps：

Step b1：Fixed variableSolving-optimizing problem

Step b2：Fixed variableSolving-optimizing problem

Step b3：Fixed variableSolving-optimizing problem

Step b4：Fixed variableSolving-optimizing problem

4. internet data clustering method according to claim 3, it is characterised in that described solid in the step b1 Determine variableSolving-optimizing problemSolution according to being：

In the step b2, the fixed variableSolving-optimizing problemSolution According to being：

5. internet data clustering method according to claim 4, it is characterised in that described solid in the step b3 Determine variableSolving-optimizing problemSolution formula be：

6. internet data clustering method according to claim 5, it is characterised in that described solid in the step b4 Determine variableSolving-optimizing problemSolution formula be：

7. internet data clustering method according to claim 1, it is characterised in that in the step c, the basis The parametric results of optimization calculate the distance of sample and cluster centre, and sample is clustered with the distance of cluster centre according to sample Cluster distribution is specially：p_lRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes, when the sample in clustering cluster Number percent p_lWhen excessive, the penalty coefficient in corresponding clustering clusterCan be closer to 1, more than other clustering clusters；Work as sample X calculate apart from when, to each cluster centre distance be multiplied by the penalty coefficient after can carry out scaling, the pantograph ratio on clustering cluster l Rate is smaller, and the scaling of other clustering clusters is larger；If the cluster centre of a sample x and clustering cluster l and another clustering cluster r It is closely located, due to the penalty coefficient of rSmaller, distance reduces the distance between large percentage, x and r by small after scaling In its distance with l, then sample x can be assigned in clustering cluster r.

8. a kind of internet data clustering system, it is characterised in that calculate mould including object function update module and object function Block；The object function update module is used to be added in computational item in the object function of original FG-k-means algorithms punishes Penalty factor, forms new object function；The object function computing module is used to optimize the parameter of new object function Solve, the parametric results according to optimization calculate the distance of sample and cluster centre, the distance according to sample and cluster centre is to sample Originally clustering cluster distribution is carried out.

9. internet data clustering system according to claim 8, it is characterised in that the object function update module shape Into new object function be newly：

Meet condition：

In above-mentioned formula and formula：

Z={ Z₁, Z₂, Z_kRepresent k cluster centre；

10. internet data clustering system according to claim 9, it is characterised in that the object function computing module Unit, the second solution unit, the 3rd solution unit and first are solved including first solve unit；

Described 3rd solves unit is used in fixed variableSolving-optimizing problem