CN106897292A - A kind of internet data clustering method and system - Google Patents
A kind of internet data clustering method and system Download PDFInfo
- Publication number
- CN106897292A CN106897292A CN201510956891.2A CN201510956891A CN106897292A CN 106897292 A CN106897292 A CN 106897292A CN 201510956891 A CN201510956891 A CN 201510956891A CN 106897292 A CN106897292 A CN 106897292A
- Authority
- CN
- China
- Prior art keywords
- cluster
- sample
- clustering
- object function
- cluster centre
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to clustering algorithm technical field, more particularly to a kind of internet data clustering method and system.The internet data clustering method is comprised the following steps:Step a:Penalty coefficient is added in computational item in the object function of original FG-k-means algorithms, new object function is formed;Step b:Parameter to new object function optimizes solution;Step c:Parametric results according to optimization calculate the distance of sample and cluster centre, and the distance according to sample and cluster centre carries out clustering cluster distribution to sample.The internet data clustering method and system of the embodiment of the present invention add penalty coefficient in the object function of original FG-k-means algorithms in computational item, the problem that the clustering cluster sample size of original FG-k-means algorithms presence increases without limitation can be efficiently controlled, sample size equilibrium is controllable between enabling to each clustering cluster simultaneously, so as to obtain clustering precision higher.
Description
Technical field
The invention belongs to clustering algorithm technical field, more particularly to a kind of internet data clustering method and system.
Background technology
With the arrival in big data epoch, the data faced in Data Mining become to become increasingly complex.It is especially mutual
Networking text data, in addition to enormous amount, the textual data built by vector space model (Vector Space Model)
According to the dimension and degree of rarefication also with superelevation, existing data mining clustering algorithm, such as k-means, hierarchical clustering and general cluster
When scheduling algorithm is applied to text cluster, generally existing is not enough and limitation.
For the subspace clustering problem of high dimension sparse data, academia proposes the subspace clustering algorithm of many correlations
(Subspace Clustering), soft subspace clustering algorithm is a class therein, and the number of plies according to weighting is different, and soft son is empty
Between clustering algorithm can be divided into the soft subspace clustering algorithm of individual layer and double-deck soft subspace clustering algorithm.FG-k-means algorithms are
A kind of double-deck soft subspace clustering algorithm put forward in 2012 by Chen little Jun et al., introduces the concept of feature group, in group
Weighted simultaneously with feature, when being clustered to super high dimension sparse data, effect is substantially better than the soft subspace clustering of individual layer and calculates
Method.FG-k-means algorithms are clustered for data of the feature space comprising grouping information, specific as shown in figure 1, being FG-k-
Means algorithm simulation data sets.FG-k-means algorithm characteristics definition spaces are as follows:
1) it is X={ x to set training dataset1, x2..., xN, wherein xi∈Ra(1≤i≤N) represents i-th sample in data set
This;
2) feature set on X is V={ v1, v2..., vd, the feature in V is contained in a group set G={ G1, G2..., GkIn
And G meets
FG-k-means algorithms need to find out K clustering cluster on the data set of above form, while finding each clustering cluster
The correspondence subspace in group and feature.As shown in figure 1, a data set example comprising group information is defined, in data set X
={ x1, x2..., xNOn, feature set is v={ v1, v2..., v12, feature quantity d=12, group collection is combined into G={ G1, G2, G3,
And the feature between group and group is mutually disjointed.
To solve the above-mentioned clustering problem comprising group information data set, FG-k-means assumes each clustering cluster l in V and G
On respectively have one group of weight HlAnd Sl.When all samples in clustering cluster are high in a certain feature or ancestors' uniformity, then this feature is assigned
Or organize larger weight.Its object function is as follows:
Meet condition:
● U is a matrix of n × k, uI, l=1 i-th example of expression belongs to l-th cluster centre;
● Z={ Z1, Z2..., ZkRepresent k cluster centre;
● H is the weight matrix of k × m, HI, jRepresent j-th keyword in l-th weight of cluster centre;
● S is a weight matrix of k × T, SL, jJ-th group is represented in l-th weight of cluster centre;
● λ > 0 and η > 0 are respectively intended to the degree that is evenly distributed of control weight.
The parameter of object function (1) can be solved by iterative optimization method.
In sum, existing FG-k-means algorithms have the disadvantage that:Enter by feature in cluster process
Row weighting, ignores the influence that useless feature is formed to clustering cluster.When run into some sample in these dimensions with cluster centre
There were significant differences, and when extremely similar in the larger feature of weight in the clustering cluster, algorithm cannot be by the sample and cluster
Cluster is effectively split.The presence of a large amount of unrelated samples, causes the clustering cluster sample size to increase, so as to cause between clustering cluster
Unbalanced phenomena, reduce the robustness of FG-k-means algorithms.
The content of the invention
The invention provides a kind of internet data clustering method and system, it is intended at least solve to a certain extent existing
One of above-mentioned technical problem in technology.
Implementation of the present invention is as follows, a kind of internet data clustering method, comprises the following steps:
Step a:Penalty coefficient is added in computational item in the object function of original FG-k-means algorithms, forms new
Object function;
Step b:Parameter to new object function optimizes solution;
Step c:Parametric results according to optimization calculate the distance of sample and cluster centre, according to sample and cluster centre
Distance carries out clustering cluster distribution to sample.
The technical scheme that the embodiment of the present invention is taken also includes:In the step a, the new object function is:
Meet condition:
In above-mentioned formula:
U is a matrix of n × k, uI, l=1 i-th example of expression belongs to l-th cluster centre;
Z={ Z1, Z2..., ZkRepresent k cluster centre;
H is a weight matrix of k × m, HL, jRepresent j-th keyword in l-th weight of cluster centre;
S is a weight matrix of k × T, SL, jT-th group is represented in l-th weight of cluster centre;
plRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes;
λ > 0 and η > 0 are respectively intended to the degree that is evenly distributed of control weight;
σ is used for adjusting the balanced degree of sample size between clustering cluster;
It is the new penalty coefficient punished clustering cluster sample size for adding.
The technical scheme that the embodiment of the present invention is taken also includes:In the step b, the ginseng to new object function
Number optimizes solution and specifically includes following steps:
Step b1:Fixed variableSolving-optimizing problem
Step b2:Fixed variableSolving-optimizing problem
Step b3:Fixed variableSolving-optimizing problem
Step b4:Fixed variableSolving-optimizing problem
The technical scheme that the embodiment of the present invention is taken also includes:In the step b1, the fixed variableSolving-optimizing problemSolution according to being:
In the step b2, the fixed variableSolving-optimizing problem's
Solving foundation is:
The technical scheme that the embodiment of the present invention is taken also includes:In the step b3, the fixed variableSolving-optimizing problemSolution formula be:
The technical scheme that the embodiment of the present invention is taken also includes:In the step b4, the fixed variableSolving-optimizing problemSolution formula be:
The technical scheme that the embodiment of the present invention is taken also includes:In the step c, the parametric results according to optimization
The distance of sample and cluster centre is calculated, clustering cluster distribution is carried out to sample according to sample and the distance of cluster centre is specially:
plRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes, as the sample size percentage p in clustering clusterlCross
When big, the penalty coefficient in corresponding clustering clusterCan be closer to 1, more than other clustering clusters;When sample x calculate apart from when, arrive
The distance of each cluster centre can carry out scaling after being multiplied by the penalty coefficient, and the zoom ratio on clustering cluster l is smaller, other clusters
The scaling of cluster is larger;If the cluster centre of a sample x and clustering cluster l and another clustering cluster r is closely located, due to r
Penalty coefficientSmaller, distance reduces the distance between large percentage, x and r by being less than its distance with l after scaling,
Then sample x can be assigned in clustering cluster r.
Another technical scheme that the embodiment of the present invention is taken is:A kind of internet data clustering system, including object function
Update module and object function computing module;The object function update module is used for the mesh in original FG-k-means algorithms
Scalar functions add penalty coefficient in computational item, form new object function;The object function computing module is used for new
The parameter of object function optimize solution, the parametric results according to optimization calculate the distance of sample and cluster centre, according to
Sample carries out clustering cluster distribution with the distance of cluster centre to sample.
The technical scheme that the embodiment of the present invention is taken also includes:The new target letter that the object function update module is formed
Number is newly:
Meet condition:
In above-mentioned formula and formula:
U is a matrix of n × k, uI, l=1 i-th example of expression belongs to l-th cluster centre;
Z={ Z1, Z2..., ZkRepresent k cluster centre;
H is a weight matrix of k × m, HL, jRepresent j-th keyword in l-th weight of cluster centre;
S is a weight matrix of k × T, SL, jT-th group is represented in l-th weight of cluster centre;
plRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes;
λ > 0 and η > 0 are respectively intended to the degree that is evenly distributed of control weight;
σ is used for adjusting the balanced degree of sample size between clustering cluster;
It is the new penalty coefficient punished clustering cluster sample size for adding.
The technical scheme that the embodiment of the present invention is taken also includes:The object function computing module includes that first solves list
Unit, second solve unit, the 3rd solution unit and first and solve unit;
Described first solves unit is used in fixed variableWhen, solving-optimizing problem
Described second solves unit is used in fixed variableWhen, solving-optimizing problem
Described 3rd solves unit is used in fixed variableSolving-optimizing problem
Described 4th solves unit is used in fixed variableWhen, solving-optimizing problem
Object function of the internet data clustering method and system of the embodiment of the present invention in original FG-k-means algorithms
Penalty coefficient is added in computational item, the clustering cluster sample size that original FG-k-means algorithms are present can be efficiently controlled
The problem for increasing without limitation, while enable to sample size equilibrium between each clustering cluster controllable, so as to obtain cluster higher
Precision.
Brief description of the drawings
Fig. 1 is FG-k-means algorithm simulation data sets;
Fig. 2 is the flow chart of the internet data clustering method of the embodiment of the present invention;
Fig. 3 is the flow chart of the objective function optimization method for solving of the embodiment of the present invention;
Fig. 4 is the structural representation of the internet data clustering system of the embodiment of the present invention.
Specific embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.
Fig. 2 is referred to, is the flow chart of the internet data clustering method of the embodiment of the present invention.The embodiment of the present invention it is mutual
Networking data clustering method is comprised the following steps:
Step 100:Penalty coefficient is added in computational item in the object function of original FG-k-means algorithms, is formed
New object function;
In step 100, new object function is:
Meet condition:
In formula (3) and formula (4):
U is a matrix of n × k, uI, l=1 i-th example of expression belongs to l-th cluster centre;
Z={ Z1, Z2..., ZkRepresent k cluster centre;
H is a weight matrix of k × m, HL, jRepresent j-th keyword in l-th weight of cluster centre;
S is a weight matrix of k × T, SL, jT-th group is represented in l-th weight of cluster centre;
plRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes;
λ > 0 and η > 0 are respectively intended to the degree that is evenly distributed of control weight;
σ is used for adjusting the balanced degree of sample size between clustering cluster;
It is the new penalty coefficient punished clustering cluster sample size for adding, penalty coefficient uses fraction
Form expression, by plControlled with the two parameters of σ, and corresponding adjustment can be made according to practical application;By σ come to cluster
The balance of sample carries out stipulations between cluster, and the sample total on data set is normalized, and introduces plParameter, be for
Influence of the removal data set sample size to σ reasonable value scopes.In embodiments of the present invention, the conjunction of the σ in penalty coefficient
Reason span is [0,0.7] this interval, i.e., the span of the parameter in penalty coefficient is in a less controllable interval
In, be conducive to the regulation and control to clustering cluster sample size balance degree.
Step 200:Parameter to new object function optimizes solution, according to optimization parametric results calculate sample with
The distance of cluster centre, the distance according to sample and cluster centre carries out clustering cluster distribution to sample.
In step 200, it is the flow of the objective function optimization method for solving of the embodiment of the present invention also referring to Fig. 3
Figure.The objective function optimization method for solving of the embodiment of the present invention is comprised the following steps:
Step 201:Fixed variableSolving-optimizing problem
In step 201, with reference to the derivation of FG-k-means, can be according to following Policy Updates U:
Step 202:Fixed variableSolving-optimizing problem
In step 202., can be according to the element in following Policy Updates Z:
Step 203:Fixed variableSolving-optimizing problem
In step 203, solution procedure is drawn by theorem 1:
Theorem 1:Assuming thatAnd η > 0, then object functionObtain minimum value when and only
When:
Theorem 1 is proved:
If givenThree variables, need to solve h in the case where object function (3) is minimizedI, jValue, the value table
Show weight of j-th feature in ith cluster cluster, due to there are k × T qualificationsBy object function
(3) taking the logarithm and introducing Lagrangian Cheng Zi to obtain:
In formula (8), EL, jRepresent when fixationWhen, j-th constant of feature group, Ke Yitong in l-th clustering cluster
Formula (8) is crossed to try to achieve.
By to formula (9) respectively according to γL, tAnd hL, jDerivation is carried out, and sets reciprocal value as 0, can obtained:
In formula (11) and formula (12), t is characterized the numbering of the group that j is adhered to separately.
Abbreviation formula (12) can be obtained:
Formula (13) is substituted into the middle of formula (11), can be obtained:
Abbreviation formula (14) can be obtained:
Most formula (15) inverse iteration enters formula (13) at last:
Step 204:Fixed variableSolving-optimizing problem
In step 204, solved using theorem 2:
Theorem 2:And λ > 0, then object functionAnd if only if to obtain minimum value:
Theorem 2 is proved:
If givenThree variables, need to solve h in the case where object function (3) is minimizedL, jValue, the value table
Show weight of j-th feature in ith cluster cluster, due to there are k qualifications
Taking the logarithm and introduce Lagrangian Cheng Zi object function (3) can obtain:
In formula (18), DL, tRepresent when fixationWhen, t-th constant of feature group, Ke Yitong in l-th clustering cluster
Formula (17) is crossed to try to achieve.
In formula (18), respectively to γ and sL, tIt is 0 to carry out derivation and make its inverse, can be obtained:
Abbreviation formula (20) can be obtained:
Formula (21) is substituted into the middle of formula (19), can be obtained:
Abbreviation formula (22) can be obtained:
Most formula (23) inverse iteration enters formula (21) at last:
In above-mentioned algorithm, it is possible to find the value of σ influences whether the overall performance of algorithm, and when σ=0, algorithm is equivalent to
FG-k-means algorithms.According to the algorithm above more new formula and FG-k-means algorithm flows, inventive algorithm can be obtained pseudo-
Code is as follows:
Pseudo-code of the algorithm
plRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes, when the sample size in clustering cluster
Percentage plWhen excessive, the penalty coefficient in corresponding clustering clusterCan be closer to 1, more than other clustering clusters.When sample x according to
According to (5) calculate apart from when, to each cluster centre distance be multiplied by the penalty coefficient after can carry out scaling, the contracting on clustering cluster l
Put that ratio is smaller, the scaling of other clustering clusters is larger.Assuming that the cluster centre of some sample x and l and another clustering cluster r
It is closely located, due to the penalty coefficient of rSmaller, distance reduces large percentage.The distance between x and r is by small after scaling
In its distance with l, then sample x can be assigned in clustering cluster r.Profit can limit clustering cluster l sample numbers in such a way
Measure unconfined growth.In the more clustering cluster of sample size, the sample only really near cluster centre just stays in cluster
In cluster, the sample from the clustering cluster farther out is assigned to the less clustering cluster of other sample sizes in.
Fig. 4 is referred to, is the structural representation of the internet data clustering system of the embodiment of the present invention.The embodiment of the present invention
Internet data clustering system include object function update module and object function computing module;
Object function update module is used to be added in computational item in the object function of original FG-k-means algorithms
Penalty coefficient, forms new object function;Wherein, new object function is:
Meet condition:
In formula (3) and formula (4):
U is a matrix of n × k, uI, l=1 i-th example of expression belongs to l-th cluster centre;
Z={ Z1, Z2..., ZkRepresent k cluster centre;
H is a weight matrix of k × m, HL, jRepresent j-th keyword in l-th weight of cluster centre;
S is a weight matrix of k × T, SL, jT-th group is represented in l-th weight of cluster centre;
plRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes;
λ > 0 and η > 0 are respectively intended to the degree that is evenly distributed of control weight;
σ is used for adjusting the balanced degree of sample size between clustering cluster;
It is the new penalty coefficient punished clustering cluster sample size for adding, penalty coefficient uses fraction
Form expression, by plControlled with the two parameters of σ, and corresponding adjustment can be made according to practical application;By σ come to cluster
The balance of sample carries out stipulations between cluster, and the sample total on data set is normalized, and introduces plParameter, be for
Influence of the removal data set sample size to σ reasonable value scopes.In embodiments of the present invention, the conjunction of the σ in penalty coefficient
Reason span is [0,0.7] this interval, i.e., the span of the parameter in penalty coefficient is in a less controllable interval
In, be conducive to the regulation and control to clustering cluster sample size balance degree.
Object function computing module is used to optimize solution to the parameter of new object function, according to the parameter knot of optimization
Fruit calculates the distance of sample and cluster centre, and the distance according to sample and cluster centre carries out clustering cluster distribution to sample;Specifically
Ground, object function computing module includes that the first solution unit, second solve unit, the 3rd solution unit and first and solve unit;
First solves unit is used in fixed variableWhen, solving-optimizing problemIts
In, with reference to the derivation of FG-k-means, can be according to following Policy Updates U:
Second solves unit is used in fixed variableWhen, solving-optimizing problemIts
In, can be according to the element in following Policy Updates Z:
3rd solves unit is used in fixed variableSolving-optimizing problemWherein,
Solution procedure is drawn by theorem 1:
Theorem 1:Assuming thatAnd η > 0, then object functionObtain minimum value when and only
When:
Theorem 1 is proved:
If givenThree variables, need to solve h in the case where object function (3) is minimizedL, jValue, the value table
Show weight of j-th feature in ith cluster cluster, due to there are k × T qualifications, by object function
(3) taking the logarithm and introducing Lagrangian Cheng Zi to obtain:
In formula (8), EL, jRepresent when fixationWhen, j-th constant of feature group, Ke Yitong in l-th clustering cluster
Formula (8) is crossed to try to achieve.
By to formula (9) respectively according to γL, tAnd hL, jDerivation is carried out, and sets reciprocal value as 0, can obtained:
In formula (11) and formula (12), t is characterized the numbering of the group that j is adhered to separately.
Abbreviation formula (12) can be obtained:
Formula (13) is substituted into the middle of formula (11), can be obtained:
Abbreviation formula (14) can be obtained:
Most formula (15) inverse iteration enters formula (13) at last:
4th solves unit is used in fixed variableWhen, solving-optimizing problemIts
In, solved using theorem 2:
Theorem 2:And λ > 0, then object functionAnd if only if to obtain minimum value:
Theorem 2 is proved:
If givenThree variables, need to solve h in the case where object function (3) is minimizedL, jValue, the value table
Show weight of j-th feature in ith cluster cluster, due to there are k qualifications
Taking the logarithm and introduce Lagrangian Cheng Zi object function (3) can obtain:
In formula (18), DL, tRepresent when fixationWhen, t-th constant of feature group, Ke Yitong in l-th clustering cluster
Formula (17) is crossed to try to achieve.
In formula (18), respectively to γ and sL, tIt is 0 to carry out derivation and make its inverse, can be obtained:
Abbreviation formula (20) can be obtained:
Formula (21) is substituted into the middle of formula (19), can be obtained:
Abbreviation formula (22) can be obtained:
Most formula (23) inverse iteration enters formula (21) at last:
In above-mentioned algorithm, it is possible to find the value of σ influences whether the overall performance of algorithm, and when σ=0, algorithm is equivalent to
FG-k-means algorithms;
plRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes, when the sample size in clustering cluster
Percentage plWhen excessive, the penalty coefficient in corresponding clustering clusterCan be closer to 1, more than other clustering clusters.When sample x according to
According to (5) calculate apart from when, to each cluster centre distance be multiplied by the penalty coefficient after can carry out scaling, the contracting on clustering cluster l
Put that ratio is smaller, the scaling of other clustering clusters is larger.Assuming that the cluster centre of some sample x and l and another clustering cluster r
It is closely located, due to the penalty coefficient of rSmaller, distance reduces large percentage.The distance between x and r is by small after scaling
In its distance with l, then sample x can be assigned in clustering cluster r.Profit can limit clustering cluster l sample numbers in such a way
Measure unconfined growth.In the more clustering cluster of sample size, the sample only really near cluster centre just stays in cluster
In cluster, the sample from the clustering cluster farther out is assigned to the less clustering cluster of other sample sizes in.
Object function of the internet data clustering method and system of the embodiment of the present invention in original FG-k-means algorithms
Penalty coefficient is added in computational item, the clustering cluster sample size that original FG-k-means algorithms are present can be efficiently controlled
The problem for increasing without limitation, while enable to sample size equilibrium between each clustering cluster controllable, so as to obtain cluster higher
Precision.
Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the invention, it is all in essence of the invention
Any modification, equivalent and improvement made within god and principle etc., should be included within the scope of the present invention.
Claims (10)
1. a kind of internet data clustering method, comprises the following steps:
Step a:Penalty coefficient is added in computational item in the object function of original FG-k-means algorithms, new mesh is formed
Scalar functions;
Step b:Parameter to new object function optimizes solution;
Step c:Parametric results according to optimization calculate the distance of sample and cluster centre, according to sample and the distance of cluster centre
Clustering cluster distribution is carried out to sample.
2. internet data clustering method according to claim 1, it is characterised in that in the step a is described new
Object function is:
Meet condition:
In above-mentioned formula:
U is a matrix of n × k, uI, l=1 i-th example of expression belongs to l-th cluster centre;
Z={ Z1, Z2, ZkRepresent k cluster centre;
H is a weight matrix of k × m, HL, jRepresent j-th keyword in l-th weight of cluster centre;
S is a weight matrix of k × T, SL, jT-th group is represented in l-th weight of cluster centre;
plRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes;
λ > 0 and η > 0 are respectively intended to the degree that is evenly distributed of control weight;
σ is used for adjusting the balanced degree of sample size between clustering cluster;
It is the new penalty coefficient punished clustering cluster sample size for adding.
3. internet data clustering method according to claim 2, it is characterised in that described to new in the step b
The parameter of object function optimize solution and specifically include following steps:
Step b1:Fixed variableSolving-optimizing problem
Step b2:Fixed variableSolving-optimizing problem
Step b3:Fixed variableSolving-optimizing problem
Step b4:Fixed variableSolving-optimizing problem
4. internet data clustering method according to claim 3, it is characterised in that described solid in the step b1
Determine variableSolving-optimizing problemSolution according to being:
In the step b2, the fixed variableSolving-optimizing problemSolution
According to being:
5. internet data clustering method according to claim 4, it is characterised in that described solid in the step b3
Determine variableSolving-optimizing problemSolution formula be:
6. internet data clustering method according to claim 5, it is characterised in that described solid in the step b4
Determine variableSolving-optimizing problemSolution formula be:
7. internet data clustering method according to claim 1, it is characterised in that in the step c, the basis
The parametric results of optimization calculate the distance of sample and cluster centre, and sample is clustered with the distance of cluster centre according to sample
Cluster distribution is specially:plRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes, when the sample in clustering cluster
Number percent plWhen excessive, the penalty coefficient in corresponding clustering clusterCan be closer to 1, more than other clustering clusters;Work as sample
X calculate apart from when, to each cluster centre distance be multiplied by the penalty coefficient after can carry out scaling, the pantograph ratio on clustering cluster l
Rate is smaller, and the scaling of other clustering clusters is larger;If the cluster centre of a sample x and clustering cluster l and another clustering cluster r
It is closely located, due to the penalty coefficient of rSmaller, distance reduces the distance between large percentage, x and r by small after scaling
In its distance with l, then sample x can be assigned in clustering cluster r.
8. a kind of internet data clustering system, it is characterised in that calculate mould including object function update module and object function
Block;The object function update module is used to be added in computational item in the object function of original FG-k-means algorithms punishes
Penalty factor, forms new object function;The object function computing module is used to optimize the parameter of new object function
Solve, the parametric results according to optimization calculate the distance of sample and cluster centre, the distance according to sample and cluster centre is to sample
Originally clustering cluster distribution is carried out.
9. internet data clustering system according to claim 8, it is characterised in that the object function update module shape
Into new object function be newly:
Meet condition:
In above-mentioned formula and formula:
U is a matrix of n × k, uI, l=1 i-th example of expression belongs to l-th cluster centre;
Z={ Z1, Z2, ZkRepresent k cluster centre;
H is a weight matrix of k × m, HL, jRepresent j-th keyword in l-th weight of cluster centre;
S is a weight matrix of k × T, SL, jT-th group is represented in l-th weight of cluster centre;
plRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes;
λ > 0 and η > 0 are respectively intended to the degree that is evenly distributed of control weight;
σ is used for adjusting the balanced degree of sample size between clustering cluster;
It is the new penalty coefficient punished clustering cluster sample size for adding.
10. internet data clustering system according to claim 9, it is characterised in that the object function computing module
Unit, the second solution unit, the 3rd solution unit and first are solved including first solve unit;
Described first solves unit is used in fixed variableWhen, solving-optimizing problem
Described second solves unit is used in fixed variableWhen, solving-optimizing problem
Described 3rd solves unit is used in fixed variableSolving-optimizing problem
Described 4th solves unit is used in fixed variableWhen, solving-optimizing problem
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510956891.2A CN106897292A (en) | 2015-12-17 | 2015-12-17 | A kind of internet data clustering method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510956891.2A CN106897292A (en) | 2015-12-17 | 2015-12-17 | A kind of internet data clustering method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106897292A true CN106897292A (en) | 2017-06-27 |
Family
ID=59188750
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510956891.2A Pending CN106897292A (en) | 2015-12-17 | 2015-12-17 | A kind of internet data clustering method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106897292A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111241360A (en) * | 2020-01-09 | 2020-06-05 | 腾讯科技(深圳)有限公司 | Information recommendation method, device, equipment and storage medium |
CN111310843A (en) * | 2020-02-25 | 2020-06-19 | 苏州浪潮智能科技有限公司 | Mass streaming data clustering method and system based on K-means |
CN114077860A (en) * | 2020-08-18 | 2022-02-22 | 鸿富锦精密电子(天津)有限公司 | Method and system for sorting parts before assembly, electronic device and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106897276A (en) * | 2015-12-17 | 2017-06-27 | 中国科学院深圳先进技术研究院 | A kind of internet data clustering method and system |
-
2015
- 2015-12-17 CN CN201510956891.2A patent/CN106897292A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106897276A (en) * | 2015-12-17 | 2017-06-27 | 中国科学院深圳先进技术研究院 | A kind of internet data clustering method and system |
Non-Patent Citations (2)
Title |
---|
GUOJUN GAN: ""Subspace clustering with automatic feature grouping"", 《PATTERN RECOGNITION 48(2015)》 * |
史冬生: ""基于高维数据的聚类算法研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111241360A (en) * | 2020-01-09 | 2020-06-05 | 腾讯科技(深圳)有限公司 | Information recommendation method, device, equipment and storage medium |
CN111241360B (en) * | 2020-01-09 | 2023-03-21 | 深圳市雅阅科技有限公司 | Information recommendation method, device, equipment and storage medium |
CN111310843A (en) * | 2020-02-25 | 2020-06-19 | 苏州浪潮智能科技有限公司 | Mass streaming data clustering method and system based on K-means |
CN114077860A (en) * | 2020-08-18 | 2022-02-22 | 鸿富锦精密电子(天津)有限公司 | Method and system for sorting parts before assembly, electronic device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104050242B (en) | Feature selecting, sorting technique and its device based on maximum information coefficient | |
CN105005589B (en) | A kind of method and apparatus of text classification | |
CN103488662A (en) | Clustering method and system of parallelized self-organizing mapping neural network based on graphic processing unit | |
CN109871860A (en) | A kind of daily load curve dimensionality reduction clustering method based on core principle component analysis | |
Liu et al. | Resource-constrained federated edge learning with heterogeneous data: Formulation and analysis | |
CN107203785A (en) | Multipath Gaussian kernel Fuzzy c-Means Clustering Algorithm | |
CN110751355A (en) | Scientific and technological achievement assessment method and device | |
Mai et al. | RETRACTED ARTICLE: Research on semi supervised K-means clustering algorithm in data mining | |
CN114385376B (en) | Client selection method for federal learning of lower edge side of heterogeneous data | |
CN110322075A (en) | A kind of scenic spot passenger flow forecast method and system based on hybrid optimization RBF neural | |
CN108256630A (en) | A kind of over-fitting solution based on low dimensional manifold regularization neural network | |
CN106897292A (en) | A kind of internet data clustering method and system | |
Liu et al. | EACP: An effective automatic channel pruning for neural networks | |
CN104834746B (en) | Heterogeneous characteristic time series data evolution clustering method based on graphics processing unit | |
CN106897276A (en) | A kind of internet data clustering method and system | |
Liu et al. | Quantum-inspired African vultures optimization algorithm with elite mutation strategy for production scheduling problems | |
Qiao et al. | A framework for multi-prototype based federated learning: Towards the edge intelligence | |
CN103605493A (en) | Parallel sorting learning method and system based on graphics processing unit | |
CN109978051A (en) | Supervised classification method based on hybrid neural networks | |
CN104778205B (en) | A kind of mobile application sequence and clustering method based on Heterogeneous Information network | |
CN117407921A (en) | Differential privacy histogram release method and system based on must-connect and don-connect constraints | |
Zhang et al. | Information guiding and sharing enhanced simultaneous heat transfer search and its application to k-means optimization | |
Lu et al. | Adaptive asynchronous federated learning | |
CN106354886A (en) | Method for screening nearest neighbor by using potential neighbor relation graph in recommendation system | |
CN109901929A (en) | Cloud computing task share fair allocat method under server level constraint |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170627 |
|
RJ01 | Rejection of invention patent application after publication |