CN106897292A - A kind of internet data clustering method and system - Google Patents

A kind of internet data clustering method and system Download PDF

Info

Publication number
CN106897292A
CN106897292A CN201510956891.2A CN201510956891A CN106897292A CN 106897292 A CN106897292 A CN 106897292A CN 201510956891 A CN201510956891 A CN 201510956891A CN 106897292 A CN106897292 A CN 106897292A
Authority
CN
China
Prior art keywords
cluster
sample
clustering
object function
cluster centre
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510956891.2A
Other languages
Chinese (zh)
Inventor
赵鹤
李栋
李栋一
黄哲学
姜青山
陈会
高琴
朱敏
蔡业首
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201510956891.2A priority Critical patent/CN106897292A/en
Publication of CN106897292A publication Critical patent/CN106897292A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to clustering algorithm technical field, more particularly to a kind of internet data clustering method and system.The internet data clustering method is comprised the following steps:Step a:Penalty coefficient is added in computational item in the object function of original FG-k-means algorithms, new object function is formed;Step b:Parameter to new object function optimizes solution;Step c:Parametric results according to optimization calculate the distance of sample and cluster centre, and the distance according to sample and cluster centre carries out clustering cluster distribution to sample.The internet data clustering method and system of the embodiment of the present invention add penalty coefficient in the object function of original FG-k-means algorithms in computational item, the problem that the clustering cluster sample size of original FG-k-means algorithms presence increases without limitation can be efficiently controlled, sample size equilibrium is controllable between enabling to each clustering cluster simultaneously, so as to obtain clustering precision higher.

Description

A kind of internet data clustering method and system
Technical field
The invention belongs to clustering algorithm technical field, more particularly to a kind of internet data clustering method and system.
Background technology
With the arrival in big data epoch, the data faced in Data Mining become to become increasingly complex.It is especially mutual Networking text data, in addition to enormous amount, the textual data built by vector space model (Vector Space Model) According to the dimension and degree of rarefication also with superelevation, existing data mining clustering algorithm, such as k-means, hierarchical clustering and general cluster When scheduling algorithm is applied to text cluster, generally existing is not enough and limitation.
For the subspace clustering problem of high dimension sparse data, academia proposes the subspace clustering algorithm of many correlations (Subspace Clustering), soft subspace clustering algorithm is a class therein, and the number of plies according to weighting is different, and soft son is empty Between clustering algorithm can be divided into the soft subspace clustering algorithm of individual layer and double-deck soft subspace clustering algorithm.FG-k-means algorithms are A kind of double-deck soft subspace clustering algorithm put forward in 2012 by Chen little Jun et al., introduces the concept of feature group, in group Weighted simultaneously with feature, when being clustered to super high dimension sparse data, effect is substantially better than the soft subspace clustering of individual layer and calculates Method.FG-k-means algorithms are clustered for data of the feature space comprising grouping information, specific as shown in figure 1, being FG-k- Means algorithm simulation data sets.FG-k-means algorithm characteristics definition spaces are as follows:
1) it is X={ x to set training dataset1, x2..., xN, wherein xi∈Ra(1≤i≤N) represents i-th sample in data set This;
2) feature set on X is V={ v1, v2..., vd, the feature in V is contained in a group set G={ G1, G2..., GkIn And G meets
FG-k-means algorithms need to find out K clustering cluster on the data set of above form, while finding each clustering cluster The correspondence subspace in group and feature.As shown in figure 1, a data set example comprising group information is defined, in data set X ={ x1, x2..., xNOn, feature set is v={ v1, v2..., v12, feature quantity d=12, group collection is combined into G={ G1, G2, G3, And the feature between group and group is mutually disjointed.
To solve the above-mentioned clustering problem comprising group information data set, FG-k-means assumes each clustering cluster l in V and G On respectively have one group of weight HlAnd Sl.When all samples in clustering cluster are high in a certain feature or ancestors' uniformity, then this feature is assigned Or organize larger weight.Its object function is as follows:
Meet condition:
● U is a matrix of n × k, uI, l=1 i-th example of expression belongs to l-th cluster centre;
● Z={ Z1, Z2..., ZkRepresent k cluster centre;
● H is the weight matrix of k × m, HI, jRepresent j-th keyword in l-th weight of cluster centre;
● S is a weight matrix of k × T, SL, jJ-th group is represented in l-th weight of cluster centre;
● λ > 0 and η > 0 are respectively intended to the degree that is evenly distributed of control weight.
The parameter of object function (1) can be solved by iterative optimization method.
In sum, existing FG-k-means algorithms have the disadvantage that:Enter by feature in cluster process Row weighting, ignores the influence that useless feature is formed to clustering cluster.When run into some sample in these dimensions with cluster centre There were significant differences, and when extremely similar in the larger feature of weight in the clustering cluster, algorithm cannot be by the sample and cluster Cluster is effectively split.The presence of a large amount of unrelated samples, causes the clustering cluster sample size to increase, so as to cause between clustering cluster Unbalanced phenomena, reduce the robustness of FG-k-means algorithms.
The content of the invention
The invention provides a kind of internet data clustering method and system, it is intended at least solve to a certain extent existing One of above-mentioned technical problem in technology.
Implementation of the present invention is as follows, a kind of internet data clustering method, comprises the following steps:
Step a:Penalty coefficient is added in computational item in the object function of original FG-k-means algorithms, forms new Object function;
Step b:Parameter to new object function optimizes solution;
Step c:Parametric results according to optimization calculate the distance of sample and cluster centre, according to sample and cluster centre Distance carries out clustering cluster distribution to sample.
The technical scheme that the embodiment of the present invention is taken also includes:In the step a, the new object function is:
Meet condition:
In above-mentioned formula:
U is a matrix of n × k, uI, l=1 i-th example of expression belongs to l-th cluster centre;
Z={ Z1, Z2..., ZkRepresent k cluster centre;
H is a weight matrix of k × m, HL, jRepresent j-th keyword in l-th weight of cluster centre;
S is a weight matrix of k × T, SL, jT-th group is represented in l-th weight of cluster centre;
plRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes;
λ > 0 and η > 0 are respectively intended to the degree that is evenly distributed of control weight;
σ is used for adjusting the balanced degree of sample size between clustering cluster;
It is the new penalty coefficient punished clustering cluster sample size for adding.
The technical scheme that the embodiment of the present invention is taken also includes:In the step b, the ginseng to new object function Number optimizes solution and specifically includes following steps:
Step b1:Fixed variableSolving-optimizing problem
Step b2:Fixed variableSolving-optimizing problem
Step b3:Fixed variableSolving-optimizing problem
Step b4:Fixed variableSolving-optimizing problem
The technical scheme that the embodiment of the present invention is taken also includes:In the step b1, the fixed variableSolving-optimizing problemSolution according to being:
In the step b2, the fixed variableSolving-optimizing problem's Solving foundation is:
The technical scheme that the embodiment of the present invention is taken also includes:In the step b3, the fixed variableSolving-optimizing problemSolution formula be:
The technical scheme that the embodiment of the present invention is taken also includes:In the step b4, the fixed variableSolving-optimizing problemSolution formula be:
The technical scheme that the embodiment of the present invention is taken also includes:In the step c, the parametric results according to optimization The distance of sample and cluster centre is calculated, clustering cluster distribution is carried out to sample according to sample and the distance of cluster centre is specially: plRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes, as the sample size percentage p in clustering clusterlCross When big, the penalty coefficient in corresponding clustering clusterCan be closer to 1, more than other clustering clusters;When sample x calculate apart from when, arrive The distance of each cluster centre can carry out scaling after being multiplied by the penalty coefficient, and the zoom ratio on clustering cluster l is smaller, other clusters The scaling of cluster is larger;If the cluster centre of a sample x and clustering cluster l and another clustering cluster r is closely located, due to r Penalty coefficientSmaller, distance reduces the distance between large percentage, x and r by being less than its distance with l after scaling, Then sample x can be assigned in clustering cluster r.
Another technical scheme that the embodiment of the present invention is taken is:A kind of internet data clustering system, including object function Update module and object function computing module;The object function update module is used for the mesh in original FG-k-means algorithms Scalar functions add penalty coefficient in computational item, form new object function;The object function computing module is used for new The parameter of object function optimize solution, the parametric results according to optimization calculate the distance of sample and cluster centre, according to Sample carries out clustering cluster distribution with the distance of cluster centre to sample.
The technical scheme that the embodiment of the present invention is taken also includes:The new target letter that the object function update module is formed Number is newly:
Meet condition:
In above-mentioned formula and formula:
U is a matrix of n × k, uI, l=1 i-th example of expression belongs to l-th cluster centre;
Z={ Z1, Z2..., ZkRepresent k cluster centre;
H is a weight matrix of k × m, HL, jRepresent j-th keyword in l-th weight of cluster centre;
S is a weight matrix of k × T, SL, jT-th group is represented in l-th weight of cluster centre;
plRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes;
λ > 0 and η > 0 are respectively intended to the degree that is evenly distributed of control weight;
σ is used for adjusting the balanced degree of sample size between clustering cluster;
It is the new penalty coefficient punished clustering cluster sample size for adding.
The technical scheme that the embodiment of the present invention is taken also includes:The object function computing module includes that first solves list Unit, second solve unit, the 3rd solution unit and first and solve unit;
Described first solves unit is used in fixed variableWhen, solving-optimizing problem
Described second solves unit is used in fixed variableWhen, solving-optimizing problem
Described 3rd solves unit is used in fixed variableSolving-optimizing problem
Described 4th solves unit is used in fixed variableWhen, solving-optimizing problem
Object function of the internet data clustering method and system of the embodiment of the present invention in original FG-k-means algorithms Penalty coefficient is added in computational item, the clustering cluster sample size that original FG-k-means algorithms are present can be efficiently controlled The problem for increasing without limitation, while enable to sample size equilibrium between each clustering cluster controllable, so as to obtain cluster higher Precision.
Brief description of the drawings
Fig. 1 is FG-k-means algorithm simulation data sets;
Fig. 2 is the flow chart of the internet data clustering method of the embodiment of the present invention;
Fig. 3 is the flow chart of the objective function optimization method for solving of the embodiment of the present invention;
Fig. 4 is the structural representation of the internet data clustering system of the embodiment of the present invention.
Specific embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.
Fig. 2 is referred to, is the flow chart of the internet data clustering method of the embodiment of the present invention.The embodiment of the present invention it is mutual Networking data clustering method is comprised the following steps:
Step 100:Penalty coefficient is added in computational item in the object function of original FG-k-means algorithms, is formed New object function;
In step 100, new object function is:
Meet condition:
In formula (3) and formula (4):
U is a matrix of n × k, uI, l=1 i-th example of expression belongs to l-th cluster centre;
Z={ Z1, Z2..., ZkRepresent k cluster centre;
H is a weight matrix of k × m, HL, jRepresent j-th keyword in l-th weight of cluster centre;
S is a weight matrix of k × T, SL, jT-th group is represented in l-th weight of cluster centre;
plRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes;
λ > 0 and η > 0 are respectively intended to the degree that is evenly distributed of control weight;
σ is used for adjusting the balanced degree of sample size between clustering cluster;
It is the new penalty coefficient punished clustering cluster sample size for adding, penalty coefficient uses fraction Form expression, by plControlled with the two parameters of σ, and corresponding adjustment can be made according to practical application;By σ come to cluster The balance of sample carries out stipulations between cluster, and the sample total on data set is normalized, and introduces plParameter, be for Influence of the removal data set sample size to σ reasonable value scopes.In embodiments of the present invention, the conjunction of the σ in penalty coefficient Reason span is [0,0.7] this interval, i.e., the span of the parameter in penalty coefficient is in a less controllable interval In, be conducive to the regulation and control to clustering cluster sample size balance degree.
Step 200:Parameter to new object function optimizes solution, according to optimization parametric results calculate sample with The distance of cluster centre, the distance according to sample and cluster centre carries out clustering cluster distribution to sample.
In step 200, it is the flow of the objective function optimization method for solving of the embodiment of the present invention also referring to Fig. 3 Figure.The objective function optimization method for solving of the embodiment of the present invention is comprised the following steps:
Step 201:Fixed variableSolving-optimizing problem
In step 201, with reference to the derivation of FG-k-means, can be according to following Policy Updates U:
Step 202:Fixed variableSolving-optimizing problem
In step 202., can be according to the element in following Policy Updates Z:
Step 203:Fixed variableSolving-optimizing problem
In step 203, solution procedure is drawn by theorem 1:
Theorem 1:Assuming thatAnd η > 0, then object functionObtain minimum value when and only When:
Theorem 1 is proved:
If givenThree variables, need to solve h in the case where object function (3) is minimizedI, jValue, the value table Show weight of j-th feature in ith cluster cluster, due to there are k × T qualificationsBy object function (3) taking the logarithm and introducing Lagrangian Cheng Zi to obtain:
In formula (8), EL, jRepresent when fixationWhen, j-th constant of feature group, Ke Yitong in l-th clustering cluster Formula (8) is crossed to try to achieve.
By to formula (9) respectively according to γL, tAnd hL, jDerivation is carried out, and sets reciprocal value as 0, can obtained:
In formula (11) and formula (12), t is characterized the numbering of the group that j is adhered to separately.
Abbreviation formula (12) can be obtained:
Formula (13) is substituted into the middle of formula (11), can be obtained:
Abbreviation formula (14) can be obtained:
Most formula (15) inverse iteration enters formula (13) at last:
Step 204:Fixed variableSolving-optimizing problem
In step 204, solved using theorem 2:
Theorem 2:And λ > 0, then object functionAnd if only if to obtain minimum value:
Theorem 2 is proved:
If givenThree variables, need to solve h in the case where object function (3) is minimizedL, jValue, the value table Show weight of j-th feature in ith cluster cluster, due to there are k qualifications Taking the logarithm and introduce Lagrangian Cheng Zi object function (3) can obtain:
In formula (18), DL, tRepresent when fixationWhen, t-th constant of feature group, Ke Yitong in l-th clustering cluster Formula (17) is crossed to try to achieve.
In formula (18), respectively to γ and sL, tIt is 0 to carry out derivation and make its inverse, can be obtained:
Abbreviation formula (20) can be obtained:
Formula (21) is substituted into the middle of formula (19), can be obtained:
Abbreviation formula (22) can be obtained:
Most formula (23) inverse iteration enters formula (21) at last:
In above-mentioned algorithm, it is possible to find the value of σ influences whether the overall performance of algorithm, and when σ=0, algorithm is equivalent to FG-k-means algorithms.According to the algorithm above more new formula and FG-k-means algorithm flows, inventive algorithm can be obtained pseudo- Code is as follows:
Pseudo-code of the algorithm
plRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes, when the sample size in clustering cluster Percentage plWhen excessive, the penalty coefficient in corresponding clustering clusterCan be closer to 1, more than other clustering clusters.When sample x according to According to (5) calculate apart from when, to each cluster centre distance be multiplied by the penalty coefficient after can carry out scaling, the contracting on clustering cluster l Put that ratio is smaller, the scaling of other clustering clusters is larger.Assuming that the cluster centre of some sample x and l and another clustering cluster r It is closely located, due to the penalty coefficient of rSmaller, distance reduces large percentage.The distance between x and r is by small after scaling In its distance with l, then sample x can be assigned in clustering cluster r.Profit can limit clustering cluster l sample numbers in such a way Measure unconfined growth.In the more clustering cluster of sample size, the sample only really near cluster centre just stays in cluster In cluster, the sample from the clustering cluster farther out is assigned to the less clustering cluster of other sample sizes in.
Fig. 4 is referred to, is the structural representation of the internet data clustering system of the embodiment of the present invention.The embodiment of the present invention Internet data clustering system include object function update module and object function computing module;
Object function update module is used to be added in computational item in the object function of original FG-k-means algorithms Penalty coefficient, forms new object function;Wherein, new object function is:
Meet condition:
In formula (3) and formula (4):
U is a matrix of n × k, uI, l=1 i-th example of expression belongs to l-th cluster centre;
Z={ Z1, Z2..., ZkRepresent k cluster centre;
H is a weight matrix of k × m, HL, jRepresent j-th keyword in l-th weight of cluster centre;
S is a weight matrix of k × T, SL, jT-th group is represented in l-th weight of cluster centre;
plRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes;
λ > 0 and η > 0 are respectively intended to the degree that is evenly distributed of control weight;
σ is used for adjusting the balanced degree of sample size between clustering cluster;
It is the new penalty coefficient punished clustering cluster sample size for adding, penalty coefficient uses fraction Form expression, by plControlled with the two parameters of σ, and corresponding adjustment can be made according to practical application;By σ come to cluster The balance of sample carries out stipulations between cluster, and the sample total on data set is normalized, and introduces plParameter, be for Influence of the removal data set sample size to σ reasonable value scopes.In embodiments of the present invention, the conjunction of the σ in penalty coefficient Reason span is [0,0.7] this interval, i.e., the span of the parameter in penalty coefficient is in a less controllable interval In, be conducive to the regulation and control to clustering cluster sample size balance degree.
Object function computing module is used to optimize solution to the parameter of new object function, according to the parameter knot of optimization Fruit calculates the distance of sample and cluster centre, and the distance according to sample and cluster centre carries out clustering cluster distribution to sample;Specifically Ground, object function computing module includes that the first solution unit, second solve unit, the 3rd solution unit and first and solve unit;
First solves unit is used in fixed variableWhen, solving-optimizing problemIts In, with reference to the derivation of FG-k-means, can be according to following Policy Updates U:
Second solves unit is used in fixed variableWhen, solving-optimizing problemIts In, can be according to the element in following Policy Updates Z:
3rd solves unit is used in fixed variableSolving-optimizing problemWherein, Solution procedure is drawn by theorem 1:
Theorem 1:Assuming thatAnd η > 0, then object functionObtain minimum value when and only When:
Theorem 1 is proved:
If givenThree variables, need to solve h in the case where object function (3) is minimizedL, jValue, the value table Show weight of j-th feature in ith cluster cluster, due to there are k × T qualifications, by object function (3) taking the logarithm and introducing Lagrangian Cheng Zi to obtain:
In formula (8), EL, jRepresent when fixationWhen, j-th constant of feature group, Ke Yitong in l-th clustering cluster Formula (8) is crossed to try to achieve.
By to formula (9) respectively according to γL, tAnd hL, jDerivation is carried out, and sets reciprocal value as 0, can obtained:
In formula (11) and formula (12), t is characterized the numbering of the group that j is adhered to separately.
Abbreviation formula (12) can be obtained:
Formula (13) is substituted into the middle of formula (11), can be obtained:
Abbreviation formula (14) can be obtained:
Most formula (15) inverse iteration enters formula (13) at last:
4th solves unit is used in fixed variableWhen, solving-optimizing problemIts In, solved using theorem 2:
Theorem 2:And λ > 0, then object functionAnd if only if to obtain minimum value:
Theorem 2 is proved:
If givenThree variables, need to solve h in the case where object function (3) is minimizedL, jValue, the value table Show weight of j-th feature in ith cluster cluster, due to there are k qualifications Taking the logarithm and introduce Lagrangian Cheng Zi object function (3) can obtain:
In formula (18), DL, tRepresent when fixationWhen, t-th constant of feature group, Ke Yitong in l-th clustering cluster Formula (17) is crossed to try to achieve.
In formula (18), respectively to γ and sL, tIt is 0 to carry out derivation and make its inverse, can be obtained:
Abbreviation formula (20) can be obtained:
Formula (21) is substituted into the middle of formula (19), can be obtained:
Abbreviation formula (22) can be obtained:
Most formula (23) inverse iteration enters formula (21) at last:
In above-mentioned algorithm, it is possible to find the value of σ influences whether the overall performance of algorithm, and when σ=0, algorithm is equivalent to FG-k-means algorithms;
plRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes, when the sample size in clustering cluster Percentage plWhen excessive, the penalty coefficient in corresponding clustering clusterCan be closer to 1, more than other clustering clusters.When sample x according to According to (5) calculate apart from when, to each cluster centre distance be multiplied by the penalty coefficient after can carry out scaling, the contracting on clustering cluster l Put that ratio is smaller, the scaling of other clustering clusters is larger.Assuming that the cluster centre of some sample x and l and another clustering cluster r It is closely located, due to the penalty coefficient of rSmaller, distance reduces large percentage.The distance between x and r is by small after scaling In its distance with l, then sample x can be assigned in clustering cluster r.Profit can limit clustering cluster l sample numbers in such a way Measure unconfined growth.In the more clustering cluster of sample size, the sample only really near cluster centre just stays in cluster In cluster, the sample from the clustering cluster farther out is assigned to the less clustering cluster of other sample sizes in.
Object function of the internet data clustering method and system of the embodiment of the present invention in original FG-k-means algorithms Penalty coefficient is added in computational item, the clustering cluster sample size that original FG-k-means algorithms are present can be efficiently controlled The problem for increasing without limitation, while enable to sample size equilibrium between each clustering cluster controllable, so as to obtain cluster higher Precision.
Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the invention, it is all in essence of the invention Any modification, equivalent and improvement made within god and principle etc., should be included within the scope of the present invention.

Claims (10)

1. a kind of internet data clustering method, comprises the following steps:
Step a:Penalty coefficient is added in computational item in the object function of original FG-k-means algorithms, new mesh is formed Scalar functions;
Step b:Parameter to new object function optimizes solution;
Step c:Parametric results according to optimization calculate the distance of sample and cluster centre, according to sample and the distance of cluster centre Clustering cluster distribution is carried out to sample.
2. internet data clustering method according to claim 1, it is characterised in that in the step a is described new Object function is:
Meet condition:
In above-mentioned formula:
U is a matrix of n × k, uI, l=1 i-th example of expression belongs to l-th cluster centre;
Z={ Z1, Z2, ZkRepresent k cluster centre;
H is a weight matrix of k × m, HL, jRepresent j-th keyword in l-th weight of cluster centre;
S is a weight matrix of k × T, SL, jT-th group is represented in l-th weight of cluster centre;
plRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes;
λ > 0 and η > 0 are respectively intended to the degree that is evenly distributed of control weight;
σ is used for adjusting the balanced degree of sample size between clustering cluster;
It is the new penalty coefficient punished clustering cluster sample size for adding.
3. internet data clustering method according to claim 2, it is characterised in that described to new in the step b The parameter of object function optimize solution and specifically include following steps:
Step b1:Fixed variableSolving-optimizing problem
Step b2:Fixed variableSolving-optimizing problem
Step b3:Fixed variableSolving-optimizing problem
Step b4:Fixed variableSolving-optimizing problem
4. internet data clustering method according to claim 3, it is characterised in that described solid in the step b1 Determine variableSolving-optimizing problemSolution according to being:
In the step b2, the fixed variableSolving-optimizing problemSolution According to being:
5. internet data clustering method according to claim 4, it is characterised in that described solid in the step b3 Determine variableSolving-optimizing problemSolution formula be:
6. internet data clustering method according to claim 5, it is characterised in that described solid in the step b4 Determine variableSolving-optimizing problemSolution formula be:
7. internet data clustering method according to claim 1, it is characterised in that in the step c, the basis The parametric results of optimization calculate the distance of sample and cluster centre, and sample is clustered with the distance of cluster centre according to sample Cluster distribution is specially:plRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes, when the sample in clustering cluster Number percent plWhen excessive, the penalty coefficient in corresponding clustering clusterCan be closer to 1, more than other clustering clusters;Work as sample X calculate apart from when, to each cluster centre distance be multiplied by the penalty coefficient after can carry out scaling, the pantograph ratio on clustering cluster l Rate is smaller, and the scaling of other clustering clusters is larger;If the cluster centre of a sample x and clustering cluster l and another clustering cluster r It is closely located, due to the penalty coefficient of rSmaller, distance reduces the distance between large percentage, x and r by small after scaling In its distance with l, then sample x can be assigned in clustering cluster r.
8. a kind of internet data clustering system, it is characterised in that calculate mould including object function update module and object function Block;The object function update module is used to be added in computational item in the object function of original FG-k-means algorithms punishes Penalty factor, forms new object function;The object function computing module is used to optimize the parameter of new object function Solve, the parametric results according to optimization calculate the distance of sample and cluster centre, the distance according to sample and cluster centre is to sample Originally clustering cluster distribution is carried out.
9. internet data clustering system according to claim 8, it is characterised in that the object function update module shape Into new object function be newly:
Meet condition:
In above-mentioned formula and formula:
U is a matrix of n × k, uI, l=1 i-th example of expression belongs to l-th cluster centre;
Z={ Z1, Z2, ZkRepresent k cluster centre;
H is a weight matrix of k × m, HL, jRepresent j-th keyword in l-th weight of cluster centre;
S is a weight matrix of k × T, SL, jT-th group is represented in l-th weight of cluster centre;
plRepresent that the sample size in clustering cluster l accounts for the percentage of all sample sizes;
λ > 0 and η > 0 are respectively intended to the degree that is evenly distributed of control weight;
σ is used for adjusting the balanced degree of sample size between clustering cluster;
It is the new penalty coefficient punished clustering cluster sample size for adding.
10. internet data clustering system according to claim 9, it is characterised in that the object function computing module Unit, the second solution unit, the 3rd solution unit and first are solved including first solve unit;
Described first solves unit is used in fixed variableWhen, solving-optimizing problem
Described second solves unit is used in fixed variableWhen, solving-optimizing problem
Described 3rd solves unit is used in fixed variableSolving-optimizing problem
Described 4th solves unit is used in fixed variableWhen, solving-optimizing problem
CN201510956891.2A 2015-12-17 2015-12-17 A kind of internet data clustering method and system Pending CN106897292A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510956891.2A CN106897292A (en) 2015-12-17 2015-12-17 A kind of internet data clustering method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510956891.2A CN106897292A (en) 2015-12-17 2015-12-17 A kind of internet data clustering method and system

Publications (1)

Publication Number Publication Date
CN106897292A true CN106897292A (en) 2017-06-27

Family

ID=59188750

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510956891.2A Pending CN106897292A (en) 2015-12-17 2015-12-17 A kind of internet data clustering method and system

Country Status (1)

Country Link
CN (1) CN106897292A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241360A (en) * 2020-01-09 2020-06-05 腾讯科技(深圳)有限公司 Information recommendation method, device, equipment and storage medium
CN111310843A (en) * 2020-02-25 2020-06-19 苏州浪潮智能科技有限公司 Mass streaming data clustering method and system based on K-means
CN114077860A (en) * 2020-08-18 2022-02-22 鸿富锦精密电子(天津)有限公司 Method and system for sorting parts before assembly, electronic device and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897276A (en) * 2015-12-17 2017-06-27 中国科学院深圳先进技术研究院 A kind of internet data clustering method and system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897276A (en) * 2015-12-17 2017-06-27 中国科学院深圳先进技术研究院 A kind of internet data clustering method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GUOJUN GAN: ""Subspace clustering with automatic feature grouping"", 《PATTERN RECOGNITION 48(2015)》 *
史冬生: ""基于高维数据的聚类算法研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241360A (en) * 2020-01-09 2020-06-05 腾讯科技(深圳)有限公司 Information recommendation method, device, equipment and storage medium
CN111241360B (en) * 2020-01-09 2023-03-21 深圳市雅阅科技有限公司 Information recommendation method, device, equipment and storage medium
CN111310843A (en) * 2020-02-25 2020-06-19 苏州浪潮智能科技有限公司 Mass streaming data clustering method and system based on K-means
CN114077860A (en) * 2020-08-18 2022-02-22 鸿富锦精密电子(天津)有限公司 Method and system for sorting parts before assembly, electronic device and storage medium

Similar Documents

Publication Publication Date Title
CN104050242B (en) Feature selecting, sorting technique and its device based on maximum information coefficient
CN105005589B (en) A kind of method and apparatus of text classification
CN103488662A (en) Clustering method and system of parallelized self-organizing mapping neural network based on graphic processing unit
CN109871860A (en) A kind of daily load curve dimensionality reduction clustering method based on core principle component analysis
Liu et al. Resource-constrained federated edge learning with heterogeneous data: Formulation and analysis
CN107203785A (en) Multipath Gaussian kernel Fuzzy c-Means Clustering Algorithm
CN110751355A (en) Scientific and technological achievement assessment method and device
Mai et al. RETRACTED ARTICLE: Research on semi supervised K-means clustering algorithm in data mining
CN114385376B (en) Client selection method for federal learning of lower edge side of heterogeneous data
CN110322075A (en) A kind of scenic spot passenger flow forecast method and system based on hybrid optimization RBF neural
CN108256630A (en) A kind of over-fitting solution based on low dimensional manifold regularization neural network
CN106897292A (en) A kind of internet data clustering method and system
Liu et al. EACP: An effective automatic channel pruning for neural networks
CN104834746B (en) Heterogeneous characteristic time series data evolution clustering method based on graphics processing unit
CN106897276A (en) A kind of internet data clustering method and system
Liu et al. Quantum-inspired African vultures optimization algorithm with elite mutation strategy for production scheduling problems
Qiao et al. A framework for multi-prototype based federated learning: Towards the edge intelligence
CN103605493A (en) Parallel sorting learning method and system based on graphics processing unit
CN109978051A (en) Supervised classification method based on hybrid neural networks
CN104778205B (en) A kind of mobile application sequence and clustering method based on Heterogeneous Information network
CN117407921A (en) Differential privacy histogram release method and system based on must-connect and don-connect constraints
Zhang et al. Information guiding and sharing enhanced simultaneous heat transfer search and its application to k-means optimization
Lu et al. Adaptive asynchronous federated learning
CN106354886A (en) Method for screening nearest neighbor by using potential neighbor relation graph in recommendation system
CN109901929A (en) Cloud computing task share fair allocat method under server level constraint

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170627

RJ01 Rejection of invention patent application after publication