CN106874367A

CN106874367A - A kind of sampling distribution formula clustering method based on public sentiment platform

Info

Publication number: CN106874367A
Application number: CN201611260883.5A
Authority: CN
Inventors: 汪伟亚; 许恺; 黄强松; 陈辉
Original assignee: Jiangsu One Hundred Information Service Co Ltd
Current assignee: Jiangsu One Hundred Information Service Co Ltd
Priority date: 2016-12-30
Filing date: 2016-12-30
Publication date: 2017-06-20

Abstract

The invention provides a kind of sampling distribution formula clustering method based on public sentiment platform, and comprise the following steps：First, obtaining data to be clustered, and treat cluster data carries out burst treatment, obtains multiple bursts；2nd, sampling of data is carried out using each burst of Map function pairs in MapReduce；3rd, the data from the sample survey that will be obtained collects, and the data from the sample survey for collecting is clustered during the Reduce of MapReduce frameworks；4th, it is repeated in step 2 and the total r that carries out of step 3 takes turns sampling of data, the cluster result of the data from the sample survey of each round is denoted as base cluster, and obtain Π={ π 1, π 2 ..., π r } vector, wherein, r is the positive integer more than or equal to 2, and π i are the base cluster of the i-th wheel, 1≤i≤r, and be positive integer；5th, it is final cluster result to reuse MapReduce frameworks by base clustering ensemble.The sampling distribution formula clustering method based on public sentiment platform can effectively improve the cluster efficiency of mass data and improve data diversity while data scale is reduced.

Description

A kind of sampling distribution formula clustering method based on public sentiment platform

Technical field

The invention belongs to data mining and machine learning field, more particularly to a kind of sampling distribution based on public sentiment platform Formula clustering method.

Background technology

Data clusters problem, is that it is operated by the similitude between data sample point, makes similarity high Data sample point is in same class cluster, and the relatively low sample point of similarity is away from each other.Cluster is all the time data mining One of with the important method in machine learning, but the user's original content brought with the development particularly Web2.0 of internet Explosive growth, data volume has turned into the bottleneck of traditional clustering method, especially news recommendation, machine translation, literature search, feelings The text data of the application fields such as analysis, public sentiment monitoring of calling the score, with the characteristic that higher-dimension is sparse.Clustering algorithm how is improved to be particularly The efficiency of the clustering method of high dimension sparse data, it has also become internet big data data mining major issue urgently to be resolved hurrily.

Therefore, it is necessary to provide a kind of efficiency of the clustering method that can improve high dimension sparse data based on public sentiment platform Sampling distribution formula clustering method.

The content of the invention

It is an object of the invention to provide a kind of efficiency of the clustering method that can improve high dimension sparse data based on carriage The sampling distribution formula clustering method of feelings platform.

Technical scheme is as follows：A kind of sampling distribution formula clustering method based on public sentiment platform includes following step Suddenly：First, data to be clustered are obtained, and burst treatment is carried out to the data to be clustered, obtain multiple bursts；2nd, utilize The each burst of Map function pairs in MapReduce carries out sampling of data；3rd, the data from the sample survey that will be obtained collects, and The data from the sample survey for collecting is clustered during the Reduce of MapReduce frameworks；4th, step 2 and step are repeated in Rapid three it is total carry out r wheel sampling of datas, the cluster result of the data from the sample survey of each round is denoted as base cluster, and obtain Π=π 1, π 2 ..., π r } vector, wherein, r is the positive integer more than or equal to 2, and π i are the base cluster of the i-th wheel, 1≤i≤r, and for just whole Number；5th, it is final cluster result to reuse MapReduce frameworks by the base clustering ensemble.

Preferably, in step one, horizontal segmentation is carried out to the data to be clustered, and ensure every in cutting procedure The integrality of data, and the burst storage that segmentation is obtained is in distributed file system.

Preferably, carry out that sampling of data at least meets in the step 2 requires to include：Sampling techniques letter enough in itself Single, sampling carries out having certain randomness with sampling results based on local data.

Preferably, in step 3, using specific sampling of data round as key, the data from the sample survey conduct for obtaining Value, in converging to a Reduce function of MapReduce by shuffle functions, to taking out in the Reduce functions Sample data are clustered.

Preferably, comprise the following steps in step 5：The a number of base cluster is randomly choosed as barycenter, and The distance between other bases clusters and the barycenter is calculated with Map functions, each base cluster is assigned to and its distance In class cluster where the nearest barycenter, and the barycenter of class cluster is updated in Reduce functions；This process is repeated until institute The barycenter for stating class cluster no longer changes.

Preferably, z is set_kK-th barycenter of class cluster in base Clustering Vector Π is represented, is described as rk dimensional vectors：

Wherein,

Preferably, setting vector Π is described as a vector x for rk dimensions_l, then x_lWith z_kBetween COS distance be：

Wherein wi represents i-th weight of base cluster, and value is 1/r when in the absence of priori.

Preferably, barycenter z_kIt is updated using equation below：

WhereinIt is the constant vector on Π,

Represent the quantity of example in i-th k-th cluster of base cluster；

ForWithFor, if the given real vector y, | | y | | of d dimensions_pThe Lp norms of y are represented, i.e.,

The technical scheme that the present invention is provided has the advantages that：

The sampling distribution formula clustering method based on public sentiment platform reduces data scale using sampling techniques, by many wheels Sampling improves the diversity of base cluster result, then defines COS distance and base cluster result is integrated into final cluster result, Therefore, it is possible to effectively improve the cluster efficiency of mass data；

Also, by introducing sampling techniques, data diversity is improved while reduction data scale, then using distribution Computational frame designs two stage cluster process, to improve the clustering result quality and efficiency of public sentiment project analysis in internet big data There is provided effective ways.

Brief description of the drawings

Fig. 1 is the FB(flow block) of the sampling distribution formula clustering method based on public sentiment platform provided in an embodiment of the present invention.

Specific embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

The description of specific distinct unless the context otherwise, element and component in the present invention, quantity both can be with single shape Formula is present, it is also possible to which multiple forms is present, and the present invention is defined not to this.Although the step in the present invention is entered with label Arrangement is gone, but has been not used to limit the precedence of step, unless expressly stated the order of step or holding for certain step Row is needed based on other steps, and the relative rank of otherwise step is adjustable.It is appreciated that used herein Term "and/or" is related to and covers one of associated Listed Items or one or more of any and all possible group Close.

Fig. 1 is referred to, the sampling distribution formula clustering method 100 based on public sentiment platform provided in an embodiment of the present invention is included such as Lower step：

S1, acquisition data to be clustered, and burst treatment is carried out to the data to be clustered, obtain multiple bursts.

In step sl, treating cluster data carries out horizontal segmentation, some bursts (Sharding) is obtained, in cutting procedure In should ensure that the integrality of every data (such as newsletter archive etc.).And, the burst storage for obtaining will be split in distribution In formula file system such as HDFS, the size of the burst is determined by selected distributed file system, each burst in such as HDFS Size is 64M.And, by accessing distributed file system, calculate node can share burst, and be localized by calculating, It is effectively reduced I/O consumption.

S2, carry out sampling of data using each burst of Map function pairs in MapReduce.

Specifically, in step s 2, sampling of data is carried out on each burst, for the consideration divided and ruled with efficiency, Sampling techniques should at least meet the following requirements：1st, sampling techniques need in itself it is enough simple, can otherwise turn into new bottleneck 2, Sampling can carry out should having certain randomness without relying on global view 3, sampling results based on local data.

And, the more than satisfaction methods of sampling of some can be applied in the present invention, not do specific restriction to this.And In step s 2, subsampling operation is to realize that this is denoted as first stage Map by Map functions in MapReduce frameworks Process.

S3, the data from the sample survey that will be obtained collect, and to taking out described in collecting during the Reduce of MapReduce frameworks Sample data are clustered.

Specifically, in step s3, to the sampling results of each round, using specific sampling of data round as key, obtain Data from the sample survey as value, in converging to a Reduce function of MapReduce by shuffle functions, described Data from the sample survey is clustered in Reduce functions, this is denoted as first stage Reduce process.

And, specific clustering method includes but is not limited to K averages, spectral clustering and hierarchical clustering etc., to this present invention not Limit.

S4, it is repeated in that step S2 and step S3 is total to carry out r wheel sampling of datas, by the cluster of the data from the sample survey of each round Result is denoted as base cluster, and obtains the vector of Π={ π 1, π 2 ..., π r }, wherein, r is the positive integer more than or equal to 2, and π i are the The base cluster of i wheels, 1≤i≤r, and be positive integer.

S5, to reuse MapReduce frameworks by the base clustering ensemble be final cluster result.

In step s 5, clustering ensemble is carried out to vectorial Π, and each described base cluster is considered as entirety, so as to calculate every Distance between the individual base cluster.

Specifically, the step S5 comprises the following steps：

A number of base cluster is randomly choosed as barycenter, and with Map functions calculate other described bases cluster with Distance between the barycenter, each base is clustered in the class cluster where being assigned to the barycenter closest with it, and The barycenter of class cluster is updated in Reduce functions, this is denoted as second stage Map processes and second stage Reduce processes；

This process is repeated until the barycenter of the class cluster no longer changes.

In the present embodiment, the class cluster of the calculating and base cluster of entering row distance during the Map of the second stage refers to Group；The renewal of barycenter in being carried out during the Reduce of the second stage.

And, during the Map of the second stage, the sampling distribution formula clustering method based on public sentiment platform 100 define COS distance is calculated：

Setting z_kK-th barycenter of class cluster in base Clustering Vector Π is represented, is described as rk dimensional vectors：

Wherein,

And, setting vector Π is described as a vector x for rk dimensions_l, then x_lWith z_kBetween COS distance be：

Wherein w_iI-th weight of base cluster is represented, value is 1/r when in the absence of priori.

During the Reduce of the second stage, after all base clusters are assigned to a certain class cluster in step S5, Update the barycenter of all class clusters, barycenter z_kIt is updated using equation below：

Wherein,It is the constant vector on Π,

Represent the quantity of example in i-th k-th cluster of base cluster；

In the calculating process that barycenter during the Reduce of the second stage updates, when each class cluster barycenter more After new, the Map processes of the second stage are repeated, and recalculate the distance of each base cluster and new barycenter and carry out base and gather The class cluster of class is assigned, untill class cluster barycenter no longer changes.

By taking microblog users group discovery as an example, the specific embodiment of the invention and step are described in detail.User data is included Its association attributes, such as age, sex, hobby, pay close attention to, be concerned, forwarding, may be defined as a vector, colony's discovery is root Clustered according to user vector, be a colony by similitude user clustering higher.Because the quantity of microblog users is extremely huge Greatly, it is adaptable to distributed clustering method proposed by the invention.

Mass users are stored on HDFS bursts first, each burst 64M ensures the data of each user not during storage It is divided, i.e., the data storage of unique user is on same burst.

Using the Map functions of distributed memory Computational frame Spark (of MapReduce frameworks implements) every User is randomly choosed on individual burst.

The user chosen on all bursts is focused on into same node to be clustered, the foundation of cluster is user vector, profit Realize that clustering method can select conventional K mean cluster method, the machine learning storehouse in Spark with the Reduce functions of Spark MLib provides K mean algorithms implementing in Spark.Repeat said process n times, a base cluster is obtained every time.

After obtaining n base cluster, these bases cluster is carried out integrated, integrating process is in " appointments of class cluster " and " barycenter renewal " Between iterate, it is same to be realized using distributed memory Computational frame Spark.Data in Spark are by elasticity distribution formula data Collection (Resilient Distributed Datasets, RDD) carries out abstract and description, and this is also the most important cores of Spark One of technology, all of operation is all based on RDD to be carried out.

Here, example is the x in formula (3)_l, this is a higher-dimension sparse vector, to save memory space, in RDD Two arrays are modeled as, non-zero index are deposited respectively and is considered that the quantity of example is very big with numerical value, vector is constituted first Matrix carry out horizontal fragmentation, each burst import be a RDD, each RDD correspondence one Map task, calculate RDD in example With the distance of all barycenter, the appointment (classify) of example is carried out.

It is worth noting that, in the calculating of COS distance, it is not necessary that each vectorial is calculated, and is only needed The inner product indexed between identical in two vectors is calculated, This further reduces the complexity for calculating.In Map outputs Key-value centering, key represents the numbering of cluster, and value represents specific example, in the Reduce stages then according to all key identical values, i.e., Belong to the renewal (recenter) that the example of same class cluster carries out barycenter, the new barycenter for obtaining (including randomly choosed during initialization K barycenter) be sent to all RDD using the broadcast mechanism of Spark, the iteration of next round is carried out, until barycenter or class cluster No longer change.

Compared to prior art, the embodiment of the present invention has the advantages that：

The sampling distribution formula clustering method 100 based on public sentiment platform reduces data scale using sampling techniques, passes through Many wheel sampling improve the diversity of base cluster result, then define COS distance and base cluster result is integrated into final cluster knot Really, therefore, it is possible to effectively improve the cluster efficiency of mass data；

It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be in other specific forms realized.Therefore, no matter From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power Profit requires to be limited rather than described above, it is intended that all in the implication and scope of the equivalency of claim by falling Change is included in the present invention.Any reference in claim should not be considered as the claim involved by limitation.

Moreover, it will be appreciated that although the present specification is described in terms of embodiments, not each implementation method is only wrapped Containing an independent technical scheme, this narrating mode of specification is only that for clarity, those skilled in the art should Specification an as entirety, the technical scheme in each embodiment can also be formed into those skilled in the art through appropriately combined May be appreciated other embodiment.

Claims

1. a kind of sampling distribution formula clustering method based on public sentiment platform, it is characterised in that：Comprise the following steps：

First, data to be clustered are obtained, and burst treatment is carried out to the data to be clustered, obtain multiple bursts；

2nd, sampling of data is carried out using each burst of Map function pairs in MapReduce；

3rd, the data from the sample survey that will be obtained collects, and the sampling number during the Reduce of MapReduce frameworks to collecting According to being clustered；

4th, it is repeated in step 2 and the total r that carries out of step 3 takes turns sampling of data, by the cluster result of the data from the sample survey of each round It is denoted as base to cluster, and obtains the vector of Π={ π 1, π 2 ..., π r }, wherein, r is the positive integer more than or equal to 2, and π i are the i-th wheel Base cluster, 1≤i≤r, and be positive integer；

5th, it is final cluster result to reuse MapReduce frameworks by the base clustering ensemble.

2. the sampling distribution formula clustering method based on public sentiment platform according to claim 1, it is characterised in that：In step In, the data to be clustered are carried out with horizontal segmentation, and ensure the integrality per data in cutting procedure, and will split To the burst store in distributed file system.

3. the sampling distribution formula clustering method based on public sentiment platform according to claim 1, it is characterised in that：The step Carry out that sampling of data at least meets in two requires to include：Simple enough, sampling is carried out sampling techniques based on local data in itself There is certain randomness with sampling results.

4. the sampling distribution formula clustering method based on public sentiment platform according to claim 1, it is characterised in that：In step 3 In, using specific sampling of data round as key, the data from the sample survey for obtaining is converged to as value by shuffle functions In one Reduce function of MapReduce, data from the sample survey is clustered in the Reduce functions.

5. the sampling distribution formula clustering method based on public sentiment platform according to claim 1, it is characterised in that：In step 5 In comprise the following steps：

A number of base cluster is randomly choosed as barycenter, and with Map functions calculate other described bases cluster with it is described Distance between barycenter, each base is clustered in the class cluster where being assigned to the barycenter closest with it, and The barycenter of class cluster is updated in Reduce functions；

6. the sampling distribution formula clustering method based on public sentiment platform according to claim 5, it is characterised in that：Setting z_kTable Show k-th barycenter of class cluster in base Clustering Vector Π, be described as rk dimensional vectors：

z_{k} = (z_{k}^{1}, ..., z_{k}^{i}, ... z_{k}^{r}),

Wherein,

7. the sampling distribution formula clustering method based on public sentiment platform according to claim 6, it is characterised in that：Setting vector Π is described as a vector x for rk dimensions_l, then x_lWith z_kBetween COS distance be：

C o s D i s t (x_{l}, z_{k}) = Σ_{i = 1}^{r} w_{i} (1 - c o s (x_{l}^{i}, z_{k}^{i}))

8. the sampling distribution formula clustering method based on public sentiment platform according to claim 5, it is characterised in that：Barycenter z_kProfit It is updated with equation below：

{z_{k}^{i}}^{'} = | | z_{k}^{i} | |_{2} - | | T^{(i)} | |_{2}

WhereinIt is the constant vector on Π,

Represent the quantity of example in i-th k-th cluster of base cluster；

ForWith | | T⁽ⁱ⁾||₂For, if given real vector y, | | y | | the p of d dimensions represents the Lp norms of y, i.e.,