Summary of the invention
In order to solve the above-mentioned technical problem, the invention proposes a kind of clustering methods based on industry.The present invention is specifically
It is realized with following technical solution:
A kind of clustering method based on industry, comprising:
Data acquisition system is obtained, the data acquisition system includes a kind of data and two class data;
Data grouping is carried out according to the issuing time of a kind of data, obtains packet data collection, the packet data convergence packet
Include a kind of data and two classes data relevant to one kind data;
Each packet data collection is pre-processed, the corresponding multiple data network set of the packet data collection are obtained;
For each packet data collection, its corresponding theme vector collection is calculated;
The hot spot data collection of network of the packet data convergence is obtained based on the theme vector collection;
Popular industry cluster result is obtained according to the hot spot data collection of network of each packet data convergence;
Pass through cluster result described in bubble map logo.
Further, further includes:
The bubble diagram using packet data integrate corresponding to period as horizontal axis, each packet data convergence is identified with bubble
The cluster result of topical subject.
Further, described that the hot spot data collection of network of the packet data convergence is obtained based on the theme vector collection
Include:
Obtain the temperature attribute of each data network set;
According to the doubtful hot spot data collection of network of the temperature attributes extraction;
Obtain the correlation matrix of doubtful hot spot data collection of network;
Obtain the element that numerical value in the correlation matrix is greater than default relevance threshold;
If the element sum is greater than preset heat degree threshold, the doubtful hot spot data network is judged as hot spot number
According to network.
Further, the hot spot data collection of network according to each packet data convergence obtains popular industry cluster knot
Fruit includes:
Obtain the hot spot data collection of network of each packet data convergence;
Obtain the corresponding N number of hot spot theme of each hot spot data collection of network;
M hot spot master of the packet data collection is obtained according to the corresponding hot spot theme of each hot spot data collection of network
Topic;
It is clustered to obtain popular industry cluster result according to the M hot spot theme.
Further, the acquisition methods of the corresponding hot spot theme of hot spot data collection of network include:
Calculate the total value of the element of every a line in the correlation matrix of the hot spot data collection of network;
The maximum N row of the total value is chosen, its corresponding theme, as hot spot theme are obtained.
The present invention provides a kind of clustering methods based on industry.The present invention by for mutually commenting class data to analyze,
To obtain the relevant information of current hot spot data, hot spot theme and hot spot industry, to fill up automatic carry out hot spot
The blank of the relevant technologies of analysis.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention
Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only
The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people
The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work
It encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, "
Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way
Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or
Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover
Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to
Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product
Or other step or units that equipment is intrinsic.
The embodiment of the present invention provides a kind of clustering method based on industry.The method is as shown in Figure 1, comprising:
S101. data acquisition system is obtained, the data acquisition system includes a kind of data and two class data.
The data include a kind of data and two class data, and one kind data are the data directly issued, two class
Data are the comment data for a kind of data.
S102. data grouping is carried out according to the issuing time of a kind of data, obtains packet data collection, the packet data collection
In include a kind of data and two classes data relevant to a kind of data.
Specifically, the time dimension of data grouping can be configured according to specific requirements, for example, on the same day, the same star
Phase, the same moon etc..
S103. each packet data collection is pre-processed, obtains the corresponding multiple data networks of the packet data collection
Set.
The data network set is with diThe form of={ V, E } records, and wherein V is user identifier, and E represents user's mark
Know the comment relationship for a kind of data that the two class data issued issue another user identifier, each vertex includes user's mark
Knowledge, title and content three parts data.
For example, if user spark has issued an a kind of data, user tony, samby and dazzi carry out it
Comment has then obtained including four vertex, the data network set of three directed edges, and directed edge is to be directed toward spark from tony,
Samby is directed toward three sides that spark and dazzi is directed toward spark.The direction of directed edge institute is directed toward by the user for issuing two class data
State the user of the corresponding a kind of data of two class data.
It specifically, may include multiple two class data of user and multiple publications for issuing a kind of data in data network set
User, and the user for issuing a kind of data can also be simultaneously as the user for issuing two class data, and the embodiment of the present invention is not
Limit the specific generation method of data network set.
S104. for each packet data collection, its corresponding theme vector collection is calculated.
Specifically, the theme vector collection can be identified as { t o piI, c wherein topici={ (ti1,
pi1)......(tin,pin), wherein for tijTheme topiciIn the keyword that is likely to occur, PijIt is the keyword in the master
The probability occurred in topic.In fact the title on each vertex in data network set and content can regard a series of passes as
Therefore the probability distribution of keyword carries out analysis by the title for each vertex and priori knowledge is combined to can be obtained and vertex
Thus relevant theme obtains the corresponding theme vector collection of data network set, for each number of each packet data convergence
Union is taken according to the corresponding theme vector collection of collection of network, obtains the corresponding theme vector collection of each packet data collection.And for
The specific method embodiment of the present invention to theme vector collection does not make specific restriction, can refer to the prior art.
S105. the hot spot data collection of network of the packet data convergence is obtained based on the theme vector collection.
S106. popular industry cluster result is obtained according to the hot spot data collection of network of each packet data convergence.
Specifically, each packet data collection can gather its hot spot data collection of network according to category of employment
Class, so that cluster result is obtained, and cluster result can identify the temperature of industry.
Preferably pass through bubble diagram identified cluster result in the embodiment of the present invention.
Specifically, the bubble diagram using packet data integrate corresponding to period as horizontal axis, each grouping is identified with bubble
The cluster result of topical subject in data set.
Further, as shown in Fig. 2, described obtain the hot spot number of the packet data convergence based on the theme vector collection
Include: according to collection of network
S1051. the temperature attribute of each data network set is obtained.
Specifically, the temperature attribute can be obtained according to the actual situation, for example, used in the embodiment of the present invention
Temperature attribute is the reading of data network set number of vertex different degree, data network set participation different degree and data collection of network
Different degree.
Specifically, number and the data network institute of the data network priority of vertex for the data network vertex
The ratio of any active ues sum within the packet data collection corresponding period.Any active ues can be online clear according to user
Look at data number definition.
It is the data network set number of vertices and the data network collection that the data network set, which participates in different degree,
The ratio for the sum that each data are browsed in conjunction.
The reading different degree of the data network set is the sum that each data are browsed in the data network set
The ratio of any active ues sum in period corresponding with packet data collection where the data network.
S1052. according to the doubtful hot spot data collection of network of the temperature attributes extraction.
Specifically, only when data network set number of vertex different degree is greater than preset first threshold value, and data network collection
It closes and participates in the data that different degree is greater than default third threshold value greater than the reading different degree of default second threshold and data collection of network
Collection of network is only doubtful hot spot data collection of network.
Specifically, first threshold is 0.1 in the embodiment of the present invention, second threshold 0.15, and third threshold value is 0.3.
S1053. the correlation matrix of doubtful hot spot data collection of network is obtained.
Specifically, some vertex and the acquisition methods of the degree of correlation of some theme vector include:
Based on formulaThe degree of correlation on some vertex Yu some theme vector is calculated, wherein ViFor the vertex
Title, key is while being under the jurisdiction of the keyword of the theme vector and the title, and the P (key) is the keyword in institute
State the probability in theme vector.
Further, on the basis of obtaining the degree of correlation on some vertex and some theme vector, the available vertex
The theme vector concentrates the degree of correlation of each theme, to obtain vertex relevance vector, the relevance vector indicates institute
State the degree of correlation on vertex Yu each theme.
It is column with the vertex relevance vector on some vertex, obtains the corresponding degree of correlation square of doubtful hot spot data collection of network
Battle array.
S1054. the element that numerical value in the correlation matrix is greater than default relevance threshold is obtained.
If S1055. the element sum is greater than preset heat degree threshold, the doubtful hot spot data network is judged as
Hot spot data network.
Specifically, as shown in figure 3, described obtain popular row according to the hot spot data collection of network of each packet data convergence
Industry cluster result includes:
S1061. the hot spot data collection of network of each packet data convergence is obtained.
S1062. the corresponding N number of hot spot theme of each hot spot data collection of network is obtained.
Specifically, as shown in figure 4, the acquisition methods of the corresponding hot spot theme of hot spot data collection of network include:
S10621. the total value of the element of every a line in the correlation matrix of the hot spot data collection of network is calculated;
S10622. the maximum N row of the total value is chosen, its corresponding theme, as hot spot theme are obtained.
S1063. M heat of the packet data collection is obtained according to the corresponding hot spot theme of each hot spot data collection of network
Point theme.
Specifically, the highest M theme of the corresponding hot spot theme frequency of occurrence of each hot spot data collection of network is
For M hot spot theme of the packet data collection.
S1064. it is clustered to obtain popular industry cluster result according to the M hot spot theme.
Specifically, the cluster includes:
S10. the dissimilarity between two hot spot themes is calculated, and obtains dissimilarity matrix R={ rij}n*n。
Specifically, the not phase between two hot spot themes can be obtained by the prior art of the acquisition of dissimilarity matrix
Like property, the embodiment of the present invention is not repeated them here.
S20. category of employment ω is initialized.
S30. each hot spot theme is obtained for the degree of membership of category of employment.
S40. each hot spot theme is obtained for the contribution degree of category of employment.
S50. the cluster result based on the degree of membership and contribution degree expression is exported.
Specifically, the degree of membership is indicated with u, uikIndicate hot spot theme xiTo category of employment ωkDegree of membership, the tribute
Degree of offering indicates with v, vkjIndicate hot spot theme xjTo category of employment ωkContribution weight.
Degree of membership is according to formula (one):It calculates, contribution degree is according to formula (two):As it can be seen that being iterated meter according to formula (one) and formula (two) using M hot spot theme as input
It calculates, to obtain each hot spot theme for the degree of membership and contribution degree of category of employment.φ in formula (one) and formula (two)
It is constant related with clustering precision with β.
It should be understood that referenced herein " multiple " refer to two or more."and/or", description association
The incidence relation of object indicates may exist three kinds of relationships, for example, A and/or B, can indicate: individualism A exists simultaneously A
And B, individualism B these three situations.Character "/" typicallys represent the relationship that forward-backward correlation object is a kind of "or".
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware
It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable
In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and
Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.