CN109684480A - A kind of clustering method based on industry - Google Patents

A kind of clustering method based on industry Download PDF

Info

Publication number
CN109684480A
CN109684480A CN201811644123.3A CN201811644123A CN109684480A CN 109684480 A CN109684480 A CN 109684480A CN 201811644123 A CN201811644123 A CN 201811644123A CN 109684480 A CN109684480 A CN 109684480A
Authority
CN
China
Prior art keywords
data
hot spot
network
collection
theme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811644123.3A
Other languages
Chinese (zh)
Other versions
CN109684480B (en
Inventor
徐承迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING PEOPLE ONLINE NETWORK Co.,Ltd.
Original Assignee
Hangzhou Rabbit Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Rabbit Network Technology Co Ltd filed Critical Hangzhou Rabbit Network Technology Co Ltd
Priority to CN201811644123.3A priority Critical patent/CN109684480B/en
Publication of CN109684480A publication Critical patent/CN109684480A/en
Application granted granted Critical
Publication of CN109684480B publication Critical patent/CN109684480B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of clustering methods based on industry, comprising: obtains data acquisition system, the data acquisition system includes a kind of data and two class data;Data grouping is carried out according to the issuing time of a kind of data, obtains packet data collection, the packet data convergence includes a kind of data and two classes data relevant to one kind data;Each packet data collection is pre-processed, the corresponding multiple data network set of the packet data collection are obtained;For each packet data collection, its corresponding theme vector collection is calculated;The hot spot data collection of network of the packet data convergence is obtained based on the theme vector collection;Popular industry cluster result is obtained according to the hot spot data collection of network of each packet data convergence;Pass through cluster result described in bubble map logo.The present invention is by for mutually commenting class data to analyze, so that the relevant information of current hot spot data, hot spot theme and hot spot industry has been obtained, to fill up the blank of the relevant technologies of automatic carry out analysis of central issue.

Description

A kind of clustering method based on industry
Technical field
The present invention relates to computer field more particularly to a kind of clustering methods based on industry.
Background technique
In data analysis field, it is often necessary to analyze data.In common interactive website, for example know, hundred Spending discussion bar, there are a large amount of users mutually to comment class data, and this kind of data can react the personal preference of user, also can be used in studying Current events hot spot and social phenomenon, there are more social informations, can be widely used in advertising objective user study, hot spot Study on Problems, the every field such as public sentiment supervision.But lack the method effectively analyzed for this kind of data in the prior art, Industry temperature information can not be obtained based on these data.
Summary of the invention
In order to solve the above-mentioned technical problem, the invention proposes a kind of clustering methods based on industry.The present invention is specifically It is realized with following technical solution:
A kind of clustering method based on industry, comprising:
Data acquisition system is obtained, the data acquisition system includes a kind of data and two class data;
Data grouping is carried out according to the issuing time of a kind of data, obtains packet data collection, the packet data convergence packet Include a kind of data and two classes data relevant to one kind data;
Each packet data collection is pre-processed, the corresponding multiple data network set of the packet data collection are obtained;
For each packet data collection, its corresponding theme vector collection is calculated;
The hot spot data collection of network of the packet data convergence is obtained based on the theme vector collection;
Popular industry cluster result is obtained according to the hot spot data collection of network of each packet data convergence;
Pass through cluster result described in bubble map logo.
Further, further includes:
The bubble diagram using packet data integrate corresponding to period as horizontal axis, each packet data convergence is identified with bubble The cluster result of topical subject.
Further, described that the hot spot data collection of network of the packet data convergence is obtained based on the theme vector collection Include:
Obtain the temperature attribute of each data network set;
According to the doubtful hot spot data collection of network of the temperature attributes extraction;
Obtain the correlation matrix of doubtful hot spot data collection of network;
Obtain the element that numerical value in the correlation matrix is greater than default relevance threshold;
If the element sum is greater than preset heat degree threshold, the doubtful hot spot data network is judged as hot spot number According to network.
Further, the hot spot data collection of network according to each packet data convergence obtains popular industry cluster knot Fruit includes:
Obtain the hot spot data collection of network of each packet data convergence;
Obtain the corresponding N number of hot spot theme of each hot spot data collection of network;
M hot spot master of the packet data collection is obtained according to the corresponding hot spot theme of each hot spot data collection of network Topic;
It is clustered to obtain popular industry cluster result according to the M hot spot theme.
Further, the acquisition methods of the corresponding hot spot theme of hot spot data collection of network include:
Calculate the total value of the element of every a line in the correlation matrix of the hot spot data collection of network;
The maximum N row of the total value is chosen, its corresponding theme, as hot spot theme are obtained.
The present invention provides a kind of clustering methods based on industry.The present invention by for mutually commenting class data to analyze, To obtain the relevant information of current hot spot data, hot spot theme and hot spot industry, to fill up automatic carry out hot spot The blank of the relevant technologies of analysis.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with Other attached drawings are obtained according to these attached drawings.
Fig. 1 is a kind of clustering method flow chart based on industry provided in an embodiment of the present invention;
Fig. 2 is provided in an embodiment of the present invention to obtain the hot spot number of the packet data convergence based on the theme vector collection According to the method flow diagram of collection of network;
Fig. 3 is provided in an embodiment of the present invention to obtain hot topic according to the hot spot data collection of network of each packet data convergence Industry cluster result method flow diagram;
Fig. 4 is the acquisition methods process of the corresponding hot spot theme of hot spot data collection of network provided in an embodiment of the present invention Figure.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.
The embodiment of the present invention provides a kind of clustering method based on industry.The method is as shown in Figure 1, comprising:
S101. data acquisition system is obtained, the data acquisition system includes a kind of data and two class data.
The data include a kind of data and two class data, and one kind data are the data directly issued, two class Data are the comment data for a kind of data.
S102. data grouping is carried out according to the issuing time of a kind of data, obtains packet data collection, the packet data collection In include a kind of data and two classes data relevant to a kind of data.
Specifically, the time dimension of data grouping can be configured according to specific requirements, for example, on the same day, the same star Phase, the same moon etc..
S103. each packet data collection is pre-processed, obtains the corresponding multiple data networks of the packet data collection Set.
The data network set is with diThe form of={ V, E } records, and wherein V is user identifier, and E represents user's mark Know the comment relationship for a kind of data that the two class data issued issue another user identifier, each vertex includes user's mark Knowledge, title and content three parts data.
For example, if user spark has issued an a kind of data, user tony, samby and dazzi carry out it Comment has then obtained including four vertex, the data network set of three directed edges, and directed edge is to be directed toward spark from tony, Samby is directed toward three sides that spark and dazzi is directed toward spark.The direction of directed edge institute is directed toward by the user for issuing two class data State the user of the corresponding a kind of data of two class data.
It specifically, may include multiple two class data of user and multiple publications for issuing a kind of data in data network set User, and the user for issuing a kind of data can also be simultaneously as the user for issuing two class data, and the embodiment of the present invention is not Limit the specific generation method of data network set.
S104. for each packet data collection, its corresponding theme vector collection is calculated.
Specifically, the theme vector collection can be identified as { t o piI, c wherein topici={ (ti1, pi1)......(tin,pin), wherein for tijTheme topiciIn the keyword that is likely to occur, PijIt is the keyword in the master The probability occurred in topic.In fact the title on each vertex in data network set and content can regard a series of passes as Therefore the probability distribution of keyword carries out analysis by the title for each vertex and priori knowledge is combined to can be obtained and vertex Thus relevant theme obtains the corresponding theme vector collection of data network set, for each number of each packet data convergence Union is taken according to the corresponding theme vector collection of collection of network, obtains the corresponding theme vector collection of each packet data collection.And for The specific method embodiment of the present invention to theme vector collection does not make specific restriction, can refer to the prior art.
S105. the hot spot data collection of network of the packet data convergence is obtained based on the theme vector collection.
S106. popular industry cluster result is obtained according to the hot spot data collection of network of each packet data convergence.
Specifically, each packet data collection can gather its hot spot data collection of network according to category of employment Class, so that cluster result is obtained, and cluster result can identify the temperature of industry.
Preferably pass through bubble diagram identified cluster result in the embodiment of the present invention.
Specifically, the bubble diagram using packet data integrate corresponding to period as horizontal axis, each grouping is identified with bubble The cluster result of topical subject in data set.
Further, as shown in Fig. 2, described obtain the hot spot number of the packet data convergence based on the theme vector collection Include: according to collection of network
S1051. the temperature attribute of each data network set is obtained.
Specifically, the temperature attribute can be obtained according to the actual situation, for example, used in the embodiment of the present invention Temperature attribute is the reading of data network set number of vertex different degree, data network set participation different degree and data collection of network Different degree.
Specifically, number and the data network institute of the data network priority of vertex for the data network vertex The ratio of any active ues sum within the packet data collection corresponding period.Any active ues can be online clear according to user Look at data number definition.
It is the data network set number of vertices and the data network collection that the data network set, which participates in different degree, The ratio for the sum that each data are browsed in conjunction.
The reading different degree of the data network set is the sum that each data are browsed in the data network set The ratio of any active ues sum in period corresponding with packet data collection where the data network.
S1052. according to the doubtful hot spot data collection of network of the temperature attributes extraction.
Specifically, only when data network set number of vertex different degree is greater than preset first threshold value, and data network collection It closes and participates in the data that different degree is greater than default third threshold value greater than the reading different degree of default second threshold and data collection of network Collection of network is only doubtful hot spot data collection of network.
Specifically, first threshold is 0.1 in the embodiment of the present invention, second threshold 0.15, and third threshold value is 0.3.
S1053. the correlation matrix of doubtful hot spot data collection of network is obtained.
Specifically, some vertex and the acquisition methods of the degree of correlation of some theme vector include:
Based on formulaThe degree of correlation on some vertex Yu some theme vector is calculated, wherein ViFor the vertex Title, key is while being under the jurisdiction of the keyword of the theme vector and the title, and the P (key) is the keyword in institute State the probability in theme vector.
Further, on the basis of obtaining the degree of correlation on some vertex and some theme vector, the available vertex The theme vector concentrates the degree of correlation of each theme, to obtain vertex relevance vector, the relevance vector indicates institute State the degree of correlation on vertex Yu each theme.
It is column with the vertex relevance vector on some vertex, obtains the corresponding degree of correlation square of doubtful hot spot data collection of network Battle array.
S1054. the element that numerical value in the correlation matrix is greater than default relevance threshold is obtained.
If S1055. the element sum is greater than preset heat degree threshold, the doubtful hot spot data network is judged as Hot spot data network.
Specifically, as shown in figure 3, described obtain popular row according to the hot spot data collection of network of each packet data convergence Industry cluster result includes:
S1061. the hot spot data collection of network of each packet data convergence is obtained.
S1062. the corresponding N number of hot spot theme of each hot spot data collection of network is obtained.
Specifically, as shown in figure 4, the acquisition methods of the corresponding hot spot theme of hot spot data collection of network include:
S10621. the total value of the element of every a line in the correlation matrix of the hot spot data collection of network is calculated;
S10622. the maximum N row of the total value is chosen, its corresponding theme, as hot spot theme are obtained.
S1063. M heat of the packet data collection is obtained according to the corresponding hot spot theme of each hot spot data collection of network Point theme.
Specifically, the highest M theme of the corresponding hot spot theme frequency of occurrence of each hot spot data collection of network is For M hot spot theme of the packet data collection.
S1064. it is clustered to obtain popular industry cluster result according to the M hot spot theme.
Specifically, the cluster includes:
S10. the dissimilarity between two hot spot themes is calculated, and obtains dissimilarity matrix R={ rij}n*n
Specifically, the not phase between two hot spot themes can be obtained by the prior art of the acquisition of dissimilarity matrix Like property, the embodiment of the present invention is not repeated them here.
S20. category of employment ω is initialized.
S30. each hot spot theme is obtained for the degree of membership of category of employment.
S40. each hot spot theme is obtained for the contribution degree of category of employment.
S50. the cluster result based on the degree of membership and contribution degree expression is exported.
Specifically, the degree of membership is indicated with u, uikIndicate hot spot theme xiTo category of employment ωkDegree of membership, the tribute Degree of offering indicates with v, vkjIndicate hot spot theme xjTo category of employment ωkContribution weight.
Degree of membership is according to formula (one):It calculates, contribution degree is according to formula (two):As it can be seen that being iterated meter according to formula (one) and formula (two) using M hot spot theme as input It calculates, to obtain each hot spot theme for the degree of membership and contribution degree of category of employment.φ in formula (one) and formula (two) It is constant related with clustering precision with β.
It should be understood that referenced herein " multiple " refer to two or more."and/or", description association The incidence relation of object indicates may exist three kinds of relationships, for example, A and/or B, can indicate: individualism A exists simultaneously A And B, individualism B these three situations.Character "/" typicallys represent the relationship that forward-backward correlation object is a kind of "or".
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (5)

1. a kind of clustering method based on industry characterized by comprising
Data acquisition system is obtained, the data acquisition system includes a kind of data and two class data;
Data grouping is carried out according to the issuing time of a kind of data, obtains packet data collection, the packet data convergence includes one Class data and two classes data relevant to one kind data;
Each packet data collection is pre-processed, the corresponding multiple data network set of the packet data collection are obtained;
For each packet data collection, its corresponding theme vector collection is calculated;
The hot spot data collection of network of the packet data convergence is obtained based on the theme vector collection;
Popular industry cluster result is obtained according to the hot spot data collection of network of each packet data convergence;
Pass through cluster result described in bubble map logo.
2. the method according to claim 1, wherein further include:
The bubble diagram using packet data integrate corresponding to period as horizontal axis, it is popular that each packet data convergence is identified with bubble The cluster result of theme.
3. according to the method described in claim 1, it is characterized by:
The hot spot data collection of network for obtaining the packet data convergence based on the theme vector collection includes:
Obtain the temperature attribute of each data network set;
According to the doubtful hot spot data collection of network of the temperature attributes extraction;
Obtain the correlation matrix of doubtful hot spot data collection of network;
Obtain the element that numerical value in the correlation matrix is greater than default relevance threshold;
If the element sum is greater than preset heat degree threshold, the doubtful hot spot data network is judged as hot spot data net Network.
4. according to the method described in claim 1, it is characterized by:
The hot spot data collection of network according to each packet data convergence obtains popular industry cluster result
Obtain the hot spot data collection of network of each packet data convergence;
Obtain the corresponding N number of hot spot theme of each hot spot data collection of network;
M hot spot theme of the packet data collection is obtained according to the corresponding hot spot theme of each hot spot data collection of network;
It is clustered to obtain popular industry cluster result according to the M hot spot theme.
5. according to the method described in claim 4, it is characterized by:
The acquisition methods of the corresponding hot spot theme of hot spot data collection of network include:
Calculate the total value of the element of every a line in the correlation matrix of the hot spot data collection of network;
The maximum N row of the total value is chosen, its corresponding theme, as hot spot theme are obtained.
CN201811644123.3A 2018-12-30 2018-12-30 Industry-based clustering method Active CN109684480B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811644123.3A CN109684480B (en) 2018-12-30 2018-12-30 Industry-based clustering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811644123.3A CN109684480B (en) 2018-12-30 2018-12-30 Industry-based clustering method

Publications (2)

Publication Number Publication Date
CN109684480A true CN109684480A (en) 2019-04-26
CN109684480B CN109684480B (en) 2021-01-05

Family

ID=66191428

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811644123.3A Active CN109684480B (en) 2018-12-30 2018-12-30 Industry-based clustering method

Country Status (1)

Country Link
CN (1) CN109684480B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070260586A1 (en) * 2006-05-03 2007-11-08 Antonio Savona Systems and methods for selecting and organizing information using temporal clustering
CN103914445A (en) * 2014-03-05 2014-07-09 中国人民解放军装甲兵工程学院 Data semantic processing method
CN104199974A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Microblog-oriented dynamic topic detection and evolution tracking method
KR20160136014A (en) * 2015-05-19 2016-11-29 한국과학기술원 Method and system for topic clustering of big data
CN107784010A (en) * 2016-08-29 2018-03-09 上海掌门科技有限公司 A kind of method and apparatus for being used to determine the temperature information of theme of news
CN108334591A (en) * 2018-01-30 2018-07-27 天津中科智能识别产业技术研究院有限公司 Industry analysis method and system based on focused crawler technology

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070260586A1 (en) * 2006-05-03 2007-11-08 Antonio Savona Systems and methods for selecting and organizing information using temporal clustering
CN104199974A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Microblog-oriented dynamic topic detection and evolution tracking method
CN103914445A (en) * 2014-03-05 2014-07-09 中国人民解放军装甲兵工程学院 Data semantic processing method
KR20160136014A (en) * 2015-05-19 2016-11-29 한국과학기술원 Method and system for topic clustering of big data
CN107784010A (en) * 2016-08-29 2018-03-09 上海掌门科技有限公司 A kind of method and apparatus for being used to determine the temperature information of theme of news
CN108334591A (en) * 2018-01-30 2018-07-27 天津中科智能识别产业技术研究院有限公司 Industry analysis method and system based on focused crawler technology

Also Published As

Publication number Publication date
CN109684480B (en) 2021-01-05

Similar Documents

Publication Publication Date Title
CN109145934B (en) User behavior data processing method, medium, equipment and device based on log
TWI475412B (en) Digital content reordering method and digital content aggregator
Mitrović et al. Quantitative analysis of bloggers’ collective behavior powered by emotions
CN104182517B (en) The method and device of data processing
US20140108311A1 (en) Information porcessing apparatus and method, and program thereof
CN105723402A (en) Systems and methods for determining influencers in a social data network
CN103714063B (en) Data analysing method and its system
CN103399858A (en) Socialization collaborative filtering recommendation method based on trust
CN108710654B (en) Public opinion data visualization method and equipment
CN107016001A (en) A kind of data query method and device
CN106503028A (en) Recommend method and system
Leite Dantas Bezerra et al. Symbolic data analysis tools for recommendation systems
CN106127506A (en) A kind of recommendation method solving commodity cold start-up problem based on Active Learning
CN106354867A (en) Multimedia resource recommendation method and device
CN106886559A (en) The collaborative filtering method of good friend's feature and similar users feature is incorporated simultaneously
CN109740059A (en) A kind of hot topic the analysis of public opinion method
Sun et al. Overlapping community detection based on information dynamics
CN108665513A (en) Drawing practice based on user behavior data and device
US9251113B1 (en) System for enabling participants to discuss, debate, connect and compare media and information
CN116663505B (en) Comment area management method and system based on Internet
Humayoun et al. TExVis: An Interactive Visual Tool to Explore Twitter Data.
CN109656433A (en) Category information processing method, device, computer equipment and storage medium
CN109684480A (en) A kind of clustering method based on industry
Huang et al. Eiffel: Evolutionary flow map for influence graph visualization
CN109739988A (en) A kind of industry temperature acquisition methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Feng Wei

Inventor after: Xu Chengdi

Inventor before: Xu Chengdi

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20201218

Address after: Room 324, building 10, 2 Jintai West Road, Chaoyang District, Beijing

Applicant after: BEIJING PEOPLE ONLINE NETWORK Co.,Ltd.

Address before: 310052 476, 4 floor, 3 story A building, No. 301, Binxing Road, Changhe street, Binjiang District, Hangzhou, Zhejiang.

Applicant before: HANGZHOU YITU NETWORK TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant