CN101556582A - System for analyzing and predicting netizen interest in forum - Google Patents

System for analyzing and predicting netizen interest in forum Download PDF

Info

Publication number
CN101556582A
CN101556582A CNA2008100357691A CN200810035769A CN101556582A CN 101556582 A CN101556582 A CN 101556582A CN A2008100357691 A CNA2008100357691 A CN A2008100357691A CN 200810035769 A CN200810035769 A CN 200810035769A CN 101556582 A CN101556582 A CN 101556582A
Authority
CN
China
Prior art keywords
netizen
interest
analysis
analyzing
predicting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008100357691A
Other languages
Chinese (zh)
Inventor
张世永
吴承荣
谢剑锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
FUDAN GUANGHUA INFORMATION SCIENCE AND TECHNOLOGY Co Ltd SHANGHAI
Original Assignee
FUDAN GUANGHUA INFORMATION SCIENCE AND TECHNOLOGY Co Ltd SHANGHAI
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by FUDAN GUANGHUA INFORMATION SCIENCE AND TECHNOLOGY Co Ltd SHANGHAI filed Critical FUDAN GUANGHUA INFORMATION SCIENCE AND TECHNOLOGY Co Ltd SHANGHAI
Priority to CNA2008100357691A priority Critical patent/CN101556582A/en
Publication of CN101556582A publication Critical patent/CN101556582A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a system for analyzing and predicting netizen interest in a forum, which is characterized by comprising a data storage layer, an intelligent content analysis layer, an association analysis layer and an interest analysis layer, wherein the data storage layer is used for storing structured data and non-structured data, the intelligent content analysis layer is used for the subject classification, the hot topic extraction and traction and the tendentiousness analysis of the data stored in the data storage layer, the association analysis layer is used for associating netizens with content and associating netizens with netizens in sequence according to subject classifications and hot topics, and the interest analysis layer is used for analyzing and predicting nitizen interest according to the associations between the netizens and the content and between the netizens and the netizens and the tendentiousness analysis. The system can effectively satisfy the need deeply exploring the analysis of the netizen interest in the forum and is applicable to the implementation of a network public sentiment analysis system.

Description

System for analyzing and predicting netizen interest in forum
Technical field
The present invention is a kind of analytical technology of network virtual environment, is specifically related to a kind of system for analyzing and predicting netizen interest in forum, belongs to the data mining technology field.
Background technology
Along with the development of the network informationization, a large amount of network virtual communities has appearred, formed a network virtual environment, the network forum is exactly a kind of principal mode wherein.In traditional socialization warp, the long-term management system that has had an effective people of cover and group, but the network virtual environment this be new things, it not only has online characteristics of freely making a speech, the characteristics that also have netizen's anonymity have strengthened the difficulty of supervision.At present, network public-opinion has become a very important aspect, and the network forum more can embody the characteristics that network is assembled a crowd, and compares with other network application, more can reflect network public-opinion situation.Therefore, for public sentiment in the forum of website main motive force---netizen's analysis is significant.By analysis, can accurately control the main trend of the interior network public-opinion situation development of section sometime to netizen interest in the forum.
Though the netizen interest analysis based on forum is had development prospect and application future preferably, some relevant systems have also appearred,, the system in this field still exists a series of problem at present, mainly contains several several down:
1. simple netizen and the association analysis of publishing an article lack the netizen are participated in systematic analysis on the time span of subject under discussion, much-talked-about topic, content type, make that the analysis to individual netizen lacks stereoscopic sensation.
2. the activity of netizen on network often has the character of group, and present system and method has often been ignored this point.Network public-opinion all is under the drive of network community and form basically, and individual netizen is difficult to form one strength,
Therefore, need carry out deep analysis to the network crowd.
3. present system and method is all analyzed instant, local data, but, netizen's interest is not independently, they often are associated with big network environment, network development process, present system and method lacks netizen's model knowledge base, is used for netizen interest is analyzed on the whole and predicted.
This shows, the analysis of netizen interest is very important in the network forum, the requirement that the degree of depth is arranged on the certificate is dug in analysis to netizen interest in data, and existing systems is related between netizen and relevance, netizen, netizen's model knowledge base all exists defective, also can't satisfy the profound requirement that netizen interest is analyzed.
Summary of the invention
Purpose of the present invention mainly is the defective that exists in the system at existing network virtual environment netizen interest analysis based on forum, propose a kind of related with the content pass with the netizen, related between the netizen, netizen's model knowledge base is the system for analyzing and predicting netizen interest in forum based on data mining that technical foundation realizes, it is mainly by netizen and much-talked-about topic, subject under discussion, classifying content, based on sentiment classification, relationship analysis between netizen and the netizen, the aspects such as accumulation of long-term netizen's model knowledge base, tap/dip deep into the origin and the development of netizen interest, and make a prediction, realize the profound level analysis of forum netizen interest.
Related between, the netizen related with netizen and content of the present invention, netizen's model knowledge base is that the system for analyzing and predicting netizen interest in forum based on data mining that technical foundation realizes is made up of data storage layer, intelligent content analysis layer, association analysis layer and interest analysis layer.
Described data storage layer is responsible for storage configuration data and unstructured data in local system, the warehouse-in of data and index all are to finish at this layer.For structural data, as netizen ID, time etc., described data storage layer is deposited in it in general business database, and that adopt here is oracle; And for unstructured data, mainly be content of text, if leave in the general business database, along with the increase of data volume, the index performance will sharply reduce, therefore, we are placed in the unstructured data thesaurus of special use of independent development.The structural data of every piece of article and unstructured data are because be stored in the different databases, and type is different, therefore uniform data need be associated, we adopt the foundation of unique sign ID of structural data in the common commercial database as association.
Described intelligent content analysis layer is at unstructured data, adopt data mining method, mainly comprise text classification, text cluster, text snippet etc., carry out intelligent text content analysis, realized subject classification, much-talked-about topic extraction and functions such as tracking, based on sentiment classification.
Described text classification is to adopt mode artificial and that robotization combines, carries out the identification of classification to both establishing theme.The method of classification has a variety ofly, and we have adopted the method for SUPPORT VECTOR MACHINE (support vector machine), and this method is based upon on the statistical basis to speech.Its workflow is mainly as follows: the first step, manually extract a part of article as training set; Second step, feature set is carried out Chinese word segmentation, filter stop words, extract the feature speech, and every piece in feature set article is converted into the feature term vector represents; In the 3rd step, the calling classification training aids is trained the feature set vector, obtains sorter; In the 4th step, the classifying text content is treated in input, extracts feature according to training set feature speech, forms proper vector, utilizes sorter that it is classified.
The mode that adopts text cluster and classification to combine is extracted and followed the tracks of to described much-talked-about topic, it on the specific practice method that the extraction of much-talked-about topic is adopted text cluster, and the tracking of much-talked-about topic is adopted the method for text classification, its workflow is as follows: the first step, text data in the fixed time section is carried out Chinese word segmentation, feature extraction, form vector; Second step, the vector that forms is carried out the robotization cluster, the algorithm of cluster has a lot, the clustering algorithm that is based on level that we adopt; In the 3rd step, the classification that cluster is gone out is as new much-talked-about topic; Follow the tracks of this topic if desired,, it is trained, obtain sorter the training set of the article in the new much-talked-about topic as text classification; The 4th step, utilize the sorter that obtains, the article of new input is classified, it is included into certain much-talked-about topic, thereby has realized tracking much-talked-about topic.
Described based on sentiment classification adopts manually and the mode that combines automatically, and at first, we have formed semantic base to general term, and in this semantic base, we have carried out tendentious weights analysis to each speech; Secondly, the input text content utilizes semantic base that the speech in the content of text is carried out semantic weighting, thereby obtains the tendentiousness of content of text; Once more, get involved artificial mode, regulate the based on sentiment classification result.
Described association analysis layer, according to described subject classification and described much-talked-about topic, it is related with relevance, netizen and netizen to carry out the netizen successively.Described netizen and relevance are not meant netizen and Ta published an article related, but utilize the output result of above-mentioned described intelligent content analysis layer, carry out related to the netizen with current subject classification, much-talked-about topic, speech tendentiousness, thereby this netizen interest during this period of time is in which subject classification, which much-talked-about topic as can be seen, hold which kind of attitude? the main probabilistic statistical method that adopts, the statistical study netizen is in the concern situation of all directions, thereby judges point of interest.
Described netizen is related with the netizen, the analysis result data of the result data of the described structural data of integrated use, described intelligent content analysis layer, described netizen and relevance, adopt the method for data association, analysis draws a networked society structure, comprises Web Community, network colony, network clique.According to forum's structural data, comprise website, the space of a whole page, netizen, time etc., analyze in certain period, often be active in the netizen group of certain certain classification of the space of a whole page of certain website, we are defined as Web Community; In Web Community, often participate in the netizen group of certain class sensitive subjects simultaneously, we are defined as network colony; In network colony, the unified subject under discussion of frequent participation, the group of promptly unified individual root subsides and money order receipt to be signed and returned to the sender, we are defined as network clique.
Described interest analysis layer according to described netizen and the related and described based on sentiment classification of relevance, described netizen and netizen, carries out the netizen interest analyses and prediction.Described interest analysis layer comprises: netizen's model base module is used for single netizen and the conclusion and the summary of interest analysis in the past of netizen colony are formed empirical model, and supply subsequent analysis as machine learning knowledge; The netizen interest analysis module is used for analyzing single netizen's the interest and the point of interest of netizen colony according to described netizen's model base module; Netizen interest development prediction module is used for according to described netizen's model base module, and the following interest development of single netizen and netizen colony is judged in prediction.
Described netizen's model base module is to netizen and the group's conclusion and summary of interest analysis in the past, forms empirical model, and as machine learning knowledge, for follow-up analysis.The interest probability statistics that netizen's model knowledge base has write down netizen and group distribute, and the development and change on a period of time.
Described netizen interest analysis module has not only been analyzed single netizen's interest, has also analyzed network group's point of interest.The main method that adopts is according to netizen and relevance module analysis result, and netizen and netizen's relating module analysis result in conjunction with netizen's model knowledge base, are taken all factors into consideration netizen and group's interest experience in the past, judge the current interest of netizen and distribute.
Described netizen interest development prediction module is according to netizen and the current discussion focus place of group, and utilization netizen model knowledge base draws development model in the past, after contrast, suitable prediction is made in netizen and group's the development of interest from now on judged.We have adopted Markov model, have adopted the probability distribution of point of interest on each time point, according to the probability distribution of current point of interest, thereby to a certain extent forecast analysis have been made in the development of following point of interest.
The present invention has substantive distinguishing features and marked improvement: (1) carries out interest analysis by the tap/dip deep into to netizen and relevance to the netizen; (2) by analysis to the network crowd, excavate, obtain netizen institute's role and the effect of playing on network, thereby excavate out netizen's motivation; (3) adopt the mode of netizen's model knowledge base, accumulate the model of a large amount of netizen's relevant informations, be applied to again in the current data analysis, help analyzing on the whole netizen's interest place, and make suitable prediction.
That the present invention proposes is related with netizen and content, related between the netizen, netizen's model knowledge base is the system for analyzing and predicting netizen interest in forum based on data mining that technical foundation realizes, make full use of Web content information, netizen's information, historical data information, effectively solve the tap/dip deep into demand of analyzing based on the netizen interest of forum, be applicable to the enforcement of network public-opinion analytic system.
Description of drawings
Accompanying drawing is the system architecture diagram of system for analyzing and predicting netizen interest in forum one embodiment.
Embodiment
Below in conjunction with accompanying drawing embodiments of the present invention are elaborated.
Accompanying drawing is depicted as the system architecture diagram of system for analyzing and predicting netizen interest in forum one embodiment.As shown in the figure, the total system framework is divided into four levels: ground floor is a data storage layer, is in charge of warehouse-in, the index of structural data and unstructured data; The second layer is the intelligent content analysis layer, adopts data mining method that article content is carried out text classification, much-talked-about topic extraction and tracking, based on sentiment classification; The 3rd layer is the association analysis layer, comprises netizen and relevance module, netizen and netizen's relating module, and wherein the analysis result of netizen and relevance module is the analysis foundation of netizen and netizen's relating module; The 4th layer, also be that last layer is the interest analysis layer, comprise netizen's model base module, netizen interest analysis module, netizen interest development prediction module, its call sequence is, the netizen interest analysis module calls netizen's model base module, these two bases that module is again a netizen interest development prediction module.
At described intelligent content analysis layer, at first text data is imported this module, content analysis module is called the Chinese word segmentation function, Chinese text is carried out participle, and then enter feature selecting, mainly contain two work, at first remove stop words, calculate the TFIDF value again, carry out feature selecting.The feature selecting of text classification and text cluster is different, and text classification is directly carried out feature selecting to the training document, and text cluster is regarded all test document as different classification, carries out feature selecting, therefore, obtains two feature selecting results.Feature selecting is divided into two parts after finishing, and a part is to carry out text classification, and another part is to carry out text cluster.In this part of text classification, at first the calling classification training function passes through the sorter that obtains classifying after the training; Next carries out text classification; At last classification results is carried out based on sentiment classification, obtain the speech tendentiousness situation of each classification.In this part of text cluster, at first call the text cluster function, enumerate classification automatically; To gather the classification that once more automatically and extract, form new much-talked-about topic and tracking; At last, much-talked-about topic is carried out based on sentiment classification, draw the speech tendentiousness of each much-talked-about topic.
At described association analysis layer, existing netizen and relevance module have netizen and netizen's relating module again.At first be netizen and relevance module, be divided into three parts, first is text classification result and website space of a whole page netizen association analysis, and second is that much-talked-about topic analysis result and website space of a whole page netizen scrape face analysis, and the 3rd is with exercise question subject under discussion and website space of a whole page netizen association analysis; Next is netizen and netizen's relating module, also is divided into three parts, respectively corresponding above-mentioned three parts, and first network group with the identical space of a whole page same category of same web site is divided into Web Community; Second network group with the identical space of a whole page same topic of same web site is divided into network colony; The 3rd is divided into network clique with the identical space of a whole page of same web site with the network group of exercise question subject under discussion.
At described interest analysis layer, the above-mentioned Web Community that obtains, network colony, network clique, individual netizen and based on sentiment classification result are combined, through statistical study, we can obtain netizen and network group's interest analysis point; On this basis, again in conjunction with netizen's model knowledge base, respectively netizen and networking group's interest development is made prediction, comprise Web Community's interest analysis and development prediction, network colony interest analysis and development prediction, network clique interest analysis and development prediction, netizen interest analysis and development prediction.
From above-mentioned implementation process as can be seen, that the present invention did was related with netizen and content, related between the netizen, netizen's model knowledge base is the system for analyzing and predicting netizen interest in forum based on data mining that technical foundation realizes, effectively realized the tap/dip deep into that forum netizen interest is analyzed, for netter in the network public-opinion analysis and group's analysis provides authentic communication.

Claims (7)

1. system for analyzing and predicting netizen interest in forum is characterized in that comprising:
Data storage layer is used for structured data and unstructured data;
The intelligent content analysis layer is used for the data of described data storage layer are made the extraction of theme classification, much-talked-about topic and tracking, based on sentiment classification;
The association analysis layer, according to described subject classification and described much-talked-about topic, it is related with relevance, netizen and netizen to carry out the netizen successively;
The interest analysis layer according to described netizen and the related and described based on sentiment classification of relevance, described netizen and netizen, carries out the netizen interest analyses and prediction.
2. system for analyzing and predicting netizen interest in forum according to claim 1 is characterized in that described interest analysis layer comprises:
Netizen's model base module is used for single netizen and the conclusion and the summary of interest analysis in the past of netizen colony are formed empirical model, and supply subsequent analysis as machine learning knowledge;
The netizen interest analysis module is used for analyzing single netizen's the interest and the point of interest of netizen colony according to described netizen's model base module;
Netizen interest development prediction module is used for according to described netizen's model base module, and the following interest development of single netizen and netizen colony is judged in prediction.
3. system for analyzing and predicting netizen interest in forum according to claim 1 and 2 is characterized in that described netizen and relevance comprise text classification result and netizen's association analysis, much-talked-about topic analysis result and website space of a whole page netizen association analysis and with exercise question subject under discussion and netizen's association analysis.
4. system for analyzing and predicting netizen interest in forum according to claim 1 and 2, it is characterized in that described netizen and netizen related comprise the netizen of the identical space of a whole page same category of same web site and the netizen is related, with the netizen of the identical space of a whole page same topic of same web site and the netizen is related and the identical space of a whole page of same web site is related with the netizen and the netizen of exercise question subject under discussion.
5. system for analyzing and predicting netizen interest in forum according to claim 3, it is characterized in that described netizen and netizen related comprise the netizen of the identical space of a whole page same category of same web site and the netizen is related, with the netizen of the identical space of a whole page same topic of same web site and the netizen is related and the identical space of a whole page of same web site is related with the netizen and the netizen of exercise question subject under discussion.
6. system for analyzing and predicting netizen interest in forum according to claim 1 and 2 is characterized in that described data storage layer is that described structural data and described unstructured data are set up index.
7. system for analyzing and predicting netizen interest in forum according to claim 2, it is characterized in that described netizen interest analysis module adopts Markov model, adopt the probability distribution of point of interest on each time point, according to the probability distribution of current point of interest, the development of following point of interest is judged in prediction.
CNA2008100357691A 2008-04-09 2008-04-09 System for analyzing and predicting netizen interest in forum Pending CN101556582A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2008100357691A CN101556582A (en) 2008-04-09 2008-04-09 System for analyzing and predicting netizen interest in forum

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2008100357691A CN101556582A (en) 2008-04-09 2008-04-09 System for analyzing and predicting netizen interest in forum

Publications (1)

Publication Number Publication Date
CN101556582A true CN101556582A (en) 2009-10-14

Family

ID=41174700

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008100357691A Pending CN101556582A (en) 2008-04-09 2008-04-09 System for analyzing and predicting netizen interest in forum

Country Status (1)

Country Link
CN (1) CN101556582A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102388374A (en) * 2011-09-28 2012-03-21 华为技术有限公司 Method and device for data storage
CN102567397A (en) * 2010-12-30 2012-07-11 高德软件有限公司 Method and device for relevance marking of interest points and chain store sub-branch interest points
CN102819576A (en) * 2012-07-23 2012-12-12 无锡雅座在线科技发展有限公司 Data mining method and system based on microblog
CN102999539A (en) * 2011-09-13 2013-03-27 富士通株式会社 Method and device for forecasting future development trend of given topic
CN103455552A (en) * 2013-08-01 2013-12-18 百度在线网络技术(北京)有限公司 Point-of-interest mining method and device based on terms of interest
CN104035960A (en) * 2014-05-08 2014-09-10 东莞市巨细信息科技有限公司 Internet information hotspot predicting method
CN106528655A (en) * 2016-10-18 2017-03-22 百度在线网络技术(北京)有限公司 Text subject recognition method and device
CN109255026A (en) * 2018-08-23 2019-01-22 云南师范大学 A method of it is analyzed based on Co-word analysis and the learning demand of clustering
CN109902237A (en) * 2019-02-22 2019-06-18 苏州华必讯信息科技有限公司 System for analyzing and predicting netizen interest in forum

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567397A (en) * 2010-12-30 2012-07-11 高德软件有限公司 Method and device for relevance marking of interest points and chain store sub-branch interest points
CN102567397B (en) * 2010-12-30 2014-08-06 高德软件有限公司 Method and device for relevance marking of interest points and chain store sub-branch interest points
CN102999539A (en) * 2011-09-13 2013-03-27 富士通株式会社 Method and device for forecasting future development trend of given topic
CN102999539B (en) * 2011-09-13 2015-11-25 富士通株式会社 Predict the method and apparatus of the future developing trend of given topic
CN102388374A (en) * 2011-09-28 2012-03-21 华为技术有限公司 Method and device for data storage
CN102819576A (en) * 2012-07-23 2012-12-12 无锡雅座在线科技发展有限公司 Data mining method and system based on microblog
CN103455552A (en) * 2013-08-01 2013-12-18 百度在线网络技术(北京)有限公司 Point-of-interest mining method and device based on terms of interest
CN104035960A (en) * 2014-05-08 2014-09-10 东莞市巨细信息科技有限公司 Internet information hotspot predicting method
CN106528655A (en) * 2016-10-18 2017-03-22 百度在线网络技术(北京)有限公司 Text subject recognition method and device
CN109255026A (en) * 2018-08-23 2019-01-22 云南师范大学 A method of it is analyzed based on Co-word analysis and the learning demand of clustering
CN109255026B (en) * 2018-08-23 2021-06-25 云南师范大学 Learning demand analysis method based on common word analysis and cluster analysis
CN109902237A (en) * 2019-02-22 2019-06-18 苏州华必讯信息科技有限公司 System for analyzing and predicting netizen interest in forum

Similar Documents

Publication Publication Date Title
CN101556582A (en) System for analyzing and predicting netizen interest in forum
CN110245981B (en) Crowd type identification method based on mobile phone signaling data
CN110462604A (en) The data processing system and method for association internet device are used based on equipment
CN110472017A (en) A kind of analysis of words art and topic point identify matched method and system
Xu et al. Influential mechanism of farmers' sense of relative deprivation in the sustainable development of rural tourism
CN109446331A (en) A kind of text mood disaggregated model method for building up and text mood classification method
CN105868183B (en) A kind of method and device for predicting labor turnover
CN108052605A (en) A kind of intelligent Answer System based on client feature library
Ragab et al. HRSPCA: Hybrid recommender system for predicting college admission
CN106372072A (en) Location-based recognition method for user relations in mobile social network
Piekut et al. Segregation in the twenty first century: Processes, complexities and future directions
CN106682236A (en) Machine learning based patent data processing method and processing system adopting same
CN106802951B (en) A kind of topic abstracting method and system for Intelligent dialogue
CN101354714A (en) Method for recommending problem based on probability latent semantic analysis
CN106951471A (en) A kind of construction method of the label prediction of the development trend model based on SVM
CN111723257B (en) User portrayal method and system based on water usage rule
CN110377605A (en) A kind of Sensitive Attributes identification of structural data and classification stage division
CN109949174A (en) A kind of isomery social network user entity anchor chain connects recognition methods
CN110059190A (en) A kind of user's real-time point of view detection method based on social media content and structure
Li [Retracted] Innovation and Development of University Education Management Informationization in the Environment of Wireless Communication and Big Data
Xu et al. Evaluation of smart city sustainable development prospects based on fuzzy comprehensive evaluation method
Cao et al. Opinion leaders discovery in social networking site based on the theory of propagation probability
CN115203365A (en) Social event processing method applied to comprehensive treatment field
CN101551797A (en) Method for analyzing forum netizen interest
CN112632218A (en) Network public opinion monitoring method for enterprise crisis public customs

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20091014