CN101556582A

CN101556582A - System for analyzing and predicting netizen interest in forum

Info

Publication number: CN101556582A
Application number: CNA2008100357691A
Authority: CN
Inventors: 张世永; 吴承荣; 谢剑锋
Original assignee: FUDAN GUANGHUA INFORMATION SCIENCE AND TECHNOLOGY Co Ltd SHANGHAI
Current assignee: FUDAN GUANGHUA INFORMATION SCIENCE AND TECHNOLOGY Co Ltd SHANGHAI
Priority date: 2008-04-09
Filing date: 2008-04-09
Publication date: 2009-10-14

Abstract

The invention relates to a system for analyzing and predicting netizen interest in a forum, which is characterized by comprising a data storage layer, an intelligent content analysis layer, an association analysis layer and an interest analysis layer, wherein the data storage layer is used for storing structured data and non-structured data, the intelligent content analysis layer is used for the subject classification, the hot topic extraction and traction and the tendentiousness analysis of the data stored in the data storage layer, the association analysis layer is used for associating netizens with content and associating netizens with netizens in sequence according to subject classifications and hot topics, and the interest analysis layer is used for analyzing and predicting nitizen interest according to the associations between the netizens and the content and between the netizens and the netizens and the tendentiousness analysis. The system can effectively satisfy the need deeply exploring the analysis of the netizen interest in the forum and is applicable to the implementation of a network public sentiment analysis system.

Description

System for analyzing and predicting netizen interest in forum

Technical field

The present invention is a kind of analytical technology of network virtual environment, is specifically related to a kind of system for analyzing and predicting netizen interest in forum, belongs to the data mining technology field.

Background technology

Along with the development of the network informationization, a large amount of network virtual communities has appearred, formed a network virtual environment, the network forum is exactly a kind of principal mode wherein.In traditional socialization warp, the long-term management system that has had an effective people of cover and group, but the network virtual environment this be new things, it not only has online characteristics of freely making a speech, the characteristics that also have netizen's anonymity have strengthened the difficulty of supervision.At present, network public-opinion has become a very important aspect, and the network forum more can embody the characteristics that network is assembled a crowd, and compares with other network application, more can reflect network public-opinion situation.Therefore, for public sentiment in the forum of website main motive force---netizen's analysis is significant.By analysis, can accurately control the main trend of the interior network public-opinion situation development of section sometime to netizen interest in the forum.

Though the netizen interest analysis based on forum is had development prospect and application future preferably, some relevant systems have also appearred,, the system in this field still exists a series of problem at present, mainly contains several several down:

1. simple netizen and the association analysis of publishing an article lack the netizen are participated in systematic analysis on the time span of subject under discussion, much-talked-about topic, content type, make that the analysis to individual netizen lacks stereoscopic sensation.

2. the activity of netizen on network often has the character of group, and present system and method has often been ignored this point.Network public-opinion all is under the drive of network community and form basically, and individual netizen is difficult to form one strength,

Therefore, need carry out deep analysis to the network crowd.

3. present system and method is all analyzed instant, local data, but, netizen's interest is not independently, they often are associated with big network environment, network development process, present system and method lacks netizen's model knowledge base, is used for netizen interest is analyzed on the whole and predicted.

This shows, the analysis of netizen interest is very important in the network forum, the requirement that the degree of depth is arranged on the certificate is dug in analysis to netizen interest in data, and existing systems is related between netizen and relevance, netizen, netizen's model knowledge base all exists defective, also can't satisfy the profound requirement that netizen interest is analyzed.

Summary of the invention

Purpose of the present invention mainly is the defective that exists in the system at existing network virtual environment netizen interest analysis based on forum, propose a kind of related with the content pass with the netizen, related between the netizen, netizen's model knowledge base is the system for analyzing and predicting netizen interest in forum based on data mining that technical foundation realizes, it is mainly by netizen and much-talked-about topic, subject under discussion, classifying content, based on sentiment classification, relationship analysis between netizen and the netizen, the aspects such as accumulation of long-term netizen's model knowledge base, tap/dip deep into the origin and the development of netizen interest, and make a prediction, realize the profound level analysis of forum netizen interest.

Related between, the netizen related with netizen and content of the present invention, netizen's model knowledge base is that the system for analyzing and predicting netizen interest in forum based on data mining that technical foundation realizes is made up of data storage layer, intelligent content analysis layer, association analysis layer and interest analysis layer.

Described data storage layer is responsible for storage configuration data and unstructured data in local system, the warehouse-in of data and index all are to finish at this layer.For structural data, as netizen ID, time etc., described data storage layer is deposited in it in general business database, and that adopt here is oracle; And for unstructured data, mainly be content of text, if leave in the general business database, along with the increase of data volume, the index performance will sharply reduce, therefore, we are placed in the unstructured data thesaurus of special use of independent development.The structural data of every piece of article and unstructured data are because be stored in the different databases, and type is different, therefore uniform data need be associated, we adopt the foundation of unique sign ID of structural data in the common commercial database as association.

Described intelligent content analysis layer is at unstructured data, adopt data mining method, mainly comprise text classification, text cluster, text snippet etc., carry out intelligent text content analysis, realized subject classification, much-talked-about topic extraction and functions such as tracking, based on sentiment classification.

Described text classification is to adopt mode artificial and that robotization combines, carries out the identification of classification to both establishing theme.The method of classification has a variety ofly, and we have adopted the method for SUPPORT VECTOR MACHINE (support vector machine), and this method is based upon on the statistical basis to speech.Its workflow is mainly as follows: the first step, manually extract a part of article as training set; Second step, feature set is carried out Chinese word segmentation, filter stop words, extract the feature speech, and every piece in feature set article is converted into the feature term vector represents; In the 3rd step, the calling classification training aids is trained the feature set vector, obtains sorter; In the 4th step, the classifying text content is treated in input, extracts feature according to training set feature speech, forms proper vector, utilizes sorter that it is classified.

The mode that adopts text cluster and classification to combine is extracted and followed the tracks of to described much-talked-about topic, it on the specific practice method that the extraction of much-talked-about topic is adopted text cluster, and the tracking of much-talked-about topic is adopted the method for text classification, its workflow is as follows: the first step, text data in the fixed time section is carried out Chinese word segmentation, feature extraction, form vector; Second step, the vector that forms is carried out the robotization cluster, the algorithm of cluster has a lot, the clustering algorithm that is based on level that we adopt; In the 3rd step, the classification that cluster is gone out is as new much-talked-about topic; Follow the tracks of this topic if desired,, it is trained, obtain sorter the training set of the article in the new much-talked-about topic as text classification; The 4th step, utilize the sorter that obtains, the article of new input is classified, it is included into certain much-talked-about topic, thereby has realized tracking much-talked-about topic.

Described based on sentiment classification adopts manually and the mode that combines automatically, and at first, we have formed semantic base to general term, and in this semantic base, we have carried out tendentious weights analysis to each speech; Secondly, the input text content utilizes semantic base that the speech in the content of text is carried out semantic weighting, thereby obtains the tendentiousness of content of text; Once more, get involved artificial mode, regulate the based on sentiment classification result.

Described association analysis layer, according to described subject classification and described much-talked-about topic, it is related with relevance, netizen and netizen to carry out the netizen successively.Described netizen and relevance are not meant netizen and Ta published an article related, but utilize the output result of above-mentioned described intelligent content analysis layer, carry out related to the netizen with current subject classification, much-talked-about topic, speech tendentiousness, thereby this netizen interest during this period of time is in which subject classification, which much-talked-about topic as can be seen, hold which kind of attitude? the main probabilistic statistical method that adopts, the statistical study netizen is in the concern situation of all directions, thereby judges point of interest.

Described netizen is related with the netizen, the analysis result data of the result data of the described structural data of integrated use, described intelligent content analysis layer, described netizen and relevance, adopt the method for data association, analysis draws a networked society structure, comprises Web Community, network colony, network clique.According to forum's structural data, comprise website, the space of a whole page, netizen, time etc., analyze in certain period, often be active in the netizen group of certain certain classification of the space of a whole page of certain website, we are defined as Web Community; In Web Community, often participate in the netizen group of certain class sensitive subjects simultaneously, we are defined as network colony; In network colony, the unified subject under discussion of frequent participation, the group of promptly unified individual root subsides and money order receipt to be signed and returned to the sender, we are defined as network clique.

Described interest analysis layer according to described netizen and the related and described based on sentiment classification of relevance, described netizen and netizen, carries out the netizen interest analyses and prediction.Described interest analysis layer comprises: netizen's model base module is used for single netizen and the conclusion and the summary of interest analysis in the past of netizen colony are formed empirical model, and supply subsequent analysis as machine learning knowledge; The netizen interest analysis module is used for analyzing single netizen's the interest and the point of interest of netizen colony according to described netizen's model base module; Netizen interest development prediction module is used for according to described netizen's model base module, and the following interest development of single netizen and netizen colony is judged in prediction.

Described netizen's model base module is to netizen and the group's conclusion and summary of interest analysis in the past, forms empirical model, and as machine learning knowledge, for follow-up analysis.The interest probability statistics that netizen's model knowledge base has write down netizen and group distribute, and the development and change on a period of time.

Described netizen interest analysis module has not only been analyzed single netizen's interest, has also analyzed network group's point of interest.The main method that adopts is according to netizen and relevance module analysis result, and netizen and netizen's relating module analysis result in conjunction with netizen's model knowledge base, are taken all factors into consideration netizen and group's interest experience in the past, judge the current interest of netizen and distribute.

Described netizen interest development prediction module is according to netizen and the current discussion focus place of group, and utilization netizen model knowledge base draws development model in the past, after contrast, suitable prediction is made in netizen and group's the development of interest from now on judged.We have adopted Markov model, have adopted the probability distribution of point of interest on each time point, according to the probability distribution of current point of interest, thereby to a certain extent forecast analysis have been made in the development of following point of interest.

The present invention has substantive distinguishing features and marked improvement: (1) carries out interest analysis by the tap/dip deep into to netizen and relevance to the netizen; (2) by analysis to the network crowd, excavate, obtain netizen institute's role and the effect of playing on network, thereby excavate out netizen's motivation; (3) adopt the mode of netizen's model knowledge base, accumulate the model of a large amount of netizen's relevant informations, be applied to again in the current data analysis, help analyzing on the whole netizen's interest place, and make suitable prediction.

That the present invention proposes is related with netizen and content, related between the netizen, netizen's model knowledge base is the system for analyzing and predicting netizen interest in forum based on data mining that technical foundation realizes, make full use of Web content information, netizen's information, historical data information, effectively solve the tap/dip deep into demand of analyzing based on the netizen interest of forum, be applicable to the enforcement of network public-opinion analytic system.

Description of drawings

Accompanying drawing is the system architecture diagram of system for analyzing and predicting netizen interest in forum one embodiment.

Embodiment

Below in conjunction with accompanying drawing embodiments of the present invention are elaborated.

Accompanying drawing is depicted as the system architecture diagram of system for analyzing and predicting netizen interest in forum one embodiment.As shown in the figure, the total system framework is divided into four levels: ground floor is a data storage layer, is in charge of warehouse-in, the index of structural data and unstructured data; The second layer is the intelligent content analysis layer, adopts data mining method that article content is carried out text classification, much-talked-about topic extraction and tracking, based on sentiment classification; The 3rd layer is the association analysis layer, comprises netizen and relevance module, netizen and netizen's relating module, and wherein the analysis result of netizen and relevance module is the analysis foundation of netizen and netizen's relating module; The 4th layer, also be that last layer is the interest analysis layer, comprise netizen's model base module, netizen interest analysis module, netizen interest development prediction module, its call sequence is, the netizen interest analysis module calls netizen's model base module, these two bases that module is again a netizen interest development prediction module.

At described intelligent content analysis layer, at first text data is imported this module, content analysis module is called the Chinese word segmentation function, Chinese text is carried out participle, and then enter feature selecting, mainly contain two work, at first remove stop words, calculate the TFIDF value again, carry out feature selecting.The feature selecting of text classification and text cluster is different, and text classification is directly carried out feature selecting to the training document, and text cluster is regarded all test document as different classification, carries out feature selecting, therefore, obtains two feature selecting results.Feature selecting is divided into two parts after finishing, and a part is to carry out text classification, and another part is to carry out text cluster.In this part of text classification, at first the calling classification training function passes through the sorter that obtains classifying after the training; Next carries out text classification; At last classification results is carried out based on sentiment classification, obtain the speech tendentiousness situation of each classification.In this part of text cluster, at first call the text cluster function, enumerate classification automatically; To gather the classification that once more automatically and extract, form new much-talked-about topic and tracking; At last, much-talked-about topic is carried out based on sentiment classification, draw the speech tendentiousness of each much-talked-about topic.

At described association analysis layer, existing netizen and relevance module have netizen and netizen's relating module again.At first be netizen and relevance module, be divided into three parts, first is text classification result and website space of a whole page netizen association analysis, and second is that much-talked-about topic analysis result and website space of a whole page netizen scrape face analysis, and the 3rd is with exercise question subject under discussion and website space of a whole page netizen association analysis; Next is netizen and netizen's relating module, also is divided into three parts, respectively corresponding above-mentioned three parts, and first network group with the identical space of a whole page same category of same web site is divided into Web Community; Second network group with the identical space of a whole page same topic of same web site is divided into network colony; The 3rd is divided into network clique with the identical space of a whole page of same web site with the network group of exercise question subject under discussion.

At described interest analysis layer, the above-mentioned Web Community that obtains, network colony, network clique, individual netizen and based on sentiment classification result are combined, through statistical study, we can obtain netizen and network group's interest analysis point; On this basis, again in conjunction with netizen's model knowledge base, respectively netizen and networking group's interest development is made prediction, comprise Web Community's interest analysis and development prediction, network colony interest analysis and development prediction, network clique interest analysis and development prediction, netizen interest analysis and development prediction.

From above-mentioned implementation process as can be seen, that the present invention did was related with netizen and content, related between the netizen, netizen's model knowledge base is the system for analyzing and predicting netizen interest in forum based on data mining that technical foundation realizes, effectively realized the tap/dip deep into that forum netizen interest is analyzed, for netter in the network public-opinion analysis and group's analysis provides authentic communication.

Claims

1. system for analyzing and predicting netizen interest in forum is characterized in that comprising:

Data storage layer is used for structured data and unstructured data;

The intelligent content analysis layer is used for the data of described data storage layer are made the extraction of theme classification, much-talked-about topic and tracking, based on sentiment classification;

The association analysis layer, according to described subject classification and described much-talked-about topic, it is related with relevance, netizen and netizen to carry out the netizen successively;

The interest analysis layer according to described netizen and the related and described based on sentiment classification of relevance, described netizen and netizen, carries out the netizen interest analyses and prediction.

2. system for analyzing and predicting netizen interest in forum according to claim 1 is characterized in that described interest analysis layer comprises:

Netizen's model base module is used for single netizen and the conclusion and the summary of interest analysis in the past of netizen colony are formed empirical model, and supply subsequent analysis as machine learning knowledge;

The netizen interest analysis module is used for analyzing single netizen's the interest and the point of interest of netizen colony according to described netizen's model base module;

Netizen interest development prediction module is used for according to described netizen's model base module, and the following interest development of single netizen and netizen colony is judged in prediction.

3. system for analyzing and predicting netizen interest in forum according to claim 1 and 2 is characterized in that described netizen and relevance comprise text classification result and netizen's association analysis, much-talked-about topic analysis result and website space of a whole page netizen association analysis and with exercise question subject under discussion and netizen's association analysis.

4. system for analyzing and predicting netizen interest in forum according to claim 1 and 2, it is characterized in that described netizen and netizen related comprise the netizen of the identical space of a whole page same category of same web site and the netizen is related, with the netizen of the identical space of a whole page same topic of same web site and the netizen is related and the identical space of a whole page of same web site is related with the netizen and the netizen of exercise question subject under discussion.

5. system for analyzing and predicting netizen interest in forum according to claim 3, it is characterized in that described netizen and netizen related comprise the netizen of the identical space of a whole page same category of same web site and the netizen is related, with the netizen of the identical space of a whole page same topic of same web site and the netizen is related and the identical space of a whole page of same web site is related with the netizen and the netizen of exercise question subject under discussion.

6. system for analyzing and predicting netizen interest in forum according to claim 1 and 2 is characterized in that described data storage layer is that described structural data and described unstructured data are set up index.

7. system for analyzing and predicting netizen interest in forum according to claim 2, it is characterized in that described netizen interest analysis module adopts Markov model, adopt the probability distribution of point of interest on each time point, according to the probability distribution of current point of interest, the development of following point of interest is judged in prediction.