CN109558587B

CN109558587B - Method for classifying public opinion tendency recognition aiming at category distribution imbalance

Info

Publication number: CN109558587B
Application number: CN201811325887.6A
Authority: CN
Inventors: 彭蓉; 王卓; 洪涛
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2018-11-08
Filing date: 2018-11-08
Publication date: 2021-04-16
Anticipated expiration: 2038-11-08
Also published as: CN109558587A

Abstract

The invention discloses a public opinion tendency identification method aiming at unbalanced training sample class distribution. Firstly, collecting vocabularies related to the concerned public opinion field as public opinion hot words to establish a word stock; and (4) crawling a comment data set from a public opinion information source, and dividing the comment data set into a training set and a testing set. And then, manually classifying the public opinion tendency of the training set, and performing complement processing by adopting a bootstrap learning method aiming at the problem of unbalanced category. Extracting the characteristics of each type of training sample, training an algorithm model by adopting algorithms such as naive Bayes, a support vector machine, a decision tree and the like, classifying the data of the test set by using the trained model, and identifying the public opinion tendency according to the classification result. The bootstrap learning method, the feature vector construction method and the classification model training method adopt a time-sensitive weighting method for weighting, so that the public opinion tendency reflected by the method is more time-efficient. The method solves the problem of inaccurate classification caused by unbalanced training data, and improves the accuracy of public opinion tendency identification and the timeliness of public opinion analysis.

Description

Method for classifying public opinion tendency recognition aiming at category distribution imbalance

Technical Field

The invention belongs to the technical field of natural language processing and machine learning, relates to a method for performing public opinion tendency analysis by using a machine learning algorithm, and particularly relates to a public opinion tendency identification method aiming at unbalanced training sample class distribution.

Background

The popularity of the internet is rapidly increased, the number of news updated on the internet is huge, public opinion influence caused by the news is huge, and a public opinion tendency analysis technology is born under the situation and aims to timely discriminate tendency attitude and attitude change held by public opinion reviewers on the internet, so that supervision departments are helped to timely find the public opinion change and construct civilized and harmonious public opinion environment.

When a general machine learning algorithm is used for public opinion tendency analysis, great deviation between tendency recognition effect and actual tendency is caused by problems of unbalanced class of training data, text publishing timeliness, public opinion timeliness and the like. At present, no effective solution has been proposed.

Disclosure of Invention

In order to solve the technical problems, the invention provides a public opinion tendency recognition method aiming at unbalanced training sample class distribution, introduces a semi-supervised training set expansion method and a time sensitive and public opinion high-frequency word sensitive characteristic weighting method on the basis of a common machine learning algorithm, and can improve the public opinion tendency recognition accuracy under the state of class unbalance.

1. A public opinion tendency recognition method aiming at unbalanced training sample class distribution is characterized by comprising the following steps:

step 1: collecting high-frequency words related to the public opinion field as public opinion hot words, creating a public opinion high-frequency word library, and updating every day;

step 2: crawling a comment data set to be analyzed from a public opinion information source, and dividing the comment data set into a training set and a testing set;

and step 3: and manually marking the public opinion tendency in the training set, classifying training samples according to tendency categories, and counting the sample amount under different tendency categories in the training set. If the phenomenon of unbalanced category distribution exists, processing is carried out by adopting a bootstrap learning method. The method comprises the steps that the sample size owned by the category with the largest sample size is taken as a standard, more comment data are crawled from a public opinion information source for the category with the data size less than that of the category, comment data similar to the characteristic text of the category are searched by a semi-supervised similarity calculation method and are supplemented into a category training set until the data size of all the category training sets is the same; the method for extracting the comment feature vector in the similarity calculation is the same as that in the step 4.

And 4, step 4: for all comments in the training set and the test set, taking a comment publisher as a unit, weighting the comment features of the comment by using a time-sensitive weighting function and a public opinion hot word-sensitive weighting function to form weighted feature vectors so as to reflect the timeliness of the comments;

and 5: training an algorithm model by using weighted feature vectors of each class of training samples and adopting machine learning algorithms such as naive Bayes, a support vector machine, a decision tree, a multi-layer perception classifier and the like; and then, classifying the comment data in the test set by using the trained model, and determining the public opinion tendency of the comment publishers according to the classification.

Preferably, in step 1, the public opinion high frequency word library records not only the high frequency words, but also the time, frequency and frequency variation of the high frequency words with time. The frequency of the public opinion high-frequency words is calculated according to the number of relevant results searched by the public opinion high-frequency words at a specific time point in a hundred-degree search engine.

Preferably, in step 2, the basis for segmenting the training set and the test set is the comment publisher, that is, the comments published by one part of the comment publisher are taken as the training set, and the comments published by the other part of the comment publisher are taken as the test set. It is suggested to select the comments made by 90% of the commentators in the data set as the training set and the remaining 10% as the test set. This ratio can be dynamically adjusted as needed.

Preferably, in step 3, if there is a problem of unbalanced class distribution, the processing is performed by using a bootstrap learning method. The unbalanced distribution of the categories means that the number of samples in different categories has a difference exceeding K%. The determination of the value K is related to the true class proportion of the current classification problem. Generally, the smaller K is, the better the classification effect of the classification algorithm model after learning is; the larger K is, the more the classification effect after the classification model algorithm learns tends to classify the data into the class with the largest number of samples. Thus, the sensitivity of the classification model algorithm to K can be determined by analysis.

The sample size owned by the category with the largest sample size is taken as a standard, for the category with the small sample size, more comment data need to be crawled by using a crawler in a public opinion information source, and a semi-supervised similarity calculation method is used for searching comments with the similarity exceeding a certain threshold T with the category sample and supplementing the comments into the category sample. Taking the VSM-based similarity calculation method as an example, the similarity Sim (o1, o2) is shown in formula (1):

wherein the content of the first and second substances,o1 is the feature vector of a class sample of the training set, o2 is the feature vector of the text of a newly crawled comment data set in a public opinion source, o1_iTo train the ith feature of a class of feature text, o2_iFor the ith feature of the text of the review data set, x is the total dimension of the vectorized feature vectors of o1 and o 2. The construction method of the feature vector is the same as the step 4.

Preferably, in step 4, a time-sensitive weighting function is adopted, and the feature vector extracted from the comments is weighted, so that the more recent comments are presented, the more the current public opinion tendency of the commentator is reflected, and the idea that the feature weight should be higher is reflected. For example, the comment weight TimeWeight (c) calculation method shown in formula (2) can be adopted:

wherein c is a comment, Tn is the current date, Tc is the date on which the comment c was published, and the unit of Tn-Tc is day. The feature words appearing in the same comment are given the same feature weight according to equation (2).

Preferably, in step 4, a weighting function sensitive to public opinion hot words is used to weight the feature vectors extracted from the comments, so that the comments which are more relevant to the current public opinion hot spots can reflect the current public opinion tendency of the commentator, and the feature weight is higher. For example, a public opinion high frequency word weight hotwordweight (c) calculation method as shown in formula (3) may be adopted:

wherein D is the current date, Dc is the date of adding the hot word c into the public opinion high-frequency word bank, Wt (c) is the current search result number of the hot word c, and Wb (c) is the search result number of the hot word c when the hot word c is added into the public opinion high-frequency word bank. When c is not a high frequency word, HotWordWeight (c) is 0.

Preferably, in step 4, the comment features of the public opinion hot word are weighted by using a time-sensitive weighting function and a public opinion hot word-sensitive weighting function to form a weighted feature vector.

For example, the weighted TF-IDF value weighttfidf (c) of the comment feature word c may be calculated using the method shown in formula (4).

WeightTFIDF(c)＝(HotWordWeight(c)+TimeWeight(S_c))×TFIDF(c) (4)

Wherein HotWordWeight (c) is the public opinion high frequency word weight of word c, TimeWeight (S)_c) Is comment sentence S of word c_cTfidf (c) is the TF-IDF value of the word c.

The TF-IDF algorithm is formulated as follows:

in equation (5), tf (c) refers to the word frequency of the word c in the current text. N represents the total number of texts in the corpus, and N (c) represents the total number of texts in the corpus containing the word c.

And then, arranging all the characteristic words of the class sample in a descending order according to the weighted TF-IDF value, and selecting the first L words with the highest correlation degree with the class as the characteristic text vector of the class sample. The value L needs to be determined according to the requirements for the classification accuracy and recall of the classification algorithm model and the acceptable time complexity. Generally, the larger L, the higher the temporal complexity of the classification algorithm model; as the value of L is increased, the classification accuracy and the recall rate of the classification algorithm model are gradually increased and reduced after reaching the peak value. Thus, the optimal value of L can be determined by multiple iterative analyses.

Compared with the prior art, the invention has the following beneficial technical effects: the invention supplements unbalanced training set data by introducing a semi-supervised training set extension method on the basis of the original machine learning classification algorithm so as to solve the problem of inaccurate classification caused by unbalanced training data. Meanwhile, a public opinion timeliness concept and a public opinion high-frequency word library are introduced to improve real-time hot public opinion classification precision. In addition, the method and the device adopt time-sensitive weighting for all comments of a single user, and can better identify the current tendency of the user.

Drawings

FIG. 1 is a block flow diagram of an embodiment of the present invention.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

Referring to fig. 1, the present invention provides a method for identifying a public opinion tendency of unbalanced distribution of training sample classes, which includes the following steps:

step 1: tracking and labeling the current public opinion hotspots by using a manual collection method, selecting high-frequency words related to the concerned public opinion field as public opinion hotspots, creating a public opinion high-frequency word library, and updating every day;

in this embodiment, the public opinion hot word source may refer to a microblog hot search list or a first page title of each large portal website, for example: "Jinyong goes off the world" and "twelve great curtains of Chinese women" etc.

In this embodiment, the public opinion high frequency word stock is stored in a text file format, so as to facilitate the artificial addition of the public opinion hot words.

In this embodiment, the content of each entry in the public opinion high-frequency word library includes a word, a time when the word is added to the word library, and a number of results searched by using an hundred degree search engine when the word is added to the word library.

The final tendency judgment result is influenced by public opinion timeliness. If the time periods of the comments contained in the training set with the labels and the comments contained in the test set are different, the model training effect may be poor due to the fact that the public opinion hot words are different, and the tendency judgment result is affected.

in this embodiment, crawls of social network account bloggers or comments such as microblogs and Twitter are used as a comment data set, and the crawled data is sorted according to posting accounts.

And step 3: and counting the sample amount of different classes in the training set. If the problem of unbalanced category distribution exists, processing is carried out by adopting a bootstrap learning method. The unbalanced distribution of the categories means that the number of samples in different categories has a difference exceeding K%. The determination of the value K is related to the true class proportion of the current classification problem. Generally, the smaller K is, the better the classification effect of the classification algorithm model after learning is; the larger K is, the more the classification effect after the classification model algorithm learns tends to classify the data into the class with the largest number of samples. Thus, the sensitivity of the classification model algorithm to K can be determined by analysis.

For the category with a small sample size, more comment data need to be crawled from a public opinion information source, data similar to the characteristic text of the category is searched in a comment data set by utilizing a similarity calculation algorithm, and the comment data set is supplemented into the category training set until the problem of unbalanced category distribution is solved.

In this embodiment, for the category with a small sample size, a semi-supervised training set extension method based on VSM is adopted to calculate the similarity between the category feature sample and a newly crawled text from a public opinion source, and the similarity calculation formula is shown in formula (1).

o1 selecting the feature text of a certain category of the training set through TF-IDF algorithm, calculating TF-IDF values of all words after dividing the category into words, and selecting the first L words with the highest degree of correlation with the category as the feature text vector of the category text. The value L needs to be determined according to the requirements for the classification accuracy and recall of the classification algorithm model and the acceptable time complexity. Generally, the larger L, the higher the temporal complexity of the classification algorithm model; as the value of L is increased, the classification accuracy and the recall rate of the classification algorithm model are gradually increased and reduced after reaching the peak value. Thus, the optimal value of L can be determined by multiple iterative analyses.

When the similarity Sim (o1, o2) >0.7, the present invention considers that the two texts o1 and o2 are similar.

In the embodiment, the feature vector adopts a construction method based on the TF-IDF algorithm. In particular, the method of manufacturing a semiconductor device,

firstly, TF-IDF values TFIDF (c) of words of each category sample of the training set after word segmentation are calculated in sequence.

The TF-IDF algorithm is formulated as follows:

Secondly, the weighted TF-IDF value WeightTFIDF (c) is calculated for each comment characteristic word c according to the formula (4) by adopting a time-sensitive weighting function and a public opinion hot word-sensitive weighting function.

Thirdly, after weighted TF-IDF values of all words in a certain category sample are calculated, the values are arranged in a descending order, and the first L words are selected as feature vectors of the category text.

And 4, step 4: for all comments of a single user, weighting the comments by adopting a time-sensitive weighting function so as to reflect the timeliness of the comments;

when the tendency of a single user is calculated, the weights TimeWeight (c) are directly accumulated, so that the tendency brought by earlier comments is reduced, the tendency brought by recent comments is improved, and the current tendency of the user is judged along with the lapse of time.

And 5: and classifying the comment data by adopting machine learning algorithms such as naive Bayes, a support vector machine, a decision tree, a multi-layer perception classifier and the like.

In this embodiment, the current user comment tendency is classified using a corresponding machine learning algorithm, such as a naive bayes classifier. When the classification problem is a second classification problem, the positive and negative tendency values can be 1-1; for the three-classification problem, the neutral, positive and negative tendency values can be taken as 0, 1 and-1, and the user tendency value Sum (A) is as follows:

in the formula (6), N is the total number of comments made by the current user A, tend (c)_i) Refer to the trend value of the ith comment posted by the reviewer, TimeWeight (c)_i) Referring to the weight of the ith comment, the calculation formula is shown as (2).

The user's tendencies Tendency (A) are as follows:

in the present embodiment, t is 5.

In the method, values of various parameters such as a category distribution imbalance determination index K, a characteristic dimension L, a similarity threshold T, a tendency determination threshold Tt and the like need to be optimized through tests so as to obtain a better public opinion tendency identification effect.

The method introduces a semi-supervised training set extension method on the basis of the traditional machine learning algorithm, and solves the problem of inaccurate classification caused by unbalanced training data to a certain extent. Meanwhile, the concepts of public opinion hot words and public opinion high-frequency word libraries are added, and the public opinion tendency recognition efficiency aiming at specific public opinions or major events is improved by introducing the public opinion hot word sensitive weighting function, and the classification accuracy under the conditions is also improved; the introduction of a time sensitive weighting function can reflect changes in the user's tendency over time.

It should be understood that parts of the specification not set forth in detail are well within the prior art.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A classification method for public opinion tendency identification aiming at category distribution imbalance is characterized by comprising the following steps:

the basis for segmenting the training set and the test set is a comment publisher, namely, comments published by one part of comment publishers are used as the training set, and comments published by the other part of comment publishers are used as the test set;

and step 3: manually marking the public opinion tendency in the training set, classifying training samples according to tendency categories, and counting the sample amount under different tendency categories in the training set; if the phenomenon of unbalanced category distribution exists, processing by adopting a similarity calculation method; the method comprises the steps that the sample size owned by the category with the largest sample size is taken as a standard, more comment data are crawled from a public opinion information source for the category with the data size less than that of the category, comment data similar to the characteristic text of the category are searched, and the comment data are supplemented into a category training set until the data size of all the category training sets is the same;

wherein the unbalanced class distribution means that the difference of the number of samples of different classes exceeds K%; the determination of the value K is related to the true category proportion of the current classification problem; taking the sample size owned by the category with the largest sample size as a standard, crawling more comment data in a public opinion information source by using a crawler for the category with a small sample size again, searching comments with similarity exceeding a threshold value T with the category sample by using a semi-supervised similarity calculation method, and supplementing the comments into the category sample; the similarity Sim (o1, o2) is shown in formula (1):

wherein o1 is the feature direction of a class sample of the training setVolume, o2 is a feature vector of newly crawled text of a review data set in a public opinion source, o1_iTo train the ith feature of a class of feature text, o2_iFor the ith feature of the text of the comment data set, x is the total dimension of the vectorized feature vectors of o1 and o 2;

and 5: training an algorithm model by using the weighted feature vector of each type of training sample and adopting a machine learning algorithm; and then, classifying the comment data in the test set by using the trained model, and determining the public opinion tendency of the comment publishers according to the classification.

2. The method as claimed in claim 1, wherein the classification method comprises the following steps: in the step 1, a public opinion high-frequency word library not only records high-frequency words, but also records the occurrence time, frequency and the change condition of the frequency along with the time of the high-frequency words; the frequency of the public opinion high frequency words is calculated according to the number of relevant results searched in a search engine.

3. The method as claimed in claim 1, wherein the classification method comprises the following steps: in step 4, a time-sensitive weighting function is adopted to weight the feature vectors extracted from the comments, and the comment weight TimeWeight (Sc) calculation formula is as follows:

wherein Sc is a certain comment, Tn is the current date, Tc is the date of publication of the comment c, and the unit of Tn-Tc is day; the feature words appearing in the same comment are given the same feature weight according to equation (2).

4. The method as claimed in claim 1, wherein the classification method comprises the following steps: in step 4, weighting the feature vectors extracted from the comments by using a weighting function sensitive to the public opinion hot words, wherein the public opinion high-frequency word weight HotWordWeight (c) has the calculation formula:

wherein D is the current date, Dc is the date of adding the hot word c into the public opinion high-frequency word bank, Wt (c) is the current search result number of the hot word c, and Wb (c) is the search result number of the hot word c when the hot word c is added into the public opinion high-frequency word bank; when c is not a high frequency word, HotWordWeight (c) is 0.

5. The method as claimed in claim 1, wherein the classification method comprises the following steps: step 4, calculating a weighted TF-IDF value weightTFIDF (c) of the comment feature word c by adopting a formula (4);

WeightTFIDF(c)＝(HotWordWeight(c)+TimeWeight(S_c))×TFIDF(c) (4)

wherein HotWordWeight (c) is the public opinion high frequency word weight of word c, TimeWeight (S)_c) Is comment sentence S of word c_cTfidf (c) is the TF-IDF value of the word c;

the TF-IDF algorithm is formulated as follows:

in formula (5), tf (c) refers to the word frequency of the word c in the current text; n represents the total number of texts in the corpus, and N (c) represents the total number of texts containing the word c in the corpus;

and then, arranging all the characteristic words of the class sample in a descending order according to the weighted TF-IDF value, and selecting the first L words with the highest correlation degree with the class as the characteristic text vector of the class sample.