CN111046650A

CN111046650A - Network public opinion automatic identification technology based on element co-occurrence

Info

Publication number: CN111046650A
Application number: CN201911248914.9A
Authority: CN
Inventors: 程南昌; 宋康; 邹煜; 滕永林; 杨柳
Original assignee: Communication University of China
Current assignee: Communication University of China
Priority date: 2019-12-09
Filing date: 2019-12-09
Publication date: 2020-04-21

Abstract

The invention discloses an element co-occurrence-based network public opinion automatic identification technology, which comprises two steps of an implementation method and a weighting algorithm, wherein the implementation method comprises the following steps: s101: 9436 linguistic data are collected and recorded as X, and 1250 ten thousand characters are totally obtained, wherein 1836 linguistic data related to public sentiment are recorded as Y, more than 250 ten thousand characters, 7600 linguistic data related to non-public sentiment are recorded as Z, and about 1000 ten thousand characters are obtained; s102: then, performing word segmentation on the corpus by adopting an automatic word segmentation system CUCBst, and performing word frequency statistics; s103: dividing words in X, Y, Z into five grades according to frequency; s104: comparing the words in the Z with the words in the same frequency band in the X according to the frequency band; the weighting algorithm comprises the following steps: s201: firstly, calculating a weight value of the feature words, and then, based on the co-occurrence of the three types of feature words; s202: and combining the appearance position of the characteristic word and the length of the text, and performing weighted calculation by four factors to obtain a text score.

Description

Network public opinion automatic identification technology based on element co-occurrence

Technical Field

The invention relates to the technical field of public opinion monitoring, in particular to an automatic network public opinion identification technology based on element co-occurrence.

Background

Research related to public opinion detection mainly focuses on the field of topic detection, and special evaluation activities, namely topic detection and tracking, have been internationally held. In topic detection and tracking, a topic refers to a set of stories consisting of "one seed event or activity and events or activities directly related to it. The task of topic detection is to detect and organize topics that are not known by the system in advance. The technology mainly adopts a clustering algorithm based on statistics, such as K-Means, centroid and hierarchical clustering and the like. Because the clustering method is large in calculation amount, when the system is oriented to massive network documents, the system for detecting the public sentiment related topics directly by the clustering method is rare.

Although topic detection and tracking evaluation has ceased by 2004, related research continues. In recent years, the existing literature proposes a new event detection method based on topic segmentation and based on lemma reevaluation. The new event detection technique can be used to detect the first report of an emergency like 9 · 11, related to public opinion detection. The literature adds subtopic information of the topic to be detected to an experiment for judging a new event, for example, the possibility that the topic with more subtopics is the new event is less than the topic with less subtopics. The document finds that the sensibility of the lemmas with different parts of speech in different classes of news is different, so the weights of the lemmas need to be evaluated again according to the specific classes of the news in the calculation process. Topic detection and tracking and evaluating corpora used in documents are classified in detail according to different topics, but a real network document has no relevant information such as categories, sub-topics and the like available. The literature adopts a search method based on key words to find out emergencies in the Xinlang blogs, and restricts search results by a method of limiting time periods and domain names, so that redundancy is reduced. This is similar to the keyword ranking method mentioned earlier. The document identifies hot sentences through hot words, and then clusters the hot sentences, so that identification of hot topics is realized. The hot topic has a high possibility of belonging to public sentiment and is relevant to the research. Although documents reduce the computation of clusters from chapters to sentence level, the recognition of hot words and hot sentences consumes a large amount of computation.

In summary, the current defects of public opinion detection can be summarized as 3 points:

(1) the field pertinence is not strong, and the system of the world is basically oriented to the whole society and politics;

(2) a method based on batch keywords or public sentiment dictionaries is mainly adopted, and the defects of the method are mentioned in the introduction part;

(3) the statistical-based clustering method and other new methods are still mostly on the theoretical level, and are not common in the actual public opinion detection.

Disclosure of Invention

The invention aims to provide an automatic network public opinion identification technology based on element co-occurrence, which is characterized in that three main elements (subjects, objects and emotional tendencies) forming public opinions are respectively represented by three types of feature words from the essence of the public opinions, and the three types of feature words are dynamically combined according to combination and aggregation relations, so that topics related to the public opinions in a certain field can be generated, and public opinion information in the field can be effectively identified. The method is practically applied to a language and character public opinion monitoring system and an advanced education public opinion monitoring system, and respectively achieves the accuracy of 92% and 93% so as to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme:

the network public opinion automatic identification technology based on element co-occurrence comprises two steps of an implementation method and a weighting algorithm, wherein the implementation method comprises the following steps:

s101: 9436 linguistic data are collected and recorded as X, and 1250 ten thousand characters are totally obtained, wherein 1836 linguistic data related to public sentiment are recorded as Y, more than 250 ten thousand characters, 7600 linguistic data related to non-public sentiment are recorded as Z, and about 1000 ten thousand characters are obtained;

s102: then, performing word segmentation on the corpus by adopting an automatic word segmentation system CUCBst, and performing word frequency statistics;

s103: dividing words in X, Y, Z into five grades according to frequency;

s104: comparing the words in the Z with the words in the same frequency band in the X according to the frequency band, and aiming at extracting characteristic words in the language and character public sentiment;

the weighting algorithm comprises the following steps:

s201: firstly, calculating a weight value of the feature words, and then, based on the co-occurrence of the three types of feature words;

s202: and (4) combining the occurrence positions of the characteristic words and the length of the text, performing weighted calculation on four factors to obtain a text score, and judging that the text belongs to the language word public sentiment when the score reaches a certain threshold value.

Further, the five stages in S103 include: level 1 (more than or equal to 1000), level 2 (between 500 and 999), level 3 (between 100 and 499), level 4 (between 5 and 99), and level 5 (between 1 and 4).

Further, the weighting algorithm further comprises calculating feature word weights.

Further, the factors considered by the weighting algorithm include feature word weights, co-occurrence conditions among the three types of feature words, feature word positions and text lengths.

Furthermore, the quality of the feature word set determines the accuracy and recall rate of public opinion information detection, and in order to ensure the quality of the feature word set, three types of feature words which are automatically extracted need to be manually confirmed item by item.

Compared with the prior art, the invention has the beneficial effects that: the invention starts from the essence of public sentiment, three main elements (subject, object and emotional tendency) forming the public sentiment are respectively represented by three types of characteristic words, and the three types of characteristic words are dynamically combined according to combination and aggregation relations, thereby not only generating topics related to the public sentiment in a certain field, but also effectively identifying the public sentiment information in the field. The method is practically applied to a language and character public opinion monitoring system and an advanced education public opinion monitoring system, and the accuracy rates of the method are respectively 92% and 93%.

Drawings

FIG. 1 is a diagram of three types of characteristic words and expressions of language and public sentiment and their relationships according to the present invention;

fig. 2 is a flow chart of the extraction of three types of feature words according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Public sentiment, as defined in the literature, is the public consisting of individuals and various social groups, and is the sum of various emotions, will, attitudes and opinions held by various public matters concerned by or closely related to their interests in a certain historical stage and social space. The visible public sentiment is composed of three basic elements: subjects (people), objects (various public matters), emotional tendencies (the sum of emotions, willingness, attitudes, and stagger of opinions). The element co-occurrence method starts from the essence of public sentiment, the three elements forming the public sentiment are represented by three types of characteristic words, each type of characteristic word represents a public sentiment element, and the three types of characteristic words can be dynamically combined and matched to generate topics related to the public sentiment in a certain field. For example, in the field of language and text, public sentiment events such as "simple and complicated war", "defending dialect", "letter and word wind wave" and the like may occur. After three types of elements of public sentiment are represented by characteristic words, the relationship is shown in fig. 1.

As shown in fig. 1, the language word public opinion is represented as: subjects such as experts and teachers have opinions or attitudes against, reject or approve objects (objects) such as mandarin, traditional Chinese characters, dialects and alphabetic words. The characteristic words in the three elements of 'subject', 'object' and 'emotional tendency' are automatically extracted or summarized empirically based on the existing linguistic data. The three characteristic words respectively play their roles, can be dynamically combined together, have extremely strong tension, can cover all public sentiment information possibly appearing in the field of language and characters, and can exclude most non-public sentiment information. The theoretical basis is just the combined polymerization theory of grovels.

Swiss linguist states that everything is relationship based in the language state. Its core is the sentence segment relation and association relation, i.e. the combination relation and aggregation relation. The combination relation indicates the horizontal relation among all language units which appear in the speech and are established on a linear basis; aggregation refers to the vertical relationship between units that may appear in the same location and have the same function in the language hierarchy. According to the theory of grovels, the three kinds of feature words are dynamically combined and collocated to generate different topics, for example, according to a combination relationship: the teacher promotes Mandarin, expert rejection dialect, media abuse letter words, popular praise for traditional characters … …, etc.; the aggregate relationship may yield: the teacher popularizes mandarin, the expert popularizes mandarin, the media popularizes mandarin, the public popularizes mandarin … …, and the like. The element co-occurrence method is just simulating the domain knowledge word bank (aggregation) in the human brain, the cognitive understanding of objective objects and the generation expression process (combination), has strong topic generation capacity, and can effectively identify all topics which can be generated by the element co-occurrence method. If the above three types of feature word sets can be established for the public sentiment features of a certain field, the public sentiment of the field can be effectively identified. The topic generation capability of the element co-occurrence method is potential, when the characteristic words are co-occurring in a certain language segment, other words which are not related to the characteristic words can be automatically ignored, and the dynamically generated topic is matched with the characteristic words. For example, a language segment "some students after 90 like traditional Chinese characters" can successfully identify the topic "students like traditional Chinese characters" by ignoring other words.

From the perspective of public opinion detection, the characteristic words corresponding to objects are the most important, and only words related to language characters appear in the text, and it is meaningful to discuss whether the language characters belong to the public opinion, so that the words can be called as 'subject words'; secondly, characteristic words with emotional tendency are called as 'emotional words'; thirdly, the characteristic words correspond to the main body, and the main body is generally the people, such as students, parents, teachers and the like. In addition, public sentiment needs a certain space-time background, and corresponding characteristic words such as classroom, classroom and school have certain functions in public sentiment detection, and some can also replace a main body such as school promoted mandarin, which is closer to the characteristic words corresponding to the main body, so that the characteristic words related to the main body can be combined into the categories of character and background, namely the background words. In the three types of characteristic words, any one type of characteristic words appearing independently cannot directly form the public sentiment, more than two types of characteristic words are required to be co-appeared, and a certain topic can be the public sentiment. Based on this, we call the method "elemental co-occurrence" method.

The element co-occurrence method realizes public opinion detection by constructing a speech knowledge system related to public opinions in a certain field, focuses on the combination of three basic elements related to the public opinions instead of one point, presents strong tension, and is essentially different from the detection method of the conventional keyword and public opinion dictionary. The way of the batch keyword or public opinion dictionary is one-dimensional, and a point such as 'removal event', 'move away blood case', 'violence terrorist event' and the like is searched. The element co-occurrence method is three-dimensional, and three types of characteristic words are combined and co-occur to form different topics. The method of keyword grading or public sentiment dictionary also considers the co-occurrence, but the co-occurrence is bound with a specific word, and all elements of the element co-occurrence method can be dynamically combined and matched, so that the method has strong topic generation capability. The element co-occurrence method also utilizes the dynamic combination and collocation, and endows the public sentiment monitoring system with the public sentiment early warning function of finding unknown topics in real time.

The network public opinion automatic identification technology based on element co-occurrence comprises two steps of an implementation method and a weighting algorithm, wherein the implementation method firstly extracts three types of feature words, and the premise of implementing the element co-occurrence method is to find out a feature word set corresponding to three elements of public opinion. The characteristic words can be summarized by manual induction or obtained by an automatic searching method.

The implementation method comprises the following steps:

s101: 9436 corpora (hereinafter referred to as X in this collection) are collected, and 1250 ten thousand words are provided, wherein 1836 corpora (hereinafter referred to as Y in this collection), 250 more than ten thousand words, 7600 corpora (hereinafter referred to as Z in this collection) are provided as the non-public-sentiment-related corpora, and about 1000 ten thousand words are provided;

s103: the words in X, Y, Z are divided into five levels according to the frequency: level 1 (more than or equal to 1000), level 2 (between 500 and 999), level 3 (between 100 and 499), level 4 (between 5 and 99), level 5 (between 1 and 4);

s104: comparing the words in the Z with the words in the same frequency band in the X according to the frequency band, and aiming at extracting characteristic words in the language and character public sentiment; taking the word of "language" as an example, in X, the frequency of occurrence is 7161 times, which belongs to level 1 words, and in Z, only 62 times, which belongs to level 4 words, if the comparison is performed without frequency division, words with language character public sentiment characteristics cannot be extracted.

The extracted entries are further classified by comparison, wherein the identification of the subject word is identified by taking a dictionary of 'linguistic nouns' as a reference; the emotion words are identified by taking an emotion dictionary arranged in Yangjiang as a reference; those that do not fall into these two categories are automatically categorized as background words. Taking level 1 words in X and Z as an example, the extraction process of the feature words is shown in fig. 2.

The quality of the feature word set determines the accuracy rate and the recall rate of the public opinion information detection. In order to ensure the quality of the feature word set, the three types of feature words which are automatically extracted need to be manually confirmed item by item.

The successful extraction of the feature word set of the weighting algorithm is a precondition for realizing an element co-occurrence method, the element co-occurrence is a main factor for judging public sentiment, but not a unique factor, and the final judgment can be carried out only by combining with the weighted calculation of other factors. Firstly, calculating the weight of the feature words, then, on the basis of the co-occurrence of the three types of feature words, combining the occurrence positions of the feature words and the length of the text, carrying out weighted calculation by four factors to obtain the score of the text, and when the score reaches a certain threshold value, judging that the text belongs to the language word public sentiment.

Calculating the weight of the characteristic word, and the importance of a word in the text set, which is generally expressed by the word frequency-document frequency value. Word frequency-document frequency theory holds that the importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the number of texts in which it appears in a corpus, i.e. more specific words appearing in only a few documents are weighted more heavily than words appearing in many documents. But the lack of word frequency-document frequency is also significant, and the method underestimates the role of frequently occurring words in a class, which are capable of representing the text characteristics of the class and should be given higher weight. Therefore, the present invention takes the normalized usage as an important quantization criterion. Because of the fact that in a text set, if the usage rate of the appearing feature words is high, the probability that the text belongs to the public opinion is high. For example, when a feature word such as "chinese" or "dialect" appears in a text, it is more likely to belong to a language word public opinion than to appear "tone" or "syllable".

Based on the above consideration, the invention determines the weight of the feature word by the normalized utilization rate of the feature word, and the weight is high when the utilization rate is high. Through the utilization rate analysis of the feature words, the feature words with high utilization rate are generally more than or equal to 0.01, the feature words with medium utilization rate are between 0.001 and 0.01, and the feature words with less than 0.001 are the feature words with low utilization rate. According to the discovery, the invention sets the weight of the feature words as 3 grades, each grade is sequentially decreased by 1, and the weight is respectively 3, 2 and 1, for example, the word "language" has 3 grades, and the word "book" has 1 grade. Table 1 shows three classes of feature words and their normalized usage, each class extracting 10 representative.

TABLE 1 feature words and their normalized usage

The calculation formula of the normalized utilization rate is as follows:

where F represents the frequency of the word, D represents the distribution ratio, the denominator is the normalization term, and V represents the set of all homogenous panelists (all word types).

TABLE 2 Co-occurrence of three classes of feature words in clauses

Table 2 shows that in example sentence 1, three types of feature words co-occur, and can be determined as language word public sentiment; in the example sentence 2, the subject term and the emotional term co-occur and can be basically judged as the language word public sentiment; in example 3, the subject word and the background word co-occur, and this example may be public sentiment information in the aspect of international spreading of chinese, and may also be only an introduction to a certain proofreading foreign-chinese professional, so that it cannot be directly determined as language word public sentiment information.

Generally, the closer the elements are, the more closely the syntactic and semantic relationships between them, and the greater the likelihood of belonging to a public opinion related topic. In the three language segments listed in table 2, the co-occurrence distances between feature words are smaller and are all within a small sentence, whereas more times, the co-occurrence distances between the three types of feature words are in a sentence or a paragraph. Therefore, it is necessary to solve a problem of how much the co-occurrence distance between the three types of feature words is recognized as the best. The invention divides the co-occurrence distance between three types of words into four levels of sections, paragraphs, sentences and small sentences, and compares the sections, the paragraphs, the sentences and the small sentences respectively, and a weighting algorithm is needed for comparison.

Besides the co-occurrence, the position of the feature word in the text and the length of the text are also factors to be considered by the weighting algorithm. In terms of location, the present invention only considers the two cases of title and body. The feature words appearing in the title and the text are different in weight. In terms of text length, the longer the text, the higher its score may be, and therefore some constraint must be placed on this, the present invention being constrained by the average length of the text in Y.

To summarize, the weighting algorithm takes into account four factors: the feature word weight, the co-occurrence condition among the three types of feature words, the feature word position and the text length. The weighting algorithm needs to segment the text according to the co-occurrence distance between the feature words, as mentioned above, the invention divides the co-occurrence distance into four levels of "chapter, paragraph, sentence, and small sentence", and this section discusses the process of the weighting algorithm by taking the co-occurrence distance at the sentence level as an example. First in. Is there a | A "as the boundary, the text is cut into sentences, and the sentence score is shown in formula (2).

Sen_iThe score of a sentence i is represented, a, b and c respectively represent a word in three types of feature word lists, F represents the frequency of the word, U represents a weight, P represents a position score, and the score of the feature word in a certain word list in the sentence is equal to the frequency of the word in the sentence multiplied by the weight and then added with the position score. G_iThe co-occurrence scores of the three characteristic words are highest, the subject word + the sentiment word is next to the three characteristic words, and the subject word + the background word is lowest.

Finally, the score of the entire text is shown in equation (3).

Text_iRepresents the score of the text i, AL (Average Length) represents the Average Length of all texts in Y, and Li represents the Length of the text i. I.e., the score of the article is equal to the sum of the scores of all sentences, multiplied by the average text length divided by the length of the text.

Table 3 shows statistics of a one-week monitoring result of randomly extracted language and text, and data shows that the average accuracy of the recognizer in actual monitoring reaches 92%. The system is adopted by departments such as a language and character information management department of the education department, a national language resource monitoring and research center and the like, and the system operates for more than 6 years all day.

TABLE 3 one week accuracy of language and character public opinion monitoring system

In order to verify the universality of the element co-occurrence method, the method is adopted to identify the education public sentiment in the network in the monitoring of the higher education public sentiment, and the success is achieved. Table 4 shows the monitoring results of the monitoring system randomly drawing the advanced education opinions for one week.

TABLE 4 high education public opinion monitoring System one week accuracy

Data show that the recognizer has an average accuracy rate of 93% in the detection of higher education public sentiment. The system is adopted by a national advanced education quality monitoring and evaluation research base advanced education transmission and public opinion monitoring research center subordinate to an advanced education teaching evaluation center of the education department, and the system operates for more than 4 years in all weather.

The invention starts from the essence of public sentiment, three main elements (subject, object and emotional tendency) forming the public sentiment are respectively represented by three types of characteristic words, and the three types of characteristic words are dynamically combined according to combination and aggregation relations, thereby not only generating topics related to the public sentiment in a certain field, but also effectively identifying the public sentiment information in the field. The method is practically applied to a language and character public opinion monitoring system and an advanced education public opinion monitoring system, and the accuracy rates of the method are respectively 92% and 93%.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be able to cover the technical solutions and the inventive concepts of the present invention within the technical scope of the present invention.

Claims

1. The network public opinion automatic identification technology based on element co-occurrence is characterized by comprising two steps of an implementation method and a weighting algorithm, wherein the implementation method comprises the following steps:

s103: dividing words in X, Y, Z into five grades according to frequency;

the weighting algorithm comprises the following steps:

2. The element co-occurrence-based internet public opinion monitoring method according to claim 1, wherein five levels in S103 include: level 1 (more than or equal to 1000), level 2 (between 500 and 999), level 3 (between 100 and 499), level 4 (between 5 and 99), and level 5 (between 1 and 4).

3. The element co-occurrence-based internet public opinion monitoring method according to claim 1, wherein the weighting algorithm further comprises calculating feature word weights.

4. The element co-occurrence-based internet public opinion monitoring method according to claim 1, wherein the factors considered by the weighting algorithm include feature word weight, co-occurrence between three types of feature words, feature word position, and text length.

5. The element co-occurrence-based network public opinion monitoring method according to claim 1, wherein the quality of the feature word set determines the accuracy and recall rate of public opinion information detection, and in order to ensure the quality of the feature word set, manual confirmation is required for automatically extracted three types of feature words item by item.