CN115017302A

CN115017302A - Public opinion monitoring method and public opinion monitoring system

Info

Publication number: CN115017302A
Application number: CN202210047264.7A
Authority: CN
Inventors: 李响; 杨国武; 李蒍韦; 侯柏成
Original assignee: Yellow River Conservancy Technical Institute
Current assignee: Yellow River Conservancy Technical Institute
Priority date: 2022-01-17
Filing date: 2022-01-17
Publication date: 2022-09-06

Abstract

The invention discloses a public opinion monitoring method and a public opinion monitoring system, wherein the public opinion monitoring method comprises the following steps: s1: acquiring a keyword; s2: performing keyword expansion operation on the keywords to obtain a keyword library; s3: extracting sensitive words in the keyword library to obtain a sensitive word library; s4: collecting final public opinion data of the keyword library and the sensitive word library; s5: carrying out preprocessing operation on the public opinion data to obtain a preprocessing result; s6: carrying out public sentiment analysis processing on the preprocessing result to obtain an analysis result; s7: and obtaining a public opinion monitoring result according to the analysis result. The public opinion monitoring method and the public opinion monitoring system provided by the invention can effectively improve the comprehensiveness and accuracy of public opinion data.

Description

Public opinion monitoring method and public opinion monitoring system

Technical Field

The invention relates to the technical field of public opinion monitoring, in particular to a public opinion monitoring method and a public opinion monitoring system.

Background

In the information age, the internet has become an important channel and carrier for information transmission in the current society, social media based on internet technology is widely applied in social life, and people have put the center of gravity of information collection on a network social platform. However, in the background of the era of big data, the data scale in network media is continuously increased, the data forms are more diversified, the information transmission speed is continuously improved, and the changes are expected to be accurate in monitoring and bring about little challenge to public opinion monitoring work with quick response. The existing public opinion monitoring system has the following problems:

1. the acquired data has fewer sources and incomplete information. Most network public opinion monitoring systems only collect and analyze data for a single website, but nowadays, social platforms and network media are all in a whole, different platforms have user characteristics of different platforms, and respective public opinion data also have different values. The public opinion data can be collected in multiple channels, so that the network public opinion condition can be reflected more comprehensively and accurately.

2. Public opinion retrieval is inaccurate. Under the background of big data, the scale of network public opinion data is huge, and the information is complicated. Some public opinion monitoring systems directly use keywords provided by users to collect information, however, since there are many near-meaning words in vocabularies, there may be different network popular word expressions at different times, and there may be ambiguity in words and sentences, etc., most of the collected information in the ordinary sense is retrieved by directly using initial keywords, and the most comprehensive public opinion data and sensitive public opinions concerned by users cannot be obtained.

3. The single text emotion analysis method has poor resolution effect on problems such as irony, sentence ambiguity and the like. Most of the current emotion analysis methods are directed to emotion analysis of texts, and the methods cannot identify ironic sentences well because the ironic sentences discard context and normal sentences which are very different. In the network environment of today, pictures such as emoji, emoticons and the like become important supplements for people to express emotions, and are worthy of attracting attention in the field of emotion analysis.

Disclosure of Invention

The invention aims to provide a public opinion monitoring method and a public opinion monitoring system, which can effectively improve the comprehensiveness and accuracy of public opinion data.

The technical scheme for solving the technical problems is as follows:

the invention provides a public opinion monitoring method, which comprises the following steps:

s1: acquiring a keyword input by a user;

s2: performing keyword expansion operation on the keywords to obtain a keyword library;

s3: extracting sensitive words in the keyword library to obtain a sensitive word library;

s4: collecting final public opinion data of the keyword library and the sensitive word library;

s5: carrying out preprocessing operation on the final public sentiment data to obtain a preprocessing result;

s6: carrying out public sentiment analysis processing on the preprocessing result to obtain an analysis result;

s7: and obtaining a public opinion monitoring result according to the analysis result.

Alternatively, the step S2 includes:

searching in related data sources by using the keywords to obtain a plurality of pieces of data information matched with the keywords;

and obtaining the keyword library according to all the data information.

Alternatively, the step S3 includes:

performing word segmentation operation on all data in the keyword library by using a word segmentation toolkit to obtain a word segmentation database;

converting all the word segmentation data information into word vector information;

extracting negative words in the word segmentation database by using a BilSTM model according to the word vector information;

and taking the negative words as sensitive words to obtain the sensitive word bank.

Alternatively, the step S4 includes:

s41: configuring a data acquisition expression, and merging the keyword library and the sensitive word library into a combined word library;

s42: searching a related public opinion news list by using the combined word bank;

s43: adding the webpage address of the current news page of the related public opinion news list into a list to be collected;

s44: extracting the webpage address from a list to be collected, and accessing the related information of the current news page to form initial public opinion data;

s45: if the initial public opinion data simultaneously satisfies the integrity and the uniqueness, the step S46 is carried out, otherwise, the step S47 is carried out;

s46: outputting the initial public sentiment data as the final public sentiment data;

s47: and judging whether the current news page is the last page of the related public opinion news list, if so, returning to the step S46, otherwise, returning to the step S43.

Alternatively, the step S5 includes:

processing the final public opinion data in batches to obtain multiple batches of public opinion data;

removing special characters and useless characters from each batch of public opinion data by using a regular expression to obtain processed final public opinion data;

performing data feature extraction operation on the processed final public sentiment data to obtain a feature extraction result;

and outputting the feature extraction result as the preprocessing result.

Optionally, the public opinion analysis operation includes: general statistical analysis, keyword extraction, heat calculation and multi-modal emotion analysis.

Optionally, the heat calculation includes a heat index calculation of a single data source and a heat index calculation of a plurality of data sources, and the heat index calculation formula of the plurality of data sources is:

wherein H is a calorific value, H _i Heat index integration of all final public sentiment data for the ith related data source, W _i A heat weight for the associated data source;

the heat index x of a single relevant data source is calculated by the formula:

wherein E is the user attention index, T, of each relevant data source _s Reflecting the freshness of related public opinion news and T _s A is the release time, B is the acquisition time, T represents the total number of seconds in a thermal cycle of 3 days and T is 259200.

Optionally, the multimodal sentiment analysis comprises:

acquiring picture characteristics and character characteristics in the preprocessing result;

training a picture text alignment network according to the picture characteristics and the text characteristics to obtain a trained picture text alignment network;

according to the picture features and the text features, utilizing the trained picture text to align a network to obtain fusion features;

taking the fusion features as input of a classifier to obtain a multi-modal emotion analysis result;

the loss function of the multi-modal emotion analysis model is as follows:

L＝L _CA -L _DA

wherein L is _CA Are lost for cross reconstruction and

m is the number of samples, x ^j Representing the original characteristics of the j mode, D _j Encoder representing j modality, E _i Encoder representing i modality, x ⁱ RepresentsOriginal characteristics of the i-mode, L _DA Is to distribute the alignment loss and

W _ij is the 2-Wasserstein distance between modes i and j and

wherein, mu is in combination with

Are all hidden layer feature vectors generated by the encoder.

Optionally, the photo text alignment network comprises: the system comprises a picture characteristic encoder, a text characteristic encoder, a shared characteristic layer and a plurality of shared characteristic decoders, wherein the picture characteristic encoder and the text characteristic encoder are simultaneously connected with the input end of the shared characteristic layer, the shared characteristic encoders are connected with the output end of the shared characteristic layer, and the shared characteristic layer is also connected with a classifier;

the picture characteristic encoder is used for encoding the picture characteristic;

the text feature encoder is used for encoding the text features;

a plurality of shared feature decoders for decoding the shared features to output reconstructed picture features and reconstructed text features;

the classifier is used for classifying the shared features so as to train the image text alignment network.

The invention also provides a public opinion monitoring system based on the public opinion monitoring method, and the public opinion monitoring system comprises:

the keyword acquisition module is used for acquiring keywords;

the keyword expansion module is used for expanding the keywords;

the sensitive word extraction module is used for extracting sensitive words in the keyword library;

the public opinion data acquisition module is used for acquiring final public opinion data of the keyword library and the sensitive word library;

the data preprocessing module is used for preprocessing the final public opinion data;

the public opinion analysis module is used for analyzing the preprocessing result;

and the public opinion reporting module is used for displaying the public opinion monitoring result to the user.

The invention has the following beneficial effects:

1. the method can improve the accuracy of public sentiment emotion analysis;

2. the method for expanding the keywords and extracting the sensitive words is combined to form a new search word, so that sensitive public sentiments concerned by the user can be effectively and comprehensively searched;

3. information acquisition is carried out based on a plurality of relevant data sources, and an extensible data acquisition interface is provided, so that the problems that the sources of data acquired by a public opinion monitoring system are few, and the information is incomplete can be solved.

Drawings

Fig. 1 is a flowchart of a public opinion monitoring method according to the present invention;

FIG. 2 is a flowchart illustrating steps S4 of FIG. 1;

FIG. 3 is a schematic structural diagram of a multi-modal emotion analysis model provided by the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

Examples

The invention provides a public opinion monitoring method, which is shown in a reference figure 1 and comprises the following steps:

s1: acquiring a keyword input by a user;

the keyword here is generally a user input keyword.

the step S2 includes:

here, the related data sources include but are not limited to data sources such as microblog, today's headline, internet news, Tencent news, and the like; the search mode comprises the step of finding out key words in the search articles by using a TF-IDF algorithm and a TextRank algorithm.

TF-IDF is a commonly used weighting technique for information retrieval and text mining, which is a statistical method to evaluate how important a word is to one of a set of documents or a corpus of documents. TF represents the frequency of occurrence of the entry in the text by the formula

Representing the word frequency, n, of the word i in the document j _i,j Representing the number of times the word i appears in document j,

representing the sum of the number of occurrences of all words in the j document.

IDF is the inverse document frequency, | D | is the total number of documents in the corpus, and the denominator | j: t _i ∈d _j The expression, | denotes the inclusion of the word t _i The number of files of (c). The TF-IDF value of a certain word i for a certain category description text j is calculated as follows: TF-IDF _i,j ＝TF _i,j *IDF _i

If the high word frequency of a certain word in a specific file and the low file frequency of the word in the whole file set exist, the TF-DF with high weight can be generated, so that common words can be filtered out, important words are reserved, and the extraction of keywords is realized.

The TextRank is based on the idea improvement of the Pagerank, is a graph-based sorting algorithm for keyword extraction and document summarization, can extract keywords by utilizing contribution information among words in a document, can extract the keywords and keyword groups of the text from a given text, and can extract the keywords of the text by using an extraction type automatic summarization method. The basic idea of TextRank is to treat a document as a network of words, where the links in the network represent semantic relationships from word to word. The formula is as follows:

wherein WS (V) _i ) The weight of the sentence i is represented, the summation on the right side represents the contribution degree of each adjacent sentence to the sentence, in a single document, all sentences can be roughly considered to be adjacent, generation and extraction of multiple windows are not needed like multiple documents, only a single document window is needed, and W is _ij Representing the similarity of two sentences, WS (V) _j ) Representing the weight of the last iterated sentence j. d is the damping coefficient, typically 0.85.

From the above, it is obvious that TF-IDF is suitable for extracting rare words in an article, and is intended to find out words with high frequency in the article but with low frequency in a corpus, and is suitable for finding out some characteristic words, whereas TextRank algorithm is a simple method for extracting keywords of an article through a graph algorithm, and is suitable for finding out conventional keywords by discarding the corpus. The two methods are used simultaneously, and conventional keywords and special words in the corresponding fields are found out to expand the keyword library.

And obtaining the keyword library according to all the data information.

alternatively, the step S3 includes:

performing word segmentation operation on all data in the keyword library by using a word segmentation toolkit to obtain a word segmentation database; the word segmentation toolkit adopted in the invention is a jieba word segmentation toolkit.

because the machine can not recognize the participle, the participle data information is converted into word vector information so as to be convenient for machine recognition.

Extracting negative words in the word segmentation database by using a BilSTM model according to the word vector information; the emotion analysis model BilSTM is used for carrying out emotion analysis on word segmentation results, the emotion score between [ -1, 1] is calculated for each word by the model, the probability that the word is negative is higher when the emotion score is close to-1, and the probability that the word is positive is higher when the emotion score is close to 1. And then, carrying out ascending sorting according to the emotion polarity scores, taking out the first m negative words and adding the negative words into a sensitive word bank.

and the final public opinion data comprises the subject of the article, the publishing time, the full text of the content, the forwarding number, the comment number, the praise number, the publishing user authentication information, the grade, the region and other information (if the information exists).

Specifically, referring to fig. 2, the step S4 includes:

s41: configuring a data acquisition expression, and merging the keyword library and the sensitive word library into a combined word library; here, the data collection expression is mainly a CSS expression or an xpath expression.

here, the related information includes information such as the title of the extracted article, the release time, and the text of the article.

S45: if the initial public opinion data simultaneously meets the integrity and uniqueness, the step S46 is carried out, otherwise, the step S47 is carried out;

the integrity and uniqueness are: for example, if a news article is extracted, the incomplete data is discarded if the title of the article is missing or the content of the article is missing, and if the extracted data is already in the database (data duplication), the data is not stored.

alternatively, the step S5 includes:

processing the final public opinion data in batches to obtain a plurality of batches of public opinion data;

performing data feature extraction operation on the processed final public sentiment data to obtain a feature extraction result; text data adopts an open-source ALBert Chinese pre-training model to extract 768-dimensional semantic vectors. The picture data adopts an open source ResNet101 pre-training model to extract 2048-dimensional picture feature vectors.

And outputting the feature extraction result as the preprocessing result.

General statistical analysis comprises public opinion data total number related to keywords and similar keywords, public opinion data proportion of each website, public opinion data amount time-interval statistical information and region distribution information released by the public opinion data;

the keyword extraction comprises the following steps: a keyword search may obtain a plurality of related public sentiment events, and a public sentiment event comprises a plurality of public sentiment data. Analysis of each independent event is also crucial to public opinion analysis. The invention adopts TF-IDF algorithm and TextRank algorithm to extract public sentiment keywords in related events, and is used for forming public sentiment keyword cloud and public sentiment keyword-dividing popularity calculation in a public sentiment report module.

The heat calculation includes: and respectively calculating the public sentiment popularity of the dimensions according to the classification of the search terms/the classification of the events belonging to the search terms/the classification of the keywords in the events belonging to the search terms. Public opinion popularity calculation the invention takes traditional socialization media algorithm Reddit as a basis, and respectively designs different popularity calculation methods for social media public opinion data (microblog) and network media public opinion data (the head of the day, Tencent news and network news) to calculate public opinion popularity indexes of different platforms, and then gives weights to the public opinion popularity indexes of different platforms, and the popularity indexes of all the platforms are multiplied by the weights and added to obtain the multisource public opinion popularity indexes of the public opinions related to the keywords.

Optionally, the heat calculation includes a heat index calculation for a single data source and a heat index calculation for a plurality of data sources:

the heat index x of a single relevant data source is calculated by the formula:

wherein E is the user attention index, T, of each relevant data source _s Reflecting the freshness of the related public opinion news and T _s A is the release time, B is the acquisition time, T represents the total number of seconds in a thermal cycle of 3 days and T is 259200.

When a single related data source is a microblog, E ═ user type (6 × forwarding number +3 × comment number +1 × praise number), and the user type and the corresponding weight thereof are: 1 for common users, 1 for microblog girls, 1.5 for microblog users, 2 for landers, 4 for blue V, 4 for yellow V, 4 for gold V and 10 for gold V; the use of log10 may also allow earlier revalidation to gain more weight.

User types which are the same as the microblog types are not divided in all news network media in detail, but the heat index calculation principle is similar, so the calculation steps are the same except that the calculation of the E and the calculation of the microblog heat index are different.

According to the characteristics of the analysis data, the calculation E in the network media heat calculation formula is as follows: e ═ 8 × forwarding number +5 × comment number +2 × vote number

The heat index calculation formula of a plurality of data sources is as follows:

wherein H is a calorific value, H _i Heat index integration of all Final public opinion data for the ith related data Source, W _i Is the heat weight of the relevant data source.

The multi-modal sentiment analysis comprises the following steps: because of the domain difference of data among different modalities (pictures and texts), the fusion of characteristics among multiple modalities is also a difficult problem. The emotion analysis module of the software is respectively a graph, a coder and a decoder are constructed for text data, the text characteristics are aligned by adopting a VAE idea, the characteristics are fused into a single hidden layer to form a shared characteristic representation layer, the fusion characteristics generated by the shared characteristic representation layer are used for training an emotion classifier of public opinion data, and finally the emotion tendency score of the multi-mode public opinion data is obtained.

The loss function of the multi-modal emotion analysis model is as follows:

L＝L _CA -L _DA

wherein L is _CA Are lost for cross reconstruction

M is the number of samples, x ^j Representing the original characteristics of the j mode, D _j Encoder representing j modality, E _i Encoder representing i modality, x ⁱ Representing the original characteristics of the i-mode, L _DA Is to distribute the alignment loss and

W _ij is the 2-Wasserstein distance between modes i and j and

wherein, mu is in combination with

Are the hidden layer feature vectors generated by the encoder.

Optionally, the multimodal sentiment analysis comprises:

the trained picture text alignment network is a picture text alignment network capable of better fusing semantic information corresponding to pictures and texts.

According to the picture features and the text features, aligning a network by using the trained picture texts to obtain fusion features;

and taking the fusion features as the input of a classifier to obtain a multi-modal emotion analysis result.

Optionally, as shown in fig. 3, the picture text alignment network includes: the system comprises a picture characteristic encoder, a text characteristic encoder, a shared characteristic layer and a plurality of shared characteristic decoders, wherein the picture characteristic encoder and the text characteristic encoder are simultaneously connected with the input end of the shared characteristic layer, the shared characteristic encoders are connected with the output end of the shared characteristic layer, and the shared characteristic layer is also connected with a classifier;

the picture feature encoder is used for encoding the picture features;

the text feature encoder is used for encoding the text features;

a plurality of the shared feature decoders for decoding the shared features to output reconstructed picture features and reconstructed text features;

the keyword acquisition module is used for acquiring keywords;

the keyword expansion module is used for expanding the keywords;

the sensitive word extraction module is used for extracting sensitive words in a keyword library;

The invention has the following beneficial effects:

1. the method can improve the accuracy of public sentiment emotion analysis;

3. carry out information acquisition based on a plurality of relevant data sources to provide the expandable data acquisition interface, can solve when the source of public opinion monitoring system data collection is few, the problem that the information is incomplete.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A public opinion monitoring method is characterized by comprising the following steps:

s1: acquiring a keyword input by a user;

2. The public opinion monitoring method according to claim 1, wherein the step S2 includes:

and obtaining the keyword library according to all the data information.

3. The public opinion monitoring method according to claim 1, wherein the step S3 includes:

4. The public opinion monitoring method according to claim 1, wherein the step S4 includes:

5. The public opinion monitoring method according to claim 1, wherein the step S5 includes:

and outputting the feature extraction result as the preprocessing result.

6. The public opinion monitoring method according to any one of claims 1-5, wherein the public opinion analysis process includes: general statistical analysis, keyword extraction, heat calculation and multi-modal emotion analysis.

7. The public opinion monitoring method according to claim 6, wherein the popularity calculation includes a popularity index calculation of a single data source and a popularity index calculation of a plurality of data sources, and the popularity index calculation formula of the plurality of data sources is as follows:

wherein H is a calorific value, H _i Heat index integration of all Final public opinion data for the ith related data Source, W _i A heat weight for the associated data source;

the heat index x of a single relevant data source is calculated by the formula:

wherein E is the user attention index, T, of each relevant data source _s T represents the freshness of the related public opinion news _s A is the release time, B is the acquisition time, and T represents the total number of seconds in a thermal cycle of 3 days.

8. The consensus monitoring method of claim 6, wherein the multimodal sentiment analysis comprises:

the loss function of the multi-modal emotion analysis model is as follows:

L＝L _CA -L _DA

wherein L is _CA Are lost for cross reconstruction and

m is the number of samples, x ^j Representing the original characteristics of the j mode, D _j Encoder representing j modality, E _i Encoder representing i modality, x ⁱ Representing the original characteristics of the i mode, L _DA Is to distribute the alignment loss and

W _ij is the 2-Wasserstein distance between modes i and j and

wherein, mu is in combination with

Are all hidden layer feature vectors generated by the encoder.

9. The public opinion monitoring method according to claim 8, wherein the picture text alignment network comprises: the system comprises a picture characteristic encoder, a text characteristic encoder, a shared characteristic layer and a plurality of shared characteristic decoders, wherein the picture characteristic encoder and the text characteristic encoder are simultaneously connected with the input end of the shared characteristic layer, the shared characteristic encoders are connected with the output end of the shared characteristic layer, and the shared characteristic layer is also connected with a classifier;

the text feature encoder is used for encoding the text features;

10. A public opinion monitoring system based on the public opinion monitoring method according to any one of claims 1 to 9, wherein the public opinion monitoring system comprises:

the keyword acquisition module is used for acquiring keywords;

the keyword expansion module is used for expanding the keywords;