CN112434516B

CN112434516B - Self-adaptive comment emotion analysis system and method for merging text information

Info

Publication number: CN112434516B
Application number: CN202011506610.0A
Authority: CN
Inventors: 许建兵; 李军; 戴磊; 陶飞; 王磊; 李强
Original assignee: Anhui Suncn Pap Information Technology Co ltd
Current assignee: Anhui Suncn Pap Information Technology Co ltd
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2024-04-26
Anticipated expiration: 2040-12-18
Also published as: CN112434516A

Abstract

The invention provides a self-adaptive comment emotion analysis system and method for merging text information, wherein the method comprises the following steps: step a, determining the source and scale of data; step b, preprocessing the data; step c, extracting feature vectors according to the preprocessed data; step d, performing association analysis on the extracted feature vectors to obtain weighted text vectors; and e, carrying out convolution operation on the weighted text vector and the comment feature compression vector to finish final comment classification. The invention avoids the manual supervision work required by using LDA while introducing the main information, and has certain discovery and feature extraction capability for unregistered text types; the model has the capability of discovering new topics to a certain extent, and can automatically match text information with higher relativity with different comments in the same text, so that the problem that LDA cannot be fine-grained is solved.

Description

Self-adaptive comment emotion analysis system and method for merging text information

Technical Field

The invention belongs to the field of data statistics analysis, and particularly relates to a self-adaptive comment emotion analysis system and method for merging text information.

Background

With the rapid development of social platforms such as microblogs, weChats and the like, people can know events and news all over the world and leave comments through a network at any time and any place. Through analysis and statistics of the comment data of the part, the concept of the masses for certain events, such as support, objection, no so-called equiattitudes, can be known from a general perspective.

How to process the comment data is the key point of accurately acquiring real information, and because the data volume of the network comment is huge, the manual examination is not practical, and the emotion analysis algorithm becomes the only feasibility scheme.

Existing emotion analysis algorithms are well established, including but not limited to BiLSTM, fastText, CLSTM and the like. The general flow is as follows:

1. data preprocessing including word segmentation, stop word removal and irrelevant character filtering

2. Feature extraction of segmented results by using CNN or other algorithm models

3. Inputting the extracted features into a classifier (a full connection layer or any other classifier) to finish emotion classification of comments

In addition to the above conventional steps, the following are also more common ways to improve discrimination:

1. Using word stock modes such as emotion word stock and semantic library to assist in judging whether emotion in anticipation is positive or negative

2. The syntactic analysis is added into the basic model, so that the model can learn the semantic and grammar information of comments better

3. Obtaining subject text subject word information of commented text by using LDA and introducing the subject text subject word information into a model to assist in judging

A large number of practices prove that the basic scheme has better discrimination capability when facing conventional comment data, and can further improve the discrimination capability of confusing data after using emotion word stock or LDA (being a dimension reduction technology for supervised learning) information, but the methods still have larger limitations and mainly comprise the following steps:

1. The conventional scheme has better performance, but ignores the subject information of the text. On this basis, even if a semantic library or an emotion word library is introduced, the problem is still not solved.

2. Although the above problems can be solved after the topic information is introduced by using the LDA, obtaining text topics by using the LDA requires that training of an LDA model is completed separately by using corresponding main text, and the number of topics of the batch of text needs to be set manually.

3. The LDA model after training is completed can only extract the determined topic information in the training engineering, and the new topic is lack of effective information extraction capability.

4. Because the generation of the theme in the LDA training process does not depend on a certain document, certain deviation exists in the theme information extracted by the LDA for part of articles, and the LDA cannot extract more accurate theme information for a certain article.

5. If there are multiple small topics in the same text and comments for different small topics, in this case, only a part of the topic information extracted by the LDA is identical to the comment and the rest is interference information due to the situation in the problem 4, so that the subsequent results are affected to some extent.

Aiming at the defects, the invention aims to solve the problems that the training process of the LDA model is complex and a large amount of manual supervision is required while introducing main body information. The model has the capability of discovering new topics to a certain extent, and can automatically match text information with higher relativity with different comments in the same text, so that the problem that LDA cannot be fine-grained is solved. LDA represents an implicit dirichlet distribution (Latent DirichletAllocation), a widely used topic model for mining and finding different topic distributions in a large volume of text.

Disclosure of Invention

Aiming at the problems, the invention provides a self-adaptive comment emotion analysis method for merging text information of a text, which comprises the following steps:

Step a, determining the source and scale of data;

step b, preprocessing the data;

step c, extracting feature vectors according to the preprocessed data;

Step d, performing association analysis on the extracted feature vectors to obtain weighted text vectors;

And e, carrying out convolution operation on the weighted text vector and the comment feature compression vector to finish final comment classification.

Further, the data includes body text and comment text.

Further, the preprocessing of the data in the step b specifically comprises the steps of text and comment text stop word filtering and text length compression.

Further, the step c of extracting feature vectors according to the preprocessed data includes a step of extracting feature vectors of the body text and the comment text respectively.

Further, extracting feature vectors of the comment text and the body text specifically includes the following steps:

Step c1, carrying out data vectorization on the preprocessed comment text to obtain sentence vector characterization corresponding to the comment text;

Step c2, further coding and extracting features of sentence vector characterizations corresponding to the comment text to obtain feature vectors of the comment text;

Step c3, carrying out data vectorization on the compressed text, and obtaining sentence vector characterization corresponding to the text;

and c4, further encoding and extracting features of sentence vector representations corresponding to the text, and obtaining feature vectors of the text.

Further, the step d of obtaining the weighted text vector by association analysis specifically includes the following steps:

Step d1, calculating the relevance r _ij：r_ij＝c_i·s_j of each sentence feature vector and comment feature vector of the compressed text, wherein c _i represents the ith comment feature vector, and s _j represents the feature vector of the jth text sentence;

Step d2, calculating the relevance R _ij of the ith comment feature vector for each sentence j in the body:

And d3, calculating a weighted text vector V _i：V_i＝∑_jR_ij×S_j.

Further, in step e, the evaluation feature compression vector is obtained by using max_ pooling and average_ pooling in sequence through the evaluation feature vector, where max_ pooling and average_ pooling each represent a convolution kernel.

The invention also provides a self-adaptive comment emotion analysis system for merging text information, which comprises the following steps:

A data source and scale determining unit for determining a data source and scale;

the data preprocessing unit is used for preprocessing data;

The feature vector extraction unit is used for extracting feature vectors according to the preprocessed data;

The association degree analysis unit is used for carrying out association degree analysis on the extracted feature vectors and obtaining weighted text vectors;

And the decision unit is used for carrying out convolution operation on the weighted text vector and the comment feature compression vector to finish final comment classification.

Further, the feature vector extraction unit is configured to perform feature vector extraction according to the preprocessed data, and includes:

Carrying out data vectorization on the preprocessed comment text to obtain sentence vector characterization corresponding to the comment text; further coding and extracting features of sentence vector characterizations corresponding to the comment text to obtain feature vectors of the comment text;

Carrying out data vectorization on the compressed text to obtain sentence vector characterization corresponding to the text; and further encoding and extracting features of sentence vector characterization corresponding to the text to obtain feature vectors of the text.

Further, the association degree analysis unit is configured to perform association degree analysis on the extracted feature vector and obtain a weighted text vector, and includes:

Calculating the relevance r _ij：r_ij＝c_i·s_j of each sentence characteristic vector and comment characteristic vector of the compressed text, wherein c _i represents an ith comment characteristic vector, and s _j represents a characteristic vector of a jth text sentence;

Calculating the relevance R _ij of the ith comment feature vector for each sentence j in the body:

a weighted text vector V _i：V_i＝∑_jR_ij×S_j is calculated.

The invention has the beneficial effects that:

1. The method solves the problems that the training process of an LDA model is complex and a large amount of manual supervision is required while introducing main information, introduces text information of a text by using a deep learning mode in the comment emotion analysis process, avoids the manual supervision work required by using the LDA, and has certain discovery and feature extraction capability for unregistered text types; the model has the capability of discovering new topics to a certain extent, and can automatically match text information with higher relativity with different comments in the same text, so that the problem that LDA cannot be fine-grained is solved;

2. According to the method, when the text information is introduced, the relevance feature vector of the text is obtained by calculating the relevance of each sentence of the text and the comment, so that the model can adaptively extract text features with higher relevance to different comments. The method has better feature extraction capability for body text with multiple fine-grained subjects.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 shows a schematic flow chart of an adaptive comment emotion analysis method for merging text information in an embodiment of the invention;

Fig. 2 shows a specific flow diagram of an adaptive comment emotion analysis method for merging text information in an embodiment of the present invention;

Fig. 3 shows a specific flow chart of association analysis in the embodiment of the invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 shows a schematic flow chart of an adaptive comment emotion analysis method for merging text information in an embodiment of the present invention, and in fig. 1, the method includes the following steps:

Step a, determining the source and scale of data;

step b, preprocessing the data;

step c, extracting feature vectors according to the preprocessed data;

Specifically, the data of the invention are microblog comments and corresponding text information, and about 100 tens of thousands of comment data and corresponding text data are obtained in a crawler mode. Part of the data (about 30 ten thousand pieces) therein is noted using a manual annotation, and subsequent model training is completed using the part of the data. The final annotated data format is the following triplet relationship: (body text, comment classification), wherein the comment classification is obtained by analyzing body text data and comment text data.

Fig. 2 shows a specific flow diagram of an adaptive comment emotion analysis method for merging text information in an embodiment of the present invention, and in fig. 2, specifically, preprocessing data in step b includes the steps of text and comment text stop word filtering, and text length compression: the method comprises the steps of performing word segmentation on a text body text and a comment text by using a crust word segmentation device, and performing stop word filtering on a word segmentation result by using a Ha Gong stop word list; and calling a TextRank algorithm module in the TextRank4ZH, and screening out key sentences of the text (taking a top 30) so as to finish the length compression of the text. The step is mainly used for preventing the problem of too slow training speed caused by overlong text of some microblogs, wherein TextRank represents a common keyword and keyword sentence extraction algorithm.

In the step c, extracting the feature vector according to the preprocessed data comprises the step of extracting the feature vector of the text and the comment text respectively.

Specifically, when feature vectors are performed on comment texts: vectorizing (embedding) the preprocessed comment text data by using an open source Erine model with hundred degrees, and obtaining corresponding sentence vector characterization; further encoding and feature extraction are carried out on the vector representation of the sentence by using BiLSTM, and the encoded output vector is recorded as a vector representation V _i of the sentence (the dimension of V _i is seq_length embedding _size at the moment); for the vector characterization V _i of each sentence, the last time step is taken as the feature vector of the comment text, and the subsequent association analysis is performed on the partial vector (the dimension is 1 x compressing_size).

BiLSTM and time steps: a bi-directional Long-Short-Term Memory network (Long Short-Term Memory), a commonly used recurrent neural network, is used to process data, such as text data, that depends on the time of existence. Wherein each word in the text is a time step. Typically, the output at the last time step of BiLSTM contains the entire sequence of information.

When feature vectors are carried out on body text: carrying out data vectorization on the compressed text to obtain sentence vector characterization corresponding to the text; and further encoding and extracting features of sentence vector characterization corresponding to the text to obtain feature vectors of the text. Compared with the feature vector of the comment text, the text length of the text is usually much longer than that of the comment text (the text is limited to be within 30 sentences after being preprocessed by TextRank), so that sentence vector features of the 30 sentences can be extracted one by one to serve as the feature vector of the text when the sentence vector features are extracted, and the subsequent association degree analysis is carried out on the partial vectors.

And d, carrying out association analysis on the extracted feature vectors and obtaining weighted text vectors, finding out related information between comments and texts, and extracting text information features related to the comments according to association between each sentence of the texts and the comments so as to assist in final classification. Fig. 3 shows a specific flow chart of association analysis in the embodiment of the invention.

In fig. 3, c _i is defined as the sentence vector of the ith comment under an article, s _j is the sentence vector of the jth sentence of the article body, where c _i and s _j are respectively from the extracted comment and the vector representation of the text in the second module.

The calculation mode of the correlation between the ith comment and the jth text sentence is as follows: r _ij＝c_i·s_j;

For the ith comment, its relevance R _ij for each sentence j in the body is defined as follows:

calculating the relevance of each sentence and comment, carrying out softmax probability normalization on the relevance, and calculating the weight vector of the relevance;

Finally, sentence vectors in the text are weighted and summed according to the relevance R _ij, and the text vector for comment i is obtained as follows: v _i＝∑_jR_ij×S_j.

In step e, the comment feature compression vector is obtained by sequentially using max_ pooling and average_ pooling through comment text feature vectors, where max_ pooling and average_ pooling each represent a convolution kernel, which can play a role in feature compression, and can extract the most significant feature (max_ pooling) or the more common feature (average_ pooling).

Specifically, in step e, performing convolution operation on the weighted text vector and the comment feature compression vector to complete final comment classification, which includes the following steps: combining the comment feature compression vector (3×compressing_size) and the weighted text vector (1×compressing_size), and splicing the comment feature compression vector and the weighted text vector to form a vector with the dimension of 4×compressing_size; performing convolution operation on the spliced feature vectors by using different convolution kernels, and splicing the convolved results to be used as the input of a full connection layer; and receiving the input result of the full connection layer, and finishing final classification by using the full connection layer.

CNN and convolution: the convolutional neural network is a common feature extractor, and the specific feature extraction purpose is mainly achieved through different convolutional kernels.

The invention also provides a self-adaptive comment emotion analysis system for merging text information, which comprises:

the data preprocessing unit is used for preprocessing data;

Specifically, the feature vector extraction unit is configured to perform feature vector extraction according to the preprocessed data, and includes:

Specifically, the association degree analysis unit is configured to perform association degree analysis on the extracted feature vector and obtain a weighted text vector, and includes:

a weighted text vector V _i：V_i＝∑_jR_ij×S_j is calculated.

Specifically, the decision unit receives the text vector (1×embedding_size) weighted by the relevance analysis unit and the comment vector (3×embedding_size) in the second module, and splices the comment feature compression vector and the weighted text vector to form a vector with a dimension of 4×embedding_size.

In the comment emotion analysis process, the text information of the text is introduced by using a deep learning mode, the manual supervision work required by using LDA is avoided, and the method has certain discovery and feature extraction capability for unregistered text types.

According to the method, when the text information is introduced, the relevance feature vector of the text is obtained by calculating the relevance of each sentence of the text and the comment, so that the model can adaptively extract text features with higher relevance to different comments. The method has better feature extraction capability for body text with multiple fine-grained subjects.

Although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The self-adaptive comment emotion analysis method for merging text information of a text is characterized by comprising the following steps of:

step a, determining the source and the scale of data, wherein the data comprises body text and comment text;

step b, preprocessing the data;

step c, extracting feature vectors according to the preprocessed data;

The association analysis in the step d to obtain the weighted text vector specifically comprises the following steps:

Step d1, calculating the relevance r _ij of each sentence characteristic vector and comment characteristic vector of the compressed text: Wherein c _i represents the i-th comment feature vector, and s _j represents the feature vector of the j-th body sentence;

Step d2, calculating the relevance R _ij of the ith comment feature vector for each sentence j in the body: ；

Step d3, calculating a weighted text vector V _i: ；

2. The adaptive comment emotion analysis method for merging body text information according to claim 1, wherein the preprocessing of the data in the step b specifically includes the steps of body text and comment text stop word filtering and body text length compression.

3. The adaptive comment emotion analysis method for merging body text information according to claim 1, wherein the feature vector extraction in step c according to the preprocessed data includes a step of extracting feature vectors for body text and comment text, respectively.

4. The adaptive comment emotion analysis method for merging body text information according to claim 3, characterized in that extracting feature vectors of said comment text and said body text specifically comprises the steps of:

5. The adaptive comment emotion analysis method for merging text information according to claim 1, wherein in step e, the comment feature compression vector is obtained by sequentially using max_ pooling and average_ pooling for the comment text feature vector, where max_ pooling and average_ pooling each represent a convolution kernel.

6. An adaptive comment emotion analysis system for merging text information of a body, the system comprising:

The data source and scale determining unit is used for determining the source and scale of data, wherein the data comprises a body text and a comment text;

the data preprocessing unit is used for preprocessing data;

The association degree analysis unit is used for performing association degree analysis on the extracted feature vectors and obtaining weighted text vectors, and comprises the following steps:

Calculating the relevance r _ij of each sentence characteristic vector and comment characteristic vector of the compressed body text: Wherein c _i represents the i-th comment feature vector, and s _j represents the feature vector of the j-th body sentence;

Calculating the relevance R _ij of the ith comment feature vector for each sentence j in the body: ；

calculating a weighted text vector V _i: ；

7. The adaptive comment emotion analysis system of claim 6, wherein the feature vector extraction unit is configured to perform feature vector extraction based on the preprocessed data, and includes: