CN112434516A

CN112434516A - Self-adaptive comment emotion analysis system and method fusing text information

Info

Publication number: CN112434516A
Application number: CN202011506610.0A
Authority: CN
Inventors: 许建兵; 李军; 戴磊; 陶飞; 王磊; 李强
Original assignee: Anhui Suncn Pap Information Technology Co ltd
Current assignee: Anhui Suncn Pap Information Technology Co ltd
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2021-03-02
Anticipated expiration: 2040-12-18

Abstract

The invention provides a self-adaptive comment emotion analysis system and method fusing text information, wherein the method comprises the following steps: step a, determining a data source and scale; b, preprocessing the data; c, extracting a feature vector according to the preprocessed data; d, analyzing the relevance of the extracted feature vectors to obtain weighted text vectors; and e, performing convolution operation on the weighted text vector and the comment feature compressed vector to finish final comment classification. The invention avoids the manual supervision work required by LDA while introducing the main body information, and has certain discovery and feature extraction capability for the unregistered text category; and the model has the capability of discovering new subjects to a certain degree, and can automatically match text information with higher relevance to different comments in the same text, so that the problem that the LDA cannot be subjected to fine granularity is solved.

Description

Self-adaptive comment emotion analysis system and method fusing text information

Technical Field

The invention belongs to the field of data statistical analysis, and particularly relates to a self-adaptive comment sentiment analysis system and method fusing text information.

Background

With the rapid development of social platforms such as microblogs, WeChat and the like, people can know events and news around the world at any time and any place through a network and leave messages and comments. Through analysis and statistics of the comment data of the part, the concept of the broad masses on certain events can be known from a general perspective, such as support, objection, no alleged attitude.

How to process the comment data is the key point for accurately acquiring the real information, and because the data volume of the network comment is huge, the manual review is not practical, and the emotion analysis algorithm is used as a unique feasible scheme.

Existing emotion analysis algorithms are mature and include, but are not limited to, BilSTM, FastText, CLSTM, etc. The general flow is as follows:

1. data preprocessing including word segmentation, stop word removal and extraneous character filtering

2. Feature extraction of the segmented results using CNN or other algorithmic models

3. Inputting the extracted features into a classifier (a full-link layer or any other classifier) to complete emotion classification of comments

Besides the above conventional steps, the following parts are also a more common way to improve the discrimination capability:

1. method for assisting judgment of positive and negative emotions in anticipation by using word stock modes such as emotion word stock and semantic stock

2. The syntactic analysis is added into the basic model, so that the model can better learn the semantic and grammatical information of the comments

3. Obtaining the subject word information of the body text to be commented by LDA and introducing the subject word information into a model to assist in judgment

A large number of practices prove that the basic scheme has better discrimination capability in the face of conventional comment data, and the discrimination capability of confusable data can be further improved after information of an emotion word bank or LDA (dimension reduction technology for supervised learning) is used, but the methods still have greater limitations and are mainly expressed as follows:

1. the conventional scheme ignores the subject information of the text although the conventional scheme has excellent performance. On the basis, even though the method of introducing a semantic library or an emotion word library and the like still does not solve the problem.

2. Although the problem can be solved after the topic information is introduced by using the LDA, the use of the LDA to obtain the text topic requires the use of the corresponding main text to separately complete the training of the LDA model, and the number of topics in the batch of text needs to be manually set.

3. The LDA model after training can only extract the determined subject information in the training project, and the LDA model lacks effective information extraction capability for the newly-entered subject.

4. Because the generation of the theme in the LDA training process does not depend on a certain document, for a part of articles, the theme information extracted by the LDA has a certain deviation, and the LDA cannot extract more accurate theme information for a certain article.

5. If there are multiple small topics in the same text and there are comments for different small topics, in this case, due to the situation in the problem 4, only a part of the topic information extracted by LDA is the same as the comment, and the rest is interference information, so the subsequent result is also affected to some extent.

Aiming at the defects, the invention aims to solve the problems that the LDA model training process is complex and a large amount of manual supervision is needed while introducing main body information. And the model has the capability of discovering new subjects to a certain degree, and can automatically match text information with higher relevance to different comments in the same text, so that the problem that the LDA cannot be subjected to fine granularity is solved. LDA represents Latent dirichlet distribution (late dirichletailocation), a widely used topic model for mining and finding different topic distributions in large volumes of text.

Disclosure of Invention

Aiming at the problems, the invention provides a self-adaptive comment emotion analysis method fusing text information, which comprises the following steps:

step a, determining a data source and scale;

b, preprocessing the data;

c, extracting a feature vector according to the preprocessed data;

d, analyzing the relevance of the extracted feature vectors to obtain weighted text vectors;

and e, performing convolution operation on the weighted text vector and the comment feature compressed vector to finish final comment classification.

Further, the data includes body text and comment text.

Further, the step b of preprocessing the data specifically comprises the steps of filtering stop words of the text and the comment text and compressing the length of the text.

Further, the step c of extracting feature vectors according to the preprocessed data comprises the step of extracting feature vectors of the text to be written and the comment text respectively.

Further, the extracting the feature vectors of the comment text and the body text specifically includes the following steps:

c1, carrying out data vectorization on the preprocessed comment text to obtain sentence vector representations corresponding to the comment text;

step c2, further coding and extracting features of sentence vector representations corresponding to the comment texts to obtain feature vectors of the comment texts;

c3, carrying out data vectorization on the compressed text to obtain sentence vector representations corresponding to the text;

and c4, further coding and extracting the characteristics of the sentence vector representation corresponding to the text to obtain the characteristic vector of the text.

Further, the step d of analyzing the relevancy to obtain a weighted text vector specifically includes the following steps:

step d1, calculating the relevance r of each sentence characteristic vector and comment characteristic vector of the compressed text_ij：r_ij＝c_i·s_jWherein c is_iRepresenting the ith comment feature vector, s_jA feature vector representing a jth text sentence;

step d2, calculating the relevance R of the ith comment feature vector to each sentence j in the text_ij：

Step d3, calculating a weighted text vector V_i：V_i＝∑_jR_ij×S_j。

Further, in step e, the comment feature compressed vector is obtained by sequentially using max _ posing and average _ posing for the comment text feature vector, where max _ posing and average _ posing both represent convolution kernels.

The invention also provides a self-adaptive comment emotion analysis system fused with text information, which comprises:

the data source and scale determining unit is used for determining the data source and scale;

the data preprocessing unit is used for preprocessing the data;

the feature vector extraction unit is used for extracting feature vectors according to the preprocessed data;

the relevance analysis unit is used for carrying out relevance analysis on the extracted feature vectors to obtain weighted text vectors;

and the decision unit is used for performing convolution operation on the weighted text vector and the comment feature compressed vector to finish the final comment classification.

Further, the feature vector extraction unit is configured to perform feature vector extraction according to the preprocessed data, and includes:

carrying out data vectorization on the preprocessed comment text to obtain sentence vector representations corresponding to the comment text; further coding and extracting the characteristics of sentence vector representations corresponding to the comment texts to obtain characteristic vectors of the comment texts;

carrying out data vectorization on the compressed text to obtain sentence vector representations corresponding to the text; and further coding and extracting the characteristics of the sentence vector representation corresponding to the text to obtain the characteristic vector of the text.

Further, the association analysis unit is configured to perform association analysis on the extracted feature vectors and obtain weighted text vectors, and includes:

calculating the correlation r of each sentence characteristic vector and comment characteristic vector of the compressed text_ij：r_ij＝c_i·s_jWherein c is_iRepresenting the ith comment feature vector, s_jA feature vector representing a jth text sentence;

calculating the relevance R of the ith comment feature vector to each sentence j in the text_ij：

Computing a weighted text vector V_i：V_i＝∑_jR_ij×S_j。

The invention has the beneficial effects that:

1. according to the invention, while the main body information is introduced, the difficulties that the LDA model training process is complex and a large amount of manual supervision is needed are solved, text information of a text is introduced by using a deep learning mode in the process of commenting emotion analysis, manual supervision required by LDA is avoided, and certain discovery and feature extraction capability is provided for unregistered text categories; the model has the capability of finding a new theme to a certain degree, and can automatically match text information with higher relevance to different comments in the same text, so that the problem that the LDA cannot be subjected to fine granularity is solved;

2. according to the method, when text information is introduced, the relevance degree characteristic vector of the text is obtained by calculating the relevance degree of each sentence of the text and the comment, so that the model can adaptively extract text characteristics with high relevance degree with the comment for different comments. The method has better feature extraction capability for the text with a plurality of fine-grained subjects.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 shows a flow diagram of an adaptive comment sentiment analysis method fusing text information in an embodiment of the present invention;

FIG. 2 shows a specific flowchart of an adaptive comment emotion analysis method for fusing text information in an embodiment of the present invention;

fig. 3 shows a specific flow diagram of association analysis in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 shows a schematic flow chart of an adaptive comment emotion analysis method fusing text information in an embodiment of the present invention, where in fig. 1, the method includes the following steps:

step a, determining a data source and scale;

b, preprocessing the data;

c, extracting a feature vector according to the preprocessed data;

Specifically, the data of the invention are all microblog comments and corresponding text information, and about 100 thousands of comment data and corresponding text data are obtained in a crawler manner. Part of the data (about 30 ten thousand pieces) in the model is labeled by means of artificial labeling, and subsequent model training is completed by using the part of the data. The final labeled data format is a triple relationship as follows: (text, comment classification), wherein the comment classification is obtained by analyzing the text data and the comment text data.

Fig. 2 shows a specific flow diagram of an adaptive comment emotion analysis method fusing text information in the embodiment of the present invention, and in fig. 2, specifically, the preprocessing of data in step b specifically includes steps of text and comment text stop word filtering and text length compression: segmenting words of the text and the comment text by using a Chinese word segmentation device, and filtering stop words of the segmented result by using a Hadamard stop word table; and calling a TextRank algorithm module in the TextRank4ZH to screen out key sentences of the body (taking top30) so as to complete the length compression of the body text. The method mainly aims to prevent the problem of too low training speed caused by too long microblog texts, wherein the TextRank represents a commonly-used keyword and key sentence extraction algorithm.

And c, extracting the feature vectors according to the preprocessed data, wherein the step c comprises the step of extracting the feature vectors of the text to be the text and the comment text respectively.

Specifically, when feature vectors are performed on comment texts: vectorizing (embedding) the preprocessed comment text data by using a Baidu open source Erine model to obtain corresponding sentence vector representations; further coding and extracting the characteristics of the vector characterization of the sentence by using the BilSTM, and marking the coded output vector as the vector characterization V of the sentence_i(at this time V_iDimension of seq _ length × embedding _ size); vector characterization V for each sentence_iAnd taking the last time step as a feature vector of the comment text, and performing subsequent relevance analysis on the partial vector (the dimension is 1 × embedding _ size at this time).

BilSTM and time step: a two-way Short-Term Memory network (Long Short-Term Memory), a commonly used recurrent neural network, is used to process data that depends on the presence time, such as text data. Wherein each word in the text is a time step. In general, the output of BilSTM at the last time step contains the information of the entire sequence.

When the feature vector is carried out on the text: carrying out data vectorization on the compressed text to obtain sentence vector representations corresponding to the text; and further coding and extracting the characteristics of the sentence vector representation corresponding to the text to obtain the characteristic vector of the text. Compared with feature vectors of comment texts, the text length of the text is usually much longer than that of comments (the text is limited within 30 sentences after TextRank preprocessing), so that sentence vector representations of the 30 sentences are extracted one by one as feature vectors of the text when sentence vector feature extraction is carried out, and subsequent relevance analysis is carried out on the part of vectors.

And d, when the relevance degree of the extracted feature vector is analyzed to obtain a weighted text vector, finding out relevant information between the comment and the text, and extracting text information features relevant to the comment according to the relevance degree of each sentence of the text and the comment to assist in final classification. Fig. 3 shows a specific flow diagram of association analysis in the embodiment of the present invention.

In FIG. 3, definition c_iSentence vector, s, for the ith comment under an article_jA sentence vector for the jth sentence of the article text, wherein c_iAnd s_jRespectively from the vector representations of the comments and text extracted in block two.

The calculation mode of the correlation between the ith comment and the jth text sentence is as follows: r is_ij＝c_i·s_j；

For the ith comment, its relevance R for each sentence j in the body_ijThe definition is as follows:

calculating the correlation between each sentence and each comment, carrying out softmax probability normalization on the correlation, and calculating a weight vector of the correlation;

finally according to the correlation degree R_ijThe sentence vectors in the text are weighted and summed, and the text vector for the comment i is characterized as follows: v_i＝∑_jR_ij×S_j。

In step e, the comment feature compression vector is obtained by sequentially using max _ posing and average _ posing through the comment text feature vector, wherein both max _ posing and average _ posing represent convolution kernels, so that the effect of feature compression can be achieved, and the most significant feature (max _ posing) or more common feature (average _ posing) can be extracted.

Specifically, in the step e, performing convolution operation on the weighted text vector and the comment feature compressed vector to complete the final comment classification includes the following steps: combining the comment feature compressed vector (3 × embedding _ size) and the weighted text vector (1 × embedding _ size), and splicing the comment feature compressed vector and the weighted text vector to form a vector with the dimension of 4 × embedding _ size; performing convolution operation on the spliced feature vectors by using different convolution cores, and splicing the result after convolution as the input of the full connection layer; and receiving the input result of the full connection layer, and finishing the final classification by using the full connection layer.

CNN and convolution: convolutional neural networks, a commonly used feature extractor, accomplish specific feature extraction objectives mainly through different convolution kernels.

The invention also provides a self-adaptive comment emotion analysis system fused with text information, which comprises the following components:

the data preprocessing unit is used for preprocessing the data;

Specifically, the feature vector extraction unit is configured to perform feature vector extraction according to the preprocessed data, and includes:

Specifically, the association analysis unit is configured to perform association analysis on the extracted feature vectors and obtain weighted text vectors, and includes:

Computing a weighted text vector V_i：V_i＝∑_jR_ij×S_j。

Specifically, the decision unit receives the text vector (1 × embedding _ size) weighted by the relevancy analysis unit and the comment vector (3 × embedding _ size) in the module two, and splices the comment feature compressed vector and the weighted text vector to form a vector with a dimension of 4 × embedding _ size.

In the process of comment emotion analysis, text information of a text is introduced by using a deep learning mode, manual supervision work required by LDA is avoided, and the method has certain discovery and feature extraction capabilities for unregistered text types.

According to the method, when text information is introduced, the relevance degree characteristic vector of the text is obtained by calculating the relevance degree of each sentence of the text and the comment, so that the model can adaptively extract text characteristics with high relevance degree with the comment for different comments. The method has better feature extraction capability for the text with a plurality of fine-grained subjects.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A self-adaptive comment emotion analysis method fused with text information is characterized by comprising the following steps:

step a, determining a data source and scale;

b, preprocessing the data;

c, extracting a feature vector according to the preprocessed data;

2. The method of adaptive comment sentiment analysis fused with body text information of claim 1 wherein the data includes body text and comment text.

3. The method for analyzing the self-adaptive comment emotion fused with the text information as recited in claim 2, wherein the preprocessing of the data in the step b specifically includes the steps of filtering stop words of the text and the comment text and compressing the length of the text.

4. The method for analyzing self-adaptive comment emotion fused with text information according to claim 2, wherein the step c of extracting feature vectors from the preprocessed data includes a step of extracting feature vectors from the text and the comment text, respectively.

5. The method for analyzing the self-adaptive comment emotion fused with the text information as recited in claim 4, wherein the extracting the feature vectors of the comment text and the text specifically includes the following steps:

6. The method for analyzing self-adaptive comment emotion fused with text information as claimed in claim 5, wherein said analyzing for relevancy to obtain weighted text vectors in step d specifically includes the steps of:

Step d3, calculating a weighted text vector V_i：V_i＝∑_jR_ij×S_j。

7. The method for analyzing adaptive comment emotion fused with text information according to claim 6, wherein in step e, the comment feature compressed vector is obtained by sequentially using max _ posing and average _ posing for the comment text feature vector, where max _ posing and average _ posing both represent convolution kernels.

8. An adaptive comment emotion analysis system fused with body text information, the system comprising:

the data preprocessing unit is used for preprocessing the data;

9. The system of claim 8, wherein the feature vector extraction unit is configured to perform feature vector extraction according to the preprocessed data, and the feature vector extraction unit is configured to perform the feature vector extraction according to the preprocessed data, and the feature vector extraction:

10. The system of claim 8, wherein the relevance analysis unit is configured to perform relevance analysis on the extracted feature vectors and obtain weighted text vectors, and the relevance analysis unit is configured to:

Computing a weighted text vector V_i：V_i＝∑_jR_ij×S_j。