CN114911942A

CN114911942A - Interpretability text emotion analysis method, system and equipment based on confidence coefficient

Info

Publication number: CN114911942A
Application number: CN202210607887.5A
Authority: CN
Inventors: 张思; 翟佩云; 惠柠; 徐佳丽; 刘清堂
Original assignee: Central China Normal University
Current assignee: Central China Normal University
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-08-16
Anticipated expiration: 2042-05-31
Also published as: CN114911942B

Abstract

The invention discloses a text emotion analysis method, a system and equipment based on interpretability of confidence coefficient, which comprises the following steps of firstly, carrying out data preprocessing on pre-analysis text data; then inputting the processed data into a deep learning network for classification; then constructing a confidence segmenter, defining a confidence function, setting a confidence threshold value, and dividing the deep learning network classification result into a confidence degree strong part and a confidence degree weak part; according to the confidence degree strong and weak points, the data with strong confidence degree is classified by a deep learning network, and the data with weak confidence degree is classified by an enhanced network; and finally, combining the two network classification results and outputting a final classification result. The invention constructs a new network model frame RTS-CF, quickly extracts longer keywords through RAKE, and is simple and efficient; and dividing the test set into a confidence degree strong part and a confidence degree weak part through a confidence function, and reclassifying the data with weak confidence degree by combining the enhanced network. The integration method for optimizing the neural network by utilizing the enhanced network has strong interpretability and improves the overall classification performance.

Description

Interpretability text emotion analysis method, system and equipment based on confidence coefficient

Technical Field

The invention belongs to the technical field of text data mining, and relates to a text emotion analysis method, system and equipment, in particular to a text emotion analysis method, system and equipment with strong interpretability based on confidence coefficient.

Background

With the development of internet technology and the rise of modeling deep learning, the research of text sentiment analysis is more popular, and relevant research not only has very important practical significance to scientific researchers, but also to daily life, for example, a government department can guide the public opinion development by analyzing the network public opinion sentiment tendency, and an e-commerce merchant can know user preference by analyzing the user comment sentiment tendency. Through deep mining and analysis of texts in various fields, the interests, hobbies and emotional biases of users can be better known.

Currently, commonly used text emotion analysis methods include emotion classification based on dictionaries, emotion analysis based on traditional machine learning, and emotion analysis methods based on deep learning. The deep neural network model achieves a remarkable effect in emotion classification. Although the classification method based on the traditional machine learning is slightly inferior to the deep learning method in the aspect of classification accuracy, the method has advantages in the aspects of interpretability and time complexity. The integration method of the deep learning method and the traditional machine learning method is adopted, the integral classification performance can be improved, the interpretability is strong, the mastering and understanding of the personal emotional tendency can be realized, and the analysis modeling method is rarely used at present and is worthy of exploration and trial. The method can quickly extract some longer key words of the professional terms by adopting the RAKE, is simple and efficient, and has good effect on text classification.

Disclosure of Invention

The invention aims to provide a text sentiment analysis method, a text sentiment analysis system and text sentiment analysis equipment with strong interpretability based on confidence coefficient, which utilize an integration method of an enhanced model to optimize a deep neural network and improve the whole text classification performance.

The method adopts the technical scheme that: a text emotion analysis method based on interpretability of confidence coefficient comprises the following steps:

step 1: performing data preprocessing aiming at the pre-analysis text data;

step 2: inputting the preprocessed data into a deep learning network for classification;

and step 3: constructing a confidence segmenter, defining a confidence function, setting a confidence threshold value, and dividing a deep learning network classification result into a strong confidence part and a weak confidence part;

the confidence function

Wherein d is a preset value; mean (—) is a mean function; y is ₁ ，y ₂ And (3) representing the output value of the softmax layer of the deep learning network, and respectively considering the scores of the strong confidence coefficient and the weak confidence coefficient, wherein

0<y _i <1，∑y _i ＝1；z _i Is the output value of the ith node as the input value of softmax; n is the number of output nodes, namely the number of classified categories;

represents the sum of all the predicted results;

and 4, step 4: according to the strong and weak confidence scores, classifying the data with strong confidence scores by a deep learning network, and reclassifying the data with weak confidence scores by an enhanced network;

and 5: and combining the results of the deep learning network and the enhanced network, and outputting a final classification result.

The technical scheme adopted by the system of the invention is as follows: a text sentiment analysis system based on interpretability of confidence coefficient comprises the following modules:

the module 1 is used for preprocessing data aiming at pre-analysis text data;

the module 2 is used for inputting the preprocessed data into a deep learning network for classification;

the module 3 is used for constructing a confidence segmenter, defining a confidence function, setting a confidence threshold value and dividing a deep learning network classification result into a high confidence part and a low confidence part;

the confidence function

Wherein d is a preset value; mean is a mean function; y is ₁ ，y ₂ And the output value representing the softmax layer of the deep learning network can be respectively considered as the score of the part with strong confidence coefficient and the part with weak confidence coefficient, wherein

represents the sum of all the predicted results;

the module 4 is used for classifying the data with strong confidence coefficient by the deep learning network according to the strong and weak confidence coefficient, and reclassifying the data with weak confidence coefficient by the enhancement network;

and the module 5 is used for combining the results of the deep learning network and the enhancement network and outputting the final classification result.

The technical scheme adopted by the equipment of the invention is as follows: a text sentiment analysis device based on interpretability of confidence, comprising:

one or more processors;

a storage device to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the method for text sentiment analysis based on confidence-based interpretability.

The invention comprises the following technical effects:

(1) the deep learning model R-TextCNN trained by the whole training set can achieve a remarkable effect on emotion classification.

(2) By extracting the keywords through the RAKE, some longer term keywords can be extracted, and good effect can be achieved.

(3) Through a confidence function, the test set can be divided into two parts of high confidence degree and low confidence degree, and the traditional machine learning model is combined to reclassify the part of data with low confidence degree.

(4) And automatically adjusting parameters by GridSearchCV to obtain optimized parameters.

(5) The integration method for optimizing the neural network by utilizing the enhanced network model has strong interpretability and can improve the overall classification performance.

Drawings

FIG. 1 is a flow chart of a method of an embodiment of the present invention;

FIG. 2 is a diagram of a deep learning network architecture in accordance with an embodiment of the present invention;

FIG. 3 is a diagram of the calculation of the softmax function of an example of the present invention;

FIG. 4 is a diagram of an enhanced network architecture of an embodiment of the present invention;

FIG. 5 is a hyperplane view of an enhanced network of an embodiment of the present invention;

fig. 6 is a diagram of an RTS-CF network architecture according to an example of the present invention.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

Educational text mining is a non-negligible area of text mining. Potential learning feelings and emotional tendencies of the learner are mined and found from a simple text, reference and basis can be provided for personalized teaching, and a teacher can be helped to quickly master learning conditions of the learner, including learning attitude and overall progress, so that timely answering and confusion are facilitated, and feedback is provided. As a research hotspot in the field of education text mining, the emotional tendency of the learner is calculated and analyzed through the text, so that the method not only can help to understand and analyze the potential psychological change of the learner, but also has great help to diversify and richen teaching resources and modes. Many online platforms serve as important teaching aids that allow learners to freely release their personal views and subjective feelings, as well as socially interact with others. Text is among the simplest and most common ways of interacting. The learner can analyze the emotional tendency of the learner from the text through the publishing viewpoint and know the whole learning state of the learner in time, so that the feedback and the intervention of the teacher are possible.

Referring to fig. 1, the text sentiment analysis method based on interpretability of confidence level provided by the present invention includes the following steps:

step 1: carrying out data preprocessing aiming at the pre-analysis text data;

in this embodiment, the specific implementation of step 1 includes the following substeps:

step 1.1: arranging the acquired text data into a required data type and storing the data type in a txt file;

step 1.2: reading and writing the content of the text file, and removing spaces and other useless symbols for subsequent use;

in this embodiment, for subsequent classification, the data needs to be processed into a txt file for reading text content, removing symbols other than chinese and designated punctuation marks, and storing the txt file in a new txt file.

referring to fig. 2, the deep learning network R-TextCNN of the present embodiment includes a RAKE extraction keyword layer, a keyword embedding layer, a convolution layer, a maximum pooling layer, and a fully connected softmax layer;

the RAKE extraction keyword layer of the present embodiment is a method for quickly and automatically extracting keywords. Dividing the text into a plurality of sentences by using specified punctuations, such as periods, question marks, exclamation marks, commas and the like; for each clause, the stop word is used as a separator to divide the sentence into a plurality of phrases, and the phrases are candidate words to be sequenced; each phrase is composed of a number of words, each word is assigned a score, the score for each phrase is obtained by accumulation,

where deg is the degree of each word, which means the number of co-occurrences of all words in the text in the candidate keyword, and freq is the word frequency of each word; sorting the extracted candidate keywords from big to small; finally, outputting a plurality of phrases with top ranking scores as keywords;

the keyword embedding layer of the embodiment converts the extracted keywords into embedding representation. N words mapped as word vectors are concatenated into a sentence. The sentence of length n is represented as: x is the number of _1:n ＝x ₁ ⊕x ₂ ⊕...⊕x _n (ii) a Wherein x is _i ∈R ^K A k-dimensional word vector corresponding to the ith word in the sentence; [ ] is a connect operation; x is the number of _i:i+j Representing the word x _i ，x _i+1 ，...，x _i+j The connection of (1);

the convolution layer of this embodiment uses a convolution kernel w and x with a width d and a height h _i:i+h-1 After convolution operation is carried out on the (h words), the corresponding characteristic c is obtained by activating the activation function _i Then the convolution operation is denoted c _i ＝f(w.x _i:i+h-1 + b); wherein, w is the initialization weight, b is the bias term, and h is the filter window length; after convolution operation, obtaining a vector c with n-h +1 dimension: c ═ c ₁ ,c ₂ ,...,c _i ,...,c _n-h+1 ](ii) a Wherein n is the number of words of each sentence;

in the maximum pooling layer of this embodiment, the maximum is taken for a plurality of one-dimensional vectors obtained after convolutionAnd the values are spliced together to be used as the output value of the layer: z ═ z ₁ ,z ₂ ,z ₃ ,...,z _i ,...,z _m }; wherein z is _i ＝max{c _i }；

The fully-connected softmax layer of this embodiment sends z into the fully-connected softmax layer, and obtains the tag probability distribution of a sentence:

wherein, y _i Is label _i Corresponding predictive score, w _i Is the weight of the full connection layer; label _i Is the ith category label.

The deep learning network adopted by the embodiment is a trained deep learning network; the training process comprises the following substeps:

(1) collecting a training data text set, and dividing the text and the label into a training set and a test set according to the sample ratio;

the embodiment divides the data set into a training set and a testing set through a train _ test _ split () function, and sets the sample proportion test _ size. For example, there are 100 data, test _ size ═ 0.2, then 80% of the training set, 80, and 20% of the test set.

(2) Creating an embedded matrix, obtaining an embedded vector through an embedded index, assigning the embedded vector to the embedded matrix, and loading a pre-trained word to be embedded into an embedded layer;

(3) training a deep learning network by using a training set;

(4) and after the data are trained, storing the deep learning network for predicting and classifying the test set.

And step 3: constructing a confidence segmenter, defining a confidence function, setting a confidence threshold value, and dividing a deep learning network classification result into a high confidence part and a low confidence part;

confidence function used in the present embodiment

Wherein d is the training times of the deep learning network, and the minimum interval of the iteration times during training is presetOn the basis of the iteration times when the loss function value of the first training of the deep learning network tends to be stable, training a model for testing data every time an iteration interval is added; if the minimum interval is 5, the iteration number reference is 50, and the training number d is 3, the deep learning network needs to be trained and tested when the iteration number is 55, 60, and 65, respectively; mean is a mean function; y is ₁ ，y ₂ And the output value representing the softmax layer of the deep learning network can be respectively considered as the score of the part with strong confidence coefficient and the part with weak confidence coefficient, wherein

represents the sum of all the predicted results;

referring to fig. 3, the softmax function, also called normalized exponential function, adopted in the present embodiment is a generalization of a two-classification function sigmoid on multi-classification, and is intended to show the result of multi-classification in a probabilistic form, and the calculation process includes the following sub-steps:

(1) converting the predicted result into a non-negative number: the prediction result z of the model is { z ═ z ₁ ,z ₂ ,...,z _i ,...,z _n Conversion to exponential function f (x) exp (x), guarantees non-negativity of the probability.

(2) The sum of the probabilities of the various predictors is equal to 1: to ensure that the sum of the probabilities is equal to 1, the converted result needs to be normalized. The method is to convert the result exp (z) _i ) Divided by the sum of all converted results

Get the probability of approximation

In the embodiment, after the softmax layer obtains two classification scores, a visual confidence function is defined by a user, and the classification score is divided into two types of data according to the strength of the confidence degree, wherein one type of data is data with strong confidence degree, namely the data with large difference of the two types of scores and good classification effect; one type is data with weak confidence, namely, the two types of data with weak score difference and are not classified well.

referring to fig. 4, the classification performed by the enhanced network in this embodiment includes setting a parameter adjusting point, GridSearchCV, training SVM, and a classification result;

setting a parameter adjusting starting point, setting a penalty parameter C and a gamma value of a kernel function parameter to be between 0.1 and 100, and multiplying the penalty parameter C and the gamma value by 0.1 or 10 every time to be used as a step length according to the performance of an enhanced network model; after the approximate range is determined, refining the search interval;

GridSearchCV of the present embodiment tries each possibility by loop traversal in the parameter selection of the refined search interval, and the best performing parameter is the final result. The final performance is greatly related to the result of the division of the initial data, so that the chance is reduced by adopting a cross-validation method;

after the parameters are adjusted, the SVC in the sklern.svm is called to train the enhanced network model, and the result obtained by adjusting the parameters before is set during training, so that a trained enhanced network model is finally obtained;

the classification result of the embodiment is obtained by loading the trained enhanced network model and predicting classification by using the trained SVM data with weak confidence.

Please refer to fig. 5, a hyperplane view of the enhanced network of the present embodiment;

in this embodiment, a maximum hyperplane is found in the feature space, so that the distance from all samples to the plane is maximum (the distance from the sample set to the plane, that is, the distance from the nearest sample point to the hyperplane, is calculated), and we find the distance from all samples to the plane, that is, the distance from the nearest sample point to the hyperplaneThe learning objective of (2) is to solve the parameter α, determine the hyperplane, and maximize this distance. The parameter alpha is solved by adopting an SMO algorithm, two alpha are selected in each loop for optimization, once a pair of alpha which is out of the interval boundary and has not been subjected to interval processing or is not on the boundary is found, one of the alpha is increased while the other is decreased until all the alpha are all reduced _i The KKT condition and the constraint condition of the optimization problem are met.

The classification implementation process is further explained below;

D＝{(x ₁ ,y ₁ ),(x ₂ ,y ₂ ),...,(x _m ,y _m )}

given a sample set: y is _i { -1, +1 }; wherein x is _i As an attribute, y _i Is a class label. The purpose is as follows: and finding an optimal hyperplane (with the highest generalization capability) to separate samples of different classes.

The object to be trained is hyperplane: w is a _s ^T x+b _s 0; wherein w _s Is a normal vector, b _s Is a displacement term.

From arbitrary point x to hyperplane (w) _s ,b _s ) The distance of (a) is:

if the hyperplane successfully classifies the sample, the following holds:

several sample points for which equal signs hold are called "support vectors", and the sum of the distances from two heterogeneous support vectors to the hyperplane is:

which is called a "space".

Finding hyperplanes with "maximum separation", i.e.

It can be known that the maximum||w _s || ^-1 Equivalent to minimizing w _s || ² The above formula is rewritten as:

this equation is the "base model" of the SVM.

Solving the above equation to obtain the model: f (x) w _s ^T x+b _s ；

Adding Lagrangian multiplier alpha to each constraint in the equation _i (α _i 0 or more), obtaining:

let L respectively pair w _s And b _s The partial derivative of (a) is 0, resulting in:

substituting the formula into the formula to obtain the dual problem of the SVM basic model:

calculating w _s (i.e. find a) and b _s Obtaining a model:

the above process needs to satisfy the KKT condition.

Using SMO algorithm to obtain alpha, using property of support vector to obtain b _s 。

In this embodiment, for the part of data with weak confidence, a traditional machine learning method is adopted as an enhanced model to reclassify the part of data. The traditional machine learning method has the characteristic of strong interpretability.

Please refer to fig. 6, which shows a structure diagram of the RTS-CF network;

in this embodiment, first, the data type and content of the text data are processed; secondly, RAKE extracts keywords, and sequentially enters a keyword embedding layer, a convolution layer, a maximum pooling layer and a fully-connected softmax layer for classification; then, entering a confidence segmenter and passing through a confidence function

Dividing the results into a strong confidence degree result and a weak confidence degree result, and finding out a corresponding text and a corresponding label through indexes to obtain list data with a strong confidence degree and list data with a weak confidence degree; then, the data with strong confidence coefficient enters a deep learning network for classification, and the data with weak confidence coefficient enters an enhanced network for classification; and finally, merging the classification results of the two networks through a concatenate () function to obtain a final prediction result.

The method of the invention is used for carrying out emotion classification on the text sent out by the individual. Firstly, loading data and preprocessing the data; training a deep learning network model (the existing textCNN, RNN and other models can also be adopted) by using the whole training data, and classifying the test data; constructing a confidence divider, defining a confidence function, and dividing the classification result of the deep learning network model into a high confidence part and a low confidence part; according to the strong and weak degree of confidence, the data with high degree of confidence is classified by a deep learning network model, the data with weak degree of confidence is reclassified by an enhanced network model (the existing naive Bayes, SVM with naive Bayes characteristics and the like can also be adopted), and the enhanced model is a traditional machine learning model; and finally, combining the results of the deep learning network model and the enhanced network model and outputting the final classification result. The method and the system can obtain the emotional tendency of the text sent by the individual and know the interest topic of the individual. The invention adopts an integration method of a deep learning method and a machine learning method, aims to improve the overall classification performance, and realizes the mastering and understanding of the personal emotional tendency. The method has the advantages that the method adopts RAKE to quickly extract keywords, is simple and efficient, can extract some longer key terms of professional terms, belongs to an unsupervised method, and does not need a large amount of labeled data. In future exploration work, other effective confidence functions can be found in an attempt, and the framework is applied to other models to study the effectiveness and the applicability of the models.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A text emotion analysis method based on interpretability of confidence coefficient is characterized by comprising the following steps:

step 1: carrying out data preprocessing aiming at the pre-analysis text data;

the confidence function

z _i Is the output value of the ith node as the input value of softmax; n is the number of output nodes, i.e. the number of classes to be classifiedCounting;

represents the sum of all the predicted results;

2. The method for text sentiment analysis based on confidence interpretability according to claim 1, wherein: preprocessing data in the step 1, firstly, arranging the acquired text data into a required data type and storing the data type in a txt file; and reading and writing the content of the text file, and removing spaces and other useless symbols for subsequent use.

3. The method for text sentiment analysis based on confidence interpretability according to claim 1, wherein: the deep learning network R-TextCNN in the step 2 comprises a RAKE extraction keyword layer, a keyword embedding layer, a convolution layer, a maximum pooling layer and a fully-connected softmax layer;

the RAKE extraction keyword layer divides the text into a plurality of sentences by using the appointed punctuation marks; for each clause, dividing the sentence into a plurality of phrases by using stop words as separators, wherein the phrases are candidate words to be sorted; each phrase is composed of a number of words, each word is assigned a score, the score for each phrase is obtained by accumulation,

the term deg refers to the co-occurrence frequency of the word and all words in the text in the candidate keywords, and freq refers to the word frequency of each word; sorting the extracted candidate keywords from big to small; finally, outputting a plurality of phrases with top ranking scores as keywords;

the keyword embedding layerConverting the extracted keywords into an embedding representation; connecting n words mapped into word vectors into a sentence; the sentence of length n is represented as:

wherein x is _i ∈R ^K A k-dimensional word vector corresponding to the ith word in the sentence;

is a connection operation; x is a radical of a fluorine atom _i:i+j Representing the word x _i ，x _i+1 ，...，x _i+j The connection of (1);

the convolutional layer uses convolution kernels w and x with width d and height h _i:i+h-1 After convolution operation, activating by using an activation function to obtain corresponding characteristics c _i Then the convolution operation is denoted c _i ＝f(w.x _i:i+h-1 + b); wherein f is an activation function, w is an initialization weight, b is a bias term, and h is a filter window length; after convolution operation, obtaining a vector c with n-h +1 dimension: c ═ c ₁ ,c ₂ ,...,c _i ,...,c _n-h+1 ](ii) a Wherein n is the number of words of each sentence;

the maximum pooling layer takes the maximum value of a plurality of one-dimensional vectors obtained after convolution, and then is spliced together to be used as the output value of the layer: z ═ z ₁ ,z ₂ ,z ₃ ,...,z _i ,...,z _m }; wherein z is _i ＝max{c _i }；

The fully-connected softmax layer sends z into the fully-connected softmax layer to obtain the label probability distribution of the sentence

4. The method for text sentiment analysis based on confidence interpretability according to claim 1, wherein: reclassifying by the enhanced network in the step 4, wherein the reclassification comprises setting a parameter adjusting point, GridSearchCV, training an SVM and a classification result;

setting a parameter adjusting starting point, firstly setting a penalty parameter C and a gamma value of a kernel function parameter between 0.1 and 100, and multiplying by 0.1 or 10 each time as a step length according to the performance of an enhanced network model; after the approximate range is determined, refining the search interval;

the GridSearchCV tries each possibility through cyclic traversal in the parameter selection of the refined search interval, and the parameter with the best performance is the final result;

after the parameters of the training SVM are adjusted, calling SVC in sklern.svm to train an enhanced network model, and setting a result obtained by the previous parameter adjustment during training to finally obtain a trained enhanced network model;

and loading the trained enhanced network model according to the classification result, and predicting and classifying by using the trained SVM data with weak confidence to obtain a classification result.

5. The method for textual emotion analysis based on interpretability of confidence according to any one of claims 1 to 4, wherein: and 5, merging the classification results of the two networks through a concatenate () function to obtain a final prediction result.

6. A text sentiment analysis system based on interpretability of confidence coefficient is characterized by comprising the following modules:

the module 1 is used for carrying out data preprocessing aiming at pre-analysis text data;

the confidence function

z _i Is the output value of the ith node as the input value of softmax; n is the number of output nodes, namely the number of classified categories;

represents the sum of all the predicted results;

7. A text sentiment analysis device based on interpretability of confidence, comprising:

one or more processors;

storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the text sentiment analysis method based on confidence interpretability of the text according to any one of claims 1 to 5.