CN111339772B

CN111339772B - Russian text emotion analysis method, electronic device and storage medium

Info

Publication number: CN111339772B
Application number: CN202010179507.3A
Authority: CN
Inventors: 刘鑫; 徐琳宏; 祁瑞华; 邵林; 陈恒
Original assignee: Dalian University Of Foreign Languages
Current assignee: Dalian University Of Foreign Languages
Priority date: 2020-03-16
Filing date: 2020-03-16
Publication date: 2023-11-14
Anticipated expiration: 2040-03-16
Also published as: CN111339772A

Abstract

The embodiment of the invention provides a Russian text emotion analysis method, electronic equipment and a storage medium, wherein the method comprises the following steps: obtaining a Russian text to be analyzed; inputting the Russian text into an emotion analysis model to obtain an emotion analysis result output by the emotion analysis model; the emotion analysis model is used for carrying out local feature extraction on word level features of each word in the Russian text to obtain context features of each word, carrying out sequence feature extraction on the context features of each word based on a self-attention mechanism to obtain attention sequence features of the Russian text, and carrying out emotion analysis on the basis of sentence level features and the attention sequence features of the Russian text; word-level features and sentence-level features are extracted based on russian text representation rules. The method, the electronic equipment and the storage medium provided by the embodiment of the invention combine the advantages of the local features and the sequence features, thereby improving the accuracy and the reliability of the Russian text emotion analysis.

Description

Russian text emotion analysis method, electronic device and storage medium

Technical Field

The invention relates to the technical field of natural language processing, in particular to a Russian text emotion analysis method, electronic equipment and a storage medium.

Background

With the rapid development of the Internet technology in the world, network social media has become a main source for netizens to acquire various information, and a convenient platform is provided for netizen communication views, discussion of current events and expression of various positive or negative emotions in daily life.

The generation of a large number of web social short texts enables researchers to use data mining techniques to analyze the user's satisfaction with a certain product or service commercially, or to predict market trends, or to politically identify ethnic trends, text emotion analysis techniques have emerged.

At present, most text emotion analysis tools are designed and realized specifically aiming at the characteristics of English. In the case of emotion analysis of Russian text, it is often necessary to translate Russian into English by a translation engine and then perform emotion analysis. However, the emotion analysis method is not reliable because emotion and semantic loss are unavoidable during translation, and the language characteristics of Russian itself are ignored during emotion analysis.

Disclosure of Invention

The embodiment of the invention provides a Russian text emotion analysis method, electronic equipment and a storage medium, which are used for solving the problem of low reliability and accuracy of the conventional Russian text emotion analysis method.

In a first aspect, an embodiment of the present invention provides a russian text emotion analysis method, including:

obtaining a Russian text to be analyzed;

inputting the Russian text into an emotion analysis model to obtain an emotion analysis result output by the emotion analysis model;

the emotion analysis model is used for extracting local features of word level features of each word in the Russian text to obtain context features of each word, extracting sequence features of each word based on a self-attention mechanism to obtain attention sequence features of the Russian text, and performing emotion analysis based on sentence level features of the Russian text and the attention sequence features; the word-level features and the sentence-level features are extracted based on russian text representation rules.

In a second aspect, an embodiment of the present invention provides a russian text emotion analysis device, including:

the text acquisition unit is used for acquiring the Russian text to be analyzed;

the emotion analysis unit is used for inputting the Russian text into an emotion analysis model to obtain an emotion analysis result output by the emotion analysis model;

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a bus, where the processor, the communication interface, and the memory are in communication with each other via the bus, and the processor may invoke logic commands in the memory to perform the steps of the method as provided in the first aspect.

In a fourth aspect, embodiments of the present invention provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as provided by the first aspect.

According to the Russian text emotion analysis method, the electronic equipment and the storage medium, the problem that the feature extraction accuracy is influenced by the specificity of the Russian text is solved by extracting the word-level features and sentence-level features based on the Russian text representation rule, the local feature extraction is performed on the word-level emotion features of the text based on the convolutional neural network, the sequence feature extraction is performed based on the cyclic neural network and the self-attention mechanism, the advantages of the local feature and the sequence feature are combined, the emotion analysis is performed on the sentence-level emotion features and the attention sequence feature finally, and the accuracy and the reliability of the Russian text emotion analysis are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a Russian text emotion analysis method provided by an embodiment of the invention;

FIG. 2 is a schematic structural diagram of an emotion analysis model according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a local feature extraction layer according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a structure of an attention layer according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an emotion classification layer according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a russian text emotion analysis device according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Emotion analysis (Sentiment Analysis, SA), also known as emotion tendentiousness analysis, is a traditional natural language processing task that can be regarded as a problem of classifying text with the attitudes expressed by authors as classification criteria, aiming at identifying subjectivity views in unstructured text. Currently, the positive and negative tendencies of Russian text emotion are classified into three categories of "negative", "neutral" and "positive", and the common methods are two categories, namely dictionary and rule-based methods and machine learning methods.

Dictionary-based methods primarily match tagged tokens (words, phrases, etc.) to a vocabulary of known emotion polarities, thereby classifying their emotion. Although the classification method based on the emotion dictionary reflects unstructured characteristics of the text, the method is relatively dependent on background knowledge in the aspects of fields, times or languages, and the like, the matching effect of the dictionary and the text has great influence on classification accuracy, and the classification method is somewhat indistinct in the face of endless new words in network media and various forms in russia.

The rule set-based method is mainly characterized in that a language expert gathers and formulates a rule set based on the relation among words, phrases or sentences aiming at the text characteristics of a certain language in a certain field, so that classification is carried out. However, rule set based methods, while allowing for rapid classification results, rely heavily on the ability and personal experience of language or domain experts. When the data volume is large, the cost of maintaining and expanding the rule set of the classifier is high; in the face of cross-domain problems, it is difficult to formulate rules that are simultaneously applicable to different domains.

Machine learning based methods can enable extraction of valuable emotional features from text by training models. However, when the traditional machine learning algorithm faces to the short text of the Russian social network with sparse features, short content and complex morphology, the most suitable emotion features are difficult to select.

In order to remedy the defects of the methods, the embodiment of the invention provides a Russian text emotion analysis method. Fig. 1 is a schematic diagram of a russian text emotion analysis method according to an embodiment of the present invention, as shown in fig. 1, where the method includes:

step 110, obtaining the russian text to be analyzed.

Specifically, the russian text to be analyzed is the russian text to be subjected to emotion analysis, where the russian text can be text obtained through each social media platform, such as Twitter, microblog, instragram, and the like, and also can be text obtained through user comments on each shopping website, such as: the AliExpress is not particularly limited in this regard according to the embodiment of the present invention.

Step 120, inputting the Russian text into the emotion analysis model to obtain an emotion analysis result output by the emotion analysis model; the emotion analysis model is used for carrying out local feature extraction on word level features of each word in the Russian text to obtain context features of each word, carrying out sequence feature extraction on the context features of each word based on a self-attention mechanism to obtain attention sequence features of the Russian text, and carrying out emotion analysis on the basis of sentence level features of the Russian text and the attention sequence features; word-level features and sentence-level features are extracted based on russian text representation rules.

Specifically, the emotion analysis model is used for performing emotion analysis on the input Russian text and outputting emotion analysis results of the Russian text, wherein the emotion analysis results represent positive and negative tendencies of the Russian text emotion.

The russian text expression rule is a preset rule for word-level feature and sentence-level feature extraction of the russian text. The method comprises the steps of referring to a special representation method adopted by the Russian network social short text for expressing strong emotion, presetting Russian text representation rules, and enabling sentence-level features of the Russian text and word-level features of each word extracted from two layers of sentences and words to be capable of reflecting emotion features of the Russian text.

After sentence-level features of the Russian text and word-level features of each word are obtained based on the Russian text representation rule, the emotion analysis model can conduct local feature extraction on the word-level features of each word to obtain the context features of each word. Here, the contextual characteristics of each word include not only the characteristics of the word itself, but also the characteristics of neighboring words. On the basis, the emotion analysis model extracts sequence features based on a self-attention mechanism from the context features of each word, highlights information with obvious emotion features in the Russian text through the application of the self-attention mechanism, weakens information without emotion features, and accordingly obtains attention sequence features capable of reflecting the whole Russian text, and then performs emotion analysis by combining the attention sequence features with sentence-level features, and obtains emotion analysis results.

Before executing step 120, the emotion analysis model may be trained in advance, specifically, the emotion analysis model may be trained in the following manner: firstly, collecting a large number of sample Russian texts, and labeling emotion analysis results corresponding to the sample Russian texts through manual analysis. And then training an initial model based on the sample Russian text and the marked emotion analysis result thereof, thereby obtaining an emotion analysis model.

According to the method provided by the embodiment of the invention, the problem that the specificity of the Russian text itself affects the feature extraction accuracy is solved based on the word-level features and sentence-level features extracted by the Russian text representation rule, the local feature extraction is performed on the word-level emotion features of the text based on the convolutional neural network, the sequence feature extraction is performed based on the cyclic neural network and the self-attention mechanism, so that the advantages of the local features and the sequence features are combined, and finally the emotion analysis is performed on the sentence-level emotion features and the attention sequence features, thereby improving the accuracy and the reliability of the Russian text emotion analysis.

Based on the above embodiment, fig. 2 is a schematic structural diagram of an emotion analysis model provided in the embodiment of the present invention, and as shown in fig. 2, the emotion analysis model includes a word-level feature encoding layer, a local feature extraction layer, an attention layer, a sequence feature extraction layer, a sentence-level feature encoding layer, and an emotion classification layer.

Specifically, the word-level feature encoding layer is used for carrying out feature encoding on each word in the inputted Russian text to obtain the word-level feature of each word.

The local feature extraction layer performs local feature extraction based on word-level features of each word to obtain context features of each word.

The attention layer performs attention conversion on the context characteristics of each word to obtain the attention weight of each word.

And the sequence feature extraction layer combines the context feature and the attention weight of each word to extract the sequence features and obtain the attention sequence features.

The sentence-level feature encoding layer is used for feature encoding the inputted Russian text to obtain sentence-level features of the Russian text.

And the emotion classification layer is used for classifying by combining sentence-level features and attention sequence features to obtain emotion analysis results.

Based on any of the above embodiments, step 120 specifically includes:

step 121, inputting the Russian text into the word-level feature encoding layer to obtain the word-level feature of each word output by the word-level feature encoding layer.

The word-level feature of each word is obtained by encoding based on the Russian text expression rule, and when the word-level feature is encoded, whether letters repeatedly appearing for representing strong emotion exist in the word or purposely capitalized letters are used as part of information in the word-level feature, the part of speech, russian morphology, emotion intensity and the like of the word are used as part of information in the word-level feature, and whether the word is an unclean word is used as part of information in the word-level feature, so that emotion reflected by the word is reflected from the word-level feature of the word.

Step 122, the word-level feature of each word is input to the local feature extraction layer, so as to obtain the context feature of each word output by the local feature extraction layer.

Here, the local feature extraction layer is used to extract the contextual features of the word and take the contextual features as more comprehensive and accurate word features.

Here, the local feature extraction layer adopts a convolutional neural network CNN, fig. 3 is a schematic structural diagram of the local feature extraction layer provided by the embodiment of the present invention, and as shown in fig. 3, it is assumed that n words are included in the russian text, and d is the length of a single word-level feature, so as to obtain a matrix size of the word-level feature including each text as n×d. In order to make the sequence lengths of the input and output of the CNN layer consistent, pad is added before and after the word-level feature matrix as a filling, thereby obtaining a matrix X with the size of (n+2) X d. The convolution layer contains T pieces of 3 x d sizeThe convolution kernel W, the j-th convolution kernel W is selected _j When the convolution operation is carried out on the input matrix X, the eigenvalue g can be obtained _ij (1 is less than or equal to i is less than or equal to n), and the calculation process is as follows:

g _ij ＝f(W _j ⊙X _i：i+2 +b) (1)

v _i ＝[g _i1 ，g _i2 ，g _i3 ，...，g _iT ] (2)

in the formula (1), X _i：i+2 Is the local feature extracted in the range from the ith row to the (i+2) th row of X, where, as follows, it represents volume product, b is bias, and f is a nonlinear activation function; in the formula (2), v _i Is the 3-gram feature vector that the CNN layer extracts around the i-th word. Further, in the embodiment of the present invention, in order to increase the convergence rate, the ReLU activation function is taken as f.

Step 123, the contextual characteristics of each word are input to the attention layer, resulting in the attention weight of each word output by the attention layer.

And 124, inputting the contextual characteristics and the attention weight of each word to the sequence characteristic extraction layer to obtain the attention sequence characteristics output by the sequence characteristic extraction layer.

Specifically, the attention layer determines the attention weight of each word based on a self-attention mechanism, so that words with key emotion information in Russian text are highlighted, words without emotion information or unimportant are weakened, and optimization of model effect is realized.

In the sequence feature extraction layer, the sequence feature extraction under the self-attention mechanism can be realized by combining the context feature and the attention weight of each word, so that the attention sequence feature is obtained. Here, the sequence feature extraction may be implemented by a long and short term memory network.

Step 125, the russian text is input to the sentence-level feature encoding layer, so as to obtain sentence-level features output by the sentence-level feature encoding layer.

The sentence-level feature of the russian text is obtained by encoding based on the russian text expression rule, when the sentence-level feature is encoded, whether the russian text has a punctuation mark which is repeated for representing strong emotion or not can be encoded as part of information in the sentence-level feature, emotion corresponding to an expression symbol contained in the russian text can be encoded as part of information in the sentence-level feature, or after the russian text is translated into the english text, the emotion feature of the english text can be obtained and encoded as part of information in the sentence-level feature.

It should be noted that the execution sequence between the step 125 and the steps 121 to 124 is not specifically limited in the embodiment of the present invention.

And step 126, inputting the attention sequence features and sentence-level features into the emotion classification layer to obtain emotion analysis results output by the emotion classification layer.

Specifically, after the attention sequence features output by the sequence feature extraction layer and the sentence-level features output by the sentence-level feature encoding layer are obtained, the emotion classification layer performs emotion classification on the Russian text by combining the attention sequence features and the sentence-level features, and outputs an emotion analysis result.

Based on any of the above embodiments, fig. 4 is a schematic structural diagram of an attention layer according to an embodiment of the present invention, and as shown in fig. 4, step 123 specifically includes: inputting the contextual characteristics of each word into a first bidirectional long-short-time memory network of an attention layer to obtain first hidden layer characteristics of each word output by the first bidirectional long-short-time memory network; the first hidden layer feature of each word is input to the attention calculating layer of the attention layer, and the attention weight of each word output by the attention calculating layer is obtained.

Specifically, the attention layer includes a first bidirectional long and short time memory network (Bi-LSTM) that takes as input the contextual characteristics of each word and obtains as output the corresponding first hidden layer characteristics, and an attention calculation layer. The attention calculating layer takes the first hidden layer feature as input, thereby calculating the attention weight of each word, and the specific formula is as follows:

s _i ＝tanh(W ^T h′ _i +b)

a _i ＝softmax(s _i A)

in the formula, h′ _i And a _i First hidden features and attention weights, s, for the ith word, respectively _i A is used as a scoring system and is automatically learned from corpus by a model, A and W are weight matrixes, and b is bias.

Based on any of the above embodiments, step 124 specifically includes: inputting the contextual characteristics of each word into a second bidirectional long and short time memory network of the sequence characteristic extraction layer to obtain second hidden layer characteristics of each word output by the second bidirectional long and short time memory network; and carrying out weighted summation on the second hidden layer characteristics of each word based on the attention weight of each word to obtain attention sequence characteristics.

Specifically, the sequence feature extraction layer includes a second bidirectional long and short time memory network, and it should be noted that the second bidirectional long and short time memory network and the first bidirectional long and short time memory network of the attention layer are two independent bidirectional long and short time memory networks, and are distinguished by "first" and "second" herein.

The second bidirectional long-short time memory network further extracts information with wider range and deeper connotation on the basis of the contextual characteristics of each word, so that the second hidden layer characteristics of each word are obtained. Here, the second hidden layer feature of each word can be represented by the following formula:

in the formula, h _i Is the second hidden feature of the i-th word,and->The hidden layer outputs of the ith word in two directions respectively. By stitching the hidden layer outputs in both directions, a second hidden layer feature of the corresponding word can be obtained.

It should be noted that, two-way long-short time memory networks are respectively arranged at the attention layer and the sequence feature extraction layer, so that the first two-way long-short time memory network bears the task of representing the importance of each word in emotion analysis, and the second two-way long-short time memory network bears the task of representing emotion and semantic information of each word, thereby avoiding the problem that when two tasks conflict, a single two-way long-short time memory network is difficult to simultaneously consider the two tasks, and the feature extracted by a model is inaccurate.

Based on any of the above embodiments, step 126 specifically includes: randomly discarding the attention sequence features; combining the attention sequence characteristics and sentence-level characteristics which are randomly discarded to obtain text characteristics; and determining emotion analysis results based on the text features.

Specifically, dropout and L2 regularization strategies are adopted, and interaction between hidden layer nodes is reduced by randomly discarding a part of parameters in a model and controlling complexity of the model parameters, so that generalization errors of the whole deep neural network are reduced. And combining the attention sequence characteristics after Dropout and L2 regularization with sentence-level characteristics output by the sentence-level characteristic coding layer to obtain text characteristics. And carrying out emotion classification on the Russian text based on the text characteristics, and determining emotion analysis results corresponding to the Russian text.

Fig. 5 is a schematic structural diagram of an emotion classification layer provided in an embodiment of the present invention, as shown in fig. 5, after merging attention sequence features after Dropout with sentence-level features, sequentially performing dimension reduction operation through two full-connection layers Linear, and adding normalization operation between the two layers, so as to avoid ignoring features with excessively small values in certain dimensions. After this, the emotion analysis results were determined by Softmax regression.

The difficulty of the emotion analysis aiming at Russian texts at present is that:

1) The social media short text has the characteristics of spoken language, more slang, short sentences, irregular language, frequent omission of the evaluation object, lack of context information and the like in terms of expression view or invisibility of the evaluation object, and the like, and the ideal processing result is difficult to obtain by the traditional text emotion analysis technology.

2) From the language characteristics of Russian, semantic analysis and emotion extraction tasks become complex because of the frequent presence of word order freedom, word ambiguity, morphological complexity and non-projection relations in sentences.

Aiming at the two problems, the feature extraction of the Russian text is performed based on the Russian text representation rule in the embodiment of the invention, so that the accuracy of feature extraction is improved.

Based on any of the above embodiments, the word-level features of any word include a word vector of the word, and at least one of a capital character feature, a repeat letter feature, a part-of-speech feature, a morphological feature, a emotion score feature, and a slop feature of the word.

The word vector can be obtained through fastttext based on wikipedia and a large number of public website texts. Each word vector w in fastText _i The dimension of the word list is 300, the vocabulary scale is large, 1888423 Russian words are contained, and the occurrence of the problem of unregistered word vectors (Out of Vocabulary, OOV) can be reduced as much as possible; more importantly, the fastText solves the OOV problem to a certain extent and is more suitable for the Russian with rich morphology because the word vector is generated by adopting the character characteristics of the n-gram.

In order to express a stronger emotion, russian text on a social network often consciously violates language rules, such as the first or last letters of a word, a vowel or a loud consonant ([ pi ], [ co ], [ ph ]), a transient or pop-up consonant is repeated multiple times ([ ce ], [ r ], [ co ], [ pi ], [ t ]). For example, "kidney-qi у pi-qi-! "Di and" Di "are respectively expressed by" Tsaoko у у у Be and ы, and "Tsaoko and Be" у у у "are expressed by" Hi and Be respectively expressed by "Tsao and Be")! The following is carried out The following is carried out "; in addition, it is also possible to intentionally capitalize the first or last letter of a word in the middle of a sentence, or to use capital letters to represent accents of intentionally misplaced words, expressing negative moods such as sarcasm, anger, light-thin, aversion, etc. For example, a filter media may be a filter media such as a filter media, or a filter media. BIUIDIET PIHU the OI III is that! "K represents a c and b-! Mei ч e, c a ж, c! The method includes the steps of l ы c ы й, t 'c, t' i'm ы й a'm, t'm, a'm, c'm, a'm я! ". In order to preserve the emotion features contained in the russian text, the number of capital letters and repeated letters in each word can be extracted as the capital letter features and repeated letter features of the word, respectively.

Because emotion information is often contained in real words and exclamation words, the real words comprise nouns, verbs, adjectives and adverbs, and the part of speech corresponding to a word can be used as part of speech characteristics of the word to indicate the probability that the emotion information is possibly contained in the word, so that an emotion analysis model can concentrate attention on real words and exclamation words with richer emotion information. Further, english paraphrasing and part-of-speech pos of each word can be obtained through Google translation ₁ And obtaining part-of-speech pos of each word through NLTK (Natural Language Toolkit, natural language processing toolkit) ₂ Obtaining the part-of-speech pos of each word through two well-known Russian morphological analysis tools PyMorph 2 and PyMystem ₃ And pos ₄ Finally, the analysis results [ pos ] of the four tools are fused through a majority voting method ₁ ,pos ₂ ,pos ₃ ,pos ₄ ]The part of speech of each word is divided into adjectives, adverbs, verbs, nouns, interjections, emoji, and other seven categories.

In addition, various complex forms in the Russian words have influence on the emotion expression of the Russian text, and the form corresponding to the words can be used as one of the morphological characteristics of the words. Further, the morphology mrp of each word can be obtained by pymorph 2 and PyMystem, respectively ₁ And mrp ₂ . Table 1 summarizes 10 large classes of 28 Russian forms that PyMorph 2 and PyMystem can label, compared to PyMorph 2, pyMystem not only employs dictionary and rule-based algorithms, but also fully considers the context information of the context in which the word is located, thus in embodiments of the invention, the word morphology mrp based on PyMystem is obtained ₂ As a basic form of a word, if mrp ₁ And mrp ₂ Not conflict with each other, mrp ₁ As a complement to the word morphology, morphology features of the word are thus obtained.

TABLE 1 Russian morphology types noted in PyMorph 2 and PyMystem

Each word has the corresponding emotion intensity, and the emotion intensity is used as one of the emotion characteristics of the word, so that the emotion analysis model can focus attention on emotion information of the text. Here, the emotion intensity corresponding to each word may be obtained by a preset emotion dictionary. The existing multilingual emotion dictionary SenticNet and English emotion dictionary SentiWordNe have the characteristics of large scale, wide coverage, fine and accurate emotion intensity and the like. Therefore, the embodiment of the invention can obtain the emotion score sv of each word and the English meaning thereof through senticNet respectively by each word and the English meaning eng thereof ₁ And sv ₂ The method comprises the steps of carrying out a first treatment on the surface of the In addition, the emotion score sv of English paraphrasing corresponding to each word can be obtained through SentiWordNet ₃ . Here, emotion score sv ₁ 、sv ₂ And sv ₃ Continuous data from-1 to +1 can describe emotion intensity of each word more comprehensively, precisely and accurately, and emotion score sv can be calculated ₁ 、sv ₂ And sv ₃ Concatenation is used as the emotion score feature of words.

In daily life, a speaker often leaks an dissatisfied emotion through a visceral speech, expresses a devastating attitude, or curses a certain aversive object. The social short text of the network is used in a large amount due to strong emotion expression, various dirty words, coarse words and slang, and a large amount of indirect expression, so that granularity of emotion factors affecting sentences is increased, and the demands cannot be met only by means of a traditional emotion dictionary. Therefore, the embodiment of the invention constructs a visceral peak dictionary in advance, takes whether the words or the corresponding forms thereof exist in the visceral peak dictionary as a judging standard, classifies each word in the Russian text into two categories of dirty words and non-dirty words, and takes the classification result as the characteristics of the dirty peak of the word. Further, words in the visceral peak dictionary can be simply divided into three major categories: 1) Regarding sexual behavior or sexual organ, such as Di Pi я t, di Pi and Di Pi, etc.; 2) Related words such as feces, urine, buttocks, garbage, etc. causing nausea, for example, r, a, b, n у, a, etc.; 3) Various curse related words, such as, for example, emBe, av у, ca, and the like.

According to the method provided by the embodiment of the invention, the word-level characteristics of the word are constructed from the plurality of dimensions such as the capital letter characteristics, the repeated letter characteristics, the part-of-speech characteristics, the morphological characteristics, the emotion score characteristics, the slop characteristics and the like, so that the word-level characteristics can fully reflect emotion information contained in the Russian text of the word, and a basis is provided for realizing accurate and reliable emotion analysis.

Based on any of the above embodiments, the sentence-level features of the russian text include at least one of punctuation features, emoji features, and english interpretation emotion features of the russian text.

Specifically, Russian language users tend to express the intensity of emotions through repeated punctuation marks such as exclamation marks and question marks when applying online social media platforms, such as "Янехочу！！！！！！！" and "Какжеплохо？？？？". In addition, Russian speakers also tend to use continuous parentheses at the end of social short texts to indicate positive or negative emotions, such as "Оченьдобрыйчеловексоткрытойдушой,чторедкостьвнашевремя)))" and "ОБожеАлинабеременнаПойдуполлачу((((". Therefore, in the embodiments of the present invention, the number of punctuation marks such as exclamation points, periods, question marks, and parentheses in each direction at the end is used as the punctuation feature of Russian text to reflect the emotional information represented by the Russian text.

With the popularity of various emoticons in a network social media platform, russian users also increasingly use punctuation mark combinations to simulate facial expressions or emotion-related things, thereby expressing positive or negative emotions, such as applying ' ≡a smiling face and applying ' 3) ' to represent hearts and ideas. Therefore, in the embodiment of the invention, the number of emoticons corresponding to different emotion polarities contained in the Russian text is used as the emotion feature of the Russian text so as to reflect the intensity of various emotion polarities in the Russian text. Table 2 is a polarity classification of emoji, and statistics can be made on the number of emoji of various emotion polarities in russian text based on table 2.

TABLE 2 Emotion sign and emotion polarity classification

Among all languages, the research on emotion analysis for English has the greatest proportion, and a large number of mature, reliable and convenient tools are derived. Although several emotional features are lost and some noise is introduced during the stage of translating russian text into english translations, powerful translation engines result in english translations that are relatively more language model compliant and more normative than russian text that is free of order of language and not normative on social media. If mature, professional and proper English emotion analysis tools are selected in a targeted manner, the English translation emotion characteristics obtained by the English emotion analysis tools can provide reliable references for emotion analysis of Russian texts. Further, the translation engine provided by google and hundred degrees can be used for obtaining an english translation, then analyzing tools such as Vader, sentiment and TextBlob are selected for emotion analysis on the english translation, and the obtained english translation emotion characteristics are used as one of sentence-level characteristics and applied to emotion analysis on russian texts. Here, vader, sentiment and TextBlob are more suitable types of text, and one or more floating point values expressing emotion polarity can be obtained without training a model.

According to the method provided by the embodiment of the invention, sentence-level features of the Russian text are constructed from a plurality of dimensions such as punctuation features, expression symbol features, english translation emotion features and the like, so that the sentence-level features can fully reflect emotion information contained in the Russian text, and a basis is provided for realizing accurate and reliable emotion analysis.

Based on any of the above embodiments, step 120 further includes, prior to: preprocessing Russian text.

Specifically, since russian social network short text, especially russian text often contains informal text such as a topic tag beginning with "#", a user name beginning with "@", a forwarding tag beginning with "RT", a URL, etc., the informal text contains little information about emotion expressions, which can be filtered out before emotion analysis is performed. In addition, text preprocessing may include the following:

1) Filtering all punctuation marks and special characters, and discarding text which is not written in the target language (Russian text possibly contains a small amount of Korean and Japanese);

2) Remove or replace HTML tags, such as: delete < div > or < br >; use of ">" replacement "> "(part of the emoticons contains" > ");

3) Replacing all numbers appearing herein with uniform symbols;

4) All letters are converted to a lower case format.

5) Word drying: because the Russian form is very complex, the words with the same root source are unified into the same form (for example, the interference of single complex number, person name, negative positive, noun lattice, verb multiple times, states, bodies and the like is removed).

Based on any of the above embodiments, the samples used for training the emotion analysis model are provided by Araujo, and in the model training process, in order to increase the speed of data processing, a global optimal solution is found at the same time, and the emotion analysis model is trained by using a mini-batch gradient descent method, namely: the data batch participates in the calculation, and the weight is updated after each batch of calculation is finished. Table 3 shows the parameter settings of the emotion analysis model.

TABLE 3 parameters of emotion analysis model

When verifying the trained emotion analysis model, a 5-fold cross verification method can be applied to respectively calculate the accuracy rate and the precision rate P in the 5-fold cross verification _macro Recall rate R _macro And F1 _macro Average value of (2). In order to balance the relation between the accuracy rate and the recall rate and avoid unbalanced classification effect, the embodiment of the invention uses F1 _macro As main evaluationPrice criteria, accuracy as an auxiliary evaluation criterion. Wherein F1 _macro Can be expressed in the following form:

based on any of the above examples, table 4 compares F1 corresponding to the emotion analysis results of Russian text using different english translation emotion characteristics _macro And Accuracy Accuracy, wherein ACBM is an emotion analysis model in the embodiment of the invention.

TABLE 4 SA comparison of English translation with Russian original

/>

Comparing the classification results of the first 6 groups in table 4, it can be found that: aiming at the translation task of the short text of the Russian social media, the hundred-degree translation effect is slightly better than the Google translation, wherein the combination effect of the hundred-degree translation and Vader is the best, and the experimental result shows that: the effect of the emotion classification scheme based on the english translation is very dependent on the quality of the translation; when analyzing short social media text with an inadequate text format, vader is better than Senntivent 140, and Senntivent 140 is better than TextBlob.

Comparing the two sets of experimental results based on Bert in table 4, it can be seen that: the English-based pre-training model provided by Bert is truly powerful, and the effect is far better than that of the previous 6-group classification scheme; the multilingual-based pre-training model provided by Bert is less effective, probably because the data resources of multilingual text, whether in quantity or quality, are far less than the english resources.

By comparing the first 7 experimental results with the second two experimental results in table 4, it can be seen that: because of the loss of semantic, emotion, language characteristics and the like generated in the translation process, the emotion classification scheme based on the english translation is far less effective than the effect of directly analyzing russian, and can only be used as a temporary scheme or an auxiliary scheme in transfer learning before the machine translation technology does not break through further.

Based on any of the above embodiments, in order to fully mine information in short text of Russian social media, and fully utilize various background knowledge and language features related to the short text, the embodiment of the invention adds word-level features and sentence-level features of each dimension into the LSTM and ACBM models one by one, and simultaneously refers to three data relevance indexes (Kendall, pearson, spearman) of the features and emotion analysis results.

TABLE 5-1 comparison of results after addition of different sentence-level features

TABLE 5-2 comparison of results after addition of different word-level characteristics

As shown in tables 5-1 and 5-2: the absolute value of the correlation index has obvious positive correlation with the experimental effect after the characteristic is added, and the ACBM model of the embodiment of the invention is compared with the F1 of the LSTM model no matter whether the characteristic is added or not or what type of characteristic is added _macro The value is 1.27-2.51%; in sentence-level features, the best effect of adding the "English version Vader emotion value, the number of forward expression characters and the direction of end brackets" is achieved, and compared with the result without the features, F1 of the LSTM model is achieved _macro The values increased by 1.05%, 0.87% and 0.79%, respectively, for F1 of the ACBM model _macro The values are respectively increased by 0.95%, 0.73% and 0.71%, and obviously the LSTM model is improved more obviously-probably due to the fact that the ACBM has stronger feature extraction capability on the module in front of the fully connected layer, so that the sentence-level features have limited supplementing and auxiliary effects; in word-level features, the best effect of adding the emotion score features, the slop slang features and the emotion score and part-of-speech features is achieved, and F1 of the LSTM model is compared with the result without the added features _macro The values are respectively increased by 0.83%, 1.31% and 0.77%% ACBM model F1 _macro The values are respectively increased by 0.97%, 1.56% and 1.57%, which is contrary to sentence-level features, and the word-level features obviously improve the ACBM model more obviously-probably because the CNN module and the self-attention mechanism adopted by the ACBM can fully extract key local emotion features; also notable is that: the feature of adding part of speech alone for an ACBM model is less effective, but if the "emotion score + part of speech" feature is added at the same time, the effect is improved by 0.6% over the "emotion score" feature alone, probably because the self-attention mechanism of the ACBM model effectively extracts the weight value of important information from part of speech information, making the model more concerned about those "emotion scores" that are helpful to classification results.

Based on any one of the above embodiments, in order to verify the effectiveness of the ACBM model provided by the embodiment of the present invention on the task of emotion analysis SA of russian social short text, the method provided by the embodiment of the present invention is respectively compared with a traditional machine learning method (SVM), a commonly-used deep learning model (CNN, LSTM, biLSTM), and a plurality of deep learning hybrid models, and is described as follows:

1) All models simultaneously fuse the following word-level features: word-part characteristics, emotion score characteristics and slang characteristics;

2) All models are simultaneously fused with the following sentence-level features: punctuation features and english translation emotion features of Russian texts, 768-dimensional sentence vectors obtained by a Bert pre-training model of the english translations;

3) The model is explained as follows: biLSTM-2layers are two-layer stacked LSTM; biLSTM-ATT is based on BiLSTM and adds a self-attention mechanism shown in formula (3); the BiLSTM-ATT2 is a self-attention mechanism which directly determines the self weight of hi by using hidden layer output hi of the BiLSTM on the basis of the BiLSTM; biLSTM-CNN is a serial combination of BiLSTM followed by CNN; CNN-BiLSTM is a serial combination of CNN followed by BiLSTM.

TABLE 6 comparison of classification results for different methods

From an examination of the experimental results in table 6, it can be found that: all deep learning models are superior to the machine learning model SVM; CNN and BiLSTM differ little in effect, but BiLSTM has significantly better results than LSTM, meaning that bi-directional LSTM can effectively utilize past and future features of sequences; by adding the number of layers and the self-attention mechanism of the BiLSTM, compared with the experimental result of the original BiLSTM, the improvement effect is not obvious; the effect of the CNN-BiLSTM is slightly stronger than that of BiLSTM-CNN, which proves that the CNN layer is more suitable for extracting the local features in the earlier stage, and BiLSTM is more suitable for summarizing the global sequence features in the later stage; the ABCM provided by the embodiment of the invention has better effect than all models, and the purposes that local characteristics can be effectively captured, global information is summarized and the emotion analysis effect is improved are achieved by reasonably combining CNN and BiLSTM and assisting with an Attention mechanism with definite division.

Aiming at the characteristics of 'spoken words, slang words, free word order, various word forms' and the like of short text of Russian social media, the embodiment of the invention provides a CNN+LSTM mixed model (ACBM model) based on a self-attention mechanism, and aims to combine CNN and LSTM neural networks to obtain better emotion analysis performance, and meanwhile, the CNN and LSTM neural networks are integrated with emotion characteristics of various word levels and sentence levels. Experiments prove that the effects of various models are improved after various characteristics are integrated, wherein: sentence-level features are more obvious for simple model effects, wherein word-level features are more obvious for complex model effects; f1 of ACBM model compared to CNN and LSTM models alone _macro Values improved by 2.21% and 2.79%; f1 of ACBM model compared to other machine learning methods and hybrid models of CNN+LSTM _macro The values were raised by 0.92% and 10.3%.

Based on any of the above embodiments, fig. 6 is a schematic structural diagram of a russian text emotion analysis device according to an embodiment of the present invention, where, as shown in fig. 6, the russian text emotion analysis device includes:

a text obtaining unit 610, configured to obtain a russian text to be analyzed;

emotion analysis unit 620, configured to input the russian text to an emotion analysis model, and obtain an emotion analysis result output by the emotion analysis model;

According to the device provided by the embodiment of the invention, the problem that the specificity of the Russian text itself affects the feature extraction accuracy is solved based on the word-level features and sentence-level features extracted by the Russian text representation rule, the local feature extraction is performed on the word-level emotion features of the text based on the convolutional neural network, the sequence feature extraction is performed based on the cyclic neural network and the self-attention mechanism, so that the advantages of the local features and the sequence features are combined, and finally the emotion analysis is performed on the sentence-level emotion features and the attention sequence features, thereby improving the accuracy and the reliability of the Russian text emotion analysis.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 7, the electronic device may include: processor 710, communication interface (Communications Interface) 720, memory 730, and communication bus 740, wherein processor 710, communication interface 720, memory 730 communicate with each other via communication bus 740. Processor 710 may invoke logic commands in memory 730 to perform the following method: obtaining a Russian text to be analyzed; inputting the Russian text into an emotion analysis model to obtain an emotion analysis result output by the emotion analysis model; the emotion analysis model is used for extracting local features of word level features of each word in the Russian text to obtain context features of each word, extracting sequence features of each word based on a self-attention mechanism to obtain attention sequence features of the Russian text, and performing emotion analysis based on sentence level features of the Russian text and the attention sequence features; the word-level features and the sentence-level features are extracted based on russian text representation rules.

In addition, the logic commands in the memory 730 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in the form of a software product stored in a storage medium, comprising several commands for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the methods provided by the above embodiments, for example, comprising: obtaining a Russian text to be analyzed; inputting the Russian text into an emotion analysis model to obtain an emotion analysis result output by the emotion analysis model; the emotion analysis model is used for extracting local features of word level features of each word in the Russian text to obtain context features of each word, extracting sequence features of each word based on a self-attention mechanism to obtain attention sequence features of the Russian text, and performing emotion analysis based on sentence level features of the Russian text and the attention sequence features; the word-level features and the sentence-level features are extracted based on russian text representation rules.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several commands for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A russian text emotion analysis method, comprising:

obtaining a Russian text to be analyzed;

the emotion analysis model is used for extracting local features of word level features of each word in the Russian text to obtain context features of each word, extracting sequence features of each word based on a self-attention mechanism to obtain attention sequence features of the Russian text, and performing emotion analysis based on sentence level features of the Russian text and the attention sequence features; the word-level features and the sentence-level features are extracted based on russian text representation rules;

The word-level features of any word include a word vector of the any word, and at least one of uppercase letter features, repeat letter features, part-of-speech features, morphological features, emotion score features, and dirty word slang features of the any word;

the sentence-level features of the Russian text include at least one of punctuation features, emoji features, and english interpretation emotion features of the Russian text.

2. The russian text emotion analysis method of claim 1, wherein the emotion analysis model comprises a word-level feature encoding layer, a local feature extraction layer, an attention layer, a sequence feature extraction layer, a sentence-level feature encoding layer, and an emotion classification layer.

3. The russian text emotion analysis method according to claim 2, wherein the step of inputting the russian text into an emotion analysis model to obtain emotion analysis results output by the emotion analysis model specifically comprises:

inputting the Russian text to the word-level feature coding layer to obtain word-level features of each word output by the word-level feature coding layer;

inputting word-level features of each word into the local feature extraction layer to obtain the context features of each word output by the local feature extraction layer;

Inputting the contextual characteristics of each word into the attention layer to obtain the attention weight of each word output by the attention layer;

inputting the context feature and the attention weight of each word to the sequence feature extraction layer to obtain the attention sequence feature output by the sequence feature extraction layer;

inputting the Russian text to the sentence-level feature encoding layer to obtain sentence-level features output by the sentence-level feature encoding layer;

and inputting the attention sequence features and the sentence-level features into the emotion classification layer to obtain an emotion analysis result output by the emotion classification layer.

4. A russian text emotion analysis method as recited in claim 3, wherein said inputting the contextual characteristics of each word to said attention layer, obtaining the attention weight of each word output by said attention layer, specifically comprises:

inputting the contextual characteristics of each word into a first bidirectional long-short-time memory network of the attention layer, and obtaining a first hidden layer characteristic of each word output by the first bidirectional long-short-time memory network;

the first hidden layer feature of each word is input to the attention calculating layer of the attention layer to obtain the attention weight of each word output by the attention calculating layer.

5. The russian text emotion analysis method of claim 3, wherein the inputting the contextual feature and the attention weight of each word to the sequence feature extraction layer, to obtain the attention sequence feature output by the sequence feature extraction layer, specifically comprises:

inputting the context feature of each word into a second bidirectional long-short time memory network of the sequence feature extraction layer to obtain a second hidden layer feature of each word output by the second bidirectional long-short time memory network;

and carrying out weighted summation on the second hidden layer characteristic of each word based on the attention weight of each word to obtain the attention sequence characteristic.

6. The russian text emotion analysis method of claim 3, wherein said inputting the attention sequence feature and the sentence-level feature to the emotion classification layer obtains an emotion analysis result output by the emotion classification layer, specifically comprising:

randomly discarding the attention sequence feature;

combining the randomly discarded attention sequence features and the sentence-level features to obtain text features;

and determining the emotion analysis result based on the text characteristics.

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the russian text emotion analysis method of any of claims 1 to 6 when the program is executed by the processor.

8. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the russian text emotion analysis method of any of claims 1 to 6.