CN113535899B

CN113535899B - Automatic studying and judging method for emotion tendencies of internet information

Info

Publication number: CN113535899B
Application number: CN202110769546.3A
Authority: CN
Inventors: 郭齐
Original assignee: Xi'an Kangnai Network Technology Co ltd
Current assignee: Xi'an Kangnai Network Technology Co ltd
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2024-02-27
Anticipated expiration: 2041-07-07
Also published as: CN113535899A

Abstract

The invention discloses an automatic studying and judging method aiming at internet information emotion tendencies, relates to the technical field of language emotion analysis, and adopts a method of pre-training and fine-tuning downstream tasks by using a RoBERTa model on a general corpus. And a mixed precision multi-machine multi-GPU training mode is used in the deep learning training process. After the super-parameter training is found, a model is deployed and an interface is provided to complete the automatic research and judgment work; the problems of low accuracy, insufficient generalization effect of the research and judgment model, poor performance when dealing with complex Chinese contexts such as darkness and ambiguity and the like in the traditional research and judgment of public opinion emotion are solved.

Description

Automatic studying and judging method for emotion tendencies of internet information

Technical Field

The invention relates to the technical field of language emotion analysis, in particular to an automatic studying and judging method aiming at the emotion tendencies of internet information.

Background

According to the 47 th-stage Chinese Internet development statistics report issued by the Chinese Internet information center (CNNIC), the number of Chinese Internet users reaches 9.89 hundred million by 12 months and 20 days in 2020. Therefore, the internet provides a large amount of data information for our networking and providing, wherein the analysis of the internet public opinion is an essential step for coping with the analysis of the internet public opinion.

With the continuous and deep development of the internet age, internet public opinion emotion analysis has become an indispensable means for understanding community opinion, grasping public opinion trend, and making quick response and processing on emergencies. The emotion tendency automatic research and judgment of the Internet public opinion is a vivid application of combining big data with artificial intelligence.

However, the existing emotion analysis scheme generally adopts the technologies of traditional machine learning, support vector machine, logistic regression, CNN neural network, LSTM neural network and the like, and the existing technology has poor performance when facing complex Chinese contexts such as darkness, ambiguity and the like in the aspect of natural language processing, which is the core of automatic judgment of emotion tendencies, and has poor model generalization effect and greatly improved emotion tendentiousness judging accuracy.

Aiming at the problems existing in the prior art, the application provides an automatic research and judgment method aiming at the emotion tendencies of internet information, and solves the problems of low accuracy, insufficient generalization effect of research and judgment models, poor performance when dealing with complex Chinese contexts such as darkness, ambiguity and the like in the traditional research and judgment work of public opinion emotion.

Disclosure of Invention

The invention aims to provide an automatic research and judgment method aiming at the emotion tendencies of internet information, which solves the problems of low accuracy, insufficient generalization effect of research and judgment models and poor performance when dealing with complex Chinese contexts such as darkness, ambiguity and the like in the traditional research and judgment work of public opinion emotion.

The invention provides an automatic studying and judging method aiming at the emotion tendencies of internet information, which comprises the following steps:

establishing a public opinion corpus data set;

establishing a RoBERTa model, importing a public opinion corpus data set for pre-training, and improving the Bert of the RoBERTa model to obtain a pre-training model;

fine-tuning parameters of the pre-training model based on the downstream task data set, and storing a final model after fine-tuning;

and outputting emotion tendency probability after final model prediction, so as to realize automatic research and judgment.

Further, preprocessing is carried out on the public opinion corpus data set, and the preprocessing comprises the following steps:

collecting public opinion data marked with emotion tendencies in public opinion corpus data set, and cleaning the data;

formatting public opinion data;

converting public opinion data as required by using a Chinese character dictionary file;

and carrying out multi-process preprocessing on the public opinion data.

Further, the pre-training of the public opinion corpus data set is performed based on deep learning, and a mixed precision multi-machine multi-GPU training mode is used in the training process.

Further, the pre-training of the public opinion corpus data set comprises improvement of Bert, and specifically comprises the following steps:

removing the NSP task;

specifying a BERT mask type;

static Mask the dynamic Mask.

Further, the training rate parameter of the fine-tuning pre-training model is 3e-4, the batch size parameter is 64, the epochs parameter is 12, and the mask type is set to full_visual.

Further, the final model configures an HTTP interface, the HTTP interface adopts a data submitting mode of POST, and a transmission format of JSON.

Further, the emotional trends output after model prediction include positive emotional trend, negative emotional trend, neutral emotional trend and irrelevant emotional trend.

Compared with the prior art, the invention has the following remarkable advantages:

the automatic research and judgment method for the emotion tendencies of the internet information provided by the invention adopts a method of pre-training and fine-tuning a downstream task by using a RoBERTa model on a general corpus. And a mixed precision multi-machine multi-GPU training mode is used in the deep learning training process. After the super-parameter training is found, a model is deployed and an interface is provided for completing the automatic research and judgment work, and the method has the advantages of good robustness, strong model generalization capability and capability of providing high-accuracy research and judgment results facing special Chinese contexts.

Secondly, the automatic research and judgment method for the emotion tendencies of the internet information improves Bert in the pre-training process, changes the static Mask into the dynamic Mask, indirectly increases training data, and is beneficial to improving the model performance. Eliminating NSP loss can be leveled or slightly improved with the original BERT in performance of downstream tasks.

The RoBERTa model (160G) uses 10 times more data than the Bert model (16G) according to the automatic research and judgment method for the emotion tendencies of the Internet information. More training data increases the vocabulary, the syntax structure, and the diversity of the syntax structure data.

Drawings

FIG. 1 is a block diagram of a fine-tuning pre-training model provided by an embodiment of the present invention;

FIG. 2 is a diagram illustrating a difference between models before training according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the fine tuning principle of RoBERTa on different tasks according to the embodiment of the present invention;

fig. 4 is a diagram of the precision of the MNLI after fine tuning according to the embodiment of the present invention;

fig. 5 is a diagram showing a BERT input part according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention, taken in conjunction with the accompanying drawings, will clearly and completely describe the embodiments of the present invention, and it is evident that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

Referring to fig. 1-5, the invention provides an automatic studying and judging method for emotion tendencies of internet information, which comprises the following steps:

establishing a public opinion corpus data set, and preprocessing the public opinion corpus data set;

establishing a RoBERTa model, designating a target task of the model, importing a preprocessed public opinion corpus data set for pre-training, and improving the Bert of the RoBERTa model to obtain a pre-training model;

The final model is configured with an HTTP interface, the HTTP interface adopts a POST data submitting mode, a transmission format is JSON, and the model predicts and outputs emotion tendency probability to realize automatic research and judgment.

The emotion tendencies output after model prediction include positive emotion tendencies, negative emotion tendencies, neutral emotion tendencies and irrelevant emotion tendencies.

Example 1

Preprocessing the public opinion corpus data set, wherein the preprocessing comprises the following steps:

formatting public opinion data;

and carrying out multi-process preprocessing on the public opinion data.

Example 2

Referring to fig. 2, differences between model architectures before training: BERT uses a bi-directional transducer. OpenAI GPT uses a left to right transducer. ELMo uses independently trained left-to-right and right-to-left LSTMs connections to generate downstream task features. Of these three representations, only the BERT representation is a joint representation of the condition that there are both left and right contexts on all layers. In addition to architectural differences, BERT and OpenAI GPT are one method of fine tuning, while ELMo is one feature-based method.

The training method is used for pre-training the public opinion corpus data set based on deep learning, and a mixed precision multi-machine multi-GPU training mode is used in the training process. The application uses 160G training public opinion corpus, uses RoBERTa model training for one week, batch size 64, and a total of 3 machines with 8 GPUs per machine (NVIDIA Tesla V100 16G) for pre-training.

The pre-training of the public opinion corpus data set comprises the improvement of Bert, and specifically comprises the following steps:

removing the NSP task;

specifying a BERT mask type;

static Mask the dynamic Mask.

The RoBERTa improves Bert on the training method, and is mainly characterized by four aspects of changing the mask mode, discarding NSP tasks, optimizing training super parameters and using larger-scale training data. The improvement points are as follows: (1) static Mask dynamic Mask: the dynamic mask is equivalent to indirectly adding training data, which is beneficial to improving the performance of the model; (2) remove NSP task: bert uses NSP task to pretrain in order to capture the relation between sentences, namely, inputs a pair of sentences A and B, judges whether the two sentences are continuous or not, the sum of the maximum lengths of the two sentences is 512, after the NSP is removed by RoBERTa, a plurality of continuous sentences are input each time until the maximum length is 512 (and can span articles). Eliminating NSP loss can be leveled or slightly improved with the original BERT in performance of downstream tasks. Because Bert-single sentence is input as a unit, the model cannot learn the remote dependency relationship among words, roBERTa inputs a plurality of continuous sentences, and the model can capture longer dependency relationship, so that the model is friendly to the downstream task of a long sequence; (3) larger batch size: the batch size of RoBERTa is 8k. The training strategy in machine translation is used for reference, the phenomenon that the model optimization rate and the model performance can be improved by using larger batch size and larger learning rate is matched, and experiments prove that indeed Bert can also use larger batch size. (4) more training data: longer training time, roBERTa (160G) used 10 times more data than Bert (16G). More training data increases the vocabulary, the syntax structure, and the diversity of the syntax structure data.

Example 3

As shown in fig. 1 and 3, the training rate parameter of the pre-training model is fine-tuned to 3e-4, the batch size parameter is 64, the epochs parameter is 12, and the mask type is set to full_visual on the downstream task data set.

The Roberta model performs comprehensive pre-training and fine tuning on BERT. The same architecture is used in both pre-training and fine-tuning except for the output layer. The same pre-trained model parameters are used to initialize models for different downstream tasks. During the trimming process, all parameters are trimmed. [ CLS ] is a special symbol added before each input instance, [ SEP ] is a special separator marker for separating questions/answers.

The prediction task of the final model of the application is as follows:

Input＝[CLS]the man went to[MASK]store[SEP]he bought a gallon[MASK]milk[SEP]

Label＝IsNext

Input＝[CLS]the man[MASK]to the store[SEP]penguin[MASK]are flight ##less birds[SEP]

Label＝NotNext

as in fig. 4, the MNLI accuracy after fine tuning is shown, starting with model parameters trained for k steps in advance. The x-axis is the value of k.

Example 4

Taking neutral news as an example, the BERT input section is shown in fig. 5, where the BERT input section illustrates that the input embedding is the sum of mark embedding, segment embedding, and position embedding.

The model prediction of the application shows that:

{"label":"__label__neutral","probability":0.99930811524391174}

wherein '__ label __ neutral' represents that the model predicts neutral emotion, the probability is 99%, and the model judges that the model is neutral news.

Taking negative news as an example, the result is:

{"label":"__label__negative","probability":0.9227299332618713}

wherein "__ label __ negative" indicates that the model predicts negative emotion, the probability is 92%, and the model judges negative news.

Taking the front news as an example, the result is:

{"label":"__label__positive","probability":0.9998956918716431}

wherein '__ label __ positive' indicates that the model predicts positive emotion, the probability is 99%, and positive news is determined.

In the actual application scene, the public opinion event is complex and changeable, the public opinion focus is not single, the public opinion progress is multi-level, and the invention can still accurately finish the judgment of the emotion tendencies of the public opinion text.

The foregoing disclosure is merely illustrative of some embodiments of the invention, but the embodiments are not limited thereto and variations within the scope of the invention will be apparent to those skilled in the art.

Claims

1. An automatic studying and judging method for the emotion tendencies of internet information is characterized by comprising the following steps:

establishing a public opinion corpus data set;

removing the NSP task;

specifying a BERT mask type;

static Mask and dynamic Mask;

pre-training of the public opinion corpus data set is performed based on deep learning, and a mixed precision multi-machine multi-GPU training mode is used in the training process;

fine tuning a training rate parameter of the pre-training model to be 3e-4, a batch size parameter to be 64, an epochs parameter to be 12, and setting a mask type to be full_visual;

2. The automatic research and judgment method for emotion tendencies of internet information according to claim 1, wherein the preprocessing of the public opinion corpus dataset comprises the steps of:

formatting public opinion data;

and carrying out multi-process preprocessing on the public opinion data.

3. The automatic studying and judging method for emotion tendencies of internet information according to claim 1, wherein the final model is configured with an HTTP interface, the HTTP interface adopts a data submitting mode of POST, and a transmission format is JSON.

4. The method of claim 1, wherein the emotional tendency output after model prediction comprises positive emotional tendency, negative emotional tendency, neutral emotional tendency and irrelevant emotional tendency.