CN113535899A

CN113535899A - Automatic studying and judging method for internet information emotion tendentiousness

Info

Publication number: CN113535899A
Application number: CN202110769546.3A
Authority: CN
Inventors: 郭齐
Original assignee: Xi'an Kangnai Network Technology Co ltd
Current assignee: Xi'an Kangnai Network Technology Co ltd
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2021-10-22
Anticipated expiration: 2041-07-07
Also published as: CN113535899B

Abstract

The invention discloses an automatic studying and judging method for internet information emotion tendentiousness, relates to the technical field of language emotion analysis, and adopts a method of pre-training by using a RoBERTA model on general linguistic data and fine-tuning a downstream task. And a mixed precision and multi-machine multi-GPU training mode is used in the deep learning training process. After the super-participation training is found, a model is deployed and an interface is provided to finish the automatic research and judgment work; the method solves the problems of low accuracy, poor generalization effect of a research and judgment model, poor performance in dealing with complex Chinese contexts such as tarnish, ambiguity and the like in the traditional public sentiment emotion research and judgment work.

Description

Automatic studying and judging method for internet information emotion tendentiousness

Technical Field

The invention relates to the technical field of language emotion analysis, in particular to an automatic studying and judging method for emotion tendencies of internet information.

Background

According to the display of 'Chinese Internet development statistics report' of 47 th phase issued by a Chinese Internet information center (CNNIC), the number of Chinese Internet users reaches 9.89 hundred million by 12 months and 20 days in 2020. Therefore, the internet provides a great deal of data information to us, wherein analysis of the netizen public opinion is an essential step for the network public opinion analysis.

With the continuous and deep development of the internet era, the internet public opinion sentiment analysis has become an indispensable means for understanding the social opinion, holding the public opinion trend, and quickly responding and processing the emergencies. The automatic research and judgment of the emotional tendency of Internet public sentiment is a vivid application combining big data and artificial intelligence.

However, the existing emotion analysis scheme generally adopts the technologies of traditional machine learning, support vector machine, logistic regression, CNN neural network, LSTM neural network and the like, and the existing technology is not good enough in terms of natural language processing, which is the core of automatic study and judgment of emotion tendentiousness, when facing complex Chinese contexts such as vagueness, ambiguity and the like, the model generalization effect is poor, and the emotion tendentiousness judgment accuracy rate has a great space for improvement.

Aiming at the problems in the prior art, the application provides an automatic studying and judging method for the emotion tendentiousness of internet information, and solves the problems that the accuracy rate is low, the generalization effect of a studying and judging model is not good enough, and the performance is poor when complex Chinese contexts such as vague and ambiguous are dealt with in the traditional public opinion emotion studying and judging work.

Disclosure of Invention

The invention aims to provide an automatic studying and judging method aiming at internet information emotion tendentiousness, and solves the problems of low accuracy, poor generalization effect of studying and judging models and poor performance in dealing with complex Chinese contexts such as vague and ambiguous and the like in the traditional public opinion emotion studying and judging work.

The invention provides an automatic studying and judging method for internet information emotion tendentiousness, which comprises the following steps:

establishing a public opinion corpus data set;

establishing a RoBERTA model, importing public sentiment corpus data sets for pre-training, improving the Bert of the RoBERTA model, and obtaining a pre-training model;

fine-tuning parameters of a pre-training model based on a downstream task data set, and storing a final model after fine tuning;

and after the final model is used for prediction, the emotional tendency probability is output, and automatic research and judgment are realized.

Further, the public sentiment corpus data set is preprocessed, and the preprocessing steps are as follows:

public sentiment data of emotion tendentiousness marked in a public sentiment corpus data set are collected, and data cleaning is carried out on the data;

formatting public opinion data;

converting public opinion data as required by using a Chinese character dictionary file;

and carrying out multi-process preprocessing on the public sentiment data.

Furthermore, pre-training of the public sentiment corpus data set is performed based on deep learning, and a mixed precision multi-machine multi-GPU training mode is used in the training process.

Further, the pre-training of the public sentiment corpus data set includes improving Bert, and specifically includes:

removing the NSP task;

specifying a BERT mask type;

the static Mask varies the state Mask.

Further, the learning rate parameter of the fine tuning pre-training model is 3e-4, the batch size parameter is 64, the epochs parameter is 12, and the mask type is set to full _ visible.

Further, the final model configures an HTTP interface, the data submission mode adopted by the HTTP interface is POST, and the transmission format is JSON.

Further, the emotional tendency output after model prediction comprises positive emotional tendency, negative emotional tendency, neutral emotional tendency and irrelevant emotional tendency.

Compared with the prior art, the invention has the following remarkable advantages:

the invention provides an automatic studying and judging method for internet information emotion tendentiousness, which adopts a method of pre-training by using a RoBERTA model on general linguistic data and finely adjusting downstream tasks. And a mixed precision and multi-machine multi-GPU training mode is used in the deep learning training process. After the super-participatory training is found, the model is deployed and an interface is provided to finish the automatic studying and judging work, and the method is good in robustness and strong in model generalization capability, and can provide the studying and judging result with high accuracy in the face of special Chinese context.

The invention provides an automatic studying and judging method for internet information emotion tendentiousness, which improves Bert in a pre-training process, changes static Mask into dynamic Mask, indirectly increases training data and is beneficial to improving model performance. Eliminating NSP loss can be on the same level or slightly improved in the performance of downstream tasks as the original BERT.

And thirdly, the RoBERTA model (160G) uses data which is 10 times more than that of the Bert model (16G) in the automatic judging method for the internet information emotion tendentiousness. More training data increases the diversity of the vocabulary, syntactic structure, and syntactic structure data.

Drawings

FIG. 1 is a block diagram of a fine-tuning pre-training model according to an embodiment of the present invention;

FIG. 2 is a diagram of model architecture differences before training provided by an embodiment of the present invention;

fig. 3 is a schematic diagram illustrating a fine adjustment principle of RoBERTa on different tasks according to an embodiment of the present invention;

fig. 4 is a fine-tuned MNLI accuracy diagram provided by an embodiment of the present invention;

fig. 5 is a diagram illustrating a BERT input portion according to an embodiment of the present invention.

Detailed Description

The technical solutions of the embodiments of the present invention are clearly and completely described below with reference to the drawings in the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

Referring to fig. 1-5, the invention provides an automatic studying and judging method for internet information emotion tendentiousness, which comprises the following steps:

establishing a public opinion corpus data set, and preprocessing the public opinion corpus data set;

establishing a RoBERTA model, designating a target task of the model, importing a preprocessed public sentiment corpus data set for pre-training, improving the Bert of the RoBERTA model, and obtaining a pre-training model;

And the final model is configured with an HTTP interface, the data submission mode adopted by the HTTP interface is POST, the transmission format is JSON, and the emotional tendency probability is output after model prediction, so that automatic study and judgment are realized.

The emotional tendency output after model prediction comprises positive emotional tendency, negative emotional tendency, neutral emotional tendency and irrelevant emotional tendency.

Example 1

The method comprises the following steps of preprocessing a public opinion corpus data set, wherein the preprocessing comprises the following steps:

formatting public opinion data;

and carrying out multi-process preprocessing on the public sentiment data.

Example 2

Referring to fig. 2, the differences between the model architectures before training: BERT uses a bidirectional transducer. OpenAI GPT uses a left to right transform. ELMo uses independently trained left-to-right and right-to-left LSTMs concatenations to generate features for downstream tasks. Of these three representations, the BERT representation alone is a condition for the joint representation to have both left and right contexts on all layers. In addition to architectural differences, BERT and OpenAI GPT are fine-tuning methods, while ELMo is a feature-based method.

The pre-training of the public opinion corpus data set is carried out based on deep learning, and a mixed precision multi-machine multi-GPU training mode is used in the training process. The application uses 160G training public sentiment corpora, uses RoBERTA model training a week, the batch size is 64, and 3 machines each machine of 8 GPU (NVIDIA Tesla V10016G) carry out pre-training totally.

Pre-training public opinion corpus data sets includes improving Bert, specifically including:

removing the NSP task;

specifying a BERT mask type;

the static Mask varies the state Mask.

The RoBERTA improves the Bert on the training method, and mainly embodies four aspects of changing the mask mode, discarding NSP tasks, training over-parameter optimization and using larger-scale training data. The improvement points are as follows: (1) static Mask shift Mask: the dynamic mask is equivalent to indirectly increasing training data, and is beneficial to improving the model performance; (2) NSP task removal: in order to capture the relationship between sentences, Bert uses the NSP task for pre-training, namely, a pair of sentences A and B is input, whether the two sentences are continuous or not is judged, the sum of the maximum lengths of the two sentences is 512, and after removing NSP by RoBERTA, a plurality of continuous sentences are input each time until the maximum length is 512 (and can cross articles). Eliminating NSP loss can be on the same level or slightly improved in the performance of downstream tasks as the original BERT. Because the Bert single sentence is input as a unit, the model cannot learn the remote dependency relationship among the words, and the RoBERTA input is a plurality of continuous sentences, the model can capture longer dependency relationship, which is friendly to the downstream task of a long sequence; (3) larger batch size: the batch size of RoBERTA is 8 k. By using the training strategy in machine translation for reference, the phenomenon that the optimization rate and the model performance can be improved by matching a larger batch size with a larger learning rate is adopted, and experiments prove that the actually Bert can also use the larger batch size. (4) More training data: for longer training times, RoBERTA (160G) uses 10 times more data than Bert (16G). More training data increases the diversity of the vocabulary, syntactic structure, and syntactic structure data.

Example 3

As shown in FIGS. 1 and 3, the learning rate parameter of the fine-tuned pre-trained model on the downstream task data set is 3e-4, the batch size parameter is 64, the epochs parameter is 12, and the mask type is set to full _ visible.

The RoBERTa model performs comprehensive pre-training and fine-tuning of BERT. The same architecture is used in both pre-training and fine-tuning, except for the output layer. The same pre-trained model parameters are used to initialize the models for different downstream tasks. During the fine tuning, all parameters are fine tuned. [ CLS ] is a special symbol added before each input instance, [ SEP ] is a special delimiter mark to separate the questions/answers.

The prediction task of the final model of the application is as follows:

Input＝[CLS]the man went to[MASK]store[SEP]he bought a gallon [MASK]milk[SEP]

Label＝IsNext

Input＝[CLS]the man[MASK]to the store[SEP]penguin[MASK]are flight ##less birds[SEP]

Label＝NotNext

as in fig. 4, the MNLI accuracy after fine tuning is shown, starting with model parameters previously trained for k steps. The x-axis is the value of k.

Example 4

Taking the example of neutral news, its BERT input section is shown in fig. 5, where the BERT input representation illustrates that the input embedding is the sum of mark embedding, segment embedding, and position embedding.

The model of the application predicts that the result is as follows:

{"label":"__label__neutral","probability":0.99930811524391174}

wherein, __ label __ neutral represents that the model predicts neutral emotion, the probability is 99%, and the news is judged to be neutral news.

Taking negative news as an example, the result is as follows through model prediction:

{"label":"__label__negative","probability":0.9227299332618713}

wherein "__ label __ negative" indicates that the model predicts negative emotion, the probability is 92%, and the news is judged to be negative.

Taking the positive news as an example, the result is predicted by a model as follows:

{"label":"__label__positive","probability":0.9998956918716431}

wherein "__ label __ positive" indicates that the model predicts positive emotion and the probability is 99%, and the model is judged as positive news.

In the practical application scene, the public sentiment events are complex and changeable, the public sentiment focus is not single, the public sentiment progresses in multiple levels, and the method can still accurately judge the sentiment tendency of the public sentiment text.

The above disclosure is only for a few specific embodiments of the present invention, however, the present invention is not limited to the above embodiments, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present invention.

Claims

1. An automatic studying and judging method for internet information emotion tendentiousness is characterized by comprising the following steps:

establishing a public opinion corpus data set;

2. The method as claimed in claim 1, wherein the public sentiment corpus data set is preprocessed, the preprocessing comprises:

formatting public opinion data;

and carrying out multi-process preprocessing on the public sentiment data.

3. The method as claimed in claim 1, wherein the pre-training of the public sentiment corpus data set is based on deep learning, and a mixed-precision, multi-machine and multi-GPU training mode is used in the training process.

4. The method as claimed in claim 3, wherein the pre-training of the corpus data set includes Bert improvement, and specifically includes:

removing the NSP task;

specifying a BERT mask type;

the static Mask varies the state Mask.

5. The method as claimed in claim 1, wherein the learning rate parameter of the fine tuning pre-training model is 3e-4, the batch size parameter is 64, the epochs parameter is 12, and the mask type is set to full _ visible.

6. The method for automatically studying and judging the emotional tendency of the internet information as claimed in claim 1, wherein the final model is configured with an HTTP interface, the HTTP interface adopts a POST data submission mode and a JSON transmission format.

7. The method as claimed in claim 1, wherein the emotional tendencies output after model prediction include positive emotional tendencies, negative emotional tendencies, neutral emotional tendencies and irrelevant emotional tendencies.