CN113535899B - Automatic studying and judging method for emotion tendencies of internet information - Google Patents

Automatic studying and judging method for emotion tendencies of internet information Download PDF

Info

Publication number
CN113535899B
CN113535899B CN202110769546.3A CN202110769546A CN113535899B CN 113535899 B CN113535899 B CN 113535899B CN 202110769546 A CN202110769546 A CN 202110769546A CN 113535899 B CN113535899 B CN 113535899B
Authority
CN
China
Prior art keywords
training
model
public opinion
emotion
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110769546.3A
Other languages
Chinese (zh)
Other versions
CN113535899A (en
Inventor
郭齐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xi'an Kangnai Network Technology Co ltd
Original Assignee
Xi'an Kangnai Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xi'an Kangnai Network Technology Co ltd filed Critical Xi'an Kangnai Network Technology Co ltd
Priority to CN202110769546.3A priority Critical patent/CN113535899B/en
Publication of CN113535899A publication Critical patent/CN113535899A/en
Application granted granted Critical
Publication of CN113535899B publication Critical patent/CN113535899B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an automatic studying and judging method aiming at internet information emotion tendencies, relates to the technical field of language emotion analysis, and adopts a method of pre-training and fine-tuning downstream tasks by using a RoBERTa model on a general corpus. And a mixed precision multi-machine multi-GPU training mode is used in the deep learning training process. After the super-parameter training is found, a model is deployed and an interface is provided to complete the automatic research and judgment work; the problems of low accuracy, insufficient generalization effect of the research and judgment model, poor performance when dealing with complex Chinese contexts such as darkness and ambiguity and the like in the traditional research and judgment of public opinion emotion are solved.

Description

Automatic studying and judging method for emotion tendencies of internet information
Technical Field
The invention relates to the technical field of language emotion analysis, in particular to an automatic studying and judging method aiming at the emotion tendencies of internet information.
Background
According to the 47 th-stage Chinese Internet development statistics report issued by the Chinese Internet information center (CNNIC), the number of Chinese Internet users reaches 9.89 hundred million by 12 months and 20 days in 2020. Therefore, the internet provides a large amount of data information for our networking and providing, wherein the analysis of the internet public opinion is an essential step for coping with the analysis of the internet public opinion.
With the continuous and deep development of the internet age, internet public opinion emotion analysis has become an indispensable means for understanding community opinion, grasping public opinion trend, and making quick response and processing on emergencies. The emotion tendency automatic research and judgment of the Internet public opinion is a vivid application of combining big data with artificial intelligence.
However, the existing emotion analysis scheme generally adopts the technologies of traditional machine learning, support vector machine, logistic regression, CNN neural network, LSTM neural network and the like, and the existing technology has poor performance when facing complex Chinese contexts such as darkness, ambiguity and the like in the aspect of natural language processing, which is the core of automatic judgment of emotion tendencies, and has poor model generalization effect and greatly improved emotion tendentiousness judging accuracy.
Aiming at the problems existing in the prior art, the application provides an automatic research and judgment method aiming at the emotion tendencies of internet information, and solves the problems of low accuracy, insufficient generalization effect of research and judgment models, poor performance when dealing with complex Chinese contexts such as darkness, ambiguity and the like in the traditional research and judgment work of public opinion emotion.
Disclosure of Invention
The invention aims to provide an automatic research and judgment method aiming at the emotion tendencies of internet information, which solves the problems of low accuracy, insufficient generalization effect of research and judgment models and poor performance when dealing with complex Chinese contexts such as darkness, ambiguity and the like in the traditional research and judgment work of public opinion emotion.
The invention provides an automatic studying and judging method aiming at the emotion tendencies of internet information, which comprises the following steps:
establishing a public opinion corpus data set;
establishing a RoBERTa model, importing a public opinion corpus data set for pre-training, and improving the Bert of the RoBERTa model to obtain a pre-training model;
fine-tuning parameters of the pre-training model based on the downstream task data set, and storing a final model after fine-tuning;
and outputting emotion tendency probability after final model prediction, so as to realize automatic research and judgment.
Further, preprocessing is carried out on the public opinion corpus data set, and the preprocessing comprises the following steps:
collecting public opinion data marked with emotion tendencies in public opinion corpus data set, and cleaning the data;
formatting public opinion data;
converting public opinion data as required by using a Chinese character dictionary file;
and carrying out multi-process preprocessing on the public opinion data.
Further, the pre-training of the public opinion corpus data set is performed based on deep learning, and a mixed precision multi-machine multi-GPU training mode is used in the training process.
Further, the pre-training of the public opinion corpus data set comprises improvement of Bert, and specifically comprises the following steps:
removing the NSP task;
specifying a BERT mask type;
static Mask the dynamic Mask.
Further, the training rate parameter of the fine-tuning pre-training model is 3e-4, the batch size parameter is 64, the epochs parameter is 12, and the mask type is set to full_visual.
Further, the final model configures an HTTP interface, the HTTP interface adopts a data submitting mode of POST, and a transmission format of JSON.
Further, the emotional trends output after model prediction include positive emotional trend, negative emotional trend, neutral emotional trend and irrelevant emotional trend.
Compared with the prior art, the invention has the following remarkable advantages:
the automatic research and judgment method for the emotion tendencies of the internet information provided by the invention adopts a method of pre-training and fine-tuning a downstream task by using a RoBERTa model on a general corpus. And a mixed precision multi-machine multi-GPU training mode is used in the deep learning training process. After the super-parameter training is found, a model is deployed and an interface is provided for completing the automatic research and judgment work, and the method has the advantages of good robustness, strong model generalization capability and capability of providing high-accuracy research and judgment results facing special Chinese contexts.
Secondly, the automatic research and judgment method for the emotion tendencies of the internet information improves Bert in the pre-training process, changes the static Mask into the dynamic Mask, indirectly increases training data, and is beneficial to improving the model performance. Eliminating NSP loss can be leveled or slightly improved with the original BERT in performance of downstream tasks.
The RoBERTa model (160G) uses 10 times more data than the Bert model (16G) according to the automatic research and judgment method for the emotion tendencies of the Internet information. More training data increases the vocabulary, the syntax structure, and the diversity of the syntax structure data.
Drawings
FIG. 1 is a block diagram of a fine-tuning pre-training model provided by an embodiment of the present invention;
FIG. 2 is a diagram illustrating a difference between models before training according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of the fine tuning principle of RoBERTa on different tasks according to the embodiment of the present invention;
fig. 4 is a diagram of the precision of the MNLI after fine tuning according to the embodiment of the present invention;
fig. 5 is a diagram showing a BERT input part according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention, taken in conjunction with the accompanying drawings, will clearly and completely describe the embodiments of the present invention, and it is evident that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
Referring to fig. 1-5, the invention provides an automatic studying and judging method for emotion tendencies of internet information, which comprises the following steps:
establishing a public opinion corpus data set, and preprocessing the public opinion corpus data set;
establishing a RoBERTa model, designating a target task of the model, importing a preprocessed public opinion corpus data set for pre-training, and improving the Bert of the RoBERTa model to obtain a pre-training model;
fine-tuning parameters of the pre-training model based on the downstream task data set, and storing a final model after fine-tuning;
and outputting emotion tendency probability after final model prediction, so as to realize automatic research and judgment.
The final model is configured with an HTTP interface, the HTTP interface adopts a POST data submitting mode, a transmission format is JSON, and the model predicts and outputs emotion tendency probability to realize automatic research and judgment.
The emotion tendencies output after model prediction include positive emotion tendencies, negative emotion tendencies, neutral emotion tendencies and irrelevant emotion tendencies.
Example 1
Preprocessing the public opinion corpus data set, wherein the preprocessing comprises the following steps:
collecting public opinion data marked with emotion tendencies in public opinion corpus data set, and cleaning the data;
formatting public opinion data;
converting public opinion data as required by using a Chinese character dictionary file;
and carrying out multi-process preprocessing on the public opinion data.
Example 2
Referring to fig. 2, differences between model architectures before training: BERT uses a bi-directional transducer. OpenAI GPT uses a left to right transducer. ELMo uses independently trained left-to-right and right-to-left LSTMs connections to generate downstream task features. Of these three representations, only the BERT representation is a joint representation of the condition that there are both left and right contexts on all layers. In addition to architectural differences, BERT and OpenAI GPT are one method of fine tuning, while ELMo is one feature-based method.
The training method is used for pre-training the public opinion corpus data set based on deep learning, and a mixed precision multi-machine multi-GPU training mode is used in the training process. The application uses 160G training public opinion corpus, uses RoBERTa model training for one week, batch size 64, and a total of 3 machines with 8 GPUs per machine (NVIDIA Tesla V100 16G) for pre-training.
The pre-training of the public opinion corpus data set comprises the improvement of Bert, and specifically comprises the following steps:
removing the NSP task;
specifying a BERT mask type;
static Mask the dynamic Mask.
The RoBERTa improves Bert on the training method, and is mainly characterized by four aspects of changing the mask mode, discarding NSP tasks, optimizing training super parameters and using larger-scale training data. The improvement points are as follows: (1) static Mask dynamic Mask: the dynamic mask is equivalent to indirectly adding training data, which is beneficial to improving the performance of the model; (2) remove NSP task: bert uses NSP task to pretrain in order to capture the relation between sentences, namely, inputs a pair of sentences A and B, judges whether the two sentences are continuous or not, the sum of the maximum lengths of the two sentences is 512, after the NSP is removed by RoBERTa, a plurality of continuous sentences are input each time until the maximum length is 512 (and can span articles). Eliminating NSP loss can be leveled or slightly improved with the original BERT in performance of downstream tasks. Because Bert-single sentence is input as a unit, the model cannot learn the remote dependency relationship among words, roBERTa inputs a plurality of continuous sentences, and the model can capture longer dependency relationship, so that the model is friendly to the downstream task of a long sequence; (3) larger batch size: the batch size of RoBERTa is 8k. The training strategy in machine translation is used for reference, the phenomenon that the model optimization rate and the model performance can be improved by using larger batch size and larger learning rate is matched, and experiments prove that indeed Bert can also use larger batch size. (4) more training data: longer training time, roBERTa (160G) used 10 times more data than Bert (16G). More training data increases the vocabulary, the syntax structure, and the diversity of the syntax structure data.
Example 3
As shown in fig. 1 and 3, the training rate parameter of the pre-training model is fine-tuned to 3e-4, the batch size parameter is 64, the epochs parameter is 12, and the mask type is set to full_visual on the downstream task data set.
The Roberta model performs comprehensive pre-training and fine tuning on BERT. The same architecture is used in both pre-training and fine-tuning except for the output layer. The same pre-trained model parameters are used to initialize models for different downstream tasks. During the trimming process, all parameters are trimmed. [ CLS ] is a special symbol added before each input instance, [ SEP ] is a special separator marker for separating questions/answers.
The prediction task of the final model of the application is as follows:
Input=[CLS]the man went to[MASK]store[SEP]he bought a gallon[MASK]milk[SEP]
Label=IsNext
Input=[CLS]the man[MASK]to the store[SEP]penguin[MASK]are flight ##less birds[SEP]
Label=NotNext
as in fig. 4, the MNLI accuracy after fine tuning is shown, starting with model parameters trained for k steps in advance. The x-axis is the value of k.
Example 4
Taking neutral news as an example, the BERT input section is shown in fig. 5, where the BERT input section illustrates that the input embedding is the sum of mark embedding, segment embedding, and position embedding.
The model prediction of the application shows that:
{"label":"__label__neutral","probability":0.99930811524391174}
wherein '__ label __ neutral' represents that the model predicts neutral emotion, the probability is 99%, and the model judges that the model is neutral news.
Taking negative news as an example, the result is:
{"label":"__label__negative","probability":0.9227299332618713}
wherein "__ label __ negative" indicates that the model predicts negative emotion, the probability is 92%, and the model judges negative news.
Taking the front news as an example, the result is:
{"label":"__label__positive","probability":0.9998956918716431}
wherein '__ label __ positive' indicates that the model predicts positive emotion, the probability is 99%, and positive news is determined.
In the actual application scene, the public opinion event is complex and changeable, the public opinion focus is not single, the public opinion progress is multi-level, and the invention can still accurately finish the judgment of the emotion tendencies of the public opinion text.
The foregoing disclosure is merely illustrative of some embodiments of the invention, but the embodiments are not limited thereto and variations within the scope of the invention will be apparent to those skilled in the art.

Claims (4)

1. An automatic studying and judging method for the emotion tendencies of internet information is characterized by comprising the following steps:
establishing a public opinion corpus data set;
establishing a RoBERTa model, importing a public opinion corpus data set for pre-training, and improving the Bert of the RoBERTa model to obtain a pre-training model;
the pre-training of the public opinion corpus data set comprises the improvement of Bert, and specifically comprises the following steps:
removing the NSP task;
specifying a BERT mask type;
static Mask and dynamic Mask;
pre-training of the public opinion corpus data set is performed based on deep learning, and a mixed precision multi-machine multi-GPU training mode is used in the training process;
fine-tuning parameters of the pre-training model based on the downstream task data set, and storing a final model after fine-tuning;
fine tuning a training rate parameter of the pre-training model to be 3e-4, a batch size parameter to be 64, an epochs parameter to be 12, and setting a mask type to be full_visual;
and outputting emotion tendency probability after final model prediction, so as to realize automatic research and judgment.
2. The automatic research and judgment method for emotion tendencies of internet information according to claim 1, wherein the preprocessing of the public opinion corpus dataset comprises the steps of:
collecting public opinion data marked with emotion tendencies in public opinion corpus data set, and cleaning the data;
formatting public opinion data;
converting public opinion data as required by using a Chinese character dictionary file;
and carrying out multi-process preprocessing on the public opinion data.
3. The automatic studying and judging method for emotion tendencies of internet information according to claim 1, wherein the final model is configured with an HTTP interface, the HTTP interface adopts a data submitting mode of POST, and a transmission format is JSON.
4. The method of claim 1, wherein the emotional tendency output after model prediction comprises positive emotional tendency, negative emotional tendency, neutral emotional tendency and irrelevant emotional tendency.
CN202110769546.3A 2021-07-07 2021-07-07 Automatic studying and judging method for emotion tendencies of internet information Active CN113535899B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110769546.3A CN113535899B (en) 2021-07-07 2021-07-07 Automatic studying and judging method for emotion tendencies of internet information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110769546.3A CN113535899B (en) 2021-07-07 2021-07-07 Automatic studying and judging method for emotion tendencies of internet information

Publications (2)

Publication Number Publication Date
CN113535899A CN113535899A (en) 2021-10-22
CN113535899B true CN113535899B (en) 2024-02-27

Family

ID=78127065

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110769546.3A Active CN113535899B (en) 2021-07-07 2021-07-07 Automatic studying and judging method for emotion tendencies of internet information

Country Status (1)

Country Link
CN (1) CN113535899B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117076751B (en) * 2023-10-10 2024-01-16 西安康奈网络科技有限公司 Public opinion event development trend judging system based on multidimensional feature analysis

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017107957A1 (en) * 2015-12-22 2017-06-29 中兴通讯股份有限公司 Human face image retrieval method and apparatus
CN111667069A (en) * 2020-06-10 2020-09-15 中国工商银行股份有限公司 Pre-training model compression method and device and electronic equipment
CN112183486A (en) * 2020-11-02 2021-01-05 中山大学 Method for rapidly identifying single-molecule nanopore sequencing base based on deep network
CN112270181A (en) * 2020-11-03 2021-01-26 北京明略软件系统有限公司 Sequence labeling method, system, computer readable storage medium and computer device
CN112434523A (en) * 2020-11-25 2021-03-02 上海极链网络科技有限公司 Text auditing device and method for reducing false alarm rate of harmonic matching of sensitive words
CN112508609A (en) * 2020-12-07 2021-03-16 深圳市欢太科技有限公司 Crowd expansion prediction method, device, equipment and storage medium
CN112529146A (en) * 2019-09-18 2021-03-19 华为技术有限公司 Method and device for training neural network model
CN112597759A (en) * 2020-11-30 2021-04-02 深延科技(北京)有限公司 Text-based emotion detection method and device, computer equipment and medium
CN112784041A (en) * 2021-01-06 2021-05-11 河海大学 Chinese short text emotion orientation analysis method
CN112883720A (en) * 2021-01-25 2021-06-01 北京瑞友科技股份有限公司 Text emotion classification system and method based on double models
CN113011185A (en) * 2020-07-17 2021-06-22 上海浦东华宇信息技术有限公司 Legal field text analysis and identification method, system, storage medium and terminal

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017107957A1 (en) * 2015-12-22 2017-06-29 中兴通讯股份有限公司 Human face image retrieval method and apparatus
CN112529146A (en) * 2019-09-18 2021-03-19 华为技术有限公司 Method and device for training neural network model
CN111667069A (en) * 2020-06-10 2020-09-15 中国工商银行股份有限公司 Pre-training model compression method and device and electronic equipment
CN113011185A (en) * 2020-07-17 2021-06-22 上海浦东华宇信息技术有限公司 Legal field text analysis and identification method, system, storage medium and terminal
CN112183486A (en) * 2020-11-02 2021-01-05 中山大学 Method for rapidly identifying single-molecule nanopore sequencing base based on deep network
CN112270181A (en) * 2020-11-03 2021-01-26 北京明略软件系统有限公司 Sequence labeling method, system, computer readable storage medium and computer device
CN112434523A (en) * 2020-11-25 2021-03-02 上海极链网络科技有限公司 Text auditing device and method for reducing false alarm rate of harmonic matching of sensitive words
CN112597759A (en) * 2020-11-30 2021-04-02 深延科技(北京)有限公司 Text-based emotion detection method and device, computer equipment and medium
CN112508609A (en) * 2020-12-07 2021-03-16 深圳市欢太科技有限公司 Crowd expansion prediction method, device, equipment and storage medium
CN112784041A (en) * 2021-01-06 2021-05-11 河海大学 Chinese short text emotion orientation analysis method
CN112883720A (en) * 2021-01-25 2021-06-01 北京瑞友科技股份有限公司 Text emotion classification system and method based on double models

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于几何深度学习的知识图谱关键技术研究进展;杜博;万国佳;纪颖;;航空兵器;20200528(03);全文 *
文本词向量与预训练语言模型研究;徐菲菲;冯东升;;上海电力大学学报;20200815(04);全文 *

Also Published As

Publication number Publication date
CN113535899A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
CN108415977B (en) Deep neural network and reinforcement learning-based generative machine reading understanding method
CN110929030B (en) Text abstract and emotion classification combined training method
CN111177376B (en) Chinese text classification method based on BERT and CNN hierarchical connection
CN111738251B (en) Optical character recognition method and device fused with language model and electronic equipment
CN110717334A (en) Text emotion analysis method based on BERT model and double-channel attention
CN108984683A (en) Extracting method, system, equipment and the storage medium of structural data
CN107392147A (en) A kind of image sentence conversion method based on improved production confrontation network
CN109063164A (en) A kind of intelligent answer method based on deep learning
CN110263165A (en) A kind of user comment sentiment analysis method based on semi-supervised learning
CN112395417A (en) Network public opinion evolution simulation method and system based on deep learning
Zhang et al. A BERT fine-tuning model for targeted sentiment analysis of Chinese online course reviews
CN115630156A (en) Mongolian emotion analysis method and system fusing Prompt and SRU
CN113535899B (en) Automatic studying and judging method for emotion tendencies of internet information
CN113901289A (en) Unsupervised learning-based recommendation method and system
CN111967267A (en) XLNET-based news text region extraction method and system
CN111914555B (en) Automatic relation extraction system based on Transformer structure
CN111444328A (en) Natural language automatic prediction inference method with interpretation generation
CN113204976B (en) Real-time question and answer method and system
CN114444515A (en) Relation extraction method based on entity semantic fusion
CN110991515A (en) Image description method fusing visual context
CN113326367A (en) Task type dialogue method and system based on end-to-end text generation
CN110334204B (en) Exercise similarity calculation recommendation method based on user records
CN109871537A (en) A kind of high-precision Thai subordinate sentence method
CN114662456A (en) Image ancient poem generation method based on Faster R-convolutional neural network detection model
CN114357166A (en) Text classification method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant