CN112784602B

CN112784602B - News emotion entity extraction method based on remote supervision

Info

Publication number: CN112784602B
Application number: CN202011395972.7A
Authority: CN
Inventors: 张琨; 孙琦; 李寻; 张李林清; 刘志敏
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2024-06-14
Anticipated expiration: 2040-12-03
Also published as: CN112784602A

Abstract

The invention discloses a news emotion entity extraction method based on remote supervision, which comprises the following steps: the news of the become an official party news website is anticipated and cached to a local warehouse; preprocessing the crawled news corpus to obtain news expectation segmented into sentences; constructing a key entity knowledge base, and automatically labeling news predictions divided into sentences according to the knowledge base; training the emotion sentence extraction model by using the marked news anticipation to enable the emotion sentence extraction model to have the capability of automatically judging emotion of an input sentence; training by using the extracted emotion sentences as a training set of emotion entity extraction models; crawling news corpus and segmenting the news corpus into sentences, inputting the segmented news corpus into a trained emotion sentence extraction model to extract emotion sentences, and inputting the extracted emotion sentences into the trained emotion entity extraction model to obtain emotion entities. According to the invention, a noisy data set is generated for a large number of samples by adopting a remote supervision mode for model training, so that the model training efficiency is improved.

Description

News emotion entity extraction method based on remote supervision

Technical Field

The invention belongs to the field of computer artificial intelligence, and particularly relates to a news emotion entity extraction method based on remote supervision.

Background

Named entity recognition in the news field researchers have explored it due to unique application contexts and text expressions. Feng Yuntian et al put forward the entity classification principle of personnel, soldier, military personnel, military institutions, facilities and the like, and construct a corpus based on the standardized texts of combat paperwork, duty paperwork, military paperwork and the like. The CRF model is trained by using a small amount of artificial labeling training corpus, the training model carries out entity recognition on unlabeled test corpus, and the model obtains a recognition effect with an F value of 90.9% on the test corpus. The method comprises the steps of identifying weapon named entities such as free flight, establishing a weapon entity identification model based on DNN, and obtaining context characteristics by nonlinear transformation learning by taking word vectors and part-of-speech vectors with fixed dimensions as input by the model. The model is trained on 7500 news-built corpora from the world wide web, the Chinese net and the like, and the F value reaches 91.02%. Wang Xuefeng and the like divide named entities into 8 categories of troops, place names, institutions, weapons, facilities, time, environment and quantity, a word-level representation-based entity recognition model (character-BiLSTM-CRF) combining BiLSTM and CRF is provided, the model is trained based on a corpus constructed by undisclosed 30 more than 30 combined combat exercise desired documents and command exercise desired documents, and the F value reaches 98%. In addition, researchers have explored methods of generating word vectors using convolutional neural networks and combining BiLSTM and CRF to build news domain naming entities. Named entity identification in the unpublished combat documents is oriented, the named entities are divided into 13 subclasses of positions, troops, personnel, articles, number 5 major classes, place names, establishment and the like based on a nested classification principle, and higher recall rate and F value are obtained by adopting the CNN-BiLSTM-CRF model and experiments on a corpus constructed by 100 unpublished combat documents.

The traditional emotion entity identification method based on rules, dictionaries and statistical learning models relies on rule design and feature engineering, and although higher recall rate is achieved, the rule formulation and feature extraction require abundant domain knowledge and a great deal of labor cost, and it is difficult to formulate uniform templates and rules for all problems. In recent years, with the support of computing power and text distributed representation technology, emotion entity identification methods based on deep neural networks (deep neural network, DNN) have made breakthrough progress in general fields and specific fields such as law, medicine, biochemistry, finance and the like. Compared to emotion entity identification studies in other fields, news field emotion entity identification faces the following problems and challenges:

There is often a problem in entity recognition tasks that entity boundaries are difficult to define. For example, in the field of insurance, "Chinese life insurance" may be considered as one entity, and may be considered as 2 entities, "Chinese" and "life insurance". However, the expertise of the field makes the boundaries between entities more difficult to determine, for example, "imperial navy in the uk" may be considered as an organizational entity, and likewise "imperial" may be considered as a place name entity, "imperial navy" as an organizational entity; "Russian diagram-160 strategic bombers" may be considered as weaponry entities, as well as "Russian army" as organization entities, "diagram-160 strategic bombers" as weaponry entities.

There is also a phenomenon that an entity simplifies expression in an entity recognition task. Compared with other fields, the news field has the advantages that the emotion entity is obscured after simplified expression due to the uniqueness and the specialty of the field, and the news field has no certain regularity.

Named entity recognition technology based on CRF and other statistical models relies on field experts to complete a large amount of artificial feature selection work; the field named entity method based on long-short-term memory neural network and other models needs to rely on a huge corpus to construct word vectors in the model training process.

The electronic medical record in the medical field, the judgment book in the legal field and the prosecution book have strict formats and expression specifications, and excellent recognition effects can be obtained based on a rule recognition method. The social media data represented by the microblog is not standard in expression, a large number of spoken expressions exist, no specific rule exists, and the difficulty of identifying the entity is high.

At present, a corpus data set and entity classification standard facing the news field do not exist, and research work of open source information is hindered.

Disclosure of Invention

The invention aims to provide a news emotion entity extraction method based on remote supervision.

The technical scheme for realizing the purpose of the invention is as follows: a news emotion entity extraction method based on remote supervision comprises the following steps:

step 1: adopting a crawler technology to crawl become an official parts of news web news corpus and caching the news corpus to a local warehouse;

Step 2: preprocessing the crawled news corpus to obtain the news corpus segmented into sentences;

step 3: constructing a key entity knowledge base, and automatically labeling news corpus divided into sentences according to the knowledge base;

Step 4: training the emotion sentence extraction model by using the marked news corpus to enable the emotion sentence extraction model to have the capability of automatically judging emotion of an input sentence;

step 5: extracting emotion sentences by using the step 4, and training the emotion sentences as a training set of emotion entity extraction models to enable the emotion sentences to have the capability of a holder, an expression object and an event of emotion in the extracted sentences;

step 6, crawling news corpus and segmenting the news corpus into sentences by adopting the method of the step 1 and the step 2, inputting the news corpus segmented into sentences into a trained emotion sentence extraction model to extract emotion sentences, and inputting the extracted emotion sentences into the trained emotion entity extraction model to obtain emotion entities.

Preferably, the specific method for crawling become an official news related to the news website is as follows:

acquiring news websites related to the event by analyzing search results of the official websites with keywords;

And analyzing news content according to the news website, acquiring the title, time and specific content of the news, and caching the news to a local warehouse.

Preferably, preprocessing the crawled news corpus includes:

Cleaning the crawled news corpus, and removing redundancy and dirty data irrelevant to the theme;

sentence division is carried out on news corpus in the local warehouse by taking punctuation marks as marks.

Preferably, the key entity knowledge base is constructed as a human, organization, country, event entity knowledge base.

Preferably, the principle of automatically labeling the news corpus divided into sentences according to the knowledge base is as follows: when more than n knowledge base entities appear in the sentence, the sentence is marked as a sentence with emotion, and n is a set natural number.

Preferably, the emotion sentence extraction model includes a word vector expression layer and a SoftMax classification layer, which are respectively specified as follows:

the word vector expression layer adopts a BERT pre-training model and is used for extracting characteristics of each word in the news text data segmented into sentences to obtain word characteristics;

The SoftMax classification layer is used for predicting probability distribution on output categories and decoding labels, and judging whether an input sentence is an emotion sentence or not according to a prediction result.

Preferably, the emotion entity extraction model includes a word vector layer, an encoder, and a decoder, which are respectively specified as follows:

The word vector layer adopts a BERT pre-training model for obtaining the sub-features of emotion sentences;

The encoder adopts a bidirectional long-short-term memory neural network for extracting semantic features of an input text;

the decoder adopts a conditional random field for decoding semantic features into corresponding labels, and obtains corresponding entity positions and entity categories according to predicted label values

Compared with the prior art, the invention has the remarkable advantages that:

according to the invention, under the condition that a large number of unmarked samples exist, a noisy data set is generated for the large number of samples by adopting a remote supervision mode for model training, so that the cost of manual marking is greatly reduced, and the efficiency of model training is improved;

Aiming at the problem and the challenge brought by the special news field, the invention designs the emotion sentence extraction technology based on the BERT word vector, and the object of entity extraction is concentrated in a more meaningful range, thereby greatly improving the efficiency of entity extraction;

The invention is based on the entity extraction network of the multi-model fusion, and combines the expert knowledge base to extract emotion holders, emotion expression objects and related event information in emotion sentences, thereby laying the foundation of a pre-task for emotion analysis and public opinion analysis in the news field.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 emotion sentence extraction model training test flow.

FIG. 3 emotion entity extraction model training test flow.

FIG. 4LSTM structure.

Fig. 5CRF structure diagram.

Detailed Description

A news emotion entity extraction method based on remote supervision, as shown in fig. 1, comprises the following steps:

the crawler technology is adopted to crawl relevant news corpus of official news websites such as world wide web, internet news, new-bloom daily news and the like aiming at hot news events. The specific method comprises the following steps: and acquiring news websites related to the event by analyzing the search results of the official websites with the keywords, analyzing news contents according to the news websites, acquiring the title, time, specific contents and other data of the news, and caching the data into a local warehouse.

and reading the crawled news corpus from the local warehouse to clean the data, and removing redundancy and dirty data irrelevant to the theme. And deleting the useless repeated sentences in the news. The cleaned data is stored in a structured manner for training of the algorithm model.

Marked with punctuation marks). "? ", I! The data in the database is sentence-divided by using "," … … "and" "" as marks.

Step 3, constructing a key entity knowledge base, and automatically labeling news corpus divided into sentences according to the knowledge base;

And establishing a key entity knowledge base of people, organizations, countries, events and the like according to the data in the local warehouse. And automatically labeling the news divided into sentences according to the key entity knowledge base. The labeling principle is as follows: and labeling as a sentence with emotion when more than n knowledge base entities appear in the sentence. n is an adjustable parameter, and a large amount of noisy training data can be obtained by the remote supervision mode.

As shown in fig. 2, the news text data cut into sentences is divided into a training set and a testing set according to the twenty-eight principle, emotion sentence extraction models are trained by the training set, and accuracy and performance analysis is performed on the trained models by the testing set.

In a further embodiment, the emotion sentence extraction model includes a word vector expression layer and a SoftMax classification layer.

Specifically, the word vector expression layer adopts a BERT pre-training model, the BERT pre-training model uses a transducer encoder as a language model, and adopts a 'shielding language model' and a next sentence prediction mechanism to solve the problem of unidirectional current most word vector generation models. And extracting the characteristics of each word in the news text data S _i＝{X_i1,X_i2,...,X_ik segmented into sentences by using the BERT pre-training model to obtain word characteristics: x _ij＝(e₁,e₂,...,e_m). Where S _i represents the ith sentence in the dataset, X _ik represents the kth word in the sentence, X _ij represents the word vector representation of the jth word of the ith sentence, and e _m represents the value of the mth word in X _ij. To sum up, after each sentence passes through the word vector representation layer, each word therein is composed of m-dimensional word vector features, so that it can be represented as: Where S _i represents the ith sentence in the dataset and e _km represents the value of the mth of the kth word in the ith sentence.

Specifically, the SoftMax classification layer serves as a classifier for emotion sentence classification, normalizes the output of the network to a probability distribution over the predicted output categories, maps the output result to a value of (0, 1), and represents:

Wherein the method comprises the steps of Is a weight matrix,/>Is the weight deviation. /(I)Is the output of the last layer,/>Representing the intermediate value calculated by the output of the i-th node of the layer i. The SoftMax layer is used for normalizing the result and decoding the tag, and whether the input sentence is an emotion sentence or a non-emotion sentence is judged through the result.

And extracting sentences with emotion tendencies from news of long texts through the emotion sentence extraction model.

The emotion entity extraction model training test flow is shown in fig. 3, and based on the extracted emotion sentences, the emotion holders, the expression objects and the emotion sentence related events in the sentences are extracted. The important entities in emotion sentences are identified by adopting a sequence-to-sequence model based on a deep learning algorithm.

In a further embodiment, the emotion entity extraction model consists of three parts: a word vector layer, an encoder, a decoder;

specifically, the word vector layer also employs a BERT pre-training model. And inputting emotion sentences extracted by the emotion sentence extraction model, and outputting word vector representation of the emotion sentences.

Specifically, the encoder employs a bi-directional long-short-term memory neural network (LSTM) for extracting semantic features of the input text. LSTM is also a special type of Recurrent Neural Network (RNN) that can learn long-term dependency information, all RNNs having a chained form of repeating neural network modules. In a standard RNN, the repetition module has only a very simple structure, e.g. a Tanh layer, whereas the "memory cells" of LSTM avoid the long-term dependency problem by being deliberately designed. LSTM controls cell status by a carefully designed structure called a gate, deleting or adding information directly throughout and into. The Bi-LSTM is adopted, and global feature information of the whole text can be obtained through two feature extractors in different directions, so that the feature extraction capability of enconder on the whole text is improved. The LSTM model is calculated as follows:

i_t＝σ(W_xix_t+W_hih_t-1+W_cic_t-1+b_i)

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cfc_t-1+b_f)

c_t＝f_tc_t-1+i_ttanh(W_xcx_t+W_hch_t-1+b_c)

o_t＝σ(W_xox_t+W_hoh_t-1+W_coc_t-1+b_o)

h_t＝o_ttanh(c_t)

Wherein i, f, c, o is an input gate, a forgetting gate, a cell state and an output gate respectively; w and b are respectively corresponding weight coefficient matrixes and bias items; sigma and tanh are the sigmoid function and hyperbolic tangent activation function, respectively.

The LSTM model training process can be roughly divided into four steps: ① Calculating the output value of LSTM cells according to the fifth expression (forward calculation method); ② Calculating the error term of each LSTM cell reversely, wherein the error term comprises 2 reverse propagation directions of time and model level; ③ Calculating the gradient of each weight according to the corresponding error term; ④ The weights are updated using a gradient-based optimization algorithm. The LSTM structure is shown in FIG. 4.

In particular, the decoder employs a Conditional Random Field (CRF). The encoder extracts and encodes the characteristics of the data, the decoder decodes the characteristics into corresponding labels, and the corresponding entity positions and entity categories are obtained according to the predicted label values. The conditions in CRF refer to the markov random field for the random variable Y given the random variable X. Typically, only linear chain member random fields are used to label the problem, with a conditional probability of P (Y|X). Where X is a given observation sequence and Y is a labeling sequence (state sequence) that needs to be labeled. The conditional probability distribution P (y|x) is called conditional random field, which is generally as follows, if any node v is established.

P(Y_v|X,Y_w,w≠v)＝P(Y_v|X,Y_w,w～v)

The corresponding label of each word can be obtained through the decoder, the type and the position of the entity are judged according to the label category, so that the identification and the extraction of emotion holders, expression objects and events in emotion sentences are realized, and the model can reach 65% of accuracy through testing. The CRF structure is shown in fig. 5.

Through the steps 1 to 5, the emotion sentence extraction model and the emotion entity extraction model are trained, in practical application, new news corpus is crawled in the mode of the step 1, preprocessing is carried out on the corpus through the step 2, the processed long text is segmented into sentences, the sentences are input into the emotion sentence extraction model, and the model judges whether the input sentences are emotion sentences or not. And storing sentences which are judged to be emotion sentences by the emotion sentence extraction model into an emotion sentence library. And reading emotion sentences in the emotion sentence library as the input of the emotion entity extraction model, and acquiring the positions of all emotion entities in the input emotion sentences through the emotion sentence extraction model. According to the position, the emotion holder contained in the emotion sentence, the emotion expression object and the related event can be extracted.

The invention extracts emotion entities in news based on a remote supervision learning training deep learning model, wherein the emotion entities comprise emotion holders, emotion expression objects and events; aiming at the challenges of entity extraction in the news field, a deep learning model based on BERT word vectors is designed, and meanwhile, the cost of manual marking is greatly relieved by combining an expert knowledge base in an automatic marking mode, so that the method has great significance.

Claims

1. The news emotion entity extraction method based on remote supervision is characterized by comprising the following steps of:

Step 3: constructing a key entity knowledge base, and automatically labeling news corpus divided into sentences according to the knowledge base; the constructed key entity knowledge base is a human, organization, country and event entity knowledge base; the principle of automatically labeling the news corpus divided into sentences according to the knowledge base is as follows: when more than n knowledge base entities appear in the sentence, marking the sentence as a sentence with emotion, wherein n is a set natural number;

2. The method for extracting news emotion entities based on remote supervision as defined in claim 1, wherein the specific method for crawling become an official news related to news website is as follows:

3. The method for extracting news emotion entities based on remote supervision according to claim 1, wherein preprocessing the crawled news corpus comprises:

4. The news emotion entity extraction method based on remote supervision according to claim 1, wherein the emotion sentence extraction model includes a word vector expression layer and a SoftMax classification layer, which are respectively specified as follows:

5. The news emotion entity extraction method based on remote supervision according to claim 1, wherein the emotion entity extraction model includes a word vector layer, an encoder and a decoder, and specifically includes:

The decoder adopts a conditional random field for decoding semantic features into corresponding labels, and obtains corresponding entity positions and entity categories according to predicted label values.