CN111125367A

CN111125367A - Multi-character relation extraction method based on multi-level attention mechanism

Info

Publication number: CN111125367A
Application number: CN201911362557.9A
Authority: CN
Inventors: 蔡毅; 刘宸铄
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2020-05-08
Anticipated expiration: 2039-12-26
Also published as: CN111125367B

Abstract

The invention discloses a multi-character relation extraction method based on a multi-level attention mechanism, which comprises the following steps of: preprocessing the collected text; adopting a remote supervision technology to align and label the named entity of the original character to obtain a text containing the entity and entity description information; performing Chinese word vector training on the obtained text containing the entity; constructing an attention mechanism two-way long-time and short-time memory network comprising two levels, and training the constructed model to obtain a multi-classification model for extracting various character relations; and inputting the preprocessed text to obtain a text relation extraction result. The invention overcomes the defects of relation extraction of the existing multiple character relation texts and improves the relation extraction experimental results of the multiple character relation texts.

Description

Multi-character relation extraction method based on multi-level attention mechanism

Technical Field

The invention relates to the field of natural language processing, in particular to a method for extracting relationships among multiple characters based on a multi-level attention mechanism.

Background

With the rapid development of internet technology, text information data in a network grows exponentially, but text information data is often unstructured information. Information extraction is a task of natural language processing, which aims to extract structured information from unstructured text. Information extraction includes two aspects: mingmei Shishi (named as Shisui)A body identification task and a relation extraction task, wherein the former is used for discovering the entities existing in the text, and the latter judges the relation between the discovered entities, namely obtaining the entity pair e for the specified text₁And e₂And a relation r between the two₁，r，e₂). The relation extraction task has been widely used in fields of knowledge graph, information retrieval, and the like.

Conventional non-deep learning methods for relationship extraction are typically supervised learning and can be classified into feature-based methods and kernel-based methods, both of which use existing NLP tools and can result in downstream error accumulation. In the deep learning era, a manual feature acquisition mode is avoided, but supervised deep learning requires a large amount of training data to learn features. The labeling of the training data takes a lot of time and effort, and is biased to a certain fixed domain. Mintz et al, 2009, proposed a remote surveillance method that strongly assumes that the entity relationships in the knowledge base are the entity relationships in the text, and generates a large amount of data by aligning the knowledge base with the text.

However, the strong assumption of remote supervision is not necessarily true, and the entity relationships existing in the text are not necessarily the same as those in the knowledge base. To alleviate this drawback, Riedel uses multi-instance learning. Lin first used a segmented convolutional neural network and a sentence-level attention mechanism in 2016, and the use of deep learning and the attention mechanism was introduced to enable the relationship extraction to achieve a better effect.

So far, most of the relation extraction tasks are for English texts, and for Chinese texts, especially for Chinese texts containing various character relations, a method for realizing a better Chinese text multiple character relation extraction method by using a deep learning fusion attention mechanism is urgently needed to be researched.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a method for extracting relationships among multiple characters based on a multi-level attention mechanism. The method comprises the steps of obtaining global feature representation of a text by adopting a bidirectional long-time memory network and a word-level attention mechanism, strengthening the weight of a vocabulary which is more important for relation extraction, then obtaining package representation consisting of a plurality of sentence representations by the sentence-level attention mechanism in a multi-instance learning mode, and adding description information of a named entity to strengthen the result of the package representation. The invention obtains better experimental results on the remote supervision relation extraction data set.

The invention can be realized by the following technical scheme:

a multi-character relationship extraction method based on a multi-level attention mechanism comprises the following steps:

preprocessing the collected text;

adopting a remote supervision technology to align and label the named entity of the original character to obtain a text containing the entity and entity description information;

performing Chinese word vector training on the obtained text containing the entity;

constructing an attention mechanism two-way long-time and short-time memory network comprising two levels, and training the constructed model to obtain a multi-classification model for extracting various character relations;

and inputting the preprocessed text to obtain a text relation extraction result.

Specifically, the pretreatment comprises:

removing English data in the text;

removing emoticons and hyperlinks in the text;

removing stop words in the text according to the Chinese stop word list;

and performing Chinese word segmentation on the processed text.

Specifically, in the step of aligning and labeling the original character named entity by adopting the remote monitoring technology, a character name entry is obtained by utilizing Chinese online encyclopedia, two related characters and the relation of the characters form a triple, and finally a character relation knowledge base is constructed. The pair of entities that exist with the knowledge base, i.e., the relationship of two entities, appear in the text as a relationship in a triple. The final annotated data set in the present invention has 35 relationship types.

Specifically, in the step of training the Chinese Word vector of the text, a distributed Word vector representation method Word2Vec is adopted, and the dimension of the output Word vector is set to 300.

Specifically, in the step of constructing the two-level attention mechanism bidirectional long-short time memory network, a birstm (bidirectional long-short time memory network) and two-level attention mechanism network structures are constructed by using a pytorch, wherein the first layer of the network is an embedded layer, the second layer is a bidirectional LSTM layer, the third layer is a word-level attention layer, the fourth layer is a sentence-level attention layer, and the fifth layer is a classifier soffmax layer.

Furthermore, the input of the embedding layer is a trained Word vector sequence, the length of a text sequence (the number of the Word vector sequences) is set to be m, the length of the text sequence is filled with 0 when the length of the text sequence is less than m, the length of the text sequence exceeds m, the length of the relative position of a Word relative to two entities is also m, the embedding layer adopts Baidu encyclopedia as corpus, and Word2Vec is obtained by using a Gensim tool. The dimension of the word vector is dw, the dimension of the randomly initialized position vector is dp, and a vector sequence w is obtained by the method₁，w₂，…，w_m}，w_i∈R^dWherein d ═ dw + dp × 2.

Further, in the bidirectional LSTM layer, its unidirectional LSTM is expressed as follows:

i_tto the input gate, f_tTo forget the door, c_tIn a cellular state, o_tTo the output gate, h_tFor the hidden vector, W_x，W_h，W_cIn order to be the weight, the weight is,

i_t＝σ(W_xix_t+W_hih_t-1+W_cic_t-1+b_i)

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cfc_t-1+b_f)

c_t＝f_tc_t-1+i_ttanh(W_xcx_t+W_hch_t-1+b_c)

o_t＝σ(W_xox_t+W_hoh_t-1+W_coc_t+b_o)

h_t＝o_ttanh(c_t)

vector h of bidirectional LSTM_tOutput from the forward network

And reverse output

Are obtained together.

Further, the word-level attention tier is used to enforce weights for the more important words for relationship extraction.

In the attention level of the word level, u_iRepresenting the score of the correlation of each word in a sentence, r representing a random query vector, h_iIs h_tEach term of (2), namely each word hidden vector, has the following specific relationship:

u_i＝h_i·r

α_iis the weight obtained by the word-level attention mechanism, and the calculation formula is as follows:

s is a vector representation of a sentence, and the calculation formula is as follows:

furthermore, the sentence-level attention layer is used for adding entity description information.

In the sentence-level attention layer, e_iRepresenting an input sentence s_iRelation to prediction r_kThe calculation formula of the matching degree is as follows:

α_iis the weight obtained by the sentence level attention mechanism, and the calculation formula is as follows:

b is a vector representation of a packet, which is equal to the sum of the weights of all sentences, and the calculation formula is as follows:

the obtained packet represents description information of the entity on the splice, i.e. a category information vector of the entity, and is represented as:

and the fifth layer is a classifier softmax layer for generating relational extraction results through a softmax multi-classifier.

The method obtains the global feature representation of a text by using a BilSTM (bidirectional long and short term memory network) and a Word level attention (Word level attention mechanism), wherein the Word level attention mechanism is used for strengthening the weight of a vocabulary which is more important for relation extraction, and then a multi-instance learning mode is adopted, and a sequence level attention (Sentence level attention mechanism) obtains a packet representation consisting of a plurality of Sentence representations and adds description information of a named entity to strengthen the result of the packet representation.

Compared with the prior art, the invention has the following beneficial effects:

the invention adopts two-level attention mechanism for Chinese texts with various task relationships, better avoids noise caused by remote supervision, and simultaneously adds entity description information into the Chinese texts, so that the semantic characteristics of the texts are enhanced, and a better relationship extraction result is obtained.

Drawings

FIG. 1 is a flow chart of a method for extracting relationships between multiple people based on a multi-level attention mechanism according to the present invention.

FIG. 2 is a diagram of a multi-level attention mechanism-based multi-person relationship extraction network model according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Examples

Fig. 1 is a flowchart illustrating a method for extracting relationships among multiple persons based on a multi-level attention mechanism, the method including the steps of:

(1) preprocessing the collected text;

in this embodiment, a network-exposed remote-supervised multi-person relationship extraction dataset (e.g., CCKS 2019IPER dataset) is used. The following operations are carried out: firstly, removing English data in a text;

removing special symbols in the text such as: emoji and hyperlinks, representing emoji as "expression", removing hyperlinks, etc.; and removing stop words in the text according to the Chinese stop word list.

(2) Adopting a remote supervision technology to align and label the named entity of the original character to obtain a text containing the entity and entity description information;

in the embodiment, in the entity labeling stage, the name entry of a Chinese character is acquired in an online encyclopedia, two related characters and the relationship of the two related characters form a triple, and finally a character relationship knowledge base is constructed. The pair of entities that exist with the knowledge base, i.e., the relationship of two entities, appear in the text as a relationship in a triple. There are 35 relationship types for the final labeled data set.

(3) Performing Chinese word vector training on the obtained text containing the entity;

in the text vectorization step of this embodiment, the word2vec method is used, which performs chinese word segmentation on the text subjected to the processing by using a word segmentation tool, and performs word2vec training by using a genim packet, so that the vector dimension of each word is 300.

(4) Constructing an attention mechanism two-way long-time and short-time memory network comprising two levels, and training the constructed model to obtain a multi-classification model for extracting various character relations;

as shown in fig. 2, the network model constructed in this embodiment has a structure that includes: an embedding layer, a bidirectional LSTM layer, a word level attention layer, a sentence level attention layer and a softmax classification layer.

The neural network model constructed by the embodiment is trained through a downloaded Chinese Baidu encyclopedia data set, Crossentpy is used as a loss function, Adam is used as an optimization method, training is completed when 15 epochs are trained or loss is not changed in 1000 lots through adjustment of other parameters of the model, a test set is tested, a result obtained through measuring relation extraction by using a P-R curve is obtained, the P-R curve utilizes precision and recall of the result to depict curves, and the upper curve in a two-dimensional coordinate system represents a better relation extraction effect.

(5) And preprocessing the text needing relationship extraction, and inputting the preprocessed text into the trained model to obtain a text relationship extraction result.

The method for extracting the relationships of various characters based on the multi-level attention mechanism, which is established in the method, converts an input text into a vector form through an embedded layer, obtains a hidden vector with more characteristics through a BilSTM layer, extracts more important words from the relationship in the text and obtains higher weight, and the sentence-level attention mechanism is better represented by a packet, so that noise caused by remote supervision can be eliminated, and a better experimental result is obtained.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A multi-character relation extraction method based on a multi-level attention mechanism is characterized by comprising the following steps:

preprocessing the collected text;

2. The method of claim 1, wherein the pre-processing comprises:

removing English data in the text;

removing emoticons and hyperlinks in the text;

removing stop words in the text according to the Chinese stop word list;

and performing Chinese word segmentation on the processed text.

3. The method as claimed in claim 1, wherein in the step of performing aligned labeling on the named entities of the original characters by using a remote supervision technology, the obtained name entry of the online encyclopedia in chinese is used to combine two characters with relationships and the relationships of the characters into a triple, and finally a character relationship knowledge base is constructed. The pair of entities that exist with the knowledge base, i.e., the relationship of two entities, appear in the text as a relationship in a triple.

4. The method of claim 1, wherein in the step of training the Chinese Word vector on the text, a distributed Word vector representation method Word2Vec is adopted, and the dimension of the output Word vector is set to 300.

5. The method of claim 1, wherein in the step of constructing the bidirectional long-term memory network with two levels of attention mechanisms, the birstm and the two levels of attention mechanism network structures are constructed by using a pytorch, wherein the first layer of the network is an embedding layer, the second layer is a bidirectional LSTM layer, the third layer is a word level attention layer, the fourth layer is a sentence level attention layer, and the fifth layer is a classifier softmax layer.

6. The method of claim 5, wherein the input of the embedding layer is a trained word vector sequence, the length of the text sequence (number of word vector sequences) is set to m, less than m is filled with 0, more than m is truncated, the length of the relative position of a word with respect to two entities in each text is also m, the embedding layer adopts a pre-trained word vector dimension dw and a randomly initialized position vector dimension dp, thereby obtaining a vector sequence w ═ { w ═ w [ ("w₁，w₂，…，w_m}，w_i∈R^dWherein d ═ dw + dp × 2.

7. The method of claim 5, wherein the bi-directional LSTM layer has its unidirectional LSTM expressed as follows:

i_tto the input gate, f_tTo forget the door, c_tIn a cellular state, o_tTo the output gate, h_tFor the hidden vector, W_x，W_h，W_eIn order to be the weight, the weight is,

i_t＝σ(W_xix_t+W_hih_t-1+W_cic_t-1+b_i)

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cfc_t-1+b_f)

c_t＝f_tc_t-1+i_ttanh(W_xcx_t+W_hch_t-1+b_c)

o_t＝σ(W_xox_t+W_hoh_t-1+W_coc_t+b_o)

h_t＝o_ttanh(c_t)

vector h of bidirectional LSTM_tOutput from the forward network

And reverse output

Are obtained together.

8. The method of claim 5, wherein u is the attention level of the word level_iRepresenting the score of the correlation of each word in a sentence, r representing a random query vector, h_iIs h_tEach term of (2), namely each word hidden vector, has the following specific relationship:

u_i＝h_i·r

9. the method of claim 5, wherein the sentence-level attention layer is configured to add entity description information;