CN111368542A

CN111368542A - Text language association extraction method and system based on recurrent neural network

Info

Publication number: CN111368542A
Application number: CN201811600745.6A
Authority: CN
Inventors: 韩英; 陈薇; 王腾蛟; 李强; 刘迪; 黄晓光
Original assignee: Peking University; State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; State Grid Zhejiang Electric Power Co Ltd
Current assignee: Peking University; State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; State Grid Zhejiang Electric Power Co Ltd
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2020-07-03

Abstract

The invention discloses a text language association extraction method and system based on a recurrent neural network. The method automatically extracts complex context characteristics based on a recurrent neural network (bidirectional long and short term memory network) and encodes semantic information of the context; discovering a definition pattern in the document through a rule-based entity expression pair extractor, identifying a definition about a non-standard expression in the document, and extracting the defined standard expression and the defined non-standard expression which belong to the same entity concept; encoding the extracted characteristics of the entity expression pair, and embedding information about entity normalization into a low-dimensional entity expression vector; the entity expression vector and the context feature coding vector are connected and subjected to dimension conversion to obtain a final code; and the decoder based on the conditional random field decodes a globally optimal state sequence as a final output sequence by combining the transition probabilities between the features and the states learned by the encoder. The invention can effectively improve the performance of entity identification.

Description

Text language association extraction method and system based on recurrent neural network

Technical Field

The invention belongs to the field of artificial intelligence, and relates to a method for extracting information from massive unstructured data by using a natural language processing technology, in particular to a method for identifying entities from texts and extracting entity association relation, which is a key technology for information extraction.

Background

Text entity extraction is to identify entities with specific meanings, such as name of person, name of place, name of organization, etc., from text. The method is a key technology for extracting information from massive unstructured data, and is a fundamental stone of numerous complex natural language processing applications, such as intelligent question answering, knowledge maps, automatic abstractions, machine translation and the like.

Due to the rich expression form of natural language, the same entity may have many different expressions, such as full name, abbreviation and alternative name of the entity. The phenomenon of "ambiguous word" is widely existed in chinese and english, such as "chinese industrial and commercial banks" and "workers" in chinese, and "United States" and "u.s. The variable expression form of the entity brings huge challenges to the entity recognition. The results of Khalid M A0 et al [ Khalid M A, Jijkon V, De Rijke M. the impact of nomenclature neutralization on Information transformation for query analysis [ C ]// European Conference on Information recovery.Springer, Berlin, Heidelberg,2008: 705-.

In the field of natural language processing, entity identification and entity association normalization are traditionally regarded as independent tasks which are processed separately. The entity identification is firstly carried out, and then the result of the entity identification is used as the input of the entity association normalization, so that the result of the entity normalization cannot be fed back to the entity identification in a pipeline mode, and therefore the entity identification cannot utilize useful information of the entity normalization. Existing research on this part of the combined process of entity identification and entity normalization is also very limited. LiuX et al [ Liu X, Zhou M, Wei F, et al. Joint information of nominal importance and knowledge for tweeets [ C ]// Proceedings of the 50th annular Meeting and knowledge for computer Linear algorithms Long Papers-volume1.Association for computer Linear algorithms 2012: 526: 535 ] studied the combined treatment of entity identification and entity normalization for tweets and proposed a probability map-based model. The model describes whether two entity expressions between similar-content tweets refer to the same entity concept by introducing a binary random variable. Similarly, Luo G et al [ Luo G, Huang X, Lin C Y, et al. Joint incidence and normalization [ C ]// Proceedings of the 2015 consensus on Empirical Methods in Natural Language processing.2015:879-888 ] also propose a probabilistic graph-based model to combine entity identification and entity normalization. These methods all focus on the joint processing of normalization of entity expressions and entity identification between short texts tweets, rely on a large number of artificially constructed features based on a probabilistic graph model of statistical machine learning. These feature projects are costly, difficult to expand on large-scale datasets, do not function well with massive amounts of data, and are not data-driven. And high-order interactions of many hidden contextual features cannot be covered by manually building the features. And the entity normalization modules in the methods all depend on the existing dictionary, and an unreasonable assumption that the standard entity expression exists in the dictionary exists. The existing dictionary is limited in coverage, and a plurality of linguistic data are lack of dictionaries in corresponding fields. Especially today, where information technology is well developed, new entities, such as reports on newly established institutions, newly issued bonds, newly occurring events, etc., often appear in the text of news media, which do not exist in existing dictionaries and knowledge bases, and dictionary-dependent methods cannot normalize the names of such new entities.

The technical solution is needed to solve the above problems, and the complex features of the text context can be automatically learned without relying on manual feature engineering, and simultaneously, the entity normalization information can be effectively obtained by using the definition of the non-standard entity expression in the document, and the learning of the text context features and the information of the entity expression pair defined in the document are integrated to realize better entity identification.

Disclosure of Invention

Aiming at the problems, the invention aims to design and realize a model combining rules and deep learning for extracting text entities and entity association relation, can realize automatic extraction of context characteristics by utilizing the deep learning, avoids complex characteristic engineering, can also find definition about entity expression in a document by utilizing the rules to be integrated with human knowledge and experience, and can realize better entity identification by assisting entity identification through entity association normalization in the text.

In order to achieve the purpose, the invention adopts the following technical scheme:

a text entity and entity incidence relation extraction method based on a recurrent neural network comprises the following steps:

(1) automatically extracting complex context characteristics through a time recursive neural network (a bidirectional long and short term memory network), and coding information of the context characteristics;

(2) discovering a definition mode in the document through a rule, identifying the definition related to the non-standard expression in the document, and extracting the defined standard expression and the defined non-standard expression which belong to the same entity concept as an entity expression characteristic;

(3) coding the extracted entity expression characteristics, and embedding information related to entity normalization into a low-dimensional entity expression vector;

(4) connecting the context characteristics and the codes of the entity expression characteristics in a vector space to obtain a final code fusing entity identification and entity expression normalization information;

(5) and sending the final code into a conditional random field model, calculating a global optimal state sequence by combining transition probabilities among states, decoding and outputting a final result sequence of the text entity and the entity incidence relation.

A system for extracting text entities and entity incidence relation based on a recurrent neural network comprises:

the word/word embedding module is used for mapping each word/word of the original text sequence into a certain-dimension vector;

the context feature encoder is used for representing the text sequence after embedding the words and the phrases in a vector form, automatically extracting complex context features and encoding semantic information of the context;

the word segmentation module is used for segmenting the original text sequence;

the entity expression pair extractor is used for discovering definitions about non-standard expressions in the document based on word segmentation results of the word segmentation module, and extracting standard expressions and non-standard expressions which belong to the same entity concept and are defined as entity expression characteristics;

the entity normalization information encoder is used for encoding the entity expression characteristics extracted by the extractor by the entity expression and embedding the information about entity normalization into a low-dimensional entity expression vector;

the entity identification and normalization coding combination module is used for connecting the context characteristics obtained by the context characteristic encoder with the entity expression characteristic codes obtained by the entity normalization information encoder in a vector space to obtain final codes fusing entity identification and entity expression normalization information;

and the decoder based on the conditional random field is used for calculating to obtain a globally optimal state sequence as a final output sequence of the text entity and the entity incidence relation by combining the transition probability between the output and the state of the entity identification and normalization coding combination module.

Compared with the prior art, the invention has the following positive effects:

the invention provides a text entity and entity incidence relation extraction method based on a recurrent neural network by adopting a mode of combining rules and deep learning, which utilizes a bidirectional long-short term memory network to automatically extract text context semantic features, simultaneously combines human experience and knowledge into rules for extracting entity nonstandard expressions defined in documents, and improves the performance of an entity recognition system through entity incidence normalization. The method utilizes the advantage of deep learning for automatically extracting the features, avoids manual feature engineering which has high time cost and high labor cost and is difficult to expand to a large data set, and realizes real data driving; meanwhile, the method gives full play to the knowledge and experience of people, quickly discovers the definition of the entity nonstandard expression in the document based on the rule, and extracts the entity expression pair by fully utilizing the information transmitted by the document content; the relevance of the entity identification and the entity normalization task is fully utilized, compared with the traditional separate processing mode, the simultaneous processing of the entity identification and the entity normalization can be supported, the information sharing of the entity identification and the entity normalization is realized, and the performance of the entity identification is improved by utilizing the information of the entity normalization. The invention has the advantages of low overhead, high expression and multiple applications.

Drawings

Fig. 1 is a schematic diagram illustrating a module composition of a recurrent neural network-based text entity and entity association extraction system according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of data flow and network structure of a recurrent neural network-based text entity and entity association extraction system according to an embodiment of the present invention. Wherein B-ORG represents the beginning of the organization class entity, I-ORG represents the middle of the organization class entity, E-ORG represents the end of the organization class entity, and O represents a non-organization class entity.

Fig. 3 is a flowchart illustrating steps of a recurrent neural network-based text entity and entity association extraction system according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail below with reference to specific embodiments and accompanying drawings.

Fig. 1 is a schematic diagram of constituent modules of a recurrent neural network-based text entity and entity association relation extraction system according to an embodiment of the present invention, and fig. 2 is a schematic diagram of data flows and a network structure of a recurrent neural network-based text entity and entity association relation extraction system according to an embodiment of the present invention. With reference to fig. 1 and fig. 2, the functions and implementations of the modules shown in fig. 1 are described as follows:

(1) a context feature encoder based on a time recursive neural network (bidirectional long and short term memory network) consists of a forward long and short term memory network (LSTM) and a backward long and short term memory network, and is responsible for automatically extracting complex context features and encoding semantic information of contexts.

When the LSTM receives information from the previous time at time t, the cell (the neuron of the LSTM) first determines that a part of the information is forgotten, and the forgetting gate controls a forgotten parameter. The input to the gate is the input x at the current time_tAnd the output h of the previous moment_t-1The formula for a forget gate is as follows:

f_t＝σ(W_f·[h_t-1,x_t]+b_f)

wherein f is_tIs the cyclic weight of the forgetting gate, σ is the activation function (sigmoid function), W_fIs the input weight of the forgetting gate, b_fIs the bias of the forgetting gate.

After discarding useless information, the cell needs to decide which newly entered information to absorb, and the formula of the input gate is as follows:

i_t＝σ(W_i·[h_t-1,x_t]+b_i)

wherein i_tIs the cyclic weight of the input gate, σ is the activation function (sigmoid function), W_iIs the input weight of the input gate, b_fIs the offset of the input gate.

Cell candidate at present:

wherein

Is a candidate for a cell, W_cIs the input weight, x, of the cell candidate_tIs the input x at the current time_t，h_t-1Is the output of the previous time, b_cIs a bias of cellular candidates.

Updating the cell state to obtain a new cell state, and calculating from the old cell state selective forgetting and the candidate cell state:

wherein C is_tIs a new value of the cell state, f_tIs the cyclic weight of the forgetting gate, C_t-1Is the value of the cell state at the previous moment, i_tIs the round-robin weight of the input gate,

is a cell candidate at the current time.

Finally, the output gate plays a role to determine the output vector h of the hidden layer at the current moment_tDefinition of the output gate:

o_t＝σ(W_o·[h_t-1,x_t]+b_o)

wherein o is_tIs the weight of the input gate, σ is the activation function (sigmoid function), W_oIs the connection weight of the output gate, b_oIs the offset of the output gate, x_tIs the input x at the current time_t，h_t-1Is the output of the previous time instant.

The output of the hidden layer at the current moment is that the state of the activated cells is output outwards through an output gate:

h_t＝o_t*tanhC_t

wherein o is_tIs the weight of the input gate, C_tIs the updated cell state value at the current time, h_tIs the output of the current time.

For a given text sequence with a string of n characters/words (English is a word and Chinese is a character), the notation is S ═ w₁,w₂,w₃,….w_n]Wherein w is_iAnd (3) a vector representing the ith character/word of the sequence after the character/word is embedded. At time n, the hidden layer of the forward LSTM network is output as

Hidden layer output to the LSTM network is noted

The hidden layer output of the forward LSTM network and the hidden layer output of the backward LSTM network are combined together through a combining layer to obtain

The context feature encoder output is noted as H_R。

(2) The entity expression pair extractor based on the rules has the functions of fully utilizing the knowledge and experience of people, discovering the definition related to the entity nonstandard expression in the document through the rules based on the syntactic structure and the lexical structure, and extracting the expression pair which is given by the definition and refers to the same entity concept, such as the name pair of < full name, short name >, < full name, alternative name >.

Table 1 gives the rules used for the decimator. Wherein F represents a standard expression, and A represents a non-standard expression such as abbreviation, alternative name and the like. The string length of F is specified to be longer than the string length of a. The expression pair extractor extracts entity expression pairs from the data which are in accordance with the syntactic and lexical conditions.

TABLE 1 formulation rules used for the decimator

(3) And the entity normalization information encoder is responsible for encoding the characteristics of the extracted entity expression pairs and embedding the information about entity normalization into a low-dimensional entity expression vector. For the expression pair extracted by the entity expression pair extractor, the expression pair is firstly converted into a vector with a certain length and then is further learned through a linear layer.

The corresponding meanings of each element of the expression vector respectively correspond to the beginning, the middle, the end and the independent single character of the non-standard name from left to right, and the beginning, the middle and the end of the standard name. Since the standard name is the longest one of the multiple names of the entity, there is no case of independent words. For a given string of text containing n characters/words (english is a word and chinese is a character), the notation S ═ w₁,w₂,w₃,….w_n]Assuming that the set of expression pairs extracted by the expression pair extractor is a<F₁,A₁>,<F₂,A₂>,……<F_k,A_k>For each word w_iThe method comprises the following steps of (1) preparing,

wherein, g (w)_i) The value of the formulation function representing the w-th word/word.

For satisfying g (w)_i) W not equal to 0_iMeaning M corresponding to each element of the expression vector normalized by its corresponding named entity_iIs defined as:

wherein Pos is w_iThe positions in the name pair are B (beginning), I (middle), E (end), S (name consists of only one word), respectively.

When the initial expression vector is recorded as V, for each character/word (Chinese is character, English is word) w_iComprises the following steps:

wherein i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to 7, and N_jA label representing the jth element representation of the expression vector.

And the initialized expression vector is marked as V, and after the linear layer processing, the result output by the entity normalization information encoder is a final expression vector:

wherein H_NThe final expression vector is represented by a vector of representations,

a function representing the expression vector acting on the initialisation, w_lInput weights representing linear layers, b_lIndicating the bias of the linear layer.

(4) And the context characteristic and entity expression characteristic combination module is responsible for realizing the connection of the codes of the context characteristic and the entity expression characteristic in a vector space to obtain the final code fusing entity identification and entity expression normalization information.

Hidden layer vector H obtained by context feature encoder_RAnd the expression vector H obtained by the entity normalization information encoder_NSplicing into a vector H containing high-order feature interaction and low-order feature interaction_A：

H_A＝[H_R，H_N]

H_AAnd converting the full connection layer to form an output vector H of the final encoder:

H＝w_f·H_A+b_f

h is a tensor with one dimension (n, L), n is the length of each sample sequence, and L is the number of classes of output labels.

(5) And the decoder based on the conditional random field is responsible for decoding a globally optimal state sequence as a final output sequence by combining the transition probability between the features and the states learned by the encoder.

The predicted tag representing the ith word of the sequence is y_iThe score of the state feature of the time,

representing slave label y_iTransfer to y_i+1Score of the state transition feature of (1), y₀Representing the beginning of the marker sequence, y_nRepresenting the end of the marker sequence. The total score of the marker sequence is the sum of the score of the status feature and the score of the metastasis feature, and is defined as follows:

performing Softmax processing on scores S (X, y) corresponding to all possible marker sequences y to obtain the probability of the sequences y:

wherein Y is_XRepresenting all possible marker sequences for the input sequence X, the output marker sequence is the sequence that achieves the maximum score during the prediction phase of the decoder.

Fig. 3 is a flowchart illustrating steps of a method for extracting a text entity and an entity association relationship based on a recurrent neural network according to an embodiment of the present invention. The steps are specifically described as follows:

step 1.1 prepare data and segment the dataset.

Preparing marked data, segmenting the marked data into a training data set, developing the data set and a testing data set, wherein the training data set and the developing data set are used in a training stage, and the testing data set is used in a testing stage. The data set is a text data set, and each sample is an article.

Step 1.2 establishes a character index table.

And establishing character indexes for all the obtained corpora, and adding the number of the unknown character for each character number from 1. Word embedding module for the back (Chinese is character, English is word)

Step 1.3 batch sample input

Training of the training data set is input into the system in batches according to a small batch principle and a set batch size.

Step 2.1 word segmentation

Segmenting each sentence by sentence unit

Step 2.2 define pattern matching

Each sentence is searched whether a condition on defining a pattern in the syntax structure is satisfied, and whether a definition such as a full name (abbreviation) exists or not is searched. If present, an entity expression pair may be defined. If not, then there are no pairs of entity expressions in this sentence.

Step 2.3 extracting entity expression pair by forward and backward search

And searching the words before and after the separator according to the definition mark as the separator, such as '('), checking whether a word combination before and after the word combination which meets the lexical condition of the entity expression pair exists, and extracting the entity expression pair if the word combination exists.

Step 2.4 expression information embedding

And embedding the extracted entity expression information, converting the information into low-dimensional entity expression vectors, and judging whether each dimension corresponds to a short term or a full term and the position of a corresponding character in the entity. If no entity expression pair exists, all zeros initialize.

Step 3 word embedding

For each character of each input sample, character/word embedding (English is a word, Chinese is a character level) is carried out, and the character/word embedding is converted into a 300-dimensional vector according to a character index table and a linear layer.

Step 4 bidirectional LSTM network

The input sample sequence represented by the word vector is sent to a bidirectional LSTM network to extract context feature information.

Step 5 connecting and converting

And the hidden layer vector output by the bidirectional LSTM network is spliced with the entity expression vector to realize the connection of the vector space. And then the dimensionality of the tensor is converted through the full connection layer. The emission probability (the probability of the state sequence generating the observation sequence) of each character is obtained, i.e. the state features of the CRF model.

Step 6CRF modeling state transition probability

CRF modeling, taking into account the dependency between states (tags), and the probability of emission of an observed sequence to a sequence of states.

Step 7 decoding the globally optimal sequence

And calculating the score of each sequence, and calculating the sequence with the highest global score after combining the label transition probability through a dynamic programming algorithm to be used as a final output sequence. If it is the prediction phase, then it ends at step 7. If it is a training phase, there are also steps 8 and 9.

Step 8 calculating a cost function

In the training process, the objective function is to maximize the log-likelihood of the correct label sequence of the training set.

The cost function is the negative of the objective function.

Step 9 adaptive gradient descent algorithm

And training the model by using an Adam algorithm, and adaptively adjusting the learning rate according to the training speed. If the effect of the model on the test set is reduced, the overfitting is indicated, the training is stopped immediately, and otherwise, the training is continued.

While the foregoing disclosure shows illustrative embodiments of the invention, it should be noted that various changes and modifications could be made herein without departing from the scope of the invention as defined by the appended claims. In accordance with the structures of the embodiments of the invention described herein, the constituent elements of the claims can be replaced with any functionally equivalent elements. Therefore, the scope of the present invention should be determined by the contents of the appended claims.

Claims

1.A text language association extraction method based on a recurrent neural network is characterized by comprising the following steps:

(1) automatically extracting complex context characteristics through a time recursive neural network, and coding information of the context characteristics;

2. The method of claim 1, wherein the temporal recurrent neural network of step (1) is a bidirectional long-short term memory network.

3. The method of claim 1, wherein the step (2) extracts the expression pair referring to the same entity concept by a rule based on syntactic and lexical structures, wherein the non-standard expression includes abbreviation and alternative name, and the string length of the standard expression is specified to be longer than that of the non-standard expression.

4. The method according to claim 1, wherein the step (3) converts the expression pairs extracted by the entity expression pair extractor into vectors with a certain length, and then further learns the vectors through a linear layer to obtain the final entity expression vector.

5. The method according to claim 4, wherein each element of the entity expression vector has corresponding meaning from left to right, which corresponds to a beginning, middle, end, independent word representing a non-standard name, and a beginning, middle, and end representing a standard name.

6. The method of claim 1, wherein step (4) is performed by using a hidden layer vector H obtained by a context feature encoder_RAnd the expression vector H obtained by the entity normalization information encoder_NSplicing into a vector H containing high-order feature interaction and low-order feature interaction_A，H_AAnd converting the output vector H of the final encoder through a full connection layer, wherein the H is a tensor with one dimension of (n, L), wherein n is the length of each sample sequence, and L is the number of types of output labels.

7. The method of claim 1, wherein step (5) comprises:

(5.1) calculating a total score for the marker sequence, which is the sum of the score for the status features and the score for the metastasis features, defined as follows:

wherein H_i,yiThe predicted tag representing the ith word of the sequence is y_iScore of the State feature of time, A_yi,yi+1Representing slave label y_iTransfer to y_i+1Score of the state transition feature of (1), y₀Representing the beginning of the marker sequence, y_nRepresents the end of the marker sequence;

(5.2) performing Softmax processing on scores S (X, y) corresponding to all possible marker sequences y to obtain the probability of the sequences y:

wherein, Y_XRepresents all possible marker sequences for input sequence X;

(5.3) in the prediction stage of the decoder, the output marker sequence is the sequence that achieves the maximum score:

8. the method of claim 1 or 7 wherein the conditional random field based decoder is such that the objective function in the training process is to maximize the log-likelihood of the correct token sequence of the training set and the cost function is the negative of the objective function.

9. The method of claim 8 wherein the conditional random field based decoder is trained using an adaptive gradient descent algorithm and the learning rate is adaptively adjusted based on the training speed, and wherein the training is terminated if the effect of the model on the test set is decreasing, and wherein the training is continued otherwise.

10. A system for extracting a text language association based on a recurrent neural network, comprising:

the word segmentation module is used for segmenting the original text sequence;