CN112800756B

CN112800756B - Entity identification method based on PRADO

Info

Publication number: CN112800756B
Application number: CN202011334119.4A
Authority: CN
Inventors: 尚凤军; 冉淳夫
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2022-05-10
Anticipated expiration: 2040-11-25
Also published as: CN112800756A

Abstract

The invention relates to the technical field of computer networks, in particular to an entity identification method based on PRADO, which comprises the steps of obtaining original data, and performing word segmentation and labeling processing on the original data; on a PRADO layer, based on a projection Embedding model, a projection network is constructed by using local sensitive hashing, and each character in a sentence is converted into a low-dimensional Embedding word list; extracting Embedding vector features by using the context association characteristics of the BilSTM neural network; distributing the feature vectors acquired by the BilSTM layer to different attention weights by an attention mechanism method; using CRF to complete the task of sequence labeling; the invention adopts the LSH algorithm to construct the projection network so as to achieve the purpose of reducing word embedded vector parameters, and simultaneously uses an attention mechanism to ensure the relation between the feature vector and the whole text so as to eliminate the hidden danger that the LSH algorithm can not well relate to the context.

Description

Entity identification method based on PRADO

Technical Field

The invention relates to the technical field of computer networks, in particular to an entity identification method based on PRADO.

Background

In recent years, due to the continuous development of the technical level of the internet, a large amount of data of all walks of life appears on the network, the large amount of data has high value, and how to efficiently acquire, store, analyze and apply the data is a problem to be researched in the big data era. In the data, not only structured data which is already arranged, but also a large amount of unstructured and semi-structured data which is not arranged exist, and natural language processing technology can be used for processing and classifying the data. With the rapid increase of the total amount of internet information, the traditional semantic network is not suitable, and the appearance of the knowledge graph provides a new idea for solving the problem.

The extraction of entity relations is an indispensable link for constructing the knowledge graph, the quality of the extracted entities and the quality of the relation lays the quality of the graph, the technology is not only used for search engines, but also used in other industries including the fields of medical treatment, education, securities investment, finance and the like, in general, each field is related, the existence of the relation provides a foundation for constructing the knowledge graph, and meanwhile, the value of the knowledge graph can be extracted.

An existing entity relationship extraction model, such as a Skip-Gram model, is a method for predicting context word vectors based on a selected target word vector, and firstly, a word in a sequence is selected as a reference point, and then, another word is found near the reference point by using a sliding window as a label, so that a plurality of reference point-label pairs can be obtained, and the reference point-label pairs are used as input of the model. However, the vector dimensions trained by these conventional word vector techniques are large, so that the input parameters in the network are extremely large, and the training of the model is extremely difficult.

Disclosure of Invention

In order to reduce the size of parameters in an Embedding stage and reduce the number of parameters on the premise of ensuring comprehensive word vector description information, so that the training of a model can be simpler and more portable, the invention provides an entity identification method based on PRADO, which specifically comprises the following steps as shown in FIG. 1:

acquiring original data, and performing word segmentation and labeling processing on the original data;

on a PRADO layer, based on a projection Embedding model, a projection network is constructed by using local sensitive hashing, and each character in a sentence is converted into a low-dimensional Embedding word list;

extracting Embedding vector features by using the context association characteristics of the BilSTM neural network;

distributing the feature vectors acquired by the BilSTM layer to different attention weights by an attention mechanism method;

and (4) completing the task of sequence labeling by using the CRF.

Further, the process of converting each word in the sentence into a low-dimensional Embedding word list includes:

repeatedly carrying out binary hash on the ith word to obtain a 2B bit vector

Using a projection matrix P generated by an initial random number and using the matrix pair

Projecting to obtain d-dimension vector

Using pairs of activation functions

Activating to obtain the low-dimensional Embedding word list e of the word_i。

Further, the process of optimizing the projection matrix P generated by using an initial random number includes:

and comparing the final output result of the model with the actual value, performing a back propagation algorithm, and adaptively updating the projection matrix P through gradient inspection.

Further, using projection matrix pairs

The performing of the projection includes:

wherein, P_kIn order to be a function of the projection,

representing a vector

And vector

The angle therebetween;

is composed of

The projection of (a) is performed,

further, the ith word is a low-dimensional Embedding word list e_iExpressed as:

wherein, W^pA weight parameter for the projection network; b is^pIs the bias parameter of the projection network.

Further, the feature vectors obtained by the projection layer are assigned with different attention weights by an attention mechanism method, including:

wherein alpha is_i,t′Indicating the generation result y_iHow much attention is needed to be put to e_t′Upper, i.e. attention weight factor, e_t,t′Ensuring the sum of the weights as an auxiliary parameter is 1, y_iTo output the result, T_xThe length of the input sequence.

Further, the characteristics of the context association of the BilSTM neural network are utilized to extract the Embedding vector characteristics, namely, at each moment, the data to be deleted is added, the newly added content is added, the memory cell is updated, and the data at the current moment is output, the BilSTM neural network comprises a forgetting gate, an input gate and an output gate, wherein the forgetting gate is used for selecting the information to be discarded and left in the memory cell, the input gate is used for updating the control factor and updating the content, the output gate is used for determining the final output content, and the forgetting gate is expressed as:

Γ_f＝σ(W_f[a^<t-1>,x^<t>,c^<t-1>]+b_f)；

the input gates are represented as:

Γ_u＝σ(W_u[a^<t-1>,x^<t>,c^<t-1>]+b_u)；

the output gate is represented as:

Γ_o＝σ(W_o[a^<t-1>,x^<t>,c^<t-1>]+b_o)；

a^<t>＝Γ_o*c^<t>；

wherein, gamma is_fA factor for forgetting the gate, W_fWeight of forgetting gate, b_fBias value for forgetting gate；a^<t-1>Is an activation value; c. C^<t-1>The value of the memory cell at the last moment is recorded; gamma-shaped_uFactor of input gate, W_uAs the weight of the input gate, b_uIs the offset value of the input gate;

the content is to be newly added; c. C^<t>Newly added content; x is the number of^<t>Is the t-th input parameter; gamma-shaped_oFactor of output gate, W_oAs weights of output gates, b_oIs the offset value of the output gate; b_cIs composed of

The corresponding offset value.

Further, the tasks of completing sequence labeling by using the CRF comprise:

wherein the content of the first and second substances,

for transfer matrix, representing slave label y_i-1To y_iThe transition probability of (a) is,

is that the predicted result is the y_iScore of individual labels, Z (x) is a normalization factor, t_kAnd s_iAs a characteristic function, mu_iAnd λ_kIs a weight parameter.

According to the entity recognition model provided by the invention, the idea of a PRADO algorithm is borrowed at a word embedding layer, and a projection network is constructed by adopting an LSH algorithm, so that the purpose of reducing word embedding vector parameters is achieved, and meanwhile, the relation between the feature vector and the whole text is ensured by using an attention mechanism, so that the hidden danger of the context which cannot be well connected by the LSH algorithm is eliminated; then, the characteristic that the network is strongly associated in a local scope is used in a BilSTM layer, so that the trained result has better association with the whole text and the local part; and finally, completing the task of sequence labeling on a CRF layer, and continuously adjusting the weight parameters of each layer through a back propagation mechanism in the whole model.

Drawings

FIG. 1 is a flow chart of a PRADO-based entity identification method of the present invention;

FIG. 2 is a schematic diagram of the PRADO-BilSTM-CRF model employed in the present invention;

FIG. 3 is a schematic structural diagram of an attention model employed in the present invention;

FIG. 4 is a schematic structural diagram of a BilSTM model employed in the present invention;

FIG. 5 is a schematic diagram of an LSTM cell unit according to the present invention;

FIG. 6 is a schematic view of a CRF structure according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The invention provides a PRADO-based entity identification method, as shown in figure 1, which specifically comprises the following steps:

and (4) completing the task of sequence labeling by using the CRF.

As shown in fig. 2, firstly, performing operations such as word segmentation and labeling on original data, and then injecting the original data into a PRADO layer, wherein the layer uses the idea of a projection Embedding model, uses Local Sensitive Hashing (LSH) to construct a projection network, converts each word in a sentence into a low-dimensional Embedding word list, and then distributes feature vectors acquired by a BiLSTM layer to different attention weights by an attention mechanism method, thereby eliminating the defect that an LSH algorithm cannot contact the whole text; the second layer is a BilSTM layer, and the characteristics of context correlation of the BilSTM neural network are utilized to extract the Embedding vector characteristics, so that the defect that the previous and following relations cannot be fully considered by the LSH in the first layer is improved; and the third layer is a CRF layer, and the CRF layer is used for completing the task of sequence marking. Next, this embodiment will describe in detail the manner of using the model in each layer.

(I) PRADO

In the traditional embedding concept, assume that the input text has T tokens or words, W_iRepresents the ith word, where i ∈ {0,1.. T-1 }. If V is the number of words in the vocabulary, including the out-of-vocabulary token representing all the missing words, then each word W_iAre all mapped to delta_iE.g. V. In most linguistic neural networks, words are typically mapped to fixed-length d-dimensional vectors e using an embedding layer with trainable parameters W ∈ Rd · V_i＝W·δ_iWherein e is_iE Rd is the word vector. Since most parameters in the network mainly come from the word vectors trained by W, and a word vector matrix capable of describing W in detail is to be obtained, the completeness of the vocabulary V, that is, the dimension of V, is particularly large, and only then, the word vectors obtained by training have relatively good performance. However, this method is premised on the fact that the dimension of V is large, and the dimension of W is also large, so that the number of parameters of the whole neural network is extremely large, and training is performedThe process of the network is particularly difficult, so that the method for training the word vectors by using a projection Embedding mode is proposed in the Embedding stage so as to achieve the purpose of reducing network parameters and enable the network training to be faster.

In the Embedding stage, if the dimension of the trained W is too large, although the representation of the word vector is complete, the parameters trained by the network can explode, the dimension is too small, and the description of the word vector is inaccurate and the network cannot be correctly trained, so that the mode adopted by the PRADO is a compromise method, and the mode of projecting the network is used, so that a certain word does not need to be particularly accurately represented, and only the trained word vector can describe the attributes of the word to a certain extent. For example, in entity classification, specific differences between the Chongqing university and the Chongqing post and telecommunications university do not need to be known, and only the fact that the Chongqing university and the Chongqing post and telecommunications university refer to each other needs to be understood, that is, in some specific fields, the meaning of the designation of some entities does not need to be completely known, and only the class to which the entity belongs needs to be known.

In the embodiment, a basic projection model is constructed by using Local Sensitive Hashing (LSH), the size and precision of word vectors trained by a traditional word2vec method mainly depend on the dimension of a vocabulary, and the LSH, as a dimension reduction technology in a clustering algorithm, can more independently control the dimension and sparsity of the word vectors, so that some vocabularies needing high-latitude representation can be controlled in the dimension in a certain range to achieve the purpose of reducing parameters and generate compact embedding, thereby optimizing the training effect of the whole model. The main steps are as follows:

1. for each W in the input text_iIteratively performing binary hashing to obtain vectors

Here, assume that max (i) ═ N;

2. projection matrix P generated by an initial random number (P can be optimally adjusted by back propagation mechanism)) Will be

Is converted into

As shown in equation (1), to obtain a d-dimension vector

This results in a d-dimensional vector representation, and each dimension corresponds to a P_{k＝1,2,...,d}The vector is projected.

3. Using an activation function to obtain

As in equation 3:

wherein W^pAnd B^pRepresenting the weights and bias functions of the projection network, respectively. It can be known from the above formula

A total of N x d parameters that can be mapped into N d-dimensional word-embedding vectors e_iThus the resulting eigenvector matrix (e)₁,e-2,...,e_n-1,e_n)。

Through the method, a feature vector representation compressed by a traditional word embedding method can be obtained, and a token is not required to be described in great detail by using a one-hot vector; meanwhile, the dimensionality of N and d can be limited in a relatively small range by the coarse granularity of word segmentation, so that the number of parameters input into the neural network is smaller, and the operation speed of the network is higher. However, because the LSH algorithm is used, the feature vector obtained by the training can only describe the word within a certain range, and cannot better relate to the following relation before and after, therefore, before the feature vector obtained at this stage is fed into the BiLSTM model, the feature vector needs to be processed by means of an attention mechanism, so as to reduce the disadvantages of the LSH algorithm, as shown in fig. 4, the method specifically includes the following steps:

1. by alpha_i,t′Indicating the generation result y_iHow much attention is needed to be put to e_t′And satisfies the following equation:

at this time alpha_i,t′Then again expressed as output y_iTo ensure that the sum of weights is 1, softmax is used, and an auxiliary parameter e is introduced_t,t′Such that:

2. the above equation requires calculating e_t,t′Therefore, a simple neural network model needs to be established, and then e is calculated by using a gradient descent algorithm_t,t′。

3. The result Y output in the above step is { Y ═ Y₁,y₂,...,y_n-1,y_nLet E be the eigenvector matrix E ═ E₁,e-2,...,e_n-1E: ═ Y, and is taken as input to the BiLSTM, by means of which the network is better pairedThe front and back related characteristics of the section improve the accuracy of the final output.

(II) BilsTM layer

In the field of natural language processing, entity naming recognition problems are often expressed as sequence models, if two problems exist using standard neural network models: the first point is that different text sequences can obtain matched output after the model is input due to different sequences, but the input and the output of the model after another sequence is changed are different from the previous sequence and are not equal; the second point is that due to the particularity of the text sequence, the context information of the sequence is related, but the context of the sequence cannot be related by the common model. The general neural network has a natural inverse potential in solving the sequence problem, and thus a Recurrent Neural Network (RNN) is proposed to solve the sequence model problem.

In the task of named entity recognition, compared with the early rule matching and machine learning method, the accuracy of entity recognition can be obviously improved by using a recurrent neural network model, but because the model is large, two defects generally exist: (1) in the back propagation process of the model, due to the sequence of the RNN model, a plurality of hidden layers and a plurality of weight data of each layer, the problem of gradient explosion or disappearance is particularly easy to occur; (2) for longer sequences, the model is not good at capturing long-term dependent effects before and after the sequence.

To solve the above problem, the concept of adding a gate control unit, namely, long-short-term memory (LSTM), to the hidden unit of the RNN changes the hidden layer of the RNN, so that it can better capture deep junctions and improve the gradient hour problem. The main role of these gate structures is to control the number of information flow transmission processes, and in the training phase of the model, there are more and more intermediate data due to the properties of RNN, and the gate structures can know which of these intermediate data are important and need to be retained, and which are relatively unimportant and can be discarded. LSTM has a structure with three gates to control and regulate the transmission of information, a forgetting gate, an input gate and an output gate respectively. But because LSTM only remembers a single-direction text sequence, the BiLSTM model is finally chosen to solve this problem.

The last layer network obtains the word vector part after data preprocessing, and a local sensitive hashing algorithm (LSH) is used in the projection network, and because the algorithm cannot sufficiently link the relationship between two words with relatively long distance, the BiLSTM network needs to be used for better relationship between the words before and after the connection in the future. Next we will input the word vector into the constructed sequence processing model.

The forward propagation formula of the LSTM model is as follows:

(1) forget the door: the forgetting gate is used for determining the information discarded and left in the memory cell and has an output value of gamma_fOf value between 0 and 1, when Γ_fAbout close to 0 means that the more should one discard, Γ_fCloser to 1 means that the more should be kept, Γ_fThe formula for forward propagation is:

Γ_f＝σ(W_f[a^<t-1>，x^<t>，c^<t-1>]+b_f) (7)

(2) an input gate: the input gate determines the new updated content, and the two parts are the update control factor and the updated content respectively. First is the update factor, i.e. the update gate Γ_uThe value range is [0, 1]]Because the values of the update are different, the information required to be retained is also different, that is, the importance of the information is different from 0 to 1, which indicates that the importance is changed from low to high. The formula is as follows:

Γ_u＝σ(W_u[a^<t-1>，x^<t>，c^<t-1>]+b_u) (8)

secondly, the content to be newly added is expressed as

The formula is as follows:

finally obtaining the memory cells c at the t moment^<t>Disclosure of the inventionOvercombination update gate gamma_uAnd new added value

Calculated, the formula is:

(3) an output gate: determining the final output content, where the output gate has a value range of [0, 1], and the final output content is expressed by the following formula:

Γ_o＝σ(W_o[a^<t-1>,x^<t>,c^<t-1>]+b_o) (11)

a^<t>＝Γ_o*c^<t> (12)

the main principle of the network is that data to be deleted is removed at each moment, then new content is added and memory cells are updated, and finally data at the current moment are output. In this layer, the main steps are as follows:

1. building a BilST model, and setting a word vector matrix E ═ E obtained in the first step₁,e-2,...,e_n-1E, inputting a BilSTM model;

2. training network weights through a back propagation algorithm;

3. optimizing overfitting phenomena by using techniques such as Dropout and L2 according to requirements;

4. outputting a sentence-level feature vector matrix (y)₁,y₂,...,y_n-1,y_n)。

(III) CRF layer

Generally speaking, softmax model can be directly selected to directly obtain the desired result, but because the sentence-level feature vectors obtained in the BilsTM model have the possibility of labeling offset, and the traditional softmax model has the defect of processing the problem, a CRF model is selected to solve the problem so as to obtain the optimal output result facing the global sequence, and the effect is better than that of the BilsTM model alone or the softmax model directly.

The LSTM output vector Y in the previous layer is determined as { Y ═ Y₁,y₂,...,y_n-1,y_nAnd (4) inputting the data into the model, and combining the constraint of conditional probability distribution with an input-output sequence to obtain a final result so as to reduce the error of the data. The specific principle is as follows:

first, let us set the output sequence Y of the BiLSTM layer as { Y ═ Y₁,y₂,...,y_n-1,y_nThe input sequence X ═ X for CRF₁,x₂,...,x_n-1,x_nThen let the correct notation sequence Y ═ Y₁,y₂,...,y_n-1,y_nAnd constructing a conditional probability P ═ y |, x, and the main formula is as follows:

wherein

is that the predicted result is y_iScore of label, Z (x) is normalization factor, t_kAnd s_iAs a characteristic function, mu_iAnd λ_kIs a weight parameter.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A PRADO-based entity identification method is characterized by comprising the following steps:

on the PRADO layer, based on a projection Embedding model, a projection network is constructed by using locality sensitive hashing, and each word in a sentence is converted into a low-dimensional Embedding word list, namely the method comprises the following steps:

repeatedly carrying out binary hash on the ith word to obtain a 2B bit vector

A projection matrix P generated by an initial random number is utilized, wherein the optimization of the projection matrix P comprises the steps of comparing the final output result of the model with an actual value, carrying out a back propagation algorithm, and adaptively updating the projection matrix P through gradient checking;

and using projection matrix pairs

Performing projection to obtain a d-dimension vector

The method comprises the following steps:

wherein, P_kIn order to be a function of the projection,

representing a vector

And vector

The angle therebetween;

is composed of

The projection of (a) is performed,

using pairs of activation functions

Activating to obtain the low-dimensional Embedding word list e of the word_iExpressed as:

wherein, W^pThe weight parameter is the projection network; b is^pIs a bias parameter of the projection network;

extracting Embedding vector characteristics by using context correlation characteristics of a BilSTM neural network;

and (4) completing the task of sequence labeling by using the CRF.

2. The PRADO-based entity recognition method of claim 1, wherein the assigning the feature vectors obtained from the projection layer with different attention weights by an attention mechanism method comprises:

wherein alpha is_i,t′Indicating the generation result y_iHow much attention is needed to be put to e_t′Upper, i.e. attention weight factor, e_t,t′Ensuring as an auxiliary parameter that the sum of the weights is 1, y_iTo output the result, T_xThe length of the input sequence.

3. A PRADO-based entity recognition method according to claim 1, wherein the bilst neural network context-dependent features are used to extract the Embedding vector features, i.e. the data to be deleted at each time, add new contents and update the memory cells, and output the current time data, the bilst neural network includes a forgetting gate, an input gate and an output gate, the forgetting gate is used to select the information to be discarded and left in the memory cells, the input gate is used to update the control factors, and the content is updated, the output gate is used to determine the final output content, the forgetting gate is expressed as:

Γ_f＝σ(W_f[a^＜t-1＞,x^＜t＞,c^＜t-1＞]+b_f)；

the input gates are represented as:

Γ_u＝σ(W_u[a^＜t-1＞,x^＜t＞,c^＜t-1＞]+b_u)；

the output gate is represented as:

Γ_o＝σ(W_o[a^＜t-1＞,x^＜t＞,c^＜t-1＞]+b_o)；

a^＜t＞＝Γ_o*c^＜t＞；

wherein, gamma is_fA factor for forgetting the gate, W_fWeight of forgetting gate, b_fIs the offset value of the forgetting gate; a is^＜t-1＞Is an activation value; c. C^＜t-1＞The value of the memory cell at the last moment is recorded; gamma-shaped_uFactor of input gate, W_uAs the weight of the input gate, b_uIs the offset value of the input gate;

the content is to be newly added; c. C^＜t＞Newly added content; x is the number of^＜t＞Is the t-th input parameter; gamma-shaped_oFactor of output gate, W_oAs weights of output gates, b_oIs the offset value of the output gate; b_cTo be newly added with content

The corresponding offset value.

4. The PRADO-based entity identification method of claim 1, wherein the task of performing sequence labeling by using CRF comprises:

output sequence Y of the BiLSTM layer ═ Y₁,y₂,...,y_n-1,y_nThe input sequence X ═ X for CRF₁，x₂，...，x_n-1，x_n}；

Make the training network correctThe sequence of the label is Y ═ Y₁,y₂,...,y_n-1,y_nAnd constructing a conditional probability P ═ y |, which specifically includes:

wherein