CN114492441A

CN114492441A - BilSTM-BiDAF named entity identification method based on machine reading understanding

Info

Publication number: CN114492441A
Application number: CN202210052780.9A
Authority: CN
Inventors: 夏晓明; 王洁
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2022-01-18
Filing date: 2022-01-18
Publication date: 2022-05-13

Abstract

The invention discloses a BilSTM-BiDAF named entity recognition method based on machine reading understanding. Second, a two-way attention mechanism is introduced to learn semantic associations between text and entity classes. And finally, designing a boundary detector based on a gating mechanism to strengthen the correlation relation of the entity boundary, predicting the position of the entity in the text, and identifying the problem without answer by establishing an answer quantity detector. Experiments are carried out on CCKS2020 Chinese electronic medical record and CMeEE data sets, and results show that the model constructed by the method can effectively identify named entities in texts.

Description

Machine reading understanding-based BilSTM-BiDAF named entity identification method

Technical Field

The invention belongs to the field of natural language processing and named entity recognition, and provides a BilSTM-BiDAF named entity recognition model based on machine reading understanding. The model can be used for named entity recognition tasks of texts such as Chinese medical treatment and the like, and can provide basic services for information extraction, question answering, machine translation and the like.

Background

Natural language processing is an important research direction in the fields of computer science and artificial intelligence. Named Entity Recognition (NER) is a key task of natural language processing, and has wide application in information retrieval, question-answering systems and machine translation.

With the rapid development of neural networks, the named entity recognition method based on deep learning gradually replaces the statistical machine learning method with complicated feature engineering, and becomes the mainstream research method of the named entity recognition task. The named entity recognition method based on deep learning converts entity recognition into a sequence labeling task, and utilizes the strong computing power of a neural network to automatically extract the context characteristics of a sentence level through a machine so as to realize entity recognition. For example, Huang and the like firstly propose a method for processing a named entity recognition task by combining a Bidirectional Long Short-Term Memory (BilSTM) and a conditional random field, and the BilSTM can learn the character-dependent context information in a sequence in a forward and backward transmission mode, so that the problem of Long-distance dependence of a text is solved to a certain extent. Rei, etc., on the basis of the recurrent neural network, the problem of the occurrence of unknown words in the text is solved by dynamically utilizing character vector and word vector information by using an attention mechanism. Devlin and the like generate word embedded codes by utilizing a pre-training language model BERT and perform fine adjustment by combining a specific entity recognition task, thereby effectively solving the problem of insufficient context expression of the named entity recognition task. The method for recognizing the named entity by using the sequence label can effectively acquire the context information in the sentence, but the model depends on the feature extraction capability to a great extent because the related information of the entity category is not fully utilized in the training process, the learning difficulty is high, and the recognition effect is poor.

In recent years, researchers have proposed converting named entity recognition tasks into Machine Reading Comprehension (MRC) tasks for processing and achieving better effects. In 2020, li and the like, entity identification is converted into sentence fragment extraction by designing entity category query sentences, introducing prior semantic knowledge of entity categories, learning interactive relations between texts and the entity category query sentences by using a machine reading understanding frame, and obtaining high accuracy in a named entity identification task. Xue et al, which follows the above named entity recognition framework based on reading understanding, uses a variety of weakly supervised data-assisted model training to incorporate entity knowledge into a pre-trained language model. Compared with the traditional sequence labeling method, the MRC-based method encodes important prior information of related entity categories through problems, so that similar classification labels can be distinguished more easily, but still only modeling is carried out at sentence level, semantic information among sentences is ignored, and the problem of inconsistent entity labeling in different sentences is easily caused.

Disclosure of Invention

Aiming at the problems, the invention expands the sentence-level named entity recognition to the text-level named entity recognition and provides a BilSTM-BiDAF named entity recognition model MRC-NER based on machine reading understanding. First, full-text context information of the text is acquired using NEZHA, and further local dependency information is captured by BilSTM. Second, a Bidirectional Attention Flow (BiDAF) mechanism is introduced to learn semantic associations between text and entity classes. And finally, designing a boundary detector based on a gating mechanism when the entity position is predicted, fully considering the constraint of the entity starting position to the ending position, and establishing an answer quantity detector to identify the problem without answers.

The method comprises the following steps:

step 1: and constructing an entity category query statement containing semantic prior information according to the entity category to be identified in the text, constructing a data set into a form conforming to machine reading understanding frame input through data preprocessing, wherein each piece of data contains the text, the entity category query statement and the starting and ending positions of the type of entity in the text.

Step 2: and extracting the semantic information of the text and the prior information contained in the problem. And using the NEZHA pre-training language model as an embedding layer to extract global features of the text and the problem to obtain a character embedding vector of the text and the problem. And local features are further extracted from the text and the character embedding vectors of the problems through the BilSTM, so that the problem that the pre-training language model is insufficient in local dependence information capturing capacity is solved.

And step 3: a two-way attention mechanism is used to learn semantic associations between text and entity classes. And calculating a similarity matrix of the text and the question, calculating an attention expression vector of the text to the question and an attention expression vector of the question to the text based on the similarity matrix, and fully fusing the text and semantic prior information of entity categories and then inputting the fused text and semantic prior information to an answer prediction layer.

And 4, step 4: the location of the entity in the text is predicted. And (4) calculating the probability of whether the interactive fusion vector obtained in the step (3) is the entity initial position by using a half pointer and half label strategy position by position. Considering the constraint of the entity starting position to the ending position, a boundary detector based on a gating mechanism is designed, and the starting position probability distribution and the interaction fusion vector are dynamically fused, so that the model can predict the ending position more accurately. And matching the results of the two classifiers for calculating the probability of the starting position and the ending position of the entity nearby through the index sequence, thereby marking the specific position of the entity.

And 5: in order to reduce the extraction error when the model does not contain some entity types in the text, an answer quantity detector is established, and the problem without answers is identified. And calculating weighted text representation by using the predicted entity start-stop position probability and the interactive fusion vector, splicing with [ CLS ] special marks in a pre-training language model, training a classifier, and effectively identifying the problem without answer.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention uses NEZHA as an embedding layer to extract global features of texts and problems to obtain character embedding vectors of the texts and the problems. The method not only can effectively solve the problem of word ambiguity, but also enables the character embedded vector to contain rich context information and entity boundary information, and enhances the extraction effect of the model on the global semantics of the text. The invention also further uses the BilSTM to encode the text and problem character embedded vectors, effectively solves the problem that the pre-training language model has insufficient capture capability for local dependence information, improves the capture capability of the model for the local characteristics of the text by distributing larger weight to adjacent words, further endows each word with richer context information, and enhances the semantic representation capability of the model for the text.

2. A two-way attention mechanism is introduced to learn semantic association between texts and entity categories, the texts are fully fused with prior information contained in the questions through attention of the texts to the questions, and the questions pay attention to key information related to the categories in the texts through attention of the questions to the texts.

3. By designing the boundary detector based on the gating mechanism, the time sequence relation and the logic relation between entity boundaries are fully considered, so that the model can predict the entity positions more accurately, and the accuracy of named entity identification is improved. Meanwhile, an answer quantity detector is established to identify the problem without answers, and the problem that the text does not contain certain types of entities is effectively solved.

Drawings

FIG. 1 is a schematic diagram of the overall structure of a BilSTM-BiDAF named entity recognition model understood based on machine reading.

Fig. 2 is a diagram of an embedded layer network architecture using a NEZHA pre-trained language model.

Fig. 3 is a network structure diagram of BiLSTM.

Fig. 4 is a schematic diagram of a bidirectional attention mechanism.

Fig. 5 is a network architecture diagram of an answer boundary detector based on a gating mechanism.

Fig. 6 is a diagram illustrating an overall network structure of an answer prediction layer.

Fig. 7 is a network configuration diagram of the answer number detector.

Fig. 8 is a data processing flow chart.

Detailed Description

The invention is further described with reference to the following figures and detailed description.

The text contains rich context information, and in order to learn semantic association between sentences and characters in the text and establish deep interactive relation between the text and entity categories, the invention carries out modeling and solving on the named entity recognition task by means of a machine reading understanding frame. The machine-reading understanding framework, in particular, can be divided into three steps, given context, asking relevant questions, and labeling answer locations. Based on the framework, the invention firstly designs an entity category query sentence containing semantic prior information according to the entity category to be identified, then constructs a model to encode the semantic information of the text and the prior information in the problem, learns the complex interaction relationship between the text and the entity category query sentence, and finally finds the position of the entity in the text.

In order to realize the above, the invention constructs a BilSTM-BiDAF named entity recognition model MRC-NER based on machine reading understanding as shown in figure 1, and the model is divided into four layers: 1) an embedding layer, which adopts a pre-training language model NEZHA to obtain global semantic information to obtain the character embedding representation of texts and problems; 2) the coding layer uses the local features of the BilSTM coding to solve the problem that the pre-training language model is insufficient in local dependence information capturing capacity; 3) an interaction layer which introduces a two-way attention mechanism to fuse interaction information between the text and the entity category; 4) and the answer prediction layer is used for predicting the starting position and the ending position of the entity in the text.

Firstly, the method comprises the following steps: embedding layer

Step 1: the invention converts the named entity recognition task into a machine reading understanding problem. And constructing a query statement containing semantic prior information according to the entity type to be identified in the text, and constructing an input form conforming to a machine reading understanding frame through data preprocessing. In order to include more semantic prior knowledge in the entity category query statement, the invention integrates the related description of the entity type and some simple examples when constructing the entity category query statement. Taking the CCKS2020 dataset as an example, table 1 illustrates the problems associated with the different entity types that the present invention constructs for this dataset.

TABLE 1 problem construction Table

Step 2: and using a NEZHA pre-training language model as an embedding layer to perform global feature extraction on semantic information of the text and prior information contained in the problem.

In order to learn the internal correlation among different sentences in the text and enhance the representation capability of the model on the text context semantic information, the invention uses the NEZHA pre-training language model as an embedding layer to extract the global features of the text and the question so as to obtain the character embedding vector of the text and the question. The NEZHA is a pre-training language model open source in the Roman laboratory, and is an improved model based on BERT, because a relative position coding and full-word masking technology of a complete functional formula are used in a pre-training stage, the NEZHA is used as an embedding layer, so that character embedding vectors contain rich context information and entity boundary information, and the extraction effect of the model on the global semantics of the text is enhanced.

The present invention character-embedded encodes text and questions using the NEZHA pre-trained language model as shown in FIG. 2 as an embedding layer. Let X be { c ═ c₁，c₂，…，c_nIs a given piece of text, n is the length of the text, Q_y＝(q₁，q₂，…，q_mThe query sentence is an entity category query sentence constructed based on the text content, wherein Y is a set of entity categories, Q_yRepresenting a predefined question query, q, of Y e Y_iDenotes the ith character and m is the length of the question. Query statement Q for entity recognition constructed from text_y＝{q₁，q₂，…，q_mAnd text X ═ c₁，c₂，…，c_nAs input, a specific class label [ CLS ] is used]And a separation marker [ SEP]Splicing the question and the text into the following input sequence in a sequential series connection mode:

I＝{[CLS]；Q；[SEP]；X；[SEP]}

coding the text vector representation X ═ c obtained by a NEZHA pre-training language model₁，c₂，…，c_nThe problem vector represents Q ═ Q }₁，q₂，…，q_m}. Aiming at the condition that the text length exceeds the maximum limit input length of the preset pre-training language model, the invention cuts the text by using a dynamic programming solution long text cutting algorithm proposed by cai and the like, thereby meeting the limit of the model on the length.

II, secondly: coding layer

And step 3: local features are extracted from the text and the character embedding vectors of the problems through the BilSTM, and the problem that the pre-training language model is insufficient in local dependence information capturing capacity is solved.

By using an embedding layer, the model can obtain the global semantic information of the text, and for the named entity recognition task, it is important to consider the influence of the pre-and-post sequence of the characters on the current characters. In order to effectively acquire the context information of the characters depending in the sentences, the invention further uses a bidirectional long-time and short-time memory network (BilSTM) to respectively encode the text and the problem character embedded vectors, and by distributing larger weight to adjacent words, the capture capability of the model to the local characteristics of the text is improved, and each character is further endowed with richer context information.

The basic structure of the BiLSTM network used in the present invention is shown in fig. 3. Respectively inputting the text vector X and the problem vector Q obtained in the step 2 into a BilSTM network for feature extraction to obtain a coded text vector H e to R^2d×nThe problem vector U is equal to R^2d×m. The network output corresponding to the t-th word in the text is represented as follows:

wherein the content of the first and second substances,

denotes x_tHidden layer output of the corresponding forward LSTM unit;

denotes x_tCorresponding hidden layer output to the LSTM unit; h is_tDenotes x_tOutput of the corresponding bidirectional LSTM network

Vector sum

And vector splicing.

The use of BilSTM can alleviate the problem that the pre-training language model has insufficient capability of capturing the local dependence information. Through the contents of the embedding layer and the coding layer, the model can extract the global semantic information of the text, can fully capture the local dependency information of the text, and improves the semantic representation capability of the model to the text.

Thirdly, the method comprises the following steps: interaction layer

And 4, step 4: a two-way attention mechanism is used to learn semantic associations between text and entity classes. And calculating a similarity matrix of the text and the question, calculating an attention expression vector of the text to the question and an attention expression vector of the question to the text based on the similarity matrix, and fully fusing the text and semantic prior information of entity categories and then inputting the fused text and semantic prior information to an answer prediction layer.

The entity category query statement contains rich entity prior information, and in order to fully learn semantic association between the text and the entity category, the invention introduces a bidirectional attention mechanism. The method has the advantages that the text is fully fused with the prior information contained in the problem through the attention of the text to the problem, and the problem is focused on the key information related to the category in the text through the attention of the problem to the text.

Firstly, the text and question output vector H epsilon R obtained by the coding layer^2d×nAnd U ∈ R^2d×mThe similarity matrix S of the text and the question is calculated using a trilinear function:

S＝α(H_：n，U_：m)∈R^n×m

wherein alpha is a trainable scalar function,

is a dot product of elements",; "is a vector row splice, W_sAre trainable weight vectors. H_：nIs the n-th column vector, U, of the text H_：mIs the m-th column vector, S, of the problem coding matrix U_nmIs the similarity between the nth word in the text H and the mth word in the question U.

Then, an attention representation vector of the text to the question is calculated based on the similarity matrix, and each word vector in the text is characterized by a weighted sum of all words in the question. Normalizing each row of the similarity matrix S, and then performing attention weighting calculation with the problem, wherein the calculation process is as follows:

α_n＝softmax(S_n：)

wherein S is_n：The vector representing the nth row in the similarity matrix.

Meanwhile, an attention expression vector of the question to the text is calculated based on the similarity matrix. In order to answer the question better, the key information of the entity is extracted from the text, the attention of the question to each word in the text does not need to be calculated, only the word most relevant to the question in the text needs to be found, therefore, the weight of the question with the maximum text relevance degree is selected to form a new column vector, the weighted fusion vector of the answer question is calculated, and the weighted fusion vector is obtained by copying for n times

To keep the dimensions consistent.

b＝softmax(max_row(S))

Therein, max_rowDenotes taking the maximum value for each row of the matrix.

Finally, the text expression vector H and the text-question attention expression vector obtained by the coding layer are used

Question-text attention representation vector

The three are fused and then input into an answer prediction layer together, and the fusion mode is as follows:

fourthly, the method comprises the following steps: answer prediction layer

And 5: and predicting the position of the entity in the text. The model can predict the ending position more accurately by fully considering the time sequence relation and the logic relation of the starting position and the ending position of the entity through the initial position probability distribution predicted by the dynamic fusion model and the fusion vector obtained by the interaction layer.

Firstly, calculating the interaction vector G obtained in the step 4 by using a half pointer and half label strategy position by position whether the interaction vector G is the probability distribution of the answer initial position:

P_start＝softmax(W_start·G+b_start)∈R^2×n

then, designing a boundary detector based on a gating mechanism, and controlling the initial position probability distribution and the interaction fusion vector G epsilon R^8d×nThe two are organically combined, and then the probability distribution of each character as the termination position is calculated position by position:

g＝σ(W₁·P_start+b₁)

P_end＝softmax(W_end·X_gated+b_end)∈R^2×n

wherein, W₁，W₃∈R^d×2，W₂∈R^d×8d，W_end∈R^2×dG is a control gate, which is calculated by using sigmoid activation function to predict the initial position in the answer prediction layer, and further controls the fusion degree of the initial position and the interaction vector G, X_gatedA vector fused with answer start position information is represented. When g is 0, X_gatedOnly with respect to the output vector G of the interaction layer.

And finally, matching the results of the two classifiers at the starting position and the ending position of the answer nearby through an index sequence, thereby marking the specific position of the entity.

Step 6: in order to reduce the extraction error when the model does not contain some entity types in the text, an answer quantity detector is established, and the problem without answers is identified.

The existing named entity recognition model based on machine reading understanding usually assumes that for an input entity category query statement, entities of corresponding types can be extracted from a text. Thus, for some entity types, the model may still extract the wrong answer in the text, although the corresponding entity does not exist in the text. Aiming at the problem, the invention establishes an answer quantity detector as shown in fig. 7, calculates a weighted text representation by using the probability distribution of the answer start and stop positions, and splices the text representation with the [ CLS ] contextualized representation vector obtained in the step 2 to identify the problem which cannot be answered:

R¹＝G·P_start＝1

R²＝G·P_end＝1

z＝W_z·[R¹；R²]+b_z

P_answer＝sigmoid(W_a·[H_[cls]；z]+b_a)

because the [ CLS ] mark encodes the global information of the sequence, the whole semantic information of the text and the question is fused, the predicted starting and stopping position contains the relevant information about the number of answers, and the problem without the answers can be effectively identified by effectively fusing the three parts of contents and training the classifier.

And 7: in order to make the model more fit the real result and reduce the difference between the model and the real result, the invention designs the following loss function to strengthen the entity classification accuracy:

L_start＝CE(P_start，Y_start)

L_end＝CE(P_end，Y_end)

L_answer＝CE(P_answer，Y_answer)

L＝α·L_answer+β·L_start+γ·L_end α，β，γ∈[0，1]

the model training overall loss function is formed by combining a start-stop position loss function and an answer quantity detector loss function. The three parts all use cross entropy loss function and are adjusted through three hyper-parameters of alpha, beta and gamma. By reducing the loss function of the model, the model can extract the entity when the entity of the corresponding type exists, and detect that the question of the entity type has no matching answer when the entity of the corresponding type does not exist.

Experimental facility and required environment

The experiment is based on the python language and a third-party library thereof, the deep learning environment adopts an open-source Pytorch computing framework, the GPU is used for acceleration, and the specific environment configuration is shown in Table 2.

Table 2 experimental environment configuration

Preprocessing and parameter setting

The data set of the experiment of the invention comes from the evaluation task of CCKS2020, 1050 texts are totally marked by official labeled training data, and because the number of the data sets is relatively small, the invention integrates related data in the CCKS2019 data set on the basis of the CCKS2020 data set. After the original data set was converted into the MRC input form, a total of 9522 question and answer data sets were obtained. To better understand the data set for modeling, tables 3 and 4 show specific statistics for the CCKS2020 data set used by the present invention.

TABLE 3 CCKS2020 data set statistics

TABLE 4 statistics of answer for the MRC form of CCKS2020 dataset

The invention converts the named entity recognition task into a machine reading understanding problem. And constructing an entity query statement containing semantic prior information according to the entity type to be identified in the text, and constructing a data set into a form conforming to the input of a machine reading understanding frame through data preprocessing. The specific data processing flow is shown in fig. 8.

In the experiment, the Chinese NEZHA pre-training language model is adopted for embedding and coding characters, compared with a BERT model, the NEZHA uses relative position coding, and introduces a full word mask technology, so that more semantic information and boundary information are learned compared with the BERT model. The dimension of a word vector is set to 768, the maximum length of a pre-training language model is set to 128, the dimension of a hidden layer of a bidirectional long-time memory network is 256, dropout is set to 0.3, war up is 0.1, and batch size is 128. The above parameters are set for the CCKS2020 corpus, and fine-tuning is required for data of other corpora.

Experimental setup and evaluation index

The invention adopts accuracy (precision), recall (recall) and F1 value (F1-measure) as the evaluation criteria of the model, and carries out comprehensive evaluation on the entity recognition result of the medical text, and the evaluation criteria are calculated as follows:

1) accuracy (precision):

2) recall (recall):

3) f1 value (F1-measure):

where TP represents the number of entities for which positive samples are predicted as positive samples, TN represents the number of entities for which positive samples are predicted as negative samples, FN represents the number of entities for which negative samples are predicted as negative samples, and FP represents the number of entities for which negative samples are predicted as positive samples.

Analysis of Experimental results

In order to verify the effect of the model provided by the invention on the text level named entity recognition task, NEZHA-CRF, NEZHA-IDCNN-CRF, BERT-BilsTM-CRF, NEZHA-BilsTM-CRF and NEZHA-MRC-basepine are adopted to compare with the model MRC-NER provided by the invention on a CCKS2020 Chinese electronic medical record data set, and the basic information of each model is as follows:

(1) the NEZHA-CRF is a model for extracting features by adopting a NEZHA pre-training language model and labeling a sequence of a character vector by combining CRF.

(2) The NEZHA-IDCNN-CRF is a model which adopts a NEZHA pre-training language model and an iterative expanded Convolutional Neural network (IDCNN) to extract characteristics and combines the CRF to carry out sequence labeling on an input character vector.

(3) BERT-BilSTM-CRF, namely a model for extracting context characteristics by adopting a BERT pre-training language model and BilSTM and carrying out sequence labeling on an input character vector by combining the CRF.

(4) NEZHA-BilSTM-CRF, namely a model for extracting context characteristics by adopting a NEZHA pre-training language model and BilSTM and labeling the sequence of an input character vector by combining CRF.

(5) The NEZHA-MRC-baseline adopts an NEZHA pre-training language model to extract features and combines Li and the like to convert a named entity recognition task into a model constructed by machine reading understanding problems.

As can be seen from Table 5, the MRC-NER (sources) method provided by the invention is superior to other algorithms in the three indexes of accuracy, recall rate and F1 value, and can be obtained from result analysis:

(1) the NEZHA-BilSTM-CRF which uses the NEZHA pre-training language model to extract global features is superior to the BERT-BilSTM-CRF algorithm, and shows that the NEZHA pre-training language model which uses relative position coding and full word mask technology has certain advantages in word vector coding and can better learn the inherent correlation among different sentences in a text.

(2) The NEZHA-BiLSTM-CRF using BiLSTM for local feature extraction is superior to the NEZHA-CRF and NEZHA-IDCNN-CRF algorithms, because BiLSTM can capture longer-distance semantic dependence in both directions and better extract internal features of text compared with IDCNN.

(3) Comparing the MRC-NER model with the NEZHA-MRC-baseline model proposed by li and the like, the model constructed by the invention can effectively capture the context information of the whole text, and has obvious advantages in processing the text-level named entity recognition task.

(4) Comparing the MRC-NER model with the other four sequence labeling models shows that the prior knowledge of the entity class is encoded in the problem by using the machine reading understanding frame, so that the entity identification effect is enhanced.

TABLE 5 CCKS2020 data set evaluation results

In addition, in order to more clearly verify the different effects of the contents of the various parts in the method of the present invention on the results, the present invention performed the following ablation experiments, and the results are shown in table 6:

(1) the NEZHA-MRC-BilSTM is that only BilSTM is added on the basis of a baseline model, and larger weight is distributed to adjacent words, so that the problem that the pre-training language model has insufficient capability of capturing local dependence information is solved, the local dependence information of a text is captured better, and the F1 value is improved by 2.25%.

(2) The NEZHA-MRC-BiDAF introduces a bidirectional attention mechanism on the basis of a baseline model, so that the model learns the interactive relation between a text and a problem containing entity category prior semantic information, information which is more critical to entity identification is extracted, and the F1 value is improved by 0.07%.

(3) The NEZHA-MRC-BilSTM-BiDAF is added with BilSTM and a bidirectional attention mechanism on the basis of a baseline model, and compared with the baseline model, the F1 value is improved by 2.77%.

(4) The NEZHA-answer-border uses a gating-based boundary detector and an answer quantity detector on the basis of a baseline model, the F1 value is improved by 2.5%, and the method is used for enhancing the logic relation between boundaries and identifying the problem without answers when the entities are predicted, so that the named entity identification effect can be effectively improved.

Table 6 evaluation results of CCKS2020 dataset ablation experiments

Meanwhile, the MRC-NER model and the baseline model NEZHA-MRC-baseline are compared and tested on the CMeEE Chinese medical named entity recognition data set. As shown in table 7, compared to the baseline model, the model provided by the present invention is improved in all of the three evaluation indexes, thereby verifying the effectiveness of the model provided by the present invention on different data sets.

TABLE 7 CMeEE dataset evaluation results

Claims

1. The method for identifying the BiLSTM-BiDAF named entity based on machine reading understanding is characterized by comprising the following steps of: the method comprises the following steps:

step 1: constructing an entity category query statement containing semantic prior information according to an entity category to be identified in a text, constructing a data set into a form conforming to machine reading understanding frame input through data preprocessing, wherein each piece of data contains the text, the entity category query statement and the starting and ending positions of the type of entity in the text;

step 2: extracting the semantic information of the text and the prior information contained in the problem; using a NEZHA pre-training language model as an embedding layer to extract global features of texts and problems to obtain character embedding vectors of the texts and the problems; respectively embedding vectors into the text and the question characters through the BilSTM to further extract local features;

and 3, step 3: learning semantic associations between text and entity classes using a two-way attention mechanism; calculating a similarity matrix of the text and the question, calculating an attention expression vector of the text to the question and an attention expression vector of the question to the text based on the similarity matrix, and fully fusing the text and semantic prior information of entity categories and inputting the fused text and semantic prior information to an answer prediction layer;

and 4, step 4: predicting a location of an entity in text; calculating the probability of whether the interactive fusion vector obtained in the step 3 is the entity initial position by using a half pointer and half label strategy; considering the constraint of the entity starting position to the ending position, designing a boundary detector based on a gating mechanism, and dynamically fusing the starting position probability distribution and the interactive fusion vector to ensure that the model predicts the ending position more accurately; matching the results of the two classifiers for calculating the probability of the starting position and the ending position of the entity nearby through an index sequence, thereby marking the specific position of the entity;

and 5: in order to reduce the extraction error when the model does not contain certain entity types in the text, an answer quantity detector is established, and the problem without answers is identified; and calculating weighted text representation by using the predicted entity start-stop position probability and the interactive fusion vector, splicing with [ CLS ] special marks in a pre-training language model, training a classifier, and effectively identifying the problem without answer.

2. The method of machine-read-understanding-based BilSTM-BiDAF named entity recognition of claim 1, wherein: in the step 1, a named entity recognition task is converted into a machine reading understanding problem; constructing a query statement containing semantic prior information according to an entity type to be identified in a text, and constructing an input form conforming to a machine reading understanding frame through data preprocessing; when building an entity category query statement, the related descriptions and simple examples of entity types are integrated.

3. The method of machine-read-understanding-based BilSTM-BiDAF named entity recognition of claim 1, wherein: in step 2, a NEZHA pre-training language model is used as an embedding layer, and global feature extraction is carried out on semantic information of a text and prior information contained in a problem; in order to learn the internal correlation among different sentences in a text and enhance the representation capability of a model on text context semantic information, a NEZHA pre-training language model is used as an embedding layer to extract global features of the text and a problem to obtain character embedding vectors of the text and the problem;

the NEZHA pre-training language model is used as an embedding layer to carry out character embedding coding on texts and problems; let X be { c ═ c₁,c₂,…,c_nIs a given piece of text, n is the length of the text, Q_y＝{q₁,q₂,…,q_mThe query sentence is an entity category query sentence constructed based on the text content, wherein Y is a set of entity categories, Q_yRepresenting a predefined question query, q, of Y e Y_iRepresents the ith character, m is the length of the question; query statement Q for entity recognition constructed from text_y＝{q₁,q₂,…,q_mAnd text X ═ c₁,c₂,…,c_nAs input, a specific class label [ CLS ] is used]And a separation marker [ SEP]Splicing the question and the text into the following input sequence in a sequential series connection mode:

I＝{[CLS]；Q；[SEP]；X；[SEP]}

coding the text vector representation X ═ c obtained by a NEZHA pre-training language model₁,c₂,…,c_nThe problem vector represents Q ═ Q }₁,q₂,…,q_m}。

4. The method of machine-read-understanding-based BilSTM-BiDAF named entity recognition of claim 1, wherein: in step 3, local features are respectively extracted from the text and the character embedding vectors of the problem through BilSTM, so that the problem of insufficient capturing capability of the pre-training language model for local dependence information is solved;

by using the embedding layer, the model can acquire the global semantic information of the text, and for the named entity recognition task, the influence of the front and back sequences of the characters on the current characters is very important; in order to effectively acquire context information of characters depending in sentences, a bidirectional long-and-short-term memory network BilSTM is used for respectively encoding embedded vectors of texts and problem characters, larger weights are distributed to adjacent words, the capturing capability of a model on the local characteristics of the texts is improved, and more abundant context information is further given to each word;

the problem that the pre-training language model is insufficient in local dependence information capturing capacity can be solved by using the BilSTM; through the contents of the embedding layer and the coding layer, the model can extract the global semantic information of the text, can fully capture the local dependency information of the text, and improves the semantic representation capability of the model to the text.

5. The method of machine-read-understanding-based BilSTM-BiDAF named entity recognition of claim 1, wherein: in step 4, learning semantic association between the text and the entity category by using a bidirectional attention mechanism; calculating a similarity matrix of the text and the question, calculating an attention expression vector of the text to the question and an attention expression vector of the question to the text based on the similarity matrix, and fully fusing the text and semantic prior information of entity categories and inputting the fused text and semantic prior information to an answer prediction layer;

the entity category query sentence contains rich entity prior information, and a bidirectional attention mechanism is introduced for fully learning semantic association between the text and the entity category; the method has the advantages that the text is fully fused with the prior information contained in the problem through the attention of the text to the problem, and the problem is focused on the key information related to the category in the text through the attention of the problem to the text.

6. The method of machine-read-understanding-based BilSTM-BiDAF named entity recognition of claim 1, wherein: in step 5, predicting the position of the entity in the text; the time sequence relation and the logic relation of the starting position and the ending position of the entity are fully considered through the initial position probability distribution predicted by the dynamic fusion model and the fusion vector obtained by the interaction layer, so that the model can predict the ending position more accurately;

firstly, calculating the probability distribution of whether the interaction vector G obtained in the step 4 is the initial position of the answer or not by using a half pointer and half label strategy; then, designing a boundary detector based on a gating mechanism, controlling the probability distribution of the initial position and the interactive fusion vector to be organically combined, calculating the probability distribution of each character as the termination position by position, calculating the prediction result of the initial position in the answer prediction layer by using a sigmoid activation function, and further controlling the fusion degree of the initial position and the interactive vector G; and matching the results of the two classifiers at the starting position and the ending position of the answer nearby through an index sequence, thereby marking the specific position of the entity.

7. The method of machine-read-understanding-based BilSTM-BiDAF named entity recognition of claim 1, wherein: in order to reduce extraction errors when the model does not contain certain entity types in the text, an answer quantity detector is established, and the problem without answers is identified; and (3) establishing an answer quantity detector, calculating a weighted text representation by using the probability distribution of the start-stop positions of the answers, splicing the text representation with the [ CLS ] contextualized representation vector obtained in the step (2), and identifying the question which cannot be answered.

8. The method of machine-read-understanding-based BilSTM-BiDAF named entity recognition of claim 1, wherein: designing a loss function to enhance entity classification accuracy:

L_start＝CE(P_start,Y_start)

L_end＝CE(P_end,Y_end)

L_answer＝CE(P_answer,Y_answer)

L＝α·L_answer+β·L_start+γ·L_end α,β,γ∈[0,1]

the model training overall loss function is formed by combining a start-stop position loss function and an answer quantity detector loss function; the three parts all use cross entropy loss functions and are adjusted through three hyper-parameters of alpha, beta and gamma; by reducing the loss function of the model, the model can extract the entity when the entity of the corresponding type exists, and detect that the question of the entity type has no matching answer when the entity of the corresponding type does not exist.