CN115630647A

CN115630647A - Entity and relation tandem type extraction method for text data

Info

Publication number: CN115630647A
Application number: CN202211328790.7A
Authority: CN
Inventors: 乔海洋; 白洁; 张雨晨; 刘红英; 马国梁; 尚凡华
Original assignee: Tenth Research Institute Of Telecommunications Technology Co ltd
Current assignee: Tenth Research Institute Of Telecommunications Technology Co ltd
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2023-01-20

Abstract

The invention provides an entity and relation tandem type extraction method for text data. Firstly preprocessing a text, extracting a word vector of the text, then inputting the word vector into a text entity extraction model BilSTM-softmax to obtain a text with an entity pair, then carrying out disambiguation treatment on the text with ambiguity, finally inputting the text with the entity pair into a relation discrimination model, and obtaining a relation category of the entity through a relation classification model. The method can perform entity and relation tandem extraction on the text data, has the functions of long text processing, entity disambiguation and reference resolution rich semantics, and obtains better entity extraction and relation extraction results.

Description

Entity and relation tandem type extraction method for text data

Technical Field

The invention belongs to the technical field of information extraction, and particularly relates to a method for extracting entities and relations of text data in a serial connection mode.

Background

In the information extraction task, entity extraction and entity relation extraction play important roles. The entity extraction task mainly extracts atomic information elements in the text, such as name of person, name of organization, geographical position, name of event, time and the like. The entity relationship extraction task is to extract relationship categories among all entities in the text, such as parent-child relationship, lover relationship, superior-subordinate relationship and the like in the character relationship.

For entity extraction tasks, rule-based methods and model-based methods are common. (1) The rule-based method is directed to the entity with special context or the text of the entity with many characteristics, and the method is simple and effective. As the number of corpora increases, the confrontation becomes more complex, conflicts between rules may also occur, and the entire system may also become non-maintainable. Therefore, the rule-based method is more suitable for the extraction task of semi-structured or standard texts, and can achieve a certain effect by combining with business requirements. (2) The model-based method includes a traditional model hidden Markov model or a conditional random field, etc., and deep learning models RNN (Recurrent Neural Network), LSTM (Long short-term memory), biLSTM (Bi-directional Long short-term memory), etc. have good performance on NLP (Natural Language Processing) tasks. Compared with the traditional model, the RNN can consider long-term context information and can solve the problem of CRF (conditional random field) feature selection, and the key point is network design and parameter tuning, but the RNN generally needs a larger training data set, and CRF performs well on a small-scale data set. The effect is better to combine the BilSTM and the CRF, and the respective advantages of the two models can be exerted.

For the task of extracting the relation of the text entity, the common method is based on a templateMethods of supervised learning and semi-supervised/unsupervised learning. (1) The template-based method comprises a method based on trigger words/character strings and a method based on dependency syntax, and has the advantages of high accuracy, specific customization, difficult maintenance and poor transportability. (2) Supervised learning methods include machine learning methods such as MaxEnt (maximum entropy model), maximum entropy, and deep learning methods,

Bayes (naive Bayes), SVM (Support Vector Machine), and the like, two classifiers are usually trained, the first step is to judge whether two entities have a relationship, and the second step is to judge the relationship type; the deep learning method is further divided into a Pipeline method (i.e., entity identification and relationship classification are two independent processes, and the identification of the relationship depends on the effect of the entity identification) and a Joint Model method (the processes of the entity identification and the relationship classification are jointly optimized), wherein the Pipeline method includes CR-CNN (Convolutional Neural Network classified according to levels and relationships), att-bltm (Convolutional Neural Network Based on Attention machine), att-bltm (Bidirectional Long Short-Term Memory Network Based on Attention machine), and the like, and the method can cause the transfer of errors, and the end-to-end method is convenient to optimize. The supervised learning method has high requirements on data sets, fragile models, limited generalization capability and difficulty in expanding new relationships, and in addition, the cost for acquiring a large training set is high. (3) The semi-supervised/unsupervised learning method mainly uses a small amount of labeled information for learning, and is represented by a Bootstrap-based method and a remote supervision (distance) method. Bootstrap-based methods mainly use a small number of instances as a set of initial seeds (seed programs), and then learn by using pattern learning methods, from unstructured features by continuous iterationExtracting examples from the data, learning new pattern from the newly learned examples, expanding a pattern set, and searching and discovering new potential relationship triples; the remote monitoring method mainly comprises the steps of automatically constructing a large amount of training data by aligning a knowledge base with unstructured texts, reducing the dependence of a model on manual annotation data, and enhancing the cross-domain adaptability of the model, and has the advantages of low construction cost, suitability for large-scale construction, and low accuracy of sensitive results of initial seeds and lack of calculation of confidence coefficient of each result; however, the unsupervised learning method can only estimate and randomly extract a sample of a relationship from the result set and then manually check because a new relationship is extracted and the accuracy and the recall rate cannot be accurately calculated.

The existing information extraction tasks comprise extraction tasks aiming at various data, such as online medical entity extraction, electronic medical record named entity extraction, chinese scenic spot named entity identification, military field entity relationship extraction and the like. And the text data is an indispensable part for the daily information delivery of the public and covers a large number of valuable entities and relationship information of the entities. The extraction of entity and relation of text data is a rich personal information database, which forms an important component of a complete and clear relation network.

The BilSTM-CRF entity extraction model is an existing typical text entity extraction model, belongs to a sequence labeling task, and is input as a text sequence to be processed and output as the probability that each character in the sequence belongs to each type of label. The model consists of an embedded layer, a plurality of BilSTM (Bidirectional Long Short-Term Memory) layers and a CRF (Conditional Random Field) layer. The treatment process comprises the following steps: each character/word in the input text sequence is represented as a vector (including character/word embedding), then the vector is input into a BilSTM layer, the probability that each character/word belongs to a certain type of label is output by the layer, then the vector is input into a CRF layer, the output probability is corrected according to the learned label transfer relationship, the optimal label sequence is output, and the output numerical symbols are converted into label types capable of being intuitively understood.

The BilSTM-Attention relation extraction model is a typical existing text relation extraction model, and is input as a text sequence, labeled with an entity pair of position information and output as a relation category of two entities in the sequence. The model consists of a word/word embedding layer, several BilStm layers (usually one layer) and an Attention layer. The treatment process comprises the following steps: inputting a sequence with entity position information (for example, < e1> Xiaoming </e1> likes going to a market to buy < e2> pencil </e2>, ", < e1>, < e2> are four position indicators, and specify the start and end marks of an entity), representing each word/word in the sequence as a vector (including word/word embedding), then inputting a BilTM layer, which outputs high-order feature information of each word/word, generating a weight vector through an Attention layer, multiplying the word-level features of each time step, merging into sentence-level feature vectors, and finally using the sentence-level feature vectors for relation classification.

The two models have some problems when applied directly to text data, including: (1) The production of text data in one day is in hundred million level, when the BilSTM-CRF entity extraction model processes massive text data, a large amount of reasoning time is occupied when a CRF layer calculates an optimal label sequence, so that the model reasoning efficiency is extremely low, and the model is not suitable for reasoning tasks with overlarge data. (2) The BilTM-Attention model can not generate entity pairs for classifying the BiLSTM-CRF entity model, although the BilTM-CRF entity model can generate candidate entities, a plurality of entities may exist in one text, the proportion of binary entity pairs with relations is not large, for example, 12 combination modes exist for 4 entities, only one entity pair with real relations exists, and if the generated entities are directly processed into candidate entity pairs to be sent to the relation extraction model, the effect of the model is greatly influenced by uneven data distribution. (3) Since the input sequence of the model only has the position information of the entity but not the category information, and BiLSTM-attribute cannot learn the category information of the entity, some pairs of entities may be generated which are not in accordance with the conventional rules, such as classifying the "xiaoming" and the "car" as the family relationship, but if the category information "name" of the "xiaoming" entity and the category information "carrier" of the "car" entity are input together into the relationship extraction model, there should be an improvement.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an entity and relation tandem type extraction method for text data. Firstly preprocessing a text, extracting a word vector of the text, then inputting the word vector into a text entity extraction model BilSTM-softmax to obtain a text with an entity pair, then carrying out disambiguation treatment on the text with ambiguity, finally inputting the text with the entity pair into a relation discrimination model, and obtaining a relation category of the entity through a relation classification model. The invention carries out entity and relation tandem extraction aiming at the text data, simultaneously has the functions of long text processing, entity disambiguation and meaning resolution rich semantics, and obtains better entity extraction and relation extraction results.

An entity and relation tandem type extraction method for text data is characterized by comprising the following steps:

step 1: inputting a text, and if the number of text characters is more than 500, segmenting by taking 500 characters as the length to obtain a plurality of texts;

step 2: respectively inputting the segmented texts into word vector models to obtain word vectors corresponding to the texts; the word vector model is obtained by coding word vectors based on collected Chinese and English language databases and training the word vectors by adopting a word2vec method, wherein the Chinese and English word vector models are trained by the method;

and step 3: inputting the text word vector into a text entity extraction model for entity extraction, and outputting a text with entity pairs; the text entity extraction model is a BilSTM-Softmax entity extraction model trained on a sequence dataset with position information, wherein the BilSTM-Softmax entity extraction model comprises a bidirectional LSTM layer and a Softmax layer, and is an improved model obtained by replacing a CRF layer of an original BilSTM-CRF model with the Softmax layer;

and 4, step 4: judging whether the text has ambiguity, and if not, turning to the step 5; otherwise, inputting the text into the entity disambiguation model for disambiguation, then carrying out reference resolution processing on the trained entity resolution model to obtain the text after disambiguation, and turning to the step 5; the entity disambiguation model is trained on a Chinese Short Text data set, and adopts a Bert-CNN network architecture, and comprises a Bert layer, a convolutional layer and a Softmax layer; the entity digestion model is a SpanBert entity digestion model trained on an opening data set ontonotes5.0 data set marked by an IDC company, wherein the SpanBert entity digestion model comprises a transformer layer, a full connection layer and a Softmax layer;

and 5: inputting the text which is obtained in the step 4 and is unambiguous or disambiguated into an entity relationship discrimination model, and obtaining the relationship category of the entity through an entity relationship classification model; the entity relationship distinguishing model is a BilSTM-orientation model trained on a sequence data set with entity position information and entity category information, wherein the BilSTM-orientation model comprises a BilSTM layer and an orientation layer; the entity relation classification model refers to a BilSTM-Attention model trained on a sequence data set only containing entity position information.

The invention has the beneficial effects that: because a CRF layer in the BilSTM-CRF model is replaced by a softmax layer, the model processing speed and the calculation efficiency can be greatly improved while higher text entity extraction precision is obtained; due to the addition of the text entity disambiguation processing process, the problem that the Chinese entity has ambiguity can be effectively solved, and the accuracy of the text entity expression theme is practically improved; the entity type information is added to the input of the BilSTM-Attention model in the entity relationship judging model, so that the model can well give consideration to the entity type information when learning the relationship between the entities, thereby ensuring the accuracy of entity relationship judgment; and the serial entity and relationship extraction mode is adopted, the data linkage among different models is standardized, the capacity of processing mass text data is enhanced, semantic information can be better expressed for the entity subjected to disambiguation, and the accuracy and the reasonability of the entity extraction and relationship extraction results are improved to a great extent.

Drawings

FIG. 1 is a flow chart of an entity and relationship concatenation extraction method for text data according to the present invention.

Detailed Description

The present invention will be further described with reference to the following drawings and examples, which include, but are not limited to, the following examples.

As shown in fig. 1, the present invention provides a method for extracting entities and relationships in tandem for text data, which is implemented as follows:

step 1: inputting a text, judging the length of a text sequence, if the number of characters of the text is more than 500, segmenting by taking 500 characters as the length to obtain a plurality of texts, and recording the sequence relation among the plurality of texts;

step 2: respectively inputting the segmented texts into word vector models to obtain word vectors corresponding to the texts; the word vector model refers to a word vector model trained on a Chinese and English corpus by adopting a word2vec method, and comprises a Chinese word vector model and an English word vector model.

And 3, step 3: inputting the text word vector into a text entity extraction model for entity extraction, and outputting a text with entity pairs; the text entity extraction model is a BiLSTM-Softmax entity extraction model trained on a sequence data set with position information, wherein the BiLSTM-Softmax entity extraction model comprises a bidirectional LSTM layer and a Softmax layer and is an improved model obtained by replacing a CRF layer of an original BiLSTM-CRF model with the Softmax layer;

and 4, step 4: judging whether the text has ambiguity, and if not, turning to the step 5; otherwise, inputting the text into the entity disambiguation model for disambiguation, then carrying out reference resolution processing on the trained entity resolution model to obtain the text after disambiguation, and turning to the step 5; the entity disambiguation model is trained on a Chinese Short Text data set provided by CCKS2019 (China Association of knowledge maps and semantic calculations), and adopts a Bert-CNN network architecture, wherein the entity disambiguation model comprises a Bert layer, a convolution layer and a Softmax layer; the entity digestion model is a SpanBert entity digestion model trained on an opening data set ontonotes5.0 data set marked by an IDC company, wherein the SpanBert entity digestion model comprises a transformer layer, a full connection layer and a Softmax layer;

and 5: inputting the text which is obtained in the step 4 and is unambiguous or disambiguated into an entity relationship discrimination model, and obtaining the relationship category of the entity through an entity relationship classification model; the entity relationship distinguishing model is a BilSTM-orientation model trained on a sequence data set with entity position information and entity category information, wherein the BilSTM-orientation model adopts a BilSTM network and is added with an orientation mechanism, and comprises a BilSTM layer and an orientation layer; the entity relation classification model structure is consistent with the entity relation discrimination model, but the training data set only contains sequences with entity position information, namely the BilSTM-Attention model trained on the sequence data set only containing the entity position information.

In order to verify the effectiveness of the invention, the BilSTM-softmax entity extraction model of the invention is compared with the original BilSTM-CRF model. The parameters of the BilSTM-CRF model are set as follows: left _ rate =0.001, batch_size =32, embedding_size =128, hidden_size =200, num_layers =2, dropout =0.2. Wherein, embedding _ size refers to the character word vector dimension, hidden _ size refers to the hidden layer dimension, and num _ layers refers to the number of layers of lstm. The parameters of the BiLSTM-SoftMax model are set as follows: left _ rate =0.001, batch \/size =32, embedding \/size =128, hidden \/size =200, num \/u layers =1, dropout =0.2. When the real data set is trained and evaluated, the inference speed and the F1 index (the index of both accuracy and recall rate is considered, the larger the model performance is, the better) of the two models are shown in Table 1. It can be seen that the inference speed of the BilSTM-softmax model of the invention is greatly improved.

TABLE 1

	F1	Reasoning speed (bar/s)
			BilsTM-CRF model	0.8013	25
BilSTM-SoftMax model	0.7952	700

Comparing the present invention adds the BilSTM-Attention model of entity category information and the original BilSTM-Attention model in the entity relationship discrimination model of step 5, and F1 index and accuracy when the two models are inferred on the test set are shown in Table 2. It can be seen that the improved model of the invention not only improves the F1 index, but also greatly improves the rationality of the reasoning result (namely, the entity class pairs which are unlikely to have a relationship are screened out by the model through the method).

TABLE 2

	F1	Rate of accuracy
			Original BilSTM-Attention model	0.8528	87％
Model of the invention	0.8934	92％

Claims

1. An entity and relation tandem type extraction method for text data is characterized by comprising the following steps:

and 4, step 4: judging whether the text has ambiguity, and if not, turning to the step 5; otherwise, inputting the text into the entity disambiguation model for disambiguation, then carrying out reference resolution processing on the trained entity resolution model to obtain the text after disambiguation, and turning to the step 5; the entity disambiguation model is an entity disambiguation model trained on a Chinese Short Text data set, wherein the entity disambiguation model adopts a Bert-CNN network architecture and comprises a Bert layer, a convolutional layer and a Softmax layer; the entity digestion model is a SpanBert entity digestion model trained on an opening data set ontonotes5.0 data set marked by an IDC company, wherein the SpanBert entity digestion model comprises a transformer layer, a full connection layer and a Softmax layer;

and 5: inputting the unambiguous or disambiguated text obtained in the step (4) into an entity relationship discrimination model, and obtaining the relationship category of the entity through an entity relationship classification model; the entity relationship distinguishing model is a BilSTM-orientation model trained on a sequence data set with entity position information and entity category information, wherein the BilSTM-orientation model comprises a BilSTM layer and an orientation layer; the entity relation classification model refers to a BilSTM-Attention model trained on a sequence data set only containing entity position information.