CN115630647A - Entity and relation tandem type extraction method for text data - Google Patents

Entity and relation tandem type extraction method for text data Download PDF

Info

Publication number
CN115630647A
CN115630647A CN202211328790.7A CN202211328790A CN115630647A CN 115630647 A CN115630647 A CN 115630647A CN 202211328790 A CN202211328790 A CN 202211328790A CN 115630647 A CN115630647 A CN 115630647A
Authority
CN
China
Prior art keywords
entity
model
text
layer
bilstm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211328790.7A
Other languages
Chinese (zh)
Inventor
乔海洋
白洁
张雨晨
刘红英
马国梁
尚凡华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tenth Research Institute Of Telecommunications Technology Co ltd
Original Assignee
Tenth Research Institute Of Telecommunications Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tenth Research Institute Of Telecommunications Technology Co ltd filed Critical Tenth Research Institute Of Telecommunications Technology Co ltd
Priority to CN202211328790.7A priority Critical patent/CN115630647A/en
Publication of CN115630647A publication Critical patent/CN115630647A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides an entity and relation tandem type extraction method for text data. Firstly preprocessing a text, extracting a word vector of the text, then inputting the word vector into a text entity extraction model BilSTM-softmax to obtain a text with an entity pair, then carrying out disambiguation treatment on the text with ambiguity, finally inputting the text with the entity pair into a relation discrimination model, and obtaining a relation category of the entity through a relation classification model. The method can perform entity and relation tandem extraction on the text data, has the functions of long text processing, entity disambiguation and reference resolution rich semantics, and obtains better entity extraction and relation extraction results.

Description

Entity and relation tandem type extraction method for text data
Technical Field
The invention belongs to the technical field of information extraction, and particularly relates to a method for extracting entities and relations of text data in a serial connection mode.
Background
In the information extraction task, entity extraction and entity relation extraction play important roles. The entity extraction task mainly extracts atomic information elements in the text, such as name of person, name of organization, geographical position, name of event, time and the like. The entity relationship extraction task is to extract relationship categories among all entities in the text, such as parent-child relationship, lover relationship, superior-subordinate relationship and the like in the character relationship.
For entity extraction tasks, rule-based methods and model-based methods are common. (1) The rule-based method is directed to the entity with special context or the text of the entity with many characteristics, and the method is simple and effective. As the number of corpora increases, the confrontation becomes more complex, conflicts between rules may also occur, and the entire system may also become non-maintainable. Therefore, the rule-based method is more suitable for the extraction task of semi-structured or standard texts, and can achieve a certain effect by combining with business requirements. (2) The model-based method includes a traditional model hidden Markov model or a conditional random field, etc., and deep learning models RNN (Recurrent Neural Network), LSTM (Long short-term memory), biLSTM (Bi-directional Long short-term memory), etc. have good performance on NLP (Natural Language Processing) tasks. Compared with the traditional model, the RNN can consider long-term context information and can solve the problem of CRF (conditional random field) feature selection, and the key point is network design and parameter tuning, but the RNN generally needs a larger training data set, and CRF performs well on a small-scale data set. The effect is better to combine the BilSTM and the CRF, and the respective advantages of the two models can be exerted.
For the task of extracting the relation of the text entity, the common method is based on a templateMethods of supervised learning and semi-supervised/unsupervised learning. (1) The template-based method comprises a method based on trigger words/character strings and a method based on dependency syntax, and has the advantages of high accuracy, specific customization, difficult maintenance and poor transportability. (2) Supervised learning methods include machine learning methods such as MaxEnt (maximum entropy model), maximum entropy, and deep learning methods,
Figure BDA0003913155220000011
Bayes (naive Bayes), SVM (Support Vector Machine), and the like, two classifiers are usually trained, the first step is to judge whether two entities have a relationship, and the second step is to judge the relationship type; the deep learning method is further divided into a Pipeline method (i.e., entity identification and relationship classification are two independent processes, and the identification of the relationship depends on the effect of the entity identification) and a Joint Model method (the processes of the entity identification and the relationship classification are jointly optimized), wherein the Pipeline method includes CR-CNN (Convolutional Neural Network classified according to levels and relationships), att-bltm (Convolutional Neural Network Based on Attention machine), att-bltm (Bidirectional Long Short-Term Memory Network Based on Attention machine), and the like, and the method can cause the transfer of errors, and the end-to-end method is convenient to optimize. The supervised learning method has high requirements on data sets, fragile models, limited generalization capability and difficulty in expanding new relationships, and in addition, the cost for acquiring a large training set is high. (3) The semi-supervised/unsupervised learning method mainly uses a small amount of labeled information for learning, and is represented by a Bootstrap-based method and a remote supervision (distance) method. Bootstrap-based methods mainly use a small number of instances as a set of initial seeds (seed programs), and then learn by using pattern learning methods, from unstructured features by continuous iterationExtracting examples from the data, learning new pattern from the newly learned examples, expanding a pattern set, and searching and discovering new potential relationship triples; the remote monitoring method mainly comprises the steps of automatically constructing a large amount of training data by aligning a knowledge base with unstructured texts, reducing the dependence of a model on manual annotation data, and enhancing the cross-domain adaptability of the model, and has the advantages of low construction cost, suitability for large-scale construction, and low accuracy of sensitive results of initial seeds and lack of calculation of confidence coefficient of each result; however, the unsupervised learning method can only estimate and randomly extract a sample of a relationship from the result set and then manually check because a new relationship is extracted and the accuracy and the recall rate cannot be accurately calculated.
The existing information extraction tasks comprise extraction tasks aiming at various data, such as online medical entity extraction, electronic medical record named entity extraction, chinese scenic spot named entity identification, military field entity relationship extraction and the like. And the text data is an indispensable part for the daily information delivery of the public and covers a large number of valuable entities and relationship information of the entities. The extraction of entity and relation of text data is a rich personal information database, which forms an important component of a complete and clear relation network.
The BilSTM-CRF entity extraction model is an existing typical text entity extraction model, belongs to a sequence labeling task, and is input as a text sequence to be processed and output as the probability that each character in the sequence belongs to each type of label. The model consists of an embedded layer, a plurality of BilSTM (Bidirectional Long Short-Term Memory) layers and a CRF (Conditional Random Field) layer. The treatment process comprises the following steps: each character/word in the input text sequence is represented as a vector (including character/word embedding), then the vector is input into a BilSTM layer, the probability that each character/word belongs to a certain type of label is output by the layer, then the vector is input into a CRF layer, the output probability is corrected according to the learned label transfer relationship, the optimal label sequence is output, and the output numerical symbols are converted into label types capable of being intuitively understood.
The BilSTM-Attention relation extraction model is a typical existing text relation extraction model, and is input as a text sequence, labeled with an entity pair of position information and output as a relation category of two entities in the sequence. The model consists of a word/word embedding layer, several BilStm layers (usually one layer) and an Attention layer. The treatment process comprises the following steps: inputting a sequence with entity position information (for example, < e1> Xiaoming </e1> likes going to a market to buy < e2> pencil </e2>, ", < e1>, < e2> are four position indicators, and specify the start and end marks of an entity), representing each word/word in the sequence as a vector (including word/word embedding), then inputting a BilTM layer, which outputs high-order feature information of each word/word, generating a weight vector through an Attention layer, multiplying the word-level features of each time step, merging into sentence-level feature vectors, and finally using the sentence-level feature vectors for relation classification.
The two models have some problems when applied directly to text data, including: (1) The production of text data in one day is in hundred million level, when the BilSTM-CRF entity extraction model processes massive text data, a large amount of reasoning time is occupied when a CRF layer calculates an optimal label sequence, so that the model reasoning efficiency is extremely low, and the model is not suitable for reasoning tasks with overlarge data. (2) The BilTM-Attention model can not generate entity pairs for classifying the BiLSTM-CRF entity model, although the BilTM-CRF entity model can generate candidate entities, a plurality of entities may exist in one text, the proportion of binary entity pairs with relations is not large, for example, 12 combination modes exist for 4 entities, only one entity pair with real relations exists, and if the generated entities are directly processed into candidate entity pairs to be sent to the relation extraction model, the effect of the model is greatly influenced by uneven data distribution. (3) Since the input sequence of the model only has the position information of the entity but not the category information, and BiLSTM-attribute cannot learn the category information of the entity, some pairs of entities may be generated which are not in accordance with the conventional rules, such as classifying the "xiaoming" and the "car" as the family relationship, but if the category information "name" of the "xiaoming" entity and the category information "carrier" of the "car" entity are input together into the relationship extraction model, there should be an improvement.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides an entity and relation tandem type extraction method for text data. Firstly preprocessing a text, extracting a word vector of the text, then inputting the word vector into a text entity extraction model BilSTM-softmax to obtain a text with an entity pair, then carrying out disambiguation treatment on the text with ambiguity, finally inputting the text with the entity pair into a relation discrimination model, and obtaining a relation category of the entity through a relation classification model. The invention carries out entity and relation tandem extraction aiming at the text data, simultaneously has the functions of long text processing, entity disambiguation and meaning resolution rich semantics, and obtains better entity extraction and relation extraction results.
An entity and relation tandem type extraction method for text data is characterized by comprising the following steps:
step 1: inputting a text, and if the number of text characters is more than 500, segmenting by taking 500 characters as the length to obtain a plurality of texts;
step 2: respectively inputting the segmented texts into word vector models to obtain word vectors corresponding to the texts; the word vector model is obtained by coding word vectors based on collected Chinese and English language databases and training the word vectors by adopting a word2vec method, wherein the Chinese and English word vector models are trained by the method;
and step 3: inputting the text word vector into a text entity extraction model for entity extraction, and outputting a text with entity pairs; the text entity extraction model is a BilSTM-Softmax entity extraction model trained on a sequence dataset with position information, wherein the BilSTM-Softmax entity extraction model comprises a bidirectional LSTM layer and a Softmax layer, and is an improved model obtained by replacing a CRF layer of an original BilSTM-CRF model with the Softmax layer;
and 4, step 4: judging whether the text has ambiguity, and if not, turning to the step 5; otherwise, inputting the text into the entity disambiguation model for disambiguation, then carrying out reference resolution processing on the trained entity resolution model to obtain the text after disambiguation, and turning to the step 5; the entity disambiguation model is trained on a Chinese Short Text data set, and adopts a Bert-CNN network architecture, and comprises a Bert layer, a convolutional layer and a Softmax layer; the entity digestion model is a SpanBert entity digestion model trained on an opening data set ontonotes5.0 data set marked by an IDC company, wherein the SpanBert entity digestion model comprises a transformer layer, a full connection layer and a Softmax layer;
and 5: inputting the text which is obtained in the step 4 and is unambiguous or disambiguated into an entity relationship discrimination model, and obtaining the relationship category of the entity through an entity relationship classification model; the entity relationship distinguishing model is a BilSTM-orientation model trained on a sequence data set with entity position information and entity category information, wherein the BilSTM-orientation model comprises a BilSTM layer and an orientation layer; the entity relation classification model refers to a BilSTM-Attention model trained on a sequence data set only containing entity position information.
The invention has the beneficial effects that: because a CRF layer in the BilSTM-CRF model is replaced by a softmax layer, the model processing speed and the calculation efficiency can be greatly improved while higher text entity extraction precision is obtained; due to the addition of the text entity disambiguation processing process, the problem that the Chinese entity has ambiguity can be effectively solved, and the accuracy of the text entity expression theme is practically improved; the entity type information is added to the input of the BilSTM-Attention model in the entity relationship judging model, so that the model can well give consideration to the entity type information when learning the relationship between the entities, thereby ensuring the accuracy of entity relationship judgment; and the serial entity and relationship extraction mode is adopted, the data linkage among different models is standardized, the capacity of processing mass text data is enhanced, semantic information can be better expressed for the entity subjected to disambiguation, and the accuracy and the reasonability of the entity extraction and relationship extraction results are improved to a great extent.
Drawings
FIG. 1 is a flow chart of an entity and relationship concatenation extraction method for text data according to the present invention.
Detailed Description
The present invention will be further described with reference to the following drawings and examples, which include, but are not limited to, the following examples.
As shown in fig. 1, the present invention provides a method for extracting entities and relationships in tandem for text data, which is implemented as follows:
step 1: inputting a text, judging the length of a text sequence, if the number of characters of the text is more than 500, segmenting by taking 500 characters as the length to obtain a plurality of texts, and recording the sequence relation among the plurality of texts;
step 2: respectively inputting the segmented texts into word vector models to obtain word vectors corresponding to the texts; the word vector model refers to a word vector model trained on a Chinese and English corpus by adopting a word2vec method, and comprises a Chinese word vector model and an English word vector model.
And 3, step 3: inputting the text word vector into a text entity extraction model for entity extraction, and outputting a text with entity pairs; the text entity extraction model is a BiLSTM-Softmax entity extraction model trained on a sequence data set with position information, wherein the BiLSTM-Softmax entity extraction model comprises a bidirectional LSTM layer and a Softmax layer and is an improved model obtained by replacing a CRF layer of an original BiLSTM-CRF model with the Softmax layer;
and 4, step 4: judging whether the text has ambiguity, and if not, turning to the step 5; otherwise, inputting the text into the entity disambiguation model for disambiguation, then carrying out reference resolution processing on the trained entity resolution model to obtain the text after disambiguation, and turning to the step 5; the entity disambiguation model is trained on a Chinese Short Text data set provided by CCKS2019 (China Association of knowledge maps and semantic calculations), and adopts a Bert-CNN network architecture, wherein the entity disambiguation model comprises a Bert layer, a convolution layer and a Softmax layer; the entity digestion model is a SpanBert entity digestion model trained on an opening data set ontonotes5.0 data set marked by an IDC company, wherein the SpanBert entity digestion model comprises a transformer layer, a full connection layer and a Softmax layer;
and 5: inputting the text which is obtained in the step 4 and is unambiguous or disambiguated into an entity relationship discrimination model, and obtaining the relationship category of the entity through an entity relationship classification model; the entity relationship distinguishing model is a BilSTM-orientation model trained on a sequence data set with entity position information and entity category information, wherein the BilSTM-orientation model adopts a BilSTM network and is added with an orientation mechanism, and comprises a BilSTM layer and an orientation layer; the entity relation classification model structure is consistent with the entity relation discrimination model, but the training data set only contains sequences with entity position information, namely the BilSTM-Attention model trained on the sequence data set only containing the entity position information.
In order to verify the effectiveness of the invention, the BilSTM-softmax entity extraction model of the invention is compared with the original BilSTM-CRF model. The parameters of the BilSTM-CRF model are set as follows: left _ rate =0.001, batch_size =32, embedding_size =128, hidden_size =200, num_layers =2, dropout =0.2. Wherein, embedding _ size refers to the character word vector dimension, hidden _ size refers to the hidden layer dimension, and num _ layers refers to the number of layers of lstm. The parameters of the BiLSTM-SoftMax model are set as follows: left _ rate =0.001, batch \/size =32, embedding \/size =128, hidden \/size =200, num \/u layers =1, dropout =0.2. When the real data set is trained and evaluated, the inference speed and the F1 index (the index of both accuracy and recall rate is considered, the larger the model performance is, the better) of the two models are shown in Table 1. It can be seen that the inference speed of the BilSTM-softmax model of the invention is greatly improved.
TABLE 1
F1 Reasoning speed (bar/s)
BilsTM-CRF model 0.8013 25
BilSTM-SoftMax model 0.7952 700
Comparing the present invention adds the BilSTM-Attention model of entity category information and the original BilSTM-Attention model in the entity relationship discrimination model of step 5, and F1 index and accuracy when the two models are inferred on the test set are shown in Table 2. It can be seen that the improved model of the invention not only improves the F1 index, but also greatly improves the rationality of the reasoning result (namely, the entity class pairs which are unlikely to have a relationship are screened out by the model through the method).
TABLE 2
F1 Rate of accuracy
Original BilSTM-Attention model 0.8528 87%
Model of the invention 0.8934 92%

Claims (1)

1. An entity and relation tandem type extraction method for text data is characterized by comprising the following steps:
step 1: inputting a text, and if the number of text characters is more than 500, segmenting by taking 500 characters as the length to obtain a plurality of texts;
step 2: respectively inputting the segmented texts into word vector models to obtain word vectors corresponding to the texts; the word vector model is obtained by coding word vectors based on collected Chinese and English language databases and training the word vectors by adopting a word2vec method, wherein the Chinese and English word vector models are trained by the method;
and step 3: inputting the text word vector into a text entity extraction model for entity extraction, and outputting a text with entity pairs; the text entity extraction model is a BilSTM-Softmax entity extraction model trained on a sequence dataset with position information, wherein the BilSTM-Softmax entity extraction model comprises a bidirectional LSTM layer and a Softmax layer, and is an improved model obtained by replacing a CRF layer of an original BilSTM-CRF model with the Softmax layer;
and 4, step 4: judging whether the text has ambiguity, and if not, turning to the step 5; otherwise, inputting the text into the entity disambiguation model for disambiguation, then carrying out reference resolution processing on the trained entity resolution model to obtain the text after disambiguation, and turning to the step 5; the entity disambiguation model is an entity disambiguation model trained on a Chinese Short Text data set, wherein the entity disambiguation model adopts a Bert-CNN network architecture and comprises a Bert layer, a convolutional layer and a Softmax layer; the entity digestion model is a SpanBert entity digestion model trained on an opening data set ontonotes5.0 data set marked by an IDC company, wherein the SpanBert entity digestion model comprises a transformer layer, a full connection layer and a Softmax layer;
and 5: inputting the unambiguous or disambiguated text obtained in the step (4) into an entity relationship discrimination model, and obtaining the relationship category of the entity through an entity relationship classification model; the entity relationship distinguishing model is a BilSTM-orientation model trained on a sequence data set with entity position information and entity category information, wherein the BilSTM-orientation model comprises a BilSTM layer and an orientation layer; the entity relation classification model refers to a BilSTM-Attention model trained on a sequence data set only containing entity position information.
CN202211328790.7A 2022-10-27 2022-10-27 Entity and relation tandem type extraction method for text data Pending CN115630647A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211328790.7A CN115630647A (en) 2022-10-27 2022-10-27 Entity and relation tandem type extraction method for text data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211328790.7A CN115630647A (en) 2022-10-27 2022-10-27 Entity and relation tandem type extraction method for text data

Publications (1)

Publication Number Publication Date
CN115630647A true CN115630647A (en) 2023-01-20

Family

ID=84906779

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211328790.7A Pending CN115630647A (en) 2022-10-27 2022-10-27 Entity and relation tandem type extraction method for text data

Country Status (1)

Country Link
CN (1) CN115630647A (en)

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN114610515B (en) Multi-feature log anomaly detection method and system based on log full semantics
Yu et al. Beyond Word Attention: Using Segment Attention in Neural Relation Extraction.
WO2018028077A1 (en) Deep learning based method and device for chinese semantics analysis
CN110298032A (en) Text classification corpus labeling training system
CN112905739B (en) False comment detection model training method, detection method and electronic equipment
CN111914091A (en) Entity and relation combined extraction method based on reinforcement learning
CN117171333B (en) Electric power file question-answering type intelligent retrieval method and system
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
CN115859980A (en) Semi-supervised named entity identification method, system and electronic equipment
CN116578708A (en) Paper data name disambiguation algorithm based on graph neural network
CN115757775A (en) Text implication-based triggerless text event detection method and system
CN115098673A (en) Business document information extraction method based on variant attention and hierarchical structure
CN113158659B (en) Case-related property calculation method based on judicial text
CN116522165B (en) Public opinion text matching system and method based on twin structure
CN113869054A (en) Deep learning-based electric power field project feature identification method
Lei et al. An input information enhanced model for relation extraction
Hua et al. A character-level method for text classification
CN115600602B (en) Method, system and terminal device for extracting key elements of long text
CN117390131A (en) Text emotion classification method for multiple fields
CN116304064A (en) Text classification method based on extraction
CN118057354A (en) Event detection method based on meta attribute learning
CN113312903B (en) Method and system for constructing word stock of 5G mobile service product
CN115630647A (en) Entity and relation tandem type extraction method for text data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination