CN112487206A

CN112487206A - Entity relationship extraction method for automatically constructing data set

Info

Publication number: CN112487206A
Application number: CN202011428961.4A
Authority: CN
Inventors: 房冬丽; 魏超; 李俊; 衡宇峰; 黄元稳
Original assignee: CETC 30 Research Institute
Current assignee: CETC 30 Research Institute
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2021-03-12
Anticipated expiration: 2040-12-09
Also published as: CN112487206B

Abstract

The invention provides an entity relationship extraction method for automatically constructing a data set, which comprises the following processes: step 1, collecting and preprocessing corpora; step 2, defining a triple dictionary table and constructing a synonym table; step 3, generating a training data set and a testing data set by utilizing an LTP tool; step 4, training a network model according to the training data set; step 5, carrying out entity and relation prediction on the test data set through the trained network model; and 6, optimizing the prediction result to obtain a triple data set. By adopting the scheme of the invention, the text content can be automatically analyzed, and the problem of difficult generation of training and testing data sets is effectively solved; according to the scheme, the problem that the extraction of the entity relation needs to depend on a large amount of resources for calculation is solved through the optimization and adjustment of the bert model, and the entity relation in the text can be extracted efficiently only through fine adjustment of the bert model, so that the essential relation among multi-source heterogeneous data is presented visually.

Description

Entity relationship extraction method for automatically constructing data set

Technical Field

The invention relates to the field of natural language processing, in particular to an entity relationship extraction method for automatically constructing a data set.

Background

The entities and the relations summarize main contents of texts, can visually display the relation among data, and provide basic data for downstream tasks such as intelligent question answering and retrieval systems; at present, except that documents with formal specifications such as thesis and the like provide a plurality of keywords, most documents do not provide an intuitive data structure capable of reflecting the content of the documents; the traditional method for extracting document entities and relationships by manually reading texts cannot meet the requirements of practical application at present of multi-source and mass document data; therefore, how to efficiently and accurately extract entities and relationships is a problem which needs to be solved urgently at present; at present, there are various methods for extracting entities and relationships, and they are mainly classified into two categories, namely rule-based methods and machine learning methods, in summary.

1) The rule-based method requires a certain number of rule template sets to be constructed, modeled by the relevant linguists in the field, and assigned with matching patterns. On the basis of the linguistic knowledge of characters, word characteristics, lexical analysis, syntactic dependency analysis and the like, a high-quality grammar pattern matching template is obtained, and existing relation patterns in the text are mined. The method for manually compiling the rules can be effectively applied to the professional field in the early stage, but because of the complex diversity of the language rules, a large amount of manpower is consumed to compile the rules.

2) In the machine learning method, most entity relation extraction methods are considered from language features, and how to extract features such as part of speech tagging, syntax parse trees and the like, certain sample data needs to be constructed in advance, and then the sample is input to model training. The constructed features depend on the technical level of a natural language processing tool, and errors generated by the tool can be spread and accumulated in the entity relation extraction process, so that the subsequent operation is greatly influenced. Meanwhile, the generation of sample data also requires a lot of manpower.

Disclosure of Invention

The data is very important resources for each field, the progress of research work in each field is limited due to the shortage of the data, a large amount of human resources are consumed in a traditional manual labeling mode, and aiming at the problems, the invention provides a technical solution for extracting entity relations in texts based on a Harbour LTP tool and a bert model. The scheme mainly solves the following two problems, on one hand, the automatic analysis of the text content can be realized, and the problem of difficult generation of training and testing data sets is effectively solved; on the other hand, the problem that the extraction of the entity relationship needs to depend on a large amount of resources for calculation is solved through the optimization and adjustment of the bert model, and the entity relationship in the text can be efficiently extracted only through the fine adjustment of the bert model, so that the essential relation among multi-source heterogeneous data is visually presented.

The technical scheme adopted by the invention is as follows: an entity relationship extraction method for automatically constructing a data set comprises the following processes:

step 1, collecting and preprocessing corpora;

step 2, defining a triple dictionary table and constructing a synonym table;

step 3, generating a training data set and a testing data set by utilizing an LTP tool;

step 4, training a network model according to the training data set;

step 5, carrying out entity and relation prediction on the test data set through the trained network model;

and 6, optimizing the prediction result to obtain a triple data set.

Further, the preprocessing comprises: and clearing the collected document data and performing sentence division on the text content, wherein different types of texts are realized by adopting different segmentation modes.

Further, the specific process of step 2 is as follows: and forming a triple dictionary table according to the preprocessed expectation and the relationship between the theoretic entity categories and the categories.

Further, the LTP tool is adopted to correct data with uncertain entity relation and attribute types according to the part of speech of the annotation information, nouns and verbs, the nouns are used as entities, the verbs are used as relations or attributes, and unreasonable data are cleaned to form a training data set and a testing data set.

Further, the step 4 comprises: building a model based on a bert model, matching a relation classification model, and building an entity extraction model according to the relation predicted by the relation classification model; the input of the relation classification model and the entity extraction model is a vector of (1, n, 768) dimensionality, the output of the bert model adopts sentence level, and the characteristic information output by the bert model passes through a full connection layer and a sigmoid activation function; and respectively inputting the training data into the two models to finish training.

Further, the activation function is:

further, the loss function of the model is:

wherein y is_kTag representing one-hot encoding, the output layer contains k neurons for k classes, phi (l)_k) And activating the corresponding neuron of the output layer by using a sigmoid function.

Further, the specific process of step 5 is as follows: and 4, based on the network model trained in the step 4, performing entity prediction and relationship prediction on the test data set, and dividing the input text into two parameters, namely text-a and relationship text-b, by the bert-based entity and relationship network model to perform prediction of multiple relationships and labeling of entity positions.

Further, the prediction result optimization method in step 6 comprises: judging whether synonyms or similar synonyms exist in the entity relationship at the predicted position or whether the synonyms or similar synonyms refer to the same entity or relationship, judging by adopting cosine similarity, if the cosine similarity is higher than a threshold value, judging the words to be uniform words, otherwise, clearing the entity relationship with the same semantic meaning by calculating the cosine similarity, and forming a final ternary group data set; the cosine similarity calculation mode is as follows:

wherein x and y respectively represent vector sets of two words, x_iVector representation representing the ith word in x, y_iRepresenting the vector representation of the ith one of the words y.

Compared with the prior art, the beneficial effects of adopting the technical scheme are as follows: the scheme of the invention can realize automatic analysis of the text content, and effectively solve the problem of difficult generation of training and testing data sets; according to the scheme, the problem that the extraction of the entity relation needs to depend on a large amount of resources for calculation is solved through the optimization and adjustment of the bert model, and the entity relation in the text can be extracted efficiently only through fine adjustment of the bert model, so that the essential relation among multi-source heterogeneous data is presented visually.

Drawings

FIG. 1 is a flow chart of the method for extracting entity relationships for automatically building data sets according to the present invention.

Fig. 2 is a schematic diagram of a network model constructed in an embodiment of the present invention.

FIG. 3 is a flow chart of constructing a training and testing data set in an embodiment of the present invention.

FIG. 4 is a diagram of network model input vectors in an embodiment of the invention.

FIG. 5 is a diagram of entities and relationships in an embodiment of the invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, the present invention provides an entity relationship extraction method for automatically constructing a data set, which includes the following processes:

step 1, collecting and preprocessing corpora;

step 2, defining a triple dictionary table and constructing a synonym table;

step 4, training a network model according to the training data set;

and 6, optimizing the prediction result to obtain a triple data set.

In particular, the method comprises the following steps of,

in the step 1, crawler data and non-public document data are mainly collected, the collected multi-source heterogeneous data are cleaned, finally, text content is accurately divided by sentences, and different segmentation modes can be adopted for different types of texts, such as segmentation by using a regular expression, segmentation by using punctuation marks and the like.

The specific process of the step 2 is as follows: and according to the preprocessed expectation, theorem entity classes and class relations, forming a triple dictionary table which mainly comprises information such as definition entities, relation classes, data types and the like.

In step 3, the LTP integrates a word segmentation algorithm based on dictionary matching and statistical machine learning, and can conveniently label information such as part of speech labeling, part of speech category labeling, boundaries and the like to the data, so that a training data set and a test data set are preliminarily formed by adopting an LTP tool according to the part of speech of labeling information, nouns and verbs, the nouns are used as entities, and verbs are used as relations or attributes.

The data set constructed by the LTP tool has irregular and unreasonable data, and the data set needs to be cleaned up by the part. And correcting the data with uncertain entity relationship and attribute types, cleaning unreasonable data, and continuously optimizing the entity category and relationship category dictionary table to finally form a high-quality data set.

In step 4, a sentence may contain a plurality of relationships or entities, so the extraction process for the entity relationships is a multi-label classification task. In order to more effectively extract entity relationship information in a text, the invention adopts a mode of separating an entity model from a relationship network model, firstly constructs the relationship network model and then constructs the entity network model. The main objective is to make the relationship network model extract all relationship classifications as much as possible, and the entity network model can extract the relevant entities better. In recent years, the bert model is widely applied in the field of NLP with its own advantages, so that the relationship and entity network models to be constructed herein are based on the bert model. The construction of the relationship and entity network model based on the bert model is very convenient, and only the following parameters need to be defined: transformer, self-attention, hidden units, cycle number and the like, wherein 12 layers of transformer, 12 layers of self-attention and 768 hidden units are adopted, and the cycle number is 3, and a specific model is shown in FIG. 2. The input of the relation classification model and the entity extraction model is a vector of (1, n, 768) dimensions, the output of the bert model adopts sentence level, and the characteristic information output by the bert model passes through a full connection layer and a sigmoid activation function.

In the training stage, the training data set is mainly processed, so that parameters such as the weight and the bias of the network model are formed and stored for subsequent verification on the test data set. The training process of the two network models is as follows: 1) inserting CLS and SEP at the beginning and end of a sentence, respectively, 2) converting each word into a vector of 768 dimensions, 3) generating segment ids, 4) generating positional vector. All three vectors are (1, n, 768), the elements can be added to obtain a composite representation of (1, n, 768), namely coded input of bert, coded information is input into a well-defined network model, network model training of entities and relations, storage of the network model and various weight parameters and bias information and the like are executed in parallel.

Wherein the activation function is:

only the model can not effectively extract useful characteristic information, and although the bert model is pre-trained based on open-source corpus, text composition of each field is complex and diverse, so that complete field data is difficult to be involved in the whole bert model pre-training stage. The invention can continuously optimize the model by training and fine-tuning the proposed model, and adopts a loss function matched with the multi-label classification task:

wherein y is_kRepresenting a one-hot encoded tag, the output layer contains k neurons for k classes,

and activating the corresponding neuron of the output layer by using a sigmoid function.

The specific process of the step 5 is as follows: and (4) performing entity prediction and relation prediction on the test data set based on the network model trained in the step (4). The entity and relationship network model based on bert automatically splits an input text into two parameters, text-a and text-b, wherein the text-a is sentence content, and the text-b can be a specific relationship, so that the prediction of various relationships and the labeling of entity positions are carried out.

The prediction result optimization method in the step 6 comprises the following steps: judging whether synonyms or similar synonyms exist in the entity relationship at the predicted position or whether the synonyms or similar synonyms refer to the same entity or relationship, judging by adopting cosine similarity, if the cosine similarity is higher than a threshold value, judging the words to be uniform words, otherwise, clearing the entity relationship with the same semantic meaning by calculating the cosine similarity, and forming a final ternary group data set; the cosine similarity calculation mode is as follows:

In the scheme, the optimization is also carried out aiming at the LTP and bert models: aiming at the situation that a text data set is deficient, an LTP word segmentation tool is adopted to automatically complete the construction of the data set, and the problem that a large amount of labels can only be manually marked in the past is effectively solved.

In the self-supervision task of the bert model, an MLM task is provided, namely, a part of words are randomly masked from an input sequence during network training, and then the masked words are predicted by context input into the bert model. However, MLM starts to be an NLP training method for english, and in the chinese field, the MLM task will segment chinese words and splits context semantic information. Google published the latest version of bert in 2019, and the masking granularity in the text is increased from words to words, so that the sentence semantics are ensured not to be segmented as much as possible. Unfortunately, this latest version is still not for the chinese version. As indicated in the text of pre-training with a word masking for Chinese chess, an LTP word segmentation tool can be used to segment words of sentences first, and then the words are used as granularity to cover the words, so as to perform self-supervision training. Therefore, in order to better improve the Chinese semantic understanding, the word segmentation is carried out by combining an LTP tool, then the word granularity covering capability and the open source code provided by the latest version of bert in 2019 are combined, and the pre-training process of the bert is re-executed, so that a bert model supporting the Chinese word granularity level is formed, and the entity and the relation network model adopted in the text are based on the bert model.

The invention also provides an embodiment that the original text data is mainly based on the text in the field of policy and regulation, and the collected web crawler data and the non-public text data account for 15982 parts in total, which is concretely as follows:

1) training and testing data sets are automatically constructed as shown in fig. 3.

Firstly, accurately segmenting all documents by using a regular expression to form a French stripe.

The regular expression is as follows: the first ([ two, three, four, six, seven, eight ninety, hundred million zeros 1234567890\ \ S ]) the [ chapter bar ] ([ \ \ S ]? ) (? (the first ([ two, three, four, five, seven, eight ninety, one hundred million zero 1234567890 \)

H) [ chapter bar ]).

Next, entity classes (Table 1) and class relationships (Table 2) are defined.

And finally, marking the corpus information according to the LTP tool, and cleaning the data to form a data set.

Example (c): active military/n and/c preparedness/n,/wp must/d respect/v constitution/n and/c law/n, fulfill/u obligation/n of/v citizen/n,/wp simultaneously/d enjoy/u right/n of/v citizen/n; /wp produces/u right/n and/c obligation/n of/v due to/c service/v and/c, and/wp is specified/v by/c law/n and/c other/r-related/r laws/regulations; the resulting data sets are: { active soldier, adherence, constitution }, { active soldier, adherence, law } … ….

TABLE 1 entity classes

TABLE 2 relationship classes

2) And (3) training the network model according to a training set, respectively taking the French stripe and the single relation as a bert model text-a and a text-b, and predicting all existing relations in the French stripe as much as possible.

The details are as follows: example (c): "military service is divided into active service and reserve service, { military service, divided into active service }, { military service, divided into reserve service }", which is converted into the following two sentences: "military service is divided into active service and reserve service, { military service, divided into active service }" and "military service is divided into active service and reserve service, { military service, divided into reserve service }". According to each law and the corresponding triple, carrying out sequence marking in the law; example (c): "military service is divided into active service and reserve service, { military service, divided into, active service }" is marked as "CLS B-SUB I-SUB 00B-OBJ I-OBJ 0000 SEP active service". Performing tokenization processing on each law in a training data set, dividing the law into a basic tokenizer and a WordpieceTokenizer, and performing vectorization processing, wherein the vectorization processing comprises mask vectors mark, segment sequences segments ids and position coding position encoding, and specifically as shown in fig. 4, the method mainly comprises the following specific steps: 1) inserting CLS and SEP at the beginning and end of a sentence, respectively, 2) converting each word into a vector of 768 dimensions, 3) generating segment ids, 4) generating positional vector. All three vectors are (1, n, 768), and element additions can be made to obtain a composite representation of (1, n, 768), i.e., the coded input of bert. Inputting the coded information into a defined network model, executing entity and relation network model training in parallel, storing the network model and various weight parameters and bias information, and the like.

3) Calculating cosine similarity, forming a triple data set of entity relationship, and visually presenting the relation among data, wherein the result is shown in table 3 and fig. 5:

table 3 entity relation case

In this embodiment, the performance of the model is evaluated by using the accuracy and the recall ratio, and the samples in the test set are divided into the following types: predicting correct positive samples TP, predicting incorrect positive samples FP, predicting correct negative samples TN, predicting incorrect negative samples FN, precision P and recall R are as follows:

in the field of entity relationship extraction, the size of comparison F1 is generally adopted. The calculation method is as follows:

the invention compares the values of accuracy P, recall R and F1 of different network models, as shown in Table 4, and it can be seen from Table 4 that the improved bert model achieves better effect.

TABLE 4 comparison of different models

Network model	P	R	F1
				RNN	0.57	052	0.54
LSTM	0.63	0.58	0.60
				CNN	0.67	0.62	0.64
Bert	0.72	0.68	0.70

The invention is not limited to the foregoing embodiments. The invention extends to any novel feature or any novel combination of features disclosed in this specification and any novel method or process steps or any novel combination of features disclosed. Those skilled in the art to which the invention pertains will appreciate that insubstantial changes or modifications can be made without departing from the spirit of the invention as defined by the appended claims.

All of the features disclosed in this specification, or all of the steps in any method or process so disclosed, may be combined in any combination, except combinations of features and/or steps that are mutually exclusive.

Any feature disclosed in this specification may be replaced by alternative features serving equivalent or similar purposes, unless expressly stated otherwise. That is, unless expressly stated otherwise, each feature is only an example of a generic series of equivalent or similar features.

Claims

1. An entity relationship extraction method for automatically constructing a data set is characterized by comprising the following processes:

step 1, collecting and preprocessing corpora;

step 2, defining a triple dictionary table and constructing a synonym table;

step 4, training a network model according to the training data set;

and 6, optimizing the prediction result to obtain a triple data set.

2. The method of entity relationship extraction for automatically building a data set according to claim 1, wherein the preprocessing comprises: and clearing the collected document data and performing sentence division on the text content, wherein different types of texts are realized by adopting different segmentation modes.

3. The method for extracting entity relationship of automatically building data set according to claim 2, wherein the specific process of step 2 is as follows: and forming a triple dictionary table according to the preprocessed expectation and the relationship between the theoretic entity categories and the categories.

4. The method as claimed in claim 3, wherein in step 3, an LTP tool is used to correct the data with ambiguous entity relation and attribute type according to the part of speech, noun and verb, noun as entity and verb as relation or attribute, and to clean up the unreasonable data to form a training data set and a testing data set.

5. The method for extracting entity relationship of automatically building data set according to claim 4, wherein the step 4 comprises: building a model based on a bert model, matching a relation classification model, and building an entity extraction model according to the relation predicted by the relation classification model; the input of the relation classification model and the entity extraction model is a vector of (1, n, 768) dimensionality, the output of the bert model adopts sentence level, and the characteristic information output by the bert model passes through a full connection layer and a sigmoid activation function; and respectively inputting the training data into the two models to finish training.

6. The method of entity relationship extraction for automatically building a data set according to claim 5, wherein the activation function is:

7. the method of extracting entity relationships for automatically constructing data sets according to claim 5, wherein the loss function of the model is:

8. The method for extracting entity relationship of automatically building data set according to claim 1, wherein the concrete process of the step 5 is as follows: and 4, based on the network model trained in the step 4, performing entity prediction and relationship prediction on the test data set, and dividing the input text into two parameters, namely text-a and relationship text-b, by the bert-based entity and relationship network model to perform prediction of multiple relationships and labeling of entity positions.

9. The method for extracting entity relationship of automatically constructing data set according to claim 1, wherein the method for optimizing prediction result in step 6 is as follows: judging whether synonyms or similar synonyms exist in the entity relationship at the predicted position or whether the synonyms or similar synonyms refer to the same entity or relationship, judging by adopting cosine similarity, if the cosine similarity is higher than a threshold value, judging the words to be uniform words, otherwise, clearing the entity relationship with the same semantic meaning by calculating the cosine similarity, and forming a final ternary group data set; the cosine similarity calculation mode is as follows: