CN113807518B

CN113807518B - Relation extraction system based on remote supervision

Info

Publication number: CN113807518B
Application number: CN202110935696.7A
Authority: CN
Inventors: 陆远航; 王悦; 白璐; 崔丽欣
Original assignee: Central university of finance and economics; iFlytek Co Ltd
Current assignee: Central university of finance and economics; iFlytek Co Ltd
Priority date: 2021-08-16
Filing date: 2021-08-16
Publication date: 2024-04-05
Anticipated expiration: 2041-08-16
Also published as: CN113807518A

Abstract

The invention relates to a relation extraction system based on remote supervision, which relates to the field of information extraction, and comprises a knowledge base module and a relation extraction module connected with the knowledge base module, wherein the knowledge base module provides necessary knowledge information for the relation extraction module and helps to screen out needed information from a large amount of external texts; after the relation extraction module develops the entity extraction function, the relation extraction module can be combined with the relation extraction to extract the triplet pairs of the external text, and the knowledge base is enriched by using the extraction result, so that the problems of high cost of large-scale labeling data in supervised learning, low accuracy in unsupervised learning, easy noise caused by the extremely strong original remote supervision assumption and the like are solved.

Description

Relation extraction system based on remote supervision

Technical Field

The invention relates to the field of information extraction, in particular to a relationship extraction system based on remote supervision.

Background

The knowledge base is made up of structural entries that are convenient for people to use in the relevant field context, which are stored in a triplet format. The specific format is (entity A, relationship name, entity B), wherein the relationship name is the relationship between entity A and entity B, and the triplet is (AIDS, symptom, immunity decline) which indicates that the relationship between AIDS and immunity decline exists as symptom.

Introduction to remote supervision [ Mintz M, bills S, snow R, et al distance supervision forrelation extraction without labeled data [ C ]// ACL 2009,Proceedings of the 47thAnnual Meeting of the Association for Computational Linguistics and the4thInternational Joint Conference on Natural Language Processing of the AFNLP,2-7August 2009,Singapore.Association for Computational Linguistics,2009] techniques:

by using the idea of remote supervision, we can automatically generate large-scale trainable corpus from unstructured text on the basis of an existing knowledge base. The most original remote supervision implementation method comprises the following steps: two entities in a sentence are considered to represent such a relationship in a knowledge base entry if they both appear in the same entry in the knowledge base. When doing the above-mentioned operation on a large number of sentences, we can get large-scale labeled sentences, and the cost is relatively small.

The realization of the relation extraction system is totally 3 methods, which are respectively as follows: supervised learning, semi-supervised learning, and unsupervised learning. The relation extraction based on the supervised learning algorithm has higher accuracy rate if the data are rich, but usually needs to possess large-scale marked data, extract various features from all sentences to construct feature vectors, and extract a model by using the training relation extraction, for example, kambhatla (Kambhatla N. Training similarity, system, and semanticfeatures with maximum entropy models for extracting relations [ C ]// Proc ofMeetin of the Associat ion for Computat ional Lingui st ics.2004] proposes a mode of combining different vocabulary, syntax and semantic features by using a maximum entropy model to construct a training model; the unsupervised learning is to use a clustering method to aggregate similar sentences into a class without any manually marked data, and then select words with higher frequency in the class as semantic references of the sentences, wherein common clustering methods include K-means, self-organizing mapping clustering and the like.

For relation extraction based on supervised learning, if large-scale manually marked data exists, the model can generally achieve considerable accuracy, but huge time and labor cost are brought in the process of collecting data and marking the data in the earlier stage. Moreover, after the data standard is finished, the model trained according to the data generally has good effect only on input data with similar distribution, and once the data distribution domain is different from the training data, it is difficult to reconstruct a relational model meeting requirements in a short time.

For relation extraction of unsupervised learning, the cost of data preparation is very low, although no annotated data is required per se. However, the relation is often given to sentences in the unstructured text on the basis of clustering, so that the problems of difficult relation description, low recall rate and the like in the clustering method are inevitably introduced into a finally trained model.

The present invention therefore contemplates providing our system based on a remote supervision algorithm. Remote supervision of mountain opening is proposed by Mintz [ Mintz M, bills S, snow R, et al distance supervision for relationextraction without labeled data [ C ]// ACL 2009,Proceedings of the47th AnnualMeeting of the Association for Computational Linguistics and the4thInternational Joint Conference on Natural Language Processing of the AFNLP,2-7August 2009,Singapore.Association for Computational Linguistics,2009 ]. He proposes an assumption that: if two entity pairs in a sentence appear in an entry in the knowledge base, the sentence must embody the relationship name in this triplet entry to some extent. According to the assumption, the special domain knowledge base can be utilized to automatically label the external unstructured corpus text, so that a great deal of cost for manually labeling data in supervised learning is saved. However, this assumption is somewhat too absolute. Therefore, the system also suffers from the disadvantage of low accuracy of unsupervised learning if performed on this absolute assumption.

Disclosure of Invention

The invention provides a relation extraction system based on remote supervision, which aims to solve the problems that large-scale annotation data in supervised learning has high cost, accuracy in unsupervised learning is low, noise is easily caused by too strong original remote supervision assumption, and the like.

In order to solve the above problems in the prior art, the present invention provides a relationship extraction system based on remote supervision, including:

the system comprises a knowledge base module and a relation extraction module connected with the knowledge base module, wherein the knowledge base module provides knowledge information for the relation extraction module and helps to screen out needed information from a large amount of external texts; after the entity extraction function is developed by the relation extraction module, the relation extraction module can be combined with the relation extraction to extract the triplet pairs of the external text, and the knowledge base is enriched by using the extraction result;

the relation extraction module comprises an extraction sub-module with a relation extraction function, wherein the extraction sub-module is used for updating a classifier used in remote supervision;

the knowledge base module comprises a visualization sub-module which is connected with the knowledge base module and has the function of drawing the knowledge base into a visual knowledge map, a file update knowledge base sub-module which uses knowledge base files to update the knowledge base in batches, and a triplet update knowledge base sub-module which inputs a single item update knowledge base;

the principle used by the classifier is a remote supervision algorithm, a trained classifier is introduced at the same time, and the trained classifier and a knowledge base are combined to screen out needed corpus.

Preferably, the classifier updates the functionality of the classifier in our remote supervision by introducing a reselection corpus file and a label file.

Preferably, the visualization sub-module comprises a query sub-module and a drawing sub-module, the query sub-module searches corresponding triples in the knowledge corpus according to given triples, and the query sub-module transmits the obtained triples list to the drawing sub-module.

Preferably, the extraction submodule trains a relation extraction model by utilizing all the corpora screened by the remote supervision and screening corpus submodule and combining the labels automatically marked by the knowledge base.

Preferably, the relation extraction module further comprises a sentence update knowledge corpus sub-module, which is used for obtaining sentences required for updating by user input, and the system judges whether the adding requirements are met or not, and if the adding requirements are met, the sentences are added into the corpus; the remote sub-module further comprises a sentence updating knowledge corpus sub-module, which is used for acquiring updated file system clauses selected by a user, judging the sentences meeting the conditions, and then adding the sentences into the corpus.

Compared with the prior art, the relationship extraction system based on remote supervision has the following beneficial effects:

firstly, aiming at the problem of high supervised learning cost, the system adopts a remote supervised relation extraction method, and automatically marks the external text by utilizing a knowledge base, thereby saving the problems of long manual marking time and high cost and achieving the effect of reducing the cost.

Secondly, the problem of noise is introduced by assuming that the original remote supervision is too strong, and a multi-classifier for the applicable field is introduced. And this classifier is used to improve remote supervision, the improvement points are as follows: for external sentences, we find whether there are entity pairs of triple entries (E1, R, E2) in the knowledge base, if there are entity pairs of triple entries (E1, E2) in the sentence, we do not directly label them, but give the sentence to our classifier for classification, if the classification result is R, it is stated that the sentence will not bring noise when automatic labeling is performed later, and then add the sentence to our system corpus. When the relation extraction model is trained, the corpus from which the noise is removed is automatically marked by utilizing the knowledge base, the defect that the noise is brought in the remote supervision is overcome, and the effect of optimizing the accuracy is achieved.

Again, for the problem of coarser relationship categories of semi-supervision, our system solves the following: since we need to use our multi-classifier when screening external text, in this multi-classifier we will include all the relationship classes used in the subsequent work (i.e. all the relationship classes in the knowledge base) in this classifier, so the classes are preset by the user, and the problem of rough class will not occur. The effect of optimizing the relation category is achieved.

Finally, we have mentioned earlier that supervised relation extraction model relations are typically derived from a specific corpus, and that classifiers tend to be specific to text fields. Aiming at the problem, a function of updating a classifier used for filtering corpus is added in the system, a small and full labeling corpus is built by a user, a new classifier model is automatically generated after the system is read, at the moment, the filtered category and text fields are changed, the user inputs texts of previous text fields or other text fields, and the system cannot filter. Only the text of the text field in which the new classifier is located will be screened by the system and added to the system corpus. Thus, the disadvantage of only being prone to a specific text field in supervised learning is solved. The effect of enabling the system to adapt to work in more fields is achieved, and the efficiency of work transformation is improved.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention.

Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

fig. 1 is a schematic structural diagram of a relationship extraction system based on remote supervision according to an embodiment of the present invention.

Fig. 2 is a flowchart of classifier training in remote supervision provided in an embodiment of the present invention.

Fig. 3 is a flowchart of a remote supervision algorithm screening external corpus according to an embodiment of the present invention.

Fig. 4 is a flow of batch file updating knowledge base provided in an embodiment of the present invention.

Fig. 5 is a flowchart of updating a single knowledge base content provided in an embodiment of the present invention.

Fig. 6 is a flow chart of knowledge base visualization provided in an embodiment of the present invention.

Fig. 7 is a functional flowchart of an update pre-training classifier provided in an embodiment of the present invention.

Fig. 8 is a flowchart of relation extraction according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Referring to fig. 1, an embodiment of the present invention provides a relationship extraction system based on remote supervision, which can be applied to training of a deep learning model, and includes a knowledge base module and a relationship extraction module connected with the knowledge base module, where the knowledge base module provides necessary knowledge information for the relationship extraction module, and helps to screen out needed information in a large amount of external texts; after the entity extraction function is developed by the relation extraction module, the relation extraction module can be combined with the relation extraction to extract the triplet pairs of the external text, and the knowledge base is enriched by using the extraction result.

Referring to fig. 3, a pre-trained multi-classifier is introduced into the remote supervision algorithm to screen the portions of the external corpus meeting the requirements. Meanwhile, in order to solve the problem that supervised learning is only applicable to a certain fixed domain of corpus, the function of updating a classifier in our remote supervision by reselecting corpus files and tag files is introduced into a system. In addition, because the corpus updating and the training of the relation extractor are both assisted by the knowledge base, the function of updating the knowledge base is added in the system, more proper corpora are screened out by the updated knowledge base, and then the accuracy of the relation extraction model is optimized by the screened more corpora. Meanwhile, the knowledge base generally reflects knowledge content in the related field, so that a knowledge base visualization function is added into the system, and a user can obtain a concise and clear visualization result by inputting the items to be queried.

The idea of remote supervision is utilized to screen corpus functions meeting requirements, and the principle is utilized to be a remote supervision algorithm. Meanwhile, a classifier trained in advance is introduced, and the classifier and a knowledge base are combined to screen out the needed corpus.

Referring to fig. 2, the technical method adopted in the classifier training process includes: word vector training submodules (Wrod 2Vec, bert, etc.), word segmentation toolkits, and PCA (principal component analysis, etc.). The specific training process is as follows:

(1) the method comprises the following steps And performing word segmentation on the small-sized marked corpus of the user by using a word segmentation tool, and storing the word segmentation result into a file.

(2) The method comprises the following steps Referring to fig. 2, the word vector model is trained by using the word segmentation result file obtained in (1). Because of the small corpus size, no words are discarded during training and the sliding window of the model is set to 3, i.e. one word is related to words within 3 lengths of its context. And later tests show that the classification effect is best when the word vector dimension is 36.

(3) The method comprises the following steps Representing each row in the file obtained in the step (1) into a sentence vector form, wherein the calculation method comprises the following steps: the vector sum of all words in the sentence is averaged.

(4) The method comprises the following steps The sentence vector set obtained in (3) is subjected to principal component extraction by PCA (principal component analysis), the original 36-dimensional vector is compressed into a 28-dimensional vector, and more important components are extracted. And finally, saving the PCa model matched with the original data as a PCa.Pickle file by using a Pickle submodule.

(5) The method comprises the following steps And reading the tag file, and reading all the tags. The vector set obtained in (4) and the label set are then used to train our SVM (support vector machine) based classifier. Finally, the classifier model is saved as an svm.pi ckle file using a pi ckle submodule.

In order to overcome the influence of noise of external data on a final model of a training system, a function of a pre-training classifier is added in the method, a relation classifier can be trained by only inputting a small amount of marked texts, sentences which do not meet requirements are filtered by the classifier, and on the basis, the external texts are automatically marked by using a remote supervision idea. The specific flow is as follows: (1) the user selects the most representative sentences from the corpus of the user to make labeling (wherein the number of the labeled relations should contain all the relations which the user needs to extract) and a small and complete labeled data set is constructed; (2) the user inputs the marked file into a system of us, and the system adopts the technologies of a Support Vector Machine (SVM), principal Component Analysis (PCA) and the like to train out a classifier according to the input small-sized data; (3) and automatically labeling the external text according to the knowledge base owned by the user and the pre-trained classifier. Therefore, the system is different from the original remote supervision, namely the system trains the relationship extraction model only by the fact that entity pairs appear in sentences, and the system also needs to be added with a step for screening, so that the defect of low accuracy can be overcome, noise is effectively overcome, and the accuracy is improved.

At the same time, we add a function to update this pre-trained classifier, thus overcoming the disadvantage that supervised relational extraction is only applicable to input data with the same distribution. If the user needs to change the domain to which the relation extraction model is applicable, only a small amount of time is required to prepare part of representative annotation data again for the system, and the system automatically trains a pre-trained classifier applicable to the new domain. After the classifier is updated, the new domain-specific knowledge base is updated into our system, and our system can perform corpus screening on the texts in the new domain and train the relation extraction model by using the new screened corpus.

After the classifier is obtained, the classifier can be used for screening the corpus of the external large-scale unstructured text by combining with our knowledge base, and the corpus meeting the requirements is added into the corpus used by the final training relation extraction model of the system. Referring to fig. 3, the screening process is as follows:

(1) selecting an external text file path, reading the file by the system, and sentence-dividing the read content.

(2) For each sentence, judging whether an entity pair with an entry in the knowledge base exists in the sentences, if yes, entering (3), and if no, ending.

(3) The sentence is classified with a pre-trained classifier. If the classification result is consistent with the relation name of the corresponding item in the step (2), adding the sentence into the corpus used by the final training relation extraction model of the system. If the classification results are inconsistent, the method jumps to (2) to judge the next sentence.

(4) After all sentences are processed, the process is finished.

Since corpus screening and automatic corpus labeling during subsequent relation extraction model training are carried out by means of the knowledge base of the system, the system is additionally provided with a function of enabling a user to update the knowledge base. Every time some items are added to enter the knowledge base, the corpus range screened by the system is larger, and the system can be better helped to continuously improve the accuracy of the system.

Updating the knowledge base is divided into two ways, namely: the knowledge base is updated in batches using the knowledge base file and the knowledge base is updated by inputting a single entry. Their respective schemes are as follows:

firstly, the process of updating the knowledge base in batches by using files is as follows:

the user selects a new file which is needed by the user to update the knowledge base;

the system reads the new knowledge base file and judges whether each item exists in the original knowledge base or not;

if so, skipping the bar, and judging the next bar. If not, the entry is added to the system knowledge base.

Referring to fig. 5, the single triplet entry update knowledge base flow is then entered:

the user inputs (entity A, entity B) entity pairs and selects corresponding relationship names in a list generated by the system according to all categories in the remote supervision classifier;

the system combines the user input into the format of (entity A, relationship name, entity B) and then transmits the format to the judging sub-module;

if the same item is judged not to exist in the original knowledge base, the method is added. Otherwise, the entry is prompted to be present.

As mentioned above, the knowledge base can generally represent the related knowledge in a certain field, or represent the relationship between people, or represent the relationship between diseases and medicines, foods, etc. Therefore, the knowledge base content visualization function is provided, so that people who need to acquire knowledge in the related field can conveniently inquire about the content, and the inquiry result is presented by adopting a visualization method, so that the knowledge base content visualization function is more concise, efficient and understandable. In addition, the query method is simple and easy, only the entity A is needed to be input, the corresponding relation is selected, and after the query is submitted to the system, the system can quickly query all the entities B with the corresponding relation in the entity A, and the entities B are presented through a beautiful view. Referring to fig. 6, the specific flow is as follows:

the user inputs an entity A to be queried on a system interface and selects a relationship name to be queried;

the system transmits the user input to a query sub-module, and the query sub-module transmits the query result to a drawing sub-module;

the drawing sub-module draws a directed graph by taking the queried result as a source node, taking all the entities B as target nodes and taking the query relationship as the names of edges.

In addition, in order to utilize the knowledge base in the special fields to the maximum extent, the knowledge base query and the function of visualizing the query result are provided. The user inputs the entity A and selects the relation name, so that all the corresponding entities B of the entity A under the relation can be queried, for example, the corresponding entities B are queried in a system (diabetes mellitus, symptoms), all the existing diabetes mellitus-like items in the knowledge base can be queried, and a visual diagram is drawn and presented. Meanwhile, the user can update the knowledge base by himself.

In order to solve the problem that the model existing in the relation extraction model based on supervised learning is only applicable to the specific field, the invention also provides a function of changing the classifier for screening the corpus. The user only needs to input a small amount of representative annotation data, and the system can automatically generate a new classifier for filtering the corpus. At this time, a part of knowledge base in a new special field is added into the knowledge base by updating the knowledge base function, the system can carry out corpus screening on the external text in the new field, and when the corpus reaches a certain scale, the relation extraction model in the new field can be trained.

Referring to fig. 7, the specific flow is as follows:

(1) the method comprises the following steps The user selects a text file and a tag file for updating the classifier;

(2) the method comprises the following steps After the text file is read by the system, word segmentation is carried out on the text file by using a Word segmentation tool bag, a Word pre-training model (Word 2Vec, bert, etc.) is utilized to train a Word segmentation result so as to obtain a new Word vector model, and the model is stored as a file;

(3) the method comprises the following steps Using the word vector model obtained in the step (2) to represent each sentence in the text as a vector form;

(4) the method comprises the following steps Using the vector list obtained in the step (2) to match the PCA (principal component analysis) model, and storing the model as a file;

(5) the method comprises the following steps Extracting principal components from the vector list by using a PCA model, reducing the dimension of word vectors, and extracting more important information;

(6) the method comprises the following steps And combining the vector extracted by the principal component with the label read in the label file to train a new classifier model, and storing the new classifier model.

The relation extraction function is also the most important function of the present invention. In the part, the system trains a relation extraction model by utilizing all the corpora screened by the remote supervision and screening corpus submodule and combining the labels automatically marked by the knowledge base. The training relation extraction model still adopts a word vector model and a traditional machine learning model at the beginning, and the Doc2Vec (sentence vector) and the deep learning model can be used for training after the data volume reaches a certain scale.

The specific flow is as follows:

the system reads the content of the filtered corpus file and carries out clause on the corpus file;

the system searches the knowledge base for the items (entity A and entity B) appearing in the sentences after each sentence, and uses the relation name as the label of the sentences, which is equivalent to automatic labeling;

dividing all sentences in the corpus into words, adding the words into word division files, training a word vector model by using the word division files, and storing the model after training;

all sentences are expressed as sentence vectors according to a word vector model, and a machine learning model is adopted by the vectors and labels to train out a relation extraction model;

the user inputs a sentence, the system reads and divides the word, then the sentence is expressed as a sentence vector according to the word vector model, the system inputs the vector into a relation extraction model trained by us, and the model extracts the relation most likely to be possessed by the sentence.

The pre-trained classifier is obtained by training a group of small corpus by using a support vector machine, and after training, the classifier has 82% accuracy on a test set, has higher classification accuracy and can be used for screening corpus.

In addition, when the pre-training classifier is not added, the relation extractor only has 40% -50% of relation identification accuracy of each sentence, and after the pre-training classifier is added to further screen candidate sentences, the accuracy of the candidate sentences reaches 80%.

For the function of the classifier used for updating and screening the corpus, a small labeled corpus is constructed in a certain time, and after updating the classifier, the classifier has good effect in a new field.

Experiments prove that the pre-trained classifier has good performance, and by utilizing the pre-trained classifier, a great amount of noise brought by an original remote supervision idea can be successfully overcome, so that the performance of a relation extraction model is improved. And the field to which the system is adapted can be transformed at an extremely low cost.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A relationship extraction system based on remote supervision, characterized in that: the system comprises a knowledge base module and a relation extraction module connected with the knowledge base module, wherein the knowledge base module provides knowledge information for the relation extraction module and helps to screen out needed information from a large amount of external texts; after the entity extraction function is developed by the relation extraction module, the relation extraction module can be combined with the relation extraction to extract the triplet pairs of the external text, and the knowledge base is enriched by using the extraction result;

the knowledge base module comprises a visualization sub-module with a function of drawing the knowledge base into a visual knowledge map, a file update knowledge base sub-module for updating the knowledge base in batches by using a knowledge base file, and a triplet update knowledge base sub-module for inputting a single item update knowledge base;

2. The remote supervision based relationship extraction system of claim 1, wherein the classifier updates the function of the classifier in the remote supervision by introducing a reselection corpus file and a label file.

3. The remote supervision based relationship extraction system of claim 1, wherein the visualization submodule comprises a query submodule and a drawing submodule, the query submodule searches corresponding triples in the knowledge corpus according to given triples, and the query submodule transmits the obtained triples list to the drawing submodule.

4. The relationship extraction system based on remote supervision according to claim 1, wherein the extraction submodule trains a relationship extraction model by using all the corpora screened by the submodule for screening corpora based on remote supervision and combining the labels automatically marked by the knowledge base.

5. The remote supervision based relation extraction system according to claim 1, wherein the relation extraction module comprises a sentence update knowledge corpus sub-module for obtaining sentences required for updating by user input, judging whether the sentence meets the joining requirement, and joining the sentence into the corpus if the sentence meets the joining requirement.