CN111367986A

CN111367986A - Joint information extraction method based on weak supervised learning

Info

Publication number: CN111367986A
Application number: CN202010170467.6A
Authority: CN
Inventors: 王岚熙; 姜同强
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2020-03-12
Filing date: 2020-03-12
Publication date: 2020-07-03

Abstract

The invention relates to the technical field of information extraction, and discloses a joint information extraction method based on weak supervised learning, which comprises the following steps: s1: collecting information to form a training corpus, matching the training corpus with an entity pair in a knowledge base to obtain a training set, classifying the information in the training set, labeling the label according to the information containing characteristics of the information, and inputting the information subjected to multi-label labeling into a combined extraction model; s2: extracting the feature labels in the training set according to the information to be extracted, and labeling all feature labels on the target after the target is obtained; s3: and (4) putting the label information obtained in the step (S2) into a joint extraction model for extraction to obtain an extraction result. The combined information extraction method based on the weak supervised learning can solve the problem that the labeling of a data set is time-consuming and labor-consuming due to the current supervised learning/semi-supervised learning mode.

Description

Joint information extraction method based on weak supervised learning

Technical Field

The invention relates to the technical field of information extraction, in particular to a joint information extraction method based on weak supervised learning.

Background

The information technology in the web2.0 era has rapidly developed, and the advent of the internet has promoted an explosive increase in data volume. As a main carrier of information dissemination, these data carry much information of interest, and how to quickly and efficiently process large-scale unstructured data and obtain structured information becomes a hot spot of research at present, which is a main task of information extraction. The entity relationship extraction is an important branch of the information extraction field, has promotion significance in the aspect of theoretical research, and also has wide application value in the field of practical engineering application. Currently, entity relationship extraction mainly stays in modes based on supervised learning/semi-supervised learning and the like, and data set labeling caused by the supervised learning/semi-supervised learning mode is time-consuming and labor-consuming.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects of the prior art, the invention provides a joint information extraction method based on weak supervised learning, which has the advantage of improving learning efficiency and solves the problem that the labeling of a data set is time-consuming and labor-consuming due to the current supervised learning/semi-supervised learning mode.

(II) technical scheme

In order to achieve the purpose of improving the surface stress, the invention provides the following technical scheme: a joint information extraction method based on weak supervised learning comprises the following steps:

s1: collecting information to form a training corpus, matching the training corpus with an entity pair in a knowledge base to obtain a training set, classifying the information in the training set, labeling the label according to the information containing characteristics of the information, and inputting the information subjected to multi-label labeling into a combined extraction model;

s2: extracting the feature labels in the training set according to the information to be extracted, and labeling all feature labels on the target after the target is obtained;

s3: and (4) putting the label information obtained in the step (S2) into a joint extraction model for extraction to obtain an extraction result.

Preferably, in the step S1, in the entity relationship extraction, relationship classification is generally performed based on verb phrases between two entities, and for one entity pair (4, B) and the trigger ρ therebetween, a prediction process of the relationship is defined as f (a, B, ρ) → (a, B, R), that is, the trigger related to the entity pair is mapped to a certain relationship by the extraction system.

Preferably, in the step S1, according to the prior knowledge of the entity information to be extracted, some tag classes, tag element sets, and generalization operations of set elements are predefined:

identification and extraction such as "company name" defines a Key token CLASS Key, a set of Key token elements [ Key _ T ] T ∈ (a company, finite, factory, etc. publicly identifiable token), and corresponding generalization operations, including both quantitative generalization and type generalization, generralize _ NUM1(Key _ T) ═ Key _ T (generalizing one Key _ T to 0 or more) and generralize _ CLASS (Key _ T) ═ Key _ (generalizing Key _ T to the wildcard Key-of a set of Key token elements).

Preferably, when labeling the information, a filtering mechanism is adopted to filter the labeling result, so as to reduce the number of error labels and improve the function of the extraction system, and the method specifically comprises the following steps:

a1: given the tagged information, predicting whether the instance expresses a relationship;

a2: for each set of entity pairs, predicting whether the entity pair is tagged;

a3: the instances marked in step S2 are filtered using the set of negative examples.

Preferably, in the step a1, the learning of the parameters of the hierarchical generative model is improved by information, i.e. an instance W is given_rsIndicates whether the s-th word sequence expresses the r-th relation, W_rsIs a binary variable, if W_rs1 indicates that the word sequence s expresses the relation r, whereas W_rs0; in the step A2, according to the word sequence W_rsPredicting whether the ith entity pair in the set is marked according to a knowledge base; in the step A3, an entity generation set which cannot express the relationship in the knowledge base but is often marked by errors is obtained by analyzing according to the training corpus, and then the predicted relationship in the step A2 is screened by using the set, so that the wrongly marked relationship examples in the weak supervised learning are effectively reduced through the two processes.

Preferably, in the step S1, the joint extraction model includes an embedding layer of vectors for mapping words in a high-dimensional discrete space to a low-dimensional continuous space, a two-way long-short term memory network (Bi-LSTM) coding layer for capturing semantic information of each word, a Conditional Random Field (CRF) decoding layer for labeling linear data sequences, and a complex optimization for comprehensively considering three characteristics of a calculation region, used information and a structural hierarchy, and during an experimental process, system performance evaluation is performed in two ways: retention evaluation and manual evaluation, and statistics of accuracy and recall rate; and evaluating the accurate performance of the system according to the N entity pairs with the most occurrence times.

Preferably, the retention assessment is in particular: randomly dividing the training corpus, and automatically identifying all the relational entities by an extraction system and comparing the relational entities with the entities in a knowledge base; only about 56.7% of the entities in the corpus are present in the knowledge base, so the remaining entities as noise data will have noise influence on the extraction performance, and therefore, we can evaluate the most common n entity pairs in the corpus to reduce the influence of the noise data on the final result to some extent. The significance of the retention evaluation is that a rough evaluation mode is adopted to carry out a plurality of experiments to obtain the value ranges of some key parameters of the extraction system; because the retention evaluation only roughly screens the entity pairs in the corpus, the entity pairs with less occurrence times are equivalent to noise data, the accuracy rate is sharply reduced along with the increase of the number of the entity pairs, and the system performance is increasingly poor. However, by retention assessment experiments without manual selection, the determination of the range of important parameters can be performed quickly.

Preferably, the manual evaluation is specifically: manually selecting various relationships with the highest occurrence frequency for testing, and avoiding the noise problem caused by retention evaluation; the accuracy performance of manual evaluation is obviously superior to that of retention evaluation, the whole performance of the extraction system is improved in the aspect of extracting the relation of the middle-frequency entity pair and the high-frequency entity pair due to the introduction of the word vector, and the good performance is still maintained under the condition that the number of the entity pairs is small when the word class analysis is triggered.

(III) advantageous effects

Compared with the prior art, the invention provides a joint information extraction method based on weak supervised learning, which has the following beneficial effects:

according to the combined information extraction method based on the weak supervised learning, the information is subjected to feature labeling through a strategy of combining the weak supervised learning with the combined information extraction, then the information is extracted in the combined extraction model, the accuracy and the recall rate of the information extraction are improved, and meanwhile the time and the energy required to be consumed are reduced.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

A joint information extraction method based on weak supervised learning comprises the following steps:

In the step S1, in the entity relationship extraction, relationship classification is generally performed based on verb phrases between two entities, and for one entity pair (4, B) and the trigger ρ therebetween, a prediction process of the relationship is defined as f (a, B, ρ) → (a, B, R), that is, the trigger related to the entity pair is mapped to a certain relationship by the extraction system.

In the step S1, according to the prior knowledge of the entity information to be extracted, some tag classes, tag element sets, and generalization operations of set elements are predefined:

When labeling information, a filtering mechanism is adopted to filter labeling results, the number of error labels is reduced, and the function of an extraction system is improved, and the method specifically comprises the following steps:

a2: for each set of entity pairs, predicting whether the entity pair is tagged;

In the step A1, improved learning of hierarchical model parameters is performed through information, namely, an example W is given_rsIndicates whether the s-th word sequence expresses the r-th relation, W_rsIs a binary variable, if W_rs1 indicates that the word sequence s expresses the relation r, whereas W_rs0; in the step A2, according to the word sequence W_rsPredicting whether the ith entity pair in the set is marked according to a knowledge base; in the step A3, an entity generation set which cannot express the relationship in the knowledge base but is often marked by errors is obtained by analyzing according to the training corpus, and then the predicted relationship in the step A2 is screened by using the set, so that the wrongly marked relationship examples in the weak supervised learning are effectively reduced through the two processes.

In the step S1, the joint extraction model includes an embedding layer of vectors for mapping words in a high-dimensional discrete space to a low-dimensional continuous space, a bidirectional long-short term memory network (Bi-LSTM) coding layer for capturing semantic information of each word, a Conditional Random Field (CRF) decoding layer for labeling linear data sequences, and a multiple optimization layer for comprehensively considering three characteristics of a calculation region, used information and a structural hierarchy, and during an experiment, the accuracy and recall rate are counted through retention evaluation; and evaluating the accurate performance of the system according to the N entity pairs with the most occurrence times.

The retention assessment, which is in particular: randomly dividing the training corpus, and automatically identifying all the relational entities by an extraction system and comparing the relational entities with the entities in a knowledge base; only about 56.7% of the entities in the corpus are present in the knowledge base, so the remaining entities as noise data will have noise influence on the extraction performance, and therefore, we can evaluate the most common n entity pairs in the corpus to reduce the influence of the noise data on the final result to some extent. The significance of the retention evaluation is that a rough evaluation mode is adopted to carry out a plurality of experiments to obtain the value ranges of some key parameters of the extraction system; because the retention evaluation only roughly screens the entity pairs in the corpus, the entity pairs with less occurrence times are equivalent to noise data, the accuracy rate is sharply reduced along with the increase of the number of the entity pairs, and the system performance is increasingly poor. However, by retention assessment experiments without manual selection, the determination of the range of important parameters can be performed quickly.

Example 2

a2: for each set of entity pairs, predicting whether the entity pair is tagged;

In the step S1, the joint extraction model includes an embedding layer of vectors for mapping words in a high-dimensional discrete space to a low-dimensional continuous space, a bidirectional long-short term memory network (Bi-LSTM) coding layer for capturing semantic information of each word, a Conditional Random Field (CRF) decoding layer for labeling linear data sequences, and a multiple optimization layer for comprehensively considering three characteristics of a calculation region, used information and a structural hierarchy, and during an experimental process, the accuracy and recall rate are counted through manual evaluation; and evaluating the accurate performance of the system according to the N entity pairs with the most occurrence times.

The manual evaluation specifically comprises: manually selecting various relationships with the highest occurrence frequency for testing, and avoiding the noise problem caused by retention evaluation; the accuracy performance of manual evaluation is obviously superior to that of retention evaluation, the whole performance of the extraction system is improved in the aspect of extracting the relation of the middle-frequency entity pair and the high-frequency entity pair due to the introduction of the word vector, and the good performance is still maintained under the condition that the number of the entity pairs is small when the word class analysis is triggered.

In conclusion, the combined information extraction method based on the weak supervised learning combines the strategy of combining the weak supervised learning with the combined information extraction to perform feature labeling on the information and then extract the information in the combined extraction model, so that the accuracy and the recall rate of information extraction are improved, and the time and the energy required to be consumed are reduced.

It is to be noted that the term "comprises," "comprising," or any other variation thereof is intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A joint information extraction method based on weak supervised learning is characterized by comprising the following steps:

2. The joint information extraction method based on weak supervised learning as claimed in claim 1, wherein: in the step S1, in the entity relationship extraction, relationship classification is generally performed based on verb phrases between two entities, and for one entity pair (4, B) and the trigger ρ therebetween, a prediction process of the relationship is defined as f (a, B, ρ) → (a, B, R), that is, the trigger related to the entity pair is mapped to a certain relationship by the extraction system.

3. The joint information extraction method based on weak supervised learning as claimed in claim 1, wherein: in the step S1, some mark classes, mark element sets, and generalization operations of set elements are predefined according to the prior knowledge of the entity information to be extracted.

4. The joint information extraction method based on weak supervised learning as claimed in claim 1, wherein: when labeling information, a filtering mechanism is adopted to filter labeling results, the number of error labels is reduced, and the function of an extraction system is improved, and the method specifically comprises the following steps:

a2: for each set of entity pairs, predicting whether the entity pair is tagged;

5. The joint information extraction method based on weak supervised learning as claimed in claim 4, wherein: in the step A1, improved learning of hierarchical model parameters is performed through information, namely, an example W is given_rsIndicates whether the s-th word sequence expresses the r-th relation, W_rsIs a binary variable, if W_rs1 indicates that the word sequence s expresses the relation r, whereas W_rs0; in the step A2, according to the word sequence W_rsPredicting whether the ith entity pair in the set is marked according to a knowledge base; in the step A3, an entity generation set which cannot express the relationship in the knowledge base but is often marked by errors is obtained by analyzing according to the training corpus, and then the predicted relationship in the step A2 is screened by using the set, so that the wrongly marked relationship examples in the weak supervised learning are effectively reduced through the two processes.

6. The joint information extraction method based on weak supervised learning as claimed in claim 1, wherein: in the step S1, the joint extraction model includes an embedding layer of vectors for mapping words in a high-dimensional discrete space to a low-dimensional continuous space, a two-way long-short term memory network (Bi-LSTM) encoding layer for capturing semantic information of each word, a Conditional Random Field (CRF) decoding layer for labeling linear data sequences, and a composite attention layer for comprehensively considering three parts of computation regions, information used and structural hierarchy characteristics.

7. The joint information extraction method based on weak supervised learning as claimed in claim 1, wherein: during the experiment, system performance evaluation was performed in two ways: and (4) retention evaluation and manual evaluation, and counting the accuracy and recall rate.

8. The joint information extraction method based on weak supervised learning as claimed in claim 7, wherein: the retention assessment, which is in particular: and randomly dividing the training corpus, and automatically identifying all the relational entities by an extraction system and comparing the relational entities with the entities in the knowledge base.

9. The joint information extraction method based on weak supervised learning as claimed in claim 7, wherein: the manual evaluation specifically comprises: and a plurality of relationships with the highest occurrence frequency are manually selected for testing, so that the noise problem caused by retention evaluation is avoided.