CN110188359B

CN110188359B - Text entity extraction method

Info

Publication number: CN110188359B
Application number: CN201910472799.7A
Authority: CN
Inventors: 金霞
Original assignee: Chengdu Firestone Creation Technology Co ltd
Current assignee: Chengdu Firestone Creation Technology Co ltd
Priority date: 2019-05-31
Filing date: 2019-05-31
Publication date: 2023-01-03
Anticipated expiration: 2039-05-31
Also published as: CN110188359A

Abstract

The invention discloses a text entity extraction method, which utilizes redundancy and repetition of information in a large amount of linguistic data, obtains entities with more noise by means of phrase segmentation and remote supervision, then digs a context sequence mode (rule) of the entities, automatically obtains an input rule of Snorkel, and obtains a result with better quality than the result of remote supervision by utilizing the fault-tolerant capability of the Snorkel on a noise label. The model and results are modified cyclically, gradually removing noise and resulting in a more reliable sequence pattern. The invention does not use a label sample, thereby saving the labor; the input rule of Snorkel is automatically obtained; and the remote supervision, the rule mining, the snorkel and the cyclic process are combined, so that the result is improved progressively, the noise is removed, and the extraction quality is improved.

Description

Text entity extraction method

Technical Field

The invention relates to the technical field of natural language processing, in particular to a few-sample text entity extraction method.

Background

In the application scene of text information extraction, scenes are various and detailed, labeling samples are lacked, the labeling sample acquisition cost is high, the current situation is faced in industrial application, in the face of the current situation, under the thought of model training, a depth model which rapidly establishes labeling samples, needs fewer samples or samples with larger noise is a popular research direction, and under the thought based on extraction rules, rapid mining and construction of an extraction rule set are popular research directions.

In the existing text information extraction method, a large number of labeled samples are required for a model training method, although some depth models have the tendency of higher and higher accuracy and smaller amount of required labeled samples, a certain amount of labeled samples are still required to train to obtain available models, and before obtaining the samples, the work cannot be carried out, so that the development cost is transferred to the labeling of the samples, and the overall development efficiency is still low.

In the method based on the extraction rule, although the sample is not required to be directly labeled manually, the extraction rule usually needs to be debugged on the basis of domain knowledge, and a set of system based on the rule completely may need tens of thousands of rule sets. Mining and automatic generation of rule sets has become a hot research direction in order to mitigate the development of rule sets.

Snorkel is a path from rules to models, however, it has a strong dependency on the accuracy of the rule set, and the rules are not automatically generated.

Disclosure of Invention

The invention provides an information extraction solution under the condition of a small amount of labeled samples by combining the extraction rule and the idea of model training, and the extraction model with higher accuracy can be obtained without manual intervention.

The purpose of the invention is realized by the following technical scheme: a method of text entity extraction, the method comprising the steps of:

(1) Automatic mining of rule sets, comprising the sub-steps of:

(1.1) carrying out phrase segmentation on a large amount of linguistic data to obtain noun phrases;

(1.2) carrying out entity and entity type recognition on noun phrases in a remote supervision mode;

(1.3) mining a sequence pattern with high occurrence frequency on the results of entity and entity type identification; in the sequence mode, if a noun phrase in the original corpus is identified as an entity, replacing the noun phrase in the sequence mode with the entity type of the noun phrase;

(1.4) according to entity types contained in the sequence modes, aggregating the synonymous sequence modes to obtain a sequence mode subset A corresponding to each semantic meaning;

(1.5) adjusting the hierarchy of entity types in each semantically corresponding sequence pattern subset A: on the result of the sequence mode aggregation, counting the entity type hierarchy in the sequence mode subset A corresponding to each semantic, and taking the most hierarchy as the entity type hierarchy in the subset A;

(1.6) for each entity type, finding out a sequence pattern containing the type from each subset A to obtain a sequence pattern subset B corresponding to the entity type;

(2) The tagged data is generated: taking the sequence mode subset B corresponding to each entity type as the input of the Snorkel, predicting the label of the sample, namely the entity type, wherein the label has confidence;

(3) Training an entity extraction regression model: extracting a regression model by using a label training entity with confidence coefficient, and predicting linguistic data by using the trained regression model to obtain an entity recognition result;

(4) Returning to the step (1), using the trained entity extraction regression model to predict the corpus again, using the obtained result to correct the phrase segmentation and remote supervision entity identification result obtained in the step (1), continuing the rest steps, and obtaining the entity extraction regression model and the entity identification result again; this process is repeated until the entity results from step (3) are consistent with the results from the previous process.

Further, in the step (1.1), phrase segmentation is performed by using an AutoPhrase method to obtain noun phrases.

Further, in the step (1.3), on the result of entity and entity type identification, a sequence pattern with high occurrence frequency is mined by a Prefix span method.

Further, in the step (1.4), the specific polymerization manner is as follows: establishing a graph structure for the sequence mode set, wherein each vertex in the graph is a sequence mode, edges between the two modes are defined by three characteristics of the number of the types of the entities common between the two modes, the number of the common context words and the number of the extraction results of the same entities, each edge is endowed with weight by training a regression model based on the three characteristics, and a subgraph, namely a sequence mode subset, is obtained by using a clustering algorithm.

The invention has the beneficial effects that: the invention utilizes redundancy and repetition of information in a large amount of linguistic data, firstly obtains entities with more noise by means of phrase segmentation and remote supervision, then excavates a context sequence mode (rule) of the entities, automatically obtains an input rule of Snorkel, and obtains a result with better quality than the result of remote supervision by utilizing the fault-tolerant capability of the Snorkel on a noise label. The model and results are modified cyclically, gradually removing noise and resulting in a more reliable sequence pattern. The invention does not use a label sample, thereby saving the labor; the input rule of Snorkel is automatically obtained; the method combines remote supervision, rule mining, snorkel and a cyclic process to improve the result and remove noise progressively, and the obtained result is better than that of remote supervision.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only some of the embodiments of the present invention, and not all of them. Other embodiments, which can be derived by one of ordinary skill in the art from the embodiments of the present invention without creative efforts, are also within the scope of the present invention.

According to the method, under the scene of few samples, the rule set is automatically mined on a large number of unlabeled samples, the rule set is managed by using Snorkel, a large number of labeled data containing noise and with confidence coefficients are generated, and finally the data are used for training an entity extraction regression model.

As shown in fig. 1, the method for extracting text entities provided by the present invention specifically includes the following steps:

automatic mining of rule sets

On a large amount of linguistic data, firstly, carrying out Phrase segmentation by using an AutoPhrase method (AutoPhrase: automated Phrase Mining from Massive Text corpra) to obtain noun phrases;

entity and entity type recognition is carried out on noun phrases in a remote supervision mode (for English medical texts, better results can be obtained by using a MetaMap tool);

on the result of entity and entity type identification, a Prefix span method [ Prefix span: mining Sequential Patterns efficient by Prefix-project Pattern Growth ] is used to mine the sequence Pattern with high frequency of occurrence. The sequence pattern is formed by adding entity types to a common regular template, such as: ($ MEDICINE) may be half pful for ($ DISEASE), wherein ($ MEDICINE) and ($ DISEASE) respectively represent drugs and DISEASE entity types, and the corresponding position in the sequence mode can be any drug or DISEASE. In the sequence mode, if the noun phrase in the original corpus is identified as an entity, the noun phrase in the sequence mode is replaced by the entity type of the noun phrase, and the generalization of the sequence mode is improved.

And according to the entity types contained in the sequence patterns, aggregating the synonymous sequence patterns to obtain a sequence pattern subset A corresponding to each semantic meaning, wherein the patterns in each subset A represent the same semantic meaning. Synonymous sequence patterns refer to sequence patterns that express the same semantics, such as "Person's age is $ Digit" and "$ Person, $ Digit" both expressing the semantic "Person's age is a number". The specific polymerization mode is as follows: establishing a graph structure for the sequence mode set, wherein each vertex in the graph is a sequence mode, edges between the two modes are defined by three characteristics of the number of common entity types, the number of common context words and the number of the same entity extraction results, each edge is given with weight by training a regression model based on the three characteristics, and a subgraph, namely a sequence mode subset A, is obtained by using a clustering algorithm [ A procedure for closed detection using the group matrix ]. In the sequence patterns "$ Country president $ Politian" and "president $ Politian of $ Country", the common entity types between the two patterns are $ Country and $ Politian, the common entity type number is 2, the common context word is president, and the number is 1, and the same entity extraction result, that is, the entity number extracted from the corpus by the two sequence patterns, for example, in the extraction of $ Politian type entities, the number of extracted $ Politian type entities is counted.

The hierarchy of the entity type in the sequence pattern subset a corresponding to each semantic is adjusted, for example, the types of $ Location, $ Country, $ State, $ City, etc. are below $ Location type, and different hierarchy entity types are obtained for each noun phrase when the entity type is identified. And on the result of the sequence pattern aggregation, counting the entity type hierarchy in each sequence pattern subset A corresponding to each semantic, and taking the most number of hierarchies as the entity type hierarchy in the subset A.

Through the above process, the sequence pattern subset a corresponding to each semantic can be obtained. For each entity type, finding out the sequence mode containing the type from each subset A to obtain the sequence mode subset B corresponding to the entity type.

Second, generate labeled data

And (3) taking the sequence mode subset B corresponding to each entity type as the input of the Snorkel, predicting the label of the sample, namely the entity type, wherein the label has confidence.

Training entity extraction regression model

And (3) extracting a regression model by using a label training entity with confidence coefficient, and predicting corpus by using the trained regression model to obtain an entity recognition result.

And fourthly, returning to the first step, using the trained entity extraction regression model to predict the linguistic data again, using the obtained result to correct the results of the phrase segmentation and the remote supervision entity recognition obtained in the first step, continuing the rest steps, and obtaining the entity extraction regression model and the entity recognition result again. This process is repeated until the entity results from the third step are consistent with the results from the previous process.

One skilled in the art can, using the teachings of the present invention, readily make various changes and modifications to the invention without departing from the spirit and scope of the invention as defined by the appended claims. Any modifications and equivalent variations of the above-described embodiments, which are made in accordance with the technical spirit and substance of the present invention, fall within the scope of protection of the present invention as defined in the claims.

Claims

1. A text entity extraction method is characterized by comprising the following steps:

(1) Automatic mining of rule sets, comprising the sub-steps of:

(1.2) carrying out entity and entity type identification on noun phrases in a remote supervision mode;

(2) The tagged data is generated: taking the sequence mode subset B corresponding to each entity type as the input of Snorkel, predicting the label of the sample, namely the entity type, wherein the label has confidence;

2. The method as claimed in claim 1, wherein in step (1.1), the noun phrase is obtained by phrase segmentation using AutoPhrase method.

3. The method for extracting text entities according to claim 1, wherein in said step (1.3), the prefix span method is used to mine the sequence patterns with high frequency of occurrence on the results of entity and entity type identification.

4. The method for extracting text entities as claimed in claim 1, wherein in the step (1.4), the specific aggregation manner is as follows: establishing a graph structure for the sequence mode set, wherein each vertex in the graph is a sequence mode, edges between the two modes are defined by three characteristics of the number of the types of the entities common between the two modes, the number of the common context words and the number of the extraction results of the same entities, each edge is endowed with weight by training a regression model based on the three characteristics, and a subgraph, namely a sequence mode subset, is obtained by using a clustering algorithm.