CN110188359A

CN110188359A - A kind of text entities abstracting method

Info

Publication number: CN110188359A
Application number: CN201910472799.7A
Authority: CN
Inventors: 金霞
Original assignee: Chengdu Firestone Creation Technology Co Ltd
Current assignee: Chengdu Firestone Creation Technology Co Ltd
Priority date: 2019-05-31
Filing date: 2019-05-31
Publication date: 2019-08-30
Anticipated expiration: 2039-05-31
Also published as: CN110188359B

Abstract

The invention discloses a kind of text entities abstracting methods, present invention utilizes the redundancies and repetition of information in a large amount of corpus, first the more entity of noise is obtained with the mode of phrase separation and remote supervisory, the context sequence pattern (rule) of entity is excavated again, it has been automatically derived the input rule of Snorkel, using Snorkel to the fault-tolerant ability of noise label, the quality result better than remote supervisory has been obtained.Cyclically to model and modified result, gradually remove noise, and obtains more reliable sequence pattern.The present invention does not use exemplar, saves artificial；The input rule of Snorkel automatically derives；In conjunction with remote supervisory, rule digging, snorkel and cyclic process, improves result, removal noise progressively, improve and extract quality.

Description

A kind of text entities abstracting method

Technical field

The present invention relates to natural language processing technique field more particularly to a kind of few sample text entity abstracting methods.

Background technique

In the application scenarios of Text Information Extraction, scene multiplicity, refinement lack mark sample, mark sample acquisition at This height is the status faced in industrial application, faces such status, under the thinking of model training, quickly establishes mark sample This, to need the depth model of less sample or the bigger sample of noise be two popular research directions, be based on decimation rule Thinking under, the quick of decimation rule collection is excavated and construction is popular research direction.

In current Text Information Extraction method, the method based on model training needs a large amount of mark sample, although having Some depth models presentation accuracy are higher and higher, the mark sample size trend less and less needed, but still need certain The mark sample of amount could train to obtain available model, before obtaining sample, can not carry out the work, such process is equivalent to Development cost is married again on the mark of sample, whole development efficiency is still low.

And in the method based on decimation rule, although not needing manually directly to be labeled sample, extract rule It then generally requires largely to be debugged on the basis of domain knowledge, a set of system based entirely on rule may need up to ten thousand Rule set.In order to mitigate the exploitation of rule set, the excavation of rule set and automatically generate as a hot research direction.

Snorkel is an approach from rule to model, however it is very strong to the accuracy dependence of rule set, and is advised It does not automatically generate then.

Summary of the invention

The thinking of present invention combination decimation rule and model training proposes that a kind of a small amount of information marked under sample conditions is taken out Solution is taken, the higher extraction model of accuracy rate just can be obtained in no manual intervention.

The purpose of the present invention is achieved through the following technical solutions: a kind of text entities abstracting method, this method packet Include following steps:

(1) automatic mining of rule set, including following sub-step:

(1.1) phrase separation is carried out on a large amount of corpus, obtains noun phrase；

(1.2) entity is carried out to noun phrase with the mode of remote supervisory and entity type identifies；

(1.3) the high sequence pattern of frequency of occurrence is excavated in the result that entity and entity type identify；In sequence pattern In, if the noun phrase in primitive material is identified as entity, in the entity type replacement sequence pattern of the noun phrase The noun phrase；

(1.4) according to the entity type for including in sequence pattern, synonymous sequence pattern is polymerize, obtains each language The corresponding sequence pattern subset A of justice；

(1.5) level of the entity type in sequence pattern subset A corresponding to each semanteme is adjusted: in sequence mould In the result of formula polymerization, sequence pattern subset A corresponding to each semanteme counts entity type level therein, takes most most layers Grade is as the entity type level in subset A；

(1.6) for every kind of entity type, the sequence pattern comprising the type is found out from each subset A, obtains this reality The corresponding sequence pattern subset B of body type；

(2) there is label data: using the corresponding sequence pattern subset B of every kind of entity type as the input of Snorkel, The label of sample, i.e. entity type are predicted, label has confidence level；

(3) training entity extracts regression model: regression model is extracted with the label training entity with confidence level, with training Good forecast of regression model corpus, obtains Entity recognition result；

(4) return step (1) extracts regression model with trained entity and predicts corpus again, with obtained result pair The result of phrase separation, remote supervisory Entity recognition that step (1) obtains is modified, and is continued remaining step, is retrieved Entity extracts regression model and Entity recognition result；This process is repeated, until entity result and the last time that step (3) obtains The result that process obtains is consistent.

Further, in the step (1.1), phrase separation is carried out using AutoPhrase method, obtains noun phrase.

Further, in the step (1.3), in the result that entity and entity type identify, with the side PrefixSpan Method excavates the high sequence pattern of frequency of occurrence.

Further, in the step (1.4), specific polymerization methods are as follows: establishing graph structure to sequence pattern set, scheme In each vertex be a sequence pattern, the side between two modes by entity type quantity common between two modes, The common quantity of cliction up and down, identical entity extract these three features of fruiting quantities to define, and are instructed based on three above feature Practice regression model to assign each edge weight, obtains subgraph, i.e. sequence pattern subset with point group's algorithm.

The beneficial effects of the present invention are: present invention utilizes the redundancies and repetition of information in a large amount of corpus, first with phrase point It cuts and obtains the more entity of noise with the mode of remote supervisory, then excavate the context sequence pattern (rule) of entity, automatically The input rule of Snorkel has been obtained, using Snorkel to the fault-tolerant ability of noise label, quality has been obtained and has compared remote supervisory Good result.Cyclically to model and modified result, gradually remove noise, and obtains more reliable sequence pattern.The present invention does not have Useful exemplar saves artificial；The input rule of Snorkel automatically derives；In conjunction with remote supervisory, rule digging, Snorkel and cyclic process, improve result, removal noise progressively, and obtained result is better than remote supervisory.

Detailed description of the invention

Fig. 1 is the flow chart of the method for the present invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described.Obviously, the embodiment is this hair Bright a part of the embodiment, instead of all the embodiments.Based on the embodiment of the present invention, those of ordinary skill in the art are not having Other embodiments obtained under the premise of creative work are made, protection scope of the present invention is belonged to.

The present invention, in the automatic mining largely without the mark enterprising line discipline collection of sample, utilizes under the scene of few sample Snorkel is managed rule set, and generation largely has label data with confidence level comprising noise, finally uses these Data come train entity extract regression model.

As shown in Figure 1, text entities abstracting method proposed by the present invention, specifically includes the following steps:

The automatic mining of one, rule set

On a large amount of corpus, AutoPhrase method [AutoPhrase:Automated Phrase is used first Mining from Massive Text Corpora] phrase separation is carried out, obtain noun phrase；

Entity and entity type identification are carried out (for English medicine text, benefit to noun phrase with the mode of remote supervisory Better result can be obtained with MetaMap tool)；

In the result that entity and entity type identify, with PrefixSpan method [PrefixSpan:Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth] excavate appearance frequency Secondary high sequence pattern.Sequence pattern be increase entity type in common canonical template, such as: (MEDCINE) may Be helpful for ($ DISEASE), ($ MEDCINE) and ($ DISEASE) therein respectively indicate drug, disease entity class Type, corresponding position can be any one drug, disease in sequence pattern.In sequence pattern, if the noun in primitive material Phrase is identified as entity, then with the noun phrase in the entity type replacement sequence pattern of the noun phrase, improves sequence The generalization of mode.

According to the entity type for including in sequence pattern, synonymous sequence pattern is polymerize, it is right to obtain each semanteme Mode in sequence pattern the subset A, each subset A answered indicates identical semanteme.Synonymous sequence pattern refers to that expression is identical Semantic sequence pattern, such as " Person ' s age is $ Digit " and " $ Person, $ Digit " the two sequence patterns all tables Having reached " age of people is number ", this is semantic.Specific polymerization methods are as follows: graph structure established to sequence pattern set, it is every in figure A vertex is a sequence pattern, and the side between two modes passes through entity type quantity common between two modes, common The quantity of cliction up and down, identical entity extract fruiting quantities these three features and define, trained back based on three above feature Return model to assign each edge weight, with a point group algorithm [A procedure for clique detection using the Group matrix] obtain subgraph, i.e. sequence pattern subset A.In sequence pattern " $ Country president $ In Politician " and " president $ Politician of $ Country ", common entity type is between two modes $ Country and Politician, common entity type quantity are 2, and common cliction up and down is president, and quantity is 1, it is the physical quantities being drawn into corpus with the two sequence patterns that identical entity, which extracts result, such as in $ In the extraction of Politician type entities, the quantity for the $ Politician type entities being drawn into is counted.

The level of entity type in sequence pattern subset A corresponding to each semanteme is adjusted, such as $ Location class There is $ Country under type, the types such as $ State, $ City can obtain difference to each noun phrase in entity type identification The entity type of level.In the result of sequence pattern polymerization, sequence pattern subset A statistics corresponding to each semanteme is therein Entity type level takes most most levels as the entity type level in subset A.

The corresponding sequence pattern subset A of available each semanteme by the above process.For every kind of entity type, from each The sequence pattern comprising the type is found out in subset A, obtains the corresponding sequence pattern subset B of this entity type.

Two, have label data

Using the corresponding sequence pattern subset B of every kind of entity type as the input of Snorkel, the label of sample is predicted, That is entity type, label have confidence level.

Three, training entity extracts regression model

Regression model is extracted with the label training entity with confidence level to be obtained with trained forecast of regression model corpus To Entity recognition result.

Four, return to the first step, extract regression model with trained entity and predict corpus again, with obtained result to the The result of phrase separation, remote supervisory Entity recognition that one step obtains is modified, and is continued remaining step, is retrieved entity Extract regression model and Entity recognition result.This process is repeated, until the entity result and last time process that third step obtains Obtained result is consistent.

Provided verbal description, attached drawing and claims can hold those skilled in the art very much according to the present invention Easily in the case where not departing from thought and range of condition of the invention defined by claims, a variety of variations and change can be made. All technical ideas according to the present invention and the substantive any modification carried out to above-described embodiment, equivalent variations, belong to this hair Bright claim is within the limits of the protection.

Claims

1. a kind of text entities abstracting method, which is characterized in that method includes the following steps:

(1) automatic mining of rule set, including following sub-step:

(1.3) the high sequence pattern of frequency of occurrence is excavated in the result that entity and entity type identify；In sequence pattern, such as Noun phrase in fruit primitive material is identified as entity, then with the name in the entity type replacement sequence pattern of the noun phrase Word phrase；

(1.4) according to the entity type for including in sequence pattern, synonymous sequence pattern is polymerize, it is right to obtain each semanteme The sequence pattern subset A answered；

(1.5) level of the entity type in sequence pattern subset A corresponding to each semanteme is adjusted: poly- in sequence pattern In the result of conjunction, sequence pattern subset A corresponding to each semanteme counts entity type level therein, and most most levels is taken to make For the entity type level in subset A；

(1.6) for every kind of entity type, the sequence pattern comprising the type is found out from each subset A, obtains this entity class The corresponding sequence pattern subset B of type；

(2) there is label data: using the corresponding sequence pattern subset B of every kind of entity type as the input of Snorkel, prediction The label of sample, i.e. entity type out, label have confidence level；

(3) training entity extracts regression model: extracting regression model with the label training entity with confidence level, use is trained Forecast of regression model corpus obtains Entity recognition result；

(4) return step (1) extracts regression model with trained entity and predicts corpus again, with obtained result to step (1) result of the phrase separation, remote supervisory Entity recognition that obtain is modified, and is continued remaining step, is retrieved entity Extract regression model and Entity recognition result；This process is repeated, until the entity result and last time process that step (3) obtains Obtained result is consistent.

2. a kind of text entities abstracting method according to claim 1, which is characterized in that in the step (1.1), utilize AutoPhrase method carries out phrase separation, obtains noun phrase.

3. a kind of text entities abstracting method according to claim 1, which is characterized in that in the step (1.3), in reality On body and the result of entity type identification, the high sequence pattern of frequency of occurrence is excavated with PrefixSpan method.

4. a kind of text entities abstracting method according to claim 1, which is characterized in that in the step (1.4), specifically Polymerization methods are as follows: establishing graph structure to sequence pattern set, each vertex is a sequence pattern in figure, between two modes Side result is extracted by entity type quantity common between two modes, the common quantity of cliction up and down, identical entity Quantity these three features define, and assign each edge weight based on three above feature training regression model, with point group's algorithm Obtain subgraph, i.e. sequence pattern subset.