CN110188359A - A kind of text entities abstracting method - Google Patents
A kind of text entities abstracting method Download PDFInfo
- Publication number
- CN110188359A CN110188359A CN201910472799.7A CN201910472799A CN110188359A CN 110188359 A CN110188359 A CN 110188359A CN 201910472799 A CN201910472799 A CN 201910472799A CN 110188359 A CN110188359 A CN 110188359A
- Authority
- CN
- China
- Prior art keywords
- entity
- sequence pattern
- result
- subset
- entity type
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Abstract
The invention discloses a kind of text entities abstracting methods, present invention utilizes the redundancies and repetition of information in a large amount of corpus, first the more entity of noise is obtained with the mode of phrase separation and remote supervisory, the context sequence pattern (rule) of entity is excavated again, it has been automatically derived the input rule of Snorkel, using Snorkel to the fault-tolerant ability of noise label, the quality result better than remote supervisory has been obtained.Cyclically to model and modified result, gradually remove noise, and obtains more reliable sequence pattern.The present invention does not use exemplar, saves artificial;The input rule of Snorkel automatically derives;In conjunction with remote supervisory, rule digging, snorkel and cyclic process, improves result, removal noise progressively, improve and extract quality.
Description
Technical field
The present invention relates to natural language processing technique field more particularly to a kind of few sample text entity abstracting methods.
Background technique
In the application scenarios of Text Information Extraction, scene multiplicity, refinement lack mark sample, mark sample acquisition at
This height is the status faced in industrial application, faces such status, under the thinking of model training, quickly establishes mark sample
This, to need the depth model of less sample or the bigger sample of noise be two popular research directions, be based on decimation rule
Thinking under, the quick of decimation rule collection is excavated and construction is popular research direction.
In current Text Information Extraction method, the method based on model training needs a large amount of mark sample, although having
Some depth models presentation accuracy are higher and higher, the mark sample size trend less and less needed, but still need certain
The mark sample of amount could train to obtain available model, before obtaining sample, can not carry out the work, such process is equivalent to
Development cost is married again on the mark of sample, whole development efficiency is still low.
And in the method based on decimation rule, although not needing manually directly to be labeled sample, extract rule
It then generally requires largely to be debugged on the basis of domain knowledge, a set of system based entirely on rule may need up to ten thousand
Rule set.In order to mitigate the exploitation of rule set, the excavation of rule set and automatically generate as a hot research direction.
Snorkel is an approach from rule to model, however it is very strong to the accuracy dependence of rule set, and is advised
It does not automatically generate then.
Summary of the invention
The thinking of present invention combination decimation rule and model training proposes that a kind of a small amount of information marked under sample conditions is taken out
Solution is taken, the higher extraction model of accuracy rate just can be obtained in no manual intervention.
The purpose of the present invention is achieved through the following technical solutions: a kind of text entities abstracting method, this method packet
Include following steps:
(1) automatic mining of rule set, including following sub-step:
(1.1) phrase separation is carried out on a large amount of corpus, obtains noun phrase;
(1.2) entity is carried out to noun phrase with the mode of remote supervisory and entity type identifies;
(1.3) the high sequence pattern of frequency of occurrence is excavated in the result that entity and entity type identify;In sequence pattern
In, if the noun phrase in primitive material is identified as entity, in the entity type replacement sequence pattern of the noun phrase
The noun phrase;
(1.4) according to the entity type for including in sequence pattern, synonymous sequence pattern is polymerize, obtains each language
The corresponding sequence pattern subset A of justice;
(1.5) level of the entity type in sequence pattern subset A corresponding to each semanteme is adjusted: in sequence mould
In the result of formula polymerization, sequence pattern subset A corresponding to each semanteme counts entity type level therein, takes most most layers
Grade is as the entity type level in subset A;
(1.6) for every kind of entity type, the sequence pattern comprising the type is found out from each subset A, obtains this reality
The corresponding sequence pattern subset B of body type;
(2) there is label data: using the corresponding sequence pattern subset B of every kind of entity type as the input of Snorkel,
The label of sample, i.e. entity type are predicted, label has confidence level;
(3) training entity extracts regression model: regression model is extracted with the label training entity with confidence level, with training
Good forecast of regression model corpus, obtains Entity recognition result;
(4) return step (1) extracts regression model with trained entity and predicts corpus again, with obtained result pair
The result of phrase separation, remote supervisory Entity recognition that step (1) obtains is modified, and is continued remaining step, is retrieved
Entity extracts regression model and Entity recognition result;This process is repeated, until entity result and the last time that step (3) obtains
The result that process obtains is consistent.
Further, in the step (1.1), phrase separation is carried out using AutoPhrase method, obtains noun phrase.
Further, in the step (1.3), in the result that entity and entity type identify, with the side PrefixSpan
Method excavates the high sequence pattern of frequency of occurrence.
Further, in the step (1.4), specific polymerization methods are as follows: establishing graph structure to sequence pattern set, scheme
In each vertex be a sequence pattern, the side between two modes by entity type quantity common between two modes,
The common quantity of cliction up and down, identical entity extract these three features of fruiting quantities to define, and are instructed based on three above feature
Practice regression model to assign each edge weight, obtains subgraph, i.e. sequence pattern subset with point group's algorithm.
The beneficial effects of the present invention are: present invention utilizes the redundancies and repetition of information in a large amount of corpus, first with phrase point
It cuts and obtains the more entity of noise with the mode of remote supervisory, then excavate the context sequence pattern (rule) of entity, automatically
The input rule of Snorkel has been obtained, using Snorkel to the fault-tolerant ability of noise label, quality has been obtained and has compared remote supervisory
Good result.Cyclically to model and modified result, gradually remove noise, and obtains more reliable sequence pattern.The present invention does not have
Useful exemplar saves artificial;The input rule of Snorkel automatically derives;In conjunction with remote supervisory, rule digging,
Snorkel and cyclic process, improve result, removal noise progressively, and obtained result is better than remote supervisory.
Detailed description of the invention
Fig. 1 is the flow chart of the method for the present invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described.Obviously, the embodiment is this hair
Bright a part of the embodiment, instead of all the embodiments.Based on the embodiment of the present invention, those of ordinary skill in the art are not having
Other embodiments obtained under the premise of creative work are made, protection scope of the present invention is belonged to.
The present invention, in the automatic mining largely without the mark enterprising line discipline collection of sample, utilizes under the scene of few sample
Snorkel is managed rule set, and generation largely has label data with confidence level comprising noise, finally uses these
Data come train entity extract regression model.
As shown in Figure 1, text entities abstracting method proposed by the present invention, specifically includes the following steps:
The automatic mining of one, rule set
On a large amount of corpus, AutoPhrase method [AutoPhrase:Automated Phrase is used first
Mining from Massive Text Corpora] phrase separation is carried out, obtain noun phrase;
Entity and entity type identification are carried out (for English medicine text, benefit to noun phrase with the mode of remote supervisory
Better result can be obtained with MetaMap tool);
In the result that entity and entity type identify, with PrefixSpan method [PrefixSpan:Mining
Sequential Patterns Efficiently by Prefix-Projected Pattern Growth] excavate appearance frequency
Secondary high sequence pattern.Sequence pattern be increase entity type in common canonical template, such as: (MEDCINE) may
Be helpful for ($ DISEASE), ($ MEDCINE) and ($ DISEASE) therein respectively indicate drug, disease entity class
Type, corresponding position can be any one drug, disease in sequence pattern.In sequence pattern, if the noun in primitive material
Phrase is identified as entity, then with the noun phrase in the entity type replacement sequence pattern of the noun phrase, improves sequence
The generalization of mode.
According to the entity type for including in sequence pattern, synonymous sequence pattern is polymerize, it is right to obtain each semanteme
Mode in sequence pattern the subset A, each subset A answered indicates identical semanteme.Synonymous sequence pattern refers to that expression is identical
Semantic sequence pattern, such as " Person ' s age is $ Digit " and " $ Person, $ Digit " the two sequence patterns all tables
Having reached " age of people is number ", this is semantic.Specific polymerization methods are as follows: graph structure established to sequence pattern set, it is every in figure
A vertex is a sequence pattern, and the side between two modes passes through entity type quantity common between two modes, common
The quantity of cliction up and down, identical entity extract fruiting quantities these three features and define, trained back based on three above feature
Return model to assign each edge weight, with a point group algorithm [A procedure for clique detection using the
Group matrix] obtain subgraph, i.e. sequence pattern subset A.In sequence pattern " $ Country president $
In Politician " and " president $ Politician of $ Country ", common entity type is between two modes
$ Country and Politician, common entity type quantity are 2, and common cliction up and down is president, and quantity is
1, it is the physical quantities being drawn into corpus with the two sequence patterns that identical entity, which extracts result, such as in $
In the extraction of Politician type entities, the quantity for the $ Politician type entities being drawn into is counted.
The level of entity type in sequence pattern subset A corresponding to each semanteme is adjusted, such as $ Location class
There is $ Country under type, the types such as $ State, $ City can obtain difference to each noun phrase in entity type identification
The entity type of level.In the result of sequence pattern polymerization, sequence pattern subset A statistics corresponding to each semanteme is therein
Entity type level takes most most levels as the entity type level in subset A.
The corresponding sequence pattern subset A of available each semanteme by the above process.For every kind of entity type, from each
The sequence pattern comprising the type is found out in subset A, obtains the corresponding sequence pattern subset B of this entity type.
Two, have label data
Using the corresponding sequence pattern subset B of every kind of entity type as the input of Snorkel, the label of sample is predicted,
That is entity type, label have confidence level.
Three, training entity extracts regression model
Regression model is extracted with the label training entity with confidence level to be obtained with trained forecast of regression model corpus
To Entity recognition result.
Four, return to the first step, extract regression model with trained entity and predict corpus again, with obtained result to the
The result of phrase separation, remote supervisory Entity recognition that one step obtains is modified, and is continued remaining step, is retrieved entity
Extract regression model and Entity recognition result.This process is repeated, until the entity result and last time process that third step obtains
Obtained result is consistent.
Provided verbal description, attached drawing and claims can hold those skilled in the art very much according to the present invention
Easily in the case where not departing from thought and range of condition of the invention defined by claims, a variety of variations and change can be made.
All technical ideas according to the present invention and the substantive any modification carried out to above-described embodiment, equivalent variations, belong to this hair
Bright claim is within the limits of the protection.
Claims (4)
1. a kind of text entities abstracting method, which is characterized in that method includes the following steps:
(1) automatic mining of rule set, including following sub-step:
(1.1) phrase separation is carried out on a large amount of corpus, obtains noun phrase;
(1.2) entity is carried out to noun phrase with the mode of remote supervisory and entity type identifies;
(1.3) the high sequence pattern of frequency of occurrence is excavated in the result that entity and entity type identify;In sequence pattern, such as
Noun phrase in fruit primitive material is identified as entity, then with the name in the entity type replacement sequence pattern of the noun phrase
Word phrase;
(1.4) according to the entity type for including in sequence pattern, synonymous sequence pattern is polymerize, it is right to obtain each semanteme
The sequence pattern subset A answered;
(1.5) level of the entity type in sequence pattern subset A corresponding to each semanteme is adjusted: poly- in sequence pattern
In the result of conjunction, sequence pattern subset A corresponding to each semanteme counts entity type level therein, and most most levels is taken to make
For the entity type level in subset A;
(1.6) for every kind of entity type, the sequence pattern comprising the type is found out from each subset A, obtains this entity class
The corresponding sequence pattern subset B of type;
(2) there is label data: using the corresponding sequence pattern subset B of every kind of entity type as the input of Snorkel, prediction
The label of sample, i.e. entity type out, label have confidence level;
(3) training entity extracts regression model: extracting regression model with the label training entity with confidence level, use is trained
Forecast of regression model corpus obtains Entity recognition result;
(4) return step (1) extracts regression model with trained entity and predicts corpus again, with obtained result to step
(1) result of the phrase separation, remote supervisory Entity recognition that obtain is modified, and is continued remaining step, is retrieved entity
Extract regression model and Entity recognition result;This process is repeated, until the entity result and last time process that step (3) obtains
Obtained result is consistent.
2. a kind of text entities abstracting method according to claim 1, which is characterized in that in the step (1.1), utilize
AutoPhrase method carries out phrase separation, obtains noun phrase.
3. a kind of text entities abstracting method according to claim 1, which is characterized in that in the step (1.3), in reality
On body and the result of entity type identification, the high sequence pattern of frequency of occurrence is excavated with PrefixSpan method.
4. a kind of text entities abstracting method according to claim 1, which is characterized in that in the step (1.4), specifically
Polymerization methods are as follows: establishing graph structure to sequence pattern set, each vertex is a sequence pattern in figure, between two modes
Side result is extracted by entity type quantity common between two modes, the common quantity of cliction up and down, identical entity
Quantity these three features define, and assign each edge weight based on three above feature training regression model, with point group's algorithm
Obtain subgraph, i.e. sequence pattern subset.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910472799.7A CN110188359B (en) | 2019-05-31 | 2019-05-31 | Text entity extraction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910472799.7A CN110188359B (en) | 2019-05-31 | 2019-05-31 | Text entity extraction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110188359A true CN110188359A (en) | 2019-08-30 |
CN110188359B CN110188359B (en) | 2023-01-03 |
Family
ID=67719618
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910472799.7A Active CN110188359B (en) | 2019-05-31 | 2019-05-31 | Text entity extraction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110188359B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111325350A (en) * | 2020-02-19 | 2020-06-23 | 第四范式(北京)技术有限公司 | Suspicious tissue discovery system and method |
CN113204643A (en) * | 2021-06-23 | 2021-08-03 | 北京明略软件系统有限公司 | Entity alignment method, device, equipment and medium |
CN113255356A (en) * | 2021-06-10 | 2021-08-13 | 杭州费尔斯通科技有限公司 | Entity recognition method and device based on entity word list |
CN113299375A (en) * | 2021-07-27 | 2021-08-24 | 北京好欣晴移动医疗科技有限公司 | Method, device and system for marking and identifying digital file information entity |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170060835A1 (en) * | 2015-08-27 | 2017-03-02 | Xerox Corporation | Document-specific gazetteers for named entity recognition |
CN106776711A (en) * | 2016-11-14 | 2017-05-31 | 浙江大学 | A kind of Chinese medical knowledge mapping construction method based on deep learning |
CN107291687A (en) * | 2017-04-27 | 2017-10-24 | 同济大学 | It is a kind of based on interdependent semantic Chinese unsupervised open entity relation extraction method |
US20190147297A1 (en) * | 2017-11-16 | 2019-05-16 | Accenture Global Solutions Limited | System for time-efficient assignment of data to ontological classes |
-
2019
- 2019-05-31 CN CN201910472799.7A patent/CN110188359B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170060835A1 (en) * | 2015-08-27 | 2017-03-02 | Xerox Corporation | Document-specific gazetteers for named entity recognition |
CN106776711A (en) * | 2016-11-14 | 2017-05-31 | 浙江大学 | A kind of Chinese medical knowledge mapping construction method based on deep learning |
CN107291687A (en) * | 2017-04-27 | 2017-10-24 | 同济大学 | It is a kind of based on interdependent semantic Chinese unsupervised open entity relation extraction method |
US20190147297A1 (en) * | 2017-11-16 | 2019-05-16 | Accenture Global Solutions Limited | System for time-efficient assignment of data to ontological classes |
Non-Patent Citations (1)
Title |
---|
刘绍毓 等: "实体关系抽取研究综述", 《信息工程大学学报》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111325350A (en) * | 2020-02-19 | 2020-06-23 | 第四范式(北京)技术有限公司 | Suspicious tissue discovery system and method |
CN111325350B (en) * | 2020-02-19 | 2023-09-29 | 第四范式(北京)技术有限公司 | Suspicious tissue discovery system and method |
CN113255356A (en) * | 2021-06-10 | 2021-08-13 | 杭州费尔斯通科技有限公司 | Entity recognition method and device based on entity word list |
CN113255356B (en) * | 2021-06-10 | 2021-09-28 | 杭州费尔斯通科技有限公司 | Entity recognition method and device based on entity word list |
CN113204643A (en) * | 2021-06-23 | 2021-08-03 | 北京明略软件系统有限公司 | Entity alignment method, device, equipment and medium |
CN113204643B (en) * | 2021-06-23 | 2021-11-02 | 北京明略软件系统有限公司 | Entity alignment method, device, equipment and medium |
CN113299375A (en) * | 2021-07-27 | 2021-08-24 | 北京好欣晴移动医疗科技有限公司 | Method, device and system for marking and identifying digital file information entity |
Also Published As
Publication number | Publication date |
---|---|
CN110188359B (en) | 2023-01-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110188359A (en) | A kind of text entities abstracting method | |
CN104462053B (en) | A kind of personal pronoun reference resolution method based on semantic feature in text | |
CN106484664B (en) | Similarity calculating method between a kind of short text | |
CN106897559B (en) | A kind of symptom and sign class entity recognition method and device towards multi-data source | |
CN104268160B (en) | A kind of OpinionTargetsExtraction Identification method based on domain lexicon and semantic role | |
CN106844346A (en) | Short text Semantic Similarity method of discrimination and system based on deep learning model Word2Vec | |
CN108874878A (en) | A kind of building system and method for knowledge mapping | |
CN108595696A (en) | A kind of human-computer interaction intelligent answering method and system based on cloud platform | |
CN106372064B (en) | A kind of term weight function calculation method of text mining | |
CN105138507A (en) | Pattern self-learning based Chinese open relationship extraction method | |
CN109344250A (en) | Single diseases diagnostic message rapid structure method based on medical insurance data | |
CN105631468A (en) | RNN-based automatic picture description generation method | |
CN107632979A (en) | The problem of one kind is used for interactive question and answer analytic method and system | |
CN106649783A (en) | Synonym mining method and apparatus | |
CN107463553A (en) | For the text semantic extraction, expression and modeling method and system of elementary mathematics topic | |
CN110910283A (en) | Method, device, equipment and storage medium for generating legal document | |
CN105261358A (en) | N-gram grammar model constructing method for voice identification and voice identification system | |
CN110502644A (en) | A kind of field level dictionary excavates the Active Learning Method of building | |
CN108038205A (en) | For the viewpoint analysis prototype system of Chinese microblogging | |
CN108710611A (en) | A kind of short text topic model generation method of word-based network and term vector | |
CN106934005A (en) | A kind of Text Clustering Method based on density | |
CN106980652A (en) | Intelligent answer method and system | |
CN103678499A (en) | Data mining method based on multi-source heterogeneous patent data semantic integration | |
CN110399433A (en) | A kind of data entity Relation extraction method based on deep learning | |
CN107943786A (en) | A kind of Chinese name entity recognition method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |