CN110188359A - A kind of text entities abstracting method - Google Patents

A kind of text entities abstracting method Download PDF

Info

Publication number
CN110188359A
CN110188359A CN201910472799.7A CN201910472799A CN110188359A CN 110188359 A CN110188359 A CN 110188359A CN 201910472799 A CN201910472799 A CN 201910472799A CN 110188359 A CN110188359 A CN 110188359A
Authority
CN
China
Prior art keywords
entity
sequence pattern
result
subset
entity type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910472799.7A
Other languages
Chinese (zh)
Other versions
CN110188359B (en
Inventor
金霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Firestone Creation Technology Co Ltd
Original Assignee
Chengdu Firestone Creation Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Firestone Creation Technology Co Ltd filed Critical Chengdu Firestone Creation Technology Co Ltd
Priority to CN201910472799.7A priority Critical patent/CN110188359B/en
Publication of CN110188359A publication Critical patent/CN110188359A/en
Application granted granted Critical
Publication of CN110188359B publication Critical patent/CN110188359B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a kind of text entities abstracting methods, present invention utilizes the redundancies and repetition of information in a large amount of corpus, first the more entity of noise is obtained with the mode of phrase separation and remote supervisory, the context sequence pattern (rule) of entity is excavated again, it has been automatically derived the input rule of Snorkel, using Snorkel to the fault-tolerant ability of noise label, the quality result better than remote supervisory has been obtained.Cyclically to model and modified result, gradually remove noise, and obtains more reliable sequence pattern.The present invention does not use exemplar, saves artificial;The input rule of Snorkel automatically derives;In conjunction with remote supervisory, rule digging, snorkel and cyclic process, improves result, removal noise progressively, improve and extract quality.

Description

A kind of text entities abstracting method
Technical field
The present invention relates to natural language processing technique field more particularly to a kind of few sample text entity abstracting methods.
Background technique
In the application scenarios of Text Information Extraction, scene multiplicity, refinement lack mark sample, mark sample acquisition at This height is the status faced in industrial application, faces such status, under the thinking of model training, quickly establishes mark sample This, to need the depth model of less sample or the bigger sample of noise be two popular research directions, be based on decimation rule Thinking under, the quick of decimation rule collection is excavated and construction is popular research direction.
In current Text Information Extraction method, the method based on model training needs a large amount of mark sample, although having Some depth models presentation accuracy are higher and higher, the mark sample size trend less and less needed, but still need certain The mark sample of amount could train to obtain available model, before obtaining sample, can not carry out the work, such process is equivalent to Development cost is married again on the mark of sample, whole development efficiency is still low.
And in the method based on decimation rule, although not needing manually directly to be labeled sample, extract rule It then generally requires largely to be debugged on the basis of domain knowledge, a set of system based entirely on rule may need up to ten thousand Rule set.In order to mitigate the exploitation of rule set, the excavation of rule set and automatically generate as a hot research direction.
Snorkel is an approach from rule to model, however it is very strong to the accuracy dependence of rule set, and is advised It does not automatically generate then.
Summary of the invention
The thinking of present invention combination decimation rule and model training proposes that a kind of a small amount of information marked under sample conditions is taken out Solution is taken, the higher extraction model of accuracy rate just can be obtained in no manual intervention.
The purpose of the present invention is achieved through the following technical solutions: a kind of text entities abstracting method, this method packet Include following steps:
(1) automatic mining of rule set, including following sub-step:
(1.1) phrase separation is carried out on a large amount of corpus, obtains noun phrase;
(1.2) entity is carried out to noun phrase with the mode of remote supervisory and entity type identifies;
(1.3) the high sequence pattern of frequency of occurrence is excavated in the result that entity and entity type identify;In sequence pattern In, if the noun phrase in primitive material is identified as entity, in the entity type replacement sequence pattern of the noun phrase The noun phrase;
(1.4) according to the entity type for including in sequence pattern, synonymous sequence pattern is polymerize, obtains each language The corresponding sequence pattern subset A of justice;
(1.5) level of the entity type in sequence pattern subset A corresponding to each semanteme is adjusted: in sequence mould In the result of formula polymerization, sequence pattern subset A corresponding to each semanteme counts entity type level therein, takes most most layers Grade is as the entity type level in subset A;
(1.6) for every kind of entity type, the sequence pattern comprising the type is found out from each subset A, obtains this reality The corresponding sequence pattern subset B of body type;
(2) there is label data: using the corresponding sequence pattern subset B of every kind of entity type as the input of Snorkel, The label of sample, i.e. entity type are predicted, label has confidence level;
(3) training entity extracts regression model: regression model is extracted with the label training entity with confidence level, with training Good forecast of regression model corpus, obtains Entity recognition result;
(4) return step (1) extracts regression model with trained entity and predicts corpus again, with obtained result pair The result of phrase separation, remote supervisory Entity recognition that step (1) obtains is modified, and is continued remaining step, is retrieved Entity extracts regression model and Entity recognition result;This process is repeated, until entity result and the last time that step (3) obtains The result that process obtains is consistent.
Further, in the step (1.1), phrase separation is carried out using AutoPhrase method, obtains noun phrase.
Further, in the step (1.3), in the result that entity and entity type identify, with the side PrefixSpan Method excavates the high sequence pattern of frequency of occurrence.
Further, in the step (1.4), specific polymerization methods are as follows: establishing graph structure to sequence pattern set, scheme In each vertex be a sequence pattern, the side between two modes by entity type quantity common between two modes, The common quantity of cliction up and down, identical entity extract these three features of fruiting quantities to define, and are instructed based on three above feature Practice regression model to assign each edge weight, obtains subgraph, i.e. sequence pattern subset with point group's algorithm.
The beneficial effects of the present invention are: present invention utilizes the redundancies and repetition of information in a large amount of corpus, first with phrase point It cuts and obtains the more entity of noise with the mode of remote supervisory, then excavate the context sequence pattern (rule) of entity, automatically The input rule of Snorkel has been obtained, using Snorkel to the fault-tolerant ability of noise label, quality has been obtained and has compared remote supervisory Good result.Cyclically to model and modified result, gradually remove noise, and obtains more reliable sequence pattern.The present invention does not have Useful exemplar saves artificial;The input rule of Snorkel automatically derives;In conjunction with remote supervisory, rule digging, Snorkel and cyclic process, improve result, removal noise progressively, and obtained result is better than remote supervisory.
Detailed description of the invention
Fig. 1 is the flow chart of the method for the present invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described.Obviously, the embodiment is this hair Bright a part of the embodiment, instead of all the embodiments.Based on the embodiment of the present invention, those of ordinary skill in the art are not having Other embodiments obtained under the premise of creative work are made, protection scope of the present invention is belonged to.
The present invention, in the automatic mining largely without the mark enterprising line discipline collection of sample, utilizes under the scene of few sample Snorkel is managed rule set, and generation largely has label data with confidence level comprising noise, finally uses these Data come train entity extract regression model.
As shown in Figure 1, text entities abstracting method proposed by the present invention, specifically includes the following steps:
The automatic mining of one, rule set
On a large amount of corpus, AutoPhrase method [AutoPhrase:Automated Phrase is used first Mining from Massive Text Corpora] phrase separation is carried out, obtain noun phrase;
Entity and entity type identification are carried out (for English medicine text, benefit to noun phrase with the mode of remote supervisory Better result can be obtained with MetaMap tool);
In the result that entity and entity type identify, with PrefixSpan method [PrefixSpan:Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth] excavate appearance frequency Secondary high sequence pattern.Sequence pattern be increase entity type in common canonical template, such as: (MEDCINE) may Be helpful for ($ DISEASE), ($ MEDCINE) and ($ DISEASE) therein respectively indicate drug, disease entity class Type, corresponding position can be any one drug, disease in sequence pattern.In sequence pattern, if the noun in primitive material Phrase is identified as entity, then with the noun phrase in the entity type replacement sequence pattern of the noun phrase, improves sequence The generalization of mode.
According to the entity type for including in sequence pattern, synonymous sequence pattern is polymerize, it is right to obtain each semanteme Mode in sequence pattern the subset A, each subset A answered indicates identical semanteme.Synonymous sequence pattern refers to that expression is identical Semantic sequence pattern, such as " Person ' s age is $ Digit " and " $ Person, $ Digit " the two sequence patterns all tables Having reached " age of people is number ", this is semantic.Specific polymerization methods are as follows: graph structure established to sequence pattern set, it is every in figure A vertex is a sequence pattern, and the side between two modes passes through entity type quantity common between two modes, common The quantity of cliction up and down, identical entity extract fruiting quantities these three features and define, trained back based on three above feature Return model to assign each edge weight, with a point group algorithm [A procedure for clique detection using the Group matrix] obtain subgraph, i.e. sequence pattern subset A.In sequence pattern " $ Country president $ In Politician " and " president $ Politician of $ Country ", common entity type is between two modes $ Country and Politician, common entity type quantity are 2, and common cliction up and down is president, and quantity is 1, it is the physical quantities being drawn into corpus with the two sequence patterns that identical entity, which extracts result, such as in $ In the extraction of Politician type entities, the quantity for the $ Politician type entities being drawn into is counted.
The level of entity type in sequence pattern subset A corresponding to each semanteme is adjusted, such as $ Location class There is $ Country under type, the types such as $ State, $ City can obtain difference to each noun phrase in entity type identification The entity type of level.In the result of sequence pattern polymerization, sequence pattern subset A statistics corresponding to each semanteme is therein Entity type level takes most most levels as the entity type level in subset A.
The corresponding sequence pattern subset A of available each semanteme by the above process.For every kind of entity type, from each The sequence pattern comprising the type is found out in subset A, obtains the corresponding sequence pattern subset B of this entity type.
Two, have label data
Using the corresponding sequence pattern subset B of every kind of entity type as the input of Snorkel, the label of sample is predicted, That is entity type, label have confidence level.
Three, training entity extracts regression model
Regression model is extracted with the label training entity with confidence level to be obtained with trained forecast of regression model corpus To Entity recognition result.
Four, return to the first step, extract regression model with trained entity and predict corpus again, with obtained result to the The result of phrase separation, remote supervisory Entity recognition that one step obtains is modified, and is continued remaining step, is retrieved entity Extract regression model and Entity recognition result.This process is repeated, until the entity result and last time process that third step obtains Obtained result is consistent.
Provided verbal description, attached drawing and claims can hold those skilled in the art very much according to the present invention Easily in the case where not departing from thought and range of condition of the invention defined by claims, a variety of variations and change can be made. All technical ideas according to the present invention and the substantive any modification carried out to above-described embodiment, equivalent variations, belong to this hair Bright claim is within the limits of the protection.

Claims (4)

1. a kind of text entities abstracting method, which is characterized in that method includes the following steps:
(1) automatic mining of rule set, including following sub-step:
(1.1) phrase separation is carried out on a large amount of corpus, obtains noun phrase;
(1.2) entity is carried out to noun phrase with the mode of remote supervisory and entity type identifies;
(1.3) the high sequence pattern of frequency of occurrence is excavated in the result that entity and entity type identify;In sequence pattern, such as Noun phrase in fruit primitive material is identified as entity, then with the name in the entity type replacement sequence pattern of the noun phrase Word phrase;
(1.4) according to the entity type for including in sequence pattern, synonymous sequence pattern is polymerize, it is right to obtain each semanteme The sequence pattern subset A answered;
(1.5) level of the entity type in sequence pattern subset A corresponding to each semanteme is adjusted: poly- in sequence pattern In the result of conjunction, sequence pattern subset A corresponding to each semanteme counts entity type level therein, and most most levels is taken to make For the entity type level in subset A;
(1.6) for every kind of entity type, the sequence pattern comprising the type is found out from each subset A, obtains this entity class The corresponding sequence pattern subset B of type;
(2) there is label data: using the corresponding sequence pattern subset B of every kind of entity type as the input of Snorkel, prediction The label of sample, i.e. entity type out, label have confidence level;
(3) training entity extracts regression model: extracting regression model with the label training entity with confidence level, use is trained Forecast of regression model corpus obtains Entity recognition result;
(4) return step (1) extracts regression model with trained entity and predicts corpus again, with obtained result to step (1) result of the phrase separation, remote supervisory Entity recognition that obtain is modified, and is continued remaining step, is retrieved entity Extract regression model and Entity recognition result;This process is repeated, until the entity result and last time process that step (3) obtains Obtained result is consistent.
2. a kind of text entities abstracting method according to claim 1, which is characterized in that in the step (1.1), utilize AutoPhrase method carries out phrase separation, obtains noun phrase.
3. a kind of text entities abstracting method according to claim 1, which is characterized in that in the step (1.3), in reality On body and the result of entity type identification, the high sequence pattern of frequency of occurrence is excavated with PrefixSpan method.
4. a kind of text entities abstracting method according to claim 1, which is characterized in that in the step (1.4), specifically Polymerization methods are as follows: establishing graph structure to sequence pattern set, each vertex is a sequence pattern in figure, between two modes Side result is extracted by entity type quantity common between two modes, the common quantity of cliction up and down, identical entity Quantity these three features define, and assign each edge weight based on three above feature training regression model, with point group's algorithm Obtain subgraph, i.e. sequence pattern subset.
CN201910472799.7A 2019-05-31 2019-05-31 Text entity extraction method Active CN110188359B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910472799.7A CN110188359B (en) 2019-05-31 2019-05-31 Text entity extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910472799.7A CN110188359B (en) 2019-05-31 2019-05-31 Text entity extraction method

Publications (2)

Publication Number Publication Date
CN110188359A true CN110188359A (en) 2019-08-30
CN110188359B CN110188359B (en) 2023-01-03

Family

ID=67719618

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910472799.7A Active CN110188359B (en) 2019-05-31 2019-05-31 Text entity extraction method

Country Status (1)

Country Link
CN (1) CN110188359B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325350A (en) * 2020-02-19 2020-06-23 第四范式(北京)技术有限公司 Suspicious tissue discovery system and method
CN113204643A (en) * 2021-06-23 2021-08-03 北京明略软件系统有限公司 Entity alignment method, device, equipment and medium
CN113255356A (en) * 2021-06-10 2021-08-13 杭州费尔斯通科技有限公司 Entity recognition method and device based on entity word list
CN113299375A (en) * 2021-07-27 2021-08-24 北京好欣晴移动医疗科技有限公司 Method, device and system for marking and identifying digital file information entity

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170060835A1 (en) * 2015-08-27 2017-03-02 Xerox Corporation Document-specific gazetteers for named entity recognition
CN106776711A (en) * 2016-11-14 2017-05-31 浙江大学 A kind of Chinese medical knowledge mapping construction method based on deep learning
CN107291687A (en) * 2017-04-27 2017-10-24 同济大学 It is a kind of based on interdependent semantic Chinese unsupervised open entity relation extraction method
US20190147297A1 (en) * 2017-11-16 2019-05-16 Accenture Global Solutions Limited System for time-efficient assignment of data to ontological classes

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170060835A1 (en) * 2015-08-27 2017-03-02 Xerox Corporation Document-specific gazetteers for named entity recognition
CN106776711A (en) * 2016-11-14 2017-05-31 浙江大学 A kind of Chinese medical knowledge mapping construction method based on deep learning
CN107291687A (en) * 2017-04-27 2017-10-24 同济大学 It is a kind of based on interdependent semantic Chinese unsupervised open entity relation extraction method
US20190147297A1 (en) * 2017-11-16 2019-05-16 Accenture Global Solutions Limited System for time-efficient assignment of data to ontological classes

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘绍毓 等: "实体关系抽取研究综述", 《信息工程大学学报》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325350A (en) * 2020-02-19 2020-06-23 第四范式(北京)技术有限公司 Suspicious tissue discovery system and method
CN111325350B (en) * 2020-02-19 2023-09-29 第四范式(北京)技术有限公司 Suspicious tissue discovery system and method
CN113255356A (en) * 2021-06-10 2021-08-13 杭州费尔斯通科技有限公司 Entity recognition method and device based on entity word list
CN113255356B (en) * 2021-06-10 2021-09-28 杭州费尔斯通科技有限公司 Entity recognition method and device based on entity word list
CN113204643A (en) * 2021-06-23 2021-08-03 北京明略软件系统有限公司 Entity alignment method, device, equipment and medium
CN113204643B (en) * 2021-06-23 2021-11-02 北京明略软件系统有限公司 Entity alignment method, device, equipment and medium
CN113299375A (en) * 2021-07-27 2021-08-24 北京好欣晴移动医疗科技有限公司 Method, device and system for marking and identifying digital file information entity

Also Published As

Publication number Publication date
CN110188359B (en) 2023-01-03

Similar Documents

Publication Publication Date Title
CN110188359A (en) A kind of text entities abstracting method
CN104462053B (en) A kind of personal pronoun reference resolution method based on semantic feature in text
CN106484664B (en) Similarity calculating method between a kind of short text
CN106897559B (en) A kind of symptom and sign class entity recognition method and device towards multi-data source
CN104268160B (en) A kind of OpinionTargetsExtraction Identification method based on domain lexicon and semantic role
CN106844346A (en) Short text Semantic Similarity method of discrimination and system based on deep learning model Word2Vec
CN108874878A (en) A kind of building system and method for knowledge mapping
CN108595696A (en) A kind of human-computer interaction intelligent answering method and system based on cloud platform
CN106372064B (en) A kind of term weight function calculation method of text mining
CN105138507A (en) Pattern self-learning based Chinese open relationship extraction method
CN109344250A (en) Single diseases diagnostic message rapid structure method based on medical insurance data
CN105631468A (en) RNN-based automatic picture description generation method
CN107632979A (en) The problem of one kind is used for interactive question and answer analytic method and system
CN106649783A (en) Synonym mining method and apparatus
CN107463553A (en) For the text semantic extraction, expression and modeling method and system of elementary mathematics topic
CN110910283A (en) Method, device, equipment and storage medium for generating legal document
CN105261358A (en) N-gram grammar model constructing method for voice identification and voice identification system
CN110502644A (en) A kind of field level dictionary excavates the Active Learning Method of building
CN108038205A (en) For the viewpoint analysis prototype system of Chinese microblogging
CN108710611A (en) A kind of short text topic model generation method of word-based network and term vector
CN106934005A (en) A kind of Text Clustering Method based on density
CN106980652A (en) Intelligent answer method and system
CN103678499A (en) Data mining method based on multi-source heterogeneous patent data semantic integration
CN110399433A (en) A kind of data entity Relation extraction method based on deep learning
CN107943786A (en) A kind of Chinese name entity recognition method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant