CN110188359B - Text entity extraction method - Google Patents

Text entity extraction method Download PDF

Info

Publication number
CN110188359B
CN110188359B CN201910472799.7A CN201910472799A CN110188359B CN 110188359 B CN110188359 B CN 110188359B CN 201910472799 A CN201910472799 A CN 201910472799A CN 110188359 B CN110188359 B CN 110188359B
Authority
CN
China
Prior art keywords
entity
sequence
subset
extraction
regression model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910472799.7A
Other languages
Chinese (zh)
Other versions
CN110188359A (en
Inventor
金霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Firestone Creation Technology Co ltd
Original Assignee
Chengdu Firestone Creation Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Firestone Creation Technology Co ltd filed Critical Chengdu Firestone Creation Technology Co ltd
Priority to CN201910472799.7A priority Critical patent/CN110188359B/en
Publication of CN110188359A publication Critical patent/CN110188359A/en
Application granted granted Critical
Publication of CN110188359B publication Critical patent/CN110188359B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text entity extraction method, which utilizes redundancy and repetition of information in a large amount of linguistic data, obtains entities with more noise by means of phrase segmentation and remote supervision, then digs a context sequence mode (rule) of the entities, automatically obtains an input rule of Snorkel, and obtains a result with better quality than the result of remote supervision by utilizing the fault-tolerant capability of the Snorkel on a noise label. The model and results are modified cyclically, gradually removing noise and resulting in a more reliable sequence pattern. The invention does not use a label sample, thereby saving the labor; the input rule of Snorkel is automatically obtained; and the remote supervision, the rule mining, the snorkel and the cyclic process are combined, so that the result is improved progressively, the noise is removed, and the extraction quality is improved.

Description

Text entity extraction method
Technical Field
The invention relates to the technical field of natural language processing, in particular to a few-sample text entity extraction method.
Background
In the application scene of text information extraction, scenes are various and detailed, labeling samples are lacked, the labeling sample acquisition cost is high, the current situation is faced in industrial application, in the face of the current situation, under the thought of model training, a depth model which rapidly establishes labeling samples, needs fewer samples or samples with larger noise is a popular research direction, and under the thought based on extraction rules, rapid mining and construction of an extraction rule set are popular research directions.
In the existing text information extraction method, a large number of labeled samples are required for a model training method, although some depth models have the tendency of higher and higher accuracy and smaller amount of required labeled samples, a certain amount of labeled samples are still required to train to obtain available models, and before obtaining the samples, the work cannot be carried out, so that the development cost is transferred to the labeling of the samples, and the overall development efficiency is still low.
In the method based on the extraction rule, although the sample is not required to be directly labeled manually, the extraction rule usually needs to be debugged on the basis of domain knowledge, and a set of system based on the rule completely may need tens of thousands of rule sets. Mining and automatic generation of rule sets has become a hot research direction in order to mitigate the development of rule sets.
Snorkel is a path from rules to models, however, it has a strong dependency on the accuracy of the rule set, and the rules are not automatically generated.
Disclosure of Invention
The invention provides an information extraction solution under the condition of a small amount of labeled samples by combining the extraction rule and the idea of model training, and the extraction model with higher accuracy can be obtained without manual intervention.
The purpose of the invention is realized by the following technical scheme: a method of text entity extraction, the method comprising the steps of:
(1) Automatic mining of rule sets, comprising the sub-steps of:
(1.1) carrying out phrase segmentation on a large amount of linguistic data to obtain noun phrases;
(1.2) carrying out entity and entity type recognition on noun phrases in a remote supervision mode;
(1.3) mining a sequence pattern with high occurrence frequency on the results of entity and entity type identification; in the sequence mode, if a noun phrase in the original corpus is identified as an entity, replacing the noun phrase in the sequence mode with the entity type of the noun phrase;
(1.4) according to entity types contained in the sequence modes, aggregating the synonymous sequence modes to obtain a sequence mode subset A corresponding to each semantic meaning;
(1.5) adjusting the hierarchy of entity types in each semantically corresponding sequence pattern subset A: on the result of the sequence mode aggregation, counting the entity type hierarchy in the sequence mode subset A corresponding to each semantic, and taking the most hierarchy as the entity type hierarchy in the subset A;
(1.6) for each entity type, finding out a sequence pattern containing the type from each subset A to obtain a sequence pattern subset B corresponding to the entity type;
(2) The tagged data is generated: taking the sequence mode subset B corresponding to each entity type as the input of the Snorkel, predicting the label of the sample, namely the entity type, wherein the label has confidence;
(3) Training an entity extraction regression model: extracting a regression model by using a label training entity with confidence coefficient, and predicting linguistic data by using the trained regression model to obtain an entity recognition result;
(4) Returning to the step (1), using the trained entity extraction regression model to predict the corpus again, using the obtained result to correct the phrase segmentation and remote supervision entity identification result obtained in the step (1), continuing the rest steps, and obtaining the entity extraction regression model and the entity identification result again; this process is repeated until the entity results from step (3) are consistent with the results from the previous process.
Further, in the step (1.1), phrase segmentation is performed by using an AutoPhrase method to obtain noun phrases.
Further, in the step (1.3), on the result of entity and entity type identification, a sequence pattern with high occurrence frequency is mined by a Prefix span method.
Further, in the step (1.4), the specific polymerization manner is as follows: establishing a graph structure for the sequence mode set, wherein each vertex in the graph is a sequence mode, edges between the two modes are defined by three characteristics of the number of the types of the entities common between the two modes, the number of the common context words and the number of the extraction results of the same entities, each edge is endowed with weight by training a regression model based on the three characteristics, and a subgraph, namely a sequence mode subset, is obtained by using a clustering algorithm.
The invention has the beneficial effects that: the invention utilizes redundancy and repetition of information in a large amount of linguistic data, firstly obtains entities with more noise by means of phrase segmentation and remote supervision, then excavates a context sequence mode (rule) of the entities, automatically obtains an input rule of Snorkel, and obtains a result with better quality than the result of remote supervision by utilizing the fault-tolerant capability of the Snorkel on a noise label. The model and results are modified cyclically, gradually removing noise and resulting in a more reliable sequence pattern. The invention does not use a label sample, thereby saving the labor; the input rule of Snorkel is automatically obtained; the method combines remote supervision, rule mining, snorkel and a cyclic process to improve the result and remove noise progressively, and the obtained result is better than that of remote supervision.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only some of the embodiments of the present invention, and not all of them. Other embodiments, which can be derived by one of ordinary skill in the art from the embodiments of the present invention without creative efforts, are also within the scope of the present invention.
According to the method, under the scene of few samples, the rule set is automatically mined on a large number of unlabeled samples, the rule set is managed by using Snorkel, a large number of labeled data containing noise and with confidence coefficients are generated, and finally the data are used for training an entity extraction regression model.
As shown in fig. 1, the method for extracting text entities provided by the present invention specifically includes the following steps:
automatic mining of rule sets
On a large amount of linguistic data, firstly, carrying out Phrase segmentation by using an AutoPhrase method (AutoPhrase: automated Phrase Mining from Massive Text corpra) to obtain noun phrases;
entity and entity type recognition is carried out on noun phrases in a remote supervision mode (for English medical texts, better results can be obtained by using a MetaMap tool);
on the result of entity and entity type identification, a Prefix span method [ Prefix span: mining Sequential Patterns efficient by Prefix-project Pattern Growth ] is used to mine the sequence Pattern with high frequency of occurrence. The sequence pattern is formed by adding entity types to a common regular template, such as: ($ MEDICINE) may be half pful for ($ DISEASE), wherein ($ MEDICINE) and ($ DISEASE) respectively represent drugs and DISEASE entity types, and the corresponding position in the sequence mode can be any drug or DISEASE. In the sequence mode, if the noun phrase in the original corpus is identified as an entity, the noun phrase in the sequence mode is replaced by the entity type of the noun phrase, and the generalization of the sequence mode is improved.
And according to the entity types contained in the sequence patterns, aggregating the synonymous sequence patterns to obtain a sequence pattern subset A corresponding to each semantic meaning, wherein the patterns in each subset A represent the same semantic meaning. Synonymous sequence patterns refer to sequence patterns that express the same semantics, such as "Person's age is $ Digit" and "$ Person, $ Digit" both expressing the semantic "Person's age is a number". The specific polymerization mode is as follows: establishing a graph structure for the sequence mode set, wherein each vertex in the graph is a sequence mode, edges between the two modes are defined by three characteristics of the number of common entity types, the number of common context words and the number of the same entity extraction results, each edge is given with weight by training a regression model based on the three characteristics, and a subgraph, namely a sequence mode subset A, is obtained by using a clustering algorithm [ A procedure for closed detection using the group matrix ]. In the sequence patterns "$ Country president $ Politian" and "president $ Politian of $ Country", the common entity types between the two patterns are $ Country and $ Politian, the common entity type number is 2, the common context word is president, and the number is 1, and the same entity extraction result, that is, the entity number extracted from the corpus by the two sequence patterns, for example, in the extraction of $ Politian type entities, the number of extracted $ Politian type entities is counted.
The hierarchy of the entity type in the sequence pattern subset a corresponding to each semantic is adjusted, for example, the types of $ Location, $ Country, $ State, $ City, etc. are below $ Location type, and different hierarchy entity types are obtained for each noun phrase when the entity type is identified. And on the result of the sequence pattern aggregation, counting the entity type hierarchy in each sequence pattern subset A corresponding to each semantic, and taking the most number of hierarchies as the entity type hierarchy in the subset A.
Through the above process, the sequence pattern subset a corresponding to each semantic can be obtained. For each entity type, finding out the sequence mode containing the type from each subset A to obtain the sequence mode subset B corresponding to the entity type.
Second, generate labeled data
And (3) taking the sequence mode subset B corresponding to each entity type as the input of the Snorkel, predicting the label of the sample, namely the entity type, wherein the label has confidence.
Training entity extraction regression model
And (3) extracting a regression model by using a label training entity with confidence coefficient, and predicting corpus by using the trained regression model to obtain an entity recognition result.
And fourthly, returning to the first step, using the trained entity extraction regression model to predict the linguistic data again, using the obtained result to correct the results of the phrase segmentation and the remote supervision entity recognition obtained in the first step, continuing the rest steps, and obtaining the entity extraction regression model and the entity recognition result again. This process is repeated until the entity results from the third step are consistent with the results from the previous process.
One skilled in the art can, using the teachings of the present invention, readily make various changes and modifications to the invention without departing from the spirit and scope of the invention as defined by the appended claims. Any modifications and equivalent variations of the above-described embodiments, which are made in accordance with the technical spirit and substance of the present invention, fall within the scope of protection of the present invention as defined in the claims.

Claims (4)

1. A text entity extraction method is characterized by comprising the following steps:
(1) Automatic mining of rule sets, comprising the sub-steps of:
(1.1) carrying out phrase segmentation on a large amount of linguistic data to obtain noun phrases;
(1.2) carrying out entity and entity type identification on noun phrases in a remote supervision mode;
(1.3) mining a sequence pattern with high occurrence frequency on the results of entity and entity type identification; in the sequence mode, if a noun phrase in the original corpus is identified as an entity, replacing the noun phrase in the sequence mode with the entity type of the noun phrase;
(1.4) according to entity types contained in the sequence modes, aggregating the synonymous sequence modes to obtain a sequence mode subset A corresponding to each semantic meaning;
(1.5) adjusting the hierarchy of entity types in each semantically corresponding sequence pattern subset A: on the result of the sequence mode aggregation, counting the entity type hierarchy in the sequence mode subset A corresponding to each semantic, and taking the most hierarchy as the entity type hierarchy in the subset A;
(1.6) for each entity type, finding out a sequence pattern containing the type from each subset A to obtain a sequence pattern subset B corresponding to the entity type;
(2) The tagged data is generated: taking the sequence mode subset B corresponding to each entity type as the input of Snorkel, predicting the label of the sample, namely the entity type, wherein the label has confidence;
(3) Training an entity extraction regression model: extracting a regression model by using a label training entity with confidence coefficient, and predicting linguistic data by using the trained regression model to obtain an entity recognition result;
(4) Returning to the step (1), using the trained entity extraction regression model to predict the corpus again, using the obtained result to correct the phrase segmentation and remote supervision entity identification result obtained in the step (1), continuing the rest steps, and obtaining the entity extraction regression model and the entity identification result again; this process is repeated until the entity results from step (3) are consistent with the results from the previous process.
2. The method as claimed in claim 1, wherein in step (1.1), the noun phrase is obtained by phrase segmentation using AutoPhrase method.
3. The method for extracting text entities according to claim 1, wherein in said step (1.3), the prefix span method is used to mine the sequence patterns with high frequency of occurrence on the results of entity and entity type identification.
4. The method for extracting text entities as claimed in claim 1, wherein in the step (1.4), the specific aggregation manner is as follows: establishing a graph structure for the sequence mode set, wherein each vertex in the graph is a sequence mode, edges between the two modes are defined by three characteristics of the number of the types of the entities common between the two modes, the number of the common context words and the number of the extraction results of the same entities, each edge is endowed with weight by training a regression model based on the three characteristics, and a subgraph, namely a sequence mode subset, is obtained by using a clustering algorithm.
CN201910472799.7A 2019-05-31 2019-05-31 Text entity extraction method Active CN110188359B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910472799.7A CN110188359B (en) 2019-05-31 2019-05-31 Text entity extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910472799.7A CN110188359B (en) 2019-05-31 2019-05-31 Text entity extraction method

Publications (2)

Publication Number Publication Date
CN110188359A CN110188359A (en) 2019-08-30
CN110188359B true CN110188359B (en) 2023-01-03

Family

ID=67719618

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910472799.7A Active CN110188359B (en) 2019-05-31 2019-05-31 Text entity extraction method

Country Status (1)

Country Link
CN (1) CN110188359B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325350B (en) * 2020-02-19 2023-09-29 第四范式(北京)技术有限公司 Suspicious tissue discovery system and method
CN113255356B (en) * 2021-06-10 2021-09-28 杭州费尔斯通科技有限公司 Entity recognition method and device based on entity word list
CN113204643B (en) * 2021-06-23 2021-11-02 北京明略软件系统有限公司 Entity alignment method, device, equipment and medium
CN114093470A (en) * 2021-07-27 2022-02-25 北京好欣晴移动医疗科技有限公司 Internet insurance product recommendation method, device and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776711A (en) * 2016-11-14 2017-05-31 浙江大学 A kind of Chinese medical knowledge mapping construction method based on deep learning
CN107291687A (en) * 2017-04-27 2017-10-24 同济大学 It is a kind of based on interdependent semantic Chinese unsupervised open entity relation extraction method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9836453B2 (en) * 2015-08-27 2017-12-05 Conduent Business Services, Llc Document-specific gazetteers for named entity recognition
US10691976B2 (en) * 2017-11-16 2020-06-23 Accenture Global Solutions Limited System for time-efficient assignment of data to ontological classes

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776711A (en) * 2016-11-14 2017-05-31 浙江大学 A kind of Chinese medical knowledge mapping construction method based on deep learning
CN107291687A (en) * 2017-04-27 2017-10-24 同济大学 It is a kind of based on interdependent semantic Chinese unsupervised open entity relation extraction method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
实体关系抽取研究综述;刘绍毓 等;《信息工程大学学报》;20161031;第17卷(第5期);542-547 *

Also Published As

Publication number Publication date
CN110188359A (en) 2019-08-30

Similar Documents

Publication Publication Date Title
CN110188359B (en) Text entity extraction method
CN109388795B (en) Named entity recognition method, language recognition method and system
CN107330011B (en) The recognition methods of the name entity of more strategy fusions and device
CN107480125B (en) Relation linking method based on knowledge graph
CN104199972B (en) A kind of name entity relation extraction and construction method based on deep learning
CN109635297B (en) Entity disambiguation method and device, computer device and computer storage medium
CN108334495A (en) Short text similarity calculating method and system
CN104881458B (en) A kind of mask method and device of Web page subject
CN110413787B (en) Text clustering method, device, terminal and storage medium
CN107463553A (en) For the text semantic extraction, expression and modeling method and system of elementary mathematics topic
CN110110327A (en) A kind of text marking method and apparatus based on confrontation study
US11113470B2 (en) Preserving and processing ambiguity in natural language
WO2017177809A1 (en) Word segmentation method and system for language text
CN111143571B (en) Entity labeling model training method, entity labeling method and device
CN111291177A (en) Information processing method and device and computer storage medium
CN110795932B (en) Geological report text information extraction method based on geological ontology
CN111274804A (en) Case information extraction method based on named entity recognition
CN108763192B (en) Entity relation extraction method and device for text processing
Ren et al. Detecting the scope of negation and speculation in biomedical texts by using recursive neural network
CN107943786A (en) A kind of Chinese name entity recognition method and system
CN107526721A (en) A kind of disambiguation method and device to electric business product review vocabulary
CN107357895A (en) A kind of processing method of the text representation based on bag of words
CN105159917A (en) Generalization method for converting unstructured information of electronic medical record to structured information
Sagcan et al. Toponym recognition in social media for estimating the location of events
CN114036907A (en) Text data amplification method based on domain features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant