CN110188359B - Text entity extraction method - Google Patents
Text entity extraction method Download PDFInfo
- Publication number
- CN110188359B CN110188359B CN201910472799.7A CN201910472799A CN110188359B CN 110188359 B CN110188359 B CN 110188359B CN 201910472799 A CN201910472799 A CN 201910472799A CN 110188359 B CN110188359 B CN 110188359B
- Authority
- CN
- China
- Prior art keywords
- entity
- sequence
- subset
- extraction
- regression model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a text entity extraction method, which utilizes redundancy and repetition of information in a large amount of linguistic data, obtains entities with more noise by means of phrase segmentation and remote supervision, then digs a context sequence mode (rule) of the entities, automatically obtains an input rule of Snorkel, and obtains a result with better quality than the result of remote supervision by utilizing the fault-tolerant capability of the Snorkel on a noise label. The model and results are modified cyclically, gradually removing noise and resulting in a more reliable sequence pattern. The invention does not use a label sample, thereby saving the labor; the input rule of Snorkel is automatically obtained; and the remote supervision, the rule mining, the snorkel and the cyclic process are combined, so that the result is improved progressively, the noise is removed, and the extraction quality is improved.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a few-sample text entity extraction method.
Background
In the application scene of text information extraction, scenes are various and detailed, labeling samples are lacked, the labeling sample acquisition cost is high, the current situation is faced in industrial application, in the face of the current situation, under the thought of model training, a depth model which rapidly establishes labeling samples, needs fewer samples or samples with larger noise is a popular research direction, and under the thought based on extraction rules, rapid mining and construction of an extraction rule set are popular research directions.
In the existing text information extraction method, a large number of labeled samples are required for a model training method, although some depth models have the tendency of higher and higher accuracy and smaller amount of required labeled samples, a certain amount of labeled samples are still required to train to obtain available models, and before obtaining the samples, the work cannot be carried out, so that the development cost is transferred to the labeling of the samples, and the overall development efficiency is still low.
In the method based on the extraction rule, although the sample is not required to be directly labeled manually, the extraction rule usually needs to be debugged on the basis of domain knowledge, and a set of system based on the rule completely may need tens of thousands of rule sets. Mining and automatic generation of rule sets has become a hot research direction in order to mitigate the development of rule sets.
Snorkel is a path from rules to models, however, it has a strong dependency on the accuracy of the rule set, and the rules are not automatically generated.
Disclosure of Invention
The invention provides an information extraction solution under the condition of a small amount of labeled samples by combining the extraction rule and the idea of model training, and the extraction model with higher accuracy can be obtained without manual intervention.
The purpose of the invention is realized by the following technical scheme: a method of text entity extraction, the method comprising the steps of:
(1) Automatic mining of rule sets, comprising the sub-steps of:
(1.1) carrying out phrase segmentation on a large amount of linguistic data to obtain noun phrases;
(1.2) carrying out entity and entity type recognition on noun phrases in a remote supervision mode;
(1.3) mining a sequence pattern with high occurrence frequency on the results of entity and entity type identification; in the sequence mode, if a noun phrase in the original corpus is identified as an entity, replacing the noun phrase in the sequence mode with the entity type of the noun phrase;
(1.4) according to entity types contained in the sequence modes, aggregating the synonymous sequence modes to obtain a sequence mode subset A corresponding to each semantic meaning;
(1.5) adjusting the hierarchy of entity types in each semantically corresponding sequence pattern subset A: on the result of the sequence mode aggregation, counting the entity type hierarchy in the sequence mode subset A corresponding to each semantic, and taking the most hierarchy as the entity type hierarchy in the subset A;
(1.6) for each entity type, finding out a sequence pattern containing the type from each subset A to obtain a sequence pattern subset B corresponding to the entity type;
(2) The tagged data is generated: taking the sequence mode subset B corresponding to each entity type as the input of the Snorkel, predicting the label of the sample, namely the entity type, wherein the label has confidence;
(3) Training an entity extraction regression model: extracting a regression model by using a label training entity with confidence coefficient, and predicting linguistic data by using the trained regression model to obtain an entity recognition result;
(4) Returning to the step (1), using the trained entity extraction regression model to predict the corpus again, using the obtained result to correct the phrase segmentation and remote supervision entity identification result obtained in the step (1), continuing the rest steps, and obtaining the entity extraction regression model and the entity identification result again; this process is repeated until the entity results from step (3) are consistent with the results from the previous process.
Further, in the step (1.1), phrase segmentation is performed by using an AutoPhrase method to obtain noun phrases.
Further, in the step (1.3), on the result of entity and entity type identification, a sequence pattern with high occurrence frequency is mined by a Prefix span method.
Further, in the step (1.4), the specific polymerization manner is as follows: establishing a graph structure for the sequence mode set, wherein each vertex in the graph is a sequence mode, edges between the two modes are defined by three characteristics of the number of the types of the entities common between the two modes, the number of the common context words and the number of the extraction results of the same entities, each edge is endowed with weight by training a regression model based on the three characteristics, and a subgraph, namely a sequence mode subset, is obtained by using a clustering algorithm.
The invention has the beneficial effects that: the invention utilizes redundancy and repetition of information in a large amount of linguistic data, firstly obtains entities with more noise by means of phrase segmentation and remote supervision, then excavates a context sequence mode (rule) of the entities, automatically obtains an input rule of Snorkel, and obtains a result with better quality than the result of remote supervision by utilizing the fault-tolerant capability of the Snorkel on a noise label. The model and results are modified cyclically, gradually removing noise and resulting in a more reliable sequence pattern. The invention does not use a label sample, thereby saving the labor; the input rule of Snorkel is automatically obtained; the method combines remote supervision, rule mining, snorkel and a cyclic process to improve the result and remove noise progressively, and the obtained result is better than that of remote supervision.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only some of the embodiments of the present invention, and not all of them. Other embodiments, which can be derived by one of ordinary skill in the art from the embodiments of the present invention without creative efforts, are also within the scope of the present invention.
According to the method, under the scene of few samples, the rule set is automatically mined on a large number of unlabeled samples, the rule set is managed by using Snorkel, a large number of labeled data containing noise and with confidence coefficients are generated, and finally the data are used for training an entity extraction regression model.
As shown in fig. 1, the method for extracting text entities provided by the present invention specifically includes the following steps:
automatic mining of rule sets
On a large amount of linguistic data, firstly, carrying out Phrase segmentation by using an AutoPhrase method (AutoPhrase: automated Phrase Mining from Massive Text corpra) to obtain noun phrases;
entity and entity type recognition is carried out on noun phrases in a remote supervision mode (for English medical texts, better results can be obtained by using a MetaMap tool);
on the result of entity and entity type identification, a Prefix span method [ Prefix span: mining Sequential Patterns efficient by Prefix-project Pattern Growth ] is used to mine the sequence Pattern with high frequency of occurrence. The sequence pattern is formed by adding entity types to a common regular template, such as: ($ MEDICINE) may be half pful for ($ DISEASE), wherein ($ MEDICINE) and ($ DISEASE) respectively represent drugs and DISEASE entity types, and the corresponding position in the sequence mode can be any drug or DISEASE. In the sequence mode, if the noun phrase in the original corpus is identified as an entity, the noun phrase in the sequence mode is replaced by the entity type of the noun phrase, and the generalization of the sequence mode is improved.
And according to the entity types contained in the sequence patterns, aggregating the synonymous sequence patterns to obtain a sequence pattern subset A corresponding to each semantic meaning, wherein the patterns in each subset A represent the same semantic meaning. Synonymous sequence patterns refer to sequence patterns that express the same semantics, such as "Person's age is $ Digit" and "$ Person, $ Digit" both expressing the semantic "Person's age is a number". The specific polymerization mode is as follows: establishing a graph structure for the sequence mode set, wherein each vertex in the graph is a sequence mode, edges between the two modes are defined by three characteristics of the number of common entity types, the number of common context words and the number of the same entity extraction results, each edge is given with weight by training a regression model based on the three characteristics, and a subgraph, namely a sequence mode subset A, is obtained by using a clustering algorithm [ A procedure for closed detection using the group matrix ]. In the sequence patterns "$ Country president $ Politian" and "president $ Politian of $ Country", the common entity types between the two patterns are $ Country and $ Politian, the common entity type number is 2, the common context word is president, and the number is 1, and the same entity extraction result, that is, the entity number extracted from the corpus by the two sequence patterns, for example, in the extraction of $ Politian type entities, the number of extracted $ Politian type entities is counted.
The hierarchy of the entity type in the sequence pattern subset a corresponding to each semantic is adjusted, for example, the types of $ Location, $ Country, $ State, $ City, etc. are below $ Location type, and different hierarchy entity types are obtained for each noun phrase when the entity type is identified. And on the result of the sequence pattern aggregation, counting the entity type hierarchy in each sequence pattern subset A corresponding to each semantic, and taking the most number of hierarchies as the entity type hierarchy in the subset A.
Through the above process, the sequence pattern subset a corresponding to each semantic can be obtained. For each entity type, finding out the sequence mode containing the type from each subset A to obtain the sequence mode subset B corresponding to the entity type.
Second, generate labeled data
And (3) taking the sequence mode subset B corresponding to each entity type as the input of the Snorkel, predicting the label of the sample, namely the entity type, wherein the label has confidence.
Training entity extraction regression model
And (3) extracting a regression model by using a label training entity with confidence coefficient, and predicting corpus by using the trained regression model to obtain an entity recognition result.
And fourthly, returning to the first step, using the trained entity extraction regression model to predict the linguistic data again, using the obtained result to correct the results of the phrase segmentation and the remote supervision entity recognition obtained in the first step, continuing the rest steps, and obtaining the entity extraction regression model and the entity recognition result again. This process is repeated until the entity results from the third step are consistent with the results from the previous process.
One skilled in the art can, using the teachings of the present invention, readily make various changes and modifications to the invention without departing from the spirit and scope of the invention as defined by the appended claims. Any modifications and equivalent variations of the above-described embodiments, which are made in accordance with the technical spirit and substance of the present invention, fall within the scope of protection of the present invention as defined in the claims.
Claims (4)
1. A text entity extraction method is characterized by comprising the following steps:
(1) Automatic mining of rule sets, comprising the sub-steps of:
(1.1) carrying out phrase segmentation on a large amount of linguistic data to obtain noun phrases;
(1.2) carrying out entity and entity type identification on noun phrases in a remote supervision mode;
(1.3) mining a sequence pattern with high occurrence frequency on the results of entity and entity type identification; in the sequence mode, if a noun phrase in the original corpus is identified as an entity, replacing the noun phrase in the sequence mode with the entity type of the noun phrase;
(1.4) according to entity types contained in the sequence modes, aggregating the synonymous sequence modes to obtain a sequence mode subset A corresponding to each semantic meaning;
(1.5) adjusting the hierarchy of entity types in each semantically corresponding sequence pattern subset A: on the result of the sequence mode aggregation, counting the entity type hierarchy in the sequence mode subset A corresponding to each semantic, and taking the most hierarchy as the entity type hierarchy in the subset A;
(1.6) for each entity type, finding out a sequence pattern containing the type from each subset A to obtain a sequence pattern subset B corresponding to the entity type;
(2) The tagged data is generated: taking the sequence mode subset B corresponding to each entity type as the input of Snorkel, predicting the label of the sample, namely the entity type, wherein the label has confidence;
(3) Training an entity extraction regression model: extracting a regression model by using a label training entity with confidence coefficient, and predicting linguistic data by using the trained regression model to obtain an entity recognition result;
(4) Returning to the step (1), using the trained entity extraction regression model to predict the corpus again, using the obtained result to correct the phrase segmentation and remote supervision entity identification result obtained in the step (1), continuing the rest steps, and obtaining the entity extraction regression model and the entity identification result again; this process is repeated until the entity results from step (3) are consistent with the results from the previous process.
2. The method as claimed in claim 1, wherein in step (1.1), the noun phrase is obtained by phrase segmentation using AutoPhrase method.
3. The method for extracting text entities according to claim 1, wherein in said step (1.3), the prefix span method is used to mine the sequence patterns with high frequency of occurrence on the results of entity and entity type identification.
4. The method for extracting text entities as claimed in claim 1, wherein in the step (1.4), the specific aggregation manner is as follows: establishing a graph structure for the sequence mode set, wherein each vertex in the graph is a sequence mode, edges between the two modes are defined by three characteristics of the number of the types of the entities common between the two modes, the number of the common context words and the number of the extraction results of the same entities, each edge is endowed with weight by training a regression model based on the three characteristics, and a subgraph, namely a sequence mode subset, is obtained by using a clustering algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910472799.7A CN110188359B (en) | 2019-05-31 | 2019-05-31 | Text entity extraction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910472799.7A CN110188359B (en) | 2019-05-31 | 2019-05-31 | Text entity extraction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110188359A CN110188359A (en) | 2019-08-30 |
CN110188359B true CN110188359B (en) | 2023-01-03 |
Family
ID=67719618
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910472799.7A Active CN110188359B (en) | 2019-05-31 | 2019-05-31 | Text entity extraction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110188359B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111325350B (en) * | 2020-02-19 | 2023-09-29 | 第四范式(北京)技术有限公司 | Suspicious tissue discovery system and method |
CN113255356B (en) * | 2021-06-10 | 2021-09-28 | 杭州费尔斯通科技有限公司 | Entity recognition method and device based on entity word list |
CN113204643B (en) * | 2021-06-23 | 2021-11-02 | 北京明略软件系统有限公司 | Entity alignment method, device, equipment and medium |
CN114093469A (en) * | 2021-07-27 | 2022-02-25 | 北京好欣晴移动医疗科技有限公司 | Internet medical scheme recommendation method, device and system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106776711A (en) * | 2016-11-14 | 2017-05-31 | 浙江大学 | A kind of Chinese medical knowledge mapping construction method based on deep learning |
CN107291687A (en) * | 2017-04-27 | 2017-10-24 | 同济大学 | It is a kind of based on interdependent semantic Chinese unsupervised open entity relation extraction method |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9836453B2 (en) * | 2015-08-27 | 2017-12-05 | Conduent Business Services, Llc | Document-specific gazetteers for named entity recognition |
US10691976B2 (en) * | 2017-11-16 | 2020-06-23 | Accenture Global Solutions Limited | System for time-efficient assignment of data to ontological classes |
-
2019
- 2019-05-31 CN CN201910472799.7A patent/CN110188359B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106776711A (en) * | 2016-11-14 | 2017-05-31 | 浙江大学 | A kind of Chinese medical knowledge mapping construction method based on deep learning |
CN107291687A (en) * | 2017-04-27 | 2017-10-24 | 同济大学 | It is a kind of based on interdependent semantic Chinese unsupervised open entity relation extraction method |
Non-Patent Citations (1)
Title |
---|
实体关系抽取研究综述;刘绍毓 等;《信息工程大学学报》;20161031;第17卷(第5期);542-547 * |
Also Published As
Publication number | Publication date |
---|---|
CN110188359A (en) | 2019-08-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109388795B (en) | Named entity recognition method, language recognition method and system | |
CN110188359B (en) | Text entity extraction method | |
CN107480125B (en) | Relation linking method based on knowledge graph | |
CN104199972B (en) | A kind of name entity relation extraction and construction method based on deep learning | |
CN110413787B (en) | Text clustering method, device, terminal and storage medium | |
CN108334495A (en) | Short text similarity calculating method and system | |
CN109635297B (en) | Entity disambiguation method and device, computer device and computer storage medium | |
CN107315737A (en) | A kind of semantic logic processing method and system | |
CN104881458B (en) | A kind of mask method and device of Web page subject | |
US11113470B2 (en) | Preserving and processing ambiguity in natural language | |
CN110110327A (en) | A kind of text marking method and apparatus based on confrontation study | |
CN111143571B (en) | Entity labeling model training method, entity labeling method and device | |
CN107463553A (en) | For the text semantic extraction, expression and modeling method and system of elementary mathematics topic | |
WO2017177809A1 (en) | Word segmentation method and system for language text | |
CN104462053A (en) | Inner-text personal pronoun anaphora resolution method based on semantic features | |
CN110795932B (en) | Geological report text information extraction method based on geological ontology | |
CN111291177A (en) | Information processing method and device and computer storage medium | |
CN111274804A (en) | Case information extraction method based on named entity recognition | |
CN112347761B (en) | BERT-based drug relation extraction method | |
CN107943786A (en) | A kind of Chinese name entity recognition method and system | |
CN108763192B (en) | Entity relation extraction method and device for text processing | |
Ren et al. | Detecting the scope of negation and speculation in biomedical texts by using recursive neural network | |
CN107526721A (en) | A kind of disambiguation method and device to electric business product review vocabulary | |
CN107357895A (en) | A kind of processing method of the text representation based on bag of words | |
CN105159917A (en) | Generalization method for converting unstructured information of electronic medical record to structured information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |