CN110032649A

CN110032649A - Relation extraction method and device between a kind of entity of TCM Document

Info

Publication number: CN110032649A
Application number: CN201910293263.9A
Authority: CN
Inventors: 张德政; 付雅慧; 谢永红; 阿孜古丽; 刘宏岚; 栗辉; 田款阳
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2019-04-12
Filing date: 2019-04-12
Publication date: 2019-07-19
Anticipated expiration: 2039-04-12
Also published as: CN110032649B

Abstract

The present invention provides Relation extraction method and device between the entity of TCM Document a kind of, can be improved the accuracy rate that relationship type between entity extracts.The described method includes: being directed to TCM Document to be processed, relationship type between the entity type and entity marked to its partial content is obtained；According to the entity type training Named Entity Extraction Model marked；Entity recognition is named to TCM Document to be processed using trained Named Entity Extraction Model, according to name Entity recognition as a result, obtain there are the candidate entity of relationship to and mark sheet；According to obtain there are the candidate entity of relationship to and mark sheet, the statistical inference of figure probability is carried out with factor graph model, global learning object relationship characteristic obtains the probability between entity there are relationship；The type of relationship between entity is determined in conjunction with relationship type between the entity that dependency analysis extracts the method for true triple and has marked according to the probability between obtained entity there are relationship.The present invention relates to knowledge engineering fields.

Description

Method and device for extracting relationships between entities in traditional Chinese medicine literature

Technical Field

The invention relates to the field of knowledge engineering, in particular to a method and a device for extracting relationships between entities in traditional Chinese medicine documents.

Background

China has spread a lot of ancient books and documents in the field of traditional Chinese medicine, and the ancient books and documents are the basic basis for learning traditional Chinese medicine. However, most of these documents are written in ancient ways, and most of them are unstructured texts, which are very time-consuming to use. If the entities and the entity relationships between them can be extracted from the literature of chinese medicine, information retrieval, knowledge mining, and the like can be efficiently performed using the extracted relationships between the entities.

The entity relationship extraction method in the prior art is difficult to accurately extract the relationship between entities from the unstructured text.

Disclosure of Invention

The invention aims to provide a method and a device for extracting relationships between entities in traditional Chinese medicine documents, so as to solve the problem that the relationships between the entities are difficult to accurately extract from unstructured texts in the prior art.

In order to solve the above technical problems, an embodiment of the present invention provides a method for extracting relationships between entities in a traditional Chinese medicine document, including:

aiming at the traditional Chinese medicine document to be processed, acquiring entity types and relationship types among entities which are labeled on partial contents of the traditional Chinese medicine document;

training a named entity recognition model according to the marked entity type;

carrying out named entity recognition on the traditional Chinese medicine document to be processed by utilizing a trained named entity recognition model, and obtaining a candidate entity pair and a feature table with a relationship according to a named entity recognition result;

according to the obtained candidate entity pair and the feature table with the relationship, a factor graph model is used for carrying out statistical reasoning on graph probability, entity relationship features are learned globally, and the probability of the relationship between the entities is obtained;

and determining the type of the relationship between the entities by combining a method for extracting the fact triple through dependency analysis and the labeled type of the relationship between the entities according to the obtained probability of the relationship between the entities.

Further, the training of the named entity recognition model according to the labeled entity types includes:

according to the marked entity type, using a natural language processing tool to train a named entity recognition model to obtain the named entity recognition model suitable for the traditional Chinese medicine literature;

and integrating the obtained named entity recognition model suitable for the traditional Chinese medicine literature into a natural language processing tool, replacing the original named entity recognition model, packaging and compiling.

Further, the conducting named entity recognition on the traditional Chinese medicine document to be processed by utilizing the trained named entity recognition model, and obtaining the candidate entity pair and the feature table with the relationship according to the named entity recognition result comprises:

carrying out named entity recognition on the traditional Chinese medicine document to be processed by utilizing a trained named entity recognition model;

carrying out Cartesian product operation on the identified entities to obtain candidate entity pairs;

extracting text features of entities in the candidate entity pair to obtain a named entity recognition result of the context of the candidate entities to form a feature table;

and determining whether a relationship exists between the two entities in the partial candidate entity pair.

Further, the determining the type of the relationship between the entities according to the obtained probability of the relationship between the entities by combining the method for extracting the fact triple through dependency analysis and the labeled type of the relationship between the entities includes:

acquiring entity pairs with the relation probability between the entities larger than a preset threshold, analyzing sentences of the entity pairs with the relation probability between the entities larger than the preset threshold by using a dependency analysis method, and extracting fact triples with verbs as cores;

constructing a fact triple with a predicate verb as a core by analyzing the grammatical relation of the sentence;

and determining the type of the relationship between the entities according to the predicate verbs between the entity pairs and the marked relationship type between the entities.

The embodiment of the present invention further provides a device for extracting relationships between entities in a traditional Chinese medicine document, including:

the acquisition module is used for acquiring entity types and relationship types among entities which are labeled to partial contents of traditional Chinese medicine documents to be processed;

the training module is used for training the named entity recognition model according to the marked entity type;

the recognition module is used for recognizing the named entities of the traditional Chinese medicine documents to be processed by utilizing the trained named entity recognition model and obtaining a candidate entity pair and a feature table with a relationship according to the recognition result of the named entities;

the determining module is used for carrying out statistical reasoning on graph probability by using the factor graph model according to the obtained candidate entity pair with the existing relationship and the feature table, and learning the relationship features of the entities globally to obtain the probability of the existing relationship between the entities;

and the extraction module is used for determining the type of the relationship between the entities by combining the method for extracting the fact triple through dependency analysis and the labeled type of the relationship between the entities according to the obtained probability of the relationship between the entities.

Further, the training module comprises:

the training unit is used for carrying out named entity recognition model training by using a natural language processing tool according to the marked entity types to obtain a named entity recognition model suitable for the traditional Chinese medicine literature;

and the replacing unit is used for integrating the obtained named entity recognition model suitable for the traditional Chinese medicine literature into a natural language processing tool, replacing the original named entity recognition model, packaging and compiling.

Further, the identification module includes:

the recognition unit is used for carrying out named entity recognition on the traditional Chinese medicine document to be processed by utilizing the trained named entity recognition model;

the operation unit is used for carrying out Cartesian product operation on the identified entities to obtain candidate entity pairs;

the forming unit is used for extracting text characteristics of the entities in the candidate entity pair to obtain a named entity recognition result of the contexts of the candidate entities and form a characteristic table;

and the first determining unit is used for determining whether a relationship exists between two entities in the partial candidate entity pair.

Further, the extraction module comprises:

the analysis unit is used for acquiring entity pairs with the relation probability between the entities larger than a preset threshold, analyzing sentences of the entity pairs with the relation probability between the entities larger than the preset threshold by using a dependency analysis method, and extracting fact triples with verbs as cores;

the construction unit is used for constructing a fact triple with a predicate verb as a core by analyzing the grammatical relation of the sentence;

and the second determining unit is used for determining the type of the relationship between the entities by combining the marked relationship type between the entities according to the verb predicates between the entity pairs.

The technical scheme of the invention has the following beneficial effects:

in the scheme, aiming at the traditional Chinese medicine document to be processed, the entity type and the relationship type between the entities marked on part of the content of the traditional Chinese medicine document are obtained; training a named entity recognition model according to the marked entity type; carrying out named entity recognition on the traditional Chinese medicine document to be processed by utilizing a trained named entity recognition model, and obtaining a candidate entity pair and a feature table with a relationship according to a named entity recognition result; according to the obtained candidate entity pair and the feature table with the relationship, a factor graph model is used for carrying out statistical reasoning on graph probability, entity relationship features are learned globally, and the probability of the relationship between the entities is obtained; and determining the type of the relationship between the entities by combining a method for extracting the fact triple through dependency analysis and the labeled type of the relationship between the entities according to the obtained probability of the relationship between the entities. Therefore, the probability of the relationship existing between the entities and the dependency analysis method of natural language processing are combined, and the relationship type between the entities is determined according to the extracted fact triples and the labeled relationship type between the entities, so that the accuracy of the extraction of the relationship type between the entities is improved, and the content of the traditional Chinese medicine literature can be clearly and structurally expressed.

Drawings

FIG. 1 is a schematic flow chart of a method for extracting relationships between entities in a Chinese medical literature according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an entity identification result according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of candidate entity pairs according to an embodiment of the present invention;

FIG. 4 is a schematic representation of features provided by an embodiment of the present invention;

FIG. 5 is a labeled diagram of whether a relationship exists between a pair of candidate entities according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating a probability result of relationships between entities according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating the results of relationships between entities that are ultimately formed according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an apparatus for extracting relationships between entities in a chinese medical literature according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

The invention provides a method and a device for extracting relationships between entities in traditional Chinese medicine documents, aiming at the problem that the relationships between the entities are difficult to extract from unstructured texts.

Example one

As shown in fig. 1, the method for extracting relationships between entities in a traditional chinese medicine document according to an embodiment of the present invention includes:

s101, aiming at the traditional Chinese medicine literature to be processed, acquiring entity types and relationship types among entities which are labeled on partial contents of the traditional Chinese medicine literature;

s102, training a named entity recognition model according to the marked entity type;

s103, carrying out named entity recognition on the traditional Chinese medicine document to be processed by using the trained named entity recognition model, and obtaining a candidate entity pair and a feature table with a relationship according to the named entity recognition result;

s104, according to the obtained candidate entity pair and the feature table with the relationship, carrying out statistical reasoning on graph probability by using a factor graph model, and learning the relationship features of the entities globally to obtain the probability of the relationship between the entities;

and S105, determining the type of the relationship between the entities by combining the method for extracting the fact triple through dependence analysis and the labeled type of the relationship between the entities according to the obtained probability of the relationship between the entities.

The method for extracting the relationship between the entities of the traditional Chinese medicine literature, disclosed by the embodiment of the invention, aims at the traditional Chinese medicine literature to be processed, and obtains the entity type and the relationship type between the entities, which are marked on part of the contents of the traditional Chinese medicine literature; training a named entity recognition model according to the marked entity type; carrying out named entity recognition on the traditional Chinese medicine document to be processed by utilizing a trained named entity recognition model, and obtaining a candidate entity pair and a feature table with a relationship according to a named entity recognition result; according to the obtained candidate entity pair and the feature table with the relationship, a factor graph model is used for carrying out statistical reasoning on graph probability, entity relationship features are learned globally, and the probability of the relationship between the entities is obtained; and determining the type of the relationship between the entities by combining a method for extracting the fact triple through dependency analysis and the labeled type of the relationship between the entities according to the obtained probability of the relationship between the entities. Therefore, the probability of the relationship existing between the entities and the dependency analysis method of natural language processing are combined, and the relationship type between the entities is determined according to the extracted fact triples and the labeled relationship type between the entities, so that the accuracy of the extraction of the relationship type between the entities is improved, and the content of the traditional Chinese medicine literature can be clearly and structurally expressed.

In the embodiment, the extraction of the relationship between the entities also lays a foundation for the construction of the knowledge graph and the intelligent auxiliary diagnosis and treatment system in the field of traditional Chinese medicine, and is an indispensable important link.

In this embodiment, before S101, according to the specific content of the to-be-processed chinese medical literature, the main chinese medical entity type and the inter-entity relationship type of the to-be-processed chinese medical literature may be determined, and 20% of the content may be labeled with the entity type and the inter-entity relationship type.

In a specific implementation of the method for extracting relationships between entities in the foregoing chinese medical literature, further, the training a named entity recognition model according to the labeled entity types includes:

In this embodiment, according to the labeled entity type, a stanford natural language processing tool (deepdive) may be used to perform named entity recognition model training, so as to obtain a named entity recognition model suitable for the literature of traditional Chinese medicine, integrate the model into the deepdive, replace the original named entity recognition model in the deepdive, and package and compile the same.

In this embodiment, the deepdive is an information extraction framework tool for stanford natural language processing, and is mainly used for extracting information of modern texts and extracting relationships among people, organizations and places.

In a specific implementation of the method for extracting relationships between entities in the literature of traditional Chinese medicine, the step of performing named entity recognition on the literature of traditional Chinese medicine to be processed by using the trained named entity recognition model and obtaining a candidate entity pair and a feature table having relationships according to a result of the named entity recognition includes:

In this embodiment, S103 mainly performs data preparation to prepare three data, i.e., whether a relationship exists between two entities in a candidate entity pair, a feature table, and a partial candidate entity pair, specifically:

s1031, conducting named entity recognition on the traditional Chinese medicine document to be processed by using the deepdive integrated with the new named entity recognition model, and conducting Cartesian product operation on the recognized entities to obtain candidate entity pairs;

in this embodiment, an entity pair is a pair of two entities, for example, entity a and entity B form entity pair (a, B).

S1032, extracting text features of the entities in the candidate entity pair to obtain a named entity recognition result of the context of the candidate entities to form a feature table;

s1033, part (e.g., 20%) of the candidate entity pairs are marked, the candidate entity pair with the relationship is marked as true, and the candidate entity pair without the relationship is marked as false. Meanwhile, some rules can be specified to assist in annotation, for example, a and B have a relationship, and then B and a also have a relationship, and the rules can reduce the workload of manual annotation. The labeled data serves as a priori knowledge for probabilistic model learning. By this, the required data is prepared, which provides the basis for the later construction of the probabilistic model.

In the embodiment, a factor graph model is used for learning the probability of the relationship between the entities to construct a probability model; specifically, the method comprises the following steps: and according to the obtained candidate entity pair with the relationship and the feature table, carrying out statistical reasoning on graph probability by using a factor graph model, and globally learning entity relationship features to form a probability model of the relationship between the entities, wherein the probability model is used for determining the probability of the relationship between the entities.

In this embodiment, the factor graph is a two-dimensional graph called factor graph obtained based on a product of several local functions obtained by factorizing a global function having multiple variables.

In a specific implementation of the method for extracting relationships between entities in the foregoing traditional Chinese medicine literature, further, the determining the type of relationships between entities, according to the obtained probability that relationships exist between entities, by combining a method for extracting fact triples by dependency analysis and a labeled type of relationships between entities, includes:

In this embodiment, after obtaining the probability of the relationship existing between the entities, determining the type of the relationship between the entities by combining the method of extracting the fact triple through dependency analysis and the labeled type of the relationship between the entities may specifically include the following steps: for entity pairs with the probability of existing relations higher than a preset threshold (for example, 0.8), analyzing sentences in which the entity pairs are located according to a dependency analysis method, and extracting fact triples with verbs as cores; constructing a fact triple with a predicate verb as a core by analyzing some grammatical relations such as a principal and a subordinate object of a sentence or a principal and subordinate anaplerosis containing a relation of a subordinate object; and determining the type of the relationship between the entities according to the predicate verbs between the entity pairs and the relationship type between the entities marked in the step S101, wherein the type of the relationship between the entities is used as a final result of the relationship between the entities.

In the embodiment, the dependency analysis method is used for decomposing the sentences into triples, namely, the entities and the relationship between the entities are used for expressing one sentence, the meaning of the sentence can be structurally expressed, and a foundation is laid for constructing a knowledge graph in the future.

In summary, the embodiment modifies the stanford natural language processing tool into an information extraction method suitable for the traditional Chinese medicine literature and combines the information extraction method with dependency analysis, so as to provide an extraction method for the relationship between the entities of the traditional Chinese medicine literature, which can analyze the unstructured traditional Chinese medicine literature, realize the structuralization of the traditional Chinese medicine literature, and improve the accuracy of the extraction of the relationship types between the entities.

In order to better understand the method for extracting the relationship between entities in the traditional Chinese medicine literature according to the embodiment of the present invention, taking "dialectics of pathogenesis of traditional Chinese medicine" as an example, the method for extracting the relationship between entities in the traditional Chinese medicine literature according to the embodiment of the present invention is described in detail, and specifically includes the following steps:

first, labeling the entity type and the relationship type between entities for part of the content of TCM pathogenesis dialectics, for example, 20%, and obtaining the labeled entity type and relationship type between entities.

In this embodiment, the entity types include: etiology (by), location (bw), and manifestation (bx); wherein the etiology includes wind, cold, fire, heat and yin; the disease position comprises entities such as lung, collaterals, stomach, spleen, intestinal tract and small intestine; the manifestations include the loss of lung qi, unclear lung qi, loss of lung clear and moist, and phlegm-heat in the interior.

In this embodiment, the relationships between entities in the disease evolution can be classified into six categories, namely, a combination (between etiologies), an invasion (between etiologies and disease positions), an invaded relationship, a change (disease positions and etiologies), an appearance relationship and a cause-effect relationship; wherein,

the relationship of combination (between etiologies) mainly includes verb leading factors such as combination, hold, clip, meet, and beat;

the relationship of invasion (etiology to disease location) is mainly dominated by verbs such as invasion, consumption, diffusion, burning, decoction, invasion, injury, middle-jiao, disturbance, impact, obstruction, flow and injury;

the infringed relationship is mainly dominated by the subject, the quilt and other verbs;

the relationship of change (location of disease) is mainly dominated by verbs such as depression, loss, stagnation, coagulation, clear, adverse, blockage, stasis, disorder, movement and closure; the relationship of changes (etiology) is mainly dominated by the verbs of paranoid, exuberance, congestion, coagulation, exuberance, depression, and tenuation;

the occurrence relationship is mainly dominated by verbs such as transformation, generation, transformation, expression, formation, seeing, transfer, implication, brewing and the like;

causality is mainly dominated by verbs that cause, then, become, be, have, cause, even, appear, etc.

Second, the named entity recognition model is trained based on the labeled entity types.

Thirdly, using a new named entity recognition model obtained by training to recognize the pathogenesis and dialectics of traditional Chinese medicine, for example: the disease location of the heart, the lung, the stomach and other entities can be identified, the etiology of the wind, the cold and other entities is identified, the performance of the phlegm reducing and other entities is identified, and partial identification results are shown in fig. 2; performing cartesian product operation on the identified entities to obtain candidate entity pairs, for example: the obtained jin and phlegm form candidate entity pairs, and partial results are shown in FIG. 3; extracting text characteristics of the candidate entity pairs according to results of the candidate entity pairs, for example, if the original sentence is that the lung is not clear due to wind-cold, the wind-cold is identified as the cause of the disease, one word in the original text is that the left word and the right word are that the words are right and left, and the named entity identification results of the words are o and o to form a characteristic table, as shown in fig. 4, wherein o represents that the entity types are other; determining whether a relationship exists between two entities in a part of candidate entity pairs, for example, determining whether a relationship exists between two entities in 20% of candidate entity pairs according to a preset rule, assuming that true represents that a relationship exists and false represents that no relationship exists; the preset rule may be, for example, that a and B have a relationship, and then B and a also have a relationship, and the result of the relationship part is shown in fig. 5.

Fourthly, according to the obtained candidate entity pair and the feature table with the relationship, the factor graph model is used for carrying out statistical reasoning on graph probability, entity relationship features are learned globally, a probability model of the relationship among the entities is formed, the probability model is used for determining the probability of the relationship among the entities, and the result is shown in FIG. 6;

fifthly, acquiring entities with higher relation probability among the entities, extracting a fact triple method by combining dependency analysis, and determining the specific relation among the entities according to the marked relation type among the entities in the first step; for example, the expression "wind attacking lung site" is obtained as the invasion relationship of the etiology to the disease site, and partial results are shown in fig. 7.

Example two

The present invention further provides a specific embodiment of an apparatus for extracting relationships between entities in a chinese medical literature, which corresponds to the specific embodiment of the method for extracting relationships between entities in the foregoing chinese medical literature, and the apparatus for extracting relationships between entities in the foregoing chinese medical literature can achieve the object of the present invention by executing the process steps in the specific embodiment of the method, so the explanation in the specific embodiment of the method for extracting relationships between entities in the foregoing chinese medical literature is also applicable to the specific embodiment of the apparatus for extracting relationships between entities in the foregoing chinese medical literature, and will not be described in detail in the following specific embodiment of the present invention.

As shown in fig. 8, an embodiment of the present invention further provides an apparatus for extracting relationships between entities in a traditional chinese medicine document, including:

an obtaining module 11, configured to obtain, for a to-be-processed traditional Chinese medicine document, an entity type and an inter-entity relationship type that are labeled for part of contents of the to-be-processed traditional Chinese medicine document;

a training module 12, configured to train a named entity recognition model according to the labeled entity type;

the recognition module 13 is configured to perform named entity recognition on the to-be-processed traditional Chinese medicine documents by using the trained named entity recognition model, and obtain a candidate entity pair and a feature table having a relationship according to a named entity recognition result;

a determining module 14, configured to perform statistical inference on graph probability by using a factor graph model according to the obtained candidate entity pair and the feature table with the existing relationship, and learn the entity relationship features globally to obtain the probability of the existing relationship between entities;

and the extraction module 15 is configured to determine the type of the relationship between the entities according to the obtained probability that the relationship exists between the entities, in combination with the method for extracting the fact triple through dependency analysis and the labeled type of the relationship between the entities.

The device for extracting the relationship between the entities of the traditional Chinese medicine literature, disclosed by the embodiment of the invention, aims at the traditional Chinese medicine literature to be processed, and obtains the entity type and the relationship type between the entities which are marked on part of the contents of the traditional Chinese medicine literature; training a named entity recognition model according to the marked entity type; carrying out named entity recognition on the traditional Chinese medicine document to be processed by utilizing a trained named entity recognition model, and obtaining a candidate entity pair and a feature table with a relationship according to a named entity recognition result; according to the obtained candidate entity pair and the feature table with the relationship, a factor graph model is used for carrying out statistical reasoning on graph probability, entity relationship features are learned globally, and the probability of the relationship between the entities is obtained; and determining the type of the relationship between the entities by combining a method for extracting the fact triple through dependency analysis and the labeled type of the relationship between the entities according to the obtained probability of the relationship between the entities. Therefore, the probability of the relationship existing between the entities and the dependency analysis method of natural language processing are combined, and the relationship type between the entities is determined according to the extracted fact triples and the labeled relationship type between the entities, so that the accuracy of the extraction of the relationship type between the entities is improved, and the content of the traditional Chinese medicine literature can be clearly and structurally expressed.

In an embodiment of the foregoing apparatus for extracting relationships between entities in the chinese medical literature, the training module further includes:

In an embodiment of the foregoing apparatus for extracting relationships between entities in the chinese medical literature, the identification module further includes:

In an embodiment of the foregoing apparatus for extracting relationships between entities in the chinese medical literature, the extracting module further includes:

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method for extracting relationships between entities in traditional Chinese medicine documents is characterized by comprising the following steps:

training a named entity recognition model according to the marked entity type;

2. The method of extracting relationships between entities of the TCM literature according to claim 1, wherein said training of the named entity recognition model based on the labeled entity types comprises:

3. The method of claim 1, wherein the step of performing named entity recognition on the TCM document to be processed by using the trained named entity recognition model and obtaining the candidate entity pair and the feature list having the relationship according to the recognition result of the named entity comprises:

4. The method of claim 1, wherein determining the type of the inter-entity relationship includes, based on the obtained probability of the existence of the relationship between the entities, combining a method of extracting fact triples by dependency analysis and the labeled inter-entity relationship type:

5. An apparatus for extracting relationships between entities in a traditional Chinese medicine document, comprising:

6. The apparatus for extracting relationships between entities of TCM literature according to claim 5, wherein said training module comprises:

7. The apparatus for extracting relationships between entities of TCM literature according to claim 5, wherein said identification module comprises:

8. The apparatus for extracting relationships between entities of TCM literature according to claim 5, wherein said extraction module comprises: