CN115618005A

CN115618005A - Traditional Tibetan medicine knowledge graph construction and completion method

Info

Publication number: CN115618005A
Application number: CN202110798028.4A
Authority: CN
Inventors: 苗方; 金立标; 庞龙
Original assignee: Communication University of China
Current assignee: Communication University of China
Priority date: 2021-07-16
Filing date: 2021-07-16
Publication date: 2023-01-17

Abstract

The invention relates to the field of knowledge maps, and discloses a construction and completion method of a traditional Tibetan medicine knowledge map. The method comprises the steps of designing a semantic frame of the Tibetan medicine knowledge map, constructing an entity dictionary corresponding to the Chinese and Tibetan language, preprocessing the linguistic data of an original document and the like, and utilizing an entity-relationship combined extraction model and a knowledge map completion model to form construction and completion of a knowledge map entity relationship triple set, so as to finally form the Tibetan medicine knowledge map stored by a map database. The knowledge extraction joint model used by the invention realizes the entity identification and the relation extraction tasks in parallel through an end-to-end model, and completes the knowledge map through a deep convolution network. The application of the invention can realize the semantization and knowledge systematization of Tibetan medicine concepts, and is beneficial to developing novel knowledge service application.

Description

Traditional Tibetan medicine knowledge graph construction and completion method

Technical Field

The invention relates to the field of knowledge maps, in particular to a construction and completion method of a traditional Tibetan medicine knowledge map.

Background

Tibetan medicine is the cultural treasure of Tibetan nationality and is a precious experience accumulated by Tibetan nationality in long-term struggle with diseases. But the related knowledge is complicated and complicated, and the application is difficult to be systematically mastered by ordinary people. At present, the medical resources stored on the internet mostly take various books and webpage information as the main, the medical resources are in a loose association state and lack of systematic organization, and the concept semantics, the standardized construction of a knowledge system and the knowledge service are lagged. The problem can be effectively solved by constructing the Tibetan medicine knowledge map, a knowledge network is constructed based on text semantic understanding in documents, and the relationship among concepts of Tibetan medicine traditional Chinese medicinal materials, formulas, diseases and the like is mined. However, in the process of constructing the knowledge graph, because the text of the traditional medicine document has extremely strong professional field characteristics, language is obscure, the syntactic structure word order and the like are greatly different from the general document, sentence components are often omitted, and dependence analysis and knowledge extraction are difficult to perform by using a general natural language processing tool. On the other hand, a large amount of Tibetan medicine documents do not have corresponding Chinese translations, and the difficulty is increased by cross-language processing. Conventional solutions rely heavily on manual handling. Therefore, the invention provides a knowledge graph construction and completion method based on the deep learning technology, which can effectively improve the automation degree of the work, save human resources and improve the efficiency.

Disclosure of Invention

The invention aims to solve the problems that the prior art excessively depends on a natural language processing tool, the model reasoning accuracy is not high, the information extraction is not comprehensive and the like, and provides a traditional Tibetan medicine knowledge graph construction and completion method.

In order to achieve the above purpose, the invention provides the following technical scheme:

a traditional Tibetan medicine knowledge graph construction and completion method comprises the following steps:

s1: designing a Tibetan medicine knowledge graph semantic framework, and determining the definition of the relation between entities;

s2: and constructing an entity dictionary corresponding to the Chinese-Tibetan language.

S3: a certain amount of structured data are obtained through a database importing method and a web crawler crawling method, and an initial ternary group data set is formed.

S4: inputting a Tibetan medicine text and preprocessing the Tibetan medicine text to obtain a text labeling corpus;

s5: performing knowledge extraction on the text corpus obtained in the step S4 by using an entity-relationship combined extraction model to obtain entity-relationship triples;

s6: and (5) scoring various combinations of the entity dictionary and the relationship by using a knowledge graph spectrum completion model, and finding out the triples which are not extracted in the step S5 for completing.

S7: and (4) manually checking the triples generated in the steps S5 and S6, and importing the triples into a Neo4j graph database to form a hierarchy of the knowledge graph.

As a preferable embodiment of the present invention, the step S4 includes the following steps:

s41: and after primary screening processing, performing ancient book text sentence splitting processing, and performing downstream semantic annotation work by taking sentences as units.

S42: and (4) marking the sentences generated in the S41 by using words as minimum division units and utilizing a BERT pre-training model to perform BIO (B-begin, I-inside, O-other).

As a preferable embodiment of the present invention, the step S5 includes the following steps:

s51: the input sentence is embedded vectorially according to characters/words:

s52: extracting each word and the context semantic features of the word by the vector through a bidirectional long-short term memory network and a multi-head self-attention coding layer;

s53: and outputting results generated by the entity identification and relationship extraction two tasks by utilizing the CRF and softmax layers of the linear chain.

And S5, adopting a countermeasure training method for the entity-relationship combined extraction model, adding a small disturbance value into the vector representation of the original sample to obtain a countermeasure sample, and then mixing the original sample and the countermeasure sample to train the model.

As a preferred embodiment of the present invention, the step S6 includes the following steps:

s61: training each entity and relationship into a single vector by using a TransE model, enabling each triple (head entity, relationship and tail entity) to accord with the vector addition relationship, and enabling the length of an output vector to be self-defined;

s62: forming candidate triples according to any entity relation;

s63: filtering the candidate ternary group, and deleting the known effective ternary group in the knowledge base;

s64: and (4) carrying out scoring judgment on each residual triple by using the depth pyramid convolution model, wherein if the score is greater than a threshold value, the triple is regarded as an effective triple, and if the score is smaller than the threshold value, the triple is regarded as an ineffective triple.

S65: and supplementing the effective triples to the knowledge graph to form knowledge graph completion.

The depth pyramid convolution model involved in the S6 needs to be trained through the following method: and simultaneously inputting an effective triple and an ineffective triple in the training process, wherein the ineffective triple is obtained by Bernoulli distribution. And outputting the score as an initial threshold value, wherein the change of a score function is gradually stable after long-time training, and determining the threshold value.

Compared with the prior art, the invention has the beneficial effects that:

1. the method comprises the steps of taking text corpora of Tibetan medicine documents as input, converting the text corpora into word/word vectors, carrying out entity annotation and automatic annotation on the vectors, inputting the annotated vectors into an entity-relationship joint extraction module, and extracting triples required for constructing a Tibetan medicine knowledge map. The traditional method is a pipeline model which treats entity identification and relationship extraction as two separate tasks and extracts the relationship after the entity identification, and the method inevitably has the problems of error propagation and neglect of the relationship existing between the two subtasks. The combined model used by the invention realizes the tasks of entity identification and relationship extraction through an end-to-end model, and can effectively overcome the problems.

2. The invention adopts a knowledge graph spectrum compensation model based on deep convolution, the model consists of a reference network and a deep convolution network, the reference network is responsible for generating a feature graph after single-dimension convolution and is used as subsequent input, the deep convolution network carries out further convolution and pooling operation aiming at the features, and the convolution depth is controlled by utilizing the number of circulating units.

3. The knowledge extraction and knowledge completion method provided by the invention is based on vectorization words as input, and the vectorization operation reduces the dimensionality of original data. And based on a vectorized knowledge representation model, correlation and calculation of semantic levels can be realized in different linguistic data of the Chinese Tibetan.

Drawings

FIG. 1 is a schematic flow chart of a method for constructing and complementing a Tibetan medicine knowledge graph according to embodiment 1 of the present invention;

FIG. 2 is a schematic diagram illustrating an entity-relationship joint extraction model according to embodiment 1 of the present invention;

fig. 3 is an operational diagram of a knowledge graph spectrum compensation model according to embodiment 1 of the present invention;

Detailed Description

The present invention will be described in further detail with reference to test examples and specific embodiments. It should be understood that the scope of the above-described subject matter is not limited to the following examples, and any techniques implemented based on the disclosure of the present invention are within the scope of the present invention.

Example 1

As shown in fig. 1, a method for constructing and complementing a Tibetan medicine knowledge graph comprises the following steps:

s1: designing a Tibetan medicine knowledge map semantic framework, and determining the definition of the relation between the entities.

The main entity types of the Tibetan medicine knowledge graph model are as follows: prescription (such as JIAWEIBAIYAO powder), medicinal materials (such as Ginseng radix), and diseases (such as pneumonia); the relationships among the entities are mainly as follows: (prescription) - [ main treatment ] - > (disease), (medicinal material) - [ composition ] - > (prescription), (disease) - [ use ] - > (medicinal material), and the like. The formulas have attributes that include: the name of the prescription, the Tibetan name of the prescription, the Latin name of the prescription, the composition of the prescription, the alias of the prescription, the toxicity of the prescription, the preparation method of the prescription, the pinyin of the prescription, the nature and taste of the prescription, the functional indication of the prescription, the usage and dosage of the prescription, the specification of the prescription, the cautionary matters of the prescription, the storage of the prescription, the source of the prescription, the identification of the prescription and the like; the medicinal materials have the following properties: the method comprises the following steps of medicinal material name, tibetan name of medicinal material, latin name of medicinal material, alias of medicinal material, pinyin of medicinal material, english name of medicinal material, medicinal material collection, medicinal material form, medicinal material identification, medicinal material flavor, medicinal material property and taste, medicinal material application, medicinal material function indication, medicinal material habitat, medicinal material collection time, medicinal material ratio, medicinal material toxicity, medicinal material source, medicinal material supplementary notes and the like, wherein the symptoms mainly comprise symptom description, etiology and the like.

S2: and constructing an entity dictionary corresponding to the Chinese-Tibetan. The construction of the dictionary is based on structured data, and can be from crawling of professional databases or websites and identification import of dictionary-like tool books and relevant national/industry standard documents from libraries.

S3: a certain amount of structured data is obtained through a database importing method and a web crawler crawling method, and an initial ternary group data set is formed. The professional website and the built database have partially structured data, and the partially structured data can be directly used for constructing the knowledge-graph triples. This part of data is also the main data source for subsequent deep learning model training and labeling.

S42: and carrying out BIO (B-begin, I-inside, O-other) labeling on the sentences generated in the S41 by using the BERT pre-training model by taking the characters as the minimum division unit.

S5: performing knowledge extraction on the text corpus obtained in the step S2 by using an entity-relationship combined extraction model to obtain entity-relationship triples;

s51: the input sentences are vectorized and embedded according to words, a vectorization model can use word2vec or BERT, and the vectorization dimension can be adjusted according to the actual data volume and the actual computing power of a model training platform;

s52: extracting each word and the context complex semantic features of the word by the vector through a bidirectional long-short term memory network and a multi-head self-attention coding layer;

s53: and (3) outputting classification results generated by two tasks of entity identification and relation extraction by using two layers of processing of a linear chain element random field and softmax.

And S5, adopting a countermeasure training method for the entity-relationship combined extraction model, adding a small disturbance value into the vector representation of the original sample to obtain a countermeasure sample, and then mixing the original sample and the countermeasure sample to train the model. The model employs a cross entropy loss function.

S6: and (4) scoring various combinations of the entity dictionary and the relation by using a knowledge graph spectrum completion model, and finding out the triples which are not extracted in the step (S4) for completion.

S61: training each entity and relationship into a single vector by using a TransE model, enabling each triple (head entity, relationship and tail entity) to conform to the vector addition relationship, and enabling the length of an output vector to be self-defined;

s62: forming a candidate triple set by any entity relation;

s63: filtering the candidate triple set, and deleting known effective triples in the knowledge base;

s64: and (4) carrying out scoring judgment on each residual triple by using the depth pyramid convolution model shown in the third drawing, wherein if the score is greater than a threshold value, the triple is regarded as an effective triple, and if the score is less than the threshold value, the triple is regarded as an ineffective triple.

S65: the valid triples are supplemented to the existing knowledge-graph.

The depth pyramid convolution model involved in the S6 needs to be trained through the following method: and simultaneously inputting an effective triple and an ineffective triple in the training process, wherein the ineffective triple is obtained by Bernoulli distribution. And outputting the score as an initial threshold value, wherein the change of the score function is gradually stable after long-time training, and determining the threshold value.

S7: and (5) manually checking the triples generated in the steps S5 and S6, and importing the triples into a Neo4j graph database. Based on the database, service applications such as visual query, retrieval, question answering and the like of the Tibetan medicine knowledge map can be carried out.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, and these embodiments are intended to be encompassed in the scope of the present invention.

Claims

1. A traditional Tibetan medicine knowledge graph construction and completion method is characterized by comprising the following steps:

s2: and constructing an entity dictionary corresponding to the Chinese-Tibetan.

S3: partial structured data are obtained through a database importing method and a web crawler crawling method, and an initial ternary group data set is formed.

s5: performing knowledge extraction on the text corpus by using an entity-relation combined extraction model to obtain entity-relation triples;

s6: and (4) scoring various combinations of the entity dictionary and the relation by using a knowledge graph spectrum completion model, and finding out the triples which are not extracted in the step (S5) for completion.

S7: and (5) manually checking the triples generated in the steps S5 and S6, and importing the triples into a graph database to form the knowledge graph.

2. The method for building and complementing a traditional Tibetan medicine knowledge-graph as claimed in claim 1, wherein the step S4 comprises the following steps:

3. The method for building and complementing a traditional Tibetan medicine knowledge-graph as claimed in claim 1, wherein the step S5 comprises the following steps:

s51: input sentences are vectorized and embedded according to characters/words;

s52: extracting characters/words and context semantic features of the vectors from a multi-head self-attention coding layer through a bidirectional long-short term memory network;

4. The method for building and complementing a traditional Tibetan medicine knowledge-graph as claimed in claim 1, wherein the step S6 comprises the following steps:

s62: forming candidate triples according to any entity relation;

s63: filtering the candidate triple groups, and deleting known effective triples in the knowledge base;

s64: carrying out scoring judgment on each residual triple by using a depth pyramid convolution model, wherein if the score is greater than a threshold value, the triple is regarded as an effective triple, and if the score is less than the threshold value, the triple is regarded as an ineffective triple;

s65: and supplementing the triples judged to be effective in the step S64 to the knowledge graph for completion.

5. The method for building and completing traditional Tibetan medicine knowledge graph according to claim 1, wherein the entity-relationship joint extraction model involved in the step S5 performs two tasks of entity identification and relationship extraction in parallel.

6. The method as claimed in claim 1, wherein the entity-relationship joint extraction model in step S5 adopts a confrontation training method, and the confrontation sample is obtained by adding a small perturbation value to the vector representation of the original sample, and then the model is trained by mixing the original sample and the confrontation sample.

7. The method for constructing and completing a traditional Tibetan medicine knowledge graph according to claim 1, wherein a deep pyramid convolution model is adopted in step S6 for triple validity judgment, the model consists of a reference network and a deep convolution network, the reference network is responsible for generating a feature map after single-dimensional convolution and is used as subsequent input, the deep convolution network performs further convolution and pooling operations on the features, and the number of the circulation units is used for controlling the convolution depth.