CN115618005A - Traditional Tibetan medicine knowledge graph construction and completion method - Google Patents

Traditional Tibetan medicine knowledge graph construction and completion method Download PDF

Info

Publication number
CN115618005A
CN115618005A CN202110798028.4A CN202110798028A CN115618005A CN 115618005 A CN115618005 A CN 115618005A CN 202110798028 A CN202110798028 A CN 202110798028A CN 115618005 A CN115618005 A CN 115618005A
Authority
CN
China
Prior art keywords
entity
knowledge
model
tibetan medicine
triple
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110798028.4A
Other languages
Chinese (zh)
Inventor
苗方
金立标
庞龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Communication University of China
Original Assignee
Communication University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Communication University of China filed Critical Communication University of China
Priority to CN202110798028.4A priority Critical patent/CN115618005A/en
Publication of CN115618005A publication Critical patent/CN115618005A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention relates to the field of knowledge maps, and discloses a construction and completion method of a traditional Tibetan medicine knowledge map. The method comprises the steps of designing a semantic frame of the Tibetan medicine knowledge map, constructing an entity dictionary corresponding to the Chinese and Tibetan language, preprocessing the linguistic data of an original document and the like, and utilizing an entity-relationship combined extraction model and a knowledge map completion model to form construction and completion of a knowledge map entity relationship triple set, so as to finally form the Tibetan medicine knowledge map stored by a map database. The knowledge extraction joint model used by the invention realizes the entity identification and the relation extraction tasks in parallel through an end-to-end model, and completes the knowledge map through a deep convolution network. The application of the invention can realize the semantization and knowledge systematization of Tibetan medicine concepts, and is beneficial to developing novel knowledge service application.

Description

Traditional Tibetan medicine knowledge graph construction and completion method
Technical Field
The invention relates to the field of knowledge maps, in particular to a construction and completion method of a traditional Tibetan medicine knowledge map.
Background
Tibetan medicine is the cultural treasure of Tibetan nationality and is a precious experience accumulated by Tibetan nationality in long-term struggle with diseases. But the related knowledge is complicated and complicated, and the application is difficult to be systematically mastered by ordinary people. At present, the medical resources stored on the internet mostly take various books and webpage information as the main, the medical resources are in a loose association state and lack of systematic organization, and the concept semantics, the standardized construction of a knowledge system and the knowledge service are lagged. The problem can be effectively solved by constructing the Tibetan medicine knowledge map, a knowledge network is constructed based on text semantic understanding in documents, and the relationship among concepts of Tibetan medicine traditional Chinese medicinal materials, formulas, diseases and the like is mined. However, in the process of constructing the knowledge graph, because the text of the traditional medicine document has extremely strong professional field characteristics, language is obscure, the syntactic structure word order and the like are greatly different from the general document, sentence components are often omitted, and dependence analysis and knowledge extraction are difficult to perform by using a general natural language processing tool. On the other hand, a large amount of Tibetan medicine documents do not have corresponding Chinese translations, and the difficulty is increased by cross-language processing. Conventional solutions rely heavily on manual handling. Therefore, the invention provides a knowledge graph construction and completion method based on the deep learning technology, which can effectively improve the automation degree of the work, save human resources and improve the efficiency.
Disclosure of Invention
The invention aims to solve the problems that the prior art excessively depends on a natural language processing tool, the model reasoning accuracy is not high, the information extraction is not comprehensive and the like, and provides a traditional Tibetan medicine knowledge graph construction and completion method.
In order to achieve the above purpose, the invention provides the following technical scheme:
a traditional Tibetan medicine knowledge graph construction and completion method comprises the following steps:
s1: designing a Tibetan medicine knowledge graph semantic framework, and determining the definition of the relation between entities;
s2: and constructing an entity dictionary corresponding to the Chinese-Tibetan language.
S3: a certain amount of structured data are obtained through a database importing method and a web crawler crawling method, and an initial ternary group data set is formed.
S4: inputting a Tibetan medicine text and preprocessing the Tibetan medicine text to obtain a text labeling corpus;
s5: performing knowledge extraction on the text corpus obtained in the step S4 by using an entity-relationship combined extraction model to obtain entity-relationship triples;
s6: and (5) scoring various combinations of the entity dictionary and the relationship by using a knowledge graph spectrum completion model, and finding out the triples which are not extracted in the step S5 for completing.
S7: and (4) manually checking the triples generated in the steps S5 and S6, and importing the triples into a Neo4j graph database to form a hierarchy of the knowledge graph.
As a preferable embodiment of the present invention, the step S4 includes the following steps:
s41: and after primary screening processing, performing ancient book text sentence splitting processing, and performing downstream semantic annotation work by taking sentences as units.
S42: and (4) marking the sentences generated in the S41 by using words as minimum division units and utilizing a BERT pre-training model to perform BIO (B-begin, I-inside, O-other).
As a preferable embodiment of the present invention, the step S5 includes the following steps:
s51: the input sentence is embedded vectorially according to characters/words:
s52: extracting each word and the context semantic features of the word by the vector through a bidirectional long-short term memory network and a multi-head self-attention coding layer;
s53: and outputting results generated by the entity identification and relationship extraction two tasks by utilizing the CRF and softmax layers of the linear chain.
And S5, adopting a countermeasure training method for the entity-relationship combined extraction model, adding a small disturbance value into the vector representation of the original sample to obtain a countermeasure sample, and then mixing the original sample and the countermeasure sample to train the model.
As a preferred embodiment of the present invention, the step S6 includes the following steps:
s61: training each entity and relationship into a single vector by using a TransE model, enabling each triple (head entity, relationship and tail entity) to accord with the vector addition relationship, and enabling the length of an output vector to be self-defined;
s62: forming candidate triples according to any entity relation;
s63: filtering the candidate ternary group, and deleting the known effective ternary group in the knowledge base;
s64: and (4) carrying out scoring judgment on each residual triple by using the depth pyramid convolution model, wherein if the score is greater than a threshold value, the triple is regarded as an effective triple, and if the score is smaller than the threshold value, the triple is regarded as an ineffective triple.
S65: and supplementing the effective triples to the knowledge graph to form knowledge graph completion.
The depth pyramid convolution model involved in the S6 needs to be trained through the following method: and simultaneously inputting an effective triple and an ineffective triple in the training process, wherein the ineffective triple is obtained by Bernoulli distribution. And outputting the score as an initial threshold value, wherein the change of a score function is gradually stable after long-time training, and determining the threshold value.
Compared with the prior art, the invention has the beneficial effects that:
1. the method comprises the steps of taking text corpora of Tibetan medicine documents as input, converting the text corpora into word/word vectors, carrying out entity annotation and automatic annotation on the vectors, inputting the annotated vectors into an entity-relationship joint extraction module, and extracting triples required for constructing a Tibetan medicine knowledge map. The traditional method is a pipeline model which treats entity identification and relationship extraction as two separate tasks and extracts the relationship after the entity identification, and the method inevitably has the problems of error propagation and neglect of the relationship existing between the two subtasks. The combined model used by the invention realizes the tasks of entity identification and relationship extraction through an end-to-end model, and can effectively overcome the problems.
2. The invention adopts a knowledge graph spectrum compensation model based on deep convolution, the model consists of a reference network and a deep convolution network, the reference network is responsible for generating a feature graph after single-dimension convolution and is used as subsequent input, the deep convolution network carries out further convolution and pooling operation aiming at the features, and the convolution depth is controlled by utilizing the number of circulating units.
3. The knowledge extraction and knowledge completion method provided by the invention is based on vectorization words as input, and the vectorization operation reduces the dimensionality of original data. And based on a vectorized knowledge representation model, correlation and calculation of semantic levels can be realized in different linguistic data of the Chinese Tibetan.
Drawings
FIG. 1 is a schematic flow chart of a method for constructing and complementing a Tibetan medicine knowledge graph according to embodiment 1 of the present invention;
FIG. 2 is a schematic diagram illustrating an entity-relationship joint extraction model according to embodiment 1 of the present invention;
fig. 3 is an operational diagram of a knowledge graph spectrum compensation model according to embodiment 1 of the present invention;
Detailed Description
The present invention will be described in further detail with reference to test examples and specific embodiments. It should be understood that the scope of the above-described subject matter is not limited to the following examples, and any techniques implemented based on the disclosure of the present invention are within the scope of the present invention.
Example 1
As shown in fig. 1, a method for constructing and complementing a Tibetan medicine knowledge graph comprises the following steps:
s1: designing a Tibetan medicine knowledge map semantic framework, and determining the definition of the relation between the entities.
The main entity types of the Tibetan medicine knowledge graph model are as follows: prescription (such as JIAWEIBAIYAO powder), medicinal materials (such as Ginseng radix), and diseases (such as pneumonia); the relationships among the entities are mainly as follows: (prescription) - [ main treatment ] - > (disease), (medicinal material) - [ composition ] - > (prescription), (disease) - [ use ] - > (medicinal material), and the like. The formulas have attributes that include: the name of the prescription, the Tibetan name of the prescription, the Latin name of the prescription, the composition of the prescription, the alias of the prescription, the toxicity of the prescription, the preparation method of the prescription, the pinyin of the prescription, the nature and taste of the prescription, the functional indication of the prescription, the usage and dosage of the prescription, the specification of the prescription, the cautionary matters of the prescription, the storage of the prescription, the source of the prescription, the identification of the prescription and the like; the medicinal materials have the following properties: the method comprises the following steps of medicinal material name, tibetan name of medicinal material, latin name of medicinal material, alias of medicinal material, pinyin of medicinal material, english name of medicinal material, medicinal material collection, medicinal material form, medicinal material identification, medicinal material flavor, medicinal material property and taste, medicinal material application, medicinal material function indication, medicinal material habitat, medicinal material collection time, medicinal material ratio, medicinal material toxicity, medicinal material source, medicinal material supplementary notes and the like, wherein the symptoms mainly comprise symptom description, etiology and the like.
S2: and constructing an entity dictionary corresponding to the Chinese-Tibetan. The construction of the dictionary is based on structured data, and can be from crawling of professional databases or websites and identification import of dictionary-like tool books and relevant national/industry standard documents from libraries.
S3: a certain amount of structured data is obtained through a database importing method and a web crawler crawling method, and an initial ternary group data set is formed. The professional website and the built database have partially structured data, and the partially structured data can be directly used for constructing the knowledge-graph triples. This part of data is also the main data source for subsequent deep learning model training and labeling.
S4: inputting a Tibetan medicine text and preprocessing the Tibetan medicine text to obtain a text labeling corpus;
s41: and after primary screening processing, performing ancient book text sentence splitting processing, and performing downstream semantic annotation work by taking sentences as units.
S42: and carrying out BIO (B-begin, I-inside, O-other) labeling on the sentences generated in the S41 by using the BERT pre-training model by taking the characters as the minimum division unit.
S5: performing knowledge extraction on the text corpus obtained in the step S2 by using an entity-relationship combined extraction model to obtain entity-relationship triples;
s51: the input sentences are vectorized and embedded according to words, a vectorization model can use word2vec or BERT, and the vectorization dimension can be adjusted according to the actual data volume and the actual computing power of a model training platform;
s52: extracting each word and the context complex semantic features of the word by the vector through a bidirectional long-short term memory network and a multi-head self-attention coding layer;
s53: and (3) outputting classification results generated by two tasks of entity identification and relation extraction by using two layers of processing of a linear chain element random field and softmax.
And S5, adopting a countermeasure training method for the entity-relationship combined extraction model, adding a small disturbance value into the vector representation of the original sample to obtain a countermeasure sample, and then mixing the original sample and the countermeasure sample to train the model. The model employs a cross entropy loss function.
S6: and (4) scoring various combinations of the entity dictionary and the relation by using a knowledge graph spectrum completion model, and finding out the triples which are not extracted in the step (S4) for completion.
S61: training each entity and relationship into a single vector by using a TransE model, enabling each triple (head entity, relationship and tail entity) to conform to the vector addition relationship, and enabling the length of an output vector to be self-defined;
s62: forming a candidate triple set by any entity relation;
s63: filtering the candidate triple set, and deleting known effective triples in the knowledge base;
s64: and (4) carrying out scoring judgment on each residual triple by using the depth pyramid convolution model shown in the third drawing, wherein if the score is greater than a threshold value, the triple is regarded as an effective triple, and if the score is less than the threshold value, the triple is regarded as an ineffective triple.
S65: the valid triples are supplemented to the existing knowledge-graph.
The depth pyramid convolution model involved in the S6 needs to be trained through the following method: and simultaneously inputting an effective triple and an ineffective triple in the training process, wherein the ineffective triple is obtained by Bernoulli distribution. And outputting the score as an initial threshold value, wherein the change of the score function is gradually stable after long-time training, and determining the threshold value.
S7: and (5) manually checking the triples generated in the steps S5 and S6, and importing the triples into a Neo4j graph database. Based on the database, service applications such as visual query, retrieval, question answering and the like of the Tibetan medicine knowledge map can be carried out.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, and these embodiments are intended to be encompassed in the scope of the present invention.

Claims (7)

1. A traditional Tibetan medicine knowledge graph construction and completion method is characterized by comprising the following steps:
s1: designing a Tibetan medicine knowledge graph semantic framework, and determining the definition of the relation between entities;
s2: and constructing an entity dictionary corresponding to the Chinese-Tibetan.
S3: partial structured data are obtained through a database importing method and a web crawler crawling method, and an initial ternary group data set is formed.
S4: inputting a Tibetan medicine text and preprocessing the Tibetan medicine text to obtain a text labeling corpus;
s5: performing knowledge extraction on the text corpus by using an entity-relation combined extraction model to obtain entity-relation triples;
s6: and (4) scoring various combinations of the entity dictionary and the relation by using a knowledge graph spectrum completion model, and finding out the triples which are not extracted in the step (S5) for completion.
S7: and (5) manually checking the triples generated in the steps S5 and S6, and importing the triples into a graph database to form the knowledge graph.
2. The method for building and complementing a traditional Tibetan medicine knowledge-graph as claimed in claim 1, wherein the step S4 comprises the following steps:
s41: and after primary screening processing, performing ancient book text sentence splitting processing, and performing downstream semantic annotation work by taking sentences as units.
S42: and carrying out BIO (B-begin, I-inside, O-other) labeling on the sentences generated in the S41 by using the BERT pre-training model by taking the characters as the minimum division unit.
3. The method for building and complementing a traditional Tibetan medicine knowledge-graph as claimed in claim 1, wherein the step S5 comprises the following steps:
s51: input sentences are vectorized and embedded according to characters/words;
s52: extracting characters/words and context semantic features of the vectors from a multi-head self-attention coding layer through a bidirectional long-short term memory network;
s53: and outputting results generated by the entity identification and relationship extraction two tasks by utilizing the CRF and softmax layers of the linear chain.
4. The method for building and complementing a traditional Tibetan medicine knowledge-graph as claimed in claim 1, wherein the step S6 comprises the following steps:
s61: training each entity and relationship into a single vector by using a TransE model, enabling each triple (head entity, relationship and tail entity) to conform to the vector addition relationship, and enabling the length of an output vector to be self-defined;
s62: forming candidate triples according to any entity relation;
s63: filtering the candidate triple groups, and deleting known effective triples in the knowledge base;
s64: carrying out scoring judgment on each residual triple by using a depth pyramid convolution model, wherein if the score is greater than a threshold value, the triple is regarded as an effective triple, and if the score is less than the threshold value, the triple is regarded as an ineffective triple;
s65: and supplementing the triples judged to be effective in the step S64 to the knowledge graph for completion.
5. The method for building and completing traditional Tibetan medicine knowledge graph according to claim 1, wherein the entity-relationship joint extraction model involved in the step S5 performs two tasks of entity identification and relationship extraction in parallel.
6. The method as claimed in claim 1, wherein the entity-relationship joint extraction model in step S5 adopts a confrontation training method, and the confrontation sample is obtained by adding a small perturbation value to the vector representation of the original sample, and then the model is trained by mixing the original sample and the confrontation sample.
7. The method for constructing and completing a traditional Tibetan medicine knowledge graph according to claim 1, wherein a deep pyramid convolution model is adopted in step S6 for triple validity judgment, the model consists of a reference network and a deep convolution network, the reference network is responsible for generating a feature map after single-dimensional convolution and is used as subsequent input, the deep convolution network performs further convolution and pooling operations on the features, and the number of the circulation units is used for controlling the convolution depth.
CN202110798028.4A 2021-07-16 2021-07-16 Traditional Tibetan medicine knowledge graph construction and completion method Pending CN115618005A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110798028.4A CN115618005A (en) 2021-07-16 2021-07-16 Traditional Tibetan medicine knowledge graph construction and completion method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110798028.4A CN115618005A (en) 2021-07-16 2021-07-16 Traditional Tibetan medicine knowledge graph construction and completion method

Publications (1)

Publication Number Publication Date
CN115618005A true CN115618005A (en) 2023-01-17

Family

ID=84854768

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110798028.4A Pending CN115618005A (en) 2021-07-16 2021-07-16 Traditional Tibetan medicine knowledge graph construction and completion method

Country Status (1)

Country Link
CN (1) CN115618005A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115878818A (en) * 2023-02-21 2023-03-31 创意信息技术股份有限公司 Geographic knowledge graph construction method and device, terminal and storage medium
CN116701665A (en) * 2023-08-08 2023-09-05 滨州医学院 Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method
CN117408338A (en) * 2023-12-14 2024-01-16 神州医疗科技股份有限公司 Method and system for constructing knowledge graph of traditional Chinese medicine decoction pieces based on Chinese pharmacopoeia

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115878818A (en) * 2023-02-21 2023-03-31 创意信息技术股份有限公司 Geographic knowledge graph construction method and device, terminal and storage medium
CN116701665A (en) * 2023-08-08 2023-09-05 滨州医学院 Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method
CN117408338A (en) * 2023-12-14 2024-01-16 神州医疗科技股份有限公司 Method and system for constructing knowledge graph of traditional Chinese medicine decoction pieces based on Chinese pharmacopoeia
CN117408338B (en) * 2023-12-14 2024-03-12 神州医疗科技股份有限公司 Method and system for constructing knowledge graph of traditional Chinese medicine decoction pieces based on Chinese pharmacopoeia

Similar Documents

Publication Publication Date Title
US10482115B2 (en) Providing question and answers with deferred type evaluation using text with limited structure
Moreno et al. Combining word and entity embeddings for entity linking
CN115618005A (en) Traditional Tibetan medicine knowledge graph construction and completion method
CN111444700A (en) Text similarity measurement method based on semantic document expression
CN112487202B (en) Chinese medical named entity recognition method and device fusing knowledge map and BERT
de Abreu et al. A review on Relation Extraction with an eye on Portuguese
Xu et al. Extracting interrogative intents and concepts from geo-analytic questions
Bravo-Candel et al. Automatic correction of real-word errors in Spanish clinical texts
Chauhan et al. SemSyn: Semantic-Syntactic Similarity Based Automatic Machine Translation Evaluation Metric
Anthes Automated translation of indian languages
Plum et al. Large-scale data harvesting for biographical data
Alshammari et al. TAQS: an Arabic question similarity system using transfer learning of BERT with BILSTM
Manias et al. An evaluation of neural machine translation and pre-trained word embeddings in multilingual neural sentiment analysis
Alkım et al. Machine translation infrastructure for Turkic languages (MT-Turk)
Sun et al. Entity disambiguation with decomposable neural networks
Sidhu et al. Role of machine translation and word sense disambiguation in natural language processing
Wang Math-KG: Construction and Applications of Mathematical Knowledge Graph
Van Tu A Deep Learning Model of Multiple Knowledge Sources Integration for Community Question Answering
Singh et al. GA-based machine translation system for Sanskrit to Hindi language
Lampouras et al. Extracting linguistic resources from the web for concept-to-text generation
Zhu et al. Doc2Vec on similar document suggestion for pharmaceutical collections
Wimalasuriya Automatic text summarization for sinhala
Xie et al. A Phrase Disambiguation Method of “Quanbu V de N” Based on SBERT Model and Syntactic Rule
Rumaisaa et al. Development of Multilingual Social Media Data Corpus: Development and Evaluation
Esqueda et al. Machine translation: mapping technological developments through scientometrics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
DD01 Delivery of document by public notice

Addressee: Miao Fang

Document name: Notice of Publication of Invention Patent Application

DD01 Delivery of document by public notice