CN116701665A - Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method - Google Patents
Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method Download PDFInfo
- Publication number
- CN116701665A CN116701665A CN202310988524.5A CN202310988524A CN116701665A CN 116701665 A CN116701665 A CN 116701665A CN 202310988524 A CN202310988524 A CN 202310988524A CN 116701665 A CN116701665 A CN 116701665A
- Authority
- CN
- China
- Prior art keywords
- entity
- chinese medicine
- traditional chinese
- tasks
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 239000003814 drug Substances 0.000 title claims abstract description 31
- 238000013135 deep learning Methods 0.000 title claims abstract description 14
- 238000010276 construction Methods 0.000 title claims abstract description 13
- 238000005192 partition Methods 0.000 claims description 63
- 238000000034 method Methods 0.000 claims description 36
- 230000006870 function Effects 0.000 claims description 20
- 238000012545 processing Methods 0.000 claims description 16
- 238000001914 filtration Methods 0.000 claims description 14
- 238000013507 mapping Methods 0.000 claims description 12
- 238000000605 extraction Methods 0.000 claims description 11
- 230000009466 transformation Effects 0.000 claims description 11
- 230000000694 effects Effects 0.000 claims description 7
- 238000005516 engineering process Methods 0.000 claims description 7
- 230000004927 fusion Effects 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 7
- 238000012549 training Methods 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 6
- 238000004458 analytical method Methods 0.000 claims description 5
- 238000012512 characterization method Methods 0.000 claims description 5
- 230000003993 interaction Effects 0.000 claims description 5
- 239000013598 vector Substances 0.000 claims description 4
- 238000004422 calculation algorithm Methods 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 238000003058 natural language processing Methods 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 230000009849 deactivation Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
- G06N5/025—Extracting rules from data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H70/00—ICT specially adapted for the handling or processing of medical references
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Public Health (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Primary Health Care (AREA)
- Epidemiology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Animal Behavior & Ethology (AREA)
- Pathology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a traditional Chinese medicine ancient book knowledge graph construction method based on deep learning, which relates to the technical field of knowledge graph construction, solves the technical problems that a large number of rare words exist in text content, grammar is different from modern Chinese grammar, so that the main knowledge graph construction method cannot well establish properties such as attributes, entities and relations in the text content, and the like; and optimizing the rare words according to specific characteristic parameters, better fusing the properties such as the properties, the entities and the relations in the main stream knowledge graph, and applying the main stream knowledge graph.
Description
Technical Field
The invention belongs to the technical field of knowledge graph construction, and particularly relates to a method for constructing a traditional Chinese medicine ancient book knowledge graph based on deep learning.
Background
The concept of knowledge graph is proposed in 2012, is used for perfecting a search engine, is a typical polygonal relation graph and consists of nodes (entities) and edges (relations between the entities); knowledge graph is essentially a semantic network used to reveal relationships between everything; as shown in fig. 1, the knowledge graph aims at extracting concepts, entities and relationships from multiple types of complex data, and is a computable model of the relationships of things. According to the coverage range of the knowledge and the difference of the fields, the whole knowledge graph can be divided into a universal knowledge graph and a field knowledge graph. Along with the continuous development of science and technology, the knowledge graph is widely applied in the NLP field, such as semantic search, intelligent question-answering, auxiliary decision making and the like, and has become an important power for artificial intelligence development;
the knowledge graph architecture is divided into three parts: the first part is the acquisition of source data, namely, acquiring useful resource information in each type of data; the second part is knowledge fusion, which is used for associating knowledge of multiple data sources and expanding the knowledge range; the third part is knowledge calculation and knowledge application, the knowledge calculation is a main mode of the output of the knowledge graph capability, and the knowledge application combines the knowledge graph with a specific field or service, so that the service efficiency of each field is improved;
aiming at the ancient books of traditional Chinese medicine, the text content of the ancient books of traditional Chinese medicine has a large number of rare words, and the grammar is different from the modern Chinese grammar, so that the mainstream knowledge graph construction method cannot well establish the properties, entities, relations and the like in the ancient books, and the knowledge graph construction method of the ancient books of traditional Chinese medicine is provided.
Disclosure of Invention
The present invention aims to solve at least one of the technical problems existing in the prior art; therefore, the invention provides a deep learning-based traditional Chinese medicine ancient book knowledge graph construction method, which is used for solving the technical problem that the main stream knowledge graph construction method cannot well establish properties such as attributes, entities, relations and the like because a large number of rare words exist in text contents and grammar is different from modern Chinese grammar.
To achieve the above object, an embodiment according to a first aspect of the present invention provides a method for constructing a ancient book knowledge graph of traditional Chinese medicine based on deep learning, comprising the steps of:
s1, processing unstructured multi-mode traditional Chinese medicine field data: extracting text data in the ancient books of the traditional Chinese medicine by adopting a multi-mode information extraction technology and combining an OCR technology and an NLP processing technology, converting the text data into semi-structured data and structured data, and marking the semi-structured data to obtain an entity-relation-entity data set;
s2, knowledge extraction: the grammar and the corresponding text content structure are trained in a combined way, and the entity and relation triples in the ancient books of the traditional Chinese medicine are extracted through the training process, so that the rare words and the grammar of the ancient books of the traditional Chinese medicine are solved, and the specific steps are as follows:
s21, optimizing and embedding rare words: the method comprises the steps of optimizing the rarely used words in ancient book contents by using a traditional Chinese medicine rarely used word optimizing embedding model, simplifying the rarely used words into corresponding Chinese characters, and embedding word vectors by taking ancient text Bert and ancient text sentences after dictionary optimization as global feature descriptions;
s22, carrying out partition filtering by adopting a partition filtering encoder: at each time step, the word features are split into three partitions: entity partition, relation partition and shared partition, wherein the entity partition is a partition related to entity extraction only, the relation partition is a partition related to relation only, the shared partition is a partition related to both tasks, and then features unrelated to specific tasks are filtered out by merging the partitions; the specific method is as follows:
s221, first calculating candidate partition information:wherein X is t For the entered features, h t-1 Is a hidden state value corresponding to the time t-1;
s222, calculating a relation threshold and an entity threshold:wherein cummax = cumsum;
and (softmax ()) cumsum (x 1, x2, x 3) = (x 1, x1+x2, x1+x2+x3);
s223, then, generating three partitions by using the two just calculated thresholds at each layer, wherein the two layers are 6 partitions:wherein o represents AND operation, -represents NOT operation, wherein c represents history information and t-1 represents corresponding t-1 time;
s224, finally generating information of three partitions of the t time step according to the history threshold and history information of the t-1 time step, and the candidate threshold and candidate partition information of the t time step:;
s23, performing filtering operation: according to the information of the three partitions generated in the step S22, three memory features are generated through interaction, so that the filtering effect is achieved: entity correlation/relationship correlation/sharing correlation:;
then, the three memory features are respectively processed by a tanh () hyperbolic tangent function to obtain three corresponding hidden states, and the three hidden states are directly output from the history information of the current time step and are used as entity correlation/relationship correlation/sharing correlation for the operation of the next stage:;
finally, updating the history information and the hidden state, splicing the three memory features together, and linearly mapping to obtain the history information of the t time step, wherein the history information of the t time step is subjected to a tanh () hyperbolic tangent function to obtain the hidden state of the t time step:;
s24, carrying out global representation: by obtaining a global characterization of two specific tasks: the entity concentrating feature and the relation concentrating feature of each time step are used for splicing the shared concentrating feature, and two specific task features are obtained through linear mapping and tanh () hyperbolic tangent function and global use of maximum pooling operation:wherein maxpool represents a maximize pooling operation;
s25, executing characteristic tasks: the method comprises the steps of respectively executing entity characteristic tasks and relation characteristic tasks, wherein the specific mode of executing the entity characteristic tasks is as follows:
s251, a group of sentences is set to be L, the length of a table is L multiplied by L, and the (i, j) position in the table represents the physical characteristic representation of the span starting at the ith position and ending at the jth position and is represented as the concatenation of the ith position and the jth positionThe j-th entity concentration feature and the global representation of the entity concentration feature are then processed by linear transformation and ELU activation functions in the following manner:;
then, the output layer is entered, the output layer is mapped to the dimension of the number of entity categories through linear mapping, and then, a sigmoid function is carried out on each dimension to judge whether the entity category is represented by the sigmoid function or not, and the processing mode is as follows:k represents each type, and its element e represents the probability of the word pair (w, k) as the start and end positions of the entity with type k;
for each word pair (w, k), h i ,h j Representing word-level entity characteristics thereof;
s252, executing relation feature tasks: setting a group of sentence length values as L and the length of a table as L x L, wherein the (i, j) position in the table marks the span taking the ith position as the first word and the span taking the jth position as the first word, the representation is similar to a physical unit, and splicing: the relationship between the ith position and the jth focus feature, and the global representation of the relationship focus feature, then undergo linear transformation and ELU activation functions to make multi-label classification:where R represents a set of relationship labels, and for each relationship L, the element R represents the word w i Word w j As probabilities of subject and object entities, T represents a set of elements r, h gr Representing global features, obtaining an entity i span representation +.>;
S26, carrying out loss parameter analysis: for two BCE losses, the BCE losses are classification losses for multi-label classification, and classification tasks can be completed; considering both tasks as classified tasks, one for the NER task; another for RE tasks:
;
s3, carrying out knowledge fusion: according to the calculated specific parameters, fusing the vocabularies, selecting a global entity alignment method to comprehensively use various strategies to judge entity similarity, and improving knowledge fusion effect, wherein the strategy judgment is assisted by using an entity matching algorithm based on Chinese similarity judgment;
s4, knowledge application is carried out: the input data is mapped from the original space to another feature space through nonlinear mapping transformation and the feature representation is learned, so that knowledge application is completed.
Compared with the prior art, the invention has the beneficial effects that: knowledge reasoning is achieved by using a ConvKE method, a dimension transformation strategy is adopted by ConvKE to improve the sliding steps of a convolution sliding window on a triplet matrix and the information interaction capability of entities and relations in the triplet on more dimensions, and overall information on more dimensions of the triplet is captured by improving experience of a 2-D convolution sliding window;
and sequentially confirming the characteristic parameters of the corresponding words in a step-by-step analysis specific mode, optimizing the uncommon words according to the specific characteristic parameters, and better fusing the properties such as the properties, the entities and the relations in the main stream knowledge graph and applying the properties.
Drawings
FIG. 1 is a schematic diagram of a prior knowledge graph architecture;
FIG. 2 is a schematic flow chart of the method of the present invention.
Detailed Description
The technical solutions of the present invention will be clearly and completely described in connection with the embodiments, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Embodiment one: referring to fig. 2, the invention provides a method for constructing a traditional Chinese medicine ancient book knowledge graph based on deep learning, which comprises the following steps:
s1, processing unstructured multi-mode traditional Chinese medicine field data: extracting text data in the ancient books of the traditional Chinese medicine by adopting a multi-mode information extraction technology and combining an OCR technology and an NLP processing technology, converting the text data into semi-structured data and structured data, and marking the semi-structured data to obtain an entity-relation-entity data set;
s2, knowledge extraction: the grammar and the corresponding text content structure are trained in a combined way, and the entity and relation triples in the ancient books of the traditional Chinese medicine are extracted through the training process, so that the rare words and the grammar of the ancient books of the traditional Chinese medicine are solved, and the method specifically comprises the following steps:
s21, optimizing and embedding rare words: the method comprises the steps of optimizing the rarely used words in ancient book contents by using a traditional Chinese medicine rarely used word optimizing embedding model, simplifying the rarely used words into corresponding Chinese characters, and embedding word vectors by taking ancient text Bert and ancient text sentences after dictionary optimization as global feature descriptions;
s22, carrying out partition filtering by adopting a partition filtering encoder: at each time step (i.e., each time state), the word features are split into three partitions: entity partitions, relation partitions and shared partitions, wherein the entity partitions are partitions only related to entity extraction, the relation partitions are partitions only related to relation, the shared partitions are partitions related to both tasks, and then characteristics irrelevant to specific tasks are filtered out by merging the partitions;
examples: combining the entity partition and the shared partition, the features only related to the relation partition can be filtered out;
the whole flow is divided into two parts, namely, partition, three partition, filtering and merging the partition;
the method comprises the following steps:
s221, first calculating candidate partition information:wherein X is t For the entered features, h t-1 Is a hidden state value corresponding to the time t-1;
s222, calculating a relation threshold and an entity threshold:wherein cummax = cumsum;
s223, then, generating three partitions by using the two just calculated thresholds at each layer, wherein the two layers are 6 partitions:wherein o represents AND operation, -represents NOT operation, wherein c represents history information and t-1 represents corresponding t-1 time;
s224, finally generating information of three partitions of the t time step according to the historical gate and the historical information state of the t-1 time step, and the candidate gate and the candidate information candidate partition information of the t time step:;
s23, performing filtering operation: according to the information of the three partitions generated in the step S22, three memory features are generated through interaction, so that the filtering effect is achieved: entity correlation/relationship correlation/sharing correlation:;
then, the three memory features are respectively processed by tanh () hyperbolic tangent functions to obtain three corresponding hidden states, and the three hidden states are directly output from the history information of the current time step and are used as entity correlation/relationship correlation/sharing correlation (NER Feature/relationship concentration Feature/sharing concentration Feature) for the operation of the next stage:;
finally, updating the history information and the hidden state, splicing the three memory features together, and linearly mapping to obtain the history information of the t time step, wherein the history information of the t time step is subjected to a tanh () hyperbolic tangent function to obtain the hidden state of the t time step:;
s24, carrying out global representation: by obtaining a global characterization of two specific tasks: the entity concentrating feature and the relation concentrating feature of each time step are used for splicing the shared concentrating feature, and two task concentrating features are obtained through linear mapping and tanh () hyperbolic tangent function and global use of maximum pooling operation:wherein maxpool represents a maximize pooling operation;
s25, executing characteristic tasks: the method comprises the steps of respectively executing entity characteristic tasks and relation characteristic tasks, wherein the specific mode of executing the entity characteristic tasks is as follows:
s251, setting a group of sentences with input length L, wherein the length of a table is L multiplied by L, the (i, j) position in the table represents the physical characteristic representation of the span starting at the ith position and ending at the jth position, the representation is the global representation of the physical concentration characteristic and the physical concentration characteristic of the ith position and the jth position, and then the linear transformation and ELU activation function processing are carried out, wherein the processing mode is as follows:;
then, the output layer is entered, the output layer is mapped to the dimension of the number of entity categories through linear mapping, and then, a sigmoid function is carried out on each dimension to judge whether the entity category is represented by the entity category (the multi-label classification mode is adopted to solve the coincidence problem), and the processing mode is as follows:k represents each type, and its element e represents the probability of a word pair (w, k) as the start and end positions of an entity of type k, h for each word pair (w, k) i ,h j Representing word-level entity characteristics thereof;
s252, executing relation feature tasks: setting a set of sentence length values as L and the length of the table as L x L, wherein the (i, j) position in the table indicates the relation between the span with the i-th position as the first word and the span with the j-th position as the first word,this characterization is similar to the entity units, splice: the relationship between the ith position and the jth focus feature, and the global representation of the relationship focus feature, then undergo linear transformation and ELU activation functions to make multi-label classification:where R represents a set of relationship labels, and for each relationship L, the element R represents the word w i Word w j As probabilities of subject and object entities, T represents a set of elements r, h gr Representing global features, obtaining an entity i span representation +.>;
S26, carrying out loss parameter analysis: for two BCE losses, one for the NER task; another for RE tasks:
;
s3, carrying out knowledge fusion, and according to the l obtained by calculation ner L re The vocabularies with the same parameters are fused, a global entity alignment method is selected to comprehensively use various strategies to judge entity similarity, knowledge fusion effect is improved, and an entity matching algorithm based on Chinese similarity judgment is used for assisting strategy judgment;
s4, knowledge application is carried out: deep learning is widely applied to the NLP field, achieves remarkable effect, enables a deep neural network to capture characteristics, maps input data from an original space to another characteristic space through nonlinear mapping transformation, learns characteristic representation, and is suitable for knowledge reasoning tasks; the invention uses ConvKE method to realize knowledge reasoning, convKE adopts dimension transformation strategy to promote the sliding steps of convolution sliding window on the triple matrix and the information interaction capability of entities and relations in the triple on more dimensions, and captures the whole information of the triple on more dimensions through 2-D convolution sliding window promotion feeling.
Embodiment two: based on the overall implementation of the first embodiment, step S2 of the present embodiment further includes step S27: setting knowledge extraction model parameters, wherein the specific model parameters are as follows:
Epoch:120
Hiddensize:300
Batchsize:32
Embedmode:BertbasedChinese
Lr:0.00002
Weightdecay:0
Seed:0
Dropout:0.1
Dropconnect:0.1
Step:50
Clip:0.25
Maxseqlen:150。
embodiment III: based on the two groups of embodiments, the embodiment further includes processing the chinese natural language in a specific implementation process, where a specific processing manner is:
1) Acquiring corpus;
2) Preprocessing corpus, wherein the preprocessing comprises corpus cleaning, word segmentation, part-of-speech tagging, word deactivation removing and other steps;
3) Characterization, namely vectorization, mainly represents the word and the word after word segmentation into types (vectors) which can be calculated by a computer, thus being beneficial to better expressing the similarity relationship among different words;
4) Model training, including traditional supervised, semi-supervised and unsupervised learning models, can be selected according to different application requirements. But over-fitting and under-fitting conditions may occur during model training. The method for solving the over-fitting mainly comprises the steps of increasing regularization items so as to increase the training quantity of data, and reducing regularization items so as to increase other characteristic items to process the data when the under-fitting is solved;
5) Performance evaluation.
The Chinese information processing is mainly to process characters, words, paragraphs or chapters. The main methods are respectively a rule-based method and a statistics-based method, wherein the former method is to manually process the text according to language-related rules; the latter is to analyze the data through a large-scale database, thereby realizing the processing of natural language.
Natural language processing is greatly affected by data, and the growth of data is responsible for the improved performance of most NLP applications (e.g., machine translation), so that text can be better understood and analyzed with strong data support, which makes many NLP applications today employ data stream analysis methods.
Embodiment four: this embodiment is embodied in all of the implementations including the three embodiments described above.
The partial data in the formula are all obtained by removing dimension and taking the numerical value for calculation, and the formula is a formula closest to the real situation obtained by simulating a large amount of collected data through software; the preset parameters and the preset threshold values in the formula are set by those skilled in the art according to actual conditions or are obtained through mass data simulation.
The above embodiments are only for illustrating the technical method of the present invention and not for limiting the same, and it should be understood by those skilled in the art that the technical method of the present invention may be modified or substituted without departing from the spirit and scope of the technical method of the present invention.
Claims (6)
1. The method for constructing the ancient Chinese medicine books knowledge graph based on the deep learning is characterized by comprising the following steps of:
s1, processing unstructured multi-mode traditional Chinese medicine field data: extracting text data in the ancient books of the traditional Chinese medicine by adopting a multi-mode information extraction technology and combining an OCR technology and an NLP processing technology, converting the text data into semi-structured data and structured data, and marking the semi-structured data to obtain an entity-relation-entity data set;
s2, knowledge extraction: performing joint training on grammar and corresponding text content structure, extracting entity and relation triples in ancient books of traditional Chinese medicine through training process, and solving rare words and grammar of ancient books in the ancient books of traditional Chinese medicine;
s3, carrying out knowledge fusion: according to the calculated specific parameters, fusing the vocabularies, selecting a global entity alignment method to comprehensively use various strategies to judge entity similarity, and improving knowledge fusion effect, wherein the strategy judgment is assisted by using an entity matching algorithm based on Chinese similarity judgment;
s4, knowledge application is carried out: the input data is mapped from the original space to another feature space through nonlinear mapping transformation and the feature representation is learned, so that knowledge application is completed.
2. The method for constructing a ancient book knowledge map of traditional Chinese medicine based on deep learning according to claim 1, wherein in the step S2, the specific steps of knowledge extraction are as follows:
s21, optimizing and embedding rare words: the method comprises the steps of optimizing the rarely used words in ancient book contents by using a traditional Chinese medicine rarely used word optimizing embedding model, simplifying the rarely used words into corresponding Chinese characters, and embedding word vectors by taking ancient text Bert and ancient text sentences after dictionary optimization as global feature descriptions;
s22, carrying out partition filtering by adopting a partition filtering encoder: at each time step, the word features are split into three partitions: entity partitions, relationship partitions, and shared partitions, wherein an entity partition is a partition related to entity extraction only, a relationship partition is a partition related to relationship only, a shared partition is a partition related to both tasks, and then features unrelated to a specific task are filtered out by merging the partitions.
3. The method for constructing a ancient book knowledge graph of traditional Chinese medicine based on deep learning according to claim 2, wherein in the step S22, the specific filtering is performed in the following specific manner:
s221, first calculating candidate partition information:,
wherein X is t For the entered features, h t-1 Is a hidden state value corresponding to the time t-1;
s222, calculating a relation threshold and an entity threshold:wherein cummax = cumsum;
and (softmax ()) cumsum (x 1, x2, x 3) = (x 1, x1+x2, x1+x2+x3);
s223, then, generating three partitions by using the two just calculated thresholds at each layer, wherein the two layers are 6 partitions:wherein o represents AND operation, -represents NOT operation, wherein c represents history information and t-1 represents corresponding t-1 time;
s224, finally generating information of three partitions of the t time step according to the history threshold and history information of the t-1 time step, and the candidate threshold and candidate partition information of the t time step:。
4. the deep learning-based ancient book knowledge graph construction method of traditional Chinese medicine according to claim 2, further comprising:
s23, performing filtering operation: according to the information of the three partitions generated in the step S22, three memory features are generated through interaction, so that the filtering effect is achieved: entity correlation/relationship correlation/sharing correlation:
;
then, the three memory features are respectively processed by a tanh () hyperbolic tangent function to obtain three corresponding hidden states, and the three hidden states are directly output from the history information of the current time step and are used as entity correlation/relationship correlation/sharing correlation for the operation of the next stage:;
finally, updating the history information and the hidden state, splicing the three memory features together, and linearly mapping to obtain the history information of t time steps, wherein the t time stepsThe history information of (2) is subjected to a tanh () hyperbolic tangent function to obtain the hidden state of the t time step:;
s24, carrying out global representation: by obtaining a global characterization of two specific tasks: the entity concentrating feature and the relation concentrating feature of each time step are used for splicing the shared concentrating feature, and two specific task features are obtained through linear mapping and tanh () hyperbolic tangent function and global use of maximum pooling operation:wherein maxpool represents a maximize pooling operation.
5. The deep learning-based ancient book knowledge graph construction method of traditional Chinese medicine according to claim 4, further comprising:
s25, executing characteristic tasks: the method comprises the steps of respectively executing entity characteristic tasks and relation characteristic tasks, wherein the specific mode of executing the entity characteristic tasks is as follows:
s251, setting a group of sentences with input length L, wherein the length of a table is L multiplied by L, the (i, j) position in the table represents the physical characteristic representation of the span starting at the ith position and ending at the jth position, the representation is the global representation of the physical concentration characteristic and the physical concentration characteristic of the ith position and the jth position, and then the linear transformation and ELU activation function processing are carried out, wherein the processing mode is as follows:;
then, the output layer is entered, the output layer is mapped to the dimension of the number of entity categories through linear mapping, and then, a sigmoid function is carried out on each dimension to judge whether the entity category is represented by the sigmoid function or not, and the processing mode is as follows:k represents each type, and its element e represents the word pair (w, k) as having the typeProbability of start and end positions of the entities of k;
for each word pair (w, k), h i ,h j Representing word-level entity characteristics thereof;
s252, executing relation feature tasks: setting a group of sentence length values as L and the length of a table as L x L, wherein the (i, j) position in the table marks the span taking the ith position as the first word and the span taking the jth position as the first word, the representation is similar to a physical unit, and splicing: the relationship between the ith position and the jth focus feature, and the global representation of the relationship focus feature, then undergo linear transformation and ELU activation functions to make multi-label classification:where R represents a set of relationship labels, and for each relationship L, the element R represents the word w i Word w j As probabilities of subject and object entities, T represents a set of elements r, h gr Representing global features, obtaining an entity i span representation +.>。
6. The deep learning-based ancient book knowledge graph construction method of traditional Chinese medicine according to claim 5, further comprising:
s26, carrying out loss parameter analysis: for two BCE losses, the BCE losses are classification losses for multi-label classification, and classification tasks can be completed; considering both tasks as classified tasks, one for the NER task; another for RE tasks:
。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310988524.5A CN116701665A (en) | 2023-08-08 | 2023-08-08 | Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310988524.5A CN116701665A (en) | 2023-08-08 | 2023-08-08 | Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116701665A true CN116701665A (en) | 2023-09-05 |
Family
ID=87843750
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310988524.5A Withdrawn CN116701665A (en) | 2023-08-08 | 2023-08-08 | Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116701665A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117236342A (en) * | 2023-09-28 | 2023-12-15 | 南京大经中医药信息技术有限公司 | Chinese medicine classics semantic analysis method and system combined with knowledge graph |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021196520A1 (en) * | 2020-03-30 | 2021-10-07 | 西安交通大学 | Tax field-oriented knowledge map construction method and system |
US20220180065A1 (en) * | 2020-12-09 | 2022-06-09 | Beijing Wodong Tianjun Information Technology Co., Ltd. | System and method for knowledge graph construction using capsule neural network |
WO2022116417A1 (en) * | 2020-12-03 | 2022-06-09 | 平安科技(深圳)有限公司 | Triple information extraction method, apparatus, and device, and computer-readable storage medium |
CN115238040A (en) * | 2022-08-02 | 2022-10-25 | 北京科技大学 | Steel material science knowledge graph construction method and system |
CN115618005A (en) * | 2021-07-16 | 2023-01-17 | 中国传媒大学 | Traditional Tibetan medicine knowledge graph construction and completion method |
CN116127090A (en) * | 2022-12-28 | 2023-05-16 | 中国航空综合技术研究所 | Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction |
-
2023
- 2023-08-08 CN CN202310988524.5A patent/CN116701665A/en not_active Withdrawn
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021196520A1 (en) * | 2020-03-30 | 2021-10-07 | 西安交通大学 | Tax field-oriented knowledge map construction method and system |
WO2022116417A1 (en) * | 2020-12-03 | 2022-06-09 | 平安科技(深圳)有限公司 | Triple information extraction method, apparatus, and device, and computer-readable storage medium |
US20220180065A1 (en) * | 2020-12-09 | 2022-06-09 | Beijing Wodong Tianjun Information Technology Co., Ltd. | System and method for knowledge graph construction using capsule neural network |
CN115618005A (en) * | 2021-07-16 | 2023-01-17 | 中国传媒大学 | Traditional Tibetan medicine knowledge graph construction and completion method |
CN115238040A (en) * | 2022-08-02 | 2022-10-25 | 北京科技大学 | Steel material science knowledge graph construction method and system |
CN116127090A (en) * | 2022-12-28 | 2023-05-16 | 中国航空综合技术研究所 | Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction |
Non-Patent Citations (2)
Title |
---|
宋伟;张游杰;: "基于环境信息融合的知识图谱构建方法", 计算机系统应用, no. 06 * |
陈荟;邓晖;吴道婷;: "基于自然语言处理的教学设计学科知识图谱自动构建研究", 中国教育信息化, no. 07 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117236342A (en) * | 2023-09-28 | 2023-12-15 | 南京大经中医药信息技术有限公司 | Chinese medicine classics semantic analysis method and system combined with knowledge graph |
CN117236342B (en) * | 2023-09-28 | 2024-05-28 | 南京大经中医药信息技术有限公司 | Chinese medicine classics semantic analysis method and system combined with knowledge graph |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112347268B (en) | Text-enhanced knowledge-graph combined representation learning method and device | |
Wang et al. | Feature extraction and analysis of natural language processing for deep learning English language | |
CN108182295B (en) | Enterprise knowledge graph attribute extraction method and system | |
CN110287481B (en) | Named entity corpus labeling training system | |
Chen et al. | Research on text sentiment analysis based on CNNs and SVM | |
Li et al. | Improving convolutional neural network for text classification by recursive data pruning | |
CN112560432A (en) | Text emotion analysis method based on graph attention network | |
CN111782769B (en) | Intelligent knowledge graph question-answering method based on relation prediction | |
CN113743119B (en) | Chinese named entity recognition module, method and device and electronic equipment | |
CN112417884A (en) | Sentence semantic relevance judging method based on knowledge enhancement and knowledge migration | |
CN115510245B (en) | Unstructured data-oriented domain knowledge extraction method | |
CN114077673A (en) | Knowledge graph construction method based on BTBC model | |
CN110852089A (en) | Operation and maintenance project management method based on intelligent word segmentation and deep learning | |
CN113221571A (en) | Entity relation joint extraction method based on entity correlation attention mechanism | |
CN111428481A (en) | Entity relation extraction method based on deep learning | |
CN111651973A (en) | Text matching method based on syntax perception | |
CN116701665A (en) | Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method | |
CN114491036A (en) | Semi-supervised text classification method and system based on self-supervision and supervised joint training | |
CN116383352A (en) | Knowledge graph-based method for constructing field intelligent question-answering system by using zero samples | |
CN114048314A (en) | Natural language steganalysis method | |
CN114021584A (en) | Knowledge representation learning method based on graph convolution network and translation model | |
CN116522165B (en) | Public opinion text matching system and method based on twin structure | |
CN116680407A (en) | Knowledge graph construction method and device | |
CN114239575B (en) | Statement analysis model construction method, statement analysis method, device, medium and computing equipment | |
CN111708896B (en) | Entity relationship extraction method applied to biomedical literature |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20230905 |
|
WW01 | Invention patent application withdrawn after publication |