CN116701665A

CN116701665A - Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method

Info

Publication number: CN116701665A
Application number: CN202310988524.5A
Authority: CN
Inventors: 许雯; 王海洋; 隋明爽; 王海涛; 李真真; 王慎强
Original assignee: Yantai Branch Institute Of Computing Technology Chinese Academy Of Science; Binzhou Medical College
Current assignee: Yantai Branch Institute Of Computing Technology Chinese Academy Of Science; Binzhou Medical College
Priority date: 2023-08-08
Filing date: 2023-08-08
Publication date: 2023-09-05

Abstract

The invention discloses a traditional Chinese medicine ancient book knowledge graph construction method based on deep learning, which relates to the technical field of knowledge graph construction, solves the technical problems that a large number of rare words exist in text content, grammar is different from modern Chinese grammar, so that the main knowledge graph construction method cannot well establish properties such as attributes, entities and relations in the text content, and the like; and optimizing the rare words according to specific characteristic parameters, better fusing the properties such as the properties, the entities and the relations in the main stream knowledge graph, and applying the main stream knowledge graph.

Description

Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method

Technical Field

The invention belongs to the technical field of knowledge graph construction, and particularly relates to a method for constructing a traditional Chinese medicine ancient book knowledge graph based on deep learning.

Background

The concept of knowledge graph is proposed in 2012, is used for perfecting a search engine, is a typical polygonal relation graph and consists of nodes (entities) and edges (relations between the entities); knowledge graph is essentially a semantic network used to reveal relationships between everything; as shown in fig. 1, the knowledge graph aims at extracting concepts, entities and relationships from multiple types of complex data, and is a computable model of the relationships of things. According to the coverage range of the knowledge and the difference of the fields, the whole knowledge graph can be divided into a universal knowledge graph and a field knowledge graph. Along with the continuous development of science and technology, the knowledge graph is widely applied in the NLP field, such as semantic search, intelligent question-answering, auxiliary decision making and the like, and has become an important power for artificial intelligence development;

the knowledge graph architecture is divided into three parts: the first part is the acquisition of source data, namely, acquiring useful resource information in each type of data; the second part is knowledge fusion, which is used for associating knowledge of multiple data sources and expanding the knowledge range; the third part is knowledge calculation and knowledge application, the knowledge calculation is a main mode of the output of the knowledge graph capability, and the knowledge application combines the knowledge graph with a specific field or service, so that the service efficiency of each field is improved;

aiming at the ancient books of traditional Chinese medicine, the text content of the ancient books of traditional Chinese medicine has a large number of rare words, and the grammar is different from the modern Chinese grammar, so that the mainstream knowledge graph construction method cannot well establish the properties, entities, relations and the like in the ancient books, and the knowledge graph construction method of the ancient books of traditional Chinese medicine is provided.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems existing in the prior art; therefore, the invention provides a deep learning-based traditional Chinese medicine ancient book knowledge graph construction method, which is used for solving the technical problem that the main stream knowledge graph construction method cannot well establish properties such as attributes, entities, relations and the like because a large number of rare words exist in text contents and grammar is different from modern Chinese grammar.

To achieve the above object, an embodiment according to a first aspect of the present invention provides a method for constructing a ancient book knowledge graph of traditional Chinese medicine based on deep learning, comprising the steps of:

s1, processing unstructured multi-mode traditional Chinese medicine field data: extracting text data in the ancient books of the traditional Chinese medicine by adopting a multi-mode information extraction technology and combining an OCR technology and an NLP processing technology, converting the text data into semi-structured data and structured data, and marking the semi-structured data to obtain an entity-relation-entity data set;

s2, knowledge extraction: the grammar and the corresponding text content structure are trained in a combined way, and the entity and relation triples in the ancient books of the traditional Chinese medicine are extracted through the training process, so that the rare words and the grammar of the ancient books of the traditional Chinese medicine are solved, and the specific steps are as follows:

s21, optimizing and embedding rare words: the method comprises the steps of optimizing the rarely used words in ancient book contents by using a traditional Chinese medicine rarely used word optimizing embedding model, simplifying the rarely used words into corresponding Chinese characters, and embedding word vectors by taking ancient text Bert and ancient text sentences after dictionary optimization as global feature descriptions;

s22, carrying out partition filtering by adopting a partition filtering encoder: at each time step, the word features are split into three partitions: entity partition, relation partition and shared partition, wherein the entity partition is a partition related to entity extraction only, the relation partition is a partition related to relation only, the shared partition is a partition related to both tasks, and then features unrelated to specific tasks are filtered out by merging the partitions; the specific method is as follows:

s221, first calculating candidate partition information:wherein X is _t For the entered features, h _t-1 Is a hidden state value corresponding to the time t-1;

s222, calculating a relation threshold and an entity threshold:wherein cummax = cumsum;

and (softmax ()) cumsum (x 1, x2, x 3) = (x 1, x1+x2, x1+x2+x3);

s223, then, generating three partitions by using the two just calculated thresholds at each layer, wherein the two layers are 6 partitions:wherein o represents AND operation, -represents NOT operation, wherein c represents history information and t-1 represents corresponding t-1 time;

s224, finally generating information of three partitions of the t time step according to the history threshold and history information of the t-1 time step, and the candidate threshold and candidate partition information of the t time step:；

s23, performing filtering operation: according to the information of the three partitions generated in the step S22, three memory features are generated through interaction, so that the filtering effect is achieved: entity correlation/relationship correlation/sharing correlation:；

then, the three memory features are respectively processed by a tanh () hyperbolic tangent function to obtain three corresponding hidden states, and the three hidden states are directly output from the history information of the current time step and are used as entity correlation/relationship correlation/sharing correlation for the operation of the next stage:；

finally, updating the history information and the hidden state, splicing the three memory features together, and linearly mapping to obtain the history information of the t time step, wherein the history information of the t time step is subjected to a tanh () hyperbolic tangent function to obtain the hidden state of the t time step:；

s24, carrying out global representation: by obtaining a global characterization of two specific tasks: the entity concentrating feature and the relation concentrating feature of each time step are used for splicing the shared concentrating feature, and two specific task features are obtained through linear mapping and tanh () hyperbolic tangent function and global use of maximum pooling operation:wherein maxpool represents a maximize pooling operation;

s25, executing characteristic tasks: the method comprises the steps of respectively executing entity characteristic tasks and relation characteristic tasks, wherein the specific mode of executing the entity characteristic tasks is as follows:

s251, a group of sentences is set to be L, the length of a table is L multiplied by L, and the (i, j) position in the table represents the physical characteristic representation of the span starting at the ith position and ending at the jth position and is represented as the concatenation of the ith position and the jth positionThe j-th entity concentration feature and the global representation of the entity concentration feature are then processed by linear transformation and ELU activation functions in the following manner:；

then, the output layer is entered, the output layer is mapped to the dimension of the number of entity categories through linear mapping, and then, a sigmoid function is carried out on each dimension to judge whether the entity category is represented by the sigmoid function or not, and the processing mode is as follows:k represents each type, and its element e represents the probability of the word pair (w, k) as the start and end positions of the entity with type k;

for each word pair (w, k), h _i ，h _j Representing word-level entity characteristics thereof;

s252, executing relation feature tasks: setting a group of sentence length values as L and the length of a table as L x L, wherein the (i, j) position in the table marks the span taking the ith position as the first word and the span taking the jth position as the first word, the representation is similar to a physical unit, and splicing: the relationship between the ith position and the jth focus feature, and the global representation of the relationship focus feature, then undergo linear transformation and ELU activation functions to make multi-label classification:where R represents a set of relationship labels, and for each relationship L, the element R represents the word w _i Word w _j As probabilities of subject and object entities, T represents a set of elements r, h _gr Representing global features, obtaining an entity i span representation +.>；

S26, carrying out loss parameter analysis: for two BCE losses, the BCE losses are classification losses for multi-label classification, and classification tasks can be completed; considering both tasks as classified tasks, one for the NER task; another for RE tasks:

；

s3, carrying out knowledge fusion: according to the calculated specific parameters, fusing the vocabularies, selecting a global entity alignment method to comprehensively use various strategies to judge entity similarity, and improving knowledge fusion effect, wherein the strategy judgment is assisted by using an entity matching algorithm based on Chinese similarity judgment;

s4, knowledge application is carried out: the input data is mapped from the original space to another feature space through nonlinear mapping transformation and the feature representation is learned, so that knowledge application is completed.

Compared with the prior art, the invention has the beneficial effects that: knowledge reasoning is achieved by using a ConvKE method, a dimension transformation strategy is adopted by ConvKE to improve the sliding steps of a convolution sliding window on a triplet matrix and the information interaction capability of entities and relations in the triplet on more dimensions, and overall information on more dimensions of the triplet is captured by improving experience of a 2-D convolution sliding window;

and sequentially confirming the characteristic parameters of the corresponding words in a step-by-step analysis specific mode, optimizing the uncommon words according to the specific characteristic parameters, and better fusing the properties such as the properties, the entities and the relations in the main stream knowledge graph and applying the properties.

Drawings

FIG. 1 is a schematic diagram of a prior knowledge graph architecture;

FIG. 2 is a schematic flow chart of the method of the present invention.

Detailed Description

The technical solutions of the present invention will be clearly and completely described in connection with the embodiments, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Embodiment one: referring to fig. 2, the invention provides a method for constructing a traditional Chinese medicine ancient book knowledge graph based on deep learning, which comprises the following steps:

s2, knowledge extraction: the grammar and the corresponding text content structure are trained in a combined way, and the entity and relation triples in the ancient books of the traditional Chinese medicine are extracted through the training process, so that the rare words and the grammar of the ancient books of the traditional Chinese medicine are solved, and the method specifically comprises the following steps:

s22, carrying out partition filtering by adopting a partition filtering encoder: at each time step (i.e., each time state), the word features are split into three partitions: entity partitions, relation partitions and shared partitions, wherein the entity partitions are partitions only related to entity extraction, the relation partitions are partitions only related to relation, the shared partitions are partitions related to both tasks, and then characteristics irrelevant to specific tasks are filtered out by merging the partitions;

examples: combining the entity partition and the shared partition, the features only related to the relation partition can be filtered out;

the whole flow is divided into two parts, namely, partition, three partition, filtering and merging the partition;

the method comprises the following steps:

s224, finally generating information of three partitions of the t time step according to the historical gate and the historical information state of the t-1 time step, and the candidate gate and the candidate information candidate partition information of the t time step:；

then, the three memory features are respectively processed by tanh () hyperbolic tangent functions to obtain three corresponding hidden states, and the three hidden states are directly output from the history information of the current time step and are used as entity correlation/relationship correlation/sharing correlation (NER Feature/relationship concentration Feature/sharing concentration Feature) for the operation of the next stage:；

s24, carrying out global representation: by obtaining a global characterization of two specific tasks: the entity concentrating feature and the relation concentrating feature of each time step are used for splicing the shared concentrating feature, and two task concentrating features are obtained through linear mapping and tanh () hyperbolic tangent function and global use of maximum pooling operation:wherein maxpool represents a maximize pooling operation;

s251, setting a group of sentences with input length L, wherein the length of a table is L multiplied by L, the (i, j) position in the table represents the physical characteristic representation of the span starting at the ith position and ending at the jth position, the representation is the global representation of the physical concentration characteristic and the physical concentration characteristic of the ith position and the jth position, and then the linear transformation and ELU activation function processing are carried out, wherein the processing mode is as follows:；

then, the output layer is entered, the output layer is mapped to the dimension of the number of entity categories through linear mapping, and then, a sigmoid function is carried out on each dimension to judge whether the entity category is represented by the entity category (the multi-label classification mode is adopted to solve the coincidence problem), and the processing mode is as follows:k represents each type, and its element e represents the probability of a word pair (w, k) as the start and end positions of an entity of type k, h for each word pair (w, k) _i ，h _j Representing word-level entity characteristics thereof;

s252, executing relation feature tasks: setting a set of sentence length values as L and the length of the table as L x L, wherein the (i, j) position in the table indicates the relation between the span with the i-th position as the first word and the span with the j-th position as the first word,this characterization is similar to the entity units, splice: the relationship between the ith position and the jth focus feature, and the global representation of the relationship focus feature, then undergo linear transformation and ELU activation functions to make multi-label classification:where R represents a set of relationship labels, and for each relationship L, the element R represents the word w _i Word w _j As probabilities of subject and object entities, T represents a set of elements r, h _gr Representing global features, obtaining an entity i span representation +.>；

S26, carrying out loss parameter analysis: for two BCE losses, one for the NER task; another for RE tasks:

；

s3, carrying out knowledge fusion, and according to the l obtained by calculation _ner L _re The vocabularies with the same parameters are fused, a global entity alignment method is selected to comprehensively use various strategies to judge entity similarity, knowledge fusion effect is improved, and an entity matching algorithm based on Chinese similarity judgment is used for assisting strategy judgment;

s4, knowledge application is carried out: deep learning is widely applied to the NLP field, achieves remarkable effect, enables a deep neural network to capture characteristics, maps input data from an original space to another characteristic space through nonlinear mapping transformation, learns characteristic representation, and is suitable for knowledge reasoning tasks; the invention uses ConvKE method to realize knowledge reasoning, convKE adopts dimension transformation strategy to promote the sliding steps of convolution sliding window on the triple matrix and the information interaction capability of entities and relations in the triple on more dimensions, and captures the whole information of the triple on more dimensions through 2-D convolution sliding window promotion feeling.

Embodiment two: based on the overall implementation of the first embodiment, step S2 of the present embodiment further includes step S27: setting knowledge extraction model parameters, wherein the specific model parameters are as follows:

Epoch：120

Hiddensize：300

Batchsize：32

Embedmode：BertbasedChinese

Lr：0.00002

Weightdecay：0

Seed：0

Dropout：0.1

Dropconnect：0.1

Step：50

Clip：0.25

Maxseqlen：150。

embodiment III: based on the two groups of embodiments, the embodiment further includes processing the chinese natural language in a specific implementation process, where a specific processing manner is:

1) Acquiring corpus;

2) Preprocessing corpus, wherein the preprocessing comprises corpus cleaning, word segmentation, part-of-speech tagging, word deactivation removing and other steps;

3) Characterization, namely vectorization, mainly represents the word and the word after word segmentation into types (vectors) which can be calculated by a computer, thus being beneficial to better expressing the similarity relationship among different words;

4) Model training, including traditional supervised, semi-supervised and unsupervised learning models, can be selected according to different application requirements. But over-fitting and under-fitting conditions may occur during model training. The method for solving the over-fitting mainly comprises the steps of increasing regularization items so as to increase the training quantity of data, and reducing regularization items so as to increase other characteristic items to process the data when the under-fitting is solved;

5) Performance evaluation.

The Chinese information processing is mainly to process characters, words, paragraphs or chapters. The main methods are respectively a rule-based method and a statistics-based method, wherein the former method is to manually process the text according to language-related rules; the latter is to analyze the data through a large-scale database, thereby realizing the processing of natural language.

Natural language processing is greatly affected by data, and the growth of data is responsible for the improved performance of most NLP applications (e.g., machine translation), so that text can be better understood and analyzed with strong data support, which makes many NLP applications today employ data stream analysis methods.

Embodiment four: this embodiment is embodied in all of the implementations including the three embodiments described above.

The partial data in the formula are all obtained by removing dimension and taking the numerical value for calculation, and the formula is a formula closest to the real situation obtained by simulating a large amount of collected data through software; the preset parameters and the preset threshold values in the formula are set by those skilled in the art according to actual conditions or are obtained through mass data simulation.

The above embodiments are only for illustrating the technical method of the present invention and not for limiting the same, and it should be understood by those skilled in the art that the technical method of the present invention may be modified or substituted without departing from the spirit and scope of the technical method of the present invention.

Claims

1. The method for constructing the ancient Chinese medicine books knowledge graph based on the deep learning is characterized by comprising the following steps of:

s2, knowledge extraction: performing joint training on grammar and corresponding text content structure, extracting entity and relation triples in ancient books of traditional Chinese medicine through training process, and solving rare words and grammar of ancient books in the ancient books of traditional Chinese medicine;

2. The method for constructing a ancient book knowledge map of traditional Chinese medicine based on deep learning according to claim 1, wherein in the step S2, the specific steps of knowledge extraction are as follows:

s22, carrying out partition filtering by adopting a partition filtering encoder: at each time step, the word features are split into three partitions: entity partitions, relationship partitions, and shared partitions, wherein an entity partition is a partition related to entity extraction only, a relationship partition is a partition related to relationship only, a shared partition is a partition related to both tasks, and then features unrelated to a specific task are filtered out by merging the partitions.

3. The method for constructing a ancient book knowledge graph of traditional Chinese medicine based on deep learning according to claim 2, wherein in the step S22, the specific filtering is performed in the following specific manner:

s221, first calculating candidate partition information:，

wherein X is _t For the entered features, h _t-1 Is a hidden state value corresponding to the time t-1;

and (softmax ()) cumsum (x 1, x2, x 3) = (x 1, x1+x2, x1+x2+x3);

s224, finally generating information of three partitions of the t time step according to the history threshold and history information of the t-1 time step, and the candidate threshold and candidate partition information of the t time step:。

4. the deep learning-based ancient book knowledge graph construction method of traditional Chinese medicine according to claim 2, further comprising:

s23, performing filtering operation: according to the information of the three partitions generated in the step S22, three memory features are generated through interaction, so that the filtering effect is achieved: entity correlation/relationship correlation/sharing correlation:

；

finally, updating the history information and the hidden state, splicing the three memory features together, and linearly mapping to obtain the history information of t time steps, wherein the t time stepsThe history information of (2) is subjected to a tanh () hyperbolic tangent function to obtain the hidden state of the t time step:；

s24, carrying out global representation: by obtaining a global characterization of two specific tasks: the entity concentrating feature and the relation concentrating feature of each time step are used for splicing the shared concentrating feature, and two specific task features are obtained through linear mapping and tanh () hyperbolic tangent function and global use of maximum pooling operation:wherein maxpool represents a maximize pooling operation.

5. The deep learning-based ancient book knowledge graph construction method of traditional Chinese medicine according to claim 4, further comprising:

then, the output layer is entered, the output layer is mapped to the dimension of the number of entity categories through linear mapping, and then, a sigmoid function is carried out on each dimension to judge whether the entity category is represented by the sigmoid function or not, and the processing mode is as follows:k represents each type, and its element e represents the word pair (w, k) as having the typeProbability of start and end positions of the entities of k;

s252, executing relation feature tasks: setting a group of sentence length values as L and the length of a table as L x L, wherein the (i, j) position in the table marks the span taking the ith position as the first word and the span taking the jth position as the first word, the representation is similar to a physical unit, and splicing: the relationship between the ith position and the jth focus feature, and the global representation of the relationship focus feature, then undergo linear transformation and ELU activation functions to make multi-label classification:where R represents a set of relationship labels, and for each relationship L, the element R represents the word w _i Word w _j As probabilities of subject and object entities, T represents a set of elements r, h _gr Representing global features, obtaining an entity i span representation +.>。

6. The deep learning-based ancient book knowledge graph construction method of traditional Chinese medicine according to claim 5, further comprising:

。