CN116701665A - Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method - Google Patents

Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method Download PDF

Info

Publication number
CN116701665A
CN116701665A CN202310988524.5A CN202310988524A CN116701665A CN 116701665 A CN116701665 A CN 116701665A CN 202310988524 A CN202310988524 A CN 202310988524A CN 116701665 A CN116701665 A CN 116701665A
Authority
CN
China
Prior art keywords
entity
chinese medicine
traditional chinese
tasks
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202310988524.5A
Other languages
Chinese (zh)
Inventor
许雯
王海洋
隋明爽
王海涛
李真真
王慎强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yantai Branch Institute Of Computing Technology Chinese Academy Of Science
Binzhou Medical College
Original Assignee
Yantai Branch Institute Of Computing Technology Chinese Academy Of Science
Binzhou Medical College
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yantai Branch Institute Of Computing Technology Chinese Academy Of Science, Binzhou Medical College filed Critical Yantai Branch Institute Of Computing Technology Chinese Academy Of Science
Priority to CN202310988524.5A priority Critical patent/CN116701665A/en
Publication of CN116701665A publication Critical patent/CN116701665A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Animal Behavior & Ethology (AREA)
  • Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a traditional Chinese medicine ancient book knowledge graph construction method based on deep learning, which relates to the technical field of knowledge graph construction, solves the technical problems that a large number of rare words exist in text content, grammar is different from modern Chinese grammar, so that the main knowledge graph construction method cannot well establish properties such as attributes, entities and relations in the text content, and the like; and optimizing the rare words according to specific characteristic parameters, better fusing the properties such as the properties, the entities and the relations in the main stream knowledge graph, and applying the main stream knowledge graph.

Description

Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method
Technical Field
The invention belongs to the technical field of knowledge graph construction, and particularly relates to a method for constructing a traditional Chinese medicine ancient book knowledge graph based on deep learning.
Background
The concept of knowledge graph is proposed in 2012, is used for perfecting a search engine, is a typical polygonal relation graph and consists of nodes (entities) and edges (relations between the entities); knowledge graph is essentially a semantic network used to reveal relationships between everything; as shown in fig. 1, the knowledge graph aims at extracting concepts, entities and relationships from multiple types of complex data, and is a computable model of the relationships of things. According to the coverage range of the knowledge and the difference of the fields, the whole knowledge graph can be divided into a universal knowledge graph and a field knowledge graph. Along with the continuous development of science and technology, the knowledge graph is widely applied in the NLP field, such as semantic search, intelligent question-answering, auxiliary decision making and the like, and has become an important power for artificial intelligence development;
the knowledge graph architecture is divided into three parts: the first part is the acquisition of source data, namely, acquiring useful resource information in each type of data; the second part is knowledge fusion, which is used for associating knowledge of multiple data sources and expanding the knowledge range; the third part is knowledge calculation and knowledge application, the knowledge calculation is a main mode of the output of the knowledge graph capability, and the knowledge application combines the knowledge graph with a specific field or service, so that the service efficiency of each field is improved;
aiming at the ancient books of traditional Chinese medicine, the text content of the ancient books of traditional Chinese medicine has a large number of rare words, and the grammar is different from the modern Chinese grammar, so that the mainstream knowledge graph construction method cannot well establish the properties, entities, relations and the like in the ancient books, and the knowledge graph construction method of the ancient books of traditional Chinese medicine is provided.
Disclosure of Invention
The present invention aims to solve at least one of the technical problems existing in the prior art; therefore, the invention provides a deep learning-based traditional Chinese medicine ancient book knowledge graph construction method, which is used for solving the technical problem that the main stream knowledge graph construction method cannot well establish properties such as attributes, entities, relations and the like because a large number of rare words exist in text contents and grammar is different from modern Chinese grammar.
To achieve the above object, an embodiment according to a first aspect of the present invention provides a method for constructing a ancient book knowledge graph of traditional Chinese medicine based on deep learning, comprising the steps of:
s1, processing unstructured multi-mode traditional Chinese medicine field data: extracting text data in the ancient books of the traditional Chinese medicine by adopting a multi-mode information extraction technology and combining an OCR technology and an NLP processing technology, converting the text data into semi-structured data and structured data, and marking the semi-structured data to obtain an entity-relation-entity data set;
s2, knowledge extraction: the grammar and the corresponding text content structure are trained in a combined way, and the entity and relation triples in the ancient books of the traditional Chinese medicine are extracted through the training process, so that the rare words and the grammar of the ancient books of the traditional Chinese medicine are solved, and the specific steps are as follows:
s21, optimizing and embedding rare words: the method comprises the steps of optimizing the rarely used words in ancient book contents by using a traditional Chinese medicine rarely used word optimizing embedding model, simplifying the rarely used words into corresponding Chinese characters, and embedding word vectors by taking ancient text Bert and ancient text sentences after dictionary optimization as global feature descriptions;
s22, carrying out partition filtering by adopting a partition filtering encoder: at each time step, the word features are split into three partitions: entity partition, relation partition and shared partition, wherein the entity partition is a partition related to entity extraction only, the relation partition is a partition related to relation only, the shared partition is a partition related to both tasks, and then features unrelated to specific tasks are filtered out by merging the partitions; the specific method is as follows:
s221, first calculating candidate partition information:wherein X is t For the entered features, h t-1 Is a hidden state value corresponding to the time t-1;
s222, calculating a relation threshold and an entity threshold:wherein cummax = cumsum;
and (softmax ()) cumsum (x 1, x2, x 3) = (x 1, x1+x2, x1+x2+x3);
s223, then, generating three partitions by using the two just calculated thresholds at each layer, wherein the two layers are 6 partitions:wherein o represents AND operation, -represents NOT operation, wherein c represents history information and t-1 represents corresponding t-1 time;
s224, finally generating information of three partitions of the t time step according to the history threshold and history information of the t-1 time step, and the candidate threshold and candidate partition information of the t time step:
s23, performing filtering operation: according to the information of the three partitions generated in the step S22, three memory features are generated through interaction, so that the filtering effect is achieved: entity correlation/relationship correlation/sharing correlation:
then, the three memory features are respectively processed by a tanh () hyperbolic tangent function to obtain three corresponding hidden states, and the three hidden states are directly output from the history information of the current time step and are used as entity correlation/relationship correlation/sharing correlation for the operation of the next stage:
finally, updating the history information and the hidden state, splicing the three memory features together, and linearly mapping to obtain the history information of the t time step, wherein the history information of the t time step is subjected to a tanh () hyperbolic tangent function to obtain the hidden state of the t time step:
s24, carrying out global representation: by obtaining a global characterization of two specific tasks: the entity concentrating feature and the relation concentrating feature of each time step are used for splicing the shared concentrating feature, and two specific task features are obtained through linear mapping and tanh () hyperbolic tangent function and global use of maximum pooling operation:wherein maxpool represents a maximize pooling operation;
s25, executing characteristic tasks: the method comprises the steps of respectively executing entity characteristic tasks and relation characteristic tasks, wherein the specific mode of executing the entity characteristic tasks is as follows:
s251, a group of sentences is set to be L, the length of a table is L multiplied by L, and the (i, j) position in the table represents the physical characteristic representation of the span starting at the ith position and ending at the jth position and is represented as the concatenation of the ith position and the jth positionThe j-th entity concentration feature and the global representation of the entity concentration feature are then processed by linear transformation and ELU activation functions in the following manner:
then, the output layer is entered, the output layer is mapped to the dimension of the number of entity categories through linear mapping, and then, a sigmoid function is carried out on each dimension to judge whether the entity category is represented by the sigmoid function or not, and the processing mode is as follows:k represents each type, and its element e represents the probability of the word pair (w, k) as the start and end positions of the entity with type k;
for each word pair (w, k), h i ,h j Representing word-level entity characteristics thereof;
s252, executing relation feature tasks: setting a group of sentence length values as L and the length of a table as L x L, wherein the (i, j) position in the table marks the span taking the ith position as the first word and the span taking the jth position as the first word, the representation is similar to a physical unit, and splicing: the relationship between the ith position and the jth focus feature, and the global representation of the relationship focus feature, then undergo linear transformation and ELU activation functions to make multi-label classification:where R represents a set of relationship labels, and for each relationship L, the element R represents the word w i Word w j As probabilities of subject and object entities, T represents a set of elements r, h gr Representing global features, obtaining an entity i span representation +.>
S26, carrying out loss parameter analysis: for two BCE losses, the BCE losses are classification losses for multi-label classification, and classification tasks can be completed; considering both tasks as classified tasks, one for the NER task; another for RE tasks:
s3, carrying out knowledge fusion: according to the calculated specific parameters, fusing the vocabularies, selecting a global entity alignment method to comprehensively use various strategies to judge entity similarity, and improving knowledge fusion effect, wherein the strategy judgment is assisted by using an entity matching algorithm based on Chinese similarity judgment;
s4, knowledge application is carried out: the input data is mapped from the original space to another feature space through nonlinear mapping transformation and the feature representation is learned, so that knowledge application is completed.
Compared with the prior art, the invention has the beneficial effects that: knowledge reasoning is achieved by using a ConvKE method, a dimension transformation strategy is adopted by ConvKE to improve the sliding steps of a convolution sliding window on a triplet matrix and the information interaction capability of entities and relations in the triplet on more dimensions, and overall information on more dimensions of the triplet is captured by improving experience of a 2-D convolution sliding window;
and sequentially confirming the characteristic parameters of the corresponding words in a step-by-step analysis specific mode, optimizing the uncommon words according to the specific characteristic parameters, and better fusing the properties such as the properties, the entities and the relations in the main stream knowledge graph and applying the properties.
Drawings
FIG. 1 is a schematic diagram of a prior knowledge graph architecture;
FIG. 2 is a schematic flow chart of the method of the present invention.
Detailed Description
The technical solutions of the present invention will be clearly and completely described in connection with the embodiments, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Embodiment one: referring to fig. 2, the invention provides a method for constructing a traditional Chinese medicine ancient book knowledge graph based on deep learning, which comprises the following steps:
s1, processing unstructured multi-mode traditional Chinese medicine field data: extracting text data in the ancient books of the traditional Chinese medicine by adopting a multi-mode information extraction technology and combining an OCR technology and an NLP processing technology, converting the text data into semi-structured data and structured data, and marking the semi-structured data to obtain an entity-relation-entity data set;
s2, knowledge extraction: the grammar and the corresponding text content structure are trained in a combined way, and the entity and relation triples in the ancient books of the traditional Chinese medicine are extracted through the training process, so that the rare words and the grammar of the ancient books of the traditional Chinese medicine are solved, and the method specifically comprises the following steps:
s21, optimizing and embedding rare words: the method comprises the steps of optimizing the rarely used words in ancient book contents by using a traditional Chinese medicine rarely used word optimizing embedding model, simplifying the rarely used words into corresponding Chinese characters, and embedding word vectors by taking ancient text Bert and ancient text sentences after dictionary optimization as global feature descriptions;
s22, carrying out partition filtering by adopting a partition filtering encoder: at each time step (i.e., each time state), the word features are split into three partitions: entity partitions, relation partitions and shared partitions, wherein the entity partitions are partitions only related to entity extraction, the relation partitions are partitions only related to relation, the shared partitions are partitions related to both tasks, and then characteristics irrelevant to specific tasks are filtered out by merging the partitions;
examples: combining the entity partition and the shared partition, the features only related to the relation partition can be filtered out;
the whole flow is divided into two parts, namely, partition, three partition, filtering and merging the partition;
the method comprises the following steps:
s221, first calculating candidate partition information:wherein X is t For the entered features, h t-1 Is a hidden state value corresponding to the time t-1;
s222, calculating a relation threshold and an entity threshold:wherein cummax = cumsum;
s223, then, generating three partitions by using the two just calculated thresholds at each layer, wherein the two layers are 6 partitions:wherein o represents AND operation, -represents NOT operation, wherein c represents history information and t-1 represents corresponding t-1 time;
s224, finally generating information of three partitions of the t time step according to the historical gate and the historical information state of the t-1 time step, and the candidate gate and the candidate information candidate partition information of the t time step:
s23, performing filtering operation: according to the information of the three partitions generated in the step S22, three memory features are generated through interaction, so that the filtering effect is achieved: entity correlation/relationship correlation/sharing correlation:
then, the three memory features are respectively processed by tanh () hyperbolic tangent functions to obtain three corresponding hidden states, and the three hidden states are directly output from the history information of the current time step and are used as entity correlation/relationship correlation/sharing correlation (NER Feature/relationship concentration Feature/sharing concentration Feature) for the operation of the next stage:
finally, updating the history information and the hidden state, splicing the three memory features together, and linearly mapping to obtain the history information of the t time step, wherein the history information of the t time step is subjected to a tanh () hyperbolic tangent function to obtain the hidden state of the t time step:
s24, carrying out global representation: by obtaining a global characterization of two specific tasks: the entity concentrating feature and the relation concentrating feature of each time step are used for splicing the shared concentrating feature, and two task concentrating features are obtained through linear mapping and tanh () hyperbolic tangent function and global use of maximum pooling operation:wherein maxpool represents a maximize pooling operation;
s25, executing characteristic tasks: the method comprises the steps of respectively executing entity characteristic tasks and relation characteristic tasks, wherein the specific mode of executing the entity characteristic tasks is as follows:
s251, setting a group of sentences with input length L, wherein the length of a table is L multiplied by L, the (i, j) position in the table represents the physical characteristic representation of the span starting at the ith position and ending at the jth position, the representation is the global representation of the physical concentration characteristic and the physical concentration characteristic of the ith position and the jth position, and then the linear transformation and ELU activation function processing are carried out, wherein the processing mode is as follows:
then, the output layer is entered, the output layer is mapped to the dimension of the number of entity categories through linear mapping, and then, a sigmoid function is carried out on each dimension to judge whether the entity category is represented by the entity category (the multi-label classification mode is adopted to solve the coincidence problem), and the processing mode is as follows:k represents each type, and its element e represents the probability of a word pair (w, k) as the start and end positions of an entity of type k, h for each word pair (w, k) i ,h j Representing word-level entity characteristics thereof;
s252, executing relation feature tasks: setting a set of sentence length values as L and the length of the table as L x L, wherein the (i, j) position in the table indicates the relation between the span with the i-th position as the first word and the span with the j-th position as the first word,this characterization is similar to the entity units, splice: the relationship between the ith position and the jth focus feature, and the global representation of the relationship focus feature, then undergo linear transformation and ELU activation functions to make multi-label classification:where R represents a set of relationship labels, and for each relationship L, the element R represents the word w i Word w j As probabilities of subject and object entities, T represents a set of elements r, h gr Representing global features, obtaining an entity i span representation +.>
S26, carrying out loss parameter analysis: for two BCE losses, one for the NER task; another for RE tasks:
s3, carrying out knowledge fusion, and according to the l obtained by calculation ner L re The vocabularies with the same parameters are fused, a global entity alignment method is selected to comprehensively use various strategies to judge entity similarity, knowledge fusion effect is improved, and an entity matching algorithm based on Chinese similarity judgment is used for assisting strategy judgment;
s4, knowledge application is carried out: deep learning is widely applied to the NLP field, achieves remarkable effect, enables a deep neural network to capture characteristics, maps input data from an original space to another characteristic space through nonlinear mapping transformation, learns characteristic representation, and is suitable for knowledge reasoning tasks; the invention uses ConvKE method to realize knowledge reasoning, convKE adopts dimension transformation strategy to promote the sliding steps of convolution sliding window on the triple matrix and the information interaction capability of entities and relations in the triple on more dimensions, and captures the whole information of the triple on more dimensions through 2-D convolution sliding window promotion feeling.
Embodiment two: based on the overall implementation of the first embodiment, step S2 of the present embodiment further includes step S27: setting knowledge extraction model parameters, wherein the specific model parameters are as follows:
Epoch:120
Hiddensize:300
Batchsize:32
Embedmode:BertbasedChinese
Lr:0.00002
Weightdecay:0
Seed:0
Dropout:0.1
Dropconnect:0.1
Step:50
Clip:0.25
Maxseqlen:150。
embodiment III: based on the two groups of embodiments, the embodiment further includes processing the chinese natural language in a specific implementation process, where a specific processing manner is:
1) Acquiring corpus;
2) Preprocessing corpus, wherein the preprocessing comprises corpus cleaning, word segmentation, part-of-speech tagging, word deactivation removing and other steps;
3) Characterization, namely vectorization, mainly represents the word and the word after word segmentation into types (vectors) which can be calculated by a computer, thus being beneficial to better expressing the similarity relationship among different words;
4) Model training, including traditional supervised, semi-supervised and unsupervised learning models, can be selected according to different application requirements. But over-fitting and under-fitting conditions may occur during model training. The method for solving the over-fitting mainly comprises the steps of increasing regularization items so as to increase the training quantity of data, and reducing regularization items so as to increase other characteristic items to process the data when the under-fitting is solved;
5) Performance evaluation.
The Chinese information processing is mainly to process characters, words, paragraphs or chapters. The main methods are respectively a rule-based method and a statistics-based method, wherein the former method is to manually process the text according to language-related rules; the latter is to analyze the data through a large-scale database, thereby realizing the processing of natural language.
Natural language processing is greatly affected by data, and the growth of data is responsible for the improved performance of most NLP applications (e.g., machine translation), so that text can be better understood and analyzed with strong data support, which makes many NLP applications today employ data stream analysis methods.
Embodiment four: this embodiment is embodied in all of the implementations including the three embodiments described above.
The partial data in the formula are all obtained by removing dimension and taking the numerical value for calculation, and the formula is a formula closest to the real situation obtained by simulating a large amount of collected data through software; the preset parameters and the preset threshold values in the formula are set by those skilled in the art according to actual conditions or are obtained through mass data simulation.
The above embodiments are only for illustrating the technical method of the present invention and not for limiting the same, and it should be understood by those skilled in the art that the technical method of the present invention may be modified or substituted without departing from the spirit and scope of the technical method of the present invention.

Claims (6)

1. The method for constructing the ancient Chinese medicine books knowledge graph based on the deep learning is characterized by comprising the following steps of:
s1, processing unstructured multi-mode traditional Chinese medicine field data: extracting text data in the ancient books of the traditional Chinese medicine by adopting a multi-mode information extraction technology and combining an OCR technology and an NLP processing technology, converting the text data into semi-structured data and structured data, and marking the semi-structured data to obtain an entity-relation-entity data set;
s2, knowledge extraction: performing joint training on grammar and corresponding text content structure, extracting entity and relation triples in ancient books of traditional Chinese medicine through training process, and solving rare words and grammar of ancient books in the ancient books of traditional Chinese medicine;
s3, carrying out knowledge fusion: according to the calculated specific parameters, fusing the vocabularies, selecting a global entity alignment method to comprehensively use various strategies to judge entity similarity, and improving knowledge fusion effect, wherein the strategy judgment is assisted by using an entity matching algorithm based on Chinese similarity judgment;
s4, knowledge application is carried out: the input data is mapped from the original space to another feature space through nonlinear mapping transformation and the feature representation is learned, so that knowledge application is completed.
2. The method for constructing a ancient book knowledge map of traditional Chinese medicine based on deep learning according to claim 1, wherein in the step S2, the specific steps of knowledge extraction are as follows:
s21, optimizing and embedding rare words: the method comprises the steps of optimizing the rarely used words in ancient book contents by using a traditional Chinese medicine rarely used word optimizing embedding model, simplifying the rarely used words into corresponding Chinese characters, and embedding word vectors by taking ancient text Bert and ancient text sentences after dictionary optimization as global feature descriptions;
s22, carrying out partition filtering by adopting a partition filtering encoder: at each time step, the word features are split into three partitions: entity partitions, relationship partitions, and shared partitions, wherein an entity partition is a partition related to entity extraction only, a relationship partition is a partition related to relationship only, a shared partition is a partition related to both tasks, and then features unrelated to a specific task are filtered out by merging the partitions.
3. The method for constructing a ancient book knowledge graph of traditional Chinese medicine based on deep learning according to claim 2, wherein in the step S22, the specific filtering is performed in the following specific manner:
s221, first calculating candidate partition information:
wherein X is t For the entered features, h t-1 Is a hidden state value corresponding to the time t-1;
s222, calculating a relation threshold and an entity threshold:wherein cummax = cumsum;
and (softmax ()) cumsum (x 1, x2, x 3) = (x 1, x1+x2, x1+x2+x3);
s223, then, generating three partitions by using the two just calculated thresholds at each layer, wherein the two layers are 6 partitions:wherein o represents AND operation, -represents NOT operation, wherein c represents history information and t-1 represents corresponding t-1 time;
s224, finally generating information of three partitions of the t time step according to the history threshold and history information of the t-1 time step, and the candidate threshold and candidate partition information of the t time step:
4. the deep learning-based ancient book knowledge graph construction method of traditional Chinese medicine according to claim 2, further comprising:
s23, performing filtering operation: according to the information of the three partitions generated in the step S22, three memory features are generated through interaction, so that the filtering effect is achieved: entity correlation/relationship correlation/sharing correlation:
then, the three memory features are respectively processed by a tanh () hyperbolic tangent function to obtain three corresponding hidden states, and the three hidden states are directly output from the history information of the current time step and are used as entity correlation/relationship correlation/sharing correlation for the operation of the next stage:
finally, updating the history information and the hidden state, splicing the three memory features together, and linearly mapping to obtain the history information of t time steps, wherein the t time stepsThe history information of (2) is subjected to a tanh () hyperbolic tangent function to obtain the hidden state of the t time step:
s24, carrying out global representation: by obtaining a global characterization of two specific tasks: the entity concentrating feature and the relation concentrating feature of each time step are used for splicing the shared concentrating feature, and two specific task features are obtained through linear mapping and tanh () hyperbolic tangent function and global use of maximum pooling operation:wherein maxpool represents a maximize pooling operation.
5. The deep learning-based ancient book knowledge graph construction method of traditional Chinese medicine according to claim 4, further comprising:
s25, executing characteristic tasks: the method comprises the steps of respectively executing entity characteristic tasks and relation characteristic tasks, wherein the specific mode of executing the entity characteristic tasks is as follows:
s251, setting a group of sentences with input length L, wherein the length of a table is L multiplied by L, the (i, j) position in the table represents the physical characteristic representation of the span starting at the ith position and ending at the jth position, the representation is the global representation of the physical concentration characteristic and the physical concentration characteristic of the ith position and the jth position, and then the linear transformation and ELU activation function processing are carried out, wherein the processing mode is as follows:
then, the output layer is entered, the output layer is mapped to the dimension of the number of entity categories through linear mapping, and then, a sigmoid function is carried out on each dimension to judge whether the entity category is represented by the sigmoid function or not, and the processing mode is as follows:k represents each type, and its element e represents the word pair (w, k) as having the typeProbability of start and end positions of the entities of k;
for each word pair (w, k), h i ,h j Representing word-level entity characteristics thereof;
s252, executing relation feature tasks: setting a group of sentence length values as L and the length of a table as L x L, wherein the (i, j) position in the table marks the span taking the ith position as the first word and the span taking the jth position as the first word, the representation is similar to a physical unit, and splicing: the relationship between the ith position and the jth focus feature, and the global representation of the relationship focus feature, then undergo linear transformation and ELU activation functions to make multi-label classification:where R represents a set of relationship labels, and for each relationship L, the element R represents the word w i Word w j As probabilities of subject and object entities, T represents a set of elements r, h gr Representing global features, obtaining an entity i span representation +.>
6. The deep learning-based ancient book knowledge graph construction method of traditional Chinese medicine according to claim 5, further comprising:
s26, carrying out loss parameter analysis: for two BCE losses, the BCE losses are classification losses for multi-label classification, and classification tasks can be completed; considering both tasks as classified tasks, one for the NER task; another for RE tasks:
CN202310988524.5A 2023-08-08 2023-08-08 Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method Withdrawn CN116701665A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310988524.5A CN116701665A (en) 2023-08-08 2023-08-08 Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310988524.5A CN116701665A (en) 2023-08-08 2023-08-08 Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method

Publications (1)

Publication Number Publication Date
CN116701665A true CN116701665A (en) 2023-09-05

Family

ID=87843750

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310988524.5A Withdrawn CN116701665A (en) 2023-08-08 2023-08-08 Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method

Country Status (1)

Country Link
CN (1) CN116701665A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117236342A (en) * 2023-09-28 2023-12-15 南京大经中医药信息技术有限公司 Chinese medicine classics semantic analysis method and system combined with knowledge graph

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021196520A1 (en) * 2020-03-30 2021-10-07 西安交通大学 Tax field-oriented knowledge map construction method and system
US20220180065A1 (en) * 2020-12-09 2022-06-09 Beijing Wodong Tianjun Information Technology Co., Ltd. System and method for knowledge graph construction using capsule neural network
WO2022116417A1 (en) * 2020-12-03 2022-06-09 平安科技(深圳)有限公司 Triple information extraction method, apparatus, and device, and computer-readable storage medium
CN115238040A (en) * 2022-08-02 2022-10-25 北京科技大学 Steel material science knowledge graph construction method and system
CN115618005A (en) * 2021-07-16 2023-01-17 中国传媒大学 Traditional Tibetan medicine knowledge graph construction and completion method
CN116127090A (en) * 2022-12-28 2023-05-16 中国航空综合技术研究所 Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021196520A1 (en) * 2020-03-30 2021-10-07 西安交通大学 Tax field-oriented knowledge map construction method and system
WO2022116417A1 (en) * 2020-12-03 2022-06-09 平安科技(深圳)有限公司 Triple information extraction method, apparatus, and device, and computer-readable storage medium
US20220180065A1 (en) * 2020-12-09 2022-06-09 Beijing Wodong Tianjun Information Technology Co., Ltd. System and method for knowledge graph construction using capsule neural network
CN115618005A (en) * 2021-07-16 2023-01-17 中国传媒大学 Traditional Tibetan medicine knowledge graph construction and completion method
CN115238040A (en) * 2022-08-02 2022-10-25 北京科技大学 Steel material science knowledge graph construction method and system
CN116127090A (en) * 2022-12-28 2023-05-16 中国航空综合技术研究所 Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
宋伟;张游杰;: "基于环境信息融合的知识图谱构建方法", 计算机系统应用, no. 06 *
陈荟;邓晖;吴道婷;: "基于自然语言处理的教学设计学科知识图谱自动构建研究", 中国教育信息化, no. 07 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117236342A (en) * 2023-09-28 2023-12-15 南京大经中医药信息技术有限公司 Chinese medicine classics semantic analysis method and system combined with knowledge graph
CN117236342B (en) * 2023-09-28 2024-05-28 南京大经中医药信息技术有限公司 Chinese medicine classics semantic analysis method and system combined with knowledge graph

Similar Documents

Publication Publication Date Title
CN112347268B (en) Text-enhanced knowledge-graph combined representation learning method and device
Wang et al. Feature extraction and analysis of natural language processing for deep learning English language
CN108182295B (en) Enterprise knowledge graph attribute extraction method and system
CN110287481B (en) Named entity corpus labeling training system
Chen et al. Research on text sentiment analysis based on CNNs and SVM
Li et al. Improving convolutional neural network for text classification by recursive data pruning
CN112560432A (en) Text emotion analysis method based on graph attention network
CN111782769B (en) Intelligent knowledge graph question-answering method based on relation prediction
CN113743119B (en) Chinese named entity recognition module, method and device and electronic equipment
CN112417884A (en) Sentence semantic relevance judging method based on knowledge enhancement and knowledge migration
CN115510245B (en) Unstructured data-oriented domain knowledge extraction method
CN114077673A (en) Knowledge graph construction method based on BTBC model
CN110852089A (en) Operation and maintenance project management method based on intelligent word segmentation and deep learning
CN113221571A (en) Entity relation joint extraction method based on entity correlation attention mechanism
CN111428481A (en) Entity relation extraction method based on deep learning
CN111651973A (en) Text matching method based on syntax perception
CN116701665A (en) Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method
CN114491036A (en) Semi-supervised text classification method and system based on self-supervision and supervised joint training
CN116383352A (en) Knowledge graph-based method for constructing field intelligent question-answering system by using zero samples
CN114048314A (en) Natural language steganalysis method
CN114021584A (en) Knowledge representation learning method based on graph convolution network and translation model
CN116522165B (en) Public opinion text matching system and method based on twin structure
CN116680407A (en) Knowledge graph construction method and device
CN114239575B (en) Statement analysis model construction method, statement analysis method, device, medium and computing equipment
CN111708896B (en) Entity relationship extraction method applied to biomedical literature

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20230905

WW01 Invention patent application withdrawn after publication