CN115269865A - Knowledge graph construction method for auxiliary diagnosis - Google Patents

Knowledge graph construction method for auxiliary diagnosis Download PDF

Info

Publication number
CN115269865A
CN115269865A CN202210765651.4A CN202210765651A CN115269865A CN 115269865 A CN115269865 A CN 115269865A CN 202210765651 A CN202210765651 A CN 202210765651A CN 115269865 A CN115269865 A CN 115269865A
Authority
CN
China
Prior art keywords
knowledge
data
extraction
knowledge graph
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210765651.4A
Other languages
Chinese (zh)
Inventor
杨鹏
王超余
冷俊成
胡皓楠
解然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Huaxun Technology Co ltd
Original Assignee
Zhejiang Huaxun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Huaxun Technology Co ltd filed Critical Zhejiang Huaxun Technology Co ltd
Priority to CN202210765651.4A priority Critical patent/CN115269865A/en
Publication of CN115269865A publication Critical patent/CN115269865A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a knowledge graph construction method facing auxiliary diagnosis, which firstly provides a knowledge extraction algorithm based on a Multi-Attention structure and a knowledge extraction method based on a wrapper, performs high-efficiency and accurate extraction work on public data of a medical website and electronic medical record data, then provides a Multi-scheme knowledge fusion strategy for the characteristics of Multi-source heterogeneous data, solves the problems of knowledge redundancy and ambiguity, and finally provides a knowledge representation and optimization scheme for the characteristics of the Multi-source heterogeneous data to complete the knowledge graph construction facing the auxiliary diagnosis of cardiovascular diseases. The method focuses on the multi-source heterogeneous characteristics of the disease data, and provides multi-scheme knowledge extraction, knowledge fusion and optimization strategies aiming at the disease data, so that deeper disease data can be mined, and a knowledge map more suitable for disease auxiliary diagnosis can be accurately constructed.

Description

Knowledge graph construction method for auxiliary diagnosis
Technical Field
The invention relates to a knowledge graph construction method for auxiliary diagnosis, and belongs to the technical field of Internet and artificial intelligence.
Background
At present, people in all countries suffer from various diseases to different degrees, the prevention and treatment work in the disease field highly depends on the experience and knowledge of medical staff or experts, and due to the problems of complex pathology, limited medical resources and the like, the society still has a small challenge in providing comprehensive and effective prevention and treatment measures for the public. Therefore, the computer-aided high-risk prediction of diseases is a promising and significant research topic, which can effectively relieve the pressure of shortage of medical resources and promote the work of disease prevention and treatment.
With the coming of the internet and the artificial intelligence era, medical informatization and intelligent medical treatment become a new direction for promoting the steady development of traditional medical treatment. In the actual treatment, the electronic medical record gradually replaces the handwritten medical record, a large amount of structured and unstructured data represented by the electronic medical record, health records and the like are accumulated in the treatment process of the cardiovascular disease patient, the data are important medical information resources in the disease field, and a powerful data basis is provided for data mining and data analysis tasks in the field. In addition, the auxiliary diagnosis system in the intelligent medical treatment is widely paid attention by medical service providers, various products are widely applied to different disease scenes, the development of the diagnosis auxiliary system based on the knowledge graph is one of research hotspots in the field, the knowledge relationship and the storage characteristic of the knowledge graph can effectively extract effective information from a plurality of disease medical data, and the knowledge graph is taken as an auxiliary tool of a doctor and has great significance for improving the working efficiency of the doctor, liberating the productivity, relieving the shortage of medical resources and automatically researching and preventing cardiovascular diseases. However, because the problem of multi-source isomerism of disease data is that it is difficult to accurately construct a knowledge graph for disease-assisted diagnosis, how to design an effective method to process data related to multi-dimensional isomerism disease and accurately construct a knowledge graph for disease-assisted diagnosis becomes an extremely important problem.
Disclosure of Invention
The invention provides a knowledge graph construction method facing auxiliary diagnosis, which aims at solving the problem of how to effectively organize Multi-source heterogeneous disease data.
In order to achieve the purpose, the invention is realized by the following technical scheme:
a knowledge graph construction method for auxiliary diagnosis comprises the following steps:
step 1, constructing a cardiovascular disease corpus, extracting public data of a professional medical website by using a knowledge extraction method based on a wrapper, and constructing an original corpus;
step 2, extracting unstructured medical record data by using a knowledge extraction algorithm based on a Multi-Attention structure to supplement a disease corpus;
step 3, using a multi-scheme knowledge fusion strategy aiming at the characteristics of multi-source heterogeneous data to perform entity disambiguation, entity linkage and knowledge combination on the extracted data information; solving the problems of knowledge redundancy and ambiguity through entity disambiguation based on a clustering algorithm, knowledge merging based on Pandans and entity linking based on Fusion similarity;
and 4, further optimizing data, and completing construction work of the knowledge graph facing the cardiovascular disease auxiliary diagnosis through knowledge representation and graph database storage.
Further, the step 1 specifically includes the following steps:
the knowledge extraction of the semi-structured data facing the professional medical website is completed by using the knowledge extraction based on the wrapper; the wrapper is a rule-based text information extraction model, which comprises the following steps: the system comprises a rule base, a rule execution module and an information conversion module; constructing a user-agent set during crawling work, and randomly selecting one user-agent during each request; and pausing for several seconds after each grabbing, then performing crawler again, and finally saving the extracted information as a csv file for subsequent processing.
Further, the step 2 specifically includes the following steps:
the method comprises the following steps of utilizing a BERT-Bi-LSTM-CRF model based on a multi-head attention structure to complete knowledge extraction of unstructured data of medical records such as electronic medical records; the model is divided into three layers: the method comprises the following steps of (1) carrying out BERT pre-training on a model, a Bi-LSTM semantic fusion layer and a CRF optimal output layer; after the marked data is input into the model, firstly, the first layer of BERT pre-training model is used, text vectorization is realized by combining a multi-head attention model, different positions are simultaneously focused in the extraction process to input information representing different subspaces, and a plurality of attention layers are used for parallel calculation; then, inputting the vector expression sequence of the text into a second Bi-LSTM semantic fusion layer, and carrying out further semantic coding to obtain global sequence characteristics; and finally, the data enters a third CRF optimized output layer, so that the label sequence which has the highest probability and is most consistent with the semantics is output.
Further, the step 3 specifically includes the following steps:
an improved K-Means algorithm is adopted to automatically complete the determination work of the number of the clustering categories and perform clustering disambiguation; merging the overlapped structured data into the existing knowledge base through Pandas; and linking the entity object extracted from the unstructured data or the semi-structured data with the corresponding correct entity object in the knowledge base by adopting a Fusion similarity calculation method.
Further, the improved K-Means algorithm flow is as follows:
the file to be processed n initializes the number of clusters, k is the number of clusters with different diseases D1、D2Is collected as a file of
Figure BDA0003725427340000023
The integer part of (a); selecting an initial aggregation point according to
Figure BDA0003725427340000021
Figure BDA0003725427340000022
Storing the aggregation point S in a set, and storing the index and the minimum distance in a set S'; calculating the difference value of the minimum distance between the two clustering points, and storing the difference value into a set S'; starting from the S' point with the largest distance difference, storing the previous aggregation point into a set S; starting from the clustering center K, obtaining a clustering result by applying a K-means clustering algorithm; k clustering centers can be automatically obtained, a final document set is obtained, and the disambiguation task is completed.
Further, the step 4 specifically includes the following steps:
further optimization of the knowledge graph is completed by removing nodes irrelevant to the domain and the relation contained by the nodes through a vector variance algorithm, knowledge representation is completed by supplementing and correcting domain experts, and the knowledge graph is visually stored by using a Neo4j graph database.
Further, the vector variance algorithm includes the following steps:
treating the set of relationships as a directed graph, where SiIs contained in fjN is SiNumber of links in, ekDenotes from SiTo fjThe edge of (1) has a weight of w (e)k),E{e1,e2,…enDenotes the slave node SiTo fjSet of paths of P { P }1,p2,…pmDenotes the slave node SiTo fjThe entire path of (a); node S is calculated using the following formulaiTo fjDegree of membership of (c):
Figure BDA0003725427340000031
Figure BDA0003725427340000032
Figure BDA0003725427340000033
and remove nodes that are not related to the domain and their contained relationships by setting a threshold.
Compared with the prior art, the invention has the following beneficial effects:
the invention can process the relevant data of the multi-dimensional heterogeneous diseases and accurately construct the knowledge map facing the auxiliary diagnosis of the diseases. Compared with other methods, the method focuses on the multi-source heterogeneous characteristics of the disease data, and provides multi-scheme knowledge extraction, knowledge fusion and optimization strategies aiming at the disease data, so that deeper disease data can be mined, and a knowledge map more suitable for disease auxiliary diagnosis is constructed.
Drawings
FIG. 1 is an overall framework of a knowledge graph construction method for auxiliary diagnosis provided by the invention.
FIG. 2 is a workflow for implementing the wrapper-based knowledge extraction provided by the present invention.
FIG. 3 is a working example of a knowledge extraction model based on the Multi-Attention structure in the present invention.
FIG. 4 is a block diagram of the entity linking module process of the present invention.
Detailed Description
The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention.
Taking cardiovascular diseases as an example, the method for constructing the knowledge graph facing auxiliary diagnosis, which is provided by the method, has the overall framework as shown in figure 1, and comprises the following specific implementation steps:
step 1, constructing a cardiovascular disease corpus, and performing efficient and accurate extraction work on public data of a professional medical website by using a knowledge extraction method based on a wrapper to construct an original corpus.
And completing knowledge extraction oriented to the semi-structured data by utilizing knowledge extraction based on a wrapper. The wrapper is a text information extraction model based on rules, and the rule set is easy to establish and has high extraction precision, so that the wrapper is suitable for knowledge extraction of semi-structured data. A wrapper generally consists of three parts: the workflow of the rule base, the rule execution module and the information conversion module and the wrapper is shown in FIG. 2. The rule base is used for storing the crawling rules, the rule execution module is used for extracting corresponding rules from the rule base to execute, and the information conversion module is used for storing the crawling information in the database; the attribute knowledge structure of the medical entity is attribute-value pairs, and since the website data comprises regular attribute-value pair information and semi-structured data, the medical website can be better obtained by using a wrapper.
Because the required knowledge of the method has strong specialty, the '39 health network' edited by experts manually is used as one of data sources to ensure the integrity of the knowledge map. The experiment adopts two measures to finish the crawling work, one is to construct a user-agent set, and randomly select one user-agent in each request; secondly, after each time of grabbing, pausing for several seconds by using time. In addition, the invention adopts the threading module to realize the multithreading crawler to process the data grabbing and processing tasks in parallel, thereby improving the crawler efficiency. When data is captured, a request is sent to a server through request.get, and html text of a webpage is obtained; then resolving the html text into an 'lxml' format which is convenient to read by adopting a Beautiful Soup library bs4 module; and finally, respectively saving the extracted information as csv files.
And 2, performing efficient and accurate extraction work on medical record data such as electronic medical records by using a knowledge extraction algorithm based on a Multi-Attention structure to supplement a disease corpus.
And the method utilizes a BERT-Bi-LSTM-CRF model based on a multi-head attention structure to complete knowledge extraction of unstructured data of medical records such as electronic medical records. The model is integrally divided into three layers: the working examples of the BERT pre-training model, the Bi-LSTM semantic fusion layer and the CRF optimization output layer are shown in FIG. 3. The coding layer BERT model adopts a bidirectional Transformer as a coder, and the Transformer models texts based on an attention mechanism, so that the coding layer BERT model has better parallel computing capability and capability of capturing long-distance text features.
After the marked data is input into the model, firstly, the first layer BERT pre-training model is used, text vectorization is realized by combining a multi-head attention model, different positions are simultaneously focused in the extraction process to input information representing different subspaces, and a plurality of attention layers perform parallel computation. The present invention introduces a multi-headed attention structure to optimize the classical BERT model, which maps queries and a set of key-value pairs to the output, i.e. computes a weighted sum of the weight values assigned to each value using the queries and the corresponding keys. The output of the Scaled product attribute layer is calculated by equation (1):
Figure BDA0003725427340000051
wherein, the input is three in number,
Figure BDA0003725427340000052
respectively representing a query matrix, a key matrix and a value matrix. d is the size of the hidden unit of the BilSTM layer equal to dhIn the experiment, Q = K = V = H, H = { H =1,h2,h3,…hnIs the output of the BilsTM layer. The multi-head attention layer firstly linearly projects the query, the key and the value h times through different projection layers, and each projection layer is calculated as shown in a formula (2). The h projection layers then perform the scaled point-by-attention shown in equation (1) in parallel, and finally join the results of these h attentions and project again to get the final output, as shown in equation (3).
Figure BDA0003725427340000053
H'=(head1,head2,head3,…,headh)WO (3)
Among these, the present inventors have found that,
Figure BDA0003725427340000054
are all projection layer parameter matrices, d in the projection layer parameter matrixk=2dh/h,
Figure BDA0003725427340000055
Are also trainable parameters. The BERT model mainly uses two tasks of a mask language model MLM and the next sentence prediction NSP for combined training. The MLM task is inspired by complete filling, 15% of words are randomly selected in the training process, wherein 80% of words are covered by a 'mask' symbol, 10% of words are replaced by other random words, 10% of original words are reserved, the words which are not covered are used for predicting the covered words, and the model can input information on the left side and the right side of one word into the model by using context information to predict the current word. The input of the NSP task is two sentences A and B, the two sentences of the training model have 50% of probability of continuous context and 50% of probability of discontinuous context, and the model is used for predicting whether the sentence B is the next sentence of the sentence A or not so as to judge the sentence relation. The BERT model is subjected to fine adjustment on different downstream tasks in an adjustment stage by adding a divider or an output layer, can be applied to tasks such as sentence pair classification tasks, sentence classification tasks, question and answer tasks and labeling tasks, and has the characteristics of universality and optimized performance effect.
And inputting the vector representation sequence of the text obtained by the BERT pre-training model into a Bi-LSTM semantic fusion layer of the second layer, and performing further semantic coding to obtain global sequence characteristics. The semantic fusion layer consists of a BilSTM neural network layer and an attention mechanism. The BilSTM modeling mode refers to operating on sentences in forward and backward directions which are also connected with an output layer, so that the obtained output layer simultaneously comprises forward and backward context information, and an attention mechanism is used for paying attention to candidate knowledge and context. LSTM solves the problems of gradient explosions and gradient disappearance in traditional RNNs. The BilSTM can take over not only the information of the previous sequence, but also the information of the next sequence. The nature of the BilSTM is that two LSTM units are in the forward direction
Figure BDA0003725427340000056
And backward direction
Figure BDA0003725427340000057
Then the final state of each cell at time t is represented as
Figure BDA0003725427340000058
In the knowledge extraction model, the BilSTM uses context knowledge to expand, and entity data is brought into a hidden layer to carry out model training. Knowledge of candidates
Figure BDA0003725427340000061
Weight of (2)
Figure BDA0003725427340000062
Reflects the jth candidate knowledge xtRelevance or importance in the current context,
Figure BDA0003725427340000063
bilinear calculation is performed using equation (4):
Figure BDA0003725427340000064
matrix parameter WkIs learned in training and then combined with a candidate knowledge set K, a knowledge integration vector VtExpressed by equation (5) as:
Figure BDA0003725427340000065
where Σ aj=1, integrate hidden state and knowledge of BilSTM into vector VtCombining, obtaining a mixed vector h 'by using a formula (6)'t
h't=ht+Vt (6)
If the current word has no candidate knowledge, that is, the candidate knowledge set is an empty set, for the sequence semantics, the importance of each context semantic to the candidate knowledge should be distinguished, and for the semantics of a sequence, the importance of each context semantic to the candidate knowledge should be distinguished. Focusing on knowledge using intermediate gates in the Bi-LSTM unit complicates the structure of Bi-LSTM and adds additional learning parameters, and the drawback of BiLSTM is that the amount of information decreases with increasing sequence length. Therefore, the invention configures an attention mechanism after the Bi-LSTM, the attention mechanism reduces the sequence distance, and further retains the context information and the candidate knowledge in the sequence to further strengthen the attention to the candidate knowledge and the context information, and the calculation method is as the formula (7-10):
M=tanh(H+K) (7)
β=softmax(WTM) (8)
γ'=HβT (9)
γ=tanh(γ') (10)
wherein H = { H =1,h2,h3,…hnAnd the output is the output of a hidden layer of the BilSTM neural network, beta is a weight matrix, W is a parameter matrix, softmax is a normalized exponential function, tanh is an activation function, and finally the output is the depth characteristic gamma after the weighted change of the knowledge characteristic is combined. And then the output is imported into a classifier, the classifier selects the maximum probability value as the label output of the sequence, and due to the independence of the output of the softmax classifier, the output sequence is disordered and does not consider the local characteristics of the sentence, so that the low accuracy of the training model is caused, and the CRF model is adopted to comprehensively consider the hidden sequence rule of the sentence.
And the data processed by the semantic fusion layer enters a third CRF optimized output layer, so that the tag sequence which has the highest probability and is most consistent with the semantics is output. The optimized output layer is a CRF layer, for the knowledge extraction task of cardiovascular diseases, it is necessary to consider the dependency relationship of adjacent labels, the CRF is a graph model of joint probability distribution represented by undirected graph, local features are normalized into global features, and the problem of partial label deviation is solved by calculating the probability distribution of the whole sequence. With Z = { Z =1,z2,z3,…znAs input, predicting the most likely tag sequence Y by using past and future tags = { Y = }1,y2,y3,…yn}. Let μ denote CRFThe parameter set of the layer. Then, the parameter set can be calculated by maximizing the log-likelihood (equation 11).
L(μ)=∑(S,Y)∈Datasetlogp(Y|Z,μ) (11)
Where Y is the corresponding tag sequence of sentence S and probability p is the conditional probability for S and μ given Y. Sμ(Z, Y) is the score of the tag sequence Y for a given sentence, which is obtained by adding the transition score matrix A and the output Z of the Tanh layer according to equation (12). The conditional probability p may be SμNormalization of (Z, Y) is calculated.
Figure BDA0003725427340000071
Wherein
Figure BDA0003725427340000072
Is the current character wtWith label ytThe probability of (a) of (b) being,
Figure BDA0003725427340000073
is the previous character wt-1With label yt-1After wtWith label ytThe probability of (c). Through dynamic programming, the present invention can maximize log-likelihood over all training sets, as in equation (11), and find the best tag sequence for any input sentence by maximizing the score using the Viterbi algorithm, as in equation (12). With the optimized output layer, the model can effectively utilize past and future labels to predict the current label, and meanwhile, the hidden constraint rule of the label can be obtained, the global optimal solution can be effectively obtained, and the accuracy of entity identification can be greatly enhanced.
Step 3, in order to solve the problems of knowledge redundancy and ambiguity, a multi-scheme knowledge fusion strategy aiming at the characteristics of multi-source heterogeneous data is used for carrying out entity disambiguation, entity linking and knowledge merging on the extracted data information;
in the operation process, the main objects of knowledge fusion are entities, attributes and relationships in triples in the map, and entity disambiguation aims at solving the ambiguous phenomenon. In early disambiguation methods, an external dictionary was introduced to word sense disambiguation, mostly by comparing the context language environment of the terms. And the number of repeated terms between the interpretation of a term and the dictionary determines the correct meaning of the term. Nevertheless, this unsupervised approach still can explain the articles of the text and vocabulary, but its matching degree is too high to be suitable for disambiguation of complex knowledge. In view of the fact that the traditional K-Means algorithm needs to determine the clustering category number in advance when a disambiguation task is carried out, the determination of the category number has a lot of uncertainty on multi-source heterogeneous data, the traditional algorithm is prone to causing the problem of local convergence, and meanwhile, in order to better guarantee the rigidness of an auxiliary diagnosis system, the improved K-Means algorithm is adopted for clustering entity categories, the determination work of the clustering category number is automatically completed, and clustering disambiguation is carried out. The clustering algorithm is used for the data to be added in the future, so that the difficulty of artificial disambiguation can be obviously reduced.
The optimization principle of the clustering algorithm is to select initial clustering points by using the Max-Min principle, and firstly select two objects x with minimum Fusion similarityi1And xi2For the first two clusters, then for data point x in all other objectskIs calculated to xi1And xi2Other rendezvous satisfy recursion equation (13), e.g., the m +1 st rendezvous satisfies:
Figure BDA0003725427340000084
the specific flow of the improved K-Means clustering algorithm is as follows: the file to be processed n initializes the number of clusters, k is the number of clusters with different diseases D1、D2Is collected as a file of
Figure BDA0003725427340000081
The integer part of (1); selecting an initial aggregation point according to equation (14)
Figure BDA0003725427340000082
The aggregation point S is stored in a set, and the index and the minimum distance are stored in a set S(ii) a Calculating the difference value of the minimum distance between the two clustering points, and storing the difference value into a set S'; starting from the S' point with the largest distance difference, storing the previous aggregation points into a set S; starting from the clustering center K, a K-means clustering algorithm is applied to obtain a clustering result. Then k clustering centers can be automatically obtained, and a final document set is obtained, so that the disambiguation task can be completed more conveniently.
Figure BDA0003725427340000083
Given two sets A, B, the Fusion coefficient is defined as the ratio of the size of the intersection of A and B to the size of the union of A and B. Meanwhile, the problem of knowledge overlapping still needs to be concerned, so that the query time of knowledge is prolonged, the system operation load is increased, and the working efficiency is reduced. Therefore, before the knowledge is stored, the overlapping knowledge of the cardiovascular diseases generated in the construction process of the invention is merged, and repeated triples in the process are deleted, so that the system efficiency is improved. Pandas is a mature data analysis technology with two powerful data structures, namely Series and DataFrame, can provide convenient and efficient data operation, can be effectively used in the aspects of merging processing and data simplification, and in order to complete the knowledge merging task more conveniently and rapidly, the method uses Pandas to complete the knowledge merging task, and after the structured data are obtained, the overlapped structured data are merged into the existing knowledge base through Pandas, and the task mainly focuses on knowledge merging at two levels, namely a mode layer and a data layer. The method comprises the steps of taking a 39 medical website as basic knowledge, reading csv files obtained in the work flow by using a DataFrame structure through Pandas, selecting corresponding attribute names in the DataFrames of other data sources, adding the attribute names into the 39 medical website DataFrame, completing knowledge merging on a mode layer, then selecting non-null values in the DataFrames of the other data sources to fill null values of corresponding entities in the DataFrame of the medical website, and completing knowledge merging on the data layer.
Entity linking refers to the operation of linking an entity object extracted from unstructured data or semi-structured data with a corresponding correct entity object in a knowledge base. The basic idea is to select a set of candidate entity objects from a knowledge base based on a given entity object and then link them to the correct entity object by similarity calculation. For the calculation of the similarity, the Fusion similarity calculation method is adopted, the Fusion similarity calculation method is applicable to wide sparse data, the similarity and the difference between limited sample sets can be compared, wherein the larger the coefficient is, the higher the sample similarity is, and the formula (14) and the formula (15) are shown. The entity link is designed to aim at the condition that two entities in the knowledge graph are inconsistent and easily cause the retrieval failure, for example, the entity 'sick sinus syndrome' is retrieved in the system working process, but the entity corresponding to the entity in the knowledge graph is actually 'sick sinus syndrome', so the condition that the retrieval failure in the knowledge graph can happen inevitably occurs. Therefore, the invention links the entity obtained by the NER module with the module existing in the knowledge graph of the invention through entity linking. The method specifically adopts, for example, the entity "sick sinus syndrome" acquired by named entity recognition is split into [ "sick", "sinus", "house", "knot", "complex", "symptomatology" ]; searching out all related entities in the corresponding entity category to form an entity list; constructing the duplicate removal into a candidate entity set after the duplicate removal is completed; calculating Fusion similarity between the NER entity and the candidate entity; and obtaining the candidate entity with the maximum similarity. The work flow diagram of this module is shown in fig. 4.
And in the sixth step, similarity calculation is carried out on the phrase vectors of the named entity recognition module entity and each entity in the candidate set by using Fusion, and the candidate entity with the maximum similarity is output. Assuming that the phrase vector of the NER entity is a = [ a1, a2, …, an ], the phrase vector of the candidate entity is B = [ B1, B2, …, bn ], and the cosine similarity of both is shown in formula (15):
Figure BDA0003725427340000091
and 4, further optimizing data, and completing construction work of the knowledge graph facing the cardiovascular disease auxiliary diagnosis through knowledge representation and graph database storage.
And the knowledge graph optimization based on the vector variance algorithm is carried out. The vector variance algorithm accomplishes the optimization of the knowledge-graph mainly by removing nodes and their contained relationships that are not related to the domain. The invention treats the set of relationships as a directed graph, where SiIs contained in fjN is SiNumber of links in, ekDenotes from SiTo fjThe edge of (1) has a weight of w (e)k),E{e1,e2,…enDenotes the slave node SiTo fjSet of paths of P { P }1,p2,…pmDenotes the slave node SiTo fjThe entire path of (a). The present invention calculates node S using equations (16-18)iTo fjAnd remove domain-independent nodes and their contained relationships by setting a threshold.
Figure BDA0003725427340000092
Figure BDA0003725427340000093
Figure BDA0003725427340000094
The invention takes cardiovascular diseases as an example, and 5 entities, 5 relations and 12 attributes are designed for the knowledge graph facing the field of auxiliary diagnosis of cardiovascular diseases by supplementing and correcting the knowledge graph through field experts. And finally, after entity fusion is introduced in a ternary form, a Neo4j graph database can be used for visualizing the knowledge graph, so that the knowledge graph with high accuracy and high coverage rate is provided for subsequent auxiliary diagnosis work. Neo4j uses a graph to represent data and its relationships, whose basic units are entities, relationships, and attributes, which can be visually seen as relationships between entities in a knowledge graph. For data query, due to the high retrieval efficiency of Cypher language and the use of adjacent indexes, the quick and efficient target access can be realized, the query speed is obviously improved, and convenience is provided for next retrieval.
There are many methods to import data into graph database, the invention chooses to convert data into CSV format, and then completes the reading work of data through graph database language. In addition, an importing tool attached to Neo4j can realize rapid local data import, and a successfully stored knowledge graph can be displayed through the system and can be applied to perform an auxiliary diagnosis task.
The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims (7)

1. A knowledge graph construction method for auxiliary diagnosis is characterized by comprising the following steps:
step 1, constructing a cardiovascular disease corpus, extracting public data of a professional medical website by using a knowledge extraction method based on a wrapper, and constructing an original corpus;
step 2, extracting unstructured medical record data by using a knowledge extraction algorithm based on a Multi-Attention structure to supplement a disease corpus;
step 3, using a multi-scheme knowledge fusion strategy aiming at the characteristics of multi-source heterogeneous data to perform entity disambiguation, entity linkage and knowledge merging on the extracted data information; solving the problems of knowledge redundancy and ambiguity through entity disambiguation based on a clustering algorithm, knowledge merging based on Pandans and entity linking based on Fusion similarity;
and 4, further optimizing data, and completing construction work of the knowledge graph facing the cardiovascular disease auxiliary diagnosis through knowledge representation and graph database storage.
2. The knowledge graph construction method for auxiliary diagnosis according to claim 1, wherein the step 1 specifically comprises the following steps:
the knowledge extraction of the semi-structured data facing the professional medical website is completed by using the knowledge extraction based on the wrapper; the wrapper is a rule-based text information extraction model, which comprises the following steps: the system comprises a rule base, a rule execution module and an information conversion module; constructing a user-agent set during crawling work, and randomly selecting one user-agent during each request; and pausing for several seconds after each grabbing, then performing crawler again, and finally saving the extracted information as a csv file for subsequent processing.
3. The diagnosis-assisting-oriented knowledge graph building method according to claim 1, wherein the step 2 specifically comprises the following steps:
the method comprises the following steps of utilizing a BERT-Bi-LSTM-CRF model based on a multi-head attention structure to complete knowledge extraction of unstructured data of medical records such as electronic medical records; the model is divided into three layers: the method comprises the following steps of (1) carrying out BERT pre-training on a model, a Bi-LSTM semantic fusion layer and a CRF optimal output layer; after the marked data input model, firstly, a first layer BERT pre-training model is used, text vectorization is realized by combining a multi-head attention model, different positions are simultaneously concerned in the extraction process to input information representing different subspaces, and a plurality of attention layers are used for parallel calculation; then, inputting the vector expression sequence of the text into a second Bi-LSTM semantic fusion layer, and carrying out further semantic coding to obtain global sequence characteristics; and finally, the data enters a third CRF optimized output layer, so that the label sequence which has the highest probability and is most consistent with the semantics is output.
4. The knowledge graph construction method for auxiliary diagnosis according to claim 1, wherein the step 3 specifically comprises the following steps:
an improved K-Means algorithm is adopted to automatically complete the determination work of the number of the clustering categories and perform clustering disambiguation; merging the overlapped structured data into the existing knowledge base through Pandas; and linking the entity object extracted from the unstructured data or the semi-structured data with the corresponding correct entity object in the knowledge base by adopting a Fusion similarity calculation method.
5. The diagnosis-assisted knowledge graph construction method according to claim 3, wherein the improved K-Means algorithm flow is as follows:
the file to be processed n initializes the number of clusters, k is the number of clusters with different diseases D1、D2Is collected as a file of
Figure FDA0003725427330000021
The integer part of (1); selecting an initial aggregation point according to
Figure FDA0003725427330000022
Figure FDA0003725427330000023
Storing the aggregation point S in a set, and storing the index and the minimum distance in a set S'; calculating the difference value of the minimum distance between the two clustering points, and storing the difference value into a set S'; starting from the S' point with the largest distance difference, storing the previous aggregation point into a set S; starting from the clustering center K, obtaining a clustering result by applying a K-means clustering algorithm; k clustering centers can be automatically obtained, a final document set is obtained, and the disambiguation task is completed.
6. The knowledge graph construction method for auxiliary diagnosis according to claim 1, wherein the step 4 specifically comprises the following steps:
further optimization of the knowledge graph is completed by removing nodes irrelevant to the domain and the relation contained by the nodes through a vector variance algorithm, knowledge representation is completed by supplementing and correcting domain experts, and the knowledge graph is visually stored by using a Neo4j graph database.
7. The diagnosis-assisted knowledge graph construction method according to claim 6, wherein the vector variance algorithm comprises the following procedures:
treating the set of relationships as a directed graph, where SiIs contained in fjN is SiNumber of links in, ekDenotes from SiTo fjThe edge of (1) has a weight of w (e)k),E{e1,e2,…enDenotes the slave node SiTo fjSet of paths of (P), P { P1,p2,…pmDenotes the slave node SiTo fjThe entire path of (a); node S is calculated using the following formulaiTo fjDegree of membership of:
Figure FDA0003725427330000024
Figure FDA0003725427330000025
Figure FDA0003725427330000026
and remove nodes that are not related to the domain and their contained relationships by setting a threshold.
CN202210765651.4A 2022-07-01 2022-07-01 Knowledge graph construction method for auxiliary diagnosis Pending CN115269865A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210765651.4A CN115269865A (en) 2022-07-01 2022-07-01 Knowledge graph construction method for auxiliary diagnosis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210765651.4A CN115269865A (en) 2022-07-01 2022-07-01 Knowledge graph construction method for auxiliary diagnosis

Publications (1)

Publication Number Publication Date
CN115269865A true CN115269865A (en) 2022-11-01

Family

ID=83763833

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210765651.4A Pending CN115269865A (en) 2022-07-01 2022-07-01 Knowledge graph construction method for auxiliary diagnosis

Country Status (1)

Country Link
CN (1) CN115269865A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117059261A (en) * 2023-08-21 2023-11-14 安徽农业大学 Livestock and poultry disease diagnosis method and system based on multi-mode knowledge graph
CN117271804A (en) * 2023-11-21 2023-12-22 之江实验室 Method, device, equipment and medium for generating common disease feature knowledge base
CN117313849A (en) * 2023-10-12 2023-12-29 湖北华中电力科技开发有限责任公司 Knowledge graph construction method and device for energy industry based on multi-source heterogeneous data fusion technology
CN117423470A (en) * 2023-10-30 2024-01-19 盐城市第三人民医院 Chronic disease clinical decision support system and construction method
CN117558393A (en) * 2024-01-12 2024-02-13 成都市龙泉驿区中医医院 Anorectal patient information arrangement method and system based on artificial intelligence
CN117577340A (en) * 2023-10-26 2024-02-20 杭州乐九医疗科技有限公司 Scientific research data acquisition configuration method and system based on data fusion

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117059261A (en) * 2023-08-21 2023-11-14 安徽农业大学 Livestock and poultry disease diagnosis method and system based on multi-mode knowledge graph
CN117313849A (en) * 2023-10-12 2023-12-29 湖北华中电力科技开发有限责任公司 Knowledge graph construction method and device for energy industry based on multi-source heterogeneous data fusion technology
CN117577340A (en) * 2023-10-26 2024-02-20 杭州乐九医疗科技有限公司 Scientific research data acquisition configuration method and system based on data fusion
CN117577340B (en) * 2023-10-26 2024-04-16 杭州乐九医疗科技有限公司 Scientific research data acquisition configuration method and system based on data fusion
CN117423470A (en) * 2023-10-30 2024-01-19 盐城市第三人民医院 Chronic disease clinical decision support system and construction method
CN117423470B (en) * 2023-10-30 2024-04-23 盐城市第三人民医院 Chronic disease clinical decision support system and construction method
CN117271804A (en) * 2023-11-21 2023-12-22 之江实验室 Method, device, equipment and medium for generating common disease feature knowledge base
CN117271804B (en) * 2023-11-21 2024-03-01 之江实验室 Method, device, equipment and medium for generating common disease feature knowledge base
CN117558393A (en) * 2024-01-12 2024-02-13 成都市龙泉驿区中医医院 Anorectal patient information arrangement method and system based on artificial intelligence
CN117558393B (en) * 2024-01-12 2024-03-19 成都市龙泉驿区中医医院 Anorectal patient information arrangement method and system based on artificial intelligence

Similar Documents

Publication Publication Date Title
Zhang et al. A review on entity relation extraction
CN113239181B (en) Scientific and technological literature citation recommendation method based on deep learning
CN115269865A (en) Knowledge graph construction method for auxiliary diagnosis
CN111339313A (en) Knowledge base construction method based on multi-mode fusion
CN111611361A (en) Intelligent reading, understanding, question answering system of extraction type machine
CN110825721A (en) Hypertension knowledge base construction and system integration method under big data environment
CN110598005A (en) Public safety event-oriented multi-source heterogeneous data knowledge graph construction method
CN113254659A (en) File studying and judging method and system based on knowledge graph technology
CN113707339B (en) Method and system for concept alignment and content inter-translation among multi-source heterogeneous databases
CN114077673A (en) Knowledge graph construction method based on BTBC model
CN112632250A (en) Question and answer method and system under multi-document scene
Serdyukov et al. Investigation of the genetic algorithm possibilities for retrieving relevant cases from big data in the decision support systems
Peng et al. Path-based reasoning with K-nearest neighbor and position embedding for knowledge graph completion
Lang et al. AFS graph: multidimensional axiomatic fuzzy set knowledge graph for open-domain question answering
CN114265936A (en) Method for realizing text mining of science and technology project
Zhou et al. Knowledge fusion and spatiotemporal data cleaning: A review
CN116629361A (en) Knowledge reasoning method based on ontology learning and attention mechanism
CN114637846A (en) Video data processing method, video data processing device, computer equipment and storage medium
CN113111136A (en) Entity disambiguation method and device based on UCL knowledge space
Zhang et al. Clinical short text classification method based on ALBERT and GAT
Zhen et al. Frequent words and syntactic context integrated biomedical discontinuous named entity recognition method
Zhou et al. Spatiotemporal data cleaning and knowledge fusion
Liu et al. A resource retrieval method of multimedia recommendation system based on deep learning
LUO et al. Project Articles
Wang et al. SMAAMA: A named entity alignment method based on Siamese network character feature and multi-attribute importance feature for Chinese civil aviation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination