CN115269865A - Knowledge graph construction method for auxiliary diagnosis - Google Patents
Knowledge graph construction method for auxiliary diagnosis Download PDFInfo
- Publication number
- CN115269865A CN115269865A CN202210765651.4A CN202210765651A CN115269865A CN 115269865 A CN115269865 A CN 115269865A CN 202210765651 A CN202210765651 A CN 202210765651A CN 115269865 A CN115269865 A CN 115269865A
- Authority
- CN
- China
- Prior art keywords
- knowledge
- data
- extraction
- knowledge graph
- clustering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Public Health (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Animal Behavior & Ethology (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Pathology (AREA)
- Physics & Mathematics (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The invention discloses a knowledge graph construction method facing auxiliary diagnosis, which firstly provides a knowledge extraction algorithm based on a Multi-Attention structure and a knowledge extraction method based on a wrapper, performs high-efficiency and accurate extraction work on public data of a medical website and electronic medical record data, then provides a Multi-scheme knowledge fusion strategy for the characteristics of Multi-source heterogeneous data, solves the problems of knowledge redundancy and ambiguity, and finally provides a knowledge representation and optimization scheme for the characteristics of the Multi-source heterogeneous data to complete the knowledge graph construction facing the auxiliary diagnosis of cardiovascular diseases. The method focuses on the multi-source heterogeneous characteristics of the disease data, and provides multi-scheme knowledge extraction, knowledge fusion and optimization strategies aiming at the disease data, so that deeper disease data can be mined, and a knowledge map more suitable for disease auxiliary diagnosis can be accurately constructed.
Description
Technical Field
The invention relates to a knowledge graph construction method for auxiliary diagnosis, and belongs to the technical field of Internet and artificial intelligence.
Background
At present, people in all countries suffer from various diseases to different degrees, the prevention and treatment work in the disease field highly depends on the experience and knowledge of medical staff or experts, and due to the problems of complex pathology, limited medical resources and the like, the society still has a small challenge in providing comprehensive and effective prevention and treatment measures for the public. Therefore, the computer-aided high-risk prediction of diseases is a promising and significant research topic, which can effectively relieve the pressure of shortage of medical resources and promote the work of disease prevention and treatment.
With the coming of the internet and the artificial intelligence era, medical informatization and intelligent medical treatment become a new direction for promoting the steady development of traditional medical treatment. In the actual treatment, the electronic medical record gradually replaces the handwritten medical record, a large amount of structured and unstructured data represented by the electronic medical record, health records and the like are accumulated in the treatment process of the cardiovascular disease patient, the data are important medical information resources in the disease field, and a powerful data basis is provided for data mining and data analysis tasks in the field. In addition, the auxiliary diagnosis system in the intelligent medical treatment is widely paid attention by medical service providers, various products are widely applied to different disease scenes, the development of the diagnosis auxiliary system based on the knowledge graph is one of research hotspots in the field, the knowledge relationship and the storage characteristic of the knowledge graph can effectively extract effective information from a plurality of disease medical data, and the knowledge graph is taken as an auxiliary tool of a doctor and has great significance for improving the working efficiency of the doctor, liberating the productivity, relieving the shortage of medical resources and automatically researching and preventing cardiovascular diseases. However, because the problem of multi-source isomerism of disease data is that it is difficult to accurately construct a knowledge graph for disease-assisted diagnosis, how to design an effective method to process data related to multi-dimensional isomerism disease and accurately construct a knowledge graph for disease-assisted diagnosis becomes an extremely important problem.
Disclosure of Invention
The invention provides a knowledge graph construction method facing auxiliary diagnosis, which aims at solving the problem of how to effectively organize Multi-source heterogeneous disease data.
In order to achieve the purpose, the invention is realized by the following technical scheme:
a knowledge graph construction method for auxiliary diagnosis comprises the following steps:
step 1, constructing a cardiovascular disease corpus, extracting public data of a professional medical website by using a knowledge extraction method based on a wrapper, and constructing an original corpus;
step 2, extracting unstructured medical record data by using a knowledge extraction algorithm based on a Multi-Attention structure to supplement a disease corpus;
step 3, using a multi-scheme knowledge fusion strategy aiming at the characteristics of multi-source heterogeneous data to perform entity disambiguation, entity linkage and knowledge combination on the extracted data information; solving the problems of knowledge redundancy and ambiguity through entity disambiguation based on a clustering algorithm, knowledge merging based on Pandans and entity linking based on Fusion similarity;
and 4, further optimizing data, and completing construction work of the knowledge graph facing the cardiovascular disease auxiliary diagnosis through knowledge representation and graph database storage.
Further, the step 1 specifically includes the following steps:
the knowledge extraction of the semi-structured data facing the professional medical website is completed by using the knowledge extraction based on the wrapper; the wrapper is a rule-based text information extraction model, which comprises the following steps: the system comprises a rule base, a rule execution module and an information conversion module; constructing a user-agent set during crawling work, and randomly selecting one user-agent during each request; and pausing for several seconds after each grabbing, then performing crawler again, and finally saving the extracted information as a csv file for subsequent processing.
Further, the step 2 specifically includes the following steps:
the method comprises the following steps of utilizing a BERT-Bi-LSTM-CRF model based on a multi-head attention structure to complete knowledge extraction of unstructured data of medical records such as electronic medical records; the model is divided into three layers: the method comprises the following steps of (1) carrying out BERT pre-training on a model, a Bi-LSTM semantic fusion layer and a CRF optimal output layer; after the marked data is input into the model, firstly, the first layer of BERT pre-training model is used, text vectorization is realized by combining a multi-head attention model, different positions are simultaneously focused in the extraction process to input information representing different subspaces, and a plurality of attention layers are used for parallel calculation; then, inputting the vector expression sequence of the text into a second Bi-LSTM semantic fusion layer, and carrying out further semantic coding to obtain global sequence characteristics; and finally, the data enters a third CRF optimized output layer, so that the label sequence which has the highest probability and is most consistent with the semantics is output.
Further, the step 3 specifically includes the following steps:
an improved K-Means algorithm is adopted to automatically complete the determination work of the number of the clustering categories and perform clustering disambiguation; merging the overlapped structured data into the existing knowledge base through Pandas; and linking the entity object extracted from the unstructured data or the semi-structured data with the corresponding correct entity object in the knowledge base by adopting a Fusion similarity calculation method.
Further, the improved K-Means algorithm flow is as follows:
the file to be processed n initializes the number of clusters, k is the number of clusters with different diseases D1、D2Is collected as a file ofThe integer part of (a); selecting an initial aggregation point according to
Storing the aggregation point S in a set, and storing the index and the minimum distance in a set S'; calculating the difference value of the minimum distance between the two clustering points, and storing the difference value into a set S'; starting from the S' point with the largest distance difference, storing the previous aggregation point into a set S; starting from the clustering center K, obtaining a clustering result by applying a K-means clustering algorithm; k clustering centers can be automatically obtained, a final document set is obtained, and the disambiguation task is completed.
Further, the step 4 specifically includes the following steps:
further optimization of the knowledge graph is completed by removing nodes irrelevant to the domain and the relation contained by the nodes through a vector variance algorithm, knowledge representation is completed by supplementing and correcting domain experts, and the knowledge graph is visually stored by using a Neo4j graph database.
Further, the vector variance algorithm includes the following steps:
treating the set of relationships as a directed graph, where SiIs contained in fjN is SiNumber of links in, ekDenotes from SiTo fjThe edge of (1) has a weight of w (e)k),E{e1,e2,…enDenotes the slave node SiTo fjSet of paths of P { P }1,p2,…pmDenotes the slave node SiTo fjThe entire path of (a); node S is calculated using the following formulaiTo fjDegree of membership of (c):
and remove nodes that are not related to the domain and their contained relationships by setting a threshold.
Compared with the prior art, the invention has the following beneficial effects:
the invention can process the relevant data of the multi-dimensional heterogeneous diseases and accurately construct the knowledge map facing the auxiliary diagnosis of the diseases. Compared with other methods, the method focuses on the multi-source heterogeneous characteristics of the disease data, and provides multi-scheme knowledge extraction, knowledge fusion and optimization strategies aiming at the disease data, so that deeper disease data can be mined, and a knowledge map more suitable for disease auxiliary diagnosis is constructed.
Drawings
FIG. 1 is an overall framework of a knowledge graph construction method for auxiliary diagnosis provided by the invention.
FIG. 2 is a workflow for implementing the wrapper-based knowledge extraction provided by the present invention.
FIG. 3 is a working example of a knowledge extraction model based on the Multi-Attention structure in the present invention.
FIG. 4 is a block diagram of the entity linking module process of the present invention.
Detailed Description
The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention.
Taking cardiovascular diseases as an example, the method for constructing the knowledge graph facing auxiliary diagnosis, which is provided by the method, has the overall framework as shown in figure 1, and comprises the following specific implementation steps:
step 1, constructing a cardiovascular disease corpus, and performing efficient and accurate extraction work on public data of a professional medical website by using a knowledge extraction method based on a wrapper to construct an original corpus.
And completing knowledge extraction oriented to the semi-structured data by utilizing knowledge extraction based on a wrapper. The wrapper is a text information extraction model based on rules, and the rule set is easy to establish and has high extraction precision, so that the wrapper is suitable for knowledge extraction of semi-structured data. A wrapper generally consists of three parts: the workflow of the rule base, the rule execution module and the information conversion module and the wrapper is shown in FIG. 2. The rule base is used for storing the crawling rules, the rule execution module is used for extracting corresponding rules from the rule base to execute, and the information conversion module is used for storing the crawling information in the database; the attribute knowledge structure of the medical entity is attribute-value pairs, and since the website data comprises regular attribute-value pair information and semi-structured data, the medical website can be better obtained by using a wrapper.
Because the required knowledge of the method has strong specialty, the '39 health network' edited by experts manually is used as one of data sources to ensure the integrity of the knowledge map. The experiment adopts two measures to finish the crawling work, one is to construct a user-agent set, and randomly select one user-agent in each request; secondly, after each time of grabbing, pausing for several seconds by using time. In addition, the invention adopts the threading module to realize the multithreading crawler to process the data grabbing and processing tasks in parallel, thereby improving the crawler efficiency. When data is captured, a request is sent to a server through request.get, and html text of a webpage is obtained; then resolving the html text into an 'lxml' format which is convenient to read by adopting a Beautiful Soup library bs4 module; and finally, respectively saving the extracted information as csv files.
And 2, performing efficient and accurate extraction work on medical record data such as electronic medical records by using a knowledge extraction algorithm based on a Multi-Attention structure to supplement a disease corpus.
And the method utilizes a BERT-Bi-LSTM-CRF model based on a multi-head attention structure to complete knowledge extraction of unstructured data of medical records such as electronic medical records. The model is integrally divided into three layers: the working examples of the BERT pre-training model, the Bi-LSTM semantic fusion layer and the CRF optimization output layer are shown in FIG. 3. The coding layer BERT model adopts a bidirectional Transformer as a coder, and the Transformer models texts based on an attention mechanism, so that the coding layer BERT model has better parallel computing capability and capability of capturing long-distance text features.
After the marked data is input into the model, firstly, the first layer BERT pre-training model is used, text vectorization is realized by combining a multi-head attention model, different positions are simultaneously focused in the extraction process to input information representing different subspaces, and a plurality of attention layers perform parallel computation. The present invention introduces a multi-headed attention structure to optimize the classical BERT model, which maps queries and a set of key-value pairs to the output, i.e. computes a weighted sum of the weight values assigned to each value using the queries and the corresponding keys. The output of the Scaled product attribute layer is calculated by equation (1):
wherein, the input is three in number,respectively representing a query matrix, a key matrix and a value matrix. d is the size of the hidden unit of the BilSTM layer equal to dhIn the experiment, Q = K = V = H, H = { H =1,h2,h3,…hnIs the output of the BilsTM layer. The multi-head attention layer firstly linearly projects the query, the key and the value h times through different projection layers, and each projection layer is calculated as shown in a formula (2). The h projection layers then perform the scaled point-by-attention shown in equation (1) in parallel, and finally join the results of these h attentions and project again to get the final output, as shown in equation (3).
H'=(head1,head2,head3,…,headh)WO (3)
Among these, the present inventors have found that,are all projection layer parameter matrices, d in the projection layer parameter matrixk=2dh/h,Are also trainable parameters. The BERT model mainly uses two tasks of a mask language model MLM and the next sentence prediction NSP for combined training. The MLM task is inspired by complete filling, 15% of words are randomly selected in the training process, wherein 80% of words are covered by a 'mask' symbol, 10% of words are replaced by other random words, 10% of original words are reserved, the words which are not covered are used for predicting the covered words, and the model can input information on the left side and the right side of one word into the model by using context information to predict the current word. The input of the NSP task is two sentences A and B, the two sentences of the training model have 50% of probability of continuous context and 50% of probability of discontinuous context, and the model is used for predicting whether the sentence B is the next sentence of the sentence A or not so as to judge the sentence relation. The BERT model is subjected to fine adjustment on different downstream tasks in an adjustment stage by adding a divider or an output layer, can be applied to tasks such as sentence pair classification tasks, sentence classification tasks, question and answer tasks and labeling tasks, and has the characteristics of universality and optimized performance effect.
And inputting the vector representation sequence of the text obtained by the BERT pre-training model into a Bi-LSTM semantic fusion layer of the second layer, and performing further semantic coding to obtain global sequence characteristics. The semantic fusion layer consists of a BilSTM neural network layer and an attention mechanism. The BilSTM modeling mode refers to operating on sentences in forward and backward directions which are also connected with an output layer, so that the obtained output layer simultaneously comprises forward and backward context information, and an attention mechanism is used for paying attention to candidate knowledge and context. LSTM solves the problems of gradient explosions and gradient disappearance in traditional RNNs. The BilSTM can take over not only the information of the previous sequence, but also the information of the next sequence. The nature of the BilSTM is that two LSTM units are in the forward directionAnd backward directionThen the final state of each cell at time t is represented as
In the knowledge extraction model, the BilSTM uses context knowledge to expand, and entity data is brought into a hidden layer to carry out model training. Knowledge of candidatesWeight of (2)Reflects the jth candidate knowledge xtRelevance or importance in the current context,bilinear calculation is performed using equation (4):
matrix parameter WkIs learned in training and then combined with a candidate knowledge set K, a knowledge integration vector VtExpressed by equation (5) as:
where Σ aj=1, integrate hidden state and knowledge of BilSTM into vector VtCombining, obtaining a mixed vector h 'by using a formula (6)'t:
h't=ht+Vt (6)
If the current word has no candidate knowledge, that is, the candidate knowledge set is an empty set, for the sequence semantics, the importance of each context semantic to the candidate knowledge should be distinguished, and for the semantics of a sequence, the importance of each context semantic to the candidate knowledge should be distinguished. Focusing on knowledge using intermediate gates in the Bi-LSTM unit complicates the structure of Bi-LSTM and adds additional learning parameters, and the drawback of BiLSTM is that the amount of information decreases with increasing sequence length. Therefore, the invention configures an attention mechanism after the Bi-LSTM, the attention mechanism reduces the sequence distance, and further retains the context information and the candidate knowledge in the sequence to further strengthen the attention to the candidate knowledge and the context information, and the calculation method is as the formula (7-10):
M=tanh(H+K) (7)
β=softmax(WTM) (8)
γ'=HβT (9)
γ=tanh(γ') (10)
wherein H = { H =1,h2,h3,…hnAnd the output is the output of a hidden layer of the BilSTM neural network, beta is a weight matrix, W is a parameter matrix, softmax is a normalized exponential function, tanh is an activation function, and finally the output is the depth characteristic gamma after the weighted change of the knowledge characteristic is combined. And then the output is imported into a classifier, the classifier selects the maximum probability value as the label output of the sequence, and due to the independence of the output of the softmax classifier, the output sequence is disordered and does not consider the local characteristics of the sentence, so that the low accuracy of the training model is caused, and the CRF model is adopted to comprehensively consider the hidden sequence rule of the sentence.
And the data processed by the semantic fusion layer enters a third CRF optimized output layer, so that the tag sequence which has the highest probability and is most consistent with the semantics is output. The optimized output layer is a CRF layer, for the knowledge extraction task of cardiovascular diseases, it is necessary to consider the dependency relationship of adjacent labels, the CRF is a graph model of joint probability distribution represented by undirected graph, local features are normalized into global features, and the problem of partial label deviation is solved by calculating the probability distribution of the whole sequence. With Z = { Z =1,z2,z3,…znAs input, predicting the most likely tag sequence Y by using past and future tags = { Y = }1,y2,y3,…yn}. Let μ denote CRFThe parameter set of the layer. Then, the parameter set can be calculated by maximizing the log-likelihood (equation 11).
L(μ)=∑(S,Y)∈Datasetlogp(Y|Z,μ) (11)
Where Y is the corresponding tag sequence of sentence S and probability p is the conditional probability for S and μ given Y. Sμ(Z, Y) is the score of the tag sequence Y for a given sentence, which is obtained by adding the transition score matrix A and the output Z of the Tanh layer according to equation (12). The conditional probability p may be SμNormalization of (Z, Y) is calculated.
WhereinIs the current character wtWith label ytThe probability of (a) of (b) being,is the previous character wt-1With label yt-1After wtWith label ytThe probability of (c). Through dynamic programming, the present invention can maximize log-likelihood over all training sets, as in equation (11), and find the best tag sequence for any input sentence by maximizing the score using the Viterbi algorithm, as in equation (12). With the optimized output layer, the model can effectively utilize past and future labels to predict the current label, and meanwhile, the hidden constraint rule of the label can be obtained, the global optimal solution can be effectively obtained, and the accuracy of entity identification can be greatly enhanced.
Step 3, in order to solve the problems of knowledge redundancy and ambiguity, a multi-scheme knowledge fusion strategy aiming at the characteristics of multi-source heterogeneous data is used for carrying out entity disambiguation, entity linking and knowledge merging on the extracted data information;
in the operation process, the main objects of knowledge fusion are entities, attributes and relationships in triples in the map, and entity disambiguation aims at solving the ambiguous phenomenon. In early disambiguation methods, an external dictionary was introduced to word sense disambiguation, mostly by comparing the context language environment of the terms. And the number of repeated terms between the interpretation of a term and the dictionary determines the correct meaning of the term. Nevertheless, this unsupervised approach still can explain the articles of the text and vocabulary, but its matching degree is too high to be suitable for disambiguation of complex knowledge. In view of the fact that the traditional K-Means algorithm needs to determine the clustering category number in advance when a disambiguation task is carried out, the determination of the category number has a lot of uncertainty on multi-source heterogeneous data, the traditional algorithm is prone to causing the problem of local convergence, and meanwhile, in order to better guarantee the rigidness of an auxiliary diagnosis system, the improved K-Means algorithm is adopted for clustering entity categories, the determination work of the clustering category number is automatically completed, and clustering disambiguation is carried out. The clustering algorithm is used for the data to be added in the future, so that the difficulty of artificial disambiguation can be obviously reduced.
The optimization principle of the clustering algorithm is to select initial clustering points by using the Max-Min principle, and firstly select two objects x with minimum Fusion similarityi1And xi2For the first two clusters, then for data point x in all other objectskIs calculated to xi1And xi2Other rendezvous satisfy recursion equation (13), e.g., the m +1 st rendezvous satisfies:
the specific flow of the improved K-Means clustering algorithm is as follows: the file to be processed n initializes the number of clusters, k is the number of clusters with different diseases D1、D2Is collected as a file ofThe integer part of (1); selecting an initial aggregation point according to equation (14)The aggregation point S is stored in a set, and the index and the minimum distance are stored in a set S(ii) a Calculating the difference value of the minimum distance between the two clustering points, and storing the difference value into a set S'; starting from the S' point with the largest distance difference, storing the previous aggregation points into a set S; starting from the clustering center K, a K-means clustering algorithm is applied to obtain a clustering result. Then k clustering centers can be automatically obtained, and a final document set is obtained, so that the disambiguation task can be completed more conveniently.
Given two sets A, B, the Fusion coefficient is defined as the ratio of the size of the intersection of A and B to the size of the union of A and B. Meanwhile, the problem of knowledge overlapping still needs to be concerned, so that the query time of knowledge is prolonged, the system operation load is increased, and the working efficiency is reduced. Therefore, before the knowledge is stored, the overlapping knowledge of the cardiovascular diseases generated in the construction process of the invention is merged, and repeated triples in the process are deleted, so that the system efficiency is improved. Pandas is a mature data analysis technology with two powerful data structures, namely Series and DataFrame, can provide convenient and efficient data operation, can be effectively used in the aspects of merging processing and data simplification, and in order to complete the knowledge merging task more conveniently and rapidly, the method uses Pandas to complete the knowledge merging task, and after the structured data are obtained, the overlapped structured data are merged into the existing knowledge base through Pandas, and the task mainly focuses on knowledge merging at two levels, namely a mode layer and a data layer. The method comprises the steps of taking a 39 medical website as basic knowledge, reading csv files obtained in the work flow by using a DataFrame structure through Pandas, selecting corresponding attribute names in the DataFrames of other data sources, adding the attribute names into the 39 medical website DataFrame, completing knowledge merging on a mode layer, then selecting non-null values in the DataFrames of the other data sources to fill null values of corresponding entities in the DataFrame of the medical website, and completing knowledge merging on the data layer.
Entity linking refers to the operation of linking an entity object extracted from unstructured data or semi-structured data with a corresponding correct entity object in a knowledge base. The basic idea is to select a set of candidate entity objects from a knowledge base based on a given entity object and then link them to the correct entity object by similarity calculation. For the calculation of the similarity, the Fusion similarity calculation method is adopted, the Fusion similarity calculation method is applicable to wide sparse data, the similarity and the difference between limited sample sets can be compared, wherein the larger the coefficient is, the higher the sample similarity is, and the formula (14) and the formula (15) are shown. The entity link is designed to aim at the condition that two entities in the knowledge graph are inconsistent and easily cause the retrieval failure, for example, the entity 'sick sinus syndrome' is retrieved in the system working process, but the entity corresponding to the entity in the knowledge graph is actually 'sick sinus syndrome', so the condition that the retrieval failure in the knowledge graph can happen inevitably occurs. Therefore, the invention links the entity obtained by the NER module with the module existing in the knowledge graph of the invention through entity linking. The method specifically adopts, for example, the entity "sick sinus syndrome" acquired by named entity recognition is split into [ "sick", "sinus", "house", "knot", "complex", "symptomatology" ]; searching out all related entities in the corresponding entity category to form an entity list; constructing the duplicate removal into a candidate entity set after the duplicate removal is completed; calculating Fusion similarity between the NER entity and the candidate entity; and obtaining the candidate entity with the maximum similarity. The work flow diagram of this module is shown in fig. 4.
And in the sixth step, similarity calculation is carried out on the phrase vectors of the named entity recognition module entity and each entity in the candidate set by using Fusion, and the candidate entity with the maximum similarity is output. Assuming that the phrase vector of the NER entity is a = [ a1, a2, …, an ], the phrase vector of the candidate entity is B = [ B1, B2, …, bn ], and the cosine similarity of both is shown in formula (15):
and 4, further optimizing data, and completing construction work of the knowledge graph facing the cardiovascular disease auxiliary diagnosis through knowledge representation and graph database storage.
And the knowledge graph optimization based on the vector variance algorithm is carried out. The vector variance algorithm accomplishes the optimization of the knowledge-graph mainly by removing nodes and their contained relationships that are not related to the domain. The invention treats the set of relationships as a directed graph, where SiIs contained in fjN is SiNumber of links in, ekDenotes from SiTo fjThe edge of (1) has a weight of w (e)k),E{e1,e2,…enDenotes the slave node SiTo fjSet of paths of P { P }1,p2,…pmDenotes the slave node SiTo fjThe entire path of (a). The present invention calculates node S using equations (16-18)iTo fjAnd remove domain-independent nodes and their contained relationships by setting a threshold.
The invention takes cardiovascular diseases as an example, and 5 entities, 5 relations and 12 attributes are designed for the knowledge graph facing the field of auxiliary diagnosis of cardiovascular diseases by supplementing and correcting the knowledge graph through field experts. And finally, after entity fusion is introduced in a ternary form, a Neo4j graph database can be used for visualizing the knowledge graph, so that the knowledge graph with high accuracy and high coverage rate is provided for subsequent auxiliary diagnosis work. Neo4j uses a graph to represent data and its relationships, whose basic units are entities, relationships, and attributes, which can be visually seen as relationships between entities in a knowledge graph. For data query, due to the high retrieval efficiency of Cypher language and the use of adjacent indexes, the quick and efficient target access can be realized, the query speed is obviously improved, and convenience is provided for next retrieval.
There are many methods to import data into graph database, the invention chooses to convert data into CSV format, and then completes the reading work of data through graph database language. In addition, an importing tool attached to Neo4j can realize rapid local data import, and a successfully stored knowledge graph can be displayed through the system and can be applied to perform an auxiliary diagnosis task.
The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.
Claims (7)
1. A knowledge graph construction method for auxiliary diagnosis is characterized by comprising the following steps:
step 1, constructing a cardiovascular disease corpus, extracting public data of a professional medical website by using a knowledge extraction method based on a wrapper, and constructing an original corpus;
step 2, extracting unstructured medical record data by using a knowledge extraction algorithm based on a Multi-Attention structure to supplement a disease corpus;
step 3, using a multi-scheme knowledge fusion strategy aiming at the characteristics of multi-source heterogeneous data to perform entity disambiguation, entity linkage and knowledge merging on the extracted data information; solving the problems of knowledge redundancy and ambiguity through entity disambiguation based on a clustering algorithm, knowledge merging based on Pandans and entity linking based on Fusion similarity;
and 4, further optimizing data, and completing construction work of the knowledge graph facing the cardiovascular disease auxiliary diagnosis through knowledge representation and graph database storage.
2. The knowledge graph construction method for auxiliary diagnosis according to claim 1, wherein the step 1 specifically comprises the following steps:
the knowledge extraction of the semi-structured data facing the professional medical website is completed by using the knowledge extraction based on the wrapper; the wrapper is a rule-based text information extraction model, which comprises the following steps: the system comprises a rule base, a rule execution module and an information conversion module; constructing a user-agent set during crawling work, and randomly selecting one user-agent during each request; and pausing for several seconds after each grabbing, then performing crawler again, and finally saving the extracted information as a csv file for subsequent processing.
3. The diagnosis-assisting-oriented knowledge graph building method according to claim 1, wherein the step 2 specifically comprises the following steps:
the method comprises the following steps of utilizing a BERT-Bi-LSTM-CRF model based on a multi-head attention structure to complete knowledge extraction of unstructured data of medical records such as electronic medical records; the model is divided into three layers: the method comprises the following steps of (1) carrying out BERT pre-training on a model, a Bi-LSTM semantic fusion layer and a CRF optimal output layer; after the marked data input model, firstly, a first layer BERT pre-training model is used, text vectorization is realized by combining a multi-head attention model, different positions are simultaneously concerned in the extraction process to input information representing different subspaces, and a plurality of attention layers are used for parallel calculation; then, inputting the vector expression sequence of the text into a second Bi-LSTM semantic fusion layer, and carrying out further semantic coding to obtain global sequence characteristics; and finally, the data enters a third CRF optimized output layer, so that the label sequence which has the highest probability and is most consistent with the semantics is output.
4. The knowledge graph construction method for auxiliary diagnosis according to claim 1, wherein the step 3 specifically comprises the following steps:
an improved K-Means algorithm is adopted to automatically complete the determination work of the number of the clustering categories and perform clustering disambiguation; merging the overlapped structured data into the existing knowledge base through Pandas; and linking the entity object extracted from the unstructured data or the semi-structured data with the corresponding correct entity object in the knowledge base by adopting a Fusion similarity calculation method.
5. The diagnosis-assisted knowledge graph construction method according to claim 3, wherein the improved K-Means algorithm flow is as follows:
the file to be processed n initializes the number of clusters, k is the number of clusters with different diseases D1、D2Is collected as a file ofThe integer part of (1); selecting an initial aggregation point according to
Storing the aggregation point S in a set, and storing the index and the minimum distance in a set S'; calculating the difference value of the minimum distance between the two clustering points, and storing the difference value into a set S'; starting from the S' point with the largest distance difference, storing the previous aggregation point into a set S; starting from the clustering center K, obtaining a clustering result by applying a K-means clustering algorithm; k clustering centers can be automatically obtained, a final document set is obtained, and the disambiguation task is completed.
6. The knowledge graph construction method for auxiliary diagnosis according to claim 1, wherein the step 4 specifically comprises the following steps:
further optimization of the knowledge graph is completed by removing nodes irrelevant to the domain and the relation contained by the nodes through a vector variance algorithm, knowledge representation is completed by supplementing and correcting domain experts, and the knowledge graph is visually stored by using a Neo4j graph database.
7. The diagnosis-assisted knowledge graph construction method according to claim 6, wherein the vector variance algorithm comprises the following procedures:
treating the set of relationships as a directed graph, where SiIs contained in fjN is SiNumber of links in, ekDenotes from SiTo fjThe edge of (1) has a weight of w (e)k),E{e1,e2,…enDenotes the slave node SiTo fjSet of paths of (P), P { P1,p2,…pmDenotes the slave node SiTo fjThe entire path of (a); node S is calculated using the following formulaiTo fjDegree of membership of:
and remove nodes that are not related to the domain and their contained relationships by setting a threshold.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210765651.4A CN115269865A (en) | 2022-07-01 | 2022-07-01 | Knowledge graph construction method for auxiliary diagnosis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210765651.4A CN115269865A (en) | 2022-07-01 | 2022-07-01 | Knowledge graph construction method for auxiliary diagnosis |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115269865A true CN115269865A (en) | 2022-11-01 |
Family
ID=83763833
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210765651.4A Pending CN115269865A (en) | 2022-07-01 | 2022-07-01 | Knowledge graph construction method for auxiliary diagnosis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115269865A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117059261A (en) * | 2023-08-21 | 2023-11-14 | 安徽农业大学 | Livestock and poultry disease diagnosis method and system based on multi-mode knowledge graph |
CN117271804A (en) * | 2023-11-21 | 2023-12-22 | 之江实验室 | Method, device, equipment and medium for generating common disease feature knowledge base |
CN117313849A (en) * | 2023-10-12 | 2023-12-29 | 湖北华中电力科技开发有限责任公司 | Knowledge graph construction method and device for energy industry based on multi-source heterogeneous data fusion technology |
CN117423470A (en) * | 2023-10-30 | 2024-01-19 | 盐城市第三人民医院 | Chronic disease clinical decision support system and construction method |
CN117558393A (en) * | 2024-01-12 | 2024-02-13 | 成都市龙泉驿区中医医院 | Anorectal patient information arrangement method and system based on artificial intelligence |
CN117577340A (en) * | 2023-10-26 | 2024-02-20 | 杭州乐九医疗科技有限公司 | Scientific research data acquisition configuration method and system based on data fusion |
-
2022
- 2022-07-01 CN CN202210765651.4A patent/CN115269865A/en active Pending
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117059261A (en) * | 2023-08-21 | 2023-11-14 | 安徽农业大学 | Livestock and poultry disease diagnosis method and system based on multi-mode knowledge graph |
CN117313849A (en) * | 2023-10-12 | 2023-12-29 | 湖北华中电力科技开发有限责任公司 | Knowledge graph construction method and device for energy industry based on multi-source heterogeneous data fusion technology |
CN117577340A (en) * | 2023-10-26 | 2024-02-20 | 杭州乐九医疗科技有限公司 | Scientific research data acquisition configuration method and system based on data fusion |
CN117577340B (en) * | 2023-10-26 | 2024-04-16 | 杭州乐九医疗科技有限公司 | Scientific research data acquisition configuration method and system based on data fusion |
CN117423470A (en) * | 2023-10-30 | 2024-01-19 | 盐城市第三人民医院 | Chronic disease clinical decision support system and construction method |
CN117423470B (en) * | 2023-10-30 | 2024-04-23 | 盐城市第三人民医院 | Chronic disease clinical decision support system and construction method |
CN117271804A (en) * | 2023-11-21 | 2023-12-22 | 之江实验室 | Method, device, equipment and medium for generating common disease feature knowledge base |
CN117271804B (en) * | 2023-11-21 | 2024-03-01 | 之江实验室 | Method, device, equipment and medium for generating common disease feature knowledge base |
CN117558393A (en) * | 2024-01-12 | 2024-02-13 | 成都市龙泉驿区中医医院 | Anorectal patient information arrangement method and system based on artificial intelligence |
CN117558393B (en) * | 2024-01-12 | 2024-03-19 | 成都市龙泉驿区中医医院 | Anorectal patient information arrangement method and system based on artificial intelligence |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | A review on entity relation extraction | |
CN113239181B (en) | Scientific and technological literature citation recommendation method based on deep learning | |
CN115269865A (en) | Knowledge graph construction method for auxiliary diagnosis | |
CN111339313A (en) | Knowledge base construction method based on multi-mode fusion | |
CN111611361A (en) | Intelligent reading, understanding, question answering system of extraction type machine | |
CN110825721A (en) | Hypertension knowledge base construction and system integration method under big data environment | |
CN110598005A (en) | Public safety event-oriented multi-source heterogeneous data knowledge graph construction method | |
CN113254659A (en) | File studying and judging method and system based on knowledge graph technology | |
CN113707339B (en) | Method and system for concept alignment and content inter-translation among multi-source heterogeneous databases | |
CN114077673A (en) | Knowledge graph construction method based on BTBC model | |
CN112632250A (en) | Question and answer method and system under multi-document scene | |
Serdyukov et al. | Investigation of the genetic algorithm possibilities for retrieving relevant cases from big data in the decision support systems | |
Peng et al. | Path-based reasoning with K-nearest neighbor and position embedding for knowledge graph completion | |
Lang et al. | AFS graph: multidimensional axiomatic fuzzy set knowledge graph for open-domain question answering | |
CN114265936A (en) | Method for realizing text mining of science and technology project | |
Zhou et al. | Knowledge fusion and spatiotemporal data cleaning: A review | |
CN116629361A (en) | Knowledge reasoning method based on ontology learning and attention mechanism | |
CN114637846A (en) | Video data processing method, video data processing device, computer equipment and storage medium | |
CN113111136A (en) | Entity disambiguation method and device based on UCL knowledge space | |
Zhang et al. | Clinical short text classification method based on ALBERT and GAT | |
Zhen et al. | Frequent words and syntactic context integrated biomedical discontinuous named entity recognition method | |
Zhou et al. | Spatiotemporal data cleaning and knowledge fusion | |
Liu et al. | A resource retrieval method of multimedia recommendation system based on deep learning | |
LUO et al. | Project Articles | |
Wang et al. | SMAAMA: A named entity alignment method based on Siamese network character feature and multi-attribute importance feature for Chinese civil aviation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |