CN112687388A

CN112687388A - Interpretable intelligent medical auxiliary diagnosis system based on text retrieval

Info

Publication number: CN112687388A
Application number: CN202110021525.3A
Authority: CN
Inventors: 董守斌; 刘晓峰; 胡金龙; 袁华
Original assignee: Zhongshan Yishu Technology Co ltd
Current assignee: Zhongshan Yishu Technology Co ltd
Priority date: 2021-01-08
Filing date: 2021-01-08
Publication date: 2021-04-20
Anticipated expiration: 2041-01-08
Also published as: CN112687388B

Abstract

The invention discloses an interpretable intelligent medical auxiliary diagnosis system based on text retrieval, which comprises: the query preprocessing module is used for preprocessing the medical record of the patient to obtain query words related to the sick information; the knowledge map construction module is used for combining with the query words to form a new knowledge map about the specific disease; the text retrieval module is used for retrieving past cases and treatment schemes related to the query words in the medical database through the query words and the new knowledge map; the interpretable auxiliary diagnosis module is used for explaining and retrieving cases related to the patient medical record and reasons of treatment schemes. The invention can effectively solve the semantic coding problem of long-distance dependence of long documents, well integrate the knowledge graph into the query word, and obtain entity information or document fragments related to the query word through the weight values of the self-attention model on the knowledge graph and pre-training, thereby providing more accurate interpretable results for intelligent diagnosis.

Description

Interpretable intelligent medical auxiliary diagnosis system based on text retrieval

Technical Field

The invention relates to the technical field of text retrieval, in particular to an interpretable intelligent medical auxiliary diagnosis system based on text retrieval.

Background

With the rapid development of the computer field, various industries are combined with computers more and more deeply. Data mining and data analysis can be performed by using data in the industry through a computer. Especially in the medical field, in recent years, deep learning methods in the computer field can extract original features from mass data and then find rules behind the original features, thereby solving complex problems. With the development of technology and the enhancement of computing power, deep learning is applied more and more widely in the medical field, and intelligent medical treatment has obtained preliminary results in case analysis and disease prediction, but the deep learning model is of a black box nature, i.e. the interpretability is poor, and it cannot give out which data plays a decisive role in medical diagnosis, so that doctors cannot completely trust the medical diagnosis result, thus preventing the computer technology from being applied to the biomedical field. The interpretable model of the intelligent medical auxiliary diagnosis is interpretable for providing a disease diagnosis result, namely when a disease is diagnosed, not only the diagnosis result is simply given, but also relevant explanations need to be given, such as the origin of the diagnosis result, relevant cases and the like.

The patent uses text retrieval technology to retrieve relevant case information according to case diagnosis results or biomedical documents, and finds out the prior similar cases from the case information and provides the similar cases to doctors. However, the traditional retrieval method is used for matching accurate words, even if words with the same meaning or related meanings are added into the biomedical knowledge base, the semantic gap problem caused by accurate matching cannot be solved well. The recent deep learning technology is applied to text retrieval, and good effects can be achieved, for example, query words or expansion words and words in documents are mapped to a low-dimensional vector space through a neural network, so that the semantics of the query words or the expansion words and the words in the documents are better compared, and words with similar semantics but incomplete matching are found. In addition, the neural network can also mine the query sentence through a deep network to be matched with the integral semantics of the document. However, most search models based on neural networks do not provide a good explanation of the search results, i.e. only provide articles relevant to the query, but do not know which words or segments in the documents are relevant to the query, which still causes the doctor to have difficulty understanding why such search results are obtained. Furthermore, if the document chapters are too long, existing neural networks do not capture semantic information in the document well, meaning that the semantic relationships between the document and the query are not well matched.

Text retrieval can provide results of case diagnosis interpretability, but the degree of interpretability is not enough, and an external knowledge base needs to be introduced to further explain retrieval results. In addition, recently popular self-attention models may also facilitate interpretability of results. The structure can also be used for solving the semantic problem of long-distance dependence caused by overlong documents.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the prior art, provides an interpretable intelligent medical auxiliary diagnosis system based on text retrieval, can effectively solve the problem of semantic coding of long-distance dependence of long documents, well integrates a knowledge graph into a query word, and obtains entity information or document fragments related to the query word through the weight values of the self-attention model on the knowledge graph and pre-training, thereby providing a more accurate interpretable result for intelligent diagnosis.

In order to achieve the purpose, the technical scheme provided by the invention is as follows: an interpretable intelligent medical-assisted diagnosis system based on text retrieval, comprising:

the query preprocessing module is used for preprocessing the medical record of the patient to obtain query words related to the sick information;

the knowledge map construction module is used for combining with the query words to form a new knowledge map about the specific disease;

the text retrieval module is used for retrieving past cases and treatment schemes related to the query words in the medical database through the query words and the new knowledge map;

the interpretable auxiliary diagnosis module is used for explaining and retrieving cases related to the patient medical record and reasons of treatment schemes.

Further, the preprocessing of the patient medical record by the query preprocessing module comprises word segmentation, punctuation removal, stop word removal and spelling correction, and then noun phrases and verb phrases are kept through grammar analysis and are used as entities for subsequent knowledge graph fusion.

Further, the knowledge graph construction module obtains entity concepts, semantic interpretations and semantic types from the existing knowledge graph in the biomedical field through noun phrases and verb phrases obtained from the query preprocessing module, extracts subgraphs related to query words from the existing knowledge graph through knowledge fusion, and fuses to form a new knowledge graph G of the specific disease, wherein the knowledge graph G is defined as:

G＝{(h,r,t)|h,t∈ε,r∈R} (1)

in the formula, epsilon and R are an integrated atlas entity set and a relation set respectively; and the triplet (h, r, t) represents such a piece of fact knowledge: there is a relationship r between the head entity h and the tail entity t.

Further, the text retrieval module comprises a word embedding representation module, a graph embedding representation module and a text matching module;

the word embedding representation module utilizes a pre-trained self-attention model to respectively obtain word embedding representations of medical records to be retrieved and query words; the pre-trained self-attention model is a 12-layer transform stack structure, wherein the formula of the ith layer transform is shown as (2), (3) and (4):

Mⁱ＝LN(S^i-1+Oⁱ) (3)

q, K, V in formula (2) is the query term or the medical record to be retrieved, which is a two-dimensional matrix,

q, K, V, respectively, as a two-dimensional weight matrix, d_kIs that

Wherein the size of one dimension is the scaling value, softmax is the normalization operation, OⁱIs a sequence vector obtained by accumulating the similarity between words; LN in equation (3) represents a residual network, S^i-1Is the input of the i-1 th layer, which refers to the query word or the medical record to be retrieved, MⁱIs the output of the residual network; formula (4) is two fully connected layers, where W₁ ⁱ、

Is a matrix of the weights that is,

is the offset, ReLU is the activation function, SⁱIs the output of a layer of transform, namely a new word vector embedding representation;

the graph embedding representation module changes the word embedding representation of the query word on the knowledge graph of the specific disease into the graph embedding representation of the query word by utilizing a graph attention network, learns the feature representation of the query word on a new knowledge graph by the graph attention network, and essentially learns to obtain the feature representation of a new node by giving different weights to the node and the neighbor of the node on the knowledge graph by the graph attention network; first, the word vector embedding representation of all nodes on the knowledge-graph is H ═ H₁,h₂,...,h_oIn which h is_oEmbedding a word vector representing the o-th node, and then placing the attention point of the network in a neighbor node set N of the o-th node in a mode of hiding a self-attention structure_oWhere the set of neighboring nodes includes itself, where the node represents a query term, a concrete formulaAs shown in (5):

in the formula, W, W_aIs a weight matrix, h_o、h_j、h_vThe o, j and v nodes respectively, LeakyRelu endows all negative values with a non-zero slope, T is the transposition of the matrix, a_o,jIs the similarity of node o to node j; and finally, obtaining the graph embedding expression of the node on the knowledge graph by weighting the neighbors of the node, wherein the formula is shown as (6):

in the formula, W_hIs a weight matrix; finally, all the query word embedding is converted into graph embedding through the graph attention network, and a sequence Q is expressed as [ Q ]₁,q₂,...,q_o]，q_oIs the graph-embedded representation of the o-th node, i.e., the o-th query term;

the text matching module embeds the graph of the query word to represent average pooling and embeds the word of the medical record to be queried to represent convolution and maximum pooling, and then obtains the matching score of the query word and the medical record to be queried through cosine calculation; wherein, the graph embedding expression of the query word is shown as (7):

wherein mean-posing represents that Q is subjected to an averaging operation,

is the average pooled query vector; the convolution and maximum pooling formulas of the medical records to be inquired are shown as (8), (9) and (10)

P＝[p₁,p₂,...,p_m] (9)

The word embedding representation sequence D ═ { D } of the medical record to be retrieved obtained by the word embedding representation module₁,d₂,...,d_m}，d_mWord embedding for the mth word in the medical record to be retrieved, D_m-u:m+uThe words from m-u to m + u in the medical record to be retrieved are the words, and u is half of the size of a convolution kernel; equation (8) is a convolution operation, b_lIs an offset, W_lIs a convolution kernel, l is the ith convolution kernel,

is a dot product operation, p_m,lIs the mth scalar resulting from the lth convolution kernel; equation (9) is that the characteristic map P, P is obtained by one convolution kernel_mIs the mth vector obtained after l convolution kernels; the max-posing in the formula (10) is maximum pooling operation, and the maximum value is taken for each column of the characteristic diagram to obtain the medical record vector to be retrieved

The text matching module obtains the similarity score of the query vector and the medical record vector to be retrieved by calculating the cosine value of the query vector and the medical record vector to be retrieved, and the formula is shown as (11):

further, the interpretable auxiliary diagnosis module sequences the scores of matching the medical records to be retrieved with the query words, selects S documents with the scores Top, S is the number of the selected documents, obtains the weight values of all paths on the knowledge graph and the weight values of the query words and the words in the documents in the pre-trained self-attention model through the visual graph attention network, and marks out the propagation path based on knowledge perception on the knowledge graph and the most relevant segments in the documents as interpretable results of case diagnosis.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. in conjunction with the graph attention network, the characterization of the query term is modeled by a knowledge graph. Different from the prior art that expansion words are simply spliced, the semantic information of the expansion words is coded into the representation of the query words according to different proportions according to different importance of the expansion words, so that the query and the document are more accurately matched.

2. The document is coded by utilizing a pre-trained self-attention mechanism and a Transformer structure, so that the problem of long-distance dependent semantic coding caused by long documents can be effectively solved, and meanwhile, the structure can better match the semantic similarity between the query and the document.

3. The knowledge link on the knowledge graph and the queried document result can be respectively labeled by utilizing the graph attention network and the Transformer structure, the most relevant information fragment is obtained, and a more accurate interpretable result is provided.

Drawings

Fig. 1 is an architecture diagram of an interpretable intelligent medical-assisted diagnostic system.

FIG. 2 is an integrated graph of query terms and an existing knowledge graph.

FIG. 3 is a schematic diagram of the structure of a pre-trained self-attention model.

FIG. 4 is a schematic diagram illustrating the structure of a force network.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

As shown in fig. 1, the interpretative intelligent medical auxiliary diagnosis system based on text retrieval provided in the embodiment includes the following functional modules:

The query preprocessing module uses a space tool to preprocess a query word, divide the word, remove punctuation marks, remove stop words, correct spelling, and then keep keyword information related to case diagnosis, such as noun phrases, verb phrases and the like, through grammar analysis.

The query words are information related to the patient, including disease name, gene name, and some symptom information. After pretreatment as follows:

diseases: acute myelogenous leukemia

Mutant genes: ARCH

Symptoms are: debilitation with high fever, a maximum body temperature of 40 degrees, cough.

The knowledge graph construction module performs knowledge fusion on the preprocessed query keywords and the existing knowledge graph, such as UMLS, MeSH, NCBI and the like, and then forms a new knowledge graph related to a specific disease. Extracting subgraphs related to the query word from the existing knowledge graph, and fusing to form a new knowledge graph G of the specific disease, wherein the knowledge graph G is defined as:

G＝{(h,r,t)|h,t∈ε,r∈R} (1)

The specific process uses a MetaMap to extract the UMLS concepts for each query term. The disease-related query expansion can be knowledge-integrated by means of Lexigram tool, the gene-related query expansion can be entity-extracted by NCBI, and these extracted concepts and name variants thereof are used for knowledge fusion with query words, as shown in FIG. 2. FIG. 2 depicts some symptoms associated with acute myeloid leukemia, mutant genes, and other diseases associated therewith.

The text retrieval module comprises a word embedding representation module, a graph embedding representation module and a text matching module.

The word embedding representation module utilizes a pre-trained self-attention model to respectively obtain word embedding representations of medical records to be retrieved and query words; the pre-trained self-attention model is a 12-layer stack of transformers. As shown in fig. 3, equation (2) corresponds to a multi-head attention mechanism in the graph, equation (3) corresponds to a residual network in the graph, and equation (4) corresponds to two fully connected layers in the graph. The query vector and the document vector are respectively input into a transformer formula according to the formats of [ CLS ] Q [ SEP ] "and [ CLS ] D [ SEP ]",

Mⁱ＝LN(S^i-1+Oⁱ) (3)

q, K, V, respectively, as a two-dimensional weight matrix, d_kIs that

Wherein the size of one dimension is the scaling value, softmax is the normalization operation, OⁱIs a sequence vector obtained by accumulating the similarity between words; LN in equation (3) represents a residual network, S^i-1Is an input to the i-1 th layer,at the first level, the query term or the medical record to be retrieved, MⁱIs the output of the residual network; formula (4) is two fully connected layers, where W₁ ⁱ、

Is a matrix of the weights that is,

is the offset, ReLU is the activation function, SⁱIs the output of one layer of the transform, i.e. the new word vector embedding representation.

After inputting query words and medical records to be retrieved, H is obtained as H₁,h₂,...,h_nD ═ D₁,d₂,...,d_mIn which h is_nWord embedding for the nth query word, d_mAnd embedding the word of the mth word in the medical record to be retrieved.

The graph-embedded representation module utilizes a graph attention network to change a word-embedded representation of a query word on a knowledge graph of a particular disease to a graph-embedded representation of the query word. The feature representation of the query word in the new knowledge graph is learned through the graph attention network, and the essence of the feature representation is that the feature representation of a new node is obtained through learning by giving different weights to a node and the neighbor of the node on the knowledge graph. Firstly, the word vector embedding expression of nodes on the knowledge graph is as follows: h ═ H₁,h₂,...,h_oIn which h is_oThe word vector representing the o-th node is embedded. Then, the attention point of the network is placed in a neighbor node set N of the node o in a mode of hiding a self-attention mechanism_o(the set of neighboring nodes includes itself), where a node refers to a query term. The structure of the graph attention network is shown in fig. 4, the similarity between the node o and the node j is calculated, and the specific formula is shown in (5):

in the formula, W, W_aIs a weight matrix, h_o,h_j,h_tAre the respective o, j,v nodes. LeakyRelu assigns a non-zero slope to all negative values, T is the transpose of the matrix, a_o,jIs the similarity of node o to node j; and finally, obtaining the graph embedding expression of the node on the knowledge graph by weighting the neighbors of the node, wherein the formula is shown as (6):

in the formula, W_hIs a weight matrix, ReLU belongs to the activation function; finally, all the query word embedding is converted into graph embedding through the graph attention network, and a sequence Q is expressed as [ Q ]₁,q₂,...,q_o]，q_oIs the graph-embedded representation of the o-th node, i.e., the o-th query term.

The text matching module embeds the graph of the query word to represent average pooling and embeds the word of the medical record to be queried to represent convolution and maximum pooling, and then obtains the matching score of the query word and the medical record to be queried through cosine calculation. Wherein, the graph embedding expression of the query word is shown as (7):

wherein mean-posing represents that Q is subjected to an averaging operation,

P＝[p₁,p₂,...,p_m] (9)

The word embedding representation sequence D ═ { D } of the medical record to be retrieved obtained by the word embedding representation module₁,d₂,...,d_m}，d_mFor the m-th word in the medical record to be retrieved, D_m-u:m+uThe words from m-u to m + u in the medical record to be retrieved are the words, and u is half of the size of a convolution kernel; equation (8) is a convolution operation, b_lIs an offset, W_r,lIs a convolution kernel, l is the l-th convolution kernel, ReLU is the activation function,

The text matching module obtains the similarity score of the query vector and the medical record vector to be retrieved by calculating the cosine values of the query vector and the medical record vector to be retrieved. The formula is shown as (11):

the interpretable auxiliary diagnosis module sequences the matching scores of the medical record to be retrieved and the query words, selects Top S documents, wherein S is the number of the selected documents, obtains the weight value of each path on the knowledge graph through the visual graph attention network, and marks out the propagation path based on knowledge perception on the knowledge graph, as shown in FIG. 2, the graph attention network can adaptively distribute different attention weight scores a to each connection path in the knowledge graph₁,a₁,...a₁₁. The greater the weight value of the neighbors around each query term, the more important it is, i.e., more relevant to the retrieved result. Self-attention model capable of passing through pre-trainingThe weight values of the query words and the words in the document and the most relevant segments in the document serve as interpretable results of the intelligent case diagnosis, as shown by the bold fields in the interpretable results in fig. 1.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. An interpretable intelligent medical-assisted diagnosis system based on text retrieval, comprising:

2. The system of claim 1, wherein the diagnostic system comprises: the query preprocessing module preprocesses the patient medical record and comprises word segmentation, punctuation removal, word deactivation removal and spelling correction, and then noun phrases and verb phrases are kept through grammar analysis and are used as entities for fusing a subsequent knowledge graph.

3. The system of claim 1, wherein the diagnostic system comprises: the knowledge graph construction module obtains entity concepts, semantic interpretations and semantic types from the existing knowledge graph in the biomedical field through noun phrases and verb phrases obtained from the query preprocessing module, extracts subgraphs related to query words from the existing knowledge graph through knowledge fusion, and fuses to form a new knowledge graph G of a specific disease, wherein the knowledge graph G is defined as:

G＝{(h,r,t)|h,t∈ε,r∈R} (1)

4. The system of claim 1, wherein the diagnostic system comprises: the text retrieval module comprises a word embedding representation module, a graph embedding representation module and a text matching module;

Mⁱ＝LN(S^i-1+Oⁱ) (3)

q, K, V, respectively, as a two-dimensional weight matrix, d_kIs that

Is a matrix of the weights that is,

the graph embedding representation module changes the word embedding representation of the query word on the knowledge graph of the specific disease into the graph embedding representation of the query word by utilizing a graph attention network, learns the feature representation of the query word on a new knowledge graph by the graph attention network, and essentially learns to obtain the feature representation of a new node by giving different weights to the node and the neighbor of the node on the knowledge graph by the graph attention network; first, the word vector embedding representation of all nodes on the knowledge-graph is H ═ H₁,h₂,...,h_oIn which h is_oEmbedding a word vector representing the o-th node, and then placing the attention point of the network in a neighbor node set N of the o-th node in a mode of hiding a self-attention structure_oWherein the neighbor node set includes itself, where the node represents a query term, and the specific formula is shown as (5):

wherein mean-posing represents that Q is subjected to an averaging operation,

P＝[p₁,p₂,...,p_m] (9)

5. the system of claim 1, wherein the diagnostic system comprises: the interpretable auxiliary diagnosis module sequences the matched scores of the medical record to be retrieved and the query words, selects S documents with the score Top, and S is the number of the selected documents, obtains the weight value of each path on the knowledge graph and the weight values of the query words and the words in the documents in the pre-trained self-attention model through the visual graph attention network, and marks out the propagation path based on knowledge perception on the knowledge graph and the most relevant segments in the documents as interpretable results of case diagnosis.