CN115171871A

CN115171871A - Cardiovascular disease prediction method based on knowledge graph and attention mechanism

Info

Publication number: CN115171871A
Application number: CN202210485938.1A
Authority: CN
Inventors: 杨鹏; 王超余; 谢亮亮; 马卫东
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-05-06
Filing date: 2022-05-06
Publication date: 2022-10-11

Abstract

The invention discloses a cardiovascular disease prediction method based on a knowledge graph and an attention mechanism, which comprises the steps of firstly constructing a cardiovascular disease corpus; then, establishing a knowledge graph in the cardiovascular disease field, extracting attribute information of the cardiovascular disease from original articles in a cardiovascular disease corpus, and establishing a knowledge graph relation network; then extracting cardiovascular disease description text feature vectors, obtaining symptom entities in the text according to the relation between cardiovascular diseases and symptoms in a knowledge map, performing vector representation on the symptoms by using a TransR knowledge representation model, and extracting description text feature vectors through an attention-based LSTM (A-LSTM); and finally, identifying the cardiovascular diseases by a softmax classifier. Compared with other methods, the method provided by the invention can be used for excavating deeper disease characteristics by combining the cardiovascular disease knowledge map and the attention mechanism, so that a more accurate prediction effect is achieved.

Description

Cardiovascular disease prediction method based on knowledge graph and attention mechanism

Technical Field

The invention relates to a cardiovascular disease prediction method based on a knowledge graph and an attention mechanism, and belongs to the technical field of internet and artificial intelligence.

Background

Cardiovascular diseases (CVDs) are a leading cause of death worldwide. Of the 5770 million deaths reported worldwide in 2015, 1790 million died from cardiovascular disease. In addition, cardiovascular disease places a non-negligible economic burden on patients and leads to severe life-long disability. However, it is estimated that 90% of CVDs can be prevented by appropriate measures. Therefore, predicting the onset of CVDs in individuals is of great importance in the medical field. There are several well-established pathological procedures for detecting markers of CVDs, such as Electrocardiogram (ECG) and angiography, which are the definitive diagnostic methods of cardiovascular disease in the medical field, often with high accuracy. While angiography is generally expensive and invasive, electrocardiogram is another common method for diagnosis and prognosis of cardiovascular diseases, and its accuracy in the medical field is highly dependent on the experience and knowledge of medical personnel or experts. Computer-aided high risk prediction of CVDs is therefore a promising and significant research topic. The traditional task of high-risk prediction based on machine learning aims at obtaining an automated computer system, which should be a potential and critical feature extracted from the patient's historical Electronic Health Record (EHR). Compared with traditional pathological measures, it has the characteristics of operability, non-invasiveness and low cost.

A key challenge for EHR-based high-risk prediction tasks is how to obtain an accurate picture of the patient, also known as patient characterization learning or feature engineering. EHRs are composed of various information about a patient and can be represented as a sequence of time-ordered hospital visits, each of which contains a number of medical variables such as demographics, diagnoses, medications, procedures, laboratory test results, and vital signs. The number of unique medical variables in EHR systems is typically very large, so many existing predictive models seek to handle it in a sparse feature representation through various dimension reduction techniques. Conventional manual intervention feature engineering measures are often poorly scalable and generalized because they are highly dependent on the individual experience of the researcher and the particular EHR system. In recent years, some simple and extensible methods inspired by automatic feature representation have been proposed, such as One-Hot and Bag-of-Words (BoW). However, in these approaches, each feature is typically treated as a discrete and independent word, which results in their inability to accurately capture the semantic information and dynamic associations in EHR data that are hidden between features. Therefore, how to design an efficient method to handle the feature representation of sequential, high-dimensional heterogeneous EHR data becomes an extremely important issue.

Disclosure of Invention

Aiming at the problems and the defects in the prior art, the invention provides a cardiovascular disease prediction method fusing a knowledge graph and an attention mechanism, which uses a prediction model fusing the knowledge graph and the attention mechanism for predicting the onset of cardiovascular diseases. The method combines the cardiovascular disease knowledge map and deep learning, extracts related entities in a text provided by a user according to entity information related to cardiovascular diseases in the cardiovascular disease knowledge map so as to enrich the cardiovascular disease characteristics, and further analyzes the cardiovascular disease information and cardiovascular disease images through a deep neural network model, and finally predicts the cardiovascular diseases.

In order to achieve the purpose of the invention, the invention is realized by the following technical scheme:

a cardiovascular disease prediction method based on a knowledge graph and an attention mechanism comprises the following steps:

step 1, constructing a cardiovascular disease corpus, regularly collecting knowledge articles of cardiovascular diseases through a distributed web crawler, and performing preliminary filtering through a wrapper to construct an original corpus;

step 2, constructing a knowledge graph in the cardiovascular disease field, extracting attribute information of the cardiovascular disease from original articles in a cardiovascular disease corpus by using a rule set, named entity recognition and keyword extraction methods respectively, and constructing a knowledge graph relation network;

step 3, extracting cardiovascular disease description text feature vectors: acquiring symptom entities in the text according to the relation between the cardiovascular diseases and symptoms in the knowledge map, performing vector representation on the symptoms by using a TransR knowledge representation model, and extracting and describing text feature vectors through an LSTM based on an attention mechanism;

and 4, identifying the cardiovascular diseases by a softmax classifier.

Further, the step 1 specifically includes the following steps:

acquiring original data of related cardiovascular disease websites regularly by using a web crawler, counting the total number of knowledge data in a basic knowledge base by using a data mining technology, and calculating the minimum support count; sequentially judging whether the count of each piece of knowledge data meets the minimum support degree or not, and outputting the knowledge data meeting the minimum support degree to obtain a plurality of frequent 1 item sets; reading a frequent k-1 item set, generating a frequent k item set according to a pruning algorithm, and calculating the count of the frequent k item set, wherein k is more than or equal to 2; judging whether the counting of the frequent k item set meets the minimum support degree, if so, adding 1 to the counting value of k, returning to the previous step, and if not, outputting the frequent k item set; traversing all the frequent 1 item sets, acquiring a plurality of frequent k item sets, and filtering part of noise data by using a black-and-white list mechanism based on a dictionary; collecting data related to cardiovascular diseases provided by a user; and preliminarily filtering the acquired data by using a rule set, and storing the data in a file library form.

Further, the step 2 specifically includes the following steps:

extracting the attribute of the original corpus by using the page attribute information; aiming at complex articles, a BilSTM-CRF model is adopted for named entity identification; aiming at the cardiovascular disease pathogenesis feature description, a key word extraction method based on TF-IDF is adopted to extract cardiovascular disease feature entities; expressing the extracted attributes, attribute names and the relationship among the attributes and the attribute names in a triple mode; using Neo4j to store and manage the knowledge graph; adopting a key word extraction method based on TF-IDF to extract cardiovascular disease characteristic entities, wherein a characteristic weight plan arithmetic formula is as follows:

wherein, tf _ik Is a feature item t _k In document d _t Number of occurrences in, n _k For containing feature items t _k The number of documents in (1), N is the total number of the documents; expressing the extracted attributes, attribute names and the relationship among the attributes and the attribute names in a triple mode; and using Neo4j for storing and managing the knowledge graph.

Further, the step 3 specifically includes the following steps:

training the data of the knowledge graph by using a TransR knowledge representation model, extracting cardiovascular disease entities of the description text according to the knowledge graph, and obtaining an entity matrix E through the TransR knowledge representation model ^m×k Wherein k is the dimension of the entity vector, and m is the number of entities in the description text; entity matrix E that will describe text ^m×k As input of the BilSTM network, text feature extraction is carried out by using LSTM based on attention mechanism, and an output vector of the last LSTM unit is selected

As descriptive text feature vectors, wherein

The feature vector of the hidden layer of LSTM is expressed by the following formula:

further, when the TransR knowledge representation model is trained, the optimizer adopts a whale optimization algorithm.

Further, the step 4 specifically includes the following steps:

connecting the final patient representation vector to the softmax layer, the prediction of cardiovascular disease using the softmax classifier was obtained as follows:

wherein,

is a high risk index of cardiovascular disease of the patient of the ith case,

is the risk score for the ith patient calculated by the model.

Further, if

Equal to 1 indicates a high risk case, if

Equal to 0, it is indicated as a normal case.

Has the beneficial effects that:

1 when the prediction method provided by the invention is used for extracting cardiovascular disease entity information in a cardiovascular disease feature description text, a knowledge map technology and a TransR knowledge representation model are utilized, so that the extracted feature entities are more representative. In the training process of the TransR knowledge representation model, a whale optimization algorithm is added, the convergence speed of the model is improved, and the neural network based on the attention mechanism is used for better processing various information with huge dimensions. The method of the invention can be used for mining deeper cardiovascular disease characteristics by combining the knowledge map, the knowledge representation model and the deep learning, and can be used for mining deeper disease characteristics, thereby achieving more accurate recognition effect.

2 when the EHR patient data of high dimension, isomerism and tense is input, the model of the invention can automatically mine the potential information in the EHR patient data, combines the knowledge map with deep learning by using representation learning, obtains accurate feature representation for patients with lower dimension by the relation information between disease entities in the knowledge map and symptom entities described by users and between the entities, and can better complete high risk prediction task in the prediction model by adopting an attention mechanism and a long-short term memory artificial neural network.

Drawings

Fig. 1 is a flow chart of a cardiovascular disease prediction method based on a knowledge map and an attention mechanism provided by the invention.

Fig. 2 is an architecture diagram for implementing the method for predicting cardiovascular diseases based on knowledge mapping and attention mechanism provided by the present invention.

FIG. 3 is a model diagram of knowledge representation in the present invention.

Detailed Description

The technical solutions provided by the present invention will be described in detail with reference to specific examples, which should be understood that the following specific embodiments are only illustrative and not limiting the scope of the present invention.

The method for predicting cardiovascular diseases based on the knowledge base and attention mechanism has the flow shown in fig. 1, the model architecture shown in fig. 2, and the specific implementation steps as follows:

step 1, constructing a cardiovascular disease corpus: acquiring knowledge articles of cardiovascular diseases at regular time by using a distributed web crawler, and performing preliminary filtering by using a wrapper to construct an original corpus;

acquiring original data of a related cardiovascular disease website by using a web crawler regularly, acquiring related data of cardiovascular diseases by using other paths, preliminarily filtering the acquired data by using a rule set to obtain a basic knowledge base, counting the total number of knowledge data in the basic knowledge base, and calculating the minimum support count; sequentially judging whether the count of each piece of knowledge data meets the minimum support degree, and outputting the knowledge data meeting the minimum support degree to obtain a plurality of frequent 1 item sets; reading a frequent k-1 item set, generating a frequent k item set according to a pruning algorithm, and calculating the count of the frequent k item set, wherein k is more than or equal to 2; judging whether the counting of the frequent k item set meets the minimum support degree, if so, adding 1 to the counting value of k, returning to the previous step, and if not, outputting the frequent k item set; traversing all frequent 1 item sets, acquiring a plurality of frequent k item sets, and filtering the rest noise data by combining a dictionary-based black and white list mechanism; and obtaining initialization data and storing the initialization data in a file library form.

Step 2, constructing a knowledge map in the cardiovascular disease field: extracting cardiovascular disease attribute information from original articles in a cardiovascular disease corpus by using a rule set, named entity identification and keyword extraction methods respectively to construct a knowledge graph relation network;

extracting the attribute of the original corpus by using the page attribute information; aiming at complex articles, a BilSTM-CRF model is adopted for named entity identification; aiming at the cardiovascular disease incidence characteristic description, a key word extraction method based on TF-IDF is adopted to extract cardiovascular disease characteristic entities, and a characteristic weight plan arithmetic formula is as follows:

wherein, tf _ik Is a characteristic item t _k In document d _t Number of occurrences in, n _k For containing feature items t _k N is the total number of texts. Expressing the extracted attributes, attribute names and the relationship among the attributes and the attribute names in a triple mode; and using Neo4j for storing and managing the knowledge graph.

Step 3, extracting cardiovascular disease description text feature vectors: acquiring symptom entities in texts according to the relation between cardiovascular diseases and symptoms in a knowledge map, performing vector representation on symptoms by using a TransR knowledge representation model, and extracting and describing text feature vectors through an attention-based LSTM (A-LSTM);

training the data of the knowledge graph by using a TransR knowledge representation model, taking the constructed knowledge graph data as the input of the representation model, and representing the model to obtain

Mapping the entities and the relations into a low-dimensional space for the basic idea, wherein h represents a head entity, t represents a tail entityThe body and the r represent the relationship, and further considering the complexity of the knowledge graph, the TransR model not only realizes the distinction between the entity and the relationship, but also projects the entity to the vector space of the relationship of the knowledge representation model aiming at different semantic spaces, so that the many-to-many relationship has more accurate vector representation, as shown in figure 3. For example, for relational symptoms

A mapping matrix Mr epsilon Rk multiplied by d is distributed to the coronary heart disease vector

Vascular sclerosis vector

By passing

Obtain its projection vector

Calculating to obtain the vector representation of coronary heart disease entity according to the following formula

Wherein

The optimizer adopts a whale optimization algorithm to improve the convergence speed of the model; then, the vector e is represented according to the trained entity ^h An entity matrix E of the description text can be formed ^m×k Wherein k is the dimension of the entity vector, and m is the number of entities in the description text; the details are as followsShown in the figure: e ^m×k ＝[e ¹ ，e ² ，...，e ^m ]。

Entity matrix E to describe text ^m×k As an input of the BilSTM network, text feature extraction is carried out by using LSTM (A-LSTM) based on attention mechanism, and an output vector of the last LSTM unit is selected

As a feature vector describing the text, wherein

and 4, identifying the cardiovascular diseases by a softmax classifier.

The final patient representation vector is connected to the softmax layer. The prediction of cardiovascular disease using the softmax classifier was as follows:

here, the

Is a high risk indicator of cardiovascular disease in the patient of the i < th > case. If it is

Equal to 1 indicates a high risk case, if

Equal to 0, it is indicated as a normal case.

Is the ith patient calculated by the modelRisk scoring of (2).

The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that modifications and adaptations can be made by those skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims

1. A cardiovascular disease prediction method based on a knowledge graph and attention mechanism is characterized by comprising the following steps:

step 3, extracting the feature vector of the cardiovascular disease description text: acquiring symptom entities in the text according to the relation between the cardiovascular diseases and symptoms in the knowledge map, performing vector representation on the symptoms by using a TransR knowledge representation model, and extracting and describing text feature vectors through an LSTM based on an attention mechanism;

and 4, identifying the cardiovascular diseases by a softmax classifier.

2. The method for cardiovascular disease prediction based on knowledge-graph and attention mechanism as claimed in claim 1, wherein the step 1 comprises the following steps:

3. The method for cardiovascular disease prediction based on knowledge-graph and attention mechanism as claimed in claim 1, wherein the step 2 comprises the following steps:

4. The method for cardiovascular disease prediction based on knowledge-graph and attention mechanism as claimed in claim 1, wherein the step 3 comprises the following steps:

training data of the knowledge graph by using a TransR knowledge representation model, extracting cardiovascular disease entities of description texts according to the knowledge graph, and obtaining an entity matrix E through the TransR knowledge representation model ^m×k Wherein k is the dimension of the entity vector, and m is the number of entities in the description text; entity matrix E that will describe text ^m×k As input of the BilSTM network, text feature extraction is carried out by using LSTM based on attention mechanism, and an output vector of the last LSTM unit is selected

As a feature vector describing the text, wherein

5. the method of claim 4, wherein the optimizer employs whale optimization algorithm when training the TransR knowledge representation model.

6. The method of claim 1, wherein step 4 comprises the steps of:

wherein, y _i Is a high risk index of cardiovascular disease of the patient of the ith case,

is the risk score for the ith patient calculated by the model.

7. The method of claim 6 wherein y is the number of days if _i Equal to 1 indicates a high risk case, if y _i Equal to 0, it is indicated as a normal case.