CN112201340B

CN112201340B - Electrocardiogram disease determination method based on Bayesian network filtering

Info

Publication number: CN112201340B
Application number: CN202010678145.2A
Authority: CN
Inventors: 韩京宇; 孙广鹏
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-07-15
Filing date: 2020-07-15
Publication date: 2022-08-26
Anticipated expiration: 2040-07-15
Also published as: CN112201340A

Abstract

The invention discloses an electrocardiogram disease determination method based on Bayesian network filtering, and belongs to the field of electrocardiogram disease diagnosis. On the basis of a trained base classifier, the method adopts a two-layer structure to determine a final disease label: the first layer constructs a voter to screen the results of the base classifier, and generates an anchor disease set and a candidate disease set; and in the second layer, a Bayesian network is constructed by adopting a hill climbing method based on BDe scores, and the Bayesian network filters the anchor disease set and the candidate disease set to determine a final prediction disease set. The method is characterized in that (1) the dependency relationship among disease labels is fully utilized, and the generalization capability of the model is improved; (2) the prediction result of the base classifier can be corrected through two layers of filtering processing, and the accuracy of model prediction is improved; (3) because the causal relationship used for constructing the Bayesian network is a strong correlation, the model has the characteristic of stability and does not show too large difference due to different data distribution.

Description

Electrocardiogram symptom determining method based on Bayesian network filtering

Technical Field

The invention belongs to the technical field of intelligent diagnosis of electrocardiogram disorders based on machine learning, and relates to electrocardiogram disorder determination, in particular to a multi-label disorder determination method based on machine learning.

Background

Multi-label classification refers to classification of multiple labels for a given exemplar, which may correspond to one or more labels in a set of labels. Defining a feature space X ═ R ^d Where d represents the dimension of the feature, L ═ L ₁ ,L ₂ ,…,L _n Denotes a label space with n labels, and a training set D ═ x is constructed _i ,L _j ) I is more than or equal to 1 and less than or equal to q, j is more than or equal to 1 and less than or equal to n, q represents the size of the training set, i represents the serial number of the sample, x _i E X represents a d-dimensional feature vector, L _j E L represents a tag element in L. The task of multi-label learning is to learn a multi-label classifier h (-) according to a training set D, predict a new sample x by using the classifier h (-) and predict the result

Is the set of class labels for sample x.

The solutions for multi-label classification are mainly divided into two types at present: one is a strategy based on problem transformation and one is a strategy based on algorithm adaptation. The strategy of problem conversion is to convert the multi-label problem into a plurality of single-label two-classification submodels and then combine the results of the submodels to obtain the final result. And the strategy based on algorithm adaptation is to adjust the popular learning algorithm to adapt to multi-label learning.

The strategy of problem transformation can be divided into Binary Relevance (Binary reservance), Classifier Chains (Classifier Chains), Label power set (Label Powerset) and the like. The binary correlation method is the simplest method, and the core idea is to decompose the multi-label classification problem into a plurality of binary classification problems. The method has the advantages that the implementation method is simple and easy to understand, the model obtained by training is good in effect when the dependency relationship does not exist among the labels, and if the direct dependency relationship exists among the labels, the finally constructed model is weak in generalization capability and cannot achieve the expected effect. The core idea of the classifier chain is to convert the multi-label classification problem into a binary classifier chain form, wherein the construction of the binary classifier after the chain is carried out on the basis of the prediction result of the preceding classifier, in the model construction process, the label sequence needs to be disordered and ordered, and then the model corresponding to each label is constructed respectively according to the sequence from head to tail. The classifier chain method has the advantages that the implementation method is relatively simple, meanwhile, the relation of the labels is considered, the generalization capability of the model is enhanced to a certain extent, but the effect of the method is influenced by the sequencing, and a proper label dependency relationship is difficult to find. The label power set method is to convert the multi-label classification into a multi-classification problem, and the label set of each sample instance is used as a single class to construct a multi-classifier. The method considers the combination relation among the labels, but does not consider the dependency among the labels, and the number of classes may be increased along with the increase of the number of the labels, so that the model becomes more complex and the generalization capability of the model is reduced to a certain extent.

The methods adopting the algorithm adaptation strategy mainly comprise ML-kNN and ML-DT. ML-kNN is a modified algorithm of kNN algorithm, and it is thought that for each sample instance, k nearest instances are obtained, and feature information of these instances is used to determine the predicted tag set of the instance. The ML-kNN can identify different neighborhoods of each sample and predict by using the inter-domain information, so that the accuracy is high, but the ML-kNN is not sensitive to abnormal points. The basic idea of ML-DT is to process multi-label data by using a decision tree technology, and recursively construct a decision tree by using an information gain criterion based on multi-label entropy, wherein a decision tree model can be efficiently derived from the multi-label data, but the labels are assumed to be independent when the information entropy is calculated.

The dependency among the labels is largely ignored by the algorithm adaptation strategy and the problem transformation strategy, a model is not constructed by utilizing the relation among the labels, and the electrocardiogram diseases are just related, so that the methods cannot well utilize the electrocardiogram to determine the diseases, and the prediction accuracy is poor.

Causal relationships are important patterns of data mining and can reveal dependencies between tags. Causal relationships explain the cause of an event occurrence and what the event occurrence will cause, emphasizing the strong correlation between variables. Causal relationships among data can be mined through corresponding algorithms, common constraint-based mining algorithms include SGS algorithms, PC algorithms and variants of various PC algorithms, and score-based mining algorithms are mainly search algorithms based on Bayesian Dirichlet likelihood equivalence scores. The mining of causal relationships, while implemented by a number of algorithms, has been rarely used for electrocardiographic disorder determination.

The invention combines the work of the two aspects and utilizes the causal relationship among symptoms to provide the electrocardiogram symptom determining method based on Bayesian network filtration.

Disclosure of Invention

Aiming at the problems, the invention provides an electrocardiogram disease determination method which adopts a Bayesian network for filtering treatment and well realizes the intelligent diagnosis of electrocardiogram diseases.

The technical scheme of the invention is as follows: an electrocardiogram disease determination method based on Bayesian network filtering comprises the following specific operation steps:

step (1.1): predicting the possible disease label of the instance ob by using a plurality of base classifiers;

step (1.2): constructing a voting machine;

step (1.3): transmitting the prediction result obtained in the step (1.1) into the voter constructed in the step (1.2) for screening, and obtaining an anchor disease set AS (ob) and a candidate disease set CS (ob) by the voter after screening;

step (1.4): combining all subsets of anchor disorder set AS (ob) and candidate disorder set CS (ob) to obtain anchor disorder support set ASP (ob), each element of which is a union of anchor disorder set AS (ob) and candidate disorder set CS (ob) subsets, denoted as anchor disorder extension SL (ob) _i (ob)；

Step (1.5): constructing a Bayesian network by using a hill-climbing search algorithm based on Bayesian Dirichlet likelihood equivalence scores;

step (1.6): respectively calculating anchor point disorder set AS (ob) and anchor point disorder extension SL by utilizing Bayesian network _i The joint probabilities of (ob) are denoted as P (AS), (ob), and P (SL) _i (ob))。

Further, in step (1.2), the operation steps of constructing the voter are as follows:

(1.2.1) setting a probability threshold that allows the model to participate in anchor disorder set as (ob), candidate disorder set vote cs (ob); presetting voting threshold values required for adding the symptoms into an anchor symptom set AS (ob) and a candidate symptom set vote CS (ob);

(1.2.2) traversing the prediction results of all base classifier models corresponding to a disease, wherein the prediction results are probability values of 1 predicted by the models, when the prediction results are not smaller than a preset probability threshold, the models are qualified to participate in voting, and after the models have the voting right, the votes obtained by the corresponding anchor point disease set AS (ob) or candidate disease set voting CS (ob) are added with 1;

(1.2.3) if the votes obtained by the anchor disease set AS (ob) satisfy the voting threshold condition, adding the disease into the anchor disease set AS (ob); otherwise, checking the number of votes obtained from the candidate disorder set CS (ob), and if the voting threshold is met, adding the disorder into the candidate disorder set CS (ob);

(1.2.4), repeating the steps (1.2.1) to (1.2.3) and determining the attribution of all the symptoms.

Further, in step (1.3), the anchor set of disorders as (ob) stores the identified disorders, and the candidate set of disorders cs (ob) stores the disorders that need to be identified.

Further, in step (1.5), the constructed bayesian network is constructed by using causal relationship mining, namely a hill climbing method based on bayesian dirichlet likelihood equivalence score.

Further, in step (1.6), the disorder set SL satisfying the following formula (r) _i (ob) is the prediction of instance ob and is denoted as tls (ob)

The beneficial effects of the invention are: (1) the invention adopts a two-layer structure to determine the final result for a plurality of trained base classifiers: the first layer constructs a voter which screens the results of the base classifier, and the second layer constructs a Bayesian network which filters the results of the voter, so that the classification effect of the base classifier is enhanced, and the accuracy of determining the electrocardiogram symptoms is improved; (2) the Bayesian network is constructed by using a hill climbing method based on Bayesian Dirichlet likelihood equivalent scores, so that the dependence among symptoms is fully utilized, and the generalization performance of the model is improved; (3) the Bayesian network is constructed by mining the causal relationship, and the causal relationship reveals the dependency relationship with strong relevance, so that the model has the characteristic of stability and cannot show too large difference due to different data distribution.

Drawings

FIG. 1 is a flow chart of the structure of the present invention;

FIG. 2 is a schematic diagram of the structure of the voting machine according to the present invention;

FIG. 3 is a diagram of an exemplary voting architecture of the voter of the present invention;

FIG. 4 is a flow chart of the Bayesian network construction of the present invention;

FIG. 5 is a diagram of an exemplary Bayesian network architecture in accordance with the present invention;

fig. 6 is a schematic partial structure diagram of a bayesian network in an embodiment of the present invention.

Detailed Description

In order to more clearly illustrate the technical solution of the present invention, the following detailed description is made with reference to the accompanying drawings:

an electrocardiogram disease determination method based on Bayesian network filtering comprises the following specific operation steps:

step (1.1): predicting a probable disorder label for the instance ob using a number of basis classifiers;

step (1.2): constructing a voting machine;

(1.2.3) if the votes obtained by the anchor point disorder set AS (ob) meet the voting threshold condition, adding the disorder into the anchor point disorder set AS (ob); otherwise, checking the number of votes obtained from the candidate disorder set CS (ob), and if the voting threshold is met, adding the disorder into the candidate disorder set CS (ob);

Further, in step (1.3), the anchor disorder set as (ob) stores the identified disorders, and the candidate disorder set cs (ob) stores the disorders requiring confirmation.

In particular, as depicted in FIG. 1; a method for determining electrocardiogram symptoms based on Bayesian network filtering adopts a two-layer structure to determine the final result of a plurality of trained base classifiers: a first layer constructs a voter V which screens the results of the base classifier to generate an anchor disease set AS and a candidate disease set CS; and a Bayesian network is constructed on the second layer by using a causal mining algorithm based on a mountain climbing method with BDe scores, and the Bayesian network carries out secondary filtering on an anchor disease set AS and a candidate disease set CS to determine a final prediction disease set tls (ob).

As shown in fig. 2, the diagram is a schematic design diagram of a voter, and the implementation steps are as follows:

step 1: for disorder Li, its base classifier is recordedBC1, BC2, …, BCm, the prediction result of the base classifier is noted as P (BC) _j ) Representing the probability that the prediction result of the jth base classifier is 1;

step 2: setting 6 parameters of AS _ count, CS _ count, AS _ proba, CS _ proba, t1 and t2, wherein the AS _ count and the CS _ count are respectively used for recording the votes of the AS and the CS and are initially 0, the AS _ proba and the CS _ proba are respectively threshold values for allowing the model to participate in voting of the AS and the CS, and t1 and t2 represent threshold values of the votes required for adding the disease Li into the AS and the CS;

and 3, step 3: traverse the prediction results of m (all the basis classifiers corresponding to the condition Li), for the basis classifier BC _j Is provided with P _j Predict the probability value of 1 for the jth model if P (BC) _j )>AS _ proba, the base classifier BC _j Voting for participating AS, adding 1 to AS _ count, otherwise checking P (BC) _j )>Whether CS _ proba is true or not, if true, the base classifier BC _j Voting for participating in the CS, adding 1 to the CS _ count, and if not, starting the next base classifier BC _j+1 The voting process of (2);

and 4, step 4: if AS _ count > -t 1, the pathology Li is added to the set AS, otherwise it is checked if CS _ count > -t 2 holds, if it holds, the pathology Li is added to the set CS, if it does not hold, it is discarded.

AS shown in fig. 3, there are 6 disease conditions L1, …, L6, each having 5 base classifier models BC1, …, BC5, the data in the table represents the probability value of the prediction result of the base classifier models being 1, while AS _ proba, CS _ proba are thresholds allowing the models to participate in the AS, CS voting, respectively, and are set to 0.8, 0.5, respectively, t1, t2 represent thresholds of the number of votes required to add the disease condition Li to the AS, CS, and are set to 4, 2, respectively. The number of prediction results not less than AS _ proba in 5 base classifiers of the disease condition L1 is 4, and let AS _ count be 4, which is not less than the threshold t1, so that the disease condition L1 is added to the AS; since the number of prediction results not less than AS _ proba in 5 base classifiers of the disorder L2 is 0, the condition that AS _ count > -4 is not satisfied, it is necessary to check whether or not CS _ count > -2 is satisfied, and the disorder L2 is added to CS when the condition is satisfied because the number of prediction results more than CS _ proba in the base classifiers is 2. The attribution of L3, … and L6 is determined in the same way, and the obtained AS is { L1, L4}, CS is { L2, L5}, and L3 and L6 are discarded.

Fig. 4 is a flow chart of the bayesian network construction, which is mainly constructed by a causal relationship mining method based on BDe scored Hill Climbing (HC), and includes the following main steps:

step 1: an initial network G is randomly generated, and three search operators are defined: an edge adding operator A, an edge reducing operator M and an edge turning operator R, namely three operations of edge adding, edge cutting and edge changing directions of the network G are defined;

step 2: carrying out search operator operation on the current network G, updating the network, and acquiring a series of candidate networks G1, G2, … and Gm;

and step 3: respectively scoring the candidate networks G1, G2, … and Gm by using an BDe scoring function, marking the scores as S (Gi), representing the scores of the candidate networks Gi, and selecting the network with the highest score as an optimal candidate network structure, which is marked as G';

and 4, step 4: if the score of G 'is greater than that of G, i.e. S (G') > S (G), updating the current network G to G '(i.e. making G ═ G'), returning to step 2 to start the next round of search; otherwise, the current network is not updated, the search is finished, and the current structure G is stored;

and 5: obtaining a conditional probability table CPT among symptoms in a statistical mode;

step 6: a bayesian network is established using the storage network G (which is a directed acyclic graph) in step 4 and the conditional probability table CPT in step 5.

As shown in fig. 5, an example is a built bayesian network, in which there are 5 disorders L1, …, and L5, the solid line with arrows represents the inter-disorder dependency, and the table connected by the dotted lines is the conditional probability table of the node. An example of filtering using this bayesian network is as follows:

if AS is { L1, L2}, and CS is { L3, L4, L5}, there are 8 subsets of CS, so 8 calculations are required, and the SL corresponding to the maximum value is taken _i (ob); let BCS _i SL is a subset of CS, L3, L4 _i (ob) { L1, L2, L3, L4}, so

Wherein

P(L1_1,L2_1)＝0.5×0.4＝0.2，

P (L1_1, L2_1, L3_1, L4_1, L5_0) is 0.5 × 0.4 × 0.1 × 0.6 × 0.2 is 0.0024; all SLs can be calculated by the same method _i (ob) as a result of the correspondence, take SL corresponding to maximum value _i (ob) is the final prediction set of disorders tls (0b), i.e., SL required _i (ob) satisfies the formula:

the specific embodiment of the invention: the following describes the processing procedure of the method according to the embodiment of the present invention, taking the conditions of I-degree atrioventricular block, incomplete right bundle branch block, incomplete left bundle branch block, left ventricular hypertrophy, sinus bradycardia, etc. as examples:

(1) in this example, the practical feasibility of the present invention was verified by comparing the results of the filtration (Filter) and the non-filtration (NotFilter).

(2) And constructing a voter to primarily screen the result of the base classifier:

firstly, setting parameters required by a voter, and initially setting a parameter AS _ count (CS _ count) 0 for recording the number of votes obtained by an AS and a CS; AS _ proba and CS _ proba are thresholds for allowing the model to participate in the AS and CS voting, respectively, and are set to 0.8 for AS _ proba and 0.6 for CS _ proba, respectively, and t1 and t2 represent vote count thresholds required for adding the disorder Li to the AS and CS, and are set to 4 for t1 and 2 for t 2;

then, voting is carried out on all test examples by using the voting rules shown in fig. 2 and fig. 3, and after the screening of the voter, each test example ob obtains an anchor point label set as (ob) and a candidate label set cs (ob) corresponding to the test example ob;

and merging all subsets of the anchor label set AS (ob) and the candidate label set CS (ob) respectively to obtain an anchor disease support set consisting of a union of the subsets AS (ob) and CS (ob).

(3) Constructing a Bayesian network:

in this embodiment, a bayesian network is constructed by using a hill climbing method (HC) based on bayesian dirichlet likelihood equivalence score explained in fig. 4, and fig. 6 is a partial structure diagram of the bayesian network constructed in this embodiment; if connection relations exist among all disease nodes in the graph, one end with an arrow represents a child node, the other end without the arrow is a father node, the child node has strong dependence relations with the father node, the nodes are connected by dotted lines and are a conditional probability table corresponding to the nodes, and the conditional probability table shows the probability of the existence of the child node under the state of determining the father node;

the bayesian network is then used to determine the final set of predicted disorders using the calculation method illustrated in fig. 5;

(4) and analysis of example results:

the table above shows partial results of this embodiment, and the indexes for measuring the model quality in this embodiment are precision, recall, and fscore, respectively. It can be seen from the above table that the recall value of each disease is greatly improved after the filtering treatment, and the fscore is also obviously improved, which shows that the invention enhances the classification effect of the base classifier after the filtering treatment twice, improves the accuracy of determining the electrocardiogram diseases, fully utilizes the dependence relationship among the diseases, and improves the generalization performance of the model.

Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of embodiments of the present invention; other variations are possible within the scope of the invention; thus, by way of example, and not limitation, alternative configurations of embodiments of the invention may be considered consistent with the teachings of the present invention; accordingly, the embodiments of the invention are not limited to the embodiments explicitly described and depicted.

Claims

1. A method for determining an electrocardiogram disease based on Bayesian network filtering is characterized by comprising the following specific operation steps:

step (1.2): constructing a voting machine;

the Bayesian network construction flow chart is mainly constructed by a causal relationship mining method based on a BDe scoring hill climbing method, and the method mainly comprises the following steps:

step 1: randomly generating an initial network G, and defining three search operators: an edge adding operator A, an edge subtracting operator M and an edge turning operator R, namely three operations of adding edges, cutting edges and converting the directions of the edges are defined for the network G;

and 4, step 4: if the score of G ' is larger than that of G, namely S (G ') > S (G), updating the current network G to G ', returning to the step 2 and starting the next round of search; otherwise, the current network is not updated, the search is finished, and the current structure G is stored;

step 6: establishing a Bayesian network by using the storage network G in the step 4 and the conditional probability table CPT in the step 5;

step (1.6): respectively calculating anchor point disorder set AS (ob) and anchor point disorder extension SL by utilizing Bayesian network _i The joint probabilities of (ob) are denoted as P (AS (ob)) and P (SL) _i (ob))。

2. The method for determining electrocardiographic disorder based on bayesian network filtering according to claim 1, wherein in step (1.2), the operation steps for constructing the voter are as follows:

and (1.2.4) repeating the steps (1.2.1) to (1.2.3) and determining the attribution of all the symptoms.

3. The method according to claim 1, wherein in step (1.3), the anchor disorder set as (ob) stores determined disorders, and the candidate disorder set cs (ob) stores disorders requiring confirmation.

4. The method for determining electrocardiographic disorders based on Bayesian network filtering as set forth in claim 1, wherein in step (1.6), the disorder set SL satisfying the following formula (r) _i (ob) is the prediction result of instance ob, and is denoted as tls (ob):