CN106980757A

CN106980757A - The concurrent coronary artery pathological changes hazards management system of Kawasaki disease and method for digging

Info

Publication number: CN106980757A
Application number: CN201710154709.0A
Authority: CN
Inventors: 贺向前; 张胜; 田杰; 樊楚; 谭续海
Original assignee: Chongqing Medical University
Current assignee: Chongqing Medical University
Priority date: 2017-03-15
Filing date: 2017-03-15
Publication date: 2017-07-25

Abstract

The invention discloses a kind of concurrent coronary artery pathological changes hazards management system of Kawasaki disease and method for digging, including management control module, typing module is provided with the input of management control module, the output end connection Kawasaki disease database of management control module, the output end of Kawasaki disease database is connected with data processor；Typing module is used for typing Kawasaki disease data；After management control module is pre-processed according to the Kawasaki disease data of typing, classification is preserved to Kawasaki disease database；Data processor is used to carry out all data in Kawasaki disease database data scrubbing, data integration and data conversion.Beneficial effect：By management system, the analysis quality and efficiency of Kawasaki disease data are improved；The hazards related to disease are found using Strong association rule, precision of prediction is improved using Random Forest model, usability is high, and good reliability, the data source being related to is wide, it is easy to accomplish, artificial workload is small.

Description

The concurrent coronary artery pathological changes hazards management system of Kawasaki disease and method for digging

Technical field

The present invention relates to technical field of life science, specifically a kind of concurrent coronary artery pathological changes of Kawasaki disease it is dangerous because Plain management system and method for digging.

Background technology

Kawasaki disease is a kind of with the scorching eruptive pediatric disease of febris acuta for major lesions of system vascular.Coronary artery is damaged Wound is the major complications of Kawasaki disease, and the infant about 15%-25% of untreated forms coronary artery pathological changes, wherein, it is coronal dynamic Arteries and veins lesion includes thrombus shape at coronary artery expansion, coronary aneurysm, coronary artery stenosis, occlusion and atherosis, aneurysm Into myocardial infarction, ischemic heart disease or even sudden death occurs in severe patient, therefore preventing and treating Coronary Artery Lesions are pediatrician's treatment Kawasakis The primary and foremost purpose of sufferer youngster.

Domestic and international correlative factor of the scientist all to the concurrent coronary artery injury of Kawasaki disease has carried out substantial amounts of, deep grind Study carefully.But at present still without the conclusion accepted extensively by the whole world and can be widely used in clinic evaluation Kawasaki disease it is concurrently coronal The system of arterial injury degree of danger.

It is concurrent coronal dynamic that many researchers find out Kawasaki disease by the clinical data progress statistical analysis to patients with Kawasaki disease The hazards of arteries and veins damage.But as electronic medical record system development in recent years is very fast, hospital has gradually formed a set pattern The clinical data resource of mould, the electronic data that involves a wide range of knowledge big to these quantity, the inquiry of conventional data base management system Search mechanism and statistical analysis method can not be effectively analyzed mass data.

But, for existing statistics and search mechanism technology, there are many drawbacks in such method, such as：Analysis method Single, manpower and time loss are big, the science difference of Forecasting Methodology the shortcomings of, when can not meet the intelligence of people's life Generation.

In summary, it is necessary to a kind of technology for meeting people's demand of design is proposed, to the concurrent coronary artery of Kawasaki disease Lesion hazards make more detailed, intelligent analysis.

The content of the invention

In view of the above-mentioned problems, the invention provides a kind of concurrent coronary artery pathological changes hazards management system of Kawasaki disease and Method for digging, sets up the hazards management system of Diagnosisof Kawasaki Disease with Coronary Artery Involvement, and Kawasaki disease data are counted, from big The hazards of Diagnosisof Kawasaki Disease with Coronary Artery Involvement are excavated in the statistics of amount.

To reach above-mentioned purpose, the concrete technical scheme that the present invention is used is as follows：

A kind of concurrent coronary artery pathological changes hazards management system of Kawasaki disease, its key is：Including management control mould Block, typing module, the output end connection Kawasaki disease of the management control module are provided with the input of the management control module Database, the output end of the Kawasaki disease database is connected with data processor；The typing module is used for typing Kawasaki disease number According to；After the management control module is pre-processed according to the Kawasaki disease data of typing, classification is preserved to Kawasaki disease database；Institute Stating data processor is used to carry out all data in Kawasaki disease database data scrubbing, data integration and data conversion.

By above-mentioned design, the management system is counted to Kawasaki disease data, and people transfer to data.Wherein, Management control module is pre-processed according to the Kawasaki disease data of typing, is carried out classification preservation, is become apparent from Kawasaki disease data. Data processor carries out data scrubbing, data integration and data to all data in Kawasaki disease database and converted, and obtains Kawasaki Sick data set.

Further, in order to obtain the data that Kawasaki disease is all, the Kawasaki disease database includes of patients with Kawasaki disease People's document data base, clinical examination database, ultrasonic cardiography chart database, diagnostic result database and electronic health record database.

A kind of concurrent coronary artery pathological changes hazards method for digging of Kawasaki disease, its key is to comprise the following steps：

S1：Personal data, clinical examination data, the ultrasonic cardiography of all patients with Kawasaki disease are obtained from Kawasaki disease database Diagram data and diagnostic result data, electronic health record data；

S2：All data that the data processor is obtained to step S1 carry out data scrubbing, data integration and data and become Change, obtain Kawasaki disease data set；

S3：Data mining is carried out to Kawasaki disease data set using association rules method, obtains related to coronary artery pathological changes Hazards；

S4：Random forest risk forecast model is set up to the concurrent Coronary Artery Lesions of Kawasaki disease with random forests algorithm, and calculated The AUC areas of the random forest risk forecast model.

It is that the management control module has carried out pretreated data, the pretreatment step in the step S1 data obtained It is rapid to be specially：

S11：Obtain personal data, clinical examination data, ultrasonic cardiography diagram data and the diagnosis knot of all patients with Kawasaki disease Fruit data, electronic health record data；

S12：All data obtained according to step S11, take out all predictive variable and predictive variable average；

S13：Determine patients with Kawasaki disease whether there is classified variable, classification grade and each grade institute for occurring coronary artery pathological changes Corresponding classified variable value；

S14：All patients with Kawasaki disease are classified, preserved to Kawasaki disease database.

The predictive variable includes sex, age and 52 laboratory checking index of patients with Kawasaki disease, 52 realities Testing room Index for examination is：C reactive protein, leucocyte, monocyte absolute value, lymphocyte absolute value, neutrophil leucocyte is absolute Value, red blood cell, hemoglobin, packed cell volume, MCVU, mean corpuscular hemoglobin concentration (MCHC), red blood cell point Cloth width, RDW absolute value, platelet count, mean platelet volume, large platelet cell ratio, blood platelet distribution Width, thrombocytocrit, the absolute value of eosinophil, with reference to bilirubin, total bile acid, albumin, serum complement C4, courage Red pigment, Urine proteins, gamma-glutamyl turns peptide, glutamic-pyruvic transaminase, glutamic-oxalacetic transaminease, millet straw/paddy third, red cell morphology；Creatinine, flesh Acid kinase, creatine kinase isozyme, indirect bilirubin, alkaline phosphatase, phosphorus, chlorine, magnesium, sodium, urea nitrogen, uric acid urinates glucose, Prealbumin, globulin, lactic dehydrogenase, body ketone urinates vitamin C, erythrocyte sedimentation rate, nitrite, total bilirubin, total protein, total calcium；

The classified variable is the z-score values in ultrasonic cardiography diagram data；

The classification grade is included without coronary artery pathological changes (NCAL), small-sized coronary aneurysm (SCAL), medium-sized coronal dynamic Arteries and veins knurl (MCAL) and huge coronary aneurysm (GCAL)；

It is described to be without the corresponding classified variable value of coronary artery pathological changes：z-score<2.5；

The corresponding classified variable value of the small-sized coronary aneurysm is：2.5≦z-score<5.0；

The corresponding classified variable value of the medium-sized coronary aneurysm is：5.0≦z-score<10.0；

The corresponding classified variable value of the huge coronary aneurysm is：z-score≧10.0.

The particular content of data scrubbing is described in step S2：

To there is the index more than missing data, filled up using multiple interpolation enthesis, wherein using predictive variable Average carries out interpolation；

To existing, missing data is few and missing data occurs at random, then missing data is deleted；

The particular content of the data integration is：By the number in all tables of data in electronic health record data described in step S1 According to being merged into synthesis table；

The particular content of data conversion is：The value of each attribute in the synthesis table is converted into the shape of data mining Formula, and respectively to all properties the characteristics of carry out normalization processing and coding.

Rapid S3 carries out data mining to Kawasaki disease data set using association rules method and concretely comprised the following steps：

S31：Obtain the data of all patients with Kawasaki disease for suffering from Coronary Artery Lesions；

S32：Rule analysis is associated to the obtained data of step S31, obtained using rule constraint and interest-degree constraint The related Strong association rule of the concurrent Coronary Artery Lesions of Kawasaki disease；

S33：Using the predictive variable occurred in Strong association rule as the concurrent coronary artery pathological changes of Kawasaki disease hazards.

The specific method of Strong association rule acquisition is in step S32：

Correlation rule X → Y is set up, X is condition：Including at least one predictive variable, Y is result；Including wherein coronal dynamic One classification grade of arteries and veins lesion；

Set min confidence and minimum support；

When the support and confidence level of correlation rule are all higher than min confidence and minimum support, then associated by force Rule.

What step S4 random forests risk forecast model was set up and assessed concretely comprises the following steps：

S41：The Kawasaki disease data set that step S2 is obtained is according to N:1 ratio cut partition is training sample and test sample；

S42：Using the hazards occurred in step S3 as forecast model prediction index；

S43：Set up the random forest risk forecast model to the concurrent Coronary Artery Lesions of Kawasaki disease；

Select the number mtry and parameter of Split Attribute to generate the number ntree of decision tree to observe mould by adjusting parameter The predicated error of type sets up random forest risk profile with ntree situation of change with the optimal random forest number of this determination Model；

S44：According to step S41 test sample, the AUC areas of random forest risk forecast model are calculated.

Beneficial effects of the present invention：The hazards management system of Diagnosisof Kawasaki Disease with Coronary Artery Involvement is set up, and to Kawasaki disease Data are counted, and the hazards of Diagnosisof Kawasaki Disease with Coronary Artery Involvement are excavated from substantial amounts of statistics.Utilize strong association The rule discovery hazards related to disease, using random forests algorithm set up Random Forest model precision of prediction up to much surpass Go out traditional Multivariate Logistic Regression model, improve the quality and efficiency of analysis；Usability is high, and good reliability is related to Data source is wide, it is easy to accomplish, artificial workload is small.

Brief description of the drawings

Fig. 1 is management system block diagram of the invention；

Fig. 2 is data digging flow figure of the invention；

Fig. 3 is the analysis result figure of association rules method；

Fig. 4 generates the changing trend diagram of decision tree number for the predicated error of random forest risk forecast model with parameter；

Fig. 5 is the figure that predicts the outcome of random forest risk forecast model；

Fig. 6 is the ROC curve comparison diagram of random forest risk forecast model and Logistic regression models；

Embodiment

The embodiment and operation principle to the present invention are described in further detail below in conjunction with the accompanying drawings.

As shown in Figure 1：A kind of concurrent coronary artery pathological changes hazards management system of Kawasaki disease, including management control mould Block, typing module, the output end connection Kawasaki disease of the management control module are provided with the input of the management control module Database, the output end of the Kawasaki disease database is connected with data processor；The typing module is used for typing Kawasaki disease number According to；After the management control module is pre-processed according to the Kawasaki disease data of typing, classification is preserved to Kawasaki disease database；Institute Stating data processor is used to carry out all data in Kawasaki disease database data scrubbing, data integration and data conversion.

In the present embodiment, the Kawasaki disease database includes the personal information database of patients with Kawasaki disease, clinical examination Database, ultrasonic cardiography chart database, diagnostic result database and electronic health record database.

Figure it is seen that a kind of concurrent coronary artery pathological changes hazards method for digging of Kawasaki disease, including following step Suddenly：

The pre-treatment step is specially：

In the present embodiment, totally 8501 patients in electronic health record database, have 5020 patients to be diagnosed as Kawasaki disease, Coronary Artery Lesions occur for wherein 343 people and Coronary Artery Lesions do not occur for 4677 people.

In the present embodiment, sex, age and 52 laboratory examinations of the predictive variable including patients with Kawasaki disease refer to Mark, 52 laboratory checking index are：C reactive protein, leucocyte, monocyte absolute value, lymphocyte absolute value, in Property granulocyte absolute value, red blood cell, hemoglobin, packed cell volume, MCVU, mean corpuscular hemoglobin is dense Degree, RDW, RDW absolute value, platelet count, mean platelet volume, large platelet cell Than, MPW, thrombocytocrit, the absolute value of eosinophil, with reference to bilirubin, total bile acid, albumin, Serum complement C4, bilirubin, Urine proteins, gamma-glutamyl turns peptide, glutamic-pyruvic transaminase, glutamic-oxalacetic transaminease, millet straw/paddy third, red blood cell Form；Creatinine, creatine kinase, creatine kinase isozyme, indirect bilirubin, alkaline phosphatase, phosphorus, chlorine, magnesium, sodium, urea nitrogen, urine Acid, urine glucose, prealbumin, globulin, lactic dehydrogenase, body ketone, urine vitamin C, erythrocyte sedimentation rate, nitrite, total bilirubin, Total protein, total calcium；

The particular content of the data scrubbing is：

The particular content of the data integration is：

Data in all tables of data in electronic health record data described in step S1 are merged into synthesis table；

The particular content of data conversion is：

The value of each attribute in the synthesis table is converted into the form of data mining, and respectively to all properties the characteristics of Carry out normalization processing and coding.

In the present embodiment, normalization processing and coding are：

For the age, it is divided into less than 2 years old, 2 years old to 5 years old, 5 years old to 7 years old, more than 7 years old 4 intervals, successively with a, b, c, d Represent.

To the Biological indicators of laboratory inspection, such as it is according to the range of normal value of c reactive protein<8mg/L, then be divided into< 8mg/L He≤the intervals of 8mg/L two, successively with N, H is represented.Completed using the SQL statement of MySQL database.

S3：Data mining is carried out to Kawasaki disease data set using association rules method, obtains related to coronary artery pathological changes Hazards；Using totally 343 data sets of patients with Kawasaki disease for suffering from Coronary Artery Lesions are excavated in total sample, specific steps For：

Specific method is：

Set min confidence and minimum support；

In the present embodiment, min confidence is 0.9, and minimum support is 0.01.

In the present embodiment, occur in Strong association rule 30 predictions related to the concurrent Coronary Artery Lesions of Kawasaki disease are become Amount is as hazards for predicting, these indexs are：Sex, age, packed cell volume, Platelet large cell ratio, C reaction eggs In vain, platelet count, glutamic-oxalacetic transaminease, glutamic-pyruvic transaminase, millet straw/paddy third, erythrocyte sedimentation rate, mean platelet volume, monocyte are exhausted To value, albumin, ketoboidies, serium inorganic phosphorus, blood chlorine, alkaline phosphatase, red blood cell, NCHC, acidophil absolute value, Urea nitrogen, neutrophil leucocyte absolute value, mean corpuscular volume (MCV), RDW, red cell morphology, red cell distribution are exhausted To value, urine protein, total protein, prealbumin, average hemoglobin amount.

As shown in figure 3, preceding 1000 correlation rules are found by counting, male, the rise of large platelet cells ratio, blood are small The rise of the plate dispersion of distribution, urea nitrogen rise and serium inorganic phosphorus rise have stronger correlation with the concurrent coronary artery pathological changes of Kawasaki disease.

Concretely comprise the following steps：

In the present embodiment, according to 3:Data set is divided into training sample (3765) and test sample by 1 ratio at random (1255).

Training sample is used to model, and test sample is used for model evaluation.

Using 30 indexs occurred in above-mentioned correlation rule as model prediction index；

Because mtry default value is the root mean square of attribute number, the predictive variable number that the present invention is selected is 54, because This starts adjustment using mtry as 8, and generation decision tree number ntree changes to 400 from 100, respectively the predicated error of observing and nursing With ntree situation of change, random forest risk forecast model is set up with the optimal decision tree generation number of this determination.

As shown in figure 4, diminish with the decision tree number of generation, the macro-forecast error of random forest risk forecast model Reduce therewith, from fig. 3 it can also be seen that optimal generation decision tree number is 80 or so, to without coronary artery pathological changes (NCAL), small The predicated error of type coronary aneurysm (SCAL), medium-sized coronary aneurysm (MCAL) and huge coronary aneurysm (GCAL) all reaches Stable state, and all control below 0.1.

Fig. 5 be random forest risk forecast model the figure that predicts the outcome, it has been observed that, c reactive protein, erythrocyte sedimentation rate, sex, Age, mean corpuscular hemoglobin concentration (MCHC), albumin, prealbumin, eosinophil absolute value are in model prediction The higher predictive variable of importance；It is glutamic-pyruvic transaminase, blood platelet, red in addition, with the increase of severity degree of coronary Cell pack, glutamic-oxalacetic transaminease, body ketone, millet straw/paddy third, mean corpuscular volume (MCV), Urine proteins, urea nitrogen, total protein, red blood cell Importance of the dispersion of distribution absolute value in prediction is consequently increased.

Fig. 6 is random forest risk forecast model (Randomforest) and Multivariate Logistic Regression model Operating characteristic (ROC) curve of (Logistec Regression), by calculating respective AUC areas, random forest risk is pre- The AUC areas for surveying model are 98.2%, and the AUC areas of regression model are 59.2%, it will be apparent that, the prediction effect of Random Forest model Fruit is more excellent than the prediction effect of regression model.

It should be pointed out that described above is not limitation of the present invention, the present invention is also not limited to the example above, What those skilled in the art were made in the essential scope of the present invention changes, is modified, adds or replaces, and also should Belong to protection scope of the present invention.

Claims

1. a kind of concurrent coronary artery pathological changes hazards management system of Kawasaki disease, it is characterised in that：Including management control module, Typing module, the output end connection Kawasaki disease data of the management control module are provided with the input of the management control module Storehouse, the output end of the Kawasaki disease database is connected with data processor；

The typing module is used for typing Kawasaki disease data；

After the management control module is pre-processed according to the Kawasaki disease data of typing, classification is preserved to Kawasaki disease database；

The data processor is used to carry out all data in Kawasaki disease database data scrubbing, data integration and data change Change.

2. the concurrent coronary artery pathological changes hazards management system of Kawasaki disease according to claim 1, it is characterised in that：Institute State the personal information database of Kawasaki disease database including patients with Kawasaki disease, clinical examination database, ultrasonic cardiography chart database, Diagnostic result database and electronic health record database.

3. a kind of concurrent coronary artery pathological changes hazards method for digging of Kawasaki disease, it is characterised in that comprise the following steps：

S1：Personal data, clinical examination data, the echocardiogram number of all patients with Kawasaki disease are obtained from Kawasaki disease database According to and diagnostic result data, electronic health record data；

S2：All data that the data processor is obtained to step S1 carry out data scrubbing, data integration and data conversion, obtain To Kawasaki disease data set；

S3：Data mining is carried out to Kawasaki disease data set using association rules method, the danger related to coronary artery pathological changes is obtained Dangerous factor；

S4：With random forests algorithm the concurrent Coronary Artery Lesions of Kawasaki disease are set up with random forest risk forecast model, and calculates described The AUC areas of random forest risk forecast model.

4. the concurrent coronary artery pathological changes hazards method for digging of Kawasaki disease according to claim 3, it is characterised in that The data that step S1 is obtained have carried out pretreated data for the managing system device, and the pre-treatment step is specially：

S11：Obtain personal data, clinical examination data, ultrasonic cardiography diagram data and the diagnostic result number of all patients with Kawasaki disease According to, electronic health record data；

S13：Determine patients with Kawasaki disease is whether there is corresponding to classified variable, classification grade and each grade for occurring coronary artery pathological changes Classified variable value；

5. the concurrent coronary artery pathological changes hazards method for digging of Kawasaki disease according to claim 4, it is characterised in that：Institute Stating predictive variable includes sex, age and 52 laboratory checking index of patients with Kawasaki disease, and 52 laboratory examinations refer to It is designated as：C reactive protein, leucocyte, monocyte absolute value, lymphocyte absolute value, neutrophil leucocyte absolute value, red blood cell, Hemoglobin, packed cell volume, MCVU, mean corpuscular hemoglobin concentration (MCHC), RDW is red thin Born of the same parents' dispersion of distribution absolute value, platelet count, mean platelet volume, large platelet cell ratio, MPW, blood platelet Hematocrit, the absolute value of eosinophil, with reference to bilirubin, total bile acid, albumin, serum complement C4, bilirubin urinates egg In vain, gamma-glutamyl turns peptide, glutamic-pyruvic transaminase, glutamic-oxalacetic transaminease, millet straw/paddy third, red cell morphology；Creatinine, creatine kinase, flesh Acid kinase isodynamic enzyme, indirect bilirubin, alkaline phosphatase, phosphorus, chlorine, magnesium, sodium, urea nitrogen, uric acid, urine glucose, prealbumin, Globulin, lactic dehydrogenase, body ketone urinates vitamin C, erythrocyte sedimentation rate, nitrite, total bilirubin, total protein, total calcium；

The classification grade is included without coronary artery pathological changes, small-sized coronary aneurysm, medium-sized coronary aneurysm and huge coronal dynamic Arteries and veins knurl；

6. the concurrent coronary artery pathological changes hazards method for digging of Kawasaki disease according to claim 3, it is characterised in that：

The particular content of data scrubbing is described in step S2：

To there is the index more than missing data, filled up using multiple interpolation enthesis, wherein using predictive variable average Carry out interpolation；

The particular content of the data integration is：Data in all tables of data in electronic health record data described in step S1 are closed And into synthesis table；

The particular content of data conversion is：The value of each attribute in the synthesis table is converted into the form of data mining, And respectively to all properties the characteristics of carry out normalization processing and coding.

7. the concurrent coronary artery pathological changes hazards of the Kawasaki disease according to claim 6 based on data mining technology is pre- Survey method, it is characterised in that step S3 carries out the specific steps of data mining using association rules method to Kawasaki disease data set For：

S32：Rule analysis is associated to the obtained data of step S31, Kawasaki is obtained using rule constraint and interest-degree constraint The related Strong association rule of sick concurrent Coronary Artery Lesions；

8. the concurrent coronary artery pathological changes hazards of the Kawasaki disease according to claim 7 based on data mining technology is pre- Survey method, it is characterised in that the specific method of Strong association rule acquisition is in step S32：

Correlation rule X → Y is set up, X is condition：Including at least one predictive variable, Y is result；Including wherein coronary artery disease The classification grade become；

Set min confidence and minimum support；

When the support and confidence level of correlation rule are all higher than min confidence and minimum support, then strong association rule are obtained Then.

9. the concurrent coronary artery pathological changes danger of the Kawasaki disease based on data mining technology according to claim 3-8 any one The Forecasting Methodology of dangerous factor, it is characterised in that what step S4 random forests risk forecast model was set up and assessed concretely comprises the following steps：

The number mtry and parameter of Split Attribute is selected to generate the number ntree of decision tree come observing and nursing by adjusting parameter Predicated error sets up random forest risk forecast model with ntree situation of change with the optimal random forest number of this determination；