CN112102945B

CN112102945B - Device for predicting severe condition of COVID-19 patient

Info

Publication number: CN112102945B
Application number: CN202011235506.2A
Authority: CN
Inventors: 罗嘉庆; 周凌云; 冯韵宇; 陈子蝶; 郭姝瑾
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-11-09
Filing date: 2020-11-09
Publication date: 2021-02-05
Anticipated expiration: 2040-11-09
Also published as: CN112102945A

Abstract

The invention discloses a device for predicting the severity of a COVID-19 patient, belonging to the intelligent processing technology of medical data. The invention comprises the following steps: the input module is used for inputting patient information; the data preprocessing module is used for preprocessing the data output by the input module, sending a processing result to the feature selection module if the data is training data, and sending the processing result to the prediction processing module if the data is to-be-predicted data; the characteristic selection module selects a certain number of characteristics from the input characteristics as input characteristic selection results; and the prediction processing module is used for inputting the characteristic information of the patient into a preset prediction model and sending the prediction result to the prediction result output module for visual output. The invention selects key characteristics from the blood detection result of the patient to ensure the accuracy of predicting the severity of the COVID-19 patient, realizes medical assistance for rapidly shunting the patient with the COVID-19, and is beneficial to optimizing medical resources and carrying out medical intervention in time.

Description

Device for predicting severe condition of COVID-19 patient

Technical Field

The invention belongs to the technical field of intelligent processing of medical data, and particularly relates to a device for predicting the severity of a COVID-19 patient.

Background

Currently, over 2000 million people worldwide are infected with the new coronavirus SARS-Cov-2, and 600 million people are receiving treatment. This poses a great threat to the health and life of people worldwide and also puts unprecedented pressure on medical systems.

Most patients with COVID-19 belong to mild/moderate cases and can recover themselves. However, about 14% of patients are in severe cases, and 5% of patients are in critical cases. Severe/critical cases often develop Acute Respiratory Distress Syndrome (ARDS) or Multiple Organ Dysfunction Syndrome (MODS) within 2 weeks after infection, which consumes a lot of medical resources and leads to a higher fatality rate (up to 49%). Early prediction of the severity of COVID-19 allows for rapid diversion of patients with COVID-19 (i.e., home isolation, hospitalization or ICU distribution, etc.), which helps to optimize the use of medical resources and to timely medical intervention.

Most patients with suspicious symptoms will first visit a fever clinic of a community hospital. They generally accepted 4 initial tests: SARS-Cov-2 RNA test, blood biochemical test and chest Computed Tomography (CT) scan. The first test is used to determine whether a patient is infected with SARS-Cov-2. The latter 3 tests were used to predict the severity of COVID-19. However, since the resources of community hospitals are limited, there are many limitations in completing all four examinations in a short time (for example, the capacity of waiting rooms, waiting time for examination results, and sterilization of examination instruments, etc.). Therefore, how to use the simplest and fastest test to make an accurate prediction is a very urgent and challenging problem.

Of all initial tests, blood tests are the most common and will typically yield results within 2 hours. The inventors of the present invention, in carrying out the present invention, have discovered that an attempt can be made to select key features from blood test results to quickly and accurately predict the severity of COVID-19 patients, thereby helping to optimize the use of medical resources and to perform medical interventions in a timely manner.

Disclosure of Invention

The invention aims to: aiming at the existing problems, the device for predicting the serious condition of the COVID-19 patient is provided, so that the medical auxiliary effect of quickly shunting the patient with the COVID-19 is realized, the use of medical resources is optimized, and medical intervention is performed in time.

The invention discloses a device for predicting the severity of a COVID-19 patient, which comprises an input module, a data preprocessing module, a feature selection module, a prediction processing module and a prediction result output module, wherein the input module is used for inputting a plurality of parameters;

the input module is used for inputting patient information, and if the current data is training data, the input patient information comprises patient personal information, blood detection information and severity; if the current data is the data to be predicted, the input patient information comprises patient personal information and blood detection information;

the data preprocessing module is used for preprocessing the data output by the input module, performing different processing on the training data and the data to be predicted, sending the processing result of the training data to the feature selection module and sending the processing result of the data to be predicted to the prediction processing module;

the characteristic selection module selects T characteristics from the input characteristics as an input characteristic selection result, wherein T is more than or equal to 1;

the prediction processing module inputs the characteristic information of the patient into a preset prediction model and sends a prediction result to the prediction result output module;

the prediction result output module is used for visually outputting the prediction result;

the data preprocessing module is used for specifically processing the training data and the data to be predicted:

if the current data is training data, executing the following preprocessing steps:

respectively taking the specified items in the personal information of the patient as input characteristic items, respectively taking each item in the blood detection information as an input characteristic item, and taking the severity as an output characteristic item; obtaining a feature table based on all input feature items and output feature items;

defining X to represent an input feature index, X to represent an input feature index set, Y to represent an output feature index, and Y to represent an output feature index set;

calculating a correlation value between any two characteristics in the characteristic table to obtain a correlation matrix R;

calculating a P value between any two characteristics in the characteristic table to obtain a P value matrix P;

preprocessing a correlation matrix R:

let R [ X, Y ] = R [ Y, X ] =0 if the elements of matrix P satisfy X ∈ X and Y ∈ Y, P [ X, Y ] = P [ Y, X ] > α;

for i, X ∈ X, if P [ X, i ] = P [ i, X ] > α, let R [ X, i ] = R [ i, X ] = 1; wherein the threshold value alpha is a preset value;

sending the feature tables of a plurality of patients, an input feature index set X, an output feature index set Y and the preprocessed correlation matrix R to a feature selection module;

if the current data is the data to be predicted, executing the following preprocessing steps:

and based on the input feature selection result sent by the feature selection module, reading the matched information from the data to be predicted to generate the feature information of the current patient, and sending the feature information of the patient to the prediction processing module.

Further, when determining the input feature selection result, the feature selection module defines the feature selection as a multi-standard decision problem of the correlation between the input features and the correlation between the input and output features, and obtains the input feature selection result based on the solution of the multi-standard decision problem.

Further, the feature selection module determines that the input feature selection result specifically is:

step 1: acquiring a marking feature set L:

step T1: initializing a marking characteristic set L as an empty set;

step T2: judging whether the input feature index set X is empty or not; if not, go to step T3; if yes, executing the step 2 based on the current marking characteristic set L;

step T3: updating the marking feature set L:

step T301: judging whether | X | is more than min { m-1, ⌈ beta X m ⌉ }, if so, sequencing elements of a union of the marking characteristic set L and the output characteristic index set Y in an ascending order to obtain a sequence

And executing the step T302; wherein m represents the number of input characteristic terms, n represents the number of output characteristic terms, and the value range of the parameter beta is [0.6,0.8 ]]：

Otherwise, directly sorting the elements of the set L in ascending order to form a sequence

And executing the step T302;

step T302: to input characteristic indexThe elements of the lead set X are sorted in ascending order to form a sequence

；

Step T303: extracting a sub-matrix E from the correlation matrix R, wherein the elements of the sub-matrix E are as follows: e [ i, j ]]=R[r_i,c_j]；

And the element E [ i, j]Worst condition w_iAnd optimum condition b_iRespectively as follows:

；

calculating the similarity s of each column of the matrix E_jAnd the maximum similarity s_jThe corresponding column identifier is denoted as j, and element c is marked_j*Adding the element c into the marking characteristic set L and simultaneously deleting the element c from the input characteristic index set X_j*And then returns to step T2;

the similarity s_jThe specific calculation method is as follows:

；

wherein the first Euclidean distance

；

Second Euclidean distance

；

The parameters k and q respectively represent the row number and the column number of the matrix E;

step 2: and (3) selecting the features in the marking feature set L:

starting from the first feature of the marking feature set L, and combining in a mode of adding one feature each time in sequence to obtain a plurality of combined features; and then, carrying out classification performance test on the features of each combination according to a preset classifier model, and selecting the combination with the best classification performance test as an input feature screening result.

Further, the feature selection module performs classification performance testing on the features of each combination by using naive Bayes classification.

Further, the feature selection module sets the input feature selection result as: age, white blood cell count, and lymphocyte count, or set to: age, neutrophil count, and lymphocyte count.

Further, the feature selection module performs classification performance testing on the features of each combination based on the classification accuracy, and selects the combination with the highest classification accuracy as an input feature screening result.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

the invention selects key characteristics from the blood detection result of the patient to achieve the aim of quickly and accurately predicting the severity of the patient with COVID-19; the device for predicting the serious condition of the COVID-19 patient is based on the processing of small sample data, can obtain a relatively stable result, can find out a combination with higher accuracy, and simultaneously has the visualization and interpretability of the characteristic selection process, thereby meeting the requirements of medical clinic.

Drawings

FIG. 1 is an exemplary diagram of a correlation matrix R for a COVID-19 data set, according to an embodiment;

FIG. 2 is an exemplary graph of a matrix P of P-values (a parameter used to determine the outcome of a hypothesis test, i.e., the probability of a more extreme outcome than the resulting sample observation occurring when the original hypothesis is true) of the data set, in accordance with an embodiment;

FIG. 3 is a diagram illustrating an example of a correlation matrix after preprocessing, in accordance with an embodiment;

FIG. 4 is a diagram illustrating a characteristic ordering process of a COVID-19 data set according to an embodiment;

FIG. 5 is a schematic diagram of feature ordering in an embodiment;

FIG. 6 is a graphical illustration of a predicted performance evaluation of the present invention in an exemplary embodiment;

FIG. 7 is a diagram illustrating an average feature number according to an embodiment;

FIG. 8 is a graph illustrating average performance comparison in an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.

The invention aims to realize the rapid shunting of patients by early prediction of the severity of COVID-19, thereby improving the utilization of medical resources and providing timely medical intervention. The device for predicting the severity of the COVID-19 patient (called a severity prediction device for short) selects key characteristics from blood detection results so as to rapidly and accurately predict the severity of the COVID-19 patient. The present invention first defines feature selection as a multi-criteria decision (MCDM) problem that considers the correlation between input features and the correlation between input and output features, and then combines the "top-of-the-solution similarity-based-precedence-technique" (TOPSIS) and Naive Bayes (NB) classifiers to achieve the highest prediction accuracy with the least amount of functionality. Preliminary results indicate that the present invention has only 3 features (i.e., age, white blood cell count (WBC)/neutrophil count (NEUT), and lymphocyte count (LYMC)) even considering the impact of dataset uncertainty on machine learning model predictions.

In the specific embodiment, COVID-19 cases diagnosed in Wuhan red cross hospital from 1/2/2020 to 15/3/2020 according to WHO (world Health organization) guidelines were collected. As shown in table 1, the data set contains 9 features, including 8 input features (age, sex, white blood cell count (WBC), lymphocyte count (LYMC), lymphocyte ratio (LYMPH), neutrophil count (NEUT), neutrophil ratio (NEU) and neutrophil to lymphocyte ratio (NLR)) and 1 output feature (severity).

Table 1: clinical features of COVID-19 cases

According to the national institutes of health "the guidelines for the diagnosis and treatment of infection of CoVID-19 of China edition 5, the cases are divided into 4 types:

(1) mild cases: mild clinical symptoms but no imaging manifestations of pneumonia;

(2) moderate cases: with fever, respiratory symptoms and pneumonic image manifestations;

(3) severe cases: any one of the following: respiratory distress with respiratory quotient RR > 30/min, oxygen saturation at rest < 93% or PaO2/FiO2<300mmHg (ImmHg =0.133 kPa);

(4) critical cases: there is any one of the following: respiratory failure requiring mechanical ventilation, electrical shock, or other organ failure requiring intensive care of the ICU.

In order to reduce the time overhead of the prediction process and improve the accuracy of prediction of the severity of COVID-19 (mild/moderate or severe/critically severe cases), the severity prediction device of the present invention reduces the severity types to two types in the present embodiment: the first type is: mild and/or moderate; the second type is: severe and/or critical illness; that is, the severity prediction device according to the present invention can predict whether or not a COVID-19 patient is in a severe state (including a critically ill state) quickly.

The severe illness prediction device comprises an input module, a data preprocessing module, a feature selection module, a prediction processing module and a prediction result output module; the input module is used for inputting patient information, and if the current data is training data, the input patient information comprises patient personal information (name, age, sex and the like), blood detection information and severity (namely, the patient information is classified based on the severity of the patient diseases, and a corresponding severity metric value is set for each type respectively); if the current data is the data to be predicted, the input patient information comprises patient personal information and blood detection information; the data preprocessing module is used for preprocessing the data output by the input module, performing different processing on training data and data to be predicted, and performing thinning processing on the training data mainly to perform a noise elimination process on the collected original data; if the data to be predicted is to be predicted, extracting partial information (name, information of an item matched with the input feature selection result in the blood detection information) from the data to be predicted to generate feature information of the current patient, and sending the feature information of the patient to the prediction processing module; the characteristic sorting module is used for sorting and screening the characteristics, wherein the characteristic sorting is a process of sorting the characteristics through values of certain scoring functions, and the characteristic relevance of the characteristics is usually measured; feature selection aims at selecting a small fraction of relevant features from the original features by removing irrelevant, redundant or noisy features. The prediction processing module inputs the characteristic information of the patient into a preset prediction model (which is well learned and trained) and sends a prediction result to a prediction result output module for visual output; namely, the currently input information to be predicted is subjected to the classified prediction processing of the exacerbation based on the selected characteristics and the set prediction model, and the prediction result is visually output. Meanwhile, in order to verify the prediction performance of the exacerbation prediction device of the present invention, the performance of binary classification of the exacerbation prediction device of the present invention is also measured by statistical measures (accuracy (ACC), sensitivity (TPR), False Positive Rate (FPR), and F1 score (a weighted average of model accuracy and recall, with a maximum value of 1 and a minimum value of 0, a larger value indicating a better model)).

The concrete implementation processes of the prediction processing and the prediction performance evaluation of the invention are as follows:

(1) and (4) preprocessing.

In this embodiment, the data set is randomly divided into 2 subsets: training set (50%) and test set (50%). In the four stages of this embodiment, only the test set is used for performance evaluation.

Let X = { X |1 ≦ X ≦ m } be the input feature set and Y = { Y | m +1 ≦ Y ≦ m + n } be the output feature set, assuming that there are m input features and n output features, the elements X and Y are the indices of the features. The feature set is F = X ≧ Y = { i |1 ≦ i ≦ m + n }. A (m + n) × (m + n) correlation matrix R and a (m + n) × (m + n) P-value matrix P are calculated and visualized to show the correlation between all the different feature pairs.

To simplify the data throughput, the correlation matrix R is preprocessed in two steps.

Step 1: ignoring the sign of R [ i, j ], let R [ i, j ] = | R [ i, j ] | so the range of R [ i, j ] changes from [ -1,1] to [0,1], where i, j ∈ F.

Step 2: r was filtered through P.

For X ∈ X and Y ∈ Y, if P [ X, Y ] = P [ Y, X ] > α, then R [ X, Y ] and R [ Y, X ] can be ignored, i.e., let R [ X, Y ] = R [ Y, X ] =0. For i, X ∈ X, if P [ X, i ] = P [ i, X ] > α, let R [ X, i ] = R [ i, X ] = 1. In general, the threshold α may be in the range of 0.01 or 0.05, preferably 0.05.

Based on the personal information (sex and age), blood test information, and severity (whether or not severe) of the patient given in table 1, the input feature number m =8 and the output feature number n =1 can be obtained, thereby obtaining a 9 × 9 correlation matrix R shown in fig. 1 and a 9 × 9P-value matrix P shown in fig. 2.

After the correlation matrix R is preprocessed, specific values of each element R [ i, j ] of the preprocessed correlation matrix R shown in fig. 3 can be obtained, where i, j is e.f, and the value range of R [ i, j ] is [0,1 ].

Since P [1,9] = P [9,1] =0.3865>0.05, and P [3,9] = P [9,3] =0.1055>0.05, R [1,9], R [9,1], R [3,9] and R [9,3] are negligible, i.e., take values of 0. As can also be seen from fig. 3, R [1,9] = R [9,1] = R [3,9] = R [9,3] =0, R [1, 1: 8] = R [3, 1: 8] = unit vector (1, 8), R [ 1: 8,1] = R [ 1: 8,3] = unit vector (8, 1).

(2) And (6) sorting the features.

A set of labeling features L is defined and initialized to L = ∅.

The process of ranking the input features X e X iterates and the first of each ranking is moved from X to L. The ranking criteria included 2 evaluation terms:

evaluation item 1 (EVAL 1): the input features X ∈ X and the output features Y ∈ Y, R [ X, Y ] or R [ Y, X ].

Evaluation item 2 (EVAL 2): the correlation between the input features X ∈ X and the marker features v ∈ L, R [ X, v ] or R [ v, X ]. Thereby realizing the evaluation processing of a plurality of conflict criteria in the decision.

The present invention is based on the proposed process of solving this multi-criteria decision (MCDM) problem by using the preference order Technique (TOPSIS) similar to the ideal solution, which is a compensatory aggregation method, first of all creating an evaluation matrix E containing k conditions and q alternatives to rank the input elements. According to the pareto principle, x is classified into the following 2 types:

type 1:

if | X | is > min { m-1, ⌈ β × m ⌉ }, then the input feature X to be labeled is the core feature, which should have the lowest R [ v, X ] in the evaluation term 2]And the highest R [ y, x ] from evaluation 1]. And ordering the elements of the sets L ^ Y and X in ascending order to obtain a sequence

And

. Wherein the value range of the parameter beta is [0.6,0.8 ]]Preferably, the value is 0.8, i.e. the first 20% of the input features are core features.

。

referring to fig. 4, in the diagram, it is represented that when | X | =8 > min {8-1, ⌈ 0.8 × 8 ⌉ } =7, L { [ Y = ∅ { [ 9} = {9 }. Is provided with (r)_i)¹ _i=1=(9)，(c_j)⁸ _j=1= (1, …, 8). Since = | L | + n =1 and q = | X | =8, E is for RA 1 × 8 sub-matrix.

Type 2:

if | X | ≦ min { m-1, ⌈ 0.8.8 xm ⌉ }, the X to be labeled is an assist feature (the remaining 80%), only the lowest R [ v, X ] evaluation of 2 is needed.

And ordering the elements of the L and X sets in ascending order to obtain a sequence

And

。

As can be seen from the graph given in fig. 4, when | X | =5 < 7, L = {2,6,4}, and X = {1,3,5,7,8} (r)_i)³ _i=1=(2,6,4)，(c_j)⁵ _j=1= (1,3,5,7, 8). Since currently k = | L | =3 and q = | X | =5, E is a 3 × 5 sub-matrix of R.

The L2 distance (euclidean distance) between the target surrogate j and the worst condition is calculated according to equation (1):

（1）

the L2 distance between the j condition and the optimum condition is then calculated according to equation (2):

（2）

and then calculating the similarity with the worst condition according to a formula (3):

（3）

s only when the conditions for substituting j are optimal_j= 1; s only when the worst condition of j is substituted_jAnd =0. Order toj^*=arg max_j{s_jIs then X = X \ c_j*},L=L∪{c_j*}。

Example 4: as shown in fig. 4, when | X | =8 > 7, w_i=1 and b_iAnd =0. D is calculated from the formula (1) and the formula (2)_w2=0.5251，d_b2= 0.4749. From equation (3), s can be obtained₂= 0.5251. When | X | =5 < 7, w_i=1，b_iAnd =0. D is calculated by formula (1) and formula (2)_w8=0.9685，d_b8= 0.8615. From equation (3), s is obtained₂=0.5293。

Namely, the invention marks a plurality of characteristics of patients based on MCDM, and obtains the specific realization process of the marked characteristic set as follows:

step S1: acquiring patient characteristics as input characteristics, acquiring prediction types as output characteristics, and acquiring a characteristic set based on all the input characteristics and the output characteristics;

obtaining a correlation matrix R for any two characteristics in the characteristic set based on a correlation value between the characteristics, wherein the dimensionality of the correlation matrix R is (m + n) x (m + n), m represents the number of input characteristics, and n represents the number of output characteristics;

for any two features in the feature set, obtaining a matrix P with dimensions of (m + n) × (m + n) based on a P value between the features;

setting an input feature index set X = { X |1 is not less than X and not more than m }, and setting an output feature index set Y = { Y | m +1 is not less than Y and not more than m + n };

initializing a marking characteristic set L as an empty set;

step S2: preprocessing a correlation matrix R:

the elements of the correlation matrix R are set to: r [ i, j ] = | R [ i, j ] |, where i, j represent the rows and columns, respectively, of the correlation matrix R;

and (3) filtering the correlation matrix R based on the matrix P: for X ∈ X and Y ∈ Y, if P [ X, Y ] = P [ Y, X ] >0.05, let R [ X, Y ] = R [ Y, X ] = 0; for u ∈ X and X ∈ X, if P [ X, u ] = P [ u, X ] >0.05, let R [ X, u ] = R [ u, X ] = 1;

step S3: judging whether the set X is empty; if yes, go to step S5; otherwise, executing step S4;

step S4: updating the marking feature set L:

step S401: judging whether | X | is more than min { m-1, ⌈ beta X m ⌉ }, if so, sorting the elements of the set L | > Y and the set X in an ascending order to obtain a sequence

And performing step 402;

that is, when the number of elements in the set X is greater than the value of min { m-1, ⌈ β × m ⌉ }, the elements of L { [ Y ] } are ordered in ascending order to form a sequence

；

And performing step 402;

step S402: sorting the elements of the set X in ascending order to form a sequence

；

Step S403: extracting a sub-matrix E from the correlation matrix R, wherein the elements of the sub-matrix E are as follows: e [ i, j ]]=R[r_i,c_j]；

Calculating the similarity s of each column of the matrix E_jAnd the maximum similarity s_jThe corresponding column identifier is denoted j^*An element c_j*Adding the element c into the marking characteristic set L and simultaneously deleting the element c from the input characteristic index set X_j*Then, the process returns to step S3;

step S5: and obtaining and outputting a marking feature set L.

Referring to fig. 4, the labeling order of the current input elements is (2, 6,4,7,8,6,1, 3). If only evaluation 1 is considered, i.e. X ∈ X is ordered according to statistically significant R [ X, y ], another sequence (2, 5,4,7,8, 6) will result, as shown in fig. 5. As can be seen from fig. 3, although R [5,9] =0.3526> R [6,9] =0.2179, R [5,2] =0.2471> R [6,2] =0.06803 and R [5, 4] =0.7023> R [6,4] = 0.2827. This indicates that 2,5,4 may include redundant features and may not independently contribute to the prediction.

(3) And (4) selecting characteristics.

The goal of feature subset selection is to find the best input feature subset. The number of labeled features is gradually increased and the model is trained using a naive bayes classifier in turn. To find the best subset, the accuracy of the training model is tested sequentially on the training set. Fig. 5 shows that when 4 features {2,5,4,7} are selected, the accuracy of evaluation item 1 reaches a peak of 0.765. And when fewer features 2,6,4 are used, the accuracy of the evaluation term 1 plus the evaluation term 2 can reach a higher 0.816.

(4) And (4) prediction processing and output.

The prediction processing module and the prediction result output module are realized based on the invention. The prediction processing module of the invention is preset with a trained prediction model (such as a classifier model adopted in feature selection), and only the feature information of the patient is required to be input into the classifier model, so that the current prediction result of the serious condition of the patient is output and obtained based on the classification result; the prediction model in the prediction processing module is not specifically limited, any conventional classifier model can be adopted, and the adopted classifier model is subjected to learning training to obtain the prediction model meeting the training requirement. The prediction result output module can output the corresponding prediction result in a mode of graphics, characters, light or the like.

(5) And (6) performance evaluation.

In the present embodiment, based on the set test set, the Accuracy (ACC), sensitivity (TPR), False Positive Rate (FPR), and F1 score (F1 score) are used as evaluation measures of functional predictability. Fig. 6 shows the prediction performance for prediction using different conditions. As shown in fig. 6, {2,6,4} has the lowest number of functions, but scores the highest among the multiple performance indicators. Meanwhile, based on fig. 6, it can be seen that the accuracies of {2,5,4,7,8,6}, {2,5,4,7} and {2,6,4} are 0.7959, 0.8469 and 0.8673, respectively; and {2,5,4,7,8,6}, {2,5,4,7} and {2,6,4} have F1 scores of 0.7561, 0.7761 and 0.806, respectively.

In this embodiment, 306 collected cases of COVID-19 are divided into two groups: 141 moderate cases and 165 severe/critical cases. The blood test results of the two groups are shown in Table 1.

To test the severity prediction unit of the present invention for prediction stability and to observe the effect of dataset uncertainty on feature selection, the dataset was divided into 100 runs (50% training set and 50% testing set) and repeated. FIG. 7 shows the average number of features selected by 3 different criteria, EVAL1, EVAL2 (subset) and EVAL1+ EVAL2 (subset) being 6.29 (95% CI (Confidence Interval): 6.13-6.45), 3.11 (95% CI: 2.79-3.43) and 2.98 (95% CI: 2.81-3.15), respectively. As can be seen from fig. 8, the standard EVAL1+ EVAL2 (subset) used by the severe exacerbation prediction device of the present invention improves most performance indicators. Indexes (ACC, TPR, FPR and F1 scores) of EVAL1+ EVAL2 (subset) were 0.803 (95% CI: 0.794-0.812), 0.685 (95% CI: 0.673-0.697), 0.117 (95% CI: 0.104-0.131) and 0.724 (95% CI: 0.71-0.739), respectively, while EVAL1 was 0.75 (95% CI: 0.741-0.76), 0.599 (95% CI: 0.583-0.616), 0.093 (95% CI: 0.083-0.103) and 0.698 (95% CI: 0.688-0.708), respectively. Referring to FIG. 8, although the feature selection is affected by the dataset uncertainty, the feature selection is dominated by the selectivity of up to 31% for the 2 subsets Age, NEUT, LYMC and Age, WBC, LYMC. These two subsets can achieve high accuracy with a small number of features.

Furthermore, based on current treatment experience, proper intervention in the first and second weeks of disease progression is important to prevent disease progression and reduce mortality. Previous studies have shown that the severity of COVID-19 is closely related to the age, underlying disease and general immune status of the patient. The input of the critical condition prediction device only needs the age of the patient and the blood test result, and selects the corresponding characteristics (WBC/NEUT, LYMC) from the blood test result to perform prediction processing based on the preset characteristic selection mode, so as to output the patient type (mild, moderate, severe and critical) of COVID-19 of the current patient, and the prediction accuracy can reach more than 80%. During the COVID-19 pandemic, it is more clinically desirable and is easier to popularize and use in areas of different medical levels. That is, the present invention of the critical illness prediction apparatus selects effective characteristics from blood test results, and preliminary experiment results show that the prediction accuracy (95% CI: 0.794-0.812) of 0.803 can be achieved by only selecting 3 key characteristics (i.e., age, white blood cell count (WBC)/neutrophil count (NEUT) and lymphocyte count (LYMC)), and the high accuracy of the prediction (average 80.3%) is very favorable for the rapid diagnosis of covi-19 patients. Using only the most common blood tests, the medical facility can better determine home isolation, hospitalization, ICU distribution, or covd-19 patients.

While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims

1. The device for predicting the severity of the COVID-19 patient is characterized by comprising an input module, a data preprocessing module, a feature selection module, a prediction processing module and a prediction result output module;

preprocessing a correlation matrix R:

setting the value of an element R [ x, y ] of the correlation matrix R to 0 if the element of the matrix P satisfies P [ x, y ] = P [ y, x ] > α;

for i ∈ X and X ∈ X, if P [ X, i ] = P [ i, X ] > α, let R [ X, i ] = R [ i, X ] = 1; wherein the threshold value alpha is a preset value;

2. The apparatus of claim 1, wherein the feature selection module, when determining the input feature selection result, defines the feature selection as a multi-criteria decision problem of the correlation between the input features and the correlation between the input and output features, and obtains the input feature selection result based on a solution of the multi-criteria decision problem.

3. The apparatus of claim 1, wherein the feature selection module determines the input feature selection result to be:

step 1: acquiring a marking feature set L:

step T1: initializing a marking characteristic set L as an empty set;

step T3: updating the marking feature set L:

And executing the step T302; wherein m represents the number of input characteristic terms, n represents the number of output characteristic terms, and the value range of the parameter beta is [0.6,0.8 ]]；

And executing the step T302;

step T302: sorting the elements of the input feature index set X in ascending order to form a sequence

；

Step T303: from phaseExtracting a sub-matrix E from the relation matrix R, wherein the elements of the sub-matrix E are as follows: e [ i, j ]]=R[r_i,c_j]；

；

calculating the similarity s of each column of the matrix E_jAnd the maximum similarity s_jThe corresponding column identifier is denoted j^*An element c_j*Adding the element c into the marking characteristic set L and simultaneously deleting the element c from the input characteristic index set X_j*And then returns to step T2;

the similarity s_jThe specific calculation method is as follows:

；

wherein the first Euclidean distance

，

Second Euclidean distance

，

step 2: and (3) selecting the features in the marking feature set L:

4. The apparatus of claim 1, wherein the feature selection module sets the input feature selection result to: age, white blood cell count, and lymphocyte count, or set to: age, neutrophil count, and lymphocyte count.

5. The apparatus of claim 3, wherein the feature selection module performs a classification performance test on each combination of features using a naive bayes classification.

6. The apparatus of claim 3, wherein the feature selection module performs a classification performance test on the features of each combination based on the classification accuracy, and selects the combination with the highest classification accuracy as the input feature selection result.

7. The apparatus of claim 1, wherein the threshold α is set to a value of 0.01 or 0.05.

8. The apparatus of claim 3, wherein the setting parameter β is 0.8.