CN114037018A

CN114037018A - Medical data classification method and device, storage medium and electronic equipment

Info

Publication number: CN114037018A
Application number: CN202111415820.3A
Authority: CN
Inventors: 何涛; 王晨; 徐赛; 李志�; 李盼; 闻英友
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2022-02-11

Abstract

The disclosure relates to a medical data classification method and device, a storage medium and electronic equipment, which are used for intelligently examining, verifying and classifying medical data, reducing the workload of manual examination and verification and improving the efficiency of examining, verifying and classifying the medical data. The method comprises the following steps: acquiring inspection data to be classified, wherein the inspection data represents an inspection result of a medical inspection item; determining a target classification result corresponding to the inspection data based on a pre-trained inspection data classification model, wherein the target classification result is used for characterizing whether to perform a review on a medical inspection item corresponding to the inspection data, the inspection data classification model comprises a plurality of classifiers, and the inspection data classification model is used for performing weighted summation on classification results output by the plurality of classifiers aiming at the inspection data according to a classification weight corresponding to each classifier to obtain a target classification result corresponding to the inspection result data, and the classification weight is inversely related to a result error rate of the classifier.

Description

Medical data classification method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a medical data classification method and apparatus, a storage medium, and an electronic device.

Background

Clinical test data provides results for a series of test items. However, due to the instrumentation, test reagents, etc., not every test is reliable. Therefore, the inspection data needs to be checked to determine whether the data inspected at a certain time is abnormal, and the abnormal data needs to be checked again to determine the validity of the abnormal data.

In the related art, manual review of inspection data is usually performed, which requires a lot of manpower and time, and the efficiency of review and classification is low. Moreover, the judgment mode of different people for the abnormal value is relatively subjective, and the judgment can be inconsistent, so that the accuracy of the examination and classification is influenced.

Disclosure of Invention

The purpose of the present disclosure is to provide a medical data classification method, device, storage medium, and electronic device, so as to improve the efficiency of auditing and classifying medical data.

In order to achieve the above object, in a first aspect, the present disclosure provides a medical data classification method, the method including:

acquiring inspection data to be classified, wherein the inspection data represents an inspection result of a medical inspection item;

determining a target classification result corresponding to the inspection data based on a pre-trained inspection data classification model, wherein the target classification result is used for characterizing whether to perform a review on a medical inspection item corresponding to the inspection data, the inspection data classification model comprises a plurality of classifiers, and the inspection data classification model is used for performing weighted summation on classification results output by the plurality of classifiers aiming at the inspection data according to a classification weight corresponding to each classifier to obtain a target classification result corresponding to the inspection result data, and the classification weight is inversely related to a result error rate of the classifier.

Optionally, the test data classification model is trained by:

acquiring a first sample set and a second sample set, wherein the first sample set is a set of test data corresponding to a medical test item which is not subjected to the review, and the second sample set is a set of test data corresponding to a medical test item which is required to be subjected to the review;

clustering the first sample set to obtain a plurality of cluster sets, selecting a target cluster set with the data volume exceeding the preset data volume from the plurality of cluster sets, and clustering the target cluster set to obtain a plurality of sub cluster sets;

performing data sampling on the basis of the clustering centers of the plurality of sub-clustering sets and the clustering centers of other clustering sets except the target clustering set in the plurality of clustering sets to obtain a sampled data set;

training the test data classification model based on the sample data set and the second sample set.

Optionally, the clustering the first sample set to obtain a plurality of cluster sets, selecting a target cluster set with a data volume exceeding a preset data volume from the plurality of cluster sets, and clustering the target cluster set to obtain a plurality of sub-cluster sets, includes:

determining an expected first cluster number based on the sample data size of the second sample set;

clustering the first sample set based on the first clustering quantity to obtain a plurality of clustering sets with the quantity equal to the first clustering quantity;

selecting a target cluster set with the data volume exceeding a preset data volume from the plurality of cluster sets, and determining an expected second cluster number based on the ratio of the sample data volume of the target cluster set to the sample data volume of the first sample set;

clustering the target cluster set based on the second cluster number to obtain a plurality of sub cluster sets with the number equal to the second cluster number.

Optionally, the performing data sampling based on the cluster centers of the plurality of sub-cluster sets and the cluster centers of other cluster sets in the plurality of cluster sets except the target cluster set to obtain a sampled data set includes:

in each sub-cluster set, determining first target data characterized by a data point closest to a cluster center, and determining second target data characterized by a data point closest to the cluster center in other cluster sets except the target cluster set in the plurality of cluster sets;

combining the first target data and the second target data into a sample data set.

Optionally, the training the test data classification model based on the sample data set and the second sample set comprises:

iteratively training the test data classification model based on the sampling data set and the second sample set, taking a first-layer classifier of the test data classification model as an initial target-layer classifier in each iterative training, and circularly executing the following operations:

determining a predicted classification result of the target layer classifier on each sample data in the sampling data set and the second sample set, wherein each sample data has a sample weight and a sample classification label, and the sample weight is inversely related to a proportion of the target layer classifier correctly classifying the sample data;

determining a result error rate of the target layer classifier based on the predicted classification result, the sample classification label and the sample weight of the sample data, and determining a classification weight of each classifier in the target layer classifier based on the result error rate and a negative correlation relationship between the classification weight and the result error rate;

and updating the sample weight of the sample data based on the classification weight of each classifier in the target layer classifier and the prediction classification result, and taking the next layer classifier of the target layer classifier as a new target layer classifier until a preset iteration stop condition is reached.

Optionally, the updating the sample weight of the sample data based on the classification weight of each classifier in the target layer classifier and the prediction classification result includes:

updating the sample weight of the sample data according to the following formula:

wherein the content of the first and second substances,

a sample weight representing the ith sample data corresponding to the target layer classifier,

a sample weight, y, of the ith sample data corresponding to a next-layer classifier of the target-layer classifier_iA sample classification tag representing the ith sample data,

representing a classification weight of a jth classifier in the target layer classifiers, c representing a number of classifiers included in the target layer classifiers,

and the predicted classification result of the ith classifier to the ith sample data in the target layer classifier is represented, and n represents the number of the sample data.

Optionally, the determining, by the pre-trained inspection data classification model, a target classification result corresponding to the inspection data includes:

acquiring historical inspection data corresponding to the inspection data and associated data related to classifying the inspection data from a medical information system;

when the time interval between the generation time of the historical inspection data and the generation time of the current inspection data is less than or equal to a preset time interval, taking the difference ratio between the index value of the historical inspection data and the index value of the inspection data as a difference check value;

and inputting the difference check value, the associated data and the current test data into a pre-trained test data classification model, and determining a target classification result corresponding to the test data based on an output result of the test data classification model.

In a second aspect, the present disclosure provides a medical data classification apparatus, the apparatus comprising:

the acquisition module is used for acquiring inspection data to be classified, and the inspection data represents an inspection result of a medical inspection item;

the classification module is used for determining a target classification result corresponding to the inspection data based on a pre-trained inspection data classification model, wherein the target classification result is used for representing whether medical inspection items corresponding to the inspection data are reviewed, the inspection data classification model comprises a plurality of classifiers, the inspection data classification model is used for performing weighted summation on classification results output by the plurality of classifiers aiming at the inspection data according to a classification weight corresponding to each classifier to obtain a target classification result corresponding to the inspection result data, and the classification weight is inversely related to a result error rate of the classifier.

In a third aspect, the present disclosure provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of any one of the first aspects.

In a fourth aspect, the present disclosure provides an electronic device comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of the method of any one of the first aspect.

By the technical scheme, the inspection data can be automatically checked and classified through the pre-trained inspection data classification model, the workload of manual checking is reduced, and therefore the checking and classifying efficiency of the inspection data can be improved. And the inspection data classification model comprises a plurality of classifiers, each classifier corresponds to a classification weight which is inversely related to the result error rate, namely the higher the result error rate of the classifier is, the lower the corresponding classification weight is, so that the output results of the plurality of classifiers are weighted and summed based on the classification weight, and a more accurate classification result can be obtained. Therefore, the efficiency of auditing and classifying can be ensured, and the accuracy of auditing and classifying can also be ensured.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a flow chart illustrating a method of medical data classification according to an exemplary embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating a clustering process in a medical data classification method according to an exemplary embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating a hierarchical structure of classifiers in a medical data classification method according to an exemplary embodiment of the present disclosure;

FIG. 4 is a process diagram illustrating a method of medical data classification according to an exemplary embodiment of the present disclosure;

FIG. 5 is a block diagram illustrating a medical data classification apparatus according to an exemplary embodiment of the present disclosure;

fig. 6 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.

As mentioned in the background, the related art usually performs manual review on the inspection data, which requires a lot of manpower and time, and the efficiency of review and classification is low. Moreover, the judgment mode of different people for the abnormal value is relatively subjective, and the judgment can be inconsistent, so that the accuracy of the examination and classification is influenced.

In addition, the related art also classifies the medical inspection data through an automatic inspection system, so that the automatic inspection of the medical inspection data is realized.

However, the inventor researches and discovers that the automatic auditing system in the related art can only complete basic auditing conditions through preset rules. For example, a data threshold value is set, and when the index value of the medical test result is larger than the data threshold value, the abnormal value is set. However, the set data threshold may be inaccurate, and the data threshold needs to be modified manually according to actual conditions. If the auditing rate of the automatic auditing system is to be improved, a great deal of manpower and time are required to be invested to continuously modify and perfect the automatic auditing rules. Moreover, because the automatic auditing system can only complete basic auditing conditions, the automatic auditing system can only audit 50 to 70 percent of inspection samples which pass, and the rest of a large amount of samples also need manual auditing to judge whether to need to be rechecked again.

In view of this, the present disclosure provides a medical data classification method, device, storage medium, and electronic device, so as to perform intelligent review and classification on medical data, reduce the workload of manual review, and improve the efficiency of review and classification of medical data.

Fig. 1 is a flow chart illustrating a method of medical data classification according to an exemplary embodiment of the present disclosure. Referring to fig. 1, the method includes the steps of:

step 101, test data to be classified is obtained.

Illustratively, the test data characterizes test results of the medical test item. For example, the medical test item is a potassium blood test, and the test data may include a potassium blood test value.

And 102, determining a target classification result corresponding to the inspection data based on the pre-trained inspection data classification model, wherein the target classification result is used for representing whether to carry out recheck on the medical inspection item corresponding to the inspection data. The inspection data classification model is used for performing weighted summation on classification results output by the plurality of classifiers aiming at the inspection data according to the classification weight corresponding to each classifier to obtain a target classification result corresponding to the inspection result data, and the classification weight is inversely related to the result error rate of the classifier.

By the mode, the inspection data can be automatically checked and classified through the pre-trained inspection data classification model, the workload of manual checking is reduced, and therefore the checking and classifying efficiency of the inspection data can be improved. And the inspection data classification model comprises a plurality of classifiers, each classifier corresponds to a classification weight which is inversely related to the result error rate, namely the higher the result error rate of the classifier is, the lower the corresponding classification weight is, so that the output results of the plurality of classifiers are weighted and summed based on the classification weight, and a more accurate classification result can be obtained. Therefore, the efficiency of auditing and classifying can be ensured, and the accuracy of auditing and classifying can also be ensured.

In order to make the medical data classification method provided by the present disclosure more understandable to those skilled in the art, the above steps are exemplified in detail below.

The training process for testing the data classification model in the embodiments of the present disclosure is first described.

In a possible manner, a first sample set and a second sample set may be obtained, where the first sample set is a set of test data corresponding to a medical test item that is not to be reviewed, and the second sample set is a set of test data corresponding to a medical test item that is to be reviewed. And then clustering the first sample set to obtain a plurality of cluster sets, selecting a target cluster set with the data volume exceeding the preset data volume from the plurality of cluster sets, and clustering the target cluster set to obtain a plurality of sub cluster sets. And then, performing data sampling based on the clustering centers of the plurality of sub-clustering sets and the clustering centers of other clustering sets except the target clustering set in the plurality of clustering sets to obtain a sampled data set. Finally, a test data classification model is trained based on the sample data set and the second sample set.

Illustratively, the historical inspection data may be extracted from a HIS (Hospital Information System) or LIS (Laboratory Information Management System) database, and each historical inspection data corresponds to a manual review classification result of whether to review. Historical inspection data may then be labeled based on the manual review classification, such as a review labeled "1" and no review labeled "-1". Thus, a first sample set corresponding to the non-review data and a second sample set corresponding to the review data can be obtained.

It should be further understood that the result of the audit classification of a medical test item may need to be determined in combination with other information, such as patient information corresponding to the medical test item and/or test data of other medical test items. Therefore, in the embodiment of the present disclosure, in order to improve the result accuracy of the data inspection classification model, in the model training stage, the associated data of the medical inspection item may be acquired for model training, and accordingly, in the model application stage, the inspection data and the associated data of the medical inspection item may be input for model prediction to obtain the corresponding audit classification result.

For example, data associated with a potassium blood audit classification includes gender, age, sodium test value, chlorine test value, creatinine test value, urea test value, and delta check value (DeltaCheck). The difference check value is used for reflecting the difference degree of two continuous test results, and can be obtained by calculating the current test value and the historical test value, wherein the historical test value is the last test result of the same patient closest to the current test result.

In addition, the real physical condition of the patient in the latest period of time cannot be accurately reflected due to the fact that the time interval of the test is long, so that the current time can be considered in the process of determining the difference check valueThe time interval between the test and the historical test. For example, in the potassium in blood project, the value K can be checked from the current potassium in blood according to the following formula_curAnd the historical blood potassium test value K_hisCalculating a corresponding difference check value DeltaCheck:

it should be understood that the above formula of 7 days as the time interval threshold between the current test and the historical test is only one possible example, and that the time interval threshold may be set as practical in a particular application. Likewise, the difference check value may be calculated in other ways in a particular application, such as calculating | K_his-K_curThe result of | is a difference check value, etc., which is not limited in the embodiment of the present disclosure as long as the difference check value can represent the difference degree of two consecutive inspection results. Therefore, if the difference check value of a certain test item is too large, it means that the difference of the index values of the same test item of the same patient is large in a short time interval, so that the data can be determined to be abnormal data that needs to be reviewed, and the data can be labeled with a review label "1".

For other acquired data used for training the test data classification model, such as non-data value features like gender, One-Hot coding can be adopted for conversion. Wherein, the One-Hot code can be coded by binary number, such as that a male is represented by binary number "1" and a female is represented by binary number "0". In addition, if the acquired item numerical values are missing, filling can be performed through the middle value of the normal range, and the artificial influence is reduced.

For example, in the data annotation process of the kalemia program, the associated data characteristics are provided by the doctor according to professional knowledge, and can comprise sex, age, sodium test value, chlorine test value, creatinine test value, urea test value and DeltaCheck. Then, corresponding data are extracted from HIS and LIS historical databases to obtain a blood potassium test item data set with a checking and classifying result, wherein the data set comprises a blood potassium test item data set without a checking and classifying resultReview data set D_majAnd reviewing the data set D_min，D_majThe number of the middle samples is expressed as | | D_maj||，D_minThe number of the middle samples is expressed as | | D_min||，||D_maj||＞＞||D_min||。

After the data extraction is finished, the gender attribute values of male and female are replaced by One-Hot codes and are split into two features. Meanwhile, the Delta check can be calculated and filled according to the formula according to the blood potassium test value of the same patient at different times. If other item values are missing, filling can be performed with reference to the middle value of the normal range shown in table 1, so as to reduce the artificial influence:

TABLE 1

Then, the sample data for training the inspection data classification model shown in table 2 can be obtained:

TABLE 2

Thus, the non-review test data in the first sample set includes the current test value and associated data (such as age, gender, other test data, and DeltaCheck shown in Table 2) of a medical test item that does not require review, and the review data in the second sample set includes the current test value and associated data (such as age, gender, other test data, and DeltaCheck shown in Table 2) of a medical test item that requires review.

After the first sample set and the second sample set are obtained, because data which does not need to be reviewed is far larger than data which needs to be reviewed in practical application, that is, the data volume of the first sample set is far larger than that of the second sample set, in order to ensure the data volume balance among training samples and further ensure the accuracy of a training model, cluster down-sampling can be performed on the first sample set.

For example, the first sample set may be clustered first to obtain a plurality of cluster sets. Further, in order to solve the problem of sample balance in the clusters and further ensure the accuracy of the training model, a target cluster set with a data volume exceeding a preset data volume can be selected from the multiple cluster sets, the target cluster set is clustered to obtain multiple sub-cluster sets, and namely the target cluster set with a large data volume is selected for clustering again. The preset data amount may be set according to an actual situation, which is not limited in the embodiment of the present disclosure. For example, referring to fig. 2, a first sample set is clustered twice, wherein the results of the first clustering are distinguished by solid lines and the results of the second clustering are distinguished by dashed lines. After the first clustering, some cluster sets include more data, and some cluster sets include less data. In practical applications, the desired downsampling result is to extract more samples from a large class (i.e. a cluster set with a large amount of data) and less samples from a small class (i.e. a cluster set with a small amount of data). If downsampling is performed directly based on the first clustering set, the number of samples extracted from the large class and the small class may be the same, which is not expected, and thus accuracy of model training is affected. Therefore, the embodiment of the present disclosure may re-cluster the large class based on the result of the first clustering. For example, referring to fig. 2, two cluster sets including a larger amount of data are clustered secondarily. Thus, by clustering twice, the difference in the amount of data in each cluster set is small.

It should be understood that the above example is illustrated by twice clustering, and if the amount of data included in the sub-cluster set is still large after the second clustering, the third clustering may also be performed, and so on, and the number of clustering is not limited in the embodiment of the present disclosure.

In a possible manner, an expected first cluster number may be determined based on the sample data size of the second sample set, and then the first sample set is clustered based on the first cluster number to obtain a plurality of cluster sets equal to the first cluster number. Then, a target cluster set with the data volume exceeding the preset data volume is selected from the plurality of cluster sets, and the expected second cluster quantity is determined based on the proportion between the sample data volume of the target cluster set and the sample data volume of the first sample set. And finally, clustering the target cluster set based on the second cluster number to obtain a plurality of sub cluster sets with the number equal to the second cluster number.

For example, the number of first-layer clusters is set to be α, and the value of α is related to the number of second sample sets, for example,

ρ>3 and is a rational number, where g () is a rounding function (rounding up or rounding down can be set according to the actual situation). Then, clustering the first sample set by a K-means + + clustering method to obtain alpha cluster sets, wherein the ith cluster set is represented as K_i。

After the first clustering is completed, the category with a large number of samples is called a large category, the category with a small number of samples is called a small category, in order to achieve the purpose that the sampled samples are distributed similar to the original samples and avoid the sampling imbalance of the samples in the categories, more samples should be extracted from the large category, and less samples are extracted from the small category, so that the nested K-means + + clustering can be carried out on each cluster set after the first clustering, namely, the second clustering is carried out on the first clustering result, and the steps are as follows:

step 1, determining the number of second desired clusters

Wherein the second number of clusters may be based on the target set of clusters K_iIs determined by the ratio of the sample data size of the first sample set to the sample data size of the first sample set, for example, the target cluster set K may be determined as follows_iCorresponding number of second Polymer

Where g () is an integer function, θ is an adjustment parameter of the number of samples, and may be set according to practical situations, for example, may be set to 1, or may be set to 2, which is not limited in this disclosure.

Step 2, setting a target clustering set K based on a K-means + + clustering method_iThe number of initial cluster centers is

And find the target cluster set K_iIs/are as follows

And each central point corresponds to one sub-cluster set. Thereby, can obtain

A collection of subcategories.

After the plurality of sub-cluster sets are obtained, data sampling can be performed based on the cluster centers of the plurality of sub-cluster sets and the cluster centers of other cluster sets except the target cluster set in the plurality of cluster sets, so that a sampling data set is obtained. The difference between the data volume of the sampling data set and the data volume in the second sample set is small, so that the problem of unbalanced samples can be solved, and the accuracy of model training is ensured.

In a possible manner, in each sub-cluster set, first target data characterized by a data point closest to the cluster center may be determined, and in other cluster sets except the target cluster set, second target data characterized by a data point closest to the cluster center may be determined, and then the first target data and the second target data are combined into a sample data set.

For example, an initially empty sample data set is created, and then a target cluster set K is determined_iThe first target data with the shortest distance to the ion clustering center in each sub-clustering set is put into the sampling data set. Meanwhile, determining the distance from the cluster center to the most cluster center in other cluster sets except the target cluster setThe second, near target data is placed in the sample data set. Therefore, the data volume of the finally obtained sampling data set and the data volume of the second sample set are balanced by performing the nested clustering-based downsampling on the first sample set, the problem of unbalanced samples can be solved, and the accuracy of model training is ensured.

In addition, in practical application, the cluster center of each sub-cluster set and the cluster centers of other cluster sets except the target cluster set may also be selected to be added into the sample data set. However, considering that the cluster center may be an unreal data point formed in the clustering process, that is, all sample data points are gathered around a certain data point, and the data point does not correspond to actual sample data, it may be considered to select the sample data point closest to the cluster center to be added into the sample data set.

After the sampling data set is obtained, the sampling data set comprises non-review data, the second sample set comprises review data, the difference between the data quantity of the sampling data set and the data quantity of the second sample set is small, and data shuffling can be carried out on the sampling data set and the second sample set to obtain training data for training a test data classification model. Thereafter, a test data classification model may then be trained based on the training data.

In a possible manner, the test data classification model may be iteratively trained based on the sampling data set and the second sample set, and in each iterative training, a first-layer classifier of the test data classification model is used as an initial target-layer classifier, and the following operations are performed in a loop: and determining a prediction classification result of the target layer classifier on each sample data in the sampling data set and the second sample set, wherein each sample data has a sample weight and a sample classification label, and the sample weight is inversely related to the proportion of the target layer classifier for correctly classifying the sample data. Then, a result error rate of the target layer classifier is determined based on the predicted classification result of the sample data, the sample classification label and the sample weight, and a classification weight of each classifier in the target layer classifier is determined based on the result error rate and a negative correlation between the classification weight and the result error rate. And finally, updating the sample weight of the sample data based on the classification weight and the prediction classification result of each classifier in the target layer classifier, and taking the next layer classifier of the target layer classifier as a new target layer classifier until a preset iteration stop condition is reached.

It should be understood at the outset that the inspection data classification model in the disclosed embodiments includes multiple classifiers, which may be, but are not limited to, different types of classifiers such as logistic regression, decision trees, support vector machines, and the like. Taking the decision tree as an example, referring to fig. 3, the plurality of classifiers may be divided into a plurality of levels, so that the classifiers are iteratively trained layer by layer, and finally the plurality of classifiers are linearly combined into a classification model with higher prediction accuracy according to the classification weight. The dashed line box shown in fig. 3 represents a classifier, parallel execution indicates that each classifier can output a corresponding classification result based on input test data at the same time, so that the execution efficiency is improved, and serial execution indicates that the result of a certain layer of classifier can be transmitted to the next layer of classifier in the training stage, so that the training of the next layer of classifier is realized.

Illustratively, the preset iteration stop condition includes that the result error rate of each classifier is less than a preset error rate threshold, or the number of iterations reaches a preset number of iterations, which is not limited by the embodiment of the disclosure. The preset error rate threshold and the preset iteration number may also be set according to actual conditions, which is not limited in the embodiments of the present disclosure.

For example, the sample data set and the second sample set obtained by any of the above methods are defined as a total sample set, and the total sample set is defined as S { (y)₁；X₁),(y₂；X₂),……,(y_n；X_n) Wherein y is_iE { -1,1} represents the classification label of the ith sample data, the values are two, "-1" represents no review, "1" represents review, and X represents review_iThe index feature vector indicating the ith sample data may be obtained by performing vector conversion on the acquired sample data, where the number of samples in the entire sample set is n. In this case, the test data classification model may be trained as follows:

step 1), determining the sample weight of each sample data. Wherein, the initial weight of each sample data is ω 1/n, and the initial sample weight distribution is:

and step 2), iteratively training the inspection data classification model layer by layer. Taking the T-th layer classifier (T ═ 1, …, T,) as an example, where T is a positive integer greater than 1, the number of layers of the classifier included in the inspection data classification model is represented, and may be set according to actual conditions. If the t-th classifier includes 3 shallow decision trees, respectively

Firstly, based on the prediction results of all sample data by each classifier and the sample weight and sample classification label corresponding to the sample data, determining the result error rate e of each classifier^(t)。

It should be appreciated that the resulting error rate of a classifier may be the weighted sum of the error samples classified by the classifier, i.e., the resulting error rate of the first classifier may be calculated as follows

Where the I () function is represented as:

wherein the content of the first and second substances,

and representing the predicted classification result of the ith sample data by the first classifier. It should be understood that if

The predicted classification result of the classifier for the ith sample data is not consistent with the classification label of the corresponding sample, namely the classifier classifies the wrong sample.

Similarly, the error rate of the second classifier can be calculated according to the above formula

And the error rate of the result of the third classifier

Classification weights for each classifier may then be determined based on the resulting error rate and a negative correlation between the classification weights and the resulting error rate. For example, the classification weight of each classifier can be calculated according to the following formula:

wherein the content of the first and second substances,

and representing the classification weight of the first classifier, wherein if the result error rate of the classifier is lower, namely the accuracy is higher, the corresponding classification weight is larger, and the weight occupied in the final inspection data classification model is larger.

Similarly, the classification weight of the first classifier can be obtained as described above

And the classification weight of the third classifier

For the training of the next-layer classifier, the sample weights of the samples are updated first, that is, the weights of the misclassified samples are increased. It should be understood that, if a certain layer of classifier misclassifies a sample, it indicates that the sample is difficult to classify accurately, and if model training is performed on the sample in the subsequent training process, the classification accuracy of the whole model can be fully improved. Therefore, in the embodiment of the present disclosure, after obtaining the classification weight of a certain layer of classifier (i.e., training a certain layer of classifier), the sample weight of the sample data may be updated to improve the sample weight of the misclassified sample.

In a possible approach, the sample weights of the sample data may be updated according to the following formula:

wherein the content of the first and second substances,

Thus, the sample weight of sample data for training the next layer classifier can be updated based on the predicted classification result of a certain layer classifier. Also, if more classifiers in a certain layer classify a certain sample incorrectly, the weight of the sample will increase more, whereas if more classifiers classify a certain sample correctly, the weight of the sample will decrease instead. Therefore, the sample weight of the misclassified samples can be improved in the training process, so that the test data classification model can be more classified and predicted based on the misclassified samples, and the classification accuracy of the test data classification model is improved.

It should be understood that a typical model training process is to perform a computation of a loss function based on the labels and model predicted values, and to adjust parameters of the model by performing back propagation based on the computation result of the loss function. In the embodiment of the present disclosure, the parameters of the model include the classification weights of the classifiers, so that determining the classification weight of a certain layer of classifiers in the above manner is a training process for the layer of classifiers.

Then, for the next-layer classifier of the layer, in order to improve the classification accuracy of the model, the sample weight of the sample data can be updated first, then the classification weight of the next-layer classifier is determined according to the above method based on the updated sample weight, and so on, the classification weight of the classifier in the inspection data classification model is determined layer by layer. After the classification weight of the last layer of classifier is determined, the process of one-time iterative training is carried out. Then, new prediction classification can be performed on the sample data based on the classification weight of each classifier, the above process is repeated based on the result of the new prediction classification, the classification weight of the classifier is determined, the sample weight of the sample data is updated, and so on until the result error rate of all the classifiers is less than the preset error rate threshold value, or the iteration number reaches the preset iteration number.

Thus, the inspection data classification model in the embodiments of the present disclosure is an adaptive multi-way lifting model. The multi-path learning method includes the steps that a plurality of learners are built at the same level, parallel learning can be carried out in a model training stage, the model training efficiency is improved, parallel prediction can be carried out in a model application stage, simultaneous prediction is carried out on the same input test data, and the efficiency of auditing and classifying is improved. Adaptive means that the classification weight and the sample weight of the sample data can be adaptively adjusted in the model training process.

After the test data classification model is obtained through training, each classifier included in the test data classification model may have a corresponding classification weight, so that a target classification result of the test data classification model on the input test data may be determined according to the following formula:

where sign (x) is a sign function, x >0 outputs 1, x <0 outputs-1.

In a possible mode, the trained inspection data classification model can be packaged into a service to be deployed in a business system of a medical institution, or the service is deployed in a cloud end in an independent service mode, so that loading and calling of the inspection data classification model are realized. Therefore, after the inspection data classification model is loaded or called, batch intelligent audit can be performed on the real-time inspection data, the workload of manual audit is reduced, and the audit classification efficiency of the inspection data is improved. That is, after the inspection data to be classified is acquired, the pre-trained inspection data classification model may be loaded or called to determine the target classification result corresponding to the inspection data.

In a possible mode, after the test data to be classified is obtained, historical test data corresponding to the test data and associated data related to the classification of the test data can be obtained from the medical information system, and then when the time interval between the generation time of the historical test data and the generation time of the current test data is smaller than or equal to a preset time interval, a difference value ratio between the index value of the historical test data and the index value of the test data is used as a difference value check value. And finally, inputting the difference check value, the associated data and the current test data into a pre-trained test data classification model, and determining a target classification result corresponding to the test data based on an output result of the test data classification model.

Illustratively, the medical information system may be a HIS and/or a LIS, and the disclosed embodiments are not limited.

It should be understood that, as already explained above, in the process of training the test count classification model, the sample data may be actual test data including a medical test item and associated data of the medical test item. Correspondingly, in the model application stage, the inspection data and the associated data of the medical inspection items can be acquired to perform model prediction to obtain corresponding examination and classification results. Wherein the correlation data input during the model application phase may refer to the correlation data used during the model training phase above. For example, in the blood potassium test item, in addition to the potassium test value, the age, sex, sodium test value, chlorine test value, creatinine test value, urea test value, and difference check value (DeltaCheck) of the patient may be obtained.

For example, referring to fig. 4, first, sample data is collected from the HIS and the LIS, and data processing of feature selection (i.e., selecting associated data corresponding to the sample data), data filling, and numerical value conversion is performed to obtain a first sample set and a second sample set. The feature selection may be manual selection based on professional knowledge, data filling may be performed through a middle value of a normal range, and numerical value conversion may be performed through One-Hot encoding, which may be referred to above for the description of model training. The first sample set is a set of test data corresponding to a medical test item which is not to be reviewed, and the second sample set is a set of test data corresponding to a medical test item which is to be reviewed.

In order to solve the problem of unbalanced samples and ensure the result accuracy of the data classification model, the first sample set can be subjected to cluster nested downsampling to obtain a sampling data set. The specific process of the cluster nested downsampling may be referred to above, and is not described here again. A test data classification model may then be trained based on the sample data set and the second sample data set. Finally, the trained inspection data classification model can be deployed in a business system of a medical institution or in a cloud as an intelligent auditing service. Therefore, the intelligent auditing service can automatically audit and classify the inspection data generated in real time, namely, the inspection data is determined to be abnormal data needing to be reviewed or normal data not needing to be reviewed.

By any mode, the inspection data can be automatically checked and classified through the pre-trained inspection data classification model, the workload of manual checking is reduced, and therefore the checking and classifying efficiency of the inspection data can be improved. And the inspection data classification model comprises a plurality of classifiers, each classifier corresponds to a classification weight which is inversely related to the result error rate, namely the higher the result error rate of the classifier is, the lower the corresponding classification weight is, so that the output results of the plurality of classifiers are weighted and summed based on the classification weight, and a more accurate classification result can be obtained. Therefore, the efficiency of auditing and classifying can be ensured, and the accuracy of auditing and classifying can also be ensured.

Based on the same inventive concept, the embodiment of the present disclosure further provides a medical data classification apparatus, which may be a part or all of an electronic device through software, hardware, or a combination of the two. Referring to fig. 5, the medical data classification apparatus 500 includes:

an obtaining module 501, configured to obtain test data to be classified, where the test data represents a test result of a medical test item;

a classification module 502, configured to determine a target classification result corresponding to the inspection data based on a pre-trained inspection data classification model, where the target classification result is used to characterize whether to perform a review on a medical inspection item corresponding to the inspection data, and the inspection data classification model includes a plurality of classifiers, and is configured to perform a weighted summation on classification results output by the plurality of classifiers for the inspection data according to a classification weight corresponding to each classifier, so as to obtain a target classification result corresponding to the inspection result data, where the classification weight is inversely related to a result error rate of the classifier.

Optionally, the test data classification model is obtained by training through the following modules:

the system comprises a sample acquisition module, a data analysis module and a data analysis module, wherein the sample acquisition module is used for acquiring a first sample set and a second sample set, the first sample set is a set of test data corresponding to a medical test item which is not subjected to the review, and the second sample set is a set of test data corresponding to a medical test item which is required to be subjected to the review;

the clustering module is used for clustering the first sample set to obtain a plurality of cluster sets, selecting a target cluster set with the data volume exceeding the preset data volume from the plurality of cluster sets, and clustering the target cluster set to obtain a plurality of sub cluster sets;

the sampling module is used for carrying out data sampling on the basis of the clustering centers of the plurality of sub-clustering sets and the clustering centers of other clustering sets except the target clustering set in the plurality of clustering sets to obtain a sampling data set;

a training module to train the test data classification model based on the sample data set and the second sample set.

Optionally, the clustering module is configured to:

Optionally, the sampling module is configured to:

Optionally, the training module is configured to:

wherein the content of the first and second substances,

Optionally, the classification module 502 is configured to:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Based on the same inventive concept, an embodiment of the present disclosure further provides an electronic device, including:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of any of the medical data classification methods described above.

In a possible manner, a block diagram of the electronic device is shown in fig. 6. Referring to fig. 6, the electronic device 600 may include: a processor 601 and a memory 602. The electronic device 600 may also include one or more of a multimedia component 603, an input/output (I/O) interface 604, and a communications component 605.

The processor 601 is configured to control the overall operation of the electronic device 600 to complete all or part of the steps of the medical data classification method. The memory 602 is used to store various types of data to support operation at the electronic device 600, such as instructions for any application or method operating on the electronic device 600 and application-related data, such as contact data, transmitted and received messages, pictures, audio, video, and so forth. The Memory 602 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. The multimedia components 603 may include a screen and audio components. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 602 or transmitted through the communication component 605. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 604 provides an interface between the processor 601 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 605 is used for wired or wireless communication between the electronic device 600 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, 4G, NB-IOT, eMTC, or other 5G, etc., or a combination of one or more of them, which is not limited herein. The corresponding communication component 605 may therefore include: Wi-Fi module, Bluetooth module, NFC module, etc.

In an exemplary embodiment, the electronic Device 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the above-described medical data classification method.

In another exemplary embodiment, a computer readable storage medium comprising program instructions which, when executed by a processor, implement the steps of the medical data classification method described above is also provided. For example, the computer readable storage medium may be the memory 602 described above including program instructions executable by the processor 601 of the electronic device 600 to perform the medical data classification method described above.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned medical data classification method when executed by the programmable apparatus.

The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.

It should be noted that, in the foregoing embodiments, various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various combinations that are possible in the present disclosure are not described again.

In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims

1. A method of medical data classification, the method comprising:

2. The method of claim 1, wherein the test data classification model is trained by:

3. The method according to claim 2, wherein the clustering the first sample set to obtain a plurality of cluster sets, selecting a target cluster set with a data amount exceeding a preset data amount from the plurality of cluster sets, and clustering the target cluster set to obtain a plurality of sub-cluster sets comprises:

4. The method of claim 2, wherein the sampling data sets based on the cluster centers of the plurality of sub-cluster sets and the cluster centers of the cluster sets other than the target cluster set in the plurality of cluster sets to obtain sampled data sets comprises:

5. The method of any of claims 2-4, wherein training the test data classification model based on the sample data set and the second sample set comprises:

6. The method of claim 5, wherein updating the sample weights for the sample data based on the classification weight for each of the target layer classifiers and the predicted classification result comprises:

wherein the content of the first and second substances,

representing the classification weight of the jth classifier in the target layer classifier, c representing the number of classifiers included in the target layer classifier, f_j ^(t)And the predicted classification result of the ith classifier to the ith sample data in the target layer classifier is represented, and n represents the number of the sample data.

7. The method according to any one of claims 1-4, wherein the determining the target classification result corresponding to the test data based on the pre-trained test data classification model comprises:

8. A medical data sorting apparatus, characterized in that the apparatus comprises:

9. A non-transitory computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

10. An electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1 to 7.