CN111435613A - Medical data preprocessing method and device and electronic equipment - Google Patents

Medical data preprocessing method and device and electronic equipment Download PDF

Info

Publication number
CN111435613A
CN111435613A CN201910034052.3A CN201910034052A CN111435613A CN 111435613 A CN111435613 A CN 111435613A CN 201910034052 A CN201910034052 A CN 201910034052A CN 111435613 A CN111435613 A CN 111435613A
Authority
CN
China
Prior art keywords
medical data
data
normalized
training
medical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910034052.3A
Other languages
Chinese (zh)
Inventor
郭晓方
金敏
刘颖丰
徐长水
雷锦誌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201910034052.3A priority Critical patent/CN111435613A/en
Publication of CN111435613A publication Critical patent/CN111435613A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

A medical data preprocessing method, a medical data preprocessing device and electronic equipment are disclosed. The medical data preprocessing method comprises the following steps: performing data cleaning processing on the acquired original medical data related to the client to obtain cleaned medical data after data cleaning; normalizing the cleaned medical data to obtain normalized medical data after normalization; and carrying out abnormal value processing on the normalized medical data to obtain training medical data after the abnormal value processing, wherein the training medical data is used for being input into a model training table, and the model training table forms a training set and is used for training a cancer prediction model. In this way, a great deal of time and space can be saved in the data analysis process, and the accuracy and applicability of the data can be ensured, so that the cancer prediction result obtained based on the cancer prediction model has better decision and prediction effects.

Description

Medical data preprocessing method and device and electronic equipment
Technical Field
The present application relates to the field of data processing, and in particular, to a medical data preprocessing method, a medical data preprocessing apparatus, and an electronic device.
Background
Cancer is one of the leading causes of death in the chinese population. There are data showing that the total number of new cancer cases expected in 2015 is 429.2 ten thousand and the number of deaths expected is 281.4 ten thousand. That is, 8 people are diagnosed with cancer every minute, 5 people are away from the world due to cancer, and the Chinese life is at a risk of cancer as high as 22%. Cancer is a difficult disease to cure, and according to the latest data of the national cancer center, the 5-year survival rate of malignant tumors is 40.5%, and the data of 2015 is 36.9% lower.
And (3) displaying data: if the cancer is found early, the cure rate is very high. Moreover, research efforts at home and abroad have demonstrated significant differences in clinical medical data (e.g., hematuria assay data) between cancer patients and healthy persons. Hematuria examination will gradually become an important tool for cancer screening.
Although it is not difficult to obtain the hematuria test data of the conventional health examination or cancer patients, not all the obtained data are valid, and most of the health examination data collected usually are ragged and have the defects of missing items, unit errors, different orders of magnitude and the like. This is a great nuisance for the subsequent predictive analysis of cancer based on clinical medical data.
Therefore, an effective technical solution for preprocessing the acquired medical data is needed to facilitate the subsequent data analysis and data mining.
Disclosure of Invention
The present application is proposed to solve the above-mentioned technical problems. Embodiments of the present application provide a medical data preprocessing method, a medical data preprocessing apparatus, and an electronic device, which perform data cleansing processing, normalization processing, and outlier processing on acquired raw medical data related to a client to obtain a data set for training a cancer prediction model. In this way, a great deal of time and space can be saved in the data analysis process, and the accuracy and applicability of the data can be ensured, so that the cancer prediction result obtained based on the cancer prediction model has better decision and prediction effects.
According to an aspect of the present application, there is provided a medical data preprocessing method including: performing data cleaning processing on the acquired original medical data related to the client to obtain cleaned medical data after data cleaning; normalizing the cleaned medical data to obtain normalized medical data after normalization; and carrying out abnormal value processing on the normalized medical data to obtain training medical data after the abnormal value processing, wherein the training medical data is used for being input into a model training table, and the model training table forms a training set and is used for training a cancer prediction model.
In the method for preprocessing medical data, the raw medical data related to the client includes client name, gender, birth date, age, identification number, passport, Hodgkin's disease, Taiwan's disease, hospital examination number or case number, height, body weight, systolic blood pressure, diastolic blood pressure, examination time, red blood cells, red blood cell specific volume, neutrophil%, neutrophil count, monocyte count, basophil count, eosinophil count, mean hemoglobin amount concentration, platelet mean volume, white blood cells, red blood cell mean volume, platelets, platelet volume distribution width, lymphocytes, lymphocyte count, RBC width-cv, RBC width-sd, platelet specific volume, hemoglobin, albumin, glutamic pyruvic transaminase, glutamic-oxaloacyl transaminase, gamma-glutamyl transpeptidase, creatinine, urea, nitrite, fasting blood glucose, total cholesterol, glycerol, triglyceride, lipoprotein density, lipoprotein, glucose-cholesterol-protein (alpha-cholesterol-protein (alpha-cholesterol-protein (alpha-cholesterol-protein (alpha-cholesterol-protein (alpha-cholesterol-protein, cholesterol-.
In the medical data preprocessing method, the data cleaning processing is performed on the acquired original medical data related to the client to obtain the cleaned medical data after the data cleaning, and the method comprises at least one of the following steps: filling values of missing items in the original medical data; identifying and eliminating outliers in the raw medical data; and correcting inconsistent data items in the raw medical data.
In the above medical data preprocessing method, filling the values of the missing items in the original medical data includes: and filling the missing value of the missing item based on the numerical distribution characteristics of the corresponding data items of the same gender in the training set formed by the model training table.
In the above medical data preprocessing method, the normalizing the cleaned medical data to obtain normalized medical data after the normalizing process includes: based on the maximum value and the minimum value in each data item in a training set formed by the model training table, the data items in the cleaned medical data are normalized one by one, wherein the normalization process can be expressed as follows by a formula: (X)Old-minX)/(maxX-minX) Wherein X isOldData item, min, representing the cleaned medical dataXRepresents the minimum value among the corresponding data items in the training set based on the training set composed of the model training table, and maxXRepresenting the maximum value in the corresponding data item in the training set formed by the model training table.
In the above medical data preprocessing method, performing outlier processing on the normalized medical data to obtain training medical data after the outlier processing, includes: setting a corresponding data item as a normal data item in response to the condition that the value of the data item in the normalized medical data is within the range of 0-1; and setting the normalized medical data as the training medical data in response to all data items in the normalized medical data being normal data items.
In the above medical data preprocessing method, the processing of an abnormal value is performed on the normalized medical data to obtain training medical data after the processing of the abnormal value, and the method further includes: outputting the normalized medical data in response to the presence of an anomalous data item in the normalized medical data and the normalized data belonging to a cancer patient; and setting the normalized medical data of which the abnormal data items are artificially judged to be valid data items as the training medical data.
According to another aspect of the present application, there is also provided a medical data preprocessing apparatus, including: the data cleaning unit is used for performing data cleaning processing on the acquired original medical data related to the client to obtain cleaned medical data after data cleaning; the normalization processing unit is used for performing normalization processing on the cleaned medical data to obtain normalized medical data after normalization processing; and an abnormal value processing unit, which is used for carrying out abnormal value processing on the normalized medical data to obtain training medical data after the abnormal value processing, wherein the training medical data is used for being input into a model training table, and the model training table forms a training set and is used for training a cancer prediction model.
In the medical data preprocessing device, the raw medical data related to the client includes client name, gender, birth date, age, identification number, passport, Hodgkin's disease, Taiwan's disease, hospital examination number or case number, height, body weight, systolic blood pressure, diastolic blood pressure, examination time, red blood cells, red blood cell specific volume, neutrophil%, neutrophil count, monocyte count, basophil count, eosinophil count, granulocyte count, mean hemoglobin amount concentration, platelet mean volume, white blood cells, red blood cell mean volume, platelets, platelet volume distribution width, lymphocyte count, RBC width-cv, RBC width-sd, platelet specific volume, hemoglobin, albumin, glutamic pyruvic transaminase, glutamic-oxaloacetic transaminase, gamma-glutamyl transpeptidase, creatinine, urea, nitrite, fasting blood glucose, total cholesterol, glycerol, triglyceride, lipoprotein density, lipoprotein, glucose-cholesterol-protein (alpha-cholesterol-protein (alpha-cholesterol-protein (alpha-cholesterol-protein (alpha-cholesterol-protein (alpha-cholesterol-.
In the medical data preprocessing device, the data cleaning unit is further configured to perform at least one of the following steps: filling values of missing items in the original medical data; identifying and eliminating outliers in the raw medical data; and correcting inconsistent data items in the raw medical data.
In the medical data preprocessing device, the data cleaning unit is further configured to: and filling the missing value of the missing item based on the numerical distribution characteristics of the corresponding data items of the same gender in the training set formed by the model training table.
In the medical data preprocessing device, the normalization processing unit is further configured to: based on the most of each data item in the training set composed of the model training tableAnd normalizing the data items in the cleaned medical data one by one according to the large value and the minimum value, wherein the normalization process can be expressed as follows: (X)Old-minX)/(maxX-minX) Wherein X isOldData item, min, representing the cleaned medical dataXRepresents the minimum value among the corresponding data items in the training set based on the training set composed of the model training table, and maxXRepresenting the maximum value in the corresponding data item in the training set formed by the model training table.
In the medical data preprocessing device, the abnormal value processing unit is further configured to: setting a corresponding data item as a normal data item in response to the condition that the value of the data item in the normalized medical data is within the range of 0-1; and setting the normalized medical data as the training medical data in response to all data items in the normalized medical data being normal data items.
In the medical data preprocessing device, the abnormal value processing unit is further configured to: outputting the normalized medical data in response to the presence of an anomalous data item in the normalized medical data and the normalized data belonging to a cancer patient; and setting the normalized medical data of which the abnormal data items are artificially judged to be valid data items as the training medical data.
According to yet another aspect of the present application, there is also provided an electronic device including: a processor and a memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform a medical data pre-processing method as described above.
According to yet another aspect of the present application, there is also provided a computer readable storage medium having stored thereon computer program instructions operable, when executed by a computing device, to perform a medical data pre-processing method as described above.
The medical data sorting method, the medical data sorting device and the electronic equipment can effectively perform data cleaning processing, normalization processing and abnormal value processing on the acquired original medical data related to the client to obtain a data set for training a cancer prediction model. In this way, a great deal of time and space can be saved in the data analysis process, and the accuracy and applicability of the data can be ensured, so that the cancer prediction result obtained based on the cancer prediction model has better decision and prediction effects.
Drawings
These and/or other aspects and advantages of the present invention will become more apparent and more readily appreciated from the following detailed description of the embodiments of the invention, taken in conjunction with the accompanying drawings of which:
fig. 1 illustrates a flow chart of a method of medical data pre-processing according to an embodiment of the application.
Fig. 2 illustrates a schematic diagram of the distribution of the property values of red blood cells of a lung cancer patient in this specific example as shown in table 1.
Fig. 3 illustrates a schematic diagram of the distribution of red blood cell attribute values of a healthy population in this specific example as shown in table 1.
Fig. 4 illustrates another flow diagram of the medical data preprocessing method according to an embodiment of the present application.
Fig. 5 illustrates a block diagram schematic of a medical data preprocessing apparatus according to an embodiment of the present application.
FIG. 6 illustrates a block diagram schematic of an electronic device in accordance with an embodiment of the present application.
Detailed Description
Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are merely some embodiments of the present application and not all embodiments of the present application, with the understanding that the present application is not limited to the example embodiments described herein.
Summary of the application
As mentioned above, it is not difficult to obtain the hematuria test data of the conventional health examination or cancer patients, but not all the obtained data are valid, and most of the health examination data collected are irregular, and have the defects of missing items, unit errors, different orders of magnitude, and the like. This is a great nuisance for the subsequent predictive analysis of cancer based on clinical medical data.
The research finds that the data preprocessing process takes 60% -80% of the time and space of the whole data analysis process. That is, if effective data preprocessing can be performed before data analysis and exploration are performed, a lot of time and space can be saved, and it is more ensured that the cancer prediction result obtained based on the cancer prediction model has better decision and prediction effects.
In view of the above requirements, the basic idea of the present application is to design a technical solution specifically for medical data preprocessing, wherein the data preprocessing solution mainly includes: the method comprises four parts of data cleaning, data change, data specification and data loading.
Data cleaning means to eliminate noise data existing in medical data and correct data inconsistency in medical data.
Data transformation, which refers to converting data in one format into a data format that is unified in another specification and can be easily analyzed and processed for the sake of data format unification, is a unit transformation, a scaling transformation, and the like.
The purpose of carrying out data normalization processing is to: when mass data is faced, in order to improve analysis efficiency, redundant characteristic data irrelevant to analysis service needs to be deleted, and redundant data is eliminated through clustering and other modes.
Data loading refers to the reconstruction of a new data set from different data sources in an isomorphic manner, such as horizontal merging or vertical appending.
Based on the above, the medical data preprocessing method, device and electronic equipment provided by the application firstly perform data cleaning processing on the acquired original medical data related to the client to obtain cleaned medical data after data cleaning; then, normalizing the cleaned medical data to obtain normalized medical data after normalization; then, abnormal value processing is carried out on the normalized medical data to obtain training medical data after the abnormal value processing, wherein the training medical data is used for being recorded into a model training table, and the model training table forms a training set and is used for training a cancer prediction model.
In this way, a data set for training a cancer prediction model is obtained by performing a data cleansing process, a normalization process, and an outlier process on the acquired raw medical data relating to the client. Therefore, a great deal of time and space can be saved in the data analysis process, and the accuracy and applicability of the data can be ensured, so that the cancer prediction result obtained based on the cancer prediction model has better decision and prediction effects.
Having described the general principles of the present application, various non-limiting embodiments of the present application will now be described with reference to the accompanying drawings.
Exemplary medical data preprocessing method
Fig. 1 illustrates a flow chart of a method of medical data pre-processing according to an embodiment of the application. As shown in fig. 1, a medical data preprocessing method according to an embodiment of the present application includes: s110, performing data cleaning processing on the acquired original medical data related to the client to acquire cleaned medical data after data cleaning; s120, normalizing the cleaned medical data to obtain normalized medical data after normalization, and S130, performing abnormal value processing on the normalized medical data to obtain training medical data after abnormal value processing, wherein the training medical data is used for being recorded into a model training table, and the model training table forms a training set and is used for training a cancer prediction model.
In step S110, the acquired raw medical data related to the client IS subjected to data cleaning processing to obtain cleaned medical data after data cleaning, wherein in the embodiment of the application, the raw medical data related to the client comprises client basic Information, client physical examination Information, client clinical diagnosis Information, client case Information and the like, the sources of the raw medical data comprise L IS (L laboratory Information System) or HIS (Hospital management Information System) of a state-approved medical institution (including a hospital, a physical examination center, a health care institution, a community clinic), a personal-provided examination report and the like, and the client referred to in the application IS a user relative to the hospital or the physical examination center, namely, the client refers to a person participating in the physical examination or a patient going to the hospital for examination.
It should be appreciated that there is some variance in the raw medical data collected from the sources. These differences are particularly manifested in: first, the total number of data items may not be consistent for routine examination reports of the same item by different medical institutions, i.e., the data item standards lack uniformity; secondly, for routine examination reports of the same item, the arrangement sequence of each detection item is inconsistent, namely the ordering of the data items lacks consistency; third, in the routine examination report of the same item in different medical institutions, the measurement units of the detection items are different, namely, the measurement of the data items lacks consistency. In addition, not all data items in the original medical data collected from various sources are valid data, and most of the generally collected health examination data has the problems of missing items, unit errors, order of magnitude differences and the like. These phenomena bring great trouble to the subsequent data analysis and data mining (i.e., predictive analysis of cancer based on medical data).
Accordingly, in step S110, the acquired raw medical data related to the client is first subjected to data cleansing processing to obtain cleansed medical data after the data cleansing. In the embodiment of the present application, the process of performing data cleaning processing on the acquired original medical data includes at least one of the following technical means: filling values of missing items in the original medical data; identifying and eliminating outliers in the raw medical data; and correcting inconsistent data items in the raw medical data.
Specifically, in the embodiment of the present application, the process of filling in the value of the missing item in the original medical data includes: and filling the missing value of the missing item based on the numerical distribution characteristics of the corresponding data items of the same gender in the training set formed by the model training table. For convenience of explanation and understanding, the process of filling missing values according to the embodiment of the present application will be described by taking the hematuria data of lung cancer patients and healthy persons as an example.
Table 1 below shows one specific example of hematuria data of lung cancer patients and healthy people according to an embodiment of the present application. It should be understood that the hematuria data of the lung cancer patients and the healthy population illustrated in table 1 are obtained based on the training set statistics formed by the model training table, wherein the distribution of the red blood cell attribute values of the lung cancer patients and the distribution of the red blood cell attributes of the healthy population are shown in fig. 2 and fig. 3, respectively.
[ TABLE 1 ]
Figure BDA0001945247710000081
Figure BDA0001945247710000091
Figure BDA0001945247710000101
Taking a sample lung cancer _ red cell attribute deficiency as an example, suppose that the lung cancer patients in the specific example are N in total, wherein the male has N1, the mean value is u1, and the variance is sigma1Obeying a normal distribution N1 (u1, σ)1) (ii) a Women contained N2 with a mean of u2 and a variance of σ2Obeying a normal distribution N2 (u2, σ)2). Then, if the sample is male, a value is randomly generated as a complementary value using the N1 distribution; if the sample is female, a random distribution of N2 is used to generate a value as the complement. It should be understood that, taking a sample healthy population-red blood cell attribute missing as an example, assume that the healthy population in this particular example is M, where a male has M1, a mean u3, and a variance σ3Obeying the normal distribution M1 (u3, σ)3) (ii) a The female comprisesM2, mean u3, variance σ4Obeying the normal distribution M2 (u4, σ)4). Then, if the sample is male, a value is randomly generated as a complementary value using the M1 distribution; if the sample is female, a random distribution of M2 is used to generate a value as the complement.
By cleaning the original medical data, the noise data in the original medical data can be removed, and the condition that the data in the original medical data are inconsistent can be corrected.
Optionally, before data cleaning, normalization processing of the acquired raw medical data is also required. The process of standardization may include the steps of: the raw medical data is first encoded with specific key values as association rules. For example, the source of acquisition (specific hospital or institution), the date of acquisition may be labeled and encoded for the original medical data resource, wherein the rule of encoding may be implemented as an encoding system based on the location of the medical institution, and two additional identification codes are added to prevent multiple medical institutions from appearing in the same area to indicate differences. Of course, those skilled in the art will appreciate that in other embodiments of the present application, other association rules may be used to encode the raw medical data, as long as the association rules are capable of serving as a unique identifier for the corresponding medical data. And is not intended to limit the scope of the present application.
In step S120, the cleaned medical data is normalized to obtain normalized medical data after being normalized. That is, after the cleaning process is performed on the raw medical data to eliminate noise data in the raw medical data and remove and fill up missing values, a data normalization process is performed.
Specifically, in the embodiment of the present application, the process of normalizing the cleaned medical data includes: normalizing the data items in the cleaned medical data one by one based on the maximum value and the minimum value in each data item in a training set formed by the model training table, wherein the normalization process can be expressed by a formulaComprises the following steps: (X)Old-minX)/(maxX-minX) Wherein X isoldData item, min, representing the cleaned medical dataXRepresents the minimum value among the corresponding data items in the training set based on the training set composed of the model training table, and maxXRepresenting the maximum value in the corresponding data item in the training set formed by the model training table.
For ease of understanding and explanation, the normalization process of the embodiments of the present application will be described by taking the blood urine test data of lung cancer patients and healthy persons as examples. Setting maximum value max of red blood cell attribute in training set of lung cancer prediction modelRBCMinimum value minRBCBased on the formula (RBC)Old-minRBC)/(maxRBC-minRBC) And normalizing the attribute value of the red blood cells in the cleaned medical data. And similarly, normalizing other data items in the cleaned medical data.
The purpose of carrying out data normalization processing is to: redundant characteristic data irrelevant to the analysis service are deleted, and redundant data are eliminated in clustering and other modes, so that the data analysis and exploration efficiency is improved.
In step S130, performing outlier processing on the normalized medical data to obtain training medical data after the outlier processing, wherein the training medical data is used for entering into a model training table, and the model training table constitutes a training set for training a cancer prediction model. In other words, after the normalization processing is performed on the medical data, the data abnormal value processing work is performed.
Specifically, in the embodiment of the present application, the process of processing the abnormal value of the normalized medical data includes: setting a corresponding data item as a normal data item in response to the condition that the value of the data item in the normalized medical data is within the range of 0-1; and setting the normalized medical data as the training medical data in response to all data items in the normalized medical data being normal data items. In other words, it is first determined whether each item of the normalized medical data is within a preset interval (in the embodiment of the present application, the interval is set to be 0-1); further, the normalized medical data in which no abnormal value exists is set as the training medical data.
The abnormal value determination process according to the embodiment of the present application will be described by taking the hematuria data of the lung cancer patient and the healthy person as an example. Setting maximum value max of red blood cell attribute in training set of lung cancer prediction modelRBCMinimum value minRBCBased on the formula RBCnew=(RBCOld-minRBC)/(maxRBC-minRBC) Then, if 0. ltoreq.RBCnewAnd judging that the attribute value of the red blood cells is a normal value if the attribute value is less than or equal to 1, and otherwise, judging that the attribute value of the red blood cells is an abnormal value. Similarly, outlier determinations may be made for other data items in the normalized medical data.
Furthermore, for the normalized medical data for which an abnormal value exists, in the embodiment of the present application, if the piece of normalized medical data belongs to medical data of a healthy population, the normalized medical data is not set as the training medical data; if the normalized medical data belongs to the medical data of the cancer patient, outputting the normalized data to manual processing, manually judging whether abnormal values in the normalized medical data are valid data, and if so, setting the normalized medical data as the training medical data; if not, the normalized medical data is not set as the training medical data.
Further, the training medical data is to be entered into a model training table, and the model training table constitutes a training set for training a cancer prediction model. It should be understood that each type of cancer prediction model requires a training data set, for example, when lung cancer needs to be analyzed for prediction, a training data set for lung cancer is constructed.
In particular, in the examples of the present application, the raw medical data associated with the client includes client name, gender, date of birth, age, identification number, passport, Hodgkin's disease, Zea's disease, hospital examination number or case number, height, body weight, systolic blood pressure, diastolic blood pressure, examination time, red blood cells, red blood cell specific volume, neutrophil%, neutrophil count, monocyte%, monocyte count, basophil count, eosinophil count, mean hemoglobin amount concentration, platelet mean volume, white blood cells, mean volume of red blood cells, platelets, platelet volume distribution width, lymphocyte count, RBC juice width-cv, RBC blood count width-sd, platelet specific volume, hemoglobin, albumin, glutamic transaminase, glutamic-oxalacetic transaminase, gamma-glutamyl transpeptidase, creatinine, urea, uric acid, blood glucose, total cholesterol, triglyceride, high density lipoprotein, lipoprotein density, hemoglobin, albumin, lipoprotein-cholesterol-protein (HA), cholesterol-protein (alpha-cholesterol-protein (alpha-cholesterol-protein, cholesterol-protein (alpha-cholesterol-.
Moreover, as will be appreciated by those skilled in the art, the model training table includes data items selected from one or more of the group consisting of the above data items for the raw medical data associated with the client.
In summary, a medical data preprocessing method based on an embodiment of the present application is illustrated, which obtains a data set for training a cancer prediction model by performing a data cleansing process, a normalization process, and an outlier process on acquired raw medical data related to a client. In this way, a great deal of time and space can be saved in the data analysis process, and the accuracy and applicability of the data can be ensured, so that the cancer prediction result obtained based on the cancer prediction model has better decision and prediction effects.
Fig. 4 illustrates another flow diagram of the medical data preprocessing method according to an embodiment of the present application. As shown in fig. 4, the medical data preprocessing process first includes obtaining raw medical data related to a client and constructing a raw data set; further, judging whether the data is training data, if not, returning to the step of acquiring original medical data related to the client; if so, reading a new record and determining if there are missing values (i.e., null values) therein; then, if the missing value does not exist in the new record, directly jumping to a data normalization processing stage; in the data normalization processing stage, the data items item are normalized according to the data normalization method as described above, and specifically, the 43 data items item1 to item43 are normalized as shown in fig. 4, wherein the normalization formula of each data item is item ═ m (item-min)item)/(maxitem-minitem) (ii) a If the missing value exists in the new record, judging whether the new record is a cancer patient or not, if so, filling the missing value according to Ethernet distribution or Gamma distribution, and skipping to a data normalization processing stage; if the patient is not cancer, returning to the step of reading a new record; after the data are normalized, judging whether an abnormal value exists in the data or not, and if the judgment result is that the abnormal value does not exist in the data, directly inputting the data into a model training table; if the data has abnormal value, firstly judging whether the data belongs to the cancer patient; if the judgment result is negative, skipping to the step of reading a new record, if yes, outputting the data to manual processing, and if the abnormal value is judged to be valid data manually, inputting the data to the model training table.
Exemplary medical data preprocessing device
Fig. 5 illustrates a block diagram schematic of a medical data preprocessing apparatus according to an embodiment of the present application.
As shown in fig. 5, a medical data preprocessing apparatus 600 according to an embodiment of the present application includes: the method comprises the following steps: a data cleaning unit 610, configured to perform data cleaning processing on the acquired original medical data related to the client to obtain cleaned medical data after data cleaning; a normalization processing unit 620, configured to perform normalization processing on the cleaned medical data to obtain normalized medical data after the normalization processing; and an abnormal value processing unit 630, configured to perform abnormal value processing on the normalized medical data to obtain training medical data after the abnormal value processing, where the training medical data is used to be entered into a model training table, and the model training table constitutes a training set used to train a cancer prediction model.
In one example, in the medical data preprocessing device 600, the raw medical data related to the client includes client name, gender, birth date, age, identification number, passport, hong Kong and Australian Reversal syndrome, typhoid syndrome, hospital examination number or case number, height, weight, systolic blood pressure, diastolic blood pressure, examination time, red blood cells, red blood cell specific volume, neutrophils, monocytes, basophils, eosinophils, platelet volume, mean hemoglobin concentration, platelet mean volume, white blood cells, red blood cell mean volume, platelets, platelet volume distribution width, lymphocytes, lymphocyte count, RBC width-cv, RBC width-sd, platelet specific volume, hemoglobin, albumin, glutamic pyruvic transaminase, glutamic-oxalacetic transaminase, γ -glutamyl transpeptidase, creatinine, urea, glucose, glycerol, triglyceride, lipoprotein density-sd, cholesterol specific lipoprotein, cholesterol level, cholesterol-protein (HA), cholesterol-protein (cholesterol-protein (alpha-cholesterol-protein (alpha-cholesterol-protein (alpha-cholesterol-protein (alpha-cholesterol-protein), cholesterol-protein (alpha-cholesterol-protein), cholesterol-cholesterol.
In one example, in the medical data preprocessing device 600, the data cleaning unit 610 is further configured to perform at least one of the following steps: filling values of missing items in the original medical data; identifying and eliminating outliers in the raw medical data; and correcting inconsistent data items in the raw medical data.
In an example, in the medical data preprocessing device 600, the data cleaning unit 610 is further configured to: and filling the missing value of the missing item based on the numerical distribution characteristics of the corresponding data items of the same gender in the training set formed by the model training table.
In an example, in the medical data preprocessing device 600, the normalization processing unit 620 is further configured to: based on the maximum value and the minimum value in each data item in a training set formed by the model training table, the data items in the cleaned medical data are normalized one by one, wherein the normalization process can be expressed as follows by a formula: (X)Old-minX)/(maxX-minX) Wherein X isOldData item, min, representing the cleaned medical dataXRepresents the minimum value among the corresponding data items in the training set based on the training set composed of the model training table, and maxXRepresenting the maximum value in the corresponding data item in the training set formed by the model training table.
In one example, in the medical data preprocessing apparatus 600, the abnormal value processing unit 630 is further configured to: setting a corresponding data item as a normal data item in response to the condition that the value of the data item in the normalized medical data is within the range of 0-1; and setting the normalized medical data as the training medical data in response to all data items in the normalized medical data being normal data items.
In one example, in the medical data preprocessing apparatus 600, the abnormal value processing unit 630, are further configured to: outputting the normalized medical data in response to the presence of an anomalous data item in the normalized medical data; and setting the normalized medical data of which the abnormal data items are artificially judged to be valid data items as the training medical data.
Here, it will be understood by those skilled in the art that the specific functions and operations of the respective units and modules in the above-described medical data preprocessing device 600 have been described in detail in the medical data preprocessing method described above with reference to fig. 1 to 4, and thus, a repetitive description thereof will be omitted.
As described above, the medical data preprocessing apparatus according to the embodiment of the present application can be implemented in various terminal devices, for example, a server for cancer prediction analysis. In one example, the medical data preprocessing apparatus according to the embodiment of the present application may be integrated into the terminal device as a software module and/or a hardware module. For example, the medical data preprocessing means may be a software module in the operating system of the terminal device, or may be an application developed for the terminal device; of course, the medical data preprocessing device can also be one of a plurality of hardware modules of the terminal device.
Alternatively, in another example, the medical data preprocessing device and the terminal device may be separate terminal devices, and the medical data preprocessing device may be connected to the terminal device through a wired and/or wireless network and transmit the interaction information according to an agreed data format.
Illustrative electronic device
Next, an electronic apparatus according to an embodiment of the present application is described with reference to fig. 6.
FIG. 6 illustrates a block diagram of an electronic device in accordance with an embodiment of the present application.
As shown in fig. 6, the electronic device 10 includes one or more processors 11 and memory 12.
The processor 11 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.
Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer readable storage medium and executed by the processor 11 to implement the medical data preprocessing methods of the various embodiments of the present application described above and/or other desired functions. Various contents such as a model training table may also be stored in the computer-readable storage medium.
In one example, the electronic device 10 may further include: an input device 13 and an output device 14, which are interconnected by a bus system and/or other form of connection mechanism (not shown).
The input device 13 may be, for example, a keyboard, a mouse, or the like.
The output device 14 can output various information including medical data including abnormal values and the like to the outside. The output devices 14 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.
Of course, for simplicity, only some of the components of the electronic device 10 relevant to the present application are shown in fig. 6, and components such as buses, input/output interfaces, and the like are omitted. In addition, the electronic device 10 may include any other suitable components depending on the particular application.
Illustrative computer program product
In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in a method of pre-processing medical data according to various embodiments of the present application described in the "exemplary methods" section of this specification, supra.
The computer program product may write program code for carrying out operations for embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + +, or the like, as well as conventional procedural or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the steps in a method of pre-processing medical data according to various embodiments of the present application described in the "exemplary methods" section above in this specification.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.
The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims (15)

1. A method of medical data preprocessing, comprising:
performing data cleaning processing on the acquired original medical data related to the client to obtain cleaned medical data after data cleaning;
normalizing the cleaned medical data to obtain normalized medical data after normalization; and
and carrying out abnormal value processing on the normalized medical data to obtain training medical data after the abnormal value processing, wherein the training medical data is used for being recorded into a model training table, and the model training table forms a training set and is used for training a cancer prediction model.
2. The method for preprocessing medical data according to claim 1, wherein the raw medical data related to the client includes client name, gender, birth date, age, identification number, passport, Hodgkin's disease, typhoid, Hospital disease number or case number, height, body weight, systolic blood pressure, diastolic blood pressure, examination time, red blood cells, red blood cell specific volume, neutrophils, monocytes, basophils, eosinophils, mean hemoglobin concentration, platelet mean volume, white blood cells, red blood cell mean volume, platelets, platelet volume distribution width, lymphocytes, RBC width-cv, RBC width-sd, platelet specific volume, hemoglobin, albumin, glutamic pyruvic transaminase, glutamic-pyruvic transaminase, γ -glutamyl transpeptidase, creatinine, urea, glucose, total cholesterol density, triglyceride density-sd, platelet specific volume, hemoglobin, albumin-cholesterol-protein (HA), cholesterol-protein (HA), cholesterol-protein (alpha-cholesterol-protein (alpha-cholesterol-protein (alpha-cholesterol-protein (alpha-cholesterol-protein (alpha), cholesterol-protein (alpha-cholesterol-protein (alpha-protein), cholesterol.
3. The medical data preprocessing method according to claim 1 or 2, wherein the data cleaning process is performed on the acquired raw medical data related to the client to obtain cleaned medical data after the data cleaning, comprising at least one of the following steps:
filling values of missing items in the original medical data;
identifying and eliminating outliers in the raw medical data; and
correcting inconsistent data items in the raw medical data.
4. The medical data preprocessing method of claim 3, wherein filling in values of missing items in the original medical data comprises:
and filling the missing value of the missing item based on the numerical distribution characteristics of the corresponding data items of the same gender in the training set formed by the model training table.
5. The medical data preprocessing method as set forth in claim 4, wherein normalizing the cleaned medical data to obtain normalized medical data after normalization comprises:
based on the maximum value and the minimum value in each data item in a training set formed by the model training table, the data items in the cleaned medical data are normalized one by one, wherein the normalization process can be expressed as follows by a formula: (X)Old-minX)/(maxX-minX) Wherein X isOldData item, min, representing the cleaned medical dataXRepresents the minimum value among the corresponding data items in the training set based on the training set composed of the model training table, and maxXRepresenting the maximum value in the corresponding data item in the training set formed by the model training table.
6. The medical data preprocessing method as claimed in claim 5, wherein the performing of outlier processing on the normalized medical data to obtain trained medical data after the outlier processing comprises:
setting a corresponding data item as a normal data item in response to the condition that the value of the data item in the normalized medical data is within the range of 0-1; and
setting the normalized medical data as the training medical data in response to all data items in the normalized medical data being normal data items.
7. The medical data preprocessing method as set forth in claim 6, wherein performing outlier processing on the normalized medical data to obtain training medical data after the outlier processing further includes:
outputting the normalized medical data in response to the presence of an anomalous data item in the normalized medical data and the normalized data belonging to a cancer patient; and
setting the normalized medical data, in which the abnormal data item is manually determined to be a valid data item, as the training medical data.
8. A medical data preprocessing apparatus, characterized by comprising:
the data cleaning unit is used for performing data cleaning processing on the acquired original medical data related to the client to obtain cleaned medical data after data cleaning;
the normalization processing unit is used for performing normalization processing on the cleaned medical data to obtain normalized medical data after normalization processing; and
and the abnormal value processing unit is used for performing abnormal value processing on the normalized medical data to obtain training medical data after the abnormal value processing, wherein the training medical data is used for being input into a model training table, and the model training table forms a training set and is used for training a cancer prediction model.
9. The medical data preprocessing apparatus as defined in claim 8, wherein the data cleaning unit is further configured to perform at least one of the following steps:
filling values of missing items in the original medical data;
identifying and eliminating outliers in the raw medical data; and
correcting inconsistent data items in the raw medical data.
10. The medical data preprocessing apparatus as claimed in claim 9, wherein the data cleaning unit is further configured to:
and filling the missing value of the missing item based on the numerical distribution characteristics of the corresponding data items of the same gender in the training set formed by the model training table.
11. The medical data preprocessing apparatus as defined in claim 10, wherein the normalization processing unit is further configured to:
based on the maximum value and the minimum value in each data item in a training set formed by the model training table, the data items in the cleaned medical data are normalized one by one, wherein the normalization process can be expressed as follows by a formula: (X)Old-minX)/(maxX-minX) Wherein X isOldData item, min, representing the cleaned medical dataXRepresents the minimum value among the corresponding data items in the training set based on the training set composed of the model training table, and maxXRepresenting the maximum value in the corresponding data item in the training set formed by the model training table.
12. The medical data preprocessing apparatus as defined in claim 11, wherein the abnormal value processing unit is further configured to:
setting a corresponding data item as a normal data item in response to the condition that the value of the data item in the normalized medical data is within the range of 0-1; and
setting the normalized medical data as the training medical data in response to all data items in the normalized medical data being normal data items.
13. The medical data preprocessing apparatus as defined in claim 12, wherein the abnormal value processing unit is further configured to:
outputting the normalized medical data in response to the presence of an anomalous data item in the normalized medical data and the normalized data belonging to a cancer patient; and
setting the normalized medical data, in which the abnormal data item is manually determined to be a valid data item, as the training medical data.
14. An electronic device, comprising:
a processor; and
a memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform the medical data pre-processing method of any one of claims 1-7.
15. A computer readable storage medium having stored thereon computer program instructions operable, when executed by a computing device, to perform the medical data pre-processing method of any of claims 1-7.
CN201910034052.3A 2019-01-15 2019-01-15 Medical data preprocessing method and device and electronic equipment Pending CN111435613A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910034052.3A CN111435613A (en) 2019-01-15 2019-01-15 Medical data preprocessing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910034052.3A CN111435613A (en) 2019-01-15 2019-01-15 Medical data preprocessing method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN111435613A true CN111435613A (en) 2020-07-21

Family

ID=71580655

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910034052.3A Pending CN111435613A (en) 2019-01-15 2019-01-15 Medical data preprocessing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN111435613A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112599250A (en) * 2020-12-24 2021-04-02 中国人民解放军总医院第三医学中心 Postoperative data analysis method and device based on deep neural network

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106529168A (en) * 2016-11-08 2017-03-22 无锡市妇幼保健院 Gynecological disease intelligent diagnosis oriented data preprocessing technology

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106529168A (en) * 2016-11-08 2017-03-22 无锡市妇幼保健院 Gynecological disease intelligent diagnosis oriented data preprocessing technology

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112599250A (en) * 2020-12-24 2021-04-02 中国人民解放军总医院第三医学中心 Postoperative data analysis method and device based on deep neural network

Similar Documents

Publication Publication Date Title
US6193654B1 (en) Computerized method and system for measuring and determining neonatal severity of illness and mortality risk
Malley et al. Data pre-processing
O’Halloran et al. Characterizing the patients, hospitals, and data quality of the eICU collaborative research database
WO2021032055A1 (en) Automatic entry method and device for clinical trial reports, electronic equipment, and storage medium
Inácio et al. Nonparametric Bayesian estimation of the three‐way receiver operating characteristic surface
Duggal et al. Identification of acute respiratory distress syndrome subphenotypes de novo using routine clinical data: a retrospective analysis of ARDS clinical trials
EP3329403A1 (en) Reliability measurement in data analysis of altered data sets
Chen et al. Phenotypic similarity for rare disease: ciliopathy diagnoses and subtyping
Province et al. A new model for the resolution of cultural and biological inheritance in the presence of temporal trends: application to systolic blood pressure
Lewis‐Smith et al. Computational analysis of neurodevelopmental phenotypes: Harmonization empowers clinical discovery
Mancini et al. Marked point process models for the admissions of heart failure patients
Geraci et al. Quantile contours and allometric modelling for risk classification of abnormal ratios with an application to asymmetric growth-restriction in preterm infants
US20150025908A1 (en) Clustering and analysis of electronic medical records
CN111435613A (en) Medical data preprocessing method and device and electronic equipment
Lee et al. Analysis of patient profile in predicting home care resource utilization and outcomes
CN112151174A (en) User health information analysis method and system based on physical examination data
Virdee et al. Assessing data quality from the Clinical Practice Research Datalink: a methodological approach applied to the full blood count blood test
US10973467B2 (en) Method and system for automated diagnostics of none-infectious illnesses
CN111640517A (en) Medical record encoding method and device, storage medium and electronic equipment
CN110827936A (en) Cross-platform distributed rare disease management system
Wang et al. A modified skip-gram algorithm for extracting drug-drug interactions from AERS reports
Pasricha et al. Assessing the quality of clinical and administrative data extracted from hospitals: the General Medicine Inpatient Initiative (GEMINI) experience
US8756234B1 (en) Information theory entropy reduction program
CN111241148A (en) Medical data sorting method, medical data sorting device and electronic equipment
Steele et al. A quality assessment of reporting sources for microcephaly in Utah, 2003 to 2013

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination