CN113485990A

CN113485990A - Multi-dimensional intelligent data cleaning method and system based on big transfusion data

Info

Publication number: CN113485990A
Application number: CN202110757837.0A
Authority: CN
Inventors: 乐爱平; 刘威; 吴承高; 刘强; 曹磊; 熊伟
Original assignee: Nanchang Shengyuan Software Co ltd; First Affiliated Hospital of Nanchang University
Current assignee: Nanchang Shengyuan Software Co ltd; First Affiliated Hospital of Nanchang University
Priority date: 2021-07-05
Filing date: 2021-07-05
Publication date: 2021-10-08

Abstract

The invention discloses a multi-dimensional intelligent data cleaning method and a system based on big blood transfusion data, which mainly achieve the aim of combing original blood transfusion data by data extraction, cleaning, repair and clustering, and mainly provide the following technical scheme: acquiring a unique identification and treatment data of a patient, wherein the treatment data comprises transfusion data of the patient; cleaning the treatment data according to the unique identifier to obtain target data; outputting target data, and acquiring a correction rule and a clustering rule; correcting the target data according to the correction rule; and clustering the text data in the corrected target data according to the clustering rule to obtain final data. The method is mainly used for processing the blood transfusion data.

Description

Multi-dimensional intelligent data cleaning method and system based on big transfusion data

Technical Field

The invention relates to the field of medical data processing, in particular to a multi-dimensional intelligent data cleaning method and system based on transfusion big data.

Background

With the rapid development of medical systems and the internet of things, medical big data analysis has higher and higher guiding significance for the hospitalizing process of patients, wherein for the research and the use of transfusion big data, the efficient management of blood supply can be realized, and the safety of transfusion of patients is improved.

Because the hospitalizing information of the patient is dispersed in each system, the blood transfusion data has the characteristics of scattered sources, strong repeatability and disordered inaccurate data, the blood transfusion data is preprocessed, the error data is removed, the disordered data is combed, the complete data with unified rules is formed, an analysis basis is provided for the decision of the treatment process business, and a foundation is laid for the subsequent further mining and use of the blood transfusion big data.

Disclosure of Invention

In view of this, the embodiment of the present invention provides a method and a system for multidimensional intelligent data cleaning based on transfusion big data, which mainly achieve the purpose of combing original transfusion data by data extraction, cleaning, patching, and clustering, and provide complete regular pre-processing data for further analysis and use of transfusion data.

In order to achieve the purpose, the invention mainly provides the following technical scheme:

on the one hand, the embodiment of the invention provides a multi-dimensional intelligent data cleaning method based on transfusion big data, which comprises the following steps:

acquiring a unique identification and treatment data of a patient, wherein the treatment data comprises transfusion data of the patient;

cleaning the treatment data according to the unique identifier to obtain target data;

outputting target data, and acquiring a correction rule and a clustering rule;

correcting the target data according to the correction rule;

and clustering the text data in the corrected target data according to the clustering rule to obtain final data.

Preferably, the specific method for acquiring the treatment data of the patient is as follows:

determining a database;

defining a data extraction rule and forming a data table;

and extracting the source data from the database according to the data extraction rule and mapping the source data to the data table to obtain the treatment data.

Preferably, after the source data is extracted from the database and mapped into the data table according to the data extraction rule, the method further includes:

and performing characteristic derivation on the fields in the data table.

Preferably, the cleaning the treatment data according to the unique identifier specifically includes:

extracting irregular data in the treatment data to carry out regularized replacement to generate structural data;

and merging the same unique identification data in the structural data according to the unique identification, and deleting the repeated data under the unique identification.

Preferably, the outputting the target data specifically includes:

outputting the data missing condition in each field in the target data;

and outputting the data distribution condition in each field in the target data.

Preferably, the data missing condition in each field in the output target data specifically includes:

extracting the data number of each field in the target data;

and drawing a bar chart of the number of the field-data, and performing visual output.

Preferably, the data distribution in each field in the output target data specifically includes:

carrying out equal grouping on the corresponding values of the fields, and calculating the numerical frequency of each group;

drawing a numerical value-frequency histogram;

and connecting points corresponding to the median values of the numerical value histograms of all groups to form a smooth curve, and performing visual output.

Preferably, the acquiring of the modification rule specifically includes:

abnormal value judgment rules and processing rules in the target data and filling rules of missing values in the target data.

Preferably, the filling rule of the missing value in the target data is specifically as follows:

calculating the percentage of the missing value of a certain field in the target data in the total number of the field, and deleting the field if the percentage is greater than the maximum threshold value;

if the percentage is less than the maximum threshold and greater than the minimum threshold, filling in missing values using 0, the mean or mode of the field values;

if the percentage is less than the minimum threshold, filling the missing value using random forest regression.

On the other hand, the embodiment of the invention provides a multi-dimensional intelligent data cleaning system based on transfusion big data, which comprises the following components:

the data acquisition module is used for acquiring the unique identifier of the patient and treatment data, wherein the treatment data comprises transfusion data of the patient;

the data cleaning module is used for cleaning the treatment data according to the unique identifier to obtain target data;

the interaction module is used for outputting target data and acquiring a correction rule and a clustering rule;

the correction module is used for correcting the target data according to the correction rule;

and the clustering module is used for clustering the text data in the corrected target data according to the clustering rule to obtain final data.

The multi-dimensional intelligent data cleaning method based on the big blood transfusion data, provided by the embodiment of the invention, is mainly used for preprocessing the blood transfusion data, removing error data, combing disordered data and forming complete data with unified rules. In the prior art, basic information and blood transfusion data of patients in each database are often manually called, information in different databases is formatted one by one and then integrated, error data is manually picked and missing data is filled, the workload is high, and errors are easy to occur. Compared with the prior art, the method and the device achieve the aim of combing the original blood transfusion data through data extraction, cleaning, repairing and clustering, and provide complete and regular preprocessing data for further analysis and use of the blood transfusion data.

Drawings

Fig. 1 is a flowchart of a multidimensional intelligent data cleaning method based on transfusion big data according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a relationship between data tables according to an embodiment of the present invention;

FIG. 3 is a flow chart of another method for multi-dimensional intelligent data cleansing based on transfusion big data according to an embodiment of the present invention;

FIG. 4 is a field-data number histogram provided by an embodiment of the present invention;

FIG. 5 is a value-frequency histogram provided by an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a transfusion big data-based multidimensional intelligent data cleaning system provided by the embodiment of the invention.

Detailed Description

To further illustrate the technical means and effects of the present invention for achieving the predetermined objects, the following detailed description will be given to the multi-dimensional intelligent data cleaning method and system based on big transfusion data according to the present invention, and the specific implementation, structure, features and effects thereof with reference to the accompanying drawings and preferred embodiments.

As shown in fig. 1, an embodiment of the present invention provides a method for multidimensional intelligent data cleansing based on transfusion big data, including:

s1, acquiring the unique identification of the patient and treatment data, wherein the treatment data comprises transfusion data of the patient;

s2, cleaning the treatment data according to the unique identifier to obtain target data;

s3, outputting target data, and acquiring a correction rule and a clustering rule;

s4, correcting the target data according to the correction rule;

and S5, clustering the text data in the corrected target data according to the clustering rule to obtain final data.

The treatment data in S1 is patient treatment process data obtained from multiple platforms, and the total amount of multi-source heterogeneous data can be obtained through a data API interface arranged on the blood transfusion data operation platform, and the databases include relational databases such as SqlService, Oracle, and the like. Treatment data may include patient baseline data, such as gender, age, weight, etc.; laboratory indices such as Hb, PLT, FIB, etc.; clinical symptom indicators such as blood loss, bleeding volume, etc.; clinical transfusion data, such as whether to infuse, composition of infusion, volume of infusion, etc., and data for efficacy and prognosis assessment. The unique identifier of the patient may be the patient's identification number or visit card number.

Specifically, the following method is adopted for acquiring the treatment data of the patient:

s11, determining a database;

s12, defining data extraction rules and forming a data table;

and S13, extracting the source data from the database according to the data extraction rule and mapping the source data to the data table to obtain the treatment data.

Firstly, a database is determined according to the analysis purpose, a source data table is selected according to the type of the database, and the extracted treatment data is written into the source data table. In the definition of data extraction rules, for example, surgical transfusion records are taken as an example, the formation of the table data needs to be linked to a plurality of tables, such as a patient information table, a test result table, a surgical condition table, a transfusion condition table and the like. For example, as shown in fig. 2, the correspondence between tables is used from the start of a patient to the procedure of applying for blood, performing blood, and performing an operation.

In the data extraction process, an extraction range needs to be defined, taking the matching of the test result table and the operation condition table as an example, how to match the test result before the operation is performed for a patient with a plurality of test results needs to be performed through the relation between fields. The solution is as follows: according to the detection result time and the operation starting time, the detection result within 15 hours before the operation starting time is selected by combining the actual detection condition of the hospital, and the blood index condition of the operation patient during the operation is measured by using the detection result.

Secondly, because the source data tables in different databases are not uniform in form, a new data table is established, and the source data table is mapped into the new data table to obtain table data with uniform format, specifically:

in the process of extracting data, firstly, according to a determined source data table of an analysis target, a source data field containing a conditional format is extracted by using a mode of combining the sheet input and the SQL statement in a button control of an ETL tool and is filled in the source data table. And forming the source data field into a target field in the new data table to be matched through the conversion control.

Second, the field values of the source data are mapped into the target fields in the new data table.

Specifically, if the source data table a has the following fields and field values:

field a 1: field value M1, field value M2, field value M3

Field a 2: field value N1, field value N2, field value N3

And converting the field into a row by using a column transmission row control in the button, wherein the conversion form is as follows:

field M1: field value N1

Field M2: field value N2

Field M3: field value N3.

After the source data is mapped into the data table in S13, S14 is also included, and the fields in the data table are feature-derived according to the meaning of the fields.

Taking the operation duration as an example, the source data table does not have the field, but has the operation starting time and the operation ending time, and a new field "operation duration" can be obtained through the calculation between the two field values, namely, the following formula:

the operation duration is equal to the operation ending time-operation starting time

Further, as shown in fig. 3, S2, the cleaning the treatment data according to the unique identifier specifically includes:

s21, extracting irregular data in the treatment data to carry out regularized replacement to generate structural data;

and S22, merging the same unique identification data in the structural data according to the unique identification, and deleting the repeated data under the unique identification.

S21 is a process of converting mixed data into structural data, where the mixed data is originally structural data, but irregular data is generated due to data extraction, for example, ANTI-HBC in index data of various items in a laboratory is taken as an example, the field value is character type, the field value should be "negative" or "positive", but an irregular field value such as "6.460 (positive (+)") exists, and the regular expression ">? ? \ (positive)? \ "extracts the field of that type and replaces it with the field value" positive ".

Further, S3, outputting the target data, and acquiring the modification rule and the clustering rule specifically includes:

s31, outputting the data missing condition in each field in the target data:

s311, extracting the data number of each field in the target data;

and S312, drawing a bar graph of the number of the field-data, and performing visual output.

Specifically, a missing no library in Python programming software is used for drawing a bar graph of field-data number, and the missing condition of each field is visually reflected, as shown in fig. 4, wherein the abscissa is the name of the field; the left ordinate is the proportion of the non-empty data volume to the total data volume of the field, and the range is [0-1 ]; the right ordinate is the data volume correspondingly contained in the field under the condition of the corresponding proportion displayed by the real data volume; the data on each bar represents the amount of data contained in that field. It can be seen visually from the figure which fields have more missing data and the proportion of missing data to the total data.

S32, outputting the data distribution condition in each field in the target data:

s321, carrying out equal grouping on the corresponding values of the fields, and calculating the numerical frequency of each group;

s322, drawing a numerical value-frequency histogram, as shown in FIG. 5, wherein the abscissa of the histogram is the data value appearing in the field, and the height of a single rectangle represents the numerical value frequency of the corresponding group;

and S323, connecting points corresponding to the median values of the numerical value histograms of the groups to form a smooth curve, such as the distribution curve in FIG. 5, and performing visual output.

The visual output of the target data visually provides the overall data condition for operators, and provides a basis for further data analysis and manual correction of the custom correction rule.

And S33, acquiring the correction rule and the clustering rule.

Wherein, the correction rule specifically includes: abnormal value judgment rules and processing rules in the target data and filling rules of missing values in the target data.

The clustering rule clusters the fields by adopting a K-means algorithm.

Further, as shown in fig. 3, S4, modifying the target data according to the modification rule includes;

s41, judging abnormal values in the target data and processing the abnormal values;

the extraction of the abnormal value can be specifically carried out in the following way: (1) laida criteria: a field value having a deviation from the mean value of more than three times a preset standard deviation among a group of field values, which is an abnormal value.

(2) A distribution method: with the visualized field-data number diagram in fig. 4 and the value-frequency histogram in fig. 5, the distribution of the field values of each field can be seen, and some fields have abnormal values if the distribution curve is severely biased.

(3) Box line graph discrimination: the criterion for distinguishing the abnormal value of the boxplot is based on a quartile and a quartile distance, wherein the quartile is the number of quartering a group of data after the group of data is arranged from small to large. The quartile has certain resistance, and up to 25 percent of data can be changed to any far without great disturbance, so that the abnormal value does not influence the data shape of the boxplot, and the result of identifying the abnormal value by the boxplot is objective.

After the abnormal value is obtained by the method, the following processing modes are adopted:

(1) for a non-negative field, if a field less than or equal to 0 appears in a field value, the field value is determined to be an abnormal value, and the field value and a corresponding field thereof may be deleted or converted into a null value.

(2) According to the Lauda criterion, the field value can be compared with the average value of the corresponding field, the field value with the deviation exceeding three times of the preset standard deviation is an abnormal value, the average value of the field is utilized to modify, and the distribution of the indexes is modified as much as possible under the condition that the original data is not changed greatly.

S42, filling missing values in the target data;

s421, calculating the percentage of the missing value of a certain field in the target data in the total number of the field, and deleting the field if the percentage is greater than the maximum threshold;

s422, if the percentage is smaller than the maximum threshold and larger than the minimum threshold, filling missing values by using 0, the average number or mode of the field values;

the correlation formula is as follows:

where L denotes the exact lower limit of the group in which the modes are located, f_aFrequency adjacent to the lower limit of the mode set, f_bFor the frequency number adjacent to the upper limit of the mode number, i is the group distance, x_iFor each index value, n is the number of values.

And S423, if the percentage is smaller than the minimum threshold value, filling the missing value by using random forest regression.

Random Forest (Random Forest) as a machine learning method is a classifier comprising a plurality of decision trees, the output category of the Random Forest is determined by the mode of the category output by individual trees, the prediction precision is improved on the premise that the operation amount is not remarkably increased, and meanwhile, the operation result can reach a stable level for missing data and unbalanced data.

The method comprises the following basic steps of random forest filling missing values:

a) extracting all columns with missing values in the data to establish a model, and sequencing the columns from small to large according to the number of the missing values of the columns;

b) extracting the column with the minimum number of missing values, and filling the missing values of other columns with 0;

c) filling missing values of the column by using a random forest regression model;

d) and repeating the steps b and c until all missing values are filled, and obtaining complete data.

The above process Python instance code is as follows:

ensemble format identifier # finds fields with missing values

X_missing＝pre_data_1[list(pre_data_1.isna().mean()[pre_data_1.isna().mean()！＝0].index)]

X _ missing _ reg ═ X _ missing

sortindex ═ np.argsort (X _ missing _ reg.isnull (). sum (axis ═ 0)). values for i in sortindex # construct a new feature matrix and a new label

df＝X_missing_reg

fillc＝df.iloc[:,i]

If [: df. columns | ]. Column with missing value is padded with 0 in the new feature matrix

df_0＝SimpleImputer(missing_values＝np.nan,

Found training set and test set, file _ value 0, fit _ transform (df) #

Ytrain＝fillc[fillc.notnull()]

Ytest＝fillc[fillc.isnull()]

Xtrain＝df_0[Ytrain.index,:]

Xtest is df _0[ ytest. index ] # fills in missing values with random forest regression

rfc＝RandomForestRegressor(n_estimators＝100)

rfc＝rfc.fit(Xtrain,Ytrain)

Predict (Xtest) # returns the padded features to our original feature matrix

X_missing_reg.loc[X_missing_reg.iloc[:,i].isnull(),X_missing_reg.columns[i]]＝Ypredict

Further, S5, clustering the text data in the modified target data according to the clustering rule to obtain final data, specifically:

(1) segmenting each text into words and removing stop words;

(2) and converting words obtained after text word segmentation into word vectors by a TF-IDF method to obtain the weight of the text vectors.

TF-IDF is a statistical method to evaluate the importance of words to a document in a corpus or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.

(3) Clustering texts through K-means algorithm

Step 1: selecting the number k of categories to be clustered (for example, the family department clustering k is 20 categories), and selecting k central points;

step 2: respectively calculating the distance from each sample to each clustering center aiming at each sample point, and finding the closest central point (searching organization) to the sample point, wherein the point closest to the same central point is a class;

step 3: after all samples are distributed, the centers of the K clusters are recalculated

Step 4: judging whether the sample points before and after clustering are the same in category, if so, terminating the algorithm, otherwise, entering step 5;

step 5: for the sample points in each class, the center points of these sample points are computed, and step2 is continued as the new center point of the class.

Furthermore, a new field is obtained after the clustering is finished, and the data of the field corresponds to the classification category of each sample. And writing the field and the corresponding field value into a database.

The method has the advantages of simple principle, easy realization and small calculation complexity.

On the other hand, as shown in fig. 6, an embodiment of the present invention provides a transfusion big data-based multidimensional intelligent data cleaning system, including:

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. Multi-dimensional intelligent data cleaning method based on transfusion big data is characterized by comprising the following steps:

outputting the target data, and acquiring a correction rule and a clustering rule;

correcting the target data according to the correction rule;

2. The method for multidimensional intelligent data cleaning based on transfusion big data as claimed in claim 1, wherein the specific method for acquiring the treatment data of the patient is as follows:

determining a database;

defining a data extraction rule and forming a data table;

and extracting source data from the database according to the data extraction rule and mapping the source data to the data table to obtain the treatment data.

3. The method for multi-dimensional intelligent data cleansing based on transfusion big data as claimed in claim 2, wherein after extracting source data from said database and mapping into said data table according to said data extraction rule, further comprising:

and performing characteristic derivation on the fields in the data table.

4. The method for multidimensional intelligent data cleansing based on transfusion big data as claimed in claim 1, wherein said cleansing said treatment data according to said unique identifier specifically comprises:

5. The method for multi-dimensional intelligent data cleansing based on transfusion big data as claimed in claim 1, wherein said outputting said target data specifically comprises:

outputting the data missing condition in each field in the target data;

6. The method according to claim 5, wherein the outputting of the missing data in each field of the target data specifically includes:

extracting the data number of each field in the target data;

7. The method according to claim 5, wherein the outputting of the data distribution in each field of the target data specifically includes:

drawing a numerical value-frequency histogram;

8. The method for multi-dimensional intelligent data cleaning based on transfusion big data as claimed in claim 1, wherein the obtaining of the correction rule specifically comprises:

and abnormal value judgment rules and processing rules in the target data and filling rules of missing values in the target data.

9. The method for multi-dimensional intelligent data cleaning based on transfusion big data as claimed in claim 8, wherein the filling rule of missing values in the target data is specifically:

if the percentage is less than the maximum threshold and greater than the minimum threshold, filling the missing value with 0, the average or mode of the field values;

10. Multi-dimensional intelligent data cleaning system based on blood transfusion big data, its characterized in that includes:

the interaction module is used for outputting the target data and acquiring a correction rule and a clustering rule;

and the clustering module is used for clustering the corrected text data in the target data according to the clustering rule to obtain final data.