CN113485990A - Multi-dimensional intelligent data cleaning method and system based on big transfusion data - Google Patents

Multi-dimensional intelligent data cleaning method and system based on big transfusion data Download PDF

Info

Publication number
CN113485990A
CN113485990A CN202110757837.0A CN202110757837A CN113485990A CN 113485990 A CN113485990 A CN 113485990A CN 202110757837 A CN202110757837 A CN 202110757837A CN 113485990 A CN113485990 A CN 113485990A
Authority
CN
China
Prior art keywords
data
field
rule
target
target data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110757837.0A
Other languages
Chinese (zh)
Inventor
乐爱平
刘威
吴承高
刘强
曹磊
熊伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanchang Shengyuan Software Co ltd
First Affiliated Hospital of Nanchang University
Original Assignee
Nanchang Shengyuan Software Co ltd
First Affiliated Hospital of Nanchang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanchang Shengyuan Software Co ltd, First Affiliated Hospital of Nanchang University filed Critical Nanchang Shengyuan Software Co ltd
Priority to CN202110757837.0A priority Critical patent/CN113485990A/en
Publication of CN113485990A publication Critical patent/CN113485990A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Probability & Statistics with Applications (AREA)
  • Quality & Reliability (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a multi-dimensional intelligent data cleaning method and a system based on big blood transfusion data, which mainly achieve the aim of combing original blood transfusion data by data extraction, cleaning, repair and clustering, and mainly provide the following technical scheme: acquiring a unique identification and treatment data of a patient, wherein the treatment data comprises transfusion data of the patient; cleaning the treatment data according to the unique identifier to obtain target data; outputting target data, and acquiring a correction rule and a clustering rule; correcting the target data according to the correction rule; and clustering the text data in the corrected target data according to the clustering rule to obtain final data. The method is mainly used for processing the blood transfusion data.

Description

Multi-dimensional intelligent data cleaning method and system based on big transfusion data
Technical Field
The invention relates to the field of medical data processing, in particular to a multi-dimensional intelligent data cleaning method and system based on transfusion big data.
Background
With the rapid development of medical systems and the internet of things, medical big data analysis has higher and higher guiding significance for the hospitalizing process of patients, wherein for the research and the use of transfusion big data, the efficient management of blood supply can be realized, and the safety of transfusion of patients is improved.
Because the hospitalizing information of the patient is dispersed in each system, the blood transfusion data has the characteristics of scattered sources, strong repeatability and disordered inaccurate data, the blood transfusion data is preprocessed, the error data is removed, the disordered data is combed, the complete data with unified rules is formed, an analysis basis is provided for the decision of the treatment process business, and a foundation is laid for the subsequent further mining and use of the blood transfusion big data.
Disclosure of Invention
In view of this, the embodiment of the present invention provides a method and a system for multidimensional intelligent data cleaning based on transfusion big data, which mainly achieve the purpose of combing original transfusion data by data extraction, cleaning, patching, and clustering, and provide complete regular pre-processing data for further analysis and use of transfusion data.
In order to achieve the purpose, the invention mainly provides the following technical scheme:
on the one hand, the embodiment of the invention provides a multi-dimensional intelligent data cleaning method based on transfusion big data, which comprises the following steps:
acquiring a unique identification and treatment data of a patient, wherein the treatment data comprises transfusion data of the patient;
cleaning the treatment data according to the unique identifier to obtain target data;
outputting target data, and acquiring a correction rule and a clustering rule;
correcting the target data according to the correction rule;
and clustering the text data in the corrected target data according to the clustering rule to obtain final data.
Preferably, the specific method for acquiring the treatment data of the patient is as follows:
determining a database;
defining a data extraction rule and forming a data table;
and extracting the source data from the database according to the data extraction rule and mapping the source data to the data table to obtain the treatment data.
Preferably, after the source data is extracted from the database and mapped into the data table according to the data extraction rule, the method further includes:
and performing characteristic derivation on the fields in the data table.
Preferably, the cleaning the treatment data according to the unique identifier specifically includes:
extracting irregular data in the treatment data to carry out regularized replacement to generate structural data;
and merging the same unique identification data in the structural data according to the unique identification, and deleting the repeated data under the unique identification.
Preferably, the outputting the target data specifically includes:
outputting the data missing condition in each field in the target data;
and outputting the data distribution condition in each field in the target data.
Preferably, the data missing condition in each field in the output target data specifically includes:
extracting the data number of each field in the target data;
and drawing a bar chart of the number of the field-data, and performing visual output.
Preferably, the data distribution in each field in the output target data specifically includes:
carrying out equal grouping on the corresponding values of the fields, and calculating the numerical frequency of each group;
drawing a numerical value-frequency histogram;
and connecting points corresponding to the median values of the numerical value histograms of all groups to form a smooth curve, and performing visual output.
Preferably, the acquiring of the modification rule specifically includes:
abnormal value judgment rules and processing rules in the target data and filling rules of missing values in the target data.
Preferably, the filling rule of the missing value in the target data is specifically as follows:
calculating the percentage of the missing value of a certain field in the target data in the total number of the field, and deleting the field if the percentage is greater than the maximum threshold value;
if the percentage is less than the maximum threshold and greater than the minimum threshold, filling in missing values using 0, the mean or mode of the field values;
if the percentage is less than the minimum threshold, filling the missing value using random forest regression.
On the other hand, the embodiment of the invention provides a multi-dimensional intelligent data cleaning system based on transfusion big data, which comprises the following components:
the data acquisition module is used for acquiring the unique identifier of the patient and treatment data, wherein the treatment data comprises transfusion data of the patient;
the data cleaning module is used for cleaning the treatment data according to the unique identifier to obtain target data;
the interaction module is used for outputting target data and acquiring a correction rule and a clustering rule;
the correction module is used for correcting the target data according to the correction rule;
and the clustering module is used for clustering the text data in the corrected target data according to the clustering rule to obtain final data.
The multi-dimensional intelligent data cleaning method based on the big blood transfusion data, provided by the embodiment of the invention, is mainly used for preprocessing the blood transfusion data, removing error data, combing disordered data and forming complete data with unified rules. In the prior art, basic information and blood transfusion data of patients in each database are often manually called, information in different databases is formatted one by one and then integrated, error data is manually picked and missing data is filled, the workload is high, and errors are easy to occur. Compared with the prior art, the method and the device achieve the aim of combing the original blood transfusion data through data extraction, cleaning, repairing and clustering, and provide complete and regular preprocessing data for further analysis and use of the blood transfusion data.
Drawings
Fig. 1 is a flowchart of a multidimensional intelligent data cleaning method based on transfusion big data according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a relationship between data tables according to an embodiment of the present invention;
FIG. 3 is a flow chart of another method for multi-dimensional intelligent data cleansing based on transfusion big data according to an embodiment of the present invention;
FIG. 4 is a field-data number histogram provided by an embodiment of the present invention;
FIG. 5 is a value-frequency histogram provided by an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a transfusion big data-based multidimensional intelligent data cleaning system provided by the embodiment of the invention.
Detailed Description
To further illustrate the technical means and effects of the present invention for achieving the predetermined objects, the following detailed description will be given to the multi-dimensional intelligent data cleaning method and system based on big transfusion data according to the present invention, and the specific implementation, structure, features and effects thereof with reference to the accompanying drawings and preferred embodiments.
As shown in fig. 1, an embodiment of the present invention provides a method for multidimensional intelligent data cleansing based on transfusion big data, including:
s1, acquiring the unique identification of the patient and treatment data, wherein the treatment data comprises transfusion data of the patient;
s2, cleaning the treatment data according to the unique identifier to obtain target data;
s3, outputting target data, and acquiring a correction rule and a clustering rule;
s4, correcting the target data according to the correction rule;
and S5, clustering the text data in the corrected target data according to the clustering rule to obtain final data.
The treatment data in S1 is patient treatment process data obtained from multiple platforms, and the total amount of multi-source heterogeneous data can be obtained through a data API interface arranged on the blood transfusion data operation platform, and the databases include relational databases such as SqlService, Oracle, and the like. Treatment data may include patient baseline data, such as gender, age, weight, etc.; laboratory indices such as Hb, PLT, FIB, etc.; clinical symptom indicators such as blood loss, bleeding volume, etc.; clinical transfusion data, such as whether to infuse, composition of infusion, volume of infusion, etc., and data for efficacy and prognosis assessment. The unique identifier of the patient may be the patient's identification number or visit card number.
Specifically, the following method is adopted for acquiring the treatment data of the patient:
s11, determining a database;
s12, defining data extraction rules and forming a data table;
and S13, extracting the source data from the database according to the data extraction rule and mapping the source data to the data table to obtain the treatment data.
Firstly, a database is determined according to the analysis purpose, a source data table is selected according to the type of the database, and the extracted treatment data is written into the source data table. In the definition of data extraction rules, for example, surgical transfusion records are taken as an example, the formation of the table data needs to be linked to a plurality of tables, such as a patient information table, a test result table, a surgical condition table, a transfusion condition table and the like. For example, as shown in fig. 2, the correspondence between tables is used from the start of a patient to the procedure of applying for blood, performing blood, and performing an operation.
In the data extraction process, an extraction range needs to be defined, taking the matching of the test result table and the operation condition table as an example, how to match the test result before the operation is performed for a patient with a plurality of test results needs to be performed through the relation between fields. The solution is as follows: according to the detection result time and the operation starting time, the detection result within 15 hours before the operation starting time is selected by combining the actual detection condition of the hospital, and the blood index condition of the operation patient during the operation is measured by using the detection result.
Secondly, because the source data tables in different databases are not uniform in form, a new data table is established, and the source data table is mapped into the new data table to obtain table data with uniform format, specifically:
in the process of extracting data, firstly, according to a determined source data table of an analysis target, a source data field containing a conditional format is extracted by using a mode of combining the sheet input and the SQL statement in a button control of an ETL tool and is filled in the source data table. And forming the source data field into a target field in the new data table to be matched through the conversion control.
Second, the field values of the source data are mapped into the target fields in the new data table.
Specifically, if the source data table a has the following fields and field values:
field a 1: field value M1, field value M2, field value M3
Field a 2: field value N1, field value N2, field value N3
And converting the field into a row by using a column transmission row control in the button, wherein the conversion form is as follows:
field M1: field value N1
Field M2: field value N2
Field M3: field value N3.
After the source data is mapped into the data table in S13, S14 is also included, and the fields in the data table are feature-derived according to the meaning of the fields.
Taking the operation duration as an example, the source data table does not have the field, but has the operation starting time and the operation ending time, and a new field "operation duration" can be obtained through the calculation between the two field values, namely, the following formula:
the operation duration is equal to the operation ending time-operation starting time
Further, as shown in fig. 3, S2, the cleaning the treatment data according to the unique identifier specifically includes:
s21, extracting irregular data in the treatment data to carry out regularized replacement to generate structural data;
and S22, merging the same unique identification data in the structural data according to the unique identification, and deleting the repeated data under the unique identification.
S21 is a process of converting mixed data into structural data, where the mixed data is originally structural data, but irregular data is generated due to data extraction, for example, ANTI-HBC in index data of various items in a laboratory is taken as an example, the field value is character type, the field value should be "negative" or "positive", but an irregular field value such as "6.460 (positive (+)") exists, and the regular expression ">? ? \ (positive)? \ "extracts the field of that type and replaces it with the field value" positive ".
Further, S3, outputting the target data, and acquiring the modification rule and the clustering rule specifically includes:
s31, outputting the data missing condition in each field in the target data:
s311, extracting the data number of each field in the target data;
and S312, drawing a bar graph of the number of the field-data, and performing visual output.
Specifically, a missing no library in Python programming software is used for drawing a bar graph of field-data number, and the missing condition of each field is visually reflected, as shown in fig. 4, wherein the abscissa is the name of the field; the left ordinate is the proportion of the non-empty data volume to the total data volume of the field, and the range is [0-1 ]; the right ordinate is the data volume correspondingly contained in the field under the condition of the corresponding proportion displayed by the real data volume; the data on each bar represents the amount of data contained in that field. It can be seen visually from the figure which fields have more missing data and the proportion of missing data to the total data.
S32, outputting the data distribution condition in each field in the target data:
s321, carrying out equal grouping on the corresponding values of the fields, and calculating the numerical frequency of each group;
s322, drawing a numerical value-frequency histogram, as shown in FIG. 5, wherein the abscissa of the histogram is the data value appearing in the field, and the height of a single rectangle represents the numerical value frequency of the corresponding group;
and S323, connecting points corresponding to the median values of the numerical value histograms of the groups to form a smooth curve, such as the distribution curve in FIG. 5, and performing visual output.
The visual output of the target data visually provides the overall data condition for operators, and provides a basis for further data analysis and manual correction of the custom correction rule.
And S33, acquiring the correction rule and the clustering rule.
Wherein, the correction rule specifically includes: abnormal value judgment rules and processing rules in the target data and filling rules of missing values in the target data.
The clustering rule clusters the fields by adopting a K-means algorithm.
Further, as shown in fig. 3, S4, modifying the target data according to the modification rule includes;
s41, judging abnormal values in the target data and processing the abnormal values;
the extraction of the abnormal value can be specifically carried out in the following way: (1) laida criteria: a field value having a deviation from the mean value of more than three times a preset standard deviation among a group of field values, which is an abnormal value.
(2) A distribution method: with the visualized field-data number diagram in fig. 4 and the value-frequency histogram in fig. 5, the distribution of the field values of each field can be seen, and some fields have abnormal values if the distribution curve is severely biased.
(3) Box line graph discrimination: the criterion for distinguishing the abnormal value of the boxplot is based on a quartile and a quartile distance, wherein the quartile is the number of quartering a group of data after the group of data is arranged from small to large. The quartile has certain resistance, and up to 25 percent of data can be changed to any far without great disturbance, so that the abnormal value does not influence the data shape of the boxplot, and the result of identifying the abnormal value by the boxplot is objective.
After the abnormal value is obtained by the method, the following processing modes are adopted:
(1) for a non-negative field, if a field less than or equal to 0 appears in a field value, the field value is determined to be an abnormal value, and the field value and a corresponding field thereof may be deleted or converted into a null value.
(2) According to the Lauda criterion, the field value can be compared with the average value of the corresponding field, the field value with the deviation exceeding three times of the preset standard deviation is an abnormal value, the average value of the field is utilized to modify, and the distribution of the indexes is modified as much as possible under the condition that the original data is not changed greatly.
S42, filling missing values in the target data;
s421, calculating the percentage of the missing value of a certain field in the target data in the total number of the field, and deleting the field if the percentage is greater than the maximum threshold;
s422, if the percentage is smaller than the maximum threshold and larger than the minimum threshold, filling missing values by using 0, the average number or mode of the field values;
the correlation formula is as follows:
Figure BDA0003148505040000091
Figure BDA0003148505040000092
where L denotes the exact lower limit of the group in which the modes are located, faFrequency adjacent to the lower limit of the mode set, fbFor the frequency number adjacent to the upper limit of the mode number, i is the group distance, xiFor each index value, n is the number of values.
And S423, if the percentage is smaller than the minimum threshold value, filling the missing value by using random forest regression.
Random Forest (Random Forest) as a machine learning method is a classifier comprising a plurality of decision trees, the output category of the Random Forest is determined by the mode of the category output by individual trees, the prediction precision is improved on the premise that the operation amount is not remarkably increased, and meanwhile, the operation result can reach a stable level for missing data and unbalanced data.
The method comprises the following basic steps of random forest filling missing values:
a) extracting all columns with missing values in the data to establish a model, and sequencing the columns from small to large according to the number of the missing values of the columns;
b) extracting the column with the minimum number of missing values, and filling the missing values of other columns with 0;
c) filling missing values of the column by using a random forest regression model;
d) and repeating the steps b and c until all missing values are filled, and obtaining complete data.
The above process Python instance code is as follows:
ensemble format identifier # finds fields with missing values
X_missing=pre_data_1[list(pre_data_1.isna().mean()[pre_data_1.isna().mean()!=0].index)]
X _ missing _ reg ═ X _ missing
sortindex ═ np.argsort (X _ missing _ reg.isnull (). sum (axis ═ 0)). values for i in sortindex # construct a new feature matrix and a new label
df=X_missing_reg
fillc=df.iloc[:,i]
If [: df. columns | ]. Column with missing value is padded with 0 in the new feature matrix
df_0=SimpleImputer(missing_values=np.nan,
Found training set and test set, file _ value 0, fit _ transform (df) #
Ytrain=fillc[fillc.notnull()]
Ytest=fillc[fillc.isnull()]
Xtrain=df_0[Ytrain.index,:]
Xtest is df _0[ ytest. index ] # fills in missing values with random forest regression
rfc=RandomForestRegressor(n_estimators=100)
rfc=rfc.fit(Xtrain,Ytrain)
Predict (Xtest) # returns the padded features to our original feature matrix
X_missing_reg.loc[X_missing_reg.iloc[:,i].isnull(),X_missing_reg.columns[i]]=Ypredict
Further, S5, clustering the text data in the modified target data according to the clustering rule to obtain final data, specifically:
(1) segmenting each text into words and removing stop words;
(2) and converting words obtained after text word segmentation into word vectors by a TF-IDF method to obtain the weight of the text vectors.
TF-IDF is a statistical method to evaluate the importance of words to a document in a corpus or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.
(3) Clustering texts through K-means algorithm
Step 1: selecting the number k of categories to be clustered (for example, the family department clustering k is 20 categories), and selecting k central points;
step 2: respectively calculating the distance from each sample to each clustering center aiming at each sample point, and finding the closest central point (searching organization) to the sample point, wherein the point closest to the same central point is a class;
step 3: after all samples are distributed, the centers of the K clusters are recalculated
Step 4: judging whether the sample points before and after clustering are the same in category, if so, terminating the algorithm, otherwise, entering step 5;
step 5: for the sample points in each class, the center points of these sample points are computed, and step2 is continued as the new center point of the class.
Furthermore, a new field is obtained after the clustering is finished, and the data of the field corresponds to the classification category of each sample. And writing the field and the corresponding field value into a database.
The method has the advantages of simple principle, easy realization and small calculation complexity.
On the other hand, as shown in fig. 6, an embodiment of the present invention provides a transfusion big data-based multidimensional intelligent data cleaning system, including:
the data acquisition module is used for acquiring the unique identifier of the patient and treatment data, wherein the treatment data comprises transfusion data of the patient;
the data cleaning module is used for cleaning the treatment data according to the unique identifier to obtain target data;
the interaction module is used for outputting target data and acquiring a correction rule and a clustering rule;
the correction module is used for correcting the target data according to the correction rule;
and the clustering module is used for clustering the text data in the corrected target data according to the clustering rule to obtain final data.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (10)

1. Multi-dimensional intelligent data cleaning method based on transfusion big data is characterized by comprising the following steps:
acquiring a unique identification and treatment data of a patient, wherein the treatment data comprises transfusion data of the patient;
cleaning the treatment data according to the unique identifier to obtain target data;
outputting the target data, and acquiring a correction rule and a clustering rule;
correcting the target data according to the correction rule;
and clustering the text data in the corrected target data according to the clustering rule to obtain final data.
2. The method for multidimensional intelligent data cleaning based on transfusion big data as claimed in claim 1, wherein the specific method for acquiring the treatment data of the patient is as follows:
determining a database;
defining a data extraction rule and forming a data table;
and extracting source data from the database according to the data extraction rule and mapping the source data to the data table to obtain the treatment data.
3. The method for multi-dimensional intelligent data cleansing based on transfusion big data as claimed in claim 2, wherein after extracting source data from said database and mapping into said data table according to said data extraction rule, further comprising:
and performing characteristic derivation on the fields in the data table.
4. The method for multidimensional intelligent data cleansing based on transfusion big data as claimed in claim 1, wherein said cleansing said treatment data according to said unique identifier specifically comprises:
extracting irregular data in the treatment data to carry out regularized replacement to generate structural data;
and merging the same unique identification data in the structural data according to the unique identification, and deleting the repeated data under the unique identification.
5. The method for multi-dimensional intelligent data cleansing based on transfusion big data as claimed in claim 1, wherein said outputting said target data specifically comprises:
outputting the data missing condition in each field in the target data;
and outputting the data distribution condition in each field in the target data.
6. The method according to claim 5, wherein the outputting of the missing data in each field of the target data specifically includes:
extracting the data number of each field in the target data;
and drawing a bar chart of the number of the field-data, and performing visual output.
7. The method according to claim 5, wherein the outputting of the data distribution in each field of the target data specifically includes:
carrying out equal grouping on the corresponding values of the fields, and calculating the numerical frequency of each group;
drawing a numerical value-frequency histogram;
and connecting points corresponding to the median values of the numerical value histograms of all groups to form a smooth curve, and performing visual output.
8. The method for multi-dimensional intelligent data cleaning based on transfusion big data as claimed in claim 1, wherein the obtaining of the correction rule specifically comprises:
and abnormal value judgment rules and processing rules in the target data and filling rules of missing values in the target data.
9. The method for multi-dimensional intelligent data cleaning based on transfusion big data as claimed in claim 8, wherein the filling rule of missing values in the target data is specifically:
calculating the percentage of the missing value of a certain field in the target data in the total number of the field, and deleting the field if the percentage is greater than the maximum threshold value;
if the percentage is less than the maximum threshold and greater than the minimum threshold, filling the missing value with 0, the average or mode of the field values;
if the percentage is less than the minimum threshold, filling the missing value using random forest regression.
10. Multi-dimensional intelligent data cleaning system based on blood transfusion big data, its characterized in that includes:
the data acquisition module is used for acquiring the unique identifier of the patient and treatment data, wherein the treatment data comprises transfusion data of the patient;
the data cleaning module is used for cleaning the treatment data according to the unique identifier to obtain target data;
the interaction module is used for outputting the target data and acquiring a correction rule and a clustering rule;
the correction module is used for correcting the target data according to the correction rule;
and the clustering module is used for clustering the corrected text data in the target data according to the clustering rule to obtain final data.
CN202110757837.0A 2021-07-05 2021-07-05 Multi-dimensional intelligent data cleaning method and system based on big transfusion data Pending CN113485990A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110757837.0A CN113485990A (en) 2021-07-05 2021-07-05 Multi-dimensional intelligent data cleaning method and system based on big transfusion data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110757837.0A CN113485990A (en) 2021-07-05 2021-07-05 Multi-dimensional intelligent data cleaning method and system based on big transfusion data

Publications (1)

Publication Number Publication Date
CN113485990A true CN113485990A (en) 2021-10-08

Family

ID=77940249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110757837.0A Pending CN113485990A (en) 2021-07-05 2021-07-05 Multi-dimensional intelligent data cleaning method and system based on big transfusion data

Country Status (1)

Country Link
CN (1) CN113485990A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104123465A (en) * 2014-07-24 2014-10-29 中国软件与技术服务股份有限公司 Big data cross-over analysis early warning method and system based on clusters
CN104142986A (en) * 2014-07-24 2014-11-12 中国软件与技术服务股份有限公司 Big data situation analysis early warning method and system based on clustering
CN107066791A (en) * 2016-12-19 2017-08-18 银江股份有限公司 A kind of aided disease diagnosis method based on patient's assay
CN108182963A (en) * 2017-12-14 2018-06-19 山东浪潮云服务信息科技有限公司 A kind of medical data processing method and processing device
CN109669935A (en) * 2018-12-13 2019-04-23 平安医疗健康管理股份有限公司 Check data screening method, apparatus, equipment and storage medium
CN110070929A (en) * 2019-04-30 2019-07-30 上海复繁信息科技有限公司 A kind of acquisition and cleaning method for atrial fibrillation Single diseases data
CN110957043A (en) * 2018-09-26 2020-04-03 金敏 Disease prediction system
CN111427974A (en) * 2020-06-11 2020-07-17 杭州城市大数据运营有限公司 Data quality evaluation management method and device
CN111986754A (en) * 2020-08-21 2020-11-24 南通大学 Electronic medical record management model construction method based on diabetes

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104123465A (en) * 2014-07-24 2014-10-29 中国软件与技术服务股份有限公司 Big data cross-over analysis early warning method and system based on clusters
CN104142986A (en) * 2014-07-24 2014-11-12 中国软件与技术服务股份有限公司 Big data situation analysis early warning method and system based on clustering
CN107066791A (en) * 2016-12-19 2017-08-18 银江股份有限公司 A kind of aided disease diagnosis method based on patient's assay
CN108182963A (en) * 2017-12-14 2018-06-19 山东浪潮云服务信息科技有限公司 A kind of medical data processing method and processing device
CN110957043A (en) * 2018-09-26 2020-04-03 金敏 Disease prediction system
CN109669935A (en) * 2018-12-13 2019-04-23 平安医疗健康管理股份有限公司 Check data screening method, apparatus, equipment and storage medium
CN110070929A (en) * 2019-04-30 2019-07-30 上海复繁信息科技有限公司 A kind of acquisition and cleaning method for atrial fibrillation Single diseases data
CN111427974A (en) * 2020-06-11 2020-07-17 杭州城市大数据运营有限公司 Data quality evaluation management method and device
CN111986754A (en) * 2020-08-21 2020-11-24 南通大学 Electronic medical record management model construction method based on diabetes

Similar Documents

Publication Publication Date Title
CN111986770B (en) Prescription medication auditing method, device, equipment and storage medium
CN108831559B (en) Chinese electronic medical record text analysis method and system
CN107705839B (en) Disease automatic coding method and system
CN107731269B (en) Disease coding method and system based on original diagnosis data and medical record file data
WO2020220635A1 (en) Pharmaceutical drug classification method and apparatus, computer device and storage medium
CN111414393A (en) Semantic similar case retrieval method and equipment based on medical knowledge graph
CN113345577B (en) Diagnosis and treatment auxiliary information generation method, model training method, device, equipment and storage medium
CN111540468A (en) ICD automatic coding method and system for visualization of diagnosis reason
CN109378066A (en) A kind of control method and control device for realizing disease forecasting based on feature vector
CN108511056A (en) Therapeutic scheme based on patients with cerebral apoplexy similarity analysis recommends method and system
CN111191415A (en) Operation classification coding method based on original operation data
EP3443486A1 (en) Query optimizer for combined structured and unstructured data records
JP2018198045A (en) Apparatus and method for generation of natural language processing event
CN111048190A (en) DRG grouping method based on artificial intelligence
WO2021169203A1 (en) Monogenic disease name recommendation method and system based on multi-level structural similarity
JP6177609B2 (en) Medical chart system and medical chart search method
WO2021120587A1 (en) Method and apparatus for retina classification based on oct, computer device, and storage medium
CN114358001A (en) Method for standardizing diagnosis result, and related device, equipment and storage medium thereof
CN110321556A (en) A kind of method and its system of doctor's diagnosis and treatment medical insurance control expense intelligent recommendation scheme
CN112071431B (en) Clinical path automatic generation method and system based on deep learning and knowledge graph
CN113485990A (en) Multi-dimensional intelligent data cleaning method and system based on big transfusion data
CN110010231A (en) A kind of data processing system and computer readable storage medium
CN115274091A (en) Medical information analysis method and system
WO2022079593A1 (en) A system and a way to automatically monitor clinical trials - virtual monitor (vm) and a way to record medical history
CN113972009A (en) Medical examination consultation system based on clinical examination medical big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20211008

RJ01 Rejection of invention patent application after publication