CN115391332A

CN115391332A - Data governance method, device and computer storage medium

Info

Publication number: CN115391332A
Application number: CN202210836255.6A
Authority: CN
Inventors: 安西平; 徐辉
Original assignee: Singularity Of Life Beijing Technology Co ltd
Current assignee: Singularity Of Life Beijing Technology Co ltd
Priority date: 2022-07-15
Filing date: 2022-07-15
Publication date: 2022-11-25

Abstract

A data governance method and device, the method includes: acquiring initial medical data; performing data modeling and mode conversion based on the initial medical data; standardizing the data after modeling and die conversion; constructing a patient main index according to the standardized data; cleaning the data after the index is constructed; carrying out desensitization treatment on the cleaned data; and (5) carrying out quality control on the desensitized data to obtain a treatment result. By the method and the device provided by the embodiment of the invention, the problems of strong manual dependence, low efficiency and single treatment module of the conventional data treatment are solved, the ordered and effective treatment of multi-source, heterogeneous and massive medical data can be realized without manpower, the efficiency and the accuracy of medical work are improved, the labor cost is reduced, and the data support is provided for the utilization and the mining of subsequent medical data.

Description

Data governance method, device and computer storage medium

Technical Field

The invention relates to the technical field of medical big data processing, in particular to a data management method, a data management device and a computer storage medium.

Background

Medical informatization has become widespread in domestic medical institutions. The informatization means that information technologies such as computers, databases, networks and the like are used for providing collection, storage, processing, extraction and data exchange of information such as patients, management and the like among hospitals and among departments in the hospitals, so that the operating efficiency of the medical system is improved.

The medical informatization promotes the rapid increase of the types and the scales of medical data, which are not achieved before, and the requirements of hospitals on the data are more and more diverse and high-frequency, such as data reporting, interconnection, in-hospital scientific research, clinical trials and the like. However, the information system is still supported by the original business level of the hospital, data is not deeply communicated, and the data is still in an isolated island state at the semantic level and cannot be directly used as input of various analysis applications. Thus creating a need for "medical digitization".

The core of digitization is to make data become production data, exert data value, find the optimization point of operation production and optimize the original business mode on the basis of analyzing and mining data, and the first step is to make data become the input of analysis.

The realization of medical digitization is to improve the interoperability of medical systems technically, and realize data integration and communication among different departments and even among different areas of a hospital, that is, to form medical big data of interconnection and intercommunication specifications, which is the basis for realizing medical digitization.

In the practice of medical data management at present, most of data management implementation methods and systems in the medical industry highlight the characteristics of manual management, single-module management and the like, so that a data management model or implementation method with data self-internalized is difficult to form, and the problems of high manpower investment, incapability of maintaining a management effect, difficulty in realizing automatic data management and the like in a data management project process can be caused.

Disclosure of Invention

In view of this, the invention provides a data management method, a data management device and a computer storage medium, and aims to solve the problems that the existing data management has strong manual dependence and low efficiency, and a management module is single and is difficult to process complex data.

In a first aspect, an embodiment of the present invention provides a data governance method, where the method includes: acquiring initial medical data; performing data modeling and template rotation based on the initial medical data; standardizing the data after modeling and die conversion; constructing a patient main index according to the standardized data; cleaning the data after the index is constructed; carrying out desensitization treatment on the cleaned data; and (5) carrying out quality control on the desensitized data to obtain a treatment result.

Further, the modeling and rotating the data based on the initial medical data includes: generating a DDL statement based on the initial medical data, and establishing a base table field to obtain a target data model; and converting the original data model into the target data model.

Further, the normalizing the data after modeling and transferring includes: and carrying out standardized mapping on the data after modeling and die conversion according to a preset standard.

Further, the constructing a patient master index according to the normalized data includes: and extracting basic information of the patient from the standardized data, associating the multi-service IDs of the same patient based on the basic information of the patient, uniformly numbering the IDs, and generating a main index number.

Further, the cleaning the data after the index is constructed includes: and according to a preset cleaning rule, cleaning the data after the index is constructed, and correcting the data which does not accord with the rule in the cleaning process.

Further, the desensitizing treatment of the cleaned data includes: according to the preset sensitive data characteristics, the sensitive information contained in the cleaned data is identified by using a sensitive data information base and a word segmentation system, and desensitization is carried out on the sensitive information by adopting a desensitization algorithm.

Further, the quality control of the desensitized data to obtain a treatment result includes: and checking and correcting the desensitized data according to a preset quality control rule.

In a second aspect, an embodiment of the present invention further provides a data management apparatus, where the apparatus includes: a data acquisition unit for acquiring initial medical data; the data modeling and template transferring unit is used for carrying out data modeling and template transferring based on the initial medical data; the data standardization unit is used for standardizing the data after modeling and die conversion; the index construction unit is used for constructing a patient main index according to the standardized data; the data cleaning unit is used for cleaning the data after the index is constructed; the data desensitization unit is used for performing desensitization processing on the cleaned data; and the data quality control unit is used for carrying out quality control on the desensitized data to obtain a treatment result.

Further, the data modeling and rotating unit is further configured to: generating a DDL statement based on the initial medical data, and establishing a base table field to obtain a target data model; and converting the original data model into the target data model.

Further, the data normalization unit is further configured to: and carrying out standardized mapping on the data after modeling and die conversion according to a preset standard.

Further, the index building unit is further configured to: and extracting patient basic information from the standardized data, associating the multi-service IDs of the same patient based on the patient basic information, uniformly numbering the IDs, and generating a main index number.

Further, the data cleansing unit is further configured to: and cleaning the data after the index is constructed according to a preset cleaning rule, and correcting the data which does not accord with the rule in the cleaning process.

Further, the data desensitization unit is further configured to: according to the preset sensitive data characteristics, the sensitive information contained in the cleaned data is identified by using a sensitive data information base and a word segmentation system, and desensitization is carried out on the sensitive information by adopting a desensitization algorithm.

Further, the data quality control unit is further configured to: and checking and correcting the desensitized data according to a preset quality control rule.

In a third aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the methods provided in the foregoing embodiments.

According to the data management method, the data management device and the computer storage medium provided by the embodiment of the invention, initial medical data are subjected to data modeling, template conversion, standardization, patient main index construction, cleaning, desensitization and quality control treatment, so that the problems of strong manual dependence, low efficiency and single management module in the conventional data management are solved, the ordered and effective management of multi-source, heterogeneous and massive medical data can be realized without manual work, the efficiency and the accuracy of medical work are improved, the labor cost is reduced, and data support is provided for the utilization and mining of subsequent medical data.

Drawings

FIG. 1 illustrates an exemplary flow diagram of a data governance method according to an embodiment of the present invention;

FIG. 2 illustrates a data modeling and rotational modeling presentation diagram according to one embodiment of the invention;

FIG. 3 is a schematic diagram of data modeling and rotational modeling according to another embodiment of the invention;

FIG. 4 illustrates a data normalization presentation diagram according to an embodiment of the invention;

FIG. 5 is a schematic diagram of a data cleansing functionality architecture according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating a data pre-set quality control rule according to an embodiment of the present invention;

FIG. 7 is a schematic diagram illustrating data quality control result presentation according to an embodiment of the present invention;

FIG. 8 is a data quality control report presentation diagram according to an embodiment of the present invention;

FIG. 9 shows a schematic structural diagram of a data governance device according to an embodiment of the present invention.

Detailed Description

Example embodiments of the present invention will now be described with reference to the accompanying drawings, however, the invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, which are provided for a complete and complete disclosure of the invention and to fully convey the scope of the invention to those skilled in the art. The terminology used in the exemplary embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention. In the drawings, the same units/elements are denoted by the same reference numerals.

Unless otherwise defined, terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Further, it will be understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense.

FIG. 1 illustrates an exemplary flow diagram of a data governance method according to an embodiment of the present invention.

As shown in fig. 1, the method includes:

step S101: initial medical data is acquired.

The initial health care data may include patient basic information, medical history, electronic medical records, medical order and medication information, laboratory examinations, imaging and functional examinations, health questionnaire information, and the like.

Step S102: and performing data modeling and template transferring based on the initial medical data.

Further, step S102 includes:

generating a DDL statement based on the initial medical data, and establishing a base table field to obtain a target data model;

and converting the original data model into a target data model.

The data modeling can be the definition of an initial medical database, a data table and data fields, a DDL statement is automatically generated, the establishment of the table fields of the base is automatically realized, and a target data model is obtained; and the data conversion comprises the step of converting the data model of the original hospital information system into a target data model.

Fig. 2 and 3 each show a data modeling and rotational modeling presentation diagram according to an embodiment of the invention. As shown in FIGS. 2 and 3, data modeling and template conversion are completed by writing sql scripts according to the structures of the source model and the target model. The manual workload can be reduced to the greatest extent through a visualization mode. In modeling, the form may be filled in and then the sql statement may be automatically generated.

Step S103: and (5) standardizing the data after modeling and die conversion.

Further, step S103 includes:

and carrying out standardized mapping on the data after modeling and die conversion according to a preset standard.

The purpose of data standardization is to unify data from different sources into one reference system. The data standard definition refers to the ERUS and international standards, establishes the classification standard of codes and data elements, can establish detailed code standards and data element classification standards as preset standards according to specific service and data specification requirements, and provides consistency guarantee for storage, access and integration of data.

FIG. 4 shows a data normalization presentation diagram according to an embodiment of the invention. As shown in fig. 4, some important fields (such as basic information of patients, diagnoses, symptoms, medication, etc.) in the medical records are mapped in terms and standardization, and the details of the corresponding medical terms are given, and the information of the corresponding term base can be displayed. Synonyms or non-standard expressions in free text are accurately identified and normalized, and mapping is termed. And maintaining and managing the standardized terms as unified metadata.

Through data standardization, the unification of semantic levels based on term data exchange of each system in a platform can be realized, important terms and the like in a hospital are taken as unified metadata for maintenance and management, and an open access interface and an update notification interface are provided; the data interaction is simplified, the coupling degree between systems is reduced, and a platform-level term (dictionary) standard is established to meet the system integration requirement and provide convenience for various data mining, scientific research and historical data management.

Meanwhile, the data element standard also provides a maintenance mode of regular expressions of general rule dictionaries such as mailboxes, contact phones, identity cards, passports and the like for data quality control, a unified standard specification is established, the data quality control is directly called to carry out data normative verification, and the use and analysis quality of hospital data is improved.

Step S104: and constructing a patient main index according to the standardized data.

Further, step S104 includes:

and extracting the basic information of the patient from the standardized data, associating the multi-service IDs of the same patient based on the basic information of the patient, uniformly numbering the IDs and generating a main index number.

The Patient Master Index (EMPI) is a medical informatization terminology, and simply, it is a Patient basic information retrieval directory. Many patients in a medical institution do not form a uniform patient ID, so that data integration of multi-system and multi-diagnosis and treatment information of the patients cannot be realized. The main purpose of EMPI is to effectively link multiple medical information systems together through unique patient identification within a complex medical system. The system can realize interconnection and intercommunication among all systems of the hospital, and ensure the integrity and accuracy of personal information acquisition of the same patient distributed in different systems of the hospital. The establishment of the patient main index is a necessary condition for realizing the integration of internal systems of large hospitals and resource sharing.

The EMPI module collects basic information data of patients according to a hospital business system, and mainly judges that the patients with different patient IDs or hospitalized numbers are the same person through all or part of information such as original patient ID numbers, identity card numbers, passport numbers, driver license numbers, hospitalized numbers, clinic numbers, names, sexes, birth months, home addresses, mobile phone numbers, outpatient transfer records and the like, and carries out uniform numbering to generate main index numbers.

Step S105: and cleaning the data after the index is constructed.

Further, step S105 includes:

and according to a preset cleaning rule, cleaning the data after the index is constructed, and correcting the data which does not accord with the rule in the cleaning process.

Data cleaning is an indispensable link in the whole hospital data intelligent gateway construction process, and the result quality directly relates to the model effect and the final conclusion of all subsequent related researches. Data cleansing includes cleansing of data for integrity, consistency, legitimacy, correctness, etc., and converting into a unified standard according to certain rules. Such as: when the data contains multiple variables in different dimensions, the difference between the values may be large. Normalization scales the data in proportion to make the data fall into a small specific interval; unit limitation of the data is removed, and the data is converted into a dimensionless pure numerical value, so that indexes of different units or orders of magnitude can be compared and weighted conveniently. The cleaned data can be used for subsequent statistical analysis.

And the data cleaning is used for cleaning and processing the collected and gathered data and performing standardized arrangement. FIG. 5 is a diagram of a data cleansing architecture. As shown in fig. 5, the data cleaning mainly includes data cleaning flow planning, cleaning flow control, cleaning quality control, cleaning process management, and the like. Through the standard process and the rule base, the unified and configurable data conversion, cleaning, comparison, association, fusion and other processing processes are constructed based on the process engine, heterogeneous mass discrete data resources are processed and produced, and sharable data which are easy to analyze and utilize are generated.

The cleaning rule configuration mainly comprises a field cleaning rule, a regular expression cleaning rule and a complex logic cleaning rule.

The task of data cleaning is to filter data which do not accord with preset rules, confirm whether to filter or to be extracted after being corrected by a business unit.

Data cleansing requires consistency checks. The consistency check is to check whether the data is in accordance with the reasonable value range and the mutual relation of each variable, and find out the data which is out of the normal range, is logically unreasonable or is mutually contradictory.

The data which does not conform to the rule mainly comprises invalid values, missing values, incomplete data, error data, repeated data and the like, and the specific correction method for the data is as follows:

invalid and missing values

Due to investigation, coding and logging errors, there may be some invalid and missing values in the data that need to be given appropriate treatment. The commonly used treatment methods are: evaluation, whole case deletion, variable deletion and pair deletion.

Incomplete data

The data is mainly characterized in that some information which should be existed is missing, the data is filtered out and submitted to a client according to the missing content, and the data is required to be supplemented or deleted within a specified time. And writing the data into the data warehouse after completion.

Error data

The reason for this kind of errors is that the service system is not robust enough, and it is not judged after receiving input and directly written into the background database, for example, the numerical data is input into full-angle numerical characters, there is a carriage return operation behind the character string data, the date format is incorrect, the date crosses the border, etc. Unifying the format into a standard format through data cleaning.

Repeating data

For this type of data, and particularly for dimension tables, this occurs by exporting all the fields of the duplicate data records for the client to identify and organize.

Data cleaning is a repeated process which cannot be completed in a short time, and only the problem is continuously found and solved. Whether filtering is carried out or not and whether correction is carried out or not generally needs user confirmation, and for the filtered data, the Excel file is written or the filtered data is written into a data table, so that the filtered data is prompted to correct errors as soon as possible, and meanwhile, the filtered data can also be used as the basis for verifying data in the future.

Step S106: and carrying out desensitization treatment on the cleaned data.

Further, step S106 includes:

according to the preset sensitive data characteristics, the sensitive information contained in the cleaned data is identified by using a sensitive data information base and a word segmentation system, and desensitization is carried out on the sensitive information by adopting a desensitization algorithm.

The data desensitization processing refers to deleting or fuzzifying the privacy and sensitive information of the patient, so as to avoid exposing the privacy information.

The procedure for data desensitization is generally divided into: the method comprises four steps of sensitive data discovery, sensitive data combing, desensitization scheme formulation and desensitization task execution, and the optimal data desensitization effect is achieved by combining a data desensitization algorithm, a data desensitization rule and a desensitization environment.

Desensitization data is generally mainly discovered automatically, and discovery and definition of sensitive data are completed by combining manual discovery and auditing, so that a perfect sensitive data dictionary is formed finally. For relatively fixed business data of a hospital, manual discrimination can be adopted, data of data columns of data base tables are definitely specified to be desensitized, the general data structure and the data length of the data do not change, and most of the data are numerical characters and fixed-length characters. Such as: identification columns such as patient names, identity card numbers, contact ways, home addresses and the like can be used for manually appointing desensitization rules and different data access strategies aiming at the data, so that sensitive information is prevented from being leaked. The automatic identification can automatically identify the sensitive information contained in the database by means of the sensitive data information base and the word segmentation system according to the characteristics of the sensitive data specified or predefined manually, and compared with the manual identification, the automatic identification can reduce the workload and prevent omission.

General desensitization rules are classified into recoverable and non-recoverable categories. The recoverable type means that the desensitized data can be recovered into the original sensitive data in a certain mode, and the desensitization rules mainly refer to various encryption and decryption algorithm rules. The unrecoverable class means that the desensitized portion of the data cannot be recovered by any means.

For different data desensitization requirements, a special desensitization strategy can be configured on the basis of a basic desensitization algorithm. The desensitization scheme is mainly formulated by multiplexing a desensitization strategy and a desensitization algorithm, and an optimal scheme is formulated by configuring and expanding a decryption algorithm. And (4) according to the established desensitization scheme, after confirming the desensitization range and the specific desensitization operation, executing a data desensitization process.

And selecting different desensitization algorithms according to different data characteristics to desensitize common sensitive data such as names, certificate numbers, addresses, telephone numbers, home addresses and the like, wherein the desensitization algorithms generally comprise shielding, deformation, replacement, randomness, format Preserving Encryption (FPE) and strong encryption algorithms (such as AES).

According to different data processing modes, data desensitization can be divided into two main categories of static data desensitization and dynamic data desensitization. The static data desensitization refers to the process of desensitizing and privacy removal of data files and simultaneously ensuring the association relationship among data. The dynamic data desensitization refers to that when a user calls sensitive data in a background database at a front-end application part, the data desensitization is carried out, and then the sensitive data are fed back to a foreground for presentation.

In particular, the present invention employs a desensitization scheme of HIPAA, which is divided into five fractions (Title), each fraction addressing one particular problem of medical insurance reform. The second problem solved by the act is to simplify administration, focus on the security of medical-related Information, HIPAA establishes a set of security standards for receiving, transmitting and maintaining medical Information, and ensuring privacy and personal identity Information, which is at the core of privacy and security protection for Information such as PHI (Protected Health Information). In order to ensure that the medical private data of the patient is not disclosed, desensitization treatment (static desensitization mode) of the relevant private data of the patient is firstly completed. Table 1 shows a data desensitization scheme according to an embodiment of the invention. As shown in table 1, the field correspondence analysis and the corresponding desensitization recommendation are collated item by item, combining the existing field definitions and actual usage, and referring to the 18 items of contents of HIPAA definition for PHI.

TABLE 1 HIPAA desensitization protocol

Step S107: and (5) carrying out quality control on the desensitized data to obtain a treatment result.

Further, step S107 includes:

and checking and correcting the desensitized data according to a preset quality control rule.

Presetting a data quality control rule according to requirements, and automatically identifying, checking and correcting data according to the preset quality control rule. Table 2 shows preset quality control rules according to an embodiment of the present invention. Fig. 6, fig. 7 and fig. 8 are schematic diagrams respectively showing a data preset quality control rule, a quality control result and a quality control report according to an embodiment of the present invention.

TABLE 2 Preset quality control rules

The system utilizes a big data technology, combines a medical information rule engine, establishes a multi-dimensional medical data quality control system, performs highly automated data processing and data extraction, can effectively help a user solve the problems of out-of-control data acquisition progress, uneven quality, insufficient and disordered data, non-uniform data standard and the like, ensures the integrity, consistency and accuracy of a data processing process, and improves the accuracy and practicability of data analysis and utilization.

The system constructs a data automatic quality control system, opens the data quality inspection capability, enables the medical institution to customize the data quality control rule, and supports automatic identification and manual checking and correction of the data. The method provides super error correction capability, can set various clinical medicine quality control rules, and automatically calibrates medical data. Meanwhile, manual sampling quality inspection of paper medical records is assisted, so that the treatment process is correct and the result is correct. The disease diagnosis and treatment can be supported by the most extensive data base, and high-quality big data is formed. The data quality inspection is a closed-loop and continuous optimization process. And a closed loop for improving the native data and a closed loop for improving the data cleaning capacity are formed by combining the data management and monitoring management requirements of a hospital, so that the continuous improvement of the data quality is ensured. The method realizes the tracking of the distribution and dynamic change conditions of the data, improves the data quality, then carries out comprehensive business and standardization on the data, forms a uniform business view, and ensures the safety and controllability of the data in each use link. And setting a data processing rule in the whole process, continuously monitoring the data processing condition, finding out the data quality problem in time, and informing related personnel of processing the problem in time in an alarm mode.

According to the embodiment, the problems of strong manual dependence, low efficiency and single treatment module of the conventional data treatment are solved by performing data modeling, template conversion, standardization, construction of the main index of the patient, cleaning, desensitization and quality control treatment on the initial medical data, the ordered and effective treatment of multi-source, heterogeneous and massive medical data can be realized without manual work, the efficiency and accuracy of medical work are improved, the labor cost is reduced, and data support is provided for utilization and mining of subsequent medical data.

As shown in fig. 9, the apparatus includes:

a data acquisition unit 901 for acquiring initial medical data.

The initial health medical data may include patient basic information, medical history, electronic medical records, medical advice and medication information, laboratory examinations, imaging and functional examinations, health questionnaire information, and the like.

And a data modeling and template transferring unit 902, configured to perform data modeling and template transferring based on the initial medical data.

Further, the data modeling and modeling unit 902 is further configured to:

and converting the original data model into a target data model.

The data modeling can be the definition of an initial medical database, a data table and data fields, a DDL statement is automatically generated, the establishment of the table fields of the base is automatically realized, and a target data model is obtained; the data conversion module comprises the step of converting the original hospital system type into a target data model.

Fig. 2 and fig. 3 each show a schematic diagram of data modeling and a rotary mold demonstration according to an embodiment of the invention. As shown in FIGS. 2 and 3, data modeling and template conversion are completed by writing sql scripts according to the structures of the source model and the target model. The proportion of manual workload can be reduced to the greatest extent through a visualization mode. In modeling, the form may be filled in and then the sql statement may be automatically generated.

And a data normalization unit 903 for normalizing the modeled and transformed data.

Further, the data normalization unit 903 is further configured to:

The purpose of data normalization is to unify data from different sources into one reference system. The data standard definition refers to the EROE and international standards, establishes the classification standard of codes and data elements, can formulate detailed code standard and data element classification standard as preset standards according to specific service and data specification requirements, and provides consistency guarantee for storage, access and integration of data.

FIG. 4 shows a data normalization presentation diagram according to an embodiment of the invention. As shown in fig. 4, some important fields (such as basic information of patients, diagnoses, symptoms, medication, etc.) in the medical records are mapped in terms and standardized, and details of corresponding medical terms are given, and corresponding term library information can be displayed. Synonyms or non-standard expressions in free text are accurately identified and normalized, and mapping is termed. And taking the standardized terms as unified metadata for maintenance management.

Through data standardization, the unification of semantic levels of term data exchange of all systems in a platform can be realized, important terms and the like in a hospital are used as unified metadata for maintenance and management, and an open access interface and an update notification interface are provided; the data interaction is simplified, the coupling degree between systems is reduced, and a platform-level term (dictionary) standard is established to meet the system integration requirement and provide convenience for the platform in future data mining, scientific research and historical data management.

And an index constructing unit 904, configured to construct a patient main index according to the normalized data.

Further, the index building unit 904 is further configured to:

The Patient Master Index (EMPI) is a medical informatization terminology, and simply, it is a Patient basic information retrieval directory. Many patients in a medical institution do not form a uniform patient ID, so that data integration of multi-system and multi-diagnosis and treatment information of the patients cannot be realized. The main purpose of EMPI is to effectively link multiple medical information systems together through unique patient identification within a complex medical system. The system and the method realize interconnection among the systems of the hospital and ensure the integrity and accuracy of personal information acquisition of the same patient distributed in different systems of the hospital. Establishing the patient main index is a necessary condition for realizing system integration and resource sharing in a large hospital.

The EMPI module collects basic information data of patients according to a hospital business system, mainly judges that the patients with different patient IDs or hospitalization numbers are the same person according to all or part of information such as original patient ID numbers, identification numbers, passport numbers, driving license numbers, hospitalization numbers, clinic numbers, names, sexes, birth months, family addresses, mobile phone numbers, clinic-to-hospitalization records and the like, and carries out unified numbering to generate main index numbers.

And the data cleaning unit 905 is used for cleaning the data after the index is constructed.

Further, the data washing unit 905 is further configured to:

and cleaning the data after the index is constructed according to a preset cleaning rule, and correcting the data which does not accord with the rule in the cleaning process.

Data cleaning is an indispensable link in the whole hospital data intelligent gateway construction process, and the result quality of the data cleaning is directly related to the model effect and the final conclusion of all subsequent related researches. Data cleansing includes cleansing of data for integrity, consistency, legitimacy, correctness, etc., and converting into a unified standard according to certain rules. Such as: when the data contains multiple variables in different dimensions, the difference between the values may be large. The normalization scales the data in proportion to make the data fall into a small specific interval; unit limitation of the data is removed, and the data is converted into a dimensionless pure numerical value, so that indexes of different units or orders of magnitude can be compared and weighted conveniently. The cleaned data can be used for subsequent statistical analysis.

And the data cleaning is used for cleaning and processing the collected and gathered data and performing standardized arrangement. FIG. 5 is a block diagram of a data cleansing functionality architecture according to an embodiment of the present invention. As shown in fig. 5, the data cleaning mainly includes data cleaning flow planning, cleaning flow control, cleaning quality control, cleaning process management, and the like. Through the standard process and the rule base, the unified and configurable data conversion, cleaning, comparison, association, fusion and other processing processes are constructed based on the process engine, heterogeneous mass discrete data resources are processed and produced, and sharable data which are easy to analyze and utilize are generated.

The data which does not conform to the rules mainly comprise invalid values, missing values, incomplete data, error data, repeated data and the like, and the specific correction method for the data comprises the following steps:

invalid and missing values

Incomplete data

The data is mainly the information which should be lost, and the data is filtered out and submitted to clients according to the lost content, and the data is required to be supplemented or selected to be deleted within a specified time. And writing the data into a data warehouse after completion.

Error data

The reason for this kind of error is that the service system is not sound enough, and it is not judged after receiving the input and directly written into the background database, for example, the numerical data is input into full-angle numerical characters, there is a carriage return operation behind the character string data, the date format is incorrect, the date is out of bounds, etc. Unifying the format into a standard format through data cleaning.

Repeating data

For this type of data, particularly those that occur in dimension tables, all fields of the duplicate data records are exported for validation and collation by the client.

It should be understood that data cleansing is an iterative process that cannot be completed in a short time, and only the problem is discovered and solved continuously. Whether filtering is carried out or not and whether correction is carried out or not generally needs user confirmation, and for the filtered data, the Excel file is written or the filtered data is written into a data table, so that the filtered data is prompted to correct errors as soon as possible, and meanwhile, the filtered data can also be used as the basis for verifying data in the future.

And a data desensitization unit 906, configured to perform desensitization processing on the cleaned data.

Further, the data desensitization unit 906 is further configured to:

The data desensitization processing refers to deleting or fuzzifying the privacy and sensitive information of the patient, so as to avoid the exposure of the privacy information.

Desensitization data is generally mainly discovered automatically, and discovery and definition of the sensitive data are completed by combining manual discovery and auditing, so that a complete sensitive data dictionary is formed finally. For relatively fixed business data of a hospital, manual screening can be adopted, data of data columns of data base tables are definitely specified to be desensitized, the data generally has unchanged data structure and data length, and most of the data are characters with numerical values and fixed lengths. Such as: identification columns such as patient names, identity card numbers, contact ways, home addresses and the like can be used for manually appointing desensitization rules and different data access strategies aiming at the data, so that sensitive information is prevented from being leaked. The automatic identification can automatically identify the sensitive information contained in the database by means of the sensitive data information base and the word segmentation system according to the characteristics of the manually specified or predefined sensitive data, and compared with the manual identification, the automatic identification can reduce the workload and prevent omission. And on the basis of the discovery of the sensitive data, the adjustment of the sensitive data column and the sensitive data relation is completed to ensure the association relation of the data. Data mask scrambling is carried out aiming at different data types through data desensitization algorithms such as shielding, deformation, replacement, randomness, format preserving encryption, strong encryption and the like.

And on the basis of the sensitive data discovery, the sensitive data columns and the sensitive data relation are adjusted to ensure the association relation of the data. Data mask scrambling is carried out aiming at different data types through data desensitization algorithms such as shielding, deformation, replacement, randomness, format preserving encryption, strong encryption and the like.

For different data desensitization requirements, a special desensitization strategy can be configured on the basis of a basic desensitization algorithm. The desensitization scheme is mainly formulated by multiplexing a desensitization strategy and a desensitization algorithm, and an optimal scheme is formulated by configuring and expanding a decryption algorithm.

And (4) according to the established desensitization scheme, after confirming the desensitization range and the specific desensitization operation, executing a data desensitization process.

General desensitization rules are classified into recoverable and non-recoverable classes. The recoverable type means that the desensitized data can be recovered into the original sensitive data in a certain mode, and the desensitization rules mainly refer to various encryption and decryption algorithm rules. The unrecoverable class means that the desensitized portion of the data cannot be recovered by any means.

Data desensitization can be divided into two major categories, static data desensitization and dynamic data desensitization, according to different data processing modes. The static data desensitization refers to the process of desensitizing and privacy removal of data files and simultaneously ensuring the association relationship among data. The dynamic data desensitization refers to that when a user calls sensitive data in a background database at a front-end application part, the data desensitization is carried out, and then the sensitive data are fed back to a foreground for presentation.

Specifically, with the desensitization scheme of HIPAA, HIPAA is divided into five fractions (Title), each of which addresses one particular problem of medical insurance reform. The second problem solved by the law is to simplify administration, focus on the security of medical-related Information, HIPAA establishes a whole set of security standards for receiving, transmitting and maintaining medical Information, and ensuring privacy and personal identity Information, which is at the core of privacy and security protection of Information such as PHI (Protected Health Information). In order to ensure that the private data of the patient medical treatment is not disclosed, the desensitization treatment of the private data related to the patient is firstly completed (a static desensitization mode). Table 3 shows a data desensitization scheme according to an embodiment of the present invention. As shown in table 3, the field correspondence analysis and the corresponding desensitization recommendation are collated item by item, combining the existing field definitions and actual usage, and referring to the 18 items of contents of HIPAA definition for PHI.

TABLE 3 HIPAA desensitization protocol

And the data quality control unit 907 is used for performing quality control on the desensitized data to obtain a treatment result.

Further, the data quality control unit 907 is further configured to:

Presetting a data quality control rule according to requirements, and automatically identifying, checking and correcting data according to the preset quality control rule. Table 4 shows preset quality control rules according to an embodiment of the present invention. Fig. 6, 7 and 8 are schematic diagrams illustrating data presetting control rules, control results and control reports according to an embodiment of the present invention.

TABLE 4 Preset quality control rules

The system utilizes a big data technology, combines a medical information rule engine, establishes a multi-dimensional medical data quality control system, performs highly automated data processing and data extraction, can effectively help a user solve the problems of out-of-control data acquisition progress, irregular quality, insufficient and disordered data, incomplete data standard and the like, ensures the integrity, consistency and accuracy of a data processing process, and improves the accuracy and practicability of data analysis and utilization.

The system constructs a data automatic quality control system, opens the data quality inspection capability, enables the medical institution to customize the data quality control rule, and supports automatic identification and manual checking and correction of the data. Provides super error correction capability, can set various clinical medicine quality control rules and automatically calibrate medical data. Meanwhile, the manual paper medical record sampling quality inspection is assisted, so that the treatment process is correct, and the result is correct. The disease diagnosis and treatment can be supported by the most extensive data base, and high-quality big data is formed. The data quality inspection is a closed-loop and continuous optimization process. The data management and monitoring management requirements of a hospital are combined, a primary data improved closed loop and a data cleaning capacity improved closed loop are formed, and therefore continuous improvement of data quality is guaranteed. The method realizes the tracking of the distribution and dynamic change conditions of the data, improves the data quality, then carries out comprehensive business and standardization on the data, forms a uniform business view, and ensures the safety and controllability of the data in each use link. And setting a data processing rule in the whole process, continuously monitoring the data processing condition, finding out the data quality problem in time, and informing related personnel of processing the problem in time in an alarm mode.

The embodiment of the invention also provides a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and when the computer program is executed by a processor, the data governance method provided by the above embodiments is realized.

The invention has been described with reference to a few embodiments. However, other embodiments of the invention than the ones disclosed above are equally possible within the scope of these appended patent claims, as these are known to those skilled in the art.

Generally, all terms used in the claims are to be interpreted according to their ordinary meaning in the technical field, unless explicitly defined otherwise herein. All references to "a/an/the [ means, component, etc ]" are to be interpreted openly as referring to at least one instance of said means, component, etc., unless explicitly stated otherwise. The steps of any method disclosed herein do not have to be performed in the exact order disclosed, unless explicitly stated.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A data governance method, comprising:

acquiring initial medical data;

performing data modeling and template rotation based on the initial medical data;

standardizing the data after modeling and die conversion;

constructing a patient main index according to the standardized data;

cleaning the data after the index is constructed;

carrying out desensitization treatment on the cleaned data;

and (5) carrying out quality control on the desensitized data to obtain a treatment result.

2. The method of claim 1, wherein the data modeling and rotational modeling based on the initial medical data comprises:

and converting the original data model into the target data model.

3. The method of claim 1, wherein normalizing the modeled and transformed data comprises:

4. The method of claim 1, wherein constructing a patient primary index from the normalized data comprises:

and extracting patient basic information from the standardized data, associating the multi-service IDs of the same patient based on the patient basic information, uniformly numbering the IDs, and generating a main index number.

5. The method of claim 1, wherein the cleansing the indexed data comprises:

6. The method of claim 1, wherein the desensitizing the washed data comprises:

and identifying the sensitive information contained in the cleaned data by using a sensitive data information base and a word segmentation system according to the preset sensitive data characteristics, and desensitizing the sensitive information by using a desensitization algorithm.

7. The method of claim 6, wherein the quality control of the desensitized data to obtain a treatment result comprises:

8. A data governance device, the device comprising:

a data acquisition unit for acquiring initial medical data;

the data modeling and template transferring unit is used for carrying out data modeling and template transferring based on the initial medical data;

the data standardization unit is used for standardizing the data after modeling and die conversion;

the index construction unit is used for constructing a patient main index according to the standardized data;

the data cleaning unit is used for cleaning the data after the index is constructed;

the data desensitization unit is used for performing desensitization processing on the cleaned data;

and the data quality control unit is used for performing quality control on the desensitized data to obtain a treatment result.

9. The apparatus of claim 8, wherein the data modeling and modeling unit is further configured to:

and converting the original data model into the target data model.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 7.