CN115132366A

CN115132366A - Multi-source data processing method and system based on health and medical big data standard library

Info

Publication number: CN115132366A
Application number: CN202210758887.5A
Authority: CN
Inventors: 殷晋; 洪磊
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2022-09-30

Abstract

The invention relates to a multi-source data processing method and a system based on a health and medical big data standard library, comprising the following steps: acquiring medical health data of each service platform, and establishing a medical health database; extracting metadata in the medical health database, and converting the storage format of the metadata; classifying and storing the metadata according to the application category of the metadata, and establishing a metadata standard library based on a CWM (CWM meta model); analyzing and fusing metadata, managing metadata life cycle, managing metadata change and standardizing metadata; and performing data quality management, metadata standard management and metadata knowledge graph construction on the metadata. The method can automatically perform unified conversion and data standardization processing on data sources from different service platforms, establish a metadata standard library based on a CWM (CWM-metadata) meta-model, standardize the processing on metadata objects and facilitate the query and maintenance of data; and valuable data can be acquired beneficially by fusing and managing the data.

Description

Multi-source data processing method and system based on health and medical big data standard library

Technical Field

The invention belongs to the technical field of big data processing, and particularly relates to a multisource data processing method based on a health and medical big data standard library.

Background

Big data analysis techniques are continuously developed to provide better conditions for finding knowledge from a large amount of medical health data, in particular to process a large amount of patient data in medical and medical information and identify clusters and correlations between data.

However, at present, medical institutions in different areas usually adopt a plurality of different systems to process various medical health data, a large amount of data are processed by different information systems, once the data of the information systems are mixed, data integration is difficult, data responsibility boundary is fuzzy and the like due to the fact that the data have heterogeneity, and valuable data are difficult to obtain.

Disclosure of Invention

The invention provides a multisource data processing method and system based on a health and medical treatment big data standard library for solving the technical problems.

The technical scheme for solving the technical problems is as follows:

in a first aspect, the invention provides a multisource data processing method based on a health and medical big data standard library, which comprises the following steps:

acquiring medical health data of each service platform, and establishing a medical health database;

extracting metadata in the medical health database, and converting the storage format of the metadata;

classifying and storing the metadata according to the application types of the metadata, and establishing a metadata standard library based on the CWM;

establishing a metadata function component, and analyzing and fusing metadata, managing metadata life cycle, managing metadata change and standardizing the metadata by using the metadata function component;

and performing data quality management, metadata standard management and metadata knowledge graph construction on the metadata.

In a second aspect, the present invention provides a multisource data processing system based on a health and medical big data standard library, comprising: the system comprises a metadata acquisition layer, a metadata storage layer, a metadata analysis layer and a metadata application layer;

the metadata acquisition layer comprises a database acquisition unit, a data extraction unit and a database analysis and verification unit;

the database acquisition unit is used for acquiring the medical health data of each service platform and establishing a medical health database;

the data extraction unit is used for establishing a plurality of parallel data extraction processes to extract ETL data of the medical health data to obtain metadata, and cleaning and converting the storage format of the metadata;

the database analysis and verification unit is used for acquiring the health database logs, analyzing and verifying the health database logs and sending the verified metadata to the metadata storage layer;

the metadata storage layer includes: the system comprises a standard library construction unit, a warehousing conversion unit, a classification unit and a data warehouse unit;

the standard library building unit is used for building a metadata standard library based on the CWM;

the warehouse entry conversion unit is used for processing the metadata by using the public warehouse meta-model and converting the storage format of the metadata;

a classification unit for classifying the metadata according to an application category of the metadata;

the data warehouse unit is used for describing the classified metadata in an XML form and importing the metadata into the data warehouse through a metadata access interface of the development data warehouse;

the metadata analysis layer comprises an analysis and fusion component, a metadata life cycle management component, a metadata change management component and a metadata standardization processing component, and is used for analyzing and fusing metadata in the data warehouse, managing the metadata life cycle, managing the metadata change and standardizing the metadata;

the metadata application layer comprises a metadata quality management unit, a metadata standard management unit and a metadata knowledge graph construction unit and is used for performing data quality management and metadata standard management on metadata and constructing a metadata knowledge graph.

The invention has the beneficial effects that: the method can automatically collect data sources from different service platforms, uniformly convert the heterogeneous data and standardize the data, establish a metadata standard library based on a CWM (CWM-W-M) meta-model, standardize the processing of metadata objects and facilitate the query and maintenance of the data; and valuable data can be acquired beneficially by fusing and managing the data.

On the basis of the technical scheme, the invention can be improved as follows.

Further, the method for extracting metadata in the medical health database is to establish a plurality of parallel data extraction processes to extract ETL data of the medical health data, or to analyze health database logs.

The beneficial effect of adopting the further scheme is that the data in the distributed and heterogeneous data sources can be extracted to the temporary middle layer for cleaning, conversion and integration by carrying out ETL data extraction on the medical health data and analyzing the health database logs, and the metadata can be written into the data warehouse.

Further, the classified storage of the metadata according to the application categories of the metadata comprises:

establishing a typical word list related to the application category semantics according to the metadata application category;

performing text processing on the metadata, and dividing the metadata into text feature word combinations;

and matching the text characteristic word combinations according to the typical word list, taking the matching result as a metadata set, and storing the metadata set.

The method has the advantages that the metadata are matched with the text feature word combination through the typical word list, and the metadata are classified and stored according to the application categories of the metadata, so that the metadata can be quickly retrieved and extracted.

Further, parsing and fusing the metadata includes:

establishing a basic metadata corpus, and translating all metadata based on the metadata corpus;

carrying out duplicate removal, normalization and disambiguation processing on the metadata by using a natural language processing algorithm;

establishing a business data model, and classifying the metadata according to business types;

and establishing a knowledge base, and adding the classified metadata into a corresponding directory tree of the knowledge base.

The beneficial effect of adopting the further scheme is that the metadata from different systems are translated into the uniform language, which is beneficial to uniformly classifying the metadata, and the classified metadata is added into the corresponding directory tree of the knowledge base to be beneficial to searching the metadata index.

Further, metadata lifecycle management includes:

monitoring the metadata according to the progress of each data processing node in the life cycle of the metadata, generating XML data streams for all processing flows of the metadata, and converting the XML data streams into XML files in a standard metadata format;

and analyzing the association between each metadata in the same XML file based on the XML data stream.

The beneficial effect of adopting the further scheme is that the XML data stream is utilized to monitor the metadata, which is beneficial to ensuring the consistency and correctness of the metadata.

Further, the metadata change management includes: and comparing the structural change of the metadata of each data processing flow, and changing the structural change into a metadata standard library.

The beneficial effect of adopting the further scheme is that the standardized upgrading of the metadata standard library is realized.

Further, the metadata normalization process includes: metadata structure standardization, metadata value range standardization and interface service standardization.

The further scheme has the beneficial effect that the structure, the value threshold and the interface of the metadata are subjected to standardized processing, so that the processing efficiency of the metadata is improved.

Drawings

Fig. 1 is a flowchart of a multisource data processing method based on a big health and medical data standard library provided in embodiment 1 of the present invention;

FIG. 2 is a system diagram of a multi-source data processing system based on a health and medical big data standard library.

Detailed Description

The principles and features of this invention are described below in conjunction with examples which are set forth to illustrate, but are not to be construed to limit the scope of the invention.

Example 1

The embodiment provides a multi-source data processing method based on a health and medical big data standard library, as shown in fig. 1, the method includes:

establishing a metadata function component, and analyzing and fusing metadata, managing the life cycle of the metadata, managing metadata change and standardizing the metadata by using the metadata function component;

Optionally, the method for extracting metadata from the medical health database is to establish multiple parallel data extraction processes to extract ETL data from the medical health data, or to analyze logs of the health database.

In actual application, before ETL data extraction is carried out on the medical health database, unstructured data is removed firstly. By carrying out ETL data extraction on the medical health data and analyzing the health database log, the data in the distributed and heterogeneous data sources can be extracted to the temporary middle layer for cleaning, conversion and integration, the metadata is written into the data warehouse, the medical health database log records the state of the data base, the execution state of statements and the consumption condition of resources, and the metadata is written into the data warehouse by analyzing the health database log.

Optionally, the classifying and storing the metadata according to the application category of the metadata includes:

In the actual application process, the metadata is matched with the text characteristic word combination through the typical word list, and the metadata is classified and stored according to the application categories of the metadata, so that the metadata is standardized and classified and stored. The application categories of the metadata include: business metadata, technical metadata, and management metadata. The service metadata is used for describing service logic in the metadata processing process; the technical metadata is used for describing the storage structure type of the metadata, attribute values (such as the address of the metadata) data acquisition ports and the like; the management metadata is used for describing management processes, personnel organization and role responsibilities and the like; and rapidly retrieving and extracting the metadata by using the metadata field name and the field attribute value. And decoupling the target incidence relation among the technical metadata, the service metadata and the management metadata in the metadata through a map knowledge map engine to construct a metadata knowledge map.

Optionally, parsing and fusing the metadata includes:

establishing a service data model, and classifying the metadata according to service types;

In the actual application process, the metadata from different systems are translated into the unified language, so that the metadata can be classified uniformly, and the classified metadata is added into the corresponding directory tree of the knowledge base to facilitate the metadata index search.

Optionally, the metadata lifecycle management includes:

monitoring the metadata according to the progress of each data processing node in the life cycle of the metadata, generating an XML data stream for all processing flows of the metadata, and converting the XML data stream into an XML file in a standard metadata format;

In the actual application process, the XML data stream is used for monitoring the metadata, so that the consistency and the correctness of the metadata can be ensured.

Optionally, the metadata change management includes: and comparing the structural change of the metadata of each data processing flow, and changing the structural change into a metadata standard library.

In the actual application process, the standardized upgrading of the metadata standard library is realized.

Optionally, the metadata normalization process includes: metadata structure standardization, metadata value range standardization and interface service standardization.

In the actual application process, the structure, the value threshold and the interface of the metadata are subjected to standardized processing, so that the metadata processing efficiency is improved.

Example 2

As shown in FIG. 2, the present invention provides a multi-source data processing system based on a health and medical big data standard library, comprising: the system comprises a metadata acquisition layer, a metadata storage layer, a metadata analysis layer and a metadata application layer;

the standard library construction unit is used for establishing a metadata standard library based on the CWM meta-model;

the metadata analysis layer comprises an analysis and fusion component, a metadata life cycle management component, a metadata change management component and a metadata standardization processing component, and is used for analyzing and fusing metadata in the data warehouse, managing the metadata life cycle, managing the metadata change and carrying out metadata standardization processing;

Optionally, the parsing and fusing component includes:

the translation unit is used for establishing a basic metadata corpus and translating all metadata based on the metadata corpus;

the processing unit is used for carrying out duplicate removal, normalization and disambiguation processing on the metadata by utilizing a natural language processing algorithm;

the service classification unit is used for establishing a service data model and classifying the metadata according to service types;

and the knowledge base management unit is used for establishing a knowledge base and adding the classified metadata into a corresponding directory tree of the knowledge base.

Optionally, the metadata lifecycle management unit includes:

the file conversion unit is used for monitoring the metadata according to the progress of each data processing node in the life cycle of the metadata, generating XML data streams for all processing flows of the metadata and converting the XML data streams into XML files in a standard metadata format;

and the analysis unit is used for analyzing the association among the metadata in the same XML file based on the XML data stream.

Optionally, the metadata change management component includes: and the structure comparison component is used for comparing the structural change of the metadata of each data processing flow and changing the structural change into the metadata standard library.

Optionally, the metadata standardization processing component is used for standardization processing of a metadata structure, standardization processing of a metadata value range, and standardization processing of an interface service.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. The multi-source data processing method based on the health medical big data standard library is characterized by comprising the following steps:

classifying and storing the metadata according to the application category of the metadata, and establishing a metadata standard library based on a CWM (CWM) meta-model;

establishing a metadata functional component, and analyzing and fusing the metadata, managing the life cycle of the metadata, changing and managing the metadata and standardizing the metadata by using the metadata functional component;

2. The multi-source data processing method based on the big health and medical data standard library according to claim 1, wherein the method for extracting the metadata in the medical health database is to establish a plurality of parallel data extraction processes to perform ETL data extraction on the medical health data or to perform analysis on the health database logs.

3. The multi-source data processing method based on the big health and medical data standard library according to claim 1, wherein the classifying and storing the metadata according to the application categories of the metadata comprises:

establishing a typical word list semantically associated with the application category according to the metadata application category;

4. The multi-source data processing method based on the health medical big data standard library according to claim 1, wherein the parsing and fusing the metadata comprises:

carrying out duplicate removal, normalization and disambiguation on the metadata by using a natural language processing algorithm;

5. The multi-source data processing method based on the big healthy data standard library according to claim 4, wherein the metadata life cycle management comprises:

and analyzing the association between the metadata in the same XML file based on the XML data stream.

6. The multi-source data processing method based on the health care big data standard library according to claim 1, wherein the metadata change management comprises the following steps: and comparing the structural change of the metadata of each data processing flow, and changing the structural change into the metadata standard library.

7. The multi-source data processing method based on the big healthy and medical data standard library according to claim 1, wherein the metadata standardization process comprises the following steps: metadata structure standardization, metadata value range standardization and interface service standardization.

8. Multisource data processing system based on big data standard storehouse of health and medical care, its characterized in that includes: the system comprises a metadata acquisition layer, a metadata storage layer, a metadata analysis layer and a metadata application layer;

the database acquisition unit is used for acquiring medical health data of each service platform and establishing a medical health database;

the database analysis and verification unit is used for acquiring a health database log, analyzing and verifying the health database log and sending the verified metadata to the metadata storage layer;

the standard library construction unit is used for establishing a metadata standard library based on the CWM;

the warehousing conversion unit is used for processing the metadata by using the public warehouse meta-model and converting the storage format of the metadata;

the classification unit is used for classifying the metadata according to the application categories of the metadata;

the metadata analysis layer comprises an analysis and fusion component, a metadata life cycle management component, a metadata change management component and a metadata standardization processing component and is used for analyzing and fusing the metadata in the data warehouse, managing the metadata life cycle, managing the metadata change and standardizing the metadata;

the metadata application layer comprises a metadata quality management unit, a metadata standard management unit and a metadata knowledge graph construction unit, and is used for performing data quality management, metadata standard management and metadata knowledge graph construction on the metadata.