CN111767332B

CN111767332B - Data integration method, system and terminal for heterogeneous data sources

Info

Publication number: CN111767332B
Application number: CN202010566643.8A
Authority: CN
Inventors: 王福; 陈良
Original assignee: Shanghai Synyi Medical Technology Co ltd
Current assignee: Shanghai Synyi Medical Technology Co ltd
Priority date: 2020-06-12
Filing date: 2020-06-19
Publication date: 2021-07-30
Anticipated expiration: 2040-06-19
Also published as: CN111767332A

Abstract

The data integration method, the data integration system and the terminal of the heterogeneous data source are used for solving the problems that in the prior art, based on a large amount of heterogeneous data, especially when structured data and unstructured data are integrated, the data integration is incomplete, the efficiency is not high, the data are difficult to expand, the data lack of treatment, the application range is limited, repeated development is needed when the integration range is expanded to new application, and the cost is high. The heterogeneous databases of the subsystems are converted into the unified data format supported by the data lake, the problem of inconsistent data content standards among heterogeneous data is deeply managed, data integration and sharing are realized, the data standard is established, the follow-up data application is facilitated, and the expandability is good.

Description

Data integration method, system and terminal for heterogeneous data sources

Technical Field

The present invention relates to the technical field of data information processing, and in particular, to a data integration method, system and terminal for heterogeneous data sources.

Background

Data is computer information that may be transmitted and stored. "database" refers to a set of related data that is stored, organized, and manipulated in a particular logical structure. To ensure transaction rate, reliability, maintainability, scalability and cost, existing large applications typically access databases through "database management software" (DBMS), obtain the required data or perform data maintenance. Database management software such as IBM DB2, Oracle, Mysql, SqlServer, etc., dominates large data processing applications.

With the development of information-based construction, if an enterprise wishes to support operation management in the enterprise through data analysis and Business Intelligence (BI), a uniform data warehouse must be established to store data of each sub-application system in a centralized manner, so as to ensure data consistency, achieve data interconnection and intercommunication, exchange and share data sources efficiently, and reduce repeated labor and corresponding cost for data collection. However, since different application systems use different database software, the data storage structures and data maintenance methods of the systems are different, and the problem of exchanging heterogeneous data arises. Heterogeneous data refers not only to different types of database software, but also includes heterogeneity between different structured data, such as structured data and unstructured data.

Particularly in medical scenarios, unstructured data is very common, such as patient medical records, examination reports, images, text, recordings, and the like.

In order to solve the problem, in the prior art, an independent data interface is generally developed between subsystems needing to be integrated, and data integration is performed according to specified data content and format, but the limitation is more, and the requirement for massive data exchange of all systems in an enterprise cannot be met. In addition, data is lack of governance, the application range is limited to a certain extent, repeated development is needed when the integration range is expanded to new application, and the cost is high.

Disclosure of Invention

In view of the above drawbacks of the prior art, an object of the present invention is to provide a data integration method, a system and a terminal for a heterogeneous data source, which are used to solve the problems in the prior art that when a large amount of heterogeneous data is based, especially structured data and unstructured data are integrated, data integration is incomplete, low in efficiency, difficult to expand, data is lack of governance, an application range is limited, and when the integration range is expanded to a new application, repeated development is required, and cost is high.

To achieve the above and other related objects, the present invention provides a data integration method for heterogeneous data sources, including: performing abstract mapping on each data source in a plurality of heterogeneous databases to obtain metadata of each meta-model under the mapping relation, wherein each meta-model corresponds to one data source; copying each heterogeneous database to a copy database, and establishing change capture on the copy database to obtain a change table for recording change data in each heterogeneous database; converting the read change data in each heterogeneous database into a data format unified with the metadata; and performing data governance on the change data and the metadata after the conversion of the unified data format, and storing the change data and the metadata into an integrated data lake.

In an embodiment of the present invention, the manner of performing abstract mapping on each data source in the multiple heterogeneous databases to obtain metadata of each meta-model obtained under the mapping relationship includes: abstract mapping is carried out on physical models in each data source in a plurality of heterogeneous databases according to the mapping relation, and meta models with logical relations are respectively generated; and obtaining the metadata of the meta-model of each data source under the mapping relation based on each meta-model.

In an embodiment of the present invention, the heterogeneous database includes structured data and/or unstructured data.

In an embodiment of the invention, the unstructured data includes: patient medical record data, examination report data, image data, text data, and a record database.

In an embodiment of the present invention, the copying the heterogeneous databases to the replication database and establishing change capture on the replication database to obtain a change table for recording change data in the heterogeneous databases includes: synchronously copying data in each heterogeneous database to a copy database; capturing new change data in the replicated database into a change table each time a time threshold has elapsed.

In an embodiment of the present invention, the data structure supported by the replication database includes: one or more of DB2, Oracle, Sqlserver, and Mysql database.

In an embodiment of the present invention, the data governance method includes: one or more of invalid data removal, unified data definition, missing data processing, and efficient variable manner of extracting unstructured data.

To achieve the above and other related objects, the present invention provides a data integration system of heterogeneous data sources, the system comprising: the metadata management module is used for carrying out abstract mapping on each data source in the heterogeneous databases to obtain metadata of each meta-model under the mapping relation, wherein each meta-model corresponds to one data source; the replication database module is used for replicating each heterogeneous database to a replication database and establishing change capture on the replication database so as to obtain a change table for recording change data in each heterogeneous database; the data integration module is connected with the metadata management module and the copy database module and is used for converting the read change data in each heterogeneous database into a data format unified with the metadata; and the data management module is connected with the data integration module and used for performing data management on the change data and the metadata after the conversion of the unified data format and storing the change data and the metadata into an integrated data lake.

To achieve the above and other related objects, the present invention provides a data integration terminal for heterogeneous data sources, comprising: a memory for storing a computer program; and the processor is used for executing the data integration method of the heterogeneous data source.

As described above, the data integration method, system and terminal of the heterogeneous data source of the present invention have the following beneficial effects: the heterogeneous databases of the subsystems are converted into the unified data format supported by the data lake, the problem of inconsistent data content standards among heterogeneous data is deeply managed, data integration and sharing are realized, the data standard is established, the follow-up data application is facilitated, and the expandability is good.

Drawings

Fig. 1 is a flowchart illustrating a data integration method for heterogeneous data sources according to an embodiment of the present invention.

Fig. 2 is a flowchart illustrating a data integration method for heterogeneous data sources according to an embodiment of the present invention.

Fig. 3 is a flowchart illustrating a data integration method for heterogeneous data sources according to an embodiment of the present invention.

Fig. 4 is a flowchart illustrating a data integration method for heterogeneous data sources according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a data integration system of heterogeneous data sources according to an embodiment of the present invention.

Fig. 6 is a schematic structural diagram of a data integration terminal of heterogeneous data sources according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It is noted that in the following description, reference is made to the accompanying drawings which illustrate several embodiments of the present invention. It is to be understood that other embodiments may be utilized and that mechanical, structural, electrical, and operational changes may be made without departing from the spirit and scope of the present invention. The following detailed description is not to be taken in a limiting sense, and the scope of embodiments of the present invention is defined only by the claims of the issued patent. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Spatially relative terms, such as "upper," "lower," "left," "right," "lower," "below," "lower," "over," "upper," and the like, may be used herein to facilitate describing one element or feature's relationship to another element or feature as illustrated in the figures.

Throughout the specification, when a part is referred to as being "connected" to another part, this includes not only a case of being "directly connected" but also a case of being "indirectly connected" with another element interposed therebetween. In addition, when a certain part is referred to as "including" a certain component, unless otherwise stated, other components are not excluded, but it means that other components may be included.

The terms first, second, third, etc. are used herein to describe various elements, components, regions, layers and/or sections, but are not limited thereto. These terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first element, component, region, layer or section discussed below could be termed a second element, component, region, layer or section without departing from the scope of the present invention.

Also, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," and/or "comprising," when used in this specification, specify the presence of stated features, operations, elements, components, items, species, and/or groups, but do not preclude the presence, or addition of one or more other features, operations, elements, components, items, species, and/or groups thereof. The terms "or" and/or "as used herein are to be construed as inclusive or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions or operations are inherently mutually exclusive in some way.

Therefore, the embodiment of the present invention provides a data integration method for a heterogeneous data source, which is used to solve the problems in the prior art that, based on a large amount of heterogeneous data, especially when structured data and unstructured data are integrated, data integration is incomplete, efficiency is not high, expansion is difficult, data is lack of governance, an application range is limited, repeated development is required when the integration range is expanded to a new application, and cost is high. The heterogeneous databases of the subsystems are converted into the unified data format supported by the data lake, the problem of inconsistent data content standards among heterogeneous data is deeply managed, data integration and sharing are realized, the data standard is established, the follow-up data application is facilitated, and the expandability is good.

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that those skilled in the art can easily implement the embodiments of the present invention. The present invention may be embodied in many different forms and is not limited to the embodiments described herein.

Fig. 1 is a schematic flow chart illustrating a data integration method of heterogeneous data sources according to an embodiment of the present invention.

The method comprises the following steps:

step S11: and carrying out abstract mapping on each data source in the heterogeneous databases to obtain metadata corresponding to each meta-model under the mapping relation.

Optionally, the heterogeneous database includes: the computer architecture heterogeneous database, the operating system heterogeneous database, the data format heterogeneous database, the data storage location heterogeneous database, and other heterogeneous databases are not limited in this application.

Optionally, performing abstract mapping on physical models in each data source in a plurality of heterogeneous databases according to the mapping relationship, and generating meta models with logical relationships respectively; and obtaining the metadata of the meta-model of each data source under the mapping relation based on each meta-model.

Specifically, abstract mapping is carried out on physical models of data sources in a plurality of heterogeneous databases according to mapping relations respectively, and a plurality of meta models with logical relations are obtained; and obtaining the metadata of the meta-model of each data source under the mapping relation based on the meta-model corresponding to each data source.

For each heterogeneous database, carrying out abstract mapping on a physical model of a data source in the heterogeneous database according to a mapping relation to obtain a meta-model (a logical model) with a logical relation; metadata of the meta-model of the data source under the mapping relationship is obtained, as shown in fig. 2.

Preferably, the mapping relationship is used for converting data of data sources of physical models managed by different semantic types and/or business logics into data of a uniform logical relationship. Wherein the logical relationship or mapping relationship is utilized to obtain metadata managed by the same specified data source semantic type standard and/or the same business logical standard.

Optionally, the heterogeneous database comprises structured data and/or unstructured data.

Optionally, in a medical scenario, the unstructured data includes: patient medical record data, examination report data, image data, text data, and a record database. The various databases obtain the metadata managed by the same semantic type standard and/or the same business logic standard of the specified data source according to the set mapping relation.

Optionally, the metadata management tool performs abstract mapping on each data source in the multiple heterogeneous databases to obtain metadata of each meta-model under the mapping relationship, where each meta-model corresponds to one data source.

Optionally, the metadata management tool may also delete, add, extract, store, query, manage, and the like the obtained meta model and the metadata.

Step S12: and copying each heterogeneous database to a copy database, and establishing change capture on the copy database to obtain a change table for recording change data in each heterogeneous database.

Optionally, synchronously copying data in each heterogeneous database to a copy database; every time a time threshold has elapsed, new change data in the replicated database is captured into the change table, as shown in FIG. 3.

Specifically, data in each heterogeneous database is synchronously copied to generate a plurality of copy databases; each replication database corresponds to one heterogeneous database; it should be noted that the replicated databases are synchronized to change when the data of the heterogeneous databases change.

And every time a set time threshold value passes, carrying out change capture on the current database, and generating a change table containing the captured change data. The time threshold is determined according to specific requirements, and the shorter the time threshold is set, the better the capture change effect is.

Optionally, the data structure supported by the replication database includes: one or more of DB2, Oracle, Sqlserver, and Mysql database.

Step S13: and converting the read change data in each heterogeneous database into a data format unified with the metadata.

Optionally, the format is unified according to the change data in each heterogeneous database read from the change table captured by each replication database and the metadata obtained by obtaining the unified mapping relationship. So as to update the continuously changing data in each heterogeneous database and unify the format.

Optionally, the unified format is the same as the format of the metadata.

Step S14: and performing data governance on the change data and the metadata after the conversion of the unified data format, and storing the change data and the metadata into an integrated data lake.

Optionally, the change data and the metadata after being converted by the unified data format are subjected to data governance; and outputting the change data subjected to data governance and the metadata to an integrated data lake for storage. The format of the change data and the metadata subjected to data governance is a unified data format supported by the data lake, as shown in fig. 4.

Optionally, the data governance method includes: one or more of removing invalid data, unifying data definitions, processing missing data, and extracting valid variable ways of unstructured data, in order to generate normalized and normalized data for storage into the integrated data lake.

Similar to the principle of the above embodiment, the present invention provides a data integration system of heterogeneous data sources.

Specific embodiments are provided below in conjunction with the attached figures:

fig. 5 is a schematic structural diagram illustrating a data integration system of heterogeneous data sources according to an embodiment of the present invention.

The system comprises:

the metadata management module 51 is configured to perform abstract mapping on each data source in a plurality of heterogeneous databases to obtain metadata of each meta-model in the mapping relationship, where each meta-model corresponds to one data source;

the replication database module 52 is configured to replicate the heterogeneous databases to the replication database, and establish change capture on the replication database to obtain a change table for recording change data in the heterogeneous databases;

a data integration module 53, connected to the metadata management module 51 and the replication database module 52, for converting the read change data in the various heterogeneous databases into a data format unified with the metadata;

and the data governance module 54 is connected with the data integration module 53, and is used for performing data governance on the change data and the metadata which are subjected to the unified data format conversion, and storing the change data and the metadata into an integrated data lake.

Optionally, the metadata management module 51 performs abstract mapping on physical models in each data source in a plurality of heterogeneous databases according to a mapping relationship, and generates a meta model with a logical relationship respectively; and obtaining the metadata of the meta-model of each data source under the mapping relation based on each meta-model.

Specifically, the metadata management module 51 performs abstract mapping on the physical models of the data sources in the heterogeneous databases according to the mapping relationship, so as to obtain a plurality of metadata models with logical relationships; and obtaining the metadata of the meta-model of each data source under the mapping relation based on the meta-model corresponding to each data source.

For each heterogeneous database, the metadata management module 51 performs abstract mapping on a physical model of a data source in the heterogeneous database according to a mapping relationship to obtain a meta model (logical model) with a logical relationship; and obtaining metadata of the meta-model of the data source under the mapping relation.

Optionally, the unstructured data includes: patient medical record data, examination report data, image data, text data, and a record database. The various databases obtain the metadata managed by the same semantic type standard and/or the same business logic standard of the specified data source according to the set mapping relation.

Optionally, the metadata management module 51 performs abstract mapping on each data source in the multiple heterogeneous databases through a metadata management tool to obtain metadata of each meta model under the mapping relationship, where each meta model corresponds to one data source.

Optionally, the metadata management tool includes: ODBC, file adapter, XML adapter, etc. and storage device.

Optionally, the replication database module 52 synchronously replicates the data in each heterogeneous database to the replication database; capturing new change data in the replicated database into a change table each time a time threshold has elapsed.

Specifically, the replication database module 52 performs synchronous replication on the data in each heterogeneous database to generate a plurality of replication databases; each replication database corresponds to one heterogeneous database; it should be noted that the replicated databases are synchronized to change when the data of the heterogeneous databases change.

Every time a set time threshold value passes, the replication database module 52 captures the change of the current database, and generates a change table containing the captured change data. The time threshold is determined according to specific requirements, and the shorter the time threshold is set, the better the capture change effect is.

Optionally, the data integration module 53 unifies the format of the change data in each heterogeneous database read from the change table captured by each replication database with the metadata obtained by obtaining the unified mapping relationship. So as to update the continuously changing data in each heterogeneous database and unify the format.

Optionally, the unified format is the same as the format of the metadata.

Optionally, the data governance module 54 performs data governance on the change data and the metadata after the uniform data format conversion; and outputting the change data subjected to data governance and the metadata to an integrated data lake for storage. And the formats of the change data and the metadata subjected to data governance are unified data formats supported by the data lake.

As shown in fig. 6, a schematic structural diagram of a data integration terminal 60 of heterogeneous data sources in the embodiment of the present invention is shown.

The data integration terminal 60 of the heterogeneous data source includes: a memory 61 and a processor 62, the memory 61 being for storing computer programs; the processor 62 runs a computer program to implement the data integration method of heterogeneous data sources as described in fig. 1.

Optionally, the number of the memories 61 may be one or more, the number of the processors 62 may be one or more, and fig. 6 illustrates one example.

Optionally, the processor 62 in the data integration terminal 60 of the heterogeneous data source may load one or more instructions corresponding to processes of the application program into the memory 61 according to the steps shown in fig. 1, and the processor 62 runs the application program stored in the first memory 61, so as to implement various functions in the data integration method of the heterogeneous data source shown in fig. 1.

Optionally, the memory 61 may include, but is not limited to, a high speed random access memory, a non-volatile memory. Such as one or more magnetic disk storage devices, flash memory devices, or other non-volatile solid-state storage devices; the Processor 62 may include, but is not limited to, a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

Optionally, the Processor 62 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

The present invention also provides a computer-readable storage medium storing a computer program, which when executed implements the data integration method of heterogeneous data sources as shown in fig. 1. The computer-readable storage medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs (compact disc-read only memories), magneto-optical disks, ROMs (read-only memories), RAMs (random access memories), EPROMs (erasable programmable read only memories), EEPROMs (electrically erasable programmable read only memories), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing machine-executable instructions. The computer readable storage medium may be a product that is not accessed by the computer device or may be a component that is used by an accessed computer device.

In summary, the data integration method, system, terminal and medium of the heterogeneous data source of the present invention solve the problems in the prior art that based on a large amount of heterogeneous data, especially when structured data and unstructured data are integrated, data integration is incomplete, efficiency is not high, expansion is difficult, data is lack of control, application range is limited, and when the integration range is expanded to a new application, repeated development is required, and cost is high. The heterogeneous databases of the subsystems are converted into the unified data format supported by the data lake, the problem of inconsistent data content standards among heterogeneous data is deeply managed, data integration and sharing are realized, the data standard is established, the follow-up data application is facilitated, and the expandability is good. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles of the present invention and its efficacy, and are not to be construed as limiting the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A data integration method for heterogeneous data sources is characterized by comprising the following steps:

performing abstract mapping on each data source in a plurality of heterogeneous databases to obtain metadata of each meta-model under the mapping relation, wherein each meta-model corresponds to one data source;

copying each heterogeneous database to a copy database, and establishing change capture on the copy database to obtain a change table for recording change data in each heterogeneous database;

converting the read change data in each heterogeneous database into a data format unified with the metadata;

carrying out data management on the change data and the metadata after the conversion of the unified data format, and storing the change data and the metadata into an integrated data lake;

the method for performing abstract mapping on each data source in the multiple heterogeneous databases to obtain the metadata of each meta-model obtained under the mapping relationship includes: carrying out abstract mapping on physical models managed by different semantic types and/or business logics in each data source in a plurality of heterogeneous databases according to the mapping relation, and respectively generating meta-models with uniform logic relations; based on each meta-model, obtaining the meta-data of each data source under the meta-model under the mapping relation, wherein the meta-data are managed by the same semantic type and/or business logic;

the copying each heterogeneous database to the copy database, and establishing change capture on the copy database to obtain a change table for recording change data in each heterogeneous database includes: synchronously copying data in each heterogeneous database to a copy database; capturing new change data in the replicated database into a change table each time a time threshold has elapsed.

2. The data integration method of heterogeneous data sources of claim 1, wherein the heterogeneous database comprises structured data and/or unstructured data.

3. The method of data integration of disparate data sources as recited in claim 2, wherein said unstructured data comprises: patient medical record data, examination report data, image data, text data, and a record database.

4. The method of data integration of disparate data sources as recited in claim 1, wherein said replicating a database-supported data structure comprises: one or more of DB2, Oracle, Sqlserver, and Mysql database.

5. The data integration method of the heterogeneous data source according to claim 1, wherein the data governance manner comprises: one or more of invalid data removal, unified data definition, missing data processing, and efficient variable manner of extracting unstructured data.

6. A data integration system for disparate data sources, the system comprising:

the metadata management module is used for carrying out abstract mapping on each data source in the heterogeneous databases to obtain metadata of each meta-model under the mapping relation, wherein each meta-model corresponds to one data source;

the replication database module is used for replicating each heterogeneous database to a replication database and establishing change capture on the replication database so as to obtain a change table for recording change data in each heterogeneous database;

the data integration module is connected with the metadata management module and the copy database module and is used for converting the read change data in each heterogeneous database into a data format unified with the metadata;

the data management module is connected with the data integration module and used for performing data management on the change data and the metadata which are subjected to unified data format conversion and storing the change data and the metadata into an integrated data lake;

7. A data integration terminal of heterogeneous data sources, comprising:

a memory for storing a computer program;

a processor for performing the data integration method of heterogeneous data sources of any of claims 1 to 5.