WO2020139079A1

WO2020139079A1 - System and method for analyzing heterogeneous data by utilizing data virtualization components

Info

Publication number: WO2020139079A1
Application number: PCT/MY2019/050135
Authority: WO
Inventors: Mohamad Zakaria Bin ALLI; Wan Zawawi Bin MD ZIN; Hooi Hwa LIM
Original assignee: Mimos Berhad
Priority date: 2018-12-28
Filing date: 2019-12-27
Publication date: 2020-07-02

Abstract

The present invention relates to a system (100) and method (200) for analyzing heterogeneous data from disparate data source (102) by utilizing data virtualization components. In particular, the present invention utilizes data virtualization server (108) for correlating heterogeneous data from disparate data source (102) pertaining to health information. The data virtualization server (108) utilizes a plurality of components (108A, 108B and 108C) within the data virtualization server (108) namely at least one data connector (108A), at least one data composer (108B) and at least one data consumption (108C). The heterogeneous data comprises of structured data (102A) and unstructured data (102B) are pseudonymized and stored in at least one data warehouse (106A) and at least one data lake (106B) accordingly. The structured data (102A) stored in the data warehouse (106A) and the unstructured data stored (102B) in the data lake (106B) are correlated to be analyzed and visualized through the data analytics and visualization module (110).

Description

SYSTEM AND METHOD FOR ANALYZING HETEROGENEOUS DATA BY UTILIZING

DATA VIRTUALIZATION COMPONENTS

FIELD OF INVENTION

The present invention relates to a system and method for analyzing heterogeneous data from disparate data source by utilizing data virtualization components. In particular, the present invention utilizes data virtualization server for correlating heterogeneous data pertaining to health information by utilizing a plurality of components within the data virtualization server.

BACKGROUND OF INVENTION

Data can be obtained through various sources such as websites, interdisciplinary researches, documents, e-mails, and media content. The data is generally generated in structured and unstructured forms whereby structured data is a data stored into a relational database in predefined fixed fields. The structured data is searchable through queries and search operations or algorithm by utilizing the field names.

In contrast, the unstructured data is not stored or structured with predefined manner. However, the unstructured data may have an internal structure. Generally, the unstructured data is complicated to be searched and analyzed compared to the structured data. Furthermore, large volume of the structured and unstructured data also known as big data, leads to difficulty in data analytics. The structured and unstructured data generally have to be integrated in order to make better decision-making, whereby the available data has to be utilized and analyzed thoroughly.

Currently in the medical field, the amount of data being generated is tremendously huge as the data is being generated through hospital information system as well as other external sources such as medical history from other hospital and personal medical wearables. This prompts to difficulties and challenges in compiling and analyzing the data.

United States of America Patent No. US 8250026 B2 (hereinafter referred to as the US 026 B2 Patent) entitled“Combining Medical Information Captured in Structured and Unstructured Data Formats for Use or Display in a User Application, Interface, or View” having a filing date of 6 March 2009 (Patentee: PeopleChart Corp) discloses that structured and unstructured data obtained, from disparate sources are transformed and correlated whereby the unstructured data is visualized by indices into individual image report table. The US 026 B2 Patent also discloses that the data are converted into a common schema.

United States Patent No. US 7849048 B2 (hereinafter referred to as the US 048 B2 Patent) entitled“System and Method of Making Unstructured Data Available to Structured Data Analysis Tools” having a filing date of 5 July 2005 (Patentee: Clarabridge Inc.) utilizes natural-language processing transformation tool to extract sentences from the copy of unstructured data. The US 048 B2 Patent also discloses that documents are all assigned with a unique key which can be used to identify the document and data derived from the document throughout the entire system and can be used to reference back to the original document in the original source.

United States Patent No. US 7668849 B1 (hereinafter referred to as the US 849 B1 Patent) entitled“Method and System for Processing Structured Data and Unstructured Data” having a filing date of 9 December 2005 (Patentee: Clarabridge Inc.) discloses system and method for processing structured and unstructured data. The 849 B1 Patent utilizes Component Integration Services (CIS) gateway, which is a set of connectivity tools for accessing data within heterogeneous environment. The 849 B1 Patent also discloses that the structured and unstructured data are correlated and integrated through links.

As outlined above, various systems and methods have been developed to provide analysis of the structured and unstructured data. However, it is desirable to provide correlation or integration of both structured and unstructured data and further analyzing the same through data analytics and visualization module.

SUMMARY OF INVENTION

The present invention relates to a system and method for analyzing heterogeneous data from disparate data sources by utilizing data virtualization component. In particular, the present invention utilizes data virtualization server for correlating heterogeneous data from disparate data source pertaining to health information by utilizing a plurality of component within the data virtualization server.

One aspect of the invention provides that at least one file server at a client side for providing access of files; at least one data source module (102) for providing heterogeneous data to be aggregated; a Privacy Assurance Services component (104) for conducting pseudonymization on the heterogeneous data to mask personal identification information; at least one data store (106) having a plurality of database for storing heterogeneous data; at least one data virtualization server (108) having a plurality of components for collecting, harnessing and storing heterogeneous data into a unified view; and at least one data analytics and visualization module (1 10) for composing and exposing heterogeneous data from a virtual database within the data virtualization server (108) for analyzing and visualizing heterogeneous data.

The data virtualization server (108) further comprising at least one data connector (108A) for connecting to the data store (106) and supports a plurality of database connection; at least one data composer (108B) for composing a virtual database schema that integrates multiple data sources; and at least one data consumption (108C) for exposing a virtual database schema.

A further aspect of the invention provides that heterogeneous data further comprises structured data (102A) and unstructured data (102B).

Yet another aspect of the invention provides that the unstructured data (102B) further comprises semi-structured data.

Another aspect of the invention provides that the data store (106) further comprises at least one data warehouse (106A) for storing the structured data (102A); and at least one data lake (106B) for storing the unstructured data (102B) and semi-structured data.

A further aspect of the invention provides that the unified view enables views of heterogeneous data in physical table. Yet another aspect of the invention provides that the plurality of database connection includes a relational database management system.

Still another aspect of the invention provides that the virtual database schema further comprising a set of metadata representing the data source (102).

Another aspect of the invention provides that the data analytic and visualization module (1 10) further configured to retrieving, analyzing, transforming and reporting of heterogeneous data.

A further aspect of the invention provides a method (200) for analyzing heterogeneous data utilizing data virtualization components comprising steps of collecting heterogeneous data from at least one data source (202); sending heterogeneous data to at least one file server as readable format for structuring data (204); pseudonymizing heterogeneous data through Privacy Assurance Services component (206); sending pseudonymized heterogeneous data to at least one data store (208); correlating pseudonymized heterogeneous data through at least one data virtualization server (210); and sending the correlated data to at least one data analytics and visualization module (212).

Yet another aspect of the invention provides that correlating pseudonymized heterogeneous data through at least one data virtualization server (210) further comprising steps of (300) configuring connection to the data source (302); introspecting metadata of structured and unstructured data (102A and 102B) and exposing the metadata as physical table (304); combining, integrating, transforming and cleansing source view as canonical model views of data for publishing the data as data service format (306); and analyzing and visualizing the data (308).

Still another aspect of the invention provides that configuring connection to the data source (302) further comprising steps of (400) selecting required connector from a list of data connector adapter (402); configuring connection to relational database management system (404); specifying the data source destination information including server address and database properties (406); and configuring and publishing connection information (408).

Another aspect of the invention provides that introspecting metadata of structured and unstructured data (102A and 102B) and exposing the metadata as physical table (304) further comprising steps of (500) configuring virtual database schema properties including schema name, connection source and version prior for selecting defined connection source (502); importing and defining metadata from data sources by utilizing machine readable instruction for creating and deploying a virtual schema database (504); and creating and deploying the virtual schema database (506). A further aspect of the invention provides that combining, integrating, transforming and cleansing source view as canonical model views of data for publishing the data as data service format (306) steps of (600) creating an extract, transform and load transformation for combining and integrating the metadata from the virtual database schema to accommodate data analytic (602); generating output of extreme, transform and load, transformation as a materialized view to be consumed by data analytics and visualization module (604); exposing the materialized view as service (606); deploying and publishing the materialized view service (608).

Yet another aspect of the invention provides that analyzing and visualizing the data (308) further comprising steps of (700) selecting the service required for analytic requirement (702); and exposing views as physical table for data analytics and visualization module to design a data mart (704).

The present invention consists of features and a combination of parts hereinafter fully described and illustrated in the accompanying drawings, it being understood that various changes in the details may be made without departing from the scope of the invention or sacrificing an of the advantages of the present invention.

BRIEF DESCRIPTION OF ACCOMPANYING DRAWINGS

To further clarify various aspects of some embodiments of the present invention, a more particular description of the invention will be rendered by references to specific embodiments thereof, which are illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail through the accompanying drawings.

Figure 1.0 illustrates a general system architecture of the present invention for analyzing heterogeneous data of health information.

Figure 1.0a illustrates general flow of the system architecture of the present invention for analyzing the heterogeneous data of health information.

Figure 2.0 is a flowchart illustrating a general methodology of the present invention for analyzing heterogeneous data of health information.

Figure 3.0 is a flowchart illustrating steps involved in correlating pseudonymized heterogeneous data through at least one data virtualization module.

Figure 4.0 is a flowchart illustrating further steps involved in configuring connection to at least one data source.

Figure 5.0 is a flowchart illustrating further steps involved in introspecting metadata and exposing the metadata as physical table.

Figure 6.0 is a flowchart illustrating further steps involved in combining, integrating, transforming and cleansing source view as canonical model views of data for publishing the data as structured query language, SQL views or other data service format.

Figure 7.0 is a flowchart illustrating further steps involved in analyzing and visualizing the data consumed by Java Database Connectivity, JDBC as connector. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention relates to a system and method for analyzing heterogeneous data from disparate data sources (106) by utilizing data virtualization component. In particular, the present invention utilizes data virtualization server (108) for correlating heterogeneous data from disparate data source (102) pertaining to health information by utilizing a plurality of components (108A, 108B and 108C) within the data virtualization server (108). Hereinafter, it is to be understood that limiting the description to the preferred embodiments of the invention is merely to facilitate discussion of the present invention and it is envisioned without departing from the scope of the appended claims.

Reference is first made to Figure 1 .0 and Figure 1 .0a. Figure 1.0 is a general system architecture for analyzing heterogeneous data of health information while Figure 1.0a illustrates general flows of the system architecture. As illustrated in Figure 1.0, the system of the present invention comprises at least one file server at a client side for providing access of files within the network server at a client side. Further, the system of the present invention also provides at least one data source (102) for obtaining heterogeneous data whereby the heterogeneous data comprises of structured data (102A) and unstructured data (102B). The present invention provides that the structured data (102A) and the unstructured data (102A) are obtained from the disparate sources.

Further, the system (200) of the present invention also comprises of a Privacy Assurance Services, PAS, component (104) for conducting pseudonymization on the structured data (102A) and the unstructured data (102B) to mask personal identification information within the structured and unstructured data (102A, 102B). The present invention also provides at least one data store (106) having a plurality of database (106A,106B) for storing the structured and the unstructured data (102A, 102B) accordingly. The structured data (102A) is stored in at least one data warehouse (106A) which leverage a relational database management system, RDBMS platform while the unstructured data (102B) is stored in at least one data lake (106B) by utilizing third party components such as Hadoop and Filesystem. The data lake (106B) stores the unstructured data (102B) with multiple hierarchies in semi-structured data format.

The system (100) of the present invention also comprises of at least one data virtualization server (108) having a plurality of components (108A, 108B, 108C) for collecting, harnessing and storing the structured and unstructured data (102A, 102B) into a unified view such as physical table. The present invention provides that the plurality of components within the data virtualization server (108) are at least one data connector (108A), at least one data composer (108B) and at least one data consumption (108C). The present invention provides that the data connector (108A) supports a plurality of database connection type which may comprise of a relational database management system, RDBMS. The RDBMS may include PostgreSQL and MySQL whereby NoSQL in an alternative to traditional relational database which may include MongoDB and Hadoop. The data connector (108A) is utilized by the present invention as a connector for connecting to the data store (106).

Further, the present invention provides that the data composer (108B) is utilized for composing virtual database schema, VDS that integrates plurality of data sources of the structured data (102A) and the unstructured data (102B). The VDS comprises a set of metadata which represents the data sources (102). The present invention also provides that the data consumption (108C) is utilized for exposing the VDS as Representational State Transfer application program interface, REST API, Java Database Connectivity, JDBC, or Open Database Connectivity, ODBC.

Reference is now made to Figure 2.0. Figure 2.0 is a flowchart illustrating a general methodology (200) for analyzing heterogeneous data of health information. As illustrated in Figure 2.0, analyzing heterogeneous data is first initiated by collecting heterogeneous data from disparate data sources (102), whereby the heterogeneous data is the structured data (102A) and the unstructured data (102B) (202). The structured and the unstructured data (102A, 102B) obtain from the disparate data sources (102) are subsequently sent to the file server as a readable format for structuring data including JavaScript Object Notation, JSON format, whereby JSON is a lightweight data-interchange format (204). Thereafter, the structured and unstructured data (102A, 102B) are pseudonymized for masking personal identification information by utilizing PAS (104) (206).

The pseudonymized structured and unstructured data (102A, 102B) are sent to the data store (106), whereby the psedonymized structured data (102A) are stored directly in the data warehouse (106A) and the psedonymized unstructured data (102B) with multiple hierarchies are stored using document-based database in the data lake (106B) in semi-structured format such as JSON or comma-separated values, CSV (208). Subsequently, the pseudonymized structured and unstructured data (102A, 102B) from the data warehouse (106A) and the data lake (106B) are correlated trough the data virtualization server (108) (210). Finally, the data correlated through the data virtualization server (108) are sent to the data analytics and visualization module (1 10) for data analysis and visualization (212). Reference is now made to Figure 3.0. Figure 3.0 is a flowchart illustrating further steps (300) involved in correlating pseudonymized structured and unstructured data (102A, 102B) through the data virtualization module (108) (210). As illustrated in Figure 3.0, correlating pseudonymized structured and unstructured data (102A, 102B) through the data virtualization server (108) is first initiated by configuring connection to the data source (102) (302). The step is followed by introspecting metadata of the structured data (102A) and the unstructured data (102B) and exposing the metadata as physical table (304). Thereafter, upon exposing the metadata as physical table, source views of metadata is combined, integrate, transform and cleansed as canonical model view of data for publishing the same as SQL views or other service format (306). Finally the JDBC connector will be utilized to consume the metadata for analysis and visualization of the structured data (102A) and the unstructured data (102B) through the data analytic and visualization module (1 10) (308). Each step will be clarified in detail in further embodiment of the present invention.

Reference is now made to Figure 4.0. Figure 4.0 is a flowchart illustrating the further steps (400) involved in configuring connection to the data source (102) (302). As illustrated in Figure 4.0, the step to configure connection to the data source (102) is first initiated by selecting required data connector from a list of data connector adapter (402), whereby the data connector adapter comprises a list of connectors for RDBMS and non-RDBMS database type including MySQL, PostgreSQL, JDBC and etc. The connector are selected by a system administrator. Upon selecting the required connector, the system administrator may configure connection (404) to RDBMS by utilizing conventional JDBC or ODBC protocol. Also, the system administrator may utilize Open Data Protocol, OData for connecting to non-RDBMS. Subsequently, the step is followed by specifying the data source (102) destination information (406). The data source (102) information may include server address and database properties whereby the database properties may include type, time zone, schema and etc. Upon specifying the data source (102), the system administrator subsequently configures and publishes the connection information (408).

Reference now is made to Figure 5.0. Figure 5.0 is a flowchart illustrating further steps (500) involved in introspecting the metadata and exposing the metadata as physical table (304) whereby the metadata is a set of data that provides information of the data source (102). As illustrated in Figure 5.0, the steps for introspecting the metadata and exposing the metadata as physical table is first initiated by first configuring virtual database schema, VDS properties including schema name, connection source and version, and subsequently selecting connection source as defined in the VDS properties (502). Upon selecting defined connection source, the metadata is imported from the data sources (102) and defined by utilizing computer readable instructions which includes Data Definition Language (504).

After the metadata is imported and defined using Data Definition Language, the VDS is created and deployed (506).

Reference is now made to Figure 6.0. Figure 6.0 is a flowchart illustrating further steps (600) involved in combining, integrating, transforming and cleansing source view as canonical model views of the structured and unstructured data (102A, 102B) for publishing the data as data service format such as structured query language, SQL views (306). As illustrated in Figure 6.0, the step is initiated by first creating an extract, transform and load transformation for combining and integrating the metadata from the VDS to accommodate data analytic (602). Thereafter, the output from the extract, transform and load transformation is generated as materialized view to be consumed by the data analytics and visualization module (1 10) (604). Subsequently the materialized view is exposed as service (606) whereby the service is further deployed and published by the system administrator (608).

Reference is now made to Figure 7.0. Figure 7.0 is a flowchart illustrating further steps (700) involved in analyzing and visualizing the structure and unstructured data (102A, 102B) consumed by Java Database Connectivity, JDBC as connector (308). As illustrated in Figure 7.0, the system administrator first selects the materialized view service which required for analytic requirement (702) by utilizing JDBC connector from the data analytics and visualization module (1 10) to establish connection. Finally, views of the data are exposed as physical table for data analytics and visualization (1 10) consumption to design a data mart (704) whereby the data mart is a subset of the data warehouse (106A).

The present invention relates to a system and method for analyzing heterogeneous data from disparate data sources (106) by utilizing data virtualization component. In particular, the present invention utilizes data virtualization server (108) for correlating heterogeneous data from disparate data source (102) pertaining to health information by utilizing a plurality of components (108A, 108B and 108C) within the data virtualization server (108). The heterogeneous data comprises of structured data (102A) and unstructured data (102B). The present invention provides that the plurality of components (108A, 108B and 108C) of the data analytics and virtualization server (108) are at least one data connector (108A), at least one data composer (108B) and at least one data consumption (108C). The structured and unstructured data (102A, 102B) obtained from disparate data sources (106) are pseudonymized for masking personal identification information of the data and subsequently stored in at least one data warehouse (106A) and at least one data lake (106B) accordingly. The structured data (102A) stored in the data warehouse (106A) and the unstructured data (102B) stored in the data lake (106B) are correlated to be analyzed and visualized through the data analytics and visualization module (1 10).

Unless the context requires otherwise or specifically stated to the contrary, integers, steps or elements of the invention recited herein as singular integers, steps or elements clearly encompass both singular and plural forms of the recited integers, steps or elements. Throughout this specification, unless the context requires otherwise, the word“comprise”, or variations such as“comprises” or“comprising”, will be understood to imply the inclusion of a stated step or element or integer or group of steps or elements or integers, but not the exclusion of any other step or element or integer or group of steps, elements or integers. Thus, in the context of this specification, the term“comprising” is used in an inclusive sense and thus should be understood as meaning“including principally, but not necessarily solely”.

Claims

1. A system (100) for analyzing heterogeneous data utilizing data virtualization components comprising:

at least one file server at a client side for providing access of files; at least one data source module (102) for providing heterogeneous data to be aggregated;

a Privacy Assurance Services component (104) for conducting pseudonymization on the heterogeneous data to mask personal identification information;

at least one data store (106) having a plurality of database for storing the heterogeneous data;

at least one data virtualization server (108) having a plurality of components for collecting, harnessing and storing the heterogeneous data into a unified view; and

at least one data analytics and visualization module (1 10) for composing and exposing heterogeneous data from a virtual database within the data virtualization server (108) for analyzing and visualizing the heterogeneous data,

characterized in that

the data virtualization server (108) further comprising:

at least one data connector (108A) for connecting to the data store

(106) and supports a plurality of database connection;

at least one data composer (108B) for composing a virtual database schema that integrates multiple data sources; and at least one data consumption (108C) for exposing a virtual database schema.

2. The system (100) according to claim 1 , wherein the heterogeneous data further comprises structured data (102A) and unstructured data (102B).

3. The system (100) according to claim 2, wherein the unstructured data (102B) further comprises semi-structured data.

4. The system (100) according to claim 1 , wherein the data store (106) further comprises: at least one data warehouse (106A) for storing the structured data (102A); and

at least one data lake (106B) for storing the unstructured data (102B) and semi-structured data.

5. The system (100) according to claim 1 , wherein the unified view enables views of the heterogeneous data in physical table.

6. The system (100) according to claim 1 , wherein the plurality of database connection includes a relational database management system.

7. The system (100) according to claim 1 , wherein the virtual database schema further comprising a set of metadata representing the data source (102).

8. The system (100) according to claim 1 , wherein the data analytic and visualization module (1 10) further configured to retrieving, analyzing, transforming and reporting of the heterogeneous data.

9. A method (200) for analyzing heterogeneous data utilizing data virtualization components comprises steps of:

collecting heterogeneous data from at least one data source (202); sending heterogeneous data to at least one file server as readable format for structuring data (204);

pseudonymizing heterogeneous data through Privacy Assurance Services component (206);

sending pseudonymized heterogeneous data to at least one data store (208); correlating pseudonymized heterogeneous data through at least one data virtualization server (210); and

sending the correlated data to at least one data analytics and visualization module (212).

10. The method (200) according to claim 9, wherein correlating pseudonymized heterogeneous data through at least one data virtualization server (210) further comprising steps of (300):

configuring connection to the data source (302); introspecting metadata of structured and unstructured data (102A and 102B) and exposing the metadata as physical table (304);

combining, integrating, transforming and cleansing source view as canonical model views of data for publishing the data as data service format (306); and analyzing and visualizing the data (308).

1 1 . The method (200) according to claim 10, wherein configuring connection to the data source (302) further comprising steps of (400):

selecting required connector from a list of data connector adapter (402);

configuring connection to relational database management (404); specifying the data source destination information including server address and database properties (406); and

configuring and publishing connection information (408).

12. The method (200) according to claim 10, wherein introspecting metadata of structured and unstructured data (102A and 102B) and exposing the metadata as physical table (304) further comprising steps of (500):

configuring virtual database schema properties including schema name, connection source and version prior for selecting defined connection source (502);

importing and defining metadata from data sources by utilizing machine readable instruction for creating and deploying a virtual schema database (504); and

creating and deploying the virtual schema database (506).

13. The method (200) according to claim 10, wherein combining, integrating, transforming and cleansing source view as canonical model views of data for publishing the data as data service format (306) steps of (600):

creating an extract, transform and load transformation for combining and integrating the metadata from the virtual database schema to accommodate data analytic (602);

generating output of extreme, transform and load, transformation as a materialized view to be consumed by data analytics and visualization module (604);

exposing the materialized view as service (606); and

deploying and publishing the materialized view service (608).

14. The method according to claim 10, wherein analyzing and visualizing the data (308) further comprising steps of (700):

selecting the service required for analytic requirement (702); and

exposing views as physical table for data analytics and visualization module to design a data mart (704).