WO2020139079A1 - System and method for analyzing heterogeneous data by utilizing data virtualization components - Google Patents

System and method for analyzing heterogeneous data by utilizing data virtualization components Download PDF

Info

Publication number
WO2020139079A1
WO2020139079A1 PCT/MY2019/050135 MY2019050135W WO2020139079A1 WO 2020139079 A1 WO2020139079 A1 WO 2020139079A1 MY 2019050135 W MY2019050135 W MY 2019050135W WO 2020139079 A1 WO2020139079 A1 WO 2020139079A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
heterogeneous
structured
source
database
Prior art date
Application number
PCT/MY2019/050135
Other languages
French (fr)
Inventor
Mohamad Zakaria Bin ALLI
Wan Zawawi Bin MD ZIN
Hooi Hwa LIM
Original Assignee
Mimos Berhad
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mimos Berhad filed Critical Mimos Berhad
Publication of WO2020139079A1 publication Critical patent/WO2020139079A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/256Integrating or interfacing systems involving database management systems in federated or virtual databases

Definitions

  • the present invention relates to a system and method for analyzing heterogeneous data from disparate data source by utilizing data virtualization components.
  • the present invention utilizes data virtualization server for correlating heterogeneous data pertaining to health information by utilizing a plurality of components within the data virtualization server.
  • Data can be obtained through various sources such as websites, interdisciplinary researches, documents, e-mails, and media content.
  • the data is generally generated in structured and unstructured forms whereby structured data is a data stored into a relational database in predefined fixed fields.
  • the structured data is searchable through queries and search operations or algorithm by utilizing the field names.
  • the unstructured data is not stored or structured with predefined manner.
  • the unstructured data may have an internal structure.
  • the unstructured data is complicated to be searched and analyzed compared to the structured data.
  • large volume of the structured and unstructured data also known as big data, leads to difficulty in data analytics.
  • the structured and unstructured data generally have to be integrated in order to make better decision-making, whereby the available data has to be utilized and analyzed thoroughly.
  • US 026 B2 Patent entitled“Combining Medical Information Captured in Structured and Unstructured Data Formats for Use or Display in a User Application, Interface, or View” having a filing date of 6 March 2009 (Patentee: PeopleChart Corp) discloses that structured and unstructured data obtained, from disparate sources are transformed and correlated whereby the unstructured data is visualized by indices into individual image report table.
  • the US 026 B2 Patent also discloses that the data are converted into a common schema.
  • United States Patent No. US 7849048 B2 (hereinafter referred to as the US 048 B2 Patent) entitled“System and Method of Making Unstructured Data Available to Structured Data Analysis Tools” having a filing date of 5 July 2005 (Patentee: Clarabridge Inc.) utilizes natural-language processing transformation tool to extract sentences from the copy of unstructured data.
  • the US 048 B2 Patent also discloses that documents are all assigned with a unique key which can be used to identify the document and data derived from the document throughout the entire system and can be used to reference back to the original document in the original source.
  • US 7668849 B1 entitled“Method and System for Processing Structured Data and Unstructured Data” having a filing date of 9 December 2005 (Patentee: Clarabridge Inc.) discloses system and method for processing structured and unstructured data.
  • the 849 B1 Patent utilizes Component Integration Services (CIS) gateway, which is a set of connectivity tools for accessing data within heterogeneous environment.
  • CIS Component Integration Services
  • the 849 B1 Patent also discloses that the structured and unstructured data are correlated and integrated through links.
  • the present invention relates to a system and method for analyzing heterogeneous data from disparate data sources by utilizing data virtualization component.
  • the present invention utilizes data virtualization server for correlating heterogeneous data from disparate data source pertaining to health information by utilizing a plurality of component within the data virtualization server.
  • At least one file server at a client side for providing access of files; at least one data source module (102) for providing heterogeneous data to be aggregated; a Privacy Assurance Services component (104) for conducting pseudonymization on the heterogeneous data to mask personal identification information; at least one data store (106) having a plurality of database for storing heterogeneous data; at least one data virtualization server (108) having a plurality of components for collecting, harnessing and storing heterogeneous data into a unified view; and at least one data analytics and visualization module (1 10) for composing and exposing heterogeneous data from a virtual database within the data virtualization server (108) for analyzing and visualizing heterogeneous data.
  • the data virtualization server (108) further comprising at least one data connector (108A) for connecting to the data store (106) and supports a plurality of database connection; at least one data composer (108B) for composing a virtual database schema that integrates multiple data sources; and at least one data consumption (108C) for exposing a virtual database schema.
  • heterogeneous data further comprises structured data (102A) and unstructured data (102B).
  • the unstructured data (102B) further comprises semi-structured data.
  • the data store (106) further comprises at least one data warehouse (106A) for storing the structured data (102A); and at least one data lake (106B) for storing the unstructured data (102B) and semi-structured data.
  • a further aspect of the invention provides that the unified view enables views of heterogeneous data in physical table.
  • the plurality of database connection includes a relational database management system.
  • Still another aspect of the invention provides that the virtual database schema further comprising a set of metadata representing the data source (102).
  • Another aspect of the invention provides that the data analytic and visualization module (1 10) further configured to retrieving, analyzing, transforming and reporting of heterogeneous data.
  • a further aspect of the invention provides a method (200) for analyzing heterogeneous data utilizing data virtualization components comprising steps of collecting heterogeneous data from at least one data source (202); sending heterogeneous data to at least one file server as readable format for structuring data (204); pseudonymizing heterogeneous data through Privacy Assurance Services component (206); sending pseudonymized heterogeneous data to at least one data store (208); correlating pseudonymized heterogeneous data through at least one data virtualization server (210); and sending the correlated data to at least one data analytics and visualization module (212).
  • Yet another aspect of the invention provides that correlating pseudonymized heterogeneous data through at least one data virtualization server (210) further comprising steps of (300) configuring connection to the data source (302); introspecting metadata of structured and unstructured data (102A and 102B) and exposing the metadata as physical table (304); combining, integrating, transforming and cleansing source view as canonical model views of data for publishing the data as data service format (306); and analyzing and visualizing the data (308).
  • Still another aspect of the invention provides that configuring connection to the data source (302) further comprising steps of (400) selecting required connector from a list of data connector adapter (402); configuring connection to relational database management system (404); specifying the data source destination information including server address and database properties (406); and configuring and publishing connection information (408).
  • Another aspect of the invention provides that introspecting metadata of structured and unstructured data (102A and 102B) and exposing the metadata as physical table (304) further comprising steps of (500) configuring virtual database schema properties including schema name, connection source and version prior for selecting defined connection source (502); importing and defining metadata from data sources by utilizing machine readable instruction for creating and deploying a virtual schema database (504); and creating and deploying the virtual schema database (506).
  • a further aspect of the invention provides that combining, integrating, transforming and cleansing source view as canonical model views of data for publishing the data as data service format (306) steps of (600) creating an extract, transform and load transformation for combining and integrating the metadata from the virtual database schema to accommodate data analytic (602); generating output of extreme, transform and load, transformation as a materialized view to be consumed by data analytics and visualization module (604); exposing the materialized view as service (606); deploying and publishing the materialized view service (608).
  • Yet another aspect of the invention provides that analyzing and visualizing the data (308) further comprising steps of (700) selecting the service required for analytic requirement (702); and exposing views as physical table for data analytics and visualization module to design a data mart (704).
  • Figure 1.0 illustrates a general system architecture of the present invention for analyzing heterogeneous data of health information.
  • Figure 1.0a illustrates general flow of the system architecture of the present invention for analyzing the heterogeneous data of health information.
  • Figure 2.0 is a flowchart illustrating a general methodology of the present invention for analyzing heterogeneous data of health information.
  • Figure 3.0 is a flowchart illustrating steps involved in correlating pseudonymized heterogeneous data through at least one data virtualization module.
  • Figure 4.0 is a flowchart illustrating further steps involved in configuring connection to at least one data source.
  • Figure 5.0 is a flowchart illustrating further steps involved in introspecting metadata and exposing the metadata as physical table.
  • Figure 6.0 is a flowchart illustrating further steps involved in combining, integrating, transforming and cleansing source view as canonical model views of data for publishing the data as structured query language, SQL views or other data service format.
  • Figure 7.0 is a flowchart illustrating further steps involved in analyzing and visualizing the data consumed by Java Database Connectivity, JDBC as connector. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
  • the present invention relates to a system and method for analyzing heterogeneous data from disparate data sources (106) by utilizing data virtualization component.
  • the present invention utilizes data virtualization server (108) for correlating heterogeneous data from disparate data source (102) pertaining to health information by utilizing a plurality of components (108A, 108B and 108C) within the data virtualization server (108).
  • data virtualization server (108) for correlating heterogeneous data from disparate data source (102) pertaining to health information by utilizing a plurality of components (108A, 108B and 108C) within the data virtualization server (108).
  • Figure 1.0 is a general system architecture for analyzing heterogeneous data of health information while Figure 1.0a illustrates general flows of the system architecture.
  • the system of the present invention comprises at least one file server at a client side for providing access of files within the network server at a client side.
  • the system of the present invention also provides at least one data source (102) for obtaining heterogeneous data whereby the heterogeneous data comprises of structured data (102A) and unstructured data (102B).
  • the present invention provides that the structured data (102A) and the unstructured data (102A) are obtained from the disparate sources.
  • the system (200) of the present invention also comprises of a Privacy Assurance Services, PAS, component (104) for conducting pseudonymization on the structured data (102A) and the unstructured data (102B) to mask personal identification information within the structured and unstructured data (102A, 102B).
  • the present invention also provides at least one data store (106) having a plurality of database (106A,106B) for storing the structured and the unstructured data (102A, 102B) accordingly.
  • the structured data (102A) is stored in at least one data warehouse (106A) which leverage a relational database management system, RDBMS platform while the unstructured data (102B) is stored in at least one data lake (106B) by utilizing third party components such as Hadoop and Filesystem.
  • the data lake (106B) stores the unstructured data (102B) with multiple hierarchies in semi-structured data format.
  • the system (100) of the present invention also comprises of at least one data virtualization server (108) having a plurality of components (108A, 108B, 108C) for collecting, harnessing and storing the structured and unstructured data (102A, 102B) into a unified view such as physical table.
  • the present invention provides that the plurality of components within the data virtualization server (108) are at least one data connector (108A), at least one data composer (108B) and at least one data consumption (108C).
  • the present invention provides that the data connector (108A) supports a plurality of database connection type which may comprise of a relational database management system, RDBMS.
  • the RDBMS may include PostgreSQL and MySQL whereby NoSQL in an alternative to traditional relational database which may include MongoDB and Hadoop.
  • the data connector (108A) is utilized by the present invention as a connector for connecting to the data store (106).
  • the present invention provides that the data composer (108B) is utilized for composing virtual database schema, VDS that integrates plurality of data sources of the structured data (102A) and the unstructured data (102B).
  • the VDS comprises a set of metadata which represents the data sources (102).
  • the present invention also provides that the data consumption (108C) is utilized for exposing the VDS as Representational State Transfer application program interface, REST API, Java Database Connectivity, JDBC, or Open Database Connectivity, ODBC.
  • FIG. 2.0 is a flowchart illustrating a general methodology (200) for analyzing heterogeneous data of health information.
  • analyzing heterogeneous data is first initiated by collecting heterogeneous data from disparate data sources (102), whereby the heterogeneous data is the structured data (102A) and the unstructured data (102B) (202).
  • the structured and the unstructured data (102A, 102B) obtain from the disparate data sources (102) are subsequently sent to the file server as a readable format for structuring data including JavaScript Object Notation, JSON format, whereby JSON is a lightweight data-interchange format (204).
  • JSON JavaScript Object Notation
  • the structured and unstructured data (102A, 102B) are pseudonymized for masking personal identification information by utilizing PAS (104) (206).
  • the pseudonymized structured and unstructured data (102A, 102B) are sent to the data store (106), whereby the psedonymized structured data (102A) are stored directly in the data warehouse (106A) and the psedonymized unstructured data (102B) with multiple hierarchies are stored using document-based database in the data lake (106B) in semi-structured format such as JSON or comma-separated values, CSV (208). Subsequently, the pseudonymized structured and unstructured data (102A, 102B) from the data warehouse (106A) and the data lake (106B) are correlated trough the data virtualization server (108) (210).
  • Figure 3.0 is a flowchart illustrating further steps (300) involved in correlating pseudonymized structured and unstructured data (102A, 102B) through the data virtualization module (108) (210).
  • correlating pseudonymized structured and unstructured data (102A, 102B) through the data virtualization server (108) is first initiated by configuring connection to the data source (102) (302).
  • the step is followed by introspecting metadata of the structured data (102A) and the unstructured data (102B) and exposing the metadata as physical table (304).
  • Figure 4.0 is a flowchart illustrating the further steps (400) involved in configuring connection to the data source (102) (302).
  • the step to configure connection to the data source (102) is first initiated by selecting required data connector from a list of data connector adapter (402), whereby the data connector adapter comprises a list of connectors for RDBMS and non-RDBMS database type including MySQL, PostgreSQL, JDBC and etc.
  • the connector are selected by a system administrator.
  • the system administrator may configure connection (404) to RDBMS by utilizing conventional JDBC or ODBC protocol.
  • the system administrator may utilize Open Data Protocol, OData for connecting to non-RDBMS.
  • the step is followed by specifying the data source (102) destination information (406).
  • the data source (102) information may include server address and database properties whereby the database properties may include type, time zone, schema and etc.
  • the system administrator subsequently configures and publishes the connection information (408).
  • Figure 5.0 is a flowchart illustrating further steps (500) involved in introspecting the metadata and exposing the metadata as physical table (304) whereby the metadata is a set of data that provides information of the data source (102).
  • the steps for introspecting the metadata and exposing the metadata as physical table is first initiated by first configuring virtual database schema, VDS properties including schema name, connection source and version, and subsequently selecting connection source as defined in the VDS properties (502).
  • VDS properties including schema name, connection source and version
  • connection source as defined in the VDS properties
  • the metadata is imported from the data sources (102) and defined by utilizing computer readable instructions which includes Data Definition Language (504).
  • VDS is created and deployed (506).
  • Figure 6.0 is a flowchart illustrating further steps (600) involved in combining, integrating, transforming and cleansing source view as canonical model views of the structured and unstructured data (102A, 102B) for publishing the data as data service format such as structured query language, SQL views (306).
  • the step is initiated by first creating an extract, transform and load transformation for combining and integrating the metadata from the VDS to accommodate data analytic (602). Thereafter, the output from the extract, transform and load transformation is generated as materialized view to be consumed by the data analytics and visualization module (1 10) (604). Subsequently the materialized view is exposed as service (606) whereby the service is further deployed and published by the system administrator (608).
  • Figure 7.0 is a flowchart illustrating further steps (700) involved in analyzing and visualizing the structure and unstructured data (102A, 102B) consumed by Java Database Connectivity, JDBC as connector (308).
  • the system administrator first selects the materialized view service which required for analytic requirement (702) by utilizing JDBC connector from the data analytics and visualization module (1 10) to establish connection.
  • views of the data are exposed as physical table for data analytics and visualization (1 10) consumption to design a data mart (704) whereby the data mart is a subset of the data warehouse (106A).
  • the present invention relates to a system and method for analyzing heterogeneous data from disparate data sources (106) by utilizing data virtualization component.
  • the present invention utilizes data virtualization server (108) for correlating heterogeneous data from disparate data source (102) pertaining to health information by utilizing a plurality of components (108A, 108B and 108C) within the data virtualization server (108).
  • the heterogeneous data comprises of structured data (102A) and unstructured data (102B).
  • the present invention provides that the plurality of components (108A, 108B and 108C) of the data analytics and virtualization server (108) are at least one data connector (108A), at least one data composer (108B) and at least one data consumption (108C).
  • the structured and unstructured data (102A, 102B) obtained from disparate data sources (106) are pseudonymized for masking personal identification information of the data and subsequently stored in at least one data warehouse (106A) and at least one data lake (106B) accordingly.
  • the structured data (102A) stored in the data warehouse (106A) and the unstructured data (102B) stored in the data lake (106B) are correlated to be analyzed and visualized through the data analytics and visualization module (1 10).

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a system (100) and method (200) for analyzing heterogeneous data from disparate data source (102) by utilizing data virtualization components. In particular, the present invention utilizes data virtualization server (108) for correlating heterogeneous data from disparate data source (102) pertaining to health information. The data virtualization server (108) utilizes a plurality of components (108A, 108B and 108C) within the data virtualization server (108) namely at least one data connector (108A), at least one data composer (108B) and at least one data consumption (108C). The heterogeneous data comprises of structured data (102A) and unstructured data (102B) are pseudonymized and stored in at least one data warehouse (106A) and at least one data lake (106B) accordingly. The structured data (102A) stored in the data warehouse (106A) and the unstructured data stored (102B) in the data lake (106B) are correlated to be analyzed and visualized through the data analytics and visualization module (110).

Description

SYSTEM AND METHOD FOR ANALYZING HETEROGENEOUS DATA BY UTILIZING
DATA VIRTUALIZATION COMPONENTS
FIELD OF INVENTION
The present invention relates to a system and method for analyzing heterogeneous data from disparate data source by utilizing data virtualization components. In particular, the present invention utilizes data virtualization server for correlating heterogeneous data pertaining to health information by utilizing a plurality of components within the data virtualization server.
BACKGROUND OF INVENTION
Data can be obtained through various sources such as websites, interdisciplinary researches, documents, e-mails, and media content. The data is generally generated in structured and unstructured forms whereby structured data is a data stored into a relational database in predefined fixed fields. The structured data is searchable through queries and search operations or algorithm by utilizing the field names.
In contrast, the unstructured data is not stored or structured with predefined manner. However, the unstructured data may have an internal structure. Generally, the unstructured data is complicated to be searched and analyzed compared to the structured data. Furthermore, large volume of the structured and unstructured data also known as big data, leads to difficulty in data analytics. The structured and unstructured data generally have to be integrated in order to make better decision-making, whereby the available data has to be utilized and analyzed thoroughly.
Currently in the medical field, the amount of data being generated is tremendously huge as the data is being generated through hospital information system as well as other external sources such as medical history from other hospital and personal medical wearables. This prompts to difficulties and challenges in compiling and analyzing the data.
United States of America Patent No. US 8250026 B2 (hereinafter referred to as the US 026 B2 Patent) entitled“Combining Medical Information Captured in Structured and Unstructured Data Formats for Use or Display in a User Application, Interface, or View” having a filing date of 6 March 2009 (Patentee: PeopleChart Corp) discloses that structured and unstructured data obtained, from disparate sources are transformed and correlated whereby the unstructured data is visualized by indices into individual image report table. The US 026 B2 Patent also discloses that the data are converted into a common schema.
United States Patent No. US 7849048 B2 (hereinafter referred to as the US 048 B2 Patent) entitled“System and Method of Making Unstructured Data Available to Structured Data Analysis Tools” having a filing date of 5 July 2005 (Patentee: Clarabridge Inc.) utilizes natural-language processing transformation tool to extract sentences from the copy of unstructured data. The US 048 B2 Patent also discloses that documents are all assigned with a unique key which can be used to identify the document and data derived from the document throughout the entire system and can be used to reference back to the original document in the original source.
United States Patent No. US 7668849 B1 (hereinafter referred to as the US 849 B1 Patent) entitled“Method and System for Processing Structured Data and Unstructured Data” having a filing date of 9 December 2005 (Patentee: Clarabridge Inc.) discloses system and method for processing structured and unstructured data. The 849 B1 Patent utilizes Component Integration Services (CIS) gateway, which is a set of connectivity tools for accessing data within heterogeneous environment. The 849 B1 Patent also discloses that the structured and unstructured data are correlated and integrated through links.
As outlined above, various systems and methods have been developed to provide analysis of the structured and unstructured data. However, it is desirable to provide correlation or integration of both structured and unstructured data and further analyzing the same through data analytics and visualization module.
SUMMARY OF INVENTION
The present invention relates to a system and method for analyzing heterogeneous data from disparate data sources by utilizing data virtualization component. In particular, the present invention utilizes data virtualization server for correlating heterogeneous data from disparate data source pertaining to health information by utilizing a plurality of component within the data virtualization server.
One aspect of the invention provides that at least one file server at a client side for providing access of files; at least one data source module (102) for providing heterogeneous data to be aggregated; a Privacy Assurance Services component (104) for conducting pseudonymization on the heterogeneous data to mask personal identification information; at least one data store (106) having a plurality of database for storing heterogeneous data; at least one data virtualization server (108) having a plurality of components for collecting, harnessing and storing heterogeneous data into a unified view; and at least one data analytics and visualization module (1 10) for composing and exposing heterogeneous data from a virtual database within the data virtualization server (108) for analyzing and visualizing heterogeneous data.
The data virtualization server (108) further comprising at least one data connector (108A) for connecting to the data store (106) and supports a plurality of database connection; at least one data composer (108B) for composing a virtual database schema that integrates multiple data sources; and at least one data consumption (108C) for exposing a virtual database schema.
A further aspect of the invention provides that heterogeneous data further comprises structured data (102A) and unstructured data (102B).
Yet another aspect of the invention provides that the unstructured data (102B) further comprises semi-structured data.
Another aspect of the invention provides that the data store (106) further comprises at least one data warehouse (106A) for storing the structured data (102A); and at least one data lake (106B) for storing the unstructured data (102B) and semi-structured data.
A further aspect of the invention provides that the unified view enables views of heterogeneous data in physical table. Yet another aspect of the invention provides that the plurality of database connection includes a relational database management system.
Still another aspect of the invention provides that the virtual database schema further comprising a set of metadata representing the data source (102).
Another aspect of the invention provides that the data analytic and visualization module (1 10) further configured to retrieving, analyzing, transforming and reporting of heterogeneous data.
A further aspect of the invention provides a method (200) for analyzing heterogeneous data utilizing data virtualization components comprising steps of collecting heterogeneous data from at least one data source (202); sending heterogeneous data to at least one file server as readable format for structuring data (204); pseudonymizing heterogeneous data through Privacy Assurance Services component (206); sending pseudonymized heterogeneous data to at least one data store (208); correlating pseudonymized heterogeneous data through at least one data virtualization server (210); and sending the correlated data to at least one data analytics and visualization module (212).
Yet another aspect of the invention provides that correlating pseudonymized heterogeneous data through at least one data virtualization server (210) further comprising steps of (300) configuring connection to the data source (302); introspecting metadata of structured and unstructured data (102A and 102B) and exposing the metadata as physical table (304); combining, integrating, transforming and cleansing source view as canonical model views of data for publishing the data as data service format (306); and analyzing and visualizing the data (308).
Still another aspect of the invention provides that configuring connection to the data source (302) further comprising steps of (400) selecting required connector from a list of data connector adapter (402); configuring connection to relational database management system (404); specifying the data source destination information including server address and database properties (406); and configuring and publishing connection information (408).
Another aspect of the invention provides that introspecting metadata of structured and unstructured data (102A and 102B) and exposing the metadata as physical table (304) further comprising steps of (500) configuring virtual database schema properties including schema name, connection source and version prior for selecting defined connection source (502); importing and defining metadata from data sources by utilizing machine readable instruction for creating and deploying a virtual schema database (504); and creating and deploying the virtual schema database (506). A further aspect of the invention provides that combining, integrating, transforming and cleansing source view as canonical model views of data for publishing the data as data service format (306) steps of (600) creating an extract, transform and load transformation for combining and integrating the metadata from the virtual database schema to accommodate data analytic (602); generating output of extreme, transform and load, transformation as a materialized view to be consumed by data analytics and visualization module (604); exposing the materialized view as service (606); deploying and publishing the materialized view service (608).
Yet another aspect of the invention provides that analyzing and visualizing the data (308) further comprising steps of (700) selecting the service required for analytic requirement (702); and exposing views as physical table for data analytics and visualization module to design a data mart (704).
The present invention consists of features and a combination of parts hereinafter fully described and illustrated in the accompanying drawings, it being understood that various changes in the details may be made without departing from the scope of the invention or sacrificing an of the advantages of the present invention.
BRIEF DESCRIPTION OF ACCOMPANYING DRAWINGS
To further clarify various aspects of some embodiments of the present invention, a more particular description of the invention will be rendered by references to specific embodiments thereof, which are illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail through the accompanying drawings.
Figure 1.0 illustrates a general system architecture of the present invention for analyzing heterogeneous data of health information.
Figure 1.0a illustrates general flow of the system architecture of the present invention for analyzing the heterogeneous data of health information.
Figure 2.0 is a flowchart illustrating a general methodology of the present invention for analyzing heterogeneous data of health information.
Figure 3.0 is a flowchart illustrating steps involved in correlating pseudonymized heterogeneous data through at least one data virtualization module.
Figure 4.0 is a flowchart illustrating further steps involved in configuring connection to at least one data source.
Figure 5.0 is a flowchart illustrating further steps involved in introspecting metadata and exposing the metadata as physical table.
Figure 6.0 is a flowchart illustrating further steps involved in combining, integrating, transforming and cleansing source view as canonical model views of data for publishing the data as structured query language, SQL views or other data service format.
Figure 7.0 is a flowchart illustrating further steps involved in analyzing and visualizing the data consumed by Java Database Connectivity, JDBC as connector. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
The present invention relates to a system and method for analyzing heterogeneous data from disparate data sources (106) by utilizing data virtualization component. In particular, the present invention utilizes data virtualization server (108) for correlating heterogeneous data from disparate data source (102) pertaining to health information by utilizing a plurality of components (108A, 108B and 108C) within the data virtualization server (108). Hereinafter, it is to be understood that limiting the description to the preferred embodiments of the invention is merely to facilitate discussion of the present invention and it is envisioned without departing from the scope of the appended claims.
Reference is first made to Figure 1 .0 and Figure 1 .0a. Figure 1.0 is a general system architecture for analyzing heterogeneous data of health information while Figure 1.0a illustrates general flows of the system architecture. As illustrated in Figure 1.0, the system of the present invention comprises at least one file server at a client side for providing access of files within the network server at a client side. Further, the system of the present invention also provides at least one data source (102) for obtaining heterogeneous data whereby the heterogeneous data comprises of structured data (102A) and unstructured data (102B). The present invention provides that the structured data (102A) and the unstructured data (102A) are obtained from the disparate sources.
Further, the system (200) of the present invention also comprises of a Privacy Assurance Services, PAS, component (104) for conducting pseudonymization on the structured data (102A) and the unstructured data (102B) to mask personal identification information within the structured and unstructured data (102A, 102B). The present invention also provides at least one data store (106) having a plurality of database (106A,106B) for storing the structured and the unstructured data (102A, 102B) accordingly. The structured data (102A) is stored in at least one data warehouse (106A) which leverage a relational database management system, RDBMS platform while the unstructured data (102B) is stored in at least one data lake (106B) by utilizing third party components such as Hadoop and Filesystem. The data lake (106B) stores the unstructured data (102B) with multiple hierarchies in semi-structured data format.
The system (100) of the present invention also comprises of at least one data virtualization server (108) having a plurality of components (108A, 108B, 108C) for collecting, harnessing and storing the structured and unstructured data (102A, 102B) into a unified view such as physical table. The present invention provides that the plurality of components within the data virtualization server (108) are at least one data connector (108A), at least one data composer (108B) and at least one data consumption (108C). The present invention provides that the data connector (108A) supports a plurality of database connection type which may comprise of a relational database management system, RDBMS. The RDBMS may include PostgreSQL and MySQL whereby NoSQL in an alternative to traditional relational database which may include MongoDB and Hadoop. The data connector (108A) is utilized by the present invention as a connector for connecting to the data store (106).
Further, the present invention provides that the data composer (108B) is utilized for composing virtual database schema, VDS that integrates plurality of data sources of the structured data (102A) and the unstructured data (102B). The VDS comprises a set of metadata which represents the data sources (102). The present invention also provides that the data consumption (108C) is utilized for exposing the VDS as Representational State Transfer application program interface, REST API, Java Database Connectivity, JDBC, or Open Database Connectivity, ODBC.
Reference is now made to Figure 2.0. Figure 2.0 is a flowchart illustrating a general methodology (200) for analyzing heterogeneous data of health information. As illustrated in Figure 2.0, analyzing heterogeneous data is first initiated by collecting heterogeneous data from disparate data sources (102), whereby the heterogeneous data is the structured data (102A) and the unstructured data (102B) (202). The structured and the unstructured data (102A, 102B) obtain from the disparate data sources (102) are subsequently sent to the file server as a readable format for structuring data including JavaScript Object Notation, JSON format, whereby JSON is a lightweight data-interchange format (204). Thereafter, the structured and unstructured data (102A, 102B) are pseudonymized for masking personal identification information by utilizing PAS (104) (206).
The pseudonymized structured and unstructured data (102A, 102B) are sent to the data store (106), whereby the psedonymized structured data (102A) are stored directly in the data warehouse (106A) and the psedonymized unstructured data (102B) with multiple hierarchies are stored using document-based database in the data lake (106B) in semi-structured format such as JSON or comma-separated values, CSV (208). Subsequently, the pseudonymized structured and unstructured data (102A, 102B) from the data warehouse (106A) and the data lake (106B) are correlated trough the data virtualization server (108) (210). Finally, the data correlated through the data virtualization server (108) are sent to the data analytics and visualization module (1 10) for data analysis and visualization (212). Reference is now made to Figure 3.0. Figure 3.0 is a flowchart illustrating further steps (300) involved in correlating pseudonymized structured and unstructured data (102A, 102B) through the data virtualization module (108) (210). As illustrated in Figure 3.0, correlating pseudonymized structured and unstructured data (102A, 102B) through the data virtualization server (108) is first initiated by configuring connection to the data source (102) (302). The step is followed by introspecting metadata of the structured data (102A) and the unstructured data (102B) and exposing the metadata as physical table (304). Thereafter, upon exposing the metadata as physical table, source views of metadata is combined, integrate, transform and cleansed as canonical model view of data for publishing the same as SQL views or other service format (306). Finally the JDBC connector will be utilized to consume the metadata for analysis and visualization of the structured data (102A) and the unstructured data (102B) through the data analytic and visualization module (1 10) (308). Each step will be clarified in detail in further embodiment of the present invention.
Reference is now made to Figure 4.0. Figure 4.0 is a flowchart illustrating the further steps (400) involved in configuring connection to the data source (102) (302). As illustrated in Figure 4.0, the step to configure connection to the data source (102) is first initiated by selecting required data connector from a list of data connector adapter (402), whereby the data connector adapter comprises a list of connectors for RDBMS and non-RDBMS database type including MySQL, PostgreSQL, JDBC and etc. The connector are selected by a system administrator. Upon selecting the required connector, the system administrator may configure connection (404) to RDBMS by utilizing conventional JDBC or ODBC protocol. Also, the system administrator may utilize Open Data Protocol, OData for connecting to non-RDBMS. Subsequently, the step is followed by specifying the data source (102) destination information (406). The data source (102) information may include server address and database properties whereby the database properties may include type, time zone, schema and etc. Upon specifying the data source (102), the system administrator subsequently configures and publishes the connection information (408).
Reference now is made to Figure 5.0. Figure 5.0 is a flowchart illustrating further steps (500) involved in introspecting the metadata and exposing the metadata as physical table (304) whereby the metadata is a set of data that provides information of the data source (102). As illustrated in Figure 5.0, the steps for introspecting the metadata and exposing the metadata as physical table is first initiated by first configuring virtual database schema, VDS properties including schema name, connection source and version, and subsequently selecting connection source as defined in the VDS properties (502). Upon selecting defined connection source, the metadata is imported from the data sources (102) and defined by utilizing computer readable instructions which includes Data Definition Language (504).
After the metadata is imported and defined using Data Definition Language, the VDS is created and deployed (506).
Reference is now made to Figure 6.0. Figure 6.0 is a flowchart illustrating further steps (600) involved in combining, integrating, transforming and cleansing source view as canonical model views of the structured and unstructured data (102A, 102B) for publishing the data as data service format such as structured query language, SQL views (306). As illustrated in Figure 6.0, the step is initiated by first creating an extract, transform and load transformation for combining and integrating the metadata from the VDS to accommodate data analytic (602). Thereafter, the output from the extract, transform and load transformation is generated as materialized view to be consumed by the data analytics and visualization module (1 10) (604). Subsequently the materialized view is exposed as service (606) whereby the service is further deployed and published by the system administrator (608).
Reference is now made to Figure 7.0. Figure 7.0 is a flowchart illustrating further steps (700) involved in analyzing and visualizing the structure and unstructured data (102A, 102B) consumed by Java Database Connectivity, JDBC as connector (308). As illustrated in Figure 7.0, the system administrator first selects the materialized view service which required for analytic requirement (702) by utilizing JDBC connector from the data analytics and visualization module (1 10) to establish connection. Finally, views of the data are exposed as physical table for data analytics and visualization (1 10) consumption to design a data mart (704) whereby the data mart is a subset of the data warehouse (106A).
The present invention relates to a system and method for analyzing heterogeneous data from disparate data sources (106) by utilizing data virtualization component. In particular, the present invention utilizes data virtualization server (108) for correlating heterogeneous data from disparate data source (102) pertaining to health information by utilizing a plurality of components (108A, 108B and 108C) within the data virtualization server (108). The heterogeneous data comprises of structured data (102A) and unstructured data (102B). The present invention provides that the plurality of components (108A, 108B and 108C) of the data analytics and virtualization server (108) are at least one data connector (108A), at least one data composer (108B) and at least one data consumption (108C). The structured and unstructured data (102A, 102B) obtained from disparate data sources (106) are pseudonymized for masking personal identification information of the data and subsequently stored in at least one data warehouse (106A) and at least one data lake (106B) accordingly. The structured data (102A) stored in the data warehouse (106A) and the unstructured data (102B) stored in the data lake (106B) are correlated to be analyzed and visualized through the data analytics and visualization module (1 10).
Unless the context requires otherwise or specifically stated to the contrary, integers, steps or elements of the invention recited herein as singular integers, steps or elements clearly encompass both singular and plural forms of the recited integers, steps or elements. Throughout this specification, unless the context requires otherwise, the word“comprise”, or variations such as“comprises” or“comprising”, will be understood to imply the inclusion of a stated step or element or integer or group of steps or elements or integers, but not the exclusion of any other step or element or integer or group of steps, elements or integers. Thus, in the context of this specification, the term“comprising” is used in an inclusive sense and thus should be understood as meaning“including principally, but not necessarily solely”.

Claims

1. A system (100) for analyzing heterogeneous data utilizing data virtualization components comprising:
at least one file server at a client side for providing access of files; at least one data source module (102) for providing heterogeneous data to be aggregated;
a Privacy Assurance Services component (104) for conducting pseudonymization on the heterogeneous data to mask personal identification information;
at least one data store (106) having a plurality of database for storing the heterogeneous data;
at least one data virtualization server (108) having a plurality of components for collecting, harnessing and storing the heterogeneous data into a unified view; and
at least one data analytics and visualization module (1 10) for composing and exposing heterogeneous data from a virtual database within the data virtualization server (108) for analyzing and visualizing the heterogeneous data,
characterized in that
the data virtualization server (108) further comprising:
at least one data connector (108A) for connecting to the data store
(106) and supports a plurality of database connection;
at least one data composer (108B) for composing a virtual database schema that integrates multiple data sources; and at least one data consumption (108C) for exposing a virtual database schema.
2. The system (100) according to claim 1 , wherein the heterogeneous data further comprises structured data (102A) and unstructured data (102B).
3. The system (100) according to claim 2, wherein the unstructured data (102B) further comprises semi-structured data.
4. The system (100) according to claim 1 , wherein the data store (106) further comprises: at least one data warehouse (106A) for storing the structured data (102A); and
at least one data lake (106B) for storing the unstructured data (102B) and semi-structured data.
5. The system (100) according to claim 1 , wherein the unified view enables views of the heterogeneous data in physical table.
6. The system (100) according to claim 1 , wherein the plurality of database connection includes a relational database management system.
7. The system (100) according to claim 1 , wherein the virtual database schema further comprising a set of metadata representing the data source (102).
8. The system (100) according to claim 1 , wherein the data analytic and visualization module (1 10) further configured to retrieving, analyzing, transforming and reporting of the heterogeneous data.
9. A method (200) for analyzing heterogeneous data utilizing data virtualization components comprises steps of:
collecting heterogeneous data from at least one data source (202); sending heterogeneous data to at least one file server as readable format for structuring data (204);
pseudonymizing heterogeneous data through Privacy Assurance Services component (206);
sending pseudonymized heterogeneous data to at least one data store (208); correlating pseudonymized heterogeneous data through at least one data virtualization server (210); and
sending the correlated data to at least one data analytics and visualization module (212).
10. The method (200) according to claim 9, wherein correlating pseudonymized heterogeneous data through at least one data virtualization server (210) further comprising steps of (300):
configuring connection to the data source (302); introspecting metadata of structured and unstructured data (102A and 102B) and exposing the metadata as physical table (304);
combining, integrating, transforming and cleansing source view as canonical model views of data for publishing the data as data service format (306); and analyzing and visualizing the data (308).
1 1 . The method (200) according to claim 10, wherein configuring connection to the data source (302) further comprising steps of (400):
selecting required connector from a list of data connector adapter (402);
configuring connection to relational database management (404); specifying the data source destination information including server address and database properties (406); and
configuring and publishing connection information (408).
12. The method (200) according to claim 10, wherein introspecting metadata of structured and unstructured data (102A and 102B) and exposing the metadata as physical table (304) further comprising steps of (500):
configuring virtual database schema properties including schema name, connection source and version prior for selecting defined connection source (502);
importing and defining metadata from data sources by utilizing machine readable instruction for creating and deploying a virtual schema database (504); and
creating and deploying the virtual schema database (506).
13. The method (200) according to claim 10, wherein combining, integrating, transforming and cleansing source view as canonical model views of data for publishing the data as data service format (306) steps of (600):
creating an extract, transform and load transformation for combining and integrating the metadata from the virtual database schema to accommodate data analytic (602);
generating output of extreme, transform and load, transformation as a materialized view to be consumed by data analytics and visualization module (604);
exposing the materialized view as service (606); and
deploying and publishing the materialized view service (608).
14. The method according to claim 10, wherein analyzing and visualizing the data (308) further comprising steps of (700):
selecting the service required for analytic requirement (702); and
exposing views as physical table for data analytics and visualization module to design a data mart (704).
PCT/MY2019/050135 2018-12-28 2019-12-27 System and method for analyzing heterogeneous data by utilizing data virtualization components WO2020139079A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
MYPI2018003033 2018-12-28
MYPI2018003033 2018-12-28

Publications (1)

Publication Number Publication Date
WO2020139079A1 true WO2020139079A1 (en) 2020-07-02

Family

ID=71127414

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/MY2019/050135 WO2020139079A1 (en) 2018-12-28 2019-12-27 System and method for analyzing heterogeneous data by utilizing data virtualization components

Country Status (1)

Country Link
WO (1) WO2020139079A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883091A (en) * 2021-01-12 2021-06-01 平安资产管理有限责任公司 Factor data acquisition method and device, computer equipment and storage medium
CN113420099A (en) * 2021-07-06 2021-09-21 广州方硅信息技术有限公司 Buried point data access control method and device, computer equipment and storage medium
US11544286B2 (en) * 2019-11-29 2023-01-03 Amazon Technologies, Inc. Replicating materialized views across heterogeneous target systems
US11797518B2 (en) 2021-06-29 2023-10-24 Amazon Technologies, Inc. Registering additional type systems using a hub data model for data processing
JP7403431B2 (en) 2020-11-13 2023-12-22 株式会社日立製作所 Data integration methods and data integration systems
US11874828B2 (en) 2019-11-29 2024-01-16 Amazon Technologies, Inc. Managed materialized views created from heterogenous data sources
US11934389B2 (en) 2019-11-29 2024-03-19 Amazon Technologies, Inc. Maintaining data stream history for generating materialized views
WO2024065061A1 (en) * 2022-09-29 2024-04-04 Verto Inc. Healthcare record virtualization

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070016596A1 (en) * 2005-07-01 2007-01-18 Business Objects, S.A. Apparatus and method for producing a virtual database from data sources exhibiting heterogeneous schemas
US20090171927A1 (en) * 2003-05-27 2009-07-02 International Business Machines Corporation Method for providing a real time view of heterogeneous enterprise data
KR101706252B1 (en) * 2016-02-29 2017-02-13 주식회사 티맥스데이터 Method, server and computer program stored in computer readable medium for synchronizing query result
KR20170053013A (en) * 2015-11-05 2017-05-15 주식회사 나눔기술 Data Virtualization System for Bigdata Analysis
US20170243028A1 (en) * 2013-11-01 2017-08-24 Anonos Inc. Systems and Methods for Enhancing Data Protection by Anonosizing Structured and Unstructured Data and Incorporating Machine Learning and Artificial Intelligence in Classical and Quantum Computing Environments

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090171927A1 (en) * 2003-05-27 2009-07-02 International Business Machines Corporation Method for providing a real time view of heterogeneous enterprise data
US20070016596A1 (en) * 2005-07-01 2007-01-18 Business Objects, S.A. Apparatus and method for producing a virtual database from data sources exhibiting heterogeneous schemas
US20170243028A1 (en) * 2013-11-01 2017-08-24 Anonos Inc. Systems and Methods for Enhancing Data Protection by Anonosizing Structured and Unstructured Data and Incorporating Machine Learning and Artificial Intelligence in Classical and Quantum Computing Environments
KR20170053013A (en) * 2015-11-05 2017-05-15 주식회사 나눔기술 Data Virtualization System for Bigdata Analysis
KR101706252B1 (en) * 2016-02-29 2017-02-13 주식회사 티맥스데이터 Method, server and computer program stored in computer readable medium for synchronizing query result

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11544286B2 (en) * 2019-11-29 2023-01-03 Amazon Technologies, Inc. Replicating materialized views across heterogeneous target systems
US11874828B2 (en) 2019-11-29 2024-01-16 Amazon Technologies, Inc. Managed materialized views created from heterogenous data sources
US11934389B2 (en) 2019-11-29 2024-03-19 Amazon Technologies, Inc. Maintaining data stream history for generating materialized views
JP7403431B2 (en) 2020-11-13 2023-12-22 株式会社日立製作所 Data integration methods and data integration systems
CN112883091A (en) * 2021-01-12 2021-06-01 平安资产管理有限责任公司 Factor data acquisition method and device, computer equipment and storage medium
US11797518B2 (en) 2021-06-29 2023-10-24 Amazon Technologies, Inc. Registering additional type systems using a hub data model for data processing
CN113420099A (en) * 2021-07-06 2021-09-21 广州方硅信息技术有限公司 Buried point data access control method and device, computer equipment and storage medium
CN113420099B (en) * 2021-07-06 2022-11-04 广州方硅信息技术有限公司 Buried point data access control method and device, computer equipment and storage medium
WO2024065061A1 (en) * 2022-09-29 2024-04-04 Verto Inc. Healthcare record virtualization

Similar Documents

Publication Publication Date Title
WO2020139079A1 (en) System and method for analyzing heterogeneous data by utilizing data virtualization components
US11238109B2 (en) Computerized tools configured to determine subsets of graph data arrangements for linking relevant data to enrich datasets associated with a data-driven collaborative dataset platform
Szekely et al. Building and using a knowledge graph to combat human trafficking
Hartig et al. Publishing and consuming provenance metadata on the web of linked data
US7487174B2 (en) Method for storing text annotations with associated type information in a structured data store
CN106021260B (en) The method and system of at least one relation schema is searched in product in multiple operations
US20140214897A1 (en) SYSTEMS AND METHODS FOR ACCESSING A NoSQL DATABASE USING BUSINESS INTELLIGENCE TOOLS
US20130318095A1 (en) Distributed computing environment for data capture, search and analytics
Meroño-Peñuela et al. CEDAR: the Dutch historical censuses as linked open data
Park et al. Graph databases for large-scale healthcare systems: A framework for efficient data management and data services
Debattista et al. Linked'Big'Data: towards a manifold increase in big data value and veracity
US10572481B1 (en) System and method for integrating health information sources
CN114049927A (en) Disease data processing method and device, electronic equipment and readable medium
US11321366B2 (en) Systems and methods for machine learning models for entity resolution
CN111198898B (en) Big data query method and big data query device
Meimaris et al. A query language for multi‐version data web archives
Niinimäki et al. An ETL process for OLAP using RDF/OWL ontologies
Tomaszuk et al. Pgo: Describing property graphs in rdf
US20150066536A1 (en) Method and apparatus for generating health quality metrics
FR3061576A1 (en) METHOD AND PLATFORM FOR ELEVATION OF SOURCE DATA IN INTERCONNECTED SEMANTIC DATA
Prabhune et al. MetaStore: an adaptive metadata management framework for heterogeneous metadata models
Willighagen et al. Beautifying data in the real world
Reddy et al. Data linkage in medical science using the resource description framework: the AVERT model
Huang et al. A sensor data mediator bridging the OGC Sensor Observation Service (SOS) and the OASIS Open Data Protocol (OData)
Kuznetsov Scientific data integration system in the linked open data space

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19903944

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19903944

Country of ref document: EP

Kind code of ref document: A1