WO2021133164A1

WO2021133164A1 - Unstructured data in enterprise data warehouse

Info

Publication number: WO2021133164A1
Application number: PCT/MY2020/050170
Authority: WO
Inventors: Mohamad Zakaria ALLI; Nur Syafiqah MUNIR; Wan Zawawi MD ZIN; Shahirina MOHD TAHIR
Original assignee: Mimos Berhad
Priority date: 2019-12-24
Filing date: 2020-11-25
Publication date: 2021-07-01

Abstract

The present invention relates to a system and method for analyzing unstructured data. The system includes at least one data source module for providing at least one unstructured data, a Privacy Assurance Services component for conducting pseudonymization on the unstructured data to mask personal identification information, a data harmonization tool configured for codification of the pseudonymized data based one at least on reference dataset, and at least one data analytics and visualization module configured for visualizing and analyzing transformed codified data loaded into a data warehouse.

Description

UNSTRUCTURED DATA IN ENTERPRISE DATA WAREHOUSE

FIELD OF INVENTION

The present invention generally relates to data analysis. More particularly, the invention relates to system and method for analysis of unstructured data in enterprise data warehouse. BACKGROUND OF THE INVENTION

Clinical information in electronic health records (EHRs) is mostly in a form of unstructured data, for example, procedure and diagnosis data. The data is generated in the form of unstructured data. This data usually being written by medical officer which is required as part of discharge process. These data provide important piece of information for decision making when it is being collectively analyzed.

Unstructured data types typically are not well fit in traditional data warehouses that are based on relational databases, as they are inherently limited in analyzing the data. As a result, this piece of information could not be used during analysis for decision making. However, as unstructured data is also important in order to produce exhaustive analysis, solution need to be introduced so that both structured and unstructured data can be analyzed.

One prior art document US7849048 B2 discloses a system and method of making unstructured data available to structured data analysis tools. The system includes middleware software that can be used in combination with structured data tools to perform analysis on both structured and unstructured data. Data can be read from a wide variety of unstructured sources. The data may then be transformed with commercial data transformation products that may, for example, extract individual pieces of data and determine relationships between the extracted data. The transformed data and relationships may then be passed through an extraction/transform/load (ETL) layer and placed in a structured schema. The structured schema may then be made available to commercial or proprietary structured data analysis tools.

Another prior art US 7668849 B1 discloses a method, system, and software of relating structured data to unstructured data includes displaying unstructured data in a first display area and displaying structured data related to the unstructured data in a second display area. In response to a change in the display of one of the unstructured data in the first display area or the structured data in the second display area, automatically dynamically changing the display in the other of the first display area or the second display area to display changed data based on its relation to the changed data in the one of the first display area or the second display area.

However, none of the prior arts addresses the problem of issues arising in analyzing structured and unstructured data. In the view of foregoing, there is a need for an improved method and system for overcoming the short comings associated with prior arts.

SUMMARY OF THE INVENTION

Accordingly, the present invention provides a system for analyzing unstructured data. The system includes at least one data source module for providing at least one unstructured data, a Privacy Assurance Services component for conducting pseudonymization on the unstructured data to mask personal identification information, a data harmonization tool configured for codification of the pseudonymized data based on at least on reference dataset, and at least one data analytics and visualization module configured for visualizing and analyzing transformed codified data loaded into a data warehouse. In an embodiment, the present invention provides a method of analyzing unstructured data. The method includes extracting at least one unstructured data from a data source, pseudonymizing the data through Privacy Assurance Services component, sending pseudonymized data to a data harmonization module for codification, transforming and loading codified data into data warehouse and sending the transformed data to at least one data analytics and visualization module for visualization and analytics.

BRIEF DESCRIPTION OF THE DRAWINGS

The other objects, features and advantages will occur to those skilled in the art from the following description of the preferred embodiment and the accompanying drawings in which: Fig. 1 shows an architecture diagram of a system for extracting, transform & loading (ETL) unstructured data inside enterprise data warehouse using data harmonization tool in accordance with an embodiment of the present invention. Fig. 2 shows a flow diagram of a method for analyzing the unstructured data utilizing data harmonization tool in accordance with an embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Various embodiment of the present invention provides system and method for analyzing of unstructured data in enterprise data warehouse. The following description provides specific details of certain embodiments of the invention illustrated in the drawings to provide a thorough understanding of those embodiments. It should be recognized, however, that the present invention can be reflected in additional embodiments and the invention may be practiced without some of the details in the following description.

The various embodiments including the example embodiments will now be described more fully with reference to the accompanying drawings, in which the various embodiments of the invention are shown. The invention may, however, be embodied in different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. In the drawings, the sizes of components may be exaggerated for clarity.

It will be understood that as used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.

Spatially relative terms, such as “data,” “unstructured data,” and the like, may be used herein for ease of description to describe one element or feature's relationship to another element(s) or feature(s) as illustrated in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations of the structure in use or operation in addition to the orientation depicted in the figures.

Embodiments described herein will refer to plan views and/or cross-sectional views by way of ideal schematic views. Accordingly, the views may be modified depending on simplistic assembling or manufacturing technologies and/or tolerances. Therefore, example embodiments are not limited to those shown in the views but include modifications in configurations formed on basis of assembling process. Therefore, regions exemplified in the figures have schematic properties and shapes of regions shown in the figures exemplify specific shapes or regions of elements, and do not limit the various embodiments including the example embodiments.

The subject matter of example embodiments, as disclosed herein, is described with specificity to meet statutory requirements. However, the description itself is not intended to limit the scope of this patent. Rather, the inventors have contemplated that the claimed subject matter might also be embodied in other ways, to include different features or combinations of features similar to the ones described in this document, in conjunction with other technologies. Generally, the various embodiments including the example embodiments relate to a system and method for extracting, transforming and loading an unstructured data into data warehouse for analytics.

Referring to Fig. 1, an architecture diagram of a system for extracting, transforming & loading (ETL) unstructured data inside enterprise data warehouse using data harmonization tool is shown in accordance with an embodiment of the present invention. The system 100 includes a data analytics and visualization module 110, an ETL support tool 120, a privacy assurance service (PAS) component 130, a data source 140, a data harmonization tool 150, a reference dataset 160 and an enterprise data warehouse 170.

The data analytic and visualization module 110 is configured for composing and exposing data from the data warehouse 170 for analyzing and visualizing heterogeneous data.

The ETL support tool 120 is configured for extracting the data from source and generate a JSON file. The JSON file is then transformed based on the business rules and the transformed data is loaded into the data warehouse 170. The privacy assurance service component 130 performs pseudonymization process in order to anonymize personal information.

The data source 140 is configured to capture and collect the unstructured data.

The data harmonization tool 150 is configured for codifying unstructured data into an integrated, consistent and unambiguous data based on terminology inside the reference dataset 160.

The reference dataset 160 is configured to store all relevant terminology with unique codes.

In an embodiment, the at least one data warehouse 170 includes a plurality of warehouse databases for storing the data. In an embodiment, the at least one reference dataset 160 includes a plurality of reference database for storing list of a plurality of terminologies with unique codes, each unique code is required for codification of the unstructured data. The terminology herein refers to a hierarchy of terms which include unique codes and definitions. The unique code functions as an identifier for the term. For example, term ‘Cardiovascular Implant’ is tagged to a unique code of ‘309513005’ which then will be used by data harmonization tool 150 as tagging for each identified term that found inside the unstructured data. Similar term such as ‘Cardiovascular Therapy’ is also being tagged with the same unique code of ‘309513005’ to ensure wider coverage on the area.

In an embodiment, the data harmonization module 150 includes a plurality of components for codified unstructured data.

In an embodiment, the visualization module 110 which is based on Business Intelligent (BI) tool is used to visualize the data from the data warehouse 170. The visualization medium such as fixed- format report and dashboard is being created using BI tool which then will extract the data from the data warehouse 170 and populate the data according to the designed report. In an embodiment the system 100 includes at least but not limited to four processors and sixteen gigabit of Random Access Memory (RAM) configured for processing the extraction, transformation and loading of data in the data warehouse 170. The size of processors and RAM may varies depending on data velocity and volume. Referring to Fig. 2, a flow diagram 200 of a method for analyzing the unstructured data utilizing data harmonization tool is shown in accordance with an embodiment of the present invention. The method includes step S210 of collecting and extracting data from a source system where the data extraction output is in JSON format. It should be noted that JSON format is an output format for codified data that has undergone data harmonization; in which the unstructured data is structured before it can be processed by the ETL tool which is one of the subsequent steps in accordance with the embodiment of the present invention.

In S220 Pseudonymization of unstructured data is done via Privacy Assurance Service component. In S230 Data Harmonization module fetch pseudomized data from file server for codification and generate JSON file. In S240 the ETL support tool fetch JSON file with codified data and performs the transformation and loading into data warehouse. In S250 data analytics and Visualization is performed, in which during this step, composing and exposing data from the warehouse to be used for analyzing and visualizing heterogenous data.

The present invention utilizes data harmonization tool which is based on semantic technology that utilizes different terminologies to combine textual data into an integrated, consistent and unambiguous data.

Data may be captured or collected from a source system in form of unstructured data will need to go thorough pseudonymization process to mask personal identification information. Once pseudonymized, the ETL support tool processes and sends the output to data harmonization component in order to codify the unstructured data based on the reference dataset that being provided. Codified data is generated and processed by the ETL support tool for data warehouse consumption. Processed data inside data warehouse can be used by intelligent tools for analysis.

The data source, ETL support tool, data harmonization module, and/or processor included in or associated with the system 100 described herein may comprise one or more microprocessors, digital signal processors, application specific integrated circuits, field programmable gate arrays, and/or other types of digital processing circuits, configured according to computer program instructions implemented in software (or firmware). As would be apparent to a person having ordinary skilled in the art, the afore-described methods and systems may be provided in many variations, modifications or alternatives to existing methods and systems. The principles and concepts disclosed herein may also be implemented in various manner which may not have been specifically described herein but which are to be understood as encompassed within the scope of the appended claims.

Claims

1. A system (100) for analyzing unstructured data, characterized in that, the system (100) comprising: at least one data source module (140) for providing at least one unstructured data; a privacy assurance services component (130) for conducting pseudonymization on the unstructured data to mask personal identification information; a data harmonization tool (150) configured for codification of the pseudonymized data based on at least one reference dataset (160); and at least one data analytics and visualization module (110) configured for visualizing and analyzing transformed codified data loaded into a data warehouse (170).

2. The system (100) of claim 1 further comprises at least one file server at a client side for providing access of files.

3. The system (100) of claim 1 wherein the data warehouse (170) includes a plurality of warehouse databases for storing the data.

4. The system (100) of claim 1 wherein the at least one reference dataset (160) includes a plurality of reference database for storing list of a plurality of terminologies with a unique code, each unique code is required for codification of the unstructured data.

5. The system (100) of claim 1 wherein the data harmonization tool (150) includes a plurality of components for codifying unstructured data.

6. The system (100) of claim 1 wherein the data analytics and visualization tool (110) is further configured for composing and exposing the data from the data warehouse (170).

7. The system (100) of claim 1 further comprises at least one processor configured for processing the extraction, transformation and loading of data in the data warehouse (170).

8. A method of analyzing unstructured data, characterized by the steps of: extracting at least one unstructured data from a data source module; pseudonymizing the data through privacy assurance services component; sending pseudonymized data to a data harmonization module for codification; transforming and loading codified data into a data warehouse; and sending the transformed codified data to at least one data analytics and visualization module for visualization and analytics.

9. The method of claim 8 further comprising the step of outputting the extracted unstructured data as JavaScript Object notation format.

10. The method of claim 8 wherein the data is pseudonymized for masking personally identifiable information.

11. The method of claim 8 wherein the codification of the data is based on a reference dataset.