WO2020170217A1

WO2020170217A1 - A method and system for managing analytical schema

Info

Publication number: WO2020170217A1
Application number: PCT/IB2020/051497
Authority: WO
Inventors: Mukund Mohan
Original assignee: Mukund Mohan
Priority date: 2019-02-23
Filing date: 2020-02-22
Publication date: 2020-08-27

Abstract

Disclosed herein is a method and system for managing analytical schema for plurality of datasets of an enterprise. The system receives a plurality of datasets and respective metadata sets and creates a unified schema to provide more efficient platform to perform data analytics. The system generates a data profile of the plurality of input datasets by intersecting each input dataset with the respective metadata set. In one embodiment, user performs data quality assessment of the input dataset by analysis of generated data profile. The system generates a second plurality of dataset by modifying the plurality of input dataset based on the data profile and predefined rules. The system determines at least one relationship profile between the second plurality of datasets by pairwise analysis of datasets. The system further creates the unified schema based upon the at least one relationship profile between the second plurality of input datasets.

Description

Title: A METHOD AND SYSTEM FOR MANAGING ANALYTICAL SCHEMA

CROSS REFERENCE TO RELATED APPLICATION

This application claims the benefit of priority to Indian Provisional Patent Application Number 201941002784, filed on February 23, 2019, the entire contents of which are hereby incorporated by reference.

TECHNICAL FIELD

The present subject matter is related, to data processing, in general and more particularly, but not exclusively related to method and system for managing analytical schema for data processing.

BACKGROUND

Data analytics is the process of transforming raw data into useful information that helps organizations to generate analysis reports, review reports and so on. Conventional types of data analytics that exists today are Descriptive Analytics, Predictive Analytics and Prescriptive Analytics. Though the purpose of all types of data analytics is reduction of large amount of data into smaller number of datapoints for comprehension, each of them not only differs in the subject, criteria and outcome of analysis but also the method and system for processing data. Each of the data analytics type requires data to be structured in a specific data structure or schema accordingly in order to make the transformation process efficient. However, there is no universal data structure or schema that can be applied to run all of them.

Typically, most of the organizations have been using these analytics separately and independently or in combination of each other. The ability to switch from one type of analytics to another type is a critical challenge in the current world of analytics. Switching from one analytics to either of the other type of analytics requires a physical change in the schema and the structure of data. For example, descriptive analytics requires data to be structured for fast aggregation of rows based on explicit structural relationships. In another example predictive or prescriptive analytics requires data to be structured for fast aggregation of columns based on functional relationships. Interchanging or switching from row-efficient structure to column- based structure is an expensive, cumbersome and time-consuming process fraught with number of inefficiencies, discontinuities and redundancies. Orthogonalizing data structure to perform or execute various analytics in the same session is also a challenging task. Conventional methods prepare different data schema for various analytics; however, the conventional methods are not efficient in terms of time taken to process the same. Therefore, it is desirous to have a method and system for designing and managing a unified schema that is required for distinct analytics of data without the need of schema modifications.

The information disclosed in this background of the disclosure section is only for enhancement of understanding of the general background of the disclosure and should not be taken as an acknowledgement or any form of suggestion that this information forms prior art already known to a person skilled in the art.

SUMMARY

One or more shortcomings of the prior art are overcome, and additional advantages are provided through the present disclosure. Additional features and advantages are realized through the techniques of the present disclosure. Other embodiments and aspects of the disclosure are described in detail herein and are considered a part of the claimed disclosure.

Accordingly, the present disclosure relates to a method of designing analytical schema for a plurality of datasets. The method includes receiving a first plurality of datasets and respective metadata sets as input from at least one source directory system. The method comprises creating a data profile for each of the first plurality of datasets based on the respective metadata set, generating a second plurality of datasets by modifying the first plurality of datasets based on the data profile, determining at least one relationship profile between the second plurality of datasets by pairwise comparison. The method further comprises creating a unified schema based upon the at least one relationship profile determined between the second plurality of input datasets.

Further, the disclosure relates to a system for designing analytical schema for a plurality of datasets. The system comprises a processor and one or more analytics interfaces coupled with the processor. The system further comprises a memory communicatively coupled with the processor. The memory stores processor-executable instructions, which, on execution, cause the processor to receive a first plurality of datasets and respective metadata sets as input from at least one source directory systems coupled to the processor. The processor is configured to create a data profile for each of the first plurality of datasets based on the respective metadata sets and generate a second plurality of datasets by modifying the first plurality of datasets based on the data profile. The processor is further configured to determine at least one relationship profile between the second plurality of datasets by pairwise comparison and create a unified schema based upon the at least one relationship profile determined between the second plurality of input datasets.

The foregoing summary is illustrative only and is not intended to be in anyway limiting. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features will become apparent by reference to the drawings and the following detailed description.

BRIEF DESCRIPTION OF DRAWINGS

The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles. In the figures, the left- most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the figures to reference like features and components. Some embodiments of device or system and/or methods in accordance with embodiments of the present subject matter are now described, by way of example only, and with reference to the accompanying figures, in which:

Figure 1 illustrates an exemplary architecture of a proposed system for managing an analytical schema for distinct analysis of data in accordance with some embodiments of the present disclosure;

Figure 2 illustrates a block diagram of analytical schema designing system of Figure 1 in accordance with some embodiments of the present disclosure;

Figures 3A-3B depicts an exemplary representation of a dataset pair in accordance with some embodiments of the present disclosure;

Figure 3C depicts an exemplary representation of a data profile plane generated from the intersection of the dataset pair of Figures 3A-3B in accordance with some embodiments of the present disclosure;

Figure 3D illustrates an exemplary three-dimensional representation of interaction of the dataset pair when the datasets are intersected entirely in accordance with an embodiment of the present disclosure;

Figure 4A illustrates an exemplary schema view representation of intersected plurality of datasets in accordance with some embodiments of the present disclosure;

Figure 4B illustrates an exemplary three-dimensional view of unified schema along with instances of schema in accordance with some embodiments of the present disclosure;

Figure 5 shows a flowchart illustrating a method for designing analytical schema for distinct data analysis in accordance with some embodiments of the present disclosure;

Figure 6 illustrates an exemplary representation of data flow of managing the unified schema for a plurality of datasets of an enterprise in accordance with an embodiment of the present disclosure;

Figures 7A - 7B illustrates an exemplary representation of composite data view and result of a row aggregation process on the composite data in accordance with some embodiments of the present disclosure;

Figures 8A - 8C illustrates an exemplary representation of output of a column aggregation process on the composite data in accordance with some embodiments of the present disclosure; and

Figure 9 illustrates a block diagram of an exemplary computer system for implementing embodiments consistent with the present disclosure.

It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and executed by a computer or processor, whether or not such computer or processor is explicitly shown.

DETAILED DESCRIPTION In the present document, the word "exemplary" is used herein to mean "serving as an example, instance, or illustration." Any embodiment or implementation of the present subject matter described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

While the disclosure is susceptible to various modifications and alternative forms, specific embodiment thereof has been shown by way of example in the drawings and will be described in detail below. It should be understood, however that it is not intended to limit the disclosure to the particular forms disclosed, but on the contrary, the disclosure is to cover all modifications, equivalents, and alternative falling within the spirit and the scope of the disclosure.

The terms “comprises”, “comprising”, “include(s)”, or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a setup, device or method that comprises a list of components or steps does not include only those components or steps but may include other components or steps not expressly listed or inherent to such setup or device or method. In other words, one or more elements in a device or system or apparatus proceeded by “comprises... a” does not, without more constraints, preclude the existence of other elements or additional elements in the device or system or apparatus.

Embodiments of the present disclosure relates to a method and system for managing unified analytical schema for distinct analysis of data. The analytical schema is a general-purpose data integration schema for various analytics of data that combines related datasets or files through common set of attributes, columns or fields of the datasets. The system receives a first plurality of datasets and respective metadata sets as input from at least one source directory system. The system creates a data profile for each of the first plurality of datasets based on the respective metadata set by intersecting data value of each cell corresponding to each attribute with the corresponding attribute information. The data profile of a dataset provides information for the data quality assessment of the dataset. The system generates a second plurality of datasets by modifying the first plurality of datasets based on the data profile. In one embodiment, the system detects one or more anomalies in the first plurality of datasets and removes the detected anomalies to generate the second plurality of datasets. The system further determines at least one relationship profile between the second plurality of datasets by pairwise comparison based on at least one common attribute identified between the second plurality of datasets. Upon determining at least one relationship profile, the system creates a unified schema by intersecting each dataset of the second plurality of datasets with another dataset of the second plurality of datasets across the at least one common attribute indicated by the at least one relationship profile. The unified schema enables the execution of distinct data analytics by switching from one analytics type to the other without modifying the schema.

In the following detailed description of the embodiments of the disclosure, reference is made to the accompanying drawings that form a part hereof, and in which are shown by way of illustration specific embodiments in which the disclosure may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the disclosure, and it is to be understood that other embodiments may be utilized and that changes may be made without departing from the scope of the present disclosure. The following description is, therefore, not to be taken in a limiting sense.

Figure 1 illustrates an exemplary architecture of a proposed system (100) for managing unified analytical schema for distinct analysis of data in accordance with some embodiments of the present disclosure.

As shown in Figure 1, the exemplary system (100) comprises one or more components configured for managing unified analytical schema for distinct analysis of data. In one embodiment, the system (100) comprises a unified schema managing system (interchangeably referred to as an analytical schema designing system) (ASDS) (102), source data stores (106) comprising plurality of input datasets (107) and respective metadata sets (108) of the plurality of input datasets (107). The system (100) further comprises a target dataset (109) and one or more user devices comprising one or more analytics interfaces such as analytics interface- 1

(104-1), analytics interface-2 (104-2), . and analytics interface-N (104-N) (hereinafter collectively referred to as analytics interface (104)) communicatively coupled via a communication network (hereinafter referred to as network 110).

The network (110) may include, without limitation, a direct interconnection, LAN (local area network), WAN (wide area network), wireless network, point-to-point network, or another configuration. One of the most common types of network in current use is a TCP/IP (Transfer Control Protocol and Internet Protocol) network for communication between database client and database server. Other common Internet protocols used for such communication include HTTPS, FTP, AFS, and WAP and other secure communication protocols etc. The ASDS (102) is configured to manage a unified schema for a plurality of datasets of an enterprise. In one example, the ASDS (102) may be configured as a standalone system. In another example, the ASDS (102) may be configured in cloud environment. In yet another example, the ASDS (102) may include a desktop personal computer, workstation, laptop, PDA, cell phone, or any WAP-enabled device or any other computing device capable of interfacing directly or indirectly to the Internet or other network connection. The ASDS (102) typically includes one or more user interface devices, such as a keyboard, a mouse, touch screen, pen or the like, for interacting with the GUI provided on a display. The ASDS (102) also includes a graphical user interface (GUI) provided therein for interacting with the source data stores (106) to access the plurality of input datasets (107) and the metadata sets (108). In one embodiment, the ASDS (102) comprises a processor (112), a memory (114), a data profiling module (116), a data processing module (118), an aggregation module (120) and a schema generation module (122). The data profiling module (116) is configured to determine the data profile for each input dataset by using the respective metadata set of the input dataset. The data processing module (118) is configured to process the input data based on the generated data profile for preparing a clean set of data for the target dataset (109). The aggregation module (120) is configured to determine the relationship profile of the plurality of input datasets (107). The schema generation module (122) creates the unified schema for analysis of data based upon the relationship profile of the input datasets (107).

The processor (112) is configured to receive user input from the analytics interface (104) via the network (110) in order to use the unified schema for analysis of data.

The analytics interface (104) also enables the user with options to customize search criteria/process, search results of predictive analytics, descriptive analytics, and prescriptive analytics by using the unified analytical schema created by the ASDS (102). In an embodiment, the analytics interface (104) may be a mobile device or a computing device including the functionality for communicating over the network (110). For example, the analytics interface (104) can be a conventional web-enabled personal computer in the home, mobile computer (laptop, notebook or subnotebook), Smart Phone (iOS, Android & Windows), personal digital assistant, wireless electronic mail device, tablet computer or other device capable of communicating both ways over the Internet or other appropriate communications network.

The source data stores (interchangeably referred to as source directory systems) (106) store the plurality of input datasets (107) and the plurality of metadata sets (108) where each metadata set corresponds to one input dataset. In one example, the plurality of input datasets is a collection of organizational data and may include data tables, files or any document comprising one or more data elements for example attributes, and data values corresponding to the attributes collected from various data sources of an organization. In one example, the plurality of input datasets may be any data collected from various publicly available databases. The plurality of metadata sets (108) are the collection of metadata of the respective plurality of input datasets (107). The plurality of metadata sets (108), in one example, may contain one or more attribute information of the plurality of respective input datasets (107) such as attribute name, attribute data type, constraints, keys, indexes, sequences etc., where the plurality of metadata sets (108) facilitate determination of the data profile of the respective input datasets (107). In one embodiment, the source data stores (106) may be integrated within the ASDS (102). In another embodiment, the source datastores (106) may be a standalone repository communicatively coupled with the ASDS (102).

The target dataset (109) stores the processed and aggregated data, over which the user can execute plurality of analytical queries for receiving desired search result. In one embodiment, the target dataset (109) may be integrated within the ASDS (102). The target dataset (109) may be configured, for example, as a standalone datastore. In another example, target dataset (109) may be configured in cloud environment. In yet another example, the target dataset (109) may be one of data table, flat files, spreadsheet or any document comprising one or more data elements.

In another embodiment, the ASDS (102) may be a typical analytical schema designing system as illustrated in Figure 2. The ASDS (102) comprises the processor (112) and the memory (114). The analytics interface (104) is coupled with the processor (112) and the memory (114) via the network (110). The analytics interface (104) is configured to display one or more aggregated attributes to the user, receive user selection of the one or more attributes, display a materialized view, receive user input for customization of materialized view, display requested search result in an appropriate format etc. The analytics interface (104) may be configured to receive full text search query from the user for providing the user with more efficient search options.

The ASDS (102) further includes data (204) and modules (206). In one embodiment, the data (204) can be stored within the memory (114). In one example, the data (204) may include the input datasets (107), the metadata sets (108), the target dataset (109), a plurality of aggregation rules (208), staging datasets (210), configuration files (212), and other data (214). In one embodiment, the data (204) can be stored in the memory (114) in the form of various data structures. Additionally, the aforementioned data can be organized using data models, such as relational or hierarchical data models. The other data (212) may be also referred to as reference repository for storing recommended implementation approaches as reference data. The other data (214) may also store data, including temporary data, temporary files, listing of orphan rows, database connection credentials etc. generated by the modules (206) for performing the various functions of the ASDS (102).

The modules (206) may include, for example, the data profiling module (116), the data processing module (118), the aggregation module (120), and the schema generation module (122). The modules (206) may also comprise other modules (216) to perform various miscellaneous functionalities of the ASDS (102). It will be appreciated that such aforementioned modules may be represented as a single module or a combination of different modules. The modules (206) may be implemented in the form of software performed by the processor, hardware and/or firmware.

In operation, the ASDS (102) receives the input datasets (107) as a first plurality of datasets (107) and the respective metadata sets (108) as input from the data stores (106) for creating a unified schema useful for data analysis. The data profiling module (116) retrieves data value of each cell corresponding to each attribute of the first plurality of inputs datasets (107). Each of the first plurality of datasets (107) comprise one or more attributes and corresponding data values in one or more cells. The data profiling module (116) retrieves attribute information of each attribute from the respective metadata sets (108). For example, the attribute information includes attribute name, data type, data expression, unique value, and missing value information etc. Upon retrieving the attribute information, the data profiling module (116) intersects the data value with the attribute information corresponding to the data value and determines the data profile for data value in each cell of the first plurality of input datasets (107) based on the intersection. The data profiling module (116) further generates an ontological structure of the determined data profile of the first plurality of datasets (107) and stores the ontological structure in a first configuration file for retrieving the data quality information in response to the user request. In one embodiment, the data profiling module (116) enables the user to perform data quality analysis of the first plurality of input datasets (107) by storing the data profile information in the ontological structure in the configuration files (212) for example, a first configuration file so that the user can execute a full text search query on the first configuration file instead of preparing complex queries for data quality analysis. Data quality is a perception or assessment of fitness of input data to serve the purpose in a given context. Data quality assessment can ensure the validity, correctness, accuracy, consistency, completeness, and uniformity of the input data.

As illustrated in Figures 3A and 3B, the ASDS (102) receives an input dataset (ABCD) and the respective metadata set (ABEF) where the metadata set enlists plurality of properties of the attributes of input dataset (ABCD) such as attribute/column name, data type, data expression, data constraints etc. The attribute names of the input dataset are in AB plane, wherein the attributes names are enlisted in a transposed format of the input dataset across the AB plane of the metadata set (ABEF). The data profiling module (116) determines the attribute information from the metadata set (ABEF) and intersects the data value of each cell corresponding to the attribute with the attribute information. Upon intersection, the data profiling module (116) determines data profile for each cell of the attribute and further determines the data profile for all the attributes of the input dataset (ABCD) in the analogous approach. Figure 3C illustrates an exemplary data profile created for one of the attributes of the input where the data profile of each cell of the said attribute are described. The data profile information typically includes but not limited to data type of the cell value, length of data, uniqueness of the cell value, occurrence of null values, typical string patterns etc.

As an example, the input dataset (ABCD) described in Figure 3A has five different attributes such as position_num, job_code, job_family, job name, hire_date. The respective metadata set (ABEF) has five attributes described in five different rows along with the data definition of such attributes as illustrated in Figure 3B. Upon intersection of the data value of each cell of the input dataset (ABCD) with the attribute information of the respective attribute, a data profile is determined as described in Figure 3C, wherein the data profile (BDFG) describes the properties of value residing in each cell of the attribute. Figure 3D illustrates the aforementioned example in a 3D representation structure. The“hire date” column of the input dataset (ABCD) contains different formats in individual cell and value of one cell is not available. Upon intersecting the data value of each cell of the“hire date” attribute of the input dataset with the“hire date” attribute information retrieved from metadata set (ABEF), the data profile plane (BDFG) is created where data profile of each cell of the“hire date” attribute is described in terms of data type, data expression, unique value, and missing value. In one embodiment, the data profiling module (114) determines data profile for each attribute of first plurality of datasets in the analogous approach and generates an ontological structure of the determined data profile of the first plurality of datasets. The data profiling module (116) further stores the ontological structure in the first configuration file to enable the user to perform the data quality assessment by using full-text search capability of the analytics tool. In one embodiment, the data profile module (102) is further configured to generate entity profile and attribute profile of the input dataset separately, wherein entity profile information includes but not limited to entity name, attribute count, row count, attribute sequence etc. and attribute profile information includes but not limited to attribute name, data type, nullability, uniqueness etc.

The ASDS (102) further generates a second plurality of datasets by modifying the first plurality of datasets based on the data profile. In one embodiment, the data processing module (118) is configured to analyse the generated data profile and detect one or more anomalies in the first plurality of input datasets (107). In one example, the one or more anomalies include one or more of invalid data value, orphan rows, invalid attributes, incorrect attribute format etc. The data processing module (118) generates the second plurality of datasets by removing one or more detected anomalies from the first plurality of input datasets (107). In one embodiment, the ASDS (102) generates the second plurality of datasets by cleansing the detected anomalies in the first plurality of input datasets and stores a copy of cleaned first plurality of datasets as the second plurality of datasets in the staging datasets (210), wherein the data profiling includes entity profiling and attribute profiling. The ASDS (102) stores plurality of predefined rules for data cleansing that are executed upon detection of data anomalies in the first plurality of input datasets to generate the second plurality of datasets.

In another embodiment, the data processing module (118) cleanses the first plurality of input datasets by executing the predefined rules and generates a clean data based upon the entity profiling. In the process of data cleansing, the data processing module (118) detects, corrects or removes the corrupt or inaccurate records from an input dataset. The data cleansing steps can further include handling missing data, fixing structural errors, removing duplicates, removing unwanted characters, resolving syntax errors, fixing typos etc. The data processing module (118) stores the second plurality of datasets in the staging datasets (210) as one copy of the clean input datasets after entity profiling.

Further, the ASDS (102) determines one or more relationship profiles between the second plurality of input datasets. The aggregation module (120) performs pairwise comparison between the second plurality of datasets and determines the one or more relationship profile based on identification of one or more common attributes between the second plurality of datasets in response to the pairwise comparison. The aggregation module (120) identifies the one or more common attributes of the second plurality of datasets that are same in at least one dataset pair by using pairwise comparison. In one example, when each dataset pair of the second plurality of datasets comprises more than one common attributes, the aggregation module (120) identifies at least one common attribute that uniquely identifies data records of the dataset pair and the identified at least one common attribute is used for generation of the unified schema. Upon identifying the common attributes, the aggregation module (120) determines a relationship or association among the second plurality of datasets based on the common attributes and generates one or more relationship profiles of the second plurality of datasets. The relationship profile, in one example, may comprise one or more referential keys such as primary key and foreign key establishing or defining a relation among the second plurality of datasets. The referential keys or referential association is used for the datasets having same domain of attributes or datasets belonging to same database or data source. In another example, the relationship data for datasets having distinct domain of attributes may be an association formed by a virtue of characterizing relationship among the second plurality of datasets. The aggregation module (120) further stores the determined relationship profile in the staging dataset (210).

The aggregation module (120) further aggregates the second plurality of datasets to determine orphan rows by using the determined relationship profile. In the process of aggregation, the aggregation module (120) executes the predefined aggregation rules (208) to identify the orphan rows. In one example, the aggregation rules (208) can determine the inner difference, outer difference, and full difference between a dataset pair. The data processing module (118) modifies the second plurality of datasets by removing the orphan rows and generates a clean plurality of datasets. Upon modifying or cleansing the second plurality of datasets, the data processing module (118) further validates the modified second plurality of datasets and stores the modified second plurality of datasets in the target dataset (109) In one embodiment, the data processing module (118) validates or verifies the modified second plurality of datasets based on one or more parameters including such as no null in mandatory attribute, uniform date format, uniform numeric format, no unwanted row, no unwanted attribute, no unwanted characters, validate join conditions, validate hierarchies etc. Further, the schema generation module (122) retrieves the second plurality of datasets from the staging dataset (210) having the at least one relationship profile and intersects each dataset of the second plurality of datasets with another dataset of the second plurality of datasets across the at least one common attribute indicated by the at least one relationship profile. The schema generation module (122) aggregates the second plurality of datasets to create the unified schema based on the intersection of the second plurality of datasets. Upon creation of the unified schema, the schema generation module (122) generates an ontological structure of the unified schema and stores the ontological structure in a second configuration file of the configuration files (212) so that the user can select the desired attributes for data analysis by performing full-text search instead of designing complex queries. In one embodiment, the schema generation module (122) generates a materialized view of the unified schema based upon the user provided search criteria and provides the user with options to customize the materialized view. Materialized view is a type of database object which contains result of a query and is used in data warehousing scenarios, where frequent queries of the actual dataset is expensive.

The ASDS (102) further dynamically updates the data profile upon addition of new records in the first plurality of input datasets (107). In one embodiment, if new records are added to the first plurality of input datasets (107), the data profiling module (116) automatically creates a new data profile by intersecting the data value of each cell of the new records with the attribute information of data value and updates the first configuration file (214) with the new data profile information.

In one example, the unified schema for the plurality of datasets XI, X2, Y1 and Y2 is represented as shown in Figure 4A. The input dataset XI may be a Job Application table comprising four data fields such as Job Application ID, Applicant ID, Requisition Identity and Start Date (DT). The input dataset X2 may be Job position table comprising five data fields such as Position Number (NUM), Job Code (CD), Location Code (CD), Salary Grade, and Open Date (DT). The input dataset Y1 may be a job requisition table with six data fields comprising Recruiter Identity, Department code, location code, requisition identity, Position Number, Open Date. The input dataset Y2 may be a Job requirement table comprising five data fields such as Start Date (DT), Job code (CD), Job Role ID, Competency ID, Active Date (DT).

The unified schema may be a two-dimensional representation of the plurality of data fields with rows and columns related to each other and is viewed as a three-dimensional schema view as illustrated in Figure 4B, with one or more records of the unified schema. In one embodiment, each record of the unified schema is a projection of unified schema, and any two records of the plurality of datasets are related as per the relationship data among the one or more data fields of the unified schema. Each record of the unified schema may be retrieved from target dataset (109) and is viewed as the three-dimensional view with XY plane consisting of unified schema and XZ plane consisting of data records of the first plurality of input datasets (107) as aggregated into the target dataset (109). The unified schema, in one embodiment, is extendable by adding rows and columns. The unified schema may be automatically extended by addition of rows and columns without performing alteration or changes to the existing structure of unified schema. The data records can be directly accessed and processed for performing various analytics without changing the unified schema. Thus, the system eliminates the need of schema modification while switching to various data analytics and thereby provides a unified schema for distinct analytics types.

In some embodiments, the ASDS (102) enables one or more analytics interfaces (104) to perform one or more types of data analytics such as descriptive analytics, predictive analytics and prescriptive analytics on the unified schema without modifying or changing original schema definition. In one example, the analytics interface- 1 (104-1) performs descriptive analytics where the OLAP (Online Analytical Processing) style data analysis is performed on one or more rows of the unified schema. The outcome of such analysis includes count, sum, averages, ratios or other aggregates of the filtered attributes of the plurality of input datasets. In another example, the analytics interface-2 (104-2) performs predictive analytics such as computing various measures of a future outcome by processing one or more attributes of the unified schema. In yet another example, the analytics interface-N (104-N) performs prescriptive analytics for recommending one or more possible course of action by analysing one or more attributes collated in same data plane where the collated data is generated by the unified schema.

Figure 5 illustrates a flowchart showing a method for managing a unified schema for a plurality of datasets of an enterprise in accordance with some embodiments of the present disclosure.

As illustrated in Figure 5, the method (500) comprises one or more blocks implemented by the processor (112) for generating unified schema for distinct analytics and enabling execution of distinct analytics without modifying the unified schema. The method (500) may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, and functions, which perform particular functions or implement particular abstract data types.

The order in which the method (500) is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method (500). Additionally, individual blocks may be deleted from the method (500) without departing from the spirit and scope of the subject matter described herein. Furthermore, the method (500) can be implemented in any suitable hardware, software, firmware, or combination thereof.

At block (502), a first plurality of input datasets and respective metadata sets are received. In one embodiment, the data profiling module (116) receives the first plurality of input datasets (107) of an organization from the source data stores (106) and receives respective metadata sets (108) of the first plurality of input datasets (107) from the source data stores (106). The metadata set (108) of each of the plurality of input datasets (107) comprise one or more data attributes.

At block (504), data profile for each of the first plurality of datasets (107) is created based on the respective metadata set. The data profiling module (116) retrieves data value of each cell corresponding to each attribute of the first plurality of inputs datasets (107), wherein each of the first plurality of datasets (107) comprise one or more attributes and corresponding data values in one or more cells. The data profiling module (116) further retrieving attribute information of each attribute from the respective metadata sets and intersects the data value with the attribute information corresponding to the data value. Upon intersection, the data profiling module (116) determines the data profile for data value in each cell of the first plurality of input datasets based on the intersection. The data profiling module (116) generates an ontological structure of the attribute wise data profile of the first plurality of datasets as determined. Therefore, the number of generated data profile is same as the number of attributes of the first plurality of dataset. The data profiling module (116) further stores the ontological structure in a first configuration file for retrieving the data quality information in response to the user request.

At block (506), a second plurality of datasets is generated by modifying the first plurality of datasets based on the data profile. The data processing module (118) detects one or more anomalies in the first plurality of input datasets (107) and modifies the first plurality of input datasets (107) by removing the one or more detected anomalies from the first plurality of input datasets (107) so as to clean the first plurality of input datasets (107) thereby generating the second plurality of datasets. Data cleansing is performed based on the entity profile, attribute profile of the first plurality of input datasets (107).

At block (508), at least one relationship profile between the second plurality of datasets is determined by pairwise comparison. In one embodiment, the aggregation module (120) performs pairwise comparison between the second plurality of datasets and determines the at least one relationship profile based on identification of at least one common attribute between the second plurality of datasets upon the pairwise comparison. The aggregation module (120) further aggregates the second plurality of datasets by using plurality of aggregation rules (208) to determine orphan rows. The data processing module (118) modifies the second plurality of datasets by removing the orphan rows and generates a cleaned second plurality of datasets. Upon cleansing the second plurality of datasets, the data processing module (118) further verifies the cleaned second plurality of datasets and stores the clean aggregated data in the target dataset as one cleansed copy of the first plurality of input datasets after entity profiling and relationship profiling. The aggregation module (120) further stores the determined relationship profile in the staging dataset (210).

At block (510), a unified schema is created based upon the at least one relationship profile between the second plurality of input datasets. The schema generation module (122) retrieves the second plurality of datasets from the staging dataset (210) having the at least one relationship profile and intersects each dataset of the second plurality of datasets with another dataset of the second plurality of datasets across the at least one common attribute indicated by the at least one relationship profile. The schema generation module (122) further aggregates the second plurality of datasets to create the unified schema based on the intersection of the second plurality of datasets and stores the created unified schema in a second configuration file supported by ontological structure.

Further, the ASDS (102) creates at least one materialized view of the unified schema based on a user request. The schema generation module (122) retrieves the unified schema from the second configuration file and receives the user selection of one or more attributes. The schema generation module (122) further validates relationship profile automatically based on the user selection and updates the materialized view. Figure 6 illustrates an exemplary illustration (600) of different data flow phases of ASDS (102) in accordance with an embodiment of the present disclosure. At phase (602), three input datasets and three respective metadata sets (dotted notation) are received. Each input dataset is intersected with respective metadata set at phase (604), thereby performing three intersections of datasets. At phase (606), data profile sets are displayed where data profile for each attribute of each dataset is created. The created data profile includes entity profile and attribute profile of the respective dataset. The ASDS (102) enables the user to perform data quality assessment wherein the user requested query is executed upon the data profile sets. Upon creation of data profile, a second set of three input datasets are generated by modifying the three input datasets based on the data profile at phase (608) where the data modification includes data cleansing. At phase (610), relationship profile between three input datasets are determined by pairwise comparison and mapping of similar attributes between the input datasets is determined. The three input datasets are further compared at phase (612) for determining data difference by identifying orphan rows from all the three input datasets. At phase (614), the second set of three input datasets are further modified to create a clean data based on the relationship profile wherein the data modification includes removing redundant attribute, removing orphan rows etc. The ASDS (102) creates the unified schema at phase (616) based upon the relationship profile between the second set of three input datasets. At phase (618), the ASDS (102) receives plurality of search requests from user and creates materialized view for such requests by analysing the unified schema. The plurality of search request may include search for data quality, search for join query, search for OLAP (Online Analytical Processing) analysis, search for statistical, predictive, and prescriptive analytics etc.

In an example, four datasets XI, X2, Yl, and Y2 which are related to one another as described in Figure 4A. If the ASDS (102) receives a search request such as“ search the applicants based on location CL007 and their competency” from the user, the ASDS (102) creates an instance of a unified schema by determining the join conditions to serve the search request i.e. (XI + X2 + Yl. Location + Y2. Competency) and prepares a composite data view as illustrated in Figure 7A. The ASDS (102) performs a row aggregation on the composite data view wherein all common data objects across the participating datasets may act as selection criteria. In the cited example, the selection criteria for row aggregation is the location cd i.e. CL007, thereby producing the result set as illustrated in Figure 7B, where two applicants along with the respective position number and competency CD or ID are displayed. In another example, the unified schema enables the user to receive desired out by fast column aggregation process as illustrated in Figures 8A - 8C. Figure 8A describes a unified schema of five related datasets i.e. XI, X2, Yl, Y2, Y3 intersected across plurality of common attributes. Figure 8B illustrates a composite data view of Y3 having plurality of attributes. If the ASDS (102) receives a search request such as“ search the personality type of the applicant” from the user, the ASDS (102) executes the column regression operation on the composite data to generate the desires result. In the cited example, Y3 dataset has 76 attributes which are responsible for determining the personality type of an applicant. Therefore, for each applicant data value of the 76 attributes are analysed and aggregated to a single result i.e. personality type based upon the composite data as illustrated in Figure 8C, wherein the requirement of complex join queries are eliminated as the composite data itself is a fully qualified document to serve the user requested query.

In an embodiment, the computer system (900) may be the ASDS (102), which is used for creating and managing a unified schema for data analysis of distinct data. The computer system (900) may include a central processing unit (“CPU” or“processor”) (902). The processor (902) may comprise at least one data processor for executing program components for executing user or system-generated business processes. The processor (902) may include specialized processing units such as integrated system (bus) controllers, memory management control units, floating point units, graphics processing units, digital signal processing units, etc.

The processor (902) may be disposed in communication with one or more input/output (I/O) devices (904 and 906) via I/O interface (908). The I/O interface (908) may employ communication protocols/methods such as, without limitation, audio, analog, digital, stereo, IEEE-1394, serial bus, Universal Serial Bus (USB), infrared, PS/2, BNC, coaxial, component, composite, Digital Visual Interface (DVI), high-definition multimedia interface (HDMI), Radio Frequency (RF) antennas, S-Video, Video Graphics Array (VGA), IEEE 802.n /b/g/n/x, Bluetooth, cellular (e.g., Code-Division Multiple Access (CDMA), High-Speed Packet Access (HSPA+), Global System For Mobile Communications (GSM), Long-Term Evolution (LTE) or the like), etc.

Using the I/O interface (908), the computer system (900) may communicate with one or more I/O devices (904 and 906). In some implementations, the processor (902) may be disposed in communication with a communication network (110) via a network interface (910). The network interface (910) may employ connection protocols including, without limitation, direct connect, Ethernet (e.g., twisted pair 10/100/1000 Base T), Transmission Control Protocol/Internet Protocol (TCP/IP), token ring, IEEE 802.11a/b/g/n/x, etc. Using the network interface (910) and the network (110), the computer system (900) may be connected to the source data stores (106), the target dataset (109), the ASDS (102) and the analytics interface (104).

The network (110) can be implemented as one of the several types of networks, such as intranet or any such wireless network interfaces. The network (110) may either be a dedicated network or a shared network, which represents an association of several types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), etc., to communicate with each other. Further, the communication network (110) may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, etc.

In some embodiments, the processor (902) may be disposed in communication with a memory (912) e.g., RAM (914), and ROM (916), etc. as shown in Figure 9, via a storage interface (918). The storage interface (918) may connect to the memory (912) including, without limitation, memory drives, removable disc drives, etc., employing connection protocols such as Serial Advanced Technology Attachment (SATA), Integrated Drive Electronics (IDE), IEEE-1394, Universal Serial Bus (USB), fibre channel, Small Computer Systems Interface (SCSI), etc. The memory drives may further include a drum, magnetic disc drive, magneto optical drive, optical drive, Redundant Array of Independent Discs (RAID), solid-state memory devices, solid-state drives, etc.

The memory (912) may store a collection of program or database components, including, without limitation, user/application data (920), an operating system (922), a web browser (924), a mail client (926), a mail server (928), a user interface (930), and the like. In some embodiments, computer system (900) may store user/application data (920), such as the data, variables, records, etc. as described in this invention. Such databases may be implemented as fault-tolerant, relational, scalable, secure databases such as Oracle or Sybase. The operating system (922) may facilitate resource management and operation of the computer system (900). Examples of operating systems include, without limitation, Apple Macintosh™ OS X™, UNIX™, Unix-like system distributions (e.g., Berkeley Software Distribution (BSD), FreeBSD™, Net BSD™, Open BSD™, etc.), Linux distributions (e.g., Red Hat™, Ubuntu™, K-Ubuntu™, etc.), International Business Machines (IBM™) OS/2™, Microsoft Windows™ (XP™, Vista/7/8, etc.), Apple iOS™, Google Android™, Blackberry™ Operating System (OS), or the like. A user interface may facilitate display, execution, interaction, manipulation, or operation of program components through textual or graphical facilities. For example, user interfaces may provide computer interaction interface elements on a display system operatively connected to the computer system (900), such as cursors, icons, check boxes, menus, windows, widgets, etc. Graphical User Interfaces (GUIs) may be employed, including, without limitation, Apple™ Macintosh™ operating systems’ Aqua™, IBM™ OS/2™, Microsoft™ Windows™ (e.g., Aero, Metro, etc.), Unix X-Windows™, web interface libraries (e.g., ActiveX, Java, JavaScript, AJAX, HTML, Adobe Flash, etc.), or the like.

The instant invention provides the technical solution to the technical problem of the existing data analytics system by integrating the input datasets that may be related to each other by means of plurality of common data points. The system eliminates the physical change in schema and structure of data while toggling from descriptive analytics to either predictive or prescriptive analytics by proving a unified schema and cleansed aggregated dataset to operate on. The system further eliminates the expensive, cumbersome and time-consuming process fraught with number of inefficiencies, discontinuities, and redundancies while interchanging row-efficient structure to column-efficient structure required for different data analytics approach by providing an aggregated cleansed input data collated in the same plane by using the unified schema. Further, the system can reduce substantial time and efforts that are required to structure and prepare differently by automatically structuring the new input data records in different datasets into an aggregated dataset. Therefore, the system of the instant invention provides an automated, robust, highly scalable, time-efficient platform to the one or more enterprises for performing different types of data analysis in a hassle-free manner.

The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the words "comprising," "having," "containing," and "including," and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms“a,”“an,” and“the” include plural references unless the context clearly dictates otherwise.

Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term“computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., are non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, non-volatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.

Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, the disclosure of the embodiments of the disclosure is intended to be illustrative, but not limiting, of the scope of the disclosure.

With respect to the use of substantially any plural and/or singular terms herein, those having skill in the art can translate from the plural to the singular and/or from the singular to the plural as is appropriate to the context and/or application. The various singular/plural permutations may be expressly set forth herein for sake of clarity.

Claims

The Claim:

1. A method for managing a unified schema for a plurality of datasets of an enterprise, comprising:

receiving a first plurality of datasets and respective metadata sets as input from at least one source directory system;

creating a data profile for each of the first plurality of dataset based on the respective metadata set;

generating a second plurality of datasets by modifying the first plurality of datasets based on the data profile;

determining at least one relationship profile between the second plurality of datasets by pairwise comparison; and

creating the unified schema based upon the at least one relationship profile determined between the second plurality of input datasets.

2. The method as claimed in claim 1, wherein creating the data profile comprises steps of:

retrieving data value of each cell corresponding to each attribute of the first plurality of inputs datasets, wherein each of the first plurality of datasets comprise one or more attributes and corresponding data values in one or more cells;

retrieving attribute information of each attribute from the respective metadata sets, wherein the attribute information includes one or more of attribute name, data type, data expression, unique value, and missing value information;

intersecting the data value with the attribute information corresponding to the data value to determine the data profile for the data value, wherein data profile is indicative of data quality of each cell;

generating an ontological structure of the determined data profile of the first plurality of datasets; and

storing the ontological structure in a first configuration file for retrieving the data quality information in response to a user request.

3. The method as claimed in claim 1, wherein modifying the first plurality of input datasets comprising:

detecting one or more anomalies in the first plurality of input datasets, wherein the one or more anomalies include one or more of invalid data value, orphan rows, invalid attributes, and incorrect attribute format; and

modifying the first plurality of input datasets by removing the one or more detected anomalies from the first plurality of input datasets to generate the second plurality of datasets.

4. The method as claimed in claim 1, wherein determining at least one relationship profile comprising steps of:

performing pairwise comparison between the second plurality of datasets; determining the at least one relationship profile based on identification of at least one common attribute between the second plurality of datasets in response to the pairwise comparison;

aggregating the second plurality of datasets to determine orphan rows;

modifying the second plurality of datasets by removing the orphan rows; and storing the modified second plurality of datasets with the determined relationship profile in a staging dataset.

5. The method as claimed in claim 1, wherein creating the unified schema comprises steps of:

retrieving the second plurality of datasets from the staging dataset having the at least one relationship profile;

intersecting each dataset of the second plurality of datasets with another dataset of the second plurality of datasets across the at least one common attribute indicated by the at least one relationship profile;

aggregating the second plurality of datasets to create the unified schema based on the intersection of the second plurality of datasets; and

storing the created unified schema in a second configuration file.

6. The method as claimed in claim 1, further comprising managing the unified schema by steps comprising:

creating at least one materialized view of the unified schema based on the user request;

retrieving the unified schema from the second configuration file; receiving selection of one or more attributes from the user as input; automatically validating the at least one relationship profile based on the selection of the one or more attributes; and

updating the materialized view based on the validation.

7. The method as claimed in claim 1, further comprising dynamically updating the data profile based upon addition of new records in the first plurality of input datasets.

8. A system for managing a unified schema for a plurality of datasets of an enterprise, the system comprising:

a processor (112);

one or more analytics interfaces (104) coupled to the processor (112);

a memory (114) communicatively coupled with the processor (112), wherein the memory (114) stores processor-executable instructions, which on execution, cause the processor (112) to:

receive a first plurality of datasets (107) and respective metadata sets (108) as input from at least one source directory system coupled to the processor (112);

create a data profile for each of the first plurality of dataset based on the respective metadata set;

generate a second plurality of datasets by modifying the first plurality of datasets based on the data profile;

determine at least one relationship profile between the second plurality of datasets by pairwise comparison; and

create the unified schema based upon the at least one relationship profile determined between the second plurality of input datasets.

9. The system as claimed in claim 8, wherein the processor is configured to create the data profile by steps comprising:

retrieving data value of each cell corresponding to each attribute of the first plurality of inputs datasets (107), wherein each of the first plurality of datasets (107) comprise one or more attributes and corresponding data values in one or more cells; retrieving attribute information of each attribute from the respective metadata sets, wherein the attribute information includes one or more of attribute name, data type, data expression, unique value, and missing value information;

generating an ontological structure of the determined data profile of the first plurality of datasets (107); and

storing the ontological structure in a first configuration file (214) for retrieving the data quality information in response to a user request.

10. The system as claimed in claim 8, wherein the processor is configured to modify the first plurality of input datasets (107), by:

detecting one or more anomalies in the first plurality of input datasets (107), wherein the one or more anomalies include one or more of invalid data value, orphan rows, invalid attributes, and incorrect attribute format; and

modifying the first plurality of input datasets (107) by removing the one or more detected anomalies from the first plurality of input datasets to generate the second plurality of datasets.

11. The system as claimed in claim 8, wherein the processor is configured to determine the at least one relationship profile, by:

aggregating the second plurality of datasets to determine orphan rows;

modifying the second plurality of datasets by removing the orphan rows; and storing the modified second plurality of datasets with the determined relationship profile in a staging dataset (210).

12. The system as claimed in claim 8, wherein the processor is configured to create the unified schema, by:

retrieving the second plurality of datasets from the staging dataset (210) having the at least one relationship profile;

aggregating the second plurality of datasets to create the unified schema based on the intersection of the second plurality of datasets; and storing the created unified schema in a second configuration file (214).

13. The system as claimed in claim 8, wherein the processor is further configured to manage the unified schema, by:

creating at least one materialized view of the unified schema based on a user request;

retrieving the unified schema from the second configuration file (214);

receiving selection of one or more attributes from the user as input; automatically validating the at least one relationship profile based on the selection of the one or more attributes; and

updating the materialized view based on the validation.

14. The system as claimed in claim 8, wherein the processor is further configured to dynamically update the data profile based upon addition of new records in the first plurality of input datasets.