WO2022269573A1 - System and method facilitating monitoring of data quality - Google Patents

System and method facilitating monitoring of data quality Download PDF

Info

Publication number
WO2022269573A1
WO2022269573A1 PCT/IB2022/055911 IB2022055911W WO2022269573A1 WO 2022269573 A1 WO2022269573 A1 WO 2022269573A1 IB 2022055911 W IB2022055911 W IB 2022055911W WO 2022269573 A1 WO2022269573 A1 WO 2022269573A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
quality module
data packets
corrupt
metrics
Prior art date
Application number
PCT/IB2022/055911
Other languages
French (fr)
Inventor
Shikhar Srivastava
Shivam Runthala
Sunny Jain
Ujjal Kumar Mandal
Raghuram Velega
Original Assignee
Jio Platforms Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jio Platforms Limited filed Critical Jio Platforms Limited
Publication of WO2022269573A1 publication Critical patent/WO2022269573A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/12Network monitoring probes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/02Capturing of monitoring data
    • H04L43/028Capturing of monitoring data by filtering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/04Processing captured monitoring data, e.g. for logfile generation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/04Processing captured monitoring data, e.g. for logfile generation
    • H04L43/045Processing captured monitoring data, e.g. for logfile generation for graphical visualisation of monitoring data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0823Errors, e.g. transmission errors

Definitions

  • the embodiments of the present disclosure generally relate to system and methods that facilitate enhancing big data eco system. More particularly, the present disclosure relates to a system and method for facilitating end to end data completeness.
  • a data ecosystem is a collection of infrastructure, analytics, and applications used to capture and analyze data.
  • Data ecosystems provide entities with data that they rely on to understand users and to make better operations, and decisions.
  • the term ecosystem is used rather than ‘environment’ because, like real ecosystems, data ecosystems are intended to evolve over time.
  • Data ecosystems are for capturing data to produce useful insights. As users use products-especially digital ones-they leave data trails. Entities can create a data ecosystem to capture and analyze data trails so product teams can determine what their users like, don’t like, and respond well to. Product teams can use insights to tweak features to improve the product.
  • Ecosystems were originally referred to as information technology environments. They were designed to be relatively centralized and static. The birth of the web and cloud services has changed that.
  • data ecosystem are data environments that are designed to evolve.
  • Every entity creates its own ecosystem, sometimes referred to as a technology stack, and fills it with a patchwork of hardware and software to collect, store, analyze, and act upon the data.
  • the best data ecosystems are built around a product analytics platform that ties the ecosystem together.
  • Analytics platforms help teams integrate multiple data sources, provide machine learning tools to automate the process of conducting analysis, and track user cohorts so teams can calculate performance metrics.
  • resultant data footprint such as feeds and logs pertaining to different streams in an entity and data needs to be ingested in Big Data Eco- system.
  • An object of the present disclosure is to provide for a system and method to facilitate ease in ingestion and analytical processing of data in big data eco-system.
  • An object of the present disclosure is to provide for a system and method that enables end-to-end data completeness checks on the data pipelines.
  • An object of the present disclosure is to provide for a system and method that facilitates insights on how the data is flowing from one system to another, alerts the operations team on anomalous behaviour and finding Root-Cause-Analysis or RCA of such behaviour as well.
  • An object of the present disclosure is to provide for a system and method that facilitates an automated pipeline in place for logging errors and exceptions that are thrown either from the processor or any application workflows that are deployed on it.
  • the present disclosure relates to system and methods that facilitate enhancing big data eco system. More particularly, the present disclosure relates to a system and method for facilitating end to end data completeness.
  • a system and method for monitoring quality of a set of data packets of a big data eco-system by one or more first computing devices may include a data quality module comprising one or more processors coupled with a memory that may store instructions which when executed by the one or more processors causes the system to: receive, the set of data packets from the one or more first computing devices.
  • the set of data packets may pertain to data from filesystem-based sources.
  • the system may then extract, by the data quality module, a set of attributes from the set of data packets received, the set of attributes pertaining to the quality of data received; and then identify, by the data quality module, one or more corrupt data packets based on the set of attributes extracted.
  • the data quality module may log, the one or more corrupt set of data packets into a first queue; and then auto-analyse the logged corrupt set of data packets to determine one or more errors or exceptions leading to the corrupt set of data packets.
  • the method for monitoring quality of a set of data packets of a big data eco-system by one or more first computing devices may include the steps of receiving, at a data quality module coupled to one or more processors, the set of data packets pertaining to data from filesystem based sources and extracting, by the data quality module, a set of attributes from the set of data packets received, the set of attributes pertaining to the quality of data received.
  • the method may further include the step of identifying, by the data quality module, one or more corrupt data packets based on the set of attributes extracted and the step of logging, by the data quality module, the one or more corrupt set of data packets into a first queue associated.
  • the method may further include the step of auto-analysing, by the data quality module, the logged corrupt set of data packets to determine one or more errors or exceptions leading to the corrupt set of data packets.
  • FIG. 1 that illustrates an exemplary network architecture in which or with which data quality module of the present disclosure can be implemented, in accordance with an embodiment of the present disclosure.
  • FIG. 2A illustrates an exemplary representation of data quality module
  • FIG. 2B illustrates an exemplary representation of a proposed method associated with the data quality module, in accordance with an embodiment of the present disclosure.
  • FIG. 3A-3B illustrate exemplary representations of data quality platform for facilitating monitoring of data flows, sampling and measuring the correctness and completeness of datasets, in accordance with an embodiment of the present disclosure.
  • FIG. 4 illustrates an exemplary representation of a flow diagram of a Data logistics platform Processor and connection metric collection method (400), in accordance with an embodiment of the present disclosure.
  • FIG. 5 illustrates an exemplary block diagram representation of a Data logistics platform Processor and flow of data, in accordance with an embodiment of the present disclosure.
  • FIG. 6 illustrates an exemplary block diagram representation (600) of Data logistics platform error logging, in accordance with an embodiment of the present disclosure.
  • FIG. 7 illustrates an exemplary block diagram representation (700) of spark stats collection, in accordance with an embodiment of the present disclosure.
  • FIG. 8 illustrates an exemplary block diagram representation (800) of distributed event streaming platform stats collection, in accordance with an embodiment of the present disclosure.
  • FIG. 9 illustrates an exemplary block diagram representation (900) of a consolidated report generation, in accordance with an embodiment of the present disclosure.
  • FIGs. 10A-10I illustrate exemplary representations of implementation results associated with data quality module, in accordance with an embodiment of the present disclosure.
  • FIG. 11 illustrates an exemplary block diagram representation (1100) of the system architecture, in accordance with an embodiment of the present disclosure.
  • FIG. 12 illustrates an exemplary computer system in which or with which embodiments of the present invention can be utilized in accordance with embodiments of the present disclosure.
  • the present invention provides a robust and effective solution to an entity or an organization by allowing to visualize the flow of data from a source system to a sink system with different metric counters at each step, setting up automatic alerts if something goes wrong and in doing RCA for the same and creating profiling reports for the datasets to check the quality of data being ingested.
  • Metric collection across different components of the data pipeline which helps in building an end-to-end completeness check for the ingestion.
  • the metrics may be captured without any serious overhead on the performance of the existing jobs and anomalies may be found as soon as possible.
  • the proposed method reduces the number of counter queries required on production environment as these are captured as part of the metrics itself, in turn reducing the unnecessary load.
  • FIG. 1 illustrates an exemplary network architecture (100) in which or with which data ingestion module (110) of the present disclosure can be implemented, in accordance with an embodiment of the present disclosure.
  • the exemplary architecture (100) includes a data quality module (110) equipped with a machine learning engine (214) (also referred to as data quality module 110 hereinafter) for facilitating ingestion and processing of a set of data packet from one or more first computing devices (102) to be stored in one or more second computing devices (104).
  • the set of data packets may be big but not limited to it.
  • the one or more first computing devices (102) may include a plurality of Distributed Source Systems such as Distributed event streaming platform, the Hadoop Distributed File System (HDFS) and the like.
  • the one or more second computing devices (104) may include a plurality of Distributed Storage Systems such as Elasticsearch, Hive, HDFS but not limited to the like with pluggable transformation and customized throughput, rate control, throttle and embedded Fault Tolerance.
  • the set of data packets may be ingested by the first computing device (102) in any format.
  • the data quality module (110) may be coupled to a centralized server (112).
  • the data quality module (110) may also be operatively coupled to one or more first computing devices (102) and one or more second computing devices (104) through a network (106).
  • the data quality module (110) may receive the set of data packets from the first computing devices (102).
  • the data quality module (110) may be further configured to process the set of data packets based received based on a set of predefined configuration parameters.
  • the data quality module (110) may further be configured to process the set of data packets received through a plurality of processing logic modules. In an embodiment, the processing of the set of data packets may get started as soon as any new log file get generated at the centralised Server (112).
  • the data quality module (110) may further identify and log corrupt data present in the set of data packets received into a first queue.
  • the set of data packets after being analysed and processed may be written into one or more second computing devices (104) (also referred to as sinks (104)).
  • the data quality module (110) may be configured to automate warning signals based on the corrupt and erroneous data identified in the pipeline.
  • the data quality module (110) may be configured to update the database every time and an end-to-end data completeness check for each of the data pipeline can be enabled and visualized by the data quality module (110) that may send the processed set of data packets to the second computing device (104).
  • the data quality module (110) may be configured to create an automated pipeline for data sampling and data profiling.
  • the one or more first computing devices (102), the one or more second computing devices (104) may communicate with the data ingestion module (110) via set of executable instructions residing on any operating system, including but not limited to, Android TM, iOS TM, Kai OS TM and the like.
  • the one or more second computing devices (104) may include, but not limited to, any electrical, electronic, electro-mechanical or an equipment or a combination of one or more of the above devices such as mobile phone, smartphone, virtual reality (VR) devices, augmented reality (AR) devices, laptop, a general-purpose computer, desktop, personal digital assistant, tablet computer, mainframe computer, or any other computing device, wherein the computing device may include one or more in-built or externally coupled accessories including, but not limited to, a visual aid device such as camera, audio aid, a microphone, a keyboard, input devices for receiving input from a user such as touch pad, touch enabled screen, electronic pen, receiving devices for receiving any audio or visual signal in any range of frequencies and transmitting devices that can transmit any audio or visual signal in any range of frequencies.
  • a visual aid device such as camera, audio aid, a microphone, a keyboard
  • input devices for receiving input from a user such as touch pad, touch enabled screen, electronic pen
  • the to one or more first computing devices (102), and the one or more second computing devices (104) may not be restricted to the mentioned devices and various other devices may be used.
  • a smart computing device may be one of the appropriate systems for storing data and other private/sensitive information.
  • the data quality module (110) or the centralized server may be one of the appropriate systems for storing data and other private/sensitive information.
  • (112) may include one or more processors coupled with a memory, wherein the memory may store instructions which when executed by the one or more processors may cause the system to access content stored in a network.
  • FIG. 2A with reference to FIG. 1, illustrates an exemplary representation of data quality module (110) /centralized server (112) for facilitating real time event data feeds, in accordance with an embodiment of the present disclosure.
  • the data quality module (110) /centralized server (112) may comprise one or more processor(s) (202).
  • the one or more processor(s) (202) may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, logic circuitries, and/or any devices that process data based on operational instructions.
  • the one or more processor(s) (202) may be configured to fetch and execute computer-readable instructions stored in a memory (206) of the data receiver module (110).
  • the memory (206) may be configured to store one or more computer-readable instructions or routines in a non-transitory computer readable storage medium, which may be fetched and executed to create or share data packets over a network service.
  • the memory (206) may comprise any non-transitory storage device including, for example, volatile memory such as RAM, or non-volatile memory such as EPROM, flash memory, and the like.
  • the data quality module (110)/centralized server (112) may include an interface(s) 204.
  • the interface(s) 204 may comprise a variety of interfaces, for example, interfaces for data input and output devices, referred to as I/O devices, storage devices, and the like.
  • the interface(s) 204 may facilitate communication of the data receiver module (110).
  • the interface(s) 204 may also provide a communication pathway for one or more components of the data receiver module (110) or the centralized server (112). Examples of such components include, but are not limited to, processing engine(s) 208 and a database 210.
  • the processing engine(s) (208) may be implemented as a combination of hardware and programming (for example, programmable instructions) to implement one or more functionalities of the processing engine(s) (208).
  • programming for the processing engine(s) (208) may be processor executable instructions stored on a non-transitory machine-readable storage medium and the hardware for the processing engine(s) (208) may comprise a processing resource (for example, one or more processors), to execute such instructions.
  • the machine-readable storage medium may store instructions that, when executed by the processing resource, implement the processing engine(s) (208).
  • the data quality module (110) /centralized server (112) may comprise the machine-readable storage medium storing the instructions and the processing resource to execute the instructions, or the machine-readable storage medium may be separate but accessible to the data quality module (110) /centralized server (112) and the processing resource.
  • the processing engine(s) (208) may be implemented by electronic circuitry.
  • the processing engine (208) may include one or more engines selected from any of a data acquisition engine (212), a machine learning (ML) engine (214), and other engines (216).
  • the processing engine (208) may further include a Data logistics platform (DLP) (500)(Ref. FIG.5) but not limited to it, an open-source platform (OSP) (700) (Ref FIG. 7) and a distributed event streaming platform (DESP) (800) (Ref. FIG. 8).
  • DLP Data logistics platform
  • OSP open-source platform
  • DSP distributed event streaming platform
  • FIG. 8 distributed event streaming platform
  • FIG. 2B illustrates an exemplary representation of a proposed method associated with the data quality module, in accordance with an embodiment of the present disclosure.
  • the method (250) may include the step 252 of receiving a set of data packets from a first computing device, the step of 254 extracting a first set of attributes, first set of attributes pertaining to metrics associated with a Data logistics platform processor operatively coupled to the first computing system, the step of 256 extracting a second set of attributes, second set of attributes pertaining to metrics associated with a open- source platform of the first computing device, the step of 258 extracting a third set of attributes, third set of attributes pertaining to metrics associated with a Distributed event streaming platform module.
  • the method may also include the step 260 of storing the data packets in a predefined format, and a step 262 of extracting a fourth set of attributes from the stored data packet, the fourth set of attributes pertaining to discrepancies associated with the data quality and a step 264 of generating alerts on determining the discrepancies associated with the data quality.
  • FIG. 3A-3B with reference to FIG. 2A illustrates an exemplary representation
  • the Data Quality Platform may facilitate monitoring of data flows, measuring the correctness and completeness of datasets and generate insights/trends about the flow of data.
  • the Data Quality Platform may include a source system (302) comprising of databases such as Distributed event streaming platform , SFTP, RDBMS, NAS, Solace, HiveMQ, Syslog and the like.
  • the source system (302) may be coupled to an ingestion and processing module (304) for metric correction and corrupt record logging.
  • the ingestion and processing module (304) may be coupled to source connectors (306), transformation and processing module (308), sink connectors (310) and data quality module (312, 110).
  • the collection of the data-flow statistical information as ingestion and processing may be happening in streaming and batch pipelines but not limited to it.
  • the un-intrusive and thin process may help in avoiding any increased load on the data pipelines or on the underlying services stack.
  • the data processed from the ingestion and processing module (304) may be stored in a data store (314).
  • the data store (314) may include Hive, HDFS, Elastic, MySQL, Soir, HBase, Druid Cassandra and the like.
  • the stored data may be analysed for establishing volumetric trends on data feeds and identifying anomalies in data flows and data corruption issues (316).
  • the key capabilities of the data quality platform may include establishing volumetric trends on data feeds and identifying anomalies in data flows and data corruption issues using minimal resources, reducing manual intervention in troubleshooting and investigations pertaining to issues in data ingestion and processing pipelines, reducing the overall load on cluster attributed through count queries on domain tables as these counts are captured and stored as part of this service
  • FIG. 3B illustrates a data sampling and profiling module that may include the data obtained from the data store (314) sent for data sampling and profiling to profile data samples to understand the data quality and correctness through a sampling and profiling module (318) coupled to source connectors (306) and sink connectors (310) and then sent to a serving layer (320).
  • the serving layer (320) may include Elastic, Ignite, MySQL, Solr, Redis but not limited tO the like.
  • the sampled and profiled data may be then studied as interactive data profile reports (322).
  • FIG. 4 illustrates an exemplary representation of a flow diagram of a Data logistics platform Processor and connection metric collection method (400), in accordance with an embodiment of the present disclosure.
  • a job is started and metrics are enabled and at block 404, the metrics are captured at every n interval.
  • the metrics captured are stored in a query able table and an hourly consolidated report is generated at block 408 to develop an automated pipeline which processes and stores the Data logistics platform processor and connection stats logs which may be getting generated at least every 5 min but not limited to the like in Hive through Data logistics platform Flow itself, so that later valuable insights from those stats and improve end-to-end data reconciliation and data completeness checks to understand the magnitude of data missing due to exceptional scenarios.
  • the steps involved to develop the Data logistics platform processor and collection metrics Collection Job may include creating a separate log file for an application by adding some configuration in a logback.xml, as to not touch the data logistics platform-app.log as this is used by many processes in Data logistics platform, configuring the Data logistics platform to roll up the processor and connection log files on an hourly basis but not limited to it which contains a plurality of n minutes (at least 5 minutes) stats but not limited to it, creating the Data logistics platform flow which will take those files from one of the Data logistics platform server and put it into at least two HDFS location for processor and connection logs respectively on the date partition basis, Created the External hive table on top of those log files from hdfs directory, creating the logic to parse those raw files from HDFS directory and loaded into main table, integrating that parsing and storing logic with the Data logistics platform flow, so that as any new log file got generated at the Data logistics platform Server, processing will get started just after storing it into HDFS, and taking
  • FIG. 5 illustrates an exemplary block diagram representation (500) of a Data logistics platform Processor and flow of data, in accordance with an embodiment of the present disclosure.
  • Data logistics platform flow (502-1, 502,2, 502-N) send Data logistics platform data to Data logistics platform flow (504) which stores the hourly log files in in HDFS and later parse the raw data into managed tables by running hive query having parsing logic through Data logistics platform flow itself and are finally stored in a hive (506).
  • FIG. 6 illustrates an exemplary block diagram representation (500) of Data logistics platform error logging, in accordance with an embodiment of the present disclosure.
  • a Site2Site Bulletin reporting tasks may be used to report tasks that may be continually running to collect and send data to an input port on the canvas.
  • a plurality of Data logistics platform servers (502-1, 502-2, 502-N) send data to Data logistics platform flow for error detection and warning (602) that may be configured to receive errors and warnings from site to site bulletin report task periodically in JSON format and later parse and store it into HDFS location in ORC format of the hive table (506).
  • an automated pipeline in place for logging errors and exceptions may be created that are thrown either from the Data logistics platform component itself or from any application workflows that are deployed on it. Alternately, a job has been developed which captures all the errors and warnings thrown by the processors or other Data logistics platform components to help in monitoring and debugging purposes.
  • the steps involved in creating the Data logistics platform Error Logging Flow include creating a StandardRestrictedSSLContextService at the Reporting Task Controller Services by providing keystore and trustore filename, creating the Site2SiteBulletinReportingTask at the Reporting Task panel of the Controller Setting with the following properties such as Destination URL, Input Port Name and Instance URL but not limited to the like, creating an input port at the root canvas, which will receive the bulletin messages from Bulletin Board and later can be used in the Data logistics platform workflow and which can also be scheduled to run for some specified time, After starting the reporting task and input port, receiving the bulletin messages in terms of flow files and data comes as an array in Json format but not limited to it, integrating the parsing and storing logic with the Data logistics platform flow, so that as any new error or warning Messages gets generated at the Bulletin board, processing will get started just before storing it into HDFS in ORC format, and creating an external hive table on top of those Message
  • FIG. 7 illustrates an exemplary block diagram representation (700) of spark stats collection, in accordance with an embodiment of the present disclosure.
  • an open-source platform may expose many real time useful metrics through a Spark user interface (interchangeably referred to as Spark UI).
  • the open-source platform may include real time metrics from a plurality of ingestion jobs (702-1, 702-2, ... 702-N), consolidate them in a Distributed event streaming platform (DESP) (704) and push them to a data completeness OSP ingestion job (706) to check spark metric data completeness.
  • DSP Distributed event streaming platform
  • the metrics are stored in a hive table (708), to utilize them for reporting.
  • the metrics may be captured at least three levels: Stage, Error and App. Since the metrics already being captured maybe applied, the Open-source platform doesn’t incur any significant overhead on the process. Moreover, the open-source platform may be configurable and can be enabled for any ingestion flow.
  • the steps involved in Spark Metric Logging may include creating a custom accumulator in spark which will keep track of the metrics, creating a custom spark listener object which updates the accumulator object at the end of each step, creating a custom source which emits these accumulated metrics as logs after every n seconds, pushing the logs to a distributed event streaming platform distributed event streaming platform topic using Distributed event streaming platform Distributed event streaming platform Appender, and reading the logs from distributed event streaming platform distributed event streaming platform topic, process them and write them to a Hive Table by a spark job.
  • FIG. 8 illustrates an exemplary block diagram representation (800) of distributed event streaming platform distributed event streaming platform stats collection, in accordance with an embodiment of the present disclosure.
  • the Distributed event streaming platform includes
  • Distributed event streaming platform Stats Collection Job may include ingestion jobs (702-1, 702-2, ... 702-N) where, Distributed event streaming platform Distributed event streaming platform (804) may be used as a source to capture the metrics such as “How many messages have been consumed from Distributed event streaming platform ”, “How many messages have been processed by the job after consuming from Distributed event streaming platform ” but not limited to the like.
  • the metrics are pushed to an data completeness DESP ingestion job (802) in real time and later processed and pushed stored in a table (708) for reporting purpose.
  • the steps involved in Distributed event streaming platform Metrics Logging may include creating a custom logger and every time we consume and commit message from distributed event streaming platform log it with ‘consumed’ tag and once the application is done processing these messages log it with ‘processed’ tag, pushing the logs to a distributed event streaming platform topic using Distributed event streaming platform Appender, and reading the logs by a spark job from the distributed event streaming platform topic, process them and write them to a Hive Table.
  • FIG. 9 illustrates an exemplary block diagram representation (900) of a consolidated report generation, in accordance with an embodiment of the present disclosure.
  • the metrics may be combined to create a consolidated view (908) that may help in identifying anomalies at any stage of the pipeline, and also avoid manual debugging on production systems.
  • a consolidated view (908) that may help in identifying anomalies at any stage of the pipeline, and also avoid manual debugging on production systems.
  • hourly metrics which can be visualized and on top of which alerts can be set.
  • a day-over-day ratios for at least 4-buckets (6 hour each) may be captured (910) which helps in comparing data flow rate of today with that of yesterday.
  • FIGs. 10A-10I illustrate exemplary representations of implementation results associated with data quality module, in accordance with an embodiment of the present disclosure.
  • FIG. 10 A-10B illustrate a daily and hourly comparison of source data, intermediate data and sink data in order to establish volumetric trends across datasets and to identify anomalies in datasets.
  • FIG. IOC illustrates Data Flow Statistics across different Datasets for a Current Day.
  • FIG. 10D illustrates Source, Intermediate and Sink Size Distribution by Bucket for the current day.
  • the charts show the current day bucket wise distribution in terms of data size for all 3 stages (here 24 hrs. divided into 4 buckets (0,1, 2, 3) each is of 6 hrs.) which can be helpful for detecting any increase or dip in the data at any stage in our flow.
  • FIG. 10E illustrates source wise ingestion trend.
  • the chart shows the trend of data size at every hour for all 3 stages in the system (i.e. source, intermediate and sink). By this we can detect anomalies related to data if in either of the stages we found any inconsistency in the chart.
  • X- Axis may indicate current day hours
  • Y-Axis may indicate Size in GB's for all 3 stages at that particular hour.
  • FIG. 10F illustrates Counters for the Current day data set and Day over day comparison.
  • FIG. 10G illustrates hour Wise Ingestion Statistics. As illustrated an hour wise statistics related to all 3 stages in the pipeline is shown.
  • FIG. 10E illustrates source wise ingestion trend. The chart shows the trend of data size at every hour for all 3 stages in the system (i.e. source, intermediate and sink). By this we can detect anomalies related to data if in either of the stages we found any inconsistency in the chart.
  • X- Axis may indicate current day hours
  • 10H illustrates anomaly identifier charts for Day Over Day Comparison for Source, Intermediate and Sink Stages in terms of Data Size Ratio.
  • the chart shows the ratio of current day vs previous day in terms of data size for all 3 stages i.e. (Source, Intermediate and Sink), X-Axis: observation date and Y-Axis: Percentage change in data size with respect to previous day for all 3 stages rolling over 7 days.
  • FIG. 101 illustrates the Source, Intermediate and Sink Data Size Comparison Measured over 6 Hours Window for each day.
  • the charts show the bucket wise comparison in terms of data size for all 3 stages over the period of 3 days (here 24 hrs.
  • Consolidated Dashboard (comprising of above charts). A dashboard created in a superset may showcase the capabilities of data quality platform.
  • FIG. 11 illustrates an exemplary block diagram representation (1100) of the system architecture, in accordance with an embodiment of the present disclosure.
  • data sampling of d-1 days data for a particular source at hive layer and storing it in temp table (1102) is done as illustrated.
  • Data profiling is done on the sampled data and html report is written into hdfs directory (1106) based on the source table information.
  • JDSP Data Curio (1108) provides GUI where end users can view the generated report based on their access privileges as per ranger policies. The above- mentioned steps may be automated using python scripts (1104).
  • a Data-sampler module may perform data-sampling of source data and store the sampled data into temporary hive table.
  • a Data-profiler module may use an open source tool such as a pandas profiling but not limited to it to generate profile report from the sampled data and store the profile report into the HDFS directory, catalogued based on a data source name.
  • a JDSP Data Curio but not limited to it may provide GUI where end users can view the generated report based on their access privileges as per ranger policies.
  • FIG. 12 illustrates an exemplary computer system in which or with which embodiments of the present invention can be utilized in accordance with embodiments of the present disclosure.
  • computer system 1200 can include an external storage device 1210, a bus 1220, a main memory 1230, a read only memory 1240, a mass storage device 1250, communication port 1260, and a processor 1270.
  • processor 1270 include, but are not limited to, an Intel® Itanium® or Itanium 2 processor(s), or AMD® Opteron® or Athlon MP® processor(s), Motorola® lines of processors, FortiSOCTM system on chip processors or other future processors.
  • Communication port 1260 can be any of an RS-232 port for use with a modem based dialup connection, a 10/100 Ethernet port, a Gigabit or 10 Gigabit port using copper or fiber, a serial port, a parallel port, or other existing or future ports.
  • Communication port 1260 may be chosen depending on a network, such a Local Area Network (LAN), Wide Area Network (WAN), or any network to which computer system connects.
  • Memory 1230 can be Random Access Memory (RAM), or any other dynamic storage device commonly known in the art.
  • Read-only memory 1240 can be any static storage device(s) e.g., but not limited to, a Programmable Read Only Memory (PROM) chips for storing static information e.g., start-up or BIOS instructions for processor 1270.
  • Mass storage 1250 may be any current or future mass storage solution, which can be used to store information and/or instructions. Exemplary mass storage solutions include, but are not limited to, Parallel Advanced Technology Attachment (PATA) or Serial Advanced Technology Attachment (SATA) hard disk drives or solid-state drives (internal or external, e.g., having Universal Serial Bus (USB) and/or Firewire interfaces), e.g.
  • PATA Parallel Advanced Technology Attachment
  • SATA Serial Advanced Technology Attachment
  • USB Universal Serial Bus
  • Firewire interfaces e.g.
  • Seagate e.g., the Seagate Barracuda 782 family
  • Hitachi e.g., the Hitachi Deskstar 12K800
  • one or more optical discs e.g., Redundant Array of Independent Disks (RAID) storage, e.g. an array of disks (e.g., SATA arrays), available from various vendors including Dot Hill Systems Corp., LaCie, Nexsan Technologies, Inc. and Enhance Technology, Inc.
  • RAID Redundant Array of Independent Disks
  • Bus 1220 communicatively couples processor(s) 1270 with the other memory, storage and communication blocks.
  • Bus 1220 can be, e.g. a Peripheral Component Interconnect (PCI) / PCI Extended (PCI-X) bus, Small Computer System Interface (SCSI), USB or the like, for connecting expansion cards, drives and other subsystems as well as other buses, such a front side bus (FSB), which connects processor 12120 to software system.
  • PCI Peripheral Component Interconnect
  • PCI-X PCI Extended
  • SCSI Small Computer System Interface
  • FFB front side bus
  • operator and administrative interfaces e.g. a display, keyboard, and a cursor control device
  • bus 1220 may also be coupled to bus 1220 to support direct operator interaction with a computer system.
  • Other operator and administrative interfaces can be provided through network connections connected through communication port 1260.
  • the external storage device 1210 can be any kind of external hard-drives, floppy drives, IOMEGA® Zip Drives, Compact Disc - Read Only Memory (CD-ROM), Compact Disc-Re- Writable (CD-RW), Digital Video Disk-Read Only Memory (DVD-ROM).
  • CD-ROM Compact Disc - Read Only Memory
  • CD-RW Compact Disc-Re- Writable
  • DVD-ROM Digital Video Disk-Read Only Memory
  • the present disclosure provides for a system and method to facilitate no serious overhead on the existing jobs, as the way metrics are being captured is very lightweight. [0077] The present disclosure provides for a system and method that gives a consolidated view across different systems involved in the pipeline.
  • the present disclosure provides for a system and method allows end-users to measure the completeness of available data and quantify its correctness and quality.
  • the present disclosure provides for a system and method that facilitates identification of sudden increase/decrease in data flow due to high usage or system issue can be easily identified and appropriate teams can be alerted as soon as the anomaly occurs.
  • the present disclosure provides for a system and method that facilitates identification of components in the pipeline that are misbehaving which reduces the scope for detailed analysis through an initial RCA by just looking at the final report. [0081] The present disclosure provides for a system and method that helps in avoiding count queries for datasets as it already provides count and size stats for hourly level, reduces the load on underlying services.
  • the present disclosure provides for a system and method that facilitates capturing the errors as well which occurs while the application is running which helps in debugging the application.
  • the present disclosure provides for a system and method that allows

Abstract

The present invention provides a robust and effective solution to an entity or an organization by allowing to visualize the flow of data from a source system to a sink system with different metric counters at each step, setting up automatic alerts if something goes wrong and in doing RCA for the same and creating profiling reports for the datasets to check the quality of data being ingested. Metric collection across different components of the data pipeline which helps in building an end-to-end completeness check for the ingestion. The metrics may be captured without any serious overhead on the performance of the existing jobs and anomalies may be found as soon as possible. The proposed method reduces the number of counter queries required on production environment as these are captured as part of the metrics itself, in turn reducing the unnecessary load.

Description

SYSTEM AND METHOD FACILITATING MONITORING OF DATA QUALITY
FIELD OF INVENTION
[0001] The embodiments of the present disclosure generally relate to system and methods that facilitate enhancing big data eco system. More particularly, the present disclosure relates to a system and method for facilitating end to end data completeness.
BACKGROUND OF THE INVENTION
[0002] The following description of related art is intended to provide background information pertaining to the field of the disclosure. This section may include certain aspects of the art that may be related to various features of the present disclosure. However, it should be appreciated that this section be used only to enhance the understanding of the reader with respect to the present disclosure, and not as admissions of prior art.
[0003] A data ecosystem is a collection of infrastructure, analytics, and applications used to capture and analyze data. Data ecosystems provide entities with data that they rely on to understand users and to make better operations, and decisions. The term ecosystem is used rather than ‘environment’ because, like real ecosystems, data ecosystems are intended to evolve over time. Data ecosystems are for capturing data to produce useful insights. As users use products-especially digital ones-they leave data trails. Entities can create a data ecosystem to capture and analyze data trails so product teams can determine what their users like, don’t like, and respond well to. Product teams can use insights to tweak features to improve the product. Ecosystems were originally referred to as information technology environments. They were designed to be relatively centralized and static. The birth of the web and cloud services has changed that. Now, data is captured and used throughout organizations and IT professionals have less central control. The infrastructure they use to collect data must now constantly adapt and change. Hence, the term data ecosystem: They are data environments that are designed to evolve. There is no one ‘data ecosystem’ solution. Every entity creates its own ecosystem, sometimes referred to as a technology stack, and fills it with a patchwork of hardware and software to collect, store, analyze, and act upon the data. The best data ecosystems are built around a product analytics platform that ties the ecosystem together. Analytics platforms help teams integrate multiple data sources, provide machine learning tools to automate the process of conducting analysis, and track user cohorts so teams can calculate performance metrics. With the digital revolution, resultant data footprint such as feeds and logs pertaining to different streams in an entity and data needs to be ingested in Big Data Eco- system.
[0004] During the ingestion there can occur many issues such as data corruption at source end, some system going down in the data pipeline, data loss. In existing systems, there is no provision to capture processor errors and warnings at one place. So, in order to debug issues Operations & Developers Team have no resort but to log into each of the different nodes of the cluster to identify the root cause, which is entirely manual and error prone. Moreover, directly querying records is an expensive operation. Unlike the RDBMS counterpart, which usually has a considerable retention policy, NAS or SFTP sources usually host heavy volumetric datasets and have relatively low retention policies. There is no open source solution out there which allows us to enable end-to-end data completeness check on data pipelines in Big Data Ecosystem, without impacting the performance of the pipeline significantly. This framework will allow end-users to measure the completeness of available data and quantify its correctness and quality. It will also enable identifying anomalies at any stage of the dataflow and avoid the manual debugging on production systems.
[0005] There is therefore a need in the art to provide a system and a method that can overcome the shortcomings of the existing prior art.
OBJECTS OF THE PRESENT DISCLOSURE
[0006] Some of the objects of the present disclosure, which at least one embodiment herein satisfies are as listed herein below.
[0007] An object of the present disclosure is to provide for a system and method to facilitate ease in ingestion and analytical processing of data in big data eco-system.
[0008] An object of the present disclosure is to provide for a system and method that enables end-to-end data completeness checks on the data pipelines.
[0009] An object of the present disclosure is to provide for a system and method that facilitates insights on how the data is flowing from one system to another, alerts the operations team on anomalous behaviour and finding Root-Cause-Analysis or RCA of such behaviour as well.
[0010] An object of the present disclosure is to provide for a system and method that helps in profiling the data to check the data quality and give feedback to the source regarding the same. [0011] An object of the present disclosure is to provide for a system and method that allows visualization of the flow of data from source system to the sink system, different metric counters at each step.
[0012] An object of the present disclosure is to provide for a system and method that facilitates an automated pipeline in place for logging errors and exceptions that are thrown either from the processor or any application workflows that are deployed on it.
SUMMARY
[0013] The present disclosure relates to system and methods that facilitate enhancing big data eco system. More particularly, the present disclosure relates to a system and method for facilitating end to end data completeness.
[0014] According to an aspect of the present disclosure, a system and method for monitoring quality of a set of data packets of a big data eco-system by one or more first computing devices is disclosed. The system may include a data quality module comprising one or more processors coupled with a memory that may store instructions which when executed by the one or more processors causes the system to: receive, the set of data packets from the one or more first computing devices. The set of data packets may pertain to data from filesystem-based sources. The system may then extract, by the data quality module, a set of attributes from the set of data packets received, the set of attributes pertaining to the quality of data received; and then identify, by the data quality module, one or more corrupt data packets based on the set of attributes extracted. The data quality module may log, the one or more corrupt set of data packets into a first queue; and then auto-analyse the logged corrupt set of data packets to determine one or more errors or exceptions leading to the corrupt set of data packets.
[0015] According to an aspect of the present disclosure, the method for monitoring quality of a set of data packets of a big data eco-system by one or more first computing devices may include the steps of receiving, at a data quality module coupled to one or more processors, the set of data packets pertaining to data from filesystem based sources and extracting, by the data quality module, a set of attributes from the set of data packets received, the set of attributes pertaining to the quality of data received. The method may further include the step of identifying, by the data quality module, one or more corrupt data packets based on the set of attributes extracted and the step of logging, by the data quality module, the one or more corrupt set of data packets into a first queue associated. The method may further include the step of auto-analysing, by the data quality module, the logged corrupt set of data packets to determine one or more errors or exceptions leading to the corrupt set of data packets.
BRIEF DESCRIPTION OF DRAWINGS
[0016] The accompanying drawings, which are incorporated herein, and constitute a part of this invention, illustrate exemplary embodiments of the disclosed methods and systems in which like reference numerals refer to the same parts throughout the different drawings. Components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present invention. Some drawings may indicate the components using block diagrams and may not represent the internal circuitry of each component. It will be appreciated by those skilled in the art that invention of such drawings includes the invention of electrical components, electronic components or circuitry commonly used to implement such components.
[0017] FIG. 1 that illustrates an exemplary network architecture in which or with which data quality module of the present disclosure can be implemented, in accordance with an embodiment of the present disclosure.
[0018] FIG. 2A illustrates an exemplary representation of data quality module
/centralized server for accessing content stored in a network, in accordance with an embodiment of the present disclosure.
[0019] FIG. 2B illustrates an exemplary representation of a proposed method associated with the data quality module, in accordance with an embodiment of the present disclosure.
[0020] FIG. 3A-3B illustrate exemplary representations of data quality platform for facilitating monitoring of data flows, sampling and measuring the correctness and completeness of datasets, in accordance with an embodiment of the present disclosure. [0021] FIG. 4 illustrates an exemplary representation of a flow diagram of a Data logistics platform Processor and connection metric collection method (400), in accordance with an embodiment of the present disclosure.
[0022] FIG. 5 illustrates an exemplary block diagram representation of a Data logistics platform Processor and flow of data, in accordance with an embodiment of the present disclosure.
[0023] FIG. 6 illustrates an exemplary block diagram representation (600) of Data logistics platform error logging, in accordance with an embodiment of the present disclosure. [0024] FIG. 7 illustrates an exemplary block diagram representation (700) of spark stats collection, in accordance with an embodiment of the present disclosure.
[0025] FIG. 8 illustrates an exemplary block diagram representation (800) of distributed event streaming platform stats collection, in accordance with an embodiment of the present disclosure.
[0026] FIG. 9 illustrates an exemplary block diagram representation (900) of a consolidated report generation, in accordance with an embodiment of the present disclosure. [0027] FIGs. 10A-10I illustrate exemplary representations of implementation results associated with data quality module, in accordance with an embodiment of the present disclosure.
[0028] FIG. 11 illustrates an exemplary block diagram representation (1100) of the system architecture, in accordance with an embodiment of the present disclosure.
[0029] FIG. 12 illustrates an exemplary computer system in which or with which embodiments of the present invention can be utilized in accordance with embodiments of the present disclosure.
[0030] The foregoing shall be more apparent from the following more detailed description of the invention.
DETAILED DESCRIPTION OF INVENTION
[0031] In the following description, for the purposes of explanation, various specific details are set forth in order to provide a thorough understanding of embodiments of the present disclosure. It will be apparent, however, that embodiments of the present disclosure may be practiced without these specific details. Several features described hereafter can each be used independently of one another or with any combination of other features. An individual feature may not address all of the problems discussed above or might address only some of the problems discussed above. Some of the problems discussed above might not be fully addressed by any of the features described herein.
[0032] The ensuing description provides exemplary embodiments only, and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the ensuing description of the exemplary embodiments will provide those skilled in the art with an enabling description for implementing an exemplary embodiment. It should be understood that various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the invention as set forth. [0033] The present invention provides a robust and effective solution to an entity or an organization by allowing to visualize the flow of data from a source system to a sink system with different metric counters at each step, setting up automatic alerts if something goes wrong and in doing RCA for the same and creating profiling reports for the datasets to check the quality of data being ingested. Metric collection across different components of the data pipeline which helps in building an end-to-end completeness check for the ingestion. The metrics may be captured without any serious overhead on the performance of the existing jobs and anomalies may be found as soon as possible. The proposed method reduces the number of counter queries required on production environment as these are captured as part of the metrics itself, in turn reducing the unnecessary load.
[0034] Referring to FIG. 1 that illustrates an exemplary network architecture (100) in which or with which data ingestion module (110) of the present disclosure can be implemented, in accordance with an embodiment of the present disclosure. As illustrated, the exemplary architecture (100) includes a data quality module (110) equipped with a machine learning engine (214) (also referred to as data quality module 110 hereinafter) for facilitating ingestion and processing of a set of data packet from one or more first computing devices (102) to be stored in one or more second computing devices (104). In an embodiment, the set of data packets may be big but not limited to it.
[0035] In an exemplary embodiment, the one or more first computing devices (102) may include a plurality of Distributed Source Systems such as Distributed event streaming platform, the Hadoop Distributed File System (HDFS) and the like. And the one or more second computing devices (104) may include a plurality of Distributed Storage Systems such as Elasticsearch, Hive, HDFS but not limited to the like with pluggable transformation and customized throughput, rate control, throttle and embedded Fault Tolerance.
[0036] In an embodiment, the set of data packets may be ingested by the first computing device (102) in any format.
[0037] The data quality module (110) may be coupled to a centralized server (112).
The data quality module (110) may also be operatively coupled to one or more first computing devices (102) and one or more second computing devices (104) through a network (106).
[0038] In an embodiment, the data quality module (110) may receive the set of data packets from the first computing devices (102). The data quality module (110) may be further configured to process the set of data packets based received based on a set of predefined configuration parameters. The data quality module (110) may further be configured to process the set of data packets received through a plurality of processing logic modules. In an embodiment, the processing of the set of data packets may get started as soon as any new log file get generated at the centralised Server (112).
[0039] In an embodiment, the data quality module (110) may further identify and log corrupt data present in the set of data packets received into a first queue. The set of data packets after being analysed and processed may be written into one or more second computing devices (104) (also referred to as sinks (104)). In an embodiment, the data quality module (110) may be configured to automate warning signals based on the corrupt and erroneous data identified in the pipeline.
[0040] In an embodiment, the data quality module (110) may be configured to update the database every time and an end-to-end data completeness check for each of the data pipeline can be enabled and visualized by the data quality module (110) that may send the processed set of data packets to the second computing device (104).
[0041] In an embodiment, the data quality module (110) may be configured to create an automated pipeline for data sampling and data profiling.
[0042] In an embodiment, the one or more first computing devices (102), the one or more second computing devices (104) may communicate with the data ingestion module (110) via set of executable instructions residing on any operating system, including but not limited to, Android TM, iOS TM, Kai OS TM and the like. In an embodiment, to one or more first computing devices (102), and the one or more second computing devices (104) may include, but not limited to, any electrical, electronic, electro-mechanical or an equipment or a combination of one or more of the above devices such as mobile phone, smartphone, virtual reality (VR) devices, augmented reality (AR) devices, laptop, a general-purpose computer, desktop, personal digital assistant, tablet computer, mainframe computer, or any other computing device, wherein the computing device may include one or more in-built or externally coupled accessories including, but not limited to, a visual aid device such as camera, audio aid, a microphone, a keyboard, input devices for receiving input from a user such as touch pad, touch enabled screen, electronic pen, receiving devices for receiving any audio or visual signal in any range of frequencies and transmitting devices that can transmit any audio or visual signal in any range of frequencies. It may be appreciated that the to one or more first computing devices (102), and the one or more second computing devices (104) may not be restricted to the mentioned devices and various other devices may be used. A smart computing device may be one of the appropriate systems for storing data and other private/sensitive information. [0043] In an embodiment, the data quality module (110) or the centralized server
(112) may include one or more processors coupled with a memory, wherein the memory may store instructions which when executed by the one or more processors may cause the system to access content stored in a network.
[0044] FIG. 2A with reference to FIG. 1, illustrates an exemplary representation of data quality module (110) /centralized server (112) for facilitating real time event data feeds, in accordance with an embodiment of the present disclosure. In an aspect, the data quality module (110) /centralized server (112) may comprise one or more processor(s) (202). The one or more processor(s) (202) may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, logic circuitries, and/or any devices that process data based on operational instructions. Among other capabilities, the one or more processor(s) (202) may be configured to fetch and execute computer-readable instructions stored in a memory (206) of the data receiver module (110). The memory (206) may be configured to store one or more computer-readable instructions or routines in a non-transitory computer readable storage medium, which may be fetched and executed to create or share data packets over a network service. The memory (206) may comprise any non-transitory storage device including, for example, volatile memory such as RAM, or non-volatile memory such as EPROM, flash memory, and the like.
[0045] In an embodiment, the data quality module (110)/centralized server (112) may include an interface(s) 204. The interface(s) 204 may comprise a variety of interfaces, for example, interfaces for data input and output devices, referred to as I/O devices, storage devices, and the like. The interface(s) 204 may facilitate communication of the data receiver module (110). The interface(s) 204 may also provide a communication pathway for one or more components of the data receiver module (110) or the centralized server (112). Examples of such components include, but are not limited to, processing engine(s) 208 and a database 210.
[0046] The processing engine(s) (208) may be implemented as a combination of hardware and programming (for example, programmable instructions) to implement one or more functionalities of the processing engine(s) (208). In examples described herein, such combinations of hardware and programming may be implemented in several different ways. For example, the programming for the processing engine(s) (208) may be processor executable instructions stored on a non-transitory machine-readable storage medium and the hardware for the processing engine(s) (208) may comprise a processing resource (for example, one or more processors), to execute such instructions. In the present examples, the machine-readable storage medium may store instructions that, when executed by the processing resource, implement the processing engine(s) (208). In such examples, the data quality module (110) /centralized server (112) may comprise the machine-readable storage medium storing the instructions and the processing resource to execute the instructions, or the machine-readable storage medium may be separate but accessible to the data quality module (110) /centralized server (112) and the processing resource. In other examples, the processing engine(s) (208) may be implemented by electronic circuitry.
[0047] The processing engine (208) may include one or more engines selected from any of a data acquisition engine (212), a machine learning (ML) engine (214), and other engines (216). The processing engine (208) may further include a Data logistics platform (DLP) (500)(Ref. FIG.5) but not limited to it, an open-source platform (OSP) (700) (Ref FIG. 7) and a distributed event streaming platform (DESP) (800) (Ref. FIG. 8).
[0048] FIG. 2B illustrates an exemplary representation of a proposed method associated with the data quality module, in accordance with an embodiment of the present disclosure. According to an aspect, the method (250) may include the step 252 of receiving a set of data packets from a first computing device, the step of 254 extracting a first set of attributes, first set of attributes pertaining to metrics associated with a Data logistics platform processor operatively coupled to the first computing system, the step of 256 extracting a second set of attributes, second set of attributes pertaining to metrics associated with a open- source platform of the first computing device, the step of 258 extracting a third set of attributes, third set of attributes pertaining to metrics associated with a Distributed event streaming platform module. The method may also include the step 260 of storing the data packets in a predefined format, and a step 262 of extracting a fourth set of attributes from the stored data packet, the fourth set of attributes pertaining to discrepancies associated with the data quality and a step 264 of generating alerts on determining the discrepancies associated with the data quality.
[0049] FIG. 3A-3B with reference to FIG. 2A, illustrates an exemplary representation
(300) of data quality platform for facilitating monitoring of data flows, measuring the correctness and completeness of datasets, in accordance with an embodiment of the present disclosure.
[0050] As illustrated in FIG. 3A, in an embodiment, the Data Quality Platform may facilitate monitoring of data flows, measuring the correctness and completeness of datasets and generate insights/trends about the flow of data. The Data Quality Platform may include a source system (302) comprising of databases such as Distributed event streaming platform , SFTP, RDBMS, NAS, Solace, HiveMQ, Syslog and the like. The source system (302) may be coupled to an ingestion and processing module (304) for metric correction and corrupt record logging. The ingestion and processing module (304) may be coupled to source connectors (306), transformation and processing module (308), sink connectors (310) and data quality module (312, 110). The collection of the data-flow statistical information as ingestion and processing may be happening in streaming and batch pipelines but not limited to it. The un-intrusive and thin process may help in avoiding any increased load on the data pipelines or on the underlying services stack. The data processed from the ingestion and processing module (304) may be stored in a data store (314). The data store (314) may include Hive, HDFS, Elastic, MySQL, Soir, HBase, Druid Cassandra and the like. The stored data may be analysed for establishing volumetric trends on data feeds and identifying anomalies in data flows and data corruption issues (316).
[0051] The key capabilities of the data quality platform may include establishing volumetric trends on data feeds and identifying anomalies in data flows and data corruption issues using minimal resources, reducing manual intervention in troubleshooting and investigations pertaining to issues in data ingestion and processing pipelines, reducing the overall load on cluster attributed through count queries on domain tables as these counts are captured and stored as part of this service
[0052] FIG. 3B illustrates a data sampling and profiling module that may include the data obtained from the data store (314) sent for data sampling and profiling to profile data samples to understand the data quality and correctness through a sampling and profiling module (318) coupled to source connectors (306) and sink connectors (310) and then sent to a serving layer (320). The serving layer (320) may include Elastic, Ignite, MySQL, Solr, Redis but not limited tO the like. The sampled and profiled data may be then studied as interactive data profile reports (322).
[0053] FIG. 4 illustrates an exemplary representation of a flow diagram of a Data logistics platform Processor and connection metric collection method (400), in accordance with an embodiment of the present disclosure.
[0054] As illustrated, in an embodiment, the method (400) of Data logistics platform
Processor and connection metric collection at block 402, a job is started and metrics are enabled and at block 404, the metrics are captured at every n interval. At block 406, the metrics captured are stored in a query able table and an hourly consolidated report is generated at block 408 to develop an automated pipeline which processes and stores the Data logistics platform processor and connection stats logs which may be getting generated at least every 5 min but not limited to the like in Hive through Data logistics platform Flow itself, so that later valuable insights from those stats and improve end-to-end data reconciliation and data completeness checks to understand the magnitude of data missing due to exceptional scenarios.
[0055] In an exemplary embodiment, the steps involved to develop the Data logistics platform processor and collection metrics Collection Job may include creating a separate log file for an application by adding some configuration in a logback.xml, as to not touch the data logistics platform-app.log as this is used by many processes in Data logistics platform, configuring the Data logistics platform to roll up the processor and connection log files on an hourly basis but not limited to it which contains a plurality of n minutes (at least 5 minutes) stats but not limited to it, creating the Data logistics platform flow which will take those files from one of the Data logistics platform server and put it into at least two HDFS location for processor and connection logs respectively on the date partition basis, Created the External hive table on top of those log files from hdfs directory, creating the logic to parse those raw files from HDFS directory and loaded into main table, integrating that parsing and storing logic with the Data logistics platform flow, so that as any new log file got generated at the Data logistics platform Server, processing will get started just after storing it into HDFS, and taking out valuable insights from the stats and improve end-to-end data reconciliation and data completeness checks to understand the magnitude of data missing due to exceptional scenarios.
[0056] FIG. 5 illustrates an exemplary block diagram representation (500) of a Data logistics platform Processor and flow of data, in accordance with an embodiment of the present disclosure.
[0057] As illustrated, in an embodiment, a plurality of Data logistics platform servers
(502-1, 502,2, 502-N) send Data logistics platform data to Data logistics platform flow (504) which stores the hourly log files in in HDFS and later parse the raw data into managed tables by running hive query having parsing logic through Data logistics platform flow itself and are finally stored in a hive (506).
[0058] FIG. 6 illustrates an exemplary block diagram representation (500) of Data logistics platform error logging, in accordance with an embodiment of the present disclosure. In an exemplary embodiment, a Site2Site Bulletin reporting tasks may be used to report tasks that may be continually running to collect and send data to an input port on the canvas. As illustrated, in an embodiment, a plurality of Data logistics platform servers (502-1, 502-2, 502-N) send data to Data logistics platform flow for error detection and warning (602) that may be configured to receive errors and warnings from site to site bulletin report task periodically in JSON format and later parse and store it into HDFS location in ORC format of the hive table (506). Thus, in an embodiment, an automated pipeline in place for logging errors and exceptions may be created that are thrown either from the Data logistics platform component itself or from any application workflows that are deployed on it. Alternately, a job has been developed which captures all the errors and warnings thrown by the processors or other Data logistics platform components to help in monitoring and debugging purposes. [0059] In an exemplary embodiment, the steps involved in creating the Data logistics platform Error Logging Flow include creating a StandardRestrictedSSLContextService at the Reporting Task Controller Services by providing keystore and trustore filename, creating the Site2SiteBulletinReportingTask at the Reporting Task panel of the Controller Setting with the following properties such as Destination URL, Input Port Name and Instance URL but not limited to the like, creating an input port at the root canvas, which will receive the bulletin messages from Bulletin Board and later can be used in the Data logistics platform workflow and which can also be scheduled to run for some specified time, After starting the reporting task and input port, receiving the bulletin messages in terms of flow files and data comes as an array in Json format but not limited to it, integrating the parsing and storing logic with the Data logistics platform flow, so that as any new error or warning Messages gets generated at the Bulletin board, processing will get started just before storing it into HDFS in ORC format, and creating an external hive table on top of those Messages which captures all the Errors & Warnings thrown by the processors or other Data logistics platform components to help Operations & Developers Team for Monitoring and Debugging purposes.
[0060] FIG. 7 illustrates an exemplary block diagram representation (700) of spark stats collection, in accordance with an embodiment of the present disclosure.
[0061] In an embodiment, an open-source platform may expose many real time useful metrics through a Spark user interface (interchangeably referred to as Spark UI). As illustrated, the open-source platform may include real time metrics from a plurality of ingestion jobs (702-1, 702-2, ... 702-N), consolidate them in a Distributed event streaming platform (DESP) (704) and push them to a data completeness OSP ingestion job (706) to check spark metric data completeness. After pushing the metrics to the intermediate storage, the metrics are stored in a hive table (708), to utilize them for reporting.
[0062] In an exemplary embodiment, the metrics may be captured at least three levels: Stage, Error and App. Since the metrics already being captured maybe applied, the Open-source platform doesn’t incur any significant overhead on the process. Moreover, the open-source platform may be configurable and can be enabled for any ingestion flow.
[0063] In an exemplary embodiment, the steps involved in Spark Metric Loggingmay include creating a custom accumulator in spark which will keep track of the metrics, creating a custom spark listener object which updates the accumulator object at the end of each step, creating a custom source which emits these accumulated metrics as logs after every n seconds, pushing the logs to a distributed event streaming platform distributed event streaming platform topic using Distributed event streaming platform Distributed event streaming platform Appender, and reading the logs from distributed event streaming platform distributed event streaming platform topic, process them and write them to a Hive Table by a spark job.
[0064] FIG. 8 illustrates an exemplary block diagram representation (800) of distributed event streaming platform distributed event streaming platform stats collection, in accordance with an embodiment of the present disclosure.
[0065] As illustrated, in an embodiment, the Distributed event streaming platform
Distributed event streaming platform Stats Collection Job may include ingestion jobs (702-1, 702-2, ... 702-N) where, Distributed event streaming platform Distributed event streaming platform (804) may be used as a source to capture the metrics such as “How many messages have been consumed from Distributed event streaming platform ”, “How many messages have been processed by the job after consuming from Distributed event streaming platform ” but not limited to the like. The metrics are pushed to an data completeness DESP ingestion job (802) in real time and later processed and pushed stored in a table (708) for reporting purpose.
[0066] In an exemplary embodiment, the steps involved in Distributed event streaming platform Metrics Logging may include creating a custom logger and every time we consume and commit message from distributed event streaming platform log it with ‘consumed’ tag and once the application is done processing these messages log it with ‘processed’ tag, pushing the logs to a distributed event streaming platform topic using Distributed event streaming platform Appender, and reading the logs by a spark job from the distributed event streaming platform topic, process them and write them to a Hive Table. [0067] FIG. 9 illustrates an exemplary block diagram representation (900) of a consolidated report generation, in accordance with an embodiment of the present disclosure. As illustrated, in an embodiment, after collecting all the metrics related to Data logistics platform (902), Distributed event streaming platform (904) and Open source platform (906) independently, the metrics may be combined to create a consolidated view (908) that may help in identifying anomalies at any stage of the pipeline, and also avoid manual debugging on production systems. After capturing metrics for different components, we consolidate them and generate hourly metrics which can be visualized and on top of which alerts can be set. In an exemplary embodiment, apart from hourly reports, a day-over-day ratios for at least 4-buckets (6 hour each) may be captured (910) which helps in comparing data flow rate of today with that of yesterday.
[0068] FIGs. 10A-10I illustrate exemplary representations of implementation results associated with data quality module, in accordance with an embodiment of the present disclosure. FIG. 10 A-10B illustrate a daily and hourly comparison of source data, intermediate data and sink data in order to establish volumetric trends across datasets and to identify anomalies in datasets. FIG. IOC illustrates Data Flow Statistics across different Datasets for a Current Day. FIG. 10D illustrates Source, Intermediate and Sink Size Distribution by Bucket for the current day. The charts show the current day bucket wise distribution in terms of data size for all 3 stages (here 24 hrs. divided into 4 buckets (0,1, 2, 3) each is of 6 hrs.) which can be helpful for detecting any increase or dip in the data at any stage in our flow. Here values represents the size in GB's for each bucket. FIG. 10E illustrates source wise ingestion trend. The chart shows the trend of data size at every hour for all 3 stages in the system (i.e. source, intermediate and sink). By this we can detect anomalies related to data if in either of the stages we found any inconsistency in the chart. X- Axis may indicate current day hours, Y-Axis may indicate Size in GB's for all 3 stages at that particular hour. FIG. 10F illustrates Counters for the Current day data set and Day over day comparison. FIG. 10G illustrates hour Wise Ingestion Statistics. As illustrated an hour wise statistics related to all 3 stages in the pipeline is shown. FIG. 10H illustrates anomaly identifier charts for Day Over Day Comparison for Source, Intermediate and Sink Stages in terms of Data Size Ratio. The chart shows the ratio of current day vs previous day in terms of data size for all 3 stages i.e. (Source, Intermediate and Sink), X-Axis: observation date and Y-Axis: Percentage change in data size with respect to previous day for all 3 stages rolling over 7 days. FIG. 101 illustrates the Source, Intermediate and Sink Data Size Comparison Measured over 6 Hours Window for each day. The charts show the bucket wise comparison in terms of data size for all 3 stages over the period of 3 days (here 24 hrs. divided into 4 buckets (0, 1,2,3) each is of 6 hrs.), X-Axis: Observation Date and Y-Axis: Ratio between respective data size for 6 hours window for current day to previous day buckets size for all 3 stages over past 3 days rolling. Consolidated Dashboard (comprising of above charts). A dashboard created in a superset may showcase the capabilities of data quality platform.
[0069] FIG. 11 illustrates an exemplary block diagram representation (1100) of the system architecture, in accordance with an embodiment of the present disclosure.
[0070] In an exemplary embodiment, data sampling of d-1 days data for a particular source at hive layer and storing it in temp table (1102) is done as illustrated. Data profiling is done on the sampled data and html report is written into hdfs directory (1106) based on the source table information. JDSP Data Curio (1108) provides GUI where end users can view the generated report based on their access privileges as per ranger policies. The above- mentioned steps may be automated using python scripts (1104).
[0071] In an embodiment, a Data-sampler module may perform data-sampling of source data and store the sampled data into temporary hive table. In another embodiment a Data-profiler module may use an open source tool such as a pandas profiling but not limited to it to generate profile report from the sampled data and store the profile report into the HDFS directory, catalogued based on a data source name. A JDSP Data Curio but not limited to it may provide GUI where end users can view the generated report based on their access privileges as per ranger policies.
[0072] FIG. 12 illustrates an exemplary computer system in which or with which embodiments of the present invention can be utilized in accordance with embodiments of the present disclosure. As shown in FIG. 12, computer system 1200 can include an external storage device 1210, a bus 1220, a main memory 1230, a read only memory 1240, a mass storage device 1250, communication port 1260, and a processor 1270. A person skilled in the art will appreciate that the computer system may include more than one processor and communication ports. Examples of processor 1270 include, but are not limited to, an Intel® Itanium® or Itanium 2 processor(s), or AMD® Opteron® or Athlon MP® processor(s), Motorola® lines of processors, FortiSOC™ system on chip processors or other future processors. Processor 12120 may include various modules associated with embodiments of the present invention. Communication port 1260 can be any of an RS-232 port for use with a modem based dialup connection, a 10/100 Ethernet port, a Gigabit or 10 Gigabit port using copper or fiber, a serial port, a parallel port, or other existing or future ports. Communication port 1260 may be chosen depending on a network, such a Local Area Network (LAN), Wide Area Network (WAN), or any network to which computer system connects. Memory 1230 can be Random Access Memory (RAM), or any other dynamic storage device commonly known in the art. Read-only memory 1240 can be any static storage device(s) e.g., but not limited to, a Programmable Read Only Memory (PROM) chips for storing static information e.g., start-up or BIOS instructions for processor 1270. Mass storage 1250 may be any current or future mass storage solution, which can be used to store information and/or instructions. Exemplary mass storage solutions include, but are not limited to, Parallel Advanced Technology Attachment (PATA) or Serial Advanced Technology Attachment (SATA) hard disk drives or solid-state drives (internal or external, e.g., having Universal Serial Bus (USB) and/or Firewire interfaces), e.g. those available from Seagate (e.g., the Seagate Barracuda 782 family) or Hitachi (e.g., the Hitachi Deskstar 12K800), one or more optical discs, Redundant Array of Independent Disks (RAID) storage, e.g. an array of disks (e.g., SATA arrays), available from various vendors including Dot Hill Systems Corp., LaCie, Nexsan Technologies, Inc. and Enhance Technology, Inc.
[0073] Bus 1220 communicatively couples processor(s) 1270 with the other memory, storage and communication blocks. Bus 1220 can be, e.g. a Peripheral Component Interconnect (PCI) / PCI Extended (PCI-X) bus, Small Computer System Interface (SCSI), USB or the like, for connecting expansion cards, drives and other subsystems as well as other buses, such a front side bus (FSB), which connects processor 12120 to software system.
[0074] Optionally, operator and administrative interfaces, e.g. a display, keyboard, and a cursor control device, may also be coupled to bus 1220 to support direct operator interaction with a computer system. Other operator and administrative interfaces can be provided through network connections connected through communication port 1260. The external storage device 1210 can be any kind of external hard-drives, floppy drives, IOMEGA® Zip Drives, Compact Disc - Read Only Memory (CD-ROM), Compact Disc-Re- Writable (CD-RW), Digital Video Disk-Read Only Memory (DVD-ROM). Components described above are meant only to exemplify various possibilities. In no way should the aforementioned exemplary computer system limit the scope of the present disclosure.
[0075] While considerable emphasis has been placed herein on the preferred embodiments, it will be appreciated that many embodiments can be made and that many changes can be made in the preferred embodiments without departing from the principles of the invention. These and other changes in the preferred embodiments of the invention will be apparent to those skilled in the art from the disclosure herein, whereby it is to be distinctly understood that the foregoing descriptive matter to be implemented merely as illustrative of the invention and not as limitation. ADVANTAGES OF THE PRESENT DISCLOSURE
[0076] The present disclosure provides for a system and method to facilitate no serious overhead on the existing jobs, as the way metrics are being captured is very lightweight. [0077] The present disclosure provides for a system and method that gives a consolidated view across different systems involved in the pipeline.
[0078] The present disclosure provides for a system and method allows end-users to measure the completeness of available data and quantify its correctness and quality.
[0079] The present disclosure provides for a system and method that facilitates identification of sudden increase/decrease in data flow due to high usage or system issue can be easily identified and appropriate teams can be alerted as soon as the anomaly occurs.
[0080] The present disclosure provides for a system and method that facilitates identification of components in the pipeline that are misbehaving which reduces the scope for detailed analysis through an initial RCA by just looking at the final report. [0081] The present disclosure provides for a system and method that helps in avoiding count queries for datasets as it already provides count and size stats for hourly level, reduces the load on underlying services.
[0082] The present disclosure provides for a system and method that facilitates capturing the errors as well which occurs while the application is running which helps in debugging the application.
[0083] The present disclosure provides for a system and method that allows
Continuous data-sampling and data profiling and one click solution to view generated profile report which helps us determine the accuracy, completeness, and validity of our data and lets organizations make better data decisions.

Claims

We Claim:
1. A system for monitoring quality of a set of data packets of a big data eco- system by one or more first computing devices, the system comprising: a data quality module comprising one or more processors (202) coupled with a memory (204), wherein said memory (204) stores instructions which when executed by the one or more processors (202) causes said system to: receive, the set of data packets from the one or more first computing devices, said set of data packets pertaining to data from file system based sources; extract, by said data quality module (110), a set of attributes from the set of data packets received, said set of attributes pertaining to the quality of data received; identify, by said data quality module (110), one or more corrupt data packets based on the set of attributes extracted; log, by said data quality module (110), the one or more corrupt set of data packets into a first queue; and, auto-analyse, by said data quality module, the logged corrupt set of data packets to determine one or more errors or exceptions leading to the corrupt set of data packets.
2. The system as claimed in claim 1, wherein the data quality module (110) further automates a warning signal whenever the one or more corrupt set of data packets are identified by the system.
3. The system as claimed in claim 1, wherein the data quality module updates a database every time and an end-to-end data completeness check for each data packet received.
4. The system as claimed in claim 1, wherein the data quality module captures one or more metrics pertaining to the one or more corrupt set of data packets at least three levels, wherein the one or metrics are consolidated and pushed to an intermediate storage in real time.
5. The system as claimed in claim 1, wherein the data quality module processes the one or more metrics from the intermediate storage and store the processed one or more metrics in a predefined format in the database.
6. The system as claimed in claim 1, wherein the data quality module performs sampling of data of the set of data packets received and stored in a second predefined format in the database.
7. The system as claimed in claim 1, wherein the data quality module performs data profiling on the sampled data and is stored in a third pre-defined format in the database.
8. The system as claimed in claim 1, wherein the data quality module auto generates a report based on the profiled data
9. The system as claimed in claim 1, wherein the report generated is displayed in a display unit associated with the one or more first computing device.
10. A method for monitoring quality of a set of data packets of a big data eco-system by one or more first computing devices, the method comprising: receiving, at a data quality module coupled to a processor, the set of data packets, wherein the set of data packets pertain to data from file system based sources; extracting, by said data quality module (110), a set of attributes from the set of data packets received, said set of attributes pertaining to the quality of data received; identifying, by said data quality module (110), one or more corrupt data packets based on the set of attributes extracted; logging, by said data quality module (110), the one or more corrupt set of data packets into a first queue; and, auto-analysing, by said data quality module (110), the logged corrupt set of data packets to determine one or more errors or exceptions leading to the corrupt set of data packets.
11. The method as claimed in claim 10, wherein the method further comprises: automating, by data quality module (110) a warning signal whenever the one or more corrupt set of data packets are identified by the method.
12. The method as claimed in claim 10, wherein the method further comprises: updating, by the data quality module, a database every time and an end-to-end data completeness check for each data packet received.
13. The method as claimed in claim 10, wherein the method further comprises: capturing by the data quality module, one or more metrics pertaining to the one or more corrupt set of data packets at least three levels, wherein the one or metrics are consolidated and pushed to an intermediate storage in real time.
14. The method as claimed in claim 10, wherein the method further comprises: processing, by the data quality module, the one or more metrics from the intermediate storage and storing the processed one or more metrics in a predefined format in the database.
15. The method as claimed in claim 10, wherein the method further comprises: sampling of data of the set of data packets received by the data quality module and storing in a second predefined format in the database.
16. The method as claimed in claim 10, wherein the method further comprises: data profiling on the sampled data by the data quality module and storing the profiled data in a third predefined format in the database.
17. The method as claimed in claim 10, wherein the method further comprises: auto generating a report based on the profiled data.
18. The method as claimed in claim 10, wherein the method further comprises: displaying, by the data quality module, the report generated in a display unit associated with the one or more first computing device.
PCT/IB2022/055911 2021-06-26 2022-06-25 System and method facilitating monitoring of data quality WO2022269573A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN202121028762 2021-06-26
IN202121028762 2021-06-26

Publications (1)

Publication Number Publication Date
WO2022269573A1 true WO2022269573A1 (en) 2022-12-29

Family

ID=84544174

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2022/055911 WO2022269573A1 (en) 2021-06-26 2022-06-25 System and method facilitating monitoring of data quality

Country Status (1)

Country Link
WO (1) WO2022269573A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10318501B2 (en) * 2016-10-25 2019-06-11 Mastercard International Incorporated Systems and methods for assessing data quality
US20200117757A1 (en) * 2018-10-16 2020-04-16 Open Text Sa Ulc Real-time monitoring and reporting systems and methods for information access platform

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10318501B2 (en) * 2016-10-25 2019-06-11 Mastercard International Incorporated Systems and methods for assessing data quality
US20200117757A1 (en) * 2018-10-16 2020-04-16 Open Text Sa Ulc Real-time monitoring and reporting systems and methods for information access platform

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TALEB IKBAL, SERHANI MOHAMED ADEL, BOUHADDIOUI CHAFIK, DSSOULI RACHIDA: "Big data quality framework: a holistic approach to continuous quality management", JOURNAL OF BIG DATA, vol. 8, no. 1, 1 December 2021 (2021-12-01), XP093019123, DOI: 10.1186/s40537-021-00468-0 *

Similar Documents

Publication Publication Date Title
US10733079B2 (en) Systems and methods for end-to-end testing of applications using dynamically simulated data
US10797958B2 (en) Enabling real-time operational environment conformity within an enterprise architecture model dashboard
US9590880B2 (en) Dynamic collection analysis and reporting of telemetry data
US10303533B1 (en) Real-time log analysis service for integrating external event data with log data for use in root cause analysis
US9189355B1 (en) Method and system for processing a service request
US20160092516A1 (en) Metric time series correlation by outlier removal based on maximum concentration interval
JP5886712B2 (en) Efficient collection of transaction-specific metrics in a distributed environment
KR20190075972A (en) Systems and methods for identifying process flows from log files and for visualizing flows
US8990621B2 (en) Fast detection and diagnosis of system outages
US10116534B2 (en) Systems and methods for WebSphere MQ performance metrics analysis
AU2019201142A1 (en) Testing and improving performance of mobile application portfolios
US20240020215A1 (en) Analyzing large-scale data processing jobs
US20130047169A1 (en) Efficient Data Structure To Gather And Distribute Transaction Events
EP2957073B1 (en) Queue monitoring and visualization
KR101989330B1 (en) Auditing of data processing applications
US10929259B2 (en) Testing framework for host computing devices
US9276826B1 (en) Combining multiple signals to determine global system state
US20220179764A1 (en) Multi-source data correlation extraction for anomaly detection
WO2022269573A1 (en) System and method facilitating monitoring of data quality
Chen et al. System-Level Data Management for Endpoint Advanced Persistent Threat Detection: Issues, Challenges and Trends
Lin et al. Trusted behavior identification model for distributed node
Papazachos et al. Preliminary offline trace analysis: project deliverable D4. 2
Machiraju et al. Service Fundamentals: Instrumentation, Telemetry, and Monitoring
Singh Monitoring Hadoop

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22827827

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE