CN107810500B

CN107810500B - Data quality analysis

Info

Publication number: CN107810500B
Application number: CN201680034382.7A
Authority: CN
Inventors: C·斯皮茨; 乔尔·古尔德
Original assignee: Ab Initio Technology LLC
Current assignee: Ab Initio Technology LLC
Priority date: 2015-06-12
Filing date: 2016-06-10
Publication date: 2023-12-08
Anticipated expiration: 2036-06-10
Also published as: AU2019253860B2; SG10201909389VA; HK1250066A1; EP3308297A1; AU2016274791A1; EP3839758B1; US20160364434A1; CA2988256A1; JP2023062126A; EP3839758A1; AU2016274791B2; CN107810500A; KR102033971B1; KR20180030521A; CN117807065A; CA3185178C; US20200057757A1; JP6707564B2; JP2020161147A; CA3185178A1

Abstract

A method, comprising: receiving information indicative of an output data set generated by a data processing system; identifying one or more upstream data sets on which the output data set depends based on data lineage information related to the output data set; an analysis of one or more upstream data sets of the one or more upstream data sets on which the identified output data set depends is performed. The analysis includes: for each particular upstream data set of the one or more upstream data sets, one or more of the following rules are applied: (i) A first rule indicating an allowable deviation between a profile of the particular upstream data set and a reference profile of the particular upstream data set, and (ii) a second rule indicating one or more allowed or forbidden values for each of one or more data elements in the particular upstream data set; and selecting one or more of the upstream data sets based on a result of applying the one or more rules. The method further includes outputting information associated with the selected one or more upstream data sets.

Description

Data quality analysis

Background

The present description relates to data quality analysis. The data quality of a data set indicates whether a data record in the data set has an error. In general, in the case where an error occurs during the processing of a data set, the data quality of the data set is poor.

Disclosure of Invention

In a general aspect, a method includes: receiving information indicative of an output data set generated by a data processing system; identifying one or more upstream data sets on which the output data set depends based on data lineage information related to the output data set; analyzing one or more upstream data sets of the one or more upstream data sets on which the identified output data set depends, the analyzing comprising: for each particular upstream dataset of the one or more upstream datasets, applying one or more of the following rules: (i) A first rule indicating an allowable deviation between a profile of the particular upstream data set and a reference profile of the particular upstream data set, and (ii) a second rule indicating one or more allowed or forbidden values for each of one or more data elements in the particular upstream data set; selecting one or more of the upstream data sets based on a result of applying the one or more rules; and outputting information associated with the selected one or more upstream data sets.

Embodiments may include one or more of the following features.

One or more of the first rule and the second rule are automatically generated. The first rule is automatically generated based on an automated analysis of a historical profile of the particular upstream dataset. The reference profile is based on a historical average profile of the particular upstream dataset. The second rule is automatically generated based on an automated analysis of historical values of one or more data elements in the particular upstream dataset. The allowed or forbidden values are determined based on the automated analysis.

One or more of the first rule and the second rule are specified by a user.

The method further comprises the steps of: a designation of one or more of the first rule and the second rule is received through a user interface.

The data lineage information indicates one or more data sets on which the output data set depends, or both.

Analyzing each of the one or more data sets to identify a subset of the one or more data sets includes: determining a dataset of the one or more datasets that has an error or that is likely to have an error; the method further comprises: a data set with errors or possibly with errors is selected as the subset.

Analyzing each of the one or more data sets to identify a subset of the one or more data sets includes: identifying a particular data set for which a deviation between a profile of the particular data set and a reference profile of the particular data set exceeds an allowable deviation indicated by a corresponding first rule; the method further comprises: the particular data set is selected as the subset.

Analyzing each of the one or more data sets to identify a subset of the one or more data sets includes: identifying a particular data set having data elements with values that do not satisfy the allowed or forbidden values indicated by the respective second rule; the method further comprises: the particular data set is selected as the subset.

The identifying further comprises identifying data elements in the output data set, and identifying one or more data sets on which the output data set depends comprises: a dataset is identified that affects the identified data elements in the output dataset. Identifying data elements in the output dataset includes: data elements that have errors or are likely to have errors are identified.

The method further comprises the steps of: a profile of one or more of the upstream data sets is generated. Generating a profile for a particular dataset includes: a new profile for the particular data set is generated upon receipt of the new version of the particular data set.

A reference profile for a particular data set is derived from one or more previous profiles for that particular data set.

Outputting information associated with a subset of the dataset includes: an identifier of each data set of the subset is output.

Outputting information associated with a subset of the dataset includes: an indicator of errors or possible errors associated with each data set of the subset is output.

The method further includes displaying a representation of the data processing system on a user interface, and outputting information associated with a subset of the data set includes: information associated with a particular data set in a subset of the data sets is displayed in proximity to a representation of the particular data set in the subset. The displayed information associated with the particular data set in the subset includes: a value indicating a deviation between the profile of the particular data set and the reference profile of the particular data set. The displayed information associated with the particular data set in the subset includes: a value representing the number of data elements in the particular data set that do not satisfy the allowed or forbidden value indicated by the corresponding second rule. The method further comprises the steps of: an information bubble or pop-up window showing information about a subset of the dataset is displayed.

The method further includes providing a user interface to enable a user to add rules, modify rules, or remove rules.

The dataset comprises: one or more source data sets comprising data elements to be processed by the data processing system and one or more reference data sets comprising reference values referenced by the data processing system in the processing of data elements in the source data sets. The reference data set includes data associated with a business entity associated with the data processing system and the source data set includes data associated with a customer of the business entity.

The data processing system includes a conversion element, and the method includes identifying one or more conversion elements affecting the output dataset based on the data lineage information. The method further comprises the steps of: one or more of the conversion elements that have an error or that are likely to have an error are determined. The method further comprises the steps of: one or more of the conversion elements that have an error or that are likely to have an error are determined.

In a general aspect, a non-transitory computer-readable medium stores instructions for causing a computing system to: receiving information indicative of an output data set generated by a data processing system; identifying one or more upstream data sets on which the output data set depends based on data lineage information related to the output data set; analyzing one or more upstream data sets of the one or more upstream data sets on which the identified output data set depends, the analyzing comprising: for each particular upstream dataset of the one or more upstream datasets, applying one or more of the following rules: (i) A first rule indicating an allowable deviation between a profile of the particular upstream data set and a reference profile of the particular upstream data set, and (ii) a second rule indicating one or more allowed or forbidden values for each of one or more data elements in the particular upstream data set; selecting one or more of the upstream data sets based on a result of applying the one or more rules; and outputting information associated with the selected one or more upstream data sets.

In a general aspect, a computing system includes: a processor connected to a memory, the processor and the memory configured to: receiving information indicative of an output data set generated by a data processing system; identifying one or more upstream data sets on which the output data set depends based on data lineage information related to the output data set; analyzing one or more upstream data sets of the one or more upstream data sets on which the identified output data set depends, the analyzing comprising: for each particular upstream dataset of the one or more upstream datasets, applying one or more of the following rules: (i) A first rule indicating an allowable deviation between a profile of the particular upstream data set and a reference profile of the particular upstream data set, and (ii) a second rule indicating one or more allowed or forbidden values for each of one or more data elements in the particular upstream data set; selecting one or more of the upstream data sets based on a result of applying the one or more rules; and outputting information associated with the selected one or more upstream data sets.

In a general aspect, a computing system includes: means for receiving information indicative of an output data set generated by a data processing system; means for identifying one or more upstream data sets on which the output data set depends based on data lineage information relating to the output data set; means for analyzing one or more upstream data sets of the one or more upstream data sets on which the identified output data set depends, the analysis comprising: for each particular upstream dataset of the one or more upstream datasets, applying one or more of the following rules: (i) A first rule indicating an allowable deviation between a profile of the particular upstream data set and a reference profile of the particular upstream data set, and (ii) a second rule indicating one or more allowed or forbidden values for each of one or more data elements in the particular upstream data set; selecting one or more of the upstream data sets based on a result of applying the one or more rules; and means for outputting information associated with the selected one or more upstream data sets.

In a general aspect, a method includes: upon identifying an error or possible error in a data element of a downstream data set of a data processing system, automatically identifying one or more upstream data sets affecting the data element based on data lineage information related to the downstream data set; determining an upstream dataset having an error or likely to have an error in the upstream datasets, comprising: analyzing the current profile and the reference profile of each of the identified upstream data sets; and outputting information associated with each of the upstream data sets determined to have errors or likely to have errors.

Aspects can include one or more of the following advantages.

The methods described herein may help users, such as data analysts or application developers, quickly identify the root cause of a data quality problem. For example, reference data in a data processing system is updated frequently, but is not necessarily thoroughly checked prior to deployment. Errors in the reference data may cause data quality problems in downstream data processed using the reference data. Analysis within the underlying willingness of the data quality problem in the downstream data set may help identify reference data or other upstream data having data quality problems that may have affected the data quality of the downstream data set. User notification of potential data quality problems can help the user actively manage data processing.

Other features and advantages of the invention will become apparent from the following description, and from the claims.

Drawings

Fig. 1 and 2 are data lineage diagrams.

Fig. 3A and 3B are data lineage diagrams.

Fig. 4 is a diagram of a user interface.

Fig. 5 is a system diagram.

Fig. 6 is a diagram of a user interface.

Fig. 7, 8A, and 8B are diagrams of data processing systems.

Fig. 8C is an example of a record.

Fig. 9A and 9B are diagrams of data processing systems.

FIG. 10A is a diagram of a data processing system.

Fig. 10B is an example of recording.

Fig. 11 to 15 are flowcharts.

Fig. 16 is a system diagram.

Detailed Description

A method of identifying a root cause of a data quality problem based on data lineage analysis is described herein. If a data quality problem is identified in the downstream dataset, an upstream dataset and an upstream conversion element (sometimes referred to as an upstream data lineage element) from which the downstream dataset was derived are identified. The quality of each upstream data lineage element is evaluated to identify one or more upstream data lineage elements that may themselves have data quality problems that lead to data quality problems in the downstream dataset. In some examples, the profile characterizing each upstream dataset is compared to a base profile, such as a historical average profile, for that dataset to determine if the dataset has data quality problems. In some examples, values in a field of an upstream dataset are compared to one or more allowed or forbidden values for the field to determine if the dataset has a data quality issue.

A data lineage is information describing the lifecycle of a data record processed by a data processing system. The data lineage information for a given dataset includes one or more upstream datasets on which the given dataset depends, one or more downstream datasets on which the given dataset depends, and an identifier of one or more transformations that process the data to generate the given dataset. Downstream data sets rely on upstream data sets to refer to the processing of upstream data sets by a data processing system that directly or indirectly results in the generation of downstream data sets. The downstream data set generated may be a data set output from the data processing system (sometimes referred to as an output data set) or may be a data set to be further processed by the data processing system (sometimes referred to as an intermediate data set). The upstream data set may be a data set that is input into the data processing system (sometimes referred to as an input data set or a reference data set), or may be a data set that has undergone processing by the data processing system (sometimes referred to as an intermediate data set). Conversion is a data processing operation applied to an upstream data set in order to produce a downstream data set that is provided to a data sink. A data lineage diagram is a graphical depiction of data lineage elements in a data processing system.

FIG. 1 is an exemplary data lineage diagram 100 of output data 110 generated by a data processing system. In the example of FIG. 1, the data processing system receives two source data sets 102, 104. The source data may be, for example, data records stored in a file such as an unstructured file, a database such as a relational database or an object database, a queue, or other repository for storing data in or received from a computing system. For example, the source data 102 may be a data record of an American credit card transaction stored in the file "US_feed. Each data record may include a respective value for one or more fields, such as an attribute defined within the record structure or a column in a database table. The source data 102, 104 may receive and process data from a file or database in batches, for example, hourly, daily, weekly, monthly, quarterly, yearly, or at other intervals. The source data 102, 104 may be received as a stream and may be processed continuously, e.g., may be buffered in queues and processed as data is available and system resources allow.

The source data 102 is processed by a conversion element 106, where the conversion element 106 operates on the source data 102 to alter the source data 102 in some way, for example. The conversion element may be an executable program that manipulates data, such as a java program, an executable file, a dataflow graph, or other type of executable program that executes within a virtual machine. For example, the conversion element 106 may be an executable file named "transforma. In a particular example, the conversion element 106 can be a filtering component that filters unwanted data records, such as data records having an incorrect format, from the source data 102. The conversion element 106 processes the source data 102 in view of the reference data 120 to produce intermediate data 112. The reference data is data used by the conversion element so that the conversion element can perform data processing. For example, the reference data capable of mapping operations includes one or more fields having values corresponding to values in one or more fields in the data being processed. The intermediate data 112 may be stored in a file, database, queue, or other repository for storing data in a computing system.

The conversion element 108 processes the set of source data 104 in view of the reference data 122 to produce intermediate data 114. Intermediate data 114 may be stored in a file, database, queue, or other repository for storing data in a computing system.

The intermediate data 112, 114 are processed together by a conversion element 116 that utilizes reference data 118. In an example, the conversion element 116 is a mapping operation and the reference data 118 includes a data record representing a status value and a corresponding region value. When the intermediate data 112, 114 is processed by the conversion element 116, the values of the status fields in each data record of the intermediate data 112, 114 are mapped to corresponding regions as represented in the reference data 118. In an example, the reference data 118 includes business data representing company business units and corresponding partial identifiers, manager names, and locations. When the intermediate data 112, 114 is processed by the conversion element 116, each data record is assigned to a corporate service unit based on the mapping enabled by the reference data set. The reference data 118 may be used to process multiple data sets and is unchanged during processing. The reference data 118 may be updated by the user periodically or as needed.

The conversion element 116 outputs output data 110 that is stored in a file, database, queue, or other repository for storing data in a computing system. The output data 110 may be further processed by other conversion elements, for example, in the same data processing system or a different data processing system, or may be stored for future analysis.

In the example of FIG. 1, the data lineage of output data 110 is shown for a data lineage element in one data processing system. In some examples, data lineage of a dataset can be tracked by multiple data processing systems. For example, the source data may be initially processed by a first data processing system to produce output data X. The second data processing system reads the output data X from the first data processing system and processes the output data X to generate output data Y. The output data Y is processed by a third data processing system to generate output data Z. The data lineage of output data Z includes initial source data, transformations included by each of the three data processing systems, and any reference data used during processing by any of the three data processing systems.

In some examples, the output data may be generated by a more complex data processing system such as that shown in the exemplary end-to-end data lineage diagram 200A for the target element 206A. In the data lineage diagram 200A, a connection between a data element 202A and a conversion element 204A is shown. The data element 202A may represent a data set, a table within a data set, a column in a table, a field in a file, or other data. An example of a conversion element is an element of an executable file that describes how to generate one output of a data element. The root cause of the potential data quality problem in the target element 206A (or other data element 202A) may be tracked in the data processing system of fig. 2. Other descriptions of FIG. 2 can be found in U.S. patent publication 2010/0138431, the contents of which are incorporated herein by reference in their entirety.

The information shown in a data lineage diagram, such as the data lineage diagram of fig. 1 or 2, shows which upstream data sources, sinks, or conversions affect downstream data. For example, the data lineage diagram 100 of FIG. 1 reveals that the output data 110 is affected by the source data 102, 104, the reference data 118, 120, 122, and the conversion elements 106, 108, 116.

Understanding lineage of downstream data sets (such as output data 110, etc.) can help identify root causes of data quality problems that may occur in downstream data. The root cause of a data quality problem refers to the identification of an upstream system, operation, or data set that is at least partially responsible for the data quality problem in downstream data. A data quality problem in a downstream dataset, such as output data 110, may be due to poor quality source data, poor quality reference data, or errors in conversion elements, or a combination of any two or more thereof, in an upstream lineage of the collection of output data 110. Tracking the quality or status of data lineage elements can provide information that can be used to evaluate possible root causes of poor quality output data.

The data quality of a data set is generally whether the data set has the desired characteristics. The poor data quality may indicate that the behavior of the data set is not as expected, e.g., falls outside of statistical specifications, returns query failure in response to a standard query, or other type of behavior. As discussed below, the quality of a data set may be characterized based on a profile of some or all of the data records in the data set, or based on the respective values of one or more fields in a particular data record, or both.

The quality of the difference data in the set of downstream data (e.g., output data 110) may be traced back to any of a variety of factors in the upstream data lineage of the output data. One potential cause of poor quality output data may be poor quality source data, poor quality reference data, or both. For example, the source data set may be corrupted or cut off during transmission, may be a wrong data set, may have data lost, or may have other problems. The reference data set may expose an error in the most recent update of the reference data set, may be corrupted, may be a wrong data set, or may have other problems. Other possible causes of poor quality output data may be problems with conversion elements in the upstream data lineage of the output data. For example, in the case where the software implementing the conversion element has been recently updated to a new version, for example, if the updated software has an error or has been damaged, the conversion element may no longer perform the desired processing. The source data, reference data, and conversion elements in the data lineage of the set of output data 110 can be monitored to facilitate pre-identification of potential data quality problems that may occur in the output data set, subsequent tracking of root causes of data quality problems that occur in the output data set, or both.

Monitoring and analysis of the source and reference data may help a user diagnose one or more possible causes of poor quality output data. For example, if a poor quality output data set is generated, analysis of source or reference data in a data lineage of the poor quality output data set may indicate that a given source or reference data set itself has poor quality output data, and thus may contribute to poor quality. The monitoring of the source data and the reference data may also pre-identify poor quality source data or reference data that, if processed, may lead to data quality problems in the downstream output data.

Fig. 3A and 3B depict a method to track root causes of known or potential data quality problems in a collection of output data 110 having the data lineage depicted in fig. 1. Referring to fig. 3A, before processing input data (e.g., source data 102, 104 of fig. 1), the quality of reference data 118, 120, 122 is characterized by quality elements 154, 156, 158, respectively. In some examples, the quality of the reference data may be characterized when the reference data set is updated, when scheduled (e.g., periodically or when scheduling reference data updates), before processing each input data set, or at other times.

To characterize the quality of a dataset, the quality element calculates a profile (sometimes also referred to as census) of fields in the dataset. A profile of a data set is a summary of data values in a data record, e.g., field by field. The profile includes statistics characterizing data values, histograms of values, maxima, minima, averages (e.g., mean or median), standard deviations from averages, samples of least common values and most common values in one or more fields (e.g., for key data elements of each dataset), or other statistics of at least some of the data records in the collection. In some examples, the profile may include processed information characterizing data values in each of one or more fields in the data record. The profile may include a classification of values in the fields (e.g., classifying data in the revenue data fields as high, medium, or low categories), an indication of a relationship between data fields in the respective data records (e.g., an indication that the status data field and the ZIP data field are not independent), a relationship between data records (e.g., an indication that data records in the customer identifier field have a common value), or other information characterizing the data in the set of data records.

The quality element then applies one or more rules to identify any actual or potential data quality problems in the data set. As discussed further below, these rules may be specified by the user and may indicate the allowed or forbidden features of the profile. In a specific example, where the reference dataset includes a field listing state abbreviations, if the number of different values in the field is greater than 50, an exemplary rule may indicate that a data quality problem is identified. In some examples, the rules may be based on a historical profile of the dataset, e.g., based on a historical average. If no data quality problem is identified in the dataset, the profile of the dataset may be used to update the rules, e.g., update the historical average. If the reference data set is identified as having an actual or potential data quality problem, processing may be suspended until the data quality problem is resolved.

Referring to fig. 3B, the quality of the source data 102, 104 is characterized by quality elements 150, 152, respectively. The quality elements 150, 152 may characterize the data quality of the source data 102, 104, respectively, as the data is received into the data processing system, prior to scheduling processing of the corresponding source data, or at other times. If the source data set is identified as having a known or potential data quality problem, information regarding the data quality problem may be output, for example, to alert the user or to store the data set in a data store for future reference. For example, when each quality element 150, 152 reads data from a corresponding dataset, the quality element 150, 152 calculates a profile for that dataset.

In a specific example, to calculate the profile of the source data 102, the quality element 150 may calculate the sum of all values in the transaction_current field in the source data 102. The rules of the source data 102 may compare the sum of all values in the transaction_count field to the mean and standard deviation of the sum of the past 30 runs and may indicate that a data quality problem is identified if the sum of all values in the transaction_count field of the source data 102 falls outside of one standard deviation from the mean of the sum.

In some examples, the rules to be used to characterize the quality of a data set may indicate allowed or forbidden features of a profile of a data record in the data set. The profile may be characterized by a value or range of values. In case the profile comprises an allowed feature, a rule indicating the allowed feature of the profile is fulfilled. Examples of allowable characteristics of a field may be the allowable maximum and minimum values for that field; this rule is satisfied if the average value of the field falls between the allowed maximum and minimum values. The rules indicating the forbidden features of the profile are fulfilled as long as the profile does not comprise forbidden features. An example of a forbidden feature for a field may be a list of values forbidden for that field; if the field includes any forbidden values, then the rule is not satisfied.

Rules indicating characteristics of a profile may indicate an allowable deviation between a profile of a field of a particular dataset and a reference profile of a field of the dataset. Deviations between a profile of a data set and a reference profile of the data set that are greater than the allowable deviations indicated by the respective rules may indicate that a data quality problem exists in the data set, and thus indicate that the data set is a possible root cause of an existing or potential data quality problem in a downstream data set. In some examples, the allowed deviation may be specified as a range of values, such as a maximum allowed value and a minimum allowed value, and so on. In some examples, the allowed deviation may be specified as a standard deviation from a value that may be an average (e.g., a mean or median of values in a past dataset).

In some examples, the rules to be used to characterize the quality of a data set may indicate an enabled or disabled feature of a value in one or more fields of a data record, such as based on validity of the value in that field. In the case where the value in the field satisfies the permission feature, a rule indicating the permission feature of the field is satisfied. As long as the value in a field does not satisfy the inhibit feature, a rule indicating the inhibit feature for that field is satisfied. Values that satisfy the rule are sometimes referred to as valid values; values that do not meet the rule are sometimes referred to as invalid values. Various features of the values in the fields may be indicated as enabled features or disabled features by rules. Exemplary rules may indicate allowed or forbidden characteristics of field content, such as a range of allowed or forbidden values, a maximum allowed value, a minimum allowed value, or a list of one or more specific values allowed or forbidden, etc. For example, a birth_year field having a value less than 1900 or greater than 2016 may be considered invalid. Exemplary rules may indicate the enabled or disabled features of the data type of the field. An exemplary rule may indicate whether the absence of a value in a certain field (or the presence of NULL) is allowed or forbidden. For example, a last_name field that includes a string value (e.g., "Smith") may be considered valid, while a last_name field that is blank or includes a value may be considered invalid. An exemplary rule may indicate an allowed or forbidden relationship between two or more fields in the same data record. For example, a rule may specify a list of values of the ZIP field corresponding to each possible value of the status field, and may specify that any combination of values of the ZIP and status fields that are not supported by the list are invalid.

In some examples, the rules may be generated based on an automatic analysis of historical data. This type of rule is referred to as an auto-generated rule. The auto-generation rule may indicate an enable feature or a disable feature of a profile of a data record in the dataset. For example, an automatic generation rule for a profile may indicate an allowable deviation between a profile for a field of a particular dataset and an automatically determined historical reference profile for the field of the dataset. The historical reference profile of the dataset may be based on historical data; for example, the historical reference profile may be a profile from the same dataset of the previous day, an average profile from the same dataset of the previous days (e.g., the past week or month), a life average profile for the same dataset. More generally, the reference profile may retain a wide variety of reference information to take advantage of various statistical analyses. For example, the reference profile may include information about standard deviation or other indications of the distribution of values. For the purposes of the following examples, and without limiting the generality of the application, it will be assumed that the reference profile comprises a digital average of the previous dataset, and possibly also a standard deviation.

The auto-generation rule may indicate an automatically determined enabled or disabled feature of the value in the field of the data record. In an example, the automatically generated rule for a field may indicate an allowable maximum or minimum for the field based on an analysis of historical maximum or minimum values for the field. In an example, the automatically generated rule for a field may indicate a list of allowed values for the field based on an analysis of values that have previously occurred for the field. In some examples, the automatically generated rule is specified for each field in the dataset. In some examples, rules are specified for a subset of the fields. The fields that specify the rules may be automatically identified, for example, based on analysis of the data records. For example, any field in a data record set that typically has a small number of different values (sometimes referred to as a low radix field) may be identified as a field in which an automatically generated rule may be generated.

In some examples, machine learning techniques are employed to generate the automatically generated rules. For example, the data may be analyzed during a learning period to identify historical averages or expected values prior to generating the rules. The learning period may be a specified period of time, or may be an amount of time until the average or expected value tends to stabilize.

In some examples, the rules may be specified by a user. This type of rule is referred to as a user-specified rule. The user-specified rule may specify an enabled or disabled feature of a profile of a field of a particular dataset, an enabled or disabled feature of a value in each of one or more fields of a data record in the dataset, or both. The user may specify rules based on, for example, their understanding of the expected characteristics of the data records to be processed by the system. In some examples, the user-specified rule may be assigned a default value that may be modified by the user.

In a specific example, the source data is a credit card transaction record of transactions occurring in the united states. The source data is stream data processed in one-hour increments. Based on its knowledge of the source data and the operations to be performed when processing the credit card transaction record, the transaction identifier field, the card identifier field, the status field, the date field, and the amount field may be identified as key data elements to be profiled.

In the specific example where the source data is a credit card transaction record, the user may know that there are only 50 allowed values for the status field. Regardless of the standard deviation of the profile of the source data set from the reference, if the profile of the source data set identifies more than 50 values in the status field, the user may create a rule to use the warning flag. The user may also know that only credit card transaction records should exist in the source dataset that handle transactions completed on the day. If any of the source data records has a date that is inconsistent with the date of processing, the user may create a rule that a warning message is to be sent.

Referring to fig. 4, in some examples, a user may specify one or more rules through a user interface 400. The exemplary user interface 400 includes a plurality of rows 402 and a plurality of columns 404. Each row 402 is associated with a field 406 of a data record in the dataset, and each column 404 is associated with a rule 408. Through user interface 400, a user may specify rules for one or more fields 406, or may approve pre-filled default rules for the fields. Additional description of the user interface 400 may be found in U.S. application Ser. No. 13/653995, filed 10/17 in 2012, the contents of which are incorporated herein by reference in their entirety. Other implementations of the user interface 400 are also possible.

In some examples, if a possible data quality problem is detected in the dataset (such as in a new version of the reference dataset or in the source dataset), an identifier of the dataset having the possible data quality problem is placed on a list of root cause datasets stored in the database. If a data quality issue is later detected for a collection of output data 110, the database may be queried to identify upstream data lineage elements for the collection of output data 110 and to determine which of these (if any) are included in the list of root cause data sets.

In some examples, user notification can be made if a possible data quality problem is detected in the data set (such as in a new version of the reference data set or in the source data set, etc.). In some examples, a warning flag may be stored to indicate a data quality problem. For example, if a possible data quality problem is detected in a new version of the reference data set, a warning flag may be stored in combination with the profile data of the new version of the reference data. If a possible data quality problem is detected in the source data set, a warning flag may be stored in conjunction with the profile data of the source data set. In some examples, a warning message may be transmitted to the user to indicate the existence of a possible data quality problem. The warning message may be, for example, as a message, icon, or pop-up window on a user interface; as an email or Short Message Service (SMS) message; or in other forms.

In some examples, the rules may specify one or more threshold deviations from a reference profile using a warning flag or warning message. For example, if the deviation between the profile of the current dataset and the reference profile of the dataset is small (such as between one and two standard deviations), a warning flag may be stored; and if the deviation is greater than two standard deviations, a warning message may be transmitted. Threshold deviations may be specified for each of the source data set and the reference data set.

In some examples, further processing of the data processing system may be stopped until user intervention, such as if the deviation is severe (e.g., more than three standard deviations from the reference profile). For example, any further processing of the source data or reference data with severe bias will be aborted. The transition to be aborted may be identified by data referencing the affected source data or data lineage elements downstream of the reference data.

In some examples, the reference document data is automatically determined. For example, the reference profile data for a given data set may be automatically updated to the running history average of the past profile data for that data set by recalculating the reference profile data each time new profile data for that data set is determined. In some examples, the user may provide initial reference profile data, for example, by profiling a dataset having desired characteristics.

The update status of the conversion elements 106, 108, 116 in the data lineage of the output data, such as the time or date of the last update of each conversion element 106, 108, 116, may be tracked. By accessing the timing of the most recent update of the conversion elements, the user can evaluate whether one or more conversion elements (e.g., incorrect or corrupted conversion elements) are a possible root cause of an existing or potential data quality problem in the output data 110. For example, if the conversion element 116 is updated shortly before the output data 110 is output from the conversion element 116, the conversion element 116 may be identified as a possible root cause of an existing or potential data quality problem in the output data 110.

Referring to FIG. 5, a trace engine 500 monitors profiles of data lineage elements, such as source data and reference data, and updates to data lineage elements, such as reference data and transformations, in upstream data lineage of a given data set, such as output data generated by a data processing system.

The trace engine 500 includes a data lineage repository 502, where the data lineage repository 502 stores data 504 referencing data lineage elements upstream of a given data set, such as output data generated by a data processing system. For example, data lineage repository 502 can store identifiers for data lineage elements and data indicating relationships between data lineage elements. The data lineage repository 502 can be a file, database, or other data storage mechanism.

The tracking engine 500 also includes an update monitor 506. The update monitor 506 monitors when to update the conversion elements and the reference data sets in the data processing system. For each conversion element referenced by the data lineage repository 502, an update monitor 506 monitors when to update the software implementing the conversion element. When an update occurs, the update monitor 506 stores the entry 510 in an update store 508, such as a file, database, or other data storage mechanism. Entry 510 indicates the timing of the update, such as the date or time the software was updated, or both. In some examples, the entry 510 may also include an indication of the nature of the update, such as a manually entered description of the update, text of the code line altered by the update, or other indication of the nature of the update, etc. Update store 508 can be indexed by an identifier of the conversion element, or by the timing of the update, or both.

For each reference data set referenced by the data lineage repository 502, the update monitor 506 monitors when to update the reference data set. When an update occurs, the update monitor 506 stores the entry 514 in a profile store 516, such as a file, database, or other data storage mechanism. Entry 514 indicates the timing of the update, such as the date or time the reference dataset was updated, or both. The profile store 516 may be indexed by an identifier of the reference dataset, or by timing of updates, or both.

When a reference data set is updated, quality elements of the reference data set generate a profile of the updated reference data (sometimes referred to as a new version of the reference data). The quality element may generate a profile from a list 520 of key data elements stored in a rules store 522, such as a file, database, or other storage mechanism. Key data elements are fields in a data record that are known to be important to a user or system, such as fields specified by a user or automatically identified. A profile is generated for each key data element of the new version of the reference data. For example, a profile generated for a given key data element may be census data indicating how many different values exist for the key data element in the reference data set and how many times each different value occurs. Reference profile data 524 indicating the generated profile for each key data element is stored in profile store 516, for example, in association with entry 514 indicating an update to the reference data.

Where source data is provided to a data processing application, a profile for each source data set referenced by the data lineage repository 502 is generated using the corresponding quality element. A profile is generated for each key data element in the source data, where the key data element is specified in a list 520 of key data elements stored in a rules store 522. Source profile data 526, which indicates the generated profiles of the various profiled source data sets, is stored in a profile store 516, such as a file, database, or other data storage mechanism.

In some examples, the reference profile data 524 and the source profile data 526 are accessed only if a data quality problem occurs in the downstream output data. In some examples, the reference profile data 524, the source profile data 526, or both are analyzed with the profile module to determine whether the data indicates potential data quality issues for the new version of the reference data or the received source data, respectively. The profile data 524, 526 may be analyzed shortly after profile generation or may be analyzed at a later point in time, e.g., at any time the trace engine has free computing resources for analysis.

To analyze the reference profile data 524 or the source profile data 526, the analysis module 530 applies rules 536 stored in the rules store 222, such as automatically generated rules or user-specified rules, and the like. The rules may, for example, indicate one or more critical data elements for each data set, threshold deviations that may cause data quality problems, or other types of rules.

In some examples, if a potential data quality problem is detected in a new version of the reference data or in the source data set, an identifier of the data set with the potential data quality problem is placed on the list 550 of root cause data sets stored in the data lineage repository 502. If the user later detects a data quality issue for a downstream dataset, the user may query the data lineage repository 502 to identify data lineage elements upstream of the output dataset and identify which of these upstream data lineage elements (if any) are included on the list 550 of root cause datasets.

In some examples, the output data 110 is automatically analyzed to determine if there are possible data quality problems. For example, batches or time intervals of the output data 110 may be profiled, and profiling rules and validation rules may be applied to the output data 110, for example, to compare the profile of the current output data 110 to a reference profile of a previous version of the output data 110. If the current output data 110 has a deviation from the reference profile that is more than a threshold amount as specified in the output data profiling rules, the current output data 110 may be identified as having potential data quality problems. If the deviation value of a particular data element in the current output data 110 from the expected range of values is more than a threshold amount as specified in the output data validation rule, the current output data 110 may be identified as having a potential data quality problem. The warning flag may be stored in a data repository with the output data 110 or the user may be notified, for example, through a user interface or with a message.

In some examples, the user identifies the set of output data 110 as having potential data quality issues. For example, a business analyst preparing a report summarizing multiple sets of output data 110 may recognize that a particular set of output data 110 is meaningless than other output data sets that it is analyzing. The analyst may mark this particular set of output data 110 as having potential data quality problems.

In the event that the output data has a data quality problem, the information stored in the trace engine 500 may be accessed in an attempt to identify the root cause of the data quality problem. For example, an identifier of the output data, such as a file name or timestamp, may be provided to the query module 548, e.g., automatically or by a user. The query module 548 queries the respective associated store for information that may be related to the identified output data. In particular, the query module 548 queries the data lineage repository 502 to identify transformations, source data, and reference data upon which the identified output data depends. The query module 548 may then query the update repository to obtain any entries 510 indicating updates to any identified conversion elements that occurred shortly before processing of the output data. The query module 548 can query the profile store 516 to obtain any entries 514 indicating updates to the identified reference data along with the associated reference profile data 524 and the associated alert plane. The query module 548 can query the profile store 516 to obtain the source profile data 526 for any identified source data sets.

Results returned in response to the query of the query module 548 are displayed on the user interface. The display enables a user to view and manipulate the data to learn the potential root cause of the data quality problem in the output data. For example, if a software update is made to the conversion element shortly before the output data is processed, the user may view information associated with the update, such as a description of the update or a changed code line, etc. If there is a warning flag associated with the reference profile data or the source profile data, the user may view the profile data.

In some examples, the results returned by query module 548 may indicate an update to a conversion element that occurs immediately prior to the conversion element processing output data having potential data quality problems. Sometimes referred to as the most recently updated conversion element. By "immediately before" is meant within a set amount of time, such as within ten minutes, within an hour, within a day, or within other amounts of time of the process, etc. The update monitor 506 may obtain additional information regarding the most recently updated conversion elements, wherein the additional information may indicate whether one or more of the most recently updated conversion elements are potential root causes of data quality problems in the output data. For example, update monitor 506 may identify any processing artifacts (artifacts) associated with the most recently updated conversion element. The presence of processing artifacts may indicate a potential problem with the recently updated conversion element. The update monitor 506 may examine the update log associated with the most recently updated conversion element to ensure that the update log reflects updates to the most recently updated conversion element. An inconsistency between the update log and the data 510 indicating an update to the most recently updated conversion element may indicate a potential problem with the conversion element. Update monitor 506 may examine checksums or other system data to identify potential errors that may have been introduced during the update of the most recently updated conversion element.

In some examples, if a potential problem with the recently updated conversion element is detected, a user notification can be made. In some examples, a warning flag may be stored to indicate, in conjunction with the data 510 indicating an update, for example, a potential problem in the update repository 508. In some examples, a warning message may be transmitted to the user through communication module 546 to indicate the presence of a potential problem with the recently updated conversion element. For example, the warning message may be a message, an icon, a pop-up window on a user interface; an email or SMS message; or in other forms. In some examples, data lineage and data quality analysis can be at the level of a dataset, sometimes referred to as coarse-grained data lineage. Coarse-grained data lineage view data lineage of downstream data sets. The upstream dataset and upstream conversion element used to generate the downstream dataset are considered to be located in the data lineage of the downstream dataset. In some examples, the data lineage and data quality analysis can be at the level of a single field, sometimes referred to as a fine-grained data lineage. The fine-grained data lineage views the data lineage of a particular field in the downstream dataset. The upstream conversion element used to generate a particular field in the downstream dataset and the field in the upstream dataset are considered to be located in the data lineage of the downstream dataset. The methods described herein to perform data quality analysis may be applied in the context of both coarse-grained data lineage and fine-grained data lineage.

Additional information related to Profiling can be found in U.S. patent 8868580, entitled "Data Profiling," the contents of which are incorporated herein by reference in their entirety. Typically, a data record is associated with a set of data fields, where each field has a particular value (possibly including a null value) for each record. In some examples, the data records in the dataset have a fixed record structure, wherein each data record includes the same field in the fixed record structure. In some examples, the data records in the data set have a variable record structure, including, for example, a variable length vector or a condition field. In some examples, the profile module 218 may provide the profile elements 150, 152, 154 with initial format information related to data records in the dataset. The initial format information may, for example, include a number of bits (e.g., 16 bits) representing different values, an order of values (including values associated with record fields and values associated with tags or delimiters), a type of value represented by a bit (e.g., a string, signed/unsigned integer, or other type), or other format information. The format information may be specified in a Data Manipulation Language (DML) file stored in rules store 522. The profile elements 150, 152, 154 may use predefined DML files to automatically interpret data from various common data system formats, such as SQL tables, XML files, or CSV files, or may use DML files obtained from the rules repository 222 that describe custom data system formats.

Fig. 6 illustrates an example of a user interface 300 that enables a user to investigate the root cause of a potential data quality problem in an output dataset. Through the user interface 300, a user may input in the output data an identifier 302 of the output data set or an identifier 304 of a particular data element. For example, the identifier 302 or 304 may identify an output data set or a particular data element that has potential data quality issues. In the example of fig. 6, the user has entered the output dataset "rolling_records. The interactive data lineage diagram 310 is displayed on the user interface 300, where the user interface 300 graphically depicts a set of identified output data 328 or an upstream data lineage element of the identified data elements. In the exemplary data lineage diagram 310, the data lineage elements located upstream of the identified output data set include two source data sets 312, 314, two conversion elements 316, 318, and one reference data set 320.

Upstream data lineage elements with possible data quality issues (such as source data 312, conversion element 318, and reference data 320 in this example) are marked with warning flags 324a, 324b, 324c, respectively. The user may select the warning sign (such as by clicking or tapping the warning sign, hovering a mouse pointer over the warning sign, or otherwise selecting the warning sign) to access information related to the associated possible data quality problem. Information regarding possible data quality problems associated with the data set may include information such as: profile data, reference profile data for one or more data elements, results of statistical analysis of profile data (such as deviations of profile data from reference profile data, etc.), values that do not satisfy allowable values specified by the validation rules, or other information. Information about possible data quality problems associated with the conversion element may include: the date of the last update of the conversion element, a description of the update, a snippet of code from the update, or other information. In some examples, an information bubble may be overlaid on the data lineage graph in response to a user selection of one of the alert flags. In some examples, a new screen may be displayed in response to a user selection of one of the alert flags. In some examples, the information displayed in the bubble or new screen may be interactive such that the user may access further detailed information by selecting one piece of information.

Through the user interface 300, the user may also access a rule editor 328, wherein through the rule editor 328 the user may add, delete, or modify profiling rules, validation rules, or both. For example, a user may add, delete, or modify key data elements for each data set; updating a threshold deviation that identifies potential data quality problems; specifying whether to automatically apply the profiling rules or validation rules upon receipt of a new data set or to apply the profiling rules or validation rules only upon detection of a downstream data quality problem; or make other changes to the profiling rules or validation rules.

In a specific example, the data processing system processes the phone record to generate a billing record. Each source data record represents a telephone call and includes fields storing data such as date, time of call, duration of call, telephone number dialed, and telephone number received. The source data records are processed monthly in a batch manner to generate billing. In this example, at 5 months of 2015, 95% of the customer accounts were not billed. The user has requested information about the profile and update of data lineage elements in the upstream data lineage for generating output data for the bill for month 5 2015. The source profile data reveals that the dialed phone number field in the source data record used to generate the 2015 month bill has only 10 unique values, while the reference source profile data shows that the unique value of the dialed phone number field is expected to range between 150 and 240 tens of thousands. Based on this examination of the source profile data, the user decides that the source data record has been corrupted. Source data records are retrieved from the compressed storage and reprocessed to properly generate a bill of 5 months 2015.

In another specific example, the data processing system processes the intra-company financial records and assigns each financial record to a company subsection. The assignment of each financial record to a company subsection is performed by mapping the department identifier in each record to one of the six company subsections as provided by the company reference dataset. The reference profile data of the company reference data indicates that the number of company divisions is always six in the past decade. The reference data is updated once a quarter. After the latest update, the reference data is profiled, showing that the number of company subdivisions in the reference data increases to 60. The profile of the updated reference data deviates sufficiently from the references of the six subdivisions to cause a warning message to be sent to the system administrator. In addition, further processing of the data processing system is suspended as needed until the reference data can be checked and corrected.

Referring to FIG. 7, in a specific example, the data processing system 50 includes a plurality of conversion elements 52, 54, 56, wherein the conversion elements 52, 54, 56 process input data 58 comprising a record of online purchases made on boston hop. Com on month 4 of 2016. Each record of input data 58 has a plurality of fields including a status field. In this example, component 56 is a split component that sends each data record to one of 8 files 60 a-60 h based on the value in the status field of the input data. For example, a record in the status field having a value MA is sent to file 60a; transmitting the record with the value TX to the file 60b; sending a record with the value CA to the File 60c; sending a record with the value DE to the file 60d; transmitting the record with the value NY to the file 60e; sending a record having a value IL to file 60f; transmitting the record with the value RI to the file 60g; and sends a record with any other value to file 60h. The number of records sent to each file is shown in fig. 7. In the example of fig. 7, the number of records sent to each file is within the expected range, and therefore no data quality warning is generated. This is because the input data 58 falls within the expected range.

The quality of the input data 58 is characterized by a quality element 62. The quality element 62 generates a profile of the status field of the input data 58 and applies an automatic generation rule indicating an allowable deviation between the profile of the status field of the input data and a reference profile of the status field. The reference profile represents an average profile of data processed by the data processing system 50 over the past year and indicates an allowable deviation beyond which a potential data quality problem is identified. In this example, if the distribution of values in the status field of the profile of the input data 58 differs by more than 10% from the distribution of values in the reference profile, the automatically generated rule indicates a rule that the input data 58 is identified as having a potential data quality problem, wherein the reference profile of the status field indicates the following distribution of values in the status field with a 10% allowed deviation:

MA：6％

TX：25％

CA：33％

DE：3％

NY：17％

IL：11％

RI：4％

any other value: 1%.

As can be seen from fig. 7, the actual profile of the status field falls within 10% of the allowable deviation of the reference profile, so that the input data has no data quality problem.

Referring to FIG. 8A, in an example of abnormal operation of data processing system 50, input data 55 includes a record of online purchases made in boston hop. Com on month 4 and 2 of 2016. In this example, no record is sent to file 60g. An operator of data processing system 50 may notice that file 60a is empty or that an empty file may cause an error in further processing by the downstream data processing system. An operator of data processing system 50 may track the root cause that was not recorded as sent to file 60g by investigating the quality of upstream data elements within the data lineage of files 60 a-60 h. In particular, input data 55 belongs to the upstream data lineage of files 60 a-60 h.

Referring also to fig. 8B, the quality element 62 generates the following actual profile of the status field of the input data 55:

MA：6％

TX：25.1％

CA：32.7％

DE：2.9％

NY：17.1％

IL：11.1％

RI：0％

any other value: 5.1%

Because of the discrepancy between the profile of the status field of the input data 55 and the reference profile of that status field, the input data 55 is identified as having a potential data quality problem, and a warning flag is stored to indicate the potential data quality problem. When an operator tracks the root cause of the empty file 60g, the operator can easily see that there is a potential data quality problem in the input data 55. The operator may then use this knowledge to investigate the cause of the discrepancy (e.g., determine if the input data 55 is corrupted, if early processing of the input data 55 in the upstream data processing system caused the discrepancy, or other cause). For example, referring also to FIG. 8C, in this example, by looking at a portion of the actual input data 55, the operator may recognize that the letters in the value "RI" read back as "IR", which results in these records being sorted into file 60h rather than file 60 g.

Referring to FIG. 9A, in another example of abnormal operation of data processing system 50, input data 64 includes a record of online purchases made on boston hop. Com on month 4 and 3 of 2016. In this example, the record is sent only to file 60a and not to any other files 60 b-60 h. An operator of data processing system 50 may notice that files 60 b-60 h are empty, or that these empty files may cause errors in further processing by the downstream data processing system.

Referring also to FIG. 9B, an operator of the data processing system may track the root cause of all records being sent to file 60a by investigating the quality of upstream data elements within the data lineage of files 60 a-60 h. In this example, the quality element 62 generates the following profile of the status field of the input data 64:

MA：6.1％

TX：25.2％

CA：32.6％

DE：2.9％

NY：17.0％

IL：11.1％

RI：4.1％

any other value: 1%

The profile of the status field of the input data 64 matches the reference profile of the status field, so no potential data quality problems are identified. The operator may then investigate the updated state of the conversion elements 52, 54, 56 in the data lineage of the files 60 a-60 h. For example, the operator may determine that the conversion element 56 is updated immediately prior to processing of the input data 64, and thus the conversion element 56 may be the root cause of the empty files 60 b-60 h.

Referring to fig. 10A, in a specific example, the data processing system 50 includes a plurality of conversion elements 82, 84, wherein the conversion elements 82, 84 process a stream of input data 86 including a phone record of a mobile phone call handled by a particular tower. Each record of the input data 86 has a plurality of fields including a phone number field. The input data 86 is formatted by the conversion element 82 and then ordered by the conversion element 84 by the value in the phone number field and output into a queue 88, where the input data 86 is fed to a second data processing system 90 for additional processing. In this example, 25% of the records fed from queue 88 into second data processing system 90 may cause processing errors. An operator of data processing system 80 may track the root cause of these processing errors by investigating the quality of upstream data elements within a data lineage of queue 88.

The quality of the input data 86 is characterized by a quality element 90 and the quality of the data 94 output from the format conversion element 82 is characterized by a quality element 92. Both quality elements 90, 92 apply user-generated rules, wherein the rules specify: the value in the phone number field should be a 10 bit integer and if more than 3% of the records do not meet the rules, a potential data quality problem is identified. In this example, the quality element 90 determines that 0.1% of the records in the data 86 have an 11-bit integer in the phone_number field. Since the percentage of records is below the 3% threshold, the quality element 90 is not identified as having any potential data quality problem for the input data 86. The quality element 92 characterizes 25% of the records in the data 94 as having an alphanumeric value in the phone number field. An example of a portion of data 94 is shown in fig. 10B. A warning flag is stored to indicate that there is a potential data quality problem with the data 94. When an operator tracks the root cause of a processing error, the operator can easily see that no data quality problem was identified in the input data 86, but that there is a potential data quality problem in the data 94.

Referring to FIG. 11, in an exemplary process for determining the quality of a source data set, the source data set is received into a data processing application (400). A profile of the source dataset is generated and stored (402). One or more rules for the source dataset are retrieved (404). The source data or a profile of the source data is analyzed according to one or more rules (406). If the source data set does not meet one or more rules (408), storing an alert indicating a potential data quality problem with the profile data, transmitting the alert to the user, or performing both (410); and adds the source data to a list of datasets that may have data quality problems. If the source data satisfies one or more rules (408), the source data is processed (412) using the data processing application. In some cases, such as for a significant deviation from a threshold or allowed value specified by the rule, the process is aborted until user intervention enables the process to restart. During or after processing, for example, the stored profile data is enabled to be accessed by a user to investigate the potential root cause of a downstream data quality problem.

Referring to FIG. 12, in an exemplary process for monitoring the quality of reference data in a data processing system, a reference data set is monitored (500). When the reference data set is updated, a profile of a new version of the reference data is generated and stored (502). For example, profiling may be performed after each scheduling update for the reference data. One or more rules for a reference dataset are retrieved (504). The new version of the reference data or the profile of the new version of the reference data is analyzed according to one or more rules (506). If the new version of the reference data does not satisfy the one or more rules (508), an alert indicating a possible data quality problem is stored with the profile data, communicated to the user, or both (510). If the new version of the reference data satisfies one or more rules (508), then subsequent processing of the data processing system is allowed to begin or continue (512). In some cases, such as for a significant deviation from a threshold or allowed value specified by the rule, the process is aborted until user intervention allows the process to begin or continue. During or after processing, for example, the stored profile data is enabled to be accessed by a user to investigate the potential root cause of a downstream data quality problem.

In some examples, the rules are analyzed prior to application of the rules, for example, to determine update dates for the rules. If the rule is older than the threshold age, the rule may not be applied or the user may be alerted that the rule may be ready for updating.

Referring to FIG. 13, in an exemplary process for analyzing updates to a conversion element, a time of a most recent update of the conversion element is identified (600). For example, the most recently updated time stamp may be stored in a data store. If the conversion element does not have a recent update (602), then the update of the conversion element is not further analyzed (604). The most recent update may be an update within a threshold amount of time, such as within ten minutes, within an hour, within a day, or within other amounts of time, etc. If the conversion element was recently updated (602), any processing artifacts are identified (606). The update log associated with the conversion element is checked (608) to identify any inconsistencies between the update log and the timestamp of the most recent update stored in the data store. Checksum or other system data associated with the conversion element is checked (610) to obtain an indication of any potential errors that may have been introduced during the updating of the conversion element. If no potential problems are identified (612), processing of the system is allowed to begin or continue (614). If one or more potential problems are identified (612), an alert is stored in the data store indicating that the conversion element is potentially problematic, communicated to the user, or both (616). Processing of the data processing system may be allowed to begin or continue, or may be suspended until user intervention allows processing to begin or continue.

Fig. 14 is a flow chart of an exemplary process. Information indicative of an output data set generated by a data processing system is received (700). One or more upstream data sets on which the output data set depends are identified based on data lineage information related to the output data set (702). The data lineage information indicates one or more data sets on which the output data set depends, and one or more data sets that depend on the output data set, or both. Each of the identified upstream data sets on which the output data set depends is analyzed to identify a subset of the data sets, including determining one or more data sets that have errors or that are likely to have errors (704). For each particular upstream data set, a first rule is applied (706) indicating an allowable deviation between a profile of the particular upstream data set and a reference profile of the particular upstream data set, and a second rule is applied (708) indicating an allowable or forbidden value for one or more data elements in the particular upstream data set. In some examples, only the first rule or only the second rule is applied. The first rule or the second rule or both may be automatically generated or specified by the user. One or more of the upstream data sets are selected as subsets based on the results of applying the first rule or the second rule or both (710). Information associated with a subset of the upstream dataset is output (712).

Fig. 15 is a flow chart of an exemplary process. Errors or possible errors in data elements of a downstream data set of the data processing system are identified, e.g., automatically or based on user input (900). One or more upstream data sets affecting the data element are automatically identified based on data lineage information related to the downstream data sets (902). Determining which upstream data sets have or are likely to have errors includes analyzing current and reference profiles for each of the identified upstream data sets (904). For example, each upstream dataset may be analyzed by applying one or more rules to each current profile. The rule may indicate an allowable deviation between a current profile of a particular upstream dataset and a corresponding reference profile of the particular upstream dataset. A rule may indicate an allowed value for a data element in a particular upstream dataset. Information associated with each upstream data set that has or is likely to have errors is output (906).

The techniques described herein for monitoring and tracking data quality are rooted in computer technology and may be used to solve problems that occur during the execution of computer-implemented processes. For example, the techniques for monitoring and tracking described herein may be used to monitor and make more efficient, effective, or accurate the processing of data sets by a computer-implemented data processing system. In addition, the techniques described herein may be applied to assist users, such as system administrators, in managing the operation of a data processing system.

FIG. 16 illustrates an example of a data processing system 1000 that may use techniques for monitoring and tracking. The system 1000 includes a data source 1002, wherein the data source 1002 may include one or more data sources such as a storage device or a connection to an online data stream, wherein the one or more data sources may each store or provide data in any of a variety of formats (e.g., database tables, spreadsheet files, unstructured text (flat text) files, or native formats used by a mainframe computer). The data may be logic data, analysis data, or machine data. The execution environment 1004 includes a preprocessing module 1006 and an execution module 1012. The execution environment 1004 may be installed on one or more general-purpose computers, for example, under the control of a suitable operating system, such as a version of the UNIX operating system. For example, the execution environment 1004 may include a multi-node parallel computing environment including a configuration of a computer system using multiple Central Processing Units (CPUs) or processor cores, which may be local (e.g., a multiprocessor system such as a Symmetric Multiprocessing (SMP) computer), or locally distributed (e.g., multiple processors connected as a cluster or Massively Parallel Processing (MPP) system), or remote, or remotely distributed (e.g., multiple processors connected via a Local Area Network (LAN) and/or Wide Area Network (WAN)), or any combination thereof.

The storage providing the data source 1002 may be local to the execution environment 1004, for example, stored on a storage medium (e.g., hard disk drive 1008) connected to the computer installing the execution environment 1004, or may be remote to the execution environment 1004, for example, installed on a remote system (e.g., mainframe 1010) in communication with the computer installing the execution environment 1004 via a remote connection (e.g., a remote connection provided by a cloud computing infrastructure).

The preprocessing module 1006 reads data from the data source 1002 and prepares the data processing application for execution. For example, the preprocessing module 1006 may compile the data processing application, store and/or load the compiled data processing application with respect to a data storage system 1016 accessible to the execution environment 1004, and perform other tasks to prepare the data processing application for execution.

The execution module 1012 executes the data processing application prepared by the preprocessing module 1006 to process the data set and generate output data 1014 resulting from the processing. The output data 1014 may be stored back to the data source 1002 or in a data storage system 1016 accessible to the execution environment 1004, or otherwise used. The data storage system 1016 may also have access to a development environment 1018, wherein in the development environment 1018, a developer 1020 is able to design and edit data processing applications to be executed by the execution module 1012. In some implementations, the development environment 1018 is a system for developing applications into dataflow graphs that include vertices (representing data processing components or datasets) that are connected by directional links (representing flows of work elements, i.e., data) between the vertices. Such an environment is described in more detail, for example, in U.S. patent publication No. 2007/0011668, entitled "Managing Parameters for Graph-Based Applications," incorporated herein by reference. A system for performing such graph-based calculations is described in U.S. patent 5966072, entitled "EXECUTING COMPUTATIONS EXPRESSED AS GRAPHS," the contents of which are incorporated herein by reference in their entirety. A dataflow graph made in accordance with the system provides a method for inputting and outputting information into and from various processes represented by the graph components, for moving information between processes, and for defining an order of execution of the processes. The system includes an algorithm that selects an inter-process communication method from any available method (e.g., a communication path according to a graph link may use TCP/IP or UNIX domain sockets or use a common store to transfer data between processes).

The preprocessing module 1006 can receive data from various types of systems (including different forms of database systems) that can embody the data source 1002. The data may be organized as records having values (including possibly empty values) for various fields (also referred to as "attributes" or "columns"). When data is first read from a data source, the preprocessing module 1006 typically begins with some initial format information related to records in the data source. In some cases, the record structure of the data source may be initially unknown, and may instead be determined after analysis of the data source or data. The initial information related to the record may, for example, include the number of bits representing the different values, the order of the fields within the record, and the type of value represented by the bits (e.g., string, signed/unsigned integer).

The monitoring and tracking methods described above may be implemented using a computing system executing suitable software. For example, the software may include processes in one or more computer programs executing on one or more programmable computing systems (which may have various architectures such as distributed, client/server or web formats, etc.), each including at least one processor, at least one data storage system (including volatile and/or non-volatile memory and/or storage elements), at least one user interface (for receiving input using at least one input device or port, and for providing output using at least one output device or port). The software may include, for example, one or more modules that provide larger programs of services related to the design, configuration, and execution of the figures. The modules of the program (e.g., elements of the figures) may be implemented as data structures or other organized data conforming to a data model stored in a data store.

The software may be disposed on a tangible, non-transitory medium such as a CD-ROM or other computer readable medium (e.g., readable with a general or special purpose programmable computer) or a tangible, non-transitory medium that is transferred (e.g., encoded with a propagated signal) via a communications medium of a network to a computing system executing the software. Some or all of this processing may be done on a special purpose computer, or using special purpose hardware, such as a coprocessor or Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC). The process may be implemented in a distributed fashion using different computing elements to perform different portions of the computation specified by the software. Each such computer program is preferably stored on or downloaded to a computer readable storage medium (e.g., a solid state memory or medium, or a magnetic or optical medium) of a general or special purpose programmable computer accessible memory device, to configure and cause a computer to operate if the storage medium is read by the computer to perform the processes described herein. The system of the present invention may also be considered to be implemented as a tangible, non-transitory medium configured with a computer program, wherein the medium so configured causes a computer to operate in a specific, predefined manner to perform one or more of the process steps described herein.

Many embodiments of the invention have been described. It is to be understood, however, that the foregoing description is intended to illustrate and not limit the scope of the invention, which is defined by the scope of the following claims. Accordingly, other embodiments are within the scope of the following claims. For example, various modifications may be made without departing from the scope of the invention. In addition, some of the steps described above may be order independent and thus may be performed in an order different from that described.

Claims

1. A computer-implemented method, comprising:

receiving information indicative of an output data set generated by a data processing system;

identifying data elements in the output dataset;

identifying one or more upstream data sets on which the output data set depends based on data lineage information related to the output data set;

analyzing one or more upstream data sets of the one or more upstream data sets on which the identified output data set depends, the analyzing comprising:

for each particular upstream dataset of the one or more upstream datasets, applying one or more of the following rules:

(i) A first rule for indicating an allowable deviation between the profile of the particular upstream data set and a reference profile of the particular upstream data set, an

(ii) A second rule indicating one or more allowed or forbidden values for each of the one or more data elements in the particular upstream data set;

selecting one or more of the upstream data sets based on a result of applying the one or more rules; and

outputting information associated with the selected one or more upstream data sets,

wherein the outputted information comprises a warning indicating a quality problem in the particular upstream dataset, and

wherein identifying one or more upstream data sets on which the output data set depends comprises: a dataset is identified that affects the identified data elements in the output dataset.

2. The method of claim 1, wherein one or more of the first rule and the second rule are automatically generated.

3. The method of claim 2, wherein the first rule is automatically generated based on an automated analysis of a historical profile of the particular upstream dataset.

4. A method according to claim 3, wherein the reference profile is based on a historical average profile of the particular upstream dataset.

5. The method of claim 2, wherein the second rule is automatically generated based on an automated analysis of historical values of one or more data elements in the particular upstream dataset.

6. The method of claim 5, wherein the allowed or forbidden value is determined based on the automated analysis.

7. The method of claim 1, wherein one or more of the first rule and the second rule are specified by a user.

8. The method of claim 1, further comprising: a designation of one or more of the first rule and the second rule is received through a user interface.

9. The method of claim 1, wherein the data lineage information indicates one or more data sets on which the output data set depends, or both.

10. The method of claim 1, wherein analyzing each of the one or more data sets to identify a subset of the one or more data sets comprises: determining a dataset of the one or more datasets that has an error or that is likely to have an error; and

The method further comprises the steps of: a data set with errors or possibly with errors is selected as the subset.

11. The method of claim 1, wherein analyzing each of the one or more data sets to identify a subset of the one or more data sets comprises: identifying a particular data set for which a deviation between a profile of the particular data set and a reference profile of the particular data set exceeds an allowable deviation indicated by a corresponding first rule; and

the method further comprises the steps of: the particular data set is selected as the subset.

12. The method of claim 1, wherein analyzing each of the one or more data sets to identify a subset of the one or more data sets comprises: identifying a particular data set having data elements with values that do not satisfy the allowed or forbidden values indicated by the respective second rule; and

13. The method of claim 1, wherein identifying data elements in the output dataset comprises: data elements that have errors or are likely to have errors are identified.

14. The method of claim 1, further comprising: a profile of one or more of the upstream data sets is generated.

15. The method of claim 14, wherein generating a profile for a particular data set comprises: a new profile for the particular data set is generated upon receipt of the new version of the particular data set.

16. The method of claim 1, wherein the reference profile for a particular data set is derived from one or more previous profiles for that particular data set.

17. The method of claim 1, wherein outputting information associated with a subset of the dataset comprises: an identifier of each data set of the subset is output.

18. The method of claim 1, wherein outputting information associated with a subset of the dataset comprises: an indicator of errors or possible errors associated with each data set of the subset is output.

19. The method of claim 1, wherein,

further comprising displaying a representation of the data processing system on a user interface, and

outputting information associated with a subset of the dataset includes: information associated with a particular data set in a subset of the data sets is displayed in proximity to a representation of the particular data set in the subset.

20. The method of claim 19, wherein the displayed information associated with the particular data set in the subset comprises: a value indicating a deviation between the profile of the particular data set and the reference profile of the particular data set.

21. The method of claim 19, wherein the displayed information associated with the particular data set in the subset comprises: a value representing the number of data elements in the particular data set that do not satisfy the allowed or forbidden value indicated by the corresponding second rule.

22. The method of claim 19, further comprising: an information bubble or pop-up window showing information about a subset of the dataset is displayed.

23. The method of claim 1, further comprising providing a user interface to enable a user to add rules, modify rules, or remove rules.

24. The method of claim 1, wherein the dataset comprises: one or more source data sets comprising data elements to be processed by the data processing system and one or more reference data sets comprising reference values referenced by the data processing system in the processing of data elements in the source data sets.

25. The method of claim 24, wherein the reference data set comprises data associated with a business entity related to the data processing system and the source data set comprises data associated with a customer of the business entity.

26. The method of claim 1, wherein,

the data processing system includes a conversion element, and

the method includes identifying one or more conversion elements affecting the output dataset based on the data lineage information.

27. The method of claim 26, further comprising: one or more of the conversion elements that have an error or that are likely to have an error are determined.

28. The method of claim 27, further comprising: whether a particular data processing element has an error or is likely to have an error is determined based on the implementation date associated with the particular conversion element.

29. The method of claim 1, wherein the quality issue comprises one or more of:

the number of different values in the fields of the data record of the particular upstream data set exceeds the allowable number of different values;

the average of the values in the fields of the data records of the particular upstream data set exceeds the allowed maximum value;

the average of the values in the fields of the data records of the particular upstream data set is less than the minimum allowed value;

the value in the field of the data record of the particular upstream dataset is one of the prohibited values;

The profile of a field of the particular upstream dataset deviates from the reference profile for that field by an amount greater than the allowed deviation; and

the values in the fields of the data records of the particular upstream dataset have a disable feature.

30. A non-transitory computer-readable medium storing instructions for causing a computing system to:

identifying data elements in the output dataset;

outputting information associated with the selected one or more upstream data sets

31. The non-transitory computer-readable medium of claim 30, wherein the first rule is automatically generated based on an automated analysis of a historical profile of the particular upstream dataset.

32. The non-transitory computer-readable medium of claim 30, wherein the second rule is automatically generated based on an automated analysis of historical values of one or more data elements in the particular upstream dataset.

33. The non-transitory computer-readable medium of claim 30, wherein analyzing each of the one or more data sets to identify a subset of the one or more data sets comprises: determining a dataset of the one or more datasets that has an error or that is likely to have an error; and

The instructions cause the computing system to select a data set having an error or likely having an error as the subset.

34. The non-transitory computer-readable medium of claim 30, wherein analyzing each of the one or more data sets to identify a subset of the one or more data sets comprises: identifying a particular data set for which a deviation between a profile of the particular data set and a reference profile of the particular data set exceeds an allowable deviation indicated by a corresponding first rule; and

the instructions cause the computing system to select the particular data set as the subset.

35. The non-transitory computer-readable medium of claim 30, wherein analyzing each of the one or more data sets to identify a subset of the one or more data sets comprises: identifying a particular data set having data elements with values that do not satisfy the allowed or forbidden values indicated by the respective second rule; and

36. The non-transitory computer-readable medium of claim 30, wherein the instructions cause the computing system to generate a profile of one or more of the upstream data sets.

37. The non-transitory computer-readable medium of claim 30, wherein outputting information associated with a subset of the dataset comprises: an identifier of each data set of the subset is output.

38. The non-transitory computer-readable medium of claim 30, wherein outputting information associated with a subset of the dataset comprises: an indicator of errors or possible errors associated with each data set of the subset is output.

39. The non-transitory computer-readable medium of claim 30, wherein,

the instructions cause the computing system to display a representation of the data processing system on a user interface, an

40. The non-transitory computer-readable medium of claim 30, wherein the instructions cause the computing system to provide a user interface to enable a user to add rules, modify rules, or remove rules.

41. The non-transitory computer-readable medium of claim 30, wherein the data set comprises: one or more source data sets comprising data elements to be processed by the data processing system and one or more reference data sets comprising reference values referenced by the data processing system in the processing of data elements in the source data sets.

42. The non-transitory computer-readable medium of claim 30, wherein,

the data processing system includes a conversion element, and

the instructions cause the computing system to identify, based on the data lineage information, one or more conversion elements that affect the output dataset.

43. The non-transitory computer-readable medium of claim 42, wherein the instructions cause the computing system to determine one or more of the conversion elements that have an error or that are likely to have an error.

44. A computing system, comprising:

one or more processors coupled to a memory, the one or more processors and the memory configured to:

identifying data elements in the output dataset;

45. The computing system of claim 44, wherein the first rule is automatically generated based on an automated analysis of a historical profile of the particular upstream dataset.

46. The computing system of claim 44, wherein the second rule is automatically generated based on an automated analysis of historical values of one or more data elements in the particular upstream dataset.

47. The computing system of claim 44, wherein analyzing each of the one or more data sets to identify a subset of the one or more data sets comprises: determining a dataset of the one or more datasets that has an error or that is likely to have an error; and

the one or more processors and the memory are configured to select a data set having an error or likely having an error as the subset.

48. The computing system of claim 44, wherein analyzing each of the one or more data sets to identify a subset of the one or more data sets comprises: identifying a particular data set for which a deviation between a profile of the particular data set and a reference profile of the particular data set exceeds an allowable deviation indicated by a corresponding first rule; and

the one or more processors and the memory are configured to select the particular data set as the subset.

49. The computing system of claim 44, wherein analyzing each of the one or more data sets to identify a subset of the one or more data sets comprises: identifying a particular data set having data elements with values that do not satisfy the allowed or forbidden values indicated by the respective second rule; and

50. The computing system of claim 44, wherein the one or more processors and the memory are configured to generate a profile of one or more of the upstream data sets.

51. The computing system of claim 44, wherein outputting information associated with a subset of the dataset comprises: an identifier of each data set of the subset is output.

52. The computing system of claim 44, wherein outputting information associated with a subset of the dataset comprises: an indicator of errors or possible errors associated with each data set of the subset is output.

53. The computing system of claim 44, wherein,

the one or more processors and the memory are configured to display a representation of the data processing system on a user interface, and

54. The computing system of claim 44, wherein the one or more processors and the memory are configured to provide a user interface to enable a user to add rules, modify rules, or remove rules.

55. The computing system of claim 44 wherein the data set comprises: one or more source data sets comprising data elements to be processed by the data processing system and one or more reference data sets comprising reference values referenced by the data processing system in the processing of data elements in the source data sets.

56. The computing system of claim 44, wherein,

the data processing system includes a conversion element, and

the one or more processors and the memory are configured to identify, based on the data lineage information, one or more conversion elements that affect the output dataset.

57. The computing system of claim 56, wherein the one or more processors and the memory are configured to determine one or more of the conversion elements that have an error or that are likely to have an error.

58. A computing system, comprising:

means for receiving information indicative of an output data set generated by a data processing system;

means for identifying data elements in the output dataset;

means for identifying one or more upstream data sets on which the output data set depends based on data lineage information relating to the output data set;

Means for analyzing one or more upstream data sets of the one or more upstream data sets on which the identified output data set depends, the analysis comprising:

means for outputting information associated with the selected one or more upstream data sets,

59. A computer-implemented method, comprising:

upon identifying an error or possible error in a data element of a downstream data set of a data processing system, automatically identifying one or more upstream data sets affecting the data element based on data lineage information related to the downstream data set;

determining an upstream dataset having an error or likely to have an error in the upstream datasets, comprising: analyzing the current profile and the reference profile of each of the identified upstream data sets; and

information associated with each of the upstream data sets determined to have errors or likely to have errors is output.