HK1237462B

HK1237462B - Managing parameter sets

Info

Publication number: HK1237462B
Application number: HK17111226.4A
Authority: HK
Inventors: E.巴赫; R．奥伯多夫; B.拉尔森
Original assignee: 起元科技有限公司
Priority date: 2014-07-18
Filing date: 2015-07-20
Publication date: 2021-02-05

Description

Management parameter sets

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请要求于2014年7月18日提交的序列号为62/026,228的美国申请的优先权。This application claims priority to U.S. application serial number 62/026,228, filed on July 18, 2014.

技术领域Technical Field

本说明书涉及管理参数集。This specification refers to management parameter sets.

背景技术Background Art

在数据处理系统中，某些类型的用户通常期望访问数据在通过所述系统时沿袭的报告。一般来说，在许多用途中，这种“数据沿袭”报告可用于降低风险、验证合规性要求、简化业务流程和保护数据。保证数据沿袭报告的正确和完整很重要。In data processing systems, certain types of users often desire access to reports that document the lineage of data as it passes through the system. Generally speaking, such "data lineage" reports can be used to mitigate risk, verify compliance requirements, streamline business processes, and protect data, among many other purposes. Ensuring that data lineage reports are accurate and complete is crucial.

发明内容Summary of the Invention

在一个方面，一般来说，管理参数值集合，反映使用那些参数值集合进行实例化的通用计算机程序的实例之间的关系的沿袭信息使得能够生成更准确和完整的数据沿袭报告。In one aspect, generally speaking, managing parameter value sets, lineage information reflecting relationships between instances of a general purpose computer program instantiated using those parameter value sets enables generation of more accurate and complete data lineage reports.

在另一方面，一般来说，一种管理参数值集合的方法包括：接收用于通用计算机程序的多个参数值集合，以及处理与所述通用计算机程序的实例的执行相关联的日志条目，所述通用计算机程序的每个实例与一个或多个参数值相关联。所述处理包括：分析所述通用计算机程序以将与所述通用计算机程序相关联的一个或多个参数中的每个分类为第一类参数或第二类参数的成员；处理与所述通用计算机程序的第一实例的执行相关联的日志条目以形成特定的参数值集合，其中该处理包括：在所述特定集合中包括出现在所述日志条目中的被分类为所述第一类的成员的参数的任何值，以及从所述特定集合中排除出现在所述日志条目中的被分类为所述第二类的成员的参数的任何值；以及基于所述特定的参数值集合的第一标识符与所述多个参数值集合中的至少一些参数值集合的标识符的比较来确定是否将所述特定的参数值集合添加到所述多个参数值集合中。In another aspect, in general, a method for managing parameter value sets includes receiving a plurality of parameter value sets for a general-purpose computer program, and processing log entries associated with execution of instances of the general-purpose computer program, each instance of the general-purpose computer program being associated with one or more parameter values. The processing includes analyzing the general-purpose computer program to classify each of one or more parameters associated with the general-purpose computer program as a member of a first class of parameters or a second class of parameters; processing the log entries associated with the execution of a first instance of the general-purpose computer program to form a particular parameter value set, wherein the processing includes including in the particular set any values of the parameters appearing in the log entries that are classified as members of the first class and excluding from the particular set any values of the parameters appearing in the log entries that are classified as members of the second class; and determining whether to add the particular parameter value set to the plurality of parameter value sets based on a comparison of a first identifier of the particular parameter value set with identifiers of at least some of the plurality of parameter value sets.

各个方面可以包括以下特征中的一个或多个。Various aspects can include one or more of the following features.

处理所述日志条目包括：基于参数是否影响与所述通用计算机程序相关联的数据沿袭来对参数进行分类。Processing the log entry includes classifying the parameter based on whether the parameter affects data lineage associated with the general computer program.

所述特定的参数值集合的第一标识符与所述多个参数值集合中的至少一些参数值集合的标识符的比较包括：基于所述特定的参数值集合和所述通用计算机程序的标识符来确定所述第一标识符；确定多个第二标识符，每个第二标识符对应于所述多个参数值集合中的至少一些参数值集合的一个参数值集合；以及将所述第一标识符与所述多个第二标识符中的每个第二标识符进行比较，以确定所述第一标识符是否和所述多个第二标识符中的任一个第二标识符匹配。The comparison of the first identifier of the specific parameter value set with the identifiers of at least some of the multiple parameter value sets includes: determining the first identifier based on the specific parameter value set and the identifier of the general computer program; determining multiple second identifiers, each second identifier corresponding to a parameter value set of at least some of the multiple parameter value sets; and comparing the first identifier with each of the multiple second identifiers to determine whether the first identifier matches any of the multiple second identifiers.

确定是否将所述特定的参数值集合添加到所述多个参数值集合中包括：如果没有第二标识符匹配所述第一标识符，则确定将所述特定的参数值集合添加到所述多个参数值集合中。Determining whether to add the specific parameter value set to the multiple parameter value sets includes: if no second identifier matches the first identifier, determining to add the specific parameter value set to the multiple parameter value sets.

确定所述第一标识符包括：从所述特定的参数值集合的内容计算标识字符串，并且确定所述多个第二标识符包括：从所述多个参数值集合中的至少一些参数值集合的内容计算标识字符串。Determining the first identifier includes computing an identification string from the contents of the particular parameter value set, and determining the plurality of second identifiers includes computing identification strings from the contents of at least some of the plurality of parameter value sets.

确定所述第一标识符包括形成以下中的一个或多个的串联符：所述通用计算机程序的所述标识符，所述特定的参数值集合的名称-数值对，所述通用计算机程序的函数原型以及所述通用计算机程序的第一实例的项目范围。Determining the first identifier includes forming a concatenation of one or more of: the identifier of the general computer program, a name-value pair of the specific parameter value set, a function prototype of the general computer program, and a project scope of the first instance of the general computer program.

确定所述第一标识符包括将数据映射函数应用于以下中的一个或多个：所述通用计算机程序的标识符，所述特定的参数值集合的名称-数值对，所述通用计算机程序的函数原型以及所述通用计算机程序的第一实例的项目范围。Determining the first identifier includes applying a data mapping function to one or more of: an identifier of the general computer program, name-value pairs of the particular set of parameter values, function prototypes of the general computer program, and a project scope of the first instance of the general computer program.

所述数据映射函数包括散列函数。The data mapping function includes a hash function.

所述第一类参数包括影响所述通用计算机程序的逻辑操作的参数，并且所述第二类参数包括不影响所述通用计算机程序的逻辑操作的参数。The first category of parameters includes parameters that affect the logical operation of the general-purpose computer program, and the second category of parameters includes parameters that do not affect the logical operation of the general-purpose computer program.

所述通用计算机程序被指定为数据流图，所述数据流图包括表示数据处理操作的节点和所述节点之间表示数据处理操作之间的数据元素的流的链接。The general computer program is specified as a dataflow graph including nodes representing data processing operations and links between the nodes representing the flow of data elements between the data processing operations.

对于所述多个参数中的每个，所述分析包括：对该参数进行自动分类或接受该参数的用户定义的分类。For each of the plurality of parameters, the analyzing includes automatically classifying the parameter or accepting a user-defined classification for the parameter.

对该参数进行自动分类包括：初始时将该参数分类为属于所述第一类参数；确定在所述通用计算机程序的实例的多次执行中该参数的唯一值的数量；以及如果该参数的唯一值的数量超过预定阈值，则将该参数重新分类为属于所述第二类参数。Automatically classifying the parameter includes: initially classifying the parameter as belonging to the first category of parameters; determining a number of unique values of the parameter in multiple executions of instances of the general computer program; and reclassifying the parameter as belonging to the second category of parameters if the number of unique values of the parameter exceeds a predetermined threshold.

对该参数进行自动分类包括：初始时将该参数分类为属于所述第一类参数；确定在所述通用计算机程序的实例的多次执行中该参数的数值的变化是否影响与所述通用计算机程序相关联的数据沿袭，以及如果该参数的数值的变化不影响数据沿袭，则将该参数重新分类为属于所述第二类参数。Automatically classifying the parameter includes: initially classifying the parameter as belonging to the first category of parameters; determining whether changes in the value of the parameter during multiple executions of the instance of the general-purpose computer program affect data lineage associated with the general-purpose computer program; and if the changes in the value of the parameter do not affect the data lineage, reclassifying the parameter as belonging to the second category of parameters.

所述方法还包括：形成与所述通用计算机程序的第一实例的执行相关联的日志条目与所述特定的参数值集合之间的关联。The method also includes forming an association between a log entry associated with execution of the first instance of the general purpose computer program and the particular set of parameter values.

与所述通用计算机程序的第一实例的执行相关联的日志条目包括：用于实例化所述通用计算机程序的执行命令的日志条目，该日志条目包括作为所述执行命令的自变量提供的一个或多个参数值。The log entry associated with the execution of the first instance of the general-purpose computer program includes a log entry of an execution command for instantiating the general-purpose computer program, the log entry including one or more parameter values provided as arguments to the execution command.

与所述通用计算机程序的第一实例的执行相关联的日志条目还包括以下中的一个或多个：所述第一实例在其中执行的项目的指示，所述第一实例的内部参数的指示，以及由所述第一实例使用的环境设置、全局变量和配置变量的指示。The log entries associated with the execution of the first instance of the general-purpose computer program also include one or more of: an indication of the project in which the first instance executed, an indication of internal parameters of the first instance, and an indication of environment settings, global variables, and configuration variables used by the first instance.

所述方法还包括：处理用于多个通用计算机程序的全部多个参数值集合以及与所述多个通用计算机程序中的至少一些通用计算机程序的实例的执行相关联的全部多个日志条目，以形成数据沿袭报告，其中全部多个参数值集合包括：通用计算机程序的扩充的多个参数值集合，并且与所述多个通用计算机程序中的至少一些通用计算机程序的实例的执行相关联的全部多个日志条目包括：通用计算机程序的第一实例的执行的日志条目，包括其与所述特定的参数值集合的关联。The method also includes: processing all multiple parameter value sets for multiple general-purpose computer programs and all multiple log entries associated with the execution of instances of at least some of the multiple general-purpose computer programs to form a data lineage report, wherein all multiple parameter value sets include: expanded multiple parameter value sets for the general-purpose computer programs, and all multiple log entries associated with the execution of instances of at least some of the multiple general-purpose computer programs include: log entries of the execution of the first instance of the general-purpose computer program, including its association with the specific parameter value set.

形成数据沿袭报告包括：对于所述多个通用计算机程序的全部多个参数值集合中的每个参数值集合，处理与所述多个通用计算机程序中的所述至少一些通用计算机程序的实例的执行相关联的全部多个日志条目，以识别与对应于该参数值集合的通用计算机程序的实例的执行相关联的所有日志条目，并且从与通用计算机程序的实例的执行相关联的所识别的日志条目中识别通用计算机程序的最近实例化时间；以及基于通用计算机程序的最近实例化时间来确定是否将该参数值集合包括在所述数据沿袭报告中。Forming a data lineage report includes: for each parameter value set among all the multiple parameter value sets of the multiple general-purpose computer programs, processing all the multiple log entries associated with the execution of instances of the at least some of the multiple general-purpose computer programs to identify all log entries associated with the execution of the instances of the general-purpose computer programs corresponding to the parameter value set, and identifying the most recent instantiation time of the general-purpose computer program from the identified log entries associated with the execution of the instances of the general-purpose computer program; and determining whether to include the parameter value set in the data lineage report based on the most recent instantiation time of the general-purpose computer program.

基于通用计算机程序的最近实例化时间来确定是否将该参数值集合包括在所述数据沿袭报告中包括：将所述最近实例化时间与预定时间间隔进行比较，并且如果通用计算机程序的最近实例化时间在预定时间间隔内，则将该参数值集合包括在所述数据沿袭报告中。Determining whether to include the parameter value set in the data lineage report based on the most recent instantiation time of the general computer program includes: comparing the most recent instantiation time with a predetermined time interval, and including the parameter value set in the data lineage report if the most recent instantiation time of the general computer program is within the predetermined time interval.

形成所述数据沿袭报告包括：对于所述多个通用计算机程序的全部多个参数值集合中的每个参数值集合，处理与所述多个通用计算机程序中的至少一些通用计算机程序的实例的执行相关联的全部多个日志条目，以确定与对应于该参数值集合的通用计算机程序的实例的执行相关联的日志条目的数量，以及基于与该通用计算机程序的实例的执行相关联的日志条目的数量来确定是否将该参数值集合包括在所述数据沿袭报告中。Forming the data lineage report includes: for each parameter value set in all the plurality of parameter value sets of the plurality of general-purpose computer programs, processing all the plurality of log entries associated with the execution of instances of at least some of the plurality of general-purpose computer programs to determine the number of log entries associated with the execution of the instance of the general-purpose computer program corresponding to the parameter value set, and determining whether to include the parameter value set in the data lineage report based on the number of log entries associated with the execution of the instance of the general-purpose computer program.

在另一方面，一般来说，一种以非暂时形式存储在计算机可读介质上的软件，用于管理参数值集合，所述软件包括指令用于使计算系统：接收用于通用计算机程序的多个参数值集合，以及处理与所述通用计算机程序的实例的执行相关联的日志条目，所述通用计算机程序的每个实例与一个或多个参数值相关联，以及基于所述处理扩充所述多个参数值集合。所述处理包括：分析所述通用计算机程序以将与所述通用计算机程序相关联的一个或多个参数中的每个分类为第一类参数或第二类参数的成员；处理与所述通用计算机程序的第一实例的执行相关联的日志条目以形成特定的参数值集合，其中该处理包括：在所述特定集合中包括出现在所述日志条目中的被分类为所述第一类的成员的参数的任何值，以及从所述特定集合中排除出现在所述日志条目中的被分类为所述第二类的成员的参数的任何值；以及基于所述特定的参数值集合的第一标识符与所述多个参数值集合中的至少一些参数值集合的标识符的比较来确定是否将所述特定的参数值集合添加到所述多个参数值集合中。In another aspect, in general, software stored in non-transitory form on a computer-readable medium for managing parameter value sets includes instructions for causing a computing system to: receive a plurality of parameter value sets for a general-purpose computer program; process log entries associated with execution of instances of the general-purpose computer program, each instance of the general-purpose computer program being associated with one or more parameter values; and augment the plurality of parameter value sets based on the processing. The processing includes: analyzing the general-purpose computer program to classify each of the one or more parameters associated with the general-purpose computer program as a member of a first class of parameters or a second class of parameters; processing log entries associated with the execution of a first instance of the general-purpose computer program to form a particular parameter value set, wherein the processing includes: including in the particular set any values of the parameters appearing in the log entries that are classified as members of the first class and excluding from the particular set any values of the parameters appearing in the log entries that are classified as members of the second class; and determining whether to add the particular parameter value set to the plurality of parameter value sets based on a comparison of a first identifier of the particular parameter value set with identifiers of at least some of the plurality of parameter value sets.

在另一方面，一般来说，一种用于管理参数值集合的计算系统包括：输入设备或端口，用于接收用于通用计算机程序的多个参数值集合，以及至少一个处理器，用于处理与所述通用计算机程序的实例的执行相关联的日志条目，所述通用计算机程序的每个实例与一个或多个参数值相关联。所述处理包括：分析所述通用计算机程序以将与所述通用计算机程序相关联的一个或多个参数中的每个分类为第一类参数或第二类参数的成员；处理与所述通用计算机程序的第一实例的执行相关联的日志条目以形成特定的参数值集合，其中该处理包括：在所述特定集合中包括出现在所述日志条目中的被分类为所述第一类的成员的参数的任何值，以及从所述特定集合中排除出现在所述日志条目中的被分类为所述第二类的成员的参数的任何值；以及基于所述特定的参数值集合的第一标识符与所述多个参数值集合中的至少一些参数值集合的标识符的比较来确定是否将所述特定的参数值集合添加到所述多个参数值集合中。In another aspect, in general, a computing system for managing parameter value sets includes: an input device or port for receiving a plurality of parameter value sets for a general-purpose computer program, and at least one processor for processing log entries associated with execution of instances of the general-purpose computer program, each instance of the general-purpose computer program being associated with one or more parameter values. The processing includes: analyzing the general-purpose computer program to classify each of one or more parameters associated with the general-purpose computer program as a member of a first class of parameters or a second class of parameters; processing the log entries associated with the execution of a first instance of the general-purpose computer program to form a particular parameter value set, wherein the processing includes: including in the particular set any values of the parameters appearing in the log entries that are classified as members of the first class and excluding from the particular set any values of the parameters appearing in the log entries that are classified as members of the second class; and determining whether to add the particular parameter value set to the plurality of parameter value sets based on a comparison of a first identifier of the particular parameter value set with identifiers of at least some of the plurality of parameter value sets.

在另一方面，一般来说，一种用于管理参数值集合的计算系统包括：用于接收用于通用计算机程序的多个参数值集合的装置，以及用于处理与所述通用计算机程序的实例的执行相关联的日志条目的装置，所述通用计算机程序的每个实例与一个或多个参数值相关联。所述处理包括：分析所述通用计算机程序以将与所述通用计算机程序相关联的一个或多个参数中的每个分类为第一类参数或第二类参数的成员；处理与所述通用计算机程序的第一实例的执行相关联的日志条目以形成特定的参数值集合。其中该处理包括：在所述特定集合中包括出现在所述日志条目中的被分类为所述第一类的成员的参数的任何值，以及从所述特定集合中排除出现在所述日志条目中的被分类为所述第二类的成员的参数的任何值；以及基于所述特定的参数值集合的第一标识符与所述多个参数值集合中的至少一些参数值集合的标识符的比较来确定是否将所述特定的参数值集合添加到所述多个参数值集合中。In another aspect, in general, a computing system for managing parameter value sets includes: means for receiving a plurality of parameter value sets for a general-purpose computer program; and means for processing log entries associated with execution of instances of the general-purpose computer program, each instance of the general-purpose computer program being associated with one or more parameter values. The processing includes: analyzing the general-purpose computer program to classify each of one or more parameters associated with the general-purpose computer program as a member of a first class of parameters or a second class of parameters; processing log entries associated with the execution of a first instance of the general-purpose computer program to form a particular parameter value set. The processing includes: including in the particular set any values of the parameters appearing in the log entries that are classified as members of the first class, and excluding from the particular set any values of the parameters appearing in the log entries that are classified as members of the second class; and determining whether to add the particular parameter value set to the plurality of parameter value sets based on a comparison of a first identifier of the particular parameter value set with identifiers of at least some of the plurality of parameter value sets.

在另一方面，一般来说，一种用于管理参数值集合的方法包括：接收通用计算机程序；接收第一参数值集合；通过根据所述第一参数值集合对所述通用计算机程序进行实例化来生成所述通用计算机程序的可执行实例；从一个或多个数据集接收数据；执行所述通用计算机程序的可执行实例以处理所接收的数据中的至少一些数据；生成所述通用计算机程序的可执行实例的日志条目，所述日志条目包括所述第一参数值集合的至少一些参数值；存储所述日志条目；接收所述日志条目；处理所述日志条目以形成特定的参数值集合，其中所述处理包括：从所述日志条目提取所述第一参数值集合的所述至少一些参数值，并且由所提取的参数值形成所述特定的参数值集合；以及基于所述特定的参数值集合的第一标识符与多个预先存在的参数值集合中的至少一些预先存在的参数值集合的标识符的比较来确定是否将所述特定的参数值集合添加到所述多个预先存在的参数值集合中。On the other hand, in general, a method for managing parameter value sets includes: receiving a general computer program; receiving a first parameter value set; generating an executable instance of the general computer program by instantiating the general computer program according to the first parameter value set; receiving data from one or more data sets; executing the executable instance of the general computer program to process at least some of the received data; generating a log entry for the executable instance of the general computer program, the log entry including at least some parameter values of the first parameter value set; storing the log entry; receiving the log entry; processing the log entry to form a specific parameter value set, wherein the processing includes: extracting at least some parameter values of the first parameter value set from the log entry, and forming the specific parameter value set from the extracted parameter values; and determining whether to add the specific parameter value set to the multiple pre-existing parameter value sets based on comparing a first identifier of the specific parameter value set with identifiers of at least some of the pre-existing parameter value sets in the multiple pre-existing parameter value sets.

所述特定的参数值集合的标识符与多个预先存在的参数值集合中的至少一些预先存在的参数值集合的标识符的比较包括：基于所述特定的参数值集合和所述通用计算机程序的标识符来确定所述第一标识符；确定多个第二标识符，每个第二标识符对应所述至少一些预先存在的参数值集合中的每个预先存在的参数值集合；以及将所述第一标识符与所述多个第二标识符中的每个第二标识符进行比较，以确定所述第一标识符是否和所述多个第二标识符中的任一个第二标识符匹配。The comparison of the identifier of the specific parameter value set with the identifiers of at least some of the pre-existing parameter value sets in a plurality of pre-existing parameter value sets includes: determining the first identifier based on the specific parameter value set and the identifier of the general computer program; determining a plurality of second identifiers, each second identifier corresponding to each pre-existing parameter value set in the at least some of the pre-existing parameter value sets; and comparing the first identifier with each of the plurality of second identifiers to determine whether the first identifier matches any of the plurality of second identifiers.

确定是否将所述特定的参数值集合添加到所述多个预先存在的参数值集合中包括：如果没有第二标识符匹配所述第一标识符，则确定将所述特定的参数值集合添加到所述多个预先存在的参数值集合中。Determining whether to add the specific parameter value set to the plurality of pre-existing parameter value sets includes determining to add the specific parameter value set to the plurality of pre-existing parameter value sets if no second identifier matches the first identifier.

确定所述第一标识符包括：从所述特定的参数值集合的内容计算标识字符串，并且确定所述多个第二标识符包括：从所述多个预先存在的参数值集合中的至少一些预先存在的参数值集合的内容计算标识字符串。Determining the first identifier comprises computing an identification string from the contents of the particular parameter value set, and determining the plurality of second identifiers comprises computing identification strings from the contents of at least some of the plurality of pre-existing parameter value sets.

确定所述第一标识符包括形成以下中的一个或多个的串联符：所述通用计算机程序的所述标识符，所述特定的参数值集合的名称-数值对，所述通用计算机程序的函数原型以及所述通用计算机程序的可执行实例的项目范围。Determining the first identifier includes forming a concatenation of one or more of: the identifier of the general computer program, a name-value pair of the specific parameter value set, a function prototype of the general computer program, and an item scope of an executable instance of the general computer program.

确定所述第一标识符包括将数据映射函数应用于以下中的一个或多个：所述通用计算机程序的标识符，所述特定的参数值集合的名称-数值对，所述通用计算机程序的函数原型以及所述通用计算机程序的可执行实例的项目范围。Determining the first identifier includes applying a data mapping function to one or more of: an identifier of the general computer program, name-value pairs of the specific set of parameter values, function prototypes of the general computer program, and a project scope of an executable instance of the general computer program.

所述方法还包括：分析所述通用计算机程序以将与所述通用计算机程序相关联的一个或多个参数中的每个分类为第一类参数或第二类参数的成员。The method also includes analyzing the general purpose computer program to classify each of one or more parameters associated with the general purpose computer program as a member of a first class of parameters or a second class of parameters.

处理所述日志条目以形成特定的参数值集合还包括：在所述特定集合中包括出现在所述日志条目中的被分类为所述第一类的成员的任何提取的参数值，以及从所述特定集合中排除出现在所述日志条目中的被分类为所述第二类的成员的任何提取的参数值。Processing the log entry to form a specific set of parameter values also includes: including in the specific set any extracted parameter values that appear in the log entry and are classified as members of the first class, and excluding from the specific set any extracted parameter values that appear in the log entry and are classified as members of the second class.

在另一方面，一般来说，一种以非暂时形式存储在计算机可读介质上的软件，用于管理参数值集合，所述软件包括指令用于使计算系统：接收通用计算机程序；接收第一参数值集合；通过根据所述第一参数值集合对所述通用计算机程序进行实例化来生成所述通用计算机程序的可执行实例；从一个或多个数据集接收数据；执行所述通用计算机程序的可执行实例以处理所接收的数据中的至少一些数据；生成所述通用计算机程序的可执行实例的日志条目，所述日志条目包括所述第一参数值集合的至少一些参数值；存储所述日志条目；接收所述日志条目；处理所述日志条目以形成特定的参数值集合，其中所述处理包括：从所述日志条目提取所述第一参数值集合的所述至少一些参数值，并且由所提取的参数值形成所述特定的参数值集合；以及基于所述特定的参数值集合的第一标识符与多个预先存在的参数值集合中的至少一些预先存在的参数值集合的标识符的比较来确定是否将所述特定的参数值集合添加到所述多个预先存在的参数值集合中。On the other hand, in general, a software stored in a non-transitory form on a computer-readable medium is provided for managing parameter value sets, the software comprising instructions for causing a computing system to: receive a general-purpose computer program; receive a first parameter value set; generate an executable instance of the general-purpose computer program by instantiating the general-purpose computer program according to the first parameter value set; receive data from one or more data sets; execute the executable instance of the general-purpose computer program to process at least some of the received data; generate a log entry for the executable instance of the general-purpose computer program, the log entry comprising at least some parameter values of the first parameter value set; store the log entry; receive the log entry; process the log entry to form a specific parameter value set, wherein the processing comprises: extracting at least some parameter values of the first parameter value set from the log entry and forming the specific parameter value set from the extracted parameter values; and determining whether to add the specific parameter value set to the multiple pre-existing parameter value sets based on a comparison of a first identifier of the specific parameter value set with identifiers of at least some of the pre-existing parameter value sets in the multiple pre-existing parameter value sets.

在另一方面，一般来说，一种用于管理参数值集合的系统包括：第一计算系统包括：第一输入设备或端口，用于接收通用计算机程序、第一参数值集合和来自一个或多个数据集的数据；第一组一个或多个处理器，被配置为：通过根据所述第一参数值集合对所述通用计算机程序进行实例化来生成所述通用计算机程序的可执行实例；执行所述通用计算机程序的可执行实例以处理所接收的数据中的至少一些数据；生成所述通用计算机程序的可执行实例的日志条目，所述日志条目包括所述第一参数值集合的至少一些参数值；第一输出设备或端口，用于在存储设备中存储所述日志条目；第二计算系统包括：第二输入设备或端口，用于接收所述日志条目；第二组一个或多个处理器，被配置为：处理所述日志条目以形成特定的参数值集合，其中所述处理包括：从所述日志条目提取所述第一参数值集合的所述至少一些参数值，并且由所提取的参数值形成所述特定的参数值集合；以及基于所述特定的参数值集合的第一标识符与多个预先存在的参数值集合中的至少一些预先存在的参数值集合的标识符的比较来确定是否将所述特定的参数值集合添加到所述多个预先存在的参数值集合中。In another aspect, in general, a system for managing a set of parameter values comprises: a first computing system comprising: a first input device or port for receiving a general purpose computer program, a first set of parameter values, and data from one or more data sets; a first set of one or more processors configured to: generate an executable instance of the general purpose computer program by instantiating the general purpose computer program according to the first set of parameter values; execute the executable instance of the general purpose computer program to process at least some of the received data; generate a log entry for the executable instance of the general purpose computer program, the log entry including at least some of the parameter values of the first set of parameter values; a first output device or port; a port for storing the log entry in a storage device; the second computing system comprising: a second input device or port for receiving the log entry; a second group of one or more processors configured to: process the log entry to form a specific parameter value set, wherein the processing comprises: extracting at least some of the parameter values of the first parameter value set from the log entry and forming the specific parameter value set from the extracted parameter values; and determining whether to add the specific parameter value set to the multiple pre-existing parameter value sets based on a comparison of a first identifier of the specific parameter value set with identifiers of at least some of the pre-existing parameter value sets in the multiple pre-existing parameter value sets.

各个方面可以包括以下优点中的一个或多个。Various aspects can include one or more of the following advantages.

通过使用本文描述的方法发现参数集并使用发现的参数集来扩充现有参数集，使用现有参数集的扩充集合生成的数据沿袭报告更准确地表示数据处理系统的真实数据沿袭。特别地，数据处理系统的数据沿袭的先前会被忽略的部分被包括在数据沿袭报告中。By discovering parameter sets using the methods described herein and augmenting existing parameter sets with the discovered parameter sets, a data lineage report generated using the augmented set of existing parameter sets more accurately represents the true data lineage of the data processing system. In particular, previously ignored portions of the data lineage of the data processing system are included in the data lineage report.

在一些示例中，参数集发现方法的结果还可以用于扩充计算机程序的实例的执行的日志条目(即，用关于所发现的参数集的信息来扩充日志条目)。扩充的日志条目可以有利地用于验证计算机程序和/或数据集之间的对应于物理连接的逻辑连接。此验证的结果确保呈现给用户的数据沿袭显示计算机程序及其输入和输出之间的正确的沿袭关系。In some examples, the results of the parameter set discovery method can also be used to augment the log entries of the execution of the instance of the computer program (i.e., augment the log entries with information about the discovered parameter sets). The augmented log entries can be advantageously used to verify the logical connections between the computer program and/or data sets corresponding to the physical connections. The results of this verification ensure that the data lineage presented to the user shows the correct lineage relationship between the computer program and its inputs and outputs.

从下面的描述和权利要求书中，本发明的其他特征和优点将变得显而易见。Other features and advantages of the invention will be apparent from the following description and from the claims.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是用于发现参数集的系统的框图。FIG1 is a block diagram of a system for discovering parameter sets.

图2是包括子图及其相关联的参数集的数据流图。FIG2 is a data flow diagram including subgraphs and their associated parameter sets.

图3是图2的数据流图的运行时间配置。FIG3 is a runtime configuration of the data flow graph of FIG2.

图4是图2的数据流图的静态分析配置。FIG4 is a static analysis configuration of the data flow graph of FIG2 .

图5是用于发现参数集的方法的流程图。FIG5 is a flow chart of a method for discovering parameter sets.

图6是用于发现参数集的方法的示例性操作的前半部分。FIG6 is the first half of exemplary operations of a method for discovering parameter sets.

图7是用于发现参数集的方法的示例性操作的后半部分。FIG7 is the second half of exemplary operations of a method for discovering parameter sets.

图8是包括第一子图和第二子图的示例数据流图。FIG8 is an example data flow graph including a first subgraph and a second subgraph.

图9示出了图8的数据流图，其逻辑数据集解析为物理数据集。FIG9 shows the data flow diagram of FIG8 , where the logical data set is parsed into a physical data set.

图10示出了用于图8的数据流图的数据沿袭报告。FIG. 10 illustrates a data lineage report for the data flow diagram of FIG. 8 .

图11是包括第一子图和第二子图并且具有重复的逻辑数据集的示例数据流图。FIG. 11 is an example data flow graph including a first subgraph and a second subgraph and having a repeated logical data set.

图12示出了图11的示例数据流图，其逻辑数据集解析为物理数据集。FIG12 shows the example data flow diagram of FIG11 , where the logical data sets are parsed into physical data sets.

图13示出了用于图11的数据流图的包括数据沿袭中断的数据沿袭报告。FIG. 13 illustrates a data lineage report including data lineage breaks for the data flow diagram of FIG. 11 .

图14示出了用于缓解数据沿袭报告中的数据沿袭中断的影响的第一技术。FIG14 illustrates a first technique for mitigating the impact of data lineage breaks in data lineage reporting.

图15示出了用于缓解数据沿袭报告中数据沿袭中断的影响的第二技术。FIG15 illustrates a second technique for mitigating the impact of data lineage breaks in data lineage reporting.

图16示出了用于缓解数据沿袭报告中的数据沿袭中断的影响的第三技术。FIG16 illustrates a third technique for mitigating the impact of data lineage breaks in data lineage reporting.

图17示出了用于缓解数据沿袭报告中的数据沿袭中断的影响的第四技术。FIG17 illustrates a fourth technique for mitigating the impact of data lineage outages in data lineage reporting.

图18示出了用于缓解数据沿袭报告中的数据沿袭中断的影响的第五技术。FIG18 illustrates a fifth technique for mitigating the impact of data lineage outages in data lineage reporting.

图19示出了用于缓解数据沿袭报告中的数据沿袭中断的影响的第六技术。FIG19 illustrates a sixth technique for mitigating the impact of data lineage outages in data lineage reporting.

具体实施方式DETAILED DESCRIPTION

图1示出了可以使用本文描述的参数集发现技术的数据处理系统100的示例。该系统包括开发环境118，其在一些实现中是用于将应用程序开发为数据流图117的系统，该数据流图包括通过顶点之间的定向链接(表示工作元素，即数据，的流)连接的顶点(表示数据处理组件或数据集)。例如，这样的环境更详细地描述于发明名称为“Managing Parametersfor Graph-Based Applications(管理基于图表的应用程序的参数)”的第2007/0011668号美国公开文本中，通过引用将其并入本文。用于执行这种基于图表的计算的系统描述于发明名称为“EXECUTING COMPUTATIONS EXPRESSED AS GRAPHS(执行表达为图表的计算)”的美国专利5,966,072中，其通过引用并入本文。根据该系统制作的数据流图117提供用于将信息输入和输出由图表组件表示的各个过程，用于在过程之间移动信息以及用于定义过程的运行顺序的方法。该系统包括从任何可用方法(例如，根据图表的链接的通信路径可以使用TCP/IP或UNIX域套接字，或使用共享存储器在进程之间传递数据)选择进程间通信方法的算法。由开发者120使用开发环境118创建的数据流图117可以存储在开发环境118可访问的数据存储系统116中，以供系统100的其他模块随后使用。FIG1 shows an example of a data processing system 100 that can use the parameter set discovery techniques described herein. The system includes a development environment 118, which, in some implementations, is a system for developing applications as data flow graphs 117, which include vertices (representing data processing components or data sets) connected by directed links between the vertices (representing flows of work elements, i.e., data). For example, such an environment is described in more detail in U.S. Publication No. 2007/0011668, entitled “Managing Parameters for Graph-Based Applications,” which is incorporated herein by reference. A system for performing such graph-based computations is described in U.S. Patent No. 5,966,072, entitled “Executing Computations Expressed as Graphs,” which is incorporated herein by reference. A data flow graph 117 created according to this system provides methods for inputting and outputting information to and from various processes represented by graph components, for moving information between processes, and for defining the order in which processes are run. The system includes an algorithm for selecting an inter-process communication method from any available method (e.g., a communication path based on the graph's links may use TCP/IP or UNIX domain sockets, or use shared memory to pass data between processes). The data flow graph 117 created by a developer 120 using the development environment 118 may be stored in a data storage system 116 accessible to the development environment 118 for subsequent use by other modules of the system 100.

执行环境104包括参数解析模块106和执行模块112。例如，执行环境104可以在合适的操作系统的控制下驻留在一个或多个通用计算机上，所述合适的操作系统诸如某个版本的UNIX操作系统。例如，执行环境104可以包括多节点并行计算环境，包括使用多个中央处理单元(CPU)或处理器核心的计算机系统的配置，其为本地的(例如，诸如对称多处理(SMP)计算机等多处理器系统)，或本地分布式的(例如，耦接为群集或大规模并行处理(MPP)系统的多个处理器)，或远程的，或远程分布式的(例如，经由局域网(LAN)和/或广域网(WAN)或其任何组合耦接的多处理器)。Execution environment 104 includes parameter parsing module 106 and execution module 112. For example, execution environment 104 can reside on one or more general-purpose computers under the control of a suitable operating system, such as a version of the UNIX operating system. For example, execution environment 104 can include a multi-node parallel computing environment, including a configuration of a computer system using multiple central processing units (CPUs) or processor cores, which is local (e.g., a multi-processor system such as a symmetric multi-processing (SMP) computer), or locally distributed (e.g., multiple processors coupled as a cluster or massively parallel processing (MPP) system), or remote, or remotely distributed (e.g., multiple processors coupled via a local area network (LAN) and/or a wide area network (WAN), or any combination thereof).

参数解析模块106从数据存储系统116接收数据流图117的规范，并解析数据流图117的参数(如下面更详细描述的)，以准备数据流图117以供执行模块112执行。执行模块112从参数解析模块106接收准备好的数据流图117，并使用准备好的数据流图117来处理来自数据源102的数据并生成输出数据114。输出数据114可以存储回数据源102或可由执行环境104访问的数据存储系统116中，或以其他方式使用。一般来说，数据源102可以包括一个或多个数据源，例如存储设备或到在线数据流的连接，每个数据源可以存储或提供各种格式中的任何格式的数据(例如，数据库表、电子表格文件、纯文本文件或大型机使用的本机格式)。The parameter parsing module 106 receives a specification of a data flow graph 117 from the data storage system 116 and parses the parameters of the data flow graph 117 (as described in more detail below) to prepare the data flow graph 117 for execution by the execution module 112. The execution module 112 receives the prepared data flow graph 117 from the parameter parsing module 106 and uses the prepared data flow graph 117 to process data from the data source 102 and generate output data 114. The output data 114 can be stored back in the data source 102 or in a data storage system 116 accessible by the execution environment 104, or used in other ways. In general, the data source 102 can include one or more data sources, such as storage devices or connections to online data streams, each of which can store or provide data in any of a variety of formats (e.g., database tables, spreadsheet files, plain text files, or native formats used by mainframes).

提供数据源102的存储设备可以位于执行环境104的本地，例如存储在连接到托管执行环境104(例如，硬盘驱动器108)的计算机的存储介质上，或者可以在执行环境104远程，例如，通过远程连接(例如，由云计算基础设施提供)托管在与托管执行环境104的计算机通信的远程系统(例如，主机110)上。The storage device providing data source 102 may be local to execution environment 104, such as stored on a storage medium connected to the computer hosting execution environment 104 (e.g., hard drive 108), or may be remote from execution environment 104, such as hosted on a remote system (e.g., host computer 110) that communicates with the computer hosting execution environment 104 via a remote connection (e.g., provided by a cloud computing infrastructure).

系统100还包括企业用户121(例如，数据架构师或业务用户)可访问的元数据环境模块119。元数据环境模块119包括数据沿袭模块115，其处理数据流图117(或表征数据流图117及其引用的输入和输出数据集的元数据)以生成数据流图117的数据沿袭。企业用户121可以因诸如数据流图117的验证和合规性检查等原因而查看数据。关于特定数据项(例如，数据集或数据集中的字段)的数据沿袭信息基于由数据处理系统执行的处理引起的依赖关系，并且这里使用的术语“数据沿袭”通常指的是包括其他相关数据项以及消耗或生成那些数据项的处理实体的集合。数据沿袭报告(也称为数据沿袭图)可以包括具有表示数据项和处理实体的节点以及表示它们之间的依赖关系的链接的图表形式的数据沿袭的图形表示。能够生成和显示数据沿袭报告的一些系统能够自动地提供从上游端处的终极数据源到下游端处产生的最终数据的端到端数据沿袭。来自特定数据项上游的路径上的节点有时被称为该数据项的“依赖性”，并且来自特定数据项下游的路径上的节点有时被称为该数据项的“影响”。虽然如本文所使用的，“数据沿袭”有时仅用于指代上游依赖性，但是“数据沿袭”可以指代对特定上下文适当的上游依赖性和/或下游影响。System 100 also includes a metadata environment module 119 accessible to enterprise users 121 (e.g., data architects or business users). Metadata environment module 119 includes a data lineage module 115 that processes data flow graph 117 (or metadata representing data flow graph 117 and its referenced input and output datasets) to generate data lineage for data flow graph 117. Enterprise users 121 can review data for reasons such as validation and compliance checking of data flow graph 117. Data lineage information about a particular data item (e.g., a dataset or a field within a dataset) is based on dependencies arising from processing performed by a data processing system, and the term "data lineage" as used herein generally refers to a collection of related data items and the processing entities that consume or generate those data items. A data lineage report (also known as a data lineage diagram) can include a graphical representation of data lineage in the form of a diagram with nodes representing data items and processing entities and links representing dependencies between them. Some systems capable of generating and displaying data lineage reports can automatically provide end-to-end data lineage from the ultimate data source at the upstream end to the final data produced at the downstream end. Nodes on a path upstream from a particular data item are sometimes referred to as "dependencies" of that data item, and nodes on a path downstream from a particular data item are sometimes referred to as "influences" of that data item. Although, as used herein, "data lineage" is sometimes used to refer only to upstream dependencies, "data lineage" can refer to both upstream dependencies and/or downstream influences as appropriate to a particular context.

1数据流图概述1 Data Flow Diagram Overview

参考图2，示出了使用图1的开发环境118生成的数据流图217的示例。图1包括名为gather.mp的第一子图202和名为process.mp的第二子图204。2, there is shown an example of a data flow graph 217 generated using the development environment 118 of FIG 1. FIG 1 includes a first subgraph 202 named gather.mp and a second subgraph 204 named process.mp.

第一子图202接收第一逻辑数据集DS1 206和第二逻辑数据集DS2 208作为输入，处理来自第一逻辑数据集206和第二逻辑数据集208的数据，并将处理结果写入第三逻辑数据集DS3 210。第二子图204接收作为输入的第四逻辑数据集DS4 212(其恰好指向与第三逻辑数据集210相同的物理文件)，处理来自第四逻辑数据集212的数据，并且将第四逻辑数据集212的处理结果写入表214中。The first subgraph 202 receives as input a first logical dataset DS1 206 and a second logical dataset DS2 208, processes the data from the first logical dataset 206 and the second logical dataset DS2 208, and writes the processed results to a third logical dataset DS3 210. The second subgraph 204 receives as input a fourth logical dataset DS4 212 (which happens to point to the same physical file as the third logical dataset 210), processes the data from the fourth logical dataset 212, and writes the processed results of the fourth logical dataset 212 to a table 214.

四个逻辑数据集206、208、210、212中的每一个与在运行时间解析到物理文件的路径的参数化路径相关联。具体地，使用参数化路径/${FEED}/inv_${DATE}.dat标识第一逻辑数据集206，使用参数化路径/${FEED}/cust_${DATE}.dat标识第二逻辑数据集208，使用参数化路径/trans_${DATE}.dat标识第三逻辑数据集210，并且使用参数化路径/trans_${DATE}.dat标识第四逻辑数据集212。Each of the four logical data sets 206, 208, 210, and 212 is associated with a parameterized path that resolves to a physical file path at runtime. Specifically, the first logical data set 206 is identified using the parameterized path /${FEED}/inv_${DATE}.dat, the second logical data set 208 is identified using the parameterized path /${FEED}/cust_${DATE}.dat, the third logical data set 210 is identified using the parameterized path /trans_${DATE}.dat, and the fourth logical data set 212 is identified using the parameterized path /trans_${DATE}.dat.

第一子图202接收两个参数，P1＝FEED(反馈)和P2＝DATE(日期)作为自变量，并且如下面更详细描述的，使用参数来解析到第一逻辑数据集206、第二逻辑数据集208和第三逻辑数据集210的相应物理位置的路径，其中将参数化路径中的FEED和DATE占位符替换为接收到的FEED和DATE参数的值。另外，第一子图202包括DATE参数的“静态分析”值。如下面更详细地描述的，DATE参数的静态分析值是占位符值，其在数据流图217的静态分析期间(即，当确定数据流图217的数据沿袭时)用作参数值。The first subgraph 202 receives two parameters, P1=FEED and P2=DATE, as arguments, and, as described in more detail below, uses the parameters to resolve paths to the corresponding physical locations of the first logical dataset 206, the second logical dataset 208, and the third logical dataset 210, replacing the FEED and DATE placeholders in the parameterized paths with the values of the received FEED and DATE parameters. Additionally, the first subgraph 202 includes a "statically analyzed" value for the DATE parameter. As described in more detail below, the statically analyzed value for the DATE parameter is a placeholder value that is used as the parameter value during static analysis of the dataflow graph 217 (i.e., when determining the data lineage of the dataflow graph 217).

类似地，第二子图104接收单个参数P1＝DATE，并且通过将第四逻辑数据集212的参数化路径中的DATE占位符替换为接收到的DATE参数的值，使用该参数来解析到第四逻辑数据集212的物理位置的路径。另外，第二子图204包括DATE参数的“静态分析”值。如下面更详细地描述的，DATE参数的静态分析值是占位符值，其在数据流图217的静态分析期间(即，当确定数据流图217的数据沿袭时)用作参数值。Similarly, the second subgraph 104 receives a single parameter, P1=DATE, and uses this parameter to resolve a path to the physical location of the fourth logical dataset 212 by replacing the DATE placeholder in the parameterized path of the fourth logical dataset 212 with the value of the received DATE parameter. Additionally, the second subgraph 204 includes a "statically analyzed" value for the DATE parameter. As described in more detail below, the statically analyzed value for the DATE parameter is a placeholder value that is used as a parameter value during static analysis of the dataflow graph 217 (i.e., when determining the data lineage of the dataflow graph 217).

由于数据流图217及其子图的操作取决于其接收的参数，因此数据流图及其子图有时被称为“通用”数据流图或“通用”计算机程序。Because the operation of dataflow graph 217 and its subgraphs depends on the parameters they receive, the dataflow graph and its subgraphs are sometimes referred to as "generic" dataflow graphs or "generic" computer programs.

1.1参数1.1 Parameters

通常，上述参数可以被指定为“设计时间”参数或“运行时间”参数。除了如上所述用于路径解析之外，设计时间参数影响其相关联的数据流图的逻辑操作。相反，运行时间参数在逐个作业的基础上提供给图表，并且不影响图表的逻辑操作。在一些示例中，数据流图的逻辑操作指的是图表的功能性和图表所使用的逻辑数据集两者。In general, the above parameters can be specified as either "design-time" parameters or "run-time" parameters. Design-time parameters affect the logical operation of their associated dataflow graph, except for their use in path resolution, as described above. In contrast, run-time parameters are provided to the graph on a job-by-job basis and do not affect the graph's logical operation. In some examples, the logical operation of a dataflow graph refers to both the graph's functionality and the logical data sets used by the graph.

在图2中，FEED参数是影响gather.mp子图的逻辑操作的设计时间参数。例如，对于FEED参数的一个值，第一子图202中的排序组件216可以按照升序对其接收的数据进行排序，而FEED参数的另一个不同的值可以使排序组件216对数据按降序排列。在一些示例中，包括设计时间参数的数据流图被称为“通用图表”，因为其逻辑操作基于所提供的设计时间参数的值而改变。In Figure 2, the FEED parameter is a design-time parameter that affects the logical operation of the gather.mp subgraph. For example, for one value of the FEED parameter, sort component 216 in first subgraph 202 may sort the data it receives in ascending order, while a different value of the FEED parameter may cause sort component 216 to sort the data in descending order. In some examples, a dataflow graph that includes a design-time parameter is referred to as a "generic graph" because its logical operation changes based on the value of the provided design-time parameter.

DATE参数是运行时间参数，其对子图202的逻辑操作没有影响，并且在逐个作业的基础上提供。The DATE parameter is a run-time parameter that has no effect on the logical operation of the subgraph 202 and is provided on a job-by-job basis.

1.2参数集1.2 Parameter Set

在一些示例中，用于数据流图的常用参数集被存储为可以保存到磁盘并且易于重新使用的“参数集”(有时称为“pset”)。例如，在图2中，第一子图202具有与其相关联的三个pset，PSET_mexico 218、PSET_canada 220和PSET_usa 222。PSET_mexico 218包括常用的FEED参数值“mexico”和常用的DATE参数值“today()”，其是返回今天日期的函数。PSET_canada 220包括常用的FEED参数值、“canada”和常用的DATE参数值“today()”。PSET_usa222包括常用的FEED参数值“usa”和常用的DATE参数值“today()”。In some examples, common parameter sets for dataflow graphs are stored as "parameter sets" (sometimes called "psets") that can be saved to disk and easily reused. For example, in Figure 2, the first subgraph 202 has three psets associated with it, PSET_mexico 218, PSET_canada 220, and PSET_usa 222. PSET_mexico 218 includes a common FEED parameter value "mexico" and a common DATE parameter value "today()," which is a function that returns today's date. PSET_canada 220 includes a common FEED parameter value, "canada," and a common DATE parameter value "today()." PSET_usa 222 includes a common FEED parameter value, "usa," and a common DATE parameter value "today()."

类似地，第二子图204具有与其相关联的单个pset，PSET 223。PSET 223包括常用的DATE参数值“today()”，其是返回今天日期的函数。Similarly, the second subgraph 204 has a single pset associated with it, PSET 223. PSET 223 includes the commonly used DATE parameter value "today()", which is a function that returns today's date.

2参数分辨率模块2 parameter resolution module

在一些示例中，在数据流图117被执行模块112执行之前，图1的参数解析模块106标识与数据流图117(及其相关联的子图202、204)相关联的一个或多个pset，并且确定一个或多个pset中唯一设计时间参数的数量。对于给定数据流图的每个唯一设计时间参数，参数解析模块106实例化数据流图的单独可执行实例。例如，参考图3，对于图2的数据流图217，第一子图202的三个实例，gather.mp被实例化(PSET_mexico->gather.mp 202a，PSET_canada->gather.mp 202b，PSET_usa->gather.mp 202c)，每个实例根据图2的pset中的以下三个唯一FEED参数中不同的一个进行配置：mexico、canada和usa。由于第二子图204仅与不包括任何设计时间参数的单个pset 223相关联，因此在执行时仅实例化第二子图204的单个实例(process.mp 204a)。In some examples, before the dataflow graph 117 is executed by the execution module 112, the parameter parsing module 106 of FIG1 identifies one or more psets associated with the dataflow graph 117 (and its associated subgraphs 202, 204) and determines the number of unique design-time parameters in the one or more psets. For each unique design-time parameter of a given dataflow graph, the parameter parsing module 106 instantiates a separate executable instance of the dataflow graph. For example, referring to FIG3, for the dataflow graph 217 of FIG2, three instances of the first subgraph 202, gather.mp, are instantiated (PSET_mexico->gather.mp 202a, PSET_canada->gather.mp 202b, and PSET_usa->gather.mp 202c), each instance being configured according to a different one of the following three unique FEED parameters in the pset of FIG2: mexico, canada, and usa. Since the second subgraph 204 is associated with only a single pset 223 that does not include any design-time parameters, only a single instance of the second subgraph 204 (process.mp 204a) is instantiated at execution time.

一旦子图202、204的适当实例被参数解析模块106实例化，则参数解析模块106用来自pset的实际参数值替换数据集的参数化路径中的参数值占位符，将路径解析为数据集的物理位置。例如，对于第一子图202的PSET_mexico->gather.mp实例202a，第一数据集206的路径被解析为//mexico/inv_031014，因为FEED参数值是‘mexico’，并且DATE参数值是‘031014’。Once the appropriate instances of the subgraphs 202, 204 are instantiated by the parameter parsing module 106, the parameter parsing module 106 replaces the parameter value placeholders in the parameterized path of the dataset with the actual parameter values from the pset, resolving the path to the physical location of the dataset. For example, for the PSET_mexico->gather.mp instance 202a of the first subgraph 202, the path of the first dataset 206 is resolved to //mexico/inv_031014 because the FEED parameter value is 'mexico' and the DATE parameter value is '031014'.

一旦参数解析模块106已经实例化了包括其子图202、204的数据流图217并且已经解析了到数据流图217的数据集的物理路径，则数据流图217就准备好由执行模块112执行。在执行期间，第一子图202的三个实例202a、202b、202c从它们各自的输入数据集中读取数据、处理数据，并将处理的数据存储在/trans_031014.dat物理文件中。由于用于第二子图202的实例204a的输入数据集(例如，DS4 212)解析为与第一子图的输出数据集相同的物理文件，因此/trans_031014.dat物理文件被process.mp的实例读取然后处理并存储在表214中。Once the parameter parsing module 106 has instantiated the data flow graph 217 including its subgraphs 202 and 204 and has resolved the physical paths to the data sets of the data flow graph 217, the data flow graph 217 is ready for execution by the execution module 112. During execution, the three instances 202a, 202b, and 202c of the first subgraph 202 read data from their respective input data sets, process the data, and store the processed data in the /trans_031014.dat physical file. Since the input data set (e.g., DS4 212) for the instance 204a of the second subgraph 202 is resolved to the same physical file as the output data set of the first subgraph, the /trans_031014.dat physical file is read by the instance of process.mp and then processed and stored in the table 214.

3数据沿袭模块3Data Lineage Module

参考图4，在一些示例中，不是执行数据流图217，而是图1的数据架构师或业务用户121可能需要检查在数据通过数据流图217时的数据的沿袭。为此，图1的数据沿袭模块115被配置为分析数据流图217以生成用于呈现给数据架构师或业务用户121的数据沿袭报告。4 , in some examples, rather than executing the data flow graph 217, the data architect or business user 121 of FIG1 may need to examine the lineage of data as it passes through the data flow graph 217. To this end, the data lineage module 115 of FIG1 is configured to analyze the data flow graph 217 to generate a data lineage report for presentation to the data architect or business user 121.

在一些示例中，作为确定数据流图217的数据沿袭的第一步，数据沿袭模块115识别数据流图217的各个子图202、204。对于每个识别的子图202、204，数据沿袭模块115识别与子图202、204相关联的一个或多个pset 218、220、222、223，然后为子图202、204确定一个或多个pset 218、220、222、223中的唯一设计时间参数的数量。对于每个唯一设计时间参数，参数解析模块实例化子图202、204的单独实例。In some examples, as a first step in determining the data lineage of the dataflow graph 217, the data lineage module 115 identifies the various subgraphs 202, 204 of the dataflow graph 217. For each identified subgraph 202, 204, the data lineage module 115 identifies one or more psets 218, 220, 222, 223 associated with the subgraph 202, 204, and then determines the number of unique design-time parameters in the one or more psets 218, 220, 222, 223 for the subgraph 202, 204. For each unique design-time parameter, the parameter resolution module instantiates a separate instance of the subgraph 202, 204.

在一些示例中，数据沿袭模块115在假设实际物理文件和它们存储的数据与数据沿袭分析无关的情况下操作。因此，用于解析数据集的物理位置的任何运行时间参数值都是不必要的，并且可以用占位符值替换。如上所述，对于与子图相关联的每个运行时间参数，相应的占位符、静态分析参数值被包括在子图中。例如，在图2中，由于数据流图202、204都包括DATE运行时间参数，所以它们也都包括占位符、静态分析参数值‘MMDDYY’。In some examples, the data lineage module 115 operates under the assumption that the actual physical files and the data they store are irrelevant to the data lineage analysis. Therefore, any runtime parameter values used to resolve the physical location of the data set are unnecessary and can be replaced with placeholder values. As described above, for each runtime parameter associated with a subgraph, a corresponding placeholder, static analysis parameter value is included in the subgraph. For example, in Figure 2, because both dataflow graphs 202 and 204 include a DATE runtime parameter, they also include a placeholder, static analysis parameter value of 'MMDDYY'.

当数据沿袭模块115分析数据流图217以确定数据沿袭时，数据流图中的DATE参数的所有实例被替换为‘MMDDYY’占位符值，从而创建如图4所示的临时数据集对象452。然后，识别各个子图实例和临时数据集对象之间的互连，并将其作为数据沿袭呈现给数据架构师或业务用户。例如，第一子图202的实例202a、202b、202c的分析指示第一子图202的所有实例将数据写入由/trans_MMDDYY.dat数据集对象表示的数据集。该分析然后指示第二数据流图204的实例204a从由/trans_MMDDYY.dat数据集对象表示的数据集读取。基于该信息，数据流图217的数据沿袭指示第一子图202的实例202a、202b、202c的输出被馈送到第二子图204的实例204a的输入。When the data lineage module 115 analyzes the data flow graph 217 to determine data lineage, all instances of the DATE parameter in the data flow graph are replaced with the 'MMDDYY' placeholder value, thereby creating a temporary dataset object 452 as shown in Figure 4. The interconnections between the various subgraph instances and the temporary dataset objects are then identified and presented to the data architect or business user as data lineage. For example, the analysis of instances 202a, 202b, 202c of the first subgraph 202 indicates that all instances of the first subgraph 202 write data to the dataset represented by the /trans_MMDDYY.dat dataset object. The analysis then indicates that instance 204a of the second data flow graph 204 reads from the dataset represented by the /trans_MMDDYY.dat dataset object. Based on this information, the data lineage of the data flow graph 217 indicates that the outputs of instances 202a, 202b, 202c of the first subgraph 202 are fed into the inputs of instance 204a of the second subgraph 204.

4逻辑pset发现和创建方法4 Logical PSET discovery and creation methods

在一些示例中，使用执行命令而不是从先前存储的pset来执行给定数据流图，该执行命令接收参数值作为提供给执行命令的自变量。由于上述方法仅使用存储的pset来确定数据沿袭，所以与源自提供给执行命令用于执行数据流图的自变量的参数值相关联的pset不表示在数据沿袭中。这可能导致向企业架构师或审核员提供不完整或不正确的数据沿袭。In some examples, a given dataflow graph is executed using an execute command that receives parameter values as arguments provided to the execute command, rather than from a previously stored pset. Because the above method uses only stored psets to determine data lineage, the psets associated with parameter values derived from the arguments provided to the execute command used to execute the dataflow graph are not represented in the data lineage. This may result in incomplete or incorrect data lineage being presented to enterprise architects or auditors.

图5是示出用于基于在与数据流图的实例的执行相关联的日志中标识的参数集，扩充具有所创建的逻辑pset的数据流图的现有逻辑参数集(pset)的存储库的方法的流程图。在一些示例中，图5中描述的方法由图1的数据沿袭模块115实现。FIG5 is a flow diagram illustrating a method for augmenting a repository of existing logical parameter sets (psets) of a dataflow graph with created logical psets based on parameter sets identified in a log associated with the execution of an instance of the dataflow graph. In some examples, the method depicted in FIG5 is implemented by the data lineage module 115 of FIG1 .

4.1图表参数4.1 Chart Parameters

最初，数据流图(例如，图1的第一子图202)的一个示例包括两个参数(P₁和P₂)，每个参数可以被指定为“设计时间”参数或“运行时间”参数。如上所述，设计时间参数是影响图表的逻辑操作(例如，可以改变由图表执行的变换)的参数，而运行时间参数是在逐个作业的基础上改变的参数(例如，日期)，并且不影响图表的逻辑操作。Initially, an example of a dataflow graph (e.g., the first subgraph 202 of FIG. 1 ) includes two parameters (P ₁ and P ₂ ), each of which can be designated as a “design time” parameter or a “run time” parameter. As described above, design time parameters are parameters that affect the logical operation of the graph (e.g., can change the transformations performed by the graph), while run time parameters are parameters that change on a job-by-job basis (e.g., dates) and do not affect the logical operation of the graph.

4.2参数分类4.2 Parameter classification

图表202被提供给参数分类步骤424，其分析图表202的参数以生成参数分类结果426。在参数分类结果426中，每个参数被分类为设计时间参数或运行时间参数。在流程图所示的示例性情况下，P₁被分类为设计时间参数，P₂被分类为运行时间参数。The graph 202 is provided to a parameter classification step 424, which analyzes the parameters of the graph 202 to generate a parameter classification result 426. In the parameter classification result 426, each parameter is classified as a design-time parameter or a run-time parameter. In the exemplary case shown in the flowchart, _P1 is classified as a design-time parameter and _P2 is classified as a run-time parameter.

在一些示例中，数据流图的参数被预分类(例如，由用户)为设计时间参数或运行时间参数。在其他示例中(例如，对于遗留数据流图)，用于数据流图的参数不被预分类为设计时间参数或运行时间参数。在这种情况下，参数分类步骤424可以假设所有参数都是设计时间参数。在后续的重新分类步骤中，如果确定给定参数在日志条目(例如，下面描述的作业日志数据存储)的集合中具有大(例如，高于给定阈值)数量的唯一值，则给定的参数可以被重新分类为运行时间参数。可替代地，重新分类可以基于数据沿袭敏感性分析。特别地，如果参数可以采取各种不同的值而不改变数据流图内部的数据沿袭(即，数据流图内的数据集或组件的影响或依赖性)，则该参数可以被分类为运行时间参数。例如，如果图表中的数据集(例如，图3中的DS1，DS2，DS3)的相关联的记录格式或其他特性不受参数的各种值影响，则将该参数重新分类为运行时间参数。可以使用这种数据沿袭敏感性分析的变化，例如包括解决所有内部影响和依赖性的更全面的数据沿袭敏感性分析，以及包括仅解决与数据集记录格式相关联的影响和依赖性的更有限的数据沿袭敏感性分析。In some examples, parameters of a dataflow diagram are pre-classified (e.g., by a user) as either design-time parameters or runtime parameters. In other examples (e.g., for legacy dataflow diagrams), parameters for a dataflow diagram are not pre-classified as either design-time parameters or runtime parameters. In this case, the parameter classification step 424 may assume that all parameters are design-time parameters. In a subsequent reclassification step, if a given parameter is determined to have a large (e.g., above a given threshold) number of unique values in a set of log entries (e.g., the job log data store described below), then the given parameter may be reclassified as a runtime parameter. Alternatively, the reclassification may be based on a data lineage sensitivity analysis. Specifically, if a parameter can take on a variety of different values without changing the data lineage within the dataflow diagram (i.e., the impact or dependencies of data sets or components within the dataflow diagram), then the parameter may be classified as a runtime parameter. For example, if the associated record format or other characteristics of the data sets in the diagram (e.g., DS1, DS2, DS3 in FIG3 ) are not affected by the various values of the parameter, then the parameter may be reclassified as a runtime parameter. Variations of this data lineage sensitivity analysis may be used, such as a more comprehensive data lineage sensitivity analysis that includes addressing all internal impacts and dependencies, and a more limited data lineage sensitivity analysis that includes addressing only impacts and dependencies associated with the dataset record format.

在一些示例中(例如，对于遗留数据流图)，参数可以包括设计时间部分和运行时间部分两者。例如，文件名参数“/mexico/inv_031014.dat”可以是混合参数，因为它包括设计时间部分(即，“mexico”)和运行时间部分(即，“031014”)。在这样的示例中，用户可以提供正规表达式或一些其他类型的字符串解析规则，其由参数分类步骤424使用用于从混合参数中提取和分类出相应的设计时间参数和运行时间参数。In some examples (e.g., for legacy data flow graphs), a parameter may include both a design-time portion and a runtime portion. For example, the file name parameter "/mexico/inv_031014.dat" may be a hybrid parameter because it includes a design-time portion (i.e., "mexico") and a runtime portion (i.e., "031014"). In such an example, a user may provide a regular expression or some other type of string parsing rule that is used by parameter classification step 424 to extract and classify the corresponding design-time and runtime parameters from the hybrid parameter.

4.3作业日志数据存储4.3 Job Log Data Storage

所述方法利用包括多个作业日志条目429的作业日志数据存储428，每个作业日志条目429包括与数据流图202的实例的执行相关联的信息。在其它信息中，至少一些作业日志条目包括用于实例化数据流图202的执行命令的记录。用于给定作业日志条目的执行命令包括图表名称和作为自变量提供给执行命令的参数值。通常，作业日志数据存储428中的至少一些作业日志条目实例化数据流图而不访问任何参数集，而是接收作为提供给执行命令的自变量的参数值。The method utilizes a job log data store 428 that includes a plurality of job log entries 429, each of which includes information associated with the execution of an instance of a dataflow graph 202. Among other information, at least some of the job log entries include a record of an execute command used to instantiate the dataflow graph 202. The execute command for a given job log entry includes a graph name and parameter values provided as arguments to the execute command. Typically, at least some of the job log entries in the job log data store 428 instantiate the dataflow graph without accessing any parameter set, but rather receive parameter values as arguments to the execute command.

4.4处理循环4.4 Processing Loop

作业日志数据存储428和参数分类结果426被提供给处理循环430，对于作业日志数据存储428中的每个作业日志条目429，处理循环430为图表执行命令生成新的逻辑pset，确定新的逻辑pset是否已经存在于现有逻辑pset的存储库448中，并且如果新的逻辑pset尚不存在，则将其添加到存储库448。The job log data store 428 and the parameter classification results 426 are provided to a processing loop 430 which, for each job log entry 429 in the job log data store 428, generates a new logical pset for the chart execution command, determines whether the new logical pset already exists in a repository 448 of existing logical psets, and adds the new logical pset to the repository 448 if it does not already exist.

4.4.1初始命令行逻辑pset构造4.4.1 Initial command line logic pset construction

在处理循环430内，参数分类结果426和来自作业日志数据存储428的作业日志条目J_n 432被提供给逻辑pset构建步骤434，逻辑pset构建步骤434根据参数分类结果426分析作业日志条目432以生成逻辑pset 436。这样做，逻辑pset构建步骤434分析包括在作业日志条目432中的图表执行命令，以提取被包括为至图表执行命令的自变量的参数值。逻辑pset构建步骤434还提取包括在作业日志条目432中的项目范围。在一些示例中，项目范围包括数据流图在其中执行的项目的指示，数据流图的内部参数的指示，以及数据流图所使用的环境设置、全局变量和配置变量的指示。Within processing loop 430, parameter classification result 426 and job log entry J _n 432 from job log data store 428 are provided to logical pset construction step 434, which analyzes job log entry 432 based on parameter classification result 426 to generate logical pset 436. To do so, logical pset construction step 434 analyzes the graph execution command included in job log entry 432 to extract parameter values included as arguments to the graph execution command. Logical pset construction step 434 also extracts the project scope included in job log entry 432. In some examples, the project scope includes an indication of the project in which the dataflow graph is executed, an indication of internal parameters of the dataflow graph, and an indication of environment settings, global variables, and configuration variables used by the dataflow graph.

逻辑pset构造步骤434自动地将所提取的项目范围包括在逻辑pset 436中。然后，逻辑pset构造步骤434将每个提取的参数值与参数分类结果426中的对应参数进行匹配。如果逻辑pset构建步骤434确定所提取的参数值对应于参数分类结果426中的设计时间参数，则逻辑pset构建步骤434在逻辑pset 436中包括所提取的设计时间参数的值。如果逻辑pset构造步骤434确定所提取的参数值对应于参数分类结果426中的运行时间参数，则所提取的参数值不包括在逻辑pset 436中。Logical pset construction step 434 automatically includes the extracted project scopes in logical pset 436. Logical pset construction step 434 then matches each extracted parameter value with a corresponding parameter in parameter classification results 426. If logical pset construction step 434 determines that the extracted parameter value corresponds to a design-time parameter in parameter classification results 426, logical pset construction step 434 includes the value of the extracted design-time parameter in logical pset 436. If logical pset construction step 434 determines that the extracted parameter value corresponds to a run-time parameter in parameter classification results 426, the extracted parameter value is not included in logical pset 436.

4.4.2pset签名字符串计算4.4.2pset signature string calculation

逻辑pset 436被提供给pset签名字符串计算步骤442，pset签名字符串计算步骤442基于项目范围和逻辑pset 436中的参数值来计算逻辑pset签名字符串444。在一些示例中，pset签名字符串444通过序列化逻辑pset 436的项目范围、逻辑pset 436中的参数的名称/值对以及与逻辑pset 436相关联的数据流图的原型来进行计算。在其他示例中，pset签名字符串444通过应用散列函数(hash function)或一些其他数据映射算法到逻辑pset436来进行计算。The logical pset 436 is provided to a pset signature string calculation step 442, which calculates a logical pset signature string 444 based on the project scope and the parameter values in the logical pset 436. In some examples, the pset signature string 444 is calculated by serializing the project scope of the logical pset 436, the name/value pairs of the parameters in the logical pset 436, and the prototype of the data flow graph associated with the logical pset 436. In other examples, the pset signature string 444 is calculated by applying a hash function or some other data mapping algorithm to the logical pset 436.

4.4.3pset签名字符串搜索4.4.3pset Signature String Search

将pset签名字符串444与现有逻辑pset的存储库448中的所有现有逻辑pset的pset签名字符串一起提供给pset签名搜索步骤446。对于每个现有逻辑pset，现有逻辑pset的pset签名字符串pset与pset签名字符串444进行比较。如果pset签名字符串444与现有逻辑pset的pset签名字符串中的至少一个匹配，则不需要进行任何操作，因为用于图表432的执行命令实例化的逻辑pset已经存在于现有逻辑pset的存储库448中。The pset signature string 444 is provided to a pset signature search step 446 along with the pset signature strings of all existing logical psets in a repository of existing logical psets 448. For each existing logical pset, the pset signature string pset of the existing logical pset is compared to the pset signature string 444. If the pset signature string 444 matches at least one of the pset signature strings of the existing logical psets, no action is required because the logical pset instantiated by the execute command for the graph 432 already exists in the repository of existing logical psets 448.

在一些示例中，现有逻辑pset的存储库448中的所有现有逻辑pset的pset签名字符串与存储库448中的现有逻辑pset并排存储。在其他示例中，现有逻辑pset的签名字符串自动地(on the fly)和基于需要进行计算。In some examples, the pset signature strings for all existing logical psets in the repository of existing logical psets 448 are stored alongside the existing logical psets in the repository 448. In other examples, the signature strings for existing logical psets are calculated automatically (on the fly) and on an as-needed basis.

4.4.4添加新的逻辑pset4.4.4 Adding a new logical pset

否则，如果现有逻辑pset的签名字符串都不与pset签名字符串444匹配，则通过新的逻辑pset添加步骤450，将逻辑pset 436及其签名字符串444作为新逻辑pset添加到现有逻辑pset的存储库448中。Otherwise, if none of the signature strings of the existing logical psets matches the pset signature string 444 , the logical pset 436 and its signature string 444 are added as a new logical pset to the repository 448 of existing logical psets via a new logical pset addition step 450 .

4.5示例4.5 Examples

参考图6和图7，图4的逻辑pset发现和创建方法的示例性操作被应用于图2的第一子图202。图2的第一子图202包括两个参数，P1＝FEED，P2＝DATE。第一子图202被提供给参数分类步骤424，其中参数被分类为“设计时间”参数或“运行时间”参数，生成参数分类结果426。参数分类结果426指示P1(FEED)参数是设计时间参数，P2(DATE)参数是运行时间参数。6 and 7 , the exemplary operations of the logical PSET discovery and creation method of FIG4 are applied to the first subgraph 202 of FIG2 . The first subgraph 202 of FIG2 includes two parameters, P1=FEED and P2=DATE. The first subgraph 202 is provided to a parameter classification step 424 , where the parameters are classified as either "design-time" parameters or "run-time" parameters, generating a parameter classification result 426 . The parameter classification result 426 indicates that the P1 (FEED) parameter is a design-time parameter and the P2 (DATE) parameter is a run-time parameter.

参数分类结果426和作业日志数据存储428被提供给逻辑pset构造步骤434。在图6的示例中，作业日志数据存储428包括四个作业日志条目，其包括与第一子图202的实例(即，gather.mp)的执行相关联的信息。每个作业日志条目包括接收DATE和FEED参数的值作为自变量的执行命令。The parameter classification results 426 and the job log data store 428 are provided to the logical pset construction step 434. In the example of Figure 6, the job log data store 428 includes four job log entries that include information associated with the execution of an instance (i.e., gather.mp) of the first subgraph 202. Each job log entry includes an execution command that receives values for the DATE and FEED parameters as arguments.

逻辑pset构建步骤434为作业日志数据存储428中的每个作业日志条目创建不同的逻辑pset 436。由于P1(FEED)参数是设计时间参数，所以曾作为执行命令的自变量提供的P1参数的值(例如，mexico(墨西哥)、usa(美国)、canada(加拿大)或hong kong(香港))被包括在逻辑pset 436中的每一个中。由于P2(DATE)参数是运行时间参数，所以曾作为执行命令的自变量提供的P2参数的值不包括在逻辑pset 436中。每个逻辑pset 436包括其对应的第一子图202的实例的项目范围。The logical pset construction step 434 creates a different logical pset 436 for each job log entry in the job log data store 428. Because the P1 (FEED) parameter is a design-time parameter, the value of the P1 parameter (e.g., mexico, usa, canada, or hong kong) that was provided as an argument to the execute command is included in each of the logical psets 436. Because the P2 (DATE) parameter is a run-time parameter, the value of the P2 parameter that was provided as an argument to the execute command is not included in the logical psets 436. Each logical pset 436 includes the project scope of its corresponding instance of the first subgraph 202.

参考图7，逻辑pset 436被提供给pset签名字符串计算步骤442，pset签名字符串计算步骤442为每个逻辑pset 436计算不同的逻辑pset签名字符串444。7 , the logical psets 436 are provided to a pset signature string calculation step 442 , which calculates a different logical pset signature string 444 for each logical pset 436 .

逻辑pset签名字符串444和现有pset的存储库448中的现有pset 447的一组逻辑pset签名字符串被提供给搜索步骤446。如图2中的情况所示，存在与第一子图202相关联的三个现有pset：一个用于mexicoFEED参数，一个用于usaFEED参数，一个用于canadaFEED参数。因此，用于现有pset 447的一组逻辑pset签名串444包括用于与第一子图202相关联的每个现有pset的字符串。The logical pset signature string 444 and a set of logical pset signature strings for existing psets 447 in the repository of existing psets 448 are provided to a search step 446. As shown in the scenario of FIG2 , there are three existing psets associated with the first subgraph 202: one for the mexicoFEED parameter, one for the usaFEED parameter, and one for the canadaFEED parameter. Thus, the set of logical pset signature strings 444 for existing psets 447 includes a string for each existing pset associated with the first subgraph 202.

搜索步骤446在现有的pset 447的该组逻辑pset签名字符串中搜索每一个逻辑pset签名字符串444的存在。在该示例中，由搜索步骤446生成的结果是，只有未包括在现有pset 447的该组逻辑pset签名字符串中的逻辑pset签名是与具有FEED参数值‘hong kong’的逻辑pset相关联的逻辑pset签名字符串。Search step 446 searches for the presence of each logical pset signature string 444 in the set of logical pset signature strings of an existing pset 447. In this example, the result generated by search step 446 is that the only logical pset signatures not included in the set of logical pset signature strings of an existing pset 447 are the logical pset signature strings associated with the logical pset having the FEED parameter value 'hong kong'.

搜索步骤446的结果和包括‘hong kong’feed参数的逻辑pset 436被提供给逻辑pset添加步骤450，逻辑pset添加步骤450将包括‘hong kong’的FEED参数的逻辑pset及其对应的逻辑pset签名字符串444添加到现有逻辑pset的存储库448。The result of the search step 446 and the logical pset 436 including the ‘hong kong’ feed parameters are provided to the logical pset addition step 450, which adds the logical pset including the ‘hong kong’ feed parameters and its corresponding logical pset signature string 444 to the repository 448 of existing logical psets.

通过将新的逻辑pset添加到存储库，在之前的数据沿袭结果中将被忽略的第一子图202的‘hong kong’实例表示在数据沿袭结果中。By adding the new logical pset to the repository, the 'hong kong' instance of the first subgraph 202 that was previously ignored in the data lineage result is represented in the data lineage result.

注意，虽然在上述示例中将运行时间参数的静态分析值描述为存储在数据流图本身中，但是在一些示例中，运行时间参数的静态分析值可以保持在与数据流图相关联的一个或多个pset中。Note that while in the above examples the static analysis values of the runtime parameters are described as being stored in the dataflow graph itself, in some examples the static analysis values of the runtime parameters may be maintained in one or more psets associated with the dataflow graph.

在一些示例中，从不一定存在于静态分析时间的源(例如，从数据库)导出某些设计时间参数值。然而，在一些示例中，存储在作业日志数据存储中的作业日志条目包括针对该特定作业解析的所有参数的值。在静态分析时间处，可以使用存储的参数值来代替从静态分析时间处不存在的源导出的参数值。In some examples, certain design-time parameter values are derived from sources that are not necessarily present at static analysis time (e.g., from a database). However, in some examples, the job log entries stored in the job log data store include the values of all parameters resolved for that particular job. At static analysis time, the stored parameter values can be used in place of parameter values derived from sources that are not present at static analysis time.

在一些示例中，作业日志数据存储中的作业日志条目包括数据流图的所有解析的参数，由数据流图读取和写入的所有文件的日志，以及性能跟踪信息。在一些示例中，作业日志数据存储中的作业日志条目用由图4的方法发现的任何逻辑参数集扩充。在一些示例中，使用发现的逻辑参数集来扩充作业日志数据存储中的作业日志条目包括：形成作业日志条目与发现的逻辑参数集之间的关联。可以利用作业日志数据存储中的扩充的作业日志条目来向数据架构师或业务用户提供各种形式的信息。在一些示例中，可以分析扩充的作业日志条目以确保逻辑连接的数据流图也物理连接。在一些示例中，可以分析扩充的作业日志条目以确定物理数据集对应于哪些逻辑数据集实例。在一些示例中，可以分析扩充的作业日志条目以识别具有相同物理文件名但与不同静态分析参数相关联的数据集。在这样的示例中，不一致性可以呈现给用户用于手动修复或者可以自动修复。在一些示例中，数据沿袭报告可以包括不一致性的指示和其是否已被自动修复。In some examples, job log entries in the job log data store include all parsed parameters of the data flow graph, logs of all files read and written by the data flow graph, and performance tracking information. In some examples, job log entries in the job log data store are augmented with any logical parameter sets discovered by the method of FIG. 4 . In some examples, augmenting the job log entries in the job log data store with the discovered logical parameter sets includes forming an association between the job log entries and the discovered logical parameter sets. The augmented job log entries in the job log data store can be utilized to provide various forms of information to data architects or business users. In some examples, the augmented job log entries can be analyzed to ensure that logically connected data flow graphs are also physically connected. In some examples, the augmented job log entries can be analyzed to determine which physical data sets correspond to which logical data set instances. In some examples, the augmented job log entries can be analyzed to identify data sets with the same physical file name but associated with different static analysis parameters. In such examples, inconsistencies can be presented to a user for manual remediation or can be automatically remediated. In some examples, a data lineage report can include an indication of the inconsistency and whether it has been automatically remediated.

在一些示例中，扩充的作业日志条目可以由数据沿袭模块用于通过频率和/或新近度来过滤数据沿袭报告。例如，元数据环境模块可以维护不再由执行模块执行的若干数据流图和pset。这样的数据流图和pset可以留在原位，以防稍后需要它。但是，未执行的数据流图和pset可能会导致数据沿袭报告中不必要的混乱。为了减少混乱，可以分析扩充的作业日志条目以确定哪些数据流图和/或pset使用不频繁和/或最近未被使用。基于该频率和新近度信息，可以在向企业用户呈现之前从数据沿袭报告中过滤出不频繁和非最近执行的数据流图和pset(例如，在过去一年中没有运行的数据流图)。In some examples, the augmented job log entries can be used by the data lineage module to filter the data lineage report by frequency and/or recency. For example, the metadata environment module can maintain several data flow diagrams and psets that are no longer executed by the execution module. Such data flow diagrams and psets can be left in place in case they are needed later. However, unexecuted data flow diagrams and psets may cause unnecessary clutter in the data lineage report. To reduce the clutter, the augmented job log entries can be analyzed to determine which data flow diagrams and/or psets are used infrequently and/or have not been used recently. Based on this frequency and recency information, infrequently and non-recently executed data flow diagrams and psets (e.g., data flow diagrams that have not been run in the past year) can be filtered out of the data lineage report before being presented to the enterprise user.

在一些示例中，可以存在用于给定数据流图(例如，包括FEED＝USA的pset)的逻辑pset，但是调用数据流图的一个或多个作业通过直接向数据流图提供参数值而不是利用现有pset来进行。在这种情况下，在作业和由作业访问的逻辑pset(例如，经由与作业相关联的签名)之间维持的关联可以用于基于它们相关联的逻辑pset来对作业日志条目进行分组。基于分组，通过直接调用图表而不是利用现有pset来实例化的任何作业可以被识别为与逻辑pset及其参数相关。In some examples, a logical pset may exist for a given dataflow graph (e.g., a pset including FEED=USA), but one or more jobs that invoke the dataflow graph do so by providing parameter values directly to the dataflow graph rather than utilizing an existing pset. In this case, the association maintained between the jobs and the logical psets accessed by the jobs (e.g., via a signature associated with the job) can be used to group job log entries based on their associated logical psets. Based on the grouping, any job that is instantiated by directly invoking the graph rather than utilizing an existing pset can be identified as being associated with the logical pset and its parameters.

在一些示例中，数据流图的每个作业日志条目包括用于执行与作业日志条目相关联的数据流图的所有解析的参数值的列表以及其他信息。一旦累积了多个作业日志条目，则可以比较包括在作业日志条目中的解析的参数值，以识别数据流图的各种“设计时间实例”。例如，作业日志条目中的某些已解析的参数可以仅由所有作业日志条目中的少数值表示，而某些其他解析的参数可以由所有作业日志条目中的许多不同的值表示。在作业日志条目中仅由少数值表示的那些解析参数可能是“设计时间”参数，并且由作业日志条目中的许多不同值表示的其他解析参数可能是“运行时间参数”。数据流图共享“设计时间参数”的唯一组合的任何实例被分组在一起，并且被认为都是数据流图的“设计时间实例”。数据沿袭模块包括数据沿袭报告中的数据流图的不同设计时间实例。In some examples, each job log entry for a dataflow graph includes, among other information, a list of parameter values for all parsings used to execute the dataflow graph associated with the job log entry. Once multiple job log entries have been accumulated, the parameter values for the parsings included in the job log entries can be compared to identify various "design time instances" of the dataflow graph. For example, some parsed parameters in the job log entry may be represented by only a few values across all the job log entries, while some other parsed parameters may be represented by many different values across all the job log entries. Those parsed parameters that are represented by only a few values in the job log entries may be "design time" parameters, and other parsed parameters that are represented by many different values in the job log entries may be "run time parameters". Any instances of dataflow graphs that share a unique combination of "design time parameters" are grouped together and are considered to all be "design time instances" of the dataflow graph. The data lineage module includes the different design time instances of the dataflow graph in the data lineage report.

5重复的逻辑数据集发现和缓解方法5 Duplicate Logical Dataset Discovery and Mitigation Methods

5.1概述5.1 Overview

通常，给定数据流图的输入和输出数据集(例如，数据库或数据表)特定为数据流图中的逻辑数据集。在一些示例中，每个逻辑数据集与诸如逻辑文件名的标识符相关联。Typically, the input and output data sets (eg, databases or data tables) of a given data flow graph are specified as logical data sets in the data flow graph. In some examples, each logical data set is associated with an identifier such as a logical file name.

在执行数据流图之前，其被准备好用于执行，包括将每个逻辑数据集解析为对应的物理数据集(例如，磁盘上的文件)。在一些示例中，每个物理数据集与诸如物理文件名(例如，“summary.dat”)的标识符相关联。参数解析过程能够成功地将逻辑数据集解析为其对应的物理数据集，即使逻辑数据集的逻辑文件名与对应的物理数据集的物理文件名不同也是如此。Before executing a dataflow graph, it is prepared for execution, including parsing each logical data set into a corresponding physical data set (e.g., a file on disk). In some examples, each physical data set is associated with an identifier such as a physical file name (e.g., "summary.dat"). The parameter parsing process can successfully parse a logical data set into its corresponding physical data set even if the logical file name of the logical data set is different from the physical file name of the corresponding physical data set.

当为包括两个或更多个子图的数据流图确定数据沿袭报告时，至少部分地根据两个或更多个子图的输入和输出逻辑数据集的逻辑文件名来确定子图之间的沿袭关系。为此，沿袭关系的正确性要求参考给定物理数据集的两个或更多个子图的任何输入和输出逻辑数据集共享相同的逻辑文件名。实际上，如果第一子图写入给定的物理数据集，并且第二子图随后从给定的物理数据集读取，但是第一子图的输出逻辑数据集的逻辑文件名与第二子图的输入逻辑数据集的逻辑文件名不匹配，则将识别这两个子图之间没有沿袭关系。在一些示例中，解析为相同物理数据集但具有不匹配的逻辑文件名的两个逻辑数据集被称为“重复的逻辑数据集”。When determining a data lineage report for a data flow graph that includes two or more subgraphs, the lineage relationships between the subgraphs are determined based, at least in part, on the logical file names of the input and output logical data sets of the two or more subgraphs. To this end, the correctness of the lineage relationships requires that any input and output logical data sets of two or more subgraphs that reference a given physical data set share the same logical file name. In practice, if a first subgraph writes to a given physical data set and a second subgraph subsequently reads from the given physical data set, but the logical file name of the output logical data set of the first subgraph does not match the logical file name of the input logical data set of the second subgraph, then it will be recognized that there is no lineage relationship between the two subgraphs. In some examples, two logical data sets that resolve to the same physical data set but have mismatching logical file names are referred to as "duplicate logical data sets."

如下面详细描述的，数据流图中的重复(duplicate)的逻辑数据集可以被识别并呈现给用户。然后，用户可以选择以多种方式来寻址重复的逻辑数据集。As described in detail below, duplicate logical data sets in a data flow graph can be identified and presented to the user. The user can then choose to address the duplicate logical data sets in a variety of ways.

5.2没有重复的逻辑数据集的示例5.2 Example of a logical dataset without duplication

参考图8，使用图1的开发环境118生成的数据流图817的示例，其包括名为gather.mp的第一子图802和名为process.mp的第二子图804。8 , an example of a dataflow graph 817 generated using the development environment 118 of FIG. 1 includes a first subgraph 802 named gather.mp and a second subgraph 804 named process.mp.

第一子图802接收具有逻辑文件名“Acct_1.dat”的第一逻辑数据集D_L1 806和具有逻辑文件名“Acct_2.dat”的第二逻辑数据集D_L2 808作为输入。第一子图802处理来自第一逻辑数据集806和第二逻辑数据集808的数据，并将处理结果写入具有逻辑文件名称“Acct_summ.dat.”的第三逻辑数据集D_L3 810中。第二子图804接收具有逻辑文件名“Acct_summ.dat”的第三逻辑数据集D_L3 810作为输入，处理来自第三逻辑数据集810的数据，并将处理结果写入表814。注意，由第一子图802和第二子图804两者使用的第三逻辑数据集810在两个子图802、804中具有相同的逻辑文件名。The first subgraph 802 receives as input a first logical dataset _DL1 806 having a logical file name of "Acct_1.dat" and a second logical dataset _DL2 808 having a logical file name of "Acct_2.dat." The first subgraph 802 processes data from the first logical dataset 806 and the second logical dataset 808 and writes the processed results to a third logical dataset _DL3 810 having a logical file name of "Acct_summ.dat." The second subgraph 804 receives as input a third logical dataset _DL3 810 having a logical file name of "Acct_summ.dat," processes the data from the third logical dataset 810, and writes the processed results to a table 814. Note that the third logical dataset 810 used by both the first subgraph 802 and the second subgraph 804 has the same logical file name in both subgraphs 802 and 804.

参考图9，当在执行之前解析数据流图817时，逻辑数据集被解析为它们对应的物理数据集。例如，第一逻辑数据集806被解析为具有物理文件名称“Acct_1.dat”的第一物理数据集D_P1 814，第二逻辑数据集808被解析为具有物理文件名称“Acct_2.dat”的第二物理数据集D_P2 816、第三逻辑数据集810被解析为具有物理文件名为“summary.dat”的第三物理数据集D_P3 818。9 , when the data flow graph 817 is parsed before execution, the logical data sets are parsed into their corresponding physical data sets. For example, the first logical data set 806 is parsed into a first physical data set _DP1 814 having a physical file name of "Acct_1.dat," the second logical data set 808 is parsed into a second physical data set DP2 816 having a physical file name of " _{Acct_2.dat} ," and the third logical data set 810 is parsed into a third physical data set _DP3 818 having a physical file name of "summary.dat."

参考图10，用于数据流图的数据沿袭报告1017包括第一子图1002、第二子图1004、第一逻辑数据集1006、第二逻辑数据集1008和第三逻辑数据集1010。数据沿袭报告1017还包括第一逻辑数据集1006和第一子图1002的输入之间的第一沿袭关系1018，第二逻辑数据集1008和第一子图1002的输入之间的第二沿袭关系1020，在第一子图1002的输出和第三逻辑数据集1010之间的第三沿袭关系1022，以及第三逻辑数据集1010和第二子图1004之间的第四沿袭关系1024。注意，在这种情况下，数据沿袭报告1017是正确的，因为具有相同逻辑文件名(即，“Acct_summ.dat”)的相同逻辑数据集(即，第三逻辑数据集D_L3 810)存在于第一子图802的输出处和第二子图804的输入处。10 , a data lineage report 1017 for a data flow graph includes the first subgraph 1002, the second subgraph 1004, the first logical dataset 1006, the second logical dataset 1008, and the third logical dataset 1010. The data lineage report 1017 also includes a first lineage relationship 1018 between the first logical dataset 1006 and an input of the first subgraph 1002, a second lineage relationship 1020 between the second logical dataset 1008 and an input of the first subgraph 1002, a third lineage relationship 1022 between the output of the first subgraph 1002 and the third logical dataset 1010, and a fourth lineage relationship 1024 between the third logical dataset 1010 and the second subgraph 1004. Note that in this case, the data lineage report 1017 is correct because the same logical dataset (i.e., the third logical dataset _DL3 810) with the same logical file name (i.e., "Acct_summ.dat") exists at the output of the first subgraph 802 and at the input of the second subgraph 804.

5.3具有重复的逻辑数据集的示例5.3 Example of a logical dataset with duplication

参考图11，其是使用图1的开发环境118生成的数据流图1117的另一示例，包括名为gather.mp的第一子图1102和名为process.mp的第二子图1104。11 , which is another example of a data flow graph 1117 generated using the development environment 118 of FIG. 1 , includes a first subgraph 1102 named gather.mp and a second subgraph 1104 named process.mp.

第一子图1102接收具有逻辑文件名“Acct_1.dat”的第一逻辑数据集D_L1 1106和具有逻辑文件名“Acct_2.dat”的第二逻辑数据集D_L2 1108作为输入。第一子图1102处理来自第一逻辑数据集1106和第二逻辑数据集1108的数据，并将处理结果写入具有逻辑文件名“Acct_summ.dat”的第三逻辑数据集D_L3 1110中。第二子图1104接收作为输入的具有逻辑文件名“Acct-summ.dat”的第四逻辑数据集D_L4 1111作为输入，处理来自第四逻辑数据集1111的数据，并将处理结果写入表814。注意，第三逻辑数据集1110的逻辑文件名(即“Acct_summ.dat”)不同于第四逻辑数据集1111的逻辑文件名(即“Acct-summ.dat”)。The first subgraph 1102 receives as input a first logical dataset _DL1 1106 having a logical file name "Acct_1.dat" and a second logical dataset _DL2 1108 having a logical file name "Acct_2.dat." The first subgraph 1102 processes the data from the first logical dataset 1106 and the second logical dataset DL2 1108 and writes the processed results to a third logical dataset _DL3 1110 having a logical file name "Acct_summ.dat." The second subgraph 1104 receives as input a fourth logical dataset _DL4 1111 having a logical file name "Acct-summ.dat," processes the data from the fourth logical dataset 1111, and writes the processed results to the table 814. Note that the logical file name of the third logical dataset 1110 (i.e., "Acct_summ.dat") is different from the logical file name of the fourth logical dataset 1111 (i.e., "Acct-summ.dat").

参考图12，当数据流图1117在执行之前被解析时，逻辑数据集被解析为它们对应的物理数据集。例如，第一逻辑数据集1106被解析为具有物理文件名“Acct_1.dat”的第一物理数据集D_P1 1114，第二逻辑数据集1108被解析为具有物理文件名“Acct_2.dat”的第二物理数据集D_P2 1116，并且第三逻辑数据集1110和第四逻辑数据集1111都被解析为具有物理文件名“summary.dat”的第三物理数据集D_P3 1218。注意，第三逻辑数据集1110和第四逻辑数据集数据集1111是重复的逻辑数据集，因为它们各自指向相同的物理数据集(即，第三物理数据集1218)。12 , when the data flow graph 1117 is parsed before execution, the logical data sets are parsed into their corresponding physical data sets. For example, the first logical data set 1106 is parsed into a first physical data set _DP1 1114 having a physical file name "Acct_1.dat," the second logical data set 1108 is parsed into a second physical data set _DP2 1116 having a physical file name "Acct_2.dat," and the third logical data set 1110 and the fourth logical data set 1111 are both parsed into a third physical data set _DP3 1218 having a physical file name "summary.dat." Note that the third logical data set 1110 and the fourth logical data set 1111 are duplicate logical data sets because they each point to the same physical data set (i.e., the third physical data set 1218).

参考图13，用于数据流图的数据沿袭报告1317包括第一子图1102、第二子图1104、第一逻辑数据集1106、第二逻辑数据集1108、第三逻辑数据集1110和第四逻辑数据集1111。数据沿袭报告1317还包括第一逻辑数据集1106和第一子图1102的输入之间的第一沿袭关系1318，第二逻辑数据集1108和第一子图1102的输入之间的第二沿袭关系1320，第一子图1002的输出和第三逻辑数据集1110之间的第三沿袭关系1322，以及第四逻辑数据集1111和第二子图1104之间的第四沿袭关系1324。13 , a data lineage report 1317 for a data flow graph includes the first subgraph 1102, the second subgraph 1104, the first logical dataset 1106, the second logical dataset 1108, the third logical dataset 1110, and the fourth logical dataset 1111. The data lineage report 1317 also includes a first lineage relationship 1318 between the first logical dataset 1106 and an input of the first subgraph 1102, a second lineage relationship 1320 between the second logical dataset 1108 and an input of the first subgraph 1102, a third lineage relationship 1322 between an output of the first subgraph 1102 and the third logical dataset 1110, and a fourth lineage relationship 1324 between the fourth logical dataset 1111 and the second subgraph 1104.

注意，在这种情况下，数据沿袭报告1317是不正确的，因为具有不同逻辑文件名的两个不同的逻辑数据集(即，第三逻辑数据集1110和第四逻辑数据集1111)指的是相同的物理数据集(即，第三物理数据集1218)。特别地，具有逻辑文件名“Acct_summ.dat”的第三逻辑数据集D_L3 1110存在于第一子图1102的输出处，而具有逻辑文件名“Acct_summ.dat”的第四逻辑数据集1111存在于第二子图1104的输入处。数据沿袭报告1317将第三逻辑数据集1110和第四逻辑数据集1111表示为彼此没有任何沿袭关系的单独数据集。因此，数据沿袭报告1317不正确地包括第三逻辑数据集1110和第四逻辑数据集1111之间的数据沿袭的中断。Note that in this case, data lineage report 1317 is incorrect because two different logical datasets with different logical file names (i.e., third logical dataset 1110 and fourth logical dataset 1111) refer to the same physical dataset (i.e., third physical dataset 1218). Specifically, third logical dataset _DL3 1110, with the logical file name "Acct_summ.dat," exists at the output of first subgraph 1102, while fourth logical dataset 1111, with the logical file name "Acct_summ.dat," exists at the input of second subgraph 1104. Data lineage report 1317 represents third logical dataset 1110 and fourth logical dataset 1111 as separate datasets without any lineage relationship to each other. Therefore, data lineage report 1317 incorrectly includes a break in the data lineage between third logical dataset 1110 and fourth logical dataset 1111.

5.4重复的逻辑数据集发现5.4 Discovery of Duplicate Logical Datasets

在一些示例中，可以通过分析由数据流图的执行生成的运行时间工件(例如，图5的作业日志429)来发现数据流图中的重复的逻辑数据集。特别地，每次执行数据流图时，生成作业日志。In some examples, repeated logical data sets in a data flow graph can be discovered by analyzing runtime artifacts generated by the execution of the data flow graph (e.g., job log 429 of FIG. 5 ). In particular, a job log is generated each time a data flow graph is executed.

作业日志包括与数据流图的执行相关联的信息，包括图表实例名称和，对于图表中每个数据集组件，其访问的物理数据集和访问类型(读或写)。可以检查图表实例以确定每个数据集组件的逻辑数据集名称。通过匹配图表实例和数据集组件名称，系统能够将逻辑数据集名称映射到物理数据集名称。The job log includes information associated with the execution of a data flow graph, including the graph instance name and, for each dataset component in the graph, the physical dataset it accesses and the type of access (read or write). The graph instance can be examined to determine the logical dataset name for each dataset component. By matching the graph instance and dataset component names, the system is able to map logical dataset names to physical dataset names.

为了识别重复的逻辑数据集，分析作业日志以识别任何逻辑到物理的数据集映射，其中映射的第一逻辑数据集与映射的第二逻辑数据集不同。第一逻辑数据集和第二逻辑数据集不同的任何逻辑到物理的数据集映射被分类为重复的逻辑数据集。To identify duplicate logical data sets, the job log is analyzed to identify any logical-to-physical data set mappings where a first logical data set that is mapped is different from a second logical data set that is mapped. Any logical-to-physical data set mapping where the first logical data set and the second logical data set are different is classified as a duplicate logical data set.

所识别的重复的逻辑数据集被呈现给用户，用户决定是校正重复的逻辑数据集还是自动缓解。The identified duplicate logical data sets are presented to the user, who decides whether to correct the duplicate logical data sets or automatically mitigate them.

5.4.1重复的逻辑数据集发现的示例5.4.1 Example of duplicate logical dataset discovery

再次参考图12，当执行解析的数据流图1117时，生成用于数据流图执行的作业日志。作业日志包括对应于第一子图1102和第二子图1104之间的流的单个逻辑到物理的数据集映射。逻辑到物理的数据集映射包括在第一子图1104的输出处的第三逻辑数据集的标识符D_L31110，在第二子图1106的输入处的第四逻辑数据集的标识符D_L41111，以及第三物理数据集1218的标识符。Referring again to FIG12 , when the parsed dataflow graph 1117 is executed, a job log for the dataflow graph execution is generated. The job log includes a single logical-to-physical dataset mapping corresponding to the flow between the first subgraph 1102 and the second subgraph 1104. The logical-to-physical dataset mapping includes an identifier _DL3 1110 of the third logical dataset at the output of the first subgraph 1104, an identifier _DL4 1111 of the fourth logical dataset at the input of the second subgraph 1106, and an identifier of the third physical dataset 1218.

由于第三逻辑数据集1110和第四逻辑数据集1111是指向相同物理数据集(即，第三物理数据集1218)的不同逻辑数据集(例如，具有不同逻辑文件名的逻辑数据集)，第三逻辑数据集1110和第四逻辑数据集1111被分类为重复的逻辑数据集。Since the third logical dataset 1110 and the fourth logical dataset 1111 are different logical datasets (e.g., logical datasets with different logical file names) pointing to the same physical dataset (i.e., the third physical dataset 1218), the third logical dataset 1110 and the fourth logical dataset 1111 are classified as duplicate logical datasets.

注意，尽管上述简单示例包括来自单个作业日志的单对重复的逻辑数据集的标识，但是在包括上述重复的逻辑数据集发现方法的数据处理系统的实际实现中，若干对重复的逻辑数据集可以使用多个作业日志来标识。Note that although the simple example above includes identification of a single pair of duplicate logical data sets from a single job log, in an actual implementation of a data processing system including the above duplicate logical data set discovery method, several pairs of duplicate logical data sets may be identified using multiple job logs.

5.5重复的逻辑数据集缓解5.5 Duplicated Logical Dataset Mitigation

如上所述，重复的逻辑数据集可能导致数据沿袭报告中的中断。一旦识别出重复的逻辑数据集，就可以采用许多不同的方法来消除重复的逻辑数据集或缓解它们对数据沿袭报告的影响。在一些示例中，所识别的重复的逻辑数据集以例如电子表格形式呈现给用户。然后，用户可以编辑包括重复的逻辑数据集的数据流图以消除重复的逻辑数据集(例如，通过确保在给定的数据流图中，给定的物理数据集仅由单个逻辑数据集引用)。在其他示例中，用户可以将一对重复的逻辑数据集标记为等效的。以这种方式，用户不需要对数据流图进行任何改变。在其他示例中，重复的逻辑数据集对可以被自动标记为等效。As described above, duplicate logical data sets can cause disruptions in data lineage reporting. Once duplicate logical data sets are identified, a number of different approaches can be employed to eliminate the duplicate logical data sets or mitigate their impact on data lineage reporting. In some examples, the identified duplicate logical data sets are presented to the user, for example, in a spreadsheet format. The user can then edit a data flow diagram that includes the duplicate logical data sets to eliminate the duplicate logical data sets (e.g., by ensuring that in a given data flow diagram, a given physical data set is referenced only by a single logical data set). In other examples, the user can mark a pair of duplicate logical data sets as equivalent. In this way, the user does not need to make any changes to the data flow diagram. In other examples, duplicate logical data set pairs can be automatically marked as equivalent.

当一对重复的逻辑数据集被标记为等效时，有多种方法在数据沿袭报告中显示等效性。在一种方法中，一对重复的逻辑数据集所引用的物理数据集被示为连接到数据沿袭报告中的重复的逻辑数据集。例如，如图14所示，第三物理数据集D_P3 1218包括在数据沿袭报告1317中。第三逻辑数据集D_L3 1110和第四逻辑数据集D_L4 1111都被示为通过沿袭关系1450和1452连接到第三物理数据集1218。When a pair of duplicate logical datasets are marked as equivalent, there are several ways to display the equivalence in a data lineage report. In one approach, the physical datasets referenced by the duplicate logical datasets are shown as connected to the duplicate logical datasets in the data lineage report. For example, as shown in FIG14 , the third physical dataset _DP3 1218 is included in data lineage report 1317. The third logical dataset _DL3 1110 and the fourth logical dataset _DL4 1111 are both shown as connected to the third physical dataset 1218 via lineage relationships 1450 and 1452.

在另一种方法中，一对重复的逻辑数据集的逻辑数据集被示为通过沿袭关系在数据沿袭报告中彼此连接。例如，在图15中，第三逻辑数据集D_L3 1110被示为通过数据沿袭报告1317中的沿袭关系1550连接到第四逻辑数据集D_L4 1111。In another approach, the logical data sets of a pair of duplicate logical data sets are shown as connected to each other in the data lineage report via a lineage relationship. For example, in FIG15 , the third logical data set _DL3 1110 is shown as connected to the fourth logical data set _DL4 1111 via a lineage relationship 1550 in the data lineage report 1317.

在另一种方法中，该对重复的逻辑数据集由数据沿袭报告中的组合逻辑数据集表示。例如，如图16所示，该对重复的逻辑数据集由数据沿袭报告1317中的组合逻辑数据集D_LR1654表示。In another approach, the pair of duplicate logical data sets is represented by a combined logical data set in a data lineage report. For example, as shown in FIG16 , the pair of duplicate logical data sets is represented by a combined logical data set D _LR 1654 in a data lineage report 1317 .

在另一种方法中，选择该对重复的逻辑数据集中的一个逻辑数据集以表示数据沿袭报告中的该对重复的逻辑数据集。例如，如图17所示，第四逻辑数据集D_L4 1111表示数据沿袭报告1317中的一对重复的逻辑数据集。In another approach, one of the pair of duplicate logical datasets is selected to represent the pair of duplicate logical datasets in the data lineage report. For example, as shown in FIG17 , the fourth logical dataset _DL4 1111 represents a pair of duplicate logical datasets in the data lineage report 1317 .

在另一种方法中，该对重复的逻辑数据集和该对重复的逻辑数据集的组合逻辑数据集表示包括在数据沿袭报告中。在数据沿袭图表中示出了该对重复的逻辑数据集与组合逻辑数据集表示之间的沿袭关系的唯一配置。例如，如图18所示，数据沿袭报告1317包括一对重复的逻辑数据集D_LR,1854、第三逻辑数据集D_L3 1110和第四逻辑数据集D_L4 1111的组合逻辑数据集表示。组合逻辑数据集1854被示为具有与第一子图1102和第二子图1104的直接沿袭关系。组合的逻辑数据集1845还被示为具有经由第三逻辑数据集1110与第一子图1102的间接沿袭关系，并且具有经由第四逻辑数据集1111与第二子图1104的间接沿袭关系。In another approach, the pair of duplicate logical datasets and the combined logical dataset representation of the pair of duplicate logical datasets are included in the data lineage report. A unique configuration of the lineage relationship between the pair of duplicate logical datasets and the combined logical dataset representation is shown in the data lineage diagram. For example, as shown in FIG18 , data lineage report 1317 includes a combined logical dataset representation of a pair of duplicate logical datasets D _LR, 1854, a third logical dataset D _L3 1110, and a fourth logical dataset D _L4 1111. Combined logical dataset 1854 is shown as having a direct lineage relationship with first subgraph 1102 and second subgraph 1104. Combined logical dataset 1845 is also shown as having an indirect lineage relationship with first subgraph 1102 via third logical dataset 1110, and an indirect lineage relationship with second subgraph 1104 via fourth logical dataset 1111.

在另一种方法中，一对重复的逻辑数据集的逻辑数据集被包括在数据沿袭报告中。在数据沿袭图表中示出了该对重复的逻辑数据集的逻辑数据集之间的沿袭关系的唯一配置。例如，如图19所示，数据沿袭报告1317包括第三逻辑数据集D_L3 1110和第四逻辑数据集D_L4 1111。第四逻辑数据集1111被示为具有与第一子图1102和第二子图1104的直接沿袭关系。第三逻辑数据集D_L3 1110被示出为与第一子图1102具有直接沿袭关系，并且经由第四逻辑数据集1111与第二子图1104具有间接沿袭关系。In another approach, a pair of duplicate logical datasets are included in a data lineage report. A unique configuration of the lineage relationship between the logical datasets of the pair of duplicate logical datasets is shown in the data lineage diagram. For example, as shown in FIG19 , data lineage report 1317 includes a third logical dataset _DL3 1110 and a fourth logical dataset _DL4 1111. The fourth logical dataset 1111 is shown as having a direct lineage relationship with the first subgraph 1102 and the second subgraph 1104. The third logical dataset _DL3 1110 is shown as having a direct lineage relationship with the first subgraph 1102 and an indirect lineage relationship with the second subgraph 1104 via the fourth logical dataset 1111.

注意，在一些示例中，上述缓解方法以虚线、粗线或以另一替代方式示出在数据沿袭报告中，使得数据沿袭报告的用户清楚已经对该数据沿袭报告应用缓解方法。Note that in some examples, the mitigation methods described above are shown in the data lineage report in dashed lines, bold lines, or in another alternative manner so that it is clear to the user of the data lineage report that the mitigation methods have been applied to the data lineage report.

注意，虽然上述重复的逻辑数据集的发现和缓解方法使用第一组件写入物理数据集并且另一组件从该物理数据集读取的场景来描述，但是其他场景也可以导致重复的逻辑数据集。例如，一对重复的逻辑数据集可以由从同一物理数据集读取的两个不同的逻辑数据集产生。类似地，一对重复的逻辑数据集可以由两个不同的逻辑数据集写入相同的物理数据集产生。Note that while the above method for discovering and mitigating duplicate logical datasets is described using a scenario where one component writes to a physical dataset and another component reads from that physical dataset, other scenarios can also result in duplicate logical datasets. For example, a pair of duplicate logical datasets can be generated by two different logical datasets reading from the same physical dataset. Similarly, a pair of duplicate logical datasets can be generated by two different logical datasets writing to the same physical dataset.

上述方法可以合并来自用于管理和呈现数据沿袭信息以及用于管理数据集对象的各种其他方法的特征，如更详细描述于2009年2月26日提交的序列号为12/393,765的美国申请，2011年10月25日提交的序列号为13/281,039的美国申请以及2014年7月24日提交的序列号为62/028,485的美国临时申请，所有这些申请通过引用并入本文。The above methods may incorporate features from various other methods for managing and presenting data lineage information and for managing dataset objects, as described in more detail in U.S. application serial number 12/393,765 filed on February 26, 2009, U.S. application serial number 13/281,039 filed on October 25, 2011, and U.S. provisional application serial number 62/028,485 filed on July 24, 2014, all of which are incorporated herein by reference.

上述方法可以例如使用执行合适的软件指令的可编程计算系统来实现，或者可以在诸如现场可编程门阵列(FPGA)或一些混合形式的合适的硬件中实现。例如，在编程方法中，软件可以包括在一个或多个编程或可编程计算系统(其可以是诸如分布式、客户端/服务器或网格的各种架构)上执行的一个或多个计算机程序中的过程，每个计算系统包括至少一个处理器，至少一个数据存储系统(包括易失性和/或非易失性存储器和/或存储元件)，至少一个用户接口(用于使用至少一个输入设备或端口接收输入，并且用于使用至少一个输出设备或端口提供输出)。软件可以包括例如提供与数据流图的设计、配置和执行相关的服务的较大程序的一个或多个模块。程序的模块(例如，数据流图的元素)可以被实现为符合存储在数据仓库中的数据模型的数据结构或其他有组织的数据。The above methods can be implemented, for example, using a programmable computing system that executes appropriate software instructions, or can be implemented in suitable hardware such as a field programmable gate array (FPGA) or some hybrid form. For example, in a programming method, the software can include processes in one or more computer programs executed on one or more programming or programmable computing systems (which can be various architectures such as distributed, client/server or grid), each computing system including at least one processor, at least one data storage system (including volatile and/or non-volatile memory and/or storage elements), at least one user interface (for receiving input using at least one input device or port, and for providing output using at least one output device or port). The software can include, for example, one or more modules of a larger program that provides services related to the design, configuration and execution of data flow graphs. The modules of the program (e.g., elements of a data flow graph) can be implemented as data structures or other organized data that conform to a data model stored in a data warehouse.

软件可以提供在诸如CD-ROM或其他计算机可读介质(例如，可由通用或专用计算系统或设备读取)的有形、非暂时性介质上，或者可以通过网络的通信介质被递送(例如，以传播信号编码)到其被执行的计算系统的有形、非暂时性介质。可以在专用计算机上或使用诸如协处理器或现场可编程门阵列(FPGA)或特定的专用集成电路(ASIC)的专用硬件来执行处理中的一些或全部。处理可以以分布式方式实现，其中由软件指定的计算的不同部分由不同的计算元件执行。每个这样的计算机程序优选地存储在或下载到可由通用或专用可编程计算机访问的存储设备的计算机可读存储介质(例如，固态存储器或介质，或磁介质或光介质)上，用于当计算机读取存储设备介质以执行本文所述的处理时，配置和操作计算机。本发明的系统还可以被认为可实现为配置有计算机程序的有形的、非暂时性介质，其中如此配置的介质使得计算机以特定和预定义的方式操作以执行一个或多个处理步骤。The software can be provided on a tangible, non-transitory medium such as a CD-ROM or other computer-readable medium (e.g., readable by a general-purpose or special-purpose computing system or device), or can be delivered (e.g., encoded in a propagated signal) to a computing system where it is executed via a communication medium of a network. Some or all of the processing can be performed on a dedicated computer or using dedicated hardware such as a coprocessor or field programmable gate array (FPGA) or a specific application-specific integrated circuit (ASIC). The processing can be implemented in a distributed manner, where different parts of the calculations specified by the software are performed by different computing elements. Each such computer program is preferably stored or downloaded onto a computer-readable storage medium (e.g., solid-state memory or medium, or magnetic or optical medium) of a storage device accessible by a general-purpose or special-purpose programmable computer, for configuring and operating the computer when the computer reads the storage device medium to perform the processing described herein. The system of the present invention can also be considered to be embodied as a tangible, non-transitory medium configured with a computer program, wherein the medium so configured causes the computer to operate in a specific and predefined manner to perform one or more processing steps.

已经描述了本发明的多个实施例。然而，应当理解，前述描述旨在说明而不是限制本发明的范围，本发明的范围由所附权利要求的范围限定。因此，其他实施例也在所附权利要求的范围内。例如，在不脱离本发明的范围的情况下可以进行各种修改。另外，上述的一些步骤可以是与顺序无关的，并且因此可以以与所描述的顺序不同的顺序来执行。Several embodiments of the present invention have been described. However, it should be understood that the foregoing description is intended to illustrate, not to limit, the scope of the present invention, which is defined by the scope of the appended claims. Therefore, other embodiments are also within the scope of the appended claims. For example, various modifications may be made without departing from the scope of the present invention. In addition, some of the steps described above may be order-independent and, therefore, may be performed in an order different from that described.

Claims

1. A method for managing a set of parameter values, the method comprising:

It receives a set of multiple parameter values used in general-purpose computer programs, and

Processing log entries associated with the execution of instances of the general-purpose computer program, each instance of the general-purpose computer program being associated with one or more parameter values, the processing includes:

Analyze the general computer program to classify each of one or more parameters associated with the general computer program as a member of a first type of parameter or a second type of parameter;

Processing log entries associated with the execution of the first instance of the general-purpose computer program to form a specific set of parameter values, wherein the processing includes:

The specific set includes any value of the parameter that appears in the log entry and is classified as a member of the first type of parameter, and

Exclude any values of parameters belonging to members classified as the second type of parameter from the specific set that appear in the log entries; and

Whether to add the specific set of parameter values to the plurality of parameter value sets is determined by comparing a first identifier of the specific set of parameter values with the identifiers of at least some of the plurality of parameter value sets.

The comparison between the first identifier of the specific set of parameter values and the identifiers of at least some of the plurality of parameter value sets includes:

The first identifier is determined based on the specific set of parameter values and the identifier of the general computer program;

Determine a plurality of second identifiers, each second identifier corresponding to a parameter value set of at least some of the plurality of parameter value sets; and

The first identifier is compared with each of the plurality of second identifiers to determine whether the first identifier matches any of the plurality of second identifiers, wherein determining whether to add the specific set of parameter values to the plurality of parameter value sets includes: if no second identifier matches the first identifier, then determining to add the specific set of parameter values to the plurality of parameter value sets.

2. The method of claim 1, wherein processing the log entries includes classifying the parameters based on whether they affect data continuity associated with the general computer program.

3. The method of claim 1, wherein determining the first identifier comprises: calculating an identifier string from the contents of the specific set of parameter values, and determining the plurality of second identifiers comprises: calculating a plurality of identifier strings from the contents of at least some of the sets of parameter values in the plurality of sets of parameter values.

4. The method of claim 1, wherein determining the first identifier comprises a concatenation of one or more of the following: the identifier of the general-purpose computer program, name-value pairs of the particular set of parameter values, function prototypes of the general-purpose computer program, and the item scope of a first instance of the general-purpose computer program.

5. The method of claim 1, wherein determining the first identifier comprises applying a data mapping function to one or more of: the identifier of the general-purpose computer program, name-value pairs of the particular set of parameter values, a function prototype of the general-purpose computer program, and the item scope of a first instance of the general-purpose computer program.

6. The method of claim 5, wherein the data mapping function comprises a hash function.

7. The method of claim 1, wherein the first type of parameters includes parameters that affect the logical operation of the general-purpose computer program, and the second type of parameters includes parameters that do not affect the logical operation of the general-purpose computer program.

8. The method of claim 1, wherein the general computer program is designated as a data flow graph, the data flow graph comprising nodes representing data processing operations and links between the nodes representing data element flows between the data processing operations.

9. The method of claim 1, wherein for each of the plurality of parameters, the analysis comprises: automatically classifying the parameter or accepting a user-defined classification of the parameter.

10. The method of claim 9, wherein automatically classifying the parameter comprises: initially classifying the parameter as belonging to the first type of parameter; determining the number of unique values of the parameter in multiple executions of an instance of the general computer program; and if the number of unique values of the parameter exceeds a predetermined threshold, reclassifying the parameter as belonging to the second type of parameter.

11. The method of claim 9, wherein automatically classifying the parameter comprises: initially classifying the parameter as belonging to the first type of parameter; determining whether changes in the value of the parameter during multiple executions of an instance of the general computer program affect data continuity associated with the general computer program; and if changes in the value of the parameter do not affect the data continuity, reclassifying the parameter as belonging to the second type of parameter.

12. The method of claim 1, further comprising: forming an association between log entries associated with the execution of the first instance of the general-purpose computer program and the specific set of parameter values.

13. The method of claim 1, wherein the log entry associated with the execution of the first instance of the general-purpose computer program comprises: a log entry for an execution command for instantiating the general-purpose computer program, the log entry including one or more parameter values provided as arguments to the execution command.

14. The method of claim 13, wherein the log entries associated with the execution of the first instance of the general-purpose computer program further include one or more of the following: indications of items executed therein by the first instance, indications of internal parameters of the first instance, and indications of environment settings, global variables, and configuration variables used by the first instance.

15. The method of claim 9, further comprising: processing all plurality of parameter value sets for a plurality of general-purpose computer programs and all plurality of log entries associated with the execution of instances of at least some of the plurality of general-purpose computer programs to form a data lineage report, wherein the plurality of parameter value sets include: an extended plurality of parameter value sets of the general-purpose computer programs, and the plurality of log entries associated with the execution of instances of at least some of the plurality of general-purpose computer programs include: log entries for the execution of a first instance of the general-purpose computer program, including its association with the particular set of parameter value sets.

16. The method of claim 15, wherein forming a data lineage report comprises: for each of the plurality of parameter value sets in all the plurality of general-purpose computer programs,

Process all log entries associated with the execution of at least some instances of the plurality of general-purpose computer programs to identify all log entries associated with the execution of instances of general-purpose computer programs corresponding to the set of parameter values, and identify the most recent instantiation time of the general-purpose computer program from the identified log entries associated with the execution of instances of the general-purpose computer program; and

Whether to include the set of parameter values in the data lineage report is determined based on the most recent instantiation time of the general computer program.

17. The method of claim 16, wherein determining whether to include the set of parameter values in the data lineage report based on the most recent instantiation time of the general computer program comprises: comparing the most recent instantiation time with a predetermined time interval, and if the most recent instantiation time of the general computer program is within the predetermined time interval, then including the set of parameter values in the data lineage report.

18. The method of claim 15, wherein forming the data lineage report comprises: for each of the plurality of parameter value sets in all plurality of parameter value sets of the plurality of general-purpose computer programs,

Process all log entries associated with the execution of instances of at least some of the plurality of general-purpose computer programs to determine the number of log entries associated with the execution of instances of general-purpose computer programs corresponding to the set of parameter values, and

Whether to include the set of parameter values in the data lineage report is determined based on the number of log entries associated with the execution of an instance of the general-purpose computer program.

19. A computer-readable medium comprising instructions for managing a set of parameter values, the instructions being configured to cause a computing system to:

Processing log entries associated with the execution of instances of the general-purpose computer program, each instance of the general-purpose computer program being associated with one or more parameter values, and expanding the plurality of parameter value sets based on the processing, the processing including:

The first identifier is compared with each of the plurality of second identifiers to determine whether the first identifier matches any one of the plurality of second identifiers.

Determining whether to add the specific set of parameter values to the plurality of parameter value sets includes: if no second identifier matches the first identifier, then determining to add the specific set of parameter values to the plurality of parameter value sets.

20. A computing system for managing a set of parameter values, the computing system comprising:

An input device or port for receiving a set of multiple parameter values for a general-purpose computer program, and

At least one processor is configured to process log entries associated with the execution of instances of the general-purpose computer program, each instance of the general-purpose computer program being associated with one or more parameter values, the processing comprising:

21. A computing system for managing a set of parameter values, the computing system comprising:

A means for receiving a set of multiple parameter values for a general-purpose computer program, and

A means for processing log entries associated with the execution of instances of the general-purpose computer program, each instance of the general-purpose computer program being associated with one or more parameter values, the processing comprising:

22. A method for managing a set of parameter values, the method comprising:

Receive general-purpose computer programs;

Receive the set of first parameter values;

An executable instance of the general-purpose computer program is generated by instantiating the general-purpose computer program according to the first set of parameter values.

Receive data from one or more datasets;

An executable instance of the general-purpose computer program is executed to process at least some of the received data;

Log entries for generating executable instances of the general-purpose computer program are generated, the log entries including at least some parameter values from the first set of parameter values;

Store the log entries;

Receive the log entry;

Processing the log entries to form a specific set of parameter values, wherein the processing includes: extracting at least some parameter values from the log entries of the first set of parameter values, and forming the specific set of parameter values from the extracted parameter values; and

Whether to add the specific set of parameter values to the plurality of pre-existing parameter value sets is determined by comparing a first identifier of the specific set of parameter values with the identifiers of at least some of the pre-existing parameter value sets.

The comparison between the identifier of the specific set of parameter values and the identifiers of at least some of the multiple pre-existing sets of parameter values includes:

Determine a plurality of second identifiers, each second identifier corresponding to one of the at least some pre-existing sets of parameter values; and

Determining whether to add the specific set of parameter values to the plurality of pre-existing sets of parameter values includes: if no second identifier matches the first identifier, then determining to add the specific set of parameter values to the plurality of pre-existing sets of parameter values.

23. The method of claim 22, wherein determining the first identifier comprises: calculating an identifier string from the contents of the particular set of parameter values, and determining the plurality of second identifiers comprises: calculating a plurality of identifier strings from the contents of at least some of the plurality of pre-existing sets of parameter values.

24. The method of claim 22, wherein determining the first identifier comprises a concatenation of one or more of the following: the identifier of the general-purpose computer program, name-value pairs of the particular set of parameter values, a function prototype of the general-purpose computer program, and the project scope of an executable instance of the general-purpose computer program.

25. The method of claim 22, wherein determining the first identifier comprises applying a data mapping function to one or more of: the identifier of the general-purpose computer program, name-value pairs of the particular set of parameter values, a function prototype of the general-purpose computer program, and the project scope of an executable instance of the general-purpose computer program.

26. The method of claim 25, wherein the data mapping function comprises a hash function.

27. The method of claim 22, further comprising: analyzing the general-purpose computer program to classify each of one or more parameters associated with the general-purpose computer program as a member of a first class of parameters or a second class of parameters.

28. The method of claim 27, wherein processing the log entries to form a specific set of parameter values further comprises:

The specific set includes any extracted parameter values that appear in the log entries and are classified as members of the first type of parameter, as well as

Exclude any extracted parameter values from the specific set that appear in the log entries and are classified as members of the second type of parameter.

29. The method of claim 27, wherein the first type of parameters includes parameters that affect the logical operation of the general-purpose computer program, and the second type of parameters includes parameters that do not affect the logical operation of the general-purpose computer program.

30. A computer-readable medium comprising instructions for managing a set of parameter values, the instructions being configured to cause a computing system to:

Receive general-purpose computer programs;

Receive the set of first parameter values;

Receive data from one or more datasets;

Store the log entries;

Receive the log entry;

31. A system for managing a set of parameter values, the system comprising:

The first computing system includes:

A first input device or port is used to receive a general-purpose computer program, a first set of parameter values, and data from one or more datasets;

The first group of one or more processors is configured as follows:

Log entries for generating executable instances of the general-purpose computer program, the log entries including at least some parameter values from the first set of parameter values; a first output device or port for storing the log entries in a storage device;

The second computing system includes:

A second input device or port is used to receive the log entries; a second group of one or more processors are configured to: