WO2023215028A1 - Dataflow graph datasets - Google Patents

Dataflow graph datasets Download PDF

Info

Publication number
WO2023215028A1
WO2023215028A1 PCT/US2023/013841 US2023013841W WO2023215028A1 WO 2023215028 A1 WO2023215028 A1 WO 2023215028A1 US 2023013841 W US2023013841 W US 2023013841W WO 2023215028 A1 WO2023215028 A1 WO 2023215028A1
Authority
WO
WIPO (PCT)
Prior art keywords
dataflow graph
catalogued
subgraph
data
dataset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2023/013841
Other languages
English (en)
French (fr)
Inventor
Ian Robert SCHECHTER
Garth Allen DICKIE
Jonah EGENOLF
Marshall Isman
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ab Initio Technology LLC
Original Assignee
Ab Initio Technology LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ab Initio Technology LLC filed Critical Ab Initio Technology LLC
Priority to AU2023265391A priority Critical patent/AU2023265391A1/en
Priority to EP23713778.1A priority patent/EP4519760A1/en
Priority to JP2024563593A priority patent/JP2025514974A/ja
Priority to CA3256554A priority patent/CA3256554A1/en
Priority to CN202380036746.5A priority patent/CN119256293A/zh
Publication of WO2023215028A1 publication Critical patent/WO2023215028A1/en
Priority to MX2024013637A priority patent/MX2024013637A/es
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/36Software reuse
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/34Graphical or visual programming

Definitions

  • providing the user interface comprises generating a graphical user interface having a searchable menu of the one or more entries in the dataset catalog; and receiving the identification of the first entry associated with the first catalogued dataflow graph comprises receiving, via the user interface, a user input indicating a selection of the first entry in the searchable menu.
  • the method comprises: executing the configured dataflow graph of the software application program.
  • executing the configured dataflow graph of the software application program comprises: executing the first catalogued dataflow graph to generate the output data; and providing the generated output data as input to the dataflow graph of the software application for performance of at least one of the one or more data processing operations using the output data.
  • executing the configured dataflow graph causes executing of the first catalogued dataflow graph.
  • the output data is generated by the first catalogued dataflow graph during execution of the configured dataflow graph.
  • the first entry stores a reference to a file storing information indicating nodes of the first catalogued dataflow graph and/or configuration parameters of the first catalogued dataflow graph.
  • identifying the subgraph comprises: displaying, in a user interface, a graphical representation of a dataflow graph; and receiving, via the user interface, first user input indicating the subgraph within the dataflow graph.
  • the instructions further cause the at least one computer hardware processor to perform: receiving, via the user interface, second user input commanding creation of the new entry associated with the indicated subgraph; wherein the creating of the new entry associated with the identified subgraph is performed in response to receiving the second user input.
  • the data processing system comprises data storage storing a previously created dataflow graph and identifying the subgraph comprises: receiving, via a user interface, a user input identifying the previously created dataflow graph as the subgraph.
  • the new entry includes the information indicating the nodes, links, and configuration parameters of the identified subgraph.
  • configuring the dataset catalog to enable access to the new entry associated with the identified subgraph comprises providing a user interface through which a user can identify, in the dataset catalog, the new entry associated with the identified subgraph.
  • Some embodiments provide at least one non-transitory computer-readable storage medium storing instructions that, when executed by at least one computer hardware processor of a data processing system, cause the at least one computer hardware processor to perform a method for providing a software application program, developed as a dataflow graph having nodes representing data processing operations and links representing flows of data between the nodes, with access to output data dynamically generated by one or more other dataflow graphs.
  • the data processing system 200 may be configured to: (1) provide a list of schemas in a GUI from which the user can select a schema for the catalogued dataflow graph; and (2) determine the schema based on user input indicating a selection of one of the schemas.
  • the data processing system 200 may be configured to determine one or more keys (e.g., a primary key and/or a foreign key) of the data. In some embodiments, the data processing system 200 may be configured to determine a key inherited from upstream data. For example, the data processing system 200 may use a key from one or more input datasets used by the subgraph 208. In some embodiments, the data processing system 200 may be configured to obtain a user specified key when the data processing system 200 does not identify an inherited key. For example, the data processing system 200 may provide an interface through which a user may specify a key.
  • keys e.g., a primary key and/or a foreign key
  • FIG. 3F shows the software application program development UI 220 with another dataflow graph 326 on a display of the device 210 interacting with the data processing system 200, according to some embodiments of the technology described herein.
  • the software application program development UI 220 includes a dataset catalog UI 324 displaying graphical elements representing entries 204A, 204B, 204C, 204D, 204E of the updated dataset catalog 204 of FIG. 3E.
  • a user of the device 300 may use the dataset catalog UI 324 to incorporate a dataset of the data processing system 200 into a dataflow graph 326 shown in the software application program development UI 220.
  • the dataflow graph is configured to receive, as input, output data generated when dataflow graph 202E is executed using entry 204E.
  • FIG. 4A shows software application programs 206D, 206E
  • the data processing system 200 may include any number of software application programs.
  • the data processing system 200 may include hundreds or thousands of such software application programs.
  • a software application program developed as a dataflow graph may perform various data processing operations to change a format of data from a physical dataset into a logical dataset for use by other software application programs.
  • Examples of applications for accessing a physical dataset using an entry of a dataset catalog are described in U.S. Patent Application Publication No. 2022/0245125, titled “Dataset Multiplexer for Data Processing System”, which is incorporated by reference herein in its entirety.
  • FIG. 4B shows a block diagram of the system modules 400 of the data processing system 200, according to some embodiments of the technology described herein.
  • the system modules 400 include a dataflow graph generator 402, a dataset catalog module 404, a dataflow graph storage module 406, a software application development UI module 408, a dataset catalog UI module 410, a transformation engine 412, a compiler 414, and a dataflow graph execution engine 416.
  • the dataflow graph generator 402 may be configured to identify a subgraph withing a dataflow graph.
  • the dataflow graph generator 402 may be configured to identify the subgraph by identifying a portion of the dataflow graph as the subgraph.
  • the dataflow graph generator 402 may be configured to identify a portion of a dataflow graph based on user input specifying a portion of the dataflow graph (e.g., as described herein with reference to FIG. 3B).
  • the dataflow graph generator 402 may be configured to store the identified subgraph as a catalogued dataflow graph as described herein with reference to FIG. 3D.
  • the dataset catalog module 404 may be configured to provide access to datasets (e.g., physical datasets and/or dataflow graph datasets).
  • the dataset catalog module 404 may provide a software application program with access to a dataset through an entry associated with the dataset.
  • the dataset catalog module 404 may generate a dataset catalog UI menu allowing users to select entries for incorporating associated datasets into a dataflow graph (e.g., as described herein with reference to FIG. 3F).
  • the dataset catalog module 404 may be configured to provide access to datasets by allowing software application programs to reference entries of the dataset catalog 204.
  • executable instructions of a software application program may reference entries of the dataset catalog 204 in order to incorporate datasets.
  • the dataset catalog module 404 may comprise one or more software application programs that provide information from entries of a dataset catalog to other software application programs.
  • the transformation engine 412 may change the order of the first node with at least one of the one or more other nodes such that the first node representing the first sort operation is placed adjacent to a second node representing a second sort operation.
  • the first and second sort operations become redundant, and the transformation engine 412 may transform the dataflow graph by removing one of the first sort operation or the second sort operation (e.g., by removing the corresponding node from the dataflow graph when producing the transformed dataflow graph).
  • the transformation engine 412 may identify a first node representing a first operation of a first type (e.g., a first sort operation on a major key, a first rollup operation on a major key, etc.) followed by a second node representing a second operation of a second, weaker type (e.g., a second operation on a minor key, a sort-within-groups operation, a second rollup operation on a minor key, a grouped rollup operation, etc.). Because processing data by the first operation may require more computing resources than processing data by the second, weaker operation, the transformation engine 412 may perform a strength reduction transformation that replaces the first operation with the second operation.
  • a first type e.g., a first sort operation on a major key, a first rollup operation on a major key, etc.
  • a second, weaker type e.g., a second operation on a minor key, a sort-within-groups operation, a second rollup operation on a minor key
  • the transformation engine 412 may be configured to identify a node configured to perform several operations that may be more efficient when executed separately.
  • the transformation engine 412 may be configured to perform a serial to parallel transformation of the dataflow graph which breaks one or more of the several operations into separate nodes for parallel processing (e.g., an automatic parallelism operation). The operations may then execute in parallel using different processes running on one or multiple computing devices.
  • the transformation engine 412 may then add a merge operation to merge the result of the parallel operations.
  • the transformation engine 412 can continue updating the dataflow graph by: (1) selecting a second optimization rule different from the first optimization rule; (2) identifying a second portion of the dataflow graph to which to apply the second optimization rule; and (3) applying the second optimization rule to the second portion of the dataflow graph.
  • the transformation engine 412 may be configured to output the transformed dataflow graph to store the transformed dataflow graph and/or to output the transformed dataflow graph to the compiler module 414 that compiles the transformed dataflow graph into an executable software application program (e.g., that may be executable by the execution engine 416).
  • FIG. 11 is a screenshot of a UI 2100 with a menu 2102 for storing the subgraph 1002 of FIG. 10 as a dataset accessible through a dataset catalog, according to some embodiments of the technology described herein.
  • the UI 2100 generates a menu 2102 (e.g., in response to user input) that provides an option to store the subgraph as a dataset.
  • the option is labelled “Create Datasource from subgraph”.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Stored Programmes (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
PCT/US2023/013841 2022-05-05 2023-02-24 Dataflow graph datasets Ceased WO2023215028A1 (en)

Priority Applications (6)

Application Number Priority Date Filing Date Title
AU2023265391A AU2023265391A1 (en) 2022-05-05 2023-02-24 Dataflow graph datasets
EP23713778.1A EP4519760A1 (en) 2022-05-05 2023-02-24 Dataflow graph datasets
JP2024563593A JP2025514974A (ja) 2022-05-05 2023-02-24 データフローグラフデータセット
CA3256554A CA3256554A1 (en) 2022-05-05 2023-02-24 DATA FLOW GRAPH DATA SETS
CN202380036746.5A CN119256293A (zh) 2022-05-05 2023-02-24 数据流图数据集
MX2024013637A MX2024013637A (es) 2022-05-05 2024-11-04 Conjuntos de datos de grafos de flujos de datos

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202263338855P 2022-05-05 2022-05-05
US63/338,855 2022-05-05
US202263432615P 2022-12-14 2022-12-14
US63/432,615 2022-12-14

Publications (1)

Publication Number Publication Date
WO2023215028A1 true WO2023215028A1 (en) 2023-11-09

Family

ID=85775931

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/013841 Ceased WO2023215028A1 (en) 2022-05-05 2023-02-24 Dataflow graph datasets

Country Status (8)

Country Link
US (1) US20230359668A1 (https=)
EP (1) EP4519760A1 (https=)
JP (1) JP2025514974A (https=)
CN (1) CN119256293A (https=)
AU (1) AU2023265391A1 (https=)
CA (1) CA3256554A1 (https=)
MX (1) MX2024013637A (https=)
WO (1) WO2023215028A1 (https=)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4333356A1 (en) * 2022-08-29 2024-03-06 Zama SAS Optimizing a computer program for a table lookup operation
US20250181319A1 (en) * 2023-12-01 2025-06-05 Ab Initio Technology Llc Techniques for resolving data fields available at points in a software application
WO2025137522A1 (en) * 2023-12-21 2025-06-26 Ab Initio Technology Llc A development environment for automatically generating code using a multi-tiered metadata model
CN119597973B (zh) * 2024-10-15 2025-11-04 广东电网有限责任公司 一种基于数据资产管理的数据目录智能化编排系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5966072A (en) 1996-07-02 1999-10-12 Ab Initio Software Corporation Executing computations expressed as graphs
US7716630B2 (en) 2005-06-27 2010-05-11 Ab Initio Technology Llc Managing parameters for graph-based computations
US20210232579A1 (en) 2020-01-28 2021-07-29 Ab Initio Technology Llc Editor for generating computational graphs
US20220043635A1 (en) * 2017-06-07 2022-02-10 Ab Initio Technology Llc Dataflow graph configuration
US20220245125A1 (en) 2021-01-31 2022-08-04 Ab Initio Technology Llc Dataset multiplexer for data processing system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5966072A (en) 1996-07-02 1999-10-12 Ab Initio Software Corporation Executing computations expressed as graphs
US7716630B2 (en) 2005-06-27 2010-05-11 Ab Initio Technology Llc Managing parameters for graph-based computations
US20220043635A1 (en) * 2017-06-07 2022-02-10 Ab Initio Technology Llc Dataflow graph configuration
US20210232579A1 (en) 2020-01-28 2021-07-29 Ab Initio Technology Llc Editor for generating computational graphs
US20220245125A1 (en) 2021-01-31 2022-08-04 Ab Initio Technology Llc Dataset multiplexer for data processing system

Also Published As

Publication number Publication date
MX2024013637A (es) 2025-02-10
CN119256293A (zh) 2025-01-03
EP4519760A1 (en) 2025-03-12
CA3256554A1 (en) 2023-11-09
AU2023265391A1 (en) 2024-10-03
US20230359668A1 (en) 2023-11-09
JP2025514974A (ja) 2025-05-13

Similar Documents

Publication Publication Date Title
AU2023270294B2 (en) Systems and methods for dataflow graph optimization
US20230359668A1 (en) Dataflow graph datasets
Kougka et al. The many faces of data-centric workflow optimization: a survey
KR102549994B1 (ko) 가변 레벨 병렬화를 사용하여 데이터 처리 동작을 수행하기 위한 시스템 및 방법
JP7720912B2 (ja) データ処理システムによって管理されるデータエンティティにアクセスするためのシステム及び方法
Sun An improved apriori algorithm based on support weight matrix for data mining in transaction database
Ali et al. Towards a cost model to optimize user-defined functions in an ETL workflow based on user-defined performance metrics
Zou et al. Lachesis: automatic partitioning for UDF-centric analytics
Dyer et al. Boa: An enabling language and infrastructure for ultra-large-scale msr studies
HK40122891A (zh) 数据流图数据集
US20250244978A1 (en) Techniques for converting sql dialect application programs to dataflow graphs
US20260119255A1 (en) Data processing system for automatic processing of continuous flows or batch data
US20240320224A1 (en) Logical Access for Previewing Expanded View Datasets
Salah et al. Optimizing the data-process relationship for fast mining of frequent itemsets in mapreduce
WO2026096598A1 (en) Data processing system for automatic processing of continuous flows or batch data
Macke Leveraging distributional context for safe and interactive data science at scale
WO2024197264A1 (en) Logical access for previewing expanded view datasets
CN120409651A (zh) 一种基于数据编织架构的元数据管理方法、设备及介质
CN117216091A (zh) 一种HiveSQL多重连接查询的优化方法、装置、设备及存储介质
Das et al. Lachesis: Automatic Partitioning for UDF-Centric Analytics
Prathima et al. Analyzing The Hive Performance on Large Data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23713778

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: AU2023265391

Country of ref document: AU

ENP Entry into the national phase

Ref document number: 2023265391

Country of ref document: AU

Date of ref document: 20230224

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2024563593

Country of ref document: JP

Ref document number: 202380036746.5

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: MX/A/2024/013637

Country of ref document: MX

WWE Wipo information: entry into national phase

Ref document number: 202417087711

Country of ref document: IN

WWE Wipo information: entry into national phase

Ref document number: 2023713778

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2023713778

Country of ref document: EP

Effective date: 20241205

WWP Wipo information: published in national office

Ref document number: 202380036746.5

Country of ref document: CN

WWP Wipo information: published in national office

Ref document number: MX/A/2024/013637

Country of ref document: MX