WO2023215028A1 - Dataflow graph datasets - Google Patents
Dataflow graph datasets Download PDFInfo
- Publication number
- WO2023215028A1 WO2023215028A1 PCT/US2023/013841 US2023013841W WO2023215028A1 WO 2023215028 A1 WO2023215028 A1 WO 2023215028A1 US 2023013841 W US2023013841 W US 2023013841W WO 2023215028 A1 WO2023215028 A1 WO 2023215028A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- dataflow graph
- catalogued
- subgraph
- data
- dataset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9024—Graphs; Linked lists
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/30—Creation or generation of source code
- G06F8/36—Software reuse
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/30—Creation or generation of source code
- G06F8/34—Graphical or visual programming
Definitions
- providing the user interface comprises generating a graphical user interface having a searchable menu of the one or more entries in the dataset catalog; and receiving the identification of the first entry associated with the first catalogued dataflow graph comprises receiving, via the user interface, a user input indicating a selection of the first entry in the searchable menu.
- the method comprises: executing the configured dataflow graph of the software application program.
- executing the configured dataflow graph of the software application program comprises: executing the first catalogued dataflow graph to generate the output data; and providing the generated output data as input to the dataflow graph of the software application for performance of at least one of the one or more data processing operations using the output data.
- executing the configured dataflow graph causes executing of the first catalogued dataflow graph.
- the output data is generated by the first catalogued dataflow graph during execution of the configured dataflow graph.
- the first entry stores a reference to a file storing information indicating nodes of the first catalogued dataflow graph and/or configuration parameters of the first catalogued dataflow graph.
- identifying the subgraph comprises: displaying, in a user interface, a graphical representation of a dataflow graph; and receiving, via the user interface, first user input indicating the subgraph within the dataflow graph.
- the instructions further cause the at least one computer hardware processor to perform: receiving, via the user interface, second user input commanding creation of the new entry associated with the indicated subgraph; wherein the creating of the new entry associated with the identified subgraph is performed in response to receiving the second user input.
- the data processing system comprises data storage storing a previously created dataflow graph and identifying the subgraph comprises: receiving, via a user interface, a user input identifying the previously created dataflow graph as the subgraph.
- the new entry includes the information indicating the nodes, links, and configuration parameters of the identified subgraph.
- configuring the dataset catalog to enable access to the new entry associated with the identified subgraph comprises providing a user interface through which a user can identify, in the dataset catalog, the new entry associated with the identified subgraph.
- Some embodiments provide at least one non-transitory computer-readable storage medium storing instructions that, when executed by at least one computer hardware processor of a data processing system, cause the at least one computer hardware processor to perform a method for providing a software application program, developed as a dataflow graph having nodes representing data processing operations and links representing flows of data between the nodes, with access to output data dynamically generated by one or more other dataflow graphs.
- the data processing system 200 may be configured to: (1) provide a list of schemas in a GUI from which the user can select a schema for the catalogued dataflow graph; and (2) determine the schema based on user input indicating a selection of one of the schemas.
- the data processing system 200 may be configured to determine one or more keys (e.g., a primary key and/or a foreign key) of the data. In some embodiments, the data processing system 200 may be configured to determine a key inherited from upstream data. For example, the data processing system 200 may use a key from one or more input datasets used by the subgraph 208. In some embodiments, the data processing system 200 may be configured to obtain a user specified key when the data processing system 200 does not identify an inherited key. For example, the data processing system 200 may provide an interface through which a user may specify a key.
- keys e.g., a primary key and/or a foreign key
- FIG. 3F shows the software application program development UI 220 with another dataflow graph 326 on a display of the device 210 interacting with the data processing system 200, according to some embodiments of the technology described herein.
- the software application program development UI 220 includes a dataset catalog UI 324 displaying graphical elements representing entries 204A, 204B, 204C, 204D, 204E of the updated dataset catalog 204 of FIG. 3E.
- a user of the device 300 may use the dataset catalog UI 324 to incorporate a dataset of the data processing system 200 into a dataflow graph 326 shown in the software application program development UI 220.
- the dataflow graph is configured to receive, as input, output data generated when dataflow graph 202E is executed using entry 204E.
- FIG. 4A shows software application programs 206D, 206E
- the data processing system 200 may include any number of software application programs.
- the data processing system 200 may include hundreds or thousands of such software application programs.
- a software application program developed as a dataflow graph may perform various data processing operations to change a format of data from a physical dataset into a logical dataset for use by other software application programs.
- Examples of applications for accessing a physical dataset using an entry of a dataset catalog are described in U.S. Patent Application Publication No. 2022/0245125, titled “Dataset Multiplexer for Data Processing System”, which is incorporated by reference herein in its entirety.
- FIG. 4B shows a block diagram of the system modules 400 of the data processing system 200, according to some embodiments of the technology described herein.
- the system modules 400 include a dataflow graph generator 402, a dataset catalog module 404, a dataflow graph storage module 406, a software application development UI module 408, a dataset catalog UI module 410, a transformation engine 412, a compiler 414, and a dataflow graph execution engine 416.
- the dataflow graph generator 402 may be configured to identify a subgraph withing a dataflow graph.
- the dataflow graph generator 402 may be configured to identify the subgraph by identifying a portion of the dataflow graph as the subgraph.
- the dataflow graph generator 402 may be configured to identify a portion of a dataflow graph based on user input specifying a portion of the dataflow graph (e.g., as described herein with reference to FIG. 3B).
- the dataflow graph generator 402 may be configured to store the identified subgraph as a catalogued dataflow graph as described herein with reference to FIG. 3D.
- the dataset catalog module 404 may be configured to provide access to datasets (e.g., physical datasets and/or dataflow graph datasets).
- the dataset catalog module 404 may provide a software application program with access to a dataset through an entry associated with the dataset.
- the dataset catalog module 404 may generate a dataset catalog UI menu allowing users to select entries for incorporating associated datasets into a dataflow graph (e.g., as described herein with reference to FIG. 3F).
- the dataset catalog module 404 may be configured to provide access to datasets by allowing software application programs to reference entries of the dataset catalog 204.
- executable instructions of a software application program may reference entries of the dataset catalog 204 in order to incorporate datasets.
- the dataset catalog module 404 may comprise one or more software application programs that provide information from entries of a dataset catalog to other software application programs.
- the transformation engine 412 may change the order of the first node with at least one of the one or more other nodes such that the first node representing the first sort operation is placed adjacent to a second node representing a second sort operation.
- the first and second sort operations become redundant, and the transformation engine 412 may transform the dataflow graph by removing one of the first sort operation or the second sort operation (e.g., by removing the corresponding node from the dataflow graph when producing the transformed dataflow graph).
- the transformation engine 412 may identify a first node representing a first operation of a first type (e.g., a first sort operation on a major key, a first rollup operation on a major key, etc.) followed by a second node representing a second operation of a second, weaker type (e.g., a second operation on a minor key, a sort-within-groups operation, a second rollup operation on a minor key, a grouped rollup operation, etc.). Because processing data by the first operation may require more computing resources than processing data by the second, weaker operation, the transformation engine 412 may perform a strength reduction transformation that replaces the first operation with the second operation.
- a first type e.g., a first sort operation on a major key, a first rollup operation on a major key, etc.
- a second, weaker type e.g., a second operation on a minor key, a sort-within-groups operation, a second rollup operation on a minor key
- the transformation engine 412 may be configured to identify a node configured to perform several operations that may be more efficient when executed separately.
- the transformation engine 412 may be configured to perform a serial to parallel transformation of the dataflow graph which breaks one or more of the several operations into separate nodes for parallel processing (e.g., an automatic parallelism operation). The operations may then execute in parallel using different processes running on one or multiple computing devices.
- the transformation engine 412 may then add a merge operation to merge the result of the parallel operations.
- the transformation engine 412 can continue updating the dataflow graph by: (1) selecting a second optimization rule different from the first optimization rule; (2) identifying a second portion of the dataflow graph to which to apply the second optimization rule; and (3) applying the second optimization rule to the second portion of the dataflow graph.
- the transformation engine 412 may be configured to output the transformed dataflow graph to store the transformed dataflow graph and/or to output the transformed dataflow graph to the compiler module 414 that compiles the transformed dataflow graph into an executable software application program (e.g., that may be executable by the execution engine 416).
- FIG. 11 is a screenshot of a UI 2100 with a menu 2102 for storing the subgraph 1002 of FIG. 10 as a dataset accessible through a dataset catalog, according to some embodiments of the technology described herein.
- the UI 2100 generates a menu 2102 (e.g., in response to user input) that provides an option to store the subgraph as a dataset.
- the option is labelled “Create Datasource from subgraph”.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Stored Programmes (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Priority Applications (6)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| AU2023265391A AU2023265391A1 (en) | 2022-05-05 | 2023-02-24 | Dataflow graph datasets |
| EP23713778.1A EP4519760A1 (en) | 2022-05-05 | 2023-02-24 | Dataflow graph datasets |
| JP2024563593A JP2025514974A (ja) | 2022-05-05 | 2023-02-24 | データフローグラフデータセット |
| CA3256554A CA3256554A1 (en) | 2022-05-05 | 2023-02-24 | DATA FLOW GRAPH DATA SETS |
| CN202380036746.5A CN119256293A (zh) | 2022-05-05 | 2023-02-24 | 数据流图数据集 |
| MX2024013637A MX2024013637A (es) | 2022-05-05 | 2024-11-04 | Conjuntos de datos de grafos de flujos de datos |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202263338855P | 2022-05-05 | 2022-05-05 | |
| US63/338,855 | 2022-05-05 | ||
| US202263432615P | 2022-12-14 | 2022-12-14 | |
| US63/432,615 | 2022-12-14 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023215028A1 true WO2023215028A1 (en) | 2023-11-09 |
Family
ID=85775931
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2023/013841 Ceased WO2023215028A1 (en) | 2022-05-05 | 2023-02-24 | Dataflow graph datasets |
Country Status (8)
| Country | Link |
|---|---|
| US (1) | US20230359668A1 (https=) |
| EP (1) | EP4519760A1 (https=) |
| JP (1) | JP2025514974A (https=) |
| CN (1) | CN119256293A (https=) |
| AU (1) | AU2023265391A1 (https=) |
| CA (1) | CA3256554A1 (https=) |
| MX (1) | MX2024013637A (https=) |
| WO (1) | WO2023215028A1 (https=) |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP4333356A1 (en) * | 2022-08-29 | 2024-03-06 | Zama SAS | Optimizing a computer program for a table lookup operation |
| US20250181319A1 (en) * | 2023-12-01 | 2025-06-05 | Ab Initio Technology Llc | Techniques for resolving data fields available at points in a software application |
| WO2025137522A1 (en) * | 2023-12-21 | 2025-06-26 | Ab Initio Technology Llc | A development environment for automatically generating code using a multi-tiered metadata model |
| CN119597973B (zh) * | 2024-10-15 | 2025-11-04 | 广东电网有限责任公司 | 一种基于数据资产管理的数据目录智能化编排系统 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5966072A (en) | 1996-07-02 | 1999-10-12 | Ab Initio Software Corporation | Executing computations expressed as graphs |
| US7716630B2 (en) | 2005-06-27 | 2010-05-11 | Ab Initio Technology Llc | Managing parameters for graph-based computations |
| US20210232579A1 (en) | 2020-01-28 | 2021-07-29 | Ab Initio Technology Llc | Editor for generating computational graphs |
| US20220043635A1 (en) * | 2017-06-07 | 2022-02-10 | Ab Initio Technology Llc | Dataflow graph configuration |
| US20220245125A1 (en) | 2021-01-31 | 2022-08-04 | Ab Initio Technology Llc | Dataset multiplexer for data processing system |
-
2023
- 2023-02-24 EP EP23713778.1A patent/EP4519760A1/en active Pending
- 2023-02-24 AU AU2023265391A patent/AU2023265391A1/en active Pending
- 2023-02-24 CA CA3256554A patent/CA3256554A1/en active Pending
- 2023-02-24 JP JP2024563593A patent/JP2025514974A/ja active Pending
- 2023-02-24 US US18/114,212 patent/US20230359668A1/en active Pending
- 2023-02-24 WO PCT/US2023/013841 patent/WO2023215028A1/en not_active Ceased
- 2023-02-24 CN CN202380036746.5A patent/CN119256293A/zh active Pending
-
2024
- 2024-11-04 MX MX2024013637A patent/MX2024013637A/es unknown
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5966072A (en) | 1996-07-02 | 1999-10-12 | Ab Initio Software Corporation | Executing computations expressed as graphs |
| US7716630B2 (en) | 2005-06-27 | 2010-05-11 | Ab Initio Technology Llc | Managing parameters for graph-based computations |
| US20220043635A1 (en) * | 2017-06-07 | 2022-02-10 | Ab Initio Technology Llc | Dataflow graph configuration |
| US20210232579A1 (en) | 2020-01-28 | 2021-07-29 | Ab Initio Technology Llc | Editor for generating computational graphs |
| US20220245125A1 (en) | 2021-01-31 | 2022-08-04 | Ab Initio Technology Llc | Dataset multiplexer for data processing system |
Also Published As
| Publication number | Publication date |
|---|---|
| MX2024013637A (es) | 2025-02-10 |
| CN119256293A (zh) | 2025-01-03 |
| EP4519760A1 (en) | 2025-03-12 |
| CA3256554A1 (en) | 2023-11-09 |
| AU2023265391A1 (en) | 2024-10-03 |
| US20230359668A1 (en) | 2023-11-09 |
| JP2025514974A (ja) | 2025-05-13 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| AU2023270294B2 (en) | Systems and methods for dataflow graph optimization | |
| US20230359668A1 (en) | Dataflow graph datasets | |
| Kougka et al. | The many faces of data-centric workflow optimization: a survey | |
| KR102549994B1 (ko) | 가변 레벨 병렬화를 사용하여 데이터 처리 동작을 수행하기 위한 시스템 및 방법 | |
| JP7720912B2 (ja) | データ処理システムによって管理されるデータエンティティにアクセスするためのシステム及び方法 | |
| Sun | An improved apriori algorithm based on support weight matrix for data mining in transaction database | |
| Ali et al. | Towards a cost model to optimize user-defined functions in an ETL workflow based on user-defined performance metrics | |
| Zou et al. | Lachesis: automatic partitioning for UDF-centric analytics | |
| Dyer et al. | Boa: An enabling language and infrastructure for ultra-large-scale msr studies | |
| HK40122891A (zh) | 数据流图数据集 | |
| US20250244978A1 (en) | Techniques for converting sql dialect application programs to dataflow graphs | |
| US20260119255A1 (en) | Data processing system for automatic processing of continuous flows or batch data | |
| US20240320224A1 (en) | Logical Access for Previewing Expanded View Datasets | |
| Salah et al. | Optimizing the data-process relationship for fast mining of frequent itemsets in mapreduce | |
| WO2026096598A1 (en) | Data processing system for automatic processing of continuous flows or batch data | |
| Macke | Leveraging distributional context for safe and interactive data science at scale | |
| WO2024197264A1 (en) | Logical access for previewing expanded view datasets | |
| CN120409651A (zh) | 一种基于数据编织架构的元数据管理方法、设备及介质 | |
| CN117216091A (zh) | 一种HiveSQL多重连接查询的优化方法、装置、设备及存储介质 | |
| Das et al. | Lachesis: Automatic Partitioning for UDF-Centric Analytics | |
| Prathima et al. | Analyzing The Hive Performance on Large Data |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23713778 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: AU2023265391 Country of ref document: AU |
|
| ENP | Entry into the national phase |
Ref document number: 2023265391 Country of ref document: AU Date of ref document: 20230224 Kind code of ref document: A |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2024563593 Country of ref document: JP Ref document number: 202380036746.5 Country of ref document: CN |
|
| WWE | Wipo information: entry into national phase |
Ref document number: MX/A/2024/013637 Country of ref document: MX |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202417087711 Country of ref document: IN |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023713778 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2023713778 Country of ref document: EP Effective date: 20241205 |
|
| WWP | Wipo information: published in national office |
Ref document number: 202380036746.5 Country of ref document: CN |
|
| WWP | Wipo information: published in national office |
Ref document number: MX/A/2024/013637 Country of ref document: MX |