WO2022240571A1 - Synthèse de pipelines de transformation de données multi-opérateur - Google Patents

Synthèse de pipelines de transformation de données multi-opérateur Download PDF

Info

Publication number
WO2022240571A1
WO2022240571A1 PCT/US2022/025858 US2022025858W WO2022240571A1 WO 2022240571 A1 WO2022240571 A1 WO 2022240571A1 US 2022025858 W US2022025858 W US 2022025858W WO 2022240571 A1 WO2022240571 A1 WO 2022240571A1
Authority
WO
WIPO (PCT)
Prior art keywords
operator
pipelines
pipeline
target
data
Prior art date
Application number
PCT/US2022/025858
Other languages
English (en)
Inventor
Yeye He
Surajit Chaudhuri
Junwen YANG
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Priority to EP22722632.1A priority Critical patent/EP4338064A1/fr
Publication of WO2022240571A1 publication Critical patent/WO2022240571A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/211Schema design and management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • Data preparation sometimes also known as data wrangling, refers to the process of building sequences of table-manipulation steps (e.g., Transform, Join, Pivot, etc.) in order to prepare raw data into a form that is ready for downstream applications (e.g., business intelligence or machine learning).
  • table-manipulation steps e.g., Transform, Join, Pivot, etc.
  • the end result from data preparation is often a workflow or data pipeline with a sequence of table-manipulation steps, which can be operationalized as recurring jobs in production.
  • the multi-operator pipelines include multiple operators, such as table reshaping operators (e.g., a join operator, a union operator, a groupby operator, etc.) and string transformation operators (e.g., a split operator, a substring operator, a concatenate operator, etc.).
  • table reshaping operators e.g., a join operator, a union operator, a groupby operator, etc.
  • string transformation operators e.g., a split operator, a substring operator, a concatenate operator, etc.
  • the technology may access or receive raw data that is to be transformed into format matching, or substantially matching, a target table or target visualization.
  • the target table or target visualization is a table or visualization that was previously generated on data other than the raw data to be transformed.
  • a selection of the target visualization or target table may be received from a user.
  • Table properties and/or constraints may be extracted from the target table or visualization, and one or more multi-operator data transformation pipelines may be synthesized for transforming the raw data to a generated table or generated visualization that substantially match the target table or target visualization.
  • Figure 1 schematically shows an example system for synthesizing multi -operator data transformation pipelines.
  • Figure 2 illustrates an example of a pipeline-by-target system.
  • Figure 3 illustrates a simplified example of a data pipeline process.
  • Figure 4A illustrates an example of a search graph for synthesis.
  • Figure 4B illustrates an example synthesis algorithm
  • Figure 5 illustrates an example representation encoding a state of a partial pipeline.
  • FIGS 6A and 6B illustrate example experimental results.
  • Figure 7A illustrates an example method for synthesizing one or more pipelines.
  • Figure 7B illustrates another example method for synthesizing one or more pipelines.
  • Figure 8 is a block diagram illustrating example physical components of a computing device with which aspects of the disclosure may be practiced.
  • Figures 9Aand 9B are simplified block diagrams of a mobile computing device with which aspects of the present disclosure may be practiced.
  • Data transformation can be extremely difficult, and transforming data from one form to another continues to present significant challenges.
  • having data in the proper format is often required for proper analysis and operations to be performed on the data.
  • the rise of machine learning has provided additional demand for improved data transformation systems.
  • machine learning systems generally have strict requirements for the data to be in a uniform format in order to accurately process the data.
  • data pipelines can be formed through the use of multiple operators that iteratively change the data from one format to another. Determining and building such pipelines is no small task as there are millions of combinations of operators that could be selected for data transformations.
  • the present technology alleviates some of the problems discussed above and provides for automated methods and systems that create multi-operation data transformation pipelines without a user having to first manually create the desired result.
  • This new paradigm for multi-step pipeline- synthesis is referred to as “by-target.”
  • a user is able to use a “target” format for the data that is not based on the raw data the user is looking to transform.
  • the target may be some other table or visualization that the user has identified and would like to transform raw data into a matching target format. For example, based on a selected target data table or data visualization, the present technology is able to extract table properties and constraints for the target.
  • the present technology Based on the extracted table properties and target constraints, the present technology then synthesizes a multi-operator data transformation pipeline for transforming the raw data to a generated table or generated visualization that matches the target data. Accordingly, the “target” is easy for users to provide but still provides a sufficient specification for pipeline synthesis. Multiple techniques may be used to generate the data pipelines from the selected target. For instance, constraint searching and/or machine learning process may be deployed to generate these multi-operator pipelines automatically. The search and learning-based synthesis algorithms have been shown to be effective on real data pipelines that enhance the server device’s computing functions, streamline processing, reduce efforts, and eliminate redundant activities.
  • a common usage pattern in pipeline- is to onboard new data files, such as data from a new store/region/time-period, etc., that are often formatted differently.
  • users typically have a precise “target” in mind, such as an existing data- warehouse table, where the goal is to bring the new data into a form that “looks like” the existing target table (so that the new data can be integrated).
  • a target in mind
  • dashboards for data analytics e.g., in Tableau or Power BI
  • users can be inspired by an existing visualization and want to turn their raw data into a visualization that “looks like” the given target visualization (in which case we can target the underlying table of the visualization).
  • the user is able to provide the existing visualization and raw data, and the technology automatically synthesizes a multi-operator data transformation pipeline to transform the rad data into the format of the existing visualization.
  • Figure 1 illustrates an overview of an example system 100 for synthesizing pipelines using a by target analysis of data elements.
  • the system 100 may include a plurality of client devices 102-106
  • a network 108 which may be the Internet, a wide-area network (WAN), a local-area network (LAN), or other types of suitable networks.
  • WAN wide-area network
  • LAN local-area network
  • One or more of the client devices 102-106 may include a data transformation application 114.
  • the data transformation application 114 may perform some or all of the operations described herein. For example, the operations may be executed locally on a client device.
  • the server device 110 may also include a pipeline synthesis application 114 or pipeline synthesizer 112.
  • the pipeline synthesizer 112 synthesizes the multi-operator pipelines discussed herein. In some examples, the pipelines synthesizer 112 may perform some or all of the data transformation operations as well.
  • the operations discussed herein may be performed by the data transformation application 114 and/or the pipeline synthesizer 112 such that processing operations may be shared between a client device 102 and a server device 110.
  • the pipeline synthesis application 114 and/or pipelines synthesizer 112 may be components or portions of programming/code incorporated into bigger applications, such as various productivity or data analysis programs, such as the POWERBI application or Microsoft EXCEL.
  • the client devices 102-106 may be a source of raw data that is to be transformed.
  • the client devices 102-106 may provide the raw data to the server device 110 is the pipeline synthesizer 112 is to also perform data transformation functions.
  • the raw data remains at the client devices 102-106 and the pipeline synthesizer generates the multi-operator pipeline and transmits those pipelines to the client devices 102-106 to allow for local transformation of the raw data based on the received multi-operator data transformation pipeline.
  • Figure 2 illustrates an example of a pipeline-by-target system 200.
  • the example system 200 includes a plurality of different raw data in the form of source files or source data, which are identified by a first source data 202, second source data 204, and third source data 206.
  • the source data may include multiple different files that may have different formats and /or schemas.
  • the source data may be in the form of .csv or json files.
  • the source data 202-206 is transformed by one of the corresponding pipelines 210-214 into target table 216 or a target visualization 218.
  • the first source data 202 ( T m ) is prior data for which a prior pipeline 210 (L) was previously generated (potentially manually or in some other manner) to create the target table 216 and/or target visualization 218 ( T tgt ).
  • the second source data 204 the third source data 206 represent different data for which no pipelines have been previously created.
  • the present technology automatically synthesizes a set of synthesized pipelines 208 (L), that includes a first
  • the set of synthesized pipelines 208 are generated based on a selection of a target table or target visualization.
  • the selected target(s) include the target table 216 and/or the target visualization 218. In other examples, however, the selected target may be any existing table or visualization.
  • the synthesized pipeline(s) 212, 214 are generated based on characteristics extracted from the target table 216 and/or target visualization 218, such as table properties and constraints.
  • Each of the synthesized pipelines 212, 214 include multiple operators, which may include table-reshaping operators and/or string transformation operators.
  • the table reshaping operators may include operators such as a join operator, a union operator, a groupby operator, an agg operator, a pivot operator, an unpivot operator, or an explode operator.
  • the string transformation operators may include operators such as a split operator, a substring operator, a concatenate operator, casing operator, or an index operator.
  • a new table and/or visualization is generated that substantially matches the target table 216 and/or target visualization 218 (depending on whether the target table 216 and/or the target visualization 218 was selected as the target).
  • a large client may have data coming from numerous sources.
  • Some version of the desired pipeline has been built previously, such as a legacy script/pipeline 210 from the client’s internal IT that already produces a database table 216 or dashboard 218.
  • a legacy script/pipeline 210 from the client may already produce a database table 216 or dashboard 218.
  • new chunks of data for subsequent time periods or new stores need to be brought on-board, which may have different formats/schema (e.g., JSON vs. CSV, pivot-table vs. relational, missing/extra columns, etc.), because the new data comes from different systems or channels.
  • multi-operator data transformation pipelines may be synthesized automatically in such a setting. For instance, users may point the system to a “target” that schematically demonstrates what the output should “look like.” The user may provide an indication of the target in a variety of manners. For example, the user may provide the target by providing the system a location and/or a copy of the target table and/or visualization.
  • a user may provide a particular type of a selection (e.g., right-click) on an existing database table and select the option to “append data to the table” or right-click an existing visualization and select “create a dashboard like this” to trigger pipeline synthesis operations described herein. From the perspective of the system, a selection of an existing table or
  • 5 visualization may be detected.
  • a menu of options is presented, including an option to use the table and/or visualization as a target for pipeline synthesis.
  • a deep reinforcement-learning (“DRL”) based synthesis algorithm (which may be referred to herein as Auto-Pipeline-RL) may be utilized.
  • the DRL model or algorithm “learns” to synthesize pipelines using large collections of real pipelines.
  • the agent is rewarded when it successfully synthesizes a real pipeline by target, which is analogous to “self-play” in training game-playing agents such as AlphaGo and Atari.
  • the RL- based synthesis is able to learn to synthesize fairly quickly and, in some examples, may outperform Auto-Pipeline-Search.
  • T t9t a database table or visualization dashboard
  • New pipelines 212, 214 can be automatically synthesized if users can point the system to the new input files f m and the target Ttgt to schematically demonstrate what output from the desired pipeline should “look like.”
  • FIG 3 shows a simplified example of a data pipeline process 300 based on data authored by data scientists on GitHub for the Titanic challenge on Kaggle.
  • Titanic is a popular data science task, where the goal is to predict which passengers survived, using a combination of features (Gender, Fare-class, etc.) as shown in table 302.
  • a common data-pipeline in this case, may perform a GroupBy operator on the “Gender” column to compute “Avg- Survived” by “Gender” as shown in table 304.
  • join operation may be performed on “Gender” to bring “Avg- Survived” as an additional feature back into the original input, like shown in target table 308.
  • the enhanced table 308 may then be fed into downstream ML to train models.
  • table 312 7 different set of passengers, as shown in table 312. Without having access to the original pipeline that was used to create the target table 308, the user points to table 308 as the target table to demonstrate his/her desired output for by-target synthesis to synthesize the desired pipeline.
  • the desired synthesized pipeline can be uniquely determined by leveraging implicit constraints discovered from the target table 308.
  • implicit constraints may include functional dependencies (FDs) and/or key properties.
  • FDs functional dependencies
  • Two example constraints for the target table 308 are shown in data 310, which include a first constraint of Key-column: ⁇ “Passenger” ⁇ and a second constraint of (FD): ⁇ “Gender” “Avg-Survived” ⁇ .
  • table 312 When table 312 is used as the new input and target table 308 is used as the target, it is desired to generate a synthesized pipeline 314 to follow the same set of transformations in the pipeline that produced target table 308, and as such, the new output table 316 using table 312 as input should naturally satisfy the same set of constraints generated from target table 308. For instance, if a column mapping is performed between the target table 308 and new output table 316, it can be seen that the constraints discovered from these two tables, as shown in data 310 and data 318, have a direct one-to-one correspondence. Suppose it is needed to recreate these implicit constraints in table 308 in a synthesized pipeline.
  • the synthesized multi-operator data transformation pipelines may be generated by using a search-based algorithm (e.g., Auto-Pipeline- Search) and/or a deep reinforcement-learning (“DRL”) based synthesis algorithm (e.g., Auto-Pipeline-RL). Both algorithms are discussed below.
  • a search-based algorithm e.g., Auto-Pipeline- Search
  • DRL deep reinforcement-learning
  • Figure 4A illustrates an example synthesis process 400 using a search graph.
  • the search graph for synthesis includes a start node 402 (i.e., an empty pipeline) and an end node 410 (i.e., a synthesized pipeline).
  • Each intermediate node e.g., nodes between the start node 402 and the end node 410) represents an intermediate state in the synthesis process, which corresponds to a “partial pipeline.”
  • Multiple intermediate nodes are provided at different depths (D) of the graph.
  • Each edge represents the act of adding one operator, which leads to the new pipeline with one more operator.
  • the partial pipeline can be extended by one additional “step” using some operator O from the set of operators, which may include table-reshaping operators and/or string transformation operators.
  • the table reshaping operators may include operators such as a join operator, a union operator, a groupby operator, an agg operator, a pivot operator, an unpivot operator, or an explode operator.
  • the string transformation operators may include operators such as a split operator, a substring operator, a concatenate operator, casing operator, or an index operator.
  • an intermediate node has a number of operators corresponding to its respective depth.
  • a partial pipeline for a node in first set of intermediate nodes 404 has one operator and may be referred to as a single-operator partial pipeline.
  • a partial pipeline for a node in the second set of intermediate nodes 406 has two operators and may be referred to as a double-operator partial pipeline.
  • Auto-Suggest learns from real data pipelines to predict the likelihood of using parameters p for each operator O given input tables T, which is precisely Rt(0(r)).
  • Auto-Suggest may be leveraged and treats these Rt(0(r)) as given so that the system can better focus on the end-to-end
  • PMPS probability-maximizing pipeline synthesis
  • Equation (1) states the goal is to find the most likely pipeline L , or the one whose joint probability of all single-step operator invocations is maximized.
  • Equations (2) and (3) state that when running the synthesized pipeline L on the given input f m to get L(f m ), the FD/Key constraints discovered from TW should also be satisfied on L(f m ).
  • Equation (4) states that columns should “map” from L(f m ) to TW, with standard schema mapping.
  • each intermediate node e.g., a node in a node set 404, 406, 408 corresponds to a partial pipeline, and each edge corresponds to the act of adding one operator, each node that is depth- steps away from the start-node 402 naturally corresponds to a partial pipeline with depth number of operators/steps.
  • Algorithm 1 is only one example of performing such a search to generate pipelines, and other examples and algorithms may be utilized.
  • variable candidates stores all “valid” pipelines satisfying the constraints in PMPS (Equations (2)-(4)) and is initialized as an empty set.
  • the variable Sdepth corresponds to all pipelines with depth- steps that are actively explored in one loop iteration, and at line 3, a single placeholder empty-pipeline is initialized because it is the only 0-step pipeline, and it is still at the start-node 402 of the search graph.
  • all active pipelines take from the previous iteration with ( depth - 1) steps, denoted by Sdepth- I, and “extend” each partial pipeline L E Sdepth ⁇ using one additional operator O e O, by invoking AddOneStep(L,0), which is shown at line 7.
  • top-K e.g., the Top K number of partial pipelines
  • VerifyCandO the partial pipelines that satisfy PMPS constraints as candidates
  • AddOneStep() adds one additional step into partial pipelines by leveraging Auto-Suggest to find the most likely parameters for each operator, while VerifyCands() checks for PMPS constraint using standard FD/key-discovery and column-mapping.
  • the last two sub-routines, GetPromisingTopK() and GetFinal-TopK(), help ensure that efficient searching of promising parts of the graph occurs and that pipelines can synthesize successfully.
  • Auto-Suggest is leveraged, which considers the characteristics of intermediate tables in the partial pipeline L, to predict the best
  • the following example illustrates the process where the Titanic example in Figure 3 is revisited.
  • All possible operators O E O to extend L are enumerated (e.g., all possible operators for intermediate nodes 404 in Figure 4A).
  • the first pick of O is to be the GroupBy operator.
  • Gender and Fare-Class columns can be seen as the most likely used for GroupBy (because these two columns have categorical values with low cardinality).
  • the single-operator predictors from Auto-Suggest- in this case, the GroupBy predictor is leveraged, which may predict that P (Group By (Fare-Class)
  • VerifyCandsO sub-routine takes as input a collection of pipelines Sdepth (the set of synthesized pipelines with depth steps), and check if an L e Sdepth satisfy all constraints listed in Equations (2)-(4) for Key/FD/column-mapping, in relation to the target table Ttgt.
  • column-to-column mapping as shown in Figure 3 may be generated using a combination of signals from column-values, column-patterns, and column-names. This synthesized pipeline L thus satisfies the column-mapping constraint in Equation (4).
  • constraint discovery techniques For FD/Key constraints, we again apply constraint discovery techniques, to discover FD/Key constraints from both the target table T tgt , and the output table L(T m ) from a synthesized pipeline L , in order to see if all FD/Key constraints from Ttgt can be satisfied by L. Examples of constraint discovery techniques are discussed in the following two papers: (1) Matt Buranosky, Elmar Stellnberger, Emily Pfaff, David Diaz-Sanchez, and Cavin Ward-Caviness. 2018. FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data.
  • GetPromisingTopK() evaluates all depth-step pipelines currently explored. It is used to find top-K promising candidates to prune down the search space. GetPromisingTopK() may not use the same strategy as GetFinalTopK() by simply maximizing P(L) because this may lead to pipelines that cannot satisfy PMPS constraints (Equations 2-(4)), resulting in infeasible solutions.
  • search is performed significantly more efficiently, and computing resources are conserved in performing the search to generate the synthesized pipeline.
  • a learning-based synthesis may also be utilized, which follows substantially similar or same steps in Algorithm 1, except that the learning-based synthesis replaces the search-based heuristics in GetPromisingTopK and GetFinalTopK by using deep reinforcement learning (DRL) models.
  • DRL deep reinforcement learning
  • the pipeline synthesis problem may bear a resemblance to game-playing machine learning systems such as AlphaGo (David Silver, Julian Rosswieser, Karen Simonyan, Sicilnis Antonoglou, Aj a Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. 2017. Mastering the game of go without human knowledge nature 550, 7676 (2017), 354- 359) and Atari (Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Sicilnis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv: 1312.5602 (2013)).
  • agents need to take into account game “states” they are in (e.g., visual representations of game screens in Atari games or board states in the Go game), in order to produce suitable “actions” (e.g., pressing up/down/left/right/fire buttons in Atari, or placing a stone on the board in Go) that are estimated to have the highest “value” (producing the highest likelihood of winning).
  • game “states” e.g., visual representations of game screens in Atari games or board states in the Go game
  • suitable “actions” e.g., pressing up/down/left/right/fire buttons in Atari, or placing a stone on the board in Go
  • the problem may have a similar structure. Specifically, like illustrated in Figure 4A, at a given “state” in the search graph 400 (representing a partial pipeline L), a decision of suitable next “actions” to take needs to be determined- e.g., among all possible ways to extend L using different operators/parameters, which ones have the highest estimated “value” (providing the best chance to synthesize successfully).
  • an optimized synthesis “policy” may also be learned via “self-synthesis.” For instance, the pipeline synthesis may feed a reinforcement learning (RL) agent with large numbers of real data pipelines, asking the agent to synthesize pipelines by itself and rewarding it when successful.
  • RL reinforcement learning
  • DQN Deep Q-Network
  • state transition in the problem is deterministic because adding O a (p a ) to pipeline L(s ) uniquely determines a new pipeline.
  • Deep Q-Network may be used to directly learn the value-function of each state s, denoted as Q(s), which estimates the “value” of a state s, or how “promising” s is in terms of successful synthesis.
  • Q(s) the value-function of each state s
  • Promising the value-function of each state s
  • a representation is needed that abstracts away the specifics of each pipeline (e.g., which table column is used) and instead encodes generic information important to by-target synthesis that is applicable to all pipelines.
  • Figure 5 shows a representation 500 used to encode the state of LT, including FD/Key/operators/column-mapping, etc. It will start with FD representation encoded using the matrix 502 at the lower-left comer (other types of information are encoded similarly and will be described later).
  • FDs are discovered from the target table TW, and the goal is to synthesize pipelines matching all these FDs.
  • the FDs are arranged in TW as rows in the matrix 502, and encode the FDs satisfied by LT using the right-most columns (marked with T), where a “0” entry indicates that this corresponding FD has not been satisfied by LT yet, while a “1” indicates that FD has been satisfied.
  • Columns to the left correspond to FDs of pipelines from previous time-steps, with T- 1, T- 2 steps/operators, etc., up to a fixed number of historical frames.
  • 16 matrix 502 explicitly models historical information (i.e., what FDs are satisfied from previous pipelines), so that with convolution filters 508, the model can directly “learn” whether adding an operator at T-th step makes “progress” in satisfying more FDs.
  • Figure 5 shows a convolutional filter that may be learned, which has a “-1” and a “+1”. This filter allows a check into whether an increased number of FDs are satisfied from time (T -1) to T. In this example, FD-2 is not satisfied at the (T- 1) step but is now satisfied at the T step, as marked by the “0” and the “1” in the matrix 502.
  • convolutional filters are stacked together to learn visual features (e.g., circles versus lines). Similar conv-filters are applied in the synthesis problem, which interestingly learns local features like FD- progress instead.
  • Another benefit of this representation and a convolutional architecture is the flexibility in representing pipeline tasks with varying numbers of constraints (e.g., varying numbers of FDs/Keys, etc.) because the user can set the number of rows in the matrix as the maximum number of FDs across all pipelines and conveniently “pad” rows not used for a specific pipeline as “0”s (which leads to 0 regardless of what convolutional filters are applied).
  • FD In addition to FD, other types of information (e.g., Key constraints, operator probabilities (Op), column-mapping) can be modeled similarly using the same matrix representation and convolutions filter architecture, as shown in the top part of Figure 5.
  • a second matrix 504 for operator probabilities and a corresponding convolutional filter 510 may also (or alternatively) be used.
  • a third matrix 506 for key constraints and a corresponding convolutional filter 512 may also (or alternatively) be used.
  • These representations may then be fed into pooling layers 514, 516, 518 and an MLP layer 520 before being concatenated and passed into additional layers to produce a final function-approximation of Q(s ) 522.
  • the Q(s ) model was initialized with random weights.
  • Experience replay may be used (for example as described in Long-Ji Lin. 1993. Reinforcement learning for robots using neural networks. Technical Report. Camegie-Mellon Univ Pittsburgh PA School of Computer Science), in which records all ( s,Q(s )) pairs in an internal memory M and “replay” events sampled from M to update the model. This is advantageous because of its data efficiency and the fact that events sampled over many episodes have weak temporal correlations.
  • Q(s) is trained iteratively in experience replay. In each iteration, the Q(s) are used from the previous iteration to play self-synthesis and collect a fixed n number of ( s,Q(s )) events into the memory M. Events are randomly sampled in M to update weights of Q(s), and the new Q'(s) may then be used to play the next round of self-synthesis.
  • n 500 is used to find the model to converge quickly with 20 iterations. It is also observed that as a clear benefit of using RL over standard supervised- learning (“SL”) because in using RL, it is learned from positive/negative examples tailored to the current policy, which tends to be more informative than SL that learns from fixed distributions.
  • SL standard supervised- learning
  • Experiments were performed using the present technology to determined success rates and efficiencies of different pipeline synthesis algorithms. All experiments were performed on a Linux VM from a commercial cloud, with 16 virtual CPUs and 64 GB of memory. Variants of the present technology (referred to as Auto-Pipeline) have been implemented in Python 3.6.9.
  • the first benchmark referred to as the GitHub benchmark, consists of real data pipelines authored by developers and data scientists harvested at scale from GitHub repositories. Specifically, Jupyter notebooks were crawled from GitHub repositories. They were replayed programmatically on corresponding data sets (from GitHub, Kaggle, and other sources) to reconstruct the pipelines authored by experts in a manner similar to Auto-Suggest. The pipelines were filtered out that are likely duplicates (e.g., copied/forked from other pipelines) and ones that are trivially small (e.g., input tables have less than ten rows). These human-authored pipelines the ground truth for by target synthesis.
  • the pipelines are grouped based on pipeline lengths, defined as the number of steps in a pipeline.
  • Results for the present technology are reported from three variants: the search- based Auto-Pipeline- Search, the supervised-learning-based Auto-Pipeline- SL, and the reinforcement-learning-based Auto-Pipeline-RL.
  • the evaluation metrics used included an accuracy, a mean reciprocal rank (MRR), and a latency.
  • accuracy is a measure of the fraction of pipelines that can be successfully synthesized (e.g., num-succ-synthesized// J ).
  • MRR is a standard metric that measures the quality of ranking.
  • a synthesis algorithm returns a ranked list of K candidate pipelines for each test case, ideally with the correct pipeline ranked high (at top-1).
  • the reciprocal -rank in this case is defined 1 /rank, where rank is the rank-position of the first correct pipeline in the candidates (if no correct pipeline is found, then the reciprocal-rank is 0).
  • the Mean Reciprocal Rank is the mean reciprocal rank over all pipelines. It is noted that MRR is in the range of [0, 1], with one being perfect (all desired pipelines ranked at top-1).
  • FIGS. 6A-6B show an overall comparison on the GitHub benchmark and the Commercial benchmark, respectively, measured using accuracy, MRR, and latency. Average latency is reported on successfully synthesized cases only because some baselines would fail to synthesize after searching hours.
  • the present technology can consistently synthesize 60-70% of pipelines within 10-20 seconds across the two benchmarks, which is substantially more efficient and effective than other methods. While the search-based Auto-Pipeline- Search is already effective, Auto-Pipeline-RL is slightly better in terms of accuracy. The advantage of Auto-Pipeline-RL over Auto-Pipeline-Search is more pronounced in terms of MRR, which is expected as learning-based methods are better at understanding the nuance in fine grained ranking decisions than a coarse-grained optimization objective in the search-based variant (Equation (1)).
  • Figure 7A illustrates an example method 700 for synthesizing at least one multi-operator data transformation pipeline.
  • raw data for transformation is accessed and/or received.
  • a client device may send the data to a server device for processing.
  • the client device and/or the server device may access data from a database or other data store on the device or accessible by the device.
  • the raw data may include multiple input tables that may have different formats.
  • a selection of a target table or a target visualization may be received.
  • the target table or target visualization is for data other than the raw data.
  • the selection may be received via a variety of methods. For example, a location and/or a copy of the target table and/or visualization may be received.
  • particular type of a selection e.g., right-click
  • an existing database table or visualization may be detected, and a plurality of menu options may be displayed in response, including options to use the table or visualization as a target.
  • An example option may be “append data to the table” “create a dashboard like this.”
  • a selection of the option may be received as the selection of the target table or target visualization.
  • table properties and target constraints are extracted or determined for the selected target table or visualization.
  • the table properties may include properties such as a schema, and the target constraint may be the types of constraints discussed above, such as key-column constraint and/or a functional-dependency constraint.
  • one or more multi-operator data transformation pipelines are generated for transforming the raw data to a generated table or generated visualization.
  • the 20 pipelines include at least two data transformation operators.
  • the operators may include, for example, table-reshaping operators and/or string transformation operators.
  • the table-reshaping operators may include at least one of a join operator, a union operator, a groupby operator, an agg operator, a pivot operator, an unpivot operator, or an explode operator.
  • the string transformation operators include at least one of a split operator, a substring operator, a concatenate operator, casing operator, or an index operator.
  • the multi-operator data transformation pipelines may be generated using any of the method and processes discussed above, such as the search-based and/or learning based processes.
  • the top-ranked pipelines generated in operation 708 may be displayed. While the operation of “displayed” is used herein, it should be understood that a server sending data for a display at a client device may also be considered “displaying.” For example, in operation 708 multiple pipelines may be generated, and the top ranked pipelines, such as the top two, may be displayed such that a user may select or inspect the generated pipelines.
  • the generated pipelines may include a first pipeline and a second pipeline. The operators of the first pipeline and the operators of the second pipeline may be displayed and, in some examples, may be displayed concurrently. The operators may be displayed as selectable visual indicators.
  • a Join operator may be displayed as a selectable indicator or icon and a GroupBy operator may be displayed as another selectable indicator or icon.
  • the data transformation step of the operator may be displayed.
  • a user may be able to see how the particular operator transforms the raw data accessed in operation 702.
  • the user can step through the pipeline to understand how each pipeline transforms the data in a step-by-step manner. Based on that review, a user may select (and the system thus receives a selection of) a pipeline for transforming the raw data.
  • the raw data is transformed with a pipeline generated in operation 708.
  • the pipeline used to transform the data may be the pipeline selected by the user.
  • the pipeline may be the top-ranked pipeline (e.g., the pipeline that produces an output table that most closely matches the target table or visualization).
  • the transformation of data may be performed by the same device that synthesized the pipeline or a different device.
  • a server device may synthesize the pipeline, deliver the pipeline to a client device, and the client device may then use the pipeline to transform the raw data to generate an output table or output visualization that matches, or substantially matches, the target table or target visualization.
  • Figure 7B illustrates another example method 720 for synthesizing at least one multi-operator data transformation pipeline.
  • the method 720 may be performed as part of operation 708 of method 700 depicted in Figure 7A.
  • single-operator partial pipelines are generated.
  • the single-operator partial pipelines may correspond to the first depth of nodes in
  • Each of the single-operator partial pipelines include one operator.
  • a likelihood probability and constraint-matching criteria is determined.
  • the likelihood probability may be the operator probability described above and the constraint-matching criteria is based on the target constraints.
  • the constraint-matching criteria may include metrics or data indicating whether an instantiated output table, as produced from the single-operator partial pipeline transforming the raw data, satisfies one or more of the target constraints.
  • a subset of the single-operator partial pipelines may be selected.
  • the subset selected may be the top ranking single-operator partial pipelines.
  • a sub-routine such as GetPromisingTopK() may be performed to identify a subset of K single-operator partial pipelines, as discussed above.
  • machine-learning techniques or models such as deep reinforcement learning (DRL) model, may also be utilized to identify the subset of single-operator partial pipelines.
  • DRL deep reinforcement learning
  • double-operator partial pipelines are generated from the subset of the single operator partial pipelines selected in operation 726.
  • the double-operator partial pipelines may correspond to the second depth of intermediate nodes in the search graph depicted in Figure 4A.
  • Each of the double-operator partial pipelines include two operators.
  • a likelihood probability and constraint-matching criteria is determined.
  • the likelihood probability may be the operator probability described above and the constraint-matching criteria is based on the target constraints.
  • the constraint-matching criteria may include metrics or data indicating whether an instantiated output table, as produced from the double-operator partial pipeline transforming the raw data, satisfies one or more of the target constraints.
  • a subset of the single-operator partial pipelines may be selected.
  • the subset selected may be the top-ranking double-operator partial pipelines.
  • a sub-routine such as GetPromisingTopK() may be performed to identify a subset of K single-operator partial pipelines, as discussed above.
  • machine-learning techniques or models such as deep reinforcement learning (DRL) model, may also be utilized to identify the subset of double-operator partial pipelines.
  • DRL deep reinforcement learning
  • the ultimate synthesized one or more multi-operator data transformation pipelines from operation 708 of method 700 in Figure 7A are then based on the selected subset of the double-operator pipelines.
  • one or more of the selected double-operator pipelines may be used as the final pipelines.
  • additional depths of multi-operator partial pipelines may be
  • the final multi-operator data transformation pipelines synthesized in operation 708 may be generated by identifying the top-ranking pipelines. For example, a GetFinalTopK sub-routine or similar process may be performed, as discussed above. As also discussed above, machine-learning techniques or models, such as deep reinforcement learning (DRL) model, may also be utilized to identify the top set of pipelines.
  • DRL deep reinforcement learning
  • the operations of the methods described above may be performed by components of the systems described above.
  • the operations may be performed by a client device and/or a server device.
  • Figures 8, 9A, 9B and the associated descriptions provide a discussion of a variety of operating environments in which aspects of the disclosure may be practiced.
  • the devices and systems illustrated and discussed with respect to Figures 8, 9A, 9B are for purposes of example and illustration and are not limiting a vast number of computing device configurations that may be utilized for practicing aspects of the disclosure described herein.
  • FIG. 8 is a block diagram illustrating physical components (e.g., hardware) of a computing device 800 with which aspects of the disclosure may be practiced.
  • the computing device 800 may illustrate components of a server device and/or a client device.
  • the computing device components described below may be suitable for the computing devices and systems described above.
  • the computing device 800 may include at least one processing unit 802 and a system memory 804.
  • the system memory 804 may comprise, but is not limited to, volatile storage (e.g., random access memory), non-volatile storage (e.g., read-only memory), flash memory, or any combination of such memories.
  • the system memory 804 may include an operating system 805 and one or more program modules 806 suitable for running software application 820, such as one or more virtual machines and/or one or more components supported by the systems described herein.
  • the operating system 805, for example, maybe suitable for controlling the operation of the computing device 800.
  • embodiments of the disclosure may be practiced in conjunction with a graphics library, other operating systems, or any other application program and are not limited to any particular application or system.
  • This basic configuration is illustrated in Figure 8 by those components within a dashed line 808.
  • the computing device 800 may have additional features or functionality.
  • the computing device 800 may also include additional data storage devices (removable and/or non-removable) such as, for example, solid-state drives, magnetic disks, optical disks, or tape. Such additional storage is illustrated in Figure 8 by a removable storage device 809
  • program modules 806 may perform processes including, but not limited to, the aspects described herein.
  • Other program modules that may be used in accordance with aspects of the present disclosure may include virtual machines, hypervisors, and different types of applications such as electronic mail and contacts applications, word processing applications, spreadsheet applications, database applications, slide presentation applications, drawing or computer-aided application programs, etc.
  • embodiments, or portions of embodiments, of the disclosure may be practiced in an electrical circuit comprising discrete electronic elements, packaged or integrated electronic chips containing logic gates, a circuit utilizing a microprocessor, or on a single chip containing electronic elements or microprocessors.
  • embodiments of the disclosure may be practiced via a system-on-a-chip (SOC) where each or many of the components illustrated in Figure 8 may be integrated onto a single integrated circuit.
  • SOC system-on-a-chip
  • Such an SOC device may include one or more processing units, graphics units, communications units, system virtualization units, and various application functionality, all of which are integrated (or “burned”) onto the chip substrate as a single integrated circuit.
  • the functionality described herein, with respect to the client's capability to switch protocols, may be operated via application-specific logic integrated with other components of the computing device 800 on the single integrated circuit (chip).
  • Embodiments of the disclosure may also be practiced using other technologies capable of performing logical operations such as, for example, AND, OR, and NOT, including but not limited to mechanical, optical, fluidic, and quantum technologies.
  • embodiments of the disclosure may be practiced within a general-purpose computer or in any other circuits or systems.
  • the computing device 800 may also have one or more input device(s) 812 such as a keyboard, a mouse, a pen, a sound or voice input device, a touch or swipe input device, etc.
  • the output device(s) 814 such as a display, speakers, a printer, etc. may also be included.
  • the aforementioned devices are examples and others may be used.
  • the computing device 800 may include one or more communication connections 816 allowing communications with other computing devices 850. Examples of suitable communication connections 816 include, but are not limited to, radio frequency (RF) transmitter, receiver, and/or transceiver circuitry; universal serial bus (USB), parallel, and/or serial ports.
  • RF radio frequency
  • USB universal serial bus
  • Computer readable media may include computer storage media.
  • Computer storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, or program modules.
  • Computer storage media may include RAM, ROM, electrically erasable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other article of manufacture which can be used to store information and which can be accessed by the computing device 800. Any such computer storage media may be part of the computing device 800.
  • Computer storage media does not include a carrier wave or other propagated or modulated data signal.
  • Communication media may be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media.
  • modulated data signal may describe a signal that has one or more characteristics set or changed in such a manner as to encode information in the signal.
  • communication media may include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared, and other wireless media.
  • RF radio frequency
  • the aspects and functionalities described herein may operate over distributed systems (e.g., such as the system 100 described in Figure 1), where application functionality, memory, data storage and retrieval and various processing functions may be operated remotely from each other over a distributed computing network, such as the Internet or an intranet.
  • a distributed computing network such as the Internet or an intranet.
  • User interfaces and information of various types may be displayed via on-board computing device displays or via remote display units associated with such computing devices.
  • user interfaces and information of various types may be displayed and interacted with on a wall surface onto which user interfaces and information of various types are projected.
  • FIG. 8A and 8B below include an example computing device associated with client (e.g., a computing device associated with a tenant administrator or other user), for example, that may be utilized to execute a locally installed application associated with the system 106 or run a web browser through which a web application associated with the system 106 is accessible to send requests to the service and/or request status updates, among other functionalities.
  • client e.g., a computing device associated with a tenant administrator or other user
  • client e.g., a computing device associated with a tenant administrator or other user
  • FIGS 9A and 9B illustrate a mobile computing device 900, for example, a mobile telephone, a smart phone, wearable computer (such as a smart watch), a tablet computer, a laptop computer,
  • a mobile computing device 900 for example, a mobile telephone, a smart phone, wearable computer (such as a smart watch), a tablet computer, a laptop computer,
  • the client device may be a mobile computing device.
  • the mobile computing device 900 is a handheld computer having both input elements and output elements.
  • the mobile computing device 900 typically includes a display 905 and one or more input buttons 910 that allow the user to enter information into the mobile computing device 900.
  • the display 905 of the mobile computing device 900 may also function as an input device (e.g., a touch screen display).
  • a side input element 915 allows additional user input.
  • the side input element 915 may be a rotary switch, a button, or any other type of manual input element.
  • an on-board camera 930 allows further user input in the form of image data captured using the camera 930.
  • mobile computing device 900 may incorporate more or less input elements.
  • the display 905 may not be a touch screen in some embodiments.
  • the mobile computing device 900 is a portable phone system, such as a cellular phone.
  • the mobile computing device 900 may also include a keypad 935.
  • the keypad 935 may be a physical keypad or a “soft” keypad generated on the touch screen display.
  • the output elements include the display 905 for showing a graphical user interface (GUI), a visual indicator 920 (e.g., a light emitting diode), and/or an audio transducer 925 (e.g., a speaker).
  • GUI graphical user interface
  • the mobile computing device 900 incorporates a vibration transducer for providing the user with tactile feedback.
  • the mobile computing device 900 incorporates input and/or output ports, such as an audio input (e.g., a microphone jack), an audio output (e.g., a headphone jack), and a video output (e.g., a HDMI port) for sending signals to or receiving signals from an external device (e.g., a peripheral device).
  • audio input e.g., a microphone jack
  • audio output e.g., a headphone jack
  • video output e.g., a HDMI port
  • Figure 9B is a block diagram illustrating the architecture of one aspect of a mobile computing device. That is, the mobile computing device 900 can incorporate a system (e.g., an architecture) 902 to implement some aspects.
  • the system 902 is implemented as a “smart phone” capable of running one or more applications (e.g., browser, e-mail, calendaring, contact managers, messaging clients, games, and media clients/players).
  • the system 902 is integrated as a computing device, such as an integrated personal digital assistant (PDA) and wireless phone.
  • PDA personal digital assistant
  • One or more application programs 966 may be loaded into the memory 962 and run on or in association with the operating system 964. Examples of the application programs 966 include
  • the application programs 966 may also include an application associated with the system 106.
  • the system 902 also includes a non-volatile storage area 968 within the memory 962.
  • the non-volatile storage area 968 may be used to store persistent information that should not be lost if the system 902 is powered down.
  • the application programs 966 may use and store information in the non-volatile storage area 968, such as e-mail or other messages used by an e-mail application, and the like.
  • a synchronization application (not shown) also resides on the system 902 and is programmed to interact with a corresponding synchronization application resident on a host computer to keep the information stored in the non volatile storage area 968 synchronized with corresponding information stored at the host computer.
  • other applications may be loaded into the memory 962 and run on the mobile computing device 900 described herein.
  • the system 902 has a power supply 970, which may be implemented as one or more batteries.
  • the power supply 970 might further include an external power source, such as an AC adapter or a powered docking cradle that supplements or recharges the batteries.
  • the system 902 may also include a radio interface layer 972 that performs the function of transmitting and receiving radio frequency communications.
  • the radio interface layer 972 facilitates wireless connectivity between the system 902 and the “outside world,” via a communications carrier or service provider. Transmissions to and from the radio interface layer 972 are conducted under control of the operating system 964. In other words, communications received by the radio interface layer 972 may be disseminated to the application programs 966 via the operating system 964, and vice versa.
  • the visual indicator 920 described with reference to Figure 9A may be used to provide visual notifications, and/or an audio interface 974 may be used for producing audible notifications via the audio transducer 925 described with reference to Figure 9A.
  • These devices may be directly coupled to the power supply 970 so that when activated, they remain on for a duration dictated by the notification mechanism even though the processor(s) (e.g., processor 960 and/or special- purpose processor 961) and other components might shut down for conserving battery power.
  • the visual indicator 920 may be programmed to remain on indefinitely until the user takes action to indicate the powered-on status of the device.
  • the audio interface 974 is used to provide audible signals to and receive audible signals from the user.
  • the audio interface 974 may also be coupled to a microphone to receive audible input, such as to facilitate a telephone conversation.
  • the microphone may also serve as an audio sensor to facilitate control of
  • the system 902 may further include a video interface 976 that enables an operation of an on-board camera 930 to record still images, video stream, and the like.
  • a mobile computing device 900 implementing the system 902 may have additional features or functionality.
  • the mobile computing device 900 may also include additional data storage devices (removable and/or non-removable) such as, magnetic disks, optical disks, or tape.
  • Data/information generated or captured by the mobile computing device 900 and stored via the system 902 may be stored locally on the mobile computing device 900, as described above, or the data may be stored on any number of storage media that may be accessed by the device via the radio interface layer 972 or via a wired connection between the mobile computing device 900 and a separate computing device associated with the mobile computing device 900, for example, a computing device in a distributed computing network, such as the Internet.
  • a computing device in a distributed computing network such as the Internet.
  • data/information may be accessed via the mobile computing device 900 via the radio interface layer 972 or via a distributed computing network.
  • data/information may be readily transferred between computing devices for storage and use according to well- known data/information transfer and storage means, including electronic mail and collaborative data/information sharing systems.
  • the technology relates to a system for generating a multi-operator data transformation pipeline.
  • the system includes at least one processing unit; and system memory encoding instructions that, when executed by the at least one processing unit, cause the system to perform operations.
  • the operations comprise access raw data for transformation; receive a selection of a target table or target visualization, wherein the target table or target visualization is for data other than the raw data; extract table properties and target constraints; and based on the extracted table properties and target constraints, synthesize one or more multi-operator data transformation pipelines for transforming the raw data to a generated table or generated visualization.
  • the raw data includes multiple input tables.
  • the operators in the one or multi-operator data transformation pipelines include at least two or more table reshaping operators or string transformation operators.
  • the table-reshaping operators include at least one of a join operator, a union operator, a groupby operator, an agg operator, a pivot operator, an unpivot operator, or an explode operator; and the string transformation operators include at least one of a split operator, a substring operator, a concatenate operator, casing operator, or an index operator.
  • the target constraints include at least one of a key-column constraint or a functional-dependency constraint.
  • the one or more one or more multi-operator data transformation pipelines includes a first multi-operator data transformation pipeline and a second multi-operator data
  • the operations further include concurrently display: the operators of the first multi-operator data transformation pipeline as selectable visual indicators; and the operators of the second multi-operator data transformation pipeline as selectable visual indicators.
  • the operations further include generate single-operator partial pipelines; for each single-operator partial pipeline, determine a likelihood probability and a constraint-matching criteria, wherein the constraint-matching criteria is based on the target constraints; based on the determined likelihood probabilities and constraint matching criteria for the single-operator partial pipelines, select a subset of the single-operator partial pipelines; generate, from the subset of the single-operator partial pipelines, double-operator partial pipelines; for each double-operator partial pipeline, determine a likelihood probability and a constraint-matching criteria, wherein the constraint-matching criteria is based on the target constraints; and based on the determined likelihood probabilities and constraint matching criteria for the double-operator partial pipelines, select a subset of the double-o
  • the technology in another aspect, relates to a method for generating a multi-operator data transformation pipeline.
  • the method includes accessing raw data for transformation; receiving a selection of a target table or target visualization, wherein the target table or target visualization is for data other than the raw data; extracting table properties and target constraints; and based on the extracted table properties and target constraints, synthesizing one or more multi-operator data transformation pipelines for transforming the raw data to a generated table or generated visualization.
  • the raw data includes multiple input tables.
  • the operators in the one or multi-operator data transformation pipelines include at least two or more table reshaping operators or string transformation operators.
  • the table-reshaping operators include at least one of a join operator, a union operator, a groupby operator, an agg operator, a pivot operator, an unpivot operator, or an explode operator; and the string transformation operators include at least one of a split operator, a substring operator, a concatenate operator, casing operator, or an index operator.
  • the target constraints include at least one of a key-column constraint or a functional-dependency constraint.
  • the one or more one or more multi-operator data transformation pipelines includes a first multi-operator data transformation pipeline and a second multi-operator data transformation pipeline, and the method further includes concurrently displaying: the operators of the first multi-operator data transformation pipeline as selectable visual indicators; and the
  • the method further includes generating single-operator partial pipelines; for each single-operator partial pipeline, determining a likelihood probability and a constraint matching criteria, wherein the constraint-matching criteria is based on the target constraints; based on the determined likelihood probabilities and constraint matching criteria for the single-operator partial pipelines, selecting a subset of the single-operator partial pipelines; generating, from the subset of the single-operator partial pipelines, double-operator partial pipelines; for each double operator partial pipeline, determining a likelihood probability and a constraint-matching criteria, wherein the constraint-matching criteria is based on the target constraints; and based on the determined likelihood probabilities and constraint matching criteria for the double-operator partial pipelines, selecting a subset of the double-operator partial pipelines; wherein the synthesized one or more multi-operator data transformation pipelines are based on the subset of the double-operator partial pipelines;
  • the technology in another aspect, relates to computer storage media storing instructions, that when executed by a processor, causes the processor to perform operations.
  • the operations include accessing raw data for transformation; receiving a selection of a target table or target visualization, wherein the target table or target visualization is for data other than the raw data; extracting table properties and constraints; and based on the extracted table properties and target constraints, synthesizing one or more multi-operator data transformation pipelines for transforming the raw data to a generated table or generated visualization.
  • the operators in the one or multi-operator data transformation pipelines include at least two or more table-reshaping operators or string transformation operators;
  • the table-reshaping operators include at least one of a join operator, a union operator, a groupby operator, an agg operator, a pivot operator, an unpivot operator, or an explode operator;
  • the string transformation operators include at least one of a split operator, a substring operator, a concatenate operator, casing operator, or an index operator.
  • the target constraints include at least one of a key-column constraint or a functional-dependency constraint.
  • selecting the subset of single-operator partial pipelines and the subset of double-operator partial pipelines includes using at least one reinforcement learning model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Procédés et systèmes de génération de pipelines de transformation de données multi-opérateur. Un procédé donné à titre d'exemple consiste : à accéder à des données brutes en vue d'une transformation ; à recevoir une sélection d'une table cible ou d'une visualisation cible, la table cible ou la visualisation cible étant destinée à des données autres que les données brutes ; à extraire des propriétés de table et des contraintes cibles ; et selon les propriétés de table et les contraintes cibles extraites, à synthétiser un ou plusieurs pipelines de transformation de données multi-opérateur pour transformer les données brutes en une table générée ou une visualisation générée.
PCT/US2022/025858 2021-05-14 2022-04-22 Synthèse de pipelines de transformation de données multi-opérateur WO2022240571A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP22722632.1A EP4338064A1 (fr) 2021-05-14 2022-04-22 Synthèse de pipelines de transformation de données multi-opérateur

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US17/321,138 US11880344B2 (en) 2021-05-14 2021-05-14 Synthesizing multi-operator data transformation pipelines
US17/321,138 2021-05-14

Publications (1)

Publication Number Publication Date
WO2022240571A1 true WO2022240571A1 (fr) 2022-11-17

Family

ID=81598032

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/025858 WO2022240571A1 (fr) 2021-05-14 2022-04-22 Synthèse de pipelines de transformation de données multi-opérateur

Country Status (3)

Country Link
US (1) US11880344B2 (fr)
EP (1) EP4338064A1 (fr)
WO (1) WO2022240571A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11934801B2 (en) * 2021-12-07 2024-03-19 Microsoft Technology Licensing, Llc Multi-modal program inference
US11971900B2 (en) * 2022-02-04 2024-04-30 Bank Of America Corporation Rule-based data transformation using edge computing architecture

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150081618A1 (en) * 2013-09-19 2015-03-19 Platfora, Inc. Systems and Methods for Interest-Driven Business Intelligence Systems Including Event-Oriented Data
US20150100542A1 (en) * 2013-10-03 2015-04-09 International Business Machines Corporation Automatic generation of an extract, transform, load (etl) job
US20180150528A1 (en) * 2016-11-27 2018-05-31 Amazon Technologies, Inc. Generating data transformation workflows
US20190130007A1 (en) * 2017-10-31 2019-05-02 International Business Machines Corporation Facilitating automatic extract, transform, load (etl) processing
EP3722968A1 (fr) * 2019-04-12 2020-10-14 Basf Se Système d'extraction de données

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10459939B1 (en) * 2016-07-31 2019-10-29 Splunk Inc. Parallel coordinates chart visualization for machine data search and analysis system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150081618A1 (en) * 2013-09-19 2015-03-19 Platfora, Inc. Systems and Methods for Interest-Driven Business Intelligence Systems Including Event-Oriented Data
US20150100542A1 (en) * 2013-10-03 2015-04-09 International Business Machines Corporation Automatic generation of an extract, transform, load (etl) job
US20180150528A1 (en) * 2016-11-27 2018-05-31 Amazon Technologies, Inc. Generating data transformation workflows
US20190130007A1 (en) * 2017-10-31 2019-05-02 International Business Machines Corporation Facilitating automatic extract, transform, load (etl) processing
EP3722968A1 (fr) * 2019-04-12 2020-10-14 Basf Se Système d'extraction de données

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
DAVID SILVERJULIAN SCHRITTWIESERKAREN SIMONYANIOANNIS ANTONOGLOUAJ A HUANGARTHUR GUEZTHOMAS HUBERTLUCAS BAKERMATTHEW LAIADRIAN BOL: "Mastering the game of go without human knowledge", NATURE, vol. 550, no. 7676, 2017, pages 354 - 359, XP055500016, DOI: 10.1038/nature24270
ERHARD RAHMPHILIP A BERNSTEIN: "A survey of approaches to automatic schema matching", VLDB JOURNAL, vol. 10, no. 4, 2001, pages 334 - 350, XP058152794, DOI: 10.1007/s007780100057
MATT BURANOSKYELMAR STELLNBERGEREMILY PFAFFDAVID DIAZ-SANCHEZCAVIN WARD-CAVINESS: "FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data", FLOOORESEARCH, vol. 7, 2018, pages 2018
THORSTEN PAPENBROCKJENS EHRLICHJANNIK MARTENTOMMY NEUBERTJAN-PEER RUDOLPHMARTIN SCHONBERGJAKOB ZWIENERFELIX NAUMANN: "Functional dependency discovery: An experimental evaluation of seven algorithms", PROCEEDINGS OF THE VLDB ENDOWMENT, vol. 8, no. 10, 2015, pages 1082 - 1093
VOLODYMYR MNIHKORAY KAVUKCUOGLUDAVID SILVERALEX GRAVESIOANNIS ANTONOGLOUDAAN WIERSTRAMARTIN RIEDMILLER: "Playing atari with deep reinforcement learning", ARXIV PREPRINT ARXIV, vol. 1312, 2013, pages 5602

Also Published As

Publication number Publication date
EP4338064A1 (fr) 2024-03-20
US20220365910A1 (en) 2022-11-17
US11880344B2 (en) 2024-01-23

Similar Documents

Publication Publication Date Title
US20230186094A1 (en) Probabilistic neural network architecture generation
US10007708B2 (en) System and method of providing visualization suggestions
CN110178151B (zh) 任务主视图
US11886457B2 (en) Automatic transformation of data by patterns
US7873356B2 (en) Search interface for mobile devices
CN102301358B (zh) 使用社交联系的文本消歧
WO2022240571A1 (fr) Synthèse de pipelines de transformation de données multi-opérateur
US9697016B2 (en) Search augmented menu and configuration for computer applications
US10997468B2 (en) Ensemble model for image recognition processing
EP3776375A1 (fr) Optimiseur d'apprentissage pour nuage partagé
CN110546619B (zh) 自动确定检测到的问题是否是bug的方法、系统和介质
WO2014182585A1 (fr) Recommandation d'actions basées sur le contexte pour des visualisations de données
WO2019129520A1 (fr) Systèmes et procédés pour combiner des analyses de données
US20180081683A1 (en) Task assignment using machine learning and information retrieval
US10915522B2 (en) Learning user interests for recommendations in business intelligence interactions
US7840549B2 (en) Updating retrievability aids of information sets with search terms and folksonomy tags
US10229212B2 (en) Identifying Abandonment Using Gesture Movement
WO2023229737A1 (fr) Procédé et système de découverte de modèles pour des documents
CN1758251A (zh) 静态和动态数据集的交互
Djenouri et al. GPU-based swarm intelligence for Association Rule Mining in big databases
WO2021002981A1 (fr) Modification et optimisation de tâches
Welborn et al. Learning index selection with structured action spaces
WO2021050144A1 (fr) Signaux de site et de service de pilotage d'une configuration de système personnalisée et automatisée
US10031965B2 (en) Data object classification using feature generation through crowdsourcing
CN112948357B (zh) 一种面向多模数据库OrientDB的调优系统及其构建方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22722632

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022722632

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2022722632

Country of ref document: EP

Effective date: 20231214