CN116745783A

CN116745783A - Handling of system characteristic drift in machine learning applications

Info

Publication number: CN116745783A
Application number: CN202280011257.XA
Authority: CN
Inventors: 奥瑞斯蒂斯·科斯塔基斯; 蒋启明; 姜博馨
Original assignee: Snowflake Computing Inc
Current assignee: Snowflake Inc
Priority date: 2021-01-21
Filing date: 2022-01-18
Publication date: 2023-09-12
Also published as: US11934927B2; US20230132117A1; EP4281912A1; US11568320B2; DE202022002890U1; WO2022159391A1; US20220230093A1

Abstract

The present application proposes a system and method for managing input and output errors of a Machine Learning (ML) model in a database system. A set of test queries is executed on a first version of the database system to generate first test data, wherein the first version of the system includes an ML model to generate an output corresponding to a function of the database system. The error model is trained based on the first test data and second test data generated based on a previous version of the system. The error model determines an error associated with the ML model between a first version and a previous version of the system. A first version of the system is deployed with an error model that corrects the output or input of the ML model until the error model produces enough data to retrain the ML model.

Description

Handling of system characteristic drift in machine learning applications

RELATED APPLICATIONS

The present application is in accordance with 35U.S. c. ≡119 (e) claiming the benefit of U.S. patent application serial No. 17/154,928 filed on 1 month 21 of 2021, the entire contents of which are incorporated herein by reference.

Technical Field

Aspects of the present disclosure relate to database systems, and more particularly, to the use of Machine Learning (ML) in database systems.

Background

Databases are widely used for data storage and access in computing applications. The database may include one or more tables that include or reference data that may be read, modified, or deleted using queries. The database may store small or very large data sets in a table. Database systems that include databases (e.g., storage resources) may also include computing resources that allow stored data to be queried by various users in an organization, or even used to serve public users, such as via a website or Application Programming Interface (API). Both computing resources and storage resources and their underlying architecture can play a significant role in achieving desirable database performance.

Database systems increasingly integrate ML models to perform functions such as query optimization, where the database systems find the best physical execution path for a query. There are various methods of using ML in database systems, including reinforcement learning, deep learning, dimension reduction, and topic modeling, among others.

Brief Description of Drawings

The described embodiments and their advantages may be best understood by referring to the following description taken in conjunction with the accompanying drawings. The drawings in no way limit any changes in form and detail that may be made to the described embodiments by one skilled in the art without departing from the spirit and scope of the described embodiments.

FIG. 1A is a block diagram illustrating an example database system according to some embodiments of the present disclosure.

FIG. 1B is a block diagram illustrating an example database system, according to some embodiments of the present disclosure.

Fig. 2A is a block diagram illustrating a logical implementation of a system for managing output drift (drift) of an ML model in an enterprise system, according to some embodiments of the disclosure.

Fig. 2B is a block diagram illustrating a logical implementation of a system in which output drift of an ML model has been managed, according to some embodiments of the present disclosure.

Fig. 3A is a block diagram illustrating a logical implementation of a system for managing input feature drift of an ML model in an enterprise system, in accordance with some embodiments of the present disclosure.

Fig. 3B is a block diagram illustrating a logical implementation of a system in which input feature drift of an ML model has been managed, according to some embodiments of the present disclosure.

Fig. 4 is a flow chart of a method for managing output drift of an ML model in an enterprise system according to some embodiments of the disclosure.

Fig. 5 is a flow chart of a method for managing input feature drift of an ML model in an enterprise system, in accordance with some embodiments of the present disclosure.

Fig. 6 is a block diagram of an example computing device that may perform one or more operations described herein, according to some embodiments of the disclosure.

Detailed Description

Database systems may implement many query processing subsystems (also referred to herein as "query processing components" or "components"), such as a query execution engine, a resource predictor (e.g., for determining the optimal amount of resources to run a given query), a query dispatcher (e.g., for processing assignments of tasks/micro-partitions of queries among computing clusters), a query optimizer, and the like. The database system may replace the functionality/heuristics (heuristics) of one or more of these components by using an ML model. A typical workflow includes collecting a set of data, training an ML model to automate query processing components, and deploying them into a production version of a database system. The ML model may be periodically retrained and redeployed, or the process of collecting data and training a new ML model to replace the previous model is repeated when the ML model falls below a certain performance level threshold. Thus, a new ML model is trained and/or evaluated using a mix of old and new data.

However, it is wrong, assuming that the past system data will remain correct over time. In fact, this is often not the case in scenarios where the system data corresponds to the operation of a query processing component (e.g., a query execution engine). Such query processing components are often updated, augmented, and/or repaired, and thus may "drift". In other words, the actual output or input of the component has deviated from what would be expected from the ML model that models the query processing component. This is because the expectation of the ML model is based on a previous version of the system (i.e., the ML model is trained from data collected from the execution of a previous version of the system for which the query processing component was not modified).

If any such query processing components are modified (e.g., the operation of a query execution engine or query optimizer is modified), the assumptions of the ML model that model any of these components may no longer be valid. For example, modifications to the query optimizer may invalidate assumptions that simulate the ML model of the query optimizer (or simulate the ML model of a resource predictor component that depends on the output of the query optimizer). In these cases, the ML model needs to be retrained in order to continue to accurately simulate the component it is replacing. However, the process of retraining the ML model may take a significant amount of time (e.g., days or weeks) during which the ML model may produce unreliable/inaccurate outputs.

The present disclosure addresses the above and other deficiencies by using a processing device to execute a set of test queries on a first version of a database system to generate first test data, wherein the first version of the system includes a Machine Learning (ML) model to generate an output corresponding to a function of the database system. The processing device may train an error model based on the first test data and the second test data generated by executing a set of test queries on a previous version of the system, the error model to determine an output error of the ML model between the first version and the previous version of the database system. The processing device may deploy a first version of a database system having an error model, and in response to the ML model generating a first output based on the received input, may adjust the first output of the ML model by the error model based on the input of the ML model and an output error of the ML model.

In other embodiments, the present disclosure may address the above and other deficiencies by using a processing device to execute a set of test queries on a first version of a database system to generate first test data, wherein the first version of the system includes a Machine Learning (ML) model to generate an output corresponding to a function of the database system. The processing device may train an error model based on the first test data and the second test data generated by executing a set of test queries on a previous version of the system, the error model to determine an input error of the ML model between the first version and the previous version of the database system. The processing device may deploy a first version of the database system with the error model, and may adjust an input to the ML model based on an input error of the ML model through the error model. The processing device may output the adjusted inputs to the ML model.

FIG. 1A is a block diagram illustrating a database system 100 according to one embodiment. Database system 100 includes a resource manager 102 that is accessible by a plurality of users 104, 106, and 108. The resource manager 102 may also be referred to herein as a database service manager. In some implementations, the resource manager 102 may support any number of users desiring to access data or services of the database system 100. The users 104 may include, for example, end users providing data storage and retrieval queries and requests, system administrators managing the systems and methods described herein, software applications interacting with databases, and other components/devices interacting with the resource manager 102.

The same reference numbers may be used in fig. 1A and other figures to identify the same elements. Letters following a reference numeral, such as "110A", indicate that the text specifically refers to an element having the particular reference numeral. Reference numerals without a subsequent letter, such as "110", in this text refer to any or all of the elements in the drawings that carry that reference numeral.

The resource manager 102 may provide various services and functions that support the operation of systems and components within the database system 100. The resource manager 102 may access stored metadata 110 associated with data stored throughout the data database system 100. The resource manager 102 may use the metadata 110 to optimize user queries. In some embodiments, metadata 110 includes a summary of data stored in remote data storage 116 on storage platform 114 as well as data available from local caches (e.g., caches within one or more computing clusters 122 of execution platform 112). In addition, metadata 110 may include information about how the data is organized in the remote data storage device and the local cache. Metadata 110 allows systems and services to determine whether a piece of data needs to be processed without loading or accessing the actual data from a remote data storage device.

When changes are made to data stored in database system 100 using a Data Manipulation Language (DML), metadata 110 may be collected, and the changes may be made by any DML statement. Examples of operational data may include, but are not limited to, selecting data, updating data, changing data, merging data, and inserting data into a table. As part of database system 100, files may be created and metadata 110 collected on a per file and per column basis, after which metadata 110 may be saved in metadata storage. Such collection of metadata 110 may be performed during data ingestion (ingest), or collection of metadata 110 may be performed as a separate process after the data is ingested or loaded. In an embodiment, metadata 110 may include a plurality of different values; a plurality of null values; and minimum and maximum values for each file. In an embodiment, the metadata may further include character string length information and character ranges in the character string.

The resource manager 102 is also in communication with an execution platform 112, which execution platform 112 provides a plurality of computing resources that perform various data storage and data retrieval operations, as discussed in more detail below. The execution platform 112 may include one or more computing clusters that may be logically organized into one or more virtual warehouses (referred to herein as "warehouses"). Each computing cluster may be dynamically assigned or suspended for a particular warehouse based on the query workload provided by the user 104 to the particular warehouse. The execution platform 112 communicates with one or more data storage devices 116 that are part of the storage platform 114. Although three data storage devices 116 are shown in FIG. 1A, execution platform 112 is capable of communicating with any number of data storage devices. In some embodiments, data storage device 116 is a cloud-based storage device located at one or more geographic locations. For example, the data storage device 116 may be part of a public cloud infrastructure or a private cloud infrastructure or any other manner of distributed storage system. The data storage device 116 may include a Hard Disk Drive (HDD), a Solid State Drive (SSD), a storage cluster, or any other data storage technology. In addition, the storage platform 114 may include a distributed file system (such as a Hadoop Distributed File System (HDFS)), an object storage system, and the like.

In some embodiments, the communication links between the resource manager 102 and the user 104, metadata 110, and execution platform 112 are implemented via one or more data communication networks, and may be assigned various tasks so that user requests may be optimized. Similarly, the communication links between execution platform 112 and data storage devices 116 in storage platform 114 are implemented via one or more data communication networks. These data communication networks may utilize any communication protocol and any type of communication medium. In some embodiments, the data communication network is a combination of two or more data communication networks (or subnetworks) coupled to each other. In alternative embodiments, these communication links are implemented using any type of communication medium and any communication protocol.

As shown in fig. 1A, data storage device 116 is decoupled from computing resources associated with execution platform 112. The architecture supports dynamic changes to the data database system 100 based on changing data storage/retrieval requirements, computing requirements, and changing requirements of users and systems accessing the data database system 100. Support for dynamic changes allows the data database system 100 to scale quickly in response to changing demands on systems and components within the data database system 100. Decoupling of the computing resources from the data storage device supports storage of large amounts of data without requiring a correspondingly large amount of computing resources. Similarly, such decoupling of resources supports a significant increase in computing resources used at a particular time without a corresponding increase in available data storage resources.

The resource manager 102, metadata 110, execution platform 112, and storage platform 114 are shown in FIG. 1A as separate components. However, each of the resource manager 102, metadata 110, execution platform 112, and storage platform 114 may be implemented as a distributed system (e.g., distributed across multiple systems/platforms at multiple geographic locations), or may be combined into one or more systems. In addition, each of the resource manager 102, the storage device for metadata 110, execution platform 112, and storage platform 114 may scale up or down (independently of each other) according to the changing requests received from users 104 and the changing needs of the database system 100. Thus, in the described embodiment, database system 100 is dynamic and supports periodic changes to meet current data processing requirements.

Each of the resource manager 102, execution platform 112, and storage platform 114 may comprise any suitable type of computing device or machine having one or more programmable processors, including, for example, server computers, storage servers, desktop computers, laptop computers, tablet computers, smartphones, and the like. Each of resource manager 102, execution platform 112, and storage platform 114 may comprise a single machine or may comprise multiple interconnected machines (e.g., multiple servers configured as a cluster). In addition, each of the resource manager 102, execution platform 112, and storage platform 114 may include hardware, such as a processing device (e.g., a processor, central Processing Unit (CPU)), memory (e.g., random Access Memory (RAM), storage devices (e.g., hard Disk Drive (HDD), solid State Drive (SSD), etc.), and other hardware devices (e.g., sound card, video card, etc.), the storage devices may include persistent storage devices capable of storing data.

The execution platform 112 includes a plurality of computing clusters 122 that may share the computing or processing load of the database system 100. In one embodiment, when creating a warehouse or changing its configuration (when the warehouse is running and when the warehouse is suspended), a customer may control the number of active (i.e., running) clusters by specifying a range (e.g., specifying values such as minclusteriunt and maxcclusteriunt). The customer may specify an exact number of active clusters, for example, by specifying that the minimum cluster count is equal to the maximum cluster count, so that the warehouse runs the exact number at run-time. If the user specifies a maximum cluster count that is greater than the minimum cluster count, the resource manager 102 may automatically manage the number of currently active clusters based on the workload to meet the throughput criteria and be cost effective. Thus, at least the cluster of the smallest cluster count (minclusterin) is active and at most the cluster of the largest cluster count (maxcroust) is active each time the warehouse is run. The resource manager 102 may decide how many clusters are needed to handle the current workload given the specified performance criteria (in terms of memory load and concurrency level).

Fig. 1B illustrates a system 100 according to some embodiments of the present disclosure. As can be seen in fig. 1B, the system 100 may implement a main ML model 215 that may perform various tasks in place of the components of the system 100. In the example of FIG. 1B, the master ML model 215 may be implemented within the resource manager 102, where the master ML model 215 may be used to automate the functionality of any of the various components, such as: a query execution engine, a resource predictor (e.g., for determining the best amount of resources to run a given query), a query allocator (e.g., for processing allocation of tasks/micro-partitions of queries between computing clusters), a radix predictor, or a query optimizer (which may, for example, determine a join order, or determine which table to select for the left/right side of a join operation). Although illustrated with a single main ML model 215 for ease of illustration and description, the functionality/heuristics of all of the above components and other components may be automated using one or more ML models. Although described as being implemented within the resource manager 102 for ease of illustration and description, it should be noted that the embodiments described herein may be implemented for any number of ML models that replace query processing components anywhere in the system 100 (e.g., on the execution platform 114).

The resource manager 102 may train the master ML model 215 using a set of training data generated by executing a set of training queries on the system 100 (including each of its components mentioned above). In some embodiments, the master ML model 215 may be trained outside of the resource manager 102. For example, the master ML model 215 may be trained on a separate computing device and uploaded to the resource manager 102. In another example, automation may be used to train the main ML model 215. The set of training queries for the main ML model 215 may be queries that are typically performed by the customer or queries that are specifically built for a particular kind of task. For example, if the master ML model 215 is modeling a resource predictor component, the set of training queries may include queries related to the resource predictor component. The set of training data may contain features (i.e., feature vectors) of all training queries performed by the resource manager 102. The system 100 may also run a series of pre-release tests, such as regression tests, stress tests, performance tests, smoke tests, and the like, prior to deployment. These tests may typically be performed on a new version of system 100 before it is released for use by a customer. Furthermore, queries that include these pre-release tests may be repeatedly executed across different versions of the system 100 and may cover a wide range of situations/scenarios. The resource manager 102 can flag queries of pre-release tests that are most relevant to the master ML model 215 (in the examples of fig. 2 and 3, the master ML model 215 can automate the resource predictor component, and thus the resource manager 102 can flag queries that affect predictions of the optimal amount of resources to run the query). During all pre-release tests, resource manager 102 may collect all relevant result data from the execution of the tagged query and store it in memory (not shown) of system 100. After the pre-release test is complete, the resource manager may deploy the system 100 with the master ML model 215. The main ML model 215 may be a binary classification model, a regression model, or any other suitable ML model depending on the component it is to model (e.g., a resource predictor or query dispatcher).

However, if the system 100 (e.g., any of its components) changes, the assumptions of the main ML model 215 may no longer be valid. For example, if the query optimizer component of the resource manager 102 has been modified (e.g., introducing a new type of connection, or introducing new features such as search optimization), then the master ML model 215 may need to be retrained, as such modification may invalidate assumptions made by the master ML model 215 when predicting the required resources (based on training data that it derived from a previous version of the system 100, without modifying the query optimizer). In other examples, the ability of the master ML model 215 to accurately predict resource requirements may be negatively impacted if the resource manager 102 changes the type of cloud Virtual Machine (VM) used to implement any query processing component or modifies the functionality of the resource predictor component itself. The resource manager 102 may retrain the master ML model 215, however the process may take a significant amount of time (e.g., days or weeks) during which the master ML model 215 may continue to produce unreliable/inaccurate outputs.

Embodiments of the present disclosure utilize pre-release test results to train an error model that identifies errors in the output of the master ML model 215 or errors in the input of the master ML model 215 caused by a new version of the system 100, and that may correct (adjust) the output or input of the master ML model until the master ML model 215 has been retrained. In some embodiments where the error model adjusts the output of the master model, the error model is trained to identify the magnitude of the error (drift) between the output of the master ML model 215 in the new version of the system 100 and the output of the master ML model 215 in the previous version. As discussed herein, the error model may be trained on pre-release test data. During operation of a new version of the system 100, the error model may adjust the output of the main ML model 215 based on the magnitude of the error to ensure that the output of the main ML model 215 is accurate. The error model may adjust the output of the main ML model 215 in this manner until the main ML model 215 has been retrained.

In other embodiments in which the error model adjusts the input of the main ML model 215, the error model may be trained to identify the magnitude of error between the input of the main ML model 215 at a new version of the system 100 version and the input of the main ML model 215 at a previous version. It should be noted that in some scenarios, even if the system 100 changes, the input features of the main ML model 215 will be the same, although the output of the main ML model 215 will be different. Thus, an error model that accounts for input feature drift may not capture the case where drift only results in output changes for the same (fixed) input features (e.g., when drift is the result of employing different types of hardware for the virtual warehouse, aspects such as execution time or bottlenecks will differ from before).

FIG. 2A shows a block diagram of a logic implementation 200 of a system for managing output drift of an ML model in system 100. In response to a change in system 100 (i.e., a new version of system 100 implemented as a result of a change to one or more components of system 100), resource manager 102 may re-execute the pre-release test discussed herein on the new version of system 100 and store in memory the result data that has been marked as a pre-release test query related to primary ML model 215. The resource manager 102 may now know the result data of performing the same pre-release test on 2 different consecutive versions of the system 100 and may compare the result data from the 2 executions to determine the difference (if any) in the result data between performing the pre-release test on the old version of the system 100 and the new version of the system 100. The differences in the result data (e.g., the result data of the tagged query) between 2 consecutive executions of the pre-release test may correspond to the drift (error) and the magnitude of the drift in the output of the main ML model 215 (relative to the expected value). The error data may be used by the resource manager 102 to train the error model 220 to identify the magnitude of drift in the output of the master ML model 215 relative to a new version of the system 100 and correct for such drift. It should be noted that the error model 220 need not address the original problem, such as predicting the amount of resources required for a query. Instead, the error model 220 only needs to know how much the resulting data will differ between two successive versions of the system 100.

The error model 220 may be any suitable ML model and need not be similar to the main ML model 215. Because the error model 220 identifies errors (e.g., output differences) of the master ML model 215 between successive versions of the system 100, the error model 220 may adjust the output of the master ML model 215 based on the magnitude of the output drift (error) of the master ML model 215 to produce a final output given the new instance of the problem (i.e., the input features) and the output of the master ML model 215. In this way, the resource manager 102 may minimize the time that the master ML model 215 outputs inaccurate/drift results after deploying a new version of the system 100. The error model 220 may provide adjusted results when a new version of the system 100 is released to the user without having to re-execute all training queries and without having to re-train the main ML model 215.

The error model 220 is deployed with the main ML model 215 (as shown in fig. 2). When the data sources 205 generate input data, the characterizer 210 (implemented as part of the resource manager 102) may obtain the input data from all of the data sources 205 and create input features (feature vectors) that describe the input data. The characterizer 210 may synchronize with the master ML model 215 to ensure that the correct values and data types are fed to the master ML model 215. The characterizer 210 may output the input features to the main ML model 215. When the master ML model 215 receives an input feature (e.g., an original feature of a query whose resource consumption is to be predicted), it may generate an inaccurate output (e.g., a resource prediction) because the master ML model 215 has not been trained based on a new version of the system 100 (which includes, for example, an updated/modified version and/or an updated/modified query optimizer component of a virtual machine for executing a server in the platform 112). Thus, the input features and output of the main ML model 215 may become inputs to the error model 220, and the error model 220 may adjust the output of the main ML model 215 based on the magnitude of the output drift (error) of the main ML model 215 to produce a final output. In other words, the error model 220 may calculate and output a final (adjusted) output as: y_error (i) =y_main (i) +error (i, y_main (i))

Where, for problem instance i, y_main (i) is the output of the Main ML model 215 and error (i) is the output drift of the Main ML model 215 (as determined by the error model 220).

In some embodiments, training queries from the training set of the main ML model 215 are re-executed on a new version of the system 100 (as discussed herein, these may be queries that are typically executed by a customer or queries that are specifically built for such tasks). The resource manager 102 may determine differences between the result data from the re-execution and the training data set (resulting from executing the training query set on the previous version of the system 100) to re-train the error model 220 to further improve its accuracy. The new instance of the error model 220 may replace the previous instance of the error model 220. Furthermore, the resulting data from this re-execution may also be part of the updated (second) training data set for re-training the main ML model 215. The re-execution of the training query set of the master ML model 215 and the subsequent re-training of the error model 220 may be repeated at a desired cadence.

As the error model 220 continues to run (e.g., adjust the output of the master ML model 215), the resource manager 102 can retain all results (e.g., execution data) that it has processed. Over time, a sufficient amount of result data may be retained, which, in combination with the re-execution of the training query from the training set of the master ML model 215, may form an updated training data set. The resource manager 102 can re-train the master ML model 215 with the updated training data set and thereby generate a re-trained master ML model 215. After generating the new retrained instance of the master ML model 215, the resource manager 102 may replace the previous instance of the master ML model 215 with the retrained instance of the master ML model 215 and remove the error model 220. As shown in fig. 2B, the system 100 may continue to run using only the retrained instance of the master ML model 215.

FIG. 3A shows a block diagram of a logical implementation of the input correction technique using the main ML model 215 of the error model. The main ML model 215, the characterizer 210, and the data source(s) 205 may be similar to the corresponding components in fig. 2A. In response to changes in the system 100, many assumptions about the input data by the main ML model 215 may no longer be valid. For example, if the query optimizer component of the resource manager 102 has been modified (e.g., introduced a new type of connection, or introduced a new feature such as search optimization), then the master ML model 215 needs to be retrained, as these modifications may affect the data input of the master ML model 215 (which models the functionality of the resource prediction component).

Thus, in response to a change in system 100, resource manager 102 can re-execute the pre-release test discussed herein on a new version of system 100 and store in memory the result data that has been marked as a pre-release test query related to master ML model 215. In the example of fig. 3A, the result data may include data input to the main ML model 215 and data output by the main ML model 215. When the same pre-release test is performed on 2 different consecutive versions of the system 100, the resource manager 102 may now be aware of the data input to the master ML model 215 and may compare the result data from the 2 executions (e.g., the result data of the tagged query) to determine the differences (if any) in the input data of the master ML model 215 between performing the pre-release test on the old version of the system 100 and the new version of the system 100. The differences in the input data of the master ML model 215 between 2 consecutive executions of the pre-release test may correspond to drift (error) and magnitude of drift in the input data of the master ML model 215 executing on a new version of the system 100. The error data may be used by the resource manager 102 to train the error model 305 to identify drift in the input of the master ML model 215 and correct for such drift. The error model 305 does not need to solve the original problem of determining the inputs of the ML model, and only needs to know how much the inputs of the main ML model 215 will differ between two successive versions of the system 100.

The error model 305 may be any suitable ML model and need not be similar to the main ML model 215. Because the error model 305 identifies errors in the input data of the main ML model 215 between successive versions of the system 100, given a new instance of the problem (i.e., the input features), the error model 220 can adjust the input features of the main ML model 215 based on the magnitude of the input feature drift (error) to produce adjusted input features. In this way, the resource manager 102 can minimize the time that the master ML model 215 receives inaccurate input data (and thus outputs inaccurate/drifting results). The error model 305 may provide the adjusted input data to the master ML model 215 when a new version of the system 100 is released to the user without having to re-execute all training queries and without having to re-train the master ML model 215.

The error model 305 is deployed with the main ML model 215 (as shown in fig. 3A). When the data source(s) 205 generate input data, the characterizer 210 can take input from all the data source(s) 205 and generate input features that include raw feature vectors that describe the input data. The error model 305 receives input features (e.g., raw features of a query whose resource consumption is to be predicted) and may adjust the input features to generate adjusted input features. More specifically, the error model 305 calculates new values of the input features and outputs them to the main ML model 215 (as opposed to the output), which new values are then passed to the main ML model 215. The main ML model 215 may produce a final output as given below:

y _Main (y _Error (i))

Where, for instance i, y_main (i) is the output of the Main ML model 215 and y_error (i) is the output of the Error model 305.

In some embodiments, training queries from the training set of the main ML model 215 are re-executed on a new version of the system 100 (as discussed herein, these may be queries that are typically executed by a customer or queries that are specifically built for such tasks). The resource manager 102 may determine differences between the result data (input features) from the re-execution and the training data set (resulting from execution of the training query set on the previous version of the system 100) to re-train the error model 305 to further improve its accuracy. A new instance of the error model 305 may replace the previous error model 305. Furthermore, the resulting data from this re-execution may also be part of the updated (second) training data set for re-training the main ML model 215. The re-execution of the training query set of the master ML model 215 and the subsequent re-training of the error model 305 may be repeated at the desired cadence.

As the error model 305 continues to run (e.g., adjust the input data of the master ML model 215), the resource manager 102 can retain all results (e.g., adjusted inputs) that it has processed. Over time, a sufficient amount of result data may be retained, which, in combination with the re-execution of the training query from the training set of the master ML model 215, may form an updated training data set. The resource manager 102 can re-train the master ML model 215 with the new training data set and thereby generate a re-trained master ML model 215. After generating the retrained master ML model 215, the resource manager 102 may replace the previous master ML model 215 with the retrained master ML model 215 and remove the error model 305. As shown in fig. 3B, the system 100 may continue to run using only the retrained main ML model 215.

Fig. 4 is a flow diagram of a method 400 of managing output drift of an ML model in an enterprise system, in accordance with some embodiments. The method 400 may be performed by processing logic that may comprise hardware (e.g., circuitry, dedicated logic, programmable logic, a processor, a processing device, a Central Processing Unit (CPU), a system-on-a-chip (SoC), etc.), software (e.g., instructions run/executed on a processing device), firmware (e.g., microcode), or a combination thereof. In some embodiments, the method 400 may be performed by a computing device (e.g., the resource manager 102 shown in fig. 1A and 1B).

Referring also to fig. 2A and 2B, method 400 begins at block 405, where in response to a change in system 100 (i.e., a new version of system 100 is being implemented as discussed herein), resource manager 102 may re-execute the pre-release test discussed herein on the new version of system 100 (also referred to herein as the "first version") and store in memory result data (also referred to herein as "first test data") that has been marked as a pre-release test query related to master ML model 215. The resource manager 102 may now know the result data of performing the same pre-release test on 2 different consecutive versions of the system 100 and may compare the result data from the 2 executions to determine the difference in result data between performing the pre-release test on the previous version of the system 100 and the new version of the system 100. The resource manager 102 may utilize the result data from 2 consecutive executions of the pre-release test (e.g., the result data of the tagged query) to determine whether there is drift (error) (relative to expected value) in the output of the master ML model 215 and the magnitude of the drift. At block 410, the resource manager 102 may use the error data to train the error model 220 to identify an amount of drift in the output of the master ML model 215 between a previous version of the system 100 and a new version of the system 100 and correct for such drift. It should be noted that the error model 220 need not solve the original (possibly difficult) problem of predicting the amount of resources required for a query. Instead, the error model 220 only needs to know how much the results will differ between two successive versions of the system 100.

At block 415, a new version of the system 100 is deployed, wherein the error model 220 is included with the master ML model 215 (as shown in FIG. 2A). When the data sources 205 generate input data, the characterizer 210 (implemented as part of the resource manager 102) may obtain the input data from all of the data sources 205 and create input features (feature vectors) that describe the input data. The characterizer 210 may synchronize with the master ML model 215 to ensure that the correct values and data types are fed to the master ML model 215. The characterizer 210 may output the input features to the main ML model 215. When the master ML model 215 receives an input feature (e.g., an original feature of a query whose resource consumption is to be predicted), it may generate an inaccurate output (e.g., a resource prediction) because the master ML model 215 has not been trained based on a new version of the system 100 (which includes, for example, an updated/modified version and/or an updated/modified query optimizer component of a virtual machine for executing a server in the platform 112). Thus, at block 420, the input features and output of the master ML model 215 may become inputs to the error model 220, and the error model 220 may adjust the output of the master ML model 215 based on the magnitude of the output drift (error) of the master ML model 215 to produce a final output. In other words, the error model 220 may calculate and output a final (adjusted) output as:

y_Yerror(i)＝y_Main(i)+error(i,y_Main(i))

As the error model 220 continues to run (e.g., adjust the output of the master ML model 215), the resource manager 102 can retain all results (e.g., adjusted output) that it has processed. Over time, a sufficient amount of result data may be retained, which, in combination with the re-execution of the training query from the training set of the master ML model 215, may form an updated training data set. The resource manager 102 can re-train the master ML model 215 with the updated training data set and thereby generate a re-trained master ML model 215. After generating the new retrained instance of the master ML model 215, the resource manager 102 may replace the previous instance of the master ML model 215 with the retrained instance of the master ML model 215 and remove the error model 220. As shown in fig. 2B, the system 100 may continue to run using only the retrained instance of the master ML model 215.

Fig. 5 is a flow diagram of a method 500 of managing output drift of an ML model in an enterprise system, in accordance with some embodiments. The method 500 may be performed by processing logic that may comprise hardware (e.g., circuitry, dedicated logic, programmable logic, a processor, a processing device, a Central Processing Unit (CPU), a system-on-a-chip (SoC), etc.), software (e.g., instructions run/executed on a processing device), firmware (e.g., microcode), or a combination thereof. In some embodiments, the method 500 may be performed by a computing device (e.g., the resource manager 102 shown in fig. 1A and 1B).

Referring also to fig. 3A and 3B, in response to a change in system 100, resource manager 102 may re-execute the pre-release test discussed herein on a new version of system 100 and store in memory the result data (also referred to herein as "first test data") that has been marked as a pre-release test query related to master ML model 215 at block 505. In the example of fig. 3A, the result data may include data input to the main ML model 215 and data output by the main ML model 215. When the same pre-release test is performed on 2 different consecutive versions of the system 100, the resource manager 102 may now be aware of the data input to the master ML model 215 and may compare the resulting data from the 2 executions to determine differences in the input data of the master ML model 215 between the previous version of the system 100 and the execution of the pre-release test on the new version of the system 100. The resource manager 102 may utilize the result data from 2 consecutive executions of the pre-release test (e.g., the result data of the tagged query) to determine whether there is drift (error) and the magnitude of the drift in the input data of the master ML model 215. At block 510, the resource manager 102 may use the error data to train the error model 305 to identify drift in the input of the master ML model 215 between a previous version of the system 100 and a new version of the system 100 and correct for such drift. The error model 305 does not need to solve the original problem of determining the inputs of the ML model, and only needs to know how much the inputs of the main ML model 215 will differ between two successive versions of the system 100.

At block 515, a new system version (as shown in FIG. 3A) is deployed with the error model 305 along with the master ML model 215. When the data source(s) 205 generate input data, the characterizer 210 can take input from all the data source(s) 205 and generate input features that include raw feature vectors that describe the input data. The error model 305 receives the input features (e.g., raw features of the query whose resource consumption is to be predicted) and, at block 520, may adjust the input features to generate adjusted input features. More specifically, the error model 305 calculates new values of the input features and outputs them to the main ML model 215 (as opposed to the output), which new values are then passed to the main ML model 215. The main ML model 215 may produce a final output as given below:

y _Main (y _Error (i))

As the error model 305 continues to run (e.g., adjust input features of the master ML model 215), the resource manager 102 can retain all results (e.g., adjusted inputs) that it has processed. Over time, a sufficient amount of result data may be retained, which, in combination with the re-execution of the training query from the training set of the master ML model 215, may form an updated training data set. The resource manager 102 can re-train the master ML model 215 with the new training data set and thereby generate a re-trained master ML model 215. After generating the retrained master ML model 215, the resource manager 102 may replace the previous master ML model 215 with the retrained master ML model 215 and remove the error model 305. As shown in fig. 3B, the system 100 may continue to run using only the retrained main ML model 215.

Fig. 6 is a block diagram of an example computing device 600, which example computing device 600 may perform one or more operations described herein for verifying firmware before the firmware is loaded into a memory device, in accordance with some embodiments. For example, computing device 600 may execute a set of test queries on a first version of a database system to generate first test data, where the first version of the system includes a Machine Learning (ML) model to generate output corresponding to functions of the database system. The computing device 600 may train an error model for determining an output error of the ML model between the first version and the previous version of the database system based on the first test data and the second test data generated by executing a set of test queries on the previous version of the system. The computing device 600 may deploy a first version of a database system having an error model, and in response to the ML model generating a first output based on the received input, may adjust the first output of the ML model by the error model based on the input of the ML model and an output error of the ML model.

Computing device 600 may be connected to other computing devices in a LAN, intranet, extranet, and/or the internet. The computing device may operate in the capacity of a server machine in a client-server network environment or in the capacity of a client in a peer-to-peer network environment. The computing device may be provided by a Personal Computer (PC), a set-top box (STB), a server, a network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. Furthermore, while only a single computing device is illustrated, the term "computing device" shall also be taken to include any collection of computing devices that individually or jointly execute a set (or multiple sets) of instructions to perform the methodologies discussed herein.

The example computing device 600 may include a processing device (e.g., a general purpose processor, a PLD, etc.) 602, a main memory 604 (e.g., synchronous Dynamic Random Access Memory (DRAM), read Only Memory (ROM)), a static memory 606 (e.g., flash memory and data storage device 618), which may communicate with each other via a bus 630.

The processing device 602 may be provided by one or more general purpose processing devices, such as a microprocessor, central processing unit, or the like. In an illustrative example, the processing device 602 may include a Complex Instruction Set Computing (CISC) microprocessor, a Reduced Instruction Set Computing (RISC) microprocessor, a Very Long Instruction Word (VLIW) microprocessor, or a processor implementing other instruction sets, or a processor implementing a combination of instruction sets. The processing device 602 may also include one or more special purpose processing devices, such as an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), a network processor, or the like. In accordance with one or more aspects of the present disclosure, the processing device 602 may be configured to perform the operations described herein to perform the operations and steps discussed herein.

Computing device 600 may also include a network interface device 608 that may communicate with a network 620. Computing device 600 may also include a video display unit 610 (e.g., a Liquid Crystal Display (LCD) or Cathode Ray Tube (CRT)), an alphanumeric input device 612 (e.g., a keyboard), a cursor control device 614 (e.g., a mouse), and a sound signal generation device 616 (e.g., a speaker). In one embodiment, the video display unit 610, the alphanumeric input device 612, and the cursor control device 614 may be combined into a single component or device (e.g., an LCD touch screen).

In accordance with one or more aspects of the present disclosure, the data storage 618 may include a computer-readable storage medium 628 on which one or more sets of ML model drift management instructions 625, e.g., instructions for performing the operations described herein, may be stored. The ML model drift management instructions 625 may also reside, completely or at least partially, within the main memory 604 and/or within the processing device 602 during execution thereof by the computing device 600, main memory 604 and processing device 602 also constituting computer readable media. The ML model drift management instructions 625 may also be transmitted or received over the network 620 via the network interface device 608.

While the computer-readable storage medium 628 is shown in an illustrative example to be a single medium, the term "computer-readable storage medium" should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term "computer-readable storage medium" shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform the methods described herein. Accordingly, the term "computer-readable storage medium" shall be taken to include, but not be limited to, solid-state memories, optical media, and magnetic media.

Unless specifically stated otherwise, terms such as "receiving," "routing," "updating," "providing," or the like, refer to actions and processes performed or implemented by a computing device that manipulate and transform data represented as physical (electronic) quantities within the computing device's registers and memories into other data similarly represented as physical quantities within the computing device memories or registers or other such information storage, transmission or display devices. Furthermore, the terms "first," "second," "third," "fourth," and the like, as used herein, refer to labels used to distinguish between different elements and do not necessarily have an ordinal meaning according to their numerical designation.

Examples described herein also relate to an apparatus for performing the operations described herein. The apparatus may be specially constructed for the required purposes, or it may comprise a general purpose computing device selectively programmed by a computer program stored in the computing device. Such a computer program may be stored in a computer readable non-transitory storage medium.

The methods and illustrative examples described herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may be used with the teachings described herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will be as described in the description above.

The above description is intended to be illustrative, and not restrictive. Although the present disclosure has been described with reference to specific illustrative examples, it will be appreciated that the present disclosure is not limited to the described examples. The scope of the disclosure should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes," and/or "including," when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Thus, the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting.

It should also be noted that in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two features shown in succession may in fact be executed substantially concurrently or the features may sometimes be executed in the reverse order, depending upon the functionality/acts involved.

Although the method operations are described in a particular order, it should be understood that other operations may be performed between the described operations, the described operations may be adapted so that they occur at slightly different times, or the described operations may be distributed in a system that allows processing operations to occur at various intervals associated with the processing.

Various units, circuits, or other components may be described or required as "configured" or "configurable" to perform a task or tasks. In such a context, the phrase "configured to" or "configurable to" is used to connote structure by indicating that the units/circuits/components include structure (e.g., circuitry) that performs one or more tasks during operation. Likewise, a unit/circuit/component may be said to be configured to perform a task, or may be configured to perform a task even when the specified unit/circuit/component is not currently active (e.g., not on). Units/circuits/components used with a "configured" or "configurable" language include hardware-e.g., circuits, memory storing program instructions executable to perform operations, etc. It is stated that a unit/circuit/component is "configured" to perform one or more tasks or is "configurable" to perform one or more tasks, obviously not for the purpose of introducing 35u.s.c.112 paragraph 6 for that unit/circuit/component. In addition, "configured to" or "configurable to" may include general-purpose structures (e.g., general-purpose circuits) that are manipulated by software and/or firmware (e.g., FPGA or general-purpose processor executing software) to operate in a manner capable of executing the tasks in question. "configured to" may also include adjusting a manufacturing process (e.g., a semiconductor manufacturing facility) to manufacture a device (e.g., an integrated circuit) suitable for performing or executing one or more tasks. "configurable to" is expressly not intended to apply to blank media, un-programmed processors or un-programmed general-purpose computers, or un-programmed programmable logic devices, programmable gate arrays, or other un-programmed devices, unless accompanied by a programmed media that imparts the ability for the un-programmed devices to be configured to perform the disclosed functions.

The foregoing description, for purposes of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the embodiments and their practical application, to thereby enable others skilled in the art to best utilize the embodiments and various modifications as may be suited to the particular use contemplated. The present embodiments are, therefore, to be considered as illustrative and not restrictive, and the invention is not to be limited to the details given herein, but may be modified within the scope and equivalents of the appended claims.

Claims

1. A method, comprising:

executing a set of test queries on a first version of a database system to generate first test data, wherein the first version of the system includes a Machine Learning (ML) model to generate an output corresponding to a function of the database system;

training an error model based on the first test data and second test data generated by executing the set of test queries on a previous version of the system, the error model determining an error associated with the ML model between the first version and the previous version of the database system; and

Deploying the first version of the database system with the error model.

2. The method of claim 1, wherein the error associated with the ML model is an output error of the ML model, the method further comprising:

a first output is generated based on the received input in response to the ML model, the first output of the ML model being adjusted by the error model based on the input of the ML model and an output error of the ML model.

3. The method of claim 2, further comprising:

generating training data based at least in part on one or more adjusted outputs of the error model accumulated over time;

retraining the ML model based on the training data to generate a retrained ML model; and

replacing the ML model with the retrained ML model.

4. A method according to claim 3, further comprising:

the error model is removed from the first version of the system.

5. A method according to claim 3, further comprising:

executing a set of training queries on the first version of the system on the ML model to generate third test data;

Retraining the error model based on the third test data to generate an updated error model; and

replacing the error model with the updated error model.

6. The method of claim 5, wherein generating the training data comprises: the third test data is added to the one or more adjusted outputs of the error model accumulated over time.

7. The method of claim 1, wherein the error associated with the ML model is an input error of the ML model, the method further comprising:

adjusting an input directed to the ML model by the error model based on an input error of the ML model; and

the adjusted inputs are output to the ML model.

8. The method of claim 7, further comprising:

generating training data based at least in part on one or more adjusted inputs of the error model accumulated over time;

replacing the ML model with the retrained ML model.

9. The method of claim 1, wherein the set of test queries comprises test queries marked by the database system as relevant to the ML model.

10. The method of claim 1, wherein the function comprises one of a query execution engine, a query optimizer, or a resource predictor.

11. A system, comprising:

a memory; and

a processing device operably coupled to the memory, the processing device to:

deploying the first version of the database system with the error model.

12. The system of claim 11, wherein the error associated with the ML model is an output error of the ML model, and the processing device is further to:

13. The system of claim 12, wherein the processing device is further to:

replacing the ML model with the retrained ML model.

14. The system of claim 13, wherein the processing device is further to:

the error model is removed from the first version of the system.

15. The system of claim 13, wherein the processing device is further to:

replacing the error model with the updated error model.

16. The system of claim 15, wherein to generate the training data, the processing device is to add the third test data to the one or more adjusted outputs of the error model accumulated over time.

17. The system of claim 11, wherein the error associated with the ML model is an input error of the ML model, and the processing device is further to:

the adjusted inputs are output to the ML model.

18. The system of claim 17, wherein the processing device is further to:

replacing the ML model with the retrained ML model.

19. The system of claim 11, wherein the set of test queries comprises test queries marked by the database system as relevant to the ML model.

20. The system of claim 11, wherein the function comprises one of a query execution engine, a query optimizer, or a resource predictor.

21. A non-transitory computer-readable medium having instructions stored thereon that, when executed by a processing device, cause the processing device to:

deploying the first version of the database system with the error model.

22. The non-transitory computer-readable medium of claim 21, wherein the error associated with the ML model is an output error of the ML model, and the processing device is further to:

23. The non-transitory computer-readable medium of claim 22, wherein the processing device is further to:

replacing the ML model with the retrained ML model.