WO2023097270A1

WO2023097270A1 - Detecting idle periods at network endpoints for management actions at processing clusters for managed databases

Info

Publication number: WO2023097270A1
Application number: PCT/US2022/080421
Authority: WO
Inventors: Ippokratis Pandis; Eric Ray HOTINGER; Bruce William MCGAUGHY; Naresh Chainani; Neeraja RENTACHINTALA; Zhixing MA; Pulkit Jagdishchandra BHAVSAR; Chao Duan; William Michael MCCREEDY; Pavel Sokolov; Sanjay Wangoo; Gaurav Saxena; Balakrishnan NARAYANASWAMY; Mohammad Rezaur Rahman; Davide PAGANO; Fabian Oliver NAGEL; Sebastian Hillig; Christos Stavrakakis; Yan Leshinsky; Krishna Chaitanya Gudipati
Original assignee: Amazon Technologies, Inc.
Priority date: 2021-11-24
Filing date: 2022-11-23
Publication date: 2023-06-01

Abstract

Idle periods may be for management actions at processing clusters for managed databases. A leader node of a processing cluster for a managed database may monitor a network endpoint at a proxy service associated with a database managed by the database service. An idle period for the database may be detected. A management action for the processing cluster may be determined to be performed during the detected idle period. The leader node may cause the determined management action to be performed.

Description

DETECTING IDLE PERIODS AT NETWORK ENDPOINTS FOR MANAGEMENT

ACTIONS AT PROCESSING CLUSTERS FOR MANAGED DATABASES

BACKGROUND

[0001] As the technological capacity for organizations to create, track, and retain information continues to grow, a variety of different technologies for managing and storing the rising tide of information have been developed. Database systems, for example, provide clients with many different specialized or customized configurations of hardware and software to manage stored information. However, the increasing amounts of data that organizations must store and manage often correspondingly increases both the size and complexity of data storage and management technologies, like database systems, which in turn escalate the cost of maintaining the information. [0002] New technologies more and more seek to reduce both the complexity and storage requirements of maintaining data while simultaneously improving the efficiency of data processing. For example, data processing resources may be efficiently configured to perform different workloads. However, given that many workloads are unknown when data processing resources are configured, or change over time. Challenges in obtaining the right configuration of data processing resources occur frequently.

BRIEF DESCRIPTION OF THE DRAWINGS

[0003] FIG. 1 illustrates a logical block diagram of detecting idle periods for management actions at processing clusters for managed databases, according to some embodiments.

[0004] FIG. 2 is a logical block diagram illustrating a provider network offering a database service, according to some embodiments.

[0005] FIG. 3 is a logical block diagram of monitoring for idle periods at network endpoints at a leader node of a processing cluster for performing management actions, according to some embodiments.

[0006] FIG. 4 is a logical block diagram illustrating performing a management action at a processing cluster, according to some embodiments.

[0007] FIG. 5 is a logical block diagram illustrating an example of query handling with a paused processing cluster for a managed database, according to some embodiments.

[0008] FIGS. 6A-6B are logical block diagram illustrating examples of resuming from pause at a processing cluster, according to some embodiments.

[0009] FIG. 7 is a logical block diagram illustrating a restore into to a managed database feature offered by a database service, according to some embodiments. [0010] FIG. 8 is a logical block diagram illustrating interactions to perform online restore to a different cluster topology, according to some embodiments.

[0011] FIG. 9 is a logical block diagram illustrating example scenarios of different topologies with different transfer plans, according to some embodiments.

[0012] FIG. 10 is a logical block diagram illustrating interactions to make an encryption state change, according to some embodiments.

[0013] FIG. 11 is a logical block diagram of a proxy service for a database service that routes queries to a selected processing cluster attached to a database managed by the database service, according to some embodiments.

[0014] FIG. 12 is a logical block diagram illustrating attaching a new processing cluster selected for a database managed by a database service, according to some embodiments.

[0015] FIG. 13 is a logical block diagram illustrating an example of a processing cluster performing queries to database data, according to some embodiments.

[0016] FIG. 14 is a logical block diagram illustrating a separately hosted additional processing cluster, according to some embodiments.

[0017] FIG. 15 is a logical block diagram illustrating a collocated additional processing cluster on a same host as a compute node, according to some embodiments.

[0018] FIG. 16 is a high-level flowchart illustrating methods and techniques to implement detecting idle periods for management actions at processing clusters for managed databases, according to some embodiments.

[0019] FIG. 17 is a high-level flowchart illustrating methods and techniques to implement resuming a paused processing cluster for a managed database, according to some embodiments.

[0020] FIG. 18 illustrates a logical block diagram of online restore to different topologies with custom data distribution, according to some embodiments.

[0021] FIG. 19 is a high-level flowchart illustrating methods and techniques to implement online restore to different topologies with custom data distribution, according to some embodiments.

[0022] FIG. 20 is a high-level flowchart illustrating methods and techniques to implement phase 1 of online restore, according to some embodiments.

[0023] FIG. 21 is a high-level flowchart illustrating methods and techniques to implement phase 2 of online restore, according to some embodiments.

[0024] FIG. 22 illustrates a logical block diagram of scaling database query processing using additional processing clusters, according to some embodiments. [0025] FIG. 23 is a high-level flowchart illustrating methods and techniques to implement scaling database query processing using additional processing clusters, according to some embodiments.

[0026] FIG. 24 is a high-level flowchart illustrating methods and techniques to implement generating a plan a leader node of processing cluster to perform a database query using additional processing clusters, according to some embodiments.

[0027] FIG. 25 is a high-level flowchart illustrating methods and techniques to implement executing instructions at a compute node of a processing cluster to perform a portion of a database query using additional processing clusters, according to some embodiments.

[0028] FIG. 26 illustrates a logical block diagram of scaling query processing resources for efficient utilization and performance, according to some embodiments.

[0029] FIG. 27 is a high-level flowchart illustrating methods and techniques to implement scaling query processing resources for efficient utilization and performance, according to some embodiments.

[0030] FIG. 28 is a high-level flowchart illustrating methods and techniques to implement predicting response time for queries on different query processing configurations, according to some embodiments.

[0031] FIG. 29 illustrates an example system that implements the various methods, techniques, and systems described herein, according to some embodiments.

[0032] While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i. e. , meaning having the potential to), rather than the mandatory sense (i.e. , meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to.

[0033] It will also be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first contact could be termed a second contact, and, similarly, a second contact could be termed a first contact, without departing from the scope of the present invention. The first contact and the second contact are both contacts, but they are not the same contact.

DETAILED DESCRIPTION OF EMBODIMENTS

[0034] Various techniques of detecting idle periods for management actions at processing clusters for managed databases are described herein. While database systems that are operated and managed directly by the entities using the database systems may be able to make adjustments to database system configurations for query workloads, cloud service providers and other provider networks that offer database services that operate and manage database systems on behalf of other entities that utilize the database systems may have less insight into the changes in query workload. For instance, many entities that utilize database systems find it desirable to shift operational and management responsibilities to the provider of a database service in order to focus efforts on other tasks. This shift of responsibility to a provider of a database service may cause database services to make management decisions for a database based on unknown information, such as expected query workloads or changes to query workloads.

[0035] For example, entities that use database systems may be able to shut-down or limit resource waste by shutting down, hibernating, or otherwise not operating database systems that are not being used. These entities could also use these known idle periods in order to perform various actions to manage the database system (e.g., actions that upgrade, patch, reorganize, or modify the operation of the database system to achieve various improvements or correct various problems). Providers of managed databases, however, may not have foreknowledge of these idle periods. Techniques for detecting idle periods for management actions at processing clusters for managed databases may be implemented to allow providers that managed database systems to proactively detect and utilize idle periods to improve the performance of managed database systems without causing downtime or other service interruptions in the event the database system is to be used.

[0036] FIG. 1 illustrates a logical block diagram of detecting idle periods for management actions at processing clusters for managed databases, according to some embodiments. Database service 110 may be a stand-alone database service, in various embodiments. For example, database service 110 may be implemented for private use (e.g., on private networks and resources for entityspecific utilization). In some embodiments, database service 110 may be implemented as part of multiple different services provided by a cloud service provider, such as provider network 200 discussed in detail below with regard to FIG. 2. [0037] Database service 110 may manage databases on behalf of clients of database service 110, in various embodiments. For example, database service 110 may implement an interface that allows users to create a database to be hosted in database service 110. The interface may also allow users to specify whether the database is to be managed by the database service, automatically, in a “serverless” fashion (e.g., by allowing database service 110 to automatically determine and configure an appropriate number of computing resources to host and provide access to (e.g., query) the database). In some embodiments, database service 110 may also allow for hosted databases to be manually managed (e.g., via interface requests to configure a specified number of computing resources to host and provide access to (e.g., query) the database).

[0038] For database service managed databases, database service 110 may implement proxy service 120. Proxy service 120, which may be similar to proxy service 240 discussed in detail below with regard to FIG. 2, may host or implement a network endpoint 122, which may be used to provide database access 102 to a database managed by database service 110. Instead of direct access, a client application utilizing a managed database may send requests to a common network endpoint 122 associated with the database. Proxy service 120 may implement various techniques, including load balancing, scaling, and other techniques for managing and providing efficient query performance and route queries and other access requests (e.g., requests to write to the database) to a processing cluster 130 for performance. Note that in some embodiments, a primary processing cluster for a database may act as a proxy for other processing clusters attached or otherwise assigned to handling database access 102. thus, the various features discussed above with regard to proxy service 120 may be implemented instead on a primary processing cluster (e.g., at a leader node for the primary processing cluster).

[0039] Processing clusters, such as processing cluster 130, 140, and 150, may implement distributed query and other access request processing frameworks to access data in a database hosted by database service 110, as discussed in detail below with regard to FIG. 13. Processing clusters may include a leader node to plan and direct execution of requests, such as leader nodes 132, 142, and 152, and compute node(s) to perform various data operations to execute the requests, such as compute node(s) 134, 144, and 154.

[0040] Over time, processing clusters may encounter various scenarios where different management actions may be desirable to perform in order to improve performance of the processing cluster, which may then improve the performance of client applications that utilize processing clusters in order to access the managed database. Instead of waiting for client-specified actions (e.g., instructions, commands or other requests sent by a client) to perform various management actions, processing clusters may detect an idle period in order to perform a management action during the idle period. For example, leader node 140 may implement techniques to monitor for idle periods at network endpoint 122, as discussed in detail below with regard to FIGS. 3 and 16. In this way, during the idle period, leader node 140 may perform one (or more) management actions 143.

[0041] Management actions may take various forms and provide many different performance enhancements, improvements, or resource conservation measures, as discussed in detail with regard to FIGS. 4 and 5. For example, one management action may be an action to pause, and stop work of the processing cluster. Allowing resources, such as compute nodes 142 to perform other work for database service 110. Processing cluster may 140 can later resume, as discussed in detail below with regard to FIGS. 5 and 19. Other management actions can include various operations to improve the processing cluster operations (e.g., through software patches or other modifications) or improve the performance of the managed database, such as by performing data operations to clean-up, re-distribute, pre-generate views, or other actions that may make data more efficiently accessible for processing future queries.

[0042] Please note that the previous description of a database service is a logical description and thus is not to be construed as limiting as to the implementation of a database service, proxy service, processing clusters, or portions thereof.

[0043] This specification continues with a general description of a provider network that implements multiple different services, including a database service and storage service, which may implement the various embodiments of the techniques discussed above and below with regard to FIGS. 2 - 28. Then various examples of the database service and storage service, including different components/modules, or arrangements of components/module that may be employed as part of implementing the services are discussed. A number of different methods and techniques to implement various embodiments are then discussed, some of which are illustrated in accompanying flowcharts. Finally, a description of an example computing system upon which the various components, modules, systems, devices, and/or nodes may be implemented is provided. Various examples are provided throughout the specification.

[0044] FIG. 2 is a logical block diagram illustrating a provider network offering a database service that implements various embodiments discussed above with regard to FIG. 1 and below with regard to FIGS. 3 - 28, according to some embodiments. Provider network 200 may be a private or closed system or may be set up by an entity such as a company or a public sector organization to provide one or more services (such as various types of cloud-based storage) accessible via the Internet and/or other networks to clients 250. [0045] Provider network 200 may be implemented in a single location or may include numerous data centers hosting various resource pools, such as collections of physical and/or virtualized computer servers, storage devices, networking equipment and the like (e.g., computing system 3000 described below with regard to FIG. 29), needed to implement and distribute the infrastructure and storage services offered by the provider network 200. The provider network 200 can be formed as a number of regions, where a region is a separate geographical area in which the cloud provider clusters data centers. Each region can include two or more availability zones connected to one another via a private high speed network, for example a fiber communication connection.

[0046] An availability zone (also known as an availability domain, or simply a “zone”) refers to an isolated failure domain including one or more data center facilities with separate power, separate networking, and separate cooling from those in another availability zone. Preferably, availability zones within a region are positioned far enough away from one other that the same natural disaster should not take more than one availability zone offline at the same time. Customers can connect to availability zones of the provider network 200 via a publicly accessible network (e.g., the Internet, a cellular communication network). Regions are connected to a global network which includes private networking infrastructure (e.g., fiber connections controlled by the cloud provider) connecting each region to at least one other region. The provider network 200 may deliver content from points of presence outside of, but networked with, these regions by way of edge locations and regional edge cache servers. An edge location can be an extension of the cloud provider network outside of the traditional region/ AZ context. For example an edge location can be a data center positioned to provide capacity to a set of customers within a certain latency requirement, a set of servers provided to a customer’s premises, or a set of servers provided within (or forming part ol) a cellular communications network, each of which can be controlled at least in part by the control plane of a nearby AZ or region. This compartmentalization and geographic distribution of computing hardware enables the provider network 200 to provide low-latency resource access to customers on a global scale with a high degree of fault tolerance and stability.

[0047] The traffic and operations of the provider network may broadly be subdivided into two categories in various embodiments: control plane operations carried over a logical control plane and data plane operations carried over a logical data plane. While the data plane represents the movement of user data through the distributed computing system, the control plane represents the movement of control signals through the distributed computing system. The control plane generally includes one or more control plane components distributed across and implemented by one or more control servers. Control plane traffic generally includes administrative operations, such as system configuration and management (e.g., resource placement, hardware capacity management, diagnostic monitoring, system state information). The data plane includes customer resources that are implemented on the cloud provider network (e.g., computing instances, containers, block storage volumes, databases, file storage). Data plane traffic generally includes non-administrative operations such as transferring customer data to and from the customer resources. Certain control plane components (e.g., tier one control plane components such as the control plane for a virtualized computing service) are typically implemented on a separate set of servers from the data plane servers, while other control plane components (e.g., tier two control plane components such as analytics services) may share the virtualized servers with the data plane, and control plane traffic and data plane traffic may be sent over separate/distinct networks.

[0048] In some embodiments, provider network 200 may implement various computing resources or services, such as database service(s) 210, (e.g., relational database services, nonrelational database services, a map reduce service, a data warehouse service, and/or other large scale data processing services or various other types database services), data storage service 270 (e.g., object storage services or block-based storage services that may implement a centralized data store for various types of data), and/or any other type of network based services (which may include a virtual compute service and various other types of storage, processing, analysis, communication, event handling, visualization, and security services not illustrated).

[0049] In various embodiments, the components illustrated in FIG. 2 may be implemented directly within computer hardware, as instructions directly or indirectly executable by computer hardware (e.g., a microprocessor or computer system), or using a combination of these techniques. For example, the components of FIG. 2 may be implemented by a system that includes a number of computing nodes (or simply, nodes), each of which may be similar to the computer system embodiment illustrated in FIG. 29 and described below. In various embodiments, the functionality of a given system or service component (e.g., a component of database service 210 or data storage service 270) may be implemented by a particular node or may be distributed across several nodes. In some embodiments, a given node may implement the functionality of more than one service system component (e.g., more than one data store component).

[0050] Database services 210 may be various types of data processing services that perform general or specialized data processing functions (e.g., anomaly detection, machine learning, data mining, big data querying, or any other type of data processing operation). For example, in at least some embodiments, database services 210 may include a map reduce service that creates clusters of processing nodes that implement map reduce functionality over data stored in the map reduce cluster as well as data stored in data storage service 270. In another example, database service 210 may include various types of database services (both relational and non-relational) for storing, querying, and updating data. Such services may be enterprise-class database systems that are highly scalable and extensible. Queries may be directed to a database in data database service 210 that is distributed across multiple physical resources, and the database system may be scaled up or down on an as needed basis.

[0051] Database service 210 may work effectively with database schemas of various types and/or organizations, in different embodiments. In some embodiments, clients/subscribers may submit queries in a number of ways, e.g., interactively via an SQL interface to the database system. In other embodiments, external applications and programs may submit queries using Open Database Connectivity (ODBC) and/or Java Database Connectivity (JDBC) driver interfaces to the database system. For instance, database service 210 may implement, in some embodiments, a data warehouse service, that utilizes another data processing service, to execute portions of queries or other access requests with respect to data that is stored in a remote data store, such as data storage service(s) 270 (or a data store external to provider network 200) to implement distributed data processing for distributed data sets.

[0052] In at least some embodiments, database service 210 may be a data warehouse service. Thus in the description that follows database service 210 may be discussed according to the various features or components that may be implemented as part of a data ware house service, including control plane 220, proxy service 240, and processing clusters 232. Note that such features or components may also be implemented in a similar fashion for other types of database services and thus the following examples may be applicable to other types of database service 210. Database service 210 may implement one (or more) processing clusters that are attached to a database (e.g., a data warehouse). In some embodiments, these processing clusters may be designated as a primary and secondary (or concurrent, additional, or burst processing clusters) that perform queries to an attached database warehouse.

[0053] In embodiments where database service 210 is a data warehouse service, the data warehouse service may offer clients a variety of different data management services, according to their various needs. In some cases, clients may wish to store and maintain large of amounts data, such as sales records marketing, management reporting, business process management, budget forecasting, financial reporting, website analytics, or many other types or kinds of data. A client’s use for the data may also affect the configuration of the data management system used to store the data. For instance, for certain types of data analysis and other operations, such as those that aggregate large sets of data from small numbers of columns within each row, a columnar database table may provide more efficient performance. In other words, column information from database tables may be stored into data blocks on disk, rather than storing entire rows of columns in each data block (as in traditional database schemes). The following discussion describes various embodiments of a relational columnar database system implemented as a data warehouse. However, various versions of the components discussed below as may be equally adapted to implement embodiments for various other types of relational database systems, such as row- oriented database systems. Therefore, the following examples are not intended to be limiting as to various other types or formats of database systems.

[0054] In some embodiments, storing table data in such a columnar fashion may reduce the overall disk I/O requirements for various queries and may improve analytic query performance. For example, storing database table information in a columnar fashion may reduce the number of disk I/O requests performed when retrieving data into memory to perform database operations as part of processing a query (e.g., when retrieving all of the column field values for all of the rows in a table) and may reduce the amount of data that needs to be loaded from disk when processing a query. Conversely, for a given number of disk requests, more column field values for rows may be retrieved than is necessary when processing a query if each data block stored entire table rows. In some embodiments, the disk requirements may be further reduced using compression methods that are matched to the columnar storage data type. For example, since each block contains uniform data (i.e., column field values that are all of the same data type), disk storage and retrieval requirements may be further reduced by applying a compression method that is best suited to the particular column data type. In some embodiments, the savings in space for storing data blocks containing only field values of a single column on disk may translate into savings in space when retrieving and then storing that data in system memory (e.g., when analyzing or otherwise processing the retrieved data).

[0055] Database service 210 may be implemented by a large collection of computing devices, such as customized or off-the-shelf computing systems, servers, or any other combination of computing systems or devices, such as the various types of systems 3000 described below with regard to FIG. 29. Different subsets of these computing devices may be controlled by control plane 220. Control plane 220, for example, may provide a cluster control interface to clients or users who wish to interact with the processing clusters, such as processing cluster(s) 232a, 232b, and 232c managed by control plane 220. For example, control plane 220 may generate one or more graphical user interfaces (GUIs) for clients, which may then be utilized to select various control functions offered by the control interface for the processing clusters 232 hosted in the database service 210. Control plane 220 may provide or implement access to various metrics collected for the performance of different features of database service 210, including processing cluster performance, in some embodiments.

[0056] As discussed above, various clients (or customers, organizations, entities, or users) may wish to store and manage data using a database service 210. Processing clusters 232 may respond to various requests, including write/update/store requests (e.g., to write data into storage) or queries for data (e.g., such as a Server Query Language request (SQL) for particular data), as discussed below with regard to FIGS. 3 and 5. For example, multiple users or clients may access a processing cluster to obtain data warehouse services.

[0057] For databases manually managed by users, database service 210 may provide network endpoints directly to the clusters which allow the users manage in order to implement client applications that send requests and other messages directly to a particular cluster. Network endpoints, for example may be a particular network address, such as a URL, which points to a particular cluster. For instance, a client may be given the network endpoint “http://mycluster.com” to send various request messages to. Multiple clients (or users of a particular client) may be given a network endpoint for a particular cluster. Various security features may be implemented to prevent unauthorized users from accessing the clusters.

[0058] In at least some embodiments, database service 210 may implement proxy service 240 to provide access to databases (e.g., data warehouses) hosted in database service 210. For databases managed by database service 210, database service 210 may provide database endpoints 242 (e.g., network endpoints) for a hosted database. Database endpoints 242 may not provide direct access to a particular processing cluster 232, as the processing cluster used to respond to such requests (e.g., queries) may change according to various scaling techniques. Instead, client applications may utilize the database endpoint 242 for a database to be included in various client applications or other communications for database access so that proxy service 240 can direct the requests to the appropriate processing cluster without the client application having to be altered every time a change in processing cluster (e.g., scaling operations) are performed by database service 210. In this way, database service 210 can perform scaling and other management operations without interfering with client applications.

[0059] Processing clusters, such as processing clusters 232a, 232b, and 232c, hosted by database service 210 may provide an enterprise-class database query and management system that allows users to send data processing requests to be executed by the clusters 232, such as by sending a query. Processing clusters 232 may perform data processing operations with respect to data stored locally in a processing cluster, as well as remotely stored data. For example, data storage service 270 implemented by provider network 200 that stores remote data, such as backups or other data of a database stored in a cluster. In some embodiments, database data 272 may not be stored locally in a processing cluster 232 but instead may be stored in data storage service 270 (e.g., with data being partially or temporarily stored in processing cluster 232 to perform queries). Queries sent to a processing cluster 23 (or routed/redirect/assigned/allocated to processing cluster(s)) may be directed to local data stored in the processing cluster and/or remote data. Therefore, processing clusters may implement local data processing, such as local data processing, (discussed below with regard to FIG. 5) to plan and execute the performance of queries with respect to local data in the processing cluster, as well as a remote data processing client.

[0060] Database service 210 may implement different types or configurations of processing clusters. For example, different configurations A 230a, B 230b, and C 230c, may utilize various different configurations of computing resources, including, but not limited to, different numbers of computational nodes, different processing capabilities (e.g., processor size, power, custom or task-specific hardware, such as hardware accelerators to perform different operations, such as regular expression searching or other data processing operations), different amounts of memory, different networking capabilities, and so on. Thus, for some queries, different configurations 230 of processing cluster 232 may offer different execution times. Different configurations 230 of processing clusters 232 may be maintained in different pools of available processing clusters to be attached to a database. Attached processing clusters may then be made exclusively assigned or allocated for the use of performing queries to the attached database, in some embodiments. The number of processing clusters 232 attached to a database may change over time according to the selection techniques discussed below.

[0061] In some embodiments, database service 210 may have at least one processing cluster attached to a database, which may be the “primary cluster.” Primary clusters may be reserved, allocated, permanent, or otherwise dedicated processing resources that store and/or provide access to a database for a client, in some embodiments. Primary clusters, however, may be changed. For example, a different processing cluster may be attached to a database and then designated as the primary database (e.g., allowing an old primary cluster to still be used as a “secondary” processing cluster or released to a pool of processing clusters made available to be a attached to a different database). Techniques to resize or change to a different configuration of a primary cluster may be performed, in some embodiments, such as the pause and resume techniques discussed below with regard to FIG. 6B. The available processing clusters that may also be attached, as determined, to a database may be maintained (as noted earlier) in different configuration type pools, which may be a set of warmed, pre-configured, initialized, or otherwise prepared clusters which may be on standby to provide additional query performance capacity in addition to that provided by a primary cluster. Control plane 220 may manage cluster pools by managing the size of cluster pools (e.g., by adding or removing processing clusters based on demand to use the different processing clusters).

[0062] As databases are created, updated, and/or otherwise modified, snapshots, copies, or other replicas of the database at different states may be stored separate from database service 210 in data storage service 250, in some embodiments. For example, a leader node, or other processing cluster component, may implement a backup agent or system that creates and store database backups for a database to be stored as database data 272 in data storage service 270. Database data 272 may include user data (e.g., tables, rows, column values, etc.) and database metadata (e.g., information describing the tables which may be used to perform queries to a database, such as schema information, data distribution, range values or other content descriptors for filtering out portions of a table from a query, a superblock, etc.). A timestamp or other sequence value indicating the version of database data 272 may be maintained in some embodiments, so that the latest database data 272 may, for instance, be obtained by a processing cluster in order to perform queries. In at least some embodiments, database data 272 may be treated as the authoritative version of data, and data stored in processing clusters 232 for local processing as a cached version of data.

[0063] Data storage service 270 may implement different types of data stores for storing, accessing, and managing data on behalf of clients 250 as a network-based service that enables clients 250 to operate a data storage system in a cloud or network computing environment. Data storage service(s) 270 may also include various kinds of object or file data stores for putting, updating, and getting data objects or files. For example, one data storage service 270 may be an object-based data store that allows for different data objects of different formats or types of data, such as structured data (e.g., database data stored in different database schemas), unstructured data (e.g., different types of documents or media content), or semi-structured data (e.g., different log files, human-readable data in different formats like JavaScript Object Notation (JSON) or Extensible Markup Language (XML)) to be stored and managed according to a key value or other unique identifier that identifies the object. In at least some embodiments, data storage service(s) 270 may be treated as a data lake. For example, an organization may generate many different kinds of data, stored in one or multiple collections of data objects in a data storage service 270. The data objects in the collection may include related or homogenous data objects, such as database partitions of sales data, as well as unrelated or heterogeneous data objects, such as audio files and web site log files. Data storage service(s) 270 may be accessed via programmatic interfaces (e.g., APIs) or graphical user interfaces. For example, format independent data processing service 220 may access data objects stored in data storage services via the programmatic interfaces.

[0064] Generally speaking, clients 250 may encompass any type of client that can submit network-based requests to provider network 200 via network 260, including requests for storage services (e.g., a request to query a database service 210, or a request to create, read, write, obtain, or modify data in data storage service(s) 270, etc.). For example, a given client 250 may include a suitable version of a web browser, or may include a plug-in module or other type of code module that can execute as an extension to or within an execution environment provided by a web browser. Alternatively, a client 250 may encompass an application such as a database application (or user interface thereof), a media application, an office application or any other application that may make use of database service(s) 210 or storage resources in data storage service(s) 270 to store and/or access the data to implement various applications. In some embodiments, such an application may include sufficient protocol support (e.g., for a suitable version of Hypertext Transfer Protocol (HTTP)) for generating and processing network-based services requests without necessarily implementing full browser support for all types of network-based data. That is, client 250 may be an application that can interact directly with provider network 200. In some embodiments, client 250 may generate network-based services requests according to a Representational State Transfer (REST)-style network-based services architecture, a document- or message-based network-based services architecture, or another suitable network-based services architecture.

[0065] In some embodiments, a client 250 may provide access to provider network 200 to other applications in a manner that is transparent to those applications. For example, client 250 may integrate with an operating system or file system to provide storage on one of data storage service(s) 270 (e.g., a block-based storage service). However, the operating system or file system may present a different storage interface to applications, such as a conventional file system hierarchy of files, directories and/or folders. In such an embodiment, applications may not need to be modified to make use of the storage system service model. Instead, the details of interfacing to the data storage service(s) 270 may be coordinated by client 250 and the operating system or file system on behalf of applications executing within the operating system environment. Similarly, a client 250 may be an analytics application that relies upon data processing service(s) 210 to execute various queries for data already ingested or stored in the data processing service (e.g., such as data maintained in a data warehouse service).

[0066] Clients 250 may convey network-based services requests (e.g., access requests to read or write data may be directed to data in data storage service(s) 270, or operations, tasks, or jobs, such as queries, being performed as part of data processing service(s) 210) to and receive responses from provider network 200 via network 260. In various embodiments, network 260 may encompass any suitable combination of networking hardware and protocols necessary to establish network-based-based communications between clients 250 and provider network 200. For example, network 260 may generally encompass the various telecommunications networks and service providers that collectively implement the Internet. Network 260 may also include private networks such as local area networks (LANs) or wide area networks (WANs) as well as public or private wireless networks. For example, both a given client 250 and provider network 200 may be respectively provisioned within enterprises having their own internal networks. In such an embodiment, network 260 may include the hardware (e.g., modems, routers, switches, load balancers, proxy servers, etc.) and software (e.g., protocol stacks, accounting software, firewall/security software, etc.) necessary to establish a networking link between given client 250 and the Internet as well as between the Internet and provider network 200. It is noted that in some embodiments, clients 250 may communicate with provider network 200 using a private network rather than the public Internet. In some embodiments, clients of data processing services 210 and/or data storage service(s) 270 may be implemented within provider network 200 (e.g., an application hosted on a virtual computing resource that utilizes a data processing service 210 to perform database queries) to implement various application features or functions and thus various features of client(s) 250 discussed above may be applicable to such internal clients as well.

[0067] FIG. 3 is a logical block diagram of monitoring for idle periods at network endpoints at a leader node of a processing cluster for performing management actions, according to some embodiments. As indicated at 302, a query may be received at proxy service 240 via database endpoint 310. For example, query 302 may be sent to a network address or other location specific to a database managed by database service 210. Proxy service 240 may implement one or more components to listen for queries, like query 302, at database endpoint 310.

[0068] In various embodiments, proxy service 240 may implement a database query queue (not illustrated). The database query queue may be a queue of queries directed to a same database, in various embodiments. For example, queries directed to different databases may be received via different database endpoints and put into different respective database query queues (as opposed to being comingled in a common queue). The database query queue may be a database-wide query queue, separate from any workload or other queues implemented on attached processing clusters for the database. Queries, like query 302, may then be pulled from the database query queue and routed 333 by query routing (e.g., according to a First In First Out (FIFO) order) to an attached processing cluster, such as processing cluster 340. [0069] Proxy service 240 may then implement various techniques in order to make routing decisions that scale the processing clusters attached to the database in accordance with optimizing both query performance and processing cluster utilization. For example, response time predictions may be determined for any attached processing clusters as well as another other processing cluster configurations that may be attached (but are not currently attached), based on the predicted execution time of a cluster (e.g., which may be predicted by machine learning models or statistical analysis of previous queries), bootstrap time (e.g., an amount of time to prepare and attach a new processing cluster to serve a query), and queue time (e.g., time a query may be queued at a processing cluster while other queries are performed before being performed by that processing cluster). Proxy service 240 may route the query to an attached processing cluster 340 to the database, as indicated at 333, which may perform the query and return a result 304.

[0070] Processing cluster 340 may implement leader node 350 and compute nodes 360 to handle queries and other routed requests, as discussed in detail below with regard to FIG. 5. Leader node 350 may also implement idle period monitor 352 in order to detect an idle period for the database at database endpoint 310. For example, idle period monitor 352 may evaluate session state information, cursor movement, or various other indications of activity for a client of the database. Different types of idleness may be detected for different management actions, in some embodiments (e.g., different lengths or indicators of idleness). For example, different sets of criteria may be considered for different management actions (or categories of management actions). Thus, idle periods detected may have different idle period types corresponding to the set of criteria used to detect an idle period. Idle periods may be reported to management action execution 354.

[0071] Management action execution 354 may determine what management actions to perform during an idle period based on the actions stored or indicated in management actions 358 (e.g., a queue or other data structure indicating the management actions to be performed. Local action detection 356, for example, may detect actions that can be detected locally. For example, one such local action may be an action to vacuum or clean up storage (e.g., performing one more data block movements to consolidate storage space left as a result of deletions performed at the database) based on a time since a last vacuum, fragmentation measures, or various other indicators of need for storage clean up (e.g., a number of writes to the database). Another example of a location action may be an action to re-distribute data to improve query performance (e.g., by identifying and utilizing a different data distribution scheme for the processing cluster, such as utilizing a different one (or more) columns as anew distribution key for the database), which may be detected based on an amount of data shuffled between compute nodes of a processing cluster or other indicators of data that is to be joined that is not located together. Another example of a local action may be an action to generate views or other pre-computed results that can improve query performance. For example, previously received queries can be examined to detect commonly queried results. Such queries can be used to generate an internal materialized view which can then be used to perform a subsequent query. Another local action may be to pause processing cluster 340 in order to reduce processing costs for database service 210 and clients that utilize the processing cluster, as discussed in detail below with regard to FIGS. 5-6B. Pauses may also allow for resizing and other cluster reconfiguration, in some embodiments.

[0072] Control plane 220 may also indicate service management actions 335 to management actions 358, such as software patch installations, network configuration changes (e.g., update security rules) or other configuration changes for security or other operations. For example, control plane 220 can coordinate the performance of patch installations across a fleet of host systems implementing processing clusters, by staggering installation instructions as different service management actions 335 that can be applied by a processing cluster 340 when an idle period is detected (and by an update deadline specified by control plane 220).

[0073] Different management actions may be taken in different ways. FIG. 4 is a logical block diagram illustrating performing a management action at a processing cluster, according to some embodiments. For example, in processing cluster 400, leader node 410 may utilize management action execution 414 to provide various instructions 404 to compute nodes 420a, 420b, and 420c. Such instructions may include operations to perform storage vacuum (e.g., defragment or packing data records together to remove gaps or spaces after record deletions), perform view updates or create new views (e.g., to proactively create or update materialized views that are likely to be accessed by having compute nodes 420 query and update a view in storage), data re-distribution (e.g., changing a distribution key or other schema for distributing data amongst compute nodes 420 by shuffling or moving portions of database data around), among other management actions. In some embodiments, management action operation instructions 404 may include instructions to perform control plane initiated management actions (e.g., to execute scripts, install patches, or perform other changes).

[0074] Another type of management action may include pausing processing clusters. Paused processing clusters may allow for computing resources, such as compute nodes to be released to do other tasks for database service 210. In this way, customer of database service 210 can achieve cost savings because the resources do not have to be retained and not used, as well as database service 210, which can more efficiently utilize computing resources. [0075] Paused processing clusters, however, do not interfere with the ability of client applications to access managed databases. FIG. 5 is a logical block diagram illustrating an example of query handling with a paused processing cluster for a managed database, according to some embodiments. Proxy service 240 may implement database endpoint 530 which may provide access to paused processing cluster 510 and processing cluster 520, each of which may have respective leader nodes, 512 and 522, and compute nodes 514, 516, 518, 522, 524, and 526.

[0076] Database endpoint 530 may still provide access to the managed database even though processing cluster 510 is paused. For example, queries, such as query 532 can still be accepted and routed either to paused processing cluster (as leader node 512 is still available and can answer metadata queries 540 about the managed database), or either data or metadata queries 550 can be sent to processing cluster 520 and answered by leader node 522. A result 534 can then be returned in response. Write requests, such as write 536, which may insert, modify, or delete data, however, may be rejected, as indicated at 538 in order to prevent one processing cluster from creating a version of the database inconsistent with another version in paused processing cluster 510.

[0077] FIG. 6A is a logical block diagram illustrating an example of resuming from pause at a processing cluster, according to some embodiments. For example, in scene 602, processing cluster 620 may implement leader node 630, which may implement resume event monitoring (implementing techniques discussed below with regard to FIG. 17). As indicated at 634a, 634b, 634c, and 634d, a number of released compute nodes may have been previously implemented as part of processing cluster 620. Resume event monitoring 632 may send a request to obtain compute nodes, as indicated at 633. Control plane 220, as indicated in scene 604 may provide a same number of compute nodes to be attached, as indicated at 635 (attached nodes 636a, 636b, 636c, and 636d).

[0078] Pause events may be an opportunity to resize the processing cluster up or down. In FIG. 6B, an example of sizing up the cluster is given. Similar to FIG. 6A, in scene 642 processing cluster 620 may implement leader node 630, which may implement resume event monitoring (implementing techniques discussed below with regard to FIG. 17). As indicated at 634a, 634b, 634c, and 634d, a number of released compute nodes may have been previously implemented as part of processing cluster 620. Resume event monitoring 632 may send a request to obtain compute nodes, as indicated at 633. The requested compute nodes may be increased (e.g., 6 compute nodes). In scene 644, as indicated at 645, control plane 220 may attach the 6 compute nodes, 646a, 646b, 646c, 646d, 646e, and 646f, increasing the size of processing cluster 620. Similar techniques may be implemented to use a pause management action to down-size the cluster. Other reconfigurations, related to swapping nodes or modifying allocations, capabilities, or other processing cluster configurations could also be implemented, in some embodiments.

[0079] FIG. 7 is a logical block diagram illustrating a restore into to a managed database feature offered by a database service, according to some embodiments. Note some (or all) of these features may be implemented on processing clusters directly, or in combination between control plane 220 and processing clusters 232. Restore into managed database 222 may allow for clients to request (e.g., via control plane interface request or directly to a processing cluster) a change from a database being manually managed at a user configured processing cluster into a managed database, which the database service configures and otherwise manages to dynamically adapt computing resources allocated to the database, as discussed above. As discussed in detail below with regard to FIGS. 18-21 , a transfer plan may be performed to determine how to transfer data in a current topology (e.g., a number of data slices on one or more different compute nodes with a number of node slices) to a new topology selected for the database by database service 210.

[0080] As discussed below with regard to FIG. 18, online restore may include two phases. Phase 1 restore 710 may include features, such as data collection 712 to collect a database manifest (e.g., database files or other data objects storing database data) and other information describing a database, such as metadata that describes the database schema, superblocks that describe the contents or arrangement of data, etc., as discussed below with regard to FIGS. 18-21. As indicated at 718, a general distribution scheme conversion may be applied as part of the phase 1 restore 710 when storing data in a new topology (e.g., a new processing cluster).

[0081] Phase 2 of online restore 720 may include background data distribution scheme conversion 722, as discussed in detail below with regard to FIGS. 18-21.

[0082] In some embodiments, pause, retry, and resume techniques may be implemented for background data distribution scheme conversion 722 so as not interfere with database access requests.

[0083] For example, the following pseudo-code describes one such technique using checkpoints: struct TaskStats {

// Statistics on task interruption and score //calculation // To be introduced

} def UPDATE_PRIORITY_SCORE ( task) :

// TO be introduced later def SimulateWorkerBehavior (task) :

// In reality a worker wi ll randomly chosen between 1. Task INTERRUPTED

// TaskStats get updated accordingly

UPDATE_PRIORITY_SCORE (task, IsCKPDestroyed)

2. Task FINISHED

3. Task CHECKPOINTED task.CKP += 1 def Exhuast (candidate_list , retry_list) : """

Exhuast the ongoing lists until all tasks are either completed or moved to retry list retry_list is an ordered set which maintains the sortness of the task based on the backoff score. while candidate_list : task = candidate_list .back ( ) / / Get task from the back of the list res = SimulateWorkerBehavior (task) switch res: case INTERRUPTED: retry_list . add (task) case FINISHED: candidate_list . remove (task) case CHECKPOINTED: do nothing def Ref ill (candidate_list , retry_list) :

Dump the retry list to the ongoing list and preserving the sortness ongoing list = [ task with the biggest priority score . . . task with the smallest def main() candidate_list = [all tasks] retry_list = ordered_set ( ) while candidate_list : Exhaust (candidate_list , retry_list) Refill (candidate_list , retry_list)

[0084] In some embodiments, a priority score may bused to determine how to pick tasks to work on in the background (e.g., by picking the task with the lowest score for the next round of conversion). For example, given a restore operation, some metrics may be observed, such as INT: num of time get interrupted, CKP: when interrupted, it is in its #CKP-th checkpoint (1 -based), and IsCKPDestroyed: whether the interruption leads to the loss of checkpoint. This may allow us to determine a priority score by: def PRIORITY_SCORE (CKP, INT, IsCKPDestroyed) : INT += ( CKP+ 1 ) if I sCKPDestroyed else 1

CKP = 0 if IsCKPDestroyed else CKP return INT

[0085] In some embodiments, techniques for online restore may include the following features. 1 Import data from clusters with dilferent topology than the managed (sometimes referred to as “serverless”) cluster. 2. Import data from unencrypted clusters or from clusters with dissimilar encryption keys, into encrypted Serverless endpoint. 3. Import data operation has to be fast. Database service 210 may support restoring snapshot across instance types, e.g. user can restore a snapshot taken on a first topology of a cluster with a particular configuration (e.g., varying numbers and/or hardware or software capabilities of compute nodes in a cluster) to a cluster with a second, different topology and vice versa.

[0086] One of the features for migrating existing clusters to a serverless management mode is that the cluster may need to be ready immediately for read-write access, which could rule out using offline restore as it can take significantly long time to migrate. To provide the feature of faster migration into a serverless management mode across different topologies, in some embodiments, online restore techniques may be used for migrating snapshots into a serverless management mode. [0087] As discussed above, a manifest, such as BAR (Business Archive) manifest, may be used to store information about the layout and location of database data (e.g., in a snapshot or other storage location. FIG. 8 is a logical block diagram illustrating interactions to perform online restore to a different cluster topology, according to some embodiments.

[0088] Leader node 510 may be a leader node of a processing cluster, as discussed above with regard to FIG. 2 and below with regard to FIG 13. In some embodiments, leader node 810 may implement a separate restore management process, function or feature, such as a restore controller. Compute nodes 820 may be may be implemented similar to compute nodes that are part of a processing cluster, as discussed above with regard to FIG. 2 and below with regard to FIG. 13. Like leader node 820, compute nodes 820 may implement a separate process, function, or feature, such as a restore worker to perform various operations related to online restore.

[0089] As part of phase one of online restore, leader node 810 may request 831 restore data, such as a manifest (e.g., BAR manifest), superblock, and/or other database metadata) from compute nodes 820. Compute nodes 820 may be able to send one (or more) access requests to obtain the restore data, as indicated at 832, from data storage service 870. For example a backup file, storage location, or other set of one or more data objects may include the restore data and may be read or otherwise obtained. Compute nodes 820 may separate or extract the restore used by leader node 820, such as various database metadata or other information used to coordinate performance of queries, and return it to leader node 810, as indicated at 833.

[0090] Consider one example embodiment where a leader node may communication with a designated compute node, Compute Node 0 of a processing cluster, that will download a superblock (SB) and fetch the LeaderSnapshot and write ahead log (WAL). Then, the leader node may retrieve the LeaderSnapshot from Compute Node 0 and reconstructs the entire database catalog. The imported catalog may contain all entities (databases, schemas, tables, views and global metadata) that existed on the previously provisioned cluster (with the different topology). The imported entities retain the object identifiers (OID’s) that existed on the provisioned cluster. At the end of the initial restore, the last committed Transaction ID recorded in the previously provisioned cluster’s snapshot will be the base TransactionlD that the new processing cluster will start from.

[0091] As discussed above with regard to FIG. 7, a transfer plan using a general distribution scheme may be implemented to generate instructions to transfer database data into a target cluster, which may have a different topology. FIG. 9 is a logical block diagram illustrating example scenarios of different topologies with different transfer plans, according to some embodiments.

[0092] For example, the restore plan consists of steps describing how the slices will be migrated from the storage service 270 backup into the restored cluster. Each step in the restore plan contains producer and consumer node numbers for a given global node slice, and the local node slice # on target compute node where the incoming source global slice is being transferred. An example restore plan:

{

"backup_id" : " abcde" ,

"old_cluster_s ize" : 3 ,

"new_cluster_size" : 5 ,

"num_steps " : 14 ,

" steps" : [

III For CN0 on restored cluster

{

" src_node" : 0 ,

" src_slice_num" : 0 ,

"dst_node" : 0 ,

"dst_slice_nume" : 0 b

III For CN3 on restored cluster {

" src_node" : 1 , " src_slice_num" : 3 , "dst_node" : 3 , "dst_slice_num" : 1 b ] }

[0093] The generated restore plan is serialized and broadcasted to each of the compute nodes. Compute nodes use the restore plan to download the necessary incoming source superblocks from storage service 270 and perform the slice transfer operation.

[0094] To migrate across dissimilar topologies and in specific perform data import onto the target cluster topology, the restore slice transfer plan may be modified to perform generation to handle the following additional scenarios. Scenario 1: Number of slices in the snapshot is less than the number of slices on target cluster. To handle this scenario, distribute the global slices in a round-robin fashion from source topology 902 to destination topology 906 as indicated in mapping 904 onto the available target compute nodes taking into account the data skew across nodes. The empty / unoccupied slices are going to be added to the node-slice mapping during topology generation form the restore slice transfer plan.

[0095] Scenario 2 illustrates that a number of node slices in the source topology 912 is greater than the number of node slices on the destination topology 916. To handle this scenario the global data slices [0, N] may be initially placed that are valid on the target cluster in a round-robin fashion, taking into account the data skew across nodes, as indicated in mapping 914. Then, the extra global data slices [N+l, K] may be placed from the source cluster on target in a round-robin fashion. These extra global slices are going to be merged to the global node slices [0, N] on the target cluster.

[0096] In some embodiments, during the slice transfer phase based on the restore slice transfer plan, each target compute node downloads the relevant source metadata (such as a superblock) from storage service 270. Table blockchains of relevant data slices are extracted from the downloaded source superblock and then grafted into the live superblock, after making necessary adjustments (e.g. a node slice that is owning the blockchain etc.,).

[0097] Once the table blockchains are grafted into the live superblock, the metadata transfer phase may start. During the metadata transfer phase, transfer of the following metadata types from the snapshot to the restored cluster may be performed: ActiveXid: Storage engine has a list of pairs transaction ID, table ID> for transactions that are uncommitted and the tables associated with those transactions. During restart, run UNDO on these transactions, and X-Restore will do the same operation. Compressor: Contains the metadata relevant for interleaved sortkeys on a given table. Identity metadata: Contains the identity watermark for each of the identity column blockchains. This is the maximum identity value for each identity column, table, slice. Tiered storage cache lists (Hot and Cold): FDisk has the list of blocks and their states in the cache, namely Cold Untouched, Cold Touched, Hot Untouched, Hot Touched. After the restore, this list would be populated in FDisk for each target node. Table partition metadata: This metadata is used for auto vacuum sort and contains partition information (like row count and flags) for all partitions of each table, slice pair.

[0098] In order to support the scenario where the number of data slices in the snapshot is greater than the number of node slices on target cluster, an extra step may be performed during slice transfer phase to support 1. Merging permanent table blockchains. 2. Merging or recreating the relevant slice wise metadata.

[0099] Merging Permanent Table Blockchains: Slice transfer in cross instance restore involves, serializing data slice block headers from downloaded source superblock followed by deserialization process where the block headers are grafted into live superblock and update table registry. During serialization phase, fetch the starting oflset of the perm table blockchain and copy the source blockchain, from the downloaded superblock, into a destination bufler (which is a segmented superblock). During the deserialization phase, extract the table blockchains from the serialized superblock. The extracted blockchain is then added into the live superblock on the target compute node. Then transformation may be performed on the grafted blockchain to update a) new node slice owning the block, b) disk address and preferred disk no, c) guid rename if necessary etc., Once the blockchain transformation is complete, hookup the blockchain to the relevant tabledata slice- column.

[00100] The above is the current behavior during the slice transfer of table blockchains, in order to support the slice merging operation, use the following extra steps - 1. If the table-slice-column blockchain is not empty during grafting (e.g. already attached a blockchain to this slice) - a. Append the blockchain to the existing blockchain, b. Mark all the blocks in the appended blockchain as unsorted, c. Remove empty block after sorted region during append operation. 2. If the column is a row id column, in addition to step 1 - a. Update the last rowid in table perm to be the max of all seen rowids. b. Persist the max rowid value per slice in the blockchain metadata, c. Reestablish the starting row id value on each data slice during database engine restart, from the persisted metadata. 3. If the column is a identity column, in addition to step 1 - a. Persist the max identity value per slice in the blockchain metadata, b. Reestablish the starting identity value on each data slice during database management system (DBMS) restart, from the persisted metadata. Table blockchain merge operation would be only necessary for non dist-all tables. DistALL tables are only grafted once on a given compute node local slice 0.

[00101] In various embodiments, ,etadata transfer in cross instance restore involves, serializing the internal metadata from downloaded source superblock followed by a deserialization step where hookup of the metadata is performed. In order to support the scenario where the number of slices in the snapshot is greater than the number of slices on target cluster, merging or recreating may be supported for the relevant slice wise metadata. Below is a description of how to handle this operation per metadata type - ActiveXid: Is cluster wide metadata and no additional changes are required to support slice merge operation. Compressor: Rebuild the z-compressors as the sortedness of the table is modified after slice merge operation. Identity metadata: Set the identity high water mark for the target node slice to be max of merged source data slices. Leader to reestablish the identity values across data slices (e.g. identity value a slice to generate) after the metadata merge operation. Table partition metadata: Retain the table partition info of the initially grafted slice, discard partition metadata from merged data slices. Tiered storage cache lists (Hot and Cold): populate tiered storage cache by iterating hot list across incoming nodes followed by cold list.

[00102] In some embodiments, encryption may be handled as part of performing an online restore. For example, encryption handling may be implemented in phase one of online restore, according to some embodiments. When users use a key management service (KMS) of provider network 210 for key management with Database service 210, there is a four-tier hierarchy of encryption keys. These keys, in hierarchical order, are the master key, a cluster encryption key (CEK), a database encryption key (DEK), and data encryption keys (block level keys). Database Encryption Key (DEK): The AES256 key, control plane will pass to DBMS on startup. Trent Wrapper Key (TWK): Encrypts the DEK and stored in AdminDB. The TWK is generated using the KMS, KMS Key (KMS): The key that lives in KMS. The KMS key does not leave KMS, so encrypt/decrypt using the KMS key by calling the KMS APIs Encrypt and Decrypt. Additional key (O): An additional key may be used to encrypt all sensitive data in AdminDB. This isn't really necessary from a security standpoint, but in order to keep all of the records in the ENCRYPTION KEY column in the DB INSTANCES table consistent, also use the additional key in our hierarchy.

[00103] Below is one example of a sequence of actions for handling encryption during the restore operation -

1. Start DBMS in Restore mode using - a. cqi restore <backup_id> <Keyl: DEK from the source snapshot> <key2: NotAValidKey>

2. Start DBMS in Restore Bootstrap mode using - a. cqi xstart -restore-bootstrap <Keyl: DEK from the source snapshot> <Key2: Newly Created Cluster DEK>

3. Perform Key Rotation using - a. cqi rotate_key <Key2: Newly Created Cluster DEK> b. During Key Rotation, decrypt per data block key using the original DEK and reencrypt using the new DEK.

[00104] Key rotation does not require re-encrypting the data block and its only metadata (Block Header update) operation. The only place re-encrypt of the data block is currently performed is when cross-account restore is performed or when modification of the GUID of the data block is performed. Since GUID is used as the encryption key while encrypting the data block, modifying this would require re-encrypting the data blocks.

[00105] To support restoring an unencrypted snapshot into an encrypted cluster, the following features may be implemented. 1. Control plane will pass a new flag “-snapshot-not-encrypted” to indicate snapshot is not encrypted to data plane. 2. Control plane will pass the newly created cluster DEK in the restore phase- 1 itself, this is required by data plane to start the cluster with a valid encryption key and generate the block keys.

[00106] Here is an example of how the interface would look -

Restore phase - 1 : cqi restore <backupld>

NotAValidKey Key2 : Newly Created Cluster DEK

-use - target -topology

- snapshot -not - encrypted

Restore bootstrap mode : cqi xstart - f -restore -bootstrap

-keyl <Keyl : Plain string "NotAVal idKey" >

-key2 <Key2 : Newly Created Cluster DEK>

-use - target -topology -restart - reason <xyz>

Rotate Key (no-op from DP s ide for f irst time ) : cqi rotate_key <Key2 : Newly Created Cluster DEK> [00107] To support restoring an unencrypted snapshot into an encrypted cluster, the following features may be implemented in various embodiments. Start the restored cluster using encryption key 2 (e.g. the newly created cluster’s DEK) and use it as root key. Rename GUID and assign new block key for data blocks restored from unencrypted snapshot. Track blocks restored from unencrypted snapshot, until perform encryption state changes.

[00108] Use Encryption Key 2 as root key on the restored cluster. In the default encrypted snapshot restore case, start the cluster with Keyl (e.g. the DEK of the source cluster) and use this Keyl to download the superblock, decrypt the data blocks etc.,. Then switch to Key 2 when the key rotation happens. Keyl is not a valid key (as it does not even exist) when restoring an unencrypted snapshot, so use the valid Key2 to start the cluster. During the initial cluster creation (prior to x-restore slice transfer), compute nodes are started with encryption Key 2 and the “m enc key hash” in superblock mid will store the hash of encryption key 2.

[00109] Leader and compute node’s both use valid encryption key 2 as the clusters root key (Xen^m root key), from the restore phase- 1 itself. Rename GUID and assign new block key for data blocks restored from unencrypted snapshot

[00110] During restore with encryption state change, rename the GUID’s of the blocks and also reset the storage service flag. This is required to re-upload the block with new GUID after encryption. To support this, API’s ShouldRestoreCl earinstorage service 270Bit() and ShouldRestoreRenameBlockGuidQ may return true, when performing a x-restore with encryption state change.

[00111] Then a new block key may be assigned for all the data blocks restored from an unencrypted snapshot. This step may be performed to make sure, when the block is eventually written out either by backup or re-replication, that the same block key is used to encrypt the data block.

[00112] Renaming the block GUID and assigning the new block key happen during the slice transfer post processing during x- restore phase- 1.

[00113] Restored blocks may be tracked that require encryption state change by setting BlockNeedsEncryptionStateChange flag inside the block header flags field. This would be the bit 26. BlockNeedsEncryptionStateChange indicates that while reading this block do not need to decrypt, but while writing this block out (either by backup or re-replication) may need to write it as encrypted. BlockNeedsEncryptionStateChange flag is cleared in the following paths -

• Re-replication write IO completion, as the block is encrypted with valid key. • Backup completion and while marking in storage service flag and block is still underrepped. o Clear in this path as subsequent accesses will fetch new GUID block from storage service 270 which is encrypted.

• Classic/X-Restore/Elastic-resize/Burst where a backup is taken and encryption state change performed.

[00114] This flag may be retained for workflows which have not yet performed the state change

• Disaster Recovery from last storage service 270 Commit

• Data sharing from storage service 270 Commit.

• Burst if based of storage service 270 Commit.

[00115] Tagging blocks that require encryption state change with BlockNeedsEncryptionStateChange, would enable us to operate the cluster in read-write mode, where newly ingested blocks are already encrypted and restored blocks are being encrypted in the background. FIG. 10 is a logical block diagram illustrating interactions to make an encryption state change, according to some embodiments.

[00116] BlockNeedsEncryptionStateChange bit may be introduced in the block hdr flag to track blocks yet to be encrypted. For blocks with this flag set, do not decrypt the block and read it as is and while writing the block encrypt using the valid key.

[00117] In order to support restoring an unencrypted snapshot into serverless, there are few modifications implemented from both control plane and data plane side - 1. Control plane has to pass a new flag to notify data plane that the snapshot is unencrypted. 2. Data plane has to treat this restore as a special case (similar to cross account), where re-guid all the blocks. Current if a block is marked as requiring re-guid, decrypt the block using the original GUID and then re-encrypt it using the new GUID. In this special case, may not need to perform the decrypt operation as the original cluster is unencrypted.

[00118] Once phase 1 is completed, as indicated when a transfer plan complete indication 836 is received at leader node 810 from compute nodes, phase 2 restore may begin as a background process in order to avoid interfering with database access. For example, leader node 810 may initiate work to implement a custom distribution (e.g., according to a distribution key), as indicated at 841, at compute nodes 820. To allow client access with minimal downtime, compute nodes 820 can use an automatic table optimization infrastructure to convert distribution to the custom distribution, such as using techniques to create shadow tables, as indicated at 842 and discussed in detail below with regard to FIG. 10. [00119] For example, leader node 810 may trigger an automatic table optimization process to start converting a dist-key table of new source clusters to a new cluster’s topology. Automatic table optimization may launch several processes or threads to work on these new shadow tables in parallel in a special workload management queue (which may have a budgeted or capped resource allocation).

[00120] FIG. 10 is a logical block diagram illustrating the use of shadow tables to perform online restore, according to some embodiments. Original table 1010 may be used at various compute nodes 1020 to copy and create shadow table 1020. Original table 1010 may still be available for read and write access, as indicated 1002. Then, shadow table 1020 may be made consistent, by locking original table, to only allow read access 1004, applying (or undoing) various transactions that are not committed, or otherwise performing various operations to bring shadow table into a consistent state with original table 1010 (although with a custom distribution style for the new topology). Then, as indicated at 1036, the original table may be swapped with the shadow table 1020, where shadow table 1020 is now used to provide read and write access 1006 on the cluster.

[00121] For example, leader node 810 should get the hotness information of all newly imported dist-key tables to determine the priority of these new tables. These information can be collected from the tiered-storage cache policy and the hotness can be simply determined from the percentage of blocks are hot or cold of each table. The converting process should cover both dist- key assignment and topology change. This means automatic table optimization should assign the original dist-key back to target table and transform target table to Serverless cluster’s topology at the same time.

[00122] Once a table is completed, automatic table optimization may update catalog properly so that automatic table optimization wouldn’t revisit this table again, and planner and query execution can recognize them properly.

[00123] Automatic table optimization may change a table’s distribution style, sortkey and encode types to improve user’s workload performance without alfecting user’s workload. While automatic table optimization is running on a table, a user is still able to execute read and write queries on it. Automatic table optimization may catch up with user’s concurrent update through a shadow table, as illustrated in FIG. 10. Below is example technique for working on a shadow table:

1. Copying data from source table to its shadow table, which has dilferent distribution style.

2. Checkpointing the current amount of data copies at the end of each iteration.

3. Blocking background vacuum/ ALTERS from running on table under ATO’s operation. 4. Acquiring write-lock when most data of source table is copied to shadow table and performing synchronization between source and shadow table: a. Redo deleted rows, and b. Undo aborted transactions

5. Acquiring access-exclusive-lock during commit to avoid race condition of the concurrently session that is accessing the source table.

6. Swapping blockchain between source and shadow table.

[00124] In some embodiments, further features may be implemented as part of automatic table optimization in phase 2 of online restore. For example, in some embodiments during restore, record pair map from catalog with <table_id, a_dist_key> if table’s diststyle is DistStyleHash. Automatic table optimization may prioritize topology conversion based on table hotness for in pair map. The hotness information can be derived from tiered storage cache policy. Add this priority based benefit to each automatic table optimization that got launched. Concurrent Conversions and Hotness-based Scheduling: automatic table optimization only supports altering one table’s distribution style at a time. To maximize the throughput of table conversion, DBMS may make a change to enable automatic table optimization to keep track of multiple tables status. Modify Auto Dispatcher to spawn multiple automatic table optimizations for importing in separate of automatic table optimization for recommendation per database, record map<table_id, <pointer to automatic table optimization worker, alter controller» where automatic table optimization pointer is used to control the worker itself, and alter controller is to control alter progress, store them in map. Pause and resume the automatic table optimizations according to the workload. Tiered Storage Cache Pollution with automatic table optimization. Do not record automatic table optimization pinned blocks into TS cache calculation, only record the automatic table optimization blocks are on local disk. In case of concurrent conversion, set quota to how many memory all automatic table optimizations can share, if used up, automatic table optimization either wait, or move on to the next eligible(smaller table). In some embodiments, database service 210 may setup a special workload management queue that allows automatic table optimization to consume sufficient amount of resources (for example, up to 50% of cluster resources) to finish it job.

[00125] Other features may include, blocking all other auto workers on this table during conversion,, sorting Range during conversion, monitoring metrics, providing automatic table optimization progress and showing in console, implementing a system table log for user to query the automatic table optimization progress, providing an upper limit (few million) of redo deleted rows per compute node. If user deleted too many rows on the target table of automatic table optimization, automatic table optimization has to give up and retry. [00126] In various embodiments, automatic table optimization uses shadow table to contain the data with new distribution. It means there will be two copies of same table at the last stage of conversion. To resolve this scenario, either the in-place distribution transform or build the redistribution on a cluster may be performed. In various embodiments, in-place redistribution may be performed where the database management system (e.g., of a leader node) copies + redistributes data, splits original blockchain, and replaces the copied blockchain in original table. Thus, there may not be a need to create full copy of target table. In various embodiments write redistribution may be performed where the database management system uses burst cluster to finish whole data re-distribution task to shadow table and replace primary table on main cluster. Below is an example algorithm:

1. Copying data from source table to its shadow table, which has difierent distribution style.

3. Blocking background vacuum/ALTERs from running on table under automatic table optimization’s operation.

4. Acquiring write-lock when most data of source table is copied to shadow table and performing synchronization between source and shadow table: a. Redo deleted rows, and b. Undo aborted transactions

6. Swapping blockchain between source and shadow table.

[00127] Phase 2 of online restore may alter the distribution style and sortkey of distribution auto tables without a user’s notice. This technique may be fully managed by database service 210 (e.g., by restore into managed database feature 222). The distribution key and sortkey advisor systems may figure out the best distribution style and sortkey of each user defined distribution auto table based on user workload and provide recommendations to the database management system. Database management system launches background worker to alter these tables’ distribution style and sortkey accordingly. Thus, the performance of queries ran on distribution auto table will be adaptively and automatically improved.

[00128] Database service 210 may allow a user to change the sortkey and distribution key style (alter table to diststyle even is under developing) for user defined tables through SQL commands. The SKA and DKA are available on console and provide recommendations for each cluster. Although these components are ready, in some embodiments a user still needs to check the console to get the alter table commands provided by advisors and executes these commands manually by themselves.

[00129] In order to improve users’ experience on database service 210, Database service 210 may automatically execute the recommendations provide by sort key advisor and distribution key advisor, so that, the query performance of each cluster be improved automatically without any users’ action when performing, for example, phase 2 of an online restore. Thus, database management system may implement a background worker (Auto Alter Table Worker) to accept the requests come from advisor and handle these requests at appropriate time.

[00130] Since sort key advisor and distribution key advisor can generate recommendations that conflicts with a user’s decisions, bad user experience will happen when an alter table worker and user override the distribution style and/or sortkey on same tables. In some embodiments, where dist-auto tables’ to be fully managed by database service 210, Database service 210 is allowed to update their distribution style and sortkey at any proper time. Then, alter table worker can execute recommendations on dist-auto tables’ without worrying about racing with user.

[00131] In some embodiments, a user may be allowed to register a table under the management of database service 210, a new alter table command: alter table alter diststyle auto (since auto worker should also cover sortkey and may cover column, alter table may be implemented, where <tablename> set property auto can be an alternative command). This command makes source table to be a dist- auto table. After executing this command, the source table can be dist-auto(even), dist-auto(key), or dist-auto(all). The effective distribution style of source table is determined base on source table’s properties. No matter what effective distribution style the table has, since it is dist-auto, it is under database service 210’s management.

[00132] If user wants to avoid the distribution style and/or sortkey of some dist-auto tables to be changed, they can alter these table to have their desired distribution style and/or sortkey. Database service 210 is allowed to change the distribution style, sortkey and column encoding type for dist-auto (or auto) table. User can register a table to be managed by database service 210 through:

1. alter table <tablename> set property auto, or

2. create table <tablename> (<column defmition>) auto

3. database level command: alter database set property auto It can de-register a table through: alter table <tablename> alter diststyle even/key/all (since auto worker should also cover sortkey and may cover column encoding type, alter table <tablename> unset property auto can be an alternative command).

[00133] There are some large size tables existing. Since altering the distribution style on these table can take few hours and altering diststyle consumes large amount of resource: CPU, network I/O and disk I/O, users’ workload can be alfected in this period. To avoid a user experiencing performance degradation caused by long running background alter command, database management system may have the ability to create checkpoints of alter command, so that altering giant table can be completed in multiple iterations. This techniques can be implemented in different ways.

[00134] In one example, shadow table may be committed, so that it can survive across multiple transactions. In each iteration, part of the content of source table are copied, redistributed into shadow table and committed. Once all content are copied, database management system replaces source table’s content with shadow table’s blockchain and update source table’s catalog. Note: If there are too many deleted rows, it can collect the histogram of deletexid and separate repopulation deletexid step in to several iteration.

[00135] In another example, part of the content of a source table can be copied and redistributed and replace the redistributed blocks of source table with shadow table in each iterations. Once all content are replaced, database management system updates source table’s catalog with new distribution style.

[00136] In some embodiments, an auto alter table worker may be implemented to perform redistributions of data (e.g., as part of phase 2 online restore) This component may be a background worker that executes alter diststyle and/or alter sortkey commands provided by distribution key advisor and sort key advisor. In some embodiments, distribution key advisor and sort key advisory may produce a JSON file, which contains a list of alter diststyle/sortkey commands, to storage service 270. Database management system fetches this file from storage service 270 periodically and assigns its tasks to alter table worker. Alter table workers can execute their tasks when cluster is under low workload or idle based on the cluster workload status provided by Workload monitor. [00137] Alter table worker may alter diststyle auto as a background process. This component may change a source table’s distribution style automatically and registering a table to be managed by Database service 210. For change source table’s distribution style, this command has following features: 1. If the source table is small table, database management system converts this table to be dist-auto(all). 2. If source table is not small table and is dist-all or dist-even, database management system checks whether there is a recommendation entry for this table, if so database management system converts source table with distribution style specified in recommendations; otherwise, it converts it to dist-auto(even). 3. If source table is not small table and is dist-key, database management system checks whether there is a recommendation entry for this table, if so database management system converts source table with distribution style specified in recommendations; otherwise, database management system converts it to be dist-auto(key). 4. If source table is not small table and is dist-auto, database management system has no-op.

[00138] In various embodiments, a database management system supports alter a table’s diststyle to be dist-key or dist-all. Database management system may use the following features:

1. Alter a table with any original diststyle to be dist-auto(all).

2. Alter a table with any original diststyle to be dist-even.

3. Alter a table with any original diststyle to be dist-auto(even).

4. Alter a dist-key table to be dist-auto(key).

5. Make all resize to support dist-auto(key) table.

6. Make backup & restore, table-level-restore to support dist-auto(key) table.

7. Make dist-auto(key) table survive from upgrade and downgrade.

[00139] The sort key advisor and distribution key advisor components may collect user’s queries’ log and tables’ definition. They analyze whether a new distkey or sortkey can improve user's query performance and provide a recommendation entry if there is. They create a set of new recommendations every few days in a JSON file. The ATW consumes this JSON file and executes its recommendation entries accordingly. Since each recommendation entry contains benefit value, alter table worker may execute recommendations in following order:

1. table with only new sortkey has highest priority

2. non-giant table with both new sortkey and diststyle has second priority

3. non-giant table with only new distkey has third priority

4. giant table with both new sortkey and diststyle has second priority

5. giant table with only new distkey has third priority

[00140] In some embodiments, a persistent shadow table may be used. The alter diststyle command uses a shadow table to keep the data with new distribution, and achieve snapshot isolation inside transaction. By using shadow table, this technique can pause and resume alter diststyle command naturally. However, the current life-cycle of shadow table ends when its transaction is committed or aborted, database management system may perform the following steps to save shadow table:

1. Create new bootstrap only alter command: “alter diststyle checkpoint”, which has only single copy-redistribute- insert iteration and only processes limited number of rows. 2. Once the copy-redistribute-insert iteration is completed, the transaction need to issue a special commit that saves current progress into catalog, persists blockchain of shadow table into disk and save blockchain metadata of shadow table.

3. When next “alter diststyle checkpoint” is executed, database management system continues copy-redistribute-insert job from previous saved checkpoint.

4. Once all data are copied into shadow table, alter diststyle checkpoint will perform “undo aborted txns” , “repopulated deleted rows”, “rebuild rowids” on shadow table, updates catalog and swaps blockchains between shadow table and source table in commit.

[00141] In some embodiments, there will be two system tables that provides information to a user about the execution states of alter table worker. Table “stv_auto_dist_sort” shows the unexecuted alter commands provided by advisor. Table “stl auto dist sort” contains the executed alter commands of alter worker. For example, the execution flow of alter table worker on may be:

1. Alter table worker checks whether cluster is busy by consulting workload queues or other workload information. If it is, alter table worker process sleep for 1 minutes and re-check

2. If cluster is not busy, then alter table worker checks whether there is next recommendation task. If there is no recommendation task, alter table worker process may sleep (e.g., for Iday) and try to reload the sort key and distribution key recommendation file if there is a new one.3. If alter table worker finds a recommendation task, alter table worker fetches the next task.

[00142] FIG. 11 is a logical block diagram of a proxy service for a database service that routes queries to a selected processing cluster attached to a database managed by the database service, according to some embodiments. As indicated at 1102, a query may be received at proxy service 240 via database endpoint 1110. For example, query 1102 may be sent to a network address or other location specific to a database managed by database service 210. Proxy service 240 may implement one or more components to listen for queries, like query 1102, at database endpoint 1110.

[00143] In various embodiments, proxy service 240 may implement database query queue 1120. Database query queue 1120 may be a queue of queries directed to a same database, in various embodiments. For example, queries directed to different databases may be received via different database endpoints and put into different respective database query queues (as opposed to being comingled in a common queue). Database query queue 1120 may be a database-wide query queue, separate from any workload or other queues implemented on attached processing clusters for the database that are specific to queries routed to the processing cluster to be performed by that processing cluster. Queries, like query 1102, may then be pulled from database query queue 1120 and routed by query routing 1130 (e.g., according to a First In First Out (FIFO) order).

[00144] Query routing 1130 may implement techniques similar to those discussed below with regard to FIGS. 26-28, in order to make routing decisions that scale the processing clusters attached to the database in accordance with optimizing both query performance and processing cluster utilization. For example, as discussed in detail below with regard to FIG. 28, response predictions 1132 may be determined for any attached processing clusters as well as another other processing cluster configurations that may be attached (but are not currently attached), based on the execution time, bootstrap time, and queue time. Likewise response variability 1134 may be determined enforced using a variability threshold, as discussed below.

[00145] Cluster selection 1136 may use response predictions and response variability to make a cluster selection. The following is one example of an algorithm that may be applied by cluster selection 1136 in some embodiments. Consider a set of clusters indexed by

first n are attached and can be of any size N , while n -1- 1 to fl + m are not attached and are of m possible sizes queries may attach additional 4 compute node or 8 compute node clusters; for a 16 compute node primary cluster, queries may sent to 16 compute node or 32 compute node secondary clusters). For this design consider m — 2 considering that at first database service 210 may create only a next power of 2 bigger sized cluster. Now, say response time of a query on a cluster C[ with number of nodes = A'j is: where i is index of a cluster

Ri is the response time of a query on a cluster i . is the bootstrap time of cluster i and includes burst cluster acquisition and cold start time.

£ i is the execution time of a query on cluster i

Qi is the queue time of a query on cluster i i is the number of nodes of a cluster '

The objective may be to find a duster C_x on which a query may execute for best latency, price performance wise, while honoring constraints that a maximum processing units 1129 (e.g., specified by via an interface in units such as RPUs) may not be exceeded by the attached processing clusters and variability should not be exceeded.

[00146] Thus, the following algorithm may be applied by cluster selection 1136: 1. For attachable (but not yet attached) clusters C_L where i E [n + 1, n + m] , find execution time

for a query q. Recall clusters indexed from n + 1 to n + m are different sized clusters which may be attached, if required. For example if m = 2 , then Cn + 1 has 16 compute node cluster and Cn + 2 has 32 compute node cluster.

2. Find execution time

for a query on these clusters E[ . Find minimum execution time

Emin among all El and the corresponding cluster Cmln with nodes Nmin.

3. Collect all smaller sized clusters Ei than Emin in a set C for which the query q scales linearly or better as compared to cluster Cmln on which minimum execution time for query q was noticed, e.g., Ei /Emin < Nmin /Eli and Ni < Nmin . For example, if a query takes minimum 5 seconds on a 32 compute node cluster and 10 seconds or 9 seconds on 16 compute node cluster, then keep the 16 compute node cluster for consideration. If a query takes 11 seconds on a 16 compute node cluster, then do not.

4. From C , find smallest sized cluster C smallest. It would be best to execute the query q on this cluster price performance wise. Remove all other clusters from C

5. Collect in C cluster Cf where i E [1, n + m] on which query can execute such that an increase in response time is no greater than allowed by variability. Therefore El /Esmallest < (1 + V ).

6. Notice that C contains both attached and possible clusters to be attached

7. Choose a cluster Chest in C with the least response time Ebest . Note that it is possible

Ebest /Esmallest > (1 + V ) if V is not correctly chosen.

8. If Chest is attached, then query executes there.

9. If C best is not attached, then attach a new cluster if allowed by MAX

10. If new cluster cannot be attached by MAX, remove all unattached clusters G' , such that i E [n + 1, n + m ) in C find Ebe st again. Run query on Chest.

[00147] In some embodiments, query routing 1130 may also implement other features to monitor performance of cluster selection and, if necessary, disable or modify performance of query routing 1130. For example, one feature monitors a query’s predicted and real execution time and turn prediction based optimizations off if prediction accuracy is low (e.g., where a query prediction accuracy for a period of time does not satisfy an accuracy criteria). Consider an example where a query’s predicted execution time is E_x on a cluster C_x . If query runs on this cluster, then the ratio r of query’s predicted and real execution time may be considered. An exponential moving average of r may be maintained for each execution time bucket (e.g., range of time). If r breaches a threshold (e.g., > 1) for an execution tie bucket, then prediction based rightsizing of clusters may be disabled for the queries which fall in that bucket and instead a same configuration cluster as used as the primary cluster may be used to for the query (if a new cluster is to be attached).

[00148] Another feature monitors a query’s variability prorated by real execution time. Say a query’s predicted minimum execution time is E_min on a cluster C_min . If query executes on another cluster C_x. then predicted execution time of query on the r cluster is E_x. Say the real execution time of query on C_x is E_x. Then, proration factor f = E /E_x may be determined . Using the proration factor, prorated minimum execution time E^_in = E_min x f may be determined. Next, prorated variability V^p = E_x /E^_in may then be determined. An exponential moving average of prorated variability for each execution time bucket may be maintained. If prorated variability exceeds variability target set for each execution time bucket, then prediction based rightsizing of clusters for queries falling in this bucket may be disabled and instead a same configuration cluster as used as the primary cluster may be used to for the query (if a new cluster is to be attached).

[00149] Query routing 1130 may route the query to an attached cluster to the database 1140, as indicated at 1133. In some embodiments, a selected cluster is not currently attached to the database. As indicated at 1131, if a cluster is selected that is not already attached, then a request cluster to be attached to the database may be sent to control plane 220 in order to have the selected cluster attached.

[00150] Although database query queue 1120 and query routing 1130 are illustrated as implemented as part of proxy service 240, in other embodiments, one or both of these features may be implemented elsewhere. For example, the leader node of the primary cluster attached to a database may implement database query queue 1120 and query routing 1130 to route queries to an attached cluster (e.g., itself or an different attached cluster).

[00151] FIG. 12 is a logical block diagram illustrating attaching a new processing cluster selected for a database managed by a database service, according to some embodiments. As discussed above, proxy service 240 may detect or determine when to attach a new cluster for performing queries in various scenarios, as discussed above with regard to FIG. 11 and below with regard to FIG. 27. Proxy service 240 may then request an attachment of cluster 1212 from control plane 220. The request may, in some embodiments, specify a type of cluster. In some embodiments, control plane 220 may evaluate a manifest, index, or other data that describes available processing cluster(s) 1220 in burst cluster pool 1210 in order to satisfy the request. For example, control plane 220 may identify a processing cluster that matches the specified configuration of the attach cluster request, in some embodiments. In some embodiments, control plane 220 may identify a burst cluster that was previously used for performing queries to the database. [00152] Control plane 220 may provision 1214 the cluster, in some embodiments, from cluster pool, such as attached processing cluster 1230. Attaching the cluster may include various operations to configure network connections between processing cluster and proxy service 240 and other services (e.g., data storage service 250). In some embodiments, access credentials, security tokens, and/or encryption keys may be provided so that the newly attached processing cluster 1230 can access and database data 1240 to perform queries for the database. In some embodiments, initialization procedures, workflows or other operations may be started by control plane 220 at attached processing cluster 1230. For example, processing cluster 1230 may access 1224 metadata from data storage service 250 from database data 1240 in order to perform queries to the database. In some embodiments, attached processing cluster 1230 may get metadata updates 1250 directly from a leader node of an already attached cluster (or other nodes in a primary processing cluster) in order to catch up the metadata to account for changes that occurred after the database data 1240 was stored.

[00153] Once provisioning is complete, attached processing cluster 1230 may be made available for performing queries. Control plane 220 may identify the attached cluster 1216 to proxy service 240 (e.g., by providing a network endpoint for processing cluster 1230), in some embodiments. Proxy service 240 may then begin directing selected queries 1222 to processing cluster 1230, which may perform the queries and send back query results 1226 to proxy service 240, which may provide the results to a client in turn.

[00154] Although not illustrated, when an event that triggers release of the attached cluster occurs, proxy service 230 may send a request to control plane 220 to release the attached cluster 1230 (e.g., by including the identifier of the attached processing cluster 1230). Control plane 220 may then delete the attached processing cluster 1230 (e.g., by removing/del eting data and/or decommissioning/ shutting down the host resources for the attached processing cluster 1230).

[00155] FIG. 13 is a logical block diagram illustrating an example of a processing cluster performing queries to database data, according to some embodiments. As illustrated in this example, a processing cluster 1300 may include a leader node 1310 and compute nodes 1320a, 1320b, and 1320n, which may communicate with each other over an interconnect (not illustrated). Leader node 1310 may implement query planning 1312 to generate query plan(s), such as generating query plans that utilize additional processing cluster scaling 1313 as discussed in detail below with regard to FIGS. 22 - 25, query execution 1314 for executing queries on processing cluster 1300 that perform data processing that can utilize remote query processing resources for remotely stored data (e.g., by utilizing one or more query execution slot(s)/queue(s) 1317). As described herein, each node in a primary processing cluster 1300 may include attached storage, such as atached storage 1322a, 1322b, and 1322n, on which a database (or portions thereol) may be stored on behalf of clients (e.g., users, client applications, and/or storage service subscribers). [00156] Note that in at least some embodiments, query processing capability may be separated from compute nodes, and thus in some embodiments, additional components may be implemented for processing queries. Additionally, it may be that in some embodiments, no one node in processing cluster 1300 is a leader node as illustrated in FIG. 13, but rather different nodes of the nodes in processing cluster 1300 may act as a leader node or otherwise direct processing of queries to data stored in processing cluster 1300. While nodes of processing cluster may be implemented on separate systems or devices, in at least some embodiments, some or all of processing cluster may be implemented as separate virtual nodes or instance on the same underlying hardware system (e.g., on a same server).

[00157] Leader node 1310 may manage communications with clients, such as clients 250 discussed above with regard to FIG. 2. As discussed above with regard to FIG. 3, leader node 1310 may communicate with proxy service 240 and may receive query 1301 and return query results 1303 to proxy service 240 (instead of communicating directly with a client application).

[00158] Leader node 1310 may be a node that receives a query 1301 from various client programs (e.g., applications) and/or subscribers (users) (either directly or routed to leader node 1310 from proxy service 240), then parses them and develops an execution plan (e.g., query plan(s)) to carry out the associated database operation(s)). More specifically, leader node 1310 may develop the series of steps necessary to obtain results for the query. Query 1301 may be directed to data that is stored both locally within processing cluster 1300 (e.g., at one or more of compute nodes 1320) and data stored remotely. Leader node 1310 may also manage the communications among compute nodes 1320 instructed to carry out database operations for data stored in the processing cluster 1300. For example, node-specific query instructions 1304 may be generated or compiled code by query execution 1314 that is distributed by leader node 1310 to various ones of the compute nodes 1320 to carry out the steps needed to perform query 1301, including executing the code to generate intermediate results of query 1301 at individual compute nodes may be sent back to the leader node 1310. Leader node 1310 may receive data and query responses or results from compute nodes 1320 in order to determine a final result 1303 for query 1301.

[00159] A database schema, data format and/or other metadata information for the data stored among the compute nodes, such as the data tables stored in the cluster, may be managed and stored by leader node 1310. Query planning 1312 may account for remotely stored data by generating node-specific query instructions that include remote operations to be directed by individual compute node(s). Although not illustrated, in some embodiments, a leader node may implement burst manager to send 1306 a query plan generated by query planning 1312 to be performed at another attached processing cluster and return results 1308 received from the burst processing cluster to a client as part of results 1303.

[00160] In at least some embodiments, aresult cache 1319 may be implemented as part of leader node 1310. For example, as query results are generated, the results may also be stored in result cache 1319 (or pointers to storage locations that store the results either in primary processing cluster 1300 or in external storage locations), in some embodiments. Result cache 1319 may be used instead of other processing cluster capacity, in some embodiments, by recognizing queries which would otherwise be sent to another attached processing cluster to be performed that have results stored in result cache 1319. Various caching strategies (e.g., LRU, FIFO, etc.) for result cache 1319 may be implemented, in some embodiments. Although not illustrated in FIG. 13, result cache 1319 could be stored in other storage systems (e.g., other storage services, such as a NoSQL database) and/or could store sub-query results.

[00161] Processing cluster 1300 may also include compute nodes, such as compute nodes 1320a, 1320b, and 1320n. Compute nodes 1320, may for example, be implemented on servers or other computing devices, such as those described below with regard to computer system 3000 in FIG. 29, and each may include individual query processing “slices” defined, for example, for each core of a server’s multi-core processor, one or more query processing engine(s), such as query engine(s) 1324a, 1324b, and 1324n, to execute the instructions 1304 or otherwise perform the portions of the query plan assigned to the compute node. Query engine(s) 1324 may access a certain memory and disk space in order to process a portion of the workload for a query (or other database operation) that is sent to one or more of the compute nodes 1320. Query engine 1324 may access attached storage, such as 1322a, 1322b, and 1322n, to perform local operation(s), such as local operations 1318a, 1318b, and 1318n. For example, query engine 1324 may scan data in attached storage 1322, access indexes, perform joins, semi joins, aggregations, or any other processing operation assigned to the compute node 1320.

[00162] Query engine 1324a may also direct the execution of remote data processing operations, by providing remote operation(s), such as remote operations 1316a, 1316b, and 1316n, to remote data processing clients, such as remote data processing client 1326a, 1326b, and 1326n. Remote data processing clients 1326 may be implemented by a client library, plugin, driver or other component that sends request sub-queries to be performed by data storage service 220 or requests to for data, 1332a, 1332b, and 1332n. As noted above, in some embodiments, Remote data processing clients 1326 may read, process, or otherwise obtain data 1334a, 1334b, and 1334c, in response from database data 1340 in data storage service 270, which may further process, combine, and or include them with results of location operations 1318.

[00163] Compute nodes 1320 may send intermediate results from queries back to leader node 1310 for final result generation (e.g., combining, aggregating, modifying, joining, etc.). Remote data processing clients 1326 may retry data requests 1332 that do not return within a retry threshold.

[00164] Attached storage 1322 may be implemented as one or more of any type of storage devices and/or storage system suitable for storing data accessible to the compute nodes, including, but not limited to: redundant array of inexpensive disks (RAID) devices, disk drives (e.g., hard disk drives or solid state drives) or arrays of disk drives such as Just a Bunch Of Disks (JBOD), (used to refer to disks that are not implemented according to RAID), optical storage devices, tape drives, RAM disks, Storage Area Network (SAN), Network Access Storage (NAS), or combinations thereof. In various embodiments, disks may be formatted to store database tables (e.g., in column oriented data formats or other data formats).

[00165] Additional processing clusters may be implemented in different ways for use of processing clusters that originally receive and begin work on queries. In some embodiments, a separate group, pool, or fleet of additional processing clusters may be made ready to perform subqueries on behalf of processing clusters in those scenarios where query planning includes those operations to use additional processing clusters. FIG. 14 is a logical block diagram illustrating a separately hosted additional processing cluster, according to some embodiments.

[00166] For example, leader node 1412 may be implemented on a host system 1410 (e.g., a sever or other computing system 3000 in FIG. 29 below). Compute nodes of the processing cluster, such as compute node 1422 may be implemented on different host systems, such as host 1420. When leader node 1412 determines to use additional processing clusters as part of planning to execute a query, leader node 1412 may request additional clusters 1452 from additional cluster provisioning 1450 in control plane 220. For example, additional cluster provisioning 1450 may maintain a fleet or pool processing clusters (e.g., using one or more resource processing configurations as discussed above with regard to FIG. 2) that are used as additional processing resources for queries already routed to a primary processing cluster. These additional processing clusters may be different from processing clusters that are attached to a database managed by the database service 210 and which perform a query routed to the attached processing cluster (to burst or scale to multiple processing clusters handling different queries to the same database, as discussed above with regard to proxy service 240). Additional cluster provisioning 1450 may update various cluster mappings or other tracking/state information for available processing clusters to assign respective additional clusters 1454 to the compute nodes in a response to leader node 1412.

[00167] As indicated at 1414, leader node 1412 may then send plan execution instructions, including over-network subqueries to provisioned processing clusters to compute node 1422. As part of the instructions 1414, various access credentials, metadata, and other information used to perform over-network subqueries 1424 may be included, including the identity of the additional processing cluster (e.g., as a network address, endpoint or other identifier). Compute node 1422 may then send over-network subqueries to the additional processing cluster.

[00168] In at least some embodiments, the additional processing cluster may be implemented similar to (or the same as) the primary processing cluster, with a leader node 1432 that performs operations similar to leader node 310 discussed above with regard to FIG. 13, and compute nodes 1434a, 1434b, and 1434c, similar to compute nodes 1320 discussed above with regard to FIG. 13. Leader node 1432 may, for example, receive subqueries 1424, develop a query plan, generate execution instructions and send them to compute nodes 1434 which may perform various operations, such as scan data requests 1436 to access database data 1440 in data storage service 270. In at least some embodiments, a single host system, such as host 1430, may implement the entire processing cluster (unlike the primary processing cluster that uses separate hosts for leader and compute nodes as illustrated in FIG. 14). For example, leader node 1432 and compute nodes 1434 may be implemented as separate processes, containers (e.g., using operating system virtualization), or virtual machines on host 1430. In some embodiments, multiple host systems could be used (not illustrated). In some embodiments, a same version of the query engine may be implemented without leader node 1432 for the additional processing cluster at host 1430. Instead, instructions may be sent directly to different compute nodes 1434 from compute node 1422.

[00169] In some embodiments, co-location of additional processing clusters with compute nodes may be implemented on a same host. This offers further performance improvements for query processing, such as eliminating network communications between compute nodes and additional processing clusters, as well as taking advantage of dynamic resource allocation techniques offered by virtualization technologies like container virtualization, which may allow for different containers executing on a same host system to dynamically obtain the computing resources needed to perform operations (instead of only being able to operate within a statically define amount of computing resources, which may make it difficult to predict how much resources an additional processing cluster may need as opposed to a compute node). FIG. 15 is a logical block diagram illustrating a collocated additional processing cluster on a same host as a compute node, according to some embodiments. [00170] For example, leader node 1512 may be implemented on a host system 1510 (e.g., a sever or other computing system 3000 in FIG. 29 below). Compute nodes of the processing cluster, such as compute node 1522 may be implemented on different host systems, such as host 1550. When leader node 1512 determines to use additional processing clusters as part of executing a query, leader node may send plan execution instructions including subqueries to use an on-host provisioned processing cluster by the compute node.

[00171] Host 1550 may, in some embodiments, implement container virtualization (e.g., operating system virtualization that allows different containers, such as container 1520 and container 1530 to implement different applications without having to implement separate guest operating systems for each container). One container, 1520, may implement compute node 1522 (other host systems with other containers may implement other compute nodes of the processing cluster that includes leader node 1512). Another container, container 1530, may implement the additional processing cluster, which may receive and perform inter-container subqueries (e.g., avoiding network communications entirely). In other embodiments, various other virtualization techniques may be utilized to implement an additional processing cluster on a same host as the compute node (e.g., micro virtual machines). For example, container 1530 may include a leader node 1532 that performs operations similar to leader node 1310 discussed above with regard to FIG. 13, and compute nodes 1534a, 534b, and 534c, similar to compute nodes 1320 discussed above with regard to FIG. 13. Leader node 1532 may, for example, receive subqueries 1524, develop a query plan, generate execution instructions and send them to compute nodes 1534 which may perform various operations, such as scan data requests 1536 to access database data 1540 in data storage service 270. In some embodiments, a same version of the query engine may be implemented without leader node 1532 for the additional processing cluster at container 1530. Instead, instructions may be sent directly to different compute nodes 1534 from compute node 1522.

[00172] Although FIGS. 2 - 15 have been described and illustrated in the context of a provider network implementing a database service, like a data warehousing service, the various components illustrated and described in FIGS. 2 - 15 may be easily applied to other database services that can utilize detecting idle periods for management actions at processing clusters for managed databases. As such, FIGS. 2 - 15 are not intended to be limiting as to other embodiments of detecting idle periods for management actions at processing clusters for managed databases.

[00173] FIG. 16 is a high-level flowchart illustrating methods and techniques to implement detecting idle periods for management actions at processing clusters for managed databases, according to some embodiments. Various different systems and devices may implement the various methods and techniques described below, either singly or working together. Different combinations of services implemented in different provider networks operated by different entities may implement some or all of the methods (e.g., a data warehouse cluster in a service of a first provider network and a data set stored in a service of a second provider network). Different types of query engines or non-distributed query performance platforms may implement these techniques. Alternatively, various other combinations of different systems and devices located within or without provider networks may implement the below techniques. Therefore, the above examples and or any other systems or devices referenced as performing the illustrated method, are not intended to be limiting as to other different components, modules, systems, or devices.

[00174] As indicated at 1610, a leader node of a processing cluster may monitor a network endpoint at a proxy service associated with a database managed by a database service, according to some embodiments. For example, the leader node may track the activity of connections with the proxy service (e.g., by requesting active connections or other information). In some embodiments, monitoring may include tracking operations like cursors, pagination, or other indicators of progress on various operations (e.g., returning results) from the processing cluster.

[00175] A network endpoint may be a network address or other location for which database service may implement networking resources to listen and obtain the query. By sending the query to the network endpoint, the target of the query, the database, may be identified. In some embodiments, the network endpoint may be provided (e.g., to a user for inclusion in client applications) when the database is created. In at least some embodiments, the database may be created, configured, or modified to be managed by the database service, such that automatic management techniques, including providing a “serverless” management experience for a user of the database is performed. For example, the database may be created without any specified number or other configuration of computing resources used to process queries (or provide other access to) the database. Instead, an initial query processing resource configuration may be automatically selected by the database service for the database, according to various techniques (e.g., using a pre-defined initial query processing resource configuration or using prediction techniques to determine the initial query processing resource configuration based on other information, such as an amount or type of data stored in the database).

[00176] As indicated at 1620, monitoring of the network endpoint may be performed to detect an idle period for the database. If not, then monitoring may continue. Different types of idle periods may be monitored for different types of management actions. For example, long idle periods or indications of no active connections may trigger detection of an idle period for a longer running management action, such as software patch installation, data redistribution, or pausing the processing cluster. Shorter or other indications of idle periods may trigger detection of idle periods for actions like updating or creating a materialized view, and so on.

[00177] As indicated at 1630, a management action may be determined for the processing cluster to perform during the detected idle period, in some embodiments. For example, specific idle periods may map to specific management actions. In some embodiments, multiple different management actions could be performed, therefore a priority scheme may be applied (e.g., security, then service interruptions, then performance optimizations, or some other hierarchy of categories of actions). As indicated at 1640, the management action may be caused for the processing cluster to be performed, in some embodiments. For example, as depicted in FIGS. 4 - 5, operations may be instructed to compute nodes in the processing cluster, or actions to release the compute nodes and pause the processing cluster may be performed. As indicated by the arrow looping back to 1610, this technique may be continually performed such that the same idle period (or new one) may allow for another management action to be determined and performed.

[00178] FIG. 17 is a high-level flowchart illustrating methods and techniques to implement predicting response time for queries on different query processing configurations, according to some embodiments. As indicated at 1710, a leader node of a paused processing cluster, may monitor a network endpoint at a proxy service associated with at database managed by the database service. As indicated at 1720, a resume event may be detected for the processing cluster. For example, a query may be received and the proxy may send the query to the leader node of the paused processing cluster. In some embodiments, a resume event may be a resume instruction or command sent by a control plane or proxy service (e.g., in anticipation of impending queries).

[00179] As indicated at 1730, a number of compute nodes to add to the processing cluster may be determined, in some embodiments. For example, a resize event that was detected for the processing cluster may be found and evaluated to determine what size the resumed processing cluster should be. In some embodiments, an analysis of performance prior to the pause may be performed to determine whether a different number of compute nodes (e.g., more or less) should be used. As indicated at 1740, the determined number of compute nodes may be added to the leader node to ready the processing cluster, in some embodiments.

[00180] Various methods and techniques of online restore to different topologies with custom data distribution are described herein. Database workloads vary over time. One configuration of resources may initially be right-sized to achieve a desired performance for accessing database and then due to a change in workload no longer be sized correctly, providing inefficient performance or inefficient use of resources. [00181] Restore techniques may be performed in order to move a database from one group of resources (e.g., processing cluster or other grouping of compute nodes that store or provide access to a database) to another group of processing nodes. However, a different group of resources may have a different topology (e.g., a different physical layout, such as a different number of node slices across a cluster of nodes, where a node slice is a portion of node resources, such as memory, disk space, and a processor core of a multi-core processor). While some restore techniques can move a database between groups of resources, the downtime when the database is unavailable for access can be a deterrent to moving to a more efficient or more performant group of resources. Therefore, techniques that perform an online restore (e.g., where a database is available for a large majority of the time when moving), may be highly desirable.

[00182] For instance, some database services may offer users the ability to move from being manually managed by the user (e.g., a manual management mode) to being managed by the database service (e.g., determining the size of a processing cluster and/or other features of the computing resources to provide access and manage the database) in order to have the database service scale or otherwise adapt the computing resources allocated to the database in order to maintain performance or efficiency goals. Techniques that perform an online restore to a different topology may be implemented may make increase the ability of users to take advantage of changing to a database being managed by the database service.

[00183] Because databases may be distributed at a topology of nodes using schemes to increase the performance and efficiency of perform access requests (e.g., queries) to the database, custom distribution schemes may be implemented. For example, a user could select one or a combination of columns or keys from tables in the database to use as distribution scheme inputs (e.g., to a hash function) in order to determine the layout of database data across a topology. A restore operation into a different topology may cause a different distribution of database data to be implemented. Therefore, techniques perform an online restore to a different topology that can recover the custom distribution of data without blocking access to the database when being moved may be highly desirable.

[00184] FIG. 18 illustrates a logical block diagram of online restore to different topologies with custom data distribution, according to some embodiments. A database system that hosts a database may utilize a processing cluster, such as processing cluster or other current topology of nodes 1810 to host a database. As indicated at 1812, the database data may be distributed according to a custom distribution scheme. Such a database system may be a stand-alone system, in various embodiments. For example, the database system may be implemented for private use (e.g., on private networks and resources for entity-specific utilization). In some embodiments, the database system may be a database service, which may be implemented as part of multiple different services provided by a cloud service provider, such as provider network 200 discussed in detail below with regard to FIG. 2.

[00185] A move of the database into a new node topology may be desirable, such as new node topology 1820 (e.g., to resize a processing cluster to have more, or less, compute nodes or to switch from being a manually managed database to a service managed database). A restore operation may be started that sets up a new node topology, according to various techniques, such as transfer plan discussed in detail below with regard to FIGS. 8-10. For example, nodes in current topology 1810 may send various metadata to nodes in new node topology 1820, or nodes in new topology 1820 may access a backup version, such as snapshot of the database, to use to obtain metadata. Data for the database may be stored separately from the node topologies (e.g., in network attached storage in or a separate storage service, like storage service 270 discussed below with regard to FIG. 2). In this way, once the setup of new node topology is complete (e.g., the various metadata operations to prepare new node topology to access data as part of handling various database requests, the database can be made available for access, as indicated at the beginning of 1860.

[00186] When the restore is initially performed, a general distribution of database data may be used, as indicated at 1822. A general distribution scheme may, for example, apply a default distribution scheme for to determining where to physically store different portions of database data (e.g., sometimes referred to as partitions or “data slices”)., that is different from the custom distribution 1812. During the initial period of the restore operation, the database may be unavailable for access, as indicated at 1850. However, once the initial period of the restore operation is complete, when the database data is assigned to the new node topology (e.g., node slice assignments reflect a general distribution scheme) according to a general distribution of database data 1822 at new node topology 1820, then a background process, as discussed in detail below with regard to FIGS. 19-21, may be performed that redistributes database data, as indicated 1824, so that a custom distribution of database data 1830 is achieved at new node topology 1820 which matches the custom distribution scheme (but not layout as the topology is different) of custom distribution of database data 1812. For example, the same one or more columns or keys may be used to determine hash values for distributing data amongst new node topology as was used in custom distribution of database data 1812 (even though the number of node slices may be different in new node topology).

[00187] While redistribution 1824 of database data is performed, the database may be available for access, as indicated at 1860. This may allow for a restore operation with an online characteristic, as the initial period of unavailability (1850) may be very short, especially when compared with the second period of availability (1860). For example, the database may be unavailable for less than one minute, and then have a background process to redistribute data that lasts for one (or multiple) days. In this way, an offline restore technique that might take a long period of time (one or multiple) days where the database is unavailable, can be avoided (even if the time for performing the offline restore is slightly faster than the online technique (e.g., minutes faster)). Thus, online restore may be implemented in two different phases.

[00188] FIG. 19 is a high-level flowchart illustrating methods and techniques to implement online restore to different topologies with custom data distribution, according to some embodiments. Various different systems and devices may implement the various methods and techniques described below, either singly or working together. Different combinations of services implemented in different provider networks operated by different entities may implement some or all of the methods (e.g., a data warehouse cluster in a service of a first provider network and a data set stored in a service of a second provider network). Different types of query engines or other database systems may implement these techniques. Alternatively, various other combinations of different systems and devices located within or without provider networks may implement the below techniques. Therefore, the above examples and or any other systems or devices referenced as performing the illustrated method, are not intended to be limiting as to other different components, modules, systems, or devices.

[00189] As indicated at 1910, a request may be received to restore a database currently hosted across a first topology of nodes into a second topology of nodes different than the first topology of nodes according to a custom distribution scheme, in some embodiments. For example, the request may be a request to restore one (or more) manually managed databases into a single, service managed database. In some embodiments, the request may be a request to resize the database into the second topology.

[00190] As indicated at 1920, a plan to map different portions of the database in the first topology into the second topology using a general distribution scheme may be generated, in some embodiments. For example, as discussed in detail above with regard to FIGS. 8 - 10, the transfer plan may determine whether or not different portions of the database will be merged and to which location they should be stored in the new topology.

[00191] As indicated at 1930, a plan to map different portions of the database in the first topology to respective target locations in the second topology of nodes using a general distribution scheme may be performed, in some embodiments. For example, compute nodes of a processing cluster may perform instructions to obtain slices of database data from a snapshot (or from other compute nodes). In some embodiments, as discussed above, encryption may be performed as part of moving the data. For example, the database may be currently unencrypted and stored encrypted in the second topology.

[00192] Elements 1920 and 1930 may be performed as part of a phase 1 of online restore, as discussed above.

[00193] As indicated at 1940, the database may then be made available for access using the second topology of nodes, in some embodiments. For example, queries and writes may be performed to the database.

[00194] For phase 2 of the online restore, the distribution of the database across the second topology of nodes may be modified to match the custom distributions scheme, as indicated at 1950. As discussed in detail above, various different techniques for implementing a background process using, for example, a shadow table (as discussed with regard to FIG. 21) may be implemented.

[00195] FIG. 20 is a high-level flowchart illustrating methods and techniques to implement phase 1 of online restore, according to some embodiments. As indicated at 2010, a database manifest may be obtained, such as the BAR manifest discussed above with regard to FIG. 19. As indicated at 2020, a slice transfer plan may be generated, as discussed above with regard to FIG. 9. As indicated at 2030, database metadata may be obtained (e.g., catalog data obtained from a leader node and/or from a backup store, such as storage service 270).

[00196] As indicated at 2040, the slice transfer plan may be performed. As indicated at 2012, compute nodes (of the new topology) may execute the transfer plan. As indicated at 2022, slices may be extracted (e.g., from a snapshot of the database) and table chains of blocks connected (e.g., connecting data blocks storing records (e.g., in columnar fashion or row-oriented fashion)). As indicated at 2032, if the source slices are greater, then slices may be merged and block change made, as indicated at 2034. As indicated at 2042, if the database data is encrypted, then the data obtained may be encrypted, as indicated at 2044. Metadata transfer may also be executed on compute nodes, as indicated at 2052.

[00197] In some embodiments, after executing the slice transfer plan, the changes to the database at the target system may be committed. As indicated at 2050, the data may be distributed using an even distribution conversion technique. However, the custom distribution scheme may be recorded (e.g., in the catalog for phase 2). In some embodiments, after the conversion to even distribution of data, the database may be committed again. As indicated at 2060, the database may be restarted into a restore mode to make the database available.

[00198] FIG. 21 is a high-level flowchart illustrating methods and techniques to implement phase 2 of online restore, according to some embodiments. As indicated at 2110, a validation of the move database manifest may be performed. As indicated at 2120, conversion from the even distribution scheme to distribution key scheme (as recorded earlier) may be performed (e.g., using alter table workers as discussed above). Then, similar to the discussion above with regard to 535 in FIG. 8, actual table data as opposed to table metadata may be retrieved (e.g., as used to answer queries), as indicated at 2130.

[00199] Various embodiments of scaling database query processing using additional processing clusters, according to some embodiments., according to some embodiments are described herein. Database queries for data that satisfy various conditions or criteria, insertions, deletions, modifications, or any other request triggering processing based on a request to access a data store may utilize varying amounts of processing resources, including various central processing units (CPUs), graphical processing units (GPUs), or other processing components that may execute various tasks to process database queries, in some embodiments. Because database queries may vary in terms of the workload placed upon the processing resources to execute the database query, the amount of processing resources that any one query engine may provide could be inadequate (or underutilized) to meet the demands of some query workloads.

[00200] One approach to scaling resources to meet query workloads could be to utilize a different set of computing resources (e.g., an entirely different cluster of computing resources). However, such techniques may optimize the performance of queries to a database overall (e.g., multiple queries from one or multiple users), and such a scaling technique may not improve the performance of an individual database query. Techniques for scaling database query processing using additional processing clusters may allow database systems, including providers of a database services, to automatically adapt to query workloads, so that query performance for individual queries is achieved (which allows for good client application performance of client applications that utilize the database service) as the resources of the processing cluster that is handling a query can be expanded when optimal to utilize additional query processing resources.

[00201] FIG. 22 illustrates a logical block diagram of scaling database query processing using additional processing clusters, according to some embodiments. A database system that hosts a database may utilize a processing cluster, such as processing cluster 2210, to receive database queries 2202, perform the database queries and return a result 2204. Such a database system may be a stand-alone system, in various embodiments. For example, the database system may be implemented for private use (e.g., on private networks and resources for entity-specific utilization). In some embodiments, the database system may be a database service, which may be implemented as part of multiple different services provided by a cloud service provider, such as provider network 200 discussed in detail below with regard to FIG. 2. [00202] Processing cluster 2210 may utilize parallelization of work query performance work by distributing the workload of performing a query across multiple compute nodes, such as compute nodes 2212a, 2212b, 2212c, and 2212n, using a common query processing engine 2240 (e.g., a database management system and storage engine that supports distributed query execution in cluster 2210). Although not depicted in FIG. 22, processing cluster 2210 may implement a leader node, as described in detail above, although leader-node less architectures may also be implemented in other embodiments of processing cluster 2210).

[00203] In order to scale the capacities of processing cluster 2210 to optimally perform a database query, query planning as part of query processing engine 2240 may include the use of remote operations to utilize additional processing clusters through compute nodes, such as compute node 2212c using additional processing cluster 2220, as discussed in detail below with regard to FIGS. 23-25. For example, a leader node or other workload management layer that distributes work amongst compute nodes may implement various query planning techniques, including cost-based query planning techniques, which may determine whether a compute node should utilize an additional processing cluster. Note that each compute node 2212 could also use its own additional processing cluster (not illustrated). As part of executing a query, compute node 2212c may send additional query plan operations 2214 to the additional processing cluster 2220, which may implement the same query processing engine 2240, and set of compute nodes, such as compute nodes 2222a, 2222b, 2222c, and 2222n. These compute nodes 2222 may perform respective data access requests, 2224a, 2224b, 2224c, and 2224n, in order to access separately stored database data 2230 (e.g., in a separate storage system such as storage service 270 discussed above, in order to return data, as indicated 2225a, 2225b, 2225c, and 2225n for processing by the compute nodes (e.g., performing various operations to aggregate, filter, group, or other operations to return operation results 2215 on the database data. Additional processing cluster 2220 may be co-located on a same host as a compute node 2212, or on a separate host system.

[00204] In some scenarios, query processing may be more performant without using additional processing cluster 2220. In those scenarios, compute nodes may still perform data access requests directly to database data 2230, as indicated at 2216 and 2217, in order to complete performance of a database query. Such scenarios may be instructed by a query planning feature implemented as part of query processing engine 2240, as discussed above.

[00205] FIG. 23 is a high-level flowchart illustrating methods and techniques to implement scaling database query processing using additional processing clusters, according to some embodiments. Various different systems and devices may implement the various methods and techniques described below, either singly or working together. Different combinations of services implemented in different provider networks operated by different entities may implement some or all of the methods (e.g., a data warehouse cluster in a service of a first provider network and a data set stored in a service of a second provider network). Different types of query engines or other database systems may implement these techniques. Alternatively, various other combinations of different systems and devices located within or without provider networks may implement the below techniques. Therefore, the above examples and or any other systems or devices referenced as performing the illustrated method, are not intended to be limiting as to other different components, modules, systems, or devices.

[00206] As indicated at 2310, a database query may be received at a processing cluster that includes compute nodes that execute different portions of queries to a database, in various embodiments. In some embodiments, the database query may be received at a network endpoint of the processing cluster (e.g., of a leader node of the processing cluster). In some embodiments, the database query may be received via a network endpoint associated with a database managed by a database service so that the database service may route the database query to the processing cluster. The processing cluster may include multiple different compute nodes to provide distributed query processing, taking advantage of parallel processing capabilities of multiple compute nodes. In some embodiments, the processing cluster may operate without a dedicated leader node, but may share leader node responsibilities (e.g., rotate or distribute different portions of leader node responsibilities). In some embodiments, as illustrated above in FIG. 13, the processing cluster may include a dedicated leader node.

[00207] The database query may be received according to various interfaces, formats, and/or protocols. For example, the database query may be formatted according to a query language such as Structured Query Language (SQL), in some embodiments, or may be specified according to an Application Programming Interface (API) for receiving queries. In at least some embodiments, the database query may be one database query of many database queries that can be submitted by one or many different users to a same database. For example, the database query may compete for computing resources along with other queries received from other users to be executed with respect to a database in some embodiments.

[00208] For a managed database, a network endpoint may be a network address or other location for which a database service may implement networking resources to listen and obtain the query. By sending the database query to the network endpoint, the target of the database query, the database, may be identified. In some embodiments, the network endpoint may be provided (e.g., to a user for inclusion in client applications) when the database is created. In at least some embodiments, the database may be created, configured, or modified to be managed by the database service, such that automatic management techniques, including providing a “serverless” management experience for a user of the database is performed. For example, the database may be created without any specified number or other configuration of computing resources used to process queries (or provide other access to) the database. Instead, an initial query processing cluster may be automatically selected by the database service for the database, according to various techniques (e.g., using a pre-defined initial query processing cluster size or using prediction techniques to determine the initial query processing cluster size based on other information, such as an amount or type of data stored in the database).

[00209] As indicated at 2320, a plan may be generated to perform the database query at the processing cluster, in some embodiments. FIG. 24, discussed below, provides various examples of techniques that may be use to generate a plan. Some techniques may include obtaining various metadata or other information describing the database targeted by the database query (e.g., table statistics, table definitions, and partitions of the table) as part of generating the query plan (e.g., in order to perform cost optimization and other planning techniques).

[00210] As part of generating the query plan, a determination may be made as to whether to scale out processing of the database query using additional processing clusters. For example, although both the processing cluster and the potential additional processing clusters can access the same data types and formats and perform the same operations as they may use the same query engine), greater parallelization may have performance benefits for some database queries, and not for other database queries. Query planning may consider the features of the database query in order to determine whether scaling to additional processing clusters will improve database query performance. Consider, for example, when a subquery to be executed by an additional query processing cluster does not reduce data read from storage (e.g., through aggregation, filter, or other operation), then there may be no benefit to utilizing an additional processing cluster to further parallelize performance of the database query. For instance, a database query that is a “COPY” or “SELECT *” which do not filter out data may not obtain any performance benefits from scaling to an additional processing cluster. Thus, in various embodiments, the determination as to whether to include operations to use an additional processing cluster can be made based on a selected number of columns and the expected selectivity of those columns, for example.

[00211] As indicated at 2330, performance of the database query may different according to whether the generated plan includes operations to use additional processing clusters to return results to the compute nodes of the processing cluster. In scenarios wherein the plan does include operations to utilize additional processing clusters, then as indicated at 2340, the plan may be expected to perform the database query including having the compute nodes send requests to respective additional processing clusters to perform the operations, and then returning a result of the database query, as indicated at 2360, based on the performance of the additional processing clusters. As discussed above with regard to FIGS. 14 and 15, how the requests are sent may differ according to whether the additional processing clusters is implemented on the same or a separate host as the compute node instructing the additional processing cluster. In some embodiments, the additional processing clusters may operate with or without a dedicated leader node (sharing leader node responsibilities).

[00212] For those plans that do not include operation(s) to use the additional processing clusters to return results to the compute nodes, the plan to perform the database query may be performed using the processing cluster alone, as indicated at 2350. For example, local attached storage may be accessed, remote requests to access data in separate storage system or service may be performed, or some combination of both actions may be performed by the processing cluster. As indicated at 2360, a result of the database query based on the performance of the database query plan may then be returned.

[00213] As discussed above, some processing clusters may utilize a leader node for multiple compute nodes architecture for processing database queries. FIG. 24 is a high-level flowchart illustrating methods and techniques to implement generating a plan a leader node of processing cluster to perform a database query using additional processing clusters, according to some embodiments. As indicated at 2410, a database query received at a leader node of a processing cluster may be parsed to generate a query tree for the database query, in some embodiments. For example, the database query may be specified in a query language, such as SQL, and various parsing techniques to separate the various features of the database query into different nodes corresponding to keywords, identifiers, constants, operators, or tokens, and syntactic categories, such as the expressions or conditions features of the database query.

[00214] As indicated at 2420, an initial plan to perform the database query may be generated from the query tree, in some embodiments. For example, various query planning rules or techniques that identify operations that correspond to the different nodes in the query tree (e.g., scan operations to obtain portions of data, filter operations to satisfy conditions, etc.) may be implemented. As indicated at 2430, in various embodiments, the initial plan may be evaluated to determine whether to use additional processing clusters, in some embodiments. For example, as discussed above with regard to FIG. 23, a determination of whether scaling to additional processing clusters will improve database query performance may be made using the operations included in the initial query plan. If, for instance, scan operations are included in the initial plan with no filter or aggregation features, then additional processing clusters may not be determined. Similarly, an evaluation to determine a number of columns of a table selected to perform the database query and the expected selectivity of those columns, for example, could be ascertained from the initial plan operations (and metadata describing the table, such as table statistics).

[00215] As indicated at 2440, if a determination is made to use additional processing clusters, then remote operations may be included in a rewrite of the initial plan (as indicated at 2450) that use the additional processing clusters to perform plan operations (e.g., scan operations, aggregation operations (e.g., COUNT, SUM, AVG, MIN, MAX, etc.), group by operations, filter operations (e.g., including comparison conditions and pattern-matching conditions, such as “LIKE”), and string functions. If the determination is to use additional processing clusters, then any operation that can be performed remotely by the additional processing clusters may be included or otherwise assigned for additional processing cluster performance, in some embodiments.

[00216] As indicated at 2450, the initial plan to generate an optimized plan to perform the database query may be rewritten, in some embodiments. For instance, in addition to rewrites to include remote operations (in the event additional processing clusters are used), then other optimizations such as join operation reordering (e.g., to reduce the number of results to use for performing a join) and/or various other optimizations for performing the database query may be implemented by rewriting the initial plan (e.g., modifying the operators and structure, ordering, or other features of the plan to modify performance of the database query).

[00217] As indicated at 2460, if a determination is made to use additional processing clusters in the optimized plan, then further information may be included in or along with instructions generated to send to compute nodes of the processing cluster to perform the optimized query plan, as indicated at 2470. For example, templates for subqueries to send to the additional processing clusters (e.g., in SQL or other supported query format) and additional metadata describing the table(s) to be accessed), may be created and included with generated instructions, as indicated at 2462. Instructions generated at 2470 may be executable instructions generated as a result of compiling the optimized query plan to produce executable artifacts specific to each compute node in the processing cluster, in some embodiments.

[00218] In some embodiments, compute nodes may be able to make dynamic determinations as to whether utilizing an additional processing cluster will provide an optimal performance for a database query. FIG. 25 is a high-level flowchart illustrating methods and techniques to implement executing instructions at a compute node of a processing cluster to perform a portion of a database query using additional processing clusters, according to some embodiments. As indicated at 2510, instructions to perform a portion of a database query may be received at a compute node of a processing cluster from a leader node of the processing cluster, in some embodiments. For example, these instructions may be pre-compiled, executable artifacts.

[00219] In some embodiments, these instructions may indicate whether an additional processing cluster is to be used, as indicated at 2520. If not, as indicated at 2540, the instructions may be performed by the compute node to perform the portion of the database query, as indicated at 2540, and results of the instructions returned to the leader node as indicated at 2570. For instance, the compute node may access local storage and/or remote storage directly in order to perform operations included in the instructions and return the results. In some circumstances, results may be shuffled to other compute nodes in the processing cluster, which may use the results to perform an operation (e.g., to check to see if results match a join criteria being evaluated at another compute node).

[00220] If an additional processing cluster has been indicated as being used, then as indicated at 2530, another evaluation about whether to use the additional processing cluster can be performed by the compute node, in some embodiments. For example, a compute node may be able to determine some information about the database query (which may not be known until runtime at the compute node). A size of the database query, for instance, may be determined according to the number of scan ranges or segments after pruning or otherwise eliminating partitions that do not satisfy the database query (e.g., are partitions for periods of time outside of the database query requested time range). If the size is less than some threshold (e.g., 2 or less scan ranges), then size may indicate local execution (instead of using an additional processing cluster). Note this decision can be done independently at each compute node thus in some scenarios it may be possible that one compute node uses an additional processing cluster and another compute node in the same original/primary processing cluster does not use an additional processing cluster but performs the portion of the database query directly. In this way, additional processing cluster scaling can be closely fit to the performance benefits of individual compute nodes (as opposed to the cluster as a whole).

[00221] If the size indicates non-local execution (using the additional processing cluster), then as indicated at 2550, the additional processing cluster may be identified and subquer(ies) sent to the additional processing cluster, in some embodiments. For example, different implementations of the additional processing cluster may be identified separately. For example, as discussed above with regard to FIG. 14, if the additional processing cluster is separately hosted, then a control plane component for a service (e.g., a data warehouse service) may be used to provision the additional processing clusters for compute nodes. For additional processing clusters on the same host using a different container, a cross-container process to contact another container may be used (e.g., an API or other feature may be invoked and used to send the subquery(ies)). As indicated at 2560, results of the subquery(ies) may be received from the additional processing cluster and then used to return results of the instructions to the leader node, as indicated at 2570.

[00222] V arious embodiments of scaling query processing resources for efficient utilization and performance, according to some embodiments are described herein. Queries for data that satisfy various conditions or criteria, insertions, deletions, modifications, or any other request triggering processing based on a request to access a data store may utilize varying amounts of processing resources, including various central processing units (CPUs), graphical processing units (GPUs), or other processing components that may execute various tasks to process database queries, in some embodiments. Because database queries may vary in terms of the workload placed upon the processing resources to execute the database query, the amount of processing resources that any one query engine may provide could be inadequate (or underutilized) to meet the demands of some query workloads.

[00223] While database systems that are operated and managed directly by the entities using the database systems may be able to make adjustments to database system configurations for query workloads, cloud service providers and other provider networks that offer database services that operate and manage database systems on behalf of other entities that utilize the database systems may have less insight into the changes in query workload. For instance, many entities that utilize database systems find it desirable to shift operational and management responsibilities to the provider of a database service in order to focus efforts on other tasks. This shift of responsibility to a provider of a database service may cause database services to make management decisions for a database based on unknown information, such as expected query workloads or changes to query workloads.

[00224] For example, queries with different scale factors (1 Gigabyte/100 Gigabyte/10 Terabyte) may perform differently for differently sized clusters of computing resources. Queries on 1 Gigabyte (shortest) are faster/same on small sized clusters (e.g., a query processing configuration of 8 computational nodes) as compared to larger sized clusters (e.g., a query processing configuration of 16 computational nodes). While queries on 10 Terabyte (longest) may be fastest on largest (e.g. a query processing configuration of 32 computational nodes) sized clusters as compared to smaller sized clusters (e.g., a query processing configuration of 16 computational nodes). In addition, many queries do not linearly scale with size of clusters.

[00225] Selecting query processing resource configurations based on query processing performance alone (which may vary for different reasons as discussed above), may not address concerns for efficient utilization of query processing resources. For example, utilization of query processing resources may be tracked, in some embodiments, in terms of units (e.g., Resource Processing Units (RPUs)) which may, for example, be proportional to a size of a cluster of query processing resources. To achieve the best query performance (e.g., lowest latency), each query could be performed using a selected a best sized cluster for its execution. If these clusters, however, are underutilized, then costs in RPUs not utilized may accrue (e.g., which may be proportional to number and size of clusters allocated).

[00226] To address inefficient utilization of query processing resources, techniques to pack (e.g., group) queries together on a same query processing resource configuration (e.g., a same cluster). While packing queries may indeed prevent the waste of computing resources, such techniques may increase query performance variability. This may be occur because of other queries which are running a same resource configuration. For instance, a query may be “packed” on a query processing configuration that uses a 16 node computation cluster or “packed” on a query processing configuration that uses a 32 node computation cluster at other times. If in such scenarios, a query executes in 100s on the cluster with 16 computational nodes, but was run on the cluster with 32 computational nodes because it was already attached to a database and underutilized and therefore ran in 150s, then the query’s variability in performance becomes noticeable (e.g., a difference of 50 additional seconds). Therefore, another consideration for selecting query processing resources is that packing should not increase variability beyond some limit.

[00227] Techniques for scaling query processing resources for efficient utilization and performance may allow providers of a database services to automatically adapt to query workloads, so that both optimal query performance is achieved (which allows for good client application performance of client applications that utilize the database service) and optimal resource utilization is achieved (which saves computing resources for both clients, indirectly, and providers of database services, directly, to perform other computational tasks).

[00228] FIG. 26 illustrates a logical block diagram of scaling query processing resources for efficient utilization and performance, according to some embodiments. Database service 2610 may be a stand-alone database service, in various embodiments. For example, database service 2610 may be implemented for private use (e.g., on private networks and resources for entity-specific utilization). In some embodiments, database service 2610 may be implemented as part of multiple different services provided by a cloud service provider, such as provider network 200 discussed in detail below with regard to FIG. 2.

[00229] Database service 2610 may manage databases on behalf of clients of database service 2610, in various embodiments. For example, database service 2610 may implement an interface that allows users to create a database to be hosted in database service 2610. The interface may also allow users to specify whether the database is to be managed by the database service, automatically, in a “serverless” fashion (e.g., by allowing database service 2610 to automatically determine and configure an appropriate number of computing resources to host and provide access to (e.g., query) the database). In some embodiments, the interface may support management parameters or other information to guide the management of the database, such as a parameter indicating that query performance should be prioritized over resource efficiency (e.g., lower cost), or parameter to indicate that resource efficiency should be prioritized over query performance. In some embodiments, database service 2610 may also allow for hosted databases to be manually managed (e.g., via interface requests to configure a specified number of computing resources to host and provide access to (e.g., query) the database).

[00230] For database service managed databases, database service 2610 may select the query processing configuration to perform a received query (unlike a manually managed database which may use the specified and configured query processing configuration). For example, as illustrated in FIG. 26, a query 2602 may be received at database service 2610 for a database managed by database service 2610. To select the appropriate query processing configuration, database service 2610 may implement query response prediction 2620 to determine configuration response predictions 2622 for many different available processing configurations 2660. For example, as discussed in detail below with regard to FIGS. 27, and 28, query response prediction 2620 may determine different components of a query response time, such as execution time to perform the query, bootstrap time to prepare query processing resources to execute the query, if any, and queue time for the query to wait before execution at query processing resources, if any. Such components may be determined and combined for each of the different available processing configurations 2660.

[00231] Available query processing configurations 2660 may be any number of different query processing resources (e.g., different sized clusters of computational nodes with different processing and other computing capabilities (e.g., different memory, networking, Input/Output (I/O), etc.)). Available processing configurations 2660 may include one or more processing configurations (e.g., one or more clusters) that are already allocated and “attached” to a database. For example, in some embodiments, database service 2610 may initially create a “main” or “primary” processing cluster for a database. This cluster may be used to process queries in addition to other processing configurations (e.g., other clusters of different sizes) which may be later attached when selected according to the techniques discussed with regard to FIG. 26. Available processing configurations 2660 may include those processing configurations that are not currently attached, but could be attached to the database if selected (e.g., a differently sized compute cluster from a main cluster).

[00232] Configuration response predictions 2622 determined by query response prediction 2620 for each available processing configuration 2660 may indicate the query processing configuration that would provide the lowest latency (e.g., fastest) response to query 2602, efficient utilization of resources may also be considered as part selecting the processing configuration for the query. Database service 2610 may implement response variability tolerance 2630 to determine whether or not “packing” using an available processing configuration 2660 may exceed a variability threshold for query 2602. As discussed in detail below with regard to FIGS. 27 and 28, a variability threshold may be determined specific to query 2602. In some embodiments, variability thresholds may be adjusted based on management parameters provided via an interface (e.g., prioritizing query performance or resource efficiency).

[00233] As indicated at 2632, filtered configuration response predictions associated with filtered available processing configurations 2662 (which do not exceed the variability threshold) may be provided to processing configuration selection 2640. Processing configuration selection 2640, may choose from amongst the filtered available processing configurations 2662, in some embodiments, to select one query processing configuration to perform query 2602. For example, a lowest remaining predicted response time may indicate which of the filtered available processing configurations 2662 to select. In some embodiments, other considerations may affect the selection of the query processing configuration. For example, as discussed below with regard to FIGS. 27, a processing resource limit for the database, such as a maximum number of RPUs, may not be exceeded, which may determine whether, for example, a new query processing configuration may be attached to the database (e.g., in addition to other query processing configuration(s) already attached to the database).

[00234] As indicated at 2642, query 2602 may be routed to the selected processing configuration 2650, which may access database data 2652 to perform 2651 query 2602. Various different storage and query processing arrangements may be implemented, such as clusters that utilize on-cluster storage (e.g., storing database data 2652 on locally attached disk storage), remote data storage (e.g., database data 2652 being stored in a separate storage service, and/or a combination of both local and remote storage. Selected processing configuration 2650 may then return a query response 2604.

[00235] FIG. 27 is a high-level flowchart illustrating methods and techniques to implement scaling query processing resources for efficient utilization and performance, according to some embodiments. Various different systems and devices may implement the various methods and techniques described below, either singly or working together. Different combinations of services implemented in different provider networks operated by different entities may implement some or all of the methods (e.g., a data warehouse cluster in a service of a first provider network and a data set stored in a service of a second provider network). Different types of query engines or nondistributed query performance platforms may implement these techniques. Alternatively, various other combinations of different systems and devices located within or without provider networks may implement the below techniques. Therefore, the above examples and or any other systems or devices referenced as performing the illustrated method, are not intended to be limiting as to other different components, modules, systems, or devices.

[00236] As indicated at 2710, a database query may be received, in various embodiments, via a network endpoint associated with a database managed by a database service. The query may be received according to various interfaces, formats, and/or protocols. For example, the database query may be formatted according to a query language such as Structured Query Language (SQL), in some embodiments, or may be specified according to an Application Programming Interface (API) for receiving queries. In at least some embodiments, the query may be one query of many queries that can be submitted by one or many different users to a same database. For example, the query may compete for computing resources along with other queries received from other users to be executed with respect to a database in some embodiments.

[00237] A network endpoint may be a network address or other location for which database service may implement networking resources to listen and obtain the query. By sending the query to the network endpoint, the target of the query, the database, may be identified. In some embodiments, the network endpoint may be provided (e.g., to a user for inclusion in client applications) when the database is created. In at least some embodiments, the database may be created, configured, or modified to be managed by the database service, such that automatic management techniques, including providing a “serverless” management experience for a user of the database is performed. For example, the database may be created without any specified number or other configuration of computing resources used to process queries (or provide other access to) the database. Instead, an initial query processing resource configuration may be automatically selected by the database service for the database, according to various techniques (e.g., using a pre-defined initial query processing resource configuration or using prediction techniques to determine the initial query processing resource configuration based on other information, such as an amount or type of data stored in the database). As noted above with regard to FIG. 26, in some embodiments, management parameters may be included in a request to create the database (or separately as part of other requests) which may be used to guide the selection of query processing resources initially or when later scaling the query processing resources.

[00238] As indicated at 2720, respective response times may be predicted for the query using different query processing configurations available to perform the query at the database service. For example, both currently attached (e.g., a main or primary processing cluster and any previously attached additional secondary clusters) or able to be attached query processing resource configurations may be considered. Different query processing configurations may include different numbers of computing resources (e.g., different numbers of nodes) with different processing and other computing capabilities (e.g., different memory, networking, Input/Output (I/O), etc.)). As discussed in detail below with regard to 7, different techniques that base response times on execution, queuing, and bootstrapping may be used. In some embodiments, query predictions may be based on query plan cost analysis (e.g., using statistics collected for a database data, such as data distributions in histograms, data cardinality, cost allocations to query plan operations, etc.).

[00239] As indicated at 2730, query processing configuration(s) of the different query processing configurations may be excluded to determine remaining query processing configurations in response to determinations that the respective response times for the query processing configurations exceed a variability threshold determined for the query, in some embodiments. For example, variability cannot be considered constant for all queries. For short queries, e.g., 100ms, it may not be possible to achieve even V = 100% while for long queries, e.g., 1 hour, V = 100% may be too high. Consider the case when attaching a cluster is not constrained by a maximum resource utilization threshold. Also recall that response time may be given by R_L = Bt + Qt + Ei. If a decision is made about scheduling a short query on an existing cluster C_n vis- a-vis creating a new cluster C_m then this choice may only depend on queue time Q_n on cluster C_n and bootstrap time B_m of cluster C_M as E_n and E_m may be both too small to make a difference. Since Q_n and Q_m may be dependent on other factors than queries, bound short query variability may be bound by B_M/E_m. This number can be quite large depending on E_m. Therefore, variability may be controlled to a smaller number when execution time is greater than both bootstrap time. In some embodiments, the variability threshold may also be adjusted according to whether a performance prioritization parameter or resource utilization prioritization parameter is specified for managing the database.

[00240] Variability may be determined in different ways in different scenarios. For example:

1. When C_best is not attached, a new cluster may be attached. Then best = smallest and then Rfoest ^smallest d” ^smallest- Substituting this in R_b@st / smallest —

d” V ) produces, V > E_smaUest /E_smaaest. Since bootstrap time is a property of a cluster and not query, it may be considered constant, in some embodiments. Therefore, V > Esmallest /Esmallest

2. When _estis attached, and has resources to run a query then variability threshold is trivially met

3. When _estis attached, but is full, then should the query create C_smaaest or queue on C_best. In some embodiments, waiting on C_best can be chosen only when Q_best + for a query to wait on Therefore, for variability bound, the case when a query waits may be ignored.

The variability function can be bound by V > B_smaUest /E_smau_est if MAX constraint is not violated. Consider then the variability function to be V = B_smaaest /E_smaaest + K where K is positive constant for all clusters.

[00241] As indicated at 2740, one of the remaining query processing configurations may be selected to perform the query, in some embodiments. For example, a best performing query processing configuration may be selected according to smallest response time. The selected query processing configuration may be already attached, in some scenarios. In other scenarios, the query processing configuration may not be attached to the database, and thus may have to be attached.

[00242] In some embodiments, a maximum and/or other query processing resource limits may be specified for the database and enforced by the database service. For example, as indicated at 2750, a determination may be made as to whether if a query processing resource limit is exceeded by the selection. In some embodiments, a query processing resource limit cannot be exceed if the selected query processing configuration is one already attached to the database. If not attached, then a resource utilization value (e.g., RPUs) for the to be attached query processing configuration (e.g., which may be determined according to a number of nodes and/or capabilities of the nodes in the query processing configuration). If the addition of that resource utilization value causes the total resource utilization value for the database inclusive of already attached query processing resources, then the selection may exceed the query processing resource limit for the database.

[00243] If exceeding one (or more) of the query processing resource limit(s), then a query processing configuration for the database may be selected within the query processing resource limit, as indicated at 2770 (e.g., selecting add another query processing configuration to attach with a smaller amount of resource utilization or to use an already attached query processing resource). As indicated at 2760, if not exceeding the query processing resource limit, then the query may be performed using the selected query processing configuration, in some embodiments. [00244] FIG. 28 is a high-level flowchart illustrating methods and techniques to implement predicting response time for queries on different query processing configurations, according to some embodiments. As indicated at 2810, a trained machine learning model may be applied to predict respective execution times for a query using different query processing configurations, in some embodiments. For example, a machine learning model may be trained to accept query plan and a number of nodes (as the query processing configuration) as input features and output a predicted execution time (along with a confidence score). Using this prediction, the best estimated execution time T of a query for a cluster of size n can be determined for various clusters.

[00245] As indicated at 2820, respective bootstrap times to prepare the different query processing configurations to perform the query may be determined, in some embodiments. For example, in some embodiments, bootstrap time may be determined as: 1) Time to acquire a cluster or 2) Time to prepare the additional cluster to execute queries (e.g., cold start). Cold start may be when a cluster is first attach and downloads data (e.g., block headers) corresponding to a backup version of the database. Thus, in some embodiments, bootstrap time may be B_n = B ^p + B^pp where B^pp is time taken to acquire a burst cluster with index n. In some embodiments, B^pp may be the average time for a cold start. This time may vary from a second to up to several minutes depending on the cluster, but says relatively constant for a cluster. B^pp may be started as an average and then updated B^pp on the cluster on every restart as an exponential moving average. In some embodiments, the average may be determined as an exponential moving average. A bootstrap time of a query at an attached query processing resource may be zero, in some embodiments.

[00246] As indicated at 2830, respective queue times for the query at the different query processing configurations may be determined, in some embodiments. Queue time for a query q on a cluster can be estimated by figuring out how long the queries in front q will take to get to execution. To do so, exponential moving average of query queue time ( Q*Q) of queries may be tracked normalized by their queue position ( Q''Q = queue time / queue position). An example of moving average is Q^Q = k x Q^Q + (1 — k) x last_querys_queue_time where k G (0, 1). Therefore Q_n can be determined by the queue position of q times Q*Q.

[00247] As indicated at 2840, the respective execution times of a query may be added to the respective bootstrap times and queue times in order to determine respective response times for the query at each of the different query processing configurations. [00248] The methods described herein may in various embodiments be implemented by any combination of hardware and software. For example, in one embodiment, the methods may be implemented by a computer system (e.g., a computer system as in FIG. 29) that includes one or more processors executing program instructions stored on a computer-readable storage medium coupled to the processors. The program instructions may implement the functionality described herein (e.g., the functionality of various servers and other components that implement the networkbased virtual computing resource provider described herein). The various methods as illustrated in the figures and described herein represent example embodiments of methods. The order of any method may be changed, and various elements may be added, reordered, combined, omitted, modified, etc.

[00249] Various embodiments as described herein with regard to FIGS. 1 - 28 may be executed on one or more computer systems, which may interact with various other devices. One such computer system is illustrated by FIG. 29. In different embodiments, computer system 3000 may be any of various types of devices, including, but not limited to, a personal computer system, desktop computer, laptop, notebook, or netbook computer, mainframe computer system, handheld computer, workstation, network computer, a camera, a set top box, a mobile device, a consumer device, video game console, handheld video game device, application server, storage device, a peripheral device such as a switch, modem, router, or in general any type of computing node, compute node, computing device, compute device, or electronic device.

[00250] In the illustrated embodiment, computer system 3000 includes one or more processors 3010 coupled to a system memory 3020 via an input/output (I/O) interface 3030. Computer system 3000 further includes a network interface 3040 coupled to I/O interface 3030, and one or more input/output devices 3050, such as cursor control device 3060, keyboard 3070, and display(s) 3080. Display(s) 3080 may include standard computer monitor(s) and/or other display systems, technologies or devices. In at least some implementations, the input/output devices 3050 may also include a touch- or multi-touch enabled device such as a pad or tablet via which a user enters input via a stylus-type device and/or one or more digits. In some embodiments, it is contemplated that embodiments may be implemented using a single instance of computer system 3000, while in other embodiments multiple such systems, or multiple nodes making up computer system 3000, may host different portions or instances of embodiments. For example, in one embodiment some elements may be implemented via one or more nodes of computer system 3000 that are distinct from those nodes implementing other elements.

[00251] In various embodiments, computer system 3000 may be a uniprocessor system including one processor 3010, or a multiprocessor system including several processors 3010 (e.g., two, four, eight, or another suitable number). Processors 3010 may be any suitable processor capable of executing instructions. For example, in various embodiments, processors 3010 may be general-purpose or embedded processors implementing any of a variety of instruction set architectures (IS As), such as the x86, PowerPC, SPARC, or MIPS ISAs, or any other suitable ISA. In multiprocessor systems, each of processors 3010 may commonly, but not necessarily, implement the same ISA.

[00252] In some embodiments, at least one processor 3010 may be a graphics processing unit. A graphics processing unit or GPU may be considered a dedicated graphics-rendering device for a personal computer, workstation, game console or other computing or electronic device. Modem GPUs may be very efficient at manipulating and displaying computer graphics, and their highly parallel structure may make them more effective than typical CPUs for a range of complex graphical algorithms. For example, a graphics processor may implement a number of graphics primitive operations in a way that makes executing them much faster than drawing directly to the screen with a host central processing unit (CPU). In various embodiments, graphics rendering may, at least in part, be implemented by program instructions that execute on one of, or parallel execution on two or more of, such GPUs. The GPU(s) may implement one or more application programmer interfaces (APIs) that permit programmers to invoke the functionality of the GPU(s). Suitable GPUs may be commercially available from vendors such as NVIDIA Corporation, ATI Technologies (AMD), and others.

[00253] System memory 3020 may store program instructions and/or data accessible by processor 3010. In various embodiments, system memory 3020 may be implemented using any suitable memory technology, such as static random access memory (SRAM), synchronous dynamic RAM (SDRAM), nonvolatile/Flash-type memory, or any other type of memory. In the illustrated embodiment, program instructions and data implementing desired functions, such as those described above are shown stored within system memory 3020 as program instructions 3025 and data storage 3035, respectively. In other embodiments, program instructions and/or data may be received, sent or stored upon different types of computer-accessible media or on similar media separate from system memory 3020 or computer system 3000. Generally speaking, a non- transitory, computer-readable storage medium may include storage media or memory media such as magnetic or optical media, e.g., disk or CD/DVD-ROM coupled to computer system 3000 via I/O interface 3030. Program instructions and data stored via a computer-readable medium may be transmitted by transmission media or signals such as electrical, electromagnetic, or digital signals, which may be conveyed via a communication medium such as a network and/or a wireless link, such as may be implemented via network interface 3040. [00254] In one embodiment, I/O interface 3030 may coordinate I/O traffic between processor 3010, system memory 3020, and any peripheral devices in the device, including network interface 3040 or other peripheral interfaces, such as input/output devices 3050. In some embodiments, I/O interface 3030 may perform any necessary protocol, timing or other data transformations to convert data signals from one component (e.g., system memory 3020) into a format suitable for use by another component (e.g., processor 3010). In some embodiments, I/O interface 3030 may include support for devices attached through various types of peripheral buses, such as a variant of the Peripheral Component Interconnect (PCI) bus standard or the Universal Serial Bus (USB) standard, for example. In some embodiments, the function of I/O interface 3030 may be split into two or more separate components, such as a north bridge and a south bridge, for example. In addition, in some embodiments some or all of the functionality of I/O interface 3030, such as an interface to system memory 3020, may be incorporated directly into processor 3010.

[00255] Network interface 3040 may allow data to be exchanged between computer system 3000 and other devices attached to a network, such as other computer systems, or between nodes of computer system 3000. In various embodiments, network interface 3040 may support communication via wired or wireless general data networks, such as any suitable type of Ethernet network, for example; via telecommunications/tel ephony networks such as analog voice networks or digital fiber communications networks; via storage area networks such as Fibre Channel SANs, or via any other suitable type of network and/or protocol.

[00256] Input/output devices 3050 may, in some embodiments, include one or more display terminals, keyboards, keypads, touchpads, scanning devices, voice or optical recognition devices, or any other devices suitable for entering or retrieving data by one or more computer system 3000. Multiple input/output devices 3050 may be present in computer system 3000 or may be distributed on various nodes of computer system 3000. In some embodiments, similar input/output devices may be separate from computer system 3000 and may interact with one or more nodes of computer system 3000 through a wired or wireless connection, such as over network interface 3040.

[00257] As shown in FIG. 29, memory 3020 may include program instructions 3025, that implement the various methods and techniques as described herein, and data storage 3035, comprising various data accessible by program instructions 3025. In one embodiment, program instructions 3025 may include software elements of embodiments as described herein and as illustrated in the Figures. Data storage 3035 may include data that may be used in embodiments. In other embodiments, other or different software elements and data may be included.

[00258] Those skilled in the art will appreciate that computer system 3000 is merely illustrative and is not intended to limit the scope of the techniques as described herein. In particular, the computer system and devices may include any combination of hardware or software that can perform the indicated functions, including a computer, personal computer system, desktop computer, laptop, notebook, or netbook computer, mainframe computer system, handheld computer, workstation, network computer, a camera, a set top box, a mobile device, network device, internet appliance, PDA, wireless phones, pagers, a consumer device, video game console, handheld video game device, application server, storage device, a peripheral device such as a switch, modem, router, or in general any type of computing or electronic device. Computer system 3000 may also be connected to other devices that are not illustrated, or instead may operate as a stand-alone system. In addition, the functionality provided by the illustrated components may in some embodiments be combined in fewer components or distributed in additional components. Similarly, in some embodiments, the functionality of some of the illustrated components may not be provided and/or other additional functionality may be available.

[00259] Those skilled in the art will also appreciate that, while various items are illustrated as being stored in memory or on storage while being used, these items or portions of them may be transferred between memory and other storage devices for purposes of memory management and data integrity. Alternatively, in other embodiments some or all of the software components may execute in memory on another device and communicate with the illustrated computer system via inter-computer communication. Some or all of the system components or data structures may also be stored (e.g., as instructions or structured data) on a computer-accessible medium or a portable article to be read by an appropriate drive, various examples of which are described above. In some embodiments, instructions stored on a non-transitory, computer-accessible medium separate from computer system 3000 may be transmitted to computer system 3000 via transmission media or signals such as electrical, electromagnetic, or digital signals, conveyed via a communication medium such as a network and/or a wireless link. Various embodiments may further include receiving, sending or storing instructions and/or data implemented in accordance with the foregoing description upon a computer-accessible medium. Accordingly, the present invention may be practiced with other computer system configurations.

[00260] It is noted that any of the distributed system embodiments described herein, or any of their components, may be implemented as one or more web services. In some embodiments, a network-based service may be implemented by a software and/or hardware system designed to support interoperable machine-to-machine interaction over a network. A network-based service may have an interface described in a machine-processable format, such as the Web Services Description Language (WSDL). Other systems may interact with the web service in a manner prescribed by the description of the network-based service’s interface. For example, the network- based service may define various operations that other systems may invoke, and may define a particular application programming interface (API) to which other systems may be expected to conform when requesting the various operations.

[00261] In various embodiments, a network-based service may be requested or invoked through the use of a message that includes parameters and/or data associated with the network-based services request. Such a message may be formatted according to a particular markup language such as Extensible Markup Language (XML), and/or may be encapsulated using a protocol such as Simple Object Access Protocol (SOAP). To perform a web services request, a network-based services client may assemble a message including the request and convey the message to an addressable endpoint (e.g., a Uniform Resource Locator (URL)) corresponding to the web service, using an Internet-based application layer transfer protocol such as Hypertext Transfer Protocol (HTTP).

[00262] In some embodiments, web services may be implemented using Representational State Transfer (“RESTful”) techniques rather than message-based techniques. For example, a web service implemented according to a RESTful technique may be invoked through parameters included within an HTTP method such as PUT, GET, or DELETE, rather than encapsulated within a SOAP message.

[00263] The various methods as illustrated in the FIGS, and described herein represent example embodiments of methods. The methods may be implemented in software, hardware, or a combination thereof. The order of method may be changed, and various elements may be added, reordered, combined, omitted, modified, etc.

[00264] Various modifications and changes may be made as would be obvious to a person skilled in the art having the benefit of this disclosure. It is intended that the invention embrace all such modifications and changes and, accordingly, the above description to be regarded in an illustrative rather than a restrictive sense.

[00265] Embodiments of the present disclosure can be described in view of the following clauses:

Clause 1. A system, comprising: one or more computing devices, respectively comprising a processor and a memory that implement a processing cluster for a database service, the processing cluster comprising a leader node and one or more compute nodes, and wherein the leader node in the processing cluster is configured to: monitor a network endpoint at a proxy service associated with a database managed by the database service to detect an idle period for the database; select from a plurality of different management actions a management action for the processing cluster to perform during the detected idle period for the database according to the detected idle period; and send one or more requests to cause the management action for the processing cluster to be performed, wherein the database is available for read requests received via the network endpoint during performance of the management action by another processing cluster of the database service.

Clause 2. The system of clause 1, wherein to send the one or more requests to cause the management action for the processing cluster to be performed, the leader node is configured to send one or more requests to cause a software patch to be installed at the processing cluster.

Clause 3. The system of any one of clauses 1-2, wherein to send the one or more requests to cause the management action for the processing cluster to be performed, the leader node is configured to send one or more requests to release the one or more compute nodes for other tasks of the database service and pause the processing cluster.

Clause 4. The system of clause 3, wherein the leader node is further configured to: detect a resume event for the processing cluster; determine a number of compute nodes to add to the processing cluster; and cause the determined number of compute nodes to be added to the processing cluster.

Clause 5. A method, comprising: monitoring, by a leader node of a processing cluster for accessing a database, a network endpoint at a proxy service associated with a database managed by a database service to detect an idle period for the database; determining, by the leader node, a management action for the processing cluster to perform during the detected idle period for the database; and causing, by the leader node, the management action for the processing cluster to be performed, wherein the database is available for read requests received via the network endpoint during performance of the management action by another processing cluster of the database service.

Clause 6. The method of clause 5, wherein causing the management action for the processing cluster to be performed, comprises causing a software patch to be installed at the processing cluster.

Clause 7. The method of any one of clauses 5-6, wherein causing the management action for the processing cluster to be performed, comprises causing a vacuum operation to consolidate storage space for one or more records of the database left as a result of one or more deletions performed on the database.

Clause 8. The method of any one of clauses 5-7, further comprising performing a metadata query at the leader node when the processing cluster is paused as a result of the management action. Clause 9. The method of any one of clauses 5-8, wherein causing the management action for the processing cluster to be performed, comprises sending one or more requests to release one or more compute nodes of the processing cluster for other tasks of the database service and pause the processing cluster.

Clause 10. The method of clause 9, further comprising: detecting a resume event for the processing cluster; determining a number of compute nodes to add to the processing cluster; and causing the determined number of compute nodes to be added to the processing cluster.

Clause 11. The method of clause 10, wherein the determined number of compute nodes is different than a number of compute nodes released from the processing cluster when the processing cluster was paused.

Clause 12. The method of any one of clauses 5-11, wherein the determined management action is received at the leader node from a control plane of the database service.

Clause 13. The method of any one of clauses 5-12, wherein causing the management action for the processing cluster to be performed, comprises causing a materialized view for the database to be created or updated.

Clause 14. One or more non-transitory, computer-readable storage media, storing program instructions that when executed on or across one or more computing devices cause the one or more computing devices to implement: monitoring, by a leader node of a processing cluster for accessing a database, a network endpoint at a proxy service associated with a database managed by a database service to detect an idle period for the database, wherein the database was created in response to a request that specified that the database was to be managed by the database service; determining, by the leader node, a management action for the processing cluster to perform during the detected idle period for the database; and causing, by the leader node, the management action for the processing cluster to be performed, wherein the database is available for read requests received via the network endpoint during performance of the management action. Clause 15. The one or more non-transitory, computer-readable storage media of clause 14, wherein, in causing the management action for the processing cluster to be performed, the program instructions cause the one or more computing devices to implement causing a software patch to be installed at the processing cluster.

Clause 16. The one or more non-transitory, computer-readable storage media of any one of clauses 14-15, wherein, in causing the management action for the processing cluster to be performed, the program instructions cause the one or more computing devices to implement causing a redistribution database data amongst one or more compute nodes of the processing cluster.

Clause 17. The one or more non-transitory, computer-readable storage media of any one of clauses 14-16, wherein another processing cluster performs a query to the database when the processing cluster is paused as a result of the management action.

Clause 18. The one or more non-transitory, computer-readable storage media of any one of clauses 14-17, wherein, in causing the management action for the processing cluster to be performed, the program instructions cause the one or more computing devices to implement sending one or more requests to release one or more compute nodes of the processing cluster for other tasks of the database service and pause the processing cluster.

Clause 19. The one or more non-transitory, computer-readable storage media of clause 18, storing further instructions that when executed on or across the one or more computing devices, cause the one or more computing devices to implement: detecting a resume event for the processing cluster; determining a number of compute nodes to add to the processing cluster; and causing the determined number of compute nodes to be added to the processing cluster.

Clause 20. The one or more non-transitory, computer-readable storage media of any one of clauses 14-19, storing further instructions that when executed on or across the one or more computing devices, cause the one or more computing devices to implement causing a second determined management action to be performed during the idle period.

Clause 21. A database service, comprising: one or more computing devices, respectively comprising a processor and a memory, configured to implement a proxy service for the database service, wherein the proxy service is configured to: receive a query via a network endpoint associated with a database that is managed by the database service, wherein the database was created in response to a request that specified that the database was to be managed by the database service; determine respective response times predicted for the query using a plurality of query processing configurations available to perform the query at the database service; determine that the respective response times for one or more query processing configurations exceed a variability threshold determined for the query; exclude the one or more query processing configurations of the plurality of query processing configurations to determine a remaining one or more query processing configurations; select one of the remaining one or more query processing configurations to perform the query; route the query to the selected query processing configuration to be performed; and return a response to the query received from the selected query processing configuration.

Clause 22. The system of clause 21, wherein to determine the respective response times predicted for the query using a plurality of query processing configurations available to perform the query at the database service, the proxy service is configured to: predict respective execution times of the query using the plurality of query processing configurations; determine respective bootstrap times to prepare the plurality of query processing configurations; determine respective queue times for the query at the plurality of query processing configurations to perform the query; and add the respective execution times, bootstrap times, and queue times to determine the respective response times predicted for the query.

Clause 23. The system of any one of clauses 21-22, wherein the selected one of the remaining one or more query processing configurations to perform the query is not attached to the database, and wherein the proxy service is further configured to request a control plane of the database service to attach the selected one of the remaining one or more query processing configurations to the database in order to be routed the query for performance.

Clause 24. The system of any one of clauses 21-23, wherein the proxy service is further configured to determine that a query processing resource limit for the database is not exceeded by attaching the selected one of the remaining one or more query processing configurations. Clause 25. A method, comprising: receiving a query via a network endpoint associated with a database managed by a database service; determining, by the database service, respective response times predicted for the query using a plurality of query processing configurations available to perform the query at the database service; excluding, by the database service, one or more query processing configurations of the plurality of query processing configurations to determine a remaining one or more query processing configurations responsive to determining that the respective response times for the one or more query processing configurations exceed a variability threshold determined for the query; and selecting, by the database service, one of the remaining one or more query processing configurations to perform the query.

Clause 26. The method of clause 25, wherein determining the respective response times predicted for the query using a plurality of query processing configurations available to perform the query at the database service, comprises: predicting respective execution times of the query using the plurality of query processing configurations; determining respective bootstrap times to prepare the plurality of query processing configurations; determining respective queue times for the query at the plurality of query processing configurations to perform the query; and adding the respective execution times, bootstrap times, and queue times to determine the respective response times predicted for the query.

Clause 27. The method of any one of clauses 25-26, wherein predicting respective execution times of the query using the plurality of query processing configurations comprises applying a trained machine learning model that accepts as input a plan to perform the query and a query processing configuration to make the prediction of an execution time for the query processing configuration.

Clause 28. The method of any one of clauses 25-27, further comprising: disabling prediction of response times for subsequent queries responsive to determining that a prediction accuracy for queries fails to satisfy an accuracy criteria.

Clause 29. The method of any one of clauses 25-28, further comprising: disabling prediction of response times for subsequent queries responsive to determining that a prorated variability exceeds a threshold variability target for an execution range of time.

Clause 30. The method of any one of clauses 25-29, wherein the selected one of the remaining one or more query processing configurations to perform the query is already attached to the database.

Clause 31. The method of any one of clauses 25-30, wherein the selected one of the remaining one or more query processing configurations to perform the query is not attached to the database, and wherein the method further comprises causing the selected one of the remaining one or more query processing configurations to be attached to the database in order to be routed the query for performance.

Clause 32. The method of clause 31, further comprising determining that a query processing resource limit for the database is not exceeded by attaching the selected one of the remaining one or more query processing configurations.

Clause 33. The method of any one of clauses 25-32, further comprising: determining that a query processing resource limit for the database is exceeded by attaching the selected one of the remaining one or more query processing configurations; and selecting a different one of the one or more remaining query processing configurations to perform the query that does not cause the query processing resource limit for the database to be exceeded.

Clause 34. One or more non-transitory, computer-readable storage media, storing program instructions that when executed on or across one or more computing devices cause the one or more computing devices to implement a database service that implements: receiving a query via a network endpoint associated with a database managed by the database service; determining respective response times predicted for the query using a plurality of query processing configurations available to perform the query at the database service; determining that the respective response times for one or more query processing configurations exceed a variability threshold determined for the query; excluding the one or more query processing configurations of the plurality of query processing configurations to determine a remaining one or more query processing configurations; selecting one of the remaining one or more query processing configurations to perform the query; and causing the query to be performed at the selected query processing configuration.

Clause 35. The one or more non-transitory, computer-readable storage media of clause 34, wherein, in determining the respective response times predicted for the query using a plurality of query processing configurations available to perform the query at the database service, the program instructions cause the one or more computing devices to implement: predicting respective execution times of the query using the plurality of query processing configurations; determining respective bootstrap times to prepare the plurality of query processing configurations; determining respective queue times for the query at the plurality of query processing configurations to perform the query; and adding the respective execution times, bootstrap times, and queue times to determine the respective response times predicted for the query.

Clause 36. The one or more non-transitory, computer-readable storage media of any one of clauses 34-35, storing further instructions that when executed on or across the one or more computing devices, cause the one or more computing devices to implement: disabling prediction of response times for subsequent queries responsive to determining that a prediction accuracy for queries fails to satisfy an accuracy criteria.

Clause 37. The one or more non-transitory, computer-readable storage media of any one of clauses 34-36, storing further instructions that when executed on or across the one or more computing devices, cause the one or more computing devices to implement: disabling prediction of response times for subsequent queries responsive to determining that a prorated variability exceeds a threshold variability target for an execution range of time.

Clause 38. The one or more non-transitory, computer-readable storage media of any one of clauses 34-37, wherein the selected one of the remaining one or more query processing configurations to perform the query is not attached to the database, and wherein the one or more non-transitory, computer-readable storage media store further instructions that when executed on or across the one or more computing devices, cause the one or more computing devices to implement causing the selected one of the remaining one or more query processing configurations to be attached to the database in order to be routed the query for performance.

Clause 39. The one or more non-transitory, computer-readable storage media of any one of clauses 34-38, storing further instructions that when executed on or across the one or more computing devices, cause the one or more computing devices to implement determining that a query processing resource limit for the database is not exceeded by attaching the selected one of the remaining one or more query processing configurations.

Clause 40. The one or more non-transitory, computer-readable storage media of any one of clauses 34-39, wherein the database was created in response to a request that specified that the database was to be managed by the database service.

Clause 41. A system, comprising: a plurality of computing devices implementing different respective hosts of a database service offered by a provider network, wherein the database service comprises a primary processing cluster to perform database queries to a database hosted by the database service, wherein the primary processing cluster comprises a leader node and a plurality of compute nodes hosted at different ones of the respective hosts; wherein the leader node is configured to: receive a database query directed to the database; determine to use respective additional processing clusters for individual ones of the compute nodes; generate a plan to perform the database query at the primary processing cluster, wherein the plan includes one or more operations to instruct individual ones of the compute nodes to use the respective additional processing clusters to perform the one or more operations and return results of the one or more operations to the individual ones of the compute nodes, wherein the respective additional processing clusters implement a same query processing engine as is implemented by the primary processing cluster; respectively instruct the compute nodes to execute the plan to perform the database query, wherein the respective instructions cause individual ones of the compute nodes to send requests to the respective additional processing clusters to perform the one or more operations of the plan; and return a result of the database query generated based on the performance of the one or more operations of the plan at the respective additional processing clusters.

Clause 42. The system of clause 41, wherein the leader node is further configured to: receive a second database query directed to the database; determine not to use the respective additional processing clusters for individual ones of the compute nodes; generate a plan to perform the second database query at the primary processing cluster alone; respectively instruct the compute nodes to execute the plan to perform the second database query; and return a result of the second database query generated based on the performance of the plan to perform the second database query.

Clause 43. The system of any one of clauses 41-42, wherein the respective additional processing clusters are hosted on a same one of the hosts that hosts the compute nodes that instructed the additional processing clusters.

Clause 44. The system of any one of clauses 41-43, wherein the requests sent to the respective additional processing clusters to perform the one or more operations of the plan are sent over a network to other ones of the hosts that host the additional processing clusters.

Clause 45. A method, comprising: receiving a database query at a processing cluster comprising a plurality of compute nodes that execute different portions of queries to a database; generating, by the processing cluster, a plan to perform the database query at the processing cluster, wherein the plan includes one or more operations to instruct individual ones of the compute nodes to use respective additional processing clusters to perform the one or more operations and return results of the one or more operations to the individual ones of the compute nodes, wherein the additional processing clusters implement a same query processing engine as is implemented by the processing cluster; executing, by the processing cluster, the plan to perform the database query, comprising sending requests, by the individual ones of the compute nodes, to the respective additional processing clusters to perform the one or more operations of the plan; and returning, by the processing cluster, a result of the database query generated based on the execution of the plan to perform the database query.

Clause 46. The method of clause 45, further comprising: receiving a second database query at the processing cluster; generating, by the processing cluster, a plan to perform the second database query at the processing cluster alone; executing, by the processing cluster, the plan to perform the second database query; and returning, by the processing cluster, a result of the second database query generated based on the execution of the plan to perform the second database query.

Clause 47. The method of clause 46, wherein executing the plan to perform the second database query comprises accessing one or more locally attached storage devices at the compute nodes of the processing cluster.

Clause 48. The method of clause 46, wherein executing the plan to perform the second database query comprises sending one or more requests to access one or more data objects storing data for the database in separate storage service.

Clause 49. The method of any one of clauses 45-48, wherein the compute nodes are implement on different respective hosts, and wherein individual ones of the respective additional processing clusters are hosted on a same one of the hosts that hosts the compute node that instructed the additional processing cluster.

Clause 50. The method of any one of clauses 45-49, wherein the requests sent to the respective additional processing clusters to perform the one or more operations of the plan are sent over a network from respective hosts for the individual ones of the compute nodes to different hosts that of the additional processing clusters.

Clause 51. The method of any one of clauses 45-50, wherein generating the plan to perform the database query at the processing cluster comprises: generating, by a leader node of the processing cluster, an initial plan to perform the database query from a query tree generated as a result of parsing the database query; evaluating, by the leader node of the processing cluster, the initial plan to determine that the respective additional processing clusters are to be used; and including, by the leader node of the processing cluster, the one or more operations to instruct the individual ones of the compute nodes to use the respective additional processing clusters as part of rewriting the initial query plan to generate an optimized query plan, wherein the optimized query plan executed by the processing cluster.

Clause 52. The method of clause 51 , wherein executing the plan to perform the database query comprises determining at one of the compute nodes that a size of the database query does not indicate local execution of the database query before sending the request to one of the respective additional processing clusters to perform the one or more operations of the plan.

Clause 53. The method of any one of clauses 45-52, wherein at least one operation result corresponding to the one or more operations performed by the additional processing clusters is sent by one of the additional processing clusters to one of the compute nodes that did not instruct the one additional processing cluster.

Clause 54. One or more non-transitory, computer-readable storage media, storing program instructions that when executed on or across one or more computing devices cause the one or more computing devices to implement: receiving a database query at a processing cluster comprising a plurality of compute nodes that execute different portions of queries to a database; determining, by the processing cluster, to use respective additional processing clusters for individual ones of the compute nodes; generating, by the processing cluster, a plan to perform the database query at the processing cluster, wherein the plan includes one or more operations to instruct individual ones of the compute nodes to use the respective additional processing clusters to perform the one or more operations and return results of the one or more operations to the individual ones of the compute nodes, wherein the respective additional processing clusters implement a same query processing engine as is implemented by the processing cluster; executing, by the processing cluster, the plan to perform the database query, comprising sending requests, by the individual ones of the compute nodes, to the respective additional processing clusters to perform the one or more operations of the plan; and returning, by the processing cluster, a result of the database query.

Clause 55. The one or more non-transitory, computer-readable storage media of clause 54, storing further instructions that when executed on or across the one or more computing devices, cause the one or more computing devices to implement: receiving a second database query at the processing cluster; generating, by the processing cluster, a plan to perform the second database query at the processing cluster alone; executing, by the processing cluster, the plan to perform the second database query; and returning, by the processing cluster, a result of the second database query generated based on the execution of the plan to perform the second database query.

Clause 56. The one or more non-transitory, computer-readable storage media of any one of clauses 54-55, wherein the compute nodes are implement on different respective hosts, and wherein individual ones of the respective additional processing clusters are hosted on a same one of the hosts that hosts the compute node that instructed the additional processing cluster. Clause 57. The one or more non-transitory, computer-readable storage media of any one of clauses 54-56, wherein the requests sent to the respective additional processing clusters to perform the one or more operations of the plan are sent over a network from respective hosts for the individual ones of the compute nodes to different hosts that of the additional processing clusters. Clause 58. The one or more non-transitory, computer-readable storage media of any one of clauses 54-57, wherein, in determining to use the respective additional processing clusters for the individual ones of the compute nodes, the program instructions cause the one or more computing devices to implement generating, by a leader node of the processing cluster, an initial plan to perform the database query from a query tree generated as a result of parsing the database query; evaluating, by the leader node of the processing cluster, the initial plan to determine that the respective additional processing clusters are to be used; and wherein, in generating the plan to perform the database query at the processing cluster, the program instructions cause the one or more computing devices to implement including, by the leader node of the processing cluster, the one or more operations to instruct the individual ones of the compute nodes to use the respective additional processing clusters as part of rewriting the initial query plan to generate an optimized query plan, wherein the optimized query plan executed by the processing cluster.

Clause 59. The one or more non-transitory, computer-readable storage media of any one of clauses 54-58, wherein at least one operation result corresponding to the one or more operations performed by the additional processing clusters is sent by one of the additional processing clusters to one of the compute nodes that did not instruct the one additional processing cluster.

Clause 60. The one or more non-transitory, computer-readable storage media of any one of clauses 54-59, wherein the processing cluster and the respective additional processing clusters are implemented as part of a data warehouse service offered by a provider network that hosts the database and wherein data for the database is stored in a separate storage service offered by the provider network.

Clause 61. A system, comprising: a plurality of computing devices, respectively comprising at least one processor and a memory that implement a database service of a provider network, wherein the database service is configured to: receive a request to restore a manually managed database currently hosted across a first topology of a first plurality of nodes as a service managed database, wherein the restore of the manually managed database as a service managed database restores the database into a second plurality of nodes with a second topology different than the first topology of the first plurality of nodes, wherein the manually managed database is distributed across the first topology of nodes according to a custom distribution scheme; generate a plan to map different portions of the manually managed database in the first topology to respective target locations in the second plurality of nodes according to the second topology using a general distribution scheme; perform the plan to map the different portions of the database in the first topology to the respective target locations in the second plurality of nodes according to the second topology; after performing the plan: make the service managed database available for access using the second plurality of nodes; and modify, as a background process, the distribution of the service managed database across the second topology of the second plurality of nodes to match the custom distribution scheme.

Clause 62. The system of clause 61, wherein the database service is configured to encrypt the database as part of the performance of the plan to move the different portions of the database in the first topology to the respective target locations in the second plurality of nodes according to the second topology.

Clause 63. The system of any one of clauses 61-62, wherein the database service is configured to merge one or more portions of the database in the second topology of the second plurality of nodes as part of the performance of the plan to move the different portions of the database in the first topology to the respective target locations in the second plurality of nodes according to the second topology.

Clause 64. The system of any one of clauses 61-63, wherein to modify the distribution of the service managed database across the second topology of the second plurality of nodes to match the custom distribution scheme, the database service is configured to create one or more shadow tables of the database to apply the modifications to match the custom distribution scheme.

Clause 65. A method, comprising: receiving a request to restore a database currently hosted across a first topology of a first plurality of nodes into a second plurality of nodes with a second topology different than the first topology of the first plurality of nodes, wherein the database is distributed across the first topology of nodes according to a custom distribution scheme; generating a plan to map different portions of the database in the first topology to respective target locations in the second plurality of nodes according to the second topology using a general distribution scheme; performing the plan to map the different portions of the database in the first topology to the respective target locations in the second plurality of nodes according to the second topology; after performing the plan: making the database available for access using the second plurality of nodes; and modifying, as a background process, the distribution of the database across the second topology of the second plurality of nodes to match the custom distribution scheme.

Clause 66. The method of clause 65, wherein performing the plan comprises encrypting the database as part of moving the different portions of the database in the first topology to the respective target locations in the second plurality of nodes according to the second topology.

Clause 67. The method of any one of clauses 65-66, wherein the plan merges one or more portions of the database in the second topology of the second plurality of nodes as part of the performance of the plan to move the different portions of the database in the first topology to the respective target locations in the second plurality of nodes according to the second topology.

Clause 68. The method of any one of clauses 65-67, wherein modifying the distribution of the database across the second topology of the second plurality of nodes to match the custom distribution scheme comprises creating one or more shadow tables of the database to apply the modifications to match the custom distribution scheme and swapping to the one or more shadow tables after the one or more shadow tables are made consistent.

Clause 69. The method of any one of clauses 65-68, wherein the second topology includes at least one of a greater number of nodes or slices per node than the first topology.

Clause 70. The method of any one of clauses 65-69, wherein a first query is performed by the second plurality of nodes using a query plan that utilizes the general distribution scheme, and wherein a second query is performed by the second query of nodes after the distribution of the database is modified across the second topology of the second plurality of nodes to match the custom distribution scheme using the custom distribution scheme.

Clause 71. The method of any one of clauses 65-70, wherein generating the plan comprises obtaining a manifest that describes the first topology and mapping the different portions of the database in the first topology to the respective target locations in the second plurality of nodes according to the second topology using a round-robin distribution scheme.

Clause 72. The method of any one of clauses 65-71, wherein the custom distribution scheme is a distribution scheme applied using respectively specified keys for one or more tables of the database.

Clause 73. The method of any one of clauses 65-72, wherein the first topology is associated with a manual management mode for the database offered by a database service hosting the database and wherein the second topology is associated with a serverless management mode offered by the database service.

Clause 74. One or more non-transitory, computer-readable storage media, storing program instructions that when executed on or across one or more computing devices cause the one or more computing devices to implement: receiving a request to restore a database currently hosted across a first topology of a first plurality of nodes into a second plurality of nodes with a second topology different than the first topology of the first plurality of nodes, wherein the database is distributed across the first topology of nodes according to a custom distribution scheme; generating a plan to map different portions of the database in the first topology to respective target locations in the second plurality of nodes according to the second topology using a general distribution scheme; performing the plan to map the different portions of the database in the first topology to the respective target locations in the second plurality of nodes according to the second topology; after performing the plan: making the database available for access using the second plurality of nodes; and modifying, as a background process, the distribution of the database across the second topology of the second plurality of nodes to match the custom distribution scheme.

Clause 75. The one or more non-transitory, computer-readable storage media of clause 74, wherein the plan merges one or more portions of the database in the second topology of the second plurality of nodes as part of the performance of the plan to move the different portions of the database in the first topology to the respective target locations in the second plurality of nodes according to the second topology.

Clause 76. The one or more non-transitory, computer-readable storage media of any one of clauses 74-75, wherein, in modifying the distribution of the database across the second topology of the second plurality of nodes to match the custom distribution scheme, the program instructions cause the one or more computing devices to implement creating one or more shadow tables of the database to apply the modifications to match the custom distribution scheme and swapping to the one or more shadow tables after the one or more shadow tables are made consistent.

Clause 77. The one or more non-transitory, computer-readable storage media of any one of clauses 74-76, wherein a first query is performed by the second plurality of nodes using a query plan that utilizes the general distribution scheme, and wherein a second query is performed by the second query of nodes after the distribution of the database is modified across the second topology of the second plurality of nodes to match the custom distribution scheme using the custom distribution scheme.

Clause 78. The one or more non-transitory, computer-readable storage media of any one of clauses 74-77, wherein, in generating the plan, the program instructions cause the one or more computing devices to implement obtaining a manifest that describes the first topology and mapping the different portions of the database in the first topology to the respective target locations in the second plurality of nodes according to the second topology using a round-robin distribution scheme.

Clause 79. The one or more non-transitory, computer-readable storage media of any one of clauses 74-78, wherein the custom distribution scheme is a distribution scheme applied using respectively specified keys for one or more tables of the database.

Clause 80. The one or more non-transitory, computer-readable storage media of any one of clauses 74-79, wherein the first topology is associated with a manual management mode for the database offered by a database service hosting the database and wherein the second topology is associated with a serverless management mode offered by the database service.

Claims

87 CLAIMS WHAT IS CLAIMED IS:

1. A system, comprising: one or more computing devices, respectively comprising a processor and a memory that implement a processing cluster for a database service, the processing cluster comprising a leader node and one or more compute nodes, and wherein the leader node in the processing cluster is configured to: monitor a network endpoint at a proxy service associated with a database managed by the database service to detect an idle period for the database; select from a plurality of different management actions a management action for the processing cluster to perform during the detected idle period for the database according to the detected idle period; and send one or more requests to cause the management action for the processing cluster to be performed, wherein the database is available for read requests received via the network endpoint during performance of the management action by another processing cluster of the database service.

2. The system of claim 1, wherein to send the one or more requests to cause the management action for the processing cluster to be performed, the leader node is configured to send one or more requests to cause a software patch to be installed at the processing cluster.

3. The system of any one of claims 1-2, wherein to send the one or more requests to cause the management action for the processing cluster to be performed, the leader node is configured to send one or more requests to release the one or more compute nodes for other tasks of the database service and pause the processing cluster.

4. The system of claim 3, wherein the leader node is further configured to: detect a resume event for the processing cluster; determine a number of compute nodes to add to the processing cluster; and cause the determined number of compute nodes to be added to the processing cluster. 88

5. A method, comprising: monitoring, by a leader node of a processing cluster for accessing a database, a network endpoint at a proxy service associated with a database managed by a database service to detect an idle period for the database; determining, by the leader node, a management action for the processing cluster to perform during the detected idle period for the database; and causing, by the leader node, the management action for the processing cluster to be performed, wherein the database is available for read requests received via the network endpoint during performance of the management action by another processing cluster of the database service.

6. The method of claim 5, wherein causing the management action for the processing cluster to be performed, comprises causing a software patch to be installed at the processing cluster.

7. The method of any one of claims 5-6, wherein causing the management action for the processing cluster to be performed, comprises causing a vacuum operation to consolidate storage space for one or more records of the database left as a result of one or more deletions performed on the database.

8. The method of any one of claims 5-7, further comprising performing a metadata query at the leader node when the processing cluster is paused as a result of the management action.

9. The method of any one of claims 5-8, wherein causing the management action for the processing cluster to be performed, comprises sending one or more requests to release one or more compute nodes of the processing cluster for other tasks of the database service and pause the processing cluster.

10. The method of claim 9, further comprising: detecting a resume event for the processing cluster; determining a number of compute nodes to add to the processing cluster; and causing the determined number of compute nodes to be added to the processing cluster. 89

11. The method of claim 10, wherein the determined number of compute nodes is different than a number of compute nodes released from the processing cluster when the processing cluster was paused.

12. The method of any one of claims 5-11, wherein the determined management action is received at the leader node from a control plane of the database service.

13. The method of any one of claims 5-12, wherein causing the management action for the processing cluster to be performed, comprises causing a materialized view for the database to be created or updated.

14. One or more non-transitory, computer-readable storage media, storing program instructions that when executed on or across one or more computing devices cause the one or more computing devices to implement: monitoring, by a leader node of a processing cluster for accessing a database, a network endpoint at a proxy service associated with a database managed by a database service to detect an idle period for the database, wherein the database was created in response to a request that specified that the database was to be managed by the database service; determining, by the leader node, a management action for the processing cluster to perform during the detected idle period for the database; and causing, by the leader node, the management action for the processing cluster to be performed, wherein the database is available for read requests received via the network endpoint during performance of the management action.

15. The one or more non-transitory, computer-readable storage media of claim 14, wherein, in causing the management action for the processing cluster to be performed, the program instructions cause the one or more computing devices to implement causing a software patch to be installed at the processing cluster.