CN108369588B

CN108369588B - Database level automatic storage management

Info

Publication number: CN108369588B
Application number: CN201680071276.6A
Authority: CN
Inventors: H·D·钱; P·V·巴盖尔; H·南达拉; A·L·索利斯; S·塞瓦拉基
Original assignee: Oracle International Corp
Current assignee: Oracle International Corp
Priority date: 2015-10-15
Filing date: 2016-10-17
Publication date: 2023-01-20
Anticipated expiration: 2036-10-17
Also published as: US11016865B2; WO2017066808A1; US11847034B2; US20170109246A1; CN108369588A; US20210240585A1

Abstract

Techniques are described herein for associating storage management attributes with a set of files of a database, referred to herein as a "file group. In this system, storage management attributes are defined at the database level. Thus, multiple databases may be stored across a single disk group, thereby gaining the benefit of operating multiple block access devices in parallel, but each respective database may be associated with a respective file group in a one-to-one relationship such that each database may have different storage management attributes.

Description

Database level automatic storage management

Technical Field

The present invention relates to volume management, and more particularly to improved computer-implemented techniques for database-level automatic storage management.

Background

A volume manager may be used to create a storage pool, referred to as a "disk group" consisting of multiple "physical" block-accessible devices, to present and expose multiple block devices as a higher I/O bandwidth and more fault tolerant logical volume. Typically, the volume manager allows you to add disks or third party storage arrays to the logical volume on the fly. The file system is a logical abstraction that allows applications to access logical volumes using file and directory names instead of block addresses.

Typically, storage is managed at the disk group level. If a database administrator (DBA) wants to maintain multiple databases with different capabilities and availability constraints, the DBA must allocate a group of disks for each group of availability constraints. Creating a disk group for the test database; a second disk set is created for the database requiring two-way mirroring and a third disk set is made for the database requiring three-way mirroring.

Unfortunately, when provisioning disk groups, a database administrator must decide which resources to assign to different disk groups before knowing all the requirements of the databases that may be stored on those disk groups. Overfeeding disk groups to meet predicted future high availability constraints may result in wasted storage resources. Disk group starvation may result in a later need to migrate another database from the less used disks in order to add the disk to the starved disk group. The newly fed disk stack then needs to be rebalanced. Data movement is a slow and computationally expensive process, and therefore should be avoided as much as possible. Movement data may also hinder the performance of other databases using disk groups.

The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Thus, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.

Drawings

In the drawings:

FIG. 1 is a block diagram illustrating an example of a system architecture of a storage management system configured for database level automated storage management;

FIG. 2A is a flowchart illustrating an example program flow for creating database level storage management metadata;

FIG. 2B is a flowchart illustrating an example program flow for automatically recovering lost data of multiple databases after a disk failure;

FIG. 3A is a block diagram illustrating an example of a system architecture of a storage management system after a disk failure;

FIG. 3B is a block diagram illustrating an example of a system architecture of a storage management system after initiating lost data recovery;

FIG. 3C is a block diagram illustrating an example of a system architecture of a storage management system after recovery of lost data is complete.

FIG. 4 is a block diagram illustrating a computer system that may be used to implement the techniques described herein.

Detailed Description

In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It may be evident, however, that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present invention.

General overview

Techniques are described herein for associating storage management attributes with groups of files of a database, referred to herein as "file groups. In this system, storage management attributes are defined at the database level. Thus, multiple databases may be stored across a single disk group, thereby gaining the benefit of working multiple block access devices (block access devices) in parallel, but each respective database may be associated with a respective file group in a one-to-one relationship such that each database may have different storage management attributes.

Overview of the System

FIG. 1 is a block diagram illustrating an example of a system 100 architecture of a storage management system configured for database-level automated storage management, the system architecture of the system including two database server computers ("nodes") 102, 122 coupled to addressable block-storage units (referred to herein as "disks" for simplicity) 142, 144, 146, 148, 150, 152. The nodes 102, 122 include volatile memory 106, 126 and processors 104, 124 that execute database server instances 108, 128, and Automatic Storage Management (ASM) instances 112, 132. In an alternative embodiment, rDBMS 100 may include one or more database server computers each executing one or more database server instances coupled to one or more databases stored on one or more shared persistent storage devices (e.g., hard disk, flash memory) through an automated storage management layer that includes one or more automated storage management instances. For example, while in the illustrated embodiment database server computer 102 is executing a single database server instance 108, in an alternative embodiment, a single database server computer may execute three database server instances, where each database server computer is operatively coupled to the same shared disk(s) through a single automated storage management instance.

The database server instances 108, 128 execute database commands submitted to the database server computers 102, 122 by one or more users or database applications. These user and database applications may be referred to herein as external entities to indicate that they are external to the internal programs and structures of rDBMS 100. External entities may connect to rDBMS 100 through a network in a client-server relationship.

Each database server instance 108, 128 also includes processes, such as a query optimizer (not shown), a query coordinator (not shown), and one or more processes (not shown) called "slave processes" that perform database operations in parallel. A slave process may contain one or more threads and when a slave process contains a single thread, a slave process may be referred to as a thread. When a thread reads data, the thread may be referred to as a reader thread. When a thread writes data, the thread may be referred to as a writer thread.

The database management system (DBMS) 100 of fig. 1 includes an Automated Storage Management (ASM) stack layer 110 that includes two automated storage instances 112, 132. An ASM instance is a volume manager that automatically performs storage management operations between block and file levels. Reads and writes are routed through the ASM stack layer 110 to transparently provide automated storage management between the database server instances 108, 128 and persistent storage (e.g., disks 142-152). For example, a single read operation performed by database server instance 108 may be routed by ASM instance 112 to either a primary copy of the data (e.g., 170A or 180A) or a mirror copy of the data (e.g., 170B, 170C, or 180B) to provide high availability access to the data at any given time. As another example, a single write operation performed by the database server instance 108 may be cloned into additional writes in order to propagate changes to both the primary copy of the data (e.g., 170A or 180A) and the mirror copy of the data (e.g., 170B and 170C or 180B). The ASM layer 110 may be configured to require all writes to complete before performing the next database operation, or the ASM layer 110 may be configured to perform writes asynchronously.

The disks 142-152 include data and metadata stored in files. The ASM stack layer 110 may store files in stripes (strips). To stripe a file, an ASM instance divides the file into stripes of equal size and spreads the data evenly across multiple disks in a disk group (e.g., disk group 140). The strips are of constant size and are shown in fig. 1 as equally sized boxes. The file contains many strips, but only a few are shown for simplicity. In some cases, the files may be stored entirely on a single disk (e.g., files 154).

A database includes a tablespace, which is a collection of files used to store data for database objects (e.g., tables, temporary tables, indexes, logs, and control files). Metadata about these database objects may be stored in the file itself, or in a separate file called a data dictionary. The disks 142-152 may contain data from multiple databases and, thus, contain multiple different control files and data files belonging to different databases.

The automated storage management layer 110 creates and stores files used by the automated storage instances to manage data in the database. The ASM instance may read the PST files (e.g., 154) to determine which disk belongs to which disk group, and read the file group files (e.g., file 164) to determine the subset of files belonging to any file group. The file group file may also be used to determine which attributes belong to a particular file group. An ASM instance may cache data from any ASM file in volatile memory local to the ASM instance using the data to reduce the overhead required to perform storage management operations.

Performing operations in the database management system 100 typically involves calling multiple layers, with stack layers calling other stack layers. These calls may involve many stack layers in deep nesting and recursive traversal. One example is a DML statement that inserts a row into a table. A SQL stack layer (e.g., a process in the database server instance 108 or 128) may receive and analyze SQL statements to formulate and proceed with an execution plan to invoke a segment stack layer (e.g., another process in the database server instance 108 or 128) to find free space for rows in a specified table. The segment stack layer may match the specified table with the appropriate segment and call the tablespace stack layer to find a free extent (free extension) with enough space for the row. The tablespace stack layer may look up or create a free extension area in the data file created by the ASM stack layer. The call returns to the segment stack layer where the segment's extension map (extension map) can be updated. The call then returns to the SQL stack layer, which may now pass the free extension area to the data stack layer (not shown) for insertion of the row. The data stack layer formats the rows into row data and stores the row data in the free extents, updating any associated indices or other row metadata as needed.

Block addressable memory cell

A byte is eight bits and is the smallest amount of data that can be addressed, retrieved from memory, or written to byte-addressable memory. Thus, to manipulate a bit in a byte, the byte containing the bit must be fetched into register(s) and manipulated according to the instruction by the instruction referencing the byte (or the word containing the byte). In contrast, the minimum size of a unit of a block addressable memory is a block. A block includes a plurality of bytes and a plurality of words. For block addressable memory, a block is the minimum amount of data that can be addressed, retrieved from, or written to memory. Examples of block addressable memory include flash memory and disk memory. To manipulate bits or bytes in a block, the block containing those bits is loaded into byte-addressable memory by referencing an instruction issued to the block of the block-based interface, and then the bits or bytes are manipulated in byte-addressable memory.

Disk stack

The disk group includes a set of one or more block addressable storage units configured to store a plurality of database files for a plurality of databases. The block addressable storage units may comprise physical hard disks, logical units, third party storage arrays, or any type of SCSI device. The term "disk" as used herein refers to any form of persistent storage device in a broad sense.

Disk groups are self-describing — metadata defining which addressable storage units are defined within the block addressable storage units of a disk group. Metadata defining disk groups is stored in one or more files called Partnership and Status Tables (PSTs). The PST contains metadata about all block addressable storage units in the disk group-disk number, disk status, partner disk number, etc. In case of a disk failure, multiple copies of the PST must be available. Thus, the complete PST file is located on a single disk, rather than being striped across multiple disks.

In some embodiments, a copy of the PST file may not be located on each disk in the disk group. In contrast, a copy of the PST file is located on most disks in the disk group. For example, five copies of the PST files 154, 156, 158, 160, 162 are located on five of the six disks 142-152 in the disk group 140. If the disk storing the PST file fails, the ASM instance (e.g., ASM instance 112 or 132) may copy the PST file to another disk that previously did not contain the PST file. The disk group PST file may also be compared to another disk group PST file to determine if the disk group PST file is up-to-date. The PST file may contain a timestamp to determine which file is more current if two or more files are different.

For more information on disk groups and PST files, see ASM specification: belden Eric et al 5 months 2015Published'

Automatic Storage Management, adminstrator's Guide 12c Release 1 (12.1) E41058-11) ", the entire contents of which are incorporated herein by reference as if fully set forth herein.

File group

The file group describes a collection of files and one or more storage management attributes associated with the files. For any particular storage management attribute, all files of a particular file type in the file group must share the same value for that storage management attribute. Even though the files in the file group may share the same value for each particular storage management attribute, the files within the file group may be automatically managed differently based on the file types of the files. For example, all files of the same file type in the set of files share the same value of the redundancy attribute, but some files may be automatically mirrored for additional time based on the file type of those files. The storage management attributes of the file group may be set or changed at any time after the file group has been created.

The set of files is contained within a single disk group and is dedicated to a single database, a multi-tenant Container Database (CDB), or a Pluggable Database (PDB). For example, in FIG. 1, file group 170 is dedicated to database DB1 stored in example files 172, 174, and 176. The file group 180 is dedicated to the database DB2 stored in the example files 182, 184 and 186.

The database (or CDB or PDB) may have only one file group per disk group. The main benefit of a file group is the ability to have different storage management attributes for each database in the same disk group.

The file group directory file (e.g., file 164) contains metadata describing all existing file groups in the disk group, including a list of files associated with each file group and file group attributes for each file group. The file group directory file is typically automatically mirrored in three ways.

The files in the file group may be striped like any other file. A file group directory file (e.g., file 164) may store metadata about which files belong to which file group. The file group directory file may also store metadata about which storage management attributes belong to which file groups. Metadata of a file specifies which file group describes the file; instead, the metadata of the file group lists the set of files described by the file group. In other words, the metadata of the file and the metadata of the file group are two pieces of metadata pointing to each other.

QUOTA (QUOTA) group

In some embodiments, multiple file groups may be grouped into quota groups. The quota group may be used to define further storage management attributes for a set of file groups. Certain attributes, such as the maximum amount of storage that a file in a set of file groups can occupy, can be set using the quota group identifier. By setting the maximum storage that the files in the quota group can occupy, the database administrator can ensure that there will be enough space to make mirror copies of all file groups in the disk group.

Redundancy

Redundancy is a storage management attribute that defines how many copies of a primary file must be stored as a set of mirror copies. The primary and mirror copies are typically stored on separate subsets of block addressable storage units called "failure groups". Thus, if a disk fails in either the primary or mirror copy, the entire required set of data and metadata will be available in another failed group.

The accepted values for the redundancy attribute include: unprotected-meaning only the primary replica is present, mirrored-meaning a total of two replicas are present, high-meaning a total of three replicas are present, parity-meaning that there is a parity extension every N (configurable) data extensions. The purpose of the parity extension is to be able to reconstruct data from the loss of 1 data extension, and is twofold-meaning that there are two parity extensions per N (configurable) data extensions. With double parity, there is enough information to be able to recover the loss of 2 data extents.

In some embodiments, some files or extents are mirrored three-way as possible. For example, the file group file (e.g., file group file 164) is as three-way mirrored as possible. In some embodiments, even if the redundancy level is set to mirror (only two copies), the 0 th extent of each database first data file is mirrored three times. The 0 th extension area includes a header data block containing a large amount of metadata about the file. If only the 0 th extent survives a catastrophic failure, this data can be used in conjunction with tape backup to restore the database.

Mirror splitting replicas

The mirror split copy is a storage management attribute of the file group that defines the number of additional copies to be created for each data file in the file group for the mirror split process. Mirror splitting is used to provision a base image for a sparse clone of a database. The number of additional copies for mirror splitting is independent of file redundancy and will not comply with the failure group rules. That is, these copies will not necessarily be placed on separate failure groups. When performing image splitting for a group of files, the additional copy may eventually serve as the primary copy of the file. The attribute may only apply to certain file types, such as database data files. For example, if file group 170 is given the value of the mirror split copy attribute "mirror," DB1 data file 174 may be mirrored on multiple disks within the failed group (e.g., DB1 data file 174-1 will have a copy on disk 142 and disk 144; DB1 data file 174-2 will have a copy on disk 144 and disk 142.)

Size of strip

Striping is a storage management attribute for a file group, and striping defines a stripe size for files in the file group. This attribute is important depending on the size of the files in a particular database. Stripes are contiguous block addressable spaces, so a larger stripe size means more contiguous reads from a single read head of a hardware disk. However, if the file size is small, a larger stripe size will result in more wasted space. Striping may be fine-tuned with data block size in each particular database so that each database has stripes sized to maximize read and write efficiency.

Software versioning

Software versioning is a storage management attribute of a set of files, which defines the version of software that the database uses for backwards compatibility purposes. If the database is closed for maintenance, the software versioning attribute ensures that the disk group is not upgraded to a newer version of software that could potentially make the database unusable.

Recovering attributes

When a block addressable storage unit of a disk group fails, the data from that addressable storage unit needs to be recovered and rebalanced on the remaining disks. Each addressable storage unit (e.g., hard disk) may be sized to store a large amount of data (e.g., 10 TB). Moving this amount of data from the most recent copy to a new disk can take a significant amount of time and computing resources. Various attributes may be set at the database level to ensure that the high priority database is enabled and running with the full backup copy as quickly as possible.

A. Priority level

The group of files with the highest priority is first fully restored, then the group of files with the next highest priority, and so on. In some embodiments, this is an even finer grained process, where file types are also taken into account. For example, a control file is necessary for a database instance to access a data file in a particular database, and thus the control file is given priority to the data file. Similarly, redo files are more important to database operations than data files, so the files of the database redo files are restored before the data files are restored.

B. Capability limitation (power limit)

Another attribute that can be fine-tuned for each database is a capability limit. The capacity limit attribute is a storage management attribute that defines how many I/Os may be suspended before an ASM instance (e.g., ASM instance 112 or 132) must wait before sending the next set of I/Os. This attribute is important because a database instance that is recovering a database may compete for resources used during normal execution of operations on another database. Setting this attribute to a high number of discrete I/Os will minimize the overhead of recovery operations, but will ultimately reduce the performance of an application using another database. Thus, by keeping this number high for high priority databases and low for low priority databases, ASM layer operations in restoring and rebalancing low priority databases are less likely to interfere with the normal operation of the high priority databases.

Overview of the implementation

FIG. 2 is a flowchart illustrating a program flow 200 for creating database level storage management metadata. At step 202, the ASM instance associates a plurality of block addressable storage units as part of a disk group. The disk group is configured to store files of a plurality of databases. Associating a plurality of block addressable storage units may also include creating a PST file defining the disks in the disk group, and storing multiple copies of the PST file, as shown at step 204. Associating a plurality of block addressable storage units may further comprise setting default storage management attributes for files within a disk group, as shown in step 206. If the individual storage management attributes are not set at the database level, the default storage management attributes will be applied.

At step 208, the ASM instance stores a plurality of file groups in a disk group. Each of the file groups may have only files corresponding to a single database. Storing the file groups may also include creating a file group directory file that defines the file groups, and striping these files across the disk set of the disk group (referred to herein as a "failure group"), as shown at step 210. The file group directory file is configured to describe the identity of the file group, its association with the database, all attribute names and values of those attributes, and all files described by the file group.

When a file of a file group is mirrored, the file group directory file will also be mirrored. Storing the set of files may further include creating a given file group identifier for the given database and, for each file created for the given database, storing the file as being associated with the given file group identifier in the file group directory, as shown at step 212. Depending on the configuration, the set of files may be created automatically or manually. At database creation time, the storage management attributes may be initially set to the default attributes set at step 206.

Arrow 216 indicates that step 212 may need to be performed multiple times. Each time a new database is created, a new file group and associated file group identifier are created in response. In some embodiments, a file group directory and attribute files are created and maintained for all databases. In an alternative embodiment, a new file group directory and attribute file is created each time a new database is created.

At step 218, the ASM instance stores the storage management attributes for each file group. These attributes serve as metadata that a database administrator can set and ASM instances can read to determine constraints when automatically performing future storage management operations. As shown in step 220, the database administrator may use DDL statements to set any individual storage management attributes of any database in the disk group.

Arrow 222 indicates that the operation may be performed multiple times. Before adding any data to a particular database, a database administrator may set one or more storage management attributes at the time of database creation. The database administrator may then later update the same attributes or update different attributes of the database. It is also critical to update storage management attributes after a database management architecture change, such as a disk failure or a vertical or horizontal upgrade of the storage resources of the database management system architecture. For example, if a disk fails, a default capacity limit may be set, but after installing and adding a new disk to the disk group, the DBA may wish to increase the capacity limit on weekends.

As another example, the database may increase or decrease its redundancy attribute at any time. For example, the redundancy attribute of DB1 may be reduced from "high" to "mirror", in which case the file will be rebalanced to look more like the DB2 file in FIG. 1. As an alternative example, the redundancy attribute of DB2 may be increased from "mirror" to "high", in which case the file would be rebalanced to look more like the DB1 file in FIG. 1.

Maintaining DB(s) based on storage management metadata

At startup, the ASM instance 112, 132 reads PST files (e.g., PST files 154-162) to determine which addressable storage units (e.g., disks 142-152) are in a disk group (e.g., disk group 140). When a database server instance (e.g., database server instance 108) receives a selection of a database for use, a call is made to the corresponding ASM layer 110. The ASM layer then reads the storage management metadata to determine which file group corresponds to the database. The ASM instance can cache data from any ASM file into volatile memory local to the ASM instance that is using the data to reduce the overhead required to perform future storage management operations. During database operations, the ASM instance reads file group directory files (e.g., file 164) as cached data or from disk to determine which file belongs to which file group and which attributes belong to each file group.

For example, assume that database server instance 108 receives an application request to use a first database. The request is parsed and passed to the corresponding ASM instance 112, and ASM instance 112 reads file group directory file 164 to determine which files of the database are described by file group 170. The attributes of the file group are used to determine storage management attributes such as redundancy: high. When the database server instance 108 receives a DML command that causes a write to be issued, the ASM instance 112 propagates the write to three different copies of the database 170A, 170B, 170C to meet the redundancy requirements.

As another example, assume that database server instance 128 receives an application request to use a second database. The request is parsed and passed to the corresponding ASM instance 132. ASM instance 132 reads file group directory files 164 to determine which files of the database are described by file group 170. The attributes of the file group are used to determine storage management attributes, such as redundancy: mirror image. When the database server instance 128 receives a DML command that causes a write to be issued, the ASM instance 132 propagates the write to two different copies of the database 180A, 180B to satisfy the redundancy requirement.

Recovery overview

Fig. 2B is a flow chart illustrating a program flow 250 for recovering lost data of a failed disk. At step 252, the ASM instance detects a failure in at least one block addressable storage unit of the disk group. Detecting the failure may also include the ASM instance determining that the disk defined in the last PST file in the disk group is unavailable, as shown in step 254. In some embodiments, the ASM instance will check the standby attribute for the amount of time to wait before resuming, as shown in step 256. After waiting a specified amount of time to see if the disk comes back online at step 258, the ASM instance begins the recovery process. This latency prevents the initiation of the recovery process after each time the disk is temporarily taken off-line.

At step 260, the ASM instance automatically recovers the missing data from the files in each respective file group according to the respective storage management attributes for each respective file group. Recovering the missing data may also include the ASM instance supplying free space on available disks in the disk group, as shown at step 262. Recovering the missing data may also include the ASM instance determining from the file group directory file a priority and a capability limit for data from the file group to be recovered, as shown in step 264. Data from the highest priority level file group is restored first. The ASM instance then determines from the ASM directory files of the file group which disks have the latest version of the missing data. Finally, the ASM instance recovers the missing data from the files in the highest priority file group based on the capability restriction attribute of the particular file group. Arrow 270 indicates that step 266 may be performed multiple times. Each time the process repeats, the ASM instance recovers the lost data from the next highest priority set of files.

In some embodiments, the ASM instance may additionally prioritize data from files within a file group based on the file types of the files. This involves determining the file type of the stripe of the file to be restored and restoring the higher priority file type before the lower priority file type.

Example of recovery after a disk failure

FIG. 3A is a block diagram illustrating the system architecture of cluster 100 after a disk failure. The system architecture 300 after a disk failure includes the loss of disks 150 and PST files 162 of disk group 140, the image 170C-1 of the first database, and the image 180B-2 of the second database. After a disk failure, an ASM instance (e.g., ASM instance 112) scans PST files 152, 156, 158, and 160 to determine which disk failed, whether the PST files are current. The ASM instance then scans the copies of the file directory (not shown) to determine from which copies to mirror when recovering the lost data and rebalancing. After reviewing the directory files (not shown), the ASM instance determines that the image 170C-1 of the first database needs to be restored and the image 180B-2 of the second database needs to be restored.

FIG. 3B is a block diagram illustrating the system architecture of cluster 100 after recovery and rebalancing are initiated. The system architecture after recovery and rebalancing have been initiated includes using extra space on disk 152 to reconstruct a lost copy of data from disk 150.

After determining from the PST file that the file 154 is up-to-date, the PST file 322 is reconstructed from the PST file 154 of the disk group 140;

after determining from the file directory that copy 170A-1 is up-to-date, the image 370C-1 of the first database begins to be rebuilt from the primary copy 170A-1; and

after determining from the file directory that the primary copy 180A-2 is up-to-date, the mirror 380B-2 of the second database has not yet begun to be rebuilt, but the rebuild will be rebuilt from the primary copy 180A-2.

In this example, file group 170 has a priority attribute that is ranked higher than the file group priority of file group 180. For example,

the file group 170 may have a priority attribute with the rank "priority 1" among the attributes of the files 164; and

the file group 180 may have a priority attribute with the rank "priority 2" among the attributes of the file 164.

The effect of the higher priority ranking is that image 370C-1 of file group 170 is completed before image 380B-2 of file group 180.

In this example, the file types have the following priorities:

a) ASM metadata

b) Control files and redo logs in high priority file groups

c) Data files in high priority file groups

d) Control files and redo logs in low priority file groups

e) Data files in low priority file groups

Thus, in this example, the file is completed with the following priority: file group directory file 364C-1> database control file 372C-1> database redo file 376-1. The database data file is not yet complete.

FIG. 3C is a block diagram illustrating the system architecture after recovery and rebalancing have been completed. The system architecture 340 after the recovery and rebalancing have been completed includes the complete PST file 322 of the disk group 140; a finished mirror image 370C-1 of the master 170A-1 with a file group directory file 364C-1, a database control file 372C-1, a database redo file 376C-1, and a database data file 374C-1; and a completed image of 380B-2 with database control file 382B-2, database data file 384B-2, and database redo file 386B-2.

If another disk is added to the architecture 340, rebalancing will start over again and eventually look like the original state 100 of the cluster system architecture.

Overview of hardware

According to one embodiment, the techniques described herein are implemented by one or more special-purpose computing devices. A special purpose computing device may be hardwired to perform the techniques, or may include digital electronic devices such as one or more Application Specific Integrated Circuits (ASICs) or Field Programmable Gate Arrays (FPGAs) permanently programmed to perform the techniques, or may include one or more general purpose hardware processors programmed to perform the techniques pursuant to program instructions in firmware, memory, other storage, or a combination. Such special purpose computing devices may also incorporate custom hardwired logic, ASICs or FPGAs, and custom programming to implement the techniques. A special-purpose computing device may be a desktop computer system, portable computer system, handheld device, networked device, or any other device that incorporates hardwired and/or program logic to implement the techniques.

For example, FIG. 4 is a block diagram that illustrates a computer system 400 upon which an embodiment of the invention may be implemented. Computer system 400 includes a bus 402 or other communication mechanism for communicating information, and a hardware processor 404 coupled with bus 402 for processing information. Hardware processor 404 may be, for example, a general purpose microprocessor.

Computer system 400 also includes a main memory 406, such as a Random Access Memory (RAM) or other dynamic storage device, coupled to bus 402 for storing information and instructions to be executed by processor 404. Main memory 406 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 404. Such instructions, when stored in a non-transitory storage medium accessible to processor 404, make computer system 400 a special-purpose machine that is customized to perform the operations specified in the instructions.

Computer system 400 also includes a Read Only Memory (ROM) 408 or other static storage device coupled to bus 402 for storing static information and instructions for processor 404. A storage device 410, such as a magnetic disk, optical disk, or solid state drive, is provided and coupled to bus 402 for storing information and instructions.

Computer system 400 may be coupled via bus 402 to a display 412, such as a Cathode Ray Tube (CRT), for displaying information to a computer user. An input device 414, including alphanumeric and other keys, is coupled to bus 402 for communicating information and command selections to processor 404. Another type of user input device is cursor control 416, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 404 and for controlling cursor movement on display 412. Such input devices typically have two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.

Computer system 400 may implement the techniques described herein using custom hardwired logic, one or more ASICs or FPGAs, firmware, and/or program logic that, in combination with the computer system, causes computer system 400 to become or programs computer system 400 as a special purpose machine. According to one embodiment, the techniques herein are performed by computer system 400 in response to processor 404 executing one or more sequences of one or more instructions contained in main memory 406. Such instructions may be read into main memory 406 from another storage medium, such as storage device 410. Execution of the sequences of instructions contained in main memory 406 causes processor 404 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.

The term "storage medium" as used herein refers to any non-transitory medium that stores data and/or instructions that cause a machine to operate in a specified manner. Such storage media may include non-volatile media and/or volatile media. Non-volatile media includes, for example, optical, magnetic disks, or solid-state drives, such as storage device 410. Volatile media includes dynamic memory, such as main memory 406. Common forms of storage media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge.

Storage media is distinct from but can be used in conjunction with transmission media. Transmission media participate in the transfer of information between storage media. For example, transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 402. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infrared data communications.

Various forms of media may be involved in carrying one or more sequences of one or more instructions to processor 404 for execution. For example, the instructions may initially be carried on a magnetic disk or solid state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 400 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 402. Bus 402 carries the data to main memory 406, from which main memory 406 processor 404 retrieves and executes the instructions. The instructions received by main memory 406 may optionally be stored on storage device 410 either before or after execution by processor 404.

Computer system 400 also includes a communication interface 418 coupled to bus 402. Communication interface 418 provides a two-way data communication coupling to a network link 420, where network link 420 is connected to a local network 422. For example, communication interface 418 may be an Integrated Services Digital Network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 418 may be a Local Area Network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links may also be implemented. In any such implementation, communication interface 418 sends and receives electrical, electromagnetic or optical signals that carry digital signal streams representing various types of information.

Network link 420 typically provides data communication through one or more networks to other data devices. For example, network link 420 may provide a connection through local network 422 to a host computer 424 or to data equipment operated by an Internet Service Provider (ISP) 426. ISP 426 in turn provides data communication services through the global packet data communication network now commonly referred to as the "internet" 428. Local network 422 and Internet 428 both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on network link 420 and through communication interface 418, which carry the digital data to and from computer system 400, are exemplary forms of transmission media.

Computer system 400 can send messages and receive data, including program code, through the network(s), network link 420 and communication interface 418. In the Internet example, a server 430 might transmit a requested code for an application program through Internet 428, ISP 426, local network 422 and communication interface 418.

The received code may be executed by processor 404 as it is received, and/or stored in storage device 410, or other non-volatile storage for later execution.

In the foregoing specification, embodiments of the invention have been described with reference to numerous specific details that may vary from implementation to implementation. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. The sole and exclusive indicator of the scope of the invention, and what is intended by the applicants to be the scope of the invention, is the written and equivalent scope of a set of claims, including any subsequent correction, that issue from this application, in the specific form in which such claims issue.

Claims

1. A method, comprising:

associating a plurality of block addressable storage units as part of a disk group configured to store files of a plurality of databases;

storing a plurality of file groups in the disk group, each file group of the plurality of file groups having files of a single respective database of the plurality of databases;

for each file group of the plurality of file groups, storing, in association with the file group, a storage management attribute that describes how to manage storage of the file group, wherein the storage management attribute is defined at a database level, values of the storage management attributes of different file groups are different, and the values of the storage management attributes of the file groups are specified in a data definition language, DDL, statement;

wherein a plurality of file groups are grouped into a quota group, the quota group defining further storage management attributes for a group of file groups, an

Wherein the method is performed by one or more computing devices.

2. The method of claim 1, wherein the storage management attributes comprise specific storage management attributes, wherein storing the storage management attributes for each file group comprises:

storing a first value of a particular storage management attribute of a first file group;

storing a second value of the particular storage management attribute of the second file group;

wherein the first value is different from the second value.

3. The method of claim 2, wherein the particular storage management attribute is redundancy, the method further comprising:

automatically maintaining a first number of copies of a first database of the plurality of databases on block addressable storage units of the disk group according to a first redundancy attribute of the first file group having files of the first database;

automatically maintaining a second number of copies of a second database of the plurality of databases on block addressable storage units of the disk group according to a second redundancy attribute of the second file group having files of the second database;

wherein the first number of copies is different from the second number of copies.

4. The method of claim 2, further comprising:

striping each file of the first file group across a first subset of the plurality of block addressable storage units of the disk group;

striping each file of the second file group across a second subset of the plurality of block addressable storage units of the disk group;

wherein the first subset and the second subset intersect with respect to at least one block addressable storage unit such that a first set of stripes of files from the first group of files on the at least one block addressable storage unit have different storage management attributes than a second set of stripes of files from the second group of files.

5. The method of claim 1, further comprising:

detecting a failure in at least one of the plurality of block addressable storage units of the disk group;

automatically recovering a first set of missing data from files in a particular file group according to particular attributes of the particular file group.

6. The method of claim 5, wherein the particular attribute of the particular set of files is a priority, the method further comprising:

based on the particular attribute, the first set of missing data is recovered from files in another particular set of files before recovering a second set of missing data from files in the particular set of files.

7. The method of claim 6, further comprising:

determining a file type for each stripe in the first set of missing data based on the replica of the first set of missing data;

the first set of stripes of the first particular file type is restored before the second set of stripes of the second particular file type is restored.

8. One or more non-transitory computer-readable media storing one or more sequences of instructions which, when executed by one or more processors, cause performance of:

wherein a plurality of file groups are grouped into a quota group that defines further storage management attributes for a group of file groups.

9. The one or more non-transitory computer-readable media of claim 8, wherein the storage management attributes comprise specific storage management attributes; wherein storing the storage management attributes for each file group comprises:

wherein the first value is different from the second value.

10. The one or more non-transitory computer-readable media of claim 9, wherein the particular storage management attribute is redundancy; wherein the one or more non-transitory computer-readable media store instructions that, when executed by the one or more processors, further cause:

11. The one or more non-transitory computer-readable media of claim 9, wherein the one or more non-transitory computer-readable media store instructions that, when executed by the one or more processors, further cause:

12. The one or more non-transitory computer-readable media of claim 8, wherein the one or more non-transitory computer-readable media store instructions that, when executed by the one or more processors, further cause:

13. The one or more non-transitory computer-readable media of claim 12, wherein the particular attribute of the particular set of files is a priority; wherein the one or more non-transitory computer-readable media store instructions that, when executed by the one or more processors, further cause:

14. The one or more non-transitory computer-readable media of claim 13, wherein the one or more non-transitory computer-readable media store instructions that, when executed by the one or more processors, further cause:

15. A storage management system comprising one or more computing devices configured to:

16. The storage management system of claim 15, wherein the storage management attributes comprise specific storage management attributes; wherein the one or more computing devices are further configured to:

wherein the first value is different from the second value.

17. The storage management system of claim 16, wherein the particular storage management attribute is redundancy; wherein the one or more computing devices are further configured to:

18. The storage management system of claim 16, wherein the one or more computing devices are further configured to:

19. The storage management system of claim 15, wherein the one or more computing devices are further configured to:

20. The storage management system of claim 19, wherein the particular attribute of the particular set of files is a priority; wherein the one or more computing devices are further configured to:

21. The storage management system of claim 20, wherein the one or more computing devices are further configured to:

the first stripe set of the first particular file type is restored before the second stripe set of the second particular file type is restored.

22. An apparatus comprising means for performing the method of any one of claims 1-7.

23. An apparatus, comprising:

a processor; and

a memory coupled to the processor and including instructions stored thereon that, when executed by the processor, cause the processor to perform the method of any of claims 1-7.