CN1829974A

CN1829974A - Parallel recovery by non-failed nodes

Info

Publication number: CN1829974A
Application number: CN200480021752.0A
Authority: CN
Inventors: 罗杰·班福德; 萨希坎什·钱德拉塞克拉; 安杰洛·普鲁希诺
Original assignee: Oracle International Corp
Current assignee: Oracle International Corp
Priority date: 2003-08-01
Filing date: 2004-07-28
Publication date: 2006-09-06
Anticipated expiration: 2024-07-28
Also published as: CN1829988A; CN100565460C; CN1829974B; CN100449539C; CN1829962A; CN100429622C; CN1829961A

Abstract

Various techniques are described for improving the performance of a shared-nothing database system in which at least two of the nodes that are running the shared-nothing database system have shared access to a disk. Specifically, techniques are provided for recovering the data owned by a failed node using multiple recovery nodes operating in parallel. The data owned by a failed node is reassigned to recovery nodes that have access to the shared disk on which the data resides. The recovery logs of the failed node are read by the recovery nodes, or by a coordinator process that distributes the recovery tasks to the recovery nodes.

Description

Parallel recovery by means of non-failure node

Technical field

The present invention relates to be used for managing the technology that the nothing of moving is shared the data of (shared-nothing) Database Systems on shared disk hardware.

Background technology

The multiprocessing computer system generally is divided three classes: all resource sharings (shared-everything) system, shared disc system and do not have shared system.In all resource sharing systems, all volatile memory devices (hereinafter being commonly referred to as " storer ") and all Nonvolatile memory devices (hereinafter being commonly referred to as " disk ") in can the direct access system of the program on all processors.Therefore, require the senior wiring between the different computer modules, so that the function of all resource sharings to be provided.In addition, with regard to all resource sharing structures, also there is the scalability restriction.

In shared disc system, processor and storer are grouped into node.Each node in the shared disc system itself can constitute all resource sharing systems that comprise multiprocessor and multi-memory.Program on all processors all disks in can access system, but only belong to program on the processor of specific node can direct access at the storer of specific intranodal.Shared disc system usually requires the wiring lacked than all resource sharing systems.Because all nodes can all data of access, so shared disc system can also easily adapt to unbalanced workload condition.Yet shared disc system is subject to the influence of related system expense (coherence overhead).For example, if first node has been revised data and Section Point wants to read or revise these identical data, then must take a plurality of steps to guarantee that the right version of data is offered Section Point.

In no shared system, all processors, storer and disk are grouped into node.As in shared disc system, in no shared system, itself can constitute all resource sharing systems or shared disc system each node.Storer and the disk that the program of moving on specific node can the specific intranodal of direct access only.The no shared system of the multiprocessing system of three kinds of general types requires the minimum wiring between the various system components usually.Yet no shared system is subject to the influence of unbalanced workload condition most.For example, may all be present on the disk of specific node by all data of access treating during the particular task.Therefore, only the program in this intranodal operation can be used for execution work particle (work granule), even the program on other nodes all keeps idle condition.

The database that moves on multi-node system generally is divided into two classes: shared disk database and shared-nothing database.

The shared disk database

The shared disk database comes co-ordination based on following hypothesis: suppose all data by database system management for all processing nodes that Database Systems can be used all as seen.Therefore, in the shared disk database, server can distribute any work to the program on any node, and be included in duration of work will be by the location independent of the disk of the data of access.

Because all nodes can both the identical data of access, and each node all has its oneself dedicated cache, so a plurality of versions of same data item may reside in the buffer memory of a plurality of nodes of any amount.Regrettably, when this means the particular version when a node requirement specific data item, this node must be coordinated so that the particular version of data item is transferred into requesting node mutually with other nodes.Thereby the shared disk database is considered to the principle operation with " data transmission ", and wherein, data must be sent to the node of designated these data of processing.

Such data transmit requirement may cause " examination (ping) ".Especially, when the copy of the data item required by node is present in the buffer memory of another node, examination will appear.Examination may require data item is write disk, reads from disk then.The performance of checking necessary disk operating can reduce the performance of Database Systems significantly.

The shared disk database both can have been shared on the computer system in nothing and move, and also can move on the shared disk computer system.In order do not have to share operation shared disk database on the computer system, software support program (software support) can be added to operating system or can provide other hardware can the access remote disk with the permission program.

Shared-nothing database

The shared-nothing database suppose program can only be comprised in these data of access when belonging on the disk of same node point with program in data.Therefore, if specific node is wanted by the data item executable operations that another node had, then specific node must send request to another node, ask another node to carry out this operation.Thereby shared-nothing database is considered to carry out " function transmission ", rather than transmits data between node.

Because any given data block is all only had by a node, the copy that therefore has only this node (" owners " of data) forever in its buffer memory, to have data.Therefore, need not desired cache coherency mechanism type in the shared disk Database Systems.In addition, the cached version of data item is not saved in disk so that another node can deposit this data item in its buffer memory then, does not therefore have shared system and do not suffer and check relevant performance loss owing to require the node that has data item.

Shared-nothing database can and not have on the multiprocessing system of sharing and move at the shared disk multiprocessing system.In order on the shared disk machine, to move shared-nothing database, can provide a kind of mechanism to be used for database is carried out subregion (partitioning), and the entitlement of each subregion is distributed to specific node.

Have only seised node can mean that the working load in the shared-nothing database may become extremely uneven to the fact that data block is operated.For example, in the system of ten nodes, 90% of all working requirement may relate to by data that had in the node.Therefore, this node overwork, and the computational resource of other nodes is not fully used.For " balance again " working load, can make the shared-nothing database off line, and data (and entitlement) can be reallocated between node.Yet this process relates to mobile potentially mass data, and solution working load that may be only interim is unbalance.

Failure in the Database Systems

When the problem that occur to stop database server to work on, failed database server may take place.Failed database server may be by producing such as the hardware problem that has a power failure or such as the software issue of operating system or Database Systems collapse.Failed database server takes place with it is also contemplated that, for example, and when to oracle database server issue SHUTDOWN ABORT or STARTUP FORCE statement.

Owing to the data file in some Database Systems is carried out the mode of database update, at any given time point, data file can comprise some data blocks, and this data block (1) has temporarily been revised by the affairs of not submitting to and/or (2) also do not have the renewal of reflection by the affairs execution of submitting to.Therefore, after failed database server, must carry out database recovery operation, with database recovery to its consistent affairs state that before failed database server, has just.In the consistent affairs state, the institute that database reflection is made by the affairs of submitting to changes and does not reflect the change of being made by the affairs of not submission.

Typical data storehouse system carries out a plurality of steps at database server between convalescence.At first, Database Systems " preceding rolling (roll forward) ", or be applied to the data file that is changed that is recorded in the redo log again.Before roll a plurality of redo log files of necessity handled, timely database being shifted to an earlier date, be reflected in make before the collapse time changed.Before roll and generally include the change of using in the online redo log file, and may comprise the change of application records in (archived) redo log file that files (the online redo file that is being filed before reusing).After preceding rolling, data block contains the change of all submissions and be recorded in any change of not submitting in the redo log before collapse.

Return segment (rollback segment, rollback segment) and comprise the change of not submitting to that is used to cancel maintenance after roll forward operation.In database recovery, the information that contains in returning segment is used to cancel the change of being made by the affairs of not submitting to when collapsing.The process of cancelling the change of being made by the affairs of not submitting to is called " returning " affairs.

Technology described here is not limited to wherein return the environment that segment is used to cancel affairs.For example, in some database environments, cancel and reform being written into single sequential log.In such environment, can carry out recovery based on the content of single daily record rather than the distinct content of reforming and cancelling daily record.

Failure in the shared-nothing database system

In any multi-node computer system, when one or more other nodes kept working, one or more nodes may be failed.In shared-nothing database system, the failure of node usually makes that the data item that is had by failure node is unavailable.Can must be carried out recovery operation to these data item by before the access once more in these data item.It is fast more that recovery operation is carried out, and then data item becomes available just fast more.

In shared-nothing database system, can use no subregion or the subregion of failing is in advance carried out recovery operation.When using no subregion, single non-failure node supposes that the entitlement of all data item was before had by failure node.Non-failure node is set about oneself then and is carried out whole recovery operation.Because the case of non-partitioned method is only used the processing power of an active node, therefore recover spend than recovery operation by many active nodes longer time when shared.Here it is when recovery nodes need can the access failure node data the time, in shared-nothing database, how to finish recovery usually.In order to simplify hardware configuration, use " partner " system (buddysystem) usually, wherein, it is right that node is divided into node, and each can both access data each other, and if failure takes place then each is responsible for recovering each other.

According to pre-failure partition method, the data that had by failure node were divided into the shared-nothing database segment that does not wait before failure.The failure after, not each in the segment that waits all be assigned to different non-failure node be used for the recovery.Because recovery operation is distributed in many nodes, therefore recovers and to finish faster when only carrying out by a node.Yet, know seldom accurately when node will fail.Therefore, for the node that will use pre-failure partition method to be resumed, normally relate to the CUP of the node in the segment of dividing data storehouse and the subregion of primary memory, long ago just be performed in the actual generation of any failure usually.Regrettably, when node is by such subregion, reduced the steady-state operation time performance of node.Multiple factor has caused such performance to descend.For example, the resource of each physical node may be utilized insufficient.Although a plurality of subregions are had by identical physical node, subregion can not be shared the storer that is used for Buffer Pool, combination buffer memory etc.Because might better utilization single memory piece rather than the memory block of fragment, thus this cause utilize insufficient.In addition, the inter process of given working load (interprocess) communication increases along with the increase of the quantity of subregion.For example, the application program that is applicable to four subregions may not be suitable for 12 subregions.Yet, use pre-failure partition method to be used to the parallel recovery afterwards of failing, may need 12 subregions.

Description of drawings

Describe the present invention by the example in the accompanying drawing, but be not limited to this, identical in the accompanying drawings reference number is represented similar elements, wherein:

Fig. 1 is the block diagram that the group who comprises two shared disk subsystems according to an embodiment of the invention is shown; And

Fig. 2 is the block diagram that can implement the computer system of embodiments of the invention.

Embodiment

The various technology of the performance that is used to improve the shared-nothing database system that comprises the shared disk storage system have hereinafter been described.In the following description,, described a plurality of specific details, understood the present invention is had completely for the purpose of explaining.Yet, obviously, do not having can to realize the present invention under the situation of these specific detail yet.In other example, with the block diagram form known structure and equipment are shown, to avoid unnecessarily making the present invention unclear.

Functional overview

Hereinafter described the various technology of the performance that is used to improve shared-nothing database system, wherein, at least two nodes in the node of operation shared-nothing database system can be shared the ground accessing disk.As determined by the no shared structure of Database Systems, in any given time, each data segment is still only had by a node.Yet this fact of accessing disk of utilizing at least some nodes in the node of operation shared-nothing database system to share is with balance and recover shared-nothing database system again more effectively.

Especially, be provided for using concurrently a plurality of recovery nodes of operation to recover the technology of the data that have by failure node.The data that have by failure node be assigned to can the existing shared disk of these data of access recovery nodes.The recovery daily record that failure node has has recovery nodes to read, or is read by the coordinator's program that recovery tasks is assigned to recovery nodes.

The exemplary group (cluster) who comprises shared disc system

Fig. 1 is the block diagram that the group 100 that can implement embodiments of the invention is shown.Group 100 comprises five nodes 102,104,106,108 and 110, interconnection line 130 connections that these nodes communicate with one another by allowing node.Group 100 comprises two disks 150 and 152.Node 102,104 and 106 can accessing disk 150, and node 108 and 110 can accessing disk 152.Therefore, comprise node 102,104 and 106 and the subsystem of disk 150 constitute first shared disc system, and comprise node 108 and 110 and the subsystem of disk 152 constitute second shared disc system.

Group 100 is to comprise the example that does not have the relative single system of overlapping subordinate relation (membership) between two shared disk subsystems and the shared disk subsystem.Real system may be than group's 100 complexity many, having between a hundreds of node, a hundreds of shared disk and node and the shared disk is many-to-many relationship.In such system, for example, individual node that can the many disks of access can be the member of a plurality of different shared disk subsystems, and wherein, each shared disk subsystem includes shared disk in the shared disk and all nodes that can this shared disk of access.

Shared-nothing database on the shared disc system

In order to illustrate, will suppose that shared-nothing database system moves on group 110, wherein, by the database storing of shared-nothing database system management on disk 150 and 152.Nothing based on Database Systems is shared character, data can be divided into five groups or subregion 112,114,116,118 and 120.Each subregion all is assigned to node corresponding.The node of distributing to subregion is considered to be present in unique owner of all data in this subregion.In this example, node 102,104,106,108 and 110 has subregion 112,114,116,118 and 120 respectively.The subregion 112,114 and 118 that is had by node (node 102,104 and 106) that can accessing disk 150 is stored on the disk 150.Similarly, the subregion 118 and 120 that is had by node (node 108 and 110) that can accessing disk 152 is stored on the disk 152.

As shared the character defined by the nothing of the Database Systems of operation on group 100, in any given time, any data block is had by a node at the most.In addition, send the access of coordination by function to shared data.For example, in the environment of the Database Systems of supporting sql like language, the node that does not have a certain data block can cause the operation to these data by the segment that sends SQL statement to the node that has this data block really.

The entitlement mapping

Transmit in order to carry out function effectively, all nodes need all to know which data which node has.Therefore, set up the entitlement mapping, wherein, the entitlement mapping points out that data arrive the entitlement distribution of node.Run time between, different nodes with reference to entitlement mapping to send the SQL segment to correct node when the operation.

According to an embodiment, need not the mapping of determination data to node in the compilation time of SQL (or any other data base access language) statement.On the contrary, as what will be described in more detail below, data to the mapping of node can run time between set up and revise.Use technology described below, when entitlement from can access its exist a node of the disk of data change to can access its when having another node of disk of data, can under the situation of the long lasting position on the disk, carry out proprietorial change at mobile data not.

Locking

Lock is to be used for coordinating structure to the access of resource at a plurality of entities that can accessing resource.Under the situation of shared-nothing database system, need not global lock (globallocking) and coordinate access the user data in the shared-nothing database, this is because any given data block is only had by individual node.Yet,, therefore may need some to lock the inconsistent renewal that prevents the entitlement mapping because all nodes of shared-nothing database all require the mapping of access entitlement.

According to an embodiment, when the entitlement of data block when a node (" the former owner ") is redistributed to another node (" new owner "), use two node locking schemes.In addition, global lock mechanism can be used to control the access to the metadata relevant with shared-nothing database.Such metadata can comprise for example entitlement mapping.

If the entitlement of data is used for parallel recovery by reallocation, then do not need the locking scheme of entitlement mapping.Especially, if entitlement does not change during working time, then can use simple proposal to come parallel recovery in the middle of the survivor.For example, if there be N survivor, then first survivor's imputability recovers all data that had by the head node that is decomposed into first 1/N memory paragraph etc.After recovering to finish, the entitlement of all data that had by head node turns back to individual node.

Subregion based on memory paragraph (bucket)

As mentioned above, by subregion, and the data in each subregion are had exclusively by a node by the data of shared-nothing library management.According to an embodiment,, then each memory paragraph is distributed to subregion by setting up subregion for the logical storage section data allocations.Therefore, the data in the entitlement mapping comprise that to the mapping of node data arrive the mapping to node of the mapping of memory paragraph and memory paragraph.

According to an embodiment, data are set up by the title utilization hash function to each data item to the mapping of memory paragraph.Similarly, memory paragraph can be by using another hash function to set up to the identifier relevant with memory paragraph to the mapping of node.Optionally, can use subregion based on scope, tabulation subregion or set up in this mapping one or two by enumerating each personal relationship simply.For example, can be divided into 50 scopes by name space 1,000,000 data item are mapped to 50 memory paragraphs data item.By 50 memory paragraphs being mapped to five nodes for each memory paragraph stored record, this record is used for (1) identification memory paragraph and the current node that is assigned memory paragraph of (2) identification then.

For for the mapping of the independent map record of each store data items, the use of memory paragraph has reduced the size of entitlement mapping significantly with respect to wherein.In addition, surpass among the embodiment of quantity of node in the quantity of memory paragraph, the use of memory paragraph make entitlement is reallocated relatively easy to the subclass of the data that have by given node.For example, new node can be assigned with single memory paragraph from the current node that is assigned ten memory paragraphs.Such reallocation will be related to this memory paragraph simply and revise the record of indication memory paragraph to the mapping of node.The data of the data of being reallocated needn't be changed to the mapping of memory paragraph.

As mentioned above, can be by using any mapping of setting up data in the various technology (including but not limited to hash subregion, scope subregion or train value) to memory paragraph.If use based on the quantity of the subregion of scope and scope indistinctively greater than the quantity of node, as long as the range key (range key) that is used for the data item subregion is the value (for example data) that can not change, then database server can adopt meticulousr (narrower) scope to realize the memory paragraph of requirement.If range key is the value that can change, then in response to the change of the range key value that is used for specific data item, data item is removed and is added to memory paragraph corresponding to the new value of the range key of data item from its former memory paragraph.

Set up proprietorial original allocation

Use above-mentioned mapping techniques, can among a plurality of nodes, share the entitlement of single table or index.At first, proprietorial distribution can be at random.For example, the user can select to be used for key and the partitioning technique (for example, hash, scope, tabulation etc.) of data to the mapping of memory paragraph, and is used for the partitioning technique of memory paragraph to the mapping of node, but does not need the original allocation of designated store section to node.Database server can be identified for the key of memory paragraph to the mapping of node based on being used for the key of data to the mapping of memory paragraph then, and creates the distribution of initial storage section to node under the situation of particular data of not considering to be represented by memory paragraph and database object.

For example, if the user selects based on key A the object subregion, then database server will use key A to decide the mapping of memory paragraph to node.In some cases, database server can add additional key or use different function (as long as it preserves the mapping of data to memory paragraph) to being used for data to the key of the mapping of memory paragraph.For example, if use key A to divide the object hash into four data memory paragraphs, then database server can be by using hash function to determine the mapping of memory paragraph to node to key B, or by simply the number of hashed value being increased to 12, in these four memory paragraphs each is subdivided into three memory paragraphs (to allow the flexible allocation of memory paragraph to node).If hash is modular function (modulo function), then the 0th, the 4th and the 8th memory paragraph will be corresponding to the memory paragraph of the 0th data to memory paragraph to the memory paragraph of node, and the 1st, the 5th and the 9th memory paragraph will arrive the memory paragraph etc. of memory paragraph corresponding to the 1st data to the memory paragraph of node.

Another embodiment is, if this object according to the key A of DATE type by the scope subregion, then can return year (date) function in year and come the mapping of specific data to memory paragraph by use.But memory paragraph to the mapping of node can be by database server by using month and year (date) in internal calculation.Each annual subregion is divided into the memory paragraph of 12 memory paragraphs to node.If the data that database server is determined specific year are by (the normally current year) access frequently, this method this 12 memory paragraphs of can among other nodes, reallocating then.

In two examples that provide, provide the memory paragraph # of memory paragraph in the above to node, then database server uniquely specified data to the memory paragraph # of memory paragraph.In these examples, the user selects to be used for key and the partitioning technique of data to the mapping of memory paragraph equally.Yet in optional embodiment, the user can not select to be used for key and the partitioning technique of data to the mapping of memory paragraph.On the contrary, being used for data also can automatically be determined by database server to the key and the partitioning technique of the mapping of memory paragraph.

According to an embodiment, database server is based on distributing how many memory paragraphs to carry out the distribution of initial memory paragraph to node to each node.For example, the node with larger capacity can be assigned with more memory paragraph.Yet in original allocation, the decision which node is which particular memory section should be assigned to is at random.

In optional embodiment, when carrying out the branch timing of memory paragraph to node, database server considers that really which data represented by memory paragraph.For example, suppose that the data that are used for particular table are divided at some memory paragraphs.Database server can consciously be assigned to identical node with all these memory paragraphs, or conscious entitlement of distributing these memory paragraphs among many nodes.Similarly, in original allocation, database server may be attempted and will distribute to the memory paragraph identical node relevant with the index that is used for these tables with the memory paragraph that epiphase closes.On the contrary, database server may be attempted and will distribute to the node that is different from the node that the memory paragraph relevant with being used for the index of these tables be assigned to the memory paragraph that epiphase closes.

The parallel recovery of the shared data that has by one or more nodes of crossing over the survival node

One or more nodes of distributed shared-nothing database system may be failed.In order to ensure the availability by the data of shared-nothing database system management, the memory paragraph that is had by failure node (" dead node (dead node) ") must be redistributed to does not also have failure node.Typically, memory paragraph will be stored in the data base directory that is arranged on the shared disk to the map information of node.By checking data base directory, the non-failure node of shared-nothing database system can be determined the tabulation of the subregion memory paragraph that had by dead node.

In case the subregion memory paragraph that is had by dead node is identified, then the subregion memory paragraph is reallocated among the survival node.Can access contain the shared disk of the data that are mapped to memory paragraph as long as it should be noted that the proprietorial survival node that is assigned with memory paragraph, then can under the situation of mobile bottom data not, carry out this reallocation.For example, suppose group 100 node 102 failures.If node 102 has the memory paragraph corresponding to subregion 112, then this memory paragraph can be given node 104 or node 106 by reallocation not changing data under the situation of the physical location on the disk 150.

Formerly after the proprietorial reallocation of the memory paragraph that has by head node, roll and retraction operation before by the survival node item in these memory paragraphs being carried out.According to an embodiment, the survival node that the memory paragraph of failure node is assigned to only comprises redo log that can the access failure node and those survival nodes of the data that had by failure node.Optionally, but if the data that the survival node that carry out to recover can the access failure node redo log that can not the access failure node, then the coordinator can be scanned redo log and be generated the memory paragraph that is used for based on reforming and distribute the REDO Record that wherein comprises.

According to an embodiment, the piece that the node that is recovering will be resumed with particular order writes disk to avoid problem.Especially, carry out a large amount of recovery (for example, during medium recovery) if desired, then the recovery nodes piece that adopts the checkpoint maybe will recover writes disk.Yet when in these cases piece being write disk, recovery nodes may not be carried out with any order and write.For example, if for reforming of generating of piece A generate for piece B reform before, and piece A and B be by two nodes recoveries that separate, then piece B can not be written into before piece A, can shift to an earlier date if particularly this means the checkpoint of the thread of reforming of failure node, surpass reforming of piece B.For fear of this problem, the exchange that recovery nodes can be each other dirty recovery block (using the piece that recast is used for) the earliest from failure node.If the piece of node is a dirty recovery block the earliest, then node can write its piece.Piece will be write in order like this.

Because some nodes participate in recovery operation, so recovery operation is performed faster than above-mentioned no partition method.In addition, be different from above-mentioned pre-failure partition method, the proprietorial reallocation of memory paragraph is carried out after failure, and making not to cause loss working time.

Described here being used for is assigned to the technology that a plurality of nodes are used for parallel recovery operation with recovery operation, is applied to the parallel medium recovery of the object that had by individual node equally.Especially, when comprising the medium failure of object, the entitlement of the part of object during restoration can be assigned to a plurality of nodes.After recovery had been finished, entitlement can be retracted back (collapsed back) to individual node.

According to an embodiment who is used to handle nested failure, whether Database Systems are careful piece having been used and are cancelled piece.The application that tracking is cancelled is useful, may be return because revised the part early of the affairs of different subregions, may also not return and change subsequently.

According to an embodiment, subregion storage segment number is stored in the REDO Record.For example, if the REDO Record indication is made change to the piece that belongs to the particular memory section, then the storage segment number of this memory paragraph is stored in the REDO Record.Therefore, when using REDO Record, recover to handle those REDO Records that automatically to skip the subregion storage segment number of indicating the memory paragraph that does not require recovery.

When application was reformed, all recovery nodes can scan the redo log of failure node, and perhaps single recovery telegon can scan this daily record and the piece of will reforming is assigned to the node that participates in this recovery.Distribute among the embodiment of the piece of reforming at the recovery telegon, reforming is assigned with based on subregion storage segment number.Therefore, be assigned with the recovery nodes of recovering the particular memory section and will receive reforming of all data item of being used to belong to this memory paragraph from recovering telegon.

During recovery operation, certain data block may move to another subregion from a subregion.According to an embodiment, the operation that object is moved to another subregion from a subregion is used as the deletion of following insertion.Therefore there is not the ordering correlativity between the piece belonging to reforming of different memory paragraphs.

The selectivity parallelization

According to an embodiment, the part of having only the selection of recovery operation is concurrently.For example, specific node can be assigned with as recovering telegon.During restoration, recover telegon all data of recovery request recovery continuously, run into the recovery tasks that satisfies the parallelization standard up to recovering telegon.For example, and the property standard can the regulation parallel recovery should be used to surpass in the object of certain size threshold.Therefore, when recovering to run into such object during telegon is recovering to handle, the database server reallocation makes some nodes can participate in the parallel recovery of this object corresponding to the entitlement of the memory paragraph of big object.When finishing the task of this regulation, the entitlement of these data recovery telegon of can being reallocated back.

Memory paragraph in transmission

When the entitlement of memory paragraph just by when a node (" the former owner ") is transferred to another node (" new owner "), memory paragraph is considered to " in transmission ".If the former owner and/or new owner failure when memory paragraph is in transmission, then extra recovering step may be necessary.Needed extra recovering step is the passing of title technical stipulation that used by Database Systems.If passing of title technology allows the former owner and new owner all to have the dirty version of the data item that belongs to the memory paragraph in the transmission, then recover to relate to the dirty version that (1) uses the buffer memory that is present in the data item in the survival node, and (2) merge and use the former owner and new owner's redo log.Similarly, if the subregion memory paragraph when when failure is in transmission, is cancelled daily record and may be needed to be applied to returning the data item that belongs to this memory paragraph by what a plurality of nodes generated.

Determine which memory paragraph needs to recover

When node failure, can check memory paragraph belongs to failure node to the mapping of node with definite which memory paragraph data, and therefore need to recover.According to an embodiment, memory paragraph is carried out first pass (first pass) to determine which memory paragraph needs to recover to the mapping of node.After first pass, all memory paragraphs that do not need to recover can be used for access immediately.Carry out second time scanning then,, the memory paragraph that needs recover is carried out recovery operation second time scan period.The recovery of carrying out second time scan period, the possessory individual node of all data that can be had by dead node by designated conduct is finished, and maybe can use entitlement to be reallocated among being mapped in the survival node.

Hardware overview

Fig. 2 is the block diagram that the computer system 200 that can carry out embodiments of the invention is shown.Computer system 200 comprises bus 202 or other communicator that is used to the information of transmitting and the processor 204 that is connected with bus 202 that is used for process information.Computer system 200 also comprises the primary memory 206 that is connected to bus 202, such as random access storage device (RAM) or other dynamic storage device, and the instruction that is used for store information and will carries out by processor 204.Carrying out between the order period that will be carried out by processor 204, primary memory 206 also can be used for storing temporary variable or other intermediate informations.Computer system 200 further comprises ROM (read-only memory) (ROM) 208 or is connected to other static memories of bus 202, the instruction that is used to store static information and processor 204.Memory device 210 such as disk or CD is provided, and is connected to bus 202 and is used for canned data and instruction.

Computer system 200 can be connected to display 212 such as cathode ray tube (CRT) via bus 202, is used for the display message to the computer user.The input media 214 that comprises alphanumeric key and other keys is connected to bus 202, is used for information and Instruction Selection are delivered to processor 204.The user input apparatus of another kind of type is cursor control 216, such as mouse, tracking ball or cursor direction key, is used for that directional information and command selection be delivered to processor 204 and the cursor that is used to control on the display 212 moves.Input media usually on two axles (first axle (for example X-axis) and second axle (for example Y-axis)) have two degree of freedom, make the position on the device energy given plane.

The present invention relates to the use of computer system 200, be used to carry out technology described here.According to one embodiment of present invention, be included in the processor 204 of one or more sequences of the one or more instructions in the primary memory 206 in response to execution, realize these technology by computer system 200.Such instruction can be read in primary memory 206 from other computer-readable medium such as memory storage 210.Be included in the execution of the instruction sequence in the primary memory 206, make processor 204 carry out treatment step described herein.In optional embodiment, can use hard-wired circuit (hard-wired circuitry) to replace software instruction or combine and implement this invention with software instruction.Therefore, embodiments of the invention will be not limited to any particular combinations of hardware circuit and software.

Term used herein " computer-readable medium " is meant any medium that participation provides instruction to be used to carry out to processor 204.This medium can be taked various ways, includes but not limited to non-volatile media, Volatile media and transmits medium.Non-volatile media comprises CD or disk for instance, such as memory storage 210.Volatile media comprises dynamic storage, such as primary memory 206.Transmission medium comprises concentric cable, copper cash and optical fiber, comprises the lead of forming bus 202.Transmission medium also can be taked sound wave or form of light waves, for example those sound wave and light waves that produce in radiowave and infrared data communication process.

Usually the computer-readable medium of form comprises as floppy disk, soft dish, hard disk, tape, physical medium, RAM, PROM, EPROM, FLASH-EPROM or other any storage chip or the magnetic tape cassette of perhaps any other magnetic medium, CD-ROM, any other light medium, punching paper, paper tape or any pattern with holes, carrier wave or computer-readable any other medium of mentioning below perhaps.

Various forms of computer-readable mediums can participate in one or more sequences with one or more instruction and be carried to processor 204 and be used for carrying out.For example, the instruction beginning can be carried in the disk of remote computer.Remote computer can use modulator-demodular unit to send instruction by telephone wire with instruction load in its dynamic storage then.The modulator-demodular unit of computer system 200 this locality can receive the data on the telephone wire, and uses infrared transmitter that data-switching is become infrared signal.Infrared eye can receive the data that infrared signal is carried, and suitable circuit can be put into data on the bus 202.To primary memory 206, processor 204 is from primary memory retrieval and carry out these instructions with Data-carrying for bus 202.Before or after carrying out these instructions by processor 204, the instruction that is received by primary memory 206 can optionally be stored on the memory storage 210.

Computer system 200 also comprises the communication interface 218 that is connected to bus 202.The communication interface 218 of bidirectional data communication is provided, is connected to the network link 220 that is connected with LAN (Local Area Network) 222.For example, communication interface 218 can be Integrated Service Digital Network card or modulator-demodular unit, and the data communication that is used to be provided to the telephone wire of respective type connects.And for example, communication interface 218 can be the Local Area Network card, is used to provide the data communication to compatible Local Area Network to connect.Also can use Radio Link.In any such enforcement, communication interface 218 sends and receives electric signal, electromagnetic signal and the optical signalling of the digital data stream of the various types of information of carrying expression.

Network link 220 can provide data communication to other data set by one or more network usually.For example, network link 220 can be connected with main frame 224 by LAN (Local Area Network) 222, perhaps is connected with the data equipment that ISP (ISP) 226 operates.ISP226 provides data communication services by the worldwide packet data communication network that is commonly referred to as " internet " 228 at present again.LAN (Local Area Network) 222 and internet 228 all use electric signal, electromagnetic signal or the optical signalling of carrying digital data stream.Signal by diverse network and the signal on the network link 220 and the signal by communication interface 218 all transmit numerical data and give computer system 200 or send numerical data from computer system, are the exemplary form of the carrier wave of transmission information.

Computer system 200 can send message and receive data (comprising program code) by network, network link 220 and communication interface 218.In the example of internet, server 230 can pass through internet 228, ISP 226, LAN (Local Area Network) 222 and communication interface 218, transmits the program code of being asked that is used for application program.

The code that is received can be when it is received be carried out by processor 204, and/or is stored in memory storage 210 or other non-volatile media and is used for carrying out subsequently.In this manner, computer system 200 can obtain application code with the form of carrier wave.

The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various changes and variation.Within the spirit and principles in the present invention all, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. one kind is used for method of managing data, said method comprising the steps of:

Can keep a plurality of persistent data items on the long-time memory of a plurality of nodes of access, described persistent data items comprises the specific data item that is stored in the ad-hoc location on the described long-time memory;

Each exclusive ownership in described a plurality of persistent data items is distributed in described a plurality of node one, and wherein, the specific node of described a plurality of nodes is assigned with the exclusive ownership of described specific data item;

When any node wants to carry out the operation that relates to described specific data item, because described specific data item is monopolized by described specific node and is had, therefore expect that the described node that described operation is performed is sent to described specific node with described operation, is used for described specific node described specific data item is carried out described operation;

In response to relating to the failure of monopolizing the one group of persistent data items that has by individual node, carry out following steps:

In two or more recovery nodes each is distributed in the exclusive ownership of the subclass of the described one group of persistent data items that relates in the described failure; And

Each recovery nodes in described two or more recovery nodes is carried out recovery operation to the described subclass of the persistent data items that is assigned to described recovery nodes.

2. method according to claim 1, wherein, described failure is the medium failure of the persistent storage of the described one group of persistent data items of storage.

3. method according to claim 1, wherein:

Described failure is the failure with described node of the exclusive ownership of described one group of persistent data items; And

The step of described distribution comprises that each distribution in two or more recovery nodes is monopolized the exclusive ownership of the subclass of the described persistent data items that has by described failure node.

4. method according to claim 3, wherein:

Described two or more recovery nodes comprises first recovery nodes and second recovery nodes; And

At least a portion of the described recovery operation of the described subclass of the data that are assigned to described first recovery nodes exclusively being carried out by described first recovery nodes is carried out concurrently with at least a portion of the described recovery operation of the described subclass of the data that are assigned to described second recovery nodes exclusively being carried out by described second recovery nodes.

5. method according to claim 3 further comprises:

Described a plurality of persistent data items are organized into a plurality of memory paragraphs; And

Set up mapping between described a plurality of memory paragraphs and described a plurality of node, wherein, each node all has the exclusive ownership of the described data item that belongs to all memory paragraphs that are mapped to described node; And

Determine that based on described mapping which data item need be resumed.

6. method according to claim 5 further comprises:

First pass is carried out in described mapping, to determine which memory paragraph has the data item that need be resumed;

Second time scanning is carried out in described mapping, carried out with the described data item that needs are resumed and recover; And

After carrying out described first pass and finish before described scan for second time, make it possible to the data item that access belongs to all memory paragraphs that needn't be resumed.

7. method according to claim 3, wherein, each recovery nodes in described two or more recovery nodes comes described long-time memory is carried out described recovery operation based on the recovery daily record relevant with described failure node.

8. method according to claim 7 further may further comprise the steps, and recovers the telegon scanning described recovery daily record relevant with described failure node, and recovery record is assigned to described two or more recovery nodes.

9. method according to claim 7, wherein, each scanning described recovery daily record relevant in described two or more recovery nodes with described failure node.

10. method according to claim 3, wherein:

The described step that each recovery nodes in described two or more recovery nodes is carried out recovery operation comprises, to piece application revocation record; And

Described method further may further comprise the steps, and which is followed the trail of cancel record and be employed.

11. method according to claim 5 further may further comprise the steps, before described failure, described failure node stores the storage segment number in the recast record that is generated by described failure node, described storage segment number represents which memory paragraph is the described data item relevant with described REDO Record belong to.

12. method according to claim 3, wherein, the recovery of described failure node relates to multiple-task, and described method further may further comprise the steps:

Recover telegon and determine that first group of required one or more task of recovery of described failure node should be carried out continuously, and second group of required one or more task of the recovery of described failure node should be carried out concurrently; And

Carry out described first group of one or more task continuously; And

Use described two or more recovery nodes to carry out described second group of one or more task concurrently.

13. method according to claim 12, wherein, the size of the one or more objects that are resumed based on needs is at least in part carried out the step that described second group of required one or more task of recovery of determining described failure node should be carried out concurrently.

14. method according to claim 12, wherein,

The entitlement of the data item that relates in described second group of one or more task is passed to described two or more recovery nodes from described recovery telegon, carries out described second group of one or more task to allow described two or more recovery nodes; And

After carrying out described second group of one or more task and before finishing the recovery of described failure node, the entitlement of the data item that relates in described second group of one or more task is returned to described recovery telegon from described two or more recovery nodes.

15. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 3.

16. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 4.

17. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 5.

18. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 6.

19. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 7.

20. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 8.

21. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 9.

22. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 10.

23. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 11.

24. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 12.

25. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 13.

26. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 14.