CN1829961A - Ownership reassignment in a shared-nothing database system - Google Patents

Ownership reassignment in a shared-nothing database system Download PDF

Info

Publication number
CN1829961A
CN1829961A CN 200480021585 CN200480021585A CN1829961A CN 1829961 A CN1829961 A CN 1829961A CN 200480021585 CN200480021585 CN 200480021585 CN 200480021585 A CN200480021585 A CN 200480021585A CN 1829961 A CN1829961 A CN 1829961A
Authority
CN
China
Prior art keywords
node
carried out
data item
nodes
processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200480021585
Other languages
Chinese (zh)
Other versions
CN100565460C (en
Inventor
罗杰·J·班福德
萨希坎什·钱德拉塞克拉
安杰洛·普鲁希诺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Oracle International Corp
Oracle America Inc
Original Assignee
Oracle International Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oracle International Corp filed Critical Oracle International Corp
Publication of CN1829961A publication Critical patent/CN1829961A/en
Application granted granted Critical
Publication of CN100565460C publication Critical patent/CN100565460C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Storage Device Security (AREA)

Abstract

Various techniques are described for improving the performance of a shared-nothing database system in which at least two of the nodes that are running the shared-nothing database system have shared access to a disk. Specifically, techniques are provided for changing the ownership of data in a shared-nothing database without changing the location of the data on persistent storage. Because the persistent storage location for the data is not changed during a transfer of ownership of the data, ownership can be transferred more freely and with less of a performance penalty than would otherwise be incurred by a physical relocation of the data. Various techniques are also described for providing fast run-time reassignment of ownership. Because the reassignment can be performed during run-time, the shared-nothing system does not have to be taken offline to perform the reassignment. Further, the techniques describe how the reassignment can be performed with relatively fine granularity, avoiding the need to perform bulk reassignment of large amounts of data across all nodes merely to reassign ownership of a few data items on one of the nodes.

Description

Entitlement reallocation in the shared-nothing database system
Technical field
The present invention relates to be used for managing the technology that the nothing of moving is shared the data of (shared-nothing) Database Systems on shared disk hardware.
Background technology
The multiprocessing computer system generally is divided three classes: all resource sharings (shared-everything) system, shared disc system and do not have shared system.In all resource sharing systems, all volatile memory devices (hereinafter being commonly referred to as " storer ") and all Nonvolatile memory devices (hereinafter being commonly referred to as " disk ") in can the direct access system of the program on all processors.Therefore, require the senior wiring between the different computer modules, so that the function of all resource sharings to be provided.In addition, with regard to all resource sharing structures, also there is the scalability restriction.
In shared disc system, processor and storer are grouped into node.Each node in the shared disc system itself can constitute all resource sharing systems that comprise multiprocessor and multi-memory.Program on all processors all disks in can access system, but only belong to program on the processor of specific node can direct access at the storer of specific intranodal.Shared disc system usually requires the wiring lacked than all resource sharing systems.Because all nodes can all data of access, so shared disc system can also easily adapt to unbalanced workload condition.Yet shared disc system is subject to the influence of related system expense (coherence overhead).For example, if first node has been revised data and Section Point wants to read or revise these identical data, then must take a plurality of steps to guarantee that the right version of data is offered Section Point.
In no shared system, all processors, storer and disk are grouped into node.As in shared disc system, in no shared system, itself can constitute all resource sharing systems or shared disc system each node.Storer and the disk that the program of moving on specific node can the specific intranodal of direct access only.The no shared system of the multiprocessing system of three kinds of general types requires the minimum wiring between the various system components usually.Yet no shared system is subject to the influence of unbalanced workload condition most.For example, may all be present on the disk of specific node by all data of access treating during the particular task.Therefore, only the program in this intranodal operation can be used for execution work particle (work granule), even the program on other nodes all keeps idle condition.
The database that moves on multi-node system generally is divided into two classes: shared disk database and shared-nothing database.
The shared disk database
The shared disk database comes co-ordination based on following hypothesis: suppose all data by database system management for all processing nodes that Database Systems can be used all as seen.Therefore, in the shared disk database, server can distribute any work to the program on any node, and be included in duration of work will be by the location independent of the disk of the data of access.
Because all nodes can both the identical data of access, and each node all has its oneself dedicated cache, so a plurality of versions of same data item may reside in the buffer memory of a plurality of nodes of any amount.Regrettably, when this means the particular version when a node requirement specific data item, this node must be coordinated so that the particular version of data item is transferred into requesting node mutually with other nodes.Thereby the shared disk database is considered to the principle operation with " data transmission ", and wherein, data must be sent to the node of designated these data of processing.
Such data transmit requirement may cause " examination (ping) ".Especially, when the copy of the data item required by node is present in the buffer memory of another node, examination will appear.Examination may require data item is write disk, reads from disk then.The performance of checking necessary disk operating can reduce the performance of Database Systems significantly.
The shared disk database both can have been shared on the computer system in nothing and move, and also can move on the shared disk computer system.In order do not have to share operation shared disk database on the computer system, software support program (software suppot) can be added to operating system or can provide other hardware can the access remote disk with the permission program.
Shared-nothing database
The shared-nothing database suppose program can only be comprised in these data of access when belonging on the disk of same node point with program in data.Therefore, if specific node is wanted by the data item executable operations that another node had, then specific node must send request to another node, ask another node to carry out this operation.Thereby shared-nothing database is considered to carry out " function transmission ", rather than transmits data between node.
Because any given data block is all only had by a node, the copy that therefore has only this node (" owners " of data) forever in its buffer memory, to have data.Therefore, need not desired cache coherency mechanism type in the shared disk Database Systems.In addition, the cached version of data item is not saved in disk so that another node can deposit this data item in its buffer memory then, does not therefore have shared system and do not suffer and check relevant performance loss owing to require the node that has data item.
Shared-nothing database can and not have on the multiprocessing system of sharing and move at the shared disk multiprocessing system.In order on the shared disk machine, to move shared-nothing database, can provide a kind of mechanism to be used for database is carried out subregion (partitioning), and the entitlement of each subregion is distributed to specific node.
Have only seised node can mean that the working load in the shared-nothing database may become extremely uneven to the fact that data block is operated.For example, in the system of ten nodes, 90% of all working requirement may relate to by data that had in the node.Therefore, this node overwork, and the computational resource of other nodes is not fully used.For " balance again " working load, can make the shared-nothing database off line, and data (and entitlement) can be reallocated between node.Yet this process relates to mobile potentially mass data, and solution working load that may be only interim is unbalance.
Description of drawings
Describe the present invention by the example in the accompanying drawing, but be not limited to this, identical in the accompanying drawings reference number is represented similar elements, wherein:
Fig. 1 is the block diagram that the group who comprises two shared disk subsystems according to an embodiment of the invention is shown; And
Fig. 2 is the block diagram that can implement the computer system of embodiments of the invention.
Embodiment
The various technology of the performance that is used to improve the shared-nothing database system that comprises the shared disk storage system have hereinafter been described.In the following description,, described a plurality of specific details, understood the present invention is had completely for the purpose of explaining.Yet, obviously, do not having can to realize the present invention under the situation of these specific detail yet.In other example, with the block diagram form known structure and equipment are shown, to avoid unnecessarily making the present invention unclear.
Functional overview
Hereinafter described the various technology of the performance that is used to improve shared-nothing database system, wherein, at least two nodes in the node of operation shared-nothing database system can be shared the ground accessing disk.As determined by the no shared structure of Database Systems, in any given time, each data block is still only had by a node.Yet this fact of accessing disk of utilizing at least some nodes in the node of operation shared-nothing database system to share is with balance and recover shared-nothing database system again more effectively.
Especially, be provided in the proprietorial technology that changes the data in the shared-nothing database under the situation of the position on storer that does not change data.Because the persistent storage position of data is not changed during the transfer of data ownership, therefore can more freely pass ownership, and the physics that has than data rearranges the littler performance loss that will cause.
The various technology of (run-time) reallocation when being used to provide proprietorial quick operation have also been described.Since can run time between carry out reallocation, no shared system off line is reallocated to carry out.In addition, these technical descriptions as how relatively fine granulation (fine granularity) carry out reallocation, avoid the entitlement of the minority data item on the node of reallocation in the node and need to carry out a large amount of reallocation through the mass data of all nodes.
The exemplary group (cluster) who comprises shared disc system
Fig. 1 is the block diagram that the group 100 that can implement embodiments of the invention is shown.Group 100 comprises five nodes 102,104,106,108 and 110, interconnection line 130 connections that these nodes communicate with one another by allowing node.Group 100 comprises two disks 150 and 152.Node 102,104 and 106 can accessing disk 150, and node 108 and 110 can accessing disk 152.Therefore, comprise node 102,104 and 106 and the subsystem of disk 150 constitute first shared disc system, and comprise node 108 and 110 and the subsystem of disk 152 constitute second shared disc system.
Group 100 is to comprise the example that does not have the relative single system of overlapping subordinate relation (membership) between two shared disk subsystems and the shared disk subsystem.Real system may be than group's 100 complexity many, having between a hundreds of node, a hundreds of shared disk and node and the shared disk is many-to-many relationship.In such system, for example, individual node that can the many disks of access can be the member of a plurality of different shared disk subsystems, and wherein, each shared disk subsystem includes shared disk in the shared disk and all nodes that can this shared disk of access.
Shared-nothing database on the shared disc system
In order to illustrate, will suppose that shared-nothing database system moves on group 110, wherein, by the database storing of shared-nothing database system management on disk 150 and 152.Nothing based on Database Systems is shared character, data can be divided into five groups or subregion 112,114,116,118 and 120.Each subregion all is assigned to node corresponding.The node of distributing to subregion is considered to be present in unique owner of all data in this subregion.In this example, node 102,104,106,108 and 110 has subregion 112,114,116,118 and 120 respectively.The subregion 112,114 and 118 that is had by node (node 102,104 and 106) that can accessing disk 150 is stored on the disk 150.Similarly, the subregion 118 and 120 that is had by node (node 108 and 110) that can accessing disk 152 is stored on the disk 152.
As shared the character defined by the nothing of the Database Systems of operation on group 100, in any given time, any data block is had by a node at the most.In addition, send the access of coordination by function to shared data.For example, in the environment of the Database Systems of supporting sql like language, the node that does not have a certain data block can cause the operation to these data by the segment that sends SQL statement to the node that has this data block really.
The entitlement mapping
Transmit in order to carry out function effectively, all nodes need all to know which data which node has.Therefore, set up the entitlement mapping, wherein, the entitlement mapping points out that data arrive the entitlement distribution of node.Run time between, different nodes with reference to entitlement mapping to send the SQL segment to correct node when the operation.
According to an embodiment, need not the mapping of determination data to node in the compilation time of SQL (or any other data base access language) statement.On the contrary, as what will be described in more detail below, data to the mapping of node can run time between set up and revise.Use technology described below, when entitlement from can access its exist a node of the disk of data change to can access its when having another node of disk of data, can under the situation of the long lasting position on the disk, carry out proprietorial change at mobile data not.
Locking
Lock is to be used for coordinating structure to the access of resource at a plurality of entities that can accessing resource.Under the situation of shared-nothing database system, need not global lock (global locking) and coordinate access the user data in the shared-nothing database, this is because any given data block is only had by individual node.Yet,, therefore may need some to lock the inconsistent renewal that prevents the entitlement mapping because all nodes of shared-nothing database all require the mapping of access entitlement.
According to an embodiment, when the entitlement of data block when a node (" the former owner ") is redistributed to another node (" new owner "), use two node locking schemes.In addition, global lock mechanism can be used to control the access to the metadata relevant with shared-nothing database.Such metadata can comprise for example entitlement mapping.
Under the situation of mobile data not, pass ownership
According to an aspect of the present invention, can be under the situation of mobile data not the entitlement of data be changed to and associated another nodes of these data (new owner) from a node (the former owner).For example, supposing that specific data item is current is present in the subregion 112.Because data item is present in the subregion 112, so data item is had by node 102.For the entitlement with data changes to node 104, data must no longer belong to subregion 112, belong to subregion 114 but change into.In the tradition of shared-nothing database system was carried out, this entitlement changes usually can make data item practically from being moved to corresponding to another physical location on the disk 150 of subregion 114 corresponding to a physical location on the disk 150 of subregion 112.
On the contrary, according to embodiments of the invention, subregion 112 and 114 is not must be the Physical Extents of the ad-hoc location of disk 150.On the contrary, subregion 112 and 114 is the subregions that do not rely on the position, and it only represents the set of the current data item that is had by node 102 and 104 respectively, and is present in location independent on the disk 152 with specific data item.Thereby, because subregion 112 and 114 does not rely on the position, therefore under the situation that can move, data item is moved to another subregion (promptly being assigned to another owner from an owner) from a subregion in any reality without the data on the disk 150.
Do not require moving of data item though change the entitlement of data item, it requires the change of entitlement mapping.Be different from the user data in the shared-nothing database, entitlement is mapped in the different nodes to be shared.Therefore, the part of entitlement mapping can be cached in the dedicated cache of different nodes.Thereby in response to the proprietorial change of data item, the entitlement mapping is changed, and the cached copies of the influenced part of entitlement mapping lost efficacy.
According to optional embodiment, carry out entitlement similarly with the scheme change of fundamental objects and change.Especially, after entitlement mapping was made a change, the compiling statement that relates to the entitlement mapping lost efficacy and was recompiled to use new entitlement mapping.
The interpolation of node and removal
In group 100 operating period, may need to add or the removal node from group 100.In traditional no shared system, such operation can relate to another Physical Extents that mass data is moved to another file or disk continually from a Physical Extents of file or disk.Do not rely on the subregion of position by use, the unique data that must physically be rearranged is the data that those its entitlement are transferred to the node of disk that can not the current existence of these data of access.
For example, suppose that new node X is added into group 100, and nodes X can accessing disk 152 but can not accessing disk 150.For the working load between the balance node again, current some data that had by node 102 to 110 can be given nodes X by reallocation.Since its former owner is the data of node 102 to 106 be present in nodes X can not the disk 150 of access on, so these data must physically be moved to the disk 152 that nodes X can access.Yet, since its former owner data that are node 108 and 110 Already in nodes X can the disk 152 of access on, therefore can be under the situation of mobile real data not shine upon the passing of title with these data to nodes X by upgrading entitlement.
Similarly, when node when group 100 is removed, have only following data item physically to be rearranged: this data item is transferred to the current node that can not access has the disk of data item on it.The data item that its entitlement is transferred to the node of the disk that can access has these data on it need not to be moved.For example, if node 102 is removed from group 100, and the entitlement of all data item that had by node 102 before all is transferred to node 104, then do not have data item need physically to be rearranged in response to proprietorial change.
Proprietorial transfer gradually
According to an embodiment,, can alleviate and the relevant performance loss of a large amount of reallocation in response to the data ownership of the node that adds or remove by little by little rather than suddenly carrying out the passing of title.For example, when when the group adds new node, system can begin to shift the entitlement of low volume data item or the entitlement of transferring data item not to new node, rather than shifts the entitlement of abundant data so that new node is the same busy with existing node.According to an embodiment, little by little pass ownership based on the needs of working load.The transfer of data ownership is triggered when for example, can become excessive at the working load that system detects a node in the node.In response to detecting the node overwork, some data item that belong to this overwork node can be assigned to the node of previous interpolation.Little by little, the more and more data item can be assigned to new node from the node of overwork, till the node that detects this overwork is no longer overworked.
Proprietorial reallocation is triggered when on the other hand, can be brought down below a certain threshold value at the working load of existing node.Especially, it is desirable to, when the working load of busy node alleviates, some entitlement responsibilities are transferred to new node from other busy nodes, operate the performance that reduces the node of having overworked to avoid reallocating.
As for passing ownership gradually from the node of removing, passing of title possibility for example, is triggered where necessary.For example, if data item X is had by the node of removing, then when detecting some nodes and asked to relate to the operation of data item X, data item X can be given another node by reallocation.Similarly, entitlement is transferred to from the node of removing is triggered when existing node may be brought down below a certain threshold value at the working load of existing node.
Subregion based on memory paragraph (bucket)
As mentioned above, by subregion, and the data in each subregion are had exclusively by a node by the data of shared-nothing library management.According to an embodiment,, then each memory paragraph is distributed to subregion by setting up subregion for the logical storage section data allocations.Therefore, the data in the entitlement mapping comprise that to the mapping of node data arrive the mapping to node of the mapping of memory paragraph and memory paragraph.
According to an embodiment, data are set up by the title utilization hash function to each data item to the mapping of memory paragraph.Similarly, memory paragraph can be by using another hash function to set up to the identifier relevant with memory paragraph to the mapping of node.Alternatively, these two mappings or one of them can be used based on the subregion of scope and set up, or set up by enumerating each personal relationship simply.For example, can be divided into 50 scopes by name space 1,000,000 data item are mapped to 50 memory paragraphs data item.By 50 memory paragraphs being mapped to five nodes for each memory paragraph stored record, this record is used for (1) identification memory paragraph and the current node that is assigned memory paragraph of (2) identification then.
For for the mapping of the independent map record of each store data items, the use of memory paragraph has reduced the size of entitlement mapping significantly with respect to wherein.In addition, surpass among the embodiment of quantity of node in the quantity of memory paragraph, the use of memory paragraph make entitlement is reallocated relatively easy to the subclass of the data that have by given node.For example, new node can be assigned with single memory paragraph from the current node that is assigned ten memory paragraphs.Such reallocation will be related to this memory paragraph simply and revise the record of indication memory paragraph to the mapping of node.The data of the data of being reallocated needn't be changed to the mapping of memory paragraph.
As mentioned above, can be by using any mapping of setting up data in the various technology (including but not limited to hash subregion, scope subregion or train value) to memory paragraph.If use based on the quantity of the subregion of scope and scope indistinctively greater than the quantity of node, as long as the range key (range key) that is used for the data item subregion is the value (for example data) that can not change, then database server can adopt meticulousr (narrower) scope to realize the memory paragraph of requirement.If range key is the value that can change, then in response to the change of the range key value that is used for specific data item, data item is removed and is added to memory paragraph corresponding to the new value of the range key of data item from its former memory paragraph.
Subregion based on tree
The another kind of method that will be subclass by the data item subregion of database system management is to use memory hierarchy (for example, BTree), so that the higher level of tree structure (for example, root) had by all nodes, and subordinate's (for example, leaf node) among node by subregion.According to an embodiment, tree structure comprises a plurality of subtrees, and wherein, each subtree is assigned to specific node.In addition, each subordinate's tree node is corresponding to one group of data.The one group data relevant with the subordinate tree node are had by the node relevant with the subtree that comprises tree node.
In such embodiments, when the entitlement of subtree changes, make the higher level invalid by locking/broadcasting scheme.The pointer of subordinate is modified to move the entitlement of the subtree under the different nodes.
During reallocating, handle dirty version (dirty version)
As mentioned above, when can access there be the disk of data in the new owner of data on it,, change the entitlement of data by under the situation of the physical location that does not have the data on the mobile disk physically, memory paragraph being reallocated to node.Yet, the former owner of possibility has the data item of one or more reallocation in its volatile memory " dirty " version.The dirty version of data item is to comprise not influencing the current version that is present in the change of the version on the disk.
According to an embodiment, the dirty version of data item is written into the part of shared disk as passing of title operation.Therefore, when the new owner when disk reads it and has obtained proprietorial data item recently, the version of the item that is read by the new owner will reflect the up-to-date change of being made by the last owner.
Alternatively, for fear of writing the relevant expense of disk with dirty version with data item, if force to reform and do not rewrite, then before the dirty data item is write shared disk, a dirty version that can clear data from former possessory volatile storage.Especially, when the entitlement node is made change to data item, generate " reforming " record of this change of reflection.As long as the REDO Record that is used to change the entitlement of data item change in or before be imposed to disk, the dirty version that the then former owner can clear data under the situation that at first dirty version is not saved in disk.In this case, the new owner can read REDO Record by (1) and determine and must make which kind of change and (2) make indicated change to the disk version of data item to the disk version of data item, comes the latest edition of reconstruct data item.
Another optional situation is, during data in transactions requests new owner node, with the dirty data item automatically (the former owner on one's own initiative) or when requiring (in response to new owner's request) be transferred to new owner's buffer memory.
If the dirty version of data item is not arrived disk by dump (flush) before entitlement changes, then the change of data item can be reflected in a plurality of recovery daily records.For example, suppose that first node makes a change data item, the entitlement of data item is transferred to Section Point afterwards.First node can be dumped to disk with redo log, but the dirty version of data item directly is transferred to Section Point and at first it is not stored to disk.Can make second to data item after the Section Point changes.Suppose that Section Point is dumped to disk with second REDO Record that changes, Section Point lost efficacy before the dirty version with data item stores disk into then.In these cases, the change that must be applied to the disk version of data item once more both had been reflected in the redo log of first node, also was reflected in the redo log of Section Point.According to an embodiment, carry out in-line recovery with the restore data item by merging redo log.
According to an embodiment, can be under the situation of not waiting for the affairs submission of revising data item, the entitlement of transferring data item.Therefore, the change of data item being made by single affairs can extend to a plurality of redo logs.In these cases, the affairs mechanism of returning of database server is set to cancel the change of a plurality of daily records, wherein, and data block is carried out destruction operation with the order of the reversed in order that data block is made a change.In addition, provide the medium recovery mechanism that can merge all possessory redo logs, wherein, consolidation procedure comprises the REDO Record of making that is changed when data are backed up.
The reallocation of clog-free renewal (blocking update)
According to an aspect of the present invention, blocking more under the news, carry out the proprietorial reallocation of data item without aligning the data of being reallocated.According to an embodiment, submit to by making database server etc. be ready to use in all affairs of any data item that access belongs to the memory paragraph of reallocation, and wait for that all dirty data items that belong to this memory paragraph are dumped to disk, can carry out proprietorial distribution under clog-free more news.In these cases, if the former owner can not be with exclusive occupying mode or shared model access data, the data that then belong to the memory paragraph of reallocation can be updated (not waiting for that dirty version is dumped to disk) immediately.If the former owner really can be with the exclusive occupying mode access data, the then former owner can have the dirty version of data in its buffer memory, therefore upgrades being delayed, and up to the former owner dirty page or leaf (or being used for relevant reforming of changing) is write shared disk.
Before the data item that allows its entitlement has been shifted was recently carried out new renewal, database server can be set to wait for that former possessory requested ongoing renewal finishes.On the other hand, database server can be set to end ongoing operation, then affairs is distributed to the owner again.According to an embodiment, make for whether waiting for the decision that given ongoing operation is finished based on various factors.Such factor can comprise for example, finished how much work for this operation.
In some cases, wait for that former possessory requested renewal finishes, may cause seemingly-dead lock.For example, suppose row A in memory paragraph 1, and row B and C are in memory paragraph 2.Suppose more newline A of affairs T1, and another affairs T2 newline B more.Suppose that at this moment the entitlement of memory paragraph 2 is re-mapped to new node.If this moment, T1 wanted more newline C, then T1 remaps wait and finishes.Therefore, T1 will wait for T2.If T2 wants more newline A, then there is deadlock between T1 and the T2.
According to an embodiment, even when requiring the entitlement of several memory paragraphs, once only a memory paragraph is carried out proprietorial transfer, will wait for the time quantum of the data in the memory paragraph of reallocating to minimize affairs with access.
The technology that is used to pass ownership
According to a plurality of embodiment of the present invention, following example has illustrated the proprietorial technology of transferring data in the shared-nothing database that is used for carrying out on shared disc system.In following example, suppose still when carrying out, to change the entitlement of data in the affairs of revising data.That is, Database Systems can not wait for that access waits that the ongoing affairs of the data of being reallocated pause.
Change to the example of node Y below with reference to the entitlement of the subclass of suppose object (" memory paragraph B ") wherein from nodes X, describe a kind of technology that is used to pass ownership.According to an embodiment, Database Systems begin memory paragraph B is labeled as " the transformation " from nodes X to node Y.The change of entitlement mapping is broadcast to all nodes then, or loses efficacy by global lock.
According to an embodiment, the change in response to the entitlement mapping regenerates the query execution plan that relates to the data among the memory paragraph B.Alternatively, in response to the change of entitlement mapping, the mapping of buffer memory is disabled or is reloaded.
After reallocation, any new subquery/data manipulation language (DML) (dml) segment of the data among the access memory paragraph B will be transferred into node Y.Alternatively, before memory paragraph was denoted as in the transformation from X to Y, the current SQL segment of moving in X can be return.After reallocation, these segments can be distributed to node Y then again.Should be noted that the affairs under these segments are not that itself is return, and only be that current calling return and resend to the new owner.Especially, the change of by nodes X the data among the memory paragraph B being made in the calling formerly is unaffected.
According to an embodiment, nodes X can detect by simple local locks with respect scheme does not just have that the ongoing of data in access memory paragraph B calls.This locking scheme can relate to, and for example, makes each programs of the data in the access memory paragraph all obtain the shared lock/latch on this memory paragraph.Will be when this memory paragraph by when reallocation, the program of carrying out this reallocation is waited for until it can obtain lock/latch of monopolizing on the memory paragraph.By obtaining the lock/latch monopolized on the memory paragraph, the reallocation program guarantees currently do not have other programs just at the access memory paragraph.
According to an embodiment, because potential deadlock, before being labeled as memory paragraph on the turn, nodes X is not waited for that all call successfully and is finished.It below is the example how such deadlock takes place.Row 1,2 in the memory paragraph that consideration will be remapped and 3 triplex rows.
Following sequence of events can cause deadlock:
(a) T1 newline 2 more.
(b) T2 follows the row 1 of row 2 to carry out the multirow renewal to the back.
T2 waits for T1 now.
(c) the decision memory paragraph will be remapped.
(d) T1 wants more newline 3.T1 waits for T2 now.
According to an embodiment,, avoid ending in the ongoing pressure of calling of nodes X by as long as the data among the ongoing memory paragraph B that calls institute's access then allow ongoingly to call continuation and normally carry out in buffer memory.In other words, under the situation that does not have internal node (inter-node) locking, X can not read piece from the disk of memory paragraph B.If cache miss and memory paragraph are arranged on the turn, then X must send information to Y, perhaps retrieves this piece from Y, or reads this piece from disk.When memory paragraph on the turn the time, between X and Y, use the cache coherency agreement.
Handle new owner's request
After memory paragraph B is given node Y by reallocation, need any new SQL segment of the data among the access memory paragraph B in node Y, to begin to carry out.What the technology of being used by node Y when reading recently the data item that shifts from disk can have been made and changed based on owner node X before before being marked as in transfer at memory paragraph.Following situation is the example of the different situations that may be handled by new owner's node Y of explanation.
Situation A: suppose that nodes X ends all ongoing calling, and will be mapped to all dirty of this memory paragraph and write shared disk.For efficient, each node can link to the dirty data item each memory paragraph object sequence.In these cases, node Y can directly read from disk.Without any need for cache coherency.Node Y is marked as the new owner of this memory paragraph immediately.
Situation B: suppose that nodes X ends all ongoing calling, but the dirty data item is not write out.Under these situations, node Y need retrieve before reading piece from disk or examine nodes X does not have dirty copy.If X has dirty copy, then preceding map is left in nodes X, be used for recovering and guaranteeing the change that checkpoint is not made in nodes X by (advance past) in advance, this change also is not reflected in the disk by the record of piece in node Y (block write).All dirty data items in X are written into (both can oneself write also and can have been write by the preceding map (PI) by the record purge among the Y), and afterwards, the memory paragraph state is from changing on the turn as possessory Y.Y now can be under the situation of not checking X the disk block in this memory paragraph of access.
If nodes X lost efficacy on the turn the time at the memory paragraph state, if then node Y does not have the current copy of data item, the node of Hui Fuing (node Y so, if it exists) with needs be applied as in the nodes X this memory paragraph generated reforms (and if node Y also lost efficacy, then may in node Y, generate and reform).
Situation C: suppose ongoing the calling of nodes X termination, and remove the dirty data item.In these cases, node Y can directly read piece from disk under the situation of no cache coherency.Yet if reforming of not using arranged among the X, this piece may need to be updated.Memory paragraph will be considered on the turn, reform until all that produce in X and be used and be written into disk by Y.This is for preventing that X from making its checkpoint cross the Duan Eryan that reforms that still is not reflected on the disk and needing.
If nodes X lost efficacy on the turn the time at the memory paragraph state, if then node Y does not have the current copy of data item, the node of Hui Fuing (node Y so, if it exists) with needs be applied as in the nodes X this memory paragraph generated reforms (and if node Y also lost efficacy, then may in node Y, generate and reform).
Situation D: suppose that nodes X continues to carry out ongoing calling.Nodes X and node Y need be in verifications mutually before the disk access piece.Call and finish and all pieces are all write out or when being transferred to Y, Y is marked as the new owner when all are ongoing.From at the moment, need not cache coherency.
If nodes X or node Y lost efficacy on the turn the time at the memory paragraph state, then must carry out and recover.The recovery technology of using under this environment may be similar to the technology of describing in the U.S. Patent number 6,353,836 submitted on March 4th, 2002 and the Application No. 10/092,047, its each all be incorporated into this by complete.If two nodes all lost efficacy, then need to merge redo log from X and Y.
As described in the situation D, allow nodes X to continue to carry out ongoing calling and to bring various benefits.Especially, allow nodes X to continue to carry out ongoing calling, make entitlement under to the situation of ongoing business-impacting minimum, to be reallocated.Yet, the recovery scheme that it requires cache coherency and carries out for memory paragraph B between nodes X and node Y.
A kind of method that is used for providing in these cases cache coherency comprises that allowing nodes X obtain to be used for it current is the lock of all data item of memory paragraph B buffer memory.Master/the directory node that is used for all data item of B can be assigned with as node Y.Send memory paragraph B to all nodes then and will move to the notice of Y from X.Invalid/renewal entitlement mapping that this is notified, feasible further access to B will be sent to Y.
If lost efficacy before this, then entitlement reallocation operation is ended.Otherwise mapping is updated and indicates Y is that new owner and cache coherency are effective.
X and Y are that all data item among the memory paragraph B are carried out cache coherency agreements then, such as U.S. Patent number 6,353,836 and Application No. 10/092,047 in the agreement described.When no longer having the dirty data item among the B, can stop the cache coherency agreement.Y can discharge its any lock that may obtain for the data item among the B.
According to an embodiment, the cache coherency agreement is always effective, makes that the node that has memory paragraph also is the dominator of these locks.As a rule, each node will only distribute local lock (because it is the dominator) and cache coherency message only to be required when entitlement changes.When entitlement changed, the lock of opening in this memory paragraph can dynamically be redistributed the new owner to (remaster).
Modification before entitlement changes
According to embodiment, the subtransaction of revising the data in the nodes X before entitlement changes will still be considered in action, because affairs are return these changes of needs.If affairs are return and it makes a change this memory paragraph in the nodes X, then node Y will need to cancel daily record and come the application revocation daily record by reading from the part of the X of shared disk daily record.
The subtransaction of revising the data in the nodes X before entitlement changes is the identical data among the new node Y more.This requires need coordinate with nodes X such as the request of the affairs lock of row lock among the node Y or page or leaf lock.If subtransaction request lock and this lock are held by its office of peer (sibling) in nodes X, then agree the request of lock.Yet if lock is held by incoherent affairs, the lock request gets clogged.Wait-chart is reflected as wait to female affairs (parent transaction) with this wait, thereby can detect global deadlock.In case entitlement changes and to finish, and obtained to lock all affairs with the data among the memory paragraph B among the access node X and finish (submit to or end), then the affairs lock only needs local lock.
By before changing in beginning entitlement database server wait affairs are finished, the request of affairs lock can be coordinated all the time partly.
Hardware overview
Fig. 2 is the block diagram that the computer system 200 that can carry out embodiments of the invention is shown.Computer system 200 comprises bus 202 or other communicator that is used to the information of transmitting and the processor 204 that is connected with bus 202 that is used for process information.Computer system 200 also comprises the primary memory 206 that is connected to bus 202, such as random access storage device (RAM) or other dynamic storage device, and the instruction that is used for store information and will carries out by processor 204.Carrying out between the order period that will be carried out by processor 204, primary memory 206 also can be used for storing temporary variable or other intermediate informations.Computer system 200 further comprises ROM (read-only memory) (ROM) 208 or is connected to other static memories of bus 202, the instruction that is used to store static information and processor 204.Memory device 210 such as disk or CD is provided, and is connected to bus 202 and is used for canned data and instruction.
Computer system 200 can be connected to display 212 such as cathode ray tube (CRT) via bus 202, is used for the display message to the computer user.The input media 214 that comprises alphanumeric key and other keys is connected to bus 202, is used for information and Instruction Selection are delivered to processor 204.The user input apparatus of another kind of type is cursor control 216, such as mouse, tracking ball or cursor direction key, is used for that directional information and command selection be delivered to processor 204 and the cursor that is used to control on the display 212 moves.Input media usually on two axles (first axle (for example X-axis) and second axle (for example Y-axis)) have two degree of freedom, make the position on the device energy given plane.
The present invention relates to the use of computer system 200, be used to carry out technology described here.According to one embodiment of present invention, be included in the processor 204 of one or more sequences of the one or more instructions in the primary memory 206 in response to execution, realize these technology by computer system 200.Such instruction can be read in primary memory 206 from other computer-readable medium such as memory storage 210.Be included in the execution of the instruction sequence in the primary memory 206, make processor 204 carry out treatment step described herein.In optional embodiment, can use hard-wired circuit (hard-wired circuitry) to replace software instruction or combine and implement this invention with software instruction.Therefore, embodiments of the invention will be not limited to any particular combinations of hardware circuit and software.
Term used herein " computer-readable medium " is meant any medium that participation provides instruction to be used to carry out to processor 204.This medium can be taked various ways, includes but not limited to non-volatile media, Volatile media and transmits medium.Non-volatile media comprises CD or disk for instance, such as memory storage 210.Volatile media comprises dynamic storage, such as primary memory 206.Transmission medium comprises concentric cable, copper cash and optical fiber, comprises the lead of forming bus 202.Transmission medium also can be taked sound wave or form of light waves, for example those sound wave and light waves that produce in radiowave and infrared data communication process.
Usually the computer-readable medium of form comprises as floppy disk, soft dish, hard disk, tape, physical medium, RAM, PROM, EPROM, FLASH-EPROM or other any storage chip or the magnetic tape cassette of perhaps any other magnetic medium, CD-ROM, any other light medium, punching paper, paper tape or any pattern with holes, carrier wave or computer-readable any other medium of mentioning below perhaps.
Various forms of computer-readable mediums can participate in one or more sequences with one or more instruction and be carried to processor 204 and be used for carrying out.For example, the instruction beginning can be carried in the disk of remote computer.Remote computer can use modulator-demodular unit to send instruction by telephone wire with instruction load in its dynamic storage then.The modulator-demodular unit of computer system 200 this locality can receive the data on the telephone wire, and uses infrared transmitter that data-switching is become infrared signal.Infrared eye can receive the data that infrared signal is carried, and suitable circuit can be put into data on the bus 202.To primary memory 206, processor 204 is from primary memory retrieval and carry out these instructions with Data-carrying for bus 202.Before or after carrying out these instructions by processor 204, the instruction that is received by primary memory 206 can optionally be stored on the memory storage 210.
Computer system 200 also comprises the communication interface 218 that is connected to bus 202.The communication interface 218 of bidirectional data communication is provided, is connected to the network link 220 that is connected with LAN (Local Area Network) 222.For example, communication interface 218 can be Integrated Service Digital Network card or modulator-demodular unit, and the data communication that is used to be provided to the telephone wire of respective type connects.And for example, communication interface 218 can be the Local Area Network card, is used to provide the data communication to compatible Local Area Network to connect.Also can use Radio Link.In any such enforcement, communication interface 218 sends and receives electric signal, electromagnetic signal and the optical signalling of the digital data stream of the various types of information of carrying expression.
Network link 220 can provide data communication to other data set by one or more network usually.For example, network link 220 can be connected with main frame 224 by LAN (Local Area Network) 222, perhaps is connected with the data equipment that ISP (ISP) 226 operates.ISP226 provides data communication services by the worldwide packet data communication network that is commonly referred to as " internet " 228 at present again.LAN (Local Area Network) 222 and internet 228 all use electric signal, electromagnetic signal or the optical signalling of carrying digital data stream.Signal by diverse network and the signal on the network link 220 and the signal by communication interface 218 all transmit numerical data and give computer system 200 or send numerical data from computer system, are the exemplary form of the carrier wave of transmission information.
Computer system 200 can send message and receive data (comprising program code) by network, network link 220 and communication interface 218.In the example of internet, server 230 can pass through internet 228, ISP 226, LAN (Local Area Network) 222 and communication interface 218, transmits the program code of being asked that is used for application program.
The code that is received can be when it is received be carried out by processor 204, and/or is stored in memory storage 210 or other non-volatile media and is used for carrying out subsequently.In this manner, computer system 200 can obtain application code with the form of carrier wave.
The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various changes and variation.Within the spirit and principles in the present invention all, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (67)

1. one kind is used for method of managing data, said method comprising the steps of:
Can keep a plurality of persistent data items on the long-time memory of a plurality of nodes of access, described persistent data items comprises the specific data item that is stored in the ad-hoc location on the described long-time memory;
Each exclusive ownership in the described persistent data items is distributed in the described node one, and wherein, the specific node of described a plurality of nodes is assigned with the exclusive ownership of described specific data item;
When any node wants to carry out the operation that relates to described specific data item, be present in described ad-hoc location by described specific data item, therefore expect that the described node that described operation is performed is sent to described specific node with described operation, is used for described specific node described specific data item is carried out described operation;
When first node continues operation, under situation about described specific data item not being moved from the described ad-hoc location on the described long-time memory, the entitlement of described specific data item is redistributed to another node from described specific node;
After described reallocation, when any node wants to carry out the operation that relates to described specific data item, because described specific data item is present in described ad-hoc location, therefore expect that the described node that described operation is performed is sent to described other nodes with described operation, is used for described other nodes described specific data item is carried out described operation.
2. method according to claim 1, wherein, described entitlement with described specific data item comprises from the step that described specific node is redistributed to another node, is updated in the entitlement mapping of sharing in described a plurality of node.
3. method according to claim 1, wherein, described a plurality of nodes are nodes of multinode Database Systems.
4. method according to claim 3, wherein, described multinode Database Systems comprise can not the described long-time memory of access node.
5. method according to claim 3, wherein:
Described long-time memory is first long-time memory by a plurality of long-time memorys of described multinode Database Systems use; And
Described method further comprises, the entitlement of second data item is redistributed to the Section Point of can access second long-time memory but can not described first long-time memory of access from first node that can described first long-time memory of access; And
Wherein, the proprietorial step of described second data item of described reallocation comprises, described second data item is moved to described second long-time memory from described first long-time memory.
6. method according to claim 3 wherein, in response to adding described other nodes to described multinode Database Systems, is carried out the described step that the entitlement of described specific data item is redistributed to another node from described specific node.
7. method according to claim 3, wherein:
Anticipate and to remove described specific node from described multinode Database Systems, carry out the described step that the entitlement of described specific data item is redistributed to another node from described specific node; And
Described method further may further comprise the steps, anticipate and to remove described specific node from described multinode Database Systems, second data item is physically moved to another long-time memory from described long-time memory, wherein, described second data item is redistributed to the node of described multinode Database Systems that can not the described long-time memory of access from described specific node.
8. method according to claim 3, wherein, the described step that the entitlement of described specific data item is redistributed to another node from described specific node is performed as the part that gradually shift of entitlement from described specific node to one or more other nodes.
9. method according to claim 8 wherein, in response to detecting described specific node overwork for one or more other nodes of described multinode Database Systems, begins described transfer gradually.
10. method according to claim 9 wherein, is no longer overworked in response to detecting for one or more other nodes of described multinode Database Systems described specific node, stops describedly shifting gradually.
11. method according to claim 3, wherein, the described step that the entitlement of described specific data item is redistributed to another node from described specific node is performed as the part that gradually shift of entitlement from one or more other nodes to described other nodes, wherein, in response to detecting described other node work deficiencies for one or more other nodes in the described multinode Database Systems, begin described transfer gradually.
12. method according to claim 3 further may further comprise the steps:
After described multi-node system is removed first node, continue to make one group of data item to have by described first node; And
In response to detecting relating to the request of operating of described data item, the entitlement of data item is redistributed to one or more other nodes from described first node.
13. method according to claim 3 further may further comprise the steps:
After described multi-node system is removed first node, continue to make one group of data item to be had by described first node; And
Reduced to below the predetermined threshold in response to the working load that detects Section Point, the entitlement of data item has been redistributed to described Section Point from described first node.
14. method according to claim 1, wherein:
When described specific data item will be given described other nodes by reallocation, described specific node was stored the dirty version of described specific data item in volatile storage; And
Described entitlement with described specific data item comprises from the step that described specific node is redistributed to another node, and the described dirty version of described specific data item is write described long-time memory.
15. method according to claim 1, wherein:
When described specific data item will be given described other nodes by reallocation, described specific node was stored the dirty version of described specific data item in volatile storage; And
Described entitlement with described specific data item comprises from the step that described specific node is redistributed to another node, one or more REDO Records that will be relevant with described dirty version are imposed to long-time memory, and, remove described dirty version from described volatile storage the described dirty version of described specific data item not being write under the situation of described long-time memory;
Described other nodes are rebuild described dirty version by the version that is present in the described specific data item on the described long-time memory being used described one or more REDO Record.
16. method according to claim 1, wherein:
When described specific data item will be given described other nodes by reallocation, described specific node was stored the dirty version of described specific data item in volatile storage; And
Described method further may further comprise the steps, and the described dirty version of described specific data item is transferred to and the relevant volatile storage of described other nodes from the volatile storage with described particular sections spot correlation.
17. method according to claim 16, wherein, the step of the described dirty version of described transfer is initiatively carried out under without the situation of the described dirty version of described other node requests by described specific node.
18. method according to claim 16, wherein, the step of the described dirty version of described transfer by described specific node in response to from described other nodes to the request of described dirty version and carry out.
19. method according to claim 1, wherein:
Under the situation of not waiting for the affairs submission of just revising described data item, carry out the described step that the entitlement of described specific data item is redistributed to another node from described specific node;
When described specific data item was had by described specific node, described affairs were made first group of modification; And
When described specific data item was had by described other nodes, described affairs were made second group of modification.
20. method according to claim 19 further comprises, by returning described second group of modification based on the record of cancelling of cancelling in the daily record relevant with described other nodes, and, return the change of making by described affairs based on returning described first group of modification with the record of cancelling of cancelling in the daily record of described particular sections spot correlation.
21. method according to claim 1 wherein, said method comprising the steps of:
Described other nodes receive the request of upgrading described data item;
Determine whether described specific node can be with exclusive occupying mode or the described data item of shared model access;
If described specific node can not be with exclusive occupying mode or the described data item of shared model access, then described other nodes are not waiting for that the name a person for a particular job any dirty version or the reforming of dirty version of described data item of described particular sections is dumped under the situation of long-time memory, upgrades described specific data item.
22. method according to claim 1 further may further comprise the steps:
In response to the passing of title of described specific data item to described other nodes, end to relate to the ongoing operation of described specific data item;
After the entitlement of described specific data item has been transferred to described specific node, re-execute described ongoing operation.
23. method according to claim 1, wherein:
When the proprietorial transfer of described specific data item will be performed, the operation that relates to described specific data item was underway;
Described method further may further comprise the steps, and based on one group of one or more factor, determines whether to wait for that described ongoing operation finishes; And
Finish if determine not wait for described ongoing operation, then end described ongoing operation.
24. method according to claim 23, wherein, described one group of one or more factor comprise how much work described ongoing operation has carried out.
25. one kind is used for method of managing data, said method comprising the steps of:
Can keep a plurality of persistent data items on the long-time memory of a plurality of nodes of access;
By each data item being distributed in a plurality of memory paragraphs, each the entitlement in the described persistent data items is distributed in the described node one; And
Each memory paragraph is distributed in described a plurality of node one;
Wherein, the node that is assigned memory paragraph is established as the owner who is assigned to all data item of described memory paragraph.
When first node was wanted to carry out the operation that relates to the data item that is had by Section Point, described first node sent described operation to described Section Point, was used for described Section Point and carried out described operation.
26. method according to claim 25 wherein, is carried out and described each data item is distributed to one step in a plurality of memory paragraphs by the title relevant with each data item being used hash function.
27. method according to claim 25 wherein, is carried out and described each memory paragraph is distributed to one step in described a plurality of node by the identifier relevant with each memory paragraph being used hash function.
28. method according to claim 25 wherein, is used to carry out based on the subregion of scope and described each data item is distributed to one step in a plurality of memory paragraphs.
29. method according to claim 25 wherein, is used to carry out based on the subregion of scope and described each memory paragraph is distributed to one step in described a plurality of node.
30. method according to claim 25 wherein, is carried out the relation of memory paragraph and described each data item is distributed to one step in a plurality of memory paragraphs by enumerating the individual data item.
31. method according to claim 25 wherein, is carried out the relation of node and described each memory paragraph is distributed to one step in described a plurality of node by enumerating single memory paragraph.
32. method according to claim 25, wherein, the quantity of memory paragraph is greater than the quantity of node, and described memory paragraph is a many-to-one relationship to the relation of node.
33. method according to claim 25 further may further comprise the steps, by revising the mapping of memory paragraph to node not revising data item under the situation of the mapping of memory paragraph, the entitlement that will be mapped to all data item of memory paragraph is redistributed to Section Point from first node.
34. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 1.
35. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 2.
36. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 3.
37. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 4.
38. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 5.
39. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 6.
40. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 7.
41. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 8.
42. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 9.
43. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 10.
44. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 11.
45. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 12.
46. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 13.
47. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 14.
48. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 15.
49. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 16.
50. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 17.
51. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 18.
52. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 19.
53. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 20.
54. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 21.
55. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 22.
56. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 23.
57. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 24.
58. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 25.
59. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 26.
60. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 27.
61. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 28.
62. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 29.
63. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 30.
64. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 31.
65. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 32.
66. a computer-readable medium carries one or more instruction sequences, when described instruction sequence is carried out by one or more processors, described one or more processor is carried out in the method described in the claim 33.
67. a method that is used for the multinode shared-nothing database system said method comprising the steps of:
The first node of described multinode shared-nothing database system begins to play the possessory effect of monopolizing of first data item and second data item, wherein, described first data item and described second data item are by the stored data items enduringly in the database of described multinode shared-nothing database system management;
Not changing first data item in the position on the long-time memory or close under the situation of described first node, the entitlement of described first data item is redistributed to the Section Point of described multinode shared-nothing database system from described first node; And
After reallocation entitlement, described first node continues the owner as described second data item, and continues to handle all requests to the operation of described second data item.
CNB200480021585XA 2003-08-01 2004-07-28 Be used for method of managing data Active CN100565460C (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US49201903P 2003-08-01 2003-08-01
US60/492,019 2003-08-01
US10/665,062 2003-09-17

Publications (2)

Publication Number Publication Date
CN1829961A true CN1829961A (en) 2006-09-06
CN100565460C CN100565460C (en) 2009-12-02

Family

ID=36947551

Family Applications (4)

Application Number Title Priority Date Filing Date
CNB2004800215879A Active CN100429622C (en) 2003-08-01 2004-07-28 Dynamic reassignment of data ownership
CN2004800217520A Active CN1829974B (en) 2003-08-01 2004-07-28 Parallel recovery by non-failed nodes
CNB200480021585XA Active CN100565460C (en) 2003-08-01 2004-07-28 Be used for method of managing data
CNB2004800219070A Active CN100449539C (en) 2003-08-01 2004-07-28 Ownership reassignment in a shared-nothing database system

Family Applications Before (2)

Application Number Title Priority Date Filing Date
CNB2004800215879A Active CN100429622C (en) 2003-08-01 2004-07-28 Dynamic reassignment of data ownership
CN2004800217520A Active CN1829974B (en) 2003-08-01 2004-07-28 Parallel recovery by non-failed nodes

Family Applications After (1)

Application Number Title Priority Date Filing Date
CNB2004800219070A Active CN100449539C (en) 2003-08-01 2004-07-28 Ownership reassignment in a shared-nothing database system

Country Status (1)

Country Link
CN (4) CN100429622C (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103098041A (en) * 2010-03-31 2013-05-08 伊姆西公司 Apparatus and method for query prioritization in a shared nothing distributed database
CN106462447A (en) * 2014-04-11 2017-02-22 Netapp股份有限公司 Connectivity-aware storage controller load balancing

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7979626B2 (en) * 2008-05-13 2011-07-12 Microsoft Corporation Flash recovery employing transaction log
CN102521307A (en) * 2011-12-01 2012-06-27 北京人大金仓信息技术股份有限公司 Parallel query processing method for share-nothing database cluster in cloud computing environment
US8799569B2 (en) * 2012-04-17 2014-08-05 International Business Machines Corporation Multiple enhanced catalog sharing (ECS) cache structure for sharing catalogs in a multiprocessor system
CN102968503B (en) * 2012-12-10 2015-10-07 曙光信息产业(北京)有限公司 The data processing method of Database Systems and Database Systems
US9367472B2 (en) * 2013-06-10 2016-06-14 Oracle International Corporation Observation of data in persistent memory
CN103399894A (en) * 2013-07-23 2013-11-20 中国科学院信息工程研究所 Distributed transaction processing method on basis of shared storage pool
CN107766001B (en) * 2017-10-18 2021-05-25 成都索贝数码科技股份有限公司 Storage quota method based on user group
CN108924184B (en) * 2018-05-31 2022-02-25 创新先进技术有限公司 Data processing method and server
CN110895483A (en) * 2018-09-12 2020-03-20 北京奇虎科技有限公司 Task recovery method and device
US11100086B2 (en) * 2018-09-25 2021-08-24 Wandisco, Inc. Methods, devices and systems for real-time checking of data consistency in a distributed heterogenous storage system
US11874816B2 (en) * 2018-10-23 2024-01-16 Microsoft Technology Licensing, Llc Lock free distributed transaction coordinator for in-memory database participants
CN110134735A (en) * 2019-04-10 2019-08-16 阿里巴巴集团控股有限公司 The storage method and device of distributed transaction log
CN112650561B (en) * 2019-10-11 2023-04-11 金篆信科有限责任公司 Transaction management method, system, network device and readable storage medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
IL99923A0 (en) * 1991-10-31 1992-08-18 Ibm Israel Method of operating a computer in a network
US5625811A (en) * 1994-10-31 1997-04-29 International Business Machines Corporation Method and system for database load balancing
US5696898A (en) * 1995-06-06 1997-12-09 Lucent Technologies Inc. System and method for database access control
CA2176775C (en) * 1995-06-06 1999-08-03 Brenda Sue Baker System and method for database access administration
US5903898A (en) * 1996-06-04 1999-05-11 Oracle Corporation Method and apparatus for user selectable logging
US5907849A (en) * 1997-05-29 1999-05-25 International Business Machines Corporation Method and system for recovery in a partitioned shared nothing database system using virtual share disks
US6493726B1 (en) * 1998-12-29 2002-12-10 Oracle Corporation Performing 2-phase commit with delayed forget
US20010047377A1 (en) * 2000-02-04 2001-11-29 Sincaglia Nicholas William System for distributed media network and meta data server
EP2562662B1 (en) * 2001-06-28 2019-08-21 Oracle International Corporation Partitioning ownership of a database among different database servers to control access to the database

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103098041A (en) * 2010-03-31 2013-05-08 伊姆西公司 Apparatus and method for query prioritization in a shared nothing distributed database
CN106462447A (en) * 2014-04-11 2017-02-22 Netapp股份有限公司 Connectivity-aware storage controller load balancing

Also Published As

Publication number Publication date
CN1829974A (en) 2006-09-06
CN100449539C (en) 2009-01-07
CN100429622C (en) 2008-10-29
CN1829974B (en) 2010-06-23
CN1829962A (en) 2006-09-06
CN1829988A (en) 2006-09-06
CN100565460C (en) 2009-12-02

Similar Documents

Publication Publication Date Title
JP4557975B2 (en) Reassign ownership in a non-shared database system
US11755481B2 (en) Universal cache management system
CN1262942C (en) Method, equipment and system for obtaining global promoting tood using dataless transaction
Levandoski et al. High performance transactions in deuteronomy
US7277897B2 (en) Dynamic reassignment of data ownership
AU2004262370B2 (en) Parallel recovery by non-failed nodes
CN1272714C (en) Method, equipment and system for distribution and access storage imaging tool in data processing system
CN1829961A (en) Ownership reassignment in a shared-nothing database system
US20060101081A1 (en) Distributed Database System Providing Data and Space Management Methodology
AU2001271680B2 (en) Partitioning ownership of a database among different database servers to control access to the database
JPH05210637A (en) Method of simultaneously controlling access
CN1653451A (en) Providing a useable version of the data item
US11880318B2 (en) Local page writes via pre-staging buffers for resilient buffer pool extensions
Morin et al. An efficient and scalable approach for implementing fault-tolerant DSM architectures
JP2005502957A5 (en)
CN1258152C (en) Method, device and system for managing release promoting bit
Cai et al. The design and evaluation of a buffer algorithm for database machines
AU2004262380A1 (en) Dynamic reassignment of data ownership

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CI02 Correction of invention patent application

Correction item: Priority

Correct: 2003.09.17 US 10/665,062

False: Lack of priority second

Number: 36

Page: The title page

Volume: 22

COR Change of bibliographic data

Free format text: CORRECT: PRIORITY; FROM: MISSING THE SECOND ARTICLE OF PRIORITY TO: 2003.9.17 US 10/665,062

C14 Grant of patent or utility model
GR01 Patent grant