WO2010037794A2 - Monitoring mechanism for a distributed database - Google Patents
Monitoring mechanism for a distributed database Download PDFInfo
- Publication number
- WO2010037794A2 WO2010037794A2 PCT/EP2009/062714 EP2009062714W WO2010037794A2 WO 2010037794 A2 WO2010037794 A2 WO 2010037794A2 EP 2009062714 W EP2009062714 W EP 2009062714W WO 2010037794 A2 WO2010037794 A2 WO 2010037794A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- replica
- node
- partition
- master
- nodes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
- G06F16/278—Data partitioning, e.g. horizontal or vertical partitioning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
- G06F16/275—Synchronous replication
Definitions
- the present invention relates to distributed databases and, more particularly, to a geographically distributed database usable as a common database for Telecom networks. More specifically, the invention pertains to an enhanced distributed database system as well as a method of handling such distributed database system.
- the present invention addresses the issue of having a common centralized database for a diversity of applications in the field of Telecom networks.
- Most of Telecom networks supporting a variety of different generations of telecommunication systems, being wireline or wireless systems, conventionally make use of one or more common centralized databases to store subscription and subscriber data as well as service data for a variety of applications residing in the Telecommunication network or in a 3 rd party service network but accessible to subscribers of said Telecom network.
- Telecom networks grow, newer generations of telecommunication systems turn up and the existing common centralized databases are not always adaptable to, or suitable to fit, the needs of all telecommunication systems in a Telecom network. Nevertheless, Telecom networks share quite similar requirements to be fulfilled by any particular database system usable therein.
- a conventional and centralized Telecom database is generally required to support, at least, the following characteristics: resiliency and high availability; consistency; high performance and low latency; high capacity; scalability; geographical redundancy; flexible deployment and data model; single point of access (one in each geographical location); and no single point of failure.
- the longer the signalling path is between a database client and the database itself the more risk exists of encountering a node unable to further submit the signalling towards its final destination.
- the longer this signalling path is the more load the Telecom network supports and the longer execution times.
- the Telecom networks are nowadays spread through different and far territories often communicated through different access networks belonging to other network operators; in this scenario, the longer this signalling path is the more network operators may be involved and more costs are derivable thereof.
- a distributed database may be regarded as a plurality of databases physically or logically distributed, likely under control of a central database management system, and wherein storage devices are not all necessarily attached to a common processing unit.
- the distributed database might be built up with multiple computers located in the same physical location, or may be dispersed over a network of interconnected computers.
- the present invention is aimed to at least minimize the above drawbacks and provides for an enhanced distributed database system with a plurality of nodes, wherein each node is arranged for storing a replica of at least one partition of data, and a method of handling said distributed database system.
- This method comprises the steps of: partitioning data to be stored into a number 'p' of partitions; replicating each partition into a number 'r' of replicas; for each partition, distributing the number 'r' of replicas amongst corresponding number 'r' of nodes selected from the plurality of nodes; configuring each, node with a list of identifiers of other nodes usable to address each other; activating more than one node amongst the plurality of nodes; monitoring at each active node at least one event selected from: latest updating of each replica, replica status, status of local resources in charge of each replica, and connectivity status of each replica; upon activation or deactivation of a node amongst the plurality of nodes, determining which node amongst the active nodes is considered current master node for each partition and in charge of current master replica for said partition; for any request received in a node to read/write data in the distributed database system, determining the partition said data belongs to and the current master node in charge of
- the method may include in at least one node amongst the active nodes: a step of collecting from each active node information about the at least one event selected from: latest updating of each replica, replica status, status of local resources in charge of each replica and connectivity status of each replica; a step of applying pre-configured rules for prioritizing each replica in the active nodes depending on the collected events for each replica; and a step of selecting a replica in a specific node for each partition with a highest replica priority, this replica to he considered the master replica and this specific node to be considered the master node for the partition.
- the present invention provides for two main embodiments, namely modes of operation, in determining which node amongst the active nodes is considered master node for each partition and in charge of current master replica for the partition.
- each node in the distributed database system may be arranged to carry out the steps of: collecting from each active node information about the at least one event selected from: latest updating of each replica, replica status, status of local resources in charge of each replica and connectivity status of each replica; applying pre-configured rules for prioritizing each replica in the active nodes, depending on the collected events for each replica; selecting a replica in a specific node for each partition with a highest replica priority, this replica to be considered the master replica and this specific node to be considered the master node for the partition.
- the method may further comprise a step of determining the order in which the active nodes have been activated. Where this is the case, all the nodes are arranged for determining the order in which the active nodes were activated, so that the active node firstly activated is considered to be a so-called System Master
- Monitor node in charge of carrying out the steps of: collecting from each active node information about the at least one event selected from: latest updating of each replica, replica status, status of local resources in charge of each replica and connectivity status of each replica; applying pre-configured rules for prioritizing each replica in the active nodes, depending on the collected events for each replica; selecting a replica in a specific node for each partition with a highest replica priority, this replica to be considered the master replica and this specific node to be considered the master node for the partition; and informing the other active nodes about the master replica selected for each partition and the master node holding said master replica.
- the step in this method of distributing the number 'r' of replica amongst corresponding number Y of nodes, for each partition may include a step of configuring each replica with a default replica priority to be applied where other criteria produce the same replica priorities.
- the step of determining which node amongst the active nodes is considered master node for each partition, and in charge of current master replica for said partition may include in the at least one active node the steps of: collecting from at least one active node information of being configured with a given default replica priority for a partition; and selecting a replica in a specific node for each partition with a highest default replica priority, this replica to be considered the master replica and this specific node to be considered the master node for the partition, hi particular, where the method operates in accordance with the above first mode of operation, not only the at least one node, but rather each node in the distributed database system may be arranged to carry out these steps.
- the active node firstly activated is considered a System Master Monitor and the step of determining which node amongst the active nodes is considered master node for each partition, and in charge of current master replica for said partition, may include in said System Master Monitor node the steps of: collecting from at least one active node information of being configured with a given default replica priority for a partition; selecting a replica in a specific node for each partition with a highest default replica priority, this replica to be considered the master replica and this specific node to be considered the master node for the partition; and informing the other active nodes about the master replica selected for each partition and the master node holding said master replica.
- the method may further comprise a step of copying for each partition at each active node the contents of the current master replica from the current master node in charge of the current master replica for said partition. Where this step of copying takes place, the method may further comprise a step of marking for each replica copied at each active node at least one of: the latest updating made, replica status, status of local resources in charge of the replica and connectivity status of the replica. This is especially advantageous in order to select another master replica in the future if the current master node hi charge of said master replica fails, gets down or unavailable, or becomes an inactive node.
- an enhanced distributed database system with a plurality of nodes, wherein each node is arranged for storing a replica of at least one partition of data.
- each node includes: a data storage for storing a replica of at least one data partition of data to be stored and for storing identifiers of other nodes usable to address each other; an input/output unit for communicating with other nodes of the distributed database system and with clients requesting read/write operations in the distributed database system; a monitoring unit for monitoring at least one event selected from: latest updating of each replica, replica status, status of local resources in charge of each replica, and connectivity status of each replica; and a processing unit, in cooperation with the data storage, the monitoring unit and the input/output unit, for determining which node amongst active nodes of the distributed database system is considered current master node for each partition and in charge of current master replica for said partition; for determining, for any request received to read/write
- the processing unit, the monitoring unit, the data storage and the input/output unit of each node may be arranged for: collecting, from each active node of the distributed database system, information about the at least one event selected from: latest updating of each replica, replica status, status of local resources in charge of each replica and connectivity status of each replica; applying pre-configured rules for prioritizing each replica in the active nodes depending on the collected events for each replica; and selecting a replica in a specific node for each partition with a highest replica priority, this replica to be considered the master replica and this specific node to be considered the master node for the partition.
- the data storage of each node may be arranged for storing an indicator configured per replica basis to indicate a default replica priority to be applied where other criteria produce the same replica priorities.
- the processing unit, the monitoring unit, the data storage and the input/output unit of each node may further be arranged for: collecting from at least one active node information of being configured with a given default replica priority for a partition; and selecting a replica in a specific node for each partition with a highest default replica priority, this replica to be considered the master replica and this specific node to be considered the master node for the partition.
- the processing unit, the monitoring unit, the data storage and the input/output unit of each node may be arranged for collecting, from each active node of the distributed database system, information to determine the order in which the active nodes have been activated.
- the active node firstly activated may be considered a System Master Monitor
- the processing unit, the monitoring unit, the data storage and the input/output unit of the System Master Monitor may be arranged for: collecting from each active node information about the at least one event selected from: latest updating of each replica, replica status, status of local resources in charge of each replica and connectivity status of each replica; applying pre-configured rules for prioritizing each replica in the active nodes depending on the collected events for each replica; selecting a replica in a specific node for each partition with a highest replica priority, this replica to be considered the master replica and this specific node to be considered the master node for the partition; and informing the other active nodes about the master replica selected for each partition and the master node holding said master replica.
- the processing unit, the monitoring unit, the data storage and the input/output unit of the System Master Monitor may further be arranged for: collecting from at least one active node information of being configured with a given default replica priority for a partition; selecting a replica in a specific node for each partition with a highest default replica priority, this replica to be considered the master replica and this specific node to be considered the master node for the partition; and informing the other active nodes about the master replica selected for each partition and the master node holding said master replica.
- the processing unit, the monitoring unit, the data storage and the input/output unit of each node are further arranged for copying for each partition at each active node the contents of the current master replica from the current master node in charge of the current master replica for said partition.
- the processing unit, the monitoring unit, the data storage and the input/output unit of each node may further be arranged for marking for each replica copied at each active node the latest updating made, replica status, status of local resources in charge of the replica and connectivity status of the replica.
- the invention may be practised by a computer program, in accordance with a third aspect of the invention, the computer program being loadable into an internal memory of a computer with input and output units as well as with a processing unit, and comprising executable code adapted to carry out the above method steps.
- this executable code may be recorded in a carrier readable in the computer.
- FIG. IA illustrates a simplified view of the sequence of actions to be performed for partitioning data to be stored into a number 'p' of partitions and replicating each partition into a number 'r' of replica.
- FIG. IB illustrates a simplified view of the sequence of actions to be performed for distributing for each partition the number 'r' of replica amongst a corresponding number 'r' of nodes selected from the plurality of nodes and for configuring each node with a list of identifiers of other nodes usable to address each other.
- FIG. 1C illustrates a simplified view of the sequence of actions to be performed, as a continuation of the actions illustrated in FIG. IA and FIG. IB, for carrying out a method of handling a distributed database system with a plurality of nodes, wherein each node is arranged for storing a replica of at least one partition of data.
- FIG. 2 shows a simplified view of an exemplary configuration of a plurality of nodes in the distributed database with useful data to describe some embodiments of the invention.
- FIG. 3 illustrates an exemplary implementation of a node amongst the plurality of nodes included in the distributed database.
- FIG. 4 shows a simplified view of an exemplary configuration of a node amongst the plurality of nodes in the distributed database with useful data arranged as clusters to describe some embodiments of the invention.
- FIG. 5 illustrates a simplified view of the sequence of actions to be performed for routing any request received in a node to read/write data in the distributed database towards the current master node in charge of the current master replica for the partition which such data belongs to.
- FIG. 6 illustrates an exemplary state machine provided for in accordance with an embodiment of the invention to determine which node amongst the active nodes is considered a Controller System Monitor in charge of coordinating the other nodes and deciding which the master node is for each replica.
- FIG. 7 illustrates an exemplary sequence of actions to be carried out with support of the state machine shown in Fig. 6 in order to determine which node amongst the active nodes is considered a Controller System Monitor upon activation of a number of nodes in the distributed database system.
- FIG. 8 shows still another exemplary sequence of actions to be carried out with support of the state machine shown in Fig. 6 in order to determine which node amongst the active nodes is considered a Controller System Monitor upon inactivity of a previously considered Controller System Monitor.
- a Telecom database system may include several geographically distributed nodes, wherein each node may include several data storage units and wherein each data storage unit in each node may allocate a particular replica of a subset of the data, namely a replica of a partition.
- a data set 10 amongst data storage units of nodes 1-4 may be carried out by following a number of steps provided for in accordance with the invention.
- the data set 10 is partitioned during a step S-005 into a number of partitions 11-14, each partition comprising a particular subset of the data set 10. Then, for each partition a number of replicas are generated during a step S-OlO.
- the number of replicas for each partition is not required to be the same for all the partitions.
- four replicas 111-114 are generated for the partition 11, whereas three replicas 121-123 and 141-143 are respectively generated for partitions 12 and 14, and only two replicas 131-132 are generated for the partition 13.
- these replicas may be grouped per partition basis during a preliminary step S-015 of determining the required geographical distribution.
- each node is assigned during a step S-017 an identifier usable for addresses purposes.
- the exemplary illustration in Fig. IB shows a distributed database system consisting of four nodes 1-4 with respective identifiers N-I ID, N-2 ID, N-3 ID andN-4 ID.
- the replicas generated for each partition may be distributed amongst the nodes that the database system consists of.
- the first replica 111 of the first partition is stored in node 1
- the second replica 112 of the first partition is stored in node 2
- the third replica 113 of the first partition is stored in node 3
- the fourth replica 114 of the first partition is stored in node 4
- the first replica 121 of the second partition is stored in node 3
- the second replica 122 of the second partition is stored in node 1
- the third replica 123 of the second partition is stored in node 2
- the first replica 131 of the third partition is stored in node 4
- the second replica 132 of the third partition is stored in node 1
- the first replica 141 of the fourth partition is stored in node 3
- the second replica 142 of the fourth partition is stored in node 4
- the third replica 143 of the fourth partition is stored in node 2.
- each node may also be configured during this step with identifiers of the other nodes.
- the node 1 stores identifiers 151 identifying the nodes 2, 3, 4;
- the node 2 stores identifiers 152 identifying the nodes 1, 3, 4;
- the node 3 stores identifiers 153 identifying the nodes 1, 2, 4;
- the node 4 stores identifiers 154 identifying the nodes 1, 2, 3.
- each node 1-4 respectively includes a replica 1101, 2101, 3101, and 4101 for a number of partitions, and a default priority per replica basis 1102, 2102, 3102, and 4102.
- the distributed database system is ready for entering in operation, node by node or as simultaneously as wanted by the operator.
- Fig. 1 C illustrates subsequent sequence of actions to carry out the method of handling said distributed database system in accordance with another aspect of the invention. Even though all the nodes behave in a similar manner, the complete sequence of actions at each node may depend on the order in which the different nodes are activated, as further described in accordance with embodiments of the invention. Thus,
- Fig. 1C exemplary illustrates a scenario where the node 2 is the firstly activated one, during a step S-030, followed by the activation of the node 3 during a step S-035, then the node 4 during a step S-060, and finally the node 1 during a step S-070.
- each node 2, 3, 4, 1 may be followed by respective steps S-040, S-045, S-080, S-075 of determining at each active node a start time when the said node has been activated.
- This optional step is useful to further determine the order in which the active nodes have been activated where the database system is operating with a node acting as a System Master Monitor in charge of coordinating the other nodes and deciding which is the master replica for each partition as further commented, hi this respect, as Fig. 2 illustrates, each node 1-4 includes a respective indication 1104, 2104, 3104, 4104 indicating the working time that the node has been active since the start time.
- each node 2, 3, 4, 1 the activation of each node 2, 3, 4, 1 is generally followed by respective steps S-050, S-055, S-095, S-090 of monitoring at each active node 2, 3, 4, 1 at least one event selected from: latest updating of each replica, replica status, status of local resources in charge of each replica, and connectivity status of each replica.
- each node 1-4 respectively includes an indicator of the latest updating per replica basis 1103, 2103, 3103, 4103, along with the replica 1101, 2101, 3101, 4101 for a number of partitions, and the default priority per replica basis 1102, 2102, 3102, 4102 already commented above.
- other data may be stored per replica basis at each node 1-4 such as those illustrated in Fig. 4.
- the node 2 respectively includes an indicator 312, 323, 343 of connectivity status per replica basis along with the replica 112, 123, 143 for a number of partitions, and a replica status per replica basis 212, 223, 243 indicating whether the replica is considered the master replica, an active replica being up and running but not being the master, or a stand-by replica on which configuration is taking place so that cannot be considered up and running.
- each node may have a further storage field per replica, not illustrated in any drawing, to include indicators of status of local resources per replica basis so that the process of supervising the local resources for each replica may be isolated from the step of monitoring such event.
- the above step of monitoring at each active node at least one event of the choice may include a step of supervising the local resources for each replica and a step of determining the status of each local resource.
- each node 1-4 of this distributed database system may be configured with a number of clusters wherein each cluster includes: a replica of a partition, an indicator of connectivity status for the replica, a replica status for the replica, an indicator of the latest updating per replica basis, and a default priority per replica basis.
- the node 2 comprises clusters
- each node initiates during respective steps S-065, S-085, S-IOO, S-I lO the sending of so-called 'Alive' messages towards all the other nodes known to each node.
- node 1 some actions in node 1 occur before corresponding actions in node 4, whereas the node 4 was initiated before than the node 1. This is generally possible since the processor load in one node may be higher than in another node, thus resulting in slower performance in the former, and it may also be due to different signalling delays through different network paths.
- the node 2 had sent during the step S-065 the so-called 'Alive' messages before the node 1 had been activated.
- the node 2 may be aware that the node 1 had not received the original 'Alive' message and the node 1 does not have the complete information from all the nodes.
- a node like the node 2 in the present exemplary situation carries out a step S-105 of monitoring again the events and a step S-115 of submitting again the 'Alive' messages towards all the nodes known to the node 2.
- a step S-105 of monitoring again the events and a step S-115 of submitting again the 'Alive' messages towards all the nodes known to the node 2.
- the 'Alive' message is received at the node 2 from the node 3 during a step S-110 intermediate between the step S-105 of monitoring the events at the node 2 and the step S-115 of submitting the 'Alive' messages, so that the latest 'Alive' message submitted from the node 2 may be considered up to date with each node being aware of all other active nodes.
- certain events determined in operation are taken into consideration on deciding the master replica for a partition and the master node in charge of said master replica.
- the following information may be taken into account: which replica for each partition has the most recently updated contents with complete information, namely, with the updating level; the replica status of each replica for each partition since, obviously, only up and running replicas are eligible; the connectivity status of each replica for each partition; and the default priority configured for each replica of a partition.
- this default priority may be configured to overrule the results of the previous criteria or may be configured to be just applicable where the previous criteria produce the same results for more than one replica of a partition.
- pre-configured rules may be applied for prioritizing each replica in the active nodes depending on these events determined in operation.
- the pre-configured rules may be such that the connectivity status takes priority over the updating level, or such that the default priority is taken into account immediately after than the replica status, or any other criteria on the events for prioritizing replicas.
- each replica for a partition as well as the updating of such contents, conventional routines may be provided, whereby each replica updates contents at certain times from the master replica.
- each replica updates contents at certain times from the master replica.
- not all the replicas for a partition update contents at the same time, and not all the replicas for the partition progress with the updating at the same ratio.
- each node in the distributed database system has to monitor how far the updating has progressed so that the information exchanged with the other nodes in this respect takes due account on whether the replica contents can be considered a complete information or not, and the point in time when the updating was carried out.
- each node 1-4 may be configured with a virtual IP address of the other nodes, or make use of the respective node identifiers 151-154 to identify and address each other, and each node 1-4 may periodically send an 'Alive' message to the other nodes, for example, after expiry of successive delay times.
- the 'Alive' messages may be sent with the known TCP protocol instead of making use of the UDP for some kind of heartbeat such as to avoid the possibility of unidirectional link failures, which are easily detected with TCP.
- each 'Alive' message includes a node identifier N-I ID, N- 2 ID, N-3 ID, N-4 ID identifying the sender node to the receiver nodes and may include, for each replica of a partition, at least one of: an identifier of the partition that the replica belongs to, the replica status, updating level, updating time, connectivity status, and default priority.
- each 'Alive' message sent from any node 1-4 towards the other nodes in the distributed database system may respectively include an indication of the working time 1104, 2104, 3104, 4104 that the sender node has been active since its start time.
- the present invention provides for two main embodiments, namely modes of operation, in determining which node amongst the active nodes is considered master node for each partition and in charge of current master replica for the partition.
- each node receives 'Alive' messages from all other nodes with monitoring information determined therein, so that all nodes can process the same information and can arrive to determine, for each partition, the same master node in charge of the current master replica.
- each node in the distributed database system may be arranged to carry out the steps of: collecting from each active node information about the at least one event selected from: latest updating of each replica, replica status, status of local resources in charge of each replica and connectivity status of each replica; applying pre-configured rules for prioritizing each replica in the active nodes, depending on the collected events for each replica; selecting a replica in a specific node for each partition with a highest replica priority, this replica to be considered the master replica and this specific node to be considered the master node for the partition.
- each node in the distributed database system may be arranged to further carry out the steps of: collecting from at least one active node information of being configured with a given default replica priority for a partition; and selecting a replica in a specific node for each partition with a highest default replica priority, this replica to be considered the master replica and this specific node to be considered the master node for the partition.
- all the nodes process the 'Alive' messages to determine the order in which the active nodes were activated so that the active node firstly activated is considered to be a so-called System Master Monitor node in charge of carrying out the steps of: collecting from each active node information about the at least one event selected from: latest updating of each replica, replica status, status of local resources in charge of each replica and connectivity status of each replica; applying pre- configured rules for prioritizing each replica in the active nodes, depending on the collected events for each replica; selecting a replica in a specific node for each partition with a highest replica priority, this replica to be considered the master replica and this specific node to be considered the master node for the partition; and informing the other active nodes about the master replica selected for each partition and the master node holding said master replica.
- the System Master Monitor node in the distributed database system may be arranged to further carry out the steps of: collecting from at least one active node information of being configured with a given default replica priority for a partition; selecting a replica in a specific node for each partition with a highest default replica priority, this replica to be considered the master replica and this specific node to be considered the master node for the partition; and informing the other active nodes about the master replica selected for each partition and the master node holding said master replica.
- the present invention generally provides for a distributed database system with a plurality of nodes, such as nodes 1-4, wherein each node is arranged for storing a replica of at least one partition of data.
- nodes such as nodes 1-4
- each node is arranged for storing a replica of at least one partition of data.
- each node comprises a data storage 15 for storing a replica 2101 of at least one data partition 112, 123, 143 of data to be stored, and for storing identifiers 152 of other nodes usable to address each other; an input/output unit 30 for communicating with other nodes 1, 3, 4 of the distributed database system; a monitoring unit 60 for monitoring at least one event selected from: latest updating 2103 of each replica, replica status 212, 223, 243, status of local resources in charge of each replica, and connectivity status 312, 323, 343 of each replica; and a processing unit 20, acting in cooperation with the data storage, the monitoring unit and the input/output unit, for determining which node amongst active nodes of the distributed database system is considered current master node 2105 for each partition and in charge of current master replica for said partition.
- the processing unit 20, the monitoring unit 60, the data storage 15 and the input/output unit 30 of each node are arranged for: collecting, from each active node of the distributed database system, information about the at least one event selected from: latest updating of each replica 1103, 2103, 3103, 4103, replica status 212, 223, 243, status of local resources in charge of each replica and connectivity status 312, 323, 343 of each replica; applying the pre- configured rules commented above for prioritizing each replica in the active nodes depending on the collected events for each replica; and selecting a replica in a specific node for each partition with a highest replica priority, this replica to be considered the master replica and this specific node to be considered the master node 2105, 4105 for the partition.
- the processing unit 20, the monitoring unit 60, the data storage 15 and the input/output unit 30 of each node are arranged for: collecting from at least one active node information of being configured with a given default replica priority for a partition; and selecting a replica in a specific node for each partition with a highest default replica priority, this replica to be considered the master replica and this specific node to be considered the master node for the partition.
- the processing unit 20, the monitoring unit 60, the data storage 15 and the input/output unit 30 of each node are arranged for collecting, from each active node of the distributed database system, information 1104, 2104, 3104, 4104 to determine the order in which the active nodes have been activated, so that the active node firstly activated is considered a System Master Monitor.
- the processing unit 20, the monitoring unit 60, the data storage 15 and the input/output unit 30 of said System Master Monitor may further be arranged for informing the other active nodes about the master replica selected for each partition and the master node holding said master replica.
- the above method of handling the distributed database system may require a particular discussion on how clients of the distributed database system may access the information stored therein and, more particularly, for the purpose of read or write operations.
- clients of the distributed database system such as a Home Location Register, an Authentication Centre, or a Home Subscriber Server of a Telecom network might be, can access any data from any node of the distributed database system.
- clients of the distributed database system such as a Home Location Register, an Authentication Centre, or a Home Subscriber Server of a Telecom network might be, can access any data from any node of the distributed database system.
- the master replica in order to preserve data consistency amongst the different replicas, only one of the replicas would receive a request for reading and/or writing, and this would be the master replica.
- the data in the master replica is from time to time updated into the other replicas.
- each node may include one or more access gateways (hereinafter AG), wherein the AG is an entity in charge of forwarding the requests to read/write data towards the node where the master replica is located; and, since the database protocol may be different from the accessing protocol, the AG is in charge of receiving the request from a client to read/write data and accessing the other nodes in the distributed database system.
- AG access gateways
- LB Load Balancer
- the node 2 may be provided with an LB 190a and three AG' s 191a-193a, and be in charge of a replica 112 for a first partition 11, a replica
- the replica 112 is the master replica for the first partition and the node 2 is the master node in charge of said master replica for the first partition; whereas the node 3 may be provided with an LB 190b and three AG's 191b-193b, and be in charge of a replica 113 for a first partition 11, a replica 121 for a second partition 12, and a replica 141 for a fourth partition 14, wherein the replica 121 is the master replica for the second partition and the node 3 is the master node in charge of said master replica for the second partition.
- this method of handling the distributed database system includes, for any request received in a node to read/write data in the distributed database system, a step of determining the partition that said data belongs to and the current master node in charge of the current master replica for said partition, and a step of routing said request to said current master node.
- requests from a client 5 to read/write data in the distributed database system may be received at any node, such as the node 2.
- Fig. 5 exemplary illustrates a request to read/write data received at the LB 190a during a step S-150. This request may be assigned during a step S-151 to AG 193a, which determines that the data to read/write belongs to the second partition 12, and this AG also determines that the current master node in charge of said partition is the node 3. Then, the AG 193a routes the request towards the node 3 during a step S-152.
- This request may be received in the LB 190b of the node 3, if more than one AG exists in said node 3, or may be received in a unique AG if the node 3 only includes one AG, or may be received in a particular AG 191b if such AG of the node 3 is known to the AG 193 a by configuration means, as exemplary depicted in this Fig. 5.
- the AG 191b of the node 3 receiving the request accesses during a step S- 152 the master replica 121 for the second partition 12 to read/write data therein in accordance with the request.
- Fig. 5 also illustrates the case where the request is received in a master node holding the master replica for the partition which the data belongs to.
- a request to read/write data is received at the LB 190a of the node 2 during a step S-160.
- This request may be assigned during a step S-161 to AG 191a, which determines that the data to read/write belongs to the first partition 11, and this AG also determines that the current master node in charge of said partition is this node 2. Then, the AG 191a internally routes the request to access during a step S-162 the master replica 112 for the first partition 11 to read/write data therein in accordance with the request.
- the LB of any node of the distributed database system may be constructed with an input/output unit 50 dedicated for communicating with the client 5, as illustrated in Fig. 3, and with resources of the processing unit 20 arranged for selecting the appropriate AG to balance the workload and performance of the node.
- This input/output unit 50 may be an integral part of the input/output unit 30 that each node of the distributed database system comprises, or may be a separate module included in said input/output unit 30.
- the AG of any node of the distributed database system may be constructed with an input/output unit 40 dedicated for communicating with other nodes 1, 3, 4, as illustrated in Fig. 3, and with resources of the processing unit 20 arranged for determining the partition that the data to read/write belongs to, for determining whether the master node in charge of the master replica for said partition is the current node or another node of the distributed database system, and for accessing the data storage 15 to access said data, where the master node in charge of the master replica for said partition is the current node, or for routing the request towards another node determined to be the current master node for the partition, otherwise.
- the monitoring unit 60 may include a unique unit to monitor and compile the above events, a so-called Local System Monitor (hereinafter LSM) in the instant specification, or might include an active LSM and a stand-by LSM so that the latter could take over operations in case the former fails.
- LSM Local System Monitor
- any reference to the monitoring unit or to the LSM is constructed as meaning the active LSM in the node under reference.
- the monitoring unit 60 of the so-called System Master Monitor node is considered to be a Controller System Monitor (hereinafter CSM) whereas each monitoring unit of the other nodes in the distributed database system is still referred to as LSM.
- CSM Controller System Monitor
- the CSM may thus decide which the master replica is, by taking into account the events information received from the LSM of each node in the distributed database system, and by applying the pre-configured rules for prioritizing each replica in the active nodes.
- the CSM communicates each LSM in the other nodes what the master replica for each partition is.
- the exemplary state-machine illustrated in Fig. 6 may be applied, wherein transitions between states are due either to the reception of an 'Alive' message from other node or the expiration of a timer.
- the System Master Monitor node rather than referring to the System Master Monitor node versus the other nodes of the distributed database system, we refer in this discussion to their respective monitoring units 60, namely, to the CSM of the System Master Monitor node versus each LSM of said other nodes.
- each LSM is listening to the 'Alive' messages with information about the other nodes and, particularly, listening to 'Alive' messages from the CSM, if any, with the master replicas.
- Each LSM is also sending information about its own node, distributing information about the master replicas internally, in its own node to any local AG; and, optionally, forwarding the information to the rest of the nodes in the 'Alive' messages.
- a node reaching this state decides which replica is the master replica for every partition in the distributed database system.
- any LSM starts in inactive state and, as soon as the node owning the LSM sends 'Alive' messages to all the other nodes in the distributed database system, this LSM goes to active state. That is, the transition ST-I from the inactive state to active state is the sending of 'Alive' messages from the node owning the LSM in inactive state towards all the other nodes in the distributed database system.
- each LSM detects that one or more nodes are down if after a configurable period of time, a so-called INACTIVEJnME in the instant specification, each LSM receives no 'Alive' messages from said one or more nodes. The transition to inactive makes any LSM resetting its execution time, thus becoming "young" again. That prevents a previous CSM that was isolated to become CSM again and send deprecated information.
- This optional embodiment may be carried out by appropriately configuring the distributed database system.
- a suitable configuration parameter can be reset so that the account of number of nodes isolated from the system is not relevant for letting the nodes go to the active state, and letting the system to operate with more than one sub-network even if isolated between them.
- any node with the LSM in the active state can go back to inactive state if not enough 'Alive' messages from other nodes are received, namely, less than (n+l)/2 nodes up including the CSM, or less than n/2+1 not including the CSM.
- the LSM in the receiver node becomes CSM. That is, the transition ST-2.1 from the active state to CSM state is the reception of enough 'Alive' messages from the other nodes, and no more nodes remain to send messages in the distributed database system. Particularly in a two-node system, DELAY TIME may be gone with no messages received, and the node becomes CSM as well.
- a node From the Potential CSM state, a node can go back to the inactive state, for the same reasons as commented for previous transitions. Otherwise, where a still expected 'Alive' message is received from a node indicating its LSM is a consolidated CSM, or where a still expected 'Alive' message is received from a node showing that other LSM started earlier, the transition ST-3.1 illustrated in Fig. 6 takes place and the LSM in Potential CSM state goes to active state in the node receiving any of these expected 'Alive' messages.
- timers there may be two types of timers: the so-called DELAY_TIME timer in the instant specification, which is the time an LSM waits for 'Alive' messages of the rest of the nodes informing about an older LSM before proclaiming itself as CSM; and the so-called INACTIVE_TIME timer in the instant specification, which is the time an LSM waits for 'Alive' messages from a node before concluding such node is down and unavailable.
- DELAY_TIME timer in the instant specification, which is the time an LSM waits for 'Alive' messages of the rest of the nodes informing about an older LSM before proclaiming itself as CSM
- INACTIVE_TIME timer in the instant specification, which is the time an LSM waits for 'Alive' messages from a node before concluding such node is down and unavailable.
- the node with a monitoring unit 60 consolidated as CSM starts sending 'Alive' messages including the list of master replicas.
- the Fig. 7 shows an exemplary sequence of actions amongst nodes of a distributed database system to determine which node amongst 3 exemplary nodes is the node with a monitoring unit 60 consolidated as CSM.
- the CSM is the LSM that is activated first.
- each LSM at start up sends 'Alive' messages with the working time 2104, 4104, as exemplary shown in Fig. 2, to the rest of the nodes.
- the nodes that are part of the system are known by configuration.
- the LSM After sending the 'Alive' messages, the LSM waits for a certain period of time, DELAY_TIME, to receive 'Alive' messages from other nodes.
- DELAY_TIME a certain period of time
- the phase of establishing the CSM finishes when that time is gone, or before that, if 'Alive' messages from sufficient other nodes are received. During that phase, a potential CSM can be assigned, according to the information received until that time, but it won't be confirmed till this phase is over.
- the first node to be activated is node 1 during a step S-200.
- the LSM starts, namely, the monitoring unit 60 of the node 1, it sends 'Alive' messages during the step S -205 to the other two nodes 1 and 2 of the distributed database system, and also starts a timer of DELA YJTIME seconds. Since the other nodes are not running yet each corresponding LSM do not receive such 'Alive' messages.
- the LSM of node 2 starts running during a step S-210, it sends 'Alive' messages to nodes 1 and 3 during a step S-220 and starts its own timer of DELAYJTIMER seconds.
- the LSM of the node 3 starts during a step S-215 and starts its own timer of DELA Y_TIMER seconds.
- the node 1 When the node 1 receives the 'Alive' message from the node 2 appoints itself as Potential CSM during a step S-225 since, with the information it has, it is the LSM that has started earlier whilst waiting for information from the node 3. At this stage, the node 3 is able to send 'Alive' messages towards nodes 1 and 2 during a step S-230.
- the node 2 When the node 2 receives the 'Alive' message from the node 3, it appoints itself as Potential CSM during a step S-240 since, with the information it has, the LSM of the node 2 has started earlier than the LSM of the node 3. Likewise, when the node 3 receives the 'Alive' message from the node 2, it appoints node 2 as Potential CSM during a step S -245 since, with the information it has, the LSM of the node 2 has started earlier, whilst both nodes 1 and 2 are awaiting information from the node 1, still unknown to the node 3.
- the node 1 When the node 1 receives the 'Alive' message from the node 3, it appoints itself as consolidated CSM during a step S-235 since the node 1 has already received 'Alive' messages from all the nodes in the system and, with that information, the LSM of the node 1 is the one that was activated earlier. Then, the node 1 sends 'Alive' messages during a step S-250 to the other nodes informing them that the node 1 is the consolidated CSM. The nodes 2 and 3 eventually receiving the 'Alive' messages from the node 1 realize and thus mark during respective steps S-255 and S-260 that the monitoring unit of the node 1 is the CSM, and no other node can assume this role in the current situation.
- each node may determine which node is considered the CSM, i.e. the one that has a longer working time amongst the active nodes.
- a node In order for a node to become a CSM, such node could wait at least the DELAY_TIME for assuring there has been some time for receiving remote 'Alive' messages from any other potential CSM.
- the node reaching the CSM state may communicate its decision to the remaining nodes in order to assure there are no ambiguities, though this communication is not necessary where all the nodes behave in accordance with the first mode of operation.
- This communication may be TCP-based in order to assure reliability.
- this behaviour may be indirectly achieved when the CSM communicates the replica configuration to the other nodes.
- the replica status is included in the 'Alive' messages, once elected, the CSM knows which replica status configuration is suitable for working. In fact, this information may be known to all nodes, but they may wait for CSM to confirm under the second mode of operation.
- Each node could maintain a list with the current nodes which it has received an 'Alive' message from, along with the replica status received within. Each time the 'Alive' message is received, the sender node may be set as active. Master nodes for partitions may only be elected within active nodes. A node may be set as not available when there has been a period, a so-called INACTIVE_PERIOD, without receiving messages from it. This time could be two or three times the mean time of the receiving messages period, which could be set initially to the aforementioned DELAYjriME. The node may also be set as unavailable if the 'Alive' messages sent to it do not get through. In this way node availability is detected very fast.
- the node with monitoring unit 60 considered the CSM may submit towards the other nodes in the distributed database system its node identifier, working time, replica status for each replica, master node for each replica, node status from the state-machine, list of active nodes from where 'Alive' messages have been received (even if offline), current Master Replica Information (hereinafter MSI), update time and node Id (including exec time) for the CSM node that set the MRI.
- MSI current Master Replica Information
- update time and node Id including exec time
- MRI message can include a list of replicas with their hosting node.
- the MRI message might also be sent also to offline nodes, in order to receive confirmation for the MRI.
- an offline node could be the link between a real CSM node and a so-called sub-CSM node, the latter being an LSM running in a node that believes that it is the master, but it is not so. Anyhow, as stated before, no replica of an offline node could be set as master. Therefore, in order to avoid this problem, the CSM process may wait the DELAYjriME before sending the MRI.
- the CSM election mechanism is such that all monitoring processes are synchronized. As explained before the idea is that the oldest process becomes the CSM. In determining which the oldest LSM is, several embodiments are foreseeable.
- the present invention takes into account the working time, that is, each process determines, with its local time, how many seconds it has been working from the start, and sends that info in the 'Alive' message. There may be a subjacent drawback in this case, as probably latency has an important role to play. The receiving node will see the working time sent, but no the latency, so it would difficult to accurately determine whether the receiving node is younger or older than the sending node. Latency times could be measured, via pings message, and a mean could be established.
- the present invention takes into account the Startup time.
- all processes send their start-up time in the 'Alive' messages. This time will be relative to local machine time at each node. Therefore, there may be a need to synchronize all machines in the system.
- Linux systems, and operating systems in general, have solved this problem many years ago using NTP.
- the present invention upon activation or deactivation of any node of the distributed database system, there is a step of determining amongst the active nodes which one is the current master node in charge of the master replica for each partition.
- the present invention provides for two embodiments: a first embodiment where all the nodes work independently to determine which the master node is; and a second embodiment wherein a CSM is determined to decide which the master node is for each partition.
- Fig. 8 illustrates an exemplary situation wherein the currently considered CSM, which exemplary is the node 3, goes down and becomes unavailable for the other nodes, namely, for nodes 1 and 2.
- the node 3 acting as CSM sent a latest set of 'Alive' messages during a step S-300 towards the other nodes 1 and 2.
- both nodes 1 and 2 Upon respective reception of such messages, both nodes 1 and 2 trigger a so-called Inactivity Period which is not reset until new 'Alive' messages are received from the node 3. Since the Inactivity Period expires at both nodes 1 and 2 without having received further 'Alive' messages from the node 3, both nodes 1 and 2 mark the node 3 as unavailable during respective steps S-320 and S-330.
- the node 2 was older than the node 1, and this information was known to node 2 since it had received the latest periodic 'Alive' message from the node 1 during the step S-310. Therefore, the node 2 goes itself to the Potential CSM state during a step S-340, and in accordance with the state-machine illustrated in Fig. 6. Then, the node 2 sends during a step S-350 its own periodic 'Alive' messages towards nodes 1 and 3, the latter being unavailable to receive such message.
- the node 1 receiving the 'Alive' message from the node 2 is aware that the node 2 is older and arrives to the conclusion during a step S -360 that the node 2 is the current CSM. After having expired at the node 2 the so-called DELAY_TIME, the monitoring unit 60 of the node 2 is consolidated as CSM during a step S-370, and the node 2 is the current System Master Monitor.
- the node 3 recovers again, which is not shown in any drawing, the node 3 sends its own periodic 'Alive' messages towards nodes 1 and 2. Upon receipt of such messages at nodes 1 and 2, they both mark the node 3 as available again. As already commented above, the node 3 has reset its working time or the startup time, as the case may be, so that the situation does not change and, after being in a Potential CSM state during a DELAYJTIME, the current CSM is still the monitoring unit of the node 2, which reaches the CSM state in accordance with the state-machine shown in Fig. 6.
- the monitoring unit of the new node can start.
- the new node sends 'Alive' messages to the other already existing active nodes, and the other already existing active nodes will be aware of the new node.
- the CSM becomes aware of the new replicas in the new node, it reconfigures the system, if necessary, accordingly.
- the system operates in accordance with the above first mode of operation, that is, without having a CSM, all nodes respectively process the information received in the 'Alive' messages and arrive to a same conclusion on which is the master node in charge of the master replica for each partition.
- the 'Alive' messages comprise the information in the so-called MRI messages. Nevertheless, the MRI messages may be sent separately from the 'Alive' messages.
- the addition of a new partition to a running distributed database system may be an easy task with the behaviour previously described.
- all the corresponding indicators explained above per replica basis have to be configured as well.
- the elected System Master Monitor node with the CSM under the second mode of operation, or every node under the first mode of operation sooner or later receives 'Alive' messages from all the nodes, wherein information for the new replicas are received from the nodes hosting said new replicas.
- the monitoring process is expected to be highly available too. Therefore, as already commented above, two LSM may be running in each node: an active LSM and a stand-by LSM.
- the active LSM forwards all received messages to the stand-by LSM 5 particularly, the 'Alive' messages and also any MRI message, if such MRI message is received.
- the stand-by LSM takes over immediately, so that a node cannot go down simply because the monitoring process has failed.
- the monitoring process may be carried out by the monitoring unit 60 likely in cooperation with the processing unit 20.
- all 'Alive' messages received from the other nodes, as well as 'Alive' messages sent to configured nodes, may be sent by multicast, no special treatment is necessary.
- MRI messages received by any LSM of a node, or sent by the CSM of the System Master Monitor, may be received or sent by multicast.
- the stand-by process may detect that its current MRI does not match the elaborated one, and may initiate the process.
- the MRI messages may need for multicast an additional flag for the "confirmation" state at the very end in order to use common code for parsing.
- all multicast messages might also include, in the last part, one byte that represents the weight of the process in order to resolve race conditions between active and stand-by.
- This section describes how the active and stand-by process may detect their state, and what are they supposed to do.
- the process may start listening to receive any multicast message in the distributed database system. If some multicast is received, it may be set as stand-by, and may update any internal data according to the received state (it may not process the state machine as its elected state is received from the Alive message). If the process does not receive any multicast message in DELAY_TIME period after starting, or after previous multicast received packet, it may start listening (opening the port) for 'Alive' and MRI message, and may send its current MRI (which could be empty at initial start up) with the process weight (it will be different for any process in the node).
- the active process may stop listening on previous port and may degrade to stand-by (this may solve initial race conditions, which are supposed not to happen during normal operation).
- the purpose of comparing the empty MRI is to avoid a more weighted process to get control if restarted very fast (anyway, it could wait DELAY_TIME after restarting, so this could not happen unless it just starts at the same time as current active process send its 'Alive' message, and also at the same time neighbour nodes also send their 'Alive' message, which are also multicast).
- the invention may also be practised by a computer program, loadable into an internal memory of a computer with input and output units as well as with a processing unit.
- This computer program comprises to this end executable code adapted to carry out the above method steps when running in the computer.
- the executable code may be recorded in a carrier readable means in a computer.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
- Computer And Data Communications (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2011529547A JP5557840B2 (ja) | 2008-10-03 | 2009-09-30 | 分散データベースの監視メカニズム |
| EP09783614A EP2350876A2 (en) | 2008-10-03 | 2009-09-30 | Monitoring mechanism for a distributed database |
| US13/121,561 US8375001B2 (en) | 2008-10-03 | 2009-09-30 | Master monitoring mechanism for a geographical distributed database |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US10240808P | 2008-10-03 | 2008-10-03 | |
| US61/102,408 | 2008-10-03 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2010037794A2 true WO2010037794A2 (en) | 2010-04-08 |
| WO2010037794A3 WO2010037794A3 (en) | 2010-06-24 |
Family
ID=41203944
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/EP2009/062714 Ceased WO2010037794A2 (en) | 2008-10-03 | 2009-09-30 | Monitoring mechanism for a distributed database |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US8375001B2 (enExample) |
| EP (1) | EP2350876A2 (enExample) |
| JP (1) | JP5557840B2 (enExample) |
| WO (1) | WO2010037794A2 (enExample) |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2012064217A (ja) * | 2010-09-20 | 2012-03-29 | Thomson Licensing | データ復元方法及びサーバ装置 |
| US8326801B2 (en) * | 2010-11-17 | 2012-12-04 | Microsoft Corporation | Increasing database availability during fault recovery |
| JP2013525895A (ja) * | 2010-04-23 | 2013-06-20 | コンピュヴェルデ アーベー | 分散データストレージ |
| EP2898435B1 (en) * | 2012-10-16 | 2018-12-12 | Huawei Technologies Co., Ltd. | System and method for flexible distributed massively parallel processing (mpp) |
Families Citing this family (83)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7675854B2 (en) | 2006-02-21 | 2010-03-09 | A10 Networks, Inc. | System and method for an adaptive TCP SYN cookie with time validation |
| US8312507B2 (en) | 2006-10-17 | 2012-11-13 | A10 Networks, Inc. | System and method to apply network traffic policy to an application session |
| US8443062B2 (en) * | 2008-10-23 | 2013-05-14 | Microsoft Corporation | Quorum based transactionally consistent membership management in distributed storage systems |
| US8332365B2 (en) | 2009-03-31 | 2012-12-11 | Amazon Technologies, Inc. | Cloning and recovery of data volumes |
| US8582450B1 (en) * | 2009-09-30 | 2013-11-12 | Shoretel, Inc. | Status reporting system |
| US9960967B2 (en) | 2009-10-21 | 2018-05-01 | A10 Networks, Inc. | Determining an application delivery server based on geo-location information |
| US9215275B2 (en) | 2010-09-30 | 2015-12-15 | A10 Networks, Inc. | System and method to balance servers based on server load status |
| US9609052B2 (en) | 2010-12-02 | 2017-03-28 | A10 Networks, Inc. | Distributing application traffic to servers based on dynamic service response time |
| US8897154B2 (en) | 2011-10-24 | 2014-11-25 | A10 Networks, Inc. | Combining stateless and stateful server load balancing |
| US9094364B2 (en) | 2011-12-23 | 2015-07-28 | A10 Networks, Inc. | Methods to manage services over a service gateway |
| US20130311421A1 (en) * | 2012-01-06 | 2013-11-21 | Citus Data Bilgi Islemleri Ticaret A.S. | Logical Representation of Distributed Database Table Updates in an Append-Only Log File |
| US10860563B2 (en) | 2012-01-06 | 2020-12-08 | Microsoft Technology Licensing, Llc | Distributed database with modular blocks and associated log files |
| US9753999B2 (en) | 2012-01-06 | 2017-09-05 | Citus Data Bilgi Islemieri Ticaret A.S. | Distributed database with mappings between append-only files and repartitioned files |
| US10176184B2 (en) * | 2012-01-17 | 2019-01-08 | Oracle International Corporation | System and method for supporting persistent store versioning and integrity in a distributed data grid |
| US10044582B2 (en) | 2012-01-28 | 2018-08-07 | A10 Networks, Inc. | Generating secure name records |
| US8965921B2 (en) * | 2012-06-06 | 2015-02-24 | Rackspace Us, Inc. | Data management and indexing across a distributed database |
| CN108027805B (zh) | 2012-09-25 | 2021-12-21 | A10网络股份有限公司 | 数据网络中的负载分发 |
| US9843484B2 (en) | 2012-09-25 | 2017-12-12 | A10 Networks, Inc. | Graceful scaling in software driven networks |
| US10021174B2 (en) | 2012-09-25 | 2018-07-10 | A10 Networks, Inc. | Distributing service sessions |
| US10002141B2 (en) * | 2012-09-25 | 2018-06-19 | A10 Networks, Inc. | Distributed database in software driven networks |
| US9053166B2 (en) | 2012-12-10 | 2015-06-09 | Microsoft Technology Licensing, Llc | Dynamically varying the number of database replicas |
| US9824132B2 (en) * | 2013-01-08 | 2017-11-21 | Facebook, Inc. | Data recovery in multi-leader distributed systems |
| US9531846B2 (en) | 2013-01-23 | 2016-12-27 | A10 Networks, Inc. | Reducing buffer usage for TCP proxy session based on delayed acknowledgement |
| US9900252B2 (en) | 2013-03-08 | 2018-02-20 | A10 Networks, Inc. | Application delivery controller and global server load balancer |
| US9378068B2 (en) | 2013-03-13 | 2016-06-28 | International Business Machines Corporation | Load balancing for a virtual networking system |
| US9438670B2 (en) * | 2013-03-13 | 2016-09-06 | International Business Machines Corporation | Data replication for a virtual networking system |
| US11030055B2 (en) * | 2013-03-15 | 2021-06-08 | Amazon Technologies, Inc. | Fast crash recovery for distributed database systems |
| WO2014144837A1 (en) | 2013-03-15 | 2014-09-18 | A10 Networks, Inc. | Processing data packets using a policy based network path |
| JP6028641B2 (ja) * | 2013-03-21 | 2016-11-16 | 富士通株式会社 | 情報処理システム、情報処理装置の制御プログラム及び情報処理システムの制御方法 |
| US10038693B2 (en) | 2013-05-03 | 2018-07-31 | A10 Networks, Inc. | Facilitating secure network traffic by an application delivery controller |
| US9489443B1 (en) * | 2013-05-24 | 2016-11-08 | Amazon Technologies, Inc. | Scheduling of splits and moves of database partitions |
| US9548940B2 (en) | 2013-06-09 | 2017-01-17 | Apple Inc. | Master election among resource managers |
| US9053167B1 (en) * | 2013-06-19 | 2015-06-09 | Amazon Technologies, Inc. | Storage device selection for database partition replicas |
| CN104283906B (zh) * | 2013-07-02 | 2018-06-19 | 华为技术有限公司 | 分布式存储系统、集群节点及其区间管理方法 |
| US9569513B1 (en) | 2013-09-10 | 2017-02-14 | Amazon Technologies, Inc. | Conditional master election in distributed databases |
| US9633051B1 (en) * | 2013-09-20 | 2017-04-25 | Amazon Technologies, Inc. | Backup of partitioned database tables |
| US9507843B1 (en) * | 2013-09-20 | 2016-11-29 | Amazon Technologies, Inc. | Efficient replication of distributed storage changes for read-only nodes of a distributed database |
| US9430545B2 (en) * | 2013-10-21 | 2016-08-30 | International Business Machines Corporation | Mechanism for communication in a distributed database |
| US9794135B2 (en) | 2013-11-11 | 2017-10-17 | Amazon Technologies, Inc. | Managed service for acquisition, storage and consumption of large-scale data streams |
| US10230770B2 (en) | 2013-12-02 | 2019-03-12 | A10 Networks, Inc. | Network proxy layer for policy-based application proxies |
| US9639589B1 (en) * | 2013-12-20 | 2017-05-02 | Amazon Technologies, Inc. | Chained replication techniques for large-scale data streams |
| US9942152B2 (en) | 2014-03-25 | 2018-04-10 | A10 Networks, Inc. | Forwarding data packets using a service-based forwarding policy |
| US9942162B2 (en) | 2014-03-31 | 2018-04-10 | A10 Networks, Inc. | Active application response delay time |
| US9519510B2 (en) * | 2014-03-31 | 2016-12-13 | Amazon Technologies, Inc. | Atomic writes for multiple-extent operations |
| US10264071B2 (en) | 2014-03-31 | 2019-04-16 | Amazon Technologies, Inc. | Session management in distributed storage systems |
| US10372685B2 (en) | 2014-03-31 | 2019-08-06 | Amazon Technologies, Inc. | Scalable file storage service |
| US9785510B1 (en) | 2014-05-09 | 2017-10-10 | Amazon Technologies, Inc. | Variable data replication for storage implementing data backup |
| US9906422B2 (en) | 2014-05-16 | 2018-02-27 | A10 Networks, Inc. | Distributed system to determine a server's health |
| US10129122B2 (en) | 2014-06-03 | 2018-11-13 | A10 Networks, Inc. | User defined objects for network devices |
| US9992229B2 (en) | 2014-06-03 | 2018-06-05 | A10 Networks, Inc. | Programming a data network device using user defined scripts with licenses |
| US9986061B2 (en) | 2014-06-03 | 2018-05-29 | A10 Networks, Inc. | Programming a data network device using user defined scripts |
| US9613078B2 (en) | 2014-06-26 | 2017-04-04 | Amazon Technologies, Inc. | Multi-database log with multi-item transaction support |
| US9734021B1 (en) | 2014-08-18 | 2017-08-15 | Amazon Technologies, Inc. | Visualizing restoration operation granularity for a database |
| CN104615660A (zh) * | 2015-01-05 | 2015-05-13 | 浪潮(北京)电子信息产业有限公司 | 一种监控数据库性能的方法和系统 |
| US9904722B1 (en) | 2015-03-13 | 2018-02-27 | Amazon Technologies, Inc. | Log-based distributed transaction management |
| CN107851105B (zh) | 2015-07-02 | 2022-02-22 | 谷歌有限责任公司 | 具有副本位置选择的分布式存储系统 |
| US10581976B2 (en) | 2015-08-12 | 2020-03-03 | A10 Networks, Inc. | Transmission control of protocol state exchange for dynamic stateful service insertion |
| US10243791B2 (en) | 2015-08-13 | 2019-03-26 | A10 Networks, Inc. | Automated adjustment of subscriber policies |
| US10031935B1 (en) * | 2015-08-21 | 2018-07-24 | Amazon Technologies, Inc. | Customer-requested partitioning of journal-based storage systems |
| US10608956B2 (en) | 2015-12-17 | 2020-03-31 | Intel Corporation | Adaptive fabric multicast schemes |
| US10423493B1 (en) | 2015-12-21 | 2019-09-24 | Amazon Technologies, Inc. | Scalable log-based continuous data protection for distributed databases |
| US10853182B1 (en) | 2015-12-21 | 2020-12-01 | Amazon Technologies, Inc. | Scalable log-based secondary indexes for non-relational databases |
| US10567500B1 (en) | 2015-12-21 | 2020-02-18 | Amazon Technologies, Inc. | Continuous backup of data in a distributed data store |
| US10754844B1 (en) | 2017-09-27 | 2020-08-25 | Amazon Technologies, Inc. | Efficient database snapshot generation |
| US10990581B1 (en) | 2017-09-27 | 2021-04-27 | Amazon Technologies, Inc. | Tracking a size of a database change log |
| US11182372B1 (en) | 2017-11-08 | 2021-11-23 | Amazon Technologies, Inc. | Tracking database partition change log dependencies |
| US11042503B1 (en) | 2017-11-22 | 2021-06-22 | Amazon Technologies, Inc. | Continuous data protection and restoration |
| US11269731B1 (en) | 2017-11-22 | 2022-03-08 | Amazon Technologies, Inc. | Continuous data protection |
| US10922310B2 (en) | 2018-01-31 | 2021-02-16 | Red Hat, Inc. | Managing data retrieval in a data grid |
| US10621049B1 (en) | 2018-03-12 | 2020-04-14 | Amazon Technologies, Inc. | Consistent backups based on local node clock |
| US11126505B1 (en) | 2018-08-10 | 2021-09-21 | Amazon Technologies, Inc. | Past-state backup generator and interface for database systems |
| CN110874382B (zh) * | 2018-08-29 | 2023-07-04 | 阿里云计算有限公司 | 一种数据写入方法、装置及其设备 |
| CN110928943B (zh) * | 2018-08-29 | 2023-06-20 | 阿里云计算有限公司 | 一种分布式数据库及数据写入方法 |
| US11042454B1 (en) | 2018-11-20 | 2021-06-22 | Amazon Technologies, Inc. | Restoration of a data source |
| US11500743B2 (en) * | 2019-02-01 | 2022-11-15 | Nuodb, Inc. | Node failure detection and resolution in distributed databases |
| CN109933289B (zh) * | 2019-03-15 | 2022-06-10 | 深圳市网心科技有限公司 | 一种存储副本部署方法、系统及电子设备和存储介质 |
| US12007954B1 (en) | 2020-05-08 | 2024-06-11 | Amazon Technologies, Inc. | Selective forwarding for multi-statement database transactions |
| US11816073B1 (en) * | 2020-05-08 | 2023-11-14 | Amazon Technologies, Inc. | Asynchronously forwarding database commands |
| KR102382189B1 (ko) * | 2020-07-30 | 2022-04-05 | 주식회사 엘지유플러스 | 다중화 액티브 데이터베이스의 리플리케이션 갭 감지 방법 및 장치 |
| EP4189534B1 (en) * | 2020-08-03 | 2024-12-11 | Hitachi Vantara LLC | Randomization of heartbeat communications among multiple partition groups |
| US11494356B2 (en) | 2020-09-23 | 2022-11-08 | Salesforce.Com, Inc. | Key permission distribution |
| US12423199B2 (en) * | 2022-06-06 | 2025-09-23 | Mongodb, Inc. | Systems and methods for synchronizing between a source database cluster and a destination database cluster |
| US12229163B2 (en) * | 2023-01-31 | 2025-02-18 | Singlestore, Inc. | High availability with consensus in database systems |
Family Cites Families (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5440727A (en) * | 1991-12-18 | 1995-08-08 | International Business Machines Corporation | Asynchronous replica management in shared nothing architectures |
| US5608903A (en) * | 1994-12-15 | 1997-03-04 | Novell, Inc. | Method and apparatus for moving subtrees in a distributed network directory |
| US6539381B1 (en) * | 1999-04-21 | 2003-03-25 | Novell, Inc. | System and method for synchronizing database information |
| US6574750B1 (en) * | 2000-01-06 | 2003-06-03 | Oracle Corporation | Preserving consistency of passively-replicated non-deterministic objects |
| US20080052327A1 (en) * | 2006-08-28 | 2008-02-28 | International Business Machines Corporation | Secondary Backup Replication Technique for Clusters |
| US7788233B1 (en) * | 2007-07-05 | 2010-08-31 | Amazon Technologies, Inc. | Data store replication for entity based partition |
-
2009
- 2009-09-30 JP JP2011529547A patent/JP5557840B2/ja active Active
- 2009-09-30 EP EP09783614A patent/EP2350876A2/en not_active Ceased
- 2009-09-30 WO PCT/EP2009/062714 patent/WO2010037794A2/en not_active Ceased
- 2009-09-30 US US13/121,561 patent/US8375001B2/en active Active
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2013525895A (ja) * | 2010-04-23 | 2013-06-20 | コンピュヴェルデ アーベー | 分散データストレージ |
| JP2012064217A (ja) * | 2010-09-20 | 2012-03-29 | Thomson Licensing | データ復元方法及びサーバ装置 |
| US8326801B2 (en) * | 2010-11-17 | 2012-12-04 | Microsoft Corporation | Increasing database availability during fault recovery |
| EP2898435B1 (en) * | 2012-10-16 | 2018-12-12 | Huawei Technologies Co., Ltd. | System and method for flexible distributed massively parallel processing (mpp) |
Also Published As
| Publication number | Publication date |
|---|---|
| JP2012504807A (ja) | 2012-02-23 |
| US20110178985A1 (en) | 2011-07-21 |
| WO2010037794A3 (en) | 2010-06-24 |
| US8375001B2 (en) | 2013-02-12 |
| JP5557840B2 (ja) | 2014-07-23 |
| EP2350876A2 (en) | 2011-08-03 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US8375001B2 (en) | Master monitoring mechanism for a geographical distributed database | |
| EP2347563B1 (en) | Distributed master election | |
| US9811541B2 (en) | System and method for supporting lazy deserialization of session information in a server cluster | |
| CN111615066B (zh) | 一种基于广播的分布式微服务注册及调用方法 | |
| US6330605B1 (en) | Proxy cache cluster | |
| US7844851B2 (en) | System and method for protecting against failure through geo-redundancy in a SIP server | |
| US20020075870A1 (en) | Method and apparatus for discovering computer systems in a distributed multi-system cluster | |
| US20080177861A1 (en) | Method and apparatus for dynamic resource discovery and information distribution in a data network | |
| EP1989863A1 (en) | Gateway for wireless mobile clients | |
| CN104620559A (zh) | 用于支持分布式数据网格集群中的可伸缩消息总线的系统和方法 | |
| JP2013102527A (ja) | フェデレーションインフラストラクチャ内の一貫性 | |
| WO2009140979A1 (en) | Resource pooling in a blade cluster switching center server | |
| EP0776502A2 (en) | Scalable distributed computing environment | |
| EP1987657A1 (en) | Scalable wireless messaging system | |
| EP2616967B1 (en) | System including a middleware machine environment | |
| JP2017517064A (ja) | トランザクションミドルウェアマシン環境におけるドメイン間メッセージ通信のためにバイパスドメインモデルおよびプロキシモデルをサポートし、かつサービス情報を更新するためのシステムおよび方法 | |
| CN115412530B (zh) | 一种多集群场景下服务的域名解析方法及系统 | |
| CN113055461A (zh) | 一种基于ZooKeeper的无人集群分布式协同指挥控制方法 | |
| CN103842964A (zh) | 在事务中间件机器环境中支持准确负载平衡的系统及方法 | |
| CN110233886B (zh) | 一种面向海量微服务的高可用服务治理系统及实现方法 | |
| US20220019380A1 (en) | Methods providing network service restoration context and related service instance sets and storage resource nodes | |
| CN111756780A (zh) | 一种同步连接信息的方法和负载均衡系统 | |
| JP4459999B2 (ja) | 投票を活用した無停止サービスシステム及びそのシステムにおける情報更新及び提供方法 | |
| CN1745541A (zh) | 基于网际协议的通信系统的资源共享 | |
| WO2022130005A1 (en) | Granular replica healing for distributed databases |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09783614 Country of ref document: EP Kind code of ref document: A2 |
|
| DPE1 | Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101) | ||
| WWE | Wipo information: entry into national phase |
Ref document number: 2011529547 Country of ref document: JP |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 13121561 Country of ref document: US |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2009783614 Country of ref document: EP |