CN108063782A - Node is delayed machine adapting method and device, node group system - Google Patents

Node is delayed machine adapting method and device, node group system Download PDF

Info

Publication number
CN108063782A
CN108063782A CN201610979682.4A CN201610979682A CN108063782A CN 108063782 A CN108063782 A CN 108063782A CN 201610979682 A CN201610979682 A CN 201610979682A CN 108063782 A CN108063782 A CN 108063782A
Authority
CN
China
Prior art keywords
node
task
distributed cache
downtime
lock
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610979682.4A
Other languages
Chinese (zh)
Inventor
刘绍华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201610979682.4A priority Critical patent/CN108063782A/en
Publication of CN108063782A publication Critical patent/CN108063782A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)

Abstract

It delays machine adapting method and device, node group system the embodiment of the invention discloses a kind of node, for improving the speed and stability of the task for the machine node that recovers to delay.Present invention method includes:Section point obtains the machine information of delaying of first node;Section point judges that distributed caching with the presence or absence of lock corresponding with first task, is stored when wherein first task is worked normally by first node to distributed caching;If for distributed caching not there are the corresponding lock of first task, section point forms lock using distributed caching;Section point obtains the first task and take over first task of first node from distributed caching.Because distributed caching has the characteristic that can promote instruction and data reading speed, and substantial amounts of data can be handled, so as to which the speed of the task for the machine node that recovers to delay is improved, distributed caching coordinates node cluster, the stability of task recovery can be improved, also avoids the confusion generated during node cluster recovery tasks.

Description

Node downtime takeover method and device, and node cluster system
Technical Field
The present invention relates to the field of data processing, and in particular, to a method and an apparatus for taking over a node downtime, and a node cluster system.
Background
Distributed applications are often deployed on clusters consisting of multiple nodes. When one of the nodes is down, the other nodes need to take over the tasks executed on the down node, for example, clearing the records of the down node, restarting the tasks on the nodes which normally work, and the like.
In the existing mode, tasks of all nodes are stored in a database for persistence. When a node is down, other nodes generally recover the data of the down node by a root node specially used for processing the down in order to take over the tasks of the down node.
However, in such an approach, the node will send multiple query requests to the database storing the tasks, and the recovery process of the tasks is overly impacted by the database.
Disclosure of Invention
The embodiment of the invention provides a method and a device for taking over a down node and a node cluster system, which are used for improving the speed and the stability of a task of recovering the down node.
In order to solve the above technical problem, an embodiment of the present invention provides the following technical solutions:
a method of node downtime takeover, the method comprising:
the second node acquires the downtime information of the first node;
the second node judges whether a lock corresponding to a first task exists in the distributed cache or not, wherein the first task is stored in the distributed cache when the first node works normally;
if the distributed cache does not have the lock corresponding to the first task, the second node forms the lock by using the distributed cache;
the second node acquires the first task of the first node from the distributed cache;
the second node takes over the first task.
In order to solve the above technical problem, an embodiment of the present invention further provides the following technical solutions:
a node downtime takeover apparatus, the apparatus comprising:
the system comprises a downtime information acquisition unit, a downtime information acquisition unit and a downtime information acquisition unit, wherein the downtime information acquisition unit is used for acquiring downtime information of a first node;
the judging unit is used for judging whether a lock corresponding to a first task exists in the distributed cache, wherein the first task is stored in the distributed cache when the first node works normally;
a lock forming unit, configured to form a lock by using the distributed cache by the second node if the lock corresponding to the first task does not exist in the distributed cache;
a task obtaining unit, configured to obtain the first task of the first node from the distributed cache;
a takeover unit for taking over the first task.
In order to solve the above technical problem, an embodiment of the present invention further provides the following technical solutions:
a node cluster system, the node cluster system comprises at least three nodes, the nodes comprise node down take-over devices,
the node downtime take-over device is the above-mentioned node downtime take-over device.
According to the technical scheme, the embodiment of the invention has the following advantages:
after the second node acquires the downtime information of the first node, if the distributed cache does not have a lock corresponding to the first task, the second node forms a lock by using the distributed cache, so that a third node which acquires the downtime information of the first node does not recover the first task after detecting that the lock formed by the second node exists in the distributed cache; the first task is stored in the distributed cache when the first node works normally, and then the second node acquires the first task of the first node from the distributed cache and takes over the first task. Therefore, the distributed caches have the characteristic of improving the instruction and data reading speed and can process a large amount of data, so that the speed of the task of the down node is improved, the distributed caches are matched with the node cluster, a plurality of nodes can acquire the task of the down node from the distributed caches to recover the task, one of the nodes is controlled to recover the task of the down node through the lock, the stability of the task recovery process can be improved, and the chaos generated when the node cluster recovers the task is avoided.
Drawings
Fig. 1 is a usage scenario diagram related to a method for taking over a node downtime according to an embodiment of the present invention;
fig. 2 is a flowchart of a method for taking over a node downtime according to another embodiment of the present invention;
fig. 3 is a flowchart of a method for taking over a node downtime according to another embodiment of the present invention;
fig. 4 is a schematic structural diagram of a node downtime takeover apparatus according to another embodiment of the present invention;
fig. 5 is a schematic structural diagram of a node cluster system according to another embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a method and a device for taking over a down node and a node cluster system, which are used for improving the speed and the stability of a task of recovering the down node.
Fig. 1 is a usage scenario diagram related to a method for taking over a node downtime provided in an embodiment of the present invention. As shown in fig. 1, the usage scenario includes a plurality of nodes, which are combined into an AKKA cluster, where fig. 1 shows three of the nodes, a first node 101, a second node 102, and a third node 103. The usage scenario also includes a distributed cache 104 into which the node stores the task in real-time as it executes the task. After the down node, other nodes read from the distributed cache and take over the tasks of the down node.
The Akka cluster environment is that in the distributed environment, all nodes in the cluster can send messages to each other and obtain the information of node creation and closing.
The distribution is as follows: the cluster environment is composed of a plurality of nodes.
After the second node 102 acquires the downtime information of the first node 101 in an Akka message subscription mode, if the distributed cache 104 does not have a lock corresponding to the first task, the second node forms a lock by using the distributed cache, so that the third node 103 which acquires the downtime information of the first node 101 does not recover the first task after detecting that the lock formed by the second node exists in the distributed cache; the first task is stored in the distributed cache when the first node works normally, and then the second node acquires the first task of the first node from the distributed cache and takes over the first task. The first node, the second node and the third node belong to nodes of an Akka cluster, and the message subscription mode of the Akka is as follows: and when the Akka cluster detects the downtime information of the first node, sending the downtime information of the first node to the second node. Therefore, the distributed caches have the characteristic of improving the instruction and data reading speed and can process a large amount of data, so that the speed of the task of the downtime node is improved, the distributed caches are matched with the Akka cluster, a plurality of nodes can acquire the task of the downtime node from the distributed caches to recover the task, one node is controlled to recover the task of the downtime node through the lock, the stability of the task recovery process can be improved, and confusion caused by the Akka cluster when the task is recovered is avoided.
Fig. 2 is a flowchart of a method for taking over a node downtime according to an embodiment of the present invention, where the method is applied to an Akka cluster environment composed of a plurality of nodes, and referring to fig. 2 in combination with a usage scenario shown in fig. one and the above contents, in an embodiment of the present invention, the flow of the method for taking over a node downtime includes:
step 201: the second node acquires the downtime information of the first node;
step 202: the second node judges whether the distributed cache has a lock corresponding to the first task, wherein the first task is stored in the distributed cache when the first node works normally;
step 203: if the distributed cache does not have the lock corresponding to the first task, the second node forms the lock by using the distributed cache;
step 204: the second node acquires a first task of the first node from the distributed cache;
step 205: the second node takes over the first task.
Alternatively,
the second node acquires the downtime information of the first node, and the method comprises the following steps:
the second node acquires the downtime information of the first node in an Akka message subscription mode;
the message subscription mode of Akka is as follows:
and when the Akka cluster detects the downtime information of the first node, sending the downtime information of the first node to the second node.
Alternatively,
the second node takes over the first task, including:
the second node locally creates a second task according to the first task;
the second node stores the second task to the distributed cache;
or,
and the second node distributes the first task to the third node, so that the third node creates a third task according to the first task and stores the third task to the distributed cache.
Alternatively,
after the second node takes over the first task, the method further comprises:
the second node deletes the first task from the distributed cache.
Alternatively,
after the second node determines whether the distributed cache has the lock corresponding to the first task, the method further includes:
if the distributed cache has a lock, the second node does not recover the first task, wherein the lock is formed by the third node by utilizing the distributed cache;
when the third node recovers the first task and the second node acquires the downtime information of the third node, the second node executes a step of judging whether the distributed cache has a lock corresponding to the first task.
Alternatively,
after the second node forms the lock using the distributed cache, the method further comprises:
the third node acquiring the downtime information of the first node detects whether a lock corresponding to the first task exists in the distributed cache;
if the distributed cache has a lock corresponding to the first task, the third node does not recover the first task.
Alternatively,
when the second node recovers the first task, the third node acquires the downtime information of the second node in an Akka message subscription mode, wherein a lock formed by the node is released when the node crashes;
the third node judges whether the distributed cache has a lock corresponding to the first task;
if the distributed cache does not have the lock, the third node forms the lock by utilizing the distributed cache;
the third node acquires the first task of the first node from the distributed cache;
the third node takes over the first task.
In summary, after the second node acquires the downtime information of the first node, if the distributed cache does not have a lock corresponding to the first task, the second node forms a lock by using the distributed cache, so that a third node that acquires the downtime information of the first node does not recover the first task after detecting that the lock formed by the second node exists in the distributed cache; the first task is stored in the distributed cache when the first node works normally, and then the second node acquires the first task of the first node from the distributed cache and takes over the first task. Therefore, the distributed caches have the characteristic of improving the instruction and data reading speed and can process a large amount of data, so that the speed of the task of the down node is improved, the distributed caches are matched with the node cluster, a plurality of nodes can acquire the task of the down node from the distributed caches to recover the task, one of the nodes is controlled to recover the task of the down node through the lock, the stability of the task recovery process can be improved, and the chaos generated when the node cluster recovers the task is avoided.
Fig. 3 is a flowchart of a method for taking over a node downtime according to an embodiment of the present invention, where the method is applied to an Akka cluster environment composed of a plurality of nodes, and referring to fig. 3 in combination with a usage scenario shown in fig. one and the above contents, in an embodiment of the present invention, the flow of the method for taking over a node downtime includes:
step 301: the first node operates normally and stores the first task to the distributed cache.
In the embodiment of the present invention, the Akka cluster includes a first node, a second node, and a third node, that is, the first node, the second node, and the third node belong to the Akka cluster.
Of course in some embodiments the Akka cluster may include more nodes.
And when the first node works normally, executing the first task and storing the first task to the distributed cache in real time.
For example, the first node executes the tasks WorkerActor1 and WorkerActor2, where WorkerActor1 is clear database A data and WorkerActor2 is query B information from the database. At the moment, the first node works normally, and the first node stores the information of the WorkerActor1 and the WorkerActor2 into the distributed cache in real time and persists the information.
Step 302: and the first node is down, and the work of the first task is stopped.
And when the first node is down, the first node stops working on the first task.
Wherein, the downtime is also called as shutdown, and the finger node can not work normally.
Step 303: and when the Akka cluster detects the downtime information of the first node, the Akka cluster sends the downtime information of the first node to the second node.
Because the first node, the second node and the third node form an Akka cluster environment, in the Akka distribution environment, all nodes in the cluster can send messages to each other and obtain the information of node creation and closing.
Therefore, when the Akka cluster detects the downtime information of the first node, the downtime information of the first node is sent to the second node and the third node. The information subscribing method is an Akka message subscribing method, that is, when the Akka cluster detects the downtime information of the first node, the downtime information of the first node is sent to the second node. The second node and the third node acquire the downtime information of the first node through the Akka message subscription mode
Specifically, the principle of the message subscription mode of Akka is as follows: each node of the Akka cluster forms an abstract Akka cluster system, and the Akka system acquires the running state of each node by periodically sending heartbeat, namely sending detection signals between the nodes. The second node subscribes the running state of the whole system to the Akka cluster system, namely when the state of any one node of the Akka system changes, such as starting, closing and the like. When the first node is turned off, the Akka cluster system can obtain the message and send the message to the second node.
Of course, in some embodiments, the Akka cluster system may send the downtime information of the first node to each node of the Akka cluster.
Step 304: the second node determines whether the distributed cache has a lock corresponding to the first task, if the distributed cache does not have a lock corresponding to the first task, step 305 is executed, and if the distributed cache has a lock corresponding to the first task, step 306 is executed.
After the second node acquires the downtime information of the first node through an Akka message subscription mode, the second node detects whether a lock corresponding to the first task exists in the distributed cache.
And the nodes which acquire the downtime information have the opportunity to execute the recovery operation of the first task. In order to avoid that a plurality of nodes execute the task recovery operation on the downtime node, thereby causing the confusion of the task and the functional redundancy of the cluster, in the method of the embodiment of the invention, the node which firstly acquires the downtime information executes the task recovery operation, and the specific implementation method is to determine the node which firstly acquires the downtime information through locking.
Each node acquiring the downtime information detects whether a lock corresponding to the task of the downtime node exists in the distributed cache, and if one node does not detect the lock, the node is indicated to acquire the downtime information first, and the recovery operation of the task of the downtime node can be executed. In order to enable the nodes which acquire the downtime information later not to execute the recovery operation of the tasks of the downtime nodes, the nodes which acquire the downtime information firstly form a distributed lock in the distributed cache, so that the nodes which acquire the downtime information later detect the locks corresponding to the tasks of the downtime nodes in the distributed cache, and the recovery operation of the tasks of the downtime nodes is not executed.
For example, after the second node acquires the downtime information of the first node, the second node determines whether a lock corresponding to the first task exists in the distributed cache; if the distributed cache does not have a lock, go to step 305; if the distributed cache has a lock, step 306 is performed.
Step 305: the second node forms a lock using the distributed cache.
And if the second node does not detect the lock corresponding to the first task in the distributed cache, the second node indicates that the second node is the node which firstly acquires the downtime information, the second node executes the recovery operation on the first task, and in order to enable other nodes not to execute the recovery operation on the first task, the second node forms the lock by using the distributed cache, so that a third node which acquires the downtime information of the first node does not recover the first task after detecting that the lock formed by the second node exists in the distributed cache. Thus, the distributed lock ensures that the downtime recovery process is complete.
That is, in the method according to the embodiment of the present invention, after the second node forms a lock using the distributed cache, the third node that acquires the downtime information of the first node detects whether the distributed cache has a lock corresponding to the first task; if the distributed cache has a lock corresponding to the first task, the third node does not recover the first task.
The lock of the embodiment of the invention is a distributed lock, and the distributed lock is a mode for controlling synchronous access to shared resources among distributed systems. In distributed systems, it is often necessary to coordinate their actions. If one or a group of resources are shared between different systems or different hosts of the same system, then access to these resources often requires mutual exclusion to prevent interference with each other to ensure consistency, in which case a distributed lock is used.
Step 306: the second node does not resume the first task.
When the second node judges that the distributed cache has the lock corresponding to the first task, the second node indicates that other nodes acquire the downtime information first, and in order to avoid confusion and functional redundancy caused by the fact that a plurality of nodes restore the first task, the second node does not restore the first task.
For example, when the third node first acquires the downtime information of the first node, the third node cannot detect the lock corresponding to the first task in the distributed cache, the third node forms the lock by using the distributed cache, and after the second node detects the lock, the second node stops recovering the first task.
In some embodiments, when the third node recovers the first task, the third node may crash, and for stability of the task recovery process, when the third node recovers the first task, if the third node crashes, other nodes in the Akka cluster acquire the crash information of the third node. And when the second node acquires the downtime information of the third node, the third node crashes, the lock formed by the third node in the distributed cache is released, and the second node executes the step of judging whether the distributed cache has the lock corresponding to the first task. If the lock does not exist in the distributed cache, the second node is the node which first acquires the downtime information of the third node, so that the second node executes the step 305, the second node replaces the third node to recover the first task, and otherwise, the second node executes the step 306.
Step 307: the second node acquires a first task of the first node from the distributed cache;
the second node first acquires the downtime information of the first node, so that the second node executes the recovery operation of the first task after the second node forms a lock by using the distributed cache. Wherein the recovery operation includes retrieving a first task of the first node from the distributed cache and taking over the first task.
Because the first task is stored in the distributed cache when the first node works normally, the distributed cache has the characteristic of improving the reading speed of instructions and data and can process a large amount of data, so that the speed of the task of recovering the downtime node is improved.
For example, as shown in fig. 1, after the first node goes down, the second node obtains information of the tasks WorkerActor1 and WorkerActor2 from the distributed cache.
Step 308: the second node takes over the first task;
and after the second node acquires the first task, executing the takeover of the first task.
The method for taking over the first task comprises the following steps:
1) and after the second node locally creates a second task according to the first task, the second node stores the second task to the distributed cache.
Or,
2) and the second node distributes the first task to the third node, so that the third node creates a third task according to the first task and stores the third task to the distributed cache.
For example, as shown in fig. 1, after the first node goes down, the second node takes over the task of the first node, the second node takes over the task WorkerActor1 locally, creates a task WorkerActor3, and stores the task WorkerActor3 in the distributed cache in real time. And the second node distributes the task WorkerActor2 to the third node, the third node creates a task WorkerActor4 according to the task, and stores the task WorkerActor4 in the distributed cache.
Step 309: the second node deletes the first task from the distributed cache.
After the second node completes taking over the first task of the down first node, the second node deletes the first task from the distributed cache to eliminate the data which is not needed in the distributed cache. Of course, in some embodiments, step 309 may not be performed.
For example, after the second node successfully completes the task takeover, the information of the tasks Workeractor1 and Workeractor2 in the distributed cache is deleted. In this way, the tasks WorkerActor3, WorkerActor4 completely replace the original tasks WorkerActor1, WorkerActor 2.
In some embodiments, in order to improve the stability of task recovery executed by the method according to the embodiments of the present invention, in the AKKA cluster, there is no root node dedicated to processing downtime, but all nodes may process downtime recovery.
For example, when the second node recovers the first task, that is, when the second node acquires the first task from the distributed cache, or takes over the first task, the second node goes down, and at this time, the Akka cluster detects that the second node is sending down. And after the second node is down, the lock formed by the second node on the distributed cache is released.
The third node acquires the downtime information of the second node through an Akka message subscription mode, and judges whether a lock corresponding to the first task exists in the distributed cache or not;
if the lock does not exist in the distributed cache, the third node forms the lock by using the distributed cache, which means that the third node is the node which acquires the downtime information of the second node first, so that the third node can replace the second node to recover the first task. The third node forms a lock in the distributed cache to prevent other nodes from recovering the first task, and then the third node acquires the first task of the first node from the distributed cache; the third node takes over the first task. That is, the third node locally reconstructs the first task or sends the first task to other normal nodes, the other nodes reconstruct the first task, and the reconstructed task is cached in the distributed cache. After the third node manages the first task successfully, the third node may delete the first task from the distributed cache. If the third node detects that the distributed cache has the lock corresponding to the first task, the third node does not recover the first task.
In summary, after the second node acquires the downtime information of the first node in the Akka message subscription manner, if the distributed cache does not have a lock corresponding to the first task, the second node forms a lock by using the distributed cache, so that a third node that acquires the downtime information of the first node does not recover the first task after detecting that the distributed cache has the lock formed by the second node; the first task is stored in the distributed cache when the first node works normally, and then the second node acquires the first task of the first node from the distributed cache and takes over the first task. The first node, the second node and the third node belong to nodes of an Akka cluster, and the message subscription mode of the Akka is as follows: and when the Akka cluster detects the downtime information of the first node, sending the downtime information of the first node to the second node. Therefore, the distributed caches have the characteristic of improving the instruction and data reading speed and can process a large amount of dynamic data, so that the speed of the task of the downtime node is improved, the distributed caches are matched with the Akka cluster, a plurality of nodes can acquire the task of the downtime node from the distributed caches to recover the task, one node is controlled to recover the task of the downtime node through the lock, the stability of the task recovery process can be improved, and confusion caused by the Akka cluster during task recovery is avoided.
Fig. 4 is a schematic structural diagram of a node downtime takeover apparatus according to an embodiment of the present invention, where the apparatus is integrated on a node, the node belongs to an Akka cluster, and the Akka cluster includes at least three nodes, and with reference to fig. 4 in conjunction with a usage scenario shown in fig. one and the above contents, in an embodiment of the present invention, the node downtime takeover apparatus includes:
a downtime information obtaining unit 401, configured to obtain downtime information of a first node;
a determining unit 402, configured to determine whether a lock corresponding to a first task exists in the distributed cache, where the first task is stored in the distributed cache when the first task normally operates from the first node;
a locking unit 403, configured to form a lock by using the distributed cache if the distributed cache does not have the lock corresponding to the first task;
a task obtaining unit 404, configured to obtain a first task of a first node from a distributed cache;
a takeover unit 405 for taking over a first task;
alternatively,
the downtime information acquiring unit 401 is further configured to acquire downtime information of the first node in an Akka message subscription manner;
the message subscription mode of Akka is as follows:
and when the Akka cluster detects the downtime information of the first node, sending the downtime information of the first node to the second node.
Alternatively,
a take-over unit 405 comprising:
a creation module 406 for creating a second task locally from the first task;
a storage module 407, configured to store the second task in the distributed cache;
or,
the takeover unit 405 is further configured to distribute the first task to the third node, so that the third node creates the third task according to the first task and then stores the third task in the distributed cache.
Alternatively,
the device still includes:
a deleting unit 408 configured to delete the first task from the distributed cache.
Alternatively,
the task obtaining unit 404 is further configured to not restore the first task if the distributed cache has a lock, where the lock is formed by the third node using the distributed cache;
when the third node recovers the first task, when the downtime information acquisition unit 401 acquires the downtime information of the third node, the determination unit 402 performs a step of determining whether a lock corresponding to the first task exists in the distributed cache.
To sum up, after the downtime information acquisition unit 401 acquires the downtime information of the first node, if the distributed cache does not have a lock corresponding to the first task, the locking unit 403 forms a lock by using the distributed cache, so that a third node that acquires the downtime information of the first node does not recover the first task after detecting that the distributed cache has a lock formed by the second node; when the first task normally works from the first node, the first task is stored in the distributed cache, then the task obtaining unit 404 obtains the first task of the first node from the distributed cache, and the takeover unit 405 takes over the first task. Therefore, the distributed caches have the characteristic of improving the instruction and data reading speed and can process a large amount of dynamic data, so that the speed of the task of the down node is improved, the distributed caches are matched with the node cluster, a plurality of nodes can acquire the task of the down node from the distributed caches to recover the task, one of the nodes is controlled to recover the task of the down node through the lock, the stability of the task recovery process can be improved, and confusion caused when the node cluster recovers the task is avoided.
Fig. 5 is a schematic structural diagram of a node cluster system according to an embodiment of the present invention, where the node cluster system includes at least three nodes, and referring to fig. 5 in combination with the usage scenario shown in fig. one and the above, in an embodiment of the present invention, the nodes of the node cluster system 500 include a node downtime takeover apparatus,
the node downtime takeover apparatus is, for example, the node downtime takeover apparatus shown in the embodiment shown in fig. 4. For details, reference is made to the above exemplary embodiments, which are not described in detail here.
In summary, after the second node acquires the downtime information of the first node, if the distributed cache does not have a lock corresponding to the first task, the second node forms a lock by using the distributed cache, so that a third node that acquires the downtime information of the first node does not recover the first task after detecting that the lock formed by the second node exists in the distributed cache; the first task is stored in the distributed cache when the first node works normally, and then the second node acquires the first task of the first node from the distributed cache and takes over the first task. Therefore, the distributed caches have the characteristic of improving the instruction and data reading speed and can process a large amount of dynamic data, so that the speed of the task of the down node is improved, the distributed caches are matched with the node cluster, a plurality of nodes can acquire the task of the down node from the distributed caches to recover the task, one of the nodes is controlled to recover the task of the down node through the lock, the stability of the task recovery process can be improved, and confusion caused when the node cluster recovers the task is avoided.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for taking over a node downtime, the method comprising:
the second node acquires the downtime information of the first node;
the second node judges whether a lock corresponding to a first task exists in the distributed cache or not, wherein the first task is stored in the distributed cache when the first node works normally;
if the distributed cache does not have the lock corresponding to the first task, the second node forms the lock by using the distributed cache;
the second node acquires the first task of the first node from the distributed cache;
the second node takes over the first task.
2. The method of claim 1,
the second node acquiring the downtime information of the first node comprises the following steps:
the second node acquires the downtime information of the first node in an Akka message subscription mode;
the message subscription mode of the Akka is as follows:
and when the Akka cluster detects the downtime information of the first node, sending the downtime information of the first node to the second node.
3. The method of claim 1,
the second node takes over the first task, including:
the second node locally creates a second task according to the first task;
the second node stores the second task to the distributed cache;
or,
and the second node distributes the first task to the third node so that the third node stores the third task to the distributed cache after creating the third task according to the first task.
4. The method according to any one of claims 1 to 3,
after the second node takes over the first task, the method further comprises:
the second node deletes the first task from the distributed cache.
5. The method of claim 1,
after the second node determines whether the distributed cache has the lock corresponding to the first task, the method further includes:
if the lock exists in the distributed cache, the second node does not recover the first task, wherein the lock is formed by the third node by utilizing the distributed cache;
when the third node recovers the first task, and when the second node acquires the downtime information of the third node, the second node executes a step of judging whether the distributed cache has a lock corresponding to the first task.
6. The method of claim 1,
after the second node forms a lock using the distributed cache, the method further comprises:
the third node which acquires the downtime information of the first node detects whether the distributed cache has a lock corresponding to the first task;
and if the distributed cache has a lock corresponding to the first task, the third node does not recover the first task.
7. The method of claim 2,
when the second node recovers the first task, the third node acquires the downtime information of the second node in an Akka message subscription mode, wherein a lock formed by the node is released when the node is down;
the third node judges whether the distributed cache has a lock corresponding to the first task;
if the lock does not exist in the distributed cache, the third node forms the lock by using the distributed cache;
the third node acquires the first task of the first node from the distributed cache;
the third node takes over the first task.
8. A node downtime takeover apparatus, the apparatus comprising:
the system comprises a downtime information acquisition unit, a downtime information acquisition unit and a downtime information acquisition unit, wherein the downtime information acquisition unit is used for acquiring downtime information of a first node;
the judging unit is used for judging whether a lock corresponding to a first task exists in the distributed cache, wherein the first task is stored in the distributed cache when the first node works normally;
a lock forming unit, configured to form a lock by using the distributed cache by the second node if the lock corresponding to the first task does not exist in the distributed cache;
a task obtaining unit, configured to obtain the first task of the first node from the distributed cache;
a takeover unit for taking over the first task.
9. The apparatus of claim 8,
the downtime information acquisition unit is also used for acquiring the downtime information of the first node in an Akka message subscription mode;
the message subscription mode of the Akka is as follows:
and when the Akka cluster detects the downtime information of the first node, sending the downtime information of the first node to the second node.
10. A node cluster system, characterized in that the node cluster system comprises at least three nodes, the nodes comprise a node downtime takeover apparatus,
wherein the node downtime takeover apparatus is the node downtime takeover apparatus of claims 8 or 9.
CN201610979682.4A 2016-11-08 2016-11-08 Node is delayed machine adapting method and device, node group system Pending CN108063782A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610979682.4A CN108063782A (en) 2016-11-08 2016-11-08 Node is delayed machine adapting method and device, node group system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610979682.4A CN108063782A (en) 2016-11-08 2016-11-08 Node is delayed machine adapting method and device, node group system

Publications (1)

Publication Number Publication Date
CN108063782A true CN108063782A (en) 2018-05-22

Family

ID=62136814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610979682.4A Pending CN108063782A (en) 2016-11-08 2016-11-08 Node is delayed machine adapting method and device, node group system

Country Status (1)

Country Link
CN (1) CN108063782A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111459963A (en) * 2020-04-07 2020-07-28 中国建设银行股份有限公司 Core accounting transaction concurrent processing method and device
CN112162698A (en) * 2020-09-17 2021-01-01 北京浪潮数据技术有限公司 Cache partition reconstruction method, device, equipment and readable storage medium
CN113835930A (en) * 2021-09-26 2021-12-24 杭州谐云科技有限公司 Cache service recovery method, system and device based on cloud platform

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103458036A (en) * 2013-09-03 2013-12-18 杭州华三通信技术有限公司 Access device and method of cluster file system
CN103501338A (en) * 2013-09-30 2014-01-08 华为技术有限公司 Lock recovery method, equipment and network file system
CN103577546A (en) * 2013-10-12 2014-02-12 北京奇虎科技有限公司 Method and equipment for data backup, and distributed cluster file system
CN103684941A (en) * 2013-11-23 2014-03-26 广东新支点技术服务有限公司 Arbitration server based cluster split-brain prevent method and device
CN104115469A (en) * 2011-09-23 2014-10-22 混合电路逻辑有限公司 System for live -migration and automated recovery of applications in a distributed system
US20140379775A1 (en) * 2013-08-12 2014-12-25 Fred Korangy Actor system and method for analytics and processing of big data
CN105426271A (en) * 2015-12-22 2016-03-23 华为技术有限公司 Lock management method and device for distributed storage system
CN105912402A (en) * 2016-04-11 2016-08-31 深圳益邦阳光有限公司 Scheduling method and apparatus based on Actor model

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104115469A (en) * 2011-09-23 2014-10-22 混合电路逻辑有限公司 System for live -migration and automated recovery of applications in a distributed system
US20140379775A1 (en) * 2013-08-12 2014-12-25 Fred Korangy Actor system and method for analytics and processing of big data
CN103458036A (en) * 2013-09-03 2013-12-18 杭州华三通信技术有限公司 Access device and method of cluster file system
CN103501338A (en) * 2013-09-30 2014-01-08 华为技术有限公司 Lock recovery method, equipment and network file system
CN103577546A (en) * 2013-10-12 2014-02-12 北京奇虎科技有限公司 Method and equipment for data backup, and distributed cluster file system
CN103684941A (en) * 2013-11-23 2014-03-26 广东新支点技术服务有限公司 Arbitration server based cluster split-brain prevent method and device
CN105426271A (en) * 2015-12-22 2016-03-23 华为技术有限公司 Lock management method and device for distributed storage system
CN105912402A (en) * 2016-04-11 2016-08-31 深圳益邦阳光有限公司 Scheduling method and apparatus based on Actor model

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111459963A (en) * 2020-04-07 2020-07-28 中国建设银行股份有限公司 Core accounting transaction concurrent processing method and device
CN111459963B (en) * 2020-04-07 2024-03-15 中国建设银行股份有限公司 Concurrent processing method and device for core accounting transaction
CN112162698A (en) * 2020-09-17 2021-01-01 北京浪潮数据技术有限公司 Cache partition reconstruction method, device, equipment and readable storage medium
CN112162698B (en) * 2020-09-17 2024-02-13 北京浪潮数据技术有限公司 Cache partition reconstruction method, device, equipment and readable storage medium
CN113835930A (en) * 2021-09-26 2021-12-24 杭州谐云科技有限公司 Cache service recovery method, system and device based on cloud platform
CN113835930B (en) * 2021-09-26 2024-02-06 杭州谐云科技有限公司 Cache service recovery method, system and device based on cloud platform

Similar Documents

Publication Publication Date Title
CN105389230B (en) A kind of continuous data protection system and method for combination snapping technique
US10565071B2 (en) Smart data replication recoverer
CN103294675B (en) Data-updating method and device in a kind of distributed memory system
US8818954B1 (en) Change tracking
CN110807064A (en) Data recovery device in RAC distributed database cluster system
CN103516736A (en) Data recovery method of distributed cache system and a data recovery device of distributed cache system
CN104077380B (en) A kind of data de-duplication method, apparatus and system
CN109643310B (en) System and method for redistribution of data in a database
CN107919977B (en) Online capacity expansion and online capacity reduction method and device based on Paxos protocol
CN110196818B (en) Data caching method, caching device and storage system
CN110351313B (en) Data caching method, device, equipment and storage medium
US10838825B2 (en) Implementing snapshot sets for consistency groups of storage volumes
WO2013163864A1 (en) Data persistence processing method and device and database system
CN110825562B (en) Data backup method, device, system and storage medium
CN103500130A (en) Method for backing up dual-computer hot standby data in real time
CN104899071A (en) Recovery method and recovery system of virtual machine in cluster
WO2016061956A1 (en) Data processing method for distributed file system and distributed file system
US11748215B2 (en) Log management method, server, and database system
US12045137B2 (en) Data backup method, apparatus, and system
US10346610B1 (en) Data protection object store
CN108063782A (en) Node is delayed machine adapting method and device, node group system
CN104461773A (en) Backup deduplication method of virtual machine
US20090063486A1 (en) Data replication using a shared resource
CN105956032A (en) Cache data synchronization method, system and apparatus
CN101937378B (en) Method for carrying out back-up protection on data of storage equipment and computer system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100080 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20180522

RJ01 Rejection of invention patent application after publication