CN113703669A

CN113703669A - Management method, system, equipment and storage medium for cache partition

Info

Publication number: CN113703669A
Application number: CN202110807418.3A
Authority: CN
Inventors: 侯红生; 刘文志
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2021-07-16
Filing date: 2021-07-16
Publication date: 2021-11-26
Anticipated expiration: 2041-07-16
Also published as: CN113703669B

Abstract

The application discloses a management method of a cache partition, which comprises the following steps: receiving an event notification sent by a cluster; after receiving information which is fed back by the service end and represents that the task corresponding to the last event notification is executed, updating node information of the cache partition according to the last event notification, and sending the task corresponding to the current event notification to the service end; and after receiving information which is fed back by the service end and represents that the task corresponding to the current event notification is executed, updating the node information of the cache partition according to the current event notification. By applying the scheme of the application, the condition that the node information of the cache partition is abnormal can be avoided. The application also discloses a management system, equipment and a storage medium of the cache partition, and the management system, the equipment and the storage medium have corresponding technical effects.

Description

Management method, system, equipment and storage medium for cache partition

Technical Field

The present invention relates to the field of storage technologies, and in particular, to a method, a system, a device, and a storage medium for managing a cache partition.

Background

With the current higher requirement on storage, a high-performance cluster formed by multiple nodes is more and more widely applied.

In practical application, particularly in a single partition mode, sometimes a node information of a cache partition is abnormal, for example, a cache partition that should exist only on one node may exist on both nodes of an IO group, so that when the cache partition is deleted, only the cache partition on one node is deleted, and the cache partition on the other node still exists, thereby causing a problem in service configuration. And, such a situation mostly occurs during the T2 failure recovery process of the cluster. A T2 failure means that all nodes in an IO group exit the cluster at the same time due to the failure, and a recovery of a T2 failure means that all nodes in the IO group join the cluster at the same time in the recovery process.

In summary, how to effectively avoid the node information exception of the cache partition is a technical problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The invention aims to provide a management method, a management system, a management device and a storage medium of a cache partition, so as to effectively avoid node information exception of the cache partition.

In order to solve the technical problems, the invention provides the following technical scheme:

a management method of a cache partition comprises the following steps:

receiving an event notification sent by a cluster;

after receiving information which is fed back by the service end and represents that the task corresponding to the last event notification is executed, updating node information of the cache partition according to the last event notification, and sending the task corresponding to the current event notification to the service end;

and after receiving the information which is fed back by the service end and represents that the task corresponding to the event notification is executed, updating the node information of the cache partition according to the event notification.

Preferably, the receiving the event notification sent by the cluster includes:

receiving an event notification sent by a cluster and putting the event notification into a preset notification queue;

correspondingly, after the node information of the cache partition is updated according to the event notification, the method further includes:

and deleting the event notification in the notification queue.

Preferably, when the received event notification is an event notification indicating a failure of the first node, the sending a task corresponding to the current event notification to the service end includes:

sending a task corresponding to the event notification to a service end so as to take over the cache partition of the first node by utilizing a survival node;

when the received event notification is an event notification indicating that the first node has failed to recover, the sending a task corresponding to the event notification to the service end includes:

and sending a task corresponding to the event notification to a service end so as to restore the taken over cache partition of the first node to the first node.

Preferably, the cache partition of the cluster is in a single partition mode.

Preferably, the method further comprises the following steps:

and recording information when receiving the event notice which shows the fault recovery of the first node within a first time period after receiving the event notice which shows the fault of the first node.

A cache-partitioned management system, comprising:

an event notification receiving unit, configured to receive an event notification sent by a cluster;

the execution unit is used for updating the node information of the cache partition according to the last event notification after receiving the information which is fed back by the service end and indicates that the task corresponding to the last event notification is executed, and sending the task corresponding to the current event notification to the service end; and after receiving the information which is fed back by the service end and represents that the task corresponding to the event notification is executed, updating the node information of the cache partition according to the event notification.

Preferably, the event notification receiving unit is specifically configured to:

correspondingly, the method also comprises the following steps:

further comprising: and the queue updating unit is used for deleting the current event notification in the notification queue after the execution unit updates the node information of the cache partition according to the current event notification.

Preferably, when the received event notification is an event notification indicating a failure of the first node, the executing unit sends a task corresponding to the event notification to the service end, and the task includes:

when the received event notification is an event notification indicating that the first node has failed to recover, the execution unit sends a task corresponding to the event notification to the service end, and the task includes:

A management device of a cache partition, comprising:

a memory for storing a computer program;

a processor for executing said computer program to implement the steps of the method for managing cache partitions of any of the above.

A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for managing a cache partition according to any one of the preceding claims.

The applicant considers that in the conventional recovery process of the T2 failure, the cache partition module can immediately perform the recovery of the self service configuration depending on the event notification sent by the cluster. In the recovery flow of the T2 failure, sometimes a node fails again due to a warm restart or the like, and recovers again and joins the cluster in a short time. In this special scenario, when the cache partition is in the single partition mode, after a node fails again due to warm reboot or the like, the cache partition module executes a takeover process of the cache partition of the failed node according to current node information, and in the process of executing the takeover process, because the failed node recovers and rejoins the cluster, and the time interval is short, the cluster notifies the cache partition module of an event that the node joins the cluster, and updates the cache partition node information. At this time, the cache partition module creates a cache partition on the recovered node according to the new node information, but because the node fails and the time interval for rejoining the cluster is short, the previous node failure takeover process is not executed correctly, so that a situation that the cache partition is created on both nodes may occur.

According to the scheme of the application, after the event notification sent by the cluster is received, the node information of the cache partition is not updated immediately, and a new task is executed. After receiving the event notification sent by the cluster, if receiving information, which is fed back by the service end and indicates that the task corresponding to the last event notification is executed completely, indicating that the task corresponding to the last event notification is executed completely, updating the node information of the cache partition according to the last event notification, and further sending the task corresponding to the event notification to the service end. After receiving the information which is fed back by the service end and represents that the task corresponding to the event notification is executed, the node information of the cache partition is updated according to the event notification after the task corresponding to the event notification is executed. It can be seen that, in the present application, the node information of the cache partition is not immediately updated when the event notification sent by the cluster is received, but the node information of the cache partition is updated only when the task corresponding to the event notification sent by the cluster is executed each time, so that the condition that the node information of the cache partition is abnormal in the conventional scheme does not occur.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart illustrating an embodiment of a method for managing a cache partition according to the present invention;

fig. 2 is a schematic structural diagram of a management system of a cache partition according to the present invention.

Detailed Description

The core of the invention is to provide a management method of the cache partition, which can avoid the condition that the node information of the cache partition is abnormal.

In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart illustrating an implementation of a method for managing a cache partition according to the present invention, where the method for managing the cache partition may include the following steps:

step S101: and receiving an event notification sent by the cluster.

Specifically, the cluster sends information such as node joining and node exiting in the form of event notification, and the partition module, i.e., the cache partition module, may receive the event notification sent by the cluster. Due to the wide application of SSD (Solid State Disk), the cache partition of the present application may be specifically an SSD cache partition.

Step S102: and after receiving information which is fed back by the service end and represents that the task corresponding to the last event notification is executed, updating the node information of the cache partition according to the last event notification, and sending the task corresponding to the current event notification to the service end.

The method and the device do not update the node information immediately according to the received event notification after receiving the event notification sent by the cluster. This is because, the applicant considers that, in the conventional recovery flow of T2 failure, the cache partition module can immediately perform the recovery of its own service configuration depending on the event notification issued by the cluster. In the recovery flow of the T2 failure, sometimes a node fails again due to a warm restart or the like, and recovers again and joins the cluster in a short time. In this special scenario, when the cache partition is in the single partition mode, after a node fails again due to warm reboot or the like, the cache partition module executes a takeover process of the cache partition of the failed node according to current node information, and in the process of executing the takeover process, because the failed node recovers and rejoins the cluster, and the time interval is short, the cluster notifies the cache partition module of an event that the node joins the cluster, and updates the cache partition node information. At this time, the cache partition module creates a cache partition on the recovered node according to the new node information, but because the node fails and the time interval for rejoining the cluster is short, the previous node failure takeover process is not executed correctly, so that a situation that the cache partition is created on both nodes may occur.

Moreover, the applicant discovers, through the above analysis, that when a cache partition is abnormal, a node needs to fail and recover to normal in a short time, so that the situation that the previous failure takeover process described in the foregoing is not executed correctly, and the node information is already updated by an event notification newly sent by the cluster occurs. The reason why the node information of the cache partition is abnormal mainly occurs in the recovery flow of the T2 fault is that in the recovery flow of the T2 fault, the frequency of the situation that the node exits from the cluster due to hot restart and the like and joins the cluster in a short time is high, that is, in other situations, the situation that the node exits from the cluster and recovers quickly occurs rarely. Therefore, after the scheme of the application is applied, not only the recovery process of the T2 fault can avoid the occurrence of abnormal node information of the cache partition, but also the recovery process of the T2 fault can effectively avoid the occurrence of abnormal node information of the cache partition by using the scheme of the application for the node fault caused by other conditions and recovering to normal in a short time.

After receiving the event notification sent by the cluster, if receiving information which is fed back by the service end and indicates that the task corresponding to the last event notification is executed completely, the method indicates that the task corresponding to the last event notification is executed completely, and therefore the method updates the node information of the cache partition according to the last event notification. It can be seen that, for the last event notification, from the time of receiving, it is at least necessary to keep it until the task corresponding to it is executed correctly, and then update the node information of the cache partition according to it. In order to keep the event notifications according to the requirements of the present application, there are various specific means, as long as the purpose of the present application can be achieved, for example, in the following embodiment, the event notifications are placed in a queue, or in an occasion, each event notification is stored in a preset storage space and cleaned after the storage space is full.

After the node information of the cache partition is updated according to the last event notification, the application sends a task corresponding to the event notification to the service end. The business end, which may be generally referred to as agent, may perform the execution of tasks.

Step S103: and after receiving information which is fed back by the service end and represents that the task corresponding to the current event notification is executed, updating the node information of the cache partition according to the current event notification.

After a task corresponding to the event notification is sent to the service end, the service end executes the task, and in the execution process, no matter whether the cache partition module receives a new event notification, the cache partition module needs to receive information which is fed back by the service end and indicates that the task corresponding to the event notification is executed completely, and the task corresponding to the event notification is correctly executed.

In a specific embodiment of the present invention, step S101 may specifically include:

correspondingly, after the updating of the node information of the cache partition according to the event notification in step S103, the method may further include:

and deleting the event notification in the notification queue.

As described above, after receiving an event notification sent by a cluster, the present application does not immediately update node information of a cache partition according to the event notification, but updates node information of the cache partition according to the event notification only after a task corresponding to the event notification is executed, that is, at least the event notification needs to be kept until the task corresponding to the event notification is executed correctly, in this process, a new event notification may be continuously sent to a cache partition module. That is, after receiving the event notification sent by the cluster, it is sufficient to place the event notification into a preset notification queue. And it is understood that the newly received event notification can be placed at the end of the queue, and the head of the queue represents the event notification corresponding to the task currently being executed.

In this embodiment, since the preset notification queue is used to sort the event notifications, after the node information of the cache partition is updated according to the current event notification, the current event notification does not need to be retained, and thus the current event notification in the notification queue can be deleted.

In a specific embodiment of the present invention, when the received event notification is an event notification indicating a failure of the first node, the sending of the task corresponding to the current event notification to the service end described in step S102 includes:

sending a task corresponding to the event notification to a service end so as to take over the cache partition of the first node by utilizing the survival node;

when the received event notification is an event notification indicating that the first node has failed to recover, the sending of the task corresponding to the event notification to the service end, which is described in step S102, includes:

and sending a task corresponding to the event notification to the service end so as to restore the taken over cache partition of the first node to the first node.

In this embodiment, when the received event notification is an event notification indicating that the first node has failed, a task corresponding to the event notification may be sent to the service end, so as to take over the cache partition of the first node by using the surviving node. The first node may have 1 or more cache partitions, and may implement takeover of these cache partitions by 1 or more surviving nodes, and a specific takeover rule may be set according to an actual need, which is not described in this application.

When the received event notification is the event notification indicating that the first node has failed to recover, the first node is normal, and a task corresponding to the event notification can be sent to the service end, so that the taken over cache partition of the first node is recovered to the first node. The first node may be any node in the cluster.

As described above, in the conventional scheme, node information of the cache partition is abnormal mainly in the single partition mode, and therefore, the cache partition of the cluster of the present application may be selected as the single partition mode. However, although it should be noted that, in other partition modes, the node information of the cache partition is not easy to be abnormal, the scheme of the present application may still be adopted, and the implementation of the present application is not affected.

In an embodiment of the present invention, the method may further include:

when an event notification indicating failure recovery of the first node is received within a first time period after the event notification indicating failure of the first node is received, information is recorded.

If the event notification indicating the failure of the first node is received within the first time period after the event notification indicating the failure of the first node is received, the embodiment records information, such as the number of the first node, the date of the occurrence of the abnormal condition and the like, and is beneficial to counting and processing the special condition by a worker, namely the embodiment is beneficial to improving the convenience of operation and maintenance.

By applying the technical scheme provided by the embodiment of the invention, the node information of the cache partition is not updated and a new task is executed immediately after the event notification sent by the cluster is received. After receiving the event notification sent by the cluster, if receiving information, which is fed back by the service end and indicates that the task corresponding to the last event notification is executed completely, indicating that the task corresponding to the last event notification is executed completely, updating the node information of the cache partition according to the last event notification, and further sending the task corresponding to the event notification to the service end. After receiving the information which is fed back by the service end and represents that the task corresponding to the event notification is executed, the node information of the cache partition is updated according to the event notification after the task corresponding to the event notification is executed. It can be seen that, in the present application, the node information of the cache partition is not immediately updated when the event notification sent by the cluster is received, but the node information of the cache partition is updated only when the task corresponding to the event notification sent by the cluster is executed each time, so that the condition that the node information of the cache partition is abnormal in the conventional scheme does not occur.

Corresponding to the above method embodiments, the embodiments of the present invention further provide a management system for cache partitions, which can be referred to in correspondence with the above.

Referring to fig. 2, a schematic structural diagram of a management system of a cache partition in the present invention is shown, including:

an event notification receiving unit 201, configured to receive an event notification sent by a cluster;

an execution unit 202, configured to update node information of a cache partition according to a previous event notification after receiving information indicating that a task corresponding to the previous event notification is executed and fed back by a service end, and send a task corresponding to the current event notification to the service end; and after receiving information which is fed back by the service end and represents that the task corresponding to the current event notification is executed, updating the node information of the cache partition according to the current event notification.

In an embodiment of the present invention, the event notification receiving unit 201 is specifically configured to:

correspondingly, the method also comprises the following steps:

In an embodiment of the present invention, when the received event notification is an event notification indicating a failure of the first node, the executing unit 202 sends a task corresponding to the current event notification to the service end, where the task includes:

when the received event notification is an event notification indicating that the first node has failed, the execution unit 202 sends a task corresponding to the current event notification to the service end, where the task includes:

In a specific embodiment of the present invention, the cache partition of the cluster is in a single partition mode.

In one embodiment of the present invention, the method further comprises:

and the information recording unit is used for recording information when receiving the event notice which shows the fault recovery of the first node within a first time period after receiving the event notice which shows the fault of the first node.

Corresponding to the above method and system embodiments, the embodiments of the present invention further provide a management device for cache partitions and a computer-readable storage medium, which may be referred to in correspondence with the above. The computer readable storage medium has a computer program stored thereon, and the computer program, when executed by a processor, implements the steps of the method for managing a cache partition in any of the above embodiments. A computer-readable storage medium as referred to herein may include Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The management device of the cache partition may include:

a memory for storing a computer program;

a processor for executing a computer program to implement the steps of the method for managing cache partitions in any of the above embodiments.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The principle and the implementation of the present invention are explained in the present application by using specific examples, and the above description of the embodiments is only used to help understanding the technical solution and the core idea of the present invention. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims

1. A method for managing a cache partition, comprising:

receiving an event notification sent by a cluster;

2. The method for managing the cache partition according to claim 1, wherein the receiving the event notification sent by the cluster includes:

and deleting the event notification in the notification queue.

3. The method according to claim 1, wherein when the received event notification is an event notification indicating a failure of the first node, the sending a task corresponding to the current event notification to the service end includes:

4. The method for managing the cache partition according to claim 1, wherein the cache partition of the cluster is in a single partition mode.

5. The method for managing the cache partition according to any one of claims 1 to 4, further comprising:

6. A cache-partitioned management system, comprising:

7. The management system of a cache partition according to claim 6, wherein the event notification receiving unit is specifically configured to:

correspondingly, the method also comprises the following steps:

8. The system according to claim 6, wherein when the received event notification is an event notification indicating a failure of the first node, the execution unit sends a task corresponding to the current event notification to the service end, and the task includes:

9. A management apparatus for a cache partition, comprising:

a memory for storing a computer program;

processor for executing said computer program for implementing the steps of the method for managing cache partitions according to any of claims 1 to 5.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for managing cache partitions according to any one of claims 1 to 5.