CN113032106A

CN113032106A - Automatic detection method and device for IO suspension abnormality of computing node

Info

Publication number: CN113032106A
Application number: CN202110477121.5A
Authority: CN
Inventors: 张志雄; 魏亮; 杨晓峰; 许振峰
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2021-06-25
Anticipated expiration: 2041-04-29
Also published as: CN113032106B

Abstract

The invention discloses a method and a device for automatically detecting IO suspension abnormity of a computing node, and relates to the technical field of cloud computing, wherein the method comprises the following steps: the method comprises the steps of collecting IO states of all virtual machines on a computing node in real time, wherein the IO states comprise a return state and a suspension state; counting the number of IO in the suspension state and the total number of IO at fixed intervals, and determining the ratio of the number of IO in the suspension state to the total number of IO; and determining whether the IO of the computing node is in an abnormal state or not according to the size relation between the ratio and a preset threshold. The method and the device can find the IO abnormality of the computing node in time, further take effective treatment measures on the abnormal computing node in time, improve the response speed and accelerate the fault recovery speed.

Description

Automatic detection method and device for IO suspension abnormality of computing node

Technical Field

The invention relates to the technical field of cloud computing, in particular to a method and a device for automatically detecting IO suspension abnormity of a computing node.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

In recent years, with the rapid development of cloud computing technology, the application of the cloud computing technology is more and more extensive, a general cloud platform uses distributed storage as storage resources to supply virtual machines, since the distributed storage can only detect self faults in a cluster, under the condition that communication between computing resources and the storage is interrupted, the distributed storage cluster cannot judge abnormal computing nodes through the distributed storage cluster, the virtual machines IO on the abnormal computing nodes are in a suspension state for a long time, and the virtual machines in the IO suspension state have normal heartbeat but cannot normally provide services to the outside.

At present, for scenes such as IO suspension of computing nodes and the like, a common cloud manufacturer does not have a good processing mechanism, basically finds out by alarming, manually operates and maintains down the computing nodes, and evacuates the virtual machines on the computing nodes to recover the environment, and the processing mode has the problems of low response speed, difficulty in fault recovery, low efficiency and the like.

Disclosure of Invention

The embodiment of the invention provides an automatic detection method for IO suspension abnormity of a computing node, which is used for timely finding out IO abnormity of the computing node, further timely taking effective treatment measures on the abnormal computing node, improving the response speed and accelerating the fault recovery speed, and comprises the following steps:

the method comprises the steps of collecting IO states of all virtual machines on a computing node in real time, wherein the IO states comprise a return state and a suspension state;

counting the number of IO in the suspension state and the total number of IO at fixed intervals, and determining the ratio of the number of IO in the suspension state to the total number of IO;

and determining whether the IO of the computing node is in an abnormal state or not according to the size relation between the ratio and a preset threshold.

The embodiment of the present invention further provides an automatic detection device for computing node IO suspension abnormality, which is used to find computing node IO abnormality in time, and further take effective processing measures to the abnormal computing node in time, so as to improve response speed and accelerate failure recovery speed, and the device includes:

the system comprises an acquisition module, a data processing module and a data processing module, wherein the acquisition module is used for acquiring IO states of all virtual machines on a computing node in real time, and the IO states comprise a return state and a suspension state;

the counting module is used for counting the IO number and the total IO number in the suspension state at fixed time intervals and determining the ratio of the IO number in the suspension state to the total IO number;

and the determining module is used for determining whether the IO of the computing node is in an abnormal state or not according to the size relation between the ratio and a preset threshold.

The embodiment of the invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the computer program to realize the automatic detection method for IO suspension abnormality of the computing node.

The embodiment of the invention also provides a computer readable storage medium, which stores a computer program for executing the above method for automatically detecting the IO suspension abnormality of the computing node.

In the embodiment of the invention, whether each IO is suspended can be known in time by acquiring the IO state of the virtual machine on the computing node in real time, the number of the IOs in the suspended state and the total number of the IOs of the computing node are counted after a fixed time interval, whether the computing node is in an abnormal state with a large number of the IOs suspended is determined according to the size relation between the ratio of the number of the IOs in the suspended state to the total number of the IOs and a preset threshold value, and the automatic detection and the timely discovery of the cloud platform for the IO suspended scene of the computing node are realized. Therefore, the computing nodes in abnormal states can be effectively processed in time, the response speed is improved, and the fault recovery efficiency can be greatly improved compared with a manual operation and maintenance mode.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:

fig. 1 is a flowchart of an automatic detection method for computing node IO suspension abnormality in the embodiment of the present invention;

FIG. 2 is a flowchart of another method for automatically detecting IO suspension abnormality of a compute node according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of another method for automatically detecting IO suspension abnormality of a compute node according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an automatic detection device for computing node IO suspension abnormality in the embodiment of the present invention;

fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

In order to solve the problems that the IO anomaly discovery speed is low and the corresponding processing response is not timely, the embodiment of the invention provides an automatic detection method for IO suspension anomalies of a computing node, which is applied to a cloud platform, and as shown in fig. 1, the method comprises the following steps 101 to 103:

step 101, acquiring IO states of all virtual machines of the computing node in real time, wherein the IO states comprise a return state and a suspension state.

The suspension state is that an IO request sent by a compute node is processed in the next layer for a long time (for example, a storage side) and does not receive a return, and the IO has a return under a normal condition, which indicates success (for example, 0 is returned) or non-success (for example, a value other than 0 is returned, and different values correspond to different abnormal states). The IO normally responds to the received return code to the outside and is in a return state; the IO is always in an interactive state with the next layer, and is in a suspended state without a return code.

In another implementation manner of the embodiment of the present invention, IO return time may also be recorded, so as to assist in determining whether IO is abnormal according to the length of the return time.

And 102, counting the number of the IO in the suspension state and the total number of the IO at fixed time intervals, and determining the ratio of the number of the IO in the suspension state to the total number of the IO.

It should be noted that, if a large number of IO are in the suspended state for a certain period of time and no return exists, the virtual machine heartbeat is normal but the service cannot be provided normally.

In order to find out the abnormal condition that the virtual machine cannot provide the external service in time, in the embodiment of the invention, the number of the IO in the suspension state and the total number of the IO are counted at fixed intervals, so as to confirm whether a large number of IOs are in the suspension state at the current moment and whether the virtual machine provides the external service normally.

In the embodiment of the invention, the fixed time is set by operation and maintenance personnel. Considering that the abnormal state can be found and processed earlier when the fixed time is set to be shorter, the processing speed is high, but the conditions that the pressure is suddenly increased and the misjudgment is easy to occur under the condition of network oscillation can occur; when the fixed time is set to be longer, the processing of the IO suspension abnormal scene is slower but more reliable, and the fixed time may be set by balancing the processing speed and the processing reliability, for example, the fixed time may be set to be 30 s.

And 103, determining whether the IO of the computing node is in an abnormal state or not according to the size relation between the ratio and a preset threshold.

Specifically, as shown in fig. 2, step 103 determines whether the computing node IO is in an abnormal state according to a magnitude relationship between the ratio and a preset threshold, and may be executed as the following step 1031 or step 1032:

and step 1031, if the ratio is greater than or equal to a preset threshold, determining that the computing node IO is in an abnormal state.

Step 1032, if the ratio is smaller than a preset threshold, determining that the computing node IO is in a normal state.

The preset threshold may be determined according to experience that the virtual machine cannot provide external services when the IO is suspended by a certain amount, and in general, the preset threshold may be set to 80%.

In one implementation, when it is determined that the computing node IO is in an abnormal state, the cloud platform managing the computing nodes performs processing uniformly.

Specifically, if the computing node IO is determined to be in an abnormal state, checking whether redundant computing resources exist in other computing nodes of the current cloud platform; determining a processing mode of the computing node with IO abnormality according to whether other computing nodes have redundant computing resources; and processing the IO abnormal computing nodes according to a processing mode.

The redundant computing resources are resources repeatedly configured on the computing nodes, and play a role in bearing a component which fails when the component in the computing nodes fails.

Therefore, if redundant computing resources exist in other computing nodes of the cloud platform, the computing nodes with abnormal IO can be powered off, and the virtual machines are evacuated to the computing nodes with the redundant computing resources; and when the virtual machine is evacuated to the computing nodes with redundant computing resources, the service of the virtual machine is recovered. Therefore, the emergency function of the abnormal computing node is realized by using the redundant computing resources, the virtual machine service can be recovered within second-level time, and the service recovery efficiency is greatly improved compared with a manual operation and maintenance mode.

In another implementation, if the other compute nodes do not have redundant compute resources, the compute nodes are powered down, stopping virtual machine service.

In the two processing modes, the processing mode of powering off the computing node is adopted, so that the high availability mechanism of the computing node can recover the service of the computing node, and the purpose of emergency processing is achieved.

The recovery of the service does not necessarily mean that the cause of the failure of the computing node is solved, and the processing method adopted may not be applicable in some cases. After receiving the notification, the operation and maintenance personnel investigate the reason for generating the IO abnormity and determine whether the processing mode is proper so as to reduce the probability of the fault happening again.

As shown in fig. 3, which is a flowchart of another method for automatically detecting an IO suspension abnormality of a compute node in the embodiment of the present invention, it can be seen from fig. 3 that, when a ratio of the number of suspended IOs to the total number of IOs of a virtual machine is different from a preset threshold and when redundant computing resources exist, different processing manners are adopted for the compute node and the virtual machine, so as to achieve the purpose of automatically detecting and recovering an IO suspension scene.

The embodiment of the invention also provides an automatic detection device for IO suspension abnormity of the computing node, which is described in the following embodiment. Because the principle of the device for solving the problems is similar to the automatic detection method for the IO suspension abnormity of the computing node, the implementation of the device can refer to the implementation of the automatic detection method for the IO suspension abnormity of the computing node, and repeated parts are not described again.

As shown in fig. 4, the apparatus 400 includes an acquisition module 401, a statistics module 402, and a determination module 403.

The acquisition module 401 is configured to acquire IO states of all virtual machines on a compute node in real time, where the IO states include a return state and a suspension state;

a counting module 402, configured to count the number of IO in a suspended state and the total number of IO at fixed intervals, and determine a ratio of the number of IO in the suspended state to the total number of IO;

the determining module 403 is configured to determine whether the computing node IO is in an abnormal state according to a size relationship between the ratio and a preset threshold.

In an implementation manner of the embodiment of the present invention, the determining module 403 is configured to:

when the ratio is greater than or equal to a preset threshold value, determining that the IO of the computing node is in an abnormal state;

and when the ratio is smaller than a preset threshold value, determining that the calculation node IO is in a normal state.

In one implementation manner of the embodiment of the present invention, the apparatus 400 further includes:

the checking module 404 is configured to, when it is determined that the computing node IO is in an abnormal state, check whether redundant computing resources exist in other computing nodes of the current cloud platform;

the determining module 403 is further configured to determine a processing mode of the computing node with IO exception according to whether there is redundant computing resource in other computing nodes;

and the processing module 405 is configured to process the computing node with the IO exception according to a processing mode.

In an implementation manner of the embodiment of the present invention, when there are redundant computing resources in other computing nodes, the processing module 405 is configured to:

powering off the computing node with the abnormal IO, and evacuating the virtual machine to the computing node with the redundant computing resource;

and when the virtual machine is evacuated to the computing nodes with redundant computing resources, the service of the virtual machine is recovered.

In an implementation manner of the embodiment of the present invention, when there is no redundant computing resource in other computing nodes, the processing module 405 is configured to:

powering off the computing node and stopping the virtual machine service.

the communication module 406 is configured to send a node processing notification to the operation and maintenance staff, where the node processing notification includes a processing mode of a computing node with an IO exception.

An embodiment of the present invention further provides a computer device, and fig. 5 is a schematic diagram of the computer device in the embodiment of the present invention, where the computer device is capable of implementing all steps in the automatic detection method for computing node IO suspension abnormality in the embodiment, and the computer device specifically includes the following contents:

a processor (processor)501, a memory (memory)502, a communication Interface (Communications Interface)503, and a communication bus 504;

the processor 501, the memory 502 and the communication interface 503 complete mutual communication through the communication bus 504; the communication interface 503 is used for implementing information transmission between related devices;

the processor 501 is configured to call a computer program in the memory 502, and when the processor executes the computer program, the automatic detection method for the IO suspension abnormality of the compute node in the foregoing embodiment is implemented.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for automatically detecting IO suspension abnormality of a computing node is characterized by comprising the following steps:

2. The method according to claim 1, wherein determining whether a compute node IO is in an abnormal state according to a magnitude relation between the ratio and a preset threshold comprises:

if the ratio is larger than or equal to a preset threshold value, determining that the IO of the computing node is in an abnormal state;

and if the ratio is smaller than a preset threshold value, determining that the calculation node IO is in a normal state.

3. The method according to claim 2, wherein after determining whether the computing node IO is in an abnormal state according to a magnitude relation between the ratio and a preset threshold, the method further comprises:

if the computing node IO is determined to be in an abnormal state, checking whether redundant computing resources exist in other computing nodes of the current cloud platform;

determining a processing mode of the computing node with IO abnormality according to whether other computing nodes have redundant computing resources;

and processing the IO abnormal computing node according to the processing mode.

4. The method of claim 3, wherein when the other compute nodes have redundant compute resources, processing the compute node with the IO exception according to the processing method includes:

5. The method according to claim 3, wherein when there are no redundant computing resources in other computing nodes, processing the computing node with the IO exception according to the processing method includes:

powering off the computing node and stopping the virtual machine service.

6. The method according to any one of claims 3 to 5, further comprising:

and sending a node processing notification to operation and maintenance personnel, wherein the node processing notification comprises a processing mode of the computing node with IO exception.

7. An automatic detection device for IO suspension abnormity of a computing node, which is characterized by comprising:

8. The apparatus of claim 7, further comprising:

the checking module is used for checking whether redundant computing resources exist in other computing nodes of the current cloud platform when the computing node IO is determined to be in an abnormal state;

the determining module is further used for determining a processing mode of the computing node with the IO abnormality according to whether the other computing nodes have redundant computing resources;

and the processing module is used for processing the calculation node with IO exception according to the processing mode.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the method of any one of claims 1 to 6.