CN115470061A

CN115470061A - Distributed storage system I/O sub-health intelligent detection and recovery method

Info

Publication number: CN115470061A
Application number: CN202211233825.9A
Authority: CN
Inventors: 杨自新
Original assignee: CLP Cloud Digital Intelligence Technology Co Ltd
Current assignee: CLP Cloud Digital Intelligence Technology Co Ltd
Priority date: 2022-10-10
Filing date: 2022-10-10
Publication date: 2022-12-13

Abstract

The invention relates to an intelligent detection and recovery method for I/O sub-health of a distributed storage system. The method comprises the steps of setting I/O statistical points on key services of each node of an I/O path of a storage system; I/O abnormal data are collected and recorded by an I/O statistical point location; acquiring the type of I/O abnormal data recorded by each I/O statistical point location and the frequency count of each type of abnormal data; deducing key services or nodes with I/O sub-health states through a sub-health detection algorithm; and processing key services or nodes with I/O sub-health states by adopting a self-healing or isolation mode. According to the method, I/O abnormal data on key services are periodically collected through I/O statistical point positions pre-embedded in a storage system, services with I/O abnormal data are automatically inferred according to a sub-health detection algorithm, the abnormal services are intelligently processed in a self-healing or isolation mode, fault diffusion is prevented, risks are eliminated in time, service interruption is avoided, and stability and safety of system operation are improved.

Description

Intelligent detection and recovery method for I/O sub-health of distributed storage system

Technical Field

The invention belongs to the technical field of storage system operation and maintenance, and particularly relates to an I/O sub-health intelligent detection and recovery method of a distributed storage system and an intelligent detection and recovery system related to the method.

Background

Technical personnel find that in the operation process of a distributed storage system, when the storage system is in an I/O (input/output) sub-health state, in most cases, the storage system may still provide an I/O read-write service, and at this time, if the operation and maintenance system cannot timely sense the sub-health state of the system, a fault will be diffused, and further serious consequences such as service interruption of the storage system will be caused.

Distributed Asynchronous Object Storage (DAOS) is the basis for the billion secondary storage stack built by intel. In particular, DAOS is an open source software defined horizontally extending object store that can provide high bandwidth, low latency, and high I/OPS storage containers for high performance computing applications. As a set of lightweight system, the DAOS can run end-to-end in the user space and can completely bypass the operating system, and because the DAOS does not continue the I/O model aiming at high-delay and block storage, but selects the I/O model providing native support for accessing high-fine-grained data, the performance of the next generation storage technology is released, and support can be provided for the data-centered workflow. However, the current DAOS system has no effective detection and processing mechanism for the I/O sub-health status of the storage system, and the system running risk is high.

In view of the above problems, no ideal solution has been proposed.

Disclosure of Invention

In order to solve the problem that the DAOS system has no effective detection and processing mechanism aiming at the I/O sub-health state of the storage system and the running risk of the system is high, a solution is provided.

The method comprises the steps of setting I/O statistical point positions on I/O path key services of a storage system to collect I/O abnormal data, obtaining I/O failure counts and counts exceeding an I/O time delay threshold value on each service, reasoning which key service or node has an I/O sub-health (I/O overtime, I/O failure and I/O hang-up) state according to a sub-health detection algorithm, and further intelligently deciding to process the fault by adopting a self-healing or isolation mode, so that fault diffusion and service interruption are avoided.

Specifically, in a first aspect, the present invention provides a distributed storage system I/O sub-health intelligent detection and recovery method, including:

s1: setting I/O statistical point positions on key services of each node of an I/O path of a storage system;

s2: I/O abnormal data are collected and recorded by an I/O statistical point location;

s3: acquiring the type of I/O abnormal data recorded by each I/O statistical point location and the counting result of the times of the abnormal data;

s4: deducing key services or nodes with I/O sub-health states through a sub-health detection algorithm according to the obtained data;

s5: and the key service or node with the I/O sub-health state is processed by adopting a self-healing or isolation mode, so that fault diffusion and service interruption are avoided.

Further, according to some embodiments of the present invention, the I/O abnormal data in step S2 of the method for intelligently detecting and recovering I/O sub-health of a distributed storage system of the present invention includes abnormal data in the following three scenarios:

I/O timeout: the I/O delay exceeds the threshold T seconds (different services have different requirements for delay, depending on the specific service, for example, 800ms to 3 s);

I/O failure: I/O error reporting;

I/O hang-up: the I/O hang does not return.

Further, according to some embodiments of the present invention, the sub-health detection algorithm in S4 of the I/O sub-health intelligent detection and recovery method of the distributed storage system according to the present invention includes:

(1) I/O abnormity counting method

Continuously detecting M periods, and when N periods detect abnormal data, determining that I/O is abnormal, wherein M is more than or equal to N and more than or equal to 1;

(2) I/O anomaly reasoning method

The method comprises the steps that I/O is initiated from a client, an upper layer service calls a lower layer service interface in a key service path of each node of an I/O path, and when the upper layer service judges that the lower layer service is abnormal, the lower layer service is determined to be abnormal to appear on the lower layer service if the lower layer service does not further indicate that the lower layer service is abnormal;

when the upper layer service judges that the lower layer service is abnormal, the lower layer service further indicates the lower layer service to be abnormal, and the lower layer service does not continue to be downward indicated, the I/O abnormality is determined to be on the lower layer service;

and in the same way, performing layer-by-layer identification according to the logic until all the nodes with I/O abnormity and the key services are identified.

Further, according to some embodiments of the present invention, in the method for intelligently detecting and recovering I/O sub-health of a distributed storage system according to the present invention, in S5, a self-healing or isolation manner is used to process a key service or node in which an I/O sub-health state occurs, and an applicable recovery manner is selected according to the following decision logic:

(1) If I/O abnormity occurs in only one service on one node is inferred finally, the abnormity is recovered in a self-healing mode of the service and an alarm is reported; if the I/O abnormality of the service is detected again in the unit time S1, isolating the service and reporting an alarm;

(2) If I/O abnormity occurs in two or more services on only one node through reasoning, and the I/O abnormity is inferred to be the node I/O abnormity, the abnormity is recovered by restarting the self-healing node and an alarm is reported; if the I/O abnormality of the node is detected again in the unit time S2, isolating the node and reporting an alarm;

(3) If I/O abnormity occurs in two or more nodes through reasoning finally, group events are inferred, alarm is reported, and manual intervention recovery is prompted.

Further, in the method for intelligently detecting and recovering I/O sub-health of the distributed storage system, S1 is 12-48 hours, and S2 is 3-15 days.

In a second aspect, the present invention further provides an intelligent detection and recovery system for I/O sub-health of a distributed storage system, where the intelligent detection and recovery system includes:

the I/O abnormal data acquisition module is used for acquiring I/O abnormal data recorded by each I/O statistical point location;

the I/O abnormal data analysis module is used for analyzing the type of I/O abnormal data and counting the occurrence frequency of various types of abnormal data;

the detection algorithm module is internally provided with a sub-health detection algorithm and is used for deducing key services or nodes with I/O sub-health states;

and the recovery decision module is internally provided with decision logic and is used for selecting an applicable recovery mode.

Further, according to some embodiments of the present invention, the sub-health detection algorithm in the I/O sub-health intelligent detection and recovery system of the distributed storage system of the present invention comprises:

(1) I/O abnormity counting method

(2) I/O anomaly reasoning method

Further, according to some embodiments of the present invention, the decision logic in the I/O sub-health intelligence detection and recovery system of the distributed storage system of the present invention comprises:

(2) If I/O abnormity occurs in two or more services on only one node through reasoning finally and the I/O abnormity of the node is inferred, recovering the abnormity by restarting the self-healing node and reporting an alarm; if the I/O abnormality of the node is detected again in the unit time S2, isolating the node and reporting an alarm;

In a third aspect, the present invention further provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the steps of the above-mentioned method for detecting and recovering I/O sub-health intelligence of a distributed storage system.

In conclusion, the intelligent detection and recovery method for I/O sub-health of the distributed storage system has the following characteristics:

(1) According to the method, I/O abnormal data on key services are periodically acquired through I/O statistical point positions pre-embedded in a storage system, services with I/O abnormal data are automatically inferred according to a sub-health detection algorithm, then the abnormal services are processed in a self-healing or isolation mode intelligently selected, and fault diffusion is effectively prevented.

(2) The method realizes intelligent detection and recovery of I/O abnormity by a built-in sub-health detection algorithm and a recovery decision logic mode, can automatically identify the key service of IO sub-health without human intervention, and recovers the system by a self-healing or isolation mode, thereby eliminating risks in time, avoiding service interruption and improving the stability and safety of system operation.

Drawings

In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below. It is to be understood that the drawings in the following description are illustrative of some, but not all embodiments of the invention, and that other drawings may be derived therefrom by those skilled in the art without the benefit of the teachings herein.

Fig. 1 is a schematic structural diagram of a monitoring cluster according to an embodiment of the present invention.

FIG. 2 is a flow chart of an I/O anomaly reasoning method according to an embodiment of the invention,

description of the invention: service A1 indicates that service B2 is abnormal, services A2 and A3 indicate that service B1 is abnormal, at which time service B further indicates that both services B1 and B2 indicate that service C3 is abnormal, thereby inferring that service C3 is abnormal.

FIG. 3 is a flow chart of an I/O sub-health intelligent detection and recovery method of a distributed storage system according to the present invention.

FIG. 4 is a schematic diagram of a distributed storage system I/O sub-health intelligent detection and recovery system according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to specific embodiments. It is to be understood that the embodiments described are merely illustrative of some, but not all, of the present invention and that the invention may be embodied or carried out in various other specific forms, and that various modifications and changes in the details of the specification may be made without departing from the spirit of the invention.

It should be noted that, in the case of no conflict, the features in the following embodiments and examples may be combined with each other; moreover, all other embodiments that can be derived by one skilled in the art from the embodiments disclosed herein without any inventive step are intended to be within the scope of the present disclosure.

It should be noted that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the disclosure, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways, e.g., an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein.

The present invention will be described in detail below with reference to the embodiments shown in fig. 1 to 4.

Referring to fig. 3, the method for intelligently detecting and recovering I/O sub-health of a distributed storage system according to the present invention includes:

(1) Cluster function

The method comprises the steps that a set of I/O sub-health detection service monitor is deployed in a storage system, the service consists of a server and a client, wherein a server component runs on each storage node through an independent process, runs in a master-slave mode (such as monitor _ server1, monitor _ server2 and monitor _ server3 in the figure 1) for providing reliability, and has cluster selection, I/O sub-health data gathering, reasoning and decision isolation functions; the client component is loaded in each storage node key business process, and is responsible for collecting the I/O delay, I/O error, and I/O pending data of each key business service (such as service a, service B, and service C in fig. 2), reporting to the node monitor _ server, and reporting to the monitor _ server main process by the node monitor _ server.

(2) Periodic detection reporting

In the I/O path key service (for example, service a, service B, and service C in fig. 2), a statistical point location is pre-embedded, the key service periodically detects I/O abnormal data through a monitor _ client, and reports the I/O abnormal data to a monitor _ server, where the abnormal I/O includes the following three scenarios:

I/O timeout: the I/O delay exceeds a threshold T seconds (different services have different requirements for delay, depending on the specific service, for example, 800ms to 3 s);

I/O failure: I/O error reporting;

I/O deadlock: I/O pinning does not return;

(3) Intelligent reasoning based on sub-health detection algorithm

(3-1) I/O Exception counting method

I/O timeout: m1 (8, for example) periods are continuously detected, and if N1 (4, for example) periods detect that the I/O time delay is greater than T seconds (M1 is greater than or equal to N1 and greater than or equal to 1), the I/O timeout is determined to be abnormal;

I/O failure: m2 (8, for example) periods are continuously detected, and if N2 (4, for example) periods detect I/O error (M2 is more than or equal to N2 and more than or equal to 1), I/O failure is determined to be abnormal;

I/O deadlock: and if the I/O hang does not return, determining that the I/O hang is abnormal.

(3-2) I/O anomaly inference method

An I/O anomaly inference logic is illustrated by taking an I/O latency large scenario as an example, as shown in fig. 2,I/O initiated from a client, client- > service a- > service B- > service C on an I/O path key service path:

(3-2-1) calling a lower-layer service interface by the upper-layer service, judging the abnormity of the service B such as large time delay and the like by the service A, preliminarily deducing the abnormity of the service B, and assuming that the service B does not further indicate that the service C is abnormal, determining that the abnormity is on the service B;

(3-2-2) the service A judges that the service B has large time delay and other abnormalities, preliminarily deduces that the service B is abnormal, and if the service B further indicates that the service C is abnormal and the service C does not continue to indicate downwards, the service A is determined to be abnormal on the service C (if the service C also has lower-layer services and the like);

and performing layer-by-layer identification according to the logic until all the nodes with I/O abnormity and the key services are identified.

(4) Intelligent decision recovery

(4-1) if I/O abnormity occurs in only one service on one node is inferred finally, recovering the abnormity in a self-healing mode of the service and reporting an alarm; if the I/O abnormality of the service is detected again in the unit time S1 (such as 24 hours), isolating the service (namely stopping the service) and reporting an alarm;

(4-2) if I/O abnormity occurs in two or more services on only one node through final reasoning and the I/O abnormity of the node is inferred, recovering the abnormity by restarting the self-healing node and reporting an alarm; if the I/O abnormality of the node is detected again in the unit time S2 (such as 7 days), isolating the node (stopping all services on the node) and reporting an alarm;

and (4-3) if I/O abnormity occurs in two or more nodes through reasoning, inferring the nodes as group events, reporting an alarm, and prompting manual intervention recovery.

Referring to fig. 4, the I/O sub-health intelligent detection and recovery system of the distributed storage system of the present invention comprises:

And each module is operated according to the I/O sub-health intelligent detection and recovery method of the distributed storage system.

The embodiments of the present invention are described in a progressive manner, and the same or similar parts among the embodiments can be referred to each other.

The above description is only an example of the present invention, and is not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, replacement, or the like that comes within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A distributed storage system I/O sub-health intelligent detection and recovery method is characterized by comprising the following steps:

2. The method for intelligently detecting and recovering I/O sub-health of a distributed storage system according to claim 1, wherein the I/O abnormal data in S2 comprises abnormal data of the following three scenarios:

I/O timeout: the I/O time delay exceeds a threshold value for T seconds;

I/O failure: I/O error reporting;

I/O hang-up: the I/O hang does not return.

3. The distributed storage system I/O sub-health intelligence detection and recovery method of claim 2, wherein the sub-health detection algorithm in S4 comprises:

(1) I/O abnormity counting method

(2) I/O anomaly reasoning method

4. The method according to claim 1, wherein in S5, the critical service or node in which the I/O sub-health state occurs is processed in a self-healing or isolation manner, and an applicable recovery manner is selected according to the following decision logic:

(1) If I/O abnormity occurs in only one service on one node through reasoning finally, recovering the abnormity in a self-healing mode of the service and reporting an alarm; if the I/O abnormality of the service is detected again in the unit time S1, isolating the service and reporting an alarm;

5. The distributed storage system I/O sub-health intelligence detection and recovery method of claim 4, wherein S1 is 12-48 hours and S2 is 3-15 days.

6. An intelligent detection and recovery system for I/O sub-health of a distributed storage system, the intelligent detection and recovery system comprising:

7. The distributed storage system I/O sub-health intelligence detection and recovery system of claim 6, wherein the sub-health detection algorithm comprises:

(1) I/O abnormity counting method

Continuously detecting M periods, and determining that I/O is abnormal when N periods detect abnormal data, wherein M is more than or equal to N and is more than or equal to 1;

(2) I/O anomaly reasoning method

8. The distributed storage system I/O sub-health intelligence detection and recovery system of claim 6, wherein the decision logic comprises:

9. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the distributed storage system I/O sub-health intelligence detection and recovery method of any of claims 1-5.