CN115470061A - Distributed storage system I/O sub-health intelligent detection and recovery method - Google Patents

Distributed storage system I/O sub-health intelligent detection and recovery method Download PDF

Info

Publication number
CN115470061A
CN115470061A CN202211233825.9A CN202211233825A CN115470061A CN 115470061 A CN115470061 A CN 115470061A CN 202211233825 A CN202211233825 A CN 202211233825A CN 115470061 A CN115470061 A CN 115470061A
Authority
CN
China
Prior art keywords
service
sub
lower layer
health
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211233825.9A
Other languages
Chinese (zh)
Inventor
杨自新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CLP Cloud Digital Intelligence Technology Co Ltd
Original Assignee
CLP Cloud Digital Intelligence Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CLP Cloud Digital Intelligence Technology Co Ltd filed Critical CLP Cloud Digital Intelligence Technology Co Ltd
Priority to CN202211233825.9A priority Critical patent/CN115470061A/en
Publication of CN115470061A publication Critical patent/CN115470061A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2205Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
    • G06F11/2221Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test input/output devices or peripheral units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2273Test methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/26Functional testing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Hardware Redundancy (AREA)

Abstract

The invention relates to an intelligent detection and recovery method for I/O sub-health of a distributed storage system. The method comprises the steps of setting I/O statistical points on key services of each node of an I/O path of a storage system; I/O abnormal data are collected and recorded by an I/O statistical point location; acquiring the type of I/O abnormal data recorded by each I/O statistical point location and the frequency count of each type of abnormal data; deducing key services or nodes with I/O sub-health states through a sub-health detection algorithm; and processing key services or nodes with I/O sub-health states by adopting a self-healing or isolation mode. According to the method, I/O abnormal data on key services are periodically collected through I/O statistical point positions pre-embedded in a storage system, services with I/O abnormal data are automatically inferred according to a sub-health detection algorithm, the abnormal services are intelligently processed in a self-healing or isolation mode, fault diffusion is prevented, risks are eliminated in time, service interruption is avoided, and stability and safety of system operation are improved.

Description

Intelligent detection and recovery method for I/O sub-health of distributed storage system
Technical Field
The invention belongs to the technical field of storage system operation and maintenance, and particularly relates to an I/O sub-health intelligent detection and recovery method of a distributed storage system and an intelligent detection and recovery system related to the method.
Background
Technical personnel find that in the operation process of a distributed storage system, when the storage system is in an I/O (input/output) sub-health state, in most cases, the storage system may still provide an I/O read-write service, and at this time, if the operation and maintenance system cannot timely sense the sub-health state of the system, a fault will be diffused, and further serious consequences such as service interruption of the storage system will be caused.
Distributed Asynchronous Object Storage (DAOS) is the basis for the billion secondary storage stack built by intel. In particular, DAOS is an open source software defined horizontally extending object store that can provide high bandwidth, low latency, and high I/OPS storage containers for high performance computing applications. As a set of lightweight system, the DAOS can run end-to-end in the user space and can completely bypass the operating system, and because the DAOS does not continue the I/O model aiming at high-delay and block storage, but selects the I/O model providing native support for accessing high-fine-grained data, the performance of the next generation storage technology is released, and support can be provided for the data-centered workflow. However, the current DAOS system has no effective detection and processing mechanism for the I/O sub-health status of the storage system, and the system running risk is high.
In view of the above problems, no ideal solution has been proposed.
Disclosure of Invention
In order to solve the problem that the DAOS system has no effective detection and processing mechanism aiming at the I/O sub-health state of the storage system and the running risk of the system is high, a solution is provided.
The method comprises the steps of setting I/O statistical point positions on I/O path key services of a storage system to collect I/O abnormal data, obtaining I/O failure counts and counts exceeding an I/O time delay threshold value on each service, reasoning which key service or node has an I/O sub-health (I/O overtime, I/O failure and I/O hang-up) state according to a sub-health detection algorithm, and further intelligently deciding to process the fault by adopting a self-healing or isolation mode, so that fault diffusion and service interruption are avoided.
Specifically, in a first aspect, the present invention provides a distributed storage system I/O sub-health intelligent detection and recovery method, including:
s1: setting I/O statistical point positions on key services of each node of an I/O path of a storage system;
s2: I/O abnormal data are collected and recorded by an I/O statistical point location;
s3: acquiring the type of I/O abnormal data recorded by each I/O statistical point location and the counting result of the times of the abnormal data;
s4: deducing key services or nodes with I/O sub-health states through a sub-health detection algorithm according to the obtained data;
s5: and the key service or node with the I/O sub-health state is processed by adopting a self-healing or isolation mode, so that fault diffusion and service interruption are avoided.
Further, according to some embodiments of the present invention, the I/O abnormal data in step S2 of the method for intelligently detecting and recovering I/O sub-health of a distributed storage system of the present invention includes abnormal data in the following three scenarios:
I/O timeout: the I/O delay exceeds the threshold T seconds (different services have different requirements for delay, depending on the specific service, for example, 800ms to 3 s);
I/O failure: I/O error reporting;
I/O hang-up: the I/O hang does not return.
Further, according to some embodiments of the present invention, the sub-health detection algorithm in S4 of the I/O sub-health intelligent detection and recovery method of the distributed storage system according to the present invention includes:
(1) I/O abnormity counting method
Continuously detecting M periods, and when N periods detect abnormal data, determining that I/O is abnormal, wherein M is more than or equal to N and more than or equal to 1;
(2) I/O anomaly reasoning method
The method comprises the steps that I/O is initiated from a client, an upper layer service calls a lower layer service interface in a key service path of each node of an I/O path, and when the upper layer service judges that the lower layer service is abnormal, the lower layer service is determined to be abnormal to appear on the lower layer service if the lower layer service does not further indicate that the lower layer service is abnormal;
when the upper layer service judges that the lower layer service is abnormal, the lower layer service further indicates the lower layer service to be abnormal, and the lower layer service does not continue to be downward indicated, the I/O abnormality is determined to be on the lower layer service;
and in the same way, performing layer-by-layer identification according to the logic until all the nodes with I/O abnormity and the key services are identified.
Further, according to some embodiments of the present invention, in the method for intelligently detecting and recovering I/O sub-health of a distributed storage system according to the present invention, in S5, a self-healing or isolation manner is used to process a key service or node in which an I/O sub-health state occurs, and an applicable recovery manner is selected according to the following decision logic:
(1) If I/O abnormity occurs in only one service on one node is inferred finally, the abnormity is recovered in a self-healing mode of the service and an alarm is reported; if the I/O abnormality of the service is detected again in the unit time S1, isolating the service and reporting an alarm;
(2) If I/O abnormity occurs in two or more services on only one node through reasoning, and the I/O abnormity is inferred to be the node I/O abnormity, the abnormity is recovered by restarting the self-healing node and an alarm is reported; if the I/O abnormality of the node is detected again in the unit time S2, isolating the node and reporting an alarm;
(3) If I/O abnormity occurs in two or more nodes through reasoning finally, group events are inferred, alarm is reported, and manual intervention recovery is prompted.
Further, in the method for intelligently detecting and recovering I/O sub-health of the distributed storage system, S1 is 12-48 hours, and S2 is 3-15 days.
In a second aspect, the present invention further provides an intelligent detection and recovery system for I/O sub-health of a distributed storage system, where the intelligent detection and recovery system includes:
the I/O abnormal data acquisition module is used for acquiring I/O abnormal data recorded by each I/O statistical point location;
the I/O abnormal data analysis module is used for analyzing the type of I/O abnormal data and counting the occurrence frequency of various types of abnormal data;
the detection algorithm module is internally provided with a sub-health detection algorithm and is used for deducing key services or nodes with I/O sub-health states;
and the recovery decision module is internally provided with decision logic and is used for selecting an applicable recovery mode.
Further, according to some embodiments of the present invention, the sub-health detection algorithm in the I/O sub-health intelligent detection and recovery system of the distributed storage system of the present invention comprises:
(1) I/O abnormity counting method
Continuously detecting M periods, and when N periods detect abnormal data, determining that I/O is abnormal, wherein M is more than or equal to N and more than or equal to 1;
(2) I/O anomaly reasoning method
The method comprises the steps that I/O is initiated from a client, an upper layer service calls a lower layer service interface in a key service path of each node of an I/O path, and when the upper layer service judges that the lower layer service is abnormal, the lower layer service is determined to be abnormal to appear on the lower layer service if the lower layer service does not further indicate that the lower layer service is abnormal;
when the upper layer service judges that the lower layer service is abnormal, the lower layer service further indicates the lower layer service to be abnormal, and the lower layer service does not continue to be downward indicated, the I/O abnormality is determined to be on the lower layer service;
and in the same way, performing layer-by-layer identification according to the logic until all the nodes with I/O abnormity and the key services are identified.
Further, according to some embodiments of the present invention, the decision logic in the I/O sub-health intelligence detection and recovery system of the distributed storage system of the present invention comprises:
(1) If I/O abnormity occurs in only one service on one node is inferred finally, the abnormity is recovered in a self-healing mode of the service and an alarm is reported; if the I/O abnormality of the service is detected again in the unit time S1, isolating the service and reporting an alarm;
(2) If I/O abnormity occurs in two or more services on only one node through reasoning finally and the I/O abnormity of the node is inferred, recovering the abnormity by restarting the self-healing node and reporting an alarm; if the I/O abnormality of the node is detected again in the unit time S2, isolating the node and reporting an alarm;
(3) If I/O abnormity occurs in two or more nodes through reasoning finally, group events are inferred, alarm is reported, and manual intervention recovery is prompted.
In a third aspect, the present invention further provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the steps of the above-mentioned method for detecting and recovering I/O sub-health intelligence of a distributed storage system.
In conclusion, the intelligent detection and recovery method for I/O sub-health of the distributed storage system has the following characteristics:
(1) According to the method, I/O abnormal data on key services are periodically acquired through I/O statistical point positions pre-embedded in a storage system, services with I/O abnormal data are automatically inferred according to a sub-health detection algorithm, then the abnormal services are processed in a self-healing or isolation mode intelligently selected, and fault diffusion is effectively prevented.
(2) The method realizes intelligent detection and recovery of I/O abnormity by a built-in sub-health detection algorithm and a recovery decision logic mode, can automatically identify the key service of IO sub-health without human intervention, and recovers the system by a self-healing or isolation mode, thereby eliminating risks in time, avoiding service interruption and improving the stability and safety of system operation.
Drawings
In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below. It is to be understood that the drawings in the following description are illustrative of some, but not all embodiments of the invention, and that other drawings may be derived therefrom by those skilled in the art without the benefit of the teachings herein.
Fig. 1 is a schematic structural diagram of a monitoring cluster according to an embodiment of the present invention.
FIG. 2 is a flow chart of an I/O anomaly reasoning method according to an embodiment of the invention,
description of the invention: service A1 indicates that service B2 is abnormal, services A2 and A3 indicate that service B1 is abnormal, at which time service B further indicates that both services B1 and B2 indicate that service C3 is abnormal, thereby inferring that service C3 is abnormal.
FIG. 3 is a flow chart of an I/O sub-health intelligent detection and recovery method of a distributed storage system according to the present invention.
FIG. 4 is a schematic diagram of a distributed storage system I/O sub-health intelligent detection and recovery system according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to specific embodiments. It is to be understood that the embodiments described are merely illustrative of some, but not all, of the present invention and that the invention may be embodied or carried out in various other specific forms, and that various modifications and changes in the details of the specification may be made without departing from the spirit of the invention.
It should be noted that, in the case of no conflict, the features in the following embodiments and examples may be combined with each other; moreover, all other embodiments that can be derived by one skilled in the art from the embodiments disclosed herein without any inventive step are intended to be within the scope of the present disclosure.
It should be noted that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the disclosure, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways, e.g., an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein.
The present invention will be described in detail below with reference to the embodiments shown in fig. 1 to 4.
Referring to fig. 3, the method for intelligently detecting and recovering I/O sub-health of a distributed storage system according to the present invention includes:
(1) Cluster function
The method comprises the steps that a set of I/O sub-health detection service monitor is deployed in a storage system, the service consists of a server and a client, wherein a server component runs on each storage node through an independent process, runs in a master-slave mode (such as monitor _ server1, monitor _ server2 and monitor _ server3 in the figure 1) for providing reliability, and has cluster selection, I/O sub-health data gathering, reasoning and decision isolation functions; the client component is loaded in each storage node key business process, and is responsible for collecting the I/O delay, I/O error, and I/O pending data of each key business service (such as service a, service B, and service C in fig. 2), reporting to the node monitor _ server, and reporting to the monitor _ server main process by the node monitor _ server.
(2) Periodic detection reporting
In the I/O path key service (for example, service a, service B, and service C in fig. 2), a statistical point location is pre-embedded, the key service periodically detects I/O abnormal data through a monitor _ client, and reports the I/O abnormal data to a monitor _ server, where the abnormal I/O includes the following three scenarios:
I/O timeout: the I/O delay exceeds a threshold T seconds (different services have different requirements for delay, depending on the specific service, for example, 800ms to 3 s);
I/O failure: I/O error reporting;
I/O deadlock: I/O pinning does not return;
(3) Intelligent reasoning based on sub-health detection algorithm
(3-1) I/O Exception counting method
I/O timeout: m1 (8, for example) periods are continuously detected, and if N1 (4, for example) periods detect that the I/O time delay is greater than T seconds (M1 is greater than or equal to N1 and greater than or equal to 1), the I/O timeout is determined to be abnormal;
I/O failure: m2 (8, for example) periods are continuously detected, and if N2 (4, for example) periods detect I/O error (M2 is more than or equal to N2 and more than or equal to 1), I/O failure is determined to be abnormal;
I/O deadlock: and if the I/O hang does not return, determining that the I/O hang is abnormal.
(3-2) I/O anomaly inference method
An I/O anomaly inference logic is illustrated by taking an I/O latency large scenario as an example, as shown in fig. 2,I/O initiated from a client, client- > service a- > service B- > service C on an I/O path key service path:
(3-2-1) calling a lower-layer service interface by the upper-layer service, judging the abnormity of the service B such as large time delay and the like by the service A, preliminarily deducing the abnormity of the service B, and assuming that the service B does not further indicate that the service C is abnormal, determining that the abnormity is on the service B;
(3-2-2) the service A judges that the service B has large time delay and other abnormalities, preliminarily deduces that the service B is abnormal, and if the service B further indicates that the service C is abnormal and the service C does not continue to indicate downwards, the service A is determined to be abnormal on the service C (if the service C also has lower-layer services and the like);
and performing layer-by-layer identification according to the logic until all the nodes with I/O abnormity and the key services are identified.
(4) Intelligent decision recovery
(4-1) if I/O abnormity occurs in only one service on one node is inferred finally, recovering the abnormity in a self-healing mode of the service and reporting an alarm; if the I/O abnormality of the service is detected again in the unit time S1 (such as 24 hours), isolating the service (namely stopping the service) and reporting an alarm;
(4-2) if I/O abnormity occurs in two or more services on only one node through final reasoning and the I/O abnormity of the node is inferred, recovering the abnormity by restarting the self-healing node and reporting an alarm; if the I/O abnormality of the node is detected again in the unit time S2 (such as 7 days), isolating the node (stopping all services on the node) and reporting an alarm;
and (4-3) if I/O abnormity occurs in two or more nodes through reasoning, inferring the nodes as group events, reporting an alarm, and prompting manual intervention recovery.
Referring to fig. 4, the I/O sub-health intelligent detection and recovery system of the distributed storage system of the present invention comprises:
the I/O abnormal data acquisition module is used for acquiring I/O abnormal data recorded by each I/O statistical point location;
the I/O abnormal data analysis module is used for analyzing the type of I/O abnormal data and counting the occurrence frequency of various types of abnormal data;
the detection algorithm module is internally provided with a sub-health detection algorithm and is used for deducing key services or nodes with I/O sub-health states;
and the recovery decision module is internally provided with decision logic and is used for selecting an applicable recovery mode.
And each module is operated according to the I/O sub-health intelligent detection and recovery method of the distributed storage system.
The embodiments of the present invention are described in a progressive manner, and the same or similar parts among the embodiments can be referred to each other.
The above description is only an example of the present invention, and is not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, replacement, or the like that comes within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims (9)

1. A distributed storage system I/O sub-health intelligent detection and recovery method is characterized by comprising the following steps:
s1: setting I/O statistical point positions on key services of each node of an I/O path of a storage system;
s2: I/O abnormal data are collected and recorded by an I/O statistical point location;
s3: acquiring the type of I/O abnormal data recorded by each I/O statistical point location and the counting result of the times of the abnormal data;
s4: deducing key services or nodes with I/O sub-health states through a sub-health detection algorithm according to the obtained data;
s5: and the key service or node with the I/O sub-health state is processed by adopting a self-healing or isolation mode, so that fault diffusion and service interruption are avoided.
2. The method for intelligently detecting and recovering I/O sub-health of a distributed storage system according to claim 1, wherein the I/O abnormal data in S2 comprises abnormal data of the following three scenarios:
I/O timeout: the I/O time delay exceeds a threshold value for T seconds;
I/O failure: I/O error reporting;
I/O hang-up: the I/O hang does not return.
3. The distributed storage system I/O sub-health intelligence detection and recovery method of claim 2, wherein the sub-health detection algorithm in S4 comprises:
(1) I/O abnormity counting method
Continuously detecting M periods, and when N periods detect abnormal data, determining that I/O is abnormal, wherein M is more than or equal to N and more than or equal to 1;
(2) I/O anomaly reasoning method
The method comprises the steps that I/O is initiated from a client, an upper layer service calls a lower layer service interface in a key service path of each node of an I/O path, and when the upper layer service judges that the lower layer service is abnormal, the lower layer service is determined to be abnormal to appear on the lower layer service if the lower layer service does not further indicate that the lower layer service is abnormal;
when the upper layer service judges that the lower layer service is abnormal, the lower layer service further indicates the lower layer service to be abnormal, and the lower layer service does not continue to be downward indicated, the I/O abnormality is determined to be on the lower layer service;
and in the same way, performing layer-by-layer identification according to the logic until all the nodes with I/O abnormity and the key services are identified.
4. The method according to claim 1, wherein in S5, the critical service or node in which the I/O sub-health state occurs is processed in a self-healing or isolation manner, and an applicable recovery manner is selected according to the following decision logic:
(1) If I/O abnormity occurs in only one service on one node through reasoning finally, recovering the abnormity in a self-healing mode of the service and reporting an alarm; if the I/O abnormality of the service is detected again in the unit time S1, isolating the service and reporting an alarm;
(2) If I/O abnormity occurs in two or more services on only one node through reasoning finally and the I/O abnormity of the node is inferred, recovering the abnormity by restarting the self-healing node and reporting an alarm; if the I/O abnormality of the node is detected again in the unit time S2, isolating the node and reporting an alarm;
(3) If I/O abnormity occurs in two or more nodes through reasoning finally, group events are inferred, alarm is reported, and manual intervention recovery is prompted.
5. The distributed storage system I/O sub-health intelligence detection and recovery method of claim 4, wherein S1 is 12-48 hours and S2 is 3-15 days.
6. An intelligent detection and recovery system for I/O sub-health of a distributed storage system, the intelligent detection and recovery system comprising:
the I/O abnormal data acquisition module is used for acquiring I/O abnormal data recorded by each I/O statistical point location;
the I/O abnormal data analysis module is used for analyzing the type of I/O abnormal data and counting the occurrence frequency of various types of abnormal data;
the detection algorithm module is internally provided with a sub-health detection algorithm and is used for deducing key services or nodes with I/O sub-health states;
and the recovery decision module is internally provided with decision logic and is used for selecting an applicable recovery mode.
7. The distributed storage system I/O sub-health intelligence detection and recovery system of claim 6, wherein the sub-health detection algorithm comprises:
(1) I/O abnormity counting method
Continuously detecting M periods, and determining that I/O is abnormal when N periods detect abnormal data, wherein M is more than or equal to N and is more than or equal to 1;
(2) I/O anomaly reasoning method
The method comprises the steps that I/O is initiated from a client, an upper layer service calls a lower layer service interface in a key service path of each node of an I/O path, and when the upper layer service judges that the lower layer service is abnormal, the lower layer service is determined to be abnormal to appear on the lower layer service if the lower layer service does not further indicate that the lower layer service is abnormal;
when the upper layer service judges that the lower layer service is abnormal, the lower layer service further indicates the lower layer service to be abnormal, and the lower layer service does not continue to be downward indicated, the I/O abnormality is determined to be on the lower layer service;
and in the same way, performing layer-by-layer identification according to the logic until all the nodes with I/O abnormity and the key services are identified.
8. The distributed storage system I/O sub-health intelligence detection and recovery system of claim 6, wherein the decision logic comprises:
(1) If I/O abnormity occurs in only one service on one node through reasoning finally, recovering the abnormity in a self-healing mode of the service and reporting an alarm; if the I/O abnormality of the service is detected again in the unit time S1, isolating the service and reporting an alarm;
(2) If I/O abnormity occurs in two or more services on only one node through reasoning finally and the I/O abnormity of the node is inferred, recovering the abnormity by restarting the self-healing node and reporting an alarm; if the I/O abnormality of the node is detected again in the unit time S2, isolating the node and reporting an alarm;
(3) If I/O abnormity occurs in two or more nodes through reasoning finally, group events are inferred, alarm is reported, and manual intervention recovery is prompted.
9. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the distributed storage system I/O sub-health intelligence detection and recovery method of any of claims 1-5.
CN202211233825.9A 2022-10-10 2022-10-10 Distributed storage system I/O sub-health intelligent detection and recovery method Pending CN115470061A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211233825.9A CN115470061A (en) 2022-10-10 2022-10-10 Distributed storage system I/O sub-health intelligent detection and recovery method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211233825.9A CN115470061A (en) 2022-10-10 2022-10-10 Distributed storage system I/O sub-health intelligent detection and recovery method

Publications (1)

Publication Number Publication Date
CN115470061A true CN115470061A (en) 2022-12-13

Family

ID=84337935

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211233825.9A Pending CN115470061A (en) 2022-10-10 2022-10-10 Distributed storage system I/O sub-health intelligent detection and recovery method

Country Status (1)

Country Link
CN (1) CN115470061A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20240036990A1 (en) * 2021-06-15 2024-02-01 Inspur Suzhou Intelligent Technology Co., Ltd. Inference service management method, apparatus and system for inference platform, and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20240036990A1 (en) * 2021-06-15 2024-02-01 Inspur Suzhou Intelligent Technology Co., Ltd. Inference service management method, apparatus and system for inference platform, and medium
US11994958B2 (en) * 2021-06-15 2024-05-28 Inspur Suzhou Intelligent Technology Co., Ltd. Inference service management method, apparatus and system for inference platform, and medium

Similar Documents

Publication Publication Date Title
US9003230B2 (en) Method and apparatus for cause analysis involving configuration changes
US7730364B2 (en) Systems and methods for predictive failure management
US5872911A (en) Method and system of service impact analysis in a communications network
US7337373B2 (en) Determining the source of failure in a peripheral bus
CN100465919C (en) Techniques for health monitoring and control of application servers
CN105325023B (en) Method and the network equipment for cell abnormality detection
US20060293777A1 (en) Automated and adaptive threshold setting
WO2011155621A1 (en) Malfunction detection device, obstacle detection method, and program recording medium
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
US20100205483A1 (en) Operation management apparatus and method thereof
WO1997024839A9 (en) Fault impact analysis
CN103116531A (en) Storage system failure predicting method and storage system failure predicting device
WO2002054255A1 (en) A method for managing faults in a computer system environment
CN113282635A (en) Micro-service system fault root cause positioning method and device
JP2008191839A (en) Abnormality sign detection system
CN115470061A (en) Distributed storage system I/O sub-health intelligent detection and recovery method
CN115168168A (en) Server failure prediction method, system, device and medium
Ghiasvand et al. Anomaly detection in high performance computers: A vicinity perspective
CN106776251A (en) A kind of monitoring data processing unit and method
CN110659147B (en) Self-repairing method and system based on module self-checking behavior
US10205630B2 (en) Fault tolerance method for distributed stream processing system
Huda et al. An agent oriented proactive fault-tolerant framework for grid computing
CN117194154A (en) APM full-link monitoring system and method based on micro-service
Sahoo et al. Providing persistent and consistent resources through event log analysis and predictions for large-scale computing systems
JP2001014188A (en) Monitor system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination