CN117076236A - Storage pool fault detection method, device, equipment, medium and product - Google Patents

Storage pool fault detection method, device, equipment, medium and product Download PDF

Info

Publication number
CN117076236A
CN117076236A CN202310960539.0A CN202310960539A CN117076236A CN 117076236 A CN117076236 A CN 117076236A CN 202310960539 A CN202310960539 A CN 202310960539A CN 117076236 A CN117076236 A CN 117076236A
Authority
CN
China
Prior art keywords
storage pool
server
disk
information
suspension
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310960539.0A
Other languages
Chinese (zh)
Inventor
邹萌萍
张晓燕
柳跃
赵堤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202310960539.0A priority Critical patent/CN117076236A/en
Publication of CN117076236A publication Critical patent/CN117076236A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3034Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a storage system, e.g. DASD based or network based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3041Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is an input/output interface
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The disclosure provides a storage pool fault detection method, device, equipment, medium and product, relates to the technical field of big data, and can be applied to the technical field of financial science and technology, wherein the method comprises the following steps: and obtaining disk IO information of each server corresponding to the storage pool. And determining the server with IO suspension according to the disk IO information of each server. And determining a storage pool fault detection result according to the number of the servers with IO suspension, wherein the storage pool fault detection result comprises storage pool storage abnormality or disk performance abnormality of the servers on the storage pool.

Description

Storage pool fault detection method, device, equipment, medium and product
Technical Field
The disclosure relates to the technical field of big data, and can be applied to the technical field of financial science and technology, in particular to a storage pool fault detection method, a storage pool fault detection device, storage pool fault detection equipment, storage pool fault detection media and storage pool fault detection products.
Background
IO, also known as I/O, refers to Input/Output (I/O). Disk IO is the read speed of bytes, i.e., the read/write capability of the disk. In the scene of storage failure (such as storage broken chain), IO errors of a physical disk are transmitted to the front end of the virtual machine through a virtualization layer, and IO errors are received in the virtual machine, so that a user file system in the virtual machine can become a read-only state, and the virtual machine needs to be restarted or a user needs to recover manually.
Currently, a data center widely promotes cloud services, the scale of a virtual machine is explosively increased, and the hidden trouble of one storage pool possibly causes the abnormality of a large-area virtual machine is needed to be rapidly solved. In such a context, monitoring of storage failures of the storage pool is also an extremely important concern. In the face of a server with a larger scale, the traditional manual inspection work is difficult to discover the faults of the existing storage pool in time, the monitoring requirements of the server with an increasingly growing scale cannot be met, an automatic monitoring system is needed to be urgently, the hidden danger of the faults of the storage pool of the data center can be discovered rapidly under the requirements of high efficiency and universality, the production problem is solved actively and spontaneously, and the emergency time effect is shortened.
Disclosure of Invention
Accordingly, the primary objective of the present disclosure is to provide a method, apparatus, device, medium and product for detecting storage pool faults, which aims to at least partially solve the technical problem that the traditional manual inspection work is difficult to find the storage pool faults in time, and is not suitable for the monitoring requirements of the server with increasing scale.
To achieve the above object, a first aspect of an embodiment of the present disclosure provides a storage pool failure detection method, applied to a server, including: obtaining disk IO information of each server corresponding to the storage pool; determining a server with IO suspension according to the disk IO information of each server; and determining a storage pool fault detection result according to the number of the servers with IO suspension, wherein the storage pool fault detection result comprises storage pool storage abnormality or disk performance abnormality of the servers on the storage pool.
According to an embodiment of the present disclosure, the obtaining disk IO information of each server corresponding to the storage pool includes: writing a timing acquisition program; the timing acquisition program is deployed on each server; and calling the timing acquisition program to read the disk IO information of each server in batches every preset time period in the detection time period.
According to an embodiment of the present disclosure, the obtaining disk IO information of each server corresponding to the storage pool further includes: calling the timing acquisition program to read parameters of each server, wherein the parameters comprise IP and/or host names; and calling the timing acquisition program to respectively pack parameters and disk IO information of each server to obtain a first JSON string, wherein the first JSON string comprises two layers of JSON strings, one layer of JSON string corresponds to the parameters, and the other layer of JSON string corresponds to the disk IO information.
According to an embodiment of the present disclosure, the obtaining disk IO information of each server corresponding to the storage pool further includes: receiving a first JSON character string packaged by the timing acquisition program; analyzing the first JSON character string to obtain parameters and a second JSON character string respectively, wherein the second JSON character string is a single-layer JSON string, and the second JSON character string corresponds to the disk IO information; and sending the second JSON character string to the kafka for caching by taking the parameter as a storage index.
According to the embodiment of the disclosure, the disk IO information comprises read IO completion times, write IO completion times and IO request numbers in a current IO queue; the determining the server with IO suspension according to the disk IO information of each server comprises the following steps: for each server, determining whether the read IO completion times and the write IO completion times are increased before and after a preset time period, and determining whether the IO request number in the current IO queue is zero; and determining that the server generates IO suspension in response to the fact that the read IO completion times and the write IO completion times are not increased and the IO request number in the current IO queue is not zero.
According to an embodiment of the present disclosure, the determining the storage pool failure detection result according to the number of servers where the IO suspension occurs includes: respectively counting a first total number of servers with IO suspension on each storage pool in the current batch; and determining that the storage pool is abnormally stored in response to the first total number being greater than a first storage pool IO suspension controllable threshold, wherein the first storage pool IO suspension controllable threshold is determined according to a steady state of the storage pool.
According to an embodiment of the present disclosure, the determining the storage pool failure detection result according to the number of servers where the IO suspension occurs further includes: counting the occupation ratio of the number of servers with IO hanging in the current batch to the total number of the servers with IO hanging, which are determined to be in the detection batch in the detection time period; and determining that the disk performance of the server on the storage pool is abnormal in response to the occupancy value being greater than a server IO suspension controllable threshold, wherein the server IO suspension controllable threshold is determined according to the stable state of the server.
According to an embodiment of the present disclosure, the determining the storage pool failure detection result according to the number of servers where the IO suspension occurs further includes: respectively counting the second total number of servers with IO hanging in the detection time period of the storage pool to which the servers with IO hanging in the current batch belong; and determining that the storage pool is abnormally stored in response to the second total number being greater than a second storage pool IO suspension controllable threshold, wherein the second storage pool IO suspension controllable threshold is determined according to a steady state of the storage pool.
According to an embodiment of the present disclosure, the detection method further includes: and sending early warning information under the condition of determining the fault of the storage pool, wherein the method for sending the early warning information comprises at least one of mail early warning, short message early warning, monitoring system early warning and operation and maintenance webpage display early warning.
According to the embodiment of the disclosure, the early warning information comprises an early warning type, an early warning level, a name of a storage pool with a fault, parameters and applications of a server with IO suspension on the storage pool with the fault, and disc IO change information of the server with IO suspension.
According to the embodiment of the disclosure, the early warning information is arranged according to the early warning type, the early warning level, the name of the storage pool with the fault, the parameters of the server with the IO suspension on the storage pool with the fault, the application of the parameters and the disc IO change information of the server with the IO suspension in a progressive manner.
A second aspect of an embodiment of the present disclosure provides a storage pool failure detection apparatus, including: the acquisition module is used for acquiring disk IO information of each server corresponding to the storage pool; the first determining module is used for determining the server with IO suspension according to the disk IO information of each server; and the second determining module is used for determining a storage pool fault detection result according to the number of the servers with IO suspension, wherein the storage pool fault detection result comprises storage pool storage abnormality or disk performance abnormality of the servers on the storage pool.
A third aspect of an embodiment of the present disclosure provides an electronic device, including: one or more processors; and a storage device for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform a method according to the storage pool failure detection method described above.
A fourth aspect of the disclosed embodiments provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform a storage pool failure detection method according to the above.
A fifth aspect of the disclosed embodiments provides a computer program product comprising a computer program which, when executed by a processor, implements a storage pool failure detection method according to the above.
The storage pool fault detection method, device, equipment, medium and product provided by the embodiment of the disclosure have the following beneficial effects:
because whether the server is suspended by the disk IO information of each server and the characteristics of the storage pool are combined, whether the storage pool is abnormal or not is determined according to the number of the servers suspended by the IO, the storage pool with the possible storage faults can be actively found, and the automatic detection of the storage pool faults is realized. Meanwhile, the state of the storage pool can be intuitively and accurately fed back only according to whether the IO suspension occurs to the server, so that the efficiency of storage pool fault detection can be improved, and the efficient detection of storage pool faults under the background of a large-scale server can be easily realized.
Because the disk IO information for detecting the storage pool fault is automatically obtained at regular time by writing the regular acquisition program and calling the regular acquisition program, the storage pool fault can be automatically detected at regular time without manual detection of operation and maintenance personnel.
In the process of acquiring the disk IO information, the disk IO information and parameters of the server are packaged into JSON character strings, and the mode of reading the file by adopting the JSON character strings is faster than the mode of acquiring the file by adopting a command, and the resource occupation of the server by adopting the mode of reading the file by adopting the JSON character strings is smaller, so that the efficiency of detecting the failure of the storage pool can be further improved, the failure detection method of the storage pool is more suitable for popularization and use, and the universality of the detection method is improved. In addition, when disk IO information is acquired, two layers of JSON strings are used, one layer of JSON string corresponds to a server parameter, the other layer of JSON string corresponds to disk IO information, so that the acquired large-scale disk IO information is convenient to analyze, the two layers of JSON strings are analyzed, and the disk IO information is extracted to form a single-layer JSON string so as to better meet the fault detection of a storage pool.
Because the read IO completion times, the write IO completion times and the IO request numbers in the current IO queues can accurately reflect the states of the disk IOs, whether the server is hung by the IO can be accurately determined through the read IO completion times, the write IO completion times and the IO request numbers in the current IO queues, and therefore the accuracy of fault detection of the storage pool is improved.
Aiming at different abnormal conditions of the storage pool, different strategies are adopted to determine whether the storage pool occurs, and because the different strategies can cover different types of storage pool faults, the accuracy of the storage pool fault detection can be improved by carrying out the storage pool fault detection based on the different strategies.
On one hand, the method can inform operation and maintenance personnel to solve and process the problem of the storage pool fault in time, and on the other hand, the method can comprehensively display relevant information to the user from the aspects of storage pool fault, early warning record and processing record of server disk IO abnormality, related server information of the storage pool fault, abnormal server disk IO development trend and the like, so that the user can analyze the root cause of the storage pool fault conveniently. In addition, through reasonably defining the specific format of the early warning information, the related information can be displayed to the user more intuitively.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained from the structures shown in these drawings without inventive effort to those of ordinary skill in the art.
FIG. 1 schematically illustrates a system architecture 100 of a storage pool failure detection method and apparatus according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow chart of a storage pool failure detection method according to an embodiment of the present disclosure;
FIG. 3 is a flow chart schematically illustrating the obtaining of disk IO information of each server corresponding to a storage pool in operation S201 shown in FIG. 2 according to an embodiment of the present disclosure;
FIG. 4 schematically shows an example diagram of/proc/disks, according to an embodiment of the present disclosure;
FIG. 5 schematically illustrates a flowchart of acquiring disk IO information of each server corresponding to a storage pool in operation S201 illustrated in FIG. 2 according to another embodiment of the present disclosure;
FIG. 6 is a flow chart schematically illustrating the obtaining of disk IO information of each server corresponding to a storage pool in operation S201 shown in FIG. 2 according to another embodiment of the present disclosure;
FIG. 7 schematically illustrates a flowchart of determining a server where IO suspension occurs according to disk IO information of each server in operation S202 shown in FIG. 2 according to an embodiment of the present disclosure;
FIG. 8 schematically illustrates a flowchart of determining a storage pool failure detection result according to the number of servers where IO hanging occurs in operation S203 shown in FIG. 2 according to an embodiment of the present disclosure;
FIG. 9 schematically illustrates a flowchart of determining a storage pool failure detection result according to the number of servers where IO hanging occurs in operation S203 shown in FIG. 2 according to another embodiment of the present disclosure;
FIG. 10 schematically illustrates a flowchart of determining a storage pool failure detection result according to the number of servers where IO hanging occurs in operation S203 shown in FIG. 2 according to another embodiment of the present disclosure;
FIG. 11 schematically illustrates a flow chart of a method of storage pool failure detection according to another embodiment of the present disclosure;
FIG. 12 schematically illustrates a block diagram of a storage pool failure detection apparatus according to an embodiment of the present disclosure;
FIG. 13 schematically illustrates a block diagram of a storage pool failure detection apparatus according to another embodiment of the present disclosure;
fig. 14 schematically illustrates a block diagram of an electronic device adapted to implement the above-described method according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.
Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a formulation similar to at least one of "A, B or C, etc." is used, in general such a formulation should be interpreted in accordance with the ordinary understanding of one skilled in the art (e.g. "a system with at least one of A, B or C" would include but not be limited to systems with a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
Some of the block diagrams and/or flowchart illustrations are shown in the figures. It will be understood that some blocks of the block diagrams and/or flowchart illustrations, or combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data consistency restoration apparatus, such that the instructions, when executed by the processor, create means for implementing the functions/acts specified in the block diagrams and/or flowchart. The techniques of this disclosure may be implemented in hardware and/or software (including firmware, microcode, etc.). Additionally, the techniques of this disclosure may take the form of a computer program product on a computer-readable storage medium having instructions stored thereon, the computer program product being for use by or in connection with an instruction execution system.
In the technical solution of the present disclosure, the related user information (including, but not limited to, user personal information, user image information, user equipment information, such as location information, etc.) and data (including, but not limited to, data for analysis, stored data, displayed data, etc.) are information and data authorized by the user or sufficiently authorized by each party, and the related data is collected, stored, used, processed, transmitted, provided, disclosed, applied, etc. and processed, all in compliance with the related laws and regulations and standards of the related country and region, necessary security measures are taken, no prejudice to the public order, and corresponding operation entries are provided for the user to select authorization or rejection.
Aiming at the technical problems in the related art, the embodiment of the disclosure provides a storage pool fault detection method, which is applied to a server and comprises the following steps: and obtaining disk IO information of each server corresponding to the storage pool. And determining the server with IO suspension according to the disk IO information of each server. And determining a storage pool fault detection result according to the number of the servers with IO suspension, wherein the storage pool fault detection result comprises storage pool storage abnormality or disk performance abnormality of the servers on the storage pool.
Fig. 1 schematically illustrates a system architecture 100 of a storage pool failure detection method and apparatus according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.
As shown in fig. 1, a system architecture 100 according to this embodiment may include cloud platform servers 101, 102, 103, a cache module 104, a network 105, and a server 106. The network 105 is used between the cloud platform servers 101, 102, 103 and the cache module 104 and the server 106, and a communication link is provided between the cache module 104 and the server 106.
The cloud platform servers 101, 102, 103 may be, for example, virtual machines, where the virtual machines are correspondingly provided with storage pools, and failure of the storage pools may cause abnormality of the large-area virtual machines, so that normal business services cannot be provided for users.
The caching module 104 may be configured to cache disk IO information of the cloud platform servers 101, 102, 103. The caching module may be to build an engine table (source), a local table (local), and a materialized view (consumer) in the clickhouse database. The engine table can be created based on the Kafka engine, points to the Kafka server, and automatically consumes disk IO information and stores the disk IO information into the source; materialized views act to build a map to the local table based on the engine table, ultimately storing all of the data in the local table for downstream use.
The network 105 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The wired mode can be, for example, connection by adopting any one of the following interfaces: the wireless mode may be, for example, a wireless mode connection, where the wireless mode may be, for example, any one of a plurality of wireless technology standards such as bluetooth, wi-Fi, infrared, zigBee, etc.
The server 106 may be a background server capable of providing storage pool anomaly detection, and the server 106 deploys an acquisition program on the cloud platform servers 101, 102, 103 through the network 105 to obtain disk IO information of each server corresponding to the storage pool. And determining the server with IO suspension according to the disk IO information of each server. And determining a storage pool fault detection result according to the number of the servers with IO suspension.
It should be noted that the storage pool failure detection method provided by the embodiments of the present disclosure may be performed by the server 106. Accordingly, the failure detection apparatus for the storage pool provided by the embodiments of the present disclosure may be disposed in the server 106. Alternatively, the storage pool failure detection method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 106 and is capable of communicating with the cloud platform servers 101, 102, 103 and/or the cache module 104 and/or the server 106. Accordingly, the storage pool failure detection apparatus provided by the embodiments of the present disclosure may also be disposed in a server or a server cluster that is different from the server 106 and is capable of communicating with the cloud platform servers 101, 102, 103 and/or the cache module 104 and/or the server 106. Alternatively, the storage pool fault detection method provided by the embodiments of the present disclosure may be partially executed by the server 106, partially executed by the cloud platform servers 101, 102, 103, and partially executed by the cache module 104. Correspondingly, the storage pool fault detection device provided in the embodiments of the present disclosure may be partially disposed in the server 106, partially disposed in the cloud platform servers 101, 102, 103, and partially disposed in the cache module 104.
It should be understood that the number of cloud platform servers, cache modules, networks, and servers in fig. 1 is merely illustrative. There may be any number of cloud platform servers, caching modules, networks, and servers, as desired for implementation.
The storage pool fault detection method provided by the embodiment of the disclosure can be applied to the field of financial science and technology. For example, for a bank, it is necessary to deploy a virtual machine to provide various types of services for users, such as a credit card application service, a transfer service, an information change service, and the like. With the diversification of service types, the diversification of functions provided by the service and the rapid increase of user base numbers, virtual machines deployed by banks are explosively increased. In the face of a large-scale virtual machine (server), the traditional manual inspection work is difficult to discover the faults of the existing storage pool in time, and the monitoring requirements of the virtual machine in the increasing scale cannot be met. Under the background, by adopting the storage pool fault detection method provided by the embodiment of the disclosure, the method that the abnormality can not be found until the storage fault is needed to be solved is changed into the traditional passive method, the automatic detection of the storage pool fault can be realized, the hidden danger of the storage pool fault of the data center can be accurately found under the requirements of high efficiency and universality, the production problem is actively and spontaneously solved, and the emergency time effect is shortened.
It should be understood that the storage pool fault detection method provided by the embodiments of the present disclosure is not limited to application in the field of financial technology, but may be used in any field other than the financial field. The above description is exemplary only, and the storage pool fault detection method of the embodiments of the present disclosure may be applied to any field related to storage pool fault detection, such as e-commerce, logistics, and other technical fields.
A scenario of storage pool failure detection will be described below based on fig. 1, and a storage pool failure detection method according to an embodiment of the present disclosure will be described in detail with reference to fig. 2 to 11.
FIG. 2 schematically illustrates a flow chart of a storage pool failure detection method according to an embodiment of the present disclosure.
As shown in fig. 2, the storage pool failure detection method is applied to the server, and may include operations S201 to S203, for example.
In operation S201, disk IO information of each server corresponding to the storage pool is obtained;
in operation S202, a server where the IO suspension occurs is determined according to disk IO information of each server.
In operation S203, a storage pool failure detection result is determined according to the number of servers where the IO suspension occurs, the storage pool failure detection result including a storage pool storage abnormality or a disk performance abnormality of a server on the storage pool.
In the embodiment of the disclosure, the IO suspension may be understood as that in a storage failure (such as a storage broken chain) scene, an IO error of a physical disk is transmitted to a front end of a virtual machine through a virtualization layer, and the virtual machine receives the IO error, which may cause a user file system in the virtual machine to become a read-only state, so that the virtual machine needs to be restarted or a user needs to manually recover. Under the condition, the virtualization platform provides the capability of disk IO suspension, namely when storage faults occur, IO is suspended when the IO of the virtual machine is issued to the host side, IO errors are not returned to the inside of the virtual machine in the suspension time, and therefore a file system in the virtual machine cannot be changed into a read-only state due to the IO errors and is presented as being in a Hang state; and the virtual machine back end retries IO according to the appointed suspension interval. If the storage fault is recovered to be normal in the suspension time, the suspended IO can recover to be dropped, and the file system in the virtual machine automatically recovers to operate without restarting the virtual machine; if the storage fault fails to recover to be normal within the suspension time, reporting an error to the inside of the virtual machine, and notifying a user.
According to the embodiment of the disclosure, whether the server is suspended or not is determined through the disk IO information of each server, and whether the storage pool is abnormal or not is determined according to the number of the servers suspended or not by combining the characteristics of the storage pool, so that the storage pool with storage faults possibly can be actively found, and automatic detection of the storage pool faults is realized. Meanwhile, the state of the storage pool can be intuitively and accurately fed back only according to whether the IO suspension occurs to the server, so that the efficiency of storage pool fault detection can be improved, and the efficient detection of storage pool faults under the background of a large-scale server can be easily realized.
Fig. 3 schematically illustrates a flowchart of acquiring disk IO information of each server corresponding to a storage pool in operation S201 illustrated in fig. 2 according to an embodiment of the present disclosure.
As shown in fig. 3, the obtaining of the disk IO information of each server corresponding to the storage pool in operation S201 may include, for example, operations S301 to S303.
In operation S301, a timing acquisition program is written.
In operation S302, a timing acquisition program is deployed on each server.
In operation S303, in the detection period, the timing acquisition program is called to read the disk IO information of each server in batches every preset period.
In the embodiment of the disclosure, a timing acquisition program may be written in a C language, and disk IO information of each disk on the server may be obtained based on/proc/disks, where the disk IO information may include a read IO completion number, a write IO completion number, and an IO request number in a current IO queue.
Fig. 4 schematically shows an example diagram of/proc/disks, according to an embodiment of the present disclosure.
As shown in fig. 4, column 4 (reads), column 8 (writes), and column 12 (index) in fig. 4 respectively represent the number of read IO completions, the number of write IO completions, and the number of IO requests in the current IO queue. For example, the box in fig. 4 contains content indicating that the number of read IO completions is 19638 and the number of write IO completions is 22289779, i.e., the number of IO requests in the current IO queue is 0.
It should be appreciated that the time interval for the timing acquisition may be set according to the actual situation, and the disclosure is not limited, and for example, disk IO information may be acquired once per minute by default.
According to the embodiment of the disclosure, the disk IO information for detecting the storage pool fault is automatically obtained at regular time by writing the regular acquisition program and calling the regular acquisition program, so that the storage pool fault can be automatically detected at regular time without manual detection of operation and maintenance personnel.
Fig. 5 schematically illustrates a flowchart for acquiring disk IO information of each server corresponding to a storage pool in operation S201 illustrated in fig. 2 according to another embodiment of the present disclosure.
As shown in fig. 5, the obtaining of the disk IO information of each server corresponding to the storage pool in operation S201 may include, for example, operations S501 to S502.
In operation S501, a timing acquisition program is called to read parameters of each server, wherein the parameters include IP and/or hostname.
In operation S502, a timing acquisition program is called to package parameters and disk IO information of each server, so as to obtain a first JSON string, where the first JSON string includes two layers of JSON strings, one layer of JSON string corresponds to the parameters, and the other layer of JSON string corresponds to the disk IO information.
In the embodiment of the disclosure, since the IP or the hostname is unique to the server, the IP or the hostname of the server can be obtained simultaneously in the process of obtaining the disk IO information, and the disk IO information of different servers is distinguished and stored by taking the IP or the hostname as a label, so that the disk IO information of different servers is better ensured not to be confused in the whole process from the obtaining of the disk IO information to the subsequent utilization, and a data basis is provided for the fault detection of the storage pool. Meanwhile, in order to facilitate data acquisition, transmission and subsequent analysis and utilization, parameters and disk IO information of each server are packaged into two layers of JSON strings (JSON_1, original JSON strings are two layers of JSON strings), one layer of JSON strings corresponds to the parameters, the other layer of JSON strings corresponds to the disk IO information, the server parameters and the disk IO information are divided into different layers, and therefore one layer of JSON strings representing the disk IO information can be rapidly analyzed from the two layers of JSON strings in the subsequent process.
Fig. 6 schematically illustrates a flowchart for acquiring disk IO information of each server corresponding to a storage pool in operation S201 illustrated in fig. 2 according to another embodiment of the present disclosure.
As shown in fig. 6, the obtaining of the disk IO information of each server corresponding to the storage pool in operation S201 may include operations S601 to S603, for example.
In operation S601, a first JSON string of a timing acquisition program wrapper is received.
In operation S602, the first JSON string is parsed to obtain parameters and a second JSON string, where the second JSON string is a single-layer JSON string, and the second JSON string corresponds to disk IO information.
In operation S603, the second JSON string is sent to the kafka for caching with the parameters as storage indexes.
In the embodiment of the disclosure, after the server receives the first JSON string sent by the timing acquisition program, the server puts out the required indexes to form a single-layer JSON string (json_2, the single-layer JSON string after processing) and then spits the single-layer JSON string into the kafka for caching. Namely, the data disc data with larger granularity and larger server scale can be buffered by kafka to relieve the warehouse-in pressure of the database. Specifically, an engine table (source), a local table (local), and a materialized view (consumer) are built in the clickhouse database. The engine table is created based on a Kafka engine, points to a Kafka server, and automatically consumes JSON_2 and stores the JSON_2 into a source; materialized views act to build a map to the local table based on the engine table, ultimately storing all of the data in the local table for downstream use.
According to the embodiment of the disclosure, in the process of acquiring the disk IO information, the disk IO information and parameters of the server are packaged into the JSON character string, and the method for acquiring the file by adopting the JSON character string is faster than the method for acquiring the file by adopting a command, and the method for acquiring the file by adopting the JSON character string occupies less resources of the server, so that the efficiency of detecting the failure of the storage pool can be further improved, the failure detection method of the storage pool is more suitable for popularization and use, and the universality of the detection method is improved. In addition, when disk IO information is acquired, two layers of JSON strings are used, one layer of JSON string corresponds to a server parameter, the other layer of JSON string corresponds to disk IO information, so that the acquired large-scale disk IO information is convenient to analyze, the two layers of JSON strings are analyzed, and the disk IO information is extracted to form a single-layer JSON string so as to better meet the fault detection of a storage pool.
Fig. 7 schematically illustrates a flowchart of determining a server where an IO suspension occurs according to disk IO information of each server in operation S202 illustrated in fig. 2 according to an embodiment of the present disclosure.
As shown in fig. 7, determining a server where an IO suspension occurs according to disk IO information of each server in operation S201 may include, for example, operations S701 to S702.
In operation S701, for each server, it is determined whether the number of read IO completion times and the number of write IO completion times are increased before and after the interval of the preset period of time, and whether the number of IO requests in the current IO queue is zero is determined.
In operation S702, in response to neither the read IO completion count nor the write IO completion count increasing and the number of IO requests in the current IO queue being non-zero, it is determined that the server is experiencing IO suspension.
In the embodiment of the present disclosure, a cloud platform server management system may also be configured, where configuration information of a server, such as a storage pool to which the server belongs, an application to which the server belongs, and the like, are recorded in the cloud platform server management system. By writing timing monitoring program in Java, traversing the server recorded by the cloud platform server management system to carry out IO suspension monitoring
Specifically, for each server in the traversal, the disk IO condition of the initial two moments in the current latest t time (t is the preset time period of the inspection interval) of the server is obtained from the local table, and IO suspension judgment is performed. In the interval t, if the read IO completion times and the write IO completion times are not increased, and the number of IO requests in the current IO queue is not 0, IO requests exist in the interval, but the IO requests cannot be completed, and the disc is considered to possibly have IO suspension.
The judging mode may be, for example: reads_now=reads_comp & writes_now=writes_comp & info_now > 0, treated as IO suspension. Wherein, reads_comp is the number of times of IO completion read at the previous time, reads_now is the number of times of IO completion read at the current time, writes_comp is the number of times of IO completion write at the previous time, writes_now is the number of times of IO completion write at the current time, and index_now is the number of IO requests in the current IO queue.
According to the embodiment of the disclosure, the read IO completion times, the write IO completion times and the IO request number in the current IO queue can accurately reflect the state of the disk IO, so that whether the server is suspended by the read IO completion times, the write IO completion times and the IO request number in the current IO queue can be accurately determined, and the accuracy of detecting the storage pool faults is improved.
Further, the storage pool fault may have multiple fault types, for example, an IO suspension occurs to multiple virtual machines in one storage pool at the same time, and then the storage pool fault is regarded as a storage pool abnormality; for another example, if IO suspension frequently occurs in a virtual machine for a period of time, it is recommended that the operation and maintenance personnel check the state of the virtual machine. Accordingly, embodiments of the present disclosure provide a variety of decision strategies.
Fig. 8 schematically illustrates a flowchart of determining a storage pool failure detection result according to the number of servers where IO hanging occurs in operation S203 illustrated in fig. 2 according to an embodiment of the present disclosure.
As shown in fig. 8, determining the storage pool failure detection result according to the number of servers where the IO suspension occurs in operation S203 may include, for example, operations S801 to S802.
In operation S801, a first total number of servers on each storage pool in the current batch where IO suspension occurs is counted.
In operation S802, a storage pool storage anomaly is determined in response to the first total number being greater than a first storage pool IO suspension controllable threshold, wherein the first storage pool IO suspension controllable threshold is determined from a stable state of the storage pool.
For example, counting the total number of servers with IO hanging on each storage pool in the current monitoring batch, and if current_batch_hangjcount > μ (the IO hanging controllable threshold of the first storage pool can be evaluated and adjusted according to the storage pool stability condition), considering the storage pool as a storage failure.
Fig. 9 schematically illustrates a flowchart of determining a storage pool failure detection result according to the number of servers where IO hanging occurs in operation S203 illustrated in fig. 2 according to another embodiment of the present disclosure.
As shown in fig. 9, determining the storage pool failure detection result according to the number of servers where the IO suspension occurs in operation S203 may include, for example, operations S901 to S902.
In operation S901, the ratio of the number of servers in which IO hanging occurs in the current lot to the total number of servers in which IO hanging occurs in all the detection lots in the detection period is determined.
In operation S902, a disk performance anomaly of the server on the storage pool is determined in response to the occupancy value being greater than a server IO suspension controllable threshold, wherein the server IO suspension controllable threshold is determined from a steady state of the server.
For example, counting the current batch occurrence IO suspension server and judging the IO suspension ratio hang_percentage in all monitoring batches in the T time period, and if hang_percentage > gamma (server IO suspension controllable threshold value can be estimated and regulated according to the server stability condition), judging that the server disk performance is abnormal.
Fig. 10 schematically illustrates a flowchart of determining a storage pool failure detection result according to the number of servers where IO hanging occurs in operation S203 illustrated in fig. 2 according to another embodiment of the present disclosure.
As shown in fig. 10, determining the storage pool failure detection result according to the number of servers where the IO suspension occurs in operation S203 may include, for example, operations S1001 to S1002.
In operation S1001, the second total number of servers in which the storage pools to which the servers in the current batch have IO hanging in the detection period are counted.
In operation S1002, a storage pool storage anomaly is determined in response to the second total number being greater than a second storage pool IO suspension controllable threshold, wherein the second storage pool IO suspension controllable threshold is determined from a steady state of the storage pool.
For example, counting the total number of servers sum_batch_hang_count of IO suspensions that occur in a storage pool of IO suspensions in a current batch within a period of T, and if sum_batch_hang_count > α (storage pool IO suspension controllable threshold, which can be adjusted according to storage pool stability condition evaluation), treating the storage pool as abnormal.
According to the embodiment of the disclosure, different strategies are adopted to determine whether the storage pool occurs according to different abnormal conditions of the storage pool, and because different strategies can cover different types of storage pool faults, the accuracy of storage pool fault detection can be improved based on different strategies.
FIG. 11 schematically illustrates a flow chart of a storage pool failure detection method according to another embodiment of the present disclosure.
As shown in fig. 11, the storage pool failure detection method may further include, for example, operation S1101.
In operation S1101, in case of determining a failure of the storage pool, early warning information is sent, where the method for sending early warning information includes at least one of mail early warning, short message early warning, monitoring system early warning, and operation and maintenance web page display early warning.
In the embodiment of the disclosure, the mail type early warning function realizes the butt joint with the Notes mailbox system, and the storage pool or the server which needs to be concerned by the storage pool fault is informed to relevant operation and maintenance personnel and application responsible personnel to carry out the investigation and treatment of the abnormal problems by sending early warning mails.
In the embodiment of the disclosure, the monitoring system has the alarm function, realizes the butt joint with the monitoring system, combines the application focus level, sends the alarm information of the focus application to the alarm event list, and notifies the operation and maintenance duty to process the related data disc utilization rate alarm.
In the embodiment of the disclosure, the storage pool fault and the IO suspension early warning are displayed through the platform webpage of the autonomous operation and maintenance, and comprehensive detailed display is performed from the aspects of early warning record and processing record of the storage pool fault and server disk IO abnormality, the aspects of server information related to the storage pool fault, abnormal server disk IO development trend and the like.
Further, the early warning information may include early warning type, early warning level, name of a storage pool with a fault, parameters and applications of a server with an IO suspension on the storage pool with the fault, and disk IO change information of the server with the IO suspension. Preferably, the early warning information is arranged according to the early warning type, the early warning level, the name of the storage pool with the fault, the parameters of the server with the IO suspension on the storage pool with the fault, and the disc IO change information of the server with the IO suspension, which belongs to the application, in a progressive manner.
For example, the early warning types include abnormal storage pool, abnormal server disk performance, etc., the early warning level of abnormal storage pool is a first level, the early warning level of abnormal server disk performance is a second level, the first level may be higher than the second level, and the disclosure is not limited specifically and may be set according to the application requirement. In the case that the storage pool failure results in abnormal storage pool, the early warning information may be: the storage pool abnormality, the first grade, the storage pool name, the parameters of the server A for generating IO suspension and the application of the parameters, the disk IO change information of the server A for generating IO suspension; in the case that the storage pool fault results in abnormal performance of the server disk, the early warning information may be: server disk performance abnormality- & gt second grade- & gt storage pool name- & gt parameters of server A with IO suspension and application- & gt disk IO change information of server A with IO suspension.
According to the embodiment of the disclosure, since the early warning information is sent to the user in various forms, on one hand, operation and maintenance personnel can be timely informed to conduct troubleshooting and processing on the problem of the storage pool, and on the other hand, related information can be comprehensively displayed to the user from the aspects of early warning records and processing records of the storage pool faults and server disk IO abnormality, the aspects of the storage pool faults related to server information, abnormal server disk IO development trend and the like, and the user is convenient to analyze the root cause of the storage pool faults. In addition, through reasonably defining the specific format of the early warning information, the related information can be displayed to the user more intuitively.
It should be noted that, the development language used in the storage pool fault detection method provided by the embodiment of the present disclosure may be J2EE, and the relevant software used may be Jetty, quartz, MYSQL, etc. The storage pool fault detection method can realize linkage with a cloud platform server management system, a monitoring system, a mailbox and other related tool platforms, provides friendly and multidirectional result views for operation and maintenance personnel, and has good universality and popularization.
In summary, the storage pool fault detection method provided by the embodiment of the disclosure achieves full coverage of the IO suspension monitoring range of the open platform server, and meets the requirements of high efficiency and accuracy of the disk IO anomaly monitoring of the open platform related server after the data center is scaled. The detection mode realizes a monitoring implementation method of fault discovery of the storage pool based on IO suspension, actively discovers the abnormal storage pool by combining the scale and the characteristics of the storage pool, and timely informs professionals, thereby effectively solving the problems of passive discovery and untimely discovery.
Based on the storage pool fault detection method shown in fig. 2 to 11, the embodiment of the present disclosure further provides a storage pool fault detection device, and the storage pool fault detection device of the embodiment of the present disclosure will be described below by way of fig. 12 to 13 based on the scenario described in fig. 1.
Fig. 12 schematically illustrates a block diagram of a storage pool failure detection apparatus according to an embodiment of the present disclosure.
As shown in fig. 12, the storage pool failure detection apparatus 1200 may include an acquisition module 910, a first determination module 1220, and a second determination module 1230.
The obtaining module 1210 is configured to obtain disk IO information of each server corresponding to the storage pool. The acquisition module 1210 may be configured to perform the operation S201 described above, which is not described herein.
The first determining module 1220 is configured to determine, according to disk IO information of each server, a server where IO suspension occurs. The first determining module 1220 may be configured to perform the operation S202 described above, which is not described herein.
A second determining module 1230 is configured to determine a storage pool failure detection result according to the number of servers that generate the IO suspension, where the storage pool failure detection result includes a storage pool storage abnormality or a disk performance abnormality of a server on the storage pool. The second determining module 1230 may be configured to perform the operation S203 described above, which is not described herein.
Fig. 13 schematically illustrates a block diagram of a storage pool failure detection apparatus according to another embodiment of the present disclosure.
As shown in fig. 13, the storage pool failure detection apparatus 1200 may further include an early warning module 1240, for example.
And the early warning module 1240 is configured to send early warning information when it is determined that a storage pool fault topic exists, where a method for sending the early warning information includes at least one of mail early warning, short message early warning, monitoring system early warning, and operation and maintenance webpage display early warning. The early warning module 940 may be used to perform the operation S1101 described above, and will not be described herein.
Any number of modules, sub-modules, units, sub-units, or at least some of the functionality of any number of the sub-units according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented as split into multiple modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system-on-chip, a system-on-substrate, a system-on-package, an Application Specific Integrated Circuit (ASIC), or in any other reasonable manner of hardware or firmware that integrates or encapsulates the circuit, or in any one of or a suitable combination of three of software, hardware, and firmware. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be at least partially implemented as computer program modules, which when executed, may perform the corresponding functions.
For example, any of the acquisition module 1210, the first determination module 1220, the second determination module 1230, and the pre-warning module 1240 may be combined in one module/unit/sub-unit or any of them may be split into a plurality of modules/units/sub-units. Alternatively, at least some of the functionality of one or more of these modules/units/sub-units may be combined with at least some of the functionality of other modules/units/sub-units and implemented in one module/unit/sub-unit. According to embodiments of the present disclosure, at least one of the acquisition module 1210, the first determination module 1220 and the second determination module 1230, and the pre-warning module 1240 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system-on-chip, a system-on-substrate, a system-on-package, an application-specific integrated circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging the circuitry, or in any one of or a suitable combination of any of the three implementations of software, hardware, and firmware. Alternatively, at least one of the acquisition module 1210, the first determination module 1220 and the second determination module 1230, and the pre-warning module 1240 may be at least partially implemented as computer program modules, which when executed, may perform the corresponding functions.
It should be noted that, the storage pool fault detection device portion in the embodiment of the present disclosure corresponds to the storage pool fault detection method portion in the embodiment of the present disclosure, and specific implementation details and technical effects thereof are the same, which are not described herein again.
Fig. 14 schematically illustrates a block diagram of an electronic device adapted to implement the above-described method according to an embodiment of the present disclosure. The electronic device shown in fig. 14 is merely an example, and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.
As shown in fig. 14, an electronic device 1400 according to an embodiment of the present disclosure includes a processor 1401 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1402 or a program loaded from a storage section 1408 into a Random Access Memory (RAM) 1403. The processor 1401 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 1401 may also include on-board memory for caching purposes. The processor 1401 may include a single processing unit or a plurality of processing units for performing different actions of the method flows according to embodiments of the present disclosure.
In the RAM1403, various programs and data necessary for the operation of the electronic device 1400 are stored. The processor 1401, ROM 1402, and RAM1403 are connected to each other through a bus 1404. The processor 1401 performs various operations of the method flow according to the embodiment of the present disclosure by executing programs in the ROM 1402 and/or the RAM 1403. Note that the program may be stored in one or more memories other than the ROM 1402 and the RAM 1403. The processor 1401 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.
According to an embodiment of the disclosure, the electronic device 1400 may also include an input/output (I/O) interface 1405, the input/output (I/O) interface 1405 also being connected to the bus 1404. Electronic device 1400 may also include one or more of the following components connected to I/O interface 1405: an input section 1406 including a keyboard, a mouse, and the like; an output portion 1407 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 1408 including a hard disk or the like; and a communication section 1409 including a network interface card such as a LAN card, a modem, and the like. The communication section 1409 performs communication processing via a network such as the internet. The drive 1410 is also connected to the I/O interface 1405 as needed. Removable media 1411, such as magnetic disks, optical disks, magneto-optical disks, semiconductor memory, and the like, is installed as needed on drive 1410 so that a computer program read therefrom is installed as needed into storage portion 1408.
According to embodiments of the present disclosure, the method flow according to embodiments of the present disclosure may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1409 and/or installed from the removable medium 1411. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1401. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium. Examples may include, but are not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM1402 and/or RAM 1403 described above and/or one or more memories other than ROM1402 and RAM 1403.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be combined in various combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

Claims (15)

1. A storage pool fault detection method is applied to a server and comprises the following steps:
obtaining disk IO information of each server corresponding to the storage pool;
determining a server with IO suspension according to the disk IO information of each server; and
and determining a storage pool fault detection result according to the number of the servers with IO suspension, wherein the storage pool fault detection result comprises storage pool storage abnormality or disk performance abnormality of the servers on the storage pool.
2. The storage pool failure detection method according to claim 1, wherein the obtaining disk IO information of each server corresponding to the storage pool includes:
writing a timing acquisition program;
the timing acquisition program is deployed on each server; and
and in the detection time period, calling the timing acquisition program to read the disk IO information of each server in batches every preset time period.
3. The storage pool failure detection method according to claim 2, wherein the obtaining disk IO information of each server corresponding to the storage pool further includes:
calling the timing acquisition program to read parameters of each server, wherein the parameters comprise IP and/or host names; and
And calling the timing acquisition program to respectively pack parameters and disk IO information of each server to obtain a first JSON string, wherein the first JSON string comprises two layers of JSON strings, one layer of JSON string corresponds to the parameters, and the other layer of JSON string corresponds to the disk IO information.
4. The storage pool failure detection method according to claim 3, wherein the obtaining disk IO information of each server corresponding to the storage pool further includes:
receiving a first JSON character string packaged by the timing acquisition program;
analyzing the first JSON character string to obtain parameters and a second JSON character string respectively, wherein the second JSON character string is a single-layer JSON string, and the second JSON character string corresponds to the disk IO information; and
and sending the second JSON character string to kafka for caching by taking the parameter as a storage index.
5. The storage pool failure detection method of claim 2, wherein the disk IO information includes a read IO completion number, a write IO completion number, and a number of IO requests in a current IO queue;
the determining the server with IO suspension according to the disk IO information of each server comprises the following steps:
for each server, determining whether the read IO completion times and the write IO completion times are increased before and after a preset time period, and determining whether the IO request number in the current IO queue is zero; and
And determining that the server generates IO suspension in response to the fact that the read IO completion times and the write IO completion times are not increased and the IO request number in the current IO queue is not zero.
6. The storage pool failure detection method of claim 2, wherein the determining the storage pool failure detection result according to the number of servers where the IO suspension occurs comprises:
respectively counting a first total number of servers with IO suspension on each storage pool in the current batch; and
and determining that the storage pool is abnormally stored in response to the first total number being greater than a first storage pool IO suspension controllable threshold, wherein the first storage pool IO suspension controllable threshold is determined according to a stable state of the storage pool.
7. The storage pool failure detection method of claim 2, wherein the determining the storage pool failure detection result from the number of servers where IO hanging occurs further comprises:
counting the occupation ratio of the number of servers with IO hanging in the current batch to the total number of the servers with IO hanging, which are determined to be in the detection batch in the detection time period; and
and determining that the disk performance of the server on the storage pool is abnormal in response to the occupancy value being greater than a server IO suspension controllable threshold, wherein the server IO suspension controllable threshold is determined according to the stable state of the server.
8. The storage pool failure detection method of claim 2, wherein the determining the storage pool failure detection result from the number of servers where IO hanging occurs further comprises:
respectively counting the second total number of servers with IO hanging in the detection time period of the storage pool to which the servers with IO hanging in the current batch belong; and
and determining that the storage pool is abnormally stored in response to the second total number being greater than a second storage pool IO suspension controllable threshold, wherein the second storage pool IO suspension controllable threshold is determined according to the stable state of the storage pool.
9. The storage pool failure detection method of claim 1, further comprising:
and sending early warning information under the condition of determining the fault of the storage pool, wherein the method for sending the early warning information comprises at least one of mail early warning, short message early warning, monitoring system early warning and operation and maintenance webpage display early warning.
10. The storage pool fault detection method according to claim 9, wherein the early warning information comprises early warning type, early warning level, storage pool name with fault, parameters and application of a server with IO suspension on the storage pool with fault and disk IO change information of the server with IO suspension.
11. The storage pool fault detection method according to claim 10, wherein the early warning information is arranged in a manner that early warning type, early warning level, name of a storage pool with fault, parameters of a server with IO suspension on the storage pool with fault, and disc IO change information of the server with IO suspension are sequentially progressive.
12. A storage pool failure detection apparatus, comprising:
the acquisition module is used for acquiring disk IO information of each server corresponding to the storage pool;
the first determining module is used for determining the server with IO suspension according to the disk IO information of each server; and
and the second determining module is used for determining a storage pool fault detection result according to the number of the servers with IO suspension, wherein the storage pool fault detection result comprises storage pool storage abnormality or disk performance abnormality of the servers on the storage pool.
13. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-11.
14. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1 to 11.
15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 11.
CN202310960539.0A 2023-08-01 2023-08-01 Storage pool fault detection method, device, equipment, medium and product Pending CN117076236A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310960539.0A CN117076236A (en) 2023-08-01 2023-08-01 Storage pool fault detection method, device, equipment, medium and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310960539.0A CN117076236A (en) 2023-08-01 2023-08-01 Storage pool fault detection method, device, equipment, medium and product

Publications (1)

Publication Number Publication Date
CN117076236A true CN117076236A (en) 2023-11-17

Family

ID=88718574

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310960539.0A Pending CN117076236A (en) 2023-08-01 2023-08-01 Storage pool fault detection method, device, equipment, medium and product

Country Status (1)

Country Link
CN (1) CN117076236A (en)

Similar Documents

Publication Publication Date Title
JP6849672B2 (en) Systems and methods for application security and risk assessment and testing
US20190378073A1 (en) Business-Aware Intelligent Incident and Change Management
CN106101130B (en) A kind of network malicious data detection method, apparatus and system
CN110309130A (en) A kind of method and device for host performance monitor
US11093349B2 (en) System and method for reactive log spooling
CN105404581B (en) A kind of evaluating method and device of database
US20130007538A1 (en) Systems and methods for fast detection and diagnosis of system outages
US9348685B2 (en) Intermediate database management layer
US8959051B2 (en) Offloading collection of application monitoring data
CN109992454A (en) The method, apparatus and storage medium of fault location
US10657027B2 (en) Aggregating data for debugging software
CN111316272A (en) Advanced cyber-security threat mitigation using behavioral and deep analytics
CN114338684B (en) Energy management system and method
CN109558272A (en) The fault recovery method and device of server
CN110275992A (en) Emergency processing method, device, server and computer readable storage medium
US20180287914A1 (en) System and method for management of services in a cloud environment
CN116450461A (en) Method, device, equipment and medium for processing hard disk faults of storage cluster
CN117076236A (en) Storage pool fault detection method, device, equipment, medium and product
US11775654B2 (en) Anomaly detection with impact assessment
CN111901172B (en) Application service monitoring method and system based on cloud computing environment
CN113132431B (en) Service monitoring method, service monitoring device, electronic device, and medium
US10740030B2 (en) Stopping a plurality of central processing units for data collection based on attributes of tasks
CN113794719B (en) Network abnormal traffic analysis method and device based on elastic search technology and electronic equipment
CN117997715A (en) Alarm processing method, device, equipment, storage medium and program product
CN116136818A (en) Health inspection method, device, equipment and medium for message queue

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination