CN116450461A - Method, device, equipment and medium for processing hard disk faults of storage cluster - Google Patents

Method, device, equipment and medium for processing hard disk faults of storage cluster Download PDF

Info

Publication number
CN116450461A
CN116450461A CN202310434607.XA CN202310434607A CN116450461A CN 116450461 A CN116450461 A CN 116450461A CN 202310434607 A CN202310434607 A CN 202310434607A CN 116450461 A CN116450461 A CN 116450461A
Authority
CN
China
Prior art keywords
hard disk
abnormal
flow
data
hard disks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310434607.XA
Other languages
Chinese (zh)
Inventor
柳跃
毛玉华
张晓燕
樊潇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202310434607.XA priority Critical patent/CN116450461A/en
Publication of CN116450461A publication Critical patent/CN116450461A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3034Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a storage system, e.g. DASD based or network based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The disclosure provides a hard disk failure processing method of a storage cluster, which can be applied to the field of big data and the technical field of finance. The method comprises the following steps: acquiring I/O response data and flow data of a plurality of hard disks; judging at least one of the plurality of hard disks to be an abnormal hard disk according to the I/O response data and/or the flow data; and carrying out flow test on the abnormal hard disk, and replacing the abnormal hard disk with abnormal flow test results. The disclosure also provides a hard disk failure processing device, equipment, a storage medium and a program product of the storage cluster.

Description

Method, device, equipment and medium for processing hard disk faults of storage cluster
Technical Field
The present disclosure relates to the field of big data and the field of finance, and in particular, to a method, an apparatus, a device, a medium, and a program product for processing hard disk failures of a storage cluster.
Background
With the development of data center services, centralized data storage in conventional data centers faces many new challenges. With the deep architecture transformation, the distributed block storage is used as a standard infrastructure cloud storage back end and is widely applied to the supply of virtual machine storage resources. At present, a distributed block storage system with three-copy data storage redundancy mode is widely used at the back end of OpenStack, and the purpose is to ensure the safety of stored data by using space replacement reliability.
Because the data redundancy mode is three copies, the real available capacity is one third of the total capacity of the distributed storage system. With the explosive increase of the data volume after the business is online, under the condition of data redundancy of three copies of storage data, physical hardware is required to meet the same data storage and is three times as much as that of single copy data redundancy mode hardware. With such a huge physical hardware, OSD hard disk hardware failures occur frequently. Meanwhile, as the hardware online time increases, the failure rate also increases. In order to ensure service stability, three copies of data need to be quickly recovered after hard disk failure so as to prevent data loss, and the quick processing of hard disk failure is a current troublesome problem.
To solve the above problem, it is common in the industry to check the OSD log and the message log to find out whether Currently unreadable sectors related errors exist. And uploading the related alarms to be monitored in a centralized way, and performing interventional processing by related operation and maintenance personnel. The technology has the problems that the emergency time is too long, and the storage pool is in a copy degradation state for a long time to influence the stability of the storage cluster.
Disclosure of Invention
In view of the foregoing, the present disclosure provides a method, apparatus, device, medium and program product for processing hard disk failures of a storage cluster, which improve stability of the storage cluster, for at least partially solving the above technical problems.
According to a first aspect of the present disclosure, there is provided a method for processing hard disk failure of a storage cluster, including: acquiring I/O response data and flow data of a plurality of hard disks; judging at least one of the plurality of hard disks to be an abnormal hard disk according to the I/O response data and/or the flow data; and carrying out flow test on the abnormal hard disk, and replacing the abnormal hard disk with abnormal flow test results.
According to an embodiment of the present disclosure, acquiring I/O response data and traffic data for a plurality of hard disks includes: the method comprises the steps of obtaining iostat data of a plurality of hard disks; and extracting the await column data in the iostat data to obtain the I/O response data.
According to an embodiment of the present disclosure, acquiring I/O response data and traffic data for a plurality of hard disks includes: monitoring read-write flow of a plurality of hard disks in a first period; in a second period, performing a downlink detection flow test on the plurality of hard disks; wherein the traffic flow in the first period is greater than the traffic flow in the second period.
According to an embodiment of the present disclosure, determining that at least one of the plurality of hard disks is an abnormal hard disk according to the I/O response data and/or the traffic data includes: determining at least one of the plurality of hard disks as an abnormal hard disk under the condition that the I/O response data is larger than a first threshold value; and/or determining that at least one of the plurality of hard disks is an abnormal hard disk when the difference value of the read-write flow and the average flow in the third period is larger than the second threshold; and/or determining that at least one of the plurality of hard disks is an abnormal hard disk if the difference ratio of the detected flow and the average flow in the fourth period is greater than the third threshold.
According to an embodiment of the present disclosure, performing a flow test on an abnormal hard disk, and replacing the abnormal hard disk with an abnormal flow test result includes: suspending the service flow of the abnormal hard disk; under the condition of no reconstruction, carrying out flow test on the abnormal hard disk; under the condition that the flow test result is abnormal, carrying out three-copy reconstruction on the abnormal hard disk; replacing the reconstructed abnormal hard disk; and carrying out three-copy reconstruction on the replaced hard disk.
According to an embodiment of the present disclosure, performing three-copy reconstruction on a replaced hard disk includes: determining service flow of a plurality of hard disks; determining the reconstruction flow of the replaced hard disk according to the service flow; and carrying out three-copy reconstruction on the replaced hard disk according to the reconstruction flow.
According to an embodiment of the present disclosure, the hard disk failure processing method of the storage cluster further includes: and recovering the service flow of the abnormal hard disk under the condition that the flow test result is normal.
According to an embodiment of the present disclosure, in a case that a result of the traffic test is abnormal, after performing three-copy reconstruction on the abnormal hard disk, the hard disk failure processing method of the storage cluster further includes: determining alarm information of an abnormal hard disk; the alarm information is monitored in a centralized way; and after the replaced hard disk is subjected to three-copy reconstruction, canceling the alarm on the abnormal hard disk.
According to an embodiment of the present disclosure, replacing an abnormal hard disk whose flow measurement result is abnormal includes: and replacing the abnormal hard disk with abnormal flow test results in a second period.
A second aspect of the present disclosure provides a hard disk failure processing apparatus of a storage cluster, including: the acquisition module is used for acquiring I/O response data and flow data of a plurality of hard disks; the judging module is used for judging at least one of the plurality of hard disks to be an abnormal hard disk according to the I/O response data and/or the flow data; and the replacement module is used for carrying out flow test on the abnormal hard disk and replacing the abnormal hard disk with abnormal flow test results.
A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of the embodiments described above.
A fourth aspect of the present disclosure also provides a computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any of the embodiments described above.
A fifth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the method of any of the embodiments described above.
Compared with the prior art, the method, the device, the electronic equipment, the storage medium and the program product for processing the hard disk faults of the storage cluster have at least the following beneficial effects:
(1) The method disclosed by the invention judges whether the hard disk works normally by combining the I/O response time and the flow of the hard disk to judge whether the hard disk is abnormal or not, the judging mode is comprehensive, the missed detection of the abnormal hard disk is reduced, the secondary flow test is carried out on the abnormal hard disk, and the false detection of the abnormal hard disk is also reduced. In addition, the method automatically detects the OSD hard disk with the isolation fault through the system, thereby shortening the degradation time of the storage pool. Meanwhile, the corresponding emergency time effect is shortened, and the influence on the service caused by long-time degradation of a storage pool of the distributed storage is avoided.
(2) The method of the present disclosure monitors the read-write flow of the hard disk during the service peak period and the service low peak period respectively, and can effectively ensure the stability of the storage cluster even if the failure rate of the OSD hard disk increases under the condition that the redundancy performance of the storage pool is sufficient by continuous fault detection.
(3) The method disclosed by the invention is used for replacing the hard disk in the low-peak period of the service, so that the problem of slow I/O response of the storage cluster caused by the replacement of the hard disk in the peak period is avoided.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:
FIG. 1 schematically illustrates an application scenario diagram of a method, apparatus, device, medium and program product for hard disk failure handling of a storage cluster according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow chart of a method of hard disk failure handling of a storage cluster according to an embodiment of the disclosure;
FIG. 3 schematically illustrates a flow chart of a method of acquiring I/O response data for a plurality of hard disks, in accordance with an embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow chart of a method of acquiring traffic data for a plurality of hard disks in accordance with an embodiment of the present disclosure;
FIG. 5 schematically illustrates a flow chart of a method of determining whether a hard disk is abnormal in accordance with an embodiment of the present disclosure;
FIG. 6 schematically illustrates a flow chart of a method of handling an abnormal hard disk in accordance with an embodiment of the present disclosure;
FIG. 7 schematically illustrates a method flow diagram for three-copy reconstruction of a hard disk in accordance with an embodiment of the present disclosure;
FIG. 8 schematically illustrates a flow chart of a method of hard disk failure handling of a storage cluster according to another embodiment of the present disclosure;
FIG. 9 schematically illustrates a block diagram of a hard disk failure handling apparatus of a storage cluster according to an embodiment of the disclosure; and
fig. 10 schematically illustrates a block diagram of an electronic device adapted to implement a method for handling hard disk failures of a storage cluster, according to an embodiment of the disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.
Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
The embodiment of the disclosure provides a method, a device, equipment, a medium and a program product for processing hard disk faults of a storage cluster, which can be used in the financial field or other fields. It should be noted that the method, apparatus, device, medium and program product for processing a hard disk failure of a storage cluster of the present disclosure may be used in the financial field, and may also be used in any field other than the financial field, and the application fields of the method, apparatus, device, medium and program product for processing a hard disk failure of a storage cluster of the present disclosure are not limited.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related personal information of the user all conform to the regulations of related laws and regulations, necessary security measures are taken, and the public order harmony is not violated.
In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.
Fig. 1 schematically illustrates an application scenario diagram of a method, an apparatus, a device, a medium and a program product for processing hard disk failures of a storage cluster according to an embodiment of the present disclosure.
As shown in fig. 1, an application scenario 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104 and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by users using the terminal devices 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.
It should be noted that, the method for handling hard disk failures of a storage cluster according to the embodiment of the disclosure may be generally performed by the server 105. Accordingly, the hard disk failure handling apparatus of the storage cluster provided in the embodiments of the present disclosure may be generally disposed in the server 105. The hard disk failure processing method of the storage cluster provided by the embodiment of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the hard disk failure processing apparatus of the storage cluster provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Storage service (OSD): each storage server node running in the storage cluster is responsible for managing the storage medium on the server node, accepting the I/O requests from the computing nodes, and storing the user data on the storage medium (hard disk). Each storage service corresponds to a block of storage medium (hard disk). Each storage node stores, for example, 12 OSD hard disks. Each storage pool stores a plurality of storage nodes. Each resource domain stores multiple storage pools. The production service virtual machine may use multiple resource domains. The 12 OSD hard disks are, for example, data disks for storing data. In addition, a system disk for storing an operating system is also provided on each storage node, for example.
The hard disk failure processing method of the storage cluster of the disclosed embodiment will be described in detail below with reference to fig. 2 to 8 based on the scenario described in fig. 1.
Fig. 2 schematically illustrates a flowchart of a method of hard disk failure handling of a storage cluster according to an embodiment of the disclosure.
As shown in fig. 2, an embodiment of the present disclosure provides a method for processing hard disk failures of a storage cluster, for example, including:
s210, acquiring I/O response data and flow data of a plurality of hard disks.
For example, the plurality of hard disks are hard disks contained by all storage nodes in the distributed storage cluster. Wherein the distributed storage cluster may have one or several resource domains, each resource domain comprising a plurality of storage pools, each storage pool comprising a plurality of storage nodes. The I/O response data is the response time of each hard disk to the I/O request, and is used for judging the response sensitivity of the hard disk.
For example, the I/O response data is the await column data in the iostat data, i.e., the average latency of I/O requests over a period of time.
FIG. 3 schematically illustrates a flow chart of a method of acquiring I/O response data for a plurality of hard disks, according to an embodiment of the disclosure.
For example, the I/O response data of a plurality of hard disks is acquired through steps S311 to S312.
In step S311, the iostat data of the plurality of hard disks is acquired.
In step S312, the await column data in the iostat data is extracted to obtain I/O response data. By analyzing the await column data in the iostat data, the response time threshold can be determined according to the daily (for example, within 3 days), and further, whether the response sensitivity of a certain hard disk is abnormal or not can be judged.
Fig. 4 schematically illustrates a flow chart of a method of acquiring traffic data for a plurality of hard disks according to an embodiment of the present disclosure.
For example, flow data of a plurality of hard disks is acquired in steps S411 to S412.
In step S411, during a first period, the read-write traffic of the plurality of hard disks is monitored.
And step S412, in a second period, performing a downlink detection flow test on the plurality of hard disks. Wherein the traffic flow in the first period is greater than the traffic flow in the second period.
For example, the first period is a peak traffic period (e.g., 06:00-20:00) where traffic is relatively large, and the second period is a low peak traffic period (e.g., 20:00-6:00) where traffic is relatively small. By monitoring the read-write flow of the hard disk during the service peak period and the service low peak period respectively, the hard disk can be continuously subjected to fault detection, and the stability of the storage cluster can be effectively ensured even if the fault rate of the OSD hard disk is increased under the condition that the redundancy performance of the storage pool is enough.
S220, judging at least one of the plurality of hard disks to be an abnormal hard disk according to the I/O response data and/or the flow data.
Fig. 5 schematically illustrates a flowchart of a method of determining whether a hard disk is abnormal according to an embodiment of the present disclosure.
For example, it is determined whether the hard disk is abnormal or not through steps S521 to S523.
In step S521, in the case where it is determined that the I/O response data is greater than the first threshold, it is determined that at least one of the plurality of hard disks is an abnormal hard disk. And/or
For example, the first threshold may be determined based on a daily (e.g., within 3 days) response time of the hard disk, with an average response time of the hard disk within 3 days being taken as the first threshold. When the real-time I/O response time of a certain hard disk exceeds the average response time, the hard disk is judged to be an abnormal hard disk.
In step S522, if it is determined that the difference ratio between the read/write flow rate and the average flow rate in the third period is greater than the second threshold, it is determined that at least one of the plurality of hard disks is an abnormal hard disk. And/or
For example, the third period may be a time when a certain hard disk allowed by the storage system has a certain difference from a normal hard disk with a large part of the flow rate, for example, 1 hour. The second threshold is, for example, 30%, that is, when the flow difference between a certain hard disk and a normal flow hard disk exceeds 30% for more than one hour, the hard disk is determined to be an abnormal hard disk.
In step S523, when it is determined that the difference ratio between the detected flow and the average flow in the fourth period is greater than the third threshold, it is determined that at least one of the plurality of hard disks is an abnormal hard disk.
For example, in the low peak period of service, the traffic difference between the abnormal hard disk and the normal hard disk is small due to insufficient traffic, so that the abnormal hard disk and the normal hard disk are difficult to distinguish. Thus, additionally, the traffic difference between the abnormal hard disk and the normal hard disk is amplified, for example, by the system disk sending probe traffic to the data disk. The fourth period may be the same as the third period, or may be different from the third period based on the traffic characteristics of the traffic low peak period. Similarly, the third threshold may be the same as the second threshold or may be different from the second threshold.
It can be understood that the determination of whether the hard disk has a fault may be based on any one of the I/O response time of the hard disk, the read-write flow during the peak period of service, and the read-write flow during the low peak period of service, or may be based on any combination of the three. The I/O response time and the actual service flow of the hard disk are integrated to monitor the working performance of the hard disk, so that the accuracy and timeliness of hard disk fault detection can be considered, and the stability of the storage cluster can be improved.
S230, carrying out flow test on the abnormal hard disk, and replacing the abnormal hard disk with abnormal flow test result.
FIG. 6 schematically illustrates a flow chart of a method of handling an abnormal hard disk in accordance with an embodiment of the present disclosure.
For example, the abnormal hard disk is handled in steps S631 to S635.
In step S631, the traffic of the abnormal hard disk is suspended.
For example, after determining that the hard disk is an abnormal hard disk in steps S521 to S523, the abnormal hard disk may be temporarily set down to provide no service, that is, the traffic of the abnormal hard disk may be suspended. The abnormal hard disk is temporarily down, so that the abnormal hard disk can be temporarily isolated from the storage cluster, and further, the independent flow test is further carried out, instead of directly replacing the abnormal hard disk, and the false detection of hard disk faults caused by flow fluctuation or unreasonable threshold setting is avoided. The hard disk replacement frequency is reduced to a certain extent, the operation and maintenance pressure is reduced, and the cost is reduced.
It is understood that, logically, the traffic of the suspended abnormal hard disk may not belong to step S230, but may be one step operation between steps S220 and S230.
In step S632, the flow test is performed on the abnormal hard disk without reconstruction.
For example, for an OSD hard disk temporarily set down, the system disk sends out the detected I/O traffic to the data disk without reconstruction, and the hard disk temporarily determined to be abnormal is tested. The test method may refer to steps S521 to S523, and will not be described herein.
Step S633, under the condition that the flow test result is abnormal, performing three-copy reconstruction on the abnormal hard disk.
For example, when the result of further performing the flow test on the temporarily down abnormal hard disk is still abnormal, it is determined that the abnormal hard disk is a faulty hard disk, and replacement is required. Before the hard disk is replaced, the data stored in the abnormal hard disk is required to be backed up and reconstructed, namely, data migration backup is carried out from the 1 failed hard disk to a plurality of normally working hard disks, and at the moment, the reconstructed traffic only occupies a small amount of traffic bandwidth and does not influence the storage of the traffic data.
For example, in the case that the result of the traffic test is normal, the traffic flow of the abnormal hard disk is restored. And when the result of the flow test on the temporarily down abnormal hard disk is normal, judging that the abnormal hard disk is a normal hard disk, and not needing to be replaced. And then automatically restoring the service flow into the storage pool and restoring the service flow. And after the hard disk with the fault detected by mistake is judged to be normal by quick re-detection, the normal storage function of the hard disk is recovered, and the utilization rate of hardware is improved.
For example, when the result of the flow test is abnormal, after the abnormal hard disk is reconstructed in three copies, the alarm information of the abnormal hard disk can be determined, and the alarm information is monitored in a centralized manner. According to the method, the fault state of the OSD hard disk is automatically checked through the storage system, the storage system is automatically triggered and isolated, the storage system is automatically reconfigured, the OSD hard disk is automatically added into the storage pool after the fault hard disk is replaced, manual processing is not performed based on the alarm log, and emergency timeliness is greatly shortened. The method and the system further upload and monitor the alarm information of the abnormal hard disk in an centralized way, are convenient for manual spot check to check the automatic fault processing effect of the storage system when needed, further optimize related parameters and threshold values, and improve the operation reliability of the storage system.
In step S634, the reconstructed abnormal hard disk is replaced. and
For example, after the data in the failed hard disk has completed the three-copy reconstruction, the failed hard disk is replaced. The replacement may be performed manually or automatically by an automated means.
Preferably, the abnormal hard disk whose flow measurement result is abnormal is replaced in the second period. The second period, namely the service low peak period, is used for replacing the fault hard disk in the service low peak period, so that the problem of slow I/O response of the storage cluster caused by the replacement of the hard disk in the peak period can be avoided.
Step S635, performing three-copy reconstruction on the replaced hard disk.
For example, after three-copy reconstruction is performed on the replaced hard disk, the alarm for the abnormal hard disk is canceled.
FIG. 7 schematically illustrates a method flow diagram for three-copy reconstruction of a hard disk in accordance with an embodiment of the present disclosure.
For example, the replaced hard disk is subjected to three-copy reconstruction in steps S7351 to S7353.
In step S7351, traffic flows of the plurality of hard disks are determined.
Step S7352, determining the reconstruction flow of the replaced hard disk according to the service flow.
For example, after the physical hardware replacement is completed, the new replaced OSD hard disk is automatically added into the storage pool, and the reconstruction flow is automatically adjusted according to the service flow under the condition of considering service priority. Because data is migrated from the plurality of hard disks to the newly replaced hard disk when the three-copy reconstruction is performed on the newly replaced hard disk, the reconstruction traffic occupies more traffic bandwidth at this time. In order not to affect the normal storage of traffic data, a restriction on the reconstructed traffic is required.
And step S7353, performing three-copy reconstruction on the replaced hard disk according to the reconstruction flow.
Fig. 8 schematically illustrates a flowchart of a method for handling hard disk failures of a storage cluster according to another embodiment of the present disclosure.
For example, as shown in fig. 8, the storage system collects I/O traffic of each OSD hard disk, detects read/write traffic of the hard disk in time slots, and temporarily sets down the hard disk with abnormal traffic detection. And the await data of each OSD hard disk can be collected at the same time, and the hard disk with high await value is temporarily set down. Re-detecting the temporarily down abnormal hard disk, recovering the service flow of the hard disk with normal re-detection result, and circularly carrying out flow monitoring. And then the hard disk with abnormal detection result is automatically removed from the storage pool after data reconstruction, and the warning information is uploaded to centralized monitoring. And after the fault hard disk is replaced, automatically adding the newly replaced hard disk into a storage pool, and canceling the alarm of the fault hard disk.
The invention further provides a hard disk fault processing device of the storage cluster based on the hard disk fault processing method of the storage cluster. The device will be described in detail below in connection with fig. 9.
Fig. 9 schematically illustrates a block diagram of a hard disk failure processing apparatus of a storage cluster according to an embodiment of the present disclosure.
As shown in fig. 9, the apparatus 900 of this embodiment includes an acquisition module 910, a determination module 920, and a replacement module 930.
The acquisition module 910 is configured to acquire I/O response data and traffic data of a plurality of hard disks. In an embodiment, the obtaining module 910 may be configured to perform the operation S210 described above, which is not described herein.
The judging module 920 is configured to judge at least one of the plurality of hard disks is an abnormal hard disk according to the I/O response data and/or the traffic data. In an embodiment, the determining module 920 may be configured to perform the operation S220 described above, which is not described herein.
The replacing module 930 is configured to perform a flow test on the abnormal hard disk, and replace the abnormal hard disk with an abnormal flow test result. In an embodiment, the replacing module 930 may be configured to perform the operation S230 described above, which is not described herein.
Any of the acquisition module 910, the determination module 920, and the replacement module 930 may be combined in one module to be implemented, or any of the modules may be split into a plurality of modules according to an embodiment of the present disclosure. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the acquisition module 910, the determination module 920, and the replacement module 930 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system-on-chip, a system-on-substrate, a system-on-package, an Application Specific Integrated Circuit (ASIC), or in hardware or firmware, such as any other reasonable manner of integrating or packaging the circuitry, or in any one of or a suitable combination of any of three implementations of software, hardware, and firmware. Alternatively, at least one of the acquisition module 910, the determination module 920, and the replacement module 930 may be at least partially implemented as computer program modules that, when executed, perform the corresponding functions.
Fig. 10 schematically illustrates a block diagram of an electronic device adapted to implement a method for handling hard disk failures of a storage cluster, according to an embodiment of the disclosure.
As shown in fig. 10, an electronic device 1000 according to an embodiment of the present disclosure includes a processor 1001 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. The processor 1001 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 1001 may also include on-board memory for caching purposes. The processor 1001 may include a single processing unit or multiple processing units for performing different actions of the method flows according to embodiments of the present disclosure.
In the RAM 1003, various programs and data necessary for the operation of the electronic apparatus 900 are stored. The processor 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. The processor 1001 performs various operations of the method flow according to the embodiment of the present disclosure by executing programs in the ROM 1002 and/or the RAM 1003. Note that the program may be stored in one or more memories other than the ROM 1002 and the RAM 1003. The processor 1001 may also perform various operations of the method flow according to the embodiments of the present disclosure by executing programs stored in the one or more memories.
According to an embodiment of the disclosure, the electronic device 900 may also include an input/output (I/O) interface 1005, the input/output (I/O) interface 1005 also being connected to the bus 1004. The electronic device 900 may also include one or more of the following components connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output portion 1007 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), etc., and a speaker, etc.; a storage portion 1008 including a hard disk or the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The drive 1010 is also connected to the I/O interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in the drive 1010, so that a computer program read out therefrom is installed as needed in the storage section 1008.
The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 1002 and/or RAM 1003 and/or one or more memories other than ROM 1002 and RAM 1003 described above.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code, when executed in a computer system, causes the computer system to implement the item recommendation method provided by embodiments of the present disclosure.
The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1001. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted in the form of signals on a network medium, distributed, and downloaded and installed via the communication section 1009, and/or installed from the removable medium 1011. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1009, and/or installed from the removable medium 1011. The computer program is executed when it is executed by the processor 1001The above-described functions defined in the system of the disclosed embodiments. According to embodiments of the present disclosure The systems, devices, means, modules, units, etc. described above may be implemented by computer program modules.
According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.
The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims (13)

1. A method for handling hard disk failures in a storage cluster, comprising:
acquiring I/O response data and flow data of a plurality of hard disks;
judging at least one of the plurality of hard disks to be an abnormal hard disk according to the I/O response data and/or the flow data;
and carrying out flow test on the abnormal hard disk, and replacing the abnormal hard disk with abnormal flow test result.
2. The method of claim 1, wherein the obtaining the I/O response data and the traffic data for the plurality of hard disks comprises:
acquiring iostat data of the plurality of hard disks;
and extracting the await column data in the iostat data to obtain the I/O response data.
3. The method of claim 1, wherein the obtaining the I/O response data and the traffic data for the plurality of hard disks comprises:
monitoring the read-write flow of the plurality of hard disks in a first period; and
in a second period, performing a downlink detection flow test on the plurality of hard disks;
wherein the traffic flow in the first period is greater than the traffic flow in the second period.
4. The method of claim 3, wherein the determining that at least one of the plurality of hard disks is an abnormal hard disk based on the I/O response data and/or the traffic data comprises:
determining that at least one of the plurality of hard disks is an abnormal hard disk if the I/O response data is determined to be greater than a first threshold; and/or
Determining that at least one of the plurality of hard disks is an abnormal hard disk under the condition that the difference value ratio of the read-write flow to the average flow in the third period is larger than a second threshold value; and/or
And determining that at least one of the plurality of hard disks is an abnormal hard disk under the condition that the difference value ratio of the detected flow to the average flow in the fourth time period is larger than a third threshold value.
5. The method of claim 1, wherein the performing the flow test on the abnormal hard disk and replacing the abnormal hard disk with an abnormal flow test result comprises:
suspending the service flow of the abnormal hard disk;
under the condition of no reconstruction, carrying out flow test on the abnormal hard disk;
under the condition that the flow test result is abnormal, carrying out three-copy reconstruction on the abnormal hard disk;
replacing the reconstructed abnormal hard disk; and
and carrying out three-copy reconstruction on the replaced hard disk.
6. The method of claim 5, wherein the performing three-copy reconstruction of the replaced hard disk comprises:
determining service flow of the plurality of hard disks;
determining the reconstruction flow of the replaced hard disk according to the service flow;
and carrying out three-copy reconstruction on the replaced hard disk according to the reconstruction flow.
7. The method as recited in claim 5, further comprising:
and recovering the service flow of the abnormal hard disk under the condition that the flow test result is normal.
8. The method according to claim 5, wherein, in the case that the result of the traffic test is abnormal, after performing three-copy reconstruction on the abnormal hard disk, further comprising:
determining alarm information of the abnormal hard disk;
the alarm information is monitored in a centralized way; and
and after the replaced hard disk is subjected to three-copy reconstruction, canceling the alarm on the abnormal hard disk.
9. The method of claim 3, wherein replacing the abnormal hard disk for which the flow test result is abnormal comprises:
and replacing the abnormal hard disk with abnormal flow test results in the second period.
10. A hard disk failure handling apparatus for a storage cluster, comprising:
the acquisition module is used for acquiring I/O response data and flow data of a plurality of hard disks;
the judging module is used for judging that at least one of the plurality of hard disks is an abnormal hard disk according to the I/O response data and/or the flow data; and
and the replacement module is used for carrying out flow test on the abnormal hard disk and replacing the abnormal hard disk with abnormal flow test results.
11. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-9.
12. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1-9.
13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 9.
CN202310434607.XA 2023-04-21 2023-04-21 Method, device, equipment and medium for processing hard disk faults of storage cluster Pending CN116450461A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310434607.XA CN116450461A (en) 2023-04-21 2023-04-21 Method, device, equipment and medium for processing hard disk faults of storage cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310434607.XA CN116450461A (en) 2023-04-21 2023-04-21 Method, device, equipment and medium for processing hard disk faults of storage cluster

Publications (1)

Publication Number Publication Date
CN116450461A true CN116450461A (en) 2023-07-18

Family

ID=87128458

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310434607.XA Pending CN116450461A (en) 2023-04-21 2023-04-21 Method, device, equipment and medium for processing hard disk faults of storage cluster

Country Status (1)

Country Link
CN (1) CN116450461A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117271247A (en) * 2023-11-23 2023-12-22 深圳市钜邦科技有限公司 SSD solid state disk testing method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117271247A (en) * 2023-11-23 2023-12-22 深圳市钜邦科技有限公司 SSD solid state disk testing method
CN117271247B (en) * 2023-11-23 2024-03-08 深圳市钜邦科技有限公司 SSD solid state disk testing method

Similar Documents

Publication Publication Date Title
US10152382B2 (en) Method and system for monitoring virtual machine cluster
US11093349B2 (en) System and method for reactive log spooling
CN109039787B (en) Log processing method and device and big data cluster
US20160063387A1 (en) Monitoring and detecting environmental events with user devices
CN114884796B (en) Fault processing method and device, electronic equipment and storage medium
CN116450461A (en) Method, device, equipment and medium for processing hard disk faults of storage cluster
CN111949487A (en) Block chain monitoring system and method with dynamically pluggable modules
CN114024764A (en) Monitoring method, monitoring system, equipment and storage medium for abnormal access of database
Chen et al. Survivability modeling and analysis of cloud service in distributed data centers
CN111897701B (en) Alarm processing method, device, computer system and medium for application
CN113132431B (en) Service monitoring method, service monitoring device, electronic device, and medium
CN115190008B (en) Fault processing method, fault processing device, electronic equipment and storage medium
CN112131077A (en) Fault node positioning method and device and database cluster system
CN114024867B (en) Network anomaly detection method and device
CN116594968A (en) Method, system, equipment, medium and product for cleaning redundant files of server
US9659324B1 (en) System, method, and computer program for aggregating fallouts in an ordering system
CN111967961B (en) Data mining method and device
CN117076267A (en) Monitoring method and device based on alarm automatic diagnosis, electronic equipment and medium
CN116841902A (en) Health state checking method, device, equipment and storage medium
CN116136818A (en) Health inspection method, device, equipment and medium for message queue
CN116795599A (en) Proxy process exception self-recovery method and device
CN116483566A (en) Resource processing method and device for server, electronic equipment and storage medium
CN116225714A (en) Information processing method, device, equipment and storage medium
CN116737464A (en) Backup failure handling method, apparatus, device, medium and program product
CN117827587A (en) Database abnormal node determination method, apparatus, device, medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination