WO2017020747A1 - 一种检测慢盘的方法及装置 - Google Patents

一种检测慢盘的方法及装置 Download PDF

Info

Publication number
WO2017020747A1
WO2017020747A1 PCT/CN2016/091605 CN2016091605W WO2017020747A1 WO 2017020747 A1 WO2017020747 A1 WO 2017020747A1 CN 2016091605 W CN2016091605 W CN 2016091605W WO 2017020747 A1 WO2017020747 A1 WO 2017020747A1
Authority
WO
WIPO (PCT)
Prior art keywords
delay
interval
delays
average
sampling
Prior art date
Application number
PCT/CN2016/091605
Other languages
English (en)
French (fr)
Inventor
张金冬
李静辉
龚学文
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP16832235.2A priority Critical patent/EP3318975A4/en
Publication of WO2017020747A1 publication Critical patent/WO2017020747A1/zh
Priority to US15/884,413 priority patent/US20180157438A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0653Monitoring storage devices or systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3034Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a storage system, e.g. DASD based or network based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • G06F11/3419Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0659Command handling arrangements, e.g. command buffers, queues, command scheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0674Disk device
    • G06F3/0676Magnetic disk device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0616Improving the reliability of storage systems in relation to life time, e.g. increasing Mean Time Between Failures [MTBF]

Definitions

  • the delay of the I/O operation of each hard disk in the storage system can be monitored in real time to detect whether the hard disks are slow. plate.
  • the average delay of the I/O operation performed by the hard disk in each first cycle is counted, and the average delay is compared with a preset time threshold, if the average delay is greater than Or equal to the time threshold, it is recorded as a threshold event; and counts the number of times the threshold event occurs in the second period (the second period is greater than the first period), and compares the number with the preset number of times threshold If the number of times is greater than or equal to the preset number of times threshold, the hard disk may be determined to be a slow disk.
  • the time threshold is usually set relatively large, which may cause the accuracy of detecting the slow disk to decrease.
  • Embodiments of the present invention provide a method and apparatus for detecting a slow disk, which can improve the accuracy of detecting a slow disk.
  • an embodiment of the present invention provides a method for detecting a slow disk, where the method includes:
  • the relevant indicator value is a value that changes correspondingly with the delay
  • the first interval Determining, by the first interval to which the first delay related indicator value belongs; wherein the first interval is one of a plurality of intervals divided in advance for the maximum delay related index value;
  • the first interval is an expired interval, calculating a ratio of the first delay to the interval average delay to obtain a first ratio; wherein the full interval is all delays acquired in all sampling periods
  • the time-dependent indicator value falls within an interval in which the number of the interval reaches a first threshold, the interval average delay is an average of a plurality of second delays in the first interval, and the plurality of second extensions
  • the time-one corresponds to the first plurality of sampling periods, and each second delay is acquired in a sampling period corresponding thereto, wherein each sampling period corresponds to a delay-related index value;
  • the hard disk is a slow disk.
  • the method further includes :
  • the first number does not reach the first threshold, it is determined that the first interval is not an expired interval, and the next sampling period is sampled.
  • the delay related indicator value is a utilization rate of the hard disk read and write data
  • the delay related index value is a read/write speed of the hard disk read and write data.
  • the first The threshold is N
  • the first interval corresponds to N second delays
  • N is an integer greater than or equal to 1.
  • the N second delays are sequentially arranged in a sampling order, and the plurality of second delays are the first M second delays in the N second delays, where M is an integer, N/3 ⁇ M ⁇ 2N/3, and N/3 and 2N/3 are all integers.
  • the plurality of second delays are the first N/2 second delays of the N second delays, and N/2 takes an integer.
  • An average value of the plurality of second delays is an arithmetic mean of the plurality of second delays or a geometric mean of the plurality of second delays;
  • An average of the plurality of first ratios is an arithmetic mean of the plurality of first ratios or a geometric mean of the plurality of first ratios.
  • the method application The first hard disk is executed by the first hard disk, and the first hard disk is one of the plurality of hard disks.
  • the method further includes:
  • the method is the same as the method for obtaining the first average value corresponding to the first hard disk;
  • the method further includes:
  • determining, among the plurality of second ratios, a hard disk corresponding to a second ratio greater than or equal to the fourth threshold is a slow disk.
  • an embodiment of the present invention provides an apparatus for detecting a slow disk, where the apparatus includes:
  • the sampling unit is configured to periodically perform sampling during the detection period, and complete the following process in each sampling period:
  • the first interval Determining, by the first interval to which the first delay related indicator value belongs; wherein the first interval is one of a plurality of intervals divided in advance for the maximum delay related index value;
  • the first interval is an expired interval, calculating a ratio of the first delay to the interval average delay to obtain a first ratio; wherein the full interval is all delays acquired in all sampling periods
  • the time-dependent indicator value falls within an interval in which the number of the interval reaches a first threshold, the interval average delay is an average of a plurality of second delays in the first interval, and the plurality of second extensions
  • the time-one corresponds to the first plurality of sampling periods, and each second delay is acquired in a sampling period corresponding thereto, wherein each sampling period corresponds to a delay-related index value;
  • the detecting unit is configured to complete the following process after the end of each detection cycle and before the start of the next detection cycle:
  • the sampling unit is further configured to: after determining the first interval to which the first delay related indicator value belongs in each sampling period, record all delays acquired in all sampling periods after the current sampling And determining, by the correlation indicator value, the number of the first interval is the first number; and determining whether the first number reaches the first threshold; and if the first number reaches the first threshold Determining that the first interval is an expired interval; if the first number does not reach the first threshold, determining that the first interval is not a full interval, and entering a sampling period of the next sampling period, where Each sampling period corresponds to a delay related indicator value.
  • the delay related indicator value is a utilization rate of the hard disk read and write data
  • the delay related index value is a read/write speed of the hard disk read and write data.
  • the first The threshold is N
  • the first interval corresponds to N second delays
  • N is an integer greater than or equal to 1.
  • the sampling unit is further configured to calculate an average value of the plurality of second delays in the N second delays before calculating a ratio of the first delay to the interval average delay, to obtain a The interval average delay.
  • the N second delays are sequentially arranged in a sampling order, and the plurality of second delays are the first M second delays in the N second delays, where M is an integer, N/3 ⁇ M ⁇ 2N/3, and N/3 and 2N/3 are all integers.
  • the plurality of second delays are the first N/2 second delays of the N second delays, and N/2 takes an integer.
  • An average value of the plurality of second delays calculated by the sampling unit is an arithmetic mean of the plurality of second delays or a geometric mean of the plurality of second delays;
  • the average value of the plurality of first ratios calculated by the detecting unit is an arithmetic mean of the plurality of first ratios or a geometric mean of the plurality of first ratios.
  • the device application In a scenario of multiple hard disks, the device detects the first hard disk, and the first hard disk is one of the plurality of hard disks;
  • the detecting unit is further configured to acquire a plurality of first ratio average values corresponding to ones of the plurality of hard disks that are in one-to-one correspondence with the other hard disks except the first hard disk; and when associated with each of the plurality of hard disks When the average value of the first ratio corresponding to the hard disk is less than the third threshold, calculating an average value of the plurality of first ratio average values corresponding to the plurality of hard disks to obtain a first average value; Determining, by a ratio of a first ratio average of each of the plurality of hard disks to the first average, obtaining a plurality of second ratios; and determining, among the plurality of second ratios, greater than or equal to a fourth threshold
  • the hard disk corresponding to the second ratio is a slow disk, and the method for obtaining the first average value corresponding to each of the other hard disks is the same as the method for obtaining the first average value corresponding to the first hard disk.
  • Embodiments of the present invention provide a method and apparatus for detecting a slow disk.
  • the method includes periodically performing sampling during a detection period, and in each sampling period: acquiring the first data of the hard disk read and write data in the current sampling period.
  • the delay and the first delay related index value wherein the first delay related index value is a specific value of the delay related index value, and the relevant correlation indicator value is a one that changes correspondingly with the delay variation a first interval to which the first delay related index value belongs; wherein the first interval is one of a plurality of intervals previously divided for the maximum delay related index value; and if the first interval is an full interval, the calculation is performed
  • the ratio of the first delay to the interval average delay is obtained by the first ratio; wherein the full interval is that all the delay-related index values acquired in all sampling periods fall within the interval to reach the first threshold.
  • the interval, the interval average delay is an average of the plurality of second delays in the first interval, and the plurality of second delays correspond to the first plurality of sampling periods, each second delay being corresponding thereto Pick Cycle is acquired, wherein each sampling period corresponds to a delay index value associated.
  • each sampling period corresponds to a delay index value associated.
  • the delay related index value changes correspondingly with the delay, that is, the delay is closely related to the delay related index value. Therefore, by dividing the maximum delay related index value into intervals and sampling the delay corresponding to the delay related index value belonging to the interval in each interval, it is ensured that there is a uniform measurement standard for the sampling delay in one interval, thereby Improve the accuracy of detecting slow disks.
  • the first ratio is calculated after the first interval is the full interval (that is, the number of all delay-related index values acquired in all sampling periods falls within the first interval reaches the first threshold) (in the case of not being full)
  • the previous sampling process can be regarded as a learning process), and it can be ensured that a sufficient number of delay-related index values are obtained in the first interval (that is, the sampling is sufficiently long in the first interval), and then the first ratio is calculated, thereby improving the detection slow disk.
  • all the delay-related index values acquired in all sampling periods in each detection period fall within the time when the number of each full-interval interval is greater than or equal to the second threshold, and then the first ratio average is calculated.
  • the first ratio average calculated by the embodiment of the present invention is a proportional average obtained from the plurality of first ratios, which is not an actual delay value, the first ratio average can accurately represent the hard disk.
  • the performance change trend, by setting the third threshold, and comparing the first ratio average value with the third threshold value, can accurately detect that the hard disk is a slow disk when the performance of the hard disk changes, thereby further improving the accuracy of detecting the slow disk.
  • FIG. 1 is a schematic structural diagram of a cloud storage system according to an embodiment of the present disclosure
  • FIG. 2 is a flowchart 1 of a method for detecting a slow disk according to an embodiment of the present invention
  • FIG. 3 is a second flowchart of a method for detecting a slow disk according to an embodiment of the present invention
  • FIG. 4 is a flowchart 3 of a method for detecting a slow disk according to an embodiment of the present invention
  • FIG. 5 is a flowchart 4 of a method for detecting a slow disk according to an embodiment of the present invention
  • FIG. 6 is a schematic structural diagram of an apparatus for detecting a slow disk according to an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of hardware of a device for detecting a slow disk according to an embodiment of the present invention.
  • the method and apparatus for detecting a slow disk provided by the embodiment of the present invention can be applied to a hard disk detection scenario. Specifically, the method and apparatus for detecting a slow disk provided by the embodiment of the present invention can be applied to a hard disk detection scenario in a cloud storage system.
  • Cloud storage system is a new concept extended and developed in the concept of cloud computing system. It refers to a large number of various networks in the network through functions such as cluster application, grid technology or distributed file system. Different types of storage devices work together through application software to jointly provide a system for data storage and service access functions.
  • cloud computing systems need to be configured with a large number of storage devices, such as hard disks. Then the cloud computing system is transformed into a cloud storage system, so the cloud storage system is A cloud computing system with data storage and management at its core.
  • FIG. 1 is a schematic structural diagram of a cloud storage system according to an embodiment of the present invention. Since a cloud storage system includes a large number of storage devices, such as various hard disks in FIG. 1, in order to improve the performance of the cloud storage system (such as storage performance and management performance), maintenance of these storage devices is usually required. Take storage devices as a variety of hard disks. Because these hard disks are in use, some hard disks may be read or written due to other environmental and mechanical problems such as magnetic degradation, bad sectors or vibration. The operation delay is large. Therefore, in order to improve the storage efficiency of the cloud storage system, it is necessary to detect these hard disks in time to detect a hard disk with a large delay when performing read and write operations, that is, a slow disk. After the slow disk is detected, the storage capacity of the cloud storage system can be improved by isolating the cloud storage system (for example, from software or automatically popping up from the hardware).
  • isolating the cloud storage system for example, from software or automatically popping up from the hardware.
  • the hard disk provided by the embodiment of the present invention may be a solid state drive (English: solid state drive, abbreviated as: SSD); or a hard disk drive (English: Hard Disk Drive, abbreviated as HDD); or a hybrid hard disk (English: hybrid)
  • SSD uses flash granules for storage
  • HDD uses magnetic discs for storage
  • HHD is magnetic hard A hard disk that is integrated with a disk and flash memory.
  • the executor of the method for detecting a slow disk provided by the embodiment of the present invention may be a device for detecting a slow disk, and the device for detecting a slow disk may be a detecting node in the cloud storage system, and the detecting node may be an independent computer.
  • the node may also be a functional unit or the like integrated in the computer node, and the present invention is not specifically limited.
  • the execution bodies of the method are exemplified by taking a detecting node as an example.
  • the method for detecting the slow disk provided by the embodiment of the present invention is similar to the detection process of each hard disk. Therefore, the method for detecting the slow disk provided by the embodiment of the present invention is exemplified by using a hard disk as an example. Sexual description.
  • an embodiment of the present invention provides a method for detecting a slow disk, and the method may include:
  • the detecting node periodically performs sampling, and in each sampling period, the detecting node executes S100-S102.
  • the detecting node obtains a first delay of the hard disk read/write data and a first delay related index value in the sampling period, wherein the first delay related index value is a specific value of the delay related index value, and the delay
  • the associated indicator value is a value that changes accordingly as the delay changes.
  • the first delay related index value is a delay related indicator obtained in the current sampling period.
  • the first delay/delay statistics mentioned above is an average delay of reading and writing data of the hard disk for a period of time. Since this average delay is a technique well known to those skilled in the art, it will not be described in detail herein.
  • the prefix “first” in the “first delay” and the “first delay related index value” is only represented as a specific delay or delay related index value, which appears in other subsequent steps.
  • the prefixes “first”, “second” and the like all mean similar meanings.
  • the value of the delay related index refers to the value of an indicator related to the delay when the hard disk reads and writes data (that is, the average delay described above), that is, the value of the indicator has a certain regular correspondence with the delay. relationship.
  • the value of the delay related index may be a certain relationship between the "utilization rate" of the hard disk read/write data or the "read/write speed" of the hard disk read/write data and the delay of reading and writing data of the hard disk. For example, when the delay is large, the corresponding utilization rate will be high (or the corresponding read/write speed will be low).
  • the time delay related index value is the "utilization rate" of the hard disk read and write data and the "read and write speed" of the hard disk read and write data only for the purpose of illustrating the method for detecting the slow disk in the embodiment of the present invention.
  • the embodiments of the present invention form any limitation, that is, the embodiment of the present invention does not limit the value of the delay related index to other values that can change correspondingly with the delay.
  • the first delay related index value may be a value of the “utilization rate” of the hard disk read/write data (such as a value of 20% or 40%), and the higher the value, the first delay is also The larger the value is, the higher the value is, the higher the value is, the higher the system is, the higher the value is, the higher the value is, the higher the value is, the higher the value is, the higher the value is, the higher the value is. Very busy, the first delay is also greater.
  • the first delay and the first delay related index value may be obtained through multiple paths. For example, it can be obtained based on some tools provided by the operating system.
  • the iostat tool based on the Linux operating system can obtain data such as the utilization rate of the hard disk and the average delay of reading and writing data of the hard disk over a period of time by using the iostat tool; or You can also customize some tools to get some. How to use the tools that come with the system and how to develop the tools for obtaining these data by themselves are all known to those skilled in the art, and will not be described here.
  • the detecting node determines a first interval to which the first delay related index value belongs, where the first interval is in a plurality of intervals that are previously divided for the maximum delay related index value. One.
  • the software developer may obtain the maximum delay related index value of the hard disk read/write data in advance, and divide the maximum delay related index value of the hard disk read/write data to obtain Multiple intervals, and writing these multiple intervals into the software program executed when the slow disk is detected.
  • the methods for obtaining them may be different. Therefore, the following delay related indicators are respectively used as the "utilization rate" of the hard disk read/write data and the "read and write speed" of the hard disk read/write data, for example.
  • An exemplary method for obtaining the maximum delay related index value of the hard disk read and write data is exemplified.
  • the software developer can directly consider that the maximum delay-related index value is the largest "utilization rate" of the hard disk read/write data in theory.
  • the maximum latency-related metric value can be 100%.
  • software developers can use the iostat tool to obtain the maximum delay-related metric value.
  • the maximum delay-related metric value is set by the iostat tool based on the hard disk's media when reading and writing data on the hard disk.
  • the iostat tool can set the maximum delay related index value to an integral multiple of the maximum value of the "utilization rate" of the hard disk read and write data (such as a value of 200%) depending on the medium of the hard disk.
  • the maximum delay related index value may have the following acquisition methods.
  • the first method of acquisition is obtained by software developers based on development experience. For example, software developers can estimate the possible value as the maximum delay-related indicator value after understanding the design of the application system and the way the application performs I/O operations.
  • the second method of obtaining is that the software developer can run a read/write test on the hard disk without development experience, and obtain the maximum delay related index value according to the read/write test.
  • the third method of obtaining is that the software developer can directly use the nominal value of the hard disk as the maximum delay related indicator value, which is usually provided by the hard disk manufacturer. For example, when purchasing the hard disk, the parameters of the hard disk can be seen.
  • the "maximum continuous data transfer rate” (such as a value of 210 M/s), the software developer can use the "maximum continuous data transfer rate" as the maximum delay related indicator value.
  • the maximum delay-related indicator value obtained by the first acquisition method is the most accurate, the second method is the second, and the third method is the lowest.
  • the delay-related indicator value is the “utilization rate” of the hard disk read-write data
  • the maximum utilization rate of the hard disk read-write data can be divided; for example, the maximum utilization rate of the hard disk read-write data is 100%. You can divide 0-100% at intervals of 20%, which can be divided into [0, 20%), [20%, 40%), [40%, 60%), [60%, 80%). And five intervals such as [80%, 100%].
  • the delay related index value is the "read and write speed" of the hard disk read/write data
  • the maximum read/write speed of the hard disk read/write data can be read (the actual maximum read/write speed can be the maximum amount of data read and written per unit time).
  • the maximum read/write speed of hard disk read/write data is 50M/s, it can be divided into 0-50M/s at intervals of 10M/s, which can be divided into [0, 10M/s), [ Five intervals of 10M/s, 20M/s), [20M/s, 30M/s), [30M/s, 40M/s) and [40M/s, 50M/s].
  • the present invention includes but is not limited to the above-described partitioning method.
  • the number of intervals in which the maximum delay related index value of the hard disk read and write data is divided into several intervals may be based on the accuracy of the maximum delay related index value obtained during the actual detection of the slow disk, and the first threshold (using The size of the maximum number of delay-related index values that are required to be acquired in each section, the requirement for detecting the accuracy of the slow disk, and the like are set, and the present invention is not limited thereto.
  • the number of divided intervals may be increased to improve the accuracy of detecting the slow disk; otherwise, the number of intervals may be less.
  • the first threshold may be set larger, but if the number of divided segments is too large, the time for each interval to reach the first threshold may be lengthened, thereby detecting the slow disk. Sensitivity will decrease, so consider the number of intervals that may be divided by this factor.
  • the requirement for detecting the accuracy of the slow disk is high, the number of divided sections may be more.
  • the equalization can be performed when dividing the interval to divide the appropriate number of intervals, thereby achieving an equilibrium between detecting the accuracy and sensitivity of the slow disk.
  • the first delay related index value is a value of the "read and write speed" of the hard disk read/write data, for example, "first read/write speed" is taken as an example. If the "first read/write speed" is 33 M/s, the first The delay-related index value is 33 M/s, that is, the first interval to which the first delay-related index value belongs is [30 M/s, 40 M/s).
  • the detecting node needs to record the current sampling, and in all sampling periods (including the current sampling period and all previous sampling periods) All the delay-related indicator values obtained in the system (including all the delay-related indicator values obtained before the current sampling and the first delay-related indicator values obtained this time) fall into the number of the first interval, here Is the "first number.”
  • the first number indicates the number of all delay-related indicator values that fall into the first interval. For example, suppose that there are 100 sampling periods in total, and 100 delay-related index values are obtained through these sampling periods. Among them, 80 delay-related index values fall into the first interval, and the first number is 80.
  • the first number is a parameter that will be accumulated continuously, so the process of recording the first number above can also be regarded as the process of "updating the first number".
  • the detecting node determines the first delay-related index of the current sampling. After the first interval to which the value belongs, the detection node "updates the first number", that is, adds 1 to 630, and the updated first number is 631.
  • an interval that is, the first interval is taken as an example for exemplary description, and the detection process of the other interval and the first interval are used as an example.
  • the detection process is similar, and will not be described in detail in the embodiments of the present invention.
  • the detecting node calculates a ratio of the first delay to the interval average delay to obtain a first ratio.
  • the full interval in the embodiment of the present invention refers to an interval in which all the delay-related index values acquired in all the sampling periods fall within the interval to reach the first threshold, and it is known in S101 that if a certain sampling is performed If the delay related index value obtained in the period falls within the first interval, the number of the delay related index values will be recorded, and so on, the detecting node can record all the delay related indicators obtained in all sampling periods.
  • the value falls into the number of the first interval, that is, the first number, and the subsequent detecting node can judge whether the number of all the delay related index values obtained in the interval has been reached according to the first number.
  • the first threshold is reached.
  • the second delay statistics mentioned above is an average delay of reading and writing data of the hard disk for a period of time. Since this average delay is a technique well known to those skilled in the art, it will not be described in detail herein.
  • the first threshold in the embodiment of the present invention may be set according to actual detection requirements. For example, it can be set according to the requirements for the detection accuracy of the hard disk. It can be understood that the higher the detection accuracy of the hard disk (the more data to be sampled), The first threshold is set larger; the lower the detection accuracy of the hard disk (the less data to be sampled), the smaller the first threshold is set. Specifically, it can be adaptively adjusted according to the actual use scenario and other detection requirements, and the present invention is not limited.
  • the detecting node needs to sample 1000 times in 1000 sampling periods, that is, obtain 1000 second.
  • Delay related indicator value and 1000 second delay may be a plurality of second delays of the 1000 second delays, for example, the plurality of second delays may be the 1000 second delays, or may be the 1000 second delays Part of the second delay in the delay.
  • the interval average delay is an average of the plurality of second delays. For example, if the plurality of second delays are 1000 second delays acquired, the interval average delay is an average of the 1000 second delays; if the plurality of second delays are the acquired For a portion of the second delay of the 1000 second delays, the interval average delay is the average of the second delays of the portion.
  • the selection of the multiple second delays may be selected according to actual detection requirements, which are not limited by the present invention.
  • the average value of the multiple second delays may be an arithmetic mean of the multiple second delays, or may be a geometric average of the multiple second delays, which is not specifically limited in the present invention.
  • the arithmetic mean of the plurality of second delays may be an unweighted arithmetic mean or a weighted arithmetic mean;
  • the geometric mean of the plurality of second delays may be an unweighted geometric mean or a weighted geometric mean.
  • the detecting node can calculate the current sampling period.
  • the ratio of the sampled first delay to the interval average delay yields a first ratio.
  • one detection period includes a plurality of sampling periods, after the current sampling period, if the current detection period has not ended, the detecting node needs to return to S100 to continue execution.
  • the method for detecting a slow disk provided by the embodiment of the present invention, in each detection period, the detecting node periodically performs sampling, and executes the above S100-S102 in each sampling period until the end of the detecting period.
  • the detection node calculates the calculation within multiple sampling periods.
  • the average of the plurality of first ratios obtains a first ratio average value, and the plurality of sampling periods in the step is a sampling period in which a plurality of delay-related index values falling within the respective full intervals are acquired.
  • the above S102 may specifically be:
  • N/3 and 2N/3 are integers, and M is the integer; when N is not a multiple of 3, N/3 and 2N/3 are both decimals, at this time, M Take the integer parts of N/3 and 2N/3.
  • M when N is not a multiple of 3, M may also take an integer part of N/3 and add 1; correspondingly, M may also take an integer part of 2N/3 plus one.
  • M can take the integer part 33 of 100/3 plus one, ie 34.
  • N/2 is an integer
  • M is the integer
  • N/2 is a decimal
  • M is an integer part of N/2.
  • N when N is 100, N/2 is 50, and M can take 50, that is, M second delays are the first 50 second delays among 100 second delays; when N is 121 N/2 is 60.5. At this time, M can take 60, that is, M second delays are the first 60 second delays among 121 second delays.
  • M when N is an odd number, M may also take an integer part of N/2 plus one.
  • M can take an integer part of 601/2 plus 60 plus 1, which is 61.
  • the number of selected second delays is less, it means that the less sample data used for detection, the more scattered the sampling result, and the lower the accuracy of detecting the slow disk; correspondingly, due to the selected second delay
  • the smaller the number the smaller the value of the interval average delay, so it is easier to detect the slow disk (the slow disk may be detected in a relatively small number of detection cycles), that is, the sensitivity of detecting the slow disk is higher.
  • the following takes the first N/2 second delays of the N second delays as an example to describe the specific selection of the multiple second delays.
  • 11 second delays are sampled before the first interval is full, and the 11 second delays suddenly rise after a certain sampling, for example, the 11 second delays are respectively 13S, 14S, 15S, 17S, 20S, 21S, 22S, 24S, 25S, 28S and 30S.
  • the average of the first five second delays of the 11 second delays, that is, the interval average delay is about 16S; the average of the 11 second delays, that is, the interval average delay is about 21S .
  • the second delay that is, the 11 second
  • the first 5 second delays in the delay the average of the first 5 second delays, that is, the interval average delay is about 16S, because the first ratio is calculated (the first delay/interval average delay)
  • the interval average delay is used as the denominator, the smaller the interval average delay, the larger the first ratio, and the larger the average of the multiple first ratios, the easier it is to exceed the set
  • the three thresholds that is, it is easier to detect the slow disk, the higher the sensitivity of detecting the slow disk; correspondingly, since only the first half of the 11 second delays are taken, the second delay may result in The result of the test is not accurate, so the accuracy of detecting the slow disk is low.
  • the first M seconds are selected among the N second delays corresponding to the first interval, compared with the average delay of the N second delay calculation intervals.
  • the delay calculation interval average delay can ensure the ratio of the calculated first delay to the interval average delay, that is, the first ratio is larger, so that the average value of the calculated plurality of first ratios is larger, so that In the case of ensuring the accuracy of detecting the slow disk, the sensitivity of detecting the slow disk is appropriately increased, thereby achieving the balance between the accuracy and sensitivity of detecting the slow disk.
  • an embodiment of the present invention provides a method for detecting a slow disk, where the method is applied to a scenario of multiple hard disks, and the detecting node executes the method for the first hard disk, where the first hard disk is one of multiple hard disks. hard disk.
  • an embodiment of the present invention further provides a method for detecting a slow disk, where the method includes:
  • the detecting node acquires a first average value corresponding to the first hard disk.
  • the detecting node performs other steps (including S100-S102 and S11 in S10) in the respective steps shown in FIG. 2 in the above embodiment by using the first hard disk. S110); or performing other steps in the respective steps as shown in FIG. 3 in the above embodiment except S111 (including S100-S105 in S10, or S100-S101, S103-S104, and S106 in S10, And S110 in S11); or performing other steps than S111 in the embodiment shown in FIG. 4 in the above embodiment (including S100-S101, S107, and S102a in S10, and S110 in S11), obtaining The first ratio average corresponding to the first hard disk.
  • the method for obtaining the first ratio average value corresponding to each of the other hard disks is the same as the method for acquiring the first ratio average value corresponding to the first hard disk. For details, refer to the method for obtaining the first ratio average value corresponding to the first hard disk, which is not described here.
  • the detecting node can obtain a plurality of first ratio average values corresponding to the plurality of hard disks one by one by performing each step described above for each of the plurality of hard disks. If the first ratio average value corresponding to each of the plurality of hard disks is smaller than the third threshold, that is, the plurality of hard disks are respectively detected and no slow disk is detected, as shown in FIG. 5, the present invention is implemented.
  • the method for detecting a slow disk provided by the example may further include:
  • the detecting node calculates an average value of the plurality of first ratio average values corresponding to the plurality of hard disks one by one to obtain a first average value.
  • calculation method of the arithmetic mean value of the plurality of first ratio average values refer to the calculation method of the arithmetic mean values of the plurality of second delays in the embodiment shown in FIG. 2;
  • calculation method of the geometric mean value refer to the calculation method of the geometric mean value of the plurality of second delays in the embodiment shown in FIG. 2 above, and details are not described herein again.
  • the detecting node determines, among the plurality of second ratios, that the hard disk corresponding to the second ratio greater than or equal to the fourth threshold is a slow disk.
  • the foregoing fourth threshold may be preset according to actual detection requirements, and is not specifically limited in the present invention.
  • the fourth threshold can be used to measure the fluctuation of the average value of each hard disk relative to all hard disks.
  • the smaller the fourth threshold is the smaller the fluctuation of the average value of each hard disk relative to all the hard disks is required, so that if the hard disk fluctuates slightly during the detection, the second ratio may be exceeded.
  • Four thresholds which in turn can improve the accuracy and sensitivity of detecting slow disks.
  • the detecting node when detecting a node to multiple hard
  • the detecting node may also adopt the method as shown in FIG. 5 described above. Detecting between multiple hard disks may detect slow disks that are not detected during single disk detection, thereby improving the accuracy of detecting slow disks.
  • An embodiment of the present invention provides a method for detecting a slow disk, where the method is applied to a scenario of multiple hard disks, and the first ratio average value corresponding to each hard disk of the plurality of hard disks acquired by the detecting node in the foregoing embodiment is When the number of the hard disks is less than the third threshold, that is, when the slow disk is not detected, the multiple disks can be further tested for the disk, that is, the parameters and performance characteristics of the same disk are similar in the embodiment of the present invention.
  • the embodiment of the present invention provides a device for detecting a slow disk.
  • the device for detecting a slow disk may be a detecting node in a cloud storage system, and the detecting node may be an independent computer node, or may be A functional unit or the like integrated in a computer node is not specifically limited in the present invention.
  • the apparatus for detecting a slow disk provided by the embodiment of the present invention may include a sampling unit 10 and a detecting unit 11;
  • the sampling unit 10 is configured to periodically perform sampling during the detection period, and complete the following process in each sampling period:
  • the relevant indicator value is a value that changes correspondingly with the delay
  • the first interval Determining, by the first interval to which the first delay related indicator value belongs; wherein the first interval is one of a plurality of intervals divided in advance for the maximum delay related index value;
  • the first interval is an expired interval, calculating a ratio of the first delay to the interval average delay to obtain a first ratio; wherein the full interval is all delays acquired in all sampling periods
  • the time-dependent indicator value falls within an interval in which the number of the interval reaches a first threshold, the interval average delay is an average of a plurality of second delays in the first interval, and the plurality of second extensions
  • the time-one corresponds to a plurality of sampling periods, and each second delay is acquired in a sampling period corresponding thereto, wherein each sampling period corresponds to a delay-related index value.
  • the detecting unit 11 is configured to complete the following process after the end of each detection cycle and before the start of the next detection cycle:
  • the sampling unit is calculated in multiple sampling periods. Calculating an average of the plurality of first ratio values to obtain a first ratio average value, wherein the plurality of sampling periods are sampling periods for acquiring a plurality of delay-related index values falling within each of the full intervals; If the ratio average is greater than or equal to the third threshold, it is determined that the hard disk is a slow disk.
  • the sampling unit 10 is further configured to: after determining, in each sampling period, the first interval to which the first delay related indicator value belongs, after the current sampling, the recording is performed in all sampling periods. And all the delay related indicator values that fall into the first interval are the first number; and determine whether the first number reaches the first threshold; and if the first number reaches Determining, by the first threshold, that the first interval is an expired interval; if the first number does not reach the first threshold, determining that the first interval is not a full interval, and entering a next sampling Periodic sampling, where each sampling period corresponds to a delay related indicator value.
  • the delay related indicator value is a utilization rate of the hard disk read and write data
  • the first threshold is N
  • the first interval corresponds to N second delays
  • N is an integer greater than or equal to 1.
  • the N second delays are sequentially arranged in a sampling order, where the multiple second delays are the first M second delays in the N second delays, where M is an integer, N /3 ⁇ M ⁇ 2N / 3, and N/3 and 2N/3 take an integer.
  • the plurality of second delays are the first N/2 second delays of the N second delays, and N/2 takes an integer.
  • an average value of the multiple second delays calculated by the sampling unit 10 is an arithmetic mean of the multiple second delays or a geometric mean of the multiple second delays;
  • the average value of the plurality of first ratios calculated by the detecting unit 11 is an arithmetic mean of the plurality of first ratios or a geometric mean of the plurality of first ratios.
  • the device is applied to a scenario of multiple hard disks, where the device detects the first hard disk, and the first hard disk is one of the plurality of hard disks;
  • the detecting unit 11 is further configured to acquire a plurality of first ratio average values corresponding to ones of the plurality of hard disks that are in one-to-one correspondence with the other hard disks except the first hard disk; and when each of the plurality of hard disks When the average value of the first ratio corresponding to the hard disks is less than the third threshold, calculating an average value of the plurality of first ratio average values corresponding to the plurality of hard disks one by one to obtain a first average value; Determining a ratio of a first ratio average value of each of the plurality of hard disks to the first average value, obtaining a plurality of second ratio values; and determining, among the plurality of second ratio values, greater than or equal to a fourth value
  • the hard disk corresponding to the second ratio of the threshold is a slow disk, wherein the first ratio of each of the other hard disks corresponds to a first ratio average
  • the method for obtaining the value is the same as the method for acquiring the first ratio average corresponding to the first hard disk.
  • the apparatus for detecting a slow disk provided by the embodiment of the present invention, when the device detects the plurality of hard disks separately, the slow disk is not detected (that is, the average value of the first ratio corresponding to each of the plurality of hard disks is smaller than the first
  • the device can also detect between multiple hard disks, so that the slow disk that is not detected by the single disk detection may be detected, thereby improving the accuracy of detecting the slow disk.
  • An embodiment of the present invention provides a device for detecting a slow disk.
  • the delay related index value obtained by the device changes correspondingly with the delay, that is, the delay is closely related to the delay related index value.
  • the device calculates the first ratio after the first interval is the full interval (ie, the number of all delay-related index values acquired in all sampling periods falls within the first interval reaches the first threshold).
  • the sampling process before the full period can be regarded as the learning process), and it can be ensured that the first interval is obtained after obtaining enough delay-related index values in the first interval (that is, the sampling is sufficient in the first interval), thereby improving the detection.
  • the device calculates the first ratio average value when all the delay related index values acquired in all sampling periods in each detection period fall within the time when the number of each full interval is greater than or equal to the second threshold. It is guaranteed that the learning process has ended in most of the intervals, that is, most of the intervals have been sampled enough to calculate the first ratio average, which can also improve the accuracy of detecting slow disks.
  • the first ratio average value calculated by the apparatus for detecting the slow disk in the embodiment of the present invention is a proportional average value obtained from the plurality of first ratio values, which is not an actual delay value
  • the first ratio average value is The performance of the hard disk can be accurately reflected.
  • an embodiment of the present invention provides a device for detecting a slow disk.
  • the device for detecting a slow disk may be a detecting node in a cloud storage system, and the detecting node may be an independent computer node, or
  • the present invention is not specifically limited as a functional unit or the like integrated in a computer node.
  • the apparatus for detecting a slow disk may include a processor 20, a memory 21, a communication interface 22, and a system bus 23.
  • the processor 20, the memory 21, and the communication interface 22 are connected by the system bus 23 and complete communication with each other.
  • the processor 20 may be a central processing unit (English: central processing unit, abbreviated as CPU), or an application specific integrated circuit (ASIC), or configured to implement the implementation of the present invention.
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • the communication interface 22 may be a communication interface for the device that detects the slow disk to communicate with other devices.
  • the memory 21 may include a volatile memory (English: volatile memory), such as a random access memory (English: random-access memory, abbreviation: RAM); the memory 21 may also include a non-volatile memory (English: Non-volatile memory, such as read-only memory (English: read-only memory, abbreviated as ROM), flash memory (English: flash memory), SSD, HDD or HHD; the memory 21 may also include the above-mentioned type of memory The combination.
  • a volatile memory such as a random access memory (English: random-access memory, abbreviation: RAM)
  • the memory 21 may also include a non-volatile memory (English: Non-volatile memory, such as read-only memory (English: read-only memory, abbreviated as ROM), flash memory (English: flash memory), SSD, HDD or HHD; the memory 21 may also include the above-mentioned type of memory The combination.
  • the processor 20 may perform the method according to any one of the methods of FIG. 2 to FIG. 5 by reading a program stored in the memory, which specifically includes:
  • the processor 20 is configured to periodically perform sampling during a detection period, and complete the following process in each sampling period:
  • the first interval Determining, by the first interval to which the first delay related indicator value belongs; wherein the first interval is one of a plurality of intervals divided in advance for the maximum delay related index value;
  • the first interval is an expired interval, calculating a ratio of the first delay to the interval average delay to obtain a first ratio; wherein the full interval is all delays acquired in all sampling periods
  • the time-dependent indicator value falls within an interval in which the number of the interval reaches a first threshold, the interval average delay is an average of a plurality of second delays in the first interval, and the plurality of second extensions
  • the time-one corresponds to a plurality of sampling periods, and each second delay is acquired in a sampling period corresponding thereto, wherein each sampling period corresponds to a delay-related index value.
  • the processor 20 is further configured to: after each detection period ends, before the start of the next detection period, complete the following process:
  • the sampling unit is calculated in multiple sampling periods. Calculating an average of the plurality of first ratio values to obtain a first ratio average value, wherein the plurality of sampling periods are sampling periods for acquiring a plurality of delay-related index values falling within each of the full intervals; If the ratio average is greater than or equal to the third threshold, it is determined that the hard disk is a slow disk.
  • the memory 21 is configured to store a software program that the processor 20 performs the above process of detecting a slow disk, so that the processor 20 completes the process of detecting a slow disk by executing the software program.
  • the processor 20 is further configured to: after determining, in each sampling period, the first interval to which the first delay related indicator value belongs, after the current sampling, the recording is performed in all sampling periods. And all the delay related indicator values that fall into the first interval are the first number; and determine whether the first number reaches the first threshold; and if the first number reaches Determining, by the first threshold, that the first interval is an expired interval; if the first number does not reach the first threshold, determining that the first interval is not a full interval, and entering a next sampling Periodic sampling, where each sampling period corresponds to a delay related indicator value.
  • the delay related indicator value is a utilization rate of the hard disk read and write data
  • the delay related index value is a read/write speed of the hard disk read and write data.
  • the first threshold is N
  • the first interval corresponds to N second delays
  • N is an integer greater than or equal to 1.
  • the processor 20 is further configured to calculate an average value of the plurality of second delays in the N second delays before calculating a ratio of the first delay to the interval average delay, The interval average delay.
  • the N second delays are sequentially arranged in a sampling order, where the multiple second delays are the first M second delays in the N second delays, where M is an integer, N /3 ⁇ M ⁇ 2N / 3, and N/3 and 2N/3 take an integer.
  • the plurality of second delays are the first N/2 second delays of the N second delays, and N/2 takes an integer.
  • the average value of the multiple second delays calculated by the processor 20 is an arithmetic mean of the multiple second delays or a geometric mean of the multiple second delays;
  • the average of the plurality of first ratios calculated by the processor 20 is an arithmetic mean of the plurality of first ratios or a geometric mean of the plurality of first ratios.
  • the device is applied to a scenario of multiple hard disks, where the device detects the first hard disk, and the first hard disk is one of the plurality of hard disks;
  • the processor 20 is further configured to obtain a plurality of first ratio average values corresponding to ones of the plurality of hard disks that are in one-to-one correspondence with the other hard disks except the first hard disk; and when each of the plurality of hard disks When the average value of the first ratio corresponding to the hard disks is less than the third threshold, calculating an average value of the plurality of first ratio average values corresponding to the plurality of hard disks one by one to obtain a first average value; Determining a ratio of a first ratio average value of each of the plurality of hard disks to the first average value, obtaining a plurality of second ratio values; and determining, among the plurality of second ratio values, greater than or equal to a fourth value
  • the hard disk corresponding to the second ratio of the threshold is a slow disk, wherein the first ratio average of each of the other hard disks
  • the obtaining method is the same as the acquiring method of the first ratio average corresponding to the first hard disk.
  • the apparatus for detecting a slow disk provided by the embodiment of the present invention, when the device detects the plurality of hard disks separately, the slow disk is not detected (that is, the average value of the first ratio corresponding to each of the plurality of hard disks is smaller than the first
  • the device can also detect between multiple hard disks, so that the slow disk that is not detected by the single disk detection may be detected, thereby improving the accuracy of detecting the slow disk.
  • An embodiment of the present invention provides a device for detecting a slow disk.
  • the delay related index value obtained by the device changes correspondingly with the delay, that is, the delay is closely related to the delay related index value.
  • the device calculates the first ratio after the first interval is the full interval (ie, the number of all delay-related index values acquired in all sampling periods falls within the first interval reaches the first threshold).
  • the sampling process before the full period can be regarded as the learning process), and it can be ensured that the first interval is obtained after obtaining enough delay-related index values in the first interval (that is, the sampling is sufficient in the first interval), thereby improving the detection.
  • the device calculates the first ratio average value when all the delay related index values acquired in all sampling periods in each detection period fall within the time when the number of each full interval is greater than or equal to the second threshold. It is guaranteed that the learning process has ended in most of the intervals, that is, most of the intervals have been sampled enough to calculate the first ratio average, which can also improve the accuracy of detecting slow disks.
  • the first ratio average value calculated by the apparatus for detecting the slow disk in the embodiment of the present invention is a proportional average value obtained from the plurality of first ratio values, which is not an actual delay value
  • the first ratio average value is The performance of the hard disk can be accurately reflected.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the modules or units is only a logical function division.
  • there may be another division manner for example, multiple units or components may be used. Combinations can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • the technical solution of the present invention which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium.
  • a computer device which can be a personal computer, server, or network
  • the processor or the like performs all or part of the steps of the method of the various embodiments of the present invention.
  • the foregoing storage medium includes various media that can store program codes, such as a USB flash drive, a mobile hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)
  • Signal Processing For Digital Recording And Reproducing (AREA)

Abstract

本发明提供一种检测慢盘的方法及装置,涉及计算机领域,能够提高检测慢盘的准确度。该方法包括:在检测周期内周期性采样且每次采样时:获取本次采样时硬盘读写数据的第一延时以及第一延时相关指标值;确定第一延时相关指标值所属的第一区间;若第一区间已满则计算第一延时与区间平均延时的比值,即第一比值;在每次检测周期结束且下次检测周期开始前:若在本次检测周期内的所有采样周期内获取到的所有延时相关指标值落入到各个已满区间的个数大于或等于第二阈值,则计算获取到落入各个已满区间的多个延时相关指标值的多个采样周期内计算的多个第一比值的平均值,即第一比值平均值;若第一比值平均值大于或等于第三阈值,则确定硬盘为慢盘。

Description

一种检测慢盘的方法及装置
本申请要求于2015年07月31日提交中国专利局、申请号为201510466756.X、发明名称为“一种检测慢盘的方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及,尤其涉及计算机领域,尤其涉及一种检测慢盘的方法及装置。
背景技术
存储系统中的硬盘在使用过程中,由于硬盘的磁性退化、坏道或振动等其他环境和机械的问题,会导致硬盘在进行读写操作,即输入输出(英文:input/output,缩写:I/O)操作时的延时增大,这种进行I/O操作时延时增大的硬盘称为慢盘。
通常,为了降低慢盘对存储系统的读写性能的影响,一般在存储系统运行过程中,可以通过实时监控存储系统中的各个硬盘进行I/O操作时的延时,检测这些硬盘是否为慢盘。具体的,以一个硬盘为例,统计该硬盘在每个第一周期内进行I/O操作的平均延时,并将该平均延时与预设的时间阈值进行比较,若该平均延时大于或等于时间阈值,则记为一次阈值事件;以及统计该硬盘在每个第二周期(第二周期大于第一周期)内出现阈值事件的次数,并将该次数与预设的次数阈值进行比较,若该次数大于或等于预设的次数阈值,则可确定该硬盘为慢盘。
然而,为了避免硬盘某次读写较大数据时导致平均延时增大,出现检测错误的现象,通常将时间阈值设置的比较大,因此可能会使得检测慢盘的准确度下降。
发明内容
本发明实施例提供一种检测慢盘的方法及装置,能够提高检测慢盘的准确度。
第一方面,本发明实施例提供一种检测慢盘的方法,所述方法包括:
在检测周期内,周期性地进行采样,在每次采样周期内,执行如下方法:
获取本次采样周期内硬盘读写数据的第一延时以及第一延时相关指标值,其中,所述第一延时相关指标值为延时相关指标值的一个具体值,所述延时相关指标值为一个会随延时变化而发生相应变化的一个值;
确定所述第一延时相关指标值所属的第一区间;其中,所述第一区间为预先针对最大延时相关指标值划分的多个区间中的一个;
若所述第一区间是已满区间,则计算所述第一延时与区间平均延时的比值,得到第一比值;其中,所述已满区间是在所有采样周期内获取到的所有延时相关指标值落入到该区间的个数达到第一阈值的区间,所述区间平均延时为所述第一区间中的多个第二延时的平均值,所述多个第二延时一一对应于第一多个采样周期,每个第二延时在与之对应的采样周期内被获取,其中,每个采样周期对应于一个延时相关指标值;
在每次检测周期结束后,下一个检测周期开始前,执行如下方法:
若在本次检测周期内的所有采样周期内获取到的所有延时相关指标值落入到各个已满区间的个数大于或等于第二阈值,则计算第二多个采样周期内计算的多个第一比值的平均值,得到第一比值平均值,所述第二多个采样周期为获取到落入各个已满区间的多个延时相关指标值的采样周期;
若所述第一比值平均值大于或等于第三阈值,则确定所述硬盘为慢盘。
结合第一方面,在第一方面的第一种可能的实现方式中,在每次采样周期内,所述确定所述第一延时相关指标值所属的第一区间之后,所述方法还包括:
记录经过本次采样后,在所有采样周期内获取到的所有延时相关指标值落入到所述第一区间的个数为第一个数,其中,每个采样周期对应于一个延时相关指标值;
判断所述第一个数是否达到所述第一阈值;
若所述第一个数达到所述第一阈值,则确定所述第一区间是已满区间;
若所述第一个数没有达到所述第一阈值,则确定所述第一区间不是已满区间,并进入下次采样周期采样。
结合第一方面或第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现方式中,
所述延时相关指标值为所述硬盘读写数据的利用率;或者,
所述延时相关指标值为所述硬盘读写数据的读写速度。
结合第一方面或第一方面的第一种可能的实现方式至第二种可能的实现方式中的任一种实现方式,在第一方面的第三种可能的实现方式中,所述第一阈值为N,所述第一区间对应N个第二延时,N为大于或等于1的整数,所述计算所述第一延时与区间平均延时的比值之前,所述方法还包括:
计算所述N个第二延时中的所述多个第二延时的平均值,得到所述区间平均延时。
结合第一方面的第三种可能的实现方式,在第一方面的第四种可能的实现方式中,
所述N个第二延时按照采样顺序依次排列,所述多个第二延时为所述N个第二延时中的前M个第二延时,M为整数,N/3≤M≤2N/3,且N/3和2N/3均取整数。
结合第一方面的第四种可能的实现方式,在第一方面的第五种可能的实现方式中,M=N/2,
所述多个第二延时为所述N个第二延时中的前N/2个第二延时,且N/2取整数。
结合第一方面或第一方面的第一种可能的实现方式至第五种可 能的实现方式中的任一种实现方式,在第一方面的第六种可能的实现方式中,
所述多个第二延时的平均值为所述多个第二延时的算术平均值或者所述多个第二延时的几何平均值;
所述多个第一比值的平均值为所述多个第一比值的算术平均值或者所述多个第一比值的几何平均值。
结合第一方面或第一方面的第一种可能的实现方式至第六种可能的实现方式中的任一种实现方式,在第一方面的第七种可能的实现方式中,所述方法应用于多个硬盘的场景,针对第一硬盘执行,所述第一硬盘为所述多个硬盘中的其中一个硬盘;所述方法还包括:
获取与所述多个硬盘中除所述第一硬盘外的其他硬盘一一对应的多个第一比值平均值,其中,所述其他硬盘中的每个硬盘对应的第一比值平均值的获取方法与所述第一硬盘对应的第一比值平均值的获取方法相同;
当与所述多个硬盘中的每个硬盘对应的第一比值平均值均小于所述第三阈值时,所述方法还包括:
计算与所述多个硬盘一一对应的多个第一比值平均值的平均值,得到第一平均值;
计算与所述多个硬盘中的每个硬盘对应的第一比值平均值与所述第一平均值的比值,得到多个第二比值;
确定所述多个第二比值中,与大于或等于第四阈值的第二比值对应的硬盘为慢盘。
第二方面,本发明实施例提供一种检测慢盘的装置,所述装置包括:
采样单元,用于在检测周期内,周期性地进行采样,且在每次采样周期内,完成如下过程:
获取本次采样周期内硬盘读写数据的第一延时以及第一延时相关指标值,其中,所述第一延时相关指标值为延时相关指标值的一个具体值,所述延时相关指标值为一个会随延时变化而发生相应变 化的一个值;
确定所述第一延时相关指标值所属的第一区间;其中,所述第一区间为预先针对最大延时相关指标值划分的多个区间中的一个;
若所述第一区间是已满区间,则计算所述第一延时与区间平均延时的比值,得到第一比值;其中,所述已满区间是在所有采样周期内获取到的所有延时相关指标值落入到该区间的个数达到第一阈值的区间,所述区间平均延时为所述第一区间中的多个第二延时的平均值,所述多个第二延时一一对应于第一多个采样周期,每个第二延时在与之对应的采样周期内被获取,其中,每个采样周期对应于一个延时相关指标值;
检测单元,用于在每次检测周期结束后,下一个检测周期开始前,完成如下过程:
若在本次检测周期内的所有采样周期内获取到的所有延时相关指标值落入到各个已满区间的个数大于或等于第二阈值,则计算所述采样单元在第二多个采样周期内计算的多个第一比值的平均值,得到第一比值平均值,所述第二多个采样周期为获取到落入各个已满区间的多个延时相关指标值的采样周期;以及若所述第一比值平均值大于或等于第三阈值,则确定所述硬盘为慢盘。
结合第二方面,在第二方面的第一种可能的实现方式中,
所述采样单元,还用于在每次采样周期内,确定所述第一延时相关指标值所属的第一区间之后,记录经过本次采样后,在所有采样周期内获取到的所有延时相关指标值落入到所述第一区间的个数为第一个数;并判断所述第一个数是否达到所述第一阈值;以及若所述第一个数达到所述第一阈值,则确定所述第一区间是已满区间;若所述第一个数没有达到所述第一阈值,则确定所述第一区间不是已满区间,并进入下次采样周期采样,其中,每个采样周期对应于一个延时相关指标值。
结合第二方面或第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现方式中,
所述延时相关指标值为所述硬盘读写数据的利用率;或者,
所述延时相关指标值为所述硬盘读写数据的读写速度。
结合第二方面或第二方面的第一种可能的实现方式至第二种可能的实现方式中的任一种实现方式,在第二方面的第三种可能的实现方式中,所述第一阈值为N,所述第一区间对应N个第二延时,N为大于或等于1的整数,
所述采样单元,还用于在计算所述第一延时与区间平均延时的比值之前,计算所述N个第二延时中的所述多个第二延时的平均值,得到所述区间平均延时。
结合第二方面的第三种可能的实现方式,在第二方面的第四种可能的实现方式中,
所述N个第二延时按照采样顺序依次排列,所述多个第二延时为所述N个第二延时中的前M个第二延时,M为整数,N/3≤M≤2N/3,且N/3和2N/3均取整数。
结合第二方面的第四种可能的实现方式,在第二方面的第五种可能的实现方式中,M=N/2,
所述多个第二延时为所述N个第二延时中的前N/2个第二延时,且N/2取整数。
结合第二方面或第二方面的第一种可能的实现方式至第五种可能的实现方式中的任一种实现方式,在第二方面的第六种可能的实现方式中,
所述采样单元计算的所述多个第二延时的平均值为所述多个第二延时的算术平均值或者所述多个第二延时的几何平均值;
所述检测单元计算的所述多个第一比值的平均值为所述多个第一比值的算术平均值或者所述多个第一比值的几何平均值。
结合第二方面或第二方面的第一种可能的实现方式至第六种可能的实现方式中的任一种实现方式,在第二方面的第七种可能的实现方式中,所述装置应用于多个硬盘的场景,所述装置针对第一硬盘进行检测,所述第一硬盘为所述多个硬盘中的其中一个硬盘;
所述检测单元,还用于获取与所述多个硬盘中除所述第一硬盘外的其他硬盘一一对应的多个第一比值平均值;并当与所述多个硬盘中的每个硬盘对应的第一比值平均值均小于所述第三阈值时,计算与所述多个硬盘一一对应的多个第一比值平均值的平均值,得到第一平均值;且计算与所述多个硬盘中的每个硬盘对应的第一比值平均值与所述第一平均值的比值,得到多个第二比值;以及确定所述多个第二比值中,与大于或等于第四阈值的第二比值对应的硬盘为慢盘,其中,所述其他硬盘中的每个硬盘对应的第一比值平均值的获取方法与所述第一硬盘对应的第一比值平均值的获取方法相同。
本发明实施例提供一种检测慢盘的方法及装置,该方法包括在检测周期内,周期性地进行采样,且在每次采样周期内:获取本次采样周期内硬盘读写数据的第一延时以及第一延时相关指标值,其中,第一延时相关指标值为延时相关指标值的一个具体值,所述时相关指标值为一个会随延时变化而发生相应变化的一个值;确定第一延时相关指标值所属的第一区间;其中,第一区间为预先针对最大延时相关指标值划分的多个区间中的一个;若第一区间是已满区间,则计算第一延时与区间平均延时的比值,得到第一比值;其中,已满区间是在所有采样周期内获取到的所有延时相关指标值落入到该区间的个数达到第一阈值的区间,区间平均延时为第一区间中的多个第二延时的平均值,多个第二延时一一对应于第一多个采样周期,每个第二延时在与之对应的采样周期内被获取,其中,每个采样周期对应于一个延时相关指标值。在每次检测周期结束后,下一个检测周期开始前:若在本次检测周期内的所有采样周期内获取到的所有延时相关指标值落入到各个已满区间的个数大于或等于第二阈值,则计算第二多个采样周期内计算的多个第一比值的平均值,得到第一比值平均值,第二多个采样周期为获取到落入各个已满区间的多个延时相关指标值的采样周期;若第一比值平均值大于或等于第三阈值,则确定硬盘为慢盘。
基于上述技术方案,本发明实施例提供的检测慢盘的方法中,首先,由于延时相关指标值会随延时的变化而发生相应的变化,即延时与延时相关指标值密切相关,因此通过将最大延时相关指标值划分区间,并在每个区间内采样与属于该区间的延时相关指标值对应的延时,可以保证一个区间内采样的延时有统一的衡量标准,从而提高检测慢盘的准确度。其次,在第一区间是已满区间(即在所有采样周期内获取到的所有延时相关指标值落入到第一区间的个数达到第一阈值)后才计算第一比值(在没满之前的采样过程可认为是学习过程),可以保证在第一区间获取足够多个延时相关指标值(即在第一区间采样足够多次)后再计算第一比值,从而能够提高检测慢盘的准确度。再次,本发明实施例在每次检测周期内的所有采样周期内获取到的所有延时相关指标值落入到各个已满区间的个数大于或等于第二阈值时再计算第一比值平均值,可以保证在大部分区间已经结束学习过程,即大部分区间已经采样了足够多次后再计算第一比值平均值,也能够提高检测慢盘的准确度。此外,由于本发明实施例计算的第一比值平均值为由多个第一比值得到的一个比例平均值,其并不是实际的延时数值,因此该第一比值平均值可以准确地体现硬盘的性能变化趋势,通过设置第三阈值,以及将第一比值平均值与第三阈值进行比较,可以在硬盘性能发生变化时准确地检测出硬盘是慢盘,从而进一步提高检测慢盘的准确度。
附图说明
图1为本发明实施例提供的云存储系统的架构示意图;
图2为本发明实施例提供的一种检测慢盘的方法流程图一;
图3为本发明实施例提供的一种检测慢盘的方法流程图二;
图4为本发明实施例提供的一种检测慢盘的方法流程图三;
图5为本发明实施例提供的一种检测慢盘的方法流程图四;
图6为本发明实施例提供的一种检测慢盘的装置的结构示意图;
图7为本发明实施例提供的一种检测慢盘的装置的硬件示意 图。
具体实施方式
本发明实施例提供的检测慢盘的方法及装置可以应用于硬盘检测场景中。具体的,本发明实施例提供的检测慢盘的方法及装置可以应用于云存储系统中的硬盘检测场景中。
云存储系统是在云计算(英文:cloud computing)系统的概念上延伸和发展出来的一个新的概念,是指通过集群应用、网格技术或分布式文件系统等功能,将网络中大量各种不同类型的存储设备通过应用软件集合起来协同工作,共同对外提供数据存储和业务访问功能的一个系统。当云计算系统运算和处理的核心是大量数据的存储和管理时,云计算系统中就需要配置大量的存储设备,例如硬盘,那么云计算系统就转变成为一个云存储系统,所以云存储系统是一个以数据存储和管理为核心的云计算系统。
如图1所示,为本发明实施例提供的一种云存储系统的架构示意图。由于云存储系统中包括大量的存储设备,例如图1中的各种各样的硬盘,因此为了提高云存储系统的性能(例如存储性能和管理性能),通常需要对这些存储设备进行维护。以存储设备为各种各样的硬盘为例,由于这些硬盘在使用过程中,可能由于某些硬盘的磁性退化、坏道或振动等其他环境和机械问题,而导致某些硬盘在进行读写操作时的延时较大,因此为了提高云存储系统的存储效率,需要及时对这些硬盘进行检测,以检测出进行读写操作时延时较大的硬盘,即慢盘。当检测出慢盘后,可以通过从云存储系统中隔离(例如从软件上删除或者从硬件上自动弹出等)慢盘,从而提高云存储系统的存储效率。
本发明实施例提供的硬盘可以为固态硬盘(英文:solid state drive,缩写:SSD);也可以为硬盘驱动器(英文:Hard Disk Drive,缩写:HDD);还可以为混合式硬盘(英文:hybrid hard drive,缩写:HHD)等其他类型的硬盘,具体的本发明不作限定。其中,SSD采用闪存颗粒来存储;HDD采用磁性碟片来存储;HHD是把磁性硬 盘和闪存集成到一起的一种硬盘。
下面结合附图对本发明实施例提供的检测慢盘的方法及装置进行详细地说明。其中,本发明实施例提供的检测慢盘的方法的执行主体可以为检测慢盘的装置,该检测慢盘的装置可以为云存储系统中的一个检测节点,该检测节点可以为一个独立的计算机节点,也可以为一个集成在计算机节点中的功能单元等,本发明不作具体限定。以下实施例中,为了更清楚、完整地说明本发明实施例提供的检测慢盘的方法,该方法的执行主体均以检测节点为例进行示例性的说明。
实施例一
在云存储系统中,由于本发明实施例提供的检测慢盘的方法对每个硬盘的检测过程均类似,因此本实施例以一个硬盘为例对本发明实施例提供的检测慢盘的方法进行示例性的说明。
如图2所示,本发明实施例提供一种检测慢盘的方法,该方法可以包括:
S10、在检测周期内,检测节点周期性地进行采样,在每次采样周期内,检测节点执行S100-S102。
S100、检测节点获取本次采样周期内硬盘读写数据的第一延时以及第一延时相关指标值,其中,第一延时相关指标值为延时相关指标值的一个具体值,延时相关指标值为一个会随延时变化而发生相应变化的一个值。
本发明实施例中,第一延时相关指标值为本次采样周期内获取到的延时相关指标。
其中,上述第一延时/延时统计的是一段时间内硬盘读写数据的平均延时。由于该平均延时为本领域技术人员所公知的技术,因此此处不再详述。
本发明实施例中,“第一延时”以及“第一延时相关指标值”中的前缀“第一”仅表示为一个特定的延时或者延时相关指标值,在后续其他步骤中出现的“第一”、“第二”等前缀均表示类似的含义。
其中,上述延时相关指标值是指跟硬盘读写数据时的延时(即上述的平均延时)相关的一个指标的值,即该指标的值跟延时之间存在具有一定规律的对应关系。具体的,延时相关指标值可以为硬盘读写数据的“利用率”或者硬盘读写数据的“读写速度”等跟硬盘读写数据的延时之间存在一定对应关系的指标。例如,延时大时,对应的利用率会高(或者,对应的读写速度会低)。上述延时相关指标值为硬盘读写数据的“利用率”和硬盘读写数据的“读写速度”仅是为了对本发明实施例的检测慢盘的方法进行示例性的说明,其并不对本发明实施例形成任何限定,即本发明实施例也不限定延时相关指标值可以取其他能够随延时变化而发生相应变化的值。
同时,为了后续方便说明,本发明实施例中,并不对“利用率”与“利用率的值”或者其他类似的情况(如“指标”与“指标的值”)进行严格区分,本领域技术人员可以理解他们表示的是相同的意思。例如,如果出现“利用率”为XX时,本领域技术人员可以理解,实际上也可以表示“利用率的值”为XX。
本发明实施例中,第一延时相关指标值可以是硬盘读写数据的“利用率”的一个值(如20%、40%之类的值),该值越高,第一延时也越大;或者,第一延时相关指标值可以是硬盘读写数据的“读写速度”的一个值(如20M/s、50M/s之类的值)时,该值越高,说明系统很繁忙,第一延时也越大。
本发明实施例中,第一延时以及第一延时相关指标值可以通过多种路径获取。示例性的,可以基于操作系统自带的一些工具获取,例如,基于Linux操作系统的iostat工具,可以通过iostat工具获取一段时间内硬盘的利用率以及硬盘读写数据的平均延时等数据;或者,也可以自定义开发一些工具来获取。如何通过系统自带的工具以及如何自己开发用于获取这些数据的工具都属于本领域技术人员所公知的技术,这里不再赘述。
S101、检测节点确定第一延时相关指标值所属的第一区间,其中,第一区间为预先针对最大延时相关指标值划分的多个区间中的 一个。
本发明实施例提供的检测慢盘的方法中,可以由软件开发人员预先获取硬盘读写数据的最大延时相关指标值,并将硬盘读写数据的最大延时相关指标值进行划分,以得到多个区间,以及将这多个区间写入检测慢盘时执行的软件程序中。
由于延时相关指标值不同,其获取的方法可能不同,因此下面分别以上述延时相关指标值为硬盘读写数据的“利用率”和硬盘读写数据的“读写速度”为例,对获取硬盘读写数据的最大延时相关指标值的方法进行示例性的说明。
当上述延时相关指标值为硬盘读写数据的“利用率”时,一般情况下,软件开发人员可直接认为最大延时相关指标值为在理论上硬盘读写数据的“利用率”的最大值,例如,软件开发人员可直接认为最大延时相关指标值为100%。极少数情况下,软件开发人员可以通过iostat工具获取最大延时相关指标值,这种情况下,最大延时相关指标值是由iostat工具在硬盘读写数据时根据硬盘的介质设置的,例如,iostat工具根据硬盘的介质不同,可以将最大延时相关指标值设置为理论上硬盘读写数据的“利用率”的最大值的整数倍(如200%之类的值)。
当上述延时相关指标值为硬盘读写数据的“读写速度”时,最大延时相关指标值可以有以下几种获取方法。第一种获取方法是软件开发人员根据开发经验获取,比如软件开发人员在了解应用系统的设计及应用进行I/O操作的方式后,可以估算出一个可能的值作为最大延时相关指标值。第二种获取方法是软件开发人员在无开发经验的情况下,可以在硬盘上运行一个读写测试,并根据该读写测试获取最大延时相关指标值。第三种获取方法是软件开发人员可以直接使用硬盘的标称值作为最大延时相关指标值,该标称值通常由硬盘厂商提供,例如,在购买硬盘时,可以看到硬盘的参数中有“最大持续数据传输率”(如210M/s之类的值),软件开发人员可以将该“最大持续数据传输率”作为最大延时相关指标值。上述三种获取方法中 第一种获取方法获取的最大延时相关指标值的精确程度最高,第二种方法次之,第三种方法最低。
示例性的,若延时相关指标值为硬盘读写数据的“利用率”,则可以将硬盘读写数据的最大利用率进行划分;例如,假设硬盘读写数据的最大利用率为100%,则可以按照以20%为间隔对0-100%进行划分,即可以划分为[0,20%)、[20%,40%)、[40%,60%)、[60%,80%)和[80%,100%]等五个区间。若延时相关指标值为硬盘读写数据的“读写速度”,则可以将硬盘读写数据的最大读写速度(实际中最大读写速度可以为单位时间内最多读写的数据量)进行划分;例如,假设硬盘读写数据的最大读写速度为50M/s,则可以按照以10M/s为间隔对0-50M/s进行划分,即可以划分为[0,10M/s)、[10M/s,20M/s)、[20M/s,30M/s)、[30M/s,40M/s)和[40M/s,50M/s]等五个区间。
需要说明的是,上述对硬盘读写数据的最大延时相关指标值进行划分的描述仅是示例性的列举,本发明包括但不限于上述描述的划分方法。其中,上述将硬盘读写数据的最大延时相关指标值划分为若干个区间的区间数量可以根据实际检测慢盘的过程中,获取的最大延时相关指标值的精准程度、第一阈值(用于限定需获取的落入每个区间的延时相关指标值的最大个数)的大小以及对检测慢盘的准确度的要求等进行设定,本发明不作具体限定。一方面,当获取的最大延时相关指标值的精准程度比较低时,划分的区间数量可以多一些,以提高检测慢盘的准确度;反之划分的区间数量可以少一些。一方面,为了提高检测慢盘的准确度,第一阈值可以设置的大一些,但是如果划分的区间数量太多的话可能会导致每个区间达到第一阈值的时间变长,从而检测慢盘的灵敏度会降低,所以考虑这个因素可能划分的区间数量要少一些。另一方面,当对检测慢盘的准确度的要求较高时,划分的区间数量可以多一些。总之,可以基于上述三个方面,在划分区间时进行均衡,以划分合适数量的区间,从而在检测慢盘的准确度和灵敏度之间达到均衡。
检测节点获取到硬盘读写数据的第一延时相关指标值后,检测节点需在预先划分的多个区间中,确定出第一延时相关指标值所属的第一区间。以第一延时相关指标值为硬盘读写数据的“读写速度”的一个值,例如“第一读写速度”为例,若“第一读写速度”为33M/s,则第一延时相关指标值为33M/s,即第一延时相关指标值所属的第一区间为[30M/s,40M/s)这个区间。
进一步地,检测节点在确定本次获取的第一延时相关指标值所属的第一区间之后,检测节点需要记录本次采样后,在所有采样周期(包括本次采样周期及前面所有的采样周期)内获取到的所有延时相关指标值(包括本次采样之前获取的所有延时相关指标值和本次获取的第一延时相关指标值)落入到第一区间的个数,这里记为“第一个数”。本发明实施例中,在每个采样周期内,只获取一个延时相关指标值,其中,有一些延时相关指标值会落入到第一区间,有一些不会落入,此时,第一个数即表示所有落入到第一区间的延时相关指标值的个数。例如,假设所有的采样周期共有100个,通过这些采样周期共获取100个延时相关指标值,其中,有80个延时相关指标值落入到了第一区间,则第一个数为80。
可以比较容易地看到,第一个数是一个会不断累加的参数,因此,上述记录第一个数的过程也可以看成是“更新第一个数”的过程。实际应用中,为了记录第一个数,可以有很多种方法。例如,一种常用的方法是可以通过设置一个变量,每当采样周期内获取的延时指标值落入到第一区间时,就对这个变量进行累加(例如,加1),可以用程序语言表示为:first_number=first_number+1,其中,first_number代表“第一个数”。
例如,假设在本次采样前,检测节点已经记录的落入第一区间的延时相关指标值的个数为630,在本次采样中,检测节点确定本次采样的第一延时相关指标值所属的第一区间后,检测节点“更新第一个数”,即在630基础上加1,得到更新后的第一个数为631。
可以理解的是,本发明实施例提供的检测慢盘的方法中,检测 节点在每个区间对硬盘的检测过程均类似,因此,本实施例及以下实施例均以一个区间,即第一区间为例进行示例性的说明,对于其他区间的检测过程与第一区间的检测过程均类似,本发明实施例不再赘述。
S102、若第一区间是已满区间,则检测节点计算第一延时与区间平均延时的比值,得到第一比值。
其中,本发明实施例中的已满区间是指在所有采样周期内获取到的所有延时相关指标值落入到该区间的个数达到第一阈值的区间,通过S101可知,如果某个采样周期内获取的延时相关指标值落入到第一区间,则该延时相关指标值的个数就会被记录,依次类推,检测节点可以记录所有采样周期内获取到的所有延时相关指标值落入到该第一区间的个数,即第一个数,后续检测节点就可以根据这个第一个数来判断获取到的所有延时相关指标值落入到该区间的个数是否已经达到第一阈值。
其中,区间平均延时为第一区间中的多个第二延时的平均值,多个第二延时一一对应于多个采样周期,每个第二延时在与之对应的采样周期内被获取,这些采样周期都为获取到的所有延时相关指标值落入到第一区间的个数达到第一阈值,即第一区间已满之前的采样周期。具体的,由于检测周期内会有多个采样周期,每个采样周期检测节点都会执行S100,因此,在第一区间已满之前,每个采样周期内,检测节点都会采集到与这个周期对应的延时相关指标值(这里可以称第二延时相关指标值)以及硬盘读写数据的延时(这里可以称为第二延时)。
其中,上述第二延时统计的是一段时间内硬盘读写数据的平均延时。由于该平均延时为本领域技术人员所公知的技术,因此此处不再详述。
可选的,本发明实施例中的第一阈值可以根据实际检测需求进行设置。例如,可以根据对硬盘的检测准确度的要求进行设置。可以理解的是,对硬盘的检测准确度的要求越高(需采样的数据越多), 第一阈值就设置的越大;对硬盘的检测准确度的要求越低(需采样的数据越少),第一阈值就设置的越小。具体可以根据实际使用场景及其他检测需求进行适应性地调整,本发明不作限定。
举例来说,假设上述第一阈值为1000,即第一区间需采样1000次,则在第一区间已满前,检测节点需分别在1000个采样周期内采样1000次,即获取1000个第二延时相关指标值和1000个第二延时。上述多个第二延时可以为该1000个第二延时中的多个第二延时,例如多个第二延时可以为该1000个第二延时,也可以为该1000个第二延时中的部分第二延时。
上述区间平均延时为上述多个第二延时的平均值。例如,若上述多个第二延时为上述获取的1000个第二延时,则区间平均延时为这1000个第二延时的平均值;若上述多个第二延时为上述获取的1000个第二延时中的部分第二延时,则区间平均延时为该部分第二延时的平均值。具体的,多个第二延时的选取可根据实际检测需求进行选取,本发明不作限定。
可选的,上述多个第二延时的平均值可以多个第二延时的算术平均值,也可以为多个第二延时的几何平均值,本发明不作具体限定。其中,多个第二延时的算术平均值可以为不加权算术平均值或加权算术平均值;多个第二延时的几何平均值可以为不加权几何平均值或加权几何平均值。
例如,以不加权算术平均值和不加权几何平均值为例,假设需计算5个第二延时的平均值,5个第二延时的平均值分别为10秒、11秒、12秒、12秒和10秒,则5个第二延时的算术平均值=(10+11+12+12+10)/5=11(秒);5个第二延时的
Figure PCTCN2016091605-appb-000001
Figure PCTCN2016091605-appb-000002
本领域技术人员可以理解,上述5个第二延时仅是为了说明算术平均值和几何平均值的计算方法进行的示意性的举例,其并不对本发明实施例形成任何限定。实际应用过程中,多个第二延时选取的通常都比较多,例如在上述1000个第二延时中,可以选取500个 第二延时或800个第二延时等。
在采样周期内,若已经获取到的所有延时相关指标值落入到第一区间的个数已经达到第一阈值,即第一区间为已满区间,则检测节点可计算本次采样周期内采样到的第一延时与区间平均延时的比值,得到一个第一比值。
可以理解的是,本发明实施例中,在检测周期内,在本次采样周期后,当该采样周期内采样到的硬盘读写数据的第一延时相关指标值所属的第一区间为已满区间(即在所有采样周期内获取到的所有延时相关指标值落入到该第一区间的个数已经达到第一阈值)时,检测节点需计算本次采样到的硬盘读写数据的第一延时与区间平均延时的比值,得到第一比值。
进一步地,由于一个检测周期包含多个采样周期,因此在本次采样周期后,若本次检测周期还未结束,则检测节点需返回S100继续执行。
本发明实施例提供的检测慢盘的方法,在每个检测周期内,检测节点周期性地进行采样,并在每个采样周期内执行上述S100-S102,直至该检测周期结束。
S11、在每次检测周期结束后,下一个检测周期开始前,检测节点执行S110-S111。
S110、若在本次检测周期内的所有采样周期内获取到的所有延时相关指标值落入到各个已满区间的个数大于或等于第二阈值,则检测节点计算多个采样周期内计算的多个第一比值的平均值,得到第一比值平均值,本步骤中的多个采样周期为获取到落入各个已满区间的多个延时相关指标值的采样周期。
需要说明的是,本步骤中的出现的“多个采样周期”与介绍步骤102时出现的几处“多个采样周期”的意思并不相同,本领域技术人员可以根据上下文清楚地知道在各个步骤中“多个采样周期”的准确含义,因此,为了说明方便,这里并不严格用“第一”、“第二”之类的术语进行限定。同理,也不对后续出现的“多个采样周 期”进行限定区分,本领域技术人员可以结合上下文清楚地确定后续出现的“多个采样周期”的含义。
本发明实施例中,一个检测周期内可能存在多个区间都属于已满区间的情况,这样,在该检测周期内的所有采样周期内获取到的所有延时相关指标值中会有一部分落入到各个已满区间,另外一部分落入到各个未满区间。在该检测周期结束后,下一个检测周期开始前,检测节点需要判断在本次检测周期内的所有采样周期内获取到的所有延时相关指标值落入到各个已满区间的个数是否大于或等于第二阈值。并在该个数大于或等于第二阈值时,再计算多个采样周期内计算的多个第一比值的平均值,得到第一比值平均值。其中,多个采样周期为获取到落入各个已满区间的多个延时相关指标值的采样周期。该个数大于或等于第二阈值意味着大部分区间学习过程已经结束,即大部分区间已经为已满区间,且获取到的延时相关指标值的个数已经足够多,这个时候再计算第一比值平均值可以保证检测慢盘的准确度较高。
假设总共有6个区间,在上一次检测周期完后,已满区间有3个,分别为区间A、区间B和区间C;未满区间有3个,分别为区间D、区间E和区间F。本次检测周期内获取到的延时相关指标值的个数为30个,其中:
落入区间A的延时相关指标值的个数为5个,
落入区间B的延时相关指标值的个数为8个,
落入区间C的延时相关指标值的个数为6个,
落入区间D的延时相关指标值的个数为5个,
落入区间E的延时相关指标值的个数为4个,
落入区间F的延时相关指标值的个数为2个,则:
落入各个已满区间的延时相关指标值的个数=落入区间A的延时相关指标值的个数+落入区间B的延时相关指标值的个数+落入区间C的延时相关指标值的个数=5+8+6=19。
可选的,上述第二阈值可以为根据实际检测需求预先设定的, 本发明不作具体限定。
第二阈值可以用于限定每次检测周期结束后获取到的所有延时相关指标值落入到各个已满区间的个数,第二阈值越大,表示需采样的采样数据越多,采样结果就越收敛,相应的,利用这些采样结果检测慢盘的准确度也就越高,但由于需要采样的数据多,因此,需要的时间也多,灵敏度(反映检测时间快慢)也相对较低;相反,第二阈值越小,表示需采样的采样数据越少,这样准确度就低一点,但所需的时间少,灵敏度相对就更高。
优选的,为了在检测慢盘的准确度和灵敏度之间取得一个更好的综合收益,可以将上述第二阈值设定为一个检测周期内获取到的所有延时相关指标值的总个数(即一个检测周期内采样的总采样次数)的二分之一。例如,若一个检测周期是5分钟,一个采样周期是10秒钟,即每10秒钟采样一次,一次采样获取一个延时相关指标值,则该检测周期内获取到的所有延时相关指标值的总个数为30个,如此,可以将第二阈值设置为30的二分之一,即15。
示例性的,上述例子中,在本次检测周期内的所有采样周期内获取到的所有延时相关指标值落入到各个已满区间的个数为19,其大于第二阈值15,则检测节点可以计算第一比值平均值。
本发明实施例中,检测节点在每个采样周期都采样一次,即获取一个延时相关指标值,并且在该采样周期内,若本次采样周期采样到的第一延时相关指标值所属的第一区间为已满区间,则需计算本次采样到的第一延时与区间平均延时的比值,即第一比值。可以理解,对于未满区间,检测节点不会计算第一比值。而对于已满区间,若获取一个落入该已满区间的延时相关指标值(即采样一次),则就在获取该延时相关指标值的采样周期内计算一个第一比值。
在上述例子中,本次检测周期内,若在本次检测周期内的所有采样周期内获取到的所有延时相关指标值落入到区间A的个数为5,则在获取到落入区间A的5个延时相关指标值的5个采样周期内分别计算出5个第一比值;
相应的,若在本次检测周期内的所有采样周期内获取到的所有延时相关指标值落入到区间B的个数为8,则在获取到落入区间B的8个延时相关指标值的8个采样周期内分别计算出8个第一比值;
若在本次检测周期内的所有采样周期内获取到的所有延时相关指标值落入区间C的个数6,则在获取到落入区间C的6个延时相关指标值的6个采样周期内分别计算出6个第一比值。
上述在多个采样周期内计算的多个第一比值分别为:上述计算的5个第一比值、8个第一比值以及6个第一比值,即总共19个第一比值。检测节点通过计算这19个第一比值的平均值,得到第一比值平均值。
可选的,多个第一比值的平均值可以为多个第一比值的算术平均值,也可以为多个第一比值的几何平均值,本发明不作具体限定。其中,多个第一比值的算术平均值可以为不加权算术平均值或加权算术平均值;多个第一比值的几何平均值可以为不加权几何平均值或加权几何平均值。
对于多个第一比值的算术平均值的计算方法具体可以参见上述多个第二延时的算术平均值的计算方法;且多个第一比值的几何平均值的计算方法具体可以参见上述多个第二延时的几何平均值的计算方法,此处不再赘述。
S111、若第一比值平均值大于或等于第三阈值,则检测节点确定硬盘为慢盘。
当检测节点计算出的第一比值平均值大于或等于第三阈值的时候,检测节点可确定被检测的硬盘为慢盘。
其中,上述第三阈值可以根据实际检测需求进行设定,本发明不作具体限定。示例性的,若要求检测慢盘的准确度较高,则可以将第三阈值设置的较大一些,这样可能会使得检测节点在比较多的检测周期才能检测出慢盘,因此检测慢盘的准确度较高;但是由于进行检测的检测周期比较多,因此检测慢盘的灵敏度较低。若要求检测慢盘的灵敏度较高,则可以将第三阈值设置的较小一些,这样 可能会使得检测节点在比较少的检测周期内就能检测出慢盘,因此检测慢盘的灵敏度较高;但是由于进行检测的检测周期比较少,因此检测慢盘的准确度较低。
进一步地,检测节点确定被检测的硬盘为慢盘后,检测节点可将检测结果通过打印日志、告警或者界面展示的方式告知相关的处理模块,从而该处理模块可以对该硬盘进行隔离,例如,处理模块可以从软件上将该硬盘从云存储系统中删除,或者从硬件上将该硬盘自动弹出。
本发明实施例提供的检测慢盘的方法,检测节点通过在检测周期内,周期性地进行采样,且在每次采样周期内获取本次采样周期内硬盘读写数据的第一延时以及第一延时相关指标值;并确定第一延时相关指标值所属的第一区间;以及若第一区间是已满区间,则计算第一延时与区间平均延时的比值,得到第一比值;检测节点在每次检测周期结束后,下一个检测周期开始前,若在本次检测周期内的所有采样周期内获取到的所有延时相关指标值落入到各个已满区间的个数大于或等于第二阈值,则计算多个采样周期内计算的多个第一比值的平均值,得到第一比值平均值,多个采样周期为获取到落入各个已满区间的多个延时相关指标值的采样周期;以及若第一比值平均值大于或等于第三阈值,则确定硬盘为慢盘。
基于上述技术方案,本发明实施例提供的检测慢盘的方法中,首先,由于延时相关指标值会随延时的变化而发生相应的变化,即延时与延时相关指标值密切相关,因此通过将最大延时相关指标值划分区间,并在每个区间内采样与属于该区间的延时相关指标值对应的延时,可以保证一个区间内采样的延时有统一的衡量标准,从而提高检测慢盘的准确度。其次,在第一区间是已满区间(即在所有采样周期内获取到的所有延时相关指标值落入到第一区间的个数达到第一阈值)后才计算第一比值(在没满之前的采样过程可认为是学习过程),可以保证在第一区间获取足够多个延时相关指标值(即在第一区间采样足够多次)后再计算第一比值,从而能够提高 检测慢盘的准确度。再次,本发明实施例在每次检测周期内的所有采样周期内获取到的所有延时相关指标值落入到各个已满区间的个数大于或等于第二阈值时再计算第一比值平均值,可以保证在大部分区间已经结束学习过程,即大部分区间已经采样了足够多次后再计算第一比值平均值,也能够提高检测慢盘的准确度。此外,由于本发明实施例计算的第一比值平均值为由多个第一比值得到的一个比例平均值,其并不是实际的延时数值,因此该第一比值平均值可以准确地体现硬盘的性能变化趋势,通过设置第三阈值,以及将第一比值平均值与第三阈值进行比较,可以在硬盘性能发生变化时准确地检测出硬盘是慢盘,从而进一步提高检测慢盘的准确度。
实施例二
基于上述实施例一,本发明实施例提供了一种检测慢盘的方法。
可选的,结合图2,如图3所示,本发明实施例提供的检测慢盘的方法中,对于上述S10,在每次采样周期内,检测节点仍然执行S103-S105,或者执行S103-S104和S106。具体的,在S101,即检测节点确定第一延时相关指标值所属的第一区间之后,本发明实施例提供的检测慢盘的方法还可以包括:
S103、检测节点记录经过本次采样后,在所有采样周期内获取到的所有延时相关指标值落入到第一区间的个数为第一个数,其中,每个采样周期对应于一个延时相关指标值。
本发明实施例中,对于检测节点记录经过本次采样后,在所有采样周期内获取到的所有延时相关指标值落入到所述第一区间的个数,第一个数的具体描述和举例可以参见上述实施例的S101中的相关描述和举例,此处不再赘述。
S104、检测节点判断第一个数是否达到第一阈值。
由于第一阈值用于限定需获取的落入每个区间的延时相关指标值的最大个数(即各个区间的学习过程中需采样的延时相关指标值的个数),因此检测节点可在每次采样后,通过记录在所有采样周期内获取到的所有延时相关指标值落入到第一区间的个数,并将该第 一个数和第一阈值对比,判断第一区间的学习过程是否结束,即第一区间的采样数据是否已经足够多。
S105、若第一个数达到第一阈值,则检测节点确定第一区间是已满区间。
S106、若第一个数没有达到第一阈值,则检测节点确定第一区间不是已满区间,并进入下次采样周期采样。
如果检测节点记录的第一个数已经达到第一阈值,则检测节点确定第一区间是已满区间;相反,若检测节点记录的第一个数还没有达到第一阈值,则检测节点确定第一区间不是已满区间,且检测节点进入下次采样周期采样。例如,若第一阈值为1000,如果检测节点记录的第一个数为1000,则检测节点确定第一区间是已满区间;如果检测节点记录的第一个数为800,则检测节点确定第一区间不是已满区间,且检测节点进入下次采样周期进行第801次采样。
本发明实施例中,在一个检测周期内的每个采样周期内,在S101之后,检测节点执行S103-S105,执行完S103-S105后接着执行S102,执行完S102后返回到S100继续执行(即执行完S102后进入下次采样周期采样),直至该检测周期结束。或者,在S101之后,检测节点执行S103-S104和S106,执行完S103-S104和S106后返回到S100继续执行(即执行完S106后进入下次采样周期采样),直至该检测周期结束。
可选的,结合图2,如图4所示,上述实施例中的第一阈值为N,上述第一区间对应N个第二延时,上述S102、若第一区间是已满区间,则检测节点计算第一延时与区间平均延时的比值,得到第一比值之前,本发明实施例提供的检测慢盘的方法还包括:
S107、若第一区间是已满区间,检测节点计算N个第二延时中的多个第二延时的平均值,得到区间平均延时。
其中,本发明实施例中,N可以取大于或等于1的整数。
上述S102具体可以为:
S102a、检测节点计算第一延时与区间平均延时的比值得到第一 比值。
本发明实施例中,当第一区间为已满区间时,与第一区间对应的第二延时的数量与第一阈值相等。本实施例中,与第一区间对应的第二延时的数量为N个。上述多个第二延时可以为N个第二延时,也可以为N个第二延时中的部分第二延时,具体可根据实际检测需求进行选取,本发明不作限定。
可选的,上述N个第二延时按照采样顺序依次排列,多个第二延时为N个第二延时中的前M个第二延时,M为整数,N/3≤M≤2N/3,且N/3和2N/3均取整数。
其中,当N为3的倍数时,N/3和2N/3均为整数,此时M取该整数;当N不是3的倍数时,N/3和2N/3均为小数,此时M取N/3和2N/3的整数部分。
示例性的,假设N=100,则100/3可以取33,200/3可以选取66,从而M的取值范围为33≤M≤66。
当然,本发明实施例中,当N不是3的倍数时,M也可以取N/3的整数部分加1;相应的,M也可以取2N/3的整数部分加1。例如,M可以取100/3的整数部分33加1,即34。
优选的,M=N/2,上述多个第二延时为N个第二延时中的前N/2个第二延时,且N/2取整数。
其中,当N为偶数时,N/2为整数,此时M取该整数;当N为奇数时,N/2为小数,此时M取N/2的整数部分。
例如,当N为100时,N/2为50,此时M可以取50,即M个第二延时为100个第二延时中的前50个第二延时;当N为121时,N/2为60.5,此时M可以取60,即M个第二延时为121个第二延时中的前60个第二延时。
当然,本发明实施例中,当N为奇数时,M也可以取N/2的整数部分加1。例如M可以取121/2的整数部分60加1,即61。
本发明实施例中,由于选取的第二延时的数量决定了检测慢盘的准确度和灵敏度,因此若选取的第二延时的数量越多,则表示检 测所用的采样数据越多,采样结果就越收敛,从而检测慢盘的准确度也就越高;相应的,由于选取的第二延时的数量越多,可能会导致区间平均延时的值较大,因此会不太容易检测出慢盘(可能需要在比较多的检测周期才能检测出慢盘),即检测慢盘的灵敏度较低。若选取的第二延时的数量越少,则表示检测所用的采样数据越少,采样结果就越分散,从而检测慢盘的准确度也就越低;相应的,由于选取的第二延时的数量越少,可能会导致区间平均延时的值较小,因此会比较容易检测出慢盘(可能在比较少的检测周期就能检测出慢盘),即检测慢盘的灵敏度较高。
示例性的,下面以多个第二延时取N个第二延时中的前N/2个第二延时为例,来详细说明上述多个第二延时的具体选取。
假设在实际应用中,第一区间已满之前采样到11个第二延时,且该11个第二延时存在某一次采样后突然上升的情况,例如,该11个第二延时分别为13S、14S、15S、17S、20S、21S、22S、24S、25S、28S和30S。其中,该11个第二延时中的前5个第二延时的平均值,即区间平均延时约为16S;该11个第二延时的平均值,即区间平均延时约为21S。
在上述列举的11个第二延时中,如果取该11个第二延时中的前一半(11/2=5.5,5.5取整数部分为5)第二延时,即该11个第二延时中的前5个第二延时,则这前5个第二延时的平均值,即区间平均延时约为16S,由于在计算第一比值(第一延时/区间平均延时)时,区间平均延时会作为分母,因此,区间平均延时越小,第一比值就越大,而多个第一比值的平均值就越大,从而也就越容易超过设定的第三阈值,即也就会比较容易检测出慢盘,检测慢盘的灵敏度就越高;相应的,由于只取了该11个第二延时中的前一半第二延时,因此可能会导致检测的结果不准确,从而检测慢盘的准确度较低。而如果取该11个第二延时,则该11个第二延时的平均值,即区间平均延时约为21S,由于区间平均延时越大,第一比值就越小,而多个第一比值的平均值就越小,从而也就越不容易超过设定的第 三阈值,即也就会不太容易检测出慢盘,从而检测慢盘的灵敏度就越低;相应的,由于取了该11个第二延时,因此可能会使得检测的结果比较准确,从而检测慢盘的准确度较高。
本发明实施例提供的检测慢盘的方法,在选取多个第二延时时,可以选取与第一区间对应的N个第二延时,即采用N个第二延时的平均值作为区间平均延时。当然,也可以选取N个第二延时中的部分第二延时。具体的,由于在实际应用中,采样的数据通常都会存在“突然上升”的情况,因此为了保证检测慢盘的准确度和灵敏度均衡,通常优选的可以选取N个第二延时中的前2/N个第二延时。具体的,可根据实际检测需求进行设定,本发明不作限定。
本发明实施例提供的检测慢盘的方法中,与选取N个第二延时计算区间平均延时相比,通过在与第一区间对应的N个第二延时中选取前M个第二延时计算区间平均延时,能够保证计算的第一延时与区间平均延时的比值,即第一比值较大,从而使得计算的多个第一比值的平均值就越大,这样可以在保证检测慢盘的准确度的情况下,适当地提高检测慢盘的灵敏度,进而达到检测慢盘的准确度和灵敏度之间的均衡。
实施例三
基于上述各实施例,本发明实施例提供一种检测慢盘的方法,该方法应用于多个硬盘的场景,检测节点针对第一硬盘执行该方法,第一硬盘为多个硬盘中的其中一个硬盘。
如图5所示,本发明实施例还提供一种检测慢盘的方法,该方法包括:
S201、检测节点获取与第一硬盘对应的第一比值平均值。
基于上述实施例一和实施例二的描述,检测节点通过针对第一硬盘执行上述实施例中如图2所示的各个步骤中除S111之外的其他步骤(包括S10中的S100-S102以及S11中的S110);或者执行上述实施例中如图3所示的各个步骤中除S111之外的其他步骤(包括S10中的S100-S105,或S10中的S100-S101、S103-S104和S106, 以及S11中的S110);或者执行上述实施例中如图4所示的实施例中除S111之外的其他步骤(包括S10中的S100-S101、S107和S102a,以及S11中的S110),获取与第一硬盘对应的第一比值平均值。
S202、检测节点获取与多个硬盘中除第一硬盘外的其他硬盘一一对应的多个第一比值平均值。
其中,上述其他硬盘中的每个硬盘对应的第一比值平均值的获取方法与第一硬盘对应的第一比值平均值的获取方法相同。具体可参见上述与第一硬盘对应的第一比值平均值的获取方法,此处不再赘述。
特别的,与多个硬盘中的每个硬盘对应的第一比值平均值均为与该硬盘对应的多个第一比值计算得到。对于第一比值的描述可参见上述如图2所示的实施例中对第一比值的相关描述,此处不再赘述。
检测节点通过针对多个硬盘中的每个硬盘执行上述所述的各个步骤,可以获取到与多个硬盘一一对应的多个第一比值平均值。若与这多个硬盘中的每个硬盘对应的第一比值平均值均小于第三阈值,即对这多个硬盘分别进行检测未检测到有慢盘,则如图5所示,本发明实施例提供的检测慢盘的方法还可以包括:
S203、检测节点计算与多个硬盘一一对应的多个第一比值平均值的平均值,得到第一平均值。
为了保证检测慢盘的准确度,本发明实施例提供的检测慢盘的方法中,多个硬盘之间的检测也是在同一个区间内进行的。例如本实施例中,多个硬盘之间的检测是在第一区间内进行的。
例如,假设有5个硬盘,分别为硬盘A、硬盘B、硬盘C、硬盘D和硬盘E。检测节点分别获取到硬盘A的第一比值平均值为TA、硬盘B的第一比值平均值为TB、硬盘C的第一比值平均值为TC、硬盘D的第一比值平均值为TD和硬盘E的第一比值平均值为TE后,检测节点再计算TA、TB、TC、TD和TE的平均值,即第一平均值。
其中,多个第一比值平均值的平均值可以是多个第一比值平均值的算术平均值,也可以是多个第一比值平均值的几何平均值,本发明不作具体限定。其中,多个第一比值平均值的算术平均值可以为不加权算术平均值或加权算术平均值;多个第一比值平均值的几何平均值可以为不加权几何平均值或加权几何平均值。
对于多个第一比值平均值的算术平均值的计算方法具体可以参见上述如图2所示的实施例中多个第二延时的算术平均值的计算方法;多个第一比值平均值的几何平均值的计算方法具体可以参见上述如图2所示的实施例中多个第二延时的几何平均值的计算方法,此处不再赘述。
S204、检测节点计算与多个硬盘中的每个硬盘对应的第一比值平均值与第一平均值的比值,得到多个第二比值。
S205、检测节点确定多个第二比值中,与大于或等于第四阈值的第二比值对应的硬盘为慢盘。
可选的,上述第四阈值可以为根据实际检测需求预先设定的,本发明不作具体限定。
由于多个硬盘为同质盘,因此多个硬盘的性能比较接近,且盘间波动较小,所以第四阈值可以用于衡量每个硬盘相对于所有硬盘的平均值的波动。本发明实施例中,第四阈值越小,表示要求每个硬盘相对于所有硬盘的平均值的波动就越小,从而在检测时,若硬盘稍微有波动,则可能会导致第二比值超过第四阈值,进而可以提高检测慢盘的准确度和灵敏度。
优选的,本发明实施例提供的检测慢盘的方法中,为了保证检测慢盘的准确度,由于在对单盘进行检测时,为了避免硬盘性能突然波动可能会导致检测结果不准确,因此可将第三阈值设置的较大一些,以提高检测慢盘的准确度。而在对多个硬盘进行检测时,由于多个硬盘间的波动通常较小,因此可将第四阈值设置的较小一些,以提高检测慢盘的准确度。
本发明实施例提供的检测慢盘的方法中,当检测节点对多个硬 盘分别进行检测未检测到有慢盘(即与这多个硬盘一一对应的多个第一比值平均值均小于第三阈值)时,该检测节点还可以采用上述如图5所示的方法在多个硬盘之间进行检测,从而可能会检测出单盘检测时未检测出的慢盘,进而能够提高检测慢盘的准确度。
进一步地,检测节点确定多个硬盘中的某个硬盘为慢盘后,检测节点可将检测结果通过打印日志、告警或者界面展示的方式告知相关的处理模块,从而该处理模块可以对该硬盘进行隔离,例如,处理模块可以从软件上将该硬盘从云存储系统中删除,或者从硬件上将该硬盘自动弹出。
本发明实施例提供一种检测慢盘的方法,该方法应用于多个硬盘的场景,则当上述实施例中检测节点获取的与多个硬盘中的每个硬盘对应的第一比值平均值均小于第三阈值,即对单个硬盘分别检测未检测到有慢盘时,还可以进一步对多个硬盘进行盘间检测,即本发明实施例利用同质盘之间的参数和性能相似的特性,对与每个硬盘对应的第一比值平均值相对于第一平均值(与多个硬盘一一对应的多个第一比值平均值的平均值)的比值,即第二比值进行检测,从而可能会在与某个硬盘对应的第一比值平均值相对于第一平均值稍微有波动时,就可能会被检测出,进而可以在对单个硬盘分别检测未检测到有慢盘的情况下检测出慢盘,能够提高检测慢盘的准确度。
实施例四
如图6所示,本发明实施例提供一种检测慢盘的装置,该检测慢盘的装置可以为云存储系统中的一个检测节点,该检测节点可以为一个独立的计算机节点,也可以为一个集成在计算机节点中的功能单元等,本发明不作具体限定。
具体的,本发明实施例提供的检测慢盘的装置可以包括采样单元10和检测单元11;其中,
所述采样单元10,用于在检测周期内,周期性地进行采样,且在每次采样周期内,完成如下过程:
获取本次采样周期内硬盘读写数据的第一延时以及第一延时相关指标值,其中,所述第一延时相关指标值为延时相关指标值的一个具体值,所述延时相关指标值为一个会随延时变化而发生相应变化的一个值;
确定所述第一延时相关指标值所属的第一区间;其中,所述第一区间为预先针对最大延时相关指标值划分的多个区间中的一个;
若所述第一区间是已满区间,则计算所述第一延时与区间平均延时的比值,得到第一比值;其中,所述已满区间是在所有采样周期内获取到的所有延时相关指标值落入到该区间的个数达到第一阈值的区间,所述区间平均延时为所述第一区间中的多个第二延时的平均值,所述多个第二延时一一对应于多个采样周期,每个第二延时在与之对应的采样周期内被获取,其中,每个采样周期对应于一个延时相关指标值。
检测单元11,用于在每次检测周期结束后,下一个检测周期开始前,完成如下过程:
若在本次检测周期内的所有采样周期内获取到的所有延时相关指标值落入到各个已满区间的个数大于或等于第二阈值,则计算所述采样单元在多个采样周期内计算的多个第一比值的平均值,得到第一比值平均值,所述多个采样周期为获取到落入各个已满区间的多个延时相关指标值的采样周期;以及若所述第一比值平均值大于或等于第三阈值,则确定所述硬盘为慢盘。
可选的,所述采样单元10,还用于在每次采样周期内,确定所述第一延时相关指标值所属的第一区间之后,记录经过本次采样后,在所有采样周期内获取到的所有延时相关指标值落入到所述第一区间的个数为第一个数;并判断所述第一个数是否达到所述第一阈值;以及若所述第一个数达到所述第一阈值,则确定所述第一区间是已满区间;若所述第一个数没有达到所述第一阈值,则确定所述第一区间不是已满区间,并进入下次采样周期采样,其中,每个采样周期对应于一个延时相关指标值。
可选的,所述延时相关指标值为所述硬盘读写数据的利用率;或者,
所述延时相关指标值为所述硬盘读写数据的读写速度。
可选的,所述第一阈值为N,所述第一区间对应N个第二延时,N为大于或等于1的整数,
所述采样单元10,还用于在计算所述第一延时与区间平均延时的比值之前,计算所述N个第二延时中的所述多个第二延时的平均值,得到所述区间平均延时。
可选的,所述N个第二延时按照采样顺序依次排列,所述多个第二延时为所述N个第二延时中的前M个第二延时,M为整数,N/3≤M≤2N/3,且N/3和2N/3均取整数。
可选的,M=N/2,
所述多个第二延时为所述N个第二延时中的前N/2个第二延时,且N/2取整数。
可选的,所述采样单元10计算的所述多个第二延时的平均值为所述多个第二延时的算术平均值或者所述多个第二延时的几何平均值;
所述检测单元11计算的所述多个第一比值的平均值为所述多个第一比值的算术平均值或者所述多个第一比值的几何平均值。
可选的,所述装置应用于多个硬盘的场景,所述装置针对第一硬盘进行检测,所述第一硬盘为所述多个硬盘中的其中一个硬盘;
所述检测单元11,还用于获取与所述多个硬盘中除所述第一硬盘外的其他硬盘一一对应的多个第一比值平均值;并当与所述多个硬盘中的每个硬盘对应的第一比值平均值均小于所述第三阈值时,计算与所述多个硬盘一一对应的多个第一比值平均值的平均值,得到第一平均值;且计算与所述多个硬盘中的每个硬盘对应的第一比值平均值与所述第一平均值的比值,得到多个第二比值;以及确定所述多个第二比值中,与大于或等于第四阈值的第二比值对应的硬盘为慢盘,其中,所述其他硬盘中的每个硬盘对应的第一比值平均 值的获取方法与所述第一硬盘对应的第一比值平均值的获取方法相同。
本发明实施例提供的检测慢盘的装置,当该装置对多个硬盘分别进行检测未检测到有慢盘(即与这多个硬盘中的每个硬盘对应的第一比值平均值均小于第三阈值)时,该装置还可以在多个硬盘之间进行检测,从而可能会检测出单盘检测未检测出的慢盘,进而能够提高检测慢盘的准确度。
本发明实施例提供一种检测慢盘的装置,首先,由于该装置获取的延时相关指标值会随延时的变化而发生相应的变化,即延时与延时相关指标值密切相关,因此通过将最大延时相关指标值划分区间,并在每个区间内采样与属于该区间的延时相关指标值对应的延时,可以保证一个区间内采样的延时有统一的衡量标准,从而提高检测慢盘的准确度。其次,该装置在第一区间是已满区间(即在所有采样周期内获取到的所有延时相关指标值落入到第一区间的个数达到第一阈值)后才计算第一比值(在没满之前的采样过程可认为是学习过程),可以保证在第一区间获取足够多个延时相关指标值(即在第一区间采样足够多次)后再计算第一比值,从而能够提高检测慢盘的准确度。其次,该装置在每次检测周期内的所有采样周期内获取到的所有延时相关指标值落入到各个已满区间的个数大于或等于第二阈值时再计算第一比值平均值,可以保证在大部分区间已经结束学习过程,即大部分区间已经采样了足够多次后再计算第一比值平均值,也能够提高检测慢盘的准确度。此外,由于本发明实施例的检测慢盘的装置计算的第一比值平均值为由多个第一比值得到的一个比例平均值,其并不是实际的延时数值,因此该第一比值平均值可以准确地体现硬盘的性能变化趋势,通过设置第三阈值,以及将第一比值平均值与第三阈值进行比较,可以在硬盘性能发生变化时准确地检测出硬盘是慢盘,从而进一步提高检测慢盘的准确度。
实施例五
如图7所示,本发明实施例提供一种检测慢盘的装置,所述检测慢盘的装置可以为云存储系统中的一个检测节点,该检测节点可以为一个独立的计算机节点,也可以为一个集成在计算机节点中的功能单元等,本发明不作具体限定。
具体的,本发明实施例提供的所述检测慢盘的装置可以包括处理器20、存储器21、通信接口22,以及系统总线23。所述处理器20、存储器21以及通信接口22之间通过所述系统总线23连接并完成相互间的通信。
所述处理器20可以是一个中央处理器(英文:central processing unit,缩写:CPU),或者是特定集成电路(英文:application specific integrated circuit,缩写:ASIC),或者是被配置成实施本发明实施例的一个或多个集成电路。
所述通信接口22可以为所述检测慢盘的装置与其他设备进行通信的通信接口。
所述存储器21可以包括易失性存储器(英文:volatile memory),例如随机存取存储器(英文:random-access memory,缩写:RAM);所述存储器21也可以包括非易失性存储器(英文:non-volatile memory),例如只读存储器(英文:read-only memory,缩写:ROM),快闪存储器(英文:flash memory),SSD、HDD或HHD;所述存储器21还可以包括上述种类的存储器的组合。
当本发明实施例提供的所述检测慢盘的装置运行时,所述处理器20可以通过读取存储在存储器的程序来执行图2~图5任意之一所述的方法流程,具体包括:
所述处理器20,用于在检测周期内,周期性地进行采样,且在每次采样周期内,完成如下过程:
获取本次采样周期内硬盘读写数据的第一延时以及第一延时相关指标值,其中,所述第一延时相关指标值为延时相关指标值的一个具体值,所述延时相关指标值为一个会随延时变化而发生相应变 化的一个值;
确定所述第一延时相关指标值所属的第一区间;其中,所述第一区间为预先针对最大延时相关指标值划分的多个区间中的一个;
若所述第一区间是已满区间,则计算所述第一延时与区间平均延时的比值,得到第一比值;其中,所述已满区间是在所有采样周期内获取到的所有延时相关指标值落入到该区间的个数达到第一阈值的区间,所述区间平均延时为所述第一区间中的多个第二延时的平均值,所述多个第二延时一一对应于多个采样周期,每个第二延时在与之对应的采样周期内被获取,其中,每个采样周期对应于一个延时相关指标值。
所述处理器20,还用于在每次检测周期结束后,下一个检测周期开始前,完成如下过程:
若在本次检测周期内的所有采样周期内获取到的所有延时相关指标值落入到各个已满区间的个数大于或等于第二阈值,则计算所述采样单元在多个采样周期内计算的多个第一比值的平均值,得到第一比值平均值,所述多个采样周期为获取到落入各个已满区间的多个延时相关指标值的采样周期;以及若所述第一比值平均值大于或等于第三阈值,则确定所述硬盘为慢盘。
所述存储器21,用于存储所述处理器20执行上述检测慢盘的过程的软件程序,从而所述处理器20通过执行所述软件程序,完成上述检测慢盘的过程。
可选的,所述处理器20,还用于在每次采样周期内,确定所述第一延时相关指标值所属的第一区间之后,记录经过本次采样后,在所有采样周期内获取到的所有延时相关指标值落入到所述第一区间的个数为第一个数;并判断所述第一个数是否达到所述第一阈值;以及若所述第一个数达到所述第一阈值,则确定所述第一区间是已满区间;若所述第一个数没有达到所述第一阈值,则确定所述第一区间不是已满区间,并进入下次采样周期采样,其中,每个采样周期对应于一个延时相关指标值。
可选的,所述延时相关指标值为所述硬盘读写数据的利用率;或者,
所述延时相关指标值为所述硬盘读写数据的读写速度。
可选的,所述第一阈值为N,所述第一区间对应N个第二延时,N为大于或等于1的整数,
所述处理器20,还用于在计算所述第一延时与区间平均延时的比值之前,计算所述N个第二延时中的所述多个第二延时的平均值,得到所述区间平均延时。
可选的,所述N个第二延时按照采样顺序依次排列,所述多个第二延时为所述N个第二延时中的前M个第二延时,M为整数,N/3≤M≤2N/3,且N/3和2N/3均取整数。
可选的,M=N/2,
所述多个第二延时为所述N个第二延时中的前N/2个第二延时,且N/2取整数。
可选的,所述处理器20计算的所述多个第二延时的平均值为所述多个第二延时的算术平均值或者所述多个第二延时的几何平均值;
所述处理器20计算的所述多个第一比值的平均值为所述多个第一比值的算术平均值或者所述多个第一比值的几何平均值。
可选的,所述装置应用于多个硬盘的场景,所述装置针对第一硬盘进行检测,所述第一硬盘为所述多个硬盘中的其中一个硬盘;
所述处理器20,还用于获取与所述多个硬盘中除所述第一硬盘外的其他硬盘一一对应的多个第一比值平均值;并当与所述多个硬盘中的每个硬盘对应的第一比值平均值均小于所述第三阈值时,计算与所述多个硬盘一一对应的多个第一比值平均值的平均值,得到第一平均值;且计算与所述多个硬盘中的每个硬盘对应的第一比值平均值与所述第一平均值的比值,得到多个第二比值;以及确定所述多个第二比值中,与大于或等于第四阈值的第二比值对应的硬盘为慢盘,其中,所述其他硬盘中的每个硬盘对应的第一比值平均值 的获取方法与所述第一硬盘对应的第一比值平均值的获取方法相同。
本发明实施例提供的检测慢盘的装置,当该装置对多个硬盘分别进行检测未检测到有慢盘(即与这多个硬盘中的每个硬盘对应的第一比值平均值均小于第三阈值)时,该装置还可以在多个硬盘之间进行检测,从而可能会检测出单盘检测未检测出的慢盘,进而能够提高检测慢盘的准确度。
本发明实施例提供一种检测慢盘的装置,首先,由于该装置获取的延时相关指标值会随延时的变化而发生相应的变化,即延时与延时相关指标值密切相关,因此通过将最大延时相关指标值划分区间,并在每个区间内采样与属于该区间的延时相关指标值对应的延时,可以保证一个区间内采样的延时有统一的衡量标准,从而提高检测慢盘的准确度。其次,该装置在第一区间是已满区间(即在所有采样周期内获取到的所有延时相关指标值落入到第一区间的个数达到第一阈值)后才计算第一比值(在没满之前的采样过程可认为是学习过程),可以保证在第一区间获取足够多个延时相关指标值(即在第一区间采样足够多次)后再计算第一比值,从而能够提高检测慢盘的准确度。其次,该装置在每次检测周期内的所有采样周期内获取到的所有延时相关指标值落入到各个已满区间的个数大于或等于第二阈值时再计算第一比值平均值,可以保证在大部分区间已经结束学习过程,即大部分区间已经采样了足够多次后再计算第一比值平均值,也能够提高检测慢盘的准确度。此外,由于本发明实施例的检测慢盘的装置计算的第一比值平均值为由多个第一比值得到的一个比例平均值,其并不是实际的延时数值,因此该第一比值平均值可以准确地体现硬盘的性能变化趋势,通过设置第三阈值,以及将第一比值平均值与第三阈值进行比较,可以在硬盘性能发生变化时准确地检测出硬盘是慢盘,从而进一步提高检测慢盘的准确度。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地 了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设 备等)或处理器执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。

Claims (16)

  1. 一种检测慢盘的方法,其特征在于,所述方法包括:
    在检测周期内,周期性地进行采样,在每次采样周期内,执行如下方法:
    获取本次采样周期内硬盘读写数据的第一延时以及第一延时相关指标值,其中,所述第一延时相关指标值为延时相关指标值的一个具体值,所述延时相关指标值为一个会随延时变化而发生相应变化的一个值;
    确定所述第一延时相关指标值所属的第一区间;其中,所述第一区间为预先针对最大延时相关指标值划分的多个区间中的一个;
    若所述第一区间是已满区间,则计算所述第一延时与区间平均延时的比值,得到第一比值;其中,所述已满区间是在所有采样周期内获取到的所有延时相关指标值落入到该区间的个数达到第一阈值的区间,所述区间平均延时为所述第一区间中的多个第二延时的平均值,所述多个第二延时一一对应于第一多个采样周期,每个第二延时在与之对应的采样周期内被获取,其中,每个采样周期对应于一个延时相关指标值;
    在每次检测周期结束后,下一个检测周期开始前,执行如下方法:
    若在本次检测周期内的所有采样周期内获取到的所有延时相关指标值落入到各个已满区间的个数大于或等于第二阈值,则计算第二多个采样周期内计算的多个第一比值的平均值,得到第一比值平均值,所述第二多个采样周期为获取到落入各个已满区间的多个延时相关指标值的采样周期;
    若所述第一比值平均值大于或等于第三阈值,则确定所述硬盘为慢盘。
  2. 根据权利要求1所述的方法,其特征在于,在每次采样周期内,所述确定所述第一延时相关指标值所属的第一区间之后,所述方法还包括:
    记录经过本次采样后,在所有采样周期内获取到的所有延时相关 指标值落入到所述第一区间的个数为第一个数,其中,每个采样周期对应于一个延时相关指标值;
    判断所述第一个数是否达到所述第一阈值;
    若所述第一个数达到所述第一阈值,则确定所述第一区间是已满区间;
    若所述第一个数没有达到所述第一阈值,则确定所述第一区间不是已满区间,并进入下次采样周期采样。
  3. 根据权利要求1或2所述的方法,其特征在于,
    所述延时相关指标值为所述硬盘读写数据的利用率;或者,
    所述延时相关指标值为所述硬盘读写数据的读写速度。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述第一阈值为N,所述第一区间对应N个第二延时,N为大于或等于1的整数,所述计算所述第一延时与区间平均延时的比值之前,所述方法还包括:
    计算所述N个第二延时中的所述多个第二延时的平均值,得到所述区间平均延时。
  5. 根据权利要求4所述的方法,其特征在于,
    所述N个第二延时按照采样顺序依次排列,所述多个第二延时为所述N个第二延时中的前M个第二延时,M为整数,N/3≤M≤2N/3,且N/3和2N/3均取整数。
  6. 根据权利要求5所述的方法,其特征在于,M=N/2,
    所述多个第二延时为所述N个第二延时中的前N/2个第二延时,且N/2取整数。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,
    所述多个第二延时的平均值为所述多个第二延时的算术平均值或者所述多个第二延时的几何平均值;
    所述多个第一比值的平均值为所述多个第一比值的算术平均值或者所述多个第一比值的几何平均值。
  8. 根据权利要求1-7任一项所述的方法,其特征在于,所述方 法应用于多个硬盘的场景,针对第一硬盘执行,所述第一硬盘为所述多个硬盘中的其中一个硬盘;所述方法还包括:
    获取与所述多个硬盘中除所述第一硬盘外的其他硬盘一一对应的多个第一比值平均值,其中,所述其他硬盘中的每个硬盘对应的第一比值平均值的获取方法与所述第一硬盘对应的第一比值平均值的获取方法相同;
    当与所述多个硬盘中的每个硬盘对应的第一比值平均值均小于所述第三阈值时,所述方法还包括:
    计算与所述多个硬盘一一对应的多个第一比值平均值的平均值,得到第一平均值;
    计算与所述多个硬盘中的每个硬盘对应的第一比值平均值与所述第一平均值的比值,得到多个第二比值;
    确定所述多个第二比值中,与大于或等于第四阈值的第二比值对应的硬盘为慢盘。
  9. 一种检测慢盘的装置,其特征在于,所述装置包括:
    采样单元,用于在检测周期内,周期性地进行采样,且在每次采样周期内,完成如下过程:
    获取本次采样周期内硬盘读写数据的第一延时以及第一延时相关指标值,其中,所述第一延时相关指标值为延时相关指标值的一个具体值,所述延时相关指标值为一个会随延时变化而发生相应变化的一个值;
    确定所述第一延时相关指标值所属的第一区间;其中,所述第一区间为预先针对最大延时相关指标值划分的多个区间中的一个;
    若所述第一区间是已满区间,则计算所述第一延时与区间平均延时的比值,得到第一比值;其中,所述已满区间是在所有采样周期内获取到的所有延时相关指标值落入到该区间的个数达到第一阈值的区间,所述区间平均延时为所述第一区间中的多个第二延时的平均值,所述多个第二延时一一对应于第一多个采样周期,每个第二延时在与之对应的采样周期内被获取,其中,每个采样周期对应于一个延 时相关指标值;
    检测单元,用于在每次检测周期结束后,下一个检测周期开始前,完成如下过程:
    若在本次检测周期内的所有采样周期内获取到的所有延时相关指标值落入到各个已满区间的个数大于或等于第二阈值,则计算所述采样单元在第二多个采样周期内计算的多个第一比值的平均值,得到第一比值平均值,所述第二多个采样周期为获取到落入各个已满区间的多个延时相关指标值的采样周期;以及若所述第一比值平均值大于或等于第三阈值,则确定所述硬盘为慢盘。
  10. 根据权利要求9所述的装置,其特征在于,
    所述采样单元,还用于在每次采样周期内,确定所述第一延时相关指标值所属的第一区间之后,记录经过本次采样后,在所有采样周期内获取到的所有延时相关指标值落入到所述第一区间的个数为第一个数;并判断所述第一个数是否达到所述第一阈值;以及若所述第一个数达到所述第一阈值,则确定所述第一区间是已满区间;若所述第一个数没有达到所述第一阈值,则确定所述第一区间不是已满区间,并进入下次采样周期采样,其中,每个采样周期对应于一个延时相关指标值。
  11. 根据权利要求9或10所述的装置,其特征在于,
    所述延时相关指标值为所述硬盘读写数据的利用率;或者,
    所述延时相关指标值为所述硬盘读写数据的读写速度。
  12. 根据权利要求9-11任一项所述的装置,其特征在于,所述第一阈值为N,所述第一区间对应N个第二延时,N为大于或等于1的整数,
    所述采样单元,还用于在计算所述第一延时与区间平均延时的比值之前,计算所述N个第二延时中的所述多个第二延时的平均值,得到所述区间平均延时。
  13. 根据权利要求12所述的装置,其特征在于,
    所述N个第二延时按照采样顺序依次排列,所述多个第二延时 为所述N个第二延时中的前M个第二延时,M为整数,N/3≤M≤2N/3,且N/3和2N/3均取整数。
  14. 根据权利要求13所述的装置,其特征在于,M=N/2,
    所述多个第二延时为所述N个第二延时中的前N/2个第二延时,且N/2取整数。
  15. 根据权利要求9-14任一项所述的装置,其特征在于,
    所述采样单元计算的所述多个第二延时的平均值为所述多个第二延时的算术平均值或者所述多个第二延时的几何平均值;
    所述检测单元计算的所述多个第一比值的平均值为所述多个第一比值的算术平均值或者所述多个第一比值的几何平均值。
  16. 根据权利要求9-15任一项所述的装置,其特征在于,所述装置应用于多个硬盘的场景,所述装置针对第一硬盘进行检测,所述第一硬盘为所述多个硬盘中的其中一个硬盘;
    所述检测单元,还用于获取与所述多个硬盘中除所述第一硬盘外的其他硬盘一一对应的多个第一比值平均值;并当与所述多个硬盘中的每个硬盘对应的第一比值平均值均小于所述第三阈值时,计算与所述多个硬盘一一对应的多个第一比值平均值的平均值,得到第一平均值;且计算与所述多个硬盘中的每个硬盘对应的第一比值平均值与所述第一平均值的比值,得到多个第二比值;以及确定所述多个第二比值中,与大于或等于第四阈值的第二比值对应的硬盘为慢盘,其中,所述其他硬盘中的每个硬盘对应的第一比值平均值的获取方法与所述第一硬盘对应的第一比值平均值的获取方法相同。
PCT/CN2016/091605 2015-07-31 2016-07-25 一种检测慢盘的方法及装置 WO2017020747A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP16832235.2A EP3318975A4 (en) 2015-07-31 2016-07-25 Method and device for detecting slow disk
US15/884,413 US20180157438A1 (en) 2015-07-31 2018-01-31 Slow-disk detection method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510466756.X 2015-07-31
CN201510466756.XA CN106407051B (zh) 2015-07-31 2015-07-31 一种检测慢盘的方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/884,413 Continuation US20180157438A1 (en) 2015-07-31 2018-01-31 Slow-disk detection method and apparatus

Publications (1)

Publication Number Publication Date
WO2017020747A1 true WO2017020747A1 (zh) 2017-02-09

Family

ID=57942453

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/091605 WO2017020747A1 (zh) 2015-07-31 2016-07-25 一种检测慢盘的方法及装置

Country Status (4)

Country Link
US (1) US20180157438A1 (zh)
EP (1) EP3318975A4 (zh)
CN (1) CN106407051B (zh)
WO (1) WO2017020747A1 (zh)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107329710A (zh) * 2017-07-06 2017-11-07 郑州云海信息技术有限公司 一种存储性能优化的方法、系统及存储软件
CN109426592A (zh) * 2017-08-24 2019-03-05 上海交通大学 一种磁盘检测方法
CN107943605B (zh) * 2017-11-14 2021-03-19 青岛海信移动通信技术股份有限公司 一种存储卡处理方法及装置
CN109815037B (zh) * 2017-11-22 2021-07-20 华为技术有限公司 慢盘检测方法和存储阵列
CN110865896B (zh) * 2018-08-27 2021-03-23 华为技术有限公司 慢盘检测方法及装置、计算机可读存储介质
CN110928698B (zh) * 2018-09-19 2023-06-16 阿里巴巴集团控股有限公司 数据收发控制方法及装置、计算设备和存储介质
CN109815048B (zh) * 2019-01-31 2022-11-08 新华三技术有限公司成都分公司 数据读取方法、装置及设备
CN111813585A (zh) * 2019-04-10 2020-10-23 伊姆西Ip控股有限责任公司 慢盘的预测和处理
US11301316B2 (en) * 2019-07-12 2022-04-12 Ebay Inc. Corrective database connection management
CN112241343B (zh) * 2019-07-19 2024-02-23 深信服科技股份有限公司 一种慢盘检测方法、装置、电子设备及可读存储介质
CN110795314B (zh) * 2019-11-04 2023-10-03 北京小米移动软件有限公司 一种检测慢节点的方法、装置及计算机可读存储介质
CN111274070B (zh) * 2019-11-04 2021-10-15 华为技术有限公司 一种硬盘检测的方法、装置和电子设备
CN113312218A (zh) * 2021-03-31 2021-08-27 阿里巴巴新加坡控股有限公司 磁盘的检测方法和装置
CN114706720B (zh) * 2022-06-06 2022-09-06 南京鹏云网络科技有限公司 分布式存储系统慢盘判断方法、系统、设备及存储介质
CN117573483B (zh) * 2024-01-16 2024-04-02 苏州元脑智能科技有限公司 硬盘的移除方法和装置、存储介质及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005017735A1 (ja) * 2003-08-19 2005-02-24 Fujitsu Limited ディスクアレイ装置におけるボトルネックを検出するシステムおよびプログラム
CN103019623A (zh) * 2012-12-10 2013-04-03 华为技术有限公司 存储盘处理方法及装置
CN103488544A (zh) * 2013-09-26 2014-01-01 华为技术有限公司 检测慢盘的处理方法和装置
CN103744613A (zh) * 2013-12-17 2014-04-23 记忆科技(深圳)有限公司 降低i/o写延时的系统与方法
CN103810062A (zh) * 2014-03-05 2014-05-21 华为技术有限公司 慢盘检测方法和装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006092070A (ja) * 2004-09-22 2006-04-06 Nec Corp ディスクアレイ装置及びその制御方法並びに制御プログラム
US7475292B2 (en) * 2005-10-26 2009-01-06 Siemens Corporate Research, Inc. System and method for triggering software rejuvenation using a customer affecting performance metric
JP4472617B2 (ja) * 2005-10-28 2010-06-02 富士通株式会社 Raidシステム、raidコントローラ及びそのリビルド/コピーバック処理方法
US7992047B2 (en) * 2008-01-08 2011-08-02 International Business Machines Corporation Context sensitive detection of failing I/O devices
CN102147708B (zh) * 2010-02-10 2012-12-12 华为数字技术(成都)有限公司 一种磁盘检测方法及装置
CN102568522B (zh) * 2011-12-31 2015-08-19 曙光信息产业股份有限公司 硬盘性能的测试方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005017735A1 (ja) * 2003-08-19 2005-02-24 Fujitsu Limited ディスクアレイ装置におけるボトルネックを検出するシステムおよびプログラム
CN103019623A (zh) * 2012-12-10 2013-04-03 华为技术有限公司 存储盘处理方法及装置
CN103488544A (zh) * 2013-09-26 2014-01-01 华为技术有限公司 检测慢盘的处理方法和装置
CN103744613A (zh) * 2013-12-17 2014-04-23 记忆科技(深圳)有限公司 降低i/o写延时的系统与方法
CN103810062A (zh) * 2014-03-05 2014-05-21 华为技术有限公司 慢盘检测方法和装置

Also Published As

Publication number Publication date
EP3318975A1 (en) 2018-05-09
US20180157438A1 (en) 2018-06-07
CN106407051B (zh) 2019-01-11
CN106407051A (zh) 2017-02-15
EP3318975A4 (en) 2018-07-11

Similar Documents

Publication Publication Date Title
WO2017020747A1 (zh) 一种检测慢盘的方法及装置
WO2017012392A1 (zh) 一种磁盘检测的方法和装置
US10067840B1 (en) Life expectancy data migration
JP2019511054A (ja) 分散クラスタ型訓練方法及び装置
US7917677B2 (en) Smart profiler
US8468134B1 (en) System and method for measuring consistency within a distributed storage system
KR101341507B1 (ko) 수정된 b+트리 노드 검색 방법 및 장치
WO2018113317A1 (zh) 数据的迁移方法、装置和系统
CN111045881A (zh) 一种慢盘检测方法及系统
WO2017215557A1 (zh) 一种采集性能监视单元pmu事件的方法及装置
CN109388550B (zh) 一种缓存命中率确定方法、装置、设备及可读存储介质
JP5471822B2 (ja) 入出力制御プログラム、情報処理装置および入出力制御方法
CN112331249B (zh) 预测存储器件寿命的方法、装置、终端设备和存储介质
CN110287158B (zh) 监测分布式文件系统io时延的方法、装置及存储介质
US20150149418A1 (en) Estimation of query input/output (i/o) cost in database
CN108052441A (zh) 一种硬盘效能状态的测试方法、系统、装置及存储介质
JP2018106252A (ja) 情報処理装置、ストレージ制御プログラムおよびストレージ制御方法
JP2012128771A (ja) 情報処理装置及びプログラム
KR20200086548A (ko) 시계열 데이터 압축 및 복원 방법
CN115269289A (zh) 一种慢盘检测方法、装置、电子设备及存储介质
US7346868B2 (en) Method and system for evaluating design costs of an integrated circuit
KR102413753B1 (ko) 정보 처리 장치, 정보 처리 방법 및 기록 매체에 저장된 정보 처리 프로그램
Munegowda et al. SLC: Sliding Latency Coverage Factors for Optimal Performance Benchmarking of Storage Systems
CN114048106B (zh) 磁盘状态检测方法、系统、介质和存储设备
CN113568822B (zh) 业务资源监控方法、装置、计算设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16832235

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2016832235

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE