CN114116374A

CN114116374A - Hard disk monitoring method, system, device and medium

Info

Publication number: CN114116374A
Application number: CN202111229365.8A
Authority: CN
Inventors: 苏军
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2021-10-21
Filing date: 2021-10-21
Publication date: 2022-03-01

Abstract

The invention discloses a hard disk monitoring method, which comprises the following steps: acquiring a monitoring parameter; configuring a corresponding threshold value, a corresponding acquisition interval and a corresponding alarm strategy of each monitoring parameter according to the monitoring parameters; periodically acquiring corresponding values according to the acquisition intervals corresponding to each monitoring parameter; judging whether to trigger an alarm or not according to an alarm strategy corresponding to each monitoring parameter, a plurality of values acquired periodically and a corresponding threshold value; and responding to the triggering alarm and reporting alarm information. The invention also discloses a system, a computer device and a readable storage medium. The scheme provided by the invention considers the diversity of the SSD disk operation, so that the current situation of the SSD is analyzed from multiple dimensions, and the alarm information is reported when the SSD disk failure is predicted.

Description

Hard disk monitoring method, system, device and medium

Technical Field

The invention relates to the field of hard disks, in particular to a hard disk monitoring method, a hard disk monitoring system, hard disk monitoring equipment and a storage medium.

Background

With the development and wide application of technologies such as internet, cloud computing, internet of things and the like, mass data are generated at all times in human life and need to be processed and stored, and the high-speed development of information technology puts higher requirements on the performance of a storage system. Solid state disks are widely used because of their fast read/write speed and low energy consumption. With the increase of PE (tolerance degree of program & erase end write & erase), under the influence of Tcross (temperature crossing, i.e. difference between read and write temperatures), read disturb (read interference), dataretentivity (data retention), etc., the NAND may be in an unstable state, which appears as triggering more data error correction flows, and even a scenario of data decoding failure occurs, which are all the manifestations of SSD disk running abnormality, and affect the reliability of the SSD disk.

Disclosure of Invention

In view of the above, in order to overcome at least one aspect of the above problems, an embodiment of the present invention provides a hard disk monitoring method, including:

acquiring a monitoring parameter;

configuring a corresponding threshold value, a corresponding acquisition interval and a corresponding alarm strategy of each monitoring parameter according to the monitoring parameters;

periodically acquiring corresponding values according to the acquisition intervals corresponding to each monitoring parameter;

judging whether to trigger an alarm or not according to an alarm strategy corresponding to each monitoring parameter, a plurality of values acquired periodically and a corresponding threshold value;

and responding to the triggering alarm and reporting alarm information.

In some embodiments, the periodically acquiring the corresponding values according to the acquisition interval corresponding to each monitoring parameter further includes:

acquiring the current number of bad blocks on each physical LUN;

and in response to detecting the newly added bad blocks, obtaining the accumulated number of bad blocks according to the current number of bad blocks on the corresponding physical LUN.

In some embodiments, determining whether to trigger an alarm according to the alarm policy corresponding to each monitoring parameter, the plurality of values periodically collected, and the corresponding threshold, further includes:

and triggering a first-level alarm in response to the current number of the bad blocks on the corresponding physical LUN, wherein the accumulated number of the bad blocks is larger than a threshold of the number of the bad blocks.

periodically acquiring the current count of each error correction type;

and calculating the increment of the count acquired in two adjacent periods by each error correction type.

and triggering the alarm of the corresponding grade according to the increment of the count in response to the increment of the count being larger than the increment threshold.

the temperature of each temperature sensor is periodically collected and the difference between each temperature sensor and the other sensors is calculated.

judging whether the value of the temperature sensor reaches a temperature threshold value and judging whether the difference value is greater than a temperature difference threshold value;

reporting temperature abnormity in response to the value of the temperature sensor reaching a temperature threshold value;

and responding to the temperature difference reaching the temperature difference threshold value, and the heat dissipation in the report plate is uneven.

Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a hard disk monitoring system, including:

an acquisition module configured to acquire monitoring parameters;

the configuration module is configured to configure corresponding threshold values, acquisition intervals and alarm strategies corresponding to each monitoring parameter according to the monitoring parameters;

the acquisition module is configured to periodically acquire corresponding values according to the acquisition intervals corresponding to each monitoring parameter;

the judging module is configured to judge whether to trigger an alarm according to an alarm strategy corresponding to each monitoring parameter, a plurality of values acquired periodically and a corresponding threshold value;

and the alarm module is configured to respond to the triggering alarm and report alarm information.

Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a computer apparatus, including:

at least one processor; and

a memory storing a computer program operable on the processor, wherein the processor executes the program to perform the steps of:

acquiring a monitoring parameter;

and responding to the triggering alarm and reporting alarm information.

acquiring the current number of bad blocks on each physical LUN;

periodically acquiring the current count of each error correction type;

Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a computer-readable storage medium storing a computer program which, when executed by a processor, performs the steps of:

acquiring a monitoring parameter;

and responding to the triggering alarm and reporting alarm information.

acquiring the current number of bad blocks on each physical LUN;

periodically acquiring the current count of each error correction type;

The invention has one of the following beneficial technical effects: the scheme provided by the invention considers the diversity of the SSD disk operation, so that the current situation of the SSD is analyzed from multiple dimensions, and the alarm information is reported when the SSD disk failure is predicted.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a hard disk monitoring method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a hard disk monitoring system according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a computer device provided in an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

According to an aspect of the present invention, an embodiment of the present invention provides a hard disk monitoring method, as shown in fig. 1, which may include the steps of:

s1, acquiring monitoring parameters;

s2, configuring corresponding threshold values, acquisition intervals and alarm strategies corresponding to each monitoring parameter according to the monitoring parameters;

s3, periodically collecting corresponding values according to the collection intervals corresponding to each monitoring parameter;

s4, judging whether to trigger alarm according to the alarm strategy corresponding to each monitoring parameter, a plurality of values acquired periodically and the corresponding threshold value;

and S5, responding to the trigger alarm and reporting alarm information.

The scheme provided by the invention considers the diversity of the SSD disk operation, so that the current situation of the SSD is analyzed from multiple dimensions, and the alarm information is reported when the SSD disk failure is predicted.

acquiring the current number of bad blocks on each physical LUN;

Specifically, whether the total number of bad blocks on each physical LUN exceeds a standard or not can be detected: the threshold value is related to the type of the NAND particles, when a new damaged block GBB (grown bad block) is detected, whether the accumulated value of the LUN exceeds the standard or not is judged, if yes, the quality problem of the NAND monomer particles is determined, and the alarm is a serious alarm.

periodically acquiring the current count of each error correction type;

Specifically, a timer (e.g., 10 minutes) may be set to periodically poll the count of the current error correction period, where the error correction period includes a first error, a read retry failure, and a soft decoding failure. And performing difference calculation in the previous round to obtain the increase of the statistics of the three types of newly added errors. For judging whether the amplification within the detection period exceeds the corresponding specific threshold values T0, T1, T2.

It should be noted that, in designing FW, the influence of the error correction process on the performance is considered, if the increase in the short time is large, it is indicated that the disc may be in an unstable state, and at this time, current basic information, such as temperature difference and retentivity of a problem block, needs to be recorded, and an alarm is reported at the same time.

Specifically, it can be detected whether there are multiple temperature sensors in the SSD disk that are abnormal: 1) whether the value of each temperature sensor exceeds the working temperature threshold value, 2) whether the temperature difference of the sensors is large, if so, the uneven heat dissipation in the disc is indicated. Considering the influence of the temperature difference on the NAND, setting a timer period (for example, 1 minute), judging the temperature difference between the current temperature and the last temperature, recording abnormal information if the temperature difference is about a specific threshold value T3, and reporting an alarm.

It should be noted that, both excessive temperature and large temperature difference may have unpredictable influence on the correctness of data in the SSD disk, and these information may be recorded as the basis for failure analysis.

The scheme provided by the invention considers the diversity of the SSD disk operation, so that the current situation of the SSD is analyzed from multiple dimensions, and the alarm information is reported when the SSD disk failure is predicted. Therefore, the purpose of monitoring the running state of the SSD is achieved by adopting a mode of updating the running key data at regular time and judging whether the running of the SSD is normal according to rules, the suspicious running state of the SSD can be easily detected, and the performance and the reliability of the SSD are also judged.

Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a hard disk monitoring system 400, as shown in fig. 2, including:

an obtaining module 401 configured to obtain a monitoring parameter;

a configuration module 402 configured to configure a corresponding threshold, a corresponding acquisition interval, and a corresponding alarm policy for each monitoring parameter according to the monitoring parameter;

an acquisition module 403 configured to periodically acquire corresponding values according to the acquisition intervals corresponding to each monitoring parameter;

a judging module 404 configured to judge whether to trigger an alarm according to an alarm policy corresponding to each monitoring parameter, a plurality of values periodically collected, and a corresponding threshold;

the alarm module 405 is configured to report alarm information in response to a trigger alarm.

acquiring the current number of bad blocks on each physical LUN;

periodically acquiring the current count of each error correction type;

Based on the same inventive concept, according to another aspect of the present invention, as shown in fig. 3, an embodiment of the present invention further provides a computer apparatus 501, comprising:

at least one processor 520; and

a memory 510, the memory 510 storing a computer program 511 executable on the processor, the processor 520 executing the program to perform the steps of:

s1, acquiring monitoring parameters;

and S5, responding to the trigger alarm and reporting alarm information.

acquiring the current number of bad blocks on each physical LUN;

periodically acquiring the current count of each error correction type;

Based on the same inventive concept, according to another aspect of the present invention, as shown in fig. 4, an embodiment of the present invention further provides a computer-readable storage medium 601, where the computer-readable storage medium 601 stores computer program instructions 610, and the computer program instructions 610, when executed by a processor, perform the following steps:

s1, acquiring monitoring parameters;

and S5, responding to the trigger alarm and reporting alarm information.

acquiring the current number of bad blocks on each physical LUN;

periodically acquiring the current count of each error correction type;

Finally, it should be noted that, as will be understood by those skilled in the art, all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above.

Further, it should be appreciated that the computer-readable storage media (e.g., memory) herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A hard disk monitoring method is characterized by comprising the following steps:

acquiring a monitoring parameter;

and responding to the triggering alarm and reporting alarm information.

2. The method of claim 1, wherein the periodically acquiring the corresponding values according to the acquisition intervals corresponding to each monitoring parameter respectively, further comprises:

acquiring the current number of bad blocks on each physical LUN;

3. The method of claim 2, wherein determining whether to trigger an alarm according to the alarm policy corresponding to each monitoring parameter, the plurality of values periodically collected, and the corresponding threshold value further comprises:

4. The method of claim 1, wherein the periodically acquiring the corresponding values according to the acquisition intervals corresponding to each monitoring parameter respectively, further comprises:

periodically acquiring the current count of each error correction type;

5. The method of claim 4, wherein determining whether to trigger an alarm according to the alarm policy corresponding to each monitoring parameter, the plurality of values periodically collected, and the corresponding threshold value further comprises:

6. The method of claim 1, wherein the periodically acquiring the corresponding values according to the acquisition intervals corresponding to each monitoring parameter respectively, further comprises:

7. The method of claim 6, wherein determining whether to trigger an alarm according to the alarm policy corresponding to each monitoring parameter, the plurality of values periodically collected, and the corresponding threshold value further comprises:

8. A hard disk monitoring system, comprising:

an acquisition module configured to acquire monitoring parameters;

9. A computer device, comprising:

at least one processor; and

memory storing a computer program operable on the processor, wherein the processor executes the program to perform the steps of the method according to any of claims 1-7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, is adapted to carry out the steps of the method according to any one of claims 1 to 7.