CN114003477B

CN114003477B - Method, system, terminal and storage medium for collecting diagnosis information of slow disk

Info

Publication number: CN114003477B
Application number: CN202111256592.XA
Authority: CN
Inventors: 李燕红; 苑忠科
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2021-10-27
Filing date: 2021-10-27
Publication date: 2023-08-22
Anticipated expiration: 2041-10-27
Also published as: CN114003477A

Abstract

The invention provides a slow disk diagnosis information collection method, a system, a terminal and a storage medium, comprising the following steps: starting the input and output process of slow disk detection and synchronously counting the longest time delay; acquiring a slow disc detection result, and sending a power-on time inquiry request to the slow disc; and receiving the power-on time returned by the slow disk based on the power-on time inquiry request, and outputting the power-on time and the input/output process with the longest time delay to a diagnosis log. The invention meets the diagnosis requirement of disk manufacturers, and the implementation method is simple and efficient. The problem that a disk manufacturer is difficult to diagnose because of no instruction fault on the disk during slow disk is solved.

Description

Method, system, terminal and storage medium for collecting diagnosis information of slow disk

Technical Field

The invention relates to the technical field of servers, in particular to a slow disk diagnosis information collection method, a system, a terminal and a storage medium.

Background

The slow disk is simply understood to have a slow IO access rate of the hard disk, so that the reasons of hardware problems such as bad tracks of the hard disk, abnormal magnetic heads and the like are caused, and therefore the health state of the hard disk can be reflected to a certain extent by the slow disk detection; if the hard disk is in a sub-health state, the service performance is reduced, the processing capacity of the service is blocked, and the service is possibly unavailable in serious cases. The current iBMA supports the slow disk detection of Linux version, and detects the through disk and the logic disk identified by the OS layer; if the hard disk slow disk is detected, an alarm is reported to the iBMC. While also supporting alarm recovery.

The current common implementation manner of collecting disc diagnosis information is that a disc manufacturer provides some instructions, when a disc fails or has a problem, a storage system issues the instructions, the information returned by the disc is recorded in a log and then sent to the disc manufacturer, and the disc manufacturer performs diagnosis analysis.

The slow disk is only slow in read-write speed, if the disk is slow, usually read-write operation and management command can be normal, only the response information is slow, the collected diagnosis diary is only used when the disk is faulty, but the collected diagnosis diary is not used for specifying the information under the fault condition, and the time on the disk is not equal to the system time, so that the comparison is difficult even if the information is recorded, the diagnosis range is difficult to define, such as the power-on time POH of the disk, which IO or IOs are slow, which IO causes, and other adjacent IO performance conditions are difficult to diagnose.

Disclosure of Invention

In order to solve the above-mentioned shortcomings of the prior art, the present invention provides a method, a system, a terminal and a storage medium for collecting diagnosis information of a slow disc, so as to solve the above-mentioned technical problems.

In a first aspect, the present invention provides a method for collecting diagnostic information of a slow disc, comprising:

starting the input and output process of slow disk detection and synchronously counting the longest time delay;

acquiring a slow disc detection result, and sending a power-on time inquiry request to the slow disc;

and receiving the power-on time returned by the slow disk based on the power-on time inquiry request, and outputting the power-on time and the input/output process with the longest time delay to a diagnosis log.

Further, starting the input/output process of slow disk detection and synchronous statistics of the longest time delay, including:

starting to monitor the time delay of the input and output processes while starting the slow disk detection, and storing the input and output process information with the longest time delay into a circulation array;

and comparing the monitored time delay of the new input/output process with the time delay of the input/output process stored in the cyclic array, and storing the information with larger time delay into the cyclic array.

Further, obtaining the detection result of the slow disk, and sending a power-on time query request to the slow disk, including:

analyzing a target slow disc from the slow disc detection result;

the log sense 15 instruction is sent to the target slow disk.

Further, the method further comprises:

reading the input and output process information with longest power-on time and time delay from the diagnosis log;

performing time correction on the diagnosis result information of the magnetic disk according to the power-on time and the system time;

and analyzing the occurrence time of the slow disc and the instruction for causing the slow disc according to the input/output process information with the longest time delay.

In a second aspect, the present invention provides a slow disc diagnostic information collection system comprising:

the process statistics unit is used for starting the slow disk detection and synchronously counting the input and output processes with the longest time delay;

the request sending unit is used for obtaining the detection result of the slow disk and sending a power-on time inquiry request to the slow disk;

and the information output unit is used for receiving the power-on time returned by the slow disk based on the power-on time inquiry request and outputting the power-on time and the input and output process with the longest time delay to the diagnosis log.

Further, the process statistics unit includes:

the process screening module is used for starting to monitor the time delay of the input and output processes while starting the slow disk detection, and storing the input and output process information with the longest time delay into a circulation array;

and the circulation storage module is used for comparing the monitored time delay of the new input/output process with the time delay of the input/output process stored in the circulation array and storing the information with larger time delay into the circulation array.

Further, the request transmitting unit includes:

the slow disc analysis module is used for analyzing the target slow disc from the slow disc detection result;

and the instruction sending module is used for sending the log sense 15 instruction to the target slow disk.

Further, the system further comprises:

the log reading unit is used for reading the input and output process information with the longest power-on time and time delay from the diagnosis log;

the time correction unit is used for performing time correction on the disk diagnosis result information according to the power-on time and the system time;

and the process analysis unit is used for analyzing the occurrence time of the slow disc and the instruction for causing the slow disc according to the input and output process information with the longest time delay.

In a third aspect, a terminal is provided, including:

a processor, a memory, wherein,

the memory is used for storing a computer program,

the processor is configured to call and run the computer program from the memory, so that the terminal performs the method of the terminal as described above.

In a fourth aspect, there is provided a computer storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the method of the above aspects.

The slow disk diagnosis information collection method, the system, the terminal and the storage medium have the advantages that by starting the slow disk detection and synchronously counting the input and output process with the longest time delay, after the slow disk detection result is obtained, a power-on time inquiry request is sent to the slow disk, then the power-on time returned by the slow disk based on the power-on time inquiry request is received, and the power-on time and the input and output process with the longest time delay are output to a diagnosis log. The invention meets the diagnosis requirement of disk manufacturers, and the implementation method is simple and efficient. The problem that a disk manufacturer is difficult to diagnose because of no instruction fault on the disk during slow disk is solved.

In addition, the invention has reliable design principle, simple structure and very wide application prospect.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic flow chart of a method of one embodiment of the invention.

Fig. 2 is another schematic flow chart of a method of one embodiment of the invention.

FIG. 3 is a schematic block diagram of a system of one embodiment of the present invention.

Fig. 4 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

In order to make the technical solution of the present invention better understood by those skilled in the art, the technical solution of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

The following explains key terms appearing in the present invention.

Magnetic disks (disks) refer to memory devices that store data using magnetic recording technology. The magnetic disk is a main storage medium of a computer, can store a large amount of binary data, and can keep the data from losing after power is cut off. The Disk used in early computers was a Floppy Disk (Floppy Disk), and the Disk commonly used today was a Hard Disk (Hard Disk).

Generally, a user only needs to observe the relationship between the current value, the worst value and the critical value and pay attention to the state prompt information, so that the health condition of the hard disk can be approximately known. The meaning of each parameter is briefly described below, the items marked with red are life key items, and blue is a specific item of a Solid State Disk (SSD). In flash-based solid state disks, the storage units are divided into two categories: SLC (Single Layer Cell, single layer unit) and MLC (Multi-Level Cell, multi-layer unit). SLC cost is high, capacity is small, but read-write speed is fast, reliability is high, erasing times can reach 100000 times, and is 10 times higher than MLC. While MLCs have large capacity and low cost, their performance is significantly behind SLC. In order to ensure the service life of the MLC, the control chip is also provided with an intelligent wear balance technology algorithm, so that the writing times of each storage unit can be averagely distributed to achieve the average fault-free time of 100 ten thousand hours. Therefore, many SMART parameters of the solid state disk are not available in the mechanical hard disk, such as erasing times of the storage unit, statistics of spare blocks, etc., and these new additions are defined by manufacturers, some of them have no detailed explanation, some of them are not necessarily accurate, and only reference is made here. The special items of the solid state disk of the manufacturer are special for the Sandforce master control chip, and other manufacturers are respectively and independently noted.

The Power-On Time Count (POH) parameter has a clear meaning, and indicates the Time of powering On the hard disk, and the data value directly accumulates the Time of powering On the device, and the new hard disk should be close to 0, but the Count units of different hard disks are different, and the Count units are counted in hours, and also are in units of minutes, seconds and even 30 seconds, which are defined by the disk manufacturer. The critical value of this parameter is typically 0, and the current value gradually decreases as the power-on time of the hard disk increases, approaching the critical value indicates that the hard disk has approached the expected design life, which of course does not indicate that the hard disk will fail or be immediately scrapped. The remaining life or failure probability can be estimated approximately with reference to the MTBF (mean time between failure) value given by the disk manufacturer for this type of hard disk. For solid state drives, note that the "device priority power management function (device initiated power management, DIPM)" affects this statistics: if DIPM is enabled, sleep time is not included in the continuous power-on count; if the DIPM function is turned off, the time of the active, idle, and sleep states are all counted.

FIG. 1 is a schematic flow chart of a method of one embodiment of the invention. Wherein, the execution subject of fig. 1 may be a slow disk diagnostic information collection system. Since the disc time is recorded in the POH mode and does not correspond to the storage system time, a command for inquiring the POH of the disc is added to the storage system and issued before the disc is judged to be slow, so as to acquire the POH. The implementation mode of the invention is that the longest time-delay IO is counted when the slow disk is detected, when the slow disk is judged, an instruction for inquiring the POH is issued, the POH is obtained, the longest time-delay IO is recorded in a system log, and the like, and the record is used for the diagnosis of a disk manufacturer.

As shown in fig. 1, the method includes:

step 110, starting the input and output process of slow disk detection and synchronous statistics of the longest time delay;

step 120, obtaining a detection result of the slow disk, and sending a power-on time inquiry request to the slow disk;

and 130, receiving the power-on time returned by the slow disk based on the power-on time inquiry request, and outputting the power-on time and the input/output process with the longest time delay to a diagnosis log.

The method counts the longest time delay IO during the slow disk detection, issues an instruction for inquiring the POH when the slow disk is judged, acquires the POH, records the longest time delay IO into a system log, and the like for the diagnosis of disk manufacturers.

In order to facilitate understanding of the present invention, the following describes the method for collecting the diagnostic information of the slow disc according to the principles of the method for collecting diagnostic information of the slow disc according to the present invention, in combination with the process of collecting the diagnostic information of the slow disc in the embodiment.

Specifically, referring to fig. 2, the method for collecting the diagnosis information of the slow disc includes:

s1, starting a slow disk detection and synchronizing an input and output process for counting the longest time delay.

Starting to monitor the time delay of the input and output processes while starting the slow disk detection, and storing the input and output process information with the longest time delay into a circulation array; and comparing the monitored time delay of the new input/output process with the time delay of the input/output process stored in the cyclic array, and storing the information with larger time delay into the cyclic array.

Specifically, when the slow disk detection starts, the IO with the longest statistics time is started. In the slow disk detection stage, a flag variable is used for recording whether an IO consumes the longest time, when each IO is completed, the time difference from the beginning of downward sending of the IO to the receiving of the reply from the disk is calculated, the time consumption duration is counted, the largest record is compared with the history, the largest record is the longest IO consuming time, and the longest IO is recorded in a cycle array.

S2, acquiring a detection result of the slow disk, and sending a power-on time inquiry request to the slow disk.

Analyzing a target slow disc from the slow disc detection result; the log sense 15 instruction is sent to the target slow disk. Specifically, after the slow disk detection determines that a certain disk is a slow disk, information of the slow disk is acquired, and then a log sense 15 instruction needs to be issued to the slow disk.

And S3, receiving the power-on time returned by the slow disk based on the power-on time inquiry request, and outputting the power-on time and the input and output process with the longest time delay to a diagnosis log.

And (3) taking out the POH value of the disk according to the information replied by the disk, recording the information and POH of the slow disk to the log, and recording the longest IO in the circulating array to the log. Thus, POH, the longest time-consuming IO, was collected.

Reading the input and output process information with longest power-on time and time delay from the diagnosis log; performing time correction on the diagnosis result information of the magnetic disk according to the power-on time and the system time; and analyzing the occurrence time of the slow disc and the instruction for causing the slow disc according to the input/output process information with the longest time delay.

In the embodiment, an instruction for inquiring POH is added when a slow disc is used, so that the POH is obtained; in addition, the longest time-consuming IO is counted, and the information is collected so that a disk manufacturer can conveniently diagnose the disk problem. By using the method, the diagnosis requirement of a disk manufacturer is met, and the implementation method is simple and efficient. The problem that a disk manufacturer is difficult to diagnose because of no instruction fault on the disk during slow disk is solved. Based on the longest IO history that is consumed, the disk POH record, the disk manufacturer can determine the time when the slow disk occurs, which instruction caused the slow disk, what happens on the slow disk. Thus, the disc manufacturer can lock the problem range faster and speed up the diagnosis process.

As shown in fig. 3, the system 300 includes:

the process statistics unit 310 is configured to start the input/output process that detects the slow disk and synchronously counts the longest time delay;

a request sending unit 320, configured to obtain a detection result of the slow disc, and send a power-on time query request to the slow disc;

and the information output unit 330 is configured to receive a power-on time returned by the slow disk based on the power-on time query request, and output the power-on time and the longest-delay input/output process to the diagnostic log.

Optionally, as an embodiment of the present invention, the process statistics unit includes:

Optionally, as an embodiment of the present invention, the request sending unit includes:

Optionally, as an embodiment of the present invention, the system further includes:

The system provided by the embodiment adopts the method that an instruction for inquiring POH is added and issued when a disk is slow, and the POH is obtained; in addition, the longest time-consuming IO is counted, and the information is collected so that a disk manufacturer can conveniently diagnose the disk problem. By using the method, the diagnosis requirement of a disk manufacturer is met, and the implementation method is simple and efficient. The problem that a disk manufacturer is difficult to diagnose because of no instruction fault on the disk during slow disk is solved. Based on the longest IO history that is consumed, the disk POH record, the disk manufacturer can determine the time when the slow disk occurs, which instruction caused the slow disk, what happens on the slow disk. Thus, the disc manufacturer can lock the problem range faster and speed up the diagnosis process.

Fig. 4 is a schematic structural diagram of a terminal 400 according to an embodiment of the present invention, where the terminal 400 may be used to execute the method for collecting the diagnostic information of the slow disc according to the embodiment of the present invention.

The terminal 400 may include: processor 410, memory 420, and communication unit 430. The components may communicate via one or more buses, and it will be appreciated by those skilled in the art that the configuration of the server as shown in the drawings is not limiting of the invention, as it may be a bus-like structure, a star-like structure, or include more or fewer components than shown, or may be a combination of certain components or a different arrangement of components.

The memory 420 may be used to store instructions for execution by the processor 410, and the memory 420 may be implemented by any type of volatile or nonvolatile memory terminal or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, or optical disk. The execution of the instructions in memory 420, when executed by processor 410, enables terminal 400 to perform some or all of the steps in the method embodiments described below.

The processor 410 is a control center of the storage terminal, connects various parts of the entire electronic terminal using various interfaces and lines, and performs various functions of the electronic terminal and/or processes data by running or executing software programs and/or modules stored in the memory 420, and invoking data stored in the memory. The processor may be comprised of an integrated circuit (Integrated Circuit, simply referred to as an IC), for example, a single packaged IC, or may be comprised of a plurality of packaged ICs connected to the same function or different functions. For example, the processor 410 may include only a central processing unit (Central Processing Unit, simply CPU). In the embodiment of the invention, the CPU can be a single operation core or can comprise multiple operation cores.

And a communication unit 430 for establishing a communication channel so that the storage terminal can communicate with other terminals. Receiving user data sent by other terminals or sending the user data to other terminals.

The present invention also provides a computer storage medium in which a program may be stored, which program may include some or all of the steps in the embodiments provided by the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (random access memory, RAM), or the like.

Therefore, the invention starts the slow disk detection and synchronously counts the input and output process with the longest time delay, sends the power-on time inquiry request to the slow disk after the slow disk detection result is obtained, then receives the power-on time returned by the slow disk based on the power-on time inquiry request, and outputs the power-on time and the input and output process with the longest time delay to the diagnosis log. The invention meets the diagnosis requirement of disk manufacturers, and the implementation method is simple and efficient. The problems that no instruction fault exists on the magnetic disk and the disk manufacturer is difficult to diagnose when the magnetic disk is slow are solved, and technical effects achieved by the embodiment can be seen from the description above, and the description is omitted here.

It will be apparent to those skilled in the art that the techniques of embodiments of the present invention may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solution in the embodiments of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium such as a U-disc, a mobile hard disc, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, etc. various media capable of storing program codes, including several instructions for causing a computer terminal (which may be a personal computer, a server, or a second terminal, a network terminal, etc.) to execute all or part of the steps of the method described in the embodiments of the present invention.

The same or similar parts between the various embodiments in this specification are referred to each other. In particular, for the terminal embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and reference should be made to the description in the method embodiment for relevant points.

In the several embodiments provided by the present invention, it should be understood that the disclosed systems and methods may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, system or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

Although the present invention has been described in detail by way of preferred embodiments with reference to the accompanying drawings, the present invention is not limited thereto. Various equivalent modifications and substitutions may be made in the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and it is intended that all such modifications and substitutions be within the scope of the present invention/be within the scope of the present invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for collecting diagnostic information of a slow disc, comprising:

2. The method of claim 1, wherein starting the slow disk detection and synchronizing the input-output process that counts the longest delay comprises:

3. The method of claim 1, wherein obtaining the slow disk detection result and sending a power-on time query request to the slow disk comprises:

analyzing a target slow disc from the slow disc detection result;

the log sense 15 instruction is sent to the target slow disk.

4. The method according to claim 1, wherein the method further comprises:

5. A slow disk diagnostic information collection system comprising:

6. The system of claim 5, wherein the process statistics unit comprises:

7. The system according to claim 5, wherein the request transmitting unit includes:

8. The system of claim 5, wherein the system further comprises:

9. A terminal, comprising:

a processor;

a memory for storing execution instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1-4.

10. A computer readable storage medium storing a computer program, which when executed by a processor implements the method of any one of claims 1-4.