CN112506744B - Method, device and equipment for monitoring running state of NVMe hard disk - Google Patents

Method, device and equipment for monitoring running state of NVMe hard disk Download PDF

Info

Publication number
CN112506744B
CN112506744B CN202011453229.2A CN202011453229A CN112506744B CN 112506744 B CN112506744 B CN 112506744B CN 202011453229 A CN202011453229 A CN 202011453229A CN 112506744 B CN112506744 B CN 112506744B
Authority
CN
China
Prior art keywords
state
hard disk
abnormal
information
nvme hard
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011453229.2A
Other languages
Chinese (zh)
Other versions
CN112506744A (en
Inventor
李世坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN202011453229.2A priority Critical patent/CN112506744B/en
Publication of CN112506744A publication Critical patent/CN112506744A/en
Application granted granted Critical
Publication of CN112506744B publication Critical patent/CN112506744B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3037Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a memory, e.g. virtual memory, cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3058Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses an operation state monitoring method, device and equipment of an NVMe hard disk, wherein the method comprises the following steps: acquiring operation information of an NVMe hard disk; the running information comprises at least one of temperature information, read-write speed information, writing state information and bad block number information; determining the running state of the NVMe hard disk according to the running information; the operation state comprises a normal operation state and an abnormal early warning state; when the running state is an abnormal early warning state, generating and outputting early warning information corresponding to the running state; according to the method and the device, when the running state is the abnormal early warning state, the early warning information corresponding to the running state is generated and output, the monitored abnormal situation can be early warned in time before the NVMe hard disk is completely invalid, potential risks can be timely identified, serious faults such as downtime and data loss are avoided, and accordingly repairable faults of the NVMe hard disk are timely repaired and removed.

Description

Method, device and equipment for monitoring running state of NVMe hard disk
Technical Field
The invention relates to the technical field of servers, in particular to an operation state monitoring method, device and equipment for an NVMe hard disk.
Background
NVMe is an acronym for Non-Volatile Memory express (Non-volatile memory host controller interface Specification), an interface Specification that connects storage to a server over a PCIe bus. At present, with the development of big data, the requirement on the response speed of the data is higher and higher, and the application of the NVMe hard disk is wider from the start of using a small amount of NVMe hard disk for cache acceleration to the current use of a large amount of NVMe hard disk for hot data storage. Because the media are different, the NVMe hard disk has the characteristic different from the SATA hard disk, for example, the power consumption of the SATA interface mechanical hard disk is about 10W, and the power consumption of the standard NVMe hard disk reaches 25W, so that the heat dissipation test is more serious; the storage particles of the NVMe hard disk have read-write service life limitation, and the hard disk is worn out prematurely due to improper use.
At present, an NVMe hard disk fault is detected through an NVMe backboard fault indicator lamp generally, and problems can be found only when data storage is abnormal or a machine room is patrolled; and after finding out the failed NVMe hard disk, the technician takes the failed disk out of the server, uses a dedicated device for analysis or repair, and once the failure of the NVMe hard disk occurs, the failure is often difficult to repair because the failure of the NVMe hard disk is unexpected. Therefore, how to monitor the operation state of the NVMe hard disk during the operation and use of the NVMe hard disk, so as to avoid serious faults such as downtime and data loss caused by the failure of the NVMe hard disk, is an urgent need to solve the problem nowadays.
Disclosure of Invention
The invention aims to provide a method, a device and equipment for monitoring the running state of an NVMe hard disk, so that serious faults such as downtime and data loss caused by the faults of the NVMe hard disk are avoided through fault early warning before the NVMe hard disk is completely failed.
In order to solve the technical problems, the invention provides a method for monitoring the running state of an NVMe hard disk, which comprises the following steps:
acquiring operation information of an NVMe hard disk; wherein the operation information comprises at least one of temperature information, read-write speed information, writing state information and bad block number information;
determining the running state of the NVMe hard disk according to the running information; the operation state comprises a normal operation state and an abnormal early warning state;
and when the running state is the abnormal early warning state, generating and outputting early warning information corresponding to the running state.
Optionally, when the operation information includes the writing state information, determining, according to the operation information, an operation state of the NVMe hard disk includes:
judging whether the writing state information is in a writable state or not;
if not, determining the running state as a writing abnormal state; wherein the abnormal early warning state includes the writing abnormal state.
Optionally, when the operation information includes the read-write speed information, determining, according to the operation information, an operation state of the NVMe hard disk includes:
judging whether the read-write speed information is larger than a preset read-write speed or not;
if not, determining the running state as a read-write speed abnormal state; the abnormal early warning state comprises the abnormal reading and writing speed state.
Optionally, when the operation information includes the temperature information, determining, according to the operation information, an operation state of the NVMe hard disk includes:
judging whether the temperature information is larger than a first temperature threshold value or not;
if yes, judging whether the temperature information is larger than a second temperature threshold value or not; wherein the second temperature threshold is greater than the first temperature threshold;
if the temperature is not greater than the second temperature threshold, determining that the running state is a mild temperature abnormal state;
if the temperature is greater than the second temperature threshold, determining that the running state is a serious temperature abnormal state; wherein the abnormal early warning state includes the mild temperature abnormal state and the severe temperature abnormal state.
Optionally, when the operation information includes the bad block number information, determining, according to the operation information, an operation state of the NVMe hard disk includes:
judging whether the bad block quantity information is larger than the early warning bad block quantity or not;
if yes, judging whether the bad block quantity information is larger than the dangerous bad block quantity or not; wherein the number of dangerous bad blocks is larger than the number of early warning bad blocks;
if the number of the dangerous bad blocks is not greater than the number of the dangerous bad blocks, determining that the running state is a slight bad block abnormal state;
and if the number of the dangerous bad blocks is larger than the number of the dangerous bad blocks, determining that the running state is a serious bad block abnormal state.
Optionally, the method further comprises:
and executing processing operation corresponding to the running state when the running state is the abnormal early warning state.
Optionally, when the abnormal early warning state includes a writing abnormal state, a reading and writing speed abnormal state, a light temperature abnormal state, a serious temperature abnormal state, a light bad block abnormal state and the serious bad block abnormal state, the executing the processing operation corresponding to the running state includes:
if the running state is the abnormal writing state, carrying out online upgrading and repairing on the firmware of the NVMe hard disk;
if the running state is the abnormal reading and writing speed state, performing garbage cleaning operation on the NVMe hard disk;
if the running state is the mild temperature abnormal state, adjusting the I/O speed of the NVMe hard disk to a preset low-speed value; the preset low-speed value is smaller than a preset normal value corresponding to a normal running state;
if the running state is the serious temperature abnormal state, stopping the read-write operation of the NVMe hard disk;
if the running state is the mild bad block abnormal state, repairing the bad blocks in the NVMe hard disk;
and if the running state is the serious bad block abnormal state, backing up the data in the NVMe hard disk to a preset hard disk.
Optionally, the performing online upgrade repair on the firmware of the NVMe hard disk includes:
matching the firmware corresponding to the NVMe hard disk according to the important product data information of the NVMe hard disk;
downloading the firmware into the NVMe hard disk and activating the firmware.
The invention also provides an operation state monitoring device of the NVMe hard disk, which comprises:
the information acquisition module is used for acquiring the operation information of the NVMe hard disk; wherein the operation information comprises at least one of temperature information, read-write speed information, writing state information and bad block number information;
the state determining module is used for determining the operation state of the NVMe hard disk according to the operation information; the operation state comprises a normal operation state and an abnormal early warning state;
and the early warning module is used for generating and outputting early warning information corresponding to the running state when the running state is the abnormal early warning state.
The invention also provides an operation state monitoring device of the NVMe hard disk, which comprises:
a memory for storing a computer program;
and the processor is used for realizing the steps of the operation state monitoring method of the NVMe hard disk when executing the computer program.
The invention provides a method for monitoring the running state of an NVMe hard disk, which comprises the following steps: acquiring operation information of an NVMe hard disk; the running information comprises at least one of temperature information, read-write speed information, writing state information and bad block number information; determining the running state of the NVMe hard disk according to the running information; the operation state comprises a normal operation state and an abnormal early warning state; when the running state is an abnormal early warning state, generating and outputting early warning information corresponding to the running state;
therefore, when the running state of the NVMe hard disk is an abnormal early warning state, the early warning information corresponding to the running state is generated and output, the monitored abnormal situation can be early warned in time before the NVMe hard disk is completely invalid, potential risks can be timely identified, serious faults such as downtime and data loss are avoided, and accordingly repairable faults of the NVMe hard disk can be timely removed, and the situation that small faults accumulate into unrepairable large faults is avoided. In addition, the invention also provides a running state monitoring device and equipment of the NVMe hard disk, and the running state monitoring device and equipment also have the beneficial effects.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for monitoring the operation state of an NVMe hard disk according to an embodiment of the present invention;
fig. 2 is a schematic system structure diagram of an operation state monitoring method of an NVMe hard disk according to an embodiment of the present invention;
FIG. 3 is a flow chart of another method for monitoring the operation status of an NVMe hard disk according to an embodiment of the present invention;
FIG. 4 is a block diagram of an NVMe hard disk running state monitoring device according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an operation state monitoring device for an NVMe hard disk according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, fig. 1 is a flowchart of an operation state monitoring method of an NVMe hard disk according to an embodiment of the present invention. The method may include:
step 101: acquiring operation information of an NVMe hard disk; wherein the operation information includes at least one of temperature information, read-write speed information, write status information, and bad block number information.
It can be understood that the processor of the operation state monitoring device (such as a server) of the NVMe hard disk in this step can obtain the operation information of the connected NVMe hard disk, that is, the information to be monitored in the operation and use process of the NVMe hard disk; as shown in fig. 2, a processor running a control system in a server may obtain respective running information of each NVMe hard disk connected through an NVMe backplane.
Specifically, for the specific content of the operation information of the NVMe hard disk obtained by the processor in the step, the operation information can be set by a designer according to a practical scene and user requirements, for example, the operation information can include the temperature (i.e., temperature information) of the NVMe hard disk when in operation and use; the operation information can also comprise the data read-write speed (namely read-write speed information) of the NVMe hard disk when in operation and use, such as the data read speed and the data write speed; the running information may also include a writing state (i.e., writing state information) when the NVMe hard disk is running and in use, such as a writable state when the NVMe hard disk is capable of writing data or an unwritable state when the NVMe hard disk is incapable of writing data; the operation information may further include the number of bad blocks (i.e., bad block number information) when the NVMe hard disk is in operation.
Step 102: determining the running state of the NVMe hard disk according to the running information; the operation states comprise a normal operation state and an abnormal early warning state.
It can be understood that the abnormal early warning state in this step may be an operation state when the NVMe hard disk has a problem (i.e. a fault), that is, when the operation state of the NVMe hard disk is in the abnormal early warning state, the processor may determine that the NVMe hard disk has a problem, and early warning needs to be performed.
Specifically, for the specific number and types of the abnormal early-warning states in the step, the designer can set the abnormal early-warning states by himself according to the practical scene and the user requirement, for example, the abnormal early-warning states can include any one or more of a writing abnormal state corresponding to the writing state abnormality of the NVMe hard disk, a reading and writing speed abnormal state corresponding to the reading and writing speed abnormality of the NVMe hard disk, a temperature abnormal state corresponding to the temperature abnormality of the NVMe hard disk and a bad block abnormal state corresponding to the bad block number abnormality of the NVMe hard disk; accordingly, the temperature anomaly state may include a mild temperature anomaly state and a severe temperature anomaly state, and the bad block anomaly state may include a mild bad block anomaly state and a severe bad block anomaly state.
Correspondingly, for the specific mode of determining the operation state of the NVMe hard disk by the processor according to the operation information in the step, if the abnormal early warning state includes six abnormal states including a write-in abnormal state, a read-write speed abnormal state, a light temperature abnormal state, a serious temperature abnormal state, a light bad block abnormal state and a serious bad block abnormal state, the processor can sequentially or respectively utilize corresponding contents in the operation information to detect whether the operation state of the NVMe hard disk is in the six abnormal states; if the operation state is not in the six abnormal states, determining that the operation state of the NVMe hard disk is in a normal operation state. For example, the processor may determine whether the written state information in the running information is a writable state; if the state is not writable, namely the NVMe hard disk cannot write data, determining that the running state of the NVMe hard disk is in a writing abnormal state. The processor can judge whether the read-write speed information in the running information is greater than a preset read-write speed, namely, whether the read speed and the write speed in the read-write speed information are both greater than respective corresponding thresholds (namely, the preset read-write speed); and if the running state of the NVMe hard disk is not greater than the preset read-write speed, determining that the running state of the NVMe hard disk is in the abnormal read-write speed state. The processor may determine whether the temperature information in the operating information is greater than a first temperature threshold (e.g., threshold 1 in fig. 3); if yes, judging whether the temperature information is greater than a second temperature threshold (such as threshold 2 in fig. 3); wherein the second temperature threshold is greater than the first temperature threshold; if the temperature is not greater than the second temperature threshold, determining that the running state of the NVMe hard disk is a slight temperature abnormal state; and if the temperature of the NVMe hard disk is greater than the second temperature threshold, determining that the running state of the NVMe hard disk is a serious temperature abnormal state. The processor can judge whether the number of bad blocks in the operation information is larger than the number of early warning bad blocks; if yes, judging whether the bad block number information is larger than the dangerous bad block number (such as a threshold value corresponding to the bad block number in fig. 3); the number of dangerous bad blocks is larger than that of early warning bad blocks; if the number of the dangerous bad blocks is not greater than the number of the dangerous bad blocks, determining that the running state is a slight bad block abnormal state; if the number of the dangerous bad blocks is larger than the number of the dangerous bad blocks, determining that the running state is a serious bad block abnormal state.
Correspondingly, the running state of the NVMe hard disk in this embodiment may be in one or more abnormal early warning states or normal running states.
Step 103: and when the running state is an abnormal early warning state, generating and outputting early warning information corresponding to the running state.
It can be understood that the purpose of this step can be that when the processor determines that the running state of the NVMe hard disk is an abnormal early warning state, the processor reminds the user that the abnormal problem exists in the NVMe hard disk by generating and outputting early warning information corresponding to the abnormal early warning state where the running state is located, so that the user can repair the abnormal problem in the NVMe hard disk in time.
Specifically, when the abnormal early warning state includes six abnormal states including a writing abnormal state, a reading and writing speed abnormal state, a light temperature abnormal state, a serious temperature abnormal state, a light bad block abnormal state and a serious bad block abnormal state, if the running state is in the writing abnormal state, the early warning information generated and output in the step may include writing abnormal information; if the running state is in the abnormal reading and writing speed state, the early warning information generated and output in the step can comprise abnormal reading and writing speed information; if the running state is in a mild temperature abnormal state, the early warning information generated and output in the step can comprise mild temperature abnormal information; if the operation state is in a serious temperature abnormality state, the early warning information generated and output in the step can comprise the serious temperature abnormality information; if the running state is in a slight bad block abnormal state, the early warning information generated and output in the step can comprise the slight bad block abnormal information; if the running state is in a serious bad block abnormal state, the early warning information generated and output in the step can comprise the serious bad block abnormal information.
Correspondingly, for the specific mode of outputting the early warning information corresponding to the operation state by the processor in the step, the method can be set by a designer, as shown in fig. 2, the processor running with the control system in the server can output the early warning information to the management center, so that a user can check the early warning information in the management center; the processor can also output the early warning information to a display for display.
Further, the method provided by the embodiment may further include executing, by the processor, a processing operation corresponding to the running state when the running state of the NVMe hard disk is the target abnormal early warning state; the abnormal target early warning state may be a part or all of abnormal early warning states. If the abnormal early warning state comprises six abnormal states including a writing abnormal state, a reading and writing speed abnormal state, a slight temperature abnormal state, a serious temperature abnormal state, a slight bad block abnormal state and a serious bad block abnormal state, the target abnormal early warning state can comprise four fault conditions which are not serious, namely the writing abnormal state, the reading and writing speed abnormal state, the slight temperature abnormal state and the slight bad block abnormal state, so that the four repairable faults of the NVMe hard disk are automatically removed on line by executing processing operations corresponding to the running states; for example, when the running state is the abnormal writing state, the processor can update and repair the firmware of the NVMe hard disk on line; when the running state is the abnormal reading and writing speed state, the processor can carry out garbage cleaning operation on the NVMe hard disk; when the running state is a slight temperature abnormal state, the processor can adjust the I/O speed of the NVMe hard disk to a preset low-speed value; the preset low-speed value is smaller than a preset normal value corresponding to the normal running state; when the running state is a slight bad block abnormal state, the processor repairs bad blocks in the NVMe hard disk; correspondingly, as shown in fig. 3, when the running state is a serious temperature abnormal state, that is, the temperature information is greater than the second temperature threshold (such as threshold 2 in fig. 3), the processor may generate and output serious temperature abnormal information to remind the user to detect the heat dissipation system or the environmental problem of the NVMe hard disk; when the running state is a serious bad block abnormal state, that is, the number of bad blocks is greater than the number of dangerous bad blocks (such as the threshold corresponding to the number of bad blocks in fig. 3), the processor can generate and output the serious bad block abnormal information so as to remind a user to backup data in advance and replace a hard disk, so that data loss or service interruption is avoided.
Correspondingly, the target abnormal early warning state can be all abnormal early warning states, namely, when the running state of the NVMe hard disk is the abnormal early warning state, the processor executes the processing operation corresponding to the running state. If the running state is a serious temperature abnormal state, the processor can stop the read-write operation of the NVMe hard disk; when the running state is a serious bad block abnormal state, the data in the NVMe hard disk can be backed up to a preset hard disk.
Specifically, when the processor detects that the NVMe hard disk works above the first temperature threshold (for example, 60 degrees), the I/O speed can be reduced, and since the interface of the NVMe is generally 8Gb/s of pcie3.0, the I/O speed of the NVMe hard disk can be adjusted to 5Gb/s of pcie2.0 (for example, a preset low-speed value); after the speed is reduced, the heating value of the chip is reduced, the temperature is reduced to a normal value, and then the I/O speed of PCIE3.0 is recovered; if the temperature of the NVMe hard disk cannot be reduced by adjusting the I/O speed, or the heat dissipation system has a problem, when the temperature exceeds the second temperature threshold (such as 80 ℃) when the NVMe hard disk is detected to work, the read-write operation of the NVMe hard disk can be stopped, and serious temperature abnormality information is output to remind a user of the heat dissipation system or the environmental problem, so that the hardware damage to the NVMe hard disk caused by high temperature is avoided.
The longer the NVMe hard disk is used, the more data is written, and after the whole space is fully written, the old blocks need to be erased first and then written, so that the writing speed is influenced. When the processor detects that the read-write speed information of the NVMe hard disk is not greater than the preset read-write speed, the processor can carry out garbage cleaning operation on the NVMe hard disk; the space which is marked as no data after being used can be erased for one time in advance through garbage cleaning operation, so that data is written in, only the writing operation is needed, and the writing speed is improved.
When the processor detects that the writing state information of the NVMe hard disk is not in a writable state, namely the NVMe hard disk cannot write data, the processor can determine that the firmware of the NVMe hard disk is abnormal, and repair the firmware of the NVMe hard disk by carrying out online upgrade repair on the firmware of the NVMe hard disk; if the processor matches the firmware corresponding to the NVMe hard disk according to the VPD (Vital Product Data, important product data) information of the NVMe hard disk; downloading firmware into an NVMe hard disk and activating the firmware; for example, the processor may execute a Firmware update operation through an open source command line tool nvmecli, download Firmware (FW) into DRAM (dynamic random access memory) of the NVMe disk by using a Download command, and send a Firmware command to activate the Firmware, so as to implement online upgrade repair of the Firmware.
As bad blocks of the NVMe hard disk are gradually increased, the available space of the whole disk is reduced, the frequency of garbage cleaning is increased, and the increase of the bad blocks also has an influence on the stability of the hard disk; therefore, when the processor detects that the running state of the NVMe hard disk is in the slight bad block abnormal state, bad blocks in the NVMe hard disk can be repaired, so that false bad blocks caused by abnormal power failure and other reasons are erased through the bad block repair, and the number of usable blocks is increased. As the number of bad blocks is gradually increased along with use and abrasion, when the processor detects that the running state of the NVMe hard disk is in a serious bad block abnormal state, the processor can output serious bad block abnormal information to remind a user to replace a new disk so as to avoid data loss or service interruption, and the processor can automatically complete the backup of the NVMe hard disk by backing up the data in the NVMe hard disk to a preset hard disk so as to avoid the situation that the user cannot replace the new disk in time.
In the embodiment of the invention, when the running state of the NVMe hard disk is an abnormal early warning state, the early warning information corresponding to the running state is generated and output, so that the monitored abnormal situation can be early warned in time before the NVMe hard disk is completely invalid, potential risks can be timely identified, serious faults such as downtime and data loss are avoided, and accordingly repairable faults of the NVMe hard disk can be timely removed, and the situation that small faults accumulate into unrepairable large faults is avoided.
Referring to fig. 4, fig. 4 is a block diagram illustrating an operation status monitoring device for an NVMe hard disk according to an embodiment of the present invention. The apparatus may include:
the information acquisition module 10 is used for acquiring the operation information of the NVMe hard disk; the running information comprises at least one of temperature information, read-write speed information, writing state information and bad block number information;
the state determining module 20 is configured to determine an operation state of the NVMe hard disk according to the operation information; the operation state comprises a normal operation state and an abnormal early warning state;
and the early warning module 30 is used for generating and outputting early warning information corresponding to the operation state when the operation state is an abnormal early warning state.
Optionally, when the operation information includes writing state information, the state determining module 20 may include:
the first determining submodule is used for judging whether the writing state information is in a writable state or not; if the operation state is not the writable state, determining that the operation state is the abnormal writing state; the abnormal early warning state comprises writing in an abnormal state.
Optionally, when the operation information includes read-write speed information, the state determining module 20 may include:
the second determining submodule is used for judging whether the read-write speed information is larger than a preset read-write speed or not; if the running speed is not greater than the preset read-write speed, determining that the running state is an abnormal read-write speed state; the abnormal early warning state comprises a reading and writing speed abnormal state.
Optionally, when the operation information includes temperature information, the state determining module 20 may include:
the temperature judging sub-module is used for judging whether the temperature information is larger than a first temperature threshold value or not;
the third determining submodule is used for judging whether the temperature information is larger than the second temperature threshold value if the temperature information is larger than the first temperature threshold value; if the temperature is not greater than the second temperature threshold, determining that the running state is a mild temperature abnormal state; if the temperature is greater than the second temperature threshold, determining that the running state is a serious temperature abnormal state; the second temperature threshold is larger than the first temperature threshold, and the abnormal early warning state comprises a mild temperature abnormal state and a severe temperature abnormal state.
Optionally, when the running information includes bad block number information, the state determining module 20 may include:
the bad block judging sub-module is used for judging whether the number information of the bad blocks is larger than the number of the early warning bad blocks;
a fourth determining sub-module, configured to determine whether the bad block number information is greater than the dangerous bad block number if the bad block number information is greater than the early warning bad block number; if the number of the dangerous bad blocks is not greater than the number of the dangerous bad blocks, determining that the running state is a slight bad block abnormal state; if the number of the dangerous bad blocks is larger than the number of the dangerous bad blocks, determining that the running state is a serious bad block abnormal state; the number of dangerous bad blocks is larger than that of early warning bad blocks.
Optionally, the apparatus may further include:
and the online repair module is used for executing the processing operation corresponding to the running state when the running state is an abnormal early warning state.
Optionally, when the abnormal early warning state includes a writing abnormal state, a reading and writing speed abnormal state, a light temperature abnormal state, a serious temperature abnormal state, a light bad block abnormal state and a serious bad block abnormal state, the online repairing module may include:
the firmware repairing sub-module is used for carrying out online upgrade and repair on the firmware of the NVMe hard disk if the running state is the abnormal writing state;
the garbage cleaning sub-module is used for performing garbage cleaning operation on the NVMe hard disk if the running state is a read-write speed abnormal state;
the first cooling sub-module is used for adjusting the I/O speed of the NVMe hard disk to a preset low-speed value if the running state is a slight temperature abnormal state; the preset low-speed value is smaller than a preset normal value corresponding to the normal running state;
the second cooling sub-module is used for stopping the read-write operation of the NVMe hard disk if the running state is a serious temperature abnormal state;
the bad block repair sub-module is used for repairing bad blocks in the NVMe hard disk if the running state is a slight bad block abnormal state;
and the backup sub-module is used for backing up the data in the NVMe hard disk to a preset hard disk if the running state is a serious bad block abnormal state.
Optionally, the firmware repairing sub-module may be specifically configured to match firmware corresponding to the NVMe hard disk according to important product data information of the NVMe hard disk; the firmware is downloaded into the NVMe hard disk and activated.
In this embodiment, when the operation state of the NVMe hard disk is an abnormal early warning state, the early warning module 30 generates and outputs early warning information corresponding to the operation state, so that the monitored abnormal situation can be early warned in time before the NVMe hard disk is completely failed, potential risks can be timely identified, serious faults such as downtime and data loss are avoided, and accordingly repairable faults of the NVMe hard disk can be timely removed, and the situation that small faults accumulate into unrepairable large faults is avoided.
Referring to fig. 5, fig. 5 is a schematic structural diagram of an operation status monitoring device for an NVMe hard disk according to an embodiment of the present invention. The device 1 may comprise:
a memory 11 for storing a computer program; the processor 12 is configured to implement the steps of the method for monitoring the operation state of the NVMe hard disk according to the above embodiment when executing the computer program.
The device 1 may comprise a memory 11, a processor 12 and a bus 13.
The memory 11 includes at least one type of readable storage medium including flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the device 1. The memory 11 may in other embodiments also be an external storage device of the device 1, such as a plug-in hard disk provided on a server, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like. Further, the memory 11 may also include both an internal storage unit and an external storage device of the device 1. The memory 11 may be used not only for storing application software installed in the device 1 and various types of data, for example: code of a program that executes the operation state monitoring method of the NVMe hard disk, or the like, may also be used to temporarily store data that has been output or is to be output.
The processor 12 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chip in some embodiments for running program codes or processing data stored in the memory 11, such as code of a program for executing the running status monitoring method of the NVMe hard disk, etc.
The bus 13 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 5, but not only one bus or one type of bus.
Further, the device may also comprise a network interface 14, which network interface 14 may optionally comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used to establish a communication connection between the device 1 and other electronic devices.
Optionally, the device 1 may further comprise a user interface 15, the user interface 15 may comprise a Display, an input unit such as keys, and the optional user interface 15 may further comprise a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the device 1 and for displaying a visual user interface.
Fig. 5 shows only a device 1 with components 11-15, it being understood by a person skilled in the art that the structure shown in fig. 5 does not constitute a limitation of the device 1, and may comprise fewer or more components than shown, or may combine certain components, or a different arrangement of components.
In addition, the embodiment of the invention also discloses a computer readable storage medium, and the computer readable storage medium stores a computer program which realizes the steps of the operation state monitoring method of the NVMe hard disk provided by the embodiment when being executed by a processor.
Wherein the storage medium may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. The apparatus, device and computer readable storage medium of the embodiments are described more simply because they correspond to the methods of the embodiments, and the description thereof will be given with reference to the method section.
The method, the device and the equipment for monitoring the running state of the NVMe hard disk are described in detail. The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present invention and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the invention can be made without departing from the principles of the invention and these modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.

Claims (6)

1. The method for monitoring the running state of the NVMe hard disk is characterized by comprising the following steps of:
acquiring operation information of an NVMe hard disk; wherein the operation information comprises at least one of temperature information, read-write speed information, writing state information and bad block number information;
determining the running state of the NVMe hard disk according to the running information; the operation state comprises a normal operation state and an abnormal early warning state;
when the operation information includes the temperature information, determining, according to the operation information, an operation state of the NVMe hard disk includes: judging whether the temperature information is larger than a first temperature threshold value or not; if yes, judging whether the temperature information is larger than a second temperature threshold value or not; wherein the second temperature threshold is greater than the first temperature threshold; if the temperature is not greater than the second temperature threshold, determining that the running state is a mild temperature abnormal state; if the temperature is greater than the second temperature threshold, determining that the running state is a serious temperature abnormal state; wherein the abnormal early warning state includes the mild temperature abnormal state and the severe temperature abnormal state;
when the operation information includes the bad block number information, determining the operation state of the NVMe hard disk according to the operation information includes: judging whether the bad block quantity information is larger than the early warning bad block quantity or not; if yes, judging whether the bad block quantity information is larger than the dangerous bad block quantity or not; wherein the number of dangerous bad blocks is larger than the number of early warning bad blocks; if the number of the dangerous bad blocks is not greater than the number of the dangerous bad blocks, determining that the running state is a slight bad block abnormal state; if the number of the dangerous bad blocks is larger than the number of the dangerous bad blocks, determining that the running state is a serious bad block abnormal state;
when the running state is the abnormal early warning state, generating and outputting early warning information corresponding to the running state;
when the abnormal early warning state is a target abnormal early warning state, performing online removal on the fault of the NVMe hard disk by executing a processing operation corresponding to the running state, wherein the target abnormal early warning state comprises: writing an abnormal state, a read-write speed abnormal state, the mild temperature abnormal state and the mild bad block abnormal state;
if the running state is the abnormal writing state, the online fault elimination of the NVMe hard disk by executing the processing operation corresponding to the running state includes: performing online upgrade repair on the firmware of the NVMe hard disk;
if the running state is a read-write speed abnormal state, the performing the processing operation corresponding to the running state to online remove the fault of the NVMe hard disk includes: performing garbage cleaning operation on the NVMe hard disk;
if the running state is the mild abnormal temperature state, the online fault elimination of the NVMe hard disk by executing the processing operation corresponding to the running state includes: the I/O speed of the NVMe hard disk is adjusted to a preset low-speed value, wherein the preset low-speed value is smaller than a preset normal value corresponding to a normal running state; if the running state is the serious temperature abnormal state, stopping the read-write operation of the NVMe hard disk;
if the running state is the mild bad block abnormal state, the online fault elimination of the NVMe hard disk by executing the processing operation corresponding to the running state includes: the false bad blocks are erased through bad block repair, and bad blocks in the NVMe hard disk are repaired; and if the running state is the serious bad block abnormal state, backing up the data in the NVMe hard disk to a preset hard disk.
2. The method for monitoring an operation state of an NVMe hard disk according to claim 1, wherein when the operation information includes the writing state information, determining the operation state of the NVMe hard disk according to the operation information includes:
judging whether the writing state information is in a writable state or not;
if not, determining the running state as a writing abnormal state; wherein the abnormal early warning state includes the writing abnormal state.
3. The method for monitoring the operation state of the NVMe hard disk according to claim 1, wherein when the operation information includes the read-write speed information, determining the operation state of the NVMe hard disk according to the operation information includes:
judging whether the read-write speed information is larger than a preset read-write speed or not;
if not, determining the running state as a read-write speed abnormal state; the abnormal early warning state comprises the abnormal reading and writing speed state.
4. The method for monitoring the operation state of the NVMe hard disk according to claim 1, wherein the online upgrade repair of the firmware of the NVMe hard disk comprises:
matching the firmware corresponding to the NVMe hard disk according to the important product data information of the NVMe hard disk;
downloading the firmware into the NVMe hard disk and activating the firmware.
5. An operation state monitoring device of an NVMe hard disk, comprising:
the information acquisition module is used for acquiring the operation information of the NVMe hard disk; wherein the operation information comprises at least one of temperature information, read-write speed information, writing state information and bad block number information;
the state determining module is used for determining the operation state of the NVMe hard disk according to the operation information; the operation state comprises a normal operation state and an abnormal early warning state;
when the operation information includes the temperature information, the state determining module is specifically configured to: judging whether the temperature information is larger than a first temperature threshold value or not; if yes, judging whether the temperature information is larger than a second temperature threshold value or not; wherein the second temperature threshold is greater than the first temperature threshold; if the temperature is not greater than the second temperature threshold, determining that the running state is a mild temperature abnormal state; if the temperature is greater than the second temperature threshold, determining that the running state is a serious temperature abnormal state; wherein the abnormal early warning state includes the mild temperature abnormal state and the severe temperature abnormal state;
when the running information includes the bad block number information, the state determining module is specifically configured to: judging whether the bad block quantity information is larger than the early warning bad block quantity or not; if yes, judging whether the bad block quantity information is larger than the dangerous bad block quantity or not; wherein the number of dangerous bad blocks is larger than the number of early warning bad blocks; if the number of the dangerous bad blocks is not greater than the number of the dangerous bad blocks, determining that the running state is a slight bad block abnormal state; if the number of the dangerous bad blocks is larger than the number of the dangerous bad blocks, determining that the running state is a serious bad block abnormal state;
the early warning module is used for generating and outputting early warning information corresponding to the running state when the running state is the abnormal early warning state;
the online repair module is configured to, when the abnormal early warning state is a target abnormal early warning state, perform online removal on a fault of the NVMe hard disk by executing a processing operation corresponding to the running state, where the target abnormal early warning state includes: writing an abnormal state, a read-write speed abnormal state, the mild temperature abnormal state and the mild bad block abnormal state;
the online repair module is specifically configured to: if the running state is the abnormal writing state, performing online fault removal on the NVMe hard disk, including: on-line upgrading and repairing firmware of NVMe hard disk
The online repair module is also specifically configured to: if the running state is a read-write speed abnormal state, performing online fault removal on the NVMe hard disk, including: performing garbage cleaning operation on the NVMe hard disk;
the online repair module is also specifically configured to: if the running state is the mild temperature abnormal state, adjusting the I/O speed of the NVMe hard disk to a preset low-speed value, wherein the preset low-speed value is smaller than a preset normal value corresponding to a normal running state; if the running state is the serious temperature abnormal state, stopping the read-write operation of the NVMe hard disk;
the online repair module is also specifically configured to: if the running state is the mild bad block abnormal state, the false bad block is erased through bad block repair, and the bad block in the NVMe hard disk is repaired; and if the running state is the serious bad block abnormal state, backing up the data in the NVMe hard disk to a preset hard disk.
6. An operation state monitoring device for an NVMe hard disk, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method for monitoring the operation state of the NVMe hard disk according to any one of claims 1 to 4 when executing the computer program.
CN202011453229.2A 2020-12-11 2020-12-11 Method, device and equipment for monitoring running state of NVMe hard disk Active CN112506744B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011453229.2A CN112506744B (en) 2020-12-11 2020-12-11 Method, device and equipment for monitoring running state of NVMe hard disk

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011453229.2A CN112506744B (en) 2020-12-11 2020-12-11 Method, device and equipment for monitoring running state of NVMe hard disk

Publications (2)

Publication Number Publication Date
CN112506744A CN112506744A (en) 2021-03-16
CN112506744B true CN112506744B (en) 2023-08-25

Family

ID=74973296

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011453229.2A Active CN112506744B (en) 2020-12-11 2020-12-11 Method, device and equipment for monitoring running state of NVMe hard disk

Country Status (1)

Country Link
CN (1) CN112506744B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113190179B (en) * 2021-05-26 2022-02-11 北京自由猫科技有限公司 Method for prolonging service life of mechanical hard disk, storage device and system
CN113625957B (en) * 2021-06-30 2024-02-13 济南浪潮数据技术有限公司 Method, device and equipment for detecting hard disk faults
CN113556404A (en) * 2021-08-03 2021-10-26 广东九博科技股份有限公司 Communication method and system between single disks in equipment
CN113901530B (en) * 2021-09-10 2024-01-09 苏州浪潮智能科技有限公司 Method, device and equipment for early warning protection of defensive property of hard disk and readable medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218173A (en) * 2013-03-27 2013-07-24 华为技术有限公司 Method and device for storage control
CN106201801A (en) * 2016-07-18 2016-12-07 联想(北京)有限公司 A kind of electronic equipment and error-reporting method
CN107943652A (en) * 2017-11-22 2018-04-20 郑州云海信息技术有限公司 Hard disk control method, device and readable storage medium storing program for executing in a kind of storage system
CN109408328A (en) * 2018-10-08 2019-03-01 郑州云海信息技术有限公司 A kind of monitoring method, device and the equipment of hard disk health status
CN110704228A (en) * 2019-09-29 2020-01-17 至誉科技(武汉)有限公司 Solid state disk exception handling method and system
CN111858244A (en) * 2020-07-16 2020-10-30 苏州浪潮智能科技有限公司 Hard disk monitoring method, system, device and medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3757810B1 (en) * 2016-12-28 2023-04-05 Huawei Technologies Co., Ltd. Packet forwarding method, device, and system in nvme over fabric

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218173A (en) * 2013-03-27 2013-07-24 华为技术有限公司 Method and device for storage control
CN106201801A (en) * 2016-07-18 2016-12-07 联想(北京)有限公司 A kind of electronic equipment and error-reporting method
CN107943652A (en) * 2017-11-22 2018-04-20 郑州云海信息技术有限公司 Hard disk control method, device and readable storage medium storing program for executing in a kind of storage system
CN109408328A (en) * 2018-10-08 2019-03-01 郑州云海信息技术有限公司 A kind of monitoring method, device and the equipment of hard disk health status
CN110704228A (en) * 2019-09-29 2020-01-17 至誉科技(武汉)有限公司 Solid state disk exception handling method and system
CN111858244A (en) * 2020-07-16 2020-10-30 苏州浪潮智能科技有限公司 Hard disk monitoring method, system, device and medium

Also Published As

Publication number Publication date
CN112506744A (en) 2021-03-16

Similar Documents

Publication Publication Date Title
CN112506744B (en) Method, device and equipment for monitoring running state of NVMe hard disk
US9875036B2 (en) Concurrent upgrade and backup of non-volatile memory
US20090100287A1 (en) Monitoring Apparatus and a Monitoring Method Thereof
US7921341B2 (en) System and method for reproducing memory error
US20120271983A1 (en) Computing device and data synchronization method
US8423729B2 (en) Part information restoration method, part information management method and electronic apparatus
CN104021058A (en) Method for quickly starting test board card
CN109445561B (en) Power failure protection system and method applied to server and server
CN115525486A (en) SSD SMBUS temperature alarm and low power consumption state test verification method and device
CN106935272B (en) Method and device for opening eMMC back door debugging
US9411666B2 (en) Anticipatory protection of critical jobs in a computing system
CN104657232A (en) BIOS automatic recovery system and BIOS automatic recovery method
US8024604B2 (en) Information processing apparatus and error processing
CN210721440U (en) PCIE card abnormity recovery device, PCIE card and PCIE expansion system
CN110825547B (en) PCIE card exception recovery device and method based on SMBUS
CN115242753B (en) Network card MAC address burning method, system, electronic equipment and storage medium
CN114758715B (en) Method, device and equipment for lighting hard disk fault lamp and readable storage medium
CN115795568A (en) Liquid cooling server liquid leakage protection method, device, equipment and storage medium
JP2016146071A (en) Hard disk drive device diagnosis device and copying device with hard disk drive device diagnosis function
CN104678292A (en) Test method and device for CPLD (Complex Programmable Logic Device)
CN114218001A (en) Fault repairing method and device, electronic equipment and readable storage medium
CN112506817A (en) Method and equipment for controlling hard disk backboard LED
JP2001256005A (en) Hard disk device
CN103106089A (en) Upgrading method and system for intelligent platform management controller
US11983304B2 (en) On-board secure storage system for detecting unauthorized access or failure and performing predetermined processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant