CN114924929A - NVMe hard disk fault early warning method, system and computer equipment - Google Patents

NVMe hard disk fault early warning method, system and computer equipment Download PDF

Info

Publication number
CN114924929A
CN114924929A CN202210429927.1A CN202210429927A CN114924929A CN 114924929 A CN114924929 A CN 114924929A CN 202210429927 A CN202210429927 A CN 202210429927A CN 114924929 A CN114924929 A CN 114924929A
Authority
CN
China
Prior art keywords
hard disk
nvme hard
early warning
state information
nvme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210429927.1A
Other languages
Chinese (zh)
Inventor
黄凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202210429927.1A priority Critical patent/CN114924929A/en
Publication of CN114924929A publication Critical patent/CN114924929A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3037Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a memory, e.g. virtual memory, cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display

Abstract

The application relates to an NVMe hard disk fault early warning method, a system and computer equipment. The method comprises the following steps: the central processing unit acquires state information of an NVMe hard disk, and if the state information of the NVMe hard disk is abnormal, the central processing unit resets the NVMe hard disk; the complex programmable logic device sends the state information of the NVMe hard disk and the reset times of the NVMe hard disk to a substrate management controller; and if the reset times of the NVMe hard disk are not less than a first preset value, triggering fault early warning. The substrate management controller can monitor the operation parameters and the health condition of the NVMe hard disk in real time, the detection and prediction of the faults of the NVMe hard disk can be realized, the operation risk of the NVMe hard disk can be better evaluated, the efficiency of solving the operation faults of the NVMe hard disk is accelerated, and the high availability of the whole system can be effectively realized on the premise of reducing the consumption of manpower and material resources.

Description

NVMe hard disk fault early warning method, system and computer equipment
Technical Field
The application relates to the technical field of hardware monitoring, in particular to an NVMe hard disk fault early warning method, system and computer equipment.
Background
With the popularization of cloud computing and data centers, the scale of servers reaches ten thousand levels, and the application amount of NVMe hard disks is even greater. Although the failure rate of the NVMe hard disk is low, the service life of the NVMe hard disk is increased along with the increase of the number of the NVMe hard disks, and the failure rate is also increased, so that the failure of the NVMe hard disk is predicted by adopting an automatic failure early warning method, so that the NVMe hard disk to be damaged is replaced in time, and the improvement of the service quality of the NVMe hard disk and even the whole system is very necessary.
In the prior art, the NVMe hard disk can only record NVMe hard disk information by means of automatic monitoring analysis and reporting technology, user fault information can be prompted only when a fault occurs and a system is started, an automatic prediction function cannot be provided, and adverse effects can be caused to the service quality of the NVMe hard disk when the scale of the NVMe hard disk is large and the service life of the NVMe hard disk is long, and even data loss can be caused under severe conditions. In order to reduce adverse effects caused by the failure of the NVMe hard disk, operation and maintenance intervention is required to monitor the operation state of the NVMe hard disk in real time, but the labor cost and the time cost are increased by means of manual maintenance.
Therefore, a method, a system and a computer device for early warning the failure of the NVMe hard disk are urgently needed, wherein the method, the system and the computer device can monitor the working efficiency of the NVMe hard disk of the server in real time and carry out early warning on the failure before the NVMe hard disk is about to fail.
Disclosure of Invention
Therefore, in order to solve the technical problem, it is necessary to provide a method, an apparatus, a computer device, and a computer device for warning a failure of an NVMe hard disk, which can automatically reset an egress link port.
On one hand, the NVMe hard disk fault early warning method is provided, and the method comprises the following steps: the method comprises the steps that a central processing unit obtains state information of an NVMe hard disk, if the state information of the NVMe hard disk is abnormal, the central processing unit resets the NVMe hard disk, and obtains the resetting times of the NVMe hard disk; the complex programmable logic device sends the state information of the NVMe hard disk and the reset times of the NVMe hard disk to a substrate management controller; and if the resetting times of the NVMe hard disk are not less than a first preset value, triggering fault early warning.
Further, the method further comprises: the substrate management controller acquires the abnormal state information of the NVMe hard disk and generates a log file; acquiring the occurrence frequency of any abnormal state information type of the NVMe hard disk based on the log file and a preset abnormal state information type; and if the occurrence frequency of any abnormal state information type of the NVMe hard disk is not less than a second preset value, triggering fault early warning corresponding to the abnormal state information.
Further, before triggering the fault pre-warning, the method further includes: detecting whether the VMD driver is in an available state; if the VMD driver is in an available state, the central processing unit sends a fault early warning signal to the complex programmable logic device to carry out VPP fault early warning; and if the VMD driver is in an unavailable state, the substrate management controller sends a fault early warning signal to the complex programmable logic device to perform fault early warning on the substrate management controller.
Further, when the VPP fault early warning is carried out, the method further comprises the following steps: the central processing unit sends a fault early warning signal to the complex programmable logic device through the VPP IIC; and the complex programmable logic device analyzes the fault early warning signal, sets the VMD driver to be in a special fault early warning state and performs VPP fault early warning.
Further, the method further comprises: setting a plurality of third alarm devices corresponding to each abnormal state information based on the NVMe hard disk; and if the occurrence frequency of any abnormal state information type of the NVMe hard disk is not less than a second preset value, triggering a third warning device to send out fault early warning corresponding to the abnormal state information.
Further, the NVMe hard disk state information includes: the NVMe hard disk position information and the NVMe hard disk working state information are obtained; the central processing unit resets the NVMe hard disk, including: and monitoring whether the working state information of the NVMe hard disk is abnormal or not, and if the working state information of the NVMe hard disk is abnormal, resetting the NVMe hard disk by the central processing unit based on the position information of the NVMe hard disk.
Further, the acquiring the first preset value includes: establishing a training model based on the basic NVMe hard disk information and the NVMe hard disk fault information; acquiring the failure early warning times of the NVMe hard disk before damage based on the training model; and acquiring the first preset value based on the failure early warning times, wherein the first preset value is smaller than the failure early warning times.
Further, the performing fault pre-warning includes: the complex programmable logic device detects the in-place state information of the NVMe hard disk and sends the in-place state information of the NVMe hard disk to the substrate management controller; the substrate management controller sends a fault early warning signal to the complex programmable logic device based on the in-place state information of the hard disk; and the complex programmable logic device analyzes the fault early warning signal and carries out fault early warning.
On the other hand, provide NVMe hard disk trouble early warning system, the system includes: the device comprises a complex programmable logic device, a substrate management controller, a central processing unit and an NVMe hard disk; the central processing unit is in communication connection with the NVMe hard disk and is used for acquiring state information of the NVMe hard disk, if the state information of the NVMe hard disk is abnormal, the central processing unit is also used for resetting the NVMe hard disk and acquiring the reset times of the NVMe hard disk, and if the reset times of the NVMe hard disk are not less than a first preset value, the central processing unit is also used for triggering fault early warning; the complex programmable logic device is in communication connection with the NVMe hard disk and is used for acquiring state information of the NVMe hard disk and the number of times of resetting of the NVMe hard disk; the substrate management controller is in communication connection with the complex programmable logic device and is used for receiving state information of the NVMe hard disk and the number of times of resetting of the NVMe hard disk sent by the complex programmable logic device, and if the number of times of resetting of the NVMe hard disk is not smaller than a first preset value, the substrate management controller is also used for triggering fault early warning.
Further, the substrate management controller is further configured to acquire abnormal state information of the NVMe hard disk, generate a log file, and acquire the occurrence frequency of any one of the abnormal state information categories of the NVMe hard disk based on the log file and a preset abnormal state information category; and if the occurrence frequency of any abnormal state information type of the NVMe hard disk is not less than a second preset value, triggering fault early warning corresponding to the abnormal state information.
In a further aspect, there is provided a computer readable computer device having a computer program stored thereon, the computer program when executed by a processor performing the steps of: the method comprises the steps that a central processing unit obtains state information of an NVMe hard disk, if the state information of the NVMe hard disk is abnormal, the central processing unit resets the NVMe hard disk, and obtains the resetting times of the NVMe hard disk; the complex programmable logic device sends the state information of the NVMe hard disk and the reset times of the NVMe hard disk to a substrate management controller; and if the reset times of the NVMe hard disk are not less than a first preset value, triggering fault early warning.
According to the NVMe hard disk fault early warning method, the NVMe hard disk fault early warning system and the computer equipment, the central processing unit acquires the state information of the NVMe hard disk, and if the state information of the NVMe hard disk is abnormal, the central processing unit resets the NVMe hard disk and acquires the resetting times of the NVMe hard disk; the complex programmable logic device sends the state information of the NVMe hard disk and the reset times of the NVMe hard disk to a substrate management controller; and if the resetting times of the NVMe hard disk are not less than a first preset value, triggering fault early warning. By the method, the substrate management controller can monitor the operation parameters and the health condition of the NVMe hard disk in real time, the detection and prediction of the faults of the NVMe hard disk can be realized, the operation risk of the NVMe hard disk can be better evaluated, the efficiency of solving the operation faults of the NVMe hard disk is accelerated, and the high availability of the whole system can be effectively realized on the premise of reducing the consumption of manpower and material resources.
Drawings
Fig. 1 is a schematic flow chart of an NVMe hard disk failure early warning method in an embodiment;
fig. 2 is a schematic flow chart of an NVMe hard disk failure early warning method in an embodiment;
FIG. 3 is a block diagram of an NVMe hard disk failure early warning system in an embodiment;
FIG. 4 is a block diagram of an NVMe hard disk failure early warning system in an embodiment;
FIG. 5 is a block diagram of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.
In the prior art, the NVMe hard disk can only record NVMe hard disk information by means of automatic monitoring analysis and reporting technology, user fault information can be prompted only when a fault occurs and a system is started, an automatic prediction function cannot be provided, and adverse effects can be caused to the service quality of the NVMe hard disk when the scale of the NVMe hard disk is large and the service life of the NVMe hard disk is long, and even data loss can be caused under severe conditions. In order to reduce adverse effects caused by NVMe hard disk failures, operation and maintenance intervention is required to monitor the operation state of the NVMe hard disks in real time, but the labor cost and the time cost are increased by means of manual maintenance.
Example one
The invention provides an NVMe hard disk fault early warning method based on the prior art. As shown in fig. 1 and fig. 2, the NVMe hard disk failure early warning method includes: the method comprises the steps that a central processing unit obtains state information of NVMe hard disks, if the state information of the NVMe hard disks is abnormal, the central processing unit resets the NVMe hard disks, and the number of times of resetting of the NVMe hard disks is obtained, wherein the number of the NVMe hard disks is at least one; the complex programmable logic device sends the state information of the NVMe hard disk and the reset times of the NVMe hard disk to a substrate management controller; and if the reset times of the NVMe hard disk are not less than a first preset value, triggering fault early warning. Wherein, it is understood that the NVMe hard disk is an NVMe SSD.
Through the mode, the NVMe hard disk which is likely to break down is subjected to fault early warning, and the operation and maintenance personnel can timely determine the NVMe hard disk which is likely to break down based on the fault early warning, so that the operation and maintenance personnel are helped to realize fault early warning, prediction, diagnosis, isolation and recovery of the NVMe hard disk, data loss is favorably prevented, consumption of manpower and material resources is reduced, and the stability of the system is improved.
In one embodiment, when the baseboard management controller monitors that the NVMe hard disk is reset through the complex programmable logic device, that is, the baseboard management controller adds 1 to the number of times of resetting the NVMe hard disk in the counting stack until the number of times of resetting the NVMe hard disk exceeds the first preset value, a fault early warning is sent out. The initial value of the count stack is 0. The fault early warning mode can be a light flashing warning mode through lighting the breathing lamp, and can also be a mode of buzzing warning by adopting a buzzer, the fault early warning mode is not limited, technical personnel in the field can reasonably select according to actual conditions, and the fault early warning mode only can play a role in warning operation and maintenance personnel.
In one embodiment, the method further comprises: the substrate management controller acquires the abnormal state information of the NVMe hard disk and generates a log file; acquiring the occurrence frequency of any abnormal state information type of the NVMe hard disk based on the log file and a preset abnormal state information type; and if the occurrence frequency of any abnormal state information type of the NVMe hard disk is not less than a second preset value, triggering fault early warning corresponding to the abnormal state information. The abnormal state information of the NVMe hard disk at least comprises abnormal state information such as disk dropping, abnormal data transmission and the like. Taking the disc dropping and the data transmission abnormality as examples for explanation, in the process of carrying out fault early warning on the NVMe hard disc, the occurrence times of the abnormal state information of the NVMe hard disc data transmission abnormality and the disc dropping are respectively counted. Meanwhile, the abnormal state information of the NVMe hard disk is recorded in the form of log files, so that operation and maintenance personnel are helped to restore the running state information of the NVMe hard disk, and the operation and maintenance personnel can deeply analyze the fault reasons of the NVMe hard disk.
In one embodiment, the method further comprises: if the central processing unit resets the NVMe hard disk, triggering a first warning device to carry out fault early warning; if the reset times of the NVMe hard disk are not less than the first preset value, triggering a second warning device to perform fault early warning; the fault early warning frequency of the second warning device is higher than that of the first warning device. As long as the NVMe hard disk is reset, fault early warning is carried out based on the first warning device, so that operation and maintenance personnel can be helped to obtain the running state information of the NVMe hard disk in time. When the reset times of the NVMe hard disk are not less than the first preset value, the probability that the NVMe hard disk breaks down is very high, a second warning device with the failure early warning frequency higher than that of the first warning device is adopted for warning, the degree of conspicuity of failure early warning is increased, and operation and maintenance personnel are helped to find and solve problems in time.
In one embodiment, the obtaining the first preset value includes: establishing a training model based on the basic NVMe hard disk information and the NVMe hard disk fault information; acquiring the failure early warning times of the NVMe hard disk before damage based on the training model; and acquiring the first preset value based on the failure early warning times, wherein the first preset value is smaller than the failure early warning times. In the application, when fault early warning is carried out on the basis of the NVMe hard disk, the first preset value is 15. The technical staff combines the fault log of the NVMe hard disks of 38 product items to carry out modeling, and through the generated model, the probability statistics shows that the fault early warning times of the NVMe hard disks before damage are between 18 times and 20 times, and in order to ensure that the fault early warning can be carried out before the NVMe hard disks are damaged, therefore, the first preset value is set to be 15 in the application. It should be understood that, the size of the first preset value is not limited in the present application, and a person skilled in the art can reasonably select the first preset value based on the project characteristics and the NVMe hard disk category.
In one embodiment, the method further comprises: on the basis of the NVMe hard disk, setting a plurality of third alarm devices corresponding to each abnormal state information; and if the occurrence frequency of any abnormal state information type of the NVMe hard disk is not less than a second preset value, triggering a third warning device to send out fault warning corresponding to the abnormal state information. Similarly, the data transmission abnormality of the NVMe hard disk and the dropping of the NVMe hard disk are taken as examples for description, and corresponding third warning devices are respectively arranged corresponding to the data transmission abnormality of the NVMe hard disk and the dropping of the NVMe hard disk; when the data transmission of the NVMe hard disk is abnormal and/or the abnormal state information of the NVMe hard disk falling is generated, the occurrence frequency of the abnormal data transmission of the NVMe hard disk and the occurrence frequency of the abnormal state information of the NVMe hard disk falling are respectively obtained in real time, if the occurrence frequency of the abnormal data transmission of the NVMe hard disk is not less than a second preset value, a third warning device corresponding to the abnormal data transmission of the NVMe hard disk is triggered, and if the occurrence frequency of the abnormal data falling of the NVMe hard disk is not less than the second preset value, a third warning device corresponding to the abnormal data falling of the NVMe hard disk is triggered. The reason causing the NVMe hard disk fault is subjected to fault early warning in a mode of an alarm device, so that operation and maintenance personnel can be helped to timely and efficiently determine and solve the NVMe hard disk fault. It should be understood that the specific size of the second preset value is not limited, and those skilled in the art can determine the second preset value according to actual situations.
In one embodiment, the NVMe hard disk state information includes: the NVMe hard disk position information and the NVMe hard disk working state information are obtained; the central processing unit resets the NVMe hard disk, including: and monitoring whether the working state information of the NVMe hard disk is abnormal or not, and if the working state information of the NVMe hard disk is abnormal, resetting the NVMe hard disk by the central processing unit based on the position information of the NVMe hard disk.
In one embodiment, when the number of the NVMe hard disks is multiple, the central processing unit respectively acquires the state information of each NVMe hard disk, and if any one of the NVMe hard disk state information is abnormal, the central processing unit resets the NVMe hard disks; the complex programmable logic device acquires state information of each NVMe hard disk and the reset times of each NVMe hard disk based on each NVMe hard disk and sends the state information and the reset times to the substrate management controller, the substrate management controller compares the acquired reset times of each NVMe hard disk with a first preset value, and if the reset times of the NVMe hard disks are not less than the first preset value, the substrate management controller triggers fault early warning. That is to say, a plurality of fourth warning devices can be set corresponding to each NVMe hard disk, and when the number of times of resetting of the NVMe hard disk is not less than the first preset value, the fourth warning devices set corresponding to the NVMe hard disk perform failure early warning to help operation and maintenance personnel to determine the failed NVMe hard disk at the first time.
In one embodiment, before triggering the fault pre-warning, the method further comprises: detecting whether the VMD driver is in an available state; if the VMD driver is in an available state, the central processing unit sends a fault early warning signal to the complex programmable logic device to carry out VPP fault early warning; and if the VMD driver is in an unavailable state, the substrate management controller sends a fault early warning signal to the complex programmable logic device to perform fault early warning on the substrate management controller. It should be understood that, if the VMD driver is in an unavailable state, that is, in the VMD Disable or PCH AHCI mode, the substrate management controller may actively issue a fault early warning instruction to the complex programmable logic device through the IIC bus to obtain a fault early warning control permission.
In one embodiment, when performing VPP fault pre-warning, the method further comprises: the central processing unit sends a fault early warning signal to the complex programmable logic device through the VPP IIC; and the complex programmable logic device analyzes the fault early warning signal, sets the VMD driver to be in a special fault early warning state and conducts VPP lighting warning. And setting the VMD driver to be in a special fault early warning state so as to ensure that the central processing unit can successfully issue a fault early warning signal sent by the VPP IIC.
In the prior art, the system cannot predict the fault of the server storage equipment, only when the fault of the server storage equipment occurs, the system can send an alarm to inform operation and maintenance personnel, the direct reason of the fault of the server storage equipment cannot be clearly shown, and only the running state information of the server storage equipment is displayed in a log file. In the application, the central processing unit acquires the state information of the NVMe hard disk in real time and resets the NVMe hard disk with abnormal state information; the substrate management controller acquires the state information of the NVMe hard disk and the reset times of the NVMe hard disk through a complex programmable logic device; and comparing the reset times of the NVMe hard disk with a first preset value, and performing fault early warning before the NVMe hard disk is not damaged. And the operation and maintenance personnel are informed before the NVMe hard disk is not damaged, so that the maintenance cost and the maintenance time of the system are reduced, the data loss rate is reduced, and the high availability of the system is improved. It should be understood that the fault warning described herein is the same concept as the fault warning described in fig. 1 and 2 of the drawings.
It should be understood that although the steps in the flowcharts of fig. 1 and 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least some of the steps in fig. 1 and 2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.
Example two
In one embodiment, as shown in fig. 3 and 4, there is provided an NVMe hard disk failure early warning system, including: the device comprises a complex programmable logic device, a substrate management controller, a central processing unit and an NVMe hard disk; the central processing unit is in communication connection with the NVMe hard disk and is used for acquiring state information of the NVMe hard disk, resetting the NVMe hard disk and acquiring the resetting times of the NVMe hard disk if the state information of the NVMe hard disk is abnormal, and triggering fault early warning if the resetting times of the NVMe hard disk is not less than a first preset value; the complex programmable logic device is in communication connection with the NVMe hard disk and is used for acquiring state information of the NVMe hard disk and the number of times of resetting of the NVMe hard disk; the substrate management controller is in communication connection with the complex programmable logic device and is used for receiving the NVMe hard disk state information and the NVMe hard disk reset times sent by the complex programmable logic device, and if the NVMe hard disk reset times are not smaller than a first preset value, the substrate management controller is also used for triggering fault early warning. The substrate management controller is in communication connection with the NVMe hard disk and is used for acquiring state information of the NVMe hard disk backboard, for example, information such as the working temperature of the NVMe hard disk backboard, the material number of the NVMe hard disk backboard, the form of the NVMe hard disk backboard and the like is read.
In one embodiment, the baseboard management controller is further configured to acquire abnormal state information of the NVMe hard disk, generate a log file, and acquire the occurrence frequency of any one of the abnormal state information types of the NVMe hard disk based on the log file and a preset abnormal state information type; and if the occurrence frequency of any abnormal state information type of the NVMe hard disk is not less than a second preset value, triggering fault early warning corresponding to the abnormal state information.
For specific limitations of the NVMe hard disk failure early warning system, reference may be made to the above limitations on the NVMe hard disk failure early warning method, which is not described herein again. All modules in the NVMe hard disk fault early warning system can be completely or partially realized through software, hardware and a combination of the software and the hardware. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
EXAMPLE III
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile computer device and an internal memory. The non-volatile computer device stores an operating system, a computer program, and a database. The internal memory provides an environment for the operating system and the running of computer programs in the non-volatile computer device. The computer program is executed by a processor to realize an NVMe hard disk fault early warning method.
In one embodiment, a computer-readable computer device is provided, having a computer program stored thereon, the computer program, when executed by a processor, performing the steps of: the method comprises the steps that a central processing unit obtains state information of an NVMe hard disk, if the state information of the NVMe hard disk is abnormal, the central processing unit resets the NVMe hard disk, and obtains the resetting times of the NVMe hard disk; the complex programmable logic device sends the state information of the NVMe hard disk and the reset times of the NVMe hard disk to a substrate management controller; and if the resetting times of the NVMe hard disk are not less than a first preset value, triggering fault early warning. The NVMe hard disk comprises NVMe hard disk position information and NVMe hard disk working state information; the central processing unit resets the NVMe hard disk, including: and monitoring whether the working state information of the NVMe hard disk is abnormal or not, and if the working state information of the NVMe hard disk is abnormal, resetting the NVMe hard disk by the central processing unit based on the position information of the NVMe hard disk.
In one embodiment, the computer program when executed by the processor implements the steps of: the substrate management controller acquires the abnormal state information of the NVMe hard disk and generates a log file; acquiring the occurrence frequency of any abnormal state information type of the NVMe hard disk based on the log file and a preset abnormal state information type; and if the occurrence frequency of any abnormal state information type of the NVMe hard disk is not less than a second preset value, triggering fault early warning corresponding to the abnormal state information.
In one embodiment, the computer program when executed by the processor implements the steps of: detecting whether the VMD driver is in an available state; if the VMD driver is in an available state, the central processing unit sends a fault early warning signal to the complex programmable logic device to carry out VPP fault early warning; and if the VMD driver is in an unavailable state, the substrate management controller sends a fault early warning signal to the complex programmable logic device to perform fault early warning on the substrate management controller. The central processing unit sends a fault early warning signal to the complex programmable logic device through the VPP IIC; and the complex programmable logic device analyzes the fault early warning signal, sets the VMD driver to be in a special fault early warning state and performs VPP fault early warning.
In one embodiment, the computer program when executed by the processor implements the steps of: and if the occurrence frequency of any abnormal state information type of the NVMe hard disk is not less than a second preset value, triggering a third warning device to send out fault early warning corresponding to the abnormal state information. And the third warning device is set corresponding to each abnormal state information category based on the NVMe hard disk.
In one embodiment, the computer program when executed by the processor implements the steps of: establishing a training model based on the basic NVMe hard disk information and the fault information of the NVMe hard disk; acquiring the failure early warning times of the NVMe hard disk before being damaged based on the training model; and acquiring the first preset value based on the failure early warning times, wherein the first preset value is smaller than the failure early warning times.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable device, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the appended claims.

Claims (10)

1. A NVMe hard disk fault early warning method is characterized by comprising the following steps:
the method comprises the steps that a central processing unit obtains state information of an NVMe hard disk, if the state information of the NVMe hard disk is abnormal, the central processing unit resets the NVMe hard disk, and obtains the resetting times of the NVMe hard disk;
the complex programmable logic device sends the state information of the NVMe hard disk and the reset times of the NVMe hard disk to a substrate management controller;
and if the resetting times of the NVMe hard disk are not less than a first preset value, triggering fault early warning.
2. The NVMe hard disk fault pre-warning method of claim 1, further comprising:
the substrate management controller acquires the abnormal state information of the NVMe hard disk and generates a log file;
acquiring the occurrence frequency of any abnormal state information type of the NVMe hard disk based on the log file and a preset abnormal state information type;
and if the occurrence frequency of any abnormal state information type of the NVMe hard disk is not less than a second preset value, triggering fault early warning corresponding to the abnormal state information.
3. The NVMe hard disk failure early warning method of claim 1, wherein before the failure early warning is triggered, the method further comprises:
detecting whether the VMD driver is in an available state;
if the VMD driver is in an available state, the central processing unit sends a fault early warning signal to the complex programmable logic device to carry out VPP fault early warning;
and if the VMD driver is in an unavailable state, the substrate management controller sends a fault early warning signal to the complex programmable logic device to perform fault early warning on the substrate management controller.
4. The NVMe hard disk fault early warning method of claim 3, wherein when VPP fault early warning is performed, the method further comprises:
the central processing unit sends a fault early warning signal to the complex programmable logic device through the VPP IIC;
and the complex programmable logic device analyzes the fault early warning signal, sets the VMD driver to be in a special fault early warning state and performs VPP fault early warning.
5. The NVMe hard disk fault pre-warning method of claim 2, further comprising:
setting a plurality of third alarm devices corresponding to each abnormal state information category based on the NVMe hard disk;
and if the occurrence frequency of any abnormal state information type of the NVMe hard disk is not less than a second preset value, triggering a third warning device to send out fault warning corresponding to the abnormal state information.
6. The NVMe hard disk fault early warning method of claim 1, wherein obtaining the first preset value comprises:
establishing a training model based on basic NVMe hard disk information and NVMe hard disk fault information;
acquiring the failure early warning times of the NVMe hard disk before damage based on the training model;
and acquiring the first preset value based on the failure early warning times, wherein the first preset value is smaller than the failure early warning times.
7. The NVMe hard disk fault early warning method according to any one of claims 1-6, wherein the NVMe hard disk state information comprises: the NVMe hard disk position information and the NVMe hard disk working state information are obtained;
the central processing unit resets the NVMe hard disk, including: and monitoring whether the working state information of the NVMe hard disk is abnormal or not, and if the working state information of the NVMe hard disk is abnormal, resetting the NVMe hard disk by the central processing unit based on the position information of the NVMe hard disk.
8. The NVMe hard disk fault early warning system is characterized by comprising the following components: the device comprises a complex programmable logic device, a substrate management controller, a central processing unit and an NVMe hard disk;
the central processing unit is in communication connection with the NVMe hard disk and is used for acquiring state information of the NVMe hard disk, if the state information of the NVMe hard disk is abnormal, the central processing unit is also used for resetting the NVMe hard disk and acquiring the reset times of the NVMe hard disk, and if the reset times of the NVMe hard disk are not less than a first preset value, the central processing unit is also used for triggering fault early warning;
the complex programmable logic device is in communication connection with the NVMe hard disk and is used for acquiring state information of the NVMe hard disk and the number of times of resetting of the NVMe hard disk;
the substrate management controller is in communication connection with the complex programmable logic device and is used for receiving state information of the NVMe hard disk and the number of times of resetting of the NVMe hard disk sent by the complex programmable logic device, and if the number of times of resetting of the NVMe hard disk is not smaller than a first preset value, the substrate management controller is also used for triggering fault early warning.
9. The NVMe hard disk failure early warning system of claim 8,
the substrate management controller is further used for acquiring abnormal state information of the NVMe hard disk, generating a log file, and acquiring the occurrence frequency of any abnormal state information type of the NVMe hard disk based on the log file and a preset abnormal state information type; and if the occurrence frequency of any abnormal state information type of the NVMe hard disk is not less than a second preset value, triggering fault early warning corresponding to the abnormal state information.
10. A computer device having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method of any of claims 1 to 7.
CN202210429927.1A 2022-04-22 2022-04-22 NVMe hard disk fault early warning method, system and computer equipment Pending CN114924929A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210429927.1A CN114924929A (en) 2022-04-22 2022-04-22 NVMe hard disk fault early warning method, system and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210429927.1A CN114924929A (en) 2022-04-22 2022-04-22 NVMe hard disk fault early warning method, system and computer equipment

Publications (1)

Publication Number Publication Date
CN114924929A true CN114924929A (en) 2022-08-19

Family

ID=82807463

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210429927.1A Pending CN114924929A (en) 2022-04-22 2022-04-22 NVMe hard disk fault early warning method, system and computer equipment

Country Status (1)

Country Link
CN (1) CN114924929A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115543755A (en) * 2022-11-25 2022-12-30 苏州浪潮智能科技有限公司 Performance monitoring method, device, system, equipment and medium
CN115658362A (en) * 2022-10-26 2023-01-31 超聚变数字技术有限公司 Method for determining hard disk state and related equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115658362A (en) * 2022-10-26 2023-01-31 超聚变数字技术有限公司 Method for determining hard disk state and related equipment
CN115543755A (en) * 2022-11-25 2022-12-30 苏州浪潮智能科技有限公司 Performance monitoring method, device, system, equipment and medium

Similar Documents

Publication Publication Date Title
KR101856543B1 (en) Failure prediction system based on artificial intelligence
CN109783262B (en) Fault data processing method, device, server and computer readable storage medium
CN114924929A (en) NVMe hard disk fault early warning method, system and computer equipment
CN111897671A (en) Failure recovery method, computer device, and storage medium
CN113658414B (en) Mine equipment fault early warning method and device, terminal equipment and storage medium
CN110740061A (en) Fault early warning method and device and computer storage medium
CN111881014A (en) System test method, device, storage medium and electronic equipment
CN112286771A (en) Alarm method for monitoring global resources
CN115794588A (en) Memory fault prediction method, device and system and monitoring server
CN113704018A (en) Application operation and maintenance data processing method and device, computer equipment and storage medium
CN103763143A (en) Method and system for equipment abnormality alarming based on storage server
CN114118991A (en) Third-party system monitoring system, method, device, equipment and storage medium
CN110674149A (en) Service data processing method and device, computer equipment and storage medium
CN110873613A (en) Method and device for processing machine room abnormity based on temperature monitoring
CN111159051B (en) Deadlock detection method, deadlock detection device, electronic equipment and readable storage medium
CN116820820A (en) Server fault monitoring method and system
CN114341835A (en) Gas monitoring system
CN114356722A (en) Monitoring alarm method, system, equipment and storage medium for server cluster
CN113808725A (en) Equipment early warning system and method
CN112416896A (en) Data abnormity warning method and device, storage medium and electronic device
JP2003345629A (en) System monitor device, system monitoring method used for the same, and program therefor
Hayasaka et al. Method for detection of lot defects for maintenance of ICT power supplies and air conditioning equipment and verification results
CN111209131B (en) Method and system for determining faults of heterogeneous system based on machine learning
CN117076186B (en) Memory fault detection method, system, device, medium and server
US11703846B2 (en) Equipment failure diagnostics using Bayesian inference

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination