CN118098332A - Solid state disk fault positioning method, device, equipment and readable storage medium - Google Patents

Solid state disk fault positioning method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN118098332A
CN118098332A CN202410225819.1A CN202410225819A CN118098332A CN 118098332 A CN118098332 A CN 118098332A CN 202410225819 A CN202410225819 A CN 202410225819A CN 118098332 A CN118098332 A CN 118098332A
Authority
CN
China
Prior art keywords
fault
disk
disc
bandwidth
outputting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410225819.1A
Other languages
Chinese (zh)
Inventor
陈冀
陈庆陆
郑善龙
秦文政
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Metabrain Intelligent Technology Co Ltd
Original Assignee
Suzhou Metabrain Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Metabrain Intelligent Technology Co Ltd filed Critical Suzhou Metabrain Intelligent Technology Co Ltd
Priority to CN202410225819.1A priority Critical patent/CN118098332A/en
Publication of CN118098332A publication Critical patent/CN118098332A/en
Pending legal-status Critical Current

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The application discloses a solid state disk fault positioning method, a device, equipment and a readable storage medium, wherein the method comprises the following steps: after a fault disc determined on a production line is inserted into a server, PCIe equipment identification is carried out; if the PCIe device is not identified, outputting fault positioning information of master control welding or power supply of the fault disk; if the PCIe device is identified, the firmware is burned into the fault disk, and then flash memory identification is carried out; if the flash memory is not identified, a disc log of the fault disc is obtained, a fault point is determined by using the disc log, and fault positioning information of the fault point of the fault disc is output; if the flash memory is identified, the bandwidth state of the fault disk is acquired, the fault point is determined by utilizing the bandwidth state, and the fault locating information of the fault point of the fault disk is output. The technical effects are as follows: the positioning efficiency of the controller problem in the solid state disk can be effectively improved, and maintenance personnel can conveniently perform high-efficiency positioning and maintenance.

Description

Solid state disk fault positioning method, device, equipment and readable storage medium
Technical Field
The present application relates to the field of storage technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for locating a solid state disk fault.
Background
The main control process of the Solid state disk (SSD, solid STATE DRIVES) is more and more complex, and the welding and material requirements on the controller are higher and higher in the production stage. In the production process, during SMT (Surface Mounted Technology, surface mounting technology), obvious adverse phenomena such as pin lifting, tilting, body deformation and obvious pillow welding can be detected through AOI (Automated Optical Inspection, automatic optical detection).
However, with the increase of internal hardware units of the main control, the BGA (Ball GRID ARRAY) process of the controller and the structure of the hard disk PCB (Printed Circuit Board ) are more and more complex, and a great number of poor process procedures and poor main control materials cannot be detected through AOI and X-ray (X-ray detection), so that the normal main control function of the disk when the disk is delivered to a customer can be ensured, and the problem of batch production is avoided.
In summary, how to effectively solve the problems of solid state disk fault location and the like is a technical problem that needs to be solved by those skilled in the art at present.
Disclosure of Invention
The application aims to provide a solid state disk fault positioning method, device and equipment and a readable storage medium, which can effectively position faults of a solid state disk and can effectively ensure the quality of disks produced by a production line.
In order to solve the technical problems, the application provides the following technical scheme:
A solid state disk fault locating method comprises the following steps:
After a fault disc determined on a production line is inserted into a server, PCIe equipment identification is carried out;
If the PCIe device is not identified, outputting fault positioning information of master control welding or power supply of the fault disk;
if the PCIe device is identified, the firmware is burned into the fault disk, and then flash memory identification is carried out;
If the flash memory is not identified, a disc log of the fault disc is obtained, a fault point is determined by using the disc log, and fault positioning information of the fault point of the fault disc is output;
if the flash memory is identified, acquiring the bandwidth state of the fault disk, determining a fault point by utilizing the bandwidth state, and outputting fault positioning information of the fault point of the fault disk.
Preferably, determining a fault point by using the disc log, and outputting fault location information of the fault point of the fault disc, including:
Determining whether all hardware units inside the controller of the fault disk are started or not by utilizing the disk log;
if not, determining the fault point as the defect of the main control body, and outputting fault positioning information of the defect disc with the defect of the main control body;
if yes, checking whether the initialization of the main control to the flash memory is abnormal, and if the initialization failure of the flash memory of a certain channel occurs, outputting fault positioning information of the abnormal channel existing in the fault disk.
Preferably, determining a fault point by using the bandwidth state, and outputting fault location information of the fault point of the fault disc, including:
judging whether the bandwidth state is matched with the standard bandwidth state of the fault disc or not;
if the bandwidth states are matched, determining that the bandwidth states are normal;
If the bandwidth states are not matched, determining that the bandwidth states are abnormal, determining that the fault point is the master control welding or PCIe link hardware connection, and outputting fault positioning information of the master control welding or PCIe link hardware connection of the fault disk.
Preferably, determining whether the bandwidth status matches a standard bandwidth status of the failed disk includes:
judging whether the bandwidth in the bandwidth state is consistent with the standard bandwidth in the standard bandwidth state and whether the rate in the bandwidth state is consistent with the rate in the step bandwidth state;
If yes, determining that the bandwidth state is matched with the standard bandwidth state, performing IO test on the fault disk, performing fault location processing and outputting fault location information;
If not, determining that the bandwidth state is not matched with the standard bandwidth state;
correspondingly, if the bandwidth state is abnormal, determining the fault point as the master control welding or PCIe link hardware connection, and outputting fault positioning information of the master control welding or PCIe link hardware connection of the fault disk, wherein the fault positioning information comprises the following steps:
If the speed of the fault disc is smaller than the standard speed in the standard bandwidth state, determining the fault point as master control welding, and outputting fault positioning information of the fault disc in the master control welding;
If the bandwidth of the fault disk is smaller than the standard rate in the standard bandwidth state, determining that the fault point is connected with PCIe link hardware, and outputting fault positioning information of the fault disk in the PCIe link hardware connection;
wherein the PCIe link hardware connection comprises a hardware connection from a connector to a master PCIe link.
Preferably, if the bandwidth status is normal, the method further includes:
IO test is carried out on the fault disc;
In the IO test process, restarting the equipment and carrying out flash memory identification again under the condition that the flash memory is lost when the disk is dropped;
If the flash memory is identified, outputting fault positioning information of abnormal connection of the fault disk on the PCIe link hardware, detecting the PCIe link hardware connection, determining a virtual welding object or a material bad object, determining the virtual welding object or the material bad object as a fault point, and outputting the fault positioning information of the fault disk on the virtual welding object or the material bad object;
and if the flash memory is not identified, outputting fault positioning information of the fault disk with abnormal controller.
Preferably, if the flash memory is not identified, outputting fault location information that the controller is abnormal in the fault disk, including:
in the IO test process, if the flash memory is not identified, capturing a hardware unit register interacted with the flash memory in the controller;
Checking whether a target register corresponding to a main control in the hardware unit register is normal or not;
If not, outputting fault positioning information of abnormal register values in the fault disc existing controller;
if yes, acquiring a current disc log, and determining whether the controller crashes a hardware unit by using the current disc log; if the hardware unit crashes, outputting fault positioning information of the hardware unit crashes in the fault disk existing controller.
Preferably, if the hardware unit crash does not occur, the method further comprises:
capturing uncorrectable error keywords of a controller hardware unit in a current disc log, and judging whether uncorrectable errors occur or not;
if yes, outputting fault locating information of the fault disc with abnormal controller.
A solid state disk fault locating device comprises:
the PCIe identification module is used for carrying out PCIe equipment identification after the fault disk determined on the production line is inserted into the server;
The first fault locating module is used for outputting fault locating information of master control welding or power supply of the fault disc if PCIe equipment is not identified;
the flash memory identification module is used for carrying out flash memory identification after burning the firmware to the fault disk if the PCIe equipment is identified;
The second fault location module is used for acquiring a disc log of the fault disc if the flash memory is not identified, determining a fault point by utilizing the disc log and outputting fault location information of the fault point of the fault disc;
and the third fault location module is used for acquiring the bandwidth state of the fault disc if the flash memory is identified, determining a fault point by utilizing the bandwidth state, and outputting fault location information of the fault point of the fault disc.
An electronic device, comprising:
a memory for storing a computer program;
and the processor is used for realizing the steps of the solid state disk fault positioning method when executing the computer program.
A readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the solid state disk fault location method described above.
After the fault disc determined on the production line is inserted into the server, PCIe equipment identification is carried out by applying the method provided by the embodiment of the application; if the PCIe device is not identified, outputting fault positioning information of master control welding or power supply of the fault disk; if the PCIe device is identified, the firmware is burned into the fault disk, and then flash memory identification is carried out; if the flash memory is not identified, a disc log of the fault disc is obtained, a fault point is determined by using the disc log, and fault positioning information of the fault point of the fault disc is output; if the flash memory is identified, the bandwidth state of the fault disk is acquired, the fault point is determined by utilizing the bandwidth state, and the fault locating information of the fault point of the fault disk is output.
In the application, after the fault disk determined on the production line is inserted into the server, PCIe equipment identification is firstly carried out on the fault disk, if PCie equipment cannot be identified, the problem of main control welding or power supply of the fault disk is indicated, the fault positioning information of the fault disk in the main control welding or power supply can be directly output, and the fault disk is convenient to be checked and the fault problem is processed based on the fault positioning information. If the flash memory cannot be identified, fault points can be carried out based on disc logs, and fault positioning information of the fault points of the fault disc is output; if the flash memory can be identified, the bandwidth state of the fault disk can be acquired, the fault point is determined by utilizing the bandwidth state, and the fault location information of the fault point of the fault disk is output.
The technical effects are as follows: by inserting the fault disc into the server for a series of equipment identification, information acquisition and judgment, etc., specific fault points in the fault disc can be determined, and fault orientation information corresponding to the fault points can be output. The positioning efficiency of controller problem in the solid state disk can be effectively improved, the high-efficiency positioning and maintenance of maintenance personnel can be facilitated, the abnormal main control problem can be subjected to subsequent classification, and batch problems can be found in time.
Correspondingly, the embodiment of the application also provides a solid state disk fault locating device, equipment and a readable storage medium corresponding to the solid state disk fault locating method, which have the technical effects and are not repeated herein.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.
FIG. 1 is a flowchart of a method for locating a solid state disk fault in an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating an implementation of a method for locating a solid state disk fault in an embodiment of the present application;
FIG. 3 is a schematic diagram of an IO test process according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of a solid state disk fault locating device according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The controller is used as one of core devices of the solid state disk, so that the problem is difficult to locate when the problem is bad, and the controller is often required to cooperatively locate in multiple fields such as a software engineer, a hardware engineer, a production line FA engineer, a test engineer and the like. The current production line is used as the private field of each manufacturer, the industry does not have a unified standard flow for analyzing and positioning the problems, most of the problems depend on personal experience of FA engineers, components are replaced when the direction is suspected, or a disc is one-to-one, bad discs are independently analyzed, and no universal automatic flow exists. For manufacturers with huge output, the method is too low in efficiency for processing the problem plates, and accumulation of the problem plates can be caused, so that the production efficiency and the supply rhythm are seriously affected.
Aiming at the problem of the fault positioning efficiency, the application provides a solid state disk fault positioning method, a solid state disk fault positioning device, electronic equipment and a readable storage medium, which can be used for deriving a controller and a log keyword according to the state of a hard disk and analyzing a register and a log to position the bad type of the controller aiming at the bad disk in the production of the solid state disk. The process is automatically realized without manual intervention, so that the problem of the controller is conveniently determined by maintenance personnel who does not know the internal principle of the solid state disk, and the analysis, positioning and production efficiency are improved.
In order to better understand the aspects of the present application, the present application will be described in further detail with reference to the accompanying drawings and detailed description. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Referring to fig. 1 and 2, the method includes the steps of:
s101, after a fault disc determined on a production line is inserted into a server, PCIe equipment identification is carried out.
Specifically, the fault disc can be a to-be-analyzed bad problem disc which appears in a production line when the solid state disk is subjected to factory testing.
After the failed disk is inserted into the server, the server may perform PCIe (PCI-Express, PERIPHERAL COMPONENT INTERCONNECT EXPRESS, a high speed serial computer expansion bus standard) device identification for the failed disk.
Illustrating: the disc PCIe state may be obtained by executing a lspci |grep solid state disc fault location command, where the solid state disc fault location is a disc software defined PCIe state, specified by vendor software development, which is not specifically limited.
If the faulty disk has a master control welding problem or a power supply problem, a PCIe device may not be identified. Thus, in the event that no PCIe device is identified, step S102 may be performed; when the PCIe device can be identified, step S103 may be performed.
S102, if the PCIe device is not identified, outputting fault location information of master control welding or power supply of the fault disk.
The fault location information may be printed out, or may be presented in other manners, such as flashing a corresponding signal light or outputting a sound.
If the disc has no PCIe state, namely PCIe equipment cannot be identified, printing prompt can be carried out, and the probability (99%) is that serious faults such as offset occur in main control welding, the prior AOI detection is not detected, or obvious problems occur in power supply of the disc, and the power supply of the disc needs to be checked.
And S103, if the PCIe device is identified, the firmware is burned into the fault disk, and then the flash memory is identified.
If the PCIe display is normal, that is, the PCIe device can be identified, then the corresponding disk can be found, and then the corresponding firmware is burned into the disk. The firmware to be burned may be preset firmware for testing, or may be firmware that needs to be burned to the hard disk during actual production so that the hard disk may be normally used for storing the firmware under the scene.
After burning into firmware, flash memory identification, also known as disk identification, may be performed on the failed disk, i.e., nvme identification. For example, command nvme list is executed after burning, and it is observed whether nvme device is present at this time.
If nvme is not recognized, step S104 is executed, and if nvme is recognized, step S105 is executed.
And S104, if the flash memory is not identified, acquiring a disc log of the fault disc, determining a fault point by using the disc log, and outputting fault positioning information of the fault point of the fault disc.
If no nvme devices are present, i.e., no flash memory is identified, a disk log may be derived, which may be used to determine a point of failure. After determining the fault point, outputting the fault locating information of the fault point of the fault disc.
In one specific embodiment of the present application, determining a fault point by using a disc log, and outputting fault location information of the fault point on the fault disc, includes:
Determining whether all hardware units inside the controller of the failed disk are started by utilizing the disk log;
if not, determining the fault point as the defect of the main control body, and outputting fault positioning information of the defect disc with the defect of the main control body;
If yes, checking whether the initialization of the main control to the flash memory is abnormal, and if the initialization of the flash memory of a certain channel fails, outputting fault location information of the abnormal channel of the fault disc.
That is, in the case that the nvme devices can be identified, it may be determined whether or not the core (hardware unit, i.e., core) inside the controller of the failed disk is started based on the record of the disk log after the disk log is acquired.
In one embodiment of the present application, determining whether all hardware units within a controller of a failed disk have been started using a disk log includes:
Capturing keywords of a disc log;
based on whether all the core has been started at the key index master initialization.
That is, the key words in the disk log can be grasped, and whether all the cores are started or not when the index master control is initialized. The index analysis is performed by the keywords, so that the judging process can be accelerated.
If the core is not started, the initialization exception is indicated, and the initialization exception can be defined as bad main control body, and the main control needs to be replaced.
If all the core initialization of the main control is normal, checking whether the initialization of the main control on the nand (a flash memory used for storing data in a solid state disk) is abnormal, and prompting the corresponding abnormal channel when the nand initialization of a certain channel fails, so that a maintainer can check the corresponding channel, thereby reducing the checking range.
S105, if the flash memory is identified, acquiring the bandwidth state of the fault disk, determining a fault point by utilizing the bandwidth state, and outputting fault location information of the fault point of the fault disk.
If the nvme devices can be identified normally at this time, a failure point can be determined based on the bandwidth status, thereby outputting failure location information that the failed disk exists at the failure point.
In one specific embodiment of the present application, determining a failure point by using a bandwidth status, and outputting failure location information of the failure point on the failed disk, including:
Judging whether the bandwidth state is matched with the standard bandwidth state of the fault disc;
if the bandwidth states are matched, determining that the bandwidth states are normal;
If the two types of the data are not matched, determining that the bandwidth state is abnormal, determining that the fault point is the master control welding or PCIe link hardware connection, and outputting fault positioning information of the master control welding or PCIe link hardware connection of the fault disk.
Due to the solid state disk failure, abnormal bandwidth status may be caused, such as bandwidth reduction, rate reduction, and the like. The bandwidth state anomaly is often related to the master welding and PCIe link hardware connection anomaly, and therefore, by comparing the bandwidth state with the standard bandwidth state, it can be determined whether the fault location information of the master welding or PCIe link hardware connection exists on the fault disk is output.
Specifically, when the bandwidth state is matched with the standard bandwidth state of the fault disk, the bandwidth state is normal, if the bandwidth state is not matched with the standard bandwidth state, the bandwidth state is abnormal, at the moment, the fault point can be determined to be the main control welding or PCIe link hardware connection, and fault positioning information about the connection of the fault disk in the main control welding or PCIe link hardware connection can be printed.
In one embodiment, determining whether the bandwidth status matches a standard bandwidth status of a failed disk includes:
Judging whether the bandwidth in the bandwidth state is consistent with the standard bandwidth in the standard bandwidth state, and whether the rate in the bandwidth state is consistent with the rate in the step bandwidth state;
If yes, determining that the bandwidth state is matched with the standard bandwidth state, performing IO test on the fault disk, performing fault location processing and outputting fault location information;
if not, determining that the bandwidth state is not matched with the standard bandwidth state;
Correspondingly, if the bandwidth state is abnormal, determining the fault point as the master control welding or PCIe link hardware connection, and outputting fault positioning information of the master control welding or PCIe link hardware connection of the fault disk, wherein the method comprises the following steps:
if the speed of the fault disc is smaller than the standard speed in the standard bandwidth state, determining the fault point as the master control welding, and outputting fault positioning information of the fault disc in the master control welding;
If the bandwidth of the fault disk is smaller than the standard rate in the standard bandwidth state, determining the fault point as PCIe link hardware connection, and outputting fault positioning information of the fault disk in the PCIe link hardware connection;
wherein the PCIe link hardware connection comprises a hardware connection from the connector to the master PCIe link.
That is, the bandwidth state may specifically include bandwidth and rate, when the matching determination is performed, the bandwidth state and the rate need to be compared, and when the bandwidth state and the rate keep the matching state, the bandwidth state is clear to be normal, otherwise, the bandwidth state is abnormal.
For different abnormal conditions, the fault point can be accurately positioned according to specific abnormal conditions, and specific fault point prompts are output.
Illustrating: whether the bandwidth of the disc is normal or not can be confirmed through commands lspci-s bdf-vvv, wherein bdf is bus_id corresponding to the disc. Then, through lspci |grep solid state disk fault location acquisition, for a disk bandwidth state, such as a disk of PCIE4.0, the command acquisition state is 16GT/s x4, wherein 16GT/s is the speed of the disk, and x4 is the bandwidth of the disk. And when an abnormal state of mismatch occurs, performing printing prompt. If the disk is reduced (i.e., the current speed is lower than the standard speed, for example, 16GT/s is changed to 8 GT/s or 5GT/s, and the high probability is that the controller is in welding, the subsequent checking and maintenance are performed, if the reduced bandwidth is present (i.e., the current bandwidth is lower than the standard bandwidth), x4 is changed to x2 or x1, and the hardware connection of the disk from the connector to the main control PCIe link is checked, the possibility of the hardware welding problem on the link is the greatest, and the corresponding printing prompt is performed.
After the fault disc determined on the production line is inserted into the server, PCIe equipment identification is carried out by applying the method provided by the embodiment of the application; if the PCIe device is not identified, outputting fault positioning information of master control welding or power supply of the fault disk; if the PCIe device is identified, the firmware is burned into the fault disk, and then flash memory identification is carried out; if the flash memory is not identified, a disc log of the fault disc is obtained, a fault point is determined by using the disc log, and fault positioning information of the fault point of the fault disc is output; if the flash memory is identified, the bandwidth state of the fault disk is acquired, the fault point is determined by utilizing the bandwidth state, and the fault locating information of the fault point of the fault disk is output.
In the application, after the fault disk determined on the production line is inserted into the server, PCIe equipment identification is firstly carried out on the fault disk, if PCie equipment cannot be identified, the problem of main control welding or power supply of the fault disk is indicated, the fault positioning information of the fault disk in the main control welding or power supply can be directly output, and the fault disk is convenient to be checked and the fault problem is processed based on the fault positioning information. If the flash memory cannot be identified, fault points can be carried out based on disc logs, and fault positioning information of the fault points of the fault disc is output; if the flash memory can be identified, the bandwidth state of the fault disk can be acquired, the fault point is determined by utilizing the bandwidth state, and the fault location information of the fault point of the fault disk is output.
The technical effects are as follows: by inserting the fault disc into the server for a series of equipment identification, information acquisition and judgment, etc., specific fault points in the fault disc can be determined, and fault orientation information corresponding to the fault points can be output. The positioning efficiency of controller problem in the solid state disk can be effectively improved, the high-efficiency positioning and maintenance of maintenance personnel can be facilitated, the abnormal main control problem can be subjected to subsequent classification, and batch problems can be found in time.
It should be noted that, based on the above embodiments, the embodiments of the present application further provide corresponding improvements. The preferred/improved embodiments relate to the same steps as those in the above embodiments or the steps corresponding to the steps may be referred to each other, and the corresponding advantages may also be referred to each other, so that detailed descriptions of the preferred/improved embodiments are omitted herein.
In a specific embodiment of the present application, if the bandwidth status is normal, the fault point may also be determined by performing an IO test on the fault disk. Referring to fig. 3, the specific implementation process includes:
IO test is carried out on the fault disc;
In the IO test process, restarting the equipment and carrying out flash memory identification again under the condition that the flash memory is lost when the disk is dropped;
If the flash memory is identified, outputting fault location information of abnormal connection of the fault inventory on the PCIe link hardware, detecting the PCIe link hardware connection, determining a false soldering object or a material poor object, determining the false soldering object or the material poor object as a fault point, and outputting the fault location information of the fault inventory on the false soldering object or the material poor object;
if the flash memory is not identified, outputting fault location information of the fault disc with abnormal controller.
In this embodiment, the IO test refers to a read-write test performed on the solid state disk.
Specifically, the IO test process is sequential reading and writing of a fault disc and random reading and writing of different pressures. If the phenomenon of losing nvme devices occurs when the disk is dropped in the IO test, the machine can be restarted automatically, and the state of the disk is confirmed again through a command nvme list.
If nvme equipment can be identified by nvme list commands at this time, then a device fault is indicated on the link, related devices on the PCIe link need to be checked, whether the link from the connector to the main control has cold joint or poor materials can be checked by means of related means, and at this time, a printing prompt exists at the terminal; if the links are repeatedly confirmed to be all problems, the software can be suspected to have BUG (referring to logic errors in the program).
In a specific embodiment of the present application, in an IO test process, if a flash memory is not identified, outputting fault location information that a fault disk has a controller abnormality, including:
in the IO test process, if the flash memory is not identified, capturing a hardware unit register interacted with the flash memory in the controller;
checking whether a target register corresponding to the main control in the hardware unit register is normal or not;
if not, outputting fault location information of abnormal register values in the fault disk existing controller;
if yes, acquiring a current disc log, and determining whether the controller crashes a hardware unit by using the current disc log; if the hardware unit crashes, outputting fault positioning information of the hardware unit crashes in the fault disk existing controller.
That is, if the memory is not identified to be deleted in the IO test process, the hardware unit register interacted with the flash memory in the controller may be grabbed, so as to check whether the target register corresponding to the main control in the hardware unit register is normal, and output corresponding fault location information according to different judgment results.
In one embodiment of the present application, if no hardware unit crash occurs, the method further includes:
capturing uncorrectable error keywords of a controller hardware unit in a current disc log, and judging whether uncorrectable errors occur or not;
If yes, outputting fault location information of the fault disc with abnormal controller.
When the hardware unit is not crashed, whether the uncorrectable error occurs can be judged by judging whether the disk log has the uncorrectable error key word. When an uncorrectable error occurs, it may also be determined that the failed disk is in the controller exception.
The two adjacent embodiments can be combined for use in practice. That is, when the IO test shows that the disk is dropped and the server is restarted, it is confirmed nvme that the device is lost, the following operations may be sequentially performed:
(1) Firstly, grabbing a hardware unit register interacted with the NAND on a controller, checking whether a corresponding register on a main control is set at the moment, and if the register value is abnormal at the moment, solving the problem of the controller;
(2) Secondly, if the register value is normal, a disc log is exported, key words of the nuclear breakdown are captured, whether the nuclear breakdown occurs in the controller is checked, and if the nuclear breakdown occurs, the problem of the controller is solved;
(3) And finally, capturing the key words of the error unreflectable error of the controller hardware unit in the log, and if the error unreflectable error occurs, the key words are also the problem of the controller.
Corresponding printing can be carried out on three conditions, and the prompt is a specific cause of abnormality of the controller.
In one embodiment of the present application, the method further comprises:
classifying fault positioning information output by each fault disc;
listing a bad detail table according to the classification result;
Carrying out centralized analysis on faults occurring in the incoming material batch based on the detail list, and classifying possible faults occurring in the incoming material batch of the controller;
And carrying out centralized analysis on the defects of the factory process based on the detail table, and outputting an analysis result.
That is, the quality of the production line can be improved while the poor controller is positioned. For example, the internal problems of the controller are found, the problems can be classified, a bad detail list is listed, the bad analysis is carried out on the bad materials in the incoming material batch, and the bad classification of the incoming material batch of the controller which possibly occurs can be carried out. If the problem is the material problem, the problem can be fed back to the original factory of the controller for FA (Failure Analysis) and failure analysis) analysis, so that the verification link of the subsequent bad material is reduced, and the quality of the material is improved by defining the bad root cause; if the problem is caused by the factory process problem, the method can provide the factory production line with improvement of the production process, and prevent the subsequent occurrence of the problem.
Corresponding to the above method embodiment, the embodiment of the present application further provides a solid state disk fault positioning device, where the solid state disk fault positioning device described below and the solid state disk fault positioning method described above may be referred to correspondingly.
Referring to fig. 4, the apparatus includes the following modules:
The PCIe identification module 101 is configured to perform PCIe device identification after the fault disk determined on the production line is inserted into the server;
The first fault location module 102 is configured to output fault location information of a master control welding or a power supply for a fault disk if the PCIe device is not identified;
The flash memory identification module 103 is configured to, if the PCIe device is identified, burn the firmware into the failed disk, and then identify the flash memory;
The second fault location module 104 is configured to obtain a disc log of the fault disc if the flash memory is not identified, determine a fault point by using the disc log, and output fault location information of the fault point of the fault disc;
And the third fault location module 105 is configured to acquire a bandwidth state of the fault disc if the flash memory is identified, determine a fault point by using the bandwidth state, and output fault location information of the fault disc in which the fault point exists.
By applying the device provided by the embodiment of the application, after the fault disc determined on the production line is inserted into the server, PCIe equipment identification is performed; if the PCIe device is not identified, outputting fault positioning information of master control welding or power supply of the fault disk; if the PCIe device is identified, the firmware is burned into the fault disk, and then flash memory identification is carried out; if the flash memory is not identified, a disc log of the fault disc is obtained, a fault point is determined by using the disc log, and fault positioning information of the fault point of the fault disc is output; if the flash memory is identified, the bandwidth state of the fault disk is acquired, the fault point is determined by utilizing the bandwidth state, and the fault locating information of the fault point of the fault disk is output.
In the application, after the fault disk determined on the production line is inserted into the server, PCIe equipment identification is firstly carried out on the fault disk, if PCie equipment cannot be identified, the problem of main control welding or power supply of the fault disk is indicated, the fault positioning information of the fault disk in the main control welding or power supply can be directly output, and the fault disk is convenient to be checked and the fault problem is processed based on the fault positioning information. If the flash memory cannot be identified, fault points can be carried out based on disc logs, and fault positioning information of the fault points of the fault disc is output; if the flash memory can be identified, the bandwidth state of the fault disk can be acquired, the fault point is determined by utilizing the bandwidth state, and the fault location information of the fault point of the fault disk is output.
The technical effects are as follows: by inserting the fault disc into the server for a series of equipment identification, information acquisition and judgment, etc., specific fault points in the fault disc can be determined, and fault orientation information corresponding to the fault points can be output. The positioning efficiency of controller problem in the solid state disk can be effectively improved, the high-efficiency positioning and maintenance of maintenance personnel can be facilitated, the abnormal main control problem can be subjected to subsequent classification, and batch problems can be found in time.
In one embodiment of the present application, the second fault location module 104 is specifically configured to determine, using the disk log, whether all hardware units inside the controller of the faulty disk have been started;
if not, determining the fault point as the defect of the main control body, and outputting fault positioning information of the defect disc with the defect of the main control body;
If yes, checking whether the initialization of the main control to the flash memory is abnormal, and if the initialization of the flash memory of a certain channel fails, outputting fault location information of the abnormal channel of the fault disc.
In one embodiment of the present application, the second fault location module 104 is specifically configured to determine whether the bandwidth status matches the standard bandwidth status of the failed disk;
if the bandwidth states are matched, determining that the bandwidth states are normal;
If the two types of the data are not matched, determining that the bandwidth state is abnormal, determining that the fault point is the master control welding or PCIe link hardware connection, and outputting fault positioning information of the master control welding or PCIe link hardware connection of the fault disk.
In one embodiment of the present application, the second fault location module 104 is specifically configured to determine whether the bandwidth in the bandwidth state is identical to the standard bandwidth in the standard bandwidth state, and whether the rate in the bandwidth state is identical to the rate in the step bandwidth state;
If yes, determining that the bandwidth state is matched with the standard bandwidth state, performing IO test on the fault disk, performing fault location processing and outputting fault location information;
if not, determining that the bandwidth state is not matched with the standard bandwidth state;
Correspondingly, if the bandwidth state is abnormal, determining the fault point as the master control welding or PCIe link hardware connection, and outputting fault positioning information of the master control welding or PCIe link hardware connection of the fault disk, wherein the method comprises the following steps:
if the speed of the fault disc is smaller than the standard speed in the standard bandwidth state, determining the fault point as the master control welding, and outputting fault positioning information of the fault disc in the master control welding;
If the bandwidth of the fault disk is smaller than the standard rate in the standard bandwidth state, determining the fault point as PCIe link hardware connection, and outputting fault positioning information of the fault disk in the PCIe link hardware connection;
wherein the PCIe link hardware connection comprises a hardware connection from the connector to the master PCIe link.
In one embodiment of the present application, the method further comprises:
the fourth fault locating module is used for carrying out IO test on the fault disc if the bandwidth state is normal;
In the IO test process, restarting the equipment and carrying out flash memory identification again under the condition that the flash memory is lost when the disk is dropped;
If the flash memory is identified, outputting fault location information of abnormal connection of the fault inventory on the PCIe link hardware, detecting the PCIe link hardware connection, determining a false soldering object or a material poor object, determining the false soldering object or the material poor object as a fault point, and outputting the fault location information of the fault inventory on the false soldering object or the material poor object;
if the flash memory is not identified, outputting fault location information of the fault disc with abnormal controller.
In a specific embodiment of the present application, the fourth fault location module is specifically configured to grab a hardware unit register in the controller that interacts with the flash memory if the flash memory is not identified in the IO test process;
checking whether a target register corresponding to the main control in the hardware unit register is normal or not;
if not, outputting fault location information of abnormal register values in the fault disk existing controller;
if yes, acquiring a current disc log, and determining whether the controller crashes a hardware unit by using the current disc log; if the hardware unit crashes, outputting fault positioning information of the hardware unit crashes in the fault disk existing controller.
In a specific embodiment of the present application, the fourth fault location module is further configured to, if no hardware unit crash occurs, grab an uncorrectable error key of the controller hardware unit in the current disk log, and determine whether an uncorrectable error occurs;
If yes, outputting fault location information of the fault disc with abnormal controller.
Corresponding to the above method embodiment, the embodiment of the present application further provides an electronic device, where an electronic device described below and a solid state disk fault locating method described above may be referred to correspondingly.
Referring to fig. 5, the electronic device includes:
A memory 332 for storing a computer program;
the processor 322 is configured to implement the steps of the solid state disk fault locating method according to the above method embodiment when executing the computer program.
Specifically, referring to fig. 6, fig. 6 is a schematic diagram of a specific structure of an electronic device according to the present embodiment, where the electronic device may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 322 (e.g., one or more processors) and a memory 332, where the memory 332 stores one or more computer programs 342 or data 344. Wherein the memory 332 may be transient storage or persistent storage. The program stored in memory 332 may include one or more modules (not shown), each of which may include a series of instruction operations in the data processing apparatus. Still further, the processor 322 may be configured to communicate with the memory 332 and execute a series of instruction operations in the memory 332 on the electronic device 301.
The electronic device 301 may also include one or more power supplies 326, one or more wired or wireless network interfaces 350, one or more input/output interfaces 358, and/or one or more operating systems 341.
The steps in the solid state disk fault locating method described above may be implemented by the structure of the electronic device.
Corresponding to the above method embodiment, the embodiment of the present application further provides a readable storage medium, where a readable storage medium described below and a solid state disk fault locating method described above may be referred to correspondingly.
A readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the solid state disk fault locating method of the above method embodiment.
The readable storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, which may store various program codes.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it is further noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms include, comprise, or any other variation is intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The principles and embodiments of the present application have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present application and the core ideas thereof; meanwhile, as those skilled in the art will vary in the specific embodiments and application scope according to the idea of the present application, the present disclosure should not be construed as limiting the present application in summary.

Claims (10)

1. The solid state disk fault positioning method is characterized by comprising the following steps of:
After a fault disc determined on a production line is inserted into a server, PCIe equipment identification is carried out;
If the PCIe device is not identified, outputting fault positioning information of master control welding or power supply of the fault disk;
if the PCIe device is identified, the firmware is burned into the fault disk, and then flash memory identification is carried out;
If the flash memory is not identified, a disc log of the fault disc is obtained, a fault point is determined by using the disc log, and fault positioning information of the fault point of the fault disc is output;
if the flash memory is identified, acquiring the bandwidth state of the fault disk, determining a fault point by utilizing the bandwidth state, and outputting fault positioning information of the fault point of the fault disk.
2. The method of claim 1, wherein determining a failure point using the disk log and outputting failure location information of the failure point where the failure disk exists, comprises:
Determining whether all hardware units inside the controller of the fault disk are started or not by utilizing the disk log;
if not, determining the fault point as the defect of the main control body, and outputting fault positioning information of the defect disc with the defect of the main control body;
if yes, checking whether the initialization of the main control to the flash memory is abnormal, and if the initialization failure of the flash memory of a certain channel occurs, outputting fault positioning information of the abnormal channel existing in the fault disk.
3. The method according to claim 1 or 2, wherein determining a failure point using the bandwidth status and outputting failure location information that the failure disk has the failure point, comprises:
judging whether the bandwidth state is matched with the standard bandwidth state of the fault disc or not;
if the bandwidth states are matched, determining that the bandwidth states are normal;
If the bandwidth states are not matched, determining that the bandwidth states are abnormal, determining that the fault point is the master control welding or PCIe link hardware connection, and outputting fault positioning information of the master control welding or PCIe link hardware connection of the fault disk.
4. A method according to claim 3, wherein determining whether the bandwidth status matches a standard bandwidth status of the failed disk comprises:
judging whether the bandwidth in the bandwidth state is consistent with the standard bandwidth in the standard bandwidth state and whether the rate in the bandwidth state is consistent with the rate in the step bandwidth state;
If yes, determining that the bandwidth state is matched with the standard bandwidth state, performing IO test on the fault disk, performing fault location processing and outputting fault location information;
If not, determining that the bandwidth state is not matched with the standard bandwidth state;
correspondingly, if the bandwidth state is abnormal, determining the fault point as the master control welding or PCIe link hardware connection, and outputting fault positioning information of the master control welding or PCIe link hardware connection of the fault disk, wherein the fault positioning information comprises the following steps:
If the speed of the fault disc is smaller than the standard speed in the standard bandwidth state, determining the fault point as master control welding, and outputting fault positioning information of the fault disc in the master control welding;
If the bandwidth of the fault disk is smaller than the standard rate in the standard bandwidth state, determining that the fault point is connected with PCIe link hardware, and outputting fault positioning information of the fault disk in the PCIe link hardware connection;
wherein the PCIe link hardware connection comprises a hardware connection from a connector to a master PCIe link.
5. The method of claim 3, further comprising, if the bandwidth status is normal:
IO test is carried out on the fault disc;
In the IO test process, restarting the equipment and carrying out flash memory identification again under the condition that the flash memory is lost when the disk is dropped;
If the flash memory is identified, outputting fault positioning information of abnormal connection of the fault disk on the PCIe link hardware, detecting the PCIe link hardware connection, determining a virtual welding object or a material bad object, determining the virtual welding object or the material bad object as a fault point, and outputting the fault positioning information of the fault disk on the virtual welding object or the material bad object;
and if the flash memory is not identified, outputting fault positioning information of the fault disk with abnormal controller.
6. The method of claim 5, wherein outputting fault location information of the faulty disk having a controller exception if no flash memory is identified, comprises:
in the IO test process, if the flash memory is not identified, capturing a hardware unit register interacted with the flash memory in the controller;
Checking whether a target register corresponding to a main control in the hardware unit register is normal or not;
If not, outputting fault positioning information of abnormal register values in the fault disc existing controller;
if yes, acquiring a current disc log, and determining whether the controller crashes a hardware unit by using the current disc log; if the hardware unit crashes, outputting fault positioning information of the hardware unit crashes in the fault disk existing controller.
7. The method of claim 6, wherein if no hardware unit crash occurs, further comprising:
capturing uncorrectable error keywords of a controller hardware unit in a current disc log, and judging whether uncorrectable errors occur or not;
if yes, outputting fault locating information of the fault disc with abnormal controller.
8. The utility model provides a solid state disk fault location device which characterized in that includes:
the PCIe identification module is used for carrying out PCIe equipment identification after the fault disk determined on the production line is inserted into the server;
The first fault locating module is used for outputting fault locating information of master control welding or power supply of the fault disc if PCIe equipment is not identified;
the flash memory identification module is used for carrying out flash memory identification after burning the firmware to the fault disk if the PCIe equipment is identified;
The second fault location module is used for acquiring a disc log of the fault disc if the flash memory is not identified, determining a fault point by utilizing the disc log and outputting fault location information of the fault point of the fault disc;
and the third fault location module is used for acquiring the bandwidth state of the fault disc if the flash memory is identified, determining a fault point by utilizing the bandwidth state, and outputting fault location information of the fault point of the fault disc.
9. An electronic device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the solid state disk fault locating method according to any one of claims 1 to 7 when executing the computer program.
10. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the solid state disk fault localization method according to any one of claims 1 to 7.
CN202410225819.1A 2024-02-29 2024-02-29 Solid state disk fault positioning method, device, equipment and readable storage medium Pending CN118098332A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410225819.1A CN118098332A (en) 2024-02-29 2024-02-29 Solid state disk fault positioning method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410225819.1A CN118098332A (en) 2024-02-29 2024-02-29 Solid state disk fault positioning method, device, equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN118098332A true CN118098332A (en) 2024-05-28

Family

ID=91162995

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410225819.1A Pending CN118098332A (en) 2024-02-29 2024-02-29 Solid state disk fault positioning method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN118098332A (en)

Similar Documents

Publication Publication Date Title
CN110750396B (en) Server operating system compatibility testing method and device and storage medium
CN107391333B (en) OSD disk fault testing method and system
CN105223889A (en) Method for automatically monitoring PMC RAID card log suitable for production line
WO2024098753A1 (en) Abnormality detection method, apparatus and system, and host device and storage medium
JP5001972B2 (en) Semiconductor inspection system with self-inspection function for memory repair analysis
TWI664431B (en) Testing system
CN118098332A (en) Solid state disk fault positioning method, device, equipment and readable storage medium
CN117272922A (en) Chip failure analysis method, chip design method, device, equipment and medium
CN115422091A (en) Firmware debugging method and device, electronic equipment and storage medium
CN111209146B (en) RAID card aging test method and system
CN114121120A (en) Detection system, method and chip of memory
US6229743B1 (en) Method of a reassign block processing time determination test for storage device
CN112346920A (en) Integrated circuit test data analysis method and system
CN110764787B (en) USB burning system and USB burning method for detecting memory defects
JP3326546B2 (en) Computer system failure detection method
CN113609577B (en) Automobile electric appliance principle inspection method
CN113094221B (en) Fault injection method, device, computer equipment and readable storage medium
JP3664466B2 (en) Memory check test execution method and storage medium
CN107390115B (en) Method for detecting SC serial port and MC serial port of IO in raid memory in batch
JPH08296840A (en) Troubleshooting system for gas apparatus
CN110888779B (en) File system read-only judging method based on analog writing
CN115525493A (en) Peripheral equipment testing method and device, electronic equipment and storage medium
CN117707870A (en) PCI-E link problem processing method, device, equipment and medium
CN114236345A (en) Programmable PCBA (printed circuit board assembly) detection method and device
CN115410636A (en) Word line testing method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination