CN110502399B

CN110502399B - Fault detection method and device

Info

Publication number: CN110502399B
Application number: CN201910783312.7A
Authority: CN
Inventors: 孙辽东
Original assignee: Guangdong Inspur Big Data Research Co Ltd
Current assignee: Guangdong Inspur Smart Computing Technology Co Ltd
Priority date: 2019-08-23
Filing date: 2019-08-23
Publication date: 2023-09-01
Anticipated expiration: 2039-08-23
Also published as: CN110502399A

Abstract

The invention provides a fault detection method and a device, wherein the fault detection method comprises the following steps: sending a first detection instruction to the node according to a preset detection period; when GPU equipment information corresponding to the first detection instruction is received, a second detection instruction is sent to the node; when abnormal information corresponding to the second detection instruction is received, matching the GPU equipment information with equipment file information generated in advance; and if the GPU equipment information is not matched with the equipment file information, determining that the node has GPU faults. By applying the fault detection method provided by the invention, the GPU equipment information can be matched with the equipment file information which is generated in advance, so that whether the GPU card has faults or not can be judged, and the faults of the GPU card can be rapidly found.

Description

Fault detection method and device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a fault detection method and apparatus.

Background

Along with the development of information technology, artificial intelligence is also widely applied to the aspects of life of people. The development of artificial intelligence can greatly improve the work efficiency of people, provides convenient life style for people, and in artificial intelligence, when the construction of a neural network model is involved, training of the neural network model is often needed.

According to the research of the inventor, when training tasks of the neural network model are executed on each node of the AI platform, the conditions of faults such as loss of a GPU card and the like often occur, and when a certain GPU card in the node is faulty, all training tasks on the node cannot be normally executed, however, in the prior art, when the GPU card is faulty, the fault of the GPU card is often difficult to find.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a fault detection method which can match the GPU equipment information with equipment file information generated in advance so as to judge whether the GPU card has faults or not and can quickly find out the faults of the GPU card.

The invention also provides a fault detection device which is used for ensuring the realization and the application of the method in practice.

A fault detection method, comprising:

sending a first detection instruction to the node according to a preset detection period;

when GPU equipment information corresponding to the first detection instruction is received, a second detection instruction is sent to the node;

when abnormal information corresponding to the second detection instruction is received, matching the GPU equipment information with equipment file information generated in advance;

and if the GPU equipment information is not matched with the equipment file information, determining that the node has GPU faults.

The method, optionally, the generating process of the device file information includes:

judging whether the node meets preset detection conditions or not;

when the node meets a preset detection condition, a second detection instruction is sent to the node so as to acquire initial equipment information of the node;

and carrying out format conversion on the initial equipment information according to a preset format conversion rule to obtain the equipment file information of the node.

The above method, optionally, the matching the GPU device information with the device file information generated in advance, includes:

acquiring each identifier contained in the equipment file information and each bus identifier BUSID contained in the GPU equipment information;

matching each identifier with each BUSID respectively;

if the identification which is not matched with each BUSID exists, determining a GPU card-off fault corresponding to the identification in the node;

and if the BUSID matched with each identifier exists in the GPU equipment information, determining that the node does not have GPU card-off faults.

The method, optionally, further comprises:

updating a fault record corresponding to the node in a preset fault record file;

judging whether the updated fault record meets preset alarm conditions or not;

and when the updated fault record meets the preset alarm condition, calling a preset message middleware to perform alarm operation.

The method, optionally, further comprises:

collecting the running condition information of the node;

traversing a preset configuration file based on the running condition information to determine a fault cause;

and generating prompt information corresponding to the fault reason.

A fault detection device comprising:

the first sending unit is used for sending a first detection instruction to the node according to a preset detection period;

the second sending unit is used for sending a second detection instruction to the node when GPU equipment information corresponding to the first detection instruction is received;

the matching unit is used for matching the GPU equipment information with equipment file information generated in advance when abnormal information corresponding to the second detection instruction is received;

and the determining unit is used for determining that the node has GPU faults when the GPU equipment information is not matched with the equipment file information.

The above device, optionally, further comprises:

a first judging unit, configured to judge whether the node meets a preset detection condition;

an obtaining unit, configured to send a second detection instruction to the node when the node meets a preset detection condition, so as to obtain initial equipment information of the node;

and the execution unit is used for carrying out format conversion on the initial equipment information according to a preset format conversion rule to obtain the equipment file information of the node.

The above device, optionally, the matching unit includes:

the acquisition subunit is used for acquiring each identifier contained in the equipment file information and each bus identifier BUSID contained in the GPU equipment information;

a matching subunit, configured to match each identifier with each BUSID;

the first determining subunit is used for determining the GPU card-dropping fault corresponding to the identifier in the node when the identifier which is not matched with each BUSID exists;

and the second determining subunit is used for determining that the node does not have a GPU card-off fault when the BUSID matched with each identifier exists in the GPU equipment information.

The above device, optionally, further comprises:

an updating unit, configured to update, in a preset fault record file, a fault record corresponding to the node;

a second judging unit, configured to judge whether the updated fault record meets a preset alarm condition;

and the calling unit is used for calling preset message middleware to carry out alarm operation when the updated fault record meets preset alarm conditions.

The above device, optionally, further comprises:

the acquisition unit is used for acquiring the running condition information of the node;

the searching unit is used for traversing a preset configuration file according to the running condition information so as to determine a fault reason;

and the generating unit is used for generating prompt information corresponding to the fault reason.

Compared with the prior art, the invention has the following advantages:

the invention provides a fault detection method, which comprises the following steps: sending a first detection instruction to the node according to a preset detection period; when GPU equipment information corresponding to the first detection instruction is received, a second detection instruction is sent to the node; when abnormal information corresponding to the second detection instruction is received, matching the GPU equipment information with equipment file information generated in advance; and if the GPU equipment information is not matched with the equipment file information, determining that the node has GPU faults. By applying the fault detection method provided by the invention, the GPU equipment information can be matched with the equipment file information which is generated in advance, so that whether the GPU card has faults or not can be judged, and the faults of the GPU card can be rapidly found.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a flow chart of a fault detection method according to the present invention;

FIG. 2 is a flow chart of another method of detecting faults according to the present invention;

FIG. 3 is a flow chart of another method of detecting faults according to the present invention;

FIG. 4 is a flowchart of a fault detection method according to another embodiment of the present invention;

FIG. 5 is a diagram illustrating an exemplary method for detecting faults according to the present invention;

FIG. 6 is a schematic structural diagram of a fault detection device according to the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention is operational with numerous general purpose or special purpose computing device environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor devices, distributed computing environments that include any of the above devices or devices, and the like.

The embodiment of the invention provides a fault detection method which can be applied to various system platforms, wherein an execution subject of the method can be a computer terminal or processors of various mobile devices, and a flow chart of the method is shown in fig. 1, and specifically comprises the following steps:

s101: and sending a first detection instruction to the node according to a preset detection period.

In the method provided by the embodiment of the invention, the node can be any node in an AI platform and can be used for training a neural network model; the detection period can be set by a technician according to actual conditions; the first detection instruction may be an Lspci instruction.

S102: and when GPU equipment information corresponding to the first detection instruction is received, sending a second detection instruction to the node.

In the method provided by the embodiment of the invention, when the node responds to the first detection instruction, the physical equipment information corresponding to the first detection instruction is fed back, wherein the physical equipment information comprises GPU equipment information, and the GPU equipment information can comprise the GPU bus identifier BUSID of the node, manufacturer information of the GPU and the like.

Specifically, when the GPU equipment information corresponding to the first detection instruction is received, it is indicated that the node is loaded with a GPU card, and a second detection instruction is required to be sent to further detect the state of the GPU card of the node, where the second detection instruction may be an nvidia_smi instruction; and if the GPU equipment information corresponding to the first detection instruction is not received, indicating that the node is not loaded with the GPU card.

S103: and when the abnormal information corresponding to the second detection instruction is received, matching the GPU equipment information with the equipment file information which is generated in advance.

In the method provided by the embodiment of the invention, if the information corresponding to the second detection instruction is null information or error information, the GPU equipment information is matched with the equipment file information which is generated in advance; the device file information records an identifier of the GPU of the device in the node, and specifically, the identifier may be a BUSID.

S104: and if the GPU equipment information is not matched with the equipment file information, determining that the node has GPU faults.

In the method provided by the embodiment of the invention, if any identifier in the equipment file information is not matched with each BUSID in the GPU equipment information, the node is indicated to have the GPU card-dropping fault.

The fault detection method provided by the embodiment of the invention comprises the following steps: sending a first detection instruction to the node according to a preset detection period; when GPU equipment information corresponding to the first detection instruction is received, a second detection instruction is sent to the node; when abnormal information corresponding to the second detection instruction is received, matching the GPU equipment information with equipment file information generated in advance; and if the GPU equipment information is not matched with the equipment file information, determining that the node has GPU faults. By applying the fault detection method provided by the invention, the GPU equipment information can be matched with the equipment file information which is generated in advance, so that whether the GPU card has faults or not can be judged, and the faults of the GPU card can be rapidly found.

In the fault detection method provided by the embodiment of the present invention, based on the implementation process, specifically, the process for generating the device file information includes:

judging whether the node meets preset detection conditions or not;

sending a second detection instruction to the node to acquire initial equipment information of the node;

In the method provided by the embodiment of the invention, judging whether the node meets the preset detection condition comprises the following steps: judging whether version information of an operating system of the node meets a preset version requirement, if so, judging whether a first detection command and a second detection command are available, and determining that the node meets preset detection conditions when the first detection command and the second detection command are both available; when the version information of the operating system of the node does not meet the preset version requirement, the first detection command is not available or the second detection command is not available, determining that the node does not meet the preset detection condition.

Specifically, when the node meets a preset detection condition, a second detection instruction is sent to the node to obtain information of each GPU card of the node, which may specifically include host, gpu_uuid, gpu_index and BUSID, the obtained information of each GPU card is recorded to obtain initial device information, and the BUSID in the initial device information is subjected to a specific operation to obtain device file information, where the specific operation may be that each BUSID is converted into a lowercase character and stored in a 16-system manner.

According to the method provided by the embodiment of the invention, the equipment information of the node is obtained in advance, the equipment information can be compared with the GPU equipment information obtained by the first detection instruction, whether the node has the GPU card failure or not can be judged, the task scheduling is carried out again by the AI platform conveniently, and the training task is prevented from being delayed.

In the fault detection method provided by the embodiment of the present invention, based on the implementation process, specifically, the matching the GPU device information with the device file information generated in advance, as shown in fig. 2, may include:

s201: and acquiring each identifier contained in the equipment file information and each bus identifier BUSID contained in the GPU equipment information.

In the method provided by the embodiment of the invention, the equipment file information is analyzed to obtain each identifier, the identifier can be BUSID, and the GPU equipment information is analyzed to obtain each BUSID contained in the equipment information.

S202: and respectively matching each identifier with each BUSID.

In the method provided by the embodiment of the invention, each identifier in the pre-stored equipment file information is respectively matched with each BUSID in the GPU equipment information acquired through the first detection instruction.

S203: and if the identifiers which are not matched with the BUSIDs exist, determining the GPU card-dropping fault corresponding to the identifiers in the nodes.

In the method provided by the embodiment of the invention, if the BUSID which does not exist in the GPU equipment information exists in the equipment file information, the GPU card-off fault corresponding to the BUSID is indicated.

S204: and if the BUSID matched with each identifier exists in the GPU equipment information, determining that the node does not have GPU card-off faults.

In the method provided by the embodiment of the invention, for each identifier in the equipment file information, the BUSID corresponding to the identifier can be determined in the GPU equipment information, so that the node is free from GPU card failure.

In the method provided by the embodiment of the invention, the BUSID in the equipment file information can be compared with the BUSID obtained by the first detection instruction, so that whether the node has the GPU card failure or not is judged, the failed GPU card can be rapidly determined according to the BUSID, the detection efficiency is improved, and the subsequent repair work of technicians is facilitated.

In the fault detection method provided by the embodiment of the present invention, based on the implementation process, as shown in fig. 3, the fault detection method further includes:

s301: and updating the fault record corresponding to the node in a preset fault record file.

In the method provided by the embodiment of the invention, the GPU card with the card failure in the node is recorded, and the record comprises the identification of the GPU card, the failure time and the failure times. The GPU card that failed the card may be one or more.

S302: and judging whether the updated fault record meets the preset alarm condition or not.

In the method provided by the embodiment of the invention, whether the GPU card is first-time fault is judged, if the GPU card is first-time fault, the updated fault record is determined to meet the preset alarm condition, if the GPU card is not first-time fault, the historical alarm time closest to the current moment of the GPU card is obtained, whether the time interval between the current fault time and the historical alarm time is larger than the preset time interval threshold value is judged, and if the time interval between the current fault time and the historical alarm time is larger than the preset time interval threshold value, the GPU card is determined to meet the preset alarm condition.

S303: and when the updated fault record meets the preset alarm condition, calling a preset message middleware to perform alarm operation.

In the method provided by the embodiment of the invention, when the alarm condition is met, the alarm information corresponding to the GPU card information with the current card failure is generated, and the upper layer service is notified to the RESTful interface or the message middleware, so that the rescheduling content of the GPU resource is completed after the upper layer service receives the message.

The method provided by the embodiment of the invention can avoid the situation of repeated alarm by setting the alarm condition.

In the fault detection method provided by the embodiment of the present invention, based on the implementation process, as shown in fig. 4, the fault detection method further includes:

s401: and collecting the running condition information of the node.

In the method provided by the embodiment of the invention, the temperature information of the node, the running process information of the node and the like can be acquired.

S402: and traversing a preset configuration file based on the running condition information to determine the fault reason.

In the method provided by the embodiment of the invention, the fault reasons corresponding to the running condition information are obtained by traversing the preset configuration file, and the configuration file can contain the fault reasons corresponding to each running condition information.

S403: and generating prompt information corresponding to the fault reason.

In the method provided by the embodiment of the invention, when the fault reason corresponding to the running condition is obtained, the prompt information can be generated to prompt the user that the fault reason possibly appears when the current GPU card fails.

According to the method provided by the embodiment of the invention, the user can quickly find the fault solution according to the fault reason by prompting the fault reason for the user, so that the fault can be quickly solved.

In the specific application process, the fault detection method provided by the embodiment of the invention can be used for detecting faults in the form of scripts. Fig. 5 is a schematic diagram provided in an embodiment of the present invention, in which a fault detection process is performed, and specific steps are as follows:

step a1, obtaining GPU information of a node through an Nvidia-SMI command, and storing the GPU information into a local file to obtain equipment file information of the node; in the specific implementation process, firstly, environment detection needs to be performed on the node, namely, after the node is online, the installation environment of the node needs to be detected, including whether a system version, an Nvidia drive version, an Nvidia-SMI command and an LSPCI command can be normally used or not; when the environment detection passes, basic system information of the GPU including host, gpu_uuid, gpu_index, and BUSID is acquired through an Nvidia-SMI command, and special processing (conversion to lowercase, and storage in 16-system) is performed for the BUSID.

Step a2, data comparison; in the specific implementation process, an LSPCI command is required to acquire all PCI information related to the NVIDIA GPU, mainly BUSID, if no information is acquired, the node can be considered to have no GPU card and no fault notification is required;

obtaining the basic information of the GPU by using an Nvidia-SMI command, and if the command is normal, indicating that no card-falling information exists and no fault notification is needed;

when the Nvidia-SMI does not acquire GPU information or is abnormal, the BUSID acquired by using LSPCI is required to be compared with the information in the file, if the GPU in the file is not acquired through an LSPCI command, the GPU card is considered to be lost, fault information is required to be recorded, the fault information is written into a fault record file influxdb, and upper-layer service is notified;

recording the notification times and the alarm times of the faults, and not needing to notify the upper layer service when the same GPU card fails again in unit time;

the data comparison is packaged into a script, and the script is executed at regular time through the crontab of the system.

And a3, notifying an upper layer service through a RESTful interface or a message middleware after judging that the card fails in the specific implementation process, and completing rescheduling content of the GPU resource after the upper layer service receives the message.

The specific implementation manners and the derivative processes of the implementation manners are all within the protection scope of the invention.

Corresponding to the method shown in fig. 1, the embodiment of the present invention further provides a fault detection device, which is used for implementing the method shown in fig. 1, where the fault detection device provided in the embodiment of the present invention may be applied to a computer terminal or various mobile devices, and the structural schematic diagram of the fault detection device is shown in fig. 6, and specifically includes:

a first sending unit 501, configured to send a first detection instruction to a node according to a preset detection period;

a second sending unit 502, configured to send a second detection instruction to the node when GPU device information corresponding to the first detection instruction is received;

a matching unit 503, configured to match the GPU device information with device file information generated in advance when abnormal information corresponding to the second detection instruction is received;

a determining unit 504, configured to determine that the node has a GPU fault when the GPU device information does not match the device file information.

The fault detection device provided by the embodiment of the invention can send the first detection instruction to the node according to the preset detection period; when GPU equipment information corresponding to the first detection instruction is received, a second detection instruction is sent to the node; when abnormal information corresponding to the second detection instruction is received, matching the GPU equipment information with equipment file information generated in advance; and if the GPU equipment information is not matched with the equipment file information, determining that the node has GPU faults. By applying the fault detection method provided by the invention, the GPU equipment information can be matched with the equipment file information which is generated in advance, so that whether the GPU card has faults or not can be judged, and the faults of the GPU card can be rapidly found.

The device provided by the embodiment of the invention further comprises:

In the apparatus provided by the embodiment of the present invention, the matching unit 503 includes:

a matching subunit, configured to match each identifier with each BUSID;

The device provided by the embodiment of the invention further comprises:

The embodiment of the invention also provides a storage medium, which comprises stored instructions, wherein the equipment where the storage medium is located is controlled to execute the fault detection method when the instructions run.

The embodiment of the present invention further provides an electronic device, whose structural schematic diagram is shown in fig. 7, specifically including a memory 601, and one or more instructions 602, where the one or more instructions 602 are stored in the memory 601, and configured to be executed by the one or more processors 603, where the one or more instructions 602 perform the following operations:

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the apparatus class embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference is made to the description of the method embodiments for relevant points.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in the same piece or pieces of software and/or hardware when implementing the present invention.

From the above description of embodiments, it will be apparent to those skilled in the art that the present invention may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present invention.

The foregoing has described in detail a fault detection method and apparatus provided by the present invention, and specific examples have been applied herein to illustrate the principles and embodiments of the present invention, the above examples being provided only to assist in understanding the method and core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A fault detection method, comprising:

sending a first detection instruction to the node according to a preset detection period; the first detection instruction is used for detecting whether the GPU card is loaded in the node; the first detection instruction comprises an Lspci instruction;

when GPU equipment information corresponding to the first detection instruction is received, a second detection instruction is sent to the node; the second detection instruction is used for detecting the state of the GPU card; the second detection instruction comprises an nvidia_smi instruction;

when abnormal information corresponding to the second detection instruction is received, each identifier contained in the pre-generated equipment file information and each bus identifier BUSID contained in the GPU equipment information are obtained;

matching each identifier with each BUSID respectively;

if the identification which is not matched with each BUSID exists, determining that the GPU corresponding to the identification in the node has a card failure;

if the BUSID matched with each identifier exists in the GPU equipment information, determining that the node does not have GPU card-off faults;

collecting the running condition information of the node;

and generating prompt information corresponding to the fault reason.

2. The method of claim 1, wherein the generating of the device file information comprises:

judging whether the node meets preset detection conditions or not;

3. The method as recited in claim 1, further comprising:

judging whether the updated fault record meets a preset alarm condition or not;

4. A fault detection device, comprising:

the first sending unit is used for sending a first detection instruction to the node according to a preset detection period; the first detection instruction is used for detecting whether the GPU card is loaded in the node; the first detection instruction comprises an Lspci instruction;

the second sending unit is used for sending a second detection instruction to the node when GPU equipment information corresponding to the first detection instruction is received; the second detection instruction is used for detecting the state of the GPU card; the second detection instruction comprises an nvidia_smi instruction;

the determining unit is used for determining that the node has GPU faults when the GPU equipment information is not matched with the equipment file information;

the matching unit includes:

a matching subunit, configured to match each identifier with each BUSID;

the second determining subunit is used for determining that the node does not have a GPU card-off fault when the BUSID matched with each identifier exists in the GPU equipment information;

5. The apparatus as recited in claim 4, further comprising:

6. The apparatus as recited in claim 4, further comprising:

the second judging unit is used for judging whether the updated fault record meets the preset alarm condition or not;