CN110502399B - Fault detection method and device - Google Patents

Fault detection method and device Download PDF

Info

Publication number
CN110502399B
CN110502399B CN201910783312.7A CN201910783312A CN110502399B CN 110502399 B CN110502399 B CN 110502399B CN 201910783312 A CN201910783312 A CN 201910783312A CN 110502399 B CN110502399 B CN 110502399B
Authority
CN
China
Prior art keywords
node
gpu
information
preset
detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910783312.7A
Other languages
Chinese (zh)
Other versions
CN110502399A (en
Inventor
孙辽东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Inspur Smart Computing Technology Co Ltd
Original Assignee
Guangdong Inspur Big Data Research Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Inspur Big Data Research Co Ltd filed Critical Guangdong Inspur Big Data Research Co Ltd
Priority to CN201910783312.7A priority Critical patent/CN110502399B/en
Publication of CN110502399A publication Critical patent/CN110502399A/en
Application granted granted Critical
Publication of CN110502399B publication Critical patent/CN110502399B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3024Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a fault detection method and a device, wherein the fault detection method comprises the following steps: sending a first detection instruction to the node according to a preset detection period; when GPU equipment information corresponding to the first detection instruction is received, a second detection instruction is sent to the node; when abnormal information corresponding to the second detection instruction is received, matching the GPU equipment information with equipment file information generated in advance; and if the GPU equipment information is not matched with the equipment file information, determining that the node has GPU faults. By applying the fault detection method provided by the invention, the GPU equipment information can be matched with the equipment file information which is generated in advance, so that whether the GPU card has faults or not can be judged, and the faults of the GPU card can be rapidly found.

Description

Fault detection method and device
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a fault detection method and apparatus.
Background
Along with the development of information technology, artificial intelligence is also widely applied to the aspects of life of people. The development of artificial intelligence can greatly improve the work efficiency of people, provides convenient life style for people, and in artificial intelligence, when the construction of a neural network model is involved, training of the neural network model is often needed.
According to the research of the inventor, when training tasks of the neural network model are executed on each node of the AI platform, the conditions of faults such as loss of a GPU card and the like often occur, and when a certain GPU card in the node is faulty, all training tasks on the node cannot be normally executed, however, in the prior art, when the GPU card is faulty, the fault of the GPU card is often difficult to find.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a fault detection method which can match the GPU equipment information with equipment file information generated in advance so as to judge whether the GPU card has faults or not and can quickly find out the faults of the GPU card.
The invention also provides a fault detection device which is used for ensuring the realization and the application of the method in practice.
A fault detection method, comprising:
sending a first detection instruction to the node according to a preset detection period;
when GPU equipment information corresponding to the first detection instruction is received, a second detection instruction is sent to the node;
when abnormal information corresponding to the second detection instruction is received, matching the GPU equipment information with equipment file information generated in advance;
and if the GPU equipment information is not matched with the equipment file information, determining that the node has GPU faults.
The method, optionally, the generating process of the device file information includes:
judging whether the node meets preset detection conditions or not;
when the node meets a preset detection condition, a second detection instruction is sent to the node so as to acquire initial equipment information of the node;
and carrying out format conversion on the initial equipment information according to a preset format conversion rule to obtain the equipment file information of the node.
The above method, optionally, the matching the GPU device information with the device file information generated in advance, includes:
acquiring each identifier contained in the equipment file information and each bus identifier BUSID contained in the GPU equipment information;
matching each identifier with each BUSID respectively;
if the identification which is not matched with each BUSID exists, determining a GPU card-off fault corresponding to the identification in the node;
and if the BUSID matched with each identifier exists in the GPU equipment information, determining that the node does not have GPU card-off faults.
The method, optionally, further comprises:
updating a fault record corresponding to the node in a preset fault record file;
judging whether the updated fault record meets preset alarm conditions or not;
and when the updated fault record meets the preset alarm condition, calling a preset message middleware to perform alarm operation.
The method, optionally, further comprises:
collecting the running condition information of the node;
traversing a preset configuration file based on the running condition information to determine a fault cause;
and generating prompt information corresponding to the fault reason.
A fault detection device comprising:
the first sending unit is used for sending a first detection instruction to the node according to a preset detection period;
the second sending unit is used for sending a second detection instruction to the node when GPU equipment information corresponding to the first detection instruction is received;
the matching unit is used for matching the GPU equipment information with equipment file information generated in advance when abnormal information corresponding to the second detection instruction is received;
and the determining unit is used for determining that the node has GPU faults when the GPU equipment information is not matched with the equipment file information.
The above device, optionally, further comprises:
a first judging unit, configured to judge whether the node meets a preset detection condition;
an obtaining unit, configured to send a second detection instruction to the node when the node meets a preset detection condition, so as to obtain initial equipment information of the node;
and the execution unit is used for carrying out format conversion on the initial equipment information according to a preset format conversion rule to obtain the equipment file information of the node.
The above device, optionally, the matching unit includes:
the acquisition subunit is used for acquiring each identifier contained in the equipment file information and each bus identifier BUSID contained in the GPU equipment information;
a matching subunit, configured to match each identifier with each BUSID;
the first determining subunit is used for determining the GPU card-dropping fault corresponding to the identifier in the node when the identifier which is not matched with each BUSID exists;
and the second determining subunit is used for determining that the node does not have a GPU card-off fault when the BUSID matched with each identifier exists in the GPU equipment information.
The above device, optionally, further comprises:
an updating unit, configured to update, in a preset fault record file, a fault record corresponding to the node;
a second judging unit, configured to judge whether the updated fault record meets a preset alarm condition;
and the calling unit is used for calling preset message middleware to carry out alarm operation when the updated fault record meets preset alarm conditions.
The above device, optionally, further comprises:
the acquisition unit is used for acquiring the running condition information of the node;
the searching unit is used for traversing a preset configuration file according to the running condition information so as to determine a fault reason;
and the generating unit is used for generating prompt information corresponding to the fault reason.
Compared with the prior art, the invention has the following advantages:
the invention provides a fault detection method, which comprises the following steps: sending a first detection instruction to the node according to a preset detection period; when GPU equipment information corresponding to the first detection instruction is received, a second detection instruction is sent to the node; when abnormal information corresponding to the second detection instruction is received, matching the GPU equipment information with equipment file information generated in advance; and if the GPU equipment information is not matched with the equipment file information, determining that the node has GPU faults. By applying the fault detection method provided by the invention, the GPU equipment information can be matched with the equipment file information which is generated in advance, so that whether the GPU card has faults or not can be judged, and the faults of the GPU card can be rapidly found.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.
FIG. 1 is a flow chart of a fault detection method according to the present invention;
FIG. 2 is a flow chart of another method of detecting faults according to the present invention;
FIG. 3 is a flow chart of another method of detecting faults according to the present invention;
FIG. 4 is a flowchart of a fault detection method according to another embodiment of the present invention;
FIG. 5 is a diagram illustrating an exemplary method for detecting faults according to the present invention;
FIG. 6 is a schematic structural diagram of a fault detection device according to the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention is operational with numerous general purpose or special purpose computing device environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor devices, distributed computing environments that include any of the above devices or devices, and the like.
The embodiment of the invention provides a fault detection method which can be applied to various system platforms, wherein an execution subject of the method can be a computer terminal or processors of various mobile devices, and a flow chart of the method is shown in fig. 1, and specifically comprises the following steps:
s101: and sending a first detection instruction to the node according to a preset detection period.
In the method provided by the embodiment of the invention, the node can be any node in an AI platform and can be used for training a neural network model; the detection period can be set by a technician according to actual conditions; the first detection instruction may be an Lspci instruction.
S102: and when GPU equipment information corresponding to the first detection instruction is received, sending a second detection instruction to the node.
In the method provided by the embodiment of the invention, when the node responds to the first detection instruction, the physical equipment information corresponding to the first detection instruction is fed back, wherein the physical equipment information comprises GPU equipment information, and the GPU equipment information can comprise the GPU bus identifier BUSID of the node, manufacturer information of the GPU and the like.
Specifically, when the GPU equipment information corresponding to the first detection instruction is received, it is indicated that the node is loaded with a GPU card, and a second detection instruction is required to be sent to further detect the state of the GPU card of the node, where the second detection instruction may be an nvidia_smi instruction; and if the GPU equipment information corresponding to the first detection instruction is not received, indicating that the node is not loaded with the GPU card.
S103: and when the abnormal information corresponding to the second detection instruction is received, matching the GPU equipment information with the equipment file information which is generated in advance.
In the method provided by the embodiment of the invention, if the information corresponding to the second detection instruction is null information or error information, the GPU equipment information is matched with the equipment file information which is generated in advance; the device file information records an identifier of the GPU of the device in the node, and specifically, the identifier may be a BUSID.
S104: and if the GPU equipment information is not matched with the equipment file information, determining that the node has GPU faults.
In the method provided by the embodiment of the invention, if any identifier in the equipment file information is not matched with each BUSID in the GPU equipment information, the node is indicated to have the GPU card-dropping fault.
The fault detection method provided by the embodiment of the invention comprises the following steps: sending a first detection instruction to the node according to a preset detection period; when GPU equipment information corresponding to the first detection instruction is received, a second detection instruction is sent to the node; when abnormal information corresponding to the second detection instruction is received, matching the GPU equipment information with equipment file information generated in advance; and if the GPU equipment information is not matched with the equipment file information, determining that the node has GPU faults. By applying the fault detection method provided by the invention, the GPU equipment information can be matched with the equipment file information which is generated in advance, so that whether the GPU card has faults or not can be judged, and the faults of the GPU card can be rapidly found.
In the fault detection method provided by the embodiment of the present invention, based on the implementation process, specifically, the process for generating the device file information includes:
judging whether the node meets preset detection conditions or not;
sending a second detection instruction to the node to acquire initial equipment information of the node;
and carrying out format conversion on the initial equipment information according to a preset format conversion rule to obtain the equipment file information of the node.
In the method provided by the embodiment of the invention, judging whether the node meets the preset detection condition comprises the following steps: judging whether version information of an operating system of the node meets a preset version requirement, if so, judging whether a first detection command and a second detection command are available, and determining that the node meets preset detection conditions when the first detection command and the second detection command are both available; when the version information of the operating system of the node does not meet the preset version requirement, the first detection command is not available or the second detection command is not available, determining that the node does not meet the preset detection condition.
Specifically, when the node meets a preset detection condition, a second detection instruction is sent to the node to obtain information of each GPU card of the node, which may specifically include host, gpu_uuid, gpu_index and BUSID, the obtained information of each GPU card is recorded to obtain initial device information, and the BUSID in the initial device information is subjected to a specific operation to obtain device file information, where the specific operation may be that each BUSID is converted into a lowercase character and stored in a 16-system manner.
According to the method provided by the embodiment of the invention, the equipment information of the node is obtained in advance, the equipment information can be compared with the GPU equipment information obtained by the first detection instruction, whether the node has the GPU card failure or not can be judged, the task scheduling is carried out again by the AI platform conveniently, and the training task is prevented from being delayed.
In the fault detection method provided by the embodiment of the present invention, based on the implementation process, specifically, the matching the GPU device information with the device file information generated in advance, as shown in fig. 2, may include:
s201: and acquiring each identifier contained in the equipment file information and each bus identifier BUSID contained in the GPU equipment information.
In the method provided by the embodiment of the invention, the equipment file information is analyzed to obtain each identifier, the identifier can be BUSID, and the GPU equipment information is analyzed to obtain each BUSID contained in the equipment information.
S202: and respectively matching each identifier with each BUSID.
In the method provided by the embodiment of the invention, each identifier in the pre-stored equipment file information is respectively matched with each BUSID in the GPU equipment information acquired through the first detection instruction.
S203: and if the identifiers which are not matched with the BUSIDs exist, determining the GPU card-dropping fault corresponding to the identifiers in the nodes.
In the method provided by the embodiment of the invention, if the BUSID which does not exist in the GPU equipment information exists in the equipment file information, the GPU card-off fault corresponding to the BUSID is indicated.
S204: and if the BUSID matched with each identifier exists in the GPU equipment information, determining that the node does not have GPU card-off faults.
In the method provided by the embodiment of the invention, for each identifier in the equipment file information, the BUSID corresponding to the identifier can be determined in the GPU equipment information, so that the node is free from GPU card failure.
In the method provided by the embodiment of the invention, the BUSID in the equipment file information can be compared with the BUSID obtained by the first detection instruction, so that whether the node has the GPU card failure or not is judged, the failed GPU card can be rapidly determined according to the BUSID, the detection efficiency is improved, and the subsequent repair work of technicians is facilitated.
In the fault detection method provided by the embodiment of the present invention, based on the implementation process, as shown in fig. 3, the fault detection method further includes:
s301: and updating the fault record corresponding to the node in a preset fault record file.
In the method provided by the embodiment of the invention, the GPU card with the card failure in the node is recorded, and the record comprises the identification of the GPU card, the failure time and the failure times. The GPU card that failed the card may be one or more.
S302: and judging whether the updated fault record meets the preset alarm condition or not.
In the method provided by the embodiment of the invention, whether the GPU card is first-time fault is judged, if the GPU card is first-time fault, the updated fault record is determined to meet the preset alarm condition, if the GPU card is not first-time fault, the historical alarm time closest to the current moment of the GPU card is obtained, whether the time interval between the current fault time and the historical alarm time is larger than the preset time interval threshold value is judged, and if the time interval between the current fault time and the historical alarm time is larger than the preset time interval threshold value, the GPU card is determined to meet the preset alarm condition.
S303: and when the updated fault record meets the preset alarm condition, calling a preset message middleware to perform alarm operation.
In the method provided by the embodiment of the invention, when the alarm condition is met, the alarm information corresponding to the GPU card information with the current card failure is generated, and the upper layer service is notified to the RESTful interface or the message middleware, so that the rescheduling content of the GPU resource is completed after the upper layer service receives the message.
The method provided by the embodiment of the invention can avoid the situation of repeated alarm by setting the alarm condition.
In the fault detection method provided by the embodiment of the present invention, based on the implementation process, as shown in fig. 4, the fault detection method further includes:
s401: and collecting the running condition information of the node.
In the method provided by the embodiment of the invention, the temperature information of the node, the running process information of the node and the like can be acquired.
S402: and traversing a preset configuration file based on the running condition information to determine the fault reason.
In the method provided by the embodiment of the invention, the fault reasons corresponding to the running condition information are obtained by traversing the preset configuration file, and the configuration file can contain the fault reasons corresponding to each running condition information.
S403: and generating prompt information corresponding to the fault reason.
In the method provided by the embodiment of the invention, when the fault reason corresponding to the running condition is obtained, the prompt information can be generated to prompt the user that the fault reason possibly appears when the current GPU card fails.
According to the method provided by the embodiment of the invention, the user can quickly find the fault solution according to the fault reason by prompting the fault reason for the user, so that the fault can be quickly solved.
In the specific application process, the fault detection method provided by the embodiment of the invention can be used for detecting faults in the form of scripts. Fig. 5 is a schematic diagram provided in an embodiment of the present invention, in which a fault detection process is performed, and specific steps are as follows:
step a1, obtaining GPU information of a node through an Nvidia-SMI command, and storing the GPU information into a local file to obtain equipment file information of the node; in the specific implementation process, firstly, environment detection needs to be performed on the node, namely, after the node is online, the installation environment of the node needs to be detected, including whether a system version, an Nvidia drive version, an Nvidia-SMI command and an LSPCI command can be normally used or not; when the environment detection passes, basic system information of the GPU including host, gpu_uuid, gpu_index, and BUSID is acquired through an Nvidia-SMI command, and special processing (conversion to lowercase, and storage in 16-system) is performed for the BUSID.
Step a2, data comparison; in the specific implementation process, an LSPCI command is required to acquire all PCI information related to the NVIDIA GPU, mainly BUSID, if no information is acquired, the node can be considered to have no GPU card and no fault notification is required;
obtaining the basic information of the GPU by using an Nvidia-SMI command, and if the command is normal, indicating that no card-falling information exists and no fault notification is needed;
when the Nvidia-SMI does not acquire GPU information or is abnormal, the BUSID acquired by using LSPCI is required to be compared with the information in the file, if the GPU in the file is not acquired through an LSPCI command, the GPU card is considered to be lost, fault information is required to be recorded, the fault information is written into a fault record file influxdb, and upper-layer service is notified;
recording the notification times and the alarm times of the faults, and not needing to notify the upper layer service when the same GPU card fails again in unit time;
the data comparison is packaged into a script, and the script is executed at regular time through the crontab of the system.
And a3, notifying an upper layer service through a RESTful interface or a message middleware after judging that the card fails in the specific implementation process, and completing rescheduling content of the GPU resource after the upper layer service receives the message.
The specific implementation manners and the derivative processes of the implementation manners are all within the protection scope of the invention.
Corresponding to the method shown in fig. 1, the embodiment of the present invention further provides a fault detection device, which is used for implementing the method shown in fig. 1, where the fault detection device provided in the embodiment of the present invention may be applied to a computer terminal or various mobile devices, and the structural schematic diagram of the fault detection device is shown in fig. 6, and specifically includes:
a first sending unit 501, configured to send a first detection instruction to a node according to a preset detection period;
a second sending unit 502, configured to send a second detection instruction to the node when GPU device information corresponding to the first detection instruction is received;
a matching unit 503, configured to match the GPU device information with device file information generated in advance when abnormal information corresponding to the second detection instruction is received;
a determining unit 504, configured to determine that the node has a GPU fault when the GPU device information does not match the device file information.
The fault detection device provided by the embodiment of the invention can send the first detection instruction to the node according to the preset detection period; when GPU equipment information corresponding to the first detection instruction is received, a second detection instruction is sent to the node; when abnormal information corresponding to the second detection instruction is received, matching the GPU equipment information with equipment file information generated in advance; and if the GPU equipment information is not matched with the equipment file information, determining that the node has GPU faults. By applying the fault detection method provided by the invention, the GPU equipment information can be matched with the equipment file information which is generated in advance, so that whether the GPU card has faults or not can be judged, and the faults of the GPU card can be rapidly found.
The device provided by the embodiment of the invention further comprises:
a first judging unit, configured to judge whether the node meets a preset detection condition;
an obtaining unit, configured to send a second detection instruction to the node when the node meets a preset detection condition, so as to obtain initial equipment information of the node;
and the execution unit is used for carrying out format conversion on the initial equipment information according to a preset format conversion rule to obtain the equipment file information of the node.
In the apparatus provided by the embodiment of the present invention, the matching unit 503 includes:
the acquisition subunit is used for acquiring each identifier contained in the equipment file information and each bus identifier BUSID contained in the GPU equipment information;
a matching subunit, configured to match each identifier with each BUSID;
the first determining subunit is used for determining the GPU card-dropping fault corresponding to the identifier in the node when the identifier which is not matched with each BUSID exists;
and the second determining subunit is used for determining that the node does not have a GPU card-off fault when the BUSID matched with each identifier exists in the GPU equipment information.
The device provided by the embodiment of the invention further comprises:
an updating unit, configured to update, in a preset fault record file, a fault record corresponding to the node;
a second judging unit, configured to judge whether the updated fault record meets a preset alarm condition;
and the calling unit is used for calling preset message middleware to carry out alarm operation when the updated fault record meets preset alarm conditions.
The device provided by the embodiment of the invention further comprises:
the acquisition unit is used for acquiring the running condition information of the node;
the searching unit is used for traversing a preset configuration file according to the running condition information so as to determine a fault reason;
and the generating unit is used for generating prompt information corresponding to the fault reason.
The embodiment of the invention also provides a storage medium, which comprises stored instructions, wherein the equipment where the storage medium is located is controlled to execute the fault detection method when the instructions run.
The embodiment of the present invention further provides an electronic device, whose structural schematic diagram is shown in fig. 7, specifically including a memory 601, and one or more instructions 602, where the one or more instructions 602 are stored in the memory 601, and configured to be executed by the one or more processors 603, where the one or more instructions 602 perform the following operations:
sending a first detection instruction to the node according to a preset detection period;
when GPU equipment information corresponding to the first detection instruction is received, a second detection instruction is sent to the node;
when abnormal information corresponding to the second detection instruction is received, matching the GPU equipment information with equipment file information generated in advance;
and if the GPU equipment information is not matched with the equipment file information, determining that the node has GPU faults.
It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the apparatus class embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference is made to the description of the method embodiments for relevant points.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in the same piece or pieces of software and/or hardware when implementing the present invention.
From the above description of embodiments, it will be apparent to those skilled in the art that the present invention may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present invention.
The foregoing has described in detail a fault detection method and apparatus provided by the present invention, and specific examples have been applied herein to illustrate the principles and embodiments of the present invention, the above examples being provided only to assist in understanding the method and core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims (6)

1. A fault detection method, comprising:
sending a first detection instruction to the node according to a preset detection period; the first detection instruction is used for detecting whether the GPU card is loaded in the node; the first detection instruction comprises an Lspci instruction;
when GPU equipment information corresponding to the first detection instruction is received, a second detection instruction is sent to the node; the second detection instruction is used for detecting the state of the GPU card; the second detection instruction comprises an nvidia_smi instruction;
when abnormal information corresponding to the second detection instruction is received, each identifier contained in the pre-generated equipment file information and each bus identifier BUSID contained in the GPU equipment information are obtained;
matching each identifier with each BUSID respectively;
if the identification which is not matched with each BUSID exists, determining that the GPU corresponding to the identification in the node has a card failure;
if the BUSID matched with each identifier exists in the GPU equipment information, determining that the node does not have GPU card-off faults;
collecting the running condition information of the node;
traversing a preset configuration file based on the running condition information to determine a fault cause;
and generating prompt information corresponding to the fault reason.
2. The method of claim 1, wherein the generating of the device file information comprises:
judging whether the node meets preset detection conditions or not;
when the node meets a preset detection condition, a second detection instruction is sent to the node so as to acquire initial equipment information of the node;
and carrying out format conversion on the initial equipment information according to a preset format conversion rule to obtain the equipment file information of the node.
3. The method as recited in claim 1, further comprising:
updating a fault record corresponding to the node in a preset fault record file;
judging whether the updated fault record meets a preset alarm condition or not;
and when the updated fault record meets the preset alarm condition, calling a preset message middleware to perform alarm operation.
4. A fault detection device, comprising:
the first sending unit is used for sending a first detection instruction to the node according to a preset detection period; the first detection instruction is used for detecting whether the GPU card is loaded in the node; the first detection instruction comprises an Lspci instruction;
the second sending unit is used for sending a second detection instruction to the node when GPU equipment information corresponding to the first detection instruction is received; the second detection instruction is used for detecting the state of the GPU card; the second detection instruction comprises an nvidia_smi instruction;
the matching unit is used for matching the GPU equipment information with equipment file information generated in advance when abnormal information corresponding to the second detection instruction is received;
the determining unit is used for determining that the node has GPU faults when the GPU equipment information is not matched with the equipment file information;
the matching unit includes:
the acquisition subunit is used for acquiring each identifier contained in the equipment file information and each bus identifier BUSID contained in the GPU equipment information;
a matching subunit, configured to match each identifier with each BUSID;
the first determining subunit is used for determining the GPU card-dropping fault corresponding to the identifier in the node when the identifier which is not matched with each BUSID exists;
the second determining subunit is used for determining that the node does not have a GPU card-off fault when the BUSID matched with each identifier exists in the GPU equipment information;
the acquisition unit is used for acquiring the running condition information of the node;
the searching unit is used for traversing a preset configuration file according to the running condition information so as to determine a fault reason;
and the generating unit is used for generating prompt information corresponding to the fault reason.
5. The apparatus as recited in claim 4, further comprising:
a first judging unit, configured to judge whether the node meets a preset detection condition;
an obtaining unit, configured to send a second detection instruction to the node when the node meets a preset detection condition, so as to obtain initial equipment information of the node;
and the execution unit is used for carrying out format conversion on the initial equipment information according to a preset format conversion rule to obtain the equipment file information of the node.
6. The apparatus as recited in claim 4, further comprising:
an updating unit, configured to update, in a preset fault record file, a fault record corresponding to the node;
the second judging unit is used for judging whether the updated fault record meets the preset alarm condition or not;
and the calling unit is used for calling preset message middleware to carry out alarm operation when the updated fault record meets preset alarm conditions.
CN201910783312.7A 2019-08-23 2019-08-23 Fault detection method and device Active CN110502399B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910783312.7A CN110502399B (en) 2019-08-23 2019-08-23 Fault detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910783312.7A CN110502399B (en) 2019-08-23 2019-08-23 Fault detection method and device

Publications (2)

Publication Number Publication Date
CN110502399A CN110502399A (en) 2019-11-26
CN110502399B true CN110502399B (en) 2023-09-01

Family

ID=68588972

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910783312.7A Active CN110502399B (en) 2019-08-23 2019-08-23 Fault detection method and device

Country Status (1)

Country Link
CN (1) CN110502399B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111400114A (en) * 2020-03-06 2020-07-10 湖南城市学院 Deep recursion network-based big data computer system fault detection method and system
CN111935727B (en) * 2020-07-10 2023-01-31 展讯半导体(成都)有限公司 Communication exception handling method, master node, indoor distribution system and storage medium
CN112988517A (en) * 2021-03-26 2021-06-18 山东英信计算机技术有限公司 GPU card-dropping monitoring method based on BMC

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013109722A (en) * 2011-11-24 2013-06-06 Toshiba Corp Computer, computer system and failure information management method
WO2017129117A1 (en) * 2016-01-29 2017-08-03 腾讯科技(深圳)有限公司 Gpu resource reconstruction method, user device, system, and storage medium
CN109213648A (en) * 2018-09-03 2019-01-15 郑州云海信息技术有限公司 RACK cabinet switching on and shutting down stability test method, apparatus, terminal and storage medium
CN109491871A (en) * 2018-11-20 2019-03-19 浪潮电子信息产业股份有限公司 A kind of equipment information acquiring method and device of GPU
CN109684144A (en) * 2018-12-26 2019-04-26 郑州云海信息技术有限公司 A kind of method and device of GPU-BOX system testing
US10325343B1 (en) * 2017-08-04 2019-06-18 EMC IP Holding Company LLC Topology aware grouping and provisioning of GPU resources in GPU-as-a-Service platform
CN109975688A (en) * 2019-03-25 2019-07-05 北京百度网讯科技有限公司 General evaluating method and device for heterogeneous chip

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105988918B (en) * 2015-02-26 2019-03-08 阿里巴巴集团控股有限公司 The method and apparatus for predicting GPU failure
US10552280B2 (en) * 2017-12-14 2020-02-04 Microsoft Technology Licensing, Llc In-band monitor in system management mode context for improved cloud platform availability

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013109722A (en) * 2011-11-24 2013-06-06 Toshiba Corp Computer, computer system and failure information management method
WO2017129117A1 (en) * 2016-01-29 2017-08-03 腾讯科技(深圳)有限公司 Gpu resource reconstruction method, user device, system, and storage medium
US10325343B1 (en) * 2017-08-04 2019-06-18 EMC IP Holding Company LLC Topology aware grouping and provisioning of GPU resources in GPU-as-a-Service platform
CN109213648A (en) * 2018-09-03 2019-01-15 郑州云海信息技术有限公司 RACK cabinet switching on and shutting down stability test method, apparatus, terminal and storage medium
CN109491871A (en) * 2018-11-20 2019-03-19 浪潮电子信息产业股份有限公司 A kind of equipment information acquiring method and device of GPU
CN109684144A (en) * 2018-12-26 2019-04-26 郑州云海信息技术有限公司 A kind of method and device of GPU-BOX system testing
CN109975688A (en) * 2019-03-25 2019-07-05 北京百度网讯科技有限公司 General evaluating method and device for heterogeneous chip

Also Published As

Publication number Publication date
CN110502399A (en) 2019-11-26

Similar Documents

Publication Publication Date Title
US10152382B2 (en) Method and system for monitoring virtual machine cluster
CN110516971B (en) Anomaly detection method, device, medium and computing equipment
CN110502399B (en) Fault detection method and device
US11269718B1 (en) Root cause detection and corrective action diagnosis system
CN103201724B (en) Providing application high availability in highly-available virtual machine environments
CN107016480B (en) Task scheduling method, device and system
CN107589951B (en) Cluster upgrading method and device
CN108038039B (en) Method for recording log and micro-service system
CN110673936B (en) Breakpoint continuous operation method and device for arrangement service, storage medium and electronic equipment
CN114064208A (en) Method and device for detecting application service state, electronic equipment and storage medium
US11055416B2 (en) Detecting vulnerabilities in applications during execution
CN113672306B (en) Server component self-checking abnormity recovery method, device, system and medium
CN108111343B (en) Method and equipment for realizing terminal monitoring based on cloud platform and computer storage medium
CN117499412A (en) Cluster optimization processing method based on high-availability link and related equipment thereof
CN112667317A (en) Abnormal time consumption detection method and device, electronic equipment and storage medium
EP3818445B1 (en) Automated control of distributed computing devices
CN115102838B (en) Emergency processing method and device for server downtime risk and electronic equipment
CN114679295B (en) Firewall security configuration method and device
CN112751782B (en) Flow switching method, device, equipment and medium based on multi-activity data center
CN112596750B (en) Application testing method and device, electronic equipment and computer readable storage medium
CN110289977B (en) Fault detection method, system, equipment and storage medium for logistics warehouse system
CN113656239A (en) Monitoring method and device for middleware and computer program product
CN113127029A (en) Firmware updating method and device, electronic equipment and storage medium
CN111782515A (en) Web application state detection method and device, server and storage medium
CN112148420A (en) Abnormal task processing method based on container technology, server and cloud platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant