CN118051393A

CN118051393A - Inspection method, apparatus, device, storage medium, and computer program product

Info

Publication number: CN118051393A
Application number: CN202410224218.9A
Authority: CN
Inventors: 孟召潮; 秦晓宁; 陈颖; 王添; 孙建旭
Original assignee: Ningchang Information Technology Hangzhou Co ltd
Current assignee: Ningchang Information Technology Hangzhou Co ltd
Priority date: 2024-02-28
Filing date: 2024-02-28
Publication date: 2024-05-17

Abstract

The application relates to a routing inspection method, a routing inspection device, equipment, a storage medium and a computer program product. In the method, the inspection equipment can inspect the working states of the graphic processors in the plurality of servers to be inspected at one time, and the inspection work of the graphic processors can be realized through a network, so that the inspection work of workers is not needed on site, and the inspection efficiency can be greatly improved; in addition, compared with the traditional method for finishing inspection through inspection indexes, the method can comprehensively and accurately reflect the working state of the graphic processor by acquiring the static information and the dynamic information of the graphic processor in real time, so that the method can improve the inspection accuracy to a certain extent.

Description

Inspection method, apparatus, device, storage medium, and computer program product

Technical Field

The present application relates to the field of fault early warning technologies, and in particular, to a method, an apparatus, a device, a storage medium, and a computer program product for inspection.

Background

A graphics processor (Graphics Processing Unit, referred to as GPU for short) in the server is a computing component for processing more complex graphics or pictures in the server, so that the GPU has more processing units and stronger parallel processing capability than a common CPU, however, the GPU has a higher single failure rate in the process of high-speed operation, and in order to ensure the normal operation of the GPU, it is necessary to perform fault inspection on the GPU.

Currently, a method for performing state inspection on a GPU generally performs inspection on site by a staff based on inspection indexes, so as to discover GPU faults in time and perform on-site operation and maintenance.

However, the above fault inspection method for the GPU has a problem of low efficiency.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an inspection method, apparatus, device, storage medium, and computer program product that can improve inspection efficiency.

In a first aspect, the present application provides a method for inspecting, the method being applied to inspection equipment, the method comprising:

acquiring working state information of a graphic processor in each server to be inspected in a network;

And performing fault detection on each graphic processor according to the working state information to obtain detection results corresponding to each server to be patrolled and examined.

According to the inspection method provided by the embodiment of the application, the working state information of the graphic processor in each server to be inspected in the network is obtained, and then fault detection is carried out on each graphic processor according to the working state information, so that the detection result corresponding to each server to be inspected is obtained. In the method, the inspection equipment can inspect the working states of the graphic processors in the plurality of servers to be inspected at one time, and the inspection work of the graphic processors can be realized through a network, so that the inspection work of workers is not needed on site, and the inspection efficiency can be greatly improved; in addition, compared with the traditional method for finishing inspection through inspection indexes, the method can comprehensively and accurately reflect the working state of the graphic processor by acquiring the static information and the dynamic information of the graphic processor in real time, so that the method can improve the inspection accuracy to a certain extent.

In one embodiment, acquiring working state information of a graphics processor in each server to be patrolled and examined in a network includes:

Executing an information acquisition script, and sending an information acquisition request to each server to be patrolled and examined;

And receiving the working state information returned by each server to be patrolled and examined.

According to the method provided by the embodiment of the application, the working state information of the graphic processor in each server to be patrolled and examined can be automatically acquired by executing the information acquisition script, and the servers do not need to be manually connected one by one to inquire, so that the data acquisition efficiency is improved. Moreover, by executing the information acquisition script, all the servers to be inspected can be ensured to acquire the information according to the same standard, and the problem of inconsistent data caused by manual operation errors is avoided.

In one embodiment, the working state information includes static information and dynamic information, executes an information acquisition script, and sends an information acquisition request to each server to be patrolled and examined, including:

Calling a static function in the information acquisition script, sending a static information acquisition request to each server to be patrolled and examined, and calling a dynamic function in the information acquisition script, and sending a dynamic information acquisition request to each server to be patrolled and examined;

The static acquisition request is used for acquiring static information, and the dynamic acquisition request is used for acquiring dynamic information.

The method of the embodiment of the application can automatically acquire the state information of the graphic processor by executing the static function and the dynamic function, and can comprehensively monitor various state information of the graphic processor by acquiring the static information and the dynamic information.

In one embodiment, invoking a static function in an information acquisition script, sending a static information acquisition request to each server to be patrolled and examined, and invoking a dynamic function in the information acquisition script, sending a dynamic information acquisition request to each server to be patrolled and examined, including:

calling a static function in the information acquisition script every first preset time, sending a static information acquisition request to each server to be patrolled and examined, and calling a dynamic function in the information acquisition script every second preset time, sending a dynamic information acquisition request to each server to be patrolled and examined;

Wherein the first preset time is less than the second preset time.

According to the method provided by the embodiment of the application, because the static information is relatively stable and basically unchanged compared with the dynamic information, the method can ensure the comprehensiveness and instantaneity of information acquisition and simultaneously can effectively save resources and reduce unnecessary resource waste by reducing the acquisition frequency of the static information.

In one embodiment, fault detection is performed on each graphics processor according to each piece of working state information to obtain a detection result corresponding to each server to be patrolled and examined, including:

Executing an information analysis script, and carrying out cluster analysis on each piece of working state information to obtain an analysis result;

And carrying out fault detection on each graphic processor according to the analysis result corresponding to each piece of working state information to obtain the detection result corresponding to each server to be patrolled and examined.

According to the method provided by the embodiment of the application, the abnormal working state can be identified by carrying out cluster analysis on the working state information, and then based on the result of the cluster analysis, fault detection can be rapidly carried out on each graphic processor, so that the inspection efficiency is improved.

In one embodiment, the working state information includes static information and dynamic information, and the clustering analysis is performed on each working state information to obtain an analysis result, including:

for each operating state information, a first distance between static information and static standard information is determined, and a second distance between dynamic information and dynamic standard information is determined.

In one embodiment, according to an analysis result corresponding to each working state information, fault detection is performed on each graphics processor to obtain a detection result corresponding to each server to be patrolled and examined, including:

for each piece of working state information, determining whether a first distance corresponding to the working state information is smaller than a static distance threshold value, and determining whether a second distance corresponding to the working state information is smaller than a dynamic distance threshold value;

if the first distance is smaller than the static distance threshold and the second distance is smaller than the dynamic distance threshold, determining that the detection result indicates that the graphics processor in the server to be inspected is in a health state;

And if the first distance is not smaller than the static distance threshold or the second distance is not smaller than the dynamic distance threshold, determining that the detection result indicates that the graphics processor in the server to be inspected is in an unhealthy state.

According to the method provided by the embodiment of the application, the health state of the graphic processor is analyzed by the distance between each piece of working state information and the standard information, so that the data analysis efficiency can be improved, and the inspection efficiency can be further improved.

In a second aspect, the present application also provides a patrol device, which includes:

The acquisition module is used for acquiring the working state information of the graphic processor in each server to be inspected in the network;

And the detection module is used for carrying out fault detection on each graphic processor according to the working state information to obtain detection results corresponding to each server to be patrolled and examined.

In a third aspect, the present application also provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a fifth aspect, the application also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of:

The inspection method, the inspection device, the equipment, the storage medium and the computer program product are characterized in that the inspection method obtains the working state information of the graphic processor in each server to be inspected in the network, and then performs fault detection on each graphic processor according to the working state information to obtain the detection result corresponding to each server to be inspected. In the method, the inspection equipment can inspect the working states of the graphic processors in the plurality of servers to be inspected at one time, and the inspection work of the graphic processors can be realized through a network, so that the inspection work of workers is not needed on site, and the inspection efficiency can be greatly improved; in addition, compared with the traditional method for finishing inspection through inspection indexes, the method can comprehensively and accurately reflect the working state of the graphic processor by acquiring the static information and the dynamic information of the graphic processor in real time, so that the method can improve the inspection accuracy to a certain extent.

Drawings

FIG. 1 is a schematic diagram of a patrol system according to an embodiment;

FIG. 2 is a schematic flow chart of an inspection method in one embodiment;

FIG. 3 is a flow chart of a method of inspection according to another embodiment;

FIG. 4 is a schematic flow chart of a method of inspection according to another embodiment;

FIG. 5 is a schematic flow chart of a method of inspection according to another embodiment;

FIG. 6 is a schematic flow chart of a method of inspection according to another embodiment;

FIG. 7 is a block diagram of an inspection apparatus according to one embodiment;

FIG. 8 is a block diagram of an inspection apparatus according to another embodiment;

FIG. 9 is a block diagram of an inspection apparatus according to another embodiment;

fig. 10 is an internal structural view of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

A graphics processor (Graphics Processing Unit, referred to as GPU for short) in the server is a computing component for processing more complex graphics or pictures in the server, so that the GPU has more processing units and stronger parallel processing capability than a common CPU, however, the GPU has a higher single failure rate in the process of high-speed operation, and in order to ensure the normal operation of the GPU, it is necessary to perform fault inspection on the GPU. Currently, a method for performing state inspection on a GPU generally performs inspection on site by a staff based on inspection indexes, so as to discover GPU faults in time and perform on-site operation and maintenance. However, the above fault inspection method for the GPU has a problem of low efficiency. The application provides a patrol method, which aims to solve the technical problems, and the following embodiment specifically describes the patrol method.

The inspection method provided by the embodiment of the application can be applied to an inspection network system shown in fig. 1, wherein the inspection network system comprises an inspection device 01 and a plurality of servers 02 to be inspected, the inspection device 01 and the servers 02 to be inspected can be connected through a wired or wireless mode, for example, the inspection device 01 can be arranged in a local area network, and the inspection device 01 is used for inspecting a graphics processor in the servers 02 to be inspected. The inspection device 01 can be various personal computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices; alternatively, the inspection apparatus 01 may be a server. The server 02 to be patrolled and examined can be various personal computers, notebook computers, smart phones, tablet computers, internet of things equipment and portable wearable equipment; alternatively, the server 02 to be patrolled may be a server.

It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the patrol network system to which the present inventive arrangements are applied, and that a particular patrol network system may include more or fewer components than shown, or may combine certain components, or may have a different arrangement of components.

In one embodiment, as shown in fig. 2, a patrol method is provided, and the patrol method is applied to the patrol device in fig. 1, and the patrol method includes the following steps:

S101, working state information of a graphic processor in each server to be inspected in the network is obtained.

The network may be a local area network or other networks, and each server in the network may communicate with each other. The working state information is used for reflecting the working state of the graphic processor, and comprises static information and dynamic information. The static information includes at least one of a serial number (serial_number), a product name (product_name), a part number (part_number), a Globally Unique Identifier (GUID), a firmware version (firmware_version), a vendor identifier (vendor_id), a sub-vendor identifier (sub-vendor_id), a character string or code (vendor_id) of the device, and a character string or code (sub-device_id) of the sub-device of the graphics processor. The dynamic information includes a device state (drive_state), a temperature value (temperature), a power state (power), a current memory utilization (current_memory_ utilization), a current resource utilization (current_sm_ utilization), an external power state (external_power), a disabled page state (retire _page_ sbesbe), a current multi-instance state (retire _page_ dbedbe), a current error-corrected memory state (current_ecc_state), a current multi-instance state (current_mig_state), a negative time utilization (current_ utilization _time), a resource utilization time (sm_ utilization _ timesm), a current graphics frequency (current_graphics_clock), a current memory frequency (current_memory_clock), a high-speed connection channel state (nvlink _ statusNVLink), a downlink remapping state (row_ remapping _state), a link speed (5435_k_state), a link width (6783), a reset state (STATEPCIE _state), and at least one of alert states (power_state (high-alert state) in a current_memory_memory_ utilization).

In the embodiment of the application, the server to be inspected and the inspection equipment can be arranged in a network in advance, and software for inspection is installed on the inspection equipment, for example, the inspection equipment and the server to be inspected are connected in a local area network, a linux system is installed on the inspection equipment, and an ipmitool tool and a python environment are installed. Before the inspection equipment is utilized for inspection, whether the server to be inspected is communicated with the inspection equipment or not needs to be determined, and the server to be inspected is inspected under the condition of communication. The number of the servers to be inspected can be deployed at one time, can be adjusted according to the requirement, can be adjusted according to the performance of the inspection equipment, for example, the number of the servers to be inspected can be increased when the performance of the inspection equipment is good, and the number of the servers to be inspected can be reduced when the performance of the inspection equipment is poor. In the inspection process, the inspection equipment can acquire the working state information of the graphic processor in one or more servers to be inspected in the same network in real time through the network. Specifically, the working state information of the graphics processor can be obtained by inputting a command line. Alternatively, the working state information of the graphics processor may be obtained by calling an interface. Alternatively, monitoring software may be used to obtain the operating state information of the graphics processor.

S102, fault detection is carried out on each graphic processor according to the working state information, and detection results corresponding to each server to be patrolled and examined are obtained.

The detection result includes that the graphics processor is in a healthy state (normal state) or the graphics processor is in an unhealthy state (abnormal state).

In the embodiment of the application, after the inspection equipment acquires the working state information of the graphic processor in each server to be inspected, the inspection equipment can directly analyze whether the working state information is abnormal or not, further analyze whether the graphic processor is in a health state or not, and acquire the detection result corresponding to each server to be inspected. For example, comparing each working state information with the normal working state information, and performing fault detection on each graphic processor according to the comparison result to obtain a detection result corresponding to each server to be patrolled and examined. Optionally, fault detection can be performed on each graphics processor through each piece of working state information and a corresponding preset threshold value, so as to obtain a detection result corresponding to each server to be patrolled and examined. Optionally, after the inspection device obtains the working state information of the graphics processor in each server to be inspected, each working state information may be saved in a designated file, and when the working state information needs to be analyzed, the working state information may be obtained from the file. Optionally, after the inspection device obtains the working state information of the graphics processor in each server to be inspected, the inspection device may further process each working state information, for example, perform alignment processing in a time stamp manner, and then perform fault detection on the working state information after alignment processing to obtain a detection result corresponding to each server to be inspected.

According to the inspection method provided by the embodiment of the application, the working state information of the graphic processor in each server to be inspected in the network is obtained, and then fault detection is carried out on each graphic processor according to the working state information, so that the detection result corresponding to each server to be inspected is obtained. In the method, the inspection equipment can inspect the working states of the graphic processors in the plurality of servers to be inspected at one time, and the inspection work of the graphic processors can be realized through a network, so that the inspection work of workers is not needed on site, and the inspection efficiency can be greatly improved; in addition, compared with the traditional method for finishing inspection through inspection indexes, the method can comprehensively and accurately reflect the working state of the graphic processor by acquiring the static information and the dynamic information of the graphic processor in real time, so that the method can improve the inspection accuracy to a certain extent. Moreover, the method provides the inspection equipment which is specially used for inspecting the graphics processor of each equipment in the network, can realize the inspection method at regular intervals or according to commands, and can achieve the aim of high-efficiency inspection when facing the inspection task of a large number of network equipment.

In an embodiment, a specific implementation manner of obtaining the working state information of the graphics processor is further provided, as shown in fig. 3, where "obtaining the working state information of the graphics processor in each server to be patrolled and examined in the network" in step S101 includes:

S201, executing an information acquisition script, and sending an information acquisition request to each server to be patrolled and examined.

The information acquisition script is a script which is compiled in advance through a linux code instruction and is used for acquiring the working state information of each server to be patrolled and examined. The information acquisition request is used for acquiring the working state information of the graphic processor in each server to be inspected.

In the embodiment of the application, the inspection equipment can pre-write the information acquisition script capable of running on the IPMITool tool and store the information acquisition script under a preset path. When the inspection equipment determines that the servers to be inspected are communicated, the information acquisition script can be called and executed in the IPMITool tool, and in the process of executing the information acquisition script, the inspection equipment can generate an information acquisition request according to the content indicated by the information acquisition script and then send the information acquisition request to the servers to be inspected through a network. Alternatively, the information acquisition request may be sent to each server to be patrolled at one time. Optionally, the information acquisition request may be sent to each server to be patrolled and examined according to the sequence of the connection time. Optionally, the information acquisition request may be sent to each server to be patrolled according to the device number sequence of the servers to be patrolled. Optionally, the information acquisition request may also be sent to each server to be patrolled and examined according to a preset sequence.

S202, receiving the working state information returned by each server to be patrolled and examined.

In the embodiment of the application, after each server to be inspected receives the information acquisition request sent by the inspection equipment, the information acquisition request can be analyzed, the working state information to be acquired by the inspection equipment is analyzed, then the working state information of each server to be inspected is sent to the inspection equipment, and the inspection equipment can receive the working state information returned by each server to be inspected.

In one embodiment, the working state information includes static information and dynamic information, and on the basis of the static information and the dynamic information, a specific implementation manner of sending an information acquisition request is further provided, where the "executing an information acquisition script in step S201 sends an information acquisition request to each server to be patrolled and examined" includes: calling a static function in the information acquisition script, sending a static information acquisition request to each server to be patrolled and examined, and calling a dynamic function in the information acquisition script, and sending a dynamic information acquisition request to each server to be patrolled and examined.

Wherein, the working state information comprises static information and dynamic information. The information acquisition request includes a static acquisition request for acquiring static information and a dynamic acquisition request for acquiring dynamic information. The information acquisition script comprises a static function and a dynamic function, wherein the static function in the embodiment of the application can be named as a static_info_messages function, and the dynamic function can be named as a dynamic_info_messages function. Optionally, the information acquisition script may further include a collection function, which may be named a data_ collect function. Optionally, the information acquisition script may further include a main function, which may be named as a main function, and the main function is used to process the execution sequence and the overall execution logic of the three functions, i.e., the static function, the dynamic function and the collection function. The instruction formats of the static function and the dynamic function may be: the management IP address (ipmitool-H) of the server to be patrolled, the login user name (I lanplus) of the server to be patrolled, and the login password (P) of the server to be patrolled, namely the specific information instruction to be accessed by the server to be patrolled. The inspection device may access each server to be inspected based on a User name access list file in a fixed format (the User name access list file in the embodiment of the present application may be named as ip_user_passwd.log), and the specific contents are as follows: the format of the inspection device is as follows: 1.1.1.1; user: admin; passwd:111111; the formats of the 4 servers to be inspected are as follows, each server to be inspected occupies one row, the file stores login information of the servers to be inspected, the formats of the plurality of servers to be inspected and the like.

IP：1.1.1.1；User：Admin0；Passwd：111111

IP：1.1.1.2；User：Admin1；Passwd：111112

IP：1.1.1.3；User：Admin2；Passwd：111113

IP：1.1.1.4；User：Admin3；Passwd：111114

In the embodiment of the application, when the inspection equipment executes the information acquisition script, the static function in the information acquisition script can be called, the static function is executed to send a static information acquisition request to each server to be inspected, the dynamic function in the information acquisition script can be called, and the dynamic function is executed to send a dynamic information acquisition request to each server to be inspected. Optionally, after the inspection device obtains the static information and the dynamic information by using the static function and the dynamic function, the collection function may be used to save the static information and the dynamic information into a preset folder, for example, save the data into an intermediate file data.csv file. Alternatively, the static information acquisition request and the dynamic information acquisition request may be simultaneously sent to each server to be patrolled and examined, and the static information acquisition request may be sent first, or the dynamic information acquisition request may be sent first. Alternatively, the frequencies of sending the static information acquisition request and the dynamic information acquisition request may be the same or different.

In an embodiment, the step of "sending a static information acquisition request to each server to be patrolled and examined" in the step of "sending a dynamic information acquisition request to each server to be patrolled and examined" in the step of calling a static function in the information acquisition script and sending a dynamic information acquisition request to each server to be patrolled and examined further provides a specific implementation manner of executing the information acquisition script on the basis of the static function in the information acquisition script in the above embodiment, and the step of calling the dynamic function in the information acquisition script includes: and calling the static function in the information acquisition script every first preset time, sending a static information acquisition request to each server to be patrolled and examined, and calling the dynamic function in the information acquisition script every second preset time, and sending a dynamic information acquisition request to each server to be patrolled and examined.

Wherein the first preset time is less than the second preset time.

In the embodiment of the application, when the inspection equipment executes the information acquisition script, the static function in the information acquisition script can be called every first preset time, for example, the static information can be acquired every 60 minutes by calling the static function, and the dynamic function in the information acquisition script can be called every second preset time, so that a dynamic information acquisition request is sent to each server to be inspected, for example, the static information can be acquired every 6 minutes by calling the dynamic function. Optionally, after the inspection device obtains the static information and the dynamic information by using the static function and the dynamic function, the static information and the dynamic information are stored in a designated array, and then the static information and the dynamic information are saved in a preset folder by using the collection function, for example, the saved data are saved in an intermediate file data.csv file.

In an embodiment, a specific implementation manner for obtaining the detection results corresponding to each server to be inspected is further provided, as shown in fig. 4, the step S102 of performing fault detection on each graphics processor according to each piece of working state information to obtain the detection results corresponding to each server to be inspected includes:

s301, executing an information analysis script, and carrying out cluster analysis on each piece of working state information to obtain an analysis result.

The information analysis script is compiled in advance through a linux code instruction and is used for analyzing the working state information of the graphic processor in each server to be inspected so as to determine whether the graphic processor is in a health state or not. The information analysis script comprises an average function (data_mean function) and a Center function (Center point function), wherein the average function is used for calculating a standard data average value, the Center function is used for calculating a standard data Center point, and the standard data is data of a server to be patrolled and examined in a health state.

The analysis result is used for representing the relation between the current working state information of each server to be inspected and the healthy working state information of each server to be inspected.

In the embodiment of the application, after the inspection equipment acquires the working state information of the graphic processor in each server to be inspected, each working state information can be input into a preset clustering algorithm, for example, the preset clustering algorithm is a k-means clustering algorithm, and the clustering analysis is performed on each working state information through the preset clustering algorithm to obtain an analysis result. Optionally, the working state information can be input into a preset clustering model, and the clustering analysis is performed on the working state information through the preset clustering model to obtain an analysis result.

S302, fault detection is carried out on each graphic processor according to analysis results corresponding to each piece of working state information, and detection results corresponding to each server to be patrolled and examined are obtained.

In the embodiment of the application, after the inspection equipment acquires the analysis results corresponding to the working state information of the graphic processor in each server to be inspected, the inspection equipment can directly analyze whether each analysis result is abnormal or not, further analyze whether the graphic processor is in a health state or not, and acquire the detection results corresponding to each server to be inspected. For example, comparing the current analysis results with the analysis results in normal operation, and performing fault detection on each graphic processor according to the comparison results to obtain detection results corresponding to each server to be patrolled and examined. Optionally, fault detection can be performed on each graphics processor through each analysis result and a corresponding preset threshold value, so as to obtain detection results corresponding to each server to be patrolled and examined.

In an embodiment, the working state information in the foregoing embodiment includes static information and dynamic information, and on this basis, a specific implementation manner for obtaining an analysis result is further provided, where "performing cluster analysis on each working state information to obtain an analysis result" in the foregoing step S301 includes: for each operating state information, a first distance between static information and static standard information is determined, and a second distance between dynamic information and dynamic standard information is determined.

Wherein the first distance reflects the difference between the static information and the static standard information, and the larger the first distance, the larger the difference between the static information and the static standard information, namely the worse the health state. The second distance may reflect the difference between the dynamic information and the dynamic standard information, the larger the first distance, the larger the difference between the dynamic information and the dynamic standard information, i.e. the less good the health status.

In the embodiment of the application, after the inspection equipment acquires the working state information of the graphics processor in each server to be inspected, an information analysis script data_k-means.py can be executed, and the information analysis script can determine a first distance between static information and static standard information and a second distance between dynamic information and dynamic standard information according to each working state information. Specifically, the inspection equipment can call the intermediate file data.csv file, then execute the information analysis script, perform cluster analysis on the real-time static information and dynamic information of the data.csv file by adopting a k-means clustering algorithm, and automatically calculate the distance between each static information and dynamic information and the corresponding average value point.

In an embodiment, a specific implementation manner for obtaining a detection result corresponding to a server to be inspected is further provided, as shown in fig. 5, in the step S302, the fault detection is performed on each graphics processor according to the analysis result corresponding to each piece of working state information, so as to obtain a detection result corresponding to each server to be inspected, which includes:

S401, for each piece of working state information, determining whether a first distance corresponding to the working state information is smaller than a static distance threshold value, and determining whether a second distance corresponding to the working state information is smaller than a dynamic distance threshold value.

The static distance threshold represents a distance value from a static standard information average value to a static standard information center point, is used for measuring a threshold of health state of the server to be inspected, and can be set to 0. The dynamic distance threshold represents a distance value from the average value of the dynamic standard information to the center point of the dynamic standard information, is used for measuring a threshold value of the health state of the server to be inspected, can be determined according to the health data, and can be quantized and changed according to the actual working state of the GPU.

In the embodiment of the application, after the first distance and the second distance of the graphics processor in each server to be inspected are obtained, the inspection device can determine, for each server to be inspected, whether the first distance is smaller than a static distance threshold value and whether the second distance is smaller than a dynamic distance threshold value.

S402, if the first distance is smaller than the static distance threshold and the second distance is smaller than the dynamic distance threshold, determining that the detection result indicates that the graphics processor in the server to be patrolled and examined is in a health state.

In the embodiment of the application, if the first distance is smaller than the static distance threshold and the second distance is smaller than the dynamic distance threshold, the detection result is determined to indicate that the graphics processor in the server to be patrolled and examined is in a health state.

S403, if the first distance is not smaller than the static distance threshold or the second distance is not smaller than the dynamic distance threshold, determining that the detection result indicates that the graphics processor in the server to be patrolled and examined is in an unhealthy state.

In the embodiment of the application, if the first distance is not smaller than the static distance threshold or the second distance is not smaller than the dynamic distance threshold, the detection result is determined to indicate that the graphics processor in the server to be inspected is in an unhealthy state. Furthermore, when the graphics processor in the server to be inspected is in an unhealthy state, the fault can be pre-warned, specifically, the fault information can be displayed on the interface of the inspection equipment to give an alarm, and the alarm device can be triggered to give an alarm, which is not limited in this embodiment.

In all the above embodiments, there is also provided a patrol method, as shown in fig. 6, including:

S501, calling a static function in the information acquisition script every first preset time, and sending a static information acquisition request to each server to be patrolled and examined.

S502, calling a dynamic function in the information acquisition script every second preset time, and sending a dynamic information acquisition request to each server to be patrolled and examined. Wherein the first preset time is less than the second preset time.

S503, receiving the working state information returned by each server to be patrolled and examined. Wherein the working state information comprises static information and dynamic information.

S504, for each operation state information, a first distance between the static information and the static standard information is determined, and a second distance between the dynamic information and the dynamic standard information is determined.

S505, for each piece of working state information, determining whether a first distance corresponding to the working state information is smaller than a static distance threshold value, and determining whether a second distance corresponding to the working state information is smaller than a dynamic distance threshold value.

S506, if the first distance is smaller than the static distance threshold and the second distance is smaller than the dynamic distance threshold, determining that the detection result indicates that the graphics processor in the server to be patrolled and examined is in a health state.

S507, if the first distance is not smaller than the static distance threshold or the second distance is not smaller than the dynamic distance threshold, determining that the detection result indicates that the graphics processor in the server to be patrolled and examined is in an unhealthy state.

The inspection method provided by the embodiment of the application has clear thought in the aspect of software script design and smooth logic, designs a Graphic Processing Unit (GPU) dynamic and static information out-of-band automatic inspection early warning script based on IPMITool and a data analysis script based on a k-means clustering algorithm, can realize early warning of faults by combining a IPMITool tool and the k-means clustering algorithm, locks the fault range in advance, has high operation and maintenance effects, realizes omnibearing monitoring of the running state of the GPU on a server to be inspected, and carries out health early warning through the distance from a health data average value point. In the aspect of hardware design, the method has the characteristics of simple requirement, convenient deployment, light weight, strong expansibility, less investment and excellent effect, can continuously expand capacity on the premise of being deployed, and can fully meet the requirement of a small data center on the basis of not stopping the server to be patrolled and examined and not influencing the work of the server to be patrolled and examined, and is particularly suitable for server test manufacturers and the like, and meanwhile, the method is suitable for the scene of adapting to the conditions of a plurality of GPUs in the server to be patrolled and examined. And in the GPU dynamic and static information out-of-band automatic inspection early warning script based on IPMITool and the k-means clustering algorithm data analysis script based on IPMITool, a data reading mode based on a single GPU can be added so as to support the functions of dynamic and static information inspection and early warning of a plurality of GPUs on a plurality of servers to be inspected by one inspection device. The method can patrol GPU dynamic and static information, the frequency can be dynamically adjusted according to the scene requirement of a user, and the routing rule of the patrol information is regulated, and with the iterative upgrading of technology, the patrol information can be dynamically adjusted to enable the patrol range of the method to be between 1 and 5000, can be adjusted in real time according to the performance of a patrol server, only needs one-time deployment for patrol, and simultaneously ensures that the patrol server and the server to be patrol can be communicated through a network. The method supports fault early warning and inspection of the common PCIE standard GPU and the integrated HGX GPU. The method adopts a python environment of a liunx system, a k-means clustering algorithm in a called python library is adopted, data acquired by using IpmiTool tools is analyzed based on the k-means clustering algorithm, the k-means clustering algorithm is a classification algorithm, the distance is used as a measurement standard, the main idea is that GPU health data are collected and then concentrated on a central point accessory, random unordered data are avoided, health data are calculated into intermediate points, then the distance between the real-time collected data and the central point is calculated, the greater the distance is the farther the distance is from the health data, the worse the health state is, the positive is the closer the distance is the health state is, the health threshold is the distance between the average value of the health data and the central point is used for early warning, and the fault position can be judged specifically according to abnormal or outlier data, so that the aim of reducing the fault position is achieved.

The method of each step is described in the foregoing embodiments, and the detailed description is referred to the foregoing description and is not repeated here.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a patrol device for realizing the patrol method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation of one or more embodiments of the inspection device provided below may be referred to the limitation of the inspection method hereinabove, and will not be repeated here.

In one embodiment, as shown in fig. 7, there is provided a patrol apparatus, comprising:

the obtaining module 10 is configured to obtain working state information of a graphics processor in each server to be patrolled and examined in the network.

And the detection module 11 is used for carrying out fault detection on each graphic processor according to the working state information to obtain detection results corresponding to each server to be patrolled and examined.

In one embodiment, as shown in fig. 8, the acquiring module 10 includes:

An obtaining unit 101, configured to execute an information obtaining script, and send an information obtaining request to each server to be patrolled and examined.

And the receiving unit 102 is used for receiving the working state information returned by each server to be patrolled and examined.

In one embodiment, the obtaining unit 101 is configured to call a static function in the information obtaining script, send a static information obtaining request to each server to be patrolled and examined, and call a dynamic function in the information obtaining script, and send a dynamic information obtaining request to each server to be patrolled and examined; the static acquisition request is used for acquiring static information, and the dynamic acquisition request is used for acquiring dynamic information.

In one embodiment, the obtaining unit 101 is specifically configured to call a static function in the information obtaining script every a first preset time, send a static information obtaining request to each server to be patrolled and examined, and call a dynamic function in the information obtaining script every a second preset time, send a dynamic information obtaining request to each server to be patrolled and examined; wherein the first preset time is less than the second preset time.

In one embodiment, as shown in fig. 9, the detection module 11 includes:

And the analysis unit 110 is used for executing the information analysis script, and performing cluster analysis on each piece of working state information to obtain an analysis result.

And the detection unit 111 is configured to perform fault detection on each graphics processor according to the analysis result corresponding to each working state information, so as to obtain a detection result corresponding to each server to be patrolled and examined.

In one embodiment, the analysis unit 110 is configured to determine, for each operation state information, a first distance between static information and static standard information, and a second distance between dynamic information and dynamic standard information.

In one embodiment, the detecting unit 111 includes:

The first determining subunit is configured to determine, for each piece of working state information, whether a first distance corresponding to the working state information is smaller than a static distance threshold, and whether a second distance corresponding to the working state information is smaller than a dynamic distance threshold.

And the second determining subunit is used for determining that the detection result indicates that the graphics processor in the server to be patrolled and examined is in a health state if the first distance is smaller than the static distance threshold value and the second distance is smaller than the dynamic distance threshold value.

And the third determining subunit is used for determining that the detection result indicates that the graphics processor in the server to be patrolled and examined is in an unhealthy state if the first distance is not smaller than the static distance threshold or the second distance is not smaller than the dynamic distance threshold.

The above-mentioned various modules in the inspection device may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device, which may be a terminal or a server, may be provided, and an internal structure diagram thereof may be as shown in fig. 10, and includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input device. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a patrol method. The display unit of the computer device is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in FIG. 10 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:

In one embodiment, the processor when executing the computer program further performs the steps of:

calling a static function in the information acquisition script, sending a static information acquisition request to each server to be patrolled and examined, and calling a dynamic function in the information acquisition script, and sending a dynamic information acquisition request to each server to be patrolled and examined.

and calling the static function in the information acquisition script every first preset time, sending a static information acquisition request to each server to be patrolled and examined, and calling the dynamic function in the information acquisition script every second preset time, and sending a dynamic information acquisition request to each server to be patrolled and examined.

The computer device provided in the foregoing embodiments has similar implementation principles and technical effects to those of the foregoing method embodiments, and will not be described herein in detail.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, the computer program when executed by the processor further performs the steps of:

The foregoing embodiment provides a computer readable storage medium, which has similar principles and technical effects to those of the foregoing method embodiment, and will not be described herein.

In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:

The foregoing embodiment provides a computer program product, which has similar principles and technical effects to those of the foregoing method embodiment, and will not be described herein.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method of inspection, the method being applied to an inspection apparatus, the method comprising:

And performing fault detection on each graphic processor according to each piece of working state information to obtain a detection result corresponding to each server to be inspected.

2. The method according to claim 1, wherein the obtaining the working state information of the graphics processor in each server to be patrolled and examined in the network includes:

and receiving the working state information returned by each server to be inspected.

3. The method according to claim 2, wherein the operating state information includes static information and dynamic information, the executing the information acquisition script sends an information acquisition request to each of the servers to be patrolled and examined, and the method includes:

invoking a static function in the information acquisition script, sending a static information acquisition request to each to-be-inspected server, invoking a dynamic function in the information acquisition script, and sending a dynamic information acquisition request to each to-be-inspected server;

the static acquisition request is used for acquiring the static information, and the dynamic acquisition request is used for acquiring the dynamic information.

4. The method of claim 3, wherein the invoking the static function in the information acquisition script, sending a static information acquisition request to each of the servers to be inspected, and invoking the dynamic function in the information acquisition script, sending a dynamic information acquisition request to each of the servers to be inspected, comprises:

Invoking a static function in the information acquisition script every first preset time, sending a static information acquisition request to each server to be patrolled and examined, and invoking a dynamic function in the information acquisition script every second preset time, sending a dynamic information acquisition request to each server to be patrolled and examined;

Wherein the first preset time is less than the second preset time.

5. The method of claim 1, wherein the performing fault detection on each graphics processor according to each piece of working state information to obtain a detection result corresponding to each server to be patrolled and examined includes:

6. The method according to claim 5, wherein the working state information includes static information and dynamic information, and the performing cluster analysis on each working state information to obtain an analysis result includes:

For each of the operational status information, a first distance between the static information and static standard information is determined, and a second distance between the dynamic information and dynamic standard information is determined.

7. The method of claim 6, wherein the performing fault detection on each graphics processor according to the analysis result corresponding to each piece of operating state information to obtain a detection result corresponding to each server to be patrolled and examined includes:

For each piece of working state information, determining whether the first distance corresponding to the working state information is smaller than a static distance threshold value, and determining whether the second distance corresponding to the working state information is smaller than a dynamic distance threshold value;

If the first distance is smaller than the static distance threshold and the second distance is smaller than the dynamic distance threshold, determining that the detection result indicates that a graphic processor in the server to be patrolled and examined is in a health state;

And if the first distance is not smaller than the static distance threshold or the second distance is not smaller than the dynamic distance threshold, determining that the detection result indicates that the graphics processor in the server to be patrolled and examined is in an unhealthy state.

8. A patrol device, the device comprising:

And the detection module is used for carrying out fault detection on each graphic processor according to each piece of working state information to obtain a detection result corresponding to each server to be patrolled and examined.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.