CN118051393A - Inspection method, apparatus, device, storage medium, and computer program product - Google Patents

Inspection method, apparatus, device, storage medium, and computer program product Download PDF

Info

Publication number
CN118051393A
CN118051393A CN202410224218.9A CN202410224218A CN118051393A CN 118051393 A CN118051393 A CN 118051393A CN 202410224218 A CN202410224218 A CN 202410224218A CN 118051393 A CN118051393 A CN 118051393A
Authority
CN
China
Prior art keywords
information
server
static
dynamic
working state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410224218.9A
Other languages
Chinese (zh)
Inventor
孟召潮
秦晓宁
陈颖
王添
孙建旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningchang Information Technology Hangzhou Co ltd
Original Assignee
Ningchang Information Technology Hangzhou Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningchang Information Technology Hangzhou Co ltd filed Critical Ningchang Information Technology Hangzhou Co ltd
Priority to CN202410224218.9A priority Critical patent/CN118051393A/en
Publication of CN118051393A publication Critical patent/CN118051393A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2205Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
    • G06F11/2236Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test CPU or processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application relates to a routing inspection method, a routing inspection device, equipment, a storage medium and a computer program product. In the method, the inspection equipment can inspect the working states of the graphic processors in the plurality of servers to be inspected at one time, and the inspection work of the graphic processors can be realized through a network, so that the inspection work of workers is not needed on site, and the inspection efficiency can be greatly improved; in addition, compared with the traditional method for finishing inspection through inspection indexes, the method can comprehensively and accurately reflect the working state of the graphic processor by acquiring the static information and the dynamic information of the graphic processor in real time, so that the method can improve the inspection accuracy to a certain extent.

Description

Inspection method, apparatus, device, storage medium, and computer program product
Technical Field
The present application relates to the field of fault early warning technologies, and in particular, to a method, an apparatus, a device, a storage medium, and a computer program product for inspection.
Background
A graphics processor (Graphics Processing Unit, referred to as GPU for short) in the server is a computing component for processing more complex graphics or pictures in the server, so that the GPU has more processing units and stronger parallel processing capability than a common CPU, however, the GPU has a higher single failure rate in the process of high-speed operation, and in order to ensure the normal operation of the GPU, it is necessary to perform fault inspection on the GPU.
Currently, a method for performing state inspection on a GPU generally performs inspection on site by a staff based on inspection indexes, so as to discover GPU faults in time and perform on-site operation and maintenance.
However, the above fault inspection method for the GPU has a problem of low efficiency.
Disclosure of Invention
In view of the foregoing, it is desirable to provide an inspection method, apparatus, device, storage medium, and computer program product that can improve inspection efficiency.
In a first aspect, the present application provides a method for inspecting, the method being applied to inspection equipment, the method comprising:
acquiring working state information of a graphic processor in each server to be inspected in a network;
And performing fault detection on each graphic processor according to the working state information to obtain detection results corresponding to each server to be patrolled and examined.
According to the inspection method provided by the embodiment of the application, the working state information of the graphic processor in each server to be inspected in the network is obtained, and then fault detection is carried out on each graphic processor according to the working state information, so that the detection result corresponding to each server to be inspected is obtained. In the method, the inspection equipment can inspect the working states of the graphic processors in the plurality of servers to be inspected at one time, and the inspection work of the graphic processors can be realized through a network, so that the inspection work of workers is not needed on site, and the inspection efficiency can be greatly improved; in addition, compared with the traditional method for finishing inspection through inspection indexes, the method can comprehensively and accurately reflect the working state of the graphic processor by acquiring the static information and the dynamic information of the graphic processor in real time, so that the method can improve the inspection accuracy to a certain extent.
In one embodiment, acquiring working state information of a graphics processor in each server to be patrolled and examined in a network includes:
Executing an information acquisition script, and sending an information acquisition request to each server to be patrolled and examined;
And receiving the working state information returned by each server to be patrolled and examined.
According to the method provided by the embodiment of the application, the working state information of the graphic processor in each server to be patrolled and examined can be automatically acquired by executing the information acquisition script, and the servers do not need to be manually connected one by one to inquire, so that the data acquisition efficiency is improved. Moreover, by executing the information acquisition script, all the servers to be inspected can be ensured to acquire the information according to the same standard, and the problem of inconsistent data caused by manual operation errors is avoided.
In one embodiment, the working state information includes static information and dynamic information, executes an information acquisition script, and sends an information acquisition request to each server to be patrolled and examined, including:
Calling a static function in the information acquisition script, sending a static information acquisition request to each server to be patrolled and examined, and calling a dynamic function in the information acquisition script, and sending a dynamic information acquisition request to each server to be patrolled and examined;
The static acquisition request is used for acquiring static information, and the dynamic acquisition request is used for acquiring dynamic information.
The method of the embodiment of the application can automatically acquire the state information of the graphic processor by executing the static function and the dynamic function, and can comprehensively monitor various state information of the graphic processor by acquiring the static information and the dynamic information.
In one embodiment, invoking a static function in an information acquisition script, sending a static information acquisition request to each server to be patrolled and examined, and invoking a dynamic function in the information acquisition script, sending a dynamic information acquisition request to each server to be patrolled and examined, including:
calling a static function in the information acquisition script every first preset time, sending a static information acquisition request to each server to be patrolled and examined, and calling a dynamic function in the information acquisition script every second preset time, sending a dynamic information acquisition request to each server to be patrolled and examined;
Wherein the first preset time is less than the second preset time.
According to the method provided by the embodiment of the application, because the static information is relatively stable and basically unchanged compared with the dynamic information, the method can ensure the comprehensiveness and instantaneity of information acquisition and simultaneously can effectively save resources and reduce unnecessary resource waste by reducing the acquisition frequency of the static information.
In one embodiment, fault detection is performed on each graphics processor according to each piece of working state information to obtain a detection result corresponding to each server to be patrolled and examined, including:
Executing an information analysis script, and carrying out cluster analysis on each piece of working state information to obtain an analysis result;
And carrying out fault detection on each graphic processor according to the analysis result corresponding to each piece of working state information to obtain the detection result corresponding to each server to be patrolled and examined.
According to the method provided by the embodiment of the application, the abnormal working state can be identified by carrying out cluster analysis on the working state information, and then based on the result of the cluster analysis, fault detection can be rapidly carried out on each graphic processor, so that the inspection efficiency is improved.
In one embodiment, the working state information includes static information and dynamic information, and the clustering analysis is performed on each working state information to obtain an analysis result, including:
for each operating state information, a first distance between static information and static standard information is determined, and a second distance between dynamic information and dynamic standard information is determined.
In one embodiment, according to an analysis result corresponding to each working state information, fault detection is performed on each graphics processor to obtain a detection result corresponding to each server to be patrolled and examined, including:
for each piece of working state information, determining whether a first distance corresponding to the working state information is smaller than a static distance threshold value, and determining whether a second distance corresponding to the working state information is smaller than a dynamic distance threshold value;
if the first distance is smaller than the static distance threshold and the second distance is smaller than the dynamic distance threshold, determining that the detection result indicates that the graphics processor in the server to be inspected is in a health state;
And if the first distance is not smaller than the static distance threshold or the second distance is not smaller than the dynamic distance threshold, determining that the detection result indicates that the graphics processor in the server to be inspected is in an unhealthy state.
According to the method provided by the embodiment of the application, the health state of the graphic processor is analyzed by the distance between each piece of working state information and the standard information, so that the data analysis efficiency can be improved, and the inspection efficiency can be further improved.
In a second aspect, the present application also provides a patrol device, which includes:
The acquisition module is used for acquiring the working state information of the graphic processor in each server to be inspected in the network;
And the detection module is used for carrying out fault detection on each graphic processor according to the working state information to obtain detection results corresponding to each server to be patrolled and examined.
In a third aspect, the present application also provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring working state information of a graphic processor in each server to be inspected in a network;
And performing fault detection on each graphic processor according to the working state information to obtain detection results corresponding to each server to be patrolled and examined.
In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
acquiring working state information of a graphic processor in each server to be inspected in a network;
And performing fault detection on each graphic processor according to the working state information to obtain detection results corresponding to each server to be patrolled and examined.
In a fifth aspect, the application also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of:
acquiring working state information of a graphic processor in each server to be inspected in a network;
And performing fault detection on each graphic processor according to the working state information to obtain detection results corresponding to each server to be patrolled and examined.
The inspection method, the inspection device, the equipment, the storage medium and the computer program product are characterized in that the inspection method obtains the working state information of the graphic processor in each server to be inspected in the network, and then performs fault detection on each graphic processor according to the working state information to obtain the detection result corresponding to each server to be inspected. In the method, the inspection equipment can inspect the working states of the graphic processors in the plurality of servers to be inspected at one time, and the inspection work of the graphic processors can be realized through a network, so that the inspection work of workers is not needed on site, and the inspection efficiency can be greatly improved; in addition, compared with the traditional method for finishing inspection through inspection indexes, the method can comprehensively and accurately reflect the working state of the graphic processor by acquiring the static information and the dynamic information of the graphic processor in real time, so that the method can improve the inspection accuracy to a certain extent.
Drawings
FIG. 1 is a schematic diagram of a patrol system according to an embodiment;
FIG. 2 is a schematic flow chart of an inspection method in one embodiment;
FIG. 3 is a flow chart of a method of inspection according to another embodiment;
FIG. 4 is a schematic flow chart of a method of inspection according to another embodiment;
FIG. 5 is a schematic flow chart of a method of inspection according to another embodiment;
FIG. 6 is a schematic flow chart of a method of inspection according to another embodiment;
FIG. 7 is a block diagram of an inspection apparatus according to one embodiment;
FIG. 8 is a block diagram of an inspection apparatus according to another embodiment;
FIG. 9 is a block diagram of an inspection apparatus according to another embodiment;
fig. 10 is an internal structural view of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
A graphics processor (Graphics Processing Unit, referred to as GPU for short) in the server is a computing component for processing more complex graphics or pictures in the server, so that the GPU has more processing units and stronger parallel processing capability than a common CPU, however, the GPU has a higher single failure rate in the process of high-speed operation, and in order to ensure the normal operation of the GPU, it is necessary to perform fault inspection on the GPU. Currently, a method for performing state inspection on a GPU generally performs inspection on site by a staff based on inspection indexes, so as to discover GPU faults in time and perform on-site operation and maintenance. However, the above fault inspection method for the GPU has a problem of low efficiency. The application provides a patrol method, which aims to solve the technical problems, and the following embodiment specifically describes the patrol method.
The inspection method provided by the embodiment of the application can be applied to an inspection network system shown in fig. 1, wherein the inspection network system comprises an inspection device 01 and a plurality of servers 02 to be inspected, the inspection device 01 and the servers 02 to be inspected can be connected through a wired or wireless mode, for example, the inspection device 01 can be arranged in a local area network, and the inspection device 01 is used for inspecting a graphics processor in the servers 02 to be inspected. The inspection device 01 can be various personal computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices; alternatively, the inspection apparatus 01 may be a server. The server 02 to be patrolled and examined can be various personal computers, notebook computers, smart phones, tablet computers, internet of things equipment and portable wearable equipment; alternatively, the server 02 to be patrolled may be a server.
It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the patrol network system to which the present inventive arrangements are applied, and that a particular patrol network system may include more or fewer components than shown, or may combine certain components, or may have a different arrangement of components.
In one embodiment, as shown in fig. 2, a patrol method is provided, and the patrol method is applied to the patrol device in fig. 1, and the patrol method includes the following steps:
S101, working state information of a graphic processor in each server to be inspected in the network is obtained.
The network may be a local area network or other networks, and each server in the network may communicate with each other. The working state information is used for reflecting the working state of the graphic processor, and comprises static information and dynamic information. The static information includes at least one of a serial number (serial_number), a product name (product_name), a part number (part_number), a Globally Unique Identifier (GUID), a firmware version (firmware_version), a vendor identifier (vendor_id), a sub-vendor identifier (sub-vendor_id), a character string or code (vendor_id) of the device, and a character string or code (sub-device_id) of the sub-device of the graphics processor. The dynamic information includes a device state (drive_state), a temperature value (temperature), a power state (power), a current memory utilization (current_memory_ utilization), a current resource utilization (current_sm_ utilization), an external power state (external_power), a disabled page state (retire _page_ sbesbe), a current multi-instance state (retire _page_ dbedbe), a current error-corrected memory state (current_ecc_state), a current multi-instance state (current_mig_state), a negative time utilization (current_ utilization _time), a resource utilization time (sm_ utilization _ timesm), a current graphics frequency (current_graphics_clock), a current memory frequency (current_memory_clock), a high-speed connection channel state (nvlink _ statusNVLink), a downlink remapping state (row_ remapping _state), a link speed (5435_k_state), a link width (6783), a reset state (STATEPCIE _state), and at least one of alert states (power_state (high-alert state) in a current_memory_memory_ utilization).
In the embodiment of the application, the server to be inspected and the inspection equipment can be arranged in a network in advance, and software for inspection is installed on the inspection equipment, for example, the inspection equipment and the server to be inspected are connected in a local area network, a linux system is installed on the inspection equipment, and an ipmitool tool and a python environment are installed. Before the inspection equipment is utilized for inspection, whether the server to be inspected is communicated with the inspection equipment or not needs to be determined, and the server to be inspected is inspected under the condition of communication. The number of the servers to be inspected can be deployed at one time, can be adjusted according to the requirement, can be adjusted according to the performance of the inspection equipment, for example, the number of the servers to be inspected can be increased when the performance of the inspection equipment is good, and the number of the servers to be inspected can be reduced when the performance of the inspection equipment is poor. In the inspection process, the inspection equipment can acquire the working state information of the graphic processor in one or more servers to be inspected in the same network in real time through the network. Specifically, the working state information of the graphics processor can be obtained by inputting a command line. Alternatively, the working state information of the graphics processor may be obtained by calling an interface. Alternatively, monitoring software may be used to obtain the operating state information of the graphics processor.
S102, fault detection is carried out on each graphic processor according to the working state information, and detection results corresponding to each server to be patrolled and examined are obtained.
The detection result includes that the graphics processor is in a healthy state (normal state) or the graphics processor is in an unhealthy state (abnormal state).
In the embodiment of the application, after the inspection equipment acquires the working state information of the graphic processor in each server to be inspected, the inspection equipment can directly analyze whether the working state information is abnormal or not, further analyze whether the graphic processor is in a health state or not, and acquire the detection result corresponding to each server to be inspected. For example, comparing each working state information with the normal working state information, and performing fault detection on each graphic processor according to the comparison result to obtain a detection result corresponding to each server to be patrolled and examined. Optionally, fault detection can be performed on each graphics processor through each piece of working state information and a corresponding preset threshold value, so as to obtain a detection result corresponding to each server to be patrolled and examined. Optionally, after the inspection device obtains the working state information of the graphics processor in each server to be inspected, each working state information may be saved in a designated file, and when the working state information needs to be analyzed, the working state information may be obtained from the file. Optionally, after the inspection device obtains the working state information of the graphics processor in each server to be inspected, the inspection device may further process each working state information, for example, perform alignment processing in a time stamp manner, and then perform fault detection on the working state information after alignment processing to obtain a detection result corresponding to each server to be inspected.
According to the inspection method provided by the embodiment of the application, the working state information of the graphic processor in each server to be inspected in the network is obtained, and then fault detection is carried out on each graphic processor according to the working state information, so that the detection result corresponding to each server to be inspected is obtained. In the method, the inspection equipment can inspect the working states of the graphic processors in the plurality of servers to be inspected at one time, and the inspection work of the graphic processors can be realized through a network, so that the inspection work of workers is not needed on site, and the inspection efficiency can be greatly improved; in addition, compared with the traditional method for finishing inspection through inspection indexes, the method can comprehensively and accurately reflect the working state of the graphic processor by acquiring the static information and the dynamic information of the graphic processor in real time, so that the method can improve the inspection accuracy to a certain extent. Moreover, the method provides the inspection equipment which is specially used for inspecting the graphics processor of each equipment in the network, can realize the inspection method at regular intervals or according to commands, and can achieve the aim of high-efficiency inspection when facing the inspection task of a large number of network equipment.
In an embodiment, a specific implementation manner of obtaining the working state information of the graphics processor is further provided, as shown in fig. 3, where "obtaining the working state information of the graphics processor in each server to be patrolled and examined in the network" in step S101 includes:
S201, executing an information acquisition script, and sending an information acquisition request to each server to be patrolled and examined.
The information acquisition script is a script which is compiled in advance through a linux code instruction and is used for acquiring the working state information of each server to be patrolled and examined. The information acquisition request is used for acquiring the working state information of the graphic processor in each server to be inspected.
In the embodiment of the application, the inspection equipment can pre-write the information acquisition script capable of running on the IPMITool tool and store the information acquisition script under a preset path. When the inspection equipment determines that the servers to be inspected are communicated, the information acquisition script can be called and executed in the IPMITool tool, and in the process of executing the information acquisition script, the inspection equipment can generate an information acquisition request according to the content indicated by the information acquisition script and then send the information acquisition request to the servers to be inspected through a network. Alternatively, the information acquisition request may be sent to each server to be patrolled at one time. Optionally, the information acquisition request may be sent to each server to be patrolled and examined according to the sequence of the connection time. Optionally, the information acquisition request may be sent to each server to be patrolled according to the device number sequence of the servers to be patrolled. Optionally, the information acquisition request may also be sent to each server to be patrolled and examined according to a preset sequence.
S202, receiving the working state information returned by each server to be patrolled and examined.
In the embodiment of the application, after each server to be inspected receives the information acquisition request sent by the inspection equipment, the information acquisition request can be analyzed, the working state information to be acquired by the inspection equipment is analyzed, then the working state information of each server to be inspected is sent to the inspection equipment, and the inspection equipment can receive the working state information returned by each server to be inspected.
According to the method provided by the embodiment of the application, the working state information of the graphic processor in each server to be patrolled and examined can be automatically acquired by executing the information acquisition script, and the servers do not need to be manually connected one by one to inquire, so that the data acquisition efficiency is improved. Moreover, by executing the information acquisition script, all the servers to be inspected can be ensured to acquire the information according to the same standard, and the problem of inconsistent data caused by manual operation errors is avoided.
In one embodiment, the working state information includes static information and dynamic information, and on the basis of the static information and the dynamic information, a specific implementation manner of sending an information acquisition request is further provided, where the "executing an information acquisition script in step S201 sends an information acquisition request to each server to be patrolled and examined" includes: calling a static function in the information acquisition script, sending a static information acquisition request to each server to be patrolled and examined, and calling a dynamic function in the information acquisition script, and sending a dynamic information acquisition request to each server to be patrolled and examined.
Wherein, the working state information comprises static information and dynamic information. The information acquisition request includes a static acquisition request for acquiring static information and a dynamic acquisition request for acquiring dynamic information. The information acquisition script comprises a static function and a dynamic function, wherein the static function in the embodiment of the application can be named as a static_info_messages function, and the dynamic function can be named as a dynamic_info_messages function. Optionally, the information acquisition script may further include a collection function, which may be named a data_ collect function. Optionally, the information acquisition script may further include a main function, which may be named as a main function, and the main function is used to process the execution sequence and the overall execution logic of the three functions, i.e., the static function, the dynamic function and the collection function. The instruction formats of the static function and the dynamic function may be: the management IP address (ipmitool-H) of the server to be patrolled, the login user name (I lanplus) of the server to be patrolled, and the login password (P) of the server to be patrolled, namely the specific information instruction to be accessed by the server to be patrolled. The inspection device may access each server to be inspected based on a User name access list file in a fixed format (the User name access list file in the embodiment of the present application may be named as ip_user_passwd.log), and the specific contents are as follows: the format of the inspection device is as follows: 1.1.1.1; user: admin; passwd:111111; the formats of the 4 servers to be inspected are as follows, each server to be inspected occupies one row, the file stores login information of the servers to be inspected, the formats of the plurality of servers to be inspected and the like.
IP:1.1.1.1;User:Admin0;Passwd:111111
IP:1.1.1.2;User:Admin1;Passwd:111112
IP:1.1.1.3;User:Admin2;Passwd:111113
IP:1.1.1.4;User:Admin3;Passwd:111114
In the embodiment of the application, when the inspection equipment executes the information acquisition script, the static function in the information acquisition script can be called, the static function is executed to send a static information acquisition request to each server to be inspected, the dynamic function in the information acquisition script can be called, and the dynamic function is executed to send a dynamic information acquisition request to each server to be inspected. Optionally, after the inspection device obtains the static information and the dynamic information by using the static function and the dynamic function, the collection function may be used to save the static information and the dynamic information into a preset folder, for example, save the data into an intermediate file data.csv file. Alternatively, the static information acquisition request and the dynamic information acquisition request may be simultaneously sent to each server to be patrolled and examined, and the static information acquisition request may be sent first, or the dynamic information acquisition request may be sent first. Alternatively, the frequencies of sending the static information acquisition request and the dynamic information acquisition request may be the same or different.
The method of the embodiment of the application can automatically acquire the state information of the graphic processor by executing the static function and the dynamic function, and can comprehensively monitor various state information of the graphic processor by acquiring the static information and the dynamic information.
In an embodiment, the step of "sending a static information acquisition request to each server to be patrolled and examined" in the step of "sending a dynamic information acquisition request to each server to be patrolled and examined" in the step of calling a static function in the information acquisition script and sending a dynamic information acquisition request to each server to be patrolled and examined further provides a specific implementation manner of executing the information acquisition script on the basis of the static function in the information acquisition script in the above embodiment, and the step of calling the dynamic function in the information acquisition script includes: and calling the static function in the information acquisition script every first preset time, sending a static information acquisition request to each server to be patrolled and examined, and calling the dynamic function in the information acquisition script every second preset time, and sending a dynamic information acquisition request to each server to be patrolled and examined.
Wherein the first preset time is less than the second preset time.
In the embodiment of the application, when the inspection equipment executes the information acquisition script, the static function in the information acquisition script can be called every first preset time, for example, the static information can be acquired every 60 minutes by calling the static function, and the dynamic function in the information acquisition script can be called every second preset time, so that a dynamic information acquisition request is sent to each server to be inspected, for example, the static information can be acquired every 6 minutes by calling the dynamic function. Optionally, after the inspection device obtains the static information and the dynamic information by using the static function and the dynamic function, the static information and the dynamic information are stored in a designated array, and then the static information and the dynamic information are saved in a preset folder by using the collection function, for example, the saved data are saved in an intermediate file data.csv file.
According to the method provided by the embodiment of the application, because the static information is relatively stable and basically unchanged compared with the dynamic information, the method can ensure the comprehensiveness and instantaneity of information acquisition and simultaneously can effectively save resources and reduce unnecessary resource waste by reducing the acquisition frequency of the static information.
In an embodiment, a specific implementation manner for obtaining the detection results corresponding to each server to be inspected is further provided, as shown in fig. 4, the step S102 of performing fault detection on each graphics processor according to each piece of working state information to obtain the detection results corresponding to each server to be inspected includes:
s301, executing an information analysis script, and carrying out cluster analysis on each piece of working state information to obtain an analysis result.
The information analysis script is compiled in advance through a linux code instruction and is used for analyzing the working state information of the graphic processor in each server to be inspected so as to determine whether the graphic processor is in a health state or not. The information analysis script comprises an average function (data_mean function) and a Center function (Center point function), wherein the average function is used for calculating a standard data average value, the Center function is used for calculating a standard data Center point, and the standard data is data of a server to be patrolled and examined in a health state.
The analysis result is used for representing the relation between the current working state information of each server to be inspected and the healthy working state information of each server to be inspected.
In the embodiment of the application, after the inspection equipment acquires the working state information of the graphic processor in each server to be inspected, each working state information can be input into a preset clustering algorithm, for example, the preset clustering algorithm is a k-means clustering algorithm, and the clustering analysis is performed on each working state information through the preset clustering algorithm to obtain an analysis result. Optionally, the working state information can be input into a preset clustering model, and the clustering analysis is performed on the working state information through the preset clustering model to obtain an analysis result.
S302, fault detection is carried out on each graphic processor according to analysis results corresponding to each piece of working state information, and detection results corresponding to each server to be patrolled and examined are obtained.
In the embodiment of the application, after the inspection equipment acquires the analysis results corresponding to the working state information of the graphic processor in each server to be inspected, the inspection equipment can directly analyze whether each analysis result is abnormal or not, further analyze whether the graphic processor is in a health state or not, and acquire the detection results corresponding to each server to be inspected. For example, comparing the current analysis results with the analysis results in normal operation, and performing fault detection on each graphic processor according to the comparison results to obtain detection results corresponding to each server to be patrolled and examined. Optionally, fault detection can be performed on each graphics processor through each analysis result and a corresponding preset threshold value, so as to obtain detection results corresponding to each server to be patrolled and examined.
According to the method provided by the embodiment of the application, the abnormal working state can be identified by carrying out cluster analysis on the working state information, and then based on the result of the cluster analysis, fault detection can be rapidly carried out on each graphic processor, so that the inspection efficiency is improved.
In an embodiment, the working state information in the foregoing embodiment includes static information and dynamic information, and on this basis, a specific implementation manner for obtaining an analysis result is further provided, where "performing cluster analysis on each working state information to obtain an analysis result" in the foregoing step S301 includes: for each operating state information, a first distance between static information and static standard information is determined, and a second distance between dynamic information and dynamic standard information is determined.
Wherein the first distance reflects the difference between the static information and the static standard information, and the larger the first distance, the larger the difference between the static information and the static standard information, namely the worse the health state. The second distance may reflect the difference between the dynamic information and the dynamic standard information, the larger the first distance, the larger the difference between the dynamic information and the dynamic standard information, i.e. the less good the health status.
In the embodiment of the application, after the inspection equipment acquires the working state information of the graphics processor in each server to be inspected, an information analysis script data_k-means.py can be executed, and the information analysis script can determine a first distance between static information and static standard information and a second distance between dynamic information and dynamic standard information according to each working state information. Specifically, the inspection equipment can call the intermediate file data.csv file, then execute the information analysis script, perform cluster analysis on the real-time static information and dynamic information of the data.csv file by adopting a k-means clustering algorithm, and automatically calculate the distance between each static information and dynamic information and the corresponding average value point.
In an embodiment, a specific implementation manner for obtaining a detection result corresponding to a server to be inspected is further provided, as shown in fig. 5, in the step S302, the fault detection is performed on each graphics processor according to the analysis result corresponding to each piece of working state information, so as to obtain a detection result corresponding to each server to be inspected, which includes:
S401, for each piece of working state information, determining whether a first distance corresponding to the working state information is smaller than a static distance threshold value, and determining whether a second distance corresponding to the working state information is smaller than a dynamic distance threshold value.
The static distance threshold represents a distance value from a static standard information average value to a static standard information center point, is used for measuring a threshold of health state of the server to be inspected, and can be set to 0. The dynamic distance threshold represents a distance value from the average value of the dynamic standard information to the center point of the dynamic standard information, is used for measuring a threshold value of the health state of the server to be inspected, can be determined according to the health data, and can be quantized and changed according to the actual working state of the GPU.
In the embodiment of the application, after the first distance and the second distance of the graphics processor in each server to be inspected are obtained, the inspection device can determine, for each server to be inspected, whether the first distance is smaller than a static distance threshold value and whether the second distance is smaller than a dynamic distance threshold value.
S402, if the first distance is smaller than the static distance threshold and the second distance is smaller than the dynamic distance threshold, determining that the detection result indicates that the graphics processor in the server to be patrolled and examined is in a health state.
In the embodiment of the application, if the first distance is smaller than the static distance threshold and the second distance is smaller than the dynamic distance threshold, the detection result is determined to indicate that the graphics processor in the server to be patrolled and examined is in a health state.
S403, if the first distance is not smaller than the static distance threshold or the second distance is not smaller than the dynamic distance threshold, determining that the detection result indicates that the graphics processor in the server to be patrolled and examined is in an unhealthy state.
In the embodiment of the application, if the first distance is not smaller than the static distance threshold or the second distance is not smaller than the dynamic distance threshold, the detection result is determined to indicate that the graphics processor in the server to be inspected is in an unhealthy state. Furthermore, when the graphics processor in the server to be inspected is in an unhealthy state, the fault can be pre-warned, specifically, the fault information can be displayed on the interface of the inspection equipment to give an alarm, and the alarm device can be triggered to give an alarm, which is not limited in this embodiment.
According to the method provided by the embodiment of the application, the health state of the graphic processor is analyzed by the distance between each piece of working state information and the standard information, so that the data analysis efficiency can be improved, and the inspection efficiency can be further improved.
In all the above embodiments, there is also provided a patrol method, as shown in fig. 6, including:
S501, calling a static function in the information acquisition script every first preset time, and sending a static information acquisition request to each server to be patrolled and examined.
S502, calling a dynamic function in the information acquisition script every second preset time, and sending a dynamic information acquisition request to each server to be patrolled and examined. Wherein the first preset time is less than the second preset time.
S503, receiving the working state information returned by each server to be patrolled and examined. Wherein the working state information comprises static information and dynamic information.
S504, for each operation state information, a first distance between the static information and the static standard information is determined, and a second distance between the dynamic information and the dynamic standard information is determined.
S505, for each piece of working state information, determining whether a first distance corresponding to the working state information is smaller than a static distance threshold value, and determining whether a second distance corresponding to the working state information is smaller than a dynamic distance threshold value.
S506, if the first distance is smaller than the static distance threshold and the second distance is smaller than the dynamic distance threshold, determining that the detection result indicates that the graphics processor in the server to be patrolled and examined is in a health state.
S507, if the first distance is not smaller than the static distance threshold or the second distance is not smaller than the dynamic distance threshold, determining that the detection result indicates that the graphics processor in the server to be patrolled and examined is in an unhealthy state.
The inspection method provided by the embodiment of the application has clear thought in the aspect of software script design and smooth logic, designs a Graphic Processing Unit (GPU) dynamic and static information out-of-band automatic inspection early warning script based on IPMITool and a data analysis script based on a k-means clustering algorithm, can realize early warning of faults by combining a IPMITool tool and the k-means clustering algorithm, locks the fault range in advance, has high operation and maintenance effects, realizes omnibearing monitoring of the running state of the GPU on a server to be inspected, and carries out health early warning through the distance from a health data average value point. In the aspect of hardware design, the method has the characteristics of simple requirement, convenient deployment, light weight, strong expansibility, less investment and excellent effect, can continuously expand capacity on the premise of being deployed, and can fully meet the requirement of a small data center on the basis of not stopping the server to be patrolled and examined and not influencing the work of the server to be patrolled and examined, and is particularly suitable for server test manufacturers and the like, and meanwhile, the method is suitable for the scene of adapting to the conditions of a plurality of GPUs in the server to be patrolled and examined. And in the GPU dynamic and static information out-of-band automatic inspection early warning script based on IPMITool and the k-means clustering algorithm data analysis script based on IPMITool, a data reading mode based on a single GPU can be added so as to support the functions of dynamic and static information inspection and early warning of a plurality of GPUs on a plurality of servers to be inspected by one inspection device. The method can patrol GPU dynamic and static information, the frequency can be dynamically adjusted according to the scene requirement of a user, and the routing rule of the patrol information is regulated, and with the iterative upgrading of technology, the patrol information can be dynamically adjusted to enable the patrol range of the method to be between 1 and 5000, can be adjusted in real time according to the performance of a patrol server, only needs one-time deployment for patrol, and simultaneously ensures that the patrol server and the server to be patrol can be communicated through a network. The method supports fault early warning and inspection of the common PCIE standard GPU and the integrated HGX GPU. The method adopts a python environment of a liunx system, a k-means clustering algorithm in a called python library is adopted, data acquired by using IpmiTool tools is analyzed based on the k-means clustering algorithm, the k-means clustering algorithm is a classification algorithm, the distance is used as a measurement standard, the main idea is that GPU health data are collected and then concentrated on a central point accessory, random unordered data are avoided, health data are calculated into intermediate points, then the distance between the real-time collected data and the central point is calculated, the greater the distance is the farther the distance is from the health data, the worse the health state is, the positive is the closer the distance is the health state is, the health threshold is the distance between the average value of the health data and the central point is used for early warning, and the fault position can be judged specifically according to abnormal or outlier data, so that the aim of reducing the fault position is achieved.
The method of each step is described in the foregoing embodiments, and the detailed description is referred to the foregoing description and is not repeated here.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a patrol device for realizing the patrol method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation of one or more embodiments of the inspection device provided below may be referred to the limitation of the inspection method hereinabove, and will not be repeated here.
In one embodiment, as shown in fig. 7, there is provided a patrol apparatus, comprising:
the obtaining module 10 is configured to obtain working state information of a graphics processor in each server to be patrolled and examined in the network.
And the detection module 11 is used for carrying out fault detection on each graphic processor according to the working state information to obtain detection results corresponding to each server to be patrolled and examined.
In one embodiment, as shown in fig. 8, the acquiring module 10 includes:
An obtaining unit 101, configured to execute an information obtaining script, and send an information obtaining request to each server to be patrolled and examined.
And the receiving unit 102 is used for receiving the working state information returned by each server to be patrolled and examined.
In one embodiment, the obtaining unit 101 is configured to call a static function in the information obtaining script, send a static information obtaining request to each server to be patrolled and examined, and call a dynamic function in the information obtaining script, and send a dynamic information obtaining request to each server to be patrolled and examined; the static acquisition request is used for acquiring static information, and the dynamic acquisition request is used for acquiring dynamic information.
In one embodiment, the obtaining unit 101 is specifically configured to call a static function in the information obtaining script every a first preset time, send a static information obtaining request to each server to be patrolled and examined, and call a dynamic function in the information obtaining script every a second preset time, send a dynamic information obtaining request to each server to be patrolled and examined; wherein the first preset time is less than the second preset time.
In one embodiment, as shown in fig. 9, the detection module 11 includes:
And the analysis unit 110 is used for executing the information analysis script, and performing cluster analysis on each piece of working state information to obtain an analysis result.
And the detection unit 111 is configured to perform fault detection on each graphics processor according to the analysis result corresponding to each working state information, so as to obtain a detection result corresponding to each server to be patrolled and examined.
In one embodiment, the analysis unit 110 is configured to determine, for each operation state information, a first distance between static information and static standard information, and a second distance between dynamic information and dynamic standard information.
In one embodiment, the detecting unit 111 includes:
The first determining subunit is configured to determine, for each piece of working state information, whether a first distance corresponding to the working state information is smaller than a static distance threshold, and whether a second distance corresponding to the working state information is smaller than a dynamic distance threshold.
And the second determining subunit is used for determining that the detection result indicates that the graphics processor in the server to be patrolled and examined is in a health state if the first distance is smaller than the static distance threshold value and the second distance is smaller than the dynamic distance threshold value.
And the third determining subunit is used for determining that the detection result indicates that the graphics processor in the server to be patrolled and examined is in an unhealthy state if the first distance is not smaller than the static distance threshold or the second distance is not smaller than the dynamic distance threshold.
The above-mentioned various modules in the inspection device may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device, which may be a terminal or a server, may be provided, and an internal structure diagram thereof may be as shown in fig. 10, and includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input device. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a patrol method. The display unit of the computer device is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in FIG. 10 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:
acquiring working state information of a graphic processor in each server to be inspected in a network;
And performing fault detection on each graphic processor according to the working state information to obtain detection results corresponding to each server to be patrolled and examined.
In one embodiment, the processor when executing the computer program further performs the steps of:
Executing an information acquisition script, and sending an information acquisition request to each server to be patrolled and examined;
And receiving the working state information returned by each server to be patrolled and examined.
In one embodiment, the processor when executing the computer program further performs the steps of:
calling a static function in the information acquisition script, sending a static information acquisition request to each server to be patrolled and examined, and calling a dynamic function in the information acquisition script, and sending a dynamic information acquisition request to each server to be patrolled and examined.
In one embodiment, the processor when executing the computer program further performs the steps of:
and calling the static function in the information acquisition script every first preset time, sending a static information acquisition request to each server to be patrolled and examined, and calling the dynamic function in the information acquisition script every second preset time, and sending a dynamic information acquisition request to each server to be patrolled and examined.
In one embodiment, the processor when executing the computer program further performs the steps of:
Executing an information analysis script, and carrying out cluster analysis on each piece of working state information to obtain an analysis result;
And carrying out fault detection on each graphic processor according to the analysis result corresponding to each piece of working state information to obtain the detection result corresponding to each server to be patrolled and examined.
In one embodiment, the processor when executing the computer program further performs the steps of:
for each operating state information, a first distance between static information and static standard information is determined, and a second distance between dynamic information and dynamic standard information is determined.
In one embodiment, the processor when executing the computer program further performs the steps of:
for each piece of working state information, determining whether a first distance corresponding to the working state information is smaller than a static distance threshold value, and determining whether a second distance corresponding to the working state information is smaller than a dynamic distance threshold value;
if the first distance is smaller than the static distance threshold and the second distance is smaller than the dynamic distance threshold, determining that the detection result indicates that the graphics processor in the server to be inspected is in a health state;
And if the first distance is not smaller than the static distance threshold or the second distance is not smaller than the dynamic distance threshold, determining that the detection result indicates that the graphics processor in the server to be inspected is in an unhealthy state.
The computer device provided in the foregoing embodiments has similar implementation principles and technical effects to those of the foregoing method embodiments, and will not be described herein in detail.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:
acquiring working state information of a graphic processor in each server to be inspected in a network;
And performing fault detection on each graphic processor according to the working state information to obtain detection results corresponding to each server to be patrolled and examined.
In one embodiment, the computer program when executed by the processor further performs the steps of:
Executing an information acquisition script, and sending an information acquisition request to each server to be patrolled and examined;
And receiving the working state information returned by each server to be patrolled and examined.
In one embodiment, the computer program when executed by the processor further performs the steps of:
calling a static function in the information acquisition script, sending a static information acquisition request to each server to be patrolled and examined, and calling a dynamic function in the information acquisition script, and sending a dynamic information acquisition request to each server to be patrolled and examined.
In one embodiment, the computer program when executed by the processor further performs the steps of:
and calling the static function in the information acquisition script every first preset time, sending a static information acquisition request to each server to be patrolled and examined, and calling the dynamic function in the information acquisition script every second preset time, and sending a dynamic information acquisition request to each server to be patrolled and examined.
In one embodiment, the computer program when executed by the processor further performs the steps of:
Executing an information analysis script, and carrying out cluster analysis on each piece of working state information to obtain an analysis result;
And carrying out fault detection on each graphic processor according to the analysis result corresponding to each piece of working state information to obtain the detection result corresponding to each server to be patrolled and examined.
In one embodiment, the computer program when executed by the processor further performs the steps of:
for each operating state information, a first distance between static information and static standard information is determined, and a second distance between dynamic information and dynamic standard information is determined.
In one embodiment, the computer program when executed by the processor further performs the steps of:
for each piece of working state information, determining whether a first distance corresponding to the working state information is smaller than a static distance threshold value, and determining whether a second distance corresponding to the working state information is smaller than a dynamic distance threshold value;
if the first distance is smaller than the static distance threshold and the second distance is smaller than the dynamic distance threshold, determining that the detection result indicates that the graphics processor in the server to be inspected is in a health state;
And if the first distance is not smaller than the static distance threshold or the second distance is not smaller than the dynamic distance threshold, determining that the detection result indicates that the graphics processor in the server to be inspected is in an unhealthy state.
The foregoing embodiment provides a computer readable storage medium, which has similar principles and technical effects to those of the foregoing method embodiment, and will not be described herein.
In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:
acquiring working state information of a graphic processor in each server to be inspected in a network;
And performing fault detection on each graphic processor according to the working state information to obtain detection results corresponding to each server to be patrolled and examined.
In one embodiment, the computer program when executed by the processor further performs the steps of:
Executing an information acquisition script, and sending an information acquisition request to each server to be patrolled and examined;
And receiving the working state information returned by each server to be patrolled and examined.
In one embodiment, the computer program when executed by the processor further performs the steps of:
calling a static function in the information acquisition script, sending a static information acquisition request to each server to be patrolled and examined, and calling a dynamic function in the information acquisition script, and sending a dynamic information acquisition request to each server to be patrolled and examined.
In one embodiment, the computer program when executed by the processor further performs the steps of:
and calling the static function in the information acquisition script every first preset time, sending a static information acquisition request to each server to be patrolled and examined, and calling the dynamic function in the information acquisition script every second preset time, and sending a dynamic information acquisition request to each server to be patrolled and examined.
In one embodiment, the computer program when executed by the processor further performs the steps of:
Executing an information analysis script, and carrying out cluster analysis on each piece of working state information to obtain an analysis result;
And carrying out fault detection on each graphic processor according to the analysis result corresponding to each piece of working state information to obtain the detection result corresponding to each server to be patrolled and examined.
In one embodiment, the computer program when executed by the processor further performs the steps of:
for each operating state information, a first distance between static information and static standard information is determined, and a second distance between dynamic information and dynamic standard information is determined.
In one embodiment, the computer program when executed by the processor further performs the steps of:
for each piece of working state information, determining whether a first distance corresponding to the working state information is smaller than a static distance threshold value, and determining whether a second distance corresponding to the working state information is smaller than a dynamic distance threshold value;
if the first distance is smaller than the static distance threshold and the second distance is smaller than the dynamic distance threshold, determining that the detection result indicates that the graphics processor in the server to be inspected is in a health state;
And if the first distance is not smaller than the static distance threshold or the second distance is not smaller than the dynamic distance threshold, determining that the detection result indicates that the graphics processor in the server to be inspected is in an unhealthy state.
The foregoing embodiment provides a computer program product, which has similar principles and technical effects to those of the foregoing method embodiment, and will not be described herein.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (10)

1. A method of inspection, the method being applied to an inspection apparatus, the method comprising:
acquiring working state information of a graphic processor in each server to be inspected in a network;
And performing fault detection on each graphic processor according to each piece of working state information to obtain a detection result corresponding to each server to be inspected.
2. The method according to claim 1, wherein the obtaining the working state information of the graphics processor in each server to be patrolled and examined in the network includes:
executing an information acquisition script, and sending an information acquisition request to each server to be patrolled and examined;
and receiving the working state information returned by each server to be inspected.
3. The method according to claim 2, wherein the operating state information includes static information and dynamic information, the executing the information acquisition script sends an information acquisition request to each of the servers to be patrolled and examined, and the method includes:
invoking a static function in the information acquisition script, sending a static information acquisition request to each to-be-inspected server, invoking a dynamic function in the information acquisition script, and sending a dynamic information acquisition request to each to-be-inspected server;
the static acquisition request is used for acquiring the static information, and the dynamic acquisition request is used for acquiring the dynamic information.
4. The method of claim 3, wherein the invoking the static function in the information acquisition script, sending a static information acquisition request to each of the servers to be inspected, and invoking the dynamic function in the information acquisition script, sending a dynamic information acquisition request to each of the servers to be inspected, comprises:
Invoking a static function in the information acquisition script every first preset time, sending a static information acquisition request to each server to be patrolled and examined, and invoking a dynamic function in the information acquisition script every second preset time, sending a dynamic information acquisition request to each server to be patrolled and examined;
Wherein the first preset time is less than the second preset time.
5. The method of claim 1, wherein the performing fault detection on each graphics processor according to each piece of working state information to obtain a detection result corresponding to each server to be patrolled and examined includes:
Executing an information analysis script, and carrying out cluster analysis on each piece of working state information to obtain an analysis result;
And carrying out fault detection on each graphic processor according to the analysis result corresponding to each piece of working state information to obtain the detection result corresponding to each server to be patrolled and examined.
6. The method according to claim 5, wherein the working state information includes static information and dynamic information, and the performing cluster analysis on each working state information to obtain an analysis result includes:
For each of the operational status information, a first distance between the static information and static standard information is determined, and a second distance between the dynamic information and dynamic standard information is determined.
7. The method of claim 6, wherein the performing fault detection on each graphics processor according to the analysis result corresponding to each piece of operating state information to obtain a detection result corresponding to each server to be patrolled and examined includes:
For each piece of working state information, determining whether the first distance corresponding to the working state information is smaller than a static distance threshold value, and determining whether the second distance corresponding to the working state information is smaller than a dynamic distance threshold value;
If the first distance is smaller than the static distance threshold and the second distance is smaller than the dynamic distance threshold, determining that the detection result indicates that a graphic processor in the server to be patrolled and examined is in a health state;
And if the first distance is not smaller than the static distance threshold or the second distance is not smaller than the dynamic distance threshold, determining that the detection result indicates that the graphics processor in the server to be patrolled and examined is in an unhealthy state.
8. A patrol device, the device comprising:
The acquisition module is used for acquiring the working state information of the graphic processor in each server to be inspected in the network;
And the detection module is used for carrying out fault detection on each graphic processor according to each piece of working state information to obtain a detection result corresponding to each server to be patrolled and examined.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.
CN202410224218.9A 2024-02-28 2024-02-28 Inspection method, apparatus, device, storage medium, and computer program product Pending CN118051393A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410224218.9A CN118051393A (en) 2024-02-28 2024-02-28 Inspection method, apparatus, device, storage medium, and computer program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410224218.9A CN118051393A (en) 2024-02-28 2024-02-28 Inspection method, apparatus, device, storage medium, and computer program product

Publications (1)

Publication Number Publication Date
CN118051393A true CN118051393A (en) 2024-05-17

Family

ID=91053297

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410224218.9A Pending CN118051393A (en) 2024-02-28 2024-02-28 Inspection method, apparatus, device, storage medium, and computer program product

Country Status (1)

Country Link
CN (1) CN118051393A (en)

Similar Documents

Publication Publication Date Title
US11029972B2 (en) Method and system for profile learning window optimization
CN108683562B (en) Anomaly detection positioning method and device, computer equipment and storage medium
CN107506300B (en) User interface testing method, device, server and storage medium
US9064048B2 (en) Memory leak detection
US9569325B2 (en) Method and system for automated test and result comparison
US11582130B2 (en) Performance monitoring in a distributed storage system
US10552761B2 (en) Non-intrusive fine-grained power monitoring of datacenters
CN110704277B (en) Method for monitoring application performance, related equipment and storage medium
TW201732789A (en) Disk failure prediction method and apparatus
JP2009223886A (en) Method, program and device (consolidated display of resource performance trends) for generating consolidated representation of performance trends for a plurality of resources in data processing system
US10708142B2 (en) Methods, systems, and computer readable media for providing cloud visibility
CN110515758B (en) Fault positioning method and device, computer equipment and storage medium
WO2023092946A1 (en) Memory patrol inspection method and apparatus, and medium
US20100036981A1 (en) Finding Hot Call Paths
US9276826B1 (en) Combining multiple signals to determine global system state
CN117149550A (en) Solid state disk performance detection method and device and electronic equipment
US11809271B1 (en) System and method for identifying anomalies in data logs using context-based analysis
CN117130886A (en) Fault monitoring method, device, computer equipment and storage medium
CN118051393A (en) Inspection method, apparatus, device, storage medium, and computer program product
US9860155B1 (en) Code coverage and data analysis
CN113031969B (en) Equipment deployment inspection method and device, computer equipment and storage medium
CN111324516A (en) Method and device for automatically recording abnormal event, storage medium and electronic equipment
CN116955129A (en) Automatic generation of code function and test case mappings
US11093170B2 (en) Dataset splitting based on workload footprint analysis
US11886327B2 (en) Training a system to recognize scroll bars in an application under test

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination