CN114138579A - Prometous-based GPU (graphics processing Unit) interactive test method, device, equipment and readable medium - Google Patents

Prometous-based GPU (graphics processing Unit) interactive test method, device, equipment and readable medium Download PDF

Info

Publication number
CN114138579A
CN114138579A CN202111436978.9A CN202111436978A CN114138579A CN 114138579 A CN114138579 A CN 114138579A CN 202111436978 A CN202111436978 A CN 202111436978A CN 114138579 A CN114138579 A CN 114138579A
Authority
CN
China
Prior art keywords
gpu
test
prometheus
data
monitoring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202111436978.9A
Other languages
Chinese (zh)
Inventor
刘益嘉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202111436978.9A priority Critical patent/CN114138579A/en
Publication of CN114138579A publication Critical patent/CN114138579A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2205Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
    • G06F11/2236Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test CPU or processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2273Test methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2289Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing by configuration test

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)
  • Debugging And Monitoring (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)

Abstract

The invention discloses a Prometheus-based GPU interaction testing method, a Prometheus-based GPU interaction testing device, computer equipment and a Prometheus-based GPU interaction testing medium, wherein the method comprises the following steps: configuring a GPU pressure test environment, and installing a GPU driver and a CUDA; detecting whether the GPU identification condition is consistent with the actual configuration; detecting whether the FW version of the GPU is consistent with the FW version required by the test; simulating the actual pressure environment of the GPU server, and pressurizing a CPU, a memory, a hard disk and a network card; pressurizing the GPU by a GPU-burn-master tool; acquiring real-time data including power consumption, temperature, performance state, GPU utilization rate and video memory utilization rate of a GPU through a Prometous monitoring system, and monitoring pressurization data of other parts; and performing visual test data output detection test by Grafana.

Description

Prometous-based GPU (graphics processing Unit) interactive test method, device, equipment and readable medium
Technical Field
The invention relates to the technical field of computers, in particular to a Prometheus-based GPU interaction test method, device, equipment and readable medium.
Background
With the development of artificial intelligence technology, the application scenarios of the GPU server are increasing, and for the GPU server, the stability of the GPU is crucial, and whether the whole machine can keep continuous and stable work under the condition of high power when in use is concerned, and the stability of the GPU is usually measured by using a GPU pressure test in the test.
There are many types of GPU stress testing, for example: gpu-burn-master, Thermal Test in NVQual tool, nbody, etc.
However, the GPU pressure test described above generally only pressurizes the GPU singly, neglects the actual working environment of the GPU server, and does not consider the influence of other components on the GPU stability.
In addition, after a tester uses the GPU pressurizing tool to pressurize during testing, the tester only pays attention to whether the logs generated by the pressurizing tool and the system logs are abnormal or not, such as error reporting, and the tester cannot well analyze instantaneous data and fluctuation changes of other indexes of the GPU.
Disclosure of Invention
In view of this, an object of the embodiments of the present invention is to provide a method for GPU interactive testing based on Prometheus. The method improves a GPU pressure testing method, and is used for pressurizing a GPU server to the whole machine, pressurizing a CPU, a memory, a hard disk and a network card while pressurizing the GPU, so that the interactive testing method for the GPU pressure is realized, and the problem that only the GPU is pressurized in the general GPU pressure testing is solved. In the interactive test process, a Prometous-based test monitoring system is introduced to monitor the fluctuation condition of each index of the GPU, the system is used for acquiring data required in the test in real time, and Grafana is matched to form visual data, so that the log analysis and processing and the specific positioning of problems are facilitated for testers, and the problems that test items of the test results are incomplete and inaccurate are solved.
The embodiment of the invention also aims to provide a Prometheus-based GPU interaction testing device.
The embodiment of the invention also aims to provide the computer equipment.
An object of an embodiment of the present invention is also to provide a computer-readable storage medium.
Based on the above purpose, an aspect of the embodiments of the present invention provides a Prometheus-based GPU interactive test method. The method comprises the steps of configuring a GPU pressure test environment, installing a GPU driver and a CUDA; detecting whether the GPU identification condition is consistent with the actual configuration, if so, performing the next step, otherwise, detecting the link connection condition and continuing the step; detecting whether the FW version of the GPU is consistent with the FW version required by the test, if so, performing the next step, and if not, refreshing the FW version of the GPU and continuing the step; simulating the actual pressure environment of the GPU server, and pressurizing a CPU, a memory, a hard disk and a network card; pressurizing the GPU by a GPU-burn-master tool; acquiring real-time data including power consumption, temperature, performance state, GPU utilization rate and video memory utilization rate of a GPU through a Prometous monitoring system, and monitoring pressurization data of other parts; and outputting visual test data through Grafana, if the GPU test data is normal and no error log is generated in the system, the test is passed, and if the GPU test data is abnormal, the problem is analyzed and positioned according to the test data.
In some embodiments, configuring the GPU stress test environment, installing the GPU driver, and the CUDA includes: unloading a GPU driver nouveau carried by the system, and installing a driver matched with the existing GPU; and installing the CUDA and configuring the environment variable for the CUDA.
In some embodiments, detecting whether the GPU identification condition is consistent with the actual configuration, if so, performing the next step, and if not, detecting the connection condition of the link and continuing the step includes: storing actual configuration information; monitoring the recognition condition of the GPU through a nvidia-smi command of newly installing a GPU driver; and comparing whether the two are consistent, if so, carrying out the next step, and if not, detecting the actual link connection condition by using an lspci command and continuing the step.
In some embodiments, detecting whether the FW version of the GPU is consistent with the FW version required by the test, if so, performing the next step, and if not, performing the FW version refresh of the GPU and continuing the step includes: storing an FW version file of the test requirements; detecting an FW version of the GPU through an nvflash tool; and comparing whether the two are consistent, if so, carrying out the next step, and if not, refreshing through the nvflash tool and the corresponding FW version file and continuing the step.
In some embodiments, simulating the actual pressure environment of the GPU server, and pressurizing the CPU, the memory, the hard disk, and the network card comprises: pressurizing the CPU through the stress tool; pressurizing the memory by a memtester tool; pressurizing the hard disk through a fio tool; and pressurizing the network card by an iperf tool.
In some embodiments, the real-time data acquisition by the Prometheus monitoring system and monitoring the pressurization data of other components comprises: installing a DCGM tool, and managing and monitoring a GPU; deploying the monitoring index by using gpu-monitoring-tools; and installing Prometheus to monitor the test index data in the test process.
In some embodiments, the visualized test data output is performed by Grafana, if the GPU test data is normal and no error log is generated in the system, the test is passed, and if the GPU test data is abnormal, the analyzing and positioning the problem according to the test data includes: installing a Grafana tool, and carrying out visual display on data of the Prometous monitoring system; in the pressure measurement process, each test data index of the GPU is normal, the whole machine has no problems of hang machine, blue screen, dead machine and black screen, system logs and BMC logs have no errors such as fail, error and the like, a hard disk smart is normal, and the bandwidth performance of a network card is normal, and the test is confirmed to be passed; and observing the abnormal indexes of the GPU test data, and taking out the pressure test data of other parts at the same time and a period of time before and after the same time for specific analysis.
On the other hand, the embodiment of the invention also provides a Prometheus-based GPU interaction testing device. The device comprises a test environment configuration unit, a test environment detection unit and a test environment detection unit, wherein the test environment configuration unit is used for configuration and detection of a GPU stress test environment; the pressure environment simulation unit is configured for simulating the pressure environment of the GPU; a GPU stress test unit configured for GPU stress testing; the Prometheus monitoring unit is configured for monitoring test index data in the test process; and a test result output unit configured to output a test result and analyze the test result.
In some embodiments, the test environment configuration unit is configured to configure a GPU stress test environment, install a GPU driver and a CUDA, detect whether GPU identification information is consistent with an actual configuration, detect a connection condition of a link if not, detect whether an FW version of the GPU is consistent with an FW version required by the test, and perform FW version refresh of the GPU if not.
In some embodiments, the pressure environment simulation unit is configured to simulate an actual pressure environment of the GPU server, and pressurize the CPU, the memory, the hard disk, and the network card.
In some embodiments, the GPU stress test unit is configured to pressurize the GPU by a GPU-burn-master.
In some embodiments, the Prometheus monitoring unit is configured to perform real-time data acquisition by the Prometheus monitoring system, including indexes such as power consumption, temperature, performance status, GPU usage rate, and video memory usage rate of the GPU, and monitor the pressurization data of other components.
In some embodiments, the test result output unit is configured to output the Grafana visual test data, if the GPU test data is normal, the system generates no error log, the test is passed, and if the GPU test data is abnormal, the problem is analyzed and positioned according to the test data.
In another aspect of the embodiments of the present invention, there is also provided a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing steps of the method comprising: configuring a GPU pressure test environment, and installing a GPU driver and a CUDA; detecting whether the GPU identification condition is consistent with the actual configuration, if so, performing the next step, otherwise, detecting the link connection condition and continuing the step; detecting whether the FW version of the GPU is consistent with the FW version required by the test, if so, performing the next step, and if not, refreshing the FW version of the GPU and continuing the step; simulating the actual pressure environment of the GPU server, and pressurizing a CPU, a memory, a hard disk and a network card; pressurizing the GPU by a GPU-burn-master tool; acquiring real-time data including power consumption, temperature, performance state, GPU utilization rate and video memory utilization rate of a GPU through a Prometous monitoring system, and monitoring pressurization data of other parts; and outputting visual test data through Grafana, if the GPU test data is normal and no error log is generated in the system, the test is passed, and if the GPU test data is abnormal, the problem is analyzed and positioned according to the test data.
In some embodiments, configuring the GPU stress test environment, installing the GPU driver, and the CUDA includes: unloading a GPU driver nouveau carried by the system, and installing a driver matched with the existing GPU; and installing the CUDA and configuring the environment variable for the CUDA.
In some embodiments, detecting whether the GPU identification condition is consistent with the actual configuration, if so, performing the next step, and if not, detecting the connection condition of the link and continuing the step includes: storing actual configuration information; monitoring the recognition condition of the GPU through a nvidia-smi command of newly installing a GPU driver; and comparing whether the two are consistent, if so, carrying out the next step, and if not, detecting the actual link connection condition by using an lspci command and continuing the step.
In some embodiments, detecting whether the FW version of the GPU is consistent with the FW version required by the test, if so, performing the next step, and if not, performing the FW version refresh of the GPU and continuing the step includes: storing an FW version file of the test requirements; detecting an FW version of the GPU through an nvflash tool; and comparing whether the two are consistent, if so, carrying out the next step, and if not, refreshing through the nvflash tool and the corresponding FW version file and continuing the step.
In some embodiments, simulating the actual pressure environment of the GPU server, pressurizing the CPU, the memory, the hard disk, and the network card comprises: pressurizing the CPU through the stress tool; pressurizing the memory by a memtester tool; pressurizing the hard disk through a fio tool; and pressurizing the network card by an iperf tool.
In some embodiments, the real-time data acquisition by the Prometheus monitoring system and monitoring the pressurization data of other components includes: installing a DCGM tool, and managing and monitoring a GPU; deploying the monitoring index by using gpu-monitoring-tools; and installing Prometheus to monitor the test index data in the test process.
In some embodiments, the visualized test data output is performed by Grafana, if the GPU test data is normal and no error log is generated in the system, the test is passed, and if the GPU test data is abnormal, the analyzing and positioning the problem according to the test data includes: installing a Grafana tool, and carrying out visual display on data of the Prometous monitoring system; in the pressure measurement process, each test data index of the GPU is normal, the whole machine has no problems of hang machine, blue screen, dead machine and black screen, system logs and BMC logs have no errors such as fail, error and the like, a hard disk smart is normal, and the bandwidth performance of a network card is normal, and the test is confirmed to be passed; and observing the abnormal indexes of the GPU test data, and taking out the pressure test data of other parts at the same time and a period of time before and after the same time for specific analysis.
In a further aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, in which a computer program for implementing the above method steps is stored when the computer program is executed by a processor.
The invention has at least the following beneficial technical effects:
the Prometheus-based GPU interactive testing method improves the GPU pressure testing method by adopting a Prometheus-based GPU interactive testing device, pressurizes the whole GPU server, pressurizes a CPU, a memory, a hard disk and a network card while pressurizing the GPU, realizes the interactive testing method of the GPU pressure, and solves the problem that only the GPU is pressurized during the general GPU pressure testing. In the interactive test process, a Prometous-based test monitoring system is introduced to monitor the fluctuation condition of each index of the GPU, the system is used for acquiring data required in the test in real time, and Grafana is matched to form visual data, so that the log analysis and processing and the specific positioning of problems are facilitated for testers, and the problems that test items of the test results are incomplete and inaccurate are solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.
Fig. 1 is a schematic diagram of an embodiment of a Prometheus-based GPU interactive test method provided by the present invention;
FIG. 2 is a schematic diagram of an embodiment of a Prometheus-based GPU interaction testing apparatus according to the present invention;
FIG. 3 is a schematic diagram of an embodiment of a computer device provided by the present invention;
FIG. 4 is a schematic diagram of an embodiment of a computer-readable storage medium provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.
In view of the foregoing, in a first aspect of the embodiments of the present invention, an embodiment of a method for GPU interactive testing based on Prometheus is provided. Fig. 1 is a schematic diagram illustrating an embodiment of a Prometheus-based GPU interaction testing method according to the present invention. As shown in fig. 1, the method for testing GPU interaction based on Prometheus according to the embodiment of the present invention includes the following steps:
001. configuring a GPU pressure test environment, and installing a GPU driver and a CUDA;
002. detecting whether the GPU identification condition is consistent with the actual configuration, if so, performing the next step, otherwise, detecting the link connection condition and continuing the step;
003. detecting whether the FW version of the GPU is consistent with the FW version required by the test, if so, performing the next step, and if not, refreshing the FW version of the GPU and continuing the step;
004. simulating the actual pressure environment of the GPU server, and pressurizing a CPU, a memory, a hard disk and a network card;
005. pressurizing the GPU by a GPU-burn-master tool;
006. acquiring real-time data including power consumption, temperature, performance state, GPU utilization rate and video memory utilization rate of a GPU through a Prometous monitoring system, and monitoring pressurization data of other parts; and
007. and outputting visual test data through Grafana, if the GPU test data is normal and no error log is generated in the system, passing the test, and if the GPU test data is abnormal, analyzing and positioning the problems according to the test data.
In this embodiment, the interactive testing method for the Prometheus-based GPU server provided by the invention can make the GPU pressure testing environment closer to the real working environment of the GPU server, the real-time testing data collected by the Prometheus system has rich and accurate indexes, and the Grafana visual interface is matched, so that the tester can conveniently observe the testing data, the problem during testing is more detailed, and the accuracy of the GPU stability testing is greatly improved.
The Prometous monitoring system is used for monitoring the test data in the GPU interactive test process, so that the data are more accurate and reliable, a Grafana visual interface is matched, the result can be observed conveniently, and an idea is provided for analysis and positioning when the test has problems.
In some embodiments of the present invention, configuring the GPU stress test environment, installing the GPU driver, and the CUDA comprises: unloading a GPU driver nouveau carried by the system, and installing a driver matched with the existing GPU; and installing the CUDA and configuring the environment variable for the CUDA.
The unloading system drives the nuveau by the GPU, and the specific instruction is as follows:
vim/boot/efi/EFI/redhat/gru.cfg
after LANG _ en _ us.utf-8, modprobe.blackbet.noveau vga.791 is input and the exit is saved
echo“blacklist nouveau”>>/etc/modprobe.d/blacklist.conf
yum-y remove xorg-x11-drv-nouveau
Restarting, and detecting whether unloading is successful or not by using lsmod | grep noveau;
installing a GPU driver, and downloading a corresponding driver according to the actual GPU model;
installing a CUDA (compute unified device architecture) and a GPU driver,/. x. run, and paying attention to not installing a CUDA self-contained driver;
and (3) configuring CUDA environment variables, wherein specific instructions are as follows:
adding the following to/. bashrc
export
LD_LIBRARY_PATH=/usr/local/cuda-11.1/lib64:$LD_LIBRARY_PATH
export PATH=/usr/local/cuda-11.1/bin:$PATH
Save exit and execute source-/. bashrc
nvcc-V detects whether CUDA installation is successful.
In some embodiments of the present invention, detecting whether the GPU identification condition is consistent with the actual configuration, if so, performing the next step, and if not, detecting the link connection condition and continuing the step includes: storing actual configuration information; monitoring the recognition condition of the GPU through a nvidia-smi command of newly installing a GPU driver; and comparing whether the two are consistent, if so, carrying out the next step, and if not, detecting the actual link connection condition by using an lspci command and continuing the step.
In the embodiment, the actual configuration information is stored, the nvidia-smi command after the installation of the GPU driver is used for monitoring the recognition condition of the GPU, whether the two commands are consistent or not is compared, if so, the next step is performed, and if not, the actual link connection condition is detected by using the lspci | grep-i nvidia command.
In some embodiments of the present invention, detecting whether the FW version of the GPU is consistent with the FW version required by the test, if so, performing the next step, and if not, performing the FW version refresh of the GPU and continuing the step includes: storing an FW version file of the test requirements; detecting an FW version of the GPU through an nvflash tool; and comparing whether the two are consistent, if so, carrying out the next step, and if not, refreshing through the nvflash tool and the corresponding FW version file and continuing the step.
In the embodiment, an FW version file required by a test is stored, and a nvflash tool is used for detecting a GPU FW; and comparing whether the two files are consistent or not, if so, carrying out the next step, and if not, refreshing by using the nvflash tool and the corresponding FW version file.
In some embodiments of the present invention, simulating an actual pressure environment of the GPU server, and pressurizing the CPU, the memory, the hard disk, and the network card includes: pressurizing the CPU through the stress tool; pressurizing the memory by a memtester tool; pressurizing the hard disk through a fio tool; and pressurizing the network card by an iperf tool.
In this embodiment, a stress tool is installed, and the stress tool is used to pressurize the CPU, and the specific commands are as follows:
nohup stress-c < number of processes > -t 172800&
Installing a memtester tool, and pressurizing the memory by using the memtester tool, wherein the specific instruction is as follows:
memtester < number of applied test memories > < number of tests >
Installing a fio tool, pressurizing the hard disk by using the fio tool, and writing various parameters required by a fio test into fio _ parameter.
nohup fio fio_parameter.txt&
Connecting a testing end machine and an auxiliary end machine by using a network cable, installing an iperf tool at two ends, and pressurizing the network card by using the iperf tool, wherein the specific instructions are as follows:
an auxiliary end: iperf-s
And (3) testing end: iperf-c < auxiliary end ip > -w 512k-i 1-t 172800-P < process number >
And (3) pressurizing the GPU through the GPU-burn-master, wherein the specific instruction is as follows:
unzip gpu-burn-master.zip
cd gpu-burn-master
make
./gpu-burn-d$((60*60*48))|tee-a gpu-burn-result.log。
in some embodiments of the present invention, the real-time data acquisition by the Prometheus monitoring system and monitoring the pressurization data of other components comprises: installing a DCGM tool, and managing and monitoring a GPU; deploying the monitoring index by using gpu-monitoring-tools; and installing Prometheus to monitor the test index data in the test process.
In this embodiment, the DCGM tool is installed with the following specific instructions:
dpkg-i datacenter-gpu-manager_1.7.2_amd64.deb
deploying the monitoring index by using the gpu-monitoring-tools, wherein the specific instruction is as follows:
git clone https://gitee.com/JackTpy/gpu-monitoring-tools.git
go env-w GOPROXY=https://goproxy.cn
cd gpu-monitoring-tools/
make binary
make install
dcgm-exporter
vim/etc/systemd/system/dcgm-exporter.service
the following are entered:
[Unit]
Description=dcgm-exporter service
[Service]
User=root
ExecStart=/usr/bin/dcgm-exporter
TimeoutStopSec=10
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
save and exit
systemctl daemon-reload
systemctl enable dcgm-exporter
systemctl start dcgm-exporter
systemctl status dcgm-exporter
Monitoring the CPU by using node _ CPU _ seconds _ total;
finding a subset monitoring memory by using the node _ memory as a prefix;
monitoring the hard disk by using node _ disk _ reads _ completed _ total and node _ disk _ writes _ completed _ total;
monitoring the network card by using node _ network _ receive _ bytes _ total;
prometous is installed, and the specific instructions are as follows:
tar-C/usr/local/-xvf prometheus-2.20.1.linux-amd64.tar.gz
ln-sv/usr/local/prometheus-2.20.1.linux-amd64//usr/local/Prometheu
-s
/usr/local/Prometheus/prometheus--config.file=/usr/local/Prometheus
/prometheus.yml&
the server IP:9090 is a Prometheus monitoring page.
In some embodiments of the present invention, the visual test data output is performed by Grafana, if the GPU test data is normal and no error log is generated in the system, the test is passed, and if the GPU test data is abnormal, the analyzing and positioning of the problem according to the test data includes: installing a Grafana tool, and carrying out visual display on data of the Prometous monitoring system; in the pressure measurement process, each test data index of the GPU is normal, the whole machine has no problems of hang machine, blue screen, dead machine and black screen, system logs and BMC logs have no errors such as fail, error and the like, a hard disk smart is normal, and the bandwidth performance of a network card is normal, and the test is confirmed to be passed; and observing the abnormal indexes of the GPU test data, and taking out the pressure test data of other parts at the same time and a period of time before and after the same time for specific analysis.
In this embodiment, the Grafana visual test data is output, and the specific instruction is as follows:
rpm-ivh grafana-5.4.2-1.x86_64.rpm--force--nodeps
systemctl daemon-reload
systemctl start grafana-server.service
systemctl enable grafana-server.service
and 3000 is a Grafana page at the server IP, and the default of the user name and the password is admin.
If the GPU test data is normal and no error log is generated in the system, the test is passed, and the method comprises the following steps:
in the pressure measurement process, the whole machine has no problems of hang machine, blue screen, dead machine and black screen;
each GPU test data index of the Grafana page is within a normal range, and other parts are monitored normally;
collecting system logs and BMC logs, wherein the specific instructions are as follows:
ipmitool sel elist>/root/GPU_stress_log/sel.log
cat/var/log/messages>/root/GPU_stress_log/messages
cat/var/log/dmesg>/root/GPU_stress_log/dmesg
cat/var/log/mcelog>/root/GPU_stress_log/mcelog
and if no error information such as fail, error and the like appears in the log, the test is passed.
If the GPU test data is abnormal, analyzing and positioning the problems according to the test data, wherein the method comprises the following steps:
the abnormal indexes of the GPU test data are observed, the pressure test data of other components at the same moment and a period of time before and after the same moment are taken out for specific analysis, longitudinal comparison is facilitated, the components influence the stability of the GPU at the moment or in the period, positioning analysis of problems is facilitated, and ideas are provided for solving practical problems.
In view of the foregoing, a second aspect of the embodiments of the present invention provides a Prometheus-based GPU interaction testing apparatus. Fig. 2 is a schematic diagram illustrating an embodiment of a Prometheus-based GPU interaction testing apparatus according to the present invention. As shown in fig. 2, the Prometheus-based GPU interaction testing apparatus according to the embodiment of the present invention includes the following components: the test environment configuration unit 011 is used for configuring and detecting a GPU stress test environment; a pressure environment simulation unit 012 configured to simulate a pressure environment of the GPU; a GPU stress test unit 013 configured for GPU stress testing; a Prometheus monitoring unit 014 configured to monitor test index data in a test process; and a test result output unit 015 configured to output a test result and analyze the test result.
In some embodiments of the invention, the test environment configuration unit 011 is further configured to: configuring a GPU pressure test environment, installing a GPU driver and a CUDA, detecting whether GPU identification information is consistent with actual configuration or not, detecting the connection condition of a link if the GPU identification information is inconsistent with the actual configuration, detecting whether the FW version of the GPU is consistent with the FW version required by the test or not, and refreshing the FW version of the GPU if the GPU identification information is inconsistent with the actual configuration.
In some embodiments of the present invention, the pressure environment simulation unit 012 is further configured to: and simulating the actual pressure environment of the GPU server, and pressurizing the CPU, the memory, the hard disk and the network card.
In some embodiments of the invention, the GPU stress test unit 013 is further configured to: the GPU is pressurized by the GPU-burn-master.
In some embodiments of the invention, the Prometheus monitoring unit 014 is further configured to: the Prometheus monitoring system acquires real-time data, including indexes such as power consumption, temperature, performance state, GPU utilization rate and video memory utilization rate of the GPU, and simultaneously monitors the pressurization data of other parts.
In some embodiments of the present invention, the test result output unit 015 is further configured to: and outputting Grafana visual test data, if the GPU test data is normal and no error log is generated in the system, passing the test, and if the GPU test data is abnormal, analyzing and positioning the problems according to the test data.
In view of the above object, a third aspect of the embodiments of the present invention provides a computer device. Fig. 3 is a schematic diagram of an embodiment of a computer device provided by the present invention. As shown in fig. 3, the computer apparatus of the embodiment of the present invention includes the following means: at least one processor 021; and a memory 022, the memory 022 storing computer instructions 023 executable on the processor, the instructions when executed by the processor implementing steps of the method comprising: configuring a GPU pressure test environment, and installing a GPU driver and a CUDA; detecting whether the GPU identification condition is consistent with the actual configuration, if so, performing the next step, otherwise, detecting the link connection condition and continuing the step; detecting whether the FW version of the GPU is consistent with the FW version required by the test, if so, performing the next step, and if not, refreshing the FW version of the GPU and continuing the step; simulating the actual pressure environment of the GPU server, and pressurizing a CPU, a memory, a hard disk and a network card; pressurizing the GPU by a GPU-burn-master tool; acquiring real-time data including power consumption, temperature, performance state, GPU utilization rate and video memory utilization rate of a GPU through a Prometous monitoring system, and monitoring pressurization data of other parts; and outputting visual test data through Grafana, if the GPU test data is normal and no error log is generated in the system, the test is passed, and if the GPU test data is abnormal, the problem is analyzed and positioned according to the test data.
The invention also provides a computer readable storage medium. FIG. 4 is a schematic diagram illustrating an embodiment of a computer-readable storage medium provided by the present invention. As shown in fig. 4, the computer readable storage medium 031 stores a computer program 032 which, when executed by a processor, performs the method as described above.
Finally, it should be noted that, as one of ordinary skill in the art can appreciate that all or part of the processes in the methods of the above embodiments can be implemented by a computer program to instruct related hardware, and the program of the method for centralized server testing can be stored in a computer readable storage medium, and when executed, the program can include the processes of the embodiments of the methods as described above. The storage medium of the program may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.
Furthermore, the methods disclosed according to embodiments of the present invention may also be implemented as a computer program executed by a processor, which may be stored in a computer-readable storage medium. Which when executed by a processor performs the above-described functions defined in the methods disclosed in embodiments of the invention.
Further, the above method steps and system elements may also be implemented using a controller and a computer readable storage medium for storing a computer program for causing the controller to implement the functions of the above steps or elements.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.
In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (D0L), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, D0L, or wireless technologies such as infrared, radio, and microwave are all included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.
The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (10)

1. A Prometheus-based GPU interaction testing method is characterized by comprising the following steps:
configuring a GPU pressure test environment, and installing a GPU driver and a CUDA;
detecting whether the GPU identification condition is consistent with the actual configuration, if so, performing the next step, otherwise, detecting the link connection condition and continuing the step;
detecting whether the FW version of the GPU is consistent with the FW version required by the test, if so, performing the next step, and if not, refreshing the FW version of the GPU and continuing the step;
simulating the actual pressure environment of the GPU server, and pressurizing a CPU, a memory, a hard disk and a network card;
pressurizing the GPU by a GPU-burn-master tool;
acquiring real-time data including power consumption, temperature, performance state, GPU utilization rate and video memory utilization rate of a GPU through a Prometous monitoring system, and monitoring pressurization data of other parts; and
and outputting visual test data through Grafana, if the GPU test data is normal and no error log is generated in the system, passing the test, and if the GPU test data is abnormal, analyzing and positioning the problems according to the test data.
2. The Prometheus-based GPU interactive testing method of claim 1, wherein configuring a GPU stress testing environment, installing a GPU driver and CUDA comprises:
unloading a GPU driver nouveau carried by the system, and installing a driver matched with the existing GPU; and
the CUDA is installed and environment variables are configured for the CUDA.
3. The Prometheus-based GPU interactive testing method according to claim 1, wherein detecting whether a GPU identification condition is consistent with an actual configuration, if so, performing the next step, and if not, detecting a link connection condition and continuing the step includes:
storing actual configuration information;
monitoring the recognition condition of the GPU through a nvidia-smi command of newly installing a GPU driver;
and comparing whether the two are consistent, if so, carrying out the next step, and if not, detecting the actual link connection condition by using an lspci command and continuing the step.
4. The Prometheus-based GPU interactive testing method as claimed in claim 1, wherein detecting whether the FW version of the GPU is consistent with the FW version required by the test, if so, performing the next step, and if not, performing the FW version refresh of the GPU and continuing the step, comprising:
storing an FW version file of the test requirements;
detecting an FW version of the GPU through an nvflash tool;
and comparing whether the two are consistent, if so, carrying out the next step, and if not, refreshing through the nvflash tool and the corresponding FW version file and continuing the step.
5. The Prometheus-based GPU interactive testing method of claim 1, wherein simulating an actual pressure environment of a GPU server to pressurize a CPU, a memory, a hard disk, and a network card comprises:
pressurizing the CPU through the stress tool;
pressurizing the memory by a memtester tool;
pressurizing the hard disk through a fio tool; and
the network card is pressurized by an iperf tool.
6. The Prometheus-based GPU interactive testing method of claim 1, wherein performing real-time data acquisition by a Prometheus monitoring system and monitoring pressurization data of other components comprises:
installing a DCGM tool, and managing and monitoring a GPU;
deploying the monitoring index by using gpu-monitoring-tools; and
and (5) installing Prometheus to monitor the test index data in the test process.
7. The Prometous-based GPU interaction testing method according to claim 1, characterized in that visual test data output is performed by Grafana, if GPU test data is normal, no error log is generated in the system, the test is passed, if GPU test data is abnormal, the problem analysis and positioning according to the test data comprises:
installing a Grafana tool, and carrying out visual display on data of the Prometous monitoring system;
in the pressure measurement process, each test data index of the GPU is normal, the whole machine has no problems of hang machine, blue screen, dead machine and black screen, system logs and BMC logs have no fail and error report, a hard disk smartlog is normal, and the network card bandwidth performance is normal, and the test is confirmed to be passed;
and observing the abnormal indexes of the GPU test data, and taking out the pressure test data of other parts at the same time and a period of time before and after the same time for specific analysis.
8. A Prometheus-based GPU interaction testing device is characterized by comprising:
the test environment configuration unit is configured for configuration and detection of a GPU pressure test environment;
the pressure environment simulation unit is configured for simulating the pressure environment of the GPU;
a GPU stress test unit configured for GPU stress testing;
the Prometheus monitoring unit is configured for monitoring test index data in the test process; and
and the test result output unit is configured for outputting the test result and analyzing the test result.
9. A computer device, comprising:
at least one processor; and
a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method of any one of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202111436978.9A 2021-11-29 2021-11-29 Prometous-based GPU (graphics processing Unit) interactive test method, device, equipment and readable medium Withdrawn CN114138579A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111436978.9A CN114138579A (en) 2021-11-29 2021-11-29 Prometous-based GPU (graphics processing Unit) interactive test method, device, equipment and readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111436978.9A CN114138579A (en) 2021-11-29 2021-11-29 Prometous-based GPU (graphics processing Unit) interactive test method, device, equipment and readable medium

Publications (1)

Publication Number Publication Date
CN114138579A true CN114138579A (en) 2022-03-04

Family

ID=80389282

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111436978.9A Withdrawn CN114138579A (en) 2021-11-29 2021-11-29 Prometous-based GPU (graphics processing Unit) interactive test method, device, equipment and readable medium

Country Status (1)

Country Link
CN (1) CN114138579A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115391124A (en) * 2022-10-27 2022-11-25 瀚博半导体(上海)有限公司 Method and device for testing power consumption of graphic chip

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104407951A (en) * 2014-11-05 2015-03-11 浪潮电子信息产业股份有限公司 Method for automatically testing server
CN107423183A (en) * 2017-04-25 2017-12-01 郑州云海信息技术有限公司 A kind of GTX series video card calculates the applied voltage test method of performance
CN110413462A (en) * 2019-06-29 2019-11-05 苏州浪潮智能科技有限公司 A kind of server stress test method and device
CN113392005A (en) * 2021-06-16 2021-09-14 中国工商银行股份有限公司 Large file processing test method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104407951A (en) * 2014-11-05 2015-03-11 浪潮电子信息产业股份有限公司 Method for automatically testing server
CN107423183A (en) * 2017-04-25 2017-12-01 郑州云海信息技术有限公司 A kind of GTX series video card calculates the applied voltage test method of performance
CN110413462A (en) * 2019-06-29 2019-11-05 苏州浪潮智能科技有限公司 A kind of server stress test method and device
CN113392005A (en) * 2021-06-16 2021-09-14 中国工商银行股份有限公司 Large file processing test method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115391124A (en) * 2022-10-27 2022-11-25 瀚博半导体(上海)有限公司 Method and device for testing power consumption of graphic chip

Similar Documents

Publication Publication Date Title
CN107704392B (en) Test case processing method and server
US9183123B2 (en) Performance tests in a continuous deployment pipeline
WO2017000424A1 (en) Protocol detection method and apparatus
US20030177417A1 (en) System and method for remote performance analysis and optimization of computer systems
CN113760704A (en) Web UI (user interface) testing method, device, equipment and storage medium
US20130275811A1 (en) Devices for indicating a physical layer error
CN111309590B (en) Automatic testing method and simulator for financial transaction platform
CN110750458A (en) Big data platform testing method and device, readable storage medium and electronic equipment
CN113946499A (en) Micro-service link tracking and performance analysis method, system, equipment and application
KR20140102113A (en) Commit sensitive tests
CN114138579A (en) Prometous-based GPU (graphics processing Unit) interactive test method, device, equipment and readable medium
CN117009243A (en) Chip performance automatic test method, device, computer equipment and storage medium
US20030177414A1 (en) Model for performance tuning applications
JP2020190556A (en) Test measurement system and method for testing device under test
CN116506007A (en) Optical module firmware testing system and method
CN115248782B (en) Automatic testing method and device and computer equipment
CN116662197A (en) Automatic interface testing method, system, computer and readable storage medium
CN113392021A (en) Method, device and equipment for analyzing cluster reliability test result and readable medium
CN115480970A (en) Performance test method, device, equipment and storage medium
CN115373984A (en) Code coverage rate determining method and device
CN113127364A (en) Performance test method and device, electronic equipment and storage medium
CN113656319A (en) Regression testing method and device, electronic equipment and storage medium
CN117112398B (en) Incremental code coverage rate detection method and device, electronic equipment and storage medium
CN114116291B (en) Log detection method, log detection device, computer device and storage medium
Nguyen et al. Automatic load test verification using control charts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20220304