CN112231157B - AI server HCA card performance test method and system based on hardware topology - Google Patents

AI server HCA card performance test method and system based on hardware topology Download PDF

Info

Publication number
CN112231157B
CN112231157B CN202011027122.1A CN202011027122A CN112231157B CN 112231157 B CN112231157 B CN 112231157B CN 202011027122 A CN202011027122 A CN 202011027122A CN 112231157 B CN112231157 B CN 112231157B
Authority
CN
China
Prior art keywords
pcie
delay
hca card
hca
standard
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011027122.1A
Other languages
Chinese (zh)
Other versions
CN112231157A (en
Inventor
徐屹蓝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202011027122.1A priority Critical patent/CN112231157B/en
Publication of CN112231157A publication Critical patent/CN112231157A/en
Application granted granted Critical
Publication of CN112231157B publication Critical patent/CN112231157B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2205Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
    • G06F11/2221Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test input/output devices or peripheral units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2273Test methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/26Functional testing
    • G06F11/273Tester hardware, i.e. output processing circuits

Abstract

The invention provides a method and a system for testing the performance of an AI server HCA card based on hardware topology, comprising the following steps: acquiring HCA card slot, equipment information of uplink PCIe and bandwidth rate; identifying the switch chip type of PCIe according to the equipment information of the uplink PCIe; calculating actual delay according to the delay standard of the CPU directly connected with the PCIe card slot and delay data of the PCIe switch chip provided by a manufacturer; re-establishing an HCA card test passing standard according to the bandwidth rate and the actual delay of the uplink PCIe; and taking the HCA card passing standard as a test standard to carry out the performance test of the AI server HCA card of the hardware topology. The invention reformulates a new delay standard by combining the uplink PCIe condition of the HCA card, avoids misjudgment of the test result caused by standard difference, provides a new basis for analysis of the test result by a tester, and is beneficial to improvement of the overall test working quality.

Description

AI server HCA card performance test method and system based on hardware topology
Technical Field
The invention belongs to the technical field of chip testing, and particularly relates to a method and a system for testing the performance of an AI server HCA card based on hardware topology.
Background
The HCA card is used as a key component of the AI server, can meet the calculation requirements of various fields, and for the AI server, the HCA card can not only realize the communication between the servers, but also realize the RDMA technology, so the bandwidth delay performance of the HCA card is an important concern for AI server testers and is a main basis for judging whether the HCA card compatibility test passes or not.
The current general HCA card delay standard is a standard that a CPU is directly connected with a PCIe card slot, but the PCIe topology structure of an AI server model is complex, the actual connection links of each slot position and the CPU are different, some slot positions are actually directly connected with the CPU, the delay influence caused by PCIe switch chips is avoided, some slot positions can influence the bandwidth and the delay of the HCA card through a plurality of PCIe switch chips, the general delay standard cannot be applied to the performance test of the AI server HCA card, and therefore the test standard for the AI server HCA card needs to be judged by combining the actual hardware topology of the AI model. Meanwhile, in a general HCA card compatibility test, for PCIe standard slot speed detection, only the slot in which the HCA card is located is usually considered, for the AI server, in addition to testing the PCIe bandwidth speed of the slot in which the HCA card is located, the upstream PCIe bandwidth speed also needs to be tested, and this test is also based on a hardware topology, so the existing test method has limitations and one-sidedness.
Disclosure of Invention
In view of the above defects in the prior art, the present invention provides a method and a system for testing the performance of an HCA card of an AI server based on hardware topology, so as to solve the above technical problems.
In a first aspect, the present invention provides a method for testing performance of an HCA card of an AI server based on hardware topology, including:
acquiring HCA card slot, equipment information of uplink PCIe and bandwidth rate;
identifying the switch chip type of PCIe according to the equipment information of the uplink PCIe;
calculating actual delay according to delay data of a CPU directly connected with a PCIe card slot and delay data of a PCIe switch chip provided by a manufacturer;
re-establishing an HCA card test passing standard according to the bandwidth rate and the actual delay of the uplink PCIe;
and taking the HCA card passing standard as a test standard to carry out the performance test of the AI server HCA card of the hardware topology.
Further, the method further comprises:
acquiring a PCIe tree structure of a hardware topology in a shell script mode;
and the equipment information of the uplink PCIe can be acquired through the configuration of each node in the PCIe tree structure.
Further, the acquiring the HCA card slot, the device information of the uplink PCIe, and the bandwidth rate includes:
traversing HCA card, capturing BUS _ ID of HCA card and PCIe bandwidth rate
Acquiring equipment information of uplink PCIe;
calling a function to acquire the upstream PCIe bandwidth rate corresponding to the BUS _ ID;
further, the device information of the upstream PCIe includes a bus, a bridge, and a device type of the PCIe.
Further, the calculation formula of the actual delay time is as follows: actual latency = latency of the CPU directly connected to the PCIe slot + (latency of PCIe switch chip) × 2.
Further, the method further comprises:
acquiring delay data of a timer chip;
and adding the delay data update of the timer chip into the actual delay.
Further, the method further comprises:
and obtaining the PCIe bandwidth rate of the HCA card corresponding to the BUS _ ID, the upstream PCIe bandwidth rate and the actual delay, and establishing a test passing standard of the HCA card.
In a second aspect, the present invention provides a performance testing system for an AI server HCA card based on a hardware topology, including:
the information acquisition unit is configured to acquire the HCA card slot, the equipment information of the uplink PCIe and the bandwidth rate;
the type identification unit is configured to identify the switch chip type of the PCIe according to the equipment information of the uplink PCIe;
the delay calculating unit is configured to calculate actual delay according to delay data of the CPU directly connected with the PCIe card slot and delay data of the PCIe switch chip provided by a manufacturer;
the standard formulating unit is configured for re-formulating the HCA card test passing standard according to the bandwidth rate and the actual delay of the uplink PCIe;
and the performance test unit is configured to perform the performance test of the AI server HCA card of the hardware topology by taking the HCA card passing test standard as a test standard.
The beneficial effect of the invention is that,
according to the method and the system for testing the performance of the AI server HCA card based on the hardware topology, provided by the invention, the bandwidth rate data is automatically acquired by traversing the PCIE equipment on the HCA card, so that the reduction of the manual operation time in the delay test process and the improvement of the test efficiency are realized; the method has the advantages that the hardware topology condition is automatically acquired, the new delay standard is re-formulated by combining the PCIe condition of the HCA card, the misjudgment of the test result caused by the standard difference is avoided, a new basis is provided for the analysis of the test result by the tester, and the improvement of the overall test working quality is facilitated.
In addition, the invention has reliable design principle, simple structure and very wide application prospect.
Drawings
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention.
FIG. 2 is a schematic block diagram of a system of one embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The following explains key terms appearing in the present invention.
FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention. The execution subject in fig. 1 may be an AI server HCA card performance testing system based on a hardware topology.
As shown in fig. 1, the method includes:
step 110, obtaining HCA card slot, equipment information of uplink PCIe and bandwidth rate;
step 120, identifying the switch chip type of PCIe according to the uplink PCIe device information;
step 130, calculating actual delay according to the delay data of the CPU directly connected with the PCIe card slot and the delay data of the PCIe switch chip provided by a manufacturer;
step 140, reformulating an HCA card test passing standard according to the bandwidth rate and the actual delay of the uplink PCIe;
and 150, taking the HCA card passing test standard as a test standard, and carrying out the performance test of the HCA card of the AI server of the hardware topology.
Optionally, as an embodiment of the present invention, the method further includes:
acquiring a PCIe tree structure of a hardware topology in a shell script mode;
and the equipment information of the uplink PCIe can be acquired through the configuration of each node in the PCIe tree structure.
Optionally, as an embodiment of the present invention, the acquiring device information and bandwidth rate of the HCA card slot and the upstream PCIe includes:
traversing HCA card, capturing BUS _ ID of HCA card and PCIe bandwidth rate
Acquiring equipment information of uplink PCIe;
calling a function to acquire the upstream PCIe bandwidth rate corresponding to the BUS _ ID;
optionally, as an embodiment of the present invention, the device information of the upstream PCIe includes a bus, a bridge, and a device type of the PCIe.
Optionally, as an embodiment of the present invention, a calculation formula of the actual delay time is: actual latency = latency of the CPU directly connected to the PCIe slot + (latency of PCIe switch chip) × 2.
Optionally, as an embodiment of the present invention, the method further includes:
acquiring delay data of a timer chip;
and adding the delay data update of the timer chip into the actual delay.
Optionally, as an embodiment of the present invention, the method further includes:
and obtaining the PCIe bandwidth rate of the HCA card corresponding to the BUS _ ID, the uplink PCIe bandwidth rate and the actual delay, and formulating a test passing standard of the HCA card.
In order to facilitate understanding of the present invention, the following describes a performance testing method for an AI server HCA card based on a hardware topology, which is provided by the present invention, with reference to the principle of the performance testing method for an AI server HCA card based on a hardware topology, and with reference to the performance testing process for the HCA card in the embodiment.
Specifically, the method for testing the performance of the HCA card of the AI server based on the hardware topology includes:
(1) PCIe adopts a tree topology structure, the system architecture of the PCIe is generally composed of multiple types of PCIe devices, the PCIe tree topology structure is obtained, people can know PCIe device information on the HCA card uplink, a foundation is provided for the subsequent calculation of delay through the HCA card uplink device type, a new standard is calculated according to a topology result, the method is suitable for the actual conditions of different types, the test result has reference significance and accuracy, and the probability of misjudgment of testers is reduced.
(2) The shell script acquires uplink PCIe Device information of the HCA card to be detected by utilizing lspci | grep-i "inifidand" | awk '{ print $1}' and the uplink PCIe Device information comprises a PCIe BUS, a bridge and a Device type, the Device type is a Device 9797, HCA card position information and a BUS ID are captured, and the acquired information is imported into a $ Cur _ Dir/BUS _ id.log file, wherein $ Cur _ Dir = $ (cd "$ 0"); (pwd); the BUS _ ID is used as the unique label of the HCA card to be detected, and the operation of a user is facilitated.
(3) Traversing and acquiring all HCA card slot BUS _ ID by utilizing cat $ Cur _ Dir/BUS _ id.log | while read line; traversing the Bus ID of the PCIe device on the HCA card by utilizing lspci-t-vvv | grep $ line | cut-d '[' -f "$ i" | cut-d ']' -fi, wherein $ i is a cut domain number parameter, calling a function to obtain a detailed bandwidth rate corresponding to the BUS _ ID, and outputting an information list; meanwhile, acquiring the PCIe bandwidth rate of the HCA card corresponding to the BUS _ ID; the investigation of HCA slot bandwidth cannot be limited to the HCA card itself, so the performance standard of the HCA card is analyzed by integrating the upstream PCIe bandwidth rate and the PCIe bandwidth rate of the HCA card itself.
(4) PCIe switch chip delay data are related to the type of the switch chip, delay specific data are provided by internal files of manufacturers, and a delay information list is integrated according to the existing mastered data, such as xx type switch chips, and the delay is 0.2us; the script searches PCIe switch chip types and a delay information list and delay data corresponding to the uplink switch chip types of the HCA card to be tested through grep commands; two systems are interconnected during testing, theoretically, the two systems pass through two stages of switch chips, so that the delay of the PCIe switch chips needs to be calculated twice, and the calculation formula is that the actual delay = the delay of a CPU directly connected with a PCIe card slot + (the delay of the PCIe switch chips) × 2; in addition, the actual delay is also related to the timer chip, but the system cannot check the timer information, the information of the timer chip can be output by means of third-party software, and the delay brought by the timer chip is added into the actual delay, so that the performance standard is more accurate.
(5) Re-establishing a test passing standard of the HCA card according to the PCIe bandwidth rate of the HCA card corresponding to the BUS _ ID, the uplink PCIe bandwidth rate and the actual delay; when the HCA card to be tested is replaced, the standard needs to be changed according to the actual situation of the method, and the phenomenon that the test result has deviation due to the fact that the test standard is unified is avoided.
As shown in fig. 2, the system 200 includes:
an information obtaining unit 210 configured to obtain the HCA card slot, the device information of the uplink PCIe, and the bandwidth rate;
the type identification unit 220 is configured to identify a switch chip type of PCIe according to the device information of the uplink PCIe;
the delay calculating unit 230 is configured to calculate an actual delay according to a delay standard of the CPU directly connected to the PCIe slot and delay data of the PCIe switch chip provided by a manufacturer;
a standard establishing unit 240 configured to re-establish an HCA card test passing standard according to the bandwidth rate and the actual delay of the uplink PCIe;
and a performance testing unit 250 configured to perform the performance test of the AI server HCA card of the hardware topology by using the HCA card test passing standard as a test standard.
Although the present invention has been described in detail in connection with the preferred embodiments with reference to the accompanying drawings, the present invention is not limited thereto. Various equivalent modifications or substitutions can be made on the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and these modifications or substitutions should be within the scope of the present invention/any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present disclosure and the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (6)

1. A performance test method for an AI server HCA card based on hardware topology is characterized by comprising the following steps:
acquiring HCA card slot, equipment information of uplink PCIe and bandwidth rate;
identifying the switch chip type of PCIe according to the equipment information of the uplink PCIe;
calculating actual delay according to the delay standard of the CPU directly connected with the PCIe card slot and delay data of the PCIe switch chip provided by a manufacturer;
re-establishing an HCA card test passing standard according to the bandwidth rate and the actual delay of the uplink PCIe;
taking the HCA card passing standard as a test standard, and carrying out the performance test of the AI server HCA card of the hardware topology;
the calculation formula of the actual delay is as follows: actual delay = delay of CPU directly connected to PCIe slot + (delay of PCIe switch chip) × 2;
the method further comprises the following steps:
acquiring delay data of a timer chip;
and adding the delay data update of the timer chip into the actual delay.
2. The method for testing the performance of the AI server HCA card based on the hardware topology as recited in claim 1, further comprising:
acquiring a PCIe tree structure of a hardware topology in a shell script mode;
and the equipment information of the uplink PCIe can be acquired through the configuration of each node in the PCIe tree structure.
3. The method for testing the performance of the HCA card of the AI server based on the hardware topology of claim 1, wherein the obtaining of the device information and the bandwidth rate of the HCA card slot and the uplink PCIe comprises:
traversing HCA card, capturing BUS _ ID of HCA card and PCIe bandwidth rate
Acquiring equipment information of uplink PCIe;
and calling a function to acquire the upstream PCIe bandwidth rate corresponding to the BUS _ ID.
4. The AI server HCA card performance testing method based on hardware topology of claim 1, wherein the device information of the upstream PCIe comprises PCIe bus, bridge and device type.
5. The method for testing the performance of the AI server HCA card based on the hardware topology as recited in claim 1, further comprising:
and obtaining the PCIe bandwidth rate of the HCA card corresponding to the BUS _ ID, the upstream PCIe bandwidth rate and the actual delay, and establishing a test passing standard of the HCA card.
6. An AI server HCA card performance test system based on hardware topology, characterized by comprising:
the information acquisition unit is configured to acquire the HCA card slot, the equipment information of the uplink PCIe and the bandwidth rate;
the type identification unit is configured to identify the switch chip type of the PCIe according to the equipment information of the uplink PCIe;
the delay calculating unit is configured for calculating actual delay according to delay data of the CPU directly connected with the PCIe card slot and delay data of a PCIe switch chip provided by a manufacturer;
the standard formulating unit is configured for re-formulating the HCA card test passing standard according to the bandwidth rate and the actual delay of the uplink PCIe;
the performance test unit is configured to perform the performance test of the hardware topology AI server HCA card by taking the HCA card passing test standard as a test standard;
the calculation formula of the actual delay is as follows: the actual delay = the delay of the CPU directly connecting the PCIe slot + (delay of PCIe switch chip) × 2;
the AI server HCA card performance testing system is further configured to perform the steps of:
acquiring delay data of a timer chip;
and adding the delay data update of the timer chip into the actual delay.
CN202011027122.1A 2020-09-25 2020-09-25 AI server HCA card performance test method and system based on hardware topology Active CN112231157B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011027122.1A CN112231157B (en) 2020-09-25 2020-09-25 AI server HCA card performance test method and system based on hardware topology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011027122.1A CN112231157B (en) 2020-09-25 2020-09-25 AI server HCA card performance test method and system based on hardware topology

Publications (2)

Publication Number Publication Date
CN112231157A CN112231157A (en) 2021-01-15
CN112231157B true CN112231157B (en) 2023-01-10

Family

ID=74108863

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011027122.1A Active CN112231157B (en) 2020-09-25 2020-09-25 AI server HCA card performance test method and system based on hardware topology

Country Status (1)

Country Link
CN (1) CN112231157B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115037651B (en) * 2022-06-24 2023-07-11 苏州浪潮智能科技有限公司 RDMA bandwidth transmission test method, system and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104601410A (en) * 2015-02-02 2015-05-06 浪潮电子信息产业股份有限公司 Server automatic HCA card bandwidth testing method
CN107193699A (en) * 2017-05-22 2017-09-22 郑州云海信息技术有限公司 One kind tests the wide time-delay method of HCA cassette tapes automatically by RDMA modes
CN107992438A (en) * 2017-11-24 2018-05-04 郑州云海信息技术有限公司 A kind of server and in server flexible configuration PCIe topologys method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104601410A (en) * 2015-02-02 2015-05-06 浪潮电子信息产业股份有限公司 Server automatic HCA card bandwidth testing method
CN107193699A (en) * 2017-05-22 2017-09-22 郑州云海信息技术有限公司 One kind tests the wide time-delay method of HCA cassette tapes automatically by RDMA modes
CN107992438A (en) * 2017-11-24 2018-05-04 郑州云海信息技术有限公司 A kind of server and in server flexible configuration PCIe topologys method

Also Published As

Publication number Publication date
CN112231157A (en) 2021-01-15

Similar Documents

Publication Publication Date Title
CN103812726B (en) Automated testing method and device for data communication equipment
CN108092854B (en) Test method and device for train-level Ethernet equipment based on IEC61375 protocol
WO2017000424A1 (en) Protocol detection method and apparatus
CN109462490B (en) Video monitoring system and fault analysis method
CN104506376A (en) Multichannel redundant CAN (Controller Area Network) bus test system with frame start sensitive synchronous trigger function
CN102325036A (en) Fault diagnosis method for network system, system and device
CN112231157B (en) AI server HCA card performance test method and system based on hardware topology
CN110430226A (en) Network attack detecting method, device, computer equipment and storage medium
CN111147342A (en) MVB bus fault diagnosis method and system based on communication chip
WO2019041870A1 (en) Method, device, and storage medium for locating failure cause
CN114460439A (en) Digital integrated circuit test system
CN114414255A (en) Automatic driving test method and system based on CAN message period monitoring
CN113364115A (en) Power cable information comprehensive processing system and method
CN111106990B (en) Self-diagnosis method for loop of underwater multi-channel signal acquisition and transmission array system
CN210863959U (en) Mainboard self-checking device based on FPGA electrical signal detects
CN115687406B (en) Sampling method, device, equipment and storage medium for call chain data
CN107342917B (en) Method and apparatus for detecting network device performance
CN114553678B (en) Cloud network soft SLB flow problem diagnosis method
CN112865860B (en) Calibration method and device for trillion passive optical network BOB equipment
CN112235145B (en) Flow state detection method and device
CN107621988A (en) Delayed in a kind of DC test machine Fault Locating Method and system
CN210899204U (en) Intelligent detection device for centralized meter reading communication faults
CN114338347A (en) Ampere platform-based fault information out-of-band acquisition method and device
WO2016127483A1 (en) Processing method and device for collection agent management subsystem
CN113032341A (en) Log processing method based on visual configuration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant