CN112231157B - AI server HCA card performance test method and system based on hardware topology - Google Patents
AI server HCA card performance test method and system based on hardware topology Download PDFInfo
- Publication number
- CN112231157B CN112231157B CN202011027122.1A CN202011027122A CN112231157B CN 112231157 B CN112231157 B CN 112231157B CN 202011027122 A CN202011027122 A CN 202011027122A CN 112231157 B CN112231157 B CN 112231157B
- Authority
- CN
- China
- Prior art keywords
- pcie
- delay
- hca card
- hca
- standard
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000011056 performance test Methods 0.000 title claims abstract description 14
- 238000012360 testing method Methods 0.000 claims abstract description 57
- 238000011144 upstream manufacturing Methods 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000006872 improvement Effects 0.000 abstract description 3
- 238000004458 analytical method Methods 0.000 abstract description 2
- 230000009286 beneficial effect Effects 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 3
- 238000006467 substitution reaction Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 241000282326 Felis catus Species 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000010998 test method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/2205—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
- G06F11/2221—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test input/output devices or peripheral units
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/2273—Test methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/26—Functional testing
- G06F11/273—Tester hardware, i.e. output processing circuits
Abstract
The invention provides a method and a system for testing the performance of an AI server HCA card based on hardware topology, comprising the following steps: acquiring HCA card slot, equipment information of uplink PCIe and bandwidth rate; identifying the switch chip type of PCIe according to the equipment information of the uplink PCIe; calculating actual delay according to the delay standard of the CPU directly connected with the PCIe card slot and delay data of the PCIe switch chip provided by a manufacturer; re-establishing an HCA card test passing standard according to the bandwidth rate and the actual delay of the uplink PCIe; and taking the HCA card passing standard as a test standard to carry out the performance test of the AI server HCA card of the hardware topology. The invention reformulates a new delay standard by combining the uplink PCIe condition of the HCA card, avoids misjudgment of the test result caused by standard difference, provides a new basis for analysis of the test result by a tester, and is beneficial to improvement of the overall test working quality.
Description
Technical Field
The invention belongs to the technical field of chip testing, and particularly relates to a method and a system for testing the performance of an AI server HCA card based on hardware topology.
Background
The HCA card is used as a key component of the AI server, can meet the calculation requirements of various fields, and for the AI server, the HCA card can not only realize the communication between the servers, but also realize the RDMA technology, so the bandwidth delay performance of the HCA card is an important concern for AI server testers and is a main basis for judging whether the HCA card compatibility test passes or not.
The current general HCA card delay standard is a standard that a CPU is directly connected with a PCIe card slot, but the PCIe topology structure of an AI server model is complex, the actual connection links of each slot position and the CPU are different, some slot positions are actually directly connected with the CPU, the delay influence caused by PCIe switch chips is avoided, some slot positions can influence the bandwidth and the delay of the HCA card through a plurality of PCIe switch chips, the general delay standard cannot be applied to the performance test of the AI server HCA card, and therefore the test standard for the AI server HCA card needs to be judged by combining the actual hardware topology of the AI model. Meanwhile, in a general HCA card compatibility test, for PCIe standard slot speed detection, only the slot in which the HCA card is located is usually considered, for the AI server, in addition to testing the PCIe bandwidth speed of the slot in which the HCA card is located, the upstream PCIe bandwidth speed also needs to be tested, and this test is also based on a hardware topology, so the existing test method has limitations and one-sidedness.
Disclosure of Invention
In view of the above defects in the prior art, the present invention provides a method and a system for testing the performance of an HCA card of an AI server based on hardware topology, so as to solve the above technical problems.
In a first aspect, the present invention provides a method for testing performance of an HCA card of an AI server based on hardware topology, including:
acquiring HCA card slot, equipment information of uplink PCIe and bandwidth rate;
identifying the switch chip type of PCIe according to the equipment information of the uplink PCIe;
calculating actual delay according to delay data of a CPU directly connected with a PCIe card slot and delay data of a PCIe switch chip provided by a manufacturer;
re-establishing an HCA card test passing standard according to the bandwidth rate and the actual delay of the uplink PCIe;
and taking the HCA card passing standard as a test standard to carry out the performance test of the AI server HCA card of the hardware topology.
Further, the method further comprises:
acquiring a PCIe tree structure of a hardware topology in a shell script mode;
and the equipment information of the uplink PCIe can be acquired through the configuration of each node in the PCIe tree structure.
Further, the acquiring the HCA card slot, the device information of the uplink PCIe, and the bandwidth rate includes:
traversing HCA card, capturing BUS _ ID of HCA card and PCIe bandwidth rate
Acquiring equipment information of uplink PCIe;
calling a function to acquire the upstream PCIe bandwidth rate corresponding to the BUS _ ID;
further, the device information of the upstream PCIe includes a bus, a bridge, and a device type of the PCIe.
Further, the calculation formula of the actual delay time is as follows: actual latency = latency of the CPU directly connected to the PCIe slot + (latency of PCIe switch chip) × 2.
Further, the method further comprises:
acquiring delay data of a timer chip;
and adding the delay data update of the timer chip into the actual delay.
Further, the method further comprises:
and obtaining the PCIe bandwidth rate of the HCA card corresponding to the BUS _ ID, the upstream PCIe bandwidth rate and the actual delay, and establishing a test passing standard of the HCA card.
In a second aspect, the present invention provides a performance testing system for an AI server HCA card based on a hardware topology, including:
the information acquisition unit is configured to acquire the HCA card slot, the equipment information of the uplink PCIe and the bandwidth rate;
the type identification unit is configured to identify the switch chip type of the PCIe according to the equipment information of the uplink PCIe;
the delay calculating unit is configured to calculate actual delay according to delay data of the CPU directly connected with the PCIe card slot and delay data of the PCIe switch chip provided by a manufacturer;
the standard formulating unit is configured for re-formulating the HCA card test passing standard according to the bandwidth rate and the actual delay of the uplink PCIe;
and the performance test unit is configured to perform the performance test of the AI server HCA card of the hardware topology by taking the HCA card passing test standard as a test standard.
The beneficial effect of the invention is that,
according to the method and the system for testing the performance of the AI server HCA card based on the hardware topology, provided by the invention, the bandwidth rate data is automatically acquired by traversing the PCIE equipment on the HCA card, so that the reduction of the manual operation time in the delay test process and the improvement of the test efficiency are realized; the method has the advantages that the hardware topology condition is automatically acquired, the new delay standard is re-formulated by combining the PCIe condition of the HCA card, the misjudgment of the test result caused by the standard difference is avoided, a new basis is provided for the analysis of the test result by the tester, and the improvement of the overall test working quality is facilitated.
In addition, the invention has reliable design principle, simple structure and very wide application prospect.
Drawings
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention.
FIG. 2 is a schematic block diagram of a system of one embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The following explains key terms appearing in the present invention.
FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention. The execution subject in fig. 1 may be an AI server HCA card performance testing system based on a hardware topology.
As shown in fig. 1, the method includes:
and 150, taking the HCA card passing test standard as a test standard, and carrying out the performance test of the HCA card of the AI server of the hardware topology.
Optionally, as an embodiment of the present invention, the method further includes:
acquiring a PCIe tree structure of a hardware topology in a shell script mode;
and the equipment information of the uplink PCIe can be acquired through the configuration of each node in the PCIe tree structure.
Optionally, as an embodiment of the present invention, the acquiring device information and bandwidth rate of the HCA card slot and the upstream PCIe includes:
traversing HCA card, capturing BUS _ ID of HCA card and PCIe bandwidth rate
Acquiring equipment information of uplink PCIe;
calling a function to acquire the upstream PCIe bandwidth rate corresponding to the BUS _ ID;
optionally, as an embodiment of the present invention, the device information of the upstream PCIe includes a bus, a bridge, and a device type of the PCIe.
Optionally, as an embodiment of the present invention, a calculation formula of the actual delay time is: actual latency = latency of the CPU directly connected to the PCIe slot + (latency of PCIe switch chip) × 2.
Optionally, as an embodiment of the present invention, the method further includes:
acquiring delay data of a timer chip;
and adding the delay data update of the timer chip into the actual delay.
Optionally, as an embodiment of the present invention, the method further includes:
and obtaining the PCIe bandwidth rate of the HCA card corresponding to the BUS _ ID, the uplink PCIe bandwidth rate and the actual delay, and formulating a test passing standard of the HCA card.
In order to facilitate understanding of the present invention, the following describes a performance testing method for an AI server HCA card based on a hardware topology, which is provided by the present invention, with reference to the principle of the performance testing method for an AI server HCA card based on a hardware topology, and with reference to the performance testing process for the HCA card in the embodiment.
Specifically, the method for testing the performance of the HCA card of the AI server based on the hardware topology includes:
(1) PCIe adopts a tree topology structure, the system architecture of the PCIe is generally composed of multiple types of PCIe devices, the PCIe tree topology structure is obtained, people can know PCIe device information on the HCA card uplink, a foundation is provided for the subsequent calculation of delay through the HCA card uplink device type, a new standard is calculated according to a topology result, the method is suitable for the actual conditions of different types, the test result has reference significance and accuracy, and the probability of misjudgment of testers is reduced.
(2) The shell script acquires uplink PCIe Device information of the HCA card to be detected by utilizing lspci | grep-i "inifidand" | awk '{ print $1}' and the uplink PCIe Device information comprises a PCIe BUS, a bridge and a Device type, the Device type is a Device 9797, HCA card position information and a BUS ID are captured, and the acquired information is imported into a $ Cur _ Dir/BUS _ id.log file, wherein $ Cur _ Dir = $ (cd "$ 0"); (pwd); the BUS _ ID is used as the unique label of the HCA card to be detected, and the operation of a user is facilitated.
(3) Traversing and acquiring all HCA card slot BUS _ ID by utilizing cat $ Cur _ Dir/BUS _ id.log | while read line; traversing the Bus ID of the PCIe device on the HCA card by utilizing lspci-t-vvv | grep $ line | cut-d '[' -f "$ i" | cut-d ']' -fi, wherein $ i is a cut domain number parameter, calling a function to obtain a detailed bandwidth rate corresponding to the BUS _ ID, and outputting an information list; meanwhile, acquiring the PCIe bandwidth rate of the HCA card corresponding to the BUS _ ID; the investigation of HCA slot bandwidth cannot be limited to the HCA card itself, so the performance standard of the HCA card is analyzed by integrating the upstream PCIe bandwidth rate and the PCIe bandwidth rate of the HCA card itself.
(4) PCIe switch chip delay data are related to the type of the switch chip, delay specific data are provided by internal files of manufacturers, and a delay information list is integrated according to the existing mastered data, such as xx type switch chips, and the delay is 0.2us; the script searches PCIe switch chip types and a delay information list and delay data corresponding to the uplink switch chip types of the HCA card to be tested through grep commands; two systems are interconnected during testing, theoretically, the two systems pass through two stages of switch chips, so that the delay of the PCIe switch chips needs to be calculated twice, and the calculation formula is that the actual delay = the delay of a CPU directly connected with a PCIe card slot + (the delay of the PCIe switch chips) × 2; in addition, the actual delay is also related to the timer chip, but the system cannot check the timer information, the information of the timer chip can be output by means of third-party software, and the delay brought by the timer chip is added into the actual delay, so that the performance standard is more accurate.
(5) Re-establishing a test passing standard of the HCA card according to the PCIe bandwidth rate of the HCA card corresponding to the BUS _ ID, the uplink PCIe bandwidth rate and the actual delay; when the HCA card to be tested is replaced, the standard needs to be changed according to the actual situation of the method, and the phenomenon that the test result has deviation due to the fact that the test standard is unified is avoided.
As shown in fig. 2, the system 200 includes:
an information obtaining unit 210 configured to obtain the HCA card slot, the device information of the uplink PCIe, and the bandwidth rate;
the type identification unit 220 is configured to identify a switch chip type of PCIe according to the device information of the uplink PCIe;
the delay calculating unit 230 is configured to calculate an actual delay according to a delay standard of the CPU directly connected to the PCIe slot and delay data of the PCIe switch chip provided by a manufacturer;
a standard establishing unit 240 configured to re-establish an HCA card test passing standard according to the bandwidth rate and the actual delay of the uplink PCIe;
and a performance testing unit 250 configured to perform the performance test of the AI server HCA card of the hardware topology by using the HCA card test passing standard as a test standard.
Although the present invention has been described in detail in connection with the preferred embodiments with reference to the accompanying drawings, the present invention is not limited thereto. Various equivalent modifications or substitutions can be made on the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and these modifications or substitutions should be within the scope of the present invention/any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present disclosure and the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (6)
1. A performance test method for an AI server HCA card based on hardware topology is characterized by comprising the following steps:
acquiring HCA card slot, equipment information of uplink PCIe and bandwidth rate;
identifying the switch chip type of PCIe according to the equipment information of the uplink PCIe;
calculating actual delay according to the delay standard of the CPU directly connected with the PCIe card slot and delay data of the PCIe switch chip provided by a manufacturer;
re-establishing an HCA card test passing standard according to the bandwidth rate and the actual delay of the uplink PCIe;
taking the HCA card passing standard as a test standard, and carrying out the performance test of the AI server HCA card of the hardware topology;
the calculation formula of the actual delay is as follows: actual delay = delay of CPU directly connected to PCIe slot + (delay of PCIe switch chip) × 2;
the method further comprises the following steps:
acquiring delay data of a timer chip;
and adding the delay data update of the timer chip into the actual delay.
2. The method for testing the performance of the AI server HCA card based on the hardware topology as recited in claim 1, further comprising:
acquiring a PCIe tree structure of a hardware topology in a shell script mode;
and the equipment information of the uplink PCIe can be acquired through the configuration of each node in the PCIe tree structure.
3. The method for testing the performance of the HCA card of the AI server based on the hardware topology of claim 1, wherein the obtaining of the device information and the bandwidth rate of the HCA card slot and the uplink PCIe comprises:
traversing HCA card, capturing BUS _ ID of HCA card and PCIe bandwidth rate
Acquiring equipment information of uplink PCIe;
and calling a function to acquire the upstream PCIe bandwidth rate corresponding to the BUS _ ID.
4. The AI server HCA card performance testing method based on hardware topology of claim 1, wherein the device information of the upstream PCIe comprises PCIe bus, bridge and device type.
5. The method for testing the performance of the AI server HCA card based on the hardware topology as recited in claim 1, further comprising:
and obtaining the PCIe bandwidth rate of the HCA card corresponding to the BUS _ ID, the upstream PCIe bandwidth rate and the actual delay, and establishing a test passing standard of the HCA card.
6. An AI server HCA card performance test system based on hardware topology, characterized by comprising:
the information acquisition unit is configured to acquire the HCA card slot, the equipment information of the uplink PCIe and the bandwidth rate;
the type identification unit is configured to identify the switch chip type of the PCIe according to the equipment information of the uplink PCIe;
the delay calculating unit is configured for calculating actual delay according to delay data of the CPU directly connected with the PCIe card slot and delay data of a PCIe switch chip provided by a manufacturer;
the standard formulating unit is configured for re-formulating the HCA card test passing standard according to the bandwidth rate and the actual delay of the uplink PCIe;
the performance test unit is configured to perform the performance test of the hardware topology AI server HCA card by taking the HCA card passing test standard as a test standard;
the calculation formula of the actual delay is as follows: the actual delay = the delay of the CPU directly connecting the PCIe slot + (delay of PCIe switch chip) × 2;
the AI server HCA card performance testing system is further configured to perform the steps of:
acquiring delay data of a timer chip;
and adding the delay data update of the timer chip into the actual delay.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011027122.1A CN112231157B (en) | 2020-09-25 | 2020-09-25 | AI server HCA card performance test method and system based on hardware topology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011027122.1A CN112231157B (en) | 2020-09-25 | 2020-09-25 | AI server HCA card performance test method and system based on hardware topology |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112231157A CN112231157A (en) | 2021-01-15 |
CN112231157B true CN112231157B (en) | 2023-01-10 |
Family
ID=74108863
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011027122.1A Active CN112231157B (en) | 2020-09-25 | 2020-09-25 | AI server HCA card performance test method and system based on hardware topology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112231157B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115037651B (en) * | 2022-06-24 | 2023-07-11 | 苏州浪潮智能科技有限公司 | RDMA bandwidth transmission test method, system and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104601410A (en) * | 2015-02-02 | 2015-05-06 | 浪潮电子信息产业股份有限公司 | Server automatic HCA card bandwidth testing method |
CN107193699A (en) * | 2017-05-22 | 2017-09-22 | 郑州云海信息技术有限公司 | One kind tests the wide time-delay method of HCA cassette tapes automatically by RDMA modes |
CN107992438A (en) * | 2017-11-24 | 2018-05-04 | 郑州云海信息技术有限公司 | A kind of server and in server flexible configuration PCIe topologys method |
-
2020
- 2020-09-25 CN CN202011027122.1A patent/CN112231157B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104601410A (en) * | 2015-02-02 | 2015-05-06 | 浪潮电子信息产业股份有限公司 | Server automatic HCA card bandwidth testing method |
CN107193699A (en) * | 2017-05-22 | 2017-09-22 | 郑州云海信息技术有限公司 | One kind tests the wide time-delay method of HCA cassette tapes automatically by RDMA modes |
CN107992438A (en) * | 2017-11-24 | 2018-05-04 | 郑州云海信息技术有限公司 | A kind of server and in server flexible configuration PCIe topologys method |
Also Published As
Publication number | Publication date |
---|---|
CN112231157A (en) | 2021-01-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103812726B (en) | Automated testing method and device for data communication equipment | |
CN108092854B (en) | Test method and device for train-level Ethernet equipment based on IEC61375 protocol | |
WO2017000424A1 (en) | Protocol detection method and apparatus | |
CN109462490B (en) | Video monitoring system and fault analysis method | |
CN104506376A (en) | Multichannel redundant CAN (Controller Area Network) bus test system with frame start sensitive synchronous trigger function | |
CN102325036A (en) | Fault diagnosis method for network system, system and device | |
CN112231157B (en) | AI server HCA card performance test method and system based on hardware topology | |
CN110430226A (en) | Network attack detecting method, device, computer equipment and storage medium | |
CN111147342A (en) | MVB bus fault diagnosis method and system based on communication chip | |
WO2019041870A1 (en) | Method, device, and storage medium for locating failure cause | |
CN114460439A (en) | Digital integrated circuit test system | |
CN114414255A (en) | Automatic driving test method and system based on CAN message period monitoring | |
CN113364115A (en) | Power cable information comprehensive processing system and method | |
CN111106990B (en) | Self-diagnosis method for loop of underwater multi-channel signal acquisition and transmission array system | |
CN210863959U (en) | Mainboard self-checking device based on FPGA electrical signal detects | |
CN115687406B (en) | Sampling method, device, equipment and storage medium for call chain data | |
CN107342917B (en) | Method and apparatus for detecting network device performance | |
CN114553678B (en) | Cloud network soft SLB flow problem diagnosis method | |
CN112865860B (en) | Calibration method and device for trillion passive optical network BOB equipment | |
CN112235145B (en) | Flow state detection method and device | |
CN107621988A (en) | Delayed in a kind of DC test machine Fault Locating Method and system | |
CN210899204U (en) | Intelligent detection device for centralized meter reading communication faults | |
CN114338347A (en) | Ampere platform-based fault information out-of-band acquisition method and device | |
WO2016127483A1 (en) | Processing method and device for collection agent management subsystem | |
CN113032341A (en) | Log processing method based on visual configuration |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |