CN110376503B

CN110376503B - AI acceleration chip performance test method and device

Info

Publication number: CN110376503B
Application number: CN201910565843.9A
Authority: CN
Inventors: 陈坚; 汪玉; 林峰; 葛广君; 梁爽
Original assignee: Fuzhou Institute Of Data Technology Co ltd
Current assignee: Fuzhou Institute Of Data Technology Co ltd
Priority date: 2019-06-27
Filing date: 2019-06-27
Publication date: 2021-07-27
Anticipated expiration: 2039-06-27
Also published as: CN110376503A

Abstract

The invention discloses a method and a device for testing the performance of an AI acceleration chip.A data record is formed by sampling and recording the starting time and the ending time of each instruction of each module in the chip; then, the data records are sorted into a list, and the list is subjected to corresponding calculation processing to obtain the instruction operation duration and relevant parameters of each module of the chip; in addition, parallel instructions of a specified module or specified time can be searched and obtained from the list and are printed in characters or displayed in a graph line for parallel analysis. The invention not only provides the performance analysis of the computing part, but also provides the parallelism analysis of the communication part and the computing part. The invention can search the parallel instructions in the condition range according to the set conditions and analyze the instruction parallelism, thereby providing better support for the performance optimization of the chip.

Description

AI acceleration chip performance test method and device

Technical Field

The invention relates to the field of chip testing, in particular to an AI acceleration chip performance testing method and device.

Background

The AI accelerator chip generally includes: the module comprises instruction scheduling, convolution operation, pooling, activating function calculation and data loading and unloading. Because the mainstream AI algorithm has huge parameter quantity at present, data needs to be repeatedly moved between an on-chip memory and an off-chip memory. Therefore, the matching degree of the calculation bandwidth and the data bandwidth of the system is the most main factor influencing the performance of the AI accelerating chip. In addition, the diversity of AI network models also results in high uncertainty in the efficiency of operation of the individual modules. Therefore, the performance of the AI chip depends not only on the individual performance of the instruction modules, but also on the data loading and unloading and the scheduling efficiency of other instruction modules, and it is necessary to provide a testing method for analyzing the performance of each instruction module and the parallelism between the instruction modules, so as to facilitate the subsequent chip performance improvement.

Because the use of the chip is different, the focus of the existing performance test method is also different, and some test methods pay attention to the performance of the chip under different utilization rates, so that the chip type selection is convenient; some test methods pay attention to the running performance of a certain module in the test method, so that subsequent improvement is facilitated; some concern about the parallelism of the operation of multiple modules of a chip, and predict the final performance of the chip in the development stage. At present, the following typical methods are mainly available:

1) and controlling the utilization rate of the chip, obtaining the chip performance test results under different utilization rates, and taking the geometric mean. The utilization rate of the CPU is controlled by a control instruction as shown in a 'CPU performance evaluation method and device' with the application number of 201310161217.6; performing benchmark test on a Central Processing Unit (CPU) to obtain performance test results of the CPU under each utilization rate, wherein each performance test result represents the performance of the CPU under a load; and calculating the geometric mean of the multiple performance test results to obtain the final performance evaluation result of the CPU. The patent can only predict the performance by testing the instruction segment, and cannot accurately measure the running time of each internal module and the parallelism of different internal modules. This patent fails to test out modules that cause performance bottlenecks.

2) Different modules in the chip are provided with bypass circuits, and the performance of the module to be tested can be tested after bypassing. As shown in the patent of "cell performance test method and system chip for artificial intelligence module" with application number 201910103596.0, for a plurality of AI processing cells arranged in a two-dimensional array, each processing cell includes an enable input terminal for receiving an enable signal and pausing or starting the operation of the processing cell according to the enable signal; the processing unit with the same dimension 1 and/or dimension 2 as the processing unit to be tested in the plurality of processing units can be configured to be in a bypass state so as to realize performance test on the processing unit to be tested; by giving the processing unit a bypass function, the AI module can be tested more conveniently. The patent tests the module to be tested through the bypass, and the overall performance of the chip cannot be estimated; the parallelism among different modules cannot be observed in a real scene.

3) And predicting the overall performance of the chip by drawing the running time of each calculation block in the chip. As shown in the 'method for predicting GPU performance and corresponding computer system' patent application No. 201510387995.6, a set of test applications are run in a GPU chip to be evaluated; capturing a set of scalar performance counters and vector performance counters; creating a model for evaluating and predicting GPU performance for different chip configurations based on the captured scalar performance counters and vector performance counters; and predicting a performance score of the GPU chip and identifying a bottleneck in the GPU pipeline. The patent builds a performance model for the parallelism of the computation modules of the GPU, but one chip includes a communication part and a computation part, and the communication part sometimes has a greater influence on the overall performance.

Disclosure of Invention

The invention aims to provide a method and a device for testing the performance of an AI accelerating chip, which not only provide the performance analysis of a computing part, but also provide the parallelism analysis of a communication part and the computing part. The invention can search the parallel instructions in the condition range according to the set conditions and analyze the instruction parallelism, thereby providing better support for the performance optimization of the chip.

The technical scheme adopted by the invention is as follows:

a performance test method for an AI accelerating chip comprises the following steps:

step 1, starting a global test, and distributing a test instruction to each module of an AI acceleration chip;

step 2, respectively sampling and acquiring the starting time and the ending time of each instruction operated by each module to form a data record and uploading the data record to an external performance analyzer;

step 3, the performance analyzer arranges the data records into a list according to modules;

step 4, calculating the list by using a script language to obtain the running time of each instruction;

step 5, respectively accumulating the operation time lengths of all instructions of each module to obtain the total operation time length of each module, and counting the instruction operation total time length occupying the most operation time according to the module;

step 6, calculating the list by using a script language to obtain the interval of adjacent instructions in each module;

step 7, searching a parallel instruction operation result in a specified range to be analyzed from the list;

and 8, outputting and displaying the searched running result of the parallel instruction.

Further, each module of the AI acceleration chip in step 1 includes a specific computation module and a communication module responsible for data handling.

Further, in step 2, the AI acceleration chip acquires the reference time through a timer, and records the start time and the end time of the command by using the reference time.

Further, the data record in step 2 is written into the volatile memory and uploaded to the performance analyzer after the memory capacity reaches the waterline or the instruction execution is finished.

Further, the data record in step 2 is directly uploaded to the performance analyzer through a communication interface arranged on the AI acceleration chip.

Further, the scripting language in step 4 or 6 is python language.

Further, the table lookup in step 7 includes three modes, specifically as follows:

mode 1: searching parallel instructions in a time range according to set time;

mode 2: searching parallel instructions in a time range according to the specified module instruction serial number;

mode 3: searching an instruction index with the longest instruction interval in a certain module, and searching for parallel instructions in a time range;

further, the specific steps of mode 3 are:

step 7-1, obtaining an instruction index number with the largest instruction interval by sequencing the instruction intervals in the designated module;

step 7-2, taking the starting Time corresponding to the q instructions before the maximum instruction index number as Time min; starting Time corresponding to the next p instructions is set by a user as the value of Time max, q and p;

and 7-3, taking the Time min and the Time max as Time ranges to search the parallel instructions.

The invention also discloses an AI accelerating chip performance testing device, which comprises a chip internal data record generating circuit and a performance analyzer, wherein the chip internal data record generating circuit comprises a test control circuit, a timer and an instruction time record summarizing and communication circuit, the test control circuit is respectively connected with the timer and each module in the chip, the test control circuit is used for controlling the chip to start or finish performance testing, the timer is used for generating time reference and providing the time reference for each module in the chip, each module of each AI accelerating chip is provided with a time sampling circuit, and the time sampling circuit is used for acquiring the operation starting time and the operation finishing time of each instruction in each module; and the instruction time record summarizing and communication circuit is used for summarizing the generated instruction running time and uploading the instruction running time to the performance analyzer.

Further, the command time record summarizing and communication circuit comprises a competition judging and record keeping circuit, an internal RAM memory and a communication interface,

the competition judging and record keeping circuit writes the records into an internal RAM (random access memory) according to a fair rotation principle, and when the number of the AI accelerating chip modules is X, all the records are written in X clock cycles;

the competition judging and recording holding circuit is a set, and the number of the operation cycles of each instruction of the competition judging and recording holding circuit is greater than the number of the AI accelerating chip modules; or a competition judging and recording retaining circuit with a plurality of sets of rotation is adopted, the number of the operation cycles of each instruction of the competition judging and recording retaining circuit is greater than that of the AI accelerating chip modules, and the rotation cycle of each set of competition judging and recording retaining circuit is less than that of the instruction;

the internal RAM memory is used for storing the recorded data and providing storage state information;

the communication interface is in communication connection with the performance analyzer, and the record is uploaded to the performance analyzer through the communication interface after the capacity of the internal RAM memory reaches a waterline or the instruction operation is finished;

the performance analyzer is a PC, a tablet, a smart phone or a cloud server.

By adopting the technical scheme, the data record is formed by sampling and recording the starting time and the ending time of each instruction of each module in the chip; then, the data records are sorted into a list, and the list is subjected to corresponding calculation processing to obtain the instruction operation duration and relevant parameters of each module of the chip; in addition, parallel instructions of a specified module or specified time can be searched and obtained from the list and are printed in characters or displayed in a graph line for parallel analysis. The invention not only provides the performance analysis of the computing part, but also provides the parallelism analysis of the communication part and the computing part. The invention can search the parallel instructions in the condition range according to the set conditions and analyze the instruction parallelism, thereby providing better support for the performance optimization of the chip.

Drawings

The invention is described in further detail below with reference to the accompanying drawings and the detailed description;

FIG. 1 is a schematic diagram of the testing principle of the present invention;

FIG. 2 is a schematic diagram of an instruction time record summarizing and communication circuit according to the present invention

FIG. 3 is a schematic diagram of a performance test data generation process according to the present invention;

FIG. 4 is a schematic analysis flow chart of the performance analyzer of the present invention;

FIG. 5 is a schematic representation of a collated list of performance analyzers of the present invention;

FIG. 6 is a schematic diagram of mode 1 in the list lookup of the present invention;

FIG. 7 is a schematic diagram of mode 2 in the list lookup of the present invention;

FIG. 8 is a schematic diagram of mode 3 in the list lookup of the present invention;

FIG. 9 is a graph illustrating the result of the total duration of instruction execution according to the present invention;

FIG. 10 is a diagram illustrating the search results of parallel instructions within a set time range according to the present invention;

FIG. 11 is a diagram illustrating a lookup result of parallel instructions within a time range corresponding to an instruction with an index assigned according to the present invention.

Detailed Description

As shown in one of fig. 1-11, the present invention discloses an AI acceleration chip performance testing device, which comprises a chip internal data record generating circuit and a performance analyzer, wherein the chip internal data record generating circuit comprises a test control circuit, a timer and an instruction time record summarizing and communication circuit,

the test control circuit is respectively connected with the timer and each module in the chip and is used for controlling the chip to start or finish performance test; the timer is used for generating a time reference and providing the time reference for each module in the chip; each module of each AI acceleration chip is provided with a time sampling circuit, the time sampling circuit is used for acquiring the operation starting time and the operation ending time of each instruction in each module, in addition, the module included in the AI acceleration chip can be a specific calculation module or a communication module responsible for data transportation, and the invention can test the calculation module and also can test the communication module; and the instruction time record summarizing and communication circuit is used for summarizing the sampled instruction running time data and uploading the instruction running time data to the performance analyzer.

Specifically, as shown in fig. 2, the command time record totaling and communication circuit includes a contention resolution and record keeping circuit that writes a plurality of records into an internal small-capacity RAM memory in accordance with the principle of fair rotation when the plurality of records arrive at the same time. When the number of modules is X, all records can be written in X clock cycles, so that the circuit has an applicable condition, and the number of the operation cycles of each instruction is greater than the number of the modules. If the condition is not met, a plurality of circuits are designed, and each round module is smaller than the instruction running period.

The internal small-capacity RAM is used to store recorded data and provide storage status information. Each record written contains, the type of instruction, the sequence of the type of instruction (the second instruction), the start and end times of the instruction.

When the storage capacity reaches a certain waterline (RAM is guaranteed not to overflow), or when all instructions finish running, a communication interface is started, the records are uploaded to a performance analyzer, and the communication interface can be an out-of-band interface of a chip, such as an SPI interface and a gigabit interface. Note that the transmission bandwidth of the communication interface must ensure that the internal small-capacity RAM does not overflow.

Specifically, as an implementation manner, the performance analyzer is a PC, a tablet, or a smart phone, and has the following functions:

a) and performing data analysis on the uploaded data records. b) The run time for each instruction is displayed or printed. c) The total run time for each type of instruction is counted. d) And searching and displaying the parallel operation instruction in the condition range according to the set condition.

Further, the invention also discloses a method for testing the performance of the AI accelerating chip, which comprises the following specific steps:

the process of generating performance test data in steps 1 to 2 of the present invention, as shown in fig. 3,

step 1, receiving an instruction operation starting signal, starting a global timer, providing a time reference, starting a global test, and distributing a test instruction to each module of an AI acceleration chip; further, each module of the AI acceleration chip in step 1 includes a specific computation module and a communication module responsible for data handling.

specifically, the start time and the end time of each instruction are saved as one record;

further, as an embodiment, the data record in step 2 is written into a volatile memory and uploaded to a performance analyzer after the memory capacity reaches the waterline or the instruction execution is finished. Further, as another embodiment, the data record in step 2 is directly uploaded to the performance analyzer through a communication interface arranged on the AI acceleration chip to reduce the use of the internal memory

The steps 3 to 8 of the present invention relate to the analysis flow of the performance analyzer, and specifically, as shown in fig. 4, where start (n) refers to the starting time of the nth instruction in the module; end (n) refers to the end time of the nth instruction in the module.

And 3, the performance analyzer arranges the data records into a list according to the modules, arranges the instruction time records of each module into a list format as shown in fig. 5, and facilitates subsequent analysis and processing, wherein Start represents the starting time of the instruction, and End represents the ending time of the instruction.

specifically, for example, the python language, calculates the above list and calculates the operation time length of each instruction. The formula is as follows:

Module_x_Inst_cycle(n) = End(n) - Start(n) 1≤n≤index_max

wherein, start (n): the starting time of the nth instruction in the module x; end (n): end time of nth instruction in module x; module _ x _ Inst _ cycle (n): the running time of the nth instruction of the xth module; index _ max: the instruction maximum number.

Step 5, respectively accumulating the running time lengths of all instructions of each Module to obtain the running total time length of each Module, wherein the running total time length is used for counting the instruction occupying the most operating time according to the Module counting instruction running total time length;

step 6, calculating the list by using a script language to obtain the interval of adjacent instructions in each module; the script language is python language, and the specific formula is as follows:

Module_x_gap_cycle(n) = Start(n) - End(n-1) 2≤n≤index_max

wherein, Module _ x _ gap _ cycle (n): the nth instruction in the module x is separated from the (n-1) th instruction.

Step 7, searching a parallel instruction operation result in a specified range to be analyzed from the list; further, the table lookup in step 7 includes three modes, specifically as follows:

mode 1: searching parallel instructions in a time range according to set time; the specific principle is shown in fig. 6, where Time min and Time max are the display Time ranges set by the user.

Mode 2: searching parallel instructions in a time range according to the specified module instruction serial number; the specific principle is shown in fig. 7, where Time min and Time max are the display Time range set by the user.

further, as shown in fig. 8, the specific steps of mode 3 are:

And 8, outputting and displaying the searched running result of the parallel instruction. And displaying the search result graph line or parallelly operating the results according to the text printing instruction.

The effect of the present invention will be described below with reference to a test example of an AI accelerator chip.

1) As shown in fig. 9, a display result of the total instruction operation duration of the AI acceleration chip is tested, where Load refers to the total data Load instruction operation duration, Save refers to the total calculation result storage instruction operation duration, Conv refers to the total convolution instruction operation duration, and Pooling refers to the total Pooling calculation instruction operation duration.

2) As shown in fig. 10, the parallel instructions within the set time range of the AI accelerator chip are further tested, and the display results are as follows:

Time from 1000 to 15000000 inst seq is:

Load inst seq: [2 : 7141]

save inst seq: [0 : 1749]

conv inst seq: [0 : 311]

misc inst seq: [0 : 271]。

3) as shown in fig. 11, the parallel instructions (instruction indexes 1000 to 1200 of the specification module 1) within the time range corresponding to the instruction of the specification index are further tested, and the display results are as follows:

Time from 2841401 to 3155305 inst seq is:

Load inst seq: [1000 : 1200]

save inst seq: [536 : 639]

conv inst seq: [74 : 81]

misc inst seq: [110 : 130]。

Claims

1. A performance test method for an AI accelerating chip is characterized in that: which comprises the following steps:

2. The AI acceleration chip performance testing method of claim 1, characterized in that: each module of the AI acceleration chip in step 1 includes a specific computation module and a communication module responsible for data handling.

3. The AI acceleration chip performance testing method of claim 1, characterized in that: in step 2, the AI accelerating chip acquires the reference time through a timer and records the start time and the end time of the instruction by using the reference time.

4. The AI acceleration chip performance testing method of claim 1, characterized in that: and (3) writing the data record in the step 2 into a volatile memory, and uploading the data record to a performance analyzer after the memory capacity reaches a waterline or the instruction operation is finished.

5. The AI acceleration chip performance testing method of claim 1, characterized in that: and the data record in the step 2 is directly uploaded to a performance analyzer through a communication interface arranged on an AI acceleration chip.

6. The AI acceleration chip performance testing method of claim 1, characterized in that: the scripting language in step 4 or 6 is python language.

7. The AI acceleration chip performance testing method of claim 1, characterized in that: the table lookup in step 7 includes three modes, specifically as follows:

mode 1: searching parallel instructions in a time range according to set time;

mode 3: and searching the instruction index with the longest instruction interval in a certain module for parallel instructions in a time range.

8. The AI acceleration chip performance testing method of claim 7, characterized in that: the specific steps of mode 3 are:

9. The utility model provides a AI accelerates chip performance test device which characterized in that: the device comprises a chip internal data record generating circuit and a performance analyzer, wherein the chip internal data record generating circuit comprises a test control circuit, a timer and an instruction time record summarizing and communication circuit, the test control circuit is respectively connected with the timer and each module in the chip, the test control circuit is used for controlling the chip to start or finish performance test, the timer is used for generating time reference and providing the time reference for each module in the chip, each module of each AI accelerating chip is provided with a time sampling circuit, and the time sampling circuit is used for acquiring the operation starting time and the operation ending time of each instruction in each module; and the instruction time record summarizing and communication circuit is used for summarizing the sampled instruction running time data and uploading the instruction running time data to the performance analyzer.

10. The AI acceleration chip performance testing device of claim 9, characterized in that: the instruction time recording and summarizing and communication circuit comprises a competition judging and recording holding circuit, an internal RAM memory and a communication interface,

the performance analyzer is a PC, a tablet, a smart phone or a cloud server.