CN111242314A

CN111242314A - Deep learning accelerator benchmark test method and device

Info

Publication number: CN111242314A
Application number: CN202010017521.3A
Authority: CN
Inventors: 张蔚敏; 孙明俊
Original assignee: China Academy of Information and Communications Technology CAICT
Current assignee: China Academy of Information and Communications Technology CAICT
Priority date: 2020-01-08
Filing date: 2020-01-08
Publication date: 2020-06-05
Anticipated expiration: 2040-01-08
Also published as: CN111242314B

Abstract

The application provides a deep learning accelerator benchmark test method and device, wherein the method comprises the following steps: selecting a network model for testing according to the deployment position and the application scene of the accelerator to be tested; acquiring test data and preprocessing the test data; configuring a benchmark test environment component, loading the network model, and running the preprocessed test data on the accelerator to be tested; and acquiring a test result, and determining a test index of the tested accelerator. The method can comprehensively and accurately carry out benchmark test on the deep learning accelerator.

Description

Deep learning accelerator benchmark test method and device

Technical Field

The invention relates to the technical field of benchmark testing, in particular to a deep learning accelerator benchmark testing method and device.

Background

With the rapid development of AI technologies and applications, represented by deep learning, many vendors will integrate dedicated AI acceleration chips/processors or IPs on a platform or product. Related hardware systems are increasing, and how to fairly and systematically evaluate the performance of the chips and optimize the hardware becomes a research hotspot.

With the gradual maturity of algorithms and applications, a trained model is deployed and an inference task is completed to become a market demand hotspot, and inference task scenes are quite rich, for example, in the fields of data centers, automatic driving, security protection, mobile phones, smart homes and the like, under the background that a market technical route is not clear and products are unsmooth, a benchmark test evaluation system based on a clear current state index system is provided to objectively reflect AI chips, so that the healthy development of the industry is promoted.

As computer architectures evolve, it becomes more difficult to compare the performance of various computer systems simply by looking at their specifications. Therefore, it is necessary to develop a corresponding test that can compare different architectures, so that how to perform a comprehensive and efficient benchmark test is an urgent technical problem to be solved.

Disclosure of Invention

In view of this, the present application provides a deep learning accelerator benchmark testing method and apparatus, which can comprehensively and accurately perform a benchmark test on a deep learning accelerator.

In order to solve the technical problem, the technical scheme of the application is realized as follows:

in one embodiment, a deep learning accelerator benchmarking method is provided, the method comprising:

selecting a network model for testing according to the deployment position and the application scene of the accelerator to be tested;

acquiring test data and preprocessing the test data;

configuring a benchmark test environment component, loading the network model, and running the preprocessed test data on the accelerator to be tested;

and acquiring a test result, and determining a test index of the tested accelerator.

In another embodiment, there is provided a deep learning accelerator benchmarking apparatus, comprising: the device comprises a selection unit, a processing unit, a testing unit, an acquisition unit and a determination unit;

the selection unit is used for selecting a network model for testing according to the deployment position and the application scene of the accelerator to be tested;

the processing unit is used for acquiring test data and preprocessing the test data;

the test unit is used for configuring a benchmark test environment component, loading the network model selected by the selection unit, and running test data preprocessed by the processing unit on the accelerator to be tested;

the acquisition unit is used for acquiring a test result;

the determining unit is used for determining the test index of the tested accelerator according to the test result acquired by the acquiring unit.

In another embodiment, an electronic device is provided that includes a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the deep learning accelerator benchmarking method as described when executing the program.

In another embodiment, a computer readable storage medium is provided, having stored thereon a computer program which, when executed by a processor, performs the steps of the deep learning accelerator benchmarking method.

According to the technical scheme, the benchmark test is realized by selecting the model, preprocessing the test data and configuring the test environment assembly, the benchmark test index is determined according to the test result, and the benchmark test can be comprehensively and accurately performed on the deep learning accelerator.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

FIG. 1 is a schematic diagram of a basic test flow of a deep learning accelerator according to an embodiment of the present application;

FIG. 2 is a diagram of benchmark set-up data flow;

FIG. 3 is a diagram illustrating a benchmark hardware environment in an embodiment of the present application;

FIG. 4 is a schematic diagram of a benchmark index system in an embodiment of the present application;

FIG. 5 is a schematic diagram of an apparatus for implementing the above technique in an embodiment of the present application;

fig. 6 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not explicitly listed or inherent to such process, method, article, or apparatus.

The technical solution of the present invention will be described in detail with specific examples. Several of the following embodiments may be combined with each other and some details of the same or similar concepts or processes may not be repeated in some embodiments.

The embodiment of the application provides a deep learning accelerator benchmark test method, which realizes benchmark test by selecting a model, preprocessing test data and configuring a test environment assembly, determines benchmark test indexes according to test results, and can comprehensively and accurately perform benchmark test on a deep learning accelerator.

Referring to fig. 1, fig. 1 is a schematic diagram of a basic test flow of a deep learning accelerator in an embodiment of the present application. The method comprises the following specific steps:

step 101, selecting a network model for testing according to the deployment position and the application scene of the accelerator to be tested.

The deployment location may be: cloud, edge, end;

the application scenario may be: image classification, target detection, semantic segmentation, super-resolution, face recognition and the like.

Selecting a network model for testing according to the deployment position and the application scene of the accelerator to be tested, wherein the network model comprises the following steps:

selecting a network model according to the deployment position and the application scene of the accelerator to be tested;

taking the deployment location as an end as an example:

the network model as applied to scene selection for image classification may be: MobileNet;

the network model as selected for the target detection application scenario may be: an SSD;

the network model as selected for the semantic segmentation application scenario may be: a vdsr;

the network model as selected for the super resolution application scenario may be: deplab v 3;

the network model as selected for the face recognition application scenario may be: inclusion-ResNet.

During specific testing, one network model can be selected for testing according to the corresponding deployment position and application scene, or the network models can be added to form a tested network model set, each network model is used for testing respectively to obtain a test result, and finally, a reference index is determined.

If the network model is added, the principle of model difference degree is satisfied between the newly added network model and the selected network model.

If the principle of the model difference degree is satisfied, the difference degree between the two models is greater than a preset value, such as 30%, but not limited thereto.

Suppose one model is denoted as M and the other model is denoted as M

The formula for calculating the degree of difference threshold between the two models is specifically as follows:

wherein, MAC is the accumulated multiply-add operand of the Model, Input size is the Input size of the Model, Model size is the memory size occupied by the Model, Accurancy is the preset accuracy of the Model, layer is the number of layers of the Model, K is the number of layers of the Model₁、K₂、K₃、K₄、K₅The sum of 5 weight values is 1 for each weight value corresponding to each parameter.

Step 102, obtaining test data and preprocessing the test data.

In this step, the test data is preprocessed, including:

and carrying out scaling processing and standardization processing on the test data so that the processed test data meets the requirements of the selected network model.

An image preprocessing process is given below by taking image test data as an example, but is not limited to the following preprocessing process:

firstly, the size of each frame of image in the test data is scaled according to the input requirement of the network model:

if the test data has N frames of images, I_iRepresenting the ith frame image, inputting the height and width of the size according to the requirement of the model, and marking the image after the test data is zoomed as I'_i，I'_i＝createScaledBitmap(I_iHeight, width), wherein createscaladedbitmap is a scaling function provided for java;

three-channel z-score normalization of the scaled images, the sequence of benchmark input images being recorded as images,

where j is RGB channel information, j is 1,2,3,

k＝1,…，height，l＝1,…,width,

step 103, configuring a benchmark test environment component, loading the network model, and running the preprocessed test data on the accelerator to be tested.

After the network model is selected, a benchmark test environment assembly is configured to complete the construction of the operation environment, including the determination of the system environment, the acceleration mode of processor hardware, the realization format of the model, the completion of the conversion of the model, and the work including pruning, quantification and the like.

When the environment is built, the mode of calling the accelerator to be tested by the frame can be determined according to the model of the accelerator to be tested, various modes can be built in advance, and the mode of testing is determined according to the model of the accelerator to be tested during specific testing.

Calling an accelerator mode by using the selected network model (the selected network model defaults to a trained model, and if the trained model does not exist, training can be performed firstly, and then the trained network model is used for testing), and the method specifically comprises the following steps:

first, through deep learning inference framework invocation;

such as the framework (NCNN) + the CPU (mobile phone side, based on ARM architecture) of the accelerator under test.

Secondly, accelerating the calling through an own architecture;

self-contained architecture (HiAI) + NPU (Ka 980 boards) of the tested accelerator;

snpe + tested accelerator DSP (high pass 855/865 handset);

tensorRT + accelerator GPU tested (england T4).

And thirdly, calling through android NN.

The Tulite framework is based on Android API + CPU (central processing unit) of an accelerator to be tested (Android system mobile phone, such as a mobile phone carrying an MTK chip);

referring to fig. 2, fig. 2 is a schematic diagram of benchmark test component building data flow, fig. 2 shows three ways of calling the accelerator under test, wherein ① represents a first calling way, ② represents a second calling way, and ③ represents a third calling way.

If the tested accelerator is a CPU, the first or third calling mode can be selected, if the tested accelerator is a GPU, the second or third calling mode can be selected, if the tested accelerator is a DSP or NPU, the second or third calling mode can be selected, if the tested accelerator is an NNA, the second calling mode can be selected, and if the tested accelerator is a DPU, the second calling mode can be selected.

During testing, determining a mode for calling the accelerator to be tested according to the model of the accelerator to be tested;

and loading the network model during testing, and calling the accelerator to be tested by the preprocessed test data in a determined mode.

And 104, acquiring a test result and determining a test index of the tested accelerator.

During specific testing, testing time, the number of frames of an input image, testing accuracy and power consumption need to be acquired, and a power consumption testing tool needs to be added for acquiring corresponding power consumption for acquiring power consumption, for example, a power consumption testing instrument, see fig. 3, where fig. 3 is a schematic diagram of a benchmark testing hardware environment in an embodiment of the present application. Fig. 3 shows a schematic diagram of the connection between the testing apparatus and the power consumption testing tool, and the Device Under Test (DUT). The tested accelerator is carried on the tested equipment such as a mobile phone, a board card, a server and the like.

In the embodiment of the application, determining the test index of the tested accelerator according to the test result includes one or any combination of the following:

time delay, throughput, loss of precision value, power consumption, computational power energy consumption ratio.

Wherein,

determining the time delay comprises: using the ith network model test for each input of an image in a single frameTest time T of_1iAnd the total number of frames N of the input image_1iThen the time delay Metric is determined₁Comprises the following steps:

determining the throughput, comprising: at each time with B_2i＝2^j(j is 0,1,2,3 … 7) frame mode input images, respectively recording test time T used by using ith network model test_2ijAnd the total number of frames N of the input image_2ijDetermining 2 inputs per time for the ith network model^jThroughput per frame, Throughput per frame, and Throughput per frame_2ij＝N_2ij×B_2ij/T_2ij(ii) a Will T_2ij/N_2ij<The Throughput with the largest value among the throughputs of 7s is taken as the Throughput Throughput determined when the ith network model is used for testing_2i(for a network model with 8 frame input modes, one throughput, T, can be obtained for each mode_2ij/N_2ij<The corresponding throughput at 7s is the throughput to be considered, the maximum throughput under the premise is selected), and the throughput Metric of the tested accelerator is determined₂Comprises the following steps:

determining the loss of precision value, comprising: obtaining the test accuracy of the ith network model when the ith network model is used for testing

The loss of precision value Metric is determined₃Comprises the following steps:

among them, Accurancy_iA preset accuracy for the ith said network model;

determining the power consumption, comprising: acquiring consumed energy W during the test by using the ith network model_iAnd the total number of frames N of the input image_4iThen ensureDetermining the power consumption Metric₄Comprises the following steps:

W_ithe unit of (a) is tile;

determining the computational power-to-energy ratio, comprising: determining the computing power to energy ratio Metric₅Comprises the following steps:

wherein, T_4iAnd Num is the number of the network models, wherein the total frame number of the input images is the number of the ith network model during testing.

The five test indexes can be displayed through an index system schematic diagram, and the specific conditions of the indexes of the tested processors can be clearly and visually given.

Referring to fig. 4, fig. 4 is a schematic diagram of a benchmark test index system in the embodiment of the present application. In fig. 4, the central point is used as a starting point, and the farther the value of each index is from the central point, the stronger the corresponding capability of the processor under test is identified. The performance of the tested processor can be comprehensively evaluated, namely, the performance of the tested processor is better when the area enclosed by the sequence connecting lines of the five index values is larger. As in fig. 4, the area of processor 1 (enclosed by the dashed line) is larger than the area of processor 2 (enclosed by the implementation), the performance of processor 1 is better than the performance of processor 2.

The embodiment of the application is based on an objective evaluation method, namely benchmark test, playing an important role in the evolution of a computer architecture, and provides an end-to-end deep learning processor evaluation method aiming at the particularity of a deep learning processor in the aspects of architecture design and implementation. The similarity of the model is analyzed through the key information of the model, a threshold value is set to determine a model pool, the tested model can reflect the performance of the processor based on a specified application scene (macro test) while the tested model comprises operator layer evaluation (micro test), and five types of indexes output by a benchmark test tool can objectively, comprehensively and comprehensively reflect the level of a deep learning processor/accelerator/IP. The method for obtaining the performance evaluation of the deep learning processor during actual deployment, provided by the embodiment of the application, provides valuable ideas and suggestions for the development of an artificial intelligent chip reference test system and the construction of an evaluation platform, and has wide market prospects and application values.

Based on the same inventive concept, the embodiment of the application also provides a deep learning accelerator benchmark testing device. Referring to fig. 5, fig. 5 is a schematic structural diagram of an apparatus applied to the above technology in the embodiment of the present application. The device comprises: a selection unit 501, a processing unit 502, a test unit 503, an acquisition unit 504, and a determination unit 505;

a selecting unit 501, configured to select a network model for testing according to a deployment position and an application scenario of an accelerator to be tested;

a processing unit 502, configured to obtain test data and perform preprocessing on the test data;

the test unit 503 is configured to configure a benchmark test environment component, load the network model selected by the selection unit 501, and run test data preprocessed by the processing unit 502 on the accelerator to be tested;

an obtaining unit 504, configured to obtain a test result;

a determining unit 505, configured to determine a test index of the accelerator under test according to the test result obtained by the obtaining unit 504.

Preferably, the first and second electrodes are formed of a metal,

the selection unit 501 is specifically configured to select a reference network model according to the deployment position and the application scenario of the accelerator to be tested; if the network model is added, the principle of model difference degree is satisfied between the newly added network model and the selected network model.

Preferably, the first and second electrodes are formed of a metal,

the processing unit 502 is specifically configured to perform scaling processing and normalization processing on the test data, so that the processed test data meets the requirements of the selected network model.

The test unit 503 is specifically configured to determine a mode of invoking the accelerator to be tested according to the model of the accelerator to be tested; and loading the network model, and calling the accelerator to be tested by the preprocessed test data in a determined mode.

Wherein, the mode is as follows:

through deep learning inference framework invocation;

or, accelerating the call through the self-contained architecture;

or, via android nn calls.

Preferably, the test index comprises one or any combination of the following:

The units of the above embodiments may be integrated into one body, or may be separately deployed; may be combined into one unit or further divided into a plurality of sub-units.

In another embodiment, an electronic device is also provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the deep learning accelerator benchmarking method when executing the program.

In another embodiment, a computer readable storage medium is also provided having stored thereon computer instructions that, when executed by a processor, may implement the steps in the deep learning accelerator benchmarking method.

Fig. 6 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 6, the electronic device may include: a Processor (Processor)610, a communication Interface (Communications Interface)620, a Memory (Memory)630 and a communication bus 640, wherein the Processor 610, the communication Interface 620 and the Memory 630 communicate with each other via the communication bus 640. The processor 610 may call logic instructions in the memory 630 to perform the following method:

acquiring test data and preprocessing the test data;

In addition, the logic instructions in the memory 630 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A deep learning accelerator benchmarking method, the method comprising:

acquiring test data and preprocessing the test data;

2. The method of claim 1, wherein selecting the network model for testing according to the deployment location and application scenario of the accelerator under test comprises:

3. The method of claim 1, wherein the preprocessing the test data comprises:

4. The method of claim 1, wherein the loading the network model and the running of the preprocessed test data on the accelerator under test comprises:

determining a mode for calling the accelerator to be tested according to the model of the accelerator to be tested;

and loading the network model, and calling the accelerator to be tested by the preprocessed test data in a determined mode.

5. The method of claim 4, wherein the manner is:

through deep learning inference framework invocation;

or, accelerating the call through the self-contained architecture;

or, via android nn calls.

6. The method of claim 1, wherein the test indicator comprises one or any combination of the following:

7. The method of claim 6,

determining the time delay comprises: a test time T used for testing using the ith network model each time an image is input in a single frame manner_1iAnd the total number of frames N of the input image_1iThen the time delay Metric is determined₁Comprises the following steps:

determining the throughput, comprising: at each time with B_2i＝2^jWhen the image is input in a frame mode, the test time T used by the ith network model test is respectively recorded_2ijAnd the total number of frames N of the input image_2ijDetermining 2 inputs per time for the ith network model^jThroughput per frameThroughput_2ij＝N_2ij×B_2ij/T_2ij(ii) a Will T_2ij/N_2ij<The Throughput with the largest value among the throughputs of 7s is taken as the Throughput Throughput determined when the ith network model is used for testing_2iDetermining the throughput Metric of the accelerator under test₂Comprises the following steps:

wherein j is an integer of not less than 0 and not more than 7;

determining the loss of precision value, comprising: when the ith network model is used for testing, acquiring the testing accuracy Accurancy of the ith network model_i ^realDetermining a loss of precision value Metric₃Comprises the following steps:

among them, Accurancy_iA preset accuracy for the ith said network model;

determining the power consumption, comprising: acquiring consumed energy W during the test by using the ith network model_iAnd the total number of frames N of the input image_4iDetermining the power consumption Metric₄Comprises the following steps:

8. A deep learning accelerator benchmarking apparatus, the apparatus comprising: the device comprises a selection unit, a processing unit, a testing unit, an acquisition unit and a determination unit;

the acquisition unit is used for acquiring a test result;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-7 when executing the program.

10. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 7.