CN114170490A

CN114170490A - Image identification method and system based on self-adaptive data quantization and polyhedral template

Info

Publication number: CN114170490A
Application number: CN202111458489.3A
Authority: CN
Inventors: 李甜; 蒋力; 刘方鑫
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-12-02
Filing date: 2021-12-02
Publication date: 2022-03-11

Abstract

The invention relates to an image identification method and system based on self-adaptive data quantization and a polyhedron template. The method comprises the following steps: determining an optimal numerical value set of AFP by adopting a Bayesian optimization-based search algorithm according to a neural network encoded in a traditional data format; according to the optimal value set, encoding the neural network by adopting an AFP format to generate a dynamically quantized and encoded neural network; describing a data calculation mode of each layer of the neural network after the dynamic quantization coding by adopting a code template with optimal parameters, and generating an executable code of the neural network after the dynamic quantization coding; acquiring image data encoded in an AFP format; and deploying executable codes to the end-side equipment to identify the image data to obtain an image identification result. The method of the invention accelerates the calculation efficiency of the neural network in the image recognition process through the quantization technology and the search strategy based on the polyhedron, improves the calculation precision and saves the resources of the equipment.

Description

Image identification method and system based on self-adaptive data quantization and polyhedral template

Technical Field

The invention relates to the technical field of image recognition, in particular to an image recognition method and system based on self-adaptive data quantization and a polyhedron template.

Background

In recent years, as the number and kinds of edge devices have explosively increased, the combination of Deep Neural Networks (DNNs) and various edge devices has brought about numerous applications, such as automated driving by tesla, urban brains by arbape, face recognition, and the like. However, the deep neural network has characteristics of a large number of parameters, a large number of layers, and the like, and these characteristics place a great demand on computing resources (for example, energy consumption, memory, and the like of the device) of the end-side device. Therefore, limited by the limited resources of the end-side devices, how to speed up the operation of the neural network on the end-side devices becomes an important issue.

To save computational resources, DNN compression techniques are widely used. One important branch in the DNN compression technique is quantization. The quantization is to change a continuously distributed numerical value in the neural network into a discretely distributed numerical value, for example, data of float32 is represented as data of int8, so that the data is represented by using a coding mode with a lower bit width, the coding mode with the lower bit width can reduce a space for storing the data and a transportation amount of the data on a bus, and the overhead of operand calculation can be effectively reduced, thereby reducing the consumption of energy consumption. The operand with low bit width can reduce the load of a main operator (namely, multiply-accumulate operation) in the calculation of the neural network, thereby accelerating the operation process of the neural network. The method commonly used at present adopts a fixed-length data format (sign bit, exponent bit, mantissa bit) to perform the quantization operation of data. Given a batch of data, the distribution of the data may be concentrated in a certain interval, which causes difficulties such as waste, insufficient detail of representation precision, incapability of fitting the change of data adaptation, and the like in a quantization mode with a fixed bit width in a data representation range, thereby affecting the precision and efficiency of the neural network computation.

In order to speed up the computation on the designated hardware, it is also important how to perform the mapping computation on the hardware architecture for the quantized data. For example, in the cpu architecture of x86, in order to fully utilize the locality of access data between the memory and the Cache, the size of the data block is generally designed to be the size of cacheline. Generally, programmers manually optimize the hardware architecture of different layers, but this optimization method requires a lot of manpower. Thus, template-based search strategies are widely used. The template-based search strategy is that a programmer specifies a certain search space, the search space contains various parameter (block size, circulation sequence and the like) combinations, and then searches based on a certain search strategy (profile, cost model and the like) to find an optimal solution. However, in the searching process, the template-based searching strategy can generate an explosive searching space, and meanwhile, certain limitation still exists in the description of the searching template, so that the optimal solution in the searching process and the actual searching cost are not necessarily matched. That is, it takes a lot of search costs to search in a huge search space, which not only wastes time and resources, but also does not always search the optimal solution.

The different application scenarios of neural networks are determined by their structure (number of layers, etc.). Thus, different neural networks may be applied in different scenarios, such as image recognition, object detection, etc. Because the traditional quantized data format has the problems of the waste of precision and representation range on a data set, and the traditional template-based search strategy has the problems of limitation of description modes, unmatched search effects and search strategies, when the neural network is applied to image recognition of human faces or objects, the problems of low calculation precision and calculation efficiency and serious resource waste exist.

Disclosure of Invention

The invention aims to provide an image identification method and system based on self-adaptive data quantization and a polyhedral template, so that the calculation efficiency of a neural network is accelerated, the calculation precision is improved and the equipment resources are saved when image identification is carried out.

In order to achieve the purpose, the invention provides the following scheme:

an image identification method based on adaptive data quantization and polyhedron template includes:

acquiring a neural network which is selected by the end-side equipment and encoded in a traditional data format; the legacy data format comprises the FP32 data format;

determining an optimal numerical value set of a floating point expression form (AFP) of the self-adaptive data by adopting a Bayesian optimization-based search algorithm according to the neural network coded in the traditional data format; the optimal value set comprises exponent bit widths, mantissa bit widths and offset values corresponding to each layer in the neural network;

according to the optimal value set, encoding the neural network by adopting the AFP format to generate a dynamically quantized and encoded neural network;

acquiring a code template based on a polyhedron technology and the size of a memory of the end-side equipment;

determining the optimal parameters of the code template according to the size of the memory of the end-side equipment; the optimal parameter comprises a data block size;

describing a data calculation mode of each layer of the neural network after the dynamic quantization coding by adopting a code template with the optimal parameters, and generating an executable code of the neural network after the dynamic quantization coding;

acquiring image data encoded in the AFP format; the image data comprises face image data and object image data;

and deploying the executable code to the end-side equipment to identify the image data to obtain an image identification result.

Optionally, the determining, according to the neural network encoded in the conventional data format, an optimal numerical value set of AFP by using a search algorithm based on bayesian optimization specifically includes:

generating a candidate set of a plurality of possible values according to a Bayesian optimization algorithm;

and selecting the optimal value set from the candidate set according to a preset calculation function.

Optionally, the determining the optimal parameter of the code template according to the size of the memory of the end-side device specifically includes:

determining a candidate set of a search space according to the size of the memory of the end-side device;

using an integer set library based on polyhedron technology analysis, taking the code template as input, and outputting a scheduling tree representing a calculation relationship;

and determining the optimal parameters in the candidate set of the search space by adopting a two-step search strategy based on different scheduling trees.

Optionally, the acquiring image data encoded in the AFP format specifically includes:

acquiring an image received by the end-side device; the image comprises a face image or an object image;

and encoding the image in the AFP format to generate image data encoded in the AFP format.

An image recognition system based on adaptive data quantization and polyhedral templates, comprising:

the traditional coding neural network acquisition module is used for acquiring a neural network coded in a traditional data format and selected by the end-side equipment; the legacy data format comprises the FP32 data format;

the Bayesian optimization search module is used for determining an optimal numerical value set of a floating point expression form AFP of the self-adaptive data by adopting a search algorithm based on Bayesian optimization according to the neural network coded in the traditional data format; the optimal value set comprises exponent bit widths, mantissa bit widths and offset values corresponding to each layer in the neural network;

an AFP quantization coding module, which is used for coding the neural network by adopting the AFP format according to the optimal value set and generating a dynamically quantized and coded neural network;

the code template acquisition module is used for acquiring a code template based on a polyhedron technology and the size of the memory of the end-side equipment;

the optimal parameter determining module is used for determining the optimal parameters of the code template according to the size of the memory of the end-side equipment; the optimal parameter comprises a data block size;

the executable code generation module is used for describing a data calculation mode of each layer of the neural network after the dynamic quantization coding by adopting a code template with the optimal parameters and generating executable codes of the neural network after the dynamic quantization coding;

the image data acquisition module is used for acquiring the image data coded in the AFP format; the image data comprises face image data and object image data;

and the image identification module is used for deploying the executable code to the end-side equipment to identify the image data to obtain an image identification result.

Optionally, the bayesian optimization searching module specifically includes:

the candidate set generating unit is used for generating a candidate set of a plurality of possible numerical values according to a Bayesian optimization algorithm;

and the optimal value set selecting unit is used for selecting the optimal value set from the candidate set according to a preset calculation function.

Optionally, the optimal parameter determining module specifically includes:

the search space determining unit is used for determining a candidate set of a search space according to the size of the memory of the end-side equipment;

the scheduling tree output unit is used for outputting a scheduling tree representing a calculation relation by using the code template as input by using an integer set library based on polyhedron technology analysis;

and the optimal parameter determining unit is used for determining the optimal parameters in the candidate set of the search space by adopting a two-step search strategy based on different scheduling trees.

Optionally, the image data acquiring module specifically includes:

the image acquisition unit is used for acquiring images received by the end-side equipment; the image comprises a face image or an object image;

and the image AFP coding unit is used for coding the image in the AFP format and generating image data coded in the AFP format.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides an image identification method and system based on self-adaptive data quantization and a polyhedron template, wherein the method comprises the following steps: acquiring a neural network which is selected by the end-side equipment and encoded in a traditional data format; the legacy data format comprises the FP32 data format; determining an optimal numerical value set of a floating point expression form (AFP) of the self-adaptive data by adopting a Bayesian optimization-based search algorithm according to the neural network coded in the traditional data format; the optimal value set comprises exponent bit widths, mantissa bit widths and offset values corresponding to each layer in the neural network; according to the optimal value set, encoding the neural network by adopting the AFP format to generate a dynamically quantized and encoded neural network; acquiring a code template based on a polyhedron technology and the size of a memory of the end-side equipment; determining the optimal parameters of the code template according to the size of the memory of the end-side equipment; the optimal parameter comprises a data block size; describing a data calculation mode of each layer of the neural network after the dynamic quantization coding by adopting a code template with the optimal parameters, and generating an executable code of the neural network after the dynamic quantization coding; acquiring image data encoded in the AFP format; the image data comprises face image data and object image data; and deploying the executable code to the end-side equipment to identify the image data to obtain an image identification result.

In order to solve the waste of precision and representation range of the traditional quantized data format on a data set, the invention provides a scheme for quantizing by adopting a floating point expression form (AFP) of self-adaptive data; meanwhile, in order to solve the problems of limitation of the traditional TVM-based template in description modes, unmatched search effects and search strategies, the invention provides code template design and search strategy design based on polyhedron technology; finally, the invention accelerates the calculation efficiency of the neural network in the image recognition process by a quantization technology and a polyhedron-based search strategy, improves the calculation precision and saves the resources (energy consumption and the like) of equipment.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 is a flowchart of an image recognition method based on adaptive data quantization and a polyhedron template according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a neural network acceleration scheme based on adaptive data quantization and polyhedral templates according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a conventional quantized data format according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating an adaptive data quantization algorithm according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an algorithm for two layers of for loops according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an algorithm after affine transformation according to an embodiment of the present invention;

FIG. 7 is a diagram of a code template provided by an embodiment of the present invention;

fig. 8 is a structural diagram of an image recognition system based on adaptive data quantization and a polyhedron template according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a flowchart of an image identification method based on adaptive data quantization and a polyhedron template according to an embodiment of the present invention. Fig. 2 is a schematic diagram of a neural network acceleration scheme based on quantization of adaptive data and a polyhedral template according to an embodiment of the present invention. Referring to fig. 1 and 2, an image recognition method based on adaptive data quantization and a polyhedron template according to the present invention includes:

step 101: and acquiring the neural network which is selected by the end-side device and is encoded in the traditional data format.

The computation of the neural network includes reasoning and training. In the calculation of the neural network, a large number of multiply-add operations are involved, and in order to reduce the multiply-add operations, there are five conventional data formats that are commonly used. Fig. 3 is a diagram illustrating a conventional quantized data format according to an embodiment of the present invention. As shown in fig. 3, the legacy data formats include INT8, FP32, FP16, BFP16, TF 32.

FP32, TF32, FP16 and BF16 are all Floating-point representations (Floating-point formats). The floating-point representation divides the data String (Bit String) into three parts: sign bit (sign), exponent bit (exponent), and mantissa bit (mantissa). The sign bit is used to identify whether the current data is a positive or negative number; the exponent bits are to represent the range of data (range); mantissa bits are the precision with which data is represented. E.g. data b given one FP32_fp32＝{b₃₁，b₃₀，…，b₂₃，b₂₂，…，b₀}；b_iE {0, 1 }. Sign bit b₃₁1bit, exponent bit b₃₀-b₂₃Occupies 8 bits, mantissa bit b₂₂-b₀Accounting for 23 bits. The data range expressed therein is calculated according to the following formula:

thus, the range of data expressed is primarily in terms of exponent bits, with longer exponent bits indicating a larger range of data. The precision mainly refers to the mantissa bits, and the longer the mantissa bits, the greater the precision of the representation. Meanwhile, the calculation times of multiplication and addition are related to bit width information, and the shorter bit width has less calculation times of multiplication and addition.

In the present conventional data representation, the bit widths of the exponent bit and the mantissa bit are fixed, which means that the present data format has a fixed representation range. But the distribution of parameters at each layer is different for different neural networks. According to the observation, the current data distribution is approximately fitted with the gaussian distribution, that is, the data distribution is concentrated in a certain range, the data distribution is dense in the range, and the data distribution is sparse outside the range. This therefore requires a higher precision (longer mantissa bits) in the range of dense distribution, so that the current fixed representation cannot adapt to the distribution of the data, which results in inefficient matching of the data representation and the data distribution, thereby affecting the precision of the final neural network computation.

Meanwhile, the existing representation mode of the fixed bit width also has rounding-off error (round-off error). Rounding errors refer to the difference between the approximate and exact values that are computed. For example, data of FP32 is represented by BF 16. In the data table demonstration, the integer ranges represented by FP32 and BF16 are the same because the exponent bits of both are the same. However, the mantissa bits of FP32 are longer than those of BF16, and the precision of FP32 is greater than that of BF 16. The precision error represented by its fractional part results in rounding errors. In addition, when the data of FP32 is represented using FP16, there is a risk of overflow. FP32 and FP16 represent different ranges of data, and FP16 presents a risk of overflow in large data calculations. Dynamic Range (6 x 10) of FP16^-865504) much lower dynamic range (1.4 x 10) than FP32^-45～1.7*10³⁸) Precision (2)^-10) Much lower than FP32 (2)^-23)。

Step 102: and determining the optimal numerical value set of the floating point expression form AFP of the self-adaptive data by adopting a Bayesian optimization-based search algorithm according to the neural network coded by the traditional data format.

Model quantization refers to changing the continuously distributed values in the neural network into discretely distributed values. Quantization is divided into two categories according to the way of quantization: uniform quantization and non-uniform quantization.

Uniform quantization refers to the uniform mapping of continuously distributed data onto discrete spaces. Generally, the method of uniform quantization defines a range of values [ min, max ] in a discrete space, and then maps the continuously distributed values onto the nearest integer. For example, all floating point numbers may be scaled according to the scaling sparsity to be within the range of [ -128,127] of INT 8.

Non-uniform quantization refers to mapping continuously distributed data onto a variable interval range according to the characteristics of the data. For example, a clustering algorithm, analyzes the data values, selects the data values that most represent the current distribution, and then performs quantization according to the current data values. For example, by compressing exponent bits and mantissa bits within FP 32.

Wherein the uniform distribution is not adapted to the current neural network. Because the numerical distribution of the parameters in the convolutional layer of the neural network is approximately fitted to a Gaussian distribution (W-N (mu, delta)²) Where μ is the mean, δ²Is the variance. That is, most of the values are distributed substantially around the 0 value, and there are substantially no parameters having very large absolute values. Therefore, reusing a uniform distribution will concentrate most values around an integer, which will result in the accumulation of calculation errors and ultimately a sudden drop in the correctness of the neural network. Meanwhile, uniformly distributed quantization generally changes a certain type into a low-bit data type, and in the transmission process of a network layer, uniform quantization needs to be re-quantized for the continuity of calculation. For example, for convolution, the input is FP32 and the weights are FP 32. The input and weight are firstly quantized to INT8, the data of INT8 is used for carrying out multiply-add calculation, and finally the result of INT8 is inversely quantized to FP32 and is transmitted to a BatchNorm layer for calculation.

The data type of non-uniform quantization is still floating point type (BF16, etc.), which can reduce the error caused by quantization. However, in the existing floating-point data representation, bit widths of exponent bits and mantissa bits are fixed, and the fixed representation cannot adapt to data distribution, which causes low efficiency of matching data representation and data distribution, and thus affects low precision of final neural network calculation. Meanwhile, the existing representation mode of the fixed bit width also has rounding-off error (round-off error).

In order to solve the problem of the waste of precision and representation range of the traditional quantized data format on a data set, the invention provides a scheme capable of carrying out quantization by adopting an adaptive data format.

In this step 102, the input is a neural network encoded using a conventional data format, such as a depth-L neural network encoded in a conventional FP32 data format. For each layer of data in the network represented by FP32, the bit width of its exponent bits and mantissa bits is usually determined. However, when the distribution of data is concentrated in a certain interval range, the number of exponent bits is not required to be too large, and the number of mantissa bits should be increased to increase the precision, so that the precision error and the rounding error introduced in the quantization process are reduced.

Therefore, in order to improve the above problem, the present invention provides a set of Adaptive Floating-Point (AFP) representations of data, which can dynamically adjust exponent bits and mantissa bits according to the distribution of data.

Floating point representation b of adaptive data for each layer of data in a network_AFPBy sign bit

(1bit), exponent bit (n)_expbit), mantissa bit (n)_manbit). Wherein n is_exp、n_manThe bit width of the layer is adjusted according to the actual data of each layer. The specific data encoding format is as follows:

for a floating point number encoded in a 0, 1 binary string, the numeric value of the floating point data represented by the string can be obtained according to the following calculation formula:

therefore, compared with the traditional fixed-bit-width coding form of FP32, the invention can dynamically adjust the sizes of the exponent bit, the mantissa bit and the offset value k according to the distribution of data, thereby realizing the reduction of the error in the quantization process. Thus, in a neural network after AFP encoding, the parameter for each layer is effectively one encoding bit wide (n)_exp，n_man) And a set of offset values k (n)_exp，n_man，k)。

For the set of coded bit widths and offset values (n)_exp，n_manK) because the distribution of values between different layers of the neural network is not the same, the set (n) is obtained_exp，n_manThe actual value of k) should be adjusted for different layers. The invention explains how to self-adaptively adjust the actual size (n) of the bit width of each layer according to the distribution of each layer of data based on the designed Bayesian optimization-based search algorithm_exp，n_man) And the size of offset value k.

Fig. 4 is a schematic diagram of an adaptive data quantization algorithm according to an embodiment of the present invention. Referring to fig. 4, the adaptive data quantization algorithm has three inputs:

(1)

a neural network coded in the conventional FP format with depth L is represented. Before quantization in AFP format, the set of floating point numbers in each layer network, represented using FP32, is W_l。

(2) The amount of data n sampled. The invention generates a candidate set D of n possible values according to a Bayesian Optimization (Bayesian Optimization) algorithm_n. Each candidate set represents one

Possible values. The algorithm selects the optimal solution in the candidate set.

(3) The iteration number N of the algorithm represents the maximum search iteration number of the algorithm.

After the search algorithm, the overall output is two:

(1) parameter (n) of each layer_exp，n_maxK) optimal value set θ^*。

(2) After determining the bit width and the offset value of each layer, using the overall neural network parameters of the AFP coding mode

The invention needs to determine the coding mode of each layer according to the distribution of data, namely, the size (n) of three parameters in the coding mode is determined_exp，n_manK). The step 102 of determining an optimal numerical value set of AFPs by using a bayesian-optimization-based search algorithm according to the neural network encoded in the conventional data format specifically includes:

step 2.1: generating a candidate set of a plurality of possible values according to a Bayesian optimization algorithm;

generating a candidate set D based on a Gaussian model_n(see Line1 of algorithm of fig. 4), selecting the optimal parameter (n) in the candidate set according to a specific calculation function_exp，n_manK) (Line 5 of the algorithm of fig. 4).

Step 2.2: and selecting the optimal value set from the candidate set according to a preset calculation function. The optimal value set includes a exponent bit width, a mantissa bit width, and an offset value corresponding to each layer in the neural network.

Data are quantized (Line 7-Line 8) according to the optimal parameters, the errors (Line 9-Line 11) of the quantized data are calculated and the parameters (Line12-Line13) needed in the entire search strategy are updated. The optimal error (Line 14-Line 16) is updated at the same time. And when the whole algorithm is finished, the output numerical value corresponds to the numerical value corresponding to the optimal error.

Referring to fig. 4, a specific search process is as follows, Line X denotes the X-th Line of the algorithm.

Line 1. the algorithm generates a candidate set D of n possible values according to a Bayesian Optimization (Bayesian Optimization) algorithm_nEach candidate set represents one

Possible values. This step represents the Line 13-based Gaussian modeling of the fit target parameters

The possible distribution of (a).

Line 2: the number of iterations N is initialized.

Line 3: the optimum error J is initialized. One of the quantitative evaluation indexes is a small error J caused by the quantization. The error J is measured by using the product of the Kullback-Leibler divergence (KL divergence) and the cost of the candidate set. The goal of the joint search is therefore to minimize the error J, i.e.:

wherein

The KL divergence, which represents the quantization mode of the present invention and the quantization mode of FP32, is defined as follows:

representing the cost of the current candidate set; lambda [ alpha ]_lIs an exponential coefficient representing the total number of operations for addition and multiplication, where n is required for calculating multiplication_exp+(1+n_man)²Bit, add requires access

The definition is as follows:

line 4: and setting a circulation condition of the whole search algorithm.

Line 5: α represents a specific calculation function. The first input θ is the parameters of the entire network and the second input is the candidate set D_n. The algorithm corresponds the optimal solution obtained by calculating the function

Is assigned to theta^*K ', e ', p '. k 'represents the magnitude of the offset value, and e' represents the index bit width n_expP' represents the mantissa bit width n_manThe size of (2).

Line 6: the number of bits Nq ' of the overall code is calculated to be 1+ e ' + p '.

Line 7 and Line 8: according to the bit number (bit width) and the offset value determined in the iteration process, the data is quantized

The algorithm obtains quantized data

Line 9: according to determination in the course of the iteration

And

calculating KL divergence D between the AFP quantization mode of the invention and the quantization mode of FP32_KL。

Line 10: and calculating the cost C of the current candidate set based on the bit width Nq 'and e' obtained in this time.

Line 11: calculating a new error J_newKL divergence D_KLCost C lambda in the iteration process_l。

Line 12: based on the new error J_newAnd the original candidate set D_nGenerating a new candidate set D_n+1。

Line 13: updating Gaussian model gP_new。

Line 14 and Line 15: if the current latest error J_newIs the optimum error (minimum), J is updated. The output of the algorithm is the value corresponding to the optimum error J, including the exponent bit width, mantissa bit width, and offset value corresponding to each layer in the neural network.

Step 103: and according to the optimal value set, encoding the neural network by adopting the AFP format to generate a dynamically quantized and encoded neural network.

The input processed by the invention is a neural network with depth L layer encoded in the conventional FP32 data format. Aiming at a neural network coded by FP32, the invention adopts an Adaptive data Floating-Point (AFP) expression form to code parameters of each layer in the neural network, and the bit width set and the offset value set of the parameters are (n)_exp，n_man，k)。

Parameter bit width set (n) for each layer after encoding_exp，n_manAnd k), in order to determine the actual bit width length and the offset value, the invention adopts a Bayesian optimization-based search algorithm to search the bit width and the offset value. After determining the quantization data of each layer (the exponent bit width, mantissa bit width and offset value corresponding to each layer in the neural network), the present invention finally obtains the neural network encoded by dynamic quantization of adaptive data.

After the neural network is encoded by adopting the AFP format according to the optimal value set and the dynamically quantized and encoded neural network is generated, in order to deploy the neural network to different hardware, different frameworks generate calculated codes aiming at data, and the calculation rules (including the sequence of loops, the parameters of the loops and the like) of the codes can influence the calculation execution efficiency of the neural network. How to generate a set of executable computing code that can make full use of hardware is very important.

TVM (temporal Virtual machine) is currently the most widely used deep neural network compiler, and its main search strategy is template-based Auto TVM search. Specifically, the specific flow of template search used by the current TVM is as follows: (1) a user self-defines a template, and the template comprises a cycle sequence, data block division, vectorization and the like; (2) parameters defining a search within the template, such as tile size (data block size), etc.; (3) based on a given template and parameters, the framework constructs a search space, and the parameters are searched in the space according to a certain strategy. For example, a specific strategy includes calculating the calculation Cost of the set of parameters using a Cost model (e.g., fitting a neural network to the current Cost), so that under the guidance of the Cost model, the search strategy can achieve the purpose of searching out the most efficient set of parameters. The current template searching method is widely applied, for example, codes with more than 15,000 rows of templates in the TVM code library.

One important tool for optimizing convolution is affine transformation, which essentially changes the access sequence and access interval of a loop, so that data dependency can be reduced as much as possible during access and parallel computation can be maximized. For example, as shown in fig. 5, for an original two-layer for loop, when i is 2 and N is 3, the loop in the inner layer is a [2, 1] ═ Function (a [2, 0], a [1, 1]) a [2, 2] ═ Function (a [2, 1], a [1, 2]), and if the loop internally opens multiple threads to perform parallel computation, at the same time point, the result of a [2, 2] depends on a [2, 1], so parallel computation cannot be performed.

The affine transformation is such that y is j and x is i + j, and then the overall loop becomes as shown in fig. 6. Referring to fig. 6, when x is 3 and N is 3, the inner loop becomes a [2, 1] ═ Function (a [2, 0], a [1, 1]), a [1, 2] ═ Function (a [1, 1], a [0, 2]), and there is no data dependency between a [2, 1] and a [1, 2], so that parallel computations can be performed at the same time point. By reducing data dependence and increasing parallelism, affine transformation supported by the polyhedron technology can accelerate the calculation of the neural network.

However, the existing TVM template-based search algorithm essentially searches on a specific hardware architecture through enumeration, which results in a very large search space and thus low search efficiency. The cost model in the search strategy may be very long running time on hardware based on the results of the actual execution on hardware, which results in very long search times. Template-based searching essentially enumerates the round-robin order and the size of the data blocks of the round-robin, which has descriptive limitations in the way round-robin is optimized. For example, for a CPU architecture of x86, in order to take advantage of the maximum data locality among multiple levels of caches (three levels of caches L1, L2, and L3), affine transformation in the polyhedral technology can maximize the locality and parallelism of data blocks accessed between different data points by scaling coordinate axes. Some papers have shown that such radiation changes play a very important role in accelerating convolution, fully utilizing the characteristics of hardware, and the like. But the conventional TVM-based search method cannot describe such affine transformation.

Step 104: and acquiring a code template based on a polyhedron technology and the memory size of the end-side equipment.

For the quantized neural network, in order to deploy it to hardware, each framework generates code describing the way each layer of data is calculated by the neural network. Different code description modes (such as parameters of loops in the code) can affect the efficiency of accessing the memory and calculating the unit. In order to improve the computational efficiency of quantized data on hardware, the invention provides a template code optimization and generation scheme based on a polyhedron technology. By the scheme, the method and the device can search the optimal numerical value of the parameter in the code template, so that the code which can fully utilize hardware and has the highest calculation efficiency is generated.

After the quantization strategies of

steps

102 and 103 are performed, in order to design a set of computing codes which can fully utilize hardware, based on the polyhedron technology, the invention designs a set of code templates for accelerating neural network computation, wherein the whole sequence of the loop is determined, but the parameter TILE _ SIZE (data block SIZE) in the loop is uncertain. TILE _ SIZE inside the template is a parameter of the search. TILE _ SIZE affects the locality between data, the dependency between data, and the parallelism of the computation. In general, the better the locality between data, the lower the dependency and the higher the parallelism, the better the efficiency of the computation. When the TILE _ SIZE is less, each access is equivalent to a plurality of rare points, so that the dependency between data can be reduced; when the SIZE of TILE _ SIZE matches the SIZE of the underlying cacheline data block, this allows the locality of the data to be better exploited. Therefore, different underlying architectures will have different optimal TILE _ SIZE SIZEs.

Fig. 7 is a schematic diagram of a code template according to an embodiment of the present invention. Referring to FIG. 7, the present invention illustrates the design of a template using, for example, a convolution calculation code involving data encoded using AFP. The convolution of the neural network has two four-dimensional input data encoded using AFP: input [ N, C, H, W ] (input) and weight [ F, C, KH, KW ] (weight). N denotes a batch of input data, C denotes a channel of the input data, H denotes a length of the input data, W denotes a width of the input data, F denotes a convolution kernel of a weight, KH denotes a length of the convolution kernel, and KW denotes a width of the convolution kernel. The output is output [ N, F, OH, OW ], OH represents the length of the output, OW represents the width of the output. The code template describes the order in which the convolution calculation accesses the data: oj- > f _ TILE- > … - > ifm, data block SIZE of access data TILE _ SIZE, and calculation rule (Line 17).

Step 105: and determining the optimal parameters of the code template according to the memory size of the end-side equipment. The optimal parameter includes a data block size.

After the code template is determined, the invention needs to determine the optimal SIZE of TILE _ SIZE according to the underlying hardware. The specific process flow is as follows.

The overall input is the code template described above.

The step 105 of determining the optimal parameter of the code template according to the size of the memory of the end-side device specifically includes:

step 5.1: and determining a candidate set of a search space according to the size of the memory of the end-side device.

The invention determines the SIZEs of a plurality of TILE _ SIZEs according to the memory SIZE of the bottom hardware. These several TILE _ SIZEs constitute a candidate set of search spaces.

Step 5.2: and utilizing an integer set library based on polyhedron technology analysis, taking the code template as input, and outputting a scheduling tree representing a calculation relation.

Based on given hardware (e.g., end-side CPU, GPU, etc.), to determine the optimal SIZE of TILE _ SIZE, the present invention utilizes an Integer Set Library (ISL) based on polyhedron technical analysis. The ISL is a dispatch tree that takes the code template as input, analyzes the data dependency in the code, and outputs a calculation relationship using a specific syntax and semantics in the polyhedral analysis. Therefore given different TILE _ SIZE, ISL will generate different scheduling trees.

Step 5.3: and determining the optimal parameters in the candidate set of the search space by adopting a two-step search strategy based on different scheduling trees.

Based on different scheduling trees, the invention proposes a two-step search strategy in order to determine the optimal TILE _ SIZE. In the first step, the invention completes one complete calculation for the innermost layer cycle in the scheduling tree each time, and the size of the data block needing to be accessed is used as the threshold of screening. The invention inquires the size of the data block in the scheduling tree in sequence, takes the data block close to the size of the bottom layer memory bank as the candidate set of the optimal solution, and reserves the scheduling tree corresponding to the candidate set. In the second step, the above codes are rewritten (for example, the value is assigned to TILE _ SIZE, the coordinate axis is scaled, etc.) according to the rules described in the scheduling tree, and then the codes are deployed to actual hardware to run. And selecting the code with the shortest running time as the optimal generated code by taking the actual running time as an evaluation standard.

The output is the optimized code obtained finally, i.e. the executable code of the neural network after dynamic quantization coding.

Step 106: and describing a data calculation mode of each layer of the neural network after the dynamic quantization coding by adopting a code template with the optimal parameters, and generating an executable code of the neural network after the dynamic quantization coding.

Based on the above-described code templates designed, the present invention can result in an overall network of executable code encoded in AFP format that can run efficiently on hardware. Thus, based on the AFP format and optimized executable code, the present invention can accelerate overall computation.

Step 107: image data encoded in the AFP format is acquired. The image data includes face image data and object image data.

The different application scenarios of neural networks are determined by their structure (number of layers, etc.). Thus, different neural networks may be applied in different scenarios, such as image recognition, object detection, etc.

The step 107 of acquiring the image data encoded in the AFP format specifically includes:

step 7.1: acquiring an image received by the end-side device; the image comprises a face image or an object image;

step 7.2: and encoding the image in the AFP format to generate image data encoded in the AFP format.

Step 108: and deploying the executable code to the end-side equipment to identify the image data to obtain an image identification result.

The end-side device receives an image of a human face or object and encodes the image into image data encoded in the FP32 data format. Meanwhile, the end-side device chooses a particular structure of neural network encoded in the traditional data format, i.e., a collection of data encoded by multiple layers in FP 32. These two portions of data enter the process flow of the present invention as raw inputs.

In order to accelerate the calculation of the adaptive data at the same time, the invention firstly expresses the two parts of data into AFP format. To determine the code bit width (n)_exp，n_max) And a set of offset values k (n)_exp，n_manK) the actual size, the algorithm in step 102 is used. After the search of the algorithm, all the parameters determine the bit width, and the data of the two parts are represented as data in an AFP format.

To deploy the above-described process to the end-side device, the end-side device generates executable code. The executable code represents the calculation rule of each layer, and the calculation of each layer is that the two parts of data are correspondingly calculated (such as multiplication, addition, division and the like). The rules for writing the code directly affect the final efficiency. Therefore, the code for calculating the data encoded in the AFP format is used as an input, and the optimized code is obtained through the searching process of step 105 and step 106.

And deploying the optimized executable code encoded in the AFP format to a corresponding end-side device for calculation. For the input face or object image data, the result of image recognition (0/1) of whether matching is successful or not is output through the above calculation, 0 indicates that matching is failed, and 1 indicates that matching is successful.

The invention provides an image identification method based on self-adaptive data quantization and a polyhedral template, and also provides an image identification system based on self-adaptive data quantization and a polyhedral template. Fig. 8 is a structural diagram of an image recognition system based on adaptive data quantization and a polyhedron template according to an embodiment of the present invention, and referring to fig. 8, the system includes:

a conventional coding neural network obtaining module 801, configured to obtain a neural network coded in a conventional data format and selected by an end-side device; the legacy data format comprises the FP32 data format;

a bayesian optimization search module 802, configured to determine, according to the neural network encoded in the conventional data format, an optimal numerical value set of a floating point expression form AFP of the adaptive data by using a search algorithm based on bayesian optimization; the optimal value set comprises exponent bit widths, mantissa bit widths and offset values corresponding to each layer in the neural network;

an AFP quantization encoding module 803, configured to encode the neural network according to the optimal value set in the AFP format, and generate a dynamically quantized and encoded neural network;

a code template obtaining module 804, configured to obtain a code template based on a polyhedral technology and a memory size of the end-side device;

an optimal parameter determining module 805, configured to determine an optimal parameter of the code template according to the size of the memory of the end-side device; the optimal parameter comprises a data block size;

an executable code generating module 806, configured to use a code template with the optimal parameter to describe a data calculation manner of each layer of the dynamically quantized and encoded neural network, and generate an executable code of the dynamically quantized and encoded neural network;

an image data acquisition module 807 for acquiring image data encoded in the AFP format; the image data comprises face image data and object image data;

an image recognition module 808, configured to deploy the executable code to the end-side device to recognize the image data, so as to obtain an image recognition result.

The bayesian optimization searching module 802 specifically includes:

The optimal parameter determining module 805 specifically includes:

The image data obtaining module 807 specifically includes:

In order to solve the waste of the precision and the representation range of the traditional quantized data format on a data set, the invention provides a scheme capable of quantizing in a self-adaptive data format, and simultaneously, in order to solve the problems of the limitation of the description mode, the mismatching of the search effect and the search strategy of the traditional TVM-based template, the invention provides the search space design and the search strategy design based on the polyhedron technology. Finally, the invention accelerates the calculation efficiency of the neural network together by a quantization technology and a search strategy based on polyhedron, improves the calculation precision and saves the resources (energy consumption, etc.) of the equipment.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. An image recognition method based on adaptive data quantization and polyhedral templates, comprising:

2. The method according to claim 1, wherein determining an optimal set of AFP values using a bayesian-based optimization search algorithm based on said neural network encoded in a legacy data format comprises:

3. The method according to claim 1, wherein the determining the optimal parameter of the code template according to the memory size of the end-side device specifically comprises:

4. The method according to claim 1, wherein said obtaining image data encoded in said AFP format specifically comprises:

5. An image recognition system based on adaptive data quantization and polyhedral templates, comprising:

6. The system of claim 5, wherein the bayesian optimization search module specifically comprises:

7. The system of claim 5, wherein the optimal parameter determination module specifically comprises:

8. The system according to claim 5, wherein the image data acquisition module specifically comprises: