WO2021068469A1

WO2021068469A1 - Quantization and fixed-point fusion method and apparatus for neural network

Info

Publication number: WO2021068469A1
Application number: PCT/CN2020/083797
Authority: WO
Inventors: 齐南
Original assignee: 百度在线网络技术（北京）有限公司
Priority date: 2019-10-11
Filing date: 2020-04-08
Publication date: 2021-04-15
Also published as: CN110705696B; CN110705696A

Abstract

A quantification and fixed-point fusion method and apparatus of a neural network, and an electronic device and a storage medium, relating to the field of artificial intelligence, and in particular, to the field of automatic driving. The method comprises: performing quantitative processing on input data and weight of a current layer of the neural network (S110); in the current layer, performing a calculation operation on the input data subjected to quantitative processing by utilizing the weight subjected to quantitative processing to obtain the calculation operation result (S120); performing fixed-point processing on a preset processing parameter (S130); and performing a post-processing operation on the calculation operation result by utilizing the preset processing parameter subjected to fixed-point processing to obtain the output result of the current layer (S140). According to the method, by means of fusion of quantitative processing and fixed-point processing, the requirement of data transmission between operators for bandwidth is obviously reduced, the calculation amount of the acceleration unit is effectively reduced, the advantages of fixed-point calculation of the acceleration unit are fully exerted, the requirement of calculation for resources is reduced, and the calculation efficiency is improved while the resources are saved.

Description

Quantitative and fixed-point fusion method and device of neural network

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on October 11, 2019, the application number is 201910966512.6, and the title of the invention is "Quantification and fixed-point fusion method and device of neural network", the entire content of which is incorporated by reference In this application.

Technical field

This application relates to the field of information technology, in particular to the field of artificial intelligence, especially the field of autonomous driving (including autonomous parking).

Background technique

Traditional neural network calculations are based on high-bit (bit) floating-point operations, which causes a lot of waste of computing resources, and is prone to overfitting, which reduces the generalization ability of the model. In the traditional acceleration method of neural networks, even if a low-bit floating point or integer budget is used, precision will be wasted in the process of processing intermediate floating point operations, which will cause the final result to be truncated before use in subsequent processes. operating. This method both wastes precision and reduces computational power.

The same problem exists in the field of artificial intelligence, especially in the field of autonomous driving. For example, in application scenarios in the field of autonomous parking, traditional neural network calculations are based on high-bit floating-point operations, resulting in a lot of waste of computing resources. Or, in the traditional acceleration method of the neural network, even if a low-bit floating point or integer budget is used, precision is wasted in the process of processing intermediate floating point operations, which wastes precision and reduces computational power.

Summary of the invention

The embodiments of the present application propose a method, device, electronic equipment, and storage medium for quantizing and fusing a neural network to at least solve the above technical problems in the prior art.

In the first aspect, an embodiment of the present application provides a quantization and fixed-point fusion method of a neural network, including:

Quantify the input data and weights of the current layer of the neural network;

In the current layer, use the quantized weights to perform calculation operations on the quantized input data to obtain the results of the calculation operations;

Fixed-point processing of preset processing parameters; and

Use the preset processing parameters after fixed-point processing to perform post-processing operations on the results of the calculation operations to obtain the output results of the current layer.

In the embodiments of this application, through the fusion of quantization processing and fixed-point processing, the bandwidth requirements for data transmission between operators are significantly reduced, the calculation amount of the acceleration unit is effectively reduced, and the fixed-point calculation of the acceleration unit is fully utilized. Advantage. At the same time, the requirements for computing resources are reduced through quantitative processing and fixed-point processing, and computing efficiency is improved while saving resources.

In one embodiment, the calculation operation includes multiplying the quantized input data by the quantized weight;

The preset processing parameters include the first bias value of the current layer; and

The post-processing operation includes adding the result of the calculation operation to the first offset value after the fixed-point processing.

In the embodiments of the present application, the quantization processing of the input data and weights used for the calculation operation is combined with the fixed-point processing of the preset processing parameters used for the post-processing operation, so as to achieve a good neural network acceleration effect.

In one embodiment, the calculation operation includes a convolution operation; and

The preset processing parameters include: the quantized amplitude of the input data of the current layer, the quantized amplitude of the weight of the current layer, the quantized amplitude of the output data of the current layer, the scale of the batch normalization of the current layer, and the batch normalization of the current layer. At least one of the unified offset values.

In the embodiment of the present application, quantization processing and fixed-point processing are combined in the convolution calculation process, and input data, weights, and output data of the current layer are used for acceleration processing, which improves calculation efficiency.

In an implementation manner, performing fixed-point processing on preset processing parameters includes:

Perform fusion calculation on at least two preset processing parameters to obtain the scale value of the current layer and the second bias value of the current layer.

In the embodiment of the present application, the preset processing parameters are subjected to fusion calculation, and the result of the fusion calculation can be used as a data basis for subsequent fixed-point processing, and the effective computing power of the acceleration unit can be improved through the fixed-point processing.

In one embodiment, the post-processing operation includes: multiplying the result of the calculation operation by the scale value of the current layer, and then adding it to the second bias value of the current layer.

In the embodiments of the present application, the result of the fusion calculation after the fixed-point processing is used for post-processing operations, so that the quantization processing and the fixed-point processing are merged, which greatly improves the acceleration effect and operating efficiency.

In one embodiment, performing fusion calculation on at least two preset processing parameters includes performing fusion calculation using the following formula:

new_scale=bn_scale*input_scale*weight_scale/output_scale;

new_bias=bn_bias/output_scale,

Among them, new_scale represents the scale value of the current layer, bn_scale represents the batch normalization scale of the current layer, input_scale represents the quantized amplitude of the input data of the current layer, weight_scale represents the quantized amplitude of the weight of the current layer, and output_scale represents the current layer The quantized amplitude of the output data, new_bias represents the second bias value of the current layer, and bn_bias represents the batch normalized bias value of the current layer.

In the embodiment of this application, the above-mentioned preset processing parameters are subjected to fusion calculation, and the result of the fusion calculation can be used as a data basis for subsequent fixed-point processing. Through fixed-point processing, the calculation amount of the acceleration unit can be effectively reduced, and the acceleration unit can be fully utilized. Advantages of fixed-point computing.

In the second aspect, an embodiment of the present application provides a quantization and fixed-point fusion device of a neural network, including:

The quantization unit is used to: quantify the input data and weights of the current layer of the neural network;

The first operation unit is used to: in the current layer, use the weights after the quantization processing to perform calculation operations on the input data after the quantization processing to obtain the results of the calculation operations;

The fixed-point unit is used to: perform fixed-point processing on preset processing parameters; and

The second operation unit is used to: use the preset processing parameters after the fixed-point processing to perform post-processing operations on the calculation operation results to obtain the output results of the current layer.

The preset processing parameters include: the quantized amplitude of the input data of the current layer, the quantized amplitude of the weight of the current layer, the quantized amplitude of the output data of the current layer, the quantized amplitude of the batch normalized scale of the current layer, and the current At least one of the quantized amplitudes of the batch-normalized bias values of the layer.

In one embodiment, the fixed-point unit is used to:

In one embodiment, the fixed-point unit is used for fusion calculation using the following formula:

new_scale=bn_scale*input_scale*weight_scale/output_scale;

new_bias=bn_bias/output_scale,

In the third aspect, an embodiment of the present application provides an electronic device, including:

At least one processor; and

A memory communicatively connected with at least one processor, wherein,

The memory stores instructions that can be executed by at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the method provided by any one of the embodiments of the present application.

In a fourth aspect, an embodiment of the present application provides a non-transitory computer-readable storage medium storing computer instructions, which are used to make the computer execute the method provided in any one of the embodiments of the present application.

An embodiment in the above application has the following advantages or beneficial effects: through the fusion of quantization processing and fixed-point processing, the bandwidth requirements for data transmission between operators are significantly reduced, the calculation amount of the acceleration unit is effectively reduced, and the amount of calculation of the acceleration unit is effectively reduced. The advantages of the fixed-point calculation of the acceleration unit are improved. At the same time, the requirements for computing resources are reduced through quantitative processing and fixed-point processing, and computing efficiency is improved while saving resources.

The other effects of the above-mentioned optional methods will be described in the following in conjunction with specific embodiments.

Description of the drawings

The drawings are used to better understand the solution, and do not constitute a limitation on the application. among them:

Fig. 1 is a flow chart of the quantization and fixed-point fusion method of neural network according to an embodiment of the present application;

FIG. 2 is a diagram of the quantization and fixed-point fusion relationship diagram of the quantization and fixed-point fusion method of the neural network according to an embodiment of the present application;

Fig. 3 is a flow chart of convolution calculation of a neural network quantization and fixed-point fusion method according to an embodiment of the present application;

Figure 4 is a schematic structural diagram of a neural network quantization and fixed point fusion device according to an embodiment of the present application;

FIG. 5 is a block diagram of an electronic device used to implement the quantization and fixed-point fusion method of a neural network according to an embodiment of the present application.

Detailed ways

The exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be regarded as merely exemplary. Therefore, those of ordinary skill in the art should realize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the present application. Likewise, for clarity and conciseness, descriptions of well-known functions and structures are omitted in the following description.

The current neural network reasoning on the embedded platform has the following solutions: (1) select a small network; (2) model pruning and compression; (3) parameter quantification.

The above solution has the following drawbacks.

Solution (1) can only solve simple tasks and is suitable for simple scenarios. If the task is complex, a model with a more complex structure is required. This method cannot meet the needs of complex tasks.

Solution (2) is to reduce the number of internal branches in a larger network. This method seems to be able to complete simple tasks and adapt to complex tasks, but this method is not suitable for acceleration units that run with high parallelism. Due to the adjustment of the network structure, there are obvious problems in achieving parallelism, and the acceleration capability of the acceleration unit cannot be used.

Solution (3) usually quantifies the parameter part, which cannot achieve the desired effect in terms of network acceleration and operational efficiency improvement.

Fig. 1 is a flow chart of the quantization and fixed-point fusion method of a neural network according to an embodiment of the present application. Referring to Figure 1, the quantization and fixed-point fusion method of the neural network includes:

Step S110: quantify the input data and weights of the current layer of the neural network;

Step S120, in the current layer, use the quantized weight to perform a calculation operation on the quantized input data to obtain a calculation operation result;

Step S130, performing fixed-point processing on the preset processing parameters;

Step S140, using the preset processing parameters after fixed-point processing to perform post-processing operations on the results of the calculation operations to obtain the output results of the current layer.

In the specific implementation of the neural network, acceleration technology can be used to improve operating efficiency. Taking the convolutional neural network as an example, the acceleration method of the neural network may include quantization processing. Quantization is the use of low numerical precision in calculations to increase the calculation speed. Specifically, the quantization technique is an acceleration method that compresses the original neural network by reducing the number of bits required to express each weight. In an example, for N bits, the N-th power of 2 can be expressed, that is, the weights in the neural network are modified so that the weights in the neural network can only take 2 N-th values. Low-bit quantization technology is to quantify the data in the calculation process and the processing steps before and after it on the basis of ensuring accuracy, reducing the range of data representation. Taking the application scenario of the embedded platform as an example, in the current neural network reasoning, on the one hand, the input and output of each layer and the corresponding weight information have a normal distribution or sparseness, on the other hand, high-precision data representation will bring Similar to the impact of over-fitting, and high-precision data representation will cause a waste of result accuracy, so neural network reasoning on embedded platforms usually uses low-bit quantization to improve the robustness of the neural network model and increase the computing power of the platform.

Fixed-point technology is to convert data from floating-point number representation to fixed-point number representation. The acceleration units in the embedded platform, such as FPGA (Field Programmable Gate Array), DSP (digital signal processor, digital signal processor), GPU (Graphics Processing Unit, graphics processor), etc., are all fixed-point There is very good support for the calculation of fixed-point calculation, and the calculation efficiency of the fixed-point calculation is higher. Fixed-pointization can not only fully guarantee the accuracy of the results, but also maximize the effective computing power of the acceleration unit.

Most of the numerical data processed by the computer have decimals. There are usually two ways to represent the decimal point in a computer. One is to stipulate that the decimal point of all numerical data is implicit in a certain fixed position, which is called fixed-point notation, or fixed-point for short. The other is that the position of the decimal point can be floated, which is called floating point representation, or floating point for short. In other words, a fixed-point number is a number with a fixed decimal point. For fixed-point numbers, there is no special place for the decimal point in the computer, and the position of the decimal point is the default by convention. Floating point numbers are numbers whose decimal point position can be changed. In order to increase the range of numerical representation and prevent overflow, the data adopts floating-point number representation in some application scenarios. Floating point notation is similar to scientific notation.

Taking the application scenario of the embedded platform as an example, the acceleration of the neural network is essential to improve the computing efficiency of the embedded platform. The embodiments of the present invention combine the characteristics of the algorithm itself and the embedded acceleration unit to provide an efficient calculation optimization method that combines quantization processing and fixed-point processing. Fig. 2 is a diagram showing the relationship between quantization and fixed-point fusion of a neural network quantization and fixed-point fusion method according to an embodiment of the present application. As shown in Figure 2, the quantization and fixed-point fusion method of the neural network in the embodiment of the present application mainly includes two calculation steps: low-bit quantization calculation and fixed-point calculation, and these two steps depend on each other and influence each other. The combination can achieve a more efficient acceleration effect.

As shown in Figure 2, the low-bit quantization calculation steps include "fm", "conv weight" and "compute". Among them, the full name of "fm" is feature map, which represents the input feature map of the current layer. "Conv weight" indicates the weight of the convolutional layer. "Compute" represents the calculation operation of the current layer, such as the calculation operation of multiplying the weight, the convolution calculation operation in the convolution layer, and so on.

Referring to Figures 1 and 2, in one example, "fm" itself may be an integer. The input data of the current layer of the neural network in FIG. 1 may include an input feature map (fm). _{In Figure 2, the "int n} " on the connecting line between "fm" and "compute" indicates that the input feature of the current layer is mapped for quantization, and the subscript n indicates the quantized bit.

Referring to FIGS. 1 and 2, when the current layer is a convolutional layer, the weight of the current layer of the neural network in FIG. 1 is the conv weight (conv weight). _{In Figure 2, the "int n} " on the connecting line between "conv weight" and "compute" indicates that the weight of the current layer is quantized, and the subscript n indicates the quantized bit position. The output result of "compute" in Figure 2 is also the result of the calculation operation in Figure 1. _{The "int m} " on the connecting line between "compute" and "post compute" indicates that the output result of "compute" is an integer data, and the subscript m indicates the bit position of the output result. In an example, the value of n can be 8, 4, or 2, and m>n.

As shown in Figure 2, the fixed-point calculation steps include "post compute" and "bias". Among them, "post compute" means post-processing operations in the current layer, such as biasing, normalization, etc. Referring to FIGS. 1 and 2, the preset processing parameters in FIG. 1 may include bias (bias), and bias represents a bias value among the network parameters of the current layer. The "bias" in Figure 2 represents the fixed _pq value generated after the bias is fixed-pointed, and the subscripts p and q respectively represent the bits occupied by the integer part and the decimal part of the fixed-point processing.

Referring to Figure 2, the fixed _pq value after the fixed-point processing is input into the "post compute" for post-processing operations to generate the calculation result of the current layer. The calculation result may be too accurate, which means that the accuracy of the calculation result may be higher than the accuracy required for the output result. Therefore, it is necessary to further quantify the calculation result.

As shown in Figure 2, "quant" means performing quantization processing on the calculation result generated by the post-processing operation, such as truncating decimal places, carrying bits, and so on. For example, a usual rounding method can be used: rounding.

As shown in Figure 2, the result of the "quant" quantization process is used as the input of the next layer of the neural network, that is, the "fm(next)" shown in Figure 2.

As mentioned above, the embodiment of the present invention does not limit the bits of the quantization processing, and may be 8 bits, 4 bits or even 2 bits. On the premise that the bit width in low-bit quantization is determined, the fixed-point parameters can also be adaptively determined. For example, the integer part p and the decimal part q of the fixed-point processing can be determined according to the data accuracy requirements of the specific application. In an implementation manner, the acceleration of neural network inference can also be achieved only by means of quantization processing or only by means of fixed-point processing.

In the embodiments of this application, through the fusion of quantization processing and fixed-point processing, the bandwidth requirements for data transmission between operators are significantly reduced, the calculation amount of the acceleration unit is effectively reduced, and the fixed-point calculation of the acceleration unit is fully utilized. Advantage. Because the data bits transmitted between operators are exponentially reduced, there will also be benefits in computing resources. And because the operator can directly obtain the quantified data to start the calculation, the resources consumed by the intermediate data conversion part will disappear. At the same time, the requirements for computing resources are reduced through quantitative processing and fixed-point processing, and computing efficiency is improved while saving resources.

In one embodiment, the calculation operation in FIG. 1 includes multiplying the quantized input data by the quantized weight;

The preset processing parameters include the first bias value of the current layer;

Referring to Figures 1 and 2, in this embodiment, the "low bit quantization calculation" in Figure 2 may specifically include a calculation process of multiplying quantized input data and quantized weights, where the "compute "Specifically, it may include multiplying the quantized input feature map by the quantized weight.

Referring to FIGS. 1 and 2, the preset processing parameters in FIG. 1 may include the offset value among the network parameters of the current layer, which is also referred to as the first offset value. The "fixed-point calculation" in FIG. 2 can specifically include the calculation process of adding the result of the calculation operation to the first offset value after the fixed-point processing, and the "post compute" can specifically include the result of the above multiplication. Plus the calculation process of the fixed _{pq value.}

The embodiments of the present application can be applied to the field of artificial intelligence, especially the field of automatic driving. For example, in an application scenario in the field of autonomous parking, the environmental parameters of the parking space and the heading angle of the parking body can be used as input data, and the rotation angle of the steering wheel can be used as the output result. Among them, the environmental parameters of the parking space may include the size and location of the parking space. In an example, the input data may also include the position of the vehicle body, such as the coordinates of the right rear point of the vehicle body. Calculate the steering wheel angle through the trained neural network, and then complete the automatic parking control based on the calculated steering wheel angle, parking space data and body position.

In application scenarios in the field of autonomous parking, the input data and weights of the current layer of the neural network can be quantified. For example, the environmental parameters of the parking space and the heading angle and weight of the parking body are quantified. In the current layer, use the quantized weights to perform calculation operations on the quantized input data to obtain the results of the calculation operations. For example, the result of the calculation operation may be the angle of the steering wheel capable of completing automatic parking control. Then the preset processing parameters are fixed-point processing. For example, the preset processing parameters include the first bias value of the current layer. Use the preset processing parameters after fixed-point processing to perform post-processing operations on the results of the calculation operations to obtain the output results of the current layer. For example, the post-processing operation includes adding the result of the calculation operation to the first offset value after the fixed-point processing.

In the application scenarios of autonomous parking, the quantization processing of input data and weights used for calculation operations is integrated with the fixed-point processing of preset processing parameters for post-processing operations, so as to achieve good neural network acceleration effect. Through the fusion of quantization processing and fixed-point processing, the bandwidth requirements for data transmission between operators are significantly reduced, the calculation amount of the acceleration unit is effectively reduced, and the advantages of the fixed-point calculation of the acceleration unit are fully utilized. While saving resources, it improves computing efficiency.

Fig. 3 is a flow chart of convolution calculation of a neural network quantization and fixed-point fusion method according to an embodiment of the present application. In Fig. 3, a convolutional layer is taken as an example to show the calculation process of the quantization and fixed-point fusion method. The calculation process includes an accelerated calculation process and an initialization calculation process. The rectangular box labeled 1 in FIG. 3 represents the accelerated calculation process, and the rectangular box labeled 2 represents the initialization calculation process.

1 and 3, step S110 in FIG. 1 may specifically include: in the accelerated calculation process, "weight int _n " represents the quantization of the weight of the convolutional layer, and the subscript n represents the quantized bit position. "Input int _n " means quantizing the input data of the convolutional layer, and the subscript n means the quantized bit.

In one embodiment, the calculation operation includes a convolution operation.

Referring to FIG. 1 and FIG. 3, the calculation operation in step S120 in FIG. 1 may specifically include a convolution operation in the convolution layer. "Multi compute" in Figure 3 represents a convolution operation. The result of "Multi compute" is "Conv result(int _m )", which means that the result is an integer data, and the subscript m indicates the bit position of the result.

1 and 3, the result of the fixed-point processing of the preset processing parameters in step S130 in FIG. 1 may specifically include: the final result of "New weight" generated in the initialization calculation process. The final result of "New weight" includes New scale (the scale value of the current layer) and New bias (the second bias value of the current layer).

1 and 3, the post-processing operation in step S140 in FIG. 1 may specifically include: the operation performed by "Multi and add" in the accelerated calculation process, that is, multiplying "Conv result" by New scale, and then adding Go to New bias. _{The output data “output(int n} )” of the current layer is obtained after the operation performed by “Multi and add”, which indicates that the output data is an integer data, and the subscript n indicates the bit position of the output data.

In an example, the value of n can be 8, 4, or 2, and m>n.

In the embodiment of the present application, the result of the fusion calculation after the fixed-point processing is used for post-processing operations, so that the quantization processing and the fixed-point processing are merged, which greatly improves the acceleration effect and operating efficiency.

In one embodiment, the preset processing parameters include: the quantized amplitude of the input data of the current layer, the quantized amplitude of the weight of the current layer, the quantized amplitude of the output data of the current layer, and the batch normalized value of the current layer. At least one of the scale and the batch normalized bias value of the current layer.

Referring to Figure 1 and Figure 3, the preset processing parameters in Figure 1 may specifically include "input scale", "Weight scale", "output scale", "bn scale" and "bn scale" in Figure 3. In the initialization calculation process, the meanings of the above preset processing parameters are as follows:

"Input scale" represents the quantized amplitude of the input data of the current layer that is counted during training, for example, the maximum value of the input data of the current layer is counted.

"Weight scale" represents the quantized magnitude of the weight of the current layer calculated during training, for example, the maximum value of the weight of the current layer is calculated.

"Output scale" represents the quantized amplitude of the output data of the current layer calculated during training, for example, the maximum value of the output data of the current layer is calculated.

"Bn scale" represents the batch normalization scale of the current layer, and is one of the bn (batch normalization) parameters among the parameters produced by the training.

"Bn bias" represents the bias value of the batch normalization of the current layer, which is one of the bn (batch normalization) parameters among the parameters produced by the training.

In the embodiment of the present application, the quantization processing and the fixed-point processing are combined in the convolution calculation process, and the input data, weight, and output data of the current layer are used for acceleration processing, which improves the calculation efficiency.

The above is an example of the preset processing parameters of the convolutional layer. In specific applications, the preset processing parameters can be adjusted according to the network structure.

In one embodiment, step S130 in FIG. 1, performing fixed-point processing on preset processing parameters, includes:

Refer to Figure 1 and Figure 3. The "Weight fusion" in Figure 3 indicates that the preset processing parameters are used for fusion calculation.

new_scale=bn_scale*input_scale*weight_scale/output_scale;

new_bias=bn_bias/output_scale,

In the embodiment of the application, the above-mentioned preset processing parameters are subjected to fusion calculation, and the result of the fusion calculation can be used as a data basis for subsequent fixed-point processing. Through the fixed-point processing, the calculation amount of the acceleration unit can be effectively reduced, and the acceleration unit can be fully utilized. Advantages of fixed-point computing.

Refer to Figure 3, "Fix Point" means fixed point processing. In one embodiment, the intermediate result of "New weight" is obtained after the "Weight fusion" fusion calculation, and then the intermediate result of "New weight" is fixed-pointed to obtain the final result of "New weight".

As shown in Fig. 3, the operation flow or data labeled 3 represents low-bit data and the corresponding calculation process. The operation process or data labeled 4 represents an initialization calculation process. The calculation process is based on floating-point operations, and the calculation process does not occupy an acceleration unit. Among them, the acceleration unit may include hardware devices, such as FPGA (Field Programmable Gate Array, field programmable logic gate array), GPU (Graphics Processing Unit, graphics processor), and so on. The operation flow or data labeled 5 represents fixed-point data and related calculation parts.

Neural networks are usually multi-layered structures, each of which performs an operation process. Each operation process corresponds to an operator. Still taking the embedded acceleration platform as an example, the low-bit quantization processing in the embodiment of this application is to add quantization rules to the weight parameters in the neural network during training, so that the weights produced are low-bit representation data, and during the training process Quantified parameter statistics are performed on the input and output of each operator, and the statistically calculated quantized parameters are used in the reasoning process of the neural network on the embedded acceleration platform. Due to the above premise, in the process of fixed-point processing, the accuracy required by the operator for the input and output data itself does not need to be too high, but only to achieve the quantized low-bit representation, so it is not necessary for the calculation of the operator output For full-precision calculations, as long as the accuracy of the quantified output is guaranteed. In the process of fixed-point processing, the low-bit calculation result is calculated with the preset processing parameters to obtain the final low-bit result. For example, the preset processing parameters may be obtained by full-precision calculation and fixed-point operation in the initialization calculation process.

In the training process of the neural network, the calculation sequence of the acceleration unit needs to be considered, that is, the actual calculation sequence after the operator fusion. For example, conv (convolution) + add (add) + bn + relu (Rectified Linear Unit, modified linear unit) such a network structure, the four operators in the above network structure correspond to the four layers in the network, and the 4 operators are merged into 1 operator for acceleration processing. Structural operators such as conv+add+bn+relu can be processed in a manner similar to Figure 2. For example, the actual calculation process is to first multiply and accumulate conv, and then integrate the addition operation into bn, and the low-bit quantization parameters are also Merged into bn. The preset processing parameters include input data, weights, and output data, so they are inconsistent with the traditional calculation sequence. The traditional calculation sequence usually only includes a single calculation step with respect to input data or weights. At the same time, fixed-point operations should be considered in training, that is, special quantification should be done in certain positions.

The embodiment of the present application needs to determine the training process according to the actual reasoning process, that is, the problem of operator fusion should be considered during training, and the operator can be calculated in an agreed order. Make the calculation sequence of the training process consistent with the calculation sequence of the inference process. And in the process of backpropagation, when calculating the gradient, pay attention to using the fixed quantization scale of input and output to update the weight information. After calculating the weight information, mathematical statistics should be used to update the quantization scale of input and output. In other words, the update of the quantization scale and the acquisition of the weights are all completed during the entire training process. In the training process, the fixed point can be automatically updated to the preset processing parameters by limiting the data range. Wherein, the limited data range may include the bits occupied by the integer part and the decimal part of the fixed-point processing.

In the reasoning process of the neural network, the entire calculation process can be implemented in accordance with the calculation process of low bit and fixed-point requirements to ensure accurate results. Ensure that the inference process is the same as the training process. First, the calculation sequence must be consistent. Then, when the preset processing parameters are merged, the full-precision offline calculation is first completed. In actual use, it is truncated according to the fixed-point requirements to ensure the correctness of the data Sex.

In the embodiment of this application, low-bit quantization and fixed-point fusion are integrated, not only in the forward reasoning stage, but also in the network training process, how to integrate the two and achieve complete adaptation. For different neural network frameworks, the operator fusion part needs to be considered during training, and how to integrate the above quantization and fixed-pointization accurately and accurately in the process of operator fusion, such as pre-designed in which step to perform quantization and in Which step is to be processed at a fixed point to ensure the correctness of the data and the correctness of the process.

Fig. 4 is a schematic structural diagram of a neural network quantization and fixed-point fusion device according to an embodiment of the present application. As shown in FIG. 4, the quantization and fixed-point fusion device of neural network in the embodiment of the present application includes:

The quantization unit 100 is used to perform quantization processing on the input data and weights of the current layer of the neural network;

The first operating unit 200 is configured to: in the current layer, use the weights after the quantization processing to perform calculation operations on the input data after the quantization processing to obtain the results of the calculation operations;

The fixed-point unit 300 is used to: perform fixed-point processing on preset processing parameters;

The second operating unit 400 is configured to: use the preset processing parameters after the fixed-point processing to perform post-processing operations on the results of the calculation operations to obtain the output results of the current layer.

In one embodiment, the calculation operation includes a convolution operation;

In one embodiment, the pointing unit 300 is used to:

In one embodiment, the fixed-point unit 300 is used to perform fusion calculation using the following formula:

new_scale=bn_scale*input_scale*weight_scale/output_scale;

new_bias=bn_bias/output_scale,

The function of each unit in the quantization and fixed-point fusion device of the neural network in the embodiment of the present application can be referred to the corresponding description in the above method, and will not be repeated here.

According to the embodiments of the present application, the present application also provides an electronic device and a readable storage medium.

As shown in FIG. 5, it is a block diagram of an electronic device of the quantization and fixed-point fusion method of a neural network according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely examples, and are not intended to limit the implementation of the application described and/or required herein.

As shown in FIG. 5, the electronic device includes: one or more processors 501, a memory 502, and interfaces for connecting various components, including a high-speed interface and a low-speed interface. The various components are connected to each other using different buses, and can be installed on a common motherboard or installed in other ways as needed. The processor may process instructions executed in the electronic device, including storing in or on the memory to display a graphical user interface (GUI) on an external input/output device (such as a display device coupled to an interface) ) Instructions for graphic information. In other embodiments, if necessary, multiple processors and/or multiple buses can be used with multiple memories and multiple memories. Similarly, multiple electronic devices can be connected, and each device provides part of the necessary operations (for example, as a server array, a group of blade servers, or a multi-processor system). In FIG. 5, a processor 501 is taken as an example.

The memory 502 is a non-transitory computer-readable storage medium provided by this application. Wherein, the memory stores instructions that can be executed by at least one processor, so that the at least one processor executes the quantization and fixed-point fusion method of the neural network provided in this application. The non-transitory computer-readable storage medium of the present application stores computer instructions, and the computer instructions are used to make the computer execute the quantization and fixed-point fusion method of the neural network provided in the present application.

As a non-transitory computer-readable storage medium, the memory 502 can be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as the program instructions corresponding to the neural network quantization and fixed-point fusion method in the embodiment of the present application /Module (for example, the quantization unit 100, the first operation unit 200, the fixed point unit 300, and the second operation unit 400 shown in FIG. 4). The processor 501 executes various functional applications and data processing of the server by running non-transient software programs, instructions, and modules stored in the memory 502, that is, realizing the quantization and fixed-point fusion method of the neural network in the above method embodiment.

The memory 502 may include a storage program area and a storage data area. The storage program area may store an operating system and an application program required by at least one function; the storage data area may store an electronic device according to the quantization and fixed-point integration method that executes the neural network Use the created data, etc. In addition, the memory 502 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory 502 may optionally include memories remotely arranged relative to the processor 501, and these remote memories may be connected to an electronic device that executes a neural network quantization and fixed-point integration method through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device that executes the quantization and fixed-point fusion method of the neural network may further include: an input device 503 and an output device 504. The processor 501, the memory 502, the input device 503, and the output device 504 may be connected by a bus or in other ways. In FIG. 5, the connection by a bus is taken as an example.

The input device 503 can receive input digital or character information, and generate key signal input related to the user settings and function control of the electronic device that performs the quantization and fixed-point fusion method of the neural network, such as touch screen, keypad, mouse, trackpad , Touchpad, pointing stick, one or more mouse buttons, trackball, joystick and other input devices. The output device 504 may include a display device, an auxiliary lighting device (for example, LED), a tactile feedback device (for example, a vibration motor), and the like. The display device may include, but is not limited to, a liquid crystal display (Liquid Crystal Display, LCD), a light emitting diode (Light Emitting Diode, LED) display, and a plasma display. In some embodiments, the display device may be a touch screen.

Various implementations of the systems and technologies described herein can be implemented in digital electronic circuit systems, integrated circuit systems, application specific integrated circuits (ASICs), computer hardware, firmware, software, and/or combinations thereof . These various embodiments may include: being implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, the programmable processor It can be a dedicated or general-purpose programmable processor that can receive data and instructions from the storage system, at least one input device, and at least one output device, and transmit the data and instructions to the storage system, the at least one input device, and the at least one output device. An output device.

These computing programs (also referred to as programs, software, software applications, or codes) include machine instructions for programmable processors, and can be implemented using high-level procedures and/or object-oriented programming languages, and/or assembly/machine language Calculation program. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, device, and/or device used to provide machine instructions and/or data to a programmable processor ( For example, magnetic disks, optical disks, memories, and programmable logic devices (programmable logic devices, PLDs) include machine-readable media that receive machine instructions as machine-readable signals. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

In order to provide interaction with the user, the system and technology described here can be implemented on a computer with: a display device for displaying information to the user (for example, CRT (Cathode Ray Tube, cathode ray tube) or LCD (liquid crystal) (Display) monitor); and keyboard and pointing device (for example, mouse or trackball) through which the user can provide input to the computer. Other types of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (for example, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, voice input, or tactile input) to receive input from the user.

The systems and technologies described herein can be implemented in a computing system that includes back-end components (for example, as a data server), or a computing system that includes middleware components (for example, an application server), or a computing system that includes front-end components (for example, A user computer with a graphical user interface or a web browser, through which the user can interact with the implementation of the system and technology described herein), or includes such back-end components, middleware components, Or any combination of front-end components in a computing system. The components of the system can be connected to each other through any form or medium of digital data communication (for example, a communication network). Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN), and the Internet.

The computer system can include clients and servers. The client and server are generally far away from each other and usually interact through a communication network. The relationship between the client and the server is generated by computer programs running on the corresponding computers and having a client-server relationship with each other.

According to the technical solution of the embodiment of the present application, through the fusion of quantization processing and fixed-point processing, the bandwidth requirements for data transmission between operators are significantly reduced, the calculation amount of the acceleration unit is effectively reduced, and the fixed-point of the acceleration unit is fully utilized. The advantages of optimized computing. At the same time, the requirements for computing resources are reduced through quantitative processing and fixed-point processing, and computing efficiency is improved while saving resources.

It should be understood that the various forms of processes shown above can be used to reorder, add or delete steps. For example, the steps described in the present application can be performed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present application can be achieved, this is not limited herein.

The foregoing specific implementation manners do not constitute a limitation on the protection scope of the present application. Those skilled in the art should understand that various modifications, combinations, sub-combinations and substitutions can be made according to design requirements and other factors. Any modification, equivalent replacement and improvement made within the spirit and principle of this application shall be included in the protection scope of this application.

Claims

A quantization and fixed-point fusion method of neural network, which is characterized in that it includes:

Quantify the input data and weights of the current layer of the neural network;

In the current layer, use the quantized weight to perform a calculation operation on the quantized input data to obtain a calculation operation result;

Fixed-point processing of preset processing parameters; and

Using the preset processing parameters after fixed-point processing, post-processing operations are performed on the calculation operation results to obtain the output results of the current layer.
The method of claim 1, wherein:

The calculation operation includes multiplying the quantized input data by the quantized weight;

The preset processing parameters include the first bias value of the current layer; and

The post-processing operation includes adding the result of the calculation operation to the first offset value after the fixed-point processing.
The method of claim 1, wherein:

The calculation operation includes a convolution operation; and

The preset processing parameters include: the quantized amplitude of the input data of the current layer, the quantized amplitude of the weight of the current layer, the quantized amplitude of the output data of the current layer, and the batch return of the current layer. At least one of the scale of unification and the bias value of the batch normalization of the current layer.
The method according to claim 3, wherein the fixed-point processing of the preset processing parameters comprises:

Performing fusion calculation on at least two of the preset processing parameters to obtain the scale value of the current layer and the second bias value of the current layer.
The method according to claim 4, wherein the post-processing operation comprises: multiplying the result of the calculation operation by the scale value of the current layer, and then multiplying the result with the second bias value of the current layer Add up.
The method according to claim 4, wherein performing fusion calculation on at least two of the preset processing parameters includes performing fusion calculation using the following formula:

new_scale=bn_scale*input_scale*weight_scale/output_scale;

new_bias=bn_bias/output_scale,

Wherein, new_scale represents the scale value of the current layer, bn_scale represents the batch normalization scale of the current layer, input_scale represents the quantization amplitude of the input data of the current layer, and weight_scale represents the quantization of the weight of the current layer Amplitude, output_scale represents the quantized amplitude of the output data of the current layer, new_bias represents the second bias value of the current layer, and bn_bias represents the batch normalized bias value of the current layer.
A quantization and fixed-point fusion device of neural network, which is characterized in that it comprises:

The quantization unit is used to: quantify the input data and weights of the current layer of the neural network;

The first operation unit is configured to: in the current layer, use the quantized weight to perform a calculation operation on the quantized input data to obtain a calculation operation result;

The fixed-point unit is used to: perform fixed-point processing on preset processing parameters; and

The second operation unit is configured to perform post-processing operations on the calculation operation results by using preset processing parameters after fixed-point processing to obtain the output results of the current layer.
The device according to claim 7, wherein:

The calculation operation includes multiplying the quantized input data by the quantized weight;

The preset processing parameters include the first bias value of the current layer; and

The post-processing operation includes adding the result of the calculation operation to the first offset value after the fixed-point processing.
The device according to claim 7, wherein:

The calculation operation includes a convolution operation; and

The preset processing parameters include: the quantized amplitude of the input data of the current layer, the quantized amplitude of the weight of the current layer, the quantized amplitude of the output data of the current layer, and the batch return of the current layer. At least one of the quantized amplitude of the unified scale and the quantized amplitude of the batch normalized bias value of the current layer.
The device according to claim 9, wherein the pointing unit is used for:

Performing fusion calculation on at least two of the preset processing parameters to obtain the scale value of the current layer and the second bias value of the current layer.
The device according to claim 10, wherein the post-processing operation comprises: multiplying the result of the calculation operation by the scale value of the current layer, and then multiplying the result with the second bias value of the current layer Add up.
The device according to claim 10, wherein the fixed point unit is used to perform fusion calculation using the following formula:

new_scale=bn_scale*input_scale*weight_scale/output_scale;

new_bias=bn_bias/output_scale,

Wherein, new_scale represents the scale value of the current layer, bn_scale represents the batch normalization scale of the current layer, input_scale represents the quantization amplitude of the input data of the current layer, and weight_scale represents the quantization of the weight of the current layer Amplitude, output_scale represents the quantized amplitude of the output data of the current layer, new_bias represents the second bias value of the current layer, and bn_bias represents the batch normalized bias value of the current layer.
An electronic device, characterized in that it comprises:

At least one processor; and

A memory communicatively connected with the at least one processor, wherein:

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute any one of claims 1 to 6 Methods.
A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to make the computer execute the method according to any one of claims 1-6.