Disclosure of Invention
In the field of intelligent detection, under the requirement of high real-time performance and precision requirements of products, the invention provides a neural network quantification method capable of meeting both performance and precision requirements.
The method is realized by the following technical scheme:
a neural network quantization method, comprising the steps of:
setting an inference precision threshold value of a neural network;
respectively quantizing each layer of output characteristic data quantization parameter and each layer of weight quantization parameter;
for the front and back related layers in the neural network, updating the output characteristic data quantization parameter of the input layer according to the output characteristic data quantization parameter of the current layer; the related layer is a layer with unchanged input and output characteristic data;
and optimizing output characteristic data quantization parameters according to the comparison result of the inference floating point value and the fixed point value after separation quantization of each layer by taking the inference precision threshold as a constraint condition.
Preferably, the method for separately quantizing the quantization parameter for each layer weight includes:
and analyzing each network layer, and quantizing according to the floating point numerical value of the layer statistical data to obtain a weight quantization parameter of each layer and a new quantized weight file.
Preferably, the method for separately quantizing quantization parameters of the output characteristic data of each layer includes:
and for the output characteristic data of each layer, counting the floating point numerical value of the output characteristic data of each layer, and quantizing to obtain the quantization parameter of the output characteristic data of each layer.
Preferably, the specific method for optimizing the quantization parameter of the output feature data according to the comparison result of the inference floating point value and the fixed point value after separation quantization of each layer comprises the following steps:
Deducing a first floating point characteristic value of each layer according to input and output data of each layer;
and deducing a fixed point characteristic value according to the quantized parameters after the separation and quantization, converting the fixed point characteristic value into a second floating point characteristic value, comparing the first floating point characteristic value with the second floating point characteristic value, judging whether the inference precision threshold is met, if so, determining the quantization parameters, otherwise, adjusting the quantization parameters of the output characteristic data, and keeping the weight quantization parameters unchanged.
Preferably, the updating the quantization parameter of the output feature data of the input layer according to the quantization parameter of the output feature data of the current layer specifically includes: and identifying the associated layer, and updating the output characteristic quantization parameter of the input layer of the current associated layer by using the output characteristic data quantization parameter of the current associated layer so as to keep the output characteristic quantization parameter of the associated layer and the input layer consistent.
Preferably, the method for optimizing quantization parameters of output characteristic data includes: setting the bit width of the result after convolution internal multiplication and addition operation as the maximum limit range of the result after the accumulated multiplication and addition without exceeding the bit width, and gradually reducing the bit width and improving the performance according to the comparison result of each layer of inference floating point values and fixed point values after separation and quantization by taking the inference precision threshold as a constraint condition.
Furthermore, in order to solve the problem that the balance between performance and precision needs to be solved by combining an auxiliary circuit or other chips in the neural network reasoning technology in the prior art and the problem that the cost is increased additionally, the invention provides a neural network reasoning technical scheme for realizing the deep neural network reasoning process on the existing Soc chip, realizing the forward reasoning process by utilizing a coprocessor carried by the Soc chip, reducing the development period and saving the cost.
Based on the neural network quantification method, a main processor of the Soc chip is adopted to call an optimization model file made by the neural network optimized by the neural network quantification method, and a coprocessor of the Soc chip is used for promoting and executing a fixed-point reasoning process of a network layer of the optimization model file.
Preferably, the method for pushing and executing the fixed point reasoning process of the network layer of the optimization model file by the coprocessor carried by the Soc chip specifically includes:
the nonlinear network layer is executed by using a high-level language, and other network layers adopt a neon instruction to execute a forward reasoning process.
The invention also provides a neural network quantization device, which comprises the following modules:
the network conversion module is used for acquiring an original model file and converting the original model file into internally recognizable data;
The network separation quantization module is used for quantizing a quantization parameter of the output characteristic data according to the floating point reasoning result and quantizing a weight quantization parameter according to the weight;
the network comparison module is used for comparing the difference between the floating point numerical value and the quantized fixed point numerical value-to-floating point data, optimizing the characteristic data quantization parameter and controlling the inference precision within a preset inference precision threshold value;
the network feedback module is used for updating the output characteristic data quantization parameter of the input layer according to the output characteristic data quantization parameter of the current layer and updating and iterating until no associated layer appears;
the network reasoning module is used for realizing a fixed-point reasoning process of a network layer;
and the network packaging module is used for packaging the optimized neural network data into an optimized model file which can be called and stored.
Furthermore, the invention also provides a neural network reasoning system, which comprises a server and an Soc chip, wherein the server comprises a storage unit and a processing unit, the Soc chip comprises a main processor and a coprocessor,
the processing unit is used for optimizing the original model file by the neural network quantitative optimization method to form an optimized model file;
the storage unit is used for storing the optimization model file;
The main processor is used for calling the optimization model file;
the coprocessor is used for executing the fixed-point reasoning process of each network layer.
The invention has the beneficial effects that:
1) the invention optimizes the quantitative parameters of the characteristic data by controlling the inference precision threshold, realizes the continuous optimization of the performance and balances the requirements of precision and performance.
2) The invention realizes the fixed point reasoning process of each network layer by using the internal coprocessor in the Soc chip, can realize floating point and various high and low precision quantitative reasoning by using the characteristics of the coprocessor in the Soc chip, is not limited by hardware quantization function, is flexible in programming and has lower cost compared with the prior art.
Detailed Description
The present invention will be described in further detail with reference to examples, which are illustrative of the present invention and are not to be construed as being limited thereto. Wherein, the numerical labels such as S100 do not represent the time sequence operation flow, and are the labels for clear explanation.
A neural network quantization method, as shown in fig. 1, comprising the following steps:
s100, setting an inference precision threshold of the lightweight deep neural network;
s200, using a quantization set picture reasoning network to separately quantize the quantization parameter of the output characteristic data of each layer and the quantization parameter of the weight of each layer;
s300, for the front and rear related layers in the neural network, updating the output characteristic data quantization parameter of the input layer according to the output characteristic data quantization parameter of the current layer;
and S400, optimizing output characteristic data quantization parameters according to the comparison result of the inference floating point value and the fixed point value to floating point value after separation quantization of each layer by taking the inference precision threshold as a constraint condition.
In step S100, a reasoning accuracy threshold of the lightweight deep neural network is set, and before the reasoning accuracy threshold is set, implementation of a forward reasoning process of the Soc chip needs to be guaranteed; before the lightweight deep neural network reasoning, the bit width of the result after convolution internal multiplication and addition operation is amplified to the maximum limit range that the result after accumulation and multiplication and addition does not exceed the bit width, the precision and the performance are counted, then the bit width is gradually reduced to improve the performance, the precision requirement is ensured, and a reasonable reasoning precision threshold value is set.
In the embodiment, a quantization mode of 16 bits of weight and 8 bits of output characteristic data is adopted, that is, the result is stored by at least 24 bits after the output characteristic data 8 bits and the weight 16 bits are multiplied, and the result is stored by amplifying to 30 bits first, so that the result after the cumulative multiplication is not beyond the maximum limit range of bit width.
In the above preferred embodiment of the method for implementing Soc-chip-based neural network separation and quantization, the specifically performing separation and quantization on each layer of output feature data and weight quantization parameters includes: the weights and the output characteristic data can be separately and independently quantized layer by layer, and in order to balance the network precision and the hardware memory cost, the embodiment adopts a 16-bit quantization mode of the weights and an 8-bit quantization mode of the output characteristic data as disclosed above.
In step S200, the method for separately quantizing the quantization parameter of each layer weight includes: and analyzing each network layer, and quantifying according to the floating point numerical values of the layer statistical data to obtain a weight quantification parameter of each layer and a new weight file after quantification. Referring to fig. 2, an original model weight file is obtained, each network layer is analyzed, and the weights are respectively CONV1, CONV2, decronv 1, decronv 2, FC, and the like, quantization is performed according to the maximum and minimum values of layer statistical data, and a weight quantization parameter and a new weight file after quantization of each layer are counted.
The specific formula for obtaining the quantization parameter is as follows:
wherein f represents a quantization parameter, wherein,
in order to take the maximum value of the sum,
is the maximum value of the input floating-point number,
the minimum value of an input floating point number is abs absolute value, which is a built-in function of a system library, bw is a converted bit width, which currently supports a bit width of 8 bits and 16 bits, bw of 8 bits quantization is 8, bw of 16 bits quantization is 16, max is the maximum value of the two, which is the built-in function of the system library, log2 is the built-in function of the system library, ceil is rounded up, which is the built-in function of the system library.
Its floating point to fixed point representation is as follows:
where round is a rounded system library built-in function,
and f is a floating point number, f is a conversion parameter, when the value X exceeds the upper limit of the bit width, the upper limit value is taken, the lower limit value is less than the lower limit value of the bit width, and the limit value is taken down.
The fixed-point to floating-point representation is as follows:
f is a quantization parameter, and X is a fixed point number.
In step S200, the method for separately quantizing quantization parameters of each layer of output characteristic data includes: and for the output characteristic data of each layer, counting the floating point numerical value of the output characteristic data of each layer, and quantizing to obtain the quantization parameter of the output characteristic data of each layer.
Referring to fig. 3 in particular, which shows a forward implementation process of the network, after quantized data is input, each layer outputs one feature data, the quantization is respectively CONV1, CONV2, DECONV1, DECONV2, FC, etc. according to the layer, quantization is performed according to the maximum and minimum values of the layer statistical data, and the quantization parameter of the feature data output by each layer is counted and advanced.
Respectively obtaining the quantization parameters of the output characteristic data of each layer and the weight quantization parameters of each layer through the step 200; the quantized parameters of the separately quantized output feature data are used for feedback of the quantized correlation layer and for parameter comparison in balancing accuracy and performance. The following is a detailed description:
step S300, for a preceding and following related layer in the neural network, updating an output feature data quantization parameter of an input layer according to an output feature data quantization parameter of a current layer, specifically including:
The related layer is a layer in which input/output characteristic data is not changed and quantization parameters of input/output are kept uniform. And identifying the associated layer, and updating the output characteristic quantization parameter of the input layer of the current associated layer by using the output characteristic data quantization parameter of the current associated layer so as to keep the output characteristic quantization parameter of the associated layer and the input layer consistent.
These layers of association include:
the concat layer is used for realizing splicing of input characteristic data;
the crop layer realizes the function of keeping the input characteristic size consistent;
and the shuffle layer realizes the transposition function of the input characteristic channel.
As shown in fig. 4, the quantization parameters of each layer of quantized output feature data are 6, 5, 4, (3, 4) 5, 3, and if concat is identified as the associated layer, inverse modification of the quantization parameter values is performed, the quantization parameters of the concat layer of the current layer are used to update the quantization parameters of the input layer max pool and the conv3, and the quantization parameters are kept consistent and modified to 5, which is shown in fig. 5.
In step S400, with the inference precision threshold as a constraint condition, according to a comparison result between the inference floating point value and the separately quantized fixed point value to floating point value of each layer, optimizing output characteristic data quantization parameters, specifically including:
Deducing a first floating point characteristic value of each layer according to input and output data of each layer;
and deducing a fixed-point characteristic value according to the quantized parameters after separation and quantization, converting the fixed-point characteristic value into a second floating-point characteristic value, comparing the first floating-point characteristic value with the second floating-point characteristic value, judging whether the inference precision threshold is met, if so, determining the quantization parameters, otherwise, adjusting the quantization parameters of the output characteristic data, and keeping the weight quantization parameters unchanged.
As described above, in order to balance the precision and performance, in the preset configuration process, the bit width to be stored first is amplified to 30 bits for storage, so as to ensure that the result after the cumulative multiplication and addition does not exceed the maximum limit range of the bit width.
And comparing the fixed point values after the separation and quantization aiming at each layer of inference floating point values, judging whether the fixed point values are controlled within an inference precision threshold value or not, comprehensively considering the balance of performance and precision, testing on a chip, adopting fixed weight quantization parameters to be unchanged according to the quantization parameters, and ensuring that the precision is controlled within the inference precision threshold value on the premise of reducing the bit width of a storage result after the volume accumulation multiplication is added.
If the set inference precision threshold value is not met after analysis, the performance and precision are controlled within a reasonable range by reducing the quantization parameter of the output characteristic data.
Specifically, the calculation method is as follows:
deducing a first floating point characteristic value D1 of each layer according to the test picture data, namely a floating point numerical value of the original network model;
deducing a fixed point characteristic value according to the quantized parameters after separation quantization, converting the fixed point characteristic value into a second floating point characteristic value D0, comparing the first floating point characteristic value with the second floating point characteristic value (calculating a difference value), judging whether the inference precision threshold value D3 is met, if so, determining the quantization parameters, otherwise, adjusting the quantization parameters of the output characteristic data, and keeping the weight quantization parameters unchanged.
The difference value calculation formula is as follows:
to meet the inference precision requirement: d3> = D2, as the adjustment precondition, on the premise of satisfying the inference precision requirement, ensure the performance better as soon as possible; the precision is balanced by the principle that the performance is faster when the calculation storage bit width is smaller in the middle of multiplication and addition in the forward advancing process, and the minimum addition requirement of 24 bits after 16 bits by 8 bits meets the calculation requirement, and the bit width is gradually reduced from 30 bits.
Example 2:
the embodiment discloses a neural network reasoning method, which calls an optimization model file made by a neural network optimized according to the neural network quantization method disclosed in embodiment 1 through a main processor of a Soc chip, and advances a fixed-point reasoning process of a network layer for executing the optimization model file through a coprocessor carried by the Soc chip.
The method for pushing and executing the fixed point reasoning process of the network layer of the optimization model file through the coprocessor carried by the Soc chip specifically comprises the following steps: the non-linear network layer is executed by using a high-level language, and other network layers execute a forward reasoning process by adopting a neon instruction.
Specifically, the coprocessor in the Soc chip can be an ARM (advanced RISC machine), a core-A (core-A) and a core-R (core-R) series processor, is provided with a neon coprocessor register, adopts single instruction multiple data, simultaneously operates a plurality of vector data types, and supports 8-bit and 16-bit type rapid operation. The network layer reasoning process is realized by using a coprocessor in a Soc chip, as an optimal selection, the network layer such as nonlinear sigmoid, tanh and the like is realized by using a high-level language, and other network layers adopt a neon (instruction) register to realize the forward reasoning process.
Example 3:
the embodiment takes face detection as an example, and further discloses a neural network reasoning method, and realizes a face detection network deep learning reasoning method based on an Soc chip on the basis of balancing reasoning precision and performance according to the quantization method. The fixed-point reasoning process of each network layer is realized by using an internal coprocessor in the Soc chip, the forward reasoning process is realized by using a neon instruction or a high-level language by using the characteristics of the coprocessor in the Soc chip, and floating point and various high-low precision quantitative reasoning can be realized without the limitation of hardware quantization function and flexible programming. Preferably, the network layers such as the nonlinear sigmoid and tanh are realized by using a high-level language, and other network layers adopt a neon (instruction) register to realize a forward reasoning process.
The method specifically comprises the following steps:
pre-configuration: the flash of the external memory is more than 8M, the detection distance is controlled within 2M, the ARMV7-A is adopted by the processor, and the processor is externally connected with a video collector. Because the Soc chip is provided with a coprocessor which can execute a neon instruction set, and the preset detection distance is within 2 meters, the input resolution of the algorithm can use 256 × 144 low resolution, so that the performance is greatly improved and the requirement is met. The 8M flash system and the executable program occupy 7M, wherein 1M is given to the algorithm, after separation and quantization are carried out, the weight can be controlled within 1M according to 16-bit quantization, the 8-bit quantization can be controlled within 512K, and in consideration of the network precision problem, 16 bits meet the requirement.
Based on the methods disclosed in embodiment 1 and embodiment 2, the method mainly comprises the following steps:
1) a coprocessor is used for realizing weighted network forward processes of 16-bit-8-bit convolution, deconvolution, full connection and the like, and the middle cumulative multiplication and addition adopts 30-bit highest bit storage;
2) realizing 8 x 8bit network layer calculation process without weight;
3) setting an inference precision threshold value of a face detection neural network;
4) separating the quantized fixed weight to generate a weight fixed point file;
5) quantizing the quantization parameter of the output feature data using the quantized picture;
6) Feedback operation of the correlation layer is carried out, the correlation layer is found out, and output characteristic data quantization parameters are replaced;
7) combining the execution sequence, the quantization parameters, the convolution kernel size, the edge supplement and the weight file of the face detection neural network layer into a model file, and storing the model file in a flash;
8) testing a face detection neural network by adopting a test picture, optimizing quantization parameters of an algorithm, testing precision and performance layer by layer, comparing threshold precision by reducing bit width, finding out proper output characteristic data quantization parameters, keeping weight quantization parameters unchanged, and updating a model file;
9) integrating system codes, adding a video acquisition function, adding network preprocessing, calling a model file by a Soc chip to start processing feature extraction, detecting a human face frame, and displaying the frame for a user to check.
Example 4:
the embodiment discloses a neural network separation quantification device based on an Soc chip, which comprises the following modules:
and the network conversion module is used for converting the neural network trained by other platforms into internally identifiable data, including the size of a convolution kernel, the edge filling of the convolution kernel, the span, the execution sequence of a network layer, weight data and the like. Since the trained model processor cannot recognize, conversion is needed to facilitate the processor to read the information and run the forward process in the order of the original network model structure.
And the network separation quantization module quantizes the quantization parameter of the output characteristic data according to the floating point reasoning result and directly quantizes the weight quantization parameter according to the weight data. Specifically, the same quantization formula is adopted for outputting the characteristic data and the weight data.
The network feedback module performs constant feedback iteration according to the processing strategy of the association layer disclosed in embodiment 1 until no association layer appears.
And the network comparison module is used for comparing the difference between the floating point numerical value and the fixed point to floating point data and controlling whether the difference is within the precision threshold value. In the inference process, the network precision is balanced according to the performance, a difference value is calculated for the difference after floating point inference and fixed point inference, and whether the inference precision threshold is in the control range is compared. The method specifically comprises the steps of balancing network precision according to performance in the inference process, calculating a difference value for the difference between floating point inference and fixed point inference by taking an inference precision threshold value as a constraint condition, comparing the difference value with the inference precision threshold value, and optimizing quantitative parameters, namely continuously optimizing the performance on the premise of ensuring the precision.
And the network packing module is used for packing the weight quantization parameters and the internally identifiable data into a model file. I.e. the information converted from the correlation is assembled into a model file which can be recognized by a processor.
The network reasoning module and the coprocessor use the neon instruction set to accelerate the realization of the network layer reasoning process, realize the network reasoning process and gradually optimize.
Example 5:
according to embodiment 1 and embodiment 2, this embodiment discloses a neural network inference system, which includes a server and an Soc chip, wherein the server includes a storage unit and a processing unit, the Soc chip includes a main processor and a coprocessor,
the processing unit is used for optimizing the original model file by a neural network quantitative optimization method to form an optimized model file;
the storage unit is used for storing the optimization model file;
the main processor is used for calling the optimization model file;
the coprocessor is used for executing the fixed-point reasoning process of each network layer.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical functional division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another device, or some features may be omitted, or not executed.
The units may or may not be physically separate, and components displayed as units may be one physical unit or a plurality of physical units, that is, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present invention may be essentially or partially contributed to by the prior art, or all or part of the technical solution may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions within the technical scope of the present invention are intended to be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.