CN112990457B

CN112990457B - Offline quantization optimization method, device, equipment, medium and program product

Info

Publication number: CN112990457B
Application number: CN202110324266.1A
Authority: CN
Inventors: 陈泓昊; 黄明飞; 王海涛
Original assignee: Open Intelligent Machine Shanghai Co ltd
Current assignee: Open Intelligent Machine Shanghai Co ltd
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2024-05-03
Anticipated expiration: 2041-03-26
Also published as: CN112990457A

Abstract

The embodiment of the application provides an off-line quantitative tuning method, an off-line quantitative tuning device, off-line quantitative tuning equipment, off-line quantitative tuning medium and a program product. The method comprises the following steps: the method comprises the steps of obtaining a to-be-tuned optimal network comprising a plurality of convolution layers in a preset network model, adjusting weight parameters of all the convolution layers in the to-be-tuned optimal network, determining tuning output results of the to-be-tuned optimal network in different weight parameter distribution states, and determining tuning weight parameters of all the convolution layers in the to-be-tuned optimal network according to simulation quantized output results and similarity between tuning output results corresponding to different weight parameters so as to realize efficient tuning of the model on the basis of a small amount of input data.

Description

Offline quantization optimization method, device, equipment, medium and program product

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, and in particular relates to a model offline quantitative tuning method, device, equipment, medium and program product.

Background

With the rapid development of artificial intelligence technology, neural network models have wide application in the fields of system identification, pattern recognition, intelligent control and the like.

At present, for the neural network model, the existing offline quantization mode mostly adopts an iterative weight updating mode, so that the quantization precision is higher. The method mainly comprises two modes, namely offline quantization and quantization retraining, wherein the offline quantization has the advantages of less input quantity and convenient use.

However, in offline quantization training, a need exists for an efficient manner of tuning models when evaluating with such a small number of inputs.

Disclosure of Invention

The offline quantization optimization method, the device, the equipment, the medium and the program product provided by the embodiment of the application can realize efficient optimization of the model on the basis of a small amount of input data.

In a first aspect, an embodiment of the present application provides an offline quantization tuning method, including:

Obtaining a network to be tuned in a preset network model, wherein the network to be tuned comprises a plurality of convolution layers;

Adjusting weight parameters of all convolution layers in the network to be tuned, wherein the output precision of the network to be tuned is kept unchanged;

Determining tuning output results of the network to be tuned in different weight parameter distribution states;

And determining tuning weight parameters of each convolution layer in the network to be tuned according to the simulation quantized output result and the similarity between tuning output results corresponding to different weight parameters.

In one possible design, the adjusting the weight parameters of each convolution layer in the network to be tuned includes:

Distributing basic weight parameter combinations for the to-be-tuned network, wherein the to-be-tuned network comprises N convolution layers, the basic weight parameter combinations comprise N basic weight parameters, the convolution layers of the to-be-tuned network correspond to the basic weight parameters of the basic weight parameter groups one by one, and N is a positive integer;

And distributing a tuning weight parameter combination for the network to be tuned, wherein the tuning weight parameter combination comprises N tuning weight parameters, the convolution layer of the network to be tuned corresponds to the basic weight parameters of the tuning weight parameter group one by one, and the product of the tuning weight parameters relative to the tuning times of the corresponding basic weight parameters is a preset fixed value.

In one possible design, determining the tuning output result of the network to be tuned in different weight parameter allocation states includes:

And determining a tuning output result of the network to be tuned according to different tuning weight parameter combinations, wherein each tuning weight parameter combination is used as a basic weight parameter combination for next tuning weight parameter distribution.

In one possible design, the determining the tuning output result of the network to be tuned according to different tuning weight parameter combinations includes:

If the network to be tuned comprises a first common convolution layer and a second common convolution layer, wherein the second common convolution layer is the rear end of the first common convolution layer, determining that the result output of the second common convolution layer is the tuning output result according to different tuning weight parameter combinations; or alternatively

If the network to be tuned comprises a first common convolution layer, an intermediate specific convolution layer and a second common convolution layer, wherein the intermediate specific convolution layer is the rear end of the first common convolution layer, and the second common convolution layer is the rear end of the intermediate specific convolution layer, determining that the result output of the second common convolution layer is the tuning output result according to different tuning weight parameter combinations.

In one possible design, the first normal convolution layer and the second normal convolution layer have at least one linear rectification function.

In one possible design, the determining the tuning weight parameters of each convolution layer in the network to be tuned according to the analog quantized output result and the similarity between tuning output results corresponding to different weight parameters includes:

Respectively determining cosine similarity between tuning output results corresponding to the analog quantized output results and each sub-tuning weight parameter combination;

And determining the tuning weight parameter combination with the highest cosine similarity as the tuning weight parameter of each convolution layer in the network to be tuned.

In one possible design, the obtaining the network to be tuned in the preset network model includes:

Sequentially acquiring each network to be selected in the preset network model, and determining the number of continuous convolution layers in the network to be selected and the convolution type of each convolution layer;

And if the number of the continuous convolution layers meets the preset layer number condition and the convolution type meets the preset type condition, determining the network to be selected as the network to be tuned.

In a second aspect, an embodiment of the present application provides an offline quantization tuning device, including:

the system comprises an acquisition module, a control module and a control module, wherein the acquisition module is used for acquiring a network to be tuned in a preset network model, and the network to be tuned comprises a plurality of convolution layers;

The processing module is used for adjusting the weight parameters of each convolution layer in the network to be tuned, wherein the output precision of the network to be tuned is kept unchanged;

the processing module is further used for determining tuning output results of the network to be tuned in different weight parameter distribution states;

and the tuning module is also used for determining tuning weight parameters of each convolution layer in the network to be tuned according to the analog quantized output result and the similarity between the tuning output results corresponding to different weight parameters.

In one possible design, the processing module is specifically configured to:

In one possible design, the tuning module is specifically configured to:

In one possible design, the processing module is specifically configured to:

In one possible design, the tuning module is specifically configured to:

In one possible design, the acquisition module is specifically configured to:

In a third aspect, an embodiment of the present application further provides an electronic device, including: the device comprises a processor and a memory, wherein the processor is respectively connected with the memory;

The memory is used for storing a computer program of the processor;

Wherein the processor is configured to implement any one of the possible offline quantization tuning methods of the first aspect by executing the computer program.

In a fourth aspect, embodiments of the present application also provide a machine-readable storage medium having stored thereon executable instructions that when executed by a machine cause the implementation of any of the possible offline quantization tuning methods of the first aspect.

In a fifth aspect, embodiments of the present application also provide a computer program product comprising a computer program which, when executed by a processor, implements any one of the possible offline quantization tuning methods of the first aspect.

In the above technical solution, the to-be-tuned network including a plurality of convolution layers in the preset network model is obtained, the weight parameters of each convolution layer in the to-be-tuned network are adjusted, tuning output results of the to-be-tuned network in different weight parameter distribution states are determined, and tuning weight parameters of each convolution layer in the to-be-tuned network are determined according to the simulation quantized output results and the similarity between the tuning output results corresponding to different weight parameters, so that efficient tuning of the model can be realized on the basis of a small amount of input data.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly explain the drawings used as needed in the embodiments or the description of the prior art. However, it should be understood by those skilled in the art that the drawings in the following description are only some examples of the present application and do not limit the scope thereof.

FIG. 1 is a diagram of an application network architecture of an off-line quantization tuning method according to an exemplary embodiment of the present application;

FIG. 2 is a flow chart of an off-line quantization tuning method according to an exemplary embodiment of the present application;

FIG. 3 is a flow chart of an off-line quantization tuning method according to an exemplary embodiment of the present application;

Fig. 4 is a schematic structural diagram of an off-line quantization tuning apparatus according to another exemplary embodiment of the present application;

fig. 5 is a schematic structural view of an electronic device according to an exemplary embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It will be appreciated by those of ordinary skill in the art that the embodiments described are some, but not all, of the embodiments of the application. Based on the embodiments in the application, any suitable modification or variation may be made by a person skilled in the art, so as to obtain all other embodiments.

At present, for the neural network model, the existing offline quantization mode mostly adopts an iterative weight updating mode, so that the quantization precision is higher. The method mainly comprises two modes, namely offline quantization and quantization retraining, wherein the offline quantization has the advantages of less input quantity and convenient use. However, when offline quantization training is performed, there is a greater risk of overfitting when evaluating with the small number of inputs described above. However, since the offline quantization loading picture amount is far smaller than the training picture, the existing offline quantization method can not only lose the FP32 precision, but also cause a great amount of overfitting risks.

Therefore, the existing quantization strategy cannot well ensure the off-line quantization optimizing effect and the generalization effect. The large number of non-evaluable weight tuning modes is so large that the end result may be severely overfitted.

In view of this, the embodiments of the present application provide an offline quantization optimization method, apparatus, device, medium, and program product, which aim to efficiently perform optimization on a quantization model based on a small amount of input data. The above technical scheme will be described in detail with reference to specific embodiments.

Fig. 1 is a diagram of an application network architecture of an off-line quantization optimization method according to an exemplary embodiment of the present application. As shown in fig. 1, the offline quantization tuning method provided in this embodiment is applied to a preset neural network model, where the preset neural network model may include a plurality of convolution layers. The interlayer detection module can be used for judging whether two interlayers containing weight parameters meet joint debugging requirements or not, then the interlayer search module is used for scaling the interlayer weight parameters, the final output of the joint debugging layer is kept the same as the original FP32 on the basis of changing weight parameter distribution, and meanwhile the risk of overfitting is reduced to a certain extent. Then, the interlayer search module part may be iterated a plurality of times through the interlayer evaluation module, and the quantization loss is evaluated. Finally, screening the results obtained by the search module according to the evaluation results by utilizing the update module to obtain the final quantized tuning distribution;

Fig. 2 is a flow chart illustrating an offline quantization tuning method according to an exemplary embodiment of the present application. As shown in fig. 2, the offline quantization tuning method provided in this embodiment includes:

s101, acquiring a network to be optimized in a preset network model.

In this step, a network to be tuned in a preset network model is obtained, wherein the network to be tuned includes a plurality of convolution layers. It should be noted that the network to be tuned may include two continuous convolution layers or three continuous convolution layers, and may further include a linear rectification function (RECTIFIED LINEAR Unit, abbreviated as ReLU) activation between the convolution layers. ReLU, also known as a modified linear unit, is a commonly used activation function in artificial neural networks, often referred to as a nonlinear function represented by a ramp function and its variants.

S102, adjusting weight parameters of all convolution layers in the network to be tuned.

In the step, the weight parameters of all convolution layers in the network to be tuned are adjusted, wherein the output precision of the network to be tuned is kept unchanged. The method can be applied to 8bit and lower bit offline quantization optimization, and performs joint equalization on weights among the multi-layer neural networks, so that the distribution of the weights is modified on the basis of not changing the final FP32 output, and the requirement of quantization distribution is met, so that an efficient offline quantization mode with nondestructive FP32 precision is realized.

S103, determining tuning output results of the network to be tuned in different weight parameter distribution states.

And after the weight parameters of all the convolution layers in the network to be tuned are adjusted by utilizing different weight parameters, determining tuning output results of the network to be tuned in different weight parameter distribution states.

S104, determining tuning weight parameters of each convolution layer in the network to be tuned according to the simulation quantized output results and the similarity between tuning output results corresponding to different weight parameters.

And finally, respectively determining cosine similarity between the analog quantized output result and the tuning output result corresponding to each sub-tuning weight parameter combination, and then taking the tuning weight parameter combination with the highest cosine similarity as the tuning weight parameter of each convolution layer in the network to be tuned.

In this embodiment, by acquiring a to-be-tuned network including a plurality of convolution layers in a preset network model, adjusting weight parameters of each convolution layer in the to-be-tuned network, then determining tuning output results of the to-be-tuned network in different weight parameter distribution states, and determining tuning weight parameters of each convolution layer in the to-be-tuned network according to simulation quantization output results and similarity between tuning output results corresponding to different weight parameters, efficient tuning of the model can be achieved on the basis of a small amount of input data.

In addition, the output precision of the network to be tuned is always kept unchanged in the process of adjusting the weight parameters of each convolution layer in the network to be tuned, so that the distribution of weights can be modified on the basis of not changing final output so as to meet the requirement of quantization distribution, and a precision lossless and efficient offline quantization mode is realized.

Fig. 3 is a flow chart illustrating an offline quantization tuning method according to an exemplary embodiment of the present application. As shown in fig. 3, the offline quantization tuning method provided in this embodiment includes:

S201, sequentially acquiring each network to be selected in a preset network model, and determining the number of continuous convolution layers in the network to be selected and the convolution type of each convolution layer.

In the step, each network to be selected in the preset network model is sequentially obtained, and the number of continuous convolution layers in the network to be selected and the convolution type of each convolution layer are determined. The network to be tuned may include two continuous convolution layers or three continuous convolution layers, and the convolution type of the convolution layers may be a normal convolution layer or a specific convolution layer (e.g., DEPTHWISE convolution layers).

S202, if the number of continuous convolution layers meets the preset number of layers condition and the convolution type meets the preset type condition, determining the network to be selected as the network to be optimized.

In this step, for example, the network to be tuned may include two consecutive normal convolutions (relu may be located therebetween), and specifically, may include a first normal convolution layer and a second normal convolution layer, where the second normal convolution layer is a back end of the first normal convolution layer. Or the network to be tuned may include three convolutional layers, specifically, a first normal convolutional layer, an intermediate specific convolutional layer and a second normal convolutional layer, where the intermediate specific convolutional layer is the back end of the first normal convolutional layer, and the second normal convolutional layer is the back end of the intermediate specific convolutional layer. And the continuous convolution layer number accords with the preset layer number condition, and the convolution type accords with the to-be-tuned network with the preset type condition to carry out merging and debugging.

S203, basic weight parameter combinations are distributed for the network to be tuned.

S204, distributing tuning weight parameter combinations for the network to be tuned.

In S203-S204, a basic weight parameter combination may be allocated to the network to be tuned, where the network to be tuned includes N convolution layers, the basic weight parameter combination includes N basic weight parameters, and the convolution layers of the network to be tuned correspond to the basic weight parameters of the basic weight parameter set one by one, where N is a positive integer. The method comprises the steps of distributing tuning weight parameter combinations for a network to be tuned, wherein the tuning weight parameter combinations comprise N tuning weight parameters, a convolution layer of the network to be tuned corresponds to basic weight parameters of a tuning weight parameter set one by one, and products of tuning multiple of each tuning weight parameter relative to the corresponding basic weight parameter are preset fixed values.

In one possibility, the network to be tuned includes 2 convolution layers, and by means of relative scaling of continuous convolution, for example, the first convolution layer is enlarged 10 times, and the second convolution layer is reduced 10 times, the product of tuning times of each tuning weight parameter with respect to the corresponding basic weight parameter is 1, so that the FP32 output of continuous convolution is kept consistent with the original FP32 output.

S205, determining a tuning output result of the network to be tuned according to different tuning weight parameter combinations.

Specifically, according to different tuning weight parameter combinations, a tuning output result of the network to be tuned is determined, wherein each tuning weight parameter combination is used as a basic weight parameter combination for next tuning weight parameter distribution. Namely, under each scaling multiple, corresponding quantized output is realized, and all observation results are recorded and recorded.

Optionally, if the network to be tuned includes a first normal convolution layer and a second normal convolution layer, where the second normal convolution layer is a rear end of the first normal convolution layer, according to different tuning weight parameter combinations, it is determined that a result output of the second normal convolution layer is a tuning output result.

Or if the network to be tuned comprises a first common convolution layer, an intermediate specific convolution layer (for example: DEPTHWISE convolutions) and a second common convolution layer, wherein the intermediate specific convolution layer is the rear end of the first common convolution layer, and the second common convolution layer is the rear end of the intermediate specific convolution layer, determining that the result output of the second common convolution layer is a tuning output result according to different tuning weight parameter combinations.

S206, respectively determining cosine similarity between tuning output results corresponding to the analog quantized output results and the sub-tuning weight parameter combinations.

S207, determining a tuning weight parameter combination with highest cosine similarity as a tuning weight parameter of each convolution layer in the network to be tuned.

On the basis of the above embodiment, the weight quantization tuning of all the continuous convolutions of the full model can be completed by repeating S202-S307. And then, the weight parameters among the multi-layer neural networks are subjected to joint equalization, so that the distribution of the weights is modified on the basis of not changing the final FP32 output, the requirement of quantitative distribution is met, and the effect of quantitative tuning is achieved.

Fig. 4 is a schematic structural diagram of an off-line quantization tuning apparatus according to another exemplary embodiment of the present application. As shown in fig. 4, the offline quantization tuning device 300 provided in this embodiment includes:

An obtaining module 301, configured to obtain a network to be tuned in a preset network model, where the network to be tuned includes a plurality of convolution layers;

The processing module 302 is configured to adjust weight parameters of each convolution layer in the network to be tuned, where output accuracy of the network to be tuned remains unchanged;

the processing module 302 is further configured to determine tuning output results of the network to be tuned in different weight parameter allocation states;

The tuning module 303 is further configured to determine tuning weight parameters of each convolution layer in the network to be tuned according to the analog quantized output result and the similarity between tuning output results corresponding to different weight parameters.

In one possible design, the processing module 302 is specifically configured to:

In one possible design, the tuning module 303 is specifically configured to:

In one possible design, the obtaining module 301 is specifically configured to:

In the embodiment of the application, the division of the modules is only one logic function division, and other division modes can be adopted in actual implementation. For example, multiple modules or components may be combined or may be integrated into another system. In addition, the coupling between the various modules may be direct coupling or indirect coupling. In addition, each functional module in the embodiment of the present application may be integrated in one processing module, or may exist separately and physically.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored on a machine-readable storage medium. Accordingly, aspects of the present application may be embodied in a software product, which may be stored on a machine-readable storage medium, which may include instructions for causing an electronic device to perform all or part of the processes of the aspects described in embodiments of the present application. The storage medium may include a ROM, a RAM, a removable disk, a hard disk, a magnetic disk, or an optical disk, etc. various media in which program codes can be stored.

Fig. 5 is a schematic structural view of an electronic device according to an exemplary embodiment of the present application. As shown in fig. 5, the electronic device 400 provided in this embodiment includes:

a processor 401 and a memory 402, the processor 401 being connected to the memory 403;

the memory 402 is configured to store a computer program of the processor 401;

Wherein the processor 401 is configured to implement the steps of any of the method embodiments described above by executing the computer program.

Alternatively, the memory 402 may be separate or integrated with the processor 401.

When the memory 402 is a device independent from the processor 401, the electronic apparatus 400 may further include:

a bus 403 for connecting the processor 401 and the memory 402.

In addition, the embodiment of the application also provides a machine-readable storage medium. The machine-readable storage medium may store executable instructions that, when executed by a machine, cause the machine to perform the specific processes in the above method embodiments.

The machine-readable storage medium of the present application described above may be a computer-readable signal medium or a computer-readable storage medium or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

The embodiments of the present application also provide a program product comprising a computer program stored in a readable storage medium. The computer program may be read from a readable storage medium by at least one processor of an electronic device, the at least one processor executing the computer program to cause the electronic device to perform the steps of the method described above.

Furthermore, those of skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above is merely an embodiment of the present application, and the scope of the present application is not limited thereto. Those skilled in the art can make changes or substitutions within the technical scope of the present disclosure, and such changes or substitutions should be included in the scope of the present disclosure.

Claims

1. The off-line quantitative tuning method is characterized by being applied to a preset neural network model, wherein the off-line quantitative tuning method is used for loading pictures, and comprises the following steps:

Adjusting weight parameters of all convolution layers in the network to be tuned, and carrying out joint balance on weights among the multi-layer neural networks, wherein the output precision of the network to be tuned is kept unchanged;

after the weight parameters of all convolution layers in the network to be tuned are adjusted by utilizing different weight parameters, tuning output results of the network to be tuned in different weight parameter distribution states are determined;

determining tuning weight parameters of each convolution layer in the network to be tuned according to the simulation quantized output result and the similarity between tuning output results corresponding to different weight parameters;

the adjusting the weight parameters of each convolution layer in the network to be tuned comprises the following steps:

2. The offline quantization tuning method according to claim 1, wherein determining tuning output results of the network to be tuned in different weight parameter allocation states comprises:

3. The offline quantization tuning method according to claim 2, wherein the determining the tuning output result of the network to be tuned according to different tuning weight parameter combinations includes:

4. The offline quantization tuning method according to claim 3, wherein the first normal convolution layer and the second normal convolution layer have at least one linear rectification function.

5. The method for offline quantization and tuning according to any one of claims 1-4, wherein determining tuning weight parameters of each convolutional layer in the network to be tuned according to the analog quantization output result and the similarity between tuning output results corresponding to different weight parameters includes:

6. The method for offline quantization tuning according to any one of claims 1-4, wherein the obtaining a network to be tuned in a preset network model includes:

7. The utility model provides an off-line quantization tuning device which is characterized in that is applied to a preset neural network model, and the off-line quantization is loaded and is the picture, and the off-line quantization tuning device includes:

The processing module is used for adjusting weight parameters of all convolution layers in the network to be regulated and carrying out joint balance on weights among the multi-layer neural networks, wherein the output precision of the network to be regulated is kept unchanged;

The processing module is further used for determining tuning output results of the to-be-tuned network in different weight parameter distribution states after the weight parameters of all convolution layers in the to-be-tuned network are adjusted by using different weight parameters;

the tuning module is also used for determining tuning weight parameters of each convolution layer in the network to be tuned according to the analog quantized output result and the similarity between the tuning output results corresponding to different weight parameters;

8. An electronic device, comprising: the device comprises a processor and a memory, wherein the processor is respectively connected with the memory;

The memory is used for storing a computer program of the processor;

Wherein the processor is configured to implement the off-line quantization tuning method of any one of claims 1 to 6 by executing the computer program.

9. A machine-readable storage medium having stored thereon executable instructions that when executed by a machine cause the offline quantization tuning method according to any one of claims 1 to 6 to be implemented.

10. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the off-line quantitative tuning method of any one of claims 1 to 6.