CN117077740B - Model quantization method and device - Google Patents

Model quantization method and device Download PDF

Info

Publication number
CN117077740B
CN117077740B CN202311235065.XA CN202311235065A CN117077740B CN 117077740 B CN117077740 B CN 117077740B CN 202311235065 A CN202311235065 A CN 202311235065A CN 117077740 B CN117077740 B CN 117077740B
Authority
CN
China
Prior art keywords
bit width
network
weight
data
activation function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311235065.XA
Other languages
Chinese (zh)
Other versions
CN117077740A (en
Inventor
石强
唐巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Honor Device Co Ltd
Original Assignee
Honor Device Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Honor Device Co Ltd filed Critical Honor Device Co Ltd
Priority to CN202311235065.XA priority Critical patent/CN117077740B/en
Publication of CN117077740A publication Critical patent/CN117077740A/en
Application granted granted Critical
Publication of CN117077740B publication Critical patent/CN117077740B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0495Quantised networks; Sparse networks; Compressed networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/046Forward inferencing; Production systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The embodiment of the application provides a model quantization method and device, comprising the following steps: the first network receives first data, wherein the first data is floating point type data in a target model; the first network predicts the bit width of at least one weight and/or the bit width of at least one activation function in a second network according to the first data, and transmits the bit width of the at least one weight and/or the bit width of the at least one activation function to the second network; and the second network receives the first data, and performs forward reasoning according to the first data based on the bit width of the at least one weight and/or the bit width of the at least one activation function to obtain output data of the second network. According to the method and the device, the waste of computing resources in the model quantization process can be reduced.

Description

Model quantization method and device
Technical Field
The present disclosure relates to the field of model quantization technologies, and in particular, to a method and apparatus for model quantization.
Background
Model quantization is a model acceleration method for converting floating point operation of a neural network into fixed point operation. The method quantizes the floating point model, and has the advantages of accelerating the reasoning speed of the floating point model, reducing the power consumption of equipment and reducing the storage space. At present, when a quantization network is used for model quantization, the bit width used by the quantization network is fixed for different data to be quantized, in other words, the quantization network uses the same computing resource for processing different data to be quantized, so that the quantization network has the problem of computing resource waste in the model quantization process.
Disclosure of Invention
The application provides a model quantization method and device, which can reduce the waste of computing resources in the model quantization process.
In a first aspect, an embodiment of the present application provides a model quantization method, including: the first network receives first data, wherein the first data is floating point type data in a target model; the first network predicts the bit width of at least one weight and/or the bit width of at least one activation function in a second network according to the first data, and transmits the bit width of the at least one weight and/or the bit width of the at least one activation function to the second network; and the second network receives the first data, and performs forward reasoning according to the first data based on the bit width of the at least one weight and/or the bit width of the at least one activation function to obtain output data of the second network. According to the method, first data are input into a first network, the first network predicts the bit width of at least one weight and/or the bit width of at least one activation function in a second network according to the first data, then the second network carries out forward reasoning according to the first data based on the bit width of at least one weight and/or the bit width of at least one activation predicted by the first network, so that output data of the second network are obtained, the bit width of at least one weight and/or the bit width of at least one activation function in the second network can be adaptively adjusted according to different first data to be quantized, different bit widths can be used for different first data, namely the second network can carry out quantization processing according to different floating point type data by using proper calculation resources, and the calculation resource waste of a model quantization network is reduced. The first network in the method may be the bit controller in the subsequent embodiment and the second network may be the primary network in the subsequent embodiment.
Optionally, the first network predicts the bit width of at least one weight and/or the bit width of at least one activation function in the second network according to the first data, including: the first network predicts the bit width probability distribution of at least one weight and/or the bit width probability distribution of at least one activation function in the second network according to the first data; the bit width probability distribution of the weight comprises: the probability of using each bit width in a preset bit width set by weight when processing the first data; the bit width probability distribution of the activation function includes: activating a function to use the probability of each bit width in a preset bit width set when processing the first data; determining the bit width of the at least one weight from the bit width probability distribution of the at least one weight and/or determining the bit width of the at least one activation function from the bit width probability distribution of the at least one activation function.
Optionally, the determining the bit width of the at least one weight according to the bit width probability distribution of the at least one weight includes: for one weight, the bit width with the highest probability is obtained from the bit width probability distribution of the one weight as the bit width of the one weight.
Optionally, the determining the bit width of the at least one activation function according to the bit width probability distribution of the at least one activation function includes: for one activation function, the bit width with the highest probability is obtained from the bit width probability distribution of the one activation function as the bit width of the one activation function.
Optionally, the training method of the first network and the second network includes: receiving first training data, wherein the first training data is floating point type training data; inputting the first training data into a first network, and predicting the bit width of at least one weight and/or the bit width of at least one activation function in a second network by the first network according to the first training data; transmitting the bit width of at least one weight and/or the bit width of at least one activation function predicted by the first network to the second network; and inputting the first training data into the second network, and performing forward propagation processing on the second network according to the first training data based on each received weight and each activated bit width to obtain output data of the second network.
Optionally, the training method of the first network and the second network further includes: and calculating a loss function according to the output data of the second network, and adjusting parameters in the second network according to the loss function, wherein the parameters comprise weights.
Optionally, the first network includes: at least one convolutional layer and at least one fully-connected layer, the method further comprising: and obtaining the output of the last full-connection layer of the first network, and adjusting the weight of each full-connection layer in the first network according to the output of the last full-connection layer.
Optionally, the method further comprises: and acquiring the adjusted parameters of the specified convolution layer in the second network, and adjusting the parameters of the corresponding convolution layer of the specified convolution layer in the first network according to the adjusted parameters of the specified convolution layer.
Optionally, the first network predicts the bit width of at least one weight and/or the bit width of at least one activation function in the second network according to the first training data, including: the first network predicts bit width probability distribution of at least one weight and/or bit width probability distribution of at least one activation function in a second network according to the first training data; the bit width of the at least one weight is determined from the bit width probability distribution of the at least one weight and the bit width of the at least one weight activation function is determined from the bit width probability distribution of the at least one weight activation function.
In a second aspect, an embodiment of the present application provides an electronic device, including: a processor, a memory; wherein one or more computer programs are stored in the memory, the one or more computer programs comprising instructions, which when executed by the processor, cause the electronic device to perform the method of any of the first aspects.
In a third aspect, embodiments of the present application provide a computer-readable storage medium having a computer program stored therein, which when run on a computer, causes the computer to perform the method of any of the first aspects.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1A is a schematic diagram of a model quantization network according to an embodiment of the present application;
FIG. 1B is a schematic diagram of another architecture of a model quantization network according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of a model quantization method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a forward reasoning process of a single convolutional layer according to an embodiment of the present application;
FIG. 4 is a flowchart of a training method of a model quantization network according to an embodiment of the present application;
fig. 5 is a schematic diagram of a weight sharing of a bit controller and a main network according to an embodiment of the present application;
fig. 6 is a flow chart of a testing method of a model quantization network according to an embodiment of the present application.
Detailed Description
The terminology used in the description section of the present application is for the purpose of describing particular embodiments of the present application only and is not intended to be limiting of the present application.
Model quantization is a model acceleration method for converting floating point operations of a neural network into fixed point operations, and specifically uses scaling factors to map floating point weights and activation values in the model to low-bit width (low-bit width) integer values. The floating point model is quantized, so that the reasoning speed of the model at the end side is increased, the power consumption of equipment is reduced, and the storage space is reduced. The bit width is the smallest unit used in a computer to store or transmit information, signals, and is also called the bit width, or number of bits. At present, when a quantization network is used for model quantization, the bit width used by the quantization network is fixed for different data to be quantized, in other words, the quantization network uses the same computing resource for processing different data to be quantized, so that the quantization network has the problem of computing resource waste in the model quantization process. For example, less computing resources are required for a clear cat picture to be identified, but greater computing resources are required for a blurred cat picture, and the fixed bit width of the quantization network, i.e., the fixed computing resources, results in the quantization network processing both the clear cat picture and the blurred cat picture with the same computing resources, which at least in the case of clear cat picture identification, is a problem of wasted computing resources.
Therefore, the embodiment of the application provides a model quantization method and device, which can reduce the waste of computing resources in the model quantization process.
Fig. 1A is a schematic structural diagram of a model quantization network provided in an embodiment of the present application, in fig. 1A, each convolution layer is assumed to include m channels, each channel has a weight corresponding to the channel, and an activation layer and an activation function included in the activation layer are not shown in fig. 1A.
As shown in fig. 1A, the model quantization network may include: bit controller and main network.
The bit controller is used for: and predicting the bit width of at least one weight and/or the bit width of at least one activation function in the main network according to the received floating point type data, and transmitting the bit width of at least one weight and/or the bit width of at least one activation function in the main network to serve as the bit width of at least one weight and/or the bit width of at least one activation function in the main network. The received floating point type data can be training data when the model quantization network is trained, and the received floating point type data can be floating point type data of a target model which needs quantization processing when model quantization is performed after model quantization network training is finished. Taking the example of the bit controller predicting the bit width of the weight of each channel in the main network and the bit width of each activation function (not shown in fig. 1A) in fig. 1A, the bit width framed with a relatively thicker frame is the bit width predicted by the bit controller for the weight of each channel.
The main network is used for: and according to the received floating point data, carrying out forward propagation based on the bit width of at least one weight and/or the bit width of at least one activation function determined by the bit controller, so as to obtain output data of the main network.
Optionally, the bit controller and the floating point type data received by the main network are identical, thereby ensuring that the bit width of the weights and/or activation functions predicted by the bit controller is relatively more accurate.
In some embodiments, the primary network may be a quantization network in the related art that performs model quantization. The model quantization network in the embodiment of the application is obtained through training as long as the output size of the bit width set and the full-connection layer of the bit controller is changed and can be compatible with the quantization network which needs to be subjected to model quantization in the related technology.
In some embodiments, the bit controller may include: the main network may include at least 2 convolution layers, at least one full connection layer, and a bit width determination module, for example, in the network structure shown in fig. 1B, in order to prevent the bit controller from greatly increasing the storage and calculation amount of the whole model quantization network, the bit controller may include 3 convolution layers, 2 full connection layers, and a bit width determination module, and the main network includes n convolution layers, where n is a natural number greater than 1.
As shown in fig. 1B, taking the bit controller to determine the bit width of each weight and the bit width of each activation function in the main network according to floating point data as an example, at this time, the last full connection layer of the bit controller is connected to the bit width determining module, the bit width probability distribution of each weight and the bit width probability distribution of each activation function of the main network are output to the bit width determining module, the bit width determining module is connected to the main network, and the bit width of each weight and the bit width of each activation function are output to the main network.
It is understood that in other embodiments provided herein, the bit width determination module may also exist independently of the bit controller, in which case the model quantization network may include: the system comprises a bit controller, a bit width determining module and a main network.
In some embodiments, 2 fully connected layers of the bit controller may be implemented by a multi-layer perceptron (multilayer perceptron, MLP).
The bit width of the weight refers to the bit width of the quantized weight.
The bit width of the activation function refers to the bit width of the output value of the activation function after quantization. The output value of the activation function is also referred to as an activation value in some related art, and the bit width of the activation function may also be referred to as an activation bit width or a bit width of an activation value in some related art.
A bit-wide probability distribution, which may also be referred to as a bit-wide probability distribution, includes: the probability value for each bit width in the set of bit widths is preset.
The bit width probability distribution corresponding to the weight refers to: the weight uses a probability value for each bit width in a set of preset bit widths.
The bit width probability distribution corresponding to the activation function refers to: the output of the activation function uses probability values for each bit width in a set of preset bit widths.
For example, assuming that the preset bit width set is {2bit, 4bit, 8bit, 16bit }, for a weight, the bit width probability distribution corresponding to the weight may include: (0.2, 0.1, 0.5, 0.2), i.e., the weight uses a 2bit wide probability value of 0.2, the weight uses a 4bit wide probability value of 0.1, the weight uses a 8bit wide probability value of 0.5, and the weight uses a 16bit wide probability value of 0.2.
Fig. 2 is a schematic flow chart of a model quantization method according to an embodiment of the present application, as shown in fig. 2, the method may include:
step 201: the bit controller receives first data.
The first data may be floating point type data in a target model, which may be any floating point type model that requires model quantization.
Step 202: the bit controller predicts the bit width of at least one weight and/or the bit width of at least one activation function in the main network according to the first data, and transmits the predicted bit width of at least one weight and/or the bit width of at least one activation function to the main network.
Alternatively, the bit controller may predict the bit width of the at least one weight and/or the bit width of the at least one activation function in the primary network in particular based on the complexity of the first data. Taking the example that the first data is image data, the complexity of the first data may specifically include: the degree of blurring of an image, the degree of brightness of an image, whether an image subject is complete, and the like. For example, the more blurred the image of the first data, the higher the complexity of the first data is; the higher the image brightness of the first data, the lower the complexity of the first data is relatively; the more complete the image theme of the first data, the lower the complexity of the first data is relatively.
Optionally, the bit controller in this step predicts the bit width of at least one weight and/or the bit width of at least one activation function in the main network according to the first data, which may specifically include:
the bit controller predicts bit width probability distribution of at least one weight and/or bit width probability distribution of at least one activation function in the main network according to the first data; the bit-width probability distribution of the weights includes: the probability of using each bit width in the preset bit width set by weight when processing the first data; the bit-wide probability distribution of the activation function includes: activating a function to use the probability of each bit width in a preset bit width set when processing the first data;
The bit width of the at least one weight is determined from the bit width probability distribution of the at least one weight and/or the bit width of the at least one activation function is determined from the bit width probability distribution of the at least one activation function.
Optionally, the bit controller determining the bit width of the at least one weight according to the bit width probability distribution of the at least one weight may include:
for one weight, the bit width with the highest probability is obtained from the bit width probability distribution of the one weight as the bit width of the one weight.
Continuing the previous example, assume that the predetermined set of bit widths is {2bit, 4bit, 8bit, 16bit }, the bit width probability distribution of a weight includes: (0.2, 0.1, 0.5, 0.2), so that the bit width of 8 bits with the highest probability can be used as the bit width of the weight.
In one embodiment, the bit controller may implement the above-described obtaining the highest probability bit width from the bit width probability distribution of one weight by using an argmax function.
Optionally, the bit controller determines the bit width of the at least one activation function according to a bit width probability distribution of the at least one activation function, comprising:
For one activation function, the bit width with the highest probability is obtained from the bit width probability distribution of the one activation function as the bit width of the one activation function.
Continuing the previous example, assuming that the predetermined set of bit widths is {2bit, 4bit, 8bit, 16bit }, the bit width probability distribution of an activation function includes: (0.1, 0.7), so that the bit width 16 bits with the highest probability can be used as the bit width of the activation function.
In some embodiments, the bit controller may implement the above-described obtaining the highest probability bit width from the bit width probability distribution of one activation function via the argmax function.
In some embodiments, to reduce the data throughput of the bit controller, each channel in the same convolutional layer may be set to the same bit width, at which time the bit controller may predict one bit width for all weights in one convolutional layer as the bit width for all weights of that convolutional layer.
In some embodiments, to reduce the data throughput of the bit controller, each activation function in the same activation layer may be set to the same bit width, at which time the bit controller may predict one bit width for all activation functions in one activation layer as the bit widths of all activation functions of that activation layer.
For example, assuming that the set of preset bit widths is {2bit, 4bit, 8bit, 16bit }, the main network includes n convolutional layers, if each convolutional layer and the active layer have respective independent bit widths and the channels in each convolutional layer use the same bit width, the active function in each active layer uses the same bit width, then the output dimension of the last fully-connected layer of the bit controller is a vector of n× 4*2 dimensions, and each value in the vector represents the probability of the respective bit widths used in each layer (convolutional layer and active layer) when performing the first data processing.
Step 203: the main network receives the first data, and performs forward propagation processing according to the first data based on the bit width of at least one weight and/or the bit width of at least one activation function predicted by the bit controller, so as to obtain output data of the main network.
Alternatively, the output data of the main network may be fixed point type data.
Optionally, for the weight and/or the activation function of the bit width that is not predicted by the bit controller in the main network, the bit width may be preset for the weight and/or the activation function, and the specific implementation may be implemented by using a method for presetting the bit width in the related art, which is not described in detail in the embodiments of the present application.
For example, referring to the main network structure shown in fig. 1A, if the bit controller predicts the bit width of the weight of the channel 1~m in the convolution layer 1, the weights may be preset for the channels (the channel 1~m) of the convolution layer 2~n of the main network, respectively, so that in this step, the weight of the channel 1~m of the convolution layer 1 uses the weight predicted by the bit controller and the weight of the channels of the convolution layer 2~n uses the corresponding preset weight when the main network performs quantization processing on the second data.
In some embodiments, referring to fig. 3, a forward reasoning process for a single layer convolutional layer of the main network is shown. Wherein the bit width of the weights of the convolution layer may be preset or predicted by the bit controller, for example, in fig. 3, taking the example that all weights of the convolution layer are predicted to be the same bit width of 8 bits by the bit controller, so that for a set of weights of the convolution layer and input data (W, X), the bit width (8 bits) based on the weights of the convolution layer may be converted into fixed-point dataThen, forward processing is carried out by the convolution layer to obtain output data of the convolution layer>
In the method shown in fig. 2, a bit controller is set on the basis of a main network, so that first data is input into the bit controller, the bit controller predicts the bit width of at least one weight and/or the bit width of at least one activation function in the main network according to the first data, then the main network performs forward propagation processing according to the bit width of at least one weight and/or the bit width of at least one activation predicted by the bit controller according to the first data, so as to obtain output data of the main network, so that the bit width of at least one weight and/or the bit width of at least one activation function in the main network can be adaptively adjusted according to different floating point data to be quantized, different bit widths can be used for different floating point data, and the main network can also use proper computing resources for quantization processing for different floating point data, so as to reduce the computational resource waste of a model quantization network.
It can be appreciated that the more the number of weights and activation functions of the bit controller prediction bit width is, the more the problem of the computational resource waste of the main network, that is, the computational resource waste of the model quantization network, is reduced, so in some embodiments, in order to better reduce the computational resource waste of the model quantization network, the bit controller in step 202 may predict the bit width of each weight and the bit width of each activation function in the main network according to the first data, so that the bit width of each weight and each activation function in the main network may be adaptively adjusted according to the different first data, so that the main network may use the corresponding bit width for different first data, that is, the main network may use the appropriate computational resource for the different first data, and the computational resource waste of the model quantization network may be better reduced.
In the following, the training method of the model quantization network is exemplified by the example in which the bit controller predicts the bit width of each weight and the bit width of each activation function in the main network based on the first training data.
Fig. 4 is a flow chart of a training method of a model quantization network according to an embodiment of the present application, as shown in fig. 4, the method may include:
step 401: a set of preset bit widths and a target floating point number per second (floating-point operations per second).
The step is performed in advance, and is not required to be performed every time training of the model quantization network is performed.
In this step, the set of bit widths may be determined based on the bit widths that the primary network may use, in other words, the set of bit widths includes: the bit width that the primary network may use. At least 2bit widths may be included in the set of bit widths.
In this embodiment, taking a bit width set as {2 bits, 4 bits, 8 bits, 16 bits } as an example. It is understood that this set of bitwidths is merely an example and is not intended to limit the implementation of the bitwidth set of embodiments of the present application.
The above-mentioned flow refers to the number of floating point operations performed per second, which may also be referred to as peak speed per second.
The target float may be set based on the sum of the float of all the convolutional layers of the main network, i.e., the target float is the target value of the sum of the float of all the convolutional layers of the main network.
Step 402: the bit controller and the primary network are initialized separately.
In some embodiments, the initialization data of the bit controller and the initialization data of the main network may be preset, respectively, and the initialization data of the bit controller may include: the initial values of the weights in the convolution layer and the full connection layer, the initialization data of the main network may include: in this step, the initialization data of the bit controller may be read to initialize the bit controller, and the initialization data of the main network may be read to initialize the main network.
In other embodiments, to reduce the amount of data storage space and model training, the convolutional layer of the bit controller may be identical in structure to the designated convolutional layer of the main network, and accordingly, the convolutional layer of the bit controller may be initialized using the initialization data of the designated convolutional layer of the main network. For example, as shown in fig. 5, assuming that the bit controller includes 3 convolution layers, which are correspondingly identical to the structures of the first 3 convolution layers of the main network (convolution layers 1 to 3 of the main network), the initialization data of the preset bit controller may include: the initial value of each weight in the full connection layer, and the preset initialization data of the main network may include: in this step, the initialization data of the first 3 convolution layers of the main network may be read to initialize the 3 convolution layers of the bit controller, the initialization data of the bit controller may be read to initialize the full connection layer of the bit controller, and the initialization data of the main network may be read to initialize the main network.
Step 403: first training data is acquired and is input into the bit controller as input data of the bit controller.
The first training data may be floating point sample data.
Step 404: the bit controller predicts the bit width probability distribution of each weight and the bit width probability distribution corresponding to each activation function in the main network according to the first training data.
In combination with the model quantization network structure shown in fig. 1B, the bit controller may implement predicting the bit width probability distribution of each weight and the bit width probability distribution corresponding to each activation function in the main network using 3 convolutional layers and 2 fully-connected layers.
Step 405: the bit controller determines the bit width of each weight according to the bit width probability distribution of each weight, and determines the bit width of each activation function according to the bit width probability distribution corresponding to each activation function.
Alternatively, for each weight, the bit controller may select, as the bit width of the weight, the bit width having the largest probability from the bit width probability distribution corresponding to the weight. In some embodiments, the bit controller may use an argmax function to select the bit width with the highest probability from the bit width probability distributions corresponding to the weights.
Alternatively, for each activation function, the bit controller may select, from the bit width probability distribution corresponding to the activation function, the bit width with the largest probability as the bit width of the activation function. In some embodiments, the bit controller may use the argmax function to select the bit width with the highest probability from the bit width probability distributions corresponding to the activation function.
In connection with fig. 1B, this step may be performed in particular by a bit width determination module in the bit controller.
Step 406: the bit controller transmits the determined bit width of each weight and the bit width of each activation function to the main network as the bit width of each weight and the bit width of each activation function in the main network.
Step 407: the first training data is input into the main network as input data of the main network.
Step 408: and the main network performs forward propagation according to the first training data based on each weight and the bit width of each activation function determined by the bit controller, so as to obtain output data of the main network.
The implementation of the forward propagation procedure of each convolution layer in the main network may be illustrated with reference to fig. 3 in step 203, which is not described here in detail.
Step 409: and calculating the loss function of the main network according to the output data of the main network, the target flow and the flow of each convolution layer in the main network.
In some embodiments, the loss function of the primary networkThe following calculation formula can be used for realizing:
wherein,is a conventional loss function that is used to determine,αis penalty coefficient, ++>Is the flow of the quantization processing of the ith convolution layer for the second training data, ++>Is the target flow preset in step 401.
In the above calculation formulaMay be calculated from the first training data and the output data of the primary network.
The conventional loss function described above may be a loss function used in the neural network training related art, for example: l1 loss function may also be referred to as absolute value loss function, then in the above calculation formulaIt can be noted that: />
The above-mentioned loss functionTo be added with flow as an optimization target byThe target flow is preset to guide the model quantization network to reduce the time delay (latency), so that the model quantization network obtained through training meets the time delay requirement.
Step 410: and carrying out back propagation according to the loss function of the main network, and adjusting each parameter in the main network.
Parameters in the primary network include: the weights of the various channels in the convolutional layer.
In this step, updating each parameter in the main network according to the loss function of the main network may be implemented by related technologies in neural network training, which is not limited in this embodiment.
Step 411: and adjusting parameters of all the full connection layers of the bit controller according to the output data of the last full connection layer of the bit controller, and adjusting parameters of each convolution layer in the bit controller according to a loss function of the main network.
Parameters of the full connection layer include: weights of full connection layer. The parameters of the convolutional layer include: the weights of the various channels of the convolutional layer.
In some embodiments, the implementation of adjusting the parameter of each convolution layer in the bit controller according to the loss function of the main network may refer to the implementation of adjusting the parameter of each main network according to the loss function of the main network in step 410, in other words, the convolution layer in the bit controller may be regarded as the convolution layer in the main network, and the adjustment of each parameter may be performed. At this time, the execution order between steps 409 to 410 and 411 is not limited, that is, step 411 may be executed between steps 408 to 412.
In other embodiments, in order to reduce the data processing amount in the training process, the weights of the convolution layers in the bit controller may share the weights of the corresponding convolution layers in the main network, in other words, the weights of the convolution layers in the bit controller are always the same as the weights of the corresponding convolution layers in the main network, and in this step, the weights of the convolution layers in the bit controller may be adjusted according to the weights of the corresponding convolution layers in the main network. For example, if the bit controller includes 3 convolution layers corresponding to the first 3 convolution layers in the main network, after the parameters of the convolution layers of the main network are adjusted, the parameters (including weights) of the convolution layer 1 of the main network may be shared to the convolution layer 1 of the bit controller, the parameters (including weights) of the convolution layer 2 of the main network may be shared to the convolution layer 2 of the bit controller, and the parameters (including weights) of the convolution layer 3 of the main network may be shared to the convolution layer 3 of the bit controller, so that the parameters of the 3 convolution layers of the bit controller correspond to the parameters of the first 3 convolution layers in the main network.
In some embodiments, parameters of each fully connected layer in the bit controller may be adjusted by a Gumbel-Softmax function.
The formula of Gumbel-Softmax is shown below:
where z is the number of probabilities of the last full connection layer output,is the j-th output of the full connection layer, < >>Is the kth output of the full link layer, < >>Is random noise following Gumbel distribution, +.>Is a random number uniformly distributed on (0, 1) subject to the criteria,/o>Is random noise following Gumbel distribution, +.>Is a random number uniformly distributed on (0, 1) subject to a standard,τis a parameter of the temperature of the liquid,τthe smaller the softmax, the closer to a one-hot vector.
In this embodiment, by controlling the temperature parameterτCan enable the junction of Gumbel-Softmax functionSimilar results as the argmax function are achieved if the single heat vector is approached, but gummel-Softmax is steerable and thus can be used to adjust the parameters of the various fully connected layers in the bit controller.
In this step, the implementation of adjusting the parameters of each full connection layer in the bit controller by using the gummel-Softmax function may be implemented by using a related technology in neural network training, which is not described herein.
Step 412: and calculating the flow and quantization error of the main network in the forward propagation.
The flow of the main network in the forward propagation can be the sum of the flow of each convolution layer in the forward propagation, namely the formula
The quantization error of the main network in the forward propagation may be an error between the output data of the floating point main network corresponding to the first training data and the output data of the fixed point main network.
For a main network having the same convolution layer structure, if quantization processing is performed on weights and activation values and the like in the main network, the main network may be referred to as a fixed-point main network based on fixed-point data after the quantization processing, and if forward propagation is performed on weights and activation values in the main network without quantization processing but using floating-point type data, the main network may be referred to as a floating-point main network.
Step 413: and according to the loss function, the flow of the main network in the forward propagation and the quantization error, judging that the loss function, the flow of the main network in the forward propagation and the quantization error respectively meet the convergence condition, stopping training to obtain a trained model quantization network, and if the convergence condition is not met, repeatedly executing the steps 403-411 to repeatedly acquire new first training data to train the model quantization network until the convergence condition is judged to be met in the step 411, so as to obtain the trained model quantization network.
In this step, an early-stop strategy may be used, and after the loss function, the flow of the main network in this forward propagation, and the quantization error all converge, a trained model quantization network is obtained.
The embodiment of the application is not limited by the loss function in the step, and the setting of convergence conditions corresponding to the flow and quantization error of the main network in the forward propagation, and can be implemented by using a related technology, so long as the judgment of whether the model quantization network converges is performed based on the three parameters.
After the trained model quantization network is obtained, the test of the model quantization network can be completed through the test flow shown in fig. 6.
As shown in fig. 6, the test method includes:
step 601: first test data is acquired.
The first test data is floating point type data.
Step 602: the first test data is input to a bit controller, which predicts the bit width probability distribution of each weight and the bit width probability distribution of each activation function in the main network based on the first test data.
Step 603: the bit controller determines the bit width of each weight according to the predicted bit width probability distribution of each weight in the main network and determines the bit width of each activation function according to the predicted bit width probability of each activation function.
Step 604: the bit controller transmits the bit width of each weight and each activation function to the primary network as the bit width of each weight and each activation function in the primary network.
Step 605: and inputting the first test data into a main network, and performing forward propagation by the main network according to the first test data based on each weight predicted by the bit controller and the bit width of each activation function to obtain output data of the main network.
In the test, the parameters in the main network and the bit controller may be adjusted based on the output data of the main network obtained by the test, and the specific implementation may be realized by referring to steps 409 to 411, which are not described herein.
The model quantization network obtained through the above training and testing, that is, the model quantization network used in fig. 2, includes a bit controller and a main network, and can implement quantization processing of floating point type data in a model.
The implementation of the above test method may refer to the implementation of the training method of the model quantization network in fig. 4, which is not described herein.
The embodiment of the application also provides electronic equipment, which comprises a processor and a memory, wherein the processor is used for realizing the method provided by the embodiment of the application.
The present embodiments also provide a computer-readable storage medium having a computer program stored therein, which when run on a computer, causes the computer to perform the method provided by the embodiments of the present application.
The present embodiments also provide a computer program product comprising a computer program which, when run on a computer, causes the computer to perform the method provided by the embodiments of the present application.
In the embodiments of the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relation of association objects, and indicates that there may be three kinds of relations, for example, a and/or B, and may indicate that a alone exists, a and B together, and B alone exists. Wherein A, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of the following" and the like means any combination of these items, including any combination of single or plural items. For example, at least one of a, b and c may represent: a, b, c, a and b, a and c, b and c or a and b and c, wherein a, b and c can be single or multiple.
Those of ordinary skill in the art will appreciate that the various elements and algorithm steps described in the embodiments disclosed herein can be implemented as a combination of electronic hardware, computer software, and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In several embodiments provided herein, any of the functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (hereinafter referred to as ROM), a random access Memory (Random Access Memory) and various media capable of storing program codes such as a magnetic disk or an optical disk.
The foregoing is merely specific embodiments of the present application, and any changes or substitutions that may be easily contemplated by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. The protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of model quantization, comprising:
the first network receives first data, wherein the first data is floating point type data in a target model;
the first network predicts the bit width of each weight and the bit width of each activation function in a second network according to the first data, and transmits the bit width of each weight and the bit width of each activation function to the second network; the each weight is the weight of each channel of each convolution layer in the second network, and the each activation function is the each activation function of each activation layer in the second network;
the second network receives the first data, and performs forward reasoning according to the first data based on the bit width of each weight and the bit width of each activation function to obtain output data of the second network;
Wherein the first network predicts the bit width of each weight and the bit width of each activation function in the second network according to the first data, comprising:
the first network predicts the bit width probability distribution of each weight and the bit width probability distribution of each activation function in the second network according to the first data; the bit width probability distribution of the weight comprises: the probability of using each bit width in a preset bit width set by weight when processing the first data; the bit width probability distribution of the activation function includes: activating a function to use the probability of each bit width in a preset bit width set when processing the first data;
and determining the bit width of each weight according to the bit width probability distribution of each weight, and determining the bit width of each activation function according to the bit width probability distribution of each activation function.
2. The method of claim 1, wherein said determining the bit width of each weight from the bit width probability distribution of each weight comprises:
for one weight, the bit width with the highest probability is obtained from the bit width probability distribution of the one weight as the bit width of the one weight.
3. The method of claim 1, wherein said determining the bit width of each activation function based on the bit width probability distribution of each activation function comprises:
for one activation function, the bit width with the highest probability is obtained from the bit width probability distribution of the one activation function as the bit width of the one activation function.
4. A method according to any one of claims 1 to 3, wherein the training method of the first network and the second network comprises:
receiving first training data, wherein the first training data is floating point type training data;
inputting the first training data into a first network, and predicting the bit width of each weight and the bit width of each activation function in a second network by the first network according to the first training data;
transmitting the bit width of each weight and the bit width of each activation function obtained by prediction of the first network to the second network;
and inputting the first training data into the second network, and performing forward propagation processing on the second network according to the first training data based on each received weight and each activated bit width to obtain output data of the second network.
5. The method of claim 4, wherein the training method of the first network and the second network further comprises:
and calculating a loss function according to the output data of the second network, and adjusting parameters in the second network according to the loss function, wherein the parameters comprise weights.
6. The method of claim 5, wherein the first network comprises: at least one convolutional layer and at least one fully-connected layer, the method further comprising:
and obtaining the output of the last full-connection layer of the first network, and adjusting the weight of each full-connection layer in the first network according to the output of the last full-connection layer.
7. The method as recited in claim 6, further comprising:
and acquiring the adjusted parameters of the specified convolution layer in the second network, and adjusting the parameters of the corresponding convolution layer of the specified convolution layer in the first network according to the adjusted parameters of the specified convolution layer.
8. The method of claim 4, wherein the first network predicting the bit width of each weight and the bit width of each activation function in the second network based on the first training data comprises:
The first network predicts the bit width probability distribution of each weight and the bit width probability distribution of each activation function in the second network according to the first training data;
the bit width of each weight is determined according to the bit width probability distribution of each weight, and the bit width of each weight activation function is determined according to the bit width probability distribution of each activation function.
9. An electronic device, comprising:
a processor, a memory; wherein one or more computer programs are stored in the memory, the one or more computer programs comprising instructions, which when executed by the processor, cause the electronic device to perform the method of any of claims 1-8.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when run on a computer, causes the computer to perform the method of any of claims 1 to 8.
CN202311235065.XA 2023-09-25 2023-09-25 Model quantization method and device Active CN117077740B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311235065.XA CN117077740B (en) 2023-09-25 2023-09-25 Model quantization method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311235065.XA CN117077740B (en) 2023-09-25 2023-09-25 Model quantization method and device

Publications (2)

Publication Number Publication Date
CN117077740A CN117077740A (en) 2023-11-17
CN117077740B true CN117077740B (en) 2024-03-12

Family

ID=88713686

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311235065.XA Active CN117077740B (en) 2023-09-25 2023-09-25 Model quantization method and device

Country Status (1)

Country Link
CN (1) CN117077740B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555508A (en) * 2018-05-31 2019-12-10 北京深鉴智能科技有限公司 Artificial neural network adjusting method and device
CN110969251A (en) * 2019-11-28 2020-04-07 中国科学院自动化研究所 Neural network model quantification method and device based on label-free data
CN113762499A (en) * 2020-06-04 2021-12-07 合肥君正科技有限公司 Method for quantizing weight by channels
CN114239792A (en) * 2021-11-01 2022-03-25 荣耀终端有限公司 Model quantization method, device and storage medium
CN115238883A (en) * 2021-04-23 2022-10-25 Oppo广东移动通信有限公司 Neural network model training method, device, equipment and storage medium
WO2023050707A1 (en) * 2021-09-28 2023-04-06 苏州浪潮智能科技有限公司 Network model quantization method and apparatus, and computer device and storage medium
CN116187420A (en) * 2023-05-04 2023-05-30 上海齐感电子信息科技有限公司 Training method, system, equipment and medium for lightweight deep neural network
CN116634162A (en) * 2023-04-13 2023-08-22 南京大学 Post-training quantization method for rate-distortion optimized image compression neural network
WO2023165139A1 (en) * 2022-03-04 2023-09-07 上海商汤智能科技有限公司 Model quantization method and apparatus, device, storage medium and program product
CN116720563A (en) * 2022-09-19 2023-09-08 荣耀终端有限公司 Method and device for improving fixed-point neural network model precision and electronic equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190050710A1 (en) * 2017-08-14 2019-02-14 Midea Group Co., Ltd. Adaptive bit-width reduction for neural networks

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555508A (en) * 2018-05-31 2019-12-10 北京深鉴智能科技有限公司 Artificial neural network adjusting method and device
CN110969251A (en) * 2019-11-28 2020-04-07 中国科学院自动化研究所 Neural network model quantification method and device based on label-free data
CN113762499A (en) * 2020-06-04 2021-12-07 合肥君正科技有限公司 Method for quantizing weight by channels
CN115238883A (en) * 2021-04-23 2022-10-25 Oppo广东移动通信有限公司 Neural network model training method, device, equipment and storage medium
WO2023050707A1 (en) * 2021-09-28 2023-04-06 苏州浪潮智能科技有限公司 Network model quantization method and apparatus, and computer device and storage medium
CN114239792A (en) * 2021-11-01 2022-03-25 荣耀终端有限公司 Model quantization method, device and storage medium
WO2023165139A1 (en) * 2022-03-04 2023-09-07 上海商汤智能科技有限公司 Model quantization method and apparatus, device, storage medium and program product
CN116720563A (en) * 2022-09-19 2023-09-08 荣耀终端有限公司 Method and device for improving fixed-point neural network model precision and electronic equipment
CN116634162A (en) * 2023-04-13 2023-08-22 南京大学 Post-training quantization method for rate-distortion optimized image compression neural network
CN116187420A (en) * 2023-05-04 2023-05-30 上海齐感电子信息科技有限公司 Training method, system, equipment and medium for lightweight deep neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Instance-Aware Dynamic Neural Network Quantization;Zhenhua Liu et al;《2022 IEEE》;第1-10页 *

Also Published As

Publication number Publication date
CN117077740A (en) 2023-11-17

Similar Documents

Publication Publication Date Title
CN113242568B (en) Task unloading and resource allocation method in uncertain network environment
CN113067873B (en) Edge cloud collaborative optimization method based on deep reinforcement learning
US20240054332A1 (en) Adaptive quantization for neural networks
US20180032865A1 (en) Prediction apparatus, prediction method, and prediction program
CN113778691B (en) Task migration decision method, device and system
CN113238989A (en) Apparatus, method and computer-readable storage medium for quantizing data
CN111160531A (en) Distributed training method and device of neural network model and electronic equipment
CN113238987B (en) Statistic quantizer, storage device, processing device and board card for quantized data
US11423313B1 (en) Configurable function approximation based on switching mapping table content
CN117077740B (en) Model quantization method and device
CN114830137A (en) Method and system for generating a predictive model
CN113238976B (en) Cache controller, integrated circuit device and board card
CN113238988B (en) Processing system, integrated circuit and board for optimizing parameters of deep neural network
US20030026497A1 (en) Scalable expandable system and method for optimizing a random system of algorithms for image quality
KR20220010419A (en) Electronice device and learning method for low complexity artificial intelligentce model learning based on selecting the dynamic prediction confidence thresholed
CN111614358B (en) Feature extraction method, system, equipment and storage medium based on multichannel quantization
CN113238975A (en) Memory, integrated circuit and board card for optimizing parameters of deep neural network
CN112561050B (en) Neural network model training method and device
CN115037608A (en) Quantization method, device, equipment and readable storage medium
CN116959489B (en) Quantization method and device for voice model, server and storage medium
US11797850B2 (en) Weight precision configuration method and apparatus, computer device and storage medium
US20220019891A1 (en) Electronic device and learning method for learning of low complexity artificial intelligence model based on selecting dynamic prediction confidence threshold
CN114860345B (en) Calculation unloading method based on cache assistance in smart home scene
US20210240439A1 (en) Arithmetic processing device, arithmetic processing method, and non-transitory computer-readable storage medium
CN116958616A (en) Classification model training method, image classification method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant