CN110728350B - Quantization for machine learning models - Google Patents

Quantization for machine learning models Download PDF

Info

Publication number
CN110728350B
CN110728350B CN201810715757.7A CN201810715757A CN110728350B CN 110728350 B CN110728350 B CN 110728350B CN 201810715757 A CN201810715757 A CN 201810715757A CN 110728350 B CN110728350 B CN 110728350B
Authority
CN
China
Prior art keywords
current value
quantization parameter
parameter
processing
quantization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810715757.7A
Other languages
Chinese (zh)
Other versions
CN110728350A (en
Inventor
张东擎
杨蛟龙
华刚
叶东强子
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Technology Licensing LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing LLC filed Critical Microsoft Technology Licensing LLC
Priority to CN201810715757.7A priority Critical patent/CN110728350B/en
Publication of CN110728350A publication Critical patent/CN110728350A/en
Application granted granted Critical
Publication of CN110728350B publication Critical patent/CN110728350B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Feedback Control In General (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

According to an implementation of the present disclosure, a scheme for quantization of a machine learning model is presented. In this scheme, the current values of the processing parameters used by the processing units in the machine learning model are obtained. The current value of the processing parameter is quantized based on the current value of a predetermined number of basic quantization parameters and the current value of a binary quantization parameter specific to the processing parameter, which corresponds to the basic quantization parameter, respectively, and the predetermined number is the same as the number of bits used for quantization, to obtain the quantized value of the processing parameter. Based on the difference between the quantized value of the processing parameter and the current value of the processing parameter, the current value of the base quantization parameter and the current value of the binary quantization parameter are updated for quantization specific to the processing parameter. In this way, not only can a reduction in storage and processing overhead due to network quantization be obtained, but quantization accuracy can be further improved.

Description

Quantization for machine learning models
Background
In recent years, machine learning technology is continuously developed, and a great breakthrough is made in the fields of computer vision, voice processing, artificial intelligence and the like, so that the performance of a machine algorithm in a plurality of tasks such as image classification, target detection, voice recognition, machine translation, content filtering and the like is remarkably improved, and the machine algorithm is widely applied to different industries such as the Internet, video monitoring and the like. The machine learning model processes input data based on a set of parameters. The parameter set of machine learning models is typically very large and internal processing complex, so model representation and computational overhead is a significant obstacle in many applications, especially for devices with limited storage and computational resources. To overcome this difficulty, some schemes have been proposed to reduce the complexity of the model, including simplifying the model structure, reducing the parameters of the model, and quantization. Quantization schemes refer to transforming a numerical value involved in the processing of a machine learning model from a higher precision representation (e.g., floating point number format) to a lower precision quantized value (e.g., a value represented by a small number of bits), which can significantly reduce the complexity of the model representation and processing.
Disclosure of Invention
According to an implementation of the present disclosure, a scheme for quantization of a machine learning model is presented. In this scheme, the current values of the processing parameters used by the processing units in the machine learning model are obtained. The current value of the processing parameter is quantized based on the current value of a predetermined number of basic quantization parameters and the current value of a binary quantization parameter specific to the processing parameter, which corresponds to the basic quantization parameter, respectively, and the predetermined number is the same as the number of bits used for quantization, to obtain the quantized value of the processing parameter. Based on the difference between the quantized value of the processing parameter and the current value of the processing parameter, the current value of the base quantization parameter and the current value of the binary quantization parameter are updated for quantization specific to the processing parameter. In this way, not only can a reduction in storage and processing overhead due to network quantization be obtained, but quantization accuracy can be further improved.
The summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Drawings
FIG. 1 illustrates a block diagram of a computing device capable of implementing various implementations of the disclosure;
FIG. 2 shows histogram statistical distributions of weights and possible values of inputs at different layers in a trained machine learning model;
FIGS. 3A and 3B illustrate examples of quantization functions according to implementations of the present disclosure;
FIG. 4 illustrates a flow chart of a process for quantization of a machine learning model in accordance with some implementations of the present disclosure;
FIG. 5A illustrates a comparison of a histogram statistical distribution of floating point values of a processing parameter prior to quantization and quantized values of the processing parameter after quantization in accordance with implementations of the present disclosure; and
Fig. 5B illustrates a comparison of a histogram statistical distribution of an input floating point value before quantization and an input quantized value after quantization is implemented according to the present disclosure.
In the drawings, the same or similar reference numerals are used to designate the same or similar elements.
Detailed Description
The present disclosure will now be discussed with reference to several example implementations. It should be understood that these implementations are discussed only to enable one of ordinary skill in the art to better understand and thus practice the present disclosure, and are not meant to imply any limitation on the scope of the present subject matter.
As used herein, the term "comprising" and variants thereof are to be interpreted as meaning "including but not limited to" open-ended terms. The term "based on" is to be interpreted as "based at least in part on". The terms "one implementation" and "an implementation" are to be interpreted as "at least one implementation". The term "another implementation" is to be interpreted as "at least one other implementation". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.
FIG. 1 illustrates a block diagram of a computing device 100 capable of implementing various implementations of the disclosure. It should be understood that the computing device 100 illustrated in fig. 1 is merely exemplary and should not be construed as limiting the functionality and scope of the implementations described in this disclosure. As shown in fig. 1, computing device 100 includes computing device 100 in the form of a general purpose computing device. Components of computing device 100 may include, but are not limited to, one or more processors or processing units 110, memory 120, storage 130, one or more communication units 140, one or more input devices 150, and one or more output devices 160.
In some implementations, the computing device 100 may be implemented as various user terminals or service terminals having computing capabilities. The service terminals may be servers, large computing devices, etc. provided by various service providers. The user terminal is, for example, any type of mobile terminal, fixed terminal or portable terminal, including a mobile handset, a site, a unit, a device, a multimedia computer, a multimedia tablet, an internet node, a communicator, a desktop computer, a laptop computer, a notebook computer, a netbook computer, a tablet computer, a Personal Communication System (PCS) device, a personal navigation device, a Personal Digital Assistants (PDA), an audio/video player, a digital camera/camcorder, a positioning device, a television receiver, a radio broadcast receiver, an electronic book device, a game device, or any combination thereof, including accessories and peripherals for these devices, or any combination thereof. It is also contemplated that the computing device 100 can support any type of interface to the user (such as "wearable" circuitry, etc.).
The processing unit 110 may be an actual or virtual processor and is capable of performing various processes according to programs stored in the memory 120. In a multiprocessor system, multiple processing units execute computer-executable instructions in parallel to increase the parallel processing capabilities of computing device 100. The processing unit 110 may also be referred to as a Central Processing Unit (CPU), microprocessor, controller, microcontroller.
Computing device 100 typically includes a number of computer storage media. Such media may be any available media that is accessible by computing device 100 and includes, but is not limited to, volatile and non-volatile media, removable and non-removable media. The memory 120 may be volatile memory (e.g., registers, cache, random Access Memory (RAM)), non-volatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof. Memory 120 may include an image recognition module 122 configured to perform the functions of the various implementations described herein. The image recognition module 122 may be accessed and executed by the processing unit 110 to implement the corresponding functions.
Storage device 130 may be a removable or non-removable media and may include a machine-readable medium that can be used to store information and/or data and that may be accessed within computing device 100. Computing device 100 may further include additional removable/non-removable, volatile/nonvolatile storage media. Although not shown in fig. 1, a magnetic disk drive for reading from or writing to a removable, nonvolatile magnetic disk and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk may be provided. In these cases, each drive may be connected to a bus (not shown) by one or more data medium interfaces.
Communication unit 140 enables communication with additional computing devices via a communication medium. Additionally, the functionality of the components of computing device 100 may be implemented in a single computing cluster or in multiple computing machines capable of communicating over a communication connection. Accordingly, computing device 100 may operate in a networked environment using logical connections to one or more other servers, personal Computers (PCs), or another general network node.
The input device 150 may be one or more of a variety of input devices such as a mouse, keyboard, touch screen, camera, trackball, voice input device, and the like. The output device 160 may be one or more output devices such as a display, speakers, printer, etc. Computing device 100 may also communicate with one or more external devices (not shown), such as storage devices, display devices, etc., with one or more devices that enable a user to interact with computing device 100, or with any device (e.g., network card, modem, etc.) that enables computing device 100 to communicate with one or more other computing devices, as desired, via communication unit 140. Such communication may be performed via an input/output (I/O) interface (not shown).
In some implementations, some or all of the various components of computing device 100 may be provided in the form of a cloud computing architecture in addition to being integrated on a single device. In a cloud computing architecture, these components may be remotely located and may work together to implement the functionality described in this disclosure. In some implementations, cloud computing provides computing, software, data access, and storage services that do not require the end user to know the physical location or configuration of the system or hardware that provides these services. In various implementations, cloud computing provides services over a wide area network (such as the internet) using an appropriate protocol. For example, cloud computing providers offer applications over a wide area network, and they may be accessed through a web browser or any other computing component. Software or components of the cloud computing architecture and corresponding data may be stored on a server at a remote location. Computing resources in a cloud computing environment may be consolidated at remote data center locations or they may be dispersed. The cloud computing infrastructure may provide services through a shared data center even though they appear as a single access point to users. Accordingly, the components and functionality described herein may be provided from a service provider at a remote location using a cloud computing architecture. Alternatively, they may be provided from a conventional server, or they may be installed directly or otherwise on a client device.
Computing device 100 may be used to implement quantization for machine learning models in various implementations of the present disclosure. The computing device 100 is capable of receiving the constructed machine learning model 170 via the input device 150 in order to perform quantization with respect to the model. The machine learning model 170 is capable of learning a certain knowledge and ability from existing data for processing new data. The machine learning model 170 may be designed to perform various tasks such as image classification, object detection, speech recognition, machine translation, content filtering, and the like. Examples of machine learning model 170 include, but are not limited to, various types of Deep Neural Networks (DNNs), support Vector Machines (SVMs), decision trees, random forest models, and the like. In implementations of the present disclosure, the machine learning model may also be referred to as a "learning network. Hereinafter, the terms "learning model", "learning network", "model" and "network" are used interchangeably.
The machine learning model 170 is shown in fig. 1 as a deep neural network. Deep neural networks have a hierarchical architecture, with each network layer having one or more processing nodes (called neurons or filters) for processing. In deep neural networks, the output of the previous layer after processing is the input of the next layer, where the first layer in the architecture receives the network input for processing and the output of the last layer is provided as the network output. As shown in fig. 1, the machine learning model 170 includes network layers 172, 174, 176, etc., wherein the network layer 172 receives network inputs and the network layer 176 provides network outputs.
In deep neural networks, the primary processing operations within the network are interleaved linear and nonlinear transformations. These processes are distributed among the various processing nodes. Fig. 1 also shows an enlarged view of one node 171 in the model 170. The node 171 receives a plurality of input values a1, a2, a3, etc., and processes the input values based on respective processing parameters (such as weights w1, w2, w3, etc.) to generate an output z. Node 171 may be designed to process input with an activation function, which may be expressed as:
z=σ(wTa) (1)
Wherein the method comprises the steps of An input vector (including elements a1, a2, a3, etc.) representing node 171; Representing a weight vector (including elements w1, w2, w3, etc.) in the processing parameters used by node 171, each weight for weighting a respective input; n represents the number of input values; σ () represents an activation function used by node 171, which may be a linear function, a nonlinear function. Common activation functions in neural networks include sigmoid functions, reLu functions, tanh functions, maxout functions, and the like. The output of node 171 may also be referred to as an activation value. Depending on the network design, the output (i.e., the activation value) of each network layer may be provided as input to one, more or all of the nodes of the next layer.
In some implementations, the processing parameters of node 171 may also include a bias for each input, at which point equation (1) may be rewritten as:
z=σ(wTa+b) (2)
Wherein the method comprises the steps of Representing the bias vectors (including elements b1, b2, b3, etc.) in the processing parameters used by node 171, each bias being used to bias the results of the corresponding inputs and weights.
Each network layer in the machine learning model 170 may include one or more nodes 171, and when the processes in the machine learning model 170 are viewed in units of network layers, the processes of each network layer may also be similarly expressed in the form of equation (1) or equation (2), where a represents an input vector of the network layer and w represents a weight of the network layer.
The processing unit 110 of the computing device 100 may perform quantization with respect to the machine learning model 170 by running the quantization module 122. The quantization module 122 is configured to perform quantization operations on values involved in the processing of the machine learning model 170 to obtain quantized values. The quantized value may be provided to the output unit 160 for output to a desired destination device.
It should be appreciated that the architecture of the machine learning model shown in fig. 1, and the number of network layers and processing nodes therein, are illustrative. In different applications, the machine learning model may be designed with other architectures as desired.
Some possible reasons for performing quantization with respect to the machine learning model will be analyzed below. As can be seen from the processing of the machine learning model described above with respect to fig. 1, as the architecture of the machine learning model is more complex (e.g., includes more network layers, more processing nodes, and/or more connections between network layers (e.g., full connections)), the processing in the model will be more complex because more processing parameters need to be used and more inputs need to be processed. For example, some layers of a machine learning model perform convolution operations, which consist of a plurality of convolution filters, the dimensions of the processing parameters required for the layers areWherein C, H and W represent the number of channels, the convolution kernel height, and the convolution kernel width, respectively, of the convolution kernel of the convolution layer. Many machine learning models also employ a fully-connected layer, i.e., all processing nodes of a previous layer have inputs connected to each node of a subsequent layer. The number of processing parameters required in such a network is also very large. Typically, many deep neural networks will have tens of millions of weights, bias values of the same order of magnitude, a large number of inputs communicated internally. This presents difficulties for both training and using machine learning models.
In a training process of the machine learning model, training data is input into the machine learning model, which processes the input based on the current values of the processing parameters to generate a network output. The processing parameters are updated continuously according to the training objectives. The updated value of the processing parameter is treated as the current value and continues to be used to process the current input. Thus, during the training process, the processing operations in the machine learning model are continuously performed based on the current values of the processing parameters and the input current values until the convergence condition of the training is reached. The parameter set obtained after training is stored to represent the machine learning model. During use of the trained machine learning model, the processing parameters are invoked to process the input of the model, as required by the use.
The processing parameters of the machine learning model and the actual values entered generally conform to a floating point form, so it is readily conceivable to use a floating point format in computer storage and processing to store these values. However, representing the processing parameters and/or inputs of a machine learning model in a floating point format may not only incur a significant overhead in storage resources. Furthermore, an inner product operation needs to be performed between the vector of processing parameters and the input vector, and a processing operation between floating point numbers causes very high computational overhead. It can be seen that the storage of the processing parameters of the machine learning model and the model processing overhead are large. This makes training of machine learning models, and even use of machine learning models, difficult to perform on resource-limited devices, such as mobile phones, thus greatly limiting the practical application of machine learning models.
Some schemes have been proposed for reducing the complexity of the representation of the machine learning model and improving the processing efficiency. One type of solution is to propose a machine learning model architecture that is compact from the standpoint of improving processing efficiency. For example, one of such schemes proposes the use of micro-networks to enhance local modeling, replacing the fully-connected layer in the network, which consumes a lot of processing, with a global average pooling layer. Yet another type of solution is to reduce the number of processing parameters in existing machine learning model architectures. For example, by studying redundancy of filter weights in deep neural networks, some schemes propose to use low-order approximations to replace pre-training weights. Still other schemes have focused on how to reduce the number of network layer connections in order to reduce the parameters required for network processing. Yet another possible approach is to regularize the network by structuring sparsity to obtain a machine learning model that is easy to implement in hardware.
Still other schemes involve network quantization. Network quantization refers to transforming a numerical value involved in processing of a machine learning model from a higher precision representation (e.g., floating point number format) to a lower precision quantized value (e.g., a value represented by a small number of bits) using a quantizer (or quantization device, quantization module). In this way, a binary quantized value represented by a small number of bits can be stored and processing is performed using this quantized value, which can significantly reduce complexity of model representation and processing. Network quantization is a reduction in complexity achieved at the expense of network processing accuracy. In general, the quantization process in a typical quantizer can be expressed as a quantization function that is a piecewise constant function:
Q (x) =q i if x e (t i,ti+1 ] (3)
Where x represents a value to be quantized, q i (where i=1,..m) represents an i-th quantization level, the quantizer quantizes all values within one quantization interval to the corresponding quantization level by a quantization function, and the quantized values may be encoded from log 2 m bits to indices corresponding to quantization levels.
Quantifying the network weights may generate a highly compressed and memory-efficient machine learning model. For example, if each weight value is quantized using n bits, the compression rate can be achieved as compared to the case of using a 32-bit or 64-bit floating point number formatOr (b)Further, if both the weights and the inputs to be processed are quantized, the inner product processing of the weights and the inputs in equation (1) or equation (2) may be performed by bit-by-bit operations such as xnor operations (i.e., exclusive or non-logical operations) and popcnt (i.e., operations to count the number of "1's" in the bit string). Such bit-by-bit operations are very efficient. In many general processing platforms, such as CPUs and GPUs, xnor operations and popcnt operations can process at least 64 bits in one or several clock cycles, which achieves an exponential processing speed up (possibly up to 64 times speed up). It follows that it is very efficient to reduce the storage and computational complexity of machine learning models by quantization.
Some network quantization schemes already exist. For example, some quantization schemes propose to quantize the weights of the model to only two possible values, such as-1 and 1 (this is also referred to as binarization or 1-bit quantization). In other quantization schemes, it is also proposed to quantize both the weights of the model and the inputs of the various layers. Each weight and input may also be quantized to more bits for higher accuracy.
However, existing schemes all employ uniform quantization techniques, i.e. the width (also referred to as quantization step size) of each quantization interval is the same. For example, in a simple binary quantization, the quantization function is represented as a sign function, where Q (x) = +1 if x is ∈0, and Q (x) = -1 if x < 0. If 2 bits or more are selected for quantization, all quantization steps (q i+1-qi) are equal. On the basis of uniform quantization, the same quantization method (i.e. the same quantization interval configuration) will be used to quantize the processing parameters and inputs of the whole model, according to the existing scheme.
The inventors have found through research that a single uniform quantization scheme is not suitable for all machine learning models and all network layers or processing nodes in the same machine learning model. If the quantization interval selected during quantization is not suitable, larger quantization errors are caused, and network accuracy is reduced. Quantization error refers to the difference between a quantized value (expressed in a finite number of bits) and a true value. It is generally desirable that an optimized quantizer produces a minimum quantization error for all input data distributions, which can be expressed as:
where p (x) represents the probability density function of the value x to be quantized, Q (x) represents the quantizer currently applied to x, and Q * (x) represents the optimized quantizer. Equation (4) shows that the quantizer is optimized such that a minimum quantization error is obtained for all possible distributions of the value x.
However, the values of the processing parameters and inputs may vary differently across models or across network layers, and the values of the processing parameters are unknown during the network training phase. These reasons all result in an inability to ensure uniform quantization is the optimal quantization choice for either the processing parameters or the input of the machine learning model.
Fig. 2 shows a histogram statistical distribution of weights and possible values of inputs at different levels in a trained machine learning model, where values are represented in floating point number format. Each histogram indicates the statistical number of the corresponding valued areas in the same step length of the longitudinal bars. Histogram 210 shows the distribution of weight values at one processing node in one layer in the model; histogram 220 shows the distribution of weight values at one processing node in another layer. Histogram 230 shows a distribution of input values received at the same layer as histogram 210; histogram 240 shows a distribution of input values received at the same layer as histogram 220. As can be seen from fig. 2, the distribution of weights and inputs is complex, and may vary between different layers, different processing nodes. Therefore, a single uniform quantization obviously cannot ensure that the quantization error is minimal. In the training phase, it is also impractical to rely on selecting a preferred quantizer by simple analysis of the values, since the values of the processing parameters are not yet determined.
According to an implementation of the present disclosure, a scheme for quantization of a machine learning model is presented. The scheme does not adopt a fixed uniform quantization mode, but adopts a learnable quantization mode: how the processing parameters for a particular processing unit (e.g., a network layer or a processing node in a network layer) are quantized is specifically learned for that processing unit in a machine learning model, thereby obtaining a processing unit-specific quantization. In this way, not only can a reduction in storage and processing overhead due to network quantization be obtained, but quantization accuracy can be further improved.
To learn quantization for a particular processing unit in a machine learning model, a naive strategy is implemented by optimizing the quantization level { q i } in the quantizer. However, such a strategy would make the quantization function incompatible with bit-wise operations. This is undesirable in the context of machine learning models.
By studying the quantization space, the inventors limited the quantization process to subspaces compatible with bit-wise operations. In this subspace, the quantization values may be determined by a combination of values of a set of base quantization parameters and a set of binary quantization parameter values, which facilitates determining a particular quantization manner for a processing unit of a particular machine learning model using a learning manner. Before describing how to learn quantization for a particular processing unit, a quantization subspace that is the basis for a learnable quantization is first introduced. The inventors have found that uniform quantization may be considered compatible with bit-wise operations. The reason is that uniform quantization is actually mapping floating point values to nearest fixed point integer values with one normalization factor, so that the key property that uniform quantization can be compatible with bit-wise operations is that quantized values can be decomposed into linear combinations of bit values, and thus uniform quantization is also referred to as linear coding. In particular, the integer q represented by the binary encoding of K bits is actually the inner product between the following two vectors, namely
Where b i e {0,1}, i.e., b i.
In view of the compatibility with bitwise operations provided by uniform quantization, in order to learn quantization, in implementations of the present disclosure quantization can be implemented on the basis of values of a set of basic quantization parameters and values of a set of binary quantization parameters. Such quantization may be expressed as a quantization function as follows:
Q ours(x,v)=vTei if x.epsilon.t i,ti+1 (6)
Wherein V represents a base quantization parameter vector, wherein the elements comprise K base quantization parameters V 1、……vK, i.e(K is the number of bits used for quantization, also referred to as the bit width); e i denotes a binary quantization parameter vector. The elements in the binary quantization parameter vector include binary quantization parameters respectively corresponding to the base quantization parameter vector, and each of the binary quantization parameters may take one of a predetermined pair of values. For example, each binary quantization parameter may be chosen between 0 and 1, or between-1 and 1 (for ease of description, two binary examples between-1 and 1 are discussed below: e i∈{-1,1}K(i=1,...,2K, therefore), i.e., each binary quantization parameter may be chosen from [ -1,..the use of, -1] to [1,..the use of 1]. For quantization of K bits, 2 K quantization levels are required, each quantization level being denoted q i=vTei, where i=1,..2 K. Given the quantization level { q i } and assumingIt can be deduced that for any value x to be quantized, the quantization level that can minimize the quantization error (e.g., the quantization error in equation (4)) is t i=(qi-1+qi)/2(i=2,...,2K. When i=1 and i=2 K, t 1 = - ≡sum
Fig. 3A and 3B show some examples of quantization functions according to formula (6). Fig. 3A shows the case of 2-bit quantization, and fig. 3B shows the case of 3-bit quantization. In the example of fig. 3A, the value of the base quantization parameter vector is represented as v= [ v 1,v2]T. Map 310 shows how quantization levels are generated based on the values of the base quantization parameter and the corresponding binarization parameter, while piecewise curve 312 shows the corresponding quantization function. In the example of fig. 3B, the value of the base quantization parameter vector is represented as v= [ v 1,v2,v3]T. Map 320 shows how quantization levels are generated based on the values of the base quantization parameter and the corresponding binarization parameter, while segment curve 322 shows the corresponding quantization function.
As is apparent from the above discussion, in quantization, which may be represented as a set of base quantization parameters and a set of binary quantization parameters, quantization may be performed by determining values of the base quantization parameters and the binary quantization parameters, and bit-by-bit operation may also be implemented. Thus, implementations of the present disclosure are based on the above discussion of how to learn such quantization for a particular machine learning model. Various implementations of the present disclosure are further described below by way of specific examples.
Referring now to fig. 4, there is illustrated a flow chart of a process 400 for quantization of a machine learning model according to some implementations of the present disclosure. Process 400 may be implemented by computing device 100. For ease of discussion, process 400 will be described with reference to FIG. 1. The computing device 100 trains the quantization module 130 according to the process 400 to enable the quantization module 130 to perform quantization for the machine learning model 170 with the trained quantization parameters (including the base quantization parameters and the binary quantization parameters).
At block 410, the computing device 100 obtains current values of the processing parameters used by the processing units in the machine learning model 170. According to an implementation of the present disclosure, for a particular processing unit, it is desirable to learn an optimized quantization manner for quantizing the processing parameters of that processing unit. In some implementations, the processing unit may be a processing node 171 of the machine learning model 170, also referred to as a filter or neuron. In such an implementation, processing unit specific quantization parameters may be trained for quantization of the processing unit's processing parameters. In other implementations, the processing unit may be a network layer of the machine learning model 170, such as network layers 172, 174, 176, and so on. In such an implementation, network layer specific quantization parameters may be trained for quantization of the processing parameters of the processing unit.
In some implementations, appropriate quantization parameters may be determined for the trained machine learning model 170. The trained machine learning model 170 has been determined for each processing unit, such as the values of the processing parameters used by the processing nodes or network layers. At this time, the current value of the processing parameter obtained by the computing device 100 is the processing parameter of the trained processing unit. In some implementations, the machine learning model 170 may be trained jointly with the base quantization parameters and the binary quantization parameters for quantization. Because the values of the processing parameters are not yet determined during the training process and are updated continuously with each round of training. At this point, the current value of the processing parameter obtained by the computing device 100 is the value of the processing parameter determined in the current iteration of the training. The machine learning model 170 and the joint training of the two types of quantization parameters will be described in detail below. In some implementations, the current value of the processing parameter may use a floating point format, such as represented as 32-bit floating point numbers or 64-bit floating point numbers, or the like.
The processing unit may have a plurality of processing parameters forming a set of processing parameters. The processing parameters may include weights for weighting the inputs of the processing units. The number of weights is related to or the same as the number of inputs to the node. The processing parameters may also include bias values for biasing the weighted results. For processing units in the form of processing nodes or network layers, the number of processing parameters used therein may be large. As will be discussed later, it is desirable to be able to determine appropriate quantization parameters to be suitable for proper quantization of possible values of these processing parameters. In this way, the determined quantization parameter is specific to the processing parameter of the processing unit.
At block 420, the computing device 100 quantizes the current value of the process parameter based on the current value of the predetermined number of base quantization parameters and the current value of the process parameter specific binary quantization parameter to obtain a quantized value of the process parameter. The binary quantization parameter and the base quantization parameter are discussed above. The binary quantization parameter corresponds to the base quantization parameter, respectively, and the number of base quantization parameters is the same as the number of bits used for quantization. The number of bits used for quantization may be specified by the user or determined according to accuracy requirements, for example. The quantization for each processing parameter can be represented by equation (6). The quantized value of each processing parameter can be determined by equation (9).
The process 400 is used to train the values of the base quantization parameter and the binary quantization parameter, and thus typically requires multiple iterations to determine the value of the final base quantization parameter. The current value of the base quantization parameter and the current value of the binary quantization parameter represent the values of the base quantization parameter and the binary quantization parameter determined in the current iteration. In the initial stage, the base quantization parameter and the binary quantization parameter may have randomized values.
At block 430, the computing device 100 updates the current values of the base quantization parameter and the binary quantization parameter for quantization of the processing unit specific processing parameter based on the difference between the quantized value of the processing parameter and the current value of the processing parameter. The difference between the quantized value of the processing parameter and the current value of the processing parameter is used as an update target of the quantization parameter. The values of the expected quantization parameter and the binary quantization parameter are updated to continue to reduce the difference between the quantized value of the processing parameter and the current value of the processing parameter.
An objective function may be constructed based on the difference between the quantized value of the processing parameter and the current value of the processing parameter in order to guide the updating of the base quantization parameter and the binary quantization parameter. The objective function may be expressed as a loss function related to the difference between the quantized value of the process parameter and the current value of the process parameter. Assume thatIs the current value of the processing parameter to be quantized, where N represents the number of processing parameters. Assume again that the underlying quantization parameter is represented asWherein K represents a predetermined base quantization parameter, which is the same number of bits used for quantization; the binary quantization parameter is represented as b= [ B 1,...,bN]∈{-1,1}K×N, where B i∈{-1,1}N (i=1,., K) represents a binary value of the i-th bit for quantizing all values in x, including a binary value corresponding to the base quantization parameter. The objective function can be expressed as:
Where requirements B ε { -1,1} K×N,v* and B * represent updated values of v and B, respectively. As can be seen from the formula (7), the updating of the basic quantization parameter and the binary quantization parameter aims at finding the values of v and B, so that the quantization error of the processing parameter is small or minimized.
In some implementations, since equation (7) is complex, it is difficult to directly solve the equation to determine the optimal solution, and it is more difficult as the size of B increases. To increase computational efficiency, in some implementations, the base quantization parameter and the binary quantization parameter are updated alternately in a block-wise falling manner. In particular, the current value of the binary quantization parameter may be updated with the current value of the base quantization parameter fixed to obtain the current value of the binary quantization parameter. The current value of the binary quantization parameter may be fixed while the current value of the base quantization parameter is updated. The two updates may be alternated, with only one binary quantization parameter being updated in each iteration.
When the current value of the base quantization parameter is fixed (i.e., given V), the current value of the binary quantization parameter may be updated by looking up the quantization interval. In particular, a plurality of quantization intervals may be determined based on the current value of the base quantization parameter and a plurality of candidate values of the binary quantization parameter. The candidate values for the binary quantization parameter are a combination of a predetermined number of-1 and a predetermined number of +1, i.e. each binary quantization parameter may take a value from [ -1, ], -1] to [1, ], 1]. Quantization intervals may be defined by, as discussed aboveWherein t i=(qi-1+qi)/2(i=2,...,2K),qi is determined by the current value of the base quantization parameter and the respective candidate value of the binary quantization parameter. Each quantization interval corresponds to a range of values, e.g., a quantization interval may correspond to a range of values (t i,ti+1).
Then, one quantization interval is selected from the plurality of quantization intervals so that the current value of the processing parameter falls within a value range corresponding to the selected quantization interval. According to quantization, each quantization interval corresponds to a candidate quantization value to which a value falling within the quantization interval is quantized. The candidate quantization interval is selected by looking up in which candidate quantization interval the current value of the processing parameter falls. Thereafter, the current value of the binary quantization parameter is updated to one of at least one candidate value for defining the selected quantization interval. For each quantization interval except for the first and last quantization interval, both ends are defined by a plurality of quantization values. In some implementations, one post candidate in a segment of smaller values defining the quantization interval may be uniformly selected. For example, if the current value of the processing parameter falls within the quantization interval defined by t i and t i+1, a candidate value for generating the binary quantization parameter of t i may be selected as the updated value of the binary quantization parameter.
In the case where the current value of the binary quantization parameter is fixed (i.e., given B), equation (7) will be reduced to a linear regression problem with one closed-loop solution, which can be expressed as:
V*=(BBT)-1Bx (8)
In some implementations, computing device 100 iterates through the operations in blocks 420 and 430, and the base quantization parameter or the binary quantization parameter may be updated once per iteration. In some implementations, to obtain more quantity or degree, the base quantization parameter and the binary quantization parameter may be updated using a moving average policy. Each update requires that the quantization error of the processing parameters in equation (7) be reduced or at least unchanged.
As mentioned above, the computing device 100 may jointly train the base quantization parameter and the binary quantization parameter during the training of the machine learning model 170, i.e., jointly update the current values of the base quantization parameter and the binary quantization parameter and the current value of the processing parameter. Through the combined training, not only can quantization errors brought by good quantization parameters be reduced, but also the combined training is suitable for better approaching a training target, so that the model accuracy is improved.
Generally, training of the machine learning model 170 is divided into two directions, forward propagation and backward propagation. In the forward propagation phase, the machine learning model 170 receives model inputs for training, based on current values of the process parameters, to generate final model outputs. In this forward propagation, the current values of the processing parameters may be obtained to perform the updating of the base quantization parameter and the binary quantization parameter. The updating of the process parameters is performed in a backward propagation phase of the training process, wherein the current values of the process parameters are updated based on the objective function of the machine learning model 170. In some implementations, the computing device 100 may update the current value of the base quantization parameter and the current value of the binary quantization parameter in a forward propagation phase of the training process.
In the backward propagation phase, one approach is to update the process parameters based on backward gradients. The gradient of the quantization function is almost zero due to the introduction of quantization therein, which makes it difficult to achieve gradient pass-back. In some implementations, gradients due to updates of the processing parameters may be calculated in particular. Specifically, for defining t 2 and in equation (6)The value of the process parameter in between may be determined as 1 for the gradient and 0 for values in other ranges.
How the base quantization parameter and the binary quantization parameter are determined for quantization of the processing parameters specific to the processing unit is discussed above. In some implementations, particular quantization parameter values may be determined for multiple processing units in the machine learning model 170 for subsequent quantization. Because the values of the processing parameters of different processing units may vary greatly, the accuracy of quantization can be improved by determining specific quantization parameters for different processing units, thereby improving the accuracy of processing in a machine learning model.
In some implementations, in addition to determining particular quantization parameters for processing parameters of a processing unit, computing device 100 may also determine quantization parameters for input of the processing unit for quantization of values of the input. The input of the processing unit is the output of the previous layer, also called the activation value. Input-specific quantization parameters are necessary because the input values of the processing units may also differ from the distribution of the values of the processing parameters. By simultaneously quantizing the processing parameters and the input values of the processing unit, the processing operations in the processing unit will become efficient bit-by-bit operations in addition to reducing the storage requirements for the processing parameters and the input values, which will greatly increase the processing efficiency and reduce the processing resource overhead.
Detailed descriptions are provided herein of how quantization using implementations of the present disclosure can be performed to perform bit-wise operations on processing parameters and inputs. Assume that values specific to both the processing parameters and the input quantization parameters are determined. For the processing parameters, assume that the value of the processing parameter of one processing unit is(N represents the number of processing parameters), the basic quantization parameter value for quantizing the processing parameters of the processing unit isThe binary quantization parameter isWhere K w is the predetermined number of bits used to quantize the processing parameters,A binarized value representing the ith bit used to quantize all values in w,Comprises a binarization value corresponding to the basic quantization parameter. Similarly, for an input, assume that the value of the input isThe basic quantization parameter used to quantize the input of the processing unit is valuedValuing of binary quantization parameterWhere K a is the predetermined number of bits used to quantize the input,The binarized value representing the ith bit used to quantize all values in a. On a quantization basis, the multiplication (inner product) operation of the processing parameters and the input values can be expressed as:
where ≡indicates an inner product operation performed by a bit-wise operation such as xnor and popcnt.
The determination of the quantization parameter value for the input of the processing unit is similar to the determination of the quantization parameter value for the processing parameter. In particular, the computing device 100 may obtain a current value of the input of the processing unit. The processing unit processes the input current value using the current value of the processing parameter. The computing device 100 may also quantize the current value of the input based on another predetermined number of base quantization parameters and the current value of the binary quantization parameter to obtain the quantized value of the input. The basic quantization parameter and the binary quantization parameter herein are parameters specific to the input of the processing unit, sometimes also referred to as a second basic quantization parameter and a second binary quantization parameter. The other predetermined number is the same as the number of bits used for the quantized input. The number of bits of the quantized input may be the same as or different from the number of bits of the quantized output parameter, which may be set according to user configuration or actual needs.
Similarly, the computing device 100 may also update the current values of the base quantization parameter and the binary quantization parameter for quantization of the input specific to the processing unit based on the difference between the input quantization value and the current value of the input. Similar to the updating of the basic quantization parameter and the binary quantization parameter specific to the processing parameter, for example, by minimizing equation (7). The updating of the input-specific base quantization parameter and the binary quantization parameter may also be performed during the training of the machine learning model 170, during a forward propagation phase of the training process. The updating of the input-specific base quantization parameter and the binary quantization parameter may also be alternating. For brevity, the updating of the basic quantization parameter and the binary quantization parameter specific to the input will not be described in detail.
The current value of the input obtained is limited during the training phase of the machine learning model 170, because the current value of the input is primarily dependent on the training data used for model training, whereas the overall input of the model during model use is unknown at this stage and may vary widely. In view of this problem, in some implementations, a moving average policy may be applied to the base quantization parameter for updating.
The above discussion discusses how the values of the basic quantization parameter and the binary quantization parameter specific to the processing unit (e.g., the processing node or the network layer) are determined through training and learning. In some implementations, for a processing parameter, the values of the base quantization parameter and the binary quantization parameter may be determined per processing node, as the distribution of values of the processing parameters for different processing nodes may be widely different. In some implementations, for an input (i.e., an activation value), the values of the base quantization parameter and the binary quantization parameter may be determined by the network layer, since the input is transmitted across the network layer, and different processing nodes in the network layer are connected to outputs of a previous layer, the values of the inputs received by different processing nodes in the same layer are not significantly different or even the same input value.
After determining the values of the base quantization parameter and the binary quantization parameter, quantization module 122 in computing device 100 may quantize the corresponding processing parameter or input value using the processing unit specific processing parameter or input quantization parameter value. In the use phase of the machine learning model 170, the values of the base quantization parameter and the binary quantization parameter specific to the process parameter are used to quantize the trained values of the process parameter and store the quantized values of the process parameter. In some implementations, in determining the quantized value, the quantized value of the processing parameter may be determined based on the inner product result of determining the current value of the base quantization parameter and the current value of the binary quantization parameter, and based on the inner product result, as discussed above in equation (6).
According to an implementation of the present disclosure, by training quantization parameters specific to a processing unit for performing quantization, in particular training the quantization parameters in combination with training of a machine learning model, better quantization parameters may be obtained for achieving more accurate quantization. Fig. 5A shows a comparison of a histogram statistical distribution of floating point values of a processing parameter (here, weights) before quantization and quantized values of the processing parameter after quantization according to implementations of the present disclosure. Fig. 5B illustrates a comparison of a histogram statistical distribution of an input floating point value before quantization and an input quantized value after quantization is implemented according to the present disclosure. Note that the histogram relating to the floating point value is a conventional histogram in which the vertical bars in the same step length contain the floating point value of the corresponding value region; the vertical bars in the histogram associated with quantized values contain all quantized values quantized into corresponding bins.
Histogram 510 shows the distribution of floating point values of weights at one processing node of one layer in the machine learning model, and histogram 512 shows the distribution of quantized values after quantization of floating point values of weights at the same processing node in accordance with implementations of the present disclosure. Histogram 520 shows a distribution of floating point values of weights at one processing node of another layer in the machine learning model, and histogram 522 shows a distribution of quantized values after quantization of floating point values of weights at the same processing node in accordance with implementations of the present disclosure.
Histogram 530 shows the distribution of floating point values of the input at one layer in the machine learning model, and histogram 532 shows the distribution of quantized values after quantization of the floating point values of the input at the same layer in accordance with implementations of the present disclosure. Histogram 540 shows the distribution of floating point values of the input of another layer in the machine learning model, and histogram 542 shows the distribution of quantized values after quantization of the floating point values of the input of the same layer according to implementations of the present disclosure.
As can be seen from fig. 5A and 5B, quantization according to implementations of the present disclosure may not be uniform quantization, but may vary from layer to layer or from processing unit to processing unit, closer to the true distribution of processing parameters or inputs of the processing unit. This can effectively reduce quantization errors compared to uniform quantization.
Some example implementations of the present disclosure are listed below.
In one aspect, the present disclosure provides a computer-implemented method comprising: obtaining current values of processing parameters used by processing units in the machine learning model; quantizing the current value of the processing parameter based on the current value of the predetermined number of base quantization parameters and the current value of the binary quantization parameter, which correspond to the base quantization parameters, respectively, and which are the same as the number of bits used for quantization of the processing parameter, to obtain a quantized value of the processing parameter; and updating the current value of the base quantization parameter and the current value of the binary quantization parameter for quantization of the processing unit specific processing parameter based on a difference between the quantization value of the processing parameter and the current value of the processing parameter.
In some implementations, updating the current value of the base quantization parameter and the current value of the binary quantization parameter includes: in the training process of the machine learning model, the current values of the basic quantization parameters and the current values of the binary quantization parameters and the current values of the processing parameters are updated jointly.
In some implementations, updating the current value of the base quantization parameter and the current value of the binary quantization parameter in conjunction with updating the current value of the processing parameter includes: in the forward propagation phase of the training process, updating the current value of the basic quantization parameter and the current value of the binary quantization parameter; and updating the current values of the processing parameters during a backward propagation phase of the training process.
In some implementations, updating the current value of the base quantization parameter and the current value of the binary quantization parameter includes: updating the current value of the binary quantization parameter under the condition of fixing the current value of the basic quantization parameter; and updating the current value of the base quantization parameter in case of fixing the current value of the binary quantization parameter.
In some implementations, updating the current value of the binary quantization parameter includes: determining a plurality of quantization intervals based on the current value of the base quantization parameter and a plurality of candidate values of the binary quantization parameter, each quantization interval corresponding to a range of values and being defined by an inner product of the current value of the base quantization parameter and at least one of the plurality of candidate values of the binary quantization parameter; selecting a quantization interval in the quantization intervals so that the current value of the processing parameter falls within a value range corresponding to the selected quantization interval; and updating the current value of the binary quantization parameter to one of the at least one candidate value for defining the selected quantization interval.
In some implementations, the base quantization parameter is a first base quantization parameter and the binary quantization parameter is a first binary quantization parameter, the method further comprising: obtaining an input current value of a processing unit, the processing unit processing the input current value using the current value of the processing parameter; quantizing the input current value based on another predetermined number of second base quantization parameters and current values of second binary quantization parameters, the second binary quantization parameters corresponding to the second base quantization parameters, respectively, and the other predetermined number being the same as the number of bits used for quantization of the input, to obtain the input quantized value; and updating the second base quantization parameter and the current value of the second binary quantization parameter for quantization of the input specific to the processing unit based on a difference between the input quantization value and the current value of the input.
In some implementations, quantifying the current value of the processing parameter includes: determining an inner product result of the current value of the basic quantization parameter and the current value of the binary quantization parameter; and determining a quantized value of the processing parameter based on the inner product result.
In some implementations, the processing unit includes a network layer of the machine learning model or a processing node in the network layer of the machine learning model.
In some implementations, the current value of the processing parameter uses a floating point number format.
In another aspect, the present disclosure provides an electronic device. The apparatus includes: a processing unit; and a memory coupled to the processing unit and containing instructions stored thereon that, when executed by the processing unit, cause the device to perform actions comprising: obtaining current values of processing parameters used by processing units in the machine learning model; quantizing the current value of the processing parameter based on the current value of the predetermined number of base quantization parameters and the current value of the binary quantization parameter, which correspond to the base quantization parameters, respectively, and which are the same as the number of bits used for quantization of the processing parameter, to obtain a quantized value of the processing parameter; and updating the current value of the base quantization parameter and the current value of the binary quantization parameter for quantization of the processing unit specific processing parameter based on a difference between the quantization value of the processing parameter and the current value of the processing parameter.
In some implementations, updating the current value of the base quantization parameter and the current value of the binary quantization parameter includes: in the training process of the machine learning model, the current values of the basic quantization parameters and the current values of the binary quantization parameters and the current values of the processing parameters are updated jointly.
In some implementations, updating the current value of the base quantization parameter and the current value of the binary quantization parameter in conjunction with updating the current value of the processing parameter includes: in the forward propagation phase of the training process, updating the current value of the basic quantization parameter and the current value of the binary quantization parameter; and updating the current values of the processing parameters during a backward propagation phase of the training process.
In some implementations, updating the current value of the base quantization parameter and the current value of the binary quantization parameter includes: updating the current value of the binary quantization parameter under the condition of fixing the current value of the basic quantization parameter; and updating the current value of the base quantization parameter in case of fixing the current value of the binary quantization parameter.
In some implementations, updating the current value of the binary quantization parameter includes: determining a plurality of quantization intervals based on the current value of the base quantization parameter and a plurality of candidate values of the binary quantization parameter, each quantization interval corresponding to a range of values and being defined by an inner product of the current value of the base quantization parameter and at least one of the plurality of candidate values of the binary quantization parameter; selecting a quantization interval in the quantization intervals so that the current value of the processing parameter falls within a value range corresponding to the selected quantization interval; and updating the current value of the binary quantization parameter to one of the at least one candidate value for defining the selected quantization interval.
In some implementations, the base quantization parameter is a first base quantization parameter and the binary quantization parameter is a first binary quantization parameter, the apparatus further comprising: obtaining an input current value of a processing unit, the processing unit processing the input current value using the current value of the processing parameter; quantizing the input current value based on another predetermined number of second base quantization parameters and current values of second binary quantization parameters, the second binary quantization parameters corresponding to the second base quantization parameters, respectively, and the other predetermined number being the same as the number of bits used for quantization of the input, to obtain the input quantized value; and updating the second base quantization parameter and the current value of the second binary quantization parameter for quantization of the input specific to the processing unit based on a difference between the input quantization value and the current value of the input.
In some implementations, quantifying the current value of the processing parameter includes: determining an inner product result of the current value of the basic quantization parameter and the current value of the binary quantization parameter; and determining a quantized value of the processing parameter based on the inner product result.
In some implementations, the processing unit includes a network layer of the machine learning model or a processing node in the network layer of the machine learning model.
In some implementations, the current value of the processing parameter uses a floating point number format.
In yet another aspect, the present disclosure provides a computer program product. The computer program product is stored in a computer storage medium and includes machine-executable instructions that, when executed by a device, cause the device to perform acts comprising: obtaining current values of processing parameters used by processing units in the machine learning model; quantizing the current value of the processing parameter based on the current value of the predetermined number of base quantization parameters and the current value of the binary quantization parameter, which correspond to the base quantization parameters, respectively, and which are the same as the number of bits used for quantization of the processing parameter, to obtain a quantized value of the processing parameter; and updating the current value of the base quantization parameter and the current value of the binary quantization parameter for quantization of the processing unit specific processing parameter based on a difference between the quantization value of the processing parameter and the current value of the processing parameter.
In some implementations, updating the current value of the base quantization parameter and the current value of the binary quantization parameter includes: in the training process of the machine learning model, the current values of the basic quantization parameters and the current values of the binary quantization parameters and the current values of the processing parameters are updated jointly.
In some implementations, updating the current value of the base quantization parameter and the current value of the binary quantization parameter in conjunction with updating the current value of the processing parameter includes: in the forward propagation phase of the training process, updating the current value of the basic quantization parameter and the current value of the binary quantization parameter; and updating the current values of the processing parameters during a backward propagation phase of the training process.
In some implementations, updating the current value of the base quantization parameter and the current value of the binary quantization parameter includes: updating the current value of the binary quantization parameter under the condition of fixing the current value of the basic quantization parameter; and updating the current value of the base quantization parameter in case of fixing the current value of the binary quantization parameter.
In some implementations, updating the current value of the binary quantization parameter includes: determining a plurality of quantization intervals based on the current value of the base quantization parameter and a plurality of candidate values of the binary quantization parameter, each quantization interval corresponding to a range of values and being defined by an inner product of the current value of the base quantization parameter and at least one of the plurality of candidate values of the binary quantization parameter; selecting a quantization interval in the quantization intervals so that the current value of the processing parameter falls within a value range corresponding to the selected quantization interval; and updating the current value of the binary quantization parameter to one of the at least one candidate value for defining the selected quantization interval.
In some implementations, the base quantization parameter is a first base quantization parameter and the binary quantization parameter is a first binary quantization parameter, the apparatus further comprising: obtaining an input current value of a processing unit, the processing unit processing the input current value using the current value of the processing parameter; quantizing the input current value based on another predetermined number of second base quantization parameters and current values of second binary quantization parameters, the second binary quantization parameters corresponding to the second base quantization parameters, respectively, and the other predetermined number being the same as the number of bits quantized with the input, to obtain the input quantized value; and updating the second base quantization parameter and the current value of the second binary quantization parameter for quantization of the input specific to the processing unit based on a difference between the input quantization value and the current value of the input.
In some implementations, quantifying the current value of the processing parameter includes: determining an inner product result of the current value of the basic quantization parameter and the current value of the binary quantization parameter; and determining a quantized value of the processing parameter based on the inner product result.
In some implementations, the processing unit includes a network layer of the machine learning model or a processing node in the network layer of the machine learning model.
In some implementations, the current value of the processing parameter uses a floating point number format.
In yet another aspect, the present disclosure provides a computer-readable medium having stored thereon computer-executable instructions that, when executed by a device, cause the device to perform the method of the above aspects.
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), and the like.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Moreover, although operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate implementations may also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims (18)

1. A computer-implemented method, comprising:
Obtaining current values of processing parameters used by a processing unit in a machine learning model, the processing unit comprising a network layer of the machine learning model or a processing node in a network layer of the machine learning model;
Quantizing the current value of the processing parameter based on a predetermined number of current values of the base quantization parameter and current values of a binary quantization parameter, which correspond to the base quantization parameter, respectively, and which are the same as the number of bits used for quantization of the processing parameter, to obtain quantized values of the processing parameter; and
Updating the current value of the base quantization parameter and the current value of the binary quantization parameter based on a difference between the quantized value of the processing parameter and the current value of the processing parameter for quantization of the processing parameter specific to the processing unit,
Wherein updating the current value of the binary quantization parameter comprises: updating the current value of the binary quantization parameter while fixing the current value of the base quantization parameter;
Wherein the updated current value of the base quantization parameter and the updated current value of the binary quantization parameter are different from the updated current value of the base quantization parameter and the updated current value of the binary quantization parameter for another processing unit in the machine learning model.
2. The method of claim 1, wherein updating the current value of the base quantization parameter and the current value of the binary quantization parameter comprises:
in the training process of the machine learning model, the current values of the basic quantization parameter and the current values of the binary quantization parameter and the current values of the processing parameter are updated jointly.
3. The method of claim 2, wherein updating the current value of the base quantization parameter and the current value of the binary quantization parameter in conjunction with updating the current value of the processing parameter comprises:
in a forward propagation phase of the training process, updating a current value of the base quantization parameter and a current value of the binary quantization parameter; and
In a backward propagation phase of the training process, the current values of the processing parameters are updated.
4. The method of claim 1, wherein updating the current value of the base quantization parameter comprises: and updating the current value of the basic quantization parameter under the condition that the current value of the binary quantization parameter is fixed.
5. The method of claim 4, wherein updating the current value of the binary quantization parameter comprises:
Determining a plurality of quantization intervals based on the current value of the base quantization parameter and a plurality of candidate values of the binary quantization parameter, each quantization interval corresponding to a range of values and being defined by an inner product of the current value of the base quantization parameter and at least one of the plurality of candidate values of the binary quantization parameter;
Selecting one of the quantization intervals so that the current value of the processing parameter falls within a value range corresponding to the selected quantization interval; and
Updating a current value of the binary quantization parameter to one of the at least one candidate value for defining the selected quantization interval.
6. The method of claim 1, wherein the base quantization parameter is a first base quantization parameter and the binary quantization parameter is a first binary quantization parameter, the method further comprising:
Obtaining a current value of an input of the processing unit, the processing unit processing the current value of the input using the current value of the processing parameter;
quantizing a current value of the input based on a current value of a second base quantization parameter and a second binary quantization parameter of another predetermined number, the second binary quantization parameter corresponding to the second base quantization parameter, respectively, and the other predetermined number being the same as the number of bits used for quantization of the input, to obtain a quantized value of the input; and
Based on a difference between the input quantized value and the input current value, current values of the second base quantization parameter and the second binary quantization parameter are updated for quantization of the input specific to the processing unit.
7. The method of claim 1, wherein quantifying the current value of the processing parameter comprises:
determining an inner product result of the current value of the basic quantization parameter and the current value of the binary quantization parameter; and
A quantized value of the processing parameter is determined based on the inner product result.
8. The method of claim 1, wherein the current value of the processing parameter uses a floating point format.
9. An electronic device, comprising:
A processing unit; and
A memory coupled to the processing unit and containing instructions stored thereon, which when executed by the processing unit, cause the device to perform actions comprising:
Obtaining current values of processing parameters used by a processing unit in a machine learning model, the processing unit comprising a network layer of the machine learning model or a processing node in a network layer of the machine learning model;
Quantizing the current value of the processing parameter based on a predetermined number of current values of the base quantization parameter and current values of a binary quantization parameter, which correspond to the base quantization parameter, respectively, and which are the same as the number of bits used for quantization of the processing parameter, to obtain quantized values of the processing parameter; and
Updating the current value of the base quantization parameter and the current value of the binary quantization parameter based on a difference between the quantized value of the processing parameter and the current value of the processing parameter for quantization of the processing parameter specific to the processing unit,
Wherein updating the current value of the binary quantization parameter comprises: updating the current value of the binary quantization parameter while fixing the current value of the base quantization parameter;
Wherein the updated current value of the base quantization parameter and the updated current value of the binary quantization parameter are different from the updated current value of the base quantization parameter and the updated current value of the binary quantization parameter for another processing unit in the machine learning model.
10. The apparatus of claim 9, wherein updating the current value of the base quantization parameter and the current value of the binary quantization parameter comprises:
in the training process of the machine learning model, the current values of the basic quantization parameter and the current values of the binary quantization parameter and the current values of the processing parameter are updated jointly.
11. The apparatus of claim 10, wherein updating the current value of the base quantization parameter and the current value of the binary quantization parameter in conjunction with updating the current value of the processing parameter comprises:
in a forward propagation phase of the training process, updating a current value of the base quantization parameter and a current value of the binary quantization parameter; and
In a backward propagation phase of the training process, the current values of the processing parameters are updated.
12. The apparatus of claim 9, wherein updating the current value of the base quantization parameter comprises: and updating the current value of the basic quantization parameter under the condition that the current value of the binary quantization parameter is fixed.
13. The apparatus of claim 12, wherein updating a current value of the binary quantization parameter comprises:
Determining a plurality of quantization intervals based on the current value of the base quantization parameter and a plurality of candidate values of the binary quantization parameter, each quantization interval corresponding to a range of values and being defined by an inner product of the current value of the base quantization parameter and at least one of the plurality of candidate values of the binary quantization parameter;
Selecting one of the quantization intervals so that the current value of the processing parameter falls within a value range corresponding to the selected quantization interval; and
Updating a current value of the binary quantization parameter to one of the at least one candidate value for defining the selected quantization interval.
14. The apparatus of claim 9, wherein the base quantization parameter is a first base quantization parameter and the binary quantization parameter is a first binary quantization parameter, the apparatus further comprising:
Obtaining a current value of an input of the processing unit, the processing unit processing the current value of the input using the current value of the processing parameter;
quantizing a current value of the input based on a current value of a second base quantization parameter and a second binary quantization parameter of another predetermined number, the second binary quantization parameter corresponding to the second base quantization parameter, respectively, and the other predetermined number being the same as the number of bits used for quantization of the input, to obtain a quantized value of the input; and
Based on a difference between the input quantized value and the input current value, current values of the second base quantization parameter and the second binary quantization parameter are updated for quantization of the input specific to the processing unit.
15. The apparatus of claim 9, wherein quantifying the current value of the processing parameter comprises:
determining an inner product result of the current value of the basic quantization parameter and the current value of the binary quantization parameter; and
A quantized value of the processing parameter is determined based on the inner product result.
16. The apparatus of claim 9, wherein the current value of the processing parameter uses a floating point format.
17. A computer program product stored in a computer storage medium and comprising machine-executable instructions that, when executed by a device, cause the device to perform acts comprising:
Obtaining current values of processing parameters used by a processing unit in a machine learning model, the processing unit comprising a network layer of the machine learning model or a processing node in a network layer of the machine learning model;
Quantizing the current value of the processing parameter based on a predetermined number of current values of the base quantization parameter and current values of a binary quantization parameter, which correspond to the base quantization parameter, respectively, and which are the same as the number of bits used for quantization of the processing parameter, to obtain quantized values of the processing parameter; and
Updating the current value of the base quantization parameter and the current value of the binary quantization parameter based on a difference between the quantized value of the processing parameter and the current value of the processing parameter for quantization of the processing parameter specific to the processing unit,
Wherein updating the current value of the binary quantization parameter comprises: updating the current value of the binary quantization parameter while fixing the current value of the base quantization parameter;
Wherein the updated current value of the base quantization parameter and the updated current value of the binary quantization parameter are different from the updated current value of the base quantization parameter and the updated current value of the binary quantization parameter for another processing unit in the machine learning model.
18. The computer program product of claim 17, wherein updating the current value of the base quantization parameter and the current value of the binary quantization parameter comprises:
in the training process of the machine learning model, the current values of the basic quantization parameter and the current values of the binary quantization parameter and the current values of the processing parameter are updated jointly.
CN201810715757.7A 2018-06-29 2018-06-29 Quantization for machine learning models Active CN110728350B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810715757.7A CN110728350B (en) 2018-06-29 2018-06-29 Quantization for machine learning models

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810715757.7A CN110728350B (en) 2018-06-29 2018-06-29 Quantization for machine learning models

Publications (2)

Publication Number Publication Date
CN110728350A CN110728350A (en) 2020-01-24
CN110728350B true CN110728350B (en) 2024-07-26

Family

ID=69216820

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810715757.7A Active CN110728350B (en) 2018-06-29 2018-06-29 Quantization for machine learning models

Country Status (1)

Country Link
CN (1) CN110728350B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361678A (en) * 2020-03-04 2021-09-07 北京百度网讯科技有限公司 Training method and device of neural network model
CN111382844B (en) * 2020-03-11 2023-07-07 华南师范大学 Training method and device for deep learning model
CN113392954B (en) * 2020-03-13 2023-01-24 华为技术有限公司 Data processing method and device of terminal network model, terminal and storage medium
CN111738403B (en) * 2020-04-26 2024-06-07 华为技术有限公司 Neural network optimization method and related equipment
CN112652299B (en) * 2020-11-20 2022-06-17 北京航空航天大学 Quantification method and device of time series speech recognition deep learning model
CN113258935B (en) * 2021-05-25 2022-03-04 山东大学 Communication compression method based on model weight distribution in federated learning
CN113315604B (en) * 2021-05-25 2022-06-03 电子科技大学 Adaptive gradient quantization method for federated learning
US11468370B1 (en) 2022-03-07 2022-10-11 Shandong University Communication compression method based on model weight distribution in federated learning

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107636697A (en) * 2015-05-08 2018-01-26 高通股份有限公司 The fixed point neutral net quantified based on floating-point neutral net

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012104926A (en) * 2010-11-08 2012-05-31 Oki Electric Ind Co Ltd Quantization parameter control device, quantization parameter control method, and program
US20180107926A1 (en) * 2016-10-19 2018-04-19 Samsung Electronics Co., Ltd. Method and apparatus for neural network quantization

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107636697A (en) * 2015-05-08 2018-01-26 高通股份有限公司 The fixed point neutral net quantified based on floating-point neutral net

Also Published As

Publication number Publication date
CN110728350A (en) 2020-01-24

Similar Documents

Publication Publication Date Title
CN110728350B (en) Quantization for machine learning models
US11790212B2 (en) Quantization-aware neural architecture search
US11604960B2 (en) Differential bit width neural architecture search
CN113436620B (en) Training method of voice recognition model, voice recognition method, device, medium and equipment
WO2021129445A1 (en) Data compression method and computing device
US20200302283A1 (en) Mixed precision training of an artificial neural network
CN113327599B (en) Voice recognition method, device, medium and electronic equipment
CN111105017A (en) Neural network quantization method and device and electronic equipment
Zhao et al. Towards compact 1-bit cnns via bayesian learning
Peng et al. Deep network quantization via error compensation
CN113591490B (en) Information processing method and device and electronic equipment
CN118097293A (en) Small sample data classification method and system based on residual graph convolution network and self-attention
CN116956997A (en) LSTM model quantization retraining method, system and equipment for time sequence data processing
Zhan et al. Field programmable gate array‐based all‐layer accelerator with quantization neural networks for sustainable cyber‐physical systems
CN112101511A (en) Sparse convolutional neural network
CN117648964A (en) Model compression method and device and related equipment
Hosny et al. Sparse bitmap compression for memory-efficient training on the edge
CN115481562B (en) Multi-parallelism optimization method and device, recognition method and electronic equipment
US11989653B2 (en) Pseudo-rounding in artificial neural networks
Yang et al. Pse: mixed quantization framework of neural networks for efficient deployment
CN113780518A (en) Network architecture optimization method, terminal device and computer-readable storage medium
Yang et al. Model-based deep encoding based on USB transmission for modern edge computing architectures
Gong et al. A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms
Wang et al. Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview
KR20210119907A (en) Compression and decompression of weight values

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant