CN113255375A

CN113255375A - Translation model quantization method and device

Info

Publication number: CN113255375A
Application number: CN202010087096.5A
Authority: CN
Inventors: 吴晓琳
Original assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2020-02-11
Filing date: 2020-02-11
Publication date: 2021-08-13

Abstract

The disclosure relates to a translation model quantization method and device. The method relates to a deep learning technology and solves the problem of precision loss in the quantization process. The method comprises the following steps: respectively carrying out local quantization processing on a pseudo quantization layer on at least one training stage variable obtained in a training stage to obtain a precision loss correction variable corresponding to each training stage variable; and performing inference stage processing by taking the precision loss correction variable as an input. The technical scheme provided by the disclosure is suitable for the quantization of operators with large calculation ratio and stable quantized numerical values, and reduces the precision loss while improving the operation efficiency.

Description

Translation model quantization method and device

Technical Field

The present disclosure relates to deep learning technologies, and in particular, to a translation model quantization method and apparatus.

Background

In view of the superior performance of the transformation (transformer) model on translation tasks, this model has now become the mainstream model used by translation tasks. However, the transformer model is difficult to directly run on the mobile terminal due to excessive parameters, excessive model size, slow running speed and the like. The method for quantizing the parameters can effectively reduce the size of the transformer model and effectively improve the running speed of the model.

Taking the quantization method of 8-bit integer (int8) as an example, the size of the model can be effectively reduced, and the CPU of the mobile terminal can be fully utilized. Typically, the computation speed of the CPU for int8 is 4 times that of a 32-bit floating point number (fp 32).

As shown in fig. 1, the transformer model using the quantization method includes two processes, a training stage and an inference stage, and is usually trained in the training stage by using fp32 variable, and the inference stage uses int8 type operator. Fig. 2 is a schematic diagram showing the conversion of fp32 variable and int8 variable in the whole process, wherein int8 is required as input for int8 type Operator (OP). Before int8 is used as an input, the variables of fp32 type obtained in the training stage need to be quantized into int8 type, and the quantization process is shown as the following formula:

wherein max is the maximum value in the converted vector and min is the minimum value in the converted vector; h is the maximum value of the target, l is the minimum value of the target, and for the variable of int8 type, the value ranges of h and l are 0-255; x is the fp32 value being converted. This method produces a loss of precision in the quantization (Quantize) and/or inverse quantization (Dequantize) stages, and not all OPs are suitable for quantization operations. For example, int8 versions of OPs that satisfy | f (x) | > > x may have numerical instability problems, including exp, log, and the like. The int8 version of the reduction operation has the problem of error accumulation, including calculating the accumulated value (cumrod), accumulated value (cumum), mean (mean), norm (norm), standard deviation (std), sum (sum), variance (var) of each row of the array.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a translation model quantization method and apparatus.

According to a first aspect of the embodiments of the present disclosure, there is provided a translation model quantization method, including:

respectively carrying out local quantization processing on a pseudo quantization layer on at least one training stage variable obtained in a training stage to obtain a precision loss correction variable corresponding to each training stage variable;

and performing inference stage processing by taking the precision loss correction variable as an input.

Further, the training phase variable is a 32-bit floating point number, and the precision loss correcting variable is a 32-bit floating point number.

Further, the step of performing local quantization processing on the pseudo quantization layer on at least one training phase variable obtained in the training phase to obtain the precision loss correction variable corresponding to each training phase variable includes:

performing matrix multiplication on an output matrix formed by the at least one training stage variable through the pseudo quantization layer processing according to the following expression to obtain precision loss correction variables corresponding to each training stage variable:

wherein a is the minimum value of the training phase variable, b is the maximum value of the training phase variable, n is any value in the range of 0-256 of int8,

clamp (r; a, b): : min (max (x, a), b), where x is the value of the training phase variable, x is a 32-bit floating point type, and x is the input of the calculation precision loss correction variable.

Further, the step of performing inference phase processing with the accuracy loss correcting variable as an input includes:

forming an input matrix of the reasoning stage by using the precision loss correction variable corresponding to the at least one training stage variable;

and taking the input matrix as input to carry out the processing of the reasoning stage.

Further, before the step of performing local quantization processing on the pseudo quantization layer on at least one training phase variable obtained in the training phase to obtain the precision loss correction variable corresponding to each training phase variable, the method further includes:

and training by using a training sample to obtain the at least one training phase variable.

According to a second aspect of the embodiments of the present disclosure, there is provided a translation model quantizing device including:

the local quantization module is used for respectively carrying out local quantization processing on a pseudo quantization layer on at least one training stage variable obtained in a training stage to obtain a precision loss correction variable corresponding to each training stage variable;

and the reasoning module is used for carrying out reasoning stage processing by taking the precision loss correction variable as input.

Further, the training phase variable is a 32-bit floating point number, the precision loss correction variable is a 32-bit floating point number,

the local quantization module is specifically configured to perform matrix multiplication on an output matrix formed by the at least one training stage variable through the pseudo quantization layer processing according to the following expression to obtain precision loss correction variables corresponding to each of the training stage variables:

clamp (r; a, b): min (max (x, a), b), where x is the value of the training phase variable, x is a 32-bit floating point type, and x is the input of the calculation precision loss correction variable.

Further, the inference module comprises:

the matrix construction submodule is used for constructing an input matrix of the reasoning stage by using the precision loss correction variable corresponding to the at least one training stage variable;

and the reasoning submodule is used for carrying out the processing of the reasoning stage by taking the input matrix as input.

According to a third aspect of embodiments of the present disclosure, there is provided a computer apparatus comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform a translation model quantization method, the method comprising:

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: and respectively carrying out local quantization processing on a pseudo quantization layer on at least one training stage variable obtained in the training stage to obtain a precision loss correction variable corresponding to each training stage variable, and then carrying out reasoning stage processing by taking the precision loss correction variable as input. The parameter distribution is adapted to the distribution of the int8 inference stage through local quantization, the problem of precision loss generated in the quantization process is solved, and the precision loss is reduced while the operation efficiency is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a schematic diagram of an implementation principle of a transformer model using a quantization method.

FIG. 2 is a schematic representation of the transition between the fp32 variable and the int8 variable over the course of the process.

FIG. 3 is a flow diagram illustrating a translation model quantification method in accordance with an exemplary embodiment.

Fig. 4 is a schematic diagram illustrating an implementation principle of a matrix multiplication quantization method according to an exemplary embodiment.

Fig. 5 is a schematic diagram illustrating a principle of a local quantization processing performed by a translation model quantization method according to an exemplary embodiment.

Fig. 6 is a block diagram illustrating a matrix multiply quantization apparatus according to an example embodiment.

Fig. 7 is a block diagram illustrating the structure of an inference module 602 according to an exemplary embodiment.

Fig. 8 is a block diagram illustrating an apparatus (a general structure of a mobile terminal) according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

In the transform model, before entering the inference phase, precision loss is caused due to quantization processing.

In order to solve the above problem, embodiments of the present disclosure provide a translation model quantization method and apparatus. The parameter distribution is adapted to the distribution of the int8 inference stage through local quantization, the problem of precision loss generated in the quantization process is solved, and the precision loss is reduced while the operation efficiency is improved.

An exemplary embodiment of the present disclosure provides a translation model quantization method, a flow of quantization processing using the method is shown in fig. 3, and the method includes:

step 301, training is performed by using a training sample to obtain at least one training phase variable.

In this step, a training phase is implemented according to a conventional manner, and at least one training phase variable is obtained.

And 302, respectively carrying out local quantization processing of a pseudo quantization layer on at least one training stage variable obtained in the training stage to obtain a precision loss correction variable corresponding to each training stage variable.

In this step, when quantization processing is performed, local quantization processing of a pseudo quantization layer is performed on each variable operator using a local quantization (for example, local quantization) method to obtain a precision loss correction variable. In the local quantization process, the precision loss in the inference stage is simulated, the precision loss caused by quantization can be effectively reduced, and the parameter distribution is adapted to the algorithm in the inference stage.

In the embodiment of the present disclosure, the quantization process may use matrix multiplication (MatMul), in which, as shown in fig. 4, a pseudo quantization layer (e.g., Fake-int8) is added, and local quantization processing is performed on an operator with a large calculation ratio and a stable quantization value in the pseudo quantization layer.

And step 303, performing inference stage processing by taking the precision loss correction variable as an input.

In this step, the precision loss correction variable corresponding to the at least one training phase variable may form an input matrix of the inference phase, and the local inference of the inference phase may be performed with the input matrix as input.

An exemplary embodiment of the present disclosure further provides a translation model quantization method, where the training phase variable involved is a 32-bit floating point number, and the precision loss correction variable is a 32-bit floating point number. The principle of local quantization using this method is shown in fig. 5. The specific way of carrying out local quantization processing on the training phase variable of fp32 is as follows:

An exemplary embodiment of the present disclosure also provides a translation model quantizing device, whose structure is shown in fig. 6, including:

a local quantization module 601, configured to perform local quantization processing on a pseudo quantization layer on at least one training stage variable obtained in a training stage, respectively, to obtain a precision loss correction variable corresponding to each training stage variable;

and the inference module 602 is configured to perform inference phase processing by using the precision loss correction variable as an input.

the local quantization module 601 is specifically configured to perform matrix multiplication on an output matrix formed by the at least one training stage variable through the pseudo quantization layer processing according to the following expression, so as to obtain precision loss correction variables corresponding to each of the training stage variables:

clamp (r; a, b): min (max (x, a), b), x being the value of the training phase variable, x being a 32-bit floating pointNumber type, x, as input to the calculation of the loss of precision correction variable.

Further, the structure of the inference module 602 is shown in fig. 7, and includes:

a matrix construction submodule 701, configured to construct an input matrix of the inference phase with precision loss correction variables corresponding to the at least one training phase variable;

and the reasoning submodule 702 is configured to perform the processing in the reasoning stage by using the input matrix as an input.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here. The device can be integrated in the terminal, and the terminal can realize corresponding functions.

There is also provided, in accordance with an exemplary embodiment of the present disclosure, computer apparatus, including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

FIG. 8 is a block diagram illustrating an apparatus 800 for translation model quantization in accordance with an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 8, the apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power component 806 provides power to the various components of device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform a translation model quantization method, the method comprising:

According to the translation model quantization method and device provided by the embodiment of the disclosure, local quantization processing of a pseudo quantization layer is respectively performed on at least one training stage variable obtained in a training stage to obtain a precision loss correction variable corresponding to each training stage variable, and then the precision loss correction variable is used as input to perform inference stage processing. The parameter distribution is adapted to the distribution of the int8 inference stage through local quantization, the problem of precision loss generated in the quantization process is solved, and the precision loss is reduced while the operation efficiency is improved.

The size of the translation model can be effectively reduced, the translation model can be conveniently deployed at terminals such as a mobile terminal, and the model reasoning speed can be effectively increased on the premise of ensuring the model precision. The precision loss caused by quantization is reduced, the numerical stability after quantization can be ensured, and the operation efficiency is improved.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A translation model quantization method, comprising:

2. The translation model quantization method of claim 1, wherein the training phase variable is a 32-bit floating point number and the loss of precision correction variable is a 32-bit floating point number.

3. The method for quantizing a translation model according to claim 1 or 2, wherein the step of performing local quantization processing on at least one training stage variable obtained in the training stage by using a pseudo quantization layer respectively to obtain the precision loss correction variable corresponding to each training stage variable comprises:

clamp (r; a, b): min (max (x, a), b), x being the training phase variable, x being a 32-bit floating point type, x being the input of the calculation precision loss correction variable.

4. The translation model quantization method of claim 1, wherein said step of performing inference phase processing with said loss of precision correction variable as an input comprises:

5. The method according to claim 1, wherein before the step of performing local quantization processing on the pseudo quantization layer on at least one training stage variable obtained in the training stage to obtain the precision loss correction variable corresponding to each training stage variable, the method further comprises:

6. A translation model quantization apparatus, comprising:

7. The translation model quantizing device according to claim 6, wherein the training phase variable is a 32-bit floating point number, the precision loss correction variable is a 32-bit floating point number,

8. The translation model quantization apparatus of claim 6, wherein the inference module comprises:

9. A computer device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

10. A non-transitory computer readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform a translation model quantization method, the method comprising: