CN112529189A

CN112529189A - Model compression method and device, electronic equipment and storage medium

Info

Publication number: CN112529189A
Application number: CN202011247207.0A
Authority: CN
Inventors: 王桂彬; 董昊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2021-03-19

Abstract

The application discloses a model compression method, a model compression device, electronic equipment and a storage medium, and relates to the field of artificial intelligence such as deep learning and speech recognition, wherein the method comprises the following steps: when matrix operation of a single-precision floating point is required in the model reasoning process, quantizing a left matrix of two multiplied matrixes according to rows to obtain a first quantization matrix, and quantizing a right matrix of the two multiplied matrixes according to columns to obtain a second quantization matrix; multiplying the first quantization matrix and the second quantization matrix to obtain a third matrix serving as a fixed-point operation result; and performing inverse quantization according to the third matrix to obtain a fourth matrix, and taking the fourth matrix as a matrix operation result. By applying the scheme, the model reasoning speed can be improved, and the method has general applicability and the like.

Description

Model compression method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for model compression in the fields of deep learning and speech recognition, an electronic device, and a storage medium.

Background

With the development of the technology, the deep learning model is more and more widely applied, and in order to continuously improve the precision of the model, the depth and the volume of the model are continuously increased. Taking speech recognition as an example, from a feedforward deep neural network to a recurrent neural network to a coding-decoding (Encoder-Decoder) model, each technological change brings greater computational requirements to model reasoning.

Currently, deployment of deep learning applications is gradually migrating from cloud servers to end devices. Although the computational performance of the end device and the like are also continuously improved, the requirements still cannot be met in practical application, and the problem of mismatch between the model inference and the hardware resources of the end device and the like needs to be solved urgently.

In order to solve the above problems, the following implementation methods are mostly adopted: on the basis of the original large model, a more simplified model structure is obtained by reducing the number of network nodes or the number of connections, so that model compression is realized. However, the method is closely related to the model structure, and has poor operability and repeatability, so that the application scenarios and the like are limited.

Disclosure of Invention

The application provides a model compression method, a model compression device, electronic equipment and a storage medium.

A method of model compression, comprising:

when matrix operation of a single-precision floating point is required in the model reasoning process, quantizing a left matrix of two multiplied matrixes according to rows to obtain a first quantization matrix, and quantizing a right matrix of the two multiplied matrixes according to columns to obtain a second quantization matrix;

multiplying the first quantization matrix and the second quantization matrix to obtain a third matrix serving as a fixed-point operation result;

and carrying out inverse quantization according to the third matrix to obtain a fourth matrix, and taking the fourth matrix as the result of the matrix operation.

A pattern compression apparatus comprising: the device comprises a quantization module, an operation module and an inverse quantization module;

the quantization module is used for quantizing a left matrix of the two multiplied matrixes according to rows to obtain a first quantization matrix and quantizing a right matrix of the two multiplied matrixes according to columns to obtain a second quantization matrix when the matrix operation of a single-precision floating point is required in the model reasoning process;

the operation module is used for multiplying the first quantization matrix and the second quantization matrix to obtain a third matrix serving as a fixed-point operation result;

and the inverse quantization module is used for carrying out inverse quantization according to the third matrix to obtain a fourth matrix, and the fourth matrix is used as the matrix operation result.

An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as described above.

A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.

One embodiment in the above application has the following advantages or benefits: through quantization processing, single-precision floating points can be converted into fixed points, so that when matrix operation is performed in the model reasoning process, the fixed point operation is used for replacing floating point operation, the model volume is effectively compressed, the model reasoning speed is improved, and the method is applicable to various different model structures, and has universal applicability and the like.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a flow chart of an embodiment of a model compression method described herein;

FIG. 2 is a schematic diagram of a quantization refinement training process according to the present application;

FIG. 3 is a schematic diagram of a component structure of an embodiment of a model compression apparatus 30 according to the present application;

fig. 4 is a block diagram of an electronic device according to the method of an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In addition, it should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

FIG. 1 is a flow chart of an embodiment of a model compression method described herein. As shown in fig. 1, the following detailed implementation is included.

In step 101, when a single-precision floating-point matrix operation is required in the model inference process, a left matrix of the two multiplied matrices is quantized by rows to obtain a first quantization matrix, and a right matrix of the two multiplied matrices is quantized by columns to obtain a second quantization matrix.

In step 102, the first quantization matrix and the second quantization matrix are multiplied to obtain a third matrix as a result of the fixed-point operation.

In step 103, inverse quantization is performed according to the third matrix to obtain a fourth matrix, and the fourth matrix is used as a result of matrix operation.

The matrix operation is the most core operation with the largest cost in the model reasoning process, in the scheme of the embodiment of the method, single-precision floating points can be converted into fixed points through quantization processing, so that the fixed point operation is used for replacing the floating point operation during the matrix operation in the model reasoning process, the model volume is further effectively compressed, the model reasoning speed is improved, and the method is applicable to various different model structures and various different system structures, such as Advanced reduced instruction set machines (ARM, Advanced RISC machines), X86 and the like, and has universal applicability.

The matrix operation generally refers to an operation of multiplying two matrices, and may be implemented by quantizing a left matrix of the two multiplied matrices by rows to obtain a first quantization matrix, and quantizing a right matrix of the two multiplied matrices by columns to obtain a second quantization matrix. For example, for matrix operation W X, W is the left matrix and X is the right matrix.

For the left matrix, quantization may be performed by rows, resulting in a first quantization matrix. Preferably, a reference value corresponding to each row of elements in the left matrix may be determined, and for each element in the left matrix, a quantized value of the element may be determined according to the reference value corresponding to the row in which the element is located and a predetermined bit width, and a value of each element in the first quantized matrix is a quantized value.

And aiming at each row of elements in the left matrix, respectively taking the maximum value in the absolute values of the elements included in the left matrix as the reference value corresponding to the row. That is, fabsmax (w) can be respectively set for each row of elements in the left matrix_i,*) The reference value for the row is fabsmax (w)_i,*) Represents the maximum value among the absolute values of the respective elements included in the ith (representing any row). For example, the ith row includes 10 elements in total, and the absolute values of the 10 elements are respectively the absolute value 1-the absolute value 10, where the value of the absolute value 6 is the largest, then the absolute value 6 may be used as the reference value corresponding to the row.

Then, for each element in the left matrix, a quotient of the value of the element and a reference value corresponding to the row where the element is located can be calculated respectively, and the obtained quotient and 2 are calculated^B-1B represents a predetermined bit width, and the resulting product may be used as the quantized value of the element.

For example, for element a, there are:

wherein, w_i,jRepresents the value of the element a, fabsmax (w)_i,*) Denotes a reference value, w'_i,jThe quantized value of the element a is represented, and the specific value of B can be determined according to actual needs, such as 8 or 4 or other values.

For the right matrix, quantization may be performed by columns, resulting in a second quantization matrix. Preferably, a reference value corresponding to each column of elements in the right matrix may be respectively determined, and for each element in the right matrix, a quantized value of the element is respectively determined according to the reference value corresponding to the column of the element and a predetermined bit width, and values of the elements in the second quantized matrix are quantized values.

And aiming at each column of elements in the right matrix, respectively taking the maximum value in the absolute values of the elements included in the right matrix as the reference value corresponding to the column. That is, fabsmax (X) can be respectively used for each column element in the right matrix_*,j) The reference value corresponding to this column is fabsmax (X)_*,j) Represents the maximum value among the absolute values of the elements included in the j-th (representing any column).

Then, for each element in the right matrix, a quotient of the value of the element and a reference value corresponding to the column of the element can be calculated respectively, and the obtained quotient and 2 are calculated^B-1And the resulting product may then be used as the quantized value of the element.

For example, for element b, then:

wherein x is_i,jRepresents the value of element b, fabsmax (x)_*,j) Denotes a reference value, x'_i,jThe quantized value of the element B is represented, and the specific value of B can be determined according to actual needs.

After the first quantization matrix and the second quantization matrix are obtained respectively in the above manner, the first quantization matrix and the second quantization matrix may be multiplied to obtain a third matrix as a result of the fixed-point operation.

Namely, the method comprises the following steps: o ' ═ W ' × X '; (3)

wherein W ' represents a first quantization matrix, X ' represents a second quantization matrix, and O ' represents a third matrix.

Through quantization processing, single-precision floating point is converted into fixed points, so that when matrix operation is carried out, fixed point operation is used for replacing floating point operation, the model volume is effectively compressed, the model reasoning speed is improved, and the like.

Due to the quantization processing, correspondingly, inverse quantization is also required according to the third matrix to obtain a fourth matrix, and the fourth matrix is used as a final required matrix operation result.

Preferably, for each element in the fourth matrix, the following processing may be performed: and calculating the product of the value of the corresponding element of the element in the third matrix, the reference value corresponding to the row of the element and the reference value corresponding to the column of the element, and taking the obtained product as the value of the element, wherein the corresponding element is the element at the same position.

For example, for element c, there are:

o_i,j＝o′_i,j*fabsmax(w_i,*)*fabsmax(x_*,j)；(4)

wherein, o'_i,jRepresents the value of the element c in the third matrix, fabsmax (w)_i,*) The reference value, fabsmax (x), corresponding to the row in which element c is located_*,j) Indicates the reference value, o, corresponding to the column of the element c_i,jRepresenting the value of the element c in the fourth matrix.

In model reasoning, the left matrix is usually a weight matrix and is fixed, so that the quantization process for the left matrix can be completed off-line, and for the right matrix, the left matrix is input-dependent, so that the quantization process for the right matrix needs to be completed on-line.

As mentioned above, the specific value of bit width B may be determined according to actual needs, such as 8 or 4 or other values. Symmetrical may be used in this applicationThe fixed point representation method, the representation range of bit width B is (-2)^B-1,2^B-1) For example, int4, which represents a range of [ -7,7 [ ]]With int4, compression is 8 times higher than that of single precision floating point fp 32.

The model compression method described in the present application may also be used in combination with other compression methods, for example, after the model is compressed according to the model compression method described in the present application, a huffman (huffman) compression method may also be used to further compress the model.

With the reduction of the representation bit width, the difference between the precision of the fixed-point model and the precision of the floating-point model gradually increases, and the low-bit model needs to be finely adjusted to obtain the precision of the original model. Therefore, the application further provides a gradual low-precision model training method to improve the precision of the model, the stability and the reproducibility of the model and the like.

Preferably, the single-precision model may be trained in a single-precision training manner to serve as the initial model, and then the model parameters of the initial model may be subjected to quantitative fine tuning training, so as to obtain the final model. And by using the final model, model reasoning and other processing can be carried out.

The method can obtain an optimized single-precision model as an initial model according to the existing normal single-precision training mode, and the model is used as the basis of subsequent quantitative fine tuning training.

For the model parameters of the initial model, the following first process may be performed: quantizing a weight matrix (namely the left matrix) in the model parameters according to rows, and carrying out inverse quantization on a quantization result to obtain processed model parameters; performing forward calculation and backward calculation according to the processed model parameters to obtain a model parameter gradient; and updating the model parameters according to the gradient of the model parameters, and repeatedly executing the first processing aiming at the updated model parameters until a preset ending condition is met.

Based on the above description, fig. 2 is a schematic diagram of a quantization and fine tuning training process according to the present application. As shown in fig. 2, before training of each batch (batch) of training data begins, the weight matrix may be first quantized by rows for the latest model parameters, i.e. determined separatelyThe method comprises the steps of determining a reference value corresponding to each row of elements in a weight matrix, and determining a quantization value of each element in the weight matrix according to the reference value corresponding to the row where the element is located and a preset bit width, wherein for each row of elements in the weight matrix, the maximum value of absolute values of the elements included in the row can be used as the reference value corresponding to the row, in addition, for each element in the weight matrix, the quotient of the value of the element and the reference value corresponding to the row where the element is located can be calculated, and the calculated quotient and 2 are calculated^B-1The resulting product is taken as the quantized value of the element, and B represents the bit width. And then, floating point parameter values can be obtained through inverse quantization to obtain the processed model parameters. Then, forward calculation and backward calculation can be sequentially performed by combining the processed model parameters, the training data and the like to obtain the gradient of the model parameters. Further, the model parameters may be updated according to the gradient of the model parameters, and the above process may be repeatedly performed with respect to the updated model parameters until a predetermined end condition is met.

In the above process, how to perform inverse quantization, how to obtain a gradient of a model parameter, how to update the model parameter according to the gradient of the model parameter, and the like are all the prior art. In addition, in practical applications, the above processes may be preferably performed in order from the input layer to the output layer until a predetermined termination condition is met.

The specific conditions for the predetermined end condition may be determined according to actual needs. For example, it may refer to the accuracy of the model or the volume of the model reaching the expectation, etc.

It should be noted that the foregoing method embodiments are described as a series of acts or combinations for simplicity in explanation, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

The above is a description of method embodiments, and the embodiments of the present application are further described below by way of apparatus embodiments.

Fig. 3 is a schematic diagram of a structure of a model compressing apparatus 30 according to an embodiment of the present application. As shown in fig. 3, includes: a quantization module 301, an operation module 302 and an inverse quantization module 303.

The quantization module 301 is configured to quantize a left matrix of the two multiplied matrices according to rows to obtain a first quantization matrix, and quantize a right matrix of the two multiplied matrices according to columns to obtain a second quantization matrix, when the single-precision floating-point matrix operation is required in the model inference process.

And an operation module 302, configured to multiply the first quantization matrix and the second quantization matrix to obtain a third matrix as a result of the fixed-point operation.

And an inverse quantization module 303, configured to perform inverse quantization according to the third matrix to obtain a fourth matrix, and use the fourth matrix as a result of matrix operation.

When the left matrix is quantized, the quantization module 301 may determine a reference value corresponding to each row of elements in the left matrix, and may determine, for each element in the left matrix, a quantization value of the element according to the reference value corresponding to the row of the element and a predetermined bit width.

When the right matrix is quantized, the quantization module 301 may determine a reference value corresponding to each column of elements in the right matrix, and may determine, for each element in the right matrix, a quantization value of the element according to the reference value corresponding to the column of the element and a predetermined bit width.

Preferably, the quantization module 301 may use, for each row of elements in the left matrix, a maximum value of absolute values of the elements included in the left matrix as a reference value.

Similarly, the quantization module 301 may use, for each column of elements in the right matrix, a maximum value of absolute values of the elements included therein as a reference value.

In addition, the quantization module 301 may respectively calculate, for each element in the left matrix, a quotient between a value of the element and a reference value corresponding to a row in which the element is located, andcalculating quotient and 2^B-1B denotes a predetermined bit width, and the product is taken as the quantized value of the element.

Similarly, the quantization module 301 may respectively calculate, for each element in the right matrix, a quotient of a value of the element and a reference value corresponding to a column of the element, and calculate the quotient and 2^B-1The product is taken as the quantized value of the element.

The operation module 302 may multiply the first quantization matrix and the second quantization matrix to obtain a third matrix as a result of the fixed-point operation.

Then, the inverse quantization module 303 may perform the following processing for each element in the fourth matrix respectively: and calculating the product of the value of the corresponding element of the element in the third matrix, the reference value corresponding to the row of the element and the reference value corresponding to the column of the element, and taking the product as the value of the element, wherein the corresponding element is the element at the same position.

As shown in fig. 3, the apparatus may further include: and the preprocessing module 300 is configured to train in a single-precision training manner to obtain a single-precision model as an initial model, and perform quantitative fine tuning training on model parameters of the initial model to obtain a final model.

Generally, the left matrix is a weight matrix. The preprocessing module 300 may perform the following first processing for the model parameters: quantizing the weight matrix in the model parameters according to rows, and carrying out inverse quantization on the quantization result to obtain processed model parameters; performing forward calculation and backward calculation according to the processed model parameters to obtain a model parameter gradient; and updating the model parameters according to the gradient of the model parameters, and repeatedly executing the first processing aiming at the updated model parameters until a preset ending condition is met.

For a specific work flow of the apparatus embodiment shown in fig. 3, reference is made to the related description in the foregoing method embodiment, and details are not repeated.

In a word, adopt this application apparatus embodiment the scheme, through quantization process, can convert single precision floating point into the fixed point to when the matrix operation in the model inference process, replace floating point operation with fixed point operation, and then effectively compressed the model volume, promoted model inference speed, in addition, applicable in various different model structures, and applicable in various different architecture, have universal suitability etc..

The scheme can be applied to the field of artificial intelligence, and particularly relates to the fields of deep learning, voice recognition and the like. Artificial intelligence is a subject for studying a computer to simulate some thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning and the like) of a human, and has a hardware technology and a software technology, the artificial intelligence hardware technology generally comprises technologies such as a sensor, a special artificial intelligence chip, cloud computing, distributed storage, big data processing and the like, and the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge graph technology and the like.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 4 is a block diagram of an electronic device according to the method of the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 4, the electronic apparatus includes: one or more processors Y01, a memory Y02, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information for a graphical user interface on an external input/output device (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 4, one processor Y01 is taken as an example.

Memory Y02 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the methods provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the methods provided herein.

Memory Y02 is provided as a non-transitory computer readable storage medium that can be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the methods of the embodiments of the present application. The processor Y01 executes various functional applications of the server and data processing, i.e., implements the method in the above-described method embodiments, by executing non-transitory software programs, instructions, and modules stored in the memory Y02.

The memory Y02 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device, and the like. Additionally, the memory Y02 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory Y02 may optionally include memory located remotely from processor Y01, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, blockchain networks, local area networks, mobile communication networks, and combinations thereof.

The electronic device may further include: an input device Y03 and an output device Y04. The processor Y01, the memory Y02, the input device Y03 and the output device Y04 may be connected by a bus or other means, and the bus connection is exemplified in fig. 4.

The input device Y03 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device, such as a touch screen, keypad, mouse, track pad, touch pad, pointer, one or more mouse buttons, track ball, joystick, or other input device. The output device Y04 may include a display device, an auxiliary lighting device, a tactile feedback device (e.g., a vibration motor), and the like. The display device may include, but is not limited to, a liquid crystal display, a light emitting diode display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific integrated circuits, computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable logic devices) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a cathode ray tube or a liquid crystal display monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local area networks, wide area networks, blockchain networks, and the internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of model compression, comprising:

2. The method of claim 1, wherein,

the quantizing the left matrix of the two multiplied matrices by rows comprises: respectively determining a reference value corresponding to each row of elements in the left matrix; for each element in the left matrix, determining a quantization value of the element according to a reference value corresponding to a row where the element is located and a preset bit width;

the quantizing the right matrix of the multiplied two matrices by columns comprises: respectively determining a reference value corresponding to each row of elements in the right matrix; and aiming at each element in the right matrix, determining a quantization value of the element according to a reference value corresponding to the column of the element and a preset bit width.

3. The method of claim 2, wherein,

the respectively determining the reference value corresponding to each row of elements in the left matrix comprises: regarding each row of elements in the left matrix, respectively taking the maximum value in absolute values of all elements included in the left matrix as the reference value;

the respectively determining the reference value corresponding to each row of elements in the right matrix comprises: and aiming at each column of elements in the right matrix, respectively taking the maximum value in the absolute values of the elements included in the right matrix as the reference value.

4. The method of claim 2, wherein,

the determining, for each element in the left matrix, a quantization value of the element according to a reference value corresponding to a row in which the element is located and a predetermined bit width includes: calculating the quotient of the value of the element and the reference value corresponding to the row of the element, and calculating the quotient and 2^B-1B represents the bit width, the product being taken as the quantized value of the element;

the determining, for each element in the right matrix, a quantization value of the element according to a reference value corresponding to a column in which the element is located and a predetermined bit width includes: calculating the quotient of the value of the element and the reference value corresponding to the column of the element, and calculating the quotient and the quotient 2^B-1As a quantized value of the element.

5. The method of claim 2, wherein the inverse quantizing according to the third matrix to obtain a fourth matrix comprises:

for each element in the fourth matrix, respectively performing the following processing: and calculating the product of the value of the corresponding element of the element in the third matrix, the reference value corresponding to the row of the element and the reference value corresponding to the column of the element, and taking the product as the value of the element, wherein the corresponding element is the element at the same position.

6. The method of claim 1, further comprising:

training according to a single-precision training mode to obtain a single-precision model as an initial model;

and carrying out quantitative fine tuning training on the model parameters of the initial model to obtain a final model.

7. The method of claim 6, wherein the left matrix is a weight matrix;

the performing quantitative fine tuning training on the model parameters of the initial model comprises:

for the model parameters, the following first processing is performed:

quantizing the weight matrix in the model parameters according to rows, and carrying out inverse quantization on quantization results to obtain processed model parameters;

performing forward calculation and backward calculation according to the processed model parameters to obtain a model parameter gradient;

and updating the model parameters according to the gradient of the model parameters, and repeatedly executing the first processing aiming at the updated model parameters until a preset ending condition is met.

8. A pattern compression apparatus comprising: the device comprises a quantization module, an operation module and an inverse quantization module;

9. The apparatus of claim 8, wherein,

the quantization module respectively determines a reference value corresponding to each row of elements in the left matrix, and determines a quantization value of each element in the left matrix according to the reference value corresponding to the row of the element and a preset bit width;

and the quantization module respectively determines a reference value corresponding to each column of elements in the right matrix, and determines a quantization value of each element according to the reference value corresponding to the column of the element and a preset bit width aiming at each element in the right matrix.

10. The apparatus of claim 9, wherein,

the quantization module is used for respectively taking the maximum value in the absolute values of the values of all elements included in each row of elements in the left matrix as the reference value;

and the quantization module is used for respectively taking the maximum value in the absolute values of the values of all the elements included in each column of elements in the right matrix as the reference value.

11. The apparatus of claim 9, wherein,

the quantization module respectively calculates the value of each element in the left matrix and the reference value corresponding to the row of the elementQuotient and calculating the quotient and 2^B-1B represents the bit width, the product being taken as the quantized value of the element;

the quantization module respectively calculates the quotient of the value of the element and the reference value corresponding to the column of the element for each element in the right matrix, and calculates the quotient and the quotient 2^B-1As a quantized value of the element.

12. The apparatus of claim 9, wherein,

the inverse quantization module performs the following processing respectively for each element in the fourth matrix: and calculating the product of the value of the corresponding element of the element in the third matrix, the reference value corresponding to the row of the element and the reference value corresponding to the column of the element, and taking the product as the value of the element, wherein the corresponding element is the element at the same position.

13. The apparatus of claim 8, further comprising: a preprocessing module;

the preprocessing module is used for training according to a single-precision training mode to obtain a single-precision model, and the single-precision model is used as an initial model to carry out quantitative fine tuning training on model parameters of the initial model to obtain a final model.

14. The apparatus of claim 13, wherein the left matrix is a weight matrix;

the preprocessing module performs the following first processing for the model parameters: quantizing the weight matrix in the model parameters according to rows, and carrying out inverse quantization on quantization results to obtain processed model parameters; performing forward calculation and backward calculation according to the processed model parameters to obtain a model parameter gradient; and updating the model parameters according to the gradient of the model parameters, and repeatedly executing the first processing aiming at the updated model parameters until a preset ending condition is met.

15. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.