CN118035628A - Matrix vector multiplication operator realization method and device supporting mixed bit quantization - Google Patents

Matrix vector multiplication operator realization method and device supporting mixed bit quantization Download PDF

Info

Publication number
CN118035628A
CN118035628A CN202410433080.3A CN202410433080A CN118035628A CN 118035628 A CN118035628 A CN 118035628A CN 202410433080 A CN202410433080 A CN 202410433080A CN 118035628 A CN118035628 A CN 118035628A
Authority
CN
China
Prior art keywords
precision
bit
quantization
matrix
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410433080.3A
Other languages
Chinese (zh)
Other versions
CN118035628B (en
Inventor
汪玉
洪可
毛秋力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202410433080.3A priority Critical patent/CN118035628B/en
Publication of CN118035628A publication Critical patent/CN118035628A/en
Application granted granted Critical
Publication of CN118035628B publication Critical patent/CN118035628B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The application relates to the technical field of deep neural networks, in particular to a matrix vector multiplier realizing method and device supporting mixed bit quantization, wherein the method comprises the following steps: acquiring a weight matrix of the quantized large language model and an activation vector of half-precision floating point number precision; loading the weight matrix into a register by using a vector access method for mixed bit precision data; and dequantizing the weight matrix in the register to half-precision floating point number precision by using a dequantization method based on an interweaved thread arrangement strategy, and acquiring a matrix vector multiplication calculation result by combining an activation vector of the half-precision floating point number precision. Therefore, the problems that in the related technology, due to the fact that a weight matrix is introduced in mixed precision quantization, access is easy to be inefficient, instruction branches in a dequantization process are easy to cause and the like are solved.

Description

Matrix vector multiplication operator realization method and device supporting mixed bit quantization
Technical Field
The application relates to the technical field of deep neural networks, in particular to a matrix vector multiplier implementation method and device supporting mixed bit quantization.
Background
In recent years, LLM (Large Language Model ) based on a transducer structure has outstanding performance in natural language processing tasks such as intelligent question-answering, text generation and the like, but large-scale parameter amounts in billions often bring challenges to actual floor-based deployment. The quantization, sparseness and other parameter compression technologies can effectively reduce huge storage, access and calculation costs brought by LLM parameter scale, wherein the quantization technology is developed to be mature, a large amount of quantization related work can effectively reduce the parameter scale under the condition of keeping LLM inference quality, and the LLM inference acceleration can be realized through a plurality of effective GEMV (Generalized Matrix-Vector Multiplication )/GEMM (Generalized Matrix-Matrix Multiplication, generalized matrix-matrix multiplication) operators based on PTQ (Post Training Quantization, quantization after training) weight quantization, but the weight is single-precision quantization, and GEMV/GEMM calculation corresponding to the weight of mixed bit precision quantization cannot be effectively processed.
In the related art, the mixed precision quantization gradually becomes a potential quantization mode, and because the data in the LLM weight has different importance, the parameter scale can be compressed to the maximum extent under the condition of preserving the relatively important information by using the mixed precision quantization.
However, in the related art, since the number of bits loaded for different bits of quantized data is different in the process of loading the weight matrix of mixed precision quantization, a single access mode is easy to fail to realize efficient access to all data, and since the dequantization modes of different data are different due to the weight matrix of mixed precision quantization, instruction branches are easy to be caused on a SIMD (Single Instruction Multiple Data ) architecture, and improvement is needed.
Disclosure of Invention
The application provides a matrix vector multiplication operator realization method, a device, electronic equipment and a storage medium for supporting mixed bit quantization, which are used for solving the problems that in the related technology, due to the fact that a weight matrix is introduced in mixed precision quantization, access is easy to be inefficient, instruction branches in a dequantization process are easy to cause and the like.
An embodiment of a first aspect of the present application provides a method for implementing a matrix vector multiplier supporting mixed bit quantization, including the following steps: acquiring a weight matrix of the quantized large language model and an activation vector of half-precision floating point number precision; loading the weight matrix into a register by using a vector access method for mixed bit precision data; and dequantizing the weight matrix in the register to half-precision floating point number precision by using a dequantization method based on an interweaved thread arrangement strategy, and acquiring a matrix vector multiplication calculation result by combining an activation vector of the half-precision floating point number precision.
Optionally, in one embodiment of the present application, the obtaining the quantized weight matrix and the activation vector of the half-precision floating point number precision includes: and carrying out grouping quantization on the weight matrix according to the dimension of the input channels to obtain a plurality of groups of input channels so as to generate the quantized weight matrix, wherein the input channels with the preset group size are divided into corresponding groups, and the input channels in the same group use the same quantization parameters.
Optionally, in one embodiment of the present application, the generating the weight matrix includes: quantizing different groups of input channels in the multiple groups of input channels by using corresponding quantization precision; groups using the same quantization accuracy are rearranged offline to arrange them in adjacent positions.
Optionally, in an embodiment of the present application, the loading the weight matrix into a register using a vector access method for mixed bit precision data includes: for 4-bit data, 16 8-bit integer type data is loaded using 4 access instructions; for 3-bit data, 12 8-bit integer type data is loaded using 3 access instructions; for 2-bit data, 8-bit integer type data is loaded using 2 access instructions; for 1 bit data, 48 bit integer type data is loaded using 1 access instruction.
Optionally, in an embodiment of the present application, the dequantizing the weight matrix in the register to a half-precision floating point precision by using a dequantization method based on an interleaved thread arrangement policy, and obtaining a matrix vector multiplication calculation result by combining an activation vector of the half-precision floating point precision includes: based on a preset interleaving thread arrangement mode, a preset step length is added between the next dequantization address and the last dequantization address of each thread in the same thread bundle, so that a plurality of threads in the same thread bundle execute the same dequantization function.
An embodiment of a second aspect of the present application provides a matrix vector multiplier implementation apparatus supporting mixed bit quantization, including: the acquisition module is used for acquiring a weight matrix of the quantized large language model and an activation vector of the half-precision floating point number precision; the loading module is used for loading the weight matrix into a register by using a vector access method for mixed bit precision data; and the calculation module is used for dequantizing the weight matrix in the register to half-precision floating point number precision by using a dequantization method based on an interweaved thread arrangement strategy, and acquiring a matrix vector multiplication calculation result by combining an activation vector of the half-precision floating point number precision.
Optionally, in one embodiment of the present application, the acquiring module includes: the generation unit is used for carrying out grouping quantization on the weight matrix according to the dimension of the input channels to obtain a plurality of groups of input channels so as to generate the quantized weight matrix, wherein the input channels with the preset group size are divided into corresponding groups, and the input channels in the same group use the same quantization parameters.
Optionally, in one embodiment of the present application, the generating unit includes: a quantization subunit, configured to quantize input channels of different groups of the multiple groups of input channels using corresponding quantization precision; an arrangement subunit for performing an offline rearrangement of groups using the same quantization accuracy to arrange them at adjacent positions.
Optionally, in one embodiment of the present application, the loading module includes: a first loading unit for loading 16 8-bit integer type data using 4 access instructions for 4-bit data; a second loading unit for loading 12 8-bit integer type data using 3 access instructions for 3-bit data; a third loading unit for loading 8-bit integer type data using 2 access instructions for 2-bit data; and a fourth loading unit for loading 4 8-bit integer type data using 1 access instruction for 1-bit data.
Optionally, in one embodiment of the present application, the computing module includes: the thread scheduling unit is used for adding a preset step length between the next dequantization address and the last dequantization address of each thread in the same thread bundle based on a preset interleaving thread arrangement mode so as to enable a plurality of threads in the same thread bundle to execute the same dequantization function.
An embodiment of a third aspect of the present application provides an electronic device, including: a memory, a processor, wherein the memory has a computer program stored on the memory and capable of running on the processor, the processor executing the program to implement a matrix vector multiplier implementation method supporting mixed bit quantization as described in the above embodiments.
A fourth aspect of the present application provides a computer readable storage medium storing a computer program which, when executed by a processor, implements a matrix vector multiplier implementation method as above supporting mixed bit quantization.
A fifth aspect of the present application embodiment provides a computer program product comprising a computer program for implementing a matrix vector multiplier implementation method supporting mixed bit quantization as above when said computer program is executed.
According to the embodiment of the application, the quantized weight matrix and the activation vector of the half-precision floating point number precision can be obtained, the weight matrix is loaded into the register by using a vector access method for mixed bit precision data, and the weight matrix in the register is dequantized to the half-precision floating point number precision by using a dequantization method based on an interweaved thread arrangement strategy, so that a matrix vector multiplication calculation result is obtained by combining the activation vector of the half-precision floating point number precision, vector access can be realized on mixed bit precision data, the bandwidth utilization rate is improved to the greatest extent, threads in the same thread bundle can execute the same instruction, and instruction branches of thread bundle levels are avoided. Therefore, the problems that in the related art, due to the fact that the number of bits loaded for different bit quantized data is different in a single time in the process of loading the mixed precision quantized weight matrix, a single access mode cannot realize high-efficiency access to all data easily, and due to the fact that different data dequantization modes are different in the mixed precision quantized weight matrix, instruction branches are easy to cause on a SIMD architecture and the like are solved.
Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.
Drawings
The foregoing and/or additional aspects and advantages of the application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flow chart of the calculation of a Decoder-only structure generative large model according to an embodiment of the present application;
FIG. 2 is a flow chart of a method for implementing a matrix vector multiplier supporting mixed bit quantization according to an embodiment of the present application;
FIG. 3 is a schematic diagram of GEMV operator implementation of PTQ weight 1-4 bit mixed precision quantization according to one embodiment of the present application;
FIG. 4 is a schematic diagram of a vector access method for 1-to 4-bit mixed data according to one embodiment of the present application;
FIG. 5 is a schematic diagram of a dequantization method based on an interleaved thread placement strategy according to an embodiment of the application;
fig. 6 is a schematic structural diagram of a matrix vector multiplier implementation device supporting mixed bit quantization according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present application and should not be construed as limiting the application.
The method, the device, the electronic equipment and the storage medium for realizing the matrix vector multiplication operator supporting mixed bit quantization are described below with reference to the accompanying drawings.
Before introducing the matrix vector multiplier implementation method supporting mixed bit Quantization in the embodiment of the application, the related technology is specifically introduced, wherein the Quantization technology is mainly divided into two types, namely PTQ and QAT (Quantization AWARE TRAINING, quantization perception training), wherein the PTQ is used for carrying out low-bit Quantization on trained LLM, the training process of the LLM is not required, the use cost is lower, and the Quantization technology is widely used at present. The PTQ-based effort in turn includes quantization of the activation and quantization of the weights. Because the weight is a main LLM parameter, the parameter scale can be effectively reduced by only quantizing the weight, so that the cost of access and storage is reduced.
The LLM weight technology based on PTQ realizes that the source of reasoning acceleration is the reduction of the data moving scale of the weight matrix, thereby greatly reducing the memory overhead in calculation. The most widely appreciated models now have decoder-only structure, as shown in FIG. 1, which involves two stages of pre-filling (prefill) and decoding (decoding) in the actual reasoning process. Wherein, when the batch size (batchsize) is 1, the decoding stage is activated as a feature of a word (token) and is expressed as a vector when calculated. Therefore, the linear layer of decoding stage is mainly the multiplication of the weight matrix and the eigenvector, namely GEMV; when the batch size is larger than 1 (typically an exponential multiple of 2 out of 2 to 64), the calculation of the linear layer in decoding stage is mainly a multiplication of the weight matrix and the combination of several eigenvectors, i.e. a GEMM with a smaller input dimension, which can be regarded as an extension of GEMV. The total number of bits for GEMV and GEMM when loading weights is greatly reduced using the corresponding quantization technique.
Aiming at the LLM weight quantization related technology, because the number of bits loaded for different bit quantized data is different at a time in the process of loading the weight matrix of the mixed precision quantization, a single access mode is easy to be incapable of realizing high-efficiency access to all data, and the problem of instruction branching is easy to be caused on a SIMD architecture because the dequantization modes of different data are different due to the weight matrix of the mixed precision quantization. The application provides a matrix vector multiplication operator realization method supporting mixed bit quantization, in the method, a vector access method for mixed bit precision data is used for loading a weight matrix into a register by acquiring a weight matrix of a quantized large language model and an activation vector of half precision floating point precision, and a dequantization method based on an interweaved thread arrangement strategy is used for dequantizing the weight matrix in the register to half precision floating point precision, so that a matrix vector multiplication calculation result is acquired by combining the activation vector of half precision floating point precision, vector access can be realized on mixed bit precision data (for example, 1 to 4 bit data), the bandwidth utilization rate is improved to the greatest extent, threads in the same thread bundle can execute the same instruction, and instruction branches of thread bundle levels are avoided. Therefore, the problems that in the related art, due to the fact that the number of bits loaded for different bit quantized data is different in a single time in the process of loading the mixed precision quantized weight matrix, a single access mode cannot realize high-efficiency access to all data easily, and due to the fact that different data dequantization modes are different in the mixed precision quantized weight matrix, instruction branches are easy to cause on a SIMD architecture and the like are solved.
Specifically, fig. 2 is a schematic flow chart of a matrix vector multiplier implementation method supporting mixed bit quantization according to an embodiment of the present application.
As shown in fig. 2, the matrix vector multiplier implementation method supporting mixed bit quantization includes the following steps:
in step S201, a quantized weight matrix and an activation vector of half-precision floating-point number precision are acquired.
It will be appreciated that quantization refers to the process of converting a floating point model into a fixed point model, which may be performed on-line in advance, and that semi-precision floating point precision refers to the use of semi-precision floating point numbers (FP 16, i.e., 16-bit floating point numbers) to represent data.
Specifically, referring to fig. 3, in an embodiment of the present application, a quantized weight matrix (for example, quantized in-line) and an activation vector with half-precision floating point number precision may be obtained as input of a matrix vector multiplier, where the quantized weight matrix adopts a manner of grouping and quantizing input channel dimensions, specifically, input channels with preset group sizes are grouped into one group, and input channels of the same group use the same quantization parameter. For a weight matrix for mixed precision quantization, different groups of input channels may choose the same or different precision for quantization, e.g. mixed precision quantization of any combination of 1 to 4 bits can be supported. Further, for convenience of storage and access, input channel groups using the same quantization accuracy are arranged at adjacent positions by offline rearrangement.
According to the embodiment of the application, the quantized weight matrix is obtained, so that the model storage requirement can be reduced, and the storage space can be saved. By using the quantized weight matrix and the activation vector of the half-precision floating point number precision for reasoning, the calculation amount can be reduced, the calculation efficiency can be improved, the real-time performance of the model can be improved, and the power consumption and delay can be reduced.
Optionally, in one embodiment of the present application, obtaining the quantized weight matrix and the activation vector of the half-precision floating point number precision includes: and carrying out grouping quantization on the weight matrix according to the dimension of the input channels to obtain a plurality of groups of input channels so as to generate the weight matrix, wherein the input channels with the preset group size are divided into corresponding groups, and the input channels in the same group use the same quantization parameters.
It will be understood that the preset group size refers to the number of input channels included in each group set in advance when the input channel group quantization is performed.
The embodiment of the application can carry out grouping quantization on the weight matrix according to the dimension of the input channels to obtain a plurality of groups of input channels so as to generate the quantized weight matrix, and the same quantization parameters are used by the input channels of the same group, so that the storage requirement of a quantized model can be reduced, and the storage space is saved. The same quantization parameter (e.g., quantization precision) is used for the same group of input channels, and parallel computation of channel level can be performed in the computation process, so that the computation efficiency is improved and the model reasoning process is accelerated.
Optionally, in one embodiment of the present application, generating the weight matrix includes: quantizing different groups of input channels in the multiple groups of input channels by using corresponding quantization precision; groups using the same quantization accuracy are rearranged offline to arrange them in adjacent positions.
The embodiment of the application can improve the storage locality, reduce the span of storage access and improve the effective utilization of the video memory bandwidth by arranging the groups with the same quantization precision at adjacent positions, thereby accelerating the model loading and reasoning, and the adjacent groups with the same quantization precision can reduce the data type conversion and precision adjustment operation in the calculation process and improve the calculation performance and efficiency.
In step S202, a weight matrix (e.g., from a Graphics Processing Unit (GPU) video memory) is loaded into a register using a vector access method for mixed bit precision data.
Specifically, the embodiment of the present application stores data quantized using different bits (e.g., 1 to 4-bit mixed data) in different ways as 8-bit integer (INT 8) type data, which may directly affect the access manner of the weight matrix for mixed-precision quantization. As shown in connection with fig. 4, for example, 2 pieces of 4-bit data are stored as 1 INT8 type data, 8 pieces of 3-bit data are stored as 3 INT8 type data, 4 pieces of 2-bit data are stored as 1 INT8 type data, and 8 pieces of 1-bit data are stored as 1 INT8 type data. If the mixed precision quantized weight matrix is accessed only according to the storage mode, for one thread (thread), in order to dequantize 8 pieces of half-precision floating point number data, 4, 3, 2 and 1 pieces of INT8 type data are respectively loaded into registers from a Graphics Processing Unit (GPU) video memory by using one access instruction, but the maximum access request of the GPU is 128 bytes (byte), and for one thread bundle (warp), 128 bytes of request is required for single instruction transmission to fully utilize the video memory bandwidth. Since each thread bundle has multiple (e.g., 32) threads, each thread requires 4 bytes of requests to maximize bandwidth utilization. Therefore, if the weight matrix is loaded by only performing mixed precision quantization according to the storage mode, bandwidth waste of 1/4 to 3/4 is generated in each request.
According to the embodiment of the application, the vector access strategy for the mixed bit precision data is used for loading the weight matrix into the register, so that the number of video memory access times can be reduced, the video memory access efficiency and parallelism are improved, the power consumption and delay can be further reduced, frequent video memory reading operation can be avoided, and the calculation speed is improved.
Optionally, in an embodiment of the present application, loading the weight matrix into the register using a vector access method for mixed bit precision data includes: for 4-bit data, 16 8-bit integer type data is loaded using 4 access instructions; for 3-bit data, 12 8-bit integer type data is loaded using 3 access instructions; for 2-bit data, 8-bit integer type data is loaded using 2 access instructions; for 1 bit data, 4 8 bit integer type data is loaded using 1 access instruction.
Specifically, as shown in fig. 4, the embodiment of the present application may use a coarse granularity vector access policy for 1 to 4 bit mixed data, and may load data with different bit precision using different numbers of access instructions. For example, for 4-bit data, a total of 16 INT8 type data are loaded using 4 access instructions; for 3-bit data, 12 INT 8-type data are loaded using 3 access instructions; for 2-bit data, 8 INT 8-type data is loaded using 2 access instructions; for 1 bit data, 4 INT8 type data are loaded using 1 access instruction. Therefore, for one thread, each access instruction loads 4 bytes of data from the GPU video memory, and multiple instructions can be further combined into one vector access instruction, so that access can be performed on a coarser granularity.
According to the embodiment of the application, the weight matrix is loaded into the register by using the coarse-granularity vector access strategy for the mixed data of 1 to 4 bits, so that the number of instructions for data loading can be reduced, the efficiency of video memory access is improved, and especially for smaller-bit data, the efficiency of video memory access can be effectively improved by using a corresponding number of access instructions, and the full utilization of the video memory bandwidth when the weight matrix with the mixed precision quantization is loaded is realized.
In step S203, the weight matrix in the register is dequantized to the half-precision floating point number precision by using the dequantization method based on the interleaving thread arrangement policy, and the matrix vector multiplication result is obtained by combining the activation vector of the half-precision floating point number precision.
It will be appreciated that the calculation result may be obtained by multiplying the dequantized weight matrix of the half-precision floating point precision by the activation vector of the half-precision floating point precision.
Specifically, the embodiments of the present application store different bit quantized data as INT8 type data in different manners, and further, the data with different bit precision needs to be restored to data with half-precision floating point precision by using different dequantization (dequante) functions, which may be embodied as if else instructions, for example, the pseudo code may be as follows:
if 4bits quantization then
W = dequante_4bits(qW)
else if 3bits quantization then
W = dequante_3bits(qW)
else if 2bits quantization then
W = dequante_2bits(qW)
else if 1bit quantization then
W = dequante_1bit(qW)
end if
It should be noted that, in the SIMD architecture such as GPU, when one thread (thread) executes the instruction, the rest threads which do not need to execute the instruction are in an idle state, and the dequantization process of the weight matrix of the hybrid precision quantization is the same, as shown in fig. 5, if each thread is responsible for dequantizing a section of continuous address in the weight matrix, dequantizing functions to be executed by different threads are different, and further, the dequantizing method based on the interleaved thread arrangement policy in the embodiment of the present application can effectively avoid instruction branches inside a thread bundle (warp) by means of interleaving the threads, so that the efficiency of parallel computation can be greatly reduced.
Optionally, in one embodiment of the present application, using a dequantization method based on an interleaved thread arrangement policy, dequantizing the weight matrix in the register to a half-precision floating point precision, and obtaining a matrix vector multiplication calculation result in combination with an activation vector of the half-precision floating point precision includes: based on a preset interleaving thread arrangement mode, a preset step length is added between the next dequantization address and the last dequantization address of each thread in the same thread bundle, so that a plurality of threads in the same thread bundle execute the same dequantization function.
It can be understood that the threads can be scheduled and arranged according to a preset rule by a preset mode of interleaving and arranging the threads, so that the parallel performance of the hardware multithreading is fully utilized.
Specifically, the embodiment of the application can increase the preset step length between the next dequantization address and the last dequantization address based on the preset interleaving thread arrangement mode, so that 32 threads in the same thread bundle execute the same dequantization function. For example, where there are T threads in total, and a single thread is responsible for dequantizing to obtain M half-precision floating-point number precision data, this step size may be set to the product of T and M. Assuming that there are 32 threads in the 1 thread bundle, if m=8, in practical application, the quantized group size usually takes a value of 128, so long as each bit precision has 2 or more groups, it can be guaranteed that the 32 threads in the 1 thread bundle perform the same dequantization function. Fig. 5 illustrates, with 4 threads as an example, the difference between a dequantization method based on an interleaved thread arrangement policy and a dequantization method without interleaved thread arrangement. As shown in fig. 5, according to the dequantization method based on the interleaved thread placement policy, at the first execution (i.e., loop 0), thread 0, thread 1, thread 2, and thread 3 all perform 4-bit dequantization; on the second execution (i.e., cycle 1), threads 0-3 all perform 3-bit dequantization; on the third execution (i.e., cycle 2), threads 0-3 all perform 2-bit dequantization; on the fourth execution (i.e., loop 3), threads 0-3 all perform 1-bit dequantization. By performing the same dequantization function, instruction branches within a thread bundle can be effectively avoided. It should be appreciated that fig. 5 is a simplified example, and the actual number of threads (i.e., T), the number of FP16 data that a single thread is responsible for dequantizing (i.e., M), and the number of executions may be any suitable values, and the application is not limited in detail herein. Since the thread bundles are the units on the GPU that actually perform parallel computations, avoiding instruction branches within the thread bundles is of great benefit.
Further, the acceleration ratio is normalized based on a half-precision floating point implementation of baseline design PyTorch (PyTorch 1.13.0 +cuda 11.6), and the experiment can be performed on one NVIDIA RTX 3090 GPU. According to experimental results, on average, under 4-bit precision, the scheme of the application realizes 2.36 times acceleration for baseline design; at 3-bit precision, a 2.31-fold acceleration is achieved for the baseline design; at 2 bit precision, a 2.38-fold acceleration is achieved for the baseline design; with a1 to 4 mixed bit precision, a 1.78-fold acceleration is achieved for the baseline design.
The embodiment of the application can fully utilize the multithread parallel processing capability by arranging the threads based on the preset interleaving, improve the parallelism of dequantization operation, and increase the preset step length between the address of the next dequantization and the address of the last dequantization so as to enable a plurality of threads in one thread bundle to execute the same dequantization function, reduce the time cost of the dequantization operation and accelerate the calculation speed.
According to the matrix vector multiplication operator implementation method supporting mixed bit quantization, which is provided by the embodiment of the application, the quantized weight matrix and the activation vector of half-precision floating point number precision can be obtained, the weight matrix is loaded into a register by using a vector access strategy for mixed bit precision data, and the weight matrix in the register is dequantized to half-precision floating point number precision by using a dequantization strategy based on an interweaved thread arrangement strategy, so that a matrix vector multiplication calculation result is obtained by combining the activation vector of half-precision floating point number precision, vector access can be realized for 1 to 4 bits of data, the bandwidth utilization rate is improved to the greatest extent, threads in the same thread bundle can execute the same instruction, and instruction branches of thread bundle levels are avoided. Therefore, the problems that in the related art, due to the fact that the number of bits loaded for different bit quantized data is different in a single time in the process of loading the mixed precision quantized weight matrix, a single access mode cannot realize high-efficiency access to all data easily, and due to the fact that different data dequantization modes are different in the mixed precision quantized weight matrix, instruction branches are easy to cause on a SIMD architecture and the like are solved.
Next, a matrix vector multiplier implementation device supporting mixed bit quantization according to an embodiment of the present application will be described with reference to the accompanying drawings.
Fig. 6 is a schematic diagram of a matrix vector multiplier implementation apparatus supporting mixed bit quantization according to an embodiment of the present application.
As shown in fig. 6, the matrix vector multiplier implementation apparatus 10 supporting mixed bit quantization includes: an acquisition module 100, a loading module 200 and a calculation module 300.
Specifically, the obtaining module 100 is configured to obtain a weight matrix of the quantized large language model and an activation vector of half-precision floating point number precision.
The loading module 200 is configured to load the weight matrix into the register using a vector access method for mixed bit precision data.
The calculation module 300 is configured to dequantize the weight matrix in the register to a half-precision floating point number precision by using a dequantization method based on an interleaving thread arrangement policy, and acquire a matrix vector multiplication calculation result by combining an activation vector of the half-precision floating point number precision.
Optionally, in one embodiment of the present application, the acquiring module 100 includes: and a generating unit.
The generating unit is used for carrying out grouping quantization on the weight matrix according to the dimension of the input channels to obtain a plurality of groups of input channels so as to generate a quantized weight matrix, wherein the input channels with the preset group size are divided into corresponding groups, and the input channels in the same group use the same quantization parameters.
Optionally, in one embodiment of the present application, the generating unit includes: quantization subunit and permutation subunit.
The quantization subunit is used for quantizing different groups of input channels in the multiple groups of input channels by using corresponding quantization precision;
an arrangement subunit for performing an offline rearrangement of groups using the same quantization accuracy to arrange them at adjacent positions.
Optionally, in one embodiment of the present application, the loading module 200 includes: the device comprises a first loading unit, a second loading unit, a third loading unit and a fourth loading unit.
The first loading unit is used for loading 16 8-bit integer type data by using 4 access instructions for 4-bit data;
a second loading unit for loading 12 8-bit integer type data using 3 access instructions for 3-bit data;
A third loading unit for loading 8-bit integer type data using 2 access instructions for 2-bit data;
And a fourth loading unit for loading 4 8-bit integer type data using 1 access instruction for 1-bit data.
Optionally, in one embodiment of the present application, the computing module 300 includes: and a thread scheduling unit.
The thread scheduling unit is used for adding a preset step length between the next dequantization address and the last dequantization address of each thread in the same thread bundle based on a preset interleaving thread arrangement mode so that a plurality of threads in the same thread bundle execute the same dequantization function.
It should be noted that the foregoing explanation of the embodiment of the matrix vector multiplier implementation method supporting mixed bit quantization is also applicable to the matrix vector multiplier implementation device supporting mixed bit quantization of this embodiment, and will not be repeated here.
According to the matrix vector multiplication operator realizing device supporting mixed bit quantization, which is provided by the embodiment of the application, the weight matrix is loaded into a register by acquiring the quantized weight matrix of the large language model and the activation vector of the half-precision floating point precision, a vector access method for mixed bit precision data is used, and the weight matrix in the register is dequantized to the half-precision floating point precision by using a dequantization method based on an interweaved thread arrangement strategy, so that a matrix vector multiplication calculation result is obtained by combining the activation vector of the half-precision floating point precision, vector access can be realized on the mixed bit precision data, the bandwidth utilization rate is improved to the greatest extent, threads in the same thread bundle can execute the same instruction, and the instruction branch of the thread bundle level is avoided. Therefore, the problems that in the related art, due to the fact that the number of bits loaded for different bit quantized data is different in a single time in the process of loading the mixed precision quantized weight matrix, a single access mode cannot realize high-efficiency access to all data easily, and due to the fact that different data dequantization modes are different in the mixed precision quantized weight matrix, instruction branches are easy to cause on a SIMD architecture and the like are solved.
Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device may include:
A memory 701, a processor 702, wherein the memory 701 has a computer program stored on the memory 701 and executable on the processor 702.
The processor 702 implements the matrix vector multiplier implementation method supporting mixed bit quantization provided in the above embodiment when executing a program.
Further, the electronic device further includes:
A communication interface 703 for communication between the memory 701 and the processor 702.
Memory 701 for storing a computer program executable on processor 702.
The memory 701 may include a high-speed RAM memory or may further include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory.
If the memory 701, the processor 702, and the communication interface 703 are implemented independently, the communication interface 703, the memory 701, and the processor 702 may be connected to each other through a bus and perform communication with each other. The bus may be an industry standard architecture (Industry Standard Architecture, abbreviated ISA) bus, an external device interconnect (PERIPHERAL COMPONENT INTERCONNECT, abbreviated PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 7, but not only one bus or one type of bus.
Alternatively, in a specific implementation, if the memory 701, the processor 702, and the communication interface 703 are integrated on a chip, the memory 701, the processor 702, and the communication interface 703 may communicate with each other through internal interfaces.
The processor 702 may be a central processing unit (Central Processing Unit, abbreviated as CPU), or an Application SPECIFIC INTEGRATED Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the application.
The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, implements the matrix vector multiplier implementation method supporting mixed bit quantization as above.
The embodiments of the present application also provide a computer program product comprising a computer program which, when executed by a processor, implements a matrix vector multiplier implementation method as described above that supports mixed bit quantization.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, "N" means at least two, for example, two, three, etc., unless specifically defined otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order from that shown or discussed, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or N wires, a portable computer cartridge (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, may be implemented in a combination of any one or more of the following techniques, which are well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.
The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims (13)

1. The matrix vector multiplication operator implementation method supporting mixed bit quantization is characterized by comprising the following steps:
Acquiring a weight matrix of the quantized large language model and an activation vector of half-precision floating point number precision;
Loading the weight matrix into a register by using a vector access method for mixed bit precision data; and
And dequantizing the weight matrix in the register to half-precision floating point number precision by using a dequantization method based on an interweaved thread arrangement strategy, and acquiring a matrix vector multiplication calculation result by combining an activation vector of the half-precision floating point number precision.
2. The method according to claim 1, wherein the obtaining the quantized weight matrix and the activation vector of the half-precision floating point number precision comprises:
and carrying out grouping quantization on the weight matrix according to the dimension of the input channels to obtain a plurality of groups of input channels so as to generate the quantized weight matrix, wherein the input channels with the preset group size are divided into corresponding groups, and the input channels in the same group use the same quantization parameters.
3. The method of matrix vector multiplier implementation supporting mixed bit quantization according to claim 2, wherein said generating said weight matrix comprises:
Quantizing different groups of input channels in the multiple groups of input channels by using corresponding quantization precision;
groups using the same quantization accuracy are rearranged offline to arrange them in adjacent positions.
4. A matrix vector multiplier implementation method supporting mixed bit quantization according to any of claims 1-3, wherein said loading said weight matrix into a register using a vector access method for mixed bit precision data comprises:
for 4-bit data, 16 8-bit integer type data is loaded using 4 access instructions;
For 3-bit data, 12 8-bit integer type data is loaded using 3 access instructions;
For 2-bit data, 8-bit integer type data is loaded using 2 access instructions;
for 1 bit data, 48 bit integer type data is loaded using 1 access instruction.
5. A method according to any one of claims 1-3, wherein the dequantizing the weight matrix in the register to a half-precision floating point number precision using a dequantization method based on an interleaved thread placement policy, and obtaining a matrix vector multiplication result in combination with an activation vector of the half-precision floating point number precision, comprises:
Based on a preset interleaving thread arrangement mode, a preset step length is added between the next dequantization address and the last dequantization address of each thread in the same thread bundle, so that a plurality of threads in the same thread bundle execute the same dequantization function.
6. A matrix vector multiplier implementation apparatus supporting mixed bit quantization, comprising:
The acquisition module is used for acquiring a weight matrix of the quantized large language model and an activation vector of the half-precision floating point number precision;
the loading module is used for loading the weight matrix into a register by using a vector access method for mixed bit precision data; and
And the calculation module is used for dequantizing the weight matrix in the register to half-precision floating point number precision by using a dequantization method based on an interweaved thread arrangement strategy, and acquiring a matrix vector multiplication calculation result by combining an activation vector of the half-precision floating point number precision.
7. The matrix vector multiplier implementation device according to claim 6, wherein said acquisition module comprises:
The generation unit is used for carrying out grouping quantization on the weight matrix according to the dimension of the input channels to obtain a plurality of groups of input channels so as to generate the quantized weight matrix, wherein the input channels with the preset group size are divided into corresponding groups, and the input channels in the same group use the same quantization parameters.
8. The matrix vector multiplier implementation apparatus supporting mixed bit quantization according to claim 7, wherein said generating unit comprises:
a quantization subunit, configured to quantize input channels of different groups of the multiple groups of input channels using corresponding quantization precision;
an arrangement subunit for performing an offline rearrangement of groups using the same quantization accuracy to arrange them at adjacent positions.
9. The matrix vector multiplier implementation supporting mixed bit quantization according to any one of claims 6-8, wherein said loading module comprises:
A first loading unit for loading 16 8-bit integer type data using 4 access instructions for 4-bit data;
a second loading unit for loading 12 8-bit integer type data using 3 access instructions for 3-bit data;
A third loading unit for loading 8-bit integer type data using 2 access instructions for 2-bit data;
And a fourth loading unit for loading 4 8-bit integer type data using 1 access instruction for 1-bit data.
10. The matrix vector multiplier implementation supporting mixed bit quantization according to any one of claims 6-8, wherein said calculation module comprises:
The thread scheduling unit is used for adding a preset step length between the next dequantization address and the last dequantization address of each thread in the same thread bundle based on a preset interleaving thread arrangement mode so as to enable a plurality of threads in the same thread bundle to execute the same dequantization function.
11. An electronic device, comprising: a memory, a processor, wherein the memory has a computer program stored thereon and capable of running on the processor, the processor executing the program to implement the matrix vector multiplier implementation method supporting mixed bit quantization as claimed in any one of claims 1-5.
12. A computer readable storage medium having stored thereon a computer program, the program being executable by a processor for implementing the matrix vector multiplier implementation method supporting mixed bit quantization as claimed in any one of claims 1-5.
13. A computer program product comprising a computer program for implementing the matrix vector multiplier implementation method supporting mixed bit quantization according to any of claims 1-5 when executed.
CN202410433080.3A 2024-04-11 2024-04-11 Matrix vector multiplication operator realization method and device supporting mixed bit quantization Active CN118035628B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410433080.3A CN118035628B (en) 2024-04-11 2024-04-11 Matrix vector multiplication operator realization method and device supporting mixed bit quantization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410433080.3A CN118035628B (en) 2024-04-11 2024-04-11 Matrix vector multiplication operator realization method and device supporting mixed bit quantization

Publications (2)

Publication Number Publication Date
CN118035628A true CN118035628A (en) 2024-05-14
CN118035628B CN118035628B (en) 2024-06-11

Family

ID=90989891

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410433080.3A Active CN118035628B (en) 2024-04-11 2024-04-11 Matrix vector multiplication operator realization method and device supporting mixed bit quantization

Country Status (1)

Country Link
CN (1) CN118035628B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076663A (en) * 2021-05-06 2021-07-06 华南理工大学 Dynamic hybrid precision model construction method and system
CN114047903A (en) * 2021-11-09 2022-02-15 上海交通大学 Mixed precision operation unit applied to data stream driven reconfigurable array
US20220092391A1 (en) * 2021-12-07 2022-03-24 Santiago Miret System and method of using neuroevolution-enhanced multi-objective optimization for mixed-precision quantization of deep neural networks
US20220114479A1 (en) * 2020-10-14 2022-04-14 Samsung Electronics Co., Ltd. Systems and methods for automatic mixed-precision quantization search
CN114418088A (en) * 2021-12-28 2022-04-29 南京大学 Model training method
CN116227332A (en) * 2022-12-21 2023-06-06 北京视海芯图微电子有限公司 Method and system for quantizing mixed bits of transformers
CN116227563A (en) * 2022-12-09 2023-06-06 中国航空无线电电子研究所 Convolutional neural network compression and acceleration method based on data quantization
CN116502691A (en) * 2023-03-22 2023-07-28 山东海量信息技术研究院 Deep convolutional neural network mixed precision quantization method applied to FPGA
CN117574976A (en) * 2024-01-16 2024-02-20 北京大学 Large language model software and hardware collaborative quantization acceleration calculation method and system
CN117852593A (en) * 2023-12-18 2024-04-09 中国科学院信息工程研究所 Compression method for distilling perception mixing precision quantification

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220114479A1 (en) * 2020-10-14 2022-04-14 Samsung Electronics Co., Ltd. Systems and methods for automatic mixed-precision quantization search
CN113076663A (en) * 2021-05-06 2021-07-06 华南理工大学 Dynamic hybrid precision model construction method and system
CN114047903A (en) * 2021-11-09 2022-02-15 上海交通大学 Mixed precision operation unit applied to data stream driven reconfigurable array
US20220092391A1 (en) * 2021-12-07 2022-03-24 Santiago Miret System and method of using neuroevolution-enhanced multi-objective optimization for mixed-precision quantization of deep neural networks
CN114418088A (en) * 2021-12-28 2022-04-29 南京大学 Model training method
CN116227563A (en) * 2022-12-09 2023-06-06 中国航空无线电电子研究所 Convolutional neural network compression and acceleration method based on data quantization
CN116227332A (en) * 2022-12-21 2023-06-06 北京视海芯图微电子有限公司 Method and system for quantizing mixed bits of transformers
CN116502691A (en) * 2023-03-22 2023-07-28 山东海量信息技术研究院 Deep convolutional neural network mixed precision quantization method applied to FPGA
CN117852593A (en) * 2023-12-18 2024-04-09 中国科学院信息工程研究所 Compression method for distilling perception mixing precision quantification
CN117574976A (en) * 2024-01-16 2024-02-20 北京大学 Large language model software and hardware collaborative quantization acceleration calculation method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HAOCHENG XI 等: "Training transformers with 4-bit integers", 《ARXIV:2306.11987V2[CS.LG]》, 22 June 2023 (2023-06-22), pages 1 - 22 *
YUHANG LI 等: "Efficient bitwidth earch for practical mixed precision neural network", 《ARXIV:2003.07577V1[CS.LG]》, 17 March 2020 (2020-03-17), pages 1 - 21 *
杨春 等: "深度神经网络模型量化方法综述", 《工程科学学报》, vol. 45, no. 10, 31 October 2023 (2023-10-31), pages 1613 - 1629 *

Also Published As

Publication number Publication date
CN118035628B (en) 2024-06-11

Similar Documents

Publication Publication Date Title
US11698773B2 (en) Accelerated mathematical engine
CN111062472B (en) Sparse neural network accelerator based on structured pruning and acceleration method thereof
CN110415157B (en) Matrix multiplication calculation method and device
US20180046895A1 (en) Device and method for implementing a sparse neural network
CN112465110B (en) Hardware accelerator for convolution neural network calculation optimization
CN111931918B (en) Neural network accelerator
US20200293868A1 (en) Method and apparatus to efficiently process and execute artificial intelligence operations
CN109993293B (en) Deep learning accelerator suitable for heap hourglass network
US11657262B2 (en) Processing matrix operations for rate limited systems
CN109543815A (en) The accelerating method and device of neural network
CN113222102A (en) Optimization method for neural network model quantification
CN118035628B (en) Matrix vector multiplication operator realization method and device supporting mixed bit quantization
CN116167419A (en) Architecture compatible with N-M sparse transducer accelerator and acceleration method
CN115130672A (en) Method and device for calculating convolution neural network by software and hardware collaborative optimization
CN114897133A (en) Universal configurable Transformer hardware accelerator and implementation method thereof
CN118034785B (en) Instruction compression method, device, accelerator and storage medium
CN112215349A (en) Sparse convolution neural network acceleration method and device based on data flow architecture
CN220983883U (en) Matrix computing device, chiplet apparatus and artificial intelligence accelerator device
CN116109468B (en) Graphics processing unit, instruction compiling method, storage medium, and terminal device
Kim et al. QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
US20230161479A1 (en) Zero skipping techniques for reducing data movement
Zhao et al. HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices
CN113313251A (en) Deep separable convolution fusion method and system based on data stream architecture
CN115115018A (en) Acceleration system for long and short memory neural network
CN116483748A (en) Efficient data loading method, equipment and storage medium based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant