CN118034785A

CN118034785A - Instruction compression method, device, accelerator and storage medium

Info

Publication number: CN118034785A
Application number: CN202410432921.9A
Authority: CN
Inventors: 汪玉; 杨昕昊; 王鸿懿; 杨华中
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2024-04-11
Filing date: 2024-04-11
Publication date: 2024-05-14

Abstract

The invention relates to the technical field of large language model processing, in particular to an instruction compression method, an instruction compression device, an accelerator and a storage medium, wherein the method comprises the following steps: determining the instruction multiplexing proportion of the instructions to be compressed in the current instruction set according to the parallelism of the instructions to be compressed in the current instruction set of the large language model accelerator, wherein the instructions to be compressed are at least used for executing one calculation in the processing stage of the large language model; multiple instructions are generated and stored based on the instruction multiplexing ratio of the instructions to be compressed, wherein each instruction is configured to enable computation of instructions to be compressed that support input tokens of different length ranges. Therefore, the problems that in the related art, the real-time requirement cannot be met due to low online compiling speed, a large amount of storage space is required for offline compiling, and the cost is high are solved.

Description

Instruction compression method, device, accelerator and storage medium

Technical Field

The present invention relates to the field of large language model processing technologies, and in particular, to an instruction compression method, an instruction compression device, an accelerator, and a storage medium.

Background

The process of large language models can be divided into two phases: a pre-fill stage and a decode stage. The large language model first processes all input tokens once and generates the first token of the answer. This process of processing the input and generating the first token is commonly referred to as the pre-population phase. And then, the large model takes the token generated in the previous round as input, performs reasoning again, generates the next token, and continuously iterates the process until the generated token is an end mark or the total number of tokens reaches a preset maximum value, and stops iterating to complete reasoning. This process of continually generating new tokens is often referred to as the decoding stage, and the process of generating new tokens one at a time is referred to as one-time decoding. That is, the reasoning process of the large model is to pre-fill once and then decode multiple times.

In the related art, since the number of input tokens is not fixed in the process of reasoning of a large language model and the number of output tokens (the number of times of decoding) is also uncertain in the process of reasoning of the large language model, the method means that prefilling and decoding computation of different shape inputs must be supported, since a domain-specific instruction set does not support dynamic branch jump and does not support dynamic shapes, computation of different input shapes can only be supported by using different instruction sequences, but due to large number of parameters of the large model and large number of layers, on-line compiling speed is slower, compiling time is long, real-time requirements cannot be met, and off-line compiling requires a large amount of storage space due to limited storage space and has high cost.

Disclosure of Invention

The invention provides an instruction compression method, an instruction compression device, an accelerator and a storage medium, which are used for solving the problems that in the related art, the real-time requirement cannot be met due to low online compiling speed, a large amount of storage space is required for offline compiling, the cost is high and the like.

An embodiment of a first aspect of the present invention provides an instruction compression method for a large language model accelerator, where a token length register is provided in the large language model accelerator, and the method includes the following steps: determining an instruction multiplexing proportion of an instruction to be compressed in a current instruction set of a large language model accelerator according to the parallelism of the instruction to be compressed in the current instruction set, wherein the instruction to be compressed is at least used for executing one calculation in a processing stage of the large language model; a plurality of instructions are generated and stored based at least on an instruction multiplexing ratio of the instructions to be compressed, wherein each instruction is configured to enable computation of the instructions to be compressed that support input tokens of a different length range.

Optionally, the method further comprises: obtaining an input token and determining the length of the input token; storing a length of the input token with the token length register; selecting an instruction of the plurality of instructions that supports a range of lengths including the length to perform a corresponding calculation; processing the calculated output result based on the length of the input token stored in the token length register to obtain a final result of the calculation.

Optionally, the instruction multiplexing proportion is smaller than or equal to the parallelism of the instructions to be compressed.

Optionally, the number N of the plurality of instructions is equal to a ratio of the maximum length L of the input token supported by the large language model to the instruction multiplexing ratio K.

Optionally, an nth instruction of the plurality of instructions supports calculation of the instruction to be compressed for an input token having a length in a range [ (N-1) ×k+1, n×k ], where N is a positive integer and less than or equal to N.

Optionally, the processing the calculated output result based on the length of the input token stored in the token length register to obtain a final result of the calculation includes: and intercepting the first N of the calculated output results as the final result of the calculation, wherein N is equal to the length of the input token.

Optionally, the instruction to be compressed includes: matrix and one or more of matrix multiplication instructions, matrix vector multiplication instructions, scalar calculation instructions.

Optionally, the processing stage includes a pre-fill stage and a decode stage.

An embodiment of a second aspect of the present invention provides an instruction compression apparatus for a large language model accelerator, including: the processing module is used for determining the instruction multiplexing proportion of the instructions to be compressed in the current instruction set according to the parallelism of the instructions to be compressed in the current instruction set of the large language model accelerator, and the instructions to be compressed are at least used for executing one calculation in the processing stage of the large language model; and the generation module is used for generating and storing a plurality of instructions based on at least the instruction multiplexing proportion of the instructions to be compressed, wherein each instruction is configured to realize the calculation of the instructions to be compressed for supporting the input tokens with different length ranges.

Optionally, the method further comprises: the acquisition module is used for acquiring the input token and determining the length of the input token; storing the length of the input token with a token length register; selecting an instruction of which the supported length range includes a length from among the plurality of instructions to perform a corresponding calculation; the calculated output result is processed based on the length of the input token stored in the token length register to obtain a final result of the calculation.

Optionally, the instruction multiplexing ratio is less than or equal to the parallelism of the instructions to be compressed.

Optionally, an nth instruction of the plurality of instructions supports calculation of instructions to be compressed for an input token having a length in a range [ (N-1) x k+1, N x K ], where N is a positive integer and less than or equal to N.

Optionally, the method further comprises: and the intercepting module is used for intercepting the first N of the calculated output results as a final result of calculation, wherein N is equal to the length of the input token.

Optionally, the processing stage includes a pre-fill stage and a decode stage.

An embodiment of a third aspect of the present invention provides a large language model accelerator, comprising: the system comprises a token length register, a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the instruction compression method facing the large language model accelerator according to the embodiment.

An embodiment of a fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program for execution by a processor for implementing the instruction compression method for a large language model accelerator as described in the above embodiment.

Therefore, the invention has at least the following beneficial effects:

The embodiment of the invention can multiplex the same instruction by multiplexing the instruction sequence and adding the token length register in the accelerator and allowing the pre-filling or decoding calculation of different lengths, thereby realizing the pre-filling or decoding reasoning of a large language model corresponding to a plurality of different token lengths by using one instruction sequence, solving the problem of larger storage space required by the instruction caused by unfixed quantity of input and output tokens in the reasoning process of the large language model and reducing the instruction storage space.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flow chart of a method for instruction compression for a large language model accelerator according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an instruction compression scheme for a large language model specific accelerator provided in accordance with an embodiment of the present invention;

FIG. 3 is a schematic diagram of pre-filling and decoding delays at different input word lengths provided in accordance with an embodiment of the present invention;

FIG. 4 is a schematic diagram of instruction multiplexing provided according to an embodiment of the present invention;

FIG. 5 is a block diagram of an instruction compression apparatus for a large language model accelerator according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a large language model accelerator according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

At present, large language models based on transformers have had a significant impact in various fields and have demonstrated much greater accuracy and capability in a variety of tasks than small models such as traditional convolutional neural networks. On the other hand, the technology of a special hardware accelerator for convolutional neural network is mature, and the neural network accelerator based on instructions has higher flexibility and performance, and is the most common design scheme in the neural network accelerator. The neural network accelerator designs a special instruction set according to the characteristics of the calculated task.

In order to maximize performance and accelerator energy efficiency, the instruction set designed is typically not generic and highly relevant to computing tasks. Taking convolutional neural network accelerator designs as an example, a coarse-grained instruction design domain-specific instruction set is typically used (e.g., one convolutional instruction implements a convolutional operation, and one pooled instruction implements a pooled operation). The coarse-granularity instruction set has a simple structure and relatively fixed functions, so that the coarse-granularity instruction set has high execution efficiency, and has higher efficiency, performance and hardware energy efficiency compared with a CPU (Central Processing Unit ), a GPU (Graphics Processing Unit, graphics processor) and other general processors.

The adoption of such coarse-grained domain-specific instruction sets in neural network accelerator design is generally not generic, does not support branching, jumping, etc., and can only handle instruction sequences linearly. However, since the data flow of the convolutional neural network is static, there is no case where dynamic jumps are required. For example, the input data of the convolutional neural network is image data with fixed resolution (e.g. 224×224×3), the data flow is completely fixed, and can be completely determined at the offline compiling stage. Thus, while the instruction set does not have support for branch jump operations, this does not affect efficient reasoning of the neural network.

In the large language model reasoning process, the number of input tokens in the pre-filling stage is not fixed, and the number of output tokens (the number of decoding) is also uncertain, which means that pre-filling and decoding calculations of different shape inputs must be supported. However, since the domain-specific instruction set does not support dynamic branch jumps nor dynamic shapes, computation of different input shapes can only be supported by using different instruction sequences. However, this approach currently highlights the following key issues:

(1) The on-line compiling speed is slow. The large model has large parameter number and number of layers, so the compiling time is long and the cost is high. Even though the instruction templates are designed manually, the compiling time of seconds is required by dynamically applying the templates according to different parameters during compiling. Typical speeds for decoding of existing GPUs such as V100 are typically 40-50 tokens/s, with a speed gap of about 2 orders of magnitude from the compile time of each second order. Therefore, this solution is completely incapable of meeting the real-time requirements.

(2) The offline compiling and storing overhead is large. In order to meet the real-time requirement, all possible instructions can be generated in an online-down stage in advance, and instruction operation is dynamically selected in the operation time. However, since the calculation amount and the parameter number of the large model are large, the instruction sequence of each inference is long, and a large storage space is required. Taking LLaMA2-7B model as an example, the total length of supported tokens is 4096, so that all single pre-filling instructions and single decoding instructions when the token length is 1-4096 need to be generated in advance in an offline stage to support the input and output of any length of a large model. While the average size of a single pre-fill instruction is on the order of hundred MB and the average size of a single decode instruction is on the order of MB. Because, storing all instructions requires hundreds of GB of memory space, the required memory space is even far over-model weighted significantly small and cannot be stored at all in off-chip DDR (Double Data Rate) or HBM (High Bandwidth Memory ) storage of accelerators typically only 8-16 GB in capacity.

In order to meet the real-time requirements of large language models, online compilation is not practical. Therefore, the invention adopts an off-line compiling scheme, and reduces the size of the instruction sequence in order to control the memory space occupied by the instructions, thereby realizing the input and output of tokens with various lengths in a limited memory space.

The following describes an instruction compression method, apparatus, accelerator, and storage medium of an embodiment of the present invention with reference to the accompanying drawings. Specifically, fig. 1 is a flow chart of an instruction compression method for a large language model accelerator according to an embodiment of the present invention.

As shown in fig. 1, in the instruction compression method for a large language model accelerator, a token length register is set in the large language model accelerator, wherein the method comprises the following steps:

In step S101, an instruction multiplexing ratio of the instruction to be compressed in the current instruction set is determined according to the parallelism of the instruction to be compressed in the current instruction set of the large language model accelerator, the instruction to be compressed being used at least for performing one calculation in a processing stage of the large language model.

The instruction to be compressed comprises the following steps: matrix and one or more of matrix multiplication instructions, matrix vector multiplication instructions, scalar calculation instructions.

Wherein the processing stage comprises a pre-fill stage and a decode stage.

It can be understood that the embodiment of the invention can determine the proportion of multiplexing the instructions in the instruction to be compressed according to the parallelism of the hardware design corresponding to the current instruction set.

It should be noted that, the instruction to be compressed may include only the instruction related to data computation in the instruction set, that is, the instruction multiplexing may be only for the instruction related to data computation in the instruction set, and for the instruction related to data transmission only, the instruction multiplexing may not be performed; the preset length of the pre-filling stage may be set as parallelism of MM instructions, the preset length of the decoding stage may be set as parallelism of MV instructions, and the preset length may be set according to practical situations, without specific limitation.

Parallelism, which refers to the maximum number of instructions or data executed in parallel, may be related to a specific instruction set architecture, hardware design, performance, etc.; the current instruction to be compressed is a matrix and matrix multiplying instruction, the parallelism of the current instruction to be compressed is the parallelism of a matrix of hardware and a matrix multiplying module, and 128 is taken as an example, and the multiplexing proportion is required to be smaller than or equal to the instruction parallelism, so that the corresponding instruction multiplexing proportion is 1-128 at the moment; and to obtain the maximum instruction multiplexing ratio, it is preferable to be the same as the parallelism, namely 128; the multiplexing ratio is selected to be less than or equal to the parallelism, and is not particularly limited.

Specifically, if the instruction set defines a Matrix-to-Matrix multiply instruction MM (Matrix-Matrix multiplication) and a Matrix-to-vector multiply instruction MV (Matrix-Vector multiplication); assuming that the parallelism of the MM is 128, the parallelism of the MV is 16, and assuming that the output length required to be calculated in the pre-filling stage is 1-128, an MM instruction is required to calculate; if the output length is 129-256, two MMs are needed to calculate, and similarly, MVs are similar, so that the multiplexing ratio must be less than or equal to the instruction parallelism, and thus the instruction numbers of the instruction sequences with different lengths are the same. And because instruction parallelism is large, the performance of the actual token is basically unchanged when the length of the actual token changes within the range of the instruction parallelism, because the time required for calculating each instruction is similar, and even if only 1 output is required, the calculation according to the parallelism is also required.

In step S102, a plurality of instructions are generated and stored based at least on the instruction multiplexing ratio of the instructions to be compressed, wherein each instruction is configured to enable computation of instructions to be compressed supporting input tokens of different length ranges.

The instruction multiplexing proportion is smaller than or equal to the parallelism of the instructions to be compressed.

The number N of the multiple instructions is equal to the ratio of the maximum length L of the input tokens supported by the large language model to the instruction multiplexing proportion K.

The N-th instruction in the plurality of instructions supports calculation of instructions to be compressed of an input token with a length ranging from [ (N-1) x K+1, N x K ], wherein N is a positive integer and less than or equal to N.

For example: for a large language model with a total length of 4096 supported tokens, the parallelism of MM instructions in the pre-fill stage is 128, the multiplexing ratio may be 128, where the number N of MM instructions is 4096/128=32, which is much smaller than 4096. The first MM instruction supports an input token of length range [1,128], the second MM instruction supports an input token of length range [129,256], and so on, and the 32 nd MM instruction supports an input token of length range [3969,4096 ].

It can be understood that the embodiment of the invention can generate and store various instructions based on the instruction multiplexing proportion of the instructions to be compressed, and utilize the various instructions to realize the calculation of the instructions to be compressed of the input tokens in different length ranges, thereby solving the problem that the storage space required by the instructions is larger due to the unfixed quantity of the input tokens and the output tokens in the reasoning process of the large language model and reducing the storage space of the instructions.

In an embodiment of the present invention, the method further includes: obtaining an input token and determining the length of the input token; storing the length of the input token with a token length register; selecting an instruction of which the supported length range among the plurality of instructions includes the length to perform a corresponding calculation; the calculated output result is processed based on the length of the input token stored in the token length register to obtain a calculated final result.

Wherein the length of each input token is allowed to be different.

It can be understood that the embodiment of the invention can determine the length of the input token according to each input token, store the length of the input token by using the token length register, increase the token length register, allow the prefill or decoding calculation of different lengths to multiplex the same instruction, thereby avoiding the writing back of error data, realizing the prefill or decoding reasoning of a set of instruction sequences corresponding to a plurality of large language models with different token lengths, solving the problem of larger storage space required by the instruction due to the unfixed number of input and output tokens in the reasoning process of the large language model, and reducing the storage space of the instruction.

It should be noted that, in order to implement multiplexing of the instruction, a token length register may be added on hardware, where the actual input token length is stored, so as to avoid erroneous data writing back. For example, when the actual input token length is 31, an instruction sequence of token length 128 (e.g., including MM) is used, but 31 is written in the hardware token length register.

Taking MM instruction as an example of a linear operation instruction, assuming that the length of an actually input token is n ₁, and the first matrix in MM instruction with the length of the token being n (n > n ₁) is n×m dimension, the second matrix is m×p dimension, the output matrix is n×p dimension, and the first n ₁ ×p dimension (the first n ₁ rows) of the output matrix is taken as the correct result. For the previous example, when calculating MM, the 128 output results output last time are output only the first 31, avoiding the writing back of the following erroneous results to off-chip storage.

For the nonlinear calculation instruction in the instruction set, if the dimension of the input token is different from the dimension of the output result, the first n ₁ results in the output result are not necessarily taken, and the specific process of the nonlinear calculation is combined to determine how to process the output result so as to obtain the final result. The specific processing mode can be set according to the actual situation, and the invention is not particularly limited.

In an embodiment of the present invention, processing a calculated output result based on a length of an input token stored in a token length register to obtain a calculated final result includes: the first N of the computed output results are truncated as the final result of the computation, where N is equal to the length of the input token.

It can be understood that, in the embodiment of the invention, the length of the input token in the calculated output result is intercepted as the final result of calculation, so that the instruction storage space is reduced.

According to the instruction compression method for the large language model accelerator, which is provided by the embodiment of the invention, the multiplexing proportion of the instructions in the current instruction set is determined according to the parallelism of the current instruction set of the identified large language model accelerator, the same instructions in the current instruction set are multiplexed in the instruction multiplexing proportion to process one or more input tokens, the output results of the input tokens are stored by using the token length register, the large language model pre-filling or decoding reasoning of a set of instruction sequences corresponding to a plurality of different token lengths is realized, the problem that the storage space required by the instructions is larger because the quantity of the input and output tokens is not fixed in the large language model reasoning process is solved, and the instruction storage space is reduced.

The instruction compression method for the large language model accelerator of the present invention will be described in detail with reference to fig. 2 to 4, specifically as follows:

The embodiment of the invention designs the strategy of instruction multiplexing and adds a token length register (shown in figure 2) in the accelerator, so as to allow the same instruction to be multiplexed by the pre-filling or decoding calculation with different lengths. Specifically, by setting a multiplexing ratio, token lengths within this range can share the same instruction. The selection of the multiplexing proportion is related to specific hardware design and performance. In addition, a token length register is added in the accelerator, and the value in the register is updated before each reasoning, so that the multiplexing of instructions with different token lengths is realized.

The selection of the multiplexing proportion is closely related to the parallelism design of the instructions to be compressed in the instruction set. For example, if the instruction set defines a matrix-to-matrix multiply instruction MM and a matrix-to-vector multiply instruction MV. Assuming that the parallelism of the MM instruction is 128, the parallelism of the mv instruction is 16, and at this time, the multiplexing ratio of the MM is 128, the multiplexing ratio of the mv instruction is 16, which is the same as the parallelism of the corresponding instruction. Assuming that the output length of the input token to be calculated in the pre-filling stage is 1-128, an MM instruction is needed to calculate; if the input token needs to have an output length of 129-256, two MM instructions are needed for calculation, and similarly, MV is similar. That is, when the output length variation range exceeds the instruction parallelism, the number of instructions changes. Therefore, the multiplexing ratio must be equal to or less than the instruction parallelism, so that the instruction numbers of instruction sequences of different lengths are identical. Moreover, because of the large instruction parallelism, the performance of the actual token is basically unchanged when the length of the actual token changes within the range of the instruction parallelism, because the time required for each calculation instruction is similar, and even if only 1 output is required, calculation according to the parallelism is required.

The present invention tests the pre-filling performance and decoding performance of accelerators designed for large language models at different token lengths on AMD Alveo ™ U280, with MM and MV instruction parallelism of 128 and 16, respectively. The pre-fill phase uses MM calculations and the decode phase uses MV calculations. As shown in fig. 3, the delay per inference increases stepwise with increasing token length, and the width of the step corresponds to the parallelism of the instructions. That is, multiplexing of instructions in the range of instruction parallelism hardly causes performance deterioration. Therefore, in order to achieve the maximum multiplexing of instructions, the instruction multiplexing ratio of the pre-stuffing stage may be set to the parallelism of MM instructions, and the instruction multiplexing ratio of the decoding stage may be set to the parallelism of MV instructions.

As shown in fig. 4, in order to implement multiplexing of instructions, a token length register needs to be added to hardware, in which the actual input token length is stored, so as to avoid erroneous data writing back. For example, when the input length of the pre-filled actual token is 31, an instruction sequence with the token length of 128 is used (by filling the input token, the input token with the length ranging from 1 to 128 can be supported), but 31 is written in a hardware token length register, so that when the MM is calculated, 128 output results output last time are only output for the first 31, and the later error results are prevented from being written back to off-chip storage.

Verification is carried out on a large language model accelerator based on a U280 FPGA, and an instruction set of the accelerator comprises a data moving instruction; matrix and matrix multiplication instructions with a parallelism of 128; a matrix vector multiply instruction with a parallelism of 16; scalar calculation instructions with parallelism of 16; a data synchronization instruction. The data moving instruction and the data synchronizing instruction do not involve calculation, so that the parallelism concept is not provided, that is, the two instructions are not considered as the instructions to be compressed. The accelerator has an average instruction size of about 3 MB for a single decode reasoning and about 282 MB for a single pre-fill phase reasoning for a typical large language model LLaMA-7B. When the instruction multiplexing scheme of the present invention is not used, all instruction sequences with token length of 1-4096 need to be stored, and the instruction needs about 1140 GB storage space. After the instruction multiplexing scheme of the invention is applied, the instruction storage is only required to be 9.56 GB, and the instruction storage space is reduced by about 119 times.

In summary, the embodiment of the invention realizes the multiplexing of the instructions with different token lengths by multiplexing the instructions and adding the token length register in the accelerator, thereby allowing the pre-filling or decoding calculation with different lengths, multiplexing the same instructions, solving the problem of larger storage space required by the instructions due to the unfixed quantity of the input and output tokens in the reasoning process of the large language model, and further effectively reducing the storage space of the instructions by multiplexing the instructions.

Next, an instruction compression device for a large language model accelerator according to an embodiment of the present invention will be described with reference to the accompanying drawings.

FIG. 5 is a block diagram of an instruction compression apparatus for a large language model accelerator according to an embodiment of the present invention.

As shown in fig. 5, the instruction compression apparatus 10 for a large language model accelerator includes: a processing module 100 and a generating module 200.

The processing module 100 is configured to determine an instruction multiplexing ratio of an instruction to be compressed in a current instruction set according to parallelism of the instruction to be compressed in the current instruction set of the large language model accelerator, where the instruction to be compressed is at least used for executing one calculation in a processing stage of the large language model; the generation module 200 is configured to generate and store a plurality of instructions based at least on an instruction multiplexing ratio of instructions to be compressed, wherein each instruction is configured to enable computation of instructions to be compressed that support input tokens of different length ranges.

In an embodiment of the present invention, the method further includes: the acquisition module is used for acquiring the input token and determining the length of the input token; storing the length of the input token with a token length register; selecting an instruction of which the supported length range includes a length from among the plurality of instructions to perform a corresponding calculation; the calculated output result is processed based on the length of the input token stored in the token length register to obtain a calculated final result.

In the embodiment of the invention, the instruction multiplexing proportion is smaller than or equal to the parallelism of the instructions to be compressed.

In the embodiment of the invention, the number N of the multiple instructions is equal to the ratio of the maximum length L of the input tokens supported by the large language model to the instruction multiplexing proportion K.

In an embodiment of the present invention, an nth instruction of the plurality of instructions supports calculation of an instruction to be compressed of an input token having a length ranging from [ (N-1) ×k+1, n×k ], where N is a positive integer and less than or equal to N.

In an embodiment of the present invention, the method further includes: and the intercepting module is used for intercepting the first N of the calculated output results as a final result of calculation, wherein N is equal to the length of the input token.

In an embodiment of the present invention, the instruction to be compressed includes: matrix and one or more of matrix multiplication instructions, matrix vector multiplication instructions, scalar calculation instructions.

In an embodiment of the invention, the processing stage includes a pre-fill stage and a decode stage.

It should be noted that the foregoing explanation of the embodiment of the instruction compression method for the large language model accelerator is also applicable to the instruction compression device for the large language model accelerator of this embodiment, and will not be repeated herein.

According to the instruction compression device for the large language model accelerator, which is provided by the embodiment of the invention, the multiplexing proportion of instructions in the current instruction set is determined according to the parallelism of the current instruction set of the identified large language model accelerator, the same instructions in the current instruction set are multiplexed in the instruction multiplexing proportion to process one or more input tokens, the output results of the input tokens are stored by using the token length register, the large language model pre-filling or decoding reasoning of a set of instruction sequences corresponding to a plurality of different token lengths is realized, the problem that the number of input and output tokens is not fixed in the large language model reasoning process is solved, and the instruction storage space is reduced.

FIG. 6 is a schematic diagram of a large language model accelerator according to an embodiment of the present invention. The large language model accelerator may include:

A token length register 604, a memory 601, a processor 602, and a computer program stored on the memory 601 and executable on the processor 602.

The processor 602 implements the instruction compression method for the large language model accelerator provided in the above embodiment when executing a program.

Further, the large language model accelerator further includes:

A communication interface 603 for communication between the memory 601 and the processor 602.

A memory 601 for storing a computer program executable on the processor 602; the token length register 604 may be within the memory 601 or external to the memory 601 or may be partially internal to the memory 601 or partially external to the memory 601.

The memory 601 may comprise high-speed RAM (Random Access Memory ) memory, and may also include non-volatile memory, such as at least one disk memory.

If the memory 601, the processor 602, and the communication interface 603 are implemented independently, the communication interface 603, the memory 601, and the processor 602 may be connected to each other through a bus and perform communication with each other. The bus may be an ISA (Industry Standard Architecture ) bus, a PCI (PERIPHERAL COMPONENT, external device interconnect) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 6, but not only one bus or one type of bus.

Alternatively, in a specific implementation, if the memory 601, the processor 602, and the communication interface 603 are integrated on a chip, the memory 601, the processor 602, and the communication interface 603 may perform communication with each other through internal interfaces.

The processor 602 may be a CPU (Central Processing Unit ) or an ASIC (Application SPECIFIC INTEGRATED Circuit, application specific integrated Circuit) or one or more integrated circuits configured to implement embodiments of the present invention.

In some embodiments, the large language model accelerator may include an FPGA (Field Programmable GATE ARRAY ), an ASIC, a GPU, or the like.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the instruction compression method for a large language model accelerator as above.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "N" means at least two, for example, two, three, etc., unless specifically defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order from that shown or discussed, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present invention.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable gate arrays, field programmable gate arrays, and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. An instruction compression method for a large language model accelerator is characterized in that a token length register is arranged in the large language model accelerator, and the method comprises the following steps:

Determining an instruction multiplexing proportion of an instruction to be compressed in a current instruction set of a large language model accelerator according to the parallelism of the instruction to be compressed in the current instruction set, wherein the instruction to be compressed is at least used for executing one calculation in a processing stage of the large language model;

a plurality of instructions are generated and stored based at least on an instruction multiplexing ratio of the instructions to be compressed, wherein each instruction is configured to enable computation of the instructions to be compressed that support input tokens of a different length range.

2. The large language model accelerator oriented instruction compression method of claim 1, further comprising:

obtaining an input token and determining the length of the input token;

storing a length of the input token with the token length register;

Selecting an instruction of the plurality of instructions that supports a range of lengths including the length to perform a corresponding calculation;

Processing the calculated output result based on the length of the input token stored in the token length register to obtain a final result of the calculation.

3. The large language model accelerator oriented instruction compression method of claim 1, wherein the instruction multiplexing ratio is less than or equal to the parallelism of the instructions to be compressed.

4. The method for compressing instructions for accelerator of large language model according to claim 1, wherein the number N of the plurality of instructions is equal to the ratio of the maximum length L of the input tokens supported by the large language model to the instruction multiplexing ratio K.

5. The method according to claim 4, wherein an nth instruction of the plurality of instructions supports calculation of the instruction to be compressed for an input token having a length in a range of [ (N-1) ×k+1, n×k ], wherein N is a positive integer and N is equal to or less.

6. The method of instruction compression for large language model accelerator according to any one of claims 1 to 5, wherein the processing the calculated output result based on the length of the input token stored in the token length register to obtain the calculated final result comprises:

And intercepting the first N of the calculated output results as the final result of the calculation, wherein N is equal to the length of the input token.

7. The large language model accelerator oriented instruction compression method according to any one of claims 1 to 5, wherein the instruction to be compressed comprises: matrix and one or more of matrix multiplication instructions, matrix vector multiplication instructions, scalar calculation instructions.

8. The method of instruction compression for large language model accelerators according to any one of claims 1-5, wherein the processing stage comprises a pre-fill stage and a decode stage.

9. An instruction compression device for a large language model accelerator, wherein a token length register is arranged in the large language model accelerator, and the device comprises:

the processing module is used for determining the instruction multiplexing proportion of the instructions to be compressed in the current instruction set according to the parallelism of the instructions to be compressed in the current instruction set of the large language model accelerator, and the instructions to be compressed are at least used for executing one calculation in the processing stage of the large language model;

And a generation module for generating and storing a plurality of instructions based at least on the instruction multiplexing ratio of the instructions to be compressed, wherein each instruction is configured to implement the calculation of the instructions to be compressed supporting input tokens of different length ranges.

10. A large language model accelerator, comprising: a token length register, a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the large language model accelerator oriented instruction compression method of any one of claims 1-8.

11. A computer-readable storage medium having stored thereon a computer program, wherein the program is executed by a processor for implementing the large language model accelerator oriented instruction compression method of any one of claims 1-8.