CN118094037B

CN118094037B - Large language model video memory management method and device, electronic equipment and storage medium

Info

Publication number: CN118094037B
Application number: CN202410437749.6A
Authority: CN
Inventors: 汪玉; 毛秋力; 洪可
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2024-04-12
Filing date: 2024-04-12
Publication date: 2024-06-25
Anticipated expiration: 2044-04-12
Also published as: CN118094037A

Abstract

The disclosure relates to the field of machine learning, and in particular relates to a method, a device, electronic equipment and a storage medium for managing video memory of a large language model, wherein a first activation vector memory generated in a pre-filling stage is further determined by determining the number of text units needing to be subjected to large language model reasoning at present, and then an activation vector cache region is segmented according to the tail of a KV cache region in the video memory in the first activation vector. And after entering the decoding stage, the activation vector is taken out from the activation vector cache area for decoding, so that the added KV data and the activation vector generated in the decoding stage are obtained. And storing the added KV data into an activation vector cache area, and storing the activation vector generated in the decoding stage after the KV data. The method and the device can realize multiplexing of the KV cache region by dividing the KV cache region into the activation vector, solve the problem of discontinuous KV data and avoid the waste of memory resources.

Description

Large language model video memory management method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the field of machine learning, and in particular relates to a method and a device for managing video memory of a large language model, electronic equipment and a storage medium.

Background

The phenomenon of "emerging" of large language models (Large Language Model, LLM) after scaling up to a certain extent has demonstrated striking application potential in various fields, such as text generation, machine translation, programming assistance, etc. However, the large model requires far more memory resources than conventional convolutional neural networks (Convolutional Neural Networks, CNNs). The amount of CNN parameters commonly used is typically on the order of millions (million), such as AlexNet (60 million), VGG (138 million). But the usual LLM parameter amounts are typically in the order of billions (billion), for example, multiple versions of Llama2 (7 billion, 13 billion, 70 billion). Even the parameter amounts of GPT4 may be several thousand bilihones. The very large scale parameters put a great strain on the memory of the GPU (graphics processing unit). The related art has the problems of memory waste, discontinuous data and the like when managing the data generated in the large language model process.

Disclosure of Invention

In view of this, the disclosure proposes a method, an apparatus, an electronic device, and a storage medium for managing a video memory of a large language model, which aims to reasonably manage a memory and avoid memory space waste.

According to a first aspect of the present disclosure, there is provided a memory management method of a large language model, the method including:

determining the number of text units corresponding to an input text which is required to be subjected to large language model reasoning at present, wherein the large language model reasoning process comprises a pre-filling stage and a decoding stage;

determining storage positions corresponding to the activation vectors generated in the pre-filling stage according to the number of the text units;

According to the storage position, the tail part of the KV cache area in the video memory is used as an activation vector cache area;

in response to entering a decoding stage, extracting an activation vector from the activation vector cache area for decoding, and obtaining added KV data and an activation vector generated in the decoding stage;

and storing the added KV data into the activation vector cache area, and storing the activation vector generated in the decoding stage after the added KV data.

In one possible implementation manner, using the tail of the KV cache region in the video memory as the active vector cache region according to the storage location includes:

Determining an external storage space outside a KV cache area in the video memory according to the maximum memory space occupied by the activation vector generated in the decoding stage;

Correcting the storage position according to the external storage space;

and using the tail of the KV cache area in the video memory as an activation vector cache area according to the corrected storage position.

In one possible implementation manner, the activation vector buffer area has a shape of [ bs, seqlen, 2×dim+ hdim ], where bs is a batch size, seqlen is a number of text units, dim is a feature dimension corresponding to each text unit, and hdim is a feature dimension corresponding to each text unit after expansion.

In one possible implementation manner, the activation vector cache area includes a first area, a second area and a third area;

The first and second regions have a shape [ bs, seqlen, dim ] and the third region has a shape [ bs, seqlen, hdim ].

In one possible implementation, the activation vector stored in the first region includes a first input parameter, a first residual parameter, and a first output parameter generated during an attention layer processing stage, and a second input parameter, a second residual parameter, and a second output parameter generated during a feed forward neural network processing stage.

In one possible implementation, the activation vector stored in the second region includes a first input parameter, a third output parameter, a fourth output parameter, and a third input parameter generated during the attention layer processing stage, and a fifth output parameter, a fourth input parameter, and a sixth output parameter generated during the feedforward neural network processing stage.

In one possible implementation, the activation vector stored in the third region includes a seventh output parameter, a fifth input parameter, and an eighth output parameter generated in the attention layer processing stage, and a ninth output parameter and a sixth input parameter generated in the feedforward neural network processing stage.

According to a second aspect of the present disclosure, there is provided a memory management apparatus of a large language model, the apparatus comprising:

The information determining module is used for determining the number of text units corresponding to the input text which is required to be subjected to large language model reasoning at present, and the large language model reasoning process comprises a pre-filling stage and a decoding stage;

the memory calculation module is used for determining a storage position corresponding to the activation vector generated in the pre-filling stage according to the number of the text units;

The buffer dividing module is used for utilizing the tail part of the KV buffer area in the video memory as an activation vector buffer area according to the storage position;

The data decoding module is used for responding to the entering of a decoding stage, extracting an activation vector from the activation vector buffer area for decoding, and obtaining added KV data and an activation vector generated in the decoding stage;

And the data storage module is used for storing the added KV data into the activation vector cache area and storing the activation vector generated in the decoding stage after the added KV data.

In one possible implementation manner, the cache segmentation module is further configured to:

Correcting the storage position according to the external storage space;

According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described method when executing the instructions stored by the memory.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the above-described method.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.

In the embodiment of the disclosure, a first activation vector memory generated in a pre-filling stage is further determined by determining the number of text units currently needing large language model reasoning, and then the tail of a KV cache region in a video memory is utilized as an activation vector cache region according to the first activation vector memory. And after entering the decoding stage, the activation vector is taken out from the activation vector cache area for decoding, so that the added KV data and the activation vector generated in the decoding stage are obtained. And storing the added KV data into an activation vector cache area, and storing the activation vector generated in the decoding stage after the KV data. The method and the device can realize multiplexing of the KV cache region by utilizing the mode of storing the activation vector at the tail part of the KV cache region, solve the problem of discontinuous KV data and avoid the waste of memory resources.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 illustrates a flowchart of a memory management method of a large language model according to an embodiment of the present disclosure.

Fig. 2 shows a schematic diagram of determining an activation vector cache area according to an embodiment of the present disclosure.

Fig. 3 shows a schematic diagram of a network block structure in a large language model according to an embodiment of the present disclosure.

Fig. 4 shows a schematic diagram of an activation vector cache region according to an embodiment of the present disclosure.

Fig. 5 shows a schematic diagram of a memory management apparatus of a large language model according to an embodiment of the present disclosure.

Fig. 6 shows a schematic diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

The video memory management method of the large language model of the embodiment of the disclosure can be executed by electronic equipment such as terminal equipment or a server. The terminal device may be any fixed or mobile terminal such as a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a handheld device, a computing device, a vehicle-mounted device, and a wearable device. The server may be a single server or a server cluster composed of a plurality of servers. Any electronic device can implement the memory management method of the large language model of the embodiment of the disclosure by the way that the processor invokes the computer readable instructions stored in the memory.

Fig. 1 illustrates a flowchart of a memory management method of a large language model according to an embodiment of the present disclosure. As shown in fig. 1, the video memory management method of the large language model according to the embodiment of the present disclosure may include the following steps S10 to S50.

Step S10, determining the number of text units corresponding to the input text which is required to be subjected to large language model reasoning at present.

In one possible implementation, the electronic device determines the number of text units that need to be inferred by the large language model, where the text units, i.e., token in the large language model, refer to the smallest processing unit in the text input to the large language model, and may be a word, a punctuation mark, a number, or the like. When the input text is inferred through the large language model, the model divides the input text into a plurality of token, and then the large language model further performs inference analysis by taking each token as a unit. Before reasoning is carried out on the large language model, the embodiment of the disclosure determines the number of text units obtained by decomposing the text input into the large language model at present, and then carries out display memory management according to the number of the text units and the maximum number of the text units which can be contained in the corresponding display memory of the large language model.

Optionally, the electronic device further needs to determine a maximum number of text units that can be accommodated in a KV cache area in the corresponding video memory in the large language model reasoning process in advance, where the KV cache area (KV cache) is used to store KV (key value key, value) data generated in the large language model reasoning process.

And step S20, determining a storage position corresponding to the activation vector generated in the pre-filling stage according to the number of the text units.

In one possible implementation, the large language model is composed of a cascade of multiple network blocks, where each network block includes two major parts, mainly including an Attention layer (Attention layer) and a feedforward neural network (FFN), and the Attention layer and the feedforward neural network include multiple operators. The reasoning process of the large language model comprises a pre-filling stage (Prefill) and a decoding stage (Decode) implemented by the above-mentioned plurality of cascaded network blocks, wherein the input of the pre-filling stage is a sequence of text units and the input of the decoding stage is a single text unit. Therefore, in the pre-filling stage, the active vectors of all text units need to be stored in the video memory, and the memory occupation is large. In the decoding stage, only the activation vector of a single text unit needs to be stored in the video memory, and the memory occupation is small. The memory occupation of the video memory reaches a peak value after the pre-filling stage is executed, and most of the space is not used in the subsequent decoding stage, so that serious memory waste is caused. Further, based on the above background, the embodiment of the present disclosure may determine, in advance, a storage location of an activation vector generated in a pre-filling stage according to the number of text units, so as to partition a portion of space for storing the activation vector in a storage area (KVcache) of KV data in advance, so as to release the spatial memory in a decoding stage, so as to continuously store KV data newly added in the decoding stage therein, thereby implementing multiplexing of the KV data storage area, and avoiding memory waste while ensuring continuity of KV data.

Alternatively, the storage location of the activation vector in the embodiment of the present disclosure may be calculated according to the number of text units, where the storage location is a storage location in a KV buffer area storing KV data. The shape of the KV buffer region may be [ layernum,2, bs, maxtoken, dim ], layernum is the number of layers of the large model, bs is the batch processing size, dim is the feature dimension corresponding to each text unit, and maxtoken is the maximum number of text units corresponding to the KV buffer region. The electronic device calculates a storage location in the KV cache region for storing the activation vector based on the maximum number of text units maxinput actually input to the large language model. The maximum number of text units maxinput is according to≤Calculated to obtain. The maximum memory space occupied by the activation vector generated during the pre-fill phase is shaped as [ bs, maxinput,Wherein hdim is the feature dimension corresponding to each text unit after expansion, i.e. the feature dimension corresponding to each text unit after expansion is calculated by an operator in the feedforward neural network FFN,And (5) corresponding characteristic dimensions for each text unit in the storage position. Where maxinput defines the maximum number of text units that can be entered in case of KV cache multiplexing, the storage location of the activation vector may be determined according to the shape of the maximum storage space occupied by the activation vector generated in the pre-fill stage.

And step S30, utilizing the tail of the KV cache area in the video memory as an activation vector cache area according to the storage position.

In one possible implementation manner, after determining a storage location in the KV cache region for storing the activation vector, a region corresponding to the storage location may be divided at the tail of the KV cache region according to the storage location as an activation vector cache region for storing the activation vector. In some embodiments, the KV cache region tail may be shaped as bs, maxinput,The area of ] serves as an activation vector cache area.

Fig. 2 shows a schematic diagram of determining an activation vector cache area according to an embodiment of the present disclosure. As shown in fig. 2, the upper part is a conventional video memory allocation method, and the lower part is a video memory allocation method according to an embodiment of the present disclosure. After the active vector is stored in the KV buffer area in the existing video memory allocation manner, the usage rate of the portion of the pre-allocated storage space in the subsequent decoding stage is very low, which may cause serious memory waste. Meanwhile, the storage space of the activation vector is released after the pre-filling stage, and the storage space is reassigned to the newly added KV data, which also causes discontinuous storage of the KV data in space and generates additional calculation overhead. The video memory allocation method in the embodiment of the disclosure uses the tail of the KV cache area as the active vector cache area, and then the tail of the KV cache area is used as the KV cache area again after the pre-filling stage for caching the newly added KV data in the decoding stage, so that the continuity of the KV data is ensured, and the active data in the decoding stage is stored after the newly added KV data, so that the waste of video memory space is avoided.

Further, in order to ensure that the active vector cache area in the KV cache area is used for storing the newly added KV data, further reduce the video memory space waste, the electronic device may further determine the active vector memory corresponding to the active vector generated in the decoding stage, that is, the memory space size required to be occupied by the active vector generated in the decoding stage, and then determine the external memory space set outside the KV cache area according to the memory space existing in the active vector, for example, after the KV cache area, divide the memory space with the size being the memory space of the active vector as the external memory space, modify the memory location according to the external memory space, and divide the active vector cache area according to the modified memory location at the tail of the KV cache area in the video memory. For example, the active vector buffer area originally located at the tail of the KV buffer area is moved backward, the size of the backward movement is the size of the active vector memory, so that the tail of the active vector buffer area is no longer aligned with the tail of the KV buffer area, but is aligned with the tail of the external storage space, the video memory allocation mode can ensure that the active vector is stored into the external storage space in the decoding stage, and the active vector buffer area in the pre-divided KV buffer area is continuously used for storing newly added KV data.

And step S40, in response to entering a decoding stage, extracting the activation vector from the activation vector buffer area for decoding, and obtaining the added KV data and the activation vector generated in the decoding stage.

In one possible implementation, during the process that the reasoning of the large language model is in the pre-filling stage, the activation vector generated by the reasoning process is stored in the activation vector cache area at the tail of the KV cache area. When reasoning of the large language model enters a decoding stage, the currently stored activation vector is taken out from the activation vector cache area to be decoded, and added KV data and activation vectors generated in the decoding stage are obtained through decoding.

And step S50, storing the added KV data into the active vector cache area, and storing the active vector generated in the decoding stage after the added KV data.

In one possible manner, after the decoding reasoning process of the large language model, the newly added KV data is stored in an active vector buffer area storing the active vector in the pre-filling stage in the KV buffer area, and the active vector generated in the decoding stage is stored after the added KV data, for example, in an external buffer area outside the KV buffer area in the video memory, or in the KV buffer area.

Fig. 3 shows a schematic diagram of a network block structure in a large language model according to an embodiment of the present disclosure. As shown in fig. 3, the reasoning process of the large language model includes a pre-filling stage and a decoding stage, and each stage is reasoning through a cascade of network blocks. Wherein each network block comprises two major parts of an Attention layer (Attention layer) and a feedforward neural network (FFN, feed Forward Network), and the Attention layer and the feedforward neural network comprise a plurality of operators. Optionally, the Attention layer mainly includes RMSnorm (Root Mean Square Layer Normalization, mean square layer normalization), Q Project (Q projection operator), K Project (K projection operator), V Project (V projection operator), BMM (Batch Matrix Multiplication ), softMax, O Project (output projection operator), residual, and the like, and the FFN mainly includes RMSnorm, linear (first Linear operator), linear2 (second Linear operator), silu (Sigmoid Linear Unit ), linear3 (third Linear operator), residual, and the like, each of which generates a large number of activation vectors as intermediate results in the calculation process. Wherein, since seqlen is very large (very long text) intermediate data occupies very large memory, flash Attention method can be used to fuse BMM, softMax, BMM three operators to hide intermediate results. Meanwhile, the output shape of Linear1, linear2 in FFN is [ bs, seqlen, hdim ] (hdim refers to the expanded feature dimension, ilama 2-7b is 11008), they and Silu can be fused into dual_linear to hide intermediate results.

Optionally, multiple accesses to the operator are required to compute the resulting activation vector during both the pre-fill phase and the decode phase. In the pre-filling stage, the embodiment of the disclosure stores the activation vector to an activation vector cache region at the tail of the KV cache region, and in the decoding stage, the embodiment of the disclosure stores the activation vector to an external cache region outside the KV cache region in the video memory. In each stage, the buffer area for storing the activation vector in the video memory may be further divided into a first area, a second area, and a third area. The shapes of the first area and the second area are [ bs, seqlen, dim ], the shape of the third area is [ bs, seqlen, hdim ], bs is the batch processing size, seqlen is the number of text units, dim is the feature dimension corresponding to each text unit, and hdim is the feature dimension corresponding to each text unit after expansion.

Fig. 4 shows a schematic diagram of an activation vector cache region according to an embodiment of the present disclosure. As shown in fig. 4, the data stored in the three areas of the first area, the second area, and the third area are different, and may be preset according to the calculation order of each operator. Wherein the leftmost column is used to characterize the activation vectors stored by the first region at different times, the middle column is used to characterize the activation vectors stored by the second region at different times, and the rightmost column is used to characterize the activation vectors stored by the third region at different times.

Optionally, the first region stored activation vector includes a first input parameter, a first residual parameter, and a first output parameter generated during the attention layer processing stage, and a second input parameter, a second residual parameter, and a second output parameter generated during the feedforward neural network processing stage. The activation vector stored in the second region includes a first input parameter, a third output parameter, a fourth output parameter, and a third input parameter generated at the attention layer during the attention layer processing stage, and a fifth output parameter, a fourth input parameter, and a sixth output parameter generated at the feedforward neural network processing stage. The activation vector stored in the third region includes a seventh output parameter, a fifth input parameter, and an eighth output parameter generated in the attention layer processing stage, and a ninth output parameter and a sixth input parameter generated in the feedforward neural network processing stage. The activation vectors stored in each area are sequentially stored according to time sequence.

In the first region, the first input parameter and the first residual parameter are both inputs of the operator RMSnorm in the attention layer, and the first output parameter is the final output of the attention layer. The second input parameter and the second residual parameter are inputs to an operator RMSnorm in the feedforward neural network, and the second output parameter is a final output of the feedforward neural network. In the second region, the third output parameter is the output of the operator RMSnorm in the Attention layer, the fourth output parameter is the output of the operator Flash Attention after the fusion BMM, softMax, BMM, the third input parameter is the input of the operator O Project, the fifth output parameter is the output of the operator RMSnorm in the feedforward neural network, the fourth input parameter is the input of the operator dual_linear obtained by fusing Linear1, linear2 and Silu, and the sixth output parameter is the output of the operator Linear 3. In the third region, the seventh output parameter is the output of the operator Q Project in the attention layer, the fifth input parameter is the input of the operator Flash attention after the integration BMM, softMax, BMM, the eighth output parameter is the output of the operator O Project, the ninth output parameter is the output of the operator dual_linear obtained by the integration of Linear1, linear2 and Silu, and the sixth input parameter is the input of the operator Linear 3.

Based on the above storage area dividing manner, the time when the three cache areas are completely used is FFN, where two active vectors of the dual_linear operator or the Linear3 operator input and output completely occupy the second area and the third area, and the Residual (Residual) completely occupies the first area. Therefore, the embodiment of the disclosure can ensure that the intermediate data storage does not collide in the reasoning process of the large language model, and the minimum storage space is utilized, so that the storage space waste is avoided.

According to the technical characteristics, the embodiment of the disclosure can realize the multiplexing of the KV cache region by using the mode of storing the activation vector in the KV cache region, thereby solving the problem of discontinuous KV data and avoiding the waste of memory resources. And further, the area for storing the activation vector is finely divided, so that the non-conflicting storage activation vector is realized with the minimum storage space, and the waste of the storage space is further avoided.

Fig. 5 shows a schematic diagram of a memory management apparatus of a large language model according to an embodiment of the present disclosure. As shown in fig. 5, the memory management device of the large language model according to the embodiment of the present disclosure may include:

the information determining module 50 is configured to determine the number of text units corresponding to an input text that is currently required to perform large language model reasoning, where the large language model reasoning process includes a pre-filling stage and a decoding stage;

The memory calculation module 51 is configured to determine a storage location corresponding to an activation vector generated in the pre-filling stage according to the number of text units;

the buffer dividing module 52 is configured to use the tail of the KV buffer area in the video memory as an active vector buffer area according to the storage location;

a data decoding module 53, configured to, in response to entering a decoding stage, extract an activation vector from the activation vector buffer area to perform decoding, and obtain added KV data and an activation vector generated in the decoding stage;

the data storage module 54 is configured to store the added KV data in the active vector buffer area, and store the active vector generated in the decoding stage after the added KV data.

In one possible implementation, the cache splitting module 52 is further configured to:

Correcting the storage position according to the external storage space;

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.

The embodiment of the disclosure also provides an electronic device, which comprises: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described method when executing the instructions stored by the memory.

Embodiments of the present disclosure also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.

Fig. 6 shows a schematic diagram of an electronic device 1900 according to an embodiment of the disclosure. For example, electronic device 1900 may be provided as a server or terminal device. Referring to FIG. 6, electronic device 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output interface 1958 (I/O interface). The electronic device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server ^TM,Mac OS X^TM,Unix^TM, Linux^TM,FreeBSD^TM or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of electronic device 1900 to perform the methods described above.

The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for managing video memory of a large language model, the method comprising:

storing the added KV data into the activation vector cache area, and storing the activation vector generated in the decoding stage after the added KV data;

According to the storage position, using the tail of the KV cache area in the video memory as an activation vector cache area, comprising:

Correcting the storage position according to the external storage space;

2. The method of claim 1, wherein the activation vector buffer area has a shape of [ bs, seqlen, 2 xdim+ hdim ], where bs is a batch size, seqlen is a number of text units, dim is a feature dimension corresponding to each text unit, and hdim is a feature dimension corresponding to each text unit after expansion.

3. The method of claim 2, wherein the activation vector cache region comprises a first region, a second region, and a third region;

4. A method according to claim 3, wherein the first region stored activation vector comprises a first input parameter, a first residual parameter and a first output parameter generated during an attention layer processing stage, and a second input parameter, a second residual parameter and a second output parameter generated during a feedforward neural network processing stage, the first input parameter and the first residual parameter being inputs of operators RMSnorm in the attention layer, the first output parameter being a final output of the attention layer, the second input parameter and the second residual parameter being inputs of operators RMSnorm in the feedforward neural network, the second output parameter being a final output of the feedforward neural network.

5. A method according to claim 3, wherein the activation vector stored in the second region comprises a first input parameter, a third output parameter, a fourth output parameter and a third input parameter generated in the attention layer processing stage, and a fifth output parameter, a fourth input parameter and a sixth output parameter generated in the feedforward neural network processing stage, wherein the first input parameter is an input of an operator RMSnorm in the attention layer, the third output parameter is an output of an operator RMSnorm in the attention layer, the fourth output parameter is an output of an operator Flash attention after fusion BMM, softMax, BMM, the third input parameter is an input of an operator O Project, the fifth output parameter is an output of an operator RMSnorm in the feedforward neural network, the fourth input parameter is an input of an operator dual_linear obtained by fusing Linear1, linear2 and Silu, and the sixth output parameter is an output of an operator 3.

6. A method according to claim 3, wherein the activation vector stored in the third region includes a seventh output parameter, a fifth input parameter and an eighth output parameter generated in the attention layer processing stage, and a ninth output parameter and a sixth input parameter generated in the feedforward neural network processing stage, the seventh output parameter being an output of the operator Q Project in the attention layer, the fifth input parameter being an input of the operator Flash attention after the integration BMM, softMax, BMM, the eighth output parameter being an output of the operator O Project, the ninth output parameter being an output of the operator dual_line obtained by the integration of Linear1, linear2 and Silu, and the sixth input parameter being an input of the operator Linear 3.

7. A memory management device for a large language model, the device comprising:

The data storage module is used for storing the added KV data into the activation vector cache area and storing the activation vector generated in the decoding stage after the added KV data;

the cache segmentation module is further configured to:

Correcting the storage position according to the external storage space;

8. An electronic device, comprising:

A processor;

A memory for storing processor-executable instructions;

Wherein the processor is configured to implement the method of any one of claims 1 to 6 when executing the instructions stored by the memory.

9. A non-transitory computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 6.