CN117933270B

CN117933270B - Large language model long text output method, device, equipment and storage medium

Info

Publication number: CN117933270B
Application number: CN202410340500.3A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shencun Technology Wuxi Co ltd
Current assignee: Shencun Technology Wuxi Co ltd
Priority date: 2024-03-25
Filing date: 2024-03-25
Publication date: 2024-05-24
Anticipated expiration: 2044-03-25
Also published as: CN117933270A

Abstract

The application discloses a method, a device, equipment and a storage medium for outputting a long text of a large language model, which relate to the field of processors and are used for extracting and storing corresponding KV values according to semantic information of a prompt; extracting an ith round KV value when a long text is continuously output, calculating an ith round word element, and generating text information according to the round word element; determining the length of an ith round of KV set corresponding to the ith round of KV value, and cutting and filtering the KV set according to the length of the set in a text generation sequence to obtain a KV set of the ith (plus 1) round of KV value; replacing the ith round KV value in the video memory with the (i+1) th round KV value, generating the (i+1) th round word element and generating corresponding text information. According to the scheme, accumulated KV values are segmented according to time, adjacent parts are reserved for each iteration, the distant parts are properly screened to form a new set, the KV values of each iteration are compressed to greatly reduce the video memory, and meanwhile, the relevance between semantic information before and after long text output is reserved.

Description

Large language model long text output method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the field of processors, in particular to a method, a device, equipment and a storage medium for outputting a long text of a large language model.

Background

Large language model generation of long text is one of the challenging tasks, and the reason why large language models are difficult to generate long text at present is as follows:

1. Context understanding limits: the language model needs to consider the content of the previous text when generating the text to ensure consistency and rationality of the generated text. As text length increases, the model needs to remain in the understanding and memorization of longer contexts, which increases the complexity and computational cost of the model.

2. Training data limit: generating long text requires a large amount of training data to capture the complexity and diversity of the language. However, collecting and annotating long text data is an expensive and time-consuming task. The training data sets currently available may be relatively small and there may be limitations on the ability to generate long text.

3. Progressive error accumulation: as long text is generated, errors in the large language model gradually accumulate, resulting in a gradual loss of accuracy and consistency of the generated text. Even if the text generated by the model at the beginning is correct, as the generation process proceeds, incorrect predictions or incomplete sentence structures may occur, thereby affecting the quality of the overall text.

4. Semantic consistency is difficult to guarantee: in generating long text, the model needs to maintain consistent semantics and logic throughout the text. This is a challenge for the model because it requires understanding and reasoning about complex relationships in the text and ensures that the generated text is coherent and reasonable overall.

To solve the above-mentioned problems, various large language models in the present stage can save the weights of the previous words each time a word is generated by calculating all the previous attention weights, and when the text length exceeds a limit, a great deal of attention is stored in the video memory or the internal memory. And when the available capacity of the video memory is insufficient, the large-capacity storage causes memory overflow or errors of the memory overflow, so that the text generation fails. Currently there are at least billions of parameters in a LLM, each of which is a decimal number, typically stored in float32, bfloat, 16 or float16 formats. In order to take up as little video memory as possible, few models are trained with float32 accuracy, but usually with bfloat accuracy and in some cases with float16 accuracy. How much memory is needed to load a model with bfloat a few examples:

GPT3 requires 2×175 GB =350: 350 GB video memory;

Bloom requires 2 x 176 GB = 352 GB video memory;

llama-2-70b requires 2 x 70 GB =140 GB video memory;

Falcon-40b requires 2× GB =80: 80 GB video memory;

MPT-30b requires 2 x 30 GB = 60 GB video memory;

because of the technological development limitation in the present stage, the largest GPU chip in the market is 800G video memory, the largest consumer-level video memory is only 24G, and if the current stage is based on the completion of the target, huge funds are needed to be invested to purchase a large amount of CPU video memory for cluster deployment, so that the training and calculation of a large model are not facilitated, and the semantic consistency between long text outputs is also limited.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for outputting a long text of a large language model, which solve the problem that the long text output by the large language model occupies a display memory.

In one aspect, the present application provides a method for outputting long text of a large language model, the method comprising:

Receiving a model prompt, converting text information according to semantic information of the prompt, and extracting and storing a corresponding text key value pair KV value; the KV value is stored in a video memory running a large language model according to the text generation sequence of the long text;

Extracting an ith round of KV value when a long text is continuously output, calculating an ith round of text word element token according to the ith round of KV value, and generating text information according to the ith round of token; the ith round of KV Value is a KV set formed by all KV values stored when the preamble long text information is output, the KV set comprises a Key sequence set and a Value sequence set which are matched, and values in the Key sequence and the Value sequence generated by each round of model calculation are arranged according to the model iteration sequence; wherein i is a positive integer;

Determining the length of an ith round of KV set corresponding to the ith round of KV value, and dividing the ith round of KV set corresponding to the ith round of KV value into a first KV set and a second KV set according to the text generation sequence according to the set length; wherein, the first KV set is data generated far from the current iteration period, and the KV value in the second KV set is data generated near the current iteration period;

filtering the KV value of the first KV set, and combining the filtered first KV set and the second KV set into an i+1th KV set with an i+1th KV value;

replacing the ith round KV value in the video memory with the (i+1) th round KV value, generating an (i+1) th round token according to the (i+1) th round KV value, and generating corresponding text information.

Specifically, before the calculating the ith round of text word token, the method further includes:

generating KV values generated by the nth iteration, and determining the storage allowance of the video memory;

When the storage margin of the video memory is not lower than the lowest threshold value, directly merging KV values generated by the nth round into the video memory, and forming the nth round KV value together with KV values generated by the previous n-1 round; n is a positive integer not greater than i.

Specifically, when the storage margin of the video memory is lower than the lowest threshold, sequencing KV sets stored in the video memory according to the text generation sequence, and cutting and filtering.

Specifically, the filtering the KV value of the first KV set, and combining the filtered first KV set and the second KV set into an i+1th KV set of the i+1th KV value, including:

When the storage allowance of the video memory is lower than the lowest threshold value, determining the target set length of each iteration KV set of the model in the saturated state of the video memory and the reserved length of the second KV set; the number of KV values in the second KV set is not less than the number of KV values newly added in each iteration;

Calculating the difference value between the length of the target set and the retention length of the second KV set, and determining the screening length of the first KV set;

Randomly screening KV values in the first KV set, and reordering the screened and filtered KV values according to a text generation sequence;

Splicing the filtered and sequenced first KV set and the second KV set to obtain the i+1th KV set with the target set length.

Specifically, elements of the Key sequence set and the Value sequence set in the first KV set are in one-to-one correspondence according to the IDs, and synchronous screening is performed according to data of the same ID of the two sequence sets during screening;

And in the splicing process, respectively splicing and reordering the Key sequence set and the Value sequence set in the filtered first KV set and the filtered second KV set.

In another aspect, the present application provides a large language model long text output apparatus, the apparatus comprising:

The extraction and conversion module is used for receiving the model prompt, converting text information according to semantic information of the prompt, and extracting and storing a corresponding text key value pair KV value; the KV value is stored in a video memory running a large language model according to the text generation sequence of the long text;

The computing module is used for extracting an ith round of KV value when the long text is continuously output, computing an ith round of text word element token according to the ith round of KV value, and generating text information according to the ith round of token; the ith round of KV Value is a set formed by all KV values stored when the preamble long text information is output, the KV set comprises a Key sequence set and a Value sequence set which are matched, and values in the Key sequence and the Value sequence generated by each round of model calculation are arranged according to the model iteration sequence; wherein i is a positive integer;

The segmentation and filtering module is used for determining the length of an ith round of KV set corresponding to the ith round of KV value, and segmenting the ith round of KV set corresponding to the ith round of KV value into a first KV set and a second KV set according to the text generation sequence according to the set length; wherein, the first KV set is data generated far from the current iteration period, and the KV value in the second KV set is data generated near the current iteration period;

And the updating module is used for replacing the ith round of KV value in the video memory with the (i+1) th round of KV value, generating the (i+1) th round of token according to the (i+1) th round of KV value and generating corresponding text information.

In yet another aspect, the present application provides a computer device, including a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored, where the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the large language model long text output method described in the above aspect.

In yet another aspect, the present application provides a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by a processor to implement the large language model long text output method of the above aspect.

The technical scheme provided by the embodiment of the application has the beneficial effects that at least: the scheme abandons the traditional large language model text generation method, namely when the next token is produced, all KV values of the previous token need to be calculated, the calculated amount grows along with the length of the text, and the consumed resources are increased, so that very long texts such as novels or a lawyer document and the like cannot be generated. The scheme utilizes a method of randomly extracting KV values and reserving adjacent KV values, and does not need to calculate KV values of all token before so that consumed resources are limited, and long texts such as novel papers and the like are generated infinitely.

The introduction of a KV mechanism can accelerate the calculation speed, when the super parameter N=3000, namely, the value of KV in the memory is kept to be 3000, the test text generation speed is improved by 60%, and the text generation speed is greatly accelerated, which is very important under some timeliness products, such as higher timeliness precision under a man-machine interaction scene (such as a route reaching a destination on an automobile by a query language model, which has higher timeliness requirements).

The introduction of the random extraction KV reduces the dependence on the performance of the display card, so that the scheme model can be operated and deployed on various devices. The method can be operated on a server, a mobile phone, an automobile and other terminals, so that a large language model is really close to the life of a normal person.

Drawings

FIG. 1 is an attention layer model structure of a LLM model in the related art;

FIG. 2 is a flow chart of a large language model long text output method provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a KV cache generation token according to an embodiment of the present application;

FIG. 4 is a chart of attention versus time for token generation in accordance with the present principles and for token generation for conventional full use;

FIG. 5 provides a graph of memory occupancy versus time for two schemes;

fig. 6 is a block diagram of a large language model long text output device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

References herein to "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

According to the limitation of the GPU display card at the present stage, the largest GPU chip of the video memory on the market is A100 of 80GB video memory, and the largest consumer video memory is 24G at present, and the model is RTX4090. In the present stage, most bfloat loading models can be loaded by more than 80GB, so various methods are needed to store weights in a video memory and then infer, and long texts have higher requirements on the display, which is brought by the characteristics of the long texts. Most language models under such limitations cannot continuously output very long text (currently limited to 2k contexts, i.e., about 2000 tokens, 500 chinese characters), because the existing model is calculated by calculating all of the attention weights in the front, each time a word is generated, the weights of the previous words are saved, and when the text length exceeds a limit, a large amount of attention is stored in the video memory or in the memory, causing memory overflow or errors in the video memory overflow, thereby failing to generate the text. If the model is required to continue outputting, the model is required to be loaded and run once again, so that the relevance between the model and the previous text is inevitably lost.

This is because the self-Attention layer (Attention) is the core of a Large Language Model (LLM), two self-Attention models as shown in fig. 1, which enables the model to understand the contextual relationship between input tokens. However, the self-attention layer grows quadratically with the number of input tokens (also referred to as the sequence length, hereinafter denoted by N) in both the calculation and the memory occupation. The main modules in the self-attention mechanism are matrix multiplication and softmax, wherein the matrix multiplication is the operation most wasting resources and is the core cause for occupying the display memory.

While this is not obvious for shorter input sequences (number of input tokens less than 1000), it is a serious problem for longer input sequences (e.g., about 16000 input tokens). For example, for a sequence X of length N, the self-attention mechanism output is O, calculated as follows:

Therein, wherein ，/>，/>Whereas x= (X1, X2, X3, …, xN), the two parameter matrices Q and K contain N vectors, resulting in a growth of QKT complexity (including space and time) of N2, which is a very aggressive space and time growth factor, it is necessary to avoid this growth method, which is the biggest reason for the short text generated at present. In the conventional method. LLM typically has multiple attention headers, so multiple self-attention calculations can be performed in parallel. Assuming that LLM has 40 attention headers and runs with bfloat a precision, we can calculate the memory requirement to store QKT matrix to be 40 x 2x N2 bytes. Only about 50MB of memory is required when n=1000, but we need 19GB of memory when n=16000, and near 1TB when n=100000, only QKT matrix is stored. In summary, as the input context grows longer, the memory required for the default self-attention algorithm quickly becomes very large. Moreover, with the progress of LLM in text understanding and generation, they are being applied to increasingly complex tasks. Early AI models only translated or abstracted a few sentences, but large language models now required to manage entire pages of text, such as a book or an entire movie script, which required the large language models to have the ability to handle long input text.

Although some quantization techniques (reducing the storage requirements and computational complexity of the model by reducing the number of bits in the data representation) may change the weights and activation values of these 32-bit or 64-bit floating point numbers to integer or fixed point numbers of fewer bits, thereby reducing the volume of the model and improving the efficiency of the model's operation on edge devices. But the effect is not very obvious and the accuracy of the whole text prediction is lost.

Some existing approaches have length extrapolation (Length Extrapolation), contextual window expansion (Context Window Extension), and improvement LLMs in the utilization of long text output. It should be noted that the progress in these directions does not necessarily promote each other. For example, extending the context size of LLMs does not improve the performance of the model beyond the context size, and neither approach ensures efficient use of long contexts. The framework of this patent mainly belongs to the first category, namely LLMs is applied to texts far exceeding the size of the pre-training window, and even possibly text of infinite length. We did not extend the attention window size of LLMs nor did we enhance the memory and use of the model over long text. The latter two directions are independent of our emphasis but can be used in combination with our technology.

For example, in a 40-layer llama (https:// ai.meta.com/llama /) model, when a 16-long prompt (i.e., hint) is input, the whole program will save all KV in attention before the 17 th token is calculated, the dimension is 1 x 40 x 16 x 128, where 128 is the dimension of each head, and the head is 40, which constitutes 5120 dimension, and when the 18 th token is predicted, the KV dimension is 1 x 40 x 17 x 128; when 19 token is predicted, KV is 1 x 40 x 18 x 12 in dimension. In the process that the word elements which are just sampled are added to an input text, the dimension of KV preservation is larger and larger along with the length extension of the text, one common text can reach thousands and tens of thousands, and the parameters of a model are in a video memory, so that the resource consumption is huge, and the problem is solved, so that the problem of reducing the storage of the front KV is not realistic.

The goal of length extrapolation is to enable language models trained on shorter text to handle longer text. One of the main directions of research is to develop a relative position coding method for the transducer model so that it can operate outside the training window. One of the studies was Rotary Position Embeddings (RoPE) (Su et al 2021), which performs relative position integration of the query and key in each of the attention layers. Despite the great potential of RoPE, subsequent studies (Press et al 2022; chen et al 2023) indicate that it performs poorly in text beyond the training window. Another approach is ALiBi (Press et al 2022), which biases the attention score based on the distance between the query and the key, thereby introducing relative location information. While this approach represents an improvement in extrapolation, our testing of the MPT model shows that failure occurs when the text length far exceeds the training length. However, current methods have not achieved infinite length extrapolation and therefore no existing LLMs is suitable for streaming applications.

The focus of the contextual window expansion is to expand the contextual window of LLMs so that it can handle more labels in one forward pass. One of the main research directions is to solve the problem of training efficiency. Because of the quadratic complexity of attention computation, developing long context LLMs is both a computational challenge and a memory challenge. The solutions include optimizations for the system, such as FlashAttention (Dao et al, 2022; dao, 2023), which speed up attention calculations and reduce memory usage, and approximate attention methods (Zaheer et al, 2020; beltagy et al, 2020; wang et al, 2020; kitaev et al, 2020), which trade off efficiency by sacrificing model quality. Recently, there have been many efforts to extend pre-training LLMs using RoPE (Chen et al, 2023; kaiokendev,2023; bloc97, 2023; peng et al, 2023), involving position interpolation and fine tuning. However, all of the above techniques can only extend the LLMs context window to a limited extent, and cannot meet the need for processing unlimited inputs that we focus primarily on.

From an analysis of the above, it can be known whether all tokens generated by the preamble have utility when generating the next token for any long text model (e.g., LLM's characteristics). For the far token which outputs text information earlier, but is close to the text information needing to be output in one direction, the context association is necessarily related, so that a scheme of targeted discarding can be adopted to compress KV values so as to realize the problem of saving the video memory, and meanwhile, the infinite text output can be met.

Fig. 2 is a flowchart of a large language model long text output method according to an embodiment of the present application, including the following steps:

Step 201, receiving a model prompt, converting text information according to semantic information of the prompt, and extracting and storing a corresponding text key value pair KV value.

The model prompt is the information input to the large language model, semantic analysis and text information conversion are carried out on the prompt after the model is input, and KV values can be continuously extracted and generated in the process. The KV Value is a KV Key Value pair in the Attention layer Attention, and represents a Key Value and a Value. The KV value generated by the model operation is stored in a video memory (such as on-chip storage of a GPU) for operating a large language model according to the text generation sequence (i.e. time sequence) of a long text, and the KV value which is continuously generated and stored forms a KV set.

Step 202, extracting an ith round of KV value when the long text is continuously output, calculating an ith round of text word token according to the ith round of KV value, and generating text information according to the ith round of token.

The model generation text information is described according to the iteration cycle or the number of times, and when LLM is executed to the ith round in the long text output stage, the previously stored ith round KV value needs to be extracted from the GPU video memory. Because the long text needs semantic association, each iteration round cannot only use the KV value newly added by the iteration round, but needs to be calculated by combining the previous KV value, namely, the ith round of KV value is used for calculating the ith round of text word token, and text information is generated according to the ith round of token.

In particular, the ith round of KV values herein refer to KV sets formed by all KV values stored when the preamble long text information is output, and are not KV values generated by the independent ith round of iteration, because only one KV value can be generated by one iteration of the model, and the concept of the sets cannot be formed. The KV values in the KV set described herein are arranged according to the text generation order. The KV set comprises a Key sequence set and a Value sequence set which are matched, and values in the Key sequence and the Value sequence generated by each round of model calculation are arranged according to the model iteration sequence. Where i is a positive integer.

Specifically, since the KV set is not stored in the video memory in the initialized state of the model, and sufficient capacity is displayed for buffering, the KV set is continuously accumulated in the initial stage of the model operation, that is, before calculating the ith round of text word token, the method further comprises the following steps:

a, generating KV values generated by the nth iteration, and determining the storage margin of a video memory;

In the early stage of initialization and model operation, each iteration round will newly add KV value and add it into the set, and at this time, the capacity of the already stored KV set in the video memory (i.e. the storage capacity of the n-1 th round KV value) is synchronously calculated, and whether the video memory has a certain amount of storage allowance or not is judged.

B, when the storage margin of the video memory is not lower than the lowest threshold value, directly merging KV values generated by the nth round into the video memory, and forming the nth round KV value together with KV values generated by the previous n-1 round; n is a positive integer not greater than i.

When the storage margin of the video memory is not lower than the lowest threshold, the video memory capacity is sufficient, other factors are not needed to be considered, the model text output at the stage is approximately the same as the traditional LLM output principle, and the KV values generated by the nth round can be directly combined and stored into the video memory and can be combined with the KV values generated by the previous n-1 round to form the nth round KV value. Specifically, n is a positive integer not greater than i at this time, that is, no KV value filtering has been performed yet.

And C, when the storage margin of the video memory is lower than the lowest threshold value, sequencing the KV set stored in the video memory according to the text generation sequence, and cutting and filtering.

The storage margin threshold may be set according to practical situations, and a reserved portion is generally required to maintain normal operation of the graphics card. Below the memory margin threshold, it is no longer possible to merge the KV values of each iteration round into the set without any further precaution, but rather the segmentation and filtering mechanism of the KV set is triggered.

Step 203, determining the length of the ith round of KV set corresponding to the ith round of KV value, and cutting and filtering the ith round of KV set according to the length of the set in a text generation sequence to obtain the (i+1) th round of KV set of the (i+1) th round of KV value.

In the iteration period after triggering the segmentation and filtering mechanism, the KV set reserved by the preamble is firstly extracted from the existing calculation in each round, and is segmented and filtered. Especially for the first time triggering segmentation filtering extremum, firstly segmenting the KV set, then filtering some KV values according to a set filtering mechanism, and greatly reducing the quantity after filtering. In the subsequent iteration period, although a new KV value is generated, the i+1th round KV set for obtaining the i+1th round KV value can be maintained in a certain length range or a fixed length is set through a continuous filtering mode, and the situation of cache overflow can not occur during long text output. The method specifically comprises the following steps:

And D, dividing the ith round KV set corresponding to the ith round KV value into a first KV set and a second KV set according to the text generation sequence.

Since the KV values in the specified KV set are all ordered according to the text generation sequence, the KV values generated in the ith round are arranged at the end of the set, a segmentation point is firstly determined when the set is segmented, and a complete KV set is segmented into two KV sets from the segmentation point. In this embodiment, the dividing point selects N lengths from the last KV value of the set in reverse order, and divides the ith round of KV set into a first KV set and a second KV set. The first KV set is data generated far from the current iteration period, and the KV value in the second KV set is data generated near the current iteration period. The N value is set to different sizes according to different large language models, but the length (number) of the second KV set is required to be ensured to be larger than the number of KV values newly added in each iteration period. Of course, in practice, only one KV value will be added to each iteration, and the length of the second KV set at least needs to include a plurality of (far greater than 1) continuous KV values, so that the language text generated recently can be retained, so as to ensure language consistency and accuracy of the text information generated next.

E, performing KV value filtering on the first KV set, and combining the filtered first KV set and the filtered second KV set into an i+1th KV set with an i+1th KV value.

The filtering mechanism may be implemented by using the principle of random filtering extraction or average extraction, because long text output may have correlation with previous texts, so as to consider previous "information" and avoid "distortion" between subsequent text information and previous text.

In one possible embodiment, the filtering and combining may be achieved by:

s1, when the storage margin of the video memory is lower than a lowest threshold value, determining the target set length S of each iteration KV set of the model in a video memory saturated state and the reserved length m of a second KV set.

The target set length S can be a fixed value or a range value, and aims to control the KV value cached in the video memory after each round of calculation to be in a stable range, so that the storage capacity is controlled to be in a reasonable range, and no data overflow occurs. The retention length m of the second KV set is determined according to the scene or the video memory performance of hardware.

S2, calculating a difference value between the length S of the target set and the retention length m of the second KV set, and determining a screening length n of the first KV set;

S3, randomly screening KV values in the first KV set, and reordering the screened and filtered KV values according to a text generation sequence;

The retention length m of the dividing point is expressed as [ -40: ] mathematically, the last 40 elements are expressed, the previous elements are expressed as [0: -40], 100-1000 elements are extracted from [0: -40] by a random method, and the specific extraction quantity is determined according to the practical situation.

Let x= [ X1, X2, X3, X4, X5] be an input sequence, where each xi represents the i-th element of the input sequence. First, we need to initialize a set of Key-Value pairs (KV pairs), where both the Key (Key) and the Value (Value) correspond to elements of the input sequence. In this example, we use an letter to represent a value, such as v= [ a, b, c, d, e ], next we randomly select a portion of key-value pairs for computing the next token. Assuming that we choose the reserved key-value pairs to be 3, we can randomly choose some key-value pairs, e.g. choose k= [ x1, x3, x5] and the corresponding v= [ a, c, e ]. Now we can use these key-value pairs to calculate the next token. Assuming that we are to generate the 6 th token, we can calculate it using the random Attention method. Random Attention may be achieved by randomly selecting key-value pairs and calculating a weighted sum. In this example, we randomly select k= [ x1, x3, x5] and v= [ a, c, e ] corresponding thereto, and of course, in case the sequence is long enough, we can use a random function random (n) to determine whether to retain a certain key pair, where n represents the number of the sequences to be screened, i.e. the screening length.

S4, splicing the filtered and sequenced first KV set and second KV set to obtain an i+1th KV set with the target set length.

Elements of the Key sequence set and the Value sequence set in the first KV set are in one-to-one correspondence according to the IDs, and synchronous screening is carried out according to data of the same ID of the two sequence sets during screening. Splicing and reordering Key sequence sets and Value sequence sets in the filtered first KV set and the filtered second KV set respectively in the splicing process

Based on the above example, if there is a key value pair { KV0, KV1, KV2, KV3, KV4, …, KV100}, when the 101 st token needs to be generated, the location of the dividing point, that is, the retention length m, is first determined, and the second KV set is obtained, and if m=40, it indicates that the second KV set is { KV60, KV61, KV61, …, KV100}. The first KV set to be filtered is { KV0, KV1, KV2, KV3, …, KV59}. Assuming that the target set length s=70, the screening length n=30 for the first KV set means that 30 KV values are randomly screened from { KV0, KV1, KV2, KV3, …, KV59} and are spliced with { KV60, KV61, KV61, …, KV100}.

For example, in the process of generating a traditional large language model, all kv needs to be saved and then calculated to output the next token, such as "weather today is good, i want to go out_". When the "_" word is generated, all the previous information needs to be captured, and as the length of the text increases, the memory consumed by kv storage is larger and larger, so that the method is not suitable for generating long text, and is a traditional scheme.

After the above scheme is adopted, the hypothesis model first receives a prompt of a prompt model, such as "if you are a professor of a senior computer, proficiency c++, and now suggest a proposal to a large computer student how to learn? This is a hint that all the models need to be fed into the model for the model to calculate.

I will suggest that you take a systematic way to learn c++ when generating the corresponding answer, such as the following answer "as a professor of a senior computer. The following are some suggestions: theoretical basis: before beginning to learn c++, you are assured of some knowledge of the basic concepts of programming, including variables, data types, conditional statements, loops, etc. Understanding the underlying concepts of computer science will help better understand the advanced nature of c++. ". In the traditional model, the memory can explode until the memory is generated, and the generation cannot be continued, because of excessive words, the storage occupied by the history information is too large.

After the proximity reservation and remote screening operations, a piece of text closest to the user such as "protect you from having some knowledge of the basic concepts of programming, including variables, data types, conditional statements, loops, etc. Understanding the underlying concepts of computer science will help better understand the advanced nature of c++. These 65 words correspond to the second KV set and are all saved. The previous information (campt) is extracted by using a random extraction method, such as extracting a plurality of fragmented words such as 'as a name of the deep', 'you take systematic', 'some buildings' and the like, which are the first KV set after screening.

The two reserved words are combined and spliced together, and the next token is generated through the kv information, so that the pressure on storage is greatly reduced, a large amount of useful information is saved, and repeated generation is realized without any language length limitation.

And 204, replacing the ith round of KV value in the video memory with the (i+1) th round of KV value, generating the (i+1) th round of token according to the (i+1) th round of KV value, and generating corresponding text information.

And then determining the spliced set as an i+1th round KV set to be stored, and determining the 101 th token according to the spliced i+1th round KV set.

Assuming that the target set length S is fixed, in the subsequent iterative calculation, the total length of the KV set is unchanged, but the last 40 are reserved for each filtering, and the preamble content is randomly extracted for 30. After dynamic balance is realized, a new KV value is virtually added in each iteration, and is arranged at the last, then one KV value is randomly removed from the 31 previous KV values, and the effect of a sliding KV window is displayed.

FIG. 3 is a schematic diagram of a KV cache generation token, with arrow values representing KV information that would be used in computing the next token, and when the 7 th token is generated, 1,3, 5, 6 are randomly selected as information to generate; when the 8 th token is generated, 1,3, 5 and 7 are randomly selected as information to generate; in generating the 9 th token, 1, 5, 7, 8 are randomly selected as information to generate.

In particular, the size of the KV window selected randomly is not necessarily the same, but there is a limitation to ensure that KV exists continuously before the current word is generated, and that KV exists at a long distance randomly. The selected KV window size is selected according to the specific situation. Because the scheme can adapt to different devices, different devices have different video memories and memories, the trade-off between the precision and the performance is realized. The KV window length is 1024 at the server end, and is 32 at the small-sized equipment end such as an automobile, raspberry pie and the like.

FIG. 4 is a graph of attention versus time for token generation in accordance with the present principles and for conventional full use, because LLM is trained based on the objective function of a causal language model, we do not need the upper triangle portion of the attention matrix, which is why the attention score of the upper triangle in both graphs is empty (i.e., probability 0). Conventional long text generation is called Dense Attention (Dense Attention), and every time a token is generated, all KV values in front are required to be calculated and then predicted, for example, the right lower corner is marked as the generated token. Dense Attention (Dense Attention) has the time complexity of O (T2) and increases with increasing cache size. When the text length exceeds the pre-training text length, its performance decreases. Compared with the method for storing all KV, the method for randomly Attention (Random Attention) provided by the scheme has the advantages that some previous KV is randomly discarded, and when the next token is generated, random selection is continued, and the length of KV is kept to be N, wherein N is a super parameter. When N is 1000, then the amount of parameters saved is 50MB for storage, and devices of sufficient consumer level such as RTX3090, RTX4080, etc. are used to infer text, rather than all text KV being stored in memory. Based on the above, this is a simple and efficient framework that enables language models (LLMs) trained using limited attention windows to handle text of unlimited length without fine tuning. The present solution exploits the fact that attention is uncertain to text long ago and retains them can keep the attention score distribution close to normal. Thus, the scheme simply extracts the KV random of the attention convergence marker to anchor the attention calculation and stabilize the model performance. The working mechanism of the scheme can achieve the aim in models including Llama-2, MPT, falcon and Pythia.

In order to prove the feasibility of the scheme, the text of the application uses an Llama (https:// huggingface. Co/meta-Llama/Llama-2-7 b-hf) model of an open source item for experiments, a machine display card is RTX3090, and the memory length of kv is controlled, so that the display memory is not always increased, namely the text can be always generated, as shown in a comparison chart of the display memory occupation and the time relation provided by the two schemes in fig. 5, when the text is generated for a period of time, the traditional Dense Attention (intensive Attention) causes display memory explosion when the length is 4000, so that the generation fails, and when the time reaches about 500 seconds, the display memory explosion can not be continuously generated. In the experiment (random attention) of the scheme, the video memory starts to increase along with the progress of the reasoning process, but decreases when reaching a peak value (the memory margin is lower than the lowest threshold value), and then does not increase, because the random selection of KV produces an effect, when the KV is kept at a certain KV, the video memory does not increase any more, and the text is generated all the time. The effects on the PG19 (Compressive Transformers for Long-Range Sequence Modelling) test set are shown in Table 1.

Table 1 Dense Attention comparison with Random Attention

The smaller the number Perplexity (i.e., PPL), the better the interpretation, and the table above shows the confusion (PPL) of the ilama-2-13B language model under different attentiveness mechanisms. PPL at dense attention (short distance) is 5.40, while PPL at dense attention (long distance) is as high as 5158, indicating significant performance degradation when processing long text. In contrast, PPL in long text using random attention (n=1000, 2000, 3000) is relatively low, 10.22,9.83 and 9.64, respectively, highlighting that random attention may have better performance in long text scenes. The experimental results also further reveal the understanding and processing power of the Llama-2-13B model for text at different attention settings. The intense attention of short distances shows a low degree of confusion, but its performance is significantly affected when processing long text. Conversely, random attention in long text exhibits more stable performance, especially reaching the lowest PPL value at n=3000, suggesting that in some situations, introducing a stochastic attention mechanism may help to improve the overall performance of the model.

In the design scheme, pain points generated by the existing long texts are mined, namely all KV is saved to generate the next word, but the fact that the KV really plays a role in generating words after a long time is created, the fact that KV is selected randomly and adjacent KV values are reserved to replace all KV values is firstly proposed, meanwhile experiments prove that the method does not lose too much precision, meanwhile, due to the fact that KV is reduced, the method can be applied to different equipment, and long texts can be generated by a server, a personal mobile phone or an automobile, and therefore the practicability of the patent is proved.

In summary, the method for generating the text of the traditional large language model is abandoned, that is, when the next token is generated, all KV values of the previous token need to be calculated, the calculated amount grows along with the length of the text, so that more and more resources are consumed, and therefore, very long texts such as a novel, or a lawyer document and the like cannot be generated. The scheme utilizes a method of randomly extracting KV values and reserving adjacent KV values, and does not need to calculate KV values of all token before so that consumed resources are limited, and long texts such as novel papers and the like are generated infinitely.

Fig. 6 is a block diagram of a large language model long text output device according to an embodiment of the present application, the device including:

The extraction and conversion module 610 is configured to receive a model prompt, perform text information conversion according to semantic information of the prompt, and extract and store a corresponding text key value pair KV value; the KV value is stored in a video memory running a large language model according to the text generation sequence of the long text;

The calculation module 620 is configured to extract an ith round of KV value when the long text is continuously output, calculate an ith round of text word token according to the ith round of KV value, and generate text information according to the ith round of token; the ith round of KV value is a KV set stored when the preamble length text information is output, and KV values in the KV set are arranged according to the text generation sequence; wherein i is a positive integer;

the segmentation and filtering module 630 is configured to determine a length of an ith KV set corresponding to the ith KV value, segment and filter the ith KV set according to a text generation sequence according to the set length, and obtain an (i+1) th KV set of the (i+1) th KV value;

And the updating module 640 is used for replacing the ith round of KV value in the video memory with the (i+1) th round of KV value, generating the (i+1) th round of token according to the (i+1) th round of KV value, and generating corresponding text information.

Specifically, the conversion extraction module is further configured to: the KV set comprises a Key sequence set and a Value sequence set which are matched, and values in the Key sequence and the Value sequence generated by each round of model calculation are arranged according to the model iteration sequence.

Specifically, the computing module 620 is further configured to: generating KV values generated by the nth iteration, and determining the storage allowance of the video memory;

When the storage margin of the video memory is not lower than the lowest threshold value, directly merging KV values generated by the nth round into the video memory, and forming the nth round KV value together with KV values generated by the previous n-1 round; n is a positive integer not greater than i. And when the storage margin of the video memory is lower than the lowest threshold value, sequencing the KV set stored in the video memory according to the text generation sequence, and cutting and filtering.

Specifically, the segmentation filtering module 630 is further configured to: dividing an ith round KV set corresponding to the ith round KV value into a first KV set and a second KV set according to a text generation sequence; wherein, the first KV set is data generated far from the current iteration period, and the KV value in the second KV set is data generated near the current iteration period;

Filtering the KV value of the first KV set, and combining the filtered first KV set and the second KV set into an i+1th KV set with an i+1th KV value.

Specifically, the segmentation filtering module 630 is further configured to: when the storage allowance of the video memory is lower than the lowest threshold value, determining the target set length of each iteration KV set of the model in the saturated state of the video memory and the reserved length of the second KV set; the number of KV values in the second KV set is not less than the number of KV values newly added in each iteration;

The update module 640 is further configured to: elements of the Key sequence set and the Value sequence set in the first KV set are in one-to-one correspondence according to the IDs, and synchronous screening is carried out according to data of the same ID of the two sequence sets during screening;

The embodiment of the application also provides a computer device, which comprises a processor and a memory, wherein at least one instruction, at least one section of program, a code set or an instruction set is stored in the memory, and the at least one instruction, the at least one section of program, the code set or the instruction set is loaded and executed by the processor to realize the large language model long text output method in the aspect.

Embodiments of the present application further provide a computer readable storage medium, where at least one instruction, at least one program, a code set, or an instruction set is stored, where the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement the large language model long text output method described in the foregoing aspect.

The foregoing describes preferred embodiments of the present invention; it is to be understood that the invention is not limited to the specific embodiments described above, wherein devices and structures not described in detail are to be understood as being implemented in a manner common in the art; any person skilled in the art will make many possible variations and modifications, or adaptations to equivalent embodiments without departing from the technical solution of the present invention, which do not affect the essential content of the present invention; therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.

Claims

1. A method for outputting long text of a large language model, the method comprising:

Splicing the filtered and sequenced first KV set and the second KV set to obtain the i+1th KV set with the target set length;

2. The method for outputting long text of a large language model according to claim 1, wherein before calculating the ith round of text token, the method further comprises:

3. The method for outputting long text of large language model according to claim 2, wherein when the memory margin of the video memory is lower than the lowest threshold, the KV sets stored in the video memory are sorted according to the text generation order, and the segmentation and the filtering are performed.

4. The method for outputting the long text of the large language model according to claim 1, wherein elements of a Key sequence set and a Value sequence set in the first KV set are in one-to-one correspondence according to IDs, and synchronous screening is carried out according to data of the same IDs of the two sequence sets during screening;

5. A large language model long text output device, the device comprising:

6. A computer device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set, or instruction set that is loaded and executed by the processor to implement the large language model long text output method of any one of claims 1 to 4.

7. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by a processor to implement the large language model long text output method of any one of claims 1 to 4.