CN114385785A - Rapid reasoning method and system supporting high-concurrency large-scale generation type language model - Google Patents

Rapid reasoning method and system supporting high-concurrency large-scale generation type language model Download PDF

Info

Publication number
CN114385785A
CN114385785A CN202111594472.0A CN202111594472A CN114385785A CN 114385785 A CN114385785 A CN 114385785A CN 202111594472 A CN202111594472 A CN 202111594472A CN 114385785 A CN114385785 A CN 114385785A
Authority
CN
China
Prior art keywords
text
intermediate matrix
matrix
attention
preamble
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111594472.0A
Other languages
Chinese (zh)
Inventor
易泽轩
胡江礼
邓磊
李文龙
王进
王晖
曾炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peng Cheng Laboratory
Original Assignee
Peng Cheng Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peng Cheng Laboratory filed Critical Peng Cheng Laboratory
Priority to CN202111594472.0A priority Critical patent/CN114385785A/en
Publication of CN114385785A publication Critical patent/CN114385785A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method and a system for supporting high-concurrency large-scale generation type language model fast reasoning, wherein the method comprises the following steps: acquiring an attention intermediate value of the preamble text of the step i and a predicted text of the step i, and storing the attention intermediate value of the preamble text of the step i; acquiring an attention intermediate value corresponding to the predicted text in the ith step, and acquiring an attention output result corresponding to the preamble text in the (i + 1) th step according to the attention intermediate value corresponding to the predicted text in the ith step and the attention intermediate value corresponding to the preamble text in the ith step; and generating a predicted text of the (i + 1) th step according to an attention output result corresponding to the preamble text of the (i + 1) th step. The invention can accelerate the reasoning speed of the large-scale generation type language model and shorten the time for a user to wait for the output of the model.

Description

Rapid reasoning method and system supporting high-concurrency large-scale generation type language model
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method and a system for supporting high-concurrency large-scale generation type language model fast reasoning.
Background
More and more language models with huge parameters begin to emerge, when the large-scale generation type language model online reasoning service is deployed, the reasoning speed is limited by the model scale, the reasoning speed is slow, and a user always needs to wait for a long time to obtain the reasoning result of the model.
Thus, there is a need for improvements and enhancements in the art.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a method and a system for supporting high-concurrency rapid reasoning of a large-scale generative language model, and aims to solve the problem of low reasoning speed of the large-scale generative language model in the prior art.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
in a first aspect of the present invention, a fast inference method for supporting a high-concurrency large-scale generative language model is provided, where the method includes:
acquiring an attention intermediate value of the preamble text of the step i and a predicted text of the step i, and storing the attention intermediate value of the preamble text of the step i;
acquiring an attention intermediate value corresponding to the predicted text in the ith step, and acquiring an attention output result corresponding to the preamble text in the (i + 1) th step according to the attention intermediate value corresponding to the predicted text in the ith step and the attention intermediate value corresponding to the preamble text in the ith step;
generating a prediction text of the (i + 1) th step according to an attention output result corresponding to the preorder text of the (i + 1) th step;
wherein i is a positive integer.
The fast reasoning method for the large-scale generation type language model supporting high concurrency is characterized in that the attention median of the preamble text in the step i comprises Q, K, V vectors corresponding to each word in the preamble text in the step i.
The method for supporting the rapid inference of the high-concurrency large-scale generative language model, wherein the step of storing the attention median of the preamble text of the step i, comprises the following steps:
obtaining a first intermediate matrix and a second intermediate matrix of the ith step according to K, V vectors of the preamble text of the ith step, wherein the sizes of the first intermediate matrix and the second intermediate matrix are both preset sizes, the preset sizes are M N, M is the dimension of each vector in the matrix, and N is the number of the vectors;
wherein, the first m vectors in the first intermediate matrix and the second intermediate matrix are K, V vectors corresponding to each word in the pre-sequence text of the step i, respectively, and the vectors after the m vector are 0;
and storing the first intermediate matrix and the second intermediate matrix in the ith step in a video memory.
The method for supporting the high-concurrency quick inference of the large-scale generative language model, wherein the obtaining of the attention output result corresponding to the preamble text of the (i + 1) th step according to the attention intermediate value corresponding to the predicted text of the (i) th step and the attention intermediate value corresponding to the preamble text of the (i) th step, comprises:
calculating Q, K, V vectors corresponding to the predicted text in the ith step, respectively storing K, V vectors corresponding to the predicted text in the ith step into a third intermediate matrix and a fourth intermediate matrix, wherein the sizes of the third intermediate matrix and the fourth intermediate matrix are the preset sizes, the m +1 th vector of the third intermediate matrix is the K vector corresponding to the predicted text in the ith step, the m +1 th vector of the fourth intermediate matrix is the V vector corresponding to the predicted text in the ith step, and the rest vectors are 0;
updating the first intermediate matrix and the second intermediate matrix in the ith step according to the third intermediate matrix and the fourth intermediate matrix to obtain the first intermediate matrix and the second intermediate matrix in the (i + 1) th step;
and calculating an attention output result corresponding to the preamble text of the step i +1 according to the first intermediate matrix and the second intermediate matrix of the step i +1 and the Q vector corresponding to the predicted text of the step i.
The method for rapidly reasoning the large-scale generative language model supporting high concurrency, wherein the step i of updating the first intermediate matrix and the second intermediate matrix according to the third intermediate matrix and the fourth intermediate matrix to obtain the first intermediate matrix and the second intermediate matrix in the step i +1, comprises:
and performing summation operation on the third intermediate matrix and the first intermediate matrix in the ith step to obtain the first intermediate matrix in the (i + 1) th step, and performing summation operation on the fourth intermediate matrix and the second intermediate matrix in the ith step to obtain the second intermediate matrix in the (i + 1) th step.
The method for supporting the high-concurrency large-scale generation type language model fast reasoning, wherein the step of acquiring the attention median of the preamble text of the step i comprises the following steps:
and when i is 1, calculating an Q, K, V vector corresponding to each word in the preamble text of the step i according to a preset first weight matrix, a preset second weight matrix and a preset third weight matrix.
The method for supporting the high-concurrency rapid inference of the large-scale generation language model, wherein the Q, K, V vectors corresponding to each word in the preamble text in the ith step are calculated according to a preset first weight matrix, a preset second weight matrix and a preset third weight matrix, comprises the following steps:
acquiring an embedded vector of each word in the preamble text of the step i;
and multiplying the embedded vector of each word by the first weight matrix, the second weight matrix and the third weight matrix respectively to obtain Q, K, V vectors corresponding to each word.
The fast reasoning method for the large-scale generative language model supporting high concurrency comprises the following steps of after the predicted text of the (i + 1) th step is generated according to the attention output result corresponding to the preamble text of the (i + 1) th step, the method further comprises the following steps:
and stopping reasoning and outputting a reasoning result when the predicted text of the (i + 1) th step comprises a preset terminator or the total length of the predicted text of the (i + 1) th step and the pre-preamble text of the (i + 1) th step is transmitted to a preset threshold, wherein the reasoning result comprises the pre-preamble text of the (i + 1) th step and the predicted text of the (i + 1) th step, otherwise, performing reasoning of the (i + 2) th step.
In a second aspect of the present invention, a terminal is provided, where the terminal includes a processor, and a computer-readable storage medium communicatively connected to the processor, the computer-readable storage medium is adapted to store a plurality of instructions, and the processor is adapted to call the instructions in the computer-readable storage medium to execute the steps of implementing any one of the above-mentioned methods for fast inference of a large-scale generative language model supporting high concurrency.
In a third aspect of the present invention, a fast inference system supporting a highly concurrent large-scale generative language model is provided, the system comprising: the system comprises: a front-end device and at least one terminal provided by the second aspect of the invention;
the front-end equipment is used for receiving an inference request and sending a preamble text corresponding to the inference request to a target terminal in at least one terminal according to the load of each terminal;
and the terminal is used for reasoning according to the preamble text corresponding to the reasoning request and returning a reasoning result to the front-end equipment.
The rapid reasoning system supporting the high-concurrency large-scale generation type language model is characterized in that the front-end equipment comprises a request receiving module and a queue management module;
the request receiving module is used for receiving the reasoning request and preprocessing the text in the reasoning request to obtain a preamble text corresponding to the reasoning request;
and the queue management module is used for judging whether the current task queue reaches a preset length, if so, returning prompt information to the sending end of the inference request, otherwise, sending the preorder text corresponding to the inference request to the target terminal with the lowest load according to the load of each terminal, and adding one to the length of the task queue.
The rapid reasoning system supporting the high-concurrency large-scale generation type language model is characterized in that the front-end equipment further comprises a post-processing module;
the post-processing module is used for receiving the reasoning result output by the terminal, post-processing the reasoning result and sending the post-processed reasoning result to the sending end of the reasoning request;
the queue management module is further configured to reduce the length of the task queue by one after the post-processing module receives the inference result of the terminal.
In a fourth aspect of the present invention, there is provided a computer readable storage medium storing one or more programs, which are executable by one or more processors, to implement the steps of any of the above methods for supporting fast inference of a large-scale generative language model with high concurrency.
Compared with the prior art, the method and the system for supporting the high-concurrency rapid reasoning of the large-scale generation type language model are provided, in the method, when the attention output result of the step i +1 is calculated, the attention intermediate value of the step i is reused, only the attention intermediate value of the prediction text generated in the step i needs to be newly calculated, but the attention intermediate value of the preorder text of the step i does not need to be calculated, the attention calculation parameters of the step i +1 are reduced, the reasoning speed of the large-scale generation type language model can be accelerated, and the time for a user to wait for model output is shortened.
Drawings
FIG. 1 is a flow chart of an embodiment of a fast inference method supporting a high-concurrency large-scale generative language model provided by the present invention;
FIG. 2 is a schematic diagram of an embodiment of a fast inference method supporting a high-concurrency large-scale generative language model according to the present invention;
FIG. 3 is a schematic diagram illustrating an effect of an embodiment of a fast inference method supporting a high-concurrency large-scale generative language model according to the present invention;
fig. 4 is a schematic diagram of an embodiment of a terminal provided in the present invention;
FIG. 5 is a diagram of an embodiment of a fast inference system supporting a large-scale generative language model with high concurrency provided by the present invention;
fig. 6 is a schematic processing flow diagram of a user proposing an inference task in an embodiment of the rapid inference system supporting a high-concurrency large-scale generative language model provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and effects of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The method for rapidly reasoning the large-scale generation language model supporting the high concurrency can be applied to a terminal with computing capability, the terminal can execute the method for rapidly reasoning the large-scale generation language model supporting the high concurrency to process a text to be processed, and the terminal can be but is not limited to various computers, mobile terminals, intelligent household appliances, wearable equipment and the like.
Example one
As shown in fig. 1, in an embodiment of the fast inference method for supporting a highly concurrent large-scale generative language model, the method includes the steps of:
s100, acquiring the attention median of the preamble text in the step i and the predicted text in the step i, and storing the attention median of the preamble text in the step i.
Specifically, the attention median value referred to in this embodiment refers to Q, K, V vectors in the attention mechanism, and the attention median value of the preamble text of step i includes Q, K, V vectors corresponding to each word in the preamble text of step i. In the generative language model, most of them adopt an attention mechanism to make reasoning, and each step of the generative language task is a process of inputting a text, then predicting the next word of the text, then combining to generate a new text as the input of the next step, and then predicting the next word. That is, the prologue text of the ith step and the predicted text of the ith step are combined to be the prologue text of the (i + 1) th step. Specifically, in this embodiment, K, V vectors corresponding to each word in the preamble text of the step i are saved for the calculation of the step i +1, where i is a positive integer.
When i > 1, the attention median value of the preamble text of the ith step may be calculated according to the attention median values of the preamble text of the (i-1) th step and the predicted text, and the specific process may refer to the description of the process for acquiring the attention median value of the preamble text of the (i + 1) th step, and when i is 1, the attention median value of the preamble text of the ith step is calculated from the word vector of each word of the preamble text of the ith step. Namely, the obtaining of the attention median value of the preamble text of the ith step includes:
and when i is 1, calculating an Q, K, V vector corresponding to each word in the preamble text of the step i according to a preset first weight matrix, a preset second weight matrix and a preset third weight matrix.
The calculating Q, K, V vectors corresponding to each word in the preamble text of the step i according to a preset first weight matrix, a preset second weight matrix and a preset third weight matrix comprises:
acquiring an embedded vector of each word in the preamble text of the step i;
and multiplying the embedded vector of each word by the first weight matrix, the second weight matrix and the third weight matrix respectively to obtain Q, K, V vectors corresponding to each word.
The step of saving the attention median value of the preamble text of the step i comprises the following steps:
obtaining a first intermediate matrix and a second intermediate matrix of the ith step according to K, V vectors of the preamble text of the ith step, wherein the sizes of the first intermediate matrix and the second intermediate matrix are both preset sizes, the preset sizes are M N, M is the dimension of each vector in the matrix, and N is the number of the vectors;
wherein, the first m vectors in the first intermediate matrix and the second intermediate matrix are K, V vectors corresponding to each word in the pre-sequence text of the step i, respectively, and the vectors after the m vector are 0;
and storing the first intermediate matrix and the second intermediate matrix in the ith step in a video memory.
In this embodiment, the attention median corresponding to the preamble text of the ith step is multiplexed and used for calculating the attention median corresponding to the prediction text of the ith step in the inference of the (i + 1) th step, and the attention median corresponding to the prediction text of the ith step only needs to be calculated in the inference of the (i + 1) th step.
In this embodiment, K, V vectors corresponding to all words of a sentence in each step of the inference are stored in a corresponding matrix with a preset size, so that the model does not need to define calculation modes respectively for K, V vectors with different numbers corresponding to different pre-texts, and only needs to define a calculation mode for a matrix with a fixed size. In this embodiment, the matrices corresponding to the K vector and the V vector are a first intermediate matrix and a second intermediate matrix, respectively, the sizes of the first intermediate matrix and the second intermediate matrix are both preset sizes, the preset sizes are M × N, M is the dimension of each vector in the matrix, and N is the number of vectors. Respectively storing K, V vectors of the preamble text of the ith step into a first intermediate matrix and a second intermediate matrix of the ith step, specifically, m words exist in the preamble text of the ith step, then the first m vectors in the corresponding first intermediate matrix of the ith step are K vectors corresponding to each word in the preamble text of the ith step, the first m vectors in the second intermediate matrix of the ith step are V vectors corresponding to each word in the preamble text of the ith step, and setting other vectors of the first intermediate matrix and the second intermediate matrix as 0. And storing the first intermediate matrix and the second intermediate matrix in the ith step in a video memory, so that the data synchronization time of host2device can be saved. The first intermediate matrix and the second intermediate matrix can be directly read at the step (i + 1) to obtain K, V vectors needed in the calculation of the attention result, so that the time is saved.
Referring to fig. 1 again, the method provided in this embodiment further includes the steps of:
s200, acquiring an attention intermediate value corresponding to the predicted text in the ith step, and acquiring an attention output result corresponding to the preamble text in the (i + 1) th step according to the attention intermediate value corresponding to the predicted text in the ith step and the attention intermediate value corresponding to the preamble text in the ith step.
As shown in fig. 2, in the conventional generative model, after the predicted text of the ith step is generated based on the preamble text of the ith step, the preamble text of the ith step and the predicted text of the ith step are combined and then input to the model to generate the predicted text of the (i + 1) th step, for example, in fig. 2, "one ancient is" is input to the model to calculate attention, the predicted word "language" is generated, and "one ancient is" is input to the model to calculate attention, the predicted word "model" is generated, and in this process, K, V vectors of the preamble text of the ith step are repeatedly calculated. In the embodiment, when the attention of the i +1 th step is calculated, K, V vectors of the preamble text of the i th step are multiplexed, for example, "one chi is" is input into the model to calculate the attention, the predictive word "language" is generated, then "language" is input into the model to calculate the attention intermediate vector, and the corresponding attention intermediate vector "chi is input into the model, so that the predictive word" model "is obtained through calculation.
Specifically, the obtaining an attention output result corresponding to the preamble text of the (i + 1) th step according to the attention intermediate value corresponding to the predicted text of the (i) th step and the attention intermediate value corresponding to the preamble text of the (i) th step includes:
calculating Q, K, V vectors corresponding to the predicted text in the ith step, respectively storing K, V vectors corresponding to the predicted text in the ith step into a third intermediate matrix and a fourth intermediate matrix, wherein the sizes of the third intermediate matrix and the fourth intermediate matrix are the preset sizes, the m +1 th vector of the third intermediate matrix is the K vector corresponding to the predicted text in the ith step, the m +1 th vector of the fourth intermediate matrix is the V vector corresponding to the predicted text in the ith step, and the rest vectors are 0;
updating the first intermediate matrix and the second intermediate matrix in the ith step according to the third intermediate matrix and the fourth intermediate matrix to obtain the first intermediate matrix and the second intermediate matrix in the (i + 1) th step;
and calculating an attention output result corresponding to the preamble text of the step i +1 according to the first intermediate matrix and the second intermediate matrix of the step i +1 and the Q vector corresponding to the predicted text of the step i.
In the autoregressive generating language model, each step generates a word, namely the predicted text of the step i is a word, the predicted text of the step i is input, Q, K, V vectors corresponding to the predicted text of the step i are obtained, the Q vector corresponding to the predicted text of the step i is kept unchanged, and K, V vectors of the predicted text of the step i and K, V vectors of the preamble text of the step i are combined to obtain the attention median of the step i + 1. Specifically, firstly, the K, V vectors calculated for the predicted text of the ith step are saved as a third intermediate matrix and a fourth intermediate matrix, and the sizes of the third intermediate matrix and the fourth intermediate matrix are both the preset size, that is, M × N. And the m +1 th vector of the third intermediate matrix is a K vector corresponding to the predicted text of the ith step, the m +1 th vector of the fourth intermediate matrix is a V vector corresponding to the predicted text of the ith step, the rest vectors are 0, the third intermediate matrix and the first intermediate matrix of the ith step are added to update the first intermediate matrix of the ith step to obtain the first intermediate matrix of the i +1 th step, and the fourth intermediate matrix and the second intermediate matrix of the ith step are added to update the second intermediate matrix of the ith step to obtain the second intermediate matrix of the i +1 th step.
And calculating an attention output result corresponding to the preamble text of the i +1 th step according to the first intermediate matrix and the second intermediate matrix of the i +1 th step and the Q vector corresponding to the predicted text of the i th step, wherein the preamble text of the i +1 th step is a combination of the preamble text of the i th step and the predicted text of the i th step. The formula for the attention output result can be expressed as:
Figure BDA0003430147040000091
and S300, generating a predicted text of the step i +1 according to the attention output result corresponding to the preamble text of the step i + 1.
After generating the predicted text of the (i + 1) th step according to the attention output result corresponding to the preamble text of the (i + 1) th step, the method provided by this embodiment further includes the steps of:
and stopping reasoning and outputting a reasoning result when the predicted text of the (i + 1) th step comprises a preset terminator or the total length of the predicted text of the (i + 1) th step and the pre-preamble text of the (i + 1) th step is transmitted to a preset threshold, wherein the reasoning result comprises the pre-preamble text of the (i + 1) th step and the predicted text of the (i + 1) th step, otherwise, performing reasoning of the (i + 2) th step.
The preset terminator can be a preset symbol such as a period, a space and the like. The inference result is a combination of the preceding text and the predicted text of the last step.
Experiments are carried out on the existing large-scale generation type language model (Pengcheng-Pangu model) by adopting the method provided by the embodiment, and the experimental result is shown in fig. 3, and it is found that the 26 hundred million parameter model has the single word generation delay of about 30ms and the original inference delay of about 120ms by using the rapid inference method provided by the embodiment, while the single word generation delay of about 50ms and the original inference delay of about 285ms by using the rapid inference method provided by the embodiment on the 130 hundred million parameter model, and the rapid inference method provided by the embodiment can shorten the field input to the language model from the fixed sequence length to one word through a mechanism of intermediate state multiplexing on the premise of not losing the model precision, thereby effectively reducing the number of acquisitions in the model inference process and remarkably improving the inference speed of the large-scale generation type language model. The method provided by the embodiment is used for language models of various decoder architectures, has universality, is suitable for optimizing the inference engine of the large-scale online inference service, improves the user experience of the inference service, and has good expansibility.
It should be understood that, although the steps in the flowcharts shown in the figures of the present specification are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in the flowchart may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
Example two
Based on the method provided by the above embodiment, the present invention further provides a rapid inference system supporting a high concurrent large-scale generative language model, as shown in fig. 5, the system includes: front end and at least one terminal, the terminal includes: the system comprises a processor and a computer-readable storage medium which is in communication connection with the processor, wherein the computer-readable storage medium is suitable for storing a plurality of instructions, and the processor is suitable for calling the instructions in the computer-readable storage medium to execute the steps of implementing the method for fast reasoning on large-scale generative language models supporting high concurrency in the first embodiment.
The front-end equipment is used for receiving an inference request and sending a preamble text corresponding to the inference request to a target terminal in at least one terminal according to the load of each terminal;
and the terminal is used for reasoning according to the preamble text corresponding to the reasoning request and returning a reasoning result to the front-end equipment.
Specifically, as shown in fig. 6, the front-end device may receive a request sent by each user terminal (client) in an online user pool, where the user sends an inference request to the front-end device through the user terminal, the front-end device includes a request receiving module and a queue management module, the request receiving module is configured to receive the inference request and preprocess a text in the inference request to obtain a preamble text corresponding to the inference request, where the preprocessing may be processing for filtering sensitive words from the text in the inference request of the user, and the request receiving module sends the preamble text corresponding to the inference request to the queue management module. And the queue management module is used for judging whether the current task queue reaches a preset length, if so, returning prompt information to the sending end of the inference request, and if not, sending the preorder text corresponding to the inference request to the target terminal with the lowest load according to the load request of each terminal, and adding one to the length of the task queue. When the current task queue reaches the preset length, it indicates that the current inference request to be processed reaches the predetermined processing upper limit, and at this time, prompt information may be returned, for example, a [ abnormal response ] character string is returned to the sending end of the inference request. And when the current task queue does not reach the preset length, the inference request can be sent to one of the terminals for processing. Specifically, the queue management module obtains the load condition of each terminal, determines the terminal with the lowest current load as a target terminal, sends the preamble text corresponding to the inference request to the target terminal, and adds one to the length of the task queue. Therefore, high-concurrency load balancing configuration can be supported, and the number of simultaneous online people of the large-scale model reasoning service system is increased. After receiving the inference result, the target terminal performs inference according to the method provided by the embodiment, and returns the output inference result to the front-end equipment, and the front-end equipment returns to the sending end of the inference request.
The front-end equipment also comprises a post-processing module, wherein the post-processing module is used for receiving the reasoning result output by the terminal, carrying out post-processing on the reasoning result and then sending the reasoning result to the sending end of the reasoning request, and carrying out post-processing on the reasoning result can be operations of filtering sensitive words, removing duplication and the like. And after the post-processing module receives the inference result of the terminal, reducing the length of the task queue by one. The inference request is preprocessed, and the inference result output by the terminal is post-processed, so that the reasonably-compliant input content and the model generation result can be effectively controlled.
The system provided by the invention is realized based on a python-flash framework, and the code is easy to transplant and is light in weight.
EXAMPLE III
Based on the method provided by the above embodiment, the present invention also provides a terminal, as shown in fig. 4, where the terminal includes a processor 10 and a memory 20. Fig. 4 shows only some of the components of the terminal, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.
The memory 20 may in some embodiments be an internal storage unit of the terminal, such as a hard disk or a memory of the terminal. The memory 20 may also be an external storage device of the terminal in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal. Further, the memory 20 may also include both an internal storage unit and an external storage device of the terminal. The memory 20 is used for storing application software installed in the terminal and various data. The memory 20 may also be used to temporarily store data that has been output or is to be output. In one embodiment, the memory 20 stores a high-concurrency-enabled large-scale generative language model fast inference program 30, and the high-concurrency-enabled large-scale generative language model fast inference program 30 is executable by the processor 10, so as to implement the high-concurrency-enabled large-scale generative language model fast inference method in the present application.
The processor 10 may be a Central Processing Unit (CPU), microprocessor or other chip in some embodiments, and is used to run program codes stored in the memory 20 or process data, for example, execute the fast inference method supporting the high-concurrency large-scale generative language model, and so on.
In one embodiment, when processor 10 executes fast inference program 30 supporting high-concurrency large-scale generative language models in memory 20, the following steps are implemented:
acquiring an attention intermediate value of the preamble text of the step i and a predicted text of the step i, and storing the attention intermediate value of the preamble text of the step i;
acquiring an attention intermediate value corresponding to the predicted text in the ith step, and acquiring an attention output result corresponding to the preamble text in the (i + 1) th step according to the attention intermediate value corresponding to the predicted text in the ith step and the attention intermediate value corresponding to the preamble text in the ith step;
generating a prediction text of the (i + 1) th step according to an attention output result corresponding to the preorder text of the (i + 1) th step;
wherein i is a positive integer.
Wherein the attention median value of the preamble text of step i comprises Q, K, V vectors corresponding to each word in the preamble text of step i.
Wherein the step of saving the attention median value of the preamble text of the step i comprises the following steps:
obtaining a first intermediate matrix and a second intermediate matrix of the ith step according to K, V vectors of the preamble text of the ith step, wherein the sizes of the first intermediate matrix and the second intermediate matrix are both preset sizes, the preset sizes are M N, M is the dimension of each vector in the matrix, and N is the number of the vectors;
wherein, the first m vectors in the first intermediate matrix and the second intermediate matrix are K, V vectors corresponding to each word in the pre-sequence text of the step i, respectively, and the vectors after the m vector are 0;
and storing the first intermediate matrix and the second intermediate matrix in the ith step in a video memory.
Wherein, the obtaining of the attention output result corresponding to the preamble text of the (i + 1) th step according to the attention intermediate value corresponding to the predicted text of the (i) th step and the attention intermediate value corresponding to the preamble text of the (i) th step includes:
calculating Q, K, V vectors corresponding to the predicted text in the ith step, respectively storing K, V vectors corresponding to the predicted text in the ith step into a third intermediate matrix and a fourth intermediate matrix, wherein the sizes of the third intermediate matrix and the fourth intermediate matrix are the preset sizes, the m +1 th vector of the third intermediate matrix is the K vector corresponding to the predicted text in the ith step, the m +1 th vector of the fourth intermediate matrix is the V vector corresponding to the predicted text in the ith step, and the rest vectors are 0;
updating the first intermediate matrix and the second intermediate matrix in the ith step according to the third intermediate matrix and the fourth intermediate matrix to obtain the first intermediate matrix and the second intermediate matrix in the (i + 1) th step;
and calculating an attention output result corresponding to the preamble text of the step i +1 according to the first intermediate matrix and the second intermediate matrix of the step i +1 and the Q vector corresponding to the predicted text of the step i.
Wherein the updating the first intermediate matrix and the second intermediate matrix in the ith step according to the third intermediate matrix and the fourth intermediate matrix to obtain the first intermediate matrix and the second intermediate matrix in the (i + 1) th step includes:
and performing summation operation on the third intermediate matrix and the first intermediate matrix in the ith step to obtain the first intermediate matrix in the (i + 1) th step, and performing summation operation on the fourth intermediate matrix and the second intermediate matrix in the ith step to obtain the second intermediate matrix in the (i + 1) th step.
Wherein, the obtaining of the attention median value of the preamble text of the ith step includes:
and when i is 1, calculating an Q, K, V vector corresponding to each word in the preamble text of the step i according to a preset first weight matrix, a preset second weight matrix and a preset third weight matrix.
Wherein, the calculating Q, K, V vector corresponding to each word in the preamble text of the step i according to the preset first weight matrix, the second weight matrix and the third weight matrix comprises:
acquiring an embedded vector of each word in the preamble text of the step i;
and multiplying the embedded vector of each word by the first weight matrix, the second weight matrix and the third weight matrix respectively to obtain Q, K, V vectors corresponding to each word.
After the predicted text of the step i +1 is generated according to the attention output result corresponding to the preamble text of the step i +1, the method further includes:
and stopping reasoning and outputting a reasoning result when the predicted text of the (i + 1) th step comprises a preset terminator or the total length of the predicted text of the (i + 1) th step and the pre-preamble text of the (i + 1) th step is transmitted to a preset threshold, wherein the reasoning result comprises the pre-preamble text of the (i + 1) th step and the predicted text of the (i + 1) th step, otherwise, performing reasoning of the (i + 2) th step.
Example four
The present invention also provides a computer readable storage medium having stored thereon one or more programs, which are executable by one or more processors, to implement the steps of the method for supporting fast inference of large scale generative language models with high concurrency as described above.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (13)

1. A rapid reasoning method for a large-scale generative language model supporting high concurrency is characterized by comprising the following steps:
acquiring an attention intermediate value of the preamble text of the step i and a predicted text of the step i, and storing the attention intermediate value of the preamble text of the step i;
acquiring an attention intermediate value corresponding to the predicted text in the ith step, and acquiring an attention output result corresponding to the preamble text in the (i + 1) th step according to the attention intermediate value corresponding to the predicted text in the ith step and the attention intermediate value corresponding to the preamble text in the ith step;
generating a prediction text of the (i + 1) th step according to an attention output result corresponding to the preorder text of the (i + 1) th step;
wherein i is a positive integer.
2. The method for supporting the rapid inference of the highly concurrent large-scale generative language model as claimed in claim 1, wherein the attention median of the prologue text of the ith step comprises Q, K, V vector corresponding to each word in the prologue text of the ith step.
3. The method for supporting fast inference of high concurrent large scale generative language models according to claim 2, wherein the saving of the attention median value of the preamble text of step i comprises:
obtaining a first intermediate matrix and a second intermediate matrix of the ith step according to K, V vectors of the preamble text of the ith step, wherein the sizes of the first intermediate matrix and the second intermediate matrix are both preset sizes, the preset sizes are M N, M is the dimension of each vector in the matrix, and N is the number of the vectors;
wherein, the first m vectors in the first intermediate matrix and the second intermediate matrix are K, V vectors corresponding to each word in the pre-sequence text of the step i, respectively, and the vectors after the m vector are 0;
and storing the first intermediate matrix and the second intermediate matrix in the ith step in a video memory.
4. The method for supporting the fast inference of the high concurrent large-scale generative language model as claimed in claim 3, wherein the obtaining of the attention output result corresponding to the preamble text of step i +1 according to the attention median corresponding to the predicted text of step i and the attention median corresponding to the preamble text of step i comprises:
calculating Q, K, V vectors corresponding to the predicted text in the ith step, respectively storing K, V vectors corresponding to the predicted text in the ith step into a third intermediate matrix and a fourth intermediate matrix, wherein the sizes of the third intermediate matrix and the fourth intermediate matrix are the preset sizes, the m +1 th vector of the third intermediate matrix is the K vector corresponding to the predicted text in the ith step, the m +1 th vector of the fourth intermediate matrix is the V vector corresponding to the predicted text in the ith step, and the rest vectors are 0;
updating the first intermediate matrix and the second intermediate matrix in the ith step according to the third intermediate matrix and the fourth intermediate matrix to obtain the first intermediate matrix and the second intermediate matrix in the (i + 1) th step;
and calculating an attention output result corresponding to the preamble text of the step i +1 according to the first intermediate matrix and the second intermediate matrix of the step i +1 and the Q vector corresponding to the predicted text of the step i.
5. The method for supporting fast inference of high concurrent large-scale generative language models according to claim 4, wherein the step of updating the first intermediate matrix and the second intermediate matrix in the ith step according to the third intermediate matrix and the fourth intermediate matrix to obtain the first intermediate matrix and the second intermediate matrix in the (i + 1) th step comprises:
and performing summation operation on the third intermediate matrix and the first intermediate matrix in the ith step to obtain the first intermediate matrix in the (i + 1) th step, and performing summation operation on the fourth intermediate matrix and the second intermediate matrix in the ith step to obtain the second intermediate matrix in the (i + 1) th step.
6. The method for supporting fast inference of high concurrent large scale generative language models according to claim 2, wherein the obtaining the attention median value of the preamble text of step i comprises:
and when i is 1, calculating an Q, K, V vector corresponding to each word in the preamble text of the step i according to a preset first weight matrix, a preset second weight matrix and a preset third weight matrix.
7. The method for supporting fast inference of high concurrent large-scale generative language models as claimed in claim 6, wherein the calculating Q, K, V vector corresponding to each word in the preamble text of step i according to the preset first weight matrix, the second weight matrix and the third weight matrix comprises:
acquiring an embedded vector of each word in the preamble text of the step i;
and multiplying the embedded vector of each word by the first weight matrix, the second weight matrix and the third weight matrix respectively to obtain Q, K, V vectors corresponding to each word.
8. The fast inference method supporting high concurrency for large-scale generative language models according to claim 1, wherein after the predicted text of step i +1 is generated according to the attention output result corresponding to the preamble text of step i +1, the method further comprises:
and stopping reasoning and outputting a reasoning result when the predicted text of the (i + 1) th step comprises a preset terminator or the total length of the predicted text of the (i + 1) th step and the pre-preamble text of the (i + 1) th step is transmitted to a preset threshold, wherein the reasoning result comprises the pre-preamble text of the (i + 1) th step and the predicted text of the (i + 1) th step, otherwise, performing reasoning of the (i + 2) th step.
9. A terminal, characterized in that the terminal comprises: a processor, a computer readable storage medium communicatively connected to the processor, the computer readable storage medium adapted to store a plurality of instructions, the processor adapted to invoke the instructions in the computer readable storage medium to perform the steps of implementing the method for fast inference of large scale generative language models supporting high concurrency according to any one of claims 1 to 8.
10. A rapid inference system supporting highly concurrent large scale generative language models, the system comprising: a front-end device and at least one terminal according to claim 9;
the front-end equipment is used for receiving an inference request and sending a preamble text corresponding to the inference request to a target terminal in at least one terminal according to the load of each terminal;
and the terminal is used for reasoning according to the preamble text corresponding to the reasoning request and returning a reasoning result to the front-end equipment.
11. The system for supporting high-concurrency, large-scale generative language model rapid inference system according to claim 10, wherein said front-end device comprises a request receiving module and a queue management module;
the request receiving module is used for receiving the reasoning request and preprocessing the text in the reasoning request to obtain a preamble text corresponding to the reasoning request;
and the queue management module is used for judging whether the current task queue reaches a preset length, if so, returning prompt information to the sending end of the inference request, otherwise, sending the preorder text corresponding to the inference request to the target terminal with the lowest load according to the load of each terminal, and adding one to the length of the task queue.
12. The system for supporting fast inference of a large-scale generative language model with high concurrency as claimed in claim 11, wherein said front-end device further comprises a post-processing module;
the post-processing module is used for receiving the reasoning result output by the terminal, post-processing the reasoning result and sending the post-processed reasoning result to the sending end of the reasoning request;
the queue management module is further configured to reduce the length of the task queue by one after the post-processing module receives the inference result of the terminal.
13. A computer readable storage medium, storing one or more programs, which are executable by one or more processors, for performing the steps of the method for fast inference supporting large scale generative language models, which are highly concurrent, as recited in any one of claims 1 to 8.
CN202111594472.0A 2021-12-23 2021-12-23 Rapid reasoning method and system supporting high-concurrency large-scale generation type language model Pending CN114385785A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111594472.0A CN114385785A (en) 2021-12-23 2021-12-23 Rapid reasoning method and system supporting high-concurrency large-scale generation type language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111594472.0A CN114385785A (en) 2021-12-23 2021-12-23 Rapid reasoning method and system supporting high-concurrency large-scale generation type language model

Publications (1)

Publication Number Publication Date
CN114385785A true CN114385785A (en) 2022-04-22

Family

ID=81197888

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111594472.0A Pending CN114385785A (en) 2021-12-23 2021-12-23 Rapid reasoning method and system supporting high-concurrency large-scale generation type language model

Country Status (1)

Country Link
CN (1) CN114385785A (en)

Similar Documents

Publication Publication Date Title
JP7502469B2 (en) Attention Neural Networks with Sparse Attention Mechanism
CN112463189B (en) Distributed deep learning multi-step delay updating method based on communication operation sparsification
CN113296733B (en) Data processing method and device
KR20200052417A (en) Apparatus and method for selecting inference module of target device
CN115796407A (en) Production line fault prediction method and related equipment
CN116451174A (en) Task execution device, method, electronic device, and storage medium
CN111898752B (en) Apparatus and method for performing LSTM neural network operations
CN117852653A (en) Space allocation method, device, equipment and medium for model reasoning
CN116821638B (en) Data analysis method and system for AI chip application optimization design
CN114385785A (en) Rapid reasoning method and system supporting high-concurrency large-scale generation type language model
CN117291259A (en) Operator optimization method and device, electronic equipment and storage medium
CN116342367A (en) Web end dynamic rendering processing method based on three-dimensional scene model
CN115470901A (en) Hybrid precision training method and device supporting load sharing of heterogeneous processor at mobile terminal
WO2023059831A1 (en) Using memory to augment self-attention in neural networks
CN114816742A (en) Request processing method and device, electronic equipment and storage medium
CN113807397A (en) Training method, device, equipment and storage medium of semantic representation model
CN118627556A (en) FP 8-based large language model quantification method and system
CN116757254B (en) Task processing method, electronic device and storage medium
US20220051085A1 (en) Runtime hyper-heterogeneous optimization for processing circuits executing inference model
CN113377501B (en) Data processing method, device, apparatus, medium and program product
CN109615059B (en) Edge filling and filter expansion operation method and system in convolutional neural network
CN109274711B (en) Cluster computing method and device and computer readable storage medium
CN116501513A (en) Inter-process communication method, inter-process communication device, electronic equipment and storage medium
CN117196015A (en) Operator execution method, device, electronic equipment and storage medium
CN117271957A (en) Data processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination