CN112818663A

CN112818663A - Processing method for language model, text generation method, text generation device and medium

Info

Publication number: CN112818663A
Application number: CN202110057292.2A
Authority: CN
Inventors: 熊鹰; 王晓晖; 陈家泽; 李磊
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-01-15
Filing date: 2021-01-15
Publication date: 2021-05-18
Also published as: WO2022151966A1

Abstract

The embodiment of the disclosure relates to a processing method, a text generation method, a device and a medium for a language model. The language model is deployed in the electronic equipment, a plurality of computing operations between target type computing in computing of the same feature layer of the language model are combined into one fused computing operation, and the processing method for the language model comprises the following steps: upon determining that the fused computing operation is to be performed, a CPU of the electronic device sends an operation instruction containing the plurality of computing operations to a GPU; in response to receiving the operation instruction, the GPU processes the plurality of computing operations. Therefore, the scheduling overhead between the CPU and the GPU and the repeated read-write overhead of the GPU on the video memory in the processing process of the language model can be effectively reduced, so that the calculation efficiency of the GPU can be effectively improved, the calculation efficiency of the language model is further improved, and the delay of text processing based on the language model is effectively reduced.

Description

Processing method for language model, text generation method, text generation device and medium

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a processing method, a text generation device and a medium for a language model.

Background

In a natural language generation task, a language model is usually adopted for text prediction generation, and the calculation amount of the language model is usually huge, which brings difficulty to actual deployment and execution.

In the related art, a method of calculating a graph is usually adopted, however, in the above process, a large number of GPU (Graphics Processing Unit) operators are required to be used, which causes additional overhead such as operator scheduling, video memory transmission, and the like, so that after a language model is deployed, execution delay is high and usability is low when text Processing is performed based on the language model.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

The embodiment of the disclosure provides a processing method, a text generation method, a device and a medium for a language model.

In a first aspect, an embodiment of the present disclosure provides a processing method for a language model, where the language model is deployed in an electronic device, and multiple computing operations between computations of a target type in computations of a same feature layer of the language model are merged into one fused computing operation, the method including:

upon determining that the fused computing operation is to be performed, a Central Processing Unit (CPU) of the electronic device sending an operation instruction containing the plurality of computing operations to a GPU;

in response to receiving the operation instruction, the GPU processes the plurality of computing operations.

In a second aspect, an embodiment of the present disclosure provides a text generation method, where the method includes:

receiving a text to be processed;

inputting the text to be processed into a language model, and obtaining a next candidate character corresponding to the text to be processed and probability information corresponding to each candidate character, wherein the language model is deployed in an electronic device, and a plurality of computing operations between target type computing in computing of the same feature layer of the language model are combined into a fused computing operation, and the fused computing operation is executed in a manner that a CPU of the electronic device sends an operating instruction containing the computing operations to a GPU to process the computing operations;

ranking probability information corresponding to each candidate character, and determining a plurality of target characters from the candidate characters based on a ranking result;

and respectively splicing each target character at the end of the text to be processed to obtain a plurality of spliced texts so as to obtain a target text corresponding to the text to be processed, wherein the target text is a text finally generated based on the text to be processed.

In a third aspect, an embodiment of the present disclosure provides a processing apparatus for a language model, where the language model is deployed in an electronic device, and multiple computing operations between computations of a target type in computations of a same feature layer of the language model are merged into one fused computing operation, the apparatus including:

a sending module, configured to send, to a GPU, an operation instruction including the plurality of computing operations when it is determined that the fused computing operation is to be executed;

and the processing module is used for responding to the received operation instruction, and the GPU processes the plurality of computing operations.

In a fourth aspect, an embodiment of the present disclosure provides a text generation apparatus, where the apparatus includes:

the receiving module is used for receiving the text to be processed;

the input module is used for inputting the text to be processed into a language model, and obtaining a next candidate character corresponding to the text to be processed and probability information corresponding to each candidate character, wherein the language model is deployed in the electronic equipment, and a plurality of computing operations between target type computing in computing of the same feature layer of the language model are combined into a fusion computing operation, and the fusion computing operation is executed in a mode that a CPU of the electronic equipment sends an operating instruction containing the computing operations to a GPU and the GPU processes the computing operations;

the sorting module is used for sorting the probability information corresponding to each candidate character and determining a plurality of target characters from the candidate characters based on the sorting result;

and the splicing module is used for splicing each target character at the end of the text to be processed respectively to obtain a plurality of spliced texts so as to obtain a target text corresponding to the text to be processed, wherein the target text is a text finally generated based on the text to be processed.

In a fifth aspect, the disclosed embodiments provide a computer readable medium, on which a computer program is stored, which when executed by a processing apparatus implements the steps of the method of the first aspect, or which when executed by a processing apparatus implements the steps of the method of the second aspect.

In a sixth aspect, an embodiment of the present disclosure provides an electronic device, including:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method of the first aspect or to carry out the steps of the method of the second aspect.

In the above technical solution, when a language model is deployed in an electronic device, a plurality of computing operations between target types of computing in computing of the same feature layer of the language model are merged into one fused computing operation, so that when it is determined that the fused computing operation is to be executed, a CPU of the electronic device sends an operation instruction including the plurality of computing operations to a GPU; in response to receiving the operation instruction, the GPU processes the plurality of computing operations. Therefore, by the technical scheme, the scheduling overhead between the CPU and the GPU and the repeated read-write overhead of the GPU on the video memory in the processing process of the language model can be effectively reduced, so that the calculation efficiency of the GPU can be effectively improved, the calculation efficiency of the language model is improved, the delay of text processing based on the language model is effectively reduced, and support is provided for ensuring the real-time performance of processing based on the language model.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow diagram of a processing method for a language model provided in accordance with one embodiment of the present disclosure;

FIG. 2A is a schematic flow chart of a GPU computing operation in the related art;

FIG. 2B is a schematic flow chart diagram illustrating a GPU computing operation provided in accordance with one embodiment of the present disclosure;

FIG. 3 is a flow diagram of a text generation method provided in accordance with one embodiment of the present disclosure;

FIG. 4 is a flow diagram of an exemplary implementation of sorting probability information corresponding to each candidate character and determining a plurality of target characters from the candidate characters based on a result of the sorting according to one embodiment of the present disclosure;

FIG. 5 is a block diagram of a processing device for a language model provided in accordance with one embodiment of the present disclosure;

FIG. 6 is a block diagram of a text generation apparatus provided in accordance with one embodiment of the present disclosure;

FIG. 7 illustrates a schematic diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The present disclosure provides a processing method for a language model. Wherein the language model is deployed in an electronic device, and a plurality of computing operations between target-type computing in computing of the same feature layer of the language model are merged into one fused computing operation. In one possible embodiment, the calculation of the target type may be a calculation of a matrix multiplication. In a feature layer of a language model, a series of calculation operations are usually required to be performed so as to obtain a calculation result output by the feature layer. In this embodiment, multiple calculation operations in the series of calculation operations between two matrix multiplications may be combined into one fused calculation operation, i.e., multiple calculation operations are combined into one operator. As another example, the target type of computation may be matrix multiplication and matrix dot multiplication, e.g., multiple computation operations between matrix multiplication and matrix dot multiplication may be combined into one fused computation operation, or multiple computation operations between matrix multiplication and matrix multiplication may be combined into one fused computation operation. The target type can be preset according to an actual use scene, so that the corresponding fusion calculation operation can be preset according to the target type.

Fig. 1 is a flowchart illustrating a processing method for a language model according to an embodiment of the present disclosure, where the method includes: the method comprises the following steps:

in step 11, when determining that the fusion computing operation is about to be executed, the CPU of the electronic device sends an operation instruction containing a plurality of computing operations to the GPU;

in step 12, in response to receiving the operation instruction, the GPU processes the plurality of computing operations.

In the related art, when a calculation operation in a language model is executed, usually, one calculation operation is individually packaged as one operator, a CPU sends the operator to a GPU, the GPU performs calculation by reading data in a video memory into a corresponding register, and then writes a calculation result obtained by the calculation back to the video memory. Illustratively, the matrix multiplication A1 and the matrix multiplication A2 in the same feature layer include calculation operations B1, B2 and B3, and the calculation process is as shown in FIG. 2A.

Based on the solutions in the related art, the CPU will pack the calculation operations B1, B2, and B3 as 3 GPU calculation operations, that is, the CPU first sends an operation instruction corresponding to B1 to the GPU, and the GPU reads data corresponding to B1 from the video memory to the register for calculation, thereby writing the obtained result back to the video memory. And the CPU sends an operation instruction corresponding to the B2 to the GPU, and the GPU reads data corresponding to the B2 from the video memory to a register for calculation, so that the obtained result is written back to the video memory. And the CPU sends an operation instruction corresponding to the B3 to the GPU, and the GPU reads data corresponding to the B3 from the video memory to a register for calculation, so that the obtained result is written back to the video memory. Therefore, in this process, the data calculated by B1 cannot be directly used in the calculation process of B2, and needs to be written back to the video memory first and then read from the video memory, which increases the scheduling overhead of the operation instructions of the CPU, and the data reading from and writing back to the video memory are repeated in the middle of each GPU calculation operation, which also increases the overhead of the GPU for repeatedly reading and writing the video memory.

In the technical solution of the embodiment of the present disclosure, the computing operations B1, B2, and B3 may be merged into one fused computing operation B, and therefore, based on the technical solution of the embodiment of the present disclosure, the CPU may send one operation instruction to the GPU, where the operation instruction includes the fused computing operation B, that is, includes operations B1, B2, and B3, so that the GPU may execute a plurality of computing operations through one scheduling of the CPU, and may read and write back a display once in the process of executing the plurality of computing operations, thereby reducing the number of times of reading and writing the display.

Thus, in the above technical solution, when a language model is deployed in an electronic device, multiple computing operations between target types of computing in computing of the same feature layer of the language model are merged into one fused computing operation, so that when it is determined that the fused computing operation is to be executed, a CPU of the electronic device sends an operation instruction including the multiple computing operations to a GPU; in response to receiving the operation instruction, the GPU processes the plurality of computing operations. Therefore, by the technical scheme, the scheduling overhead between the CPU and the GPU and the repeated read-write overhead of the GPU on the video memory in the processing process of the language model can be effectively reduced, so that the calculation efficiency of the GPU can be effectively improved, the calculation efficiency of the language model is improved, the delay of text processing based on the language model is effectively reduced, and support is provided for ensuring the real-time performance of processing based on the language model.

When the GPU processes the calculation operation, the GPU eliminates the actual operation time, and the data transmission in the video memory usually also needs to consume time, thereby affecting the processing efficiency of the language model. Accordingly, the present disclosure also provides the following embodiments.

In a possible embodiment, the video memory space corresponding to the language model is predetermined by:

determining the storage space usage amount corresponding to the text processing performed by the language model according to a preset text length and parameter information in the language model, where the parameter information includes a data length of a model parameter used for performing calculation in the language model, for example, the model parameter may be a weight parameter or the like calculated by a feature layer, and the parameter information may further include a data length of a calculation result corresponding to the calculation performed based on the model parameter, and for example, the feature layer including the parameter may include a convolutional layer, a fully connected layer, a BatchNorm layer, an Embedding layer or the like, that is, the data length of the calculation result corresponding to the feature layer may be determined. However, in the language model, the standard data length of the calculation result obtained by calculation based on the model parameter is usually fixedly set, that is, the calculation result obtained by calculation based on the model parameter is different for different input data, but the data length of the calculation result is not greater than the standard data length corresponding to the model parameter.

The preset text length is the maximum length of the input text that can be accepted by the preset language model, and can be determined according to the actual use scenario, which is not limited in the embodiment of the present disclosure. After the preset text length and the parameter information are determined, the corresponding space usage can be determined according to the corresponding coding mode. Illustratively, the size of a parameter matrix is 10 × 10, and if the parameter matrix is calculated by using an encoding method of fp32, that is, a method of encoding by using 4 bytes, the spatial usage amount corresponding to the parameter matrix is 400(10 × 4) byte. As an example, the sum of the space usage amount corresponding to the text with the preset text length and the space usage amount corresponding to each parameter information in the language model may be determined as the storage space usage amount.

And then, applying a storage space with the size of the storage space usage amount from the video memory of the electronic device as the video memory space. The application mode of the space in the video memory is conventional operation in the art, and is not described herein again.

Therefore, by the technical scheme, all the video memories required by the language model in the calculation processing process can be predetermined, so that the video memories can be applied in advance, the video memories do not need to be dynamically applied in the calculation process of the language model, the extra overhead of dynamic video memory application or recovery in the calculation process can be effectively reduced, the operation efficiency of the GPU is further improved, and the processing efficiency of the language model is improved.

In the embodiment of the present disclosure, dynamic application in the calculation process may be avoided by applying for the video memory in advance, and further, to avoid reduction of the video memory utilization rate caused by excessive application of the video memory, the present disclosure also provides the following embodiments.

In a possible embodiment, the language model includes a plurality of iterative feature layers for performing iterative computations, and a computation result obtained by performing iterative computations by each iterative feature layer corresponds to a same memory address in the video memory space. The calculation result obtained by performing iterative calculation on the iterative feature layer may represent a data result obtained by performing calculation on the iterative feature layer based on the input data of the iterative feature layer and the model parameter in the iterative feature layer.

In the language model, there are usually multiple feature layers for performing iterative computation, for example, there are multiple layers of Transformer layers in the GPT and Transformer models, which are stacked together to form the language model, where the input of the N-th layer of feature layers is the output of the N-1-th layer of feature layers, and the same computation operation is performed between each layer of feature layers, except that model parameters corresponding to the computation operation performed by each feature layer may be different. Therefore, in the embodiment of the present disclosure, the storage spaces corresponding to the calculation results of the plurality of feature layers may be mutually multiplexed, so as to maximize the utilization of the video memory. In the disclosure, the calculation result obtained by performing iterative calculation on each iterative feature layer corresponds to the same storage address in the video memory space, that is, the calculation result of the nth feature layer can be directly stored to the storage location of the calculation result of the N-1 st layer when being stored, so as to implement overlay storage, thereby effectively improving the utilization rate of the video memory, and not affecting the accuracy of data in the calculation process. Correspondingly, when the storage space usage is determined, the space usage corresponding to the calculation results of the multiple iteration feature layers may be the space usage corresponding to any one of the calculation results, so that the calculation results of the N layers only use the video memory space of 1 layer, thereby ensuring the accuracy of the video memory space applied in advance, avoiding resource waste caused by excessive application of the video memory space, and improving the access speed and the video memory utilization rate.

In a possible embodiment, the language model includes an encoder module and a decoder module, and the calculation results of the encoder module and the decoder module correspond to the same memory address in the video memory space. The calculation result of the encoder module is used for representing a data result obtained by calculation in the encoder module based on the input data of the encoder module and the model parameters in the encoder module. Likewise, the calculation results of a decoder module are used to represent the data results of calculations made in the decoder module based on the input data to the decoder module and the model parameters in the decoder module.

In this embodiment, the language model includes an encoder module and a decoder module, and since the calculation of the encoder module is performed only once, the decoder module can directly reuse the video memory space of the encoder module when iteratively calculating the probability information of the language model, that is, the calculation result of the decoder module is directly stored into the storage space corresponding to the calculation result of the encoder module, so as to implement the covering storage of data, thereby implementing that the calculation results of the two modules are stored only in the video memory space corresponding to the calculation result of one module, further improving the memory access speed of the video memory, and simultaneously ensuring the effective utilization of the video memory.

In a possible embodiment, the output results of the language model may be sorted based on multiple parallel processing threads in the GPU, so that sampling of the output results of the language model may be accelerated by repeatedly using the parallel threads, thereby further improving the processing efficiency of the language model and providing accurate data support for subsequent processing of the output results based on the language model.

In a possible embodiment, in order to further improve the adaptability to the language model, in the deployment phase of the language model, the model parameters of the language model may be structurally represented by a protobuf protocol, so that the model structure and the parameter expression may be normalized. Therefore, standard language model structures and parameter expressions can be provided, and language models under different training frameworks can be compatible for use.

The present disclosure also provides a text generation method, and as shown in fig. 3, the text generation method is a flowchart of a text generation method provided in an implementation manner based on the present disclosure, and the method may include:

in step 31, a text to be processed is received, wherein the text to be processed may be a processed text input by a user for text generation.

In step 32, a text to be processed is input into a language model, and a next candidate character corresponding to the text to be processed and probability information corresponding to each candidate character are obtained, wherein the language model is deployed in an electronic device, and a plurality of computing operations between target-type computing in computing of the same feature layer of the language model are combined into a fused computing operation, and the fused computing operation is executed by sending an operating instruction containing the plurality of computing operations to a GPU by a CPU of the electronic device so as to process the plurality of computing operations by the GPU. The processing method of the language model is described in detail above, and is not described herein again.

In this step, the text to be processed may be input into the language model, so that the next character of the text to be processed may be predicted by performing calculation processing on a plurality of feature layers in the language model, and thus the next candidate character corresponding to the text to be processed and the probability information corresponding to each candidate character are obtained.

In step 33, the probability information corresponding to each candidate character is sorted, and a plurality of target characters are determined from the candidate characters based on the sorting result.

For example, the candidate characters corresponding to the M probability information before the ranking may be determined as the target character by performing a uniform ranking in the descending order of the respective probability information.

In step 34, each target character is spliced at the end of the text to be processed, so as to obtain a plurality of spliced texts, so as to obtain a target text corresponding to the text to be processed, where the target text is a text finally generated based on the text to be processed.

Exemplarily, the text to be processed is abcde, and if the determined target characters are f respectively₁，f₂，f₃Then, each target character and the text to be processed can be spliced respectively, so that 3 spliced texts can be obtained, namely abcdef₁，abcdef₂，abcdef₃The 3 concatenated texts may be determined as target texts generated based on the texts to be processed.

Therefore, according to the technical scheme, in the process of text generation based on the text to be processed, the text to be processed is calculated based on the language model, and a plurality of calculation operations between the calculation of the target type in the calculation of the same characteristic layer of the language model are combined into one fused calculation operation, so that the scheduling overhead between the CPU and the GPU and the repeated read-write overhead of the GPU on the video memory in the processing process of the language model can be effectively reduced, the calculation efficiency of the GPU can be effectively improved, the calculation efficiency of the language model can be further improved, the delay of text processing based on the language model can be effectively reduced, the real-time performance of the output result of the language model can be improved, the efficiency and the accuracy of text generation can be improved, the requirement of online real-time use of a user can be met, and the use experience of the user can be improved.

In one possible embodiment, the method may further comprise:

and determining whether the spliced text meets the text generation requirement or not aiming at each spliced text. The text generation requirement may be preset, and the present disclosure does not limit this. In one possible embodiment, the stitched text may be determined to satisfy the text generation requirement when the stitched text satisfies any one of the following:

the first condition is: the last character in the spliced text is a terminator, wherein the terminator may be a symbol used for representing the completeness of a sentence, such as a period, a question mark, an exclamation mark, and the like, and may be set according to an actual usage scenario, and the setting is merely an exemplary illustration and does not limit the present disclosure.

The second condition is: the length of the spliced text reaches the termination length, wherein the maximum length of the spliced text can be set as the termination length, so that the generation of overlong spliced text is avoided, and the spliced text is inconvenient for a user to use.

And under the condition that the spliced text meets the text generation requirement, determining the spliced text as the target text.

When the spliced text meets the text generation requirement, the generated spliced text meets the use requirement of the user, and at the moment, the spliced text can be used as a target text so as to prompt the user subsequently.

And under the condition that the spliced text does not meet the text generation requirement, taking the spliced text as a new text to be processed, re-executing the received text to be processed input language model, obtaining a next candidate character corresponding to the text to be processed and probability information corresponding to each candidate character 32 to a step 34 of splicing each target character at the end of the text to be processed respectively to obtain a plurality of spliced texts, and determining whether the spliced text meets the text generation requirement or not for each spliced text.

In this embodiment, for each spliced text, when the spliced text does not satisfy the text generation requirement, prediction and splicing need to be performed continuously based on the spliced text, so that the above steps can be repeatedly performed to further obtain a plurality of target texts, thereby providing more options of the target texts for a user, meeting the use requirements of the user, and improving the use experience of the user.

determining the storage space usage amount corresponding to text processing of the language model according to preset text length and parameter information in the language model, wherein the parameter information comprises data length of model parameters used for calculation in the language model and data length of calculation results corresponding to calculation based on the model parameters;

and applying a storage space with the size of the storage space usage amount from the video memory of the electronic equipment as the video memory space.

In a possible embodiment, the language model includes a plurality of iterative feature layers for performing iterative computations, and a computation result of each iterative feature layer performing iterative computations corresponds to a same memory address in the video memory space;

accordingly, the method further comprises:

in the process of calculating the language model, for each iteration feature layer, when the iteration feature layer obtains a calculation result, the calculation result is stored in the space indicated by the storage address, so as to overwrite the current content stored in the space indicated by the storage address.

In a possible embodiment, the language model includes an encoder module and a decoder module, and the calculation results of the encoder module and the decoder module correspond to the same memory address in the video memory space;

accordingly, the method further comprises:

in the process of performing calculation by the language model, in the case that the decoder module obtains a calculation result, the calculation result is stored in the space indicated by the storage address so as to overwrite the calculation result of the encoder module stored in the space indicated by the storage address.

The specific implementation of the above process has been described in detail above, and is not described herein again.

By the technical scheme, the access efficiency and the utilization rate of the video memory in the text generation process can be further improved, so that the text generation efficiency is further improved, and the real-time performance of text generation is ensured.

In one possible embodiment, in step 33, the probability information corresponding to each candidate character is sorted, and an exemplary implementation manner of determining a plurality of target characters from the candidate characters based on a result of the sorting is as follows, as shown in fig. 4, and this step may include:

in step 41, the probability information of each candidate character is uniformly distributed to a plurality of processing threads of the GPU, so that each processing thread ranks the respective probability information in the processing thread, wherein an average distribution algorithm can be adopted to uniformly distribute the probability information to each processing thread, and for each processing thread, it only needs to rank the respective probability information in the processing thread, thereby effectively reducing the time required for ranking the respective probability information. Each processing thread may use the existing sorting algorithm to sort the probability information corresponding to the processing thread, which is not described herein again.

In step 42, a first preset number of probability information is obtained from the probability information in each processing thread according to the respective sorting result of each processing thread, and the probability information is used as candidate probability information.

In step 41, each processing thread may rank the probability information corresponding to the processing thread according to the order of the probability information from large to small, and then in step 42, for each processing thread, the probability information of the first preset number in the ranking result of the thread may be determined as candidate probability information, that is, the higher probability information may be concurrently selected from the probability information corresponding to each processing thread as candidate probability information, so as to ensure the accuracy of the target character determined based on the candidate probability information subsequently.

In step 43, each candidate probability information is sorted according to a descending order, and candidate characters corresponding to probability information with a second preset number of ranks before are determined as target characters, wherein a product of the first preset number and the number of the processing threads is greater than or equal to the second preset number, that is, a sum of the candidate probability information corresponding to each processing thread is greater than the second preset number, so as to ensure that enough target characters can be obtained.

In this step, after the candidate probability information is taken out from each processing thread, the ranking may be re-performed based on the plurality of candidate probability information to further determine the target character based on the ranking result. In the process, sorting may be performed based on a fast register algorithm between the processing threads to determine the second preset number of probability information.

For example, the first preset number may be smaller than the second preset number, so that the efficiency of selecting data of each processing thread may be improved to some extent, and the efficiency of sorting the multiple candidate probability information may be improved, so as to further improve the efficiency of determining the target character.

Through the technical scheme, the probability information can be sequenced through the multiple parallel processing threads, so that the sequencing efficiency of the probability information can be effectively improved, the determination efficiency of the target characters is improved, and meanwhile, the accuracy of the determined target characters can be ensured to a certain extent. In addition, when the candidate probability information is sequenced, the candidate probability information corresponding to each processing thread has orderliness, so that the efficiency of sequencing the candidate probability information can be further improved, and the processing efficiency and the generation real-time performance of the text generation method are improved.

In one possible embodiment, the method may further comprise:

and determining display text from the target text, and outputting the display text, wherein the display text can be output to a display interface for prompting a user.

Wherein determining display text from the target text may comprise:

randomly selecting a third preset number of texts from the target texts as the display texts. In the technical scheme of the disclosure, in the process of determining the target text, characters with high probability information are selected from the candidate characters for predictive splicing, so that part of texts can be randomly selected from the target text as display texts to prompt a user, different outputs can be prompted to the user for the same input, and the variety of prompting the user is improved.

Or in another embodiment, determining display text from the target text may include:

and for each target text, determining priority information corresponding to the target text according to probability information corresponding to each target character generated in the target text generation process, and selecting a third preset number of texts as the display texts according to the sequence of the priority information from high to low.

The sum of probability information corresponding to each target character generated in the target text generation process can be determined as the priority information, so that the target texts are sequenced based on the priority information, and the display texts are determined, so that the target texts with higher probability can be preferentially displayed, the viewing time of a user is saved, and the use experience of the user is improved.

The present disclosure also provides a processing apparatus for a language model, the language model being deployed in an electronic device, and a plurality of computing operations between computations of target types in computations of the same feature layer of the language model being merged into one fused computing operation, as shown in fig. 5, the apparatus 10 including:

a sending module 101, configured to send, to a GPU, an operation instruction including the plurality of computing operations when it is determined that the fused computing operation is to be executed;

the processing module 102 is configured to, in response to receiving the operation instruction, process the plurality of computing operations by the GPU.

Optionally, the display memory space corresponding to the language model is predetermined in the following manner:

Optionally, the language model includes a plurality of iterative feature layers for performing iterative computations, and a computation result obtained by performing iterative computations on each iterative feature layer corresponds to the same storage address in the video memory space.

Optionally, the language model includes an encoder module and a decoder module, and the calculation results of the encoder module and the decoder module correspond to the same memory address in the video memory space.

Optionally, the output results of the language model are ordered based on a plurality of parallel processing threads in the GPU.

The present disclosure also provides a text generating apparatus, as shown in fig. 6, the apparatus 20 includes:

a receiving module 201, configured to receive a text to be processed;

an input module 202, configured to input the text to be processed into a language model, and obtain a next candidate character corresponding to the text to be processed and probability information corresponding to each candidate character, where the language model is deployed in an electronic device, and multiple computing operations between computations of target types in computations of the same feature layer of the language model are merged into one fused computing operation, and the fused computing operation is executed in a manner that a CPU of the electronic device sends an operation instruction including the multiple computing operations to a GPU and the GPU processes the multiple computing operations;

the sorting module 203 is configured to sort probability information corresponding to each candidate character, and determine a plurality of target characters from the candidate characters based on a result of the sorting;

a splicing module 204, configured to splice each target character at the end of the text to be processed, to obtain multiple spliced texts, so as to obtain a target text corresponding to the text to be processed, where the target text is a text finally generated based on the text to be processed.

Optionally, the apparatus further comprises:

the first determining module is used for determining whether the spliced texts meet text generation requirements or not aiming at each spliced text;

the second determining module is used for determining the spliced text as the target text under the condition that the spliced text meets the text generation requirement;

a third determining module, configured to, when the spliced text does not meet the text generation requirement, use the spliced text as a new text to be processed, trigger the input module to execute the language model to which the text to be processed is input, obtain a next candidate character corresponding to the text to be processed and probability information corresponding to each candidate character, sort, by the sorting module, the probability information corresponding to each candidate character, and determine a plurality of target characters from the candidate characters based on a result of the sorting; the splicing module splices each target character at the end of the text to be processed respectively to obtain a plurality of spliced texts; and the first determining module is used for determining whether the spliced texts meet text generation requirements or not aiming at each spliced text.

Optionally, the language model includes a plurality of iterative feature layers for performing iterative computations, and a computation result of each iterative feature layer for performing iterative computations corresponds to a same storage address in the video memory space;

the device further comprises:

and the first storage module is used for storing the calculation result in the space indicated by the storage address under the condition that the iteration feature layer obtains the calculation result in the process of calculating the language model so as to cover the current content stored in the space indicated by the storage address.

Optionally, the language model includes an encoder module and a decoder module, and the calculation results of the encoder module and the decoder module correspond to the same memory address in the video memory space;

the device further comprises:

and the second storage module is used for storing the calculation result in the space indicated by the storage address under the condition that the decoder module obtains the calculation result in the process of calculating the language model so as to cover the calculation result of the encoder module stored in the space indicated by the storage address.

Optionally, the sorting module includes:

the distribution submodule is used for uniformly distributing the probability information of each candidate character to a plurality of processing threads of the GPU so that each processing thread can sequence the probability information in the processing thread;

the obtaining submodule is used for obtaining probability information of a first preset quantity from each probability information in each processing thread according to the respective sequencing result of each processing thread to serve as candidate probability information;

and the determining submodule is used for sequencing each candidate probability information according to a descending order, and determining candidate characters corresponding to probability information with a second preset number of front ranking as the target characters, wherein the product of the first preset number and the number of the processing threads is greater than or equal to the second preset number.

Optionally, the apparatus further comprises:

the fourth determining module is used for determining a display text from the target text and outputting the display text;

wherein the fourth determining module comprises:

the first selection submodule is used for randomly selecting a third preset number of texts from the target texts as the display texts; or

And the second selection module is used for determining priority information corresponding to each target text according to probability information corresponding to each target character generated in the target text generation process aiming at each target text, and selecting a third preset number of texts as the display texts according to the sequence of the priority information from high to low.

Referring now to FIG. 7, a block diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 7, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 7 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

A language model deployed in an electronic device and a plurality of computing operations between target-type computations in computations of a same feature layer of the language model merged into one fused computing operation, the computer-readable medium carrying one or more programs which, when executed by the electronic device, cause the electronic device to: upon determining that the fused computing operation is to be performed, a CPU of the electronic device sends an operation instruction containing the plurality of computing operations to a GPU; in response to receiving the operation instruction, the GPU processes the plurality of computing operations.

Alternatively, the computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: receiving a text to be processed; inputting the text to be processed into a language model, and obtaining a next candidate character corresponding to the text to be processed and probability information corresponding to each candidate character, wherein the language model is deployed in an electronic device, and a plurality of computing operations between target type computing in computing of the same feature layer of the language model are combined into a fused computing operation, and the fused computing operation is executed in a manner that a CPU of the electronic device sends an operating instruction containing the computing operations to a GPU to process the computing operations; ranking probability information corresponding to each candidate character, and determining a plurality of target characters from the candidate characters based on a ranking result; and respectively splicing each target character at the end of the text to be processed to obtain a plurality of spliced texts so as to obtain a target text corresponding to the text to be processed, wherein the target text is a text finally generated based on the text to be processed.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of a module does not in some cases constitute a limitation on the module itself, and for example, the sending module may also be described as "a module that sends an operation instruction containing the plurality of computing operations to the GPU by the CPU of the electronic device when it is determined that the fused computing operation is to be executed".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides, according to one or more embodiments of the present disclosure, a processing method for a language model, wherein the language model is deployed in an electronic device, and a plurality of computation operations between computations of a target type in computations of a same feature layer of the language model are merged into one fused computation operation, the method including:

upon determining that the fused computing operation is to be performed, a CPU of the electronic device sends an operation instruction containing the plurality of computing operations to a GPU;

Example 2 provides the method of example 1, wherein the display memory space corresponding to the language model is predetermined by:

Example 3 provides the method of example 2, where the language model includes a plurality of iterative feature layers for iterative computation, and a computation result obtained by performing iterative computation on each iterative feature layer corresponds to a same storage address in the video memory space.

Example 4 provides the method of example 2, wherein the language model includes an encoder module and a decoder module, and the calculation results of the encoder module and the decoder module correspond to the same memory address in the video memory space.

Example 5 provides the method of any of examples 1-4, wherein the ordering of the output results of the language model is based on multiple processing threads in parallel in the GPU, in accordance with one or more embodiments of the present disclosure.

Example 6 provides a text generation method according to one or more embodiments of the present disclosure, wherein the method includes:

receiving a text to be processed;

Example 7 provides the method of example 6, wherein the method further comprises, in accordance with one or more embodiments of the present disclosure:

for each spliced text, determining whether the spliced text meets the text generation requirement;

determining the spliced text as the target text under the condition that the spliced text meets the text generation requirement;

and under the condition that the spliced text does not meet the text generation requirement, taking the spliced text as a new text to be processed, and re-executing the step of inputting the text to be processed into a language model to obtain a next candidate character corresponding to the text to be processed and probability information corresponding to each candidate character to the step of determining whether the spliced text meets the text generation requirement or not for each spliced text.

Example 8 provides the method of example 6, wherein the display memory space corresponding to the language model is predetermined by:

Example 9 provides the method of example 8, where the language model includes a plurality of iterative feature layers for iterative computation, and a computation result of the iterative computation performed by each iterative feature layer corresponds to a same storage address in the video memory space;

the method further comprises the following steps:

Example 10 provides the method of example 8, wherein the language model includes an encoder module and a decoder module, and the calculation results of the encoder module and the decoder module correspond to the same memory address in the video memory space;

the method further comprises the following steps:

Example 11 provides the method of example 6, wherein the sorting probability information corresponding to each of the candidate characters and determining a plurality of target characters from the candidate characters based on a result of the sorting includes:

uniformly distributing the probability information of each candidate character to a plurality of processing threads of the GPU, so that each processing thread sorts the probability information in the processing thread;

respectively acquiring probability information of a first preset number from each probability information in each processing thread according to the respective sequencing result of each processing thread to serve as candidate probability information;

and sequencing each candidate probability information according to a descending order, and determining candidate characters corresponding to the probability information with the ranking in the second preset number as the target characters, wherein the product of the first preset number and the number of the processing threads is greater than or equal to the second preset number.

Example 12 provides the method of any of examples 6-11, wherein the method further comprises, in accordance with one or more embodiments of the present disclosure:

determining a display text from the target text, and outputting the display text;

wherein determining display text from the target text comprises:

randomly selecting a third preset number of texts from the target texts as the display texts; or

Example 13 provides, in accordance with one or more embodiments of the present disclosure, a processing apparatus for a language model deployed in an electronic device and having multiple computing operations between target-type computations in computations of a same feature layer of the language model merged into one fused computing operation, the apparatus comprising:

Example 14 provides, in accordance with one or more embodiments of the present disclosure, a text generation apparatus, the apparatus comprising:

the receiving module is used for receiving the text to be processed;

Example 15 provides a computer readable medium having stored thereon a computer program that, when executed by a processing apparatus, implements the steps of the method of any of examples 1-5 or that, when executed by a processing apparatus, implements the steps of the method of any of examples 6-12.

Example 16 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method of any of examples 1-5 or to carry out the steps of the method of any of examples 6-12.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A processing method for a language model, wherein the language model is deployed in an electronic device, and a plurality of computation operations between computations of a target type in computations of a same feature layer of the language model are merged into one fused computation operation, the method comprising:

2. The method according to claim 1, wherein the video memory space corresponding to the language model is predetermined by:

3. The method according to claim 2, wherein the language model includes a plurality of iterative feature layers for performing iterative computations, and a computation result obtained by performing iterative computations by each iterative feature layer corresponds to a same memory address in the video memory space.

4. The method of claim 2, wherein the language model comprises an encoder module and a decoder module, and the calculation results of the encoder module and the decoder module correspond to the same memory address in the video memory space.

5. The method according to any of claims 1-4, wherein the output results of the language model are ordered based on multiple processing threads in parallel in the GPU.

6. A method of text generation, the method comprising:

receiving a text to be processed;

7. The method of claim 6, further comprising:

8. The method of claim 6, wherein the video memory space corresponding to the language model is predetermined by:

9. The method according to claim 8, wherein the language model includes a plurality of iterative feature layers for performing iterative computations, and a computation result of each iterative feature layer performing iterative computation corresponds to a same memory address in the video memory space;

the method further comprises the following steps:

10. The method of claim 8, wherein the language model comprises an encoder module and a decoder module, and the calculation results of the encoder module and the decoder module correspond to the same memory address in the video memory space;

the method further comprises the following steps:

11. The method of claim 6, wherein the sorting probability information corresponding to each of the candidate characters and determining a plurality of target characters from the candidate characters based on a result of the sorting comprises:

12. The method according to any one of claims 6-11, further comprising:

wherein determining display text from the target text comprises:

13. A processing apparatus for a language model, wherein the language model is deployed in an electronic device, and a plurality of computing operations between computing of a target type in computing of a same feature layer of the language model are merged into one fused computing operation, the apparatus comprising:

14. An apparatus for generating text, the apparatus comprising:

the receiving module is used for receiving the text to be processed;

15. A computer-readable medium, on which a computer program is stored, which, when being executed by processing means, carries out the steps of the method of any one of claims 1 to 5, or which, when being executed by processing means, carries out the steps of the method of any one of claims 6 to 12.

16. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method of any one of claims 1 to 5 or to carry out the steps of the method of any one of claims 6 to 12.