CN114676760B

CN114676760B - Pre-training model reasoning processing method and device, electronic equipment and storage medium

Info

Publication number: CN114676760B
Application number: CN202210234098.1A
Authority: CN
Inventors: 贾超; 郑直
Original assignee: Beijing Zhiyuan Artificial Intelligence Research Institute
Current assignee: Beijing Innovation Zhiyuan Technology Co ltd
Priority date: 2022-03-10
Filing date: 2022-03-10
Publication date: 2023-06-02
Anticipated expiration: 2042-03-10
Also published as: CN114676760A

Abstract

The invention provides a pre-training model reasoning processing method, a device, electronic equipment and a storage medium, wherein the method is applied to a server for reasoning processing a model to be processed, and comprises the following steps: determining the model to be processed, wherein the model to be processed is represented by a high-bit floating point number and is obtained through pre-training, and the bit number of the high-bit floating point number is greater than or equal to a first bit number threshold value; based on a model quantization technology, converting the model parameters of the to-be-processed model from the high-bit floating point number representation to the low-bit number representation, so as to realize the acceleration reasoning processing of the to-be-processed model, wherein the bit number of the low-bit number is smaller than or equal to a second bit number threshold value. The method for reasoning and processing the pre-training model realizes low cost and high processing speed of the large-scale model to be processed in the reasoning process.

Description

Pre-training model reasoning processing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of model processing, and in particular, to a method and apparatus for pre-training model reasoning, an electronic device, and a storage medium.

Background

In recent years, large-scale pre-training models have become research hotspots, for example, large-scale pre-training language models have become research hotspots in the field of natural language processing. The related art of pre-training language models makes it possible to train large-scale models (e.g., open AI GPT3, intellectual insight 2.0 models, etc.) that include billions or even billions of parameters. These large scale models have surprisingly achieved a great deal of natural language processing tasks and have attracted continued attention from many researchers.

While large-scale pre-trained language models perform surprisingly over multiple tasks, their large parametric quantities also present a significant challenge for model reasoning. The reasoning process requires frequent invocation of the model, which undoubtedly incurs significant costs in terms of time, storage, money, etc.

Disclosure of Invention

The invention provides a pre-training model reasoning processing method, a device, electronic equipment and a storage medium, which are used for solving the defect of high cost of a large-scale pre-training model in the reasoning process in the prior art and realizing low cost and high processing speed of the large-scale pre-training model in the reasoning process.

The invention provides a pre-training model reasoning processing method, which is applied to a server for reasoning processing of a model to be processed, and comprises the following steps: determining the model to be processed, wherein the model to be processed is represented by a high-bit floating point number and is obtained through pre-training, and the bit number of the high-bit floating point number is greater than or equal to a first bit number threshold value; based on a model quantization technology, converting the model parameters of the to-be-processed model from the high-bit floating point number representation to the low-bit number representation, so as to realize the acceleration reasoning processing of the to-be-processed model, wherein the bit number of the low-bit number is smaller than or equal to a second bit number threshold value.

According to the pre-training model reasoning processing method provided by the invention, the model parameters comprise linear layer parameters of the model to be processed, the model parameters of the model to be processed are converted from the high-bit floating point number representation to the low-bit point number representation based on a model quantization technology, and the method comprises the following steps: and carrying out quantization processing on the linear layer parameters based on model quantization calculation to obtain quantized linear layer parameters, wherein the quantized linear layer parameters are represented by the low bit number.

According to the pre-training model reasoning processing method provided by the invention, in the process of carrying out acceleration reasoning processing on the model to be processed, the method further comprises the following steps: based on a model quantization technology, carrying out quantization treatment on the hidden state of the model to be treated to obtain a quantized hidden state, wherein the quantized hidden state is represented by the low-bit number; performing an operation related to reasoning processing based on the quantized linear layer parameters and the quantized hidden states to obtain a reasoning processing operation result; and performing inverse quantization processing on the reasoning processing operation result to obtain a reasoning processing operation result after inverse quantization processing, and taking the reasoning processing operation result after inverse quantization processing as an acceleration reasoning processing operation result, wherein the reasoning processing operation result after inverse quantization processing adopts the high-bit floating point number representation.

According to the pre-training model reasoning processing method provided by the invention, after the model parameters of the model to be processed are converted from the high-bit floating point number representation to the low-bit point number representation based on the model quantization technology, the method further comprises the following steps: and based on a training data set, performing low-bit adaptive training on the to-be-processed model, and taking the trained to-be-processed model as a to-be-processed model after final accelerated reasoning processing.

According to the pre-training model reasoning processing method provided by the invention, the server comprises a central processing unit and a graphic processor, and in the process of carrying out accelerated reasoning processing on the model to be processed, the method further comprises the following steps: storing model parameters relating to the model to be processed to the central processor; responding to the model to be processed to perform reasoning processing, and loading the model parameters into the graphic processor by the central processing unit to perform accelerated reasoning processing calculation; releasing the model parameters and the generated computational graph loaded into the graphics processor in response to the accelerated reasoning process computation being completed.

The invention provides a pre-training model reasoning processing method, which further comprises the following steps: dividing the video memory of the graphics processor into at least a first video memory pool and a second video memory pool; the loading of the model parameters into the graphic processor by the central processing unit for accelerated reasoning processing calculation comprises the following steps: and at the same time, alternately executing the loading processing of loading the model parameters to the graphics processor by the central processing unit based on the first video memory pool and the second video memory pool, and executing the operation processing of the model parameters in the graphics processor.

The invention also provides a pre-training model reasoning processing device, which is applied to a server for reasoning processing of a model to be processed, and comprises the following steps: the determining module is used for determining the model to be processed, wherein the model to be processed is represented by a high-bit floating point number and is obtained through pre-training, and the bit number of the high-bit floating point number is larger than or equal to a first bit number threshold value; the processing module is used for converting the model parameters of the model to be processed from the high-bit floating point number representation to the low-bit number representation based on a model quantization technology, and is used for realizing the acceleration reasoning processing of the model to be processed, wherein the bit number of the low-bit number is smaller than or equal to a second bit number threshold value.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the pre-training model reasoning processing method as any one of the above when executing the program.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a pre-training model reasoning process as described in any of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a pre-training model reasoning process as described in any one of the above.

According to the pre-training model reasoning processing method, the device, the electronic equipment and the storage medium, model parameters of the model to be processed are converted from high-bit floating point number representation to low-bit point number representation based on a model quantization technology, and the large-scale model to be processed can be compressed, so that the parameter quantity of the model is reduced under the condition that the model performances are similar, and low cost and high processing speed of the large-scale model to be processed in a reasoning process are realized.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a technical scheme framework diagram of a server applying the pre-training model reasoning processing method provided by the invention;

FIG. 2 is a schematic flow chart of the pre-training model reasoning process method provided by the invention;

FIG. 3 is a second flow chart of the pre-training model reasoning process method provided by the invention;

fig. 4 is a schematic view of an application scenario of the model quantization scheme provided by the present invention;

FIG. 5 is a third flow chart of the pre-training model reasoning process method provided by the invention;

FIG. 6 is a schematic diagram of an application scenario of the model operation offloading technique provided by the present invention;

FIG. 7 is a schematic diagram of a CPU-GPU scheduling optimization scheme provided by the invention;

FIG. 8 is a graph of performance comparisons of various pre-training model reasoning processes;

FIG. 9 is a schematic diagram of the structure of the pre-training model reasoning processing apparatus provided by the present invention;

fig. 10 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In recent years, large-scale pre-training models have become research hotspots, for example, large-scale pre-training language models have become research hotspots in the field of natural language processing. The large-scale pre-training model refers to a model with model parameters of more than billions.

For ease of description, in this application, a large-scale pre-training model is illustrated as a large-scale pre-training language model. It is to be understood that the scope of the present application is not limited to large-scale pre-trained language models.

The related art of pre-training language models makes it possible to train very large scale models (e.g., open AI GPT3 models, intelligent insight 2.0 models, etc.) that include billions or even billions of parameters. These very large scale models have surprisingly achieved effects in many natural language processing tasks and have attracted continued attention from many researchers.

While very large scale pre-trained language models perform surprisingly over multiple tasks, their very large parameter volumes also present a significant challenge for model reasoning. The reasoning process requires frequent invocation of the model, and if such a requirement needs to be met, one common way is to build a large GPU cluster stacking computational effort, which undoubtedly incurs significant costs in terms of time, storage, money, etc. For large enterprises and research institutions, they have enough computing power to support oversized models for reasoning, but there is still a need for corresponding computational acceleration techniques to reduce usage costs. For small enterprises and individual users, it is generally difficult to have enough funds to build corresponding GPU resources, so that it is not possible to use oversized models for reasoning.

Based on the reasons, the application provides a pre-training model reasoning processing method, which can realize low-resource reasoning and accelerated reasoning processing, so that a super-large scale model with billions of parameters can run on display cards (such as NVIDIA GTX 1060 display card and GTX 1080Ti display card) of consumption level, and simultaneously has a faster running speed on enterprise-level display cards (such as NVIDIA Tesla V100 display card and Tesla A100 display card) than the existing frame.

The method for reasoning and processing the pre-training model provided by the invention can be applied to a server for reasoning and processing a model to be processed, and as can be known from the combination of fig. 1, the technical scheme framework of the server applying the method for reasoning and processing the pre-training model can comprise a model layer, an algorithm layer, an implementation layer and a hardware layer. The application performs low-resource adaptation and acceleration of the large-scale pre-training language model at two levels of an algorithm level and an implementation level. At the algorithm level, the application uses model quantization technology to compress a large-scale model, and reduces the model parameter number under the condition of similar model performance. At the bottom implementation level, the model reasoning performance is further accelerated by using a model operation unloading technology, a mixed precision operator implementation and a CPU-GPU scheduling optimization technology.

The process of the pre-training model reasoning processing method provided by the invention will be described with reference to the following examples.

Fig. 2 is a schematic flow chart of a pre-training model reasoning processing method provided by the invention.

In an exemplary embodiment of the present invention, the pre-training model reasoning processing method may be applied to a server that performs reasoning processing on a model to be processed. The model to be processed is obtained by pre-training by adopting a training set. The model to be processed may be a large scale model.

As can be seen in conjunction with fig. 2, the pre-training model reasoning process may include

steps

210 and 220, each of which will be described separately below.

In step 210, a model to be processed is determined, where the model to be processed is represented by a high-bit floating point number and is obtained through pre-training, and the number of bits of the high-bit floating point number is greater than or equal to a first bit number threshold.

In an example, the model to be processed may be represented using a high-bit floating point number, wherein the number of bits of the high-bit floating point number is greater than or equal to the first bit number threshold. It should be noted that the first bit number threshold may be adjusted according to practical situations, and in an example, the first bit number threshold may be 16 bits.

In step 220, based on the model quantization technique, the model parameters of the model to be processed are converted from the high-bit floating point number representation to the low-bit number representation, so as to implement the accelerated reasoning process on the model to be processed, wherein the bit number of the low-bit number is smaller than or equal to the second bit number threshold.

In one embodiment, model parameters of the model to be processed may be converted from a high bit floating point representation to a low bit point representation based on model quantization techniques. Since model quantization techniques aim to replace model parameters of the model to be processed by high bit floating point numbers with low bit numbers. In the application process, a pre-training language model (also called a model to be processed) usually uses 32-bit or 16-bit floating point numbers, and after model quantization, the model can be represented by using 8-bit, 4-bit or even 1-bit fixed point numbers, so that the occupation of a video memory can be greatly reduced, and the model to be processed is convenient to perform accelerated reasoning processing.

It should be noted that the number of bits of the low-bit count may be less than or equal to the second bit count threshold, where the second bit count threshold may be adjusted according to the actual situation, and in an example, the second bit count threshold may be 8 bits.

According to the model reasoning processing method of the pre-training model, the model parameters of the model to be processed are converted from the high-bit floating point number representation to the low-bit point number representation based on the model quantization technology, so that the large-scale model to be processed can be compressed, the parameter number of the model can be reduced under the condition that the model performances are similar, and low cost and high processing speed of the large-scale model to be processed in the reasoning process are achieved.

In an exemplary embodiment of the present invention, the model parameters may include linear layer parameters of the model to be processed, and the conversion of the model parameters of the model to be processed from using a high-bit floating point number representation to using a low-bit point number representation based on the model quantization technique may be implemented in the following manner: and carrying out quantization processing on the linear layer parameters based on model quantization calculation to obtain quantized linear layer parameters, wherein the quantized linear layer parameters are represented by low bit numbers. In the embodiment, the linear layer parameters are expressed by adopting low bit numbers, so that compression processing of the pre-training model can be realized, the compressed pre-training model can be ensured to keep generalization capability of the model, and further, low cost and high processing speed of a large-scale model to be processed in the reasoning process can be realized.

To further describe the pre-training model reasoning process method provided by the present invention, the following description will be made with reference to fig. 3.

FIG. 3 is a second flow chart of the pre-training model reasoning process method provided by the invention.

In an exemplary embodiment of the present invention, as can be seen in conjunction with fig. 3, in the process of performing the accelerated reasoning process on the model to be processed, the pre-training model reasoning process method may include steps 310 to 330, and each step will be described separately.

In step 310, quantization is performed on the hidden state of the model to be processed based on the model quantization technique, so as to obtain a quantized hidden state, where the quantized hidden state is represented by a low-bit number.

In step 320, an operation on the inference processing is performed based on the quantized linear layer parameters and the quantized hidden states, resulting in an inference processing operation result.

In the embodiment, the inference processing calculation is performed based on the linear layer parameter with the low-bit point number and the hidden state with the low-bit point number, so that the resource usage amount in the inference processing process can be effectively reduced, the inference processing cost is reduced, and the running speed is improved.

In step 330, the result of the inference processing operation is dequantized to obtain the result of the inference processing operation after dequantization processing, and the result of the inference processing operation after dequantization processing is used as the result of the acceleration inference processing operation, where the result of the inference processing operation after dequantization processing is represented by a high-bit floating point number.

In the application process, the reasoning processing operation result is subjected to inverse quantization processing to obtain the reasoning processing operation result after the inverse quantization processing, so that the reasoning processing operation result after the inverse quantization processing can be ensured to be more matched with other parameters on the basis of reducing the operation amount in the reasoning processing operation process.

As described in connection with fig. 4, a model represented by a high-bit (e.g., 32-bit) floating-point number may be quantized to obtain a model represented by a low-bit (e.g., 8-bit) fixed-point number. In the application process, an input state quantity (e.g., linear layer parameters and hidden states, corresponding to flow 32 on the left side of the equation in fig. 4) represented by a high-bit (e.g., 32-bit) floating point number may be converted into an input state quantity (corresponding to int8 on the left side of the equation in fig. 4) represented by a low-bit (e.g., 8-bit) fixed point number. And then, performing correlation calculation based on the converted input state quantity to obtain an reasoning processing operation result (corresponding to int8 on the right side of the equation in fig. 4). It can be appreciated that in the present embodiment, the inference processing calculation is performed based on the linear layer parameter with the low-bit number and the hidden state with the low-bit number, so that the resource usage amount in the inference processing process can be effectively reduced, thereby reducing the inference processing cost and improving the operation speed.

Further, the result after completion of the calculation is converted into a calculation result expressed by a high-bit floating point number (corresponds to flow 32 on the right side of the equation in fig. 4). In this embodiment, it is ensured that the result of the inference processing operation after the dequantization processing can be more matched with other parameters on the basis of reducing the operation amount in the inference processing operation process.

To reduce the impact of quantization on model performance, a small portion of the pre-training phase data may be used to adapt the model for low bits.

In an exemplary embodiment of the present invention, the pre-training model reasoning process described in fig. 2 will be further described as an example. After the model parameters of the model to be processed are converted from the high-bit floating point number representation to the low-bit point number representation based on the model quantization technology, the pre-training model reasoning processing method can further comprise the following steps: based on the training data set, performing low-bit adaptive training on the to-be-processed model, and taking the trained to-be-processed model as the to-be-processed model after final acceleration reasoning processing.

In one embodiment, during the low bit adaptation training phase, the model weights may be represented using high bit numbers. In the application process, the model weight represented by the high-bit number can be weighted into the model weight represented by the low-bit number, and model calculation is performed. After the low-bit adaptation training phase is finished, the model weights represented by the high-bit numbers are discarded, and the model reasoning can be performed by using the corresponding model weights represented by the low-bit numbers. By the embodiment, the reasoning accuracy of the model to be processed, which is represented by the low-bit number, can be improved.

In order to ensure that a graphics processor (Graphics Processing Unit, GPU) with a smaller portion of memory can support storing model parameters and computational graphs simultaneously, model operation offloading techniques may also be used to use machine memory in this application.

The process of using machine memory using model operation offload techniques will be described in connection with fig. 5.

In an exemplary embodiment of the present invention, the server may include a central processing unit (central processing unit, abbreviated as CPU) and a graphic processor. As shown in fig. 5, in the process of performing the accelerated reasoning process on the model to be processed, the pre-training model reasoning process method may further include steps 510 to 530, and each step will be described below.

In step 510, model parameters for the model to be processed are stored to the central processor.

In step 520, in response to the model to be processed performing the reasoning process, the model parameters are loaded into the graphics processor by the central processing unit for performing the accelerated reasoning process calculation.

In step 530, model parameters loaded into the graphics processor and the generated computational graph are released in the graphics processor in response to the accelerated reasoning process computation being completed.

In one embodiment, as can be seen in connection with FIG. 6, the main idea of model operation offloading is to store the model parameters to be processed in the memory used by the CPU. When the model to be processed is in the layer-by-layer calculation process, the parameters of the model to be processed after being unloaded can be loaded from the CPU to the GPU for calculation. After the calculation is completed, the loaded parameters of the model to be processed and the calculation graph can be released, so that the GPU video memory is saved. Since model parameters of a model to be processed can be divided into many blocks, model operation offloading techniques have a very important meaning for large-scale models to be processed to be able to run on low computing resource devices.

The model-operation offloading technique creates fragmented storage and requires frequent communication between the CPU and GPU, thus creating runtime overhead. To address this problem, a CPU-GPU scheduling optimization technique may be used.

The present invention will be described with reference to the following examples for a process of a CPU-GPU scheduling optimization scheme.

In an exemplary embodiment of the present invention, the pre-training model reasoning processing method may further include: dividing the video memory of the graphic processor into at least a first video memory pool and a second video memory pool; the loading of the model parameters from the central processor to the graphics processor for the accelerated reasoning process calculation may be implemented in the following manner: and at the same time, alternately executing the loading processing of the adjacent model parameters loaded to the graphic processor by the central processing unit based on the first video memory pool and the second video memory pool, and performing the operation processing of the model parameters in the graphic processor.

As can be seen in conjunction with fig. 7, the GPU memory may be segmented. In one example, two memory pools may be provided for alternating computing and loading parameters. The remaining portion of the memory may be defined as "fixed" memory. Based on a "fixed" memory, some model parameters may be stored so as to avoid scheduling all parameters.

In the application process, a layer of a large-scale pre-training language model is used as a division granularity. In one example, the pre-trained language model may include 12 layers, with GPU video memory capable of supporting 8 layers of storage and operations. In the application process, two video memory pools can be set by using the video memory size of 2 layers, the rest of the video memories store parameters of the first 6 layers, and the parameters of the last 6 layers are scheduled by using the two video memory pools. Since the calculation and loading of the neighboring model parameters (e.g., calculating the parameters of layer 7 and loading the parameters of layer 8) are performed simultaneously at the same time (e.g., time T2), the runtime overhead due to frequent communication between the CPU and the GPU can be effectively reduced.

It can also be seen from fig. 7 that after using two memory pools, the communication of CPU-GPU and model computation of GPU are performed substantially simultaneously, and the CPU-GPU communication time is completely overlapped in the computation time, so that the time cost of weight loading is negligible.

In yet another embodiment, to better utilize tensor computation cores in the GPU, a batch of efficient hybrid precision operators may also be designed and implemented to increase the speed of operation of the inference process on the model to be processed.

In the application process, the advantages of the present application over the previous solution can be illustrated by taking a one billion parameter Chinese pre-training language model CPM2 as an example. The original scheme is based on distributed toolkit deep speed and Megatron (also known as open source software). The performance of different pre-training model reasoning processing methods on different GPUs can be judged based on the speed of model decoding.

As can be seen from FIG. 8, the pre-training model reasoning processing method provided by the application can realize reasoning (such as GTX 1060 and GTX 1080 Ti) of the CPM2 model with 110 hundred million parameters on the consumer-level display card. The pre-training model reasoning processing method provided by the application can enable most individual users to run large model reasoning on the individual terminal so as to obtain a desired result.

On a GPU (such as Tesla V100 and A100) supporting large model reasoning, the pre-training model reasoning processing method provided by the application can achieve 4-6 times of speed improvement. Specifically, the model quantization technology and the mixed precision operator in the technical scheme enable the scheme to generate larger performance improvement compared with the previous scheme.

According to the above description, the model parameter of the model to be processed is converted from high-bit floating point number representation to low-bit point number representation based on the model quantization technology, so that the large-scale model to be processed can be compressed, the parameter quantity of the model can be reduced under the condition that the model performance is similar, and low cost and high processing speed of the large-scale model to be processed in the reasoning process can be realized.

Based on the same conception, the invention also provides a pre-training model reasoning processing device.

The pre-training model reasoning processing device provided by the invention is described below, and the pre-training model reasoning processing device described below and the pre-training model reasoning processing method described above can be correspondingly referred to each other.

Fig. 9 is a schematic structural diagram of a pre-training model reasoning processing device provided by the invention.

In an exemplary embodiment of the present invention, the pre-training model reasoning processing means may be used for a server for reasoning processing the model to be processed. As shown in fig. 9, the pre-training model reasoning processing means may comprise a determining module 910 and a processing module 920, each of which will be described separately below.

The determination module 910 may be configured to determine a model to be processed, where the model to be processed is represented using high-bit floating-point numbers and is obtained by pre-training, where the number of bits of the high-bit floating-point numbers is greater than or equal to a first bit number threshold.

The processing module 920 may be configured to convert model parameters of the model to be processed from using a high-bit floating point representation to using a low-bit number representation based on a model quantization technique, where the number of bits of the low-bit number is less than or equal to a second bit number threshold, to implement an accelerated reasoning process for the model to be processed.

In an exemplary embodiment of the present invention, the model parameters may include linear layer parameters of the model to be processed, and the processing module 920 may convert the model parameters of the model to be processed from a high-bit floating point representation to a low-bit point representation based on a model quantization technique in the following manner: and carrying out quantization processing on the linear layer parameters based on model quantization calculation to obtain quantized linear layer parameters, wherein the quantized linear layer parameters are represented by low bit numbers.

In an exemplary embodiment of the present invention, in the process of performing the accelerated reasoning process on the model to be processed, the processing module 920 may be further configured to: based on a model quantization technology, carrying out quantization treatment on the hidden state of the model to be treated to obtain a quantized hidden state, wherein the quantized hidden state is represented by a low-bit number; performing an operation related to reasoning processing based on the quantized linear layer parameters and the quantized hidden states to obtain a reasoning processing operation result; and performing inverse quantization processing on the reasoning processing operation result to obtain the reasoning processing operation result after inverse quantization processing, and taking the reasoning processing operation result after inverse quantization processing as an acceleration reasoning processing operation result, wherein the reasoning processing operation result after inverse quantization processing adopts high-bit floating point number representation.

In an exemplary embodiment of the present invention, the processing module 920 may be further configured to: based on the training data set, performing low-bit adaptive training on the to-be-processed model, and taking the trained to-be-processed model as the to-be-processed model after final acceleration reasoning processing.

In an exemplary embodiment of the present invention, the server may include a central processor and a graphics processor, and in performing the accelerated reasoning process on the model to be processed, the processing module 920 may be further configured to: storing model parameters related to the model to be processed into a central processing unit; responding to the model to be processed to perform reasoning processing, loading model parameters into a graphic processor by a central processing unit to perform accelerated reasoning processing calculation; in response to the accelerated reasoning process computation being completed, the model parameters loaded into the graphics processor and the generated computational graph are released in the graphics processor.

In an exemplary embodiment of the present invention, the processing module 920 may be further configured to: dividing the video memory of the graphic processor into at least a first video memory pool and a second video memory pool; the processing module 920 may load the model parameters from the central processor into the graphics processor for the accelerated reasoning process calculations in the following manner: and at the same time, alternately executing the loading processing of the adjacent model parameters loaded to the graphic processor by the central processing unit based on the first video memory pool and the second video memory pool, and performing the operation processing of the model parameters in the graphic processor.

Fig. 10 illustrates a physical structure diagram of an electronic device, as shown in fig. 10, which may include: a processor 1010, a communication interface (Communications Interface) 1020, a memory 1030, and a communication bus 1040, wherein the processor 1010, the communication interface 1020, and the memory 1030 communicate with each other via the communication bus 1040. Processor 1010 may invoke logic instructions in memory 1030 to perform a pre-trained model reasoning process method applied to a server for reasoning about a model to be processed, the method comprising: determining a model to be processed, wherein the model to be processed is represented by a high-bit floating point number and is obtained through pre-training, and the bit number of the high-bit floating point number is greater than or equal to a first bit number threshold value; based on a model quantization technology, converting model parameters of the model to be processed from high-bit floating point number representation to low-bit number representation, wherein the bit number of the low-bit number is smaller than or equal to a second bit number threshold value, so as to realize the acceleration reasoning processing of the model to be processed.

Further, the logic instructions in the memory 1030 described above may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, where the computer program, when executed by a processor, can perform a method for reasoning about a model to be processed provided by the methods above, where the method is applied to a server that performs reasoning about the model to be processed, where the method includes: determining a model to be processed, wherein the model to be processed is represented by a high-bit floating point number and is obtained through pre-training, and the bit number of the high-bit floating point number is greater than or equal to a first bit number threshold value; based on a model quantization technology, converting model parameters of the model to be processed from high-bit floating point number representation to low-bit number representation, wherein the bit number of the low-bit number is smaller than or equal to a second bit number threshold value, so as to realize the acceleration reasoning processing of the model to be processed.

In yet another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the method for reasoning about a model to be processed provided by the methods above, the method being applied to a server for reasoning about a model to be processed, the method comprising: determining a model to be processed, wherein the model to be processed is represented by a high-bit floating point number and is obtained through pre-training, and the bit number of the high-bit floating point number is greater than or equal to a first bit number threshold value; based on a model quantization technology, converting model parameters of the model to be processed from high-bit floating point number representation to low-bit number representation, wherein the bit number of the low-bit number is smaller than or equal to a second bit number threshold value, so as to realize the acceleration reasoning processing of the model to be processed.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

It will further be appreciated that although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for reasoning processing of a pre-trained model, wherein the method is applied to a server for reasoning processing of a model to be processed, and the method comprises the following steps:

determining the model to be processed, wherein the model to be processed is represented by a high-bit floating point number and is obtained through pre-training, and the bit number of the high-bit floating point number is greater than or equal to a first bit number threshold value;

converting model parameters of the model to be processed from using the high-bit floating point number representation to using a low-bit number representation based on a model quantization technology, to implement an accelerated reasoning process on the model to be processed, wherein the number of bits of the low-bit number is less than or equal to a second bit number threshold value, wherein,

the model parameters comprise linear layer parameters of the model to be processed, the model quantization technology is based to convert the model parameters of the model to be processed from the high-bit floating point number representation to the low-bit point number representation, and the method comprises the following steps:

based on model quantization calculation, carrying out quantization treatment on the linear layer parameters to obtain quantized linear layer parameters, wherein the quantized linear layer parameters are represented by the low bit number;

in the process of carrying out acceleration reasoning processing on the model to be processed, the method further comprises the following steps:

based on a model quantization technology, carrying out quantization treatment on the hidden state of the model to be treated to obtain a quantized hidden state, wherein the quantized hidden state is represented by the low-bit number;

performing an operation related to reasoning processing based on the quantized linear layer parameters and the quantized hidden states to obtain a reasoning processing operation result;

performing inverse quantization processing on the reasoning processing operation result to obtain a reasoning processing operation result after inverse quantization processing, and taking the reasoning processing operation result after inverse quantization processing as an acceleration reasoning processing operation result, wherein the reasoning processing operation result after inverse quantization processing adopts the high-bit floating point number representation;

the server comprises a central processor and a graphic processor, and in the process of carrying out acceleration reasoning processing on the model to be processed, the method further comprises the following steps:

storing model parameters relating to the model to be processed to the central processor;

responding to the model to be processed to perform reasoning processing, and loading the model parameters into the graphic processor by the central processing unit to perform accelerated reasoning processing calculation;

releasing the model parameters and the generated computational graph loaded into the graphics processor in response to completion of the accelerated reasoning process computation;

the method further comprises the steps of: dividing the video memory of the graphics processor into at least a first video memory pool and a second video memory pool;

the loading of the model parameters into the graphic processor by the central processing unit for accelerated reasoning processing calculation comprises the following steps:

and at the same time, alternately executing the loading processing of loading the model parameters to the graphics processor by the central processing unit based on the first video memory pool and the second video memory pool, and executing the operation processing of the model parameters in the graphics processor.

2. The method of claim 1, further comprising, after the model parameters of the model to be processed are converted from using the high bit floating point representation to using a low bit point representation based on a model quantization technique:

and based on a training data set, performing low-bit adaptive training on the to-be-processed model, and taking the trained to-be-processed model as a to-be-processed model after final accelerated reasoning processing.

3. A pre-training model reasoning processing apparatus, the apparatus being applied to a server for reasoning processing a model to be processed, the apparatus comprising:

the determining module is used for determining the model to be processed, wherein the model to be processed is represented by a high-bit floating point number and is obtained through pre-training, and the bit number of the high-bit floating point number is larger than or equal to a first bit number threshold value;

the processing module is used for converting the model parameters of the model to be processed from the high-bit floating point number representation to the low-bit number representation based on a model quantization technology, so as to realize the acceleration reasoning processing of the model to be processed, wherein the bit number of the low-bit number is smaller than or equal to a second bit number threshold value,

the model parameters may include linear layer parameters of the model to be processed, and the processing module converts the model parameters of the model to be processed from a high-bit floating point representation to a low-bit point representation based on a model quantization technique in the following manner: based on model quantization calculation, carrying out quantization treatment on the linear layer parameters to obtain quantized linear layer parameters, wherein the quantized linear layer parameters are represented by low bit numbers;

in the process of carrying out acceleration reasoning processing on the model to be processed, the processing module is configured to be used for: based on a model quantization technology, carrying out quantization treatment on the hidden state of the model to be treated to obtain a quantized hidden state, wherein the quantized hidden state is represented by a low-bit number; performing an operation related to reasoning processing based on the quantized linear layer parameters and the quantized hidden states to obtain a reasoning processing operation result; performing inverse quantization processing on the reasoning processing operation result to obtain a reasoning processing operation result after inverse quantization processing, and taking the reasoning processing operation result after inverse quantization processing as an acceleration reasoning processing operation result, wherein the reasoning processing operation result after inverse quantization processing adopts high-bit floating point number representation;

the server comprises a central processor and a graphic processor, and in the process of carrying out acceleration reasoning processing on the model to be processed, the processing module is configured to be used for: storing model parameters related to the model to be processed into a central processing unit; responding to the model to be processed to perform reasoning processing, loading model parameters into a graphic processor by a central processing unit to perform accelerated reasoning processing calculation; releasing model parameters loaded into the graphics processor and the generated computational graph in the graphics processor in response to completion of the accelerated reasoning process computation;

the processing module is configured to: dividing the video memory of the graphic processor into at least a first video memory pool and a second video memory pool; the processing module loads the model parameters from the central processing unit to the graphic processor for acceleration reasoning processing calculation by adopting the following modes: and at the same time, alternately executing the loading processing of the adjacent model parameters loaded to the graphic processor by the central processing unit based on the first video memory pool and the second video memory pool, and performing the operation processing of the model parameters in the graphic processor.

4. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the pre-trained model reasoning process of any of claims 1 to 2 when the program is executed.

5. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the pre-training model reasoning process of any of claims 1 to 2.