WO2021143397A1

WO2021143397A1 - Resource reuse method, apparatus and device based on gpu virtualization

Info

Publication number: WO2021143397A1
Application number: PCT/CN2020/134523
Authority: WO
Inventors: 赵军平
Original assignee: 支付宝(杭州)信息技术有限公司
Priority date: 2020-01-14
Filing date: 2020-12-08
Publication date: 2021-07-22
Also published as: CN110851285A; CN110851285B

Abstract

Disclosed in embodiments of the present specification are a resource reuse method, apparatus and device based on GPU virtualization. The scheme comprises: pre-storing a setting parameter of a first resource in a client, so that the client can locally process a first API calli request sent by an AI framework layer and used for creating the first resource and a second API calling request for setting the first resource, and a GPU driver does not need to be forwarded; and enabling the client to send a generated first calculation instruction for the first resource and the pre-stored setting parameter of the first resource to the GPU driver when obtaining a third API calling request sent by the AI framework layer and used for performing calculation on the basis of the first resource, thereby executing an AI task by utilizing a GPU virtualization technology.

Description

Method, device and equipment for resource reuse based on GPU virtualization

Technical field

This application relates to the field of computer technology, and in particular to a resource reuse method, device and equipment based on GPU virtualization.

Background technique

Graphics Processing Unit (GPU) is a microprocessor that can be used for efficient calculation and processing of images and graphics. More and more artificial intelligence technologies are beginning to be implemented based on GPUs. In order to allocate GPU resources reasonably, GPU virtualization technology came into being. After using GPU virtualization technology, different artificial intelligence (Artificial Intelligence, AI) tasks can be used to share resources on one or more GPUs to perform calculations. This safe and efficient GPU resource management method is being used by more and more users. However, when GPU virtualization technology is currently used to execute AI tasks, the operating efficiency of AI tasks based on GPU virtualization technology needs to be improved.

Summary of the invention

In view of this, the embodiments of the present application provide a resource reuse method, device, and equipment based on GPU virtualization, which are used to improve the operating efficiency when executing AI tasks based on GPU virtualization technology.

In order to solve the above technical problems, the embodiment of this specification is implemented as follows: The embodiment of this specification provides a resource reuse method based on GPU virtualization, which is applied to the client in the GPU virtualization system, including: obtaining the AI framework layer The sent first API call request for creating the first resource; determine the memory address where the pre-stored data matching the first resource is located; the data matching the first resource includes information about the The setting parameters of the first resource; feedback to the AI framework layer the memory address where the data matching the first resource is located; obtain the information sent by the AI framework layer for setting the first resource The second API call request; feedback to the AI framework layer a message indicating that the setting is successful; obtain the third API call request sent by the AI framework layer for calculation based on the first resource; based on the first resource Three API call requests, generating a first calculation instruction for the first resource; sending the first calculation instruction and the data matching the first resource to a GPU driver.

The embodiment of this specification provides a resource multiplexing device based on GPU virtualization, which is applied to a client in a GPU virtualization system and includes: a first acquisition module for acquiring the first resource sent by the AI framework layer for creating the first resource The first API call request; the first determining module is used to determine the memory address where the pre-stored data matching the first resource is located; the data matching the first resource includes information about the first resource A resource setting parameter; a first feedback module, used to feed back to the AI framework layer the memory address where the data matching the first resource is located; a second acquisition module, used to acquire the AI framework layer The second API call request for setting the first resource is sent; the second feedback module is used to feed back a message indicating that the setting is successful to the AI framework layer; the third acquisition module is used to obtain the The third API call request sent by the AI framework layer for calculation based on the first resource; the first calculation instruction generation module is configured to generate the first resource for the first resource based on the third API call request A calculation instruction; a first sending module, configured to send the first calculation instruction and the data matching the first resource to the GPU driver.

An embodiment of this specification provides a resource multiplexing device based on GPU virtualization, including: at least one processor; and, a memory communicatively connected to the at least one processor; wherein the memory stores the memory that can be used by the An instruction executed by at least one processor, the instruction being executed by the at least one processor, so that the at least one processor can: obtain the first API call request sent by the AI framework layer for creating the first resource; determine The pre-stored memory address where the data that matches the first resource is located; the data that matches the first resource includes the setting parameters for the first resource; and the data is fed back to the AI framework layer. The memory address where the data matching the first resource is located; obtain the second API call request sent by the AI framework layer for setting the first resource; feedback to the AI framework layer to indicate Set up a successful message; obtain a third API call request sent by the AI framework layer for calculation based on the first resource; generate a first calculation for the first resource based on the third API call request Instruction; sending the first calculation instruction and the data matching the first resource to the GPU driver.

The above-mentioned at least one technical solution adopted in the embodiment of this specification can achieve the following beneficial effects: by pre-storing the setting parameters for the first resource in the client in the GPU virtualization system, the client can feed back the information sent by the AI framework layer in advance. The first API call request and the second API call request, and there is no need to send the first API call request and the second API call request to the GPU driver for processing, so that the AI framework layer does not need to wait for the GPU driver to respond to the first API call request and the second API call request. 2. The processing result of the API call request to reduce the time consumption of the AI framework layer waiting for the request response, thereby improving the execution efficiency of the AI task.

Description of the drawings

The drawings described here are used to provide a further understanding of the application and constitute a part of the application. The exemplary embodiments and descriptions of the application are used to explain the application, and do not constitute an improper limitation of the application. In the attached picture:

FIG. 1 is a schematic diagram of an application scenario of a resource reuse method based on GPU virtualization in an embodiment of this specification;

2 is a schematic flowchart of a resource reuse method based on GPU virtualization provided by an embodiment of this specification;

3 is a schematic diagram of a scenario in which an address pointer of data corresponding to a second resource is written into a queue provided in an embodiment of this specification;

FIG. 4 is a schematic structural diagram of a resource multiplexing device based on GPU virtualization corresponding to FIG. 2 provided by an embodiment of this specification;

FIG. 5 is a schematic structural diagram of a resource multiplexing device based on GPU virtualization corresponding to FIG. 2 provided by an embodiment of this specification.

Detailed ways

In order to make the purpose, technical solutions and advantages of the present application clearer, the technical solutions of the present application will be described clearly and completely in conjunction with specific embodiments of the present application and the corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

FIG. 1 is a schematic diagram of an application scenario of a resource reuse method based on GPU virtualization provided by an embodiment of this specification. As shown in FIG. 1, the AI framework layer 1011 and the client 1012 of the GPU virtualization system can be mounted on the user's terminal device 101. Among them, the AI framework layer 1011 can be used to build various modules (for example, convolutional neural network CNN, recurrent neural network RNN, long and short-term memory network LSTM, generative adversarial network GAN, etc.), and control various models on the CPU or GPU run. In practical applications, the AI framework layer can be implemented using TensorFlow, PyTorch, or Caffe2.

The client 1012 of the GPU virtualization system can interact with the server 102 of the GPU virtualization system and the GPU 104 for providing resources, so as to realize the discovery, application, access, and built-in optimization of virtual GPU resources. The client 1012 of the GPU virtualization system can also record the resources and status information required in an iteration cycle of the model, and reuse the recorded resources and status information to reduce the number of API call requests sent to the GPU driver 1041 in the GPU104 The number of times.

Wherein, the GPU 104 may include a GPU driver 1041 and GPU hardware 1042. The GPU driver 1041 may respond to the API call request sent by the client 1012. The GPU hardware 1042 can be implemented using nvidia P100GPU, NVIDIA Tesla V100, and GeForce GTX 1080.

The server 102 of the GPU virtualization system can be used to be responsible for GPU services and virtualization management. Specifically, the server 102 of the GPU virtualization system can divide and pre-allocate virtual GPU resources according to the configuration strategy, and save the mapping between virtual and physical resources. Relationship, and report the GPU resource call request to the GPU resource scheduler 103. The GPU resource scheduler 103 can respond to the GPU resource call request, and implement the scheduling and allocation of resources in the GPU 104. In practical applications, the GPU resource scheduler 103 can be implemented by using K8S or Kubemaker.

Research has found that when using virtualized GPU resources to perform AI tasks, the client in the GPU virtualization system needs to call the GPU driver the resources required by each operator through an application programming interface (Application Programming Interface, API). For example, for the Convolution operator in the AI task, as many as 14 APIs are called. Table 1 is the API information table corresponding to the convolution operator, as shown in Table 1.

Table 1

According to the content in Table 1, it can be seen that for the convolution operator, of the 14 API calls that the client in the GPU virtualization system needs to perform, 13 are synchronous calls and 1 is asynchronous. When the API is synchronously called, the client will forward the operation data corresponding to the API to the GPU driver. After the GPU driver processing is completed, the client will report the successful call to the AI framework layer. That is, the AI framework layer needs to be After receiving the API call result feedback from the client, the next API call can be continued, which has a greater impact on the delay of the AI task. And AI tasks usually need to perform tens of thousands to hundreds of thousands of iterative operations, and each round of iterative operations needs to call these synchronization APIs repeatedly, which seriously affects the computing efficiency of AI tasks.

In response to the above problems, the inventor found through research that when AI tasks use algorithms such as Deep Neural Networks (DNN), the operators required to perform each round of iterative operations are usually the same. Therefore, it is possible to obtain and cache the resources and status information required by each operator in an iteration cycle of the AI task, and to reuse these resources and status information during each iteration cycle, thereby greatly reducing API call operation to optimize the operating efficiency when executing AI tasks based on GPU virtualization technology.

FIG. 2 is a schematic flowchart of a resource reuse method based on GPU virtualization provided by an embodiment of this specification. From a program point of view, the execution subject of the process can be the client in the GPU virtualization system. As shown in FIG. 2, the process may include step 202 to step 206.

Step 202: Obtain a first API call request sent by the AI framework layer for creating a first resource.

In the embodiment of this specification, the first API call request may be used to create a resource descriptor corresponding to the first resource. In actual applications, the first API call request may include multiple API call requests for synchronous calls. For example, the first API call request may include an API call request for creating an input data descriptor and an output data descriptor. API call request, API call request for creating weight data descriptor, API call request for creating convolution descriptor, etc.

Step 204: Determine the memory address where the pre-stored data matching the first resource is located; the data matching the first resource includes setting parameters for the first resource.

In the embodiment of this specification, the terminal device where the client 101 of the GPU virtualization system is located may pre-store data matching the first resource, and the data matching the first resource may be a resource for the first resource. The data generated after the properties of the descriptor are set. Therefore, when the client 101 receives the first API call request for creating the first resource sent by the AI framework layer, the client 101 does not need to send the first API call request to the GPU driver to establish data corresponding to the first resource. , And only need to determine the memory address where the pre-stored data matching the first resource is located.

In practical applications, the first N iterative cycles executed after the AI task is started are usually the warm-up phase. AI tasks can use the warm-up phase of the AI task to build calculation graphs, allocate resources, and find the best operator. In the embodiment of the present specification, the client 101 of the GPU virtualization system may obtain and store data matching the first resource by means of the warm-up phase, so as to facilitate subsequent use.

Step 206: Feed back to the AI framework layer the memory address where the data matching the first resource is located.

In the embodiment of this specification, after determining the memory address where the pre-stored data matching the first resource is located, the message that the resource descriptor corresponding to the first resource is successfully created can be fed back to the AI framework layer. The message can be Carry the memory address where the determined data matching the first resource is located.

Step 208: Obtain a second API call request sent by the AI framework layer for setting the first resource.

In the embodiment of this specification, the second API call request may include multiple synchronously called API call requests, and the second API call request may be used to request the shape, filling, data type, and data type of the resource descriptor corresponding to the first resource. Setting of attributes such as data alignment.

Step 210: Feed back a message indicating that the setting is successful to the AI framework layer.

In the embodiment of this specification, since the user's terminal device pre-stores the data matching the first resource, that is, the data generated after setting the attribute of the resource descriptor of the first resource. Therefore, when the client 101 of the GPU virtualization system obtains the second API call request for setting the first resource, it does not need to send the second API call request to the GPU driver for processing, but can directly feed back to the AI framework layer. It is used to indicate a message that the attribute of the resource descriptor of the first resource is successfully set. This can reduce the time that the AI framework layer waits for the GPU driver to feed back the response result of the second API call request, thereby helping to improve the execution efficiency of the AI task.

Step 212: Obtain a third API call request sent by the AI framework layer for calculation based on the first resource.

Step 214: Generate a first calculation instruction for the first resource based on the third API call request.

Step 216: Send the first calculation instruction and the data matching the first resource to the GPU driver.

In the embodiment of this specification, the client 101 of the GPU virtualization system may send the data matching the first resource (that is, the data generated after setting the attribute of the resource descriptor of the first resource) to the GPU driver. , The GPU driver can configure the resource and execute the calculation task according to the received first calculation instruction for the first resource and the data matching the first resource. Therefore, there is no need to send the first API call request and the second API call request of the AI framework layer to the GPU driver.

In the embodiment of this specification, by pre-storing the setting parameters for the first resource at the client in the GPU virtualization system, the client can feed back the first API call request and the second API call sent by the AI framework layer in advance Request, and there is no need to send the first API call request and the second API call request to the GPU driver for processing, so that the AI framework layer does not need to wait for the GPU driver to feedback the processing results of the first API call request and the second API call request, so as to reduce The time it takes for the AI framework layer to wait for a response to a request, thereby improving the efficiency of AI task execution.

Based on the method in FIG. 2, the examples of this specification also provide some specific implementations of the method, which are described below.

In the embodiment of this specification, the data matching the first resource can be obtained in the warm-up phase after the AI task is started, and stored, so as to be used in the subsequent iterative process. The embodiment of this specification provides an implementation manner for obtaining data matching the first resource in the warm-up phase of the AI task. Specifically, before step 202, it may also include the following steps: in the warm-up phase after the AI task is started, obtaining a fourth API call request sent by the AI framework layer for creating a second resource; creating the second resource, Store the data corresponding to the second resource in the memory address; feed back the memory address to the AI framework layer; obtain the fifth API sent by the AI framework layer for setting the second resource Call request; based on the fifth API call request, set the data in the memory address; feedback to the AI framework layer a message indicating that the setting is successful; obtain the message sent by the AI framework layer to be based on the A sixth API call request for the second resource to perform calculations; based on the sixth API call request, a second calculation instruction for the second resource is generated; and the second calculation instruction is sent to a GPU driver.

In the embodiment of this specification, in the warm-up phase after the AI task is started, the AI framework layer will generate API call requests for creating resources, setting resources, and performing calculations for each operator in the AI task, so as to make the GPU virtualization system work. The client generates and stores the resource setting parameters corresponding to each operator.

Specifically, the client of the GPU virtualization system may, after obtaining the fourth API call request sent by the AI framework layer for creating the second resource, respond to the fourth API call request, in the memory of the device where the client is located Create data corresponding to the second resource, and determine the storage address of the data as a memory address where the data corresponding to the second resource is located. The client of the GPU virtualization system may send a message that carries the memory address, indicating that the creation is successful, to the API framework layer.

When the client of the GPU virtualization system receives the fifth API call request for setting the second resource sent by the AI framework layer, the client can respond to the fifth API call request to check the storage location of the client. The data corresponding to the second resource in the device is set, the setting parameters of the second resource are obtained, and a message indicating that the setting is successful is fed back to the AI framework layer.

In the embodiment of this specification, when the operator corresponding to the second resource and the operator corresponding to the first resource are the same operator in different iterations, when step 204 is performed, the determined pre-stored and the first resource The memory address where the matched data is located may be the same as the memory address where the data corresponding to the second resource determined by the client of the GPU virtualization system is located. Correspondingly, the fourth API call request and the first API call request may also be the same. The fifth API call request and the second API call request may also be the same.

In the embodiment of the present specification, the client of the GPU virtualization system may also send the fourth API call request and the fifth API call request to the GPU driver, so that the GPU driver generates data corresponding to the second resource (that is, includes data corresponding to the second resource). The second resource setting parameter data), so as to facilitate the GPU driver to execute the second calculation instruction.

Since the data corresponding to the second resource generated by the GPU driver is usually stored in the GPU cache, in order to avoid the occupation of the GPU cache, when the GPU driver successfully responds to the second calculation instruction, the AI framework will send the instruction to the client. The instruction is used to delete the data corresponding to the second resource in the GPU cache.

Therefore, after obtaining the sixth API call request sent by the AI framework layer for calculation based on the second resource, the following step may be further included: obtaining the second API call request sent by the AI framework layer for the second resource A seventh API call request for deleting data corresponding to the resource; retaining the data corresponding to the second resource; and feeding back a message indicating that the deletion is successful to the AI framework layer.

In the embodiment of this specification, the client of the GPU virtualization system may send the seventh API call request sent by the AI framework layer to delete the data corresponding to the second resource to the GPU driver, so that the GPU driver can The data corresponding to the second resource in the GPU cache is deleted. However, since the device where the client of the GPU virtualization system is located also stores data corresponding to the second resource, in order to reuse the data corresponding to the second resource stored in the device where the client is located in the subsequent iteration process, the GPU virtual After receiving the seventh API call request, the client of the chemical system will reserve the data corresponding to the second resource stored in the device where the client is located. And the message used to indicate the success of the deletion is fed back to the AI framework layer, so that the AI framework layer can process other operators in the AI task.

In the embodiment of this specification, by making the client of the GPU virtualization system reserve the data corresponding to the second resource in the device where it is located, the data corresponding to the second resource (that is, matching the first resource) can be realized. The data) is pre-stored in order to facilitate the subsequent execution of iterative calculations in AI tasks. And because the client of the GPU virtualization system can feed back to the AI framework layer a message indicating the success of the deletion according to the storage information of the data corresponding to the second resource on the device where it is located, without the AI framework layer waiting for the GPU driver to respond to the seventh API Invoke the processing result of the request, which can reduce the waiting time of the AI framework layer, which is conducive to improving the execution efficiency of the AI task.

In the embodiment of this specification, in order to facilitate the subsequent use of the data corresponding to the second resource when executing the AI task, after the data corresponding to the second resource is reserved, the following step may be further included: determining that the fourth API call request corresponds to The calculation flow; write the address pointer of the storage address of the data corresponding to the second resource into the queue corresponding to the calculation flow.

In the embodiments of this specification, since one or more stream computing tasks (stream computing) are usually used to implement AI tasks, the stream computing tasks have strict requirements on the execution order of each operator. Therefore, in the warm-up phase after the AI task is started, the execution order of each operator included in a complete iterative process in the AI task can be determined first, and the flow computing task corresponding to each operator can be determined. Then, according to the determined execution order of each operator and the corresponding flow calculation task, the address pointer of the storage address of the resource setting parameter corresponding to each operator is written into the queue of the flow calculation task corresponding to each operator, In order to facilitate the subsequent iterative process.

For ease of understanding, the process of writing the address pointer of the storage address of the resource setting parameter into the queue is illustrated by an example. Assume that the operators that need to be executed during a complete iteration of the AI task are: OP1, OP2, OP3, and OP4, respectively. FIG. 3 is a schematic diagram of a scenario in which an address pointer of a storage address of a resource setting parameter is written into a queue provided in an embodiment of this specification. As shown in Figure 3(a), the circular queue 301 contains various operators that need to be executed in a complete iteration process. Among them, the operator in position 3011 in the circular queue 301 (that is, OP3) is the current flow computing task being executed. Operator. The first queue 302 contains address pointers of storage addresses of resource setting parameters corresponding to OP1 and OP2. The second queue 303 contains the address pointer of the storage address of the resource setting parameter corresponding to OP3. It can be seen that the stream computing tasks corresponding to OP1 and OP2 are the same, and the stream computing tasks corresponding to OP1 (or OP2) and OP3 are different.

After the stream computing task has completed the calculation of the OP3 operator, the AI framework layer can request to perform the calculation of the OP4 operator. As shown in Figure 3(b), the operator at position 3012 in the circular queue 301 (ie OP4) is the operator currently being executed by the stream computing task. If the stream computing task corresponding to the OP4 operator (ie, computing flow) is determined The flow calculation task (ie, calculation flow) corresponding to the OP3 operator is the same. Then, the address pointer of the storage address of the resource setting parameter corresponding to the OP4 operator can also be written into the second queue, so as to obtain the updated second queue 304.

In the embodiment of this specification, step 204: determining the memory address where the pre-stored data matching the first resource is located may specifically include the following steps: determining the calculation flow corresponding to the first API call request; The address pointer stored at the head of the queue is read from the queue corresponding to the calculation flow; the address pointer points to the memory address where the data matching the first resource is located.

In the embodiment of this specification, the operator corresponding to the resource to be created for the first API call request can be determined, and the calculation flow (ie, stream computing task) corresponding to the determined operator can be used as the first API call request The corresponding calculation flow.

In the embodiment of this specification, because in the warm-up phase of the AI task, according to the execution order of each operator, the address pointer of the storage address of the resource setting parameter corresponding to each operator is written into the calculation corresponding to each operator In the queue of the flow, therefore, the address pointer of the memory address where the data matching the first resource is located can be read from the queue of the calculation flow corresponding to the first API call request.

In practical applications, each time an address pointer is read from the queue, the address pointer can be deleted, and the address pointer can be rewritten to the end of the queue, so that the head of the queue is stored The address pointer is the address pointer required for the subsequent operation of the calculation flow corresponding to the queue.

Therefore, after reading the address pointer stored at the head of the queue from the queue corresponding to the calculation flow, it may also include: deleting the address pointer from the head of the queue; writing the address pointer to the queue The end of the team.

In the embodiment of this specification, after reading the address pointer stored at the head of the queue in the queue, the address pointer stored at the head of the queue is deleted, and the address pointer is written to the end of the queue, thereby facilitating When the client of the GPU virtualization system performs the iterative calculation in the AI task, it sets the parameters for the reuse of pre-stored resources.

Because the address pointer is stored in a queue, the read principle of the queue is "first in, first out". That is, the data that is first written to the queue will be read first in order. The storage order of the address pointers in the queue is stored sequentially according to the steps in each iteration. Therefore, after the address pointer is stored in the queue method, when the subsequent multiplexing is performed, it is only necessary to read the first address pointer from the queue when a new round of iterative calculation process starts, and in the same round of iterative process , When it is necessary to read the address pointers of the storage addresses corresponding to the multiplexed data in the order of execution of the steps in the subsequent, it is only necessary to read the address pointers stored in the queue in sequence according to the sequence. This can simplify the processing of the mapping relationship between the sequence of steps and the multiplexed data.

However, after the address pointer is stored in the queue method, it is necessary to ensure that the first address pointer stored in the queue is read at the beginning of each round of iterative calculation. This process can be achieved in the following ways.

Specifically, in practical applications, it is necessary to execute the calculation tasks corresponding to the operators in the next iteration after it is determined that the execution of the current iteration process is completed and at the beginning of the next iteration process to avoid errors.

Therefore, step 202: before obtaining the first API call request sent by the AI framework layer for creating the first resource, it may further include: judging whether the current round of iterative process has been calculated, and obtaining the judgment result.

Step 201: Obtain the first API call request sent by the AI framework layer for creating the first resource, which may specifically include: when the judgment result indicates that the current round of iterative process calculation is completed, obtaining the API call request sent by the AI framework layer Create the first API call request for the first resource.

In the embodiment of this specification, when the model adopted by the AI task is a neural network model, a round of iterative process in the AI task may refer to the use of forward propagation algorithm (Forward propagation algorithm) and back propagation algorithm (Backpropagation algorithm). The neural network model is processed for one round.

Based on this, the embodiment of this specification provides an implementation method for judging whether the calculation of the current round of iterative process is completed.

Specifically, the first address pointer corresponding to the storage address of the output result of the first layer of the model in the AI task can be recorded; in the backward gradient propagation process, the second address pointer corresponding to the storage address of the current input data can be monitored; It is determined whether the second address pointer is the same as the first address pointer.

For example, suppose that the first layer of the model in the AI task is convolution calculation. When using the backpropagation algorithm, the input when calculating the gradient of the last convolution should be the same as the output of the first layer of the model. Therefore, The second address pointer corresponding to the storage address of the input data currently used when obtaining the gradient can be compared with the first address pointer corresponding to the storage address of the output result of the first layer of the model in the AI task. If they are the same, it means After the calculation of the current iteration process is completed, the operators in the next iteration process can be calculated, that is, step 202 can be executed to ensure the correct operation of the iteration loop in the AI task.

The embodiment of this specification provides a resource reuse method based on GPU virtualization. When executing AI tasks, the pre-stored resource setting data can be reused so that the client of the GPU virtualization system can process the AI framework first. API call requests sent by the layer to reduce about 80% of the API synchronization calls that the AI framework layer needs to initiate to the GPU driver. It can significantly reduce the performance loss of GPU virtualization technology, and reduce the resource consumption and time consumption during the execution of AI tasks. Experiments have proved that when a CNN model (for example, AlexNet) is run in TensorFlow using the GPU virtualization-based resource reuse method provided by the embodiments of the present application, the operating efficiency is increased by 11% compared with the prior art. In addition, the GPU virtualization-based resource reuse method provided by the embodiment of this specification can be deployed flexibly, can support bare metal, container, or virtual machine operation at the same time, is cloud-friendly, and has good applicability.

Based on the same idea, the embodiment of this specification also provides a device corresponding to the above method. FIG. 4 is a schematic structural diagram of a resource multiplexing device based on GPU virtualization corresponding to FIG. 2 provided by an embodiment of this specification, and the device can be applied to a client in a GPU virtualization system. As shown in Figure 4, the device may include the following modules.

The first obtaining module 402 is configured to obtain the first API call request sent by the AI framework layer for creating the first resource.

The first determining module 404 is configured to determine a memory address where the pre-stored data matching the first resource is located; the data matching the first resource includes setting parameters for the first resource.

The first feedback module 406 is configured to feed back to the AI framework layer the memory address where the data matching the first resource is located.

The second obtaining module 408 is configured to obtain a second API call request sent by the AI framework layer for setting the first resource.

The second feedback module 410 is configured to feed back a message indicating that the setting is successful to the AI framework layer.

The third obtaining module 412 is configured to obtain a third API call request sent by the AI framework layer for calculation based on the first resource.

The first calculation instruction generating module 414 is configured to generate a first calculation instruction for the first resource based on the third API call request.

The first sending module 416 is configured to send the first calculation instruction and the data matching the first resource to the GPU driver.

In the embodiment of this specification, the GPU virtualization-based resource multiplexing device pre-stores the setting parameters for the first resource, so that the client in the GPU virtualization system can feed back the first API call request sent by the AI framework layer and The second API call request, and there is no need to send the first API call request and the second API call request to the GPU driver for processing. Therefore, the AI framework layer does not need to wait for the GPU driver to respond to the first API call request and the second API call request. Process the results to reduce the time it takes for the AI framework layer to wait for the request response, thereby improving the execution efficiency of the AI task.

In the embodiment of this specification, the device may further include: a fourth acquisition module, configured to acquire a fourth API call request sent by the AI framework layer for creating a second resource in a warm-up phase after the AI task is started. The creation module is configured to create the second resource and store the data corresponding to the second resource in the memory address. The third feedback module is configured to feed back the memory address to the AI framework layer. The fifth acquisition module is configured to acquire the fifth API call request sent by the AI framework layer for setting the second resource. The setting module is configured to set the data in the memory address based on the fifth API call request. The fourth feedback module is used to feed back a message indicating that the setting is successful to the AI framework layer. The sixth acquisition module is configured to acquire the sixth API call request sent by the AI framework layer for calculation based on the second resource. The second calculation instruction generation module is configured to generate a second calculation instruction for the second resource based on the sixth API call request. The second sending module is configured to send the second calculation instruction to the GPU driver.

In the embodiment of this specification, the device may further include: a seventh obtaining module, configured to obtain a seventh API call request sent by the AI framework layer for deleting data corresponding to the second resource. The reservation module is used to reserve the data corresponding to the second resource. The fifth feedback module is used to feed back a message indicating successful deletion to the AI framework layer.

In the embodiment of this specification, the first determining module 404 may be specifically used to determine the calculation flow corresponding to the first API call request.

Read the address pointer stored at the head of the queue from the queue corresponding to the calculation stream; the address pointer points to the memory address where the data matching the first resource is located.

In the embodiment of the present specification, the device may further include: a deletion module, configured to delete the address pointer from the head of the queue. The first writing module is used to write the address pointer to the end of the queue.

In the embodiment of the present specification, the device may further include: a second determining module, configured to determine the calculation flow corresponding to the fourth API call request. The second writing module is configured to write the address pointer of the storage address of the data corresponding to the second resource into the queue corresponding to the calculation stream.

In the embodiment of this specification, the device may further include: a judgment module for judging whether the calculation of the current round of iterative process is completed, and obtaining the judgment result. The first obtaining module is specifically configured to obtain the first API call request sent by the AI framework layer for creating the first resource when the judgment result indicates that the calculation of the current iteration process is completed.

In the embodiment of this specification, the judgment module may be specifically used to record the first address pointer corresponding to the storage address of the output result of the first layer of the model in the AI task. During the backward gradient propagation process, the second address pointer corresponding to the storage address of the current input data is monitored. It is determined whether the second address pointer is the same as the first address pointer.

Based on the same idea, the embodiment of this specification also provides a device corresponding to the above method.

FIG. 5 is a schematic structural diagram of a resource multiplexing device based on GPU virtualization corresponding to FIG. 2 provided by an embodiment of this specification. As shown in FIG. 5, the device 500 may include: at least one processor 510; and, a memory 530 communicatively connected to the at least one processor; wherein, the memory 530 stores data that can be executed by the at least one processor 510 The instruction 520 is executed by the at least one processor 510, so that the at least one processor 510 can: obtain the first API call request sent by the AI framework layer for creating the first resource; determine to store in advance The memory address where the data that matches the first resource is located; the data that matches the first resource includes the setting parameters for the first resource; and the data that matches the first resource is fed back to the AI framework layer. The memory address where the data matched by the first resource is located; obtain the second API call request sent by the AI framework layer for setting the first resource; feedback to the AI framework layer to indicate that the setting is successful Get the third API call request sent by the AI framework layer for calculation based on the first resource; based on the third API call request, generate a first calculation instruction for the first resource. Sending the first calculation instruction and the data matching the first resource to a GPU driver.

In the embodiment of this specification, the GPU virtualization-based resource multiplexing device pre-stores the setting parameters for the first resource, so that the client in the GPU virtualization system carried in the device can first feed back the first data sent by the AI framework layer. One API call request and the second API call request, and there is no need to send the first API call request and the second API call request to the GPU driver for processing. Therefore, the AI framework layer does not need to wait for the GPU driver’s feedback on the first API call request and the first API call request. 2. The processing result of the API call request to reduce the time consumption of the AI framework layer waiting for the request response, thereby improving the execution efficiency of the AI task.

In the 1990s, the improvement of a technology can be clearly distinguished between hardware improvements (for example, improvements in circuit structures such as diodes, transistors, switches, etc.) or software improvements (improvements in method flow). However, with the development of technology, the improvement of many methods and processes of today can be regarded as a direct improvement of the hardware circuit structure. Designers almost always get the corresponding hardware circuit structure by programming the improved method flow into the hardware circuit. Therefore, it cannot be said that the improvement of a method flow cannot be realized by the hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (for example, a Field Programmable Gate Array (Field Programmable Gate Array, FPGA)) is such an integrated circuit whose logic function is determined by the user's programming of the device. It is programmed by the designer to "integrate" a digital system on a piece of PLD, without requiring chip manufacturers to design and manufacture dedicated integrated circuit chips. Moreover, nowadays, instead of manually making integrated circuit chips, this kind of programming is mostly realized with "logic compiler" software, which is similar to the software compiler used in program development and writing, but before compilation The original code must also be written in a specific programming language, which is called Hardware Description Language (HDL), and there is not only one type of HDL, but many types, such as ABEL (Advanced Boolean Expression Language) , AHDL (Altera Hardware Description Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), Lava, Lola, MyHDL, PALASM, RHDL (Ruby Hardware Description), etc., currently most commonly used It is VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog. It should also be clear to those skilled in the art that just a little bit of logic programming of the method flow in the above-mentioned hardware description languages and programming into an integrated circuit can easily obtain the hardware circuit that implements the logic method flow.

The controller can be implemented in any suitable manner. For example, the controller can take the form of, for example, a microprocessor or a processor and a computer-readable medium storing computer-readable program codes (such as software or firmware) executable by the (micro)processor. , Logic gates, switches, application specific integrated circuits (ASICs), programmable logic controllers and embedded microcontrollers. Examples of controllers include but are not limited to the following microcontrollers: ARC625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicon Labs C8051F320, the memory controller can also be implemented as part of the memory control logic. Those skilled in the art also know that, in addition to implementing the controller in a purely computer-readable program code manner, it is entirely possible to program the method steps to make the controller use logic gates, switches, application specific integrated circuits, programmable logic controllers, and embedded logic. The same function can be realized in the form of a microcontroller or the like. Therefore, such a controller can be regarded as a hardware component, and the devices included in it for realizing various functions can also be regarded as a structure within the hardware component. Or even, the device for realizing various functions can be regarded as both a software module for realizing the method and a structure within a hardware component.

The systems, devices, modules, or units illustrated in the above embodiments may be specifically implemented by computer chips or entities, or implemented by products with certain functions. A typical implementation device is a computer. Specifically, the computer may be, for example, a personal computer, a laptop computer, a cell phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or Any combination of these devices.

For the convenience of description, when describing the above device, the functions are divided into various units and described separately. Of course, when implementing this application, the functions of each unit can be implemented in the same or multiple software and/or hardware.

Those skilled in the art should understand that the embodiments of the present invention can be provided as a method, a system, or a computer program product. Therefore, the present invention may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present invention may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

The present invention is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of the present invention. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

In a typical configuration, the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include non-permanent memory in a computer-readable medium, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer readable media.

Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. Information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cartridges, magnetic tape storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

It should also be noted that the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or equipment including a series of elements not only includes those elements, but also includes Other elements that are not explicitly listed, or also include elements inherent to such processes, methods, commodities, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, commodity, or equipment that includes the element.

This application may be described in the general context of computer-executable instructions executed by a computer, such as a program module. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. This application can also be practiced in distributed computing environments. In these distributed computing environments, tasks are performed by remote processing devices connected through a communication network. In a distributed computing environment, program modules can be located in local and remote computer storage media including storage devices.

The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the difference from other embodiments. In particular, as for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.

The above descriptions are only examples of the present application, and are not used to limit the present application. For those skilled in the art, this application can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the scope of the claims of this application.

Claims

A resource reuse method based on GPU virtualization, applied to a client in a GPU virtualization system, includes:

Obtain the first API call request sent by the AI framework layer for creating the first resource;

Determining a memory address where pre-stored data matching the first resource is located; the data matching the first resource includes setting parameters for the first resource;

Feedback to the AI framework layer the memory address where the data matching the first resource is located;

Acquiring a second API call request sent by the AI framework layer for setting the first resource;

Feedback to the AI framework layer a message indicating that the setting is successful;

Acquiring a third API call request sent by the AI framework layer for calculation based on the first resource;

Generating a first calculation instruction for the first resource based on the third API call request;

Sending the first calculation instruction and the data matching the first resource to a GPU driver.
The method according to claim 1, before acquiring the API call request sent by the AI framework layer for creating the first resource, the method further comprises:

In the warm-up phase after the AI task is started, obtain the fourth API call request sent by the AI framework layer to create the second resource;

Creating the second resource, and storing data corresponding to the second resource in the memory address;

Feedback the memory address to the AI framework layer;

Acquiring a fifth API call request sent by the AI framework layer for setting the second resource;

Setting the data in the memory address based on the fifth API call request;

Feedback to the AI framework layer a message indicating that the setting is successful;

Acquiring a sixth API call request sent by the AI framework layer for calculation based on the second resource;

Generating a second calculation instruction for the second resource based on the sixth API call request;

Send the second calculation instruction to the GPU driver.
The method according to claim 2, after obtaining the sixth API call request sent by the AI framework layer for calculation based on the second resource, the method further comprises:

Acquiring a seventh API call request sent by the AI framework layer for deleting data corresponding to the second resource;

Retain the data corresponding to the second resource;

Feed back to the AI framework layer a message indicating that the deletion is successful.
The method according to claim 1, wherein the determining a memory address where the pre-stored data matching the first resource is located specifically includes:

Determine the calculation flow corresponding to the first API call request;

Read the address pointer stored at the head of the queue from the queue corresponding to the calculation stream; the address pointer points to the memory address where the data matching the first resource is located.
5. The method according to claim 4, after reading the address pointer stored at the head of the queue from the queue corresponding to the calculation stream, the method further comprises:

Deleting the address pointer from the head of the queue;

Write the address pointer to the end of the queue.
5. The method according to claim 3, after the reserving the data corresponding to the second resource, further comprising:

Determine the calculation flow corresponding to the fourth API call request;

The address pointer of the storage address of the data corresponding to the second resource is written into the queue corresponding to the calculation flow.
The method according to claim 1, before acquiring the first API call request sent by the AI framework layer for creating the first resource, the method further comprises:

Judge whether the calculation of the current round of iterative process is completed, and get the judgment result;

The acquiring the first API call request sent by the AI framework layer for creating the first resource specifically includes:

When the judgment result indicates that the calculation of the current iteration process is completed, the first API call request sent by the AI framework layer for creating the first resource is obtained.
8. The method according to claim 7, wherein said determining whether the calculation of the current round of iterative process is completed includes:

Record the first address pointer corresponding to the storage address of the output result of the first layer of the model in the AI task;

During the backward gradient propagation process, monitor the second address pointer corresponding to the storage address of the current input data;

It is determined whether the second address pointer is the same as the first address pointer.
A resource multiplexing device based on GPU virtualization, applied to a client in a GPU virtualization system, includes:

The first obtaining module is used to obtain the first API call request sent by the AI framework layer for creating the first resource;

A first determining module, configured to determine a memory address where pre-stored data matching the first resource is located; the data matching the first resource includes setting parameters for the first resource;

The first feedback module is configured to feed back to the AI framework layer the memory address where the data matching the first resource is located;

The second acquisition module is configured to acquire a second API call request sent by the AI framework layer for setting the first resource;

The second feedback module is configured to feed back a message indicating that the setting is successful to the AI framework layer;

The third obtaining module is configured to obtain a third API call request sent by the AI framework layer for calculation based on the first resource;

A first calculation instruction generation module, configured to generate a first calculation instruction for the first resource based on the third API call request;

The first sending module is configured to send the first calculation instruction and the data matching the first resource to the GPU driver.
The device according to claim 9, further comprising:

The fourth acquisition module is used to acquire the fourth API call request sent by the AI framework layer for creating the second resource in the warm-up phase after the AI task is started;

A creation module, configured to create the second resource, and store data corresponding to the second resource in the memory address;

The third feedback module is configured to feed back the memory address to the AI framework layer;

A fifth acquisition module, configured to acquire a fifth API call request sent by the AI framework layer for setting the second resource;

A setting module, configured to set the data in the memory address based on the fifth API call request;

The fourth feedback module is used to feed back a message indicating that the setting is successful to the AI framework layer;

A sixth obtaining module, configured to obtain a sixth API call request sent by the AI framework layer for calculation based on the second resource;

A second calculation instruction generating module, configured to generate a second calculation instruction for the second resource based on the sixth API call request;

The second sending module is configured to send the second calculation instruction to the GPU driver.
The device according to claim 10, further comprising:

A seventh obtaining module, configured to obtain a seventh API call request sent by the AI framework layer for deleting data corresponding to the second resource;

A reservation module, used to reserve data corresponding to the second resource;

The fifth feedback module is used to feed back a message indicating successful deletion to the AI framework layer.
The device according to claim 9, wherein the first determining module is specifically configured to:

Determine the calculation flow corresponding to the first API call request;

Read the address pointer stored at the head of the queue from the queue corresponding to the calculation stream; the address pointer points to the memory address where the data matching the first resource is located.
The device of claim 12, the device further comprising:

A delete module, used to delete the address pointer from the head of the queue;

The first writing module is used to write the address pointer to the end of the queue.
The device according to claim 11, further comprising:

The second determining module is configured to determine the calculation flow corresponding to the fourth API call request;

The second writing module is configured to write the address pointer of the storage address of the data corresponding to the second resource into the queue corresponding to the calculation stream.
The device according to claim 9, further comprising:

The judgment module is used to judge whether the calculation of the current round of iterative process is completed, and the judgment result is obtained;

The first obtaining module is specifically configured to obtain the first API call request sent by the AI framework layer for creating the first resource when the judgment result indicates that the calculation of the current iteration process is completed.
The device according to claim 15, wherein the judgment module is specifically configured to:

Record the first address pointer corresponding to the storage address of the output result of the first layer of the model in the AI task;

During the backward gradient propagation process, monitor the second address pointer corresponding to the storage address of the current input data;

It is determined whether the second address pointer is the same as the first address pointer.
A resource multiplexing device based on GPU virtualization, including:

At least one processor; and,

A memory communicatively connected with the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can:

Obtain the first API call request sent by the AI framework layer for creating the first resource;

Determining a memory address where pre-stored data matching the first resource is located; the data matching the first resource includes setting parameters for the first resource;

Feedback to the AI framework layer the memory address where the data matching the first resource is located;

Acquiring a second API call request sent by the AI framework layer for setting the first resource;

Feedback to the AI framework layer a message indicating that the setting is successful;

Acquiring a third API call request sent by the AI framework layer for calculation based on the first resource;

Generating a first calculation instruction for the first resource based on the third API call request;

Sending the first calculation instruction and the data matching the first resource to a GPU driver.