WO2021159820A1

WO2021159820A1 - Data transmission and task processing methods, apparatuses and devices

Info

Publication number: WO2021159820A1
Application number: PCT/CN2020/132846
Authority: WO
Inventors: 赵军平
Original assignee: 支付宝(杭州)信息技术有限公司
Priority date: 2020-02-11
Filing date: 2020-11-30
Publication date: 2021-08-19
Also published as: CN111309649A; CN111309649B

Abstract

Data transmission and task processing methods, apparatuses and devices. A solution comprises: obtaining a data transmission request sent by a client (302); obtaining a first virtual address in the data transmission request (304); obtaining a physical memory address corresponding to the first virtual address (306); on the basis of a mapping relationship between the physical memory address and the virtual address, determining a second virtual address corresponding to the physical memory address (308); obtaining a GPU address allocated for the data transmission request (310); generating a data copying instruction from the second virtual address to the GPU address (312); and calling an interface driven by the GPU to execute the data copying instruction (314).

Description

Method, device and equipment for data transmission and task processing

Technical field

This application relates to the field of computer technology, in particular to a data transmission and task processing method, device and equipment.

Background technique

Deep Learning (DL) is widely used in the field of Artificial Intelligence (AI). AI, especially deep learning, has been widely used in various scenarios such as payment (face recognition), loss determination (picture recognition), interaction and customer service (voice recognition), and has achieved remarkable results. Typical DL tasks require strong computing power. Therefore, most tasks currently run on acceleration devices such as Graphics Processing Unit (GPU). Accelerator chips represented by graphics processors are an important guarantee for the development and landing of AI. However, there is a general problem of low average utilization of GPUs during use.

It is necessary to provide a solution to increase the average utilization rate of the GPU.

Summary of the invention

In view of this, the embodiments of the present application provide a data transmission and task processing method, device, and equipment for improving the efficiency of GPU resource virtualization.

An embodiment of this specification provides a data transmission method, which is applied to a server in a GPU virtualization system, and includes: obtaining a data transmission request sent by a client; obtaining a first virtual address in the data transmission request; Obtain the physical memory address corresponding to the first virtual address; determine the second virtual address corresponding to the physical memory address based on the mapping relationship between the physical memory address and the virtual address; obtain the GPU address allocated for the data transmission request; generate A data copy instruction from the second virtual address to the GPU address; calling an interface driven by the GPU to execute the data copy instruction.

An embodiment of this specification provides a task processing method, which is applied to a server in a GPU virtualization system, and includes: obtaining a task calculation request sent by a client; obtaining a first virtual address in the task calculation request; Obtain the physical memory address corresponding to the first virtual address; determine the second virtual address corresponding to the physical memory address based on the mapping relationship between the physical memory address and the virtual address; obtain the GPU address allocated for the task calculation request; generate The data copy instruction from the second virtual address to the GPU address; call the GPU driver interface to execute the data copy instruction; send the task calculation request to the GPU; when the GPU completes the task calculation request corresponding After the calculation task, the processing state information corresponding to the task calculation request is generated; the processing state information is stored.

The embodiment of this specification provides a data transmission method, which is applied to a client in a GPU virtualization system, and includes: obtaining a data transmission request sent by an application; obtaining a first virtual address in the data transmission request; based on The mapping relationship between the physical memory address and the virtual address is determined, and the physical memory address corresponding to the first virtual address is determined; the data transmission request and the physical memory address are sent to the server, so that the server transmits according to the data Request and the physical memory address to perform data transmission, wherein the server determines the second virtual address corresponding to the physical memory address based on the mapping relationship between the physical memory address and the virtual address; obtains the GPU allocated for the data transmission request Address; generate a data copy instruction from the second virtual address to the GPU address; call an interface driven by the GPU to execute the data copy instruction.

The embodiment of this specification provides a task processing method, which is applied to a client in a GPU virtualization system, and includes: obtaining a task processing request sent by an application; forwarding the task processing request so that the server can obtain it; When the processing status information of the task processing request sent by the server is obtained, a synchronization request is issued; when the success notification of the synchronization request sent by the server is obtained, the first of the task processing request is obtained. Virtual address; based on the mapping relationship between the physical memory address and the virtual address, determine the physical memory address corresponding to the first virtual address; read the calculation result of the task processing request from the physical memory address corresponding to the first virtual address ; Send the calculation result of the task processing request to the application.

A data transmission device provided by an embodiment of this specification includes: a data transmission request acquisition module, configured to acquire a data transmission request sent by a client; a first virtual address acquisition module, configured to acquire the first data transmission request in the data transmission request; A virtual address; a physical memory address acquiring module for acquiring the physical memory address corresponding to the first virtual address; a second virtual address determining module for determining the physical memory address based on the mapping relationship between the physical memory address and the virtual address Corresponding second virtual address; GPU address acquisition module, used to acquire the GPU address allocated for the data transmission request; data copy instruction generation module, used to generate a data copy from the second virtual address to the GPU address Instruction; interface calling module, used to call the GPU-driven interface to execute the data copy instruction.

A task processing device provided by an embodiment of this specification includes: a task calculation request obtaining module, which is used to obtain a task calculation request sent by a client; and a first virtual address obtaining module, which is used to obtain the first task calculation request in the task calculation request. A virtual address; a physical memory address acquiring module for acquiring the physical memory address corresponding to the first virtual address; a second virtual address determining module for determining the physical memory address based on the mapping relationship between the physical memory address and the virtual address Corresponding second virtual address; GPU address acquisition module, used to acquire the GPU address allocated for the task calculation request; data copy instruction generation module, used to generate a data copy from the second virtual address to the GPU address Instruction; GPU driver interface call module, used to call the GPU driver interface to execute the data copy instruction; task calculation request sending module, used to send the task calculation request to the GPU; processing status information generation module, used to After the GPU completes the calculation task corresponding to the task calculation request, it generates processing state information corresponding to the task calculation request; a processing state information storage module is used to store the processing state information.

A data transmission device provided by an embodiment of this specification includes: a data transmission request obtaining module, which is used to obtain a data transmission request sent by an application; a first virtual address obtaining module, which is used to obtain a first virtual address in the data transmission request. Address; physical memory address determining module, used to determine the physical memory address corresponding to the first virtual address based on the mapping relationship between physical memory address and virtual address; data transmission request and the physical memory address sending module, used to transfer all The data transmission request and the physical memory address are sent to the server, so that the server performs data transmission according to the data transmission request and the physical memory address, wherein the server is based on the mapping of the physical memory address and the virtual address Relationship, determine the second virtual address corresponding to the physical memory address; obtain the GPU address allocated for the data transmission request; generate a data copy instruction from the second virtual address to the GPU address; call the GPU driver interface Execute the data copy instruction.

A task processing device provided by an embodiment of this specification includes: a task processing request acquisition module for acquiring a task processing request sent by an application; a task processing request forwarding module for forwarding the task processing request so that the server can obtain it Synchronization request sending module, used to send a synchronization request when the processing status information of the task processing request sent by the server is obtained; the first virtual address obtaining module, used to obtain the information sent by the server After the successful notification of the synchronization request, the first virtual address of the task processing request is obtained; the physical memory address determination module is configured to determine the physical memory address corresponding to the first virtual address based on the mapping relationship between the physical memory address and the virtual address A memory address; a calculation result reading module for reading the calculation result of the task processing request from the physical memory address corresponding to the first virtual address; a calculation result sending module for calculating the task processing request The result is sent to the application.

The embodiment of this specification provides a data transmission device, including: a data transmission request obtaining module, used to obtain a data transmission request sent by an application; a first virtual address obtaining module, used to obtain a first virtual address in the data transmission request The physical memory address determination module is used to determine the physical memory address corresponding to the first virtual address based on the mapping relationship between the physical memory address and the virtual address; the data transmission request and the physical memory address sending module are used to transfer the The data transmission request and the physical memory address are sent to the server, so that the server performs data transmission according to the data transmission request and the physical memory address, wherein the server is based on the mapping relationship between the physical memory address and the virtual address , Determine the second virtual address corresponding to the physical memory address; obtain the GPU address allocated for the data transmission request; generate a data copy instruction from the second virtual address to the GPU address; call the GPU driver interface to execute The data copy instruction.

A task processing device provided by an embodiment of this specification includes: a task processing request acquisition module for acquiring a task processing request sent by an application; a task processing request forwarding module for forwarding the task processing request so that the server can obtain it Synchronization request sending module, used to send a synchronization request when the processing status information of the task processing request sent by the server is obtained; the first virtual address obtaining module, used to obtain the information sent by the server After the successful notification of the synchronization request, the first virtual address of the task processing request is acquired; the physical memory address determining module is configured to determine the physical memory address corresponding to the first virtual address based on the mapping relationship between the physical memory address and the virtual address A memory address; a calculation result reading module for reading the calculation result of the task processing request from the physical memory address corresponding to the first virtual address; a calculation result sending module for calculating the task processing request The result is sent to the application.

A data transmission device provided by an embodiment of this specification includes: at least one processor; Instruction, the instruction is executed by the at least one processor, so that the at least one processor can: obtain the data transmission request sent by the client; obtain the first virtual address in the data transmission request; obtain the first virtual address A physical memory address corresponding to a virtual address; determine the second virtual address corresponding to the physical memory address based on the mapping relationship between the physical memory address and the virtual address; obtain the GPU address allocated for the data transmission request; 2. A data copy instruction from a virtual address to the GPU address; calling an interface driven by the GPU to execute the data copy instruction.

A task processing device provided by an embodiment of this specification includes: at least one processor; Instruction, the instruction is executed by the at least one processor, so that the at least one processor can: obtain the task calculation request sent by the client; obtain the first virtual address in the task calculation request; obtain the first virtual address in the task calculation request; A physical memory address corresponding to a virtual address; determine the second virtual address corresponding to the physical memory address based on the mapping relationship between the physical memory address and the virtual address; obtain the GPU address allocated for the task calculation request; 2. A data copy instruction from a virtual address to the GPU address; call the GPU-driven interface to execute the data copy instruction; send the task calculation request to the GPU; when the GPU completes the calculation task corresponding to the task calculation request , Generate processing state information corresponding to the task calculation request; store the processing state information.

A data transmission device provided by an embodiment of this specification includes: at least one processor; Instructions, the instructions are executed by the at least one processor, so that the at least one processor can: obtain the data transmission request sent by the application; obtain the first virtual address in the data transmission request; based on the physical memory address and The mapping relationship of the virtual address is determined, and the physical memory address corresponding to the first virtual address is determined; the data transmission request and the physical memory address are sent to the server, so that the server can perform the The physical memory address performs data transmission, wherein the server determines the second virtual address corresponding to the physical memory address based on the mapping relationship between the physical memory address and the virtual address; obtains the GPU address allocated for the data transmission request; and generates the slave A data copy instruction from the second virtual address to the GPU address; calling an interface driven by the GPU to execute the data copy instruction.

A task processing device provided by an embodiment of this specification includes: at least one processor; Instruction, the instruction is executed by the at least one processor, so that the at least one processor can: obtain the task processing request sent by the application; forward the task processing request so that the server can obtain it; when the When the processing status information of the task processing request sent by the server is sent, a synchronization request is issued; when the success notification of the synchronization request sent by the server is obtained, the first virtual address of the task processing request is obtained; based on The mapping relationship between the physical memory address and the virtual address is determined, the physical memory address corresponding to the first virtual address is determined; the calculation result of the task processing request is read from the physical memory address corresponding to the first virtual address; The calculation result of the task processing request is sent to the application.

An embodiment of the present specification provides a computer-readable medium having computer-readable instructions stored thereon, and the computer-readable instructions can be executed by a processor to implement the foregoing method.

The above-mentioned at least one technical solution adopted in the embodiments of this specification can achieve the following beneficial effects: by mapping the physical memory address with the first virtual address of the client and the second virtual address of the server respectively, that is, the client and the server share the same physical In the memory, the data copy instruction is generated to directly copy the data in the physical memory address to the GPU address. Because the first virtual address of the client and the second virtual address of the server are retained, the original program is not changed, and transparency is realized. Moreover, only one copy of the data from the physical memory address to the GPU address has been experienced, which reduces the number of data memory copies. Therefore, there is no need to allocate temporary memory for the client and server to store the copied data, which significantly improves the utilization rate and effectively reduces the cost. , Improve the efficiency of GPU resource virtualization.

Description of the drawings

The drawings described here are used to provide a further understanding of the application and constitute a part of the application. The exemplary embodiments and descriptions of the application are used to explain the application, and do not constitute an improper limitation of the application. In the attached picture:

Figure 1 is a schematic diagram of the GPU software virtualization process based on request-data forwarding (multiple memory copies);

2 is a schematic diagram of the results of the overall GPU virtualization module based on transparent memory sharing provided by an embodiment of this specification;

FIG. 3 is a schematic flowchart of a data transmission method provided by an embodiment of this specification;

4 is a schematic flowchart of a task processing method provided by an embodiment of this specification;

FIG. 5 is a schematic diagram of multi-queue management provided by an embodiment of this specification;

FIG. 6 is a schematic flowchart of another data transmission method provided by an embodiment of this specification;

FIG. 7 is a schematic flowchart of another task processing method provided by an embodiment of the specification;

FIG. 8 is a schematic structural diagram of a data transmission device corresponding to FIG. 3 provided by an embodiment of this specification;

FIG. 9 is a schematic structural diagram of a data transmission device corresponding to FIG. 3 provided by an embodiment of this specification.

Detailed ways

In order to make the purpose, technical solutions and advantages of the present application clearer, the technical solutions of the present application will be described clearly and completely in conjunction with specific embodiments of the present application and the corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

Transparent memory sharing: On the data plane, for the purpose of no changes to existing programs, no data movement, and transparent virtualization, a specially designed and optimized method of sharing data.

It is known that software virtualization usually requires the client to intercept the GPU request, and then forward the request (such as resource application, task submission, etc.) to the server. After receiving the request, the server performs the necessary control, and then sends a request to the GPU driver. Good to forward the results to the client, as shown in Figure 1. In this process, the biggest performance constraint is that all GPU requests (commands, parameters) and data must undergo multiple memory copies, that is, GPU requests and data are first copied to the client, and then copied from the client to the server, and finally Then copy the GPU request and data from the server to the GPU address. This method, although virtualized, is compared to the native method without virtualization, that is, directly copying GPU requests and data to the GPU address, there are two more data memory copies. If the memory copy is based on network transmission, It will affect the efficiency of data processing, which fundamentally limits the performance of software virtualization, and because there are two more data memory copies, the overhead of CPU and memory is also a lot.

In order to improve the efficiency of virtualization, this solution uses a transparent memory sharing mechanism to implement GPU control requests and data exchange.

The technical solutions provided by the embodiments of the present application will be described in detail below with reference to the accompanying drawings.

FIG. 2 is a schematic diagram of the results of the overall GPU virtualization module based on transparent memory sharing provided by an embodiment of this specification. as shown in picture 2:

Models and applications: models, such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM), Generative Adversarial Networks (Generative) Adversarial Networks, GAN, etc. Applications include model training or model online services.

AL framework layer: common DL frameworks, such as TensorFlow, PyTorch, Caffe2, etc.

Server: Responsible for GPU service and virtualization management, long running daemon running on the GPU driver. Usually a GPU server has 1 service instance (which can be packaged and run in docker), and according to configuration strategies (such as environment variables or configuration files), it divides and pre-allocates virtual GPU resources, saves the mapping relationship between virtual and physical resources, and Report to the cluster scheduler (such as K8S, Kubemaker).

Client: The client lib (for example, packaged together as a docker image) together with the application model is responsible for the discovery, application, access and necessary built-in optimization of virtual GPU resources, and records the correspondence between virtual and physical resources. The client exports the GPU access API to the application, such as Nvidia CUDA (only internal resource application and low-level implementation for decoupling). One server (or one physical GPU) can run multiple clients.

Among them, in order to improve the efficiency of virtualization, a transparent shared memory module and an efficient GPU request processing module (send-process-return results) need to be arranged on both the server and the client. This is the core of this program. In addition, the client and server in the embodiments of this specification are applied to the client and server in the GPU virtualization system, which are different from the client and server in the conventional sense. They are modeled after the functions and roles of the client and server in the conventional sense, and are constructed using software programming, and are not entities.

Scheduler: GPU resource scheduler within the cluster, such as K8S, the client application needs to apply for GPU resources from the scheduler first, and then the scheduler is responsible for scheduling execution.

Example one

FIG. 3 is a schematic flowchart of a data transmission method provided by an embodiment of this specification. From a program point of view, the execution subject of the process can be the server applied to the GPU virtualization system.

As shown in FIG. 3, the process may include step 302 to step 314.

Step 302: Obtain a data transmission request sent by the client.

In the embodiment of this specification, the data transmission request is initiated by the client application, and the client receives the data transmission request initiated by the application, and then forwards the data transmission request to the server.

The data transmission request can be an independent data request or a subtask in a task processing request. For example, if the GPU is required to complete a calculation task, then the data to be calculated must first be transmitted to the GPU address, and then the GPU will perform calculations based on the data in the GPU address. Then the data output request in step 302 may be a subtask of data transmission of the foregoing computing task.

Step 304: Obtain the first virtual address in the data transmission request.

Virtual Address (Virtual Address) identifies a non-physical physical address.

In the data transmission request, the direction of the data transmission is often included, that is, the data is transmitted from one address to another address. However, in the client or client application, the data request can only use the virtual address, and cannot include the actual physical address. This is done to prevent the client from directly calling the physical address and performing unsafe behaviors such as data tampering. Therefore, the address in the data transmission request is a virtual address. Here, the first virtual address is used to distinguish it from other virtual addresses. However, "first" has no other meaning.

Step 306: Obtain the physical memory address corresponding to the first virtual address.

In order to determine the actual address of the data, the physical memory address corresponding to the first virtual address needs to be determined according to the relationship between the virtual address and the physical memory address. The physical memory address is an address of a divided memory pool dedicated to storing shared data, and the specific location can be represented by an offset.

Among them, the relationship between the virtual address and the physical memory address can be saved in a table and stored on the client. When forwarding the data transmission request, the client may also send the physical memory address corresponding to the first virtual address to the client. Another way is that the server sends a request to the client to "get the physical memory address corresponding to the first virtual address", and after receiving the request, the client forwards the physical memory address corresponding to the first virtual address to the server.

Step 308: Determine a second virtual address corresponding to the physical memory address based on the mapping relationship between the physical memory address and the virtual address.

In order to achieve no changes to existing programs, no data movement, and transparent memory sharing, after the server obtains the actual physical memory address corresponding to the data, it also needs to convert to a virtual address that can be used and recognized in the server process. The second virtual address is used for identification. The second virtual address only appears in the program in the process of the server, and has nothing to do with the client. Moreover, in the server program, the correspondence between the physical memory address and the second virtual address is unique, that is, a physical memory address corresponds to a unique second virtual address, and the description can be determined based on the mapping relationship between the physical memory address and the virtual address. The second virtual address corresponding to the physical memory address. The mapping relationship between the physical memory address and the virtual address can be represented by a mapping table. Query whether the physical memory address is included in the mapping table. If it can be found, it means that the physical memory address has been mapped on the server side, and there is no need to perform the mapping. If it is not found, it means that the physical memory address is not mapped on the server side, so it needs to be mapped to obtain the second virtual address.

Step 310: Obtain the GPU address allocated for the data transmission request.

In the data transmission request, the data in the physical memory address is copied to the GPU address to realize the data calculation of the GPU. Therefore, it is also necessary to allocate a corresponding GPU address for the transmitted data.

The server can issue a GPU address allocation instruction according to the data transmission request, and the GPU driver calls the interface to complete the GPU address. Then the GPU driver sends the assigned GPU address to the server. Of course, the server may also obtain the GPU address allocated for the data transmission request through other methods, which is not specifically limited in the embodiment of this specification.

The data transmission request may also include the length of the data. The GPU address allocation subject can be the server or other execution subjects.

Step 312: Generate a data copy instruction from the second virtual address to the GPU address.

Once the second virtual address is determined, the original address of the data is determined, and when the GPU address is determined, the transfer address of the data is determined. Therefore, the data transfer instruction can be generated according to the original address and the transfer address of the data.

Step 314: Call the GPU driver interface to execute the data copy instruction.

The data copy instruction generated in step 312 requires the GPU driver to complete the interaction with the GPU. Therefore, it is necessary to call the GPU driver interface to execute the data copy instruction, that is, complete the data copy from the physical memory address to the GPU address.

The method in Figure 3 maps the physical memory address to the first virtual address of the client and the second virtual address of the server respectively, that is, the client and the server share the same physical memory, and the data copy instruction is generated directly to the physical memory address Copy the data in the GPU address. Because the first virtual address of the client and the second virtual address of the server are retained, the original program is not changed, and transparency is realized. Moreover, only one data copy from the physical memory address to the GPU address has been experienced, which reduces the number of data memory copies, so there is no need to allocate temporary memory for the client and server to store the copied data, which significantly improves the utilization rate and effectively reduces It reduces the cost and improves the efficiency of GPU resource virtualization.

Based on the method in FIG. 3, the embodiments of this specification also provide some specific implementation manners of the method, which are described below.

Specifically, before determining the second virtual address corresponding to the physical memory address, the method may further include: determining whether the physical memory address is stored in a mapping table; if not, generating a second virtual address corresponding to the physical memory address. Virtual address, and store the mapping relationship between the physical memory address and the second virtual address in the mapping table; the determining the second virtual address corresponding to the physical memory address specifically includes: if yes, obtaining all The second virtual address corresponding to the physical memory address.

In one or more embodiments of the specification, the server determines the second virtual address corresponding to the physical memory address. First, it needs to determine whether the physical memory address exists in the mapping table. If it exists, it means that the physical memory address on the server has been completed. Virtual address mapping. In other words, the physical memory is mapped only once on the server. If there is no physical memory address in the mapping table, it means that the physical memory address has not been mapped by the server. You can perform the mapping operation to generate a second virtual address for the physical memory address, and then compare the physical memory address with the second virtual address. The mapping relationship of the virtual address is stored.

Example two

FIG. 4 is a schematic flowchart of a task processing method provided by an embodiment of this specification. From a program point of view, the execution subject of the process can be the server applied to the GPU virtualization system. As shown in FIG. 4, the process may include step 402 to step 420.

Step 402: Obtain the task calculation request sent by the client.

The task calculation request can be various calculation tasks, such as matrix multiplication, convolution and so on. The task calculation request is initiated by the application, and after the client obtains it, it is forwarded to the server.

Step 404: Obtain the first virtual address in the task calculation request.

In the task calculation request, some information related to the calculation data can be included. However, the task calculation request will not directly include these data, but the address where the data is stored is recorded. Since the actual physical memory address cannot be reflected in the client, it is expressed in the form of a virtual address. For example, if matrix A and matrix B are multiplied, then the first virtual address is the address representing the storage matrix A and matrix B. Among them, the first virtual address may be one or multiple, and the number is related to the specific task calculation request.

Step 406: Obtain the physical memory address corresponding to the first virtual address.

In step 404, after the first virtual address is determined, it is also necessary to determine the physical memory address corresponding to the first virtual address. For details, refer to step 306 in the first embodiment.

Step 408: Determine a second virtual address corresponding to the physical memory address based on the mapping relationship between the physical memory address and the virtual address.

Step 410: Obtain the GPU address allocated for the task calculation request.

In the task calculation request, the data in the physical memory address is copied to the GPU address to implement GPU data calculation. Therefore, it is also necessary to allocate a corresponding GPU address for the transmitted data.

It should be noted that the GPU address allocated for the task calculation request may include the storage address of the calculation data, and may also include the storage address of the calculation result. The allocation of GPU addresses can also be allocated according to the size of the data.

Step 412: Generate a data copy instruction from the second virtual address to the GPU address.

Step 414: Call the interface of the GPU driver to execute the data copy instruction.

Step 416: Send the task calculation request to the GPU.

Step 418: After the GPU completes the calculation task corresponding to the task calculation request, generate processing state information corresponding to the task calculation request.

The GPU obtains the data in the GPU address, and then performs calculations according to the task calculation request, and then stores the calculation result in the GPU address allocated for the calculation result. Then, the GPU will notify the server of the completed state of the calculation, and then the server generates processing state information corresponding to the task calculation request.

Step 420: Store the processing status information.

The server can actively send the processing status information to the client, and can also store the information process, which is convenient for the client to query.

The method in Figure 4 maps the physical memory address to the first virtual address of the client and the second virtual address of the server respectively, that is, the client and the server share the same physical memory, and the data copy instruction is generated to directly transfer the physical memory address Copy the data in the GPU address. Because the first virtual address of the client and the second virtual address of the server are retained, the original program is not changed, and transparency is realized. Moreover, only one copy of the data from the physical memory address to the GPU address has been experienced, which reduces the number of data memory copies. Therefore, there is no need to allocate temporary memory for the client and server to store the copied data, which significantly improves the utilization rate and effectively reduces the cost. , Improve the efficiency of GPU resource virtualization.

In one or more embodiments of the specification, the first virtual address may have different functions, for example, one type is used to store calculation data, and the other type is used to store calculation results. Specifically, the first virtual address includes a calculation data acquisition virtual address and a calculation result storage virtual address, and the acquisition of the first virtual address in the task calculation request may specifically include: acquiring the calculation in the task calculation request Data acquisition virtual address; said acquiring the physical memory address corresponding to the first virtual address may specifically include: acquiring the calculation data acquiring the first physical memory address corresponding to the virtual address; said determining the physical memory address corresponding to the The second virtual address may specifically include: determining the second virtual address corresponding to the first physical memory address.

Since the first virtual address includes two types, when obtaining the first virtual address, it is necessary to determine whether the virtual address is used to obtain calculation data or to store calculation results. If the type of the first virtual address is wrong, it will cause the task technology request to fail. For example, matrix A is multiplied by matrix B, matrix A is stored in virtual address A, matrix B is stored in virtual address B, and the calculation result C is stored in virtual address C. If, get from virtual address A and virtual address C Data, and then multiply, obviously the calculation result is wrong.

Since the first virtual address includes multiple categories, it must be classified when assigning GPU addresses to task calculation requests. Specifically, the obtaining the GPU address allocated for the task calculation request specifically includes: obtaining the calculation data storage GPU address and the calculation result storage GPU address allocated for the task calculation request; and the generating from the second virtual The data copy instruction from the address to the GPU address specifically includes: generating a data copy instruction from the second virtual address to the computing data storage GPU address.

The calculation data storage GPU address is used to store the data copied from the physical memory address, that is, to store the source data for calculation. The calculation result storage GPU address is used to store the calculation result. After the GPU completes the calculation task, it temporarily stores the calculation result in the calculation result storage GPU address, and when the client calls it, it copies the data to the corresponding physical memory address.

In one or more embodiments of this specification, after generating the processing status information corresponding to the task calculation request, it may further include: when the calculation result synchronization request sent by the client is obtained, obtaining the The calculation result stores the second physical memory address corresponding to the virtual address; based on the mapping relationship between the physical memory address and the virtual address, the third virtual address corresponding to the second physical memory address is determined; the calculation result is stored in the GPU from the calculation result The data copy instruction copied to the third virtual address in the address; the interface of the GPU driver is invoked to execute the data copy instruction.

Although the GPU has completed the calculation task, the client or application cannot obtain the calculation result. Therefore, after the client obtains that the GPU has completed the calculation task, it will issue a calculation result synchronization request, that is, copy the calculation result in the GPU address to the physical memory address for the client application to read.

Copying the calculation result from the GPU address to the physical memory address is the opposite process of copying the calculation data from the physical memory address to the GPU address. Since the server initiates the data copy request, it is necessary to determine which virtual address corresponds to the server side of the physical memory address where the calculation result is stored, that is, the third virtual address.

In the task calculation request, the first virtual address includes the calculation data acquisition virtual address and the calculation result storage virtual address, and the second physical memory address corresponding to the calculation result storage virtual address may be determined based on the mapping relationship between the physical memory address and the virtual address. Then, the server determines the third virtual address corresponding to the second physical memory address according to the mapping relationship. Then, a data copy instruction for copying the calculation result from the calculation result storage GPU address to the third virtual address is generated, and the GPU driver interface is called to execute the data copy instruction.

In order to improve scalability, such as single-machine multi-card, a multi-queue request management method is proposed, that is, each GPU maintains multiple request queues, including a submission queue (SubmitQ) and a completion queue (CompleteQ), as shown in Figure 5. Specifically, the obtaining the task calculation request sent by the client may specifically include: obtaining the task calculation request sent by the client from a submission queue; the submission queue contains a plurality of unprocessed submissions submitted by the client The task calculation request; after generating the processing status information corresponding to the task calculation request, it further includes: sending the processing status information corresponding to the task calculation request to a completion queue, and the completion queue contains a plurality of the server Submitted processing status information that has not been read by the client.

When the client submits a GPU request, it puts the request in the submission queue and then returns (for example, for asynchronous requests), and then the worker thread is responsible for sending the request to the server; or the server actively queries for new requests in the request queue. After the server receives the request, it executes the processing and puts the processing result on the completion queue. The client can asynchronously query the processing status information in the completion queue.

Multiple task calculation requests are stored in the submission queue, and they are sorted according to the time when they enter the submission queue. The task calculation request that enters the submission queue first is first obtained by the server, and the task calculation request that enters the submission queue later is obtained by the server. For example, if task 1, task 2, and task 3 are submitted to the submission queue one after another, then the server will first obtain task 1, then task 2, and finally task 3 from the submission queue. The same principle applies to completion queues.

This solution also combines the queuing mechanism with transparent shared memory, that is, all requests from the client and the server are allocated on the shared memory to avoid memory copies of the request message (request) encountered when the request is forwarded.

This method proposes to use an efficient software method to make GPU hardware as efficient and lossless as the CPU, thereby significantly improving the utilization rate and effectively reducing the cost; and in the case of exclusive use, it can also optimize the performance; in the case of large-scale deployment, pure software is used Virtual methods to simplify operation, maintenance and management.

Example three

FIG. 6 is a schematic flowchart of another data transmission method provided by an embodiment of this specification. From a program point of view, the execution subject of the process can be the client applied to the GPU virtualization system. As shown in FIG. 6, the process may include step 602 to step 608.

Step 602: Obtain a data transmission request sent by the application.

In this embodiment, the application and the client are together, and the data transmission request sent by the application will be transmitted through the client.

Step 604: Obtain the first virtual address in the data transmission request.

The data address in the data transmission request is a virtual address, and the client first needs to obtain the first virtual address, and then perform related operations.

Step 606: Determine the physical memory address corresponding to the first virtual address based on the mapping relationship between the physical memory address and the virtual address.

In order to determine the actual address of the data, the physical memory address corresponding to the first virtual address needs to be determined according to the relationship between the virtual address and the physical memory address. Among them, the relationship between the virtual address and the physical memory address can be saved in a table and stored on the client. After obtaining the first virtual address, the client can query the stored correspondence between the first virtual address and the physical memory address, and then determine the physical memory address corresponding to the first virtual address.

Step 608: Send the data transmission request and the physical memory address to the server, so that the server performs data transmission according to the data transmission request and the physical memory address, wherein the server is based on the physical memory address A mapping relationship with a virtual address, determining a second virtual address corresponding to the physical memory address; acquiring a GPU address allocated for the data transmission request; generating a data copy instruction from the second virtual address to the GPU address; Call the GPU driver interface to execute the data copy instruction.

The data transmission method provided in the third embodiment and the data transmission method provided in the first embodiment are respectively described from the perspective of the client and the server, and many of the contents are similar. For the content not explained in the third embodiment, please refer to the first embodiment. explanation of.

The method in Figure 6 maps the physical memory address with the first virtual address of the client and the second virtual address of the server respectively, that is, the client and the server share the same physical memory, and the data copy instruction is generated directly into the physical memory address Copy the data to the GPU address. Because the first virtual address of the client and the second virtual address of the server are retained, the original program is not changed, and transparency is realized. Moreover, only one data copy from the physical memory address to the GPU address has been experienced, which reduces the number of data memory copies. Therefore, there is no need to allocate temporary memory for the client and server to store the copied data, which significantly improves the utilization rate and effectively reduces It reduces the cost and improves the efficiency of GPU resource virtualization.

In one or more embodiments of this specification, before the acquiring the data transmission request sent by the application, it may further include: acquiring the memory allocation request sent by the application; acquiring the data in the memory allocation request; Store to a first physical memory address; map the first physical memory address to the process space of the application to generate a first virtual address corresponding to the first physical memory address; send the first virtual address to all The application, and store the mapping relationship between the physical memory address and the first virtual address.

The application initiates a memory allocation request, such as calling malloc(len). The client obtains the memory allocation request sent by the application, and allocates the required memory that meets the length requirement from the memory pool, such as a segment of the memory pool starting address offset and length L. And map the memory to the process space of the application to obtain the mapped virtual address H. Recording the virtual address H and the location information (offset offset) in the memory pool are stored in the mapping table, and can be recorded through a hash table. Then, the virtual address H is returned to the application, so that the application can perform normal data reading and writing.

In one or more embodiments of this specification, the determining the physical memory address corresponding to the first virtual address may specifically include: determining the physical memory address corresponding to the first virtual address according to the mapping relationship.

Example four

FIG. 7 is a schematic flowchart of another task processing method provided by an embodiment of this specification. From a program point of view, the execution subject of the process can be the client applied to the GPU virtualization system. As shown in FIG. 7, the process may include step 702 to step 714.

Step 702: Obtain a task processing request sent by the application.

Step 704: Forward the task processing request so that the server can obtain it.

Step 706: When the processing status information of the task processing request sent by the server is obtained, a synchronization request is issued.

Step 708: After obtaining the successful notification of the synchronization request sent by the server, obtain the first virtual address of the task processing request.

Step 710: Determine the physical memory address corresponding to the first virtual address based on the mapping relationship between the physical memory address and the virtual address.

Step 712: Read the calculation result of the task processing request from the physical memory address corresponding to the first virtual address.

Step 714: Send the calculation result of the task processing request to the application.

Optionally, the first virtual address includes a virtual address for obtaining calculation data and a virtual address for storing calculation results.

The obtaining the first virtual address of the task processing request may specifically include: obtaining a virtual address for storing a calculation result of the task processing request.

The determining the physical memory address corresponding to the first virtual address may specifically include: determining the physical memory address corresponding to the virtual address stored in the calculation result.

Optionally, the forwarding the task processing request may specifically include: sending the task processing request to a submission queue, so that the server can obtain the task processing request from the submission queue, and the submission queue Contains multiple unprocessed task calculation requests submitted by the client.

When the processing status information of the task processing request sent by the server is obtained, a synchronization request is sent, which specifically includes: querying the completion queue, when the processing status information of the task processing request sent by the server is queried , Send a synchronization request, and the completion queue contains a plurality of processing status information submitted by the server that has not been read by the client.

After obtaining the successful notification of the synchronization request sent by the server, obtaining the first virtual address of the task processing request specifically includes: querying the completion queue, and when querying the completion queue sent by the server After the successful notification of the synchronization request, the first virtual address of the task processing request is acquired.

Example five

In another task processing method provided by the embodiment of this specification, the execution subject is a machine equipped with a client and a server. The method may include the following steps: the client obtains the task calculation request sent by the application; the client has a virtual memory sharing function; the client sends the task calculation request to the submission queue; the server receives the task calculation request from the submission queue Obtain the task calculation request; the server has a virtual memory sharing function; the server obtains the calculation data storage GPU address and the calculation result storage GPU address in the task calculation request; the server obtains the calculation data storage The first physical memory address corresponding to the GPU address; the server determines the second virtual address corresponding to the first physical memory address; the server obtains the GPU address allocated for the task calculation request; the server generates A data copy instruction from the second virtual address to the GPU address, so as to call an interface to execute the data copy from the physical memory address to the GPU address; the server sends the task calculation request to the GPU; After the GPU completes the calculation task corresponding to the task calculation request, the server generates processing status information corresponding to the task calculation request, and sends the processing status information to the completion queue; The processing status information corresponding to the task calculation request is queried in the completion queue, and when queried, a synchronization request is sent to the submission queue; when the synchronization request in the submission queue is obtained, the server obtains the The calculation result stores the second physical memory address corresponding to the virtual address; the server determines the third virtual address corresponding to the second physical memory address; the server generates and copies the calculation result from the GPU address to the The instruction of the third virtual address, so as to call the interface to execute the data copy from the GPU address to the second physical memory address; when the data copy is completed, the server sends a synchronization completion notification to the completion queue ; When the notification of completion of the synchronization is queried in the completion queue, the client obtains the calculation result from the second physical memory address and sends it to the application.

The program adopted by the method provided in the embodiment of this specification is in the user mode and can be applied in the user space. There are multiple implementation methods for different scenarios and flexible deployment. Summarized as follows:

1. Bare metal environment (virtualization technology is not used): Both the server and the client run on the host OS (such as linux), and the server takes over all GPU access through the GPU driver, including exclusive use of a certain GPU0 or sharing based on configuration Use GPU1. If the client and the server are on the same machine, the communication can be IPC (such as UNIX socket, Pipe or shmem); if not on the same machine, socket/RDMA communication is used.

2. Containerized environment: In a containerized environment, the server can run in a containerized manner, take over the physical GPU, and export virtual GPU resources. The client (such as K8S pod) runs on the same physical machine and is connected to the server. The communication between the client and the server can be IPC or network.

3. Virtual machine environment: In a typical virtual machine environment, the GPU passes through to a specific physical machine, and then in the VM guest OS, the server or client is started, and then it is equivalent to a bare metal environment.

The technical effects that the above scheme can achieve are as follows:

1. High performance: A transparent memory sharing mechanism is used to avoid additional memory copying. Multi-queue request processing based on polling (polling) can efficiently respond to high-frequency request calls for typical deep learning tasks. Compared with known methods, the performance is significantly improved. The software virtualization using this method can achieve no performance loss, and the virtualization efficiency is significantly better than the known industrial/academic hardware and software virtualization solutions.

2. Low overhead: Due to the transparent shared memory mechanism, there is no need to allocate temporary memory, which greatly reduces memory overhead; and efficient lock-free polling can also reduce CPU overhead (overhead constant).

3. Scalability: Due to the above efficiency and low overhead, the solution can cope with the concurrent access of a single machine with multiple cards.

4. Transparent and non-intrusive: No need to modify or recompile existing applications, maintain API level compatibility, and the core framework can be easily extended to support other heterogeneous acceleration devices, such as NPU.

5. Based on the above transparent memory sharing, multiple request queues are provided for each device, including submission and completion queues, to improve scalability and cope with concurrent access by multiple cards.

6. Low overhead: It greatly reduces the allocation of additional memory during runtime, and only one CPU core is required to support multi-card concurrency.

7. Universality, flexibility and scalability: The method supports a variety of deployment environments, can interface with all known AI frameworks and models, and is transparent and non-intrusive; the core method can be independent of the GPU device, and can also support other acceleration devices such as Ali AI chips and so on.

Based on the same idea, the embodiment of this specification also provides a device corresponding to the above method. FIG. 8 is a schematic structural diagram of a data transmission device corresponding to FIG. 3 provided by an embodiment of the specification. As shown in FIG. 8, the device may include: a data transmission request obtaining module 801, configured to obtain a data transmission request sent by a client; and a first virtual address obtaining module 802, configured to obtain a first virtual address in the data transmission request. Address; physical memory address obtaining module 803, configured to obtain the physical memory address corresponding to the first virtual address; second virtual address determining module 804, configured to determine the physical memory based on the mapping relationship between the physical memory address and the virtual address The second virtual address corresponding to the address; the GPU address acquisition module 805, which is used to acquire the GPU address allocated for the data transmission request; the data copy instruction generation module 806, which is used to generate from the second virtual address to the GPU address The data copy instruction; the interface calling module 807 is used to call the GPU-driven interface to execute the data copy instruction.

Optionally, the device may further include: a judging module, configured to judge whether the physical memory address is stored in a mapping table.

The second virtual address generating module is configured to, if not, generate a second virtual address corresponding to the physical memory address, and store the mapping relationship between the physical memory address and the second virtual address in the mapping table.

The second virtual address determining module 804 may be specifically configured to: if yes, obtain the second virtual address corresponding to the physical memory address.

The embodiment of this specification also provides a task processing device corresponding to FIG. 4, the device includes: a task calculation request obtaining module, configured to obtain a task calculation request sent by a client; and a first virtual address obtaining module, configured to obtain The first virtual address in the task calculation request; a physical memory address obtaining module for obtaining the physical memory address corresponding to the first virtual address; a second virtual address determining module for calculating the physical memory address and the virtual address The mapping relationship is used to determine the second virtual address corresponding to the physical memory address; the first GPU address obtaining module is used to obtain the GPU address allocated for the task calculation request; the data copy instruction generation module is used to generate the slave 2. The data copy instruction from the virtual address to the GPU address; the first GPU driver interface calling module is used to call the GPU driver interface to execute the data copy instruction; the task calculation request sending module is used to send the task calculation request To the GPU; a processing state information generation module, used to generate processing state information corresponding to the task calculation request after the GPU completes the calculation task corresponding to the task calculation request; processing state information storage module, used to store the Processing status information.

Optionally, the first virtual address includes a calculation data acquisition virtual address and a calculation result storage virtual address, and the first virtual address acquisition module may be specifically used to: acquire the calculation data acquisition virtual address in the task calculation request The physical memory address obtaining module may be specifically used to: obtain the first physical memory address corresponding to the virtual address obtained by the calculation data; the second virtual address determining module may be specifically used to: determine the first physical The second virtual address corresponding to the memory address.

Optionally, the GPU address obtaining module may be specifically used to: obtain the calculation data storage GPU address and the calculation result storage GPU address allocated for the task calculation request; the generation from the second virtual address to the The data copy instruction of the GPU address specifically includes: generating a data copy instruction from the second virtual address to the computing data storage GPU address.

Optionally, the device may further include: a second physical memory address obtaining module, configured to obtain the second physical memory corresponding to the virtual address of the calculation result when the calculation result synchronization request sent by the client is obtained Address; the third virtual address acquisition module, used to determine the third virtual address corresponding to the second physical memory address based on the mapping relationship between the physical memory address and the virtual address; the second data copy instruction generation module, used to generate the calculation The result is copied from the calculation result storage GPU address to the data copy instruction of the third virtual address; the second GPU driver interface calling module is used to call the GPU driver interface to execute the data copy instruction.

Optionally, the task calculation request obtaining module may be specifically used to: obtain a task calculation request sent by the client from a submission queue; the submission queue contains a plurality of unprocessed submissions submitted by the client Task calculation request; after generating the processing status information corresponding to the task calculation request, the device may further include: a processing status information sending module, configured to send the processing status information corresponding to the task calculation request to the completion queue, so The completion queue contains a plurality of processing status information submitted by the server but not read by the client.

The embodiment of this specification also provides a data transmission device corresponding to FIG. 6, including: a data transmission request obtaining module, configured to obtain a data transmission request sent by an application; and a first virtual address obtaining module, configured to obtain the data transmission The first virtual address in the request; a physical memory address determining module, configured to determine the physical memory address corresponding to the first virtual address based on the mapping relationship between the physical memory address and the virtual address; the data transmission request and the physical memory address sending Module for sending the data transmission request and the physical memory address to the server, so that the server performs data transmission according to the data transmission request and the physical memory address, wherein the server is based on the physical memory The mapping relationship between the address and the virtual address is determined, the second virtual address corresponding to the physical memory address is determined; the GPU address allocated for the data transmission request is obtained; the data copy instruction from the second virtual address to the GPU address is generated ; Call the GPU driver interface to execute the data copy instruction.

Optionally, before the acquiring the data transmission request sent by the application, the device may further include: a memory allocation request acquiring module, configured to acquire the memory allocation request sent by the application; and a data acquiring module, configured to acquire the memory allocation The data in the request; a data storage module for storing the data in a first physical memory address; a first virtual address generating module for mapping the first physical memory address to the process space of the application to generate A first virtual address corresponding to the first physical memory address; a storage module, configured to send the first virtual address to the application, and store the mapping relationship between the physical memory address and the first virtual address.

Optionally, the physical memory address determining module may be specifically configured to determine the physical memory address corresponding to the first virtual address according to the mapping relationship.

The embodiment of the present specification also provides a task processing device corresponding to FIG. 7, including: a task processing request acquisition module for acquiring a task processing request sent by an application; a task processing request forwarding module for forwarding the task processing request , So that the server can obtain; the synchronization request sending module is used to send a synchronization request when the processing status information of the task processing request sent by the server is obtained; the first virtual address obtaining module is used to obtain the After the successful notification of the synchronization request sent by the server, the first virtual address of the task processing request is acquired; the physical memory address determination module is configured to determine the first virtual address based on the mapping relationship between the physical memory address and the virtual address A physical memory address corresponding to a virtual address; a calculation result reading module for reading the calculation result of the task processing request from the physical memory address corresponding to the first virtual address; a calculation result sending module for sending all The calculation result of the task processing request is sent to the application.

Optionally, the first virtual address includes a calculation data acquisition virtual address and a calculation result storage virtual address; the first virtual address acquisition module may be specifically used to: obtain a calculation result storage virtual address of the task processing request; The physical memory address determining module may be specifically used to determine the physical memory address corresponding to the virtual address stored in the calculation result.

Optionally, the task processing request forwarding module may be specifically configured to: send the task processing request to a submission queue, so that the server can obtain the task processing request from the submission queue, and the submission queue Contains multiple unprocessed task calculation requests submitted by the client; the synchronization request sending module can be specifically used to query the completion queue, when the processing of the task processing request sent by the server is queried When the status information, a synchronization request is sent, and the completion queue contains a plurality of processing status information submitted by the server and not read by the client; the first virtual address acquisition module can be specifically used for: query The completion queue acquires the first virtual address of the task processing request after querying the success notification of the synchronization request sent by the server.

Based on the same idea, the embodiment of this specification also provides a device corresponding to the above method.

FIG. 9 is a schematic structural diagram of a data transmission device corresponding to FIG. 3 provided by an embodiment of this specification. As shown in Figure 9, the device 900 may include: at least one processor 910; The instruction 920 of the instruction is executed by the at least one processor 910, so that the at least one processor 910 can: obtain the data transmission request sent by the client; obtain the first virtual address in the data transmission request; Obtain the physical memory address corresponding to the first virtual address; determine the second virtual address corresponding to the physical memory address based on the mapping relationship between the physical memory address and the virtual address; obtain the GPU address allocated for the data transmission request; generate A data copy instruction from the second virtual address to the GPU address; calling an interface driven by the GPU to execute the data copy instruction.

The embodiment of this specification also provides a task processing device corresponding to FIG. 4. The device may include: at least one processor; and a memory communicatively connected with the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are Executed by at least one processor, so that the at least one processor can: obtain a task calculation request sent by a client; obtain a first virtual address in the task calculation request; obtain a physical memory address corresponding to the first virtual address Based on the mapping relationship between the physical memory address and the virtual address, determine the second virtual address corresponding to the physical memory address; obtain the GPU address allocated for the task calculation request; generate from the second virtual address to the GPU address Call the GPU-driven interface to execute the data copy instruction; send the task calculation request to the GPU; when the GPU completes the calculation task corresponding to the task calculation request, generate the task calculation request corresponding The processing status information; storing the processing status information.

The embodiment of this specification also provides a data transmission device corresponding to FIG. 6. The device may include: at least one processor; and a memory communicatively connected with the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are Executed by at least one processor, so that the at least one processor can: obtain a data transmission request sent by an application; obtain a first virtual address in the data transmission request; The physical memory address corresponding to the first virtual address; sending the data transmission request and the physical memory address to the server, so that the server performs data transmission according to the data transmission request and the physical memory address, wherein The server determines the second virtual address corresponding to the physical memory address based on the mapping relationship between the physical memory address and the virtual address; obtains the GPU address allocated for the data transmission request; The data copy instruction of the GPU address; call the GPU driver interface to execute the data copy instruction.

The embodiment of this specification also provides a task processing device corresponding to FIG. 7. The device may include: at least one processor; and a memory communicatively connected with the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are Executed by at least one processor, so that the at least one processor can: obtain the task processing request sent by the application; forward the task processing request so that the server can obtain it; when the task processing sent by the server is obtained When the requested processing status information, a synchronization request is issued; when the successful notification of the synchronization request sent by the server is obtained, the first virtual address of the task processing request is obtained; based on the mapping of physical memory addresses and virtual addresses Relationship, determine the physical memory address corresponding to the first virtual address; read the calculation result of the task processing request from the physical memory address corresponding to the first virtual address; send the calculation result of the task processing request to The application.

An embodiment of the present specification provides a computer-readable medium having computer-readable instructions stored thereon, and the computer-readable instructions can be executed by a processor to implement any of the methods described above.

In the 1990s, the improvement of a technology can be clearly distinguished between hardware improvements (for example, improvements in circuit structures such as diodes, transistors, switches, etc.) or software improvements (improvements in method flow). However, with the development of technology, the improvement of many methods and processes of today can be regarded as a direct improvement of the hardware circuit structure. Designers almost always get the corresponding hardware circuit structure by programming the improved method flow into the hardware circuit. Therefore, it cannot be said that the improvement of a method flow cannot be realized by the hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (for example, a Field Programmable Gate Array (Field Programmable Gate Array, FPGA)) is such an integrated circuit whose logic function is determined by the user's programming of the device. It is programmed by the designer to "integrate" a digital system on a piece of PLD, without requiring chip manufacturers to design and manufacture dedicated integrated circuit chips. Moreover, nowadays, instead of manually making integrated circuit chips, this kind of programming is mostly realized with "logic compiler" software, which is similar to the software compiler used in program development and writing, but before compilation The original code must also be written in a specific programming language, which is called Hardware Description Language (HDL), and there is not only one type of HDL, but many types, such as ABEL (Advanced Boolean Expression Language) , AHDL (Altera Hardware Description Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), Lava, Lola, MyHDL, PALASM, RHDL (Ruby Hardware Description), etc., currently most commonly used It is VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog. It should also be clear to those skilled in the art that just a little bit of logic programming of the method flow in the above-mentioned hardware description languages and programming into an integrated circuit can easily obtain the hardware circuit that implements the logic method flow.

The controller can be implemented in any suitable manner. For example, the controller can take the form of, for example, a microprocessor or a processor and a computer-readable medium storing computer-readable program codes (such as software or firmware) executable by the (micro)processor. , Logic gates, switches, application specific integrated circuits (ASICs), programmable logic controllers and embedded microcontrollers. Examples of controllers include but are not limited to the following microcontrollers: ARC625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicon Labs C8051F320, the memory controller can also be implemented as part of the memory control logic. Those skilled in the art also know that, in addition to implementing the controller in a purely computer-readable program code manner, it is entirely possible to program the method steps to make the controller use logic gates, switches, application specific integrated circuits, programmable logic controllers, and embedded logic. The same function can be realized in the form of a microcontroller or the like. Therefore, such a controller can be regarded as a hardware component, and the devices included in it for realizing various functions can also be regarded as a structure within the hardware component. Or even, the device for realizing various functions can be regarded as both a software module for realizing the method and a structure within a hardware component.

The systems, devices, modules, or units illustrated in the above embodiments may be specifically implemented by computer chips or entities, or implemented by products with certain functions. A typical implementation device is a computer. Specifically, the computer may be, for example, a personal computer, a laptop computer, a cell phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or Any combination of these devices.

For the convenience of description, when describing the above device, the functions are divided into various units and described separately. Of course, when implementing this application, the functions of each unit can be implemented in the same or multiple software and/or hardware.

Those skilled in the art should understand that the embodiments of the present invention can be provided as a method, a system, or a computer program product. Therefore, the present invention may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present invention may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

The present invention is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of the present invention. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are used to generate It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

In a typical configuration, the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include non-permanent memory in a computer readable medium, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer readable media.

Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cartridges, magnetic tape storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

It should also be noted that the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or equipment including a series of elements not only includes those elements, but also includes Other elements that are not explicitly listed, or also include elements inherent to such processes, methods, commodities, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, commodity, or equipment that includes the element.

This application may be described in the general context of computer-executable instructions executed by a computer, such as a program module. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. This application can also be practiced in distributed computing environments. In these distributed computing environments, tasks are performed by remote processing devices connected through a communication network. In a distributed computing environment, program modules can be located in local and remote computer storage media including storage devices.

The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, as for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.

The above descriptions are only examples of the present application, and are not used to limit the present application. For those skilled in the art, this application can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the scope of the claims of this application.

Claims

A data transmission method, which is applied to a server in a GPU virtualization system, includes:

Obtain the data transmission request sent by the client;

Acquiring the first virtual address in the data transmission request;

Acquiring a physical memory address corresponding to the first virtual address;

Determining the second virtual address corresponding to the physical memory address based on the mapping relationship between the physical memory address and the virtual address;

Acquiring the GPU address allocated for the data transmission request;

Generating a data copy instruction from the second virtual address to the GPU address;

Call the GPU driver interface to execute the data copy instruction.
The method according to claim 1, before said determining the second virtual address corresponding to the physical memory address, further comprising:

Judging whether the physical memory address is stored in a mapping table;

If not, generate a second virtual address corresponding to the physical memory address, and store the mapping relationship between the physical memory address and the second virtual address in the mapping table;

The determining the second virtual address corresponding to the physical memory address specifically includes:

If yes, obtain the second virtual address corresponding to the physical memory address.
A task processing method, which is applied to a server in a GPU virtualization system, includes:

Obtain the task calculation request sent by the client;

Acquiring the first virtual address in the task calculation request;

Acquiring a physical memory address corresponding to the first virtual address;

Determining the second virtual address corresponding to the physical memory address based on the mapping relationship between the physical memory address and the virtual address;

Acquiring the GPU address allocated for the task calculation request;

Generating a data copy instruction from the second virtual address to the GPU address;

Calling the GPU-driven interface to execute the data copy instruction;

Sending the task calculation request to the GPU;

After the GPU completes the calculation task corresponding to the task calculation request, generate processing state information corresponding to the task calculation request;

Store the processing status information.
The method according to claim 3, wherein the first virtual address includes a virtual address for obtaining a calculation data and a virtual address for storing a calculation result, and the obtaining of the first virtual address in the task calculation request specifically includes:

Acquiring the computing data in the task computing request to acquire a virtual address;

The obtaining the physical memory address corresponding to the first virtual address specifically includes:

Obtaining the first physical memory address corresponding to the virtual address by obtaining the calculation data;

The determining the second virtual address corresponding to the physical memory address specifically includes:

Determine the second virtual address corresponding to the first physical memory address.
The method according to claim 4, wherein said obtaining the GPU address allocated for the task calculation request specifically comprises:

Acquiring the calculation data storage GPU address and the calculation result storage GPU address allocated for the task calculation request;

The generating a data copy instruction from the second virtual address to the GPU address specifically includes:

Generate a data copy instruction from the second virtual address to the computing data storage GPU address.
5. The method according to claim 5, after said generating the processing state information corresponding to the task calculation request, further comprising:

When obtaining the calculation result synchronization request sent by the client, obtaining the second physical memory address corresponding to the virtual address for storing the calculation result;

Determining the third virtual address corresponding to the second physical memory address based on the mapping relationship between the physical memory address and the virtual address;

Generating a data copy instruction for copying a calculation result from the calculation result storage GPU address to the third virtual address;

Call the GPU driver interface to execute the data copy instruction.
The method according to claim 3, wherein said obtaining the task calculation request sent by the client specifically includes:

Obtain the task calculation request sent by the client from a submission queue; the submission queue contains a plurality of unprocessed task calculation requests submitted by the client;

After generating the processing state information corresponding to the task calculation request, the method further includes:

The processing status information corresponding to the task calculation request is sent to a completion queue, and the completion queue contains a plurality of processing status information submitted by the server and not read by the client.
A data transmission method, which is applied to a client in a GPU virtualization system, includes:

Obtain the data transmission request sent by the application;

Acquiring the first virtual address in the data transmission request;

Determining the physical memory address corresponding to the first virtual address based on the mapping relationship between the physical memory address and the virtual address;

The data transmission request and the physical memory address are sent to the server, so that the server performs data transmission according to the data transmission request and the physical memory address, wherein the server is based on the physical memory address and the virtual address To determine the second virtual address corresponding to the physical memory address; obtain the GPU address allocated for the data transmission request; generate a data copy instruction from the second virtual address to the GPU address; call the GPU driver The interface executes the data copy instruction.
8. The method according to claim 8, before said obtaining the data transmission request sent by the application, further comprising:

Get the memory allocation request sent by the application;

Acquiring data in the memory allocation request;

Storing the data in the first physical memory address;

Mapping the first physical memory address to the process space of the application to generate a first virtual address corresponding to the first physical memory address;

The first virtual address is sent to the application, and the mapping relationship between the physical memory address and the first virtual address is stored.
The method according to claim 9, wherein the determining the physical memory address corresponding to the first virtual address specifically includes:

The physical memory address corresponding to the first virtual address is determined according to the mapping relationship.
A task processing method, which is applied to a client in a GPU virtualization system, includes:

Obtain the task processing request sent by the application;

Forward the task processing request so that the server can obtain it;

When the processing status information of the task processing request sent by the server is obtained, a synchronization request is issued;

After obtaining the successful notification of the synchronization request sent by the server, obtaining the first virtual address of the task processing request;

Determining the physical memory address corresponding to the first virtual address based on the mapping relationship between the physical memory address and the virtual address;

Reading the calculation result of the task processing request from the physical memory address corresponding to the first virtual address;

The calculation result of the task processing request is sent to the application.
11. The method according to claim 11, wherein the first virtual address comprises a virtual address for obtaining a calculation data and a virtual address for storing a calculation result;

The acquiring the first virtual address of the task processing request specifically includes:

Obtaining a virtual address for storing a calculation result of the task processing request;

The determining the physical memory address corresponding to the first virtual address specifically includes:

It is determined that the physical memory address corresponding to the virtual address is stored in the calculation result.
The method according to claim 11, forwarding the task processing request specifically includes:

Send the task processing request to a submission queue, so that the server obtains the task processing request from the submission queue, and the submission queue contains a plurality of unprocessed task calculation requests submitted by the client ；

When the processing status information of the task processing request sent by the server is obtained, the synchronization request is sent, which specifically includes:

Query the completion queue, when the processing status information of the task processing request sent by the server is queried, a synchronization request is issued, and the completion queue contains multiple submitted by the server that have not been read by the client Processing status information;

After obtaining the successful notification of the synchronization request sent by the server, obtaining the first virtual address of the task processing request specifically includes:

The completion queue is inquired, and after the successful notification of the synchronization request sent by the server is inquired, the first virtual address of the task processing request is acquired.
A data transmission device includes:

The data transmission request acquisition module is used to acquire the data transmission request sent by the client;

A first virtual address obtaining module, configured to obtain the first virtual address in the data transmission request;

A physical memory address obtaining module, configured to obtain a physical memory address corresponding to the first virtual address;

The second virtual address determining module is configured to determine the second virtual address corresponding to the physical memory address based on the mapping relationship between the physical memory address and the virtual address;

GPU address obtaining module, configured to obtain the GPU address allocated for the data transmission request;

A data copy instruction generating module, configured to generate a data copy instruction from the second virtual address to the GPU address;

The interface calling module is used to call the GPU-driven interface to execute the data copy instruction.
A task processing device includes:

The task calculation request obtaining module is used to obtain the task calculation request sent by the client;

The first virtual address obtaining module is configured to obtain the first virtual address in the task calculation request;

A physical memory address obtaining module, configured to obtain a physical memory address corresponding to the first virtual address;

The second virtual address determining module is configured to determine the second virtual address corresponding to the physical memory address based on the mapping relationship between the physical memory address and the virtual address;

GPU address obtaining module, configured to obtain the GPU address allocated for the task calculation request;

A data copy instruction generating module, configured to generate a data copy instruction from the second virtual address to the GPU address;

The GPU driver interface calling module is used to call the GPU driver interface to execute the data copy instruction;

A task calculation request sending module, configured to send the task calculation request to the GPU;

A processing state information generating module, configured to generate processing state information corresponding to the task calculation request after the GPU completes the calculation task corresponding to the task calculation request;

The processing state information storage module is used to store the processing state information.
A data transmission device includes:

The data transmission request acquisition module is used to acquire the data transmission request sent by the application;

A first virtual address obtaining module, configured to obtain the first virtual address in the data transmission request;

A physical memory address determining module, configured to determine the physical memory address corresponding to the first virtual address based on the mapping relationship between the physical memory address and the virtual address;

The data transmission request and the physical memory address sending module is configured to send the data transmission request and the physical memory address to the server, so that the server executes the data according to the data transmission request and the physical memory address Transmission, wherein the server determines the second virtual address corresponding to the physical memory address based on the mapping relationship between the physical memory address and the virtual address; obtains the GPU address allocated for the data transmission request; A data copy instruction from an address to the GPU address; calling an interface driven by the GPU to execute the data copy instruction.
A task processing device includes:

The task processing request obtaining module is used to obtain the task processing request sent by the application;

The task processing request forwarding module is used to forward the task processing request so that the server can obtain it;

The synchronization request sending module is configured to send a synchronization request when the processing status information of the task processing request sent by the server is obtained;

The first virtual address obtaining module is configured to obtain the first virtual address of the task processing request after obtaining the successful notification of the synchronization request sent by the server;

A physical memory address determining module, configured to determine the physical memory address corresponding to the first virtual address based on the mapping relationship between the physical memory address and the virtual address;

A calculation result reading module, configured to read the calculation result of the task processing request from the physical memory address corresponding to the first virtual address;

The calculation result sending module is configured to send the calculation result of the task processing request to the application.
A data transmission device, including:

At least one processor; and,

A memory communicatively connected with the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can:

Obtain the data transmission request sent by the client;

Acquiring the first virtual address in the data transmission request;

Acquiring a physical memory address corresponding to the first virtual address;

Determining the second virtual address corresponding to the physical memory address based on the mapping relationship between the physical memory address and the virtual address;

Acquiring the GPU address allocated for the data transmission request;

Generating a data copy instruction from the second virtual address to the GPU address;

Call the GPU driver interface to execute the data copy instruction.
A task processing device, including:

At least one processor; and,

A memory communicatively connected with the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can:

Obtain the task calculation request sent by the client;

Acquiring the first virtual address in the task calculation request;

Acquiring a physical memory address corresponding to the first virtual address;

Determining the second virtual address corresponding to the physical memory address based on the mapping relationship between the physical memory address and the virtual address;

Acquiring the GPU address allocated for the task calculation request;

Generating a data copy instruction from the second virtual address to the GPU address;

Calling the GPU-driven interface to execute the data copy instruction;

Sending the task calculation request to the GPU;

After the GPU completes the calculation task corresponding to the task calculation request, generate processing state information corresponding to the task calculation request;

Store the processing status information.
A data transmission device, including:

At least one processor; and,

A memory communicatively connected with the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can:

Obtain the data transmission request sent by the application;

Acquiring the first virtual address in the data transmission request;

Determining the physical memory address corresponding to the first virtual address based on the mapping relationship between the physical memory address and the virtual address;

The data transmission request and the physical memory address are sent to the server, so that the server performs data transmission according to the data transmission request and the physical memory address, wherein the server is based on the physical memory address and the virtual address To determine the second virtual address corresponding to the physical memory address; obtain the GPU address allocated for the data transmission request; generate a data copy instruction from the second virtual address to the GPU address; call the GPU driver The interface executes the data copy instruction.
A task processing device, including:

At least one processor; and,

A memory communicatively connected with the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can:

Obtain the task processing request sent by the application;

Forward the task processing request so that the server can obtain it;

When the processing status information of the task processing request sent by the server is obtained, a synchronization request is issued;

After obtaining the successful notification of the synchronization request sent by the server, obtaining the first virtual address of the task processing request;

Determining the physical memory address corresponding to the first virtual address based on the mapping relationship between the physical memory address and the virtual address;

Reading the calculation result of the task processing request from the physical memory address corresponding to the first virtual address;

The calculation result of the task processing request is sent to the application.
A computer-readable medium having computer-readable instructions stored thereon, and the computer-readable instructions can be executed by a processor to implement the method according to any one of claims 1 to 13.