CN112600882B

CN112600882B - Hardware acceleration method based on shared memory communication mode

Info

Publication number: CN112600882B
Application number: CN202011389606.0A
Authority: CN
Inventors: 李健; 庄树隽; 管海兵
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2022-03-08
Anticipated expiration: 2040-12-01
Also published as: CN112600882A

Abstract

The invention discloses a hardware acceleration method based on a shared memory communication mode, which relates to the field of network protocols. The invention can reduce the influence of the copy operation on the performance in communication under the condition of ensuring the low coupling of the user mode protocol stack and the application, and can still achieve good performance under the scene of more large packet requests.

Description

Hardware acceleration method based on shared memory communication mode

Technical Field

The invention relates to the field of network protocols, in particular to a hardware acceleration method based on a shared memory communication mode.

Background

A network protocol stack is a specific software implementation of a suite of computer network protocols. The network card is responsible for packaging data sent by upper network application into a network packet and sending the network packet out of the network card, and ensuring the stability and the correctness of the transmission of the network packet in the whole network link. At present, network protocol stacks on terminal machines are mostly realized in a kernel mode according to an OSI seven-layer model. However, the conventional kernel-mode network protocol stack has some inefficiency problems, such as frequent context switching and global lock contention. With the rapid growth of network traffic in recent years, these inefficiencies have led to the network protocol stack becoming the major performance bottleneck in transmission.

To address these inefficiencies, researchers have begun to search for alternative approaches. RDMA (remote Direct Memory Access) remote Direct data access technology is an alternative, but the limitation is that a network card is required to support RDMA. In addition, the network protocol, namely the user mode network protocol stack, is directly realized in the user mode, the scheme has the advantages that the frequent switching overhead of the kernel mode and the user mode is avoided, and the time period for developing and deploying new characteristics of the network is greatly shortened due to the convenience of user mode development.

After the user mode network protocol stack is implemented, it needs to communicate with the network application of the upper layer. At present, there are two communication modes, one is a LibOS mode, a protocol stack is embedded in an application process in a library form, and then the protocol stack communicates with an application in a function call form, and the other is that a network protocol stack is started as a separate process, and then asynchronous communication is performed with the application by using a shared memory.

The LibOS mode has the advantages that the function call communication mode with low overhead and the RTC (Run-To-Completion) thread model are adopted, so that the communication cost is reduced, and better performance is obtained. But the disadvantages are that firstly, the function interface is tightly coupled with the application, and the development and deployment need to be synchronous with the application, thereby reducing the development and deployment speed of new network characteristics. Secondly, it may pose some security risks, such as some malicious applications being able to attack the protocol stack. Finally, this approach may share core computing resources with the application process and may not be flexibly distributed.

The shared memory mode has the advantages of low coupling, short period for developing and deploying new network characteristics, flexible scheduling of computing resources and support of advanced functions such as transparent upgrading of a protocol stack. In addition, the shared memory serves as a middle layer to isolate malicious attacks of the application. However, the communication between the application and the protocol stack needs to pass through two copy operations, and when there are many large packet requests, the two copy operations occupy excessive CPU resources, and seriously affect the performance of the protocol stack in terms of throughput and delay.

To sum up, the two communication modes of the user mode protocol stack and the application have respective disadvantages, the LibOS mode cannot meet the requirements of developers for rapid development and supporting advanced functions, and the shared memory mode cannot meet the requirements of the protocol stack for high performance.

Accordingly, those skilled in the art are devoted to developing a hardware acceleration method based on a shared memory communication scheme.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, the technical problem to be solved by the present invention is how to enable a communication mode to achieve both a series of advantages associated with low coupling and high performance of communication.

In order to achieve the above object, the present invention provides a hardware acceleration method based on a shared memory communication mode, which uses a VPP as a user mode network protocol stack, uses an NGINX application as a web server, and communicates through the shared memory mode, where the protocol stack includes an asynchronous copy module, a decision maker, and a virtual memory copy acceleration layer.

Further, the IOAT dedicated hardware is copied by Intel's memory, and the copy between the shared memory and the protocol stack is offloaded by the CPU to the IOAT dedicated hardware.

Further, the function of the asynchronous copy module is realized in three steps:

step 1, allocating a sufficient number of sending buffer areas to store a packet to be copied from a shared memory, then translating a virtual address and a physical address, packaging copied parameters into a copy request, transmitting the copy request to a decision maker, deciding whether the copy request is handed to a CPU or IOAT acceleration hardware by the decision maker, and finally putting the copy request unloaded to the IOAT into a waiting buffer area for temporary storage;

step 2, periodically acquiring a completed copy request from the IOAT hardware description queue;

and 3, after the copying process is finished, the network transport layer protocol needs to carry out some protocol-related processing, and finally the copied packet is sent to the next VPP protocol processing node.

Further, the decision maker decides whether to hand the request to the CPU or the IOAT hardware based on the data size of the copy request.

Further, the virtual memory copy acceleration layer isolates hardware offload logic from logic of the protocol stack.

Further, the protocol stack implements the copy request offload onto the hardware accelerator by invoking the virtual offload interface.

Further, the virtual copy accelerator has a fault tolerance mechanism for multiple error conditions.

Further, when the copy request exceeds the length of the hardware accelerator queue too much, the virtual copy accelerator temporarily hands the redundant copy request to the CPU for processing.

Further, for requests that are not available permanently, the virtual copy accelerator preferentially finds another hardware accelerator and sends outstanding requests to the new hardware accelerator at once, replacing the wrong hardware accelerator.

Further, for requests that are not permanently available, the virtual copy accelerator hands the copy request to the CPU for processing when no other hardware accelerator is available.

The technical effects are as follows:

the method can reduce the influence of copy operation on performance in communication under the condition of ensuring low coupling of a user mode protocol stack and application, and can still achieve good performance under the scene of more large-packet requests.

The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.

Drawings

FIG. 1 is a general flow diagram of a preferred embodiment of the present invention;

FIG. 2 is a diagram of a comparison of CPU occupied by various operations in the protocol stack in accordance with a preferred embodiment of the present invention;

FIG. 3 is a block diagram of an asynchronous copy module framework in accordance with a preferred embodiment of the present invention;

FIG. 4 is a graph comparing the copy speed of the IOAT hardware and CPU of a preferred embodiment of the present invention;

FIG. 5 is an architecture diagram of the virtual memory copy accelerator layer in accordance with a preferred embodiment of the present invention.

Detailed Description

The technical contents of the preferred embodiments of the present invention will be more clearly and easily understood by referring to the drawings attached to the specification. The present invention may be embodied in many different forms of embodiments and the scope of the invention is not limited to the embodiments set forth herein.

Firstly, VPP is used as a user mode network protocol stack, and NGINX application is used as a web server, so that the VPP and the NGINX application communicate through a mode of sharing a memory. FIG. 1 is a general architecture diagram of the present invention, where the asynchronous copy module is an asynchronous copy module embedded in a VPP protocol stack, and the decision maker decides whether to hand a request to CPU or IOAT hardware according to the data size of the copy request.

As shown in fig. 2, in this scenario, we quantitatively analyze the CPU proportion occupied by the copy operation, and experiments show that:

(1) when the file size requested by the client is 4KB/8KB, the memory copy operation occupies about 20% -30% of CPU resources, but when the file size exceeds 32KB, the memory copy becomes a main performance bottleneck in a protocol stack, and occupies about 60% of CPU resources.

(2) In the process diagram of the gradual increase of the size of the requested file, the CPU resource occupied by the memory copy is increased linearly.

(3) As the request files gradually increase, the memory copy cost increases, and the throughput performance of the VPP user mode protocol stack is also reduced from 40% faster than the kernel mode network protocol stack to 40% slower than the kernel mode network protocol stack.

As shown in fig. 3, the commit phase: a large enough send buffer is allocated to store the packet to be copied from the shared memory, then the translation between the virtual address and the physical address is performed, and the copied parameters are packaged into a copy request. The copy request is placed on a decider, which decides whether to hand over to the CPU or the IOAT acceleration hardware. And finally, the copy request unloaded to the IOAT is put into a waiting buffer area for temporary storage. And (3) a polling stage: completed copy requests are periodically fetched from the IOAT hardware description queue. And a post-copying stage: after the copying process is completed, the network transport layer protocol needs to perform some protocol-related processing (such as setting a retransmission timer), and finally sends the copied packet to the next VPP protocol processing node.

As shown in FIG. 4, we found that IOAT copy speed at 1KB is close to CPU. When the requested file is less than 1KB, the CPU copy speed is faster, and when the file is greater than 1KB, the IOAT copy speed is faster. So when copying data greater than 1KB we offload the request to IOAT hardware for processing, whereas we hand the copy request to CPU processing.

As shown in fig. 5, the protocol stack implements the copy request offload by calling the virtual offload interface, and then the virtual accelerator offloads the request to the hardware accelerator, thereby implementing the separation of the protocol stack logic and the hardware driving logic. Developers can implement these interfaces and bind acceleration hardware on different machines or in the future to the VPP protocol stack without knowing the upper protocol stack logic, allowing the present invention to support more hardware accelerators.

The invention realizes a fault-tolerant mechanism in the module of the virtual copy accelerator, and is oriented to various error conditions: in the event of temporary unavailability, such as a copy request that is too large to exceed the length of the hardware accelerator queue, the fault tolerance mechanism temporarily passes the excess copy request to the CPU for processing. If a request that is not available permanently, such as an IOAT hardware error or a channel error caused by an illegal copy address, the fault tolerance mechanism will preferentially find another hardware accelerator and send the outstanding request to the new hardware accelerator at a time to replace the wrong hardware accelerator. If no other hardware accelerator is available, the fault tolerance mechanism will pass the copy request to the CPU for processing.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A hardware acceleration method based on a shared memory communication mode uses VPP as a user mode network protocol stack, NGINX is network application, and communication is carried out through the mode of the shared memory, and is characterized by comprising the following three stages:

a submission stage: allocating a sending buffer area to store a packet to be copied from a shared memory, then translating a virtual address and a physical address, and packaging copied parameters into a copy request, wherein the copy request is put on a decision maker, the decision maker decides to give the copy request to a CPU or IOAT acceleration hardware, and after the decision maker decides to give the copy request to the IOAT acceleration hardware, the decision maker puts the copy request unloaded to the IOAT acceleration hardware into a waiting buffer area for temporary storage;

and (3) a polling stage: periodically obtaining a completed copy request from the IOAT acceleration hardware description queue;

and a post-copying stage: after the copying process between the shared memory and the protocol stack is completed, the network transport layer protocol sets a retransmission timer, and then sends the copied packet to the next VPP protocol processing node.

2. The method of claim 1, wherein the decision-maker decides whether to hand a request to a CPU or the IOAT acceleration hardware according to a data size of the copy request.

3. The method as claimed in claim 2, wherein the protocol stack implements offloading of the copy request to the virtual accelerator by calling a virtual offload function interface, and then the virtual accelerator offloads the copy request to the hardware accelerator, thereby implementing separation of protocol stack logic and hardware driver logic.

4. The method as claimed in claim 3, wherein the virtual accelerator has a fault tolerance mechanism for multiple error conditions.

5. The method of claim 4, wherein the virtual accelerator temporarily forwards the redundant copy request to the CPU for processing when the copy request exceeds the length of the cache queue in the hardware accelerator.

6. The method of claim 4, wherein for a request that is permanently unavailable, the virtual accelerator preferentially finds another hardware accelerator and sends an outstanding copy request to the new hardware accelerator at a time, replacing the wrong hardware accelerator.

7. The method of claim 4, wherein for a request that is permanently unavailable, the virtual accelerator hands a copy request to the CPU for processing when no other hardware accelerator is available.