CN112600882B - Hardware acceleration method based on shared memory communication mode - Google Patents

Hardware acceleration method based on shared memory communication mode Download PDF

Info

Publication number
CN112600882B
CN112600882B CN202011389606.0A CN202011389606A CN112600882B CN 112600882 B CN112600882 B CN 112600882B CN 202011389606 A CN202011389606 A CN 202011389606A CN 112600882 B CN112600882 B CN 112600882B
Authority
CN
China
Prior art keywords
hardware
copy request
accelerator
request
copy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011389606.0A
Other languages
Chinese (zh)
Other versions
CN112600882A (en
Inventor
李健
庄树隽
管海兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202011389606.0A priority Critical patent/CN112600882B/en
Publication of CN112600882A publication Critical patent/CN112600882A/en
Application granted granted Critical
Publication of CN112600882B publication Critical patent/CN112600882B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/30Definitions, standards or architectural aspects of layered protocol stacks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/06Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/104Peer-to-peer [P2P] networks
    • H04L67/1074Peer-to-peer [P2P] networks for supporting data block transmission mechanisms
    • H04L67/1078Resource delivery mechanisms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources
    • H04L67/63Routing a service request depending on the request content or context

Abstract

The invention discloses a hardware acceleration method based on a shared memory communication mode, which relates to the field of network protocols. The invention can reduce the influence of the copy operation on the performance in communication under the condition of ensuring the low coupling of the user mode protocol stack and the application, and can still achieve good performance under the scene of more large packet requests.

Description

Hardware acceleration method based on shared memory communication mode
Technical Field
The invention relates to the field of network protocols, in particular to a hardware acceleration method based on a shared memory communication mode.
Background
A network protocol stack is a specific software implementation of a suite of computer network protocols. The network card is responsible for packaging data sent by upper network application into a network packet and sending the network packet out of the network card, and ensuring the stability and the correctness of the transmission of the network packet in the whole network link. At present, network protocol stacks on terminal machines are mostly realized in a kernel mode according to an OSI seven-layer model. However, the conventional kernel-mode network protocol stack has some inefficiency problems, such as frequent context switching and global lock contention. With the rapid growth of network traffic in recent years, these inefficiencies have led to the network protocol stack becoming the major performance bottleneck in transmission.
To address these inefficiencies, researchers have begun to search for alternative approaches. RDMA (remote Direct Memory Access) remote Direct data access technology is an alternative, but the limitation is that a network card is required to support RDMA. In addition, the network protocol, namely the user mode network protocol stack, is directly realized in the user mode, the scheme has the advantages that the frequent switching overhead of the kernel mode and the user mode is avoided, and the time period for developing and deploying new characteristics of the network is greatly shortened due to the convenience of user mode development.
After the user mode network protocol stack is implemented, it needs to communicate with the network application of the upper layer. At present, there are two communication modes, one is a LibOS mode, a protocol stack is embedded in an application process in a library form, and then the protocol stack communicates with an application in a function call form, and the other is that a network protocol stack is started as a separate process, and then asynchronous communication is performed with the application by using a shared memory.
The LibOS mode has the advantages that the function call communication mode with low overhead and the RTC (Run-To-Completion) thread model are adopted, so that the communication cost is reduced, and better performance is obtained. But the disadvantages are that firstly, the function interface is tightly coupled with the application, and the development and deployment need to be synchronous with the application, thereby reducing the development and deployment speed of new network characteristics. Secondly, it may pose some security risks, such as some malicious applications being able to attack the protocol stack. Finally, this approach may share core computing resources with the application process and may not be flexibly distributed.
The shared memory mode has the advantages of low coupling, short period for developing and deploying new network characteristics, flexible scheduling of computing resources and support of advanced functions such as transparent upgrading of a protocol stack. In addition, the shared memory serves as a middle layer to isolate malicious attacks of the application. However, the communication between the application and the protocol stack needs to pass through two copy operations, and when there are many large packet requests, the two copy operations occupy excessive CPU resources, and seriously affect the performance of the protocol stack in terms of throughput and delay.
To sum up, the two communication modes of the user mode protocol stack and the application have respective disadvantages, the LibOS mode cannot meet the requirements of developers for rapid development and supporting advanced functions, and the shared memory mode cannot meet the requirements of the protocol stack for high performance.
Accordingly, those skilled in the art are devoted to developing a hardware acceleration method based on a shared memory communication scheme.
Disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, the technical problem to be solved by the present invention is how to enable a communication mode to achieve both a series of advantages associated with low coupling and high performance of communication.
In order to achieve the above object, the present invention provides a hardware acceleration method based on a shared memory communication mode, which uses a VPP as a user mode network protocol stack, uses an NGINX application as a web server, and communicates through the shared memory mode, where the protocol stack includes an asynchronous copy module, a decision maker, and a virtual memory copy acceleration layer.
Further, the IOAT dedicated hardware is copied by Intel's memory, and the copy between the shared memory and the protocol stack is offloaded by the CPU to the IOAT dedicated hardware.
Further, the function of the asynchronous copy module is realized in three steps:
step 1, allocating a sufficient number of sending buffer areas to store a packet to be copied from a shared memory, then translating a virtual address and a physical address, packaging copied parameters into a copy request, transmitting the copy request to a decision maker, deciding whether the copy request is handed to a CPU or IOAT acceleration hardware by the decision maker, and finally putting the copy request unloaded to the IOAT into a waiting buffer area for temporary storage;
step 2, periodically acquiring a completed copy request from the IOAT hardware description queue;
and 3, after the copying process is finished, the network transport layer protocol needs to carry out some protocol-related processing, and finally the copied packet is sent to the next VPP protocol processing node.
Further, the decision maker decides whether to hand the request to the CPU or the IOAT hardware based on the data size of the copy request.
Further, the virtual memory copy acceleration layer isolates hardware offload logic from logic of the protocol stack.
Further, the protocol stack implements the copy request offload onto the hardware accelerator by invoking the virtual offload interface.
Further, the virtual copy accelerator has a fault tolerance mechanism for multiple error conditions.
Further, when the copy request exceeds the length of the hardware accelerator queue too much, the virtual copy accelerator temporarily hands the redundant copy request to the CPU for processing.
Further, for requests that are not available permanently, the virtual copy accelerator preferentially finds another hardware accelerator and sends outstanding requests to the new hardware accelerator at once, replacing the wrong hardware accelerator.
Further, for requests that are not permanently available, the virtual copy accelerator hands the copy request to the CPU for processing when no other hardware accelerator is available.
The technical effects are as follows:
the method can reduce the influence of copy operation on performance in communication under the condition of ensuring low coupling of a user mode protocol stack and application, and can still achieve good performance under the scene of more large-packet requests.
The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.
Drawings
FIG. 1 is a general flow diagram of a preferred embodiment of the present invention;
FIG. 2 is a diagram of a comparison of CPU occupied by various operations in the protocol stack in accordance with a preferred embodiment of the present invention;
FIG. 3 is a block diagram of an asynchronous copy module framework in accordance with a preferred embodiment of the present invention;
FIG. 4 is a graph comparing the copy speed of the IOAT hardware and CPU of a preferred embodiment of the present invention;
FIG. 5 is an architecture diagram of the virtual memory copy accelerator layer in accordance with a preferred embodiment of the present invention.
Detailed Description
The technical contents of the preferred embodiments of the present invention will be more clearly and easily understood by referring to the drawings attached to the specification. The present invention may be embodied in many different forms of embodiments and the scope of the invention is not limited to the embodiments set forth herein.
Firstly, VPP is used as a user mode network protocol stack, and NGINX application is used as a web server, so that the VPP and the NGINX application communicate through a mode of sharing a memory. FIG. 1 is a general architecture diagram of the present invention, where the asynchronous copy module is an asynchronous copy module embedded in a VPP protocol stack, and the decision maker decides whether to hand a request to CPU or IOAT hardware according to the data size of the copy request.
As shown in fig. 2, in this scenario, we quantitatively analyze the CPU proportion occupied by the copy operation, and experiments show that:
(1) when the file size requested by the client is 4KB/8KB, the memory copy operation occupies about 20% -30% of CPU resources, but when the file size exceeds 32KB, the memory copy becomes a main performance bottleneck in a protocol stack, and occupies about 60% of CPU resources.
(2) In the process diagram of the gradual increase of the size of the requested file, the CPU resource occupied by the memory copy is increased linearly.
(3) As the request files gradually increase, the memory copy cost increases, and the throughput performance of the VPP user mode protocol stack is also reduced from 40% faster than the kernel mode network protocol stack to 40% slower than the kernel mode network protocol stack.
As shown in fig. 3, the commit phase: a large enough send buffer is allocated to store the packet to be copied from the shared memory, then the translation between the virtual address and the physical address is performed, and the copied parameters are packaged into a copy request. The copy request is placed on a decider, which decides whether to hand over to the CPU or the IOAT acceleration hardware. And finally, the copy request unloaded to the IOAT is put into a waiting buffer area for temporary storage. And (3) a polling stage: completed copy requests are periodically fetched from the IOAT hardware description queue. And a post-copying stage: after the copying process is completed, the network transport layer protocol needs to perform some protocol-related processing (such as setting a retransmission timer), and finally sends the copied packet to the next VPP protocol processing node.
As shown in FIG. 4, we found that IOAT copy speed at 1KB is close to CPU. When the requested file is less than 1KB, the CPU copy speed is faster, and when the file is greater than 1KB, the IOAT copy speed is faster. So when copying data greater than 1KB we offload the request to IOAT hardware for processing, whereas we hand the copy request to CPU processing.
As shown in fig. 5, the protocol stack implements the copy request offload by calling the virtual offload interface, and then the virtual accelerator offloads the request to the hardware accelerator, thereby implementing the separation of the protocol stack logic and the hardware driving logic. Developers can implement these interfaces and bind acceleration hardware on different machines or in the future to the VPP protocol stack without knowing the upper protocol stack logic, allowing the present invention to support more hardware accelerators.
The invention realizes a fault-tolerant mechanism in the module of the virtual copy accelerator, and is oriented to various error conditions: in the event of temporary unavailability, such as a copy request that is too large to exceed the length of the hardware accelerator queue, the fault tolerance mechanism temporarily passes the excess copy request to the CPU for processing. If a request that is not available permanently, such as an IOAT hardware error or a channel error caused by an illegal copy address, the fault tolerance mechanism will preferentially find another hardware accelerator and send the outstanding request to the new hardware accelerator at a time to replace the wrong hardware accelerator. If no other hardware accelerator is available, the fault tolerance mechanism will pass the copy request to the CPU for processing.
The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims (7)

1. A hardware acceleration method based on a shared memory communication mode uses VPP as a user mode network protocol stack, NGINX is network application, and communication is carried out through the mode of the shared memory, and is characterized by comprising the following three stages:
a submission stage: allocating a sending buffer area to store a packet to be copied from a shared memory, then translating a virtual address and a physical address, and packaging copied parameters into a copy request, wherein the copy request is put on a decision maker, the decision maker decides to give the copy request to a CPU or IOAT acceleration hardware, and after the decision maker decides to give the copy request to the IOAT acceleration hardware, the decision maker puts the copy request unloaded to the IOAT acceleration hardware into a waiting buffer area for temporary storage;
and (3) a polling stage: periodically obtaining a completed copy request from the IOAT acceleration hardware description queue;
and a post-copying stage: after the copying process between the shared memory and the protocol stack is completed, the network transport layer protocol sets a retransmission timer, and then sends the copied packet to the next VPP protocol processing node.
2. The method of claim 1, wherein the decision-maker decides whether to hand a request to a CPU or the IOAT acceleration hardware according to a data size of the copy request.
3. The method as claimed in claim 2, wherein the protocol stack implements offloading of the copy request to the virtual accelerator by calling a virtual offload function interface, and then the virtual accelerator offloads the copy request to the hardware accelerator, thereby implementing separation of protocol stack logic and hardware driver logic.
4. The method as claimed in claim 3, wherein the virtual accelerator has a fault tolerance mechanism for multiple error conditions.
5. The method of claim 4, wherein the virtual accelerator temporarily forwards the redundant copy request to the CPU for processing when the copy request exceeds the length of the cache queue in the hardware accelerator.
6. The method of claim 4, wherein for a request that is permanently unavailable, the virtual accelerator preferentially finds another hardware accelerator and sends an outstanding copy request to the new hardware accelerator at a time, replacing the wrong hardware accelerator.
7. The method of claim 4, wherein for a request that is permanently unavailable, the virtual accelerator hands a copy request to the CPU for processing when no other hardware accelerator is available.
CN202011389606.0A 2020-12-01 2020-12-01 Hardware acceleration method based on shared memory communication mode Active CN112600882B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011389606.0A CN112600882B (en) 2020-12-01 2020-12-01 Hardware acceleration method based on shared memory communication mode

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011389606.0A CN112600882B (en) 2020-12-01 2020-12-01 Hardware acceleration method based on shared memory communication mode

Publications (2)

Publication Number Publication Date
CN112600882A CN112600882A (en) 2021-04-02
CN112600882B true CN112600882B (en) 2022-03-08

Family

ID=75187681

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011389606.0A Active CN112600882B (en) 2020-12-01 2020-12-01 Hardware acceleration method based on shared memory communication mode

Country Status (1)

Country Link
CN (1) CN112600882B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113360293B (en) * 2021-06-02 2023-09-08 奥特酷智能科技(南京)有限公司 Vehicle body electrical network architecture based on remote virtual shared memory mechanism
CN113364856B (en) * 2021-06-03 2023-06-30 奥特酷智能科技(南京)有限公司 Vehicle-mounted Ethernet system based on shared memory and heterogeneous processor

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001014959A2 (en) * 1999-08-16 2001-03-01 Z-Force Corporation System of reusable software parts and methods of use
CN106537340A (en) * 2014-07-16 2017-03-22 戴尔产品有限公司 Input/output acceleration device and method for virtualized information handling systems
CN106663021A (en) * 2014-06-26 2017-05-10 英特尔公司 Intelligent gpu scheduling in a virtualization environment
CN110865953A (en) * 2019-10-08 2020-03-06 华南师范大学 Asynchronous copying method and device
CN111314429A (en) * 2020-01-19 2020-06-19 上海交通大学 Network request processing system and method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150261701A1 (en) * 2014-03-14 2015-09-17 International Business Machines Corporation Device table in system memory

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001014959A2 (en) * 1999-08-16 2001-03-01 Z-Force Corporation System of reusable software parts and methods of use
CN106663021A (en) * 2014-06-26 2017-05-10 英特尔公司 Intelligent gpu scheduling in a virtualization environment
CN106537340A (en) * 2014-07-16 2017-03-22 戴尔产品有限公司 Input/output acceleration device and method for virtualized information handling systems
CN110865953A (en) * 2019-10-08 2020-03-06 华南师范大学 Asynchronous copying method and device
CN111314429A (en) * 2020-01-19 2020-06-19 上海交通大学 Network request processing system and method

Also Published As

Publication number Publication date
CN112600882A (en) 2021-04-02

Similar Documents

Publication Publication Date Title
EP2645674B1 (en) Interrupt management
US5617424A (en) Method of communication between network computers by dividing packet data into parts for transfer to respective regions
US6163812A (en) Adaptive fast path architecture for commercial operating systems and information server applications
CN115516832A (en) Network and edge acceleration tile (NEXT) architecture
US7783769B2 (en) Accelerated TCP (Transport Control Protocol) stack processing
US7089289B1 (en) Mechanisms for efficient message passing with copy avoidance in a distributed system using advanced network devices
US20060190942A1 (en) Processor task migration over a network in a multi-processor system
CN112600882B (en) Hardware acceleration method based on shared memory communication mode
US7924848B2 (en) Receive flow in a network acceleration architecture
CN110768994B (en) Method for improving SIP gateway performance based on DPDK technology
EP1565817A2 (en) Embedded transport acceleration architecture
CN111966446B (en) RDMA virtualization method in container environment
WO2020171989A1 (en) Rdma transport with hardware integration and out of order placement
CN111459418A (en) RDMA (remote direct memory Access) -based key value storage system transmission method
WO2020171988A1 (en) Rdma transport with hardware integration
CN111614631A (en) User mode assembly line framework firewall system
CN112929210B (en) Method and system for gateway routing application plug-in built on WebFlux framework and application of gateway routing application plug-in
CN111158782B (en) DPDK technology-based Nginx configuration hot update system and method
US7412454B2 (en) Data structure supporting random delete and timer function
CN110445580B (en) Data transmission method and device, storage medium, and electronic device
US20050188070A1 (en) Vertical perimeter framework for providing application services
Sterbenz et al. AXON: Application-oriented lightweight transport protocol design
US20040240388A1 (en) System and method for dynamic assignment of timers in a network transport engine
CN113746802B (en) Method in network function virtualization and VNF device with full storage of local state and remote state
Melnyk Modeling of the messages search mechanism in the messaging process on the basis of TCP protocols

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant