CN111163121A

CN111163121A - Ultra-low-delay high-performance network protocol stack processing method and system

Info

Publication number: CN111163121A
Application number: CN201911137090.8A
Authority: CN
Inventors: 陈伟杰; 胡康桥
Original assignee: Hexin Interconnect Technology Qingdao Co ltd
Current assignee: Hexin Interconnect Technology Qingdao Co ltd
Priority date: 2019-11-19
Filing date: 2019-11-19
Publication date: 2020-05-15

Abstract

The invention discloses a method and a system for processing an ultra-low delay high-performance network protocol stack, which relate to the technical field of computer transmission networks, and the method comprises the following steps: constructing a system architecture without memory copy and kernel processing; executing a network protocol stack processing mechanism optimization strategy applied to a protocol stack interaction layer based on the system architecture; further adopting hardware flow and software multithreading technology to realize the delay hiding of the network protocol stack and the ASIC hardening of the protocol stack; and an ASIC network protocol stack implementation strategy based on OOHLS is adopted, and a hardware implementation strategy of the protocol stack is comprehensively optimized at a high level. The invention can reduce the copying and moving operation of mass data and the interrupt overhead of a processor from the system software level, further optimize the protocol stack hardening implementation strategy, not only overcome the processing time consumption of the network protocol stack, but also quickly improve the ASIC implementation efficiency, and greatly improve the network transmission system performance and the ASIC development efficiency of the service level.

Description

Ultra-low-delay high-performance network protocol stack processing method and system

Technical Field

The embodiment of the invention relates to the field of computer network processing, in particular to a method and a system for processing an ultra-low-delay high-performance network protocol stack.

Background

The design of the network protocol stack is intended to provide reliable and secure logical links between processes and provide reliable network transmission services based on the reliable logical links. It is emphasized that TCP does not make any assumptions about the network, and its primary function is to provide a reliable logical link. In order to be able to communicate reliably over unreliable networks, the network protocol stack must provide the following functions: the method can perform basic data transmission, ensure the reliability of data, perform proper flow control, maintain the set of communication states, use parallel multiplexing technology and ensure the priority and the safety of communication.

With the rapid development of network technologies, especially optical fiber technologies, optical fiber communication networks are rapidly becoming the main network transmission means, and the network bandwidth is continuously increasing. The performance requirements of network applications are characterized by high throughput, low latency, high bandwidth, low host overhead, and low storage overhead. However, in the prior art, especially after a gigabit network or more than ten gigabit network appears, the network protocol processing always occupies a considerable amount of CPU resources, and for a ten gigabit ethernet network, the protocol processing on the host system becomes a performance bottleneck of reliable network transmission.

The main performance is as follows: in the prior art, interrupt processing is performed in a segmented manner due to frequent interrupt operations, and this processing strategy inevitably introduces a larger system overhead, and when the network bandwidth is relatively large, the frequent interrupt operations will seriously affect the overall performance of the system.

In conventional data replication, in the whole network data receiving or transmitting process, data passes through a storage bus for many times, the storage bus needs to provide bandwidth which is several times that of network data flow, and complex replication operation also affects the whole system performance.

In the traditional system architecture, because the process of network protocol processing is complex, huge overhead is brought, and the system performance is seriously influenced.

Disclosure of Invention

The embodiment of the invention aims to provide a method and a system for processing a network protocol stack with ultralow delay and high performance, which are used for solving the problem that the software processing overhead of the network protocol stack is overlarge and considerable CPU resources are occupied to influence the system performance in the prior art, and in addition, the efficiency of ASIC hardware implementation and protocol stack service development of the hardware protocol stack is rapidly improved through a high-level comprehensive AISC implementation strategy.

In order to achieve the above object, the embodiments of the present invention mainly provide the following technical solutions:

in a first aspect, an embodiment of the present invention provides an ultra-low latency high performance network protocol stack processing method,

the method comprises the following steps: constructing a system architecture without memory copy and kernel processing; executing a network protocol stack processing mechanism optimization strategy applied to a protocol stack interaction layer based on the system architecture; the delayed hiding of a software protocol stack is realized by adopting a hardware flow and software multithreading technology, and the delayed hiding is used for realizing the hardening of the protocol stack; and realizing a strategy by adopting an ASIC (application specific integrated circuit) network protocol stack based on OOHLS (object oriented programming language) so as to optimize the hardening method of the protocol stack.

Further, the building of a system architecture without copying and kernel processing specifically includes: the system architecture comprises a physical layer, a link layer, a network layer, a transport layer and an application layer, wherein the capability of direct communication from the application program to the application program is provided through enhanced direct memory access, and data is directly put into a cache of the application program in the transport layer so as to realize zero-cache copy of an operating system and direct access of a remote memory application program; wherein interrupt rate adjustment is used to balance the delay of the underlying hardware and the throughput rate of the system.

Further, the network protocol stack processing mechanism optimization strategy specifically includes: the checksum calculation is carried out by adopting a processing mechanism for processing 32-bit checksum data in parallel so as to be integrated with a general-purpose processor; protocol processing acceleration is achieved with caching and congestion control.

Further, the processing mechanism for parallel processing of 32-bit checksum data specifically includes: and iterating the data layer by adopting a recurrence method, taking the result of the first serial algorithm as the initial value of the second serial algorithm, taking the result of the second serial algorithm as the initial value of the third serial algorithm, and iterating layer by layer according to the rule to obtain the parallel logic relation between the data in the CRC register when any n-bit data is input in parallel.

Further, the method for implementing protocol processing acceleration specifically includes: adding options in a compiler, and centralizing codes with low execution frequency at the tail of a program or a function and centralizing high-frequency execution codes through pre-compiling processing; delaying the code cloning time in a delayed cloning mode, and cloning the codes frequently executed by a plurality of program segments together to reduce the times of function call and instruction jump; the function with the higher processing frequency is linked into the processing code.

Further, the hardware pipelining method specifically includes: adopting a cut-off DMA strategy to perform pipelined transmission on the message; setting a plurality of parallel protocol engines in hardware for parallel processing; and performing pipeline processing on the data on the SoC by adopting a data acceleration mechanism.

Further, when the truncated DMA policy is used for data transmission, the transmission conditions include: when the DMA engine is idle and the data entering the DMA buffer reaches a preset threshold, the message length exceeds 1 KB.

Further, the implementing policy of the OOHLS-based ASIC network protocol stack specifically includes: the high-level language is directly integrated into a gate-level circuit through high-level integrated OOHLS and template element programming, and the physical realization of a chip performance model and a hardware circuit is unified.

In a second aspect, an embodiment of the present invention further provides an ultra-low latency high performance network protocol stack processing system,

the system comprises: at least one processor and memory; the memory is to store one or more program instructions; the processor is used for running one or more program instructions to execute the ultra-low-latency high-performance network protocol stack processing method.

In a third aspect, an embodiment of the present invention further provides a computer-readable storage medium,

the computer storage medium contains one or more program instructions for execution by a drive test unit of an ultra-low latency high performance network protocol stack processing method.

The technical scheme provided by the embodiment of the invention at least has the following advantages:

the embodiment of the invention provides a system architecture for avoiding memory copy and kernel processing, which reduces the copying operation of mass data and the interrupt overhead of a processor from a system level; optimizing the system from the protocol stack interaction layer by a network protocol stack processing mechanism optimization strategy; the delayed hiding is realized by adopting hardware pipelining and software multithreading technology, so that the processing time consumption of a software protocol stack can be overcome; by means of an ASIC network protocol stack implementation strategy based on OOHLS, ASIC design efficiency can be rapidly improved, hardware protocol stack implementation and tuning are facilitated, and system performance can be greatly improved.

Drawings

Fig. 1 is a step diagram of a processing method of an ultra-low latency high performance network protocol stack according to an embodiment of the present invention.

Fig. 2 is a schematic diagram illustrating an adjustment method of an interrupt rate according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of an ultra-low latency high-performance network protocol stack processing system according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided for illustrative purposes, and other advantages and effects of the present invention will become apparent to those skilled in the art from the present disclosure.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

An embodiment of the present invention provides an ultra-low latency high performance network protocol stack processing method, referring to fig. 1, the method includes:

s1, building a system architecture without memory copy and kernel processing;

in the prior art, because of uncertainty of network transmission delay, an asynchronous interrupt communication mode is generally adopted between a system and a network interface controller. The network driver interacts with the kernel one message at a time, and the processing ensures that the driver does not need to care about the specific implementation of the protocol, and the protocol does not need to care about the physical transmission of data, but generates more interrupt overhead. The embodiment of the invention builds a new system architecture, which comprises a physical layer, a link layer, a network layer, a transport layer and an application layer, wherein, different from the prior art, the embodiment provides the capability of direct communication from the application program to the application program through enhanced direct memory access, can avoid context switching of the application program, and directly puts data into a cache of the application program in the transport layer, so as to realize zero-cache copy of an operating system and direct access of a remote memory application program, and further realize the system ultra-low delay of a network protocol stack.

Referring to fig. 2, the interrupt rate is adjusted to balance the delay of the underlying hardware and the throughput rate of the system, and the rule of the interrupt is determined according to the priority of the delay priority and the throughput priority, and when the delay is prioritized, one interrupt tends to be generated for each packet; when throughput takes priority, an interrupt tends to be generated in a smaller time unit.

Through the enhanced direct memory access and interrupt rate adjustment strategy, the copy operation of mass data and the interrupt overhead of a processor are reduced from the system level.

S2, executing a network protocol stack processing mechanism optimization strategy applied to the protocol stack interaction layer based on the system architecture;

specifically, a processing mechanism for parallel processing of 32-bit checksum data is adopted for checksum calculation so as to be integrated with a general-purpose processor; protocol processing acceleration is achieved with caching and congestion control.

Although the serial algorithm is slow, each calculation process of the serial algorithm is completely logically based on the generation principle of the CRC-32 check code, and the logical relationship between the result of each serial algorithm and the initial value can be converted into the parallel logical relationship through simple derivation by the generation principle of the CRC-32 check code. Therefore, the processing mechanism for parallel processing of 32-bit checksum data specifically includes:

and iterating the data layer by adopting a recurrence method, taking the result of the first serial algorithm as the initial value of the second serial algorithm, taking the result of the second serial algorithm as the initial value of the third serial algorithm, and iterating layer by layer according to the rule to obtain the parallel logic relation between the data in the CRC register when any n-bit data is input in parallel.

The parallel logic relation skips the calculation of each bit by a serial algorithm, greatly simplifies the calculation process, can obtain values of any n rows according to the actual situation to obtain the logic relation again, and has great flexibility compared with the table look-up method in the prior art. When the method is executed, the corresponding program can be modified according to the logic relation, so that the aim of increasing the operation speed to the maximum extent under the limit of a certain occupied space is fulfilled.

Further, the method for implementing protocol processing acceleration specifically includes:

in the execution code, the processing frequency of each instruction cannot be the same, the probability of executing some codes is relatively small, such as error processing or initialization code, options can be added in a compiler for the codes with low execution frequency, the codes with low execution frequency are concentrated at the tail part of a program or a function through pre-compiling processing, and the high-frequency execution codes are concentrated; the method can effectively reduce the jump and further reduce the failure rate of the Cache of the instruction Cache.

Delaying the code cloning time in a delayed cloning mode, and cloning the codes frequently executed by a plurality of program segments together to reduce the times of function call and instruction jump; for example, in TCP processing, when the TCP link is completed and then cloned, a large amount of information can be acquired, and many pieces of state information become constant. Meanwhile, if the calling instruction is very close to the called function, the jump instruction can be converted into a branch instruction related to the PC, so that the loading of useless instructions is avoided, and the invalidation of the instruction Cache is reduced.

The function with higher processing frequency is linked into the processing code, so that a large amount of call overhead can be eliminated, a compiler can obtain more contexts, and the code is more fully optimized.

S3, adopting hardware flow and software multithreading technology to realize the delay hiding of the software protocol stack, and hardening the protocol stack;

specifically, a typical DMA transfer always adopts a store-and-forward strategy, and only when a message completely enters a main memory or a NIC memory, the corresponding DMA is initialized for transfer. In this embodiment, the intercepted DMA strategy is adopted to perform pipelined transmission on the packet, and the transmission can be performed as long as the transmission condition is satisfied, where the transmission condition includes: when the DMA engine is idle and the data entering the DMA buffer reaches a preset threshold, the message length exceeds 1 KB.

In order to further improve the performance, the embodiment sets a plurality of parallel protocol engines in the hardware for parallel processing.

In addition, a data acceleration mechanism is adopted to perform pipeline processing on data on the SoC. The data processing accelerator can be used for realizing data acceleration, mainly comprises a cyclic redundancy check unit, an extraction and comparison unit and an adder, and is programmed and controlled by a microcontroller.

In general, a plurality of TCP streams show weak correlation, so the present embodiment may use a multithread processing technique to process different TCP streams, and the multithread concurrency technique may make the delay of memory access hide well in the overlapping processing of TCP streams.

And S4, adopting an OOHLS-based ASIC network protocol stack to realize a strategy so as to optimize the hardening method of the protocol stack and the ASIC development efficiency.

ASIC, Application Specific Integrated Circuit, or Application Specific Integrated Circuit. OOHLS, i.e. object-oriented high-level integration, is to hide the state information of an object inside the object, without directly accessing the internal information of the object, and can implement the operation and access to the internal information by the method provided by the class.

In the embodiment, the high-level language is directly integrated into the gate-level circuit through the high-level integrated OOHLS and the template element programming, so that the physical implementation of a chip performance model and a hardware circuit can be unified, and the design efficiency of an ASIC chip is rapidly improved.

Referring to fig. 3, corresponding to the foregoing embodiment, an embodiment of the present invention provides an ultra-low latency high performance network protocol stack processing system, where the system includes: at least one processor 02 and memory 01;

the memory 01 is used for storing one or more program instructions;

the processor 02 is configured to execute one or more program instructions to perform an ultra-low latency high performance network protocol stack processing method.

In accordance with the foregoing embodiments, the present invention further provides a computer-readable storage medium, where the computer storage medium contains one or more program instructions, and the one or more program instructions are used for a drive test unit to execute an ultra-low latency high performance network protocol stack processing method.

The disclosed embodiments of the present invention provide a computer-readable storage medium having stored therein computer program instructions which, when run on a computer, cause the computer to perform the above-described method.

In an embodiment of the invention, the processor may be an integrated circuit chip with large-scale information processing capability. The processor may be a general purpose processor, an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete Gate or transistor logic device, or discrete hardware component.

The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in ram, flash, rom, prom, eprom, registers, etc. storage media as is well known in the art. The processor reads the information in the storage medium and completes the steps of the method in combination with the hardware.

The storage medium may be a memory, for example, which may be volatile memory or nonvolatile memory, or which may include both volatile and nonvolatile memory.

The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory.

The volatile Memory may be a Random Access Memory (RAM) which serves as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), and Enhanced SDRAM (ESDRAM).

The storage media described in connection with the embodiments of the invention are intended to comprise, without being limited to, these and any other suitable types of memory.

Those skilled in the art will appreciate that the functionality described in the present invention may be implemented in a combination of hardware and software in one or more of the examples described above. When software is applied, the corresponding functionality may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A processing method of an ultra-low-latency high-performance network protocol stack is characterized by comprising the following steps:

constructing a system architecture without memory copy and kernel processing;

executing a network protocol stack processing mechanism optimization strategy applied to a network protocol stack interaction layer based on the system architecture;

the delay hiding of a software protocol stack is realized by adopting a hardware flow and software multithreading technology, and the delay hiding is used for realizing the hardening of a network protocol stack;

and realizing a strategy by adopting an ASIC (application specific integrated circuit) network protocol stack based on OOHLS (object oriented programming language) so as to optimize the hardening method of the protocol stack.

2. The method according to claim 1, wherein the building of a system architecture without copying and kernel processing specifically comprises:

the system architecture comprises a physical layer, a link layer, a network layer, a transport layer and an application layer, wherein the capability of direct communication from the application program to the application program is provided through enhanced direct memory access, and data is directly put into a cache of the application program in the transport layer so as to realize zero-cache copy of an operating system and direct access of a remote memory application program;

wherein interrupt rate adjustment is used to balance the delay of the underlying hardware and the throughput rate of the system.

3. The method according to claim 1, wherein the network protocol stack processing mechanism optimizing policy specifically includes:

the checksum calculation is carried out by adopting a processing mechanism for processing 32-bit checksum data in parallel so as to be integrated with a general-purpose processor;

protocol processing acceleration is achieved with caching and congestion control.

4. The processing method of the network protocol stack with ultra-low latency and high performance as claimed in claim 3, wherein the processing mechanism for parallel processing of the 32-bit checksum data specifically comprises:

and (3) iterating the data layer by adopting a recurrence method, taking the result of the first serial algorithm as the initial value of the second serial algorithm, taking the result of the second serial algorithm as the initial value of the third serial algorithm, and sequentially iterating layer by layer to obtain the parallel logic relation between the data in the CRC register when any n-bit data is input in parallel.

5. The processing method of the network protocol stack with ultra-low latency and high performance as claimed in claim 3, wherein the method for implementing protocol processing acceleration specifically comprises:

adding options in a compiler, and centralizing codes with low execution frequency at the tail of a program or a function and centralizing high-frequency execution codes through pre-compiling processing;

delaying the code cloning time in a delayed cloning mode, and cloning the codes frequently executed by a plurality of program segments together to reduce the times of function call and instruction jump;

the function with the higher processing frequency is linked into the processing code.

6. The processing method of the network protocol stack with ultra-low latency and high performance as claimed in claim 1, wherein the hardware pipelining method specifically comprises:

adopting a cut-off DMA strategy to perform pipelined transmission on the message;

setting a plurality of parallel protocol engines in hardware for parallel processing;

and performing pipeline processing on the data on the SoC by adopting a data acceleration mechanism.

7. The processing method of claim 6, wherein when the truncated DMA policy is used for data transmission, the transmission conditions include:

when the DMA engine is idle and the data entering the DMA buffer reaches a preset threshold, the message length exceeds 1 KB.

8. The processing method of the ultra-low latency high performance network protocol stack according to claim 1, wherein the OOHLS-based ASIC network protocol stack implementation policy specifically includes:

the high-level language is directly integrated into a gate-level circuit through high-level integrated OOHLS and template element programming, and the physical realization of a chip performance model and a hardware circuit is unified.

9. An ultra-low latency high performance network protocol stack processing system, the system comprising: at least one processor and memory;

the memory is to store one or more program instructions;

the processor, configured to execute one or more program instructions to perform the method of processing the ultra-low latency high performance network protocol stack according to any one of claims 1 to 8.

10. A computer readable storage medium containing one or more program instructions for execution by a drive test unit of an ultra-low latency high performance network protocol stack processing method as claimed in any one of claims 1 to 8.