CN112612518B

CN112612518B - Network checksum algorithm optimization method based on Feiteng platform

Info

Publication number: CN112612518B
Application number: CN202011420425.XA
Authority: CN
Inventors: 胡海; 刘正元; 刘云; 肖林逵; 黄锦慧; 李佑鸿; 彭灿; 孙立明; 张铎; 李唯实; 曾驰
Original assignee: Kirin Software Co Ltd
Current assignee: Kirin Software Co Ltd
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2022-04-01
Anticipated expiration: 2040-12-08
Also published as: CN112612518A

Abstract

The invention discloses a network checksum algorithm optimization method based on a Feiteng platform, which comprises the following implementation processes: firstly, 128bit data is loaded into a NEON 128-bit register to reduce the cycle number; then, a NEON vector pair-wise addition instruction is adopted, data in a NEON 128-bit register are divided into 8 pieces of 16-bit data to be subjected to pair-wise addition, and when the data stream is processed to a certain length, the data stream is switched to arm64 to be assembled for processing; then the algorithm operation of 64bit to 16bit is carried out. Compared with the prior art, the network checksum algorithm optimization method based on the Feiteng platform effectively reduces the delay caused by the checksum algorithm when the network UDP receives data, thereby achieving the purpose of improving the UDP packet data transmission efficiency, and having the advantages of autonomous controllability, originality of an implementation mode, obvious implementation effect and the like.

Description

Network checksum algorithm optimization method based on Feiteng platform

Technical Field

The invention belongs to the technical field of communication and computers, and particularly relates to a network checksum algorithm optimization method based on a Feiteng platform.

Background

The domestic Feiteng series processor is based on an ARM64 architecture, is fully compatible with an ARMV8 instruction set, and internally realizes an NEON expansion instruction. The SIMD part of the expansion instruction makes up the weakness of the Feiteng processor in the aspect of CPU frequency, and can improve the memory access and data calculation speed of data intensive application. Common data intensive applications include graphics computing, entertainment audio, data verification, and the like.

Ethernet is the most common communication protocol standard used in the existing local area networks today, and there are many protocols available in the transport layer, among which UDP protocol is widely used in local area networks because of its compact structure and low transmission overhead. The UDP protocol is a simple datagram-oriented transport layer protocol that provides non-connection-oriented, unreliable transport of data streams. The UDP protocol is a data segment in a data packet of the ethernet, the UDP encapsulation header includes a source port, a destination port, a UDP length, and a checksum of the UDP, and the checksum calculation for the 16-bit UDP may include all data loads behind a checksum field of the UDP.

The existing UDP checksum calculation method is characterized in that a UDP pseudo packet header, a UDP packet header and a data segment are divided into 16-bit hexadecimal numbers, data are subjected to grouping cyclic addition, a generated carry is added to a bit of the operation, then the result of the successive cyclic addition is inverted according to the bit, and the result obtained by calculation is backfilled into the UDP checksum. Therefore, to calculate the checksum of UDP, it is necessary to cyclically add all 16 bits in the data stream step by step, and as the amount of data to be transmitted in the DUP packet increases, the number of cyclic additions step by step increases accordingly, which greatly reduces the efficiency of UDP packet data transmission.

The invention provides a UDP checksum calculation method (application number CN:201210087407.3), which is concretely implemented by firstly setting a UDP checksum as a constant; then, calculating according to a traditional UDP checksum calculation method; finally, the obtained result is added at the end of the UDP data portion. The method adopts a calculation method, simplifies the packing flow of UDP data packets, enables all data to be immediately packed and sent after being read once, and does not adopt an effective method to reduce the number of times of gradually adding data streams when the checksum is calculated.

The invention provides a check sum calculating method and a network processor (application number CN 201510536324.1). the invention provides the check sum calculating method and the network processor, which are concretely realized by firstly obtaining a calculating parameter corresponding to a current thread by a multithreading micro-engine and sending the calculating parameter to a calculating unit; then the computing unit carries out checksum computation, and meanwhile, the thread scheduling module schedules the current thread to enter a dormant state; then when the calculation is completed, the calculation unit writes the calculated check sum into a check sum register of the current thread and instructs the thread scheduling module to schedule the current thread to enter an awakening state; and then when the thread scheduling module schedules the current thread to enter a working state from an awakening state, the multithreading microengine writes the calculated checksum into a position corresponding to the current thread in the data storage unit. According to the method, the checksum calculation is embedded into a production line of the multi-thread micro-engine, so that the scheduling link is reduced, and the performance of the network processor is improved. But again no effective method is taken for reducing the number of times the checksum is calculated step-wise.

Disclosure of Invention

In order to solve the problems, the invention provides a network checksum algorithm optimization method based on a Feiteng platform, which comprises the following steps:

s1: determining the cycle number cnt _ neo of the NEON instruction and the assembly cycle number cnt _ asm;

s2: defining NEON register variables VA and VB, and initializing to 0;

s3: judging whether cnt _ neon > 0 is true or not; if yes, go to step S4; if not, go to step S7;

s4: loading 8 16bit data from buff to VB;

s5: adopting a UADALP vector addition instruction to complete the vector addition calculation of VA and VB;

s6: decrementing cnt _ neon by 1, and shifting back buff by 16 bytes, and returning to step S3;

s7: accumulating 4 32bit data in VA to result;

s8: judging whether cnt _ asm is greater than 0; if yes, go to step S9; if not, go to step S12;

s9: load 4 16bit data from buff to X1;

s10: completing result + X1 accumulation operation by using an ADDS ADCS addition instruction;

s11: subtracting 1 from cnt _ asm, and shifting back buff by 8 bytes, and returning to step S8;

s12: circularly accumulating the buff residual data to result;

s13: convert result to 16bit number and then invert.

Compared with the prior art, the invention has the advantages that:

(1) and the design and implementation of the optimized checksum algorithm are independent design research and development, so that the method has complete intellectual property.

(2) The implementation mode is original, the NEON characteristic of the Feiteng processor is fully utilized, the checksum algorithm characteristic is combined, the advantages of the NEON instruction are fully played, and the gradual cyclic addition times of data are reduced by combining the assembly instruction.

(3) The method has obvious implementation effect, widens the single-byte processing before optimization into the 16-byte processing after optimization, greatly reduces the delay caused by the checksum algorithm when the network udp receives the data, and improves the UPD network bandwidth.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a generic Checksum algorithm on a Feiteng platform in the prior art;

FIG. 2 is a diagram illustrating the execution of UADALP VA.4S and VB.8H instructions.

Fig. 3 is a flowchart of a fickian platform-based Checksum algorithm optimization method in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.

As shown in fig. 2 to 3, in the embodiment of the present application, the present invention provides a network checksum algorithm optimization method based on a soar platform, where the method includes the steps of:

s2: defining NEON register variables VA and VB, and initializing to 0;

s4: loading 8 16bit data from buff to VB;

s7: accumulating 4 32bit data in VA to result;

s9: load 4 16bit data from buff to X1;

s12: circularly accumulating the buff residual data to result;

s13: convert result to 16bit number and then invert.

The most time consuming in FIG. 1 is the portion of the loop operation where only 16 bits of data can be processed per loop and overflow conditions need to be handled in addition to the accumulate operation per loop, which consumes a large number of CPU instruction cycles.

The Feiteng full-serial processors are all 64-bit, registers support operands 64-bit wide, ADDS addition instructions can affect the condition code flag bit of the Feiteng CPU, the condition code flag bit C can be set to 1 when the result overflows, and the condition code flag bit C can be set to 0 when the result does not overflow. 4 pieces of 16-bit data can be processed by using the ADDS and ADCS addition instructions in one operation, and overflow operation does not need to be considered additionally. In addition, the Feiteng full-range processor supports NEON, which is a 128-bit operation instruction that can process 8 16-bit numbers at a time, 1 times faster than the ADDS and ADCS instructions, however, the nen instruction does not require additional consideration of overflow operations as in the ADDS, ADCS instruction, and in order to fully exploit the advantages of the nen instruction, the accumulate operation may be performed using the NEON vector add instruction UADALP va.4s, vb.8h, which is schematically depicted in figure 2, the instruction can process 8 16 bits of data at a time, and in addition the loop accumulation u32+ ═ u16+ u16 does not overflow for no more than 0xffff cycles, there is no need to consider overflow carry operations inside the loop as long as it is guaranteed that the number of loops does not exceed 0xffff, for the part exceeding 0xffff cycles, the acceleration is performed by using ARM64 assembly instructions ADDS and ADCS instructions, this allows the advantages of the 128-bit NEON instruction to be shared, and avoids the disadvantages of the overflow handling.

Fig. 1 is a flowchart of a general netlocksum algorithm of a current soar platform without optimization, and fig. 3 is a flowchart of the netlocksum algorithm after nen optimization. It is based on fig. 1, and is improved on the basis of this, broadens the channel of data calculation, and raises the efficiency of data transmission. Similarly, the embodiment steps do not consider the problem of data address alignment, and if the data address is not aligned, only simple conversion is needed, and the specific embodiment steps are as follows:

step S201: a 64bit wide result is defined and initialized to 0.

Step S202: the number of cycles cnt _ NEON of the NEON instruction is determined, ensuring that it is not greater than 0xffff, and the number of assembly cycles cnt _ asm is determined.

Step S203: a 4-channel 128-bit NEON register variable VA, 32-bits per channel, is defined and initialized to 0, and an 8-channel 128-bit NEON register variable VB, 16-bits per channel, is defined for loading data from buff.

Step S204: steps S205, S206, S207 are repeated until cnt _ neon is 0.

Step S205: 8 16 bits of data are loaded from the buff into 8 channels of the NEON register variable VB via LDR instructions.

Step S206: the operation of a0+ ═ B0+ B1, a1+ ═ B2+ B3, a2+ ═ B4+ B5, A3+ ═ B6+ B7 is completed by using a NEON vector pair-add instruction UADALP.

Step S207: the data buff is shifted backward by 16 bytes and cnt _ neon is decremented by 1.

Step S208: the nenon register variable VA accumulates to result, i.e., result + ═ a0+ a1+ a2+ A3;

step S209: the steps S210, S211, S212 are repeated until the cnt _ asm is 0.

Step S210: 4 16 bits of data are loaded from buf into register X1 by the LDR instruction.

Step S211: with the ADDS ADCS add instruction, result + X1+ C is completed.

Step S212: the data buff is shifted backward by 8 bytes, and cnt _ asm is decremented by 1.

Step S213: and circularly accumulating the residual data in the buf to result, converting the result into a 16-bit number, and inverting to obtain the checksum.

In step S202), the cnt _ neon is set to 0xffff at maximum to prevent overflow of the neon vector pair-wise addition instruction UADALP, and the U32+ ═ U16+ U16 operation can ensure that the carry is not generated by overflow when the loop addition is performed 0xffff times.

In step S206), a0, a1, a2, A3 refer to 4 32-bit lanes in the NEON register variable VA, B0, B1, B2, B3, B4, B5, B6, B7 refer to 8 16-bit lanes in the NEON register variable VB, and the vector addition operation is specifically completed as shown in fig. 2.

In step S213), the processing is performed in units of 128bit/64bit in the previous step, so that there may be a few data left unprocessed in buf, and the accumulation processing herein needs to consider the overflow condition, but because there is only a data amount less than 8 bytes, no large delay is caused. The result performs 64-bit conversion and 16-bit operation, namely dividing 64 bits into 4 pieces of 16-bit data and accumulating the 4 pieces of 16-bit data.

Compared with the prior art, the invention has the advantages that:

It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.

Claims

1. A network checksum algorithm optimization method based on a Feiteng platform is characterized by comprising the following steps:

s2: defining NEON register variables VA and VB, and initializing to 0;

s4: loading 8 16bit data from buff to VB;

s7: accumulating 4 32bit data in VA to result;

s9: load 4 16bit data from buff to X1;

s12: circularly accumulating the buff residual data to result;

s13: convert result to 16bit number and then invert.