CN112612518B - Network checksum algorithm optimization method based on Feiteng platform - Google Patents

Network checksum algorithm optimization method based on Feiteng platform Download PDF

Info

Publication number
CN112612518B
CN112612518B CN202011420425.XA CN202011420425A CN112612518B CN 112612518 B CN112612518 B CN 112612518B CN 202011420425 A CN202011420425 A CN 202011420425A CN 112612518 B CN112612518 B CN 112612518B
Authority
CN
China
Prior art keywords
data
neon
cnt
buff
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011420425.XA
Other languages
Chinese (zh)
Other versions
CN112612518A (en
Inventor
胡海
刘正元
刘云
肖林逵
黄锦慧
李佑鸿
彭灿
孙立明
张铎
李唯实
曾驰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kirin Software Co Ltd
Original Assignee
Kirin Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kirin Software Co Ltd filed Critical Kirin Software Co Ltd
Priority to CN202011420425.XA priority Critical patent/CN112612518B/en
Publication of CN112612518A publication Critical patent/CN112612518A/en
Application granted granted Critical
Publication of CN112612518B publication Critical patent/CN112612518B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/32Address formation of the next instruction, e.g. by incrementing the instruction counter
    • G06F9/322Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address
    • G06F9/325Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address for loops, e.g. loop detection or loop counter
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0896Bandwidth or capacity management, i.e. automatically increasing or decreasing capacities

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Detection And Prevention Of Errors In Transmission (AREA)

Abstract

The invention discloses a network checksum algorithm optimization method based on a Feiteng platform, which comprises the following implementation processes: firstly, 128bit data is loaded into a NEON 128-bit register to reduce the cycle number; then, a NEON vector pair-wise addition instruction is adopted, data in a NEON 128-bit register are divided into 8 pieces of 16-bit data to be subjected to pair-wise addition, and when the data stream is processed to a certain length, the data stream is switched to arm64 to be assembled for processing; then the algorithm operation of 64bit to 16bit is carried out. Compared with the prior art, the network checksum algorithm optimization method based on the Feiteng platform effectively reduces the delay caused by the checksum algorithm when the network UDP receives data, thereby achieving the purpose of improving the UDP packet data transmission efficiency, and having the advantages of autonomous controllability, originality of an implementation mode, obvious implementation effect and the like.

Description

Network checksum algorithm optimization method based on Feiteng platform
Technical Field
The invention belongs to the technical field of communication and computers, and particularly relates to a network checksum algorithm optimization method based on a Feiteng platform.
Background
The domestic Feiteng series processor is based on an ARM64 architecture, is fully compatible with an ARMV8 instruction set, and internally realizes an NEON expansion instruction. The SIMD part of the expansion instruction makes up the weakness of the Feiteng processor in the aspect of CPU frequency, and can improve the memory access and data calculation speed of data intensive application. Common data intensive applications include graphics computing, entertainment audio, data verification, and the like.
Ethernet is the most common communication protocol standard used in the existing local area networks today, and there are many protocols available in the transport layer, among which UDP protocol is widely used in local area networks because of its compact structure and low transmission overhead. The UDP protocol is a simple datagram-oriented transport layer protocol that provides non-connection-oriented, unreliable transport of data streams. The UDP protocol is a data segment in a data packet of the ethernet, the UDP encapsulation header includes a source port, a destination port, a UDP length, and a checksum of the UDP, and the checksum calculation for the 16-bit UDP may include all data loads behind a checksum field of the UDP.
The existing UDP checksum calculation method is characterized in that a UDP pseudo packet header, a UDP packet header and a data segment are divided into 16-bit hexadecimal numbers, data are subjected to grouping cyclic addition, a generated carry is added to a bit of the operation, then the result of the successive cyclic addition is inverted according to the bit, and the result obtained by calculation is backfilled into the UDP checksum. Therefore, to calculate the checksum of UDP, it is necessary to cyclically add all 16 bits in the data stream step by step, and as the amount of data to be transmitted in the DUP packet increases, the number of cyclic additions step by step increases accordingly, which greatly reduces the efficiency of UDP packet data transmission.
The invention provides a UDP checksum calculation method (application number CN:201210087407.3), which is concretely implemented by firstly setting a UDP checksum as a constant; then, calculating according to a traditional UDP checksum calculation method; finally, the obtained result is added at the end of the UDP data portion. The method adopts a calculation method, simplifies the packing flow of UDP data packets, enables all data to be immediately packed and sent after being read once, and does not adopt an effective method to reduce the number of times of gradually adding data streams when the checksum is calculated.
The invention provides a check sum calculating method and a network processor (application number CN 201510536324.1). the invention provides the check sum calculating method and the network processor, which are concretely realized by firstly obtaining a calculating parameter corresponding to a current thread by a multithreading micro-engine and sending the calculating parameter to a calculating unit; then the computing unit carries out checksum computation, and meanwhile, the thread scheduling module schedules the current thread to enter a dormant state; then when the calculation is completed, the calculation unit writes the calculated check sum into a check sum register of the current thread and instructs the thread scheduling module to schedule the current thread to enter an awakening state; and then when the thread scheduling module schedules the current thread to enter a working state from an awakening state, the multithreading microengine writes the calculated checksum into a position corresponding to the current thread in the data storage unit. According to the method, the checksum calculation is embedded into a production line of the multi-thread micro-engine, so that the scheduling link is reduced, and the performance of the network processor is improved. But again no effective method is taken for reducing the number of times the checksum is calculated step-wise.
Disclosure of Invention
In order to solve the problems, the invention provides a network checksum algorithm optimization method based on a Feiteng platform, which comprises the following steps:
s1: determining the cycle number cnt _ neo of the NEON instruction and the assembly cycle number cnt _ asm;
s2: defining NEON register variables VA and VB, and initializing to 0;
s3: judging whether cnt _ neon > 0 is true or not; if yes, go to step S4; if not, go to step S7;
s4: loading 8 16bit data from buff to VB;
s5: adopting a UADALP vector addition instruction to complete the vector addition calculation of VA and VB;
s6: decrementing cnt _ neon by 1, and shifting back buff by 16 bytes, and returning to step S3;
s7: accumulating 4 32bit data in VA to result;
s8: judging whether cnt _ asm is greater than 0; if yes, go to step S9; if not, go to step S12;
s9: load 4 16bit data from buff to X1;
s10: completing result + X1 accumulation operation by using an ADDS ADCS addition instruction;
s11: subtracting 1 from cnt _ asm, and shifting back buff by 8 bytes, and returning to step S8;
s12: circularly accumulating the buff residual data to result;
s13: convert result to 16bit number and then invert.
Compared with the prior art, the invention has the advantages that:
(1) and the design and implementation of the optimized checksum algorithm are independent design research and development, so that the method has complete intellectual property.
(2) The implementation mode is original, the NEON characteristic of the Feiteng processor is fully utilized, the checksum algorithm characteristic is combined, the advantages of the NEON instruction are fully played, and the gradual cyclic addition times of data are reduced by combining the assembly instruction.
(3) The method has obvious implementation effect, widens the single-byte processing before optimization into the 16-byte processing after optimization, greatly reduces the delay caused by the checksum algorithm when the network udp receives the data, and improves the UPD network bandwidth.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of a generic Checksum algorithm on a Feiteng platform in the prior art;
FIG. 2 is a diagram illustrating the execution of UADALP VA.4S and VB.8H instructions.
Fig. 3 is a flowchart of a fickian platform-based Checksum algorithm optimization method in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.
As shown in fig. 2 to 3, in the embodiment of the present application, the present invention provides a network checksum algorithm optimization method based on a soar platform, where the method includes the steps of:
s1: determining the cycle number cnt _ neo of the NEON instruction and the assembly cycle number cnt _ asm;
s2: defining NEON register variables VA and VB, and initializing to 0;
s3: judging whether cnt _ neon > 0 is true or not; if yes, go to step S4; if not, go to step S7;
s4: loading 8 16bit data from buff to VB;
s5: adopting a UADALP vector addition instruction to complete the vector addition calculation of VA and VB;
s6: decrementing cnt _ neon by 1, and shifting back buff by 16 bytes, and returning to step S3;
s7: accumulating 4 32bit data in VA to result;
s8: judging whether cnt _ asm is greater than 0; if yes, go to step S9; if not, go to step S12;
s9: load 4 16bit data from buff to X1;
s10: completing result + X1 accumulation operation by using an ADDS ADCS addition instruction;
s11: subtracting 1 from cnt _ asm, and shifting back buff by 8 bytes, and returning to step S8;
s12: circularly accumulating the buff residual data to result;
s13: convert result to 16bit number and then invert.
The most time consuming in FIG. 1 is the portion of the loop operation where only 16 bits of data can be processed per loop and overflow conditions need to be handled in addition to the accumulate operation per loop, which consumes a large number of CPU instruction cycles.
The Feiteng full-serial processors are all 64-bit, registers support operands 64-bit wide, ADDS addition instructions can affect the condition code flag bit of the Feiteng CPU, the condition code flag bit C can be set to 1 when the result overflows, and the condition code flag bit C can be set to 0 when the result does not overflow. 4 pieces of 16-bit data can be processed by using the ADDS and ADCS addition instructions in one operation, and overflow operation does not need to be considered additionally. In addition, the Feiteng full-range processor supports NEON, which is a 128-bit operation instruction that can process 8 16-bit numbers at a time, 1 times faster than the ADDS and ADCS instructions, however, the nen instruction does not require additional consideration of overflow operations as in the ADDS, ADCS instruction, and in order to fully exploit the advantages of the nen instruction, the accumulate operation may be performed using the NEON vector add instruction UADALP va.4s, vb.8h, which is schematically depicted in figure 2, the instruction can process 8 16 bits of data at a time, and in addition the loop accumulation u32+ ═ u16+ u16 does not overflow for no more than 0xffff cycles, there is no need to consider overflow carry operations inside the loop as long as it is guaranteed that the number of loops does not exceed 0xffff, for the part exceeding 0xffff cycles, the acceleration is performed by using ARM64 assembly instructions ADDS and ADCS instructions, this allows the advantages of the 128-bit NEON instruction to be shared, and avoids the disadvantages of the overflow handling.
Fig. 1 is a flowchart of a general netlocksum algorithm of a current soar platform without optimization, and fig. 3 is a flowchart of the netlocksum algorithm after nen optimization. It is based on fig. 1, and is improved on the basis of this, broadens the channel of data calculation, and raises the efficiency of data transmission. Similarly, the embodiment steps do not consider the problem of data address alignment, and if the data address is not aligned, only simple conversion is needed, and the specific embodiment steps are as follows:
step S201: a 64bit wide result is defined and initialized to 0.
Step S202: the number of cycles cnt _ NEON of the NEON instruction is determined, ensuring that it is not greater than 0xffff, and the number of assembly cycles cnt _ asm is determined.
Step S203: a 4-channel 128-bit NEON register variable VA, 32-bits per channel, is defined and initialized to 0, and an 8-channel 128-bit NEON register variable VB, 16-bits per channel, is defined for loading data from buff.
Step S204: steps S205, S206, S207 are repeated until cnt _ neon is 0.
Step S205: 8 16 bits of data are loaded from the buff into 8 channels of the NEON register variable VB via LDR instructions.
Step S206: the operation of a0+ ═ B0+ B1, a1+ ═ B2+ B3, a2+ ═ B4+ B5, A3+ ═ B6+ B7 is completed by using a NEON vector pair-add instruction UADALP.
Step S207: the data buff is shifted backward by 16 bytes and cnt _ neon is decremented by 1.
Step S208: the nenon register variable VA accumulates to result, i.e., result + ═ a0+ a1+ a2+ A3;
step S209: the steps S210, S211, S212 are repeated until the cnt _ asm is 0.
Step S210: 4 16 bits of data are loaded from buf into register X1 by the LDR instruction.
Step S211: with the ADDS ADCS add instruction, result + X1+ C is completed.
Step S212: the data buff is shifted backward by 8 bytes, and cnt _ asm is decremented by 1.
Step S213: and circularly accumulating the residual data in the buf to result, converting the result into a 16-bit number, and inverting to obtain the checksum.
In step S202), the cnt _ neon is set to 0xffff at maximum to prevent overflow of the neon vector pair-wise addition instruction UADALP, and the U32+ ═ U16+ U16 operation can ensure that the carry is not generated by overflow when the loop addition is performed 0xffff times.
In step S206), a0, a1, a2, A3 refer to 4 32-bit lanes in the NEON register variable VA, B0, B1, B2, B3, B4, B5, B6, B7 refer to 8 16-bit lanes in the NEON register variable VB, and the vector addition operation is specifically completed as shown in fig. 2.
In step S213), the processing is performed in units of 128bit/64bit in the previous step, so that there may be a few data left unprocessed in buf, and the accumulation processing herein needs to consider the overflow condition, but because there is only a data amount less than 8 bytes, no large delay is caused. The result performs 64-bit conversion and 16-bit operation, namely dividing 64 bits into 4 pieces of 16-bit data and accumulating the 4 pieces of 16-bit data.
Compared with the prior art, the invention has the advantages that:
(1) and the design and implementation of the optimized checksum algorithm are independent design research and development, so that the method has complete intellectual property.
(2) The implementation mode is original, the NEON characteristic of the Feiteng processor is fully utilized, the checksum algorithm characteristic is combined, the advantages of the NEON instruction are fully played, and the gradual cyclic addition times of data are reduced by combining the assembly instruction.
(3) The method has obvious implementation effect, widens the single-byte processing before optimization into the 16-byte processing after optimization, greatly reduces the delay caused by the checksum algorithm when the network udp receives the data, and improves the UPD network bandwidth.
It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.

Claims (1)

1. A network checksum algorithm optimization method based on a Feiteng platform is characterized by comprising the following steps:
s1: determining the cycle number cnt _ neo of the NEON instruction and the assembly cycle number cnt _ asm;
s2: defining NEON register variables VA and VB, and initializing to 0;
s3: judging whether cnt _ neon > 0 is true or not; if yes, go to step S4; if not, go to step S7;
s4: loading 8 16bit data from buff to VB;
s5: adopting a UADALP vector addition instruction to complete the vector addition calculation of VA and VB;
s6: decrementing cnt _ neon by 1, and shifting back buff by 16 bytes, and returning to step S3;
s7: accumulating 4 32bit data in VA to result;
s8: judging whether cnt _ asm is greater than 0; if yes, go to step S9; if not, go to step S12;
s9: load 4 16bit data from buff to X1;
s10: completing result + X1 accumulation operation by using an ADDS ADCS addition instruction;
s11: subtracting 1 from cnt _ asm, and shifting back buff by 8 bytes, and returning to step S8;
s12: circularly accumulating the buff residual data to result;
s13: convert result to 16bit number and then invert.
CN202011420425.XA 2020-12-08 2020-12-08 Network checksum algorithm optimization method based on Feiteng platform Active CN112612518B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011420425.XA CN112612518B (en) 2020-12-08 2020-12-08 Network checksum algorithm optimization method based on Feiteng platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011420425.XA CN112612518B (en) 2020-12-08 2020-12-08 Network checksum algorithm optimization method based on Feiteng platform

Publications (2)

Publication Number Publication Date
CN112612518A CN112612518A (en) 2021-04-06
CN112612518B true CN112612518B (en) 2022-04-01

Family

ID=75229268

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011420425.XA Active CN112612518B (en) 2020-12-08 2020-12-08 Network checksum algorithm optimization method based on Feiteng platform

Country Status (1)

Country Link
CN (1) CN112612518B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114968653B (en) * 2022-07-14 2022-11-11 麒麟软件有限公司 Method for determining RAIDZ check value of ZFS file system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104937542A (en) * 2013-01-23 2015-09-23 国际商业机器公司 Vector checksum instruction
CN106293870A (en) * 2015-06-29 2017-01-04 联发科技股份有限公司 Computer system and strategy thereof guide compression method
CN106484503A (en) * 2015-08-27 2017-03-08 深圳市中兴微电子技术有限公司 A kind of computational methods of verification sum and network processing unit
US9648102B1 (en) * 2012-12-27 2017-05-09 Iii Holdings 2, Llc Memcached server functionality in a cluster of data processing nodes
CN107094369A (en) * 2014-09-26 2017-08-25 英特尔公司 Instruction and logic for providing SIMD SM3 Cryptographic Hash Functions
CN108139907A (en) * 2015-10-14 2018-06-08 Arm有限公司 Vector data send instructions
CN110620585A (en) * 2018-06-20 2019-12-27 英特尔公司 Supporting random access of compressed data
CN111654265A (en) * 2020-06-19 2020-09-11 京东方科技集团股份有限公司 Quick checking circuit, method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200177660A1 (en) * 2020-02-03 2020-06-04 Intel Corporation Offload of streaming protocol packet formation

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9648102B1 (en) * 2012-12-27 2017-05-09 Iii Holdings 2, Llc Memcached server functionality in a cluster of data processing nodes
CN104937542A (en) * 2013-01-23 2015-09-23 国际商业机器公司 Vector checksum instruction
CN107094369A (en) * 2014-09-26 2017-08-25 英特尔公司 Instruction and logic for providing SIMD SM3 Cryptographic Hash Functions
CN106293870A (en) * 2015-06-29 2017-01-04 联发科技股份有限公司 Computer system and strategy thereof guide compression method
CN106484503A (en) * 2015-08-27 2017-03-08 深圳市中兴微电子技术有限公司 A kind of computational methods of verification sum and network processing unit
CN108139907A (en) * 2015-10-14 2018-06-08 Arm有限公司 Vector data send instructions
CN110620585A (en) * 2018-06-20 2019-12-27 英特尔公司 Supporting random access of compressed data
CN111654265A (en) * 2020-06-19 2020-09-11 京东方科技集团股份有限公司 Quick checking circuit, method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
UDP: a programmable accelerator for extract-transform-load workloads and more;Yuanwei Fang等;《Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture》;20171014;全文 *

Also Published As

Publication number Publication date
CN112612518A (en) 2021-04-06

Similar Documents

Publication Publication Date Title
US8230144B1 (en) High speed multi-threaded reduced instruction set computer (RISC) processor
US10140124B2 (en) Reconfigurable microprocessor hardware architecture
JP5269610B2 (en) Perform cyclic redundancy check operations according to user level instructions
US9015443B2 (en) Reducing remote reads of memory in a hybrid computing environment
US11489773B2 (en) Network system including match processing unit for table-based actions
US20140208069A1 (en) Simd instructions for data compression and decompression
EP1126367A1 (en) Data processing device, system and method using a state transition table
US20120030451A1 (en) Parallel and long adaptive instruction set architecture
US9274802B2 (en) Data compression and decompression using SIMD instructions
US20120317360A1 (en) Cache Streaming System
US10666288B2 (en) Systems, methods, and apparatuses for decompression using hardware and software
CN112612518B (en) Network checksum algorithm optimization method based on Feiteng platform
US9959066B2 (en) Memory-attached computing resource in network on a chip architecture to perform calculations on data stored on memory external to the chip
WO2023169267A1 (en) Network device-based data processing method and network device
US11343358B2 (en) Flexible header alteration in network devices
US20040103086A1 (en) Data structure traversal instructions for packet processing
WO2013036950A1 (en) Instruction packet including multiple instructions having a common destination
US7320013B2 (en) Method and apparatus for aligning operands for a processor
US8745235B2 (en) Networking system call data division for zero copy operations
Zolfaghari et al. A custom processor for protocol-independent packet parsing
US10445099B2 (en) Reconfigurable microprocessor hardware architecture
US7571258B2 (en) Method and apparatus for a pipeline architecture
KR101449732B1 (en) System and method of processing hierarchical very long instruction packets
US7877581B2 (en) Networked processor for a pipeline architecture
Zilberman Technical perspective: hXDP: Light and efficient packet processing offload

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant