CN112035898A

CN112035898A - Multi-node multi-channel high-speed parallel processing method and system

Info

Publication number: CN112035898A
Application number: CN202010844411.4A
Authority: CN
Inventors: 吴世勇; 苏庆会; 李银龙; 王凯霖; 王斌; 冯驰; 王中原; 卫志刚; 徐诺; 姬少锋
Original assignee: Zhengzhou Xinda Jiean Information Technology Co Ltd
Current assignee: Zhengzhou Xinda Jiean Information Technology Co Ltd
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2020-12-04

Abstract

The invention provides a multi-node multi-channel high-speed parallel processing method and a system, wherein the method comprises the following steps: selecting a virtual channel from a plurality of virtual channels as a target virtual channel based on the task requirement of a data packet to be processed, and determining a corresponding host forward buffer and a corresponding host reverse buffer as target buffers based on the target virtual channel; distributing required forward memory nodes and reverse memory nodes for the data packets to be processed from the target buffer area, writing the data packets to be processed into the distributed forward memory nodes in the m forward memory nodes by the host, and writing a command word into the corresponding command word FIFO; when the forward DMA module polls the command word FIFO, the forward DMA module determines the address information of the distributed forward memory node and the length of the data packet to be processed based on the command word information, and transmits the data packet to be processed to the algorithm module and the like. The invention can realize high-speed transmission and high-speed processing of data.

Description

Multi-node multi-channel high-speed parallel processing method and system

Technical Field

The invention relates to the technical field of computers, in particular to a multi-node multi-channel high-speed parallel processing method and a multi-node multi-channel high-speed parallel processing system.

Background

An FPGA (Field-Programmable Gate Array) is a product of further development on the basis of Programmable devices such as pal (Programmable Array Logic), gal (general Array Logic), cpld (complex Programmable Logic device), and other complex Programmable Logic devices. The circuit is a semi-custom circuit in the field of application-specific integrated circuits, not only overcomes the defects of the custom circuit, but also overcomes the defect that the number of gate circuits of the original programmable device is limited.

Currently, the main control function of the encryption card is integrated in the FPGA to be implemented, and when data on the host needs to be sent to the encryption card for encryption and decryption, there are generally various requirements for using encryption and decryption algorithms, including symmetric and asymmetric. When data to be encrypted and decrypted is transmitted and received in one channel, the single channel mode is likely to cause data congestion when the data size is large. In addition, if a large amount of task data occupies the front end of the data stream, data of other tasks can be transmitted only after the task data is completed, which easily causes part of processing units in the FPGA to be in an idle state, and the resource utilization rate is not high. In some scenes, a task may need to encrypt and decrypt a group of ordered data packets, and the sizes of the data packets may be different, so that the time consumption for encrypting and decrypting the data packets is different, and then the phenomenon of disorder of the group of ordered data packets after encryption and decryption processing is easily caused, and the problem of data packet sequencing is difficult to solve by adopting a traditional single-channel mode.

In addition, if a plurality of tasks of the same type need to be processed based on the same channel in the same time period, if some task data occupies the task processing channel, other task data needing to be processed, especially emergency task data, can send the current task data only after the previous task data is processed; if the data volume of the previous task is large, the task processing channel is occupied for a long time, and data congestion of other tasks is caused; that is, the task processing mode will affect the transmission efficiency and processing progress of other task data to be processed, especially the urgent task data, and further cause the processing efficiency of the system to be low.

Disclosure of Invention

In view of the above problems, it is desirable to provide a multi-node multi-channel high-speed parallel processing method and system, which support multi-channel data transmission and implement quick and orderly transmission and processing of multi-task data.

The first aspect of the present invention further provides a multi-node multi-channel high-speed parallel processing method, including the following steps:

selecting a virtual channel from a plurality of virtual channels as a target virtual channel based on the task requirement of a data packet to be processed, and determining a corresponding host forward buffer and a corresponding host reverse buffer as target buffers based on the target virtual channel; the system comprises a host, a plurality of virtual channels and a plurality of FPGA chips, wherein a plurality of forward memory nodes in a host forward buffer area of the same virtual channel correspond to a plurality of reverse memory nodes in a host reverse buffer area one by one, the plurality of forward memory nodes are respectively used for caching data packets to be processed for being read by the FPGA chips, and the plurality of reverse memory nodes are respectively used for receiving finished data packets processed by the FPGA chips;

distributing required forward memory nodes and reverse memory nodes for the data packets to be processed from the target buffer area, writing the data packets to be processed into the distributed forward memory nodes in the m forward memory nodes by the host, and writing a command word into the corresponding command word FIFO; when the forward DMA module polls the command word FIFO, the forward DMA module determines the address information of the distributed forward memory node and the length of the data packet to be processed based on the command word information, and transmits the data packet to be processed to the algorithm module;

the algorithm module receives a data packet to be processed and performs operation processing to obtain a corresponding completed data packet, and writes a status word to a corresponding status word FIFO; when the reverse DMA module polls the status word FIFO, the reverse DMA module determines the address information of the allocated reverse memory node and the length of the completion data packet based on the status word information, and writes the completion data packet into the allocated reverse memory node.

The second aspect of the invention provides a multi-node multi-channel high-speed parallel processing system, which comprises an FPGA chip and a host, wherein the FPGA chip is in communication connection with the host and constructs a plurality of virtual channels so as to transmit data packets of different tasks;

the host comprises a plurality of host forward buffer areas and a plurality of host reverse buffer areas, and the host forward buffer areas, the host reverse buffer areas and the virtual channels are in one-to-one correspondence; the FPGA chip comprises a DMA module, a plurality of command word FIFOs, a plurality of state word FIFOs and an algorithm module;

each host forward buffer area comprises a plurality of forward memory nodes, each host reverse buffer area comprises a plurality of reverse memory nodes, the plurality of forward memory nodes in the host forward buffer area of the same virtual channel correspond to the plurality of reverse memory nodes in the host reverse buffer area one by one, the plurality of forward memory nodes are respectively used for caching a data packet to be processed for being read by an FPGA chip, and the plurality of reverse memory nodes are respectively used for receiving a finished data packet processed by the FPGA chip;

the DMA module comprises a forward DMA module and a reverse DMA module, the forward DMA module polls and reads a plurality of command word FIFOs, reads a data packet to be processed in a corresponding forward memory node through a corresponding virtual channel based on command word information, and transmits the data packet to the algorithm module; the reverse DMA module polls and reads a plurality of state word FIFOs and writes the completion data packet processed by the algorithm module into the corresponding reverse memory node through the corresponding virtual channel based on the state word information; the algorithm module is used for receiving the data packet to be processed and carrying out operation processing to obtain a corresponding finished data packet;

and in the same time period, when the FPGA chip and the host machine process the data packets to be processed required by the same task, executing the steps of the multi-node multi-channel high-speed parallel processing method.

The invention has prominent substantive characteristics and remarkable progress, in particular to the following steps:

1) the invention provides a multi-node multi-channel high-speed parallel processing method and a system, which can realize the parallel processing of various tasks through a plurality of virtual channels, and realize the parallel processing of a plurality of data packets of the same task through different nodes corresponding to each virtual channel, thereby improving the processing efficiency of the system. For each host, only the corresponding forward/reverse memory node needs to be concerned, the DMA module is responsible for high-speed reading and writing data for each forward/reverse memory node, meanwhile, a plurality of algorithms of the algorithm module are respectively subjected to parallel high-speed operation processing, the node, the DMA module and the algorithm module are mutually independent, asynchronous operation among the node, the DMA module and the algorithm module is realized, and compared with a traditional synchronous operation mode, the invention improves the efficiency and simultaneously realizes the maximum utilization of each resource;

2) based on the length of the data packet to be processed or the calculated length of the finished data packet, the method allocates a required forward memory node and a required reverse memory node for the data packet to be processed from a target buffer area; the problem of disorder caused by mismatching of the length of a data packet to be processed and a forward memory node or mismatching of the length of a finished data packet and a reverse memory node is avoided;

3) if the total number of the currently required forward memory nodes is larger than or equal to a preset value m, determining the sending sequence of each data packet to be processed based on the emergency degree of each data packet to be processed, and solving the problem that the task processing channel is occupied for a long time due to the large data volume of the previous task to cause congestion of other task data, particularly emergency task data;

4) when the current data packet to be processed is being processed and a new data packet to be processed required by the same task is received, judging which processing mode to execute based on the relation between the number of forward memory nodes required by the new data packet to be processed and the number of current idle forward memory nodes in the forward buffer area of the target host; the method realizes the pipeline processing of the data packets to be processed with the same task requirement, and reduces the cost of the host during the processing of the same task requirement;

5) the data can be orderly arranged while realizing high-speed transmission and high-speed processing of data, and the problem of disorder after a group of orderly data packets to be processed are parallelly operated by a traditional algorithm module is solved;

6) the number of virtual channels and nodes can be expanded according to actual needs, so that the method is suitable for more application scenes and has high applicability;

7) the algorithm module comprises a plurality of algorithm units, the algorithm units can respectively support parallel operation of a plurality of tasks, each algorithm unit also comprises a plurality of algorithms, the algorithms can respectively support parallel operation processing of a plurality of data packets of the same task, the algorithm module is high in operation efficiency, and the utilization rate of each algorithm is high.

Description of the drawings:

FIG. 1 is a block diagram of a multi-node, multi-channel, high-speed parallel processing system of the present invention;

FIG. 2 is a schematic diagram of multi-node parallel processing for a channel according to the present invention;

FIG. 3 is a flow chart of a multi-node multi-channel high-speed parallel processing method according to the invention.

The specific implementation mode is as follows:

in order to make the present invention clearer, the technical solution of the present invention is further described in detail by the following embodiments.

As shown in fig. 3, a first aspect of the present invention provides a multi-node multi-channel high-speed parallel processing method, including the following steps:

selecting a virtual channel from a plurality of virtual channels as a target virtual channel based on the task requirement of a data packet to be processed, and determining a corresponding host forward buffer and a corresponding host reverse buffer as target buffers based on the target virtual channel; the system comprises a host, a plurality of virtual channels and a plurality of FPGA chips, wherein a plurality of forward memory nodes in a host forward buffer area of the same virtual channel correspond to a plurality of reverse memory nodes in a host reverse buffer area one by one, the plurality of forward memory nodes are respectively used for caching data packets to be processed for being read by the FPGA chips, and the plurality of reverse memory nodes are respectively used for receiving finished data packets processed by the FPGA chips; the method comprises the steps that m forward memory nodes are preset in each host forward buffer area, and m reverse memory nodes are preset in each host reverse buffer area; the target buffer zone comprises a target host forward buffer zone and a target host reverse buffer zone;

Further, when allocating the required forward memory node and reverse memory node for the data packet to be processed from the target buffer, executing: reading the length of a data packet to be processed of a certain application main body, and judging whether the length of the data packet to be processed of the certain application main body is less than or equal to the storage capacity of a forward memory node or not;

if the length of the data packet to be processed is less than or equal to the storage capacity of one forward memory node, generating a first application request based on the data packet to be processed; the host allocates a forward memory node and a reverse memory node for the data packet to be processed according to the first application request corresponding to the application main body;

if the length of the data packet to be processed is larger than the storage capacity of the forward memory node, dividing the data to be processed into a group of ordered sub data packets to be processed, and generating a second application request based on the sub data packets to be processed; and the host allocates a group of ordered forward memory nodes and reverse memory nodes for the data packet to be processed according to the second application request.

It should be noted that, in the present invention, the required forward memory node and reverse memory node are allocated to the to-be-processed data packet from the target buffer based on the length of the to-be-processed data packet, so that the length of the to-be-processed data packet matches the forward memory node, and the problem of disorder caused by the mismatch between the length of the to-be-processed data packet and the forward memory node is avoided.

In one embodiment, a is based on the data packet a to be processed_ijThe task requirement of (1) selecting an ith virtual channel from the plurality of virtual channels as a target virtual channel, and determining an ith host forward buffer and an ith host reverse buffer as target buffers based on the target virtual channel. If the number of the data packets to be processed is 1 and the length of the data packets to be processed is less than or equal to the storage capacity of a forward memory node, allocating a forward direction to the data packets to be processed according to the first application requestA memory node (jth forward memory node) and a reverse memory node (jth reverse memory node). The host writes a data packet a to be processed into the jth forward memory node in the m forward memory nodes_ijWriting a command word into the ith command word FIFO; when the forward DMA module polls the ith command word FIFO, the forward DMA module determines the address information of the jth forward memory node and the data packet a to be processed based on the command word information_ijAfter the length of the data packet a to be processed is reached, the data packet a to be processed is processed_ijTransmitting to an algorithm module; the algorithm module receives a data packet a to be processed_ijAnd performing operation processing to obtain corresponding completion data packet A_ijWriting a state word to the ith state word FIFO; when the reverse DMA module polls the ith status word FIFO, the reverse DMA module determines the address information of the jth reverse memory node and completes the data packet A based on the status word information_ijWill complete packet a_ijAnd writing into the jth reverse memory node.

It can be understood that the host writes the pending data packet a into the jth forward memory node of the m forward memory nodes_ijWhile writing the command word b into the ith command word FIFO_ijThe command word b_ijComprises a data packet a to be processed_ijAnd address information stored in the jth forward memory node; when the forward DMA module polls the ith command word FIFO and command word b_ijWhen updating to the foremost end of the ith command word FIFO, the forward DMA module is based on the command word b_ijReading a data packet a to be processed in the jth forward memory node_ijMake the data packet a to be processed_ijAfter carrying the relevant information of j, transmitting the relevant information to the ith FPGA forward buffer area together to wait for the algorithm module to read;

the algorithm module receives a data packet a to be processed_ijAnd performing operation processing to obtain corresponding completion data packet A_ij(ii) a The algorithm module will complete packet a_ijTransmitting to the ith FPGA reverse buffer area, and simultaneously writing a state word B into the ith state word FIFO_ijSaid status word B_ijIncluding completion dataBag A_ijThe length of the host node and the address information of the jth reverse memory node in the ith host reverse buffer zone to be returned; when the reverse DMA module polls to the ith status word FIFO and status word B_ijWhen updating to the foremost end of the ith status word FIFO, the reverse DMA module is based on the status word B_ijFinish data packet A read from ith FPGA reverse buffer area_ijAnd is based on the status word B_ijDetermines the address information of the jth reverse memory node and completes the data packet A_ijWill complete packet a_ijAnd writing into the jth reverse memory node.

In another embodiment, the data packet a is processed based on the data to be processed_ijThe task requirement of (1) selecting an ith virtual channel from the plurality of virtual channels as a target virtual channel, and determining an ith host forward buffer and an ith host reverse buffer as target buffers based on the target virtual channel. If the number of the data packets to be processed is 1 and the length of the data packets to be processed is greater than the storage capacity of one forward memory node, dividing the data packets to be processed into a group of ordered data packets to be processed, and generating a second application request based on the data packets to be processed; according to a second application request corresponding to the application main body, distributing a group of ordered forward memory nodes and reverse memory nodes for the data packet to be processed; the application main body writes corresponding sub data packets to be processed into a group of ordered forward memory nodes respectively; the forward DMA module determines the packet length and the corresponding forward memory node based on the command word information in the command word FIFO, then reads the sub data packet to be processed of each forward memory node, and transmits the sub data packet to the algorithm module; the algorithm module receives a group of ordered to-be-processed sub data packets and performs parallel operation processing to obtain corresponding completed sub data packets; and after determining the length of the determined packet and the corresponding reverse memory node based on the state word information in the state word FIFO, the reverse DMA module writes each completion sub-data packet with the determined length into the corresponding reverse memory node respectively to recombine a group of ordered completion data packets.

It should be noted that after encryption and decryption, ciphertext data is often larger than original plaintext data, and if the storage amount of the reverse memory node is greater than or equal to the length of the complete data packet or the length of the complete sub data packet, the problem of disorder does not occur. In practical application, the storage capacity of the reverse memory node may be smaller than that of the complete data packet or the complete sub data packet, and at this time, the reverse memory node cannot completely store the complete data packet or the complete sub data packet, which may cause data scrambling after the FPGA processing.

Therefore, the invention presets the proportional relation between the length of the data packet to be processed (the length of the original plaintext data) and the length of the completed data packet or the completed sub-data packet after the processing of the algorithm module according to historical experience, when the encryption service requirement of the application main body is received, the length of the encrypted completed data packet or the length of the completed sub-data packet can be calculated based on the length of the data packet to be processed (the length of the original plaintext data) and the preset proportional relation, and then the forward memory node and the reverse memory node are allocated based on the length of the encrypted completed data packet or the length of the completed sub-data packet.

Specifically, when allocating a required forward memory node and a required reverse memory node for a to-be-processed data packet from the target buffer, the following steps are performed: reading the length of a to-be-processed data packet of an application main body, calculating the length of a finished data packet based on the length of the to-be-processed data packet and a preset proportional relation, and judging whether the calculated length of the finished data packet is less than or equal to the storage capacity of a reverse memory node or not;

if the calculated length of the completion data packet is less than or equal to the storage capacity of one reverse memory node, generating a third application request based on the calculated length of the completion data packet; the host allocates a forward memory node and a reverse memory node for the data packet to be processed according to the third application request;

if the deduced length of the completion data packet is larger than the storage capacity of one reverse memory node, generating a fourth application request based on the deduced length of the completion data packet; and the host allocates and distributes a group of ordered forward memory nodes and reverse memory nodes to the data packet to be processed according to the fourth application request.

It is understood that the present invention may allocate the required forward memory node and reverse memory node for the pending data packet from the target buffer based on the pending data packet length or the calculated completed data packet length. It should be noted that, the present invention may also allocate a required forward memory node and a required reverse memory node for the pending data packet from the target buffer based on the pending data packet length and the calculated completion data packet length.

It can be understood that, if the number of the allocated forward memory nodes and the number of the allocated reverse memory nodes are inconsistent based on the relationship between the length of the to-be-processed data packet and the storage amount of one forward memory node and based on the relationship between the calculated length of the completed data packet and the storage amount of one reverse memory node; the allocation policy with the larger number of the forward memory nodes and the reverse memory nodes is selected as the target allocation policy. That is, the forward memory node and the reverse memory node are allocated according to the maximum data length before and after processing, so that data disorder is prevented.

It should be noted that, when there may be more than one to-be-processed data packet corresponding to the target virtual channel in the same time period, if the sending sequence is determined based on the to-be-processed data packet generation time, a problem may occur that the task processing channel is occupied for a long time due to a large amount of previous task data, which may cause congestion of other task data, especially emergency task data. Therefore, when the total number of the forward memory nodes required currently is greater than or equal to the preset value m, the method determines the sending sequence of each data packet to be processed based on the urgency degree of each data packet to be processed.

Specifically, when allocating a required forward memory node and a required reverse memory node for a to-be-processed data packet from the target buffer, the following steps are performed: when the number of the data packets to be processed corresponding to the target virtual channel (i-th virtual channel) in the same time period is two or more, the host obtains the total number of currently required forward memory nodes based on the ratio of the length of each data packet to be processed to the storage capacity of one forward memory node, and judges whether the total number of currently required forward memory nodes corresponding to the target virtual channel (i-th virtual channel) is greater than or equal to a preset value m;

if the total number of the current required forward memory nodes corresponding to the target virtual channel is greater than or equal to a preset value m, respectively calculating the emergency degree of each data packet to be processed, determining the sending sequence of each data packet to be processed based on the emergency degree of each data packet to be processed, and respectively writing the data packet to be processed into the distributed forward memory nodes according to the sending sequence of each data packet to be processed;

and if the total number of the currently required forward memory nodes is less than a preset value m, respectively writing the data packets to be processed into the distributed forward memory nodes according to the generation time sequence of the data packets to be processed.

Further, when calculating the urgency level of each to-be-processed data packet, executing: reading the priority of an application main body generating a data packet to be processed, and determining the sequence of the emergency degree of the data packet to be processed from large to small according to the sequence of the priority of the application main body from high to low; if the application main bodies belonging to the same priority level generate the data packets to be processed with different sizes, determining the sequence of the emergency degree of the data packets to be processed from small to large according to the sequence of the data packets to be processed from small to large.

It should be noted that, when one or more packets to be processed are being processed, there may be a situation that packets to be processed required by the same task arrive; in the prior art, if a certain task data packet is being processed, other task data to be processed, especially emergency task data, must wait for the previous task data to be processed, and then process a new data packet to be processed. In particular, if the previous task has a large amount of data, the task processing channel may be occupied for a long time, and congestion may be caused to other task data (packets to be processed required by the same task).

Aiming at the problems, the invention sets the following processing strategies: when a current data packet to be processed is being processed and a new data packet to be processed required by the same task is received, the host obtains the number of forward memory nodes required by the new data packet to be processed based on the relationship between the length of the new data packet to be processed and the storage capacity of one forward memory node;

judging whether the number of the forward memory nodes required by the new data packet to be processed is less than or equal to the number of the current idle forward memory nodes of a forward buffer area (ith host forward buffer area) of a target host;

if the number of the forward memory nodes required by the new data packet to be processed is less than or equal to the number of the current idle forward memory nodes in the forward buffer area of the target host, generating a fifth application request based on the new data to be processed, and allocating the forward memory nodes and the reverse memory nodes with corresponding numbers to the new data to be processed by the host according to the fifth application request corresponding to an application main body;

and if the quantity of the forward memory nodes required by the new data packet to be processed is greater than the quantity of the current idle forward memory nodes in the forward buffer area of the target host, continuously waiting and accumulating the quantity of the current idle forward memory nodes until the quantity of the current idle forward memory nodes is greater than or equal to the quantity of the forward memory nodes required by the new data packet to be processed.

It can be understood that if the number of the current idle forward memory nodes is small and cannot meet the requirement of a new data packet to be processed, a period of time needs to be waited; if the number of the current idle forward memory nodes is large and the requirement of a new data packet to be processed can be met, distributing a corresponding number of forward memory nodes and reverse memory nodes for the new data packet to be processed, sending the new data packet to be processed to the distributed forward memory nodes, and starting the processing flow of the new data packet to be processed without waiting for the completion of the processing of other data packets; and carrying out asynchronous processing operation on the new data packet to be processed and the current data packet to be processed in the processing.

In a specific embodiment, when the application main body has a service processing requirement, the system allocates a corresponding forward memory node and a corresponding reverse memory node based on the requirement of the application main body, and when the service processing is completed, that is, the reverse memory node receives a completion data packet, and after the completion data packet is read by the application main body and exceeds a preset time, the allocated forward memory node and the allocated reverse memory node are released, so that the memory resource of the host computer is saved, and more service processing scenarios of the application main body are supported. It is understood that the preset time may be set according to the requirement, such as 2s, 5s, etc., and the embodiment is not limited herein.

In another specific embodiment, if the host extracts a completion packet corresponding to a to-be-processed packet from a reverse memory node, a forward memory node corresponding to the reverse memory node is marked as an idle forward memory node.

Further, when the number of the data packets to be processed corresponding to the target virtual channel (ith virtual channel) in the same time period is two or more, the target algorithm unit (ith algorithm unit) calls a corresponding number of algorithm subunits to perform parallel operation processing on a plurality of data packets to be processed required by the same task; the algorithm module comprises a plurality of algorithm units, the algorithm units are respectively in one-to-one correspondence with the virtual channels, the FPGA forward buffer areas and the FPGA reverse buffer areas, each algorithm unit respectively bears different tasks, each algorithm unit comprises a plurality of same algorithm subunits, and the same algorithm subunits respectively carry out parallel operation processing on a plurality of data packets to be processed of the same task.

It can be understood that the node, the DMA module and the algorithm module are all independent of each other, so that asynchronous operation among the node, the DMA module and the algorithm module is realized, and a plurality of algorithms of the algorithm module respectively perform parallel high-speed operation processing; compared with the traditional synchronous operation mode, the method and the device have the advantages that the efficiency is improved, and meanwhile, each resource is utilized to the maximum extent.

FIG. 1 is a block diagram of a multi-node multi-channel high-speed parallel processing system according to the present invention, and FIG. 2 is a schematic diagram of a multi-node parallel processing according to a certain channel of the present invention.

As shown in fig. 1 and fig. 2, a second aspect of the present invention further provides a multi-node multi-channel high-speed parallel processing system, where the system includes an FPGA chip and a host, where the FPGA chip and the host are communicatively connected and construct multiple virtual channels to transmit data packets of different tasks;

the host comprises a plurality of host forward buffer areas and a plurality of host reverse buffer areas, and the host forward buffer areas, the host reverse buffer areas and the virtual channels are in one-to-one correspondence; each host forward buffer area comprises a plurality of forward memory nodes, each host reverse buffer area comprises a plurality of reverse memory nodes, the plurality of forward memory nodes in the host forward buffer area of the same virtual channel correspond to the plurality of reverse memory nodes in the host reverse buffer area one by one, the plurality of forward memory nodes are respectively used for caching a data packet to be processed for being read by an FPGA chip, and the plurality of reverse memory nodes are respectively used for receiving a finished data packet processed by the FPGA chip;

the FPGA chip comprises a DMA module, a plurality of command word FIFOs, a plurality of state word FIFOs and an algorithm module; the DMA module comprises a forward DMA module and a reverse DMA module, the forward DMA module polls and reads a plurality of command word FIFOs, reads a data packet to be processed in a corresponding forward memory node through a corresponding virtual channel based on command word information, and transmits the data packet to the algorithm module; the reverse DMA module polls and reads a plurality of state word FIFOs and writes the completion data packet processed by the algorithm module into the corresponding reverse memory node through the corresponding virtual channel based on the state word information; the algorithm module is used for receiving the data packet to be processed and carrying out operation processing to obtain a corresponding finished data packet;

It can be understood that the plurality of virtual channels may respectively correspond to a plurality of different task requirements, for example, the task corresponding to the 1 st virtual channel may be encryption/decryption, and the task corresponding to the 2 nd virtual channel may be signature/signature verification, but is not limited thereto.

It can be understood that the number of virtual channels and the number of forward memory nodes in each host forward buffer (or the number of reverse memory nodes in each host reverse buffer) may be expanded according to actual requirements; so as to be applicable to more application scenarios, the suitability is higher.

Furthermore, the plurality of command word FIFOs correspond to the plurality of state word FIFOs, the plurality of virtual channels, the plurality of host forward buffer areas and the plurality of host reverse buffer areas one by one; the command word FIFOs are respectively used for indicating whether a data packet to be processed needing to be transmitted to the DMA module exists in a corresponding host forward buffer area or not, and the source node information and the length size of the data packet to be processed. It should be noted that the command word FIFO and the status word FIFO use a First-in First-out queue (FIFO) to store data.

In practical application, when a host has a task to be processed by the FPGA chip, a data packet to be processed is organized and completed in a corresponding host forward buffer area, and a command word FIFO is written in through process application to indicate that a DMA module has the data packet to be transmitted in the virtual channel. Specifically, if a data packet to be processed is formed in a forward buffer of a host, a 32-bit command word is written into the corresponding command word FIFO, and the command word includes length information and address information of the data packet. When a certain command word FIFO is not empty, the DMA module reads the command word in the command word FIFO to obtain the data packet length information and the address information corresponding to the command word, and then the data packet can be transmitted.

It will be appreciated that the number of command words in the command word FIFO may be multiple and that the multiple command words satisfy the requirement of "first in first out", i.e. the command word written first into the command word FIFO should be read by the forward DMA module earlier than the command word written back into the command word FIFO.

As shown in fig. 2, it is preset that the ith host forward buffer includes m forward memory nodes, and the ith host forward buffer, the ith virtual channel, and the ith command word FIFO are in one-to-one correspondence;

the host writes a data packet a to be processed into the jth forward memory node in the m forward memory nodes_ijWhile writing the command word b into the ith command word FIFO_ijThe command word b_ijComprises a data packet a to be processed_ijAnd stored in the jth forward memory nodeAddress information;

when the forward DMA module polls the ith command word FIFO and command word b_ijWhen updating to the foremost end of the ith command word FIFO, the forward DMA module is based on the command word b_ijReading a data packet a to be processed in the jth forward memory node_ij。

In practical application, when the ith command word FIFO is polled, the forward DMA module determines that the ith command word FIFO is not empty, and can read the command word at the foremost end of the ith command word FIFO. If the virtual channel is empty, polling the next virtual channel (for example, the (i + 1) th virtual channel).

Furthermore, the FPGA chip also comprises a plurality of FPGA forward buffer areas and a plurality of FPGA reverse buffer areas, and the plurality of FPGA forward buffer areas, the plurality of FPGA reverse buffer areas and the plurality of virtual channels are in one-to-one correspondence;

the plurality of FPGA forward buffer areas are respectively used for receiving data packets to be processed of different tasks transmitted by the forward DMA module and performing buffer processing so as to wait for the algorithm module to read;

and the plurality of FPGA reverse buffer areas are respectively used for receiving the completion data packets of different tasks processed by the algorithm module and performing cache processing to wait for the reverse DMA module to read.

Specifically, the forward DMA module is based on a command word b_ijReading a data packet a to be processed in the jth forward memory node_ijAnd let the pending data packet a_ijAnd carrying the relevant information of the j, and then transmitting the j to the ith FPGA forward buffer area together to wait for the algorithm module to read.

It can be understood that the virtual channel described in the present invention is mainly used to represent the mapping link relationship between each host forward/reverse buffer and each FPGA forward/reverse buffer, and there is no specific physical line. By establishing the mapping link relation, the data packet in the 1 st host forward buffer area can only be received by the 1 st FPGA forward buffer area, but not received by other FPGA receiving buffer areas; similarly, the data packet in the 1 st FPGA reverse buffer can only be received by the 1 st host reverse buffer, but not by other host reverse buffers.

Furthermore, the plurality of state word FIFOs correspond to the plurality of command word FIFOs, the plurality of virtual channels, the plurality of host forward buffer areas, the plurality of host reverse buffer areas, the plurality of FPGA forward buffer areas and the plurality of FPGA reverse buffer areas one by one; the plurality of status word FIFOs are respectively used for indicating whether a completion data packet needing to be transmitted by a reverse DMA module exists in a corresponding FPGA reverse buffer area or not, and destination node information and length of the completion data packet.

Specifically, the algorithm module receives a data packet a to be processed_ijAnd performing operation processing to obtain a corresponding completion data packet A_ijThe algorithm module will complete packet A_ijTransmitting to the ith FPGA reverse buffer area, and simultaneously writing a state word B into the ith state word FIFO_ijSaid status word B_ijIncluding completion packet a_ijThe length of the host node and the address information of the jth reverse memory node in the ith host reverse buffer zone to be returned;

when the reverse DMA module polls to the ith status word FIFO and status word B_ijWhen updating to the foremost end of the ith status word FIFO, the reverse DMA module is based on the status word B_ijFinish data packet A read from ith FPGA reverse buffer area_ijAnd is based on the status word B_ijThe carried j related information determines the address information of the jth reverse memory node and completes the data packet A_ijWill complete packet a_ijAnd writing into the jth reverse memory node.

Furthermore, the algorithm module comprises a plurality of algorithm units, the plurality of algorithm units respectively correspond to the plurality of virtual channels, the plurality of FPGA forward buffer areas and the plurality of FPGA reverse buffer areas one by one, each algorithm unit respectively bears different tasks, each algorithm unit comprises a plurality of same algorithm subunits, and the plurality of same algorithm subunits respectively carry out parallel operation processing on a plurality of data packets to be processed of the same task.

Specifically, the 1 st algorithm unit is used for processing an encryption and decryption task, and the 1 st algorithm unit includes a plurality of SM4 algorithm subunits, and the plurality of SM4 algorithm subunits can perform encryption and decryption processing on a plurality of data packets to be processed in parallel; the 2 nd algorithm unit is used for signing/signature checking tasks, the 2 nd algorithm unit comprises a plurality of SM2 algorithm subunits, and the SM2 algorithm subunits can carry out signing/signature checking on a plurality of data packets to be processed in parallel; but is not limited thereto.

Preferably, the FPGA chip and the host are connected by data communication through a PCIE interface, and the PCIE interface protocol version to which the present invention is applied may be PCIE1.0, PCIE2.0, and PCIE3.0, but is not limited thereto.

Furthermore, each application main body in the host applies for acquiring a predetermined number of forward memory nodes and reverse memory nodes in the corresponding host forward buffer area and host reverse buffer area respectively based on task requirements, and the forward memory nodes and the reverse memory nodes applied between different application main bodies of the same task do not conflict.

In an application scenario of the present invention, two virtual channels are preset, that is, a 1 st virtual channel and a 2 nd virtual channel, and the 1 st virtual channel is used for an encryption/decryption task, and the 2 nd virtual channel is used for a signature/signature verification task, when the 1 st application main body and the 2 nd application main body have requirements for encryption/decryption tasks respectively, the 1 st application main body applies to a host to obtain a 1 st forward memory node and a 1 st reverse memory node of the 1 st virtual channel, the 2 nd application main body applies to the host to obtain a 2 nd forward memory node and a 2 nd reverse memory node of the 1 st virtual channel, and the 1 st application main body and the 2 nd application main body write data packets to be processed into the 1 st forward memory node and the 2 nd forward memory node according to a time slice sequence, respectively. For example: in the first time slice, a 1 st application main body writes a data packet to be processed into a 1 st forward memory node, and simultaneously writes a 1 st command word into a command word FIFO; in the second time slice, the 2 nd application main body writes a data packet to be processed into the 2 nd forward memory node, and simultaneously writes the 2 nd command word into the command word FIFO, and based on the principle of the command word FIFO 'first-in first-out', the 1 st command word is in front of the 2 nd command word. For a forward DMA module of an FPGA chip, command word FIFOs of a 1 st virtual channel and a 2 nd virtual channel are polled and read according to a fairness principle, when the 1 st virtual channel is polled, the 1 st command word is preferentially read according to a first-in first-out principle, the forward DMA module can read out a data packet to be processed in the 1 st forward memory node and transmit the data packet to an algorithm module for encryption and decryption, after the encryption and decryption are completed, the algorithm module writes a corresponding completion data packet into a 1 st FPGA reverse buffer area, and simultaneously writes a 1 st state word into a state word FIFO of the 1 st virtual channel, wherein the 1 st state word comprises destination node information and length of the completion data packet, and the reverse DMA module can write the completion data packet into a 1 st reverse memory node based on the 1 st state word. Then the forward DMA module continues to poll the 2 nd virtual channel, when the command word FIFO of the 2 nd virtual channel is read to be empty, it indicates that all forward memory nodes corresponding to the 2 nd virtual channel have no data, then the forward DMA module can return to poll the 1 st virtual channel, and read the data packet to be processed in the 2 nd forward memory node; when the command word FIFO for reading the 2 nd virtual channel is not empty, the forward DMA module can read out the data packet to be processed in the manner of the 1 st virtual channel and transmit the data packet to the algorithm module for encryption and decryption.

It can be understood that a plurality of traditional application entities share one memory node, and after the task of one application entity is completed, the task of the next application entity can be performed, so that the transmission and processing efficiency is low, and for each algorithm unit in the application entities and the algorithm module, the waiting time is long, the resource is easy to idle, and the utilization rate is low. The application main bodies can write the data packets to be processed into the forward memory nodes respectively, and receive the completed data packets at the corresponding reverse memory nodes, the whole process is not sensitive to the application main bodies, the nodes are independent and do not influence each other, the DMA module of the FPGA chip can poll and read and write the forward/reverse memory nodes at a high speed, and a plurality of algorithms in the algorithm module can perform parallel processing, so that the task processing efficiency can be effectively improved, and the resource utilization rate is maximally improved.

In another application scenario of the present invention, if an application subject needs to process an ordered set of packets to be processed (1, 2, 3, …, 10) of the same task or the length of a packet to be processed exceeds the storage capacity of a forward memory node, the 1 st application subject may apply for an ordered set of forward memory nodes (1, 2, 3, …, 10) and reverse memory nodes (1, 2, 3, …, 10) matching the number of packets to be processed, for the application subject, only the packets to be processed (1, 2, 3, …, 10) need to be sequentially written into the forward memory nodes (1, 2, 3, …, 10), and later a set of ordered completion packets may be received at the reverse memory nodes (1, 2, 3, …, 10) to correspond to the ordered set of packets to be processed (1, 2, 3, 10), …, 10). Based on the method, the device and the system, the data can be transmitted and processed at high speed, and the finished data packets can be arranged in order.

The invention can realize the parallel processing of various tasks through a plurality of virtual channels, and realize the parallel processing of a plurality of data packets of the same task through different nodes corresponding to each virtual channel, thereby improving the processing efficiency of the system. For each host, only the corresponding forward/reverse memory node needs to be concerned, the DMA module is responsible for high-speed data reading and writing of each forward/reverse memory node, a plurality of algorithms of the algorithm module are subjected to parallel high-speed operation processing respectively, the node, the DMA module and the algorithm module are independent of one another, asynchronous operation among the node, the DMA module and the algorithm module is achieved, and compared with a traditional synchronous operation mode, the method and the system improve efficiency and achieve maximum utilization of each resource.

The invention can realize the ordered arrangement of the completed data packets while realizing the high-speed transmission and high-speed processing of the data, and solves the problem that the traditional algorithm module generates disorder after parallel operation on a group of ordered data packets to be processed.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present invention and not for limiting the same, and those skilled in the art should make modifications to the specific embodiments of the present invention or make equivalent substitutions for part of technical features without departing from the spirit of the technical solutions of the present invention, and all of them should be covered in the technical solutions claimed in the present invention.

Claims

1. A multi-node multi-channel high-speed parallel processing method is characterized by comprising the following steps:

2. The multi-node multi-channel high-speed parallel processing method according to claim 1, wherein when allocating the required forward memory node and reverse memory node for the data packet to be processed from the target buffer, the following steps are performed: reading the length of a data packet to be processed of an application main body, and judging whether the length of the data packet to be processed is less than or equal to the storage capacity of a forward memory node or not;

if the length of the data packet to be processed is less than or equal to the storage capacity of one forward memory node, generating a first application request based on the data packet to be processed; the host allocates a forward memory node and a reverse memory node for the data packet to be processed according to the first application request;

3. The multi-node multi-channel high-speed parallel processing method according to claim 2, wherein the application main body writes corresponding sub-packets to be processed into a set of ordered forward memory nodes, respectively; the forward DMA module determines the packet length and the corresponding forward memory node based on the command word information in the command word FIFO, then reads the sub data packet to be processed of each forward memory node, and transmits the sub data packet to the algorithm module; the algorithm module receives a group of ordered to-be-processed sub data packets and performs parallel operation processing to obtain corresponding completed sub data packets; and after determining the length of the determined packet and the corresponding reverse memory node based on the state word information in the state word FIFO, the reverse DMA module writes each completion sub-data packet with the determined length into the corresponding reverse memory node respectively to recombine a group of ordered completion data packets.

4. The multi-node multi-channel high-speed parallel processing method according to claim 1, wherein when allocating the required forward memory node and reverse memory node for the data packet to be processed from the target buffer, the following steps are performed: reading the length of a to-be-processed data packet of an application main body, calculating the length of a finished data packet based on the length of the to-be-processed data packet and a preset proportional relation, and judging whether the calculated length of the finished data packet is less than or equal to the storage capacity of a reverse memory node or not;

5. The multi-node multi-channel high-speed parallel processing method according to claim 1, wherein when allocating the required forward memory node and reverse memory node for the data packet to be processed from the target buffer, the following steps are performed:

when the number of the data packets to be processed corresponding to the target virtual channel is two or more in the same time period, the host obtains the total number of the currently required forward memory nodes based on the ratio of the length of each data packet to be processed to the storage capacity of one forward memory node, and judges whether the total number of the currently required forward memory nodes corresponding to the target virtual channel is greater than or equal to a preset value m;

6. A multi-node multi-channel high-speed parallel processing method according to claim 5, wherein when calculating the urgency of each pending packet, performing:

reading the priority of an application main body generating a data packet to be processed, and determining the sequence of the emergency degree of the data packet to be processed from large to small according to the sequence of the priority of the application main body from high to low; if the application main bodies belonging to the same priority level generate the data packets to be processed with different sizes, determining the sequence of the emergency degree of the data packets to be processed from small to large according to the sequence of the data packets to be processed from small to large.

7. The multi-node multi-channel high-speed parallel processing method according to any one of claims 1 to 6, wherein when a current pending packet is being processed and a new pending packet required by the same task is received, the host obtains the number of forward memory nodes required for the new pending packet based on a relationship between the length of the new pending packet and the storage capacity of one forward memory node;

judging whether the number of forward memory nodes required by the new data packet to be processed is less than or equal to the number of current idle forward memory nodes in a forward buffer area of a target host;

if so, generating a fifth application request based on the new to-be-processed data, and distributing a corresponding number of forward memory nodes and reverse memory nodes for the new to-be-processed data by the host according to the fifth application request corresponding to a certain application main body;

if not, continuing to wait and accumulating the number of the current idle forward memory nodes until the number of the current idle forward memory nodes is more than or equal to the number of the forward memory nodes required by the new data packet to be processed.

8. A multi-node multi-channel high-speed parallel processing system comprises an FPGA chip and a host, wherein the FPGA chip is in communication connection with the host and constructs a plurality of virtual channels so as to transmit data packets of different tasks; the host comprises a plurality of host forward buffer areas and a plurality of host reverse buffer areas, and the host forward buffer areas, the host reverse buffer areas and the virtual channels are in one-to-one correspondence; the FPGA chip comprises a DMA module, a plurality of command word FIFOs, a plurality of state word FIFOs and an algorithm module; the DMA module comprises a forward DMA module and a reverse DMA module, and is characterized in that:

executing the steps of the multi-node multi-channel high-speed parallel processing method according to any one of claims 1 to 7 when the FPGA chip and the host process the data packets to be processed required by the same task in the same time period.

9. The multi-node multi-channel high-speed parallel processing system according to claim 8, wherein the algorithm module comprises a plurality of algorithm units, the plurality of algorithm units respectively correspond to the plurality of virtual channels, the plurality of FPGA forward buffers and the plurality of FPGA reverse buffers one by one, each algorithm unit respectively undertakes processing of different tasks, each algorithm unit comprises a plurality of identical algorithm subunits, and the identical algorithm subunits respectively perform parallel operation processing on a plurality of to-be-processed data packets of the same task.

10. The multi-node multi-channel high-speed parallel processing system according to claim 8, wherein each application agent in the host applies for obtaining a predetermined number of forward memory nodes and reverse memory nodes in the corresponding host forward buffer area and host reverse buffer area respectively based on task requirements, and the forward memory nodes and reverse memory nodes applied between different application agents of the same task do not conflict.