CN116414344A

CN116414344A - Data processing method and device

Info

Publication number: CN116414344A
Application number: CN202310460612.8A
Authority: CN
Inventors: 袁暾; 郭旭晨; 王梓谦; 马亮
Original assignee: Nanjing Tianlang Defense Technology Co ltd
Current assignee: Nanjing Tianlang Defense Technology Co ltd
Priority date: 2023-04-26
Filing date: 2023-04-26
Publication date: 2023-07-11

Abstract

The invention discloses a data processing method and a device, which are applied to a NUMA architecture, wherein the data processing method comprises the following steps: the receiving node distributes the data poll to be processed to a plurality of intermediate nodes; the intermediate node takes out the data to be processed in the respective first annular buffer space queues, and processes the data to be processed by utilizing the data processing link; the intermediate node sends the processed data to respective second annular buffer space queues; the sending node takes out all the data in the second annular buffer space queues, and sends the data to the next-stage system after sequencing; according to the invention, a complete data processing link is deployed on each node, so that each node can independently complete a data processing flow, and in addition, the annular buffer space queue is utilized to realize the polling receiving and sending of data, so that the data is processed with high efficiency and high instantaneity, and the resources of a processor can be utilized to the greatest extent.

Description

Data processing method and device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data processing method and apparatus.

Background

Referring to fig. 1, fig. 1 is a diagram of a conventional data processing architecture, in the conventional signal processing architecture, each functional module of a signal processing link is divided into M-1 combinations according to the passing rate of the functional module, and the combinations are sequentially deployed on M-1 nodes according to a processing flow, in a signal processing process, any node receives a processing result of a previous node, processes the processing result at the node, and finally transmits the processing result to a next node.

The data processing mode is similar to a pipeline, each node processes the same thing, and enters the next node after the processing is completed, and although the resources of a processor can be fully utilized, the data processing speed is low, the next node can be entered only after the previous node finishes the data processing, and when the processing time of one node is too long, the following node is in a long-time waiting state, so that the processing efficiency of the data is low.

Disclosure of Invention

In order to solve the problems, the invention provides a data processing method and a data processing device with high data processing efficiency and strong expandability.

In order to achieve the above object, an aspect of the present invention provides a data processing method, applied to a NUMA architecture, including:

the receiving node distributes the data poll to be processed to a plurality of intermediate nodes; each intermediate node is provided with a first annular buffer space queue, a second annular buffer space queue and a complete data processing link;

the intermediate node takes out the data to be processed in the respective first annular buffer space queues, and processes the data to be processed by utilizing the data processing link;

the intermediate node sends the processed data to respective second annular buffer space queues;

and the sending node takes out all the data in the second annular buffer space queues, and sends the data to the next-stage system after sequencing.

As a preferable solution, the first annular buffer space queue and the second annular buffer space queue each include a plurality of memory spaces for repeated use.

As a preferred solution, the receiving node distributes the data poll to be processed to a plurality of intermediate nodes, further including:

the receiving node fills the data to be processed into a plurality of memory spaces of the first annular buffer space queue in turn in a polling mode;

after the data to be processed is placed in the current memory space, the white pointer of the first annular buffer space queue is pointed to the position of the pointer of the next memory space.

As a preferred technical solution, the intermediate node fetches the data to be processed in the respective first ring buffer queues, further includes:

the intermediate node sequentially takes out the data to be processed from a plurality of memory spaces of the respective first annular buffer space queues;

and after the data to be processed stored in the current memory space is taken out, the red pointer of the first annular buffer space queue is pointed to the position of the pointer of the next memory space.

As a preferred technical solution, the method further includes, before fetching, at the intermediate node, the data to be processed in the respective first ring buffer space queues: the intermediate node detects whether available data exists in the memory space in the respective first annular buffer space queues in real time.

As a preferred technical solution, the intermediate node detects in real time whether available data exists in the memory space in the respective first ring buffer space queue, and further includes:

the intermediate node detects the positions of the red pointer and the white pointer of the respective first annular buffer space queue in real time;

when the red pointer and the white pointer are overlapped, indicating that no data is available in the first annular buffer space queue;

when the amount of memory space separated by the red pointer and the white pointer is equal to the total amount of memory space minus one, the first annular buffer space queue is filled with available data.

As a preferred solution, the intermediate node sends the processed data to the respective second ring buffer space queues, and further includes:

the intermediate node sequentially puts the processed data into a plurality of memory spaces of the respective second annular buffer space queues;

after the processed data is placed in the current memory space, the white pointer of the second annular buffer space queue is pointed to the position of the pointer of the next memory space.

As a preferable technical scheme, the sending node fetches the data in all the second annular buffer space queues, and sends the data to the next-stage system after sequencing: and the sending node selects the data with the smallest sequence number to send to the next-stage system.

In another aspect, the present invention also provides a data processing apparatus, including:

a receiving unit for distributing the data poll to be processed to a plurality of intermediate nodes; each intermediate node is provided with a first annular buffer space queue, a second annular buffer space queue and a complete data processing link;

the processing unit is used for taking out the data to be processed in the respective first annular buffer space queues and processing the data to be processed by utilizing the data processing link;

a first transmitting unit, configured to transmit the processed data to respective second ring buffer space queues;

and the second sending unit is used for taking out all the data in the second annular buffer space queues, and sending the data to the next-stage system after sequencing.

Compared with the prior art, the invention has the following effects: according to the invention, a complete data processing link is deployed on each node, so that each node can independently complete a data processing flow, in addition, the annular buffer space queue is utilized to realize the polling receiving and sending of data, the high-efficiency data processing is realized, the instantaneity is high, the next node does not wait for the end of the processing of the previous node like the traditional data processing architecture, and the resources of a processor can be utilized to the greatest extent; in addition, the data processing method is high in expandability, and the number of intermediate nodes can be configured according to the size of the data volume, so that the processing efficiency is improved.

Drawings

FIG. 1 is a prior art architecture diagram for data processing;

FIG. 2 is a schematic diagram of an FT200 architecture according to an embodiment of the present invention;

FIG. 3 is a flow chart of a data processing method according to an embodiment of the present invention;

FIG. 4 is a diagram of an annular buffer space queue according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of poll distribution of pending data in an embodiment of the invention;

FIG. 6 is a schematic diagram of ordering before sending data according to an embodiment of the invention;

fig. 7 is a block diagram of a data processing apparatus according to an embodiment of the present invention.

Description of the embodiments

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The present embodiment provides a data processing method, which is based on a NUMA architecture, and in this embodiment, an FT200 architecture is illustrated as an example, as shown in fig. 2, where the FT200 architecture is a typical NUMA architecture, and the platform integrates 64 processor cores, and is divided into 8 panels, each Panel has 2 clusters, each Cluster includes 4 processor cores, and the 4 processor cores share a second level cache, which is logically equivalent to an SMP system. Two local Directory Control Units (DCUs), a network-on-chip router node (Cell), and a tightly coupled Memory Controller (MCU). The panels are connected through network interfaces on a chip, and a consistency maintenance message, a data message, a test adjustment message, an interrupt message and the like are uniformly routed and communicated from the same set of network interfaces.

In the architecture platform, according to different affinities of different panels and Cluster to storage space, dividing the whole storage space into 8 large spaces, wherein each large space corresponds to one Panel with the nearest distance; each large space is divided into 2 subspaces, one for each Cluster. The task deployment and scheduling can fully utilize the characteristics to optimize, the structure supports mapping a plurality of threads with higher affinity to the same Panel, global communication among the threads can be reduced, and the global communication delay and energy efficiency can be further optimized by combining an on-chip data movement and migration mechanism.

In order to meet the requirements of a multi-core processor on access bandwidth and delay, a chip realizes a hierarchical on-chip storage architecture and a hierarchical network structure, supports high-speed on-chip Cache and large-capacity storage, has high task communication frequency and large data synchronization volume, adopts an interconnection network with short delay and high bandwidth and a local private Cache, has low task communication frequency, adopts an interconnection network with good expansibility and longer delay and a Cache with distributed sharing, and is placed in a closer Panel as far as possible for an application needing cross-Panel access. And the distributed directory control and storage are adopted, the directory controller and the storage are distributed in each Panel, and the maintenance and access of the parallel processing consistency protocol are maximized. Meanwhile, different access capacities of the system configuration are supported through a flexible address mapping mode. In the affinity mode, a Directory Controller (DCU) in the Panel only accesses a local memory access Module (MCU), and memory access channels among the panels are not affected, so that the method has minimum delay and maximum bandwidth; in the partial mode, the DCU can access any MCU according to the configuration, and supports the system to configure the DDR channel number of different scales.

In this embodiment, according to the characteristics of the architecture of the data-based affinity multi-core processor, one Cluster in the multi-core platform is regarded as an SMP system, one Cluster shares a cache, and the Panel where the Cluster is located is mounted with a DDR with large space for use, that is, the Cluster has its own cache and memory, so that one Cluster can be equivalent to a small isomorphic multi-core CPU.

As shown in fig. 3, based on the above FT200 architecture platform, the data processing method provided in this embodiment includes the following steps,

s10: the receiving node distributes the data poll to be processed to a plurality of intermediate nodes;

it should be noted that, the node described in this embodiment refers to a Cluster in the FT200 platform, for example, a Cluster-0 in the FT200 platform is deployed as a receiving node, that is, the Cluster-0 is used to receive data, and then a first annular buffer space queue, a second annular buffer space queue and a complete data processing link are deployed on each intermediate node for distributing the received data poll; the first annular buffer space queue is used for storing input data, and the second annular buffer space queue is used for storing output data.

As shown in fig. 4, a total of N memory spaces in the ring buffer space queue are available for repeated use, the size of the memory spaces depends on the size of input data of each frame, that is, the storage space of the memory spaces is larger than the space occupied by the input data of each frame, and the memory spaces are set according to the size of the input data before data processing.

In a specific polling distribution manner, as shown in fig. 5, the receiving node sequentially fills the data to be processed into the first annular buffer space queue of each intermediate node in a polling manner, and points the white pointer of the first annular buffer space queue to the position of the next pointer, where the position refers to the position of the memory space, that is, after the data to be processed is placed in the current memory space, the white pointer points to the starting position of the next memory space, so that the storage of the next frame of data is convenient.

In addition, whether available data exist in the first annular buffer space queue can be detected in real time in the process of putting the data into the annular buffer space queue, and as long as the available data arrive in the first annular buffer space queue, the intermediate node immediately takes out the data for processing and points the red pointer of the first annular buffer space queue to the position of the next pointer.

When the red pointer and the white pointer coincide, the first annular buffer space queue is in a waiting state, and the node receiving the data fills in the data. When the red pointer and the white pointer are separated by N-1 memory space, the first annular buffer space queue is filled with available data, at the moment, the node receiving the data cannot continue to fill the data into the first annular buffer space queue, and the intermediate node waits for reading the data in the first annular buffer space queue and vacates the available memory space.

S20: the intermediate node takes out the data to be processed in the respective first annular buffer space queues, and processes the data to be processed by utilizing the data processing link;

specifically, after the data is placed in the memory of the first ring buffer space queue, the following nodes can perform data processing, in this embodiment, a plurality of parallel intermediate nodes are set to process the data, and a complete data processing link is deployed in each intermediate node, where it should be noted that each functional module included in the data processing link includes, for example: FFT, MTD, CFAR, capon, EKF, in this embodiment, the whole data processing link is deployed on one node, so that one piece of data can be completely processed on one node, the framework of the FT200 includes 16 clusters, 1 cluster is used for receiving the data, and 1 cluster is used for sending the processing result, so that the intermediate node can only be expanded to 14 at most, 14 frames of data can be processed at the same time, the processing efficiency is higher, for example, 20ms is required for processing 1 frame of data by the signal processing link deployed by the intermediate node, the real-time requirement of the signal processing system is 5ms, the requirement for system resource redundancy is 20%, and the intermediate node can meet the system requirement only by expanding to 5.

S30: the intermediate node sends the processed data to respective second annular buffer space queues;

it should be noted that, after the intermediate node finishes processing the data, the data are sequentially put into the second ring buffer space queue of each word, and the rule of the put is the same as that of the input data put into the first ring buffer space queue, so that the description is omitted here.

S40: and the sending node takes out all the data in the second annular buffer space queues, and sends the data to the next-stage system after sequencing.

Specifically, as shown in fig. 6, the transmitting node takes out the processing results of the signal processing links from the ring buffer space queues of each intermediate processing node, sorts the processing results according to the data sequence numbers, and selects the data with the smallest sequence number each time to transmit to the next-stage system.

In addition, in order to verify the technical effects of the present invention, the present embodiment provides the following test data:

test platform: the domestic multi-core processing platform FT2000 platform.

Test data amount: 528K complex floating point data. 1 data receiving node, 1 data transmitting node, 6 intermediate processing nodes, and occupies 32 processor cores in total.

Polling signal processing architecture of this experiment:

1 data receiving node, 1 data transmitting node, 6 intermediate processing nodes occupy 32 processor cores in total.

1 intermediate processing node takes about 23ms to process 1 frame data, 6 intermediate processing nodes can process 6 frames of data simultaneously, and in this scenario, signal processing instantaneity can reach 3.9ms.

Traditional signal processing architecture:

the 1 data receiving node, the 1 data sending node and the signal processing link are split into 6 functional module combinations according to the data passing rate of each module, and are deployed on 6 intermediate processing nodes according to the illustration of fig. 1, and occupy 32 processor cores in total.

After receiving the data, the nodes receiving the data are sequentially transmitted to the nodes deployed with the functional modules for processing, and the total time spent on processing 1 frame of data by each intermediate node is about 21ms.

Table 1 results of comparison of different architectural processing effects

Data volume	The processing method of the present embodiment	Conventional processing method
			528K complex number	3.9ms	21ms

As shown by experimental results, the data processing method provided by the implementation improves the signal processing efficiency and improves the utilization rate of system resources under the condition of occupying the same calculation and storage resources.

Referring to fig. 7, the present embodiment further provides a data processing apparatus, including:

a receiving unit 100 for distributing data polls to be processed to a plurality of intermediate nodes; each intermediate node is provided with a first annular buffer space queue, a second annular buffer space queue and a complete data processing link; it should be noted that, since the specific receiving manner and principle are described in detail in the step S10 of the data processing method described in the above embodiment, the detailed description is omitted here.

The processing unit 200 is configured to take out data to be processed in the respective first ring buffer space queues, and process the data to be processed by using the data processing link; it should be noted that, since the specific processing manner and principle are described in detail in the step S20 of the data processing method described in the above embodiment, the description is omitted here.

A first transmitting unit 300, configured to transmit the processed data to respective second ring buffer space queues; it should be noted that, since the specific transmission mode and principle are described in detail in the step S30 of the data processing method described in the above embodiment, the description is omitted here.

The second sending unit 400 is configured to take out all the data in the second ring buffer space queues, and send the data to the next stage system after sequencing; it should be noted that, since the specific transmission mode and principle are described in detail in the step S40 of the data processing method described in the above embodiment, the description is omitted here.

Compared with the traditional signal processing architecture, the traditional signal processing architecture is characterized in that a complete signal processing link is split into M-1 functional module sets which are respectively deployed on M-1 processing nodes; the polling signal processing architecture is to deploy a complete signal processing link on an intermediate node, and generate a plurality of identical processing branches, so as to process multi-frame data simultaneously.

The signal processing program running on one cluster reduces the computing resource scheduling, data I/O, etc. operations of the signal processing program between clusters compared to running on multiple clusters, so the running efficiency of the program running on a single cluster should be higher.

In addition, an embodiment of the present invention further provides a computer readable storage medium, where the computer readable storage medium may store a program, where the program when executed includes some or all of the steps of any one of the data processing methods described in the foregoing method embodiments.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable memory. Based on this understanding, the technical solution of the present invention may be embodied essentially or partly in the form of a software product, or all or part of the technical solution, which is stored in a memory, and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program that instructs associated hardware, and the program may be stored in a computer readable memory, which may include: flash disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

Exemplary flowcharts for processing data according to embodiments of the present invention are described above with reference to the accompanying drawings. It should be noted that the numerous details included in the above description are merely illustrative of the invention and not limiting of the invention. In other embodiments of the invention, the method may have more, fewer, or different steps, and the order, inclusion, functional relationship between steps may be different than that described and illustrated.

Claims

1. A data processing method applied to a NUMA architecture, comprising:

2. A data processing method according to claim 1, characterized in that: the first annular buffer space queue and the second annular buffer space queue each comprise a plurality of memory spaces for repeated use.

3. The data processing method of claim 1, wherein the receiving node distributes the pending data poll to a plurality of intermediate nodes, further comprising:

4. A data processing method according to claim 3, wherein the intermediate node fetches the data to be processed in the respective first ring buffer queues, further comprising:

5. The method of claim 4, wherein retrieving the data to be processed in the respective first ring buffer queues at the intermediate node further comprises:

the intermediate node detects whether available data exists in the memory space in the respective first annular buffer space queues in real time.

6. The method of claim 5, wherein the intermediate node detects in real time whether there is available data in the memory space in the respective first ring buffer space queue, further comprising:

7. The data processing method of claim 2, wherein the intermediate nodes send the processed data to respective second ring buffer space queues, further comprising:

8. The data processing method according to claim 7, wherein the data in all the second ring buffer queues are fetched at the transmitting node, and are sent to the next-stage system after being ordered:

and the sending node selects the data with the smallest sequence number to send to the next-stage system.

9. A data processing apparatus, comprising:

10. A computer-readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of a data processing method according to any one of claims 1 to 8.