CN118151837A

CN118151837A - FPGA-based dispersion/aggregation processing method, electronic equipment and storage medium

Info

Publication number: CN118151837A
Application number: CN202211566456.5A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: DeepRoute AI Ltd
Current assignee: DeepRoute AI Ltd
Priority date: 2022-12-07
Filing date: 2022-12-07
Publication date: 2024-06-07

Abstract

The application discloses a method for processing dispersion/aggregation based on an FPGA, electronic equipment and a storage medium, wherein the method comprises the following steps: obtaining M dispersing/gathering requests, wherein M and N are positive integers; distributing M dispersing/gathering requests to N storage unit blocks, wherein each dispersing/gathering request corresponds to one storage unit block in the N storage unit blocks, so that an ith storage unit block in the N storage unit blocks corresponds to x _i dispersing/gathering requests in the M dispersing/gathering requests, i is a positive integer smaller than or equal to N, x _i is between 0 and M, and the sum of the numbers of dispersing/gathering requests corresponding to the N storage units is M; and transmitting x _i disperse/aggregate requests in the M disperse/aggregate requests to corresponding storage unit blocks in a polling arbitration mode so as to realize that N disperse/aggregate requests are transmitted in parallel at most in each round. The application can reduce the time delay, thereby improving the reasoning efficiency of the deep convolutional neural network model when being applied to the deep convolutional neural network.

Description

FPGA-based dispersion/aggregation processing method, electronic equipment and storage medium

Technical Field

The application belongs to the technical field of FPGA (field programmable gate array), and particularly relates to a dispersion/aggregation processing method based on FPGA, electronic equipment and a storage medium.

Background

With the development of artificial intelligence, an acceleration platform with high performance, high computation power and low power consumption becomes more and more important, and an FPGA plays an increasingly important role as flexible reprogrammable hardware. For example, to deep convolutional neural networks. In the reasoning process of the deep convolutional neural network, dispersion and aggregation are very common operations in operations such as matrix multiplication, subtraction and the like.

Aggregation and dispersion are vector addressing classes in storage addressing, a kind of register indirect addressing, aggregation is indirect reading, and dispersion is indirect writing. The indirect addressing, which is generally aggregated and dispersed, can only operate linearly in sequence because of the irregularity of its addresses, and sends corresponding read-write requests to the memory cells one by one, resulting in inefficiency of reading and writing and occupying the bandwidth of the memory cells. Furthermore, in the dispersion process, there is often a demand for operation on the data, which further increases the delay of the dispersion process.

Disclosure of Invention

The application provides a dispersion/aggregation processing method based on an FPGA, electronic equipment and a storage medium, so as to solve the problems.

In order to achieve the above purpose, the application is realized by the following technical scheme:

the application provides a dispersion/aggregation processing method based on an FPGA, wherein the FPGA comprises storage units which are divided into N storage unit blocks, and the method comprises the following steps: obtaining M dispersing/gathering requests, wherein M and N are positive integers; the M scatter/gather requests are distributed to the N memory cell blocks, wherein each scatter/gather request corresponds to one memory cell block of the N memory cell blocks, so that an ith memory cell block of the N memory cell blocks corresponds to x _i scatter/gather requests of the M scatter/gather requests, wherein i is a positive integer from 1 to N, x _i is between 0 and M, and the sum of the numbers of scatter/gather requests corresponding to the N memory cells is M; x _i of the M scatter/gather requests are sent to the corresponding memory unit blocks by means of polling arbitration, so as to support that at most N scatter/gather requests can be sent in parallel per round.

In some embodiments, each of the M scatter/gather requests includes an address representing a block of memory cells to be scattered/gathered; the M dispersing/gathering requests are distributed to the N storage unit blocks, and for each storage unit block in the N storage unit blocks, the address of each dispersing/gathering request in the M dispersing/gathering requests is compared with the serial number of the storage unit block, so that x _i dispersing/gathering requests corresponding to the storage unit block are obtained; the x _i scatter/gather requests are ordered according to the sequence number, wherein the address is the same as the sequence number of the memory unit block, and the scatter/gather requests correspond to the memory unit block.

In some embodiments, x _i of the M scatter/gather requests are sent to FIFO buffers preceding the interfaces of the respective blocks of memory cells by means of poll arbitration; the x _i scatter/gather requests buffered in the FIFO buffer are sent into the corresponding block of memory cells.

In some embodiments, when the M scatter requests are acquired, the method further comprises: each of the M scatter requests includes data representing an address to be scattered to a memory cell block and data to be written to the memory cell block; x _i disperse requests in the M disperse requests are sent to the FIFO buffer area in front of the interface of the corresponding storage unit block in a polling arbitration mode; when the FIFO buffer area can send the current dispersion request, the current dispersion request is cached for one clock period to wait for the next dispersion request; responding to the fact that when the current dispersion request is the same as the address of the next dispersion request, carrying out operation combination on the data of the current dispersion request and the data of the next dispersion request; x _i scatter/gather requests buffered in the FIFO buffer are sent into the corresponding block of memory cells; and sending the data obtained by operation combination of the scatter requests with the same addresses which are continuously arranged in the x _i scatter requests to the corresponding storage unit blocks.

In some embodiments, when the M aggregate requests are acquired, the method further comprises: each of the M aggregate requests includes an address representing a memory cell block to be aggregated and data to be read from the memory cell block; x _i aggregation requests in the M aggregation requests are sent to a FIFO buffer area in front of an interface of the corresponding storage unit block in a polling arbitration mode; before the aggregation request in the FIFO buffer is sent to the corresponding memory cell block, at least one FIFO buffer is further included, and specific information such as a sequence number, an address, data and the like of the aggregation request in the FIFO buffer is cached, so as to indicate a memory cell block in which the data is to be stored when the memory cell block sends back the read-back data.

In some embodiments, when acquiring the M requests is a scatter request, the method further comprises: o disperse requests are obtained, wherein O is a positive integer, and O is greater than or equal to M; the O scatter requests are ordered according to addresses, wherein each scatter request in the O scatter requests comprises an address used for representing a memory unit block to be scattered and data to be written into the memory unit block, and when the addresses of any two scatter requests in the O scatter requests are the same, the data of the O scatter requests are adjacent; and performing dichotomy merging operation on the ordered O dispersion requests according to the same address, so as to obtain the M dispersion requests. And in the sorted O dispersion requests, if the addresses of every two adjacent dispersion requests are the same, merging the data.

The application also provides an FPGA, which comprises a storage unit and a storage controller which are mutually coupled, wherein the storage controller is used for executing the dispersion/aggregation processing method based on the FPGA.

The application also provides an electronic device which is connected with the FPGA and comprises a memory and a processor which are mutually coupled, wherein the processor is used for executing the program instructions stored in the memory so as to realize the dispersion/aggregation processing method based on the FPGA.

The present application also provides a non-transitory computer readable storage medium storing program instructions which, when executed by a processor, implement the above-described FPGA-based scatter/gather processing method.

Compared with the prior art, the application has the beneficial effects that: dividing the storage unit into N storage unit blocks, and distributing M scattering/gathering requests to the N storage unit blocks, wherein each scattering/gathering request corresponds to one storage unit block in the N storage unit blocks, so that the ith storage unit block in the N storage unit blocks corresponds to x _i scattering/gathering requests in the M scattering/gathering requests, i is a positive integer from 1 to N, x _i is between 0 and M, and the sum of the numbers of the scattering/gathering requests corresponding to the N storage units is M; and x _i disperse/gather requests in the M disperse/gather requests are sent to corresponding storage unit blocks in a polling arbitration mode, so that N disperse/gather requests are sent in parallel at most in each round, time consumption is reduced, each request is efficiently distributed to the corresponding storage unit blocks, and therefore, when the method is applied to a deep convolutional neural network, time delay similar to common operations such as matrix multiplication and reduction can be improved, and the reasoning efficiency of a deep convolutional neural network model is improved.

Drawings

FIG. 1 is a flow chart of an FPGA-based scatter/gather processing method of an embodiment of the present application;

FIG. 2 is a schematic diagram of an application scenario in which an embodiment of the present application allocates scatter/gather requests to a block of memory cells;

FIG. 3 is a flow chart of a method of FPGA-based decentralized processing according to an embodiment of the application;

fig. 4 is a schematic diagram of an application scenario of scatter request merging according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a frame of an FPGA according to one embodiment of the present application;

FIG. 6 is a schematic diagram of a frame of an electronic device according to an embodiment of the present application;

Fig. 7 is a schematic diagram of a frame of a storage medium according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In addition, if there is a description of "first", "second", etc. in the embodiments of the present application, the description of "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present application.

Referring to fig. 1, fig. 1 is a flowchart of a method for FPGA-based scatter/gather processing according to an embodiment of the present application. The method can be applied to electronic equipment comprising an FPGA, wherein the FPGA comprises storage units which are divided into N storage unit blocks, N is a positive integer, and the number of the storage units of the FPGA is not limited. The method comprises the following steps:

s110: m scatter/gather requests are acquired.

Where M is a positive integer, and "/" in the scatter/gather request means "or", that is, the scatter request or the gather request, and further, it should be noted that the description of "scatter/gather request" and the like in the present application refers to the scatter or gather request.

S120: m scatter/gather requests are allocated to N blocks of memory cells.

Wherein each scatter/gather request corresponds to one of the N memory cell blocks, such that, of the N memory cell blocks, the i-th memory cell block corresponds to x _i scatter/gather requests of the M scatter/gather requests, where i is a positive integer from 1 to N, x _i is between 0 and M, and the sum of the numbers of scatter/gather requests corresponding to the N memory cells is M.

I is a positive integer from 1 to N, x _i has a value between 0 and M, which varies from memory cell block to memory cell block, that is, the number of allocated scatter/gather requests differs for each memory cell block, but the sum of the numbers of allocated scatter/gather requests for all memory cell blocks is M. For example, in the case of m=16 and n=32, when i=1, the 1 st memory cell block corresponds to x ₁ =4 scatter/gather requests, when i=2, the 2 nd memory cell block corresponds to x ₂ =4 scatter/gather requests, when i=3, the 3 rd memory cell block corresponds to x ₃ =5 scatter/gather requests, and when i=4, the 4 th memory cell block corresponds to x ₄ =3 scatter/gather requests, and the remaining memory cell blocks correspond to 0 scatter/gather requests.

S130: and transmitting x _i disperse/aggregate requests in the M disperse/aggregate requests to corresponding storage unit blocks in a polling arbitration mode so as to realize that N disperse/aggregate requests are transmitted in parallel at most in each round.

The round robin arbitration scheme refers to x _i scatter/gather requests allocated to each memory unit block, and each round of transmitting one scatter/gather request to the memory unit block, so that, for the entire memory unit, at most, N scatter/gather requests can be transmitted per round, i.e., N scatter/gather requests are transmitted in parallel. If M scatter/gather requests need to be completed, each round of transmission needs to be performed in k clock cycles, and then each round of transmission can be completed in k clock cycles, where k is the largest value in x ₁,x₂,x_13…x_N, and since x ₁,x₂,x_13…x_N is smaller than M, k is smaller than M, compared with M clock cycles needed for sequential reading and writing, time consumption is reduced, and each request is efficiently allocated to a corresponding memory cell block.

In this embodiment, the storage unit is divided into N storage unit blocks, and M scattering/aggregating requests are further distributed to N storage unit blocks, where each scattering/aggregating request corresponds to one storage unit block of the N storage unit blocks, so that an ith storage unit block of the N storage unit blocks corresponds to x _i scattering/aggregating requests of the M scattering/aggregating requests, i is a positive integer from 1 to N, x _i is between 0 and M, and a sum of numbers of scattering/aggregating requests corresponding to the N storage units is M; and x _i disperse/gather requests in the M disperse/gather requests are sent to corresponding storage unit blocks in a polling arbitration mode, so that N disperse/gather requests are sent in parallel at most in each round, time consumption is reduced, each request is efficiently distributed to the corresponding storage unit blocks, and therefore, when the method is applied to a deep convolutional neural network, time delay similar to common operations such as matrix multiplication and reduction can be improved, and the reasoning efficiency of a deep convolutional neural network model is improved.

As described above, the memory cells are divided into N blocks of memory cells, and in some embodiments, each of the M scatter/gather requests includes an address representing a memory cell to be scattered/gathered to a corresponding block of memory cells.

Wherein M scatter/gather requests are also assigned consecutive sequence numbers, e.g., 0,1,2 … (M-1), and N memory cell blocks are assigned consecutive sequence numbers, e.g., 0,1,2 … (N-1).

In the following, description will be given by taking fig. 2 as an example, and fig. 2 is a schematic view of an application scenario in which a scatter/gather request is allocated to a memory cell block according to an embodiment of the present application. In the example of fig. 2, it is assumed that there are 16 scatter/gather requests, i.e., M is 16, and the memory cells are divided into 32 memory cell blocks, i.e., N is 32.

The 32 memory cell blocks are sequentially assigned consecutive sequence numbers such that each of the 32 memory cell blocks has a sequence number, namely 0,1,2 … as shown in the figure, and the 16 scatter/gather requests are sequentially assigned consecutive sequence numbers such that each of the 16 scatter/gather requests has a sequence number, namely 0,1,2, … as shown in the figure. Each scatter/gather request includes an address indicating that the memory cell block is to be scattered/gathered, for example, 0 for the scatter/gather request 0 and 1 for the scatter/gather request 1, and it can be seen in turn that the address of the scatter/gather request 15 is 1.

It should be noted that the present application is not limited to other contents of each scatter/gather request, for example, each scatter/gather request further includes data to be written into a corresponding memory cell block, and a valid bit of each scatter/gather request, although related contents are shown in fig. 2, the present application will not be described in detail, and all other contents of each scatter/gather request are also within the scope of the present application.

At this time, the M scatter/gather requests are allocated to the N memory cell blocks, including: for each of the N memory cell blocks, comparing an address of each of the M scatter/gather requests with a sequence number of the memory cell block, thereby obtaining x _i scatter/gather requests corresponding to the memory cell block, and ordering the x _i scatter/gather requests according to a size of the sequence number, wherein the address is the same as the sequence number of the memory cell block, and indicates that the scatter/gather requests correspond to the memory cell block.

Continuing with the example of FIG. 2, in the example of FIG. 2, there are 16 scatter/gather requests, M being 16, and the memory cells are divided into 32 blocks of memory cells, N being 32. For the memory cell block 0, the address of each of the 16 scatter/gather requests is compared with the sequence number 0 of the memory cell block, so that the addresses of the scatter/gather requests 0, 2,9 and 11 are all 0, which is the same as the sequence number 0 of the memory cell block 0, that is, the memory cell block 0 corresponds to the scatter/gather requests 0, 2,9 and 11, so that the memory cell block 0 corresponds to 4 scatter/gather requests, and the 4 scatter/gather requests corresponding to the memory cell block 0 are ordered according to the size of the sequence number thereof. For the memory cell block 1, comparing the address of each of the 16 scatter/gather requests with the sequence number 1 of the memory cell block can result in the addresses of the scatter/gather requests 1, 4, 8 and 15 being 1, which is the same as the sequence number 1 of the memory cell block 1, i.e., the memory cell block 1 corresponds to the scatter/gather requests 1, 4, 8 and 15, so that the memory cell block 1 can be obtained corresponding to the 4 scatter/gather requests, and the 4 scatter/gather requests corresponding to the memory cell block 1 are ordered according to the size of the sequence number thereof. And so on. For the memory cell block 3, the address of each of the 16 scatter/gather requests is compared with the sequence number 3 of the memory cell block, so that the addresses of the scatter/gather requests 5, 7 and 14 are all 3, which is the same as the sequence number 3 of the memory cell block 3, that is, the memory cell block 3 corresponds to the scatter/gather requests 1, 4, 8 and 15, so that the memory cell block 1 corresponds to 3 scatter/gather requests, and the 3 scatter/gather requests corresponding to the memory cell block 3 are ordered according to the size of the sequence number thereof.

In some embodiments, transmitting x _i of the M scatter/gather requests to the corresponding memory unit block by means of a poll arbitration, includes: transmitting x _i disperse/gather requests in the M disperse/gather requests to a FIFO buffer positioned in front of an interface of a corresponding memory cell block in a polling arbitration mode; the x _i scatter/gather requests buffered in the FIFO buffer are sent into the corresponding memory cell blocks.

The FIFO buffer is a data buffer that buffers in a data buffer manner of FIFO (First Input First Output, first in first out), which means that data of the buffer is first read out from the buffer.

Continuing with the description of fig. 2 above, in the example of fig. 2, the process of sending the 16 scatter/gather requests to the corresponding memory cell blocks by means of a poll arbitration scheme is: in the first clock cycle, i.e., the first round, scatter/gather requests 0,1,3,5 are sent to FIFO buffers preceding the interfaces of memory cell blocks 0,1,2,3 and FIFO buffer scatter/gather requests 0,1,3,5 are sent to memory cell blocks 0,1,2,3. In a second clock cycle, i.e. the second round, scatter/gather requests 2,4,6,7 are sent to FIFO buffers before the interfaces of memory cell blocks 0,1,2,3 and FIFO buffer scatter/gather requests 0,1,3,5 are sent to memory cell blocks 0,1,2,3. And so on, in the 5 th clock cycle, i.e., the 5 th round, the scatter/gather request 13 is sent to the FIFO buffer before the interface of the memory cell block 2, and the FIFO buffer scatter/gather request 13 is sent to the memory cell block 2. So far, all 16 scatter/gather requests have been sent into the corresponding memory cell blocks.

In some embodiments, when the above-described embodiments obtain a scatter request, each of the M scatter requests includes data representing an address to be scattered to a memory cell block and data to be written to the corresponding memory cell block. For example, the example shown in fig. 2 is directed to scatter requests, each including, as shown in fig. 2, an address representing a memory cell block to be scattered and data to be written to the corresponding memory cell block. For example, the scatter request 0 includes address 0 indicating to be scattered to the memory cell block 0 and data 1 to be written to the memory cell block 0.

At this time, x _i scatter requests among the M scatter requests are sent to the FIFO buffer located before the interface of the corresponding memory cell block by means of polling arbitration, including: each time a current scatter request is cached in the FIFO buffer, the current scatter request is cached for one clock period to wait for the next scatter request; and in response to the current dispersion request and the address of the next dispersion request being the same, carrying out operation combination on the data of the current dispersion request and the data of the next dispersion request.

The operation merging may be an addition operation to implement merging, that is, the data included in the current scatter request and the data of the next scatter request are added to implement merging.

For example, the address of the first scatter request is 15, the data is 3, the address of the second scatter request is 15, the data is 5, when the first scatter request is cached, the first scatter request is cached for one clock cycle, the next scatter request is waited for, namely, the second scatter request, when the address of the first scatter request is the same as the address of the next scatter request, namely, 15, the data (namely, 3) contained in the first scatter request and the data (namely, 5) contained in the next scatter request are calculated and combined, for example, the addition operation is 8, at the moment, one clock cycle is cached, and assuming that the address of the third scatter request is 15, the contained data is 4, the data (namely, 4) contained in the third scatter request and 8 are calculated and combined, for example, the addition operation is 12, until all the scatter requests with the addresses of 15 in sequence are all calculated and combined. Assume that the address of the fourth scatter request is 16, and at this time, all three scatter requests with addresses of 15 arranged consecutively complete the operation merge.

At this time, x _i scatter requests buffered in the FIFO buffer are sent to the corresponding memory cell block, including: and sending the data obtained by operation merging of the scatter requests with the same addresses which are arranged continuously in the x _i scatter requests to corresponding storage unit blocks.

Continuing with the description given above with reference to the three scatter requests with the address of 15 as an example, in the process of sending the scatter requests with the address of 15 to the corresponding memory cell block in the polling arbitration method, the data obtained by combining the three scatter requests through operation is 12, and is sent to the corresponding memory cell block.

In other embodiments, when the above-described embodiments obtain an aggregate request, each of the M aggregate requests includes an address representing a memory cell block to be aggregated and data to be read from the memory cell block. For example, the example shown in fig. 2 is directed to aggregation requests, as shown in fig. 2, each including data representing an address to be aggregated to a memory cell block and a corresponding memory cell block to be written. For example, the aggregation request 0 includes address 0 indicating to be aggregated to the memory cell block 0 and data 1 to be written to the memory cell block 0.

At this time, when x _i aggregation requests among the M aggregation requests are transmitted to the FIFO buffer located before the interface of the corresponding memory cell block by the polling arbitration method, the interface of the corresponding memory cell block further includes another FIFO buffer for buffering the aggregation request transmitted each time to indicate a valid memory cell block among the N memory cell blocks, and further includes another FIFO buffer for buffering the sequence number of the memory cell block to indicate the memory cell block in which the data is to be stored when the memory cell block transmits the read-back data.

That is, there are 3 FIFO buffers before the interface of the corresponding memory cell block, one FIFO buffer for the aggregate request to be sent, one FIFO buffer for buffering the aggregate request for each sending, and one FIFO buffer for buffering the sequence number of the memory cell block to indicate the memory cell block to which data is to be stored when the memory cell block sends back the read-back data.

In some embodiments, as shown in fig. 3, fig. 3 is a partial flowchart of an FPGA-based dispersion processing method according to an embodiment of the present application, when the dispersion requests are acquired in the above embodiment, that is, in the above method, when M dispersion requests are acquired, the method further includes:

S310: o dispersion requests are obtained, wherein O is a positive integer, and O is greater than or equal to M.

S320: and sorting the addresses of the O dispersion requests, wherein each dispersion request in the sorted O dispersion requests has a sequence number and comprises data used for representing the address to be dispersed to the storage unit block and the data to be written into the corresponding storage unit block, and when the addresses of any two dispersion requests in the sorted O dispersion requests are the same, the sequence numbers of the O dispersion requests are adjacent.

And sorting the addresses of the O disperse requests, namely sorting the addresses to be dispersed to corresponding memory cell blocks according to the address contained in each disperse request. When the addresses of any two distributed requests in the sorted O distributed requests are the same, the serial numbers of the distributed requests are adjacent, so that the two distributed requests can be combined, for example, the data contained in the distributed requests can be calculated and combined.

S330: and carrying out the same address dichotomy merging operation on the O distributed requests after sequencing, thereby obtaining M distributed requests.

And carrying out the same address dichotomy merging operation on the O distributed requests after sequencing to obtain M distributed requests.

It should be noted that, details of the steps S110 to S130 in fig. 3 are detailed in the description of the above embodiments, and are not repeated here. And, steps S310 to S330 precede steps S110 to S130.

Further, in some embodiments, performing the same address dichotomy merge operation on the ordered O scatter requests includes: and in the ordered O dispersion requests, if the addresses of every two adjacent dispersion requests are the same, merging the data.

Fig. 4 is a schematic diagram of an application scenario of scatter request merging according to an embodiment of the present application. In fig. 4, it is assumed that there are 16 scatter requests, i.e., O is 16, and the 16 scatter requests are ordered according to the address they contain. Each of the 16 scatter requests after ordering is assigned a sequence number, namely 0,1,2 … as shown in fig. 4. Each scatter request includes an address representing a corresponding memory cell block to be scattered and data to be written to the corresponding memory cell block. For example, among the 16 sorted scatter requests, the address included in the scatter request 0 is 0, the data is 1, the address included in the scatter request 1 is 0, and the data is 2, and it can be seen that the address of the scatter/gather request 15 is 11, and the data is 2.

It should be noted that the present application is not limited to other contents of each scatter request, for example, the valid bit of each scatter request, although related contents are shown in fig. 4, and the present application will not describe in detail, and all other contents of each scatter request are also within the scope of the present application.

In the process of performing the same address dichotomy merging operation on the sorted O scatter requests, as shown in fig. 4, the valid bit of each scatter request is set to 1, which indicates that each scatter request is valid. In the first round of merging, the addresses of scatter request 0 and scatter request 1 are compared and the same, the data contained in scatter request 1 is merged into the data contained in scatter request 0, and the valid bit of the merged scatter request (i.e., scatter request 1) is set to 0, which indicates that scatter request 1 is invalid, i.e., black in the figure. Comparing the addresses of scatter request 2 and scatter request 3 to be the same, the data contained in scatter request 3 is merged into the data contained in scatter request 2, and the valid bit of the merged scatter request (i.e. scatter request 3) is set to 0, which indicates that scatter request 3 is invalid, i.e. black. By analogy, the addresses of scatter request 8 and scatter request 9 are compared and are not the same, the data contained in scatter request 9 is not merged into the data contained in scatter request 8, namely, is not merged, by analogy, the addresses of scatter request 15 and scatter request 14 are compared and are the same, the data contained in scatter request 15 is merged into the data contained in scatter request 14, and meanwhile, the valid bit of the merged scatter request (namely, scatter request 15) is set to 0, which means that scatter request 15 is invalid, namely, black is shown in the figure.

Since there are also scatter requests with the same addresses, in the second round of merging, the addresses of scatter request 2 and scatter request 0 are compared and the same, the data obtained by merging scatter request 2 is merged into the data obtained by merging scatter request 0, while the valid bit of the merged scatter request (i.e., scatter request 2) is set to 0, which indicates that scatter request 2 is invalid, i.e., black in the figure, and so on, the addresses of scatter request 14 and scatter request 12 are compared and the same, the data contained in scatter request 14 is merged into the data contained in scatter request 12, while the valid bit of the merged scatter request (i.e., scatter request 14) is set to 0, which indicates that scatter request 14 is invalid, i.e., black in the figure.

Referring to fig. 5, fig. 5 is a schematic diagram of a frame of an FPGA 500 according to an embodiment of the application. The FPGA 500 includes a memory unit 510 and a memory controller 520 coupled to each other, and the memory controller 520 is configured to perform any of the above-described FPGA-based dispersion/aggregation processing methods.

Referring to fig. 6, fig. 6 is a schematic diagram of a frame of an electronic device 600 according to an embodiment of the present application, where the electronic device is connected to an FPGA 500, and includes a memory 610 and a processor 620 that are coupled to each other, and the processor 620 is configured to execute program instructions stored in the memory 610 to implement any of the above-mentioned FPGA-based dispersion/aggregation processing methods. In one particular implementation scenario, electronic device 600 may include, but is not limited to: the electronic device 600 may also include mobile devices such as a notebook computer and a tablet computer, and is not limited herein.

In particular, the processor 620 is configured to control itself and the memory 610 to implement any of the FPGA-based decentralized/aggregated processing methods described above. The processor 620 may also be referred to as a CPU (Central Processing Unit ). The processor 620 may be an integrated circuit chip with signal processing capabilities. The Processor 620 may also be a general purpose Processor, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), an Application SPECIFIC INTEGRATED Circuit (ASIC), a Field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 620 may be commonly implemented by an integrated circuit chip.

Referring to fig. 7, fig. 7 is a schematic diagram of a framework of a nonvolatile computer readable storage medium 700 according to an embodiment of the present application, where the nonvolatile computer readable storage medium 700 stores program instructions 701 capable of being executed by a processor, and the program instructions 701 are configured to implement any of the above-mentioned FPGA-based decentralized/aggregate processing methods.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A dispersion/aggregation processing method based on an FPGA, wherein the FPGA includes memory cells divided into N memory cell blocks, comprising:

Obtaining M dispersing/gathering requests, wherein M and N are positive integers;

Distributing the M scatter/gather requests to the N memory cell blocks, wherein each scatter/gather request corresponds to one memory cell block of the N memory cell blocks, such that an ith memory cell block of the N memory cell blocks corresponds to x _i scatter/gather requests of the M scatter/gather requests, i is a positive integer from 1 to N, x _i is between 0 and M, and a sum of numbers of scatter/gather requests corresponding to the N memory cells is M;

And transmitting x _i disperse/gather requests in the M disperse/gather requests to corresponding storage unit blocks in a polling arbitration mode so as to realize that N disperse/gather requests are transmitted in parallel at most in each round.

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

Each of the M scatter/gather requests includes an address representing a block of memory cells to be scattered/gathered;

Assigning the M scatter/gather requests to the N memory cell blocks, comprising:

for each storage unit block in the N storage unit blocks, comparing an address of each dispersion/aggregation request in the M dispersion/aggregation requests with a sequence number of the storage unit block, thereby obtaining x _i dispersion/aggregation requests corresponding to the storage unit block, and ordering the x _i dispersion/aggregation requests according to a sequence number of the storage unit block, wherein the address is the same as the sequence number of the storage unit block, and indicates that the dispersion/aggregation requests correspond to the storage unit block.

3. The method of claim 1, wherein the step of determining the position of the substrate comprises,

Transmitting x _i scatter/gather requests in the M scatter/gather requests to corresponding memory unit blocks by a polling arbitration mode, including:

Transmitting x _i scatter/gather requests of the M scatter/gather requests to a FIFO buffer positioned in front of an interface of the corresponding memory cell block by a polling arbitration mode;

X _i scatter/gather requests buffered in the FIFO buffer are sent into the corresponding block of memory cells.

4. The method of claim 3, wherein the step of,

Each of the M scatter requests includes an address representing a memory cell block to be scattered and data to be written into the corresponding memory cell block;

Transmitting x _i scatter requests in the M scatter requests to a FIFO buffer located before an interface of the corresponding memory unit block in a polling arbitration mode, wherein the method comprises the following steps:

Each time a current scatter request is cached in the FIFO buffer area, the current scatter request is cached for one clock period to wait for the next scatter request;

responding to the fact that the current dispersion request is identical to the address of the next dispersion request, and carrying out operation combination on the data of the current dispersion request and the data of the next dispersion request;

transmitting the x _i scatter requests buffered in the FIFO buffer into the corresponding block of memory cells, comprising:

And sending the data obtained by operation combination of the scatter requests with the same addresses which are continuously arranged in the x _i scatter requests to the corresponding storage unit blocks.

5. The method of claim 3, wherein the step of,

Each of the M aggregate requests includes an address representing a memory cell block to be aggregated and data to be read from the memory cell block;

when x _i aggregation requests in the M aggregation requests are sent to the FIFO buffer located before the interface of the corresponding memory unit block by a polling arbitration mode, the interface of the corresponding memory unit block further comprises another FIFO buffer and a further FIFO buffer, wherein the another FIFO buffer is used for buffering the aggregation requests sent each time to indicate a valid memory unit block in the N memory unit blocks, and the further FIFO buffer is used for buffering the sequence number of the memory unit block to indicate a memory unit in which the data is to be stored when the memory unit block sends back the read data.

6. The method of claim 1, wherein the step of determining the position of the substrate comprises,

When the M scatter requests are acquired, the method further comprises:

O disperse requests are obtained, wherein O is a positive integer, and O is greater than or equal to M;

address ordering is carried out on the O disperse requests, wherein each disperse request in the O disperse requests after ordering has a sequence number and comprises an address used for representing a corresponding storage unit block to be dispersed and data to be written into the corresponding storage unit block, and when the addresses of any two disperse requests in the O disperse requests are the same, the sequence numbers of the disperse requests are adjacent;

And carrying out the same address dichotomy merging operation on the O distributed requests after sequencing, thereby obtaining M distributed requests.

7. The method of claim 6, wherein the step of providing the first layer comprises,

The same address dichotomy merging operation is carried out on the O distributed requests after sequencing, and the method comprises the following steps:

And in the sorted O dispersion requests, if the addresses of every two adjacent dispersion requests are the same, merging the data.

8. An FPGA comprising a memory unit and a memory controller coupled to each other, the memory controller configured to perform the FPGA-based scatter/gather processing method of any of claims 1 to 7.

9. An electronic device connected to an FPGA, comprising a memory and a processor coupled to each other, the processor configured to execute program instructions stored in the memory to implement the FPGA-based scatter/gather processing method of any of claims 1 to 7.

10. A non-transitory computer readable storage medium storing program instructions which, when executed by a processor, are configured to implement the FPGA-based scatter/gather processing method of any of claims 1 to 7.