CN109101347A

CN109101347A - A kind of process of pulse-compression method of the FPGA heterogeneous computing platforms based on OpenCL

Info

Publication number: CN109101347A
Application number: CN201810778029.0A
Authority: CN
Inventors: 胡善清; 于嘉程; 王雨薇
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2018-07-16
Filing date: 2018-07-16
Publication date: 2018-12-28
Anticipated expiration: 2038-07-16
Also published as: CN109101347B

Abstract

The process of pulse-compression method of the invention discloses a kind of FPGA heterogeneous computing platforms based on OpenCL, defines the first array local_buf_1 and the second array local_buf_2, array length N in inverse Fourier transform IFFT kernel.Method particularly includes: M group echo data PRT sequence obtains conjugate multiplication result data through FFT kernel and conjugate multiplication kernel；Every group of PRT sampled point is N number of.When m is odd number, by the conjugate multiplication result data sequence deposit local_buf_1 of m group PRT；Data in local_buf_2 are equally divided into eight sections simultaneously and fetches from each section according to progress IFFT calculating in such a way that binary bits inverted sequence is incremented by and exports IFFT result data.After the conjugate multiplication result data of m group PRT is stored in local_buf_1 completely, by the conjugate multiplication result data sequence deposit local_buf_2 of m+1 group PRT；Data in local_buf_1 are equally divided into eight sections simultaneously and fetches from each section according to progress IFFT calculating in such a way that binary bits inverted sequence is incremented by and exports IFFT result data.The final pulse compression result for obtaining M group PRT.

Description

A kind of process of pulse-compression method of the FPGA heterogeneous computing platforms based on OpenCL

Technical field

The present invention relates to signal processing and parallel computing fields, and in particular to a kind of FPGA based on OpenCL is different The process of pulse-compression method of structure computing platform.

Background technique

To processor performance, more stringent requirements are proposed for the development of Modern Radar Signal processing technique, however due to mole fixed Rule encounters bottleneck, and the computing capability of general processor is increasingly unable to satisfy practical application request.Heterogeneous computing platforms can fill Divide the completion for accelerating task using the advantage of various types of processors, is improving system-computed performance, Energy Efficiency Ratio and calculating real-time Aspect has embodied advantage not available for conventional architectures.The unique internal structure of FPGA makes it have powerful parallel computation energy Power and lower power consumption, therefore FPGA and CPU are formed together isomery processing platform can effectively realize system-computed performance Promotion.OpenCL is a kind of cross-platform parallel programming model based on C/C++ for aiming at heterogeneous computing platforms formulation, and is First industrial standard of industry.OpenCL provides a kind of completely new development approach as cross-platform development language, for FPGA. The method development cycle is short, abstraction hierarchy is high, portable strong, compensates for the deficiency of traditional development scheme.Currently, being based on The FPGA heterogeneous computing platforms of OpenCL have become the research hotspot of academia and industry.

Pulse compression technique is widely used in radar signal processing field, and for radar system, pulsewidth and radar energy are visited Ranging is inversely prroportional relationship with distance resolution from direct proportionality.It may be implemented using pulse compression technique biggish Detection range, while distance resolution with higher.Pulse compression technique is needed to the progress of exomonental echo-signal It is a burst pulse by echo suppression with filtering processing, to improve the signal-to-noise ratio and distance resolution for receiving signal.Such as Fig. 1 It is shown, existing pulse compression algorithm process include three processing steps: (1) FFT (2) conjugate multiplication (3) IFFT, and this three There is specific " productive consumption " relationship, i.e. the output of previous step is the input of later step between a step.

Based on OpenCL when realizing process of pulse-compression on FPGA, need to map three kernels of generation on FPGA (kernel), three processing steps in pulse compression algorithm process are respectively corresponded.As shown in Fig. 2, wherein Global Memory Global storage is the DDR chip outside FPGA, and kernel can carry out data interaction with the DDR chip outside FPGA, in typical case OpenCL model in, needed between multiple kernels by global storage carry out data interaction, and by host carry out data tune Degree, therefore three complete work in series of kernel of pulse compression algorithm, and data dispatch can bring biggish processing to be delayed, it should The computation capability that operating mode is unable to give full play FPGA is optimal process performance.As shown in figure 3, Intel FPGA It is extended on the basis of OpenCL typical model, increases kernel channel (kernel pipeline) intercore communication machine System allows different kernels directly to pass through kernel channel and carries out data interaction, it is not necessary to pass through global memory, without host End participates in data dispatch.Therefore, for there is the processing step of " productive consumption " relationship between each other, it can use kernel Channel optimizes kernel, to realize pipeline and parallel design, promotes process performance.

For process of pulse-compression, the degree of parallelism calculated can be extracted in terms of two: (1) between each group PRT (echo) It is independent from each other, therefore can be with the data of parallel processing each group PRT.(2) when handling every group of PRT, Intel official is utilized The FFT/IFFT kernel routine of offer, the routine realize 8 data points of each clock cycle output based on OpenCL Base 4FFT engine may be implemented different points FFT by modifying parameter, have between three kernels in process of pulse-compression process Explicitly " productive consumption " relationship, therefore pipeline and parallel design can be carried out as unit of 8 data points.

But there are a technological difficulties during realization: the FFT/IFFT kernel that Intel official provides, which uses, divides eight The mode that section is incremented by order inputs, and the mode of binary bits inverted sequence exports, and can not directly be existed using kernel channel Pipeline and parallel design is realized between tri- FFT, conjugate multiplication and IFFT kernels, 8 data points of FFT output are by conjugation phase After multiplying, need to be adjusted the position sequence of data just can be carried out IFFT processing.

Therefore, it is badly in need of finding a kind of method at present, is realized under the premise of guaranteeing that data bit sequence is correct and with 8 data points be The high performance pipeline parallel processing of unit.

Summary of the invention

In view of this, the process of pulse-compression side of the present invention provides a kind of FPGA heterogeneous computing platforms based on OpenCL Method can realize that each clock cycle handles the high performance pipeline of 8 data points simultaneously under the premise of guaranteeing that data bit sequence is correct Row processing, to greatly improve the process performance of pulse compression.

In order to achieve the above objectives, the technical solution of the present invention is as follows: the FPGA heterogeneous computing platforms based on OpenCL are included in In three kernels of mapping generation in field programmable gate array FPGA chip, respectively Fourier transformation FFT kernel, conjugate multiplication Core and inverse Fourier transform IFFT kernel, and between FFT kernel and conjugate multiplication kernel, in conjugate multiplication kernel and IFFT The data path of kernel communication is established between core using kernel pipeline kernel channel；In inverse Fourier transform IFFT kernel Two number groups of middle definition are respectively the first array local_buf_1 and the second array local_buf_2, as local cache, In sampling of the array length of the first array local_buf_1 and the second array local_buf_2 with one group of echo data PRT Point number is identical.

This method comprises the following steps:

S1, M group echo data PRT, which are sequentially input into FFT kernel, carries out Fourier transformation, the FFT knot of FFT kernel output Fruit data are transmitted directly to the progress conjugate multiplication operation of conjugate multiplication kernel by kernel pipeline kernel channel and are total to Yoke multiplied result data；It is N that number of sampling points is identical in every group of echo data PRT.

For the conjugate multiplication result data of m group echo data PRT, m initial value is 1, executes S2；

S2, when m is odd number, the conjugate multiplication result data sequence of m group echo data PRT is stored in the first array local_buf_1；Data in the second array local_buf_2 are equally divided into eight sections and are passed according to binary bits inverted sequence simultaneously The mode of increasing fetches according to progress IFFT calculating from each section and exports IFFT result data；Wherein the second array local_buf_ Data are initially invalid data in 2.

It, will after the conjugate multiplication result data of m group echo data PRT is stored in the first array local_buf_1 completely The second array local_buf_2 of conjugate multiplication result data sequence deposit of m+1 group echo data PRT；Simultaneously by the first number Data are equally divided into eight sections and evidence of fetching from each section in such a way that binary bits inverted sequence is incremented by group local_buf_1 It carries out IFFT calculating and exports IFFT result data.

S3, judge whether that whole M group echo datas complete IFFT processing, if then defeated with inverse Fourier transform IFFT kernel All IFFT result datas out are as the pulse compression result for being directed to M group echo data PRT.

Otherwise m returns to S2 from increasing 2.

Further, it states in S2, data in the second array local_buf_2 is equally divided into eight sections and according to binary system ratio The incremental mode of special inverted sequence is fetched from each section according to progress IFFT calculating, specifically:

It is marked in order for each data in the m+1 group echo data PRT stored in the second array local_buf_2 Subscript；

As m ≠ 1, every segment data obtained after data are equally divided into eight sections in the second array local_buf_2 Originating subscript is respectively 0,1,2,3,4,5,6 and 7, successively takes one from each section in such a way that binary bits inverted sequence is incremented by A data then take 8 point datas every time, take N/8 times altogether, i-th take out 8 point datas subscript be sequentially 0+ (~i), 1+ (~ I), 2+ (~i), 3+ (~i), 4+ (~i), 5+ (~i), 6+ (~i), 7+ (~i), wherein i=1,2 ..., (N/8-1), (~ It i) is the result for carrying out binary bits reversion with LOG (N) position bit to i.

As m=1, data are initially invalid data in the second array local_buf_2, do not do and locate for invalid data Reason.

Further, in S2, data in the first array local_buf_1 are equally divided into eight sections and according to binary bits The incremental mode of inverted sequence fetches according to progress IFFT calculating from each section and exports IFFT result data, specifically:

Under being marked in order for each data in the m group echo data PRT stored in the first array local_buf_1 Mark.

The starting subscript point of every segment data obtained after data are equally divided into eight sections in first array local_buf_1 Not Wei 0,1,2,3,4,5,6 and 7, using binary bits inverted sequence be incremented by by the way of a data are successively taken from each section, then Take 8 point datas every time, take N/8 times altogether, i-th take out 8 point datas subscript be sequentially 0+ (~i), 1+ (~i), 2+ (~ I), 3+ (~i), 4+ (~i), 5+ (~i), 6+ (~i), 7+ (~i), wherein i=1,2 ..., (N/8-1), (~i) is to i The result of binary bits reversion is carried out with LOG (N) position bit.

The utility model has the advantages that

The present invention is based on kernel channel to optimize pulse compression algorithm, and utilizes Ping-Pong caching Mode solves in pulse compression process due to FFT kernel the output data by the way of binary bits inverted sequence and can not be direct The problem of carrying out IFFT processing, for process of pulse-compression whole process, realizes the high performance pipeline as unit of 8 data points Parallel processing, so that FFT, conjugate multiplication and IFFT three parts processing time-interleaving are together, so that pulse be greatly shortened The processing time of compression algorithm.

Detailed description of the invention

Fig. 1 is existing pulse compression algorithm flow chart；

Fig. 2 is multiple kernel Core Operational pattern diagrams based on typical OpenCL model；

Fig. 3 is multiple kernel Core Operational pattern diagrams based on kernel channel；

Fig. 4 is the composed structure schematic diagram for the FPGA heterogeneous computing platforms based on OpenCL that the present invention uses；

Fig. 5 is that the present invention is based on the process of pulse-compression method flow diagrams of the FPGA heterogeneous computing platforms of OpenCL；

Fig. 6 is that pipeline and parallel design operating mode schematic diagram is compressed in the pulse based on kernel channel；

Fig. 7 is the process of pulse-compression operating mode schematic diagram based on typical OpenCL model.

Specific embodiment

The present invention will now be described in detail with reference to the accompanying drawings and examples.

The embodiment of the present invention is by taking M × N granularity pulse pressure as an example, i.e., M group echo data PRT altogether, every group of echo data PRT packet Containing N number of sampled point.

The operating mode to FFT/IFFT kernel and the position sequence of input, output data are described in detail below:

The process of pulse-compression method of the present invention provides a kind of FPGA heterogeneous computing platforms based on OpenCL, is based on The FPGA heterogeneous computing platforms of OpenCL are as shown in figure 4, include the mapping generation three in field programmable gate array FPGA chip A kernel, respectively Fourier transformation FFT kernel, conjugate multiplication kernel and inverse Fourier transform IFFT kernel, and in FFT kernel It is established between conjugate multiplication kernel, between conjugate multiplication kernel and IFFT kernel using kernel pipeline kernel channel The data path of kernel communication；It is respectively the first array local_ that two number groups are defined in inverse Fourier transform IFFT kernel Buf_1 and the second array local_buf_2, as local cache, wherein the first array local_buf_1 and the second array The array length of local_buf_2 is identical as the number of sampling points of one group of echo data PRT.

Two array the first array local_buf_1 and the second array are defined in the embodiment of the present invention in IFFT kernel The array length of local_buf_2 is N, identical as the number of sampling points of one group of echo data PRT, specifically can be by first Array local_buf_1 and the second array local_buf_2 are defined as local memory, using compiler on FPGA ram in slice The two arrays are mapped and realized.

On the basis of the above-mentioned FPGA heterogeneous computing platforms based on OpenCL, process of pulse-compression side provided by the invention Method process is as shown in figure 5, include the following steps:

S1, M group echo data PRT, which are sequentially input into FFT kernel, carries out Fourier transformation, the FFT knot of FFT kernel output Fruit data are transmitted directly to the progress conjugate multiplication operation of conjugate multiplication kernel by kernel pipeline kernel channel and are total to Yoke multiplied result data；It is N that number of sampling points is identical in every group of echo data PRT.It is that will return wherein in conjugate multiplication kernel Wave number carries out conjugate multiplication according to the FFT result data of PRT and the FFT result of reference signal.

For the conjugate multiplication result data of m group echo data PRT, m initial value is 1, executes S2.

Specifically:

As m=1, data are initially invalid data in the second array local_buf_2, in the embodiment of the present invention, for Invalid data is not processed.

Specifically:

Under being marked in order for each data in the m group echo data PRT stored in the first array local_buf_1 Mark；

In the present invention, the principle of S2 are as follows:

Since FFT kernel exports calculated result in a manner of binary bits inverted sequence, and every group of PRT by FFT and The data that IFFT kernel is input to after conjugate multiplication kernel are stored in array local_buf_1 and local_buf_2 by sequence, because This, the data in array local_buf_1 and local_buf_2 are stored in a manner of binary bits inverted sequence.For every Group PRT, the position for the data that array local_buf_1 and local_buf_2 is stored since preceding 8 continuation address subscript 0 Sequence is 0,4 × N/8,2 × N/8,6 × N/8,1 × N/8,5 × N/8,3 × N/8,7 × N/8, just with above-mentioned IFFT engine to original The position sequence of eight sections of 8 data points of starting for being incremented by access of beginning data point is identical.Therefore, in the IFFT of process of pulse-compression The data stored in core, local_buf_1 and local_buf_2 are equally divided into eight sections, but the starting subscript of every segment data It is 0,1,2,3,4,5,6,7, and is no longer 0,4 × N/8,2 × N/8,6 × N/8,1 × N/8,5 × N/8,3 × N/8,7 × N/8. It is subsequent every time using a for circulation using binary bits inverted sequence it is incremental by the way of by local_buf_1 and local_ Data in buf_2 are successively taken out, and each for circulation, which is taken out, is designated as 0+ (~i), 1+ (~i), 2+ (~i), 3 under 8 point datas + (~i), 4+ (~i), 5+ (~i), 6+ (~i), 7+ (~i), wherein (~i) is suitable to target under eight segment datas in every group of PRT Sequence increment value 1,2,3 ... (N/8-1) carries out the result of binary bits reversion with LOG (N) position bit.

Whether S3, to judge m=M true, if all IFFT number of results then exported with inverse Fourier transform IFFT kernel According to as the pulse compression result for being directed to M group echo data PRT.

Otherwise m returns to S2 from increasing 2.

When calculating the process of pulse-compression of M × N granularity, since the preceding N/8 for circulation of IFFT kernel is needed to first group The data of PRT carry out local cache, and what IFFT engine calculated in the process is invalid data, the output for IFFT kernel, The delay for needing to increase on the basis of original routine N/8 for circulation needs to wait (N/8+N/8) a delay, ability altogether Effective output of all calculated results is obtained in subsequent M × N/8 for circulation.Therefore, IFFT kernel needs to be implemented altogether (M × N/8+N/8+N/8) secondary for circulation.In preceding M × N/8 for circulation, N/8 is divided exactly to obtain using circulation subscript i every Group number base, base=0,1,2 ... the M of one group of PRT obtain data in each group of PRT to N/8 remainder using circulation subscript i Offset address offset, offset=0,1,2 ... N/8 are utilized to realize Ping-Pong caching using group number base Offset address offset, which is realized, presses the data of the every group of PRT stored in array local_buf_1 and local_buf_2 point for eight sections It is taken out according to the mode that binary bits inverted sequence is incremented by and is sent into IFFT engine.

Using the above method, each clock can be realized using kernel channel for pulse compression algorithm whole process The high performance pipeline parallel processing of 8 data points of period treatment, operating mode is as shown in fig. 6, wherein red arrow is The data interaction of kernel kernel and global memory, FFT kernel read initial data from global memory, conjugate multiplication kernel from Reference signal is read in global memory, pulse compression calculated result is stored in global memory by IFFT kernel.It is based on allusion quotation separately below Type OpenCL model and optimization method proposed by the present invention are to the pulse compression algorithm of 4K × 8K granularity in CPU+Arria10 It is realized on FPGA heterogeneous computing platforms and tests kernel and execute the time, the results are shown in Table 1.

Test result before and after 1 improved Algorithm for Pulse Compression of table

Data can see from table 1, and when being based on typical case OpenCL model realization pulse compression algorithm, three kernels are serial It executes, operating mode is as shown in fig. 7, total time is the summation that three kernels respectively handle the time.Using proposed by the present invention When method realizes pulse compression algorithm, tri- FFT, conjugate multiplication and IFFT kernels are worked by the way of pipeline parallel method, So that three parts processing time-interleaving is together, so that the processing time of pulse compression algorithm be greatly shortened, processing is improved Performance.

Specifically, for the pulse pressure of 4K × 8K granularity, CPU+Arria10FPGA heterogeneous computing platforms can be reached at present Optimal performance with based on eight core parallel optimization of DSP C6678 realize result carry out across comparison, the results are shown in Table 2.

Table 2 is directed to the pulse compression algorithm performance across comparison of different processor

	Arria10 FPGA	DSP C6678
			Total time (unit: ms)	42	1200

Data can see from table 2, the pulse pressure of 4K × 8K granularity be handled, Arria10 FPGA is compared to DSP C6678 obtains 28.6 times of performance boosts.

Therefore the present invention can be realized the high performance pipeline parallel processing as unit of 8 data points, so that FFT, conjugation It is multiplied and IFFT three parts handles time-interleaving together, so that the processing time of pulse compression algorithm be greatly shortened.

In conclusion the above is merely preferred embodiments of the present invention, being not intended to limit the scope of the present invention. All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included in of the invention Within protection scope.

Claims

1. a kind of process of pulse-compression method of the FPGA heterogeneous computing platforms based on OpenCL, which is characterized in that described to be based on The FPGA heterogeneous computing platforms of OpenCL include three kernels of mapping generation in field programmable gate array FPGA chip, respectively For Fourier transformation FFT kernel, conjugate multiplication kernel and inverse Fourier transform IFFT kernel, and in the FFT kernel and described Between conjugate multiplication kernel, kernel pipeline kernel is utilized between the conjugate multiplication kernel and the IFFT kernel Channel establishes the data path of kernel communication；Two number groups are defined in the inverse Fourier transform IFFT kernel is respectively First array local_buf_1 and the second array local_buf_2, as local cache, wherein the first array local_buf_1 It is identical as the number of sampling points of one group of echo data PRT with the array length of the second array local_buf_2；

This method comprises the following steps:

S1, M group echo data PRT, which are sequentially input into the FFT kernel, carries out Fourier transformation, the FFT knot of FFT kernel output Fruit data are transmitted directly to the conjugate multiplication kernel by the kernel pipeline kernel channel and carry out conjugate multiplication behaviour Make to obtain conjugate multiplication result data；It is N that number of sampling points is identical in every group of echo data PRT；

S2, when m is odd number, the conjugate multiplication result data sequence of m group echo data PRT is stored in the first array local_ buf_1；Data in the second array local_buf_2 are equally divided into eight sections and are incremented by according to binary bits inverted sequence simultaneously Mode fetched from each section according to carrying out IFFT calculating and exporting IFFT result data；Wherein the second array local_ Data are initially invalid data in buf_2；

After the conjugate multiplication result data of m group echo data PRT is stored in the first array local_buf_1 completely, by m+1 The second array local_buf_2 of conjugate multiplication result data sequence deposit of group echo data PRT；Simultaneously by first array In local_buf_1 data be equally divided into eight sections and by binary bits inverted sequence be incremented by the way of from each section access according into Row IFFT is calculated and is exported IFFT result data；

S3, judge whether that whole M group echo datas complete IFFT processing, if then defeated with the inverse Fourier transform IFFT kernel All IFFT result datas out are as the pulse compression result for being directed to the M group echo data PRT；

Otherwise m returns to S2 from increasing 2.

2. the method as described in claim 1, which is characterized in that in the S2, by number in the second array local_buf_2 According to be equally divided into eight sections and by binary bits inverted sequence be incremented by the way of from each section access according to carry out IFFT calculating, specifically Are as follows:

Under being marked in order for each data in the m+1 group echo data PRT stored in the second array local_buf_2 Mark；

As m ≠ 1, every segment data obtained after data are equally divided into eight sections in the second array local_buf_2 Originating subscript is respectively 0,1,2,3,4,5,6 and 7, successively takes one from each section in such a way that binary bits inverted sequence is incremented by A data then take 8 point datas every time, take N/8 times altogether, i-th take out 8 point datas subscript be sequentially 0+ (~i), 1+ (~ I), 2+ (~i), 3+ (~i), 4+ (~i), 5+ (~i), 6+ (~i), 7+ (~i), wherein i=1,2 ..., (N/8-1), (~ It i) is the result for carrying out binary bits reversion with LOG (N) position bit to i；

3. the method as described in claim 1, which is characterized in that in the S2, by number in the first array local_buf_1 According to be equally divided into eight sections and in such a way that binary bits inverted sequence is incremented by from each section access according to carrying out IFFT calculating and defeated IFFT result data out, specifically:

Subscript is marked in order for each data in the m group echo data PRT stored in the first array local_buf_1；

The starting subscript point of every segment data obtained after data are equally divided into eight sections in the first array local_buf_2 Not Wei 0,1,2,3,4,5,6 and 7, using binary bits inverted sequence be incremented by by the way of a data are successively taken from each section, then Take 8 point datas every time, take N/8 times altogether, i-th take out 8 point datas subscript be sequentially 0+ (~i), 1+ (~i), 2+ (~ I), 3+ (~i), 4+ (~i), 5+ (~i), 6+ (~i), 7+ (~i), wherein i=1,2 ..., (N/8-1), (~i) is to i The result of binary bits reversion is carried out with LOG (N) position bit.