CN105302644B

CN105302644B - FFT accelerator installation based on token task scheduling strategy

Info

Publication number: CN105302644B
Application number: CN201510718777.6A
Authority: CN
Inventors: 雷元武; 鲁建壮; 陈胜刚; 彭元喜; 孙书为; 孙永节; 刘胜; 吴虎成; 李勇; 许邦建; 胡封林; 王耀华
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2015-10-29
Filing date: 2015-10-29
Publication date: 2019-09-06
Anticipated expiration: 2035-10-29
Also published as: CN105302644A

Abstract

A kind of FFT accelerator installation based on token task scheduling strategy, comprising: FFT accelerator completes the control that batch 1 ties up FFT operation for control logic, sends Read-write Catrol parameter to bus control unit；Bus control unit generates the control signal of SMC memory in read/write DDR memory or piece according to the parameter of FFT Accelerator control module；FFT computing array, the FFT-PE including two single memory structures；TeraNet order slave Port Profile is converted to internal Pbus bus protocol for TeraNet data host Port Profile is converted to internal dma bus agreement by data path and order access asynchronous process unit；Token is set by four groups of memory bank access authority in two FFT-PE, read component, write parts, FFT-PE1 execution unit and FFT-PE2 execution unit operate four groups of memory banks according to token as four functional components.The present invention has many advantages, such as that can reduce functional component waits expense, shorten the fft algorithm execution time, improve FFT accelerator performance.

Description

FFT accelerator installation based on token task scheduling strategy

Technical field

Present invention relates generally to microprocessor architecture and chip design field, refer in particular to it is a kind of on dsp chip based on The FFT accelerator installation of token task scheduling strategy.

Background technique

Fast Fourier Transform FFT (Fast Fourier Transformation) is discrete Fourier transform DFT A kind of Fast implementation of (Discrete Fourier Transformation), utilizes complex exponential constant Periodicity, the characteristic of conjugate symmetry and reducibility, the arrangement order of signal sequence x (n) is reset by rule, it is final to decompose Operation is carried out at some short sequences.The computation complexity of FFT by DFT algorithm O (n²) it is reduced to O (nlogn).The appearance of FFT So that DFT has obtained wider application in theory analysis and actual life.In theoretical calculation and analysis, fft algorithm is answered For spectrum analysis, fast convolution, fast correlation, large integer multiplication calculate etc., meanwhile, FFT be digital processing field not One of the tool that can lack, a kind of signal is transformed from the time domain to frequency domain by it, to can be easy to analyze signal on frequency domain Correlated characteristic.In field of signal processing, FFT be applied to digital communication, Speech processing, image procossing, power Spectral Estimation, Field of radar etc..

However, in certain special occasions, it is desirable that conversion speed it is high, to the performance, power consumption and efficiency of fft algorithm It all proposes requirements at the higher level, is difficult to meet the demand using general digital signal processor (DSP) chip or cpu chip. Therefore, it is integrated with the hardware cell for being exclusively used in fft algorithm on some dsp chips, this chip is by corresponding FFT Processing Algorithm It is realized using customization special logic, without being programmed, such as TI C55X series DSP chip includes that a close coupling FFT accelerates Device (referred to as HWAFFT) is instructed by using accelerator and realizes that FFT accelerator and C55X DSP communicate, which supports 32 8 points to 1024 points of fixed point format of real number and plural number FFT calculates.

In FFT accelerator, it will usually calculated performance is improved using task-level parallelism strategy, shortens computation delay, it is more It needs to be scheduled task between a functional component.Meanwhile these functional components also need to be carried out according to task performance Simultaneously operating.How effectively the dispatching and coordinate each functional component of the task becomes the key of FFT accelerator performance boost.Tradition Tasks synchronization and scheduling mode based on fence are to execute time longest task as measurement between two neighboring synchronous point , remaining functional component is waited.However, each functional component completes task under different configurations, performing environment Executing the time is variation, and tasks synchronization and scheduling strategy based on fence generate additional waiting expense.

Summary of the invention

The technical problem to be solved in the present invention is that, for technical problem of the existing technology, the present invention provides one It kind can reduce functional component to wait expense, shorten fft algorithm and execute the time, improve appointing based on token for FFT accelerator performance The FFT accelerator installation of business scheduling strategy

In order to solve the above technical problems, the invention adopts the following technical scheme:

A kind of FFT accelerator installation based on token task scheduling strategy, comprising:

FFT accelerator completes the control that batch 1 ties up FFT operation for control logic, sends read-write control to bus control unit Parameter processed, the calculating and data coordinated between FFT-PE are transmitted；

Bus control unit, for generating in read/write DDR memory or piece according to the parameter of FFT Accelerator control module The control signal of SMC memory；

FFT computing array, the FFT-PE including two single memory structures, i.e. FFT_PE1 and FFT_PE2, for realizing The calculating of 1 dimension fft algorithm in batches；Two groups of data storages are set inside each FFT-PE, for realizing primary data reading, calculating As a result " table tennis " write between FFT calculating operates；Two FFT-PE receive the data from memory in a looping fashion, carry out FFT is calculated, and calculated result is written to memory；

Data path and order access asynchronous process unit are used to be responsible for convert TeraNet data host Port Profile For internal dma bus agreement, TeraNet order slave Port Profile is converted into internal Pbus bus protocol；Meanwhile it also completing The asynchronous docking in system clock frequency domain and FFT clock frequency domains；

Token, four groups of memory banks are as follows: FFT1- are set by four groups of memory bank access authority in two FFT-PE RAM0, FFT1-RAM1, FFT2-RAM0 and FFT2-RAM1；Read component, write parts, FFT-PE1 execution unit and FFT-PE2 are held Row component operates four groups of memory banks according to token as four functional components.

As a further improvement of the present invention: the token has to comply with following principle in the transmitting of each functional component:

Rule 1: in bus control unit DMA read control module, DMA write control module can be to two FFT-PE inside Four groups of data storages access；

Rule 2: each FFT-PE can only access to the two groups of data storages of oneself, i.e., FFT-PE1 deposits data Reservoir FFT1-RAM0 and FFT1-RAM1 access, FFT-PE2 carries out data storage FFT2-RAM0 and FFT2-RAM1 Access；

Regular 3: every group data storage any times can only allow DMA read control module, DMA write control module or right The FFT-PE thrin answered is written and read access, and each data memory operations do not allow to be overlapped；

4: every group data storage DMA of rule are read, FFT is calculated and DMA access control is that exact sequence executes, Er Qietong Reading, calculating and the write operation of any two groups of data executed in one data storage cannot intersect progress；

Rule 5: each execution unit any time can only operate one among four data storages, each The data storage of execution unit does not allow to be overlapped.

As a further improvement of the present invention: further setting 4 token FIFO, respectively PE1_FIFO, PE2_FIFO, Wrt_FIFO and Rd_FIFO, the depth that wherein depth of PE1_FIFO and PE2_FIFO is 2, Wrt_FIFO and Rd_FIFO is 4.

As a further improvement of the present invention: when token operates memory bank, being dispatched using following token passing Strategy:

(a) read component, write parts can access four groups of memories, and FFT-PE1 execution unit can only to FFT1-RAM0, FFT1-RAM1 is operated, and FFT-PE2 execution unit can only operate FFT2-RAM0 and FFT2-RAM1, meets task Scheduling rule 1 and rule 2；

(b) sequence of flow of token is fixed；Execution sequence to every group of memory bank is first to read source data, is then carried out FFT is calculated, and result is finally written to specified storage location by write parts；I.e. for FFT1-RAM0's and FFT1-RAM1 Token sequence of flow: read component → FFT-PE1 → write parts → read component, and for the token of FFT2-RAM0 and FFT2-RAM1 Sequence of flow: read component → FFT-PE2 → write parts → read component；Meet rule 4；

(c) token alternative: each memory bank token can only have a functional component to operate on it, and meet rule 3；

(d) functional unit alternative: each functional unit can only obtain a token, grasp to one of memory bank Make, token is just transmitted to next execution unit immediately after the completion of operating, meets rule 5.

As a further improvement of the present invention: the structure of the FFT-PE includes that calculate state of a control machine, " table tennis " more by FFT Body structure double end mouth RAM, more body selective factor B memory ROM, CORDIC twiddle factor generation modules and configurable butterfly computation Component.

Compared with the prior art, the advantages of the present invention are as follows: the FFT acceleration of the invention based on token task scheduling strategy Device device using the task scheduling strategy based on token and is based on event driven token-passing policy, by way of token To realize task in the flowing of each functional component, to guarantee correct to execute sequence.The strategy can under any executive condition, The waiting expense of each functional component is reduced to greatest extent, is executed the time so as to shorten fft algorithm, is maximized FFT accelerator Performance.

Detailed description of the invention

Fig. 1 is topological structure schematic diagram of the invention.

Fig. 2 is data storage data access space-time diagram of the present invention in specific application example.

Fig. 3 is the execution operation space-time diagram of present invention execution unit each in specific application example.

Fig. 4 present invention is based on event driven token passing scheduling strategy structure chart in specific application example.

The state of a control machine schematic diagram of task schedule of Fig. 5 present invention in specific application example based on token.

Specific embodiment

The present invention is described in further details below with reference to Figure of description and specific embodiment.

As shown in Figure 1, the present invention is based on the FFT accelerator installations of token task scheduling strategy, comprising:

FFT computing array, the FFT-PE (FFT_PE1 and FFT_PE2) including two single memory structures criticize for realizing The calculating of 1 dimension fft algorithm of amount, each inside FFT-PE are arranged two groups of data storages, realize that primary data is read, calculated result is write " table tennis " between FFT calculating operates.Two FFT-PE units receive the data from memory in a looping fashion, carry out FFT is calculated, and calculated result is written to memory.

Data path and order access asynchronous process unit are used to be responsible for convert TeraNet data host Port Profile For internal dma bus agreement, TeraNet order slave Port Profile is converted into internal Pbus bus protocol；Meanwhile it also completing The asynchronous docking in system clock frequency domain and FFT clock frequency domains.

Wherein, the structure of FFT-PE is as shown, mainly include that FFT calculates state of a control machine, " table tennis " multiple hull construction pair Port ram, more body selective factor B memory ROM, CORDIC twiddle factor generation modules and configurable butterfly computation component.FFT The calculating that batch 1 ties up fft algorithm may be implemented in computing array, and two groups of data storages are arranged inside each FFT-PE, realize just " table tennis " that beginning data are read, calculated result is write between FFT calculating operates.Two FFT-PE units receive in a looping fashion to be come from The data of memory carry out FFT calculating, and calculated result are written to memory.

In the above scheme, due to the variation of FFT scale difference and read-write channel bandwidth, it will lead to FFT and calculate the time, read Primary data time and to write calculated result time overhead uncertain, using read based on DMA, DMA write and FFT calculate the synchronous side of fence Formula controls the Read-write Catrol of data storage, this will increase additional waiting expense.Therefore, present invention further propose that being based on The task scheduling strategy of token maximally utilizes read path bandwidth, writes channel bandwidth and FFT butterfly computing resource.

In invention based in token mechanism FFT task schedule, by 4 groups of memory bank (that is: FFT1- in two FFT-PE RAM0, FFT1-RAM1, FFT2-RAM0 and FFT2-RAM1) access authority is set as token, and 4 functional components (write by read component Component, FFT-PE1 execution unit and FFT-PE2 execution unit) 4 groups of memory banks are operated according to token.That is: batch FFT In calculating the access space-time diagram of four data storages (FFT1-RAM0, FFT1-RAM1, FFT2-RAM0 and FFT2-RAM1) and Execution space-time diagram such as Fig. 2 of four execution units (read component, write parts, FFT-PE1 execution unit and FFT-PE2 execution unit) With shown in Fig. 3.Wherein, i-th group of data is read, FFT is calculated and write back to be denoted as respectively and reads i, FFT calculating i and write i.

In FFT accelerator, four data storages are considered as operation object, and four execution units are considered as operator. For assuring correct execution property, token has to comply with following principle in the transmitting of each functional component:

Regular 3: every group data storage any times can only allow DMA read control module, DMA write control module or right The FFT-PE thrin answered is written and read access, and each data memory operations do not allow to be overlapped；As shown in Fig. 2, every number Do not allow to be overlapped according to storage operation；

4: every group data storage DMA of rule are read, FFT is calculated and DMA access control is that exact sequence executes, Er Qietong Reading, calculating and the write operation of any two groups of data executed in one data storage cannot intersect progress.As shown in Fig. 2, right Any one group of data i is answered, i, FFT is read and calculates i and write i sequence execution, and be not overlapped, later group data reading operation is in previous group Data carry out after writing complete；

Rule 5: each execution unit any time can only operate one among four data storages, such as scheme Shown in 3, the data storage of each execution unit does not allow to be overlapped.

In specific application example, in order to guarantee that validity that token transmits between functional component, the present invention are further 4 token FIFO, respectively PE1_FIFO, PE2_FIFO, Wrt_FIFO and Rd_FIFO are set, wherein PE1_FIFO and PE2_ The depth that the depth of FIFO is 2, Wrt_FIFO and Rd_FIFO is 4.That is: by 4 groups of memory bank (FFT1- in two FFT-PE RAM0, FFT1-RAM1, FFT2-RAM0 and FFT2-RAM1) access authority is set as token, and 4 functional components (write by read component Component, FFT-PE1 execution unit and FFT-PE2 execution unit) 4 groups of memory banks are operated according to token.

Based on event driven token passing scheduling strategy:

(a) read component, write parts can access this 4 groups of memories, and FFT-PE1 execution unit can only to FFT1-RAM0, FFT1-RAM1 is operated, and FFT-PE2 execution unit can only operate FFT2-RAM0 and FFT2-RAM1, meets task Scheduling rule 1 and rule 2；

(b) sequence of flow of token is fixed, as shown in Figure 4.Execution sequence to every group of memory bank is first to read source number According to then progress FFT calculating, is finally written to specified storage location by write parts for result；I.e. for FFT1-RAM0 and The token sequence of flow of FFT1-RAM1: read component → FFT-PE1 → write parts → read component, and for FFT2-RAM0 and The token sequence of flow of FFT2-RAM1: read component → FFT-PE2 → write parts → read component；Meet rule 4；

In specific application example, as shown in figure 5, the execution process of the task scheduling strategy based on token is as follows:

S10: first token generation module according to FFT1-RAM0, FFT2-RAM0, FFT1-RAM1 and FFT2-RAM1 sequence 4 tokens are generated, read through model token FIFO (Rd_FIFO) is arrived in storage；

S20: read component reads token according to the not empty signal of Rd_FIFO, after the completion of read component, further according to holding on token Token is written to corresponding FFT respectively and executed in token FIFO by row PE information, i.e., if read through model complete FFT1-RAM0 or Its token is written in PE1_FIFO by the read operation of FFT1-RAM1, else if read through model completes FFT2-RAM0 or FFT2- Its token is written in PE2_FIFO by the read operation of RAM1；

S30:FFT-PE1 and FFT-PE2 execution unit is according to corresponding token FIFO (what PE2_ of respectively PE1_FIFO FIFO not empty signal) reads token, starts corresponding FFT and calculates, token is written to write parts token after the completion of FFT calculating FIFO(Wrt_FIFO)；

S40: write parts read token according to the not empty signal of Wrt_FIFO, and the FFT of starting corresponding data memory is calculated As a result written-back operation, after which token is written to read component token FIFO (Rd_FIFO) again after the completion of DMA write operation；

Above-mentioned execution process is it is found that the token of four data storages uses FIFO machine between four function execution units System is started, buffered and is transmitted.The state of a control machine of task schedule based on token is as shown in figure 4, this state machine is mainly controlled Token processed generates and circulation source, i.e. the Token Control of read component.RdNum_sFFT in Fig. 4 indicate DMA read data group number, WrtNum_sFFT indicates that DMA write data group number, Num_sFFT expression need to be implemented FFT and calculate data group number, wherein Num_sFFT It is configured by external command bus.RdNum_sFFT is carried out certainly when token is pressed into Rd_FIFO every time by DMA read through model Increase operation.WrtNum_sFFT carries out increment operator when DMA write module provides and completes signal.

After initialization, FFT state of a control machine is in idle condition (Idle), four token FIFO (Rd_FIFO, FFT1_ FIFO, FFT2_FIFO and Wrt_FIFO) be sky, it is to be received to jumped to after the start command of command line token produce Raw state (Initial_RdToken_sFFT).

Token generates state and calculates data group number generation token according to FFT, and token is pressed into Rd_FIFO, when When Num_sFFT > 4, the token sum of generation is 4, is otherwise Num_sFFT.Case1-1 indicates the token generated in Fig. 4 Quantity is also insufficient, need to continue to generate token.It is Num_sFFT that Case1-2 expression, which need to generate token number, and is no more than 4 (i.e. Num_sFFT≤4), and Num_sFFT token has been produced, this shows that four data storages are counted in this batch FFT It is at most used once during calculating, at this point, all reading tokens are in Rd_FIFO, without discharging according to after the completion of DMA write Token reuse data storage, at this point, FFT state of a control machine, which jumps into waiting, writes complete state (Wait_Wrt_ Finish).Case1-3 indicates to read data group number Num_sFFT > 4, and the token of 4 data storages needed for this is calculated Generation finishes, and needs to enter the token status (Wait_WrtToken) of release after the completion of waiting DMA write at this time to reuse number According to memory.After the token discharged after the completion of DMA write, FFT state of a control machine jumps into indentation and reads token fifo status (Startup_RdToken), which is pressed into Rd_FIFO, and according to the reading of all groups of data of RdNum_sFFT interpretation Whether token, which is completed, is pressed into, if being not fully pressed into (i.e. Case2-1:RdNum_sFFT < Num_sFFT), jumps into Wait_ WrtToken state continues waiting for the token that DMA write completes release, otherwise (i.e. Case2-2:RdNum_sFFT=Num_sFFT) FFT state of a control machine jumps into waiting and writes complete state (Wait_Wrt_Finish).DMA write completion status (Wait_Wrt_ Finish) according to result group number is write come interpretation, if all FFT calculated result has write back (i.e. Case3-1:WrtNum_ SFFT=Num_sFFT), FFT state of a control machine, which is jumped into, empties all token FIFO (Empty_Token_FIFO), otherwise Continue to wait in Wait_Wrt_Finish state.FFT state of a control machine needs empty institute under Empty_Token_FIFO state There is the information of token FIFO, in this way, state machine is in original state when starting next time batch FFT is calculated.FFT state of a control machine (Finish_State_FFT) issues FFT to system and calculates completion interruption under completion status.

The above is only the preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-described embodiment, All technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention.It should be pointed out that for the art For those of ordinary skill, several improvements and modifications without departing from the principles of the present invention should be regarded as protection of the invention Range.

Claims

1. a kind of FFT accelerator installation based on token task scheduling strategy characterized by comprising

FFT accelerator completes the control that batch 1 ties up FFT operation for control logic, sends Read-write Catrol ginseng to bus control unit Number, the calculating and data coordinated between FFT-PE are transmitted；

Bus control unit is deposited for according to the parameter of FFT Accelerator control module, generating SMC in read/write DDR memory or piece The control signal of reservoir；

FFT computing array, the FFT-PE including two single memory structures, i.e. FFT_PE1 and FFT_PE2, for realizing batch 1 Tie up the calculating of fft algorithm；Two groups of data storages are set inside each FFT-PE, for realizing primary data reading, calculated result " table tennis " write between FFT calculating operates；Two FFT-PE receive the data from memory in a looping fashion, carry out FFT It calculates, and calculated result is written to memory；

Data path and order access asynchronous process unit, in being responsible for being converted to TeraNet data host Port Profile TeraNet order host port protocol conversion is internal Pbus bus protocol by portion's dma bus agreement；Meanwhile also completing system The asynchronous docking of clock frequency domains and FFT clock frequency domains；

Token, four groups of memory banks are set by four groups of memory bank access authority in two FFT-PE are as follows: FFT1-RAM0, FFT1-RAM1, FFT2-RAM0 and FFT2-RAM1；Read component, write parts, FFT-PE1 execution unit and FFT-PE2 execution unit Four groups of memory banks are operated according to token as four functional components；

When token operates memory bank, using following token passing scheduling strategy:

(a) read component, write parts can access four groups of memories, and FFT-PE1 execution unit can only be to FFT1-RAM0, FFT1- RAM1 is operated, and FFT-PE2 execution unit can only operate FFT2-RAM0 and FFT2-RAM1；

(b) sequence of flow of token is fixed；Execution sequence to every group of memory bank is first to read source data, then carries out FFT It calculates, result is finally written to specified storage location by write parts；I.e. for the order of FFT1-RAM0 and FFT1-RAM1 Board sequence of flow: read component → FFT-PE1 → write parts → read component, and for the token stream of FFT2-RAM0 and FFT2-RAM1 Dynamic sequence: read component → FFT-PE2 → write parts → read component；

(c) token alternative: each memory bank token can only have a functional component to operate on it,

(d) functional unit alternative: each functional unit can only obtain a token, operate to one of memory bank, Wait which token is just transmitted to next execution unit immediately after the completion of operating.

2. the FFT accelerator installation according to claim 1 based on token task scheduling strategy, which is characterized in that described Token has to comply with following principle in the transmitting of each functional component:

Rule 1: DMA read control module, DMA write control module can be to four groups inside two FFT-PE in bus control unit Data storage accesses；

Rule 2: each FFT-PE can only access to the two groups of data storages of oneself, i.e., FFT-PE1 is to data storage FFT1-RAM0 and FFT1-RAM1 accesses, FFT-PE2 accesses to data storage FFT2-RAM0 and FFT2-RAM1；

Regular 3: every group data storage any times can only allow DMA read control module, DMA write control module or corresponding FFT-PE thrin is written and read access, and each data memory operations do not allow to be overlapped；

4: every group data storage DMA of rule are read, FFT is calculated and DMA access control is that exact sequence executes, and same number Reading, calculating and write operation according to any two groups of data executed in memory cannot intersect progress；

Rule 5: each execution unit any time can only operate one among four data storages, each execution The data storage of unit does not allow to be overlapped.

3. the FFT accelerator installation according to claim 2 based on token task scheduling strategy, which is characterized in that into one Step setting 4 token FIFO, respectively PE1_FIFO, PE2_FIFO, Wrt_FIFO and Rd_FIFO, wherein PE1_FIFO and The depth that the depth of PE2_FIFO is 2, Wrt_FIFO and Rd_FIFO is 4.

4. the FFT accelerator installation according to any one of claims 1 to 3 based on token task scheduling strategy, It is characterized in that, the structure of the FFT-PE includes that FFT calculates state of a control machine, " table tennis " multiple hull construction two-port RAM, Duo Tixuan Select factors memory ROM, CORDIC twiddle factor generation module and configurable butterfly computation component.