CN105373367A

CN105373367A - Vector single instruction multiple data-stream (SIMD) operation structure supporting synergistic working of scalar and vector

Info

Publication number: CN105373367A
Application number: CN201510718729.7A
Authority: CN
Inventors: 陈书明; 彭元喜; 雷元武; 万江华; 郭阳; 田甜; 彭浩; 徐恩
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2015-10-29
Filing date: 2015-10-29
Publication date: 2016-03-02
Anticipated expiration: 2035-10-29
Also published as: CN105373367B

Abstract

The invention discloses a vector SIMD operation structure supporting scalar-vector cooperative work, which includes: a unified instruction fetching and instruction dispatching component, which is used to dispatch instructions to a scalar processing unit SPU, a vector processing unit VPU and a vector array memory AM at the same time; The processing unit SPU is used to process serial tasks and control the execution of the vector processing unit VPU; the vector processing unit VPU is used to process intensive parallel tasks; the vector array memory AM is used for parallel and multi-width The vector operation provides data and moving support; the DMA unit is used to provide instructions and data for the scalar processing unit SPU and the vector processing unit VPU. The present invention can improve overall execution efficiency and parallelism.

Description

Support the vectorial SIMD operating structure of the vectorial collaborative work of mark

Technical field

The present invention is mainly concerned with microprocessor architecture and design field, refers in particular to a kind of vectorial SIMD operating structure supporting to mark vectorial collaborative work.

Background technology

Digital signal processor (DigitalSignalProcessor, DSP) be widely used in embedded system as the typical embedded microprocessor of one, it is powerful with its data-handling capacity, programmability good, use is flexible and the feature such as low-power consumption, bring huge opportunity to the development of signal transacting, its application is extended to the various aspects of military affairs, economic development.In applications such as modern communications, image procossing and Radar Signal Processing, along with data processing amount strengthens, to the increase of computational accuracy and requirement of real-time, usually need to use more high performance microprocessor to process.

Be different from traditional CPU, DSP has following characteristics: (1) computing power is strong, pays close attention to calculate in real time to be better than Focus Control and issued transaction; (2) specialised hardware support is provided with for type signal process, as multiply-add operation, linear addressing; (3) common feature of embedded microprocessor: address and no more than 32 of instruction path, no more than 32 of most data path; Non-precision is interrupted; The job-program mode (but not universal cpu debugs the method namely run) of the debugging of short-term off-line, long-term online resident operation; (4) integrated Peripheral Interface is set to master with outer fast, is beneficial to online transceiving high speed AD/DA data especially, also supports that between DSP, high speed is direct-connected.

General scientific algorithm needs high performance DSP, but has following shortcoming when traditional DSP is used for scientific algorithm: (1) bit wide is little, makes computational accuracy and addressing space deficiency.General scientific algorithm application at least needs 64 precision; (2) lack the software and hardware supports such as task management, document control, process scheduling, interrupt management, lack operating system hardware environment in other words, make troubles to general, the management of multiple tracks calculation task; (3) lack the support of unified advanced language programming pattern, substantially rely on assembly routine to programme to the support of multinuclear, vector, data parallel etc., be not easy to universal programming; (4) do not support the program debug pattern of local host, only rely on its machine cross debugging to emulate.These problems seriously limit the application of DSP in general scientific algorithm field.

Practitioner is had to propose one " general-purpose computations digital signal processor " (GPDSP), this is a kind of advantage both having kept DSP embedded essential characteristic and high-performance low-power-consumption, again efficient new architecture---the multi-core microprocessor (GPDSP) supporting general scientific algorithm.This structure can overcome general DSP the problems referred to above for scientific algorithm, can provide the efficient support to 64 high-performance computers and embedded high-precision signal transacting simultaneously.This structure has following feature: (1) has the direct representation of double-precision floating point and 64 fixed-point datas, general-purpose register, data bus, instruction bit wide more than 64, address bus more than 40; (2) CPU and DSP heterogeneous polynuclear close-coupled, CPU core supports complete operating system, and the scalar units of DSP core supports operating system micronucleus; (3) the unified programming mode of vectorial array structure in CPU core, DSP core and DSP core is considered; (4) keep its machine intersection artificial debugging, local cpu host is provided debugging mode simultaneously; (5) essential characteristic of the common DSP except figure place is retained.

Separately there is practitioner to propose one and " there is the data shuffling unit of switch matrix storer ", it discloses a kind of data shuffling unit and realize structure and data shuffling method, shuffling in program is asked the switch matrix be converted in switch matrix storer, thus realize data selection and restructuring.This reshuffling unit has the advantages that simple, the flexible and efficient and arbitrary node of structure shuffles.

GPDSP usually forms process array by multiple isomorphism 64 bit processing unit and obtains higher floating-point operation ability.But, also there is following Railway Project when GPDSP uses numerous processing unit to develop general scientific algorithm concurrency: how (1) organizes numerous isomorphism processing unit, make the concurrency of the many levels in the general scientific algorithm of its Efficient Development; (2) how effective coordination for the scalar operation unit that controls and the vector operation unit for calculating; (3) how the matrix class computing in general scientific algorithm is provided support, utilize the mass data multiplexing characteristics in matrix class computing improve to numerous isomorphism processing unit for number ability, and then improve the counting yield of GPDSP.

Summary of the invention

The technical problem to be solved in the present invention is just: the technical matters existed for prior art, the invention provides a kind of vectorial SIMD operating structure that can improve the support mark vector collaborative work of execution efficiency and concurrency.

For solving the problems of the technologies described above, the present invention by the following technical solutions:

Support the vectorial SIMD operating structure marking vectorial collaborative work, it comprises:

Unified fetching and instruction distribute parts, are used for distributing instruction for scalar processing unit SPU, vector processing unit VPU and vector array storer AM simultaneously;

Scalar processing unit SPU, is used for the process of responsible serial task, and to the control that vector processing unit VPU performs;

Vector processing unit VPU, is used for the parallel task process of responsible computation-intensive;

Vector array storer AM, is used for as the parallel vector operation with many width provides data and moves support;

DMA unit, is used for as scalar processing unit SPU, vector processing unit VPU provide instruction and data.

As a further improvement on the present invention: described unified fetching and instruction distribute the N that parts adopt variable length in the process of implementation _sI+ N _vIlaunch VLIW order structure, simultaneously fetching and distribute N _sIbar scalar instruction and N _vIbar vector instruction, this N _sI+ N _vIbar instruction simultaneously supports conditions performs, interrupts and abnormality processing.

As a further improvement on the present invention: described scalar processing unit SPE is by N _sMACindividual MAC unit and N _sIEUindividual fixed point execution unit IEU composition, this N _sIbar pipeline parallel method performs the N in VLIW instruction bag _sIbar scalar instruction, performs the serial arithmetic in science application, wherein N _sI=N _sMAC+ N _sIEU.

As a further improvement on the present invention: described vector processing unit VPU is by N _vPEindividual isomorphism vector operation unit VPE is formed, under unified instruction stream controls, perform identical operation, wherein N to different pieces of information _vPEit is the power side of 2.

As a further improvement on the present invention: described vector operation unit VPE comprises N _vMACindividual MAC unit and N _vIEUindividual fixed point execution unit IEU, this N _vIbar pipeline parallel method performs the N in VLIW instruction bag _vIbar vector instruction, performs the concurrent operation in science application, wherein N _vI=N _vMAC+ N _vIEU.

As a further improvement on the present invention: the data interaction between described vector operation unit VPE is completed by regular net region and shuffling network.

As a further improvement on the present invention: the configuration path respectively devising 64 between described scalar processing unit SPU and vector processing unit VPU and vector array storer AM, realize by MOV instruction the access overall situation in vector processing unit VPU and vector array storer A being controlled to configuration register.

As a further improvement on the present invention: the data broadcast pass through mechanism also having two scalar processing unit SPU to vector processing unit VPU between described scalar processing unit SPU and vector processing unit VPU, individual character broadcasting instructions and double word broadcasting instructions is supported respectively;

Described individual character broadcasting instructions is: the individual character in SPU register file is broadcast to N _vPEsame position in the vector registor of individual VPE; To N in the process performed _vPEregister file in individual VPE carries out a write operation, completes 64*N _vPEthe transmission of bit data;

Described double word broadcasting instructions is: a pair data Src_o:Src_e in SPU register file is broadcast to N _vPEin Dst_o:Dst_e in register file in individual VPE, it is that VR0 represents VR1:VR0 that register pair uses even number to represent; To N in the process performed _vPEregister file in individual VPE carries out a write operation, completes 128*N _vPEbit data is transmitted;

Article two, mark vector broadcast path executed in parallel double word broadcast operation can realize 256*N _vPEthe transmission of bit data.

Compared with prior art, the invention has the advantages that:

1, the present invention is tight coupling vector SIMD (SingleInstructionMultipleData-stream, the single-instruction multiple-data stream (SIMD)) operating structure of the scalar sum vector collaborative work of a kind of applicable multi-core microprocessor GPDSP.Adopt multi-emitting VLIW (VeryLargeInstructionWord, the very long instruction word) order structure of variable length, fetching distributes N simultaneously _sIbar scalar instruction and N _vIbar vector instruction, scalar operation cell S PE and vector operation unit VPE performs parallel instruction in VLIW simultaneously.Vector operation unit in this operating structure comprises N _vPE(N _vPEbe the power side of 2) individual isomorphism vector operation unit VPE, identical instruction is performed to different pieces of information; Data interaction between SPE and VPE is completed by register stage data sharing and the broadcast mechanism of marking vector processing unit fast, and the data interaction between VPE is completed by regular net region and shuffling network, by support matrix class and the application of signal transacting class efficiently of these data interaction mechanism.

2, of the present inventionly support in the vectorial SIMD operating structure of the vectorial collaborative work of mark, scalar processing unit and the tightly coupled organizational form of vector processing unit, by unified fetching and distribute the multi-emitting VLIW instruction that parts realize variable length.This organizational form can play the execution efficiency of scalar sum vector processing unit to greatest extent.

3, support of the present invention is marked in the vectorial SIMD operating structure of vectorial collaborative work, multiple parallel mechanism fully excavates the concurrency in application, comprises the inner sub-word SIMD technology of arithmetic element, marks vectorial SIMD technology between inner many execution pipeline VLIW concurrent techniques of vector and vector operation unit.

4, of the present inventionly support in the vectorial SIMD operating structure of the vectorial collaborative work of mark, between mark vector and the several data pass through mechanism of vector inside to realize data mutual fast between each arithmetic element, improve the execution efficiency of high-performance calculation application.

5, support of the present invention is marked in the vectorial SIMD operating structure of vectorial collaborative work, by said structure tissue signature, multiple parallel mechanism and several data pass through mechanism, the concurrency that abundant excavation core algorithm (as matrix multiplication, FFT etc.) is potential, improves the execution efficiency of GPDSP.

Accompanying drawing explanation

Fig. 1 is structural representation of the present invention.

Fig. 2 is scalar processing unit structural representation in the present invention;

Fig. 3 is the principle schematic of vector processing unit in the present invention.

Fig. 4 is the exemplary plot that the present invention carries out matrix-vector multiplication in embody rule example.

Fig. 5 be the present invention in embody rule example based on the FFT computation process schematic diagram shuffling function.

Fig. 6 is the arrangement schematic diagram of the present invention in embody rule example VLIW instruction slots in FFT computation process.

Embodiment

Below with reference to Figure of description and specific embodiment, the present invention is described in further details.

The present invention is the high performance universal digital signal processor GPDSP that the multithread of a variable length goes out very long instruction word structure, for high-performance calculation, is also applicable to radio communication, video and image procossing etc. simultaneously.This GPDSP is applicable to 64 general scientific algorithm, has the polycaryon processor of a kind of new architecture of embedded dsp essential characteristic simultaneously.Its advantage is both to maintain the essential characteristic of DSP and the advantage of high-performance low-power-consumption, support general scientific algorithm efficiently again, the general considerations of general DSP for scientific algorithm can be overcome, the efficient support to 64 high-performance computers and embedded high-precision signal transacting can be provided simultaneously.

As shown in Figure 1, for the present invention supports to mark the one-piece construction of kernel in the vectorial SIMD operating structure of vectorial collaborative work.Described kernel adopts Harvard structure, and instruction and data separately stores.Structure of the present invention comprises unified fetching and instruction distributes parts, scalar processing unit SPU, vector processing unit VPU, vector array storer AM and DMA unit.Wherein:

Scalar processing unit SPU, is used for the process of responsible serial task, and to the control that vector processing unit VPU performs; Scalar processing unit SPE is by N _sMACindividual MAC unit and N _sIEUindividual fixed point execution unit IEU composition, this N _sI(N _sI=N _sMAC+ N _sIEU) bar pipeline parallel method performs N in VLIW instruction bag _sIbar scalar instruction, performs the serial arithmetic in science application;

Vector processing unit VPU, is used for the parallel task process of responsible computation-intensive; Vector processing unit VPU is by N _vPE(N _vPEbe the power side of 2) individual isomorphism vector operation unit (VPE:VectorProcessingElement) formation, under unified instruction stream controls, identical operation is performed to different pieces of information; Vector operation unit VPE comprises N _vMACindividual MAC unit and N _vIEUindividual fixed point execution unit IEU, this N _vI(N _vI=N _vMAC+ N _vIEU) bar pipeline parallel method performs N in VLIW instruction bag _vIbar vector instruction, performs the concurrent operation in science application;

In embody rule example, above-mentioned kernel adopts the N of variable length _sI+ N _vIlaunch VLIW order structure (VeryLargeInstructionWord, very long instruction word), can simultaneously fetching and distribute N _sIbar scalar instruction and N _vIbar vector instruction, this N _sI+ N _vIbar instruction simultaneously supports conditions performs, interrupts and abnormality processing.

Instruction Cache (InstructionCache:ICache) is designed to 2 road set associatives, adopts and reads allocation strategy and the capable replacement policy of least-recently-used Cache.ICache size is 64KB, and during hit, the access time is 1 cycle.ICache obtains instruction bag according to the request of fetching and dispatch unit by EMI interface.

Distribute in the middle of parts in unified fetching and instruction, instruction fetching component is used for, according to the address sent from interrupt processing parts, abnormality processing parts, instruction Flow Control parts (branch address), ET parts, producing new address.Then, control according to the overall situation, the blank operation information of branch components and DP parts distribute information, control fetching streamline, carry out instruction and distribute control.

Distribute in the middle of parts in unified fetching and instruction, distribute the instruction bag that parts are used for receiving from instruction fetching component and ICache, the instruction in instruction bag is analyzed according to parallel mark position and functional unit type field, instruction is distributed corresponding functional part.

Instruction bag 1024, the instruction of executed in parallel can be formed one and performs bag in one-period, and instruction bag can comprise and multiplely performs bag.The present invention supports 80 and 40 two kinds of order formats, can maximum while fetching and distribute N _sIbar scalar instruction and N _vIbar vector instruction, wherein N _sIbar scalar instruction is respectively 2 scalar access instruction and N _sMAC+ N _sIEUbar SPU operational order, N _vIbar vector instruction is respectively 2 vectorial access instruction and N _vMAC+ N _vIEUbar vector VPU operational order.Such execution bag maximum length is 11, and minimum is 1, wherein, has at most 4 80 bit instructions, and 80 bit instructions are positioned at the head performing bag, and 40 bit instructions following closely.

Instruction due to scalar processing unit SPU and vector processing unit VPU is arranged in and samely performs bag, under unified value and instruction distribute unit control, completes calculation task with close coupled system collaborate.

In embody rule example, the major function of scalar processing unit SPU has been the memory access of scalar data, the computing of scalar data and branch's redirect of streamline and interrupt operation.SPU performs the serial arithmetic in application, vectorial unitary operation is controlled simultaneously, comprise instruction flow control unit (SBR), scalar operation unit (ScalarProcessElement, SPE), scalar units control register (SUCR) and scalar memory access unit (SM).N is comprised in SPE _sIEUindividual fixed point execution unit (IEU) and N _sMACindividual multiply accumulating (MAC) arithmetic element, can perform simultaneously; Scalar memory access unit can perform two scalar access instruction simultaneously, reads the data of two words from data Cache.

The present invention respectively devises the configuration path of 64 between SPU and VPU and AM, realizes by MOV instruction the access overall situation in VPU and AM being controlled to configuration register.

In embody rule example, as shown in Figure 3, vector processing unit (VectorProcessUnit, VPU) is a kind of easily extensible vector operation clustering architecture, and the parallel task of main process computation-intensive, by N _vPEvector operation unit (VectorProcessElement, the VPE) composition of individual isomorphism, each VPE all devises N _vMACindividual vector is taken advantage of and is added (MAC) arithmetic element and N _vIEUindividual fixed point arithmetic unit (IEU), to support extensive MAC parallel computation.VPU adopts vectorial SIMD concurrent technique to perform identical VLIW instruction to different pieces of information, thus realizes the vector calculation in application, can perform N simultaneously _vPE* (N _vMAC+ N _vIEU) individual vector operation operation.

A) overall situation shares register;

Can share register by the overall situation between SPU and VPU in the present invention and realize data interaction, this interaction mechanism is completed by scalar MOV instruction and vector M OV instruction.The overall situation shares register by N _vPEindividual 64 bit register compositions, VPU carries out read-write operation to it in the form of vectors, and SPU reads and writes one by one by the form of scalar.Pass through N _vPEbar scalar MOV instruction is by the N in scalar register file _vPEindividual data (64) be written in SVR, and then by 1 vector M OV instruction by this N _vPEindividual data are written to N _vPEthe same position of register file in individual VPE; Or contrary, by 1 vector M OV instruction by N _vPEthe N of the same position in individual VPE in register file _vPEindividual data are read in SVR, and then pass through N _vPEbar scalar MOV instruction is by this N _vPEindividual data are written in scalar register file.

B) mark vector broadcast;

Further, the present invention in the vectorial collaborative work operating structure of mark, also have between SPU and VPU two fast scalar units to the data broadcast pass through mechanism of vector location, support individual character broadcasting instructions and double word broadcasting instructions respectively.1) individual character broadcasting instructions: the individual character (64) in SPU register file is broadcast to N _vPEsame position in the vector registor of individual VPE.Need N in the process performed _vPEregister file in individual VPE carries out a write operation, completes 64*N _vPEthe transmission of bit data.2) double word broadcasting instructions: a pair data Src_o:Src_e (double word: 128) in SPU register file is broadcast to N _vPEin Dst_o:Dst_e in register file in individual VPE, register pair uses even number to represent here is that VR0 represents VR1:VR0.Only need N in the process performed _vPEregister file in individual VPE carries out a write operation, completes 128*N _vPEbit data is transmitted.Article two, mark vector broadcast path executed in parallel double word broadcast operation can realize 256*N _vPEthe transmission of bit data.

C) many width stipulations and shuffling network;

Further, the present invention devises many width stipulations tree and shuffles (Shuffle) network to realize the data interaction between VPE register file between vector operation unit.Whole for the data of multiple VPE arithmetic stipulations can be obtained a scalar result by stipulations tree, and also can carry out grouping stipulations and obtain multiple scalar result, the reduction operation of many width is by all N _vPEindividual VPE divides into groups, each grouping reduction operation executed in parallel, and grouping only supports average packet, and packet size is the integral number power of 2.Shuffling network is for realizing the data rearrangement on all VPE, thus the data communication realized between VPE, great dirigibility is brought to vector data process, shuffling network can carry out shuffle operation according to the difference of shuffling granularity and shuffle mode to the data between VPE, reduction network by the data in multiple VPE by many width stipulations mode stipulations in one or more VPE.

D) vector data addressed location;

Further, the present invention supports two vectorial load/store parallel instructions accessing operations, is N _vPEindividual vector processing unit provides higher memory bandwidth.2*N with the data bandwidth of VPU _vPE* 8B/cycle.Vector data memory bank adopts single port multibank institutional framework, supports concurrent access; N is provided _vPEindividual special base register and address offset register, support linear addressing and cyclic addressing two kinds of indirect addressing mode.

In the present embodiment, VPU is a kind of easily extensible vector operation clustering architecture.VPU is by N _vPEindividual isomorphism vector operation unit VPE is formed, this N _vPEindividual VPE adopts vectorial SIMD concurrent technique to carry out improving performance.VPU receives the vector operation class instruction distributing parts and distribute, and delivers to corresponding functional unit and perform after decoding.The local general-purpose register R0 ~ R63 of 64 64 is comprised in each VPE.

As shown in Figures 2 and 3, inner integrated four sub-function of each VPE are N respectively _vMACindividual multiply-accumulate unit (MAC) and N _vIEUindividual fixed point execution unit (IEU), for the computing on support vector basis.VPE is by this N _vMAC+ N _vIEUthe VLIW executed in parallel of individual sub-function carrys out improving performance, and each parts correspondence performs a vector instruction in VLIW instruction bag, and namely VPE comprises N _vMAC+ N _vIEUbar can the streamline of executed in parallel.Each bar operation execution pipeline in SPU and VPE all adopts sub-word SIMD technology to realize 64 and SIMD32 position algorithm, improves 32 bit manipulations (as single-precision floating point) performance further.

Each MAC parts are made up of fixed point MAC, floating-point MAC and floating-point ALU short instruction unit three unit.Wherein floating-point MAC and the multiplexing 64x64 fixed-point multiplication device of fixed point MAC.Floating-point MAC unit, fixed point MAC unit, floating-point ALU short instruction unit are the separate units having identical data path.Same period three can not start to perform or write back simultaneously, but can by software flow schedule parallel.

IEU parts are made up of bit processing unit (BitProcess, BP) and fixed point arithmetic arithmetic logic unit (ArithmeticLogicUnit, ALU).Both are the separate units with identical data path, and both same periods can not start to perform or write back simultaneously, can be dispatched realize walking abreast by software flow.

Broadcast mechanism is there is, maximum support two double word broadcast operations, the filling speed of acceleration vector data between GPDSP scalar performance element SPU of the present invention and vectorial performance element VPU.Data are broadcast to vectorial performance element from scalar performance element, and implementation only needs to carry out a write operation to VRF, can complete 128*N _vPEor 256*N _vPEthe transmission of bit data.

Data are transferred to N from SPE _vPEin individual VPE, use the mark vector broadcast capability had in the present invention, only needed for 4 bat times, and transmitting procedure is that complete pipeline mode carries out.Using SVR to complete this process then needs 20 bats just can complete, and is that serial performs by the data interaction that SVR carries out SPE and VPE.According to such calculating, the application of marking vectorial broadcast capability makes the filling speed of data reach about 20 times that use SVR, can the raising data stuffing speed of high degree; Meanwhile, adopt the broadcast of mark vector to realize data-reusing, reduce memory bandwidth demand, improve overall performance.

In the application of many Science and engineerings, all can relate to matrix class computing, and matrix class computing has good data parallelism, can by the instruction-level parallelism developed based on SIMD and VLIW parallel method in this kind of computing of the present invention.Below with N _vPE=16 and N _vMAC=3 is example, and the support that parallel organization of the present invention is applied matrix multiplication and FFT is described.

While improving calculated performance by large-scale functional unit parallel mode, also bring huge challenge also to memory bandwidth demand.The present invention also has good data reusing according to matrix class computing, and the mark vector broadcast operation of design, can complete the data transmission of 2048 or 4096 when an execution write operation.This can data reusing in effective exploitation application, reduces memory bandwidth demand, improves vector calculation unit utilization factor.Can the operation efficiency of raising matrix multiplication of high degree, reduce taking of resource, improve overall performance.

As shown in Figure 4, illustrate that the vectorial broadcast operation of mark is to calculated performance and storage demand for operation---matrix-vector multiplication---the most basic in matrix class computing.Matrix-vector multiplication y=A × x, wherein A is the matrix of n*m, x is the vector of m, y is the vector of n.On GPDSP operating structure of the present invention, be stored into by data matrix A in vector memory AM, x is stored in scalar data storer SM, and 16 VPE are performed by SIMD parallel mode wheel calculates, and VPE [i] VPE [i] calculates result wherein 1≤j≤t.As can be seen from Figure 4, each element in result vector y needs multiplexing vectors x.On GPDSP of the present invention, only need reading vector x, then often wheel calculate all pass through mark vectorial broadcast operation the element in vector x is broadcast to successively in the vector registor of 16 VPE, from AM, read the data of 16 corresponding row matrix A successively, in 16 VPE, 48 MAC unit are with pipeline mode executed in parallel simultaneously.

For application matrix class computing very widely, of the present invention significant.All need to use matrix class computing in numerous scientific algorithm tasks, as matrix multiplication, mark vectorial broadcast capability and share register SVR higher than traditional mark vector on operation efficiency.Mark vectorial broadcast capability can complete 2048 or 4096 data transmission by a write operation, performance advantage so can be reached, rely on the support of GPDSP operating structure of the present invention.Scalar processing unit SPU, includes the vector processing unit VPU of the vector operation unit VPE of 16 isomorphisms, and these unit actings in conjunction achieve the vectorial broadcast capability of mark, can the operational performance of significantly lifting matrixes multiplication, has a extensive future.

GPDSP operating structure of the present invention can be applied to signal transacting field equally efficiently, is described for rudimentary algorithm---double-precision floating point fft algorithm---most in this field.Owing to needing to conduct interviews to data with different intervals in FFT computation process, the data that the vectorial SIMD operating structure based on shuffling network of the present invention can realize between VPE are mutual fast, thus meet different interval data access requirements.

As illustrated in Figures 5 and 6, adopt Cooley-Tukey algorithm that random scale FFT is decomposed into the small-scale FFT that multiple scale is no more than at 128.For the FFT that scale is 128, can complete in the register file being stored in vector operation parts by primary data, twiddle factor and result of calculation, each VPE stores the data of 8 points, and each point is a double precision plural number.As shown in Figure 5, data sequence is deposited in each VPE, and the base 2FFT algorithm of 128 is divided into 7 grades of butterfly computations, in the 1st grade, the 2nd grade and 3rd level butterfly computation, each VPE all carries out computing to the data of respective register file, and result is stored in own register file; After 3rd level terminates, need exchange the data between each VPE, this patent performs 7 shuffle instruction by flowing water have been come, and then, carries out the 4th, 5,6 grade of butterfly computation to the data after shuffling; After 6th grade of butterfly computation terminates, complete data interaction between VPE by performing 1 shuffle instruction, then perform the 7th grade of butterfly computation.For in every grade of butterfly computation, each VPE completes 4 butterfly operations, as shown in (A) in Fig. 6, each butterfly computation is by 4 double-precision floating point multiplication and 6 double-precision floating point plus/minus method compositions, as shown in (B) in Fig. 6, therefore, every grade of each VPE of butterfly computation completes 16 floating-point multiplications and 24 floating adds operation (totally 40 floating-point operations) altogether, and the distribution of these 40 floating-point operations in 3 MAC instruction slots of the present invention is as shown in (C) in Fig. 6.106 clock period (14*7+8) are needed altogether in VPU structure of the present invention from known 128 the FFT computings of above-mentioned analysis.

Below be only the preferred embodiment of the present invention, protection scope of the present invention be not only confined to above-described embodiment, all technical schemes belonged under thinking of the present invention all belong to protection scope of the present invention.It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, should be considered as protection scope of the present invention.

Claims

1. A vector SIMD operation structure that supports scalar vector cooperative work, is characterized in that, comprises:

A unified instruction fetching and instruction dispatching component is used to dispatch instructions to the scalar processing unit SPU, the vector processing unit VPU and the vector array memory AM at the same time;

The scalar processing unit SPU is responsible for processing serial tasks and controlling the execution of the vector processing unit VPU;

The vector processing unit VPU is responsible for computing-intensive parallel task processing;

Vector array memory AM, used to provide data and move support for parallel and multi-width vector operations;

The DMA unit is used to provide instructions and data for the scalar processing unit SPU and the vector processing unit VPU.

2. The vector SIMD computing structure supporting scalar-vector cooperative work according to claim 1, is characterized in that, described unified fetching and instruction dispatching part adopt variable-length _NSI + _NVI to send VLIW instruction in execution process Structure, fetching and dispatching N _SI scalar instructions and N _VI vector instructions at the same time, these N _SI +N _VI instructions support conditional execution, interrupt and exception handling at the same time.

3. The vector SIMD computing structure supporting scalar-vector cooperative work according to claim 1, is characterized in that, described scalar processing unit SPE is made up of N _SMAC MAC units and N _SIEU fixed-point execution units IEU, and these N _SI items The pipeline executes N _SI scalar instructions in the VLIW instruction package in parallel, and executes serial operations in scientific applications, where N _SI =N _SMAC +N _SIEU .

4. the vector SIMD operation structure of supporting scalar vector cooperative work according to claim 1, it is characterized in that, described vector processing unit VPU is made of N _VPE isomorphic vector operation units VPE, under unified instruction flow control to Perform the same operation on different data, where N _VPE is a power of 2.

5. the vector SIMD operation structure of supporting scalar vector cooperative work according to claim 4, it is characterized in that, described vector operation unit VPE comprises N _VMAC MAC units and N _VIEU fixed-point execution units IEU, these N _VI pipelines Execute N _VI vector instructions in the VLIW instruction package in parallel to perform parallel operations in scientific applications, where N _VI =N _VMAC +N _VIEU .

6. The vector SIMD computing structure supporting scalar-vector cooperative work according to claim 5, characterized in that, the data interaction between the vector computing units VPE is completed through a protocol network and a shuffling network.

7. the vector SIMD operation structure of supporting scalar vector cooperative work according to claim 1, is characterized in that, respectively designed a 64-bit configuration between described scalar processing unit SPU and vector processing unit VPU and vector array memory AM Access to the vector processing unit VPU and the global control configuration register in the vector array memory A is realized through the MOV instruction.

8. The vector SIMD computing structure supporting scalar vector cooperative work according to claim 1, is characterized in that, also has two scalar processing units SPU to vector processing unit VPU between described scalar processing unit SPU and vector processing unit VPU The data broadcast transmission mechanism supports single-word broadcast commands and double-word broadcast commands respectively;

The single-word broadcast instruction is: the single-word in the SPU register file is broadcast to the same position in the vector registers of N _VPE VPEs; the register file in the N _VPE VPEs is written once in the process of execution, and 64*N _VPEs are completed transmission of bit data;

The double-word broadcast instruction is: a pair of data Src_o:Src_e in the SPU register file is broadcast to Dst_o:Dst_e in the register file in N _VPE VPEs, and the register pair uses even numbers to represent that VR0 represents VR1:VR0; Execute During the process, a write operation is performed on the register files in N _VPE VPEs to complete 128*N _VPE bit data transmission;

Two scalar vector broadcast paths execute double-word broadcast operations in parallel to realize the transmission of 256*N _VPE bit data.