CN105373367A - Vector single instruction multiple data-stream (SIMD) operation structure supporting synergistic working of scalar and vector - Google Patents

Vector single instruction multiple data-stream (SIMD) operation structure supporting synergistic working of scalar and vector Download PDF

Info

Publication number
CN105373367A
CN105373367A CN201510718729.7A CN201510718729A CN105373367A CN 105373367 A CN105373367 A CN 105373367A CN 201510718729 A CN201510718729 A CN 201510718729A CN 105373367 A CN105373367 A CN 105373367A
Authority
CN
China
Prior art keywords
vector
scalar
processing unit
vpe
instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510718729.7A
Other languages
Chinese (zh)
Other versions
CN105373367B (en
Inventor
陈书明
彭元喜
雷元武
万江华
郭阳
田甜
彭浩
徐恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201510718729.7A priority Critical patent/CN105373367B/en
Publication of CN105373367A publication Critical patent/CN105373367A/en
Application granted granted Critical
Publication of CN105373367B publication Critical patent/CN105373367B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Complex Calculations (AREA)
  • Advance Control (AREA)

Abstract

本发明公开了一种支持标向量协同工作的向量SIMD运算结构,其包括:统一取指和指令派发部件,用来同时为标量处理单元SPU、向量处理单元VPU和向量阵列存储器AM派发指令;标量处理单元SPU,用来负责串行任务的处理,以及对向量处理单元VPU执行的控制;向量处理单元VPU,用来负责计算密集的并行任务处理;向量阵列存储器AM,用来为并行与多宽度的向量运算提供数据及搬移支持;DMA单元,用来为标量处理单元SPU、向量处理单元VPU提供指令和数据。本发明能够提高整体的执行效率和并行性。

The invention discloses a vector SIMD operation structure supporting scalar-vector cooperative work, which includes: a unified instruction fetching and instruction dispatching component, which is used to dispatch instructions to a scalar processing unit SPU, a vector processing unit VPU and a vector array memory AM at the same time; The processing unit SPU is used to process serial tasks and control the execution of the vector processing unit VPU; the vector processing unit VPU is used to process intensive parallel tasks; the vector array memory AM is used for parallel and multi-width The vector operation provides data and moving support; the DMA unit is used to provide instructions and data for the scalar processing unit SPU and the vector processing unit VPU. The present invention can improve overall execution efficiency and parallelism.

Description

Support the vectorial SIMD operating structure of the vectorial collaborative work of mark
Technical field
The present invention is mainly concerned with microprocessor architecture and design field, refers in particular to a kind of vectorial SIMD operating structure supporting to mark vectorial collaborative work.
Background technology
Digital signal processor (DigitalSignalProcessor, DSP) be widely used in embedded system as the typical embedded microprocessor of one, it is powerful with its data-handling capacity, programmability good, use is flexible and the feature such as low-power consumption, bring huge opportunity to the development of signal transacting, its application is extended to the various aspects of military affairs, economic development.In applications such as modern communications, image procossing and Radar Signal Processing, along with data processing amount strengthens, to the increase of computational accuracy and requirement of real-time, usually need to use more high performance microprocessor to process.
Be different from traditional CPU, DSP has following characteristics: (1) computing power is strong, pays close attention to calculate in real time to be better than Focus Control and issued transaction; (2) specialised hardware support is provided with for type signal process, as multiply-add operation, linear addressing; (3) common feature of embedded microprocessor: address and no more than 32 of instruction path, no more than 32 of most data path; Non-precision is interrupted; The job-program mode (but not universal cpu debugs the method namely run) of the debugging of short-term off-line, long-term online resident operation; (4) integrated Peripheral Interface is set to master with outer fast, is beneficial to online transceiving high speed AD/DA data especially, also supports that between DSP, high speed is direct-connected.
General scientific algorithm needs high performance DSP, but has following shortcoming when traditional DSP is used for scientific algorithm: (1) bit wide is little, makes computational accuracy and addressing space deficiency.General scientific algorithm application at least needs 64 precision; (2) lack the software and hardware supports such as task management, document control, process scheduling, interrupt management, lack operating system hardware environment in other words, make troubles to general, the management of multiple tracks calculation task; (3) lack the support of unified advanced language programming pattern, substantially rely on assembly routine to programme to the support of multinuclear, vector, data parallel etc., be not easy to universal programming; (4) do not support the program debug pattern of local host, only rely on its machine cross debugging to emulate.These problems seriously limit the application of DSP in general scientific algorithm field.
Practitioner is had to propose one " general-purpose computations digital signal processor " (GPDSP), this is a kind of advantage both having kept DSP embedded essential characteristic and high-performance low-power-consumption, again efficient new architecture---the multi-core microprocessor (GPDSP) supporting general scientific algorithm.This structure can overcome general DSP the problems referred to above for scientific algorithm, can provide the efficient support to 64 high-performance computers and embedded high-precision signal transacting simultaneously.This structure has following feature: (1) has the direct representation of double-precision floating point and 64 fixed-point datas, general-purpose register, data bus, instruction bit wide more than 64, address bus more than 40; (2) CPU and DSP heterogeneous polynuclear close-coupled, CPU core supports complete operating system, and the scalar units of DSP core supports operating system micronucleus; (3) the unified programming mode of vectorial array structure in CPU core, DSP core and DSP core is considered; (4) keep its machine intersection artificial debugging, local cpu host is provided debugging mode simultaneously; (5) essential characteristic of the common DSP except figure place is retained.
Separately there is practitioner to propose one and " there is the data shuffling unit of switch matrix storer ", it discloses a kind of data shuffling unit and realize structure and data shuffling method, shuffling in program is asked the switch matrix be converted in switch matrix storer, thus realize data selection and restructuring.This reshuffling unit has the advantages that simple, the flexible and efficient and arbitrary node of structure shuffles.
GPDSP usually forms process array by multiple isomorphism 64 bit processing unit and obtains higher floating-point operation ability.But, also there is following Railway Project when GPDSP uses numerous processing unit to develop general scientific algorithm concurrency: how (1) organizes numerous isomorphism processing unit, make the concurrency of the many levels in the general scientific algorithm of its Efficient Development; (2) how effective coordination for the scalar operation unit that controls and the vector operation unit for calculating; (3) how the matrix class computing in general scientific algorithm is provided support, utilize the mass data multiplexing characteristics in matrix class computing improve to numerous isomorphism processing unit for number ability, and then improve the counting yield of GPDSP.
Summary of the invention
The technical problem to be solved in the present invention is just: the technical matters existed for prior art, the invention provides a kind of vectorial SIMD operating structure that can improve the support mark vector collaborative work of execution efficiency and concurrency.
For solving the problems of the technologies described above, the present invention by the following technical solutions:
Support the vectorial SIMD operating structure marking vectorial collaborative work, it comprises:
Unified fetching and instruction distribute parts, are used for distributing instruction for scalar processing unit SPU, vector processing unit VPU and vector array storer AM simultaneously;
Scalar processing unit SPU, is used for the process of responsible serial task, and to the control that vector processing unit VPU performs;
Vector processing unit VPU, is used for the parallel task process of responsible computation-intensive;
Vector array storer AM, is used for as the parallel vector operation with many width provides data and moves support;
DMA unit, is used for as scalar processing unit SPU, vector processing unit VPU provide instruction and data.
As a further improvement on the present invention: described unified fetching and instruction distribute the N that parts adopt variable length in the process of implementation sI+ N vIlaunch VLIW order structure, simultaneously fetching and distribute N sIbar scalar instruction and N vIbar vector instruction, this N sI+ N vIbar instruction simultaneously supports conditions performs, interrupts and abnormality processing.
As a further improvement on the present invention: described scalar processing unit SPE is by N sMACindividual MAC unit and N sIEUindividual fixed point execution unit IEU composition, this N sIbar pipeline parallel method performs the N in VLIW instruction bag sIbar scalar instruction, performs the serial arithmetic in science application, wherein N sI=N sMAC+ N sIEU.
As a further improvement on the present invention: described vector processing unit VPU is by N vPEindividual isomorphism vector operation unit VPE is formed, under unified instruction stream controls, perform identical operation, wherein N to different pieces of information vPEit is the power side of 2.
As a further improvement on the present invention: described vector operation unit VPE comprises N vMACindividual MAC unit and N vIEUindividual fixed point execution unit IEU, this N vIbar pipeline parallel method performs the N in VLIW instruction bag vIbar vector instruction, performs the concurrent operation in science application, wherein N vI=N vMAC+ N vIEU.
As a further improvement on the present invention: the data interaction between described vector operation unit VPE is completed by regular net region and shuffling network.
As a further improvement on the present invention: the configuration path respectively devising 64 between described scalar processing unit SPU and vector processing unit VPU and vector array storer AM, realize by MOV instruction the access overall situation in vector processing unit VPU and vector array storer A being controlled to configuration register.
As a further improvement on the present invention: the data broadcast pass through mechanism also having two scalar processing unit SPU to vector processing unit VPU between described scalar processing unit SPU and vector processing unit VPU, individual character broadcasting instructions and double word broadcasting instructions is supported respectively;
Described individual character broadcasting instructions is: the individual character in SPU register file is broadcast to N vPEsame position in the vector registor of individual VPE; To N in the process performed vPEregister file in individual VPE carries out a write operation, completes 64*N vPEthe transmission of bit data;
Described double word broadcasting instructions is: a pair data Src_o:Src_e in SPU register file is broadcast to N vPEin Dst_o:Dst_e in register file in individual VPE, it is that VR0 represents VR1:VR0 that register pair uses even number to represent; To N in the process performed vPEregister file in individual VPE carries out a write operation, completes 128*N vPEbit data is transmitted;
Article two, mark vector broadcast path executed in parallel double word broadcast operation can realize 256*N vPEthe transmission of bit data.
Compared with prior art, the invention has the advantages that:
1, the present invention is tight coupling vector SIMD (SingleInstructionMultipleData-stream, the single-instruction multiple-data stream (SIMD)) operating structure of the scalar sum vector collaborative work of a kind of applicable multi-core microprocessor GPDSP.Adopt multi-emitting VLIW (VeryLargeInstructionWord, the very long instruction word) order structure of variable length, fetching distributes N simultaneously sIbar scalar instruction and N vIbar vector instruction, scalar operation cell S PE and vector operation unit VPE performs parallel instruction in VLIW simultaneously.Vector operation unit in this operating structure comprises N vPE(N vPEbe the power side of 2) individual isomorphism vector operation unit VPE, identical instruction is performed to different pieces of information; Data interaction between SPE and VPE is completed by register stage data sharing and the broadcast mechanism of marking vector processing unit fast, and the data interaction between VPE is completed by regular net region and shuffling network, by support matrix class and the application of signal transacting class efficiently of these data interaction mechanism.
2, of the present inventionly support in the vectorial SIMD operating structure of the vectorial collaborative work of mark, scalar processing unit and the tightly coupled organizational form of vector processing unit, by unified fetching and distribute the multi-emitting VLIW instruction that parts realize variable length.This organizational form can play the execution efficiency of scalar sum vector processing unit to greatest extent.
3, support of the present invention is marked in the vectorial SIMD operating structure of vectorial collaborative work, multiple parallel mechanism fully excavates the concurrency in application, comprises the inner sub-word SIMD technology of arithmetic element, marks vectorial SIMD technology between inner many execution pipeline VLIW concurrent techniques of vector and vector operation unit.
4, of the present inventionly support in the vectorial SIMD operating structure of the vectorial collaborative work of mark, between mark vector and the several data pass through mechanism of vector inside to realize data mutual fast between each arithmetic element, improve the execution efficiency of high-performance calculation application.
5, support of the present invention is marked in the vectorial SIMD operating structure of vectorial collaborative work, by said structure tissue signature, multiple parallel mechanism and several data pass through mechanism, the concurrency that abundant excavation core algorithm (as matrix multiplication, FFT etc.) is potential, improves the execution efficiency of GPDSP.
Accompanying drawing explanation
Fig. 1 is structural representation of the present invention.
Fig. 2 is scalar processing unit structural representation in the present invention;
Fig. 3 is the principle schematic of vector processing unit in the present invention.
Fig. 4 is the exemplary plot that the present invention carries out matrix-vector multiplication in embody rule example.
Fig. 5 be the present invention in embody rule example based on the FFT computation process schematic diagram shuffling function.
Fig. 6 is the arrangement schematic diagram of the present invention in embody rule example VLIW instruction slots in FFT computation process.
Embodiment
Below with reference to Figure of description and specific embodiment, the present invention is described in further details.
The present invention is the high performance universal digital signal processor GPDSP that the multithread of a variable length goes out very long instruction word structure, for high-performance calculation, is also applicable to radio communication, video and image procossing etc. simultaneously.This GPDSP is applicable to 64 general scientific algorithm, has the polycaryon processor of a kind of new architecture of embedded dsp essential characteristic simultaneously.Its advantage is both to maintain the essential characteristic of DSP and the advantage of high-performance low-power-consumption, support general scientific algorithm efficiently again, the general considerations of general DSP for scientific algorithm can be overcome, the efficient support to 64 high-performance computers and embedded high-precision signal transacting can be provided simultaneously.
As shown in Figure 1, for the present invention supports to mark the one-piece construction of kernel in the vectorial SIMD operating structure of vectorial collaborative work.Described kernel adopts Harvard structure, and instruction and data separately stores.Structure of the present invention comprises unified fetching and instruction distributes parts, scalar processing unit SPU, vector processing unit VPU, vector array storer AM and DMA unit.Wherein:
Unified fetching and instruction distribute parts, are used for distributing instruction for scalar processing unit SPU, vector processing unit VPU and vector array storer AM simultaneously;
Scalar processing unit SPU, is used for the process of responsible serial task, and to the control that vector processing unit VPU performs; Scalar processing unit SPE is by N sMACindividual MAC unit and N sIEUindividual fixed point execution unit IEU composition, this N sI(N sI=N sMAC+ N sIEU) bar pipeline parallel method performs N in VLIW instruction bag sIbar scalar instruction, performs the serial arithmetic in science application;
Vector processing unit VPU, is used for the parallel task process of responsible computation-intensive; Vector processing unit VPU is by N vPE(N vPEbe the power side of 2) individual isomorphism vector operation unit (VPE:VectorProcessingElement) formation, under unified instruction stream controls, identical operation is performed to different pieces of information; Vector operation unit VPE comprises N vMACindividual MAC unit and N vIEUindividual fixed point execution unit IEU, this N vI(N vI=N vMAC+ N vIEU) bar pipeline parallel method performs N in VLIW instruction bag vIbar vector instruction, performs the concurrent operation in science application;
Vector array storer AM, is used for as the parallel vector operation with many width provides data and moves support;
DMA unit, is used for as scalar processing unit SPU, vector processing unit VPU provide instruction and data.
In embody rule example, above-mentioned kernel adopts the N of variable length sI+ N vIlaunch VLIW order structure (VeryLargeInstructionWord, very long instruction word), can simultaneously fetching and distribute N sIbar scalar instruction and N vIbar vector instruction, this N sI+ N vIbar instruction simultaneously supports conditions performs, interrupts and abnormality processing.
Instruction Cache (InstructionCache:ICache) is designed to 2 road set associatives, adopts and reads allocation strategy and the capable replacement policy of least-recently-used Cache.ICache size is 64KB, and during hit, the access time is 1 cycle.ICache obtains instruction bag according to the request of fetching and dispatch unit by EMI interface.
Distribute in the middle of parts in unified fetching and instruction, instruction fetching component is used for, according to the address sent from interrupt processing parts, abnormality processing parts, instruction Flow Control parts (branch address), ET parts, producing new address.Then, control according to the overall situation, the blank operation information of branch components and DP parts distribute information, control fetching streamline, carry out instruction and distribute control.
Distribute in the middle of parts in unified fetching and instruction, distribute the instruction bag that parts are used for receiving from instruction fetching component and ICache, the instruction in instruction bag is analyzed according to parallel mark position and functional unit type field, instruction is distributed corresponding functional part.
Instruction bag 1024, the instruction of executed in parallel can be formed one and performs bag in one-period, and instruction bag can comprise and multiplely performs bag.The present invention supports 80 and 40 two kinds of order formats, can maximum while fetching and distribute N sIbar scalar instruction and N vIbar vector instruction, wherein N sIbar scalar instruction is respectively 2 scalar access instruction and N sMAC+ N sIEUbar SPU operational order, N vIbar vector instruction is respectively 2 vectorial access instruction and N vMAC+ N vIEUbar vector VPU operational order.Such execution bag maximum length is 11, and minimum is 1, wherein, has at most 4 80 bit instructions, and 80 bit instructions are positioned at the head performing bag, and 40 bit instructions following closely.
Instruction due to scalar processing unit SPU and vector processing unit VPU is arranged in and samely performs bag, under unified value and instruction distribute unit control, completes calculation task with close coupled system collaborate.
In embody rule example, the major function of scalar processing unit SPU has been the memory access of scalar data, the computing of scalar data and branch's redirect of streamline and interrupt operation.SPU performs the serial arithmetic in application, vectorial unitary operation is controlled simultaneously, comprise instruction flow control unit (SBR), scalar operation unit (ScalarProcessElement, SPE), scalar units control register (SUCR) and scalar memory access unit (SM).N is comprised in SPE sIEUindividual fixed point execution unit (IEU) and N sMACindividual multiply accumulating (MAC) arithmetic element, can perform simultaneously; Scalar memory access unit can perform two scalar access instruction simultaneously, reads the data of two words from data Cache.
The present invention respectively devises the configuration path of 64 between SPU and VPU and AM, realizes by MOV instruction the access overall situation in VPU and AM being controlled to configuration register.
In embody rule example, as shown in Figure 3, vector processing unit (VectorProcessUnit, VPU) is a kind of easily extensible vector operation clustering architecture, and the parallel task of main process computation-intensive, by N vPEvector operation unit (VectorProcessElement, the VPE) composition of individual isomorphism, each VPE all devises N vMACindividual vector is taken advantage of and is added (MAC) arithmetic element and N vIEUindividual fixed point arithmetic unit (IEU), to support extensive MAC parallel computation.VPU adopts vectorial SIMD concurrent technique to perform identical VLIW instruction to different pieces of information, thus realizes the vector calculation in application, can perform N simultaneously vPE* (N vMAC+ N vIEU) individual vector operation operation.
A) overall situation shares register;
Can share register by the overall situation between SPU and VPU in the present invention and realize data interaction, this interaction mechanism is completed by scalar MOV instruction and vector M OV instruction.The overall situation shares register by N vPEindividual 64 bit register compositions, VPU carries out read-write operation to it in the form of vectors, and SPU reads and writes one by one by the form of scalar.Pass through N vPEbar scalar MOV instruction is by the N in scalar register file vPEindividual data (64) be written in SVR, and then by 1 vector M OV instruction by this N vPEindividual data are written to N vPEthe same position of register file in individual VPE; Or contrary, by 1 vector M OV instruction by N vPEthe N of the same position in individual VPE in register file vPEindividual data are read in SVR, and then pass through N vPEbar scalar MOV instruction is by this N vPEindividual data are written in scalar register file.
B) mark vector broadcast;
Further, the present invention in the vectorial collaborative work operating structure of mark, also have between SPU and VPU two fast scalar units to the data broadcast pass through mechanism of vector location, support individual character broadcasting instructions and double word broadcasting instructions respectively.1) individual character broadcasting instructions: the individual character (64) in SPU register file is broadcast to N vPEsame position in the vector registor of individual VPE.Need N in the process performed vPEregister file in individual VPE carries out a write operation, completes 64*N vPEthe transmission of bit data.2) double word broadcasting instructions: a pair data Src_o:Src_e (double word: 128) in SPU register file is broadcast to N vPEin Dst_o:Dst_e in register file in individual VPE, register pair uses even number to represent here is that VR0 represents VR1:VR0.Only need N in the process performed vPEregister file in individual VPE carries out a write operation, completes 128*N vPEbit data is transmitted.Article two, mark vector broadcast path executed in parallel double word broadcast operation can realize 256*N vPEthe transmission of bit data.
C) many width stipulations and shuffling network;
Further, the present invention devises many width stipulations tree and shuffles (Shuffle) network to realize the data interaction between VPE register file between vector operation unit.Whole for the data of multiple VPE arithmetic stipulations can be obtained a scalar result by stipulations tree, and also can carry out grouping stipulations and obtain multiple scalar result, the reduction operation of many width is by all N vPEindividual VPE divides into groups, each grouping reduction operation executed in parallel, and grouping only supports average packet, and packet size is the integral number power of 2.Shuffling network is for realizing the data rearrangement on all VPE, thus the data communication realized between VPE, great dirigibility is brought to vector data process, shuffling network can carry out shuffle operation according to the difference of shuffling granularity and shuffle mode to the data between VPE, reduction network by the data in multiple VPE by many width stipulations mode stipulations in one or more VPE.
D) vector data addressed location;
Further, the present invention supports two vectorial load/store parallel instructions accessing operations, is N vPEindividual vector processing unit provides higher memory bandwidth.2*N with the data bandwidth of VPU vPE* 8B/cycle.Vector data memory bank adopts single port multibank institutional framework, supports concurrent access; N is provided vPEindividual special base register and address offset register, support linear addressing and cyclic addressing two kinds of indirect addressing mode.
In the present embodiment, VPU is a kind of easily extensible vector operation clustering architecture.VPU is by N vPEindividual isomorphism vector operation unit VPE is formed, this N vPEindividual VPE adopts vectorial SIMD concurrent technique to carry out improving performance.VPU receives the vector operation class instruction distributing parts and distribute, and delivers to corresponding functional unit and perform after decoding.The local general-purpose register R0 ~ R63 of 64 64 is comprised in each VPE.
As shown in Figures 2 and 3, inner integrated four sub-function of each VPE are N respectively vMACindividual multiply-accumulate unit (MAC) and N vIEUindividual fixed point execution unit (IEU), for the computing on support vector basis.VPE is by this N vMAC+ N vIEUthe VLIW executed in parallel of individual sub-function carrys out improving performance, and each parts correspondence performs a vector instruction in VLIW instruction bag, and namely VPE comprises N vMAC+ N vIEUbar can the streamline of executed in parallel.Each bar operation execution pipeline in SPU and VPE all adopts sub-word SIMD technology to realize 64 and SIMD32 position algorithm, improves 32 bit manipulations (as single-precision floating point) performance further.
Each MAC parts are made up of fixed point MAC, floating-point MAC and floating-point ALU short instruction unit three unit.Wherein floating-point MAC and the multiplexing 64x64 fixed-point multiplication device of fixed point MAC.Floating-point MAC unit, fixed point MAC unit, floating-point ALU short instruction unit are the separate units having identical data path.Same period three can not start to perform or write back simultaneously, but can by software flow schedule parallel.
IEU parts are made up of bit processing unit (BitProcess, BP) and fixed point arithmetic arithmetic logic unit (ArithmeticLogicUnit, ALU).Both are the separate units with identical data path, and both same periods can not start to perform or write back simultaneously, can be dispatched realize walking abreast by software flow.
Broadcast mechanism is there is, maximum support two double word broadcast operations, the filling speed of acceleration vector data between GPDSP scalar performance element SPU of the present invention and vectorial performance element VPU.Data are broadcast to vectorial performance element from scalar performance element, and implementation only needs to carry out a write operation to VRF, can complete 128*N vPEor 256*N vPEthe transmission of bit data.
Data are transferred to N from SPE vPEin individual VPE, use the mark vector broadcast capability had in the present invention, only needed for 4 bat times, and transmitting procedure is that complete pipeline mode carries out.Using SVR to complete this process then needs 20 bats just can complete, and is that serial performs by the data interaction that SVR carries out SPE and VPE.According to such calculating, the application of marking vectorial broadcast capability makes the filling speed of data reach about 20 times that use SVR, can the raising data stuffing speed of high degree; Meanwhile, adopt the broadcast of mark vector to realize data-reusing, reduce memory bandwidth demand, improve overall performance.
In the application of many Science and engineerings, all can relate to matrix class computing, and matrix class computing has good data parallelism, can by the instruction-level parallelism developed based on SIMD and VLIW parallel method in this kind of computing of the present invention.Below with N vPE=16 and N vMAC=3 is example, and the support that parallel organization of the present invention is applied matrix multiplication and FFT is described.
While improving calculated performance by large-scale functional unit parallel mode, also bring huge challenge also to memory bandwidth demand.The present invention also has good data reusing according to matrix class computing, and the mark vector broadcast operation of design, can complete the data transmission of 2048 or 4096 when an execution write operation.This can data reusing in effective exploitation application, reduces memory bandwidth demand, improves vector calculation unit utilization factor.Can the operation efficiency of raising matrix multiplication of high degree, reduce taking of resource, improve overall performance.
As shown in Figure 4, illustrate that the vectorial broadcast operation of mark is to calculated performance and storage demand for operation---matrix-vector multiplication---the most basic in matrix class computing.Matrix-vector multiplication y=A × x, wherein A is the matrix of n*m, x is the vector of m, y is the vector of n.On GPDSP operating structure of the present invention, be stored into by data matrix A in vector memory AM, x is stored in scalar data storer SM, and 16 VPE are performed by SIMD parallel mode wheel calculates, and VPE [i] VPE [i] calculates result wherein 1≤j≤t.As can be seen from Figure 4, each element in result vector y needs multiplexing vectors x.On GPDSP of the present invention, only need reading vector x, then often wheel calculate all pass through mark vectorial broadcast operation the element in vector x is broadcast to successively in the vector registor of 16 VPE, from AM, read the data of 16 corresponding row matrix A successively, in 16 VPE, 48 MAC unit are with pipeline mode executed in parallel simultaneously.
For application matrix class computing very widely, of the present invention significant.All need to use matrix class computing in numerous scientific algorithm tasks, as matrix multiplication, mark vectorial broadcast capability and share register SVR higher than traditional mark vector on operation efficiency.Mark vectorial broadcast capability can complete 2048 or 4096 data transmission by a write operation, performance advantage so can be reached, rely on the support of GPDSP operating structure of the present invention.Scalar processing unit SPU, includes the vector processing unit VPU of the vector operation unit VPE of 16 isomorphisms, and these unit actings in conjunction achieve the vectorial broadcast capability of mark, can the operational performance of significantly lifting matrixes multiplication, has a extensive future.
GPDSP operating structure of the present invention can be applied to signal transacting field equally efficiently, is described for rudimentary algorithm---double-precision floating point fft algorithm---most in this field.Owing to needing to conduct interviews to data with different intervals in FFT computation process, the data that the vectorial SIMD operating structure based on shuffling network of the present invention can realize between VPE are mutual fast, thus meet different interval data access requirements.
As illustrated in Figures 5 and 6, adopt Cooley-Tukey algorithm that random scale FFT is decomposed into the small-scale FFT that multiple scale is no more than at 128.For the FFT that scale is 128, can complete in the register file being stored in vector operation parts by primary data, twiddle factor and result of calculation, each VPE stores the data of 8 points, and each point is a double precision plural number.As shown in Figure 5, data sequence is deposited in each VPE, and the base 2FFT algorithm of 128 is divided into 7 grades of butterfly computations, in the 1st grade, the 2nd grade and 3rd level butterfly computation, each VPE all carries out computing to the data of respective register file, and result is stored in own register file; After 3rd level terminates, need exchange the data between each VPE, this patent performs 7 shuffle instruction by flowing water have been come, and then, carries out the 4th, 5,6 grade of butterfly computation to the data after shuffling; After 6th grade of butterfly computation terminates, complete data interaction between VPE by performing 1 shuffle instruction, then perform the 7th grade of butterfly computation.For in every grade of butterfly computation, each VPE completes 4 butterfly operations, as shown in (A) in Fig. 6, each butterfly computation is by 4 double-precision floating point multiplication and 6 double-precision floating point plus/minus method compositions, as shown in (B) in Fig. 6, therefore, every grade of each VPE of butterfly computation completes 16 floating-point multiplications and 24 floating adds operation (totally 40 floating-point operations) altogether, and the distribution of these 40 floating-point operations in 3 MAC instruction slots of the present invention is as shown in (C) in Fig. 6.106 clock period (14*7+8) are needed altogether in VPU structure of the present invention from known 128 the FFT computings of above-mentioned analysis.
Below be only the preferred embodiment of the present invention, protection scope of the present invention be not only confined to above-described embodiment, all technical schemes belonged under thinking of the present invention all belong to protection scope of the present invention.It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, should be considered as protection scope of the present invention.

Claims (8)

1.一种支持标向量协同工作的向量SIMD运算结构,其特征在于,包括:1. A vector SIMD operation structure that supports scalar vector cooperative work, is characterized in that, comprises: 统一取指和指令派发部件,用来同时为标量处理单元SPU、向量处理单元VPU和向量阵列存储器AM派发指令;A unified instruction fetching and instruction dispatching component is used to dispatch instructions to the scalar processing unit SPU, the vector processing unit VPU and the vector array memory AM at the same time; 标量处理单元SPU,用来负责串行任务的处理,以及对向量处理单元VPU执行的控制;The scalar processing unit SPU is responsible for processing serial tasks and controlling the execution of the vector processing unit VPU; 向量处理单元VPU,用来负责计算密集的并行任务处理;The vector processing unit VPU is responsible for computing-intensive parallel task processing; 向量阵列存储器AM,用来为并行与多宽度的向量运算提供数据及搬移支持;Vector array memory AM, used to provide data and move support for parallel and multi-width vector operations; DMA单元,用来为标量处理单元SPU、向量处理单元VPU提供指令和数据。The DMA unit is used to provide instructions and data for the scalar processing unit SPU and the vector processing unit VPU. 2.根据权利要求1所述的支持标向量协同工作的向量SIMD运算结构,其特征在于,所述统一取指和指令派发部件在执行过程中采用可变长的NSI+NVI发射VLIW指令结构,同时取指和派发NSI条标量指令和NVI条向量指令,这NSI+NVI条指令同时支持条件执行、中断和异常处理。2. The vector SIMD computing structure supporting scalar-vector cooperative work according to claim 1, is characterized in that, described unified fetching and instruction dispatching part adopt variable-length NSI + NVI to send VLIW instruction in execution process Structure, fetching and dispatching N SI scalar instructions and N VI vector instructions at the same time, these N SI +N VI instructions support conditional execution, interrupt and exception handling at the same time. 3.根据权利要求1所述的支持标向量协同工作的向量SIMD运算结构,其特征在于,所述标量处理单元SPE由NSMAC个MAC单元和NSIEU个定点执行单元IEU组成,这NSI条流水线并行执行VLIW指令包中的NSI条标量指令,执行科学应用中的串行运算,其中NSI=NSMAC+NSIEU3. The vector SIMD computing structure supporting scalar-vector cooperative work according to claim 1, is characterized in that, described scalar processing unit SPE is made up of N SMAC MAC units and N SIEU fixed-point execution units IEU, and these N SI items The pipeline executes N SI scalar instructions in the VLIW instruction package in parallel, and executes serial operations in scientific applications, where N SI =N SMAC +N SIEU . 4.根据权利要求1所述的支持标向量协同工作的向量SIMD运算结构,其特征在于,所述向量处理单元VPU由NVPE个同构向量运算单元VPE构成,在统一的指令流控制下对不同数据执行相同的操作,其中NVPE为2的幂次方。4. the vector SIMD operation structure of supporting scalar vector cooperative work according to claim 1, it is characterized in that, described vector processing unit VPU is made of N VPE isomorphic vector operation units VPE, under unified instruction flow control to Perform the same operation on different data, where N VPE is a power of 2. 5.根据权利要求4所述的支持标向量协同工作的向量SIMD运算结构,其特征在于,所述向量运算单元VPE包含NVMAC个MAC单元和NVIEU个定点执行单元IEU,这NVI条流水线并行执行VLIW指令包中的NVI条向量指令,执行科学应用中的并行运算,其中NVI=NVMAC+NVIEU5. the vector SIMD operation structure of supporting scalar vector cooperative work according to claim 4, it is characterized in that, described vector operation unit VPE comprises N VMAC MAC units and N VIEU fixed-point execution units IEU, these N VI pipelines Execute N VI vector instructions in the VLIW instruction package in parallel to perform parallel operations in scientific applications, where N VI =N VMAC +N VIEU . 6.根据权利要求5所述的支持标向量协同工作的向量SIMD运算结构,其特征在于,所述向量运算单元VPE之间的数据交互通过规约网络和混洗网络完成。6. The vector SIMD computing structure supporting scalar-vector cooperative work according to claim 5, characterized in that, the data interaction between the vector computing units VPE is completed through a protocol network and a shuffling network. 7.根据权利要求1所述的支持标向量协同工作的向量SIMD运算结构,其特征在于,所述标量处理单元SPU与向量处理单元VPU和向量阵列存储器AM之间各设计了一条64位的配置通路,通过MOV指令实现对向量处理单元VPU和向量阵列存储器A中的全局控制配置寄存器的访问。7. the vector SIMD operation structure of supporting scalar vector cooperative work according to claim 1, is characterized in that, respectively designed a 64-bit configuration between described scalar processing unit SPU and vector processing unit VPU and vector array memory AM Access to the vector processing unit VPU and the global control configuration register in the vector array memory A is realized through the MOV instruction. 8.根据权利要求1所述的支持标向量协同工作的向量SIMD运算结构,其特征在于,所述标量处理单元SPU与向量处理单元VPU之间还有两条标量处理单元SPU到向量处理单元VPU的数据广播传递机制,分别支持单字广播指令和双字广播指令;8. The vector SIMD computing structure supporting scalar vector cooperative work according to claim 1, is characterized in that, also has two scalar processing units SPU to vector processing unit VPU between described scalar processing unit SPU and vector processing unit VPU The data broadcast transmission mechanism supports single-word broadcast commands and double-word broadcast commands respectively; 所述单字广播指令为:将SPU寄存器文件中的单字广播到NVPE个VPE的向量寄存器中同一位置;执行的过程中对NVPE个VPE中的寄存器文件进行一次写操作,完成64*NVPE位数据的传输;The single-word broadcast instruction is: the single-word in the SPU register file is broadcast to the same position in the vector registers of N VPE VPEs; the register file in the N VPE VPEs is written once in the process of execution, and 64*N VPEs are completed transmission of bit data; 所述双字广播指令为:将SPU寄存器文件中的一对数据Src_o:Src_e广播到NVPE个VPE中的寄存器文件中的Dst_o:Dst_e中,寄存器对使用偶数表示即VR0代表VR1:VR0;执行的过程中对NVPE个VPE中的寄存器文件进行一次写操作,完成128*NVPE位数据传输;The double-word broadcast instruction is: a pair of data Src_o:Src_e in the SPU register file is broadcast to Dst_o:Dst_e in the register file in N VPE VPEs, and the register pair uses even numbers to represent that VR0 represents VR1:VR0; Execute During the process, a write operation is performed on the register files in N VPE VPEs to complete 128*N VPE bit data transmission; 两条标向量广播通路并行执行双字广播操作能够实现256*NVPE位数据的传输。Two scalar vector broadcast paths execute double-word broadcast operations in parallel to realize the transmission of 256*N VPE bit data.
CN201510718729.7A 2015-10-29 2015-10-29 The vectorial SIMD operating structures for supporting mark vector to cooperate Active CN105373367B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510718729.7A CN105373367B (en) 2015-10-29 2015-10-29 The vectorial SIMD operating structures for supporting mark vector to cooperate

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510718729.7A CN105373367B (en) 2015-10-29 2015-10-29 The vectorial SIMD operating structures for supporting mark vector to cooperate

Publications (2)

Publication Number Publication Date
CN105373367A true CN105373367A (en) 2016-03-02
CN105373367B CN105373367B (en) 2018-03-02

Family

ID=55375596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510718729.7A Active CN105373367B (en) 2015-10-29 2015-10-29 The vectorial SIMD operating structures for supporting mark vector to cooperate

Country Status (1)

Country Link
CN (1) CN105373367B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109661647A (en) * 2016-09-13 2019-04-19 Arm有限公司 The multiply-add instruction of vector
CN111352894A (en) * 2018-12-20 2020-06-30 深圳市中兴微电子技术有限公司 A single-instruction multi-core system, instruction processing method and storage medium
CN111651201A (en) * 2016-04-26 2020-09-11 中科寒武纪科技股份有限公司 Device and method for executing vector merging operation
CN112328958A (en) * 2020-11-10 2021-02-05 河海大学 An optimized data rearrangement method based on radix-64 two-dimensional FFT architecture
CN114626540A (en) * 2020-12-11 2022-06-14 上海阵量智能科技有限公司 Processor and related product
CN115826910A (en) * 2023-02-07 2023-03-21 成都申威科技有限责任公司 Vector fixed point ALU processing system
CN117435259A (en) * 2023-12-20 2024-01-23 芯瞳半导体技术(山东)有限公司 VPU configuration method, device, electronic equipment and computer-readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101986264A (en) * 2010-11-25 2011-03-16 中国人民解放军国防科学技术大学 Multifunctional floating-point multiply and add calculation device for single instruction multiple data (SIMD) vector microprocessor
CN102012893A (en) * 2010-11-25 2011-04-13 中国人民解放军国防科学技术大学 Extensible vector operation cluster
CN102279818A (en) * 2011-07-28 2011-12-14 中国人民解放军国防科学技术大学 Vector data access and storage control method supporting limited sharing and vector memory
CN103440121A (en) * 2013-08-20 2013-12-11 中国人民解放军国防科学技术大学 Triangular matrix multiplication vectorization method of vector processor
CN104636315A (en) * 2015-02-06 2015-05-20 中国人民解放军国防科学技术大学 GPDSP-oriented matrix LU decomposition vectorization calculation method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101986264A (en) * 2010-11-25 2011-03-16 中国人民解放军国防科学技术大学 Multifunctional floating-point multiply and add calculation device for single instruction multiple data (SIMD) vector microprocessor
CN102012893A (en) * 2010-11-25 2011-04-13 中国人民解放军国防科学技术大学 Extensible vector operation cluster
CN102279818A (en) * 2011-07-28 2011-12-14 中国人民解放军国防科学技术大学 Vector data access and storage control method supporting limited sharing and vector memory
CN103440121A (en) * 2013-08-20 2013-12-11 中国人民解放军国防科学技术大学 Triangular matrix multiplication vectorization method of vector processor
CN104636315A (en) * 2015-02-06 2015-05-20 中国人民解放军国防科学技术大学 GPDSP-oriented matrix LU decomposition vectorization calculation method

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111651201A (en) * 2016-04-26 2020-09-11 中科寒武纪科技股份有限公司 Device and method for executing vector merging operation
CN111651201B (en) * 2016-04-26 2023-06-13 中科寒武纪科技股份有限公司 Apparatus and method for performing vector merge operation
CN109661647A (en) * 2016-09-13 2019-04-19 Arm有限公司 The multiply-add instruction of vector
CN109661647B (en) * 2016-09-13 2023-03-03 Arm有限公司 Data processing apparatus and method
CN111352894A (en) * 2018-12-20 2020-06-30 深圳市中兴微电子技术有限公司 A single-instruction multi-core system, instruction processing method and storage medium
CN112328958A (en) * 2020-11-10 2021-02-05 河海大学 An optimized data rearrangement method based on radix-64 two-dimensional FFT architecture
CN114626540A (en) * 2020-12-11 2022-06-14 上海阵量智能科技有限公司 Processor and related product
WO2022121275A1 (en) * 2020-12-11 2022-06-16 上海阵量智能科技有限公司 Processor, multithread processing method, electronic device, and storage medium
CN115826910A (en) * 2023-02-07 2023-03-21 成都申威科技有限责任公司 Vector fixed point ALU processing system
CN117435259A (en) * 2023-12-20 2024-01-23 芯瞳半导体技术(山东)有限公司 VPU configuration method, device, electronic equipment and computer-readable storage medium
CN117435259B (en) * 2023-12-20 2024-03-22 芯瞳半导体技术(山东)有限公司 VPU configuration method, device, electronic equipment and computer-readable storage medium

Also Published As

Publication number Publication date
CN105373367B (en) 2018-03-02

Similar Documents

Publication Publication Date Title
CN105373367A (en) Vector single instruction multiple data-stream (SIMD) operation structure supporting synergistic working of scalar and vector
CN102750133B (en) 32-Bit triple-emission digital signal processor supporting SIMD
CN105453071B (en) For providing method, equipment, instruction and the logic of vector group tally function
Fang et al. swdnn: A library for accelerating deep learning applications on sunway taihulight
Dongarra et al. High-performance computing systems: Status and outlook
CN105359129B (en) For providing the method, apparatus, instruction and the logic that are used for group's tally function of gene order-checking and comparison
Kapasi et al. The Imagine stream processor
CN104303142B (en) Use the dispersion of index array and finite state machine
CN105247475B (en) Packed data element concludes processor, method, system and instruction
CN105247477B (en) Multiregister memory reference instruction, processor, method and system
CN102012893B (en) Extensible vector operation device
CN109597646A (en) Processor, method and system with configurable space accelerator
CN109213723A (en) Processor, method and system for the configurable space accelerator with safety, power reduction and performance characteristic
US20130042090A1 (en) Temporal simt execution optimization
CN103049241B (en) A kind of method improving CPU+GPU isomery device calculated performance
CN105190538B (en) System and method for the mobile mark tracking eliminated in operation
WO2016003820A9 (en) System and methods for expandably wide operand instructions
CN102508643A (en) Multicore-parallel digital signal processor and method for operating parallel instruction sets
CN109375949A (en) The processor of instruction is utilized with multiple cores, shared core extension logic and shared core extension
CN107918546A (en) The processor of part register access, method and system are realized using the full register access through mask
CN107667345A (en) Packing data alignment plus computations, processor, method and system
CN112580792B (en) Neural network multi-core tensor processor
CN101504599A (en) Special instruction set micro-processing system suitable for digital signal processing application
CN108369510A (en) For with the instruction of the displacement of unordered load and logic
CN108475192A (en) Dispersion reduces instruction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant