CN105302749B - DMA transfer method towards single instrction multithread mode in GPDSP - Google Patents

DMA transfer method towards single instrction multithread mode in GPDSP Download PDF

Info

Publication number
CN105302749B
CN105302749B CN201510718877.9A CN201510718877A CN105302749B CN 105302749 B CN105302749 B CN 105302749B CN 201510718877 A CN201510718877 A CN 201510718877A CN 105302749 B CN105302749 B CN 105302749B
Authority
CN
China
Prior art keywords
data
simt
read
dma
write
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510718877.9A
Other languages
Chinese (zh)
Other versions
CN105302749A (en
Inventor
马胜
陈书明
万江华
郭阳
杨柳
陈海燕
刘宗林
丁博
丁一博
陈胜刚
雷元武
王耀华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201510718877.9A priority Critical patent/CN105302749B/en
Publication of CN105302749A publication Critical patent/CN105302749A/en
Application granted granted Critical
Publication of CN105302749B publication Critical patent/CN105302749B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • G06F13/282Cycle stealing DMA

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bus Control (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention discloses a kind of DMA transfer methods towards single instrction multithread mode in GPDSP, by configuring a DMA transfer affairs by the vectorial storage unit VM of the data-moving of the non-regular SIMT programs for being stored in core external storage space to kernel;After moving, the data are fitly stored in vectorial storage unit VM, for making these data of vector calculation component concurrent access.The principle of the invention is simple and convenient to operate, volume of transmitted data can configure, and can is that SIMT programs efficiently supply data with background mode, not only preferably be supported the execution of SIMT programs, while greatly increasing the operational performance of GPDSP.

Description

DMA transfer method towards single instrction multithread mode in GPDSP
Technical field
Present invention relates generally to nextport universal digital signal processor NextPort (General Purpose Digital Signal Processor, GPDSP) the field direct memory access component (Director Memory Access, DMA), refer in particular to a kind of face To the DMA transfer of single instrction multithreading (Single Instruction Multiple Threads, SIMT) mode data demand Method, for will fitly be moved in the data of the SIMT programs of the non-regular storage in core external storage space by DMA transfer affairs Vectorial storage unit (Vector Memory, VM), finally facilitates each vector location to execute parallel in core.
Background technology
Digital signal processor (Digital Signal Processor, DSP) declines place as a kind of typical insertion Reason device is widely used in embedded system, it is powerful with data-handling capacity, programmability is good, using flexible and low work( The features such as consumption, brings huge opportunity to the development of signal processing, and application field has been extended to social and economic development Various aspects.In application fields such as present communication, image procossing and Radar Signal Processings, as data processing amount increases, to meter The increase of precision and requirement of real-time is calculated, usually to use the microprocessor of higher performance to be handled.
It is different from central processor CPU, DSP has the characteristics that:1) computing capability is strong, pays close attention to calculating in real time and is better than concern Control and issued transaction;2) typical signal processing is supported equipped with specialised hardware, such as multiply-add operation and linear addressing;3) embedding Enter the common feature of microsever:Address and instruction path are not more than 32, and most data paths are not more than 32;Non-precisely It interrupts;Job-program mode (rather than the universal cpu debugging i.e. side of operation of short-term offline debugging, long-term online resident operation Method);4) Peripheral Interface is integrated based on quick peripheral hardware, is especially beneficial online transceiving high speed AD/DA data, also supports height between DSP Speed is direct-connected.
General scientific algorithm needs high performance DSP, however tradition DSP is used to have the disadvantage that when scientific algorithm:1) position Width is small so that computational accuracy and addressing space are insufficient.General scientific algorithm application at least needs 64 precision;2) lack task pipe The software and hardwares such as reason, document control, process scheduling, interrupt management support, in other words lack operating system hardware environment, to it is general, More calculating task management are made troubles;3) support for lacking unified advanced language programming pattern, to multinuclear, vector, data parallel Deng support rely on substantially assembler program, be not easy to universal programming;4) the program debugging pattern of local host is not supported, only It is emulated by its machine cross debugging.These problems seriously limit applications of the DSP in general scientific algorithm field.
There is practitioner to propose " a kind of general-purpose computations digital signal processor GPDSP ", it is disclosed that one kind can protect DSP embedded essential characteristic and the advantage of high-performance low-power-consumption are held, and can efficiently support the novel system knot of general scientific algorithm Structure --- multi-core microprocessor GPDSP.The structure can overcome the above problems of the general DSP for scientific algorithm, can carry simultaneously For the efficient support to 64 high-performance calculations and embedded high-precision signal processing.The structure has following feature:1) have The direct expression of double-precision floating point and 64 fixed-point datas, general register, data/address bus, instruction bit wide 64 or more, address Bus 40 or more;2) CPU and DSP heterogeneous polynuclear close-coupleds, CPU support complete operating system, the scalar units branch of DSP core Hold operating system micronucleus.3) consider the unified programming mode of vectorial array structure in CPU core, DSP core and DSP core;4) it is kept Machine intersects artificial debugging, while providing local cpu host's debugging mode;5) retain the substantially special of the common DSP in addition to digit Sign.
Above-mentioned GPDSP usually forms processing array to obtain higher floating-point operation energy by 64 bit processing unit of multiple isomorphisms Power.However, since the data volume that GPDSP need to be handled is huge, cause to need between storage unit and core external storage component in GPDSP cores Exchange a large amount of data.Core external storage space storage data firstly the need of move core memory space with facilitate kernel into Row calculates, and the result needs that kernel is calculated, which are moved to core external storage space, to be preserved;At this point, storage unit and core in core Message transmission rate between external storage component will be as the key factor of limitation GPDSP processing speeds.With general purpose microprocessor Identical, GPDSP is also faced with " memory " problem.
Direct memory access (Director Memory Access, DMA) is a kind of preferably to alleviate " memory " problem Technology, DMA can process cores carry out data calculating while, with backstage working method high speed moving data, move process The participation of process cores is not needed.Since DMA technology holds the operation overlapping of the data-moving of the calculating operation of kernel and storage unit Row, reduces the data transmission bauds in core between storage unit and core external storage component to GPDSP processing to a certain extent The influence of performance.
The performance for constantly promoting processor is the target that designer pursues always, in order to obtain higher operational performance, place Reason device architectural study has been transferred to the excavation of higher level Thread-Level Parallelism from the excavation of traditional instruction level parallelism On.During excavating Thread-Level Parallelism, how efficiently to dispatch execution parallel thread and have become academia and industrial quarters discussion Emphasis.The appearance of single instrction multithreading (Single Instruction Multiple Threads, SIMT) technology, makes more Thread scheduling execution is able to good realization, wherein the image processor (GPU) that representative most outstanding is NVIDIA companies is made SIMT multithread scheduling technologies, the unique structure of the technology and thread-scheduling model have provided to the user good programmable Property, is provided simultaneously with efficient multi-threaded parallel data-handling capacity, and the performance in many fields has surmounted traditional more Core processor.SIMT dispatching techniques in GPU are fused in other processor architectures, these processors will be obviously improved Performance.
The Vector Processing of GPDSP is completed by vector operation unit (VPU) and vectorial memory access unit (VMU).At existing vector Reason supports single-instruction multiple-data (Single Instruction Multiple Data, SIMD) execution pattern, i.e. is integrated in VPU Multiple parallel arithmetic elements (PE), execute arithmetic operation, while VMU executes vectorial memory access by SIMD modes by SIMD modes Operation, the vector data of high bandwidth is provided for VPU.But with the increase of SIMD width, the global pause brought by global abnormal Cost it is increasing, cause actual operation efficiency not increased as expected but.Therefore number is being excavated using SIMD modes According to grade it is parallel on the basis of, there is an urgent need to excavate higher level concurrency, such as Thread-Level Parallelism, with improve system operation imitate Rate.But current vectorial accessing operation only provides continuous to address or addresses is waited the vectorial number with specific change rule such as to stride According to memory access, cannot meet multi-threaded parallel execution the needs of.
Effectively to excavate the Thread-Level Parallelism calculated towards vector, it should vectorial memory access be made to meet the needs of SIMT programs. It can be there are two types of solution, first, the vectorial access method of modification;Second is that carrying operation by DMA data allows core external storage space In it is non-it is regular storage but address have specific change rule source data can fitly be placed into VM.
Invention content
The technical problem to be solved in the present invention is that:For technical problem of the existing technology, the present invention provides one Kind of principle is simple and convenient to operate, flexibly configurable, can be obviously improved it is more towards single instrction in the GPDSP of processor operational performance The DMA transfer method of thread mode.
In order to solve the above-mentioned technical problem, the specific technical solution that the present invention uses for:
A kind of DMA transfer method towards single instrction multithread mode in GPDSP, by configuring a DMA transfer affairs By the vectorial storage unit VM of the data-moving of the non-regular SIMT programs for being stored in core external storage space to kernel;After moving, The data are fitly stored in vectorial storage unit VM, for making these data of vector calculation component concurrent access.
As a further improvement on the present invention:The flow of the data-moving is:
S1:Configuration parameter simultaneously starts DMA;
Increase a SIMT mark in transmitting control word, is used to refer to whether this DMA transfer affairs is towards list Instruct the data-moving of multithread mode;16 groups of plot BA and address offset OA are added, for calculating 16 column datas in core external memory Store up the position in DDR3SDRAM, the access address for generating read request;Increase 16 counter cnts, be used to indicate this 16 The respective data volume of column data;Above-mentioned 16 groups of BA, OA and CNT are configured to the global register of DMA by vectorial storage unit VM, SIMT marks are by PBUS buses write-in DMA parameters RAM;
After completing parameter configuration, starts DMA, configured transmission is taken out from parameter RAM;
S2:Transmit read request;
Read request is generated according to configured transmission, while write request memory bank Mem is stored in the addresses VM are write;Target peripheral is connecing After the read request for receiving DMA, DMA returned datas are given;DMA receives returned data, is deposited into write request memory bank Mem;
S3:Transmit write request;
DMA takes out from effective write request memory bank Mem entries to be read returned data and writes the addresses VM, and vectorial storage is issued Component VM, unison counter carry out corresponding counts;If counting does not finish, continue transmission data, otherwise set by the above process It is transmitted marker register.
As a further improvement on the present invention:It is general logical when taking out configured transmission after starting DMA and from parameter RAM After road receives configured transmission, first judge that SIMT is identified whether effectively;If mark is invalid, it is transmitted according to original mode;It is no 16 groups of newly-increased BA, OA and CNT parameters are then taken out from global register again.
As a further improvement on the present invention:After parameter configuration, starts DMA and enter data transfer request production stream Journey, step are:
S10:After parameter configuration, starts DMA, parameter is read into reinforced common physical channel EGPip, EGPip is it is first determined whether be SIMT data transmissions;If it is SIMT data transmissions, then enter SIMT data transmission states --- Otherwise SIMT_START is handled by normal transmission pattern;
S20:Into after SIMT_START states, first by writing whether all counter judges the data of SIMT transmission Be written VM, if so, state machine enter be transmitted state --- FINISH, otherwise state machine will stay on SIMT_START shapes State;Under SIMT_START states, EGPip generates the read-write requests for SIMT data transmissions, first sentences when generating read request Whether disconnected read request counter counts completions, if read request does not distribute complete, continues generation read request, otherwise stops paying out read request; Read request competes the read bus inside DMA after EGPip outputs, and after obtaining bus arbitration, read bus is read true to EGPip returns Recognize signal, while read request is exported DMA;After EGPip receives reading confirmation signal, it can update and read address and read request counter; After the write request of SIMT generates, VM write bus is competed, after obtaining president's arbitration, VM write bus returns to write acknowledgement signal to EGPip, Write request is exported to VM simultaneously;After receiving write acknowledgement signal, EGPip sends out next write request, while counter is write in update.
As a further improvement on the present invention:In the step S2, the read request of data transmission is fetched and to read address same Data in 256 bit spaces;Read request information includes read request address, reads return address, reads mask and SIMT biographies Defeated mark;It is also generated for each read request and reads the address that returned data writes core memory space VM, there are in DMA for the address In the write request memory bank Mem in portion, wait corresponding read request data after DDR3SDRAM returns, another rise reads.
As a further improvement on the present invention:The generation of described address needs two clock cycle, and the first clock cycle is first 8 source addresses are first divided into 4 groups, select wherein smaller value respectively, then select 2 smaller values in this 4 results again; The second clock period claps from upper one in two obtained smaller values, selects minimum value therein, obtains the minimum in 8 source addresses Address;Then institute's source address is compared with lowest address, checks them whether within the scope of 256 bit spaces, most It is generated afterwards according to comparison result and reads mask.
As a further improvement on the present invention:The step S2 is controlled by state machine:The initial shape of the state machine State is idle state, i.e. SIMT_IDLE states;If DMA transfer affairs are to move the data of SIMT programs, from SIMT_IDLE State enters SIMT_Rd0to7 states, and in this state, EGPip generates the read request of the data for the Bank0-7 for moving VM, Judge whether the data of Bank8-15 read simultaneously to finish;If the data of Bank8-15 are not run through, NextState enters SIMT_Rd8to15 states;If the data of Bank8-15 have been run through, it must judge whether the data of Bank0-7 have read Finish;If the data of Bank8-15 have been run through, but the data of Bank0-7 are run through not yet, and following clock cycle enters SIMT_ Free states;If the data of Bank8-15 and Bank0-7 all read and finish, state machine NextState enters SIMT_IDLE;Class As, when state machine is in SIMT_Rd8to15 states, EGPip generates the request for reading Bank8-15 data;If Bank0-7's Data do not read and finish, then NextState enters SIMT_Rd0to7 states;If the data of Bank0-7 have been run through, must Judge whether Bank8-15 data read to finish;If it is not, following clock cycle enters SIMT_Free states, otherwise, then it represents that institute There are data all to read to finish, state machine NextState enters SIMT_IDLE;When state machine is in SIMT_Free, in read request It generates and is inserted into one " bubble " in assembly line;If Bank0-7 data do not read and finish, NextState enters SIMT_Rd0to7;If The data of Bank8-15 do not read and finish, then NextState enters SIMT_Rd8to15 states;If all data readings are all read It finishes, then NextState enters SIMT_IDLE.
As a further improvement on the present invention:In the step S3, returned data is read first to be sent to from DDR3SDRAM After DMA, DMA first judges whether this data is the data transmitted under SIMT transmission modes, if it is, data and writing mask letter Breath deposit is read in the write request memory bank Mem of return address instruction;Otherwise, it is handled according to DMA normal transmission patterns;It writes and asks It is 32 to seek the depth of memory bank Mem, and often row has a readable mark, when the id signal is " 1 ", has indicated the data of the row Through returning, what is be stored at this time data and first writes the reading of the addresses VM;The readable effective lowermost row of mark is selected as the ground for reading Mem Location then takes out corresponding data, writes mask information and writes the addresses VM from Mem;Mask is wherein write to be and read returned data The reading mask returned together indicates that data will be written in which Bank, and VM generates 8 write enable signals according to this signal;It reads The data that go out and the addresses VM are write, generates write request data and the write request address of 8 write bus;Write request generate after, export to VM buses are write inside DMA, after competition to bus arbitration, then read next Mem rows, are generated next VM that writes and are asked;According to upper It states rule and carries out write operation, until all data all end of transmissions of this transmission transaction.
Compared with existing DMA technology, the advantage of the invention is that:
1, the DMA transfer method towards single instrction multithread mode in GPDSP of the invention, passes through a DMA transfer thing Business is moved in the data-moving to kernel vector storage unit VM of the non-regular SIMT programs for being stored in core external storage space Afterwards, these data are fitly stored in VM, to effectively support excavation of the vector calculating to Thread-Level Parallelism.
2, the DMA transfer method towards single instrction multithread mode in GPDSP of the invention takes full advantage of the backstages DMA Operation, the advantages that message transmission rate is high, are carried out the preparation of data needed for SIMT programs using DMA transfer, can improve SIMT The execution speed of program, increases the execution efficiency of vector operation.
3, the DMA transfer method towards single instrction multithread mode in GPDSP of the invention, using with high data bandwidth VM configuration DMA configured transmission, can largely shorten the time of parameter configuration.
4, the DMA transfer method towards single instrction multithread mode in GPDSP of the invention, makes full use of DMA high to handle up The advantages that rate, consistency operation, carried out on the basis of existing DMA technology it is perfect, can by address with specific change advise Rule is fitly moved in core external storage data to vector unit VM in core, to efficiently support the vector fortune of SIMT patterns It calculates.
Description of the drawings
Fig. 1 is the transmission schematic diagram of present invention data needed for SIMT programs in concrete application example.
Fig. 2 is the flow diagram of the present invention.
Fig. 3 is present invention data transfer request production procedure schematic diagram in concrete application example.
Fig. 4 is present invention data transmission read request production procedure schematic diagram in concrete application example.
Fig. 5 is present invention data transmission read request production process state of a control machine in concrete application example.
Fig. 6 is present invention data transmission write request production procedure schematic diagram in concrete application example.
Specific implementation mode
Below with reference to Figure of description and specific embodiment, the present invention is described in further detail.
The present invention be it is a kind of towards single instrction multithreading (Single Instruction, Multiple Threads, SIMT) the DMA data transfer method of pattern, it is realized on the basis of traditional DMA technology.
To support the transmission method, one SIMT data transmission of increase of the present invention first in traditional DMA parameters to identify, When the mark is effective, indicate that DMA will carry out the data transmission towards single instrction multithread mode.In vector operation unit VPU Including the roads X processing unit PE, a storage bank of vectorial storage unit VM in core is corresponded to per road processing unit.SIMT programs Be under data initial situation it is non-it is regular be stored in core external storage, have spy with the address change of the data of Bank all the way into VM Fixed rule.Increase the information for being used to refer to Storage Format of the data in core external storage, packet in DMA global registers thus Plot (Base Address, BA), the offset address (Offset Address, OA) of data storage are included, and each Bank is needed The data volume (Counter, CNT) moved.Information commons X groups are identical as the Bank numbers of VM.
Newly-increased parameter BA refers to the initial address that data are stored in core external storage space;OA refers to the latter valid data Difference between address and the address of previous valid data, i.e. address offset;CNT refers to being retouched according to above-mentioned plot and address offset State rule storage data volume, unit be " word ", as 32;SIMT_MODE is SIMT transmission modes mark, which is " 1 ' When b1 ", indicate that configuration DMA carries out SIMT data transmissions.Above-mentioned BA, OA and CNT share X groups, with vectorial storage unit VM's Bank number is identical, i.e. data in the core external storage space of every group of BA, OA and CNT description are accordingly moved to one of VM In bank.
If it is " 1 ' b1 " to configure parameter SIMT_MODE, enhanced general physical channel (EGPip) is read from parameter RAM Conventional transmission parameter is taken, while also taking out VM and increasing parameter newly to X groups BA, OA and CNT that DMA is configured.Due to DMA data bandwidth Only X*16, X/2 Bank can only be at most written in the data that a read request returns.In order to improve the effect of read request generation Rate, design use ping-pong mechanism, if DMA current beats generate the read request of low X/2 Bank, next beat generates high X/2 A Bank read requests.Read request information includes reading address, reads return address, reads mask signal and SIMT transmission marks etc..For Saving global address line writes the address of VM there is no as read request is sent together, and is stored in inside DMA In write request memory bank Mem.Reading return address entrained by read request refers to that the write request inside DMA should be written after data return The position of memory bank Mem.
When sending read request, the BA and OA with the relevant X/2 bank of read request are taken out first, generates each bank's Destination address is selected a minimum address and is compared with the lowest address as read request address, then by each address, looked into See them whether in the positions the X*16 range of lowest address.It generates on this basis and reads mask signal, only reading and lowest address With the data in an X*16 bit space.While generating read request address and mask signal, EGPip generates this reading and asks The address of VM should be written in the returned data asked, these addresses are first stored in the write request memory bank Mem inside DMA, etc. data return Afterwards, then together with data VM components are written out to.The position for writing the addresses VM deposit Mem is encoded as return address is read in read request In.After aforesaid operations, DMA sends out read request information to core external storage space.
After read request data returns to DMA, first determine whether the returned data whether be SIMT transmission modes data.If The data of SIMT transmission modes then read data in the write request memory bank Mem of return address instruction with mask information deposit is write. If not the data of SIMT transmission modes, then handled according to other transmission modes.Data are stored in write request memory bank Mem Afterwards, by the readable mark set of the Mem rows.The depth of write request memory bank Mem is h, and often row has a readable mark, and selecting can The row that mark is effective and line number is minimum is read, the data of row storage are taken out from Mem, mask is write and writes the addresses VM.According to writing Mask information, which generates the data to the X/2 write enable signal of VM, read from Mem and writes the addresses VM, shares X/2 groups, they distinguish It is corresponding to the X/2 write bus of VM with DMA.
It is write for VM buses for a set of, when writing VM request competitions to after writing VM buses, indicates that VM will be written in the request, Write bus returns to a confirmation signal to EGPip inside DMA at this time, and the counter of EGPip executes after receiving this signal subtracts behaviour certainly Make.When VM is all written in all data, the value of counter is kept to " 0 ", indicates that this SIMT data transfer transaction is completed.Data pass After the completion of defeated, DMA completes data transfer transaction the corresponding positions set of marker register, enables effectively, to send out simultaneously if interrupting Go out affairs and is transmitted interruption.
As shown in Figure 1, for the transmission schematic diagram of present invention data needed for SIMT programs in concrete application embodiment.DMA The data of source are stored in the shadow region in the spaces DDR3SDRAM, they share 16 pieces, the form of storage can by plot, Location offset is described.Meets the needs of SIMT programs are to data, then above-mentioned data should fitly be transported to 16 of VM In banks, as shown in the dash area of VM in figure.The source data moved is needed to be distributed in 8 rows of DDR3SDRAM in this example, Its address range is 36 ' h0-36 ' h1bc.It is the address space that will be written by data by the part of Shadow marks in VM, Ranging from 36 ' h80000_0000-36 ' h80000_00bc.Base register (BA), address offset register (OA) and each Bank The data volume (CNT) that need to be moved respectively has 16, and wherein BA is used to refer in DDR3SDRAM per the initial position of block number evidence, and OA is used To indicate the spacing of two valid data in a data block, the data volume that CNT need to move for each Bank, as unit of word.At this In the example that figure provides, BA0 should be configured to 36 ' h0, and BA1 should be configured to 36 ' h44, and BA2 should be configured to 36 ' hc8 ... ..., BA14 36 ' h138 should be configured to, BA15 should be configured to 36 ' h3c;OR0 need to be configured to 16 ' h80, and OR1 need to be configured to 16 ' h40, and OR2 is needed 16 ' h40 ... ... are configured to, OR14 need to be configured to 16 ' h40, and OR15 need to be configured to 16 ' hc0;16 CNT should be configured to 16 ' h3.In addition to this, SIMT transmission mark should be configured to " 1 ", and destination address is configured to 36 ' h80000_0000.Parameter configuration finishes Afterwards, start DMA, complete data-moving.
Based on the analysis of above-mentioned principle, as shown in Fig. 2, the DMA towards single instrction multithread mode in the GPDSP of the present invention Transmission method, step are:
S1:Configuration parameter simultaneously starts DMA;
For the DMA parameter configuration efficiency of raising, parameter configuration of the invention is changed on the basis of traditional approach Into.There are two the parameter configuration sources of DMA of the present invention, when peripheral configuration bus (PBUS), second is that newly-increased VM configures access. PBUS completes the configuration of traditional parameters, and VM configures the information configuration that access completes core external storage Storage Format.Wherein PBUS is configured The bit wide of access is 32, and the bit wide that VM configures access is N*32.Therefore, carrying out parameter configuration using VM configuration accesses can be with Greatly improve the efficiency of parameter configuration.
Specifically, DMA is that SIMT data-movings is supported to be additionally arranged some special configured transmissions, is included in transmission control Increase a SIMT mark in word, is used to refer to whether this DMA transfer affairs is the number towards single instrction multithread mode According to moving.
Since vectorial storage unit VM includes 16 Bank in core, therefore 16 groups of plot BA and address offset OA are added, they For calculating position of 16 column datas in core external storage DDR3SDRAM, the access address for generating read request.
In addition, also adding 16 counter cnts, it is used to indicate this respective data volume of 16 column data.Above-mentioned 16 groups of BA, OA and CNT is configured to the global register of DMA by VM, and SIMT marks are by PBUS buses write-in DMA parameters RAM.It is complete in VM components After parameter configuration, using the other configured transmissions of PBUS bus configurations and start DMA.
After DMA starts, configured transmission is first taken out from parameter RAM, after general channels receive configured transmission, first judges SIMT It identifies whether effectively.If mark is invalid, it is transmitted according to original mode.Otherwise it is taken out from global register again newly-increased 16 groups of BA, OA and CNT parameters.
S2:Transmit read request;
Read request is generated according to configured transmission, while write request memory bank Mem is stored in the addresses VM are write.Target peripheral is connecing After the read request for receiving DMA, DMA returned datas are given.DMA receives returned data, is deposited into write request memory bank Mem.
S3:Transmit write request;
DMA takes out from effective write request memory bank Mem entries to be read returned data and writes the addresses VM, issues VM, simultaneously Counter carries out corresponding counts.If counting does not finish, continue transmission data by the above process, otherwise set is transmitted mark Register.
Above-mentioned steps S2 and S3 namely start DMA and carry out data-moving, the transmission ginseng needed for SIMT data transmissions After number configuration.In this process, DMA firstly generates the read request of SIMT data transmissions, fetches and is stored in same data Valid data in amount of bandwidth address range.While generating read request, the address of VM should be written by generating data, and these Address is stored in the write request memory bank Mem inside DMA.Read request information includes enabled read request, read request address, reads to return to ground Location, SIMT data transmissions mark etc..After read request data returns to DMA, it is first stored in write request memory bank Mem, if VM can write Enter, then data and early generated write address is taken out from Mem, issue VM.
In above process, due to the limitation of data bandwidth, a read request can only at most read the number of the roads X/2 Bank According to using ping-pong mechanism to improve the rate that read request is sent out, i.e., read the data of the front and back roads X/2 bank in turn.If when previous The read request that clock beat is sent out reads the data of the low roads X/2 Bank, then when the read request sent out reads the number of the high roads X/2 Bank According to.
DMA has X/2 to cover independent write bus with the ports VM, wherein the data bandwidth of every bus is 32.This X/2 sets are write Bus and the Bank (low X/2 or high X/2 a) that X/2 different inside VM are corresponding, i.e., the data of every write bus enter The different Bank of VM.DMA is internally provided with for storing the write request memory bank Mem for reading returned data, writing the information such as the addresses VM, should The depth of memory bank is h (value of h can specifically be set by efficiency of transmission and hardware spending choosing comprehensively).
As shown in figure 3, for the data transfer request production procedure schematic diagram of the invention in concrete application example, flow in detail Cheng Wei:
S10:After parameter configuration, starts DMA, parameter is read into reinforced common physical channel EGPip, EGPip is it is first determined whether be SIMT data transmissions.If it is SIMT data transmissions, then enter SIMT data transmission states --- Otherwise SIMT_START is handled by normal transmission pattern.
S20:Into after SIMT_START states, first by writing whether all counter judges the data of SIMT transmission Be written VM, if so, state machine enter be transmitted state --- FINISH, otherwise state machine will stay on SIMT_START shapes State.Under SIMT_START states, EGPip generates the read-write requests for SIMT data transmissions, first sentences when generating read request Whether disconnected read request counter counts completions, if read request does not distribute complete, continues generation read request, otherwise stops paying out read request. Read request competes the read bus inside DMA after EGPip outputs, and after obtaining bus arbitration, read bus is read true to EGPip returns Recognize signal, while read request is exported DMA.After EGPip receives reading confirmation signal, it can update and read address and read request counter. After the write request of SIMT generates, VM write bus is competed, after obtaining president's arbitration, VM write bus returns to write acknowledgement signal to EGPip, Write request is exported to VM simultaneously.After receiving write acknowledgement signal, EGPip sends out next write request, while counter is write in update.
As shown in figure 4, for the data transmission read request production procedure schematic diagram of the invention in concrete application example.SIMT numbers It is fetched according to the read request of transmission and reads address with the data in 256 bit spaces.Read request information include read request address, Read return address, reading mask and SIMT transmission marks etc..Since core external storage space is out of order arrival to the returned data of DMA , it must also be generated for each read request and read the address that returned data writes core memory space VM.But this address is not Send out, but pre-existed in the write request memory bank Mem inside DMA with read request, wait corresponding read request data from After DDR3SDRAM is returned, another rise reads.The production of read request is exactly the process for generating above-mentioned solicited message.Since design is led The limitation of frequency, the generation for reading address need two clock cycle, and 8 source addresses are divided into 4 groups by the first clock cycle first, respectively Wherein smaller value is selected, then selects 2 smaller values in this 4 results again.The second clock period claps from upper one and obtains Two smaller values in, select minimum value therein, thus obtained the lowest address in 8 source addresses.Then all Source address is compared with lowest address, checks that they whether within the scope of 256 bit spaces, finally give birth to according to comparison result At reading mask.Consider SIMT transmission performances and hardware spending, is equipped with the write request memory bank Mem that depth is 32 and preserves SIMT It reads returned data, write the information such as the addresses VM.The every a line of Mem has a signal designation, and whether it can be used, when read request is sent out, choosing Going out available signal, effectively minimum a line reads return address as data.The addresses VM are write in generation simultaneously, is first stored in data and reads The positions Mem of return address instruction.
As shown in figure 5, for the data transmission read request production process state of a control machine of the invention in concrete application example.It should The original state of state machine is idle state, i.e. SIMT_IDLE states.If DMA transfer affairs are to move the data of SIMT programs, Then enter SIMT_Rd0to7 states from SIMT_IDLE states, in this state, EGPip generates the Bank0-7's for moving VM The read request of data, while judging whether the data of Bank8-15 read and finishing.If the data of Bank8-15 are not run through, under One state enters SIMT_Rd8to15 states.If the data of Bank8-15 have been run through, it must judge that the data of Bank0-7 are No reading finishes.If the data of Bank8-15 have been run through, but the data of Bank0-7 are run through not yet, following clock cycle Into SIMT_Free states.If the data of Bank8-15 and Bank0-7 all read and finish, state machine NextState enters SIMT_IDLE.Similar, when state machine is in SIMT_Rd8to15 states, EGPip, which is generated, reads asking for Bank8-15 data It asks.If the data of Bank0-7 do not read and finish, NextState enters SIMT_Rd0to7 states.If the number of Bank0-7 According to having run through, then it must judge whether Bank8-15 data read and finish.If it is not, following clock cycle enters SIMT_Free shapes State, otherwise, then it represents that all data all read and finish, and state machine NextState enters SIMT_IDLE.When state machine is in When SIMT_Free, one " bubble " is inserted into read request generates assembly line.If Bank0-7 data do not read and finish, next shape State enters SIMT_Rd0to7;If the data of Bank8-15 do not read and finish, NextState enters SIMT_Rd8to15 shapes State;If all data are read all to read and be finished, NextState enters SIMT_IDLE.
As shown in fig. 6, for the data transmission write request production procedure schematic diagram of the invention in concrete application example.First, Returned data is read after DDR3SDRAM is sent to DMA, DMA first judges whether this data is the number transmitted under SIMT transmission modes According to if it is, data and writing in the write request memory bank Mem of mask information deposit reading return address instruction;Otherwise, according to DMA normal transmission patterns are handled.The depth of write request memory bank Mem is 32, and often row has a readable mark, when the mark When signal is " 1 ", indicate that the data of the row have returned, what can be stored at this time data and first writes the reading of the addresses VM.It selects The readable effective lowermost row of mark then takes out corresponding data from Mem, writes mask information and write as the address for reading Mem The addresses VM.The reading mask that mask is and reads to return together with returned data is wherein write, indicates that data, VM will be written in which Bank 8 write enable signals are generated according to this signal.The data of reading and the addresses VM are write, only need to simply unpacked, you can generates 8 The write request data of write bus and write request address.After write request generates, output writes VM buses to the inside DMA, and bus is arrived in competition After arbitration, then next Mem rows are read, generates next VM that writes and ask.Write operation is carried out according to above-mentioned rule, until this biography Until all data all end of transmissions of defeated affairs.
The above is only the preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-described embodiment, All technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention.It should be pointed out that for the art For those of ordinary skill, several improvements and modifications without departing from the principles of the present invention should be regarded as the protection of the present invention Range.

Claims (7)

1. a kind of DMA transfer method towards single instrction multithread mode in GPDSP, which is characterized in that by configuring a DMA Transmission transaction is by the vectorial storage unit VM of the data-moving of the non-regular SIMT programs for being stored in core external storage space to kernel; After moving, the data are fitly stored in vectorial storage unit VM, for making these numbers of vector calculation component concurrent access According to;
The flow of the data-moving is:
Step S1:Configuration parameter simultaneously starts DMA;
Increase a SIMT mark in transmitting control word, is used to refer to whether this DMA transfer affairs is towards single instrction The data-moving of multithread mode;16 groups of plot BA and address offset OA are added, for calculating 16 column datas in core external storage Position in DDR3SDRAM, the access address for generating read request;16 counter cnts are increased, this 16 row is used to indicate The respective data volume of data;Above-mentioned 16 groups of BA, OA and CNT are configured to the global register of DMA, SIMT by vectorial storage unit VM Mark is by PBUS buses write-in DMA parameters RAM;
After completing parameter configuration, starts DMA, configured transmission is taken out from parameter RAM;
Step S2:Transmit read request;
Read request is generated according to configured transmission, while write request memory bank Mem is stored in the addresses VM are write;Target peripheral is receiving After the read request of DMA, DMA returned datas are given;DMA receives returned data, is deposited into write request memory bank Mem;
Step S3:Transmit write request;
DMA takes out from effective write request memory bank Mem entries to be read returned data and writes the addresses VM, and vectorial storage unit is issued VM, unison counter carry out corresponding counts;If counting does not finish, continue transmission data, otherwise set is transmitted mark and posts Storage.
2. the DMA transfer method towards single instrction multithread mode in GPDSP according to claim 1, which is characterized in that After starting DMA and when taking out configured transmission from parameter RAM, after general channels receive configured transmission, first judge that SIMT is identified Whether effectively;If mark is invalid, it is transmitted according to original mode;Otherwise newly-increased 16 groups are taken out from global register again BA, OA and CNT parameter.
3. the DMA transfer method towards single instrction multithread mode, feature exist in GPDSP according to claim 1 or 2 In after parameter configuration, startup DMA enters data transfer request production procedure, and step is:
Step S10:After parameter configuration, starts DMA, parameter is read into reinforced common physical channel EGPip, EGPip is it is first determined whether be SIMT data transmissions;If it is SIMT data transmissions, then enter SIMT data transmission states --- Otherwise SIMT_START is handled by normal transmission pattern;
Step S20:Into after SIMT_START states, first by writing whether all counter judges the data of SIMT transmission Be written VM, if so, state machine enter be transmitted state --- FINISH, otherwise state machine will stay on SIMT_START shapes State;Under SIMT_START states, EGPip generates the read-write requests for SIMT data transmissions, first sentences when generating read request Whether disconnected read request counter counts completions, if read request does not distribute complete, continues generation read request, otherwise stops paying out read request; Read request competes the read bus inside DMA after EGPip outputs, and after obtaining bus arbitration, read bus is read true to EGPip returns Recognize signal, while read request is exported DMA;After EGPip receives reading confirmation signal, it can update and read address and read request counter; After the write request of SIMT generates, VM write bus is competed, after obtaining president's arbitration, VM write bus returns to write acknowledgement signal to EGPip, Write request is exported to VM simultaneously;After receiving write acknowledgement signal, EGPip sends out next write request, while counter is write in update.
4. the DMA transfer method towards single instrction multithread mode, feature exist in GPDSP according to claim 1 or 2 In in the step S2, the read request of data transmission fetches and reads address with the data in 256 bit spaces;Reading is asked It includes read request address, reading return address, reading mask and SIMT transmission marks to seek information;It is also generated for each read request The address that returned data writes core memory space VM is read, the address is corresponding there are in the write request memory bank Mem inside DMA, waiting After DDR3SDRAM returns, another rise reads read request data.
5. the DMA transfer method towards single instrction multithread mode in GPDSP according to claim 4, which is characterized in that The generation for reading the address that returned data writes core memory space VM needs two clock cycle, and the first clock cycle first will 8 source addresses are divided into 4 groups, select wherein smaller value respectively, then select 2 smaller values in this 4 results again;Second Clock cycle claps from upper one in two obtained smaller values, selects minimum value therein, obtains in 8 source addresses minimally Location;Then institute's source address is compared with lowest address, checks them whether within the scope of 256 bit spaces, finally It is generated according to comparison result and reads mask.
6. the DMA transfer method towards single instrction multithread mode in GPDSP according to claim 4, which is characterized in that The step S2 is controlled by state machine:The original state of the state machine is idle state, i.e. SIMT_IDLE states;If DMA transfer affairs are to move the data of SIMT programs, then enter SIMT_Rd0to7 states from SIMT_IDLE states, in the state Under, EGPip generates the read request of the data for the Bank0-7 for moving VM, while judging whether the data of Bank8-15 have read Finish;If the data of Bank8-15 are not run through, NextState enters SIMT_Rd8to15 states;If the data of Bank8-15 It has been run through that, then must judge whether the data of Bank0-7 read and finish;If the data of Bank8-15 have been run through, still The data of Bank0-7 are run through not yet, and following clock cycle enters SIMT_Free states;If Bank8-15's and Bank0-7 Data all read and finish, and state machine NextState enters SIMT_IDLE;Similar, when state machine is in SIMT_Rd8to15 shapes When state, EGPip generates the request for reading Bank8-15 data;If the data of Bank0-7 do not read and finish, NextState enters SIMT_Rd0to7 states;If the data of Bank0-7 have been run through, it must judge whether Bank8-15 data read and finish;If No, following clock cycle enters SIMT_Free states, otherwise, then it represents that all data all read and finish, state machine NextState Into SIMT_IDLE;When state machine is in SIMT_Free, it is inserted into one " bubble " in read request generates assembly line, it is described " bubble " is a mark;If Bank0-7 data do not read and finish, NextState enters SIMT_Rd0to7;If Bank8-15 Data not read finish, then NextState enters SIMT_Rd8to15 states;If all data are read all to read and be finished, under One state enters SIMT_IDLE.
7. the DMA transfer method towards single instrction multithread mode, feature exist in GPDSP according to claim 1 or 2 In, in the step S3, first read returned data after DDR3SDRAM is sent to DMA, DMA first judge this data whether be The data transmitted under SIMT transmission modes, if it is, data are asked with writing for mask information deposit reading return address instruction is write It asks in memory bank Mem;Otherwise, it is handled according to DMA normal transmission patterns;The depth of write request memory bank Mem is 32, is often gone There is a readable mark, when the id signal is " 1 ", indicates that the data of the row have returned, be stored at this time data and first Write the addresses VM reading;The readable effective lowermost row of mark is selected as the address for reading Mem, then takes out corresponding number from Mem According to, write mask information and write the addresses VM;It wherein writes mask to be and read the reading mask returned together with returned data, where is instruction Data will be written in a little Bank, and VM generates 8 write enable signals according to this signal;It the data of reading and writes the addresses VM, generates 8 The write request data of write bus and write request address;After write request generates, output writes VM buses to the inside DMA, and bus is arrived in competition After arbitration, then next Mem rows are read, generates next VM that writes and ask;Write operation is carried out, until owning for this transmission transaction Until data all end of transmissions.
CN201510718877.9A 2015-10-29 2015-10-29 DMA transfer method towards single instrction multithread mode in GPDSP Active CN105302749B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510718877.9A CN105302749B (en) 2015-10-29 2015-10-29 DMA transfer method towards single instrction multithread mode in GPDSP

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510718877.9A CN105302749B (en) 2015-10-29 2015-10-29 DMA transfer method towards single instrction multithread mode in GPDSP

Publications (2)

Publication Number Publication Date
CN105302749A CN105302749A (en) 2016-02-03
CN105302749B true CN105302749B (en) 2018-07-24

Family

ID=55200034

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510718877.9A Active CN105302749B (en) 2015-10-29 2015-10-29 DMA transfer method towards single instrction multithread mode in GPDSP

Country Status (1)

Country Link
CN (1) CN105302749B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108062282B (en) * 2017-12-29 2020-01-14 中国人民解放军国防科技大学 DMA data merging transmission method in GPDSP
WO2022266842A1 (en) * 2021-06-22 2022-12-29 华为技术有限公司 Multi-thread data processing method and apparatus

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679689A (en) * 2015-01-22 2015-06-03 中国人民解放军国防科学技术大学 Multi-core DMA (direct memory access) subsection data transmission method used for GPDSP (general purpose digital signal processor) and adopting slave counting

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6892266B2 (en) * 2000-11-15 2005-05-10 Texas Instruments Incorporated Multicore DSP device having coupled subsystem memory buses for global DMA access
US20060179172A1 (en) * 2005-01-28 2006-08-10 Texas Instruments Incorporated Method and system for reducing power consumption of a direct memory access controller

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679689A (en) * 2015-01-22 2015-06-03 中国人民解放军国防科学技术大学 Multi-core DMA (direct memory access) subsection data transmission method used for GPDSP (general purpose digital signal processor) and adopting slave counting

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
高可复用DMA验证平台实现;丁一博等;《中国计算机学会计算机工程与工艺专业委员会会议论文集》;20151018;第499-505页 *

Also Published As

Publication number Publication date
CN105302749A (en) 2016-02-03

Similar Documents

Publication Publication Date Title
US9830156B2 (en) Temporal SIMT execution optimization through elimination of redundant operations
CN102750133B (en) 32-Bit triple-emission digital signal processor supporting SIMD
US8639882B2 (en) Methods and apparatus for source operand collector caching
CN104603748B (en) The processor of instruction is utilized with multiple cores, shared core extension logic and the extension of shared core
CN103714039B (en) universal computing digital signal processor
CN103761215B (en) Matrix transpose optimization method based on graphic process unit
CN105190538B (en) System and method for the mobile mark tracking eliminated in operation
CN104813279B (en) For reducing the instruction of the element in the vector registor with stride formula access module
CN103309702A (en) Uniform load processing for parallel thread sub-sets
CN104050033A (en) System and method for hardware scheduling of indexed barriers
US20130145124A1 (en) System and method for performing shaped memory access operations
CN104050706A (en) Pixel shader bypass for low power graphics rendering
CN106648843A (en) System, method, and apparatus for improving throughput of consecutive transactional memory regions
CN103885752A (en) Programmable blending in multi-threaded processing units
CN109426519A (en) Data inspection is simplified in line with carrying out workload
CN104050705A (en) Handling post-z coverage data in raster operations
CN102640132A (en) Efficient predicated execution for parallel processors
CN105373367B (en) The vectorial SIMD operating structures for supporting mark vector to cooperate
CN108694684A (en) Shared local storage piecemeal mechanism
CN103226463A (en) Methods and apparatus for scheduling instructions using pre-decode data
WO2021236527A1 (en) Intelligent control and distribution of a liquid in a data center
CN105893319A (en) Multi-lane/multi-core system and method
CN104375807B (en) Three-level flow sequence comparison method based on many-core co-processor
CN101082900A (en) System and method for broadcasting instructions/data to a plurality of processors in a multiprocessor device via aliasing
WO2022031548A1 (en) Intelligent server-level testing of datacenter cooling systems

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant