CN105302749B - DMA transfer method towards single instrction multithread mode in GPDSP - Google Patents
DMA transfer method towards single instrction multithread mode in GPDSP Download PDFInfo
- Publication number
- CN105302749B CN105302749B CN201510718877.9A CN201510718877A CN105302749B CN 105302749 B CN105302749 B CN 105302749B CN 201510718877 A CN201510718877 A CN 201510718877A CN 105302749 B CN105302749 B CN 105302749B
- Authority
- CN
- China
- Prior art keywords
- data
- simt
- read
- dma
- write
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/14—Handling requests for interconnection or transfer
- G06F13/20—Handling requests for interconnection or transfer for access to input/output bus
- G06F13/28—Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
- G06F13/282—Cycle stealing DMA
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Bus Control (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
The invention discloses a kind of DMA transfer methods towards single instrction multithread mode in GPDSP, by configuring a DMA transfer affairs by the vectorial storage unit VM of the data-moving of the non-regular SIMT programs for being stored in core external storage space to kernel;After moving, the data are fitly stored in vectorial storage unit VM, for making these data of vector calculation component concurrent access.The principle of the invention is simple and convenient to operate, volume of transmitted data can configure, and can is that SIMT programs efficiently supply data with background mode, not only preferably be supported the execution of SIMT programs, while greatly increasing the operational performance of GPDSP.
Description
Technical field
Present invention relates generally to nextport universal digital signal processor NextPort (General Purpose Digital Signal
Processor, GPDSP) the field direct memory access component (Director Memory Access, DMA), refer in particular to a kind of face
To the DMA transfer of single instrction multithreading (Single Instruction Multiple Threads, SIMT) mode data demand
Method, for will fitly be moved in the data of the SIMT programs of the non-regular storage in core external storage space by DMA transfer affairs
Vectorial storage unit (Vector Memory, VM), finally facilitates each vector location to execute parallel in core.
Background technology
Digital signal processor (Digital Signal Processor, DSP) declines place as a kind of typical insertion
Reason device is widely used in embedded system, it is powerful with data-handling capacity, programmability is good, using flexible and low work(
The features such as consumption, brings huge opportunity to the development of signal processing, and application field has been extended to social and economic development
Various aspects.In application fields such as present communication, image procossing and Radar Signal Processings, as data processing amount increases, to meter
The increase of precision and requirement of real-time is calculated, usually to use the microprocessor of higher performance to be handled.
It is different from central processor CPU, DSP has the characteristics that:1) computing capability is strong, pays close attention to calculating in real time and is better than concern
Control and issued transaction;2) typical signal processing is supported equipped with specialised hardware, such as multiply-add operation and linear addressing;3) embedding
Enter the common feature of microsever:Address and instruction path are not more than 32, and most data paths are not more than 32;Non-precisely
It interrupts;Job-program mode (rather than the universal cpu debugging i.e. side of operation of short-term offline debugging, long-term online resident operation
Method);4) Peripheral Interface is integrated based on quick peripheral hardware, is especially beneficial online transceiving high speed AD/DA data, also supports height between DSP
Speed is direct-connected.
General scientific algorithm needs high performance DSP, however tradition DSP is used to have the disadvantage that when scientific algorithm:1) position
Width is small so that computational accuracy and addressing space are insufficient.General scientific algorithm application at least needs 64 precision;2) lack task pipe
The software and hardwares such as reason, document control, process scheduling, interrupt management support, in other words lack operating system hardware environment, to it is general,
More calculating task management are made troubles;3) support for lacking unified advanced language programming pattern, to multinuclear, vector, data parallel
Deng support rely on substantially assembler program, be not easy to universal programming;4) the program debugging pattern of local host is not supported, only
It is emulated by its machine cross debugging.These problems seriously limit applications of the DSP in general scientific algorithm field.
There is practitioner to propose " a kind of general-purpose computations digital signal processor GPDSP ", it is disclosed that one kind can protect
DSP embedded essential characteristic and the advantage of high-performance low-power-consumption are held, and can efficiently support the novel system knot of general scientific algorithm
Structure --- multi-core microprocessor GPDSP.The structure can overcome the above problems of the general DSP for scientific algorithm, can carry simultaneously
For the efficient support to 64 high-performance calculations and embedded high-precision signal processing.The structure has following feature:1) have
The direct expression of double-precision floating point and 64 fixed-point datas, general register, data/address bus, instruction bit wide 64 or more, address
Bus 40 or more;2) CPU and DSP heterogeneous polynuclear close-coupleds, CPU support complete operating system, the scalar units branch of DSP core
Hold operating system micronucleus.3) consider the unified programming mode of vectorial array structure in CPU core, DSP core and DSP core;4) it is kept
Machine intersects artificial debugging, while providing local cpu host's debugging mode;5) retain the substantially special of the common DSP in addition to digit
Sign.
Above-mentioned GPDSP usually forms processing array to obtain higher floating-point operation energy by 64 bit processing unit of multiple isomorphisms
Power.However, since the data volume that GPDSP need to be handled is huge, cause to need between storage unit and core external storage component in GPDSP cores
Exchange a large amount of data.Core external storage space storage data firstly the need of move core memory space with facilitate kernel into
Row calculates, and the result needs that kernel is calculated, which are moved to core external storage space, to be preserved;At this point, storage unit and core in core
Message transmission rate between external storage component will be as the key factor of limitation GPDSP processing speeds.With general purpose microprocessor
Identical, GPDSP is also faced with " memory " problem.
Direct memory access (Director Memory Access, DMA) is a kind of preferably to alleviate " memory " problem
Technology, DMA can process cores carry out data calculating while, with backstage working method high speed moving data, move process
The participation of process cores is not needed.Since DMA technology holds the operation overlapping of the data-moving of the calculating operation of kernel and storage unit
Row, reduces the data transmission bauds in core between storage unit and core external storage component to GPDSP processing to a certain extent
The influence of performance.
The performance for constantly promoting processor is the target that designer pursues always, in order to obtain higher operational performance, place
Reason device architectural study has been transferred to the excavation of higher level Thread-Level Parallelism from the excavation of traditional instruction level parallelism
On.During excavating Thread-Level Parallelism, how efficiently to dispatch execution parallel thread and have become academia and industrial quarters discussion
Emphasis.The appearance of single instrction multithreading (Single Instruction Multiple Threads, SIMT) technology, makes more
Thread scheduling execution is able to good realization, wherein the image processor (GPU) that representative most outstanding is NVIDIA companies is made
SIMT multithread scheduling technologies, the unique structure of the technology and thread-scheduling model have provided to the user good programmable
Property, is provided simultaneously with efficient multi-threaded parallel data-handling capacity, and the performance in many fields has surmounted traditional more
Core processor.SIMT dispatching techniques in GPU are fused in other processor architectures, these processors will be obviously improved
Performance.
The Vector Processing of GPDSP is completed by vector operation unit (VPU) and vectorial memory access unit (VMU).At existing vector
Reason supports single-instruction multiple-data (Single Instruction Multiple Data, SIMD) execution pattern, i.e. is integrated in VPU
Multiple parallel arithmetic elements (PE), execute arithmetic operation, while VMU executes vectorial memory access by SIMD modes by SIMD modes
Operation, the vector data of high bandwidth is provided for VPU.But with the increase of SIMD width, the global pause brought by global abnormal
Cost it is increasing, cause actual operation efficiency not increased as expected but.Therefore number is being excavated using SIMD modes
According to grade it is parallel on the basis of, there is an urgent need to excavate higher level concurrency, such as Thread-Level Parallelism, with improve system operation imitate
Rate.But current vectorial accessing operation only provides continuous to address or addresses is waited the vectorial number with specific change rule such as to stride
According to memory access, cannot meet multi-threaded parallel execution the needs of.
Effectively to excavate the Thread-Level Parallelism calculated towards vector, it should vectorial memory access be made to meet the needs of SIMT programs.
It can be there are two types of solution, first, the vectorial access method of modification;Second is that carrying operation by DMA data allows core external storage space
In it is non-it is regular storage but address have specific change rule source data can fitly be placed into VM.
Invention content
The technical problem to be solved in the present invention is that:For technical problem of the existing technology, the present invention provides one
Kind of principle is simple and convenient to operate, flexibly configurable, can be obviously improved it is more towards single instrction in the GPDSP of processor operational performance
The DMA transfer method of thread mode.
In order to solve the above-mentioned technical problem, the specific technical solution that the present invention uses for:
A kind of DMA transfer method towards single instrction multithread mode in GPDSP, by configuring a DMA transfer affairs
By the vectorial storage unit VM of the data-moving of the non-regular SIMT programs for being stored in core external storage space to kernel;After moving,
The data are fitly stored in vectorial storage unit VM, for making these data of vector calculation component concurrent access.
As a further improvement on the present invention:The flow of the data-moving is:
S1:Configuration parameter simultaneously starts DMA;
Increase a SIMT mark in transmitting control word, is used to refer to whether this DMA transfer affairs is towards list
Instruct the data-moving of multithread mode;16 groups of plot BA and address offset OA are added, for calculating 16 column datas in core external memory
Store up the position in DDR3SDRAM, the access address for generating read request;Increase 16 counter cnts, be used to indicate this 16
The respective data volume of column data;Above-mentioned 16 groups of BA, OA and CNT are configured to the global register of DMA by vectorial storage unit VM,
SIMT marks are by PBUS buses write-in DMA parameters RAM;
After completing parameter configuration, starts DMA, configured transmission is taken out from parameter RAM;
S2:Transmit read request;
Read request is generated according to configured transmission, while write request memory bank Mem is stored in the addresses VM are write;Target peripheral is connecing
After the read request for receiving DMA, DMA returned datas are given;DMA receives returned data, is deposited into write request memory bank Mem;
S3:Transmit write request;
DMA takes out from effective write request memory bank Mem entries to be read returned data and writes the addresses VM, and vectorial storage is issued
Component VM, unison counter carry out corresponding counts;If counting does not finish, continue transmission data, otherwise set by the above process
It is transmitted marker register.
As a further improvement on the present invention:It is general logical when taking out configured transmission after starting DMA and from parameter RAM
After road receives configured transmission, first judge that SIMT is identified whether effectively;If mark is invalid, it is transmitted according to original mode;It is no
16 groups of newly-increased BA, OA and CNT parameters are then taken out from global register again.
As a further improvement on the present invention:After parameter configuration, starts DMA and enter data transfer request production stream
Journey, step are:
S10:After parameter configuration, starts DMA, parameter is read into reinforced common physical channel EGPip,
EGPip is it is first determined whether be SIMT data transmissions;If it is SIMT data transmissions, then enter SIMT data transmission states ---
Otherwise SIMT_START is handled by normal transmission pattern;
S20:Into after SIMT_START states, first by writing whether all counter judges the data of SIMT transmission
Be written VM, if so, state machine enter be transmitted state --- FINISH, otherwise state machine will stay on SIMT_START shapes
State;Under SIMT_START states, EGPip generates the read-write requests for SIMT data transmissions, first sentences when generating read request
Whether disconnected read request counter counts completions, if read request does not distribute complete, continues generation read request, otherwise stops paying out read request;
Read request competes the read bus inside DMA after EGPip outputs, and after obtaining bus arbitration, read bus is read true to EGPip returns
Recognize signal, while read request is exported DMA;After EGPip receives reading confirmation signal, it can update and read address and read request counter;
After the write request of SIMT generates, VM write bus is competed, after obtaining president's arbitration, VM write bus returns to write acknowledgement signal to EGPip,
Write request is exported to VM simultaneously;After receiving write acknowledgement signal, EGPip sends out next write request, while counter is write in update.
As a further improvement on the present invention:In the step S2, the read request of data transmission is fetched and to read address same
Data in 256 bit spaces;Read request information includes read request address, reads return address, reads mask and SIMT biographies
Defeated mark;It is also generated for each read request and reads the address that returned data writes core memory space VM, there are in DMA for the address
In the write request memory bank Mem in portion, wait corresponding read request data after DDR3SDRAM returns, another rise reads.
As a further improvement on the present invention:The generation of described address needs two clock cycle, and the first clock cycle is first
8 source addresses are first divided into 4 groups, select wherein smaller value respectively, then select 2 smaller values in this 4 results again;
The second clock period claps from upper one in two obtained smaller values, selects minimum value therein, obtains the minimum in 8 source addresses
Address;Then institute's source address is compared with lowest address, checks them whether within the scope of 256 bit spaces, most
It is generated afterwards according to comparison result and reads mask.
As a further improvement on the present invention:The step S2 is controlled by state machine:The initial shape of the state machine
State is idle state, i.e. SIMT_IDLE states;If DMA transfer affairs are to move the data of SIMT programs, from SIMT_IDLE
State enters SIMT_Rd0to7 states, and in this state, EGPip generates the read request of the data for the Bank0-7 for moving VM,
Judge whether the data of Bank8-15 read simultaneously to finish;If the data of Bank8-15 are not run through, NextState enters
SIMT_Rd8to15 states;If the data of Bank8-15 have been run through, it must judge whether the data of Bank0-7 have read
Finish;If the data of Bank8-15 have been run through, but the data of Bank0-7 are run through not yet, and following clock cycle enters SIMT_
Free states;If the data of Bank8-15 and Bank0-7 all read and finish, state machine NextState enters SIMT_IDLE;Class
As, when state machine is in SIMT_Rd8to15 states, EGPip generates the request for reading Bank8-15 data;If Bank0-7's
Data do not read and finish, then NextState enters SIMT_Rd0to7 states;If the data of Bank0-7 have been run through, must
Judge whether Bank8-15 data read to finish;If it is not, following clock cycle enters SIMT_Free states, otherwise, then it represents that institute
There are data all to read to finish, state machine NextState enters SIMT_IDLE;When state machine is in SIMT_Free, in read request
It generates and is inserted into one " bubble " in assembly line;If Bank0-7 data do not read and finish, NextState enters SIMT_Rd0to7;If
The data of Bank8-15 do not read and finish, then NextState enters SIMT_Rd8to15 states;If all data readings are all read
It finishes, then NextState enters SIMT_IDLE.
As a further improvement on the present invention:In the step S3, returned data is read first to be sent to from DDR3SDRAM
After DMA, DMA first judges whether this data is the data transmitted under SIMT transmission modes, if it is, data and writing mask letter
Breath deposit is read in the write request memory bank Mem of return address instruction;Otherwise, it is handled according to DMA normal transmission patterns;It writes and asks
It is 32 to seek the depth of memory bank Mem, and often row has a readable mark, when the id signal is " 1 ", has indicated the data of the row
Through returning, what is be stored at this time data and first writes the reading of the addresses VM;The readable effective lowermost row of mark is selected as the ground for reading Mem
Location then takes out corresponding data, writes mask information and writes the addresses VM from Mem;Mask is wherein write to be and read returned data
The reading mask returned together indicates that data will be written in which Bank, and VM generates 8 write enable signals according to this signal;It reads
The data that go out and the addresses VM are write, generates write request data and the write request address of 8 write bus;Write request generate after, export to
VM buses are write inside DMA, after competition to bus arbitration, then read next Mem rows, are generated next VM that writes and are asked;According to upper
It states rule and carries out write operation, until all data all end of transmissions of this transmission transaction.
Compared with existing DMA technology, the advantage of the invention is that:
1, the DMA transfer method towards single instrction multithread mode in GPDSP of the invention, passes through a DMA transfer thing
Business is moved in the data-moving to kernel vector storage unit VM of the non-regular SIMT programs for being stored in core external storage space
Afterwards, these data are fitly stored in VM, to effectively support excavation of the vector calculating to Thread-Level Parallelism.
2, the DMA transfer method towards single instrction multithread mode in GPDSP of the invention takes full advantage of the backstages DMA
Operation, the advantages that message transmission rate is high, are carried out the preparation of data needed for SIMT programs using DMA transfer, can improve SIMT
The execution speed of program, increases the execution efficiency of vector operation.
3, the DMA transfer method towards single instrction multithread mode in GPDSP of the invention, using with high data bandwidth
VM configuration DMA configured transmission, can largely shorten the time of parameter configuration.
4, the DMA transfer method towards single instrction multithread mode in GPDSP of the invention, makes full use of DMA high to handle up
The advantages that rate, consistency operation, carried out on the basis of existing DMA technology it is perfect, can by address with specific change advise
Rule is fitly moved in core external storage data to vector unit VM in core, to efficiently support the vector fortune of SIMT patterns
It calculates.
Description of the drawings
Fig. 1 is the transmission schematic diagram of present invention data needed for SIMT programs in concrete application example.
Fig. 2 is the flow diagram of the present invention.
Fig. 3 is present invention data transfer request production procedure schematic diagram in concrete application example.
Fig. 4 is present invention data transmission read request production procedure schematic diagram in concrete application example.
Fig. 5 is present invention data transmission read request production process state of a control machine in concrete application example.
Fig. 6 is present invention data transmission write request production procedure schematic diagram in concrete application example.
Specific implementation mode
Below with reference to Figure of description and specific embodiment, the present invention is described in further detail.
The present invention be it is a kind of towards single instrction multithreading (Single Instruction, Multiple Threads,
SIMT) the DMA data transfer method of pattern, it is realized on the basis of traditional DMA technology.
To support the transmission method, one SIMT data transmission of increase of the present invention first in traditional DMA parameters to identify,
When the mark is effective, indicate that DMA will carry out the data transmission towards single instrction multithread mode.In vector operation unit VPU
Including the roads X processing unit PE, a storage bank of vectorial storage unit VM in core is corresponded to per road processing unit.SIMT programs
Be under data initial situation it is non-it is regular be stored in core external storage, have spy with the address change of the data of Bank all the way into VM
Fixed rule.Increase the information for being used to refer to Storage Format of the data in core external storage, packet in DMA global registers thus
Plot (Base Address, BA), the offset address (Offset Address, OA) of data storage are included, and each Bank is needed
The data volume (Counter, CNT) moved.Information commons X groups are identical as the Bank numbers of VM.
Newly-increased parameter BA refers to the initial address that data are stored in core external storage space;OA refers to the latter valid data
Difference between address and the address of previous valid data, i.e. address offset;CNT refers to being retouched according to above-mentioned plot and address offset
State rule storage data volume, unit be " word ", as 32;SIMT_MODE is SIMT transmission modes mark, which is " 1 '
When b1 ", indicate that configuration DMA carries out SIMT data transmissions.Above-mentioned BA, OA and CNT share X groups, with vectorial storage unit VM's
Bank number is identical, i.e. data in the core external storage space of every group of BA, OA and CNT description are accordingly moved to one of VM
In bank.
If it is " 1 ' b1 " to configure parameter SIMT_MODE, enhanced general physical channel (EGPip) is read from parameter RAM
Conventional transmission parameter is taken, while also taking out VM and increasing parameter newly to X groups BA, OA and CNT that DMA is configured.Due to DMA data bandwidth
Only X*16, X/2 Bank can only be at most written in the data that a read request returns.In order to improve the effect of read request generation
Rate, design use ping-pong mechanism, if DMA current beats generate the read request of low X/2 Bank, next beat generates high X/2
A Bank read requests.Read request information includes reading address, reads return address, reads mask signal and SIMT transmission marks etc..For
Saving global address line writes the address of VM there is no as read request is sent together, and is stored in inside DMA
In write request memory bank Mem.Reading return address entrained by read request refers to that the write request inside DMA should be written after data return
The position of memory bank Mem.
When sending read request, the BA and OA with the relevant X/2 bank of read request are taken out first, generates each bank's
Destination address is selected a minimum address and is compared with the lowest address as read request address, then by each address, looked into
See them whether in the positions the X*16 range of lowest address.It generates on this basis and reads mask signal, only reading and lowest address
With the data in an X*16 bit space.While generating read request address and mask signal, EGPip generates this reading and asks
The address of VM should be written in the returned data asked, these addresses are first stored in the write request memory bank Mem inside DMA, etc. data return
Afterwards, then together with data VM components are written out to.The position for writing the addresses VM deposit Mem is encoded as return address is read in read request
In.After aforesaid operations, DMA sends out read request information to core external storage space.
After read request data returns to DMA, first determine whether the returned data whether be SIMT transmission modes data.If
The data of SIMT transmission modes then read data in the write request memory bank Mem of return address instruction with mask information deposit is write.
If not the data of SIMT transmission modes, then handled according to other transmission modes.Data are stored in write request memory bank Mem
Afterwards, by the readable mark set of the Mem rows.The depth of write request memory bank Mem is h, and often row has a readable mark, and selecting can
The row that mark is effective and line number is minimum is read, the data of row storage are taken out from Mem, mask is write and writes the addresses VM.According to writing
Mask information, which generates the data to the X/2 write enable signal of VM, read from Mem and writes the addresses VM, shares X/2 groups, they distinguish
It is corresponding to the X/2 write bus of VM with DMA.
It is write for VM buses for a set of, when writing VM request competitions to after writing VM buses, indicates that VM will be written in the request,
Write bus returns to a confirmation signal to EGPip inside DMA at this time, and the counter of EGPip executes after receiving this signal subtracts behaviour certainly
Make.When VM is all written in all data, the value of counter is kept to " 0 ", indicates that this SIMT data transfer transaction is completed.Data pass
After the completion of defeated, DMA completes data transfer transaction the corresponding positions set of marker register, enables effectively, to send out simultaneously if interrupting
Go out affairs and is transmitted interruption.
As shown in Figure 1, for the transmission schematic diagram of present invention data needed for SIMT programs in concrete application embodiment.DMA
The data of source are stored in the shadow region in the spaces DDR3SDRAM, they share 16 pieces, the form of storage can by plot,
Location offset is described.Meets the needs of SIMT programs are to data, then above-mentioned data should fitly be transported to 16 of VM
In banks, as shown in the dash area of VM in figure.The source data moved is needed to be distributed in 8 rows of DDR3SDRAM in this example,
Its address range is 36 ' h0-36 ' h1bc.It is the address space that will be written by data by the part of Shadow marks in VM,
Ranging from 36 ' h80000_0000-36 ' h80000_00bc.Base register (BA), address offset register (OA) and each Bank
The data volume (CNT) that need to be moved respectively has 16, and wherein BA is used to refer in DDR3SDRAM per the initial position of block number evidence, and OA is used
To indicate the spacing of two valid data in a data block, the data volume that CNT need to move for each Bank, as unit of word.At this
In the example that figure provides, BA0 should be configured to 36 ' h0, and BA1 should be configured to 36 ' h44, and BA2 should be configured to 36 ' hc8 ... ..., BA14
36 ' h138 should be configured to, BA15 should be configured to 36 ' h3c;OR0 need to be configured to 16 ' h80, and OR1 need to be configured to 16 ' h40, and OR2 is needed
16 ' h40 ... ... are configured to, OR14 need to be configured to 16 ' h40, and OR15 need to be configured to 16 ' hc0;16 CNT should be configured to 16 '
h3.In addition to this, SIMT transmission mark should be configured to " 1 ", and destination address is configured to 36 ' h80000_0000.Parameter configuration finishes
Afterwards, start DMA, complete data-moving.
Based on the analysis of above-mentioned principle, as shown in Fig. 2, the DMA towards single instrction multithread mode in the GPDSP of the present invention
Transmission method, step are:
S1:Configuration parameter simultaneously starts DMA;
For the DMA parameter configuration efficiency of raising, parameter configuration of the invention is changed on the basis of traditional approach
Into.There are two the parameter configuration sources of DMA of the present invention, when peripheral configuration bus (PBUS), second is that newly-increased VM configures access.
PBUS completes the configuration of traditional parameters, and VM configures the information configuration that access completes core external storage Storage Format.Wherein PBUS is configured
The bit wide of access is 32, and the bit wide that VM configures access is N*32.Therefore, carrying out parameter configuration using VM configuration accesses can be with
Greatly improve the efficiency of parameter configuration.
Specifically, DMA is that SIMT data-movings is supported to be additionally arranged some special configured transmissions, is included in transmission control
Increase a SIMT mark in word, is used to refer to whether this DMA transfer affairs is the number towards single instrction multithread mode
According to moving.
Since vectorial storage unit VM includes 16 Bank in core, therefore 16 groups of plot BA and address offset OA are added, they
For calculating position of 16 column datas in core external storage DDR3SDRAM, the access address for generating read request.
In addition, also adding 16 counter cnts, it is used to indicate this respective data volume of 16 column data.Above-mentioned 16 groups of BA,
OA and CNT is configured to the global register of DMA by VM, and SIMT marks are by PBUS buses write-in DMA parameters RAM.It is complete in VM components
After parameter configuration, using the other configured transmissions of PBUS bus configurations and start DMA.
After DMA starts, configured transmission is first taken out from parameter RAM, after general channels receive configured transmission, first judges SIMT
It identifies whether effectively.If mark is invalid, it is transmitted according to original mode.Otherwise it is taken out from global register again newly-increased
16 groups of BA, OA and CNT parameters.
S2:Transmit read request;
Read request is generated according to configured transmission, while write request memory bank Mem is stored in the addresses VM are write.Target peripheral is connecing
After the read request for receiving DMA, DMA returned datas are given.DMA receives returned data, is deposited into write request memory bank Mem.
S3:Transmit write request;
DMA takes out from effective write request memory bank Mem entries to be read returned data and writes the addresses VM, issues VM, simultaneously
Counter carries out corresponding counts.If counting does not finish, continue transmission data by the above process, otherwise set is transmitted mark
Register.
Above-mentioned steps S2 and S3 namely start DMA and carry out data-moving, the transmission ginseng needed for SIMT data transmissions
After number configuration.In this process, DMA firstly generates the read request of SIMT data transmissions, fetches and is stored in same data
Valid data in amount of bandwidth address range.While generating read request, the address of VM should be written by generating data, and these
Address is stored in the write request memory bank Mem inside DMA.Read request information includes enabled read request, read request address, reads to return to ground
Location, SIMT data transmissions mark etc..After read request data returns to DMA, it is first stored in write request memory bank Mem, if VM can write
Enter, then data and early generated write address is taken out from Mem, issue VM.
In above process, due to the limitation of data bandwidth, a read request can only at most read the number of the roads X/2 Bank
According to using ping-pong mechanism to improve the rate that read request is sent out, i.e., read the data of the front and back roads X/2 bank in turn.If when previous
The read request that clock beat is sent out reads the data of the low roads X/2 Bank, then when the read request sent out reads the number of the high roads X/2 Bank
According to.
DMA has X/2 to cover independent write bus with the ports VM, wherein the data bandwidth of every bus is 32.This X/2 sets are write
Bus and the Bank (low X/2 or high X/2 a) that X/2 different inside VM are corresponding, i.e., the data of every write bus enter
The different Bank of VM.DMA is internally provided with for storing the write request memory bank Mem for reading returned data, writing the information such as the addresses VM, should
The depth of memory bank is h (value of h can specifically be set by efficiency of transmission and hardware spending choosing comprehensively).
As shown in figure 3, for the data transfer request production procedure schematic diagram of the invention in concrete application example, flow in detail
Cheng Wei:
S10:After parameter configuration, starts DMA, parameter is read into reinforced common physical channel EGPip,
EGPip is it is first determined whether be SIMT data transmissions.If it is SIMT data transmissions, then enter SIMT data transmission states ---
Otherwise SIMT_START is handled by normal transmission pattern.
S20:Into after SIMT_START states, first by writing whether all counter judges the data of SIMT transmission
Be written VM, if so, state machine enter be transmitted state --- FINISH, otherwise state machine will stay on SIMT_START shapes
State.Under SIMT_START states, EGPip generates the read-write requests for SIMT data transmissions, first sentences when generating read request
Whether disconnected read request counter counts completions, if read request does not distribute complete, continues generation read request, otherwise stops paying out read request.
Read request competes the read bus inside DMA after EGPip outputs, and after obtaining bus arbitration, read bus is read true to EGPip returns
Recognize signal, while read request is exported DMA.After EGPip receives reading confirmation signal, it can update and read address and read request counter.
After the write request of SIMT generates, VM write bus is competed, after obtaining president's arbitration, VM write bus returns to write acknowledgement signal to EGPip,
Write request is exported to VM simultaneously.After receiving write acknowledgement signal, EGPip sends out next write request, while counter is write in update.
As shown in figure 4, for the data transmission read request production procedure schematic diagram of the invention in concrete application example.SIMT numbers
It is fetched according to the read request of transmission and reads address with the data in 256 bit spaces.Read request information include read request address,
Read return address, reading mask and SIMT transmission marks etc..Since core external storage space is out of order arrival to the returned data of DMA
, it must also be generated for each read request and read the address that returned data writes core memory space VM.But this address is not
Send out, but pre-existed in the write request memory bank Mem inside DMA with read request, wait corresponding read request data from
After DDR3SDRAM is returned, another rise reads.The production of read request is exactly the process for generating above-mentioned solicited message.Since design is led
The limitation of frequency, the generation for reading address need two clock cycle, and 8 source addresses are divided into 4 groups by the first clock cycle first, respectively
Wherein smaller value is selected, then selects 2 smaller values in this 4 results again.The second clock period claps from upper one and obtains
Two smaller values in, select minimum value therein, thus obtained the lowest address in 8 source addresses.Then all
Source address is compared with lowest address, checks that they whether within the scope of 256 bit spaces, finally give birth to according to comparison result
At reading mask.Consider SIMT transmission performances and hardware spending, is equipped with the write request memory bank Mem that depth is 32 and preserves SIMT
It reads returned data, write the information such as the addresses VM.The every a line of Mem has a signal designation, and whether it can be used, when read request is sent out, choosing
Going out available signal, effectively minimum a line reads return address as data.The addresses VM are write in generation simultaneously, is first stored in data and reads
The positions Mem of return address instruction.
As shown in figure 5, for the data transmission read request production process state of a control machine of the invention in concrete application example.It should
The original state of state machine is idle state, i.e. SIMT_IDLE states.If DMA transfer affairs are to move the data of SIMT programs,
Then enter SIMT_Rd0to7 states from SIMT_IDLE states, in this state, EGPip generates the Bank0-7's for moving VM
The read request of data, while judging whether the data of Bank8-15 read and finishing.If the data of Bank8-15 are not run through, under
One state enters SIMT_Rd8to15 states.If the data of Bank8-15 have been run through, it must judge that the data of Bank0-7 are
No reading finishes.If the data of Bank8-15 have been run through, but the data of Bank0-7 are run through not yet, following clock cycle
Into SIMT_Free states.If the data of Bank8-15 and Bank0-7 all read and finish, state machine NextState enters
SIMT_IDLE.Similar, when state machine is in SIMT_Rd8to15 states, EGPip, which is generated, reads asking for Bank8-15 data
It asks.If the data of Bank0-7 do not read and finish, NextState enters SIMT_Rd0to7 states.If the number of Bank0-7
According to having run through, then it must judge whether Bank8-15 data read and finish.If it is not, following clock cycle enters SIMT_Free shapes
State, otherwise, then it represents that all data all read and finish, and state machine NextState enters SIMT_IDLE.When state machine is in
When SIMT_Free, one " bubble " is inserted into read request generates assembly line.If Bank0-7 data do not read and finish, next shape
State enters SIMT_Rd0to7;If the data of Bank8-15 do not read and finish, NextState enters SIMT_Rd8to15 shapes
State;If all data are read all to read and be finished, NextState enters SIMT_IDLE.
As shown in fig. 6, for the data transmission write request production procedure schematic diagram of the invention in concrete application example.First,
Returned data is read after DDR3SDRAM is sent to DMA, DMA first judges whether this data is the number transmitted under SIMT transmission modes
According to if it is, data and writing in the write request memory bank Mem of mask information deposit reading return address instruction;Otherwise, according to
DMA normal transmission patterns are handled.The depth of write request memory bank Mem is 32, and often row has a readable mark, when the mark
When signal is " 1 ", indicate that the data of the row have returned, what can be stored at this time data and first writes the reading of the addresses VM.It selects
The readable effective lowermost row of mark then takes out corresponding data from Mem, writes mask information and write as the address for reading Mem
The addresses VM.The reading mask that mask is and reads to return together with returned data is wherein write, indicates that data, VM will be written in which Bank
8 write enable signals are generated according to this signal.The data of reading and the addresses VM are write, only need to simply unpacked, you can generates 8
The write request data of write bus and write request address.After write request generates, output writes VM buses to the inside DMA, and bus is arrived in competition
After arbitration, then next Mem rows are read, generates next VM that writes and ask.Write operation is carried out according to above-mentioned rule, until this biography
Until all data all end of transmissions of defeated affairs.
The above is only the preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-described embodiment,
All technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention.It should be pointed out that for the art
For those of ordinary skill, several improvements and modifications without departing from the principles of the present invention should be regarded as the protection of the present invention
Range.
Claims (7)
1. a kind of DMA transfer method towards single instrction multithread mode in GPDSP, which is characterized in that by configuring a DMA
Transmission transaction is by the vectorial storage unit VM of the data-moving of the non-regular SIMT programs for being stored in core external storage space to kernel;
After moving, the data are fitly stored in vectorial storage unit VM, for making these numbers of vector calculation component concurrent access
According to;
The flow of the data-moving is:
Step S1:Configuration parameter simultaneously starts DMA;
Increase a SIMT mark in transmitting control word, is used to refer to whether this DMA transfer affairs is towards single instrction
The data-moving of multithread mode;16 groups of plot BA and address offset OA are added, for calculating 16 column datas in core external storage
Position in DDR3SDRAM, the access address for generating read request;16 counter cnts are increased, this 16 row is used to indicate
The respective data volume of data;Above-mentioned 16 groups of BA, OA and CNT are configured to the global register of DMA, SIMT by vectorial storage unit VM
Mark is by PBUS buses write-in DMA parameters RAM;
After completing parameter configuration, starts DMA, configured transmission is taken out from parameter RAM;
Step S2:Transmit read request;
Read request is generated according to configured transmission, while write request memory bank Mem is stored in the addresses VM are write;Target peripheral is receiving
After the read request of DMA, DMA returned datas are given;DMA receives returned data, is deposited into write request memory bank Mem;
Step S3:Transmit write request;
DMA takes out from effective write request memory bank Mem entries to be read returned data and writes the addresses VM, and vectorial storage unit is issued
VM, unison counter carry out corresponding counts;If counting does not finish, continue transmission data, otherwise set is transmitted mark and posts
Storage.
2. the DMA transfer method towards single instrction multithread mode in GPDSP according to claim 1, which is characterized in that
After starting DMA and when taking out configured transmission from parameter RAM, after general channels receive configured transmission, first judge that SIMT is identified
Whether effectively;If mark is invalid, it is transmitted according to original mode;Otherwise newly-increased 16 groups are taken out from global register again
BA, OA and CNT parameter.
3. the DMA transfer method towards single instrction multithread mode, feature exist in GPDSP according to claim 1 or 2
In after parameter configuration, startup DMA enters data transfer request production procedure, and step is:
Step S10:After parameter configuration, starts DMA, parameter is read into reinforced common physical channel EGPip,
EGPip is it is first determined whether be SIMT data transmissions;If it is SIMT data transmissions, then enter SIMT data transmission states ---
Otherwise SIMT_START is handled by normal transmission pattern;
Step S20:Into after SIMT_START states, first by writing whether all counter judges the data of SIMT transmission
Be written VM, if so, state machine enter be transmitted state --- FINISH, otherwise state machine will stay on SIMT_START shapes
State;Under SIMT_START states, EGPip generates the read-write requests for SIMT data transmissions, first sentences when generating read request
Whether disconnected read request counter counts completions, if read request does not distribute complete, continues generation read request, otherwise stops paying out read request;
Read request competes the read bus inside DMA after EGPip outputs, and after obtaining bus arbitration, read bus is read true to EGPip returns
Recognize signal, while read request is exported DMA;After EGPip receives reading confirmation signal, it can update and read address and read request counter;
After the write request of SIMT generates, VM write bus is competed, after obtaining president's arbitration, VM write bus returns to write acknowledgement signal to EGPip,
Write request is exported to VM simultaneously;After receiving write acknowledgement signal, EGPip sends out next write request, while counter is write in update.
4. the DMA transfer method towards single instrction multithread mode, feature exist in GPDSP according to claim 1 or 2
In in the step S2, the read request of data transmission fetches and reads address with the data in 256 bit spaces;Reading is asked
It includes read request address, reading return address, reading mask and SIMT transmission marks to seek information;It is also generated for each read request
The address that returned data writes core memory space VM is read, the address is corresponding there are in the write request memory bank Mem inside DMA, waiting
After DDR3SDRAM returns, another rise reads read request data.
5. the DMA transfer method towards single instrction multithread mode in GPDSP according to claim 4, which is characterized in that
The generation for reading the address that returned data writes core memory space VM needs two clock cycle, and the first clock cycle first will
8 source addresses are divided into 4 groups, select wherein smaller value respectively, then select 2 smaller values in this 4 results again;Second
Clock cycle claps from upper one in two obtained smaller values, selects minimum value therein, obtains in 8 source addresses minimally
Location;Then institute's source address is compared with lowest address, checks them whether within the scope of 256 bit spaces, finally
It is generated according to comparison result and reads mask.
6. the DMA transfer method towards single instrction multithread mode in GPDSP according to claim 4, which is characterized in that
The step S2 is controlled by state machine:The original state of the state machine is idle state, i.e. SIMT_IDLE states;If
DMA transfer affairs are to move the data of SIMT programs, then enter SIMT_Rd0to7 states from SIMT_IDLE states, in the state
Under, EGPip generates the read request of the data for the Bank0-7 for moving VM, while judging whether the data of Bank8-15 have read
Finish;If the data of Bank8-15 are not run through, NextState enters SIMT_Rd8to15 states;If the data of Bank8-15
It has been run through that, then must judge whether the data of Bank0-7 read and finish;If the data of Bank8-15 have been run through, still
The data of Bank0-7 are run through not yet, and following clock cycle enters SIMT_Free states;If Bank8-15's and Bank0-7
Data all read and finish, and state machine NextState enters SIMT_IDLE;Similar, when state machine is in SIMT_Rd8to15 shapes
When state, EGPip generates the request for reading Bank8-15 data;If the data of Bank0-7 do not read and finish, NextState enters
SIMT_Rd0to7 states;If the data of Bank0-7 have been run through, it must judge whether Bank8-15 data read and finish;If
No, following clock cycle enters SIMT_Free states, otherwise, then it represents that all data all read and finish, state machine NextState
Into SIMT_IDLE;When state machine is in SIMT_Free, it is inserted into one " bubble " in read request generates assembly line, it is described
" bubble " is a mark;If Bank0-7 data do not read and finish, NextState enters SIMT_Rd0to7;If Bank8-15
Data not read finish, then NextState enters SIMT_Rd8to15 states;If all data are read all to read and be finished, under
One state enters SIMT_IDLE.
7. the DMA transfer method towards single instrction multithread mode, feature exist in GPDSP according to claim 1 or 2
In, in the step S3, first read returned data after DDR3SDRAM is sent to DMA, DMA first judge this data whether be
The data transmitted under SIMT transmission modes, if it is, data are asked with writing for mask information deposit reading return address instruction is write
It asks in memory bank Mem;Otherwise, it is handled according to DMA normal transmission patterns;The depth of write request memory bank Mem is 32, is often gone
There is a readable mark, when the id signal is " 1 ", indicates that the data of the row have returned, be stored at this time data and first
Write the addresses VM reading;The readable effective lowermost row of mark is selected as the address for reading Mem, then takes out corresponding number from Mem
According to, write mask information and write the addresses VM;It wherein writes mask to be and read the reading mask returned together with returned data, where is instruction
Data will be written in a little Bank, and VM generates 8 write enable signals according to this signal;It the data of reading and writes the addresses VM, generates 8
The write request data of write bus and write request address;After write request generates, output writes VM buses to the inside DMA, and bus is arrived in competition
After arbitration, then next Mem rows are read, generates next VM that writes and ask;Write operation is carried out, until owning for this transmission transaction
Until data all end of transmissions.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510718877.9A CN105302749B (en) | 2015-10-29 | 2015-10-29 | DMA transfer method towards single instrction multithread mode in GPDSP |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510718877.9A CN105302749B (en) | 2015-10-29 | 2015-10-29 | DMA transfer method towards single instrction multithread mode in GPDSP |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105302749A CN105302749A (en) | 2016-02-03 |
CN105302749B true CN105302749B (en) | 2018-07-24 |
Family
ID=55200034
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510718877.9A Active CN105302749B (en) | 2015-10-29 | 2015-10-29 | DMA transfer method towards single instrction multithread mode in GPDSP |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105302749B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108062282B (en) * | 2017-12-29 | 2020-01-14 | 中国人民解放军国防科技大学 | DMA data merging transmission method in GPDSP |
WO2022266842A1 (en) * | 2021-06-22 | 2022-12-29 | 华为技术有限公司 | Multi-thread data processing method and apparatus |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104679689A (en) * | 2015-01-22 | 2015-06-03 | 中国人民解放军国防科学技术大学 | Multi-core DMA (direct memory access) subsection data transmission method used for GPDSP (general purpose digital signal processor) and adopting slave counting |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6892266B2 (en) * | 2000-11-15 | 2005-05-10 | Texas Instruments Incorporated | Multicore DSP device having coupled subsystem memory buses for global DMA access |
US20060179172A1 (en) * | 2005-01-28 | 2006-08-10 | Texas Instruments Incorporated | Method and system for reducing power consumption of a direct memory access controller |
-
2015
- 2015-10-29 CN CN201510718877.9A patent/CN105302749B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104679689A (en) * | 2015-01-22 | 2015-06-03 | 中国人民解放军国防科学技术大学 | Multi-core DMA (direct memory access) subsection data transmission method used for GPDSP (general purpose digital signal processor) and adopting slave counting |
Non-Patent Citations (1)
Title |
---|
高可复用DMA验证平台实现;丁一博等;《中国计算机学会计算机工程与工艺专业委员会会议论文集》;20151018;第499-505页 * |
Also Published As
Publication number | Publication date |
---|---|
CN105302749A (en) | 2016-02-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9830156B2 (en) | Temporal SIMT execution optimization through elimination of redundant operations | |
CN102750133B (en) | 32-Bit triple-emission digital signal processor supporting SIMD | |
US8639882B2 (en) | Methods and apparatus for source operand collector caching | |
CN104603748B (en) | The processor of instruction is utilized with multiple cores, shared core extension logic and the extension of shared core | |
CN103714039B (en) | universal computing digital signal processor | |
CN103761215B (en) | Matrix transpose optimization method based on graphic process unit | |
CN105190538B (en) | System and method for the mobile mark tracking eliminated in operation | |
CN104813279B (en) | For reducing the instruction of the element in the vector registor with stride formula access module | |
CN103309702A (en) | Uniform load processing for parallel thread sub-sets | |
CN104050033A (en) | System and method for hardware scheduling of indexed barriers | |
US20130145124A1 (en) | System and method for performing shaped memory access operations | |
CN104050706A (en) | Pixel shader bypass for low power graphics rendering | |
CN106648843A (en) | System, method, and apparatus for improving throughput of consecutive transactional memory regions | |
CN103885752A (en) | Programmable blending in multi-threaded processing units | |
CN109426519A (en) | Data inspection is simplified in line with carrying out workload | |
CN104050705A (en) | Handling post-z coverage data in raster operations | |
CN102640132A (en) | Efficient predicated execution for parallel processors | |
CN105373367B (en) | The vectorial SIMD operating structures for supporting mark vector to cooperate | |
CN108694684A (en) | Shared local storage piecemeal mechanism | |
CN103226463A (en) | Methods and apparatus for scheduling instructions using pre-decode data | |
WO2021236527A1 (en) | Intelligent control and distribution of a liquid in a data center | |
CN105893319A (en) | Multi-lane/multi-core system and method | |
CN104375807B (en) | Three-level flow sequence comparison method based on many-core co-processor | |
CN101082900A (en) | System and method for broadcasting instructions/data to a plurality of processors in a multiprocessor device via aliasing | |
WO2022031548A1 (en) | Intelligent server-level testing of datacenter cooling systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |