CN103345429B - High concurrent memory access accelerated method, accelerator and CPU based on RAM on piece - Google Patents
High concurrent memory access accelerated method, accelerator and CPU based on RAM on piece Download PDFInfo
- Publication number
- CN103345429B CN103345429B CN201310242398.5A CN201310242398A CN103345429B CN 103345429 B CN103345429 B CN 103345429B CN 201310242398 A CN201310242398 A CN 201310242398A CN 103345429 B CN103345429 B CN 103345429B
- Authority
- CN
- China
- Prior art keywords
- cpu
- memory access
- queue
- read request
- request
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Memory System Of A Hierarchy Structure (AREA)
- Multi Processors (AREA)
Abstract
A kind of processor the invention discloses high concurrent memory access accelerator based on RAM on piece and access method and using this method, the memory access accelerator is independently of Cache on piece and MSHR, it is connected with RAM on piece and Memory Controller Hub, do not complete access request and Memory Controller Hub is sent to memory system by the memory access accelerator, so as to solve the problems, such as that general processor concurrent memory access number in internet and cloud computing application is limited, accelerate high concurrent memory access.
Description
Technical field
The invention belongs to computer realm, the structure design being related to inside CPU, more particularly to a kind of height based on RAM on piece
Concurrent memory access accelerated method, accelerator and CPU.
Background technology
With internet and the development of cloud computing, the data processor of high concurrent is more and more.This class method generally needs
Handle to ask(request)Or operation(job)Form submit it is a large amount of concurrently load, these core industry concurrently loaded
Business generally involves the processing and analysis of mass data.This class method usually using multithreading or multi-process, thread or process it
Between there is relatively low memory access to rely on or relied on without memory access.
Therefore this kind of application can send substantial amounts of concurrent access request to memory system.This concurrency to memory access system
Propose challenge.If the concurrency of memory access system is not high enough, high performance bottleneck will be carried as this kind of application.
Fig. 1 show typical CPU storage organizations.When CPU needs to read data, arrive first in Cache and search, if
There are required data in Cache(Cache hit), then CPU is directly returned data to.If CPU is not looked into Cache
To required data(Cache miss), CPU can arrive main memory(Main Memory)It is middle by required data retrieval into Cache.
There is one group of register MSHR in Cache(Miss Status Handling Registers), dedicated for record
It has been sent to the miss read requests of the cache of internal memory(That is Cache miss request)Information.The information of MSHR records
Destination register etc. including Cache Line addresses, read request.When the number for hosting completion read request, returning to the Cache Line
According to rear, the information of record be just used to filling corresponding to Cache Line and return data in destination register.Each Cache
Miss read request will take one of MSHR.After MSHR is occupied full, new Cache Miss read request will be stopped
Firmly, it is impossible to be sent to main memory.Therefore, the unfinished read request that MSHR is supported(Refer to read request to have sent, but the number of read request
According to not returning also.This read request has not been completed, so also to record the request by MSHR)Number, be to determine memory access system
One of key factor of system concurrency.
At present, the number than the MSHR of the more typical processor unfinished read requests supported is typically less.Such as
Cortex-A9 processors, L2Cache MSHR only support 10 unfinished read requests.When application program is sent out to memory system
Go out a large amount of concurrent access requests, and these request localities are relatively low(Therefore a large amount of Cache Miss occur)When, MSHR is just
It can rapidly be occupied full, turn into the bottleneck of whole system.
Fig. 2 show the storage architecture of certain type processor, and the processor proposes a kind of brand-new memory access mode, reason
By the transmission that can above support a large amount of concurrent access requests.
The processor is by 1 PPE(Power Processor Element), 8 SPE(Synergistic
Processing Element, synergetic unit), 1 MIC(Memory Interface Controller, memory interface
Controller), 1 EIB(Element Interconnect Bus, cell interconnection bus)Composition.
Synergetic unit SPE memory access mechanism is paid close attention to below.
Each SPE is a microprocessor, and its program may operate in local 256KB memory cell(RAM).When
SPE is needed from main memory(Main Memory), it is necessary to first initialize DMAC during middle acquisition data(Direct Memory Access
Controller), by the parameter read-in DMAC control queues such as request source address, destination address and request length.DMAC according to
Parameter in queue is about to data certainly and moved from main memory in being locally stored.
The number for the concurrent request that this mechanism is supported in theory is limited solely by storable order in DMA command queue
Number, or on limited piece RAM capacity.But this mechanism has two defects:
1. it is required for first inputting some parameters before due to each dma operation starting, such as source address, destination address, data
Size, TAG marks and direction etc., this process will take several instruction cycles.If SPE needs concurrently to read substantial amounts of small
During granularity data, the efficiency of DMA transfer is than relatively low;
The efficiency of 2.DMA condition managings is low.First, program needs to prepare enough spaces for the returned data of read request,
And the program lacks free-space administration mechanism, local storage space utilization rate will substantially reduce after long-play;Secondly, place
Reason device obtains the mode that DMA completion statuses employ software polling mode bit, and when the number of access request increases, efficiency is not
It is high.
The content of the invention
In order to solve the above-mentioned technical problem, it is an object of the invention to propose a kind of high concurrent memory access based on RAM on piece
Accelerator and the method using a large amount of concurrent access requests of RAM management on piece, solve general processor in internet and cloud computing
The problem of concurrent memory access number is limited in, accelerate high concurrent memory access.
Specifically, the invention discloses a kind of high concurrent memory access accelerator based on RAM on piece, the memory access accelerator are only
Cache and MSHR on piece is stood on, is connected with RAM on piece and Memory Controller Hub, access request is not completed and passes through the memory access accelerator
Memory Controller Hub is sent to memory system.
The high concurrent memory access accelerator based on RAM on piece, the access request to be done that the memory access accelerator is supported
Number be only dependent upon the capacity of RAM on piece, do not limited by MSHR item numbers.
The high concurrent memory access accelerator based on RAM on piece, there is a read request table in the addressable space, for depositing
The information of read request is put, each single item of the read request table is corresponding one intrinsic No. id.
The high concurrent memory access accelerator based on RAM on piece, each single item of the read request table has three domains, for depositing
Type, address and the data of the read request, wherein type field and address field are inserted by CPU, and data field is filled out by the memory access accelerator
Enter.
The high concurrent memory access accelerator based on RAM on piece, when the data field of the read request table is excessive, can only it deposit
Data pointer, data pointer point to the storage address of returned data, and the storage address of returned data is allocated by CPU.
The high concurrent memory access accelerator based on RAM on piece, each single item of the read request table is three kinds of states:It is empty
Not busy, new read request and read request is completed, original state is the free time, when CPU has access request, request is inserted, state
It is changed into new read request, the request is sent to Memory Controller Hub by memory access accelerator, and data are inserted into data field, shape after returned data
State is changed into having completed read request, and CPU fetches according to and handled from data field, and state returns to the free time after the completion of processing.
The high concurrent memory access accelerator based on RAM on piece, each round-robin queue include a head pointer and one
Tail pointer, the head pointer and tail pointer of idle queues and the head pointer for having completed queue are variables in software, are responsible for dimension by CPU
Shield;The head pointer of new read request queue, tail pointer and to have completed the tail pointer of queue be hardware register, new read request queue
Head pointer is responsible for maintenance by memory access accelerator;Newly the tail pointer of read request queue is safeguarded jointly by CPU and memory access accelerator, CPU
Only write, memory access accelerator is read-only;The tail pointer for having completed queue safeguards that CPU is read-only jointly by CPU and memory access accelerator, memory access
Accelerator is only write.
Invention additionally discloses a kind of high concurrent access method based on RAM on piece, including one is set independently of Cache on piece
With MSHR memory access accelerator, the memory access accelerator is connected with RAM on piece and Memory Controller Hub, does not complete access request and passes through this
Memory access accelerator is sent to Memory Controller Hub to memory system.
The high concurrent access method based on RAM on piece, CPU writes access request the addressable of RAM on piece on piece
Space, the memory access accelerator read requests perform, and for read request, pending data is after memory system return, the memory access accelerator
Place data into the space and notify CPU, CPU is handled data.
The high concurrent access method based on RAM on piece, the reading for having a preservation read request in the addressable space please
Table is sought, for depositing the information of read request, each single item of the read request table is corresponding one intrinsic No. id.
The high concurrent access method based on RAM on piece, each single item of the read request table has three domains, for depositing this
Type, address and the data of read request, wherein type field and address field are inserted by CPU, and data field is inserted by the memory access accelerator.
The high concurrent access method based on RAM on piece, when the data field of the read request table is excessive, it can only deposit number
According to pointer, data pointer points to the storage address of returned data, and the storage address of returned data is allocated by CPU.
The high concurrent access method based on RAM on piece, each single item of the read request table is three kinds of states:Idle,
New read request and read request is completed, original state is the free time, when CPU has access request, request is inserted, state is changed into
The request is sent to Memory Controller Hub by new read request, memory access accelerator, and data are inserted into data field after returned data, and state becomes
To have completed read request, CPU fetches according to and handled from data field, and state returns to the free time after the completion of processing.
Invention additionally discloses a kind of high concurrent access method based on RAM on piece, including the step of CPU initiation read requests:
Idle queues state in step S701, CPU query piece in RAM addressable spaces, judge idle queues whether be
Sky, CPU judge that idle queues are for the condition of sky:The head pointer of idle queues overlaps with tail pointer, if empty, returns, if not
Sky, then go to step S702.
Step S702, CPU take id from idle queues head of the queue;
Step S703, CPU fill in the type field and address field of read request list item corresponding with the id;
Step S704, CPU write the id tail of the queue of new read request queue;
The new read request queue rear pointer of renewal is transmitted to memory access accelerator by step S705, CPU;
Step S706, CPU judges whether to continue to initiate read request, if so, step S701 is gone to, if it is not, returning.
Invention additionally discloses a kind of high concurrent access method based on RAM on piece, including CPU processing read request returned datas
Step:
Step S801, CPU inquire about the state for having completed queue, and judgement has completed whether queue is empty(CPU judges to have completed
Queue is that empty condition is:The head pointer for having completed queue overlaps with tail pointer)If empty, return;If non-NULL, step is gone to
Rapid S802;
Step S802, CPU take id from the head of the queue for having completed queue;
Step S803, the data field of CPU operation read request list item corresponding with the id;
Step S804, CPU write the id tail of the queue of idle queues;
Step S805, CPU judges whether to continue to operate, if so, step S801 is then gone to, if it is not, then returning.
Invention additionally discloses a kind of high concurrent access method based on RAM on piece, it is characterised in that including memory access accelerator
The step of handling read request:
Whether step S901, the new request queue of memory access accelerator real-time query are empty, if non-NULL, go to step S902,
If empty, inquired about always in this step;
Step S902, memory access accelerator take id from the head of the queue of new read request queue;
Step S903, memory access accelerator take out the type field and address field of read request list item corresponding with the id;
Step S904, memory access accelerator fetch data from internal memory, are written to the number of read request list item corresponding with the id
According to domain;
Step S905, memory access accelerator write the id tail of the queue for having completed queue.
Invention additionally discloses a kind of high concurrent access method based on RAM on piece, including:
Step 1, when CPU initiates write request, first check and write whether round-robin queue has expired, if discontented, insert write request
Type, address and write data;
Step 2, memory access accelerator, which detects, writes round-robin queue's non-NULL, then automatically from write at round-robin queue's head pointer read write
Type, address and the data of request;
Step 3, write request is issued Memory Controller Hub by memory access accelerator.
Invention additionally discloses one kind using any one of claim 1-17 high concurrent access method or high concurrent memory access
The processor of device.
The technique effect of the present invention:
1st, one or more read request tables are preserved using RAM on piece(Read table), the content bag of read request table each single item
The necessary information of all read requests such as request type domain, destination address field and data field is included, because the present invention uses RAM notes on piece
All information of concurrent request are recorded, the quantity of concurrent request is limited solely by the size of RAM on piece.
2nd, each single item of read request table is divided into 3 classes by solicited status:Idle class, new request class and completion class, and often
The entry address of a kind of claims is stored using round-robin queue respectively, is easy to be managed the state of read request.This hair
It is bright to manage substantial amounts of reading and writing solicited message using round-robin queue, the mode bit of the multiple requests of poll is avoided, the number of inquiry is significantly
Reduce, so as to have obvious acceleration effect to a large amount of concurrent and incoherent small grain size access requests.
3rd, judge whether to initiate accessing operation by the non-null states of the round-robin queue of " newly asking class ", it is " new by reading
The content of request class " round-robin queue, obtains the read request address that RAM is stored on piece, in this way, memory access accelerator can voluntarily disorderly
Sequence initiates accessing operation, it is not necessary to which software is controlled, so as to support the Out-of-order execution of access request, out of order return, convenience pair
A large amount of access requests, which are realized, targetedly to be dispatched.
4th, CPU software determines whether that read request is completed by the non-null states of " completion class " round-robin queue, and CPU passes through reading
The content of " completion class " round-robin queue is taken, address of the read request returned data on piece in RAM is obtained, avoids CPU polls multiple
The mode bit of request, improve software lookup efficiency.
Brief description of the drawings
Fig. 1 show existing typical CPU storage organizations;
Fig. 2 show the storage architecture of certain type processor;
Fig. 3 show the location drawing of the memory access accelerator of the present invention on a processor;
Fig. 4 show the Read table in the addressable space in the present invention;
Fig. 5 show the status change of each single item in the Read table in the present invention;
Fig. 6 show the state for managing read request in the present invention using three round-robin queues;
Fig. 7 show the step of CPU initiates read request in the present invention;
Fig. 8 show the step of CPU handles read request returned data in the present invention;
Fig. 9 show the step of memory access accelerator handles read request in the present invention;
Figure 10 show one round-robin queue's management write request of use in the present invention.
Embodiment
The problem of present invention is limited for the concurrent access request number of general processor, propose the general of " memory access accelerator "
Read.Memory access accelerator is another path between CPU and internal memory Memory.
Fig. 3 show the memory access accelerator location drawing on a processor of the present invention, it around cache Cache and
MSHR, the number of the unfinished read request supported at least an order of magnitude more than MSHR.Therefore, should by memory access accelerator
More access requests can be sent to memory system with program, so as to improve the concurrency of memory access.Processor include CPU1,
RAM3, memory access accelerator 4, Cache2, MSHR3, Memory Controller Hub 6, internal memory 7.
Memory access accelerator needs CPU to possess addressable ram space in a block piece, and access request is write RAM skies by CPU
Between, memory access accelerator read requests perform.If read request, pending data is after Memory returns, and memory access accelerator is by data
It is put into space and notifies CPU, then CPU is handled data.
Fig. 4 show the read request table in the addressable space in the present invention(Read table), will in addressable space
There are the read request table of a preservation read request, referred to as Read table.
Read table each single item is corresponding one intrinsic No. id, and the letter of read request can be deposited in Read table items
Breath.Each single item has three domains:Type, addr, data, it is respectively used to deposit the type of the read request, address and data.Type domains
Such as priority of the data length of request, request, whether it is a separation/polymerization for encoding the additional information needed
(scatter/gather)Read request of type etc..Using type domains, along with the hardware of auxiliary, it is possible to realize that some are worked as
The advanced memory access function that preceding architecture is not supported.Type domains and addr domains are inserted by CPU, and data domains are filled out by memory access accelerator
Enter.
Each single item in Read table can be divided into three kinds of states:Idle, new read request(Not yet it is sent to internal memory
Controller), the read request completed(Memory Controller Hub has returned to the data of the request, and data have inserted data
Domain).
Fig. 5 show the status change of each single item in the Read table in the present invention, for one in Read table
, original state is idle free;When CPU has access request, request is inserted this, this state is just changed into new read request
new read;The request is sent to Memory Controller Hub by memory access accelerator, and data are inserted to the data domains of this after returned data, should
The state of item is just changed into completeness request finished read;CPU is fetched according to and handled from data domains, and processing is completed
The state of this has returned to idle free afterwards.
In above process, the problem of three keys needs to solve:
1.CPU sends the position of the idle item how found during request in Read table
2. how memory access accelerator finds the position of new claims
How 3.CPU obtains the position that read request returns to item
Therefore, the present invention proposes a kind of request management method based on round-robin queue.
Fig. 6 show the state for managing read request in the present invention using three round-robin queues, and three round-robin queues are:It is idle
Round-robin queue(free entry queue), new read request round-robin queue(new read queue)Team is circulated with read request is completed
Row(finished read queue), it is respectively used to store idle item in Read table, new read request item and completed
The id of read request item.These three round-robin queues are all in addressable space.Each queue has two pointers:Team head head and tail of the queue
Tail, it is respectively used to indicate the position of queue heads and rear of queue.In figure, A is to return, and CPU can determine to be to continue with initiating again
Read request still can be with start to process data.It is not allow other operation insertions before not returning, just can be again after return
Initiate operation.
Illustrate the process that CPU initiates the operation of read request using memory access accelerator:
1. when CPU needs to read internal storage data, first inquire about whether free entry queue are empty.If it is empty, then say
Read table in bright Fig. 4 are taken completely, temporarily also can use without the Read table items of free time;If non-NULL, illustrate
Also available free item can use in Read table.As shown in fig. 6, judging whether free entry queue are that empty condition is:Refer to
Pin head1 overlaps with pointer tail1.
2.CPU takes out one No. id from free entry queue head of the queue, finds Read table corresponding to this No. id
The address of item, Read table items are inserted by type the and addr domains newly asked.Meanwhile CPU is stored in new by this No. id
Read queue tail of the queue.
Operating process of the CPU to round-robin queue is as shown in dotted line 1 in Fig. 6, and after the completion of this operation, id3 will be by from head1 positions
Put the position for being moved to tail2.
Tail2 pointers are moved back one by 3.CPU, and new tail2 pointers are sent into memory access accelerator.
4. memory access accelerator judges whether new read queue are empty by comparing head2 and tail2 pointers.Work as visit
When depositing accelerator and detecting new read queue non-NULLs, then new id is taken out from new read queue heads of the queue automatically, is passed through
Id finds corresponding unfinished read request item in Read table and handled, and the data of request are returned into Read table
Data domains in.After the completion of processing, the id is write to finished read queue tail of the queue.
Operating process of the memory access accelerator to round-robin queue as shown in dotted line 2 in Fig. 6, this operation after the completion of, id9 will by from
Head2 positions are moved to tail3 position.
When 5.CPU needs processing data, first check whether finished read queue are empty.Inspection method is still
Contrast head and tail pointers.If finished read queue non-NULLs, one No. id is taken out, is found corresponding to the id
Read table items, the data domains of this are handled.After the completion of processing, the id is written to new read queue team
Tail.
Operating process of the CPU to round-robin queue is as shown in dotted line 3 in Fig. 6, and after the completion of this operation, id2 will be by from head3 positions
Put the position for being moved to tail1.
6. above procedure can repeat.
The processing of write request is relatively easy.Because write request does not have to returned data, memory access accelerator, which need to only give CPU, to be come
Write request pass to Memory Controller Hub, therefore it manages structure and can greatly simplified.
Figure 10 show one round-robin queue's management write request of use in the present invention, only need to use a Ge Xie round-robin queues
(write queue)Write request can be managed.Here type, addr and data domain of write request are directly placed into queue.With
Three queue queue of top are the same, and write queue are also required in addressable space.
The occupation mode for writing round-robin queue (write queue) is as follows:
When CPU needs the write request for sending out new, first check whether write queue are full (, it is necessary to first determine during hair write request
Whether there is space can be with temporal data in write queue.No space on ram is completely meant that, cannot retransmit and write
Request.If discontented, the position that is indicated to tail4 inserts the type of write request, address and writes data.
Memory access accelerator detects write queue non-NULLs(Illustrate there are data, illustrate that also write request does not complete, memory access
Request is taken out and performed by accelerator automatically), then type, addr and data of write request are read at head4 pointers automatically, will be write
Memory Controller Hub is issued in request.
To sum up, the invention discloses a kind of high concurrent memory access accelerator based on RAM on piece, the memory access accelerator independently of
Cache and MSHR on piece, is connected with RAM on piece and Memory Controller Hub, does not complete access request and is sent to by the memory access accelerator
Memory Controller Hub is to memory system.
The number for not completing access request that the memory access accelerator is supported is only dependent upon the capacity of RAM on piece, not by
The limitation of MSHR item numbers.RAM is the RAM with addressable space on the piece, and access request is write addressable sky by CPU on piece
Between, memory access accelerator read requests perform, and for read request, pending data is after memory system return, and the memory access accelerator is by number
According to being put into the addressable space and notifying CPU, then CPU is handled data.
RAM is the RAM of CPU on piece on the piece, or independently of CPU on piece.
There are a read request table, for depositing the information of read request, each single item of the read request table in the addressable space
It is corresponding one intrinsic No. id.
The each single item of the read request table has three domains, for depositing type, address and the data of the read request, wherein type
Domain and address field are inserted by CPU, and data field is inserted by the memory access accelerator.
The each single item of the read request table is three kinds of states:Idle, new read request and read request is completed, initial shape
State is the free time, when CPU has access request, request is inserted, state is changed into new read request, and the request is sent to by memory access accelerator
Memory Controller Hub, data are inserted into data field after returned data, state is changed into having completed read request, and CPU takes from data field
Data are simultaneously handled, and state returns to the free time after the completion of processing.
Three kinds of states are managed by three round-robin queues, and each round-robin queue includes the position of queue heads and rear of queue
Put pointer.
The invention also discloses a kind of high concurrent access method based on RAM on piece, including one is set independently of on piece
Cache and MSHR memory access accelerator, the memory access accelerator are connected with RAM on piece and Memory Controller Hub, do not complete access request
Memory Controller Hub is sent to memory system by the memory access accelerator.
The high concurrent access method based on RAM on piece, CPU writes access request the addressable of RAM on piece on piece
Space, the memory access accelerator read requests perform, and for read request, pending data is after memory system return, the memory access accelerator
Place data into the space and notify CPU, CPU is handled data.
The invention also discloses a kind of high concurrent access method based on RAM on piece, it is characterised in that is initiated including CPU
The step of read request:
Idle queues state in step S701, CPU query piece in RAM addressable spaces, judge idle queues whether be
Sky, if empty, return, if non-NULL, go to step S702.CPU judges that idle queues are for the condition of sky:The head of idle queues
Pointer overlaps with tail pointer.
Step S702, CPU take id from idle queues head of the queue;
Step S703, CPU fill in the type field and address field of read request list item corresponding with the id;
Step S704, CPU write the id tail of the queue of new read request queue;
The new read request queue rear pointer of renewal is transmitted to memory access accelerator by step S705, CPU;
Step S706, CPU judges whether to continue to initiate read request, if so, step S701 is gone to, if it is not, returning.
The invention also discloses a kind of high concurrent access method based on RAM on piece, it is characterised in that including CPU processing
Read request returned data step:
Step S801, CPU inquire about the state for having completed queue, and judgement has completed whether queue is empty, if empty, returns;
If non-NULL, step S802 is gone to;Memory access accelerator judges that completed queue is as the condition of sky:The head pointer of queue is completed
Overlapped with tail pointer.
Step S802, CPU take id from the head of the queue for having completed queue;
Step S803, the data field of CPU operation read request list item corresponding with the id;
Step S804, CPU write the id tail of the queue of idle queues;
Step S805, CPU judges whether to continue to operate, if so, step S801 is then gone to, if it is not, then returning.
The invention also discloses a kind of high concurrent access method based on RAM on piece, it is characterised in that accelerates including memory access
Device handles the step of read request:
Whether step S901, the new request queue of memory access accelerator real-time query are empty, if non-NULL, go to step S902,
If empty, inquired about always in this step;
Step S902, memory access accelerator take id from the head of the queue of new read request queue;
Step S903, memory access accelerator take out the type field and address field of read request list item corresponding with the id;
Step S904, memory access accelerator fetch data from internal memory, are written to the number of read request list item corresponding with the id
According to domain;
Step S905, memory access accelerator write the id tail of the queue for having completed queue.
The high concurrent access method based on RAM on piece, judge whether idle queues are that empty condition is:Queue heads refer to
Pin overlaps with queue tail pointer.
The memory access accelerator can be with out of order transmission, out of order return to a large amount of concurrent read requests.
The invention also discloses a kind of high concurrent access method based on RAM on piece, including:
Step 1, when CPU initiates write request, first check and write whether round-robin queue has expired, if discontented, insert write request
Type, address and write data;
Step 2, memory access accelerator, which detects, writes round-robin queue's non-NULL, then automatically from write at round-robin queue's head pointer read write
Type, address and the data of request;
Step 3, write request is issued Memory Controller Hub by memory access accelerator.
The invention also discloses a kind of using above-mentioned access method and the processor of memory access device.
The present invention has following features:
1st, memory access granularity is flexible:Memory access granular information is encoded in type domains.Memory access granularity is not by instruction set and Cache
Line limitation.Required for each data of memory access are software, the effective rate of utilization for hosting bandwidth is improved.
2nd, some advanced memory access functions can be achieved:By specifying memory access type in type domains, then memory access accelerator parses
Perform, the advanced accessing operations such as scatter/gather, chained list read-write can be achieved.
3rd, type domains can carry some upper layer informations, such as thread number, priority so that it is high that memory access accelerator can do some
The QoS scheduling of level.
4th, addressable space could preferably play the effect of accelerator using SRAM.In the design, CPU and memory access accelerate
Device needs that the read-write of Read table, queue and queue pointer could be completed several times one request, therefore the reading of addressable space
Writing rate must be sufficiently fast, can just play acceleration.SRAM is more many soon than DRAM access speed, is suitable for use in here.
The technique effect of the present invention:
1st, one or more read request tables are preserved using RAM on piece(Read table), the content bag of read request table each single item
The necessary information of all read requests such as request type domain, destination address field and data field is included, because the present invention uses RAM notes on piece
All information of concurrent request are recorded, the quantity of concurrent request is limited solely by the size of RAM on piece.
2nd, each single item of read request table is divided into 3 classes by solicited status:Idle class, new request class and completion class, and often
The entry address of a kind of claims is stored using round-robin queue respectively, is easy to be managed the state of read request.This hair
It is bright to manage substantial amounts of reading and writing solicited message using round-robin queue, the mode bit of the multiple requests of poll is avoided, the number of inquiry is significantly
Reduce, so as to have obvious acceleration effect to a large amount of concurrent and incoherent small grain size access requests.
3rd, judge whether to initiate accessing operation by the non-null states of the round-robin queue of " newly asking class ", it is " new by reading
The content of request class " round-robin queue, obtains the read request address that RAM is stored on piece, in this way, memory access accelerator can voluntarily disorderly
Sequence initiates accessing operation, it is not necessary to which software is controlled, so as to support the Out-of-order execution of access request, out of order return, convenience pair
A large amount of access requests, which are realized, targetedly to be dispatched.
4th, CPU software determines whether that read request is completed by the non-null states of " completion class " round-robin queue, and CPU passes through reading
The content of " completion class " round-robin queue is taken, address of the read request returned data on piece in RAM is obtained, avoids CPU polls multiple
The mode bit of request, improve software lookup efficiency.
Claims (14)
1. a kind of high concurrent memory access accelerator based on RAM on piece, it is characterised in that the memory access accelerator is independently of on piece
Cache and MSHR, it is connected with RAM on piece and Memory Controller Hub, does not complete access request and internal memory is sent to by the memory access accelerator
Controller to memory system, wherein, the number for the access request to be done that the memory access accelerator is supported is only dependent upon RAM on piece
Capacity, do not limited by MSHR item numbers, and have a read request table on piece in RAM addressable space, for depositing, read please
The information asked, each single item of the read request table are corresponding one intrinsic No. id.
2. the high concurrent memory access accelerator as claimed in claim 1 based on RAM on piece, it is characterised in that the read request table it is every
One has three domains, and for depositing type, address and the data of the read request, wherein type field and address field is inserted by CPU, number
Inserted according to domain by the memory access accelerator.
3. the high concurrent memory access accelerator as claimed in claim 2 based on RAM on piece, it is characterised in that the number of the read request table
According to domain it is excessive when, can only storage data pointer, data pointer point to returned data storage address, the storing place of returned data
Location is allocated by CPU.
4. the high concurrent memory access accelerator as claimed in claim 1 based on RAM on piece, it is characterised in that the read request table it is every
One is three kinds of states:Idle, new read request and read request is completed, original state is the free time, and CPU has access request
When, request is inserted, state is changed into new read request, and the request is sent to Memory Controller Hub by memory access accelerator, after returned data
Data are inserted into data field, state is changed into having completed read request, and CPU fetches according to and handled from data field, handled
The free time is returned into rear state.
5. the high concurrent memory access accelerator as claimed in claim 4 based on RAM on piece, it is characterised in that each round-robin queue
Including a head pointer and a tail pointer, the head pointer and tail pointer of idle queues and the head pointer for having completed queue are software
In variable, maintenance is responsible for by CPU;The head pointer of new read request queue, tail pointer and to have completed the tail pointer of queue be hardware
Register, the head pointer of new read request queue are responsible for maintenance by memory access accelerator;The tail pointer of new read request queue is by CPU and visit
Deposit accelerator to safeguard jointly, CPU only writes, and memory access accelerator is read-only;The tail pointer for having completed queue is total to by CPU and memory access accelerator
With safeguarding, CPU is read-only, and memory access accelerator is only write.
A kind of 6. high concurrent access method based on RAM on piece, it is characterised in that including setting one independently of Cache on piece and
MSHR memory access accelerator, the memory access accelerator are connected with RAM on piece and Memory Controller Hub, do not complete access request and pass through the visit
Deposit accelerator and be sent to Memory Controller Hub to memory system, CPU writes access request in the addressable sky of RAM on piece wherein on piece
Between, the memory access accelerator read requests perform, and for read request, for pending data after memory system return, the memory access accelerator will
Data are put into the space and notify CPU, and CPU is handled data, there is the reading of a preservation read request in the addressable space
Required list, for depositing the information of read request, each single item of the read request table is corresponding one intrinsic No. id.
7. the high concurrent access method as claimed in claim 6 based on RAM on piece, it is characterised in that the read request table it is each
Xiang Yousan domain, for depositing type, address and the data of the read request, wherein type field and address field is inserted by CPU, data
Domain is inserted by the memory access accelerator.
8. the high concurrent access method as claimed in claim 7 based on RAM on piece, it is characterised in that the data of the read request table
When domain is excessive, only storage data pointer, data pointer the storage address of returned data, the storage address of returned data can be pointed to
It is allocated by CPU.
9. the high concurrent access method as claimed in claim 6 based on RAM on piece, it is characterised in that the read request table it is each
Item is three kinds of states:Idle, new read request and read request is completed, original state is the free time, and CPU has access request
When, request is inserted, state is changed into new read request, and the request is sent to Memory Controller Hub by memory access accelerator, after returned data
Data are inserted into data field, state is changed into having completed read request, and CPU fetches according to and handled from data field, handled
The free time is returned into rear state.
10. the high concurrent access method as claimed in claim 6 based on RAM on piece, it is characterised in that also initiate to read including CPU
The step of request:
Idle queues state in step S701, CPU query piece in RAM addressable spaces, judge whether idle queues are sky,
CPU judges that idle queues are for the condition of sky:The head pointer of idle queues overlaps with tail pointer, if empty, returns, if non-NULL,
Then go to step S702;
Step S702, CPU take id from idle queues head of the queue;
Step S703, CPU fill in the type field and address field of read request list item corresponding with the id;
Step S704, CPU write the id tail of the queue of new read request queue;
The new read request queue rear pointer of renewal is transmitted to memory access accelerator by step S705, CPU;
Step S706, CPU judges whether to continue to initiate read request, if so, step S701 is gone to, if it is not, returning.
11. the high concurrent access method as claimed in claim 6 based on RAM on piece, it is characterised in that also read including CPU processing
Ask returned data step:
Step S801, CPU inquire about the state for having completed queue, and judgement has completed whether queue is empty, and CPU judges to have completed queue
It is for empty condition:The head pointer for having completed queue overlaps with tail pointer, if empty, returns;If non-NULL, go to step
S802;
Step S802, CPU take id from the head of the queue for having completed queue;
Step S803, the data field of CPU operation read request list item corresponding with the id;
Step S804, CPU write the id tail of the queue of idle queues;
Step S805, CPU judges whether to continue to operate, if so, step S801 is then gone to, if it is not, then returning.
12. the high concurrent access method as claimed in claim 6 based on RAM on piece, it is characterised in that also including memory access accelerator
The step of handling read request:
Whether step S901, the new request queue of memory access accelerator real-time query are empty, if non-NULL, go to step S902, if empty,
Then inquired about always in this step;
Step S902, memory access accelerator take id from the head of the queue of new read request queue;
Step S903, memory access accelerator take out the type field and address field of read request list item corresponding with the id;
Step S904, memory access accelerator fetch data from internal memory, are written to the data field of read request list item corresponding with the id;
Step S905, memory access accelerator write the id tail of the queue for having completed queue.
13. the high concurrent access method as claimed in claim 6 based on RAM on piece, it is characterised in that also include:
Step 1, when CPU initiates write request, first check whether write round-robin queue full, if discontented, insert write request type,
Address and write data;
Step 2, memory access accelerator, which detects, writes round-robin queue's non-NULL, then automatically from writing at round-robin queue's head pointer reading write request
Type, address and data;
Step 3, write request is issued Memory Controller Hub by memory access accelerator.
A kind of 14. processor of accelerator using any one of claim 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310242398.5A CN103345429B (en) | 2013-06-19 | 2013-06-19 | High concurrent memory access accelerated method, accelerator and CPU based on RAM on piece |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310242398.5A CN103345429B (en) | 2013-06-19 | 2013-06-19 | High concurrent memory access accelerated method, accelerator and CPU based on RAM on piece |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103345429A CN103345429A (en) | 2013-10-09 |
CN103345429B true CN103345429B (en) | 2018-03-30 |
Family
ID=49280227
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310242398.5A Active CN103345429B (en) | 2013-06-19 | 2013-06-19 | High concurrent memory access accelerated method, accelerator and CPU based on RAM on piece |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103345429B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104252416B (en) * | 2013-06-28 | 2017-09-05 | 华为技术有限公司 | A kind of accelerator and data processing method |
CN105988952B (en) * | 2015-02-28 | 2019-03-08 | 华为技术有限公司 | The method and apparatus for distributing hardware-accelerated instruction for Memory Controller Hub |
CN105354153B (en) * | 2015-11-23 | 2018-04-06 | 浙江大学城市学院 | A kind of implementation method of close coupling heterogeneous multi-processor data exchange caching |
CN109582600B (en) * | 2017-09-25 | 2020-12-01 | 华为技术有限公司 | Data processing method and device |
CN109086228B (en) * | 2018-06-26 | 2022-03-29 | 深圳市安信智控科技有限公司 | High speed memory chip with multiple independent access channels |
CN110688238B (en) * | 2019-09-09 | 2021-05-07 | 无锡江南计算技术研究所 | Method and device for realizing queue of separated storage |
CN115292236B (en) * | 2022-09-30 | 2022-12-23 | 山东华翼微电子技术股份有限公司 | Multi-core acceleration method and device based on high-speed interface |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102073596A (en) * | 2011-01-14 | 2011-05-25 | 东南大学 | Method for managing reconfigurable on-chip unified memory aiming at instructions |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5813031A (en) * | 1994-09-21 | 1998-09-22 | Industrial Technology Research Institute | Caching tag for a large scale cache computer memory system |
US20040107240A1 (en) * | 2002-12-02 | 2004-06-03 | Globespan Virata Incorporated | Method and system for intertask messaging between multiple processors |
WO2005066796A1 (en) * | 2003-12-22 | 2005-07-21 | Matsushita Electric Industrial Co., Ltd. | Cache memory and its controlling method |
US7467277B2 (en) * | 2006-02-07 | 2008-12-16 | International Business Machines Corporation | Memory controller operating in a system with a variable system clock |
US7809895B2 (en) * | 2007-03-09 | 2010-10-05 | Oracle America, Inc. | Low overhead access to shared on-chip hardware accelerator with memory-based interfaces |
CN101221538B (en) * | 2008-01-24 | 2010-10-13 | 杭州华三通信技术有限公司 | System and method for implementing fast data search in caching |
US9772958B2 (en) * | 2011-10-31 | 2017-09-26 | Hewlett Packard Enterprise Development Lp | Methods and apparatus to control generation of memory access requests |
-
2013
- 2013-06-19 CN CN201310242398.5A patent/CN103345429B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102073596A (en) * | 2011-01-14 | 2011-05-25 | 东南大学 | Method for managing reconfigurable on-chip unified memory aiming at instructions |
Also Published As
Publication number | Publication date |
---|---|
CN103345429A (en) | 2013-10-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103345429B (en) | High concurrent memory access accelerated method, accelerator and CPU based on RAM on piece | |
TW449724B (en) | Circuit arrangement and method with state-based transaction scheduling | |
TWI537962B (en) | Memory controlled data movement and timing | |
US6868087B1 (en) | Request queue manager in transfer controller with hub and ports | |
JP5666722B2 (en) | Memory interface | |
CN109032668A (en) | Stream handle with high bandwidth and low-power vector register file | |
US11068418B2 (en) | Determining memory access categories for tasks coded in a computer program | |
CN104375954B (en) | The method and computer system for based on workload implementing that the dynamic of cache is enabled and disabled | |
US11093399B2 (en) | Selecting resources to make available in local queues for processors to use | |
CN110457238A (en) | The method paused when slowing down GPU access request and instruction access cache | |
JP2024513076A (en) | Message passing circuit configuration and method | |
US8566532B2 (en) | Management of multipurpose command queues in a multilevel cache hierarchy | |
US10204060B2 (en) | Determining memory access categories to use to assign tasks to processor cores to execute | |
US20190079795A1 (en) | Hardware accelerated data processing operations for storage data | |
US20090070527A1 (en) | Using inter-arrival times of data requests to cache data in a computing environment | |
CN109783012A (en) | Reservoir and its controller based on flash memory | |
JP4452644B2 (en) | Improved memory performance | |
US20090083496A1 (en) | Method for Improved Performance With New Buffers on NUMA Systems | |
US20050149562A1 (en) | Method and system for managing data access requests utilizing storage meta data processing | |
CN101341471B (en) | Apparatus and method for dynamic cache management | |
KR20210156759A (en) | Systems, methods, and devices for queue availability monitoring | |
US20200097297A1 (en) | System and method for dynamic determination of a number of parallel threads for a request | |
US20100058024A1 (en) | Data Transfer Apparatus, Data Transfer Method And Processor | |
CN108733409A (en) | Execute the method and chip multi-core processor of speculative threads | |
EP4227790B1 (en) | Systems, methods, and apparatus for copy destination atomicity in devices |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |