CN103345429B - High concurrent memory access accelerated method, accelerator and CPU based on RAM on piece - Google Patents

High concurrent memory access accelerated method, accelerator and CPU based on RAM on piece Download PDF

Info

Publication number
CN103345429B
CN103345429B CN201310242398.5A CN201310242398A CN103345429B CN 103345429 B CN103345429 B CN 103345429B CN 201310242398 A CN201310242398 A CN 201310242398A CN 103345429 B CN103345429 B CN 103345429B
Authority
CN
China
Prior art keywords
cpu
memory access
queue
read request
request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310242398.5A
Other languages
Chinese (zh)
Other versions
CN103345429A (en
Inventor
刘垚
陈明扬
陈明宇
阮元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201310242398.5A priority Critical patent/CN103345429B/en
Publication of CN103345429A publication Critical patent/CN103345429A/en
Application granted granted Critical
Publication of CN103345429B publication Critical patent/CN103345429B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Memory System Of A Hierarchy Structure (AREA)
  • Multi Processors (AREA)

Abstract

A kind of processor the invention discloses high concurrent memory access accelerator based on RAM on piece and access method and using this method, the memory access accelerator is independently of Cache on piece and MSHR, it is connected with RAM on piece and Memory Controller Hub, do not complete access request and Memory Controller Hub is sent to memory system by the memory access accelerator, so as to solve the problems, such as that general processor concurrent memory access number in internet and cloud computing application is limited, accelerate high concurrent memory access.

Description

High concurrent memory access accelerated method, accelerator and CPU based on RAM on piece
Technical field
The invention belongs to computer realm, the structure design being related to inside CPU, more particularly to a kind of height based on RAM on piece Concurrent memory access accelerated method, accelerator and CPU.
Background technology
With internet and the development of cloud computing, the data processor of high concurrent is more and more.This class method generally needs Handle to ask(request)Or operation(job)Form submit it is a large amount of concurrently load, these core industry concurrently loaded Business generally involves the processing and analysis of mass data.This class method usually using multithreading or multi-process, thread or process it Between there is relatively low memory access to rely on or relied on without memory access.
Therefore this kind of application can send substantial amounts of concurrent access request to memory system.This concurrency to memory access system Propose challenge.If the concurrency of memory access system is not high enough, high performance bottleneck will be carried as this kind of application.
Fig. 1 show typical CPU storage organizations.When CPU needs to read data, arrive first in Cache and search, if There are required data in Cache(Cache hit), then CPU is directly returned data to.If CPU is not looked into Cache To required data(Cache miss), CPU can arrive main memory(Main Memory)It is middle by required data retrieval into Cache.
There is one group of register MSHR in Cache(Miss Status Handling Registers), dedicated for record It has been sent to the miss read requests of the cache of internal memory(That is Cache miss request)Information.The information of MSHR records Destination register etc. including Cache Line addresses, read request.When the number for hosting completion read request, returning to the Cache Line According to rear, the information of record be just used to filling corresponding to Cache Line and return data in destination register.Each Cache Miss read request will take one of MSHR.After MSHR is occupied full, new Cache Miss read request will be stopped Firmly, it is impossible to be sent to main memory.Therefore, the unfinished read request that MSHR is supported(Refer to read request to have sent, but the number of read request According to not returning also.This read request has not been completed, so also to record the request by MSHR)Number, be to determine memory access system One of key factor of system concurrency.
At present, the number than the MSHR of the more typical processor unfinished read requests supported is typically less.Such as Cortex-A9 processors, L2Cache MSHR only support 10 unfinished read requests.When application program is sent out to memory system Go out a large amount of concurrent access requests, and these request localities are relatively low(Therefore a large amount of Cache Miss occur)When, MSHR is just It can rapidly be occupied full, turn into the bottleneck of whole system.
Fig. 2 show the storage architecture of certain type processor, and the processor proposes a kind of brand-new memory access mode, reason By the transmission that can above support a large amount of concurrent access requests.
The processor is by 1 PPE(Power Processor Element), 8 SPE(Synergistic Processing Element, synergetic unit), 1 MIC(Memory Interface Controller, memory interface Controller), 1 EIB(Element Interconnect Bus, cell interconnection bus)Composition.
Synergetic unit SPE memory access mechanism is paid close attention to below.
Each SPE is a microprocessor, and its program may operate in local 256KB memory cell(RAM).When SPE is needed from main memory(Main Memory), it is necessary to first initialize DMAC during middle acquisition data(Direct Memory Access Controller), by the parameter read-in DMAC control queues such as request source address, destination address and request length.DMAC according to Parameter in queue is about to data certainly and moved from main memory in being locally stored.
The number for the concurrent request that this mechanism is supported in theory is limited solely by storable order in DMA command queue Number, or on limited piece RAM capacity.But this mechanism has two defects:
1. it is required for first inputting some parameters before due to each dma operation starting, such as source address, destination address, data Size, TAG marks and direction etc., this process will take several instruction cycles.If SPE needs concurrently to read substantial amounts of small During granularity data, the efficiency of DMA transfer is than relatively low;
The efficiency of 2.DMA condition managings is low.First, program needs to prepare enough spaces for the returned data of read request, And the program lacks free-space administration mechanism, local storage space utilization rate will substantially reduce after long-play;Secondly, place Reason device obtains the mode that DMA completion statuses employ software polling mode bit, and when the number of access request increases, efficiency is not It is high.
The content of the invention
In order to solve the above-mentioned technical problem, it is an object of the invention to propose a kind of high concurrent memory access based on RAM on piece Accelerator and the method using a large amount of concurrent access requests of RAM management on piece, solve general processor in internet and cloud computing The problem of concurrent memory access number is limited in, accelerate high concurrent memory access.
Specifically, the invention discloses a kind of high concurrent memory access accelerator based on RAM on piece, the memory access accelerator are only Cache and MSHR on piece is stood on, is connected with RAM on piece and Memory Controller Hub, access request is not completed and passes through the memory access accelerator Memory Controller Hub is sent to memory system.
The high concurrent memory access accelerator based on RAM on piece, the access request to be done that the memory access accelerator is supported Number be only dependent upon the capacity of RAM on piece, do not limited by MSHR item numbers.
The high concurrent memory access accelerator based on RAM on piece, there is a read request table in the addressable space, for depositing The information of read request is put, each single item of the read request table is corresponding one intrinsic No. id.
The high concurrent memory access accelerator based on RAM on piece, each single item of the read request table has three domains, for depositing Type, address and the data of the read request, wherein type field and address field are inserted by CPU, and data field is filled out by the memory access accelerator Enter.
The high concurrent memory access accelerator based on RAM on piece, when the data field of the read request table is excessive, can only it deposit Data pointer, data pointer point to the storage address of returned data, and the storage address of returned data is allocated by CPU.
The high concurrent memory access accelerator based on RAM on piece, each single item of the read request table is three kinds of states:It is empty Not busy, new read request and read request is completed, original state is the free time, when CPU has access request, request is inserted, state It is changed into new read request, the request is sent to Memory Controller Hub by memory access accelerator, and data are inserted into data field, shape after returned data State is changed into having completed read request, and CPU fetches according to and handled from data field, and state returns to the free time after the completion of processing.
The high concurrent memory access accelerator based on RAM on piece, each round-robin queue include a head pointer and one Tail pointer, the head pointer and tail pointer of idle queues and the head pointer for having completed queue are variables in software, are responsible for dimension by CPU Shield;The head pointer of new read request queue, tail pointer and to have completed the tail pointer of queue be hardware register, new read request queue Head pointer is responsible for maintenance by memory access accelerator;Newly the tail pointer of read request queue is safeguarded jointly by CPU and memory access accelerator, CPU Only write, memory access accelerator is read-only;The tail pointer for having completed queue safeguards that CPU is read-only jointly by CPU and memory access accelerator, memory access Accelerator is only write.
Invention additionally discloses a kind of high concurrent access method based on RAM on piece, including one is set independently of Cache on piece With MSHR memory access accelerator, the memory access accelerator is connected with RAM on piece and Memory Controller Hub, does not complete access request and passes through this Memory access accelerator is sent to Memory Controller Hub to memory system.
The high concurrent access method based on RAM on piece, CPU writes access request the addressable of RAM on piece on piece Space, the memory access accelerator read requests perform, and for read request, pending data is after memory system return, the memory access accelerator Place data into the space and notify CPU, CPU is handled data.
The high concurrent access method based on RAM on piece, the reading for having a preservation read request in the addressable space please Table is sought, for depositing the information of read request, each single item of the read request table is corresponding one intrinsic No. id.
The high concurrent access method based on RAM on piece, each single item of the read request table has three domains, for depositing this Type, address and the data of read request, wherein type field and address field are inserted by CPU, and data field is inserted by the memory access accelerator.
The high concurrent access method based on RAM on piece, when the data field of the read request table is excessive, it can only deposit number According to pointer, data pointer points to the storage address of returned data, and the storage address of returned data is allocated by CPU.
The high concurrent access method based on RAM on piece, each single item of the read request table is three kinds of states:Idle, New read request and read request is completed, original state is the free time, when CPU has access request, request is inserted, state is changed into The request is sent to Memory Controller Hub by new read request, memory access accelerator, and data are inserted into data field after returned data, and state becomes To have completed read request, CPU fetches according to and handled from data field, and state returns to the free time after the completion of processing.
Invention additionally discloses a kind of high concurrent access method based on RAM on piece, including the step of CPU initiation read requests:
Idle queues state in step S701, CPU query piece in RAM addressable spaces, judge idle queues whether be Sky, CPU judge that idle queues are for the condition of sky:The head pointer of idle queues overlaps with tail pointer, if empty, returns, if not Sky, then go to step S702.
Step S702, CPU take id from idle queues head of the queue;
Step S703, CPU fill in the type field and address field of read request list item corresponding with the id;
Step S704, CPU write the id tail of the queue of new read request queue;
The new read request queue rear pointer of renewal is transmitted to memory access accelerator by step S705, CPU;
Step S706, CPU judges whether to continue to initiate read request, if so, step S701 is gone to, if it is not, returning.
Invention additionally discloses a kind of high concurrent access method based on RAM on piece, including CPU processing read request returned datas Step:
Step S801, CPU inquire about the state for having completed queue, and judgement has completed whether queue is empty(CPU judges to have completed Queue is that empty condition is:The head pointer for having completed queue overlaps with tail pointer)If empty, return;If non-NULL, step is gone to Rapid S802;
Step S802, CPU take id from the head of the queue for having completed queue;
Step S803, the data field of CPU operation read request list item corresponding with the id;
Step S804, CPU write the id tail of the queue of idle queues;
Step S805, CPU judges whether to continue to operate, if so, step S801 is then gone to, if it is not, then returning.
Invention additionally discloses a kind of high concurrent access method based on RAM on piece, it is characterised in that including memory access accelerator The step of handling read request:
Whether step S901, the new request queue of memory access accelerator real-time query are empty, if non-NULL, go to step S902, If empty, inquired about always in this step;
Step S902, memory access accelerator take id from the head of the queue of new read request queue;
Step S903, memory access accelerator take out the type field and address field of read request list item corresponding with the id;
Step S904, memory access accelerator fetch data from internal memory, are written to the number of read request list item corresponding with the id According to domain;
Step S905, memory access accelerator write the id tail of the queue for having completed queue.
Invention additionally discloses a kind of high concurrent access method based on RAM on piece, including:
Step 1, when CPU initiates write request, first check and write whether round-robin queue has expired, if discontented, insert write request Type, address and write data;
Step 2, memory access accelerator, which detects, writes round-robin queue's non-NULL, then automatically from write at round-robin queue's head pointer read write Type, address and the data of request;
Step 3, write request is issued Memory Controller Hub by memory access accelerator.
Invention additionally discloses one kind using any one of claim 1-17 high concurrent access method or high concurrent memory access The processor of device.
The technique effect of the present invention:
1st, one or more read request tables are preserved using RAM on piece(Read table), the content bag of read request table each single item The necessary information of all read requests such as request type domain, destination address field and data field is included, because the present invention uses RAM notes on piece All information of concurrent request are recorded, the quantity of concurrent request is limited solely by the size of RAM on piece.
2nd, each single item of read request table is divided into 3 classes by solicited status:Idle class, new request class and completion class, and often The entry address of a kind of claims is stored using round-robin queue respectively, is easy to be managed the state of read request.This hair It is bright to manage substantial amounts of reading and writing solicited message using round-robin queue, the mode bit of the multiple requests of poll is avoided, the number of inquiry is significantly Reduce, so as to have obvious acceleration effect to a large amount of concurrent and incoherent small grain size access requests.
3rd, judge whether to initiate accessing operation by the non-null states of the round-robin queue of " newly asking class ", it is " new by reading The content of request class " round-robin queue, obtains the read request address that RAM is stored on piece, in this way, memory access accelerator can voluntarily disorderly Sequence initiates accessing operation, it is not necessary to which software is controlled, so as to support the Out-of-order execution of access request, out of order return, convenience pair A large amount of access requests, which are realized, targetedly to be dispatched.
4th, CPU software determines whether that read request is completed by the non-null states of " completion class " round-robin queue, and CPU passes through reading The content of " completion class " round-robin queue is taken, address of the read request returned data on piece in RAM is obtained, avoids CPU polls multiple The mode bit of request, improve software lookup efficiency.
Brief description of the drawings
Fig. 1 show existing typical CPU storage organizations;
Fig. 2 show the storage architecture of certain type processor;
Fig. 3 show the location drawing of the memory access accelerator of the present invention on a processor;
Fig. 4 show the Read table in the addressable space in the present invention;
Fig. 5 show the status change of each single item in the Read table in the present invention;
Fig. 6 show the state for managing read request in the present invention using three round-robin queues;
Fig. 7 show the step of CPU initiates read request in the present invention;
Fig. 8 show the step of CPU handles read request returned data in the present invention;
Fig. 9 show the step of memory access accelerator handles read request in the present invention;
Figure 10 show one round-robin queue's management write request of use in the present invention.
Embodiment
The problem of present invention is limited for the concurrent access request number of general processor, propose the general of " memory access accelerator " Read.Memory access accelerator is another path between CPU and internal memory Memory.
Fig. 3 show the memory access accelerator location drawing on a processor of the present invention, it around cache Cache and MSHR, the number of the unfinished read request supported at least an order of magnitude more than MSHR.Therefore, should by memory access accelerator More access requests can be sent to memory system with program, so as to improve the concurrency of memory access.Processor include CPU1, RAM3, memory access accelerator 4, Cache2, MSHR3, Memory Controller Hub 6, internal memory 7.
Memory access accelerator needs CPU to possess addressable ram space in a block piece, and access request is write RAM skies by CPU Between, memory access accelerator read requests perform.If read request, pending data is after Memory returns, and memory access accelerator is by data It is put into space and notifies CPU, then CPU is handled data.
Fig. 4 show the read request table in the addressable space in the present invention(Read table), will in addressable space There are the read request table of a preservation read request, referred to as Read table.
Read table each single item is corresponding one intrinsic No. id, and the letter of read request can be deposited in Read table items Breath.Each single item has three domains:Type, addr, data, it is respectively used to deposit the type of the read request, address and data.Type domains Such as priority of the data length of request, request, whether it is a separation/polymerization for encoding the additional information needed (scatter/gather)Read request of type etc..Using type domains, along with the hardware of auxiliary, it is possible to realize that some are worked as The advanced memory access function that preceding architecture is not supported.Type domains and addr domains are inserted by CPU, and data domains are filled out by memory access accelerator Enter.
Each single item in Read table can be divided into three kinds of states:Idle, new read request(Not yet it is sent to internal memory Controller), the read request completed(Memory Controller Hub has returned to the data of the request, and data have inserted data Domain).
Fig. 5 show the status change of each single item in the Read table in the present invention, for one in Read table , original state is idle free;When CPU has access request, request is inserted this, this state is just changed into new read request new read;The request is sent to Memory Controller Hub by memory access accelerator, and data are inserted to the data domains of this after returned data, should The state of item is just changed into completeness request finished read;CPU is fetched according to and handled from data domains, and processing is completed The state of this has returned to idle free afterwards.
In above process, the problem of three keys needs to solve:
1.CPU sends the position of the idle item how found during request in Read table
2. how memory access accelerator finds the position of new claims
How 3.CPU obtains the position that read request returns to item
Therefore, the present invention proposes a kind of request management method based on round-robin queue.
Fig. 6 show the state for managing read request in the present invention using three round-robin queues, and three round-robin queues are:It is idle Round-robin queue(free entry queue), new read request round-robin queue(new read queue)Team is circulated with read request is completed Row(finished read queue), it is respectively used to store idle item in Read table, new read request item and completed The id of read request item.These three round-robin queues are all in addressable space.Each queue has two pointers:Team head head and tail of the queue Tail, it is respectively used to indicate the position of queue heads and rear of queue.In figure, A is to return, and CPU can determine to be to continue with initiating again Read request still can be with start to process data.It is not allow other operation insertions before not returning, just can be again after return Initiate operation.
Illustrate the process that CPU initiates the operation of read request using memory access accelerator:
1. when CPU needs to read internal storage data, first inquire about whether free entry queue are empty.If it is empty, then say Read table in bright Fig. 4 are taken completely, temporarily also can use without the Read table items of free time;If non-NULL, illustrate Also available free item can use in Read table.As shown in fig. 6, judging whether free entry queue are that empty condition is:Refer to Pin head1 overlaps with pointer tail1.
2.CPU takes out one No. id from free entry queue head of the queue, finds Read table corresponding to this No. id The address of item, Read table items are inserted by type the and addr domains newly asked.Meanwhile CPU is stored in new by this No. id Read queue tail of the queue.
Operating process of the CPU to round-robin queue is as shown in dotted line 1 in Fig. 6, and after the completion of this operation, id3 will be by from head1 positions Put the position for being moved to tail2.
Tail2 pointers are moved back one by 3.CPU, and new tail2 pointers are sent into memory access accelerator.
4. memory access accelerator judges whether new read queue are empty by comparing head2 and tail2 pointers.Work as visit When depositing accelerator and detecting new read queue non-NULLs, then new id is taken out from new read queue heads of the queue automatically, is passed through Id finds corresponding unfinished read request item in Read table and handled, and the data of request are returned into Read table Data domains in.After the completion of processing, the id is write to finished read queue tail of the queue.
Operating process of the memory access accelerator to round-robin queue as shown in dotted line 2 in Fig. 6, this operation after the completion of, id9 will by from Head2 positions are moved to tail3 position.
When 5.CPU needs processing data, first check whether finished read queue are empty.Inspection method is still Contrast head and tail pointers.If finished read queue non-NULLs, one No. id is taken out, is found corresponding to the id Read table items, the data domains of this are handled.After the completion of processing, the id is written to new read queue team Tail.
Operating process of the CPU to round-robin queue is as shown in dotted line 3 in Fig. 6, and after the completion of this operation, id2 will be by from head3 positions Put the position for being moved to tail1.
6. above procedure can repeat.
The processing of write request is relatively easy.Because write request does not have to returned data, memory access accelerator, which need to only give CPU, to be come Write request pass to Memory Controller Hub, therefore it manages structure and can greatly simplified.
Figure 10 show one round-robin queue's management write request of use in the present invention, only need to use a Ge Xie round-robin queues (write queue)Write request can be managed.Here type, addr and data domain of write request are directly placed into queue.With Three queue queue of top are the same, and write queue are also required in addressable space.
The occupation mode for writing round-robin queue (write queue) is as follows:
When CPU needs the write request for sending out new, first check whether write queue are full (, it is necessary to first determine during hair write request Whether there is space can be with temporal data in write queue.No space on ram is completely meant that, cannot retransmit and write Request.If discontented, the position that is indicated to tail4 inserts the type of write request, address and writes data.
Memory access accelerator detects write queue non-NULLs(Illustrate there are data, illustrate that also write request does not complete, memory access Request is taken out and performed by accelerator automatically), then type, addr and data of write request are read at head4 pointers automatically, will be write Memory Controller Hub is issued in request.
To sum up, the invention discloses a kind of high concurrent memory access accelerator based on RAM on piece, the memory access accelerator independently of Cache and MSHR on piece, is connected with RAM on piece and Memory Controller Hub, does not complete access request and is sent to by the memory access accelerator Memory Controller Hub is to memory system.
The number for not completing access request that the memory access accelerator is supported is only dependent upon the capacity of RAM on piece, not by The limitation of MSHR item numbers.RAM is the RAM with addressable space on the piece, and access request is write addressable sky by CPU on piece Between, memory access accelerator read requests perform, and for read request, pending data is after memory system return, and the memory access accelerator is by number According to being put into the addressable space and notifying CPU, then CPU is handled data.
RAM is the RAM of CPU on piece on the piece, or independently of CPU on piece.
There are a read request table, for depositing the information of read request, each single item of the read request table in the addressable space It is corresponding one intrinsic No. id.
The each single item of the read request table has three domains, for depositing type, address and the data of the read request, wherein type Domain and address field are inserted by CPU, and data field is inserted by the memory access accelerator.
The each single item of the read request table is three kinds of states:Idle, new read request and read request is completed, initial shape State is the free time, when CPU has access request, request is inserted, state is changed into new read request, and the request is sent to by memory access accelerator Memory Controller Hub, data are inserted into data field after returned data, state is changed into having completed read request, and CPU takes from data field Data are simultaneously handled, and state returns to the free time after the completion of processing.
Three kinds of states are managed by three round-robin queues, and each round-robin queue includes the position of queue heads and rear of queue Put pointer.
The invention also discloses a kind of high concurrent access method based on RAM on piece, including one is set independently of on piece Cache and MSHR memory access accelerator, the memory access accelerator are connected with RAM on piece and Memory Controller Hub, do not complete access request Memory Controller Hub is sent to memory system by the memory access accelerator.
The high concurrent access method based on RAM on piece, CPU writes access request the addressable of RAM on piece on piece Space, the memory access accelerator read requests perform, and for read request, pending data is after memory system return, the memory access accelerator Place data into the space and notify CPU, CPU is handled data.
The invention also discloses a kind of high concurrent access method based on RAM on piece, it is characterised in that is initiated including CPU The step of read request:
Idle queues state in step S701, CPU query piece in RAM addressable spaces, judge idle queues whether be Sky, if empty, return, if non-NULL, go to step S702.CPU judges that idle queues are for the condition of sky:The head of idle queues Pointer overlaps with tail pointer.
Step S702, CPU take id from idle queues head of the queue;
Step S703, CPU fill in the type field and address field of read request list item corresponding with the id;
Step S704, CPU write the id tail of the queue of new read request queue;
The new read request queue rear pointer of renewal is transmitted to memory access accelerator by step S705, CPU;
Step S706, CPU judges whether to continue to initiate read request, if so, step S701 is gone to, if it is not, returning.
The invention also discloses a kind of high concurrent access method based on RAM on piece, it is characterised in that including CPU processing Read request returned data step:
Step S801, CPU inquire about the state for having completed queue, and judgement has completed whether queue is empty, if empty, returns; If non-NULL, step S802 is gone to;Memory access accelerator judges that completed queue is as the condition of sky:The head pointer of queue is completed Overlapped with tail pointer.
Step S802, CPU take id from the head of the queue for having completed queue;
Step S803, the data field of CPU operation read request list item corresponding with the id;
Step S804, CPU write the id tail of the queue of idle queues;
Step S805, CPU judges whether to continue to operate, if so, step S801 is then gone to, if it is not, then returning.
The invention also discloses a kind of high concurrent access method based on RAM on piece, it is characterised in that accelerates including memory access Device handles the step of read request:
Whether step S901, the new request queue of memory access accelerator real-time query are empty, if non-NULL, go to step S902, If empty, inquired about always in this step;
Step S902, memory access accelerator take id from the head of the queue of new read request queue;
Step S903, memory access accelerator take out the type field and address field of read request list item corresponding with the id;
Step S904, memory access accelerator fetch data from internal memory, are written to the number of read request list item corresponding with the id According to domain;
Step S905, memory access accelerator write the id tail of the queue for having completed queue.
The high concurrent access method based on RAM on piece, judge whether idle queues are that empty condition is:Queue heads refer to Pin overlaps with queue tail pointer.
The memory access accelerator can be with out of order transmission, out of order return to a large amount of concurrent read requests.
The invention also discloses a kind of high concurrent access method based on RAM on piece, including:
Step 1, when CPU initiates write request, first check and write whether round-robin queue has expired, if discontented, insert write request Type, address and write data;
Step 2, memory access accelerator, which detects, writes round-robin queue's non-NULL, then automatically from write at round-robin queue's head pointer read write Type, address and the data of request;
Step 3, write request is issued Memory Controller Hub by memory access accelerator.
The invention also discloses a kind of using above-mentioned access method and the processor of memory access device.
The present invention has following features:
1st, memory access granularity is flexible:Memory access granular information is encoded in type domains.Memory access granularity is not by instruction set and Cache Line limitation.Required for each data of memory access are software, the effective rate of utilization for hosting bandwidth is improved.
2nd, some advanced memory access functions can be achieved:By specifying memory access type in type domains, then memory access accelerator parses Perform, the advanced accessing operations such as scatter/gather, chained list read-write can be achieved.
3rd, type domains can carry some upper layer informations, such as thread number, priority so that it is high that memory access accelerator can do some The QoS scheduling of level.
4th, addressable space could preferably play the effect of accelerator using SRAM.In the design, CPU and memory access accelerate Device needs that the read-write of Read table, queue and queue pointer could be completed several times one request, therefore the reading of addressable space Writing rate must be sufficiently fast, can just play acceleration.SRAM is more many soon than DRAM access speed, is suitable for use in here.
The technique effect of the present invention:
1st, one or more read request tables are preserved using RAM on piece(Read table), the content bag of read request table each single item The necessary information of all read requests such as request type domain, destination address field and data field is included, because the present invention uses RAM notes on piece All information of concurrent request are recorded, the quantity of concurrent request is limited solely by the size of RAM on piece.
2nd, each single item of read request table is divided into 3 classes by solicited status:Idle class, new request class and completion class, and often The entry address of a kind of claims is stored using round-robin queue respectively, is easy to be managed the state of read request.This hair It is bright to manage substantial amounts of reading and writing solicited message using round-robin queue, the mode bit of the multiple requests of poll is avoided, the number of inquiry is significantly Reduce, so as to have obvious acceleration effect to a large amount of concurrent and incoherent small grain size access requests.
3rd, judge whether to initiate accessing operation by the non-null states of the round-robin queue of " newly asking class ", it is " new by reading The content of request class " round-robin queue, obtains the read request address that RAM is stored on piece, in this way, memory access accelerator can voluntarily disorderly Sequence initiates accessing operation, it is not necessary to which software is controlled, so as to support the Out-of-order execution of access request, out of order return, convenience pair A large amount of access requests, which are realized, targetedly to be dispatched.
4th, CPU software determines whether that read request is completed by the non-null states of " completion class " round-robin queue, and CPU passes through reading The content of " completion class " round-robin queue is taken, address of the read request returned data on piece in RAM is obtained, avoids CPU polls multiple The mode bit of request, improve software lookup efficiency.

Claims (14)

1. a kind of high concurrent memory access accelerator based on RAM on piece, it is characterised in that the memory access accelerator is independently of on piece Cache and MSHR, it is connected with RAM on piece and Memory Controller Hub, does not complete access request and internal memory is sent to by the memory access accelerator Controller to memory system, wherein, the number for the access request to be done that the memory access accelerator is supported is only dependent upon RAM on piece Capacity, do not limited by MSHR item numbers, and have a read request table on piece in RAM addressable space, for depositing, read please The information asked, each single item of the read request table are corresponding one intrinsic No. id.
2. the high concurrent memory access accelerator as claimed in claim 1 based on RAM on piece, it is characterised in that the read request table it is every One has three domains, and for depositing type, address and the data of the read request, wherein type field and address field is inserted by CPU, number Inserted according to domain by the memory access accelerator.
3. the high concurrent memory access accelerator as claimed in claim 2 based on RAM on piece, it is characterised in that the number of the read request table According to domain it is excessive when, can only storage data pointer, data pointer point to returned data storage address, the storing place of returned data Location is allocated by CPU.
4. the high concurrent memory access accelerator as claimed in claim 1 based on RAM on piece, it is characterised in that the read request table it is every One is three kinds of states:Idle, new read request and read request is completed, original state is the free time, and CPU has access request When, request is inserted, state is changed into new read request, and the request is sent to Memory Controller Hub by memory access accelerator, after returned data Data are inserted into data field, state is changed into having completed read request, and CPU fetches according to and handled from data field, handled The free time is returned into rear state.
5. the high concurrent memory access accelerator as claimed in claim 4 based on RAM on piece, it is characterised in that each round-robin queue Including a head pointer and a tail pointer, the head pointer and tail pointer of idle queues and the head pointer for having completed queue are software In variable, maintenance is responsible for by CPU;The head pointer of new read request queue, tail pointer and to have completed the tail pointer of queue be hardware Register, the head pointer of new read request queue are responsible for maintenance by memory access accelerator;The tail pointer of new read request queue is by CPU and visit Deposit accelerator to safeguard jointly, CPU only writes, and memory access accelerator is read-only;The tail pointer for having completed queue is total to by CPU and memory access accelerator With safeguarding, CPU is read-only, and memory access accelerator is only write.
A kind of 6. high concurrent access method based on RAM on piece, it is characterised in that including setting one independently of Cache on piece and MSHR memory access accelerator, the memory access accelerator are connected with RAM on piece and Memory Controller Hub, do not complete access request and pass through the visit Deposit accelerator and be sent to Memory Controller Hub to memory system, CPU writes access request in the addressable sky of RAM on piece wherein on piece Between, the memory access accelerator read requests perform, and for read request, for pending data after memory system return, the memory access accelerator will Data are put into the space and notify CPU, and CPU is handled data, there is the reading of a preservation read request in the addressable space Required list, for depositing the information of read request, each single item of the read request table is corresponding one intrinsic No. id.
7. the high concurrent access method as claimed in claim 6 based on RAM on piece, it is characterised in that the read request table it is each Xiang Yousan domain, for depositing type, address and the data of the read request, wherein type field and address field is inserted by CPU, data Domain is inserted by the memory access accelerator.
8. the high concurrent access method as claimed in claim 7 based on RAM on piece, it is characterised in that the data of the read request table When domain is excessive, only storage data pointer, data pointer the storage address of returned data, the storage address of returned data can be pointed to It is allocated by CPU.
9. the high concurrent access method as claimed in claim 6 based on RAM on piece, it is characterised in that the read request table it is each Item is three kinds of states:Idle, new read request and read request is completed, original state is the free time, and CPU has access request When, request is inserted, state is changed into new read request, and the request is sent to Memory Controller Hub by memory access accelerator, after returned data Data are inserted into data field, state is changed into having completed read request, and CPU fetches according to and handled from data field, handled The free time is returned into rear state.
10. the high concurrent access method as claimed in claim 6 based on RAM on piece, it is characterised in that also initiate to read including CPU The step of request:
Idle queues state in step S701, CPU query piece in RAM addressable spaces, judge whether idle queues are sky, CPU judges that idle queues are for the condition of sky:The head pointer of idle queues overlaps with tail pointer, if empty, returns, if non-NULL, Then go to step S702;
Step S702, CPU take id from idle queues head of the queue;
Step S703, CPU fill in the type field and address field of read request list item corresponding with the id;
Step S704, CPU write the id tail of the queue of new read request queue;
The new read request queue rear pointer of renewal is transmitted to memory access accelerator by step S705, CPU;
Step S706, CPU judges whether to continue to initiate read request, if so, step S701 is gone to, if it is not, returning.
11. the high concurrent access method as claimed in claim 6 based on RAM on piece, it is characterised in that also read including CPU processing Ask returned data step:
Step S801, CPU inquire about the state for having completed queue, and judgement has completed whether queue is empty, and CPU judges to have completed queue It is for empty condition:The head pointer for having completed queue overlaps with tail pointer, if empty, returns;If non-NULL, go to step S802;
Step S802, CPU take id from the head of the queue for having completed queue;
Step S803, the data field of CPU operation read request list item corresponding with the id;
Step S804, CPU write the id tail of the queue of idle queues;
Step S805, CPU judges whether to continue to operate, if so, step S801 is then gone to, if it is not, then returning.
12. the high concurrent access method as claimed in claim 6 based on RAM on piece, it is characterised in that also including memory access accelerator The step of handling read request:
Whether step S901, the new request queue of memory access accelerator real-time query are empty, if non-NULL, go to step S902, if empty, Then inquired about always in this step;
Step S902, memory access accelerator take id from the head of the queue of new read request queue;
Step S903, memory access accelerator take out the type field and address field of read request list item corresponding with the id;
Step S904, memory access accelerator fetch data from internal memory, are written to the data field of read request list item corresponding with the id;
Step S905, memory access accelerator write the id tail of the queue for having completed queue.
13. the high concurrent access method as claimed in claim 6 based on RAM on piece, it is characterised in that also include:
Step 1, when CPU initiates write request, first check whether write round-robin queue full, if discontented, insert write request type, Address and write data;
Step 2, memory access accelerator, which detects, writes round-robin queue's non-NULL, then automatically from writing at round-robin queue's head pointer reading write request Type, address and data;
Step 3, write request is issued Memory Controller Hub by memory access accelerator.
A kind of 14. processor of accelerator using any one of claim 1-5.
CN201310242398.5A 2013-06-19 2013-06-19 High concurrent memory access accelerated method, accelerator and CPU based on RAM on piece Active CN103345429B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310242398.5A CN103345429B (en) 2013-06-19 2013-06-19 High concurrent memory access accelerated method, accelerator and CPU based on RAM on piece

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310242398.5A CN103345429B (en) 2013-06-19 2013-06-19 High concurrent memory access accelerated method, accelerator and CPU based on RAM on piece

Publications (2)

Publication Number Publication Date
CN103345429A CN103345429A (en) 2013-10-09
CN103345429B true CN103345429B (en) 2018-03-30

Family

ID=49280227

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310242398.5A Active CN103345429B (en) 2013-06-19 2013-06-19 High concurrent memory access accelerated method, accelerator and CPU based on RAM on piece

Country Status (1)

Country Link
CN (1) CN103345429B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104252416B (en) * 2013-06-28 2017-09-05 华为技术有限公司 A kind of accelerator and data processing method
CN105988952B (en) * 2015-02-28 2019-03-08 华为技术有限公司 The method and apparatus for distributing hardware-accelerated instruction for Memory Controller Hub
CN105354153B (en) * 2015-11-23 2018-04-06 浙江大学城市学院 A kind of implementation method of close coupling heterogeneous multi-processor data exchange caching
CN109582600B (en) * 2017-09-25 2020-12-01 华为技术有限公司 Data processing method and device
CN109086228B (en) * 2018-06-26 2022-03-29 深圳市安信智控科技有限公司 High speed memory chip with multiple independent access channels
CN110688238B (en) * 2019-09-09 2021-05-07 无锡江南计算技术研究所 Method and device for realizing queue of separated storage
CN115292236B (en) * 2022-09-30 2022-12-23 山东华翼微电子技术股份有限公司 Multi-core acceleration method and device based on high-speed interface

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073596A (en) * 2011-01-14 2011-05-25 东南大学 Method for managing reconfigurable on-chip unified memory aiming at instructions

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5813031A (en) * 1994-09-21 1998-09-22 Industrial Technology Research Institute Caching tag for a large scale cache computer memory system
US20040107240A1 (en) * 2002-12-02 2004-06-03 Globespan Virata Incorporated Method and system for intertask messaging between multiple processors
CN100517273C (en) * 2003-12-22 2009-07-22 松下电器产业株式会社 Cache memory and its controlling method
US7467277B2 (en) * 2006-02-07 2008-12-16 International Business Machines Corporation Memory controller operating in a system with a variable system clock
US7809895B2 (en) * 2007-03-09 2010-10-05 Oracle America, Inc. Low overhead access to shared on-chip hardware accelerator with memory-based interfaces
CN101221538B (en) * 2008-01-24 2010-10-13 杭州华三通信技术有限公司 System and method for implementing fast data search in caching
US9772958B2 (en) * 2011-10-31 2017-09-26 Hewlett Packard Enterprise Development Lp Methods and apparatus to control generation of memory access requests

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073596A (en) * 2011-01-14 2011-05-25 东南大学 Method for managing reconfigurable on-chip unified memory aiming at instructions

Also Published As

Publication number Publication date
CN103345429A (en) 2013-10-09

Similar Documents

Publication Publication Date Title
CN103345429B (en) High concurrent memory access accelerated method, accelerator and CPU based on RAM on piece
US6425060B1 (en) Circuit arrangement and method with state-based transaction scheduling
TWI537962B (en) Memory controlled data movement and timing
DE102020133262A1 (en) Workload scheduler for memory allocation
US6868087B1 (en) Request queue manager in transfer controller with hub and ports
JP5666722B2 (en) Memory interface
CN106502935A (en) FPGA isomery acceleration systems, data transmission method and FPGA
JPH04303248A (en) Computer system having multibuffer data cache
CN109032668A (en) Stream handle with high bandwidth and low-power vector register file
US11068418B2 (en) Determining memory access categories for tasks coded in a computer program
CN104375954B (en) The method and computer system for based on workload implementing that the dynamic of cache is enabled and disabled
US11036635B2 (en) Selecting resources to make available in local queues for processors to use
US7895397B2 (en) Using inter-arrival times of data requests to cache data in a computing environment
US8566532B2 (en) Management of multipurpose command queues in a multilevel cache hierarchy
US10204060B2 (en) Determining memory access categories to use to assign tasks to processor cores to execute
JP2024513076A (en) Message passing circuit configuration and method
CN109783012A (en) Reservoir and its controller based on flash memory
JP4452644B2 (en) Improved memory performance
US20090083496A1 (en) Method for Improved Performance With New Buffers on NUMA Systems
US20050149562A1 (en) Method and system for managing data access requests utilizing storage meta data processing
KR20210156759A (en) Systems, methods, and devices for queue availability monitoring
US20100058024A1 (en) Data Transfer Apparatus, Data Transfer Method And Processor
CN109491785A (en) Internal storage access dispatching method, device and equipment
CN108733409A (en) Execute the method and chip multi-core processor of speculative threads
US20230259294A1 (en) Systems, methods, and apparatus for copy destination atomicity in devices

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant