Background technology
Gaps between their growth rates between processor and main memory are outstanding contradictions to polycaryon processor, therefore must use the multi-level buffer buffer memory to alleviate.The polycaryon processor of shared level cache, the polycaryon processor of sharing L2 cache and the polycaryon processor of shared main memory are arranged at present.Usually, polycaryon processor adopts the polycaryon processor structure of sharing L2 cache, and namely each processor core has privately owned level cache, and all processor cores are shared L2 cache.The architecture Design of buffer memory self also is directly connected to the entire system performance.But in the polycaryon processor structure, which is better and which is worse for shared buffer memory or exclusive buffer memory, need set up multi-level buffer at chip piece, and set up what buffer memory etc., because size, power consumption, layout, performance and operational efficiency etc. to entire chip have very big influence, thereby these all are the problems that need conscientiously study and inquire into.On the other hand, multi-level buffer causes consistency problem again.Adopt which kind of buffer consistency model and mechanism all will produce material impact to the polycaryon processor overall performance.The buffer consistency model that extensively adopts in traditional multicomputer system structure has: sequential consistency model, weak consistency model, release consistency model etc.Associated buffer consistency mechanism mainly contain bus intercept agreement and based on the directory protocol of catalogue.Present polycaryon processor system adopts the agreement of intercepting based on bus mostly.
Sometimes need to carry out data sharing between the program that each processor core of polycaryon processor processor is carried out with synchronously, so its hardware configuration must be supported internuclear communication.Communication mechanism is the high performance important leverage of polycaryon processor processor efficiently, at present relatively on the sheet of main flow efficient communication mechanism have two kinds, a kind of buffer structure that bus is shared, a kind of interconnection structure that is based on the sheet of being based on.Bus shared buffer memory structure refers to that each processor cores has shared secondary or three grades of buffer memorys, is used for preserving data relatively more commonly used, and communicates by the bus that connects core.The advantage of this system is simple in structure, the communication speed height, and it is relatively poor that shortcoming is based on the structure extensibility of bus.Structure based on on-chip interconnect refers to that each processor core has independently processing unit and buffer memory, and each processor core links together by modes such as cross bar switch or network-on-chips.Each processor core passes through message communicating in the heart.The advantage of this structure is that extensibility is good, and data bandwidth is guaranteed; Shortcoming is the hardware configuration complexity, and software alteration is bigger.Perhaps the two competition result is not replacement mutually but works in coordination, and for example adopts network-on-chip and the local bus mode that adopts in global scope, reaches the balance of performance and complicacy.
In the conventional microprocessor, buffer memory does not hit or the memory access event all can have a negative impact to the execution efficient of processor, and the work efficiency of Bus Interface Unit can determine this effect.Occur buffer memory simultaneously not during hit event when a plurality of processor cores require access memory or the intracardiac privately owned buffer memory of a plurality of processor core simultaneously, BIU has determined the overall performance of polycaryon processor system to the efficient of changing the mechanism of the arbitration mechanism of these a plurality of request of access and externally memory access.Therefore seek multiport Bus Interface Unit structure efficiently, the individual character visit of main memory is transferred the multinuclear heart to the visit of bursting more efficiently; Seeking the quantitative model of the visit word of once bursting of polycaryon processor processor whole efficiency the best and the arbitration mechanism of efficient multiport Bus Interface Unit visit simultaneously will be the important content of polycaryon processor processor research.
In present polycaryon processor system, no matter be L2 cache or three grades of buffer memorys, no matter be shared buffer memory or privately owned buffer memory, all there is the algorithm complex height in the algorithm that reads and replace of its buffer memory, hits technical matterss such as duration is long.
In addition, in the existing L2 cache mode, all there is backup usually in the L2 cache data of shared buffer memory data in privately owned level cache, so, after data cached in the level cache changed by different processor cores, the pseudo-sharing problem of high-speed cache can appear, need often reload thereby bring, increased access delay, the system that makes newly can descend.
Summary of the invention
For all there is the algorithm complex height in the algorithm that reads and replace that solves buffer memory in the existing polycaryon processor system, hits long, the pseudo-technical matters such as share of duration, the invention provides a kind of in polycaryon processor the method for buffered data, wherein said polycaryon processor comprises a plurality of processor cores, be formed centrally a plurality of dedicated buffer memories of coupled relation one by one with described a plurality of processor cores and be coupled in a general memory buffer of described a plurality of processor cores respectively, and described method comprises:
Receive the instruction of a plurality of threads of concurrent execution;
In described a plurality of threads each is distributed to described a plurality of processor core respectively independently, and wherein in the heart each of a plurality of processor cores is assigned with a thread at most;
Be assigned with the processor core of thread for each, in response to the cache request during the execution thread, will have treated data cached storing in the dedicated buffer memory that is coupled;
Same when data cached when all storing in the dedicated buffer memory of the quantity that is not less than a threshold value t, with same data cached storing in the general memory buffer.
Preferably, wherein with same data cached store in the general memory buffer after, removing stores same data cached in the same data cached dedicated buffer memory, and discharges and removed the same data cached shared storage space that stores in the same data cached dedicated buffer memory.
Preferably, when in the heart each of described a plurality of processor cores need read when data cached, from general memory buffer or dedicated buffer memory, read described data cached by the query caching mapping table.
Preferably, wherein t=s or
Or t=2, wherein s is the total amount that is in the processor core under the state of activation, wherein
Expression rounds up.
The invention also discloses a kind of in polycaryon processor the method for multi-buffer data, described polycaryon processor comprises 2
nIndividual processor core, wherein, and n+1 level buffer memory, wherein m level buffer memory comprises 2
N+1-mIndividual memory buffer; Wherein, i memory buffer of the 1st grade of buffer memory only is used for the data cached of the performed required buffering of thread of i processor core of storage; J memory buffer of s level buffer memory only is used for the 2nd of storage s-1 level buffer memory
jThe-1 and the 2nd
jWhat have in the individual memory buffer is data cached, and wherein, n is the integer greater than 1,1<=m<=n+1,2<=s<=n+1,1<=i<=2
n, 1<=j<=2
N+1-mDescribed method comprises:
Instruction in response to a plurality of threads of the same process of concurrent execution, each thread is distributed to different idle processor cores, activate the idle processor core that has been assigned with thread simultaneously, make the processor core that has been assigned with thread become busy condition from idle condition;
After i processor core is activated, in response to buffer memory instruction, at first check whether treat data cached the of the p level buffer memory that has been stored in
In the individual memory buffer, if exist, send the affirmation message of indication buffer memory success, wherein 1<=i<=2
n, 2<=p<=n+1,
Expression rounds operation, k=2 downwards
P-1
If there is no, then will treat in data cached i the memory buffer that stores the 1st grade of buffer memory into; Judge step by step since the 1st grade of buffer memory to the n level buffer memory then, if t level buffer memory
* 2-1 and
* all store the described buffered data for the treatment of in 2 memory buffer, then the described buffered data for the treatment of is dumped to of t+1 level buffer memory
In the individual memory buffer, and with of t level buffer memory
* 2-1 and
That * stores in 2 memory buffer treats data cached removing, discharges of t level buffer memory simultaneously
* 2-1 and
* described in 2 memory buffer treated data cached shared storage space, wherein
1<=t<=n,
Expression rounds up;
Send the affirmation message of indication buffer memory success.
The present invention utilizes the buffer memory mapping table, and treat based on execution thread and data cachedly to read in real time and to replace, especially in real time between dedicated buffer memory and general memory buffer, carry out data cached replacement, and utilize multi-level buffer to read in real time and replace data cached, realized lower algorithm complex, reduced and hit duration, improved the whole efficiency of the computer system with polycaryon processor.
Embodiment
Fig. 1 shows the structured flowchart of the polycaryon processor that relates among the present invention, as shown in Figure 1, polycaryon processor among the present invention comprises a plurality of processor cores, be formed centrally a plurality of dedicated buffer memories of coupled relation one by one with described a plurality of processor cores, be coupled in a general memory buffer of described a plurality of processor cores respectively, wherein said a plurality of dedicated buffer memory only is used for relevant data cached of the performed thread of processor core that storage and described a plurality of dedicated buffer memories are coupled, and described one general memory buffer is for relevant data cached of the thread of storing and a plurality of processor cores are performed.Described polycaryon processor also comprises the mapping buffer, be used for the memory buffers mapping table, at least store the storage relation between data cached and each memory buffer (comprising dedicated buffer memory and dedicated buffer memory) in the described buffer memory mapping table, described storage relation comprise data cached be stored in which memory buffer and data cached which thread with which processor core related.Wherein said polycaryon processor also comprises cache controller, be used for the described a plurality of dedicated buffer memories of control, general memory buffer and mapping buffer, realize the writing of described a plurality of dedicated buffer memories, general memory buffer and mapping buffer, read, replace, operation such as inquiry.
In the present invention, the total amount of a plurality of threads of the same process that proposes is not more than the total amount of described a plurality of processor cores, and each in a plurality of threads of same process is distributed to different processor cores respectively to guarantee the concurrent execution of a plurality of threads of same process simultaneously.
Next 2 methods of buffered data in polycaryon processor of describing that the present invention propose in detail with reference to the accompanying drawings, described method comprises: polycaryon processor receives the instruction of a plurality of threads of the same process of concurrent execution; Continue to carry out subsequent step under situation about meeting the following conditions: the quantity of idle processor core is not less than the quantity of a plurality of threads of the concurrent execution of same process, exists the course allocation mode to make simultaneously each thread distributed to different idle processor cores and distributed the total resources of the idle processor core of thread to be not less than the needed processor resource total amount of the thread that distributes (namely all have distributed the processing power of the idle processor core of thread can both satisfy needs to the thread of its distribution); Otherwise return the information of indication mechanism inadequate resource.Instruction in response to a plurality of threads of the same process of concurrent execution, in described a plurality of threads each is distributed to described a plurality of processor core idle processor core in the heart respectively, wherein to distribute to different idle processor cores and allocation result be that the processing power of each processor core that has been assigned with thread satisfies the needs to the thread of its distribution fully to each thread, activate the idle processor core that has been assigned with thread simultaneously, make the processor core that has been assigned with thread become busy condition from idle condition.
After the idle processor core in the polycaryon processor is activated, the processor core that is activated enters the busy condition of execution thread, also therefore can carry out reading and writing of data according to the needs of execution thread, thereby need carry out the data buffer memory, the processor core under the busy condition can often send the buffer memory instruction to cache controller for this reason.Instruct in response to buffer memory, cache controller at first query caching mapping table checks whether treat data cached being stored in the general memory buffer, if exist then upgrade the buffer memory mapping table sends the success of indication buffer memory then to the processor core that sends the buffer memory instruction affirmation message, wherein upgrade the buffer memory mapping table and comprise and increase the described corresponding relation for the treatment of between the performed thread of the data cached and processor core that is activated and the processor core that is activated; Otherwise will treat data cached storing in the dedicated buffer memory that is coupled to the processor core that sends the buffer memory instruction, the storage relation that to treat data cached and dedicated buffer memory simultaneously is updated in the buffer memory mapping table, then to treat that data cached is that querying condition carries out query analysis in described buffer memory mapping table, if the query analysis result be not less than all store in the dedicated buffer memory of quantity of a threshold value t described treat data cached, wherein said threshold value t=s or
Or t=2, wherein s is just in the total amount of the processor core of execution thread, wherein
Expression rounds up, then cache controller is treated data cached storing in the general memory buffer with described, and with the data cached removing for the treatment of of storing in the described t dedicated buffer memory, discharge described in the described t dedicated buffer memory simultaneously and treat data cached shared storage space, final updating buffer memory mapping table, comprise increasing and describedly treat the relation of the storage between data cached and the general memory buffer and delete the described storage relation for the treatment of between data cached and t the dedicated buffer memory, send the affirmation information of indicating the buffer memory success by buffer control unit to sending the processor core that buffer memory instructs during the above-mentioned steps end.Wherein said buffer memory mapping table is stored in the mapping buffer.
In the present invention, after processor core is carried out a thread end in the heart, at first send clear instruction to cache controller.In response to clear instruction, cache controller at first query caching mapping table checks that the thread that whether exists in the general memory buffer with carrying out end is associated data cached.Be associated data cachedly if exist with carrying out the thread that finishes, then check whether exist greater than a threshold value t processor controller performed thread and described data cached being associated for each with the data cached continuation query caching mapping table that the thread of execution end is associated.If there is no greater than a threshold value t processor controller performed thread and described data cached being associated, then all have under the situation of the described data cached ability of storage at the dedicated buffer memory that all and the described data cached processor core that is associated are coupled, according to the record of buffer memory mapping table with described data cached unloading to all with dedicated buffer memory that the described data cached processor core that is associated is coupled in, delete described data cached in the general memory buffer, discharge the described data cached shared storage space in the general memory buffer; Upgrading the buffer memory mapping table makes the buffer memory mapping table reflect up-to-date storage relation.Those skilled in the art will know that after carrying out a thread end, also need to remove the related data in the main memory.After cache controller is carried out above step and is finished, send indication to the processor core that it is sent clear instruction and remove successful affirmation instruction, to confirm to instruct in response to this, this processor core enters idle condition by busy condition.
In the present invention, general memory buffer may in this case, need to adjust cache policy owing to constantly being written into the data cached state that reaches capacity.In order effectively to simplify the algorithm of high complexity of the prior art, the present invention proposes following method, when needs write in the general memory buffer when data cached, whether the total amount of at first judging the idle storage space of general memory buffer is not less than the data cached data volume that need write, if the total amount of the idle storage space of general memory buffer is not less than the data cached data volume that need write, data cached being written in the general memory buffer that directly needs is write then, if the data cached the data volume whether total amount of the idle storage space of general memory buffer writes less than needs, then at first with the data cached unloading in the general memory buffer to main memory, discharge the storage space of general memory buffer then, will need data cached being written in the general memory buffer that sucks at last.
In response to clear instruction, cache controller follows whether there be associated with the thread that carry out to finish data cached in the dedicated buffer memory that the query caching mapping table is checked with this processor core is coupled, if exist then remove inquire data cached and discharge this/these data cached shared storage spaces.Cache controller in the general memory buffer; Upgrading the buffer memory mapping table makes the buffer memory mapping table reflect up-to-date storage relation.
When needs read when data cached, determine from general memory buffer or dedicated buffer memory, to read described data cached by the query caching mapping table.Wherein for the buffer memory of unloading to the main memory, under the situation of query missed in the buffer memory mapping table, can further from main memory, read described data cached.Concrete read method is not done and is given unnecessary details.
3-4 with reference to the accompanying drawings describes the method for multi-buffer data in polycaryon processor of second embodiment of the invention in detail, and polycaryon processor comprises 2
nIndividual processor core (wherein n=2,3,4,5,6,7,8,9) and n+1 level buffer memory, wherein m level buffer memory comprises 2
N+1-mIndividual memory buffer (1<m<n+1).Wherein, the i(1<=i<=2 of the 1st grade of buffer memory
n) individual memory buffer is used for storage the i(1<=i<=2
n) the performed required buffering of thread of individual processor core data cached, and only be used for storage the i(1<=i<=2
n) the performed required buffering of thread of individual processor core data cached; The i(1 of level 2 cache memory<=i<=2
N-1) individual memory buffer is used for the 2nd of the 1st grade of buffer memory of storage
iThe-1 and the 2nd
iWhat have in the individual memory buffer is data cached, and only is used for the 2nd of the 1st grade of buffer memory of storage
iThe-1 and the 2nd
iWhat have in the individual memory buffer is data cached, further, and when the 2nd of the 1st grade of buffer memory
iThe-1 and the 2nd
iWhen all storing same buffered data in the individual memory buffer, by cache controller will i memory buffer of this same data cached unloading to the level 2 cache memory in, remove the 2nd of the 1st grade of buffer memory simultaneously
iThe-1 and the 2nd
iSame data cached in the individual memory buffer discharges the 2nd of the 1st grade of buffer memory
iThe-1 and the 2nd
iSame data cached shared storage space in the individual memory buffer; In a word, the i(1<=i<=2 of the 1st grade of buffer memory
n) individual memory buffer is used for the data cached of the performed required buffering of thread of i processor core of storage, and only be used for storing the data cached of the performed required buffering of thread of i processor core; The j(1<=j<=2 of the level of the s(2<=s<=n+1) buffer memory
N+1-s) individual memory buffer is used for the 2nd of storage s-1 level buffer memory
jThe-1 and the 2nd
jWhat have in the individual memory buffer is data cached, and only is used for the 2nd of storage s level buffer memory
jThe-1 and the 2nd
jWhat have in the individual memory buffer is data cached.For example, when n=2, described polycaryon processor comprises 2
2=4 processor cores and 3 grades of buffer memorys, wherein the 1st grade of buffer memory comprises 2
2=4 first-level buffer storeies, level 2 cache memory comprises 2
1=2 level 2 buffering storeies, the 3rd level buffer memory comprises 2
0=1 three grades of memory buffer.Described polycaryon processor also comprises the mapping buffer, be used for the memory buffers mapping table, at least store the storage relation between data cached and each memory buffer (each memory buffer that comprises buffer memorys at different levels) in the described buffer memory mapping table, described storage relation comprise data cached be stored in which memory buffer and data cached which thread with which processor core related.Wherein said polycaryon processor also comprises cache controller, is used for each memory buffer and the mapping buffer of control buffer memorys at different levels, realizes the writing of each memory buffer of buffer memorys at different levels and mapping buffer, reads, replaces, operation such as inquiry.The wherein interconnected and communication with bus between each memory buffer of each processor core, buffer memorys at different levels, mapping buffer and the cache controller.
Receive the instruction of a plurality of threads of the same process of concurrent execution when polycaryon processor after; Continue to carry out subsequent step under situation about meeting the following conditions: the quantity of idle processor core is not less than the quantity of a plurality of threads of the concurrent execution of same process, exists the course allocation mode to make simultaneously each thread distributed to different idle processor cores and distributed the total resources of the idle processor core of thread to be not less than the needed processor resource total amount of the thread that distributes (namely all have distributed the processing power of the idle processor core of thread can both satisfy needs to the thread of its distribution); Otherwise return the information of indication mechanism inadequate resource.Instruction in response to a plurality of threads of the same process of concurrent execution, in described a plurality of threads each is distributed to described a plurality of processor core idle processor core in the heart respectively, wherein to distribute to different idle processor cores and allocation result be that the processing power of each processor core that has been assigned with thread satisfies the needs to the thread of its distribution fully to each thread, activate the idle processor core that has been assigned with thread simultaneously, make the processor core that has been assigned with thread become busy condition from idle condition.
When polycaryon processor (2
nIndividual processor core, wherein n is the integer greater than 1, for example, n=2,3,4,5,6,7,8 or 9) in the i(1<=i<=2
n) after individual processor core (idle processor core) is activated, the processor core that is activated enters the busy condition of execution thread, also therefore can carry out reading and writing of data according to the needs of execution thread, thereby need carry out the data buffer memory, the processor core under the busy condition can often send the buffer memory instruction to cache controller for this reason.In response to buffer memory instruction, cache controller at first query caching mapping table checks whether treat data cached the of the level buffer memory of the m(2<=m<=n+1) that has been stored in
(
Expression rounds operation, k=2 downwards
M-1) in the individual memory buffer, if exist then upgrade the buffer memory mapping table sends the success of indication buffer memory then to i processor core affirmation message, wherein upgrade the buffer memory mapping table and comprise the described corresponding relation for the treatment of between the performed thread of data cached, the processor core that is activated and the processor core that is activated of increase; Otherwise will treat in data cached i the memory buffer that stores the 1st grade of buffer memory into, the storage relation that to treat i memory buffer of data cached and the 1st grade of buffer memory simultaneously is updated in the buffer memory mapping table, be that querying condition carries out query analysis in described buffer memory mapping table to treat data cached then, if the query analysis result is of the 1st grade of buffer memory
* 2-1 and
* 2(wherein
Expression rounds up) all store same buffered data in the individual memory buffer, then cache controller is treated data cached the of the
level 2 cache memory that dumps to described
In the individual memory buffer, and with of the 1st grade of buffer memory
* 2-1 and
That * stores in 2 memory buffer treats data cached removing, discharges of the 1st grade of buffer memory simultaneously
* 2-1 and
* described in 2 memory buffer treated data cached shared storage space, final updating buffer memory mapping table comprises that increasing the described storage for the treatment of the relation of the storage between each memory buffer in data cached and the
level 2 cache memory and deleting between described each memory buffer for the treatment of data cached and the 1st grade of buffer memory concerns.And the like, judge step by step and handle.For example working as n=2 is that polycaryon processor comprises under the situation of 4 processor cores, except carrying out under the situation of above-mentioned steps, also need to treat that data cached is that querying condition carries out query analysis in described buffer memory mapping table, if the query analysis result all stores same buffered data in the 1st and the 2nd memory buffer of
level 2 cache memory, then cache controller is treated in data cached the 1st (only one) memory buffer that dumps to the 3rd level buffer memory described, and that stores in the 1st and 2 memory buffer with
level 2 cache memory treats data cached removing, discharge described in the 1st and the 2nd memory buffer of
level 2 cache memory simultaneously and treat data cached shared storage space, final updating buffer memory mapping table comprises that increasing the described storage for the treatment of the relation of the storage between first memory buffer in data cached and the 3rd level buffer memory and deleting between described each memory buffer for the treatment of data cached and
level 2 cache memory concerns.In a word, judge step by step since the 1st grade of buffer memory to the n level buffer memory, if t level buffer memory
* 2-1 and
* 2(wherein
Expression rounds up, 1<=r<=2
N+1-t, 1<=t<=n) all store the described buffered data for the treatment of in the individual memory buffer, then cache controller dumps to of t+1 level buffer memory with the described buffered data for the treatment of
In the individual memory buffer, and with of t level buffer memory
* 2-1 and
That * stores in 2 memory buffer treats data cached removing, discharges of t level buffer memory simultaneously
* 2-1 and
* described in 2 memory buffer treated data cached shared storage space, and final updating buffer memory mapping table comprises increasing and describedly treats in data cached and the t+1 level buffer memory the
Storage between individual memory buffer relation and delete described the of data cached and the t level buffer memory for the treatment of
* 2-1 and
* the relation of the storage between 2 memory buffer.When finishing, above-mentioned steps sent the affirmation message of indication buffer memory success to i processor core by buffer control unit.Wherein said buffer memory mapping table is stored in the mapping buffer.All parameters that those skilled in the art use in should clear and definite the present invention all are integers, parameter m for example, n, p, q, r, s, k, t, i, j.
In a second embodiment, carry out processing after thread finishes in the heart when processor core and be similar to disposal route among first embodiment, and all have under the situation that enough spatial cache storages treat data cached ability in memory buffer at different levels in this embodiment and implement.In this embodiment, read data cached method in also implementing with first to read data cached method similar.In addition, though be not shown specifically in the accompanying drawing 4 about the 1st grade of cycle criterion to of n level buffer memory
* 2-1 and
* all store the described buffered data for the treatment of in 2 memory buffer, yet art technology should understand it is to judge step by step herein according to the above-mentioned write up of instructions, for the emphasis of outstanding design of inventing, concrete circulating treatment procedure is also not shown.
Data cached in general memory buffer and the non-first-level buffer storer in the present invention, processor core can only read and can not revise, revising if desired then needs to carry out buffer memory again and writes, processor core that is to say when need be revised data cached in general memory buffer or the non-first-level buffer storer, with amended data cached as new treat data cached carry out of the present invention in polycaryon processor the method for buffered data, namely original data cached and amended data cached as the different data cached write operations of carrying out for the treatment of.
More than only be exemplary about description of the invention, and being primarily aimed at the related essential features of the technical problem to be solved in the present invention is described in detail, it should be clearly know that for those skilled in the art or not doing about other correlative details of the present invention of expecting easily given unnecessary details, for example, when causing storing, the idle storage space deficiency of dedicated buffer memory or general memory buffer treats under the data cached situation, need to replace the data cached of storage formerly by replacing algorithm, do not do at this and give unnecessary details.
Should be appreciated that, above-described embodiment is the detailed description of carrying out at specific embodiment, but the present invention is not limited to this embodiment, without departing from the spirit and scope of the present invention, can make various improvement and modification to the present invention, for example treating data cached weight when the indication of weight information is when hanging down weight, middle weight and high weight, without departing from the spirit and scope of the present invention, can further improve method data cached in memory buffer of the present invention.