A kind of charge system of getting devices and methods therefor of multiline procedure processor simultaneously that is applied to
Technical field
The present invention relates to Computer Architecture and general purpose microprocessor design field, relate in particular to a kind of charge system of getting devices and methods therefor of multiline procedure processor simultaneously that is used for.
Background technology
At first introduce environment that summary of the invention is applied in the instructions, that is, and while multithreaded microprocessor system architecture.
The system architecture of a while multiline procedure processor as shown in Figure 1.In order to move a plurality of threads simultaneously, processor need independently be used to preserve the hardware mechanism of its running status for each thread is provided with a cover, comprise: programmable counter, fixed point and flating point register group, and comprise the thread identifier that is provided with in Cache, TLB and the instruction queue at memory unit.At this, instruction queue is meant to comprise simultaneously gets the general designation of finger formation, decoding formation, rename formation and fixed point/floating-point emission formation in interior whole formations in the multiline procedure processor; In each clock period, instruction fetching component reads a plurality of threads according to the result of branch predictor from instruction Cache instruction is delivered to and is got the finger formation, get the instruction that refers in the formation by the order of " first in first out " weekly phase instruction that quantity is equaled to decipher bandwidth deliver to decoding unit and decipher, instruction enters into fixed point or floating-point emission formation etc. after decoding and rename to be launched, when condition satisfies, promptly, operand is ready to and parts are launched into the out of order execution of function corresponding parts when idle, carries out to write back in the register successively through the instruction parts that reorder after finishing again.
As shown in Figure 2, simultaneously the difference of multiline procedure processor and traditional superscalar processor based on branch prediction and out of order execution is, the former allows to carry out from a plurality of thread reading command in each clock period, therefore in a clock period, multiline procedure processor can utilize the Thread-Level Parallelism of program and instruction-level parallelism to come the elimination of level waste simultaneously simultaneously, in addition, when because long delay operation or resource contention when causing having only an active threads, this thread can use all obtainable emission grooves, this makes to instruct by the not obstruction that uses other threads eliminates vertical waste, and the improvement of two aspects has greatly improved the throughput of instruction and the overall performance of system.And the latter is because the existence of two class instruction slots wasting phenomenons causes performance to be lower than the performance of multiline procedure processor simultaneously.As we can see from the figure, simultaneously multiline procedure processor can be delivered to get in each clock period and refer to that formation carries out for execution unit from a plurality of thread reading command, therefore, utilize the bandwidth of processor emission groove more fully, improved the instruction throughput of processor.
Multiline procedure processor has also increased cross-thread to sharing the competition of hardware resource when improving instruction throughput and instruction slots resource utilization simultaneously.The reasonable distribution hardware resource does not influence the performance of single program load when improving the multithreading overall performance, its The key factor is fetching method.In addition, owing to allow to carry out from a plurality of thread reading command in each clock period, instruction fetching component and memory access parts become restriction multiline procedure processor bottleneck of performance simultaneously.Rationally get finger, reduce unnecessary in advance or invalid getting refer to operation, improve to get and refer to efficient and get the finger quality that its The key factor is also at fetching method.
Secondly, we simply introduce the fetching method of some existing while multiline procedure processors.
The simplest fetching method is " random approach ", that is: instruction fetching component is carried out from alternative thread instruction fetch randomly.Another simple fetching method is called " round-robin " mode, that is, polling method, instruction fetching component refers to that according to getting bandwidth reads the instruction of fixing or variable number in turn to getting in the finger formation for carrying out from all or part of active threads.A kind of the improving one's methods of polling method is called " round-robin by 1 ", this method chose the thread of getting finger to call in the service efficiency of the data of Cache when moving in order to improve last time, when dispatching next time, only dispatch 1 new thread and eliminate the thread in 1 Geju City, thereby make most of thread continue to be got finger in this cycle, improve the utilization factor of Cache data, reduced the situation that causes the Cache data to be replaced in turn because of the frequent switching of thread.Polling method is realized simple, do not consider factors such as the fltting speed of thread and priority, the instruction that some thread might occur causes other threads can't continue to get the phenomenon that refers to and advance because of the delay execution occupies the instruction queue item for a long time, thereby has reduced the instruction throughput of entire system.In these three kinds of fetching methods, the performance of random approach and polling method is approaching, and the method for " round-robin by 1 " slightly is better than polling method.
In existing fetching method, getting of ICOUNT method refers to efficient and gets refer to that quality is the highest that the program feature of getting finger according to this method is also best.In the ICOUNT fetching method, the preferential selection of instruction fetching component occupies the minimum several threads of instruction queue item number and gets finger, therefore, if certain thread travelling speed is fast, then its instruction is short the time delay in formation, the functional part of can being fasted ejection is carried out very much, and shows promptly to occupy less entries in queues on the instruction queue, thereby can obtain higher right of priority when getting finger next time; And the slow thread of travelling speed owing to entries in queues is carried out and occupied for a long time in its instruction delay, forms instruction to pile up, thereby will have lower right of priority when getting finger next time.At this, the ICOUNT method inclination is in preferentially moving the fast thread of travelling speed, can guarantee certain finger fairness of getting simultaneously again.The ICOUNT method has the multiple finger parameter combinations of getting, wherein, the best performance of ICOUNT.2.8, its concrete steps are: select two threads to get finger at every turn, get 8 instructions altogether; At first be preferably and occupy the minimum thread of instruction queue item and get finger, when it takes place that instruction Cache lost efficacy or next bar instruction needs to cross over instruction Cache and stops to get finger when capable, then, remaining getting referred to that allocated bandwidth gives second thread of choosing.
The ICOUNT method is got finger for selected first thread as far as possible, till when stopping to get the condition generation of finger or reaching maximum getting referring to bandwidth, getting when referring to that bandwidth has residue just is second instruction that thread is got volume residual, can occur like this that first thread has been got too much instruction and phenomenon that second thread do not have enough instructions to be got takes place, cause to get and refer to that bandwidth usage is unbalanced, the instruction queue item number that thread takies in per clock period differs bigger, the instruction queue collision rate is higher, make the hit rate of cache memory and TLB be restricted simultaneously, and then influence the further raising of processor performance.
Under the ICOUNT fetching method, it gets the selector switch that finger device only comprises a thread, the counter that several record thread take the instruction queue item.In each clock period, selector switch is selected the value of two minimums from all counters, its pairing counter numbering is the thread number that will therefrom get finger, subsequently, instruction fetching component instruction fetch as much as possible from selected first thread (its Counter Value minimum) refers to bandwidth or runs into the capable border of Cache or occur just stopping when Cache does not hit up to reaching maximum getting; If still have the remaining finger bandwidth of getting this moment, then instruction fetching component instruction fetch as much as possible from second thread (its Counter Value is inferior little) just stops to get finger up to running into identical end condition.In this device, selector switch is after getting when referring to relatively all the relative size of Counter Values, and the value of these counters promptly no longer is utilized, and thus, causes to get to refer to that bandwidth usage is unbalanced, and it is lower that ICOUNT gets finger device efficient, poor-performing.
On the ICOUNT basis, also have PICOUNT and two kinds of fetching methods of PICOUNT2 in conjunction with the factor of thread priority.The former is getting the weighted sum of calculating the priority of alternative thread when referring to and taking two parameters of instruction queue item number, the right of priority that finger is got in the minimum several threads acquisitions of its value; The latter then biases toward the shared weights of thread priority when calculating weighted sum.At this, the fetching method of considering priority can guarantee that getting of high priority thread refers to and carry out, but may cause low priority thread to can not get operation for a long time or performance descends.
At different performance index, some other fetching method is arranged also.As: the thread that some greedy algorithms can be selected to have minimum Cache or TLB crash rate when getting finger is got finger; Or select to have the thread of minimum average B configuration memory access time and get finger; Also can select to have the highest instruction number of phase weekly, that is: IPG, the thread of value get finger; In addition, also can get the complementary situation of the use of hardware resource and refer to select according to each thread, as: lay particular emphasis on the Data Cache crash rate minimum that makes each thread total, or lay particular emphasis on and make total average memory access time minimum, or lay particular emphasis on and make total IPC value maximum etc.Also some is from improving transmitting instructions formation service efficiency, reduce instruction angle of stand-by period in formation as far as possible and proposed several gate fetching methods, as: when the delay of thread execution command number reaches certain limit, stop to get finger, or when thread reality or data predicted Cache crash rate reach certain limit, stop to get finger etc. from this thread.Though these methods have satisfied the requirement of a certain performance index to a certain extent, because it is not to get from processor final properties index to refer to select, thereby the overall performance of its processor all is lower than the processor performance based on the ICOUNT method.
Summary of the invention
The objective of the invention is to: overcome and get the unbalanced problem of finger bandwidth usage in the prior art; And on average take the instruction queue item number for what reduce multiline procedure processor simultaneously significantly, and significantly reduce the instruction queue collision rate, obviously improve the hit rate of cache memory and TLB simultaneously, finally improve processor performance; Thereby provide a kind of charge system of getting devices and methods therefor of multiline procedure processor simultaneously that is applied to.
In order to solve the problems of the technologies described above, the invention provides a kind of fetching method of multiline procedure processor simultaneously that is applied to, may further comprise the steps:
A) item number of the shared instruction queue item of a plurality of threads of operation is simultaneously added up;
B) numerical value that obtains after will adding up sorts;
C) value of two minimums of selection from the sequence after the ordering, as the selected thread of getting finger in this clock period, other threads are not got finger in this clock period with these two pairing threads of value;
D) value of two minimums in the step c) is carried out logic negation operation respectively and carried out modulo operation to 16, obtain two corresponding new values, calculate the instruction fetch bar of two minimum value respectively and count the upper limit;
E) count the upper limit according to the instruction fetch bar and get finger.
In such scheme, in the step c), be limited on the instruction fetch bar number of the smaller value of described minimum value and get the finger bandwidth, the value after value mould 16 computings after operating with the smaller value logical inverse of described minimum value, the value of minimum among both.
In such scheme, in the step c), be limited on the instruction fetch bar number of the higher value of described minimum value: get and refer to bandwidth and the difference of counting the actual instruction bar number of getting of the upper limit less than the instruction fetch bar of the smaller value of described minimum value, with the value after value mould 16 computings after the operation of the higher value logical inverse of described minimum value, minimum value among both.
In such scheme, described instruction fetch bar is counted the upper limit and is meant when getting from this thread when referring to, does not hit, instructs situations such as crossing over cache memory border, thread branch misprediction if cache memory takes place, and then stops to continue to get finger from this thread.
The invention provides a kind of finger device of getting that is applied to multiline procedure processor simultaneously, comprise that being used to write down counter, the T that every thread takies the item number of instruction queue selects two MUX, the first step-by-step negate device, the second step-by-step negate device, first mould, 16 arithmetical unit, second mould, 16 arithmetical unit, the first alternative selector switch, second alternative selector switch and the subtracter; Wherein, in each clock period, T selects two MUX to select the value output of two minimums from the data value of counter output, and the counter numbering exports instruction fetching component to, and minimum value is delivered to the first alternative selector switch after by the first step-by-step negate device and first mould, 16 arithmetical unit; Sub-minimum is delivered to the second alternative selector switch after by the second step-by-step negate device and second mould, 16 arithmetical unit; Another of the first alternative selector switch is input as gets the finger bandwidth, and another of the second alternative selector switch is input as the output of subtracter; The output that refers to the bandwidth and the first alternative selector switch is got in being input as of described subtracter; The first alternative selector switch and the second alternative selector switch select wherein less value to output to the instruction fetching component of multiline procedure processor simultaneously from the value of input, instruction fetching component is according to the output valve of the first alternative selector switch and the second alternative selector switch, from the thread instruction fetch of correspondence, and upgrade each thread in described value of getting pairing counter in the finger device according to the situation of this actual instruction fetch of clock period.
In such scheme, described T selects in two the MUX, and T is a Thread Count.
In such scheme, described counter comprises X counter, and X is the number of thread.
In such scheme, described instruction queue comprises simultaneously gets finger formation, decoding formation, rename formation and fixed point/whole formations of floating-point emission formation in the multiline procedure processor.
As from the foregoing, the present invention for each thread computes a upper bound of getting exponential quantity, get the finger bandwidth thereby more balancedly utilized, make that the instruction queue item number that on average takies of multiline procedure processor reduces significantly simultaneously, the instruction queue collision rate significantly reduces, the hit rate of Cache and TLB (fast table) also obviously improves simultaneously, makes that finally performance of processors is greatly improved.
Description of drawings
Fig. 1 is the system architecture of while multiline procedure processor in the prior art;
Fig. 2 a is the synoptic diagram of superscalar processor transmitting instructions groove in the prior art;
Fig. 2 b is the synoptic diagram of while multiline procedure processor transmitting instructions groove in the prior art;
Fig. 3 is that the present invention is applied to the fetching method process flow diagram of multiline procedure processor simultaneously;
Fig. 4 is that the present invention is applied to the charge system of the getting structure drawing of device of multiline procedure processor simultaneously.
Embodiment
In the present invention, processor is in each clock period, and instruction fetching component accuses that according to of the present invention getting the input value of making device is from two thread instruction fetch.Basic thought of the present invention is to select to occupy two minimum threads of instruction queue item in the current active thread to get finger, and by it is taken instruction queue item number fetch logic negation and numerical value and the getting of while multiline procedure processor that obtains after constant 16 modulo operation referred to that bandwidth limits the instruction strip number that a selected thread reads jointly at every turn; According to the qualification method of this instruction fetch bar number, instruction fetching component is got finger for occupying the minimum thread of instruction queue item earlier, then remaining getting is referred to that it is second selected thread instruction fetch that bandwidth is used for.
Referring to Fig. 3, a kind of fetching method that is applied to the while multiline procedure processor may further comprise the steps:
Step 100 is added up the item number of the shared instruction queue item of a plurality of threads of operation simultaneously;
Step 110 sorts the numerical value that obtains after the statistics;
Step 120, the value of two minimums of selection is designated as at this: MIN from the sequence after the ordering
1And MIN
2, and establish MIN
1≤ MIN
2, these two pairing threads of value as the selected thread of getting finger in this clock period, are designated as T1 and T2 respectively, other threads are not got finger in this clock period;
Step 130 is carried out the operation of logic negation respectively to the value of two minimums in the step 120, obtains two corresponding new values, is designated as:
With
Suppose that simultaneously getting of multiline procedure processor refers to that bandwidth is N, then the instruction fetch bar number of thread T1 in this clock period on be limited to:
Wherein, the implication of MIN operation is for getting wherein less value among both.The upper limit is meant when getting from this thread when referring to, do not hit, instructs situations such as crossing over cache memory border, thread branch misprediction if cache memory takes place, and then stops to continue to get finger from this thread.
After the fetched instruction bar number of thread T1 is determined, be limited on the instruction fetch bar number of thread T2:
Wherein, I
1` is the actual instruction number that thread T1 is got under the qualification of formula 1.Equally, the upper limit is meant when getting from this thread when referring to, do not hit, instructs situations such as crossing over cache memory border, thread branch misprediction if cache memory takes place, and then stops to continue to get finger from this thread.
Step 140 is counted the upper limit according to the instruction fetch bar and is got finger.
The difference of the ICOUNT.2.8 fetching method of best performance is that both are on the instruction number got of selected thread in the present invention and the present simultaneously multiline procedure processor.The ICOUNT method is got finger for selected first thread as far as possible, when the condition that stops to get finger takes place or reaches maximum getting and refer to bandwidth till, just be second instruction that thread is got volume residual getting when referring to that bandwidth has residue; The present invention then under the condition identical with ICOUNT for each thread computes a upper bound of getting exponential quantity, rather than it is set to get the finger bandwidth as the ICOUNT method, get the finger bandwidth thereby more balancedly utilized, avoid occurring that first thread has been got too much instruction and phenomenon that second thread do not have enough instructions to be got takes place, thereby make the overall performance of thread on ICOUNT, obtain further raising.
Referring to Fig. 4, the part in the frame of broken lines is a part involved in the present invention.In Fig. 4, the present invention comprises a cover and is connected to the circuit based on feedback mechanism of multiline procedure processor instruction fetching component simultaneously, promptly is applied to the charge system of the getting device of multiline procedure processor simultaneously.This device comprises that counter 1, T select two MUX 2, the first step-by-step negate device 3, the second step-by-step negate device 4, first mould, 16 arithmetical unit 5, second mould, 16 arithmetical unit 6, the first alternative selector switch 7, the second alternative selector switch 8 and subtracter 9.Wherein, counter 1 is used to write down the item number that every thread takies instruction queue; T selects in two the MUX 2, and T is a Thread Count.
At the same time in the multiline procedure processor course of work, in each clock period, the value of counter 1 is delivered to T and is selected two MUX 2, T selects two MUX 2 to select the value output of two minimums from the data value of input, wherein, minimum value is by the output of " selected thread 1 " port, inferior little value is exported from " selected thread 2 " port, and corresponding thread number (that is: the numbering of counter) delivered to instruction fetching component, this thread number is illustrated in this clock period from the thread of reading command wherein; The output valve that T selects two MUX 2 is delivered to the first alternative selector switch 7 after by the first step-by-step negate device 3 and first mould, 16 arithmetical unit 5; Another output valve that T selects two MUX 2 is delivered to the second alternative selector switch 8 after by the second step-by-step negate device 4 and second mould, 16 arithmetical unit 6; Corresponding to selected thread 1 (that is: the pairing thread of smaller value in initial selected two values), another of the first alternative selector switch 7 is input as N (getting of multiline procedure processor refers to bandwidth simultaneously); Corresponding to selected thread 2 (that is: another selected thread), another of the second alternative selector switch 8 is input as the output of subtracter 9, the output that is input as N and the thread 1 pairing first alternative selector switch 7 of this subtracter; The first alternative selector switch 7 and the second alternative selector switch 8 select wherein less value to output to the instruction fetching component of multiline procedure processor simultaneously from the value of input, instruction fetching component is according to the output valve of the first alternative selector switch 7 and the second alternative selector switch 8, thread instruction fetch from correspondence, and upgrade each thread in described value of getting pairing counter 1 in the finger device, thereby form a feedback circuit according to the situation of this actual instruction fetch of clock period.
Describedly get processor that finger device is applied in and have an independence and shared get the finger formation; In each clock period, instruction fetching component can be from a plurality of active threads instruction fetch of operation simultaneously, and deliver to shared getting and refer in the formation; Get and refer to have in the formation instruction that belongs to a plurality of different threads not need to carry out thread during the instruction execution of different threads and to switch for the execution unit execution of processor; The cache memory of a plurality of thread shared processing devices and the primary memory of host computer system.The instruction fetching component of processor is when instruction fetch, and the value of the programmable counter of the selected thread of while multiline procedure processor basis reads some from shared cache memory or primary memory instruction is delivered to shared getting and referred in the formation.
Counterpart is identical among the part that is connected behind the register renaming parts among Fig. 4 and Fig. 1, and with dashed lines omits expression in the drawings.
Of the present invention to get the finger device implementation complexity very low, can realize that with less hardware its time complexity is a constant.
Below, we illustrate implementation process of the present invention with an example.For the sake of simplicity, suppose that multiline procedure processor moves 2 threads, totally 32 of reservation station formations simultaneously simultaneously, be used to deposit fixed point and floating point instruction after rename, get finger, decoding, rename formation (or window) and be respectively 4, processor is 4 emissions, and 4 instructions are got in every bat at most.
Under the superincumbent hypothesis, entries in queues adds up to: 32+4*3=44.
Need to prove that when processor was 4 emissions, the number of " delivery " was 8 in the invention.
The situation that current two threads is had all instruction strip numbers distributions is listed as follows.For reducing length, only consider that the instruction number of thread 1 (T1) is less than the situation of thread 2 (T2), on the contrary identical.In table, 4 ' represents that it is got is limited to 4 instructions on the finger, refers to get finger in the bandwidth but must get in the residue after thread 1 is got finger.
The difference of table 1 technology of the present invention and ICOUNT technology instruction fetch bar number in per clock period.
The current thread instruction number | The invention technology get finger number (upper limit) | ICOUNT gets finger number (upper limit) |
T1 | T2 | T1 | T2 | T1 | T2 |
1 | 1<T2<=43 | 4 | 4’ | 4 | 4’ |
2 | 2<T2<=42 | 4 | 4’ | 4 | 4’ |
3 | 3<T2<=41 | 4 | 4’ | 4 | 4’ |
4 | 4<T2<=40 | 3 | 4’ | 4 | 4’ |
5 | 5<T2<=39 | 2 | 4’ | 4 | 4’ |
6 | 6<T2<=38 | 1 | 4’ | 4 | 4’ |
7 | 7<T2<=37 | 0 | 4’ | 4 | 4’ |
8 | 8<T2<=36 | 4 | 4’ | 4 | 4’ |
9 | 9<T2<=35 | 4 | 4’ | 4 | 4’ |
10 | 10<T2<=34 | 4 | 4’ | 4 | 4’ |
11 | 11<T2<=33 | 4 | 4’ | 4 | 4’ |
12 | 12<T2<=32 | 3 | 4’ | 4 | 4’ |
13 | 13<T2<=31 | 2 | 4’ | 4 | 4’ |
14 | 14<T2<=30 | 1 | 4’ | 4 | 4’ |
15 | 15<T2<=29 | 0 | 4’ | 4 | 4’ |
16 | 16<T2<=28 | 4 | 4’ | 4 | 4’ |
17 | 17<T2<=27 | 4 | 4’ | 4 | 4’ |
18 | 18<T2<=26 | 4 | 4’ | 4 | 4’ |
19 | 19<T2<=25 | 4 | 4’ | 4 | 4’ |
20 | 20<T2<=24 | 3 | 4’ | 4 | 4’ |
21 | 21<T2<=23 | 2 | 4’ | 4 | 4’ |
22 | 22<T2<=22 | 1 | 4’ | 4 | 4’ |
Can see that from table 1 the present invention can limit the upper limit of a thread instruction fetch effectively, get the finger bandwidth, reach and improve the purpose of multiline procedure processor performance simultaneously thereby distribute more effectively.
It should be noted last that: above embodiment is the unrestricted technical scheme of the present invention in order to explanation only, although the present invention is had been described in detail with reference to the foregoing description, those of ordinary skill in the art is to be understood that: still can make amendment or be equal to replacement the present invention, and not breaking away from any modification or partial replacement of the spirit and scope of the present invention, it all should be encompassed in the middle of the claim scope of the present invention.