CN1716183A - A kind of charge system of getting devices and methods therefor of multiline procedure processor simultaneously that is applied to - Google Patents

A kind of charge system of getting devices and methods therefor of multiline procedure processor simultaneously that is applied to Download PDF

Info

Publication number
CN1716183A
CN1716183A CN 200410009288 CN200410009288A CN1716183A CN 1716183 A CN1716183 A CN 1716183A CN 200410009288 CN200410009288 CN 200410009288 CN 200410009288 A CN200410009288 A CN 200410009288A CN 1716183 A CN1716183 A CN 1716183A
Authority
CN
China
Prior art keywords
value
thread
finger
instruction
getting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200410009288
Other languages
Chinese (zh)
Other versions
CN100377076C (en
Inventor
何立强
刘志勇
胡伟武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Loongson Technology Corp Ltd
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CNB2004100092885A priority Critical patent/CN100377076C/en
Publication of CN1716183A publication Critical patent/CN1716183A/en
Application granted granted Critical
Publication of CN100377076C publication Critical patent/CN100377076C/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Landscapes

  • Advance Control (AREA)

Abstract

The invention discloses and a kind ofly be applied to getting of multiline procedure processor simultaneously and accuse that system devices and methods therefor, this device comprise that being used to write down counter, the T that every thread takies the item number of instruction queue selects two MUX, the first step-by-step negate device, the second step-by-step negate device, first mould, 16 arithmetical unit, second mould, 16 arithmetical unit, the first alternative selector switch, second alternative selector switch and the subtracter.The present invention for each thread computes a upper bound of getting exponential quantity, get the finger bandwidth thereby more balancedly utilized, make that the instruction queue item number that on average takies of multiline procedure processor reduces significantly simultaneously, the instruction queue collision rate significantly reduces, the hit rate of Cache (cache memory) and TLB (fast table) also obviously improves simultaneously, makes that finally performance of processors is greatly improved.

Description

A kind of charge system of getting devices and methods therefor of multiline procedure processor simultaneously that is applied to
Technical field
The present invention relates to Computer Architecture and general purpose microprocessor design field, relate in particular to a kind of charge system of getting devices and methods therefor of multiline procedure processor simultaneously that is used for.
Background technology
At first introduce environment that summary of the invention is applied in the instructions, that is, and while multithreaded microprocessor system architecture.
The system architecture of a while multiline procedure processor as shown in Figure 1.In order to move a plurality of threads simultaneously, processor need independently be used to preserve the hardware mechanism of its running status for each thread is provided with a cover, comprise: programmable counter, fixed point and flating point register group, and comprise the thread identifier that is provided with in Cache, TLB and the instruction queue at memory unit.At this, instruction queue is meant to comprise simultaneously gets the general designation of finger formation, decoding formation, rename formation and fixed point/floating-point emission formation in interior whole formations in the multiline procedure processor; In each clock period, instruction fetching component reads a plurality of threads according to the result of branch predictor from instruction Cache instruction is delivered to and is got the finger formation, get the instruction that refers in the formation by the order of " first in first out " weekly phase instruction that quantity is equaled to decipher bandwidth deliver to decoding unit and decipher, instruction enters into fixed point or floating-point emission formation etc. after decoding and rename to be launched, when condition satisfies, promptly, operand is ready to and parts are launched into the out of order execution of function corresponding parts when idle, carries out to write back in the register successively through the instruction parts that reorder after finishing again.
As shown in Figure 2, simultaneously the difference of multiline procedure processor and traditional superscalar processor based on branch prediction and out of order execution is, the former allows to carry out from a plurality of thread reading command in each clock period, therefore in a clock period, multiline procedure processor can utilize the Thread-Level Parallelism of program and instruction-level parallelism to come the elimination of level waste simultaneously simultaneously, in addition, when because long delay operation or resource contention when causing having only an active threads, this thread can use all obtainable emission grooves, this makes to instruct by the not obstruction that uses other threads eliminates vertical waste, and the improvement of two aspects has greatly improved the throughput of instruction and the overall performance of system.And the latter is because the existence of two class instruction slots wasting phenomenons causes performance to be lower than the performance of multiline procedure processor simultaneously.As we can see from the figure, simultaneously multiline procedure processor can be delivered to get in each clock period and refer to that formation carries out for execution unit from a plurality of thread reading command, therefore, utilize the bandwidth of processor emission groove more fully, improved the instruction throughput of processor.
Multiline procedure processor has also increased cross-thread to sharing the competition of hardware resource when improving instruction throughput and instruction slots resource utilization simultaneously.The reasonable distribution hardware resource does not influence the performance of single program load when improving the multithreading overall performance, its The key factor is fetching method.In addition, owing to allow to carry out from a plurality of thread reading command in each clock period, instruction fetching component and memory access parts become restriction multiline procedure processor bottleneck of performance simultaneously.Rationally get finger, reduce unnecessary in advance or invalid getting refer to operation, improve to get and refer to efficient and get the finger quality that its The key factor is also at fetching method.
Secondly, we simply introduce the fetching method of some existing while multiline procedure processors.
The simplest fetching method is " random approach ", that is: instruction fetching component is carried out from alternative thread instruction fetch randomly.Another simple fetching method is called " round-robin " mode, that is, polling method, instruction fetching component refers to that according to getting bandwidth reads the instruction of fixing or variable number in turn to getting in the finger formation for carrying out from all or part of active threads.A kind of the improving one's methods of polling method is called " round-robin by 1 ", this method chose the thread of getting finger to call in the service efficiency of the data of Cache when moving in order to improve last time, when dispatching next time, only dispatch 1 new thread and eliminate the thread in 1 Geju City, thereby make most of thread continue to be got finger in this cycle, improve the utilization factor of Cache data, reduced the situation that causes the Cache data to be replaced in turn because of the frequent switching of thread.Polling method is realized simple, do not consider factors such as the fltting speed of thread and priority, the instruction that some thread might occur causes other threads can't continue to get the phenomenon that refers to and advance because of the delay execution occupies the instruction queue item for a long time, thereby has reduced the instruction throughput of entire system.In these three kinds of fetching methods, the performance of random approach and polling method is approaching, and the method for " round-robin by 1 " slightly is better than polling method.
In existing fetching method, getting of ICOUNT method refers to efficient and gets refer to that quality is the highest that the program feature of getting finger according to this method is also best.In the ICOUNT fetching method, the preferential selection of instruction fetching component occupies the minimum several threads of instruction queue item number and gets finger, therefore, if certain thread travelling speed is fast, then its instruction is short the time delay in formation, the functional part of can being fasted ejection is carried out very much, and shows promptly to occupy less entries in queues on the instruction queue, thereby can obtain higher right of priority when getting finger next time; And the slow thread of travelling speed owing to entries in queues is carried out and occupied for a long time in its instruction delay, forms instruction to pile up, thereby will have lower right of priority when getting finger next time.At this, the ICOUNT method inclination is in preferentially moving the fast thread of travelling speed, can guarantee certain finger fairness of getting simultaneously again.The ICOUNT method has the multiple finger parameter combinations of getting, wherein, the best performance of ICOUNT.2.8, its concrete steps are: select two threads to get finger at every turn, get 8 instructions altogether; At first be preferably and occupy the minimum thread of instruction queue item and get finger, when it takes place that instruction Cache lost efficacy or next bar instruction needs to cross over instruction Cache and stops to get finger when capable, then, remaining getting referred to that allocated bandwidth gives second thread of choosing.
The ICOUNT method is got finger for selected first thread as far as possible, till when stopping to get the condition generation of finger or reaching maximum getting referring to bandwidth, getting when referring to that bandwidth has residue just is second instruction that thread is got volume residual, can occur like this that first thread has been got too much instruction and phenomenon that second thread do not have enough instructions to be got takes place, cause to get and refer to that bandwidth usage is unbalanced, the instruction queue item number that thread takies in per clock period differs bigger, the instruction queue collision rate is higher, make the hit rate of cache memory and TLB be restricted simultaneously, and then influence the further raising of processor performance.
Under the ICOUNT fetching method, it gets the selector switch that finger device only comprises a thread, the counter that several record thread take the instruction queue item.In each clock period, selector switch is selected the value of two minimums from all counters, its pairing counter numbering is the thread number that will therefrom get finger, subsequently, instruction fetching component instruction fetch as much as possible from selected first thread (its Counter Value minimum) refers to bandwidth or runs into the capable border of Cache or occur just stopping when Cache does not hit up to reaching maximum getting; If still have the remaining finger bandwidth of getting this moment, then instruction fetching component instruction fetch as much as possible from second thread (its Counter Value is inferior little) just stops to get finger up to running into identical end condition.In this device, selector switch is after getting when referring to relatively all the relative size of Counter Values, and the value of these counters promptly no longer is utilized, and thus, causes to get to refer to that bandwidth usage is unbalanced, and it is lower that ICOUNT gets finger device efficient, poor-performing.
On the ICOUNT basis, also have PICOUNT and two kinds of fetching methods of PICOUNT2 in conjunction with the factor of thread priority.The former is getting the weighted sum of calculating the priority of alternative thread when referring to and taking two parameters of instruction queue item number, the right of priority that finger is got in the minimum several threads acquisitions of its value; The latter then biases toward the shared weights of thread priority when calculating weighted sum.At this, the fetching method of considering priority can guarantee that getting of high priority thread refers to and carry out, but may cause low priority thread to can not get operation for a long time or performance descends.
At different performance index, some other fetching method is arranged also.As: the thread that some greedy algorithms can be selected to have minimum Cache or TLB crash rate when getting finger is got finger; Or select to have the thread of minimum average B configuration memory access time and get finger; Also can select to have the highest instruction number of phase weekly, that is: IPG, the thread of value get finger; In addition, also can get the complementary situation of the use of hardware resource and refer to select according to each thread, as: lay particular emphasis on the Data Cache crash rate minimum that makes each thread total, or lay particular emphasis on and make total average memory access time minimum, or lay particular emphasis on and make total IPC value maximum etc.Also some is from improving transmitting instructions formation service efficiency, reduce instruction angle of stand-by period in formation as far as possible and proposed several gate fetching methods, as: when the delay of thread execution command number reaches certain limit, stop to get finger, or when thread reality or data predicted Cache crash rate reach certain limit, stop to get finger etc. from this thread.Though these methods have satisfied the requirement of a certain performance index to a certain extent, because it is not to get from processor final properties index to refer to select, thereby the overall performance of its processor all is lower than the processor performance based on the ICOUNT method.
Summary of the invention
The objective of the invention is to: overcome and get the unbalanced problem of finger bandwidth usage in the prior art; And on average take the instruction queue item number for what reduce multiline procedure processor simultaneously significantly, and significantly reduce the instruction queue collision rate, obviously improve the hit rate of cache memory and TLB simultaneously, finally improve processor performance; Thereby provide a kind of charge system of getting devices and methods therefor of multiline procedure processor simultaneously that is applied to.
In order to solve the problems of the technologies described above, the invention provides a kind of fetching method of multiline procedure processor simultaneously that is applied to, may further comprise the steps:
A) item number of the shared instruction queue item of a plurality of threads of operation is simultaneously added up;
B) numerical value that obtains after will adding up sorts;
C) value of two minimums of selection from the sequence after the ordering, as the selected thread of getting finger in this clock period, other threads are not got finger in this clock period with these two pairing threads of value;
D) value of two minimums in the step c) is carried out logic negation operation respectively and carried out modulo operation to 16, obtain two corresponding new values, calculate the instruction fetch bar of two minimum value respectively and count the upper limit;
E) count the upper limit according to the instruction fetch bar and get finger.
In such scheme, in the step c), be limited on the instruction fetch bar number of the smaller value of described minimum value and get the finger bandwidth, the value after value mould 16 computings after operating with the smaller value logical inverse of described minimum value, the value of minimum among both.
In such scheme, in the step c), be limited on the instruction fetch bar number of the higher value of described minimum value: get and refer to bandwidth and the difference of counting the actual instruction bar number of getting of the upper limit less than the instruction fetch bar of the smaller value of described minimum value, with the value after value mould 16 computings after the operation of the higher value logical inverse of described minimum value, minimum value among both.
In such scheme, described instruction fetch bar is counted the upper limit and is meant when getting from this thread when referring to, does not hit, instructs situations such as crossing over cache memory border, thread branch misprediction if cache memory takes place, and then stops to continue to get finger from this thread.
The invention provides a kind of finger device of getting that is applied to multiline procedure processor simultaneously, comprise that being used to write down counter, the T that every thread takies the item number of instruction queue selects two MUX, the first step-by-step negate device, the second step-by-step negate device, first mould, 16 arithmetical unit, second mould, 16 arithmetical unit, the first alternative selector switch, second alternative selector switch and the subtracter; Wherein, in each clock period, T selects two MUX to select the value output of two minimums from the data value of counter output, and the counter numbering exports instruction fetching component to, and minimum value is delivered to the first alternative selector switch after by the first step-by-step negate device and first mould, 16 arithmetical unit; Sub-minimum is delivered to the second alternative selector switch after by the second step-by-step negate device and second mould, 16 arithmetical unit; Another of the first alternative selector switch is input as gets the finger bandwidth, and another of the second alternative selector switch is input as the output of subtracter; The output that refers to the bandwidth and the first alternative selector switch is got in being input as of described subtracter; The first alternative selector switch and the second alternative selector switch select wherein less value to output to the instruction fetching component of multiline procedure processor simultaneously from the value of input, instruction fetching component is according to the output valve of the first alternative selector switch and the second alternative selector switch, from the thread instruction fetch of correspondence, and upgrade each thread in described value of getting pairing counter in the finger device according to the situation of this actual instruction fetch of clock period.
In such scheme, described T selects in two the MUX, and T is a Thread Count.
In such scheme, described counter comprises X counter, and X is the number of thread.
In such scheme, described instruction queue comprises simultaneously gets finger formation, decoding formation, rename formation and fixed point/whole formations of floating-point emission formation in the multiline procedure processor.
As from the foregoing, the present invention for each thread computes a upper bound of getting exponential quantity, get the finger bandwidth thereby more balancedly utilized, make that the instruction queue item number that on average takies of multiline procedure processor reduces significantly simultaneously, the instruction queue collision rate significantly reduces, the hit rate of Cache and TLB (fast table) also obviously improves simultaneously, makes that finally performance of processors is greatly improved.
Description of drawings
Fig. 1 is the system architecture of while multiline procedure processor in the prior art;
Fig. 2 a is the synoptic diagram of superscalar processor transmitting instructions groove in the prior art;
Fig. 2 b is the synoptic diagram of while multiline procedure processor transmitting instructions groove in the prior art;
Fig. 3 is that the present invention is applied to the fetching method process flow diagram of multiline procedure processor simultaneously;
Fig. 4 is that the present invention is applied to the charge system of the getting structure drawing of device of multiline procedure processor simultaneously.
Embodiment
In the present invention, processor is in each clock period, and instruction fetching component accuses that according to of the present invention getting the input value of making device is from two thread instruction fetch.Basic thought of the present invention is to select to occupy two minimum threads of instruction queue item in the current active thread to get finger, and by it is taken instruction queue item number fetch logic negation and numerical value and the getting of while multiline procedure processor that obtains after constant 16 modulo operation referred to that bandwidth limits the instruction strip number that a selected thread reads jointly at every turn; According to the qualification method of this instruction fetch bar number, instruction fetching component is got finger for occupying the minimum thread of instruction queue item earlier, then remaining getting is referred to that it is second selected thread instruction fetch that bandwidth is used for.
Referring to Fig. 3, a kind of fetching method that is applied to the while multiline procedure processor may further comprise the steps:
Step 100 is added up the item number of the shared instruction queue item of a plurality of threads of operation simultaneously;
Step 110 sorts the numerical value that obtains after the statistics;
Step 120, the value of two minimums of selection is designated as at this: MIN from the sequence after the ordering 1And MIN 2, and establish MIN 1≤ MIN 2, these two pairing threads of value as the selected thread of getting finger in this clock period, are designated as T1 and T2 respectively, other threads are not got finger in this clock period;
Step 130 is carried out the operation of logic negation respectively to the value of two minimums in the step 120, obtains two corresponding new values, is designated as: With Suppose that simultaneously getting of multiline procedure processor refers to that bandwidth is N, then the instruction fetch bar number of thread T1 in this clock period on be limited to:
I 1 = MIN ( MIN 1 ‾ mod 16 , N ) - - - ( 1 )
Wherein, the implication of MIN operation is for getting wherein less value among both.The upper limit is meant when getting from this thread when referring to, do not hit, instructs situations such as crossing over cache memory border, thread branch misprediction if cache memory takes place, and then stops to continue to get finger from this thread.
After the fetched instruction bar number of thread T1 is determined, be limited on the instruction fetch bar number of thread T2:
I 2 = MIN ( MIN 2 ‾ mod 16 , N - I 1 ′ ) - - - ( 2 )
Wherein, I 1` is the actual instruction number that thread T1 is got under the qualification of formula 1.Equally, the upper limit is meant when getting from this thread when referring to, do not hit, instructs situations such as crossing over cache memory border, thread branch misprediction if cache memory takes place, and then stops to continue to get finger from this thread.
Step 140 is counted the upper limit according to the instruction fetch bar and is got finger.
The difference of the ICOUNT.2.8 fetching method of best performance is that both are on the instruction number got of selected thread in the present invention and the present simultaneously multiline procedure processor.The ICOUNT method is got finger for selected first thread as far as possible, when the condition that stops to get finger takes place or reaches maximum getting and refer to bandwidth till, just be second instruction that thread is got volume residual getting when referring to that bandwidth has residue; The present invention then under the condition identical with ICOUNT for each thread computes a upper bound of getting exponential quantity, rather than it is set to get the finger bandwidth as the ICOUNT method, get the finger bandwidth thereby more balancedly utilized, avoid occurring that first thread has been got too much instruction and phenomenon that second thread do not have enough instructions to be got takes place, thereby make the overall performance of thread on ICOUNT, obtain further raising.
Referring to Fig. 4, the part in the frame of broken lines is a part involved in the present invention.In Fig. 4, the present invention comprises a cover and is connected to the circuit based on feedback mechanism of multiline procedure processor instruction fetching component simultaneously, promptly is applied to the charge system of the getting device of multiline procedure processor simultaneously.This device comprises that counter 1, T select two MUX 2, the first step-by-step negate device 3, the second step-by-step negate device 4, first mould, 16 arithmetical unit 5, second mould, 16 arithmetical unit 6, the first alternative selector switch 7, the second alternative selector switch 8 and subtracter 9.Wherein, counter 1 is used to write down the item number that every thread takies instruction queue; T selects in two the MUX 2, and T is a Thread Count.
At the same time in the multiline procedure processor course of work, in each clock period, the value of counter 1 is delivered to T and is selected two MUX 2, T selects two MUX 2 to select the value output of two minimums from the data value of input, wherein, minimum value is by the output of " selected thread 1 " port, inferior little value is exported from " selected thread 2 " port, and corresponding thread number (that is: the numbering of counter) delivered to instruction fetching component, this thread number is illustrated in this clock period from the thread of reading command wherein; The output valve that T selects two MUX 2 is delivered to the first alternative selector switch 7 after by the first step-by-step negate device 3 and first mould, 16 arithmetical unit 5; Another output valve that T selects two MUX 2 is delivered to the second alternative selector switch 8 after by the second step-by-step negate device 4 and second mould, 16 arithmetical unit 6; Corresponding to selected thread 1 (that is: the pairing thread of smaller value in initial selected two values), another of the first alternative selector switch 7 is input as N (getting of multiline procedure processor refers to bandwidth simultaneously); Corresponding to selected thread 2 (that is: another selected thread), another of the second alternative selector switch 8 is input as the output of subtracter 9, the output that is input as N and the thread 1 pairing first alternative selector switch 7 of this subtracter; The first alternative selector switch 7 and the second alternative selector switch 8 select wherein less value to output to the instruction fetching component of multiline procedure processor simultaneously from the value of input, instruction fetching component is according to the output valve of the first alternative selector switch 7 and the second alternative selector switch 8, thread instruction fetch from correspondence, and upgrade each thread in described value of getting pairing counter 1 in the finger device, thereby form a feedback circuit according to the situation of this actual instruction fetch of clock period.
Describedly get processor that finger device is applied in and have an independence and shared get the finger formation; In each clock period, instruction fetching component can be from a plurality of active threads instruction fetch of operation simultaneously, and deliver to shared getting and refer in the formation; Get and refer to have in the formation instruction that belongs to a plurality of different threads not need to carry out thread during the instruction execution of different threads and to switch for the execution unit execution of processor; The cache memory of a plurality of thread shared processing devices and the primary memory of host computer system.The instruction fetching component of processor is when instruction fetch, and the value of the programmable counter of the selected thread of while multiline procedure processor basis reads some from shared cache memory or primary memory instruction is delivered to shared getting and referred in the formation.
Counterpart is identical among the part that is connected behind the register renaming parts among Fig. 4 and Fig. 1, and with dashed lines omits expression in the drawings.
Of the present invention to get the finger device implementation complexity very low, can realize that with less hardware its time complexity is a constant.
Below, we illustrate implementation process of the present invention with an example.For the sake of simplicity, suppose that multiline procedure processor moves 2 threads, totally 32 of reservation station formations simultaneously simultaneously, be used to deposit fixed point and floating point instruction after rename, get finger, decoding, rename formation (or window) and be respectively 4, processor is 4 emissions, and 4 instructions are got in every bat at most.
Under the superincumbent hypothesis, entries in queues adds up to: 32+4*3=44.
Need to prove that when processor was 4 emissions, the number of " delivery " was 8 in the invention.
The situation that current two threads is had all instruction strip numbers distributions is listed as follows.For reducing length, only consider that the instruction number of thread 1 (T1) is less than the situation of thread 2 (T2), on the contrary identical.In table, 4 ' represents that it is got is limited to 4 instructions on the finger, refers to get finger in the bandwidth but must get in the residue after thread 1 is got finger.
The difference of table 1 technology of the present invention and ICOUNT technology instruction fetch bar number in per clock period.
The current thread instruction number The invention technology get finger number (upper limit) ICOUNT gets finger number (upper limit)
T1 T2 T1 T2 T1 T2
1 1<T2<=43 4 4’ 4 4’
2 2<T2<=42 4 4’ 4 4’
3 3<T2<=41 4 4’ 4 4’
4 4<T2<=40 3 4’ 4 4’
5 5<T2<=39 2 4’ 4 4’
6 6<T2<=38 1 4’ 4 4’
7 7<T2<=37 0 4’ 4 4’
8 8<T2<=36 4 4’ 4 4’
9 9<T2<=35 4 4’ 4 4’
10 10<T2<=34 4 4’ 4 4’
11 11<T2<=33 4 4’ 4 4’
12 12<T2<=32 3 4’ 4 4’
13 13<T2<=31 2 4’ 4 4’
14 14<T2<=30 1 4’ 4 4’
15 15<T2<=29 0 4’ 4 4’
16 16<T2<=28 4 4’ 4 4’
17 17<T2<=27 4 4’ 4 4’
18 18<T2<=26 4 4’ 4 4’
19 19<T2<=25 4 4’ 4 4’
20 20<T2<=24 3 4’ 4 4’
21 21<T2<=23 2 4’ 4 4’
22 22<T2<=22 1 4’ 4 4’
Can see that from table 1 the present invention can limit the upper limit of a thread instruction fetch effectively, get the finger bandwidth, reach and improve the purpose of multiline procedure processor performance simultaneously thereby distribute more effectively.
It should be noted last that: above embodiment is the unrestricted technical scheme of the present invention in order to explanation only, although the present invention is had been described in detail with reference to the foregoing description, those of ordinary skill in the art is to be understood that: still can make amendment or be equal to replacement the present invention, and not breaking away from any modification or partial replacement of the spirit and scope of the present invention, it all should be encompassed in the middle of the claim scope of the present invention.

Claims (8)

1, a kind of fetching method of multiline procedure processor simultaneously that is applied to may further comprise the steps:
A) item number of the shared instruction queue item of a plurality of threads of operation is simultaneously added up;
B) numerical value that obtains after will adding up sorts;
C) value of two minimums of selection from the sequence after the ordering, as the selected thread of getting finger in this clock period, other threads are not got finger in this clock period with these two pairing threads of value;
D) value of two minimums in the step c) is carried out logic negation operation respectively and carried out modulo operation to 16, obtain two corresponding new values, calculate the instruction fetch bar of two minimum value respectively and count the upper limit;
E) count the upper limit according to the instruction fetch bar and get finger.
2, the method for claim 1 is characterized in that in the step c), is limited on the instruction fetch bar number of the smaller value of described minimum value and gets the finger bandwidth, the value after value mould 16 computings after operating with the smaller value logical inverse of described minimum value, the value of minimum among both.
3, the method for claim 1, it is characterized in that in the step c), be limited on the instruction fetch bar number of the higher value of described minimum value: get and refer to bandwidth and the difference of counting the actual instruction bar number of getting of the upper limit less than the instruction fetch bar of the smaller value of described minimum value, with the value after value mould 16 computings after the operation of the higher value logical inverse of described minimum value, minimum value among both.
4, the method for claim 1, it is characterized in that described instruction fetch bar counts the upper limit and be meant when getting from this thread when referring to, do not hit, instruct situations such as crossing over cache memory border, thread branch misprediction if cache memory takes place, then stop to continue to get finger from this thread.
5, a kind of finger device of getting that is applied to multiline procedure processor simultaneously comprises that being used to write down counter, the T that every thread takies the item number of instruction queue selects two MUX, the first step-by-step negate device, the second step-by-step negate device, first mould, 16 arithmetical unit, second mould, 16 arithmetical unit, the first alternative selector switch, second alternative selector switch and the subtracter; Wherein, in each clock period, T selects two MUX to select the value output of two minimums from the data value of counter output, and the counter numbering exports instruction fetching component to, and minimum value is delivered to the first alternative selector switch after by the first step-by-step negate device and first mould, 16 arithmetical unit; Sub-minimum is delivered to the second alternative selector switch after by the second step-by-step negate device and second mould, 16 arithmetical unit; Another of the first alternative selector switch is input as gets the finger bandwidth, and another of the second alternative selector switch is input as the output of subtracter; The output that finger bandwidth and the or two is selected selector switch is got in being input as of described subtracter; The first alternative selector switch and the second alternative selector switch select wherein less value to output to the instruction fetching component of multiline procedure processor simultaneously from the value of input, instruction fetching component is according to the output valve of the first alternative selector switch and the second alternative selector switch, from the thread instruction fetch of correspondence, and upgrade each thread in described value of getting pairing counter in the finger device according to the situation of this actual instruction fetch of clock period.
6, the finger device of getting as claimed in claim 5 is characterized in that, described T selects in two the MUX, and T is a Thread Count.
7, the finger device of getting as claimed in claim 5 is characterized in that, described counter comprises X counter, and X is the number of thread.
8, the finger device of getting as claimed in claim 5 is characterized in that, described instruction queue comprises simultaneously gets finger formation, decoding formation, rename formation and fixed point/whole formations of floating-point emission formation in the multiline procedure processor.
CNB2004100092885A 2004-06-30 2004-06-30 Control device and its method for fetching instruction simultaneously used on multiple thread processors Expired - Lifetime CN100377076C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2004100092885A CN100377076C (en) 2004-06-30 2004-06-30 Control device and its method for fetching instruction simultaneously used on multiple thread processors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2004100092885A CN100377076C (en) 2004-06-30 2004-06-30 Control device and its method for fetching instruction simultaneously used on multiple thread processors

Publications (2)

Publication Number Publication Date
CN1716183A true CN1716183A (en) 2006-01-04
CN100377076C CN100377076C (en) 2008-03-26

Family

ID=35822054

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2004100092885A Expired - Lifetime CN100377076C (en) 2004-06-30 2004-06-30 Control device and its method for fetching instruction simultaneously used on multiple thread processors

Country Status (1)

Country Link
CN (1) CN100377076C (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178646B (en) * 2006-11-08 2012-06-20 松下电器产业株式会社 Multithreaded processor
CN102566974A (en) * 2012-01-14 2012-07-11 哈尔滨工程大学 Instruction acquisition control method based on simultaneous multithreading
CN106104487A (en) * 2014-03-27 2016-11-09 国际商业机器公司 The hardware counter of the utilization rate in tracking multi-threaded computer system
CN107688471A (en) * 2017-08-07 2018-02-13 北京中科睿芯科技有限公司 A kind of computing system and its method of the resource bandwidth of dynamic adjusting data stream architecture
CN112083957A (en) * 2020-09-18 2020-12-15 海光信息技术股份有限公司 Bandwidth control device, multithread controller system and memory access bandwidth control method
CN114168202A (en) * 2021-12-21 2022-03-11 海光信息技术股份有限公司 Instruction scheduling method, instruction scheduling device, processor and storage medium
CN115098169A (en) * 2022-06-24 2022-09-23 海光信息技术股份有限公司 Capacity sharing-based instruction calling method and device
CN116414463A (en) * 2023-04-13 2023-07-11 海光信息技术股份有限公司 Instruction scheduling method, instruction scheduling device, processor and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5961631A (en) * 1997-07-16 1999-10-05 Arm Limited Data processing apparatus and method for pre-fetching an instruction in to an instruction cache
CN1228557A (en) * 1998-03-06 1999-09-15 刘殷 Multiple line programme instruction level concurrent technique for computer processor
US6898694B2 (en) * 2001-06-28 2005-05-24 Intel Corporation High instruction fetch bandwidth in multithread processor using temporary instruction cache to deliver portion of cache line in subsequent clock cycle

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178646B (en) * 2006-11-08 2012-06-20 松下电器产业株式会社 Multithreaded processor
CN102566974A (en) * 2012-01-14 2012-07-11 哈尔滨工程大学 Instruction acquisition control method based on simultaneous multithreading
CN102566974B (en) * 2012-01-14 2014-03-26 哈尔滨工程大学 Instruction acquisition control method based on simultaneous multithreading
CN106104487A (en) * 2014-03-27 2016-11-09 国际商业机器公司 The hardware counter of the utilization rate in tracking multi-threaded computer system
CN106104487B (en) * 2014-03-27 2019-07-19 国际商业机器公司 Track the method and system of the utilization rate in multi-threaded computer system
CN107688471B (en) * 2017-08-07 2021-06-08 北京中科睿芯科技集团有限公司 Computing system and method for dynamically adjusting resource bandwidth of data stream architecture
CN107688471A (en) * 2017-08-07 2018-02-13 北京中科睿芯科技有限公司 A kind of computing system and its method of the resource bandwidth of dynamic adjusting data stream architecture
CN112083957A (en) * 2020-09-18 2020-12-15 海光信息技术股份有限公司 Bandwidth control device, multithread controller system and memory access bandwidth control method
CN112083957B (en) * 2020-09-18 2023-10-20 海光信息技术股份有限公司 Bandwidth control device, multithread controller system and memory access bandwidth control method
CN114168202A (en) * 2021-12-21 2022-03-11 海光信息技术股份有限公司 Instruction scheduling method, instruction scheduling device, processor and storage medium
CN115098169A (en) * 2022-06-24 2022-09-23 海光信息技术股份有限公司 Capacity sharing-based instruction calling method and device
CN115098169B (en) * 2022-06-24 2024-03-05 海光信息技术股份有限公司 Method and device for fetching instruction based on capacity sharing
CN116414463A (en) * 2023-04-13 2023-07-11 海光信息技术股份有限公司 Instruction scheduling method, instruction scheduling device, processor and storage medium
CN116414463B (en) * 2023-04-13 2024-04-12 海光信息技术股份有限公司 Instruction scheduling method, instruction scheduling device, processor and storage medium

Also Published As

Publication number Publication date
CN100377076C (en) 2008-03-26

Similar Documents

Publication Publication Date Title
CN109146072B (en) Data reuse method based on convolutional neural network accelerator
CN1102770C (en) Power estimator for microprocessor
KR101710910B1 (en) Method and apparatus for dynamic resource allocation of processing unit
CN1157641C (en) Processor
CN1112636C (en) Method and apparatus for selecting thread switch events in multithreaded processor
CN1117319C (en) Method and apparatus for altering thread priorities in multithreaded processor
CN103188521B (en) Transcoding distribution method and device, code-transferring method and equipment
CN101923491A (en) Thread group address space scheduling and thread switching method under multi-core environment
CN101067794A (en) Multi-nuclear processor and serial port multiplexing method
CN1819523A (en) Parallel interchanging switching designing method
CN101042660A (en) Method of task execution environment switch in multitask system
CN1293776A (en) Zero overhead computer interrupts with task switching
CN1716183A (en) A kind of charge system of getting devices and methods therefor of multiline procedure processor simultaneously that is applied to
CN1271524C (en) Static internal storage management method
CN1818875A (en) Grouped hard realtime task dispatching method of built-in operation system
US20110158254A1 (en) Dual scheduling of work from multiple sources to multiple sinks using source and sink attributes to achieve fairness and processing efficiency
CN1518691A (en) Method and apparatus for assigning threal priority in multi-threaded processor
CN100351792C (en) A real-time task management and scheduling method
CN1852131A (en) Timer scheduling method
US20110158250A1 (en) Assigning Work From Multiple Sources to Multiple Sinks Given Assignment Constraints
CN1595910A (en) A data packet receiving interface component of network processor and storage management method thereof
US20040194094A1 (en) Method and apparatus for supporting asymmetric multi-threading in a computer system
US8391305B2 (en) Assignment constraint matrix for assigning work from multiple sources to multiple sinks
CN112596895B (en) Elastic inclination processing method and system for SQL semantic perception
CN1652083A (en) Method of program delay executing and its device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Assignee: Beijing Loongson Technology Service Center Co.,Ltd.

Assignor: Institute of Computing Technology, Chinese Academy of Sciences

Contract fulfillment period: 2009.12.16 to 2028.12.31

Contract record no.: 2010990000062

Denomination of invention: Control device and its method for fetching instruction simultaneously used on multiple thread processors

Granted publication date: 20080326

License type: exclusive license

Record date: 20100128

LIC Patent licence contract for exploitation submitted for record

Free format text: EXCLUSIVE LICENSE; TIME LIMIT OF IMPLEMENTING CONTACT: 2009.12.16 TO 2028.12.31; CHANGE OF CONTRACT

Name of requester: BEIJING LOONGSON TECHNOLOGY SERVICE CENTER CO., LT

Effective date: 20100128

EC01 Cancellation of recordation of patent licensing contract

Assignee: LOONGSON TECHNOLOGY Corp.,Ltd.

Assignor: Institute of Computing Technology, Chinese Academy of Sciences

Contract record no.: 2010990000062

Date of cancellation: 20141231

EM01 Change of recordation of patent licensing contract

Change date: 20141231

Contract record no.: 2010990000062

Assignee after: LOONGSON TECHNOLOGY Corp.,Ltd.

Assignee before: Beijing Loongson Technology Service Center Co.,Ltd.

LICC Enforcement, change and cancellation of record of contracts on the licence for exploitation of a patent or utility model
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20060104

Assignee: LOONGSON TECHNOLOGY Corp.,Ltd.

Assignor: Institute of Computing Technology, Chinese Academy of Sciences

Contract record no.: 2015990000066

Denomination of invention: Control device and its method for fetching instruction simultaneously used on multiple thread processors

Granted publication date: 20080326

License type: Common License

Record date: 20150211

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200824

Address after: 100095, Beijing, Zhongguancun Haidian District environmental science and technology demonstration park, Liuzhou Industrial Park, No. 2 building

Patentee after: LOONGSON TECHNOLOGY Corp.,Ltd.

Address before: 100080 Haidian District, Zhongguancun Academy of Sciences, South Road, No. 6, No.

Patentee before: Institute of Computing Technology, Chinese Academy of Sciences

EC01 Cancellation of recordation of patent licensing contract
EC01 Cancellation of recordation of patent licensing contract

Assignee: LOONGSON TECHNOLOGY Corp.,Ltd.

Assignor: Institute of Computing Technology, Chinese Academy of Sciences

Contract record no.: 2015990000066

Date of cancellation: 20200928

CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 100095 Building 2, Longxin Industrial Park, Zhongguancun environmental protection technology demonstration park, Haidian District, Beijing

Patentee after: Loongson Zhongke Technology Co.,Ltd.

Address before: 100095 Building 2, Longxin Industrial Park, Zhongguancun environmental protection technology demonstration park, Haidian District, Beijing

Patentee before: LOONGSON TECHNOLOGY Corp.,Ltd.

CX01 Expiry of patent term

Granted publication date: 20080326

CX01 Expiry of patent term