CN105824605B

CN105824605B - A kind of controlled dynamic multi-threading and processor

Info

Publication number: CN105824605B
Application number: CN201610272367.8A
Authority: CN
Inventors: 王生洪
Original assignee: Individual
Current assignee: Individual
Priority date: 2016-04-28
Filing date: 2016-04-28
Publication date: 2018-04-13
Anticipated expiration: 2036-04-28
Also published as: CN105824605A

Abstract

The invention discloses a kind of controlled dynamic multi-threading and processor, the method is to a processor using pipeline organization, increases mark newly in its order structure, which includes two partial informations：The precedence information of thread and mark corresponding instructions belonging to mark corresponding instructions；Processor controls its corresponding instruction according to mark, and thread in mark and precedence information are launched and perform the instruction.The processor includes at least an instruction system containing mark, a program that can identify and track mark performs control unit（Branch）, one can identify mark and carry out decoded instruction demoding circuit, an arithmetic operation unit that can identify and decode mark and corresponding internal storage location.The present invention can dynamically dispatch all arithmetic hardware resources of a processor so as to improve the operational capability of processor, and need not increase the hardware of many complexity.

Description

A kind of controlled dynamic multi-threading and processor

Technical field

The present invention relates to field of processors, more particularly to a kind of controlled dynamic multi-threading（Dynamic Multi- ) and processor threading.

Background technology

In order to improve the operational capability of processor, many parallel processing techniques are developed, such as superscale（Super- scalar）, assembly line（Pipeline）Overlength wide instruction（VLIW）, single instrction execution more（SIMD）, etc..But due to one The instruction processing of a software program is that order performs, the dependence of instruction and data present in its implementation procedure （dependencies）Cause processor it is frequent be waited for thus limit these parallel processing technique efficiency Play.

In order to overcome the executory dependence of instruction, some improve the technology of instruction issue efficiency, such as out of order code（Out- of-Order）, control program prediction（Branch Prediction）Etc. being developed, but these technologies have its limitation Property.Their either hardware are extremely complex, or efficiency improves the application of limited and unsuitable embedded system.One insertion Formula system, especially moves equipment, such as mobile communication, mobile unit, Wearable etc., and the requirement to processor performance is not only Operational capability will height, more require that power consumption wants low and real-time is eager to excel.

Multithreaded parallel processor technology（Multi-Threading）, because it can be parallel in same processor Handle 2 or multiple completely self-contained operation programs, thus can be relatively good solve execution process instruction in control and number Limited according to operational efficiency caused by dependence, wherein synchronizing multiple threads technology（Simultaneous Multi-threading） And token driving multithreading（Token Triggered multi-threading, SMT) in some processor products Arrive good application, such as the POWER5 of the Hyper-Threading of Intel, IBM, Sun Microsystems The MT of UltraSPARC T2 and MIPS are to employ SMT technologies.The SandblasterDSP cores of Sandbridge using Token drives multithreading.

Although the dependency problem in SMT technologies energy settlement procedure implementation procedure, SMT technologies are except needing to per thread Have will also add thread trace logic outside the register needed for a set of executive program of oneself in every grade of assembly line, and increase is altogether Enjoy the size of resource, such as Instruction Cache, TLBs etc..Its thread trace logic not only want the stroke of track thread also to check and Judge the thread whether complete by executed.It is in due to having substantial amounts of thread and performs or half execution state, thus CPU The necessary sufficiently large Thrashing to avoid between unnecessary thread of the size of Caches and TLB, the complexity of its hardware Greatly increase with the increase of Thread Count thus limit it and be difficult to apply to embeded processor and low power processor Design.

Following table is a typical SMT multithread programs implementation procedure：

Token driving multithreading is a kind of time-division multithreading, since it can only perform same line within each clock cycle The instruction of Cheng Chengxu, thus its hardware complexity will simplify much compared to SMT, but efficiency also and then declines.Its main feature is that：

1. each clock cycle only has a thread to send instruction；

2. all threads are sequence startings as shown in Figure 1, thus simplifying thread selection circuit；

3. per thread has the clock cycle of identical execute instruction, it is not necessary to relies on inspection and the hardware that detours；

4. operation result can guarantee that the thread in next time has just obtained before performing.

Following table gives the program process of token driving multithreading：

1	Clock cycle i：Thread T0 sends instructions j and j+1 and j+2
		2	Clock cycle i+1:Thread T0 sends instructions k and k+1
3	Clock cycle i+2：Thread T2 sends instructions l
		4	Clock cycle i+3:Thread T3 sends instructions m and m+1 and m+2
5	Clock cycle i+4:Thread T0 instructs missing, and processor waits
		6	Clock cycle i+5:Thread T1 sends instructions K+2
7	Clock cycle i+6:Thread T2 sends instructions I+1 and I+2
		8	Clock cycle i+7:Thread T3 instructs missing, and processor waits

But since token driving multiline procedure processor can only perform specific threading operation in the defined clock cycle, because If this is in this clock cycle, its thread specified is due to instruction or the missing of data（missing）Or because dependence and When being unable to firing order, which is just wasted.In order to overcome this this defect of token driving multithreading, a machine Meeting multithreading is developed.

Chance multithreading allow a multiline procedure processor a thread within the clock cycle of some if When there is no an effective instruction need not this clock cycle of HOLD, but give the clock cycle to other thread for having effective instruction. The clock cycle that will be wasted originally gives other thread as one " chance " and uses.

For having multiline procedure processor to one using this method, its thread no longer can only be sent out one by the per thread cycle The limitation of secondary instruction, and any " chance " is available with as long as can the firing order clock cycle in each clock cycle The thread of original start does not instruct effectively within the clock cycle.

1. chance multithreading is as token driving multithreading, it is a kind of timesharing multithreading, each Clock cycle can only perform a program.Its executable Thread Count is limited to the Thread Count of hardware.

2. chance multithreading needs a branch prediction circuit, for a processor using VLIW structures, it is needed The dependence of each sub-instructions is predicted.Therefore branch prediction circuit is considerably complicated.

3. chance multithreading needs the thread identity of one group of 2 dimension（ID）Register instructs for track thread Implementation status per level production line is to ensure that result data will not be mixed up unrest.

4. in practical application, per thread increase is necessarily using each arithmetic element of the processor of chance multithreading One group of 2 dimension is totally independent of the data registers of other threads to prevent the data between half thread for performing state Thrashing。

5. in order to the firing order within the clock cycle of each processor, the instruction memory belonging to thread is also necessary The clock frequency identical with processor clock cycle is operated in ensure that thread can timely read instruction.Thus, multithreading One there would not be the characteristics of reducing power consumption of memory.

Analysis is it can be seen that more than token driving using the hardware complexity of the processor of chance multiprogram technology above Threading increase is very much, and in order to enable per thread to read instruction in each clock cycle, its instruction memory Clock frequency must be as the master oscillator frequenc of processor, and the power consumption of such processor can substantially increase.Thus chance is multi-thread Journey technology is not appropriate for being applied to low-power-consumption embedded processor design.

Fig. 2 is that the program of chance multithreading performs schematic diagram.

The content of the invention

The technical problems to be solved by the invention are to be directed to the defects of involved in background technology, there is provided a kind of controllable dynamic State multi-threading and processor.

The present invention uses following technical scheme to solve above-mentioned technical problem：

A kind of controlled dynamic multi-threading, uses one pipeline organization and the processor with I-cache, Increase mark in its order structure newly, which includes two partial informations：Thread and mark belonging to mark corresponding instructions correspond to The precedence information of instruction, the precedence information are used for the execution sequence for indicating instruction and the correlation with its front and rear instruction； Processor controls its corresponding instruction according to mark, is launched by the precedence information and affiliated thread of the instruction and is performed this and refers to Order.

As a kind of further prioritization scheme of controlled dynamic multi-threading of the present invention, the processor is controlled according to mark Its corresponding instruction is made, launches by the precedence information and affiliated thread of the instruction and performs comprising the following steps that for the instruction：

Step 1）, according to etc. precedence information in the corresponding mark of instruction to be performed read instruction；

Step 2）, instruction decoding and distribution：

The decoding circuit of processor is by step 1）In read instruction decoding be mark and each sub-instructions, processor Distribution logic assigns them to different arithmetic elements according to the function of each sub-instructions and goes to perform；

Step 3）, instruction execution：

For each sub-instructions, processor reads corresponding register according to the thread information in instruction mark belonging to it Data, and by the register of the result of execution deposit its respective thread；

Step 4）, jump to step 1）.

According to specific Hardware Implementation, step 1 and 2 may require that multiple clock cycle sometimes, when only needing 1 sometimes Clock cycle, step 3）N-1 clock cycle is then needed, n is the pipeline series of processor arithmetic element.

As a kind of further prioritization scheme of controlled dynamic multi-threading of the present invention, the step 1）Detailed step It is as follows：

Step 1.1）, the instruction reading circuit of processor check I-Cache whether have instruction by etc. it is pending, i.e., whether deposit In the instruction in Valid states；

Step 1.1.1）If only existing 1 instruction for being in Valid states, the instruction is read；

Step 1.1.2）, if the instructions of more than 2 are in Valid states, then checked according to the corresponding mark of instruction The priority of which bar instruction is high；

Step 1.1.2.1）, the instruction of other instructions is higher than if there is priority, then reads the instruction,

Step 1.1.2.2）, the instruction of other instructions is higher than if there is no priority, then judges whether back The instruction thread of execution；

Step 1.1.2.2.1）, if there is the instruction thread of back execution, read the order line performed with back The instruction of Cheng Butong reads instruction according to the order of thread；

Step 1.1.2.2.1）, if there is no the instruction thread of back execution, read and instructed according to the order of thread.

As a kind of further prioritization scheme of controlled dynamic multi-threading of the present invention, the mark write by software or Person's compiler automatically writes in compilation process.

As a kind of further prioritization scheme of controlled dynamic multi-threading of the present invention, the processor is sent out for multiple instructions Processor is penetrated, its every instruction is all independent to carry the mark of oneself.

As a kind of further prioritization scheme of controlled dynamic multi-threading of the present invention, the processor is sent out for multiple instructions Processor is penetrated, a plurality of instruction shares one group of mark.

As a kind of further prioritization scheme of controlled dynamic multi-threading of the present invention, the processor is sent out for single instrction Processor is penetrated, the corresponding mark of its every instruction.

The invention also discloses a kind of processor based on the controlled dynamic multi-threading, including at least mark's Instruction system, one can identify and track mark program perform control unit, one can identify mark and be decoded Instruction demoding circuit, an arithmetic operation unit that can identify and decode mark and corresponding internal storage location.

The present invention compared with prior art, has following technique effect using above technical scheme：

1. Multi-thread control circuit and the complicated effective prediction circuit of instruction that need not be complicated can be transferred efficiently The hardware resource of processor, the priority and correlation of effective decision instruction；

2. do not had to worry according to the priority orders execute instruction of instruction because the missing of some instructions or data And cause the waste of hardware resource and the phenomenon of operation result confusion occur；

3. effectively improving the utilization rate of the hardware resource of processor, and then reduce power consumption.

Brief description of the drawings

Fig. 1 is the token driving multithreading thread flow figure of four threads；

Fig. 2 is that chance multithread programs perform schematic diagram；

Fig. 3 is the single instrction structure chart with mark；

Fig. 4 is the single mark order structures figure of multiple instructions band；

Fig. 5 is the more mark order structures figures of multiple instructions band；

Fig. 6 is a multithreading execution flow chart with 6 level production lines；

Fig. 7 is a block diagram of the processor with software-controllable dynamic multi streaming.

Embodiment

Technical scheme is described in further detail below in conjunction with the accompanying drawings：

The present invention is to increase by one group of corresponding instruction in the instruction system of the processor of a use multi-stage pipeline arrangement Thread identity and its precedence information symbol（mark）.The instruction system of processor is being read（Fetch）While instruction Obtain the mark of the thread identity for performing the instruction and the information of its priority.The instruction control arithmetic system of processor （Branch）The hardware resource of processor and execution sequence are arranged according to the information of the mark.This mark will always with The each step performed with instruction in order to track the execution step of the instruction, and according to precedence information indicate this instruction with The dependence of instruction/data before and after it and the order preferentially performed.

The content of the mark of the present invention can set execution according to the requirement of application system when programmer programs to be somebody's turn to do The thread of programmed instruction and execution priority or compiler set thread and according to programs automatically in compilation process Calculation function sets its priority in the correlation for differentiating the instruction and its front and rear instruction and data.

Using software design patterns program execution thread and provide in the program priority of every instruction and with being held before and after it The information of the correlation of row instruction is attached in each instruction and is used as an identifier（mark）.Processor hardware only needs can The dynamic hardware resource for transferring processor can be realized and efficiently perform the finger of multithreading by identifying the information of these mark Order operation.

The line for being also possible that while running using the execution thread of software design patterns and the program of management multiline procedure processor Number of passes is from the firing order number of processor and the limitation of pipeline series.Can also avoid because program threads less than assembly line and Caused by clock cycle/hardware resource waste phenomenon.

To realize software-controllable dynamic multi streaming method, the instruction system of its processor is except the instruction of usual executive program One group must also be added outside word and includes thread number and the identifier of precedence information is attached in coding line as a mark, As shown in Figure 3.Mark in figure is 2 binary digits of one at least 2.

By taking the mark of 3 digits as an example：

Assuming that mark=" 000 "；The thread for representing to perform the instruction is 0, and priority is 0（0 represents low priority）

Assuming that mark=" 101 "；The thread for representing to perform the instruction is 1, and priority is 1（Represent high priority）

The concrete numerical value of Mark can be the execution line that programmer sets this section of program in programming according to the requirement of system Journey and priority or compiling system provides automatically in compilation process according to the function of program.

The software-controllable dynamic multi streaming method of the present invention can be not only used for the processor of single instruction issue, can also use In the processor of multiple instructions transmitting.

For the processor of a multiple instructions transmitting, the instruction of its multi-emitting can share a mark information, can also be every Bar instruction carries the mark information of oneself.

Fig. 3 is the order structure of a list mark single instrction.

Fig. 4 is the order structure of a list mark multiple instructions；Wherein coding line 1,2, n must be same multi-threaded program In different instruction.The structure of single mark coding lines can only perform time-division multiple threads.

What Fig. 5 was provided is more mark, the order structure of multiple instructions word, and in figure, M is the meaning for representing Mark；Due to each Coding line has the mark of their own, so these instructions can be the instruction of the program of different threads.The finger of this more mark Structure is made to be applicable to synchronizing multiple threads processing.

The execution step of the dynamic multi streaming method of the present invention is as follows：

Step 1（Or the clock cycle 0）Read instruction：The I-Cache read control circuits of processor check whether there is instruction Etc. pending（Valid）, if the instruction Valid of more than 2,（The I-Cache of the processor of one multithreading should be at least There is the Bank of 2 or more）, then check that the priority of which bar instruction is high, if just reading the high instruction of priority, if Priority is the same then to be read the instruction different with the instruction thread that back performs or reads instruction according to the order of thread；

Step 2（Or the clock cycle 1）Instruction decoding and distribution：Decoding circuit solution code instruction 1, instruction 2, instruction 3, distribution is patrolled The function distribution collected further according to solution code instruction goes to perform to different arithmetic elements；

Step 3（Or 2~n+1 of clock cycle）Instruction performs：Processor reads corresponding according to the thread information in mark The data of register, and by the register of the result of execution deposit its respective thread；By taking instruction control circuit as an example, according to mark's Thread information presses corresponding PC content of registers sequential execution of programmed instructions, and other work(of corresponding thread are read according to instruction Can register（Such as loop counter, jump, condition etc.）Data, and the result of execute instruction is restored again into accordingly Thread these registers；

Here the n numerical value in 2~n+1 of clock cycle is decided by the pipeline series of processor arithmetic element.If one The structure of a 4 grades of flowing water, this n is equal to 4, if 6 stage pipeline structures, n are equal to 6；

The clock cycle n+1 of step 3 just returns step 1 after having performed.

Since the dynamic multithreading architecture of single mark multiple instructions in the present invention is a time-division multithreaded architecture, work as Present procedure runs to step 2（Clock cycle 1）When, the I-Cache read control circuits of processor are read in repeat step 1 Take the validity of instruction of the control circuit in the appearance for checking next step（Valid）And determine which reads according to Valid The programmed instruction of thread.

When current program goes to step 3 (clock cycle 2), I-Cache read control circuits still re-cover step 1, the 3rd group of instruction is read according to the Valid information of instruction；And the decoding distributor circuit of processor then re-covers and performs step 2, solution The instruction of code and distribution program 2；So in cycles.

Fig. 6, which gives one, has 6 level production lines（Arithmetic element）The execution flow signal of the dynamic multi streaming of structure.Figure In：

T-thread；

Y-thread number, y=0,1,2,, n；For representing y threads T；For example T (2) represents thread 2；

The value of Y is provided by the mark in coding line；

Ith transmitting of the i-identical thread within the same instruction cycle；An instruction cycle is equal in this example 6 clock cycle；

J-pipeline series；

Such as T (3₂,₄) represent the 2nd time of thread 3 transmitting and its state in the 4th grade of assembly line.

Here the suitable procedures described above 3 of operating process of flow chart）.Wherein n is equal to 6, i.e. processor has been read Instruction is taken and instruction decoding and corresponding processing unit will be allocated to.Corresponding processing unit has been obtained for thread and excellent The information of first level.

The operating process of one dynamic multi streaming is：（Assuming that program 0,1,2,, 5 be all independent thread）

The C0 in clock cycle zero（Here clock cycle 0 is equivalent to the foregoing instruction cycle 2）：The processing list of processor First mark parts read instruction and decoded coding line obtain the thread Y of present instruction, it is assumed that the journey of Y=0, i.e. thread 0 Thread T (0 is just awarded in the instruction of sequence, the instruction_0,0) and performed since zero level assembly line；

The C1 in clock cycle one：Processor, which reads next and instructs and decode mark, obtains Y=1, illustrates that the instruction is Thread T (1 is awarded in the instruction of the program of thread 1, the instruction_0,0), and performed since the first level production line, and at this moment preceding article Flowing water is to the 1st level production line for instruction, so state becomes T (0_0,1)；

The C2 in clock cycle two：Processor should read instruction i.e. Y=2 of 2 program of thread under normal circumstances, still For some reason, the instruction missing of the program of thread 2, and the instruction for the program of thread 0 occur is already prepared to, at this moment Processor can read the mark of instruction and if decoding obtains the decoding of Y=0 and also obtains priority equal to 1（Without waiting for thread The operation result of the 0 previous bar instruction of program）, at this moment processor begin to authorize thread T (0_1,0) and start to perform the instruction, Order, before 2 instruction states become, T (0_0,2) and T (1_0,1)；

The C3 in clock cycle three：Processor, which reads to instruct and decode mark, obtains Y=3, that is, authorizes instruction thread T (3_0,0) and start to perform.At this moment the instruction execution state order before becomes T (0_0,3), T (1_0,2) and T (0_1,1)；

The C4 in clock cycle four：Processor, which reads to instruct and decode mark, obtains Y=4, that is, authorizes instruction thread T (4_0,0) and start to perform.At this moment the instruction execution state order before becomes T (0_0,4), T (1_0,3), T (0_1,2) and T (3_0,1)；

The C5 in clock cycle five：Processor, which reads to instruct and decode mark, obtains Y=5, that is, authorizes instruction thread T (5_0,0) and start to perform.At this moment the instruction execution state order before becomes T (0_0,5), T (1_0,4), T (0_1,3), T (3_0,2) and T (4_0,2)；So far, an instruction cycle terminates, and instructs T (0₀) operation result be stored in corresponding register.

As seen from the above analysis, dynamic multi streaming technology is controlled using software, only needs to track for processor T (the Y of every instruction_{I, j}) it just can effectively transfer hardware resource.And the setting of multithreading completely can will obtain from system Hair is flexible to be transferred.

Fig. 7 is a Harvard structure, employs the controllable multiline procedure processor logic of the dynamic of software design patterns thread Block diagram.The order structure of processor in figure is a tri- instruction word issue structure of list mark.Processor as we can see from the figure Increase the outer other parts in mark positions and a typical processor structure almost one of several bits in coding line structure Sample.The information of Mark needs to send all arithmetic elements to.Instruction control unit is according to the thread and precedence information of mark The reading and control of control instruction and the execution state for tracking multithreading, arithmetic operation unit are then come really using the information of mark Unrest will not be mixed up by protecting the operation result of the instruction.

Those skilled in the art of the present technique are it is understood that unless otherwise defined, all terms used herein（Including skill Art term and scientific terminology）With the identical meaning of the general understanding with the those of ordinary skill in fields of the present invention.Also It should be understood that those terms such as defined in the general dictionary should be understood that with the context of the prior art The consistent meaning of meaning, and unless defined as here, will not be explained with the implication of idealization or overly formal.

Above-described embodiment, has carried out the purpose of the present invention, technical solution and beneficial effect further Describe in detail, it should be understood that the foregoing is merely the embodiment of the present invention, be not limited to this hair Bright, within the spirit and principles of the invention, any modification, equivalent substitution, improvement and etc. done, should be included in the present invention Protection domain within.

Claims

1. a kind of controlled dynamic multi-threading, it is characterised in that pipeline organization and the place with I-cache are used to one Device is managed, increases mark newly in its order structure, which includes two partial informations：Thread belonging to mark corresponding instructions and The precedence information of mark corresponding instructions, the precedence information be used for the execution sequence for indicating instruction and with its front and rear instruction Correlation；Processor controls its corresponding instruction according to mark, launches by the precedence information of the corresponding instruction and affiliated thread And the instruction is performed, comprise the following steps that：

Step 1.1）, the instruction reading circuit of processor check I-Cache whether have instruction by etc. it is pending, i.e., with the presence or absence of place In the instruction of Valid states；

Step 1.1.2）, if the instructions of more than 2 are in Valid states, then which is checked according to the corresponding mark of instruction The priority of bar instruction is high；

Step 1.1.2.1）, the instruction of other instructions is higher than if there is priority, then reads the priority higher than other instructions Instruction；

Step 1.1.2.2）, the instruction of other instructions is higher than if there is no priority, then judges whether that back performs Instruction thread；

Step 1.1.2.2.1）, if there is the instruction thread of back execution, read the instruction thread performed with back not Instruction with thread or the order according to instruction thread read instruction；

Step 1.1.2.2.2）, if there is no the instruction thread of back execution, read and instructed according to the order of instruction thread；

Step 2）, instruction decoding and distribution：

The decoding circuit of processor is by step 1）In read instruction decoding be mark and each sub-instructions, the distribution of processor Logic assigns them to different arithmetic elements according to the function of each sub-instructions and goes to perform；

Step 3）, instruction execution：

For each sub-instructions, processor reads the corresponding register of the thread according to the thread information in instruction mark belonging to it Data, and by the register of the result of execution deposit its respective thread；

Step 4）, jump to step 1）.

2. controlled dynamic multi-threading according to claim 1, it is characterised in that the mark by software write or Compiler automatically writes in compilation process.

3. controlled dynamic multi-threading according to claim 1, it is characterised in that the processor is launched for multiple instructions Processor, its every instruction is all independent to carry the mark of oneself.

4. controlled dynamic multi-threading according to claim 1, it is characterised in that the processor is launched for multiple instructions Processor, a plurality of instruction share one group of mark.

5. controlled dynamic multi-threading according to claim 1, it is characterised in that the processor is single instruction issue Processor, the corresponding mark of its every instruction.

6. the processor of a kind of controlled dynamic multi-threading for described in perform claim requirement 1, it is characterised in that at least wrap The program that mark can be identified and tracked containing an instruction system with mark, one perform control unit, one can identify Mark simultaneously carries out decoded instruction demoding circuit, an arithmetic operation unit that can identify and decode mark and corresponding memory Unit.