CN107003859A

CN107003859A - By the runtime code parallelization for continuously monitoring repetitive instruction sequence

Info

Publication number: CN107003859A
Application number: CN201580063897.5A
Authority: CN
Inventors: 诺姆·米兹拉希; 阿尔贝托·曼德勒; 莎伊·科伦; 乔纳森·弗里德曼
Original assignee: Centi Peter Sami Ltd
Current assignee: Centi Peter Sami Ltd
Priority date: 2014-12-22
Filing date: 2015-12-09
Publication date: 2017-08-01
Also published as: EP3238040A1; WO2016103092A1; EP3238040A4

Abstract

A kind of method includes, in the processor (20) of the instruction of configuration processor code, the instruction in the repetitive sequence of the instruction of monitoring traversal flow control track, to construct the specification for the register access that monitored instruction is carried out.Based on the specification, call multiple hardware threads to be performed in parallel the correspondent section of repetitive instruction sequence at least in part.During performing, proceed the monitoring to instruction at least one of section.

Description

By the runtime code parallelization for continuously monitoring repetitive instruction sequence

Invention field

Present invention relates in general to processor design, and more particularly to the method for runtime code parallelization and it is System.

Background of invention

The various technologies for parallelization software code dynamically at runtime have been proposed.For example, Akkary and Collection of thesis " A Dynamics of the Driscoll in the annual international symposium of the 31st microarchitecture in December, 1998 The processor architecture for realizing that the dynamic multi streaming of single program is performed is described in Multithreading Processor ", should Article is incorporated herein by reference.

Collection of thesis " Speculatives of the Marcuellu et al. in the 12nd international supercomputing meeting of 1998 Describe a kind of processor micro-architecture in Multithreaded Processors ", the micro-architecture by do not need compiler or The control supposition technology of User support to perform the multiple control threads obtained from single program simultaneously, and this article is by quoting simultaneously Enter herein.

Collection of thesis " Clustereds of the Marcuello and Gonzales in the 13rd international supercomputing meeting of 1999 Proposed in Speculative Multithreaded Processors " and predictive is operationally produced from single threaded application The micro-architecture of thread, this article is incorporated herein by reference.

In the 14th collection of thesis " the A Quantitative parallel with distributed treatment international symposium of 2000 In Assessment of Thread-Level Speculation Techniques " (it is incorporated herein by reference), The value that Marcuello and Gonzales are analyzed in the benefit and thread units of different threads supposition technology is predicted, branch is pre- The influence of survey, thread initialization expense and connection.

International conference (PDCS's of the Ortiz-Arroyo and Lee in the 16th Parallel and distributed computation system of 2003 03) collection of thesis " describes in Dynamic Simultaneous Multithreaded Architecture " and is referred to as moving The multi-threaded architecture of state simultaneous multi-threading (DSMT), the multi-threaded architecture is performed from single on multiline procedure processor core at the same time Multiple threads of program, this article is incorporated herein by reference.

Summary of the invention

The embodiment of invention as described herein provides a kind of method, and this method is included in the instruction of configuration processor code In processor, the instruction in the repetitive sequence of the instruction of monitoring traversal flow control track, to construct monitored instruction progress The specification of register access.Based on the specification, call multiple hardware threads to be performed in parallel repetitive instruction sequence at least in part The correspondent section of row.Proceed the monitoring to instruction during performing, at least one in section.

In certain embodiments, continue to monitor instruction including in response to being detected in given section to different flow control tracks Change, create and be configured to the different rule of different flow control tracks by monitoring the instruction along different flow control tracks Model.This method can be included in after the different flow control tracks of monitoring, preserve different specifications or different flow control tracks.

In certain embodiments, repetitive sequence includes circulation or function.In embodiment, continuing to monitor to instruct includes continuing All sections of monitoring.Alternately, continue to monitor to instruct and at least one for the section for following flow control track can be monitored including continuation Subset.In addition, alternately, continue to monitor the part subset that instruction can include selecting section, and it is selected to continue monitoring Section in subset.Selection subset can include the every n-th the being created section that selection is used to continue to monitor, according to predefined week Phase model selection is used for the section for continuing to monitor, and/or is randomly selected for continuing the section of monitoring.

In certain embodiments, this method is included in the cycle of given quantity after repetitive sequence of stopping, instruction or micro- The monitoring to giving the instruction in section is terminated in operation.In the exemplary embodiment, it is based upon with different controlling stream tracks not With the given quantity of section setting, the given quantity of given section is set.

Embodiments in accordance with the present invention, are additionally provided with the processor including execution pipeline and monitoring unit.Perform stream Waterline is configured as the instruction of configuration processor code.Monitoring unit is configured as being known for the instruction of monitoring traversal flow control track Other repetitive instruction sequence, to construct the specification for the register access that monitored instruction is carried out, is configured as being based on the rule Model, calls multiple hardware threads in execution pipeline to be at least partly performed in parallel the correspondent section of repetitive instruction sequence, and And the instruction at least one in continuing monitoring section during performing.

Embodiments in accordance with the present invention, additionally provide a kind of method, and this method is included in the instruction of configuration processor code In processor, the instruction in the repetitive sequence of instruction is monitored, to construct the specification for the register access that monitored instruction is carried out. Based on the instruction monitored, termination criteria is estimated.If meeting termination criteria, the monitoring to instruction is terminated.If Monitoring to instruction terminates in the case where being unsatisfactory for termination criteria, then based on specification, by multiple sections of repetitive instruction sequence Perform parallelization.

In certain embodiments, termination criteria depends on the position being ultimately written to register, the register being written into Number, instruction or microoperation counting, perform the cycle counting and/or branch instruction number exceed threshold value.In addition or can Alternatively, termination criteria can be depended on monitoring up in the program code that had previously monitored position, monitor up to being identified as Position in the program code repeated, monitoring period or before the branch misprediction that occurs, and/or depending on the one of processor Individual or more mark is used as global or global-local classification.

In embodiment, the specification is uniquely associated with the flow control track of the instruction traversal by being monitored.Another In individual embodiment, the specification is associated with two or more flow control tracks of the instruction traversal by being monitored.

In embodiment, the prison to instruction is immediately performed after being decoded in the execution pipeline of processor to instruction Control.In another embodiment, the monitoring to instruction is performed before execute instruction in the execution pipeline of processor, it includes Monitoring is then by the speculative instructions being eliminated.In certain embodiments, this method is included in whole monitoring period and retains deposit The respective name of device.

Embodiments in accordance with the present invention, additionally provide a kind of method, and it is included in the processing of the instruction of configuration processor code In device, the repetitive sequence of instruction is monitored, and is commanded according to wherein each register corresponding as operand or destination Order by what is monitored come to instructing the register accessed to classify.Classification based on register, by the multiple of repetitive sequence The execution parallelization of section.

In certain embodiments, classification is carried out to register to be included at least some in register being categorized as following deposit One in device：Local register, its first time in the sequence monitored occurs being as destination；Global register, It is used only as operand in the sequence monitored；And the overall situation-local register, its first time in the sequence monitored Appearance is to be used as destination in the sequence monitored as operand, and then.

In embodiment, carrying out classification to register includes, if given register is used as the destination in conditional order Appear in first in monitored sequence, then the given register is categorized as the overall situation-local register.It is right in embodiment Register, which carries out classification, to be included, if given register appears in monitored sequence for the first time as the destination in conditional order In row, then the given register is categorized as the overall situation-local register, else if meeting the condition of conditional order, then will given Determine register and be categorized as local register.

In another embodiment, carrying out classification to register includes, if given register is as in same instructions Both operand and destination are appeared in monitored sequence for the first time, then are categorized as the given register global-local Register.

In certain embodiments, register is classified also includes at least one subset for register, identification pair Register is ultimately written relevant position of the operation in the sequence monitored.In the embodiment disclosed, identification is ultimately written The position of operation includes counting the write-in at least one subset of register.Alternately, identification is ultimately written behaviour The position of work can include the address that record is ultimately written operation.

In embodiment, in addition to register, also one or more marks of processor are performed to being ultimately written The identification of the position of operation.In another embodiment, the subset of register at least includes the deposit for being categorized as local register Device.In yet another embodiment, the subset of register at least includes the register for being categorized as the overall situation-local register.

In the exemplary embodiment, the identification of the position to being ultimately written operation includes writing the condition of corresponding registers and grasped Make.In embodiment, in addition to register, performed also directed to one or more marks of processor according to being used as operand Or the classification of the use order of destination.

Embodiments in accordance with the present invention, are additionally provided with the processor including execution pipeline and monitoring unit.Perform stream Waterline is configured as the instruction of configuration processor code.Monitoring unit is configured as the instruction in the repetitive sequence of monitoring instruction, with Just the specification for the register access that monitored instruction is carried out is constructed, is configured as based on the instruction monitored, to termination criteria It is estimated, if meeting termination criteria, terminates the monitoring to instruction, and if the monitoring to instruction is being unsatisfactory for terminating Terminate in the case of standard, then based on specification, by multiple sections of execution parallelization of repetitive instruction sequence.

Embodiments in accordance with the present invention, additionally provide the processor including execution pipeline and monitoring unit.Perform flowing water Line is configured as the instruction of configuration processor code.Monitoring unit is configured as the repetitive sequence of monitoring instruction, is configured as basis The deposit that wherein each register is commanded the respective sequence as operand or destination to access the instruction by being monitored Device is classified, and the classification based on register is come multiple sections of execution of parallelization repetitive sequence.

From the described in detail below of the embodiments of the invention carried out with reference to accompanying drawing, the present invention will be more fully understood, Wherein：

Brief description

Fig. 1 is the frame that embodiments in accordance with the present invention schematically illustrate the processor for performing runtime code parallelization Figure；

Fig. 2 is the figure of parallelization when embodiments in accordance with the present invention schematically illustrate the operation of program circulation；

Fig. 3 is the figure of the program circulation according to an embodiment of the invention with multiple tracks and corresponding scoreboard；And

Fig. 4 is that embodiments in accordance with the present invention are schematically illustrated for the continuous method for monitoring repetitive instruction sequence Flow chart.

Embodiment

Summary

It is described herein The embodiment provides within a processor to the operation of code when parallelization Improved method and apparatus.In the disclosed embodiment, processor recognizes the command sequence repeated, and establishment and execution are claimed For multiple parallel codes sequences of section, it performs the sequence of different appearance.These sections are scheduled, for passing through multiple hardware lines Cheng Jinhang is performed parallel.

For example, repetitive sequence can include circulation, in this case, section includes multiple loop iterations, the part of iteration Or the continuity of circulation.As another example, repetitive sequence can include function, in this case, and section is adjusted including multiple functions With the part of function or function continue.Operationally, parallelization is performed to precompile code.Term " repetitive sequence " is often referred to Be to be accessed and performed multiple any command sequence again.

In certain embodiments, when recognizing repetitive sequence, instruction and structure " scoreboard " in processor supervisory sequence- By the specification of the instruction that is monitored to the access of register.The specific flow control track of scoreboard and the sequence traversal by being monitored It is associated.Processor determines how and when to create and perform multiple based on the information collected in scoreboard and track Section.

In certain embodiments, scoreboard includes the classification of the register accessed by the instruction monitored.Point of register Class is used as the operand in monitored instruction or the order of destination depending on wherein register.

In certain embodiments, although microoperation is different from instruction, similar mode pair is instructed also according to monitoring Microoperation is monitored.In other words, in certain embodiments, scored according to microoperation granularity rather than instruction granularity to produce Plate simultaneously performs monitoring.

The classification can for example its occur first be as destination part (L) register, be used only as operand Made a distinction between global (G) register and the overall situation-part (GL) register, occurring first for the GL registers is as behaviour Count and be subsequently used as destination.Additionally or alternatively, scoreboard can be indicated to register at least some registers The position being ultimately written in the sequence monitored of operation.The instruction can include the write operation number of times for example to register Counting.

In certain embodiments, processor is continuing to monitor the instruction in one or more sections during performing.It is this after Continuous monitoring is enabled a processor to may be for example due to data dependence conditional branching in quickly and efficiently convection control track The change for instructing and occurring in the section monitored is reacted.This document describes several examples of selection standard, processor can To select the section for continuing to monitor using the selection standard.

In certain embodiments, processor terminates before some section terminates and interrupts the monitoring to this section.It is described herein The various termination criterias that can be used by processor.The technology of additional disclosure is kept for many of multiple corresponding flow control tracks Individual synchronous scoreboard, and suitably replace between them.

Processor architecture

Fig. 1 is the block diagram that embodiments in accordance with the present invention are schematically illustrated processor 20.Processor 20 runs precompile Software code, while make code perform parallelization.Processor in programmed instruction from memory operationally by being extracted And it is analyzed to perform parallelization decision-making when being decoded.

In this example, processor 20 includes execution pipeline, and it is single that the execution pipeline includes one or more extractions Member 24, one or more decoding units 28, out of order (OOO) buffer 32 and execution unit 36.Extraction unit 24 refers to from multistage Extraction procedure in cache is made to instruct, the cache memory includes 1 grade of (L1) instruction cache 40 in this example With 2 grades of (L2) instruction caches 44.

Inch prediction unit 48 is predicted and is expected during performing by the flow control track of program pass (herein for letter It is referred to as the sake of short " track ").Prediction is typically based on address or the programmed counting of the prior instructions extracted by extraction unit 24 Device (PC) value.Based on prediction, inch prediction unit 48 indicates which new instruction extraction unit 24 will extract.The stream control of unit 48 The system prediction parallelization that also influence code is performed, as will be explained.

The instruction decoded by decoding unit 28 is stored in OOO buffers 32, for carrying out unrest by execution unit 36 Sequence is performed, i.e. the order for not being compiled and storing in memory according to instructing.Alternately, the instruction buffered can be by Order is performed.Buffered instruction then is sent, so that various execution units 36 are performed.In this example, execution unit 36 is wrapped Include one or more multiply-accumulate (MAC) units, one or more ALUs (ALU) and it is one or more plus Load/memory cell.Additionally or alternatively, execution unit 36 can include the execution unit of other suitable types, such as floating-point Unit (FPU).

The result produced by execution unit 36 is stored in register file and/or multi-stage data cache memory In, it includes 1 grade of (L1) data high-speed and caches 52 and 2 grades of (L2) data high-speed cachings 56 in this example.In some embodiments In, L2 data-cache memories 56 and L2 instruction caches 44 are implemented as in same physical memory independent Memory area or simply share identical memory in the case of no fixed predistribution.

In certain embodiments, processor 20 also includes the thread monitor and execution unit for being responsible for runtime code parallelization 60.The following detailed description of the function of unit 60.

The configuration of processor 20 shown in Fig. 1 is example arrangement, and it is purely chosen for the sake of clear concept. In alternate embodiment, any other suitable processor configuration can be used.For example, in Fig. 1 configuration, extracting single using multiple Member 24 and multiple decoding units 28 realize multithreading.Each hardware thread can include being assigned to extract the finger for thread The decoding unit that the extraction unit of order and being assigned to is decoded to the instruction extracted.Additionally or alternatively, it is multi-thread Journey can realize in many other ways, such as using multiple OOO buffers of every thread, single execution unit and/or every The single register file of thread.In another embodiment, different threads can include different respective handling cores.

As another example, without cache or there can be different cache structures, not have in every thread Branch prediction realizes processor in the case of having single branch prediction.Processor can include add ons, for example, only Give some instances, resequencing buffer (ROB), register renaming.In addition, alternately, disclosed technology can be with apparatus There is the computing device of any other suitable micro-architecture.

Processor 20 can use any suitable hardware for example use one or more application specific integrated circuits (ASIC), Field programmable gate array (FPGA) or other equipment type are realized.Additionally or alternatively, software can be used or using hard Some elements of processor 20 are realized in the combination of part and software element.Such as random access memory (RAM) can be used The memory of any suitable type realizes instruction caches and data-cache memory.

Processor 20 can perform function described herein with software programming.The software can be by network with electronics Form downloads to processor, for example, or alternatively or additionally, it can be provided and/or to be stored in non-transitory tangible On medium, such as, magnetic memory, optical memory or electronic memory.

Runtime code parallelization

In certain embodiments, the unit 60 in processor 20 recognizes the command sequence repeated and it is performed parallel Change.Repetitive instruction sequence can include for example, the corresponding iteration of program circulation, the corresponding appearance of function or process or repeatedly being weighed New any other suitable command sequence for accessing and performing.In the present context, term " repetitive instruction sequence " refers in mistake Go to perform the command sequence of its flow control track (for example, PC value sequences) at least one times.Data value (for example, register value) may It is different because of execution.

In the disclosed embodiment, processor 20 is called and held parallel or semi-concurrently by using multiple hardware threads The multiple code segments of row carry out parallelization repetitive instruction sequence.The corresponding code segment of each thread execution, the corresponding iteration of such as circulation, Multiple (being not necessarily continuous) loop iteration, a part for loop iteration, the continuity of circulation, its function or a parts continue Or the section of any other suitable type.

The parallelization in the stage casing of processor 20 is performed using multiple hardware threads.In the example of fig. 1, although not being inevitable , but each thread includes distributing via unit 60 with the corresponding extraction unit 24 for performing one or more sections and corresponding solution Code unit 28.

In fact, data dependency is present between section.For example, the calculating performed in some loop iteration may depend on The result of the calculating performed in previous ones.The ability of section parallelization is set to depend greatly on this data dependence Property.

Fig. 2 is the figure of parallelization when illustrating the operation of program circulation according to the example embodiment of the present invention.The top of the figure Portion shows dependence of the example procedure circulation (being reappeared from the bzip benchmark tests version of SPECint protos test suite PROTOSs) between instruction Property.Between instruction of some dependences in same loop iteration, and instruction of other dependences in given loop iteration and Between instruction in previous ones.

The bottom of the figure shows how unit 60 is come using four thread TH1...TH4 according to an embodiment of the invention The parallelization circulation.The table lists across ten a cycle altogether and which of which thread is performed within each cycle refers to Order.Each instruction is represented by the instruction number in its number of iterations and iteration.For example, " 14 " represent the 4th finger of the 1st loop iteration Order.In this example, instruction 5 and instruction 7 are ignored, and assume perfect branch prediction.

Thread executory irregular (staggering) is due to data dependency.For example, due to instruction 21, (second repeatedly The first instruction in generation) dependent on instruction 13 (the 3rd instruction of first time iteration), therefore thread TH2 is unable to the He of execute instruction 21 Instruction 22 (the first two instruction in second of loop iteration) is until the cycle 1.There is similar dependence in whole table.Total comes Say, this Parallelization Scheme can perform loop iteration twice within six cycles, or every three cycles perform an iteration.

It is important to note that the parallelization shown in Fig. 2 only considers the data dependency between instruction, without considering it He constrains, the availability of such as execution unit.Therefore, the cycle in Fig. 2 is not necessarily converted directly into the corresponding clock cycle.Example Such as, the instruction for being listed as performing in period demand in Fig. 2 may actually be performed within the more than one clock cycle, because it Compete identical execution unit 36.

The parallelization monitored based on section

In certain embodiments, unit 60 determines how to make code parallel by the instruction in monitoring processor streamline Change.In response to identification repetitive instruction sequence, unit 60 starts to monitor the sequence when sequence is extracted by processor, decodes and performed Row.

In some embodiments, the function of unit 60 can be distributed between multiple hardware threads so that given line Journey can be considered as monitoring its instruction during performing.However, for the sake of clarity, description below assumes monitoring function by list Member 60 is performed.

As a part for monitoring process, unit 60 generates and instructs the flow control track of traversal and at this by what is monitored It is referred to as the monitoring table of scoreboard in text.Scoreboard includes being used to appear in the corresponding of each register in monitored sequence Entry.In embodiment, each register is categorized as global (G), local (L) or global-local (GL) by unit 60, and is referred to Show the classification in respective entries in scoreboard.Depended on as the classification of G, L or GL register wherein in the sequence monitored Middle register is used as operand (its value is read) and/or the order as (value is written into) destination.

In embodiment, local (L) register, which is defined as its first time in the sequence monitored, to be occurred being conduct The register of destination (if any, subsequent appearance can be used as operand and/or destination).Global (G) register It is defined as being used only as the register of operand in the sequence monitored, i.e. register is read but from being not written into.Entirely Office-local (GL) register, which is defined as its first time in the sequence monitored, to be occurred being as operand and then in institute The sequence of monitoring is used as the register of destination.As long as the order between " first time " and " subsequent " is retained, then first During secondary appearance and subsequent appearances are likely to occur in different instruction or identical instructs.

In alternative embodiments, the exception of above-mentioned classification is related to the conditional order that register is used as to destination.If this The instruction of sample is that occur first time of the register in the instruction monitored, then the register is classified as GL.Otherwise, according to upper Rule is stated, register is classified as part (L).If for example, " mov_cond r2, #5 " are in the instruction that is monitored for instruction R2 first time write-in, then the register r2 in the instruction will be classified as GL, and otherwise register r2 is classified as L.For For in embodiment, if such instruction is that occur first time of the register in the instruction monitored, the register is divided Class is GL.Otherwise, register is just only categorized as part when meeting the condition of instruction., should if not meeting condition Register is not classified.

In embodiment, unit 60 is classified using superset, i.e. merge two or more classification defined above one Rise.In such embodiments, even if given register is only local in given section, unit 60 is still classified as GL, with simplify control.

For G, L or GL alternative it is relative to working as according to the dependence of wherein register by the class definition of register The section generation of preceding monitoring and the position that uses are classified to register：The operand quilt generated outside the section of current monitor It is categorized as global (G) or global-local (GL).The operand generated in the section of current monitor is classified as part (L).

In certain embodiments, unit 60 finds at least some registers in scoreboard and indicated to being monitored The position being ultimately written of register in sequence.The instruction is used during performing by unit 60, for determining when send out The instruction gone out in the subsequent section being ultimately written dependent on this.The general principle of the mechanism behind is, only in section Y execution To after being ultimately written of the register, can just send the instruction in the section X dependent on the value of the register in prior segment Y.

In one embodiment, the number of times being written into by register in the sequence to being monitored is counted to realize most Write-in is indicated afterwards.Unit 60 determines that this counts (being expressed as #WRITES), and is indicated in the entry of the register in scoreboard The #WRITES values.

In this embodiment, when performing section Y, unit 60 counts the write-in number of times of the register to being discussed. When count reach the #WRITES values indicated in scoreboard when, unit 60, which is concluded, runs into last write-in, and therefore allow to send according to Rely the execute instruction in the section X of the register discussed.

A known solution for mitigating data dependency is renaming register, i.e. in different Duan Zhongwei Given register distributes different titles.In certain embodiments, unit 60 avoids carrying out renaming to register, i.e. Retain register title in the different iteration of repetitive sequence, to promote the counting to #WRITES.In other words, unit 60 is tieed up Hold the alignment of the Register renaming map between section and thread.

#WRITES mechanism described above is only depicted as finding and indicating to register in the sequence monitored The position being ultimately written mechanism example.In alternative embodiments, unit 60 can exist in any other suitable way Found in scoreboard and indicate the position being ultimately written to register, such as by being recorded in scoreboard to register most The address of write operation afterwards.

In various embodiments, unit 60 is not necessarily required to count the #WRITES of each register.For example, single Member 60 can be counted for being categorized as GL register, register for being categorized as L or both to #WRITES.

In certain embodiments, unit 60 includes condition write instruction in #WRITES counting, is but regardless of the condition It is no to be satisfied.In other embodiments, unit 60 is only in the condition that meets and actually performs when writing just in #WRITES counting Include condition write instruction.

In certain embodiments, processor 20 is maintained at one or more marks used in conditional order.Mark Example include zero flag (being otherwise "false" for "true" if the result of nearest arithmetical operation is zero), minus flag (if The result of nearest arithmetical operation is negative, then is "true", is otherwise "false"), carry flag is (if nearest add operation is produced Carry, then be "true", is otherwise "false"), overflow indicator (if nearest add operation causes spilling, for "true", otherwise for "false") or any other suitable mark.Generally, mark is implemented as the corresponding positions in special mark register.Mark is by each Plant instruction or microoperation updates.

In certain embodiments, they to monitor mark with control register similar mode and are included in note by unit 60 Divide in plate.For example, as explained above, label category can be G, L or GL by unit 60.Additionally or alternatively, unit 60 can be counted and be recorded to the position being ultimately written of each mark in the sequence that is monitored (for example, by mark The #WRITES of will is counted and recorded).

In certain embodiments, unit 60 monitors whole section from the beginning to the end not always necessarily.In the exemplary embodiment, it is single Member 60 can be monitored (for example, counted and/or classified to register to write-in) since some midpoint in section, and And update existing scoreboard.

Continuous monitoring to multiple tracks

In certain embodiments, unit 60 continues to monitor the instruction in one or more threads during its execution.Change Sentence is talked about, once repetitive instruction sequence is identified and monitors, monitoring process would not terminate.During performing, unit 60 is directed to At least some threads proceed monitoring and scoreboard construction process.As described above, the function of unit 60 can be distributed in thread Between so that each thread (or at least one subset of thread) monitors the instruction of its execution.

Continuous monitoring to section during performing is important, for example, being performed for effectively handling its Program in fortune The scene of another flow control track is switched to during row from a flow control track.Under many actual scenes, the program is in tool Have between two or more repetitive instruction sequences of different tracks alternately.In certain embodiments, unit 60 is by concurrently Create and keep multiple different scoreboard, handle such scene for the corresponding scoreboard of each track.

Fig. 3 is that embodiments in accordance with the present invention schematically illustrate the program with multiple tracks and corresponding scoreboard and followed The figure of ring.One section of code with nine instructions is illustrated on the left of the figure.Program circulation is since instruction 2 and at instruction 9 Loop back.

In this example, instruction 4 is the conditional branch instructions for jumping to instruction 6 and skip instruction 5.Therefore, according to condition The result of branch instruction, some sections are represented as 70A track (not using branch) by following, and other threads will be followed by table It is shown as 70B track (using branch).

In certain embodiments, unit 60 monitors at least some sections during its execution.When the section for detecting monitored is opened When beginning follows not previously known track, unit 60 is that new track creates single scoreboard, and records register classification and # WRITES, as explained above.In this example, unit 60 create and keep for track 70A scoreboard 74A and be used for Track 70B scoreboard 74B.

By keeping multiple scoreboard, unit 60 rapidly can make a response to trail change.As long as section is followed previously The track of monitoring, unit 60 has just had the effective scoreboard for the track.Therefore, unit 60 can use available Scoreboard calls new section immediately.In the case of not this mechanism, calling for new section will be delayed by, until for The scoreboard of new track be constructed (mean efficiency reduction, and processor may assume incorrectly that its monitoring track be New).

Fig. 3 multi-trace scene is the simple example in order to show the mechanism of continuous monitoring and multiple scoreboard and describe Scene.Disclosed technology can be used for wherein performing any other alternate suitable type between multiple flow control tracks In scene.

Fig. 4 is the stream that embodiments in accordance with the present invention are schematically illustrated the method for continuously monitoring repetitive instruction sequence Cheng Tu.The figure illustrates that the combination in given thread is performed and monitored.Unit 60 generally directed to be selected for monitoring it is any Sequence performs the process, and is not necessarily for each section being performed.

In starting step 80, this method starts from unit 60 and provides given track to given hardware thread and corresponding score Plate.＆ monitoring steps 84 are being performed, the thread discussed performs section and is performed in parallel monitoring.It is used as one of monitoring process Point, thread generates the scoreboard for its track followed.

After the execution of section is completed, in checking step 88, unit 60 checks whether track is new.In other words, it is single Member 60 checks whether the scoreboard for the track has been present.If track is new, unit 60 is remembered in recording step 92 Record the scoreboard for the trajectory creation.The scoreboard follows the subsequent thread of same trajectories by being provided to.Otherwise, if that is, Scoreboard has been present, then this method terminates at end step 96.

In certain embodiments, scoreboard is uniquely associated with single flow control track.In other embodiments, give Scoreboard can be associated with two or more tracks.

In certain embodiments, unit 60 for example monitors each section during performing using Fig. 4 method.It is real substituting Apply in example, unit 60 can select only to monitor the subset of section.The quantity and mark of the section of monitoring are selected for by control, It is possible that different balances are set between computing cost and parallelization performance.

Unit 60 can use various standards or logic to select which section monitored.For example, unit 60 can be periodically The section for monitoring is selected, for example, (for some selected constant N) called every n-th section.In another embodiment In, unit 60 can be selected according to predefined determinate pattern (for example, section 2,3,5,12,13,15,22,23,25...) Monitor section.As another example, unit 60 can be randomly selected for the section of monitoring, for example, skipping the section of random amount, select For the section of monitoring, the section of another random amount is skipped, is selected for section of monitoring etc..

As another example, unit 60 can come in response to some the predefined events occurred during the execution of section Select the section for monitoring.Because different threads may follow different flow control tracks, so unit 60 can select prison Control follows the section of particular track interested.In addition, alternately, unit 60 can be during performing using any other suitable Standard select the section for monitoring.

In embodiment, the monitoring carried out by unit 60 is performed in the instruction of the output of decoder module 28. At this point in streamline, in the sense that some instructions being decoded will be eliminated and is not submitted, instruction is still to push away The property surveyed.For example, due to branch misprediction, removing may occur.Instructed however, it is preferable that being monitored in this early stage, Because instruction is still organized in order.In addition, monitoring instruction enables unit 60 to prolong with lower early stage streamline Utilize scoreboard (that is, calling parallel section using scoreboard) late.

Monitor termination criteria

In certain embodiments, unit 60 terminates the monitoring to this section before given section terminates.Therefore, list can be passed through Various termination criterias are assessed and used to member 60.Several non-limiting examples of termination criteria may include：

■ exceedes threshold value to the write-in number of times of register.

The number for the register that ■ is written into exceedes threshold value.

■ is instructed or the counting of microoperation exceedes threshold value.

The counting that ■ performs the cycle exceedes threshold value.

The number of ■ branch instructions exceedes threshold value.

■ is monitored up to the position in previously monitored program code.

■ is monitored up to the position in the program code (for example, backward branch or branch link-BL) for being identified as to repeat.

■ branch mispredictions occur in the instruction before one of instruction in monitoring or monitoring.

■ marks are GL or the overall situation.

In addition, alternately, any other suitable termination criteria can be used.

Although embodiment paper described herein general processor, method described herein and it is System can be also used in other application, such as in graphics processing unit (GPU) or other application specific processors.

Accordingly, it will be recognized that embodiments described above is quoted by way of example, and the present invention is not limited In the content above having had been particularly shown and described.On the contrary, the scope of the present invention includes the group of various features as described above Close and sub-portfolio and variant of the invention and modification, the variants and modifications by those skilled in the art read before retouch State expecting afterwards and be not disclosed in the prior art.The file for being incorporated by reference into present patent application is considered as the application's Part, except any term in these files being incorporated to a certain extent with it is clearly or hidden in this specification Outside the mode for the definition conflict made containing ground is defined, the definition in this specification should be only considered.

Claims

1. a kind of method, including：

In the processor of the instruction of configuration processor code, the finger in the repetitive sequence of the instruction of monitoring traversal flow control track Order, to construct the specification of the register access carried out by the instruction monitored；

Based on the specification, call multiple hardware threads to be at least partly performed in parallel the corresponding of the repetitive instruction sequence Section；And

Instruction in continuing to monitor at least one in described section during performing.

2. according to the method described in claim 1, wherein, continuing to monitor the instruction is included in response to being detected in given section Change to different flow control tracks, institute is created and is configured to by monitoring along the instruction of the different flow control tracks State the different specification of different flow control tracks.

3. method according to claim 2, and be included in after the monitoring different flow control tracks, described in preservation not With specification or the different flow control tracks.

4. method according to claim 1 or 2, wherein, the repetitive sequence includes circulation or function.

5. method according to claim 1 or 2, wherein, continuing the monitoring instruction includes continuing to monitor all sections.

6. method according to claim 1 or 2, wherein, the continuation monitoring instruction includes continuation monitoring and follows the stream Control at least one subset in described section of track.

7. method according to claim 1 or 2, wherein, continue to monitor part for instructing and including selecting described section Collection, and continue to monitor the section in selected subset.

8. method according to claim 7, wherein, select the subset to include at least one in following operation：

Select the every n-th the being created section for continuing to monitor；

According to predefined cyclic pattern, the section for continuing to monitor is selected；And

It is randomly selected for the section for continuing to monitor.

9. method according to claim 1 or 2, and it is included in the week for stopping the given quantity after the repetitive sequence The monitoring to giving the instruction in section is terminated in phase, instruction or microoperation.

10. method according to claim 9, and set including being based upon the different sections with different controlling stream tracks The given quantity, sets the given quantity of the given section.

11. a kind of processor, including：

Execution pipeline, the execution pipeline is configured as the instruction of configuration processor code；And

Monitoring unit, the monitoring unit is configured as the finger of the repetitive instruction sequence of the traversal flow control track of monitoring identification Order, to construct the specification for the register access that monitored instruction is carried out, based on the specification, calls the execution pipeline In multiple hardware threads to be at least partly performed in parallel the correspondent section of the repetitive instruction sequence, and during performing after Instruction at least one in continuous described section of monitoring.

12. processor according to claim 11, wherein, in response to being detected in given section to different flow control tracks Change, the monitoring unit is configured as creating and construct use by monitoring along the instruction of the different flow control tracks Different specification in the different flow control tracks.

13. processor according to claim 12, wherein, after the different flow control tracks are monitored, the monitoring Unit is configured as preserving the different specification or the different flow control tracks.

14. the processor according to claim 11 or 12, wherein, the repetitive sequence includes circulation or function.

15. the processor according to claim 11 or 12, wherein, the monitoring unit is configured as continuing to monitor all Section.

16. the processor according to claim 11 or 12, wherein, the monitoring unit is configured as continuation monitoring and follows institute State at least one subset in described section of flow control track.

17. the processor according to claim 11 or 12, wherein, the monitoring unit is configured as the portion of described section of selection Molecule Set, and continue to monitor the section in selected subset.

18. processor according to claim 17, wherein, the monitoring unit is configured as by performing in following operate At least one select the subset：

Select the every n-th the being created section for continuing to monitor；

It is randomly selected for the section for continuing to monitor.

19. the processor according to claim 11 or 12, wherein, the monitoring unit is configured as stopping the repetition The monitoring to giving the instruction in section is terminated in cycle, instruction or the microoperation of given quantity after sequence.

20. processor according to claim 19, wherein, the monitoring unit is configured as being based upon with different controls The given quantity that the different sections of trajectory mark are set sets the given quantity of the given section.

21. a kind of method, including：

In the processor of the instruction of configuration processor code, the instruction in the repetitive sequence of instruction is monitored, to construct what is monitored Instruct the specification of the register access carried out；

Based on the instruction monitored, termination criteria is estimated；

If meeting the termination criteria, the monitoring to the instruction is terminated；And

If the monitoring to the instruction terminates in the case where being unsatisfactory for the termination criteria, based on the specification come By multiple sections of execution parallelization of the repetitive instruction sequence.

22. method according to claim 21, wherein, the termination criteria depends at least one in following item：

Counting, execution cycle to the position being ultimately written of register, the number for the register being written into, instruction or microoperation Counting or branch instruction number exceed threshold value；

It is described to monitor up to the position in the described program code previously monitored；

It is described to monitor up to the position in the described program code for being identified as repeating；

Branch misprediction occur the monitoring period or before；And

One or more marks of the processor are used as global or global-local classification.

23. method according to claim 21, wherein, the flow control track of the specification and the instruction traversal by being monitored Uniquely it is associated.

24. method according to claim 21, wherein, the specification and two or more for instructing traversal by being monitored Individual flow control track is associated.

25. the method according to any one of claim 21-24, wherein, it is right in the execution pipeline of the processor The instruction is immediately performed the monitoring to the instruction after being decoded.

26. the method according to any one of claim 21-24, wherein, held in the execution pipeline of the processor The monitoring to the instruction, including monitoring are performed before the row instruction then by the speculative instructions being eliminated.

27. the method according to any one of claim 21-24, and it is described to be included in the whole monitoring period reservation The respective name of register.

28. a kind of method, including：

In the processor of the instruction of configuration processor code, the repetitive sequence of instruction is monitored, and the instruction by being monitored is visited The register root asked is classified according to wherein each register by the respective sequence of the instruction as operand or destination； And

Based on the classification of the register, by multiple sections of execution parallelization of the repetitive sequence.

29. method according to claim 28, wherein, carrying out classification to the register is included in the register At least some one be categorized as in following register：

Local register, first time of the local register in the sequence monitored occurs being as destination；

Global register, the global register is used only as operand in the sequence monitored；And

The overall situation-local register, first time of the overall situation-local register in the sequence monitored occurs being as operation Number, and then it is used as destination in the sequence monitored.

30. method according to claim 28, wherein, make if carrying out classification to the register and including given register Appeared in for the first time in the sequence monitored by the destination in conditional order, then by the given register be categorized as it is global- It is local.

31. method according to claim 28, wherein, make if carrying out classification to the register and including given register Appeared in for the first time in the sequence monitored by the destination in conditional order, then by the given register be categorized as it is global- It is local, else if meeting the condition of the conditional order, then the given register is categorized as part.

32. method according to claim 28, wherein, make if carrying out classification to the register and including given register Appeared in for the first time in the sequence monitored by both the operand in same instructions and destination, then by the given register It is categorized as global-local.

33. method according to claim 28, wherein, the register, which is classified, also to be included being directed to the register In at least one subset, recognize and relevant position of the operation in the sequence monitored be ultimately written to the register.

34. method according to claim 33, wherein, the position of operation is ultimately written described in identification to be included to post described The write-in of at least one subset of storage is counted.

35. method according to claim 33, wherein, the position of operation is ultimately written described in identification to be included described in record most The address of write operation afterwards.

36. method according to claim 33, wherein, in addition to the register, one also to the processor Or more a mark perform identification to the position for being ultimately written operation.

37. method according to claim 33, wherein, the subset of the register at least includes being classified as part Register.

38. method according to claim 33, wherein, the subset of the register at least includes being classified as entirely The register of office-part.

39. method according to claim 33, wherein, the identification to the position for being ultimately written operation is included to corresponding The condition write operation of register.

40. method according to claim 28, wherein, in addition to the register, also directed to the one of the processor Individual or more mark is performed according to the classification as operand or the use order of destination.

41. a kind of processor, including：

Monitoring unit, the monitoring unit is configured as the instruction in the repetitive sequence of monitoring instruction, to construct what is monitored The specification of the register access carried out is instructed, based on the instruction monitored, termination criteria is estimated, if meeting the end Only standard, then terminate the monitoring to the instruction, and if the monitoring to the instruction is being unsatisfactory for the termination mark Terminate in the case of standard, then based on the specification, by multiple sections of execution parallelization of the repetitive instruction sequence.

42. processor according to claim 41, wherein, the end condition depends at least one in following item：

Branch misprediction occur the monitoring period or before；And

43. processor according to claim 41, wherein, the flow control rail of the specification and the instruction traversal by being monitored Mark is uniquely associated.

44. processor according to claim 41, wherein, two of the specification and the instruction traversal by being monitored or more Multiple flow control tracks are associated.

45. the processor according to any one of claim 41-44, wherein, the monitoring unit is configured as described The instruction is monitored immediately after being decoded in the execution pipeline of processor to the instruction.

46. the processor according to any one of claim 41-44, wherein, the monitoring unit is configured as described The instruction, including monitoring are monitored before the instruction is performed in the execution pipeline of processor then by the predictive being eliminated Instruction.

47. the processor according to any one of claim 41-44, wherein, the monitoring unit is configured as whole The monitoring period retains the respective name of the register.

48. a kind of processor, including：

Monitoring unit, the monitoring unit is configured as the repetitive sequence of monitoring instruction, to posting that the instruction by being monitored is accessed Storage is classified according to wherein each register by the respective sequence of the instruction as operand or destination, and is based on The classification of the register, by multiple sections of execution parallelization of the repetitive sequence.

49. processor according to claim 48, wherein, the monitoring unit be configured as by the register extremely Some are categorized as one in following register less：

50. processor according to claim 48, wherein, the monitoring unit is configured as, if given register is made Appeared in for the first time in the sequence monitored by the destination in conditional order, then by the given register be categorized as it is global- It is local.

51. processor according to claim 48, wherein, the monitoring unit is configured as, if given register is made Appeared in for the first time in the sequence monitored by the destination in conditional order, then by the given register be categorized as it is global- It is local, else if meeting the condition of the conditional order, then the given register is categorized as part.

52. processor according to claim 48, wherein, the monitoring unit is configured as, if given register is made Appeared in for the first time in the sequence monitored by both the operand in same instructions and destination, then by the given register It is categorized as global-local.

53. processor according to claim 48, wherein, when classifying to the register, the monitoring unit At least one subset for the register is configured as, recognizes that the operation of being ultimately written to the register is being monitored Relevant position in sequence.

54. processor according to claim 48, wherein, the monitoring unit is configured as by the register The write-in of at least one subset is counted to recognize the position for being ultimately written operation.

55. processor according to claim 48, wherein, the monitoring unit is configured as described finally writing by recording Enter the address of operation to recognize the position for being ultimately written operation.

56. processor according to claim 48, wherein, the monitoring unit be configured as except the register it Outside, also directed to the position that operation is ultimately written described in one or more landmark identifications of the processor.

57. processor according to claim 48, wherein, the subset of the register at least includes being classified as office The register in portion.

58. processor according to claim 48, wherein, the subset of the register at least includes being classified as entirely The register of office-part.

59. processor according to claim 48, wherein, the identification to the position for being ultimately written operation is included to phase Answer the condition write operation of register.

60. processor according to claim 48, wherein, the monitoring unit be configured as except the register it Outside, also directed to one or more marks of the processor, perform according to the use order as operand or destination Classification.