WO2020061765A1 - Method and device for monitoring performance of processor - Google Patents

Method and device for monitoring performance of processor Download PDF

Info

Publication number
WO2020061765A1
WO2020061765A1 PCT/CN2018/107402 CN2018107402W WO2020061765A1 WO 2020061765 A1 WO2020061765 A1 WO 2020061765A1 CN 2018107402 W CN2018107402 W CN 2018107402W WO 2020061765 A1 WO2020061765 A1 WO 2020061765A1
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
processor
entry
indication field
pause
Prior art date
Application number
PCT/CN2018/107402
Other languages
French (fr)
Chinese (zh)
Inventor
孙涛
周昔平
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2018/107402 priority Critical patent/WO2020061765A1/en
Priority to CN201880094308.3A priority patent/CN112219193B/en
Publication of WO2020061765A1 publication Critical patent/WO2020061765A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead

Definitions

  • the present application relates to the field of computer technology, and in particular, to a method and a device for monitoring processor performance.
  • Superscalar execution refers to the emission section of the processor pipeline. Each clock cycle can send multiple instructions to the execution section for execution by the execution section, thereby achieving the concurrent execution of multiple instructions within the processor; out-of-order execution (out -of-order execution) means that when the processor executes the instructions, it may not execute in the order prescribed by the program.
  • out-of-order execution out -of-order execution
  • the processor's pipeline will stagnate; and the superscalar out-of-order execution method is used,
  • the processor may instead continue to execute subsequent instructions that do not depend on the foregoing execution results, that is, the execution section of the pipeline can always be in a working state. It is not difficult to see that using superscalar out-of-order execution can reduce the average execution time of the program and improve the processing efficiency of the processor.
  • superscalar out-of-order execution brings these advantages, superscalar out-of-order execution also complicates processor performance analysis. For example, when the pipeline of a superscalar out-of-order execution processor stalls, there may be multiple performance events caused by multiple instructions during this period, and it is difficult for the processor to determine which performance event caused the pipeline stall. Performance events overlap, and it is difficult for the processor to blame the processor's stall on an instruction, and it is difficult to evaluate the performance overhead caused by each instruction executed by the processor.
  • the embodiments of the present application provide a method and a device for monitoring processor performance, which are used for superscalar out-of-order execution of a processor to determine an instruction that causes a pause after a pause occurs, and to perform a performance overhead caused by the instruction after program execution ends Evaluation.
  • an embodiment of the present application provides a method for monitoring processor performance.
  • the method includes the following steps: the processor updates a first entry in a first register when a performance event occurs, and starts a counter when a stall occurs. Count the number of first clock cycles for which the pause lasts.
  • the first entry is used to indicate an index path of the instruction type of the first instruction that caused the performance event; the processor stops updating the first entry after the pause is terminated, and according to the first The entry determines the instruction type of the first instruction; the processor adds the first clock cycle number to the cumulative stall cycle number corresponding to the instruction type of the first instruction, and writes the cumulative stall cycle number to the second entry of the second register
  • the second register is provided with multiple entries, each of which corresponds to multiple instruction types, and the multiple entries are used to store a cumulative number of pause cycles caused by instructions under each instruction type.
  • the method provided in the first aspect is used to update the first entry in the first register when a performance event occurs, and to stop updating the first entry after the processor stall is terminated.
  • the first entry may be used to indicate the cause of the processor stall. Index path of the instruction type of the first instruction.
  • the processor starts a counter when a pause occurs. Then, after the pause is terminated, the counter records a first clock cycle number in which the pause is continued. The first clock cycle number may be used to indicate a performance overhead caused by the pause. Therefore, after the pause is stopped, the first entry can be read to determine the instruction type of the first instruction that caused the pause; at the same time, the second register stores the cumulative number of pauses caused by the instructions under each instruction type.
  • the first clock cycle number can be accumulated into the entry corresponding to the instruction type of the first instruction in the second register (that is, the second entry), so that the performance of the processor can be comprehensively analyzed, for example, analysis Percentage of stalls caused by each type of instruction, analysis of which types of instructions are prone to stalls, analysis of the number of stall cycles of the processor as a percentage of the total execution cycles of the program, and so on.
  • analysis Percentage of stalls caused by each type of instruction analysis of which types of instructions are prone to stalls
  • analysis of the number of stall cycles of the processor as a percentage of the total execution cycles of the program, and so on.
  • the type of the instruction that caused the stall can be accurately determined after the processor stalls, and the performance of each type of instruction after the execution of the program ends Cost is assessed.
  • the first entry includes a front-end indication field, a type indication field, and a serial number indication field.
  • the front-end indication field is used to indicate whether a pause occurs at the front end
  • the type indication field is used to indicate whether a pause occurs before the first instruction is submitted
  • the sequence number is used to indicate a misprediction sequence number of the first instruction.
  • the processor updates the first entry in the first register when a performance event occurs, which can be specifically implemented in the following manner: when the first performance event occurs at the front end of the processor pipeline, and the reordering cache ROB of the processor is empty
  • the front-end indication field is set to the first value
  • the type indication field is set to the second value
  • the instruction type information of the first instruction is stored in the serial number indication field.
  • the processor sets the front-end indication field to a first value to indicate that a pause occurs at the front end of the processor pipeline; the processor sets the type indication field to a second value to indicate that the instruction that caused the pause is a pause The last instruction submitted before it happened.
  • the ROB is empty when the first type of pause occurs, it is difficult for us to determine the "responsible for the pause" through the relevant entry in the ROB.
  • the serial number indication field is reused in the first implementation, that is, the instruction type information of the first instruction is stored in the serial number indication field so that the sequence can be directly passed
  • the number indicating field identifies the "responsible person".
  • the processor updates the first entry in the first register when a performance event occurs, which can be specifically implemented by: the processor does not occur at the front end of the processor pipeline when the second performance event occurs, and the second performance event is not a misprediction event In the case, both the front end indication field and the type indication field are set to the second value.
  • the processor sets the front-end indication field to a second value to indicate that a stall occurs at the back end of the processor pipeline; the processor sets the type indication field to a second value to indicate that the instruction that caused the stall is The last instruction submitted before the pause occurred.
  • the processor updates the first entry in the first register when a performance event occurs, which can be specifically implemented by the processor: when the third performance event is a misprediction event, the front end indication field is set to a second value, The type indication field is set to the first value, and the mispredicted sequence number of the third performance event is stored in the sequence number indication field.
  • the processor sets the front-end indication field to a second value to indicate that a stall occurs at the back end of the processor pipeline; the processor sets the type indication field to a first value to indicate that the instruction that caused the stall is The first instruction submitted after the pause has expired.
  • the processor updates the first entry in the first register when a performance event occurs, which can be specifically implemented by the processor: when the fourth performance event is a misprediction event, the front end indication field is set to a second value, The type indication field is set to the first value, and the misprediction sequence number of the fourth performance event is stored in the sequence number indication field; the processor compares the error of the fifth performance event when the fifth performance event is a misprediction event. The size of the mispredicted sequence number of the predicted sequence number and the fourth performance event; the processor saves the smaller mispredicted sequence number of the mispredicted sequence number of the fifth performance event and the mispredicted sequence number of the fourth performance event in the sequence number Indication field.
  • the processor sets the front-end indication field to a second value to indicate that a stall occurs at the back end of the processor pipeline; the processor sets the type indication field to a first value to indicate that the instruction that caused the stall is The first instruction submitted after the pause has expired.
  • the first entry stores the performance event with a smaller misprediction sequence number. Relevant information so that the index information of the instruction that caused the stall can be accurately determined.
  • the processor determines the instruction type of the first instruction according to the first entry, which may be specifically implemented by: when the processor determines that the front-end instruction field is the first value, the processor obtains the instruction of the first instruction stored in the sequence number instruction field. Type information.
  • the front-end indication field in the first entry is the first value, it can be determined that the pause occurred at the front end of the processor pipeline.
  • the instruction that caused the pause is the first instruction submitted after the pause is terminated.
  • the instruction should be the first entry in the ROB (that is, the entry corresponding to the instruction to be submitted in the ROB).
  • the ROB is empty, so the index information of the instruction that caused the processor stall cannot be obtained from the ROB.
  • the instruction type information of the first instruction has been saved in the sequence number indication field when the first entry is updated. Then, when the front-end indication field is the first value, the instruction type information of the first instruction stored in the sequence number indication field of the first entry can be directly obtained.
  • the processor determines the instruction type of the first instruction according to the first entry, which may be specifically implemented as follows: When the processor determines that the front-end indication field and the type indication field are both second values, the processor obtains the first order in the reordering cache ROB. The instruction type information of the first instruction stored in an entry.
  • the pause is a fourth type pause.
  • the instruction that caused this pause is the first instruction submitted after the pause is terminated, in this case, we believe that the instruction should be the first entry in the ROB (that is, the upcoming ROB Command entry). Therefore, in the second implementation manner, the instruction type information of the first instruction stored in the first entry in the ROB can be obtained.
  • the processor determining the instruction type of the first instruction according to the first entry may be specifically implemented in the following three ways:
  • the processor obtains the third case when it is determined that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is the same as the mispredicted sequence number stored in the third register.
  • the instruction type information of the first instruction stored in the register the third register stores a third entry that was last deleted by the reordering buffer ROB, and the third entry contains the instruction type information of the first instruction.
  • the processor obtains the reordering buffer when it is determined that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is greater than the mispredicted sequence number stored in the third register.
  • the processor determines that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is less than the mispredicted sequence number stored in the third register, the processor obtains the reordering cache The instruction type information of the first instruction stored in the first entry in the ROB.
  • the instruction that caused the pause is the last instruction submitted before the pause occurred.
  • the instruction recorded in the third register is the instruction that caused the processor to pause.
  • the first entry may be updated out of order, for example, the instruction that caused the pause has not yet reached the commit segment, or the first entry
  • the performance event recorded in is overwritten by other performance events or the performance event recorded in the first entry overlaps with other performance events (at this time, it can be considered that the performance event recorded in the first entry did not cause the pause).
  • the mispredicted sequence number stored in the sequence number indication field and the mispredicted sequence number stored in the third register can be compared, so that according to The comparison results are processed accordingly to determine the instruction type of the instruction that caused the processor to stall.
  • the processor may also update all entries stored in the second register to In the memory connected to the processor: the second register overflows; the second register triggers an interrupt; the performance monitoring period of the processor ends. All the entries saved in the second register are updated into the memory, and the second register can be cleared. Then, the performance of the processor can be evaluated based on the data in the memory, for example, analyzing the percentage of stalls caused by each type of instruction, analyzing which types of instructions are prone to stall, and analyzing the number of processor stall cycles. The percentage of the total execution cycles of the program and so on.
  • an embodiment of the present application further provides a processor performance monitoring device.
  • the device includes a processor, a first register, a counter, and a second register.
  • the processor is configured to: update a first entry in the first register when a performance event occurs; when a pause occurs, start a counter to count the first clock cycle number of the pause, and the first entry is used to indicate that the performance is caused Index path of the instruction type of the first instruction of the event; stop updating the first entry after the pause ends, and determine the type of the first instruction according to the first entry; add the first clock cycle number to the instruction type of the first instruction Corresponding cumulative pause cycle number, and write the cumulative pause cycle number into the second entry in the second register; the second register is provided with multiple entries, each of which corresponds to multiple instruction types, and multiple entries are used for saving Cumulative number of pause cycles caused by instructions under each instruction type.
  • the first register, the counter, and the second register may be integrated on the processor or may be set separately.
  • the processor performance monitoring device can also be regarded as a processor.
  • the first entry may include a front-end indication field, a type indication field, and a sequence number indication field.
  • the front-end indication field is used to indicate whether a pause occurs in the front-end
  • the type indication field is used to indicate whether the pause occurs before the first instruction is submitted.
  • the sequence number indication field is used to indicate a mispredicted sequence number of the first instruction.
  • the processor When the processor updates the first entry in the first register, the processor is specifically configured to: when the first performance event occurs at the front end of the processor pipeline and the reordering buffer ROB of the processor is empty, the processor instructs the front end to indicate The field is set to the first value, the type indication field is set to the second value, and the instruction type information of the first instruction is stored in the sequence number indication field.
  • the processor is specifically configured to: when the second performance event does not occur at the front end of the processor pipeline and the second performance event is not a misprediction event, instruct the front end to indicate The field and type indication fields are both set to the second value.
  • the processor When the processor updates the first entry in the first register, the processor is specifically configured to: when the third performance event is a misprediction event, set the front-end indication field to the second value, and set the type indication field to The first value, and the mispredicted sequence number of the third performance event is stored in the sequence number indication field.
  • the processor When the processor updates the first entry in the first register, the processor is specifically configured to: when the fourth performance event is a misprediction event, set the front-end indication field to the second value, and set the type indication field to The first value, and stores the mispredicted sequence number of the fourth performance event in the sequence number indication field; the processor compares the mispredicted sequence number of the fifth performance event with the fourth performance event if the fifth performance event is a misprediction event; The size of the misprediction sequence number of the performance event; the processor stores the smaller misprediction sequence number of the misprediction sequence number of the fifth performance event and the misprediction sequence number of the fourth performance event in the sequence number indication field.
  • the processor may determine the following three types of specific implementations of the instruction type of the first instruction according to the first entry.
  • the processor determines the instruction type of the first instruction according to the first entry
  • the processor is specifically configured to: when determining that the front-end instruction field is the first value, obtain the instruction type of the first instruction stored in the sequence number instruction field information.
  • the processor determines the instruction type of the first instruction according to the first entry
  • the processor is specifically configured to: when the processor determines that the front-end indication field and the type indication field are both second values, obtain the first in the reordering cache ROB The instruction type information of the first instruction stored in each entry.
  • the device further includes a third register.
  • the third register stores a third entry that was last deleted by the reordering buffer ROB.
  • the third entry contains the instruction type information of the first instruction.
  • the processor obtains the third case when it is determined that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is the same as the mispredicted sequence number stored in the third register.
  • the instruction type information of the first instruction stored in the register is the same as the mispredicted sequence number stored in the third register.
  • the processor obtains the reordering buffer when it is determined that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is greater than the mispredicted sequence number stored in the third register.
  • the processor determines that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is less than the mispredicted sequence number stored in the third register, the processor obtains the reordering cache The instruction type information of the first instruction stored in the first entry in the ROB.
  • the processor is further configured to: after writing the accumulated pause period number to the second entry of the second register, the processor writes all the entries stored in the second register in any of the following cases: Update to the memory connected to the processor: the second register overflows; the second register triggers an interrupt; the processor's performance monitoring period ends.
  • an embodiment of the present application further provides a processor performance monitoring device.
  • the device includes an update module, a start module, a stop module, and a read module.
  • the updating module is configured to update a first entry in the first register when a performance event occurs, and the first entry is used to indicate an index path of an instruction type of the first instruction that caused the performance event.
  • the startup module is used to start a counter to count the first clock cycle duration when the pause occurs when a pause occurs.
  • the stop module is used to stop updating the first entry after the pause is terminated.
  • the reading module is configured to determine an instruction type of the first instruction according to the first entry.
  • the update module is further configured to stack the first clock cycle number into the cumulative pause cycle number corresponding to the instruction type of the first instruction, and write the cumulative pause cycle number to a second entry in the second register; the second register is provided with Multiple entries, multiple entries corresponding to multiple instruction types, and multiple entries are used to store the cumulative number of pause cycles caused by instructions under each instruction type.
  • processor performance monitoring device provided in the third aspect may also be used to implement other possible implementation manners in the method for monitoring processor performance provided in the first aspect.
  • the processor performance monitoring device provided in the third aspect may also be used to implement other possible implementation manners in the method for monitoring processor performance provided in the first aspect.
  • the processor performance monitoring device provided in the third aspect may also be used to implement other possible implementation manners in the method for monitoring processor performance provided in the first aspect.
  • the processor performance monitoring device provided in the third aspect may also be used to implement other possible implementation manners in the method for monitoring processor performance provided in the first aspect.
  • the processor performance monitoring device provided in the third aspect may also be used to implement other possible implementation manners in the method for monitoring processor performance provided in the first aspect.
  • an embodiment of the present application further provides a computer-readable storage medium for storing a program used to execute the functions of the first aspect or any one of the first aspects.
  • a program used to execute the functions of the first aspect or any one of the first aspects.
  • the program is executed by a processor, For implementing the method described in the first aspect or any one of the first aspects.
  • an embodiment of the present application provides a computer program product containing a program code, and when the program code contained in the computer program product runs on a computer, the computer executes the first aspect or any one of the first aspect.
  • FIG. 1 is a schematic diagram of a processor instruction pipeline according to an embodiment of the present application
  • FIG. 2 is a schematic diagram of a processor pause and a corresponding "pause person" provided by an embodiment of the present application;
  • FIG. 3 is a schematic flowchart of a method for monitoring processor performance according to an embodiment of the present application
  • FIG. 4 is a schematic flowchart of another method for monitoring processor performance according to an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a processor performance monitoring device according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of another processor performance monitoring device according to an embodiment of the present application.
  • using a superscalar out-of-order execution processor can reduce the average execution time of a program and improve the processing efficiency of the processor.
  • the superscalar out-of-order execution processor can issue multiple instructions to the execution section in one clock cycle, and the superscalar out-of-order execution processor can execute instructions in an order other than the program when executing instructions, the The performance analysis of the out-of-order execution processor becomes complicated.
  • Top-Down is a performance model based on the utilization of pipeline distribution slots.
  • each instruction microcode also called microinstruction or micro operation, referred to as uop
  • uop microinstruction or micro operation
  • This model monitors whether each distribution slot is paused, and tracks the execution of microcode on each distribution slot (e.g., submitted or abandoned, that is, committed or squashed), divides all distribution slots into four items, and Perform detailed analysis based on performance monitoring unit (PMU).
  • PMU performance monitoring unit
  • Statistical Profiling is a method to analyze the performance of a program. The method randomly selects one microcode every N microcodes (uops), and tracks and records all performance events that occur between the dispatch and completion of the microcodes on the instruction pipeline, and each performance event is persistent. The number of clock cycles. Then analyze these samples offline to infer the performance bottleneck of the program.
  • the Top-Down model focuses on the overall performance of the analysis program performance. It is difficult to locate the design point (such as the instructions and operations that caused the pause) when the processor stalls. Many important sub-items in the -Down model are based on estimates, and accuracy is difficult to guarantee; the Statistical Profiling model focuses more on analyzing which instruction is caused by each performance event, and the model randomly samples the execution of a single instruction In some cases, the problem of insufficient coverage is likely to occur.
  • the embodiments of the present application provide a method and a device for monitoring processor performance, so that a superscalar out-of-order execution processor can determine an instruction that causes a pause after a pause occurs, and perform performance overhead caused by the instruction after program execution ends. Evaluation.
  • Superscalar out-of-order execution processors usually use an eight-segment instruction pipeline, as shown in Figure 1.
  • the eight-segment instruction pipeline includes fetch, decode, register rename, dispatch, issue, execute, writeback, and commit ) These eight pipeline sections. For an instruction, its completion needs to go through the eight segments.
  • the processor can issue multiple instructions to the execute pipeline in one clock cycle for execution by the execute pipeline, thereby
  • the processor internally implements the concurrent execution of multiple instructions; for the execution of instructions in the execute pipeline, multiple instructions may not be executed in the order prescribed by the program, but may be executed out of order, so that some instructions can be executed in some instructions While waiting for the source operand, it enables other instructions that do not depend on the source operand to be executed preferentially, which improves the throughput of the processor.
  • superscalar out-of-order execution processors typically support out-of-order emission, out-of-order execution, and sequential commit.
  • out-of-order firing and out-of-order execution have been introduced in the previous paragraph, and for sequential submission, its meaning can be:
  • the execution order specified by the program is instruction A ⁇ instruction B ⁇ instruction C
  • instruction B needs to wait for the source operand to execute Therefore, the actual execution order of the instructions is instruction A ⁇ instruction C ⁇ instruction B.
  • these three instructions are executed out of order, when the instructions are submitted, they still need to be submitted in the order of instruction A ⁇ instruction B ⁇ instruction C specified by the program.
  • the above description of the eight-segment instruction pipeline is only an example.
  • the embodiment of the present application is applicable to a superscalar out-of-order execution processor.
  • the superscalar out-of-order execution processor is adopted.
  • the instruction pipeline model is not specifically limited, as long as the instruction pipeline can be used to achieve superscalar out-of-order execution of instructions by the processor.
  • Performance events are events that cause processor performance to fall below the design peak. Performance events can be triggered by instructions or operations executed by the processor. For example, performing long-delay division instructions or cache misses during cache access operations can cause performance events.
  • the performance event penalty (also referred to as the performance event overhead) refers to the number of clock cycles required to complete an instruction or an operation "when a performance event is not raised".
  • the processor pipeline stalls which can also be referred to as a processor stall.
  • the number of clock cycles included in the multiple clock cycles may be different.
  • the number of clock cycles may be considered to be one.
  • the number of clock cycles included in the multiple clock cycles is different. The number is not specifically limited.
  • processor stalls are caused by instructions or operations executed by the processor.
  • an instruction is a command language, which is used to specify what operations the processor performs and where the operation object is located, such as a command to add two operation objects and a command to access a cache.
  • Operations refer to specific operations when executing instructions, such as addition operations, operations to access the cache, and so on.
  • the fetch instruction load causes the processor to halt when the last-level cache is missing; for example, it causes the processor to halt when performing operations that access the cache; for example, the branch instruction branch , And then read and decode the instructions of one of the branches in advance to reduce the waiting time for the decoder)
  • a branch prediction error occurs, a large number of instructions will be emptied, which will cause the processor to stall.
  • the instruction or operation that causes the processor to quiesce is referred to as a "culprit.”
  • the processor stall is caused by instructions or operations executed by the processor. For example, when the fetch instruction load is missing in the last level cache, we can think that the pause is caused by the load instruction, or we can think that the pause is caused by the operation of accessing the cache when the load instruction is executed. In the embodiment of the present application, the first statement is adopted, that is, the processor stall is considered to be caused by instructions.
  • FIG. 2 illustrates processor stalls and corresponding "culprits" caused by a processor while executing multiple programs.
  • the load 1 instruction when executing program A, the load 1 instruction was delayed due to a cache miss during the execution of the load 1 instruction, which caused a processor stall, although the load 2 instruction and divide 3 were executed after load 1.
  • the instruction also caused a performance event, but the performance event caused by these two instructions is covered by the pause caused by the load 1 instruction.
  • the pause caused by the load 1 instruction occurs before the load 1 instruction is submitted, that is, the load 1 instruction is the first instruction submitted after the end of Stall 1.
  • Stall 1 is performed by the load instruction. 1 results in too long, other instructions can not be submitted. 1 before the load instructions are executed, causing a processor to stall; Stall 1 i.e. after the end of the load instructions are executed. 1, at this time load 1 instruction submission.
  • the branch 4 instruction was cleared due to misprediction, that is, the branch 4 instruction was the last instruction submitted before Stall 2 started. This is because the pause caused by the misprediction of the branch 4 instruction can only be displayed after the instruction is submitted, so Stall 2 occurs after the branch 4 instruction is submitted.
  • the instruction that caused the processor to halt may be the first instruction submitted after the pause is terminated, or the last instruction submitted before the pause begins.
  • Stall 3 stalls occur at the front end of the processor pipeline
  • Stall 1, Stall 2 and Stall 4 all occur at the back end of the processor pipeline.
  • the front end of the processor pipeline refers to the pipeline that completes the functions of instruction fetching, decoding, and distribution
  • the back end of the processor pipeline refers to the pipeline that completes the functions of instruction issuing, execution, and submission.
  • the processor can judge the start and end of the pause based on the value of the commit width.
  • the performance event may cause the processor to halt in some cases, and may not cause the processor to halt in some cases.
  • a performance event is considered to cause the processor to stall only if a performance event causes the commit section of the instruction pipeline to have no instruction submissions within several clock cycles.
  • the instruction that caused the performance event can be considered a "culprit” that caused the processor to stall.
  • the performance overhead of the "culprit” can be defined as the number of clock cycles that the pause lasts, and the performance overhead of the "culprit” can also be understood as the performance overhead of the pause.
  • the instructions need to be submitted in the order specified by the program regardless of the order of execution of the instructions. Then, for the instructions that have been executed in the execute pipeline but cannot be submitted through the commit pipeline in the order prescribed by the program, the execution results can be stored in the re-order buffer (ROB).
  • ROB re-order buffer
  • one entry in the ROB corresponds to one instruction microcode.
  • Each entry in the ROB includes at least two fields: the instruction type and the execution result.
  • the ROB can be regarded as a circular queue with head pointers and tail pointers. All instructions that enter the instruction pipeline are stored in the ROB in the order prescribed by the program. An entry in the ROB corresponds to an instruction microcode. Among them, the head pointer points to related information (instruction type execution result, etc.) of the next instruction to be submitted, and the tail pointer points to related information (instruction type, execution result, etc.) of an instruction microcode newly stored in the ROB.
  • the first register stores an index path used to indicate an instruction that causes a performance event, that is, an index path of a “pause person”.
  • the processor updates the entry in the first register when a performance event occurs. For a performance event that does not cause a processor stall, the performance event will be overwritten by updating an entry in the first register; only the entry that caused the processor stall will eventually be saved in the first register. That is, an entry in the first register corresponds to a pause in the processor.
  • a reading pointer and a writing pointer are maintained in the first register.
  • the reading pointer is used to indicate the entry that needs to be read when the processor stall is terminated.
  • the processor can read the entry to determine the index path of the "responsible for the stall"; the writing pointer is used to write and update the first A register entry.
  • the performance event that does not cause a processor stall can be overwritten by updating the entry indicated by the writing pointer. For example, after the pause is terminated, by reading the entry pointed by the reading pointer, the index path of the "responsible person" can be obtained, and then the instruction operation that caused the processor to pause is determined.
  • both the reading pointer and the writing pointer may point to the first entry in the first register, and at this time, the content of the entry is empty or the default value.
  • the entry pointed to by the writing pointer (that is, the first entry) can be updated; after the processor pause is terminated, the update can be stopped by moving the writing pointer to the next entry The first entry; then, by reading the entry pointed by the reading pointer, the instruction that caused the processor to stall can be determined; after the reading is completed, the reading pointer can be moved to the next entry.
  • the writing and updating of the first register entry ie, the index path of the instruction causing the pause
  • the "responsible person for pause" can be determined by reading. .
  • the movement of the reading pointer (writing) and writing pointer (writing) in the first register can be performed different operations according to different scenarios. This part will be described in detail in the following embodiments. I will not repeat them here.
  • the first register may have a different name.
  • the first register may be referred to as a culprit tracking register set (CTS).
  • CTS culprit tracking register set
  • the specific name is not limited.
  • the processor when two consecutive instructions cause the processor to halt, if there is only one entry in the first register, it may happen that the first instruction has not yet been submitted, and the processor will In the case where the entry in the register is covered, it is difficult to judge the "responsible person for the pause" of the first pause because the related entry of the first instruction is covered.
  • the first register in the embodiment of the present application may include multiple entries.
  • the number of entries contained in the first register may be the number of pipeline segments in the writeback segment +1.
  • the writeback section is used to control the processor to write the execution result of the instruction back to the memory or the register.
  • the number of entries in the first register is equal to the number of pipeline segments in the writeback segment +1, if multiple instructions executed in parallel cause a pause, the number of entries in the first register can also be sufficient to record the information of the multiple instructions Number of required entries required.
  • the first register when the first register contains multiple entries, the first register can also be regarded as a register group, and each register in the register group contains one entry.
  • the second register stores a cumulative pause cycle number corresponding to an instruction type of each instruction executed by the processor.
  • the instruction type may be an operation instruction, such as a fadd instruction and a divide instruction; and the instruction type may also be an access instruction, such as a load instruction or a store instruction.
  • the instruction may be a branch instruction.
  • the cumulative number of stall periods corresponding to each instruction type in the second register can be updated in an accumulative manner: when the processor stalls, the counter starts counting the number of clock cycles that the stall lasts; when the stall is terminated, the counter Stop counting and accumulate the count result into an entry in the second register.
  • the entry a of the second register records that the cumulative value of the number of pause periods of the processor caused by the load instruction is M.
  • the counter starts counting; when the stall is terminated, the counter stops counting and the count result is N at this time.
  • the pause is caused by the load instruction
  • N is accumulated on the a entry in the second register, and at this time, the a entry in the second register records the processing caused by the load instruction.
  • the cumulative value of the number of pause periods of the router is M + N.
  • the second register may be implemented by static random-access memory (static random-access memory (SRAM)) hardware.
  • SRAM static random-access memory
  • the second register and the counter may have different names.
  • the second register may be referred to as a stall performance overhead statistics table (SPCT), and the counter may be referred to as a stall cycle counter (stall cycle counter). counter (SCC), and the specific names of the second register and the counter are not limited in the embodiments of the present application.
  • the entry in the second register may be updated into the memory, and then the second register is cleared, so that the performance of the processor may be evaluated according to the data in the memory after the execution of the program is terminated.
  • the processor may update the entry stored in the second register to the memory connected to the processor in any of the following cases: an overflow occurs when an entry in the second register is accumulated; the second register triggers an interrupt; The processor's performance monitoring period ends.
  • the third register stores an entry that was last deleted by the ROB.
  • the relevant information instruction type, execution result, etc.
  • the relevant information can be saved in the ROB; after the instruction is submitted, the The entry corresponding to the instruction is deleted from the ROB.
  • the third register may have a different name.
  • the third register may be referred to as a last committed instruction (LCI) register in the embodiment of the present application.
  • LCI last committed instruction
  • the specific name of the third register in the embodiment of the present application No restrictions.
  • the first entry is updated when a performance event occurs.
  • each entry in the first register includes at least three fields: a front-end indication field, a type indication field, and a sequence number indication field. Updating the first entry updates these three fields in the first entry.
  • the front-end indication field is used to indicate whether the pause occurred at the front end of the processor pipeline
  • the type indication field is used to indicate whether the pause occurred before the "responsible person" submitted
  • the serial number indication field is used to indicate the misprediction of the "responsible person" serial number.
  • a stall occurs at the front end of the processor pipeline, and some stalls occur at the back end of the processor pipeline.
  • whether a stall occurs at the front end of the processor pipeline may be indicated by a front end indication field in the first register.
  • the instruction that caused the processor to halt may be the first instruction submitted after the pause is terminated, or the last instruction submitted before the pause begins.
  • the type indication field in the first register may be used to indicate whether the instruction that caused the processor stall is the first instruction submitted after the stall is terminated or the last instruction submitted before the stall starts.
  • sequence number indication field can be understood as follows: when an entry in the first register is used to record related information about a branch misprediction instruction, the sequence number indication field of the entry is used to record the sequence number of the branch instruction; When an entry in the first register is used to record information about instructions other than the branch instruction, the serial number of the entry indicates that the field is meaningless (can be defaulted) or used to record other information (such as to determine the pipeline Front-end "pause person" information).
  • processor stalls are classified into four types:
  • the first type of pause may be referred to as instruction supply pause, which is a pipeline pause caused by a processor pipeline frontend.
  • the first type of pause is caused by the lack of processor pipeline front-end cache (such as I-Cache or I-TLB).
  • the salient feature of the first type of pause is that when it causes a processor pause, the ROB is also empty.
  • the first type of pause we can think of culprit as an operation, such as an I-Cache access operation; at the same time, we can also think of culprit as the instruction that caused the I-Cache to be missing when fetching instructions.
  • the second viewpoint is selected, that is, the first type of pause is caused by instructions.
  • cache misses do not necessarily cause stalls to occur.
  • the fetch operation is missing in L1I-Cache but hits in L2Cache
  • the instruction cannot be provided to the back end of the processor pipeline, but at this time, the ROB is likely not empty and every clock cycle There are still instructions to submit.
  • a pause occurs at this time in the embodiment of the present application.
  • One of the main goals of modern processor architecture design is to hide the performance event overhead as much as possible. The above situation happens because the overhead of front-end performance events is hidden.
  • the second category is a first category:
  • the second type of pause may be referred to as a misprediction pause. Due to the pause caused by the emptying of instructions due to misprediction, we classify it as a second type of pause. Common misprediction is branch branch misprediction and load-store misorder misprediction.
  • the salient feature of the second type of pause is that the instruction that caused the pause (culprit) was submitted before the pause occurred, and is the last instruction submitted before the pause occurred.
  • the performance overhead of the misprediction is considered to be well hidden by the out-of-order execution in the embodiment of the present application. Mispredicted instructions are not recognized as culprit.
  • the third category is a first category:
  • the third type of pause may be referred to as a system instruction pause.
  • the third column of pauses are pauses caused by the pipeline being emptied by a particular instruction. Common isb instructions.
  • the characteristics of the third type of pause are similar to the second type of pause. Culprit (such as the isb instruction) has been submitted before the pause, and is the last instruction submitted before the pause.
  • the third type of pause has two unique features: 1) When a specific instruction appears, it must cause a pause. For example, when the isb instruction is submitted, the entire instruction pipeline must be emptied, and the performance overhead caused by it cannot be hidden; 2) The performance overhead of the third type of pause is almost constant, such as the number of clock cycles for the pause caused by the isb instruction It is almost always equal to the stage depth.
  • the third type of pause did not involve misprediction.
  • the front-end indication field and the type indication field of the second type of pause and the third type of pause are the same, in order to distinguish the second type of pause and the third type of pause, when the third type of pause occurs, it can also be in the first register.
  • a mispredicted sequence number is written in the sequence number indication field to distinguish the second type of pause and the third type of pause based on the mispredicted sequence number.
  • the fourth type of pause may be referred to as a long delay pause.
  • the fourth type of pause can be considered as a pause caused by a long delayed execution of an instruction, that is, an instruction that takes too long to execute causes a pause.
  • the fourth type of pause common culprit can be the load instruction with a last-level cache miss, a floating-point division instruction, and a load instruction that accesses shared data in the cache. The execution of these instructions often requires tens to hundreds of clock cycles, preventing Execution and submission of subsequent instructions cause the processor to stall.
  • the distinctive feature of the fourth type of pause is that the pause occurs before the culprit is submitted, and culprit is the first instruction submitted after the culprit is terminated.
  • the type of the pause can be determined according to an instruction of each field in the first register, and then it can be determined which instruction caused the pause (that is, culprit).
  • the instruction type may be represented by Stall ID.
  • the second register holds the number of accumulated pause cycles corresponding to different instruction types.
  • the Stall ID in Table 1 can be used to indicate the instruction type.
  • the cumulative number of stall periods corresponding to each instruction type in the second register can be implemented in an accumulative manner: when the processor stalls, the counter starts counting the number of clock cycles for which the stall continues; when the stall is terminated, the counter stops Count and accumulate the count result into the corresponding entry in the second register.
  • the Stall ID of the instruction can be determined, and the count result of the counter is accumulated to the corresponding entry in the second register, and the entry in the second register
  • you can analyze the performance of the processor by analyzing the memory data to obtain the performance overhead caused by each type of instruction during the performance monitoring period.
  • the count of the corresponding entry of Oct 10 in the second register Add P to the value.
  • the count value of the entry corresponding to Oct10 in the second register before the pause occurs is Q
  • the count value corresponding to the entry after the superposition is P + Q.
  • the entry in the second register is updated to the memory.
  • the count value of the entry corresponding to Oct10 in the memory before the update is X
  • the count value of the entry corresponding to Oct10 in the memory after the update is X + P + Q.
  • other types of pauses other than the four types of pauses may be collectively classified as a fifth type of pause.
  • the obvious characteristics of the first type of pauses are that the pauses occur at the front end of the processor pipeline and the ROB is empty.
  • the obvious characteristics of the second and third types of pauses are that the instruction pipeline is cleared, and the fourth type of pauses are clear. It is difficult to make a clear judgment based on intuitive characteristics, so we can determine that the pause is a fourth type of pause when the type of pause is not the first type of pause, the second type of pause, or the third type of pause.
  • the processing process and the setting of the fields in the entry of the first register are similar to the fourth type of pause. Therefore, the fifth type of pause can be referred to the fourth type of pause. Therefore, the content of the fifth type of pause in the embodiments of the present application will not be repeated.
  • a method for monitoring processor performance includes the following steps.
  • S301 The processor updates the first entry in the first register when a performance event occurs, and when a pause occurs, starts a counter to count the number of first clock cycles that the pause continues.
  • the first register maintains a reading pointer and a writing pointer.
  • the reading pointer is used to indicate the entry that needs to be read when the processor stall is terminated.
  • the processor can use this entry to determine the index path of the "responsible person"; the writing pointer is used to write and update the first register Entry. That is, in S301, the first entry updated by the processor may be an entry pointed to by a writing pointer.
  • the first register may include one or more entries.
  • the processor updates the entry in the first register when a performance event occurs. For a performance event that does not cause a processor stall, the performance event will be overwritten by updating an entry in the first register; only the entry that caused the processor stall will eventually be saved in the first register.
  • An entry in the first register corresponds to a stall of the processor, that is, each entry in the first register is used to indicate an index path of an instruction type of the instruction that caused the stall.
  • the first entry is used to indicate the index path of the instruction type of the instruction that caused the performance event; after the first entry stops updating, the performance event recorded in the first entry
  • the instruction that caused the processor to halt and the performance event was the instruction that caused the processor to halt, that is, the first entry was used to indicate the index path of the instruction type of the instruction that caused the processor to halt.
  • the first entry may include a front-end indication field, a type indication field, and a sequence number indication field.
  • the front-end indication field is used to indicate whether a pause occurs in the front-end
  • the type indication field is used to indicate whether the pause occurs before the first instruction is submitted.
  • the sequence number indication field is used to indicate a mispredicted sequence number of the first instruction.
  • the operation of the processor to update the first entry when a performance event occurs can be performed in the execute section of the instruction pipeline, that is, when the execute section judges that the processor performance is lower than the design peak, it can be determined that the occurrence Performance event, at which point the first entry in the first register can be updated.
  • the counter is used to record the number of clock cycles in which the pause is continued. This counter starts counting when a pause occurs, and stops counting when the pause ends. After stopping counting, the value recorded by this counter is the number of clock cycles that the pause lasts, that is, the performance overhead of the pause.
  • the types of stalls that occur in the processor are classified into four types. Then, according to the type of the pause, in S301, the processor may update the first entry in multiple ways. Here are four specific ways to update the first entry.
  • the processor updates the first entry in the first register when a performance event occurs, which may be specifically implemented as follows: The processor performs the first performance event at the front end of the processor pipeline, and the ROB of the processor is If it is empty, the front end indication field is set to the first value, the type indication field is set to the second value, and the instruction type information of the first instruction is stored in the serial number indication field.
  • processor stalls can occur on the front end or on the back end.
  • the front end indication field of the first entry may be set to the first value to indicate that the pause occurs at the front end of the processor pipeline.
  • the processor stall instruction supply stall, that is, the pipeline stall caused by the processor's frontend.
  • a significant feature of the first type of pause is that the ROB is empty when a pause occurs. Therefore, in the first method, when the first performance event occurs at the front end of the processor pipeline and the ROB of the processor is empty, the front end indication field in the first entry may be set to the first value.
  • the first value may be 1 and the second value may be 0.
  • the sequence number indication field is reused in the first method, that is, the instruction type information of the first instruction is stored in the sequence number indication field, so that the sequence number indication field can be directly passed. Identify the "responsible person".
  • the processor updates the first entry in the first register when a performance event occurs, which can be specifically implemented in the following manner: when the second performance event does not occur at the front end of the processor pipeline, and the second performance event If it is not a misprediction event, both the front end indication field and the type indication field are set to the second value.
  • the pause is the fourth type of pause by intuitive features, so we can determine this time when the type of pause is not the first type of pause, the second type of pause, or the third type of pause.
  • the pause is the fourth type of pause-long delay pause.
  • the second performance event does not occur at the front end of the processor pipeline, and the second performance event is not a misprediction event, indicating that the type of the pause is a fourth type of pause.
  • both the front-end indication field and the type indication field of the first entry may be set to the second value to indicate that the current pause is a fourth type of pause.
  • the second value may be 0.
  • the processor updates the first entry in the first register when a performance event occurs, which can be specifically implemented in the following manner: when the third performance event is a misprediction event, the processor sets the front-end indication field For the second value, the type indication field is set to the first value, and the mispredicted sequence number of the third performance event is stored in the sequence number indication field.
  • the front-end indication field and the type indication field of the second-type pause and the third-type pause are set in the same manner.
  • the pause is a second type pause or a third type pause.
  • the front end indication field of the first entry may be set to a second value, and the type indication field may be set to a first value.
  • the first value may be 1 and the second value may be 0.
  • the processor updates the first entry in the first register when a performance event occurs, which may be specifically implemented as follows:
  • the processor sets the front-end indication field Is the second value, the type indication field is set to the first value, and the mispredicted sequence number of the fourth performance event is stored in the sequence number indication field;
  • the processor compares the first The size of the misprediction sequence number of the five performance event and the misprediction sequence number of the fourth performance event; the processor compares the smaller misprediction sequence of the misprediction sequence number of the fifth performance event and the misprediction sequence number of the fourth performance event. The number is stored in the serial number indication field.
  • the first value may be 1 and the second value may be 0.
  • the processor can predict the branch flow of the program, and then read and decode the instructions of one of the branches in advance, thereby reducing the waiting time for the decoder. If a branch prediction error occurs, the instructions on the instruction pipeline will be emptied, causing the processor to stall.
  • the instruction causing the processor to halt should be regarded as the first prediction error instruction, that is, the instruction with the smallest misprediction sequence number. Therefore, in the fourth method, during the process of updating the first entry (that is, before the pause is terminated), if two misprediction events occur, the first entry stores information about performance events with a smaller misprediction sequence number. In order to accurately determine the index information of the instruction that caused the pause.
  • the update process of the first entry may be performed according to actual conditions, and these specific operations may also be performed in the manner indicated by the above four methods.
  • the first entry is updated in the second way; then, a fourth performance event occurs without a processor stall, and the first entry may continue to be updated at this time.
  • the first entry records the related information of the fourth performance event. In this case, we can consider that the performance cost of the second performance event is hidden by the fourth performance event, and the second performance event does not cause the processor to stall.
  • the entry pointed to by the write pointer (writing) needs to be updated; in the process of updating the entry of the first register, it may also be necessary to read the write pointer (The sequence number indication field of the entry pointed to by writing) is used to determine how to update the sequence number indication field. After the pause, the entry pointed to by the read pointer (writing) needs to be read to determine the instruction type of the first instruction. That is, the first register may be configured with two read channels and one write channel, and the first register is a register group that can be regarded as a 2 read 1 write.
  • one read channel is used to read the entry in the first register and determine how to update the entry in the execute section
  • the other read channel is used to read the entry in the first register in the commit section.
  • the write channel is used to update the entry in the first register during the execute segment.
  • S302 The processor stops updating the first entry after the pause is terminated, and determines an instruction type of the first instruction according to the first entry.
  • the processor can update the first entry in various ways when a performance event occurs.
  • the first entry updated according to the performance event The project will be overwritten, and only the instruction that caused the processor to stall will be saved in the first entry.
  • the first entry records related information of the instruction that caused the processor stall, that is, the first instruction corresponding to the first entry may be considered as the instruction that caused the processor stall.
  • the processor stops updating the first entry when the pause is terminated, which can be achieved by moving a writing pointer in the first register to a next entry of the first entry. That is, the writing pointer in the first register can move to the next entry after each pause, and then continue to monitor the performance events that occur in the processor, and continue to update the next entry when the processor has a performance event .
  • the writing pointer (writing) is moved to the next entry after each pause.
  • the purpose is to prevent the occurrence of the following situations: when several similar instructions (such as two consecutive instructions) cause the processor to pause, if the write is not moved Writing (writing). It may happen that the information about the first instruction is overwritten by the information about the second instruction before it is read in the commit section.
  • the processor may read the first entry pointed by the reading pointer after the pause is terminated, thereby determining the instruction type of the first instruction according to the first entry.
  • the reading pointer can be pointed to the next entry of the first entry, so that when a stall occurs again later, the instruction type of the instruction that caused the stall is determined according to the entry pointed by the reading pointer .
  • the storage capacity of the first register is limited, after the first entry has been read and the instruction type of the first instruction is determined according to the first entry, the first entry can be deleted or released to avoid reading. Entries occupy the storage capacity of the first register.
  • the processor may update the first entry in S301 in multiple ways. Similarly, in S302, according to the information recorded in the three fields of the first entry, the processor may determine the type of the pause, and then determine the instruction type of the first instruction that caused the pause.
  • the processor may determine the following three types of specific implementations of the instruction type of the first instruction according to the first entry.
  • the processor determines the instruction type of the first instruction according to the first entry, which may be specifically implemented by: when the processor determines that the front-end instruction field is the first value, the processor obtains the instruction of the first instruction stored in the sequence number instruction field. Type information.
  • the pause occurs at the front end of the processor pipeline, that is, the pause is the first type of pause.
  • the instruction that caused this pause is the first instruction submitted after the pause is terminated.
  • the instruction should be the first entry in the ROB (that is, the instruction to be submitted in the ROB) Corresponding entry).
  • the ROB is empty, so the index information of the instruction that caused the processor stall cannot be obtained from the ROB.
  • the instruction type information of the first instruction is stored in the serial number indication field (As described in the first method of S301). Then, when the front-end indication field is the first value, the instruction type information of the first instruction stored in the sequence number indication field of the first entry can be directly obtained.
  • the processor determines the instruction type of the first instruction according to the first entry, which can be specifically implemented in the following manner: When the processor determines that the front-end indication field and the type indication field are both second values, the first entry in the ROB is Stored the instruction type information of the first instruction.
  • the first entry in the ROB can be understood as the entry pointed by the ROB's heading.
  • the pause is a fourth type pause.
  • the instruction that caused this pause is the first instruction submitted after the pause is terminated, in this case, we believe that the instruction should be the first entry in the ROB (that is, the upcoming ROB Command entry). Therefore, in the second implementation manner, the instruction type information of the first instruction stored in the first entry in the ROB can be obtained.
  • the processor determines the instruction type of the first instruction according to the first entry, which may be specifically implemented by: the processor determines that the front-end indication field is the second value, the type indication field is the first value, and the error stored in the sequence number indication field is incorrect.
  • the instruction type information of the first instruction stored in the third register is obtained.
  • the third register stores the third entry that was last deleted by the ROB. Three entries contain instruction type information for the first instruction.
  • the pause is the second type pause or the third type pause, and the instruction that caused the pause The last instruction submitted before the pause occurred.
  • the instruction recorded in the third register is the instruction that caused the processor to pause.
  • the first entry may be updated out of order, for example, the instruction that caused the pause has not yet reached the commit segment, or the first entry
  • the performance event recorded in is overwritten by other performance events or the performance event recorded in the first entry overlaps with other performance events (at this time, it can be considered that the performance event recorded in the first entry did not cause the pause).
  • the mispredicted sequence number stored in the sequence number indication field is compared with the mispredicted sequence number stored in the third register, in order to correctly identify the "responsible person for pause" when the above complex situation occurs:
  • the mispredicted sequence number stored in the sequence number indication field in the first entry is the same as the mispredicted sequence number stored in the third register, it is determined that the instruction recorded in the first entry is the record in the third register Instruction (that is, the last instruction submitted before the pause occurred).
  • This situation is similar to the situation shown in program B / D in Figure 2. At this time, we can obtain the instruction type information of the first instruction stored in the third register. .
  • the processor determines that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is greater than the mispredicted sequence number stored in the third register. To obtain the instruction type information of the first instruction stored in the first entry in the ROB.
  • mispredicted sequence number stored in the sequence number indication field is greater than the mispredicted sequence number stored in the third register, it can be considered that the instruction indicated by the first entry has not yet reached the commit segment, That is to say, the pause caused by the instruction has not yet occurred. This pause is the last pause caused by the instruction indicated by the first entry. At this time, we should not get the index information of the instruction from the third register, but we should get the index information of the instruction from the first entry in the ROB.
  • the processor determines that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is smaller than the mispredicted sequence number stored in the third register. To obtain the instruction type information of the first instruction stored in the first entry in the ROB.
  • the performance event recorded in the first entry may be considered to be overwritten by another performance event (this It can be considered that the performance event recorded in the first entry did not cause the processor to stall).
  • the performance event recorded in the first entry is referred to as performance event p
  • the performance event covering performance event p is referred to as performance event q.
  • the reading pointer of the first register may be updated to point to the next entry.
  • S303 The processor adds the first clock cycle number to the accumulated pause cycle number corresponding to the instruction type of the first instruction, and writes the accumulated pause cycle number to the second entry of the second register.
  • the second register is provided with multiple entries, each of which corresponds to multiple instruction types, and the multiple entries are used to store a cumulative number of pause cycles caused by instructions under each instruction type.
  • each type of pause has its own characteristics, and each type of pause can be caused by different or the same type of instructions.
  • the branch instruction can cause a second type of pause
  • the load instruction can cause a second type of pause or a fourth type of pause.
  • the second register stores the cumulative number of pause cycles caused by the instructions under each instruction type.
  • the second register may record the cumulative number of pause periods of the second type of pause caused by the branch instruction as A, the cumulative number of pause periods of the second type of pause caused by the load instruction as B, and the fourth type caused by the load instruction
  • the cumulative number of pause periods is C ... and so on.
  • the process of updating the second entry into the second register may be: accumulating the count result of the counter (that is, the number of first clock cycles) to the entry corresponding to the first instruction in the second register. For example, using the method shown in FIG. 3 to determine the stall ID of the first instruction causing the stall is Oct01 and the first clock cycle count of the counter is M, then when S303 is executed, the cumulative value of the corresponding entry of Oct01 in the second register can be performed Add M. Assume that the cumulative value of the entry corresponding to Oct01 in the second register before the pause occurs is N, and the cumulative value corresponding to the entry after the superposition is M + N.
  • all the entries stored in the second register may be updated to the memory connected to the processor in any of the following cases Medium: The second register overflows; the second register triggers an interrupt; the performance monitoring period of the processor ends.
  • All the entries saved in the second register are updated into the memory, and the second register can be cleared. Then, the performance of the processor can be evaluated based on the data in the memory, for example, analyzing the percentage of stalls caused by each type of instruction, analyzing which types of instructions are prone to stall, and analyzing the number of processor stall cycles. The percentage of the total execution cycles of the program and so on.
  • the method shown in FIG. 3 is used to update the first entry in the first register when a performance event occurs, and to stop updating the first entry after the processor stall is terminated.
  • the first entry can be used to indicate that the processor caused a stall. Index path of the instruction type of the first instruction.
  • the processor starts a counter when a pause occurs. Then, after the pause is terminated, the counter records a first clock cycle number in which the pause is continued. The first clock cycle number may be used to indicate a performance overhead caused by the pause. Therefore, after the pause is stopped, the first entry can be read to determine the instruction type of the first instruction that caused the pause; at the same time, the second register stores the cumulative number of pauses caused by the instructions under each instruction type.
  • the first clock cycle number can be accumulated into the entry corresponding to the instruction type of the first instruction in the second register (that is, the second entry), so that the performance of the processor can be comprehensively analyzed, for example, analysis Percentage of stalls caused by each type of instruction, analysis of which types of instructions are prone to stalls, analysis of the number of stall cycles of the processor as a percentage of the total execution cycles of the program, and so on.
  • analysis Percentage of stalls caused by each type of instruction analysis of which types of instructions are prone to stalls
  • analysis of the number of stall cycles of the processor as a percentage of the total execution cycles of the program, and so on.
  • the type of the instruction that caused the stall can be accurately determined after the processor stalls, and the performance of each type of instruction after the execution of the program ends Cost is assessed.
  • processor monitoring can be implemented through a low-cost hardware mechanism. That is, in the embodiment of the present application, the instruction that caused the pause can be accurately determined by adding a register group that supports 2 reads and 1 writes in the execute section and combined with the judgment logic of the commit section.
  • an embodiment of the present application further provides a method for monitoring processor performance.
  • This method can be regarded as a specific example of the method shown in FIG. 3.
  • the method may be: when a front-end cache miss occurs in the fetch section and the ROB is empty, the entry pointed by the writing pointer in the CTS is updated, FE is set to 1, and the stall ID of the fetch instruction is written into Squash SN Field (ie, a specific example of a sequence number indication field).
  • Squash SN Field ie, a specific example of a sequence number indication field.
  • the processor judges that the pause is the front-end pause of the pipeline, and uses the Stall ID index stored in SquashSN to index the SPCT, and accumulates the SCC count value to the corresponding entry in the SPCT. .
  • the embodiment of the present application further provides a processor performance monitoring device, which can be used to execute the processor performance monitoring method shown in FIG. 3.
  • the processor performance monitoring device 500 (hereinafter referred to as “device 500”) includes a processor 501, a first register 502, a counter 503, and a second register 504.
  • the processor 501 is configured to: update a first entry in the first register 502 when a performance event occurs, and when a pause occurs, start a counter 503 to count a first clock cycle duration of the pause, and the first entry is used to indicate a cause Index path of the instruction type of the first instruction of the performance event; after the pause ends, stop updating the first entry, and determine the instruction type of the first instruction according to the first entry; add the first clock cycle number to the first instruction The total number of pause periods corresponding to the type of instruction, and write the accumulated number of pause periods to the second entry in the second register 504; the second register 504 is provided with multiple entries, each of which corresponds to multiple instruction types, more This entry is used to store the cumulative number of pause cycles caused by the instructions under each instruction type.
  • the first register 502, the counter 503, and the second register 504 may be integrated on the processor 501, or may be set separately.
  • the processor performance monitoring device 500 can also be regarded as a type of processor.
  • the first entry may include a front-end indication field, a type indication field, and a sequence number indication field.
  • the front-end indication field is used to indicate whether a pause occurs in the front-end
  • the type indication field is used to indicate whether the pause occurs before the first instruction is submitted.
  • the sequence number indication field is used to indicate a mispredicted sequence number of the first instruction.
  • the processor 501 can update the first entry in multiple ways, and four of them are listed below.
  • the processor 501 updates the first entry in the first register 502
  • the processor 501 is specifically configured to: when the first performance event occurs at the front end of the processor 501 pipeline and the reordering buffer ROB of the processor 501 is empty Next, the front end indication field is set to the first value, the type indication field is set to the second value, and the instruction type information of the first instruction is stored in the serial number indication field.
  • the processor 501 updates the first entry in the first register 502
  • the processor 501 is specifically configured to: when the second performance event does not occur at the front end of the processor 501 pipeline, and the second performance event is not a misprediction event , Set both the front end indication field and the type indication field to the second value.
  • the processor 501 updates the first entry in the first register 502
  • the processor 501 is specifically configured to: when the third performance event is a misprediction event, set the front-end indication field to the second value, and set the type indication The field is set to the first value, and the mispredicted sequence number of the third performance event is stored in the sequence number indication field.
  • the processor 501 updates the first entry in the first register 502
  • the processor 501 is specifically configured to: when the fourth performance event is a misprediction event, set the front-end indication field to the second value, and set the type indication The field is set to the first value, and the misprediction sequence number of the fourth performance event is stored in the sequence number indication field; the processor 501 compares the misprediction sequence of the fifth performance event when the fifth performance event is a misprediction event. Of the mispredicted sequence number of the fourth performance event and the mispredicted sequence number of the fourth performance event; the processor 501 saves the smaller mispredicted sequence number of the mispredicted sequence number of the fifth performance event and the mispredicted sequence number of the fourth performance event in the sequence number indication. Field.
  • the processor 501 may determine the following three specific implementation manners of the instruction type of the first instruction according to the first entry.
  • the processor 501 determines the instruction type of the first instruction according to the first entry, the processor 501 is specifically configured to: when the front end instruction field is determined to be the first value, the processor 501 obtains the first instruction stored in the sequence number instruction field. Instruction type information.
  • the processor 501 determines the instruction type of the first instruction according to the first entry
  • the processor 501 is specifically configured to: when the processor 501 determines that the front-end indication field and the type indication field are second values, obtain the information in the reordering cache ROB.
  • the instruction type information of the first instruction stored in the first entry.
  • the device 500 further includes a third register.
  • the third register stores a third entry that was last deleted by the reordering buffer ROB.
  • the third entry contains the instruction type information of the first instruction.
  • the processor 501 obtains the first case when it is determined that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is the same as the mispredicted sequence number stored in the third register.
  • the processor 501 obtains reordering when it is determined that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is greater than the mispredicted sequence number stored in the third register. Cache the instruction type information of the first instruction stored in the first entry in the ROB.
  • the processor 501 obtains reordering when it is determined that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is smaller than the mispredicted sequence number stored in the third register. Cache the instruction type information of the first instruction stored in the first entry in the ROB.
  • the processor 501 is further configured to: after writing the accumulated pause period number into the second entry of the second register 504, the processor 501 writes the second register 504 into the second register 504 in any of the following cases: All saved entries are updated to the memory connected to the processor 501: the second register 504 overflows; the second register 504 triggers an interrupt; the performance monitoring period of the processor 501 ends.
  • processor performance monitoring device 500 may be used to execute the method provided by the embodiment corresponding to FIG. 3, so the implementation manners and technical effects not described in detail in the processor performance monitoring device 500 shown in FIG. 5 may be See related description in FIG. 3.
  • an embodiment of the present application further provides a processor performance monitoring device, which can be used to execute the processor performance monitoring method shown in FIG. 3, and can also be regarded as the same as the processor performance shown in FIG. 5.
  • the monitoring device 500 is the same device.
  • the processor performance monitoring device 600 includes an update module 601, a start module 602, a stop module 603, and a read module 604.
  • An update module 601 is configured to update a first entry in the first register when a performance event occurs, where the first entry is used to indicate an index path of an instruction type of the first instruction that caused the performance event.
  • the starting module 602 is configured to start a counter to count the first clock cycle duration when the pause occurs when a pause occurs.
  • the stopping module 603 is configured to stop updating the first entry after the pause is terminated.
  • the reading module 604 is configured to determine an instruction type of the first instruction according to the first entry.
  • the update module 601 is further configured to stack the first clock cycle number into the accumulated pause cycle number corresponding to the instruction type of the first instruction, and write the accumulated pause cycle number to a second entry in the second register; the second register is set There are multiple entries, multiple entries corresponding to multiple instruction types, and multiple entries are used to store the cumulative number of pause cycles caused by instructions under each instruction type.
  • processor performance monitoring device 600 shown in FIG. 6 may also be used to perform other operations in the method for monitoring processor performance shown in FIG. 3, which are not described herein again.
  • the division of the modules in the embodiments of the present application is schematic, and is only a logical function division. In actual implementation, there may be another division manner.
  • the functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist separately physically, or two or more modules may be integrated into one module.
  • the above integrated modules may be implemented in the form of hardware or software functional modules.
  • the integrated module is implemented in the form of a software functional module and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially a part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium.
  • a computer device which may be a personal computer, a server, or a network device
  • the aforementioned storage media include: U disks, mobile hard disks, read-only memories (ROMs), random access memories (RAMs), magnetic disks or compact discs and other media that can store program codes .
  • processor performance monitoring device 600 can be used to execute the method provided by the embodiment corresponding to FIG. 3, so the implementation and technical effects not described in detail in the processor performance monitoring device 600 shown in FIG. See related description in FIG. 3.
  • an embodiment of the present application further provides a computer storage medium.
  • a program is stored on the computer storage medium, and when the program is executed by a processor, the program is used to implement the method provided by the embodiment corresponding to FIG.
  • An embodiment of the present application further provides a computer program product.
  • the program code included in the computer program product runs on a computer, the computer causes the computer to execute the method provided in the embodiment corresponding to FIG. 3.
  • this application may be provided as a method, a system, or a computer program product. Therefore, this application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Moreover, this application may take the form of a computer program product implemented on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing device to work in a particular manner such that the instructions stored in the computer-readable memory produce a manufactured article including an instruction device, the instructions
  • the device implements the functions specified in one or more flowcharts and / or one or more blocks of the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing device, so that a series of steps can be performed on the computer or other programmable device to produce a computer-implemented process, which can be executed on the computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more flowcharts and / or one or more blocks of the block diagrams.

Abstract

A method and device for monitoring the performance of a processor, which are used for determining the instruction type of an instruction that causes a pause after the processor pauses and evaluating the performance overhead created by the instruction after a program is finished being executed. The method comprises: a processor updating a first entry in a first register when a performance event occurs and starting up a counter when a pause occurs so as to count a first number of clock cycles for which the pause lasts, the first entry being used for indicating an index path of the instruction type of a first instruction that causes the performance event; the processor stopping updating the first entry after the pause is over and determining the instruction type of the first instruction according to the first entry; the processor superposing the first number of clock cycles into an accumulated number of pause cycles corresponding to the instruction type of the first instruction and writing the accumulated number of pause cycles into a second entry of a second register; the second register is provided with multiple entries, the multiple entries respectively being used for saving the accumulated number of pause cycles of pauses caused by instructions having various instruction types.

Description

一种处理器性能的监测方法及装置Method and device for monitoring processor performance 技术领域Technical field
本申请涉及计算机技术领域,尤其涉及一种处理器性能的监测方法及装置。The present application relates to the field of computer technology, and in particular, to a method and a device for monitoring processor performance.
背景技术Background technique
超标量执行是指在处理器流水线的发射段,每个时钟周期可以向执行段发射多条指令,以供执行段执行,从而在处理器内部实现多条指令的并发执行;乱序执行(out-of-order execution)是指处理器在执行指令时,可以不按照程序规定的顺序执行。在传统的按序执行处理器中,一旦遇到下一条指令需等待上一条指令的执行结果才能执行的情况,处理器的流水线就会停滞;而采用超标量乱序执行的方式,在遇到上述情况时处理器可以转而继续执行后面不依赖前述执行结果的指令,即流水线的执行段可以一直处于工作状态。不难看出,采用超标量乱序执行的方式可以减少程序的平均执行时间、提高处理器的处理效率。Superscalar execution refers to the emission section of the processor pipeline. Each clock cycle can send multiple instructions to the execution section for execution by the execution section, thereby achieving the concurrent execution of multiple instructions within the processor; out-of-order execution (out -of-order execution) means that when the processor executes the instructions, it may not execute in the order prescribed by the program. In the traditional sequential execution processor, once the next instruction needs to wait for the execution result of the previous instruction to be executed, the processor's pipeline will stagnate; and the superscalar out-of-order execution method is used, In the above case, the processor may instead continue to execute subsequent instructions that do not depend on the foregoing execution results, that is, the execution section of the pipeline can always be in a working state. It is not difficult to see that using superscalar out-of-order execution can reduce the average execution time of the program and improve the processing efficiency of the processor.
虽然超标量乱序执行带来了这些优势,但是超标量乱序执行也使得处理器的性能分析变得复杂。比如,当超标量乱序执行处理器的流水线发生停顿时,在此期间可能有多个指令分别引起的多个性能事件发生,处理器难以判断是哪个性能事件引起流水线停顿;再比如,由于上述性能事件重叠的情况发生,处理器难以将处理器的停顿归咎于某条指令,也就难以评估处理器执行的每个指令所造成的性能开销。Although superscalar out-of-order execution brings these advantages, superscalar out-of-order execution also complicates processor performance analysis. For example, when the pipeline of a superscalar out-of-order execution processor stalls, there may be multiple performance events caused by multiple instructions during this period, and it is difficult for the processor to determine which performance event caused the pipeline stall. Performance events overlap, and it is difficult for the processor to blame the processor's stall on an instruction, and it is difficult to evaluate the performance overhead caused by each instruction executed by the processor.
综上,亟需一种处理器性能的监测方法及装置,从而使得超标量乱序执行处理器可以在发生停顿后确定引起停顿的指令,并在程序执行结束后对该指令造成的性能开销进行评估。In summary, there is an urgent need for a method and device for monitoring processor performance, so that a superscalar out-of-order execution processor can determine the instruction that caused the pause after a pause, and perform the performance overhead caused by the instruction after the execution of the program ends. Evaluation.
发明内容Summary of the Invention
本申请实施例提供了一种处理器性能的监测方法及装置,用于超标量乱序执行处理器在发生停顿后确定引起停顿的指令,并在程序执行结束后对该指令造成的性能开销进行评估。The embodiments of the present application provide a method and a device for monitoring processor performance, which are used for superscalar out-of-order execution of a processor to determine an instruction that causes a pause after a pause occurs, and to perform a performance overhead caused by the instruction after program execution ends Evaluation.
第一方面,本申请实施例提供一种处理器性能的监测方法,该方法包括如下步骤:处理器在发生性能事件时更新第一寄存器中的第一条目,在发生停顿时,启动计数器来统计所述停顿持续的第一时钟周期数,第一条目用于指示引起性能事件的第一指令的指令类型的索引路径;处理器在停顿终止后停止更新第一条目,并根据第一条目确定第一指令的指令类型;处理器将第一时钟周期数叠加入第一指令的指令类型对应的累计停顿周期数,并将该累计停顿周期数写入第二寄存器的第二条目;其中,第二寄存器设有多个条目,多个条目分别对应多个指令类型,多个条目用于保存各个指令类型下的指令导致的停顿的累计停顿周期数。In a first aspect, an embodiment of the present application provides a method for monitoring processor performance. The method includes the following steps: the processor updates a first entry in a first register when a performance event occurs, and starts a counter when a stall occurs. Count the number of first clock cycles for which the pause lasts. The first entry is used to indicate an index path of the instruction type of the first instruction that caused the performance event; the processor stops updating the first entry after the pause is terminated, and according to the first The entry determines the instruction type of the first instruction; the processor adds the first clock cycle number to the cumulative stall cycle number corresponding to the instruction type of the first instruction, and writes the cumulative stall cycle number to the second entry of the second register Wherein, the second register is provided with multiple entries, each of which corresponds to multiple instruction types, and the multiple entries are used to store a cumulative number of pause cycles caused by instructions under each instruction type.
采用第一方面提供的方法,在发生性能事件时更新第一寄存器中的第一条目,在处理器停顿终止后停止更新第一条目,则第一条目可用于指示引起处理器停顿的第一指令的指令类型的索引路径。此外,处理器在发生停顿时启动计数器,那么在停顿终止后该计数器中记录有此次停顿持续的第一时钟周期数,该第一时钟周期数可用于表示此次停顿造成的性能开销。因此,在停顿终止后,可通过读取第一条目来确定引起此次停顿的第一指令的 指令类型;同时,第二寄存器保存有各个指令类型下的指令导致的停顿的累计停顿周期数,在停顿终止后,可将第一时钟周期数累加到第二寄存器中第一指令的指令类型对应的条目(即第二条目)中,从而可以对处理器的性能进行综合分析,例如分析每种类型的指令引起的停顿所占的百分比、分析哪种类型的指令易引起停顿、分析处理器的停顿周期数占程序的总执行周期数的百分比等等。综上,采用本申请实施例提供的处理器性能监测方案,可以在处理器发生停顿后准确地确定引起该停顿的指令的指令类型,并在程序执行结束后对每种类型的指令造成的性能开销进行评估。The method provided in the first aspect is used to update the first entry in the first register when a performance event occurs, and to stop updating the first entry after the processor stall is terminated. The first entry may be used to indicate the cause of the processor stall. Index path of the instruction type of the first instruction. In addition, the processor starts a counter when a pause occurs. Then, after the pause is terminated, the counter records a first clock cycle number in which the pause is continued. The first clock cycle number may be used to indicate a performance overhead caused by the pause. Therefore, after the pause is stopped, the first entry can be read to determine the instruction type of the first instruction that caused the pause; at the same time, the second register stores the cumulative number of pauses caused by the instructions under each instruction type. After the pause is terminated, the first clock cycle number can be accumulated into the entry corresponding to the instruction type of the first instruction in the second register (that is, the second entry), so that the performance of the processor can be comprehensively analyzed, for example, analysis Percentage of stalls caused by each type of instruction, analysis of which types of instructions are prone to stalls, analysis of the number of stall cycles of the processor as a percentage of the total execution cycles of the program, and so on. In summary, by using the processor performance monitoring solution provided in the embodiments of the present application, the type of the instruction that caused the stall can be accurately determined after the processor stalls, and the performance of each type of instruction after the execution of the program ends Cost is assessed.
其中,第一条目包括前端指示字段、类型指示字段以及序列号指示字段,前端指示字段用于指示停顿是否发生在前端,类型指示字段用于指示停顿是否在第一指令提交之前发生,序列号指示字段用于指示第一指令的误预测序列号。The first entry includes a front-end indication field, a type indication field, and a serial number indication field. The front-end indication field is used to indicate whether a pause occurs at the front end, the type indication field is used to indicate whether a pause occurs before the first instruction is submitted, and the sequence number The indication field is used to indicate a misprediction sequence number of the first instruction.
本申请实施例中,处理器更新第一条目的实现方式可以有如下四种。In the embodiment of the present application, there are four implementation manners for the processor to update the first entry.
第一种The first
处理器在发生性能事件时更新第一寄存器中的第一条目,具体可通过如下方式实现:处理器在第一性能事件发生在处理器流水线前端、且处理器的重排序缓存ROB为空的情况下,将前端指示字段置为第一数值,将类型指示字段置为第二数值,并将第一指令的指令类型信息保存在序列号指示字段。The processor updates the first entry in the first register when a performance event occurs, which can be specifically implemented in the following manner: when the first performance event occurs at the front end of the processor pipeline, and the reordering cache ROB of the processor is empty In the case, the front-end indication field is set to the first value, the type indication field is set to the second value, and the instruction type information of the first instruction is stored in the serial number indication field.
在第一种实现方式中,处理器将前端指示字段置为第一数值,以指示停顿发生在处理器流水线前端;处理器将类型指示字段置为第二数值,以指示引起停顿的指令为停顿发生前提交的最后一条指令。此外,由于发生第一类停顿时ROB为空,因而我们难以通过ROB中的相关条目来确定“停顿责任者”。因此,为了后续确定引起此次停顿的指令的索引路径,在第一种实现方式中复用了序列号指示字段,即将第一指令的指令类型信息保存在序列号指示字段,以便可直接通过序列号指示字段确定“停顿责任者”。In the first implementation, the processor sets the front-end indication field to a first value to indicate that a pause occurs at the front end of the processor pipeline; the processor sets the type indication field to a second value to indicate that the instruction that caused the pause is a pause The last instruction submitted before it happened. In addition, because the ROB is empty when the first type of pause occurs, it is difficult for us to determine the "responsible for the pause" through the relevant entry in the ROB. Therefore, in order to subsequently determine the index path of the instruction causing the pause, the serial number indication field is reused in the first implementation, that is, the instruction type information of the first instruction is stored in the serial number indication field so that the sequence can be directly passed The number indicating field identifies the "responsible person".
第二种Second
处理器在发生性能事件时更新第一寄存器中的第一条目,具体可通过如下方式实现:处理器在第二性能事件未发生在处理器流水线前端、且第二性能事件不是误预测事件的情况下,将前端指示字段和类型指示字段均置为第二数值。The processor updates the first entry in the first register when a performance event occurs, which can be specifically implemented by: the processor does not occur at the front end of the processor pipeline when the second performance event occurs, and the second performance event is not a misprediction event In the case, both the front end indication field and the type indication field are set to the second value.
在第二种实现方式中,处理器将前端指示字段置为第二数值,以指示停顿发生在处理器流水线后端;处理器将类型指示字段置为第二数值,以指示引起停顿的指令为停顿发生前提交的最后一条指令。In the second implementation, the processor sets the front-end indication field to a second value to indicate that a stall occurs at the back end of the processor pipeline; the processor sets the type indication field to a second value to indicate that the instruction that caused the stall is The last instruction submitted before the pause occurred.
第三种Third
处理器在发生性能事件时更新第一寄存器中的第一条目,具体可通过如下方式实现:处理器在第三性能事件为误预测事件的情况下,将前端指示字段置为第二数值,将类型指示字段置为第一数值,并将第三性能事件的误预测序列号保存在序列号指示字段。The processor updates the first entry in the first register when a performance event occurs, which can be specifically implemented by the processor: when the third performance event is a misprediction event, the front end indication field is set to a second value, The type indication field is set to the first value, and the mispredicted sequence number of the third performance event is stored in the sequence number indication field.
在第三种实现方式中,处理器将前端指示字段置为第二数值,以指示停顿发生在处理器流水线后端;处理器将类型指示字段置为第一数值,以指示引起停顿的指令为停顿终止后提交的第一条指令。In a third implementation, the processor sets the front-end indication field to a second value to indicate that a stall occurs at the back end of the processor pipeline; the processor sets the type indication field to a first value to indicate that the instruction that caused the stall is The first instruction submitted after the pause has expired.
第四种Fourth
处理器在发生性能事件时更新第一寄存器中的第一条目,具体可通过如下方式实现:处理器在第四性能事件为误预测事件的情况下,将前端指示字段置为第二数值,将类型指示字段置为第一数值,并将第四性能事件的误预测序列号保存在序列号指示字段;处理器 在第五性能事件为误预测事件的情况下,比较第五性能事件的误预测序列号与第四性能事件的误预测序列号的大小;处理器将第五性能事件的误预测序列号与第四性能事件的误预测序列号中较小的误预测序列号保存在序列号指示字段。The processor updates the first entry in the first register when a performance event occurs, which can be specifically implemented by the processor: when the fourth performance event is a misprediction event, the front end indication field is set to a second value, The type indication field is set to the first value, and the misprediction sequence number of the fourth performance event is stored in the sequence number indication field; the processor compares the error of the fifth performance event when the fifth performance event is a misprediction event. The size of the mispredicted sequence number of the predicted sequence number and the fourth performance event; the processor saves the smaller mispredicted sequence number of the mispredicted sequence number of the fifth performance event and the mispredicted sequence number of the fourth performance event in the sequence number Indication field.
在第四种实现方式中,处理器将前端指示字段置为第二数值,以指示停顿发生在处理器流水线后端;处理器将类型指示字段置为第一数值,以指示引起停顿的指令为停顿终止后提交的第一条指令。此外,在第四种实现方式中,在更新第一条目的过程中(即停顿终止之前),若出现两个误预测事件,则第一条目中保存误预测序列号较小的性能事件的相关信息,以便能准确地确定引发此次停顿的指令的索引信息。In a fourth implementation manner, the processor sets the front-end indication field to a second value to indicate that a stall occurs at the back end of the processor pipeline; the processor sets the type indication field to a first value to indicate that the instruction that caused the stall is The first instruction submitted after the pause has expired. In addition, in a fourth implementation manner, during the process of updating the first entry (that is, before the pause is terminated), if two misprediction events occur, the first entry stores the performance event with a smaller misprediction sequence number. Relevant information so that the index information of the instruction that caused the stall can be accurately determined.
本申请实施例中,处理器确定第一指令的指令类型的方式有多种,下面列举其中的三种。In the embodiment of the present application, there are multiple ways for the processor to determine the instruction type of the first instruction, and three of them are listed below.
第一种The first
处理器根据第一条目确定第一指令的指令类型,具体可通过如下方式实现:处理器在确定前端指示字段为第一数值的情况下,获取序列号指示字段中保存的第一指令的指令类型信息。The processor determines the instruction type of the first instruction according to the first entry, which may be specifically implemented by: when the processor determines that the front-end instruction field is the first value, the processor obtains the instruction of the first instruction stored in the sequence number instruction field. Type information.
在实现方式一中,第一条目中的前端指示字段为第一数值,则可以确定此次停顿发生在处理器流水线前端,引起此停顿的指令为停顿终止后提交的第一条指令,在这种情况下,我们认为该指令应该为ROB中的第一个条目(即,ROB中即将提交的指令对应的条目)。但是,发生第一类停顿时ROB为空,因而无法从ROB中获取到引起处理器停顿的指令的索引信息。如前所述,在停顿发生在处理器流水线前端的情况下,更新第一条目时已经将第一指令的指令类型信息保存在序列号指示字段。那么,在前端指示字段为第一数值的情况下,即可直接获取第一条目的序列号指示字段中保存的第一指令的指令类型信息。In Implementation 1, if the front-end indication field in the first entry is the first value, it can be determined that the pause occurred at the front end of the processor pipeline. The instruction that caused the pause is the first instruction submitted after the pause is terminated. In this case, we think that the instruction should be the first entry in the ROB (that is, the entry corresponding to the instruction to be submitted in the ROB). However, when the first type of stall occurs, the ROB is empty, so the index information of the instruction that caused the processor stall cannot be obtained from the ROB. As mentioned earlier, in the case where the pause occurs at the front end of the processor pipeline, the instruction type information of the first instruction has been saved in the sequence number indication field when the first entry is updated. Then, when the front-end indication field is the first value, the instruction type information of the first instruction stored in the sequence number indication field of the first entry can be directly obtained.
第二种Second
处理器根据第一条目确定第一指令的指令类型,具体可通过如下方式实现:处理器在确定前端指示字段和类型指示字段均为第二数值的情况下,获取重排序缓存ROB中的第一个条目中保存的第一指令的指令类型信息。The processor determines the instruction type of the first instruction according to the first entry, which may be specifically implemented as follows: When the processor determines that the front-end indication field and the type indication field are both second values, the processor obtains the first order in the reordering cache ROB. The instruction type information of the first instruction stored in an entry.
在实现方式二中,第一条目中的前端指示字段和类型指示字段均为第二数值,则可以确定此次停顿为第四类停顿。对于第四类停顿,由于引起此停顿的指令为停顿终止后提交的第一条指令,在这种情况下,我们认为该指令应该为ROB中的第一个条目(即,ROB中即将提交的指令对应的条目)。因此,在实现方式二中,可以获取ROB中的第一个条目中保存的第一指令的指令类型信息。In the second implementation manner, if the front-end indication field and the type indication field in the first entry are both second values, it can be determined that the pause is a fourth type pause. For the fourth type of pause, because the instruction that caused this pause is the first instruction submitted after the pause is terminated, in this case, we believe that the instruction should be the first entry in the ROB (that is, the upcoming ROB Command entry). Therefore, in the second implementation manner, the instruction type information of the first instruction stored in the first entry in the ROB can be obtained.
第三种Third
在第三种实现方式中,处理器根据第一条目确定第一指令的指令类型具体可通过如下三种方式实现:In a third implementation manner, the processor determining the instruction type of the first instruction according to the first entry may be specifically implemented in the following three ways:
方式3aWay 3a
处理器在确定前端指示字段为第二数值、类型指示字段为第一数值且序列号指示字段中保存的误预测序列号与第三寄存器中保存的误预测序列号相同的情况下,获取第三寄存器中保存的第一指令的指令类型信息,第三寄存器保存有重排序缓存ROB最近一次删除的第三条目,第三条目包含第一指令的指令类型信息。The processor obtains the third case when it is determined that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is the same as the mispredicted sequence number stored in the third register. The instruction type information of the first instruction stored in the register, the third register stores a third entry that was last deleted by the reordering buffer ROB, and the third entry contains the instruction type information of the first instruction.
方式3bWay 3b
处理器在确定前端指示字段为第二数值、类型指示字段为第一数值且序列号指示字段 中保存的误预测序列号大于第三寄存器中保存的误预测序列号的情况下,获取重排序缓存ROB中的第一个条目中保存的第一指令的指令类型信息。The processor obtains the reordering buffer when it is determined that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is greater than the mispredicted sequence number stored in the third register. The instruction type information of the first instruction stored in the first entry in the ROB.
方式3cWay 3c
处理器在确定前端指示字段为第二数值、类型指示字段为第一数值且序列号指示字段中保存的误预测序列号小于第三寄存器中保存的误预测序列号的情况下,获取重排序缓存ROB中的第一个条目中保存的第一指令的指令类型信息。When the processor determines that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is less than the mispredicted sequence number stored in the third register, the processor obtains the reordering cache The instruction type information of the first instruction stored in the first entry in the ROB.
在实现方式三中,第一条目中的前端指示字段为第二数值、类型指示字段为第一数值,则可以确定引发该停顿的指令为停顿发生前提交的最后一条指令。从理论上来说,此时第三寄存器中记录的指令(即停顿发生前提交的最后一条指令)即为引发处理器停顿的指令。但是,实际实现时,考虑到处理器的乱序执行,可能出现第一条目被乱序更新的情形,例如,引发此次停顿的指令还未到达提交(commit)段,或者第一条目中记录的性能事件被其他性能事件覆盖或者第一条目中记录的性能事件与其他性能事件有重叠(此时可以认为即第一条目中记录的性能事件未引起此次停顿)。实现方式三中,为了在发生上述复杂情形时正确识别“停顿责任者”,可以对序列号指示字段中保存的误预测序列号与第三寄存器中保存的误预测序列号进行比对,从而根据比对结果进行相应处理,以确定引起处理器停顿的指令的指令类型。In the third implementation manner, if the front-end indication field in the first entry is the second value and the type indication field is the first value, it can be determined that the instruction that caused the pause is the last instruction submitted before the pause occurred. In theory, at this time, the instruction recorded in the third register (that is, the last instruction submitted before the pause occurred) is the instruction that caused the processor to pause. However, in actual implementation, considering the out-of-order execution of the processor, the first entry may be updated out of order, for example, the instruction that caused the pause has not yet reached the commit segment, or the first entry The performance event recorded in is overwritten by other performance events or the performance event recorded in the first entry overlaps with other performance events (at this time, it can be considered that the performance event recorded in the first entry did not cause the pause). In the third implementation method, in order to correctly identify the "pause person responsible" when the above-mentioned complicated situation occurs, the mispredicted sequence number stored in the sequence number indication field and the mispredicted sequence number stored in the third register can be compared, so that according to The comparison results are processed accordingly to determine the instruction type of the instruction that caused the processor to stall.
此外,本申请实施例中,在处理器将累计停顿周期数写入第二寄存器的第二条目之后,处理器还可以在以下任一种情况下将第二寄存器中保存的所有条目更新至与处理器连接的内存中:第二寄存器发生溢出;第二寄存器触发中断;处理器的性能监测时间段结束。将第二寄存器中保存的所有条目更新至内存中,可以将第二寄存器清零。然后,可根据内存中的数据对处理器的性能进行评估,例如,分析每种类型的指令引起的停顿所占的百分比,分析哪种类型的指令易引起停顿,分析处理器的停顿周期数占程序的总执行周期数的百分比等等。In addition, in the embodiment of the present application, after the processor writes the accumulated pause period number to the second entry of the second register, the processor may also update all entries stored in the second register to In the memory connected to the processor: the second register overflows; the second register triggers an interrupt; the performance monitoring period of the processor ends. All the entries saved in the second register are updated into the memory, and the second register can be cleared. Then, the performance of the processor can be evaluated based on the data in the memory, for example, analyzing the percentage of stalls caused by each type of instruction, analyzing which types of instructions are prone to stall, and analyzing the number of processor stall cycles. The percentage of the total execution cycles of the program and so on.
第二方面,本申请实施例还提供一种处理器性能的监测装置,该装置包括处理器、第一寄存器、计数器和第二寄存器。In a second aspect, an embodiment of the present application further provides a processor performance monitoring device. The device includes a processor, a first register, a counter, and a second register.
具体地,处理器用于:在发生性能事件时更新第一寄存器中的第一条目,在发生停顿时,启动计数器来统计停顿持续的第一时钟周期数,第一条目用于指示引起性能事件的第一指令的指令类型的索引路径;在停顿终止后停止更新第一条目,并根据第一条目确定第一指令的类型;将第一时钟周期数叠加入第一指令的指令类型对应的累计停顿周期数,并将累计停顿周期数写入第二寄存器中的第二条目;第二寄存器设有多个条目,多个条目分别对应多个指令类型,多个条目用于保存各个指令类型下的指令导致的停顿的累计停顿周期数。Specifically, the processor is configured to: update a first entry in the first register when a performance event occurs; when a pause occurs, start a counter to count the first clock cycle number of the pause, and the first entry is used to indicate that the performance is caused Index path of the instruction type of the first instruction of the event; stop updating the first entry after the pause ends, and determine the type of the first instruction according to the first entry; add the first clock cycle number to the instruction type of the first instruction Corresponding cumulative pause cycle number, and write the cumulative pause cycle number into the second entry in the second register; the second register is provided with multiple entries, each of which corresponds to multiple instruction types, and multiple entries are used for saving Cumulative number of pause cycles caused by instructions under each instruction type.
其中,第一寄存器、计数器和第二寄存器可以集成在处理器上,也可以单独设置。当第一寄存器、计数器和第二寄存器集成在处理器上时,处理器性能的监测装置也可以视为一种处理器。The first register, the counter, and the second register may be integrated on the processor or may be set separately. When the first register, the counter, and the second register are integrated on the processor, the processor performance monitoring device can also be regarded as a processor.
具体地,第一条目可以包括前端指示字段、类型指示字段以及序列号指示字段,前端指示字段用于指示停顿是否发生在前端,类型指示字段用于指示停顿是否在第一指令提交之前发生,序列号指示字段用于指示第一指令的误预测序列号。Specifically, the first entry may include a front-end indication field, a type indication field, and a sequence number indication field. The front-end indication field is used to indicate whether a pause occurs in the front-end, and the type indication field is used to indicate whether the pause occurs before the first instruction is submitted. The sequence number indication field is used to indicate a mispredicted sequence number of the first instruction.
在装置中,处理器更新第一条目的方式有多种,下面列举其中四种。In the device, there are multiple ways for the processor to update the first entry, four of which are listed below.
第一种方式The first way
处理器在更新第一寄存器中的第一条目时,具体用于:处理器在第一性能事件发生在处理器流水线前端、且处理器的重排序缓存ROB为空的情况下,将前端指示字段置为第一数值,将类型指示字段置为第二数值,并将第一指令的指令类型信息保存在序列号指示字段。When the processor updates the first entry in the first register, the processor is specifically configured to: when the first performance event occurs at the front end of the processor pipeline and the reordering buffer ROB of the processor is empty, the processor instructs the front end to indicate The field is set to the first value, the type indication field is set to the second value, and the instruction type information of the first instruction is stored in the sequence number indication field.
第二种方式The second way
处理器在更新第一寄存器中的第一条目时,具体用于:处理器在第二性能事件未发生在处理器流水线前端、且第二性能事件不是误预测事件的情况下,将前端指示字段和类型指示字段均置为第二数值。When the processor updates the first entry in the first register, the processor is specifically configured to: when the second performance event does not occur at the front end of the processor pipeline and the second performance event is not a misprediction event, instruct the front end to indicate The field and type indication fields are both set to the second value.
第三种方式The third way
处理器在更新第一寄存器中的第一条目时,具体用于:处理器在第三性能事件为误预测事件的情况下,将前端指示字段置为第二数值,将类型指示字段置为第一数值,并将第三性能事件的误预测序列号保存在序列号指示字段。When the processor updates the first entry in the first register, the processor is specifically configured to: when the third performance event is a misprediction event, set the front-end indication field to the second value, and set the type indication field to The first value, and the mispredicted sequence number of the third performance event is stored in the sequence number indication field.
第四种方式Fourth way
处理器在更新第一寄存器中的第一条目时,具体用于:处理器在第四性能事件为误预测事件的情况下,将前端指示字段置为第二数值,将类型指示字段置为第一数值,并将第四性能事件的误预测序列号保存在序列号指示字段;处理器在第五性能事件为误预测事件的情况下,比较第五性能事件的误预测序列号与第四性能事件的误预测序列号的大小;处理器将第五性能事件的误预测序列号与第四性能事件的误预测序列号中较小的误预测序列号保存在序列号指示字段。When the processor updates the first entry in the first register, the processor is specifically configured to: when the fourth performance event is a misprediction event, set the front-end indication field to the second value, and set the type indication field to The first value, and stores the mispredicted sequence number of the fourth performance event in the sequence number indication field; the processor compares the mispredicted sequence number of the fifth performance event with the fourth performance event if the fifth performance event is a misprediction event; The size of the misprediction sequence number of the performance event; the processor stores the smaller misprediction sequence number of the misprediction sequence number of the fifth performance event and the misprediction sequence number of the fourth performance event in the sequence number indication field.
此外,本申请实施例中,处理器根据第一条目确定第一指令的指令类型的具体实现方式可以有如下三种。In addition, in the embodiment of the present application, the processor may determine the following three types of specific implementations of the instruction type of the first instruction according to the first entry.
第一种方式The first way
处理器在根据第一条目确定第一指令的指令类型时,具体用于:处理器在确定前端指示字段为第一数值的情况下,获取序列号指示字段中保存的第一指令的指令类型信息。When the processor determines the instruction type of the first instruction according to the first entry, the processor is specifically configured to: when determining that the front-end instruction field is the first value, obtain the instruction type of the first instruction stored in the sequence number instruction field information.
第二种方式The second way
处理器在根据第一条目确定第一指令的指令类型时,具体用于:处理器在确定前端指示字段和类型指示字段均为第二数值的情况下,获取重排序缓存ROB中的第一个条目中保存的第一指令的指令类型信息。When the processor determines the instruction type of the first instruction according to the first entry, the processor is specifically configured to: when the processor determines that the front-end indication field and the type indication field are both second values, obtain the first in the reordering cache ROB The instruction type information of the first instruction stored in each entry.
第三种方式The third way
在第三种方式中,装置还包括第三寄存器,第三寄存器保存有重排序缓存ROB最近一次删除的第三条目,第三条目包含第一指令的指令类型信息;处理器在根据第一条目确定第一指令的指令类型时,具体可采用如下三种方式:In a third manner, the device further includes a third register. The third register stores a third entry that was last deleted by the reordering buffer ROB. The third entry contains the instruction type information of the first instruction. When an entry determines the instruction type of the first instruction, the following three methods can be specifically used:
方式3aWay 3a
处理器在确定前端指示字段为第二数值、类型指示字段为第一数值且序列号指示字段中保存的误预测序列号与第三寄存器中保存的误预测序列号相同的情况下,获取第三寄存器中保存的第一指令的指令类型信息。The processor obtains the third case when it is determined that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is the same as the mispredicted sequence number stored in the third register. The instruction type information of the first instruction stored in the register.
方式3bWay 3b
处理器在确定前端指示字段为第二数值、类型指示字段为第一数值且序列号指示字段中保存的误预测序列号大于第三寄存器中保存的误预测序列号的情况下,获取重排序缓存ROB中的第一个条目中保存的第一指令的指令类型信息。The processor obtains the reordering buffer when it is determined that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is greater than the mispredicted sequence number stored in the third register. The instruction type information of the first instruction stored in the first entry in the ROB.
方式3cWay 3c
处理器在确定前端指示字段为第二数值、类型指示字段为第一数值且序列号指示字段中保存的误预测序列号小于第三寄存器中保存的误预测序列号的情况下,获取重排序缓存ROB中的第一个条目中保存的第一指令的指令类型信息。When the processor determines that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is less than the mispredicted sequence number stored in the third register, the processor obtains the reordering cache The instruction type information of the first instruction stored in the first entry in the ROB.
此外,本申请实施例中,处理器还用于:在将累计停顿周期数写入第二寄存器的第二条目之后,处理器在以下任一种情况下将第二寄存器中保存的所有条目更新至与处理器连接的内存中:第二寄存器发生溢出;第二寄存器触发中断;处理器的性能监测时间段结束。In addition, in the embodiment of the present application, the processor is further configured to: after writing the accumulated pause period number to the second entry of the second register, the processor writes all the entries stored in the second register in any of the following cases: Update to the memory connected to the processor: the second register overflows; the second register triggers an interrupt; the processor's performance monitoring period ends.
第三方面,本申请实施例还提供一种处理器性能的监测装置,该装置包括更新模块、启动模块、停止模块和读取模块。According to a third aspect, an embodiment of the present application further provides a processor performance monitoring device. The device includes an update module, a start module, a stop module, and a read module.
更新模块,用于在发生性能事件时更新第一寄存器中的第一条目,第一条目用于指示引起性能事件的第一指令的指令类型的索引路径。The updating module is configured to update a first entry in the first register when a performance event occurs, and the first entry is used to indicate an index path of an instruction type of the first instruction that caused the performance event.
启动模块,用于在发生停顿时启动计数器来统计停顿持续的第一时钟周期数。The startup module is used to start a counter to count the first clock cycle duration when the pause occurs when a pause occurs.
停止模块,用于在停顿终止后停止更新第一条目。The stop module is used to stop updating the first entry after the pause is terminated.
读取模块,用于根据第一条目确定第一指令的指令类型。The reading module is configured to determine an instruction type of the first instruction according to the first entry.
更新模块,还用于将第一时钟周期数叠加入第一指令的指令类型对应的累计停顿周期数,并将累计停顿周期数写入第二寄存器中的第二条目;第二寄存器设有多个条目,多个条目分别对应多个指令类型,多个条目用于保存各个指令类型下的指令导致的停顿的累计停顿周期数。The update module is further configured to stack the first clock cycle number into the cumulative pause cycle number corresponding to the instruction type of the first instruction, and write the cumulative pause cycle number to a second entry in the second register; the second register is provided with Multiple entries, multiple entries corresponding to multiple instruction types, and multiple entries are used to store the cumulative number of pause cycles caused by instructions under each instruction type.
此外,第三方面提供的处理器性能的监测装置还可用于实现第一方面提供的处理器性能的监测方法中的其它可能的实现方式,具体可参见第一方面提供的方法中的相关描述,此处不再赘述。In addition, the processor performance monitoring device provided in the third aspect may also be used to implement other possible implementation manners in the method for monitoring processor performance provided in the first aspect. For details, refer to related descriptions in the method provided in the first aspect. I won't repeat them here.
第四方面,本申请实施例还提供了一种计算机可读存储介质,用于存储为执行上述第一方面或第一方面的任意一种设计的功能所用的程序,该程序被处理器执行时,用于实现上述第一方面或第一方面的任意一种设计所述的方法。According to a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium for storing a program used to execute the functions of the first aspect or any one of the first aspects. When the program is executed by a processor, For implementing the method described in the first aspect or any one of the first aspects.
第五方面,本申请实施例提供了一种包含程序代码的计算机程序产品,当其包含的程序代码在计算机上运行时,使得计算机执行上述第一方面或第一方面的任意一种设计所述的方法。In a fifth aspect, an embodiment of the present application provides a computer program product containing a program code, and when the program code contained in the computer program product runs on a computer, the computer executes the first aspect or any one of the first aspect. Methods.
另外,第二方面至第五方面中任一种可能设计方式所带来的技术效果可参见第一方面中不同设计方式所带来的技术效果,此处不再赘述。In addition, for the technical effects brought by any one of the possible design methods in the second to fifth aspects, refer to the technical effects brought by the different design methods in the first aspect, which will not be described again here.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为本申请实施例提供的一种处理器的指令流水线的示意图;FIG. 1 is a schematic diagram of a processor instruction pipeline according to an embodiment of the present application; FIG.
图2为本申请实施例提供的一种处理器停顿以及相应的“停顿责任者”的示意图;2 is a schematic diagram of a processor pause and a corresponding "pause person" provided by an embodiment of the present application;
图3为本申请实施例提供的一种处理器性能的监测方法的流程示意图;FIG. 3 is a schematic flowchart of a method for monitoring processor performance according to an embodiment of the present application; FIG.
图4为本申请实施例提供的另一种处理器性能的监测方法的流程示意图;4 is a schematic flowchart of another method for monitoring processor performance according to an embodiment of the present application;
图5为本申请实施例提供的一种处理器性能的监测装置的结构示意图;5 is a schematic structural diagram of a processor performance monitoring device according to an embodiment of the present application;
图6为本申请实施例提供的另一种处理器性能的监测装置的结构示意图。FIG. 6 is a schematic structural diagram of another processor performance monitoring device according to an embodiment of the present application.
具体实施方式detailed description
如背景技术中的描述,采用超标量乱序执行处理器可以减少程序的平均执行时间、提高处理器的处理效率。但是,由于超标量乱序执行处理器中,在一个时钟周期内可以向执行段发射多条指令,且超标量乱序执行处理器在执行指令时可以不按程序规定的顺序执行,因而对超标量乱序执行处理器的性能分析就变得复杂。As described in the background art, using a superscalar out-of-order execution processor can reduce the average execution time of a program and improve the processing efficiency of the processor. However, because the superscalar out-of-order execution processor can issue multiple instructions to the execution section in one clock cycle, and the superscalar out-of-order execution processor can execute instructions in an order other than the program when executing instructions, the The performance analysis of the out-of-order execution processor becomes complicated.
比如,当超标量乱序执行处理器的流水线发生停顿时,在此期间可能有多个指令分别引起的多个性能事件发生,处理器难以判断是哪个性能事件引起流水线停顿;再比如,由于上述性能事件重叠的情况发生,处理器难以将处理器的停顿归咎于某条指令,也就难以评估处理器执行的指令所造成的性能开销。For example, when the pipeline of a superscalar out-of-order execution processor stalls, there may be multiple performance events caused by multiple instructions during this period, and it is difficult for the processor to determine which performance event caused the pipeline stall; for example, due to the above Performance events overlap, and it is difficult for the processor to blame the processor's stall on an instruction, and it is difficult to evaluate the performance overhead caused by the instructions executed by the processor.
为了评估超标量乱序执行处理器的性能,业界提出了多种性能模型。In order to evaluate the performance of superscalar out-of-order execution processors, the industry has proposed multiple performance models.
比如,Top-Down是一种建立在流水线分发槽(dispatch slots)利用率上的性能模型。其中,程序执行时每个指令微码(又可以称为微指令或者micro operation,简称uop)使用且仅使用一个分发槽。该模型监测每个分发槽是否停顿、以及跟踪每个分发槽上微码的被执行后的去向(例如,被提交或被放弃,即committed或squashed),将所有分发槽分为四项,并基于性能监测单元(performance monitoring unit,PMU)对各项进行详细分析。For example, Top-Down is a performance model based on the utilization of pipeline distribution slots. Among them, each instruction microcode (also called microinstruction or micro operation, referred to as uop) is used and only one distribution slot is used during program execution. This model monitors whether each distribution slot is paused, and tracks the execution of microcode on each distribution slot (e.g., submitted or abandoned, that is, committed or squashed), divides all distribution slots into four items, and Perform detailed analysis based on performance monitoring unit (PMU).
再比如,Statistical Profiling是一种分析程序执行性能的方法。该方法每隔N条微码(uop)随机选择一个微码,跟踪记录该微码在指令流水线上从分发(dispatch)到完成(complete)之间发生的所有性能事件以及每个性能事件持续的时钟周期数。之后再离线分析这些采样值,推断程序的性能瓶颈。As another example, Statistical Profiling is a method to analyze the performance of a program. The method randomly selects one microcode every N microcodes (uops), and tracks and records all performance events that occur between the dispatch and completion of the microcodes on the instruction pipeline, and each performance event is persistent. The number of clock cycles. Then analyze these samples offline to infer the performance bottleneck of the program.
现有技术中提出的这些模型均各有侧重,例如Top-Down模型侧重于分析程序性能的整体表现,在处理器发生停顿时难以定位到设计点(比如引起停顿的指令和操作),且Top-Down模型中很多重要的子项是基于预估得到的,准确性难以保证;而Statistical Profiling模型更侧重于分析每个性能事件是由哪条指令引起的,且该模型随机采样单条指令的执行情况,容易导致覆盖率不足的问题。These models proposed in the prior art all have their own focuses. For example, the Top-Down model focuses on the overall performance of the analysis program performance. It is difficult to locate the design point (such as the instructions and operations that caused the pause) when the processor stalls. Many important sub-items in the -Down model are based on estimates, and accuracy is difficult to guarantee; the Statistical Profiling model focuses more on analyzing which instruction is caused by each performance event, and the model randomly samples the execution of a single instruction In some cases, the problem of insufficient coverage is likely to occur.
本申请实施例提供一种处理器性能的监测方法及装置,从而使得超标量乱序执行处理器可以在发生停顿后确定引起停顿的指令,并在程序执行结束后对该指令造成的性能开销进行评估。The embodiments of the present application provide a method and a device for monitoring processor performance, so that a superscalar out-of-order execution processor can determine an instruction that causes a pause after a pause occurs, and perform performance overhead caused by the instruction after program execution ends. Evaluation.
下面,对本申请实施例涉及的基本概念进行解释。需要说明的是,这些解释是为了让本申请更容易被理解,而不应该视为对本申请所要求的保护范围的限定。The basic concepts involved in the embodiments of the present application are explained below. It should be noted that these explanations are to make this application easier to understand, and should not be considered as limiting the scope of protection required by this application.
一、指令流水线First, the instruction pipeline
超标量乱序执行处理器通常采用八段指令流水线,如图1所示。其中,八段指令流水线包含取指(fetch)、译码(decode)、寄存器重命名(rename)、分发(dispach)、发射(issue)、执行(execute)、回写(writeback)以及提交(commit)这八个流水段。对于一条指令来说,其执行完成需经过该八段处理。Superscalar out-of-order execution processors usually use an eight-segment instruction pipeline, as shown in Figure 1. Among them, the eight-segment instruction pipeline includes fetch, decode, register rename, dispatch, issue, execute, writeback, and commit ) These eight pipeline sections. For an instruction, its completion needs to go through the eight segments.
对于超标量乱序执行处理器来说,在发射(issue)流水段,处理器可在一个时钟周期内向执行(execute)流水段发射多条指令,以供执行(execute)流水段执行,从而在处理器内部实现多条指令的并发执行;对于执行(execute)流水段的指令执行过程,多条指令可以不按照程序规定的顺序执行,而是采用乱序执行的方式,从而可以在某些指令等待源操作数时,使得其他不依赖源操作数的指令能够优先执行,提高处理器的吞吐率。For a superscalar out-of-order execution processor, in the issue pipeline, the processor can issue multiple instructions to the execute pipeline in one clock cycle for execution by the execute pipeline, thereby The processor internally implements the concurrent execution of multiple instructions; for the execution of instructions in the execute pipeline, multiple instructions may not be executed in the order prescribed by the program, but may be executed out of order, so that some instructions can be executed in some instructions While waiting for the source operand, it enables other instructions that do not depend on the source operand to be executed preferentially, which improves the throughput of the processor.
特别地,超标量乱序执行处理器通常支持乱序发射、乱序执行以及顺序提交。其中, 乱序发射和乱序执行在上一段已有介绍,而对于顺序提交,其含义可以是:在超标量乱序执行处理器中,虽然指令可以乱序执行,但是指令在提交时需按照程序规定的顺序提交。例如,对于指令A、指令B和指令C来说,程序规定的执行顺序为指令A→指令B→指令C,但是在超标量乱序执行处理器中,由于指令B需等待源操作数才能执行,因而指令的实际执行顺序为指令A→指令C→指令B。虽然这三条指令采用乱序执行的方式,但是在指令提交时,仍需按照程序规定的指令A→指令B→指令C的顺序提交。In particular, superscalar out-of-order execution processors typically support out-of-order emission, out-of-order execution, and sequential commit. Among them, out-of-order firing and out-of-order execution have been introduced in the previous paragraph, and for sequential submission, its meaning can be: In a superscalar out-of-order execution processor, although instructions can be executed out of order, instructions must be submitted in Submitted in the order prescribed by the procedure. For example, for instruction A, instruction B, and instruction C, the execution order specified by the program is instruction A → instruction B → instruction C, but in a superscalar out-of-order execution processor, instruction B needs to wait for the source operand to execute Therefore, the actual execution order of the instructions is instruction A → instruction C → instruction B. Although these three instructions are executed out of order, when the instructions are submitted, they still need to be submitted in the order of instruction A → instruction B → instruction C specified by the program.
此外,需要说明的是,上述对八段指令流水线的介绍仅为一种示例,本申请实施例适用于超标量乱序执行处理器,但本申请实施例中对超标量乱序执行处理器采用的指令流水线模型并不做具体限定,只要采用该指令流水线可实现处理器对指令的超标量乱序执行即可。In addition, it should be noted that the above description of the eight-segment instruction pipeline is only an example. The embodiment of the present application is applicable to a superscalar out-of-order execution processor. However, in the embodiment of the present application, the superscalar out-of-order execution processor is adopted. The instruction pipeline model is not specifically limited, as long as the instruction pipeline can be used to achieve superscalar out-of-order execution of instructions by the processor.
二、性能事件Performance events
性能事件是指导致处理器性能低于设计峰值的事件。性能事件可以由处理器执行的指令或操作引发,例如执行长延迟除法指令或者进行缓存访问操作时发生缓存缺失等均可能引发性能事件。Performance events are events that cause processor performance to fall below the design peak. Performance events can be triggered by instructions or operations executed by the processor. For example, performing long-delay division instructions or cache misses during cache access operations can cause performance events.
性能事件代价(penalty)(也可以称为性能事件开销)是指比对“不引发性能事件时”完成一条指令或一项操作,该性能事件使得完成该指令或操作额外花费的时钟周期数。The performance event penalty (also referred to as the performance event overhead) refers to the number of clock cycles required to complete an instruction or an operation "when a performance event is not raised".
三、停顿(stall)Three, stall
在指令流水线正常工作的情况下,多条指令可以乱序执行、顺序提交;在提交(commit)流水段有指令正在提交时,可定义Commit=1(至少有一条指令提交);在提交(commit)流水段没有指令提交时,可定义Commit=0(无指令提交)。当连续若干个时钟周期内参数Commit均为0时,可以认为处理器流水线发生停顿,也可以称为处理器发生停顿。在不同的处理器中,该若干个时钟周期包含的时钟周期的数量可能有所不同,例如可以认为若干个时钟周期的数量为一个,本申请实施例对该若干个时钟周期包含的时钟周期的数量不做具体限定。Under the condition that the instruction pipeline works normally, multiple instructions can be executed out of order and submitted in sequence. When there are instructions in the commit pipeline, Commit = 1 can be defined (at least one instruction is submitted). ) When there is no instruction submission in the pipeline, Commit = 0 can be defined (no instruction submission). When the parameter Commit is 0 in several consecutive clock cycles, it can be considered that the processor pipeline stalls, which can also be referred to as a processor stall. In different processors, the number of clock cycles included in the multiple clock cycles may be different. For example, the number of clock cycles may be considered to be one. In the embodiment of the present application, the number of clock cycles included in the multiple clock cycles is different. The number is not specifically limited.
需要说明的是,上述示例中,Commit有0或1两个取值,在Commit为0时认为无指令提交,在Commit为1时认为有指令提交,并将若干个时钟周期内Commit始终为0的情况定义为处理器发生停顿。实际实现时,也可以定义Commit有多种取值,例如可以为0、1、2、3等,Commit等于A时表示有A条指令提交。进而,对于处理器停顿的定义,可以在若干个时钟周期内Commit始终小于或等于X(Commit=X表示有X条指令提交)的情况下定义处理器发生停顿,或者可以在若干个时钟周期内Commit始终小于Y(Commit=Y表示有Y条指令提交)的情况下定义处理器发生停顿,本申请实施例对此不做具体限定。It should be noted that in the above example, Commit has two values of 0 or 1. When Commit is 0, it is considered that there is no instruction submission, when Commit is 1, it is considered that there is an instruction submission, and Commit is always 0 in several clock cycles. A situation is defined as a processor stall. In actual implementation, Commit can be defined to have multiple values, for example, it can be 0, 1, 2, 3, and so on. When Commit equals A, it indicates that there are A instructions submitted. Further, for the definition of the processor stall, the processor stall can be defined under the condition that Commit is always less than or equal to X (Commit = X means that there are X instruction submissions) within several clock cycles, or it can be defined within several clock cycles. If the Commit is always less than Y (Commit = Y means that there are Y instructions submitted), it is defined that the processor stalls, which is not specifically limited in this embodiment of the present application.
通常,处理器停顿是由处理器执行的指令或操作引起的。其中,指令是一种命令语言,它用来规定处理器执行什么操作以及操作对象所在的位置,例如对两个操作对象进行加和的命令、访问缓存的命令等。操作是指执行指令时的具体操作,例如加法操作、访问缓存的操作等。Generally, processor stalls are caused by instructions or operations executed by the processor. Among them, an instruction is a command language, which is used to specify what operations the processor performs and where the operation object is located, such as a command to add two operation objects and a command to access a cache. Operations refer to specific operations when executing instructions, such as addition operations, operations to access the cache, and so on.
下面示例性地给出几种指令或操作引起处理器停顿的示例。比如,访存指令load在末级缓存缺失时会引起处理器停顿;比如,在执行访问缓存的操作时引起处理器停顿;再比如,分支指令branch(分支预测用于对程序的分支流程进行预测,然后预先读取其中一个分支的指令并解码,从而减少等待译码器的时间)在分支预测错误时会导致大量指令被清 空,从而引起处理器停顿。本申请实施例中将引起处理器停顿的指令或操作称为“停顿责任者”(culprit)。Examples of several instructions or operations that cause the processor to stall are given below by way of example. For example, the fetch instruction load causes the processor to halt when the last-level cache is missing; for example, it causes the processor to halt when performing operations that access the cache; for example, the branch instruction branch , And then read and decode the instructions of one of the branches in advance to reduce the waiting time for the decoder) When a branch prediction error occurs, a large number of instructions will be emptied, which will cause the processor to stall. In the embodiment of the present application, the instruction or operation that causes the processor to quiesce is referred to as a "culprit."
需要说明的是,本申请实施例中,处理器停顿是由处理器执行的指令或操作引起的。例如,访存指令load在末级缓存缺失时,我们可以认为停顿是由load指令引起的,也可以认为停顿是由执行load指令时进行的访问缓存的操作引起的。本申请实施例中采用第一种说法,即认为处理器停顿均是由指令引起的。It should be noted that, in the embodiment of the present application, the processor stall is caused by instructions or operations executed by the processor. For example, when the fetch instruction load is missing in the last level cache, we can think that the pause is caused by the load instruction, or we can think that the pause is caused by the operation of accessing the cache when the load instruction is executed. In the embodiment of the present application, the first statement is adopted, that is, the processor stall is considered to be caused by instructions.
示例性地,图2示出了处理器在执行多个程序时引发的处理器停顿以及相应的“停顿责任者”(culprit)。其中,当纵坐标为0时,代表Commit=0,即提交(commit)流水段无指令提交;当纵坐标为非0时,代表Commit=1,即提交(commit)流水段有至少一条指令提交。从图2中可以看出,在执行程序A时,load 1指令在执行过程由于缓存缺失导致load 1指令长延迟执行,进而引起了处理器停顿,虽然在load 1之后执行load 2指令和divide 3指令也引发了性能事件,但这两个指令引发的性能事件被load 1指令引起的停顿覆盖,因此我们认为图2中第一次停顿(Stall 1)的“停顿责任者”为load 1指令;在执行程序B时,执行branch 4指令时由于误预测导致大量指令被清空,引起处理器停顿,虽然在branch 4之后执行load 5指令也引发了性能事件,但load 5指令引发的性能事件被branch 4指令引发的停顿覆盖,因此我们认为图2中第二次停顿(Stall 2)的“停顿责任者”为branch 4指令;在执行程序C时,由于处理器在进行取指操作时I-cache发生缓存缺失,导致处理器停顿,因此图2中第三次停顿(Stall 3)的“停顿责任者”为取指操作;在执行程序D时,由于isb 6指令会引起指令流水线上的指令清空,导致处理器停顿,因此第四次停顿(Stall 4)的“停顿责任者”为isb 6指令。 By way of example, FIG. 2 illustrates processor stalls and corresponding "culprits" caused by a processor while executing multiple programs. Among them, when the ordinate is 0, it means Commit = 0, that is, no instruction is submitted in the commit pipeline; when the ordinate is non-zero, it is Commit = 1, that is, at least one instruction is submitted in the commit pipeline . As can be seen from Figure 2, when executing program A, the load 1 instruction was delayed due to a cache miss during the execution of the load 1 instruction, which caused a processor stall, although the load 2 instruction and divide 3 were executed after load 1. The instruction also caused a performance event, but the performance event caused by these two instructions is covered by the pause caused by the load 1 instruction. Therefore, we consider that the "pause responsible" of the first pause (Stall 1) in Figure 2 is the load 1 instruction; When executing program B, a large number of instructions were emptied due to misprediction during execution of branch 4 instructions, causing the processor to stall. Although execution of load 5 instructions after branch 4 also caused performance events, performance events caused by load 5 instructions were branched. The pause caused by the 4 instruction is covered, so we think that the "stall responsible" of the second pause (Stall 2) in Figure 2 is the branch 4 instruction; when executing the program C, because the processor I-caches the instruction fetch operation a cache miss occurs, causing the processor to stall, and therefore the third stop in FIG. 2 (stall 3) a "pause responsible" for the fetch operation; when executing procedures D, since the instruction isb 6 Since the instruction pipeline empty, cause the processor to pause, so the fourth stop (Stall 4) of the "standstill responsible" for the isb 6 instruction.
值得注意的是,在图2的几个示例中,“停顿责任者”的提交时间和停顿发生时间之间的关系有所差别。It is worth noting that in the several examples in Figure 2, there is a difference in the relationship between the submission time of the "responsible person" and the occurrence time of the pause.
例如,在程序A的示例中,load 1指令引起的停顿(Stall 1)发生在load 1指令提交之前,即load 1指令是Stall 1结束之后提交的第一条指令。这是因为:Stall 1是由load 1指令执行时间过长导致,在load 1指令执行完成之前其他指令无法提交,从而导致处理器停顿;在load 1指令执行完成后Stall 1即结束,此时load 1指令提交。 For example, in the example of program A, the pause caused by the load 1 instruction (Stall 1) occurs before the load 1 instruction is submitted, that is, the load 1 instruction is the first instruction submitted after the end of Stall 1. This is because: Stall 1 is performed by the load instruction. 1 results in too long, other instructions can not be submitted. 1 before the load instructions are executed, causing a processor to stall; Stall 1 i.e. after the end of the load instructions are executed. 1, at this time load 1 instruction submission.
例如,在程序B的示例中,branch 4指令由于误预测导致流水线的指令清空,即branch 4指令是Stall 2开始之前提交的最后一条指令。这是因为:branch 4指令误预测引起的停顿在该指令提交之后才可显现,因而Stall 2发生在branch 4指令提交之后。 For example, in the example of Program B, the branch 4 instruction was cleared due to misprediction, that is, the branch 4 instruction was the last instruction submitted before Stall 2 started. This is because the pause caused by the misprediction of the branch 4 instruction can only be displayed after the instruction is submitted, so Stall 2 occurs after the branch 4 instruction is submitted.
通过以上举例不难看出,引发处理器停顿的指令可以是停顿终止之后提交的第一条指令,也可以是停顿开始之前提交的最后一条指令。It is not difficult to see from the above examples that the instruction that caused the processor to halt may be the first instruction submitted after the pause is terminated, or the last instruction submitted before the pause begins.
同样值得注意的是,在图2的几个示例中,Stall 3停顿发生在处理器流水线前端,而Stall 1、Stall 2和Stall 4均发生在处理器流水线后端。其中,处理器流水线前端是指完成指令取指、译码和分发功能的流水段,处理器流水线后端是指完成指令发射、执行和提交功能的流水段。It is also worth noting that in the several examples in Figure 2, Stall 3 stalls occur at the front end of the processor pipeline, while Stall 1, Stall 2 and Stall 4 all occur at the back end of the processor pipeline. Among them, the front end of the processor pipeline refers to the pipeline that completes the functions of instruction fetching, decoding, and distribution, and the back end of the processor pipeline refers to the pipeline that completes the functions of instruction issuing, execution, and submission.
此外,实际应用中,当停顿开始时,提交(commit)流水段的commit width由非0变为0;当停顿终止时,commit流水段的commit width由0变为非0。因而处理器可根据commit width的数值判断停顿的开始和终止。In addition, in actual applications, when the pause starts, the commit width of the commit pipeline changes from non-zero to 0; when the pause ends, the commit width of the commit pipeline changes from 0 to non-zero. Therefore, the processor can judge the start and end of the pause based on the value of the commit width.
需要说明的是,本申请实施例中,性能事件在某些情况下会导致处理器停顿,在某些情况下不会导致处理器停顿。只有性能事件致使指令流水线的提交(commit)段在若干个 时钟周期内均无指令提交时才认为性能事件导致处理器停顿。当性能事件导致处理器停顿时,引起该性能事件的指令可以视为引起处理器停顿的“停顿责任者”(culprit)。该“停顿责任者”(culprit)的性能开销可以定义为此次停顿持续的时钟周期数,该“停顿责任者”(culprit)的性能开销也可以理解为此次停顿的性能开销。It should be noted that, in the embodiments of the present application, the performance event may cause the processor to halt in some cases, and may not cause the processor to halt in some cases. A performance event is considered to cause the processor to stall only if a performance event causes the commit section of the instruction pipeline to have no instruction submissions within several clock cycles. When a performance event causes the processor to stall, the instruction that caused the performance event can be considered a "culprit" that caused the processor to stall. The performance overhead of the "culprit" can be defined as the number of clock cycles that the pause lasts, and the performance overhead of the "culprit" can also be understood as the performance overhead of the pause.
四、重排序缓存Fourth, reorder cache
前面提到过,在超标量乱序执行处理器中,无论指令的执行顺序如何,指令需按照程序规定的顺序提交。那么,对于在execute流水段已执行完成但按程序规定的顺序还不能通过commit流水段进行提交的指令,其执行结果可保存在重排序缓存(re-order buffer,ROB)中。也就是说,ROB中的一个条目对应一条指令微码。ROB中的每一个条目至少包括两个字段:指令类型、执行结果。As mentioned earlier, in a superscalar out-of-order execution processor, the instructions need to be submitted in the order specified by the program regardless of the order of execution of the instructions. Then, for the instructions that have been executed in the execute pipeline but cannot be submitted through the commit pipeline in the order prescribed by the program, the execution results can be stored in the re-order buffer (ROB). In other words, one entry in the ROB corresponds to one instruction microcode. Each entry in the ROB includes at least two fields: the instruction type and the execution result.
实际应用中,ROB可以视为带有头指针和尾指针的循环队列,所有进入指令流水线的指令按照程序规定的顺序存储在ROB中,ROB中的一个条目对应一条指令微码。其中,头指针指向需要提交的下一条指令的相关信息(指令类型执行结果等),尾指针指向ROB中最新存储的一条指令微码的相关信息(指令类型、执行结果等)。In practical applications, the ROB can be regarded as a circular queue with head pointers and tail pointers. All instructions that enter the instruction pipeline are stored in the ROB in the order prescribed by the program. An entry in the ROB corresponds to an instruction microcode. Among them, the head pointer points to related information (instruction type execution result, etc.) of the next instruction to be submitted, and the tail pointer points to related information (instruction type, execution result, etc.) of an instruction microcode newly stored in the ROB.
五、第一寄存器、第二寄存器、第三寄存器Five, the first register, the second register, the third register
本申请实施例中,第一寄存器保存有用于指示引起性能事件的指令的索引路径,即保存有“停顿责任者”的索引路径。处理器在发生性能事件时会更新第一寄存器中的条目。对于未引起处理器停顿的性能事件,会通过对第一寄存器中的条目的更新将该性能事件覆盖掉;只有引起处理器停顿的条目会最终保存在第一寄存器中。也就是说,第一寄存器中的一个条目对应处理器的一次停顿。In the embodiment of the present application, the first register stores an index path used to indicate an instruction that causes a performance event, that is, an index path of a “pause person”. The processor updates the entry in the first register when a performance event occurs. For a performance event that does not cause a processor stall, the performance event will be overwritten by updating an entry in the first register; only the entry that caused the processor stall will eventually be saved in the first register. That is, an entry in the first register corresponds to a pause in the processor.
具体地,第一寄存器中维护有读指针(reading)和写指针(writing)。读指针(reading)用于指示在处理器停顿终止时需要读取的条目,处理器可通过读取该条目确定“停顿责任者”的索引路径;写指针(writing)用于写入和更新第一寄存器的条目。Specifically, a reading pointer and a writing pointer are maintained in the first register. The reading pointer is used to indicate the entry that needs to be read when the processor stall is terminated. The processor can read the entry to determine the index path of the "responsible for the stall"; the writing pointer is used to write and update the first A register entry.
例如,对于未引起处理器停顿的性能事件,通过对写指针(writing)指示的条目的更新,可以将该未引起处理器停顿的性能事件覆盖掉。例如,在停顿终止后,通过读取读指针(reading)指向的条目,可以获取“停顿责任者”的索引路径,进而确定引起处理器停顿的指令操作。For example, for a performance event that does not cause a processor stall, the performance event that does not cause a processor stall can be overwritten by updating the entry indicated by the writing pointer. For example, after the pause is terminated, by reading the entry pointed by the reading pointer, the index path of the "responsible person" can be obtained, and then the instruction operation that caused the processor to pause is determined.
具体地,在初始化阶段,读指针(reading)和写指针(writing)可均指向第一寄存器中的第一个条目,此时该条目的内容为空或者缺省值。每当发生性能事件时,均可更新写指针(writing)指向的条目(即第一个条目);在处理器停顿终止后,可通过将写指针(writing)移动到下一个条目,即停止更新第一个条目;然后,通过读取读指针(reading)指向的条目,可确定引起处理器停顿的指令;读取完成后可将读指针(reading)移动到下一个条目。采用这种方式进行下去,可通过写指针(writing)实现第一寄存器的条目(即记录引起停顿的指令的索引路径)的写入和更新,以及通过读指针(reading)确定“停顿责任者”。Specifically, in the initialization stage, both the reading pointer and the writing pointer may point to the first entry in the first register, and at this time, the content of the entry is empty or the default value. Whenever a performance event occurs, the entry pointed to by the writing pointer (that is, the first entry) can be updated; after the processor pause is terminated, the update can be stopped by moving the writing pointer to the next entry The first entry; then, by reading the entry pointed by the reading pointer, the instruction that caused the processor to stall can be determined; after the reading is completed, the reading pointer can be moved to the next entry. In this way, the writing and updating of the first register entry (ie, the index path of the instruction causing the pause) can be achieved through writing pointers, and the "responsible person for pause" can be determined by reading. .
当然,具体实现时,第一寄存器中的读指针(reading)和写指针(writing)的移动方式可以根据不同的场景进行不同的操作,这部分内容将在后面的实施例中进行详细描述,此处不再赘述。Of course, in specific implementation, the movement of the reading pointer (writing) and writing pointer (writing) in the first register can be performed different operations according to different scenarios. This part will be described in detail in the following embodiments. I will not repeat them here.
实际实现时,第一寄存器可以有不同的名称,例如本申请实施例中可以将第一寄存器称为停顿责任者跟踪寄存器组(culprit tracking set,CTS),本申请实施例中对第一寄存器的具体名称不做限定。In actual implementation, the first register may have a different name. For example, in the embodiment of the present application, the first register may be referred to as a culprit tracking register set (CTS). In the embodiment of the present application, the The specific name is not limited.
具体实现时,当连续两条指令均引起处理器停顿时,若第一寄存器中仅设有一个条目,则可能出现第一条指令还未提交时,处理器即根据第二条指令将第一寄存器中的条目覆盖的情况,此时由于第一条指令的相关条目被覆盖,因而难以对第一次停顿的“停顿责任者”进行判断。为了避免出现这一问题,本申请实施例的第一寄存器中可包含多个条目。In specific implementation, when two consecutive instructions cause the processor to halt, if there is only one entry in the first register, it may happen that the first instruction has not yet been submitted, and the processor will In the case where the entry in the register is covered, it is difficult to judge the "responsible person for the pause" of the first pause because the related entry of the first instruction is covered. To avoid this problem, the first register in the embodiment of the present application may include multiple entries.
示例性地,第一寄存器包含的条目数可以为回写(writeback)段中的流水段数+1。其中,回写(writeback)段用于控制处理器将指令的执行结果写回存储器或寄存器。当第一寄存器包含的条目数等于回写(writeback)段中的流水段数+1时,若并行执行的多条指令均引起停顿,第一寄存器的条目数量也可以满足记录该多条指令的信息所需条目的数量要求。Exemplarily, the number of entries contained in the first register may be the number of pipeline segments in the writeback segment +1. The writeback section is used to control the processor to write the execution result of the instruction back to the memory or the register. When the number of entries in the first register is equal to the number of pipeline segments in the writeback segment +1, if multiple instructions executed in parallel cause a pause, the number of entries in the first register can also be sufficient to record the information of the multiple instructions Number of required entries required.
此外,当第一寄存器中包含多个条目时,第一寄存器也可以视为一个寄存器组,该寄存器组中的每个寄存器包含一个条目。In addition, when the first register contains multiple entries, the first register can also be regarded as a register group, and each register in the register group contains one entry.
本申请实施例中,第二寄存器保存有处理器执行的每个指令的指令类型对应的累计停顿周期数。其中,指令类型可以是运算指令,例如fadd指令、divide指令;指令类型也可以是访存指令,例如load指令、store指令。此外,指令也可以是分支指令branch。In the embodiment of the present application, the second register stores a cumulative pause cycle number corresponding to an instruction type of each instruction executed by the processor. The instruction type may be an operation instruction, such as a fadd instruction and a divide instruction; and the instruction type may also be an access instruction, such as a load instruction or a store instruction. In addition, the instruction may be a branch instruction.
具体地,第二寄存器中每个指令类型对应的累计停顿周期数可以通过累加的方式进行更新:当处理器发生停顿时,计数器开始对停顿持续的时钟周期数进行计数;当停顿终止时,计数器停止计数,并将计数结果累加至第二寄存器的某个条目中。例如,第二寄存器的a条目中记载由load指令引起的处理器的停顿周期数累计值为M。当处理器发生停顿时,计数器开始计数;当停顿终止时,计数器停止计数且此时计数结果为N。然后,通过本申请实施例提供的方案判断此次停顿是由load指令引起的,则在第二寄存器中的a条目上累加N,此时第二寄存器的a条目中记载由load指令引起的处理器的停顿周期数累计值为M+N。Specifically, the cumulative number of stall periods corresponding to each instruction type in the second register can be updated in an accumulative manner: when the processor stalls, the counter starts counting the number of clock cycles that the stall lasts; when the stall is terminated, the counter Stop counting and accumulate the count result into an entry in the second register. For example, the entry a of the second register records that the cumulative value of the number of pause periods of the processor caused by the load instruction is M. When the processor stalls, the counter starts counting; when the stall is terminated, the counter stops counting and the count result is N at this time. Then, according to the solution provided by the embodiment of the present application, it is judged that the pause is caused by the load instruction, and N is accumulated on the a entry in the second register, and at this time, the a entry in the second register records the processing caused by the load instruction. The cumulative value of the number of pause periods of the router is M + N.
实际实现时,第二寄存器可以由静态随机存取存储器(static random-access memory,SRAM)硬件实现。此外,第二寄存器和计数器可以有不同的名称,例如本申请实施例中可以将第二寄存器称为停顿性能开销统计表(stall penalty counting table,SPCT),将计数器称为停顿周期计数器(stall cycle counter,SCC),本申请实施例中对第二寄存器和计数器的具体名称不做限定。In actual implementation, the second register may be implemented by static random-access memory (static random-access memory (SRAM)) hardware. In addition, the second register and the counter may have different names. For example, in the embodiment of the present application, the second register may be referred to as a stall performance overhead statistics table (SPCT), and the counter may be referred to as a stall cycle counter (stall cycle counter). counter (SCC), and the specific names of the second register and the counter are not limited in the embodiments of the present application.
此外,本申请实施例中可以将第二寄存器中的条目更新至内存中,然后将第二寄存器清零,以便在程序执行终止后可根据内存中的数据对处理器的性能进行评估。具体地,处理器可以在以下任一种情况下将第二寄存器中保存的条目更新至与处理器连接的内存中:第二寄存器中某个条目进行累加时发生溢出;第二寄存器触发中断;处理器的性能监测时间段结束。In addition, in the embodiment of the present application, the entry in the second register may be updated into the memory, and then the second register is cleared, so that the performance of the processor may be evaluated according to the data in the memory after the execution of the program is terminated. Specifically, the processor may update the entry stored in the second register to the memory connected to the processor in any of the following cases: an overflow occurs when an entry in the second register is accumulated; the second register triggers an interrupt; The processor's performance monitoring period ends.
本申请实施例中,第三寄存器保存有ROB最近一次删除的条目。如前所述,指令在execute流水段执行完成但还不能通过commit流水段进行提交的情况下,该指令的相关信息(指令类型、执行结果等)可保存在ROB中;该指令提交后,该指令对应的条目就会从ROB中删除。本申请实施例中,对于某条指令提交后引起处理器停顿的情况(例如branch指令提交后导致流水线上的指令清空的情况,停顿是在branch指令提交后才开始的),在停顿终止后分析“停顿责任者”时该指令的相关信息已经从ROB中删除,因而需要维护一个寄存器来保存ROB中最近一次删除的条目,以便在上述情况下确定“停顿责任者”,这个寄存器就是第三寄存器。In the embodiment of the present application, the third register stores an entry that was last deleted by the ROB. As mentioned before, when the execution of the execute pipeline is completed but the commit cannot be submitted through the commit pipeline, the relevant information (instruction type, execution result, etc.) of the instruction can be saved in the ROB; after the instruction is submitted, the The entry corresponding to the instruction is deleted from the ROB. In the embodiment of the present application, for a situation in which a processor stalls after an instruction is submitted (for example, a situation in which an instruction on the pipeline is emptied after the branch instruction is submitted, the stall is started after the branch instruction is submitted), and the analysis is performed after the stall is terminated At the time of the "responsible person", the relevant information of the instruction has been deleted from the ROB. Therefore, a register needs to be maintained to save the most recently deleted entry in the ROB in order to determine the "responsible person" in the above situation. This register is the third register. .
实际实现时,第三寄存器可以有不同的名称,例如本申请实施例中可以将第三寄存器称为最近提交指令(last committed instruction,LCI)寄存器,本申请实施例中对第三寄存器的具体名称不做限定。In actual implementation, the third register may have a different name. For example, the third register may be referred to as a last committed instruction (LCI) register in the embodiment of the present application. The specific name of the third register in the embodiment of the present application No restrictions.
此外,需要说明的是,本申请实施例中,第三寄存器中仅保存有一个条目,该条目可通过复制ROB中的条目实现,该条目包含的内容与ROB条目包含的内容一致,此处不再赘述。In addition, it should be noted that in the embodiment of the present application, there is only one entry stored in the third register. This entry can be implemented by copying the entry in the ROB. The content of this entry is consistent with the content of the ROB entry. More details.
六、前端指示字段、类型指示字段以及序列号指示字段Six.Front end indication field, type indication field and serial number indication field
本申请实施例中,在发生性能事件时更新第一条目。具体地,第一寄存器中的每个条目中至少包含三个字段:前端指示字段、类型指示字段以及序列号指示字段。更新第一条目即更新第一条目中的这三个字段。其中,前端指示字段用于指示停顿是否发生在处理器流水线前端,类型指示字段用于指示停顿是否在“停顿责任者”提交之前发生,序列号指示字段用于指示“停顿责任者”的误预测序列号。In the embodiment of the present application, the first entry is updated when a performance event occurs. Specifically, each entry in the first register includes at least three fields: a front-end indication field, a type indication field, and a sequence number indication field. Updating the first entry updates these three fields in the first entry. Among them, the front-end indication field is used to indicate whether the pause occurred at the front end of the processor pipeline, the type indication field is used to indicate whether the pause occurred before the "responsible person" submitted, and the serial number indication field is used to indicate the misprediction of the "responsible person" serial number.
如前面介绍处理器停顿时所述,在图2的几个示例中,有的停顿发生在处理器流水线前端,而有的停顿发生在处理器流水线后端。本申请实施例中可通过第一寄存器中的前端指示字段来指示停顿是否发生在处理器流水线前端。例如,用1比特位来指示前端指示字段,前端指示字段用FE来表示:FE=1表示停顿发生在处理器流水线前端;FE=0表示停顿发生在处理器流水线后端。As mentioned when the processor stall was introduced earlier, in the several examples of FIG. 2, some stalls occur at the front end of the processor pipeline, and some stalls occur at the back end of the processor pipeline. In the embodiment of the present application, whether a stall occurs at the front end of the processor pipeline may be indicated by a front end indication field in the first register. For example, a 1-bit bit is used to indicate the front-end indication field, and the front-end indication field is represented by FE: FE = 1 means that a pause occurs at the front-end of the processor pipeline; FE = 0 means that a pause occurs at the back-end of the processor pipeline.
同样如前所述,通过图2中的几个示例不难看出,引发处理器停顿的指令可以是停顿终止之后提交的第一条指令,也可以是停顿开始之前提交的最后一条指令。本申请实施例中可通过第一寄存器中的类型指示字段来指示引发处理器停顿的指令是停顿终止之后提交的第一条指令,还是停顿开始之前提交的最后一条指令。例如,用1比特位来指示类型指示字段,类型指示字段用Ctype来表示:Ctype=1表示引起停顿的指令是停顿开始之前提交的最后一条指令;Ctype=0表示引起停顿的指令是停顿终止之后提交的第一条指令。As also mentioned above, it is not difficult to see through the examples in FIG. 2 that the instruction that caused the processor to halt may be the first instruction submitted after the pause is terminated, or the last instruction submitted before the pause begins. In the embodiment of the present application, the type indication field in the first register may be used to indicate whether the instruction that caused the processor stall is the first instruction submitted after the stall is terminated or the last instruction submitted before the stall starts. For example, 1 bit is used to indicate the type indication field, and the type indication field is represented by Ctype: Ctype = 1 means that the instruction causing the pause is the last instruction submitted before the pause started; Ctype = 0 means that the instruction causing the pause is after the pause termination The first instruction submitted.
此外,关于序列号指示字段可以有如下理解:当第一寄存器中的某个条目用于记录分支误预测指令的相关信息时,该条目的序列号指示字段用于记录该分支指令的序列号;当第一寄存器中的某个条目用于记录分支指令之外的其他指令的相关信息时,该条目的序列号指示字段无意义(可缺省)或者用于记录其他信息(例如用于确定流水线前端“停顿责任者”的信息)。In addition, the sequence number indication field can be understood as follows: when an entry in the first register is used to record related information about a branch misprediction instruction, the sequence number indication field of the entry is used to record the sequence number of the branch instruction; When an entry in the first register is used to record information about instructions other than the branch instruction, the serial number of the entry indicates that the field is meaningless (can be defaulted) or used to record other information (such as to determine the pipeline Front-end "pause person" information).
七、处理器停顿的类型Types of processor pauses
依据前面对处理器停顿的介绍,本申请实施例中将处理器发生的停顿的类型分为四类:According to the foregoing description of the processor stall, in the embodiment of the present application, the types of processor stalls are classified into four types:
第一类:the first sort:
本申请实施例中,第一类停顿可以称为指令供应停顿(instruction supply stall),是由处理器流水线前端(frontend)引起的流水线停顿。第一类停顿由处理器流水线前端缓存(比如I-Cache或者I-TLB)缺失引起,前端缓存缺失导致无法向处理器流水线后端(backend)提供指令流/微操作流,从而导致Commit=0。第一类停顿的显著特点是当它引发处理器停顿时,ROB也为空。In the embodiment of the present application, the first type of pause may be referred to as instruction supply pause, which is a pipeline pause caused by a processor pipeline frontend. The first type of pause is caused by the lack of processor pipeline front-end cache (such as I-Cache or I-TLB). The lack of front-end cache prevents the instruction stream / micro-operation stream from being provided to the processor pipeline backend, resulting in Commit = 0. . The salient feature of the first type of pause is that when it causes a processor pause, the ROB is also empty.
对于第一类停顿,我们可以认为culprit是一种操作,如I-Cache访问操作;同时,我们也可以认为culprit是取指时引起I-Cache缺失的那条指令。在本申请实施例中选用第二种观点,即第一类停顿是由指令引起的。For the first type of pause, we can think of culprit as an operation, such as an I-Cache access operation; at the same time, we can also think of culprit as the instruction that caused the I-Cache to be missing when fetching instructions. In the embodiment of the present application, the second viewpoint is selected, that is, the first type of pause is caused by instructions.
值得注意的是,由于第一类停顿发生在处理器流水线前端且此时ROB为空,所以第一类停顿一定不会和其他类别的停顿有重叠,我们在设计中用1比特位特别指示该类停顿,即第一寄存器中的前端指示字段来指示第一类停顿。例如,在FE=1时即可确定此次发生的停顿为第一类停顿。It is worth noting that because the first type of pause occurs at the front end of the processor pipeline and the ROB is empty at this time, the first type of pause must not overlap with other types of pauses. We use a 1-bit bit to indicate this in the design. The type of pause is a front-end indication field in the first register to indicate the type of pause. For example, when FE = 1, it can be determined that the pause that occurred this time is the first type of pause.
对于第一类停顿,我们可以认为前端指示字段FE=1,类型指示字段Ctype=0。For the first type of pause, we can consider that the front-end indication field FE = 1 and the type indication field Ctype = 0.
此外,在某些情况下,缓存缺失不一定会导致停顿发生。例如当取指操作在L1I-Cache缺失但在L2Cache命中时,虽然处理器流水线前端有若干时钟周期无法向处理器流水线后端提供指令,但此时ROB很可能不为空、并且每个时钟周期内仍然有指令提交。对于这种情况,本申请实施例中并不认为此时发生了停顿。现代处理器结构设计的一个主要目的即尽可能的隐藏性能事件开销。上述情况的发生正是因为前端性能事件的开销被隐藏掉了。Also, in some cases, cache misses do not necessarily cause stalls to occur. For example, when the fetch operation is missing in L1I-Cache but hits in L2Cache, although there are several clock cycles in the front end of the processor pipeline, the instruction cannot be provided to the back end of the processor pipeline, but at this time, the ROB is likely not empty and every clock cycle There are still instructions to submit. In this case, it is not considered that a pause occurs at this time in the embodiment of the present application. One of the main goals of modern processor architecture design is to hide the performance event overhead as much as possible. The above situation happens because the overhead of front-end performance events is hidden.
第二类:The second category:
本申请实施例中,第二类停顿可以称为误预测停顿(misprediction stall)。由于误预测引起指令清空导致的停顿,我们将其归入第二类停顿。常见的误预测有branch分支误预测和load-store order误预测。In the embodiment of the present application, the second type of pause may be referred to as a misprediction pause. Due to the pause caused by the emptying of instructions due to misprediction, we classify it as a second type of pause. Common misprediction is branch branch misprediction and load-store misorder misprediction.
第二类停顿的显著特点是引起停顿的指令(culprit)在停顿发生之前已经提交、并且是发生停顿之前提交的最后一条指令。The salient feature of the second type of pause is that the instruction that caused the pause (culprit) was submitted before the pause occurred, and is the last instruction submitted before the pause occurred.
对于第二类停顿,我们可以认为前端指示字段FE=0,类型指示字段Ctype=1。For the second type of pause, we can consider the front end indication field FE = 0 and the type indication field Ctype = 1.
同样地,若误预测没有引起停顿(即无连续若干个时钟周期Commit=0的情况发生),本申请实施例中认为该误预测的性能开销被乱序执行很好地隐藏。此时误预测的指令不会被识别为culprit。Similarly, if the misprediction does not cause a pause (that is, the situation where there are no consecutive consecutive clock cycles Commit = 0), the performance overhead of the misprediction is considered to be well hidden by the out-of-order execution in the embodiment of the present application. Mispredicted instructions are not recognized as culprit.
第三类:The third category:
本申请实施例中,第三类停顿可以称为系统指令停顿(system instruction stall)。第三列停顿为由特定指令引起流水线清空而导致的停顿。常见的有isb指令。第三类停顿的特点和第二类停顿相似,culprit(如isb指令)在停顿发生之前已经提交、并且是停顿发生前提交的最后一条指令。In the embodiment of the present application, the third type of pause may be referred to as a system instruction pause. The third column of pauses are pauses caused by the pipeline being emptied by a particular instruction. Common isb instructions. The characteristics of the third type of pause are similar to the second type of pause. Culprit (such as the isb instruction) has been submitted before the pause, and is the last instruction submitted before the pause.
对于第三类停顿,我们可以认为前端指示字段FE=0,类型指示字段Ctype=1。For the third type of pause, we can consider the front-end indication field FE = 0 and the type indication field Ctype = 1.
第三类停顿有两点独特之处:1)当特定指令出现时,一定引发停顿。例如isb指令提交时一定会清空整条指令流水线,它引起的性能开销不能被隐藏;2)第三类停顿每次发生时的性能开销几乎不变,例如isb指令引起的停顿持续的时钟周期数几乎总等于流水线深度(Stage)。The third type of pause has two unique features: 1) When a specific instruction appears, it must cause a pause. For example, when the isb instruction is submitted, the entire instruction pipeline must be emptied, and the performance overhead caused by it cannot be hidden; 2) The performance overhead of the third type of pause is almost constant, such as the number of clock cycles for the pause caused by the isb instruction It is almost always equal to the stage depth.
需要说明的是,虽然发生第三类停顿时并未涉及误预测的情况发生。但是由于第二类停顿和第三类停顿的前端指示字段和类型指示字段均相同,因而为了区分第二类停顿和第三类停顿,在发生第三类停顿时,也可以在第一寄存器的序列号指示字段中写入误预测序列号,以根据误预测序列号区分第二类停顿和第三类停顿。It should be noted that although the third type of pause did not involve misprediction. However, since the front-end indication field and the type indication field of the second type of pause and the third type of pause are the same, in order to distinguish the second type of pause and the third type of pause, when the third type of pause occurs, it can also be in the first register. A mispredicted sequence number is written in the sequence number indication field to distinguish the second type of pause and the third type of pause based on the mispredicted sequence number.
第四类:Fourth category:
本申请实施例中,第四类停顿可以称为长延迟停顿(long latency stall)。第四类停顿可以视为由指令的长延迟执行引发的停顿,即某条指令的执行耗时过长导致停顿。第四类停顿常见的culprit可以是发生末级缓存缺失的load指令、浮点除法指令、访问缓存中share数据的load指令等,这些指令的执行往往需要几十至几百个时钟周期,阻止了后续指令的 执行和提交,从而导致处理器停顿。In the embodiment of the present application, the fourth type of pause may be referred to as a long delay pause. The fourth type of pause can be considered as a pause caused by a long delayed execution of an instruction, that is, an instruction that takes too long to execute causes a pause. The fourth type of pause common culprit can be the load instruction with a last-level cache miss, a floating-point division instruction, and a load instruction that accesses shared data in the cache. The execution of these instructions often requires tens to hundreds of clock cycles, preventing Execution and submission of subsequent instructions cause the processor to stall.
第四类停顿的显著特点是停顿发生在culprit提交之前,并且culprit是停顿终止后提交的第一条指令。The distinctive feature of the fourth type of pause is that the pause occurs before the culprit is submitted, and culprit is the first instruction submitted after the culprit is terminated.
对于第四类停顿,我们可以认为前端指示字段FE=0,类型指示字段Ctype=0。For the fourth type of pause, we can consider that the front-end indication field FE = 0 and the type indication field Ctype = 0.
通过以上对停顿类型的介绍不难看出,四种类型的停顿各有其特点。本申请实施例中,在发生停顿时可根据第一寄存器中各字段的指示判断停顿的类型,进而判断哪条指令是引起停顿的指令(即culprit)。It is not difficult to see from the introduction of the types of pauses that the four types of pauses each have their own characteristics. In the embodiment of the present application, when a pause occurs, the type of the pause can be determined according to an instruction of each field in the first register, and then it can be determined which instruction caused the pause (that is, culprit).
对于四种停顿类型中的每种类型来说,引起某种类型的停顿的指令可以有多条。下面以表1为例介绍每种类型的停顿以及引起该类型停顿的指令类型,本申请实施例中,指令类型可以由Stall ID表示。For each of the four types of pauses, there can be multiple instructions that cause a certain type of pause. The following table 1 is used as an example to introduce each type of pause and the type of instruction that caused the type of pause. In the embodiment of the present application, the instruction type may be represented by Stall ID.
表1Table 1
Figure PCTCN2018107402-appb-000001
Figure PCTCN2018107402-appb-000001
前面介绍过,第二寄存器保存有不同指令类型对应的累计停顿周期数。表1中的Stall ID可以用于指示指令类型。具体地,第二寄存器中每个指令类型对应的累计停顿周期数可以通过累加的方式实现:当处理器发生停顿时,计数器开始对停顿持续的时钟周期数进行计数;当停顿终止时,计数器停止计数,并将计数结果累加至第二寄存器的相应条目中。As mentioned earlier, the second register holds the number of accumulated pause cycles corresponding to different instruction types. The Stall ID in Table 1 can be used to indicate the instruction type. Specifically, the cumulative number of stall periods corresponding to each instruction type in the second register can be implemented in an accumulative manner: when the processor stalls, the counter starts counting the number of clock cycles for which the stall continues; when the stall is terminated, the counter stops Count and accumulate the count result into the corresponding entry in the second register.
具体实现时,采用本申请实施例提供的方案确定引起停顿的指令后,可以确定该指令的Stall ID,进而将计数器的计数结果累加到第二寄存器中的相应条目上,第二寄存器中的条目更新到内存后,通过分析内存数据即可获性能监测时间段内每种类型的指令造成的性能开销,从而对处理器的性能进行分析。In specific implementation, after determining the instruction that caused the stall by using the solution provided in the embodiment of the present application, the Stall ID of the instruction can be determined, and the count result of the counter is accumulated to the corresponding entry in the second register, and the entry in the second register After updating to memory, you can analyze the performance of the processor by analyzing the memory data to obtain the performance overhead caused by each type of instruction during the performance monitoring period.
例如,采用本申请实施例提供的方案确定引起此次停顿的指令的指令类型(Stall ID)为Oct 10,且计数器计数的时钟周期数为P,则在第二寄存器中Oct 10对应条目的计数值上加上P。假设此次停顿发生前第二寄存器中Oct 10对应条目的计数值为Q,则叠加后该 条目对应的计数值为P+Q。监测时间段结束后,第二寄存器中的条目更新到内存,更新前内存中Oct 10对应条目的计数值为X,则更新后内存中Oct 10对应条目的计数值为X+P+Q。通过分析Oct 10对应条目的计数值与整个监测时间段包含的时钟周期总数的比值,以及对比Oct 10对应条目的计数值与其他条目的计数值,可以对处理器的性能进行分析。For example, using the solution provided in the embodiment of the present application to determine that the instruction type (Stall ID) of the instruction causing the stall is Oct 10 and the number of clock cycles counted by the counter is P, then the count of the corresponding entry of Oct 10 in the second register Add P to the value. Assume that the count value of the entry corresponding to Oct10 in the second register before the pause occurs is Q, and the count value corresponding to the entry after the superposition is P + Q. After the monitoring time period ends, the entry in the second register is updated to the memory. The count value of the entry corresponding to Oct10 in the memory before the update is X, and the count value of the entry corresponding to Oct10 in the memory after the update is X + P + Q. By analyzing the ratio of the count value of the corresponding entry of Oct10 to the total number of clock cycles included in the entire monitoring period, and comparing the count value of the corresponding entry of Oct10 with the count of other entries, the performance of the processor can be analyzed.
值得注意的是,对于相同的指令,若该指令在程序执行过程中引起了不同类型的停顿,其对应的指令类型也不相同。例如,在表1中,对于load指令,若其引起第二类停顿,则指令类型(Stall ID)为Oct11;若其引起第四类停顿,则指令类型(Stall ID)为Oct31。It is worth noting that, for the same instruction, if the instruction causes different types of pauses during program execution, the corresponding instruction types are also different. For example, in Table 1, for a load instruction, if it causes a second type of stall, the instruction type (Stall ID) is Oct11; if it causes a fourth type of stall, the instruction type (Stall ID) is Oct31.
此外,本申请实施例中还可以将除上述四类停顿之外的其他类型的停顿统一归为第五类停顿。在上述四类停顿中,第一类停顿的明显特征是停顿发生在处理器流水线前端且ROB为空,第二类停顿和第三类停顿的明显特征是指令流水线被清空,而第四类停顿难以通过直观特征明确判断,因此我们可以在停顿类型不是第一类停顿和第二类停顿,也不是第三类停顿的情况下,确定此次停顿为第四类停顿。而对于第五类停顿来说,其处理过程以及第一寄存器的条目中的各个字段的置位方式与第四类停顿类似,因而第五类停顿的处理方式可参照第四类停顿的处理方式,因此本申请实施例中对第五类停顿的内容不再赘述。In addition, in the embodiment of the present application, other types of pauses other than the four types of pauses may be collectively classified as a fifth type of pause. Among the above four types of pauses, the obvious characteristics of the first type of pauses are that the pauses occur at the front end of the processor pipeline and the ROB is empty. The obvious characteristics of the second and third types of pauses are that the instruction pipeline is cleared, and the fourth type of pauses are clear. It is difficult to make a clear judgment based on intuitive characteristics, so we can determine that the pause is a fourth type of pause when the type of pause is not the first type of pause, the second type of pause, or the third type of pause. For the fifth type of pause, the processing process and the setting of the fields in the entry of the first register are similar to the fourth type of pause. Therefore, the fifth type of pause can be referred to the fourth type of pause. Therefore, the content of the fifth type of pause in the embodiments of the present application will not be repeated.
下面将结合附图对本申请实施例作进一步地详细描述。The embodiments of the present application will be further described in detail below with reference to the accompanying drawings.
需要说明的是,本申请实施例中,多个是指两个或两个以上。另外,需要理解的是,在本申请的描述中,“第一”、“第二”等词汇,仅用于区分描述的目的,而不能理解为指示或暗示相对重要性,也不能理解为指示或暗示顺序。It should be noted that, in the embodiments of the present application, multiple means two or more. In addition, it should be understood that in the description of this application, the words "first" and "second" are used only for the purpose of distinguishing descriptions, and cannot be understood as indicating or implying relative importance, nor as indicating Or imply order.
参见图3,为本申请实施例提供的一种处理器性能的监测方法,该方法包括如下步骤。Referring to FIG. 3, a method for monitoring processor performance according to an embodiment of the present application includes the following steps.
S301:处理器在发生性能事件时更新第一寄存器中的第一条目,在发生停顿时,启动计数器来统计停顿持续的第一时钟周期数。S301: The processor updates the first entry in the first register when a performance event occurs, and when a pause occurs, starts a counter to count the number of first clock cycles that the pause continues.
如前所述,第一寄存器中维护有读指针(reading)和写指针(writing)。读指针(reading)用于指示在处理器停顿终止时需要读取的条目,处理器可通过该条目确定“停顿责任者”的索引路径;写指针(writing)用于写入和更新第一寄存器的条目。也就是说,S301中,处理器更新的第一条目可以是写指针(writing)指向的条目。As mentioned above, the first register maintains a reading pointer and a writing pointer. The reading pointer is used to indicate the entry that needs to be read when the processor stall is terminated. The processor can use this entry to determine the index path of the "responsible person"; the writing pointer is used to write and update the first register Entry. That is, in S301, the first entry updated by the processor may be an entry pointed to by a writing pointer.
本申请实施例中,第一寄存器中可以包含一个或多个条目。如前所述,处理器在发生性能事件时会更新第一寄存器中的条目。对于未引起处理器停顿的性能事件,会通过对第一寄存器中的条目的更新将该性能事件覆盖掉;只有引起处理器停顿的条目会最终保存在第一寄存器中。第一寄存器中的一个条目对应处理器的一次停顿,即第一寄存器中的每个条目均用于指示引起一次停顿的指令的指令类型的索引路径。In the embodiment of the present application, the first register may include one or more entries. As mentioned earlier, the processor updates the entry in the first register when a performance event occurs. For a performance event that does not cause a processor stall, the performance event will be overwritten by updating an entry in the first register; only the entry that caused the processor stall will eventually be saved in the first register. An entry in the first register corresponds to a stall of the processor, that is, each entry in the first register is used to indicate an index path of an instruction type of the instruction that caused the stall.
也就是说,在第一条目的更新过程中,第一条目用于指示引发性能事件的指令的指令类型的索引路径;在第一条目停止更新后,第一条目中记录的性能事件引起处理器停顿,引发性能事件的指令即为引起处理器停顿的指令,即第一条目用于指示引起处理器停顿的指令的指令类型的索引路径。That is, during the update process of the first entry, the first entry is used to indicate the index path of the instruction type of the instruction that caused the performance event; after the first entry stops updating, the performance event recorded in the first entry The instruction that caused the processor to halt and the performance event was the instruction that caused the processor to halt, that is, the first entry was used to indicate the index path of the instruction type of the instruction that caused the processor to halt.
具体地,第一条目可以包括前端指示字段、类型指示字段以及序列号指示字段,前端指示字段用于指示停顿是否发生在前端,类型指示字段用于指示停顿是否在第一指令提交之前发生,序列号指示字段用于指示第一指令的误预测序列号。关于第一条目中包含的三 个字段的具体含义可参见前面的描述,此处不再赘述。Specifically, the first entry may include a front-end indication field, a type indication field, and a sequence number indication field. The front-end indication field is used to indicate whether a pause occurs in the front-end, and the type indication field is used to indicate whether the pause occurs before the first instruction is submitted. The sequence number indication field is used to indicate a mispredicted sequence number of the first instruction. For the specific meanings of the three fields contained in the first entry, refer to the foregoing description, and will not be repeated here.
在指令的执行过程中,若判断处理器性能低于设计峰值,则可以判断发生性能事件。因而,S301中,处理器在发生性能事件时更新第一条目的操作可以在指令流水线的执行(execute)段进行,即在执行(execute)段判断处理器性能低于设计峰值时,可以确定发生性能事件,此时可更新第一寄存器中的第一条目。During the execution of instructions, if it is judged that the processor performance is lower than the design peak value, it can be judged that a performance event has occurred. Therefore, in S301, the operation of the processor to update the first entry when a performance event occurs can be performed in the execute section of the instruction pipeline, that is, when the execute section judges that the processor performance is lower than the design peak, it can be determined that the occurrence Performance event, at which point the first entry in the first register can be updated.
此外,S301中,计数器用于记录停顿持续的时钟周期数。该计数器在发生停顿时开始计数,并在停顿终止时停止计数。停止计数后,该计数器记录的数值为此次停顿持续的时钟周期数,即此次停顿的性能开销。In addition, in S301, the counter is used to record the number of clock cycles in which the pause is continued. This counter starts counting when a pause occurs, and stops counting when the pause ends. After stopping counting, the value recorded by this counter is the number of clock cycles that the pause lasts, that is, the performance overhead of the pause.
如前所述,本申请实施例中将处理器发生的停顿的类型分为四类。那么,根据发生停顿的类型的不同,S301中,处理器更新第一条目的方式可以有多种。下面列举其中四种更新第一条目的具体方式。As mentioned above, in the embodiment of the present application, the types of stalls that occur in the processor are classified into four types. Then, according to the type of the pause, in S301, the processor may update the first entry in multiple ways. Here are four specific ways to update the first entry.
方式一method one
在方式一中,处理器在发生性能事件时更新第一寄存器中的第一条目,具体可通过如下方式实现:处理器在第一性能事件发生在处理器流水线前端、且处理器的ROB为空的情况下,将前端指示字段置为第一数值,将类型指示字段置为第二数值,并将第一指令的指令类型信息保存在序列号指示字段。In the first method, the processor updates the first entry in the first register when a performance event occurs, which may be specifically implemented as follows: The processor performs the first performance event at the front end of the processor pipeline, and the ROB of the processor is If it is empty, the front end indication field is set to the first value, the type indication field is set to the second value, and the instruction type information of the first instruction is stored in the serial number indication field.
如前面介绍中所述,处理器停顿可能发生在前端,也可能发生在后端。当确定停顿发生在处理器流水线前端时,可以将第一条目的前端指示字段置为第一数值,以指示此次停顿发生在处理器流水线前端。As mentioned in the introduction, processor stalls can occur on the front end or on the back end. When it is determined that the pause occurs at the front end of the processor pipeline, the front end indication field of the first entry may be set to the first value to indicate that the pause occurs at the front end of the processor pipeline.
在方式一中,我们可以认为处理器发生的停顿为第一类停顿——指令供应停顿(instruction supply stall),即由处理器流水线前端(frontend)引起的流水线停顿。第一类停顿的一个显著特点是发生停顿时ROB为空。因此,在方式一中,可在第一性能事件发生在处理器流水线前端且处理器的ROB为空的情况下,将第一条目中的前端指示字段置为第一数值。In the first method, we can think of the processor stall as the first type of stall—instruction supply stall, that is, the pipeline stall caused by the processor's frontend. A significant feature of the first type of pause is that the ROB is empty when a pause occurs. Therefore, in the first method, when the first performance event occurs at the front end of the processor pipeline and the ROB of the processor is empty, the front end indication field in the first entry may be set to the first value.
示例性地,第一数值可以是1,第二数值可以是0。具体实现时,我们可以将前端指示字段和类型指示字段的缺省值设置为0(前端指示字段和类型指示字段默认为0);在这种情况下,若判断第一性能事件发生在处理器流水线前端且处理器的ROB为空,则可以仅对第一条目中的前端指示字段进行更新,而不必对类型指示字段进行更新。Exemplarily, the first value may be 1 and the second value may be 0. In specific implementation, we can set the default values of the front-end indication field and the type indication field to 0 (the front-end indication field and the type indication field default to 0); in this case, if it is judged that the first performance event occurs in the processor At the front end of the pipeline and the ROB of the processor is empty, only the front end indication field in the first entry can be updated, and the type indication field need not be updated.
此外,由于发生第一类停顿时ROB为空,因而我们难以通过ROB中的相关条目来确定“停顿责任者”。因此,为了后续确定引起此次停顿的指令的索引路径,在方式一中复用了序列号指示字段,即将第一指令的指令类型信息保存在序列号指示字段,以便可直接通过序列号指示字段确定“停顿责任者”。In addition, because the ROB is empty when the first type of pause occurs, it is difficult for us to determine the "responsible for the pause" through the relevant entry in the ROB. Therefore, in order to determine the index path of the instruction that caused the pause, the sequence number indication field is reused in the first method, that is, the instruction type information of the first instruction is stored in the sequence number indication field, so that the sequence number indication field can be directly passed. Identify the "responsible person".
方式二Way two
在方式二中,处理器在发生性能事件时更新第一寄存器中的第一条目,具体可通过如下方式实现:处理器在第二性能事件未发生在处理器流水线前端、且第二性能事件不是误预测事件的情况下,将前端指示字段和类型指示字段均置为第二数值。In the second method, the processor updates the first entry in the first register when a performance event occurs, which can be specifically implemented in the following manner: when the second performance event does not occur at the front end of the processor pipeline, and the second performance event If it is not a misprediction event, both the front end indication field and the type indication field are set to the second value.
如前所述,我们难以通过直观特征判断发生的停顿为第四类停顿,因此我们可以在停顿类型不是第一类停顿和第二类停顿,也不是第三类停顿的情况下,确定此次停顿为第四类停顿——长延迟停顿(long latency stall)。在方式二中,第二性能事件未发生在处理器流水线前端、且第二性能事件不是误预测事件,说明此次停顿的类型为第四类停顿。As mentioned earlier, it is difficult for us to judge the pause that occurred is the fourth type of pause by intuitive features, so we can determine this time when the type of pause is not the first type of pause, the second type of pause, or the third type of pause. The pause is the fourth type of pause-long delay pause. In the second method, the second performance event does not occur at the front end of the processor pipeline, and the second performance event is not a misprediction event, indicating that the type of the pause is a fourth type of pause.
对于第四类停顿,可以将第一条目的前端指示字段和类型指示字段均置为第二数值,以指示此次停顿为第四类停顿。For the fourth type of pause, both the front-end indication field and the type indication field of the first entry may be set to the second value to indicate that the current pause is a fourth type of pause.
示例性地,第二数值可以为0。具体实现时,我们可以将前端指示字段和类型指示字段的缺省值设置为0(前端指示字段和类型指示字段默认为0);在这种情况下,若判断第二性能事件未发生在处理器流水线前端、且第二性能事件不是误预测事件,则可以不对第一条目进行更新。Exemplarily, the second value may be 0. In specific implementation, we can set the default values of the front-end indication field and type indication field to 0 (the front-end indication field and type indication field default to 0); in this case, if it is judged that the second performance event has not occurred in processing The front end of the processor pipeline, and the second performance event is not a misprediction event, the first entry may not be updated.
方式三Way three
在方式三中,处理器在发生性能事件时更新第一寄存器中的第一条目,具体可通过如下方式实现:处理器在第三性能事件为误预测事件的情况下,将前端指示字段置为第二数值,将类型指示字段置为第一数值,并将第三性能事件的误预测序列号保存在序列号指示字段。In the third method, the processor updates the first entry in the first register when a performance event occurs, which can be specifically implemented in the following manner: when the third performance event is a misprediction event, the processor sets the front-end indication field For the second value, the type indication field is set to the first value, and the mispredicted sequence number of the third performance event is stored in the sequence number indication field.
如前所述,第二类停顿和第三类停顿的前端指示字段和类型指示字段的置位方式相同。在方式三中,在判断第三性能事件为误预测事件的情况下,可以确定此次停顿为第二类停顿或第三类停顿。As described above, the front-end indication field and the type indication field of the second-type pause and the third-type pause are set in the same manner. In the third method, when it is determined that the third performance event is a misprediction event, it may be determined that the pause is a second type pause or a third type pause.
对于第二类停顿或第三类停顿,可以将第一条目的前端指示字段置为第二数值,将类型指示字段置为第一数值。For the second type of pause or the third type of pause, the front end indication field of the first entry may be set to a second value, and the type indication field may be set to a first value.
示例性地,第一数值可以是1,第二数值可以是0。具体实现时,我们可以将前端指示字段和类型指示字段的缺省值设置为0(前端指示字段和类型指示字段默认为0);在这种情况下,若判断第三性能事件为误预测事件,则可以仅对第一条目中的类型指示字段进行更新,而不必对前端指示字段进行更新。Exemplarily, the first value may be 1 and the second value may be 0. In specific implementation, we can set the default values of the front-end indication field and type indication field to 0 (the front-end indication field and type indication field default to 0); in this case, if the third performance event is judged to be a misprediction event , You can only update the type indication field in the first entry, without having to update the front-end indication field.
方式四Way four
在方式四中,处理器在发生性能事件时更新第一寄存器中的第一条目,具体可通过如下方式实现:处理器在第四性能事件为误预测事件的情况下,将前端指示字段置为第二数值,将类型指示字段置为第一数值,并将第四性能事件的误预测序列号保存在序列号指示字段;处理器在第五性能事件为误预测事件的情况下,比较第五性能事件的误预测序列号与第四性能事件的误预测序列号的大小;处理器将第五性能事件的误预测序列号与第四性能事件的误预测序列号中较小的误预测序列号保存在序列号指示字段。In the fourth method, the processor updates the first entry in the first register when a performance event occurs, which may be specifically implemented as follows: When the fourth performance event is a misprediction event, the processor sets the front-end indication field Is the second value, the type indication field is set to the first value, and the mispredicted sequence number of the fourth performance event is stored in the sequence number indication field; when the fifth performance event is a misprediction event, the processor compares the first The size of the misprediction sequence number of the five performance event and the misprediction sequence number of the fourth performance event; the processor compares the smaller misprediction sequence of the misprediction sequence number of the fifth performance event and the misprediction sequence number of the fourth performance event. The number is stored in the serial number indication field.
示例性地,第一数值可以是1,第二数值可以是0。Exemplarily, the first value may be 1 and the second value may be 0.
如前所述,处理器可以对程序的分支流程进行预测,然后预先读取其中一个分支的指令并解码,从而减少等待译码器的时间。若发生分支预测错误的情况,则会导致指令流水线上的指令被清空,引起处理器停顿。对于程序中存在多级分支预测指令、且多级分支预测指令中多次误预测的情况,引起处理器停顿的指令应该视为首次预测错误的指令,即误预测序列号最小的指令。因此,在方式四中,在更新第一条目的过程中(即停顿终止之前),若出现两个误预测事件,则第一条目中保存误预测序列号较小的性能事件的相关信息,以便能准确地确定引发此次停顿的指令的索引信息。As mentioned earlier, the processor can predict the branch flow of the program, and then read and decode the instructions of one of the branches in advance, thereby reducing the waiting time for the decoder. If a branch prediction error occurs, the instructions on the instruction pipeline will be emptied, causing the processor to stall. For a multi-level branch prediction instruction in the program, and multiple mis-predictions in the multi-level branch prediction instruction, the instruction causing the processor to halt should be regarded as the first prediction error instruction, that is, the instruction with the smallest misprediction sequence number. Therefore, in the fourth method, during the process of updating the first entry (that is, before the pause is terminated), if two misprediction events occur, the first entry stores information about performance events with a smaller misprediction sequence number. In order to accurately determine the index information of the instruction that caused the pause.
以上仅为对本申请实施例中更新第一条目的具体方式的简单示例。实际实现时,第一条目的更新过程可根据实际情况进行具体操作,这些具体操作也可遵循上述四种方式所指示的方式进行。例如,在发生第二性能事件后,采用方式二对第一条目更新;随后,在处理器未发生停顿的情况下又发生了第四性能事件,此时可继续对第一条目进行更新,此时第一条目记录的即为第四性能事件的相关信息。在这种情况下,我们可以认为第二性能事 件的性能代价被第四性能事件隐藏,第二性能事件未导致处理器停顿。The above is only a simple example of a specific manner of updating the first entry in the embodiment of the present application. In actual implementation, the update process of the first entry may be performed according to actual conditions, and these specific operations may also be performed in the manner indicated by the above four methods. For example, after the second performance event occurs, the first entry is updated in the second way; then, a fourth performance event occurs without a processor stall, and the first entry may continue to be updated at this time. At this time, the first entry records the related information of the fourth performance event. In this case, we can consider that the performance cost of the second performance event is hidden by the fourth performance event, and the second performance event does not cause the processor to stall.
通过上述介绍可以看出,本申请实施中,在发生性能事件时需要对写指针(writing)指向的条目进行更新;在更新第一寄存器的条目的过程中,还可能需要先读取写指针(writing)指向的条目的序列号指示字段,以判断如何更新序列号指示字段;在停顿结束后,需要读取读指针(writing)指向的条目以确定第一指令的指令类型。也就是说,第一寄存器可以配置有两个读取通道和一个写入通道,给第一寄存器为可以视为一个2读1写的寄存器组。其中,一个读取通道用于在执行(execute)段读取第一寄存器中的条目并确定如何更新条目,另一个读取通道用于在提交(commit)段读取第一寄存器中的条目,以确定引起处理器停顿的指令的指令类型;写入通道用于在执行(execute)段更新第一寄存器中的条目。It can be seen from the above introduction that in the implementation of the present application, when a performance event occurs, the entry pointed to by the write pointer (writing) needs to be updated; in the process of updating the entry of the first register, it may also be necessary to read the write pointer ( The sequence number indication field of the entry pointed to by writing) is used to determine how to update the sequence number indication field. After the pause, the entry pointed to by the read pointer (writing) needs to be read to determine the instruction type of the first instruction. That is, the first register may be configured with two read channels and one write channel, and the first register is a register group that can be regarded as a 2 read 1 write. Among them, one read channel is used to read the entry in the first register and determine how to update the entry in the execute section, and the other read channel is used to read the entry in the first register in the commit section. To determine the instruction type of the instruction that caused the processor to stall; the write channel is used to update the entry in the first register during the execute segment.
S302:处理器在停顿终止后停止更新第一条目,并根据第一条目确定第一指令的指令类型。S302: The processor stops updating the first entry after the pause is terminated, and determines an instruction type of the first instruction according to the first entry.
如前所述,在S301中,处理器在发生性能事件时可采用多种方式更新第一条目,当某个性能事件未引起处理器停顿的情况下,根据该性能事件更新的第一条目会被覆盖掉,只有导致处理器停顿的指令会保存在第一条目中。那么,S302中,处理器停顿终止后,第一条目即记录引起处理器停顿的指令的相关信息,也就是说,第一条目对应的第一指令可以认为是引起处理器停顿的指令。As mentioned earlier, in S301, the processor can update the first entry in various ways when a performance event occurs. When a performance event does not cause the processor to halt, the first entry updated according to the performance event The project will be overwritten, and only the instruction that caused the processor to stall will be saved in the first entry. Then, in S302, after the processor stall is terminated, the first entry records related information of the instruction that caused the processor stall, that is, the first instruction corresponding to the first entry may be considered as the instruction that caused the processor stall.
具体地,处理器在停顿终止时停止更新第一条目,可通过将第一寄存器中的写指针(writing)移动到第一条目的下一个条目来实现。也就是说,第一寄存器中的写指针(writing)在每次停顿终止后可移动到下一个条目,进而继续监测处理器中发生的性能事件,在处理器发生性能事件时继续更新下一个条目。Specifically, the processor stops updating the first entry when the pause is terminated, which can be achieved by moving a writing pointer in the first register to a next entry of the first entry. That is, the writing pointer in the first register can move to the next entry after each pause, and then continue to monitor the performance events that occur in the processor, and continue to update the next entry when the processor has a performance event .
每次停顿终止后即将写指针(writing)移动到下一个条目,其目的是为了防止出现如下情况的出现:当若干相近指令(例如连续两条指令)均引发处理器停顿时,如果不移动写指针(writing),可能发生第一条指令的相关信息尚未在提交(commit)段被读取、就被第二条指令的相关信息覆盖。The writing pointer (writing) is moved to the next entry after each pause. The purpose is to prevent the occurrence of the following situations: when several similar instructions (such as two consecutive instructions) cause the processor to pause, if the write is not moved Writing (writing). It may happen that the information about the first instruction is overwritten by the information about the second instruction before it is read in the commit section.
此外,具体实现时,S302中,处理器可在停顿终止后读取读指针(reading)指向的第一条目,从而根据第一条目确定第一指令的指令类型。当第一条目读取完成后,可以将读指针(reading)指向第一条目的下一条目,从而在后面再次发生停顿时根据读指针(reading)指向的条目确定引起停顿的指令的指令类型。In addition, in specific implementation, in S302, the processor may read the first entry pointed by the reading pointer after the pause is terminated, thereby determining the instruction type of the first instruction according to the first entry. When the reading of the first entry is completed, the reading pointer can be pointed to the next entry of the first entry, so that when a stall occurs again later, the instruction type of the instruction that caused the stall is determined according to the entry pointed by the reading pointer .
此外,由于第一寄存器的存储容量有限,在第一条目已经被读取并根据第一条目确定第一指令的指令类型后,可以将第一条目删除或释放,从而避免已读取条目占用第一寄存器的存储容量。In addition, because the storage capacity of the first register is limited, after the first entry has been read and the instruction type of the first instruction is determined according to the first entry, the first entry can be deleted or released to avoid reading. Entries occupy the storage capacity of the first register.
如前所述,对于不同类型的停顿,处理器在S301中更新第一条目的方式可以有多种。同样地,在S302中,根据第一条目的三个字段中记录的信息,处理器可以确定此次停顿的类型,进而确定引起此次停顿的第一指令的指令类型。As mentioned above, for different types of pauses, the processor may update the first entry in S301 in multiple ways. Similarly, in S302, according to the information recorded in the three fields of the first entry, the processor may determine the type of the pause, and then determine the instruction type of the first instruction that caused the pause.
具体实现时,处理器根据第一条目确定第一指令的指令类型的具体实现方式可以有如下三种。In specific implementation, the processor may determine the following three types of specific implementations of the instruction type of the first instruction according to the first entry.
实现方式一Implementation method one
处理器根据第一条目确定第一指令的指令类型,具体可通过如下方式实现:处理器在确定前端指示字段为第一数值的情况下,获取序列号指示字段中保存的第一指令的指令类 型信息。The processor determines the instruction type of the first instruction according to the first entry, which may be specifically implemented by: when the processor determines that the front-end instruction field is the first value, the processor obtains the instruction of the first instruction stored in the sequence number instruction field. Type information.
在实现方式一中,第一条目中的前端指示字段为第一数值,则可以确定此次停顿发生在处理器流水线前端,即此次停顿为第一类停顿。对于第一类停顿,引起此停顿的指令为停顿终止后提交的第一条指令,在这种情况下,我们认为该指令应该为ROB中的第一个条目(即,ROB中即将提交的指令对应的条目)。但是,发生第一类停顿时ROB为空,因而无法从ROB中获取到引起处理器停顿的指令的索引信息。为了避免出现这种无法获取“停顿责任者”的索引信息的情况,在设计中,我们在针对第一类停顿更新第一条目时,将第一指令的指令类型信息保存在序列号指示字段(如S301的方式一中所述)。那么,在前端指示字段为第一数值的情况下,即可直接获取第一条目的序列号指示字段中保存的第一指令的指令类型信息。In the first implementation manner, if the front-end indication field in the first entry is the first value, it can be determined that the pause occurs at the front end of the processor pipeline, that is, the pause is the first type of pause. For the first type of pause, the instruction that caused this pause is the first instruction submitted after the pause is terminated. In this case, we believe that the instruction should be the first entry in the ROB (that is, the instruction to be submitted in the ROB) Corresponding entry). However, when the first type of stall occurs, the ROB is empty, so the index information of the instruction that caused the processor stall cannot be obtained from the ROB. In order to avoid such a situation that the index information of the "responsible for pause" cannot be obtained, in the design, when updating the first entry for the first type of pause, the instruction type information of the first instruction is stored in the serial number indication field (As described in the first method of S301). Then, when the front-end indication field is the first value, the instruction type information of the first instruction stored in the sequence number indication field of the first entry can be directly obtained.
实现方式二Implementation method two
处理器根据第一条目确定第一指令的指令类型,具体可通过如下方式实现:处理器在确定前端指示字段和类型指示字段均为第二数值的情况下,ROB中的第一个条目中保存的第一指令的指令类型信息。The processor determines the instruction type of the first instruction according to the first entry, which can be specifically implemented in the following manner: When the processor determines that the front-end indication field and the type indication field are both second values, the first entry in the ROB is Stored the instruction type information of the first instruction.
其中,ROB中的第一个条目可以理解为ROB的头指针(heading)指向的条目。The first entry in the ROB can be understood as the entry pointed by the ROB's heading.
在实现方式二中,第一条目中的前端指示字段和类型指示字段均为第二数值,则可以确定此次停顿为第四类停顿。对于第四类停顿,由于引起此停顿的指令为停顿终止后提交的第一条指令,在这种情况下,我们认为该指令应该为ROB中的第一个条目(即,ROB中即将提交的指令对应的条目)。因此,在实现方式二中,可以获取ROB中的第一个条目中保存的第一指令的指令类型信息。In the second implementation manner, if the front-end indication field and the type indication field in the first entry are both second values, it can be determined that the pause is a fourth type pause. For the fourth type of pause, because the instruction that caused this pause is the first instruction submitted after the pause is terminated, in this case, we believe that the instruction should be the first entry in the ROB (that is, the upcoming ROB Command entry). Therefore, in the second implementation manner, the instruction type information of the first instruction stored in the first entry in the ROB can be obtained.
实现方式三Implementation method three
处理器根据第一条目确定第一指令的指令类型,具体可通过如下方式实现:处理器在确定前端指示字段为第二数值、类型指示字段为第一数值且序列号指示字段中保存的误预测序列号与第三寄存器中保存的误预测序列号相同的情况下,获取第三寄存器中保存的第一指令的指令类型信息,第三寄存器保存有ROB最近一次删除的第三条目,第三条目包含第一指令的指令类型信息。The processor determines the instruction type of the first instruction according to the first entry, which may be specifically implemented by: the processor determines that the front-end indication field is the second value, the type indication field is the first value, and the error stored in the sequence number indication field is incorrect. When the predicted sequence number is the same as the mispredicted sequence number stored in the third register, the instruction type information of the first instruction stored in the third register is obtained. The third register stores the third entry that was last deleted by the ROB. Three entries contain instruction type information for the first instruction.
在实现方式三中,第一条目中的前端指示字段为第二数值、类型指示字段为第一数值,则可以确定此次停顿为第二类停顿或第三类停顿,引发该停顿的指令为停顿发生前提交的最后一条指令。从理论上来说,此时第三寄存器中记录的指令(即停顿发生前提交的最后一条指令)即为引发处理器停顿的指令。In the third implementation, if the front-end indication field in the first entry is the second value and the type indication field is the first value, it can be determined that the pause is the second type pause or the third type pause, and the instruction that caused the pause The last instruction submitted before the pause occurred. In theory, at this time, the instruction recorded in the third register (that is, the last instruction submitted before the pause occurred) is the instruction that caused the processor to pause.
但是,实际实现时,考虑到处理器的乱序执行,可能出现第一条目被乱序更新的情形,例如,引发此次停顿的指令还未到达提交(commit)段,或者第一条目中记录的性能事件被其他性能事件覆盖或者第一条目中记录的性能事件与其他性能事件有重叠(此时可以认为即第一条目中记录的性能事件未引起此次停顿)。However, in actual implementation, considering the out-of-order execution of the processor, the first entry may be updated out of order, for example, the instruction that caused the pause has not yet reached the commit segment, or the first entry The performance event recorded in is overwritten by other performance events or the performance event recorded in the first entry overlaps with other performance events (at this time, it can be considered that the performance event recorded in the first entry did not cause the pause).
在实现方式三中,对序列号指示字段中保存的误预测序列号与第三寄存器中保存的误预测序列号进行比对,正是为了在发生上述复杂情形时正确识别“停顿责任者”:在第一条目中的序列号指示字段中保存的误预测序列号与第三寄存器中保存的误预测序列号相同的情况下,判断第一条目中记录的指令即为第三寄存器中记录的指令(即停顿发生前提交的最后一条指令),这种情况与图2中程序B/D所示的情形类似,此时,我们可以获取第三寄存器中保存的第一指令的指令类型信息。In the third implementation manner, the mispredicted sequence number stored in the sequence number indication field is compared with the mispredicted sequence number stored in the third register, in order to correctly identify the "responsible person for pause" when the above complex situation occurs: In the case where the mispredicted sequence number stored in the sequence number indication field in the first entry is the same as the mispredicted sequence number stored in the third register, it is determined that the instruction recorded in the first entry is the record in the third register Instruction (that is, the last instruction submitted before the pause occurred). This situation is similar to the situation shown in program B / D in Figure 2. At this time, we can obtain the instruction type information of the first instruction stored in the third register. .
此外,考虑到处理器的乱序执行,为了正确处理由于指令乱序执行导致的第一条目被乱序更新的情况,在实现方式三中,我们还可采用如下两种方式确定第一指令的指令类型。In addition, considering the out-of-order execution of the processor, in order to properly handle the case where the first entry is updated out-of-order due to out-of-order execution of the instruction, in the third implementation method, we can also determine the first instruction in the following two ways Instruction type.
方式aWay a
在方式a中,处理器在确定前端指示字段为第二数值、类型指示字段为第一数值且序列号指示字段中保存的误预测序列号大于第三寄存器中保存的误预测序列号的情况下,获取ROB中的第一个条目中保存的第一指令的指令类型信息。In method a, the processor determines that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is greater than the mispredicted sequence number stored in the third register. To obtain the instruction type information of the first instruction stored in the first entry in the ROB.
在方式a中,在序列号指示字段中保存的误预测序列号大于第三寄存器中保存的误预测序列号的情况下,可以认为第一条目指示的指令还未到达提交(commit)段,即该指令引发的停顿还未发生,此次停顿为第一条目指示的指令所引发停顿的上一次停顿。此时,我们不应该从第三寄存器中获取该指令的索引信息,而是应该从ROB中的第一个条目获取该指令的索引信息。In manner a, if the mispredicted sequence number stored in the sequence number indication field is greater than the mispredicted sequence number stored in the third register, it can be considered that the instruction indicated by the first entry has not yet reached the commit segment, That is to say, the pause caused by the instruction has not yet occurred. This pause is the last pause caused by the instruction indicated by the first entry. At this time, we should not get the index information of the instruction from the third register, but we should get the index information of the instruction from the first entry in the ROB.
此外,在方式a中,由于第一条目指示的指令还未到达提交(commit)段,因此在获取ROB中的第一个条目中保存的第一指令的指令类型信息之后,可不必像本申请实施例描述的其他方式中那样将第一寄存器的读指针(reading)指向下一条目,而是保持读指针(reading)不变,以便在下一次停顿终止后继续正确识别“停顿责任者”。In addition, in method a, since the instruction indicated by the first entry has not reached the commit segment, after obtaining the instruction type information of the first instruction stored in the first entry in the ROB, it is not necessary to read the instruction In the other way described in the embodiment of the application, the reading pointer of the first register is pointed to the next entry, but the reading pointer is kept unchanged, so that the "pause responsible person" can be correctly identified after the next pause is terminated.
方式bWay b
在方式b中,处理器在确定前端指示字段为第二数值、类型指示字段为第一数值且序列号指示字段中保存的误预测序列号小于第三寄存器中保存的误预测序列号的情况下,获取ROB中的第一个条目中保存的第一指令的指令类型信息。In manner b, the processor determines that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is smaller than the mispredicted sequence number stored in the third register. To obtain the instruction type information of the first instruction stored in the first entry in the ROB.
在方式b中,在序列号指示字段中保存的误预测序列号小于第三寄存器中保存的误预测序列号的情况下,可以认为第一条目中记录的性能事件被其他性能事件覆盖(此时可以认为即第一条目中记录的性能事件未引起处理器停顿)。示例性地,我们将第一条目中记录的性能事件称为性能事件p,将覆盖性能事件p的性能事件称为性能事件q。此时,由于引发性能事件p的指令为最近一次提交的指令但该指令未引起处理器停顿,因此我们不应该从第三寄存器中获取引起此次停顿的指令(即引发性能事件q的指令)的索引信息,而是应该从ROB中的第一个条目获取引起此次停顿的指令的索引信息。In method b, if the mispredicted sequence number stored in the sequence number indication field is smaller than the mispredicted sequence number stored in the third register, the performance event recorded in the first entry may be considered to be overwritten by another performance event (this It can be considered that the performance event recorded in the first entry did not cause the processor to stall). Exemplarily, the performance event recorded in the first entry is referred to as performance event p, and the performance event covering performance event p is referred to as performance event q. At this time, because the instruction that caused the performance event p was the most recently submitted instruction but the instruction did not cause the processor to stall, we should not get the instruction that caused the stall from the third register (that is, the instruction that caused the performance event q) The index information of the command should be obtained from the first entry in the ROB.
在方式b中,还可更新第一寄存器的读指针(reading)指向下一条目。In manner b, the reading pointer of the first register may be updated to point to the next entry.
S303:处理器将第一时钟周期数叠加入第一指令的指令类型对应的累计停顿周期数,并将累计停顿周期数写入第二寄存器的第二条目。S303: The processor adds the first clock cycle number to the accumulated pause cycle number corresponding to the instruction type of the first instruction, and writes the accumulated pause cycle number to the second entry of the second register.
其中,第二寄存器设有多个条目,多个条目分别对应多个指令类型,多个条目用于保存各个指令类型下的指令导致的停顿的累计停顿周期数。The second register is provided with multiple entries, each of which corresponds to multiple instruction types, and the multiple entries are used to store a cumulative number of pause cycles caused by instructions under each instruction type.
根据前面的介绍,每种类型的停顿各有其特点,且每种类型的停顿可以由不同或相同类型的指令引起。例如branch指令可引起第二类停顿,load指令可引起第二类停顿或第四类停顿。第二寄存器中保存有各个指令类型下的指令导致的停顿的累计停顿周期数。例如,第二寄存器中可记录由branch指令引起的第二类停顿的累计停顿周期数为A,由load指令引起的第二类停顿的累计停顿周期数为B,由load指令引起的第四类停顿的累计停顿周期数为C...等等。According to the previous introduction, each type of pause has its own characteristics, and each type of pause can be caused by different or the same type of instructions. For example, the branch instruction can cause a second type of pause, and the load instruction can cause a second type of pause or a fourth type of pause. The second register stores the cumulative number of pause cycles caused by the instructions under each instruction type. For example, the second register may record the cumulative number of pause periods of the second type of pause caused by the branch instruction as A, the cumulative number of pause periods of the second type of pause caused by the load instruction as B, and the fourth type caused by the load instruction The cumulative number of pause periods is C ... and so on.
将第二条目更新至第二寄存器中的过程可以是:将计数器的计数结果(即第一时钟周期数)累加到第二寄存器中第一指令对应的条目上。例如,采用图3所示方法确定引起停顿的第一指令的stall ID为Oct 01且计数器计数的第一时钟周期数为M,则执行S303时可 以在第二寄存器中Oct 01对应条目的累计值上加上M。假设此次停顿发生前第二寄存器中Oct 01对应条目的累计值为N,则叠加后该条目对应的累计值为M+N。The process of updating the second entry into the second register may be: accumulating the count result of the counter (that is, the number of first clock cycles) to the entry corresponding to the first instruction in the second register. For example, using the method shown in FIG. 3 to determine the stall ID of the first instruction causing the stall is Oct01 and the first clock cycle count of the counter is M, then when S303 is executed, the cumulative value of the corresponding entry of Oct01 in the second register can be performed Add M. Assume that the cumulative value of the entry corresponding to Oct01 in the second register before the pause occurs is N, and the cumulative value corresponding to the entry after the superposition is M + N.
此外,在S303中处理器将累计停顿周期数写入第二寄存器的第二条目之后,还可在以下任一种情况下将第二寄存器中保存的所有条目更新至与处理器连接的内存中:第二寄存器发生溢出;第二寄存器触发中断;处理器的性能监测时间段结束。In addition, after the processor writes the accumulated pause period number to the second entry of the second register in S303, all the entries stored in the second register may be updated to the memory connected to the processor in any of the following cases Medium: The second register overflows; the second register triggers an interrupt; the performance monitoring period of the processor ends.
将第二寄存器中保存的所有条目更新至内存中,可以将第二寄存器清零。然后,可根据内存中的数据对处理器的性能进行评估,例如,分析每种类型的指令引起的停顿所占的百分比,分析哪种类型的指令易引起停顿,分析处理器的停顿周期数占程序的总执行周期数的百分比等等。All the entries saved in the second register are updated into the memory, and the second register can be cleared. Then, the performance of the processor can be evaluated based on the data in the memory, for example, analyzing the percentage of stalls caused by each type of instruction, analyzing which types of instructions are prone to stall, and analyzing the number of processor stall cycles. The percentage of the total execution cycles of the program and so on.
采用图3所示的方法,在发生性能事件时更新第一寄存器中的第一条目,在处理器停顿终止后停止更新第一条目,则第一条目可用于指示引起处理器停顿的第一指令的指令类型的索引路径。此外,处理器在发生停顿时启动计数器,那么在停顿终止后该计数器中记录有此次停顿持续的第一时钟周期数,该第一时钟周期数可用于表示此次停顿造成的性能开销。因此,在停顿终止后,可通过读取第一条目来确定引起此次停顿的第一指令的指令类型;同时,第二寄存器保存有各个指令类型下的指令导致的停顿的累计停顿周期数,在停顿终止后,可将第一时钟周期数累加到第二寄存器中第一指令的指令类型对应的条目(即第二条目)中,从而可以对处理器的性能进行综合分析,例如分析每种类型的指令引起的停顿所占的百分比、分析哪种类型的指令易引起停顿、分析处理器的停顿周期数占程序的总执行周期数的百分比等等。综上,采用本申请实施例提供的处理器性能监测方案,可以在处理器发生停顿后准确地确定引起该停顿的指令的指令类型,并在程序执行结束后对每种类型的指令造成的性能开销进行评估。The method shown in FIG. 3 is used to update the first entry in the first register when a performance event occurs, and to stop updating the first entry after the processor stall is terminated. The first entry can be used to indicate that the processor caused a stall. Index path of the instruction type of the first instruction. In addition, the processor starts a counter when a pause occurs. Then, after the pause is terminated, the counter records a first clock cycle number in which the pause is continued. The first clock cycle number may be used to indicate a performance overhead caused by the pause. Therefore, after the pause is stopped, the first entry can be read to determine the instruction type of the first instruction that caused the pause; at the same time, the second register stores the cumulative number of pauses caused by the instructions under each instruction type. After the pause is terminated, the first clock cycle number can be accumulated into the entry corresponding to the instruction type of the first instruction in the second register (that is, the second entry), so that the performance of the processor can be comprehensively analyzed, for example, analysis Percentage of stalls caused by each type of instruction, analysis of which types of instructions are prone to stalls, analysis of the number of stall cycles of the processor as a percentage of the total execution cycles of the program, and so on. In summary, by using the processor performance monitoring solution provided in the embodiments of the present application, the type of the instruction that caused the stall can be accurately determined after the processor stalls, and the performance of each type of instruction after the execution of the program ends Cost is assessed.
此外,采用本申请实施例提供的方案,可以通过低代价的硬件机制实现处理器监测。即,本申请实施例中可通过在执行(execute)段增加一个支持2读1写的寄存器组并结合提交(commit)段的判断逻辑准确地确定引起该停顿的指令。In addition, by using the solution provided in the embodiment of the present application, processor monitoring can be implemented through a low-cost hardware mechanism. That is, in the embodiment of the present application, the instruction that caused the pause can be accurately determined by adding a register group that supports 2 reads and 1 writes in the execute section and combined with the judgment logic of the commit section.
基于以上实施例,本申请实施例还提供一种处理器性能的监测方法,该方法可视为图3所示方法的一个具体示例。Based on the above embodiments, an embodiment of the present application further provides a method for monitoring processor performance. This method can be regarded as a specific example of the method shown in FIG. 3.
参见图4,该方法可以是:当取指段发生前端缓存缺失且ROB为空,则更新CTS中writing指针指向的条目,将FE置为一,并将取指指令的stall ID写入Squash SN字段(即序列号指示字段的具体示例)。在提交段判断某次停顿终止后,则对writing指针进行加1操作,停止更新writing指针当前指向的条目。然后,读取reading指针指向的条目以判断刚刚结束的这次停顿的“停顿责任者”(culprit),并在读取结束后对reading指针进行加1操作。通过判断reading指针指向的条目,FE=1,Ctype=0,则处理器判断此次停顿为流水线前端停顿,并使用SquashSN里保存的Stall ID索引SPCT,将SCC的计数值累加到SPCT对应条目上。Referring to FIG. 4, the method may be: when a front-end cache miss occurs in the fetch section and the ROB is empty, the entry pointed by the writing pointer in the CTS is updated, FE is set to 1, and the stall ID of the fetch instruction is written into Squash SN Field (ie, a specific example of a sequence number indication field). After the submission section judges that a pause is terminated, the writing pointer is incremented to stop updating the entry currently pointed to by the writing pointer. Then, read the entry pointed by the reading pointer to determine the "culprit" of the pause just ended, and add 1 to the reading pointer after the reading is completed. By judging the entry pointed by the reading pointer, FE = 1, Ctype = 0, the processor judges that the pause is the front-end pause of the pipeline, and uses the Stall ID index stored in SquashSN to index the SPCT, and accumulates the SCC count value to the corresponding entry in the SPCT. .
(添加本段说明FE=0,Ctype=0的情形)当后续停顿再终止时,通过判断reading指针指向的条目FE=0,Ctype=0,则处理器确定ROB的头指针(head)指向的条目为“停顿责任者”,进而将SCC的计数值累加到该“停顿责任者”在SPCT中的相对应条目上。SPCT可用于后续对处理器的性能进行分析。(Add this paragraph to explain the case where FE = 0 and Ctype = 0.) When the subsequent pause is terminated again, by judging the entry pointed to by the reading pointer FE = 0, Ctype = 0, the processor determines what the ROB's head pointer (head) points to. The entry is the "responsible person", and the count value of the SCC is added to the corresponding entry in the SPCT of the "responsible person". SPCT can be used for subsequent analysis of processor performance.
需要说明的是,图4所示方法可视为图3所示方法的一个具体示例,图4所示方法中未详尽描述的实现方式及其技术效果可参见图3所示方法中的相关描述。It should be noted that the method shown in FIG. 4 can be regarded as a specific example of the method shown in FIG. 3. For an implementation manner and technical effects that are not described in detail in the method shown in FIG. 4, refer to related descriptions in the method shown in FIG. 3. .
基于同一发明构思,本申请实施例还提供一种处理器性能的监测装置,该装置可用于执行图3所示的处理器性能的监测方法。参见图5,该处理器性能的监测装置500(以下简称“装置500”)包括处理器501、第一寄存器502、计数器503和第二寄存器504。Based on the same inventive concept, the embodiment of the present application further provides a processor performance monitoring device, which can be used to execute the processor performance monitoring method shown in FIG. 3. Referring to FIG. 5, the processor performance monitoring device 500 (hereinafter referred to as “device 500”) includes a processor 501, a first register 502, a counter 503, and a second register 504.
处理器501用于:在发生性能事件时更新第一寄存器502中的第一条目,在发生停顿时,启动计数器503来统计停顿持续的第一时钟周期数,第一条目用于指示引起性能事件的第一指令的指令类型的索引路径;在停顿终止后,停止更新第一条目,并根据第一条目确定第一指令的指令类型;将第一时钟周期数叠加入第一指令的指令类型对应的累计停顿周期数,并将累计停顿周期数写入第二寄存器504中的第二条目;第二寄存器504设有多个条目,多个条目分别对应多个指令类型,多个条目用于保存各个指令类型下的指令导致的停顿的累计停顿周期数。The processor 501 is configured to: update a first entry in the first register 502 when a performance event occurs, and when a pause occurs, start a counter 503 to count a first clock cycle duration of the pause, and the first entry is used to indicate a cause Index path of the instruction type of the first instruction of the performance event; after the pause ends, stop updating the first entry, and determine the instruction type of the first instruction according to the first entry; add the first clock cycle number to the first instruction The total number of pause periods corresponding to the type of instruction, and write the accumulated number of pause periods to the second entry in the second register 504; the second register 504 is provided with multiple entries, each of which corresponds to multiple instruction types, more This entry is used to store the cumulative number of pause cycles caused by the instructions under each instruction type.
其中,第一寄存器502、计数器503和第二寄存器504可以集成在处理器501上,也可以单独设置。当第一寄存器502、计数器503和第二寄存器504集成在处理器501上时,处理器性能的监测装置500也可以视为一种处理器。The first register 502, the counter 503, and the second register 504 may be integrated on the processor 501, or may be set separately. When the first register 502, the counter 503, and the second register 504 are integrated on the processor 501, the processor performance monitoring device 500 can also be regarded as a type of processor.
具体地,第一条目可以包括前端指示字段、类型指示字段以及序列号指示字段,前端指示字段用于指示停顿是否发生在前端,类型指示字段用于指示停顿是否在第一指令提交之前发生,序列号指示字段用于指示第一指令的误预测序列号。Specifically, the first entry may include a front-end indication field, a type indication field, and a sequence number indication field. The front-end indication field is used to indicate whether a pause occurs in the front-end, and the type indication field is used to indicate whether the pause occurs before the first instruction is submitted. The sequence number indication field is used to indicate a mispredicted sequence number of the first instruction.
在装置500中,处理器501更新第一条目的方式有多种,下面列举其中四种。In the apparatus 500, the processor 501 can update the first entry in multiple ways, and four of them are listed below.
第一种方式The first way
处理器501在更新第一寄存器502中的第一条目时,具体用于:处理器501在第一性能事件发生在处理器501流水线前端、且处理器501的重排序缓存ROB为空的情况下,将前端指示字段置为第一数值,将类型指示字段置为第二数值,并将第一指令的指令类型信息保存在序列号指示字段。When the processor 501 updates the first entry in the first register 502, the processor 501 is specifically configured to: when the first performance event occurs at the front end of the processor 501 pipeline and the reordering buffer ROB of the processor 501 is empty Next, the front end indication field is set to the first value, the type indication field is set to the second value, and the instruction type information of the first instruction is stored in the serial number indication field.
第二种方式The second way
处理器501在更新第一寄存器502中的第一条目时,具体用于:处理器501在第二性能事件未发生在处理器501流水线前端、且第二性能事件不是误预测事件的情况下,将前端指示字段和类型指示字段均置为第二数值。When the processor 501 updates the first entry in the first register 502, the processor 501 is specifically configured to: when the second performance event does not occur at the front end of the processor 501 pipeline, and the second performance event is not a misprediction event , Set both the front end indication field and the type indication field to the second value.
第三种方式The third way
处理器501在更新第一寄存器502中的第一条目时,具体用于:处理器501在第三性能事件为误预测事件的情况下,将前端指示字段置为第二数值,将类型指示字段置为第一数值,并将第三性能事件的误预测序列号保存在序列号指示字段。When the processor 501 updates the first entry in the first register 502, the processor 501 is specifically configured to: when the third performance event is a misprediction event, set the front-end indication field to the second value, and set the type indication The field is set to the first value, and the mispredicted sequence number of the third performance event is stored in the sequence number indication field.
第四种方式Fourth way
处理器501在更新第一寄存器502中的第一条目时,具体用于:处理器501在第四性能事件为误预测事件的情况下,将前端指示字段置为第二数值,将类型指示字段置为第一数值,并将第四性能事件的误预测序列号保存在序列号指示字段;处理器501在第五性能事件为误预测事件的情况下,比较第五性能事件的误预测序列号与第四性能事件的误预测序列号的大小;处理器501将第五性能事件的误预测序列号与第四性能事件的误预测序列号中较小的误预测序列号保存在序列号指示字段。When the processor 501 updates the first entry in the first register 502, the processor 501 is specifically configured to: when the fourth performance event is a misprediction event, set the front-end indication field to the second value, and set the type indication The field is set to the first value, and the misprediction sequence number of the fourth performance event is stored in the sequence number indication field; the processor 501 compares the misprediction sequence of the fifth performance event when the fifth performance event is a misprediction event. Of the mispredicted sequence number of the fourth performance event and the mispredicted sequence number of the fourth performance event; the processor 501 saves the smaller mispredicted sequence number of the mispredicted sequence number of the fifth performance event and the mispredicted sequence number of the fourth performance event in the sequence number indication. Field.
此外,本申请实施例中,处理器501根据第一条目确定第一指令的指令类型的具体实现方式可以有如下三种。In addition, in the embodiment of the present application, the processor 501 may determine the following three specific implementation manners of the instruction type of the first instruction according to the first entry.
第一种方式The first way
处理器501在根据第一条目确定第一指令的指令类型时,具体用于:处理器501在确定前端指示字段为第一数值的情况下,获取序列号指示字段中保存的第一指令的指令类型信息。When the processor 501 determines the instruction type of the first instruction according to the first entry, the processor 501 is specifically configured to: when the front end instruction field is determined to be the first value, the processor 501 obtains the first instruction stored in the sequence number instruction field. Instruction type information.
第二种方式The second way
处理器501在根据第一条目确定第一指令的指令类型时,具体用于:处理器501在确定前端指示字段和类型指示字段均为第二数值的情况下,获取重排序缓存ROB中的第一个条目中保存的第一指令的指令类型信息。When the processor 501 determines the instruction type of the first instruction according to the first entry, the processor 501 is specifically configured to: when the processor 501 determines that the front-end indication field and the type indication field are second values, obtain the information in the reordering cache ROB. The instruction type information of the first instruction stored in the first entry.
第三种方式The third way
在第三种方式中,装置500还包括第三寄存器,第三寄存器保存有重排序缓存ROB最近一次删除的第三条目,第三条目包含第一指令的指令类型信息;处理器501在根据第一条目确定第一指令的指令类型时,具体可采用如下三种方式:In a third manner, the device 500 further includes a third register. The third register stores a third entry that was last deleted by the reordering buffer ROB. The third entry contains the instruction type information of the first instruction. When determining the instruction type of the first instruction according to the first entry, the following three methods may be specifically used:
方式3aWay 3a
处理器501在确定前端指示字段为第二数值、类型指示字段为第一数值且序列号指示字段中保存的误预测序列号与第三寄存器中保存的误预测序列号相同的情况下,获取第三寄存器中保存的第一指令的指令类型信息。The processor 501 obtains the first case when it is determined that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is the same as the mispredicted sequence number stored in the third register. The instruction type information of the first instruction stored in the three registers.
方式3bWay 3b
处理器501在确定前端指示字段为第二数值、类型指示字段为第一数值且序列号指示字段中保存的误预测序列号大于第三寄存器中保存的误预测序列号的情况下,获取重排序缓存ROB中的第一个条目中保存的第一指令的指令类型信息。The processor 501 obtains reordering when it is determined that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is greater than the mispredicted sequence number stored in the third register. Cache the instruction type information of the first instruction stored in the first entry in the ROB.
方式3cWay 3c
处理器501在确定前端指示字段为第二数值、类型指示字段为第一数值且序列号指示字段中保存的误预测序列号小于第三寄存器中保存的误预测序列号的情况下,获取重排序缓存ROB中的第一个条目中保存的第一指令的指令类型信息。The processor 501 obtains reordering when it is determined that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is smaller than the mispredicted sequence number stored in the third register. Cache the instruction type information of the first instruction stored in the first entry in the ROB.
此外,本申请实施例中,处理器501还用于:在将累计停顿周期数写入第二寄存器504的第二条目之后,处理器501在以下任一种情况下将第二寄存器504中保存的所有条目更新至与处理器501连接的内存中:第二寄存器504发生溢出;第二寄存器504触发中断;处理器501的性能监测时间段结束。In addition, in the embodiment of the present application, the processor 501 is further configured to: after writing the accumulated pause period number into the second entry of the second register 504, the processor 501 writes the second register 504 into the second register 504 in any of the following cases: All saved entries are updated to the memory connected to the processor 501: the second register 504 overflows; the second register 504 triggers an interrupt; the performance monitoring period of the processor 501 ends.
同样需要说明的是,处理器性能的监测装置500可用于执行图3对应的实施例提供的方法,因此图5所示的处理器性能的监测装置500中未详尽描述的实现方式及技术效果可参见图3中的相关描述。It should also be noted that the processor performance monitoring device 500 may be used to execute the method provided by the embodiment corresponding to FIG. 3, so the implementation manners and technical effects not described in detail in the processor performance monitoring device 500 shown in FIG. 5 may be See related description in FIG. 3.
基于同一发明构思,本申请实施例还提供一种处理器性能的监测装置,该装置可用于执行图3所示的处理器性能的监测方法,也可视为与图5所示的处理器性能的监测装置500相同的装置。参见图6,该处理器性能的监测装置600包括更新模块601、启动模块602、停止模块603以及读取模块604。Based on the same inventive concept, an embodiment of the present application further provides a processor performance monitoring device, which can be used to execute the processor performance monitoring method shown in FIG. 3, and can also be regarded as the same as the processor performance shown in FIG. 5. The monitoring device 500 is the same device. Referring to FIG. 6, the processor performance monitoring device 600 includes an update module 601, a start module 602, a stop module 603, and a read module 604.
更新模块601,用于在发生性能事件时更新第一寄存器中的第一条目,第一条目用于指示引起性能事件的第一指令的指令类型的索引路径。An update module 601 is configured to update a first entry in the first register when a performance event occurs, where the first entry is used to indicate an index path of an instruction type of the first instruction that caused the performance event.
启动模块602,用于在发生停顿时启动计数器来统计停顿持续的第一时钟周期数。The starting module 602 is configured to start a counter to count the first clock cycle duration when the pause occurs when a pause occurs.
停止模块603,用于在停顿终止后停止更新第一条目。The stopping module 603 is configured to stop updating the first entry after the pause is terminated.
读取模块604,用于根据第一条目确定第一指令的指令类型。The reading module 604 is configured to determine an instruction type of the first instruction according to the first entry.
更新模块601,还用于将第一时钟周期数叠加入第一指令的指令类型对应的累计停顿周期数,并将累计停顿周期数写入第二寄存器中的第二条目;第二寄存器设有多个条目,多个条目分别对应多个指令类型,多个条目用于保存各个指令类型下的指令导致的停顿的累计停顿周期数。The update module 601 is further configured to stack the first clock cycle number into the accumulated pause cycle number corresponding to the instruction type of the first instruction, and write the accumulated pause cycle number to a second entry in the second register; the second register is set There are multiple entries, multiple entries corresponding to multiple instruction types, and multiple entries are used to store the cumulative number of pause cycles caused by instructions under each instruction type.
此外,图6所示的处理器性能的监测装置600还可用于执行图3所示的处理器性能的监测方法中的其他操作,此处不再赘述。In addition, the processor performance monitoring device 600 shown in FIG. 6 may also be used to perform other operations in the method for monitoring processor performance shown in FIG. 3, which are not described herein again.
需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。在本申请的实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。It should be noted that the division of the modules in the embodiments of the present application is schematic, and is only a logical function division. In actual implementation, there may be another division manner. The functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist separately physically, or two or more modules may be integrated into one module. The above integrated modules may be implemented in the form of hardware or software functional modules.
集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated module is implemented in the form of a software functional module and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially a part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium. , Including a number of instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor to perform all or part of the steps of the methods of the embodiments of the present application. The aforementioned storage media include: U disks, mobile hard disks, read-only memories (ROMs), random access memories (RAMs), magnetic disks or compact discs and other media that can store program codes .
同样需要说明的是,处理器性能的监测装置600可用于执行图3对应的实施例提供的方法,因此图6所示的处理器性能的监测装置600中未详尽描述的实现方式及技术效果可参见图3中的相关描述。It should also be noted that the processor performance monitoring device 600 can be used to execute the method provided by the embodiment corresponding to FIG. 3, so the implementation and technical effects not described in detail in the processor performance monitoring device 600 shown in FIG. See related description in FIG. 3.
此外,本申请实施例还提供一种计算机存储介质,所述计算机存储介质上存储有程序,所述程序被处理器执行时,用于实现图3对应实施例提供的方法。In addition, an embodiment of the present application further provides a computer storage medium. A program is stored on the computer storage medium, and when the program is executed by a processor, the program is used to implement the method provided by the embodiment corresponding to FIG.
本申请实施例还提供一种计算机程序产品,所述计算机程序产品包含的程序代码在计算机上运行时,使得所述计算机执行图3对应实施例提供的方法。An embodiment of the present application further provides a computer program product. When the program code included in the computer program product runs on a computer, the computer causes the computer to execute the method provided in the embodiment corresponding to FIG. 3.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application may be provided as a method, a system, or a computer program product. Therefore, this application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Moreover, this application may take the form of a computer program product implemented on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。This application is described with reference to flowcharts and / or block diagrams of methods, devices (systems), and computer program products according to embodiments of the present application. It should be understood that each process and / or block in the flowcharts and / or block diagrams, and combinations of processes and / or blocks in the flowcharts and / or block diagrams can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing device to produce a machine, so that the instructions generated by the processor of the computer or other programmable data processing device are used to generate instructions Means for implementing the functions specified in one or more flowcharts and / or one or more blocks of the block diagrams.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令 装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing device to work in a particular manner such that the instructions stored in the computer-readable memory produce a manufactured article including an instruction device, the instructions The device implements the functions specified in one or more flowcharts and / or one or more blocks of the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device, so that a series of steps can be performed on the computer or other programmable device to produce a computer-implemented process, which can be executed on the computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more flowcharts and / or one or more blocks of the block diagrams.
显然,本领域的技术人员可以对本申请实施例进行各种改动和变型而不脱离本申请实施例的精神和范围。这样,倘若本申请实施例的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various modifications and variations to the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. In this way, if these modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalent technologies, the present application also intends to include these changes and variations.

Claims (23)

  1. 一种处理器性能的监测方法,其特征在于,包括:A method for monitoring processor performance, comprising:
    所述处理器在发生性能事件时更新第一寄存器中的第一条目,在发生停顿时,启动计数器来统计所述停顿持续的第一时钟周期数,所述第一条目用于指示引起性能事件的第一指令的指令类型的索引路径;The processor updates a first entry in the first register when a performance event occurs, and when a pause occurs, starts a counter to count a first clock cycle number in which the pause continues, and the first entry is used to indicate a cause The index path of the instruction type of the first instruction of the performance event;
    所述处理器在所述停顿终止后停止更新所述第一条目,并根据所述第一条目确定所述第一指令的指令类型;The processor stops updating the first entry after the pause is terminated, and determines an instruction type of the first instruction according to the first entry;
    所述处理器将所述第一时钟周期数叠加入所述第一指令的指令类型对应的累计停顿周期数,并将所述累计停顿周期数写入第二寄存器的第二条目;所述第二寄存器设有多个条目,所述多个条目分别对应多个指令类型,所述多个条目用于保存各个指令类型下的指令导致的停顿的累计停顿周期数。Adding, by the processor, the first number of clock cycles to the accumulated pause period number corresponding to the instruction type of the first instruction, and writing the accumulated pause period number into a second entry of a second register; The second register is provided with a plurality of entries, each of which corresponds to a plurality of instruction types, and the plurality of entries are used to store a cumulative number of pause periods caused by instructions under each instruction type.
  2. 如权利要求1所述的方法,其特征在于,所述第一条目包括前端指示字段、类型指示字段以及序列号指示字段,所述前端指示字段用于指示所述停顿是否发生在前端,所述类型指示字段用于指示所述停顿是否在所述第一指令提交之前发生,所述序列号指示字段用于指示所述第一指令的误预测序列号。The method according to claim 1, wherein the first entry comprises a front-end indication field, a type indication field, and a sequence number indication field, and the front-end indication field is used to indicate whether the pause occurs at the front end, so The type indication field is used to indicate whether the pause occurs before the first instruction is submitted, and the sequence number indication field is used to indicate a mispredicted sequence number of the first instruction.
  3. 如权利要求2所述的方法,其特征在于,所述处理器在发生性能事件时更新第一寄存器中的第一条目,包括:The method of claim 2, wherein the processor updating the first entry in the first register when a performance event occurs comprises:
    所述处理器在第一性能事件发生在处理器流水线前端、且所述处理器的重排序缓存ROB为空的情况下,将所述前端指示字段置为第一数值,将所述类型指示字段置为第二数值,并将所述第一指令的指令类型信息保存在所述序列号指示字段。When the first performance event occurs at the front end of the processor pipeline and the reordering cache ROB of the processor is empty, the front end indication field is set to the first value, and the type indication field is set. Set to the second value, and save the instruction type information of the first instruction in the sequence number indication field.
  4. 如权利要求2所述的方法,其特征在于,所述处理器在发生性能事件时更新第一寄存器中的第一条目,包括:The method of claim 2, wherein the processor updating the first entry in the first register when a performance event occurs comprises:
    所述处理器在第二性能事件未发生在处理器流水线前端、且所述第二性能事件不是误预测事件的情况下,将所述前端指示字段和所述类型指示字段均置为第二数值。When the second performance event does not occur at the front end of the processor pipeline and the second performance event is not a misprediction event, the processor sets the front end indication field and the type indication field to a second value. .
  5. 如权利要求2所述的方法,其特征在于,所述处理器在发生性能事件时更新第一寄存器中的第一条目,包括:The method of claim 2, wherein the processor updating the first entry in the first register when a performance event occurs comprises:
    所述处理器在第三性能事件为误预测事件的情况下,将所述前端指示字段置为第二数值,将所述类型指示字段置为第一数值,并将所述第三性能事件的误预测序列号保存在所述序列号指示字段。When the third performance event is a misprediction event, the processor sets the front-end indication field to a second value, sets the type indication field to a first value, and sets a value of the third performance event. The mispredicted sequence number is stored in the sequence number indication field.
  6. 如权利要求2所述的方法,其特征在于,所述处理器在发生性能事件时更新第一寄存器中的第一条目,包括:The method of claim 2, wherein the processor updating the first entry in the first register when a performance event occurs comprises:
    所述处理器在第四性能事件为误预测事件的情况下,将所述前端指示字段置为第二数值,将所述类型指示字段置为第一数值,并将所述第四性能事件的误预测序列号保存在所述序列号指示字段;When the fourth performance event is a misprediction event, the processor sets the front-end indication field to a second value, sets the type indication field to a first value, and sets a value of the fourth performance event. The mispredicted sequence number is stored in the sequence number indication field;
    所述处理器在第五性能事件为误预测事件的情况下,比较所述第五性能事件的误预测序列号与所述第四性能事件的误预测序列号的大小;When the fifth performance event is a misprediction event, comparing the magnitude of the misprediction sequence number of the fifth performance event with the misprediction sequence number of the fourth performance event;
    所述处理器将第五性能事件的误预测序列号与所述第四性能事件的误预测序列号中较小的误预测序列号保存在所述序列号指示字段。The processor stores the smaller mispredicted sequence number of the mispredicted sequence number of the fifth performance event and the mispredicted sequence number of the fourth performance event in the sequence number indication field.
  7. 如权利要求1~6任一项所述的方法,其特征在于,所述处理器根据所述第一条目 确定所述第一指令的指令类型,包括:The method according to any one of claims 1 to 6, wherein the determining the instruction type of the first instruction according to the first entry comprises:
    所述处理器在确定所述前端指示字段为第一数值的情况下,获取所述序列号指示字段中保存的所述第一指令的指令类型信息。When the processor determines that the front-end indication field is a first value, the processor obtains instruction type information of the first instruction stored in the sequence number indication field.
  8. 如权利要求1~6任一项所述的方法,其特征在于,所述处理器根据所述第一条目确定所述第一指令的指令类型,包括:The method according to any one of claims 1 to 6, wherein the determining the instruction type of the first instruction according to the first entry comprises:
    所述处理器在确定所述前端指示字段和所述类型指示字段均为第二数值的情况下,获取重排序缓存ROB中的第一个条目中保存的第一指令的指令类型信息。When determining that both the front-end indication field and the type indication field are second values, the processor obtains instruction type information of a first instruction stored in a first entry in a reordering buffer ROB.
  9. 如权利要求1~6任一项所述的方法,其特征在于,所述处理器根据所述第一条目确定所述第一指令的指令类型,包括:The method according to any one of claims 1 to 6, wherein the determining the instruction type of the first instruction according to the first entry comprises:
    所述处理器在确定所述前端指示字段为第二数值、所述类型指示字段为第一数值且所述序列号指示字段中保存的误预测序列号与第三寄存器中保存的误预测序列号相同的情况下,获取所述第三寄存器中保存的所述第一指令的指令类型信息,所述第三寄存器保存有重排序缓存ROB最近一次删除的第三条目,所述第三条目包含所述第一指令的指令类型信息;或者,Determining, by the processor, that the front-end indication field is a second value, the type indication field is a first value, and the misprediction sequence number stored in the sequence number indication field and the misprediction sequence number stored in a third register In the same case, the instruction type information of the first instruction stored in the third register is obtained, and the third register stores a third entry recently deleted by the reordering buffer ROB, and the third entry Contain instruction type information of the first instruction; or,
    所述处理器在确定所述前端指示字段为第二数值、所述类型指示字段为第一数值且所述序列号指示字段中保存的误预测序列号大于所述第三寄存器中保存的误预测序列号的情况下,获取重排序缓存ROB中的第一个条目中保存的第一指令的指令类型信息;或者所述处理器在确定所述前端指示字段为第二数值、所述类型指示字段为第一数值且所述序列号指示字段中保存的误预测序列号小于所述第三寄存器中保存的误预测序列号的情况下,获取重排序缓存ROB中的第一个条目中保存的第一指令的指令类型信息。Determining, by the processor, that the front-end indication field is a second value, the type indication field is a first value, and the misprediction sequence number stored in the sequence number indication field is greater than the misprediction stored in the third register In the case of a serial number, obtain the instruction type information of the first instruction stored in the first entry in the reordering cache ROB; or the processor determines that the front-end indication field is a second value and the type indication field When the mispredicted sequence number stored in the sequence number indication field is smaller than the mispredicted sequence number stored in the third register, the first number stored in the first entry in the reordering buffer ROB is obtained Instruction type information for an instruction.
  10. 如权利要求1~9任一项所述的方法,其特征在于,在所述处理器将所述累计停顿周期数写入第二寄存器的第二条目之后,还包括:The method according to any one of claims 1 to 9, after the processor writes the accumulated pause period number into a second entry of a second register, further comprising:
    所述处理器在以下任一种情况下将所述第二寄存器中保存的所有条目更新至与所述处理器连接的内存中:The processor updates all entries stored in the second register to a memory connected to the processor in any of the following cases:
    所述第二寄存器发生溢出;An overflow occurs in the second register;
    所述第二寄存器触发中断;The second register triggers an interrupt;
    所述处理器的性能监测时间段结束。The performance monitoring period of the processor ends.
  11. 一种处理器性能的监测装置,其特征在于,该装置包括处理器、第一寄存器、计数器和第二寄存器;所述处理器用于:A processor performance monitoring device is characterized in that the device includes a processor, a first register, a counter, and a second register; the processor is used for:
    在发生性能事件时更新第一寄存器中的第一条目,在发生停顿时,启动计数器来统计所述停顿持续的第一时钟周期数,所述第一条目用于指示引起性能事件的第一指令的指令类型的索引路径;When a performance event occurs, a first entry in the first register is updated. When a pause occurs, a counter is started to count the first clock cycle number of the pause. The first entry is used to indicate a second event that causes the performance event. An index path of the instruction type of an instruction;
    在所述停顿终止后,停止更新所述第一条目,并根据所述第一条目确定第一指令的指令类型;After the pause is terminated, stopping updating the first entry, and determining an instruction type of the first instruction according to the first entry;
    将所述第一时钟周期数叠加入所述第一指令的指令类型对应的累计停顿周期数,并将所述累计停顿周期数写入第二寄存器中的第二条目;所述第二寄存器设有多个条目,所述多个条目分别对应多个指令类型,所述多个条目用于保存各个指令类型下的指令导致的停顿的累计停顿周期数。Adding the first clock cycle number to the cumulative pause cycle number corresponding to the instruction type of the first instruction, and writing the cumulative pause cycle number to a second entry in a second register; the second register There are multiple entries, each of which corresponds to multiple instruction types, and the multiple entries are used to store the cumulative number of pause cycles caused by instructions under each instruction type.
  12. 如权利要求11所述的装置,其特征在于,所述第一条目包括前端指示字段、类型指示字段以及序列号指示字段,所述前端指示字段用于指示所述停顿是否发生在前端,所 述类型指示字段用于指示所述停顿是否在所述第一指令提交之前发生,所述序列号指示字段用于指示所述第一指令的误预测序列号。The device according to claim 11, wherein the first entry comprises a front-end indication field, a type indication field, and a serial number indication field, and the front-end indication field is used to indicate whether the pause occurs at the front end, so The type indication field is used to indicate whether the pause occurs before the first instruction is submitted, and the sequence number indication field is used to indicate a mispredicted sequence number of the first instruction.
  13. 如权利要求12所述的装置,其特征在于,所述处理器在更新第一寄存器中的第一条目时,具体用于:The apparatus according to claim 12, wherein the processor is specifically configured to: when updating the first entry in the first register:
    所述处理器在第一性能事件发生在处理器流水线前端、且所述处理器的重排序缓存ROB为空的情况下,将所述前端指示字段置为第一数值,将所述类型指示字段置为第二数值,并将所述第一指令的指令类型信息保存在所述序列号指示字段。When the first performance event occurs at the front end of the processor pipeline and the reordering cache ROB of the processor is empty, the front end indication field is set to the first value, and the type indication field is set. Set to the second value, and save the instruction type information of the first instruction in the sequence number indication field.
  14. 如权利要求12所述的装置,其特征在于,所述处理器在更新第一寄存器中的第一条目时,具体用于:The apparatus according to claim 12, wherein the processor is specifically configured to: when updating the first entry in the first register:
    所述处理器在第二性能事件未发生在处理器流水线前端、且所述第二性能事件不是误预测事件的情况下,将所述前端指示字段和所述类型指示字段均置为第二数值。When the second performance event does not occur at the front end of the processor pipeline and the second performance event is not a misprediction event, the processor sets the front end indication field and the type indication field to a second value. .
  15. 如权利要求12所述的装置,其特征在于,所述处理器在更新第一寄存器中的第一条目时,具体用于:The apparatus according to claim 12, wherein the processor is specifically configured to: when updating the first entry in the first register:
    所述处理器在第三性能事件为误预测事件的情况下,将所述前端指示字段置为第二数值,将所述类型指示字段置为第一数值,并将所述第三性能事件的误预测序列号保存在所述序列号指示字段。When the third performance event is a misprediction event, the processor sets the front-end indication field to a second value, sets the type indication field to a first value, and sets a value of the third performance event. The mispredicted sequence number is stored in the sequence number indication field.
  16. 如权利要求12所述的装置,其特征在于,所述处理器在更新第一寄存器中的第一条目时,具体用于:The apparatus according to claim 12, wherein the processor is specifically configured to: when updating the first entry in the first register:
    所述处理器在第四性能事件为误预测事件的情况下,将所述前端指示字段置为第二数值,将所述类型指示字段置为第一数值,并将所述第四性能事件的误预测序列号保存在所述序列号指示字段;When the fourth performance event is a misprediction event, the processor sets the front-end indication field to a second value, sets the type indication field to a first value, and sets a value of the fourth performance event. The mispredicted sequence number is stored in the sequence number indication field;
    所述处理器在第五性能事件为误预测事件的情况下,比较所述第五性能事件的误预测序列号与所述第四性能事件的误预测序列号的大小;When the fifth performance event is a misprediction event, comparing the magnitude of the misprediction sequence number of the fifth performance event with the misprediction sequence number of the fourth performance event;
    所述处理器将第五性能事件的误预测序列号与所述第四性能事件的误预测序列号中较小的误预测序列号保存在所述序列号指示字段。The processor stores the smaller mispredicted sequence number of the mispredicted sequence number of the fifth performance event and the mispredicted sequence number of the fourth performance event in the sequence number indication field.
  17. 如权利要求11~16任一项所述的装置,其特征在于,所述处理器在根据所述第一条目确定所述第一指令的指令类型时,具体用于:The device according to any one of claims 11 to 16, wherein when the processor determines an instruction type of the first instruction according to the first entry, the processor is specifically configured to:
    所述处理器在确定所述前端指示字段为第一数值的情况下,获取所述序列号指示字段中保存的所述第一指令的指令类型信息。When the processor determines that the front-end indication field is a first value, the processor obtains instruction type information of the first instruction stored in the sequence number indication field.
  18. 如权利要求11~16任一项所述的装置,其特征在于,所述处理器在根据所述第一条目确定所述第一指令的指令类型时,具体用于:The device according to any one of claims 11 to 16, wherein when the processor determines an instruction type of the first instruction according to the first entry, the processor is specifically configured to:
    所述处理器在确定所述前端指示字段和所述类型指示字段均为第二数值的情况下,获取重排序缓存ROB中的第一个条目中保存的第一指令的指令类型信息。When determining that both the front-end indication field and the type indication field are second values, the processor obtains instruction type information of a first instruction stored in a first entry in a reordering buffer ROB.
  19. 如权利要求11~16任一项所述的装置,其特征在于,还包括:The device according to any one of claims 11 to 16, further comprising:
    第三寄存器,所述第三寄存器保存有重排序缓存ROB最近一次删除的第三条目,所述第三条目包含所述第一指令的指令类型信息;A third register that stores a third entry that was last deleted by the reordering buffer ROB, and the third entry includes instruction type information of the first instruction;
    所述处理器在根据所述第一条目确定所述第一指令的指令类型时,具体用于:When the processor determines the instruction type of the first instruction according to the first entry, the processor is specifically configured to:
    所述处理器在确定所述前端指示字段为第二数值、所述类型指示字段为第一数值且所述序列号指示字段中保存的误预测序列号与所述第三寄存器中保存的误预测序列号相同的情况下,获取所述第三寄存器中保存的所述第一指令的指令类型信息;或者,Determining, by the processor, that the front-end indication field is a second value, the type indication field is a first value, and the misprediction sequence number stored in the sequence number indication field and the misprediction stored in the third register When the serial numbers are the same, obtaining the instruction type information of the first instruction stored in the third register; or
    所述处理器在确定所述前端指示字段为第二数值、所述类型指示字段为第一数值且所述序列号指示字段中保存的误预测序列号大于所述第三寄存器中保存的误预测序列号的情况下,获取重排序缓存ROB中的第一个条目中保存的第一指令的指令类型信息;或者Determining, by the processor, that the front-end indication field is a second value, the type indication field is a first value, and the misprediction sequence number stored in the sequence number indication field is greater than the misprediction stored in the third register In the case of a serial number, obtain the instruction type information of the first instruction stored in the first entry in the reorder buffer ROB; or
    所述处理器在确定所述前端指示字段为第二数值、所述类型指示字段为第一数值且所述序列号指示字段中保存的误预测序列号小于所述第三寄存器中保存的误预测序列号的情况下,获取重排序缓存ROB中的第一个条目中保存的第一指令的指令类型信息。Determining, by the processor, that the front-end indication field is a second value, the type indication field is a first value, and the misprediction sequence number stored in the sequence number indication field is less than the misprediction stored in the third register In the case of a serial number, the instruction type information of the first instruction stored in the first entry in the reorder buffer ROB is obtained.
  20. 如权利要求11~19任一项所述的装置,其特征在于,所述处理器还用于:The apparatus according to any one of claims 11 to 19, wherein the processor is further configured to:
    在将所述累计停顿周期数写入第二寄存器的第二条目之后,所述处理器在以下任一种情况下将所述第二寄存器中保存的所有条目更新至与所述处理器连接的内存中:After writing the accumulated number of pause cycles to the second entry of the second register, the processor updates all entries held in the second register to be connected to the processor in any of the following cases In memory:
    所述第二寄存器发生溢出;An overflow occurs in the second register;
    所述第二寄存器触发中断;The second register triggers an interrupt;
    所述处理器的性能监测时间段结束。The performance monitoring period of the processor ends.
  21. 一种处理器性能的监测装置,其特征在于,包括:A processor performance monitoring device, comprising:
    更新模块,用于在发生性能事件时更新第一寄存器中的第一条目,所述第一条目用于指示引起性能事件的第一指令的指令类型的索引路径;An update module, configured to update a first entry in the first register when a performance event occurs, where the first entry is used to indicate an index path of an instruction type of the first instruction that caused the performance event;
    启动模块,用于在发生停顿时启动计数器来统计所述停顿持续的第一时钟周期数;A startup module, configured to start a counter to count the first clock cycle duration when the pause occurs when a pause occurs;
    停止模块,用于在所述停顿终止后停止更新所述第一条目;A stopping module, configured to stop updating the first entry after the pause is terminated;
    读取模块,用于根据所述第一条目确定第一指令的指令类型;A reading module, configured to determine an instruction type of a first instruction according to the first entry;
    所述更新模块,还用于将所述第一时钟周期数叠加入所述第一指令的指令类型对应的累计停顿周期数,并将所述累计停顿周期数写入第二寄存器中的第二条目;所述第二寄存器设有多个条目,所述多个条目分别对应多个指令类型,所述多个条目用于保存各个指令类型下的指令导致的停顿的累计停顿周期数。The update module is further configured to stack the first clock cycle number into the accumulated pause cycle number corresponding to the instruction type of the first instruction, and write the accumulated pause cycle number to the second in the second register. Entries; the second register is provided with a plurality of entries, each of which corresponds to a plurality of instruction types, and the plurality of entries are used to store a cumulative number of pause periods caused by instructions under each instruction type.
  22. 一种计算机存储介质,其特征在于,所述计算机存储介质上存储有程序,所述程序被处理器执行时,用于实现如权利要求1~10任一项所述的方法。A computer storage medium, characterized in that a program is stored on the computer storage medium, and when the program is executed by a processor, it is used to implement the method according to any one of claims 1 to 10.
  23. 一种计算机程序产品,其特征在于,所述计算机程序产品包含的程序代码在计算机上运行时,使得所述计算机执行如权利要求1~10任一项所述的方法。A computer program product, characterized in that when the program code contained in the computer program product runs on a computer, the computer causes the computer to execute the method according to any one of claims 1 to 10.
PCT/CN2018/107402 2018-09-25 2018-09-25 Method and device for monitoring performance of processor WO2020061765A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2018/107402 WO2020061765A1 (en) 2018-09-25 2018-09-25 Method and device for monitoring performance of processor
CN201880094308.3A CN112219193B (en) 2018-09-25 2018-09-25 Method and device for monitoring processor performance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/107402 WO2020061765A1 (en) 2018-09-25 2018-09-25 Method and device for monitoring performance of processor

Publications (1)

Publication Number Publication Date
WO2020061765A1 true WO2020061765A1 (en) 2020-04-02

Family

ID=69952375

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/107402 WO2020061765A1 (en) 2018-09-25 2018-09-25 Method and device for monitoring performance of processor

Country Status (2)

Country Link
CN (1) CN112219193B (en)
WO (1) WO2020061765A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023241479A1 (en) * 2022-06-13 2023-12-21 上海寒武纪信息科技有限公司 Method and equipment for analyzing timeline performance on assembly line

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117472721B (en) * 2023-12-28 2024-03-12 北京微核芯科技有限公司 Event statistics method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5941983A (en) * 1997-06-24 1999-08-24 Hewlett-Packard Company Out-of-order execution using encoded dependencies between instructions in queues to determine stall values that control issurance of instructions from the queues
CN101059772A (en) * 2006-04-21 2007-10-24 株式会社东芝 Performance monitoring device, and data collecting method
CN101246447A (en) * 2007-02-15 2008-08-20 国际商业机器公司 Method and apparatus for measuring pipeline stalls in a microprocessor
CN103455132A (en) * 2013-08-20 2013-12-18 西安电子科技大学 Embedded system power consumption estimation method based on hardware performance counter

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5941983A (en) * 1997-06-24 1999-08-24 Hewlett-Packard Company Out-of-order execution using encoded dependencies between instructions in queues to determine stall values that control issurance of instructions from the queues
CN101059772A (en) * 2006-04-21 2007-10-24 株式会社东芝 Performance monitoring device, and data collecting method
CN101246447A (en) * 2007-02-15 2008-08-20 国际商业机器公司 Method and apparatus for measuring pipeline stalls in a microprocessor
CN103455132A (en) * 2013-08-20 2013-12-18 西安电子科技大学 Embedded system power consumption estimation method based on hardware performance counter

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023241479A1 (en) * 2022-06-13 2023-12-21 上海寒武纪信息科技有限公司 Method and equipment for analyzing timeline performance on assembly line

Also Published As

Publication number Publication date
CN112219193A (en) 2021-01-12
CN112219193B (en) 2021-10-26

Similar Documents

Publication Publication Date Title
KR101423480B1 (en) Last branch record indicators for transactional memory
JP4467094B2 (en) Apparatus for sampling a large number of potentially simultaneous instructions in a processor pipeline
US20070079294A1 (en) Profiling using a user-level control mechanism
JP4294778B2 (en) Method for estimating statistics of the characteristics of interactions processed by a processor pipeline
JP4467093B2 (en) Apparatus for randomly sampling instructions in a processor pipeline
US6708296B1 (en) Method and system for selecting and distinguishing an event sequence using an effective address in a processing system
JP3531731B2 (en) Method and system for counting non-speculative events in a speculative processor
US8832416B2 (en) Method and apparatus for instruction completion stall identification in an information handling system
US9495170B2 (en) Determining each stall reason for each stalled instruction within a group of instructions during a pipeline stall
US10628160B2 (en) Selective poisoning of data during runahead
JPH11272518A (en) Method for estimating statistic value of characteristics of instruction processed by processor pipeline
JPH11272514A (en) Device for sampling instruction operand or result value in processor pipeline
US20120159125A1 (en) Efficiency of short loop instruction fetch
JP2014106973A (en) Performance measurement unit
US6970999B2 (en) Counting latencies of an instruction table flush, refill and instruction execution using a plurality of assigned counters
US7047398B2 (en) Analyzing instruction completion delays in a processor
WO2020061765A1 (en) Method and device for monitoring performance of processor
US6550002B1 (en) Method and system for detecting a flush of an instruction without a flush indicator
US20090106539A1 (en) Method and system for analyzing a completion delay in a processor using an additive stall counter
US10671512B2 (en) Processor memory reordering hints in a bit-accurate trace
US10613866B2 (en) Method of detecting repetition of an out-of-order execution schedule, apparatus and computer-readable medium
US7197629B2 (en) Computing overhead for out-of-order processors by the difference in relative retirement times of instructions
EP3743818B1 (en) Commit window move element
JP2000215062A (en) Instruction control method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18935376

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18935376

Country of ref document: EP

Kind code of ref document: A1