WO2020061765A1

WO2020061765A1 - Method and device for monitoring performance of processor

Info

Publication number: WO2020061765A1
Application number: PCT/CN2018/107402
Authority: WO
Inventors: 孙涛; 周昔平
Original assignee: 华为技术有限公司
Priority date: 2018-09-25
Filing date: 2018-09-25
Publication date: 2020-04-02
Also published as: CN112219193A; CN112219193B

Abstract

A method and device for monitoring the performance of a processor, which are used for determining the instruction type of an instruction that causes a pause after the processor pauses and evaluating the performance overhead created by the instruction after a program is finished being executed. The method comprises: a processor updating a first entry in a first register when a performance event occurs and starting up a counter when a pause occurs so as to count a first number of clock cycles for which the pause lasts, the first entry being used for indicating an index path of the instruction type of a first instruction that causes the performance event; the processor stopping updating the first entry after the pause is over and determining the instruction type of the first instruction according to the first entry; the processor superposing the first number of clock cycles into an accumulated number of pause cycles corresponding to the instruction type of the first instruction and writing the accumulated number of pause cycles into a second entry of a second register; the second register is provided with multiple entries, the multiple entries respectively being used for saving the accumulated number of pause cycles of pauses caused by instructions having various instruction types.

Description

Method and device for monitoring processor performance

Technical field

The present application relates to the field of computer technology, and in particular, to a method and a device for monitoring processor performance.

Background technique

Superscalar execution refers to the emission section of the processor pipeline. Each clock cycle can send multiple instructions to the execution section for execution by the execution section, thereby achieving the concurrent execution of multiple instructions within the processor; out-of-order execution (out -of-order execution) means that when the processor executes the instructions, it may not execute in the order prescribed by the program. In the traditional sequential execution processor, once the next instruction needs to wait for the execution result of the previous instruction to be executed, the processor's pipeline will stagnate; and the superscalar out-of-order execution method is used, In the above case, the processor may instead continue to execute subsequent instructions that do not depend on the foregoing execution results, that is, the execution section of the pipeline can always be in a working state. It is not difficult to see that using superscalar out-of-order execution can reduce the average execution time of the program and improve the processing efficiency of the processor.

Although superscalar out-of-order execution brings these advantages, superscalar out-of-order execution also complicates processor performance analysis. For example, when the pipeline of a superscalar out-of-order execution processor stalls, there may be multiple performance events caused by multiple instructions during this period, and it is difficult for the processor to determine which performance event caused the pipeline stall. Performance events overlap, and it is difficult for the processor to blame the processor's stall on an instruction, and it is difficult to evaluate the performance overhead caused by each instruction executed by the processor.

In summary, there is an urgent need for a method and device for monitoring processor performance, so that a superscalar out-of-order execution processor can determine the instruction that caused the pause after a pause, and perform the performance overhead caused by the instruction after the execution of the program ends. Evaluation.

Summary of the Invention

The embodiments of the present application provide a method and a device for monitoring processor performance, which are used for superscalar out-of-order execution of a processor to determine an instruction that causes a pause after a pause occurs, and to perform a performance overhead caused by the instruction after program execution ends Evaluation.

In a first aspect, an embodiment of the present application provides a method for monitoring processor performance. The method includes the following steps: the processor updates a first entry in a first register when a performance event occurs, and starts a counter when a stall occurs. Count the number of first clock cycles for which the pause lasts. The first entry is used to indicate an index path of the instruction type of the first instruction that caused the performance event; the processor stops updating the first entry after the pause is terminated, and according to the first The entry determines the instruction type of the first instruction; the processor adds the first clock cycle number to the cumulative stall cycle number corresponding to the instruction type of the first instruction, and writes the cumulative stall cycle number to the second entry of the second register Wherein, the second register is provided with multiple entries, each of which corresponds to multiple instruction types, and the multiple entries are used to store a cumulative number of pause cycles caused by instructions under each instruction type.

The method provided in the first aspect is used to update the first entry in the first register when a performance event occurs, and to stop updating the first entry after the processor stall is terminated. The first entry may be used to indicate the cause of the processor stall. Index path of the instruction type of the first instruction. In addition, the processor starts a counter when a pause occurs. Then, after the pause is terminated, the counter records a first clock cycle number in which the pause is continued. The first clock cycle number may be used to indicate a performance overhead caused by the pause. Therefore, after the pause is stopped, the first entry can be read to determine the instruction type of the first instruction that caused the pause; at the same time, the second register stores the cumulative number of pauses caused by the instructions under each instruction type. After the pause is terminated, the first clock cycle number can be accumulated into the entry corresponding to the instruction type of the first instruction in the second register (that is, the second entry), so that the performance of the processor can be comprehensively analyzed, for example, analysis Percentage of stalls caused by each type of instruction, analysis of which types of instructions are prone to stalls, analysis of the number of stall cycles of the processor as a percentage of the total execution cycles of the program, and so on. In summary, by using the processor performance monitoring solution provided in the embodiments of the present application, the type of the instruction that caused the stall can be accurately determined after the processor stalls, and the performance of each type of instruction after the execution of the program ends Cost is assessed.

The first entry includes a front-end indication field, a type indication field, and a serial number indication field. The front-end indication field is used to indicate whether a pause occurs at the front end, the type indication field is used to indicate whether a pause occurs before the first instruction is submitted, and the sequence number The indication field is used to indicate a misprediction sequence number of the first instruction.

In the embodiment of the present application, there are four implementation manners for the processor to update the first entry.

The first

The processor updates the first entry in the first register when a performance event occurs, which can be specifically implemented in the following manner: when the first performance event occurs at the front end of the processor pipeline, and the reordering cache ROB of the processor is empty In the case, the front-end indication field is set to the first value, the type indication field is set to the second value, and the instruction type information of the first instruction is stored in the serial number indication field.

In the first implementation, the processor sets the front-end indication field to a first value to indicate that a pause occurs at the front end of the processor pipeline; the processor sets the type indication field to a second value to indicate that the instruction that caused the pause is a pause The last instruction submitted before it happened. In addition, because the ROB is empty when the first type of pause occurs, it is difficult for us to determine the "responsible for the pause" through the relevant entry in the ROB. Therefore, in order to subsequently determine the index path of the instruction causing the pause, the serial number indication field is reused in the first implementation, that is, the instruction type information of the first instruction is stored in the serial number indication field so that the sequence can be directly passed The number indicating field identifies the "responsible person".

Second

The processor updates the first entry in the first register when a performance event occurs, which can be specifically implemented by: the processor does not occur at the front end of the processor pipeline when the second performance event occurs, and the second performance event is not a misprediction event In the case, both the front end indication field and the type indication field are set to the second value.

In the second implementation, the processor sets the front-end indication field to a second value to indicate that a stall occurs at the back end of the processor pipeline; the processor sets the type indication field to a second value to indicate that the instruction that caused the stall is The last instruction submitted before the pause occurred.

Third

The processor updates the first entry in the first register when a performance event occurs, which can be specifically implemented by the processor: when the third performance event is a misprediction event, the front end indication field is set to a second value, The type indication field is set to the first value, and the mispredicted sequence number of the third performance event is stored in the sequence number indication field.

In a third implementation, the processor sets the front-end indication field to a second value to indicate that a stall occurs at the back end of the processor pipeline; the processor sets the type indication field to a first value to indicate that the instruction that caused the stall is The first instruction submitted after the pause has expired.

Fourth

The processor updates the first entry in the first register when a performance event occurs, which can be specifically implemented by the processor: when the fourth performance event is a misprediction event, the front end indication field is set to a second value, The type indication field is set to the first value, and the misprediction sequence number of the fourth performance event is stored in the sequence number indication field; the processor compares the error of the fifth performance event when the fifth performance event is a misprediction event. The size of the mispredicted sequence number of the predicted sequence number and the fourth performance event; the processor saves the smaller mispredicted sequence number of the mispredicted sequence number of the fifth performance event and the mispredicted sequence number of the fourth performance event in the sequence number Indication field.

In a fourth implementation manner, the processor sets the front-end indication field to a second value to indicate that a stall occurs at the back end of the processor pipeline; the processor sets the type indication field to a first value to indicate that the instruction that caused the stall is The first instruction submitted after the pause has expired. In addition, in a fourth implementation manner, during the process of updating the first entry (that is, before the pause is terminated), if two misprediction events occur, the first entry stores the performance event with a smaller misprediction sequence number. Relevant information so that the index information of the instruction that caused the stall can be accurately determined.

In the embodiment of the present application, there are multiple ways for the processor to determine the instruction type of the first instruction, and three of them are listed below.

The first

The processor determines the instruction type of the first instruction according to the first entry, which may be specifically implemented by: when the processor determines that the front-end instruction field is the first value, the processor obtains the instruction of the first instruction stored in the sequence number instruction field. Type information.

In Implementation 1, if the front-end indication field in the first entry is the first value, it can be determined that the pause occurred at the front end of the processor pipeline. The instruction that caused the pause is the first instruction submitted after the pause is terminated. In this case, we think that the instruction should be the first entry in the ROB (that is, the entry corresponding to the instruction to be submitted in the ROB). However, when the first type of stall occurs, the ROB is empty, so the index information of the instruction that caused the processor stall cannot be obtained from the ROB. As mentioned earlier, in the case where the pause occurs at the front end of the processor pipeline, the instruction type information of the first instruction has been saved in the sequence number indication field when the first entry is updated. Then, when the front-end indication field is the first value, the instruction type information of the first instruction stored in the sequence number indication field of the first entry can be directly obtained.

Second

The processor determines the instruction type of the first instruction according to the first entry, which may be specifically implemented as follows: When the processor determines that the front-end indication field and the type indication field are both second values, the processor obtains the first order in the reordering cache ROB. The instruction type information of the first instruction stored in an entry.

In the second implementation manner, if the front-end indication field and the type indication field in the first entry are both second values, it can be determined that the pause is a fourth type pause. For the fourth type of pause, because the instruction that caused this pause is the first instruction submitted after the pause is terminated, in this case, we believe that the instruction should be the first entry in the ROB (that is, the upcoming ROB Command entry). Therefore, in the second implementation manner, the instruction type information of the first instruction stored in the first entry in the ROB can be obtained.

Third

In a third implementation manner, the processor determining the instruction type of the first instruction according to the first entry may be specifically implemented in the following three ways:

Way 3a

The processor obtains the third case when it is determined that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is the same as the mispredicted sequence number stored in the third register. The instruction type information of the first instruction stored in the register, the third register stores a third entry that was last deleted by the reordering buffer ROB, and the third entry contains the instruction type information of the first instruction.

Way 3b

The processor obtains the reordering buffer when it is determined that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is greater than the mispredicted sequence number stored in the third register. The instruction type information of the first instruction stored in the first entry in the ROB.

Way 3c

When the processor determines that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is less than the mispredicted sequence number stored in the third register, the processor obtains the reordering cache The instruction type information of the first instruction stored in the first entry in the ROB.

In the third implementation manner, if the front-end indication field in the first entry is the second value and the type indication field is the first value, it can be determined that the instruction that caused the pause is the last instruction submitted before the pause occurred. In theory, at this time, the instruction recorded in the third register (that is, the last instruction submitted before the pause occurred) is the instruction that caused the processor to pause. However, in actual implementation, considering the out-of-order execution of the processor, the first entry may be updated out of order, for example, the instruction that caused the pause has not yet reached the commit segment, or the first entry The performance event recorded in is overwritten by other performance events or the performance event recorded in the first entry overlaps with other performance events (at this time, it can be considered that the performance event recorded in the first entry did not cause the pause). In the third implementation method, in order to correctly identify the "pause person responsible" when the above-mentioned complicated situation occurs, the mispredicted sequence number stored in the sequence number indication field and the mispredicted sequence number stored in the third register can be compared, so that according to The comparison results are processed accordingly to determine the instruction type of the instruction that caused the processor to stall.

In addition, in the embodiment of the present application, after the processor writes the accumulated pause period number to the second entry of the second register, the processor may also update all entries stored in the second register to In the memory connected to the processor: the second register overflows; the second register triggers an interrupt; the performance monitoring period of the processor ends. All the entries saved in the second register are updated into the memory, and the second register can be cleared. Then, the performance of the processor can be evaluated based on the data in the memory, for example, analyzing the percentage of stalls caused by each type of instruction, analyzing which types of instructions are prone to stall, and analyzing the number of processor stall cycles. The percentage of the total execution cycles of the program and so on.

In a second aspect, an embodiment of the present application further provides a processor performance monitoring device. The device includes a processor, a first register, a counter, and a second register.

Specifically, the processor is configured to: update a first entry in the first register when a performance event occurs; when a pause occurs, start a counter to count the first clock cycle number of the pause, and the first entry is used to indicate that the performance is caused Index path of the instruction type of the first instruction of the event; stop updating the first entry after the pause ends, and determine the type of the first instruction according to the first entry; add the first clock cycle number to the instruction type of the first instruction Corresponding cumulative pause cycle number, and write the cumulative pause cycle number into the second entry in the second register; the second register is provided with multiple entries, each of which corresponds to multiple instruction types, and multiple entries are used for saving Cumulative number of pause cycles caused by instructions under each instruction type.

The first register, the counter, and the second register may be integrated on the processor or may be set separately. When the first register, the counter, and the second register are integrated on the processor, the processor performance monitoring device can also be regarded as a processor.

Specifically, the first entry may include a front-end indication field, a type indication field, and a sequence number indication field. The front-end indication field is used to indicate whether a pause occurs in the front-end, and the type indication field is used to indicate whether the pause occurs before the first instruction is submitted. The sequence number indication field is used to indicate a mispredicted sequence number of the first instruction.

In the device, there are multiple ways for the processor to update the first entry, four of which are listed below.

The first way

When the processor updates the first entry in the first register, the processor is specifically configured to: when the first performance event occurs at the front end of the processor pipeline and the reordering buffer ROB of the processor is empty, the processor instructs the front end to indicate The field is set to the first value, the type indication field is set to the second value, and the instruction type information of the first instruction is stored in the sequence number indication field.

The second way

When the processor updates the first entry in the first register, the processor is specifically configured to: when the second performance event does not occur at the front end of the processor pipeline and the second performance event is not a misprediction event, instruct the front end to indicate The field and type indication fields are both set to the second value.

The third way

When the processor updates the first entry in the first register, the processor is specifically configured to: when the third performance event is a misprediction event, set the front-end indication field to the second value, and set the type indication field to The first value, and the mispredicted sequence number of the third performance event is stored in the sequence number indication field.

Fourth way

When the processor updates the first entry in the first register, the processor is specifically configured to: when the fourth performance event is a misprediction event, set the front-end indication field to the second value, and set the type indication field to The first value, and stores the mispredicted sequence number of the fourth performance event in the sequence number indication field; the processor compares the mispredicted sequence number of the fifth performance event with the fourth performance event if the fifth performance event is a misprediction event; The size of the misprediction sequence number of the performance event; the processor stores the smaller misprediction sequence number of the misprediction sequence number of the fifth performance event and the misprediction sequence number of the fourth performance event in the sequence number indication field.

In addition, in the embodiment of the present application, the processor may determine the following three types of specific implementations of the instruction type of the first instruction according to the first entry.

The first way

When the processor determines the instruction type of the first instruction according to the first entry, the processor is specifically configured to: when determining that the front-end instruction field is the first value, obtain the instruction type of the first instruction stored in the sequence number instruction field information.

The second way

When the processor determines the instruction type of the first instruction according to the first entry, the processor is specifically configured to: when the processor determines that the front-end indication field and the type indication field are both second values, obtain the first in the reordering cache ROB The instruction type information of the first instruction stored in each entry.

The third way

In a third manner, the device further includes a third register. The third register stores a third entry that was last deleted by the reordering buffer ROB. The third entry contains the instruction type information of the first instruction. When an entry determines the instruction type of the first instruction, the following three methods can be specifically used:

Way 3a

The processor obtains the third case when it is determined that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is the same as the mispredicted sequence number stored in the third register. The instruction type information of the first instruction stored in the register.

Way 3b

Way 3c

In addition, in the embodiment of the present application, the processor is further configured to: after writing the accumulated pause period number to the second entry of the second register, the processor writes all the entries stored in the second register in any of the following cases: Update to the memory connected to the processor: the second register overflows; the second register triggers an interrupt; the processor's performance monitoring period ends.

According to a third aspect, an embodiment of the present application further provides a processor performance monitoring device. The device includes an update module, a start module, a stop module, and a read module.

The updating module is configured to update a first entry in the first register when a performance event occurs, and the first entry is used to indicate an index path of an instruction type of the first instruction that caused the performance event.

The startup module is used to start a counter to count the first clock cycle duration when the pause occurs when a pause occurs.

The stop module is used to stop updating the first entry after the pause is terminated.

The reading module is configured to determine an instruction type of the first instruction according to the first entry.

The update module is further configured to stack the first clock cycle number into the cumulative pause cycle number corresponding to the instruction type of the first instruction, and write the cumulative pause cycle number to a second entry in the second register; the second register is provided with Multiple entries, multiple entries corresponding to multiple instruction types, and multiple entries are used to store the cumulative number of pause cycles caused by instructions under each instruction type.

In addition, the processor performance monitoring device provided in the third aspect may also be used to implement other possible implementation manners in the method for monitoring processor performance provided in the first aspect. For details, refer to related descriptions in the method provided in the first aspect. I won't repeat them here.

According to a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium for storing a program used to execute the functions of the first aspect or any one of the first aspects. When the program is executed by a processor, For implementing the method described in the first aspect or any one of the first aspects.

In a fifth aspect, an embodiment of the present application provides a computer program product containing a program code, and when the program code contained in the computer program product runs on a computer, the computer executes the first aspect or any one of the first aspect. Methods.

In addition, for the technical effects brought by any one of the possible design methods in the second to fifth aspects, refer to the technical effects brought by the different design methods in the first aspect, which will not be described again here.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic diagram of a processor instruction pipeline according to an embodiment of the present application; FIG.

2 is a schematic diagram of a processor pause and a corresponding "pause person" provided by an embodiment of the present application;

FIG. 3 is a schematic flowchart of a method for monitoring processor performance according to an embodiment of the present application; FIG.

4 is a schematic flowchart of another method for monitoring processor performance according to an embodiment of the present application;

5 is a schematic structural diagram of a processor performance monitoring device according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of another processor performance monitoring device according to an embodiment of the present application.

detailed description

As described in the background art, using a superscalar out-of-order execution processor can reduce the average execution time of a program and improve the processing efficiency of the processor. However, because the superscalar out-of-order execution processor can issue multiple instructions to the execution section in one clock cycle, and the superscalar out-of-order execution processor can execute instructions in an order other than the program when executing instructions, the The performance analysis of the out-of-order execution processor becomes complicated.

For example, when the pipeline of a superscalar out-of-order execution processor stalls, there may be multiple performance events caused by multiple instructions during this period, and it is difficult for the processor to determine which performance event caused the pipeline stall; for example, due to the above Performance events overlap, and it is difficult for the processor to blame the processor's stall on an instruction, and it is difficult to evaluate the performance overhead caused by the instructions executed by the processor.

In order to evaluate the performance of superscalar out-of-order execution processors, the industry has proposed multiple performance models.

For example, Top-Down is a performance model based on the utilization of pipeline distribution slots. Among them, each instruction microcode (also called microinstruction or micro operation, referred to as uop) is used and only one distribution slot is used during program execution. This model monitors whether each distribution slot is paused, and tracks the execution of microcode on each distribution slot (e.g., submitted or abandoned, that is, committed or squashed), divides all distribution slots into four items, and Perform detailed analysis based on performance monitoring unit (PMU).

As another example, Statistical Profiling is a method to analyze the performance of a program. The method randomly selects one microcode every N microcodes (uops), and tracks and records all performance events that occur between the dispatch and completion of the microcodes on the instruction pipeline, and each performance event is persistent. The number of clock cycles. Then analyze these samples offline to infer the performance bottleneck of the program.

These models proposed in the prior art all have their own focuses. For example, the Top-Down model focuses on the overall performance of the analysis program performance. It is difficult to locate the design point (such as the instructions and operations that caused the pause) when the processor stalls. Many important sub-items in the -Down model are based on estimates, and accuracy is difficult to guarantee; the Statistical Profiling model focuses more on analyzing which instruction is caused by each performance event, and the model randomly samples the execution of a single instruction In some cases, the problem of insufficient coverage is likely to occur.

The embodiments of the present application provide a method and a device for monitoring processor performance, so that a superscalar out-of-order execution processor can determine an instruction that causes a pause after a pause occurs, and perform performance overhead caused by the instruction after program execution ends. Evaluation.

The basic concepts involved in the embodiments of the present application are explained below. It should be noted that these explanations are to make this application easier to understand, and should not be considered as limiting the scope of protection required by this application.

First, the instruction pipeline

Superscalar out-of-order execution processors usually use an eight-segment instruction pipeline, as shown in Figure 1. Among them, the eight-segment instruction pipeline includes fetch, decode, register rename, dispatch, issue, execute, writeback, and commit ) These eight pipeline sections. For an instruction, its completion needs to go through the eight segments.

For a superscalar out-of-order execution processor, in the issue pipeline, the processor can issue multiple instructions to the execute pipeline in one clock cycle for execution by the execute pipeline, thereby The processor internally implements the concurrent execution of multiple instructions; for the execution of instructions in the execute pipeline, multiple instructions may not be executed in the order prescribed by the program, but may be executed out of order, so that some instructions can be executed in some instructions While waiting for the source operand, it enables other instructions that do not depend on the source operand to be executed preferentially, which improves the throughput of the processor.

In particular, superscalar out-of-order execution processors typically support out-of-order emission, out-of-order execution, and sequential commit. Among them, out-of-order firing and out-of-order execution have been introduced in the previous paragraph, and for sequential submission, its meaning can be: In a superscalar out-of-order execution processor, although instructions can be executed out of order, instructions must be submitted in Submitted in the order prescribed by the procedure. For example, for instruction A, instruction B, and instruction C, the execution order specified by the program is instruction A → instruction B → instruction C, but in a superscalar out-of-order execution processor, instruction B needs to wait for the source operand to execute Therefore, the actual execution order of the instructions is instruction A → instruction C → instruction B. Although these three instructions are executed out of order, when the instructions are submitted, they still need to be submitted in the order of instruction A → instruction B → instruction C specified by the program.

In addition, it should be noted that the above description of the eight-segment instruction pipeline is only an example. The embodiment of the present application is applicable to a superscalar out-of-order execution processor. However, in the embodiment of the present application, the superscalar out-of-order execution processor is adopted. The instruction pipeline model is not specifically limited, as long as the instruction pipeline can be used to achieve superscalar out-of-order execution of instructions by the processor.

Performance events

Performance events are events that cause processor performance to fall below the design peak. Performance events can be triggered by instructions or operations executed by the processor. For example, performing long-delay division instructions or cache misses during cache access operations can cause performance events.

The performance event penalty (also referred to as the performance event overhead) refers to the number of clock cycles required to complete an instruction or an operation "when a performance event is not raised".

Three, stall

Under the condition that the instruction pipeline works normally, multiple instructions can be executed out of order and submitted in sequence. When there are instructions in the commit pipeline, Commit = 1 can be defined (at least one instruction is submitted). ) When there is no instruction submission in the pipeline, Commit = 0 can be defined (no instruction submission). When the parameter Commit is 0 in several consecutive clock cycles, it can be considered that the processor pipeline stalls, which can also be referred to as a processor stall. In different processors, the number of clock cycles included in the multiple clock cycles may be different. For example, the number of clock cycles may be considered to be one. In the embodiment of the present application, the number of clock cycles included in the multiple clock cycles is different. The number is not specifically limited.

It should be noted that in the above example, Commit has two values of 0 or 1. When Commit is 0, it is considered that there is no instruction submission, when Commit is 1, it is considered that there is an instruction submission, and Commit is always 0 in several clock cycles. A situation is defined as a processor stall. In actual implementation, Commit can be defined to have multiple values, for example, it can be 0, 1, 2, 3, and so on. When Commit equals A, it indicates that there are A instructions submitted. Further, for the definition of the processor stall, the processor stall can be defined under the condition that Commit is always less than or equal to X (Commit = X means that there are X instruction submissions) within several clock cycles, or it can be defined within several clock cycles. If the Commit is always less than Y (Commit = Y means that there are Y instructions submitted), it is defined that the processor stalls, which is not specifically limited in this embodiment of the present application.

Generally, processor stalls are caused by instructions or operations executed by the processor. Among them, an instruction is a command language, which is used to specify what operations the processor performs and where the operation object is located, such as a command to add two operation objects and a command to access a cache. Operations refer to specific operations when executing instructions, such as addition operations, operations to access the cache, and so on.

Examples of several instructions or operations that cause the processor to stall are given below by way of example. For example, the fetch instruction load causes the processor to halt when the last-level cache is missing; for example, it causes the processor to halt when performing operations that access the cache; for example, the branch instruction branch , And then read and decode the instructions of one of the branches in advance to reduce the waiting time for the decoder) When a branch prediction error occurs, a large number of instructions will be emptied, which will cause the processor to stall. In the embodiment of the present application, the instruction or operation that causes the processor to quiesce is referred to as a "culprit."

It should be noted that, in the embodiment of the present application, the processor stall is caused by instructions or operations executed by the processor. For example, when the fetch instruction load is missing in the last level cache, we can think that the pause is caused by the load instruction, or we can think that the pause is caused by the operation of accessing the cache when the load instruction is executed. In the embodiment of the present application, the first statement is adopted, that is, the processor stall is considered to be caused by instructions.

By way of example, FIG. 2 illustrates processor stalls and corresponding "culprits" caused by a processor while executing multiple programs. Among them, when the ordinate is 0, it means Commit = 0, that is, no instruction is submitted in the commit pipeline; when the ordinate is non-zero, it is Commit = 1, that is, at least one instruction is submitted in the commit pipeline . As can be seen from Figure 2, when executing program A, the load ¹ instruction was delayed due to a cache miss during the execution of the load ¹ instruction, which caused a processor stall, although the load ² instruction and divide ^{3 were} executed after load ^1. The instruction also caused a performance event, but the performance event caused by these two instructions is covered by the pause caused by the load ¹ instruction. Therefore, we consider that the "pause responsible" of the first pause (Stall 1) in Figure 2 is the load ¹ instruction; When executing program B, a large number of instructions were emptied due to misprediction during execution of branch ⁴ instructions, causing the processor to stall. Although execution of load ⁵ instructions after branch ⁴ also caused performance events, performance events caused by load ⁵ instructions were branched. The pause caused by the ⁴ instruction is covered, so we think that the "stall responsible" of the second pause (Stall 2) in Figure 2 is the branch ⁴ instruction; when executing the program C, because the processor I-caches the instruction fetch operation a cache miss occurs, causing the processor to stall, and therefore the third stop in FIG. 2 (stall 3) a "pause responsible" for the fetch operation; when executing procedures D, since the instruction isb ⁶ Since the instruction pipeline empty, cause the processor to pause, so the fourth stop (Stall 4) of the "standstill responsible" for the isb ⁶ instruction.

It is worth noting that in the several examples in Figure 2, there is a difference in the relationship between the submission time of the "responsible person" and the occurrence time of the pause.

For example, in the example of program A, the pause caused by the load ¹ instruction (Stall 1) occurs before the load ¹ instruction is submitted, that is, the load ¹ instruction is the first instruction submitted after the end of Stall 1. This is because: Stall 1 is performed by the load ^{instruction. 1} results in too long, other instructions can not be ^{submitted. 1} before the load instructions are executed, causing a processor to stall; Stall 1 i.e. after the end of the load instructions are ^{executed. 1,} at this time load ¹ instruction submission.

For example, in the example of Program B, the branch ⁴ instruction was cleared due to misprediction, that is, the branch ⁴ instruction was the last instruction submitted before Stall 2 started. This is because the pause caused by the misprediction of the branch ⁴ instruction can only be displayed after the instruction is submitted, so Stall 2 occurs after the branch ⁴ instruction is submitted.

It is not difficult to see from the above examples that the instruction that caused the processor to halt may be the first instruction submitted after the pause is terminated, or the last instruction submitted before the pause begins.

It is also worth noting that in the several examples in Figure 2, Stall 3 stalls occur at the front end of the processor pipeline, while Stall 1, Stall 2 and Stall 4 all occur at the back end of the processor pipeline. Among them, the front end of the processor pipeline refers to the pipeline that completes the functions of instruction fetching, decoding, and distribution, and the back end of the processor pipeline refers to the pipeline that completes the functions of instruction issuing, execution, and submission.

In addition, in actual applications, when the pause starts, the commit width of the commit pipeline changes from non-zero to 0; when the pause ends, the commit width of the commit pipeline changes from 0 to non-zero. Therefore, the processor can judge the start and end of the pause based on the value of the commit width.

It should be noted that, in the embodiments of the present application, the performance event may cause the processor to halt in some cases, and may not cause the processor to halt in some cases. A performance event is considered to cause the processor to stall only if a performance event causes the commit section of the instruction pipeline to have no instruction submissions within several clock cycles. When a performance event causes the processor to stall, the instruction that caused the performance event can be considered a "culprit" that caused the processor to stall. The performance overhead of the "culprit" can be defined as the number of clock cycles that the pause lasts, and the performance overhead of the "culprit" can also be understood as the performance overhead of the pause.

Fourth, reorder cache

As mentioned earlier, in a superscalar out-of-order execution processor, the instructions need to be submitted in the order specified by the program regardless of the order of execution of the instructions. Then, for the instructions that have been executed in the execute pipeline but cannot be submitted through the commit pipeline in the order prescribed by the program, the execution results can be stored in the re-order buffer (ROB). In other words, one entry in the ROB corresponds to one instruction microcode. Each entry in the ROB includes at least two fields: the instruction type and the execution result.

In practical applications, the ROB can be regarded as a circular queue with head pointers and tail pointers. All instructions that enter the instruction pipeline are stored in the ROB in the order prescribed by the program. An entry in the ROB corresponds to an instruction microcode. Among them, the head pointer points to related information (instruction type execution result, etc.) of the next instruction to be submitted, and the tail pointer points to related information (instruction type, execution result, etc.) of an instruction microcode newly stored in the ROB.

Five, the first register, the second register, the third register

In the embodiment of the present application, the first register stores an index path used to indicate an instruction that causes a performance event, that is, an index path of a “pause person”. The processor updates the entry in the first register when a performance event occurs. For a performance event that does not cause a processor stall, the performance event will be overwritten by updating an entry in the first register; only the entry that caused the processor stall will eventually be saved in the first register. That is, an entry in the first register corresponds to a pause in the processor.

Specifically, a reading pointer and a writing pointer are maintained in the first register. The reading pointer is used to indicate the entry that needs to be read when the processor stall is terminated. The processor can read the entry to determine the index path of the "responsible for the stall"; the writing pointer is used to write and update the first A register entry.

For example, for a performance event that does not cause a processor stall, the performance event that does not cause a processor stall can be overwritten by updating the entry indicated by the writing pointer. For example, after the pause is terminated, by reading the entry pointed by the reading pointer, the index path of the "responsible person" can be obtained, and then the instruction operation that caused the processor to pause is determined.

Specifically, in the initialization stage, both the reading pointer and the writing pointer may point to the first entry in the first register, and at this time, the content of the entry is empty or the default value. Whenever a performance event occurs, the entry pointed to by the writing pointer (that is, the first entry) can be updated; after the processor pause is terminated, the update can be stopped by moving the writing pointer to the next entry The first entry; then, by reading the entry pointed by the reading pointer, the instruction that caused the processor to stall can be determined; after the reading is completed, the reading pointer can be moved to the next entry. In this way, the writing and updating of the first register entry (ie, the index path of the instruction causing the pause) can be achieved through writing pointers, and the "responsible person for pause" can be determined by reading. .

Of course, in specific implementation, the movement of the reading pointer (writing) and writing pointer (writing) in the first register can be performed different operations according to different scenarios. This part will be described in detail in the following embodiments. I will not repeat them here.

In actual implementation, the first register may have a different name. For example, in the embodiment of the present application, the first register may be referred to as a culprit tracking register set (CTS). In the embodiment of the present application, the The specific name is not limited.

In specific implementation, when two consecutive instructions cause the processor to halt, if there is only one entry in the first register, it may happen that the first instruction has not yet been submitted, and the processor will In the case where the entry in the register is covered, it is difficult to judge the "responsible person for the pause" of the first pause because the related entry of the first instruction is covered. To avoid this problem, the first register in the embodiment of the present application may include multiple entries.

Exemplarily, the number of entries contained in the first register may be the number of pipeline segments in the writeback segment +1. The writeback section is used to control the processor to write the execution result of the instruction back to the memory or the register. When the number of entries in the first register is equal to the number of pipeline segments in the writeback segment +1, if multiple instructions executed in parallel cause a pause, the number of entries in the first register can also be sufficient to record the information of the multiple instructions Number of required entries required.

In addition, when the first register contains multiple entries, the first register can also be regarded as a register group, and each register in the register group contains one entry.

In the embodiment of the present application, the second register stores a cumulative pause cycle number corresponding to an instruction type of each instruction executed by the processor. The instruction type may be an operation instruction, such as a fadd instruction and a divide instruction; and the instruction type may also be an access instruction, such as a load instruction or a store instruction. In addition, the instruction may be a branch instruction.

Specifically, the cumulative number of stall periods corresponding to each instruction type in the second register can be updated in an accumulative manner: when the processor stalls, the counter starts counting the number of clock cycles that the stall lasts; when the stall is terminated, the counter Stop counting and accumulate the count result into an entry in the second register. For example, the entry a of the second register records that the cumulative value of the number of pause periods of the processor caused by the load instruction is M. When the processor stalls, the counter starts counting; when the stall is terminated, the counter stops counting and the count result is N at this time. Then, according to the solution provided by the embodiment of the present application, it is judged that the pause is caused by the load instruction, and N is accumulated on the a entry in the second register, and at this time, the a entry in the second register records the processing caused by the load instruction. The cumulative value of the number of pause periods of the router is M + N.

In actual implementation, the second register may be implemented by static random-access memory (static random-access memory (SRAM)) hardware. In addition, the second register and the counter may have different names. For example, in the embodiment of the present application, the second register may be referred to as a stall performance overhead statistics table (SPCT), and the counter may be referred to as a stall cycle counter (stall cycle counter). counter (SCC), and the specific names of the second register and the counter are not limited in the embodiments of the present application.

In addition, in the embodiment of the present application, the entry in the second register may be updated into the memory, and then the second register is cleared, so that the performance of the processor may be evaluated according to the data in the memory after the execution of the program is terminated. Specifically, the processor may update the entry stored in the second register to the memory connected to the processor in any of the following cases: an overflow occurs when an entry in the second register is accumulated; the second register triggers an interrupt; The processor's performance monitoring period ends.

In the embodiment of the present application, the third register stores an entry that was last deleted by the ROB. As mentioned before, when the execution of the execute pipeline is completed but the commit cannot be submitted through the commit pipeline, the relevant information (instruction type, execution result, etc.) of the instruction can be saved in the ROB; after the instruction is submitted, the The entry corresponding to the instruction is deleted from the ROB. In the embodiment of the present application, for a situation in which a processor stalls after an instruction is submitted (for example, a situation in which an instruction on the pipeline is emptied after the branch instruction is submitted, the stall is started after the branch instruction is submitted), and the analysis is performed after the stall is terminated At the time of the "responsible person", the relevant information of the instruction has been deleted from the ROB. Therefore, a register needs to be maintained to save the most recently deleted entry in the ROB in order to determine the "responsible person" in the above situation. This register is the third register. .

In actual implementation, the third register may have a different name. For example, the third register may be referred to as a last committed instruction (LCI) register in the embodiment of the present application. The specific name of the third register in the embodiment of the present application No restrictions.

In addition, it should be noted that in the embodiment of the present application, there is only one entry stored in the third register. This entry can be implemented by copying the entry in the ROB. The content of this entry is consistent with the content of the ROB entry. More details.

Six.Front end indication field, type indication field and serial number indication field

In the embodiment of the present application, the first entry is updated when a performance event occurs. Specifically, each entry in the first register includes at least three fields: a front-end indication field, a type indication field, and a sequence number indication field. Updating the first entry updates these three fields in the first entry. Among them, the front-end indication field is used to indicate whether the pause occurred at the front end of the processor pipeline, the type indication field is used to indicate whether the pause occurred before the "responsible person" submitted, and the serial number indication field is used to indicate the misprediction of the "responsible person" serial number.

As mentioned when the processor stall was introduced earlier, in the several examples of FIG. 2, some stalls occur at the front end of the processor pipeline, and some stalls occur at the back end of the processor pipeline. In the embodiment of the present application, whether a stall occurs at the front end of the processor pipeline may be indicated by a front end indication field in the first register. For example, a 1-bit bit is used to indicate the front-end indication field, and the front-end indication field is represented by FE: FE = 1 means that a pause occurs at the front-end of the processor pipeline; FE = 0 means that a pause occurs at the back-end of the processor pipeline.

As also mentioned above, it is not difficult to see through the examples in FIG. 2 that the instruction that caused the processor to halt may be the first instruction submitted after the pause is terminated, or the last instruction submitted before the pause begins. In the embodiment of the present application, the type indication field in the first register may be used to indicate whether the instruction that caused the processor stall is the first instruction submitted after the stall is terminated or the last instruction submitted before the stall starts. For example, 1 bit is used to indicate the type indication field, and the type indication field is represented by Ctype: Ctype = 1 means that the instruction causing the pause is the last instruction submitted before the pause started; Ctype = 0 means that the instruction causing the pause is after the pause termination The first instruction submitted.

In addition, the sequence number indication field can be understood as follows: when an entry in the first register is used to record related information about a branch misprediction instruction, the sequence number indication field of the entry is used to record the sequence number of the branch instruction; When an entry in the first register is used to record information about instructions other than the branch instruction, the serial number of the entry indicates that the field is meaningless (can be defaulted) or used to record other information (such as to determine the pipeline Front-end "pause person" information).

Types of processor pauses

According to the foregoing description of the processor stall, in the embodiment of the present application, the types of processor stalls are classified into four types:

the first sort:

In the embodiment of the present application, the first type of pause may be referred to as instruction supply pause, which is a pipeline pause caused by a processor pipeline frontend. The first type of pause is caused by the lack of processor pipeline front-end cache (such as I-Cache or I-TLB). The lack of front-end cache prevents the instruction stream / micro-operation stream from being provided to the processor pipeline backend, resulting in Commit = 0. . The salient feature of the first type of pause is that when it causes a processor pause, the ROB is also empty.

For the first type of pause, we can think of culprit as an operation, such as an I-Cache access operation; at the same time, we can also think of culprit as the instruction that caused the I-Cache to be missing when fetching instructions. In the embodiment of the present application, the second viewpoint is selected, that is, the first type of pause is caused by instructions.

It is worth noting that because the first type of pause occurs at the front end of the processor pipeline and the ROB is empty at this time, the first type of pause must not overlap with other types of pauses. We use a 1-bit bit to indicate this in the design. The type of pause is a front-end indication field in the first register to indicate the type of pause. For example, when FE = 1, it can be determined that the pause that occurred this time is the first type of pause.

For the first type of pause, we can consider that the front-end indication field FE = 1 and the type indication field Ctype = 0.

Also, in some cases, cache misses do not necessarily cause stalls to occur. For example, when the fetch operation is missing in L1I-Cache but hits in L2Cache, although there are several clock cycles in the front end of the processor pipeline, the instruction cannot be provided to the back end of the processor pipeline, but at this time, the ROB is likely not empty and every clock cycle There are still instructions to submit. In this case, it is not considered that a pause occurs at this time in the embodiment of the present application. One of the main goals of modern processor architecture design is to hide the performance event overhead as much as possible. The above situation happens because the overhead of front-end performance events is hidden.

The second category:

In the embodiment of the present application, the second type of pause may be referred to as a misprediction pause. Due to the pause caused by the emptying of instructions due to misprediction, we classify it as a second type of pause. Common misprediction is branch branch misprediction and load-store misorder misprediction.

The salient feature of the second type of pause is that the instruction that caused the pause (culprit) was submitted before the pause occurred, and is the last instruction submitted before the pause occurred.

For the second type of pause, we can consider the front end indication field FE = 0 and the type indication field Ctype = 1.

Similarly, if the misprediction does not cause a pause (that is, the situation where there are no consecutive consecutive clock cycles Commit = 0), the performance overhead of the misprediction is considered to be well hidden by the out-of-order execution in the embodiment of the present application. Mispredicted instructions are not recognized as culprit.

The third category:

In the embodiment of the present application, the third type of pause may be referred to as a system instruction pause. The third column of pauses are pauses caused by the pipeline being emptied by a particular instruction. Common isb instructions. The characteristics of the third type of pause are similar to the second type of pause. Culprit (such as the isb instruction) has been submitted before the pause, and is the last instruction submitted before the pause.

For the third type of pause, we can consider the front-end indication field FE = 0 and the type indication field Ctype = 1.

The third type of pause has two unique features: 1) When a specific instruction appears, it must cause a pause. For example, when the isb instruction is submitted, the entire instruction pipeline must be emptied, and the performance overhead caused by it cannot be hidden; 2) The performance overhead of the third type of pause is almost constant, such as the number of clock cycles for the pause caused by the isb instruction It is almost always equal to the stage depth.

It should be noted that although the third type of pause did not involve misprediction. However, since the front-end indication field and the type indication field of the second type of pause and the third type of pause are the same, in order to distinguish the second type of pause and the third type of pause, when the third type of pause occurs, it can also be in the first register. A mispredicted sequence number is written in the sequence number indication field to distinguish the second type of pause and the third type of pause based on the mispredicted sequence number.

Fourth category:

In the embodiment of the present application, the fourth type of pause may be referred to as a long delay pause. The fourth type of pause can be considered as a pause caused by a long delayed execution of an instruction, that is, an instruction that takes too long to execute causes a pause. The fourth type of pause common culprit can be the load instruction with a last-level cache miss, a floating-point division instruction, and a load instruction that accesses shared data in the cache. The execution of these instructions often requires tens to hundreds of clock cycles, preventing Execution and submission of subsequent instructions cause the processor to stall.

The distinctive feature of the fourth type of pause is that the pause occurs before the culprit is submitted, and culprit is the first instruction submitted after the culprit is terminated.

For the fourth type of pause, we can consider that the front-end indication field FE = 0 and the type indication field Ctype = 0.

It is not difficult to see from the introduction of the types of pauses that the four types of pauses each have their own characteristics. In the embodiment of the present application, when a pause occurs, the type of the pause can be determined according to an instruction of each field in the first register, and then it can be determined which instruction caused the pause (that is, culprit).

For each of the four types of pauses, there can be multiple instructions that cause a certain type of pause. The following table 1 is used as an example to introduce each type of pause and the type of instruction that caused the type of pause. In the embodiment of the present application, the instruction type may be represented by Stall ID.

Table 1

As mentioned earlier, the second register holds the number of accumulated pause cycles corresponding to different instruction types. The Stall ID in Table 1 can be used to indicate the instruction type. Specifically, the cumulative number of stall periods corresponding to each instruction type in the second register can be implemented in an accumulative manner: when the processor stalls, the counter starts counting the number of clock cycles for which the stall continues; when the stall is terminated, the counter stops Count and accumulate the count result into the corresponding entry in the second register.

In specific implementation, after determining the instruction that caused the stall by using the solution provided in the embodiment of the present application, the Stall ID of the instruction can be determined, and the count result of the counter is accumulated to the corresponding entry in the second register, and the entry in the second register After updating to memory, you can analyze the performance of the processor by analyzing the memory data to obtain the performance overhead caused by each type of instruction during the performance monitoring period.

For example, using the solution provided in the embodiment of the present application to determine that the instruction type (Stall ID) of the instruction causing the stall is Oct 10 and the number of clock cycles counted by the counter is P, then the count of the corresponding entry of Oct 10 in the second register Add P to the value. Assume that the count value of the entry corresponding to Oct10 in the second register before the pause occurs is Q, and the count value corresponding to the entry after the superposition is P + Q. After the monitoring time period ends, the entry in the second register is updated to the memory. The count value of the entry corresponding to Oct10 in the memory before the update is X, and the count value of the entry corresponding to Oct10 in the memory after the update is X + P + Q. By analyzing the ratio of the count value of the corresponding entry of Oct10 to the total number of clock cycles included in the entire monitoring period, and comparing the count value of the corresponding entry of Oct10 with the count of other entries, the performance of the processor can be analyzed.

It is worth noting that, for the same instruction, if the instruction causes different types of pauses during program execution, the corresponding instruction types are also different. For example, in Table 1, for a load instruction, if it causes a second type of stall, the instruction type (Stall ID) is Oct11; if it causes a fourth type of stall, the instruction type (Stall ID) is Oct31.

In addition, in the embodiment of the present application, other types of pauses other than the four types of pauses may be collectively classified as a fifth type of pause. Among the above four types of pauses, the obvious characteristics of the first type of pauses are that the pauses occur at the front end of the processor pipeline and the ROB is empty. The obvious characteristics of the second and third types of pauses are that the instruction pipeline is cleared, and the fourth type of pauses are clear. It is difficult to make a clear judgment based on intuitive characteristics, so we can determine that the pause is a fourth type of pause when the type of pause is not the first type of pause, the second type of pause, or the third type of pause. For the fifth type of pause, the processing process and the setting of the fields in the entry of the first register are similar to the fourth type of pause. Therefore, the fifth type of pause can be referred to the fourth type of pause. Therefore, the content of the fifth type of pause in the embodiments of the present application will not be repeated.

The embodiments of the present application will be further described in detail below with reference to the accompanying drawings.

It should be noted that, in the embodiments of the present application, multiple means two or more. In addition, it should be understood that in the description of this application, the words "first" and "second" are used only for the purpose of distinguishing descriptions, and cannot be understood as indicating or implying relative importance, nor as indicating Or imply order.

Referring to FIG. 3, a method for monitoring processor performance according to an embodiment of the present application includes the following steps.

S301: The processor updates the first entry in the first register when a performance event occurs, and when a pause occurs, starts a counter to count the number of first clock cycles that the pause continues.

As mentioned above, the first register maintains a reading pointer and a writing pointer. The reading pointer is used to indicate the entry that needs to be read when the processor stall is terminated. The processor can use this entry to determine the index path of the "responsible person"; the writing pointer is used to write and update the first register Entry. That is, in S301, the first entry updated by the processor may be an entry pointed to by a writing pointer.

In the embodiment of the present application, the first register may include one or more entries. As mentioned earlier, the processor updates the entry in the first register when a performance event occurs. For a performance event that does not cause a processor stall, the performance event will be overwritten by updating an entry in the first register; only the entry that caused the processor stall will eventually be saved in the first register. An entry in the first register corresponds to a stall of the processor, that is, each entry in the first register is used to indicate an index path of an instruction type of the instruction that caused the stall.

That is, during the update process of the first entry, the first entry is used to indicate the index path of the instruction type of the instruction that caused the performance event; after the first entry stops updating, the performance event recorded in the first entry The instruction that caused the processor to halt and the performance event was the instruction that caused the processor to halt, that is, the first entry was used to indicate the index path of the instruction type of the instruction that caused the processor to halt.

Specifically, the first entry may include a front-end indication field, a type indication field, and a sequence number indication field. The front-end indication field is used to indicate whether a pause occurs in the front-end, and the type indication field is used to indicate whether the pause occurs before the first instruction is submitted. The sequence number indication field is used to indicate a mispredicted sequence number of the first instruction. For the specific meanings of the three fields contained in the first entry, refer to the foregoing description, and will not be repeated here.

During the execution of instructions, if it is judged that the processor performance is lower than the design peak value, it can be judged that a performance event has occurred. Therefore, in S301, the operation of the processor to update the first entry when a performance event occurs can be performed in the execute section of the instruction pipeline, that is, when the execute section judges that the processor performance is lower than the design peak, it can be determined that the occurrence Performance event, at which point the first entry in the first register can be updated.

In addition, in S301, the counter is used to record the number of clock cycles in which the pause is continued. This counter starts counting when a pause occurs, and stops counting when the pause ends. After stopping counting, the value recorded by this counter is the number of clock cycles that the pause lasts, that is, the performance overhead of the pause.

As mentioned above, in the embodiment of the present application, the types of stalls that occur in the processor are classified into four types. Then, according to the type of the pause, in S301, the processor may update the first entry in multiple ways. Here are four specific ways to update the first entry.

method one

In the first method, the processor updates the first entry in the first register when a performance event occurs, which may be specifically implemented as follows: The processor performs the first performance event at the front end of the processor pipeline, and the ROB of the processor is If it is empty, the front end indication field is set to the first value, the type indication field is set to the second value, and the instruction type information of the first instruction is stored in the serial number indication field.

As mentioned in the introduction, processor stalls can occur on the front end or on the back end. When it is determined that the pause occurs at the front end of the processor pipeline, the front end indication field of the first entry may be set to the first value to indicate that the pause occurs at the front end of the processor pipeline.

In the first method, we can think of the processor stall as the first type of stall—instruction supply stall, that is, the pipeline stall caused by the processor's frontend. A significant feature of the first type of pause is that the ROB is empty when a pause occurs. Therefore, in the first method, when the first performance event occurs at the front end of the processor pipeline and the ROB of the processor is empty, the front end indication field in the first entry may be set to the first value.

Exemplarily, the first value may be 1 and the second value may be 0. In specific implementation, we can set the default values of the front-end indication field and the type indication field to 0 (the front-end indication field and the type indication field default to 0); in this case, if it is judged that the first performance event occurs in the processor At the front end of the pipeline and the ROB of the processor is empty, only the front end indication field in the first entry can be updated, and the type indication field need not be updated.

In addition, because the ROB is empty when the first type of pause occurs, it is difficult for us to determine the "responsible for the pause" through the relevant entry in the ROB. Therefore, in order to determine the index path of the instruction that caused the pause, the sequence number indication field is reused in the first method, that is, the instruction type information of the first instruction is stored in the sequence number indication field, so that the sequence number indication field can be directly passed. Identify the "responsible person".

Way two

In the second method, the processor updates the first entry in the first register when a performance event occurs, which can be specifically implemented in the following manner: when the second performance event does not occur at the front end of the processor pipeline, and the second performance event If it is not a misprediction event, both the front end indication field and the type indication field are set to the second value.

As mentioned earlier, it is difficult for us to judge the pause that occurred is the fourth type of pause by intuitive features, so we can determine this time when the type of pause is not the first type of pause, the second type of pause, or the third type of pause. The pause is the fourth type of pause-long delay pause. In the second method, the second performance event does not occur at the front end of the processor pipeline, and the second performance event is not a misprediction event, indicating that the type of the pause is a fourth type of pause.

For the fourth type of pause, both the front-end indication field and the type indication field of the first entry may be set to the second value to indicate that the current pause is a fourth type of pause.

Exemplarily, the second value may be 0. In specific implementation, we can set the default values of the front-end indication field and type indication field to 0 (the front-end indication field and type indication field default to 0); in this case, if it is judged that the second performance event has not occurred in processing The front end of the processor pipeline, and the second performance event is not a misprediction event, the first entry may not be updated.

Way three

In the third method, the processor updates the first entry in the first register when a performance event occurs, which can be specifically implemented in the following manner: when the third performance event is a misprediction event, the processor sets the front-end indication field For the second value, the type indication field is set to the first value, and the mispredicted sequence number of the third performance event is stored in the sequence number indication field.

As described above, the front-end indication field and the type indication field of the second-type pause and the third-type pause are set in the same manner. In the third method, when it is determined that the third performance event is a misprediction event, it may be determined that the pause is a second type pause or a third type pause.

For the second type of pause or the third type of pause, the front end indication field of the first entry may be set to a second value, and the type indication field may be set to a first value.

Exemplarily, the first value may be 1 and the second value may be 0. In specific implementation, we can set the default values of the front-end indication field and type indication field to 0 (the front-end indication field and type indication field default to 0); in this case, if the third performance event is judged to be a misprediction event , You can only update the type indication field in the first entry, without having to update the front-end indication field.

Way four

In the fourth method, the processor updates the first entry in the first register when a performance event occurs, which may be specifically implemented as follows: When the fourth performance event is a misprediction event, the processor sets the front-end indication field Is the second value, the type indication field is set to the first value, and the mispredicted sequence number of the fourth performance event is stored in the sequence number indication field; when the fifth performance event is a misprediction event, the processor compares the first The size of the misprediction sequence number of the five performance event and the misprediction sequence number of the fourth performance event; the processor compares the smaller misprediction sequence of the misprediction sequence number of the fifth performance event and the misprediction sequence number of the fourth performance event. The number is stored in the serial number indication field.

Exemplarily, the first value may be 1 and the second value may be 0.

As mentioned earlier, the processor can predict the branch flow of the program, and then read and decode the instructions of one of the branches in advance, thereby reducing the waiting time for the decoder. If a branch prediction error occurs, the instructions on the instruction pipeline will be emptied, causing the processor to stall. For a multi-level branch prediction instruction in the program, and multiple mis-predictions in the multi-level branch prediction instruction, the instruction causing the processor to halt should be regarded as the first prediction error instruction, that is, the instruction with the smallest misprediction sequence number. Therefore, in the fourth method, during the process of updating the first entry (that is, before the pause is terminated), if two misprediction events occur, the first entry stores information about performance events with a smaller misprediction sequence number. In order to accurately determine the index information of the instruction that caused the pause.

The above is only a simple example of a specific manner of updating the first entry in the embodiment of the present application. In actual implementation, the update process of the first entry may be performed according to actual conditions, and these specific operations may also be performed in the manner indicated by the above four methods. For example, after the second performance event occurs, the first entry is updated in the second way; then, a fourth performance event occurs without a processor stall, and the first entry may continue to be updated at this time. At this time, the first entry records the related information of the fourth performance event. In this case, we can consider that the performance cost of the second performance event is hidden by the fourth performance event, and the second performance event does not cause the processor to stall.

It can be seen from the above introduction that in the implementation of the present application, when a performance event occurs, the entry pointed to by the write pointer (writing) needs to be updated; in the process of updating the entry of the first register, it may also be necessary to read the write pointer ( The sequence number indication field of the entry pointed to by writing) is used to determine how to update the sequence number indication field. After the pause, the entry pointed to by the read pointer (writing) needs to be read to determine the instruction type of the first instruction. That is, the first register may be configured with two read channels and one write channel, and the first register is a register group that can be regarded as a 2 read 1 write. Among them, one read channel is used to read the entry in the first register and determine how to update the entry in the execute section, and the other read channel is used to read the entry in the first register in the commit section. To determine the instruction type of the instruction that caused the processor to stall; the write channel is used to update the entry in the first register during the execute segment.

S302: The processor stops updating the first entry after the pause is terminated, and determines an instruction type of the first instruction according to the first entry.

As mentioned earlier, in S301, the processor can update the first entry in various ways when a performance event occurs. When a performance event does not cause the processor to halt, the first entry updated according to the performance event The project will be overwritten, and only the instruction that caused the processor to stall will be saved in the first entry. Then, in S302, after the processor stall is terminated, the first entry records related information of the instruction that caused the processor stall, that is, the first instruction corresponding to the first entry may be considered as the instruction that caused the processor stall.

Specifically, the processor stops updating the first entry when the pause is terminated, which can be achieved by moving a writing pointer in the first register to a next entry of the first entry. That is, the writing pointer in the first register can move to the next entry after each pause, and then continue to monitor the performance events that occur in the processor, and continue to update the next entry when the processor has a performance event .

The writing pointer (writing) is moved to the next entry after each pause. The purpose is to prevent the occurrence of the following situations: when several similar instructions (such as two consecutive instructions) cause the processor to pause, if the write is not moved Writing (writing). It may happen that the information about the first instruction is overwritten by the information about the second instruction before it is read in the commit section.

In addition, in specific implementation, in S302, the processor may read the first entry pointed by the reading pointer after the pause is terminated, thereby determining the instruction type of the first instruction according to the first entry. When the reading of the first entry is completed, the reading pointer can be pointed to the next entry of the first entry, so that when a stall occurs again later, the instruction type of the instruction that caused the stall is determined according to the entry pointed by the reading pointer .

In addition, because the storage capacity of the first register is limited, after the first entry has been read and the instruction type of the first instruction is determined according to the first entry, the first entry can be deleted or released to avoid reading. Entries occupy the storage capacity of the first register.

As mentioned above, for different types of pauses, the processor may update the first entry in S301 in multiple ways. Similarly, in S302, according to the information recorded in the three fields of the first entry, the processor may determine the type of the pause, and then determine the instruction type of the first instruction that caused the pause.

In specific implementation, the processor may determine the following three types of specific implementations of the instruction type of the first instruction according to the first entry.

Implementation method one

In the first implementation manner, if the front-end indication field in the first entry is the first value, it can be determined that the pause occurs at the front end of the processor pipeline, that is, the pause is the first type of pause. For the first type of pause, the instruction that caused this pause is the first instruction submitted after the pause is terminated. In this case, we believe that the instruction should be the first entry in the ROB (that is, the instruction to be submitted in the ROB) Corresponding entry). However, when the first type of stall occurs, the ROB is empty, so the index information of the instruction that caused the processor stall cannot be obtained from the ROB. In order to avoid such a situation that the index information of the "responsible for pause" cannot be obtained, in the design, when updating the first entry for the first type of pause, the instruction type information of the first instruction is stored in the serial number indication field (As described in the first method of S301). Then, when the front-end indication field is the first value, the instruction type information of the first instruction stored in the sequence number indication field of the first entry can be directly obtained.

Implementation method two

The processor determines the instruction type of the first instruction according to the first entry, which can be specifically implemented in the following manner: When the processor determines that the front-end indication field and the type indication field are both second values, the first entry in the ROB is Stored the instruction type information of the first instruction.

The first entry in the ROB can be understood as the entry pointed by the ROB's heading.

Implementation method three

The processor determines the instruction type of the first instruction according to the first entry, which may be specifically implemented by: the processor determines that the front-end indication field is the second value, the type indication field is the first value, and the error stored in the sequence number indication field is incorrect. When the predicted sequence number is the same as the mispredicted sequence number stored in the third register, the instruction type information of the first instruction stored in the third register is obtained. The third register stores the third entry that was last deleted by the ROB. Three entries contain instruction type information for the first instruction.

In the third implementation, if the front-end indication field in the first entry is the second value and the type indication field is the first value, it can be determined that the pause is the second type pause or the third type pause, and the instruction that caused the pause The last instruction submitted before the pause occurred. In theory, at this time, the instruction recorded in the third register (that is, the last instruction submitted before the pause occurred) is the instruction that caused the processor to pause.

However, in actual implementation, considering the out-of-order execution of the processor, the first entry may be updated out of order, for example, the instruction that caused the pause has not yet reached the commit segment, or the first entry The performance event recorded in is overwritten by other performance events or the performance event recorded in the first entry overlaps with other performance events (at this time, it can be considered that the performance event recorded in the first entry did not cause the pause).

In the third implementation manner, the mispredicted sequence number stored in the sequence number indication field is compared with the mispredicted sequence number stored in the third register, in order to correctly identify the "responsible person for pause" when the above complex situation occurs: In the case where the mispredicted sequence number stored in the sequence number indication field in the first entry is the same as the mispredicted sequence number stored in the third register, it is determined that the instruction recorded in the first entry is the record in the third register Instruction (that is, the last instruction submitted before the pause occurred). This situation is similar to the situation shown in program B / D in Figure 2. At this time, we can obtain the instruction type information of the first instruction stored in the third register. .

In addition, considering the out-of-order execution of the processor, in order to properly handle the case where the first entry is updated out-of-order due to out-of-order execution of the instruction, in the third implementation method, we can also determine the first instruction in the following two ways Instruction type.

Way a

In method a, the processor determines that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is greater than the mispredicted sequence number stored in the third register. To obtain the instruction type information of the first instruction stored in the first entry in the ROB.

In manner a, if the mispredicted sequence number stored in the sequence number indication field is greater than the mispredicted sequence number stored in the third register, it can be considered that the instruction indicated by the first entry has not yet reached the commit segment, That is to say, the pause caused by the instruction has not yet occurred. This pause is the last pause caused by the instruction indicated by the first entry. At this time, we should not get the index information of the instruction from the third register, but we should get the index information of the instruction from the first entry in the ROB.

In addition, in method a, since the instruction indicated by the first entry has not reached the commit segment, after obtaining the instruction type information of the first instruction stored in the first entry in the ROB, it is not necessary to read the instruction In the other way described in the embodiment of the application, the reading pointer of the first register is pointed to the next entry, but the reading pointer is kept unchanged, so that the "pause responsible person" can be correctly identified after the next pause is terminated.

Way b

In manner b, the processor determines that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is smaller than the mispredicted sequence number stored in the third register. To obtain the instruction type information of the first instruction stored in the first entry in the ROB.

In method b, if the mispredicted sequence number stored in the sequence number indication field is smaller than the mispredicted sequence number stored in the third register, the performance event recorded in the first entry may be considered to be overwritten by another performance event (this It can be considered that the performance event recorded in the first entry did not cause the processor to stall). Exemplarily, the performance event recorded in the first entry is referred to as performance event p, and the performance event covering performance event p is referred to as performance event q. At this time, because the instruction that caused the performance event p was the most recently submitted instruction but the instruction did not cause the processor to stall, we should not get the instruction that caused the stall from the third register (that is, the instruction that caused the performance event q) The index information of the command should be obtained from the first entry in the ROB.

In manner b, the reading pointer of the first register may be updated to point to the next entry.

S303: The processor adds the first clock cycle number to the accumulated pause cycle number corresponding to the instruction type of the first instruction, and writes the accumulated pause cycle number to the second entry of the second register.

The second register is provided with multiple entries, each of which corresponds to multiple instruction types, and the multiple entries are used to store a cumulative number of pause cycles caused by instructions under each instruction type.

According to the previous introduction, each type of pause has its own characteristics, and each type of pause can be caused by different or the same type of instructions. For example, the branch instruction can cause a second type of pause, and the load instruction can cause a second type of pause or a fourth type of pause. The second register stores the cumulative number of pause cycles caused by the instructions under each instruction type. For example, the second register may record the cumulative number of pause periods of the second type of pause caused by the branch instruction as A, the cumulative number of pause periods of the second type of pause caused by the load instruction as B, and the fourth type caused by the load instruction The cumulative number of pause periods is C ... and so on.

The process of updating the second entry into the second register may be: accumulating the count result of the counter (that is, the number of first clock cycles) to the entry corresponding to the first instruction in the second register. For example, using the method shown in FIG. 3 to determine the stall ID of the first instruction causing the stall is Oct01 and the first clock cycle count of the counter is M, then when S303 is executed, the cumulative value of the corresponding entry of Oct01 in the second register can be performed Add M. Assume that the cumulative value of the entry corresponding to Oct01 in the second register before the pause occurs is N, and the cumulative value corresponding to the entry after the superposition is M + N.

In addition, after the processor writes the accumulated pause period number to the second entry of the second register in S303, all the entries stored in the second register may be updated to the memory connected to the processor in any of the following cases Medium: The second register overflows; the second register triggers an interrupt; the performance monitoring period of the processor ends.

All the entries saved in the second register are updated into the memory, and the second register can be cleared. Then, the performance of the processor can be evaluated based on the data in the memory, for example, analyzing the percentage of stalls caused by each type of instruction, analyzing which types of instructions are prone to stall, and analyzing the number of processor stall cycles. The percentage of the total execution cycles of the program and so on.

The method shown in FIG. 3 is used to update the first entry in the first register when a performance event occurs, and to stop updating the first entry after the processor stall is terminated. The first entry can be used to indicate that the processor caused a stall. Index path of the instruction type of the first instruction. In addition, the processor starts a counter when a pause occurs. Then, after the pause is terminated, the counter records a first clock cycle number in which the pause is continued. The first clock cycle number may be used to indicate a performance overhead caused by the pause. Therefore, after the pause is stopped, the first entry can be read to determine the instruction type of the first instruction that caused the pause; at the same time, the second register stores the cumulative number of pauses caused by the instructions under each instruction type. After the pause is terminated, the first clock cycle number can be accumulated into the entry corresponding to the instruction type of the first instruction in the second register (that is, the second entry), so that the performance of the processor can be comprehensively analyzed, for example, analysis Percentage of stalls caused by each type of instruction, analysis of which types of instructions are prone to stalls, analysis of the number of stall cycles of the processor as a percentage of the total execution cycles of the program, and so on. In summary, by using the processor performance monitoring solution provided in the embodiments of the present application, the type of the instruction that caused the stall can be accurately determined after the processor stalls, and the performance of each type of instruction after the execution of the program ends Cost is assessed.

In addition, by using the solution provided in the embodiment of the present application, processor monitoring can be implemented through a low-cost hardware mechanism. That is, in the embodiment of the present application, the instruction that caused the pause can be accurately determined by adding a register group that supports 2 reads and 1 writes in the execute section and combined with the judgment logic of the commit section.

Based on the above embodiments, an embodiment of the present application further provides a method for monitoring processor performance. This method can be regarded as a specific example of the method shown in FIG. 3.

Referring to FIG. 4, the method may be: when a front-end cache miss occurs in the fetch section and the ROB is empty, the entry pointed by the writing pointer in the CTS is updated, FE is set to 1, and the stall ID of the fetch instruction is written into Squash SN Field (ie, a specific example of a sequence number indication field). After the submission section judges that a pause is terminated, the writing pointer is incremented to stop updating the entry currently pointed to by the writing pointer. Then, read the entry pointed by the reading pointer to determine the "culprit" of the pause just ended, and add 1 to the reading pointer after the reading is completed. By judging the entry pointed by the reading pointer, FE = 1, Ctype = 0, the processor judges that the pause is the front-end pause of the pipeline, and uses the Stall ID index stored in SquashSN to index the SPCT, and accumulates the SCC count value to the corresponding entry in the SPCT. .

(Add this paragraph to explain the case where FE = 0 and Ctype = 0.) When the subsequent pause is terminated again, by judging the entry pointed to by the reading pointer FE = 0, Ctype = 0, the processor determines what the ROB's head pointer (head) points to. The entry is the "responsible person", and the count value of the SCC is added to the corresponding entry in the SPCT of the "responsible person". SPCT can be used for subsequent analysis of processor performance.

It should be noted that the method shown in FIG. 4 can be regarded as a specific example of the method shown in FIG. 3. For an implementation manner and technical effects that are not described in detail in the method shown in FIG. 4, refer to related descriptions in the method shown in FIG. 3. .

Based on the same inventive concept, the embodiment of the present application further provides a processor performance monitoring device, which can be used to execute the processor performance monitoring method shown in FIG. 3. Referring to FIG. 5, the processor performance monitoring device 500 (hereinafter referred to as “device 500”) includes a processor 501, a first register 502, a counter 503, and a second register 504.

The processor 501 is configured to: update a first entry in the first register 502 when a performance event occurs, and when a pause occurs, start a counter 503 to count a first clock cycle duration of the pause, and the first entry is used to indicate a cause Index path of the instruction type of the first instruction of the performance event; after the pause ends, stop updating the first entry, and determine the instruction type of the first instruction according to the first entry; add the first clock cycle number to the first instruction The total number of pause periods corresponding to the type of instruction, and write the accumulated number of pause periods to the second entry in the second register 504; the second register 504 is provided with multiple entries, each of which corresponds to multiple instruction types, more This entry is used to store the cumulative number of pause cycles caused by the instructions under each instruction type.

The first register 502, the counter 503, and the second register 504 may be integrated on the processor 501, or may be set separately. When the first register 502, the counter 503, and the second register 504 are integrated on the processor 501, the processor performance monitoring device 500 can also be regarded as a type of processor.

In the apparatus 500, the processor 501 can update the first entry in multiple ways, and four of them are listed below.

The first way

When the processor 501 updates the first entry in the first register 502, the processor 501 is specifically configured to: when the first performance event occurs at the front end of the processor 501 pipeline and the reordering buffer ROB of the processor 501 is empty Next, the front end indication field is set to the first value, the type indication field is set to the second value, and the instruction type information of the first instruction is stored in the serial number indication field.

The second way

When the processor 501 updates the first entry in the first register 502, the processor 501 is specifically configured to: when the second performance event does not occur at the front end of the processor 501 pipeline, and the second performance event is not a misprediction event , Set both the front end indication field and the type indication field to the second value.

The third way

When the processor 501 updates the first entry in the first register 502, the processor 501 is specifically configured to: when the third performance event is a misprediction event, set the front-end indication field to the second value, and set the type indication The field is set to the first value, and the mispredicted sequence number of the third performance event is stored in the sequence number indication field.

Fourth way

When the processor 501 updates the first entry in the first register 502, the processor 501 is specifically configured to: when the fourth performance event is a misprediction event, set the front-end indication field to the second value, and set the type indication The field is set to the first value, and the misprediction sequence number of the fourth performance event is stored in the sequence number indication field; the processor 501 compares the misprediction sequence of the fifth performance event when the fifth performance event is a misprediction event. Of the mispredicted sequence number of the fourth performance event and the mispredicted sequence number of the fourth performance event; the processor 501 saves the smaller mispredicted sequence number of the mispredicted sequence number of the fifth performance event and the mispredicted sequence number of the fourth performance event in the sequence number indication. Field.

In addition, in the embodiment of the present application, the processor 501 may determine the following three specific implementation manners of the instruction type of the first instruction according to the first entry.

The first way

When the processor 501 determines the instruction type of the first instruction according to the first entry, the processor 501 is specifically configured to: when the front end instruction field is determined to be the first value, the processor 501 obtains the first instruction stored in the sequence number instruction field. Instruction type information.

The second way

When the processor 501 determines the instruction type of the first instruction according to the first entry, the processor 501 is specifically configured to: when the processor 501 determines that the front-end indication field and the type indication field are second values, obtain the information in the reordering cache ROB. The instruction type information of the first instruction stored in the first entry.

The third way

In a third manner, the device 500 further includes a third register. The third register stores a third entry that was last deleted by the reordering buffer ROB. The third entry contains the instruction type information of the first instruction. When determining the instruction type of the first instruction according to the first entry, the following three methods may be specifically used:

Way 3a

The processor 501 obtains the first case when it is determined that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is the same as the mispredicted sequence number stored in the third register. The instruction type information of the first instruction stored in the three registers.

Way 3b

The processor 501 obtains reordering when it is determined that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is greater than the mispredicted sequence number stored in the third register. Cache the instruction type information of the first instruction stored in the first entry in the ROB.

Way 3c

The processor 501 obtains reordering when it is determined that the front-end indication field is the second value, the type indication field is the first value, and the mispredicted sequence number stored in the sequence number indication field is smaller than the mispredicted sequence number stored in the third register. Cache the instruction type information of the first instruction stored in the first entry in the ROB.

In addition, in the embodiment of the present application, the processor 501 is further configured to: after writing the accumulated pause period number into the second entry of the second register 504, the processor 501 writes the second register 504 into the second register 504 in any of the following cases: All saved entries are updated to the memory connected to the processor 501: the second register 504 overflows; the second register 504 triggers an interrupt; the performance monitoring period of the processor 501 ends.

It should also be noted that the processor performance monitoring device 500 may be used to execute the method provided by the embodiment corresponding to FIG. 3, so the implementation manners and technical effects not described in detail in the processor performance monitoring device 500 shown in FIG. 5 may be See related description in FIG. 3.

Based on the same inventive concept, an embodiment of the present application further provides a processor performance monitoring device, which can be used to execute the processor performance monitoring method shown in FIG. 3, and can also be regarded as the same as the processor performance shown in FIG. 5. The monitoring device 500 is the same device. Referring to FIG. 6, the processor performance monitoring device 600 includes an update module 601, a start module 602, a stop module 603, and a read module 604.

An update module 601 is configured to update a first entry in the first register when a performance event occurs, where the first entry is used to indicate an index path of an instruction type of the first instruction that caused the performance event.

The starting module 602 is configured to start a counter to count the first clock cycle duration when the pause occurs when a pause occurs.

The stopping module 603 is configured to stop updating the first entry after the pause is terminated.

The reading module 604 is configured to determine an instruction type of the first instruction according to the first entry.

The update module 601 is further configured to stack the first clock cycle number into the accumulated pause cycle number corresponding to the instruction type of the first instruction, and write the accumulated pause cycle number to a second entry in the second register; the second register is set There are multiple entries, multiple entries corresponding to multiple instruction types, and multiple entries are used to store the cumulative number of pause cycles caused by instructions under each instruction type.

In addition, the processor performance monitoring device 600 shown in FIG. 6 may also be used to perform other operations in the method for monitoring processor performance shown in FIG. 3, which are not described herein again.

It should be noted that the division of the modules in the embodiments of the present application is schematic, and is only a logical function division. In actual implementation, there may be another division manner. The functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist separately physically, or two or more modules may be integrated into one module. The above integrated modules may be implemented in the form of hardware or software functional modules.

If the integrated module is implemented in the form of a software functional module and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially a part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium. , Including a number of instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor to perform all or part of the steps of the methods of the embodiments of the present application. The aforementioned storage media include: U disks, mobile hard disks, read-only memories (ROMs), random access memories (RAMs), magnetic disks or compact discs and other media that can store program codes .

It should also be noted that the processor performance monitoring device 600 can be used to execute the method provided by the embodiment corresponding to FIG. 3, so the implementation and technical effects not described in detail in the processor performance monitoring device 600 shown in FIG. See related description in FIG. 3.

In addition, an embodiment of the present application further provides a computer storage medium. A program is stored on the computer storage medium, and when the program is executed by a processor, the program is used to implement the method provided by the embodiment corresponding to FIG.

An embodiment of the present application further provides a computer program product. When the program code included in the computer program product runs on a computer, the computer causes the computer to execute the method provided in the embodiment corresponding to FIG. 3.

Those skilled in the art should understand that the embodiments of the present application may be provided as a method, a system, or a computer program product. Therefore, this application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Moreover, this application may take the form of a computer program product implemented on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

This application is described with reference to flowcharts and / or block diagrams of methods, devices (systems), and computer program products according to embodiments of the present application. It should be understood that each process and / or block in the flowcharts and / or block diagrams, and combinations of processes and / or blocks in the flowcharts and / or block diagrams can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing device to produce a machine, so that the instructions generated by the processor of the computer or other programmable data processing device are used to generate instructions Means for implementing the functions specified in one or more flowcharts and / or one or more blocks of the block diagrams.

These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing device to work in a particular manner such that the instructions stored in the computer-readable memory produce a manufactured article including an instruction device, the instructions The device implements the functions specified in one or more flowcharts and / or one or more blocks of the block diagram.

These computer program instructions can also be loaded on a computer or other programmable data processing device, so that a series of steps can be performed on the computer or other programmable device to produce a computer-implemented process, which can be executed on the computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more flowcharts and / or one or more blocks of the block diagrams.

Obviously, those skilled in the art can make various modifications and variations to the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. In this way, if these modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalent technologies, the present application also intends to include these changes and variations.

Claims

A method for monitoring processor performance, comprising:

The processor updates a first entry in the first register when a performance event occurs, and when a pause occurs, starts a counter to count a first clock cycle number in which the pause continues, and the first entry is used to indicate a cause The index path of the instruction type of the first instruction of the performance event;

The processor stops updating the first entry after the pause is terminated, and determines an instruction type of the first instruction according to the first entry;

Adding, by the processor, the first number of clock cycles to the accumulated pause period number corresponding to the instruction type of the first instruction, and writing the accumulated pause period number into a second entry of a second register; The second register is provided with a plurality of entries, each of which corresponds to a plurality of instruction types, and the plurality of entries are used to store a cumulative number of pause periods caused by instructions under each instruction type.
The method according to claim 1, wherein the first entry comprises a front-end indication field, a type indication field, and a sequence number indication field, and the front-end indication field is used to indicate whether the pause occurs at the front end, so The type indication field is used to indicate whether the pause occurs before the first instruction is submitted, and the sequence number indication field is used to indicate a mispredicted sequence number of the first instruction.
The method of claim 2, wherein the processor updating the first entry in the first register when a performance event occurs comprises:

When the first performance event occurs at the front end of the processor pipeline and the reordering cache ROB of the processor is empty, the front end indication field is set to the first value, and the type indication field is set. Set to the second value, and save the instruction type information of the first instruction in the sequence number indication field.
The method of claim 2, wherein the processor updating the first entry in the first register when a performance event occurs comprises:

When the second performance event does not occur at the front end of the processor pipeline and the second performance event is not a misprediction event, the processor sets the front end indication field and the type indication field to a second value. .
The method of claim 2, wherein the processor updating the first entry in the first register when a performance event occurs comprises:

When the third performance event is a misprediction event, the processor sets the front-end indication field to a second value, sets the type indication field to a first value, and sets a value of the third performance event. The mispredicted sequence number is stored in the sequence number indication field.
The method of claim 2, wherein the processor updating the first entry in the first register when a performance event occurs comprises:

When the fourth performance event is a misprediction event, the processor sets the front-end indication field to a second value, sets the type indication field to a first value, and sets a value of the fourth performance event. The mispredicted sequence number is stored in the sequence number indication field;

When the fifth performance event is a misprediction event, comparing the magnitude of the misprediction sequence number of the fifth performance event with the misprediction sequence number of the fourth performance event;

The processor stores the smaller mispredicted sequence number of the mispredicted sequence number of the fifth performance event and the mispredicted sequence number of the fourth performance event in the sequence number indication field.
The method according to any one of claims 1 to 6, wherein the determining the instruction type of the first instruction according to the first entry comprises:

When the processor determines that the front-end indication field is a first value, the processor obtains instruction type information of the first instruction stored in the sequence number indication field.
The method according to any one of claims 1 to 6, wherein the determining the instruction type of the first instruction according to the first entry comprises:

When determining that both the front-end indication field and the type indication field are second values, the processor obtains instruction type information of a first instruction stored in a first entry in a reordering buffer ROB.
The method according to any one of claims 1 to 6, wherein the determining the instruction type of the first instruction according to the first entry comprises:

Determining, by the processor, that the front-end indication field is a second value, the type indication field is a first value, and the misprediction sequence number stored in the sequence number indication field and the misprediction sequence number stored in a third register In the same case, the instruction type information of the first instruction stored in the third register is obtained, and the third register stores a third entry recently deleted by the reordering buffer ROB, and the third entry Contain instruction type information of the first instruction; or,

Determining, by the processor, that the front-end indication field is a second value, the type indication field is a first value, and the misprediction sequence number stored in the sequence number indication field is greater than the misprediction stored in the third register In the case of a serial number, obtain the instruction type information of the first instruction stored in the first entry in the reordering cache ROB; or the processor determines that the front-end indication field is a second value and the type indication field When the mispredicted sequence number stored in the sequence number indication field is smaller than the mispredicted sequence number stored in the third register, the first number stored in the first entry in the reordering buffer ROB is obtained Instruction type information for an instruction.
The method according to any one of claims 1 to 9, after the processor writes the accumulated pause period number into a second entry of a second register, further comprising:

The processor updates all entries stored in the second register to a memory connected to the processor in any of the following cases:

An overflow occurs in the second register;

The second register triggers an interrupt;

The performance monitoring period of the processor ends.
A processor performance monitoring device is characterized in that the device includes a processor, a first register, a counter, and a second register; the processor is used for:

When a performance event occurs, a first entry in the first register is updated. When a pause occurs, a counter is started to count the first clock cycle number of the pause. The first entry is used to indicate a second event that causes the performance event. An index path of the instruction type of an instruction;

After the pause is terminated, stopping updating the first entry, and determining an instruction type of the first instruction according to the first entry;

Adding the first clock cycle number to the cumulative pause cycle number corresponding to the instruction type of the first instruction, and writing the cumulative pause cycle number to a second entry in a second register; the second register There are multiple entries, each of which corresponds to multiple instruction types, and the multiple entries are used to store the cumulative number of pause cycles caused by instructions under each instruction type.
The device according to claim 11, wherein the first entry comprises a front-end indication field, a type indication field, and a serial number indication field, and the front-end indication field is used to indicate whether the pause occurs at the front end, so The type indication field is used to indicate whether the pause occurs before the first instruction is submitted, and the sequence number indication field is used to indicate a mispredicted sequence number of the first instruction.
The apparatus according to claim 12, wherein the processor is specifically configured to: when updating the first entry in the first register:

When the first performance event occurs at the front end of the processor pipeline and the reordering cache ROB of the processor is empty, the front end indication field is set to the first value, and the type indication field is set. Set to the second value, and save the instruction type information of the first instruction in the sequence number indication field.
The apparatus according to claim 12, wherein the processor is specifically configured to: when updating the first entry in the first register:

When the second performance event does not occur at the front end of the processor pipeline and the second performance event is not a misprediction event, the processor sets the front end indication field and the type indication field to a second value. .
The apparatus according to claim 12, wherein the processor is specifically configured to: when updating the first entry in the first register:

When the third performance event is a misprediction event, the processor sets the front-end indication field to a second value, sets the type indication field to a first value, and sets a value of the third performance event. The mispredicted sequence number is stored in the sequence number indication field.
The apparatus according to claim 12, wherein the processor is specifically configured to: when updating the first entry in the first register:

When the fourth performance event is a misprediction event, the processor sets the front-end indication field to a second value, sets the type indication field to a first value, and sets a value of the fourth performance event. The mispredicted sequence number is stored in the sequence number indication field;

When the fifth performance event is a misprediction event, comparing the magnitude of the misprediction sequence number of the fifth performance event with the misprediction sequence number of the fourth performance event;

The processor stores the smaller mispredicted sequence number of the mispredicted sequence number of the fifth performance event and the mispredicted sequence number of the fourth performance event in the sequence number indication field.
The device according to any one of claims 11 to 16, wherein when the processor determines an instruction type of the first instruction according to the first entry, the processor is specifically configured to:

When the processor determines that the front-end indication field is a first value, the processor obtains instruction type information of the first instruction stored in the sequence number indication field.
The device according to any one of claims 11 to 16, wherein when the processor determines an instruction type of the first instruction according to the first entry, the processor is specifically configured to:

When determining that both the front-end indication field and the type indication field are second values, the processor obtains instruction type information of a first instruction stored in a first entry in a reordering buffer ROB.
The device according to any one of claims 11 to 16, further comprising:

A third register that stores a third entry that was last deleted by the reordering buffer ROB, and the third entry includes instruction type information of the first instruction;

When the processor determines the instruction type of the first instruction according to the first entry, the processor is specifically configured to:

Determining, by the processor, that the front-end indication field is a second value, the type indication field is a first value, and the misprediction sequence number stored in the sequence number indication field and the misprediction stored in the third register When the serial numbers are the same, obtaining the instruction type information of the first instruction stored in the third register; or

Determining, by the processor, that the front-end indication field is a second value, the type indication field is a first value, and the misprediction sequence number stored in the sequence number indication field is greater than the misprediction stored in the third register In the case of a serial number, obtain the instruction type information of the first instruction stored in the first entry in the reorder buffer ROB; or

Determining, by the processor, that the front-end indication field is a second value, the type indication field is a first value, and the misprediction sequence number stored in the sequence number indication field is less than the misprediction stored in the third register In the case of a serial number, the instruction type information of the first instruction stored in the first entry in the reorder buffer ROB is obtained.
The apparatus according to any one of claims 11 to 19, wherein the processor is further configured to:

After writing the accumulated number of pause cycles to the second entry of the second register, the processor updates all entries held in the second register to be connected to the processor in any of the following cases In memory:

An overflow occurs in the second register;

The second register triggers an interrupt;

The performance monitoring period of the processor ends.
A processor performance monitoring device, comprising:

An update module, configured to update a first entry in the first register when a performance event occurs, where the first entry is used to indicate an index path of an instruction type of the first instruction that caused the performance event;

A startup module, configured to start a counter to count the first clock cycle duration when the pause occurs when a pause occurs;

A stopping module, configured to stop updating the first entry after the pause is terminated;

A reading module, configured to determine an instruction type of a first instruction according to the first entry;

The update module is further configured to stack the first clock cycle number into the accumulated pause cycle number corresponding to the instruction type of the first instruction, and write the accumulated pause cycle number to the second in the second register. Entries; the second register is provided with a plurality of entries, each of which corresponds to a plurality of instruction types, and the plurality of entries are used to store a cumulative number of pause periods caused by instructions under each instruction type.
A computer storage medium, characterized in that a program is stored on the computer storage medium, and when the program is executed by a processor, it is used to implement the method according to any one of claims 1 to 10.
A computer program product, characterized in that when the program code contained in the computer program product runs on a computer, the computer causes the computer to execute the method according to any one of claims 1 to 10.