CN110178123B

CN110178123B - Performance index evaluation method and device

Info

Publication number: CN110178123B
Application number: CN201780083763.9A
Authority: CN
Inventors: 程捷; 朱冠宇; 赵俊峰
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2017-07-12
Filing date: 2017-07-12
Publication date: 2020-12-01
Anticipated expiration: 2037-07-12
Also published as: WO2019010656A1; CN110178123A

Abstract

A performance index evaluation method and a device thereof are provided, the method comprises the following steps: acquiring an instruction stream of a test program and dividing the instruction stream into N instruction segments (301); counting the base in each instruction segment in N instruction segmentsThe jump probability between every two blocks of the block forms N jump matrixes (302) with M rows and M columns; converting N jump matrixes with M rows and M columns into M²A first feature matrix A (303) of rows and N columns; selecting p column vectors in the first feature matrix A, combining the p column vectors to form M²Second feature matrix A of rows and columns_p(304) (ii) a Respectively sending the p instruction fragments to the simulator, and receiving simulation results of the p instruction fragments from the simulator to determine performance index vectors C of the p instruction fragments_p(305) (ii) a According to the performance index vector C_pA first feature matrix A and a second feature matrix A_pA performance level vector C of the N instruction fragments is determined (306). Through the method, the performance index of each instruction segment of the test program can be effectively evaluated.

Description

Performance index evaluation method and device

Technical Field

The present application relates to the field of computers, and more particularly, to a performance index evaluation method and apparatus.

Background

Personnel engaged in design and development of processor architecture often need to run a test program in a simulator of a certain architecture and then collect data indexes of relevant performance, such as Instruction number Per Cycle (IPC), hit rate of Level 2Cache (L2 Cache), energy consumption and the like, so as to find the bottleneck of the current processor architecture. After the system architecture is improved, the design is redeployed in the simulator, the simulator is used again to run the test program, data are collected, the performance difference of the same test program which runs under the new and old system architectures is compared, then the bottleneck is found again. It can be seen that a great deal of design test work is done with a software simulator before deployment in hardware.

However, one of the major disadvantages of the software simulation platform is: the same test program is run at a much longer runtime than the hardware platform. Especially when running large, comprehensive test suite programs, such as SPEC CPU 2006, it often takes weeks or even months to get the data needed by itself. And after the system architecture is changed every time, the test program needs to be operated again to collect data under the new architecture. Therefore, such repeated operations and waits will seriously affect the development efficiency.

Disclosure of Invention

The application discloses a performance index evaluation method and a performance index evaluation device, which are based on the linear transformation of a local simulation result and a local jump matrix, evaluate the performance index of each instruction segment in an instruction stream by locally simulating the selected instruction segment, can realize the effective evaluation of the performance index of each instruction segment of a test program, save the simulation cost and shorten the simulation time.

In a first aspect, an embodiment of the present application provides a performance index evaluation method, including: obtaining an instruction stream and dividing the instruction stream into N instruction segments, counting the jump probability between every two basic blocks in each instruction segment in the N instruction segments to form N jump matrixes with M rows and M columns, wherein M is the type of the basic block, and converting the N jump matrixes with M rows and M columns into M jump matrixes²A first feature matrix A with N rows and columns, wherein each column of the first feature matrix A represents a jump matrix of an instruction fragment, p column vectors are selected in the first feature matrix A, and the p column vectors are combined to form M²Second feature matrix A of rows and columns_pWherein, p column vectors represent jump matrixes of p instruction fragments, the p instruction fragments are respectively sent to the simulator, and simulation results of the p instruction fragments are received from the simulator to determine the p instruction fragmentsPerformance indicator vector C for instruction fragment_pDetermining a performance indicator vector C of the N instruction segments according to the first feature matrix A and an indicator contribution vector Y, wherein the indicator contribution vector Y represents the performance indicator vector C_pAnd a second feature matrix A_pA linear relationship therebetween.

The selected small part of the instruction fragments are simulated, and the linear relation between the simulation result and the second characteristic matrix of the part of the instruction fragments, namely the index contribution degree vector Y, is obtained, and the linear relation can be theoretically suitable for the first characteristic matrix of all the instruction fragments and the simulation result of all the instruction fragments, so that the simulation result of all the instruction fragments is evaluated according to the index contribution degree vector Y and the characteristic matrix of all the instruction fragments, the simulation result of all the instruction fragments is predicted according to the simulation result of the small part of the instruction fragments, the simulation cost can be greatly saved, and the simulation time is shortened.

With reference to the first aspect, in a first possible implementation manner of the first aspect, determining a performance indicator vector C of the N instruction segments according to the first feature matrix a and the indicator contribution degree vector Y specifically includes: according to equation A_p ^TY＝C_pDetermining an index contribution vector Y according to equation A^TY-C determines a performance indicator vector C for the N instruction fragments.

Since the performance index vector C includes the performance indexes of the N instruction fragments, effective evaluation of the performance index of each instruction fragment of the test program can be achieved.

With reference to the first aspect or the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, selecting p column vectors from a in the first feature matrix specifically includes: averaging each row of the first feature matrix A to obtain a column vector B, selecting p column vectors in the first feature matrix A that are suitable for fitting the column vector B, wherein,

h is the p column vector and D is the coefficient vector required for the fit.

When p column vectors are selected from a in the first feature matrix, if the selected p column vectors can linearly express column vector B of the mean value, it indicates that the selected p column vectors are approximate to the mean value level, and is more suitable for expressing a in the first feature matrix.

With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, the method is performed according to equation a_p ^TY＝C_pDetermining an index contribution degree vector Y, specifically comprising: vector C of performance indicators_pAnd carrying out inner product operation on the sum coefficient vector D to obtain the total performance index value c of the instruction stream.

Coefficient vector D reflects the length of the column vector suitable for expressing A in the first feature matrix, and performance index vector C_pAnd performing inner product operation on the sum coefficient vector D to obtain a total performance index value c.

With reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, the method is performed according to equation a_p ^TY＝C_pDetermining an index contribution degree vector Y, specifically comprising: using the performance index vector C_pAnd the total performance index value c is taken as a limiting condition, and the second feature matrix A is used_pAnd determining a contribution degree vector Y according to a compressed sensing algorithm as an input parameter.

In a possible implementation manner of the first aspect, a jump matrix of N M rows and M columns is converted into M²The first feature matrix a with N rows and columns specifically includes: multiplying the data of each column of the N M rows and M columns of the jump matrix by the weight of the basic block to form N M rows and M columns of feature matrices, and converting the N M rows and M columns of feature matrices into N M rows and M columns of feature matrices²Column vector of row 1 column, N M²The feature vectors of row 1 and column are combined into M²And the basic block weight is the ratio of the number of the instruction segments where the basic blocks represented by the data in each column of the N jump matrices with M rows and M columns are located to the number of all the basic blocks in the instruction segments where the basic blocks are located.

Since the weight of the basic block is further considered in the constructed feature matrix and the sequence of the basic block is combined, the feature matrix can describe the features of the basic block from the above two aspects at the same time, thereby further improving the accuracy of evaluation.

In a possible implementation manner of the first aspect, the simulation result includes at least one of a number of instructions per cycle, a branch prediction success rate, a branch prediction failure rate, a second level cache hit rate, and energy consumption.

In a second aspect, an embodiment of the present application provides a performance index evaluation apparatus, including: the device comprises an instruction stream segmentation module, a jump matrix generation module and a first characteristic matrix acquisition module, wherein the instruction stream segmentation module is used for acquiring an instruction stream and segmenting the instruction stream into N instruction segments, the jump matrix generation module is used for counting the jump probability between every two basic blocks in each instruction segment in the N instruction segments and forming N jump matrixes with M rows and M columns, M is the type of the basic blocks, and the first characteristic matrix acquisition module is used for converting the N jump matrixes with M rows and M columns into M²A first feature matrix A with N rows and columns, wherein each column of the first feature matrix A represents a jump matrix of an instruction segment, and a second feature matrix acquisition module for selecting p columns of vectors in the first feature matrix A and combining the p columns of vectors to form M²Second feature matrix A of rows and columns_pWherein, the p column vectors represent jump matrixes of p instruction segments, and the first performance index vector acquisition module is used for respectively sending the p instruction segments to the simulator and receiving simulation results of the p instruction segments from the simulator to determine the performance index vectors C of the p instruction segments_pA second performance index vector obtaining module, configured to determine a performance index vector C of the N instruction segments according to the first feature matrix a and an index contribution vector Y, where the index contribution vector Y represents the performance index vector C_pAnd a second feature matrix A_pA linear relationship therebetween.

Any implementation manner of the second aspect or the second aspect is an apparatus implementation manner corresponding to any implementation manner of the first aspect or the first aspect, and the description in any implementation manner of the first aspect or the first aspect is applicable to any implementation manner of the second aspect or the second aspect, and is not described herein again.

In a third aspect, an embodiment of the present application provides a performance index evaluation apparatus, which includes a processor and a memory, where the memory stores program instructions, and the processor executes the program instructions to perform the first aspect and the steps of various possible implementation manners of the first aspect.

In a fourth aspect, there is provided a computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of the above aspects.

In a fifth aspect, a computer program product is provided, which, when run on a computer, causes the computer to perform the method of the above aspects.

Drawings

FIG. 1 is a diagrammatic illustration of a segment of instructions according to an embodiment of the present application;

fig. 2 is a schematic diagram of an application scenario to which the performance index evaluation method provided in the embodiment of the present application is applied;

FIG. 3 is a schematic flow chart diagram of a performance indicator evaluation method according to an embodiment of the application;

FIG. 4 is a schematic diagram of a sliced instruction stream according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a hopping matrix according to an embodiment of the present application;

FIG. 6 is a diagram illustrating a jump matrix after weighting processing according to an embodiment of the present application;

FIG. 7 is a schematic diagram of column vectors obtained after shift processing is performed on a jump matrix according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a first feature matrix and a column vector obtained by averaging each row of the first feature matrix according to an embodiment of the present application;

FIG. 9 is a schematic diagram of column vector fitting according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a second feature matrix according to an embodiment of the present application;

FIG. 11 is a performance indicator vector C according to an embodiment of the present application_pA schematic diagram;

FIG. 12 is a diagram of a performance indicator vector C according to an embodiment of the present application;

FIG. 13 shows pick A according to an embodiment of the present application_pA schematic flow diagram of the method of (1);

fig. 14 shows a schematic flow chart of a method of picking a column vector from a first feature matrix a in S402 according to an embodiment of the present application;

FIG. 15 is a fitting graph of column vector B according to an embodiment of the present application;

FIG. 16 shows deletion of vector A from Z according to a constraint in S4025 according to an embodiment of the present application_jA schematic flow diagram of the method of (1);

FIG. 17 shows a schematic flow diagram of a method of solving for Y in accordance with an embodiment of the present application;

FIG. 18 is a schematic diagram showing a data curve after rearrangement of BB blocks;

FIG. 19 shows a schematic representation of the data curve of FIG. 18 after a wavelet basis transform (Fourier transform);

fig. 20 is a sub-flowchart of solving for Y in S605;

FIG. 21 is a schematic diagram of an apparatus structure of a performance index evaluation apparatus according to an embodiment of the present application;

fig. 22 is a schematic hardware configuration diagram of a performance index evaluation apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.

To facilitate understanding of the embodiments of the present application, several elements that will be introduced in the description of the embodiments of the present application are first introduced herein.

An instruction stream file: the file for recording the instruction stream information is called an instruction stream file, and each line of the instruction stream file represents information related to an executed instruction and conforms to a uniform format. Typically, the instruction stream file size is fixed. For example, an instruction stream file typically contains information such as 1 million instructions, and a complete test program can be regarded as an instruction stream composed of a plurality of 1 million instructions, each 1 million instruction being referred to as an instruction fragment, in other words, a complete test program is composed of a plurality of instruction fragments. The full set of test programs is all instruction fragments and the subset of test programs is part of instruction fragments. The purpose of simplifying the test program is to select representative partial instruction segments from the corpus, and it is required that the fewer the number of the selected instruction segments are, the better the running results of the segments are similar to the running results of the original test program. An instruction may include the following information: program pointer, assembly instruction, operation type, and memory address. The memory address is a selectable item.

And (3) program pointer: the program pointer for each line of instructions is the address in memory of the line assembler instruction, which is a hexadecimal number beginning with "0 x".

Assembling instructions: the binary instruction code of the instruction needs to meet assembly syntax requirements.

The operation type is as follows: all assembly instructions can be divided into three categories: the arithmetic logic unit operates, reads the memory, writes the memory and controls the instruction.

Memory address: if the instruction is an arithmetic logic unit operation, no memory address information is needed; if the instruction is a memory read-write operation, a memory address is required.

Basic Block (BB): a piece of sequentially executed instructions. In general, an instruction stream may be divided into a plurality of BBs with a control instruction as a boundary. Each segment of the test program is composed of BB.

Specifically, the control instruction may be a jump instruction, for example, a jump instruction in an assembly language such as JMP, JE, JNE, JZ, JNZ, JS, JNS, JC, and JNC.

Basic Block feature indicator Vector (BBV): according to different control instructions, the execution times of BB of different types in each segment are counted, and a vector constructed based on the types and the execution times of the BB is called BBV. FIG. 1 is a schematic diagram of instruction code according to an embodiment of the present application. As shown in the instruction code fragment of FIG. 1, if the type ID of BB is {1,2,3,4,5}, and the corresponding execution times is {1,20,0,5,0}, then BBV of the fragment can be recorded as {1:1,2:20,3:0,4:5,5:0 }.

Fig. 2 is a schematic diagram illustrating an application scenario to which the method for index estimation provided in the embodiment of the present application is applied, as shown in fig. 2, the test system includes a performance index evaluation apparatus 10 and a simulator 20.

While the test program is running on the performance index evaluation apparatus 10, the performance index evaluation apparatus 10 fetches the binary code of the test program and stores the binary code. The performance index evaluation device 10 selects an Instruction segment from the test program and sends the Instruction segment to the simulator 20 for simulation test, and in the simulation test, the simulator 20 runs the Instruction segment and obtains a simulation result, for example, an Instruction count Per Cycle (IPC), a branch prediction success rate, a branch prediction failure rate, a secondary cache hit rate, energy consumption, and the like. Then, the simulator 20 sends the simulation result to the performance index evaluation device 10. The performance index evaluation device 10 performs performance index evaluation based on the simulation result.

The processing device 101 may be a computer or an integrated circuit. The simulator 20 may be a simulator or a software simulator, which is not limited in this application.

Fig. 3 shows a schematic flow chart of a method for index estimation according to an embodiment of the present application, the method being executed by the performance index evaluation device 10 in fig. 2, as shown in fig. 3, and the method including:

step 301: an instruction stream of a test program is obtained and the instruction stream is divided into N instruction fragments.

Specifically, reference may be made to fig. 4, where fig. 4 is a schematic view of a command stream after splitting according to an embodiment of the present application, and for convenience of description, in this embodiment, N is 6.

It is noted that in other embodiments of the present application, N may be any positive integer.

Also, in this step, the performance index evaluation device 10 may divide the instruction stream according to the number of instructions, for example, the performance index evaluation device 10 may divide the instruction stream into N instruction fragments in units of 1 hundred million instructions in the instruction stream, that is, each instruction fragment includes 1 hundred million instructions.

Step 302: and counting the jump probability between every two basic blocks in each instruction segment in the N instruction segments to form N jump matrixes with M rows and M columns, wherein M is the type of the basic blocks.

Specifically, reference may be made to fig. 5, where fig. 5 is a schematic diagram of a skip matrix according to an embodiment of the present application, and for convenience of description, in this embodiment, a type M of a basic block is 3, that is, 3 types of basic blocks are correspondingly set for each segment.

It is noted that M may be any positive integer in other embodiments of the present application.

As shown in fig. 5, the basic block has 3 kinds, BB1, BB2, and BB3, respectively, and E1 to E6 respectively represent jump matrices of instruction fragments 1 to 6. The jump matrix is a Markov matrix, the sum of each row is 1, and the sum of each column is 1.

Taking the jump matrix E1 as an example, a value of 0.3 at 1 column and 1 row of the jump matrix E1 indicates that the probability of the BB1 jumping to BB1 is 30%, a value of 0.2 at 1 column and 2 column of E1 indicates that the probability of the BB1 jumping to BB2 is 20%, and a value of 0.5 at 1 column and 3 row of E1 indicates that the probability of the BB1 jumping to BB3 is 50%.

In other examples, a value of 0.3 at row 1 column of the hopping matrix E1 indicates that the probability of BB1 hopping to BB1 is 30%, a value of 0.4 at row 1 column of E1 indicates that BB1 hopping to BB2 is 40%, and a value of 0.3 at row 1 column of E1 indicates that BB1 hopping to BB3 is 30%.

The numerical values in the second column of E1 represent the probabilities of BB2 jumping to BB1, BB2, and BB3, respectively, and the numerical values in the third column of E1 represent the probabilities of BB3 jumping to BB1, BB2, and BB 3.

E2 to E6 are similar to E1, and the jump probability between basic blocks is different due to the difference of instruction fragments, and specific values can be seen in fig. 5, which is not described herein for brevity.

In the embodiment of the application, the jump matrix completely reflects the sequence information of the basic block of the instruction fragment, and the accuracy of performance index evaluation can be greatly improved due to the introduction of the sequence information of the basic block.

Step 303: converting N jump matrixes with M rows and M columns into M²And a first feature matrix A with N rows and columns, wherein each column of the first feature matrix A represents a jump matrix of one instruction segment.

Optionally, in this step, the proportion of the basic blocks in the instruction fragment may be further introduced, so that the order of the basic blocks and the proportion of the basic blocks are combined to perform comprehensive performance index evaluation, and the accuracy of performance index evaluation may be greatly improved.

Specifically, the data of each column of the N M rows and M columns of the hopping matrix may be multiplied by the basic block weight, respectively, to form N M rows and M columns of the feature matrix. Referring to fig. 6, fig. 6 is a schematic diagram of a jump matrix after weighting processing according to an embodiment of the present application, and fig. 6 shows feature matrices F1 to F6 obtained after weighting processing is performed on E1 to E6, respectively.

For F1, assume that the basic block weight is the ratio of the number of the instruction segment 1 in which the basic block BB1 represented by the column of the data in the first column of the jump matrix E1 is located to the number of all the basic blocks of the instruction segment in which the basic block BB1 is located to the basic block weight is 0.1, the ratio of BB2 is 0.4, and the ratio of BB3 is 0.5, wherein the above ratio is the basic block weight.

The weighted jump matrix F1 is obtained by multiplying the first column data of E1 by the basic block weight 0.1 of BB1, the second column data of E1 by the basic block weight 0.4 of BB2, and the third column data of E1 by the basic block weight 0.5 of BB 3.

Similarly, in the instruction fragment 2, the basic block weight of BB1 is 0.2, the basic block weight of BB2 is 0.4, and the basic block weight of BB3 is 0.4, so that F2 is obtained after weighting E2.

In the instruction fragment 3, since the basic block weight of BB1 is 0.3, the basic block weight of BB2 is 0.1, and the basic block weight of BB3 is 0.5, F3 is obtained by weighting E3. Assuming that the basic block weight of BB1, BB2, and BB3 in instruction fragment 4 are 0.5, 0.2, and 0.3, respectively, F4 is obtained by weighting E4. Assume that, in the instruction fragment 5, the basic block weight of BB1 is 0.3, the basic block weight of BB2 is 0.4, and the basic block weight of BB3 is 0.3, and therefore, F5 is obtained by weighting E5. Assume that, in the instruction fragment 6, the basic block weight of BB1 is 0.2, the basic block weight of BB2 is 0.2, and the basic block weight of BB3 is 0.6, and therefore F6 is obtained by weighting E2.

Further, in this step, N feature matrices of M rows and M columns may be converted into N M feature matrices of M rows and M columns²Specifically, referring to fig. 7, fig. 7 is a schematic diagram of a column vector obtained after performing shift processing on the jump matrix according to an embodiment of the present application, and as shown in fig. 7, taking a column vector a1 as an example, the second column data of F1 shown in fig. 6 may be shifted to be below the first column data, and the third column data may be shifted to be below the second column data, so as to form a single column vector a1, and similarly, a2 to a6 are obtained in a similar manner.

Further, in this step, N M are added²The feature vectors of row 1 and column are combined into M²Referring to fig. 8, fig. 8 is a schematic diagram of a first feature matrix a with N rows and columns, and a column vector obtained by averaging each row of the first feature matrix according to an embodiment of the present application, where a shown in fig. 8 is a combination of column vectors a1 to a6 of fig. 1.

Step 304: selecting p column vectors in the first feature matrix A, combining the p column vectors to form M²Second feature matrix A of rows and columns_pWherein the p column vectors represent a jump matrix of p instruction fragments;

in this step, specifically, each row of the first feature matrix a may be averaged to obtain a column vector B (as shown in fig. 8), and p column vectors suitable for fitting the column vector B are selected from the first feature matrix a, where the p column vectors satisfy the following condition:

h is the p column vector and D is the coefficient vector required for the fit.

Referring to fig. 9 in particular, fig. 9 is a schematic diagram of column vector fitting according to an embodiment of the present application, and as shown in fig. 9, D1, D2, and D3 are coefficient values, which are real numbers and represent fitting lengths.

In the present embodiment, the column vector selection is implemented by selecting a vector that can linearly express the column vector B from among the column vectors a1 to a6, and in the present embodiment, it is assumed that a1, a2, and A3 can fit the column vector B.

Thus, in this embodiment, p may be 3.

Referring to fig. 10, fig. 10 is a schematic diagram of a second feature matrix according to an embodiment of the present application, where the second feature matrix a of fig. 10 is obtained by combining the column vectors a1, a2, and A3 shown in fig. 9_p。

Step 305: respectively sending the p instruction fragments to the simulator, and receiving simulation results of the p instruction fragments from the simulator to determine performance index vectors C of the p instruction fragments_p。

Specifically, since the column vectors a1, a2, and A3 were picked in step 305, the instruction fragments 1,2, and 3 corresponding to the column vectors a1, a2, and A3 may be sent to the emulator for emulation.

For example, assuming the simulation result is the number of instructions per cycle, which is 2.1, 1.7 and 2.3, respectively, the performance indicator vector C can be constructed according to the simulation result_p[2.1，1.7，2.3]^T。

Step 306: and determining a performance index vector C of the N instruction segments according to the first feature matrix A and the index contribution degree vector Y. Wherein, the index contribution degree vector Y represents the performance index vector C_pAnd a second feature matrix A_pA linear relationship therebetween.

Specifically, in this step, equation A may be followed_p ^TY＝C_pDetermining M²An index contribution vector Y of the line jump feature according to equation A^TY-C determines a performance indicator vector C for the N instruction fragments.

Referring now to FIG. 11, FIG. 11 is a performance indicator vector C according to an embodiment of the present disclosure_pSchematic view, FIG. 11 shows A_p ^TY and C_pThe relation between them, as shown in FIG. 1, A_p ^TY＝C_pWherein Y represents a second feature matrix A_pIn each row M²An indicator contribution vector for individual line jump feature, in this example, M²＝3²＝9。

That is, Y is represented by A_pIn (1), each row and the actual simulation result C_pThe linear relationship of (c).

Alternatively, Y may be obtained specifically by:

vector C of performance indicators_pPerforming inner product operation on the coefficient vector D obtained in the step 304 to obtain a total performance index value c of the instruction stream; using the performance index vector C_pAnd the total performance index value c is taken as a limiting condition, and the second feature matrix A is used_pAnd determining a contribution degree vector Y according to a compressed sensing algorithm as an input parameter.

After Y is obtained, it can be further according to equation A^TY ═ C is determined as the performance index vector C of 6 instruction fragments, which can be referred to specifically with reference to fig. 12, fig. 12 is a schematic diagram of the performance index vector C according to the embodiment of the present application, and in fig. 12, the performance indexes C1, C2, C3, C4, C5, and C6 of each instruction fragment in the instruction stream can be obtained by multiplying the transposed first feature matrix a by Y obtained in this step.

In summary, in the embodiment of the present application, the first feature matrix is obtained by skipping the matrix, and the vectors are selected from the first feature matrix to form the second feature matrix, sending the instruction segment corresponding to the second feature matrix to a simulator for local simulation, acquiring the performance index of each instruction segment according to the simulation result, the second feature matrix and the first feature matrix, since the jump matrix reflects the order between the basic blocks in the embodiment of the present application, the first feature matrix includes information reflecting the order of the basic blocks, moreover, the simulator only needs to simulate partial instruction fragments, by using the linear relation between the simulation result of the partial instruction fragments and the second characteristic matrix, the performance index of each instruction segment can be further obtained according to the linear relation and the first characteristic matrix, so that the simulation cost can be saved, and the simulation time can be shortened.

A specific application scenario is listed below for pick A shown in step 304_pFor further details:

referring first to FIG. 13, FIG. 13 shows pick A according to an embodiment of the present application_pA schematic flow chart of the method of (1), as shown in FIG. 13, pick A_pCan be determined by the following steps.

S401, initializing a residual error R as B, Z as a null set and J as A;

s402, selecting column vectors from A and adding the column vectors into Z;

s403, judging whether a convergence condition is met, and if the convergence condition is not met, jumping to S402; if not, calculating the weight of Z to obtain the sparse solution vector X₁。

Alternatively, fig. 14 shows a schematic flowchart of a method for picking column vectors from the first feature matrix a in S402 according to an embodiment of the present application, and as shown in fig. 14, the picking column vectors from a and adding to Z may be implemented by the following steps.

S4021, calculating a correlation coefficient between J and R;

s4022, converting the vector A of the maximum correlation coefficient_iAs the advancing direction u, A_iAdding Z;

s4023, walking a first distance U along U according to a first strategy;

this can be understood in conjunction with fig. 15, where fig. 15 is a fitting graph of column vectors B according to an embodiment of the present application, and in fig. 15, a₁*D₁＝U，D₁For real number, the fitting error between U and B is R, i.e. U-B ═ R, then R is used as a new vector to be fitted, and A close to R is further searched_i*D_iUntil the fitting error meets an acceptable condition, the selected column vector is the required column vector. Among them, acceptable conditions are, for example, B and A_iIs smaller than a preset threshold value, for example 1 deg..

S4024, judging whether the time is greater than t, and if yes, jumping to S4025; if not, jumping to S4026;

s4025, removing vector A from Z according to constraints_jAnd jumping to S4026;

s4026, calculating an expression Y closest to B;

s4027, updating the residual R ═ B-Y.

Alternatively, fig. 16 shows that vector a is deleted from Z according to a constraint in S4025 according to an embodiment of the present application_jAs shown in fig. 16, S4025 deletes vector a from Z according to constraints_jThis can be achieved by the following steps.

S40251, determining A by using least square method_iAnd B the closest fit Y;

s40252, calculating the value of the objective function i (x) with each vector removed;

s40253, deleting the vector A corresponding to the minimum objective function value_j。

It will be appreciated that the basic idea of the algorithm in figures 13 to 16 is to find a set of vectors, linearly fitting B with all the vectors in the set of vectors, but subject to the simulator usage time, i.e. the total time cannot be greater than t.

Further, a specific application scenario is listed below to further describe the method for solving Y in step 306.

Referring first to fig. 17, fig. 17 shows a schematic flow chart of a method for solving for Y according to an embodiment of the present application, and as shown in fig. 17, the method for solving for Y may be determined by the following steps.

S501: and sorting the types of the basic blocks BB of each instruction segment according to the sequence of the predefined index values from large to small.

Optionally, the predefined metric value is at least one of a CPI, a cache hit error rate (cache miss), and a branch prediction failure rate (branch miss).

For example, fig. 18 shows a schematic diagram of a data curve after BB block rearrangement, as shown in fig. 18, an instruction segment includes 1000 BB blocks, and the 1000 BB blocks are monotonically arranged according to the significance (e.g., CPI, etc.) of BB, resulting in the data curve shown in fig. 18.

S502: determining an optimal wavelet basis matrix Ψ, wherein Ψ is of size M²×M²The wavelet basis matrix of (1).

Specifically, determining the radix matrix Ψ can be determined in the following two ways.

The first method is as follows: determining a smooth performance curve of the basic blocks BB of each instruction segment according to the types and the jump probabilities of the sequenced basic blocks BB, wherein the smooth performance curve is monotonous, and determining the wave-base matrix psi according to the monotonous smooth performance curve;

the second method comprises the following steps: and determining the radix matrix psi according to the experimental result.

Alternatively, the determination of the wave-basis matrix according to the experimental result may include the following two ways:

the first method is as follows: performing an experiment according to the execution delay of the instruction to determine a radix matrix psi;

the second method comprises the following steps: the radix matrix Ψ was determined by a BB instruction test experiment.

For example, fig. 19 shows a schematic diagram of the data curve shown in fig. 18 after wavelet basis transform (fourier transform), and it can be seen that there are few positions where the frequency domain coefficients on the wavelet basis are non-zero after the wavelet basis transform (fourier transform).

S503: according to the predefined index value pair, the second feature matrix A_pSorting is carried out;

s504: according to the second feature matrix A_pThe performance index vector C_pAnd the wave base matrix psi, establishing a constraint compressed sensing model.

Compressed Sensing (also called Compressive Sampling) and Sparse Sampling (Sparse Sampling), by exploiting the coefficient characteristics of the signal, discrete samples of the signal are obtained by random Sampling under the condition of far less than the Nyquist Sampling rate, and then the signal is perfectly reconstructed by a nonlinear reconstruction algorithm.

In the embodiment of the application, a compressed sensing method is utilized in index estimation of the instruction segment, and index estimation of each instruction segment is obtained by utilizing a small number of discrete BB blocks and adopting the compressed sensing method.

In this step, the performance index evaluation device uses the feature matrix A, the number N of instruction segments and the performance index vector C_pSetting a total performance index value c and an optimal wavelet base psi as input parametersAnd optimizing a variable column vector Z, and establishing a constrained compressed sensing model as follows:

where Ψ is of size M²×M²The wavelet basis matrix of (1), Ψ Z ═ Y, c is an index of the entire test procedure, and is a real number.

Where I is a1 matrix (i.e., a matrix with all elements of 1), the size is 1 × N, N is the number of segments, and the optimization objective is set according to the sparse rows of Z.

Optimizing the target: the number of nonzero coefficients in Z is minimum (namely the 1 norm of Z is minimum);

constraint 1:

according to the assumption that the feature vector Y-psi S is a smooth curve, a sparse vector S is obtained under the optimized wavelet basis psi.

Constraint 2:

here, the calculated mean value of the indicators for each instruction segment is required to be equal to the total performance indicator value.

S505: solving the optimization model obtained in S504 by using an Alternating Direction Method of Multipliers (ADMM) algorithm to obtain a sparse solution vector Z, and obtaining a feature vector Y ═ Ψ S according to Z, which may be specifically referred to with reference to fig. 20, where fig. 20 is a sub-flowchart for solving Y in S505, and as shown in fig. 20, step S505 specifically includes:

s5051: according to a constrained compressed sensing model, introducing a relaxation variable S, and further adding a constraint condition: z ═ S, i.e., the slack variable S is consistent with the Z requirement.

Z＝S

Using a Lagrange multiplier method, introducing Lagrange multipliers U, V and W, and establishing a corresponding Lagrange function g, wherein Z is required sparse solution, and mu, V and xi are penalty parameters (positive real numbers);

s5052: setting initial values of Z, S, U, V, W, mu, nu and xi, for example, 0, and setting conditions for completing optimization;

s5053: the optimal value of S is calculated using the least squares method.

S5054: fixing S, U, V, W, mu, V and xi, and calculating the optimal value of Z by using a soft threshold method.

S5055: according to the equation

Updating the Lagrange multiplier U according to the residual error

And finally, increasing the penalty coefficients mu, ν and ξ by a fixed multiple ρ > 1.

S5056: judging whether the convergence condition is satisfied, if not, jumping to step 5053, repeating step 5053-5056, and judging whether the optimized condition is satisfied at the end of each cycle until convergence and ending the process.

Alternatively, the convergence condition is, for example, execution time or execution number.

Finally, vector Y can be obtained according to the equation Y ═ Ψ S.

In summary, the method and the device for index estimation in the embodiments of the present application are helpful to improve the precision of the simulation test program, reduce the measurement error, provide performance index estimation of all instruction segments, and perform linear division or estimate other program performance indexes at each stage according to the indexes.

Referring to fig. 21, fig. 21 is a schematic structural diagram of a performance index evaluation device according to an embodiment of the present application, and as shown in fig. 21, the performance index evaluation device 10 includes:

an instruction stream segmentation module 601, configured to obtain an instruction stream of a test program and segment the instruction stream into N instruction segments;

a skip matrix generation module 602, configured to count skip probabilities between every two basic blocks in each instruction segment of the N instruction segments, and form N skip matrices with M rows and M columns, where M is a type of a basic block;

a first feature matrix obtaining module 603, configured to convert the N jump matrices with M rows and M columns into M²A first feature matrix A with N rows and columns, wherein each column of the first feature matrix A represents a jump matrix of an instruction segment;

a second feature matrix obtaining module 604 for selecting p column vectors in the first feature matrix A, and combining the p column vectors to form M²Second feature matrix A of rows and columns_pWherein the p column vectors represent a jump matrix of p instruction fragments;

a first performance indicator vector obtaining module 605, configured to send the p instruction fragments to the simulator respectively, and receive simulation results of the p instruction fragments from the simulator to determine a performance indicator vector C of the p instruction fragments_p；

A second performance indicator vector obtaining module 606, configured to obtain the performance indicator vector C according to the performance indicator vector C_pA first feature matrix A and a second feature matrix A_pA performance level vector C of the N instruction fragments is determined.

Optionally, the second performance indicator vector obtaining module 606 is specifically configured to:

according to equation A_p ^TY＝C_pDetermining M²Index contribution degree vector Y of individual line jump feature:

according to equation A^TY-C determines a performance indicator vector C for the N instruction fragments.

Optionally, the second feature matrix obtaining module 604 is specifically configured to:

respectively averaging each row of the first feature matrix A to obtain a column vector B;

in the first feature matrix a, p column vectors are selected which are suitable for fitting the column vectors B, wherein,

h is the p column vector and D is the coefficient vector required for the fit.

vector C of performance indicators_pAnd carrying out inner product operation on the sum coefficient vector D to obtain the total performance index value c of the instruction stream.

using the performance index vector C_pAnd the total performance index value c is taken as a limiting condition, and the second feature matrix A is used_pAnd determining a contribution degree vector Y according to a compressed sensing algorithm as an input parameter.

Optionally, the first feature matrix obtaining module 603 is specifically configured to:

multiplying the data of each column of the jumping matrix of N M rows and M columns by the weight of the basic block respectively to form a characteristic matrix of N M rows and M columns;

converting N characteristic matrixes of M rows and M columns into N M characteristic matrixes²A column vector of row 1 and column;

n M²The feature vectors of row 1 and column are combined into M²A first feature matrix A with N rows and N columns;

the basic block weight is the ratio of the number of instruction segments in which the basic block represented by the column in which the data of each column of the N jump matrixes with M rows and M columns is located to the number of all basic blocks in the instruction segments.

Optionally, the simulation result includes at least one of an instruction number per cycle, a branch prediction success rate, a branch prediction failure rate, a second level cache hit rate, and an energy consumption.

In summary, in the embodiment of the present application, a first feature matrix is obtained through a skip matrix, a vector is selected from the first feature matrix to form a second feature matrix, an instruction segment corresponding to the second feature matrix is sent to a simulator for local simulation, and a performance index of each instruction segment is obtained according to a simulation result, the second feature matrix, and the first feature matrix.

Referring to fig. 22, fig. 22 is a schematic diagram of a hardware structure of a performance index evaluation device according to an embodiment of the present application, and as shown in fig. 22, the performance index evaluation device 10 includes: the memory 701, the processor 702, and the bus 703, wherein the memory 701 and the processor 702 are respectively connected to the bus 703, the memory stores program instructions, and the processor executes the program instructions to perform the performance index estimation method disclosed above.

In summary, by simulating a selected small part of instruction fragments and obtaining a linear relationship between the simulation result and a second feature matrix of the part of instruction fragments, that is, the indicator contribution degree vector Y, the linear relationship can also be theoretically applied to the first feature matrix of all the instruction fragments and the simulation result of all the instruction fragments, so that the simulation result of all the instruction fragments is evaluated according to the indicator contribution degree vector Y and the feature matrix of all the instruction fragments, thereby realizing prediction of the simulation result of all the instruction fragments through the simulation result of the small part of instruction fragments, and greatly saving the cost.

In the embodiment of the present application, the Processor may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of the CPU and the NP. The processor may further include a hardware chip. The hardware chip may be an Application-Specific Integrated Circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a Field-Programmable Gate Array (FPGA), General Array Logic (GAL), or any combination thereof.

The memory may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory.

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product may include one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic Disk), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: u disk, removable hard disk, read only memory, random access memory, magnetic or optical disk, etc. for storing program codes.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A performance index evaluation method is characterized by comprising the following steps:

acquiring an instruction stream and dividing the instruction stream into N instruction segments;

counting the jump probability between every two basic blocks in each instruction segment in the N instruction segments to form N jump matrixes with M rows and M columns, wherein M is the type of the basic blocks;

converting the N jump matrixes with M rows and M columns into M²A first feature matrix A with N rows and columns, wherein each column of the first feature matrix A represents a jump matrix of an instruction segment;

selecting p column vectors in the first feature matrix A, combining the p column vectors to form M²Second feature matrix A of rows and columns_pWherein the p column vectors represent a jump matrix of p instruction fragments;

respectively sending the p instruction fragments to a simulator, and receiving simulation results of the p instruction fragments from the simulator to determine performance index vectors C of the p instruction fragments_p；

Determining a performance indicator vector C of the N instruction segments according to the first feature matrix A and an indicator contribution vector Y, wherein the indicator contribution vector Y represents the performance indicator vector C_pAnd the second feature matrix A_pA linear relationship therebetween.

2. The method of claim 1, wherein determining the performance indicator vector C of the N instruction fragments according to the first feature matrix a and the indicator contribution vector Y comprises:

according to equation A_p ^TY＝C_pDetermining the index contribution degree vector Y;

3. The method according to claim 2, wherein selecting p column vectors in the first feature matrix a comprises:

selecting, in the first feature matrix A, p column vectors suitable for fitting a column vector B, wherein,

H_iis a p-column vector, D_iTo fit the required coefficient vector.

4. The method of claim 3, wherein the method is according to equation A_p ^TY＝C_pDetermining the index contribution degree vector Y specifically includes:

vector C of the performance indicators_pAnd the coefficient vector D_iAnd carrying out inner product operation to obtain the total performance index value c of the instruction stream.

5. The method of claim 4, wherein the method is according to equation A_p ^TY＝C_pDetermining the index contribution degree vector Y specifically includes:

using the performance index vector C_pAnd the total performance index value c is used as a limiting condition, and the second feature matrix A is used_pAnd determining the index contribution degree vector Y according to a compressed sensing algorithm as an input parameter.

6. The method according to any one of claims 1 to 5, wherein said converting said N M row-M column jump matrices into M²The first feature matrix a with N rows and columns specifically includes:

multiplying the data of each column of the N M rows and M columns of the jump matrix by the weight of the basic block respectively to form N M rows and M columns of feature matrices;

converting the feature matrix of N M rows and M columns into N M²A column vector of row 1 and column;

the N M²Row 1 column bitThe eigenvectors are combined into the M²A first feature matrix A with N rows and N columns;

the basic block weight is the ratio of the number of the instruction segments where the basic blocks represented by the data in each column of the N jump matrixes with M rows and M columns are located to the number of all the basic blocks in the instruction segments where the basic blocks are located.

7. The method of any of claims 1 to 5, wherein the simulation results comprise at least one of number of instructions per cycle, branch prediction success rate, branch prediction failure rate, level two cache hit rate, and power consumption.

8. A performance index evaluation device, comprising:

the instruction stream segmentation module is used for acquiring an instruction stream and segmenting the instruction stream into N instruction segments;

a skip matrix generation module, configured to count skip probabilities between every two basic blocks in each of the N instruction segments, and form N skip matrices with M rows and M columns, where M is a type of a basic block;

a first feature matrix obtaining module, configured to convert the N skip matrices with M rows and M columns into M²A first feature matrix A with N rows and columns, wherein each column of the first feature matrix A represents a jump matrix of an instruction segment;

a second feature matrix obtaining module for selecting p column vectors in the first feature matrix A, combining the p column vectors to form M²Second feature matrix A of rows and columns_pWherein the p column vectors represent a jump matrix of p instruction fragments;

a first performance indicator vector obtaining module, configured to send the p instruction fragments to a simulator respectively, and receive simulation results of the p instruction fragments from the simulator to determine a performance indicator vector C of the p instruction fragments_p；

A second performance index vector obtaining module, configured to determine a second performance index according to the first feature matrix a and the index contribution degree vector YA performance indicator vector C of the N instruction fragments, wherein the indicator contribution vector Y represents the performance indicator vector C_pAnd the second feature matrix A_pA linear relationship therebetween.

9. The apparatus of claim 8, wherein the second performance indicator vector obtaining module is specifically configured to:

10. The apparatus of claim 9, wherein the second feature matrix obtaining module is specifically configured to:

H_iis a p-column vector, D_iTo fit the required coefficient vector.

11. The apparatus of claim 10, wherein the second performance indicator vector obtaining module is specifically configured to:

12. The apparatus of claim 11, wherein the second performance indicator vector obtaining module is specifically configured to:

using the performance index vector C_pAnd saidTaking the total performance index value c as a limiting condition, and taking the second feature matrix A as_pAnd determining the index contribution degree vector Y according to a compressed sensing algorithm as an input parameter.

13. The apparatus according to any one of claims 8 to 12, wherein the first feature matrix obtaining module is specifically configured to:

the N M²The feature vectors of row 1 and column are combined into the M²A first feature matrix A with N rows and N columns;

14. The apparatus of any of claims 8 to 12, wherein the simulation results comprise at least one of number of instructions per cycle, branch prediction success rate, branch prediction failure rate, level two cache hit rate, and power consumption.

15. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the steps of any of claims 1-7.